Academic Paper |
|
|
|
|
| Title: | A new PPM variant for Chinese text compression |
| Author: | Peiliang Wu |
| Institution: | University of Wales, Bangor |
| Author: | W. J. Teahan |
| Institution: | University of Wales, Bangor |
| Linguistic Field: | Computational Linguistics; Writing Systems |
| Subject Language: |
Chinese, Mandarin
|
| Abstract: | Large alphabet languages such as Chinese are very different from English, and therefore present different problems for text compression. In this article, we first examine the characteristics of Chinese, then we introduce a new variant of the Prediction by Partial Match (PPM) model especially for Chinese characters. Unlike the traditional PPM coding schemes, which encodes an escape probability if a novel character occurs in the context, the new coding scheme directly encodes the order first before encoding a symbol, without having to output an escape probability. This scheme achieves excellent compression rates in comparison with other schemes on a variety of Chinese text files. |
|
|
|
|
This article appears in Natural Language Engineering Vol. 14, Issue 3, which you can read on Cambridge's site or on LINGUIST . |
|
|
|
|
Back
Add a new paper Return to Academic Papers main page Return to Directory of Linguists main page |
|


