Editor for this issue: Martin Jacobsen <marty
linguistlist.org>
Ji Donghong listed an open question on 26 August to roughly parallel modern Chinese, asking: suppose there were a language just like English in word order but without any affixes, so the following are all possible phrases in the language: make develop, develop country, develop product, etc. How do you determine the distribution-based POS system for that language? ANSWER: The key lies in two issues: (1) where is that language in its stage of linguistic development ? (that is, what is its history?) and (2) how do you segment that language ? A few clarifications are in order first. Chinese does have several affixes such as, for example, the -er suffix (in Pinyin zhe3)--denoting the nominal doer of an action, formed by adding the "-zhe3" affix to an action verb, like "ji4 (to report) + zhe3 (-er) => ji4-zhe3 (reporter)"--and the -ity suffix (in Pinyin -xing4)--denoting the abstract nominal form of a pure adjective, formed by adding the "-xing4" suffix to an adjective, such as "ke3-neng2 (possible) + xing4 (-ity) => ke3-neng2-xing4 (possibility)". So there are several aspects of Chinese which are indeed syntactically compositional. But Prof. Ji correctly contends that the rules are often difficult to codify. 1. WHERE IS THE CHINESE LANGUAGE IN ITS HISTORICAL DEVELOPMENT? CHINESE HAD 2 DIFFERENT FORMS AT THE TURN OF THE CENTURY As recently as 1919, there were two distinct languages in China. One was the classical language used in written formal, governmental, and literary communication, modeled on China's 2,000 continuous years of literary heritage using Chinese characters and constructions, some of which date back to 200 BC and Emperor Qin Shihuang's standardization. The other was the colloquial language used in spoken communication. CLASSICAL CHINESE (WEN2-YAN2-WEN2), THE LITERARY TRADITION The classical language tended to be monosyllabic and incorporated usages from throughout the 2,000 year history, with the result that some times a verbal usage dating from 200 AD might be utilized in one sentence and the next sentence use one from 900 AD or later. This language drew largely on allusions to the panoply of Chinese literary history. It was an elegant, erudite means of expression in which scholars vied with each other trying to communicate while also showing their erudition and veneration of the literary tradition. COLLOQUIAL CHINESE (BAI3-HUA4) THE MODERN SPOKEN LANGUAGE The colloguial language had evolved into a polysyllabic language with many lengthy compound words coined in the last 50 years to describe new Western concepts, modern inventions, and even foreign loan words. This was the language of business, of barter, and of negotiation among international merchants. This was the common spoken language across all of the Chinese empire--except that the classical language was still spoken in the court on formal occasions in deference to the Emperor who was the embodiment of China's millenia-long traditions. 1919 LITERARY REVOLUTION STARTS TO COMBINE THE TWO On May 4th, 1919 several thousand Chinese university students marched in protest in Beijing. They wanted China to modernize, unify, and throw off the foreign domination which had partitioned the empire among the collonial powers of that age. One of their demands was that China should not have two languages--one the written classical language and another the spoken colloquial language--but one. In the words of literary scholar Chow Tse-tsung in his "The May Fourth Movement" it was clear that this "literary revolution was a crucial part of the May Fourth reforms" (The May Fourth Movement, p272.). A LANGUAGE IN FLUX But there was no overall official authority, like the French Academie exercised over the development of the French language in the 1800's, nor an unofficial authority as McGuffy's Reader so functioned in 19th Century American English while the United States matured as a nation. Instead, in China these two languages merged over most of this last century spontaneously with the only moderating influence being what the populace at large accepted. The result is a language with the POS ambiguities seen and evidenced today by so many different outlooks and so few published dictionaries even venturing to offer POS information. This does not mean parts of speech are not in use. This does not mean that syntax does not exist--as some naive German linguists thought in the early 1900's when they called Chinese a "primitive" language. History suggests that the Chinese language is simply undergoing a period of flux amid its development of a new unified language. AMID ITS EVOLUTION, SCHOOLS IN BEIJING AND TAIPEI CAN STILL AGREE So here the key concept is: Chinese is evolving. You cannot at this particular point in time pontificate what the specific syntactical rules are for every occurence in the natural language today. But many schools of thought generally agree in their descriptions of modern Chinese syntax. Simply compare the syntactic POS descriptions coming from the computational linguistics societies of respectively Beijing and Taipei. There is a great deal of general agreement. 2. CHINESE SEGMENTATION WRITTEN CHINESE HAS NO WHITE SPACE It is obvious that one cannot pluck a Chinese word out of a text, because the characters appear in a text one after another to the end of sentence marker without any spaces between them. Whereas English and most other languages demarcate their words by blank spaces, Chinese does not. CHINESE TRADITIONAL SEGMENTATION When a student of Chinese reads a text in class, he usually reads it aloud. His understanding of the text is monitored by watching how he reads the text. If he understands the text correctly, as he verbalizes the written word, he will naturally pause slightly between words--grouping some characters together in polysyllabic utterances while reading others singly. Sometimes the beginning reader may have to back up a sentence when he reaches the end and finds that earlier he had segmented the text incorrectly and cannot end the sentence without correcting his earlier segmentation. CHINESE SEGMENTATION SOFTWARE Today Natural Language Processing (NLP) applications run segmentation software to identify and extract Chinese words and phrases from within running text. In this community there is great diversity of opinion as to what the optimum segmentation is for any sentence, since these segmentors are often written for different NLP applications with different objectives. But here again most can agree on a general segmentation standard--see the standard proposed by Prof. Dekai Wu (http://www.cs.ust.hk/~dekai/papers/segmentation.html). DIFFERENT SEGMENTATIONS CAN RESULT IN DIFFERENT VIEWS So back to Prof. Ji's open question, how do you determine the distribution based POS system for that language? First you must agree on a segmentation standard so that you extract words in similar fashion. Then by analizing the distributional patterns, you can come up with POS guidelines. I am working on a Chinese-English machine translation project at the University of Maryland. You can see Prof. Dekai Wu's segmentation guideline and our POS tagging guideline at http://umiacs.umd.edu/labs/CLIP/forest.html. We believe our system is well-formed, but it is of course not the one and only. There are alternate ways to view modern Chinese syntax as it evolves further, like every other living language today. John Kovarik, a Chinese language instructor contributing to the University of Maryland's Computational Linguistics and Information Processing Laboratory Chinese-English MT ProjectMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue