LINGUIST List 9.1202

Sun Aug 30 1998

Disc: POS is well-formed or not well-formed?

Editor for this issue: Martin Jacobsen <martylinguistlist.org>


Directory

  1. John Kovarik, Re: 9.1186 Sum: POS is well-formed or not well-formed?

Message 1: Re: 9.1186 Sum: POS is well-formed or not well-formed?

Date: Sun, 30 Aug 1998 14:51:56 +0000
From: John Kovarik <kovariksworldnet.att.net>
Subject: Re: 9.1186 Sum: POS is well-formed or not well-formed?

Ji Donghong listed an open question on 26 August to roughly parallel
modern Chinese, asking: suppose there were a language just like
English in word order but without any affixes, so the following are
all possible phrases in the language: make develop, develop country,
develop product, etc. How do you determine the distribution-based POS
system for that language?

ANSWER: The key lies in two issues: (1) where is that language in its
stage of linguistic development ? (that is, what is its history?) and
(2) how do you segment that language ?

A few clarifications are in order first. Chinese does have several
affixes such as, for example, the -er suffix (in Pinyin
zhe3)--denoting the nominal doer of an action, formed by adding the
"-zhe3" affix to an action verb, like "ji4 (to report) + zhe3 (-er) =>
ji4-zhe3 (reporter)"--and the -ity suffix (in Pinyin -xing4)--denoting
the abstract nominal form of a pure adjective, formed by adding the
"-xing4" suffix to an adjective, such as "ke3-neng2 (possible) + xing4
(-ity) => ke3-neng2-xing4 (possibility)". So there are several
aspects of Chinese which are indeed syntactically compositional. But
Prof. Ji correctly contends that the rules are often difficult to
codify.

1. WHERE IS THE CHINESE LANGUAGE IN ITS HISTORICAL DEVELOPMENT? 

CHINESE HAD 2 DIFFERENT FORMS AT THE TURN OF THE CENTURY 

As recently as 1919, there were two distinct languages in China. One
was the classical language used in written formal, governmental, and
literary communication, modeled on China's 2,000 continuous years of
literary heritage using Chinese characters and constructions, some of
which date back to 200 BC and Emperor Qin Shihuang's
standardization. The other was the colloquial language used in spoken
communication.

CLASSICAL CHINESE (WEN2-YAN2-WEN2), THE LITERARY TRADITION 

The classical language tended to be monosyllabic and incorporated
usages from throughout the 2,000 year history, with the result that
some times a verbal usage dating from 200 AD might be utilized in one
sentence and the next sentence use one from 900 AD or later. This
language drew largely on allusions to the panoply of Chinese literary
history. It was an elegant, erudite means of expression in which
scholars vied with each other trying to communicate while also showing
their erudition and veneration of the literary tradition.

COLLOQUIAL CHINESE (BAI3-HUA4) THE MODERN SPOKEN LANGUAGE 

The colloguial language had evolved into a polysyllabic language with
many lengthy compound words coined in the last 50 years to describe
new Western concepts, modern inventions, and even foreign loan words.
This was the language of business, of barter, and of negotiation among
international merchants. This was the common spoken language across
all of the Chinese empire--except that the classical language was
still spoken in the court on formal occasions in deference to the
Emperor who was the embodiment of China's millenia-long traditions.

1919 LITERARY REVOLUTION STARTS TO COMBINE THE TWO 

On May 4th, 1919 several thousand Chinese university students marched
in protest in Beijing. They wanted China to modernize, unify, and
throw off the foreign domination which had partitioned the empire
among the collonial powers of that age. One of their demands was that
China should not have two languages--one the written classical
language and another the spoken colloquial language--but one. In the
words of literary scholar Chow Tse-tsung in his "The May Fourth
Movement" it was clear that this "literary revolution was a crucial
part of the May Fourth reforms" (The May Fourth Movement, p272.).

A LANGUAGE IN FLUX

But there was no overall official authority, like the French Academie
exercised over the development of the French language in the 1800's,
nor an unofficial authority as McGuffy's Reader so functioned in 19th
Century American English while the United States matured as a nation.
Instead, in China these two languages merged over most of this last
century spontaneously with the only moderating influence being what
the populace at large accepted. The result is a language with the POS
ambiguities seen and evidenced today by so many different outlooks and
so few published dictionaries even venturing to offer POS
information. This does not mean parts of speech are not in use. This
does not mean that syntax does not exist--as some naive German
linguists thought in the early 1900's when they called Chinese a
"primitive" language. History suggests that the Chinese language is
simply undergoing a period of flux amid its development of a new
unified language.

AMID ITS EVOLUTION, SCHOOLS IN BEIJING AND TAIPEI CAN STILL AGREE

So here the key concept is: Chinese is evolving. You cannot at this
particular point in time pontificate what the specific syntactical
rules are for every occurence in the natural language today. But many
schools of thought generally agree in their descriptions of modern
Chinese syntax. Simply compare the syntactic POS descriptions coming
from the computational linguistics societies of respectively Beijing
and Taipei. There is a great deal of general agreement.

2. CHINESE SEGMENTATION

WRITTEN CHINESE HAS NO WHITE SPACE

It is obvious that one cannot pluck a Chinese word out of a text,
because the characters appear in a text one after another to the end
of sentence marker without any spaces between them. Whereas English
and most other languages demarcate their words by blank spaces,
Chinese does not.

CHINESE TRADITIONAL SEGMENTATION

When a student of Chinese reads a text in class, he usually reads it
aloud. His understanding of the text is monitored by watching how he
reads the text. If he understands the text correctly, as he verbalizes
the written word, he will naturally pause slightly between
words--grouping some characters together in polysyllabic utterances
while reading others singly. Sometimes the beginning reader may have
to back up a sentence when he reaches the end and finds that earlier
he had segmented the text incorrectly and cannot end the sentence
without correcting his earlier segmentation.

CHINESE SEGMENTATION SOFTWARE

Today Natural Language Processing (NLP) applications run segmentation
software to identify and extract Chinese words and phrases from within
running text. In this community there is great diversity of opinion
as to what the optimum segmentation is for any sentence, since these
segmentors are often written for different NLP applications with
different objectives. But here again most can agree on a general
segmentation standard--see the standard proposed by Prof. Dekai Wu
(http://www.cs.ust.hk/~dekai/papers/segmentation.html).

DIFFERENT SEGMENTATIONS CAN RESULT IN DIFFERENT VIEWS 

So back to Prof. Ji's open question, how do you determine the
distribution based POS system for that language? First you must agree
on a segmentation standard so that you extract words in similar
fashion. Then by analizing the distributional patterns, you can come
up with POS guidelines. I am working on a Chinese-English machine
translation project at the University of Maryland. You can see
Prof. Dekai Wu's segmentation guideline and our POS tagging guideline
at http://umiacs.umd.edu/labs/CLIP/forest.html. We believe our system
is well-formed, but it is of course not the one and only. There are
alternate ways to view modern Chinese syntax as it evolves further,
like every other living language today.

 John Kovarik, a Chinese language instructor contributing to
 the University of Maryland's Computational Linguistics and
 Information Processing Laboratory Chinese-English MT Project
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue