AUTHORS: Anna Feldman, Jirka Hana TITLE: A resource-light approach to morpho-syntactic tagging SERIES TITLE: Language and Computers 70 PUBLISHER: Rodopi YEAR: 2010
Michael Maxwell, Center for Advanced Study of Language, University of Maryland
I am writing this review with an audience of linguists, not computational linguists, in mind. I therefore think it important to say why linguists should read a review of this computational linguistics book.
Linguists know that there are thousands of languages in the world; computational linguists often behave as if there are just a few, namely those for which there are many resources: dictionaries, corpora, treebanks, etc. This book represents a break from that thinking, in that the authors develop a way to build computational tools for less-resourced languages, specifically tools for morphology. Tools like these could, if further developed, be used to describe less-resourced languages and annotate texts in such languages, and this is the first and foremost reason linguists -- including field linguists -- should pay attention.
This is of course not the first time computational linguists (or at least linguists who do computer programming) have built morphological tools. In the context of field linguistics, SIL has produced several such tools (most notably Shoebox/ Toolbox, and Field Works Language Explorer, or FLEx). More generally, there are numerous tools for building hand-crafted morphological parsers and generators, beginning with SIL's AMPLE and STAMP, as well as more modern tools like the Xerox Finite State Transducer. At the other end of a spectrum from hand-built to machine-built, there has been considerable work on automatically building morphological parsers using only unannotated text corpora, i.e. unsupervised machine learning. Creating morphological parsers by unsupervised machine learning is of considerable theoretical interest, since it captures to some extent what children must be doing. The literature on this is becoming large, and there is as yet no general survey; see Goldsmith 2001 for an early and still influential paper, and the presentations at http://www.cis.hut.fi/morphochallenge2009/workshop.shtml for some recent work.
But from a practical standpoint -- that is, meeting the needs of linguists interested in describing the thousands of languages of the world -- neither the hand-built approach, nor the machine learning from raw corpora approach, is adequate. The manual approach takes a long time to create adequate morphological analyses for under- or un-documented languages, while unsupervised machine learning has yet to achieve what most working linguists would consider adequate performance (not to mention the fact that machine learning from unannotated corpora is incapable of assigning any meaning to the affixes it finds).
The authors of this work, Anna Feldman and Jirka Hana (F&H), take a middle road, which holds promise of being able to create more adequate morphological tools for many languages than unsupervised machine learning, and to do so much more quickly than a purely hand crafted approach. This middle road combines hand-built resources which are relatively simple and can therefore be built relatively quickly, with machine learning. But the point which sets their work apart from a standard semi-supervised approach (Abney 2007) is that F&H's machine learning programs are first trained on a related language (which I will call a ''bridge language'') for which more resources are assumed to be available. This is of course what most (human) linguists would do: when describing a previously unstudied language, one usually finds out what linguists have discovered about related languages. (This is not the first computational linguistic effort to use the notion of a bridge language, but it is perhaps the most extended exploration of such an approach. F&H discuss some of the previous work in their third chapter.)
In the test cases described in this book, the higher resourced languages are Czech and Spanish, while the lesser resourced languages are Russian on the one hand, and Portuguese and Catalan on the other. Of course none of these languages is truly a low resource language. But they are appropriate choices, given that in order to demonstrate that a technology works, one needs a way to evaluate the results -- that is, one needs to have (or to be able to quickly create) a gold standard for the target languages, against which to test the technology.
With this rather extended preface, then, I hope to have whetted the appetite of linguists reading this review. I now turn to the book itself.
This is a short book: just over 130 pages in the main text, plus several appendices describing technical details of the methodology, and a very brief grammatical overview of the languages used as test cases. Part of the book is based on Feldman's Ph.D. dissertation, but the present work is a joint effort.
The first chapter is a very brief introduction to the overall theme, while the second chapter gives a background on previous work on ''tagging.'' (The reader who is unfamiliar with tagging may want more background than F&H give; for that, any introductory textbook on computational linguistics should suffice, such as Manning and Schuetze (1999) or Jurafsky and Martin (2009). I might also recommend Abney (2007) which, although it deals mostly with semi-supervised learning, may prove more accessible to linguists.)
Tagging is a form of corpus annotation in which each word (or rather, each token, typically including tokens consisting of punctuation characters) is marked for some properties. In the context of this book, tagging means assigning such morpho-syntactic properties as person, number, tense and so forth, and a citation form (or assigning a special tag to punctuation tokens). This kind of tagging is sometimes misleadingly referred to as ''part of speech tagging'', whereas in fact the tags are at a much finer grain than what linguists typically think of as parts of speech. Tagging is in fact similar to what linguists do when they create interlinear text, except that in tagging, the individual affix morphemes may not be distinguished, and the citation form of the word stands in for the gloss of one of its senses. (Assigning the correct lexeme gloss to words, which is usually done as part of interlinear text glossing, is something that computational linguists call 'sense disambiguation', and is not discussed in this book.)
Tagging is a two part process: first, one finds all possible tags for a particular word; second, one chooses the correct tag from among the possible tags. Since the computer cannot choose the correct tag on the basis of the meaning of a word in context, computational tagging is done probabilistically on the basis of properties of the neighboring words, such as their tags.
The third chapter briefly discusses previous work on natural language processing for languages with few computational resources, ranging from lexicon acquisition to syntactic parsing. F&H sketch their own goals in view of this background, pointing out that their wish is to produce computational resources which are interpretable by humans. This may sound obvious, but in fact most statistical machine translation programs produce ''grammars'' which are far from interpretable in a linguistic sense.
Chapter four gives an overview of the grammars and the corpora of the languages in question, and the tagsets to be used (i.e. the information about gender etc. that words will be marked with). The grammar sketches are amplified in an appendix; I am not sure why there is this redundancy.
The fifth, rather short, chapter quantifies properties of the tagsets for each language, pointing out the data sparsity problem: it takes a large corpus to see all the possible tags (i.e. all the combinations of morphosyntactic features possible for a given part of speech). This is in contrast to a relatively uninflected language like English, where a relatively small corpus suffices to see each tag at least once. They point out that to some extent, the data sparsity is caused by the choice of news text for corpora. A conversational corpus would certainly show a different distribution of person marking, for example, although whether it would be more diverse (and therefore have less of a data sparsity problem) is not obvious.
Zipf's Law is also a source of difficulty: no matter how large the corpus, many words will be attested in only a subset of the forms in which they could theoretically appear. This is of course a problem for the human language learner, too -- and the fact that humans usually cope with this (but not always, cf. Albright 2007) suggests that it should be possible for the computer to do well, too. But that is for future work.
With these preliminaries out of the way, chapter six turns to morphological analysis. Recall that the overall task is to tag each word of the target language corpus with its part of speech and morphosyntactic information. F&H decompose this task into a morphological analysis phase, followed by a disambiguation phase. There is a potential tradeoff here: the more effort one expends in getting the morphological analyzer right -- that is, getting it to produce only the possible parses for each word -- the less effort will be required in the disambiguation phase. In the limit, a perfect morph analyzer would generate only the possible analyses for each word, so that the task of a tagger (human or machine) is only to choose from the possible analyses for a given word which analysis is correct in a particular context. In fact, this is what computational linguists usually do.
But in the interest of minimizing the human labor needed to build the morphological parser for the target language, F&H explore varying this division of labor, so that the morph analyzer might over-generate and the tagger would be called on to choose the correct analysis from among both possible and impossible analyses.
In addition to the high level decision governing the division of labor between parser and tagger, there are lower level tradeoffs in the development of the parser itself, some involving linguistic shortcuts. For instance, suppose that in some language, the end of a stem undergoes allomorphy before a suffix. One can imagine an alternative analysis, in which the stem would be divided from the suffix so that the changed part of the stem is treated as if it were part of the suffix. This increases the number of suffix allomorphs, since there is now an additional suffix allomorph which includes what was the end of the stem; and the number of conjugation classes increases correspondingly. An English example would be the f~v alternation in words like 'wife~wives'. Practically any linguist would argue for the analysis in which the stem ends in a labiodental, and the labiodental undergoes a voicing alternation. But under the alternative analysis, stems would be vowel-final, and for this set of words there would be a singular suffix -f and a plural suffix -ves (or /-vz/). While linguists might balk at such an approach, to my mind it represents an acceptable compromise, given the goals. Linguists may even be reminded of Maori, for which Hale (1973) argued that verbal stem-final consonants have undergone re-analysis to become part of the passive suffix.
Another division of labor involves the creation of a lexicon. A morphological parser works best with a lexicon of stems, and preferably a lexicon with such additional information as part of speech and (where relevant) declension or paradigm class. (A lexicon used for parsing need not need include semantic information, i.e. senses.) F&H describe a methodology for rapidly acquiring such a lexicon from texts; it appears to resemble work by John Goldsmith and others on automatically building morphological parsers. F&H discuss tradeoffs which make such lexicon acquisition easier (again, easier means less human intensive). Some of these tradeoffs also lead to parser over-generation; for example, one may decide to leave lower frequency words ambiguous as to their conjugation class.
For very common words F&H avoid the use of a parser entirely by supplying pre-built analyses, thereby ensuring both high recall and high precision for the most common words. For uncommon words, on the other hand, there may be no lexical entry, which means that the morphological parser can only guess what the stem might be. Guessing of course results in greater ambiguity. The hope is that tagging will later reduce ambiguity by choosing the most likely parse based on the word's context.
The next step is therefore to build a tagger, whose job is to choose the correct parsed form from the often ambiguous results returned by the parser. If F&H were working on a highly resourced language, they would train such a tagger using a large corpus which had been tagged (disambiguated) by hand. But annotated corpora are expensive, and F&H are trying to develop a methodology for low resourced languages. This then brings us to the next chapter, which is really the heart of F&H's method: the cross-language tagger. Rather than building an annotated corpus for the target low resourced language and training a tagger on that, F&H use a tagger which has been trained on a corpus for a closely related but higher resourced language: in the case of Russian, a Czech tagger; and in the case of Catalan and Portuguese, a Spanish tagger. They run several experiments for each target language, but report mostly on the Russian experiments. The baseline uses the Czech tagger (with some simple modifications) for both morphological analysis and disambiguation; as one might expect, this performs poorly. Subsequent experiments substitute the quick-and-dirty Russian morphological analyzer, and then add cognate detection to improve the morphological analysis (specifically, to adjust the probabilities coming out of the morph analyzer); simple syntactic transformations to compensate for differences between Czech and Russian clause structure; and finally, sub-taggers which are sensitive to only some of the morpho-syntactic properties of words. The latter is an attempt to deal with the sparse data problem, i.e. the fact that some combinations of morpho-syntactic properties are rare. All the experiments except this last one result in improved scores.
In the end, then, what F&H describe is a way to build a quick-and-dirty morphological parser for the target low-resourced language, and to use a tagger trained on a related but more highly resourced language to disambiguate the word analyses provided by this parser in context. The result is a methodology for quickly creating morphologically tagged text for low-resourced languages.
The final chapter summarizes the results and suggests many directions for future work.
Given how little research has gone into natural language processing for low resourced languages, it is hard not to be excited about F&H's work.
Perhaps the biggest surprise to come out of this work is how often one's linguistic intuitions turn out to be wrong. For instance, it is commonplace that morphologically complex languages tend to have free word order; one might have thought that this would mean that a tagger, which uses the words in the context to disambiguate among possible morphosyntactic properties, would not help in such a language. In fact the tagger helps a great deal. One reading of this is that ''free word order'' is a misnomer; what is really free in such languages is not the individual word order, but the phrase order, with the implication being the neighboring phrase-internal words often are sufficient for disambiguation.
It might also seem obvious that the more morphosyntactic properties one was marking words for in a particular language -- that is, the larger the tagset -- the more difficult the task would be. If this were true, one might explore reducing the size of the tagset in hopes that this would make the results more accurate. Again, this turns out to be wrong (at least in the cases examined). One reason appears to be that reducing the tagset reduces the ability to use tags on words that a given word agrees with to disambiguate that word.
In short, while linguistics can inform hypothesis formation in computational linguistics, verifying a hypothesis requires testing it on real data. Of course, the results might vary depending on the choice of test languages; the languages of the experiments described here have a mostly fusional morphology, and results might differ with agglutinating languages. One also wonders what would happen with polysynthetic languages. At the same time, it is clear that comparing tagging results across languages does not work; too much depends on the corpus size and composition, the number of tags used, and other factors. So any claimed improvements on F&H's work will require careful testing on the languages they used. Generalizing those improvements across languages, or determining what works with languages of different typologies, will be still harder.
One experiment that F&H did not try, is to modify the baseline by adding cognate detection, but without using the target language morphological analyzer. Eliminating the need to build a morphological analyzer of the target language by hand would remove the need to consult an existing grammatical description of the target language, a step which might be impossible for some languages. More sophisticated methods of automatic cognate detection, perhaps along the lines of Kondrak (2009), might improve such results.
There are several questions of a practical nature for F&H's method. For how many less resourced languages is there a more highly resourced language that is closely enough related to serve as the bridge language? This question can be broken down into several parts: first, how closely related does the bridge language need to be? This is not easy to answer, and F&H just touch on it. It is clear that Czech/ Russian, and Spanish/ Portuguese/ Catalan are close enough. Depending on how close the languages need to be, most languages of the world might belong to such a group.
Another part of this question is whether within each of these groups there is a highly resourced language that is typical enough of the other languages in the group (i.e. not an outlier). However, this may be the wrong question: where such a language does not already exist, it makes sense to choose one of the languages of a group and create the expensive resources, then apply the cross-language method to the other languages. This surely makes more sense than trying to create expensive resources for all the languages of every group.
The book is not without its defects. The brief and uneven discussions of the grammars of the languages studied, in chapter 4 and appendix C, may make linguists cringe. For example, contrary to what is said on pg. 180, Spanish does not have a class of verbs ending in -or; and contrary to table 4.7, subject-verb agreement in Spanish, Catalan and Portuguese is based on person and number, not on gender and number. (The text on p. 57 appears to make this same mistake, although here it may just be a typo -- leaving out the word ''no''.) Again, table 4.7 claims that Spanish has a neuter gender. The alleged dropping of prepositions before complementizers in Catalan (p. 177) is probably an unmotivated analysis; in many respects, complementizers really are prepositions which take as their complements sentences instead of NPs, as Emonds (1985) pointed out. Finally, the book has a citation index, but a topic index (and perhaps a language index) would have been a welcome addition. But these are fairly minor issues, and at least the grammar sketches can easily be supplemented from other sources.
In conclusion, F&H have opened a very interesting door, showing us a method with many potential applications to less resourced languages. I suspect there are many other methods behind that door that we could put to use leveraging the computational analysis of one language to help analyze related languages. Finally, it is a potential way for field linguists and computational linguists to work together--again, after a lapse of some years (cf. Bird 2009).
Abney, Steven. 2007. Semisupervised Learning for Computational Linguistics. Chapman & Hall/CRC Computer Science & Data Analysis 8. CRC Press.
Albright, Adam. 2007. Lexical and morphological conditioning of paradigm gaps. In Curt Rice (ed.), Modeling ungrammaticality in optimality theory. Equinox.
Bird, Steven. 2009. Natural Language Processing and Linguistic Fieldwork Computational Linguistics. 35: 469-474.
Emonds, Joseph. 1985. A unified theory of syntactic categories. Studies in Generative Grammar 19. Dordrecht: Foris.
Goldsmith, John. 2001. Unsupervised Learning of the Morphology of a Natural Language. Computational Linguistics 27: 153-198.
Hale, Ken. 1973. Deep-surface canonical disparities in relation to analysis and change: An Australian example. Current Trends in Linguistics 11: 401-458.
Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Prentice Hall Series in Artificial Intelligence. Upper Saddle River, NJ: Prentice Hall.
Kondrak, Grzegorz. 2009. Identification of Cognates and Recurrent Sound Correspondences in Word Lists. Traitement automatique des langues et langues anciennes 50: 201-235.
Manning, Christopher D., and Hinrich Schuetze. 1999. Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press.
ABOUT THE REVIEWER
ABOUT THE REVIEWER:
Dr. Maxwell is a researcher in computational morphology and other
computational resources for low density languages, at the Center for
Advanced Study of Language at the University of Maryland. He has also
worked on endangered languages of Ecuador and Colombia, with the Summer
Institute of Linguistics, and on low density languages with the Linguistic
Data Consortium (LDC) of the University of Pennsylvania.