Review of A resource-light approach to morpho-syntactic tagging
|AUTHORS: Anna Feldman, Jirka Hana
TITLE: A resource-light approach to morpho-syntactic tagging
SERIES TITLE: Language and Computers 70
Michael Maxwell, Center for Advanced Study of Language, University of Maryland
I am writing this review with an audience of linguists, not computational
linguists, in mind. I therefore think it important to say why linguists should
read a review of this computational linguistics book.
Linguists know that there are thousands of languages in the world; computational
linguists often behave as if there are just a few, namely those for which there
are many resources: dictionaries, corpora, treebanks, etc. This book represents
a break from that thinking, in that the authors develop a way to build
computational tools for less-resourced languages, specifically tools for
morphology. Tools like these could, if further developed, be used to describe
less-resourced languages and annotate texts in such languages, and this is the
first and foremost reason linguists -- including field linguists -- should pay
This is of course not the first time computational linguists (or at least
linguists who do computer programming) have built morphological tools. In the
context of field linguistics, SIL has produced several such tools (most notably
Shoebox/ Toolbox, and Field Works Language Explorer, or FLEx). More generally,
there are numerous tools for building hand-crafted morphological parsers and
generators, beginning with SIL's AMPLE and STAMP, as well as more modern tools
like the Xerox Finite State Transducer. At the other end of a spectrum from
hand-built to machine-built, there has been considerable work on automatically
building morphological parsers using only unannotated text corpora, i.e.
unsupervised machine learning. Creating morphological parsers by unsupervised
machine learning is of considerable theoretical interest, since it captures to
some extent what children must be doing. The literature on this is becoming
large, and there is as yet no general survey; see Goldsmith 2001 for an early
and still influential paper, and the presentations at
http://www.cis.hut.fi/morphochallenge2009/workshop.shtml for some recent work.
But from a practical standpoint -- that is, meeting the needs of linguists
interested in describing the thousands of languages of the world -- neither the
hand-built approach, nor the machine learning from raw corpora approach, is
adequate. The manual approach takes a long time to create adequate morphological
analyses for under- or un-documented languages, while unsupervised machine
learning has yet to achieve what most working linguists would consider adequate
performance (not to mention the fact that machine learning from unannotated
corpora is incapable of assigning any meaning to the affixes it finds).
The authors of this work, Anna Feldman and Jirka Hana (F&H), take a middle road,
which holds promise of being able to create more adequate morphological tools
for many languages than unsupervised machine learning, and to do so much more
quickly than a purely hand crafted approach. This middle road combines
hand-built resources which are relatively simple and can therefore be built
relatively quickly, with machine learning. But the point which sets their work
apart from a standard semi-supervised approach (Abney 2007) is that F&H's
machine learning programs are first trained on a related language (which I will
call a ''bridge language'') for which more resources are assumed to be available.
This is of course what most (human) linguists would do: when describing a
previously unstudied language, one usually finds out what linguists have
discovered about related languages. (This is not the first computational
linguistic effort to use the notion of a bridge language, but it is perhaps the
most extended exploration of such an approach. F&H discuss some of the previous
work in their third chapter.)
In the test cases described in this book, the higher resourced languages are
Czech and Spanish, while the lesser resourced languages are Russian on the one
hand, and Portuguese and Catalan on the other. Of course none of these languages
is truly a low resource language. But they are appropriate choices, given that
in order to demonstrate that a technology works, one needs a way to evaluate the
results -- that is, one needs to have (or to be able to quickly create) a gold
standard for the target languages, against which to test the technology.
With this rather extended preface, then, I hope to have whetted the appetite of
linguists reading this review. I now turn to the book itself.
This is a short book: just over 130 pages in the main text, plus several
appendices describing technical details of the methodology, and a very brief
grammatical overview of the languages used as test cases. Part of the book is
based on Feldman's Ph.D. dissertation, but the present work is a joint effort.
The first chapter is a very brief introduction to the overall theme, while the
second chapter gives a background on previous work on ''tagging.'' (The reader who
is unfamiliar with tagging may want more background than F&H give; for that, any
introductory textbook on computational linguistics should suffice, such as
Manning and Schuetze (1999) or Jurafsky and Martin (2009). I might also
recommend Abney (2007) which, although it deals mostly with semi-supervised
learning, may prove more accessible to linguists.)
Tagging is a form of corpus annotation in which each word (or rather, each
token, typically including tokens consisting of punctuation characters) is
marked for some properties. In the context of this book, tagging means assigning
such morpho-syntactic properties as person, number, tense and so forth, and a
citation form (or assigning a special tag to punctuation tokens). This kind of
tagging is sometimes misleadingly referred to as ''part of speech tagging'',
whereas in fact the tags are at a much finer grain than what linguists typically
think of as parts of speech. Tagging is in fact similar to what linguists do
when they create interlinear text, except that in tagging, the individual affix
morphemes may not be distinguished, and the citation form of the word stands in
for the gloss of one of its senses. (Assigning the correct lexeme gloss to
words, which is usually done as part of interlinear text glossing, is something
that computational linguists call 'sense disambiguation', and is not discussed
in this book.)
Tagging is a two part process: first, one finds all possible tags for a
particular word; second, one chooses the correct tag from among the possible
tags. Since the computer cannot choose the correct tag on the basis of the
meaning of a word in context, computational tagging is done probabilistically on
the basis of properties of the neighboring words, such as their tags.
The third chapter briefly discusses previous work on natural language processing
for languages with few computational resources, ranging from lexicon acquisition
to syntactic parsing. F&H sketch their own goals in view of this background,
pointing out that their wish is to produce computational resources which are
interpretable by humans. This may sound obvious, but in fact most statistical
machine translation programs produce ''grammars'' which are far from interpretable
in a linguistic sense.
Chapter four gives an overview of the grammars and the corpora of the languages
in question, and the tagsets to be used (i.e. the information about gender etc.
that words will be marked with). The grammar sketches are amplified in an
appendix; I am not sure why there is this redundancy.
The fifth, rather short, chapter quantifies properties of the tagsets for each
language, pointing out the data sparsity problem: it takes a large corpus to see
all the possible tags (i.e. all the combinations of morphosyntactic features
possible for a given part of speech). This is in contrast to a relatively
uninflected language like English, where a relatively small corpus suffices to
see each tag at least once. They point out that to some extent, the data
sparsity is caused by the choice of news text for corpora. A conversational
corpus would certainly show a different distribution of person marking, for
example, although whether it would be more diverse (and therefore have less of a
data sparsity problem) is not obvious.
Zipf's Law is also a source of difficulty: no matter how large the corpus, many
words will be attested in only a subset of the forms in which they could
theoretically appear. This is of course a problem for the human language
learner, too -- and the fact that humans usually cope with this (but not always,
cf. Albright 2007) suggests that it should be possible for the computer to do
well, too. But that is for future work.
With these preliminaries out of the way, chapter six turns to morphological
analysis. Recall that the overall task is to tag each word of the target
language corpus with its part of speech and morphosyntactic information. F&H
decompose this task into a morphological analysis phase, followed by a
disambiguation phase. There is a potential tradeoff here: the more effort one
expends in getting the morphological analyzer right -- that is, getting it to
produce only the possible parses for each word -- the less effort will be
required in the disambiguation phase. In the limit, a perfect morph analyzer
would generate only the possible analyses for each word, so that the task of a
tagger (human or machine) is only to choose from the possible analyses for a
given word which analysis is correct in a particular context. In fact, this is
what computational linguists usually do.
But in the interest of minimizing the human labor needed to build the
morphological parser for the target language, F&H explore varying this division
of labor, so that the morph analyzer might over-generate and the tagger would be
called on to choose the correct analysis from among both possible and impossible
In addition to the high level decision governing the division of labor between
parser and tagger, there are lower level tradeoffs in the development of the
parser itself, some involving linguistic shortcuts. For instance, suppose that
in some language, the end of a stem undergoes allomorphy before a suffix. One
can imagine an alternative analysis, in which the stem would be divided from the
suffix so that the changed part of the stem is treated as if it were part of the
suffix. This increases the number of suffix allomorphs, since there is now an
additional suffix allomorph which includes what was the end of the stem; and the
number of conjugation classes increases correspondingly. An English example
would be the f~v alternation in words like 'wife~wives'. Practically any
linguist would argue for the analysis in which the stem ends in a labiodental,
and the labiodental undergoes a voicing alternation. But under the alternative
analysis, stems would be vowel-final, and for this set of words there would be a
singular suffix -f and a plural suffix -ves (or /-vz/). While linguists might
balk at such an approach, to my mind it represents an acceptable compromise,
given the goals. Linguists may even be reminded of Maori, for which Hale (1973)
argued that verbal stem-final consonants have undergone re-analysis to become
part of the passive suffix.
Another division of labor involves the creation of a lexicon. A morphological
parser works best with a lexicon of stems, and preferably a lexicon with such
additional information as part of speech and (where relevant) declension or
paradigm class. (A lexicon used for parsing need not need include semantic
information, i.e. senses.) F&H describe a methodology for rapidly acquiring such
a lexicon from texts; it appears to resemble work by John Goldsmith and others
on automatically building morphological parsers. F&H discuss tradeoffs which
make such lexicon acquisition easier (again, easier means less human intensive).
Some of these tradeoffs also lead to parser over-generation; for example, one
may decide to leave lower frequency words ambiguous as to their conjugation class.
For very common words F&H avoid the use of a parser entirely by supplying
pre-built analyses, thereby ensuring both high recall and high precision for the
most common words. For uncommon words, on the other hand, there may be no
lexical entry, which means that the morphological parser can only guess what the
stem might be. Guessing of course results in greater ambiguity. The hope is that
tagging will later reduce ambiguity by choosing the most likely parse based on
the word's context.
The next step is therefore to build a tagger, whose job is to choose the correct
parsed form from the often ambiguous results returned by the parser. If F&H were
working on a highly resourced language, they would train such a tagger using a
large corpus which had been tagged (disambiguated) by hand. But annotated
corpora are expensive, and F&H are trying to develop a methodology for low
resourced languages. This then brings us to the next chapter, which is really
the heart of F&H's method: the cross-language tagger. Rather than building an
annotated corpus for the target low resourced language and training a tagger on
that, F&H use a tagger which has been trained on a corpus for a closely related
but higher resourced language: in the case of Russian, a Czech tagger; and in
the case of Catalan and Portuguese, a Spanish tagger. They run several
experiments for each target language, but report mostly on the Russian
experiments. The baseline uses the Czech tagger (with some simple modifications)
for both morphological analysis and disambiguation; as one might expect, this
performs poorly. Subsequent experiments substitute the quick-and-dirty Russian
morphological analyzer, and then add cognate detection to improve the
morphological analysis (specifically, to adjust the probabilities coming out of
the morph analyzer); simple syntactic transformations to compensate for
differences between Czech and Russian clause structure; and finally, sub-taggers
which are sensitive to only some of the morpho-syntactic properties of words.
The latter is an attempt to deal with the sparse data problem, i.e. the fact
that some combinations of morpho-syntactic properties are rare. All the
experiments except this last one result in improved scores.
In the end, then, what F&H describe is a way to build a quick-and-dirty
morphological parser for the target low-resourced language, and to use a tagger
trained on a related but more highly resourced language to disambiguate the word
analyses provided by this parser in context. The result is a methodology for
quickly creating morphologically tagged text for low-resourced languages.
The final chapter summarizes the results and suggests many directions for future
Given how little research has gone into natural language processing for low
resourced languages, it is hard not to be excited about F&H's work.
Perhaps the biggest surprise to come out of this work is how often one's
linguistic intuitions turn out to be wrong. For instance, it is commonplace that
morphologically complex languages tend to have free word order; one might have
thought that this would mean that a tagger, which uses the words in the context
to disambiguate among possible morphosyntactic properties, would not help in
such a language. In fact the tagger helps a great deal. One reading of this is
that ''free word order'' is a misnomer; what is really free in such languages is
not the individual word order, but the phrase order, with the implication being
the neighboring phrase-internal words often are sufficient for disambiguation.
It might also seem obvious that the more morphosyntactic properties one was
marking words for in a particular language -- that is, the larger the tagset --
the more difficult the task would be. If this were true, one might explore
reducing the size of the tagset in hopes that this would make the results more
accurate. Again, this turns out to be wrong (at least in the cases examined).
One reason appears to be that reducing the tagset reduces the ability to use
tags on words that a given word agrees with to disambiguate that word.
In short, while linguistics can inform hypothesis formation in computational
linguistics, verifying a hypothesis requires testing it on real data. Of course,
the results might vary depending on the choice of test languages; the languages
of the experiments described here have a mostly fusional morphology, and results
might differ with agglutinating languages. One also wonders what would happen
with polysynthetic languages. At the same time, it is clear that comparing
tagging results across languages does not work; too much depends on the corpus
size and composition, the number of tags used, and other factors. So any claimed
improvements on F&H's work will require careful testing on the languages they
used. Generalizing those improvements across languages, or determining what
works with languages of different typologies, will be still harder.
One experiment that F&H did not try, is to modify the baseline by adding cognate
detection, but without using the target language morphological analyzer.
Eliminating the need to build a morphological analyzer of the target language by
hand would remove the need to consult an existing grammatical description of the
target language, a step which might be impossible for some languages. More
sophisticated methods of automatic cognate detection, perhaps along the lines of
Kondrak (2009), might improve such results.
There are several questions of a practical nature for F&H's method. For how many
less resourced languages is there a more highly resourced language that is
closely enough related to serve as the bridge language? This question can be
broken down into several parts: first, how closely related does the bridge
language need to be? This is not easy to answer, and F&H just touch on it. It is
clear that Czech/ Russian, and Spanish/ Portuguese/ Catalan are close enough.
Depending on how close the languages need to be, most languages of the world
might belong to such a group.
Another part of this question is whether within each of these groups there is a
highly resourced language that is typical enough of the other languages in the
group (i.e. not an outlier). However, this may be the wrong question: where such
a language does not already exist, it makes sense to choose one of the languages
of a group and create the expensive resources, then apply the cross-language
method to the other languages. This surely makes more sense than trying to
create expensive resources for all the languages of every group.
The book is not without its defects. The brief and uneven discussions of the
grammars of the languages studied, in chapter 4 and appendix C, may make
linguists cringe. For example, contrary to what is said on pg. 180, Spanish does
not have a class of verbs ending in -or; and contrary to table 4.7, subject-verb
agreement in Spanish, Catalan and Portuguese is based on person and number, not
on gender and number. (The text on p. 57 appears to make this same mistake,
although here it may just be a typo -- leaving out the word ''no''.) Again, table
4.7 claims that Spanish has a neuter gender. The alleged dropping of
prepositions before complementizers in Catalan (p. 177) is probably an
unmotivated analysis; in many respects, complementizers really are prepositions
which take as their complements sentences instead of NPs, as Emonds (1985)
pointed out. Finally, the book has a citation index, but a topic index (and
perhaps a language index) would have been a welcome addition. But these are
fairly minor issues, and at least the grammar sketches can easily be
supplemented from other sources.
In conclusion, F&H have opened a very interesting door, showing us a method with
many potential applications to less resourced languages. I suspect there are
many other methods behind that door that we could put to use leveraging the
computational analysis of one language to help analyze related languages.
Finally, it is a potential way for field linguists and computational linguists
to work together--again, after a lapse of some years (cf. Bird 2009).
Abney, Steven. 2007. Semisupervised Learning for Computational Linguistics.
Chapman & Hall/CRC Computer Science & Data Analysis 8. CRC Press.
Albright, Adam. 2007. Lexical and morphological conditioning of paradigm gaps. In
Curt Rice (ed.), Modeling ungrammaticality in optimality theory. Equinox.
Bird, Steven. 2009. Natural Language Processing and Linguistic Fieldwork
Computational Linguistics. 35: 469-474.
Emonds, Joseph. 1985. A unified theory of syntactic categories. Studies in
Generative Grammar 19. Dordrecht: Foris.
Goldsmith, John. 2001. Unsupervised Learning of the Morphology of a Natural
Language. Computational Linguistics 27: 153-198.
Hale, Ken. 1973. Deep-surface canonical disparities in relation to analysis and
change: An Australian example. Current Trends in Linguistics 11: 401-458.
Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language Processing: An
Introduction to Natural Language Processing, Computational Linguistics and
Speech Recognition. Prentice Hall Series in Artificial Intelligence. Upper
Saddle River, NJ: Prentice Hall.
Kondrak, Grzegorz. 2009. Identification of Cognates and Recurrent Sound
Correspondences in Word Lists. Traitement automatique des langues et langues
anciennes 50: 201-235.
Manning, Christopher D., and Hinrich Schuetze. 1999. Foundations of Statistical
Natural Language Processing. Cambridge, MA: MIT Press.
ABOUT THE REVIEWER
| ABOUT THE REVIEWER:
Dr. Maxwell is a researcher in computational morphology and other
computational resources for low density languages, at the Center for
Advanced Study of Language at the University of Maryland. He has also
worked on endangered languages of Ecuador and Colombia, with the Summer
Institute of Linguistics, and on low density languages with the Linguistic
Data Consortium (LDC) of the University of Pennsylvania.