Editors: Soudi, Abdelhadi; van den Bosch, Antal P.; Neumann, Günter Title: Arabic Computational Morphology Subtitle: Knowledge-based and Empirical Methods Series Title: Text, Speech and Language Technology Publisher: Springer Year: 2007
Adel Jebali, Département de Linguistique et de Didactique des Langues, Université du Québec à Montréal (UQAM), Canada.
SUMMARY This book is a collection of papers that deal with the different methods employed in the field of Arabic computational morphology and the use of these approaches in large scale applications. The two main approaches of this collection are: knowledge-based and empirical. The difference between these methods resides in the manner computational linguists provide the linguistic knowledge used in the analysis. In knowledge-based techniques, the linguist encodes the linguistic knowledge manually, based on a predefined theory of the morphological entities. In the empirical approaches, on another hand, linguistic knowledge is extracted directly from natural language data by employing machine learning techniques.
The book preface is written by Richard Sproat, an eminent linguist working on computational morphology. The book itself is divided into four parts, containing fifteen chapters. Part 1 is a three chapter introduction to Arabic computational morphology, and specifically to the two methods used in this field: knowledge-based and empirical. Part 2 contains four chapters and focuses on knowledge-based methods. Part 3 contains four chapters and deals with empirical methods. Finally, Part 4's four chapters deal with the integration of Arabic morphology in two main applications: information retrieval and machine translation.
Chapter 1 is written by the editors to offer a brief roadmap of the book. They introduce the two approaches widely used in Arabic computational morphology, the applications related to this field, and Basic Language Resource Kits (BLARK) for Arabic.
Chapter 2 focuses on the transliteration scheme adopted in this book to represent Arabic characters. The authors present, as well, guidelines to pronounce Arabic using this scheme. The goal is to have a sort of standard to transliterate Arabic scripts, respected by all the authors in this book. This scheme is proposed as a complete system to be widely adopted by the natural language processing research community working on Arabic, a standard that is currently lacking.
Chapter 3 provides a presentation of the main issues facing Arabic morphological analysis. Even if the relation between modern dialects and Modern Standard Arabic is a challenging one, Timothy Buckwalter thinks that the salient issues are orthographic. These include the status of non-standard Arabic characters, the persistent variation in the spelling of some letters, problems related to the tokenization of Arabic input strings and the absence of annotation for lexically-determined features, such as gender, number and humanness.
Chapter 4 begins Part 2. It introduces the first of the knowledge-based approaches, called Syllable-Based Morphology (SBM). In this model, morphological realizations are defined in terms of their syllable structure. Cahill shows that this framework accounts for facts from Semitic languages, and particularly Arabic, in the same way it accounts for facts from European ones.
The second knowledge-based approach to Arabic morphology is depicted in the fifth chapter. In this approach, which is an inheritance-based one, Al-Najem demonstrates the benefits of using this model to account for Arabic root-and-pattern morphology to capture generalizations, dependencies and syncretisms. He further implements his analysis in DATR, an inheritance network formalism designed for the representation of natural language lexical information.
In the sixth chapter, Cavalli-Sforza and Soudi present and use another approach, the Lexeme-Based Morphology of Aronoff (1994) and Beard (1995). In this theory, the priority is given to stems and not to prefixes and suffixes. The authors propose a concatenative method to generate Arabic inflected forms even when the real language-process is not concatenative in nature. They implement this approach in an extension of the MORPHÉ tool developed by Leavitt (1994).
The last method in this paradigm is related on the work in two projects: DIINAR.1 and SYSTRAN Arabic-English translator. The approach adopted in those projects is a stem-based Arabic lexicon with grammar and lexis specifications. It is presented in Chapter 7 by Dicky and Farghaly. The authors argue that the most appropriate organization for the storage of information for a language like Arabic is to use stem-grounded lexical databases in conjunction with entries associated with grammar and lexis specifications.
The third part of the book focuses on empirical methods and presents four accounts of data-driven processing models of Arabic morphology. Chapter 8, whose authors are Days et al, is a sort of introduction to these methods. The authors present a machine learning approach to the problem of extracting consonantal roots of Arabic words. This approach relies on statistical methods and linguistic constraints as well. The accuracy of the predictions thus obtained is by no means inferior to the quality of human predictions of the accurate roots.
The second account in this paradigm is presented in Chapter 9, by Diab et al. These authors provide a Support Vector Machine (SVM) based approach to tokenize, tag and annotate data of Modern Standard Arabic. They apply a method that proved its efficiency when dealing with English data and they obtain high scores working on the Arabic Treebank. Chapters 10 and 11 present two memory-based models whose application data come from the Arabic Treebank. The first of these models is semi-supervised while the second in supervised. In the partially supervised machine learning techniques, largely motivated by first language acquisition, Clark presents a pair of sets of words to the learner, who must align them. The author's focus is on broken plural (a nonconcatenative morphological process). In chapter 11, Van Den Bosh et al use annotated corpora as an application of the memory-based learning to morphological analysis and part-of-speech tagging of written Arabic.
Chapter 12 begins Part 4. Larkey et al focus on one possible application for Arabic computational morphology: information retrieval (henceforth IR). They use a method called light-stemming, i.e. stemming without resorting to morphological analysis. They argue that this method is more efficient than several stemmers which are morphological analysis-based.
Chapter 13 deals with IR as well. Darwich and Oard present a method to adapt existing Arabic morphological analysis techniques with the aim of making them suitable for the requirements of IR. They present as well a shallow statistical Arabic morphological analyzer called Sebawai and a light-stemmer called Al-Stem. Both were used by the authors in an IR application to produce Arabic index terms.
The second application to benefit from morphological analysis is Machine Translation (henceforth MT). In Chapter 14, Habash is mainly concerned with the representations used by different MT-relevant resources (morphological analyzers, dictionaries and treebanks). He discusses the usability of these representations in different MT approaches and argues that the lexeme-and-feature level of representation is motivated.
The last chapter focuses on MT as well. Guessoum and Zantout investigate the impact of Arabic Morphological Generation on the quality of MT systems. The one chosen by them is a web-based English to Arabic MT system called Ajeeb. They have translated thousands of sentences using this tool and analyzed these translations. Their analysis reveals that the morphological information captures various linguistic aspects and affects the quality of the translation.
EVALUATION I think this collection could indeed be a very good starting point for every researcher who wants to engage in Arabic computational morphology, its challenges, its theories and its applications. The tripartite division provides a clear distinction between the main problems, the theoretical issues and the areas of application. As the editors state with reason, this book is unique in several respects. I know of no other book with a so wide a coverage of both knowledge-based and empirical methods and of applications as well.
The book offers a general view of the trends of Arabic computational morphology, but it omits one of the most important approaches. The so-called finite-state morphology of Beesley (1989, 1990), for example, has greatly contributed to the Arabic computational morphology, but no paper is devoted to this knowledge-based approach. The editors mention it in Chapter 1 and present some of its concepts, but I think that this brief presentation does not do justice to such an important theory in the history of computational morphology.
Furthermore, redundancy is the main drawback of this book. Each author in each chapter, with the exception of Chapter 12, is concerned with presenting an introduction to Arabic morphology. While this could be useful for someone reading only one paper or some of the papers in isolation, it may be somewhat boring for someone who reads all chapters in the book. It would have been preferable to devote a chapter to introduce Arabic and specifically Arabic morphology. Chapter 3 was meant for that purpose, but Buckwalter focuses mainly on orthographic issues while it is well established that the main issues in Arabic morphology are linguistic (nonconcatenative nature, for example, as stated by (McCarthy, 1981)).
Some dialects are mentioned in the papers, such as Egyptian and Levantine Arabic in Chapter 3, which is the only chapter which takes into account the complexity of the data from both Standard Arabic and modern dialects. In the remainder of the book, however, the main focus is on Standard Arabic. While this is a natural choice when dealing with written Arabic, dialects should have been taken into account to propose more precise linguistic analyses. In addition, what some authors call ‘Standard Arabic’ is not defined in the papers or in the introduction. Cahill states: “The data we will cover in this chapter is from Standard Arabic.” (Chapter 4, page 48). He states further: “We will not address bi- and quadriliteral roots, even though the latter do occur in Classical Arabic.” (Page 48-49). This means that ‘Standard Arabic’ includes somehow the variety called ‘Classical Arabic’, but the data from this one is not to be taken into account. Dichy and Farghaly (chapter 7) state clearly that the variety studied is ''Modern Standard Arabic'' (page 116) which means that data from ‘Classical Arabic’ are not discussed. Finally, Larkey et al. (chapter 12) declare “The morphological complexity of Arabic (see Chapter 3 of this volume) makes it particularly difficult to develop natural language processing applications for Arabic information retrieval.” (Page 222) They make reference to Chapter 3 where Buckwalter takes into account both Standard Arabic and the modern dialects. Nevertheless, their analysis takes only Standard Arabic into account.
Apart from these issues, there are some minor considerations I would like to address. I think that a glossary at the end of the book would have been very useful for someone looking for the definition of a specific notion. Besides that, the index is too short and does not contain the authors' names mentioned in the papers. The lists of bibliographic references are formatted according to several standards from one chapter to another. The editors should have put more emphasis on this aspect. Finally, while most authors gloss Arabic examples and give a translation too, some of them translate without glossing (see chapters 6 and 15 for example).
REFERENCES Aronoff, M. (1994) _Morphology by Itself: Stems and Inflectional Classes_. Cambridge, MA: MIT Press.
Beard, R. (1995) _Lexeme-Morpheme Base Morphology: A General Theory of Inflection and Word Formation_. Albany: State University of New York Press.
Beesley, K. R. (1989) _Computer Analysis of Arabic Morphology: A Two-Level Approach with Detours. In Third Annual Symposium on Arabic Linguistics_. Salt Lake City: University of Utah. Published as Beesley, 1991.
Beesley, K. R. (1990) Finite-State Description of Arabic Morphology. In _Proceedings of the Second Cambridge Conference on Bilingual Computing in Arabic and English_. No pagination.
Leavitt, J.R. (1994) MORPHÉ: A Morphological Rule Compiler. Technical Report, CMU-CMT-94-MEMO.
McCarthy, John. (1981) A Prosodic Theory of Nonconcatenative Morphology. _Linguistic Inquiry_, vol. 12, pp. 373–418.
ABOUT THE REVIEWER Adel Jebali is currently a lecturer and a PhD student in linguistics at the Université du Québec à Montréal (UQAM). His researches focus on the implementation of Arabic argument markers within the HPSG framework using the LKB system. He is also interested in computational linguistics and more specifically in Arabic computational morphology and syntax.
|