Editor for this issue: Terence Langendoen <terry
linguistlist.org>
Bourigault, Didier, Christian Jacquemin, and Marie-Claude L'Homme, ed. (2001) Recent Advances in Computational Terminology. John Benjamins Publishing Company, xvii+375pp, hardback ISBN 1-58811-016-8, $99.00, Natural Language Processing 2. Federica Da Milano, Department of Linguistics, University of Pavia, Italy INTRODUCTION The book is an edited collection of 17 articles from researchers in automatic analysis, storage, and use of terminology, and specialists in applied linguistics, computational linguistics, information retrieval, and artificial intelligence. The book follows the First Workshop on Computational Terminology which took place at COLING-ACL'98. The goal of the workshop was to bring researchers from different scientific communities together, leading to the recognition of a field that can now be called 'computational terminology'. The book contains the extended and revised versions of the papers published in the proceedings. The contributions reflect the innovative and fruitful advances in computational terminology at the crossroads of terminology, linguistics and computer science. The contributions show a wide range of fields for which computational terminology tools are developed. The following applications are covered in the book: information retrieval, the building of bilingual lexicons, terminography, and automatic abstracting. The book reflects that there are a few similarities and many differences between the techniques and approaches computational terminologists make use of in order to improve term extraction or to assist terminology-related applications. SUMMARY 1. A graph-based approach to the automatic generation of multilingual keyword clusters (Akiko Aizawa and Kyo Kageura) The authors introduce a graph-based approach to the automatic generation of Japanese and English bilingual keyword clusters using the keyword lists assigned to academic papers by the authors where each of the generated clusters contains keywords with similar meanings from both languages. After a description of the methodology of the extraction of the data, the authors also provide an useful overview of the state of the art in their topic, and situate their work within the background of current research. The major advantages of this methodology are that, unlike statistical methods, clusters can be properly generated for low frequency keywords as well as high frequency keywords, and computational costs are relatively low. 2. The automatic construction of faceted terminological feedback for interactive document retrieval (Peter G. Anick) In this paper the author shows a linguistic method for the automatic construction of terminological feedback for use in interactive information retrieval. The author underlines that query terms are likely to match many irrelevant documents. This potential mismatch of query terms and document terms is not the only problem facing the online information seeker. The information need itself may be poorly defined in the user's mind. In his opinion, of all the techniques for generating terminological feedback for use in interactive query refinement, faceted feedback schemes are unique in providing users not only terminology but also an explicit framework for reasoning along the multiple dimensions that characterize a domain. His approach to the automatic generation of terminological feedback, like a faceted classification, structures terminology along salient dimensions. The approach is based on the observation that key domain concepts within databases and result sets tend to participate in families of semantically related lexical compounds. The result is a system which dynamically generates a terminological feedback for query result sets. 3. Automatic term detection. A review of current systems (M. Teresa Cabr� Castellv�, Rosa Estop� Bagot and Jordi Vivaldi Palatresi) This paper is an useful survey of a number of recently developed term extraction systems. All systems are analysed and compared against a set of technically relevant characteristics. The aim of the paper is to analyse the main systems of terminology extraction in order to describe its current status and thus be able to enrich them. The paper is divided up into two main parts: firstly, the largest part is devoted to describe various systems of terminology extraction together with a short evaluation in which weak and strong points have been outlined. The systems under description are: ANA, CLARIT, Daille-94 (ACABIT), FASTR, HEID, LEXTER, NAULLEAU, NEURAL, NODALIDA- 95, TERMIGHT, TERMINO, TERMS. Secondly, the terminology extraction systems have been classified according to some parameters. A contrastive analysis of these systems is based on six relevant aspects when designing a new detection system of terminological units: linguistic resources, strategies of term delimitation, strategies of term filtering, classification of recognised terms and obtained results. 4. Incremental extraction of domain-specific terms from online text resources (Lee-Feng Chien and Chun-Liang Chen) This paper presents an efficient approach which can classify online text collections from the Internet dynamically and extract domain- specific terms incrementally. The approach is based on a live dictionary with online information systems on the Internet, in which most of the domain-specific terms can be incrementally extracted and adapted with changes in text collections. Such a live dictionary can reflect up-to-date information and will be very helpful in providing real-time information service. On the other hand, as unlimited number of corpus resources are available over the Internet, the proposed approach also attempts to find an automatic way to organize the text collections which are growing daily. This approach is based on proper integration of linguistic knowledge acquisition (Zernik 1991) and IR technologies, and is an extension of a previous work (Chien 1997), which was originally designed to extract Chinese terms with correct lexical boundaries from a large but static text collection. Although this work is mainly designed for Chinese and oriental language applications, some of the developed techniques are believed to be applicable to western languages. 5. Knowledge-based terminology management in medicine (James J. Cimino) The author shows a specific knowledge base in the field of medicine and discusses the addition of new terminology to the existing semantic network. In medicine, computer systems are often integrated to facilitate data sharing, but the terminologies they use are typically not integrated. There is rarely any centralized repository of terms used in various systems, and no widely accepted standards exist. The author describes an exception, the Medical Entities Dictionary (MED), at the New York Presbyterian Hospital (NYPH). The paper describes the design and development of terminology and two case examples demonstrating some of the advantages to this approach: addition of a new terminology of laboratory terms and maintenance of an existing drug terminology. 6. Searching for and identifying conceptual relationships via a corpus- based approach to a Terminological Knowledge Base (CTKB). Methods and results (Anna Condamines and Josette Rebeyrolle) This paper shows how conceptual information on terms can be found in corpora. The authors present different methods for extracting information that can later be used to build terminological knowledge bases based on corpora (as opposed to application-based terminological knowledge bases). The results presented in this paper aim to show the feasibility of constructing a corpus-based approach to a Terminological Knowledge Base (CKTB). The authors show that it is possible, with appropriate systems and linguistic interpretation, to model a text, particularly the conceptual relationships contained in it. This method is applied to a French corpus and the results are assessed from the point of view of various applications. 7. Qualitative terminology extraction. Identifying relational adjectives (B�atrice Daille) This paper presents the identification in corpora of relational adjectives in French. First, the author defines and gives some linguistic properties of relational adjectives (AdjR). Then, she presents the termer (term extractor) and the modifications that she carried out in order to allow the identification of relational adjectives in texts. Relational adjectives and compound nouns which include a relational adjective are then quantified and their informative status is evaluated thanks to a thesaurus of the domain. The results corroborate the linguistic studies and their intuition about the informative character of the relational adjectives. The author conclude with a discussion of the interesting status of such adjectives and compound nouns for terminology extraction and other automatic terminology tasks. 8. General considerations on bilingual terminology extraction (Eric Gaussier) The paper shows general questions about bilingual terminology extraction. The author presents a standard characterization of terms based on morpho-syntactic patterns. Then, using French and English as the language pair to illustrate his discussion, he shows how the specifications of terms in two different languages impact the alignment process. In the second part of the paper, the author reviews three methods which could be well adapted to bilingual terminology alignment. The paper shows that, in order to account for the differences that exist between two languages, the alignment methods must be very flexible. 9. Detection of synonymy links between terms. Experiment and results (Thierry Hamon and Adeline Nazarenko) This paper focuses on a specific semantic relationship: synonymy. The authors evaluate a method for detecting synonymy links between terminological units contained in a specialized corpus. This paper reports new experiments which help to understand how this synonymy detection approach is to be used. This method makes use of machine- readable dictionaries (general-language dictionaries as well as specialized dictionaries) to infer synonymy relationships among the components of complex terms. Results show the complementarity and the usefulness of the different sources. 10. Extracting useful terms from parenthetical expressions by combining simple rules and statistical measures. A comparative evaluation of bigram statistics (Toru Hisamitsu and Yoshiki Niwa) Another method of enhancing term quality is to select specific zones in texts, for example parenthetical expressions, from which terms can be acquired in correlation with other terms. This article presents such terms and provides a comprehensive study of various statistical criteria used to filter out relevant terms. Parenthetical expressions are pairs of character strings A and B related to each other by parentheses as in A(B). These expressions contain a large number of important terms, such as organisation names, company names, their abbreviations, and are easily extracted by pattern matching. The authors show a simple and accurate method for collecting unregistered terms from parenthetical expressions which identified two types of parenthetical expressions by using pattern matching, bigram statistics, and entropy. 11. Software tools to support the construction of bilingual terminology lexicons (David A. Hull) This paper presents a case study on the problem of bilingual lexicon extraction. The author evaluates an existing terminology alignment system by comparing its performance to that of human experts working on the same task, the construction of a bilingual lexicon from a corpus provided by the European Court of Human Rights. Then, he describes an automatic terminology alignment algorithm that can be used as a valuable pre-processing step in the interactive process of lexicon construction. Finally, he presents a quantitative comparison of automatic and manual alignment strategies. 12. Determining semantic equivalence of terms in information retrieval. An approach based on context distance and morphology (Hongyan Jing and Evelyne Tzoukermann) This paper presents an approach useful in Information Retrieval to determine the semantic equivalence between terms in a query and terms in a document. This approach is based on context distance and morphology. Context distance is a measure used to assess the closeness of word meanings. This context distance model compares the similarity of the contexts where a word appears, using the local document information and the global lexical co-occurrence information derived from the entire set of documents to be retrieved. This method integrates this context distance model with morphological analysis so that the two operations can enhance each other. 13. Term extraction using a similarity-based approach (Diana Maynard and Sophia Ananiadou) The authors integrate syntactic and semantic information to find, rank and disambiguate terminological units. The paper describes a new thesaurus-based similarity measure, which uses semantic information to calculate the importance of different parts of the context in relation to the term. The authors claim context is the "key to understanding a term". This method relies on the hypothesis that terms tend to occur in groups, rather than singly or randomly, in other words that "terms are better indicators of other terms". Results show that making use of semantic information is beneficial for both theoretical and practical aspects of terminology. 14. Extracting knowledge-rich contexts for terminography. A conceptual and methodological framework (Ingrid Meyer) The author has developed a method designed to extract information on terms from running text in order to assist terminographers in their everyday work. They can focus on knowledge-rich contexts, i.e. contexts that contain relevant information on terminological units that is signaled with specific patterns. First, the paper defines the concept of a knowledge-rich context (KRC) by providing an analysis of the two main types of KRC. Then, the author describes a methodology for developing extraction tools that is based on lexical, grammatical and paralinguistic patterns. Finally, there is a discussion of the most pressing research problems of the field. 15. Experimental evaluation of ranking and selection methods in term extraction (Hiroshi Nakagawa) The author proposes techniques for the ranking and the classification of candidate terms that rely on structural and statistical properties. He claims that the relationship between complex terms and the simple terms they include must be analyzed ; according to the author, this is essential in estimating candidate term importance. The paper compares experimentally the performance of two term extraction methods: C-value based method (Frantzi and Ananiadou 1996) and Imp based method (Nakagawa 1997). The author did the experimental evaluation with several Japanese technical manuals. 16. Corpus-based extension of a terminological semantic- lexicon (A. Nazarenko, P. Zweigenbaum, B. Habert and J. Bouaud) This paper proposes a method for adapting a terminological semantic lexicon to meet the requirements of new domains and corpora. The tuning method described explores the corpus and gathers words that are likely to have similar meanings on the basis of their dependency relationships in the corpus. The tagging procedure is tested and parameterised on a rather small French corpus dealing with coronary diseases. This method is systematically evaluated by creating and categorizing artificial unknown words. The results show that our tagging procedure is a valuable help to account for new words and new word uses in a sublanguage. 17. Term extraction for automatic abstracting (Michael P. Oakes and Chris. D. Paice) The authors offer a template-based technique for term extraction that instantiates semantic roles of contextual words during the extraction process. The paper describes term extraction from full length journal articles in the domain of crop husbandry for the purpose of producing abstracts automatically. Initially, candidate terms are extracted which occur in one of a number of fixed lexical environments. Candidate terms which can be lexically validated receive an enhanced weight. The grammar for lexical validation was derived from a training corpus of 50 journal articles. Selected terms may be used to generate a short abstract which indicates the subject matter of the paper. COMMENTS The articles collected in this book cover an interesting topic not only for specialists. They can enlighten the comprehension of the structure of the lexicon: "I don't think there can be any corpora, however large, that contain information about all of the areas of English lexicon and grammar that I want to explore ... [but] every corpus I have had the chance to examine, however small, has taught me facts I couldn't imagine finding out any other way." (Fillmore 1992:35) REFERENCES Chien, L. F. (1997), PAT-Tree Based Keyword Extraction for Chinese Information Retrieval, Proceedings of ACM SIGIR ’97, Philadelphia, USA, 50-58 Fillmore, C. (1992), Corpus linguistics or Computer-aided armchair linguistics. In Svartvik, J. (ed.), Directions in Corpus linguistics, Mouton de Gruyter, Berlin, 35-60 Frantzi T. K. and Ananiadou, S. (1996), Extracting nested collocations. In 16th Proceedings of 15th International Conference on Computational Linguistics, 41-46 Nakagawa, H. (1997), Extraction of index words from manuals. In Conference Proceedings of Computer-Assisted Information Searching on Internet, 598-611 Zernik, U. (1991), Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon, Lawrence Erlbaum Associates, Publishers ABOUT THE REVIEWER Federica Da Milano is a Ph.D. student in Linguistics at the Department of Linguistics, University of Pavia, Italy. Her research interests include linguistic typology, spatial deixis, and negation.Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue