Publishing Partner: Cambridge University Press CUP Extra Publisher Login

New from Cambridge University Press!


Voice Quality

By John H. Esling, Scott R. Moisik, Allison Benner, Lise Crevier-Buchman

Voice Quality "The first description of voice quality production in forty years, this book provides a new framework for its study: The Laryngeal Articulator Model. Informed by instrumental examinations of the laryngeal articulatory mechanism, it revises our understanding of articulatory postures to explain the actions, vibrations and resonances generated in the epilarynx and pharynx."

New from Oxford University Press!


Let's Talk

By David Crystal

Let's Talk "Explores the factors that motivate so many different kinds of talk and reveals the rules we use unconsciously, even in the most routine exchanges of everyday conversation."

E-mail this page

We Have a New Site!

With the help of your donations we have been making good progress on designing and launching our new website! Check it out at!
***We are still in our beta stages for the new site--if you have any feedback, be sure to let us know at***

Dissertation Information

Title: Case Studies on Cross-Language Information Retrieval and Bilingual Terminology Acquisition from Comparable Corpora Add Dissertation
Author: Fatiha Sadat Update Dissertation
Email: click here to access email
Institution: Nara Institute of Science and Technology, Doctor courses
Completed in: 2003
Linguistic Subfield(s): Translation;
Subject Language(s): English
Director(s): Shunsuke Uemura
Yuji Matsumoto
Eric Gaussier
Masatoshi Yoshikawa

Abstract: The rapid exchange of information has been facilitated by the rapid expansion in the size, and the use of the Internet, which has led to a large increase in the availability of on-line texts and resources. Expanded international collaboration, the increase in the availability of electronic foreign language texts, the growing number of non-English speaking users, and the lack of common language of discourse compels us to develop Cross-Language Information Retrieval (CLIR) tools capable of bridging the language barrier. CLIR bridges this gap by enabling a person to search in one language and retrieve documents across different languages.

There are several goals for the research described herein. The first is to gain a clear understanding of the problems associated with the CLIR task and to develop techniques for addressing them. Empirical work shows that ambiguity, lack of lexical resources and missing words in the bilingual dictionary during translation, are the main hurdles.

The objective of this research is to provide some solutions to these problems. We concentrate on the following techniques:

1. Disambiguation techniques for short and long queries. We show how statistical techniques can be used to significantly reduce the effect of ambiguity that arises from dictionary-based translation and exacerbates the problem in CLIR. Disambiguation techniques based on statistical measures, which are estimated using large corpora in both source and target languages, are proposed for long queries. Evaluations using TREC test collection for French-English pair of languages show that ranking source terms then disambiguation of target translation alternatives is very effective in CLIR.

2. Combining multiple resources for query expansion, through relevance feedback, domain-based feedback and thesauri, in the pre- and post-translation, for an effective and efficient retrieval across languages. Domain-based feedback is based on hierarchical category schemes and pseudo-relevance feedback in order to extract domain key words and expand original queries. Evaluations on the query expansion using TREC test collection for French-English pair of languages show that a suitable weighting scheme to select best expansion terms is necessary. Also, combining thesauri and domain-based feedback showed its effectiveness in CLIR.

3. Bilingual terminology acquisition from comparable corpora, that will enrich bilingual lexicons and help cross the language barrier for CLIR. An approach combining statistics-based and linguistics-based pruning techniques for bilingual terminology acquisition and disambiguation from comparable corpora, is proposed. Combination to bilingual dictionaries and transliteration for the special phonetic alphabet of Japanese, showed its effectiveness in CLIR. Evaluations using NTCIR test collection demonstrate that the proposed hybrid translation model yields better translations and retrieval effectiveness could be achieved across Japanese-English language pair.

Finally, a case study on the specialized medical domain for thesauri enrichment and CLIR is briefly introduced.