|Title:||Case Studies on Cross-Language Information Retrieval and Bilingual Terminology Acquisition from Comparable Corpora||Add Dissertation|
|Author:||Fatiha Sadat||Update Dissertation|
|Email:||click here to access email|
|Institution:||Nara Institute of Science and Technology, Doctor courses|
|Abstract:||The rapid exchange of information has been facilitated by the rapid expansion in the size, and the use of the Internet, which has led to a large increase in the availability of on-line texts and resources. Expanded international collaboration, the increase in the availability of electronic foreign language texts, the growing number of non-English speaking users, and the lack of common language of discourse compels us to develop Cross-Language Information Retrieval (CLIR) tools capable of bridging the language barrier. CLIR bridges this gap by enabling a person to search in one language and retrieve documents across different languages.
There are several goals for the research described herein. The first is to gain a clear understanding of the problems associated with the CLIR task and to develop techniques for addressing them. Empirical work shows that ambiguity, lack of lexical resources and missing words in the bilingual dictionary during translation, are the main hurdles.
The objective of this research is to provide some solutions to these problems. We concentrate on the following techniques:
1. Disambiguation techniques for short and long queries. We show how statistical techniques can be used to significantly reduce the effect of ambiguity that arises from dictionary-based translation and exacerbates the problem in CLIR. Disambiguation techniques based on statistical measures, which are estimated using large corpora in both source and target languages, are proposed for long queries. Evaluations using TREC test collection for French-English pair of languages show that ranking source terms then disambiguation of target translation alternatives is very effective in CLIR.
2. Combining multiple resources for query expansion, through relevance feedback, domain-based feedback and thesauri, in the pre- and post-translation, for an effective and efficient retrieval across languages. Domain-based feedback is based on hierarchical category schemes and pseudo-relevance feedback in order to extract domain key words and expand original queries. Evaluations on the query expansion using TREC test collection for French-English pair of languages show that a suitable weighting scheme to select best expansion terms is necessary. Also, combining thesauri and domain-based feedback showed its effectiveness in CLIR.
3. Bilingual terminology acquisition from comparable corpora, that will enrich bilingual lexicons and help cross the language barrier for CLIR. An approach combining statistics-based and linguistics-based pruning techniques for bilingual terminology acquisition and disambiguation from comparable corpora, is proposed. Combination to bilingual dictionaries and transliteration for the special phonetic alphabet of Japanese, showed its effectiveness in CLIR. Evaluations using NTCIR test collection demonstrate that the proposed hybrid translation model yields better translations and retrieval effectiveness could be achieved across Japanese-English language pair.
Finally, a case study on the specialized medical domain for thesauri enrichment and CLIR is briefly introduced.