Date: Wed, 19 Nov 2003 21:59:49 -0500 (EST)
From: fatia-s <>
Subject: Case Studies on Cross-Language Information Retrieval...

Institution: Nara Institute of Science and Technology NAIST
Program: Department of Information Science
Dissertation Status: Completed
Degree Date: 2003

Author: fatiha sadat 

Dissertation Title: Case Studies on Cross-Language Information
Retrieval and Bilingual Terminology Acquisition from Comparable

Dissertation URL:

Linguistic Field: Translation 

Subject Language: Japanese (code: JPN) German, Standard (code: GER)
French (code: FRN) English (code: ENG)

Dissertation Director 1: Shunsuke Uemura
Dissertation Director 2: Yuji Matsumoto
Dissertation Director 3: Eric Gaussier
Dissertation Director 4: Pierre Isabelle

Dissertation Abstract: 

The rapid exchange of information has been facilitated by the rapid
expansion in the size, and the use of the Internet, which has led to a
large increase in the availability of on-line texts and
resources. Expanded international collaboration, the increase in the
availability of electronic foreign language texts, the growing number
of non-English speaking users, and the lack of common language of
discourse compels us to develop Cross-Language Information Retrieval
(CLIR) tools capable of bridging the language barrier. CLIR bridges
this gap by enabling a person to search in one language and retrieve
documents across different languages.

There are several goals for the research described herein. The first
is to gain a clear understanding of the problems associated with the
CLIR task and to develop techniques for addressing them. Empirical
work shows that ambiguity, lack of lexical resources and missing words
in the bilingual dictionary during translation, are the main hurdles.

The objective of this research is to provide some solutions to these
problems. We concentrate on the following techniques:
1. Disambiguation techniques for short and long queries. We show how
statistical techniques can be used to significantly reduce the effect
of ambiguity that arises from dictionary-based translation and
exacerbates the problem in CLIR. Disambiguation techniques based on
statistical measures, which are estimated using large corpora in both
source and target languages, are proposed for long
queries. Evaluations using TREC test collection for French-English
pair of languages show that ranking source terms then disambiguation
of target translation alternatives is very effective in CLIR.

2. Combining multiple resources for query expansion, through relevance
feedback, domain-based feedback and thesauri, in the pre- and
post-translation, for an effective and efficient retrieval across
languages. Domain-based feedback is based on hierarchical category
schemes and pseudo-relevance feedback in order to extract domain key
words and expand original queries. Evaluations on the query expansion
using TREC test collection for French-English pair of languages show
that a suitable weighting scheme to select best expansion terms is
necessary. Also, combining thesauri and domain-based feedback showed
its effectiveness in CLIR.

3. Bilingual terminology acquisition from comparable corpora, that
will enrich bilingual lexicons and help cross the language barrier for
CLIR. An approach combining statistics-based and linguistics-based
pruning techniques for bilingual terminology acquisition and
disambiguation from comparable corpora, is proposed. Combination to
bilingual dictionaries and transliteration for the special phonetic
alphabet of Japanese, showed its effectiveness in CLIR. Evaluations
using NTCIR test collection demonstrate that the proposed hybrid
translation model yields better translations and retrieval
effectiveness could be achieved across Japanese-English language pair.

Finally, a case study on the specialized medical domain for thesauri
enrichment and CLIR is briefly introduced.
