* * * * * * * * * * * * * * * * * * * * * * * *
LINGUIST List logo Eastern Michigan University Wayne State University *
* People & Organizations * Jobs * Calls & Conferences * Publications * Language Resources * Text & Computer Tools * Teaching & Learning * Mailing Lists * Search *
* *
 
E-mail this message to a friend
Title: Looking for Better Chinese Indexes: A corpus-based approach to base NP detection and indexing
Author: Hongbiao Chen
Email: click here to access email
Degree Awarded: Guangdong University of Foreign Studies , Faculty of English Language and Culture
Degree Date: 2001
Linguistic Subfield(s): Computational Linguistics
Subject Language(s): Chinese, Mandarin
Director(s): Chunyan Ning

Abstract:

Previous studies have shown that the use of phrases to represent a documents content can enhance the effectiveness of an automatic information retrieval (IR) system. However, among those few Chinese IR systems that have adopted phrase indexing strategy, most do not have a real automatic phrase finder. They merely extract phrases by means of maximum matching against a pre-compiled dictionary. On the other hand, the structures of the phrases extracted by most current phrase extraction methods are too complicated for indexing.

This study proposes the use of Chinese base noun phrase (baseNP) as a complex indexing unit. A relatively effective and easy-to-be-implemented baseNP extraction method and a baseNP indexing method have been designed and tested.

Chinese baseNP is defined as a combination of conceptual words. A corpus-based approach is adopted to acquiring the probabilities of words, tags and tag sequences in constituting baseNPs. Four detection algorithms have been designed and tested. The results show that 90.21% of the word combinations that contain good baseNPs can be extracted with the help of the words probability information only. By combining template checking, the hybrid method can produce a precision of 60.43% and a recall of 58.93%.

Two kinds of index databases have been generated: one is with the single words only (i.e., the single word indexing method) and the other is with single words supplemented with baseNPs (i.e., the baseNP indexing method).

Retrieval experimental results show that baseNP indexing method can increase the retrieval precision at an average rate of 23.10% as compared to the single word indexing method. It is concluded that baseNP is a kind of complex indexes capable of enhancing Chinese IR system performances and the baseNP indexing method is more effective than the single word indexing method.

The Chinese Experimental IR System (CEIRS 1.0) was developed and used as the retrieval experimental environment. The Vector Space Model (VSM) is adopted as the retrieval model.
Add a dissertation
Update dissertation
Page Updated: 28-Nov-2009

Please report any bad links or misclassified data

LINGUIST Homepage | Read LINGUIST | Contact us

NSF Logo

While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed
on its pages, it cannot vouch for their contents.