|
|
E-mail this message to a friend
|
|
Title:
|
Looking for Better Chinese Indexes: A corpus-based approach to base NP detection and indexing
|
|
Author:
|
Hongbiao Chen
|
|
Email:
|
click here to access email
|
|
Degree Awarded:
|
Guangdong University of Foreign Studies
, Faculty of English Language and Culture
|
|
Degree Date:
|
2001
|
|
Linguistic Subfield(s):
|
Computational Linguistics
|
|
Subject Language(s):
|
Chinese, Mandarin
|
|
Director(s):
|
Chunyan Ning
|
|
|
Abstract:
|
|
Previous studies have shown that the use of phrases to represent a documents content can enhance the effectiveness of an automatic information retrieval (IR) system. However, among those few Chinese IR systems that have adopted phrase indexing strategy, most do not have a real automatic phrase finder. They merely extract phrases by means of maximum matching against a pre-compiled dictionary. On the other hand, the structures of the phrases extracted by most current phrase extraction methods are too complicated for indexing.
This study proposes the use of Chinese base noun phrase (baseNP) as a complex indexing unit. A relatively effective and easy-to-be-implemented baseNP extraction method and a baseNP indexing method have been designed and tested.
Chinese baseNP is defined as a combination of conceptual words. A corpus-based approach is adopted to acquiring the probabilities of words, tags and tag sequences in constituting baseNPs. Four detection algorithms have been designed and tested. The results show that 90.21% of the word combinations that contain good baseNPs can be extracted with the help of the words probability information only. By combining template checking, the hybrid method can produce a precision of 60.43% and a recall of 58.93%.
Two kinds of index databases have been generated: one is with the single words only (i.e., the single word indexing method) and the other is with single words supplemented with baseNPs (i.e., the baseNP indexing method).
Retrieval experimental results show that baseNP indexing method can increase the retrieval precision at an average rate of 23.10% as compared to the single word indexing method. It is concluded that baseNP is a kind of complex indexes capable of enhancing Chinese IR system performances and the baseNP indexing method is more effective than the single word indexing method.
The Chinese Experimental IR System (CEIRS 1.0) was developed and used as the retrieval experimental environment. The Vector Space Model (VSM) is adopted as the retrieval model.
|
|
|
|
|
Page Updated: 28-Nov-2009

Please report any bad links or misclassified data
LINGUIST Homepage | Read
LINGUIST | Contact us

While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed on its pages, it cannot vouch for their contents.
|
|