Academic Paper |
|
|
|
|
| Title: | An information-theoretic, vector-space-model approach to cross-language information retrieval |
| Author: | Peter A. Chew |
| Email: | click here to access email |
| Homepage: | http://www.dissertation.com/library/1121784a.htm |
| Institution: | Sandia National Laboratories |
| Author: | Brett W. Bader |
| Institution: | Sandia National Laboratories |
| Author: | Stephen Helmreich |
| Institution: | New Mexico State University |
| Author: | Ahmed Abdelali |
| Institution: | New Mexico State University |
| Author: | Stephen J. Verzi |
| Institution: | Sandia National Laboratories |
| Linguistic Field: | Computational Linguistics |
| Abstract: | In this article, we demonstrate several novel ways in which insights from information theory (IT) and computational linguistics (CL) can be woven into a vector-space-model (VSM) approach to information retrieval (IR). Our proposals focus, essentially, on three areas: pre-processing (morphological analysis), term weighting, and alternative geometrical models to the widely used term-by-document matrix. The latter include (1) PARAFAC2 decomposition of a term-by-document-by-language tensor, and (2) eigenvalue decomposition of a term-by-term matrix (inspired by Statistical Machine Translation). We evaluate all proposals, comparing them to a ???standard??? approach based on Latent Semantic Analysis, on a multilingual document clustering task. The evidence suggests that proper consideration of IT within IR is indeed called for: in all cases, our best results are achieved using the information-theoretic variations upon the standard approach. Furthermore, we show that different information-theoretic options can be combined for still better results. A key function of language is to encode and convey information, and contributions of IT to the field of CL can be traced back a number of decades. We think that our proposals help bring IR and CL more into line with one another. In our conclusion, we suggest that the fact that our proposals yield empirical improvements is not coincidental given that they increase the theoretical transparency of VSM approaches to IR; on the contrary, they help shed light on why aspects of these approaches work as they do. |
|
|
|
|
This article appears in Natural Language Engineering Vol. 17, Issue 1, which you can read on Cambridge's site or on LINGUIST . |
|
|
|
|
Back
Add a new paper Return to Academic Papers main page Return to Directory of Linguists main page |
|
Business Plan,Business Ideas,Advanced Energy,High Technology,Healthy Diets,Healthy Foods,Games Guides,Games Cheats,Travel Guides,Travel Tips,Study Skills,Study Tips,Health Tips,Health Guides,Jewelry Stores,Jewellery UK Online,Digital Camera Reviews,Digital Camera Buying Guide,Replica Handbags,Replica Bags,Jackets on Sale,Jackets Clearance,WoW Gold,Cheap WoW Gold,Buy WoW Gold,WOW Gold,Swtor Credits


