Publishing Partner: Cambridge University Press CUP Extra Publisher Login
amazon logo
More Info


New from Oxford University Press!

ad

Speaking American: A History of English in the United States

By Richard W. Bailey

"Takes a novel approach to the history of American English by focusing on hotbeds of linguistic activity throughout American history."


New from Cambridge University Press!

ad

Language, Literacy, and Technology

By Richard Kern

"In this book, Richard Kern explores how technology matters to language and the ways in which we use it. Kern reveals how material, social and individual resources interact in the design of textual meaning, and how that interaction plays out across contexts of communication, different situations of technological mediation, and different moments in time."


Academic Paper


Title: An information-theoretic, vector-space-model approach to cross-language information retrieval
Author: Peter A. Chew
Email: click here TO access email
Homepage: http://www.dissertation.com/library/1121784a.htm
Institution: Sandia National Laboratories
Author: Brett W. Bader
Institution: Sandia National Laboratories
Author: Stephen Helmreich
Institution: New Mexico State University
Author: Ahmed Abdelali
Institution: New Mexico State University
Author: Stephen J. Verzi
Institution: Sandia National Laboratories
Linguistic Field: Computational Linguistics
Abstract: In this article, we demonstrate several novel ways in which insights from information theory (IT) and computational linguistics (CL) can be woven into a vector-space-model (VSM) approach to information retrieval (IR). Our proposals focus, essentially, on three areas: pre-processing (morphological analysis), term weighting, and alternative geometrical models to the widely used term-by-document matrix. The latter include (1) PARAFAC2 decomposition of a term-by-document-by-language tensor, and (2) eigenvalue decomposition of a term-by-term matrix (inspired by Statistical Machine Translation). We evaluate all proposals, comparing them to a ???standard??? approach based on Latent Semantic Analysis, on a multilingual document clustering task. The evidence suggests that proper consideration of IT within IR is indeed called for: in all cases, our best results are achieved using the information-theoretic variations upon the standard approach. Furthermore, we show that different information-theoretic options can be combined for still better results. A key function of language is to encode and convey information, and contributions of IT to the field of CL can be traced back a number of decades. We think that our proposals help bring IR and CL more into line with one another. In our conclusion, we suggest that the fact that our proposals yield empirical improvements is not coincidental given that they increase the theoretical transparency of VSM approaches to IR; on the contrary, they help shed light on why aspects of these approaches work as they do.

CUP AT LINGUIST

This article appears IN Natural Language Engineering Vol. 17, Issue 1, which you can READ on Cambridge's site or on LINGUIST .



Add a new paper
Return to Academic Papers main page
Return to Directory of Linguists main page