Editor for this issue: Tomoko Okuno <tomoko
linguistlist.org>
Institution: Columbia University Program: Natural Language Group Department of Computer Science Dissertation Status: Completed Degree Date: 2002 Author: Min-Yen Kan Dissertation Title: Automatic text summarization as applied to information retrieval: Using indicative and informative Dissertation URL: http://www.comp.nus.edu.sg/~kanmy/papers/thesis.pdf Linguistic Field: Computational Linguistics Dissertation Director 1: Kathleen R. McKeown Dissertation Director 2: Judith L. Klavans Dissertation Abstract: I identify weaknesses with the standard "ranked list of documents" information retrieval user interface by examining the search process as performed in the traditional library by professional librarians and catalogers. I distill these processes into a list of core strategies which can be effectively fulfilled by multidocument summaries which assist in both the searching and browsing process. This thesis implements such automatic text summarization components to create an alternative method of presenting search results coming from IR frameworks. As a post-processor of results coming from a search framework, Centrifuser implements these principles by producing both informative and indicative summaries that aid the user in information seeking tasks. Centrifuser uses novel techniques in analyzing source articles as a nested tree of topics, which allows the system to compare and contrast discussions of common topics across documents, and to identify rare topics. Documents similar in topic distribution are grouped together to enable faster and more accurate relevance judgment. A novel contribution in Centrifuser is the focus on generating indicative summaries. I analyze two sources of indicative summaries -- online public access catalog summaries as well as annotated bibliography entries -- by examining guidelines for writing such summaries and by cataloging types of information used in actual summary corpora. The study reveals that metadata, such as the purpose or audience of a resource, are important inclusions in indicative summaries. By using the study's results, I derive an algorithm that enables Centrifuser to author indicative summaries that both utilize and include metadata, a novel contribution in the summarization field. To enhance the quality and the variety of summaries that are produced, I have employed novel techniques in natural language generation. The system analyzes documents using a two-part method: high-level content planning deduces what semantic predicates to include and where to place them, and a low-level realization model computes the most appropriate phrasing for each predicate using both local as well as global context.Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue