LINGUIST List 15.1419

Wed May 5 2004

Diss: Computational Ling: Kan: 'Automatic text...'

Editor for this issue: Tomoko Okuno <>


  1. kanmy, Automatic text summarization as applied to information retrieval...

Message 1: Automatic text summarization as applied to information retrieval...

Date: Tue, 4 May 2004 21:28:28 -0400 (EDT)
From: kanmy <>
Subject: Automatic text summarization as applied to information retrieval...

Institution: Columbia University
Program: Natural Language Group Department of Computer Science
Dissertation Status: Completed
Degree Date: 2002

Author: Min-Yen Kan 

Dissertation Title: 
Automatic text summarization as applied to information retrieval:
Using indicative and informative

Dissertation URL:

Linguistic Field: Computational Linguistics 

Dissertation Director 1: Kathleen R. McKeown
Dissertation Director 2: Judith L. Klavans

Dissertation Abstract: 

I identify weaknesses with the standard "ranked list of documents"
information retrieval user interface by examining the search process
as performed in the traditional library by professional librarians and
catalogers. I distill these processes into a list of core strategies
which can be effectively fulfilled by multidocument summaries which
assist in both the searching and browsing process. This thesis
implements such automatic text summarization components to create an
alternative method of presenting search results coming from IR

As a post-processor of results coming from a search framework,
Centrifuser implements these principles by producing both informative
and indicative summaries that aid the user in information seeking
tasks. Centrifuser uses novel techniques in analyzing source articles
as a nested tree of topics, which allows the system to compare and
contrast discussions of common topics across documents, and to
identify rare topics. Documents similar in topic distribution are
grouped together to enable faster and more accurate relevance

A novel contribution in Centrifuser is the focus on generating
indicative summaries. I analyze two sources of indicative summaries
-- online public access catalog summaries as well as annotated
bibliography entries -- by examining guidelines for writing such
summaries and by cataloging types of information used in actual
summary corpora. The study reveals that metadata, such as the purpose
or audience of a resource, are important inclusions in indicative
summaries. By using the study's results, I derive an algorithm that
enables Centrifuser to author indicative summaries that both utilize
and include metadata, a novel contribution in the summarization field.

To enhance the quality and the variety of summaries that are produced,
I have employed novel techniques in natural language generation. The
system analyzes documents using a two-part method: high-level content
planning deduces what semantic predicates to include and where to
place them, and a low-level realization model computes the most
appropriate phrasing for each predicate using both local as well as
global context.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue