LINGUIST List 15.1166

Fri Apr 9 2004

Diss: Computational Ling: Saidi: 'Linguistic...'

Editor for this issue: Takako Matsui <>


  1. aishasaidi, Linguistic Validation of Automatic Subtopic...

Message 1: Linguistic Validation of Automatic Subtopic...

Date: Fri, 9 Apr 2004 09:09:33 -0400 (EDT)
From: aishasaidi <>
Subject: Linguistic Validation of Automatic Subtopic...

Institution: Boston University
Program: Graduate Program in Applied Linguistics
Dissertation Status: Completed
Degree Date: 2004

Author: Aisha F Saidi 

Dissertation Title: Linguistic Validation of Automatic Subtopic

Linguistic Field: Computational Linguistics

Dissertation Director 1: Bruce Fraser
Dissertation Director 2: Mary Catherine O'Connor
Dissertation Director 3: Nanette Veilleux

Dissertation Abstract: 

This study evaluates a technique for automatically segmenting medical
history paragraphs by subtopic with the view that subtopic language
models could be created in order to improve speech recognition of the
medical history sections of medical reports. The technique uses a
Hidden Markov Model segmenting tool to mark boundaries of hypothesized
subtopic segments within each text. Since the tool is built on the
assumption that the input texts have a similar topic structure, it can
be used to segment medical histories, which generally have a three
part structure.

The data for this study is a set of medical histories extracted from
2,700 orthopedic medical reports. The study is carried out in four
broad steps. First, a group of linguists independently mark the
subtopic structure of a test set of medical histories; histories upon
which there is significant agreement become the standard by which the
automatic segmenter is evaluated. Next, the automatic segmenter is
trained on a large set of histories. Then, using the statistical
information built from the training data, the automatic segmenter
marks subtopic segments in the test data. Finally, the automatic
segmentation of the test data is graded against the evaluation
standard developed by the expert subjects.

Two types of subtopic segmentation are explored in this work. The
first type, linear subtopic segmentation, assumes that each of the
three subtopics in a medical history is a continuous chunk of text
within the paragraph, uninterrupted by other subtopics. Despite the
relatively homogenous structure of medical histories, this model is
found to be linguistically unrealistic, and the performance of the
automatic segmenter is poor compared to the evaluation standard. The
second type, non-linear subtopic segmentation, allows each sentence to
be assigned to a subtopic regardless of order. Because of the
variability of the data, the tool is unable to successfully
distinguish three subtopics in the histories. However, the automatic
segmentation of two non-linear subtopics for each medical history is
successful, with a high rate of accuracy compared to the human
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue