* * * * * * * * * * * * * * * * * * * * * * * *
LINGUIST List logo Eastern Michigan University Wayne State University *
* People & Organizations * Jobs * Calls & Conferences * Publications * Language Resources * Text & Computer Tools * Teaching & Learning * Mailing Lists * Search *
* *
 
E-mail this message to a friend
Title: Linguistic Validation of Automatic Subtopic Segmetation
Author: Aisha Saidi
Email: click here to access email
Degree Awarded: Boston University , Graduate Program in Applied Linguistics
Degree Date: 2004
Linguistic Subfield(s): Computational Linguistics
Director(s): Mary O'Connor
Bruce Fraser
Nanette Veilleux

Abstract:

This study evaluates a technique for automatically segmenting medical history paragraphs by subtopic with the view that subtopic language models could be created in order to improve speech recognition of the medical history sections of medical reports. The technique uses a Hidden Markov Model segmenting tool to mark boundaries of hypothesized subtopic segments within each text. Since the tool is built on the assumption that the input texts have a similar topic structure, it can be used to segment medical histories, which generally have a three part structure.

The data for this study is a set of medical histories extracted from 2,700 orthopedic medical reports. The study is carried out in four broad steps. First, a group of linguists independently mark the subtopic structure of a test set of medical histories; histories upon which there is significant agreement become the standard by which the automatic segmenter is evaluated. Next, the automatic segmenter is trained on a large set of histories. Then, using the statistical information built from the training data, the automatic segmenter marks subtopic segments in the test data. Finally, the automatic
segmentation of the test data is graded against the evaluation standard developed by the expert subjects.

Two types of subtopic segmentation are explored in this work. The first type, linear subtopic segmentation, assumes that each of the three subtopics in a medical history is a continuous chunk of text within the paragraph, uninterrupted by other subtopics. Despite the relatively homogenous structure of medical histories, this model is found to be linguistically unrealistic, and the performance of the automatic segmenter is poor compared to the evaluation standard. The second type, non-linear subtopic segmentation, allows each sentence to be assigned to a subtopic regardless of order. Because of the variability of the data, the tool is unable to successfully distinguish three subtopics in the histories. However, the automatic segmentation of two non-linear subtopics for each medical history is successful, with a high rate of accuracy compared to the human standard.
Add a dissertation
Update dissertation
Page Updated: 27-Nov-2009

Please report any bad links or misclassified data

LINGUIST Homepage | Read LINGUIST | Contact us

NSF Logo

While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed
on its pages, it cannot vouch for their contents.