* * * * * * * * * * * * * * * * * * * * * * * *
LINGUIST List logo Eastern Michigan University Wayne State University *
* People & Organizations * Jobs * Calls & Conferences * Publications * Language Resources * Text & Computer Tools * Teaching & Learning * Mailing Lists * Search *
* *
 
E-mail this message to a friend
Title: Maschineller Erwerb lexikalischen Wissens aus kleinen und verrauschten Textkorpora (Machine Learning of Lexical Knowledge from Sparse and Noisy Text Corpora)
Author: Rene Schneider
Email: click here to access email
Degree Awarded: Heinrich-Heine-Universität Düsseldorf , Department of General Linguistics
Degree Date: 1998
Linguistic Subfield(s): Computational Linguistics
Director(s): James Kilbury
H. Geisler

Abstract:

A major reason for the development of information extraction (IE-)systems was the fact that quite often, especially in industrial applications, a deep text understanding may be abandoned in favor of the robust extraction and interpretation of relevant text segments. Nevertheless IE-systems still need knowledge bases that vary from application to application and that are generally handcrafted though the employment of non-supervised learning algorithms is handicapped due to the very small set of training data available in industrial applications. Furthermore, the majority of IE-systems are restricted to the analysis of electronical text input and those working with paperbound information have to face a considerable amount of 'noisy' output, produced by optical character recognition. Additionally, an unexpected high number of mistakes are produced during text production, consisting of typos, orthographical and grammatical mistakes.

These problems, i.e. the rather small and noisy text corpora that have to be analysed by IE-systems marked the starting point for the dissertation and lead to the implementation of a learning algorithm for the automatic acquisition of lexical knowledge.The algorithm starts with the empirical analysis of small text corpora after their scanning. A comparison of the quantitative amount of correct and noisy word forms gives evidence to the fact that the number of correct word forms is significantly higher than the number of noisy word forms, whereas the number of correct stems is again higher than the number of the corresponding variants. A combination of the rank-frequency list and the Levenshtein-distances between the different word forms allows the generation of a core lexicon, with each entry covering the correct stem together with its correct and noisy variants. The acquisition of syntagmatic knowledge is achieved through the automatic weighting of the frequency lists, assigning one definite rank to each stem with the rank estimating the significance of a word in a given domain. These ranks lead to the formulation of a collocation measure, whose value represents the tendency of two word forms being used together. The values allow the linking of the different lexical entries whereas the links between the entries represent the domain-specific relationship between two words.

With the help of the lexicon bootstrapped by the algorithm, IE-systems that beforehand were restricted to wellformed input are now enabled to perform the extraction of significant text features combined with a robust error correction and lemmatisation.
Add a dissertation
Update dissertation
Page Updated: 26-Nov-2009

Please report any bad links or misclassified data

LINGUIST Homepage | Read LINGUIST | Contact us

NSF Logo

While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed
on its pages, it cannot vouch for their contents.