* * * * * * * * * * * * * * * * * * * * * * * *
LINGUIST List logo Eastern Michigan University Wayne State University *
* People & Organizations * Jobs * Calls & Conferences * Publications * Language Resources * Text & Computer Tools * Teaching & Learning * Mailing Lists * Search *
* *
 
E-mail this message to a friend
Title: A Corpus-Based Delimitation of New Words: Cross-segment comparison and morphological productivity
Author: Eiji Nishimoto
Email: click here to access email
Degree Awarded: City University of New York , Linguistics Program
Degree Date: 2004
Linguistic Subfield(s): Computational Linguistics
Morphology
Text/Corpus Linguistics
Director(s): Martin Chodorow
Dianne Bradley
Virginia Teller

Abstract:

The dissertation explores methods of identifying new words in a large corpus of texts, the British National Corpus (BNC) of 100 million English words, and of assessing productivity in derivational affixation. Adopting a smoothing technique, deleted estimation, from the Language Technology literature, we show that new words can be detected when segments of a corpus are cross-compared to find which word types are shared (or unshared). When each corpus segment is created so as to reflect a set of words used by a group of randomly sampled speakers, through a randomization respecting document boundaries, the cross-comparison of corpus segments can be interpreted as revealing the usage distribution of words across groups of speakers. A word shared by fewer corpus segments is more limited in its usage commonality and thus a more likely candidate for a new word. Morphological productivity, the potential of a word formation process involving an affix to form a new word, is assessed for 12 English derivational suffixes (nominal -ness, -ity, -er, -ee, -ion, -ment, and -th; verbal -ize and -ify; adjectival -ish and -ous; adverbial -ly), based on new words identified in the BNC via deleted estimation. Quantifying the usage distribution of new word types across corpus segments opens many possibilities for assessing the productivity of affixes. Cross-comparing as few as two corpus segments offers a crude yet computationally simple method of separating new words (unshared) from non-new words (shared), to yield a productivity index for a given affix. Cross-comparing as many as six corpus segments supports a graded definition of a word’s newness (words shared by fewer corpus segments being more likely new) and thereby a more detailed characterization of the productivity of affixes. The proposed methods of identifying new words and assessing productivity are shown to offer valuable insights into the issue of productivity in word formation.
Add a dissertation
Update dissertation
Page Updated: 27-Nov-2009

Please report any bad links or misclassified data

LINGUIST Homepage | Read LINGUIST | Contact us

NSF Logo

While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed
on its pages, it cannot vouch for their contents.