Publishing Partner: Cambridge University Press CUP Extra Publisher Login

New from Cambridge University Press!


Revitalizing Endangered Languages

Edited by Justyna Olko & Julia Sallabank

Revitalizing Endangered Languages "This guidebook provides ideas and strategies, as well as some background, to help with the effective revitalization of endangered languages. It covers a broad scope of themes including effective planning, benefits, wellbeing, economic aspects, attitudes and ideologies."

E-mail this page

We Have a New Site!

With the help of your donations we have been making good progress on designing and launching our new website! Check it out at!
***We are still in our beta stages for the new site--if you have any feedback, be sure to let us know at***

Dissertation Information

Title: Computational Evidence for the Use of Frequency Information in Discovery of the Infant's First Lexicon Add Dissertation
Author: Eleanor Batchelder Update Dissertation
Email: click here to access email
Institution: City University of New York, Linguistics Program
Completed in: 1997
Linguistic Subfield(s):
Subject Language(s): English

Abstract: My thesis is that statistical characteristics of language are a significant source of information to prelinguistic infants in isolating meaning-based chunks from ambient speech. The principal demonstration is a cross-linguistic corpus study, using the O-B computational algorithm to locate word boundaries in a wide variety of language corpora. Additional evidence that such a strategy is cognitively plausible is cited from research in both linguistics and psychology.

First, a review of the linguistics literature shows that the use of distributional information as one of several combined cues to segmentation of the speech stream would fill an explanatory gap.
Then the O-B algorithm, a modification of work by Olivier (1968), takes a language text with word-delimiting spaces removed and successfully re-divides it into most of the original words. The algorithm proceeds incrementally, segmenting each utterance as it is presented by using frequency information gathered from all preceding utterances, and building a lexicon as it goes along. A number of experiments on spoken and written texts in English, Spanish, and Japanese show that such distributional cues are inherent in all the texts examined and are effective in segmentation. Specific characteristics of texts which facilitate segmentation are identified:
1) More language is better.
2) Short utterances, short words, and high lexical repetition make learning faster, but are not absolutely necessary. In the corpora studied, these were more typical of spoken than written texts, and of speech to young children than speech to adults.
3) No consistent difference is found between various coded
representa-tions, including standard alphabetic spelling, phonemes, and Japanese hiragana (mora), suggesting that children can use idiosyncratic categorical representations in this early period.

A review of results from cognitive psychology show that the encoding of frequency data is an implicit, automatic, innate process, and that infants and adults actually do use frequency information to segment language-like stimuli in laboratory experiments.

Finally, I compare the O-B algorithm with five other computational models of speech segmentation, concluding that the O-B algorithm is comparable in quantitative performance and superior in cognitive plausibility.