|Title:||Computational Evidence for the Use of Frequency Information in Discovery of the Infant's First Lexicon||Add Dissertation|
|Author:||Eleanor Batchelder||Update Dissertation|
|Email:||click here to access email|
|Institution:||City University of New York, Linguistics Program|
|Abstract:||My thesis is that statistical characteristics of language are a significant source of information to prelinguistic infants in isolating meaning-based chunks from ambient speech. The principal demonstration is a cross-linguistic corpus study, using the O-B computational algorithm to locate word boundaries in a wide variety of language corpora. Additional evidence that such a strategy is cognitively plausible is cited from research in both linguistics and psychology.
First, a review of the linguistics literature shows that the use of distributional information as one of several combined cues to segmentation of the speech stream would fill an explanatory gap.
Then the O-B algorithm, a modification of work by Olivier (1968), takes a language text with word-delimiting spaces removed and successfully re-divides it into most of the original words. The algorithm proceeds incrementally, segmenting each utterance as it is presented by using frequency information gathered from all preceding utterances, and building a lexicon as it goes along. A number of experiments on spoken and written texts in English, Spanish, and Japanese show that such distributional cues are inherent in all the texts examined and are effective in segmentation. Specific characteristics of texts which facilitate segmentation are identified:
1) More language is better.
2) Short utterances, short words, and high lexical repetition make learning faster, but are not absolutely necessary. In the corpora studied, these were more typical of spoken than written texts, and of speech to young children than speech to adults.
3) No consistent difference is found between various coded
representa-tions, including standard alphabetic spelling, phonemes, and Japanese hiragana (mora), suggesting that children can use idiosyncratic categorical representations in this early period.
A review of results from cognitive psychology show that the encoding of frequency data is an implicit, automatic, innate process, and that infants and adults actually do use frequency information to segment language-like stimuli in laboratory experiments.
Finally, I compare the O-B algorithm with five other computational models of speech segmentation, concluding that the O-B algorithm is comparable in quantitative performance and superior in cognitive plausibility.