Editor for this issue: Karen Milligan <karen
linguistlist.org>
Whilst it cannot be doubted that this is an interesting and laudable idea, there are problems inherent in a corpus approach to phonetics/phonology (the distinction is unclear in the original post). Something like this is needed - books like Maddieson's Patterns of Sounds (CUP 1984) form the basis of much interesting work in phonological universals, and many interesting phonetic sketches of languages have been produced which occasionally make it into journals such as JIPA. However, ideally phonetic analysis takes repetitions of a single variable from several speakers under strictly controlled conditions, and reading a long connected text may not produce enough data for analysis or controlled enough conditions. For example, the vowel /a/ may occur twenty times, but in different segmental, prosodic, intonational and positional contexts, all of which can affect factors such as duration and formant frequencies. And if two speakers of the same language read the same long text, there may be variations in rhythm and pausing which are not apparent in shorter sentences, such as the ones normally used in phonetic analysis. So the phonetic utility of such texts is doubtful. In phonological terms, not only does a text not provide the necessary data for deciding which oppositions are contrastive, it may not give examples of all phonemes for a language. So for Amharic, the ejective /p'/ may not occur, and in English /T/ [theta] may not crop up. The 'marginal' nature of such phonemes is not uninteresting, but larger patterns may reflect historical accidents. In English, for example, very few instances of words with a long vowel + /b, d, g/ occur, e.g. league, barb (for non-rhotic speakers like me). Final /d/ is fairly common. Recent coinings like Beeb for BBC show that there is no phonological constraint on such words occurring, but they don't crop up as regularly as their counterparts with voiceless codas (e.g. beat, meat, seep, sheep, soup, seek, park) for whatever historical reasons. A random text may not show any such words with a voiced plosive, and lead one to conclude that English, like German, does not allow phonologically voiced plosives in coda position. So I think a corpus approach should not be based on connected texts, but on more traditional phonetic and phonological approaches. A purely text based approach also has the drawback that unwritten languages cannot be represented. I look forward to reading what other List users think. Mark JonesMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue