LINGUIST List 13.88

Tue Jan 15 2002

Disc: Phonetic Frequencies & "Corpus Phonetics"

Editor for this issue: Karen Milligan <>


  1. Greg Kochanski, Linguist 13.50, Phonetic Frequencies & "Corpus Phonetics"

Message 1: Linguist 13.50, Phonetic Frequencies & "Corpus Phonetics"

Date: Fri, 11 Jan 2002 16:35:36 -0500
From: Greg Kochanski <>
Subject: Linguist 13.50, Phonetic Frequencies & "Corpus Phonetics"

Mark Jones has done a good job of laying out two of the
difficulties in corpus phonetics/phonology. However, that's only
one side of the story. First, the problems he raises are
solvable with a proper statistical approach and proper
corpus design.
Second, "classic" approaches have problems of their own.

His first objection is that reading a text may not provide
enough control of the environment in which a word is spoken.
That is certainly true, if one randomly chooses a text
and just asks people to read it. Different people will
often interpret it different ways, and speak it different
ways. Shakespeare's plays provide a good example here:
the same text is interpreted by different actors, and the
acoustic results can be dramatically different.

However, without pretending to solve all the problems of experimental
design, I can point out some possible solutions:
1) Carefully design texts that have only one reasonable interpretation.
2) Ask listeners (listeners who aren't an author) to evaluate
	the resulting speech:
	"Was he putting the sentence focus on 'George?'"

Then, once the data is acquired, you can ask listeners to
mark prosodic or other factors that can influence the
pronunciation. For instance, listeners can mark pauses.
Then, later stages of the analysis can differentiate
/a/ preceeding a pause from /a/ that doesn't.

You need to build models that can survive incomplete data.
The model has to be statistically correct, so
that its results will show where the data is missing.
For instance, if the data contains only one example of
/a/ at the end of a sentence, the results should not
describe how /a/'s formants change under these conditions.

Hopefully, the model also generalizes to some extent.
For instance, one might have a model that claims that all
vowels have similar formant shifts in sentence-final position.
If so, one could still measure an effect averaged over all vowels,
even if /a/ were missing, so long as the model's assumptions
were clearly stated.

And, clearly, one needs a large enough corpus to study improbable
combinations. You may also need to 'seed' the corpus
with words that you want to study.
Again, a proper statistical analysis will
tell you what you do and don't know as a result of the experiment.

On the other hand the problem with
"classic" approaches is that they tend to yield precise
results about a language called "Laboratory English", which is
not quite the same language as is spoken on the street.

So, my general attitude is that _if_ you can answer a question
with a corpus approach, you should. Because that lets you
study real languages. Doing the job with a corpus-based
approach may involve designing your own corpus, and it may
involve subsidiary experiments to select relevant parts of
the corpus. It's generally a lot of work.

If you can't design a corpus experiment, you have to fall back
on a formal, laboratory experiment. There, you have more
control of the conditions, but you always have to be aware that
your subjects will not speak quite normally. Also, their answers
to questions like "is X contrastive" may not reflect how they
would interpret such speech in the real world.

So, different questions get answered different ways.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue