LINGUIST List 13.2044

Wed Aug 7 2002

Disc: New: Accuracy in Speech Recognition: Priorities

Editor for this issue: Karen Milligan <>


  1. Richard Sproat, Re: Linguist 13.2025: Media: NYT - Speech recognition

Message 1: Re: Linguist 13.2025: Media: NYT - Speech recognition

Date: Sat, 3 Aug 2002 10:47:19 -0400
From: Richard Sproat <>
Subject: Re: Linguist 13.2025: Media: NYT - Speech recognition

The NYT article that Karen S. Chung pointed us to is a pretty good
example of the kind of reporting that anyone who works on speech
technology (or at least anyone who is honest) should cringe at.

There seems to be the implication that a major problem in speech
recognition is that we can't detect where sentence boundaries are in
running speech, and that we are only beginning to be able to detect
emotional content.

How about the more basic problem of getting most of the words right?
Speech recognition methods that might score in the low 90% range in
terms of word accuracy on relatively "clean" tasks such as dictation
or broadcast news, can easily fall to the 60-70% range on
conversational speech. And if there isn't sufficient training data for
the domain or the acoustic conditions -- a highly realistic scenario
for tasks such as eavesdropping on potential terrorists -- then they
easily drop into the 30% range. When you are getting two out of every
three words wrong, the fact that you are also unable to detect
sentence boundaries, or whether the speaker is angry, somehow doesn't
seem to be that critical. And all of this assumes that the people are
speaking English, or one of the handful of other languages for which
there is enough data to train large vocabulary speech recognizers. One
cannot generally assume that terrorists plotting their next attack
will be speaking one of those languages.

Now I happen to think that trying to detect things like prosodic
phrasing or emotion is worthwhile. Certainly detecting if someone is
angry can have useful applications: one could for example use that
information to route a disgruntled caller to an agent specially
trained to deal with unhappy customers. (There has even been a recent
patent on precisely that application, though the "inventors" were a
bit fuzzy on the implementational details: this may have been the
patent referred to in the NYT article.) And detecting prosodic
phrasing can be useful for such things as deciding how to parse a long
string of numbers, and so forth, as the article points out.

But I also believe it is important to put these kinds of things in
perspective. For many applications detecting sentence boundaries or
the speaker's emotional state just ain't number one on the list of
problems to be solved. This should be obvious: if you hand a security
analyst a near perfect transcription of some speech that omits
sentence boundaries, they are likely to get a whole lot more out of
that than if you hand them a transcript with only 30% of the words
correct, but which puts in sentence boundaries (say with 70% accuracy)
and tells you if the speaker is angry or not (say with 65% accuracy).

- Richard Sproat

Richard Sproat Information Systems and Analysis Research AT&T Labs -- Research, Shannon Laboratory
180 Park Avenue, Room B207, P.O.Box 971
Florham Park, NJ 07932-0000
- --------------
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue