Editor for this issue: Karen Milligan <karen
linguistlist.org>
The NYT article that Karen S. Chung pointed us to is a pretty good example of the kind of reporting that anyone who works on speech technology (or at least anyone who is honest) should cringe at. There seems to be the implication that a major problem in speech recognition is that we can't detect where sentence boundaries are in running speech, and that we are only beginning to be able to detect emotional content. How about the more basic problem of getting most of the words right? Speech recognition methods that might score in the low 90% range in terms of word accuracy on relatively "clean" tasks such as dictation or broadcast news, can easily fall to the 60-70% range on conversational speech. And if there isn't sufficient training data for the domain or the acoustic conditions -- a highly realistic scenario for tasks such as eavesdropping on potential terrorists -- then they easily drop into the 30% range. When you are getting two out of every three words wrong, the fact that you are also unable to detect sentence boundaries, or whether the speaker is angry, somehow doesn't seem to be that critical. And all of this assumes that the people are speaking English, or one of the handful of other languages for which there is enough data to train large vocabulary speech recognizers. One cannot generally assume that terrorists plotting their next attack will be speaking one of those languages. Now I happen to think that trying to detect things like prosodic phrasing or emotion is worthwhile. Certainly detecting if someone is angry can have useful applications: one could for example use that information to route a disgruntled caller to an agent specially trained to deal with unhappy customers. (There has even been a recent patent on precisely that application, though the "inventors" were a bit fuzzy on the implementational details: this may have been the patent referred to in the NYT article.) And detecting prosodic phrasing can be useful for such things as deciding how to parse a long string of numbers, and so forth, as the article points out. But I also believe it is important to put these kinds of things in perspective. For many applications detecting sentence boundaries or the speaker's emotional state just ain't number one on the list of problems to be solved. This should be obvious: if you hand a security analyst a near perfect transcription of some speech that omits sentence boundaries, they are likely to get a whole lot more out of that than if you hand them a transcript with only 30% of the words correct, but which puts in sentence boundaries (say with 70% accuracy) and tells you if the speaker is angry or not (say with 65% accuracy). - Richard Sproat - Richard Sproat Information Systems and Analysis Research rwsMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issueresearch.att.com AT&T Labs -- Research, Shannon Laboratory 180 Park Avenue, Room B207, P.O.Box 971 Florham Park, NJ 07932-0000 - --------------http://www.research.att.com/~rws/-----------------------