Editor for this issue: Karen Milligan <karen
linguistlist.org>
The system Steven Roberts describes sounds interesting, but I don't see how it relates to the point I was addressing in my comment on the NY Times article. I briefly repeat my point: an uninformed reader of the New York Times article would come away thinking that the main problem in speech recognition is things like inference of emotional states, and detection of phrase boundaries. But in many applications, including a couple that were mentioned in the article, the bigger problem is simply getting most of the words right. If you are 70% word error rate worrying about prosody will not get you to 30% word error rate (or at least nobody has to my knowledge demonstrated that it will). Thus the article gives a misleading view of the main issues in the field. [By the way, I completely agree with Kurt Godden that the standard word error rate (WER) measure leaves much to be desired, but it generally correlates reasonably well with performance on a given task. Still it is true that for a real application one might want to report some other measure, like task completion. In speech-based information retrieval people will generally report standard measures such as precision and recall: of course these do also correlate with WER.] As far as I can tell from the description Roberts' system was not a demonstration that detection of emotion or detection of phrasing improves recognition. As he says, it seems to demonstrate that there is "value to be obtained from adding even slightly more sophisticated constraints in the recognition process". But we knew that already: speech recognition systems depend upon various kinds of constraints ranging from the phonotactics of the language, to domain-specific language models. The fact that these are often trainable statistical models does not nullify the fact that they incorporate linguistic knowledge. Anyway, I fail to see how my argument (even less my attitude, which one could hardly infer) illustrates the kind of problems that have been "hampering forward progress in speech recognition". I did not say that people should not be working on prosodic features: in fact I said precisely the reverse. I also did not say that people should not be trying to make use of various kinds of linguistic information in improving recognition: I am all for that (I am a linguist, not an engineer, after all.) I was merely trying to make the point that for many applications, things like detecting emotion do not rank as number one on the list of things to be solved. - Richard Sproat Information Systems and Analysis Research rwsMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issueresearch.att.com AT&T Labs -- Research, Shannon Laborator 180 Park Avenue, Room B207, P.O.Box 971 Florham Park, NJ 07932-0000 - --------------http://www.research.att.com/~rws/-----------------------