LINGUIST List 13.2065

Sat Aug 10 2002

Disc: Accuracy in Speech Recognition: Priorities

Editor for this issue: Karen Milligan <karenlinguistlist.org>


Directory

  1. Richard Sproat, Re: 13.2050, Disc: Accuracy in Speech Recognition: Priorities

Message 1: Re: 13.2050, Disc: Accuracy in Speech Recognition: Priorities

Date: Fri, 9 Aug 2002 10:53:57 -0400
From: Richard Sproat <rwsresearch.att.com>
Subject: Re: 13.2050, Disc: Accuracy in Speech Recognition: Priorities


The system Steven Roberts describes sounds interesting, but I don't
see how it relates to the point I was addressing in my comment on the
NY Times article.

I briefly repeat my point: an uninformed reader of the New York Times
article would come away thinking that the main problem in speech
recognition is things like inference of emotional states, and
detection of phrase boundaries. But in many applications, including a
couple that were mentioned in the article, the bigger problem is
simply getting most of the words right. If you are 70% word error rate
worrying about prosody will not get you to 30% word error rate (or at
least nobody has to my knowledge demonstrated that it will). Thus the
article gives a misleading view of the main issues in the field.

[By the way, I completely agree with Kurt Godden that the standard
word error rate (WER) measure leaves much to be desired, but it
generally correlates reasonably well with performance on a given
task. Still it is true that for a real application one might want to
report some other measure, like task completion. In speech-based
information retrieval people will generally report standard measures
such as precision and recall: of course these do also correlate with
WER.]

As far as I can tell from the description Roberts' system was not a
demonstration that detection of emotion or detection of phrasing
improves recognition. As he says, it seems to demonstrate that there
is "value to be obtained from adding even slightly more sophisticated
constraints in the recognition process". But we knew that already:
speech recognition systems depend upon various kinds of constraints
ranging from the phonotactics of the language, to domain-specific
language models. The fact that these are often trainable statistical
models does not nullify the fact that they incorporate linguistic
knowledge.

Anyway, I fail to see how my argument (even less my attitude, which
one could hardly infer) illustrates the kind of problems that have
been "hampering forward progress in speech recognition". I did not say
that people should not be working on prosodic features: in fact I said
precisely the reverse. I also did not say that people should not be
trying to make use of various kinds of linguistic information in
improving recognition: I am all for that (I am a linguist, not an
engineer, after all.) I was merely trying to make the point that for
many applications, things like detecting emotion do not rank as number
one on the list of things to be solved.

- 
Richard Sproat Information Systems and Analysis Research
rwsresearch.att.com AT&T Labs -- Research, Shannon Laborator
 180 Park Avenue, Room B207, P.O.Box 971 
 Florham Park, NJ 07932-0000
- --------------http://www.research.att.com/~rws/-----------------------
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue