LINGUIST List 13.2050

Thu Aug 8 2002

Disc: Accuracy in Speech Recognition: Priorities

Editor for this issue: Karen Milligan <>


  1. Steven L. Roberts, Re: 13.2044, Disc: New: Accuracy in Speech Recognition: Priorities

Message 1: Re: 13.2044, Disc: New: Accuracy in Speech Recognition: Priorities

Date: Thu, 08 Aug 2002 13:38:00 -0700
From: Steven L. Roberts <>
Subject: Re: 13.2044, Disc: New: Accuracy in Speech Recognition: Priorities

Re: Linguist 13.2044

The poster's line of argument and attitude is in my opinion one of the
things that has been hampering forward progress in speech
recognition. At least in commercial engines, recognition is viewed as
a "black box" stage in processing, assumed to be feasible using only
statistical models of acoustic data and a regular expression "grammar"
for limiting the search space. I have built speech recognition
applications that use semantic and contextual (one could call
"pragmatic" if they wanted to stretch it) constraints to significantly
increase reliability and robustness on conversational (i.e. real
world) speech. But to do so you have to 'cheat' and work around the
design of current engines, yielding far sub-optimal designs.

Speech recognition accuracy rates (again commercial systems) have hit
or at least come very close to a plateau. A plateau which will not be
overcome until speech recognition systems admit semantic,
contextual/pragmatic, prosodic and other linguistic sources of
constraints as first class citizens (i.e. as a part of the primary
recognition stage). However, the phrase "with friends like statistics
who needs linguistics" seems to apply far too often in the industry.
(This is a quote from someone in the Microsoft speech technology
group, but it seems to be reflected elsewhere).

An example of a VERY hacky approach to add simple semantic and
contextual constraints follows. Its based on the observation that even
current engines can be very accurate phonetically when working with
non-constraining grammars. The sequences of words recognized will be
very inaccurate, but the implied sequence of phonemes is often very
accurate (as measured by a weighted edit distance metric against the
actual phoneme stream used as input). A "non-constraining" grammar for
my purposes is simply a flattened grammar, START ::= WORD_LIST*. I
added a second pass which "reconstructs" the recognition by scoring
word sequences that are phonetically close to the recognized sequence
(the actual algorithm uses a combination of analyzing n-best results
to find islands of higher confidence in the stream and then edit
distances with higher penalties assigned for deviation from the
islands, other constraints are then evaluated to modify the
score). This approach is tremendously inefficient and has its
limitations, but was nevertheless deployable and obtained very good
results. The benchmark that we used was a database of sales
opportunities for a large fortune 500 company. The opportunity names
had been originally entered by hand and thus full of misspellings,
duplicates etc... The sales people did not remember what they named
opportunities, so often they only remembered one or two words from the
name or were making outright guesses. We ran the system with
automatically generated dictionaries (i.e. no handcoding of phonetic
entries) for all the unique words in a set of about 15,000 opps. The
constraints used for scoring were things like: is the opportunity
associated to the user in some way, was it recently modified by the
user or someone in the user's group, is there an entry in the user's
calendar that is close in time, did the sequence proposed meet
syntactic constraints, etc... After reconstruction we presented our
best guess and if that was incorrect, then presented the next three
guesses. We achieved rates in the high 80's on the first guess and in
the high 90's when considering hits in the next three
items. Recognition data was collected from a variety of land line and
cellular phones, and represented a wide range of background noise and
signal quality. The users knew what items they were looking for, but
generally did not have the exact phrasing of the name that they were
looking for and therefore there was a lot of variety in word selection
and ordering. The set up time was simply the time necessary to import
the opportunity records (we allowed selection on any combination of
name, date, client and a couple of other fields) and generate the
non-constraining grammar. In otherwords something like 5 minutes of
computer time + install and basic setup of the software (the
significance is something you realize if you have ever had the
misfortune to be involved in tuning and deploying a speech rec

The system sketched out above is pretty crude but even so indicates to
me that there is a lot of value to be obtained from adding even
slightly more sophisticated constraints in the recognition
process. The response times were more than acceptable with the extra
pass of processing (this was a telephony app and performance was
evaluated on a fully loaded system), so I don't believe arguments that
adding such constraints at the engine level would impose too much
runtime overhead. There are two classes of constraints, those that can
be applied to directly limit the HMM search space and those analogous
to a second pass that are not currently represented in engines. The
type-0 grammars are just used as just a low powered method for
expressing search space constraints by defining possible sequences of
HMM states. However there is no reason that the constraints have to
exist as a pre-compiled graph of sequences, any method that constrains
sequences of states explored could be used. As for the second class of
constraints, at least setting up a more open design for the engine is
a starting point.

When someone says "prosodic information", I think of things like pitch
contours not classification as "angry" or whatever. For instance a
pitch contour could be used as to measure the acceptability of a
proposed recognition sequence, both in terms of boundaries that are
implied in the sequence and whether it parses to a syntactic form that
is suitable for the contour. Information like this can be used BEFORE
or DURING the construction of likely word sequences. The view of
recognition simply being a process of finding the right words and then
someone else can do "other stuff" with them is extremely short-sighted
and limits the kind of approaches that could be used to actually
improve the technology.

Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue