LINGUIST List 13.2111

Fri Aug 16 2002

Disc: Accuracy in Speech Recognition: Priorities

Editor for this issue: Karen Milligan <>


  1. David Horowitz, RE: 13.2065, Disc: Accuracy in Speech Recognition: Priorities

Message 1: RE: 13.2065, Disc: Accuracy in Speech Recognition: Priorities

Date: Fri, 16 Aug 2002 10:49:09 +0100
From: David Horowitz <>
Subject: RE: 13.2065, Disc: Accuracy in Speech Recognition: Priorities

I agree with Dr. Sproat's comments (Linguist 13.2065), but wanted to
add my own.

Statistical methods do incorporate some linguistic constraints, and
acoustic front ends, encompass some of the acoustic phonotactics. In
my own research, we have considered the detection of prosody for
spoken dialog systems. Probably the best example is some of the
research of Roberto Pieracinni on the detection of negative emotional
states, to ascertain if the user is frustrated with the system,
perhaps poor performance. However, as Dr. Sproat points out, it is
not high on the priority list of methods to focus on in the hopes of
producing improved performance.

I am enthusiastic that there is a move in the stochastic speech
community to begin thinking about prosody. However, any analysis of
the problem should not be solely considered as a stochastic algorithm.
If you look at the work of Professor Marie Ostendorff, she begins to
examine the application of distinctive feature theory (posited by
Keyser and Stevens). Moreover, the Ph.D. thesis of Dr. Mark Johnson
looks at a feature detector for the front end of a speech recogniser.
They posit, for improved performance, we need to embed more knowledge
of the speech signal. While statistical approaches have shown to be
powerful for commercial recognisers, I believe it is fruitful and
timely to begin to re-examine the literature of traditional speech
science and acoustic phonetics (see Stevens text book, Acoustic
Phonetics - MIT Press).

The fact that people are worried about improved measures of
performance also indicates the traditional acoustic modelling
techniques of speech have a role. I have talked to a well known
speech synthesis scientist who commented to me that when examining a
spectrogram, it does not inform the scientist on how to measure voice
quality and naturalness. However, Klatt and Klatt (1987) and Helen
Hanson and Ken Stevens have shown reliable acoustic measures that
reflect voice quality and Klatt showed that this model works by
identically resynthesising human speech with a formant synthesiser.
Measures such as spectral tilt, formant bandwidth and glottal open
quotient need to be modelled for any work to be done in prosody.
These parameters change dynamically with time, especially at phrase
boundaries. It is a little difficult for me to understand how the
prosody problem can be solved using purely statistical approaches when
subtle spectral changes account for the quality of naturalness or
emotional state. Furthermore, this investigation has much promise in
the field of speech recognition in the acoustic front end, and
improved benchmarks of system performance.

At Vox Generation, we have worked on an extension of Abney's work
(phrase chunking) and Hirschberg's work (automatic prosody marking)
towards the end of an improved linguistic model for prosody generation
of synthetic speech. The new research we are pursuing involves taking
these ToBI marks, which overgenerates them, and selecting the
appropriate mark for added intelligence to the unit selection
mechanism. However, I raise as a question if traditional unit
selection techniques can be trained to interpret these marks of

David Horowitz
Executive Chief Scientist
Vox Generation Ltd, London
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue