Editor for this issue: Karen Milligan <karen
linguistlist.org>
I agree with Dr. Sproat's comments (Linguist 13.2065), but wanted to add my own. Statistical methods do incorporate some linguistic constraints, and acoustic front ends, encompass some of the acoustic phonotactics. In my own research, we have considered the detection of prosody for spoken dialog systems. Probably the best example is some of the research of Roberto Pieracinni on the detection of negative emotional states, to ascertain if the user is frustrated with the system, perhaps poor performance. However, as Dr. Sproat points out, it is not high on the priority list of methods to focus on in the hopes of producing improved performance. I am enthusiastic that there is a move in the stochastic speech community to begin thinking about prosody. However, any analysis of the problem should not be solely considered as a stochastic algorithm. If you look at the work of Professor Marie Ostendorff, she begins to examine the application of distinctive feature theory (posited by Keyser and Stevens). Moreover, the Ph.D. thesis of Dr. Mark Johnson looks at a feature detector for the front end of a speech recogniser. They posit, for improved performance, we need to embed more knowledge of the speech signal. While statistical approaches have shown to be powerful for commercial recognisers, I believe it is fruitful and timely to begin to re-examine the literature of traditional speech science and acoustic phonetics (see Stevens text book, Acoustic Phonetics - MIT Press). The fact that people are worried about improved measures of performance also indicates the traditional acoustic modelling techniques of speech have a role. I have talked to a well known speech synthesis scientist who commented to me that when examining a spectrogram, it does not inform the scientist on how to measure voice quality and naturalness. However, Klatt and Klatt (1987) and Helen Hanson and Ken Stevens have shown reliable acoustic measures that reflect voice quality and Klatt showed that this model works by identically resynthesising human speech with a formant synthesiser. Measures such as spectral tilt, formant bandwidth and glottal open quotient need to be modelled for any work to be done in prosody. These parameters change dynamically with time, especially at phrase boundaries. It is a little difficult for me to understand how the prosody problem can be solved using purely statistical approaches when subtle spectral changes account for the quality of naturalness or emotional state. Furthermore, this investigation has much promise in the field of speech recognition in the acoustic front end, and improved benchmarks of system performance. At Vox Generation, we have worked on an extension of Abney's work (phrase chunking) and Hirschberg's work (automatic prosody marking) towards the end of an improved linguistic model for prosody generation of synthetic speech. The new research we are pursuing involves taking these ToBI marks, which overgenerates them, and selecting the appropriate mark for added intelligence to the unit selection mechanism. However, I raise as a question if traditional unit selection techniques can be trained to interpret these marks of prosody. David Horowitz Executive Chief Scientist Vox Generation Ltd, London www.voxgeneration.comMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue