Editor for this issue: Karen Milligan <karen
linguistlist.org>
Re: Linguist 13.2044 The poster's line of argument and attitude is in my opinion one of the things that has been hampering forward progress in speech recognition. At least in commercial engines, recognition is viewed as a "black box" stage in processing, assumed to be feasible using only statistical models of acoustic data and a regular expression "grammar" for limiting the search space. I have built speech recognition applications that use semantic and contextual (one could call "pragmatic" if they wanted to stretch it) constraints to significantly increase reliability and robustness on conversational (i.e. real world) speech. But to do so you have to 'cheat' and work around the design of current engines, yielding far sub-optimal designs. Speech recognition accuracy rates (again commercial systems) have hit or at least come very close to a plateau. A plateau which will not be overcome until speech recognition systems admit semantic, contextual/pragmatic, prosodic and other linguistic sources of constraints as first class citizens (i.e. as a part of the primary recognition stage). However, the phrase "with friends like statistics who needs linguistics" seems to apply far too often in the industry. (This is a quote from someone in the Microsoft speech technology group, but it seems to be reflected elsewhere). An example of a VERY hacky approach to add simple semantic and contextual constraints follows. Its based on the observation that even current engines can be very accurate phonetically when working with non-constraining grammars. The sequences of words recognized will be very inaccurate, but the implied sequence of phonemes is often very accurate (as measured by a weighted edit distance metric against the actual phoneme stream used as input). A "non-constraining" grammar for my purposes is simply a flattened grammar, START ::= WORD_LIST*. I added a second pass which "reconstructs" the recognition by scoring word sequences that are phonetically close to the recognized sequence (the actual algorithm uses a combination of analyzing n-best results to find islands of higher confidence in the stream and then edit distances with higher penalties assigned for deviation from the islands, other constraints are then evaluated to modify the score). This approach is tremendously inefficient and has its limitations, but was nevertheless deployable and obtained very good results. The benchmark that we used was a database of sales opportunities for a large fortune 500 company. The opportunity names had been originally entered by hand and thus full of misspellings, duplicates etc... The sales people did not remember what they named opportunities, so often they only remembered one or two words from the name or were making outright guesses. We ran the system with automatically generated dictionaries (i.e. no handcoding of phonetic entries) for all the unique words in a set of about 15,000 opps. The constraints used for scoring were things like: is the opportunity associated to the user in some way, was it recently modified by the user or someone in the user's group, is there an entry in the user's calendar that is close in time, did the sequence proposed meet syntactic constraints, etc... After reconstruction we presented our best guess and if that was incorrect, then presented the next three guesses. We achieved rates in the high 80's on the first guess and in the high 90's when considering hits in the next three items. Recognition data was collected from a variety of land line and cellular phones, and represented a wide range of background noise and signal quality. The users knew what items they were looking for, but generally did not have the exact phrasing of the name that they were looking for and therefore there was a lot of variety in word selection and ordering. The set up time was simply the time necessary to import the opportunity records (we allowed selection on any combination of name, date, client and a couple of other fields) and generate the non-constraining grammar. In otherwords something like 5 minutes of computer time + install and basic setup of the software (the significance is something you realize if you have ever had the misfortune to be involved in tuning and deploying a speech rec application). The system sketched out above is pretty crude but even so indicates to me that there is a lot of value to be obtained from adding even slightly more sophisticated constraints in the recognition process. The response times were more than acceptable with the extra pass of processing (this was a telephony app and performance was evaluated on a fully loaded system), so I don't believe arguments that adding such constraints at the engine level would impose too much runtime overhead. There are two classes of constraints, those that can be applied to directly limit the HMM search space and those analogous to a second pass that are not currently represented in engines. The type-0 grammars are just used as just a low powered method for expressing search space constraints by defining possible sequences of HMM states. However there is no reason that the constraints have to exist as a pre-compiled graph of sequences, any method that constrains sequences of states explored could be used. As for the second class of constraints, at least setting up a more open design for the engine is a starting point. When someone says "prosodic information", I think of things like pitch contours not classification as "angry" or whatever. For instance a pitch contour could be used as to measure the acceptability of a proposed recognition sequence, both in terms of boundaries that are implied in the sequence and whether it parses to a syntactic form that is suitable for the contour. Information like this can be used BEFORE or DURING the construction of likely word sequences. The view of recognition simply being a process of finding the right words and then someone else can do "other stuff" with them is extremely short-sighted and limits the kind of approaches that could be used to actually improve the technology. /slrMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue