Editor for this issue: James Yuells <james
linguistlist.org>
Dear Colleague: The Center for Language and Speech Processing at the Johns Hopkins University is offering a unique summer internship opportunity, which we would like you to bring to the attention of your best students in the current junior class. This internship is unique in the sense that the selected students will participate in cutting edge research as full members alongside leading scientists from industry, academia, and the government. The exciting nature of the internship is the exposure of the undergraduate students to the emerging fields of language engineering, such as automatic speech recognition (ASR). natural language processing (NLP), machine translation (MT), and speech synthesis (ITS). We are specifically looking to attract new talent into the field and, as such, do not require the students to have prior knowledge of language engineering technology. Please take a few moments to nominate suitable bright students who may be interested in this internship. On-line applications for the program can be found at http://www.clsp.jhu.edu/workshops along with additional information regarding plans for the 2000 Workshop and information on past workshops. The application deadline is January 28, 2000. If you have questions, please contact us by phone (410-516-7730), e-mail (secMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issueclsp.jhu.edu) or via the Internet (http://www.clsp.jhu.edu). Sincerely, Frederick Jelinek J.S. Smith Professor and Director Project Descriptions3 1. Reading Comprehension Building a computer system that can acquire information by reading texts has been a long standing goal of computer science. Consider designing a computer system that can take the following third grade reading comprehension exam. How Maple Syrup is Made Maple syrup comes from sugar maple trees. At one time, maple syrup was used to make sugar. This is why the tree is called a "sugar" maple tree. Sugar maple trees make sap. Farmers collect the sap. The best time to collect sap is in February and March. The nights must be cold and the days warm. The farmer drills a few small holes in each tree. He puts a spout in each hole. Then he hangs a bucket on the end of each spout. The bucket has a cover to keep rain and snow out. The sap drips into the bucket. About 10 gallons of sap come from each hole. 1. Who collects maple sap? (Farmers) 2. What does the farmer hang from a spout? (A bucket) 3. When is sap collected? (February and March) 4. Where does the maple sap come from? (Sugar maple trees) 5. Why is the bucket covered? (to keep rain and snow out) Such exams measure understanding by asking a variety of questions. Different types of questions probe different aspects of understanding. Existing techniques currently earn roughly a 40% grade; still failing but encouraging. We will investigate methods by which a computer can understand the text better, and hope that by the end of the workshop the computer will be ready to move on to the fourth grade! 2. Mandarin-English Information (MEI) Our globally interconnected world increasingly demands technologies to support on-demand retrieval of relevant information in any medium and in any language. If we search the web for, say, the loss of life in an earthquake in Turkey, by entering keywords in English, the most relevant stories are likely to be in Turkish or even Greek. Furthermore, the latest information may be in the form of audio files of the evening's news. One would like to be able to firstly find such information and then to translate it to English. Finding such information is beyond the capabilities of most commercially available search engines; good automatic translation is even harder. In this project, we will extend the state-of-the-art for searching audio and on-line text in one language for a user who speaks another language. A very large corpus of concurrent Mandarin and English textual and spoken news stories is available for conducting such research. These textual and spoken documents in both languages will be automatically indexed; in case of spoken documents, this will involve automatic speech recognition. Given a query in either language, we will then investigate systems that retrieve relevant documents in both languages for the user. Such cross-lingual and cross-media (CLCM) information retrieval is a novel problem with many technical challenges. Several schemes for recognizing the audio, indexing the text, and for estimating translation models to match queries in one language with documents in another language will be investigated in the summer. Applications of this research include audio and video browsing, spoken document retrieval, automated routing of information, and automatically alerting the user when special events occur. 3. Audio-Visual Speech Recognition It is well known that humans have the ability to lip-read: we combine audio and visual Information in deciding what has been spoken, especially in noisy environments. A dramatic example is the so-called McGurk effect, where a spoken sound ga is superimposed on the video of a person uttering ba. Most people perceive the speaker as uttering the sound da. We will strive to achieve automatic lip-reading by computers, i.e., to make computers recognize human speech even better than is now possible from the audio input alone, by using the video of the speaker's face. There are many difficult research problems on the way to succeeding in this task, e.g., tracking the speakers head as she moves in the video-frame, identifying the type of lip-movement, guessing the spoken words independently from the video and the audio and combining the information from the two signals to make a better guess of what was spoken. In the summer, we will focus on a specific problem: how best to combine the information from the audio and video signal. For example, using visual cues to decide whether a person said /ba/ rather than /ga/ can be easier than making the decision based on audio cues, which can sometimes be confusing. On the other hand, deciding between /ka/ and /ga/ is more reliably done from the audio than the video. Therefore our confidence in the audio-based and video-based hypotheses depends on the kinds of sounds being confused. We will invent and test algorithms for combining the automatic speech classification decisions based on the audio and visual stimuli, resulting in audio-visual speech recognition that significantly improves the traditional audio-only speech recognition performance. 4. Pronunciation Modeling of Mandarin Casual Speech When people speak casually in daily life, they are not consistent in their pronunciation. In listening to such casual speech, it is quite common to find many different pronunciations of individual words. Current automatic speech recognition systems can reach a word accuracies above 90% when evaluated on carefully produced standard speech, but in recognizing casual, unplanned speech, performance drops to 75% or even lower. There are many reasons for this. In casual speech, one phoneme can shift to another. In mandarin for example, the initial / sh / in "wo shi (I am)" is often pronounced weakly and shifts into a / r /. In some other cases, sounds are dropped. In Mandarin, phonemes such as b, p, d, t, k are often reduced and as a result are often recognized as silence. These problems are made especially severe in Mandarin casual speech since most Chinese are non-native Mandarin speakers. Chinese languages such as Cantonese are as different from the standard Mandarin as French is different from English. As a result, there is an even larger pronunciation variation due to the influence of speakers' native language. We propose to study and model such pronunciation differences in casual speech using interviews found in Mandarin news broadcasts. We hope to include experienced researchers from both China and the US in the areas of pronunciation modeling, Mandarin speech recognition, and Chinese phonology. 3 Proposed projects for WS00, Center for Language and Speech Processing, Johns Hopkins University, Baltimore, Maryland 21218-2686. - Amy Berdann 410-516x4778 Center Administrator berdann
jhu.edu 320 Barton Hall http://www.clsp.jhu.edu Center for Language and Speech Processing Johns Hopkins University