Editor for this issue: Scott Fults <scott
linguistlist.org>
Dear Colleague, The Center for Language and Speech Processing at the Johns Hopkins University is offering a unique summer internship opportunity which we would like you to bring to the attention of your best students in the current junior class. This internship is unique in the sense that the selected students will participate in cutting edge research as full members alongside leading scientists from industry, academia and the government. The exciting nature of the internship is the exposure of the undergraduate students to the emerging fields of text-to-speech synthesis, automatic speech recognition and natural language processing. We are specifically looking to attract new talent into the field and, as such, do not require the students to have prior knowledge of the technologies. Please take a few moments to nominate suitable bright students who may be interested in this internships. Details are attached below. If you have any questions, please contact us by phone, e-mail or via the internet. Sincerely, Frederick Jelinek Professor and Director. - ---------------------------------------------------------------------- INTERNSHIP ANNOUNCEMENT The Center for Language and Speech Processing at the Johns Hopkins University is seeking outstanding members of the current junior class to participate in a summer workshop on language engineering from June 28 to August 20, 1999. No limitation is placed on the undergraduate major. Only relevant skills, employment experience, past academic record and the strength of letters of recommendation will be considered. In the past, students of Biomedical Engineering, Computer Science, Cognitive Science, Electrical Engineering, Linguistics, Mathematics, Physics, Psychology, etc., have been considered. Women and minorities are encouraged to apply. * An opportunity to explore an exciting new area of research; * A two week tutorial on speech and language technology; * Mentoring by an experienced researcher; * Use of a computer workstation throughout the workshop; * A $4800 stipend and $1680 towards per diem expenses; * Private accommodation for 8 weeks covering the workshop; * Travel expenses to and from the workshop venue; * Participation in project planning activities. The eight week workshop provides a vigorously stimulating and enriching intellectual environment and hopes to attract students to eventually pursue graduate study or research in the field of human language technologies. Application forms are available via the internet or by mail. Electronic submission of applications is strongly encouraged. Applications must be received at CLSP by February 10, 1999. For details, contact CLSP, Barton Hall, 3400 N. Charles Street, Baltimore, MD 21218, visit our web site at http://www.clsp.jhu.edu, or call 410 516 4237. - ---------------------------------------------------------------------- THE 1999 LANGUAGE ENGINEERING WORKSHOP Automated systems that interact with human users in spoken and written communication will greatly enhance productivity and program usability. These systems will act as on- and off-ramps to the information superhighway, allowing friendly access to services. The convenience provided by these systems is essential to other tasks, such as for handicapped users or for accessing a database of maintenance manuals while performing intricate repairs. Some other applications are conversion of phone mail to text, transcription of radio or TV programs or of telephone conversations, mechanical translation, and information retrieval. Unfortunately, in many respects, current technology is inadequate for the tasks at hand. For instance, automatic recognition of natural conversational speech has a 40% error rate. Mechanical translation of technical manuals results in confusing and ungrammatical instructions. Even parsing of sentences of newspaper articles, while it has improved a lot, leads to faulty analysis of over 50% of the sentences attempted. There is need to make progress in this important field. The number of available personnel trained in the field is small and solutions to long standing research problems must be found. At this time, relatively few universities educate students capable of performing the required tasks. We are organizing a six week workshop on Language Engineering at Johns Hopkins University from July 12-August 20, 1999 in which mixed teams of leading professionals and students would fully cooperate to advance the state of the art. The professionals will be university professors and industrial and governmental researchers presently working in widely dispersed locations. Six or more undergraduates will be selected through a nationwide search from the current junior class based on outstanding academic promise. Graduate students will be familiar with the field and will be selected in accordance with their demonstrated performance. Four topics of research for this workshop are proposed and were determined by a group of leading professionals in the field: 1. Statistical Machine Translation 2. Towards Language Independent Acoustic Modeling 3. Topic-based Novelty Detection 4. Normalization of Non-standard Words The Center for Language and Speech Processing has successfully organized similar workshops for the last three summers. Details of past workshops are available at our web site - http://www.clsp.jhu.edu - ------------------------------------------------------------------------ OVERVIEW OF SPECIFIC GROUP PROJECTS 1. STATISTICAL MACHINE TRANSLATION Automatic translation from one human language to another using computers, better known as machine translation (MT), is a longstanding goal of computer science. In order to be able to perform such a task, the computer must "know" the two languages --- synonyms for words and phrases, grammars of the two languages, and semantic or world knowledge. One way to incorporate such knowledge into a computer is to use bilingual experts to hand-craft the necessary information into the computer program. Another is to let the computer learn some of these things automatically by examining large amounts of parallel text: documents which are nearly exact translations of each other. The Canadian government produces one such resource, for example, in the form of parliamentary proceedings which are recorded in both English and French. Recently, statistical data analysis has been used to gather MT knowledge automatically, from parallel bilingual text. The techniques have unfortunately not been disseminated to the scientific community in very usable form, and new follow-on ideas have not developed rapidly. In pre-workshop activity, we plan to reconstruct a baseline statistical MT system for distribution to all researchers, and to use it as a platform for workshop experiments. These experiments will include working with morphology, online dictionaries, widely available monolingual texts, and syntax. The goal will be to improve the accuracy of the baseline and/or achieve the same accuracy with only limited parallel corpora. We will work with the French-English Hansard data as well as with a new language, perhaps Czech or Chinese. 2. TOWARD LANGUAGE INDEPENDENT ACOUSTIC MODELING The state of the art in automatic speech recognition (ASR) has advanced considerably for those languages for which large amounts of data is available to build the ASR system. Obtaining such data is usually very difficult as it includes tens of hours of recorded speech along with accurate transcriptions, an on-line dictionary or lexicon which lists how words are pronounced in terms of elementary sound units such as phonemes, and on-line text resources. The text resources are used to train a language model which helps the recognizer anticipate likely words, the dictionary tells the recognizer identify how a word will sound in terms of phonemes when it is spoken, and the speech recordings are used to learn the acoustic signal pattern for each phoneme, resulting in a hierarchy of models which work together to recognize successive spoken words. Relatively little research has been done for building speech recognition systems for languages for which such data resources are not available --- a situation which unfortunately is true for all but a few languages of the world. This project will investigate the use of speech from diverse source languages to build an ASR system for a single target language. We will study promising modeling techniques to develop ASR systems in languages for which large amounts of training data are not available. We intend to pursue three themes. The first concerns the development of algorithms to map pronunciation dictionary entries in the target language to elements in the dictionaries of the source languages. The second theme will be Discriminative Model Combination of acoustic models in the individual source languages for recognition of speech in the target language. The third theme will be development of clustering and adaptation techniques to train a single set of acoustic models using data pooled from the available source languages. The goal is to develop Czech Broadcast News transcription systems using a small amount of Czech adaptation data to augment training data available in English, Spanish, and Mandarin. The best data for this modeling task would be natural, unscripted speech collected on a quiet, wide-band acoustic channel. News broadcasts are a good source of such speech and are fairly easily obtained. Broadcast news data of other source or target languages, possibly German or Russian, will be used if they become available in a suitable amount and quality. 3. TOPIC-BASED NOVELTY DETECTION Computers are being increasingly used to manage large volumes of news and information increasingly available in electronic form. The task of the computer is to organize the incoming data into segments or stories which are related and to index them in a way which makes it easier for the user to digest them. A key problem of digesting new data is deciding which parts contain redundant information so attention can be focused on the new material. This project proposes to investigate the problem of analyzing newly arrived news stories for two purposes: (1) to decide if the story discusses an event or topic that has not been seen earlier (first story detection); and (2) to identify, within a sequence of stories on the same pre-defined topic, which portions of subsequent stories contain new information and to determine the new named entities that are central to the topic (within-topic novelty detection). The project will focus on extending and combining Information Retrieval and Natural Language Processing/Information Extraction techniques toward addressing these questions. Specifically, the team will look at identifying who/where/when entities and how to use them in Information Retrieval and other language modeling approaches for addressing this problem. An important component of the proposed project is investigating the impact on the detection results of using (degraded) text put out by a speech recognition system. The evaluation of the project's results will be based on established measures from the Topic Detection Tracking initiative in the case of first story detection, and on accuracy of aligning predicted new text with actual new information (as identified by human experts prior to the workshop) in the case of novelty detection. 4. NORMALIZATION OF NON-STANDARD WORDS Real text contains a variety of "non-standard" token types, such as digit sequences; words, acronyms and letter sequences in all capitals; mixed case words (WinNT, SunOS); abbreviations; Roman numerals; URL's and e-mail addresses. Many of these kinds of elements are pronounced according to principles that are quite different from the pronunciation of ordinary words. Furthermore, many items have more than one plausible pronunciation, and the correct one must be disambiguated from context: IV could be "four", "fourth", "the fourth", or "I.V." Normalizing or rewriting such text using ordinary words is an important issue for several applications. For instance, an essential feature of natural human-computer interfaces is that the computer be capable of responding with spoken replies or comments. A Text-to-Speech module synthesizes the spoken response from such text input and must be able to render such items appropriately into speech. In Automatic Speech Recognition nonstandard types cause problems for training acoustic as well as language models. More sophisticated text normalization will be an important tool for utilizing the vast amounts of on-line text resources. Normalized text is likely to be of specific benefit in information extraction applications. This project will apply language modeling techniques to creation of wide coverage models for disambiguating non-standard words in English. Its aim is to create (1) a publicly available corpus of tagged examples, plus a publicly available taxonomy of cases to be considered, and (2) a set of tools that would represent the best state of the art in text normalization for English.Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue