Editor for this issue: Martin Jacobsen <marty
linguistlist.org>
REMINDER REMINDER REMINDER REMINDER 2nd International Symposium on Bilingualism (April, 1999, Newcastle, UK) Details of the symposium and registration form now available at http://www.newcastle.ac.uk/~nspeech Deadline for submission of abstract: 31 August, 1998.Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue
********* DEADLINE EXTENSION ANNOUNCEMENT ********* ********* NEW SUBMISSION DEADLINE: August 24, 1998 WORKSHOP ANNOUNCEMENT - ------------------- WORKSHOP ON EMBEDDED MT SYSTEMS CALL FOR PAPERS Design, Construction, and Evaluation of Systems with an MT Component Wednesday, October 28, 1998 (preceding the AMTA 98 conference) Sheraton Bucks County Hotel, Langhorne, Pennsylvania Introduction As the strengths and weaknesses of machine translation (MT) engines have become better understood and accepted, there has been a marked increase in the development of computer systems with an embedded MT component. One consequence of this shift to "embedded MT" is that researchers, developers, as well as users have begun pushing the limits on the input that such systems will accept for translation. In so doing, a new class of problems has surfaced: any input---whether it appears in physical form on paper, in electronic form on-line, or mixed in with another modality such as graphics or video---will bring with it some unknown mix of noisy natural language data as well as non-linguistic data. How are systems with an MT component to be designed and evaluated given the challenge this input brings? The objective of this workshop is to examine and evaluate techniques for adjusting this "linguistic impedance mismatch" between the real-world input and the natural language input expected by various MT engines. Thus the workshop will focus on computational approaches to preprocessing system input for MT engines and on statistical methods for evaluating systems with an embedded MT component. Linguistic Preprocessing In Image Data For researchers working with image data, there is currently underway an effort to augment OCR (optical character recognition) engines with linguistic data as they recognize and convert bitmap data into characters---similar to what has already been done in speech recognition with linguistic data in HMMs (hidden Markov models). Other OCR researchers have also experimented with image-level early topic detection using word-shape recognition. In principle, this could provide a first-step filtering of documents into a more homogeneous MT input set, a desirable goal for MT evaluation. Thus we expect that individuals working with or intending to incorporate OCR into their computer systems will be interested in this new area. Linguistic Preprocessing in Online Data For those working with online input, even though the characters are already present, there often still remains the task of preprocessing meaningful, symbolic character strings that are not a part of the text to be translated. For some systems, the rules for identifying and encapsulating or removing such strings may need to be hand-crafted over time as MT engine limitations surface. For others, a combination of hand-crafted rules and statistically trained NL models has worked. Many have observed that the HTML annotations, alphanumeric items, spreadsheet and word processing codes are harder to weed out than originally expected. Research efforts with the low-density and less-commonly taught languages, as well as more common ones, encounter a substantial problem with variation in spelling conventions and transcription preferences. For those natural languages that are primarily spoken and not written, for example, this is frequently the case. Researchers working on this class of problem have built variants on spell checkers (SC), components that standardize words to one orthography (spelling convention) before submitting it to an MT engine. An idea that has arisen for this component is to build in an option to adjust the level of SC correction---as would be relevant when input after OCR nonetheless varies from very noisy to relatively clean. Evaluation of Embedded MT Systems Among those working on statistical methods for evaluating systems with an embedded MT component, we have seen two distinct trends. One group of statisticians has begun looking for appropriate models from outside the world of MT evaluation, examining the efforts by others to take distinct metrics for components and combine them for an overall system-level metric using fuzzy mathematics. Another group of researchers is looking instead at developing a one-dimensional scale for ranking MT engines along a continuum defined by system-level function. That approach, for example, might rank one engine as good enough for filtering documents, while another engine deemed more linguistically robust would be ranked higher because it could generate a good enough initial translation for subsequent post-editing. We welcome other functional evaluations of MT components and computer systems with embedded MT components as well. SUBMISSIONS Submitters are invited to send in a short paper, not more than 5 pages, addressing one or more of the three areas discussed above. Papers should define the problem in an embedded MT system that is the focus of the work, describe the embedded MT system design (a simple sketch) with sample input data where relevant, and present their approach to the problem. Work at various stages of completion is acceptable; we expect the current status of the work to be made clear. Submission of end-to-end output of an embedded MT system is especially encouraged. The papers will be collected and distributed to participants of the workshop. Ideally, the result of the workshop will be a clearer delineation of: (1) the range of linguistic preprocessing problems (2) the range of designs in embedded MT systems (3) how these problems are treated in different embedded MT systems (4) the metrics that are being used to evaluate these systems and their components. DATES Notice of interest in participation: July 10, 1998 (to vossMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issuearl.mil) Please identify which of the three areas you intend to address: preprocessing in image data, preprocessing in online data, evaluation of embedded MT systems. Position paper submission: August 10, 1998 NOTE: Now, August 24, 1998 Notifications: September 10, 1998 NOTE: Now, September 17, 1998 Final copies of papers: October 10, 1998 Workshop: October 28, 1998 Submissions may be in printed or electronic form. Submissions should be sent to: Clare Voss Army Research Laboratory AMSRL-IS-CI 2800 Powder Mill Road Adelphi, MD 20783 phone: (301) 394-5615 fax: (301) 394-3903 e-mail: voss
arl.mil The registration fee for the conference is $50. Non-presenters will be accepted on a first-come, first served basis. We strongly encourage the participation of embedded MT system users, as well as members of the research and development communities. A copy of the call, the registration form, and further update information is available via a link at: <http://rpstl.arl.mil/isb-south/> Look for the Conferences and Workshop link.