LINGUIST List 23.2420
|
Mon May 21 2012
Diss: Computational Ling: Nojoumian: 'Towards the Development of an Automatic Diacritizer for the Persian Orthography based on the Xerox Finite State Transducer'
Editor for this issue: Xiyan Wang
<xiyan linguistlist.org>
|
To post to LINGUIST, use our convenient web form at http://linguistlist.org/LL/posttolinguist.cfm.
|
Date: 21-May-2012
From: Peyman Nojoumian <nojoumia usc.edu>
Subject: Towards the Development of an Automatic Diacritizer for the Persian Orthography based on the Xerox Finite State Transducer
E-mail this message to a friend
Institution: University of Ottawa
Program: Department of Linguistics
Dissertation Status: Completed
Degree Date: 2011
Author: Peyman Nojoumian
Dissertation Title: Towards the Development of an Automatic Diacritizer for the Persian Orthography based on the Xerox Finite State Transducer
Dissertation URL: http://www.ruor.uottawa.ca/en/handle/10393/20158
Linguistic Field(s):
Computational Linguistics
Dissertation Director:
Diana Inkpen
Paul Hirschbuhler
Dissertation Abstract:
Due to the lack of short vowels or diacritics in Persian orthography, many Natural Language Processing applications for this language, including information retrieval, machine translation, text-to-speech, and automatic speech recognition systems need to disambiguate the input first, in order to be able to do further processing. In machine translation, for example, the whole text should be correctly diacritized first so that the correct words, parts of speech and meanings are matched and retrieved from the lexicon. This is primarily because of Persian's ambiguous orthography. In fact, the core engine of any Persian language processor should utilize a diacritizer and a lexical disambiguator. This dissertation describes the design and implementation of an automatic diacritizer for Persian based on the state-of-the-art Finite State Transducer technology developed at Xerox by Beesley & Karttunen (2003). The result of morphological analysis and generation on a test corpus is shown, including the insertion of diacritics. This study will also look at issues that are raised by phonological and semantic ambiguities as a result of short vowels in Persian being absent in the writing system. It suggests a hybrid model (rule-based & inductive) that is inspired by psycholinguistic experiments on the human mental lexicon for the disambiguation of heterophonic homographs in Persian using frequency and collocation information. A syntactic parser can be developed based on the proposed model to discover Ezafe (the linking short vowel /e/ within a noun phrase) or disambiguate homographs, but its implementation is left for future work.
Read more issues|LINGUIST home page|Top of issue
|
|
Page Updated: 21-May-2012
|
|
About LINGUIST
|
Contact Us
While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed
on its pages, it cannot vouch for their contents.
|
|