Editor for this issue: Ljuba Veselinova <lveselin
emunix.emich.edu>
Dear people in LINGUIST and NLPASIA-L, In Linguist List (Vol-6-1244. 9/14/95,(Thu)), I submitted following question: >> I want to make a list of languages to classify if a (written) language >> uses a between-word delimiter (e.g., space in English), or not. >> That is, if it doesn't have such delimiters, we need to segment >> for the language processing (by human or computer). >> >> You can tell me: >> 1) Name of the language, >> 2) Segmentation - Need or No Need, >> 3) Letters - Use Alphabets (as a group) or not. Or, other graphic >> group (Cyrillic, Chinese characters, or Own special, etc.). >> No detail. >> 4) Note - If you like, short comment. This is a (first) preliminary summary for this inquiry. I've also included some my data. So far I've received 12 responses from following people. I want to say thank you for these people. From: Shanley Allen <allenMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issuempi.nl> From: Philippe Mennecier <ferry
cimrs1.mnhn.fr> From: Stavros Macrakis <macrakis
osf.org> From: Dan I. Slobin <slobin
cogsci.Berkeley.EDU> From: Boris Fridman Mintz <fridman
ucol.mx> From: Allan C Wechsler <Wechsler
world.std.com> From: Wolfram Kahl <kahl
hermes.informatik.unibw-muenchen.de> From: Stefan Frisch <frisch
babel.ling.nwu.edu> From: Doug Cooper <doug
chulkn.car.chula.ac.th> From: Nicholas Ostler <nostler
chibcha.demon.co.uk> From: Steve Seegmiller <SEEGMILLER
apollo.montclair.edu> From: Duncan MacGregor <aa735
freenet.carleton.ca> First, I show the list of languages whether or not it has a delimiter symbol for the 'word' boundary in text like a blank space between words in English: Q: Does the language have word-boundary delimiters? [YES]: Inuktitut(Eskimo), Amharic, Cherokee(?), Arabic(??), Hebrew(Modern), Yiddish(Judeo-German),Ladino(Judio-Spanish), [NO]: Sanskrit, Thai, Lao, Khmer, Burmese(?), Tibetan, Mongolian(?), Manchu(?), Japanese, Chinese, Korean(?) Here, I excluded historical/classical/medival/extinct languages because those are not a concern of this survay. I hope I didn't misunderstand what responders wrote. If you find mistake or you can clarify (?)-item in this list, please send me a message. (Hereafter I will call 'word-boundary delimiter' simply 'delimiter'. There are comments about the confusion of terminology such as "segmentation" or "separation"; "spaces/blanks", "punctuations", "word breaks" or "delimiters"; "segmented" means either "the text is 'segmented' as is" or "the text must be 'segmented' to separate words". I will restate my question at the end of this message.) At least so far, I didn't see a counter-example to my guessing, i.e., most Asian languages don't have delimiters to separate words no matter the letters have a phonetic or ideo/logographic (except languages with Romanized characters). Obviously we don't have enough data to cover many of the typological language families. I like to see more languages' data. I welcome your further contributions especially for the languages at the end of this message. I got several valuable comments such as: 1) According to Doug Cooper, there are indian languages which "are segmented, while others, of similar origin, are not". If so, it implies that language's letters are not a definite factor if it has delimiters or not. 2) Even though above Cooper's observation, "it is probably safe to say that all modern languages that use a Latin-, Cyrillic, or Greek-based writing system use a blank space as a delimiter" according to Steve Seegmiller. I've counted the frequencies of Latin-, Cyrillic, or Greek-based languages using the data in Campbell's Concise Compendium of the World Languages(1995), in 96 languages. The result was 63% (61 languages) are one of these three types. Althoyugh this data is not sampled typologically fair, but based on the population of speakers, anyway the establishment of the orthography is a very much product of religion or cultural politics in the history. Following are non Latin/Cyrillic/Greek-based *modern* languages which I still don't have the data: Armenian(modern), Assamese, Bengali, Buginese, Georgian, Hindi, Kannada, Kashmirti, Kurdish, Lahnda, Malayalam, Marathi, Nepali, Panjabi, Pashto, Persian, Sinhalese, Sundanse, Tamil, Telgu, Urdu, Uzbek Please send your response directly to me, so I can submit the final summary to the LINGUIST/NLPASIA-L, later. You can tell me: 1) Name of the language, 2) If the language has word-boundary delimiters, or not. 3) Letter Type: Roman/Greek, Cyrillic, Arabic, Devanagari, Hebrew, Chinese, or other group 4) Note - If you like, short comment. I appreciate your contribution. - Hideo Fujii (fujii
cs.umass.edu) University of Massachusetts