LINGUIST List 6.1264

Sun Sep 17 1995

Sum: Languages With No Between-word Delimiters

Editor for this issue: Ljuba Veselinova <lveselinemunix.emich.edu>


Directory

  1. Hideo Fujii, Prelim.Summary: languages with no between-word delimiters

Message 1: Prelim.Summary: languages with no between-word delimiters

Date: Sun, 17 Sep 1995 08:02:00 Prelim.Summary: languages with no between-word delimiters
From: Hideo Fujii <fujiimackay.cs.umass.edu>
Subject: Prelim.Summary: languages with no between-word delimiters


Dear people in LINGUIST and NLPASIA-L,

In Linguist List (Vol-6-1244. 9/14/95,(Thu)), I submitted following
question:

>> I want to make a list of languages to classify if a (written) language
>> uses a between-word delimiter (e.g., space in English), or not.
>> That is, if it doesn't have such delimiters, we need to segment
>> for the language processing (by human or computer).
>>
>> You can tell me:
>>	1) Name of the language,
>>	2) Segmentation - Need or No Need,
>>	3) Letters - Use Alphabets (as a group) or not. Or, other graphic
>> group (Cyrillic, Chinese characters, or Own special, etc.).
>> No detail.
>>	4) Note - If you like, short comment.

This is a (first) preliminary summary for this inquiry. I've also included
some my data.

So far I've received 12 responses from following people. I want to
say thank you for these people.
	From: Shanley Allen <allenmpi.nl>
	From: Philippe Mennecier <ferrycimrs1.mnhn.fr>
	From: Stavros Macrakis <macrakisosf.org>
	From: Dan I. Slobin <slobincogsci.Berkeley.EDU>
	From: Boris Fridman Mintz <fridmanucol.mx>
	From: Allan C Wechsler <Wechslerworld.std.com>
	From: Wolfram Kahl <kahlhermes.informatik.unibw-muenchen.de>
	From: Stefan Frisch <frischbabel.ling.nwu.edu>
	From: Doug Cooper <dougchulkn.car.chula.ac.th>
	From: Nicholas Ostler <nostlerchibcha.demon.co.uk>
	From: Steve Seegmiller <SEEGMILLERapollo.montclair.edu>
	From: Duncan MacGregor <aa735freenet.carleton.ca>

First, I show the list of languages whether or not it has a delimiter symbol
for the 'word' boundary in text like a blank space between words in English:

Q: Does the language have word-boundary delimiters?
 [YES]: Inuktitut(Eskimo), Amharic, Cherokee(?), Arabic(??),
	 Hebrew(Modern), Yiddish(Judeo-German),Ladino(Judio-Spanish),

 [NO]: Sanskrit, Thai, Lao, Khmer, Burmese(?), Tibetan, Mongolian(?),
	 Manchu(?), Japanese, Chinese, Korean(?)

Here, I excluded historical/classical/medival/extinct languages because
those are not a concern of this survay.


I hope I didn't misunderstand what responders wrote. If you find mistake
or you can clarify (?)-item in this list, please send me a message.

(Hereafter I will call 'word-boundary delimiter' simply 'delimiter'.
 There are comments about the confusion of terminology such as
 "segmentation" or "separation"; "spaces/blanks", "punctuations",
 "word breaks" or "delimiters"; "segmented" means either "the text is
 'segmented' as is" or "the text must be 'segmented' to separate words".
 I will restate my question at the end of this message.)

At least so far, I didn't see a counter-example to my guessing, i.e.,
most Asian languages don't have delimiters to separate words
no matter the letters have a phonetic or ideo/logographic (except
languages with Romanized characters).

Obviously we don't have enough data to cover many of the typological language
families. I like to see more languages' data. I welcome your further
contributions especially for the languages at the end of this message.

I got several valuable comments such as:

1) According to Doug Cooper, there are indian languages which "are segmented,
while others, of similar origin, are not". If so, it implies that
language's letters are not a definite factor if it has delimiters or not.

2) Even though above Cooper's observation, "it is probably safe to say that
all modern languages that use a Latin-, Cyrillic, or Greek-based writing
system use a blank space as a delimiter" according to Steve Seegmiller.

I've counted the frequencies of Latin-, Cyrillic, or Greek-based languages
using the data in Campbell's Concise Compendium of the World Languages(1995),
in 96 languages. The result was 63% (61 languages) are one of these three
types. Althoyugh this data is not sampled typologically fair, but based
on the population of speakers, anyway the establishment of the orthography
is a very much product of religion or cultural politics in the history.

Following are non Latin/Cyrillic/Greek-based *modern* languages which
I still don't have the data:

 Armenian(modern),	Assamese, 		Bengali,
 Buginese, 		Georgian, 		Hindi,
 Kannada, 		Kashmirti, 		Kurdish,
 Lahnda, 		Malayalam, 		Marathi,
 Nepali, 		Panjabi, 		Pashto,
 Persian, 		Sinhalese, 		Sundanse,
 Tamil, 		Telgu, 		Urdu,
 Uzbek

Please send your response directly to me, so I can submit the
final summary to the LINGUIST/NLPASIA-L, later. You can tell me:

	1) Name of the language,
	2) If the language has word-boundary delimiters, or not.
	3) Letter Type: Roman/Greek, Cyrillic, Arabic, Devanagari,
 Hebrew, Chinese, or other group
	4) Note - If you like, short comment.

I appreciate your contribution.

- Hideo Fujii (fujiics.umass.edu)
 University of Massachusetts
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue