Editor for this issue: <>
Alexis Manaster Ramer (amrMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issueCS.Wayne.EDU) just wrote: )Janhunen 1992 argues that the odds of finding apparent )matches simply by chance when Japanese is compared to the four )Altaic languages/subgroups, viz., Turkic, Mongolic, Tungusic, and )Korean, are four times as high as are the odds of finding such )spurious matches when Japanese is compared to just one language. )In other words, Janhunen assumes that a 5-ary comparison is four )times as likely to produce matches purely by chance (what I call )'false positives') as is a binary comparison. This, needless to )say, is a fallacy, but there you have it. I was curious to check that, so I turned to my simulation program "chance". With 200 words, a one-in-250 chance of accidental resemblances, semantic domains of size 8, I saw, after 600 simulations: For 2 languages: 31.10 spurious resemblances per simulation attested by 2 languages For 5 languages: 141.14 spurious resemblances per simulation attested by 2 languages 12.01 attested by 3 languages 0.22 attested by 4 languages Total: 153.37 spurious resemblances per simulation. Yes, 153 chance resemblances out of a sample list of 200 words! That is the cost of allowing semantic shifts. Then, not allowing semantic shifts, I got, with the same parameters, the following results: For 2 languages: 0.535 spurious resemblances per simulation attested by 2 languages For 5 languages: 4.907 spurious resemblances per simulation attested by 2 languages 0.022 attested by 3 languages Total: 4.927 spurious resemblances per simulation. I was extremely surprised at both sets of results: an 5-ary comparison is from 5 to about 10 times as likely to yield spurious resemblances as a binary comparison! My curiosity piqued, I tried again, with a vocabulary size of 100 words only. Semantic domain size: 8 For 2 languages: 14.50 spurious resemblances per simulation attested by 2 languages For 5 languages: 83.88 spurious resemblances per simulation attested by 2 languages 5.80 attested by 3 languages 0.13 attested by 4 languages Total: 89.81 spurious resemblances per simulation. About seven times as likely. Semantic domain size: 1 (i.e. no semantic shifts allowed) For 2 languages: 0.219 spurious resemblances per simulation attested by 2 languages For 5 languages: 2.509 spurious resemblances per simulation attested by 2 languages 0.008 attested by 3 languages Total: 2.517 spurious resemblances per simulation. About twelve times as likely. Note that 600 simulations are not enough to get anything near two decimal places accuracy. I just did not have the time to wait for a more reasonable 10,000 iterations to run. For details on the simulation method itself, see my article in Anthropos Vol.90:223-228 "The Incidence of Chance Resemblances on Language Comparison". j.guy
trl.oz.au
Alexis Manaster Ramer (amrMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issueCS.Wayne.EDU) writes: ) In other words, Janhunen assumes that a 5-ary comparison is four ) times as likely to produce matches purely by chance (what I call ) 'false positives') as is a binary comparison. This, needless to ) say, is a fallacy, but there you have it. Under what assumptions is it a fallacy? I don't know anything about Altaic or comparative linguistics, but under certain assumptions it seems quite reasonable. In fact, it doesn't matter how ridiculous the assumptions to counter a blanket statement like this one which doesn't state any. In fact, the assumptions I make are blatant oversimplifications, but don't I believe change the story much compared with reality. It will depend on what you mean by a match. I assume a match in both form and meaning across a PAIR of language - viz a binary match. Clearly a match across all 5 languages is much stronger (more unlikely rather than more likely). Suppose W is the set of possible words for a given alphabet and length criterion and Vi is the vocabulary of language Lii randomly and equiprobably selected (with replacement - possible homonymy) from W, and that the Sum over Li of |Vi| (< |W| (viz. the number of possible words is considerably greater than the number that occur in any or all of the group of languages being compared). We are interested in the probability of there being a word in common between language L0 and some Li, i in [1,N] for N=1 and N=5. We made the simplifying assumption that we require a match of both form and meaning - we would get a similar result if we allowed some factor of meaning shift in terms of some lattice of meaning relationships. Let the set of concepts in each language Li be Ci. We further assume that Ci = C (a universal set of concepts). Let's suppose that for all Li |Vi| = |V| = |Ci| = |C| (a language independent constant). Let's further define a semantic function Mi which maps a word x (in Vi) to a concept c in C for language Li, and assume that any word x is randomly and equiprobably mapped to some c in C. The assumptions we have correspond to a null hypothesis with all languages INDEPENDENT. In particular, languages 1 to N would be expected to have a total vocabulary V[1,N] of size close to N * |V| after exclusion of their matches. Then we have p(x in Vi) = |Vi|/|W| = |V|/|W| p(x in Vj for some j in [1,N]) = |V[1,N]|/|W| = N * |V|/|W| Then the probability of some specified concept c and word x of L0 matching in Lj p(Mj(x) =c | given j, x and c such that M0(x)=c) = 1/|W| whence the probability that there exists a c and word x of L0 matching in Lj p(Mj(x)=c | given j) ~ |V|/|W| (we assumed |V| (< |W|) and extending to the case of some Lj for j in [1,N] we have p(Mj(x)=c | j in [1,N]) ~ |V[1,N]|/|W| (we assumed |V| (< |W|) ~ N * |V|/|W| (we assumed |V[1,N]| ~ N * |V|) Thus the ratio of the number of FALSE matches for a group of N languages to that for a single language is R = |V[1,N]|/|V| = N. The main assumption I have made that DOESN'T hold for N)1 under the COMPARATIVE hypothesis when the N languages form a language group is that the N languages are mutually INDEPENDENT. In fact, we are assuming that they will have a SIGNIFICANT number of TRUE matches, quite apart from the FALSE matches which we are exploring in relation to the NULL hypothesis. It all depends on what you mean by SIGNIFICANT! What this MAY imply is that |V[1,N]| (< N * |V| and the ratio of FALSE matches R = |V[1,N]|/|V| (< N. In other words, the ratio you actually get is equal to ratio of the total vocabulary of the N language group to that of an individual language. Clearly if you use N identical languages, that ratio is 1, and you are no better off. However, if there is any point in using multiple language it must be that you expand the number of potential matches - and the above formula for R applies. In other words, if your use of N languages is going to increase the potential for TRUE matches by N then it will also increase the potential for FALSE matches by N. Or again, if you choose N languages which are representative of different features of the language group, you will tend to mutiply the number of FALSE matches by N; but if you choose N languages which are representative of the core features of the language group and have little extraneous vocabulary, then you gain nothing - either in terms of the number of TRUE matches or the number of FALSE matches. In yet other words, increasing the number of languages compared doesn't guarantee improving your signal to noise ratio. It may however do so if they represent N different subfamilies and the target language has roots in more than one of them. The best signal to noise ratio would, in fact, seem to occur when using the minimum set of related languages - a chicken and egg problem we can bypass by accumulating evidence only from languages that in binary comparison pass some significance test, or where the cumulative N-ary results prove more significant than the individual binary results (which may happen if the TRUE matches are relatively INDEPENDENT, in which case R -) N). It strikes me that R can be determined quite easily for any analysis which has been brought into question, using the above equivalence: R = |V[1,N]|/|V|. Yours thoughtfully but no doubt ignorantly, David -- powers
acm.org http://www.cs.flinders.edu.au/people/DMWPowers.html Associate Professor David Powers David.Powers
flinders.edu.au SIGART Editor; SIGNLL Chair Facsimile: +61-8-201-3626 Department of Computer Science UniOffice: +61-8-201-3663 The Flinders University of South Australia Secretary: +61-8-201-2662 GPO Box 2100, Adelaide South Australia 5001 HomePhone: +61-8-357-4220
Alexis Manaster Ramer (amrMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issueCS.Wayne.EDU) presents as an example of a published claim "that binary comparison is preferable to n-ary comparison" the following: ) In his attack on the theory that Japanese is Altaic (and on Altaic ) as a whole), Janhunen 1992 argues that the odds of finding apparent ) matches simply by chance when Japanese is compared to the four ) Altaic languages/subgroups, viz., Turkic, Mongolic, Tungusic, and ) Korean, are four times as high as are the odds of finding such ) spurious matches when Japanese is compared to just one language, ) specifically Korean There may be a confusion here between two notions of "comparison". In the use of that term which is standard among many linguists, it refers to reconstructing a protolanguage on the basis of data from attested daughter languages. This is a task which one would undertake only after being convinced that the attested languages are in fact genetically related. And in this sense of "comparison", it is hard to imagine any reasonable linguist arguing that binary comparison is preferable to n-ary (unless, of course, there is reason to believe that two specific languages form a genetic subgroup). Clearly, the more data one can bring to this task the better. But if I understand AMR's example correctly, what is under discussion is the other sense of comparison, which refers to seining data from two or more languages looking for resemblant forms, which are then to be assessed for their value as evidence that the languages are related. This is a very different proposition from the first, although they are certainly not unconnected. (Comparison in this sense, for example, may constitute the groundwork leading to a hypothesis of relationship which then can be pursued by the method of comparative reconstruction). In this sense there is a problem with n-ary comparison, and it is exactly the one which Janhunen suggests: ) In other words, Janhunen assumes that a 5-ary comparison is four ) times as likely to produce matches purely by chance (what I call ) 'false positives') as is a binary comparison. This, needless to ) say, is a fallacy, but there you have it. Well, not needless to say, because in fact it doesn't look like a fallacy to me. If I search through the vocabulary of English and Klamath looking for possible cognates, I will certainly find a few resemblant forms that might be candidates. If I extend my search to include Yokuts, Maidu, and Wintu, and thus have (roughly) four times as much vocabulary to search through for English resemblants, don't I have (roughly) four times as much chance of finding some? Scott DeLancey delancey
darkwing.uoregon.edu Department of Linguistics University of Oregon Eugene, OR 97403, USA
How to Disprove that the Indo-European Languages are Related 1. They are too similar to be genuinely related. While there are a few cases where supposed IE cognates really look dissimilar (e.g., Arm erku and Sindhi b'a '2'), there are MANY MORE forms with the same meanings but looking even more dissimilar if we compare each IE language with some other group, e.g., Polish with Basque, Armenian with Aztec, or Sindhi with Bangu-Bangu. 2. The founder of IE comparative studies, Bopp, also thought that IE included Kartvelian and was related to Austronesian. This obviously undermines the validity of the IE connection itself. 3. Given the rate at which languages supposedly lose vocabulary and the supposed age of Proto-IE, we should only be able to reconstruct at best a small fraction of the so-called Swadesh 100-word list. But actually Indo-Europeanists turn out to reconstruct over 90% of these items, so there is a fundamental contradiction which means that the IE hypothesis must be wrong. 4. All that stuff was published in a foreign language (German), so can possibly evaluate it or even review it, and why should we bother with it? Only linguistic work published originally in English should count!Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue