Editor for this issue: <>
It is no secret to readers of this list that I take a dim view of comparative linguistics. Some may wonder then why I bother writing about the subject. Contrary to what one would be justified to think, it is not because I am intent on debunking the many fallacies that litter the topic. I could hardly care less about X, Y or Z claiming to have produced the ultimate classification of Q, W, E, R, T and Y and consequently having located the cradle of Indo-European speakers, Nostratophones, even Geophones (Proto-World speakers). I am interested because the properties of language families are mirrorred in other aspects of language, far more important, and of immediate practical interest. I have cobbled together this short explanation out of a much longer article which I am still polishing. It is mostly about the reconstruction algorithms in GLOTREE and GLOTLPP (part of the lexicostatistical package in pc/linguistics/glotto02.zip at garbo.uwasa.fi). I had sent a preview of it to Cameron Laird, and you may perhaps (if it's still there) ftp it if you are interested: It's at ftp.neosoft.com in pub/users/claird/sci.anthropology/text/jg_salish.zip It's in Postcript format, 33 pages long. Here is what *really* makes me tick: Consider this excerpt from a table of cognate percentages between eight languages: C D E F G H A 20 20 21 22 22 23 B 40 42 43 40 44 45 Language B scores about twice as many cognates with the rest as A does. This pattern occurs if and only if languages C to G are *external* to A and B, in other words, if A and B are *siblings*, or other terms again, if A and B have an *immediate* common ancestor. When they do not, the pattern is breached, thus: C D E F G H A 20 20 21 22 22 39 B 40 42 43 40 44 45 There, A and B would be siblings if H were not attested, or were removed from the table (the proof is given in the original article, I feel too lazy to translate the equations and diagrams from Word-for-Windows to straight ascii. It's not a trivial 5-minute task). This observation gives us a key to reconstructing the genetic trees of language families whatever the variations in lexical retention rates, and to estimating those retention rates as well. Thus in the fabricated example above, B, shares twice as many cognates as A with external languages very simply because it has been twice as retentive since the split of their common ancestor. Real data will exhibit similar patterns, sometimes with surprising differences in retention rates: thus Sakao 1.5 times as retentive as Akei (both spoken in Espiritu Santo, Vanuatu, see Guy 1982:288,300) Note that it does not matter whether the figures in a cognation matrix represent percentages or the actual numbers of cognates counted. A language n times as retentive as another will have n times as many cognates with external languages, whether that amount is expressed as a percentage or as an absolute number of cognate. Consider now a matrix of the frequencies with which the words from an English corpus follow each other, and take, for instance, prepositions. Prepositions will most often be followed by articles, adjectives or substantives, very rarely by other parts of speech. If, say, "in" occurs n times as often as "between", we may expect to observe "in the" occurring approximately n times as often as "between the". Thus the frequency vectors for words occurring in similar environments can be expected to exhibit linear relations similar to those of the cognate scores of sibling languages with external languages. This is the cause of the results reported by Finch (1993). Finch submits matrices of frequencies of co-occurrence of words or letters to various clustering techniques and produces dendrograms showing for instance "dog, cat, mouse" in one cluster, "girl, boy, woman, man" in another (Finch 1993:112), or, counting individual letters rather than words, showing all vowels in one cluster, all consonants in another (ibid. p.118). He leaves open to speculation the reasons for the success of these procedures. The reason is: like tend to occur in like environments. Vowels, for instance, tend to occur in the immediate environment of consonants, and consonants of vowels; words of a given grammatical category tend to occur in the immediate environment of words of those grammatical categories which the syntax of the language allows. As the environment is made wider and wider, syntactic contingencies become weaker and weaker, gradually disappearing until only semantics contingencies can remain. Cognation matrices and frequency matrices of word or letter co-occurrences share another, surprising, property. Consider this matrix: D E F G H I A 19 19 20 2 3 4 B 0 1 1 20 19 19 C 20 20 21 22 22 23 Imagine that it contains cognate scores: A has nineteen words in common with D and E, twenty with F, etc., and is clearly a member of the DEF family, whereas B belongs to GHI. Note now how C is the sum of A and B. Thus C might represent a sample of English and A and B its Romance and Germanic components. Imagine now that this matrix contains the frequencies of occurrence of words A, B, and C in environments D...I. Row C might represent the frequency distribution of the word "like", being the sum of the frequencies of "like" (preposition) in row A and "like" (verb) in row B. The quantitative properties of polysemic tokens are thus the same as those of hybrid languages and polysemy is mathematically equivalent to hybridation or undetected borrowing. Therefore, a procedure capable of identifying languages affected by hybridation or undetected borrowing ought to identify polysemic words when applied to frequency matrices of word co-occurrences. Once again, complex, seemingly almost intractable problems of automatic text analysis seem to be reducible to much simpler models. The two works quoted are: Finch, S.P. (1993), "Finding Structure in Language". Ph.D. thesis, available in electronic form in /pub/statling/Papers/phdThesis.ps.Z at ftp.cogsci.ed.ac, University of Edinburgh, Scotland. Guy, J.B.M. (1982), "Bases for New Methods in Glottochronology" in "Papers from the Third International Conference on Austronesian Linguistics", Vol.1 Halim, Carrington, Wurm (eds). Pacific Linguistics, Canberra.Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue
I am trying to learn what I can from J. Guy's recent posting on comparative techniques and mutation algorithms. First, two points on which we seem to totally agree (I think so, reading Guy): 1. Any development of techniques which include empirical content on possible changes must rely on attested examples in which the antecedents (proto-Romance for example) are at least approximately known (via attested Latin, a close relative). Discoveries that changes more often go in one direction than in the reverse must in the first instance be based on such attested examples. Since the sample is small, there is much room for error. 2. Many claims which have been made of "natural" linguistic changes going in only one direction are false. Next, two points on which we seem not to agree (I think so, reading Guy): 3. There are in fact linguistic changes which go in only one direction. 4. Guy seems almost to be denying that comparative-historical linguistics is possible in the absence of written attestations of the earlier stages. This to me is a denial of the very meaning of historical method, which, once established on cases where something approximating an ancestor is attested, should, if validly so established, be usable also on cases where no such ancestor is attested. 3. There are in fact linguistic changes which go in only one direction. Guy objected to the following (his quotation and emphasis): "This "step by step" is like a minimal series of mutations, with the added information that it is our business to learn which changes (mutation steps) are more natural, and OF COURSE MOST of these go only in one direction". (My emphasis again). When I originally mentioned this topic, I was using it to explore one of several reasons why a linguistic account of mutations might not be the same as a computer-alogirthm account of mutations. I had in mind mutation "steps", minimal very small changes, with detailed context, rather than global changes. Perhaps this clarification will help J. Guy to address the questions I had at least intended to ask. I assumed claims of such one-directional changes would be formulated in sufficiently precise terms, not as global changes from for example having a case system to not having one or the reverse, or as one sound changing into another without considering surrounding context. Like Guy, I would mock either of the latter as linguistically naive. But more precisely formulated claims, many of the sort that particular morphological categories often derive from phrases containing originally free words, or that particular strings of adjacent sounds often change by a minimal step (!!!) into other strings of adjacent sounds (including detailed context) are not at all absurd, and in fact quite often valid. There are celebrated exceptional cases of morphological elements becoming more free, and of sequences of historical changes giving rise to "crazy" synchronic patterns, which if interpreted directly as reflecting a single direct change are highly misleading about possible or normal changes. But these are recognized as exceptions, and quite possibly the conditions under which they can occur may be, if not determinable, at least circumscribable in part. 4. Guy seems almost to be denying that comparative-historical linguistics is possible in the absence of written attestations of the earlier stages. This to me is a denial of the very meaning of historical method, which, once established on cases where something approximating an ancestor is attested, should, if validly so established, be usable also on cases where no such ancestor is attested. Essentially all historical sciences base their general principles on what there is the best evidence for, and then extrapolate to attempt to render an account of phenomena for which there is less, little, or no direct attestation of earlier stages or origins. Also, all historical sciences prefer to operate with surviving records of earlier stages, but do not restrict their considerations to such. When Guy says: "Biologists are helped by the fossil record, linguists by documentary evidence, dated or datable. But most of the world's languages lack this evidence. And beyond some 5000 years in the past, the evidence is, in all cases and for all practical purposes, zilch." The last statement is true in one sense, but false in another, because the synchronic evidence of descendents is also evidence, if we have learned anything at all about preferred paths of change. I feel the same way Guy said in his message, why should this have to be repeated? It is self-evident in any field attempting historical reconstructions, whether linguistics, biology, or anything else. Biologists do in fact attempt reconstructions of possible histories based on synchronic descriptions, without using only the fossile record, based on their kind of morphology, DNA studies, etc. etc. The biologists' claims that among plants, fungi, and animals, fungi and animals share some common history (of innovations?), would be merely one of many such cases, using the fossil record as far as it goes, but going beyond it to a deeper level. If comparative-historical ***techniques*** are not merely techniques for cataloging attestations, but are in fact making empirical claims about how languages have been known to change in the past, then it is certainly legitimate to attempt to extrapolate the application of such techniques to cases where there is no attested (near-relative of an) ancestral language in attempting to make sense of the patterned data of sets of descendent languages. Removing the attestation of a (near-relative of an) ancestral language is only removing one piece of the evidence. I have used "near relative of an ancestral language" in this paragraph and earlier to emphasize that these questions are, like most in the real world, not ones with absolute answers. In a world of greys, it would be just as possible to deny that the edifice of the Romance family tree has an attested ancestor as to assert that it does. Is any minor deviation of the reconstructed ancestor from attested Latin to be taken to invalidate the techniques? Of course not. Yet the absolute tone of Guy's comments might also be used by someone to suggest it is. Lastly, places where I am still unclear on what Guy is referring to: 5. Guy says that Hartigan's method was accompanied by a word list in some number of languages, supplied by Dyen. These are presumably real languages, because Guy then says he (not Hartigan?) "applied it on language families computer-generated under the strict condition of a constant universal rate of lexical change." I do not know what effect the assumption of constant universal rate of lexical change might make on the mutation algorithm, but it is of course claimed by most participants in these discussions to be a contrary-to-fact assumption, that is, it is claimed that lexical items with some semantics change more rapidly than those with other semantics. I can certainly imagine that artificially generated data may have fewer quirks that a computer program could use to detect historical splits. Whether this is the case is one of the things I was trying to ask Guy about (perhaps my question was not clear), and perhaps he has this additional information. 6. I will have to actually read the article Guy refers to, I guess, namely (Experimental glottochronology: basic methods and results. Pacific Linguistics, Canberra, 1980. p.19) because the account Guy extracted from it, namely "The program was fed the wordlists of the simulated language family, and a phylogenetic tree ([26]) drawn from the account of the successive mergings of lists and of the predicted past individual word replacements." sounds if taken literally as if the results that the algorithm was supposed to achieve were fed to it in advance. Obviously, my literal interpretation is not a viable one, so perhaps clarification is possible? 7. Proceeding to what the algorithm did: "The reduced mutation algorithm identified the basic binary split in all experiments, but did not succeed, even once, in reconstructing the subsequent ternary split of ECHO-SIERRA, either as such, or as two successive binary splits." Does this mean that the reduced-mutation algorithm has built into it a preference for binary splits only, never ternary splits? The majority of comparative-historical linguists who raise the issue of binary vs. ternary splits, in my experience, take a position that binary splits are preferred, or should be attempted before multinary, or else that only binary splits are permitted in proper historical reconstruction (weaker or stronger versions). So this is hardly a criticism against a computer algorithm. Reference to "the ... ternary split of ECHO-SIERRA" seems to imply that a ternary split was deliberatly built into the data sets? Perhaps again my lack of understanding of the tree "fed" to the program is at issue. 8. And the following I cannot understand at all, since it seems to contradict what was said earlier: "The reasons for the resounding failure of the reduced mutation algorithm are somewhat akin to those for the failure of the traditional lexicostatical method: the measure of the similarity or of the distance between two languages is based on data from just two wordlists." I thought the data was multiple sets of word lists generated by random mutation? Otherwise, there is no work for the mutation algorithm to do, if there are only two lists. No matter what the data, it would in that case simply posit one ancester with the two attested descendants. "The measure of distance used by the reduced mutation algorithm is furthermore not reconciliable, at least in my eyes, with the linguistic model. Interested readers should refer to Hartigan 1975:233-246." (Ditto, p.33) The book in question is: Hartigan, John A. Clustering Algorithms. Wiley, New York, 1974. My attempts in the previous message were precisely to ferret out ways in which the algorithm might be making assumptions contrary to what we linguists know about historical change. J. Guy's response does not (unless i missed them) give any catalog of these, though he does discuss at length the question of "naturalness" of changes. On some of that, I agree with him, though not on all, as explained above. In summary, I agree with some of what Guy says, believe that some of it needs qualification, and would still like more clarification on the assumptions behind the mutation algorithm he was talking about, since his recent communication does not unambiguously provide this clarification. Indeed, my attempts at giving a completely literal interpretation does not seem to work with much of what Guy wrote on this, and I have not found a more abstract or metaphorical interpretation to make sense of it either. I would like clarification on points 5. 6. 7. and 8. above. I can of course go to the references Guy mentions, but since he has already read these, I am sure a number of us would very much appreciate it if Guy can further clarify some of the matters concerning the algorithm. I have done my best to understand what he has already provided, and to clarify some of the questions I posed in case they were not easily interpretable. What is missing is not the things which Guy feels he has repeated many times, which I suspect most of us have in fact understood and agree with in part and not in part (as is anyone's right), but rather further clearly presented information on the assumptions involved in the mutation algorithm. Some of these certainly may go beyond my attempts to guess at them (points 5 to 8 above and the guess that the mutation algorithm does not incorporate any notions of preferences for some mutations over others). Sincerely, Lloyd AndersonMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue