LINGUIST List 14.1876

Mon Jul 7 2003

Disc: New: Re 'Celtic Found to Have Ancient Roots'

Editor for this issue: Karen Milligan <>


  • Larry Trask, Re: 14.1825, Media: NYT: Celtic Found to Have Ancient Roots

    Message 1: Re: 14.1825, Media: NYT: Celtic Found to Have Ancient Roots

    Date: Mon, 07 Jul 2003 15:26:02 +0100
    From: Larry Trask <>
    Subject: Re: 14.1825, Media: NYT: Celtic Found to Have Ancient Roots

    Last week Anthony Aristar drew our attention to a story in the New York Times about a recent article in the Proceedings of the National Academy of Sciences of the USA (Linguist 14.1825). The article proposes a new way of drawing family trees in historical linguistics, and it presents a (partial) tree for Indo-European, with particular emphasis on Celtic. The authors' conclusions are surprising in two respects. First, they conclude, contrary to the usual view, that the Insular Celtic languages form a single taxon within Celtic. Second, they claim to be able to estimate absolute dates for the splits in their tree, and they propose dates that are vastly earlier than those commonly accepted: ca. 8100 BC for the break-up of PIE, and ca. 3200 BC for the split of Insular from Continental Celtic.

    Anthony notes that one of the authors is quoted as making some insulting remarks in the Times article, implying that we linguists are too dumb to understand his important work. Anthony asks if anyone has had a look at the article, which is this:

    Peter Forster and Alfred Toth. 2003. 'Toward a phylogenetic chronology of ancient Gaulish, Celtic, and Indo-European'. PNAS. Available on line: .

    Well, I've now worked though the article very carefully, and I have a great deal to say about it. Depending on taste, you may find my account shocking, depressing, or just funny.

    1. The linguistic background

    This is an attempt at drawing genetic trees using character-states. A character is a linguistic "slot", a meaning or a function that must be provided by some linguistic material. A state is a piece of linguistic material filling that slot.

    For example, the meaning 'man' is a character. If -- as is commonly done -- we take cognation (common ancestry) as the basis for assigning states, then Latin <vir> is one state, English 'man' and German <Mann> are a second state, French <homme>, Spanish <hombre> and Italian <uomo> are a third state, modern Greek <andras> is a fourth state, Basque <gizon> is a fifth state, and so on.

    Once determined, these character-states are used to construct trees according to criteria selected by the investigators. One criterion commonly used is this: each innovation should appear at only one branching point in a tree. Another is this: the tree should be robust -- that is, it should be little altered by a different choice of characters.

    Now, linguists have been working on methods of this sort for some time. Most prominently, the group based at the University of Pennsylvania has been developing such methods for quite a few years now, and it has published a number of reports, including at least one presenting a tree for IE. But the present authors do not cite any of this work, and they seem to be unaware of its existence. They appear to believe that they are the first people ever to try such an approach in linguistics. This is not a good sign.

    The authors, who are geneticists, propose an approach taken from genetics, where they say it has been very successful, and they believe it should be just as successful in our field. We'll see.

    The authors also claim that their method does not force trees, but that it is consistent with the presence of networked structures (reticulations), in which branches of the tree are cross-connected. They tell us, in fact, that their method marries the tree model, with its rigid binary branching, with Father Schmidt's wave model.

    However, it is clear that they do not understand what the wave model is. Their account of it is badly confused, and they describe it repeatedly as no more than an account of borrowing between distinct speech varieties. But this misses the point altogether. Wave theory is an alternative to binary branching, in which innovations spread out from any number of different centers, producing a dialect continuum, rather than a tree-like arrangement with sharply distinct varieties. Again, this bad misunderstanding is not encouraging.

    2. Why in PNAS?

    Why have the authors published in PNAS? PNAS is not a journal of linguistics, and few linguists read it. Since the only possible readership for the article is historical linguists, why did they not submit it to a journal of linguistics, where it would be seen?

    There's more. PNAS enforces strict limitations on space. As a result, the writing style of this article is uncomfortably terse and compressed. In a number of cases I found myself wanting more information, more detail -- but I didn't get it, because of the space limitation.

    Even worse, some critical information on the authors' procedure is absent altogether from the article, and has been relegated to a Website, which the reader must consult in order to find out why the authors have made some important decisions. This is infuriating. That information is central to the authors' case, and it really ought to be in the article.

    PNAS was a very bad idea.

    3. Choosing the characters

    The languages chosen include 13 IE languages, ancient and modern, plus Basque, described as a "negative control".

    The authors have a special interest in Celtic, including the extinct and sparsely recorded Gaulish. They therefore expressly choose characters which they believe are well represented in Gaulish, and more particularly in Gaulish-Latin bilinguals. But the characters they choose are very odd.

    Their first choice is the difference between Subject-Verb word order and Verb-Subject order. Now, I have never before seen it suggested that SV versus VS is of any great linguistic interest, and the authors seem to have chosen this odd item expressly because it separates the VSO Insular Celtic languages from all the others. But it is clearly out of order to choose eccentric characters solely in order to single out groupings you hope to pick out (see below).

    Another character is the presence or absence of the cluster /ps/, described as present in Greek, Latin and English but absent elsewhere. But this is absurd. The authors appear to believe that /ps/ was present in PIE, and that it survived in these three languages but was lost elsewhere. However, I know of no evidence for a cluster /ps/ in PIE. The Latin and Greek instances, I think, are all acquired, not inherited. Some instances arise by borrowing; others arise at morpheme boundaries; and still others arise by various phonological changes. And the native English examples all occur at morpheme boundaries, as in 'cups', 'slaps' and 'upside-down'. This character could hardly be more pointless.

    Some of their other "characters" are no such thing. They choose entire phrases, like 'and to men'. But this is not a character: it's a whole cluster of lexical and grammatical characters. What happens when two languages match in some respects but differ in others? How can states be sensibly assigned?

    All in all, the list of characters is poorly thought out. And there are further problems with this list, as we'll see later.

    4. Assigning the states

    It's at this point that the authors' method falls completely to pieces.

    States are usually assigned on the basis of cognation. But the authors reject this approach. Why? Because, they say, appealing to cognation automatically implies a particular tree, and a tree is what they're trying to find, so working with cognates is "circular".

    But this is nonsense. Establishing cognation requires no appeal to trees at all. In fact, we don't even try to draw any trees until we have first established enough cognates to give us material to work with. The authors could not be more confused than they are.

    Anyway, having rejected cognation, the authors now require some other criteria for assigning states. What criteria do they come up with?

    Nothing. Nothing at all. They offer *no* criteria. Instead, they make it up as they go along.

    What they do is to appeal to an unexplained and wholly subjective notion of "similarity". Two items are assigned to the same state if the authors judge them to be similar, but to different states if the authors judge them to be dissimilar. Let's see what that means in practice.

    Latin <filia> 'daughter' and its Spanish descendant <hija> are assigned to different states, because the authors judge tham to be dissimilar. But the Gaulish inflected form <teuo-> 'to gods' and the Scottish Gaelic prepositional phrase <do dhiadhan> are assigned to the same state, because the authors judge them to be similar. Why are they similar?

    Breton <forn> 'oven' is assigned to the same state as Spanish <horno>, but to a different state from Irish <sorn>. Italian <e> 'and' is assigned to a different state from its Spanish cognate <y>, but to the same state as the unrelated Basque <eta>. (Spanish <y> has a positional variant <e>, but apparently that doesn't matter.) On the other hand, the Gaulish genitive suffix <-i> is assigned to the same state as Greek <-ou>. So, /i/ resembles /u/ but not /e/. How do the authors come by these remarkable insights?

    Normally, an overt suffix is counted as different from zero suffix. However, Latin feminine <-a> is assigned to the same state as French <-e>, even though in Parisian French that orthographic <-e> is purely decorative, and the suffix is zero.

    I could go on in this vein, but you get the idea. There is no rhyme or reason in the assignment of states, and the authors' procedure is as capricious as it is unexplained.

    At this point, the work under discussion abandons the discipline of linguistics altogether, and in fact it ceases to be anything recognizable as serious scholarship. Linguistics cannot be done in terms of subjective notions of similarity. This is the kind of sludge we see in those lurid articles claiming to have reconstructed "Proto-World", and in those delightful Websites announcing "Latvian -- the key to all languages".

    Whatever other virtues the authors' method may have, this shocking procedure is enough to reduce their proposal to worthlessness.

    An observation. Since <forn> and <sorn> are assigned to different states, even though they differ only in their initial segments, it appears to me that our authors must, if they want to maintain any consistency at all, assign the Chicago and London pronunciations of the word 'herb' to different states. After all, these have nothing in common beyond their final /b/: Chicago has /Vrb/, with /r/ but no /h/, while London has /hVb/, with /h/ but no /r/. Of course, this outcome is ridiculous, but that's what happens when you make it up as you go along.

    Out of curiosity, I tried to apply the authors' method to English and German. But I couldn't, because I had no way of knowing what should be counted as similar. Are 'first' and <erste> similar, like <forn> and <horno>, or dissimilar, like <forn> and <sorn>? What about 'day' and <Tag>? These words have no segments in common at all. That makes them more different than <forn> and <sorn>, but then recall that <teuo-> and <do dhiadhan> are counted as similar. It appears that I can't apply the authors' method unless I have the authors looking over my shoulder and telling me what to count as similar. And this is supposed to be science?

    There is much more. The authors assign to the same state practically all of the words for 'crane', including Latin <grus>, Irish <corr mh�na>, Breton <garan>, and Basque <kurrillo> -- but *not* Welsh <crychedd>, for some reason. Now, I have seen it seriously suggested that at least some of these names are imitative in origin, reflecting the bird's distinctive cry. I don't know if that's true or not, but clearly the authors have taken no steps to exclude imitative forms -- a serious shortcoming in comparative work.

    Further, the authors mysteriously assign Welsh <mam> 'mother' to a different state from Latin <mater>, but to the same state as Basque <ama>. However, the Welsh and Basque words are mama/papa words, and every linguist knows that mama/papa words are useless in comparison. But our authors don't know this, and they solemly report these items and assign them to states, as though they were doing something sensible.

    The authors assign English 'day' and Latin <dies> to the same state, but these words are unrelated, and they resemble each other purely by chance. Like everybody who tries to work with similarities, the authors are helpless to exclude chance resemblances.

    So, imitative words, mama/papa words, chance resemblances -- the authors have committed every schoolboy howler I can think of. They badly need a course in historical linguistics.

    I'm not done yet. The Basque data presented here contain a number of errors, some of them very serious. For example, they report a "nominative singular suffix" <-a> for Basque, and they assign this to the same state as the nominative singular feminine <-a> of Latin and some other languages. But Basque, with its wholly ergative morphology, doesn't even have a category of nominative, let alone a nominative ending, and what the authors are reporting is merely the definite article <-a>.

    What the authors report as the Basque "dative" suffix is in fact the benefactive suffix (and even this is given wrongly). The real dative ending, <-i>, happens to occur twice in the phrases 'to gods' and 'and to men', but the authors fail to notice this.

    Of course, Basque is only the control language here, but the errors in the Basque data are so numerous and so serious that I have to wonder whether similar errors might be lurking in the data for the IE languages I don't know, like Occitan and Scottish Gaelic.

    5. Drawing the tree

    The authors draw their tree by hand. Their first step is to throw away all the characters which, in their opinion, produce results results that are too messy -- that is, insufficiently tree-like. Of their 35 characters, they throw away seven for this reason, and those characters are not used at all. Well, I'm exceeding my competence here, but this doesn't look very principled to me. Is it really OK to throw away all the data that give you results you don't like?

    By the way, having jettisoned seven of their 35 characters, the authors announce that they have 29 left. This is a trivial point, of course, but it does nothing to intill confidence in the care and attentiveness of the authors.

    The authors go on to begin their tree with the binary characters (those with only two states, which include the really silly ones like SV and /ps/). Then they decompose the ternary characters into binary segments. Any character which fails to give a sufficiently tree-like graph is postponed for later use. In short, they do everything they can to force binary branching and thus a conventional tree. Only at the last is a little reticulation admitted.

    Only one tree is drawn. There is no searching of tree space, and so this is not a "best tree" method. The authors do not check their tree for robustness, by testing it with different characters. One tree is all we get.

    The tree is rootless, but the authors insert a root, representing PIE, at a point of their choosing.

    I'll have to leave it to someone more competent to pass judgement on this tree-drawing procedure. But it looks fishy to me.

    By the way, I don't understand the function of the control language. Having reported the Basque data (badly), and having solemnly assigned states, the authors then forget all about the language. Since it is reported as sharing a few character-states with some of the other languages, why is it not included in the tree? Earlier, the authors rejected the use of cognation on the ground that it supposedly implies something about the tree which they are trying to draw. But they seem to have omitted Basque from their tree for no reason apart from an *a priori* belief that Basque shouldn't be in the tree. Is this consistent?

    6. The Celtic results

    The authors report that their tree shows that the Insular Celtic languages form a unit, separated from the rest of Celtic, represented here by Gaulish. But this is a *big* blunder.

    Apart from a single character which is unique to Gaulish and so irrelevant, the Insular Celtic languages as a group are separated from Gaulish by only three character-states. Let's look at these.

    The first is that VS word order. But I complained earlier that this character was introduced *ad hoc* merely to do the job of singling out Insular Celtic. This is out of order.

    But it's much worse than that. Gaulish is recorded many centuries earlier than our first records of the Insular languages. Since Proto-Celtic is widely thought to have had SOV order, it is highly possible, and even likely, that VSO order had not yet developed in the Insular languages at the time when Gaulish was recorded. (Note that our earliest records of Irish show some evidence of SOV order.) It is also possible that Gaulish itself would have developed VSO order if it had survived until the time when our records of Insular Celtic begin. Note that, while Gaulish is predominately SVO, it exhibits VSO order in certain constructions. Very likely these constructions represent the first steps in the process which led eventually to the introduction of VSO order in the Insular languages, many centuries later.

    The other two character-states separating the Insular languages from Gaulish are both ancestral case-suffixes which survived in Gaulish but are gone in the Insular languages. But, again, note the huge time difference between Gaulish and Insular Celtic. Very likely, those suffixes were still present in the Insular languages, too, at the time when Gaulish was recorded. And quite possibly they would have disappeared in Gaulish as well, if Gaulish had survived to the time when Irish and Welsh were recorded.

    The authors' case does not stand up. They have in fact presented no evidence at all for an Insular Celtic unity. All they have done is to note the passage of time and the linguistic changes which go with it. I think I am not going too far when I say that ther claim for an Insular unity is foolish and linguistically naive.

    One more point, involving that silly /ps/ character. Noting the absence of /ps/ from Celtic, and noting that Latin /ps/ has disappeared in the Romance languages, the authors attribute this disappearance to a Celtic substrate! Er -- a Celtic substrate in Tuscany? I don't think so.

    Anyway, the authors, focusing on their beloved /ps/, have failed to notice that quite a number of original and secondary Latin clusters were eliminated in Romance, including at least one -- medial /kl/ -- wehich is present in Gaulish. So much for a Celtic substrate. Gad.

    7. The dating

    The authors claim to be able to assign moderately reliable absolute dates to branching events in their tree, and they do this, producing the astoundingly early dates given earlier.

    But it is clear that they have merely re-invented glottochronology. They claim that the rate of replacements is approximately constant -- a position known to be false. Drawing a parallel with genetics, they insist that any fluctuations in the rate of replacement will average out successfully over time. Well, this may be true in genetics, where the time depths in question are millions or tens of millions of years. But they have given us no reason to suppose that it is true in linguistics, where the time spans are only a few thousand years. Their dating claims are based upon unsupported assertions, and assertions which are extremely unlikely to be true.

    Anyway, recall that all this rests upon what has gone before. At least glottochronology operates with a principled and rigorous definition of replacement. But our authors don't: their notion of a replacement is wholly unprincipled and capricious. They are therefore claiming that the rate of occurrence of the events they capriciously choose to call "replacements" is constant. Is there any point in taking this seriously?

    8. Summing up

    This paper is a disaster. There is no reason for any linguist to pay any attention to it. The procedure described is capricious, unprincipled, arbitrary, *ad hoc* and subjective from beginning to end.

    One of the authors remarks about this work, in the Times article, "To be honest, [linguists] don't understand it, most of them. They don't even know what I'm talking about." Well, for my part, I don't understand how this mess made it into print. I can't believe that a competent referee would allow this stuff to pass. But, interestingly, PNAS is not peer-reviewed. Hmmm. This is not the first time that PNAS has published some extremely dubious work in historical linguistics.

    In this case, I'm afraid, our scientific colleagues have nothing to teach us. Instead, they have a great deal to learn from us, about the use of principled and rigorous procedures, and even, it would seem, about collecting accurate data. And they could certainly do with a few lessons in linguistics.

    Larry Trask COGS University of Sussex Brighton BN1 9QH UK