LINGUIST List 14.2012

Fri Jul 25 2003

Disc: Re 'Celtic Found to Have Ancient Roots'

Editor for this issue: Karen Milligan <karenlinguistlist.org>


Directory

  1. Paul Purdom, PNAS article on Celtic branching.
  2. Peter Forster, response to Larry Trask's critique

Message 1: PNAS article on Celtic branching.

Date: Thu, 24 Jul 2003 15:31:48 -0500 (EST)
From: Paul Purdom <pwpcs.indiana.edu>
Subject: PNAS article on Celtic branching.

Recently Larry Trask wrote an unfavorable review of a PNAS article on
the date of Celtic branching (Linguist 14.1876).

I don't want to refute the review point by point. Indeed, many of the
points made in the review are likely to be true. Regardless, while I
know a lot about classification, I don't know that much about
classification of language.

However, I do want to take major issue with one major point implied by
the review: the suggestion that people interested in classifying
languages should ignore the methods used in the article. (This is a
separate question as to whether they should ignore the conclusions of
the article.)

The problem of classification can largely be divided into:
1. what characteristics to consider when doing the classification 
2. what algorithms to use to convert the data on the characteristics
into a classification.

The biologists that work on phylogenetic trees are expert on point 2
and they are quite knowledgable on point 1 when it comes to
characterists of organisms. There is no particular reason to expect
them to be expert on point 1 when it comes to languages. The review
mainly focused on the defects related to point 1. I believe that much
of the merit of the paper, however, relates more to point 2.

It is interesting to note that at one time most biological
classification was done using complex characters. This work required a
lot of human effort, and expert knowledge was needed to develop good
classifications. Only a few characteristics were used for doing the
classifications. Much useful work was done with that model.

Much of the recent work on biological classification is based on data
that is basically very simple: DNA sequences. Although expertise is
still important, it is much less so. Each piece of data does not have
much information about how the organisms should be classified. 
However, by using large amounts of data (thousands of positions in the
DNA sequence) combined computer algorithms to do the calculations,
many interesting classifications have been developed. Often these have
agreed with the older classifications, at times they have replaced
them, and at times they have shown limitations in the new
methods. People who are interested in classifing how the older
languages developed into the current ones should pay some attention to
these techniques.

Paul Purdom, Professor of Computer Science, Indiana University.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue

Message 2: response to Larry Trask's critique

Date: Fri, 25 Jul 2003 18:18:17 +0100
From: Peter Forster <pf223cam.ac.uk>
Subject: response to Larry Trask's critique


With respect to our paper, Forster and Toth "Toward a phylogenetic
chronology of ancient Gaulish, Celtic, and Indo-European", available
at

http://www.pnas.org/cgi/content/abstract/100/15/9079

Larry Trask made a number of critical comments which we fear will
cause considerable confusion for the potential readers of our work
(Linguist 14.1876).

Both core issues and peripheral issues are raised by Larry Trask. For
the record, some of the peripheral claims made by Larry Trask are in
error (e.g., concerning publication procedures at PNAS, citation of
the Pennsylvania group, Celtic substrate in Tuscany, networks versus
trees, impact of mutation rates, etc.), in other cases he is right
(e.g., concerning the typographical error of "29"). Rather than
dealing with these details here, we would encourage interested readers
to peruse the paper, the Supplementary Information and the website
Tutorial. For those of you who do not have access to PNAS, note that
all materials are routinely made available on the PNAS website six
months after publication, free of charge.

The core criticisms concern the issue of negative controls, and the
issue of resemblance coding versus cognation coding. As we have
explained in our article and in the Tutorial, cognation coding is
unfortunately not advisable with the fragmentary corpus of ancient
Gaulish, because we would inevitably run the risk of ascertainment
bias. Incidentally, this by no means implies that we reject cognation
coding in general: it is a powerful procedure where applicable, as we
demonstrated for example in our first linguistic network paper on
Alpine Romance languages (Forster et al. 1998). Accepting that we had
to resort to error-prone resemblance coding for the Gaulish analysis,
we needed to test our error rate using a negative control, for which
we chose Basque. As expected, Basque demonstrated that the resemblance
coding entails a noticeable error rate (about 5 spurious identities
out of 35 characters), and we can expect a similar, but invisible,
error for other language pairs in the table. That is the function of a
negative control.

As concerns the coding procedure, we have no disagreement with the
statement that resemblance coding is much more subjective and
difficult than cognation coding; indeed we explicitly made this point
in the Tutorial, and we listed borderline cases in the
article. Unfortunately, Larry Trask seems to go much further than this
by implying in his examples that the coding procedure needs to be
identical between rows of the table (i.e., between different
characters). This is not the case, as it would amount to comparing the
proverbial apples and pears. The coding must only be consistent within
a row (i.e. within a character), regardless of decisions in other
rows, and it is up to the researcher to decide at which resolution to
perform the coding for each character (e.g. whether or not to
distinguish the Chicago and London pronunciations of "herb"). For
tree-generating characters, the level of resolution will have a
bearing neither on the branching order of the tree, nor on the time
estimates.

A fuller discussion of these and other issues is given in our
Tutorial, which we have now extended to incorporate the above
comments. Note especially the extended "Frequently Asked Questions"
section.

http://www.mcdonald.cam.ac.uk/genetics/gaulish_tutorial.pdf

Finally, a note of guidance to those who would like to get started on
phylogenetic network analysis on their own data: the best way to get
to grips with phylogenetic network analysis is to carry out examples
by hand. It is not difficult, all you need is pencil, paper and a
rubber (yes, I know, "eraser" to Americans). I suggest you start with
the simple tree example in the Tutorial, then pass on to the more
complex network example in the PNAS paper. Both examples show
step-by-step how to proceed. Finally, you may wish to tackle the
larger example in the Forster et al. (1998) paper, which does not show
any intermediate steps. Good luck.

Peter Forster

Dr. Peter Forster
Molecular Genetics Laboratory
The McDonald Institute for Archaeological Research
University of Cambridge
Downing Street
Cambridge CB2 3ER
England

tel. +44-(0)1223-339-330
fax. +44-(0)1223-339-285
email: pf223cam.ac.uk


References

Forster et al. (1998) Evolutionary networks of word lists: visualising
the relationships between Alpine Romance languages. Journal of
Quantitative Linguistics 5 (3): 174-187.

Forster & Toth (2003) Toward a phylogenetic chronology of ancient
Gaulish, Celtic, and Indo-European. Proceedings of the National
Academy of Sciences of the USA 10.1073/pnas.1331158100
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue