Editor for this issue: Karen Milligan <karen
linguistlist.org>
Recently Larry Trask wrote an unfavorable review of a PNAS article on the date of Celtic branching (Linguist 14.1876). I don't want to refute the review point by point. Indeed, many of the points made in the review are likely to be true. Regardless, while I know a lot about classification, I don't know that much about classification of language. However, I do want to take major issue with one major point implied by the review: the suggestion that people interested in classifying languages should ignore the methods used in the article. (This is a separate question as to whether they should ignore the conclusions of the article.) The problem of classification can largely be divided into: 1. what characteristics to consider when doing the classification 2. what algorithms to use to convert the data on the characteristics into a classification. The biologists that work on phylogenetic trees are expert on point 2 and they are quite knowledgable on point 1 when it comes to characterists of organisms. There is no particular reason to expect them to be expert on point 1 when it comes to languages. The review mainly focused on the defects related to point 1. I believe that much of the merit of the paper, however, relates more to point 2. It is interesting to note that at one time most biological classification was done using complex characters. This work required a lot of human effort, and expert knowledge was needed to develop good classifications. Only a few characteristics were used for doing the classifications. Much useful work was done with that model. Much of the recent work on biological classification is based on data that is basically very simple: DNA sequences. Although expertise is still important, it is much less so. Each piece of data does not have much information about how the organisms should be classified. However, by using large amounts of data (thousands of positions in the DNA sequence) combined computer algorithms to do the calculations, many interesting classifications have been developed. Often these have agreed with the older classifications, at times they have replaced them, and at times they have shown limitations in the new methods. People who are interested in classifing how the older languages developed into the current ones should pay some attention to these techniques. Paul Purdom, Professor of Computer Science, Indiana University.Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue
With respect to our paper, Forster and Toth "Toward a phylogenetic chronology of ancient Gaulish, Celtic, and Indo-European", available at http://www.pnas.org/cgi/content/abstract/100/15/9079 Larry Trask made a number of critical comments which we fear will cause considerable confusion for the potential readers of our work (Linguist 14.1876). Both core issues and peripheral issues are raised by Larry Trask. For the record, some of the peripheral claims made by Larry Trask are in error (e.g., concerning publication procedures at PNAS, citation of the Pennsylvania group, Celtic substrate in Tuscany, networks versus trees, impact of mutation rates, etc.), in other cases he is right (e.g., concerning the typographical error of "29"). Rather than dealing with these details here, we would encourage interested readers to peruse the paper, the Supplementary Information and the website Tutorial. For those of you who do not have access to PNAS, note that all materials are routinely made available on the PNAS website six months after publication, free of charge. The core criticisms concern the issue of negative controls, and the issue of resemblance coding versus cognation coding. As we have explained in our article and in the Tutorial, cognation coding is unfortunately not advisable with the fragmentary corpus of ancient Gaulish, because we would inevitably run the risk of ascertainment bias. Incidentally, this by no means implies that we reject cognation coding in general: it is a powerful procedure where applicable, as we demonstrated for example in our first linguistic network paper on Alpine Romance languages (Forster et al. 1998). Accepting that we had to resort to error-prone resemblance coding for the Gaulish analysis, we needed to test our error rate using a negative control, for which we chose Basque. As expected, Basque demonstrated that the resemblance coding entails a noticeable error rate (about 5 spurious identities out of 35 characters), and we can expect a similar, but invisible, error for other language pairs in the table. That is the function of a negative control. As concerns the coding procedure, we have no disagreement with the statement that resemblance coding is much more subjective and difficult than cognation coding; indeed we explicitly made this point in the Tutorial, and we listed borderline cases in the article. Unfortunately, Larry Trask seems to go much further than this by implying in his examples that the coding procedure needs to be identical between rows of the table (i.e., between different characters). This is not the case, as it would amount to comparing the proverbial apples and pears. The coding must only be consistent within a row (i.e. within a character), regardless of decisions in other rows, and it is up to the researcher to decide at which resolution to perform the coding for each character (e.g. whether or not to distinguish the Chicago and London pronunciations of "herb"). For tree-generating characters, the level of resolution will have a bearing neither on the branching order of the tree, nor on the time estimates. A fuller discussion of these and other issues is given in our Tutorial, which we have now extended to incorporate the above comments. Note especially the extended "Frequently Asked Questions" section. http://www.mcdonald.cam.ac.uk/genetics/gaulish_tutorial.pdf Finally, a note of guidance to those who would like to get started on phylogenetic network analysis on their own data: the best way to get to grips with phylogenetic network analysis is to carry out examples by hand. It is not difficult, all you need is pencil, paper and a rubber (yes, I know, "eraser" to Americans). I suggest you start with the simple tree example in the Tutorial, then pass on to the more complex network example in the PNAS paper. Both examples show step-by-step how to proceed. Finally, you may wish to tackle the larger example in the Forster et al. (1998) paper, which does not show any intermediate steps. Good luck. Peter Forster Dr. Peter Forster Molecular Genetics Laboratory The McDonald Institute for Archaeological Research University of Cambridge Downing Street Cambridge CB2 3ER England tel. +44-(0)1223-339-330 fax. +44-(0)1223-339-285 email: pf223Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issuecam.ac.uk References Forster et al. (1998) Evolutionary networks of word lists: visualising the relationships between Alpine Romance languages. Journal of Quantitative Linguistics 5 (3): 174-187. Forster & Toth (2003) Toward a phylogenetic chronology of ancient Gaulish, Celtic, and Indo-European. Proceedings of the National Academy of Sciences of the USA 10.1073/pnas.1331158100