Editor for this issue: <>
Last June, I posted the following query about "comparing 2 texts": A colleague in our Classics dept. wants to be able to compare 2 texts to see if they were written by the same author, or by different authors. Presumably, this would be done by some combination of a stylistic and a statistical analysis. (As I recall, this sort of technique has been used by folks who try to figure out if Shakespeare really wrote Shakespeare's plays.) What she needs are pointers to the literature, especially information on how reliable such arguments are. Appended is a summary of the replies. Thanks to all of you! William J. Rapaport Associate Professor of Computer Science and Center for Cognitive Science Dept. of Computer Science | (716) 645-3180 x 112 SUNY Buffalo | fax: (716) 645-3464 Buffalo, NY 14260 | rapaportMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issuecs.buffalo.edu ************* Date: Fri, 3 Jun 1994 14:18:46 --100 From: Ken.Beesley
xerox.fr (Ken Beesley) Some important work on authorship was done by Michaelson & Morton in Edinburgh, Scotland. The Rev. A.Q. Morton The Abbey Manse Culross Dunfermline, Fife KY128JD Newmills 880-231 Prof. S. Michaelson Computer Science JCMB Kings Buildings University of Edinburgh Edinburgh, Scotland There was also some work by a couple of statisticians at Brigham Young University in Provo, Utah, USA. Their names escape me now. ************* Date: Fri, 3 Jun 1994 13:24:12 -0500 From: hrubin
stat.purdue.edu (Herman Rubin) Probably the best sound work from a statistical basis is Title: Inference and disputed authorship: The Federalist <by> Frederick Mosteller <and> David L. Wallace. Author(s): Mosteller, Frederick, 1916- Wallace, David L. (David Lee), 1928- Publisher: Reading, Mass., Addison-Wesley <1964> This book considered the Federalist papers, written by three known authors, but different ones by different authors. One of their conclusions was that analysis by context vocabulary and other similar things, much used by those who assigned authorship in the past, did not work; the only thing which did for this problem was the use of connectives. There is an article in the _Journal of Applied Probability_ on the type-token relationship in Shakespeare's plays. A cursory glance at the data indicates that one cannot treat these as a sample from a single population, even if the comedies, tragedies, and historical plays are separated; there is a definite effect of the individual work. Similar things can be noticed in the attempts of other statisticians to do this, such as the writings of Yule. I did look at some of the data; I have not published on this. It is quite dangerous to say on the basis of a statistical test that two works are by different authors. ----------------- Date: Mon, 6 Jun 94 11:16:55 -0700 From: jtang
cogsci.Berkeley.EDU (Joyce Tang Boyland) [Re: Mosteller and Wallace:] ... The 1984 edition [see reply below] has a much more informative table of contents than the original 1964 version published by Addison-Wesley. ------------------ Date: Tue, 7 Jun 1994 10:33:39 --100 From: Gregory.Grefenstette
xerox.fr (Gregory Grefenstette) Frederick Mosteller and David L. Wallace "Applied Bayesian and Classical Inference: The case of the Federalist Papers, 2nd Edition of 'Inference and Disputed Authorship: The Federalist" (Springer-Verlag) this book gives statistical methods for deciding which of the Federalist papers were written by Hamilton and which were written by Jefferson. ------------------ From: Robert.Sigley
vuw.ac.nz Date: Sat, 04 Jun 1994 13:02:58 +1200 I've just finished reading (and returned, unfortunately) a collection of papers which I can thoroughly recommend to anyone trying to identify an author by style. It's called "Statistics and Style" and appeared in 1968. ... on checking the library online catalogue, I find that no extra information is given for it, so I won't be able to confirm this identification until it's reshelved (within 2 days, I hope). As far as I remember, the editor had a Slavic-type name beginning with `G'; a grep of the library's entire author index makes J. Gvozdanovic the most likely option (listed for another book in the general area of linguistics, but unconnected to the topic under discussion), but this id is tentative for now. Analyses covered include comparisons of (1) word-length spectra (and a number of statistics calculated from them); (2) sentence-length spectra (which are found to follow a log-normal distribution for any particular text by a single author. Author behaviour is reasonably consistent, but there is considerable overlap between different authors in the same genre); (3) use of certain vocabulary items previously identified as `typical' of the candidate author(s) on the basis of uncontested works; (4) use of certain grammatical constructions; (5) counts of certain grammatical classes (eg noun/verb ratios, or adjective/verb ratios). The final paper in the collection is perhaps the most important, as it deals with the general question of reliability. Overall, it has to be said that the crude general statistics above are not useful for deciding questions of authorship unless: (i) the number of possible candidate authors is small; (ii) we have a large body of work from each of the candidate authors; (iii) this corpus covers the entire span of their career (or at least shows little change over time); and (iv) the corpus is in similar genres to the contested item. In short, the rewards of such analysis are mostly not worth the considerable time it used to take to compute the statistics. The results are, I'm afraid, especially indecisive in answering questions in classics (where this volume of work, and the historical information about potential alternative authors, is often lacking). But if there is only one candidate author, with a large known corpus, and the exercise is simply to determine how similar to that author's style the unknown text is, then it can still be attempted. (1) and (2) above are now relatively quick and easy to calculate with most concordance programs - providing the text is in machine-readable form to start with! But they are the least author-specific methods. (4) and (5) could be useful, but are still very time-consuming to calculate, and require a whole lotta manual tagging of the texts. Best avoided. So (3) is probably going to be of most use in identifying a specific author. The best approach I can think of would be to construct a concordance (using OCP or similar) for a large corpus (20000 words minimum) of the candidate author, and then do the same for a similar-sized matched-genre corpus from the author's contemporaries. (If the text's general *date* is in doubt, you may as well give up now.) Then you compare the frequency ratios of common vocabulary items (ie frequency in candidate corpus/ frequency in mixed-contemporary corpus). This will identify a number of vocabulary items which are used proportionately much more or much less by the candidate author, and so can be used as `characteristic' of that author. Discard items which are linked in any literal sense to the text topic. Ignore very rare items (eg with frequency less than 5 over 20000 words). To save yourself time, and to maximise the sensitivity of your tests, look at only the 10 or so items with the largest differential frequency. Now calculate the frequencies of the remaining items in the contested text. Compare these with both the candidate-author and contemporary corpora frequencies. Finally, conduct a series of statistical tests to determine whether any differences you find can reasonably be attributed to chance. The best method will depend on the frequencies you get at the end of all this; ask a friendly statistician. Hope this helps. I'll mail back when I find the book again to confirm its identity. I should add though that there's been considerable progress in text manipulation on computers since its publication, so it's out of date in some areas; however, this is more or less made up for by falling interest and a lack of progress in statistical style analysis since 1970. --------------------- From: Robert.Sigley
vuw.ac.nz Date: Wed, 08 Jun 1994 16:34:09 +1200 The reference I mentioned is actually: Lubomir Dolezel & Richard W. Bailey (eds) 1969. _Statistics_and_Style_. New York: American Elsevier Publishing Company. I shall try to give a brief description of the more important collected papers, with original references where possible. Page references for quotes are from the collection, though. Vocabulary Measures: Paul Bennett. The Statistical Measurement of a Stylistic Trait in _Julius_Caesar_ and _As_You_Like_It_. (from _Shakespeare_Quarterly_ VIII (1957): 33-50) Bennett applies Yule's characteristic (a measure of vocabulary repetitiveness) to two very different plays by Shakespeare. Using a card-sorting technique, this was very time-consuming; it would be much quicker today! He finds that the characteristic is a useful measure of style - it varies from act to act in a way predictable from the plays' structures - but "should not care to suggest that the characteristic is going to provide an infallible test of authorship" (p40). Charles Muller. Lexical Distribution Reconsidered: The Waring-Herdan Formula. (from _Cahiers_de_Lexicologie_ VI (1965): 35-53.) Muller tests a rather complicated formula designed to predict the word-frequency spectrum of a text. It works reasonably well on material from a variety of texts in several languages. [The shape of the frequency distribution is therefore of little use in author attribution. This formula has recently surfaced again in Baayen's (1990, 1991) work on morphological productivity.-RJS] Friederike Antosch. The Diagnosis of Literary Style with the Verb-Adjective Ratio. (translated from German original.) Antosch analyses a number of plays by Grillparzer, Goethe and Anzengruber, in terms of the verb/adjective ratio. She finds that this is extremely sensitive to elements of genre (eg dialogue/ monologue; and novels vs. academic writings) and characterisation (eg lower-class/ upper-class). The V/A ratio may show local maxima within a play at points of rising action and climactic scenes, and so is a potentially useful stylistic indicator. [Corollary: it's of very limited use for comparing authors unless these factors can be controlled.] See also: G. Udny Yule. 1944. _The_Statistical_Study_of_Literary_Vocabulary_. Cambridge. Sentence-level Measures: C.B. Williams. A Note on the Statistical Analysis of Sentence-Length as a Criterion of Literary Style. Williams compares works by Chesterton, Wells and Shaw with respect to their sentence-length frequency spectra. He finds that these spectra are reasonably well modelled by a log-normal distribution (that is, the log of the sentence length has a normal distribution), and that the three books studied have significantly different mean sentence lengths - though the significance is marginal between Shaw and Chesterton. Williams uses samples of 600 sentences (approx 15000 words) from each book; this is a minimum sample size for work of this nature! [NB we can't conclude from this that we have identified any characteristic of the *authors*. -RJS] Kai Rander Buch. A Note on Sentence-Length as Random Variable. Buch comments on Williams' paper, presenting (with fearsome maths) a statistical analysis of two works by the same author, and concluding that the author's style has changed over time to such an extent that the texts are significantly different under Williams' test. See also: C.B. Williams. 1956. Studies in the History of Probability and Statistics IV. A Note on an Early Statistical Study of Literary Style. _Biometrika_ XLIII (1956): 248-256. G. Udny Yule. 1938. On Sentence-Length as a Statistical Characteristic of Style in Prose, with Application to Two Cases of Disputed Authorship. _Biometrika_ XXX (1938-39): 363-390. [Hence gross sentence-length measures are of little use for author attributions: they can return non-significant differences between different authors, and significant differences between texts by the same author. They simply aren't specific enough. -RJS] Curtis W. Hayes. A Study in Prose Styles: Edward Gibbon and Ernest Hemingway. (from _Texas_Studies_in_Literature_and_Language_ VII (1966): 371-386.) Hayes avoids the above problem by taking a more detailed transformational analysis of passages of Gibbon & Hemingway. He finds a variety of grammatical patterns which show highly significant differences between the two authors - in particular, passives, doublets, infinitival nominals, and relative clauses are far commoner in Gibbon. [This is a valuable stylistic measure, though not a method I would have the patience to use myself! But it doesn't serve to identify the authors, so much as the very different genres they write in. -RJS] Studies of Individual Author Styles: John B. Carroll. Vectors of Prose Style. (from Thomas A. Sebeok (ed) 1960. _Style_In_Language_. MIT Press: 283-292.) This is an interesting use of factor analysis to determine the linguistic correlates of literary judgements. George M. Landon. The Quantification of Metaphoric Language in the Verse of Wilfred Owen. Least said the better. Frederick L.Burwick. Stylistic Continuity and Change in the Prose of Thomas Carlyle. The mutant offspring of an entropic study of 5-word wordclass sequences, and a more traditional literary analysis. The latter wins out, but is not easily applicable to other authors. Karl Kroeber. Perils of Quantification: The Exemplary Case of Jane Austen's _Emma_. Kroeber undertakes a detailed analysis of the vocabulary of Austen, Eliot, Dickens and [E.] Bronte. While many of the restrictions he places on his samples are arbitrary, this is potentially a useful direction for author comparison and attribution (see below). See also the case studies: Alvar Ellegard. 1962. _A_Statistical_Method_for_Determining_Authorship:_The_ Junius_Letters,_1769-1772_. Gothenburg Studies in English 13. Ivor S. Francis. 1966. An Exposition of a Statistical Approach to the Federalist Dispute, in Jacob Leed (ed) _The_Computer_and_Literary_Style_. Ohio. Survey of the field: Richard W. Bailey. Statistics and Style: A Historical Survey. (pp217-236) This deals with the general question of reliability: "What is wanted... is a litmus test by which the critic can decide whether or not two given texts were written by the same author. Though some attempts have been made to formulate such a test, they have been almost wholly unsuccessful." (p222) Some other surveys cited by Bailey: William J. Paisley. 1964. Identifying the Unknown Communicator [...] _The_Journal_of_Communication_, XIV (1964): 219-237. Rebecca Posner. 1963. The Use and Abuse of Stylistic Statistics. _Archivum_Linguisticum_ XV (1963): 111-119. Overall, it has to be said that the crude general statistics above are not useful for deciding questions of authorship unless: (i) the number of possible candidate authors is small; (ii) we have a large body of work from each of the candidate authors; (iii) this corpus covers the entire span of their career (or at least shows little change over time); and (iv) the corpus is in similar genres to the contested item. In short, the rewards of such analysis are mostly not worth the considerable time it used to take to compute the statistics. The results are, I'm afraid, especially indecisive in answering questions in classics (where this volume of work, and the historical information about potential alternative authors, is often lacking). But if there is only one candidate author, with a large known corpus, and the exercise is simply to determine how similar to that author's style the unknown text is, then it can still be attempted. The general vocabulary and sentence-length measures above are now relatively quick and easy to calculate with most concordance programs - providing the text is in machine-readable form to start with! But they are the least author-specific methods. Possibly they could be of use as a preliminary check before plunging into more time-consuming methods. More specific grammatical analysis could be useful, but very time-consuming to calculate, requiring a whole lotta manual tagging of the texts. Best avoided. So an intermediate 'specific vocabulary' index is probably going to be of most use in identifying a specific author. The best approach I can think of would be to construct a concordance (using OCP or similar) for a large corpus (20000 words minimum) of the candidate author, and then do the same for a similar-sized matched-genre corpus from the author's contemporaries. (If the text's general *date* is in doubt, you may as well give up now.) Then you compare the frequency ratios of common vocabulary items (ie frequency in candidate corpus/ frequency in mixed-contemporary corpus). This will identify a number of vocabulary items which are used proportionately much more or much less by the candidate author, and so can be used as `characteristic' of that author. Discard items which are linked in any literal sense to the text topic. Ignore very rare items (eg with frequency less than 5 over 20000 words). To save yourself time, and to maximise the sensitivity of your tests, look at only the 10 or so items with the largest differential frequency. Now calculate the frequencies of the remaining items in the contested text. Compare these with both the candidate-author and contemporary corpora frequencies. Finally, conduct a series of statistical tests to determine whether any differences you find can reasonably be attributed to chance. The best method will depend on the frequencies you get at the end of all this; ask a friendly statistician. Hope this helps. I should add though that there's been considerable progress in text manipulation on computers since 1970, so it's out of date in some areas; however, this is more or less made up for by falling interest and a lack of progress in statistical style analysis since then. ---------------- Date: Sat, 4 Jun 1994 11:51:03 -0600 From: nostler
crl.nmsu.edu (Nick Ostler) Your colleague should look at a work by AJP Kenny on assessing the authorship of Aristotle's Eudemian Ethics: "The Aristotelian ethics: a study of the relationship..." Oxford: Clarendon Press, 1978. [also recommended by Virginia Knight <ZZAASVK
cms.manchester-computing-centre.ac.uk>] ------------------ Date: Mon, 6 Jun 94 13:58:20 +0200 From: monique
gia.univ-mrs.fr (Monique Rolbert) Nous sommes une equipe BDD-LN qui developpons un langage d'interrogation de bases de donnees textuelles (a partir d'un format de type SGML) et une question est de savoir quel type d'operateur il est interessant de mettre a la disposition d'un utilisateur voulant faire du TALN sur des textes. Je serais tres interessee de connaitre vos types de besoins dans le genre de comparaison que vous voulez faire (statistique-stylistique) Merci d'avance. Monique Rolbert monique.rolbert
gia.univ-mrs.fr ----------------- Date: Mon, 6 Jun 1994 23:34:23 +1000 From: sussex
lingua.cltr.uq.oz.au (Prof. Roly Sussex) John Burrows (LCJFB
cc.newcastle.edu.au) at the University of Newcastle, Australia, has done important work on text analysis and authorship. You could email him direct. Roly Sussex Director Centre for Language Teaching and Research and Language and Technology Centre of the National Languages and Literacy Institute of Australia University of Queensland Queensland 4072 Australia email: sussex
lingua.cltr.uq.oz.au phone: +61 7 365-6896 (work) +61 7 300-2942 (home) fax: +61 7 365-7077 ------------------- Date: Mon, 6 Jun 94 15:37:19 -0700 From: edwards
cogsci.Berkeley.EDU (Jane A. Edwards) Your query reminded me of a recent exchange regarding stylistic analysis, though in different context. Hope this is of use. -Jane Edwards | ------------------------ | Date: Mon, 31 Jan 1994 11:59:00 -0500 | From: neff
watson.ibm.com (Mary Neff) | To: FL-LIST
BHAM.AC.UK | Cc: neff
watson.ibm.com | Subject: The Case of the Plagiarized Patent | | A few months back I was buttonholed at a party by the owner of a company in the middle of a patent infringement case. He wanted to know if, as a | linguist, I might have anything useful to offer. Not a lot, it's not my | field, but I just found this list, and one of YOU might. It seems that his | company had signed a contract with another one that included giving them | access to his design documentation and his patent applications. Some | time later, he discovered that the other company is siphoning off his | business and is making a product too similar to his to be accidental, and | has filed patents also (I think in other countries). His question to me was whether it were possible to study and compare the two patents by structure, language, etc. to determine whether there might have been any plagiarism involved. I later looked at the patents and decided that it was perhaps not a wild idea, but that any investigation would also have to take into account the general "formula" of a patent, which might account for a lot of similarity. Who are the experts on this sort of thing? What are some of the other issues involved? It's not so often that I get approached at a party for some free advice as a linguist; usually it's the doctors and the lawyers that encounter that sort of thing! | | Interestingly, I read something in this month's DISCOVER magazine that | mentions a couple of guys who designed a computer program to snoop for | plagiarism in books. ------------------ Date: Tue, 01 Feb 94 10:03:54 EST From: Larry Horn <LHORN
YaleVM.CIS.Yale.edu> Thanks for the postings. The lawyer has settled on one of my earlier respondents, Gerry McMenamin of Fresno, who wrote a book on authorship determination. Apparently computers are indeed much used in these matters, but I don't whether his samples (from his client and another man) are generous enough to allow for statistical significance. I guess McMenamin will help him decide. ------------------ From: "Richard Hamilton-Williams" <RJHW
registry.cit.ac.nz> Date: Wed, 8 Jun 1994 13:29:26 GMT+1200 Long ago, but not so far away, I studied Middle High German and wrote a bit of a thesis on the transmission of MHG texts. It wasn't very popular with a lot of people because it made little reference to "taste" and was based, rather, on a statistical analysis of variance between texts. My professor at the time got me interested in this and he in turn had got it from a book called, I think, "The Calculus of Variants". I've an idea it was written by E H Greig(Gregg?) in the 1920s or 1930s. In any case, I think I have a copy at home and will send you the details tomorrow. I made use of a fairly crude algorithm which establised a model as if the transmission of texts were known, and then measured actual variation against this. My professor died, nothing to do with me I hope, and although I completed my degree I went on to other things, so I can't claim that I know what goes on in the field nowadays. I imagine, however, that analysis is much more sophisticated now - I used punchcards to enter data on a mainframe - although the concepts should be very similar. Richard Hamilton-Williams Central Institute of Technology, Wellington, New Zealand 04 527-6397 x6982 Private Bag 39807 Wellington Mail Centre New Zealand From: "Richard Hamilton-Williams" <RJHW
registry.cit.ac.nz> Date: Thu, 9 Jun 1994 08:03:41 GMT+1200 The reference is: Greg, W. W. The Calculus of Variants, An Essay on Textual Criticism (Oxford, 1927) Greg wrote a number of other things and edited works on the basis of his theories on textual transmission. Richard Hamilton-Williams Central Institute of Technology, Wellington, New Zealand 04 527-6397 x6982 Private Bag 39807 Wellington Mail Centre New Zealand ------------------ From: h9290030
hkuxa.hku.hk (R.Y.L. TANG) Subject: Authorship identification In David Crystal's _The Cambridge Encyclopedia of Language_ (Cambridge UP, 1987), there is a very succinct account of the use of statistics in stylistic analysis and authorship identification (Chapter 12). ----------- From: Brett.Baker
linguistics.su.edu.au (Brett Baker) Date: Wed, 15 Jun 1994 15:41:12 +1000 ... I don't know if this will be much use to your colleague, but she could do worse than have a look at a new monograph by John Myhill called 'Typological Discourse Analysis' published by Blackwell 1992. Apart from loads of interesting stuff about analysing texts quantitatively, it also has references for analyses that have been done on written texts which sound like the kind of thing you want. Much of the purpose of this kind of analysis is to show up regularities of expression type and stylistic/grammatical function. Good luck. ----------- Date: Thu, 16 Jun 1994 00:24:45 -0500 (CDT) From: Kristin E Hiller <hill0087
gold.tc.umn.edu> This is in response to the query you posted on Linguist (on behalf of your colleague). I'm sorry it's taken me so long to respond. Stylostatistical studies abound concerning cases of disputed authorship. You mention the Shakespeare/Marlowe controversy. I'll name a few others: 1) One of the most often cited cases of disputed authorship is that of the _Federalist Papers_. Of the 88 papers, the authorship of twelve was in question (having been written by either Madison or Hamilton). 2) Several anonymous articles appeared in the journals _Vremja_ (_Time_) and _`Epoxa_ (_Epoch_), which were both edited by Dostoevsky. Some of the artic;es have variously been attributed to Dostoevsky. 3) The authorship of _The Junius Letters_, not known for certain, has often been attributed to Sir Francis Bacon (although some 40 others were considered at one time or another). 4) Gustave Alderfeld's _The Military History of Charles XII_ was anonymously translated from the French. Henry Fielding is considered by some to have been the translator. 5) Some scholars maintain that Sholoxov did not actually write all of _Tixij Don_, but plagiarized Krjuchkov's manuscripts. I have only recently begun reading about this field and have already come acrosss many refences to the work done on (1) by Frederick Mosteller and David Wallace _Inference and Disputed Authorship: "The Federalist"_ (Reading, MA: Addison-Wesley, 1964) and a less statistic-laden work, Francis, Ivor S. "An Exposition of the Statistical Approach to the _Federalist_ Dispute," in _The Computer and Literary Style_, ed. Jacob Leed (Kent: Kent State U. Press, 1966). Geir Kjetsaa tackles (2) in his book (written in Russian) _Prinadlezhnost' Dostoevskomu: K voprosu ob atribucii F.M. Dostoevskomu anonimnyx statej v zhurnalax "Vremja" i "Epoxa"_ (Oslo: Solum Forlag, 1986). Michael and Jill Farringdon address (4) in "A computer-aided study of the prose style of Henry Fielding and its support for his translation of the Military History of Charles XII", in _Advances in Computer-aided Literary and Linguistic Research: Proceedings of the Fifth International Symposium on Computers in Literary and Linguistic Research_, D.E. Ager, F.E. Knowles, Joan Smith, eds. (Birmingham: AMLC, 1979). Rudall, B.H and T.N. Corns, _Computers and Literature: A practical guide_ (Cambridge, MA: Abacus Press, 1987) contains a chapter on "Author identification and canonical investigation." With all the literature out there I could continue listing references until my fingers ached from typing. Instead I'll just list two more: Kenny, Anthony, _The Computation of Style: An introduction to statistics for students of literature and humanities_ (Oxford: Pergamon Press, 1982). A great book -- the title says it all. Feldman, Paula R. and Buford Norman. _The Wordworthy Computer: Classroom and research applications in language and literature_ (NY: Random House, c.1987). The best part of this book is its HUGE bibliography. A very good starting point.