LINGUIST List 22.412

Mon Jan 24 2011

Review: Computational Ling., Discipline of Ling.: Gries (2009)

Editor for this issue: Joseph Salmons <jsalmonslinguistlist.org>


        1.     Andrew Caines , Statistics for Linguistics with R

Message 1: Statistics for Linguistics with R
Date: 24-Jan-2011
From: Andrew Caines <andrew.cainescantab.net>
Subject: Statistics for Linguistics with R
E-mail this message to a friend

Discuss this message

Announced at http://linguistlist.org/issues/21/21-2773.html

AUTHOR: Gries, Stefan Th. TITLE: Statistics for Linguistics with R SUBTITLE: A Practical Introduction SERIES TITLE: Trends in Linguistics. Studies and Monographs [TiLSM] 208 PUBLISHED: 2009 PUBLISHER: De Gruyter Mouton

Andrew Caines, Computation, Cognition and Language Group, Research Centre for English and Applied Linguistics, University of Cambridge

SUMMARY

This book discusses methods of statistical analysis using R, the open source software. It contains numerous linguistic datasets as case studies, features 'think breaks' and exercises at the end of every chapter, and gives comprehensive instruction on how to code the required calculations and chart plotting in R. There is also higher level discussion of experiment design, as well as top-to-tail coaching in diligent empirical research: from hypothesis formation to data collection, and thereafter through to appropriate statistical analysis to reporting the results.

Prerequisites to getting the most out of this book include downloading R which, being open source, is free (instructions can be found in chapter 2); and downloading the data and exercise files from the companion website. These files also contain all the code that is shown and referred to in the text of each chapter, and answer keys for the exercises.

The first two chapters deal with the essential preliminaries. Chapter 1 outlines procedures for empirical work, including hypothesis formation, operationalization of variables, best practice for data collection and annotation, and experiment design. Chapter 2 gives instruction on how to install R, how to obtain the code and exercise files from the companion website, how to load, manipulate and save data in the R console, and the difference between vectors, factors and data frames. It is essential at this point that the reader has acquired the skills presented so far, whether from previous experience or from working through chapter 2. From this point forward, Gries proceeds with case studies which the reader should simultaneously work through in R so as to get the most out of the book. To do so, the skills presented in chapter 2 are fundamental.

Chapter 3 introduces various descriptive statistical methods relating to measures of central tendency and bivariate analysis. The measures of central tendency include the arithmetic and geometric mean, the median and mode. The measures of dispersion -- at least one of which, Gries emphasises, should always accompany any measure of central tendency -- covered here are relative entropy, range, quartiles and quantiles, average deviation, standard deviation, variation coefficient and standard error. The next section describes centering and standardization methods (i.e., z-scores), followed by confidence intervals. The chapter closes with a section on bivariate statistics, necessary to characterize datasets with more than one variable and the relation(s) among those variables. The methods discussed include crosstabulation, correlation coefficients, linear regression and a range of plotting techniques. The plot types covered in this chapter include scatter, mosaic, box, spine, line and bar plots.

The fourth chapter turns to analytical statistics and identifies the appropriate calculation(s) for various experiment design scenarios, according to number of variables, variable type (dependent or independent), and data type (nominal, ordinal, ratio-scaled). The distribution tests include measures of goodness-of-fit, such as chi-square, as well as the finer details of distributions -- dispersions and means. In addition, Gries shows how to compute a table of p-values in R adjusted for degrees of freedom. There are further demonstrations of correlation and regression analyses as well as versatile plot types such as association plots, cross-tabulation plots and strip charts.

Chapter 5 takes an advanced step into multifactorial modelling. After all, ''we live in a multifactorial world in which probably no phenomenon is really monofactorial'' (p238). The techniques explored are multiple regression analysis, both mono- and multifactorial analysis of variance (ANOVA), binary logistic regression, and cluster analysis. To conclude the book, there is a brief but very thought-provoking Epilog in chapter 6 (on which more below).

EVALUATION

This book successfully performs at least three roles. Firstly, it gives a wide-ranging overview of statistical techniques and when to use them. Secondly, it is a well-written instruction manual on how to carry out these techniques in R. Thirdly, in terms of a more general context, it codifies the standards which should be observed in empirical linguistic work. The book's most significant contribution is in bringing together advice on both experiment design and data analysis in one volume. The comparable work by Baayen (2008), for instance, goes further in its exploration of clustering, regression and mixed models but lacks the section on best practice for experiment design with which Gries begins. Each has its own role, then -- Baayen (2008) being more narrowly focused on statistics and the present volume being a more complete guide to linguistic research.

This book by Gries, along with those such as Baayen (2008) and Gries (2009b), provide some much needed rigour to linguistic study. It is desirable that such high standards and procedures for empirical work are followed, if research is to be properly discussed and built upon. If all university linguistics courses could feature one of these works as a textbook it would be an important step in the right direction. Not only is there the guidance to data collection and analysis, there are also recommendations on how to summarize the results of statistical analyses in prose -- an essential skill for journal papers and conference proceedings which is all too easy to get wrong, especially when blindly imitating other publications.

The interactive nature of this book is its best asset. There are code files available from the companion website so that the reader can follow the narrative and replicate the case studies, exercises at the end of every chapter and frequent 'think breaks'. Much effort and care has evidently gone in to preparing the accompanying code and exercise files. The code files contain an appropriate amount of editorial comment as well as suggestions for the reader to try alternative statistical or graphical techniques to those outlined in the main text of the book.

The calculations range from the straightforward (p64), such as this:

> sqrt(9) # compute square root of 9

To the relatively complex (p296), like this:

> model.lrm<-lrm(CONSTRUCTION1 ~ V_CHANGPOSS + REC_ACT + PAT_ACT, x=T, y=T, linear.predictors=T) # compute binary logistic regression of pre-loaded dataset

The reader may well feel challenged by the advancement in complexity but the progression is steady enough that there should be no reason for it to be overwhelming, thanks moreover to the supporting code files and exercise answer keys. One enhancement, however, would have been a glossary to the functions covered in the book. In the absence of a glossary, the index is comprehensive enough to perform this function but a ready-reference function list would have been better still.

The plot types introduced in this book (and the statistical methods themselves, for that matter) are acknowledged as being only the tip of the iceberg. The reader is referred to specialist works for more advanced techniques (from the book Murrell 2005; Cook and Swayne 2007; Sarkar 2008; but also Wickham 2009). However, with histograms, boxplots, stripcharts, pie charts, scatter plots, association plots and many more, there are plenty of data display methods to satisfy most needs.

The Epilog (chapter 6) observes that the sections on linear models (ANOVAs and regressions) are short and points the reader to references on further techniques which might be of use: Poisson regression, repeated measure ANOVAs and multi-level models. The reader is also referred to R libraries containing more powerful graphical tools and books dedicated to graphics in R. Finally, both to ''shake up a bit what you have learnt so far'' and ''stimulate some curiosity for what else is out there'' (p320), Gries observes that the null hypothesis testing paradigm which is central to every case study in the book is not quite so uncontroversial as it seems and is generally held to be. This is an appropriate overview of what the book stands for: on the one hand it offers more than enough for the reader to get by with data collection, analysis and reporting; on the other hand it can be the jumping off point for the reader's exploration into more advanced statistical methods and theoretical considerations. The book is thus not only an introduction to data analysis with R but also an introduction to statistical theory and reconsideration of current techniques.

Gries has diligently compiled a work of great use and interest. It is relevant above all to linguistic students and researchers, and can readily act as a textbook for taught courses. It should be noted that the book is equally useful as a reference guide, with the analysis scenarios sufficiently well labelled and organized so that the reader can dip into it as and when necessary, or as a complete set of exercises which the reader can work through section by section.

REFERENCES

Baayen, R. H. (2008). Analyzing Linguistic Data: a Practical Introduction to Statistics using R. Cambridge: Cambridge University Press.

Cook, D. and D. F. Swayne (2007). Interactive and Dynamic Graphics for Data Analysis. New York: Springer.

Gries, St. Th. (2009b). Quantitative Corpus Linguistics with R: A practical introduction. London: Routledge.

Murrell, P. (2005). R Graphics. Boca Raton, FL: Chapman and Hall / CRC.

Sarkar, D. (2008). Lattice: Multivariate Data Visualization with R. New York: Springer.

Wickham, H. (2009). ggplot: Elegant Graphics for Data Analysis. New York: Springer.

ABOUT THE REVIEWER

Andrew Caines recently completed his PhD dissertation at the University of Cambridge. His research is a corpus-based study of an innovative construction in English -- namely, the 'zero auxiliary' interrogative: 'what you doing? you going to town? you talking to me?' For more information go to http://www.srcf.ucam.org/~apc38 .

Page Updated: 24-Jan-2011