AUTHOR: Baayen, Harald TITLE: Analyzing Linguistic Data SUBTITLE: A Practical Introduction to Statistics using R PUBLISHER: Cambridge University Press YEAR: 2008
Aditi Ghosh, Department of Linguistics, University of Calcutta, Kolkata
SUMMARY This book is a guidebook for researchers and students who want to use statistical computation to analyze linguistic data. It teaches how different types of linguistic data can be dealt with quantitatively by using 'R' – a free statistical tool developed originally at AT&T Bell Laboratories. The book is divided in seven chapters, each giving instruction on how to use R to analyze different types of linguistic data, starting from relatively simple problems, gradually leading to more complex ones. The first chapter, entitled 'An introduction to R', as anticipated, gives instructions on the basic handling of R – on how to download the relevant packages, how to use the software in different operating systems, how to import or export datasets and even how to use R as a simple calculator. This chapter also teaches how to select a portion of data out of a data set, how to order or sort a data frame, how to change specific information and how to extract relevant portions out of a data frame. Lastly it shows how to perform basic calculations on a data frame such as the mean or the sum of a numeric vector. The chapter like all the other chapters in the book is followed by a set of exercises and the solutions to these are provided in Appendix A.
The second chapter, ''Graphical Data Exploration'', deals with producing graphic presentation of data. It starts with a brief definition of random variable and goes on to show how bar plots and histograms can be produced, how curves can be added to a histogram. The chapter also shows how to plot ordered values and density and generate boxplots. For two or more variables the more useful practice is to create a mosaic plot or a scatter plot – and the third section of this chapter deals with this. The final section introduces trellis graphics – a graph in which data can be represented by many organized graphs at the same time.
The third chapter, ''Probability Distribution'', begins with an introduction of distribution and goes on to demonstrate how to deal with discrete and continuous distribution and introduces the relevant functions available in R. it also introduces the Poisson distribution, different types on normal distribution. Lastly it introduces three important continuous distributions, name t, F and X squared distributions and the functions used in R for these distributions.
Chapter four is entitled ''Basic statistical methods''. The first two section of this chapter introduces tests for single and two independent vectors. For single vectors, one can test the distribution by plotting the density in different kinds of graphical representations or one can use appropriate tests, such as ShapiroWilk test for normality or KolmogoroSmirnov one sample test. To test the mean of a single vector one can use ttest. To observe distribution of two independent vectors, one can plot them with in two differently colored lines. To test if the means are same of the vectors in question, one can use the boxplot function to see their frequency distribution, or one may run a ttest to verify if two means are significantly different. R also has specific functions to test if the variances of the two vectors in question are the same. For paired vectors, again ttest and Wilcox test can be used. To understand if two vectors have significant relation one can plot their individual points in a scatterplot and obtain a regression line. This chapter also shows how to evaluate significance of correlation between two variables. It deals with problems of linear regression and how to deal with them. The chapter also shows how to examine joint density of two paired vectors, how to deal with one or two numerical vectors and a factor and two vector with counts. The final section explains how to estimate the significance of a statistical test.
Chapter five, ''Clustering and classification'', discusses methods of handling more than two vectors. With the help of 'principal component analysis' and 'factor analysis' it explores the relationship between uses of 27 derivational affixes with that of the type of texts in which they appear. In the next two subsections we are introduced to 'correspondence analysis' and 'multidimensional scaling', used to create a lowdimensional map of the data and to trace structure in a matrix of distances respectively. The last subsection of this section (5.1.5) considers hierarchical cluster analysis – techniques to cluster data and display them in a tree diagram. The next section (5.2) moves from clustering to classification of data. The first subsection teaches how to create a classification tree with CART (classification and regression tree) analysis. CART and related analysis are applied on 'dative' data sets to find out whether realization of recipient as NP or PP can be predicted from other variables such as 'semantic class', 'length of theme' etc. The second subsection shows how linear discriminant analysis can be done in R to predict an item's class from a set of numerical predictors. The last subsection (5.2.3) demonstrates the use of 'support vector machine' for classification.
Chapter six takes up regression modeling, a topic which was introduced earlier in chapter four. This chapter discusses multiple linear regression and related functions available in R such as 'ordinary least squares regression'. Two subsections show how to deal with models with a nonlinear relation between independent and dependent variables and with datasets where all the independent variables are strongly correlated. The two following subsections deal with how to check whether the model that one arrives at, is satisfactory or not. The third section introduces 'generalized linear models'. Two subsections in this section (6.3.1 and 6.3.2) show how binary responses can be handled in R with logistic regression model and how ordered responses can be handled with ordinal logistic regression. Section 6.4 demonstrates how to deal with discontinuity in an otherwise linear relation. The next section (6.5) shows how to study lexical richness. It introduces various functions available in R to find out unique units in a dataset, to compare datasets etc. The last section in this chapter discusses some general issues in using statistical models with reference to the examples used in this chapter.
The last chapter is entitled ''Mixed Models''. The first four sections of this chapter deal with various strategies on how to build mixed models. Section 7.1 introduces the packages and function in R to build mixed effects and illustrates their usage in datasets. The next section compares mixed effect models with traditional models such as quasiF, latin square designs and with traditional regression (with mixedeffect regression). The following section deals with BLUPS (the best linear unbiased predictors), which is available in the mixedeffect models, unlike classical models, as it provides 'shrinkage' estimates for the bysubject and byitem adjustments. Section 7.4 discusses mixed model parallels with of generalized linear model – the 'generalized linear mixed model'. The last section (7.5) presents case studies where mixed models are put into practice.
The chapters are followed by two appendices. Appendix A provides solutions to the exercises that appear at the end of each chapter and appendix B gives an overview of functions for R. There are four indexes – of datasets, of R, of topics and of authors.
EVALUATION This book is like a course or a tutorial on how to use R to analyze linguistic data. Linguistic research has successfully used quantitative tools for a long time now and there are quite a few introductory books dealing with the subject (eg. Douglas 1943, Herdan 1964, Butler 1985, Woods & Fletcher & Hughes 1986, Tesitelová 1992, Rietveld & van Hout 1993, Kretzschmar, Kretzschmar & Schneider 1996, Paolillo 2002, Johnson 2008, Rasinger 2008). However, this field is developing rapidly and researchers need up to date knowledge about new resources available. Since R is becoming one of the most widely used tools for statistical analysis in social sciences, books exploring its utility in Linguistics, are the need of the hour. This book meets that requirement. Though there other works (cf., Johnson 2008) dealing with this topic, Baayen's introduction is valuable as, apart from being a practical introduction to statistics, it is also a thorough introduction for R beginning from downloading relevant packages for linguistics. It starts with introducing basic statistics and its use in R and progresses stepbystep to more sophisticated methods. This, apart from making it a very systematically organized book, makes it equally useful for linguists with limited mathematical background and those with sufficient expertise in statistical methods. With examples of a number of real data sets it demonstrates how to study linguistic data quantitatively. The exercises at the end of each chapter are very useful for practicing functions introduced in the adjacent chapters. The separate indexes are also quite useful for researchers, if they need to look for specific R functions or topics to meet their research requirement. It is also enriching to be introduced to the actual datasets used in the course of this book. However, I wish the datasets were more varied in type. Almost all the sets used here are morphological/lexical data sets. It would have been worthwhile to see more sociolinguistic or language teaching oriented data. I faced a few problems in installing the packages, as apparently the version that I earlier had (R 2.7.0) was not compatible with LanguageR – the package that is used extensively in this book. This problem was solved as I downloaded and installed version 2.7.1. All in all, in my opinion, this book succeeds effectively in its aim to provide its readers with ''a driving license for exploratory data analysis'' (pxi).
REFERENCES Butler Christopher. (1985) _Statistics in linguistics_. New York: Blackwell Publishing.
Douglas, Chretien, C. (1943) _Quantitative Method for Determining Linguistic Relationships: Interpretation of Results and Tests of Significance_ Berkeley Ca: University of California Berkeley.
Herdan, Gustav. (1964) _Quantitative Linguistics_. London : Butterworths.
Johnson Keith. (2008) _Quantitative methods in linguistics_. Malden, MA: Blackwell Publishing.
Kretzschmar, William A. , William A. Kretzschmar, Jr., and Edgar W. Schneider. (1996) _Introduction to Quantitative Analysis of Linguistic Survey Data: An Atlas by the Numbers_. Thousand Oaks, CA: Sage Publications.
Paolillo, John C. (2002) _Analyzing linguistic variation: Statistical models and methods_. Stanford, CA : Center for the Study of Language and Information.
Rasinger, Sebastian M. (2008) _Quantitative Research in Linguistics: An Introduction_. London and New York: Continuum International Publishing Group.
Rietveld, Toni & Roeland van Hout. (1993) _Statistics in language research: Analysis of variance_. Berlin and New York : Mouton de Gruyter.
Tesitelová Marie (1992) _Quantitative Linguistics_, Amsterdam and Philadelphia: Benjamins Publisher.
Woods Anthony, Paul Fletcher, and Arthur Hughes. (1986) _Statistics in Language Studies_. New York: Cambridge University Press
ABOUT THE REVIEWER Dr Aditi Ghosh is a Lecturer at the Department of Linguistics at Calcutta University. Her current research interests include impacts of multilingualism, relationship between society and language, linguistic politics and semantics. At present, she is engaged with two major research projects – on language use and attitude and on concepts in Linguistics
