LINGUIST List 19.3453: Discipline of Linguistics: Baayen (2008)

LINGUIST List 19.3453

Wed Nov 12 2008

Review: Discipline of Linguistics: Baayen (2008)

Editor for this issue: Randall Eggert <randylinguistlist.org>

1. Aditi Ghosh, Analyzing Linguistic Data

Message 1: Analyzing Linguistic Data
Date: 12-Nov-2008
From: Aditi Ghosh <aditi.ghgmail.com>
Subject: Analyzing Linguistic Data
E-mail this message to a friend

Discuss this message

Announced at http://linguistlist.org/issues/19/19-786.html AUTHOR: Baayen, HaraldTITLE: Analyzing Linguistic DataSUBTITLE: A Practical Introduction to Statistics using RPUBLISHER: Cambridge University PressYEAR: 2008

Aditi Ghosh, Department of Linguistics, University of Calcutta, Kolkata

SUMMARYThis book is a guidebook for researchers and students who want to usestatistical computation to analyze linguistic data. It teaches how differenttypes of linguistic data can be dealt with quantitatively by using 'R' – a freestatistical tool developed originally at AT&T Bell Laboratories. The book isdivided in seven chapters, each giving instruction on how to use R to analyzedifferent types of linguistic data, starting from relatively simple problems,gradually leading to more complex ones. The first chapter, entitled 'Anintroduction to R', as anticipated, gives instructions on the basic handling ofR – on how to download the relevant packages, how to use the software indifferent operating systems, how to import or export datasets and even how touse R as a simple calculator. This chapter also teaches how to select a portionof data out of a data set, how to order or sort a data frame, how to changespecific information and how to extract relevant portions out of a data frame.Lastly it shows how to perform basic calculations on a data frame such as themean or the sum of a numeric vector. The chapter like all the other chapters inthe book is followed by a set of exercises and the solutions to these areprovided in Appendix A.

The second chapter, ''Graphical Data Exploration'', deals with producing graphicpresentation of data. It starts with a brief definition of random variable andgoes on to show how bar plots and histograms can be produced, how curves can beadded to a histogram. The chapter also shows how to plot ordered values anddensity and generate boxplots. For two or more variables the more usefulpractice is to create a mosaic plot or a scatter plot – and the third section ofthis chapter deals with this. The final section introduces trellis graphics – agraph in which data can be represented by many organized graphs at the same time.

The third chapter, ''Probability Distribution'', begins with an introduction ofdistribution and goes on to demonstrate how to deal with discrete and continuousdistribution and introduces the relevant functions available in R. it alsointroduces the Poisson distribution, different types on normal distribution.Lastly it introduces three important continuous distributions, name t, F and Xsquared distributions and the functions used in R for these distributions.

Chapter four is entitled ''Basic statistical methods''. The first two section ofthis chapter introduces tests for single and two independent vectors. For singlevectors, one can test the distribution by plotting the density in differentkinds of graphical representations or one can use appropriate tests, such asShapiro-Wilk test for normality or Kolmogoro-Smirnov one sample test. To testthe mean of a single vector one can use t-test. To observe distribution of twoindependent vectors, one can plot them with in two differently colored lines. Totest if the means are same of the vectors in question, one can use the boxplotfunction to see their frequency distribution, or one may run a t-test to verifyif two means are significantly different. R also has specific functions to testif the variances of the two vectors in question are the same. For pairedvectors, again t-test and Wilcox test can be used. To understand if two vectorshave significant relation one can plot their individual points in a scatterplotand obtain a regression line. This chapter also shows how to evaluatesignificance of correlation between two variables. It deals with problems oflinear regression and how to deal with them. The chapter also shows how toexamine joint density of two paired vectors, how to deal with one or twonumerical vectors and a factor and two vector with counts. The final sectionexplains how to estimate the significance of a statistical test.

Chapter five, ''Clustering and classification'', discusses methods of handlingmore than two vectors. With the help of 'principal component analysis' and'factor analysis' it explores the relationship between uses of 27 derivationalaffixes with that of the type of texts in which they appear. In the next twosub-sections we are introduced to 'correspondence analysis' and'multidimensional scaling', used to create a low-dimensional map of the data andto trace structure in a matrix of distances respectively. The last subsection ofthis section (5.1.5) considers hierarchical cluster analysis – techniques tocluster data and display them in a tree diagram. The next section (5.2) movesfrom clustering to classification of data. The first subsection teaches how tocreate a classification tree with CART (classification and regression tree)analysis. CART and related analysis are applied on 'dative' data sets to findout whether realization of recipient as NP or PP can be predicted from othervariables such as 'semantic class', 'length of theme' etc. The second subsectionshows how linear discriminant analysis can be done in R to predict an item'sclass from a set of numerical predictors. The last subsection (5.2.3)demonstrates the use of 'support vector machine' for classification.

Chapter six takes up regression modeling, a topic which was introduced earlierin chapter four. This chapter discusses multiple linear regression and relatedfunctions available in R such as 'ordinary least squares regression'. Twosubsections show how to deal with models with a nonlinear relation betweenindependent and dependent variables and with datasets where all the independentvariables are strongly correlated. The two following subsections deal with howto check whether the model that one arrives at, is satisfactory or not. Thethird section introduces 'generalized linear models'. Two subsections in thissection (6.3.1 and 6.3.2) show how binary responses can be handled in R withlogistic regression model and how ordered responses can be handled with ordinallogistic regression. Section 6.4 demonstrates how to deal with discontinuity inan otherwise linear relation. The next section (6.5) shows how to study lexicalrichness. It introduces various functions available in R to find out uniqueunits in a dataset, to compare datasets etc. The last section in this chapterdiscusses some general issues in using statistical models with reference to theexamples used in this chapter.

The last chapter is entitled ''Mixed Models''. The first four sections of thischapter deal with various strategies on how to build mixed models. Section 7.1introduces the packages and function in R to build mixed effects and illustratestheir usage in datasets. The next section compares mixed effect models withtraditional models such as quasi-F, latin square designs and with traditionalregression (with mixed-effect regression). The following section deals withBLUPS (the best linear unbiased predictors), which is available in themixed-effect models, unlike classical models, as it provides 'shrinkage'estimates for the by-subject and by-item adjustments. Section 7.4 discussesmixed model parallels with of generalized linear model – the 'generalized linearmixed model'. The last section (7.5) presents case studies where mixed modelsare put into practice.

The chapters are followed by two appendices. Appendix A provides solutions tothe exercises that appear at the end of each chapter and appendix B gives anoverview of functions for R. There are four indexes – of datasets, of R, oftopics and of authors.

EVALUATIONThis book is like a course or a tutorial on how to use R to analyze linguisticdata. Linguistic research has successfully used quantitative tools for a longtime now and there are quite a few introductory books dealing with the subject(eg. Douglas 1943, Herdan 1964, Butler 1985, Woods & Fletcher & Hughes 1986,Tesitelová 1992, Rietveld & van Hout 1993, Kretzschmar, Kretzschmar & Schneider1996, Paolillo 2002, Johnson 2008, Rasinger 2008). However, this field isdeveloping rapidly and researchers need up to date knowledge about new resourcesavailable. Since R is becoming one of the most widely used tools for statisticalanalysis in social sciences, books exploring its utility in Linguistics, are theneed of the hour. This book meets that requirement. Though there other works(cf., Johnson 2008) dealing with this topic, Baayen's introduction is valuableas, apart from being a practical introduction to statistics, it is also athorough introduction for R beginning from downloading relevant packages forlinguistics. It starts with introducing basic statistics and its use in R andprogresses step-by-step to more sophisticated methods. This, apart from makingit a very systematically organized book, makes it equally useful for linguistswith limited mathematical background and those with sufficient expertise instatistical methods. With examples of a number of real data sets it demonstrateshow to study linguistic data quantitatively. The exercises at the end of eachchapter are very useful for practicing functions introduced in the adjacentchapters. The separate indexes are also quite useful for researchers, if theyneed to look for specific R functions or topics to meet their researchrequirement. It is also enriching to be introduced to the actual datasets usedin the course of this book. However, I wish the datasets were more varied intype. Almost all the sets used here are morphological/lexical data sets. Itwould have been worthwhile to see more sociolinguistic or language teachingoriented data. I faced a few problems in installing the packages, as apparentlythe version that I earlier had (R 2.7.0) was not compatible with LanguageR – thepackage that is used extensively in this book. This problem was solved as Idownloaded and installed version 2.7.1. All in all, in my opinion, this booksucceeds effectively in its aim to provide its readers with ''a driving licensefor exploratory data analysis'' (p-xi).

REFERENCESButler Christopher. (1985) _Statistics in linguistics_. New York: BlackwellPublishing.

Douglas, Chretien, C. (1943) _Quantitative Method for Determining LinguisticRelationships: Interpretation of Results and Tests of Significance_ Berkeley Ca:University of California Berkeley.

Herdan, Gustav. (1964) _Quantitative Linguistics_. London : Butterworths.

Johnson Keith. (2008) _Quantitative methods in linguistics_. Malden, MA:Blackwell Publishing.

Kretzschmar, William A. , William A. Kretzschmar, Jr., and Edgar W. Schneider.(1996) _Introduction to Quantitative Analysis of Linguistic Survey Data: AnAtlas by the Numbers_. Thousand Oaks, CA: Sage Publications.

Paolillo, John C. (2002) _Analyzing linguistic variation: Statistical models andmethods_. Stanford, CA : Center for the Study of Language and Information.

Rasinger, Sebastian M. (2008) _Quantitative Research in Linguistics: AnIntroduction_. London and New York: Continuum International Publishing Group.

Rietveld, Toni & Roeland van Hout. (1993) _Statistics in language research:Analysis of variance_. Berlin and New York : Mouton de Gruyter.

Tesitelová Marie (1992) _Quantitative Linguistics_, Amsterdam and Philadelphia:Benjamins Publisher.

Woods Anthony, Paul Fletcher, and Arthur Hughes. (1986) _Statistics in LanguageStudies_. New York: Cambridge University Press

ABOUT THE REVIEWERDr Aditi Ghosh is a Lecturer at the Department of Linguistics at CalcuttaUniversity. Her current research interests include impacts of multilingualism,relationship between society and language, linguistic politics and semantics. Atpresent, she is engaged with two major research projects – on language use andattitude and on concepts in Linguistics