LINGUIST List 28.4143
Tue Oct 10 2017
Review: General Linguistics; Text/Corpus Linguistics: Eddington (2015)
Editor for this issue: Clare Harshey <clarelinguistlist.org>
Chiara Meluzzi <chiara.meluzzi
Statistics for Linguists: A Step-by-Step Guide for Novices E-mail this message to a friend Discuss this message
Book announced at http://linguistlist.org/issues/26/26-5673.html
AUTHOR: David Eddington
TITLE: Statistics for Linguists: A Step-by-Step Guide for Novices
PUBLISHER: Cambridge Scholars Publishing
REVIEWER: Chiara Meluzzi, Scuola Normale Superiore
REVIEWS EDITOR: Helen Aristar-Dry
David Eddington’s book “Statistics for Linguists. A Step-by-Step Guide for Novices” presents a detailed account of the main statistical methods in linguistic analysis, by using the software SPSS 20. From its subtitle, it is clear that the book is intended for newbies who know little or nothing about both statistics and the IBM software SPSS. Moreover, even if graphs and figures refers to the 20th version of the software, in the Introduction the author states that it will work also for the closest versions, either slightly older or newer.
The book consists of nine chapters preceded by a short introduction.
Chapter one “Getting to know SPSS” introduces the software from the first steps (e.g., open or saving a file in SPSS). This first chapter could be skipped by more expert or semi-expert users of SPSS, as it is intended for real newbies. It also gives a first view of the whole structure of the book, which presents many figures to reproduce the software’s windows and boxes.
Chapter two “Descriptive and Inferential Statistics” introduces the fundamental concepts of quantitative analysis, namely data, variables, descriptive and inferential statistics. Firstly, the chapter presents the different types of data that could be used in SPSS: categorical data, also labelled either as nominal or factor variables, are represented by named and unordered categories, such as levels of education, or ethnicity. It goes without saying, that these categorical data are the most frequently used in linguistics. Ordinal data are represented by labeled intervals in a given scale of values. An example of ordinal data is the so-called Likert scale, which is often used in linguistic analysis, for instance, in perception tests or in testing language attitudes. Continuous data are numeric variables, such as age, duration of a fricative segment, or months of studying of another language. All these types of data could be variables, either independent or dependent ones, but also control and confounding variables. Control variables are represented by those variables that we know may influence the results (e.g., textual genre and the presence/absence of different forms of past tense). Confounding variables are implicit factors of variation that may influence the results and, in some cases, undermine the whole structure of our analysis if not satisfactorily pondered in the research design. For instance, in preparing a word list for a phonetic experiment one must consider not only the specific variables addressed by the research question (e.g., phonological context, length of the word, surrounding vowels), but also possible factors which could affect the production and, as a consequence, the whole research design of the experiment (in the given case, the influence made by prosody in a repetitive unnatural task such as word-list reading). It is good practice to take into account and to control for as many confounding variables as possible, but Eddington himself admits that sometimes “we only find out about their existence after the data have been gathered and analyzed” (p. 9). Finally, the chapter introduces the basic distinction between descriptive statistics (i.e., how to describe and summarize your data and their distribution throughout the corpus), and inferential statistics (i.e., the possibility of applying the results obtained on a small sample of the population to the entire population). In the so-called “descriptive statistics”, different measures can be used according to the types of data: the mean, the median, the mode, the dispersion, etc. The chapter also explains how to calculate those values in SPSS, and also how to visualize the data using histograms and boxplots generated by the software, be presenting pictures not only of the input window but also of the output window, with accurate descriptions of both. Moreover, Eddington spends some pages on the notion of the normal distribution (i.e., following the characteristic bell curve), and the tests to verify the data’s normal distribution. As for inferential statistics, the traditional experimental practice works with two hypotheses: the so-called null hypothesis states that there is no relationship between two variables, whereas the alternative hypothesis states that variable A influences the behavior of variable B. The main goal of inferential statistics is to reject the null hypothesis, and to confirm the influence (and, maybe, the strength) of the relationship between two variables,, not only in a sample of data but in the entire population. A correlation is said to be statistically significant if it has a level of significance, or alpha level, or p value of .05 or lower. This corresponds to a 5% probability that a given result has been obtained by chance only. Even if the simple p value has some limitations (pp. 22-23), it is however the most commonly used statistic.
Having established the background, Chapters Three through Seven consider the most frequently used statistical measures, by discussing some theoretical issues with concrete examples, as well as the input and output windows given by the software.
Chapter 3 “Pearson Correlation” moves into the statistical analysis of data, starting with continuous variables (e.g., percentage of monophthongs, or years lived outside South Britain). The chapter illustrates how to plot this correlation using scatter plots (p. 27-29), and how to statistically evaluate the relationship between the two variables by using the Pearson correlation coefficient ‘r’, which ranges from +1 to -1. A negative correlation indicates an inverse relationship between the two variables, whereas a positive correlation indicates a direct relationship; a value of r = 0 means that there is no relationship between the two variables. The chapter also explores the difference between parametric and nonparametric statistics: the first relies on normal distribution of data, whereas the latter is based on data not showing any particular distribution, and models that evolve in order to accommodate to the complexity of the data.
Chapter 4 “Chi-Square” presents the use of a goodness of fit chi-square to analyze data with a single variable (pp. 43-46), but also uses the chi-square test of independence (p. 47) to analyze data with more than one independent variable. It is important to note that the chi-square relies on data not following a normal distribution since it is a nonparametric statistic. The chi-square tells us if there is an interaction between two independent variables, whereas Cramer’s V is used to measure the strength of this interaction on a 0-to-1 scale, with 0 indicating no relationship. Both these values could be easily calculated in SPSS by using the “Crosstabs” dialogue box.
Chapter 5 “T-Test” presents the statistic to be performed to evaluate the significance of two groups of values, with a continuous dependent variable and a categorical independent variables with 2 values. This test could be used to compare two groups of speakers, or the same group after a particular linguistic training as it is usually done in language acquisition research. As for the chi-square, after having established that the difference between two groups is significant, it is possible to use Cohen’s d coefficient to calculate the effect size distinguishing the two groups. However, it is important to note that the t-test runs on normally distributed data. A Mann-Whitney test is more appropriate for skewed data, which may also contain outliers.
Chapter 6 “ANOVA (Analysis of Variance)” presents several types of variance analysis: one-way, Welch’s, factorial, and repeated measures . Generally speaking, the “analysis of variance” refers to the fact that it considers the means of each set of data (or group of speakers) and the variance of the scores, that is how much the scores area spread out from the mean (p. 65). Eddington also points out the importance of running a post hoc analysis to compare each group to the others, and to test whether they are statistically different or not: he suggests using the Tukey or Scheffé test, depending on number of scores, whereas Dunnet test may be preferred when using a control group, which is often the case in linguistic analysis. However, one-way ANOVA works on homoscedastic data, that is if variance is homogenous for all random variables in a sequence or vector. This is also a major concern in regression analysis. For the analysis of variance, if a group of data is heteroscedastic (i.e., the inverse of homoscedastic) Welch’s ANOVA has to be preferred. Factorial ANOVA is designed to check the effect of more than one independent variable, even the more the variables the harder the interpretation of the results (p. 74). Finally, repeated measures ANOVA deals with sets of data in which the same subject is included in more than one group: the classic example is the phonetic analysis of different tokens as repeated by speakers more than once. Repeated measures are fully addressed in Chapter 8, together with mixed-effect models.
Chapter 7 “Multiple Linear Regression” explores maybe the most important tool for linguistic analysis when dealing with different independent variables, either categorical or continuous. Eddington firstly explores the key issues in simple regressions, then the chapter moves to multiple regression. A particular emphasis is placed on the interpretation of the output, in terms of the simple visualization of the SPSS charts and, more important, of the correct evaluation of these numbers within a linguistic research design. The chapters also explains how to deal with categorical data in multiple regression analysis by using “dummy code”, that is having only two variables allowed: for instance, yes/no questions or the sex of the speaker (male/female). The author emphasizes that it is important to dummy code the categorical variables in order to avoid error messages from the software (p. 93).
The final chapters present some more complex tools, whose applicability is increasing recently in linguistic research. Chapter 8 deals with “Mixed-Effect Models: Analysis of Repeated (and Other Nested) Measures”. Because of the more complex argument, Eddington first presents a theoretic example, but follows it with hands-on examples, in order to emphasize the possible use of mixed models in linguistic research; then, he moves on explaining how to use the models with SPSS, as usual with an emphasis on the right interpretation of the output offered by the software. Generally speaking, the main advantage of using mixed-effect models (MEM) is that they provide a robust analytical approach for addressing problems associated with hierarchical data. MEMs can also take into account missing data, and have less restriction in their applicability if compared to ANOVA, (as shown in Chapter 6). As Eddington points out “when random factors are included, the results are considered more generalizable to other members of the random factors (other people and other test items) that we haven’t actually tested” (p. 118). It is evident that mixed models represent a very powerful tool for linguistic research, and it goes without saying that their use is increasing in recent years.
Finally, Chapter 9 presents “Mixed-Effects Logistic Regression”. Logistic regression is particularly useful when your dependent variable is categorical, with two or more values: for instance, if you carry out a morpho-syntactic analysis on the distribution of periphrastic constructions vs. tense marking according to different contextual variables, as often happens in sociolinguistic research. As for mixed models, within logistic regression data can be differently analyzed, according to different types of coding illustrated in the chapter (e.g., treatment coding, deviation coding). The important thing to keep in mind is, again, the basic assumptions of logistic regression: even if the calculation doesn’t have the same requirements as are needed for continuous variables, it is a good practice to have at least 20 observations for each independent variable included in the analysis. Conversely, the results of the regression might be cozy and not accurate (p. 154). Like the preceding chapters, the book ends with a list of possible examples and exercises to apply logistic regression to real data deriving from a linguistic experiment.
The main aim of the book is really challenging: explaining statistics to novices with a focus on linguistic research, and, at the same time, illustrating how to perform these analyses in a new software (i.e., SPSS 20, or other versions). Even if complex and not exactly easy reading, Eddington’s book manages to achieve his goal by providing a very useful survey of the main statistical tools, moving from the most common ones (e.g., the chi-square or p-value) to the most complex and recently implemented ones such as the mixed models. One of the greatest advantages of the book, which definitely distinguishes it from other “introduction to statistics” books already available, is that is specifically intended for linguists: this means that examples are taken from already existing experiments, many of them conducted by the author himself, or from hypothetical experimental settings without being limited to a specific linguistic subfield of research. Thus, the book introduces both the basic concepts in statistics research, and the way to concretely apply these concepts using the software. For this reason, the book is also full of images showing both the input and output windows provided by SPSS: this is incredibly useful for “novices” approaching the IBM software for the first times. In contrast to other introductions to the software (e.g., Gray & Kinnear 2011), Eddington’s book presents only those images strictly needed for the explanation, without wasting time in addressing subtle mathematics details usually not of interest to the main audience.
However, a general question could arise with respect of the software chosen. In fact, to do quantitative analysis there are other powerful tools, which have also the advantage of being free of charge. The first of the list is, obviously, R. Indeed, R is, to my knowledge, (one of) the most popular tools for doing statistics in linguistic research, and there are many introduction to statistics using this software (e.g., Baayen 2008). Even if free, R presents some disadvantage in particular for novices in statistics, the main one being that it is not very user-friendly: in fact, R works with strings of code, whereas SPSS offers a fancier and “Excel-like” environment which could be less intimidating to the newbies, and help the sporadic users of the software, who will not have to remember commands and codes just to open a folder or create a simple frequency table. Finally, if there are many guides to statistics in R for linguistic research (e.g., the fundamental Baayen 2008), such a book has not existed for SPSS before Eddington’s guide. Of course, other scholars have used this software in specific fields of linguistic research (e.g., Larson-Hall 2010), and for introduction to the software one cannot mention Andy Fields’s books (e.g., Fields 2009) or the official guides (e.g., Gray & Kinnear 2011). Eddington, however, clearly explains the potentiality of the software for its application in linguistic research in general. Moreover, the book is small, if compared to Fields (2009) or other guides, since it contains the essential details needed for doing the analysis, by illustrating the basic assumptions and goals of each test. In fact, SPSS is a powerful tool, which could really help researchers save time with analysis and graphs, but it needs to be interpreted with precision. Eddington’s book is in this respect really a life-savior, since it explains how to interact with the software not only in the direction of giving instructions to the machine, but also in interpreting the results. What the book lacks is a final bibliographical section, for both a general theoretic perspective (e.g., Johnson 2008; Bod, Hay & Jannedy 2003, just to quote a few), and its application in specific linguistic fields (for instance, Larson-Hall 2010 for second language research, which also use SPSS; the old but gold Oakes 1998 for corpus linguistics, and Macaulay 2009 for sociolinguistics).
Moreover, Eddington is very aware of the risks of imprecise statistics research, particularly with regard to: (1) the use of the correct statistical tool for your set of data; (2) the correct interpretation of the output of your analysis, given the possible contradictions or unexpected results that sometimes may appear. In this respect, the author will states that “correlations are only dangerous when people take them to show causation, because correlations show relationships but not necessarily causes” (p.32). Eddington also gives the example of a correlation between height and IQ: even assuming that such a correlation might exist for a certain sample, this doesn’t mean that height causes intelligence. Short said, one must always be cautious in the assumptions and in how statistics is used in a rigorous scientific paradigm. Similar caveats are proposed throughout the different chapters of the book.
In conclusion, Eddington’s book represents an exceptional tool for understanding the possibilities of the quantitative paradigm in linguistic research. At the same time, the book represents a real “step-by-step” guide to performing statistical analysis in SPSS.
Baayen, Harald R. 2008. Analyzing Linguistic Data. A Practical Introduction to Statistics using R. Cambridge: Cambridge University Press.
Bod, Rens, Hay, Jennifer & Jannedy, Stephanie. 2003. Probabilistic Linguistics. Cambridge (MA): MIT Press.
Field, Andy. 2009. Discovering Statistics using SPSS (third edition). London: Sage.
Gray, Colin D. & Kinnear, Paul K. 2011. IBM Statistics 19 Made it simple. Hove & New York: Psychology Press.
Johnson, Keith. 2008. Quantitative Methods in Linguistics. London: Blackwell.
Larson-Hall, Jennifer. 2010. A guide to Doing Statistics in Second Language Research Using SPSS. London: Routledge.
Macaulay, Ronald K.S. 2009. Quantitative Methods in Sociolinguistics. New York: Palgrave.
Oakes, Michael P. 1998.Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press.
ABOUT THE REVIEWER
Chiara Meluzzi is Postdoc fellow at Scuola Normale Superiore in Pisa (Italy) with a project on gestural coordination and speech rhythm in Italian dysillables. She mainly works in sociophonetics, experimental phonetics and sociolinguistic research on Italian. Her main publications includes various articles on the sociophonetic distribution of Italian dental affricates in Bolzano/Bozen, an analysis of rhotic variation in an Italian and Sicilian bilingual corpus (Loquens, 3:1, in collaboration with C. Celata and I. Ricci), and a pragmatic analysis of personal pronouns in Ancient Greek comedies (Pragmatics 26:3).
Page Updated: 10-Oct-2017