Review of  Using Corpora to Analyze Gender

Reviewer: Liang Zhao
Book Title: Using Corpora to Analyze Gender
Book Author: Paul Baker
Publisher: Bloomsbury Publishing (formerly The Continuum International Publishing Group)
Linguistic Field(s): Discourse Analysis
Linguistic Theories
Text/Corpus Linguistics
Issue Number: 27.282

Reviews Editor: Helen Aristar-Dry


The book is mainly about how Corpus linguistics uses specialist software to identify linguistic patterns in large computerized collections of text. It critically explores two aspects of the way that corpus linguistics techniques can aid analysis of language and gender: gendered usage (e.g. how do males and females use language), and gendered representations (e.g. how are males and females written or spoken about). Six case studies are introduced to specify the pros and cons of different methods in choosing, creating, tagging and analyzing a corpus, with reasonable interpretation and argument on different research designs and tools used. It is strongly recommended to students, faculties and researchers specialized in sociolinguistics, language and gender, gender studies, corpus linguistics and sociology, and others who have an interest in gender, language and corpus research.

A range of techniques and measures are discussed in the book, including frequencies, keywords, collocations, dispersion, word sketches, downsizing and triangulation, all in an accessible style, with the help of case studies on topics which include: directives in spoken conversations, changes in sexist and non-sexist language use over time, personal adverts, press representation of gay men, and the ways that boys and girls are constructed through language. Detailed illustrations of the six case studies not only provide a comprehensive understanding of how a corpus can be deployed in gender research, but also help realize the author’s intention of strengthening the dialogue between those gender researchers and Corpus Linguists, empowering gender researchers ‘to feel confident in building and exploiting corpora, while encouraging corpus linguists to incorporate some of the more recent thinking about Gender and Language into their own studies’ (7).

By tracing how Gender and Language research comes a long way from finding gender difference to examine gendered discourses, the author points out that detailed qualitative studies of discourse analysis, being based on small excerpts of texts, can be combined with approaches that involve techniques from Corpus Linguistics, which work well on large amounts of data, sometimes millions or even billions of words. How, then, to build a corpus? How to analyze it with tools, such as WordSmith, AntConc, R, SPSS, Microsoft Excel or Log-likelihood (LL)? A brief introduction to the steps and methods of building and analyzing a corpus is provided, accompanied by an illustration of two important measures: frequency lists and keyness. Two related aspects of frequency, collocation and concordance, are explained, the former being based on quantitative analysis while the latter on a functional qualitative reasoning. The role of Corpus Linguistics in the field and Gender and Linguistics research, with how-to questions, comprises the content of Chapter 1.

Both Chapter 2 and Chapter 3 address the issue of identity and language usage. Chapter 2 takes a largely corpus-based approach to the issue of linguistic gender difference. Collecting data from male and female speech in the BNC, the author concludes that in terms of words, males and females are characterized as having a great deal of shared languages use, with a few generalizable differences which can be attributed to social context and dispersion patterns.

Chapter 3 is a corpus-driven study. By focusing only on the disagreement expressions of female academic supervisors, rather than comparing them against anything else, the author argues that such a perspective frees researchers from thinking in terms of gendered ‘over-use’ or ‘under-use’. Examining how women engaged in a potentially FTA (disagreeing) in the relatively liberal setting of a university, the author found that a range of different strategies were employed but there was no single strategy for different cases of disagreements. Both this chapter and Chapter 2 indicate ‘variation within the sexes, which is filtered through context’ (203).

Chapters 4, 5 and 6 are dealing with gender representation. In Chapter 4, focused on a number of aspects of sexist (and non-sexist) language in a large diachronic corpus of American English stretching back to the early nineteenth century, the author finds quantifiable male bias in different ways: relational identification (the tendency for males to be talked about more than females), male firstness (males to be mentioned first), and genericization (males to be referred to as generic humans). What is more, in the English language there are more pejorative terms for women than there are pejorative male equivalents, and even the equivalents do not have a similar negative force.

Chapter 5 addresses the idea of representation to consider how gay people are written about in a corpus of articles taken from the popular British newspaper The Daily Mail. This is a replication of research on Public Discourse of Gay Men (Baker 2005), in which Baker revisits older data and redoes the analysis. Compared to the earlier research, the overall outcome is the same with a few important points that the 10 years younger version of him missed. The analysis of two sets of articles, taken from 2001-2 and 2008-9, containing the words ‘gay’ and ‘homosexual’, finds that a noticeable shift in discourse has taken place: ‘the more negative discourse associating gay people with shame, crime, violence, promiscuity and sleaze are being replaced with those which acknowledge the concept of gay rights and relationships and homophobia’ (205).The author also mentions that analyzing expanded concordance lines enables the identification of features that can run over multiples sentences, such as legitimation strategies.

Chapter 6 takes a more in-depth look at collocates by addressing different methods of collocation and by considering questions like how large the collocational span should be, and whether a confidence-based or hypothesis-testing technique (or both) ought to be used. Using one of Sketch Engine’s preloaded corpora, the ukWaC British English corpus, the author compares collocational relationships for the words ‘boy’ and ‘girl’ (and their plurals) in order to identify similar and different ways that these identities are consistently constructed. For example, boys are more often evaluated by their behavior while girls are evaluated by their appearance. Interpretation, explanation and critical evaluation of the findings are important, for it encourages the research not to take for granted the ‘gender differences’ paradigm or men’s privilege in the society; as well as it provides reflections on how to diminish the reproduction of stereotyping.

Chapter 7 brings together some of the themes already addressed and blurs the distinction between language use and gender representation by examining three small corpora of heterosexual men’s personal advertisements taken from the website Craigslist. The aim is to find out how heterosexual men use language to attract female sexual or romantic partners, as well as how different men, i.e. Australians, Indians and Singaporeans, construct themselves in the adverts. Since techniques for comparing more than two corpora are not yet popular, the author gives a detailed explanation of the steps, methods and tools for creating and tagging the corpora of this research. As he says, different kinds of corpus analysis techniques will produce different types of results. That’s why the analysis involves a triangulation of methods on the same dataset, incorporating analyses of key semantic tags, collocational networks and concordance lines.

In the concluding chapter, Chapter 8, the author summarizes the main findings of the book, reflecting on the research outcomes in the individual chapters, critically evaluating the different methods used, and attempting to address potential limitations of the corpus method to analyze gender.


This book is an updated and expanded version of Baker’s Using Corpora in Discourse Analysis in 2006. Compared to the 2006 work, this one not only focuses more on the interrelation between gender and language, but also updates and expands some of Baker’s ideas around discourse- or social-related corpus linguistics; for example, some corpora and tools which were not available in 2006 are introduced, such as the COHA and Sketch Engine.

First, the author gives a detailed explanation on how to use corpus methods in gender studies, but he keeps being reflective and critical of the methods. By highlighting what they can do and can’t do, Baker points out that researchers need to be careful when using corpus methods, for otherwise these methods will lead them to mistake some pre-existing facts for newly-drawn conclusions.

Second, the author corrects a popular misunderstanding about Corpus Linguistics that it is only about numbers and calculations; he explains how to combine quantitative analysis with qualitative reasoning and shows how productive and interesting such a combination may be. For example, in Chapter 5, the author analyzes in a qualitative way how the negative discourses in the articles from The Mail are legitimatized. Besides, it is important to consider whether further analysis is needed to evaluate the findings in terms of ‘who benefits’.

Third, the author suggests that the macro analysis should always be put in a micro social context and be correlated with micro discourses. For example, in Chapter 7, analysis of the words ‘boy’ and ‘girl’ is inserted into an overarching representational framework, van Leeuwen’s framework (1996) or Sunderland’s ‘gender discourse’ (2004).

Fourth, Baker emphasizes the importance and advantages of methodological triangulation, i.e. approaching a research project in multiple ways. In Chapter 7 he takes three small corpora of adverts from Craigslist and tries out three methods of attempting to uncover something interesting about gender from them – one based on comparisons of key semantic tags, another which used frequent self-descriptors and considered how they related to each other, and the final technique which involved a qualitative examination of concordance lines in order to identify gendered discourses.
Finally, the author mentions more than once the importance of researchers engaging with non-academics and the wider media, not only to diminish public stereotypes about gendered language use but also to engender social change of gendered discourse as a whole. His belief that ‘if our research is not aimed at improving people’s lives or their environment in some ways, there is little point in carrying out research’ is very inspiring, for it is always the duty and responsibility of making a change that motivates us researchers to go further .

Overall I found Baker’s book to be quite manageable and pleasant to read. It has a very clear structure that would make the reading accessible. Having years of experience in using corpora in discourse analysis (Baker 2005, 2006, 2009, 2010; Baker, Gabrielatos and McEnery 2013), Baker makes a good use of his research experience in describing and illustrating what corpus linguistics can offer to sociolinguists interested in the relationship between language and gender.

Baker states that part of the reason for writing the book is to address other researchers who are either from corpus linguistics and want to look at gender, or who do research in gender and want to use corpus methods. He makes it clear that a ‘gender differences’ paradigm is the first thing we need to change and that this can be changed if there is great integration of gender study and corpus methods. For such a combination will dispose of the popular metaphors about Mars and Venus as well as the nonsensical curiosity as to whether men or women will say ‘I love you’ more.


I am interested in language and gender, language and identity, language and globaliation, and sociolinguistics. Currently I am PhD candidate in the Institute of Linguistics and Applied Linguistics, Peking University. I am also an assistant professor in the Northwest University for Nationalities in China.

