Review of  Lexical Diversity and Language Development

Reviewer: Philip McCarthy
Book Title: Lexical Diversity and Language Development
Book Author: David D Malvern Ngoni Chipere Brian J. Richards Pilar Durán
Publisher: Palgrave Macmillan
Linguistic Field(s): Morphology
Language Acquisition
Issue Number: 16.47

Date: Mon, 3 Jan 2005 11:47:13 -0600
From: Philip McCarthy
Subject: Lexical Diversity and Language Development: Quantification and

AUTHORS: Malvern, David D.; Chipere, Ngoni; Richards, Brian J.; Durán,
TITLE: Lexical Diversity and Language Development
SUBTITLE: Quantification and Assessment
PUBLISHER: Palgrave Macmillan
YEAR: 2004

Philip M. McCarthy, Department of English (Linguistics), and the Institute
for Intelligent Systems (IIS), the University of Memphis.

"Lexical Diversity and Language Development: Quantification and
Assessment" is, predominantly, a summary of David Malvern and Brian
Richards' last seven years' work on the lexical richness measure known
as 'D'. The measure D, it is argued, is the most reliable measure of
lexical diversity and is particularly useful for measuring short
transcripts such as those produced by young children. The book is of
interest to researchers working in the areas of language acquisition,
English as a second language (ESL), aphasiology, or any other field where
the quantification of language deployment (lexical diversity) is a factor.

Lexical diversity, reported throughout this book (despite the title) as
lexical richness, is one of the greatest linguistic enigmas -- if a rather
unsung one. In brief, we have long known that people of different ages and
abilities, and different texts for different purposes, appear to produce
significantly different degrees of lexical diversity. No one, for
instance, would argue that Shakespeare was less diverse in his vocabulary
deployment than would be a typical five-year old child. And by the same
token, we all seem to intuitively know that works by such authors as Joyce
or Tolstoy are lexically richer than are works by, say, Hemmingway or
Steinbeck. Despite such appearances, however, no one has yet been able to
produce a measure that is capable of scoring such differences meaningfully
and accurately: It is as if we were all aware of differences in
temperature, had tacitly agreed what constituted heat, and yet had been
unable to invent the thermometer. What Malvern et al. are offering us,
therefore, is the best yet attempt at a lexical diversity thermometer.

Malvern et al.'s book is organized into four parts. The first, and main
part of the book, serves to explain the concept of lexical richness, to
outline why lexical richness is such a tricky and elusive measurement, to
explain where and how lexical richness measures have been employed, to
discuss the various types of lexical richness measures that have been
proposed, to show where and why these measures fail to reliably account
for lexical richness, and, most importantly, to introduce and discuss the
measure known as D. Part II offers a collection of previously published
papers that serve to support the authors' claims as to D's reliability.
Part III offers a look at other considerations for lexical richness
measures, and part IV is a brief overview and conclusion.

The book's review of previously proposed measures of lexical richness is
probably the most thorough ever published. The authors begin by explaining
the underlying problem of basic lexical richness measures, such as type-
token ratio (TTR). In brief, types are the words used in a text, whereas
tokens are the instances of words used in a text. Thus, the sentence "the
big dog chased the small dog" has four types and six tokens; the
types "the" and "dog" having two tokens each. The problem, as Malvern et
al. explain, is that as a text increases in length the likelihood of new
types being introduced decreases. Consequently, the longer a text is, the
lower the TTR is likely to be.

Over the years, numerous alternatives to TTR have been proposed, and
Malvern et al. explain each with great clarity. Mathematically manipulated
lexical richness scores such as RootTTR (G) and Corrected-TTR (C),
logarithmic variations of lexical richness such as R and H, and frequency
based measures such as Z and K are all explained, dissected, and
discredited. Malvern et al. show the problems with each measure through
theoretical and empirical approaches. The studies of Jarvis (2002) and
Tweedie and Baayen (1998) form a good deal of the empirical testing that
have shown problems with other lexical richness measures, and where theory
rather than empirical evidence discredits the measures, Malvern et al. go
to great lengths themselves to explain the problems.

Part I builds towards the most complex method of obtaining a lexical
richness score: the "curve fitting" approach of Sichel (1986). It is
largely on the basis of this model that Malvern et al. have composed their
measure of D. Like Sichel's model, D operates by trying to fit empirical
data, derived from TTR scores, to a theoretical TTR curve. D differs from
Sichel in a number of ways: of primary importance is that D operates by
taking hundreds of samples of data and averaging them to fit an ideal TTR
curve. Because of the complexity of D, the freely available vocd software
(MacWhinney, 2000) is used to make the calculation. The 18 pages dedicated
towards D's development are highly enlightening and clearly the book's
most important section. Despite the fact that much of what is written in
this section has been said in previously published journal articles
(Malvern & Richards, 1997; McKee, Malvern & Richards, 2000), the
thoroughness and clarity in which the development of D is relayed here is
without doubt well worth the read.

If part I is the synthesis and expansion of D's genesis (Malvern &
Richards, 1997; McKee et al., 2000; Duran, Malvern, Richards, & Chipere,
2004) then Part II is simply the collection and reprinting of more recent
papers (Malvern & Richards, 2002; Richards & Malvern, 2004). The four
chapters forming Part II provide empirical evidence supporting D and its
operating methodology: Chapters' 4 and 5 focus on measures of D across
different corpora, Chapter 6 offers compelling evidence on the
inadequacies of assessment examination testing as opposed to the
reliability of results produced by D, and Chapter 7 investigates how
variations in lemmatizing the analysis of words can lead to markedly
differing results. While these chapters would have been more convincing
had there been more work from other researchers, Malvern et al.'s own
breadth of experimentation and investigation is quite forceful. Hopefully,
more research will soon be underway to support even further these initial

Part III of the book compares lexical richness to other methods for
assessing texts: type-type ratios (as opposed to type-token ratios), for
example, are considered. Evidence compiled here suggests that
investigations into the diversity of parts of speech are also a product of
text length and that, once again, D may provide the best answers. In Part
III, the authors also expand the investigation of D's reliability into
written texts concluding that the measure effectively discriminates across
ages and developmental levels. Part IV is a bare six-page overview and
conclusion. The brevity is somewhat disturbing as one would imagine the
potential for future research involving lexical richness and D would be
vast. And it would certainly seem apparent that far more testing of D
would be undertaken. That said, Malvern et al. do take this opportunity to
once more drive home the importance of an accurate measure of lexical
richness, and they once more go to great pains to show how numerous
previous studies using flawed measures of lexical richness have lead to
results that must now be seriously questioned (for example, see Le Normand
& Cohen, 1999; Ouellet, Cohen, Le Normand, & Braun, 2000; and Dalaney-
Black et al. 2000). Even studies as recent as Ertmer, Strong, and
Sadagopan (2003) use TTR of differing text lengths and quote the
questionable "norms" of Templin (1957). Malvern et al. show their clear
concern by writing:

These things matter. Much of the research based on flawed measures has
significant implications for theory, practice, and policy. It is important
therefore that the methodological issues of measuring vocabulary richness
are understood and that these confusions are cleared up.

The authors' conclusion also acknowledges a few of D's problems: problems
involving topic change and rhetorical styles that confound the curve
fitting approach of D. Such problems are not dwelt upon however, and it
would be fair to assume that later analyses of D will be somewhat more

The authors' claim that previous LD measures are unreliable and their
evidence for such claims are well made. It would be hard to believe that
following such work any previously published approach could now win favor
as the lexical richness measure of choice. Unfortunately, whether D itself
is truly capable of carrying the crown is also, as we shall see, less than

As the book is essentially an advertisement for D, rather than a
disinterested history of lexical richness, criticism and potential
problems with D are less than boldly stated. The main problem for D lies
in its limitations caused by the attempt to satisfy its primary aim. As
stated above, this aim is to offer a reliable measure of lexical richness
for short samples of transcripts. The problem for Malvern et al. is that
while other measures of lexical richness are particularly weak at
measuring short samples, in establishing a measure that actually does
accomplish the task, Malvern et al. appear to have made a measure that is
only accurate for short samples. In other words, we must ask whether the
baby has been thrown out with the bathwater. A closer look at how D is
calculated may show why this is so.

Malvern et al. use the vocd system to sample items from the available
data. These samples are between 35 and 50 tokens in length. As such, the
minimum transcript size is 50 words; however Malvern et al. claim that
they cannot guarantee lexical richness for samples this small. Thus, the
lower end of reliability for the measure is not made clear -- except to
say that it must be above 50 tokens. Similarly, Malvern et al. cannot
claim that D is reliable for longer texts. In fact, they place their upper
limit at an unspecified "few hundred" tokens. The first question to ask,
therefore, is, if D is reliable then where exactly is it reliable? The
transcript borders are not that far apart (greater than 50 tokens but less
than a few hundred), yet if the border areas are so murky then researchers
would seriously have to wonder whether their data were of a suitable
length for D.

The next issue is that Malvern et al. recommend using only stem forms in
any lexical analysis so as to reduce the potential for confounding
results. They further recommend controls for testing participants so as
conversational topics do not diversify greatly. Perhaps most worryingly,
however, is that they base the primary evidence of empirical testing on a
corpus of 32 transcripts from children of just 2;8 years of age (Duran et
al. 2004).

Such limited borders of transcript size, based on the production of such
young children, from such a small corpus, with only stem forms recommended
for fear of confounding D, does not yet secure faith that D is the most
reliable (nor the most robust) of lexical richness measures.

We can look at Owen and Leonard's (2002) study for supporting concerns
over D. In this work, it was concluded that D may not be a reliable
measure of lexical richness. Owen and Leonard's transcripts were divided
into sample sizes of 100, 250 and 500 tokens but when measured for lexical
richness, differing D scores were produced. Jarvis (2002) despite knowing
of D, chose to use an earlier D incarnation (see Malvern and Richards
1997) and was quite critical of the theoretical unpinning of the latest
version of D (the one used in this book). The earlier D, used by Jarvis
(2002), was quite successful at predicting lexical richness measures;
however, the texts used in his study all had less than 400 words, and an
alternative measure, U, actually performed better. Silverman and Bernstein
Ratner (2002), on the other hand, do provide support for D, and Owen and
Leonard (2002), while finding fault with D, still mention that it is a
promising tool. On the whole, however, while Malvern and his colleagues
continue to turn out positive studies on D, the wider community has not
yet reached the same level of enthusiasm.

With a relatively limited use for the measure D, it is extremely hard to
see how the measure could become the standard for lexical richness. That
said, whatever the weaknesses of D, it does appear to be more reliable
than any other available measure for texts of shorter length. Researchers
would certainly be strongly advised to, at least, include D in their
measurements, whatever the text size. However, with data of differing text
length, or from different sources, researchers are equally strongly
advised to interpret results with great care. While D itself may yet have
a number of problems to overcome, while Malvern et al. may well have been
a shade generous in their assessment of D, and while this book appears to
promise much discussion on lexical diversity but in the end serves more as
a commercial for a single measure, the book itself is nonetheless clearly
the best (and indeed the only) book on lexical diversity currently
available. Its competitors, Yule (1944) and Herdan (1960) have long been
out of date, and a more recent offering by Baayen (2001) neither comes
close to the expansive history offered by Malvern et al., nor does it
focus on diversity so much as it does distribution. The significance of
the differences between the two approaches may best be described by
stating that neither author sees fit to mention the others' work. In sum,
Lexical Diversity and Language Development makes a good attempt to fill a
gaping hole in linguistic enquiry; however, whether its proposed product
lives up to its authors' faith will only be revealed if greater research
in this area (and through this method) is undertaken.


Philip McCarthy moved to the United States in 2001 having spent 11 years
as an English teacher in England, Turkey and Japan. In 2003, he graduated
with a Master's degree in English (Linguistics) from The University of
Memphis, and he is currently conducting research for his Ph.D. in applied
linguistics at the same university. Philip's primary work concerns lexical
and textual diversity algorithms though he has also published work on
child readers and the application of cohesion measures across genres.
Philip is currently working as a research assistant on three grants at the
Institute for Intelligent Systems at the FedEx Institute for Technology:
iSTART, CohMetrix, and the iMAP project. His primary responsibilities are
corpus analyses and programming. Philip teaches a variety of linguistics,
ESL and composition courses. He is also working on a number of software
projects including a phoneme acquisition application, and temporal and
structural cohesion algorithms. When not working, Philip coaches one of
Memphis's most successful soccer teams: Strangers FC.

