LINGUIST List 15.1171

Sat Apr 10 2004

Review: Software: Wordstat, v. 4

Editor for this issue: Naomi Ogasawara <>

What follows is a review or discussion note contributed to our Book Discussion Forum. We expect discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in. If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for review." Then contact Sheila Dooley Collberg at


  1. Danko Sipka, WordStat v. 4

Message 1: WordStat v. 4

Date: 10 Apr 2004 16:44:30 -0000
From: Danko Sipka <>
Subject: WordStat v. 4

WordStat v. 4, content analysis module for SimStat, Provalis Research
Danko Sipka, Critical Languages Institute, Arizona State University
WordStat v. 4, content analysis module for SimStat by Normand
P´┐Żladeau of Provalis Research (, belongs to a
sizable family of software packages intended for content
analysis. Brief compendiums at: 
with links to competing software packages offer the possibility of
exploring user's options in this field. Owing to the restrictions on
the length, this review will address WordStat in isolation, without
comparing it with other available packages.
Inasmuch as content analysis, along with other branches of applied
linguistics, is frequently subject to marginalization by the work in
English as a Second Language (ESL), a definition of the approach will
be provided. According to Neuendorf (2002) content analysis is "the
systematic, objective, quantitative analysis of message
characteristics." Applications of this approach in linguistics are
very broad, ranging from forensic and intelligence work to
establishment of authorship. More information about the field is
available at
Sociologists and psychologists have traditionally done the work in
this area, yet a solid rationale exists for linguists to engage in
content analysis in a considerably broader and deeper manner than has
been the case in the past.
The explananses "systematic, objective, quantitative" from the
aforementioned definition constitute the standard each content
analysis package should meet. In addition, such analysis should be
made available to a broad range of researches (with varied information
technology background, subject-matter, methodology, and language
interests) and performed within a reasonable time. "Easy yet powerful
tools for research and teaching", the commercial slogan of Provalis
Research, captures these requirements in an elegantly concise manner.
WordStat is a module, which means that it needs to be executed by
another stand-alone application (in this case Simstat or QDA Miner).
Forasmuch as using one of these two programs to initiate WordStat
stipulates several simple steps, from the user's perspective the
module vs. application difference is merely an academic one. The
consumer's perspective is quite different as this feature requires a
bundle purchase.
WordStat uses a database as the source for analysis. It is easy to
import files to this database and adequate support for conversions
from most common formats (Excel, MS Access, SPSS, etc.) is
provided. Lacking Unicode ( support is a major
disadvantage in this segment of the program. Wile most texts can be
automatically converted from the Unicode standard, there are still
some specialized linguistic texts (e.g., any use of a wide variety of
IPA symbols, see ( which would
require considerable searching and replacing to be prepared for use in
Although primarily intended for analyzing by employing coding schemata
on clear textual files, the program supports analysis on manually
entered codes (e.g., if the researcher marks a segment of text as
[irony], [facetious] or applies any other pragmatic or semantic tags).
Typically and as demonstrated by a sample set of data included in the
program, the database comprises one or more independent variables
(normally categories) and one or more dependent variables (message,
text, etc.). Thus the demonstration data contains two independent
variables (sex and age) and one dependent variable (texts of personal
ads). Once the variables have been chosen and WordStat initiated, a
series of analyses can be performed.
Although numerous analyses can be performed without a categorization
dictionary, building such a dictionary is the key to any insightful
and significant research. The exemplary categorization dictionary
accompanying the aforementioned personal ads sample data can be used
to illustrate the importance of the categorization dictionary. The
categorization dictionary prepared for this sample data contains the
following top-level categories: appearance, arts, communication,
education, family, finance, humor, nightlife, outdoor, sexuality,
spirituality, sports, work. Each category comprises concrete lexical
items. The category of appearance, for example, contains the following
lexemes: athletic, attractive, beautiful, beauty, body,
ex-pro-athlete, good-looking, muscular, physique, proportionate,
slender, and slim. Having this dictionary in place the user can
perform the analysis for the entire category. One can thus tabulate
the frequencies for the words relating to the appearance in the ads
placed by males and females, one can create a concordance for all
lexemes related to appearance, etc. At a higher level of analysis, one
can use the entire category as the dependent variable. For example, if
a research hypothesis states that males emphasize appearance more than
females, the user can select the sex of the subjects as the
independent variable and the appearance category as the dependent
variable and tabulate Pearson's correlation coefficient. The sample
data corroborates the hypothesis by finding a moderately strong
statistically significant correlation between the two variables.
Recognizing the importance of categorization schemata in content
analysis, the developer has provided excellent support for the English
language. The spell-checking dictionary and in particular the stemmer
(lemmatization dictionary) considerably facilitate the analysis. The
same is true about the exclusion list (it normally contains synsemantic
lexical items such as prepositions, conjunctions, etc.). Most
importantly, several ready-made categorization dictionaries are
available at: 
most notably the schemata based on Roget's Thesaurus and WordNet
lexical database. While the former resource features full
functionality, the latter is somewhat limited by the fact that the
top-level includes parts of speech rather than content categories and
by the fact that adjectives and adverbs are articulated in a less
elaborate manner. Thus, if one is interested in concepts related to
communication, one will get two values, for nouns and verbs
respectively while adverbs and adjectives would be excluded. What a
researcher would wish is the result the entire category with adverbs
and adjectives included. WordNet (
is a superb linguistic resource which offers more categorization
possibilities than utilized in WordStat. On the other hand, WordNet is
used masterfully in the dictionary building module. From the WordStat
dictionary screen, one can take any available dictionary and ask the
program, using the suggest button, to assess which new words should be
added to the existing categories. The Advanced mode searches for all
possible relationships in the WordNet database (synonyms, paronyms,
hypernyms, coordinate terms, etc.) for new words or phrases and
presents those words in a descending order of relevance This allows
swift development of comprehensive and highly intricate dictionaries.
Several additional categorization schemata are also available,
including regressive imagery dictionary, linguistic inquiry and word
count dictionary, and forest value dictionary.
Support for languages other than English is varied and far less
abundant. Spellcheckers are available for a number of languages (e.g.,
Spanish, German, Russian, Polish, Hungarian, etc.). The data from
these spell checkers can also be used to identify word-formation
clusters and use them in developing categorization dictionaries. The
stemmer is available for French in addition to English. There is a
backdoor solution to use the categorization dictionary to lemmatize
foreign language lexemes yet this technique requires the lemmatization
dictionary to be incorporated in the categorization schema. In
addition, settings to display non-Western scripts need to be performed
in the Windows language settings rather than in the application
itself. An additional disadvantage for users interested in the
content analysis of languages other than English is that WordStat
lacks Unicode capability.
Keeping categorization dictionaries in separate textual files is an
excellent solution. It allows more advanced users to save the
preparation time by supplying ready-made categorization text files
rather than having to use the dictionary editor and type in words and
categories. Other users in turn are adequately supported by dictionary
editor and building tools. Categorization dictionaries are flexible in
providing multiple hierarchical levels, they are presented in a clear
format, and they are easy to use. This flexibility and simplicity
belongs to clear fortes of WordStat.
Once a categorization schema is in place and options are selected,
there are four major areas of analysis: Frequencies, Crosstab, Key-
Word-In-Context, Phrase finder. The latter two options are well known
to linguists. Phrase finder can be used to identify n-grams
(combinations of two or more word forms) while Key-Word-In-Context
creates a KWIC concordance. A major disadvantage of the phrase finder
is that it provides only global frequencies of n-grams and that, in
order to have them segregated along the values of the independent
variable(s), one needs to go back to the concordance for each n-gram
individually or store them in a temporary categorization dictionary.
The concordance offers the standard functions found in concordancers
(e.g., in Concordance, with
additional handling of related variables. Both Phrase Finder and
concordancer remain stable and take a reasonable amount of time even
with relatively large sets of data.
The area of analysis titled Frequencies tabulates keyword frequency
numerically and percentually and provides the option of conducting
cluster analysis using two different measures and representing them in
the form of a dendogram, 2D and 3D maps, proximity plot, and a table.
All these options function seamlessly if the dataset is limited. Large
sets of keywords mean longer processing time and higher memory and CPU
speed requirements.
The most useful and diversified tool of content analysis is found in
the Crosstab area. This option cross tabulates keywords from the
categorization schema with the independent variables. A range of
statistical procedures is available (Chi-square, Likelihood ratio,
Student's F, Tau-a/b/c, Sommers' D/Dxy/Dyx, Gamma Spearman's Rho,
Pearson's R, etc.). While the available statistical procedures serve
their function well, it would be useful to mark statistically
significant probability values as it is usually done in statistical
packages (e.g., Statistica,, for example
by marking p<.01 with two asterisks and p<.05 with one. Lack of direct
access to Anova is a setback of this module. In order to test a causal
link between the two variables once a statistically significant
correlation has been attested one needs to filter and export the data
and perform the analysis in SimStat (i.e., the main program engine).
Other highly useful procedures, such as Factor analysis or Herfindahl-
Hirschman's concentration index, are not implemented in the module.
Clustering and correspondence statistics, on the other hand, is a very
strong point of this statistical module. Various measures and modes of
presentation (dendrograms, heatmaps, and tables) can accommodate any
user more than adequately.
Another forte of WordStat is a powerful case filtering engine which
allows selection of cases according to a selected logical condition.
The engine supplies a wide range of functions and operators which can
support even most intricate filtering requests.
A simple research project has been conducted in order to test the
functionality of WordStat for linguists. It was hypothesized that
Russian dictionaries from the nineteenth century exhibit a higher
degree of discursiveness than their twentieth century
counterparts. The descending level of discursiveness goes together
with professionalizing and formalizing lexicographic techniques. To
test the hypothesis, the letter A sections of approximately equal size
from two general monolingual Russian dictionaries, Dal' (1866-1862)
and Ozhegov-Shvedova (1992), were imported into the database with the
century of the dictionary being an independent variable (twentieth
century 0, nineteenth century 1) while the entries of the two
dictionaries representing the dependent variable. With no Russian
stemmer available, the simplest manner of measuring discursiveness was
to use close inflected classes, such as relative pronouns and
uninflected sets, such as prepositions and conjunctions. Thus, these
words are included under the categorical schema category titled
Markers of discursiveness. Pearson's correlation coefficient was then
used to test the hypothesis. The results have corroborated the
hypothesis in that a statistically significant correlation has been
found between the two variables (century when the dictionary was
published and markers of discursiveness).
Both this test and general perusal of the WordStat module show that
this software lives up to its corporate motto. It is indeed an easy
yet powerful research tool. Suggestions for improvements in the
ensuing versions of the software include Unicode support, better
support for languages other than English, reorganization of the
WordNet-based categorization dictionary, inclusion of additional
statistical procedures, and implementation of more efficient
algorithms to better accommodate large sets of data.
To conclude, WordStat is an excellent content analysis yardstick with
obvious potential to become, mutatis mutandis, a ruler.
The reviewer extends his gratitude to Dina Anani for proofreading this

Dal', V. (1862-1866) Tolkovyj slovar' zhivogo velikorusskogo jazyka,
St. Petersburg
Ozhegov, S.V. and N.Ju. Shvedova (1992) Tolkovyj slovar' russkogo
jazyka, Moskva
Neuendorf, Kimberly A. (2002) The content analysis guidebook, Thousand
Oaks, Calif.: Sage Publications
Danko Sipka ( holds a PhD and
Habilitation in Slavic Linguistics and a doctorate in Psychology. He
is a research associate professor and the acting director of the
Arizona State University Critical Languages Institute
( His numerous publications include the
recent volumes Serbo-Croatian- English Colloquial Dictionary (2000)
and A Dictionary of New Bosnian, Croatian, and Serbian Words (2002).
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue