LINGUIST List 15.1171

Sat Apr 10 2004

Review: Software: Wordstat, v. 4

Editor for this issue: Naomi Ogasawara <>

What follows is a review or discussion note contributed to our Book Discussion Forum. We expect discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in.

If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for review." Then contact Sheila Dooley Collberg at


  • Danko Sipka, WordStat v. 4

    Message 1: WordStat v. 4

    Date: 10 Apr 2004 16:44:30 -0000
    From: Danko Sipka <>
    Subject: WordStat v. 4

    WordStat v. 4, content analysis module for SimStat, Provalis Research Danko Sipka, Critical Languages Institute, Arizona State University WordStat v. 4, content analysis module for SimStat by Normand P´┐Żladeau of Provalis Research (, belongs to a sizable family of software packages intended for content analysis. Brief compendiums at: and with links to competing software packages offer the possibility of exploring user's options in this field. Owing to the restrictions on the length, this review will address WordStat in isolation, without comparing it with other available packages. Inasmuch as content analysis, along with other branches of applied linguistics, is frequently subject to marginalization by the work in English as a Second Language (ESL), a definition of the approach will be provided. According to Neuendorf (2002) content analysis is "the systematic, objective, quantitative analysis of message characteristics." Applications of this approach in linguistics are very broad, ranging from forensic and intelligence work to establishment of authorship. More information about the field is available at Sociologists and psychologists have traditionally done the work in this area, yet a solid rationale exists for linguists to engage in content analysis in a considerably broader and deeper manner than has been the case in the past. The explananses "systematic, objective, quantitative" from the aforementioned definition constitute the standard each content analysis package should meet. In addition, such analysis should be made available to a broad range of researches (with varied information technology background, subject-matter, methodology, and language interests) and performed within a reasonable time. "Easy yet powerful tools for research and teaching", the commercial slogan of Provalis Research, captures these requirements in an elegantly concise manner. WordStat is a module, which means that it needs to be executed by another stand-alone application (in this case Simstat or QDA Miner). Forasmuch as using one of these two programs to initiate WordStat stipulates several simple steps, from the user's perspective the module vs. application difference is merely an academic one. The consumer's perspective is quite different as this feature requires a bundle purchase. WordStat uses a database as the source for analysis. It is easy to import files to this database and adequate support for conversions from most common formats (Excel, MS Access, SPSS, etc.) is provided. Lacking Unicode ( support is a major disadvantage in this segment of the program. Wile most texts can be automatically converted from the Unicode standard, there are still some specialized linguistic texts (e.g., any use of a wide variety of IPA symbols, see ( which would require considerable searching and replacing to be prepared for use in WordStat. Although primarily intended for analyzing by employing coding schemata on clear textual files, the program supports analysis on manually entered codes (e.g., if the researcher marks a segment of text as [irony], [facetious] or applies any other pragmatic or semantic tags). Typically and as demonstrated by a sample set of data included in the program, the database comprises one or more independent variables (normally categories) and one or more dependent variables (message, text, etc.). Thus the demonstration data contains two independent variables (sex and age) and one dependent variable (texts of personal ads). Once the variables have been chosen and WordStat initiated, a series of analyses can be performed. Although numerous analyses can be performed without a categorization dictionary, building such a dictionary is the key to any insightful and significant research. The exemplary categorization dictionary accompanying the aforementioned personal ads sample data can be used to illustrate the importance of the categorization dictionary. The categorization dictionary prepared for this sample data contains the following top-level categories: appearance, arts, communication, education, family, finance, humor, nightlife, outdoor, sexuality, spirituality, sports, work. Each category comprises concrete lexical items. The category of appearance, for example, contains the following lexemes: athletic, attractive, beautiful, beauty, body, ex-pro-athlete, good-looking, muscular, physique, proportionate, slender, and slim. Having this dictionary in place the user can perform the analysis for the entire category. One can thus tabulate the frequencies for the words relating to the appearance in the ads placed by males and females, one can create a concordance for all lexemes related to appearance, etc. At a higher level of analysis, one can use the entire category as the dependent variable. For example, if a research hypothesis states that males emphasize appearance more than females, the user can select the sex of the subjects as the independent variable and the appearance category as the dependent variable and tabulate Pearson's correlation coefficient. The sample data corroborates the hypothesis by finding a moderately strong statistically significant correlation between the two variables. Recognizing the importance of categorization schemata in content analysis, the developer has provided excellent support for the English language. The spell-checking dictionary and in particular the stemmer (lemmatization dictionary) considerably facilitate the analysis. The same is true about the exclusion list (it normally contains synsemantic lexical items such as prepositions, conjunctions, etc.). Most importantly, several ready-made categorization dictionaries are available at: most notably the schemata based on Roget's Thesaurus and WordNet lexical database. While the former resource features full functionality, the latter is somewhat limited by the fact that the top-level includes parts of speech rather than content categories and by the fact that adjectives and adverbs are articulated in a less elaborate manner. Thus, if one is interested in concepts related to communication, one will get two values, for nouns and verbs respectively while adverbs and adjectives would be excluded. What a researcher would wish is the result the entire category with adverbs and adjectives included. WordNet ( is a superb linguistic resource which offers more categorization possibilities than utilized in WordStat. On the other hand, WordNet is used masterfully in the dictionary building module. From the WordStat dictionary screen, one can take any available dictionary and ask the program, using the suggest button, to assess which new words should be added to the existing categories. The Advanced mode searches for all possible relationships in the WordNet database (synonyms, paronyms, hypernyms, coordinate terms, etc.) for new words or phrases and presents those words in a descending order of relevance This allows swift development of comprehensive and highly intricate dictionaries. Several additional categorization schemata are also available, including regressive imagery dictionary, linguistic inquiry and word count dictionary, and forest value dictionary. Support for languages other than English is varied and far less abundant. Spellcheckers are available for a number of languages (e.g., Spanish, German, Russian, Polish, Hungarian, etc.). The data from these spell checkers can also be used to identify word-formation clusters and use them in developing categorization dictionaries. The stemmer is available for French in addition to English. There is a backdoor solution to use the categorization dictionary to lemmatize foreign language lexemes yet this technique requires the lemmatization dictionary to be incorporated in the categorization schema. In addition, settings to display non-Western scripts need to be performed in the Windows language settings rather than in the application itself. An additional disadvantage for users interested in the content analysis of languages other than English is that WordStat lacks Unicode capability. Keeping categorization dictionaries in separate textual files is an excellent solution. It allows more advanced users to save the preparation time by supplying ready-made categorization text files rather than having to use the dictionary editor and type in words and categories. Other users in turn are adequately supported by dictionary editor and building tools. Categorization dictionaries are flexible in providing multiple hierarchical levels, they are presented in a clear format, and they are easy to use. This flexibility and simplicity belongs to clear fortes of WordStat. Once a categorization schema is in place and options are selected, there are four major areas of analysis: Frequencies, Crosstab, Key- Word-In-Context, Phrase finder. The latter two options are well known to linguists. Phrase finder can be used to identify n-grams (combinations of two or more word forms) while Key-Word-In-Context creates a KWIC concordance. A major disadvantage of the phrase finder is that it provides only global frequencies of n-grams and that, in order to have them segregated along the values of the independent variable(s), one needs to go back to the concordance for each n-gram individually or store them in a temporary categorization dictionary. The concordance offers the standard functions found in concordancers (e.g., in Concordance, with additional handling of related variables. Both Phrase Finder and concordancer remain stable and take a reasonable amount of time even with relatively large sets of data. The area of analysis titled Frequencies tabulates keyword frequency numerically and percentually and provides the option of conducting cluster analysis using two different measures and representing them in the form of a dendogram, 2D and 3D maps, proximity plot, and a table. All these options function seamlessly if the dataset is limited. Large sets of keywords mean longer processing time and higher memory and CPU speed requirements. The most useful and diversified tool of content analysis is found in the Crosstab area. This option cross tabulates keywords from the categorization schema with the independent variables. A range of statistical procedures is available (Chi-square, Likelihood ratio, Student's F, Tau-a/b/c, Sommers' D/Dxy/Dyx, Gamma Spearman's Rho, Pearson's R, etc.). While the available statistical procedures serve their function well, it would be useful to mark statistically significant probability values as it is usually done in statistical packages (e.g., Statistica,, for example by marking p<.01 with two asterisks and p<.05 with one. Lack of direct access to Anova is a setback of this module. In order to test a causal link between the two variables once a statistically significant correlation has been attested one needs to filter and export the data and perform the analysis in SimStat (i.e., the main program engine). Other highly useful procedures, such as Factor analysis or Herfindahl- Hirschman's concentration index, are not implemented in the module. Clustering and correspondence statistics, on the other hand, is a very strong point of this statistical module. Various measures and modes of presentation (dendrograms, heatmaps, and tables) can accommodate any user more than adequately. Another forte of WordStat is a powerful case filtering engine which allows selection of cases according to a selected logical condition. The engine supplies a wide range of functions and operators which can support even most intricate filtering requests. A simple research project has been conducted in order to test the functionality of WordStat for linguists. It was hypothesized that Russian dictionaries from the nineteenth century exhibit a higher degree of discursiveness than their twentieth century counterparts. The descending level of discursiveness goes together with professionalizing and formalizing lexicographic techniques. To test the hypothesis, the letter A sections of approximately equal size from two general monolingual Russian dictionaries, Dal' (1866-1862) and Ozhegov-Shvedova (1992), were imported into the database with the century of the dictionary being an independent variable (twentieth century 0, nineteenth century 1) while the entries of the two dictionaries representing the dependent variable. With no Russian stemmer available, the simplest manner of measuring discursiveness was to use close inflected classes, such as relative pronouns and uninflected sets, such as prepositions and conjunctions. Thus, these words are included under the categorical schema category titled Markers of discursiveness. Pearson's correlation coefficient was then used to test the hypothesis. The results have corroborated the hypothesis in that a statistically significant correlation has been found between the two variables (century when the dictionary was published and markers of discursiveness). Both this test and general perusal of the WordStat module show that this software lives up to its corporate motto. It is indeed an easy yet powerful research tool. Suggestions for improvements in the ensuing versions of the software include Unicode support, better support for languages other than English, reorganization of the WordNet-based categorization dictionary, inclusion of additional statistical procedures, and implementation of more efficient algorithms to better accommodate large sets of data. To conclude, WordStat is an excellent content analysis yardstick with obvious potential to become, mutatis mutandis, a ruler. ACKNOWLEDGMENT The reviewer extends his gratitude to Dina Anani for proofreading this text. REFERENCES

    Dal', V. (1862-1866) Tolkovyj slovar' zhivogo velikorusskogo jazyka, St. Petersburg Ozhegov, S.V. and N.Ju. Shvedova (1992) Tolkovyj slovar' russkogo jazyka, Moskva Neuendorf, Kimberly A. (2002) The content analysis guidebook, Thousand Oaks, Calif.: Sage Publications ABOUT THE REVIEWER Danko Sipka ( holds a PhD and Habilitation in Slavic Linguistics and a doctorate in Psychology. He is a research associate professor and the acting director of the Arizona State University Critical Languages Institute ( His numerous publications include the recent volumes Serbo-Croatian- English Colloquial Dictionary (2000) and A Dictionary of New Bosnian, Croatian, and Serbian Words (2002).