LINGUIST List 14.1737

Thu Jun 19 2003

Review: Computational Linguistics: Hammond (2003)

Editor for this issue: Naomi Ogasawara <naomilinguistlist.org>


What follows is a review or discussion note contributed to our Book Discussion Forum. We expect discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in. If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for review." Then contact Simin Karimi at siminlinguistlist.org.

Directory

  1. Hans Paulussen, Programming for Linguists: Perl for Language Researchers

Message 1: Programming for Linguists: Perl for Language Researchers

Date: Wed, 18 Jun 2003 13:36:22 +0000
From: Hans Paulussen <Hans.Paulussenkulak.ac.be>
Subject: Programming for Linguists: Perl for Language Researchers

Hammond, Michael, (2003) Programming for Linguists: Perl for Language
Researchers, Blackwell Publishing. 

Announced at http://linguistlist.org/issues/14/14-900.html


Hans Paulussen, University of Leuven (campus Kortrijk)

SYNOPSIS

''Programming for linguists'' is an introductory book for linguists
who want to learn to program in Perl.

The book consists of nine chapters, which I would group into three
sections (although the author does not make this distinction): an
introductory section (chapters 1-2), a description of Perl (chapters
3-7), and an extension to this basic introduction, covering HTML and
CGI scripting (chapters 8-9).

The introductory section (chapters 1-2) gives a brief introduction to
Perl and how to start using Perl. The first chapter explains in a few
lines why Perl programming skills would help the linguist in his
research tasks. This is followed by a brief note on how to download
and install Perl, and a note on how to read the book. The second
chapter explains how to edit and run your first script, the ubiquitous
print command: ''Hello world''.

The second section (chapters 3-7) gives a carefully documented
description of the basics of the Perl programming language. Basic
control structures and variables are introduced in chapter 3. At the
end of the chapter, the knowledge gained is illustrated in a
linguistic experimental task involving the automatic generation of
'nonsense' syllables (p. 25).

Chapter 4 covers the subject of input and output, array operations and
randomizing. The whole set of new elements (input/output, array,
randomization) is then illustrated in a program (expprog.pl) which
shows how to collect experimental data (p. 43).

Chapter 5 deals with the subject of subroutines and modules. It is a
bit longer than the previous chapters and requires close attention of
the beginner programmer. The topics discussed involve also the
anonymous variable, variable scope, arguments to subroutines,
multidimensional arrays, and the Exporter module to round off the
modularity of Perl. All the new features are integreted in new
versions of the sample program expprog.pl introduced in the previous
chapter.

Regular expressions form the main topic of chapter 6. All the main
features of pattern matching are explained, and the regular
expressions are illustrated in a ''pig latin'' generator (p. 89): a
program which swaps syllables (similar to the French verlan), taking
into account some syllabic constraints. A sentence splitter (p. 90) is
shown as another linguistic example.

Chapter 7 deals with all the Perl tools used for text and string
manipulation. This includes, for example, string replacement based on
regular expressions, conversion of strings to arrays (split and join)
and sorting. The power of hashes is introduced, and the chapter ends
with two linguistic illustrations: a concordancing program (p. 114)
and a bigram selector (p. 118).

The third and final section covers two chapters which go beyond the
topic of a basic Perl introduction, but which nevertheless show some
interesting features of Perl. An introduction to HTML is given in
chapter 8, which also shows how to retrieve and parse web pages,
illustrated in a simplified websearch script (p. 136). Chapter 9 is an
introduction to CGI scripting, which is used to create web pages
dynamically. The chapter finishes with a further development of the
syllable experiment introduced in previous chapters, which is now run
over the web.

The book finishes with four appendices. The first two appendices can
be considered some additional chapters on an extra, rather intricate
feature of Perl: references. Appendix A is a brief yet well-documented
introduction to objected-oriented Perl programming. Appendix B is a
general introduction to the Perl Tk module, which is a library used
for building a graphical user interface (GUI). Appendix C lists the
most important ''special'' variables, and appendix D gives some hints
about how to find further information on Perl.

CRITICAL EVALUATION

Perl is a well-known scripting language, and many books have been
written on this topic, but most of them are aimed at readers with some
background in programming. Even Schwartz & Christiansen (1997), which
is an excellent introduction to Perl, covers topics which are not
always that transparent to beginning programmers, in particular those
with a background in the humanities. Moreover, programming books in
general use samples situated mainly in the field of mathematics.

Many books have been written on Perl and CGI scripting and web
tracking and logging, all domains for computer specialists. Only
recently, introductory books on Perl have been written for other
domains, mainly for researchers in the field of bioinformatics
(cf. Cross 2001, Dwyer 2002, Tisdall 2001), where the knowledge of
mathematics and logic is a prerequisite.

Hammond's book is the first work, as far as I know, which focuses on
the specific needs of language researchers. It is strange why so late
an introductory book for language researchers has been written, since
the extensive support of regular expressions makes Perl a perfect
programming language for linguistic applications.

The main asset of this introductory book is the gradual introduction
of all the basic features of Perl and the use of language
samples. Seldom have I seen an introductory programming book which
explains with minute details the different steps to understand what
are variables and program loops. Every new feature is consolidated in
a number of exercises at the end of each chapter. One can see that
Hammond has many years of experience in teaching basic programming in
general, but also, and especially, for learners with a background in
the humanities. Very instructive is the use of small modifications in
the proposed scripts, which are especially used to gradually introduce
new features in a very digestive way. Each script is explained in
detail.

Apart from the table on p. 31, showing a rather obscure nomenclature
for console input and output, the whole chapter on input and output
gives a clear overview of the different aspects of input and output
facilities in Perl. On the other hand, the exact input and output for
numerous examples is often not shown at all. In a number of cases, it
is exactly the input and output files which would give the extra
information needed to understand what the script is about.

You can download the sample programs from the author's website in
three versions: Unix, Windows and Macintosh. However, there are now
and then some minor practical problems. Some of the downloaded
exercises do not match their paper version, expecially in chapter 5. A
small number of exercises are missing, but since the programs are not
very lengthy at all, this cannot be a big problem. The book says that
answers to selected even-numbered exercises are also available on the
website, but I have not found them.

Often special attention is given to environment specific features,
including Unix, Windows and Macintosh, the last one being quite
different from the others, thus posing some problems for
beginners. However, a good understanding of the basic text file
formats is missing. In fact, an introduction on end-of-lines is a
sine-qua-non in text manipulation, especially for the linguistic
researcher.

Because of the detailed explanations, one might be tempted to use the
book as a self-study course book. However, there are a number of flaws
which make that the beginner will often need a helping hand of a
teacher. For example, no script starts with the warning flag (-w) and
the ''use strict'' command, which are the minimum requirements for
defensive programming. Especially in the case of beginners the lack of
these two items can result in many mistakes. The chapters on HTML and
CGI are brief, clear introductions to the subject, but I doubt whether
a beginner can get any hands-on experience without the help of a web
guru.

Hammond limits himself to introduce only the most common structures,
which is fair enough in the context of a beginners' handbook of
Perl. However, to call some structures merely redundant (note 8,
p. 29) is an exageration. Some of these redundant constructions are
very handy instruments which render the programming code more
transparent, as soon as one gets used to it. If one wants to discard
all redundant elements, one could just as well start by leaving out
the ubiquitous shorthand operations. Another example is the function
each() which is of limited use for a small hash (cf. p. 114), but
which is very useful when reading a hash table of thousands of
elements, which are likely to occur in real world applications.

In a world where localisation and internationalisation has become an
important topic, it is strange that no attention has been paid to
multilingual text processing. The only multilingual program sample is
replace5.pl (p. 97) which converts a number of diacritical characters
to their base form. Is this once again a typical English-oriented
approach of analysing language (English being one of the few languages
using few or no diacritics)? There are more and more language
researcher who need to know how to handle language specific features
efficiently.

Since regexes are considered one of the most useful features of Perl,
why then is the chapter on regexes so short? Why not show how to
expand regexes based on the concatenation of previously defined
regexes? One could develop for example a regex procedure to detect
French words in an English text.

Hammond is an outspoken master in explaining complicated programming
features in a simple way. A case in point is the introduction to
references (appendix A) which shows a transparency seldomly seen in
any other introductory book on Perl. This didactic approach, however,
conceals the intricacy of programming in the field of linguistics. The
language examples are too general to be considered full-fledged
scripts. They only scratch the surface of the complexities involved in
language research and language technology. Linguistic analysis is
simplified and thus mystified. It is simple to render some linguistic
features into a regex (such as vowels and verbal inflection,
cf. p. 85), but what about the intricacies involved in determining
syllables. The tokenisation algorithm as illustrated in the sentence
splitter script (chapter 6) does not even take account of
abbreviations (any dot is considered the end of a sentence) nor the
possibility of multiple dots, to name just the basic problems in
sentence splitting. On the other hand, the concordance program
concord3.pl (p. 116) shows a possible way how to deal with capitalized
nouns.

To conclude, this book is a nice introduction to Perl programming,
aimed at researchers with an interest in language. There are quite a
number of program examples and exercises that deal with language, but
the scripts presented are too rudimentary to be used for real world
applications. They nevertheless give a general idea of what one can do
with text and string manipulation. The main asset of this handbook is
the gradual introduction of the most important features of Perl, using
a transparency which appeals beginners with no background in
mathematics or logic. As such, this book can be considered an
excellent complement to the well-known introductory Perl book of
Schwartz & Christiansen (1997).

REFERENCES

Cross, David (2001), ''Data Munging with Perl'', Manning Publications
Company.

Dwyer, Rex A. (2002), ''Genomic Perl: From Bioinformatics Basics to
Working Code'', Cambridge University Press.

Schwartz, Randal L. & Tom Christiansen (1997), Learning Perl,
O'Reilly.

Tisdall, James (2001), ''Beginning Perl for Bioinformatics'', O'Reilly.

ABOUT THE REVIEWER

Hans Paulussen is translator and computational linguist. He has worked
in the field of language teaching and text corpora research. He wrote
a PhD on the contrastive analysis of prepositions in English, French
and Dutch, within the cognitive linguistic framework. This empirical
work was based on a parallel aligned corpus specifically compiled for
that purpose. He has also been involved in the computational
linguistic support of the development of the first corpus-based
Arabic-Dutch dictionary. He is presently involved in a CALL research
project at the University of Leuven (campus Kortrijk) and he teaches
an introductory postgraduate course in corpus linguistics at the
University of Lille.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue