LINGUIST List 14.1536

Thu May 29 2003

Review: Computational Linguistics: Hammond (2003)

Editor for this issue: Naomi Ogasawara <naomilinguistlist.org>


What follows is a review or discussion note contributed to our Book Discussion Forum. We expect discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in. If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for review." Then contact Simin Karimi at siminlinguistlist.org.

Directory

  1. Anne Mahoney, Programming for Linguists: Perl for Language Researchers

Message 1: Programming for Linguists: Perl for Language Researchers

Date: Thu, 29 May 2003 09:20:07 +0000
From: Anne Mahoney <amahoneyperseus.tufts.edu>
Subject: Programming for Linguists: Perl for Language Researchers

Hammond, Michael (2003) Programming for Linguists: Perl for Language
Researchers. Blackwell Publishing.

Announced at http://linguistlist.org/issues/14/14-900.html


Anne Mahoney, Tufts University

Programming for Linguists is an introduction to computer programming
using the Perl language, aimed at people who work with language.
Although Hammond seems to envision it as a self-study guide, it would
probably work better as a course textbook. It is a generally sound
introduction to the language and to the notion of programming a
computer.

Perl is a particularly nice language for text processing because of
its wealth of pattern-matching and string-handling constructs. It is
easy in Perl to say, for example, ''find all words that end in a
vowel'' or ''replace every occurrence of the word 'cat' with the word
'feline.''' In addition, Perl is easy for beginners because it is
interpreted rather than compiled: one simply writes a program and runs
it, without explicitly having to turn it into machine code. As
Hammond points out (p. 2), Perl is moreover available, and free, for
every type of computer system in current use. I therefore agree that
Perl is a good starting point for a linguist with a computational
problem.

Hammond's intended audience is ''a naive reader who may know nothing
about programming'' (p. ix). The reader who already knows another
programming language and wants to pick up Perl will be better served
by Wall et al. (2000) Hammond's naive reader, however, is expected to
understand how to install software, how to use a text editor as
distinct from a word processor, and how files and directories work.
Although Hammond gives basic instructions on how to invoke an editor,
how to invoke the Perl interpreter, and how to display the text of a
Perl program, he leaves the reader helpless if anything goes wrong.
While the details of using a text editor really are beyond the scope
of the book, especially if the reader could be using any of several
computing platforms, it is often the case that someone who has never
thought about programming before has also never had occasion to use a
text editor, change the path (set of directories from which executable
programs can automatically be found), or install anything that
requires configuration or compilation. Although Hammond sensibly
suggests that some of these are ''delicate tasks'' and ''you should
seek assistance before attempting them on your own if you've never
done this before'' (p. 7), it would be useful to provide more concrete
information about where such assistance might be available. A
college- or university-affiliated linguist may be able to ask the
school's ''academic technology'' group. If no such resource is
available, the reader will want a good book on the relevant operating
system, perhaps one introducing system administration or development.

The first two chapters introduce Perl and how to create and run a
program. Chapters 3-7 cover the core features of the language. Not
every bit of Perl syntax is included, only what beginners need to
write basic programs. Each chapter includes examples, which are also
available from the author's home page,
http://www.u.arizona.edu/~hammond/ (p. x), and ends with a group of
exercises, many of which are variations on the example programs. The
exercises are all relatively easy, a few minutes' to half an hour's
work; there are no term projects or research questions here. They
provide practice on the language features introduced in the text, and
may help the reader figure out what kinds of problems a computer
program might help solve. The core chapters introduce, in order,
control statements, scalar variables, and arrays; input and output,
both at the user's screen and to files; organizing programs into
subroutines; regular expressions; substitutions, sorting, and
tokenization. Examples grow increasingly elaborate, including an
English-to-Pig Latin translator.

Chapter 8, on HTML, talks about using Perl to generate or parse HTML
files. Chapter 9 is about CGI, the ''Common Gateway Interface'' for
web programming. Oddly, it does not mention the commonly used CGI
module, available from CPAN (the Comprehensive Perl Archive Network,
http://www.cpan.org, discussed in appendix D), which includes
functions to do several of the things Hammond has the reader do
laboriously by hand, notably retrieving the input to a CGI routine.

Four appendices round out the book. Appendix A mentions
object-oriented programming as it is done in Perl. While it is
appropriate to explain the odd syntax that object-style modules may
use (all those double colons and extra pointers), this topic is
otherwise rather more advanced than the rest of the book. Appendix B
discusses the Perl implementation of the Tk toolkit for building
graphical interfaces. Finally, appendix C lists the basic ''special
variables'' built in to Perl, and appendix D gives a few pointers to
further information.

Any introductory programming textbook is necessarily its readers'
first initiation not only into the mechanics of programming, but also
into style. Here Hammond's recommendations and examples are sometimes
inappropriate, and often unidiomatic for Perl. For example, on p. 49
he suggests that programmers should avoid ''command condensation,'' by
which he means using the output of one routine as an argument to
another, or more generally doing more than one operation in a single
step. He notes that this technique produces shorter programs, but
''it results in far less clarity and should be avoided.'' (p. 50) The
alternative, however, is generally to introduce new variables to hold
intermediate results. This is also confusing, as another programmer
reading or working on the code some time later must determine what
happens to each of those variables, and whether they are still
relevant in some later part of the code. In programming languages as
in natural languages, greater fluency makes it possible to read longer
''sentences'' without getting confused. A first-year Latin student
might be thoroughly confounded by the sentence of 60-odd words that
begins Cicero's speech for Archias the poet, but the experienced
Latinist understands its sense, its structure, and its sound effects.
Similarly, experienced programmers learn to use increasingly
complicated statements. (While in natural language acquisition
students can generally read more complex sentences than they can
accurately write, in programming language acquisition the sequence is
often the reverse, because students rarely get practice in reading
existing code. This is unfortunate, however, as working programmers
spend much more time reading, documenting, and modifying existing code
than they do writing new code from scratch.)

Hammond points out that code should have comments (p. 48), but the
examples rarely do. He also notes that variable names should give
some information about the use of the variable (p. 49); although most
of the examples follow this precept, there are occasional one-letter
or otherwise neutral names. He characterizes the ubiquitous Perl
''anonymous variables'' as ''one major threat to writing easy-to-read
programs'' (p. 50), yet anyone who will be working with Perl will run
into them almost at once. Anonymous (or ''implicit'') variables in
Perl are supplied from the context when a function requires an
argument which it is not given. They include the current main input
filehandle, the current record from a file being read, and the current
element of an array within a loop.

Real programs must be prepared for errors, especially if they expect
to receive any data from outside. Hammond notes (p. 36) that it's
always necessary to check whether a program has successfully opened a
file it intends to use, and gives the standard idiom for doing so.
The example programs, however, merely complain that there has been an
error, without saying what error or on which file; the information
necessary to construct an informative error message is relegated to a
footnote (p. 45). Once we get to regular expressions, in chapter 6, a
series of example programs allow the user to enter a regular
expression as input to the program. These expressions are then used
without any check on their validity (examples p. 80, 81, 85, etc.).

Although some aspects of programming have changed in the last fifty
years or so, the basic principles of good style are much the same as
ever. The style manual Kernighan and Plaugher (1978) has really not
been superseded; its key style rules (including ''Avoid temporary
variables''; ''Use the good features of a language; avoid the bad
ones''; ''Make sure all variables are initialized before use''; and so
on) are as relevant to object-oriented Perl as they were to Fortran
and PL/1. Hammond is an experienced enough programmer to know this,
as is clear from the programs he makes available on his home page.
Students may as well learn good habits from the beginning, rather than
being encouraged by the textbook to be sloppy.

The book has little to say about either design or debugging. Any
non-trivial program should be sketched out first, before the
programmer starts writing code, to be sure nothing major will be
overlooked. Simply starting in to write without first thinking about
the structure of the program can lead to using the wrong structure.
How large a program is non-trivial depends on experience; for the
intended readers of this book, the solutions to the exercises are not
yet trivial. Moreover, few programs are correct when first written,
and Hammond gives no suggestions about how to determine why a program
does not do what you think it should. Perl does include a couple of
tools: the ''use strict'' pragma to enforce variable declarations, the
''-w'' command line switch to enable warnings, and a debugger. New
programmers need to be reminded that everyone makes mistakes, that
programming mistakes are rarely disastrous unless the program modifies
a file or something else beyond its own borders, and that there are
systematic ways of finding and fixing the mistakes that will happen.

The book is in general accurate and well-edited, but I found a few
errors or inaccuracies which might lead to a bit of confusion. For
example, on page 9, the Perl escape ''\n'' is described as ''an
explicit return -- or newline''; they are not the same thing. In a
footnote on p. 29, the definition of ''prime number'' is correct, but
the example includes 1, which is incorrect. On p. 57, the scope of a
variable defined as a loop index in a ''for'' or ''foreach'' statement
is the loop itself, not the block or routine that encloses the loop.
In the discussion of regular expressions, p. 82-83, the pattern that
is intended to contain a backslashed vertical bar is twice printed
with a space between the backslash and the bar: for ''\|'' we have ''\
|'' instead. The example on p. 87 misses the first match: the pattern
/o.*s/ applied to ''John loves Mary'' will match ''ohn loves'' rather
than merely ''oves''. In the discussion of sorts, p. 105-106, the
text says you specify ''an explicit sorting function'' as an argument
to the standard sort routine. In fact, what you specify is only a
comparison function, which tells how to determine if one item comes
before another, not an entire sort function. In the discussion of
HTML, correct terminology seems to be deliberately avoided: ''escape
sequence'' on p. 129 instead of the standard term ''entity,''
''parameter'' on p. 131 instead of the standard term ''attribute.''

The code sample on p. 136-138 does not handle URLs with directories,
and will also fail on a relative link from a page whose URL includes a
filename. Although I did not test all of the code, this is the only
program in the book which had visible errors: a commendable success
rate.

The preface suggests that the audience for the book includes not only
linguists but ''literary theorists'' (p. ix, repeated on p. 1). I
assume Hammond means ''literary scholars'' here; while literary
theorists are unlikely to need computational tools, many of us who
work on literature -- applying theories rather than creating them --
do have occasion to program. Literary scholars might want to write
programs for stylometrics, collation and textual criticism, metrical
analyses, concordancing, and so on. In addition, knowledge of
programming greatly facilitates marking up a text for other uses, for
example turning a plain typed or scanned text into TEI XML.

Overall, this is a sound book, with only a few questionable
recommendations and very few errors. It would make a good foundation
text for an introductory course on computational linguistics or
humanities computing, perhaps coupled with something like Hockey
(2001) to give the students some ideas about what this new skill will
allow them to do.

References

Hockey, Susan M. (2001). Electronic Texts in the Humanities:
Principles and Practice. Oxford.

Kernighan, Brian W., and P. J. Plaugher (1978). The Elements of
Programming Style, second edition. New York: McGraw-Hill.

Wall, Larry, Tom Christiansen, and Randal L. Schwartz. (2000)
Programming Perl, third edition. Sebastopal, CA: O'Reilly and
Associates.

ABOUT THE REVIEWER

Anne Mahoney teaches in the department of classics at Tufts University
and is the lead programmer at the Perseus Project there. Her research
interests include Greek and Latin meter and poetics, ancient drama,
and vocabulary.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue