LINGUIST List 5.1122

Fri 14 Oct 1994

Sum: String manipulation tools for the Mac

Editor for this issue: <>


Directory

  1. Bill Croft, Sum: String manipulation tools for the Mac

Message 1: Sum: String manipulation tools for the Mac

Date: Wed, 12 Oct 94 15:00:27 BSSum: String manipulation tools for the Mac
From: Bill Croft <W.Croftmanchester.ac.uk>
Subject: Sum: String manipulation tools for the Mac

I got many responses to my request for string-manipulation programs
for the Macintosh. I also discovered that a similar query was posted
by Loren Billings on the CORPORA list. The following is a compilation
of the responses to both requests (Loren will post the summary to
CORPORA).

 I am posting this summary now, quoting liberally the basic info
and evaluations of the respondents. Since I am less compute-literate
than the respondents, please correct any errors that have crept into
my summaries. More details can probably be had from the respondents
I've named after each entry.

 Thanks to all of the following for their information! --Bill Croft

Evan L. Antworth (evan.antworthSIL.ORG)
Cathy Ball (CBALLguvax.acc.georgetown.edu)
Michael Barlow (barlowruf.rice.edu)
Loren Allen Billings (BILLINGSpucc.Princeton.EDU)
Chris Culy (cculyuiowa.edu)
Andrew E. Dolbey (dolbeyuclink.berkeley.edu)
Sebastian Adorjan Dyhr (LINSADstud.hum.aau.dk)
George Fowler (GFOWLERucs.indiana.edu)
Larry Gorbet (lgorbetmail.unm.edu)
John Henderson (jkhuniwa.uwa.edu.au)
Ken Hughes (hughesunixg.ubc.ca)
Dirk Janssen (U249009VM.UCI.KUN.NL)
Michael Kelly (kellycattell.psych.upenn.edu)
John Kirk (J.Kirkqub.ac.uk)
John E. Koontz (koontzalpha.bldr.nist.gov)
Barbara Levergood (levebruby.ils.unc.edu)
Hugh Nicoll (hnicollfunatsuka.miyazaki-mu.ac.jp)
Alain Polguere (ellalainleonis.nus.sg)
Malcolm Ross (mdr412coombs.anu.edu.au)
Achim Stein (achimchianti.philosophie.uni-stuttgart.de)
Theo Vosse (vosseruls41.LEIDENUNIV.NL)
Bill Westaley (westaleyOREGON.UOREGON.EDU)

0. TABLE OF CONTENTS
 I. Tools associated with word processors
 (1) Alpha
 (2) BBedit
 (3) Nisus
 (4) emacs
 II. Tools associated with HyperCard
 (5) MonoConc
 (6) FreeText
 (7) XFCN
 (8) Folio Views
 III. Tools associated with database programs
 (9) 4th Dimension
 IV. Stand-alone tools
 (10) Conc
 (11) Concorder
 (12) MacGawk
 (13) grep/agrep in MacMint
 (14) Search Files 1.3
 (15) MicroConcord (DOS)
 (16) TACT (DOS)
 V. Programming languages
 (17) MacPerl
 (18) MaxSPITBOL
 (19) Icon and ProIcon

I. TOOLS ASSOCIATED WITH WORD PROCESSORS

1. Many modern word/text processors have grep (e.g. Nisus, BBedit).
(Chris Culy)

(1) Alpha [adapted from Ken Hughes' discussion of Alpha, BBedit and
emacs--BC]

I would recommend one
of three programmer's editors that seem to dominate the environment
these days...Alpha (v5.81?) can be found in most Mac archives.

All have implementations of grep (full 'regular expression' use) for
search and replace functions. I don't believe that there is any choice
for doing serious work that doesn't involve regular expressions since
these allow work across linefeed, carriage return, whitespace, and other
'punctuation' boundaries and text variances. All allow for operations
on multiple files.

All three editors are highly configurable and permit sophisticated macros.

Alpha [and emacs] provide a command-line shell within a dandy window/buffer
interface, but shell functions are limited and the built-in grep can't be
piped to other machinery. Operations on files require programming skills.
Ideally, an implementation of perl or awk which outputs to a file should
satisfy pretty much any desire. Consequently, the only real 'solution'
available so far may be MacPerl, using Alpha as a front end. (I haven't
tried it yet. [see (17) below--BC]) Info can be found at
"http://web.nexor.co.uk/mak/mak.html";.
(Ken Hughes)

(2) BBedit

Bare Bones Software has just released BBEdit 3.0:
an elegant little program well worth looking at. The freeware BBEdit Lite
3.0 and the demo version of the full(commercial) program are available at
info-mac mirror sites. The commercial version is $99.

For more info contact < bbeditworld.std.com >
(Hugh Nicoll)

BBEdit is perhaps the most popular and easiest to use.
BBEdit is widely loved.
(Ken Hughes [see (1) Alpha for more description of
 BBEdit's capabilities--BC])

(3) Nisus

The word processor Nisus has an extensive GREP type find/replace setup as
well as a programming language which allows you to work up some fairly
sophisticated tools. One of its strengths is that because you are operating
on open text documents it is very easy to check each step as you go.

The latest release NisusWriter v4 has just been released.

There is also an e-mail list for Nisus users (contact
pfterrymsmail.kgs.ukans.edu) and an ftp site at syrinx.kgs.ukans.edu
(/home/ftp/nisus).
(John Henderson)

My main word processing tool is Nisus, and probably the main reason I use
it is its grep-like string manipulation tools. Actually, there's a pretty
standard grep facility (as part of its Find/Replace facility) and an "easy
grep", a language that uses a much more transparent formalism to do about
the same things. Actually, the latter is a nice tool for learning the
former: you can type something in that, then click a button and have its
grep translation appear.
(Larry Gorbet)

Nisus 3 allows sophisticated wildcard searches. It can also be used
to find all cases of the seach object simultaneously: it selects all of
them, so that they can all be copied to another file. Nisus Writer 4 has
just been released, but I am still waiting for my copy.
(Malcolm Ross)

(4) emacs

There is the Unix-like
implementation of emacs (new to the Mac environment, v1.14?)...
emacs is at "ftp.cs.cornell.edu" in the directory "pub/parmet".
...
emacs provides a command-line shell within a dandy window/buffer
interface, but shell functions are limited and the built-in grep can't be
piped to other machinery. Operations on files require programming skills.
Ideally, an implementation of perl or awk which outputs to a file should
satisfy pretty much any desire.
...
I have a personal penchant for emacs, but I have a lot of experience
with it under Unix. It is extremely (!) feature-rich but the learning
curve is a little stiff. Installing all the files requires about 7mb.
I'm hoping that someone will soon write an implementation of awk and perl
which will work within the emacs shell.
(Ken Hughes [see (1) Alpha for more description of emacs' capabilities--BC])

II. TOOLS ASSOCIATED WITH HYPERCARD

(5) MonoConc

I have a HyperCard stack, MonoConc, which will give a KWIC concordance
for a text. It will do left and right sorts and allow the search results
to be saved as a file or printed. It is pretty basic. This program is
really just a modification of ParaConc [another program that Michael Barlow
described in an earlier LINGUIST posting], which works with parallel texts.
Again the program gives a KWIC concordance for a keyword (e.g., "say" or
"say*") and allows sorting etc. In addition, the sentences from the
second language containing the equivalent of the keyword (in the first
language) are displayed.

The HyperCard programs are now at version 0.9x and I can email the
program in binhex form to any interested linguists. (My email is
barlowruf.rice.edu) In the future I will place these on an ftp site. If
vvvpeople need a disk and a manual, they should contact Athelstan. We can
send a copy for $10. (Athelstan -- 800-598-3880 in the USA)
(Michael Barlow)

(6) FreeText.

If you want a simple fast
concordance program that isn't too capable on nonIE languages you could try
FreeText. It runs under HyperCard using externals and is very fast. The
main limitation is how it handles characters - it turns everything into
caps. If you can live with the limitation then its great. You can also do
some modification since some of it is in HyperCard. FreeText is free.
(Bill Westaley)

There is a HyperCard program called AnyText from Linguist's Software that, I
believe, does proximity searching. It is based on a freeware program called
FreeText or something (sorry for being vague).
(Evan Antworth)

Also FreeText Browser (a HyperCard stack) allows
you to do Boolean searches, but it doesn't have any print capabilities.
(Cathy Ball)

(7) XFCN

There is a grep search/replace XFCN for HyperCard. (it's free.)
(Chris Culy)

(8) Folio Views

The Hypertext Programme may be `Folio Views' for Macintosh
[this is in reference to Loren's request--BC]

Folio Corporation
2155 North Freedom Boulevard
Suite 150
Provo, Utah 84604

Distributor: e.g. GVPi (Global Village Publishing Inc.)
1101 Kinsg St., STE 190, Alexandria, VA 22314
Call 1-800-394-GVPi
(Achim Stein)

III. TOOLS ASSOCIATED WITH DATABASE PROGRAMS

(9) 4th Dimension

I use a database system
called 4th Dimension (by ACI) which has very powerful string manipulation
capabilities, if you don't mind writing a little bit of Pascal-like code.
4D's programming language is powerful, flexible and relatively easy to
learn, and is thus a nice choice for amateurs like myself. However, it
does require that you break the text into "alpha-numeric" fields of a
limited size rather than "text fields" (which can be much larger), because
many of the commands operate only on the "alpha-numeric" fields, not on
"text" fields.

Another disadvantage is that 4AD is extremely expensive -- as of last year,
it listed at $600. But it's just about the best database system you'll get
for the Mac, at least in my opinion.
(Andrew Dolbey)

IV. STAND-ALONE PROGRAMS

(10) Conc

Attached is information on Conc. Conc is primarily a "keyword in context"-type
concordancer. What you want is sometimes called proximity searching. It is
possible to get Conc to do something close to what you want using a GREP search,
but it's a bit clumsy.

Conc: a concordance generator for the Macintosh

 Conc produces concordances of texts. A concordance consists
of a list of the words in the text with a short section of the
context that precedes and follows each word. Conc also produces an
index, consisting of a list of the distinct words in the text,
each with the number of times it occurs and a list of the places
where it occurs. Conc displays the original text, the concordance,
and the index each in its own window. Clicking on a word in any
one of the three windows causes the other two windows to display
the entries for the same word.

 Conc permits the user to define the sorting order and to
limit the concordance to words that match specified patterns
(GREP expressions).

 Conc will do concordances both on ordinary flat text files
and also on multiple-line interlinear texts. In the case of
interlinear texts, the concordance can be limited to selected
lines (fields). In addition to word concording, Conc can also
produce a concordance of each letter in a text or body of
phonological data. Pattern-matching facilities are also available
to letter concordances, so the user can specify search patterns
that will have the effect of retrieving, say, words containing
intervocalic obstruents.

 Concordances can be both printed and exported to plain text
files. As for performance, on a Mac IIci Conc can produce a
concordance of Moby Dick (1,177KB) in about 13 minutes and
requires about 2,500KB of memory.

Conc version 1.76 is a beta test version offered as 'freeware'. If
you use it, we only ask that you send us your comments, complaints,
and wishlist. You can affect the shape of the final product!
Documentation is included on-disk in a Microsoft Word file.

Conc 1.76 is available in any of three ways:

 1. Conc can be downloaded by anonymous FTP from ftp.sil.org [198.213.4.1].
Do these commands:

 cd [.software.mac]
 get conc176.sea_hqx

You will need a Binhex program to decode it.

 2. Conc can be retrieved via e-mail. Send a message to mailservsil.org
consisting of this single line only:

 send [ftp.software.mac]conc176.sea_hqx

You will need a Binhex program to decode it.

 3. Conc can be ordered on disk from:

 International Academic Bookstore
 7500 W. Camp Wisdom Road
 Dallas, TX 75236
 U.S.A.
 phone: 214/709-2404
 fax: 214/709-2433
 e-mail: Academic.Bookssil.org

Cost is $5 plus postage.
(Checks *must* be drawn on a U.S. bank. They do not accept credit cards,
but will bill by invoice.)

(Evan Antworth; thanks also to Bill Westaley and Theo Vosse)

(11) Concorder

Concorder is a simple concordance program for the Mac which does not have
sophisticated searching (so if you want to specify the distance between
search items it will not do) but for extraction of lines which contain
two items it should work.

 "Concorder - Concordance software for the MacIntosh
 available from:
 Les publications CRM
 Universitie de Montreal
 C.P. 6128-A
 Montreal, Quebec H3C3J7
 Canada
 Cost CAN$100 + $3 shipping
 one of the authors: David W. Rand
 randere.umontreal.ca
(Laura Proctor)

(12) MacGawk

There is a version of awk (GNU awk or gawk, actually) for the Mac, called,
of course, MacGawk.

(John Koontz)

[John Koontz also sent the README file for MacGawk patch 4, which I have
excerpted here:]

About GNU awk for the Macintosh...

This is GNU awk, gawk, for the Macintosh. For those who don't know, GNU
stands for GNU's Not UNIX, an as-yet unfinished operating system,and is
the primary goal of the Free Software Foundation. The FSF has publically
condemned Apple Computer for its litigation in defense of perceived
copyrights. The FSF, therefore, has no knowledge of the existence of this
gawk version, and would not support it if it did. Do not report bugs or
make any other contact with FSF concerning Macintosh gawk.

Why Macintosh gawk exists

gawk for the Macintosh exists for a number of reasons. First, I use gawk
extensively as part of my day to day work activities and wanted to have
it at home. Second, I was looking for a project in C to work on at home
to learn Mac programming. And third, it was a challenge. I have every
intention of following the GNU copyleft, meaning that I can not sell gawk
itself ( I could conceivably charge for support) for profit and must
also make full source available.

Macintosh gawk is Free Software

I do not charge for gawk. It is free software, not shareware or public
domain. I encourage you to read the documents that describe the GNU
Public License, or GPL so that you understand what this means.

Differences from UNIX gawk

Macinstosh gawk lacks some features that UNIX-like systems provide. These
features include pipes and multiple processes. Mac gawk will quit when
source programs invoke these functions. I caution against redirecting
input and output in getline and print/printf calls. All other features
should work the same. Read the Macintosh Supplement.mw document for
details.

Macintosh caveats

Multifinder

Mac gawk will run under Multifinder, but is not particularly MF adapted.
It is set to use a partition size of 768K but large input files may
require more, much more. Operation under Finder should be fine.

Command Line

Macintosh gawk uses the THINK C ccommand interface. This provides a
dialog box that allows the user to enter UNIX shell-like command lines.
Redirection of input and output is done with radio buttons.

TEXT Files Mac gawk reads and writes standard Macintosh TEXT files. To
use word processor files, it will be necessary to save them as TEXT first.

Behind the scenes

Compilation Mac gawk was compiled using THINK C 4.0.2 on a 4M Mac+
running System 6.0.7. gawk requires bison to generate the awk.tab.c file.
This is generally only required when making changes in the actual awk
language. The source files were converted to comply with the ANSI
standard ( as THINK defines it) and makes full use of function prototypes.

The author

I'm not really the author, I just did the porting. My name is Tom
Maszerowski, I work as a software engineer for Moscom, Inc. in Pittsford,
NY. Moscom is nice enough to allow me email and UUCP acccess and I thank
them, but there are no guarantees. Thanks to my wife as well, for
allowing me the time at home to do this.

Bugs and updates

Please do not contact the FSF concerning this version of gawk. I expect
to be the sole point of contact for bugs and source code updates. I
monitor the GNU groups on NETNEWS and will try to incorporate them as
needed. If you make changes to the gawk source you feel will benefit
others send them to me.

Addresses

I can be reached at the following email addresses:
 tcmmoscom.com
 {rit,tropix,ur-valhalla}!moscom!tcm

Mail delivery is usually quite good and I try to respond in a timely
fashion ( although timely is a subjective term).

Manual

The manual with a supplement is found in the "gawk Manual.mw" and
"Macintosh Supplement.mw" files. The manual was made by converting the
original texinfo file to to nroff and then to text. The text was then
converted to MacWrite5.0 format for the release. I tried to keep
pagination correct but this may change based on the device you print to.

Directions

The source should be stripped of non-Mac code, since it seems unlikely that
someone without a Mac would grab this source code.

Memory allocation problems should be fixed, possibly with replacement of
the THINK malloc() calls with something else.

Real Mac interface would be nice, possibly using Prototyper ( this
presents a problem with source distribution since Prototyper-produced
code requires libraries that cannot be given away).

(John Koontz)

[The awk textbook is:]

The awk programming manual, by A. Aho, B. Kernighan and P. Weinberger
Addison-Wesley 1988, isbn 0-201-07981-X

The source is on ftp.funet.fi, in the directory /mac/utils.

I want to stress again that Awk is very versatile but also very limited:
Your strings have to be formatted on 'hard' lines, ie. at the end of a line
there is a CR (CRLF) sign. There may be more strings on a line, but your
line may not be longer than about 250 characters.
This is a very non-maccy restriction :-)

(Dirk Janssen)

There have been standalone implementations of grep and awk for the Mac
but I haven't found any of these to be usefully standard or reliable.
(Ken Hughes)

(13) grep/agrep in MacMint

There is a free Unix-clone available, and all the tools you'd expect
are either already ported or can be recompiled using the GNU C compiler.
One tool in particular is agrep, which is faster than any of the other
greps, and which allows approximate matches. The clone is called MacMint,
and it is a port of Mint for the Atari. The starter kit is available from
Info-Mac mirrors (no compiler necessary to get set up). It takes a little
work to get it set up, but I've found it to be very stable, and it's what
I use for a lot of stuff.
(Chris Culy)

(14) Search Files 1.3

I recommend Search Files 1.3 by Robert Morris. A shareware
that you can download from anywhere. It's some sort
of sophisticated grep for the Mac. Gives you an output
that looks like simple concordances. Very basic but
very good. I am sure you'll like it.
(Alain Polguerre)

(15) MicroConcord (DOS)

Don't forget that you can run all the IBM
PC stuff under SoftPC. A simple concordancer that will look for collocations
(to the left or the right, within a specifid 'horizon' of N words) is
MicroConcord, published by Oxford University Press. It's for the PC, but I
run it under SoftPC.

(Cathy Ball)

I am also the source in the U.S. (as Athelstan --
800-598-3880 in the USA) for
Oxford University Press's DOS program,
MicroConcord, which is pretty
much the standard commercial concordance program.
(Michael Barlow)

(16) TACT (DOS)

TACT (PC, from University of Toronto) is quite a popular 'research'
concordancer, too.
(Cathy Ball)

One possibility would be to use DOS packages on a PowerPC and copy the
output - use Micro-OCP or TACT for instance. I'm just experimenting myself
with this very process.
(John Kirk)

V. PROGRAMMING LANGUAGES

(17) MacPerl

A macintosh version of the
UNIX perl language is available in the public
domain. It's called "MacPerl." It's not as
powerful as the Unix rendition (it doesn't
allow for file expansions using ? or *). Nor
does it run in the background. But it has
an extremely flexible set of regular expressions
and is a full programming language. There's also
a good introductory book available called
"Learning Perl" published by O'Reilly.
Although based on the UNIX version, it applies
by and large to MacPerl as well.

Here are some ftp locations for macPerl:
ftp://ftp.cis.ufl.edu/pub/perl/src/macperl
 ftp://ftp.eunet.ch/software/mac/perl
 ftp://ftp.funet.fi/pub/languages/perl/ports/perl4/mac
 ftp://src.doc.ic.ac.uk
 /packages/mac/umich/development/languages/macperl4.13.sit.hqx.gz
 America OnLine Mac Development Forum (keyword: mdv)

(Michael Kelly)

Operations on files require programming skills.
Ideally, an implementation of perl or awk which outputs to a file should
satisfy pretty much any desire. Consequently, the only real 'solution'
available so far may be MacPerl, using Alpha as a front end. (I haven't
tried it yet.) Info can be found at "http://web.nexor.co.uk/mak/mak.html";.
(Ken Hughes)

(18) MaxSPITBOL

I can strongly recommend MaxSPITBOL, an implementation of SNOBOL4 for
the Mac. It is (was?) available from:

Catspaw, Inc.
P.O. Box 1123
Salida, CO 81201
719-539-3884

When I have called before, I have talked to the programmer, Mark
Emmer, who has had a real committment to SNOBOL. He cheerfully
answered all my questions before I invested in the software and helped
me debug a couple of times when I ran into big dead ends.

MaxSPITBOL works on text files, but the nice thing (for me anyway) was
that I could use any font I want. This way, I could keep my database
in the linguistics font, and then run a SPITBOL program on it, and get
output in that same font.

SPITBOL includes a very very powerful string manipulation language and
pattern matching language, much more powerful than grep.

I have programmed in SNOBOL/SPITBOL for many years, so I can't comment
on the learning curve for a new user. I imagine it would be like
learning any new programming language, except that the pattern
matching syntax and semantics is fairly complex to learn in depth. Of
course, the simpler tasks are simpler to learn and program.

I don't remember what I paid for the program, but I imagine it was
$150-$200. It was worth every penny to me.
(Barbara Levergood; also recommended by George Fowler)

(19) Icon and ProIcon

There is the language ICON, which is like a better, more
modern & object-oriented language, which has a public domain Macintosh
implementation. I don't know where I got it from, and I don't use it, but you
could try doing an Archie search for "ICON".
(George Fowler)

If you can do your own simple programming, then the programming language
ProIcon for the Mac does a good job of various kinds of search. I think
it is now public domain, like most versions of Icon...You would
need to obtain a manual (it is a published book).
(Malcolm Ross)

ProIcon is useful if you want to make complex changes throughout a large
file which would take too long with a Nisus macro.

(Malcolm Ross)

Dept of Linguistics, U Manchester, Oxford Rd, Manchester M13 9PL, UK
w.croftmanchester.ac.uk FAX: +44-61-275 3187 Phone: 275 3188
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue