LINGUIST List 2.137

Monday, 15 Apr 1991

Disc: Munda, Shoebox

Editor for this issue: <>


Directory

  1. Ian Smith, Re: Munda Homeland
  2. John E. Koontz, Shoebox

Message 1: Re: Munda Homeland

Date: Thu, 11 Apr 1991 23:09:38 -0400
From: Ian Smith <IANSMITHVM1.YorkU.CA>
Subject: Re: Munda Homeland
Re: Susan Steele's query about a hypothesized homeland for the Munda lgs

Given that the Munda languages were in South Asia before the arrival of Indo-
Aryan speakers (circa 14th C BC) and that there is a dearth of historical info
on the languages (and of decent current info for many of them) any hypothesis
would likely be pretty shaky. You could finesse the question by saying that
ultimately it is the same as the homeland of the Austro-Asiatic family, of
which Munda is a branch. Certainly Munda isn't considered to be connected with
the Indus valley civilization (Mohanjo-Daro, Harappa etc.) since, for one thing
its language seems to have been exclusively suffixing, while Munda makes
extensive use of prefixes.
More knowledgeable sources on Munda would be David Stampe (U Hawaii) or Norman
Zide (U Chicago)
Ian Smith, York University (iansmithvm1.yorku.ca)
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue

Message 2: Shoebox

Date: Mon, 15 Apr 91 12:34:05 MDT
From: John E. Koontz <koontzalpha.bldr.nist.gov>
Subject: Shoebox
Comments on Shoebox

Since Tom Payne's cautionary remarks on Shoebox, he and I have been
corresponding on the problems he reports. 

One of his objections was that Shoebox alphabetizes all records by
their key field. This means, as he stated, that for records
consisting of glossed text one has to create a set of key fields
apart from the text. These keys are called reference fields in
Shoebox, and usually take the form of some field like

 \ref short_title sentence_number 

In Shoebox, since the keys are sorted alphabetically, one has to make
sure that all numbers have the same length by adding leading zeroes. 
So, use 0001, 0002, etc., 0010, etc., since otherwise 1, 10, 100,
1000, etc., will sort before 2, 20, 200, 2000, etc., and so on! 
Actually the requirement for reference numbers is a property of both
Shoebox and IT, the only two interlinear text glossing tools that I
know of, and both tools automate the process of creating the keys to
some extent. 

The value of reference fields as a book-keeping device is clear, and
the Shoebox manual discusses the need for them as order preserving
keys, and explains how Shoebox can generate them mechanically when a
text is first broken down into records, or regenerate them when they
fall out of regularity. 

However, what Tom objects to is not the existence of numbers, but the
unnaturalness to have to work in terms of these reference numbers. 
One has to worry about creating and updating them, one seldom wants
to access a record in terms of them, etc. While it is possible to
work around the numbers in Shoebox, they are not a particularly
natural way to organize text mentally, and a nicer model for
manipulating the interlinearized text records could easily be
imagined. One would prefer to work with a model in which the order
of the sentence database was simply defined and maintained
automatically, as the text was imported and updated. Rather than
treating the reference numbers implied as the primary key of the
database, it should be incidental information in an unkeyed
database. In particular there should be a search command that could
intantly access any original word or added gloss, etc., rather than a
search than can only access reference numbers.

In fact, some concordance type programs support this model as far as
access is concerned, though I am not aware of anything that combines
this with a running maintenance scheme of the type suggested, let
alone with interlinearizing. 

Fortunately, there is nothing to keep one from using Shoebox (or IT)
to interlinearize a text and then importing the results into a
concordance program or any other system in order to search it. I've
contemplated using WordCruncher or TACT in such roles, but so far
haven't had the opportunity to go beyond experimentation. 
Unfortunately, a linguistic database, especially in a fieldworking
situation, is apt to be undegoing continuous revision, and with
concordance programs like this one has in most cases to perform a
time-consuming reindexing operation after any modifications to the
text. A degree of effort and computer savy is needed, too, since in
going from Shoebox to the concordance program you are transferring
data between different programs with different conventions for just
about everything.

There is an alternative. Some of the features of concordance
programs can be gotten by using a simple text searching program
instead. I've actually used MKS's Unix grep for PCs, the Norton
Commander's VIEW function, and Vern Buerg's LIST on linguistic
databases in this way, not though not always on Shoebox databases. 
The main problem with this that none of these programs are easily
made aware of the structure of the database - its records and fields.

As a matter of fact, Tom Payne does the same thing in working with
his Panare database. He interlinearizes with IT and searches the
result with a public domain text searching tool called lookfor. He
then uses Sidekick to paste the material he finds into his word
processor file. Lookfor also ignores structure, but, on the other
hand, it is very fast (much faster than searching in Shoebox), and it
doesn't require reindexing when the database is changed, so the main
problem with combining easy access and easy maintenance is that the
tools for access and maintenance are separate. There are any number
of such tools around, commercial and otherwise, and you can easily
find the combination that suits you. So far you can't find them all
under one software roof, unfortunately. 

The question of searching brings me to Tom's second objection to
Shoebox, which was that searching in Shoebox does not work as he
things it should. (He accidentally used the term jumping, which has
another meaning in Shoebox.) Here he and I have a conundrum. 
Searching seems to work one way for him, and another way for me. I
can't figure out why. For example, we are both using version 1.2a,
so it is not a difference of versions. When I first search for a
record with key X, and then move the next record or the preceding
record, the move is with respect to the new record with key X, and
within the alphabetized sequence of the keys. For him the move is
with respect to the record he was in before he conducted the search. 

I speculate that this problem may be a bug related to the size of the
database he has, a collection of 2700-odd interlinearized Cebuano
clauses occupying over a megabyte. However, we have not conducted
any experiments to lay out the parameters of the phenomenon. All I
know is that I have never experienced it on any file, and have never
used a file much above 600KB. For example, I don't have the problem
in a tiny five record sample file created to test the search command,
or in a 4500 entry 370 KB Omaha lexical file in which I also tested
search/previous/next.

Tom experiences some other problems with his database that may,
hypothetically, reflect a difficulty with large files, and certainly
seem to reflect at least a corrupted index file. (I will omit the
details I have, because I do not have all of them.) On the other
hand, at the University of Colorado's Center for the Study of the
native Languages of the Plains and Southwest (CeSNaLPS), we have not
experienced any of the problems of this sort that Tom experiences in
our use of a collection of lexical database files each under c. 300K
in size. 

Proceding, Tom's main objection on searching is to the metaphor that
Shoebox uses for search non-key material in the database. Shoebox
has two normal enough searches. One operates over the database as a
whole and searches for any key in the database, i.e., it ignores
non-key fields. The other operates within a particular record, and
searches for particular text within the record. Unfortunately, there
is no search (per se) that operates over the entire database and
searches for any text. What one can do instead is to use an
operation called filtering. 

Filtering means restricting the database to only those records which
match some filtering predicate - those that pass the filter. 
Filtering is very powerful and the least of the things it can do is
find all records with a particular piece of text in them. However,
in some contexts it is an awkward metaphor for searching. As Tom
puts it, "I never filter my sock drawer in the morning for a suitable
pair of socks." While one can always do an honest search of a
Shoebox database by using some other tool to search it, as suggested
above, or even by loading it into Box 9 as text and searching it
there, if it is small enough, it would nicer if there were simply a
generalized database search facility in Shoebox.

In addition to metaphorical difficulties, filtering seems to have
some problems with bugs, too. Tom reports that after a few filtering
operations one starts finding that filtering fails to find records
that are definitely in the file. I had not noticed this before, but
I was quickly able to find a two filter sequence with the same
problem, and it looks like there is a bug in filtering such that,
after some set of filters not easily characterized, all (or most?)
filtering operations fail. 

Where do we go from here? Tom is rather down on Keyswap, and
wouldn't recommend it for the "average linguist," who wants computer
products to work immediately, in the obvious way, doing precisely
what the linguists wants without any experimentation or searching for
work-arounds. He feels that I should make more of a point of the
problems with Shoebox than I do. Well, he's right - I haven't gone to
any length at all in my contributions to Linguist in pointing out the
problems that exist with Shoebox, and I should have. I hope that this
posting helps to counteract that tendency on my part.

In spite of this, the fact of the matter is, there is very little
micro-computer software that does things useful for linguists in
particular. What exists or can be pressed into service is mostly
limited in power, cranky, limping, and user-unsympathetic, if not
actually user-hostile. I think that Shoebox is a significant step
forward from this state of affairs in the sense that it is a
purpose-made linguistic application with a fairly good user
interface. In fact, I prefer its interface to that of the commercial
package AskSAM, even though there are various major changes to the
Shoebox interface that I'd be happy to see. In the course of my work
I look at a fair amount of DOS/Non-DOS software - retail, shareware,
and public domain - for generalists and specialists - and Shoebox is
a very creditable job in this department. 

In spite of this, I can think of dozens of major and minor
improvements to Shoebox myself, as John Wimbish can attest, and this
is without taking into account fixing the bugs that Tom and others
have been uncovering. Many of the additions I'd like to see are like
Tom's - changes that expand the scope of the package by making it
easier to use with tasks peripheral to its central function of
building and maintaining lexical slip files. However, as I have said
before, Shoebox has this effect on users. It raises their
expectations. And there's nothing wrong with asking for more, but
you can't reject Shoebox because it isn't perfect or universal. For
the moment Shoebox is one essential tool for the computerized
language data worker. 

I would still recommend Shoebox for use in linguistics classes, but I
see that I should qualify this by saying that you would be advised to
get some hands on experience with it in a real project of your own
before casually adding it to the required text list.

John E. Koontz

All views represented are my own, for which I bear sole responsibility.

{End Linguist List, Vol. 2, No. 137]
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue