LINGUIST List 13.1916

Sat Jul 13 2002

Review: Software: VisualText

Editor for this issue: Terence Langendoen <terrylinguistlist.org>


What follows is a review or discussion note contributed to our Book Discussion Forum. We expect discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in. If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for review." Then contact Simin Karimi at siminlinguistlist.org.

Directory

  1. Ramon Krosley, Review of VisualText

Message 1: Review of VisualText

Date: Tue, 9 Jul 2002 15:56:49 -0600
From: Ramon Krosley <rkrosleyattbi.com>
Subject: Review of VisualText

Review of VisualText, originally announced at
http://linguistlist.org/issues/13/13-1366.html; also see
http://linguistlist.org/issues/13/13-1915.html for an updated
announcement about the software.

Ramon Krosley, independent software engineer

The goal of this review is to describe VisualText as a tool for
building natural language processing (NLP) applications. This review
considers the process of building NLP applications in two phases:
learning and operation. The first phase gathers information that
enables the execution of the second phase. The learning phase is a
cost to the user, while the operational phase reaps benefits for the
user. The organization of this review is a summary of applications and
of the two phases, followed by appendices that discuss some details.

I. A. Applications

The two classes of applications that appear in this review are text
mining and transformation of text; they define the ends of a continuum
of applications in which VisualText may offer significant value. In
this review, the term "text mining" refers to an application whose goal
is to recognize a fixed set of messages that may appear in a limited
population of texts, such as acquisitions and sales of corporate assets
in news articles. The term "text transformation" in this review refers
to an application whose goal is to model most of the messages in some
population of texts, and to restate those messages in a standard
format. Gisting is an example of the latter application. The
requirements of real applications place them somewhere between these
extremes.

A designer can use VisualText to provide the parsing function of their
application, and VisualText provides a framework upon which to drape
the software that completes the text analysis component of their
application. Extensions to VisualText, called TAIParse and GenTxt,
provide additional framework and linguistic knowledge beyond the basic
parsing mechanism.

A feature worth noting in current computer technology is the character
set that the product can use. VisualText processes ASCII text.
Support for Unicode is a future objective, depending upon the market.

The following sections review costs and benefits of applying VisualText
in the learning and operational phases.

I. B. Knowledge Engineering

VisualText implements the learning phase as a knowledge engineering
effort conducted by humans and facilitated by the software. The human
knowledge engineers construct a procedural knowledge base that produces
a unique parse tree. This policy separates the responsibility for
disambiguation from the parsing mechanism that VisualText provides.
The procedural knowledge must address the issue of exploring
alternative interpretations.

I. B. 1. Procedural and Declarative Rules

VisualText's unique style of encoding knowledge affects the effort of
knowledge engineering. The appendix "Model of Analysis" explains why I
chose to categorize this style as "procedural". VisualText's style
provides a level of control that is difficult to achieve in more
declarative rule systems. There are at least two reasons why many
knowledge engineers prefer that control.

First, it is possible to tailor the behavior of the system precisely to
match the customer's preferences for interpreting the training corpus.
The first reason derives from the fact that declarative systems take
responsibility for the computational strategy, while a more procedural
system like VisualText gives more of that control to the knowledge
engineers.

Second, the knowledge engineering effort may progress more quickly,
because diagnosis and repair of procedures seem cognitively less
difficult, compared to diagnosis and repair of declarative rules. I
think that the second reason may come from the same cause as a person's
fluency in their first language, but that does not diminish its
importance. The choice between procedural and declarative rules is as
important to the knowledge engineering team as the choice of language
in an international discussion.

It is possible to extend VisualText by writing additional software.
Those extensions can implement any NLP algorithms, including algorithms
that use more declarative knowledge, such as a chart parser.

I. B. 2. Tools

One of the attractive features of VisualText is its rich set of tools
to facilitate the knowledge engineering process. The integrated
development environment provides easy access to those tools.

VisualText includes a tool to assist its users to formulate production
rules. By providing a list of examples, or by highlighting examples in
text, a person can request that VisualText generate a rule that
recognizes the examples.

Navigation is simple between views of text, views of a parse tree, and
the rule that produced a parse tree vertex.

The learning phase uses a training corpus that represents the
population of texts that the application will encounter in its
operation. VisualText provides a simple process to gather texts for
that corpus.

The knowledge engineering process adds texts to the training corpus
until a person decides that the knowledge is adequate for operation.
VisualText does not offer tools to facilitate that decision. If
management would like to monitor the progress of the knowledge
engineering effort, the engineers must develop a method for measuring
progress and estimating operational performance.

The appendices of this review mention other tools that VisualText
provides, describing them in the context of the problems that they
solve.

I. B. 3. Knowledge Engineering for Text Mining

For a text mining application, the knowledge engineering effort is a
manageable cost, because of the two finite sets of (1) messages to
recognize and (2) styles of texts. Text Analysis International allowed
me to contact a few of their customers, who report a substantial
reduction in the knowledge engineering effort, compared to software
that preceded VisualText. One person achieved satisfactory results in
about a month, developing an analyzer to extract personal information,
education, work experience, computer skills, and known foreign
languages from text resumes. In another text mining application, a
customer estimated that VisualText reduced their development effort
from four to two person-years.

I. B. 4. Knowledge Engineering for Text Transformation

For a text transformation application, the open set of goal messages
adds a new dimension to the complexity of the knowledge engineering
task, compared to text mining. The procedural knowledge can reduce the
complexity of the task by modeling regularities in this new dimension
and by exploiting general-purpose knowledge sources.

Text Analysis International offers extensions to VisualText, called
TAIParse and GenTxt, which contain mechanisms to address the
requirements of text transformation. For example, GenTxt includes an
extensive English dictionary augmented by WordNet, a stemmer, a part-
of-speech tagger, and a shallow syntactic parser. The extensions are
comprehensive beginnings that can reduce the need to include a software
engineering task in the project budget, but it may be necessary to
employ NLP software engineers to accomplish some goals. Working with
the same toolset as the knowledge engineers, the software engineers
would devise extensions within the fundamental mechanism of VisualText.

A customer reported that approximately six person-months were
sufficient to build a domain-specific natural language generation
system that includes "an integrated lexicon, a logical-form-to-
syntactic-form mapping, and a frame-semantic representation of
underlying conceptual structures".

Of course, the extensions to VisualText that support text
transformation applications can also improve the performance of text
mining applications, but their cost may be unnecessary.

I. C. Operation

VisualText implements the operational phase as a text analyzer that can
execute in a variety of environments, including its integrated
development environment and as a dynamic link library.

The first thought that will come to many computational linguists when
considering VisualText is whether its style of knowledge can generalize
well, beyond the training corpus. This question becomes moot when a
project can afford to add any kind of NLP software within the framework
of VisualText. The question is important in projects that plan simply
to add knowledge to basic VisualText. A real project is likely to
develop as a compromise between these two extremes, because the
procedural style of knowledge encourages the knowledge engineers to do
some software engineering.

The measures of operational performance apply outside the training
corpus. Usually they apply to a sample that represents the target
population of texts independently of the training corpus. The
performance of a text analyzer has dimensions of speed and correctness.
The former affects the cost of operation, while the latter measures
generalization. Measures of precision and recall often describe the
correctness issue, while measures of throughput and response time often
describe the speed.

No metrics from actual projects are available. Some of the users with
whom I communicated planned to measure performance, but have not yet
done so. We live in a time when interpretation of natural language
text by machines is evolving in the experiments of pioneers. The
success of many of those projects is measured by accomplishing positive
results. The comparison of alternative technologies through careful
measurement in identical projects is expensive, due to the cost of the
learning phase. The lack of performance measurements makes it
difficult to predict the success of a project that requires a specific
minimum level of performance.

Evaluations of VisualText by users are preliminary, due to its recent
availability. The experiences of customers indicate that VisualText
should provide effective text mining service. Customers are just
starting to use the extensions TAIParse and GenTxt in text
transformation applications.

VisualText's policy of producing a unique parse tree makes a fast
program for a text mining application. The project's limits on targets
for recognition and styles of expression reduce the risk of poor
generalization. One user reported that his VisualText application was
"clearly outperforming conventional text mining tools".

In a text transformation application, the reliance on extensions to
VisualText makes the performance a responsibility of the software
engineers who write those extensions.

II. Appendix: Model of Analysis

The structure that develops in parsing natural language is usually a
tree in which each vertex classifies the sequence of vertices in its
branches; the leaves of the tree are the smallest tokens of text that
the parser considers. VisualText develops the tree in multiple passes
over the text, each pass building on the results of the prior passes.
The smallest tokens that VisualText considers are strings of adjacent
letters, numbers, punctuation, or whitespace.

The first pass is always a black-box tokenizer in VisualText.
Subsequent passes may be of two types. A recursive pass applies its
rules repeatedly to the result of each successive repetition within the
pass. A pattern pass applies its rules in the order of their
appearance in the file that defines the pass.

The parsing rules are more procedural than a declarative system such as
a chart parser. If the rules were declarative, then VisualText would
have some freedom (and responsibility) in its choice of the position
and timing of production rules. Instead, the knowledge engineer
specifies the order of execution of the rules, and within a pass a
person can execute procedural statements. The distinction is important
because it gives the user the responsibility for exploration of
alternative structural interpretations.

Part of the specification of a pass in VisualText is the context in
which its production rules apply. The tutorial examples in VisualText
begin with unrestricted passes that recognize small-scale structural
units, such as period-terminated abbreviations. After those passes
have recognized and hidden all the small structural units that do not
mark large-scale boundaries, the examples then apply a pass that
recognizes large-scale structural units, such as period-terminated
sentences. Subsequent passes match patterns within the context of the
large-scale units, recognizing structure that falls between the small
and large scales.

VisualText provides the ability to view the intermediate parse trees
after each pass, which simplifies the diagnosis and repair of
unexpected results. It is also possible to highlight the parts of a
text that a particular pass affects.

At first glance, it seems that the knowledge engineers must not only
gather structural production rules similar to those they would gather
for a chart parser, but also they must describe the schedule for
applying those rules, which would be the responsibility of chart
parsing software. After some thought, it appears that the quantity of
knowledge to obtain comparable performance may be similar for
VisualText and for a chart parser. To build an efficient chart parser
for text transformation, the knowledge engineers must provide
information that guides the search among the combinatorial explosion of
alternative parses that can arise from purely structural grammar rules.
That extra knowledge for a chart parser appears in the form of a
greater number of rules and classes of vertices when the knowledge
engineers use a semantic grammar. The extra knowledge appears in the
form of probabilities when they use a probabilistic grammar. In
VisualText, the extra knowledge appears in the procedural
specifications. Determining which is generally more efficient, if any,
would require a substantial experiment.

The extensions to VisualText demonstrate that it can implement any
other NLP technology within appropriate contexts of the parse tree.
For example, procedural parsing is ideal for finding the components of
a text that uses a markup language, such as HTML or XML. (VisualText
ships with "library passes" that provide knowledge for interpreting
common parts of text, such as dates, telephone numbers, email
addresses, HTML tags, and XML.) Within those components of the text, a
pass could apply a chart parser, if that best suits the application.
The application designer must add to their project the cost of
obtaining or writing the chart parser in this example.

III. Appendix: Handling Ambiguity

Ambiguity is often an artifact of our design of an analytical process,
in which separate modules produce results to feed other modules. The
product of analyzing text is the message that the writer of the text
probably intended to deliver through the text. A module in a text
analysis process identifies information that is probably part of the
writer's message.

The information that a module identifies serves as clues for later
modules, which recognize additional information. When the natural
dependencies among the pieces of information identified by the modules
are circular, we necessarily find a module earlier in the process that
has insufficient information to produce an unambiguous result. Of
course, the sequence chosen for execution of the modules can also
schedule a module to have insufficient information, even when the
natural dependencies are not circular. Ideally, we should find a later
module in the process, in which the necessary information has finally
accrued and a unique result is possible. It is possible that the
necessary information does not accrue during the process, and ambiguity
remains after the analysis.

This discussion distinguishes the residual ambiguity from the temporary
structural ambiguity that appears within the process as a result of our
choice of modules. NLP software must manage both forms of ambiguity,
but the temporary structural ambiguity is important for its effect on
the performance of the software.

When we implement our view of an analytical process as computer
software, the ambiguity inherent in our choice of modules becomes real.
The software must include steps that resolve ambiguities. The software
must also have a data structure to retain the ambiguous possibilities
from the time of their discovery until the time that there is
sufficient knowledge to choose among the possibilities.

In the case of VisualText, the "modules" of this discussion correspond
to the "passes" of the analyzer. The information about ambiguities may
appear in one of the data structures mentioned in the appendix "Data
Structures", or it may appear in the structure of the parse tree.

The parse tree may represent ambiguous choices for a polysemous word or
phrase as a series of singly branched vertices, known as the "singlet-
and-base" structure in VisualText terminology. The pattern-matching
component of VisualText can search the chain of vertices to make an
appropriate choice based on contextual clues.

One of the controls available to knowledge engineers using VisualText
is altering the sequence of passes, in order to achieve a satisfactory
analysis. Ideally, each pass should provide an analysis that is
cleanly separable from subsequent passes, without structural ambiguity.
If that clean separation were possible for general texts, then the
sequence of passes that achieves it would be a remarkable theory of
language. In practice, the model fits the text approximately.
Experimenting with the sequence of passes is one way to optimize the
match between the model and its training corpus.

Compared to chart parsing, where exploration of combinatorial
alternatives is exhaustive unless pruned by specific additions to the
algorithm, the procedural rules in VisualText reverse that strategy.
The procedural rules prune the search as they build a trusted parse
tree, exploring alternative structures only when the pass file
specifies it. As knowledge engineers make the results more correct and
efficient, systems using either strategy probably approach a common
level of ambiguity, but the VisualText strategy errs in favor of faster
execution.

IV. Appendix: Data Structures

This appendix briefly identifies major data structures available to
procedures in VisualText. These structures may retain information
about ambiguities, or store a dictionary, or serve any other purpose
that a knowledge engineer or software engineer may devise.

The variables live in a variety of maps. One of the maps is globally
available across the analyzer. Other maps associate with each vertex
in a parse tree, or they associate with an instance of a procedure.
Referencing the value of a variable consists of identifying the map and
the name of the variable.

The knowledge base is a tree of concepts, each of which may contain
attributes. The attributes of a concept are a map from attribute name
to collection of values. A concept may also contain a sequence of
references to other concepts. A knowledge engineer may assign any
interpretation to this data structure, or to a part of the structure.
Typically, the concepts close to the trunk of the tree divide it into
subtrees that have specific interpretations. For example, the concept
at the end of the path concept-sys-dict may be the trunk of a subtree
that functions as a dictionary. The knowledge base may persist between
executions of an analyzer, and an analyzer may alter its contents
during execution.

V. Acknowledgements

I am grateful to Amnon Meyers, Patrice Mellot, James Luke, Paul Deane,
D. Bruce Rex, and George Calvert for their helpful insights and
communications.

ABOUT THE REVIEWER
Ramon Krosley is working to become a member of the community of
natural language scientists. He has been studying and writing natural
language processing software for more than two years.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue