LINGUIST List 13.1916

Sat Jul 13 2002

Review: Software: VisualText

Editor for this issue: Terence Langendoen <>

What follows is a review or discussion note contributed to our Book Discussion Forum. We expect discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in.

If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for review." Then contact Simin Karimi at


  • Ramon Krosley, Review of VisualText

    Message 1: Review of VisualText

    Date: Tue, 9 Jul 2002 15:56:49 -0600
    From: Ramon Krosley <>
    Subject: Review of VisualText

    Review of VisualText, originally announced at; also see for an updated announcement about the software.

    Ramon Krosley, independent software engineer

    The goal of this review is to describe VisualText as a tool for building natural language processing (NLP) applications. This review considers the process of building NLP applications in two phases: learning and operation. The first phase gathers information that enables the execution of the second phase. The learning phase is a cost to the user, while the operational phase reaps benefits for the user. The organization of this review is a summary of applications and of the two phases, followed by appendices that discuss some details.

    I. A. Applications

    The two classes of applications that appear in this review are text mining and transformation of text; they define the ends of a continuum of applications in which VisualText may offer significant value. In this review, the term "text mining" refers to an application whose goal is to recognize a fixed set of messages that may appear in a limited population of texts, such as acquisitions and sales of corporate assets in news articles. The term "text transformation" in this review refers to an application whose goal is to model most of the messages in some population of texts, and to restate those messages in a standard format. Gisting is an example of the latter application. The requirements of real applications place them somewhere between these extremes.

    A designer can use VisualText to provide the parsing function of their application, and VisualText provides a framework upon which to drape the software that completes the text analysis component of their application. Extensions to VisualText, called TAIParse and GenTxt, provide additional framework and linguistic knowledge beyond the basic parsing mechanism.

    A feature worth noting in current computer technology is the character set that the product can use. VisualText processes ASCII text. Support for Unicode is a future objective, depending upon the market.

    The following sections review costs and benefits of applying VisualText in the learning and operational phases.

    I. B. Knowledge Engineering

    VisualText implements the learning phase as a knowledge engineering effort conducted by humans and facilitated by the software. The human knowledge engineers construct a procedural knowledge base that produces a unique parse tree. This policy separates the responsibility for disambiguation from the parsing mechanism that VisualText provides. The procedural knowledge must address the issue of exploring alternative interpretations.

    I. B. 1. Procedural and Declarative Rules

    VisualText's unique style of encoding knowledge affects the effort of knowledge engineering. The appendix "Model of Analysis" explains why I chose to categorize this style as "procedural". VisualText's style provides a level of control that is difficult to achieve in more declarative rule systems. There are at least two reasons why many knowledge engineers prefer that control.

    First, it is possible to tailor the behavior of the system precisely to match the customer's preferences for interpreting the training corpus. The first reason derives from the fact that declarative systems take responsibility for the computational strategy, while a more procedural system like VisualText gives more of that control to the knowledge engineers.

    Second, the knowledge engineering effort may progress more quickly, because diagnosis and repair of procedures seem cognitively less difficult, compared to diagnosis and repair of declarative rules. I think that the second reason may come from the same cause as a person's fluency in their first language, but that does not diminish its importance. The choice between procedural and declarative rules is as important to the knowledge engineering team as the choice of language in an international discussion.

    It is possible to extend VisualText by writing additional software. Those extensions can implement any NLP algorithms, including algorithms that use more declarative knowledge, such as a chart parser.

    I. B. 2. Tools

    One of the attractive features of VisualText is its rich set of tools to facilitate the knowledge engineering process. The integrated development environment provides easy access to those tools.

    VisualText includes a tool to assist its users to formulate production rules. By providing a list of examples, or by highlighting examples in text, a person can request that VisualText generate a rule that recognizes the examples.

    Navigation is simple between views of text, views of a parse tree, and the rule that produced a parse tree vertex.

    The learning phase uses a training corpus that represents the population of texts that the application will encounter in its operation. VisualText provides a simple process to gather texts for that corpus.

    The knowledge engineering process adds texts to the training corpus until a person decides that the knowledge is adequate for operation. VisualText does not offer tools to facilitate that decision. If management would like to monitor the progress of the knowledge engineering effort, the engineers must develop a method for measuring progress and estimating operational performance.

    The appendices of this review mention other tools that VisualText provides, describing them in the context of the problems that they solve.

    I. B. 3. Knowledge Engineering for Text Mining

    For a text mining application, the knowledge engineering effort is a manageable cost, because of the two finite sets of (1) messages to recognize and (2) styles of texts. Text Analysis International allowed me to contact a few of their customers, who report a substantial reduction in the knowledge engineering effort, compared to software that preceded VisualText. One person achieved satisfactory results in about a month, developing an analyzer to extract personal information, education, work experience, computer skills, and known foreign languages from text resumes. In another text mining application, a customer estimated that VisualText reduced their development effort from four to two person-years.

    I. B. 4. Knowledge Engineering for Text Transformation

    For a text transformation application, the open set of goal messages adds a new dimension to the complexity of the knowledge engineering task, compared to text mining. The procedural knowledge can reduce the complexity of the task by modeling regularities in this new dimension and by exploiting general-purpose knowledge sources.

    Text Analysis International offers extensions to VisualText, called TAIParse and GenTxt, which contain mechanisms to address the requirements of text transformation. For example, GenTxt includes an extensive English dictionary augmented by WordNet, a stemmer, a part- of-speech tagger, and a shallow syntactic parser. The extensions are comprehensive beginnings that can reduce the need to include a software engineering task in the project budget, but it may be necessary to employ NLP software engineers to accomplish some goals. Working with the same toolset as the knowledge engineers, the software engineers would devise extensions within the fundamental mechanism of VisualText.

    A customer reported that approximately six person-months were sufficient to build a domain-specific natural language generation system that includes "an integrated lexicon, a logical-form-to- syntactic-form mapping, and a frame-semantic representation of underlying conceptual structures".

    Of course, the extensions to VisualText that support text transformation applications can also improve the performance of text mining applications, but their cost may be unnecessary.

    I. C. Operation

    VisualText implements the operational phase as a text analyzer that can execute in a variety of environments, including its integrated development environment and as a dynamic link library.

    The first thought that will come to many computational linguists when considering VisualText is whether its style of knowledge can generalize well, beyond the training corpus. This question becomes moot when a project can afford to add any kind of NLP software within the framework of VisualText. The question is important in projects that plan simply to add knowledge to basic VisualText. A real project is likely to develop as a compromise between these two extremes, because the procedural style of knowledge encourages the knowledge engineers to do some software engineering.

    The measures of operational performance apply outside the training corpus. Usually they apply to a sample that represents the target population of texts independently of the training corpus. The performance of a text analyzer has dimensions of speed and correctness. The former affects the cost of operation, while the latter measures generalization. Measures of precision and recall often describe the correctness issue, while measures of throughput and response time often describe the speed.

    No metrics from actual projects are available. Some of the users with whom I communicated planned to measure performance, but have not yet done so. We live in a time when interpretation of natural language text by machines is evolving in the experiments of pioneers. The success of many of those projects is measured by accomplishing positive results. The comparison of alternative technologies through careful measurement in identical projects is expensive, due to the cost of the learning phase. The lack of performance measurements makes it difficult to predict the success of a project that requires a specific minimum level of performance.

    Evaluations of VisualText by users are preliminary, due to its recent availability. The experiences of customers indicate that VisualText should provide effective text mining service. Customers are just starting to use the extensions TAIParse and GenTxt in text transformation applications.

    VisualText's policy of producing a unique parse tree makes a fast program for a text mining application. The project's limits on targets for recognition and styles of expression reduce the risk of poor generalization. One user reported that his VisualText application was "clearly outperforming conventional text mining tools".

    In a text transformation application, the reliance on extensions to VisualText makes the performance a responsibility of the software engineers who write those extensions.

    II. Appendix: Model of Analysis

    The structure that develops in parsing natural language is usually a tree in which each vertex classifies the sequence of vertices in its branches; the leaves of the tree are the smallest tokens of text that the parser considers. VisualText develops the tree in multiple passes over the text, each pass building on the results of the prior passes. The smallest tokens that VisualText considers are strings of adjacent letters, numbers, punctuation, or whitespace.

    The first pass is always a black-box tokenizer in VisualText. Subsequent passes may be of two types. A recursive pass applies its rules repeatedly to the result of each successive repetition within the pass. A pattern pass applies its rules in the order of their appearance in the file that defines the pass.

    The parsing rules are more procedural than a declarative system such as a chart parser. If the rules were declarative, then VisualText would have some freedom (and responsibility) in its choice of the position and timing of production rules. Instead, the knowledge engineer specifies the order of execution of the rules, and within a pass a person can execute procedural statements. The distinction is important because it gives the user the responsibility for exploration of alternative structural interpretations.

    Part of the specification of a pass in VisualText is the context in which its production rules apply. The tutorial examples in VisualText begin with unrestricted passes that recognize small-scale structural units, such as period-terminated abbreviations. After those passes have recognized and hidden all the small structural units that do not mark large-scale boundaries, the examples then apply a pass that recognizes large-scale structural units, such as period-terminated sentences. Subsequent passes match patterns within the context of the large-scale units, recognizing structure that falls between the small and large scales.

    VisualText provides the ability to view the intermediate parse trees after each pass, which simplifies the diagnosis and repair of unexpected results. It is also possible to highlight the parts of a text that a particular pass affects.

    At first glance, it seems that the knowledge engineers must not only gather structural production rules similar to those they would gather for a chart parser, but also they must describe the schedule for applying those rules, which would be the responsibility of chart parsing software. After some thought, it appears that the quantity of knowledge to obtain comparable performance may be similar for VisualText and for a chart parser. To build an efficient chart parser for text transformation, the knowledge engineers must provide information that guides the search among the combinatorial explosion of alternative parses that can arise from purely structural grammar rules. That extra knowledge for a chart parser appears in the form of a greater number of rules and classes of vertices when the knowledge engineers use a semantic grammar. The extra knowledge appears in the form of probabilities when they use a probabilistic grammar. In VisualText, the extra knowledge appears in the procedural specifications. Determining which is generally more efficient, if any, would require a substantial experiment.

    The extensions to VisualText demonstrate that it can implement any other NLP technology within appropriate contexts of the parse tree. For example, procedural parsing is ideal for finding the components of a text that uses a markup language, such as HTML or XML. (VisualText ships with "library passes" that provide knowledge for interpreting common parts of text, such as dates, telephone numbers, email addresses, HTML tags, and XML.) Within those components of the text, a pass could apply a chart parser, if that best suits the application. The application designer must add to their project the cost of obtaining or writing the chart parser in this example.

    III. Appendix: Handling Ambiguity

    Ambiguity is often an artifact of our design of an analytical process, in which separate modules produce results to feed other modules. The product of analyzing text is the message that the writer of the text probably intended to deliver through the text. A module in a text analysis process identifies information that is probably part of the writer's message.

    The information that a module identifies serves as clues for later modules, which recognize additional information. When the natural dependencies among the pieces of information identified by the modules are circular, we necessarily find a module earlier in the process that has insufficient information to produce an unambiguous result. Of course, the sequence chosen for execution of the modules can also schedule a module to have insufficient information, even when the natural dependencies are not circular. Ideally, we should find a later module in the process, in which the necessary information has finally accrued and a unique result is possible. It is possible that the necessary information does not accrue during the process, and ambiguity remains after the analysis.

    This discussion distinguishes the residual ambiguity from the temporary structural ambiguity that appears within the process as a result of our choice of modules. NLP software must manage both forms of ambiguity, but the temporary structural ambiguity is important for its effect on the performance of the software.

    When we implement our view of an analytical process as computer software, the ambiguity inherent in our choice of modules becomes real. The software must include steps that resolve ambiguities. The software must also have a data structure to retain the ambiguous possibilities from the time of their discovery until the time that there is sufficient knowledge to choose among the possibilities.

    In the case of VisualText, the "modules" of this discussion correspond to the "passes" of the analyzer. The information about ambiguities may appear in one of the data structures mentioned in the appendix "Data Structures", or it may appear in the structure of the parse tree.

    The parse tree may represent ambiguous choices for a polysemous word or phrase as a series of singly branched vertices, known as the "singlet- and-base" structure in VisualText terminology. The pattern-matching component of VisualText can search the chain of vertices to make an appropriate choice based on contextual clues.

    One of the controls available to knowledge engineers using VisualText is altering the sequence of passes, in order to achieve a satisfactory analysis. Ideally, each pass should provide an analysis that is cleanly separable from subsequent passes, without structural ambiguity. If that clean separation were possible for general texts, then the sequence of passes that achieves it would be a remarkable theory of language. In practice, the model fits the text approximately. Experimenting with the sequence of passes is one way to optimize the match between the model and its training corpus.

    Compared to chart parsing, where exploration of combinatorial alternatives is exhaustive unless pruned by specific additions to the algorithm, the procedural rules in VisualText reverse that strategy. The procedural rules prune the search as they build a trusted parse tree, exploring alternative structures only when the pass file specifies it. As knowledge engineers make the results more correct and efficient, systems using either strategy probably approach a common level of ambiguity, but the VisualText strategy errs in favor of faster execution.

    IV. Appendix: Data Structures

    This appendix briefly identifies major data structures available to procedures in VisualText. These structures may retain information about ambiguities, or store a dictionary, or serve any other purpose that a knowledge engineer or software engineer may devise.

    The variables live in a variety of maps. One of the maps is globally available across the analyzer. Other maps associate with each vertex in a parse tree, or they associate with an instance of a procedure. Referencing the value of a variable consists of identifying the map and the name of the variable.

    The knowledge base is a tree of concepts, each of which may contain attributes. The attributes of a concept are a map from attribute name to collection of values. A concept may also contain a sequence of references to other concepts. A knowledge engineer may assign any interpretation to this data structure, or to a part of the structure. Typically, the concepts close to the trunk of the tree divide it into subtrees that have specific interpretations. For example, the concept at the end of the path concept-sys-dict may be the trunk of a subtree that functions as a dictionary. The knowledge base may persist between executions of an analyzer, and an analyzer may alter its contents during execution.

    V. Acknowledgements

    I am grateful to Amnon Meyers, Patrice Mellot, James Luke, Paul Deane, D. Bruce Rex, and George Calvert for their helpful insights and communications.

    ABOUT THE REVIEWER Ramon Krosley is working to become a member of the community of natural language scientists. He has been studying and writing natural language processing software for more than two years.