LINGUIST List 4.534

Wed 07 Jul 1993

Sum: CART

Editor for this issue: <>


Directory

  1. "Ederveen D.", Summary: CART

Message 1: Summary: CART

Date: Wed, 07 Jul 1993 17:41:41 Summary: CART
From: "Ederveen D." <D.N.M.Ederveenresearch.ptt.nl>
Subject: Summary: CART

Some days ago I posted a query on CART, and I promised to post a summary.
Here it is. I received responses from the following people:
Mike Anderson, Lee Applebaum, Leo Breiman, John Choi, Barrett P. Eynon,
Jonathan Foote, Georg Fuellen, Mark Irwin, Peter G. Hamer, Trevor Hastie,
Wei-Yin Loh, Richard Olshen, Charles Roosen, Ajay Shah, William D. Shannon,
Charles J. Stone, Joerg Ueberla. Thanks *very* much to all.

The person to contact for the original CART software is not Richard Olshen
but Charles J. Stone <stonegandalf.Berkeley.EDU>. CART is distributed
for UNIX, VMS, CMS and MVS--the academic price is $500.

SYSTAT Corporation distributes a PC version of CART. They can be reached at
SYSTAT, Inc., 1800 Sherman Avenue, Evanston, IL 60201, USA.
Phone: (708) 864-5670, FAX: (708) 492-3567.

 ------------------------------------------------------------------------------

A. OTHER CART SOFTWARE

0. SPLUS

Some of the CART has been implemented in recent versions fo the SPLUS
language (a statistical language that is very popular in many math/stat
departments). References to a couple of papers on using CART for
language modeling in speech recognition appear below.

[ another remark on SPLUS, don't know if it's the same software: ]
Look into S-Plus from StatSci located in Seattle. Their implementation has
some slight differences, but they have an extraordinary interactive
graphical environment.

[ another remark on SPLUS, don't know if it's the same software: ]
I took a class last fall with Leo Breiman on this topic, and I've used
both the original CART implementation and the Splus version called
tree(). I prefer the CART program, but the Splus version is workable -
it is probably best used with theprune.tree() and cross-validation
options, this is the closest to what CART does. The book "Statistical
Models in S"" has a discussion of the varioustree functions and their
implementation.

1. C4.5:

C source code for a new, improved decision tree algorithm known as C4.5 is
in the new book by Ross Quinlan (of ID3 fame). "C4.5: Programs for Machine
Learning", Morgan Kaufmann, 1992. It goes for $44.95. With accompanying
software on magnetic media it runs for $69.95. ISBN # 1-55860-238-0

2. FACT:

There is a fortran program called FACT that is similar to CART. It can
be ftp'd from lib.stat.cmu.edu (login with "statlib" with your email
address as password) under the "general" subdirectory. There is also a
bried user guide for FACT.

3. IND Version 2.1 - creation and manipulation of decision trees from data:

A common approach to supervised classification and prediction in
artificial intelligence and statistical pattern recognition is the use
of decision trees. A tree is "grown" from data using a recursive
partitioning algorithm to create a tree which (hopefully) has good
prediction of classes on new data. Standard algorithms are CART (by
Breiman, Friedman, Olshen and Stone) and Id3 and its successor C4.5 (by
Quinlan). More recent techniques are Buntine's smoothing and option
trees, Wallace and Patrick's MML method, and Oliver and Wallace's MML
decision graphs which extend the tree representation to graphs. IND
reimplements and integrates these methods. The newer methods produce
more accurate class probability estimates that are important in
applications like diagnosis.

IND is applicable to most data sets consisting of independent
instances, each described by a fixed length vector of attribute values.
An attribute value may be a number, one of a set of attribute specific
symbols, or omitted. One of the attributes is delegated the "target"
and IND grows trees to predict the target. Prediction can then be done
on new data or the decision tree printed out for inspection.

IND provides a range of features and styles with convenience for the
casual user as well as fine-tuning for the advanced user or those
interested in research. Advanced features allow more extensive search,
interactive control and display of tree growing, and Bayesian and MML
algorithms for tree pruning and smoothing. These often produce more
accurate class probability estimates at the leaves. IND also comes
with a comprehensive experimental control suite.

IND consist of four basic kinds of routines; data manipulation
routines, tree generation routines, tree testing routines, and tree
display routines. The data manipulation routines are used to partition
a single large data set into smaller training and test sets. The
generation routines are used to build classifiers. The test routines
are used to evaluate classifiers and to classify data using a
classifier. And the display routines are used to display classifiers in
various formats.

IND is written in K&R C, with controlling scripts in the "csh" shell of
UNIX, and extensive UNIX man entries. It is designed to be used on any
UNIX system, although it has only been thoroughly tested on SUN
platforms. Assistence with porting to other machines will be given in
some cases. IND comes with a manual giving a guide to tree methods and
pointers to the literature, and several companion documents.

IND Version 2.0 will shortly be available through NASA's COSMIC
facility. IND Version 2.1 is available strictly as unsupported
beta-test software. If you're interested in obtaining a beta-test copy,
with no obligation on your part to provide feedback, contact
 Wray Buntine
 NASA Ames Research Center
 Mail Stop 269-2
 Moffett Field, CA, 94035
 email: wraykronos.arc.nasa.gov
Unfortunately, the beta-test version is not available for overseas.
This is standard NASA policy. Version 2.0, however, should
be available soon at a modest price from NASAs COSMIC center
in Georgia, USA. Enquiries should be directed to:
 mail (to customer support): servicecossack.cosmic.uga.edu
 Phone: (706) 542-3265 and ask for customer support
 FAX: (706) 542-4807.

4. ADS software:

I saw your posting just as I was going to post the same query. We, at
ADS (now a div. of Booz-Allen & Hamilton) have a homebuilt
implementation that is difficult to use. We are using CART in document
(free-form or formatted text) profiling experiments; actually, we use
it in conjunction with other, including our own, text analysis
software. I anxiously await hearing your responses.
Lee Appelbaum <leeads.com>

4. Knowledge Seeker:

Telephone number to contact for "Commercial CART package":
The number for _Knowledge_Seeker_ is 613 723 8020.

5. General information on numerical software:

Have a look at the "Index to free source code in C or C++ for numerical
computation". It is the file pub/C-numanal/numcomp-free-c.gz on usc.edu
If you don't have access to ftp, send mail to ftpmaildecwrl.dec.com
saying "help"; you will be given information needed to do ftp by email.

 ------------------------------------------------------------------------------

B. REFERENCES ON CART AND SPEECH

Following is a list, in various formats, of the references I received.

 o "Statistical Models in S" by John M. Chambers and Trevor J. Hastie.
 (description of version of CART in the S language) 1992, ISBN 0-534-16674-0

 o a thesis by Jonathan Foote <jtflems.Brown.EDU> on using decision trees
 for vector quantization in an HMM speech recognition system. Included
 is a section on previous uses of decision trees in speech research.

 o a dissertation by Pitrelli (Victor Zue's lab at MIT)

 o Riley, Michael D. (1992). Tree-based modelling of segmental
 durations. In G. Bailly, et al (eds.), Talking Machines:
 Theories, Models and Designs. Elsevier Science Publishers.

 author = {Loh, W.-Y. and Vanichsetakul, N.},
 title = {Tree-structured classification via generalized discriminant
 analysis (with discussion)},
 journal = {Journal of the American Statistical Association},
 year = {1988},
 volume = {83},
 pages = {715--728}

TechReport(Kuh92,
 author = "Roland Kuhn and Renato de Mori",
 title = "Keyword Patterns Generated by String
 Classification Trees",
 institution = "School of Computer Science, McGill University",
 year = "1992",
 email = "kuhncrim.ca")

InProceedings(Bla92b,
 author = "E. Black and F. Jelinek and J. Lafferty and R.
 Mercer and S. Roukos",
 title = "Decision Tree Models Applied to the Labeling of
 Text with Parts-of-Speech",
 pages = "117-121",
 booktitle = darpa,
 organization = "Harriman, New York",
 year = "February 1992")

InProceedings(Wae92,
 author = "Nick Waegner and Steve Young",
 title = "A trellis-based Language Model for Speech Recognition",
 pages = "245-248",
 booktitle = "International Conference on Spoken Language Processing",
 organization = "Banff, Canada",
 year = "1992")

InCollection(Bah90,
 author = "Lalit R. Bahl and Peter F. Brown and Peter V. de
 Souza and Robert L. Mercer",
 editor = "Alex Waibel and Kai-Fu Lee",
 title = "A Tree-Based Statistical Language Model for
 Natural Language Speech Recognition",
 booktitle = "Readings in Speech Recognition",
 publisher = "Morgan Kaufmann, San Mateo, {CA}",
 pages = "507-514",
 year = "1990")

 ------------------------------------------------------------------------------

C. PERSONS WITH EXPERIENCE WITH CART AND SPEECH PROBLEMS

Below, I listed some people of whom others said they might have experience
with, or have more information on the subject.

Leo Breiman <leostat.berkeley.edu>
Trevor Hastie <trevorresearch.att.com>
William Meisel, TMA Associates, Encino, CA. tel (818) 708-0962
Mike Riley <rileyresearch.att.com>
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue