Editor for this issue: <>
Some days ago I posted a query on CART, and I promised to post a summary. Here it is. I received responses from the following people: Mike Anderson, Lee Applebaum, Leo Breiman, John Choi, Barrett P. Eynon, Jonathan Foote, Georg Fuellen, Mark Irwin, Peter G. Hamer, Trevor Hastie, Wei-Yin Loh, Richard Olshen, Charles Roosen, Ajay Shah, William D. Shannon, Charles J. Stone, Joerg Ueberla. Thanks *very* much to all. The person to contact for the original CART software is not Richard Olshen but Charles J. Stone <stoneMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issuegandalf.Berkeley.EDU>. CART is distributed for UNIX, VMS, CMS and MVS--the academic price is $500. SYSTAT Corporation distributes a PC version of CART. They can be reached at SYSTAT, Inc., 1800 Sherman Avenue, Evanston, IL 60201, USA. Phone: (708) 864-5670, FAX: (708) 492-3567. ------------------------------------------------------------------------------ A. OTHER CART SOFTWARE 0. SPLUS Some of the CART has been implemented in recent versions fo the SPLUS language (a statistical language that is very popular in many math/stat departments). References to a couple of papers on using CART for language modeling in speech recognition appear below. [ another remark on SPLUS, don't know if it's the same software: ] Look into S-Plus from StatSci located in Seattle. Their implementation has some slight differences, but they have an extraordinary interactive graphical environment. [ another remark on SPLUS, don't know if it's the same software: ] I took a class last fall with Leo Breiman on this topic, and I've used both the original CART implementation and the Splus version called tree(). I prefer the CART program, but the Splus version is workable - it is probably best used with theprune.tree() and cross-validation options, this is the closest to what CART does. The book "Statistical Models in S"" has a discussion of the varioustree functions and their implementation. 1. C4.5: C source code for a new, improved decision tree algorithm known as C4.5 is in the new book by Ross Quinlan (of ID3 fame). "C4.5: Programs for Machine Learning", Morgan Kaufmann, 1992. It goes for $44.95. With accompanying software on magnetic media it runs for $69.95. ISBN # 1-55860-238-0 2. FACT: There is a fortran program called FACT that is similar to CART. It can be ftp'd from lib.stat.cmu.edu (login with "statlib" with your email address as password) under the "general" subdirectory. There is also a bried user guide for FACT. 3. IND Version 2.1 - creation and manipulation of decision trees from data: A common approach to supervised classification and prediction in artificial intelligence and statistical pattern recognition is the use of decision trees. A tree is "grown" from data using a recursive partitioning algorithm to create a tree which (hopefully) has good prediction of classes on new data. Standard algorithms are CART (by Breiman, Friedman, Olshen and Stone) and Id3 and its successor C4.5 (by Quinlan). More recent techniques are Buntine's smoothing and option trees, Wallace and Patrick's MML method, and Oliver and Wallace's MML decision graphs which extend the tree representation to graphs. IND reimplements and integrates these methods. The newer methods produce more accurate class probability estimates that are important in applications like diagnosis. IND is applicable to most data sets consisting of independent instances, each described by a fixed length vector of attribute values. An attribute value may be a number, one of a set of attribute specific symbols, or omitted. One of the attributes is delegated the "target" and IND grows trees to predict the target. Prediction can then be done on new data or the decision tree printed out for inspection. IND provides a range of features and styles with convenience for the casual user as well as fine-tuning for the advanced user or those interested in research. Advanced features allow more extensive search, interactive control and display of tree growing, and Bayesian and MML algorithms for tree pruning and smoothing. These often produce more accurate class probability estimates at the leaves. IND also comes with a comprehensive experimental control suite. IND consist of four basic kinds of routines; data manipulation routines, tree generation routines, tree testing routines, and tree display routines. The data manipulation routines are used to partition a single large data set into smaller training and test sets. The generation routines are used to build classifiers. The test routines are used to evaluate classifiers and to classify data using a classifier. And the display routines are used to display classifiers in various formats. IND is written in K&R C, with controlling scripts in the "csh" shell of UNIX, and extensive UNIX man entries. It is designed to be used on any UNIX system, although it has only been thoroughly tested on SUN platforms. Assistence with porting to other machines will be given in some cases. IND comes with a manual giving a guide to tree methods and pointers to the literature, and several companion documents. IND Version 2.0 will shortly be available through NASA's COSMIC facility. IND Version 2.1 is available strictly as unsupported beta-test software. If you're interested in obtaining a beta-test copy, with no obligation on your part to provide feedback, contact Wray Buntine NASA Ames Research Center Mail Stop 269-2 Moffett Field, CA, 94035 email: wray
kronos.arc.nasa.gov Unfortunately, the beta-test version is not available for overseas. This is standard NASA policy. Version 2.0, however, should be available soon at a modest price from NASAs COSMIC center in Georgia, USA. Enquiries should be directed to: mail (to customer support): service
cossack.cosmic.uga.edu Phone: (706) 542-3265 and ask for customer support FAX: (706) 542-4807. 4. ADS software: I saw your posting just as I was going to post the same query. We, at ADS (now a div. of Booz-Allen & Hamilton) have a homebuilt implementation that is difficult to use. We are using CART in document (free-form or formatted text) profiling experiments; actually, we use it in conjunction with other, including our own, text analysis software. I anxiously await hearing your responses. Lee Appelbaum <lee
ads.com> 4. Knowledge Seeker: Telephone number to contact for "Commercial CART package": The number for _Knowledge_Seeker_ is 613 723 8020. 5. General information on numerical software: Have a look at the "Index to free source code in C or C++ for numerical computation". It is the file pub/C-numanal/numcomp-free-c.gz on usc.edu If you don't have access to ftp, send mail to ftpmail
decwrl.dec.com saying "help"; you will be given information needed to do ftp by email. ------------------------------------------------------------------------------ B. REFERENCES ON CART AND SPEECH Following is a list, in various formats, of the references I received. o "Statistical Models in S" by John M. Chambers and Trevor J. Hastie. (description of version of CART in the S language) 1992, ISBN 0-534-16674-0 o a thesis by Jonathan Foote <jtf
lems.Brown.EDU> on using decision trees for vector quantization in an HMM speech recognition system. Included is a section on previous uses of decision trees in speech research. o a dissertation by Pitrelli (Victor Zue's lab at MIT) o Riley, Michael D. (1992). Tree-based modelling of segmental durations. In G. Bailly, et al (eds.), Talking Machines: Theories, Models and Designs. Elsevier Science Publishers. author = {Loh, W.-Y. and Vanichsetakul, N.}, title = {Tree-structured classification via generalized discriminant analysis (with discussion)}, journal = {Journal of the American Statistical Association}, year = {1988}, volume = {83}, pages = {715--728}
TechReport(Kuh92, author = "Roland Kuhn and Renato de Mori", title = "Keyword Patterns Generated by String Classification Trees", institution = "School of Computer Science, McGill University", year = "1992", email = "kuhn
crim.ca")
InProceedings(Bla92b, author = "E. Black and F. Jelinek and J. Lafferty and R. Mercer and S. Roukos", title = "Decision Tree Models Applied to the Labeling of Text with Parts-of-Speech", pages = "117-121", booktitle = darpa, organization = "Harriman, New York", year = "February 1992")
InProceedings(Wae92, author = "Nick Waegner and Steve Young", title = "A trellis-based Language Model for Speech Recognition", pages = "245-248", booktitle = "International Conference on Spoken Language Processing", organization = "Banff, Canada", year = "1992")
InCollection(Bah90, author = "Lalit R. Bahl and Peter F. Brown and Peter V. de Souza and Robert L. Mercer", editor = "Alex Waibel and Kai-Fu Lee", title = "A Tree-Based Statistical Language Model for Natural Language Speech Recognition", booktitle = "Readings in Speech Recognition", publisher = "Morgan Kaufmann, San Mateo, {CA}", pages = "507-514", year = "1990") ------------------------------------------------------------------------------ C. PERSONS WITH EXPERIENCE WITH CART AND SPEECH PROBLEMS Below, I listed some people of whom others said they might have experience with, or have more information on the subject. Leo Breiman <leo
stat.berkeley.edu> Trevor Hastie <trevor
research.att.com> William Meisel, TMA Associates, Encino, CA. tel (818) 708-0962 Mike Riley <riley
research.att.com>