# LINGUIST List 4.430

## Fri 04 Jun 1993

## Sum: Finite State Machines from Corpora

**Editor for this issue:** <>

## Directory

- James Tauber, Deriving FSMs from corpus of well-formed strings

### Message 1: Deriving FSMs from corpus of well-formed strings

**Date:** Fri, 4 Jun 1993 16:10:17 +Deriving FSMs from corpus of well-formed strings

**From:** James Tauber <jtaubertartarus.uwa.edu.au>

**Subject:** Deriving FSMs from corpus of well-formed strings

A few days ago, I inquired about deriving finite-state machines from
corpora of well-formed strings. My particular interest was as an aid
to developing grammars from part-of-speech tagged text.
Thank you to all those who responded. By combining the information given
by each person and looking up the references given, I was able, over the
last couple of days, to write software to do exactly what I wanted.
While the more general approach is to derive a non-deterministic finite
state machine from the corpus, convert it to an equivalent deterministic
machine and then minimize it, I found this impossible in my situation.
The standard algorithm for converting an NFSM to a DFSM results in the
number of states in the DFSM being 2 to the power of the number of states
in the NFSM. This leads to ridiculous numbers in the order of 10^500 in my
application.
I got around this by deriving the DFSM straight from the corpus. I then wrote
a program that calculates the least-state equivalent DFSM.
If anyone is interested in the programs, they can e-mail me. At the moment
they are in Borland C++. If anyone is interested in helping do a Unix-C, LISP
or Prolog port, please contact me. If you just want MS-DOS executables I can
also provide those.
The next step is writing tools to find recurring sub-networks in the finite
state transition network and hence derive RTNs.
James Tauber
jtaubertartarus.uwa.edu.au

Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue