LINGUIST List 4.430

Fri 04 Jun 1993

Sum: Finite State Machines from Corpora

Editor for this issue: <>


Directory

  1. James Tauber, Deriving FSMs from corpus of well-formed strings

Message 1: Deriving FSMs from corpus of well-formed strings

Date: Fri, 4 Jun 1993 16:10:17 +Deriving FSMs from corpus of well-formed strings
From: James Tauber <jtaubertartarus.uwa.edu.au>
Subject: Deriving FSMs from corpus of well-formed strings

 A few days ago, I inquired about deriving finite-state machines from
 corpora of well-formed strings. My particular interest was as an aid
 to developing grammars from part-of-speech tagged text.

 Thank you to all those who responded. By combining the information given
 by each person and looking up the references given, I was able, over the
 last couple of days, to write software to do exactly what I wanted.

 While the more general approach is to derive a non-deterministic finite
 state machine from the corpus, convert it to an equivalent deterministic
 machine and then minimize it, I found this impossible in my situation.
 The standard algorithm for converting an NFSM to a DFSM results in the
 number of states in the DFSM being 2 to the power of the number of states
 in the NFSM. This leads to ridiculous numbers in the order of 10^500 in my
 application.

 I got around this by deriving the DFSM straight from the corpus. I then wrote
 a program that calculates the least-state equivalent DFSM.

 If anyone is interested in the programs, they can e-mail me. At the moment
 they are in Borland C++. If anyone is interested in helping do a Unix-C, LISP
 or Prolog port, please contact me. If you just want MS-DOS executables I can
 also provide those.

 The next step is writing tools to find recurring sub-networks in the finite
 state transition network and hence derive RTNs.

 James Tauber
 jtaubertartarus.uwa.edu.au
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue