Editor for this issue: <>
A few days ago, I inquired about deriving finite-state machines from corpora of well-formed strings. My particular interest was as an aid to developing grammars from part-of-speech tagged text. Thank you to all those who responded. By combining the information given by each person and looking up the references given, I was able, over the last couple of days, to write software to do exactly what I wanted. While the more general approach is to derive a non-deterministic finite state machine from the corpus, convert it to an equivalent deterministic machine and then minimize it, I found this impossible in my situation. The standard algorithm for converting an NFSM to a DFSM results in the number of states in the DFSM being 2 to the power of the number of states in the NFSM. This leads to ridiculous numbers in the order of 10^500 in my application. I got around this by deriving the DFSM straight from the corpus. I then wrote a program that calculates the least-state equivalent DFSM. If anyone is interested in the programs, they can e-mail me. At the moment they are in Borland C++. If anyone is interested in helping do a Unix-C, LISP or Prolog port, please contact me. If you just want MS-DOS executables I can also provide those. The next step is writing tools to find recurring sub-networks in the finite state transition network and hence derive RTNs. James Tauber jtauberMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issuetartarus.uwa.edu.au