SPRAAK
|
SPRAAK has a number of tools to create finite state grammars. The preferred format for developing an application is the "Wirth Syntax Notation". It is a highly readable format similar to the BNF and has support for probabilistic rules (these are weights rather than probabilities as the weights between 2 nodes are not forced to add up to 1.0) With a WSN Grammar Compiler this representation may be transformed into a weighted finite state transducer. The internals of the compiler are described in WSN Compiler - Internals Finally there is a little utility converting the WFST into a more compact form which is used by the SPRAAK decoder.
All files are editable ASCII files, allowing for easy developing or modifications in any of the formats.
Hence development will typically comprise following steps (here shown for the 'euros.wsn' grammar which gives euro amounts (incl. cents) from 0-100 :
> wsn2fsm.pl euros.wsn > euros.fsm > spr_fsn2spr_fsg.py -i euros.fsm -o euros.fsg
These files can be found in the directory
$SPR_HOME/examples/fsg
Remark: The wsn-compiler is written in PERL and uses a different style of command line arguments than other programs in the SPRAAK package.
The fileformat and supported rules are described in the (cfr. WSN Grammar Compiler )
contains lines of the format
ARC-NR START_NODE END_NODE LABEL WEIGHT(LN)
Remarks that these are finite state machines with only a single label per arc.
contains following different types of lines
[FSG] magic first line name XXX grammar name Nstate XXX number of states Narc XXX number of arcs accept one two ... known terminals (words that should be defined as well in the lexicon) end .... specifications about sentence_end arc START_NODE END_NODE1 INPUT1 WEIGHT1(LOG10) OUTPUT1 END_NODE2 LABEL2 WEIGHT2 ... all arcs leaving from node START_NODE
Contrary to the .fsm format the .fsg format is more general in supporting full finite state transducers with INPUT and OUTPUT symbols. When mappin from .fsm to .fsg the output symbols will be epsilon symbols (i.e. '[]')
See also cwr_lm_fsg.c
The conversion from .fsm to .fsg does not do the intended conversion from 'ln' to 'log10' probabilities. This shouldn't necessarily affect applications as relative ranking (within language model scores) is still consistent. However, one needs to be careful when combining with acoustic models or when interpreting language model scores.