SPRAAK
 All Data Structures Namespaces Files Functions Variables Typedefs Enumerations Enumerator Groups Pages
Finite State Grammars

SPRAAK has a number of tools to create finite state grammars. The preferred format for developing an application is the "Wirth Syntax Notation". It is a highly readable format similar to the BNF and has support for probabilistic rules (these are weights rather than probabilities as the weights between 2 nodes are not forced to add up to 1.0) With a WSN Grammar Compiler this representation may be transformed into a weighted finite state transducer. The internals of the compiler are described in WSN Compiler - Internals Finally there is a little utility converting the WFST into a more compact form which is used by the SPRAAK decoder.

All files are editable ASCII files, allowing for easy developing or modifications in any of the formats.

Hence development will typically comprise following steps (here shown for the 'euros.wsn' grammar which gives euro amounts (incl. cents) from 0-100 :

> wsn2fsm.pl   euros.wsn > euros.fsm
> spr_fsn2spr_fsg.py -i euros.fsm  -o euros.fsg

These files can be found in the directory

$SPR_HOME/examples/fsg

Tools

Remark: The wsn-compiler is written in PERL and uses a different style of command line arguments than other programs in the SPRAAK package.

File Formats

Wirth Syntax Notation (.wsm)

The fileformat and supported rules are described in the (cfr. WSN Grammar Compiler )

Weighted Finite Machines(.fsm)

contains lines of the format

ARC-NR  START_NODE  END_NODE  LABEL  WEIGHT(LN)

Remarks that these are finite state machines with only a single label per arc.

Finite State Grammar (.fsg)
(SPRAAK)

contains following different types of lines

[FSG]                   magic first line
name XXX                grammar name
Nstate XXX              number of states
Narc XXX                number of arcs
accept one two ...      known terminals (words that should be defined as well in the lexicon)
end ....                specifications about sentence_end
arc START_NODE END_NODE1 INPUT1 WEIGHT1(LOG10) OUTPUT1 END_NODE2 LABEL2 WEIGHT2 ...    all arcs leaving from node START_NODE

Contrary to the .fsm format the .fsg format is more general in supporting full finite state transducers with INPUT and OUTPUT symbols. When mappin from .fsm to .fsg the output symbols will be epsilon symbols (i.e. '[]')

See also cwr_lm_fsg.c

Bugs, Limitations

The conversion from .fsm to .fsg does not do the intended conversion from 'ln' to 'log10' probabilities. This shouldn't necessarily affect applications as relative ranking (within language model scores) is still consistent. However, one needs to be careful when combining with acoustic models or when interpreting language model scores.