SPRAAK
 All Data Structures Namespaces Files Functions Variables Typedefs Enumerations Enumerator Groups Pages
Data Structures | Functions | Variables
cwr_lm_fsg.c File Reference

Finite state grammar. More...

Data Structures

struct  SprCwrFsgWarc
 hash table entry: {state,wid} -> this struct More...
 
struct  SprCwrFsgMarc
 list of sub-models for given FSG-node More...
 
union  _Union1_CWR_LM_FSG_
 
struct  SprCwrFsgSmwh
 hash table entry: {submod,wid} -> this struct More...
 
struct  SprCwrFsgState
 
struct  spr_t_cwr_fsg_sm_hdr
 common header for the sub-model structures More...
 
struct  spr_t_cwr_fsg_fsg
 a FSG (sub)model More...
 
struct  spr_t_cwr_fsg_ngram
 an N-gram sub-model More...
 
struct  spr_t_cwr_fsg_elm
 a sub-model (other type of LM) More...
 
union  _Union2_CWR_LM_FSG_
 
struct  SprCwrFsgSm
 a sub-model More...
 
struct  SprCwrFsg
 
struct  SprCwrFsgSmFsgi
 interface to a FSG sub-model More...
 
struct  SprCwrFsgSmElmi
 interface to a sub-model (other type of LM) More...
 

Functions

SprCwrFsgspr_cwr_fsg_free (SprCwrFsg *lm)
 Free all data corresponding to the LM. Return NULL. More...
 
int spr_cwr_fsg_uninstall (SprCwrLMInterface *lmi, SprCwrLMInstance *lm_data)
 
int spr_cwr_fsg_write (SprStream *fd, const SprCwrFsg *lm)
 
int spr_cwr_fsg_install (SprCwrLMInterface *lmi, SprCwrLMInstance *lm_data, unsigned int sz_lm_data, const SprCwrFsg *lm, const SprCwrSLex *lex)
 
SprCwrFsgspr_cwr_fsg_read (SprCwrFsg *dst, const char *fname, const char *options)
 

Variables

const int spr_cwr_lm_fsg_offs_sm_id_main
 needed since we don't have descent object yet More...
 
const SprCmdOptDesc spr_cwr_od_fsg []
 
const SprCwrSRMDesc spr_cwr_srm_desc_lm_fsg
 

Detailed Description

Finite state grammar.

Implementation of a finite state grammar (FSG) with sub-models. The implementation is optimized towards speed at the cost of memory; hence one should not try to encode a large N-gram as FSG using this implementation.

An observation in a given state is evaluated as follows:

Remarks on eps-arcs
Eps-arcs can introduce a substantial overhead. Avoid eps-arcs if possible. If eps-arcs are needed, try to adhere to the following guidelines:
  • when two or more states are connected with eps-arcs, only one state (or none) should be an end-state; a simple technique to fulfill this guideline, is to make sure that source states with eps-arcs are not end-states by using an eps-arcs refering to a global end-state instead
  • when two or more states are connected with eps-arcs, only one state (or none) should have a fall-back state
  • avoid chains of eps-arcs; or limit the depth using the <eps_lvl_depth> or <prob_clip> configuration parameters
Remark on sub-models
Sub-models introduce a substantial overhead. For optimal performance: combine the FSGs into one monolithic FSG and minimize that.
Remark on end-states and end conditions
Avoid the TRY condition, as it may introduce a large number of alternatives.

[FSG_file_format]
name <str>
Give the FSG a name.
Nstate <Nstate>
The FSG uses <Nstate> states (may be overestimated with a small fraction).
Narc <Narc>
The FSG uses approximately <Narc> (word) arcs; does not have to be exact, should only include the arcs which observe words, arcs which start sub-models should not be counted.
fail_cost <cost>
If the incomming word cannot be explained at all, then assign a cost <cost> (and stay in the same state).
accept <word/submod> ...
Define a list of words the FSG (and its sub-models) can use as input symbols, i.e. only these words/sub-models may be observed on an arc.
output <word/submod> ...
Define the words that can be output by the FSG.
arc <from> <to> <word/submod> <cost> <symb_out>([]) [to...] ...
Define one ore more arcs starting from node <from>. Each arc goes from node <from> to node <to> while observing word <word> (use '[]' for an epsilon arc) and with a cost <cost>. Optionally, there can also be an output symbol (no output is indicated by []).
end <ecost> <TRY/NO_END/NO_FIT/OOV> <state_nr> ...
Signal that the sub-model can end in the given set of states with a given extra cost and when a certain condition is met:
OOV if the observed word is an OOV word;
NO_FIT if the observed word does not fit for the current state
TRY try to both stop the sub-model and continue it (results in two hypotheses to investigate)
fb_state <to_state> <cost> <state_nr> ...
If the incomming word cannot be observed in state(s) <state_nr>, then fall-back to state <to_state> with an extra cost <cost>.

[Load_flags]
check_lvl=<I32:val>(0)
Do some checking to see if there are errors in the LM, the level controls the output: (0) no checking, (1) give error count, (2) give detailed error info.
closed=<I32:val>(1)
The FSG is closed vocab – report open vocab words.
max_eps_arc=<I32:val>(1)
Limit the number of consequetive eps-arcs to <max_eps_lvl>.
max_arc=<I32:val>(1024)
Specify the maximum number of arcs that will match for any given word; calculating this from the FSG is slow and hence for practical reasons a conservative upperbound is used. Specify -1 to calculate the real number.
prob_clip=<F32:val>(-999.9)
Discard options when the probability is too low.
FWDstate=<I32:val>(-1)
Use the probabilities in the specified state as static LM smearing probs. By default max(P(w|*)) is used as static LM smearing prob. (excluding submod-end and eps-arc transitions).
ext=<String:val>(NULL)
A file containing word extension on the base lexicon.
wdsf=<String:val>(NULL)
A file containing word dependent LM scale factors.
Date
March 2004
Author
Kris Demuynck
Revision History:
15/03/04 - KD
implemented based on cwr_lm_template and cwr_lm_std.
25/11/04 - KD
added ESP-arcs + fixed some bugs
20/04/05 - KD
fixed some bugs, made it more general
Bug:
The implementation is not optimal is some circumstances, e.g. when reaching a sure end of a sub-model: the current implementation stays in the sub-model. Closing the sub-model will reduce the number of open LM-contexts and hence will allow for more recombination in the search.