SPRAAK
 All Data Structures Namespaces Files Functions Variables Typedefs Enumerations Enumerator Groups Pages
spr_lm_arpabo.c File Reference

Convert arpa style N-gram's to and from the binary format used in SPRAAK. More...

Detailed Description

Convert arpa style N-gram's to and from the binary format used in SPRAAK.

spr_lm_arpabo [-R](flag: bin->arpa) [-FIX how](no) [-i input LM](stdin) [-o output LM](stdout)
    [-c build options] [-rf LM-load options](check_lvl=2) [-l level](0)
Parameters
-Rflagbin->arpa
Reverse conversion, i.e. binary to arpabo format
-FIX<em>how</em><aname="spr_lm_arpabo.FIX" class="el">
Try to fix certain problems. The extra 'how' argument can be: 'no' do not fix or check anything; 'check' check for problems but don't solve them; 'rm' remove all problematic N-gram's; 'add' add the missing lower level N-grams (this option may require that the input file is read twice)
-i<em>inputLM
The input LM (default is arpabo format; binary for reverse conversion)
-o<em>outputLM
The output LM (default is binary format; arpabo for reverse conversion)
-c<em>buildoptions
A file containing extra options (see the man-pages for more details).
-rf<em>LM-loadoptions
Flags used to load the LM when doing the reverse conversion.
-l<em>level</em><aname="spr_lm_arpabo.l" class="el">
Reverse conversion; print only a single level of the N-gram (e.g. all 3-gram probs in a 5-gram)

Convert arpa LM's to and from the LM's used by the SPRAAK system.

Convert the N-gram representation between the arpa-style and the compact binary format used in SPRAAK.

When converting from the arpa-style to the binary format, a file with options can be specified. The possible options are:


[Options]
pquant <quant_precision>(10000.0)
Quantisize the probabilities with a precision of 1.0/<quant_precision>
Nlmc <estimated_nr_of_LM_contexts>
Initial size for the LM-context hash table (tends to be over-estimated, so providing a good initial value may save some memory during the conversion); when specified, this value may not be underestimated (less than the actual number of LM-contexts).
pct_init <percentage>(2.0)
Start with a LM-context hash table of <percentage> percent larger than the minimal size.
pct_add <percentage>
Enlarge the LM-context hash table by <percentage> percent whenever no perfect hash table could be made.
max_prob <max_prob(log10)>(0.0) [msg_lvl](0)
Clip all probabities larger than <max_prob> (log10) to <max_prob>. Report this change at message level <msg_lvl>.
min_prob <min_prob(log10)>(-Inf) [msg_lvl](0)
Clip all probabities smaller than <min_prob> (log10) to <min_prob>. Report this change at message level <msg_lvl>.
max_disc <max_discount_frac(log10)>(10.0) [msg_lvl](0)
Clip all discount fractions larger than <max_discount_frac> (log10) to <max_discount_frac>. Report this change at message level <msg_lvl>.
min_disc <min_discount_frac(log10)>(-Inf) [msg_lvl](0)
Clip all discount fractions smaller than <min_discount_frac> (log10) to <min_discount_frac>. Report this change at message level <msg_lvl>.
<lprob/ldisc> offs <prob_scale_factor(log10)> <words> ...
Scale the (discount) probability for the given words with a factor <prob_scale_factor> (log10). Since all operations are done in the log10 domain, the scaling is an addition.
<lprob/ldisc> fac <prob_power(log10)> <words> ...
Raise the (discount) probability for the given words to the power <prob_power>. Since all operations are done in the log10 domain, the raising to a certain power corresponds to scaling.

Date
05/05/1997
Author
Kris Demuynck
Revision History:
17/07/2002 - KD
new binary format (more compact + faster)
07/08/2008 - KD
add options to automatically fix certain problems