Convert arpa style N-gram's to and from the binary format used in SPRAAK. More...

Detailed Description

Convert arpa style N-gram's to and from the binary format used in SPRAAK.

spr_lm_arpabo [-R](flag: bin->arpa) [-FIX how](no) [-i input LM](stdin) [-o output LM](stdout)
    [-c build options] [-rf LM-load options](check_lvl=2) [-l level](0)

Parameters

-Rflag	bin->arpa Reverse conversion, i.e. binary to arpabo format
-FIX<em>how</em><a	name="spr_lm_arpabo.FIX" class="el"> Try to fix certain problems. The extra 'how' argument can be: 'no' do not fix or check anything; 'check' check for problems but don't solve them; 'rm' remove all problematic N-gram's; 'add' add the missing lower level N-grams (this option may require that the input file is read twice)
-i<em>input	LM The input LM (default is arpabo format; binary for reverse conversion)
-o<em>output	LM The output LM (default is binary format; arpabo for reverse conversion)
-c<em>build	options A file containing extra options (see the man-pages for more details).
-rf<em>LM-load	options Flags used to load the LM when doing the reverse conversion.
-l<em>level</em><a	name="spr_lm_arpabo.l" class="el"> Reverse conversion; print only a single level of the N-gram (e.g. all 3-gram probs in a 5-gram)

Convert arpa LM's to and from the LM's used by the SPRAAK system.

Convert the N-gram representation between the arpa-style and the compact binary format used in SPRAAK.

When converting from the arpa-style to the binary format, a file with options can be specified. The possible options are:

[Options]
`pquant <quant_precision>(10000.0)`
Quantisize the probabilities with a precision of 1.0/<quant_precision>
`Nlmc <estimated_nr_of_LM_contexts>`
Initial size for the LM-context hash table (tends to be over-estimated, so providing a good initial value may save some memory during the conversion); when specified, this value may not be underestimated (less than the actual number of LM-contexts).
`pct_init <percentage>(2.0)`
Start with a LM-context hash table of <percentage> percent larger than the minimal size.
`pct_add <percentage>`
Enlarge the LM-context hash table by <percentage> percent whenever no perfect hash table could be made.
`max_prob <max_prob(log10)>(0.0) [msg_lvl](0)`
Clip all probabities larger than <max_prob> (log10) to <max_prob>. Report this change at message level <msg_lvl>.
`min_prob <min_prob(log10)>(-Inf) [msg_lvl](0)`
Clip all probabities smaller than <min_prob> (log10) to <min_prob>. Report this change at message level <msg_lvl>.
`max_disc <max_discount_frac(log10)>(10.0) [msg_lvl](0)`
Clip all discount fractions larger than <max_discount_frac> (log10) to <max_discount_frac>. Report this change at message level <msg_lvl>.
`min_disc <min_discount_frac(log10)>(-Inf) [msg_lvl](0)`
Clip all discount fractions smaller than <min_discount_frac> (log10) to <min_discount_frac>. Report this change at message level <msg_lvl>.
`<lprob/ldisc> offs <prob_scale_factor(log10)> <words> ...`
Scale the (discount) probability for the given words with a factor <prob_scale_factor> (log10). Since all operations are done in the log10 domain, the scaling is an addition.
`<lprob/ldisc> fac <prob_power(log10)> <words> ...`
Raise the (discount) probability for the given words to the power <prob_power>. Since all operations are done in the log10 domain, the raising to a certain power corresponds to scaling.

Date: 05/05/1997

Author: Kris Demuynck

Revision History:

17/07/2002 - KD: new binary format (more compact + faster)
07/08/2008 - KD: add options to automatically fix certain problems