Combining Acoustic Model and Language Model scores is not as trivial as it may seem from text book formulas. This is mainly due to imperfections of both models; e.g. the Acoustic Model will typically use way more dimensions than strictly needed (e.g. 39 instead of 10) leading to a gross underestimation of true acoustic model probabilities; i.c. it roughly looks as if probabilities are raised to the 4th power. The language model probabilities on the other hand suffer from intrinsic sparsity problems.

The proper correction to be applied to either acoustic model and/or language model is hard to derive in theory; and depend on a complex mix of used algorithms and databases. Hence it is common practice to derive optimal scaling factors in a grid search that looks for the minimizing error rate on an independent development s et.

There are two parameters that can be set:

cost_C          word startup cost: cost that is added whenever a new word is started
cost_A          scaling factor of the LM vs. the AM

Hence each recognized word will contribute with following score:

cost_C + acoustic_mode_score + cost_A*language_model_score

in which all scores are expressed in 'log10'.

The language model weight largely implements the necessary scaling described above, whilethe word_startup cost tries to strike an optimal balance between insertions and deletions. Typical values also differ for phoneme vs. word recognition. With rather standard feature extraction (e.g. 39D mel scaled coefficients) we've derived optimal values in following ranges:

for TIMIT database (phoneme recognition):
cost_C    [-2.0 ... 1.0]
cost_A    [1.0  ... 2.5]
for WSJ database (sentence recognition):
cost_C    [-2.0 ... -8.0]
cost_A    [ 2.0 ...  4.0]