GOAL of this TUTORIAL

In this tutorial we provide you with a reference HMM, trained on the TIMIT database. If not done already, first get acquainted with the linguistic resources that are required to work with a Hidden Markov Model by reading Introduction to SPRAAK concepts, conventions and file structures. Then we show you how feature extraction is done in SPRAAK. After that everything is in place to run a small evaluation on the provided reference model. Finally we use the same models to compute a Viterbi alignment on a small corpus.

Feature Extraction

Speech recognizers do not work directly on sampled data. The great majority of speech recognition systems works with features extracted from the data at regular time intervals. SPRAAK has its own scripting language with which you can specify all of the most popular feature extraction schemes without the need for programming. The file 'mfcc39.ssp' specifies 39-dimenstional mel cepstral features (including first and second order derivatives and sentence based mean normalization), a common baseline for comparative experiments. Without needing to understand all detail, you can see what steps are performed in the signal processing demo script:

[sam2trk] : stacks samples into frames (25 msec long, and applying preemphasis with factor 0.95)
[scale] : scales the framed data with a Hamming window
[anafft] : computes the FFT power spectrum
[filter_bank] : computes mel scaled filterbank outputs, and writes them out in a dB-scale
[mean_norm] : does mean normalization (non-adaptive, over the full file)
[idct] : converts spectra to cepstra via IDCT (12th order)
[deriv] : computes time derivaties (first and second order)

For an overview of all signal processing modules in SPRAAK, see Preprocessing modules

A TIMIT reference model

SPRAAK uses tied mixture models with diagonal covariances. The './models' directory contains a reference HMM. The model is composed of 3 files:

mname.mvg: the multivariate gaussian basis functions
mname.hmm: the non-zero gaussian weights for the gaussians in all states
mname.sel: sparse weight matrix indicating which gaussians should be used with which state

A description of the acoustic models in SPRAAK is given in Acoustic Model.

For this example, the model name mname is 'mfcc39_ci'. As you may guess from the name, these are context independent acoustic models using 39-dimensional cepstral vectors. The distributions of the 141 states are modeled by a mixture of gaussians drawn from a global pool of 5.550 gaussians and using on average 126 non-zero weights per state.

Running an Evaluation

For running the first demos, copy the provided EVAL and VITALIGN scripts to your ./exp directory.

> cd ~/MyTimit/exp
> cp $SPR_HOME/examples/scripts/{EVAL,VITALIGN} .

EVAL is a wrapper around a call to spr_eval.py which is a PYTHON script that simultaneously performs recognition and scoring of the results against a given reference. The evaluation script uses the information given in a ".ini" initialization file to configure the recognizer. Further arguments and argument overrides can be specified on the command line such that the same .ini file can be reused for a whole group of experiments with different parameters.

The .ini file contains a number of different sections:

[audio] : gives information on the audio dataformat to be expected
[preproc] : specifies the script(s) to be used for feature extraction
[phoneset] : defines the phoneset and state definitions to be used
[acmod] : defines the acoustic model files, plus sets some pruning parameters to be used for fast evaluation of gaussian mixtures
[default] : defines defaults for language model, lexicon, beam search parameters and output interpretation

This section also performs conversion on the output alphabet, i.e. a conversion from the TIMIT-51 phoneset (used for model generation) to the TIMIT-39 phoneset typically used for evaluation.

Finally the .ini file contains a number of preset Lexicon and Language Model sections

[LEX:lexname] : defines lexicon and associated parameters for LEX named 'lexname' (multiple sections possible)
[LM:lmname] : defines language model and associated paramters for LM named 'lmname' (multiple sections possible)

IMPORTANT REMARK on DIRECTORY STRUCTURE: all directories specified within the '.ini' file are RELATIVE to the '.ini' file (and not to where you execute te command !)

We will not give a detailed explanation on all parameters and assume that the non described parameter settings are OK or irrelevant (for this task).

In the evaluation script (EVAL) we further specify the experiment:

obs (../dbase/test) : the root directory of the test database (as further described in the corpus)
cor (../resources/test_dr1.39.cor) : the test corpus description; only the 'dr1' subset of the TIMIT corpus (i.e. only a speakers from region 'dr1') is used. The test corpus not only contains the files to be used in test, though also the reference transcription against which can be scored.
sub (wav) : the suffix of the sampled data files
../models/mfcc39_ci.ini : the name of the .ini file
2gramWB: the name of the language model to be used (bigram)
cost_C, cost_A: word start-up cost and LM scaling factor (given defaults are fine)
bw, thr: are beam search parameters (given defaults are fine)

Now run the evaluation

> EVAL

It should only take about 1 min. to recognize and score the 88 sentences from this small test corpus. The main result is echoed to the screen:

WER=29.55%, (ins=2.82%, del=9.23%, sub=17.50%), 3371

IMPORTANT REMARK on SCORE COMPARISON: Different architectures may imply different roundoff errors and cause different results. Hence it is not impossible that your result is SLIGHTLY different; however for this small test we do NOT expect his to be the case. We only want to warn you that it is possible to obtain different results on different computers.

4 files are created at output. If all goes right only the RESULT file (expname.RES) is important. The global result for the experiment is found at the end of this file; if all goes right this may be the only thing that you are interested in. Before that it also contain reference, recognition and mismatch for each test utterance.

The other files are more for debugging and deep analysis:

expname.OUT: The output as generated by the recognizer (mainly duplicate information from what is embedded in the .RES file)
expname.CMD: Contains the list of commands as sent to spr_cwr_main.c : the SPRAAK continuous speech recognizer.
expname.ERR: Primarily look for ERROR statements is something goes wrong with your experiment. The INFO statements are mainly useful for expert debugging

Viterbi Alignment

A second baseline experiment consists in running a Viterbi alignment on a small corpus. For this execute

> VITALIGN

This script computes a Viterbi alignment based on the same reference model for part for the 'dr1' part of the TIMIT training data. In a Viterbi alignment the optimal alignment is found of a speech file against an imposed reference transcription. Acoustic model and Linguistic resources are specified in a similar way as with the evaluation with only a few differences. The VITALIGN script is used to set all the parameters and then call the main (C-)program spr_vitalign.c. A state based segmentation is requested with the '-S' flag. The output is written to the file 'train_dr1_ref.seg'.

This output file (Segmentation File) contains segmentations of the 304 utterances with in total 32.759 state segments in the corpus. You may want to compare this segmentation with the one that the training started from (../resources/train_hand_state.seg). You will see that the phone boundaries are barely different; i.e. the hand segmentations provided with the TIMIT CDs largely coincide with the segmentations found in an automatic way (though once in a while there are significant differences). Alternatively you will see that the within phone state boundaries are quite different as the ones in the ../resources directory where very rough original estimates that were only used to bootstrap the HMM training.

INFO, WARNING & ERROR Messages

SPRAAK prints out plenty of information while it is processing. There are 3 distinct streams:

INFO: this concerns information on what is happening inside the programs
WARNING: this concerns warnings, where SPRAAK sees unexpected data or commands (e.g. data in an older format). SPRAAK can cope with quite a number of WARNINGS and has built in repair mechanisms, though it is often good to check what is really happening.
ERROR: this signals that something definitely is wrong. Start with looking for missing files (or equivalent mismatching specifications in any of the configuration files)