SPRAAK
 All Data Structures Namespaces Files Functions Variables Typedefs Enumerations Enumerator Groups Pages
Recognizing and Aligning with SPRAAK

GOAL of this TUTORIAL

In this tutorial we provide you with a reference HMM, trained on the TIMIT database. If not done already, first get acquainted with the linguistic resources that are required to work with a Hidden Markov Model by reading Introduction to SPRAAK concepts, conventions and file structures. Then we show you how feature extraction is done in SPRAAK. After that everything is in place to run a small evaluation on the provided reference model. Finally we use the same models to compute a Viterbi alignment on a small corpus.

Feature Extraction

Speech recognizers do not work directly on sampled data. The great majority of speech recognition systems works with features extracted from the data at regular time intervals. SPRAAK has its own scripting language with which you can specify all of the most popular feature extraction schemes without the need for programming. The file 'mfcc39.ssp' specifies 39-dimenstional mel cepstral features (including first and second order derivatives and sentence based mean normalization), a common baseline for comparative experiments. Without needing to understand all detail, you can see what steps are performed in the signal processing demo script:

For an overview of all signal processing modules in SPRAAK, see Preprocessing modules

A TIMIT reference model

SPRAAK uses tied mixture models with diagonal covariances. The './models' directory contains a reference HMM. The model is composed of 3 files:

A description of the acoustic models in SPRAAK is given in Acoustic Model.

For this example, the model name mname is 'mfcc39_ci'. As you may guess from the name, these are context independent acoustic models using 39-dimensional cepstral vectors. The distributions of the 141 states are modeled by a mixture of gaussians drawn from a global pool of 5.550 gaussians and using on average 126 non-zero weights per state.

Running an Evaluation

For running the first demos, copy the provided EVAL and VITALIGN scripts to your ./exp directory.

> cd ~/MyTimit/exp
> cp $SPR_HOME/examples/scripts/{EVAL,VITALIGN} .

EVAL is a wrapper around a call to spr_eval.py which is a PYTHON script that simultaneously performs recognition and scoring of the results against a given reference. The evaluation script uses the information given in a ".ini" initialization file to configure the recognizer. Further arguments and argument overrides can be specified on the command line such that the same .ini file can be reused for a whole group of experiments with different parameters.

The .ini file contains a number of different sections:

This section also performs conversion on the output alphabet, i.e. a conversion from the TIMIT-51 phoneset (used for model generation) to the TIMIT-39 phoneset typically used for evaluation.

Finally the .ini file contains a number of preset Lexicon and Language Model sections

IMPORTANT REMARK on DIRECTORY STRUCTURE: all directories specified within the '.ini' file are RELATIVE to the '.ini' file (and not to where you execute te command !)

We will not give a detailed explanation on all parameters and assume that the non described parameter settings are OK or irrelevant (for this task).

In the evaluation script (EVAL) we further specify the experiment:

Now run the evaluation

> EVAL

It should only take about 1 min. to recognize and score the 88 sentences from this small test corpus. The main result is echoed to the screen:

WER=29.55%, (ins=2.82%, del=9.23%, sub=17.50%), 3371

IMPORTANT REMARK on SCORE COMPARISON: Different architectures may imply different roundoff errors and cause different results. Hence it is not impossible that your result is SLIGHTLY different; however for this small test we do NOT expect his to be the case. We only want to warn you that it is possible to obtain different results on different computers.

4 files are created at output. If all goes right only the RESULT file (expname.RES) is important. The global result for the experiment is found at the end of this file; if all goes right this may be the only thing that you are interested in. Before that it also contain reference, recognition and mismatch for each test utterance.

The other files are more for debugging and deep analysis:

Viterbi Alignment

A second baseline experiment consists in running a Viterbi alignment on a small corpus. For this execute

> VITALIGN

This script computes a Viterbi alignment based on the same reference model for part for the 'dr1' part of the TIMIT training data. In a Viterbi alignment the optimal alignment is found of a speech file against an imposed reference transcription. Acoustic model and Linguistic resources are specified in a similar way as with the evaluation with only a few differences. The VITALIGN script is used to set all the parameters and then call the main (C-)program spr_vitalign.c. A state based segmentation is requested with the '-S' flag. The output is written to the file 'train_dr1_ref.seg'.

This output file (Segmentation File) contains segmentations of the 304 utterances with in total 32.759 state segments in the corpus. You may want to compare this segmentation with the one that the training started from (../resources/train_hand_state.seg). You will see that the phone boundaries are barely different; i.e. the hand segmentations provided with the TIMIT CDs largely coincide with the segmentations found in an automatic way (though once in a while there are significant differences). Alternatively you will see that the within phone state boundaries are quite different as the ones in the ../resources directory where very rough original estimates that were only used to bootstrap the HMM training.

INFO, WARNING & ERROR Messages

SPRAAK prints out plenty of information while it is processing. There are 3 distinct streams: