Introduction

In this TUTORIAL we give an introduction to two key routines that we recommend for usage in the standard training path of SPRAAK: MIDA (Mutual Information Discriminant Analysis) and Decorrelation.

The main goal of both procedures is to make the features as suitable as possible for the backend acoustic modeling using Gaussian mixture models with diagonal covariance matrices.

While they are quite different in concept and motivation they are presented together as they go hand in hand and they are implemented as a combined linear transformation on the original features.

Theoretical Concepts

Motivation

The principle statistical modeling approach in SPRAAK is to model class distributions as a sum of elementary diagonal covariance gaussians. In theory any distribution can be approximated arbitrarily close by a sum of gaussians ... if the number of gaussians can grow infinitely large and the number of learning examples is infinite as well. In practice both the number of gaussians that can be used and the number of training samples are quite limited. For this reason it pays to prepare the features optimally with the backend acoustic modeling in mind.

Two criteria can be derived:

The feature vector should have an as low dimension as possible. Extraneous dimensions without any

added discriminative value will require extra training data and will inherently induce noise into the modeling. While speech may only have an intrinsic dimensionality of 10 (very rough guess), state of the art recognizers tend to use dimensions of the order 30-40. This overdimensioning is the main cause why speech recognizers need so much data and why they exhibit low robustness. In conclusion, dimensionality reduction of the feature vector to the strict minimum is a main objective.

The individual distributions should fit as closely as possible the shape of the elementary distributions.

This reduces the number of gaussians required per distribution and hence again reduces the amount of training data needed and a higher robustness as a consequence. As such modeling distributions with highly decorrelated features will be easier and will better fit the backend acoustic modeling.

LDA

Linear Discriminant Analysis (LDA) is a well known method in pattern classification used to create a linear transformations that optimally separates classes. The hyperplanes are computed by maximizing "Between Class Covariances" vs. "Within Class Covariances" . The computed decision hyperplanes will be optimal decision boundaries (in terms of classification error) if all classes are homogeneous and have equal distributions. Together with the designing of the hyperplanes, LDA ranks directions from more to less important. As such LDA is a great tool for dimensionality reduction.

Basic LDA will result in improved speech feature representation. Nevertheless some of the assumptions are way too restrictive for speech features which makes that improved versions of LDA can improve over the baseline implementation.

MIDA

MIDA is an improved version of LDA. For more theoretical background cfr. PhD Thesis of Kris Demuynck

Decorrelation

We approximate full covariance gaussians by a sum of diagonal covariance gaussians in the acoustic modeling. The output of MIDA may not be ideal from the point of view of diagonal covariance gaussians. Diagonal covariance gaussians assume that the data is uncorrelated in all classes. The more this is true, the better our acoustic modeling will work. Therefore an optimal transformation is sought that makes the data as uncorrelated as possible.

Implementation

Training

The MIDA transform will be computed automatically if the parameter 'mida_np' is set in the config-object, e.g.

config.mida_np = 36

The decorrelation is implemented as an individual step in the training. It is advised to add a few (3) decorrelation steps at the end of a cycle of Viterbi training steps.

trainer.fvg(niter=3)

Both MIDA and decorrelation are computed by gradient descent algorithms that may converge rather slowly. It is not unusual that they take up a large part of the training time. For details on parameter settings, especially in MIDA, pls. see [xxx]

To avoid the decorrelation passes at the end of MIDA and before the first passes of training, you can add following :

config.mida_opt = "...."

MIDA and decorrelation should in the first place look at the phones and not so much at silence and noise, while these are exactly the most frequent in most corpora. Therefore it is possible to specify which symbols have no acoustic-phonetic content and should only play a secondary role when optimizing MIDA and gaussian decorrelation, e.g.

config.ph_spec = "# % <G>"

For these 'special phones' also context independent models will be created as well.

Recognition

Both MIDA and decorrelation compute a linear transformation of the features. These transformations are combined in a single transformation matrix and is stored in the model directory together with a small piece of signal processing code that should be appended to the signal processing file. The latter can be done on the fly by specifying a sequence of signal processing files.

acmod.mat       acmod.preproc

Hence you will need to add 'acmod.preproc' to the signal processing definition in the '.ini' file as done in e3_m2.ini:

[preproc]
  script        ../resources/sp22_vtln.ssp ${mod}/acmod.preproc

Preprocessing scripts can be specified by multiple files as if those files were concatenated. The new preprocessing file contains a single command :

[lin_comb]
  A_matrix e3_m2/acmod.mat

which tells the preprocessing to apply a matrix multiplication with coefficients found in 'acmod.mat'. This matrix is the composition of the original MIDA transform computed in the first ITERation and the additional rotation applied by the second iteration.

TIMIT Example

A TIMIT experiment is defined by the script-files

./scripts/e3*

Copy these to your experiment directory.

Processing Steps

MIDA is automatically added as the first step during initialization in major ITERation 1. Full covariance gaussians are computed in major ITERation 2. Hence respective models will be found in "./e3_m1/acmod.*" respectively "./e3_m2/acmod.*".

The training will also do context-dependent training in major ITERation3 and compute final models in major ITERation 4 which are described in Context-Dependent Model Training. We recommend that you launch the whole training at once for simplicity and just run independent evaluation experiments on the CI models in ./e3_m2 and the CD models in ./e3_m4.

Following processes and subTasks will be performed

spr_train.py:
  trainer.tied()
    iter1_mida          # MIDA Training
    iter1_tied_init     # initialize Tied Mixtures
    iter1_tied_iter_i   # 3 passes of Viterbi Training
  trainer.fvg()
    iter2_fvg_i         # 3 passes of Decorrelation
    iter2_viterbi       # 1 pass of Viterbi
  trainer.cdtree()
    iter3_cdtree_init   # initialize decision tree
    iter3_cdtree_segpass 
    iter3_cdtree_sel_gauss
    iter3_cdtree_tree
    iter3_cdtree_tree_train.i  # multiple passes of tree training
  trainer.fvg()
    iter4_fvg_i         # 3 passes of Decorrelation
    iter4_viterbi       # 1 pass of Viterbi
spr_eval.py:
    mod = e3_m2         # evaluation of CI models
    mod = e3_m4         # evaluation of CD models

A very rough estimate of execution times on a contempary (2010) dual core machine is:

MIDA training: a few hours
FVG trainig: one hour
Model training: less than an hour
Decision tree training: a few hours
FVG trainig: one hour
Evaluation: 15'

Expected results are on the order of 26.5% and 24.1% for CI and CD models respectively.

Files

Filenames of all files directly involved in the experiment are given below. Filenames are relative to the local experiment directory.

SETUP:
e3.csh                  master script for running training+evaluation
e3.config               config file for training
e3_m2.ini               config file for evaluation of models _m2
e3_m4.ini               config file for evaluation of models _m4

OTHER INPUT FILES/DIRECTORIES USED:
../resources/sp22_vtln.ssp      feature extraction file
../resources/timit51.ci         phone alphabet
../resources/timit51.cd         state definition file
../resources/timit51.dic        lexicon
../resources/train_hand_states.seg      segmentation file (hand labeled)
../resources/timit.questions    question set for decision tree building 
../resources/train.cor          specification of the training corpus
../resources/test.39.cor        specification of the test corpus
../dbase/---                    database with speech waveform files

GENERATED FILES(primary):
e3_.log                         log file, contains logging information on the experiment
e3_recovery.log                 recovery log file, contains recovery points for automatic restart
e3_m1/acmod.{mvg,hmm,sel}       Acoustic Model files after training of initial Context Independent models
e3_m2/acmod.{mvg,hmm,sel}       Final Context-Independent Acoustic Model files (after 3 extra full covariance rotations)
e3_m3/acmod.{mvg,hmm,sel}       Initial Context-Dependent Acoustic Model files after decision tree training 
e3_m4/acmod.{mvg,hmm,sel}       Final Context-Dependent Acoustic Model files (after 3 extra full covariance rotations)
e3_mI/acmod.preproc             Preprocessing Script for applying Linear Feature Transformation to be used in conjunction with model I
e3_mI/acmod.mat MIDA+FVG Linear Feature Transformation Matrix to be used in conjunction with model I
e3_m2.RES                       Result File for models in e3_m2
e3_m4.RES                       Result File for models in e3_m4

GENERATED FILES(supporting):
e3_mI.CMD                       Commands sent to spr_cwr_main.c during evaluation
e3_mI.OUT                       Output generated by spr_cwr_main.c
e3_mI/acmod.i.xxx               Acoustic models files at the end of a minor iteration
e3_mI/acmod.yyy                 Auxiliary files generated during acoustic model training