SPRAAK
|
In this TUTORIAL we give an introduction to two key routines that we recommend for usage in the standard training path of SPRAAK: MIDA (Mutual Information Discriminant Analysis) and Decorrelation.
The main goal of both procedures is to make the features as suitable as possible for the backend acoustic modeling using Gaussian mixture models with diagonal covariance matrices.
While they are quite different in concept and motivation they are presented together as they go hand in hand and they are implemented as a combined linear transformation on the original features.
The principle statistical modeling approach in SPRAAK is to model class distributions as a sum of elementary diagonal covariance gaussians. In theory any distribution can be approximated arbitrarily close by a sum of gaussians ... if the number of gaussians can grow infinitely large and the number of learning examples is infinite as well. In practice both the number of gaussians that can be used and the number of training samples are quite limited. For this reason it pays to prepare the features optimally with the backend acoustic modeling in mind.
Two criteria can be derived:
added discriminative value will require extra training data and will inherently induce noise into the modeling. While speech may only have an intrinsic dimensionality of 10 (very rough guess), state of the art recognizers tend to use dimensions of the order 30-40. This overdimensioning is the main cause why speech recognizers need so much data and why they exhibit low robustness. In conclusion, dimensionality reduction of the feature vector to the strict minimum is a main objective.
This reduces the number of gaussians required per distribution and hence again reduces the amount of training data needed and a higher robustness as a consequence. As such modeling distributions with highly decorrelated features will be easier and will better fit the backend acoustic modeling.
Linear Discriminant Analysis (LDA) is a well known method in pattern classification used to create a linear transformations that optimally separates classes. The hyperplanes are computed by maximizing "Between Class Covariances" vs. "Within Class Covariances" . The computed decision hyperplanes will be optimal decision boundaries (in terms of classification error) if all classes are homogeneous and have equal distributions. Together with the designing of the hyperplanes, LDA ranks directions from more to less important. As such LDA is a great tool for dimensionality reduction.
Basic LDA will result in improved speech feature representation. Nevertheless some of the assumptions are way too restrictive for speech features which makes that improved versions of LDA can improve over the baseline implementation.
MIDA is an improved version of LDA. For more theoretical background cfr. PhD Thesis of Kris Demuynck
We approximate full covariance gaussians by a sum of diagonal covariance gaussians in the acoustic modeling. The output of MIDA may not be ideal from the point of view of diagonal covariance gaussians. Diagonal covariance gaussians assume that the data is uncorrelated in all classes. The more this is true, the better our acoustic modeling will work. Therefore an optimal transformation is sought that makes the data as uncorrelated as possible.
The MIDA transform will be computed automatically if the parameter 'mida_np' is set in the config-object, e.g.
config.mida_np = 36
The decorrelation is implemented as an individual step in the training. It is advised to add a few (3) decorrelation steps at the end of a cycle of Viterbi training steps.
trainer.fvg(niter=3)
Both MIDA and decorrelation are computed by gradient descent algorithms that may converge rather slowly. It is not unusual that they take up a large part of the training time. For details on parameter settings, especially in MIDA, pls. see [xxx]
To avoid the decorrelation passes at the end of MIDA and before the first passes of training, you can add following :
config.mida_opt = "...."
MIDA and decorrelation should in the first place look at the phones and not so much at silence and noise, while these are exactly the most frequent in most corpora. Therefore it is possible to specify which symbols have no acoustic-phonetic content and should only play a secondary role when optimizing MIDA and gaussian decorrelation, e.g.
config.ph_spec = "# % <G>"
For these 'special phones' also context independent models will be created as well.
Both MIDA and decorrelation compute a linear transformation of the features. These transformations are combined in a single transformation matrix and is stored in the model directory together with a small piece of signal processing code that should be appended to the signal processing file. The latter can be done on the fly by specifying a sequence of signal processing files.
acmod.mat acmod.preproc
Hence you will need to add 'acmod.preproc' to the signal processing definition in the '.ini' file as done in e3_m2.ini:
[preproc] script ../resources/sp22_vtln.ssp ${mod}/acmod.preproc
Preprocessing scripts can be specified by multiple files as if those files were concatenated. The new preprocessing file contains a single command :
[lin_comb] A_matrix e3_m2/acmod.mat
which tells the preprocessing to apply a matrix multiplication with coefficients found in 'acmod.mat'. This matrix is the composition of the original MIDA transform computed in the first ITERation and the additional rotation applied by the second iteration.
A TIMIT experiment is defined by the script-files
./scripts/e3*
Copy these to your experiment directory.
MIDA is automatically added as the first step during initialization in major ITERation 1. Full covariance gaussians are computed in major ITERation 2. Hence respective models will be found in "./e3_m1/acmod.*" respectively "./e3_m2/acmod.*".
The training will also do context-dependent training in major ITERation3 and compute final models in major ITERation 4 which are described in Context-Dependent Model Training. We recommend that you launch the whole training at once for simplicity and just run independent evaluation experiments on the CI models in ./e3_m2 and the CD models in ./e3_m4.
Following processes and subTasks will be performed
spr_train.py: trainer.tied() iter1_mida # MIDA Training iter1_tied_init # initialize Tied Mixtures iter1_tied_iter_i # 3 passes of Viterbi Training trainer.fvg() iter2_fvg_i # 3 passes of Decorrelation iter2_viterbi # 1 pass of Viterbi trainer.cdtree() iter3_cdtree_init # initialize decision tree iter3_cdtree_segpass iter3_cdtree_sel_gauss iter3_cdtree_tree iter3_cdtree_tree_train.i # multiple passes of tree training trainer.fvg() iter4_fvg_i # 3 passes of Decorrelation iter4_viterbi # 1 pass of Viterbi spr_eval.py: mod = e3_m2 # evaluation of CI models mod = e3_m4 # evaluation of CD models
A very rough estimate of execution times on a contempary (2010) dual core machine is:
MIDA training: a few hours FVG trainig: one hour Model training: less than an hour Decision tree training: a few hours FVG trainig: one hour Evaluation: 15'
Expected results are on the order of 26.5% and 24.1% for CI and CD models respectively.
Filenames of all files directly involved in the experiment are given below. Filenames are relative to the local experiment directory.
SETUP: e3.csh master script for running training+evaluation e3.config config file for training e3_m2.ini config file for evaluation of models _m2 e3_m4.ini config file for evaluation of models _m4 OTHER INPUT FILES/DIRECTORIES USED: ../resources/sp22_vtln.ssp feature extraction file ../resources/timit51.ci phone alphabet ../resources/timit51.cd state definition file ../resources/timit51.dic lexicon ../resources/train_hand_states.seg segmentation file (hand labeled) ../resources/timit.questions question set for decision tree building ../resources/train.cor specification of the training corpus ../resources/test.39.cor specification of the test corpus ../dbase/--- database with speech waveform files GENERATED FILES(primary): e3_.log log file, contains logging information on the experiment e3_recovery.log recovery log file, contains recovery points for automatic restart e3_m1/acmod.{mvg,hmm,sel} Acoustic Model files after training of initial Context Independent models e3_m2/acmod.{mvg,hmm,sel} Final Context-Independent Acoustic Model files (after 3 extra full covariance rotations) e3_m3/acmod.{mvg,hmm,sel} Initial Context-Dependent Acoustic Model files after decision tree training e3_m4/acmod.{mvg,hmm,sel} Final Context-Dependent Acoustic Model files (after 3 extra full covariance rotations) e3_mI/acmod.preproc Preprocessing Script for applying Linear Feature Transformation to be used in conjunction with model I e3_mI/acmod.mat MIDA+FVG Linear Feature Transformation Matrix to be used in conjunction with model I e3_m2.RES Result File for models in e3_m2 e3_m4.RES Result File for models in e3_m4 GENERATED FILES(supporting): e3_mI.CMD Commands sent to spr_cwr_main.c during evaluation e3_mI.OUT Output generated by spr_cwr_main.c e3_mI/acmod.i.xxx Acoustic models files at the end of a minor iteration e3_mI/acmod.yyy Auxiliary files generated during acoustic model training