Theoretical Concepts

All of the world's languages can be transcribed using a compact phonetic alphabet with 30-50 symbols only. However, we know that such alphabet symbol does not necessarily represent nice homogeneous classes. The major reason for this is coarticulation with adjacent phones and dependencies on word position and stress. For this reason 'context-dependent phones' have been the mainstream solution to create more homogeneous classes that are by nature easier to model.

The trivial way of creating context dependent allophonic variants is create an allophone for each pair of left and right contexts. This approach is often referred to as untied triphone modeling, i.e. all possible triphones are defined as independent units. While the context-independent approach defines classes that are too broad, triphones define many classes with too few examples for training.

The generalized triphone concept found a middle ground: allophones are created for clusters of left and right context phones. The preferred way these days of making such clustered triphones is a decision tree approach. A set of phonetic criteria is defined according to which context can be divided into a 'yes' and 'no' class; the splitting can be repeated several times and the leaves of the tree define the context-classes of the allophones. The nice thing about decision trees is that they are fully data driven, i.e. more allophonic variants will be created for phones with diverse data and with lots of examples.

Implementation

The principle task for the designer is to define a set of language specific questions. The size of the decision trees can be controlled with two parameters:

minimum count: minimum number of frames required in a node to be considered for splitting (default: dtmc=512)
likelihood threshold: the gain in likelihood required for a split to be accepted (default: dtlt=512)

The standard method 'cdtree' of the trainer class creates one tree for each state of each phone. State sharing is applied between allophones of the same phone but no state sharing is done across phones.

Building a global decision tree would allow across phone tying and is possible with the routines provided in SPRAAK, but its implementation falls outside the scope of this introductory tutorial.

CD-training: TIMIT Example

Decision tree trainig of context dependent models is implemented in Major iteration 3 of the training in the e3.config .

...
config.questions = "../resources/timit.questions"
...

# 3 - make a decision tree for context dependent modeling
trainer.cdtree()
# 4 - do yet another 3 full covariance gaussian rotation steps
trainer.fvg(niter = 3)

The file with phonetic questions is set in config.questions . This file looks exactly the same as a phonetic dictionary; the first column are the question names and the second column the question defintions specified as a set of phones.

In the provided example the decision tree training is called with its default parameters. The whole training process is finalized with a few extra iterations of gaussian decorrelation as with the new models the previous transformation might not be ideal anymore.

The results of these training passes will be stored in ./e3_m3 and ./e3_m4 respectively.

Remember that the decision tree will generate new phone definitions. Hence a newly generated 'acmod.cd' file will be found in ./e3_m3. You should find that this training resulted in the creation of 541 cd-states used in 1444 context dependent allophones.

A complete training path is provided in the scripts {e3.csh,e3.config}. For testing with the final models, you need to run the evaluation with e3_m4.ini.

The description of the experimental setup is presented combined with MIDA in MIDA & Gaussian Decorrelation.

Further Exploration

With the presented methodologies we have now created a pretty decent TIMIT baseline system with an error rate of 24%, which is a far way from the 29% obtained when using default mel cepstra and context independent phones.

Further improvements are possible. You could explore the optimization of a large number of parameters:

use a trigram language model instead of a bigram language model
create a larger decision tree with CD variants by reducing the thresholds on the splitting criteria
...

See for yourself what is possible !