Theoretical Concepts

Vocal tract length normalization is one of the easiest ways of doing fast speaker adaptation. The underlying idea is simple. Resonances in an acoustic tube (such as the vocal tract) are inversely proportional to the length of the tube. Thus, as female vocal tracts are 10-15% shorter than male vocal tracts, resulting female formant positions are higher than the equivalent male formant positions. Given the all in all moderate variations in vocal tract length, the effect of vocal tract length variation can be modeled well by a linear warping of the frequency axis. Hence by warping a spectrum by a speaker specific warping factor, typically towards a global average vocal tract length, we obtain a 'normalized' (wrt. vocal tract length) spectral estimate. By normalizing out this physiological influence the obtained spectral estimates are more homogeneous across speakers and hence more suitable for recognizing the acoustic phonetic content. The linear warping itself can be incorporated into the filterbank that is used to convert from linear frequency to mel frequency (or some other frequency scale); This is also the implementation in SPRAAK.

Warping by scaling filterbanks has one potentially negative side-effect that is easy to overcome however. For a female voice frequencies are warped to higher values. This implies that the highest frequencies (just a bit below Nyquist) are moved to a position beyond the Nyquist frequency and are hence not available in the digital signal.

Implementation

Filterbank Warping

Warping can be straightforward implemented as an option in the [filter_bank] signal processing block. Either a fixed warping factor can be specified for a full file/stream or the warping factor can be passed in a frame synchronous stream.

[filter_bank]
  scale MELDM
  bank TRI
  output dB
  LWfixed 1.07 warp
  PROCESS 1:vlen-2

The above filterbank description ( sspmod_filter_bank.c ) specifies :

the Davis & Mermelstein mel scale
triangular shaped bands
the output is converted to dB
a linear warping factor of 1.07 (which would be typical for a female voice)
process all channels except first and last

Especially the last line needs further clarification. The [filter_bank] module in SPRAAK makes it design 'unwarped' and will create all filters that fit between 0Hz and the Nyquist frequency for the unwarped case. Using the default Davis and Mermelstein mel scale and triangular filterbank shapes, this implies 24 channels with the first one having a center frequency of 100Hz and the last one a center frequency of 6.963Hz.

mel     frequency       frequency+10%warp
        (in Hz)         (in Hz)
0             0               0                 low frequency edge of first channel
1           100             110                 center frequency of first channel
2           200             220                 high frequency edge of first channel
..
23        6.061           6.667                 low frequency edge of highest channel (16kHz sampling)
24        6.963           7.666                 center frequency of highest channel (16kHz sampling)
25        7.998           8.798                 high frequency edge of highest channel (16kHz sampling)

After upwards stretching for female voices a big piece of the highest band is moved beyond Nyquist. If needed, the values for missing components will be computed by some extrapolation mechanism. However, such extrapolation is always highly noisy; moreover, the highest frequency channel may be very unreliable from the start as part of the information is likelely to have been cut by an equipment dependent anti-aliasing filter. Therefore a better solution seems to be to drop the highest channel all together. For similar reasons it is generally a good idea to drop the lowest frequency channel as well, unless one can consistently rely on high fidelity audio equipement for all recordings (both during test and recognition). DROPPING the lowest and highest frequency channel is exactly what the last statement (PROCESS) in the [filter_bank] module will do. Hence it will generat a 22-dimensional filterbank estimate, with unwarped center frequencies ranging from 200Hz ... 6.061Hz.

Vocal Tract Length Estimation

While linear warping is available in many software packages, the way in which the VTL is estimated is highly variable. Many systems use an open loop type estimation, e.g. a set of possible warping factors is hypothesized and that one is selected that maximizes the probability of the test sentence against a vocal tract neutral model (either a full blown recognition model or a neutral GMM)

SPRAAK advocates a different methodology that is able to estimate a continuous warping factor. This is obtained by:

first estimating the probability of belonging to a male or female class
secondly using these probabilities to interpolate between warping factors of 'extreme female' and 'extreme male'.

This procedure has the advantage of providing a continuous warping factor estimate in a low complexity open loop system without the need for a full blown backend recognizer.

The implementation in SPRAAK falls apart in two parts: (i) the creation of the male/female models and (ii) the addition of VTL estimate to the signal processing script Creating the male/female models is achieved by a simple training in which a segmentation file is given as to where male and female speech can be found in the corpus. It is generally benificial to have an additional model for noise, such that the male and female models do not have to account for this. However if you have no easy way of bootstrapping your segmentation then assigning male or female lables to a global file (including the noise segments) will normally do a decent job as well.

For the training of the gender models, we suggest that you use following settings:

trainer.untied(niter=3,ngps=128)

This will:

make "untied" mixtures: Given that the classes are assumed to perfectly separable, using untied states makes more sense than using a tied states model
use 128 gaussians per state: as the variability within each class is quite large it is advisable to use a quite large number of mixtures; if speed is of utmost importance for your application you might consider using 64 or 32 gaussians per mixture only. This will only have minimal impact on the results. If you have a lot of data larger models might do slightly better, though not much.

The next thing to do is to put the VTL estimate using these male/female models in to the signal processing script.

[vtl_estim]
  nfr -1                Process till the end of file
  no_update                     with no intermediate uptdates
  weight 0.5 partial    The likelihoods are raised to the power 1/(nfr/0.5) before being converted to posteriors
  wf 0.85 1.15 1.0      The warping factors for the pure models (posteriors equal to 1.0)
  models mf.mvg         The GMMs for male/female/noise  (same order as the 'wf'!)

On line VTLN

For implementing VTLN we use a new signal processing script

mfcc39_vtln.ssp

Here we see a few new functionalities of the SSP scripting language. After computing the FFT power spectrum, dual stream processing is initiated by copying the power spectrum to a buffer 'POWSPEC'. The first stream computes standard cepstral features from which warping factors are derived which are in turn stored in a buffer 'WF'. Now the 'POWSPEC' is buffer is moved back to the main processing stream and a warped mel spectrum is computed. The warping factors buffer WF is entered as an extra parameter in the mel frequency filterbank design. Processing continues further as with standard cepstral processing.

A few extra remarks are at hand:

The SSP scripts are executed in a time synchronous manner. Hence the named buffers only span the minimal number of frames required by the requested processing. In the example implementation the mean normalization and VTL estimate are performed on a file by file basis hence the buffer delay will be a full file. However, both routines can operate in an adaptive way as well which then results in internal delays equal to the minimum number of frames required during initialization.
When storing and retrieving buffers, it is needed to take care of the 'keys' as well, i.e. a set of datadescriptors that contain information on the data (frame based, vector length, .. ). This is done by pushing them on and off a stack with the push/pop keyset commands.
Some channels of the filterbank are not used in their warped version because they can only be estimated via extrapolation. In general it is a good idea to NOT consider the lowest and highest frequency channels as they are highly sensitive to recording equipment and anti-aliasing filters. How many channels to neglect will depend on the application.

A TIMIT Experiment with VTLN

Training the Male/Female models

We're using the standard 39D mel cepstral coefficients as features for the Male/Female GMMs. A training setup for Male/Female is given. Pls. copy the mf.config and mf.csh files from the "./scripts" to your "./exp" directory.

cd ~/MyTimit/exp
cp ../scripts/mf.* .

Before creating the models (by running mf.csh), have a look at the configuration file and all its resources in the "./resources" directory:

MF.ci   MF.cd    MF.dic   train_MF.cor   train_MF.seg

You can see, that actually 3 models will be trained: M(ale),F(emale),#(silence). Also have a look at the call to the trainer in the .config file which is all the way at the bottom.

trainer.untied(niter=3 , ngps=128)

Experimenting with VTLN

An experiment on TIMIT is defined by

e2.csh    e2.config    e2.ini

in the ./scripts directory. Copy these files from the scripts directory to your experiment directory. Make sure that the relative path names for the male/female models are maintained in a correct fashion. If you changed model names in the mf-experiment or ran these in a different directory then you will need to make some changes, otherwise the settings should be fine. Performance improvement on this particular experiment is small (on the order of 3% relative only) ; However, on other tasks performance improvement due to VTLN may be 5-10% relative.

Processing Steps

Expected execution times on a contempary (2010) dual core machine:

Creation of Male/Female Models: 15'
training: 60'
evaluation: 10'

Files

Filenames of all files directly involved in the experiment are given. Filenames are relative to the local experiment directory.

SETUP:
mf.csh                  master script for making M/F models
mf.config               mf configuration file 
e2.csh                  master script for running training+evaluation
e2.config               config file for training
e2.ini                  config file for evaluation

OTHER INPUT FILES/DIRECTORIES USED:
../resources/mfcc39.ssp         mel frequency feature extraction (used for MF estimation)
../resources/mfcc39_vtln.ssp    mel frequency + VTLN feature extraction (used in mainexperiment)
../resources/MF.ci              MF phone alphabet
../resources/MF.cd              MF state definition file
../resources/MF.dic             MF lexicon
../resources/timit51.ci         phone alphabet
../resources/timit51.cd         state definition file
../resources/timit51.dic        lexicon
../resources/train_MF.seg               MF segmentation file (derived from hand labels)
../resources/train_hand_states.seg      segmentation file (hand labeled)
../resources/train.cor          specification of the training corpus
../resources/test.39.cor        specification of the test corpus
../dbase/---                    database with speech waveform files

GENERATED FILES(primary):
e2_.log                         log file, contains logging information on the experiment
e2_recovery.log                 recovery log file, contains recovery points for automatic restart
e2_m1/acmod.{mvg,hmm,sel}       Acoustic model files
e2_.RES                         Result File

GENERATED FILES(supporting):
e2_.CMD                         Commands sent to spr_cwr_main.c during evaluation
e2_.OUT                         Output generated by spr_cwr_main.c
e2_m1/acmod.x.xxx               Acoustic models files at the end of a minor iteration