SPRAAK
|
Vocal tract length normalization is one of the easiest ways of doing fast speaker adaptation. The underlying idea is simple. Resonances in an acoustic tube (such as the vocal tract) are inversely proportional to the length of the tube. Thus, as female vocal tracts are 10-15% shorter than male vocal tracts, resulting female formant positions are higher than the equivalent male formant positions. Given the all in all moderate variations in vocal tract length, the effect of vocal tract length variation can be modeled well by a linear warping of the frequency axis. Hence by warping a spectrum by a speaker specific warping factor, typically towards a global average vocal tract length, we obtain a 'normalized' (wrt. vocal tract length) spectral estimate. By normalizing out this physiological influence the obtained spectral estimates are more homogeneous across speakers and hence more suitable for recognizing the acoustic phonetic content. The linear warping itself can be incorporated into the filterbank that is used to convert from linear frequency to mel frequency (or some other frequency scale); This is also the implementation in SPRAAK.
Warping by scaling filterbanks has one potentially negative side-effect that is easy to overcome however. For a female voice frequencies are warped to higher values. This implies that the highest frequencies (just a bit below Nyquist) are moved to a position beyond the Nyquist frequency and are hence not available in the digital signal.
Warping can be straightforward implemented as an option in the [filter_bank] signal processing block. Either a fixed warping factor can be specified for a full file/stream or the warping factor can be passed in a frame synchronous stream.
[filter_bank] scale MELDM bank TRI output dB LWfixed 1.07 warp PROCESS 1:vlen-2
The above filterbank description ( sspmod_filter_bank.c ) specifies :
Especially the last line needs further clarification. The [filter_bank] module in SPRAAK makes it design 'unwarped' and will create all filters that fit between 0Hz and the Nyquist frequency for the unwarped case. Using the default Davis and Mermelstein mel scale and triangular filterbank shapes, this implies 24 channels with the first one having a center frequency of 100Hz and the last one a center frequency of 6.963Hz.
mel frequency frequency+10%warp (in Hz) (in Hz) 0 0 0 low frequency edge of first channel 1 100 110 center frequency of first channel 2 200 220 high frequency edge of first channel .. 23 6.061 6.667 low frequency edge of highest channel (16kHz sampling) 24 6.963 7.666 center frequency of highest channel (16kHz sampling) 25 7.998 8.798 high frequency edge of highest channel (16kHz sampling)
After upwards stretching for female voices a big piece of the highest band is moved beyond Nyquist. If needed, the values for missing components will be computed by some extrapolation mechanism. However, such extrapolation is always highly noisy; moreover, the highest frequency channel may be very unreliable from the start as part of the information is likelely to have been cut by an equipment dependent anti-aliasing filter. Therefore a better solution seems to be to drop the highest channel all together. For similar reasons it is generally a good idea to drop the lowest frequency channel as well, unless one can consistently rely on high fidelity audio equipement for all recordings (both during test and recognition). DROPPING the lowest and highest frequency channel is exactly what the last statement (PROCESS) in the [filter_bank] module will do. Hence it will generat a 22-dimensional filterbank estimate, with unwarped center frequencies ranging from 200Hz ... 6.061Hz.
While linear warping is available in many software packages, the way in which the VTL is estimated is highly variable. Many systems use an open loop type estimation, e.g. a set of possible warping factors is hypothesized and that one is selected that maximizes the probability of the test sentence against a vocal tract neutral model (either a full blown recognition model or a neutral GMM)
SPRAAK advocates a different methodology that is able to estimate a continuous warping factor. This is obtained by:
This procedure has the advantage of providing a continuous warping factor estimate in a low complexity open loop system without the need for a full blown backend recognizer.
The implementation in SPRAAK falls apart in two parts: (i) the creation of the male/female models and (ii) the addition of VTL estimate to the signal processing script Creating the male/female models is achieved by a simple training in which a segmentation file is given as to where male and female speech can be found in the corpus. It is generally benificial to have an additional model for noise, such that the male and female models do not have to account for this. However if you have no easy way of bootstrapping your segmentation then assigning male or female lables to a global file (including the noise segments) will normally do a decent job as well.
For the training of the gender models, we suggest that you use following settings:
trainer.untied(niter=3,ngps=128)
This will:
The next thing to do is to put the VTL estimate using these male/female models in to the signal processing script.
[vtl_estim] nfr -1 Process till the end of file no_update with no intermediate uptdates weight 0.5 partial The likelihoods are raised to the power 1/(nfr/0.5) before being converted to posteriors wf 0.85 1.15 1.0 The warping factors for the pure models (posteriors equal to 1.0) models mf.mvg The GMMs for male/female/noise (same order as the 'wf'!)
For implementing VTLN we use a new signal processing script
mfcc39_vtln.ssp
Here we see a few new functionalities of the SSP scripting language. After computing the FFT power spectrum, dual stream processing is initiated by copying the power spectrum to a buffer 'POWSPEC'. The first stream computes standard cepstral features from which warping factors are derived which are in turn stored in a buffer 'WF'. Now the 'POWSPEC' is buffer is moved back to the main processing stream and a warped mel spectrum is computed. The warping factors buffer WF is entered as an extra parameter in the mel frequency filterbank design. Processing continues further as with standard cepstral processing.
A few extra remarks are at hand:
We're using the standard 39D mel cepstral coefficients as features for the Male/Female GMMs. A training setup for Male/Female is given. Pls. copy the mf.config and mf.csh files from the "./scripts" to your "./exp" directory.
cd ~/MyTimit/exp cp ../scripts/mf.* .
Before creating the models (by running mf.csh), have a look at the configuration file and all its resources in the "./resources" directory:
MF.ci MF.cd MF.dic train_MF.cor train_MF.seg
You can see, that actually 3 models will be trained: M(ale),F(emale),#(silence). Also have a look at the call to the trainer in the .config file which is all the way at the bottom.
trainer.untied(niter=3 , ngps=128)
An experiment on TIMIT is defined by
e2.csh e2.config e2.ini
in the ./scripts directory. Copy these files from the scripts directory to your experiment directory. Make sure that the relative path names for the male/female models are maintained in a correct fashion. If you changed model names in the mf-experiment or ran these in a different directory then you will need to make some changes, otherwise the settings should be fine. Performance improvement on this particular experiment is small (on the order of 3% relative only) ; However, on other tasks performance improvement due to VTLN may be 5-10% relative.
Expected execution times on a contempary (2010) dual core machine:
Creation of Male/Female Models: 15' training: 60' evaluation: 10'
Filenames of all files directly involved in the experiment are given. Filenames are relative to the local experiment directory.
SETUP: mf.csh master script for making M/F models mf.config mf configuration file e2.csh master script for running training+evaluation e2.config config file for training e2.ini config file for evaluation OTHER INPUT FILES/DIRECTORIES USED: ../resources/mfcc39.ssp mel frequency feature extraction (used for MF estimation) ../resources/mfcc39_vtln.ssp mel frequency + VTLN feature extraction (used in mainexperiment) ../resources/MF.ci MF phone alphabet ../resources/MF.cd MF state definition file ../resources/MF.dic MF lexicon ../resources/timit51.ci phone alphabet ../resources/timit51.cd state definition file ../resources/timit51.dic lexicon ../resources/train_MF.seg MF segmentation file (derived from hand labels) ../resources/train_hand_states.seg segmentation file (hand labeled) ../resources/train.cor specification of the training corpus ../resources/test.39.cor specification of the test corpus ../dbase/--- database with speech waveform files GENERATED FILES(primary): e2_.log log file, contains logging information on the experiment e2_recovery.log recovery log file, contains recovery points for automatic restart e2_m1/acmod.{mvg,hmm,sel} Acoustic model files e2_.RES Result File GENERATED FILES(supporting): e2_.CMD Commands sent to spr_cwr_main.c during evaluation e2_.OUT Output generated by spr_cwr_main.c e2_m1/acmod.x.xxx Acoustic models files at the end of a minor iteration