An MDT mask is essentially a frequency-dependent voice activity detector (VAD). For reasonably stationary noise, methods based on second order statistics yield good results. For non-stationary noise, other cues are used. This section describes a number of mask estimators that assume that the speech is corrupted by additive background noise and a smooth (short impulse response) linear filtering. This is the point where the flexibility of the MDT framework can be exploited to its full potential: by designing a different mask estimator, different speech degradation types can be handled. For example, spatial information can be exploited through multiple microphones to discern the target speaker from the background, or spectro-temporal knowledge about the noise can be exploited.

Harmonicity

Harmonicity is a strong cue to detect speech presence in a noisy signal. To detect harmonicity, the signal is decomposed in an harmonic and random part [1]. In the provided implementation, a frequency-domain implementation of the idea of [1] is followed. In [1] the frame length is set dynamically to 2 pitch periods. SPRAAK on the other hand works with fixed frame lengths. Hence, the way the harmonic decompostion is done, was changed. It consists of the following steps:

Estimate the pitch. The pitch estimator used, is a simple cepstrum based one: convert the input spectrum to the cepstral domain and find the peak. Raising the input to the third power (double convolution in the spectral domain) improved noise robustness. A proper scaling is needed to avoid numerical overflow.
Make an artificial pitch signal (pulse train) in the time domain; simple linear interpolation is used to place a pulse at a sub-sample position.
Apply windowing and convert to the spectral domain. This results in a smoothed and 'envelope modulated' pulse train (smooting due to the windowing, modulation due to the non-periodicity). Given that the same window, frame length and pitch frequency (F0) is used for the artificial pitch signal as for the input speech signal, the shape of the artifical pitch signal in the spectral domain will match that of the speech signal closely (matching of pitch peaks with a similar shape).
Estimate the upper and lower values of the 'modulation' using percentile filtering (peak and value tracking).
Combine the upper and lower values obtained in step 4 with the pitch derived pulse train in the spectral domain from step 3 to make a 'comb-filter' that isolates the harmonic and non-harmonic parts.

VQ masks

Harmonicity masks can be improved by constraining the harmonic and random log-spectra of clean speech to a codebook (see [2]). This codebook is trained on harmonically decomposed (with mean normalization, see -ssp) clean data using the program spr_mdt_make_vq.py (see reference manual). At run-time, silence frames are detected by a global (i.e. using information over all frequency bands, which is more robust than the local decisions) VAD algorithm in which case the harmonic and random spectra are constrained to a silence codebook. Hence, spr_mdt_make_vq.py places the -Nmaxs codebook entries for -sil first in a codebook of total size -Nc. In fact, it generates a maximum of -Nmax clusters per broad phonetic class (as defined in -u).

spr_mdt_make_vq.py -cdb MDTBel.mvg -Nthread 16 -spec '# * %' -sil '#' -Nc 520 -seg cleanModel.seg -tmp <TempDir> -u yapa_ext.lst -obs <DirWithSamples> -suffix sam -ssp HD_meannorm.ssp -Ncollect 16384 -Nmaxs 20

Note that the quality of the segmentation file for making the codebook is not crucial. The segmentation is only used to divide the data into several classes, one of them being silence. The program allows one to use context-independent phones as classes or to work with broad phonetic classes. Hence, the segmentation file does not have to be at the HMM (context-dependent) state level (but it can be). To illustrate the more lax constraints on the segmentation file, we used the initial segmentation file for training the back-end models in the code above.

Delta masks

Delta masks are ternary masks (see background section about MDT in this documentation). They are computed according to the "acoustic" expression: the mask for the delta is the delta of the mask, which corresponds to a weighted voting scheme. The delta mask computation is implemented in spr_sigp.

References

[1] Hugo Van hamme. Robust Speech Recognition using Cepstral Domain Missing Data Techniques and Noisy Masks. In Proc. International Conference on Acoustics, Speech and Signal Processing, volume I, pages 213–216, Montreal, Canada, May 2004.

[2] Maarten Van Segbroeck and Hugo Van hamme. Vector-Quantization based Mask Estimation for Missing Data Automatic Speech Recognition. In Proc. INTERSPEECH, pages 910-913, Antwerp, Belgium, August 2007.