SPRAAK
|
An MDT mask is essentially a frequency-dependent voice activity detector (VAD). For reasonably stationary noise, methods based on second order statistics yield good results. For non-stationary noise, other cues are used. This section describes a number of mask estimators that assume that the speech is corrupted by additive background noise and a smooth (short impulse response) linear filtering. This is the point where the flexibility of the MDT framework can be exploited to its full potential: by designing a different mask estimator, different speech degradation types can be handled. For example, spatial information can be exploited through multiple microphones to discern the target speaker from the background, or spectro-temporal knowledge about the noise can be exploited.
Harmonicity is a strong cue to detect speech presence in a noisy signal. To detect harmonicity, the signal is decomposed in an harmonic and random part [1]. In the provided implementation, a frequency-domain implementation of the idea of [1] is followed. In [1] the frame length is set dynamically to 2 pitch periods. SPRAAK on the other hand works with fixed frame lengths. Hence, the way the harmonic decompostion is done, was changed. It consists of the following steps:
Harmonicity masks can be improved by constraining the harmonic and random log-spectra of clean speech to a codebook (see [2]). This codebook is trained on harmonically decomposed (with mean normalization, see -ssp) clean data using the program spr_mdt_make_vq.py (see reference manual). At run-time, silence frames are detected by a global (i.e. using information over all frequency bands, which is more robust than the local decisions) VAD algorithm in which case the harmonic and random spectra are constrained to a silence codebook. Hence, spr_mdt_make_vq.py places the -Nmaxs codebook entries for -sil first in a codebook of total size -Nc. In fact, it generates a maximum of -Nmax clusters per broad phonetic class (as defined in -u).
Note that the quality of the segmentation file for making the codebook is not crucial. The segmentation is only used to divide the data into several classes, one of them being silence. The program allows one to use context-independent phones as classes or to work with broad phonetic classes. Hence, the segmentation file does not have to be at the HMM (context-dependent) state level (but it can be). To illustrate the more lax constraints on the segmentation file, we used the initial segmentation file for training the back-end models in the code above.
Delta masks are ternary masks (see background section about MDT in this documentation). They are computed according to the "acoustic" expression: the mask for the delta is the delta of the mask, which corresponds to a weighted voting scheme. The delta mask computation is implemented in spr_sigp.
[1] Hugo Van hamme. Robust Speech Recognition using Cepstral Domain Missing Data Techniques and Noisy Masks. In Proc. International Conference on Acoustics, Speech and Signal Processing, volume I, pages 213–216, Montreal, Canada, May 2004.
[2] Maarten Van Segbroeck and Hugo Van hamme. Vector-Quantization based Mask Estimation for Missing Data Automatic Speech Recognition. In Proc. INTERSPEECH, pages 910-913, Antwerp, Belgium, August 2007.