Objectives

This section deal with a modification to SPRAAK in order to improve its robustness to noise and possibly reverberation. It does NOT improve robustness to pronunciation variation or accents. It is based on MISSING DATA THEORY (MDT). The principles of the approach are explained below. As a user, you need to be aware of the following:

MDT requires an additional component, called MASK ESTIMATOR, which indicates which data is reliable enough for the recognizer to use and which data is unreliable (and will be reconstructed). Its output is a MASK, a vector of variables that reveal to which extent each frequency channel is reliable (either binary or as a continuous variable between 0 and 1)
the mask estimation method that is documented here assumes the speech signal is disturbed by additive possibly non-stationary background noise. It does NOT assume the signal contains significant reverberation, though mask estimation methods to handle reverberated speech are described in the literature. If the user wants robustness against reverberation, he will need to implement his own mask estimation technique.
tests on noisy data have shown that the MDT recognizer runs about at the same speed as hem SPRAAK runs a standard MIDA front-end, but at a higher accuracy. Notice that noisy data requires larger decoding times due to the increased ambiguity.
the MDT recognizer is implemented as an extension: you will need to provide a mask estimator and additional data components, but your acoustic model, language model, lexicon, ... remain unchanged

Missing Data Theory

This implementation is largely based on [1]. At each frame $t$ , the log of the speech power spectrum is represented by a vector $\mathbf{s}$ . If the data are perturbed by noise, this spectral vector becomes $\mathbf{y}$ . In absence of speech, the spectral vector would be $\mathbf{n}$ (only due to noise). Next to the spectral data, a mask $\mathbf{m}$ is also estimated. For now, it is assumed it is a binary vector, with $m_i$ =1 meaning the spectral value $y_i$ being unreliable and a 0 meaning it is reliable. The reliable data is just copied from the noisy input, i.e. $s_i = y_i$ if $m_i$ =0. The unreliable data will be reconstructed or imputed based on the clean speech model. (Notice that there is also an MDT method in which it is marginalized out.) In the SPRAAK implementation, this imputation is guided by the search process, i.e. the missing data is reconstructed for each assumption the back-end makes about the speech identity. In this context, assumption means the visited state and selected Gaussian from its mixture. The missing data is reconstructed based on the maximum likelihood principle, i.e. it looks for the most likely values of the unreliable spectral components. The negative log-likelihood of a Gaussian model is a quadratic (within a constant):

$L(\mathbf{s}) = (\mathbf{s} - \mathbf{\mu} )^T \mathbf{\Sigma}^{-1} (\mathbf{s} - \mathbf{\mu} ) \hspace{3cm}(1)$

which is to be minimized over the unreliable components of $\mathbf{s}$ . However, there are useful constraints that can be imposed on $\mathbf{s}$ . Though this is an approximation neglecting the phase relations between the spectra, MDT assumes:

$\mathbf{s} \leq \mathbf{y} \hspace{4.5cm}(2)$

Hence, (1) is to be minimized over the unreliable components of $\mathbf{s}$ under constraint (2), and this for every Gaussian and for every frame. Notice that in (1), $\mathbf{\Sigma}$ is not a diagonal matrix, since $\mathbf{s}$ is a vector of log-spectra and not cepstra (or any diagonalizing transform). In SPRAAK, the constrained optimization (1)+(2) is solved with a gradient descent method, which is computationally efficient and requires only a few iterations to converge to a sufficiently accurate solution (controled by the parameter 'MDTiter=<nr>;' in the MDT options string; the default value is 2; 0 means trivial spectral imputation only).

Delta's

For the delta (derivative) features, the likelihood maximization (1) remains valid, but the constraints (2) need to be changed. Due to the noise, the clean value of the delta feature can be both greater than or less than the noisy value. Therefore, a ternary mask is used:

reliable: =0 means $\Delta s_i = \Delta y_i$
unreliable with upper bound: =1 means $\Delta s_i \leq \Delta y_i$
unreliable with lower bound: =2 means $\Delta s_i \geq \Delta y_i$

The benefit of applying MDT to delta's is questionable in that on some data sets, no improvement was observed. To disable MDT for delta's, simply generate an all-zero mask.

Speeding it up

A non-MDT model would evaluate the Gaussians at the noisy data (after applying a diagonalizing transform). In MDT with Gaussian-based imputation, the evaluation of a likelihood is replaced by non-negative least squares (NNLSQ) problem (1)+(2). The solution of the NNLSQ problem requires an order of magnitude more computational effort than evaluating a Gaussian with diagonal covariance. Indeed, the last is linear in the feature vector dimension, while the former is quadratic or even cubic and requires iteration.

A first optimization that is performed is to use a different feature set. In [2], it is shown that the NNLSQ problem can be solved with significantly less effort if the log-spectra are (linearly) transformed to the ProSpect representation, while there is no impact on accuracy.

To cut computational costs, a multi candidate approach was designed. The idea is to constrain $\mathbf{s}$ to a discrete set of values generated by solving the NNLSQ problem on a set of cluster Gaussians. This has several advantages:

the optimization is reduced to multiple Gaussian evaluations
the number of candidates easily trades computation time against accuracy
the cluster Gaussian scores can be used to eliminate Gaussians in a MDT framework
no MDT-related modifications to the backend acoustic models are required, i.e. it can be a cepstral or MIDA model, while only the cluster Gaussians are best expressed in the ProSpect domain

The multi candidate approach requires an association table. For each backend Gaussian, it lists which are the cluster Gaussians that have yielded the best candidates $\mathbf{s}$ on training data.

The MDT implementation then works as follows: all the (ProSpect) cluster Gaussians are evaluated on the current frame using MDT. Only the cluster Gaussians with the highest likelihood are maintained (the L-cluster-M-best method of [3] or [4]). Each of these cluster Gaussians generates an imputed speech estimate. The likelihood of a each backend Gaussian is then given by its likelihood evaluated at the candidate yielding the highest likelihood among the candidates listed in the association table for that Gaussian. Typically 3 to 10 candidates are considered.

Channel compensation

Linear filtering (different microphones, bandwidth limitations, ...) have a detrimental effect on ASR. In a non-MDT recognizer, a commonly used approach is cepstral mean normalization in which the mean of the cepstral vector is subtracted from the inputs. Equivalently, the mean of the log-spectrum can be subtracted from the log-spectral features. In MDT, this does not make much sense, since part of the data are assumed to be unreliable. Therefore, in SPRAAK, the channel estimate is a maximum likelihood estimate, i.e. the log-spectral vector that maximizes the likelihood of the feature stream under the backend model. The channel estimate should also be applied to the cluster Gaussians and to any speech model that is used in the mask estimator. This is for instance the case in the VQ-based mask estimator. The ML estimate is computed via iterative optimization and needs initialization. It is suggested to initialize it from a log-spectral mean estimate, computed in the preprocessing script. During recognition, the likelihood of the data is maximized by modifying the channel estimate (an offset vector on the data or equivalently on the Gaussian means) along the Viterbi backtrace path (without modifying the path). The channel update happens only on the common part of all backtraces from feasible active states. (When the end of the utterance is reached, there is only one feasible state, so all data will be processed eventually.) Both initialization and update strategy imply that, like in CMN, a sufficient amount of speech data must be processed before the channel estimate is reliable. Therefore, the accuracy may be improved by repeatedly processing the same utterance after a channel change, if the evaluation restrictions allow this.

References

[1] Maarten Van Segbroeck and Hugo Van hamme. Advances in Missing Feature Techniques for Robust Large Vocabulary Continuous Speech Recognition. IEEE Transactions on Audio, Speech and Language Processing, volume 19, No. 1, pages 123-137, January 2011.

[2] Hugo Van hamme. PROSPECT Features and their Application to Missing Data Techniques for Robust Speech Recognition. In Proc. International Conference on Spoken Language Processing, volume I, pages 101–104, Jeju Island, Korea, October 2004.

[3] Yujun Wang, Maarten Van Segbroeck and Hugo Van hamme. Robust Large Vocabulary Continuous Speech Recognition Based on Missing Feature Techniques. In J. Ramírez and J.M. Górriz, editors, Recent Advances in Robust Speech Recognition Technology, pages 141-154, Bentham Science Publishers. 2011. eISBN 978-1-60805-172-4.

[4] Wang Yujun and Hugo Van hamme. Speed improvements in a Missing Data-based speech recognizer. In Proceeding of NAG-DAGA2009, Rotterdam, Netherlands, april 2009.