Histogram based normalization of spectral features. More...

Functions
void	spr_histeq_free (SprSspInfo *Info)

int	spr_histeq_setup (SprSspInfo Info, const char descript, void aux_info)

int	spr_histeq_process (SprSspInfo Info, const void frame_in, void *frame_out)

void	spr_histeq_reset (SprSspInfo Info, SprSspStatus action)

Detailed Description

Histogram based normalization of spectral features.

Normalization of (log) spectral features using a uni, bi or tri-modal gaussian model (one for the noise, one for undecided, one for the speech) per feature followed by model based histogram equalization.

The initialial parameters are calculated on the first N frames. After that, the gaussian mixture model parameters are updates for each new input frame using a simple weight (first order filter).

[histeq]
`mix_sz [3/2/1](2)`
Model the data using a gaussian mixture with <mix_sz> components.
`nfr_init <number>(100)`
Number of frames used for calculating the initial statistics. Specify -1 if all frames are to be used.
`em_init <number>(10)`
Number of EM re-estimation passes done on the initial gaussian set.
`alpha <number>(0.05)`
Weight for the incremental updates.
`ign_sil`
Ignore 'silence' frames when estimating/updating the gaussian mixture statistics (requires a proper configuration of the silence speech detector, see 'sil_det_pre').
`eval <precision>(1e-4) [lambda_backoff](0.0) [sig0_backoff](0.0) [sig1_backoff](0.0) [sig2_backoff](0.0)`
Specify how the histogram normalization must be done. <precision> is the desired precision for the histogram equalization. Non zero values for <lambda_backoff> and <sigX_backoff> indicate that the observed (backoff>0.0) or target (backoff<0.0) lambda's and variances should be replaced with a weighed combination of mixed and target values. The backoff values must be in the range [-1.0 ... 1.0].
`adjust [adj_lambda](0.0) [adj_sig0](0.0) [adj_sig1](0.0) [adj_sig2](0.0)`
Replace the observed lambda's and sigma's with a weighed combination of mixed and target values before doing the (incremental) EM-updates. All values must be in the range [0.0 ... 1.0].
`file_init <fname>`
Read initial statistics from the given file; specifying this option has the side-effect of setting the <nfr_init> param to 0.
`file_target <fname>`
Read the targer statistics from the given file. The default targets are usefull for <mix_sz>=1 only (mean+variance normalization)!
`no_reset`
Do not reset the statistics at the beginning of a new file.
`multi_spkr <N> <copy/move> <buf_name>`
Setup histeq to work in a multi-speaker environment, i.e. cepstral means for the (at least) the N last (leat recent used) speaker id's are calculated. The speaker id's are input from a named buffer.
`freq_smooth <number>(0) [wgt](1.0) ...`
Smooth the statistics in each frequency band by adding data from the <number> higher and lower frequency bands using weights <wgt>.
`split_ini <low1>(0.3) <hi1>(0.9) [low2](0.2) [hi2](0.8) [outlier](0.0) [var_range](0.0)`
The gaussian mixtures are initialized by (repeatedly) splitting the data in two parts that minimize to weighted average variance. The limits are there to prevent extremely unbalanced split (in #points). The highest and lowest log-energies (a fraction <outlier>) are regarded as outliers and are not used when making the gaussian mixtures. If a non zero value for the parameter <var_range> is given, then the first split will ensure that the variance of the gaussian corresponding to the high energy values (speech) is close to the target value (1.0==exact).
`sil_msk <copy/move> <buf_name>`
Label a frame (or a component in a frame) as silence (will be used to update gaussian[0] in the mixture).
`spch_msk <copy/move> <buf_name>`
Label a frame (or a component in a frame) as speech (will be used to update gaussian[<mix_sz>-1] in the mixture).
`sil_thr <copy/move> <buf_name>`
Input values smaller or equal to the given thresholds are labeled silence.
`spch_thr <copy/move> <buf_name>`
Input values larger or equal to the given thresholds are labeled speech.
`sil_det_pre <min_sil>(-1) <max_spch>(-1)`
Label a complete frame (== all components) as silence if at least <min_sil> components were labelled as silence and at maximum <max_spch> components were labeled as speech.
`sil_det_post <max_sil>(0.1) <min_spch>(0.4)`
Set the value of the \"post sil/spch detector\\" to 0 (== silence) if the mean of the gaussian posteriors for gaussian[0] (== silence) is larger or equal to <min_sil> and the mean of the gaussian posteriors for gaussian[<mix_sz>-1] (== speech) is smaller or equal to <max_spch>.
`output <sil_spch/xnorm/wgt0/wgt1/wgt2/mu0/mu1/mu2/sigma0/sigma1/sigma2/mass0/mass1/mass2/probs0/probs1/probs2>(xnorm) [sil_spch/xnorm/wgt0/wgt1/wgt2/mu0/mu1/mu2/sigma0/sigma1/sigma2/mass0/mass1/mass2/probs0/probs1/probs2] ...`
Define the output (xnorm: normalized data; sil_spch: the post sil/spch detector; {wgt,mu,sigma,mass}{0,1,2}: the gaussian mixture weights, means, sigmas or mass; probs{0,1,2}: the posterior gaussian probabilities).
`history <hist_len>`
Limit the history length to <hist_len> (is infinity by default).

Author: Kris Demuynck

Date: 18 May 2009

Functions

Detailed Description