SPRAAK
 All Data Structures Namespaces Files Functions Variables Typedefs Enumerations Enumerator Groups Pages
Functions
sspmod_histeq.c File Reference

Histogram based normalization of spectral features. More...

Functions

void spr_histeq_free (SprSspInfo *Info)
 
int spr_histeq_setup (SprSspInfo *Info, const char **descript, void *aux_info)
 
int spr_histeq_process (SprSspInfo *Info, const void *frame_in, void *frame_out)
 
void spr_histeq_reset (SprSspInfo *Info, SprSspStatus *action)
 

Detailed Description

Histogram based normalization of spectral features.

Normalization of (log) spectral features using a uni, bi or tri-modal gaussian model (one for the noise, one for undecided, one for the speech) per feature followed by model based histogram equalization.

The initialial parameters are calculated on the first N frames. After that, the gaussian mixture model parameters are updates for each new input frame using a simple weight (first order filter).


[histeq]
mix_sz [3/2/1](2)
Model the data using a gaussian mixture with <mix_sz> components.
nfr_init <number>(100)
Number of frames used for calculating the initial statistics. Specify -1 if all frames are to be used.
em_init <number>(10)
Number of EM re-estimation passes done on the initial gaussian set.
alpha <number>(0.05)
Weight for the incremental updates.
ign_sil
Ignore 'silence' frames when estimating/updating the gaussian mixture statistics (requires a proper configuration of the silence speech detector, see 'sil_det_pre').
eval <precision>(1e-4) [lambda_backoff](0.0) [sig0_backoff](0.0) [sig1_backoff](0.0) [sig2_backoff](0.0)
Specify how the histogram normalization must be done. <precision> is the desired precision for the histogram equalization. Non zero values for <lambda_backoff> and <sigX_backoff> indicate that the observed (backoff>0.0) or target (backoff<0.0) lambda's and variances should be replaced with a weighed combination of mixed and target values. The backoff values must be in the range [-1.0 ... 1.0].
adjust [adj_lambda](0.0) [adj_sig0](0.0) [adj_sig1](0.0) [adj_sig2](0.0)
Replace the observed lambda's and sigma's with a weighed combination of mixed and target values before doing the (incremental) EM-updates. All values must be in the range [0.0 ... 1.0].
file_init <fname>
Read initial statistics from the given file; specifying this option has the side-effect of setting the <nfr_init> param to 0.
file_target <fname>
Read the targer statistics from the given file. The default targets are usefull for <mix_sz>=1 only (mean+variance normalization)!
no_reset
Do not reset the statistics at the beginning of a new file.
multi_spkr <N> <copy/move> <buf_name>
Setup histeq to work in a multi-speaker environment, i.e. cepstral means for the (at least) the N last (leat recent used) speaker id's are calculated. The speaker id's are input from a named buffer.
freq_smooth <number>(0) [wgt](1.0) ...
Smooth the statistics in each frequency band by adding data from the <number> higher and lower frequency bands using weights <wgt>.
split_ini <low1>(0.3) <hi1>(0.9) [low2](0.2) [hi2](0.8) [outlier](0.0) [var_range](0.0)
The gaussian mixtures are initialized by (repeatedly) splitting the data in two parts that minimize to weighted average variance. The limits are there to prevent extremely unbalanced split (in #points). The highest and lowest log-energies (a fraction <outlier>) are regarded as outliers and are not used when making the gaussian mixtures. If a non zero value for the parameter <var_range> is given, then the first split will ensure that the variance of the gaussian corresponding to the high energy values (speech) is close to the target value (1.0==exact).
sil_msk <copy/move> <buf_name>
Label a frame (or a component in a frame) as silence (will be used to update gaussian[0] in the mixture).
spch_msk <copy/move> <buf_name>
Label a frame (or a component in a frame) as speech (will be used to update gaussian[<mix_sz>-1] in the mixture).
sil_thr <copy/move> <buf_name>
Input values smaller or equal to the given thresholds are labeled silence.
spch_thr <copy/move> <buf_name>
Input values larger or equal to the given thresholds are labeled speech.
sil_det_pre <min_sil>(-1) <max_spch>(-1)
Label a complete frame (== all components) as silence if at least <min_sil> components were labelled as silence and at maximum <max_spch> components were labeled as speech.
sil_det_post <max_sil>(0.1) <min_spch>(0.4)
Set the value of the \"post sil/spch detector\\" to 0 (== silence) if the mean of the gaussian posteriors for gaussian[0] (== silence) is larger or equal to <min_sil> and the mean of the gaussian posteriors for gaussian[<mix_sz>-1] (== speech) is smaller or equal to <max_spch>.
output <sil_spch/xnorm/wgt0/wgt1/wgt2/mu0/mu1/mu2/sigma0/sigma1/sigma2/mass0/mass1/mass2/probs0/probs1/probs2>(xnorm) [sil_spch/xnorm/wgt0/wgt1/wgt2/mu0/mu1/mu2/sigma0/sigma1/sigma2/mass0/mass1/mass2/probs0/probs1/probs2] ...
Define the output (xnorm: normalized data; sil_spch: the post sil/spch detector; {wgt,mu,sigma,mass}{0,1,2}: the gaussian mixture weights, means, sigmas or mass; probs{0,1,2}: the posterior gaussian probabilities).
history <hist_len>
Limit the history length to <hist_len> (is infinity by default).

Author
Kris Demuynck
Date
18 May 2009