SPRAAK
|
Histogram based normalization of spectral features. More...
Functions | |
void | spr_histeq_free (SprSspInfo *Info) |
int | spr_histeq_setup (SprSspInfo *Info, const char **descript, void *aux_info) |
int | spr_histeq_process (SprSspInfo *Info, const void *frame_in, void *frame_out) |
void | spr_histeq_reset (SprSspInfo *Info, SprSspStatus *action) |
Histogram based normalization of spectral features.
Normalization of (log) spectral features using a uni, bi or tri-modal gaussian model (one for the noise, one for undecided, one for the speech) per feature followed by model based histogram equalization.
The initialial parameters are calculated on the first N frames. After that, the gaussian mixture model parameters are updates for each new input frame using a simple weight (first order filter).
[histeq] | |
---|---|
mix_sz [3/2/1](2) | |
Model the data using a gaussian mixture with <mix_sz> components. | |
nfr_init <number>(100) | |
Number of frames used for calculating the initial statistics. Specify -1 if all frames are to be used. | |
em_init <number>(10) | |
Number of EM re-estimation passes done on the initial gaussian set. | |
alpha <number>(0.05) | |
Weight for the incremental updates. | |
ign_sil | |
Ignore 'silence' frames when estimating/updating the gaussian mixture statistics (requires a proper configuration of the silence speech detector, see 'sil_det_pre'). | |
eval <precision>(1e-4) [lambda_backoff](0.0) [sig0_backoff](0.0) [sig1_backoff](0.0) [sig2_backoff](0.0) | |
Specify how the histogram normalization must be done. <precision> is the desired precision for the histogram equalization. Non zero values for <lambda_backoff> and <sigX_backoff> indicate that the observed (backoff>0.0) or target (backoff<0.0) lambda's and variances should be replaced with a weighed combination of mixed and target values. The backoff values must be in the range [-1.0 ... 1.0]. | |
adjust [adj_lambda](0.0) [adj_sig0](0.0) [adj_sig1](0.0) [adj_sig2](0.0) | |
Replace the observed lambda's and sigma's with a weighed combination of mixed and target values before doing the (incremental) EM-updates. All values must be in the range [0.0 ... 1.0]. | |
file_init <fname> | |
Read initial statistics from the given file; specifying this option has the side-effect of setting the <nfr_init> param to 0. | |
file_target <fname> | |
Read the targer statistics from the given file. The default targets are usefull for <mix_sz>=1 only (mean+variance normalization)! | |
no_reset | |
Do not reset the statistics at the beginning of a new file. | |
multi_spkr <N> <copy/move> <buf_name> | |
Setup histeq to work in a multi-speaker environment, i.e. cepstral means for the (at least) the N last (leat recent used) speaker id's are calculated. The speaker id's are input from a named buffer. | |
freq_smooth <number>(0) [wgt](1.0) ... | |
Smooth the statistics in each frequency band by adding data from the <number> higher and lower frequency bands using weights <wgt>. | |
split_ini <low1>(0.3) <hi1>(0.9) [low2](0.2) [hi2](0.8) [outlier](0.0) [var_range](0.0) | |
The gaussian mixtures are initialized by (repeatedly) splitting the data in two parts that minimize to weighted average variance. The limits are there to prevent extremely unbalanced split (in #points). The highest and lowest log-energies (a fraction <outlier>) are regarded as outliers and are not used when making the gaussian mixtures. If a non zero value for the parameter <var_range> is given, then the first split will ensure that the variance of the gaussian corresponding to the high energy values (speech) is close to the target value (1.0==exact). | |
sil_msk <copy/move> <buf_name> | |
Label a frame (or a component in a frame) as silence (will be used to update gaussian[0] in the mixture). | |
spch_msk <copy/move> <buf_name> | |
Label a frame (or a component in a frame) as speech (will be used to update gaussian[<mix_sz>-1] in the mixture). | |
sil_thr <copy/move> <buf_name> | |
Input values smaller or equal to the given thresholds are labeled silence. | |
spch_thr <copy/move> <buf_name> | |
Input values larger or equal to the given thresholds are labeled speech. | |
sil_det_pre <min_sil>(-1) <max_spch>(-1) | |
Label a complete frame (== all components) as silence if at least <min_sil> components were labelled as silence and at maximum <max_spch> components were labeled as speech. | |
sil_det_post <max_sil>(0.1) <min_spch>(0.4) | |
Set the value of the \"post sil/spch detector\\" to 0 (== silence) if the mean of the gaussian posteriors for gaussian[0] (== silence) is larger or equal to <min_sil> and the mean of the gaussian posteriors for gaussian[<mix_sz>-1] (== speech) is smaller or equal to <max_spch>. | |
output <sil_spch/xnorm/wgt0/wgt1/wgt2/mu0/mu1/mu2/sigma0/sigma1/sigma2/mass0/mass1/mass2/probs0/probs1/probs2>(xnorm) [sil_spch/xnorm/wgt0/wgt1/wgt2/mu0/mu1/mu2/sigma0/sigma1/sigma2/mass0/mass1/mass2/probs0/probs1/probs2] ... | |
Define the output (xnorm: normalized data; sil_spch: the post sil/spch detector; {wgt,mu,sigma,mass}{0,1,2}: the gaussian mixture weights, means, sigmas or mass; probs{0,1,2}: the posterior gaussian probabilities). | |
history <hist_len> | |
Limit the history length to <hist_len> (is infinity by default). |