Collecting the exemplars in a database

The key idea behind template-based ASR is to measure the distance between an input speech signal and the templates previously labeled and stored somewhere in memory. No parametric models are actually created and the `training' merely consists in these steps:

phone segmentation
extraction of the acoustic features

The CI-phone segmentation can be obtained with a Viterbi alignment using the previoulsy trained HMM models:

cd scripts_dtw/
> spr_vitalign -WM -S -c ../resources/SPRAAK/wsj_si284_train.cor -d ../resources/cmudict.0.7a.lex -seg ../resources_dtw/si284_train_WSJ0+1_CMU.wrd.seg -ssp "../resources/SPRAAK/mida_vtln.preproc  ../models/wsj0+1_mida_vtln_CD_tied_auto/wsj0+1_mida_vtln_CD_tied_auto.4/acmod.preproc" -h  ../models/wsj0+1_mida_vtln_CD_tied_auto/wsj0+1_mida_vtln_CD_tied_auto.4/acmod.hmm -g  ../models/wsj0+1_mida_vtln_CD_tied_auto/wsj0+1_mida_vtln_CD_tied_auto.4/acmod.mvg -sel  ../models/wsj0+1_mida_vtln_CD_tied_auto/wsj0+1_mida_vtln_CD_tied_auto.4/acmod.sel -ci ../resources/SPRAAK/yapa_en.ci  -cd ../models/wsj0+1_mida_vtln_CD_tied_auto/wsj0+1_mida_vtln_CD_tied_auto.3/acmod.cd  -i ../data/sam_16k/ -suffix  wv1 -beam 'threshold=99,width=2000' -LMout -100 -rmg  'r15;new;' -unwind 'add_in_front=[/#];add_between=[/#];add_at_rear=[/#];sent_context=##;'
> CLEAN_SEG ../resources_dtw/si284_train_WSJ0+1_CMU.wrd.seg > ../resources_dtw/si284_train_WSJ0+1_CMU.seg
> SPR_SEG2PHON_LVL -v cd2ci=1 ../resources_dtw/si284_train_WSJ0+1_CMU.seg ../resources_dtw/si284_train_WSJ0+1_CMU.ci.seg

After that we can extract the features (if a sentence has not been segmented (eg. oov words), features will not be extracted for that sentence; in other words the segmentation is aligned with the feature file):

spr_sel_frames -c ../resources_dtw/si284_train_WSJ0+1_CMU.ci.seg -ssp "../resources/SPRAAK/mida_vtln.preproc ../models/wsj0+1_mida_vtln_CD_tied_auto/wsj0+1_mida_vtln_CD_tied_auto.4/acmod.preproc" -suffix wv1 -o key:../resources_dtw/si284_train_WSJ0+1_CMU.trk -obs ../data/sam_16k/
dim1=`spr_getkey -i ../resources_dtw/si284_train_WSJ0+1_CMU.trk -k NFR`
dim2=`spr_getkey -i ../resources_dtw/si284_train_WSJ0+1_CMU.trk -k NPARAM`
spr_addkey -i ../resources_dtw/si284_train_WSJ0+1_CMU.trk -k NDIM      -v $dim2
spr_addkey -i ../resources_dtw/si284_train_WSJ0+1_CMU.trk -k NFRAMES   -v $dim1
spr_addkey -i ../resources_dtw/si284_train_WSJ0+1_CMU.trk -k NFR       -v `echo "$dim1 * $dim2" | bc -q`
spr_addkey -i ../resources_dtw/si284_train_WSJ0+1_CMU.trk -k NPARAM    -v 1
spr_addkey -i ../resources_dtw/si284_train_WSJ0+1_CMU.trk -k DATATYPE  -v PARAM
spr_addkey -i ../resources_dtw/si284_train_WSJ0+1_CMU.trk -k PARAMTYPE -v FEATUREDB

Later on features are (usually) sharpened. Data sharpening is performed on a frame-by-frame basis. Classes are HMM states. Below, we first create a state-based segmentation file (.state.seg) and the list of the classes/states (.state.ci). After that we sharpen the features with `vectorTranslation'

seg_cd2state.py ../resources_dtw/si284_train_WSJ0+1_CMU.seg  ../models/wsj0+1_mida_vtln_CD_tied_auto/wsj0+1_mida_vtln_CD_tied_auto.3/acmod.cd > ../resources_dtw/si284_train_WSJ0+1_CMU.state.seg
awk '{if(D) u[$2]; else D=($0~"^#+$");} END {for(s in u) {sub("^S","",s);print s;}}' ../resources_dtw/si284_train_WSJ0+1_CMU.state.seg 
|       sort -k 1,1n 
|       awk '{u[i++]=$1} END {printf(".key\nNENTRY %i\n#\n",i);for(i=0;i in u;i++) printf("S%s\n",u[i]);}' 
>       ../resources_dtw/si284_train_WSJ0+1_CMU.state.ci
nice -n 19 vectorTranslation -data ../resources_dtw/si284_train_WSJ0+1_CMU.trk -o ../resources_dtw/si284_train_WSJ0+1_CMU.sharp.trk -tr ../resources_dtw/si284_train_WSJ0+1_CMU.state.seg -alph ../resources_dtw/si284_train_WSJ0+1_CMU.state.ci -frac -1 -wgt 0 -threads 10 -maxMem 2000

Finally, we map the CD-phone segmentations onto a reduced set of CD-phones more suitable to be used as template classes. CD-phones as obtained from the HMM training share states and Gaussians and are therefore sparse. The new set of CD-phones has a minimum number of realization of each phone in the database (256, in this case). Finally, also segmentation with CD-words can be produced (.tcdw).

seg_ci2cd_ext.py ../resources_dtw/si284_train_WSJ0+1_CMU.tcd.seg ../resources_dtw/si284_train_WSJ0+1_CMU.wrd.seg ../resources/SPRAAK/yapa_en.ci ../resources_dtw/SPRAAK/WSJ0+1_CMU_flat_256_20.cd  ../resources_dtw/SPRAAK/si284_train_fname2spkr.lst ../resources_dtw/SPRAAK/phon_word_exclude.lst
spr_addkey -i ../resources_dtw/si284_train_WSJ0+1_CMU.tcd.seg -k NSEG -v `egrep -c '^(-|wsj)' ../resources_dtw/si284_train_WSJ0+1_CMU.tcd.seg` 
word_tree.py ../resources_dtw/WSJ0+1_CMU_flat_256_20.wrd.cd ../resources_dtw/si284_train_WSJ0+1_CMU.tcdw.seg ../resources/SPRAAK/yapa_en.ci ../resources_dtw/SPRAAK/questions ../resources_dtw/SPRAAK/WSJ0+1_CMU_flat_256_20.cd ../resources_dtw/si284_train_WSJ0+1_CMU.tcd.seg

Rescore a word/phone lattice using DTW score

Evaluation of the template system is done using the dev92 and nov92 datasets. Below it is explained how to proceed with the nov92 test set only. Testing can be split in three main parts:

generation of WG (explained in the spr_wsj1_hmm.dox)
creation of a phone lattice from the WG,
creation of the kNN lists by computing DTW-kNN distances between hypothesis and templates
decoding

The last three steps can be obtained as follows: First, we expand word units into CD-phone units in the WG. The word graph (.LAT) must have been geenrated using the opetion wlat=2. Moreover dummy arcs, initial arcs, etc bust have been removed:

WLAT_BEST_END ../output/MY_EXP.LAT | WLAT_RM_SENT_BEGIN - | WLAT_RENUM_MAX - | WLAT_RM_DEPS -v allow_root_D=1 - | WLAT_RM_DEPS2 - | WLAT_ARC_SORT - > ../output/MY_EXP_T.LAT
lat_spraak2platcdw.py ../output/nov92_np_20k_WSJ0+1_CMU.platw.gz ../resources/SPRAAK/yapa_en.ci ../resources_dtw/SPRAAK/WSJ0+1_CMU_flat_256_20.cd ../resources/wlist20o_nvp.dic ../resources_dtw/WSJ0+1_CMU_flat_256_20.wrd.cd ../output/MY_EXP_T.LAT

We create the template lattice using the dtw-kNN algorithm. The features for the nov92 test set are needed.

spr_sel_frames -c ../resources_dtw/SPRAAK/nov92_np_20k_WSJ0+1_CMU.ci.seg -ssp "../resources/SPRAAK/mida_vtln.preproc ../models/wsj0+1_mida_vtln_CD_tied_auto/wsj0+1_mida_vtln_CD_tied_auto.4/acmod.preproc" -suffix wv1 -o key:../resources_dtw/nov92_np_20k_WSJ0+1_CMU.trk -obs ../data/sam_16k/
spr_knn_dtw -o ../output/nov92_np_20k_WSJ1_CMUv4v.tlat.gz -ctx 0 -k 50 -threads 10 -ref ../resources_dtw/si284_train_WSJ0+1_CMU.trk -seg ../resources_dtw/si284_train_WSJ0+1_CMU.tcdw.seg -wrd '/' -tst ../resources_dtw/nov92_np_20k_WSJ0+1_CMU.trk -lat ../output/nov92_np_20k_WSJ0+1_CMU.platw.gz -mvgf -0.0

Finally we rescore the word graph (.LAT) with the information gathered in the (pseudo) lattice (.tlat). Word scores are obtained by averaging the score in the kNN list, for each phone. After that, word scores are caluclated by accumulating phone (log) scores.

lat_spraak_add_template_new.py ../output/nov92.lat ../lm/phon_eos1.dic ../trees/WSJ1_CMUv4v_256_20.cd ../trees/WSJ1_CMUv4v_256_20.wrd.cd ../wg/nov92_np_20k_WSJ1_CMUv4v.tlat.gz ../lm/wlist20o_nvp.cmu+s.dic ../wg/nov92_np_20k_WSJ1_CMUv4v.wrd.LAT ../resources_dtw/si284_train_WSJ1_CMUv4v.tcdw.seg nov92_np_20k_vtln_mn_hi1.Btr0 lc=-0.0,lf=0,nav=1,ctx=0 1.0

The new word graph is now ready to be decoded. Note that for each arc score should be combined opportunely and score weights optimized on left-out data.