SPRAAK
|
The key idea behind template-based ASR is to measure the distance between an input speech signal and the templates previously labeled and stored somewhere in memory. No parametric models are actually created and the `training' merely consists in these steps:
The CI-phone segmentation can be obtained with a Viterbi alignment using the previoulsy trained HMM models:
cd scripts_dtw/ > spr_vitalign -WM -S -c ../resources/SPRAAK/wsj_si284_train.cor -d ../resources/cmudict.0.7a.lex -seg ../resources_dtw/si284_train_WSJ0+1_CMU.wrd.seg -ssp "../resources/SPRAAK/mida_vtln.preproc ../models/wsj0+1_mida_vtln_CD_tied_auto/wsj0+1_mida_vtln_CD_tied_auto.4/acmod.preproc" -h ../models/wsj0+1_mida_vtln_CD_tied_auto/wsj0+1_mida_vtln_CD_tied_auto.4/acmod.hmm -g ../models/wsj0+1_mida_vtln_CD_tied_auto/wsj0+1_mida_vtln_CD_tied_auto.4/acmod.mvg -sel ../models/wsj0+1_mida_vtln_CD_tied_auto/wsj0+1_mida_vtln_CD_tied_auto.4/acmod.sel -ci ../resources/SPRAAK/yapa_en.ci -cd ../models/wsj0+1_mida_vtln_CD_tied_auto/wsj0+1_mida_vtln_CD_tied_auto.3/acmod.cd -i ../data/sam_16k/ -suffix wv1 -beam 'threshold=99,width=2000' -LMout -100 -rmg 'r15;new;' -unwind 'add_in_front=[/#];add_between=[/#];add_at_rear=[/#];sent_context=##;' > CLEAN_SEG ../resources_dtw/si284_train_WSJ0+1_CMU.wrd.seg > ../resources_dtw/si284_train_WSJ0+1_CMU.seg > SPR_SEG2PHON_LVL -v cd2ci=1 ../resources_dtw/si284_train_WSJ0+1_CMU.seg ../resources_dtw/si284_train_WSJ0+1_CMU.ci.seg
After that we can extract the features (if a sentence has not been segmented (eg. oov words), features will not be extracted for that sentence; in other words the segmentation is aligned with the feature file):
spr_sel_frames -c ../resources_dtw/si284_train_WSJ0+1_CMU.ci.seg -ssp "../resources/SPRAAK/mida_vtln.preproc ../models/wsj0+1_mida_vtln_CD_tied_auto/wsj0+1_mida_vtln_CD_tied_auto.4/acmod.preproc" -suffix wv1 -o key:../resources_dtw/si284_train_WSJ0+1_CMU.trk -obs ../data/sam_16k/ dim1=`spr_getkey -i ../resources_dtw/si284_train_WSJ0+1_CMU.trk -k NFR` dim2=`spr_getkey -i ../resources_dtw/si284_train_WSJ0+1_CMU.trk -k NPARAM` spr_addkey -i ../resources_dtw/si284_train_WSJ0+1_CMU.trk -k NDIM -v $dim2 spr_addkey -i ../resources_dtw/si284_train_WSJ0+1_CMU.trk -k NFRAMES -v $dim1 spr_addkey -i ../resources_dtw/si284_train_WSJ0+1_CMU.trk -k NFR -v `echo "$dim1 * $dim2" | bc -q` spr_addkey -i ../resources_dtw/si284_train_WSJ0+1_CMU.trk -k NPARAM -v 1 spr_addkey -i ../resources_dtw/si284_train_WSJ0+1_CMU.trk -k DATATYPE -v PARAM spr_addkey -i ../resources_dtw/si284_train_WSJ0+1_CMU.trk -k PARAMTYPE -v FEATUREDB
Later on features are (usually) sharpened. Data sharpening is performed on a frame-by-frame basis. Classes are HMM states. Below, we first create a state-based segmentation file (.state.seg) and the list of the classes/states (.state.ci). After that we sharpen the features with `vectorTranslation'
seg_cd2state.py ../resources_dtw/si284_train_WSJ0+1_CMU.seg ../models/wsj0+1_mida_vtln_CD_tied_auto/wsj0+1_mida_vtln_CD_tied_auto.3/acmod.cd > ../resources_dtw/si284_train_WSJ0+1_CMU.state.seg awk '{if(D) u[$2]; else D=($0~"^#+$");} END {for(s in u) {sub("^S","",s);print s;}}' ../resources_dtw/si284_train_WSJ0+1_CMU.state.seg | sort -k 1,1n | awk '{u[i++]=$1} END {printf(".key\nNENTRY %i\n#\n",i);for(i=0;i in u;i++) printf("S%s\n",u[i]);}' > ../resources_dtw/si284_train_WSJ0+1_CMU.state.ci nice -n 19 vectorTranslation -data ../resources_dtw/si284_train_WSJ0+1_CMU.trk -o ../resources_dtw/si284_train_WSJ0+1_CMU.sharp.trk -tr ../resources_dtw/si284_train_WSJ0+1_CMU.state.seg -alph ../resources_dtw/si284_train_WSJ0+1_CMU.state.ci -frac -1 -wgt 0 -threads 10 -maxMem 2000
Finally, we map the CD-phone segmentations onto a reduced set of CD-phones more suitable to be used as template classes. CD-phones as obtained from the HMM training share states and Gaussians and are therefore sparse. The new set of CD-phones has a minimum number of realization of each phone in the database (256, in this case). Finally, also segmentation with CD-words can be produced (.tcdw).
seg_ci2cd_ext.py ../resources_dtw/si284_train_WSJ0+1_CMU.tcd.seg ../resources_dtw/si284_train_WSJ0+1_CMU.wrd.seg ../resources/SPRAAK/yapa_en.ci ../resources_dtw/SPRAAK/WSJ0+1_CMU_flat_256_20.cd ../resources_dtw/SPRAAK/si284_train_fname2spkr.lst ../resources_dtw/SPRAAK/phon_word_exclude.lst spr_addkey -i ../resources_dtw/si284_train_WSJ0+1_CMU.tcd.seg -k NSEG -v `egrep -c '^(-|wsj)' ../resources_dtw/si284_train_WSJ0+1_CMU.tcd.seg` word_tree.py ../resources_dtw/WSJ0+1_CMU_flat_256_20.wrd.cd ../resources_dtw/si284_train_WSJ0+1_CMU.tcdw.seg ../resources/SPRAAK/yapa_en.ci ../resources_dtw/SPRAAK/questions ../resources_dtw/SPRAAK/WSJ0+1_CMU_flat_256_20.cd ../resources_dtw/si284_train_WSJ0+1_CMU.tcd.seg
Evaluation of the template system is done using the dev92 and nov92 datasets. Below it is explained how to proceed with the nov92 test set only. Testing can be split in three main parts:
The last three steps can be obtained as follows: First, we expand word units into CD-phone units in the WG. The word graph (.LAT) must have been geenrated using the opetion wlat=2. Moreover dummy arcs, initial arcs, etc bust have been removed:
WLAT_BEST_END ../output/MY_EXP.LAT | WLAT_RM_SENT_BEGIN - | WLAT_RENUM_MAX - | WLAT_RM_DEPS -v allow_root_D=1 - | WLAT_RM_DEPS2 - | WLAT_ARC_SORT - > ../output/MY_EXP_T.LAT lat_spraak2platcdw.py ../output/nov92_np_20k_WSJ0+1_CMU.platw.gz ../resources/SPRAAK/yapa_en.ci ../resources_dtw/SPRAAK/WSJ0+1_CMU_flat_256_20.cd ../resources/wlist20o_nvp.dic ../resources_dtw/WSJ0+1_CMU_flat_256_20.wrd.cd ../output/MY_EXP_T.LAT
We create the template lattice using the dtw-kNN algorithm. The features for the nov92 test set are needed.
spr_sel_frames -c ../resources_dtw/SPRAAK/nov92_np_20k_WSJ0+1_CMU.ci.seg -ssp "../resources/SPRAAK/mida_vtln.preproc ../models/wsj0+1_mida_vtln_CD_tied_auto/wsj0+1_mida_vtln_CD_tied_auto.4/acmod.preproc" -suffix wv1 -o key:../resources_dtw/nov92_np_20k_WSJ0+1_CMU.trk -obs ../data/sam_16k/ spr_knn_dtw -o ../output/nov92_np_20k_WSJ1_CMUv4v.tlat.gz -ctx 0 -k 50 -threads 10 -ref ../resources_dtw/si284_train_WSJ0+1_CMU.trk -seg ../resources_dtw/si284_train_WSJ0+1_CMU.tcdw.seg -wrd '/' -tst ../resources_dtw/nov92_np_20k_WSJ0+1_CMU.trk -lat ../output/nov92_np_20k_WSJ0+1_CMU.platw.gz -mvgf -0.0
Finally we rescore the word graph (.LAT) with the information gathered in the (pseudo) lattice (.tlat). Word scores are obtained by averaging the score in the kNN list, for each phone. After that, word scores are caluclated by accumulating phone (log) scores.
lat_spraak_add_template_new.py ../output/nov92.lat ../lm/phon_eos1.dic ../trees/WSJ1_CMUv4v_256_20.cd ../trees/WSJ1_CMUv4v_256_20.wrd.cd ../wg/nov92_np_20k_WSJ1_CMUv4v.tlat.gz ../lm/wlist20o_nvp.cmu+s.dic ../wg/nov92_np_20k_WSJ1_CMUv4v.wrd.LAT ../resources_dtw/si284_train_WSJ1_CMUv4v.tcdw.seg nov92_np_20k_vtln_mn_hi1.Btr0 lc=-0.0,lf=0,nav=1,ctx=0 1.0
The new word graph is now ready to be decoded. Note that for each arc score should be combined opportunely and score weights optimized on left-out data.