SPRAAK
|
For convenience we bring toghether all the resources in a single directory structure. All commands/actions will be described relative to the root directory.
We distinguish the data that you will generate from data that was given at startup and which might be public or shared. Such shared public data will be stored in directories names with the prefix "pub_".
If you use shell-variable to easily refer to such 'pub_' directories or if you prefer to put links to these directories in your working directory is a matter of programming preference where the user must make his decision.
In these demos we have put symbolic links to all public resources, but converting to an alternate approach would be easy. First we put the SPRAAK wsj1 demo materials in place.
> ln -s $SPR_HOME/examples/wsj1/resources pub_res > ln -s $SPR_HOME/examples/wsj1/models pub_mod > ln -s $SPR_HOME/examples/wsj1/scripts pub_scripts
Next we create the directories and subdirectories in which we will work:
> mkdir exp the principal working directory for experiments > mkdir models directory for storing the acoustic models we training > mkdir resources directory for storing all resources we create and/or modify > ln -s $MYSCRACHDIR scratch a scratch directory to store (large amounts of) temporary data (may be on a different volume)
At the end of these operations, your directory structure should look like this.
. |-- exp YOUR experimentation directory |-- models models |-- resources resources |-- scratch YOUR scratch directory |-- pub_mod | |-- wsj_init a small context-independent HMM for bootstrapping |-- pub_res transcripts, phonetic alphabet, ... |-- pub_scripts scripts and configuration files
For convenience we will assume that you have a common data directory on your system where all your speech databases are stored; we'll call this SPCHDATA. Full contents of the WSJ0 and WSJ1 may thus respectively be in $SPCHDATA/WSJ0 and $SPCHDATA/WSJ1. For our purposes we will make them accessible as follows:
> mkdir data > ln -s $SPCHDATA/WSJ0 data/wsj0 > ln -s $SPCHDATA/WSJ1 data/wsj1
This should make your data directory look more or less like this (you might have additional data as well):
data |-- wsj0 -> [link to where the WSJ0 CD data is stored on your system] | |-- sd_dt_05 | |-- sd_dt_20 | |-- sd_dt_jd | |-- sd_dt_jr | |-- sd_et_05 | |-- sd_et_20 | |-- sd_tr_l | |-- sd_tr_s | |-- si_dt_05 | |-- si_dt_20 | |-- si_dt_jd | |-- si_dt_jr | |-- si_et_05 | |-- si_et_20 | |-- si_et_ad | |-- si_et_jd | |-- si_et_jr | |-- si_tr_s `-- wsj1 -> [link to where the WSJ1 CD data is stored on your system] |-- si_dt_05 |-- si_dt_20 |-- si_et_h1 |-- si_et_h2 |-- si_tr_l |-- si_tr_s
Similarly to the sampled data, we assume that you have the CMU dictionary already online in the CMUDICT subdirectory of your SPCHDATA database. In the demos we use the Carnegie Mellon University (CMU) pronouncing dictionary 'cmudict.0.7a' You should make a local copy and store it in the ./resources directory. After which we apply a small patch to complement it for a few ommissions/mistakes that are relevant to the WSJ evaluations.
> cp $SPCHDATA/CMUDICT/cmudict.0.7a resources > patch resources/cmudict.0.7a < pub_res/cmudict.0.7a.patch
The next tasks exist in a conversion of the phonetic alphabet, which is done for 2 reasons:
The training lexicon is derived from the train corpus after which it is converted to the YAPA format. The lexicon for testing is taken from the WSJ0 CD and converted in the same way
# # make lexicon for wsj0+1 training # > scripts/cor2wlist.py pub_res/wsj_si284_train.cor > resources/wsj01_train.def > scripts/cmudict2yapa.py pub_res/wsj01_train.def resources/cmudict.0.7a pub_res/phon_cvt.def > pub_res/wsj01_train.dic # # make lexicon for 20k open vocabulary, non verbalized pronunciation from word list on WSJ CD's # > scripts/cmudict2yapa.py data/wsj0/lng_modl/vocab/wlist20o.nvp resources/cmudict.0.7a pub_res/phon_cvt.def > resources/wsj20onp.dic cd ..
All steps required for the lexicon construction are summarized in the 'make_lex.csh' script.
Note1: If you are familiar with the CMU alphabet and prefer to deviate as little as possible, we can suggest following conversion: add one specific character (eg. ':') to each phone in the alphabet, this will make parsing unique as well.
Note2: Having a phonetic dictionary that is not complete, is no disaster. Sentences containing words with missing transcriptions will simply be skipped in the training process.
Note3: Having a phonetic dictionary that is overcomplete during training may cause issues. Words in the dictionary, but not in the language model will be assigned the 'UNK' (unknown) word category and will receive a small though finite lanuage model prob. Unless you set UNK-prob to be very low, this may have a significant effect on your results. So best is to keep dictionary and LM in sync.
The 3-gram LM provided with the WSJ data must be converted to the SPRAAK format:
> gzip -cd data/wsj0/lng_modl/base_lm/tcb20onp.gz | spr_lm_arpabo | gzip -c > resources/tcb20onp.lm.gz
We provide a few resources to will make the experiments a it easier. These include signal processing scripts, the YAPA phone set and conversion to the CMU alphabet, corpora for training and testing.
|-- resources | |-- tcb20onp.gz the tri-gram 20k language model provided with the WSJ0+1 data (not provided with SPRAAK) | |-- cmudict.0.7a the downloaded CMU lexicon 0.7a (not provided with SPRAAK) |--pub_res | |-- cmudict.0.7a.patch patch for a few missing words in the CMU lexicon 0.7a | |-- dev92_np_20k.cor corpus file for 20k development set (nvp) | |-- nov92_np_20k.cor corpus file for 20k testset (nvp) | |-- melcepstra.preproc preprocessing file for mel cepstra | |-- mida.preproc preprocessing file for mida transformed cepstra | |-- mida_vtln.preproc preprocessing file for vtln+mida transformed cepstra | |-- phon_cvt.def phone conversion file CMU -> YAPA | |-- vtln.cd cd-phone definition file for M/F training | |-- vtln.ci ci-phone definition file for M/F training | |-- vtln.dic dictionary file for M/F training | |-- vtln.preproc preprocessing for VTLN (M/F) models | |-- wsj0-spkr-info.txt.920128 speaker info file | |-- wsj0-spkr-info.txt.add additional speaker info file | |-- wsj_si284_train.cor corpus file for WSJ0+1 training set | |-- wsj_si84_train.cor corpus file for WSJ0 training set | |-- yapa_en.cd cd-phone definition file | |-- yapa_en.ci ci-phone definition file (YAPA alphabet) | `-- yapa_en.questions phonetic question set for decision tree building
Bootstrapping HMM training is non trivial. Typically you need some piece of data that is (at least roughly) phonetically segmented or you need an initial model to compute such segmentation. We provide you with such an initial model (using context-independent models, trained on a small speaker sample)
However, these models are good enough to yield a quite accurate segmentation using the Viterbi algorithm. This may be done with following commands:
> set mod=pub_mod/wsj_init > > spr_vitalign -S -c pub_res/wsj_si284_train.cor -d resources/cmudict.0.7a.lex -seg resources/wsj_si284_train.seg -ssp "pub_res/mida.preproc $mod/acmod.preproc" -h $mod/acmod.hmm -g $mod/acmod.mvg -u "pub_res/yapa_en.ci pub_res/yapa_en.cd" -i data -suffix wv1 -beam 'threshold=99,width=2000' -LMout -100 -rmg no -unwind 'add_in_front=[/#];add_between=[/#];add_at_rear=[/#];sent_context=##;' > cd ..
Note: Sentences/files are skipped if one of the following problems occurs: