SPRAAK
 All Data Structures Namespaces Files Functions Variables Typedefs Enumerations Enumerator Groups Pages
Preparing the data

Setting Up

For convenience we bring toghether all the resources in a single directory structure. All commands/actions will be described relative to the root directory.

We distinguish the data that you will generate from data that was given at startup and which might be public or shared. Such shared public data will be stored in directories names with the prefix "pub_".

If you use shell-variable to easily refer to such 'pub_' directories or if you prefer to put links to these directories in your working directory is a matter of programming preference where the user must make his decision.

In these demos we have put symbolic links to all public resources, but converting to an alternate approach would be easy. First we put the SPRAAK wsj1 demo materials in place.

> ln -s $SPR_HOME/examples/wsj1/resources pub_res
> ln -s $SPR_HOME/examples/wsj1/models pub_mod
> ln -s $SPR_HOME/examples/wsj1/scripts pub_scripts

Next we create the directories and subdirectories in which we will work:

> mkdir exp             the principal working directory for experiments
> mkdir models          directory for storing the acoustic models we training
> mkdir resources       directory for storing all resources we create and/or modify
> ln -s $MYSCRACHDIR scratch            a scratch directory to store (large amounts of) temporary data (may be on a different volume)

At the end of these operations, your directory structure should look like this.

.
|-- exp                 YOUR experimentation directory
|-- models                      models
|-- resources           resources
|-- scratch             YOUR scratch directory
|-- pub_mod
|   |-- wsj_init        a small context-independent HMM for bootstrapping
|-- pub_res             transcripts, phonetic alphabet, ...
|-- pub_scripts         scripts and configuration files

Accessing the sampled data

For convenience we will assume that you have a common data directory on your system where all your speech databases are stored; we'll call this SPCHDATA. Full contents of the WSJ0 and WSJ1 may thus respectively be in $SPCHDATA/WSJ0 and $SPCHDATA/WSJ1. For our purposes we will make them accessible as follows:

> mkdir data
> ln -s $SPCHDATA/WSJ0 data/wsj0
> ln -s $SPCHDATA/WSJ1 data/wsj1

This should make your data directory look more or less like this (you might have additional data as well):

data
|-- wsj0 -> [link to where the WSJ0 CD data is stored on your system]
|   |-- sd_dt_05
|   |-- sd_dt_20
|   |-- sd_dt_jd
|   |-- sd_dt_jr
|   |-- sd_et_05
|   |-- sd_et_20
|   |-- sd_tr_l
|   |-- sd_tr_s
|   |-- si_dt_05
|   |-- si_dt_20
|   |-- si_dt_jd
|   |-- si_dt_jr
|   |-- si_et_05
|   |-- si_et_20
|   |-- si_et_ad
|   |-- si_et_jd
|   |-- si_et_jr
|   |-- si_tr_s
`-- wsj1 -> [link to where the WSJ1 CD data is stored on your system]
    |-- si_dt_05
    |-- si_dt_20
    |-- si_et_h1
    |-- si_et_h2
    |-- si_tr_l
    |-- si_tr_s

Lexicon and Phonetic Alphabet

Background Dictionary (CMUDICT)

Similarly to the sampled data, we assume that you have the CMU dictionary already online in the CMUDICT subdirectory of your SPCHDATA database. In the demos we use the Carnegie Mellon University (CMU) pronouncing dictionary 'cmudict.0.7a' You should make a local copy and store it in the ./resources directory. After which we apply a small patch to complement it for a few ommissions/mistakes that are relevant to the WSJ evaluations.

> cp $SPCHDATA/CMUDICT/cmudict.0.7a resources
> patch resources/cmudict.0.7a < pub_res/cmudict.0.7a.patch
YAPA Phonetic Alphabet

The next tasks exist in a conversion of the phonetic alphabet, which is done for 2 reasons:

Training Lexicon

The training lexicon is derived from the train corpus after which it is converted to the YAPA format. The lexicon for testing is taken from the WSJ0 CD and converted in the same way

#
# make lexicon for wsj0+1 training
#
> scripts/cor2wlist.py pub_res/wsj_si284_train.cor > resources/wsj01_train.def
> scripts/cmudict2yapa.py pub_res/wsj01_train.def resources/cmudict.0.7a  pub_res/phon_cvt.def > pub_res/wsj01_train.dic
#
# make lexicon for 20k open vocabulary, non verbalized pronunciation from word list on WSJ CD's
#
> scripts/cmudict2yapa.py data/wsj0/lng_modl/vocab/wlist20o.nvp resources/cmudict.0.7a  pub_res/phon_cvt.def > resources/wsj20onp.dic
cd ..

All steps required for the lexicon construction are summarized in the 'make_lex.csh' script.

Note1: If you are familiar with the CMU alphabet and prefer to deviate as little as possible, we can suggest following conversion: add one specific character (eg. ':') to each phone in the alphabet, this will make parsing unique as well.

Note2: Having a phonetic dictionary that is not complete, is no disaster. Sentences containing words with missing transcriptions will simply be skipped in the training process.

Note3: Having a phonetic dictionary that is overcomplete during training may cause issues. Words in the dictionary, but not in the language model will be assigned the 'UNK' (unknown) word category and will receive a small though finite lanuage model prob. Unless you set UNK-prob to be very low, this may have a significant effect on your results. So best is to keep dictionary and LM in sync.

Making the language models

The 3-gram LM provided with the WSJ data must be converted to the SPRAAK format:

> gzip -cd data/wsj0/lng_modl/base_lm/tcb20onp.gz | spr_lm_arpabo | gzip -c > resources/tcb20onp.lm.gz

The provided SPRAAK resources

We provide a few resources to will make the experiments a it easier. These include signal processing scripts, the YAPA phone set and conversion to the CMU alphabet, corpora for training and testing.

|-- resources
|   |-- tcb20onp.gz             the tri-gram 20k language model provided with the WSJ0+1 data (not provided with SPRAAK)
|   |-- cmudict.0.7a            the downloaded CMU lexicon 0.7a (not provided with SPRAAK)
|--pub_res 
|   |-- cmudict.0.7a.patch              patch for a few missing words in the CMU lexicon 0.7a
|   |-- dev92_np_20k.cor                corpus file for 20k development set (nvp)
|   |-- nov92_np_20k.cor                corpus file for 20k testset (nvp)
|   |-- melcepstra.preproc              preprocessing file for mel cepstra
|   |-- mida.preproc            preprocessing file for mida transformed cepstra
|   |-- mida_vtln.preproc           preprocessing file for vtln+mida transformed cepstra
|   |-- phon_cvt.def            phone conversion file CMU -> YAPA
|   |-- vtln.cd                 cd-phone definition file for M/F training
|   |-- vtln.ci                 ci-phone definition file for M/F training
|   |-- vtln.dic                        dictionary file for M/F training
|   |-- vtln.preproc            preprocessing for VTLN (M/F) models
|   |-- wsj0-spkr-info.txt.920128       speaker info file
|   |-- wsj0-spkr-info.txt.add  additional speaker info file
|   |-- wsj_si284_train.cor             corpus file for WSJ0+1 training set
|   |-- wsj_si84_train.cor              corpus file for WSJ0 training set
|   |-- yapa_en.cd                      cd-phone definition file 
|   |-- yapa_en.ci                      ci-phone definition file (YAPA alphabet)
|   `-- yapa_en.questions               phonetic question set for decision tree building

Making a segmentation of the training database

Bootstrapping HMM training is non trivial. Typically you need some piece of data that is (at least roughly) phonetically segmented or you need an initial model to compute such segmentation. We provide you with such an initial model (using context-independent models, trained on a small speaker sample)

However, these models are good enough to yield a quite accurate segmentation using the Viterbi algorithm. This may be done with following commands:

> set mod=pub_mod/wsj_init
>
> spr_vitalign -S -c pub_res/wsj_si284_train.cor -d resources/cmudict.0.7a.lex -seg resources/wsj_si284_train.seg -ssp "pub_res/mida.preproc $mod/acmod.preproc" -h $mod/acmod.hmm -g $mod/acmod.mvg -u "pub_res/yapa_en.ci pub_res/yapa_en.cd"  -i data -suffix  wv1 -beam 'threshold=99,width=2000' -LMout -100 -rmg no -unwind 'add_in_front=[/#];add_between=[/#];add_at_rear=[/#];sent_context=##;'
> cd ..

Note: Sentences/files are skipped if one of the following problems occurs: