The Wall Street Journal test suite has been one of the most widely used speech recognition benchmarks since the early 90's. This demo suite contains a configuration for training a state-of-the-art HMM based system for the WSJ benchmark. On the popular November 92 non-verbalized punctuation test set using a trigram language model an error rate of less than 8% is obtained.

Who should read this tutorial?

This tutorial is not written for first time users of SPRAAK and/or speech recognition. We advise first time users to go through the TIMIT tutorials first, as these introduce individual components and concepts of the SPRAAK package step by step. This tutorial assumes an understanding of the concepts underlying a large vocabulary speech recognition system. Experienced speech recognition researchers may give it a try to start here and dig further in the general manual pages or read selected pages from the TIMIT tutorials when needed.

What will you learn?

This tutorial shows all the steps in creating and evaluating a large vocabulary systen, including: – getting all the required resources in place and in the right format – training an acoustic model – preparing a language model – evaulating and fine tuning the final system

External Prerequisites

SPRAAK does not come with a license to the WSJ data. The WSJ sampled data, transcriptions and default language models can be licensed from LDC (http://www.ldc.upenn.edu/).
You will also need a lexicon. The lexicon we used in our setup is the CMU lexicon (v0.7a) which can be downloaded from the CMU website (http://www.speech.cs.cmu.edu/cgi-bin/cmudict)

Included in this tutorial are

scripts to do the necessary format conversions from the standard LCD formats to the formats used in SPRAAK
a small context-independent acoustic model that is adequate for bootstrapping more complex models. It will be used to generate phone/state based segmentations (Viterbi alignments) from which more sophisticated trainings can be launched
configuration files and scripts to run the example experiments

About The Wall Street Journal Benchmark

The WSJ benchmark was established in the early 90's as a means of driving the development and establishing a uniform evaluating methodology for large vocabulary speech recognition. The benchmark has mainly been used to evaluate the progress in acoustic modeling. While it is a popular and useful benchmark, one should also understand it's limitations. The "read speech" and "high quality recordings" properties make that these results or a system trained with only this data are not representative for real life applications.

Acoustic Model Training Data:
- WSJ0 : 84 spkrs.
- WSJ0+1 : 284 spkrs.
Speech Quality:
- read speech
- high quality recordings
Language Models
- 34M words of WSJ data from preceeding years
- bigram, trigram with verbalized or non-verbalized punctuation
Development Set
- 82000 utterances, 8hrs.
Test Sets
- nov.92 5k and 20 vocabularies with or without verbalized punctutation
  - 8 spkrs for spkr independent test set
  - 12 spkrs for spkr dependent test set