Speech Recognition works with large corpora of data both for training and evaluation. SPRAAK has a

Corpora are descriptions of a set of recordings with multiple levels of annotations possible. When training an HMM we need plenty of examples to perform the task. The information on all training data is combined in the the train corpus "./resources/train.cor". The corpus file may contain apart from the filenames the transcription , time information, speaker information, ... More information on corpora is found in Corpus File.

Segmentations tell us which frames align with which acoustic or linguistic unit. Segmentations may be obtained by running a Viterbi alignment using an existing model. Alternatively segmentations may be made by hand. The TIMIT corpus comes with hand labeled phone segmentations, expressed in msec. We first converted the continuous time segmentations to discrete 10msec frames and then split the number of frames equally over the number of states that we use per phone as specified in the .cd file The resulting discrete state segmentations are given in ./resources/train_hand_states.seg. More information on the format of segmentation files is found in Segmentation File.