SPRAAK
 All Data Structures Namespaces Files Functions Variables Typedefs Enumerations Enumerator Groups Pages
Example Linguistic Resources

Linguistic Resources are an essential component of each speech recognition system. It are the phonetic alphabets, dictionaries, .. These are typically derived from expert knowledge and refined over the years based on experience with a speech recognizer.

For your convenience a number of 'resources' for some popular speech recognition benchmark tasks(TIMIT,WSJ, a.o.) are provided together with the SPRAAK distribution. Remark that the databases themselves are NOT included in SPRAAK and need to be obtained from LDC, ELDA, or other appropriate database distribution agencies. For each database (named DBASE) these resources are bundled in the directory '$SPR_HOME/examples/$DBASE/resources'. The rest of this page is in function of the TIMIT database, extrapolation to other databases is trivial.

Phonetic Alphabet

The Phone Alphabet is stored in a ".ci file"; it contains the definition of the phones that can be used (or by extension any type of acoustic subword unit). TIMIT uses an alphabet with 51 phones. The alphabet used is in "timit51.ci" is a slightly different alphabet than the one used on the TIMIT CDs. The reason for this is that in SPRAAK phonetic transcriptions are written without spaces and such transcriptions should be parsable in a unique way in left-to-right direction. The translation between the original transcriptions and the example one is given in "phon.xlat". "phon.39.xlat" gives a transformation between the original TIMIT phone set and the compressed set with 39 symbols which is typically used for evaluating experiments on TIMIT.

NOTE: The extension ".ci" is not enforced; it reflects that the phone alphabet also plays the role as defining the "context-independent" Acoustic Unit File acoustic units.

Unit File

The The UnitFile (.cd file) contains for each acoustic unit the state definitions. It looks identical to the phonetic alphabet file, except for a number of extra columns.
This file has multiple purposes:

The example file "timit51.cd" illustrates the versatile state assignment in SPRAAK. The number of states in a unit can be made dependent on the unit, and by explicitly numbering the states the designer could enforce state tying if desired. The number of states used per phone is based on our past experience with TIMIT experiments.

NOTE: the '.cd' extension is used as more often than not it will also be the file in which the "context-dependent" acoustic units are defined.

The Lexicon

The file "timit51.dic" is the Lexicon (or Dictionary) and contains a list of the words transcribed as sequences of phones. The fact that we will use TIMIT for phone recognition implies that the word level isn't really used. Therefore the TIMIT lexicon contains pretty much dummy information mapping a phone acoustic unit to a phone grapheme unit. In any case it is required as SPRAAK requires a lexicon.

More information on the lexicon file format is found in Lexicon File and more details on rules and conventions that apply on word and phone transcriptions are given in Transcriptions: Conventions & Rules.