Word Transcriptions

A sentence or paragraph is transcribed as a sequence of words:

Sentence        this_is_a_sentence

SPRAAK uses the "_" symbol as an explicit word separator in order to make parsing unique and to give an indication that something might be happening (optional silence, cross-word assimilation, .. ). However, this allows the transcription to be written as a single STRING, without the need for quotes which simplifies parsing throughout the package.

Reserved words & characters

A number of words have a reserved or at least recommend usage.

'_' is reserved as word separator symbol and should not be used word internally
<s> for 'sentence start' [is not in the lexicon as it has no acoustic evidence associated with it]
</s> for 'sentence end' [SHOULD BE in the lexicon as it requires some evidence]
<UNK> for 'unknown word' (also called out-of-vocabulary word or OOV)
<PARTIAL> for partially decoded word, i.e. if the decoder could not reach the end state of a word
<sil> for 'silence' [recommended, not formally reserved]
<gbg> for 'garbage' [recommended, not formally reserved]

Phone Transcriptions

A lexicon (Lexicon File) contains the canonical transcription of words in terms of phones, or more generally in acoustic units as SPRAAK can use any user defined subword unit such as phones, syllables, morphs, or full words.

Example:

hij             [i/I/hE+[j/]] 
hij             [i]/[I]/[hE+]/[hE+j]

The above example shows 2 different ways of representing the 4 pronunciation variants of the Dutch word 'hij'.

Phone Alphabet Rules

Phones may be represented by single and/or multiple characters, but the symbols must be chosen so that they can be parsed uniquely from left to right using a greedy algorithm, i.e. the longest phone string that matches is the correct one.
Reserved characters within the phone (acoustic unit) alphabet are:
- '/' [forward slash] for separating pronunciation variants
- '[',']' [square brackets] for grouping pronunciation variants
- '(',')' [round brackets] for including pronunciation probabilities [reserved for future release]
- '=' [equal sign] for usage in assimilation rules
- '_' [underscore] for usage as word separator
- '|' [vertical bar] for non-emitting acoustic units such as word or syllable boundary

Phone Transcription Rules

The individual phones in a transcription are concatenated, i.e. they are not separated by a delimiter.
Pronunciation variants are separated by forward slashes '/' and grouped by square backets '[ ]'
- nesting is allowed
- [] has precedence over /
- Example: [i/I/hE+[j/]] is equivalent to [i]/[I]/[hE+]/[hE+j]
- [NOT IMPLEMENTED YET !!] Probabilities can be added before any phone or before the empty set [] using a floating point value between round braces: A[(.5)B/(.8)C(.5)D/(.1)[]]E

Assimilation Rules

Assimilation rules may optionally be added to a lexicon. They are described together with rules applying to word concatenations in Word Concatenation and Assimilation Rules.

Allophones (context-dependent phones)

Today's systems often rely on the assignment of different acoustic models to a phone depending on the context. Context-dependent phones are written as the concatenation of the context-independent phone and a unique numerical identifier, that is an absolute number spanning over ALL phones; hence it gives the n'th allophone of the alphabet (not the n'th allophone of the specified ci_phone).

mist    m245 i27 s1345 t4378

The above example gives a context-dependent transcription of the phoneme string mist using cd units 245,27,1345,4378.

The context corresponding to a given allophone is specified in the .cd file (Acoustic Unit File).

r2128   [pbkgfvxG*#]-r-[p]

The right-hand side of this defintions shows left- and right- context for a triphonic model. Quinphones are represented by [L2][L1]-ph-[R1][R2] , ... in which context lists are to be interpreted as 'OR' lists.

State Transcriptions

The states belonging to a phone are specified in the Acoustic Unit File. States are indicated by numbers (counting starting at 0) and entities by themselves, i.e. they are not private to an acoustic unit, but can be shared across as many units as wanted.

Most often states will be referenced by their numerical identifier, though i In certain occasions it may be more handy not to use the absolute numerical identifier, but to use a reference which involves the allophonic identity, which can be done e.g. as i27#0 which refers to:

state '0'
in context-dependent unit '27'
which is an allophone of i

Remarks, Bugs and Limitations

The '#' symbol is used for a number of different meanings in the SPRAAK package. While never leading to parsing problems, it may somewhat hamper readability:

'#' is the first symbol of the separator line between header and data
'#' is the default symbol for silence
'#' is used as separator symbol between <acoustic unit name> and <allophone number>
'#' is used as separator symbol between <acoustic unit> and <state number> when transcribing individual states (e.g. s#1 indicates the second state of unit 's' - counting starts at '0' as always)

Assimilation Rules still use HMM7.5 implemenation

Probabilistic Pronunciation Variants are NOT IMPLEMENTED YET