- Contents
A Lexicon file contains the transcription of lexical units (typically words) in acoustic units (typically phones). It may also contain assimilation rules. The lexicon file uses as default extension '.lex'( though legacy '.dic' files are commonly used as well). It is stored in a '.spr' file with in the left column the words in the dictionary and in the right column the transcriptions.
- Example
.spr
DATA DICTIONARY
TYPE STRING
UNIT_TYPE PHONEME, YAPA SET
LANGUAGE DUTCH
DIM1 10
#
<sil> #
<gbg> ***
</s> ###
hij [i/I/hE+[j/]]
moet mut
goed Gut
uitkijken @+[jtkE+jk@[n/]/tkE+k@]
voor vor
mistbanken mI[st/z]bANk@[n/]
mistbanken1 [mIstbANk@n/mIzbANk@n/mIstbANk@]
In the above example 'mistbanken1' contains explicitly 3 of the 4 possible pronunciation variants foreseen in the more compact representation in 'mistbanken'.
- Reserved words
A number of word transcriptions have a reserved or at least recommend usage.
-
<s> for 'sentence start' [is not in the lexicon as it has no acoustic evidence associated with it]
-
</s> for 'sentence end' [SHOULD BE in the lexicon as it requires some evidence]
-
<UNK> for 'unknown word' (also called out-of-vocabulary word or OOV)
-
<PARTIAL> for partially decoded word, i.e. if the decoder could not reach the end state of a word
-
<sil> for 'silence' [recommended, not formally reserved]
-
<gbg> for 'garbage' [recommended, not formally reserved]
- Phonetic Alphabet Rules
-
Phones may be represented by single and/or multiple characters, but the symbols must be chosen so that they can be parsed uniquely from left to right using a greedy algorithm, i.e. the longest phone string that matches is the correct one.
-
Reserved characters within the phonetic (acoustic unit) alphabet are:
-
'/' [forward slash] for separating pronunciation variants
-
'[',']' [square brackets] for grouping pronunciation variants
-
'(',')' [round brackets] for including pronunciation probabilities [reserved for future release]
-
'=' [equal sign] for usage in assimilation rules
-
'_' [underscore] for usage as word separator
-
'|' [vertical bar] for non-emitting acoustic units such as word or syllable boundary
- Transcription Rules
-
The individual phones in a transcription are concatenated, i.e. they are not separated by a delimiter.
-
Pronunciation variants are separated by forward slashes '/' and grouped by square backets '[ ]'
-
nesting is allowed
-
[] has precedence over /
-
a / can only occur withing a [ ... ] block
-
Example: [i/I/hE+[j/]] is equivalent to [[i]/[I]/[hE+]/[hE+j]]
-
[NOT IMPLEMENTED YET !!] Probabilities can be added before any phone or before the empty set [] using a floating point value between round braces:
A[(.5)B/(.8)C(.5)D/(.1)[]]E
- Assimilation Rules
Assimilation rules may optionally be added to a lexicon.
-
The presence of a '=' sign indicates that an entry in the lexicon is an assimilation rule
-
When writing assimilation rules, a single phone or the empty set [] can be replaced by another phone, by the empty set or by a complex construction:
[A/E]B=CD=[]
[A/E][]=CD=E
[A/E]B=[X/Y/[]]D=E/F
Note: square brackets can be used to make the rules more readable: [A/E][B=C]D
[A/E][[]=C]D
[A/E][B=[]]D
[A/E][B=[B/C/[]]][D=E/F]
-
When writing assimilation rules, probabilities can only be added at the right hand side of the '=' sign:
[A/E][B=[(.1)B/(.7)C/(.2)[]]]D
This 'flat' notation strikes a good balance between readability and expressiveness. In the few cases that very complex descriptions are needed, the following formats can be used:
=<nr_of_nodes>[<from_node>/<to_node>/(<prob>)<phone>]...
=<nr_of_nodes>[<from_node>/<to_node>/<phone>=(<prob>)<phone>]...
The (<prob>)
fields are optional. For example, the assimilation rule
[A/E][B=[(.1)B/(.7)C/(.2)[]]]D
can also be written as:
=4[0/1/A][0/1/E][1/2/B=(.1)B][1/2/B=(.7)C][1/2/B=(.2)[]][2/3/D]
Invocation of assimilation rules are governed by parameters set in the 'unwind' options, as described in Word Concatenation and Assimilation Rules.
- Tools
- SPRAAK contains several tools to help in constructing the pronunciation network:
-
Lexica can be read and converted to FST's.
-
Assimilation rules can be read and converted to FST's.
-
The description of the tied-state context-dependent phones can be read and converted to a FST.
-
Orthographic transcriptions can be read and converted to a FST. These orthographic transcriptions may even contain altervatives:
... apple computer [ inc. / incorporated ] ...
-
All resources can be combined using FST composition.
-
Using off-line tools, most language models can be converted to FST's, allowing the integration of the LM constraint into the pronunciation network.
See e.g. spr_lex_cvt.c.