SPRAAK
 All Data Structures Namespaces Files Functions Variables Typedefs Enumerations Enumerator Groups Pages
Lexicon File
Contents

A Lexicon file contains the transcription of lexical units (typically words) in acoustic units (typically phones). It may also contain assimilation rules. The lexicon file uses as default extension '.lex'( though legacy '.dic' files are commonly used as well). It is stored in a '.spr' file with in the left column the words in the dictionary and in the right column the transcriptions.

Example
.spr
DATA            DICTIONARY      
TYPE            STRING
UNIT_TYPE       PHONEME, YAPA SET
LANGUAGE        DUTCH
DIM1            10
#
<sil>           #
<gbg>           ***
</s>            ###
hij             [i/I/hE+[j/]]
moet            mut
goed            Gut
uitkijken       @+[jtkE+jk@[n/]/tkE+k@]
voor            vor
mistbanken      mI[st/z]bANk@[n/]
mistbanken1     [mIstbANk@n/mIzbANk@n/mIstbANk@]

In the above example 'mistbanken1' contains explicitly 3 of the 4 possible pronunciation variants foreseen in the more compact representation in 'mistbanken'.

Reserved words

A number of word transcriptions have a reserved or at least recommend usage.

Phonetic Alphabet Rules
Transcription Rules
Assimilation Rules

Assimilation rules may optionally be added to a lexicon.

This 'flat' notation strikes a good balance between readability and expressiveness. In the few cases that very complex descriptions are needed, the following formats can be used:

  =<nr_of_nodes>[<from_node>/<to_node>/(<prob>)<phone>]...
  =<nr_of_nodes>[<from_node>/<to_node>/<phone>=(<prob>)<phone>]...

The (<prob>) fields are optional. For example, the assimilation rule
[A/E][B=[(.1)B/(.7)C/(.2)[]]]D
can also be written as:
=4[0/1/A][0/1/E][1/2/B=(.1)B][1/2/B=(.7)C][1/2/B=(.2)[]][2/3/D]

Invocation of assimilation rules are governed by parameters set in the 'unwind' options, as described in Word Concatenation and Assimilation Rules.

Tools
SPRAAK contains several tools to help in constructing the pronunciation network:
  • Lexica can be read and converted to FST's.
  • Assimilation rules can be read and converted to FST's.
  • The description of the tied-state context-dependent phones can be read and converted to a FST.
  • Orthographic transcriptions can be read and converted to a FST. These orthographic transcriptions may even contain altervatives:
    ... apple computer [ inc. / incorporated ] ...
  • All resources can be combined using FST composition.
  • Using off-line tools, most language models can be converted to FST's, allowing the integration of the LM constraint into the pronunciation network.

See e.g. spr_lex_cvt.c.