SPRAAK
 All Data Structures Namespaces Files Functions Variables Typedefs Enumerations Enumerator Groups Pages
Features of the SPRAAK Toolkit

Overview

Figure 1 shows the main components of the SPRAAK recognizer in a typical configuration and with the default interaction between these components. "SPRAAK Architecture and Components"

spraak_arch.png
Figure 1

In a large vocabulary speech recognition application typically all components will be activated. When using the SPRAAK engine for other tasks (e.g. preprocessing, alignment, training of acoustic models, ... ), only a subset of the same main components may be used. For highly experimental setups more complex configurations and interactions are possible.

The components shown in figure 1 are:

A detailed description of the underlying software implementation may be found in the developer's manual. Here we give a brief functional description of each of these modules.

The preprocessing

The preprocessing converts audio data (or half preprocessed features) to a (new) feature stream. The preprocessing is described in a dedicated preproc-file as a fully configurable processing flow-chart.

Available processing modules include:

Important properties and constraints of the preprocessing module are:

See Also: Feature Extraction

The acoustic model

The acoustic model calculates observation likelihoods for the Hidden Markov States (HMM) states.

Features are:

DNNs
  • fast CPU-based (no GPU required) DNNs (evaluation only, training will require a GPU but is not yet implemented)
  • also legacy MLP implementation with, amongst others, a novel fast hierarchical (and hence deep) structure (CPU-based; also with Viterbi-based training)
GMMs
  • mixture gaussian densities with full sharing
  • fast evaluation for tied Gaussians by data-driven pruning based on 'Fast Removal of Gaussians' (FRoG)
  • model topoplogy is decoupled from observation likelihoods, allowing for any number of states in any phone sized unit
  • dedicated modules for initializing and updating the acoustic models (training and/or speaker adaptation)
  • access to all components of the acoustic model (the Gaussian set, FRoG, ...)
Other
  • legacy implementations for discrete density models

Lexicon and Pronunciation Network

The lexicon is stored as pronunciation network in a (possibly cyclic) finite state transducer (FST) that goes from HMM states (input symbols of the FST) to some higher level (the output symbols of the FST). Typically, the output symbols are words. Having phones as output symbols is possible and results in a phone recognizer.

Apart from (word) pronunciations as such, this network can also encode assimilation rules and may use context dependent phones as learned by the acoustic model.

The same network may also be used to the constraints imposed by an FST-based LM. For example when training the acoustic models, the sentence being spoken (a linear sequence of words) is directly encoded in the pronunciation network, eliminating the need for a language model component.

Figure 2 gives an example of a pronunciation network using right context dependent phones and no assimilation rules. The final pronunciation network used by the decoder will also incorporate the tied-state information coming from the acoustic model. In order to not obscure the figure, this information was left out in the example.

lex_network.png
Figure 2: Lexical Network

The pronunciation network deviates from normal FST's in several ways:

See Also: Acoustic Model

The language model

The language model (LM) calculates conditional word probabilities, i.e. the probability of a new word given its predecessor words. For efficiency reasons the LM condenses all relevant information concerning the word predecessors in its own state variable(s).

Supported LM's and interfaces are:

Furthermore an extension layer on top of these LM's allows for various extensions:

See Also: Language Model

The decoder

The decoder (search engine) finds the best path through the search space defined by the acoustic model, the language model and the pronunciation network given the acoustic data coming from the preprocessing block.

SPRAAK implements an efficient all-in-one decoder with as main features:

(Word) lattice (post)-processing

The (word) lattice (post)-processing consists, similar to the preprocessing, of a fully configurable processing flow-chart and a large set of processing blocks. The lattice processing can be fed either with stored lattices or can be coupled directly to the decoder using the built-in lattice generator.

The most important properties of lattice processing component are:

Among the available processing modules are:

See Also