SPRAAK
 All Data Structures Namespaces Files Functions Variables Typedefs Enumerations Enumerator Groups Pages
The SPRAAK architecture

Overview

Figure 1 gives an overview of the main components of the SPRAAK recognizer in a typical configuration and with the default interaction between these components.

spraak_arch.png
Figure 1
When using the SPRAAK engine for other tasks (e.g. training of acoustic models), a subset of the same main components is used. Hence, we limit the subsequent description to the recognition setup since this is the most general case. Note that for highly experimental setups, more complex configurations and interactions are possible; figure 1 only depicts a typical configuration.

The main components is the system are:

All major components are implemented as dynamic objects with well defined interfaces. For most objects, several interchangeable (at run-time) implementations are available. For example, there is a standard N-gram LM as well as a finite state based LM. These two implementations can be combined with an extension layer providing class-based grammars, sub-models, linear combination of LM's, and so on.

Whenever this is beneficial an extra interface layer (called a gateway) is added to the object to isolate the static data (e.g. the language model data itself) from the temporary data needed when interacting with that resource. These gateways also provide fast access (eliminating the dynamic binding) to the main functionality of an object.

To allow intervention inside the core components of the SPRAAK system without needing to recode any of these core components, observer objects are allowed at strategic places in the processing chain. Observer objects also allow easy monitoring and filtering of (intermediate) results.

Having a fast system as well as a system that is flexible enough to allow all possible configurations of modules with whatever interaction between these modules, is challenge to say the least. In order to provide the best of both worlds, SPRAAK provides both a generic and hence fairly slow implementation and several efficient but more constrained implementations. By making good use of templates, it is possible to write generic code, i.e. code that imposes as few limitations on the interactions between the components as possible, from which efficient code for a specific setup is derived automatically just by redefining a few templates.

All four core techniques used to make the SPRAAK engine as flexible as possible –objects, gateways, templates and observers– are further clarified in the section Design choices: objects, gateways, observers and templates.

Next to the main components, there are several auxiliary components and/or objects. The most important ones are handled in sections The preprocessing till Other objects. For a complete enumeration of all objects and methods, we refer to the low level API.

Design choices: objects, gateways, observers and templates

Objects

Principles

Although SPRAAK is mainly coded in C, all major building blocks are encoded in the form of objects. These objects are designed for dynamic binding, i.e. the function that must be called when method 'X' is requested for object 'Y' is determined at run-time. This is similar to the dynamically typed objects used in languages such as objective-C, Smalltalk or Python, and contrasts to objects in C++ which are statically resolved (except when using virtual methods). However, SPRAAK also provides means to enforce static (i.e. at compiler time) checking of the arguments. Such strong type checking helps in finding and resolving bugs at compile time (i.e. during the development).

Rationale

A speech recognition engine such as SPRAAK is a large and complex system. The use of objects is an effective method to split the system into manageable units for which the interaction is completely defined by their interface (methods and public members). Encoding the major components as dynamic objects provides the possibility for easy replacement of each of the components in the recognizer. When combined with the possibilities of dynamic link libraries, this provides plug&play replacement of all major components in the recognizer. This makes that even those who only have access to the binaries (no source code license) can expand or modify the system, without the need to recompile anything, except for their own new additions.

On the other hand, a speech recognition engine must be very efficient if it is to be useful at all. There are tight limits on both computational and memory resources for the system to be useful: real time operation is a must for most applications and even research becomes impossible if the system is too slow or consumes too much memory. To achieve these goals, the following choices were made:

Gateways

Principles

Objects that represent resources are accessed by means of gateways, allowing different sub-systems or program threads to use one or more resources simultaneously. A gateway is lightweight layer in between the code using the resource and the object representing that resource. Next to simultaneous access to the same resource, gateways also enable the resource to be used efficiently for a certain usage pattern. To serve both these purposes, gateways contain at least the following items:

Since some objects have more than one usage pattern, there can be multiple types of gateways to the same resource. For example, acoustic models have an evaluation and a training gateway.

Rationale

For most components (objects) in the SPRAAK system, the data needed to actually use that component can be split in two parts: (1) a large chunk of actual resource data (e.g. the acoustic model) and (2) some local storage needed to do something with this data (e.g. the vector needed to return state probabilities). In the light of concurrent use of the resource, an effective means to isolate that part of the data that can be re-used is needed. Gateways provide this functionality, and hence assure efficient re-use of the main resources in the SPRAAK system.

Concurrent use of resources is useful in several situations. For example, in a multi-threaded system, it allows for multiple concurrent and independent queries to the same acoustic model. These multiple queries can either be on different frames of the same utterance, hence speeding up the processing, or on different inputs in a multi-user platform. Even in a single-threaded system, the same resource can be used on different locations in the processing chain. For example, the N-gram language model used in the first decoding pass to generate a word lattice can be re-used later on in the lattice rescoring pass as well.

Next to providing shared resources, the use of gateways in SPRAAK serves two additional purposes:

Observers

Principles

Observer objects can be inserted on predefined strategic positions in the processing chain. Observers have their own storage and can observe and sometimes even modify the system state at that point in the processing chain. As such, observers can be used to monitor the process, to add extra functionality or even to modify the behaviour of the system.

Rationale

The use of objects and dynamic libraries allow for a flexible combination of the core components. Observers extend on this flexibility by allowing some intervention within the internal operation of core components (mainly the search) without having to modify the code of the core components.

The associated overhead for allowing observers at a certain position in the processing chain is low (it requires only a simple check on a member variable in an object/structure). Yet at the same time, allowing observers to be inserted at strategic positions opens up a whole new set of possibilities. Some examples are:

Templates and optimal task-specific implementations

Principles

Templates allow for the same piece of code to be re-used in different situations (and even for different objects) without any overhead. This is done by modifying the behaviour of the code at compile time.

Rationale

SPRAAK is designed to be very flexible: it allows several behavioural changes at run-time and imposes few restrictions on the interaction between the objects. The techniques used to obtain this flexibility (dynamic objects, gateways, observers) were chosen and designed for low overhead. However, there is a very noticeable impact when comparing the generic code path (i.e. the one that provides maximal flexibility w.r.t. to options and interconnections) to a trimmed down task-specific implementation. Therefore, we opted to offer several efficient but more constrained implementations next to the generic implementation. By making good use of templates, it is possible to write generic code, i.e. code that imposes as few limitations on the interactions between the components as possible, from which efficient code for a specific setup is derived automatically just by redefining a few templates. In other words, templates reduce the overhead for implementing and maintaining such specific code paths. At run-time, the SPRAAK system automatically chooses the most efficient implementation given the capabilities and interactions of the components used in the speech recognition system one configures. So, from an end-user perspective, the whole system behaves transparently.

Some examples of the use of task-specific code are:

Corollaries of the design choices

The design choices made in SPRAAK (objects using C, gateways, templates) result in a very flexible system (plug&play replacement of all major components, flexible configuration, ...) with minimal impact on the efficiency. There are however also some associated costs:

Functionality and interactions

Todo

The preprocessing

The preprocessing converts audio data (or half preprocessed features) to a (new) feature stream. The preprocessing consists of a fully configurable processing flow-chart and a large set of standard processing blocks.

Each processing block provides the following six methods:

help
return the explanation (on-line help) concerning the configuration parameters this processing block takes (setup method)
status
print out the current status (debugging)
setup
create a new object given a set of configuration parameters
process
process a single frame of data or return a previously 'held' frame
reset
reset all variables that change during processing, i.e. prepare to start working on a new data file
free
destroy the preprocessing object

The process method is not limited to direct feed through processing, i.e. an input frame does not have to be converted immediately to an output frame. Non-causal behaviour is supported by allowing the process routine to withhold data whenever deemed necessary. The silence/speech detector for example looks a certain amount of frames into the future in order to filter out short bursts of energy (clicks). The process method can also be requested to return previously withheld data, even if no new input data is available. Next to returning output or withholding data, the process method can also skip data (throw a frame away) or signal an error.

The constraints imposed on the preprocessing blocks are:

The configuration of and interaction between the processing blocks is governed by a supervisor object. When processing data, the supervisor is responsible for calling the different processing blocks in the correct order and with the correct data. In a sense, the supervisor converts a complex pipe-line of elementary processing blocks into a new elementary processing block. Next to the help, status, setup, process, reset, and free method, the supervisor also offers the following high-level methods:

The most important properties of and constraints on the preprocessing are:

The acoustic model

Basically, the acoustic model calculates observation likelihoods for the Hidden Markov States (HMM) states the decoder investigates. In its most general form, the acoustic model can (1) provide its own state variable(s) to the search engine, (2) determine in full for which search hypothesis the state likelihood will be used, and (3) query all resources available to the search engine. Having its own state information allows the acoustic model to enforce a parallel search over some internal options (2D Viterbi decoding). For example, an acoustic model may contain a sub-model for males and females. By adding a male/female state to the search space, the acoustic model forces the search engine to investigate both genders in parallel (at least until one of the two dominates, resulting in the other one being removed due to the beam-search pruning). Having access to all information concerning the search state enables context aware acoustic models. Note that since the decoder only takes the acoustic model states into account when merging search hypotheses (tokens), the acoustic model state must reflect the piece of context information that is being used if the search is to be consistent (guaranteed to find the best solution if the search beam is wide enough).

The overhead introduced by this general framework is substantial. In order to allow the automatic selection of a more restrictive and hence faster interfacing between the search and the acoustic model, acoustic models must indicate which functionality they need. The fastest implementation assumes that the state likelihoods only depend on the input frame and the state numbers (cf. figure 1). If so, and if allowed by all other components used in the recognizer, the decoder switches to the following scheme to speed up the acoustic model evaluation:

This alternative scheme is ideal for the fast acoustic models developed by ESAT, i.e. tied Gaussians and fast data-drive evaluation by means of the 'Fast Removal of Gaussians' (FRoG) algorithm. Other acoustic models such as multilayer perceptrons or discrete models using vector quantization also benefit from this scheme.

Next to the evaluation methods there are also methods for:

The language model

The language model (LM) calculates conditional word probabilities, i.e. the probability of a new word given its predecessor words. In its most general form the LM can –similar to the acoustic model– (1) provide its own state variable(s) to the search engine, (2) determine in full for which search hypothesis the word probability will be used, and (3) query all resources available to the search engine. Note that, as was the case with the acoustic models, the state variable(s) must be complete (reflect all dependencies of the conditional word probabilities) in order to obtain a consistent search.

Note that there is a crucial difference in the typical usage of the decoder for what concerns the language and acoustic model states. The normal acoustic model constraints, i.e. all information concerning the context dependent tied states is usually directly encoded into the pronunciation network as this is far more efficient, and hence no extra acoustic model states are needed. The language model constraints on the other hand, are typically separated from the pronunciation network since encoding them into the pronunciation network is either (1) not possible, (2) requires some approximations, or (3) provides little gain in decoding speed while requiring substantially more memory.

For most situations the following faster but more constrained scheme is to be preferred:

This allows an intermediate LM cache system to resolve most of the LM queries, hence reducing the impact of a potentially slow LM sub-system on the overall decoding speed.

When interfacing with an LM (through a gateway), the following methods are available:

lmcn
Given an LM state, a new word and optionally some extra information concerning all other resources in the recognition system, return a new LM state and the corresponding conditional word probability. Note that there can be multiple new LM states and hence multiple conditional word probabilities. This allows for example for the word 'cook' to be used either as the verb or as the person in a class based LM. Whenever an end-of-word state is encountered in the pronunciation network (see pronunciation network), the decoder uses this method to update the search tokens so that they reflect the new situation.
prob
Given an LM state, a new word and optionally some extra information concerning all other resources in the recognition system, return (an estimate of) the conditional word probability. In case the combination of the current LM state and the new word leads to multiple new states, the highest conditional word probability is returned. This method typically runs substantially faster that the 'lmcn' method and is used by the decoder at the moment the word identity is found in the pronunciation network, which is usually some time before the end-of-word state is reached. Keeping the old LM-state until the end-of-word state is reached results in a faster decoder since the costly transition to a new LM state (there may be multiple new states, the LM needs to create the new state(s), and new LM states require some bookkeeping to be done by the decoder) are delayed until all phonemes in a word are observed and deemed good enough.
lmcr
Release an LM state. This method is optional since not all LM's need it. It indicates that a certain LM state is no longer used by the decoder, and hence all memory used to build this state variable can be reclaimed.
lmc0
Create the initial LM state.
lmup
Provide unconditional word probabilities. The unconditional probabilities are used to precalculate the best word probability over large word sets so that even if only the first phone of a word is known, some educated guess concerning the upcoming conditional word probability can be made. Good estimates speed up the decoder by some 30%.
lmq
Query about several aspects of the LM. This allows for example to have a detailed description of how 'lmcn' obtained a certain probability.
modify
Modify some property of the LM. In finite state based LM's derived from context free grammars, this method can be used to activate or deactivate rules.

On top of the methods provided by the LM gateway, the LM itself provides methods for:

Whereas the acoustic models have 'train' methods, LM's typically lack this functionality. Building LM's usually involves either manual work (e.g. context free grammars) or the collection of statistics concerning word usage patterns from large text corpora (e.g. N-grams). Both operations have little in common with the normal use of LM's in a speech recognition system and hence are better served using external methods, programs or packages.

Currently, the following LM's are available in the SPRAAK system:

A last point of interest for what concerns the language model is the handling of sentence starts and sentence ends. SPRAAK allows the LM to indicate that a special sentence start symbol (usually <s>) must be pushed at the start of each new sentence. Sentence ends are treated similarly: when desired by the LM, a sentence end symbol (usually </s>) will be pushed automatically on sentence end conditions. Using special symbols to start and end sentences instead of special start and stop states in the LM (a techniques typically used in FST's) proved to be more flexible overall and has better support when using non FST-based language models.

The pronunciation network

The pronunciation network is a (possibly cyclic) finite state transducer (FST) that goes from HMM states (input symbols of the FST) to some higher level (the output symbols of the FST). Typically, the pronunciation network encodes the lexical constraints, i.e. the pronunciation of words combined with some assimilation rules, and the acoustic model constraints (context dependent phones). It is possible to encode the constraints imposed by an FST-based LM in the pronunciation network as well. For example when training the acoustic models, the sentence being spoken (a linear sequence of words) is directly encoded in the pronunciation network, eliminating the need for a language model component. Typically, the output symbols are words. Having phones as output symbols is possible and results in a phone recognizer.

Figure 1 gives an example of a pronunciation network using right context dependent phones and no assimilation rules. The final pronunciation network used by the decoder will also incorporate the tied-state information coming from the acoustic model. In order to not obscure the figure, this information was left out in the example.

lex_network.png
Figure 2

The pronunciation network deviates from normal FST's in several ways:

The SPRAAK decoder allows loops inside the pronunciation network. However, software to optimize FST's that contain arbitrary loops is very complex and warrant toolkits to do just that [references]. All routines to optimize the pronunciation network in SPRAAK assume the straightforward configuration of the pronunciation network depicted at the bottom of figure 2. This limitation allows for far more efficient and simpler code for the construction of the pronunciation network, while in practice almost any setup can still be built. In conclusion: it is possible to build a decoder that has loops in the middle part of the above figure, however the construction of such networks will require tools from outside SPRAAK.

SPRAAK uses the following conventions for lexical transcriptions and assimilation rules:

This 'flat' notation strikes a good balance between readability and expressiveness. In the few cases very complex descriptions are needed, the following formats can be used:

  =<nr_of_nodes>[<from_node>/<to_node>/(<prob>)<phone>]...
  =<nr_of_nodes>[<from_node>/<to_node>/<phone>=(<prob>)<phone>]...

The (<prob>) fields are optional. For example, the assimilation rule
&nbsp;&nbsp;[A/E][B=[(.1)B/(.7)C/(.2)[]]]D
can also be written as:
&nbsp;&nbsp;=4[0/1/A][0/1/E][1/2/B=(.1)B][1/2/B=(.7)C][1/2/B=(.2)[]][2/3/D]

SPRAAK contains several tools to help in constructing the pronunciation network:

The search engine (decoder)

The search engine finds the best path through the search space defined by the acoustic model, the language model and the pronunciation network given the acoustic data coming from the preprocessing block.

The search engine is a breath-first frame synchronous decoder which is designed to:

Other important properties of the decoder are:

(Word) lattice (post)-processing

The (word) lattice (post)-processing consists, similar to the preprocessing, of a fully configurable processing flow-chart and a large set of processing blocks. The lattice processing can be fed either from stored lattices or can be coupled directly to decoder using the built-in lattice generator.

The most important properties of lattice processing component are:

Among the available processing modules are:

Other objects

Todo