Aims

SPRAAK is a speech recognition toolkit intended for a diverse population of users:

SPRAAK is a flexible modular toolkit for speech recognition research
SPRAAK is a state-of-the art speech recognizer aimed at speech application deployment in niche markets

The SPRAAK Toolkit runs on UNIX/LINUX and Windows (XP and later) platforms, though certain advanced functionalities (distributed computing and resource sharing, mainly relevant for resource development) may not be available on the Windows platforms.

In order to address the different needs, functionalities in SPRAAK may be addressed by a set of different interfaces. Application developers need control over the run-time engine and over the resources that the engine uses. They are fully served by the High Level API. Developing new resources may imply the usage of and minor adaptations to a number of standalone scripts that make use of the lower level API.

Speech researchers will be served by the scripting framework for training and evaluating new systems. Certain types of speech research(e.g. feature extraction) may involve very little programming, while others (e.g. changes to the decoder) are always a major programming effort. Depending on the type of research they may need to dig deeper into the internals for which they will make use of the low level API.

SPRAAK as a research toolkit

SPRAAK is a flexible modular toolkit for research into speech recognition algorithms, allowing researchers to focus on one particular aspect of speech recognition technology without needing to worry about the details of the other components. To address the needs of the research community SPRAAK provides:

plug & play replacement of all core components
flexible configuration of all core components
extensive libraries covering all common operations
examples on how to add functionality
well defined and implemented core components and interaction schemes
programmers and developers documentation
a low level API (Application Programmers Interface)
a direct interface with the Python scripting language
an interactive environment (Python) for speech research
scripts for batch processing

SPRAAK for application development

SPRAAK is also a state-of-the art recognizer with a programming interface that can be used by non-specialists with a minimum of programming requirements. At the same time SPRAAK allows ASR specialists to integrate and harness special functionality needed for niche market projects that are not served by the existing off-the-shelf commercial packages. To address the needs of this part of the user base, SPRAAK provides:

users documentation
a high level API (Application Programmers Interface)
a client-server model
standard scripts for developing new resources (such as acoustic models or language models).

Moreover, on request the ESAT team can provide

a set of resources (acoustic and language models, lexica, ...) for Northern and Southern Dutch for both broadband and telephony speech.
a set of reference implementations or frameworks for different applications

Next to conventional ASR applications, SPRAAK enables applications in other fields as well. Examples are linguistic and phonetic research where the software can be used to segment large speech databases or to provide high quality automatic transcriptions, pronunciation training, and speaker recognition. To that end, a set of scripts for some selected tasks are provided. We expect the number of available scripts to grow rapidly over time when the user base of SPRAAK grows.

High Level API for the run-time engine

The high level API is the convenient interface to communicate with the SPRAAK run-time engine. It allows for all necessary functionality: start and stop, setting of parameters, loading and switching of acoustic and language model and control over the audio device. This level of control suits the needs of people interested in dialog development, interactive recognizer behaviour, speech recognition testing from large corpora, interactive testing, building demonstrators, etc.

The high level API is available in two forms: C-callable routines and a corresponding set of commands for a client-server architecture. There is a one-to-one mapping between these two forms.

In the client-server framework, client (application) and server (speech recognition engine) communicate with each other via a simple physical interface (pipes or sockets) using the high level API commands. Feedback from the engine to the application is both client driven and server driven. All concepts known to the high level API may also be specified in an initialization file that is loaded when the engine is started.

The alternative form of the high level API uses C-callable routines instead of the pipe/socket interface layer. Except for giving up the flexibility of the client/server concept, it has exactly the same functionality. This is the most compact and efficient implementation and would be the implementation of choice for (semi)-commercial applications on a standalone platform.

The high level API gives the user an intuitive and simple access to a speech recognition engine, while maintaining reasonably detailed control. The concepts defined in the high level API are concepts known to the speech recognition engine, i.e. language models, acoustic models, lexica, pre-processing blocks, search engines, ... In that sense the SPRAAK high level API may not be as high level as in certain commercial packages which tend to work with concepts such as 'user', 'language', 'application'. If a translation is necessary from intuitive user concepts to the speech recognition concepts then this is the task of the application, although SPRAAK provides the necessary hooks to work with these high level concepts.

Usage of the high level API should be suitable for users with a moderate understanding of speech recognition concepts and requires only a minimum of computational skills.

Low Level API

The low level API provides full access to all routines, including those that are not relevant for the run-time engine. In order to understand what is possible through this low level API it is necessary to have an understanding of the software architecture of the SPRAAK toolkit, which is described further on in section The SPRAAK architecture. In essence, the SPRAAK toolkit consists of a set of modules (compiled code) and scripts. The modules are designed according to object oriented concepts and are written in C. Python is used as scripting language.

Standalone tools such as alignment, enrollment, operations on acoustic models, training of acoustic models are implemented in Python scripts that make use of the low level API. SPRAAK includes example scripts for most of these tasks, and the number of available scripts will grow over time.

Usage of the low level API gives the user full control over the internals. It is the vehicle for the ASR researcher who wants to write new training scripts, modify the recognizer, etc. New concepts may first be prototyped in Python. As long as it doesn't affect the lowest level computational loops in the system the impact on efficiency will be limited. Nevertheless the inherent delays at startup time and the loss of efficiency in the scripting language make this setup less suited for applications.

The low level API is also the ideal tool for ASR teaching. It provides detailed enough insight into the speech recognition internals, it allows for visualization of intermediate results (e.g. individual Gaussians) and makes it possible to modify specific pieces of the code without the need to develop a full recognition system.

Usage of the low level API is intended for those who have a good understanding of speech recognition internals and who are at least skillful programmers at the script level.

SPRAAK developers API

The SPRAAK developers API is mainly relevant to the developers of the SPRAAK toolkit. It handles low-level task such as parsing input, argument decoding, the object oriented layer, debug messages and asserts, ...

For a number of reasons, not all of the functionality of the developers API is made available on the Python level:

Python provides its own alternatives (parsing, hash tables, the object oriented layer)
the functionality is not relevant at the Python level (atomic operations)
exposing the functionality would be difficult and with little merit

Routines that are part of the developers API only are tailor made to operate withing the SPRAAK library and are (in theory) not fit for use outside this scope. Since these routines are only ment for internal use, their interface can also be changed without prior notice.

Extending and modifying the developers API is intended only for those who have a good understanding of the SPRAAK framework and who are skillful C-programmers.