Shortcuts

File formats

SPRAAK supports a variety of file formats. All formats have the same basic layout:

a (sometimes optional) magic cookie, indicating the file (or header) format
a header describing the data
the raw data (can be compressed, or in ASCII, or ...).

Header-less files are also supported. In that case any necessary header information can be provided in an external file or as extra information with the file name. See Specifying a stream of data (file) and Some practical examples for more details.

For a list of supported file formats, encryption and compression schemes and other features, see Data IO in SPRAAK.

The SPRAAK header format

A SPRAAK header consist of:

The magic cookie ".spr\\n".
A list of key-value pairs.
The header-end line consisting of a single pound sign '#' (no preceeding or trailing white-space).

A key-value pair typically consist of a single line containing the following elements:

Optional white-space preceeding the key.
The key. A key is a sequence of printable non white-space ASCII characters, i.e. any ASCII character in the range ['!' ... '~'].
Required white-space seperating the key and the value.
The value. A value can be any sequence of characters. Values that start with white-space or a double quote ('"'), or that end in white-space or a back-slash ('\') must be specified as quoted strings (cf. C-strings)
Optional white-space following the value.
A new line.

If desired, a multi-line variant can be made by having a line that ends with a back-slash new-line combo.

The header may contain empty lines. The keys used in a SPRAAK header are identical to the keys exposed to the SPRAAK enviroment (see The standard keys).

The standard keys

The key-set is per definition open, i.e. each object may define its own keys and corresponding values. However, in order to fit optimally in the SPRAAK framework, the following set of keys should be used where applicable:

FORMAT

Specifies how the data elements are stored:

BIN01: binary, little endian
BIN10: binary, big endian
ASCII: ascii

Other values are not allowed.

LAYOUT

Specifies the layout of the data elements:

LIST: a list (sequence) of entries, e.g. a corpus or a lexicon
MATRIX: a matrix or vector, e.g. feature vectors, samples, matrices
CUSTOM: some custom (usualy complex) layout; in practice this will almost always result in a non base-type value for the TYPE key as well

DATA

Specifies what kind of data the file contains. Typical values are:

SAMPLE: the file contains audio data (samples)
TRACK: the file contains feature vectors (frames)
CORPUS: the file is a corpus file
DICTIONARY: the file is a dictionary (lexicon) file
SEG: the file is a segmentation file
PARAM: some set of parameters
DES: some text file listing items

This is an open set, i.e. other values are allowed.

TYPE

Specifies the type of the data elements. Typically this is one of the base types (see datatype.c). If there is no fixed data type (e.g. a mixture of floats and integers), the value 'CUSTOM' should be used.

OBJECT

If set, this indicates that the complete file corresponds to a single object. Examples are:

NIY: future work

DIM1

The primary dimension of the stream of data. The primary dimension corresponds to:

the number of entries (lines) in a lexicon or corpus file
the number of segments (lines) in a segmentation file
the number of frames in a track or sample file
the number of vectors in a matrix; a vector consist of a consecutive list of data elements as stored in the file; whether the vectors are interpreted as row or column vectors is up to the routine that reads the data; the standard spr_print and spr_copy commands will interprete them as row vectors.

A (-1) value means "as many as there is data available", i.e. the application is supposed to process as many frames, entries, ... as it can read from the file.

DIM2

The secondary dimension of the stream of data. The secondary dimension corresponds to:

the number of entries in a segmentation file
the number of parameters per frame in a track or sample file
the vector length in a matrix

The secondary dimension cannot be (-1), except for segmentation files.

FSHIFT

The frame shift, i.e. each frame correspond to frame_shift seconds of data. The default value is 0.01 or 10 msec.

FOFFSET

The frame offset, i.e. the first frame starts at frame_offset seconds. Typical values are 0.0 (ESAT convention) and (frame_length-frame_shift)/2 (HTK convention). The ESAT convention requires that the first and last frames invent some left/right data in order to have frame_length seconds of data for these frames as well. The advantage is that (1) a file containing X seconds of data will contain floor(X/frame_shift) frames, and (2) the data for frame i (base 0) is located from i*frame_shift till (i+1)*frame_shift (invariant of the frame_length used).

NCHAN

The number of audio channels in the recording. The data for the different channels is stored in an interleaved fashion. For track files, this means that each vector of size DIM2 contains NCHAN feature vectors stored in an interleved fashion.

CHANLEN

The number of samples in a sample file (equals sample_frequency*duration). Note that the primary dimension (DIM1) is the number of frames when reading this using a frame_shift FSHIFT.

SAMPLEFREQ

The sample frequency of an audio file. The default value is 8000 (8kHz).

COMPRESS

When set, the value of this key can specifies a compression algorithm (e.g. gzip) and corresponding parameters that were used (reading data) or will be used (writing data) when the data (not the header) is written. See From physical storage to data content for more details on compression and other coders for data (and headers).

The key header format

A key header consist of:

An optional (but strongly recommended) magic cookie ".key\\n".
A list of key-value pairs.
The header-end line consisting of a sequence (1 or more) of pound signs '#' (no preceeding or trailing white-space).

A key-value pair typically consist of a single line containing the following elements:

Optional white-space preceeding the key.
The key. A key is a sequence of printable non white-space ASCII characters, i.e. any ASCII character in the range ['!' ... '~'].
Required white-space seperating the key and the value.
The value. A value can be any sequence of characters, except that leading and trailing white-space will be automatically removed.
Optional white-space following the value.
A new line.

The header may contain empty lines. The typical keys used in a key header are:

DATATYPE

Specifies the type of data stored. Typical values are:

SAMPLE: the file contains audio data (samples)
TRACK: the file contains feature vectors (frames)
SEG: the file is a segmentation file
PARAM: the file contains filter parameters, coefficients ...
DES: database description files
SEG: the file is a segmentation file

This is an open set, i.e. other values are allowed.

CORPUS

Replaces DATATYPE, indicating a corpus file.

DICTIONARY

May replace DATATYPE, or may be specified in conjunction with a "DATATYPE DES" key-value pair. Indicates a dictionary (lexicon) file.

DATAFORMAT

Specifies the type of the data elements. Must be one of the following base types:

FLOAT: a single precision (32 bits) float
DOUBLE: a double precision (64 bits) float
INT: a 32 bits integer, can be signed or unsigned
SHORT: a 16 bits integer, can be signed or unsigned
BYTE: an 8 bits integer, can be signed or unsigned
ALAW: a a-law value (8 bits unsigned integer)
MULAW: a u-law value (8 bits unsigned integer)
STRING: a string (sequence of characters)

NPARAM

The length of the feature vector in a track file (or sample file, assuming FSHIFT is known), or the length of the vectors (typically rows) in a parameter file.

NFR

The number of frames in a track file (or sample a file, assuming FSHIFT is known), or the number of vectors (typically rows) in a parameter file. A (-1) value means "as many as there is data available", i.e. the application is supposed to process as many frames, entries, ... as it can read from the file.

NCHAN

The number of audio channels in the recording. The data for the different channels is stored in an interleaved fashion. For track files, this means that each vector of size DIM2 contains NCHAN feature vectors stored in an interleved fashion.

CHANLEN

The number of samples in a sample file.

NENTRY

The number of entries in a segmenation, corpus or dictionary file.

NSEG

The number of segments (lines) in a segmenation file.

FSHIFT

The frame shift, i.e. each frame correspond to frame_shift seconds of data. The default value is 0.01 or 10 msec.

SAMPLEFREQ

The sample frequency of an audio file. The default value is 8000 (8kHz).

Note: When reading or writing a key header, the keys are automatically converted to/from the new SPRAAK format (see The standard keys).

Some examples of headers

TODO