Shortcuts
File formats
SPRAAK supports a variety of file formats. All formats have the same basic layout:
-
a (sometimes optional) magic cookie, indicating the file (or header) format
-
a header describing the data
-
the raw data (can be compressed, or in ASCII, or ...).
Header-less files are also supported. In that case any necessary header information can be provided in an external file or as extra information with the file name. See Specifying a stream of data (file) and Some practical examples for more details.
For a list of supported file formats, encryption and compression schemes and other features, see Data IO in SPRAAK.
The SPRAAK header format
A SPRAAK header consist of:
-
The magic cookie ".spr\\n".
-
A list of key-value pairs.
-
The header-end line consisting of a single pound sign '#' (no preceeding or trailing white-space).
A key-value pair typically consist of a single line containing the following elements:
-
Optional white-space preceeding the key.
-
The key. A key is a sequence of printable non white-space ASCII characters, i.e. any ASCII character in the range ['!' ... '~'].
-
Required white-space seperating the key and the value.
-
The value. A value can be any sequence of characters. Values that start with white-space or a double quote ('"'), or that end in white-space or a back-slash ('\') must be specified as quoted strings (cf. C-strings)
-
Optional white-space following the value.
-
A new line.
If desired, a multi-line variant can be made by having a line that ends with a back-slash new-line combo.
The header may contain empty lines. The keys used in a SPRAAK header are identical to the keys exposed to the SPRAAK enviroment (see The standard keys).
The standard keys
The key-set is per definition open, i.e. each object may define its own keys and corresponding values. However, in order to fit optimally in the SPRAAK framework, the following set of keys should be used where applicable:
- FORMAT
- Specifies how the data elements are stored:
- BIN01
- binary, little endian
- BIN10
- binary, big endian
- ASCII
- ascii
Other values are not allowed.
- LAYOUT
- Specifies the layout of the data elements:
- LIST
- a list (sequence) of entries, e.g. a corpus or a lexicon
- MATRIX
- a matrix or vector, e.g. feature vectors, samples, matrices
- CUSTOM
- some custom (usualy complex) layout; in practice this will almost always result in a non base-type value for the TYPE key as well
- DATA
- Specifies what kind of data the file contains. Typical values are:
- SAMPLE
- the file contains audio data (samples)
- TRACK
- the file contains feature vectors (frames)
- CORPUS
- the file is a corpus file
- DICTIONARY
- the file is a dictionary (lexicon) file
- SEG
- the file is a segmentation file
- PARAM
- some set of parameters
- DES
- some text file listing items
This is an open set, i.e. other values are allowed.
- TYPE
- Specifies the type of the data elements. Typically this is one of the base types (see datatype.c). If there is no fixed data type (e.g. a mixture of floats and integers), the value 'CUSTOM' should be used.
- OBJECT
- If set, this indicates that the complete file corresponds to a single object. Examples are:
- NIY
- future work
- DIM1
- The primary dimension of the stream of data. The primary dimension corresponds to:
-
the number of entries (lines) in a lexicon or corpus file
-
the number of segments (lines) in a segmentation file
-
the number of frames in a track or sample file
-
the number of vectors in a matrix; a vector consist of a consecutive list of data elements as stored in the file; whether the vectors are interpreted as row or column vectors is up to the routine that reads the data; the standard spr_print and spr_copy commands will interprete them as row vectors.
A (-1) value means "as many as there is data available", i.e. the application is supposed to process as many frames, entries, ... as it can read from the file.
- DIM2
- The secondary dimension of the stream of data. The secondary dimension corresponds to:
-
the number of entries in a segmentation file
-
the number of parameters per frame in a track or sample file
-
the vector length in a matrix
The secondary dimension cannot be (-1), except for segmentation files.
- FSHIFT
- The frame shift, i.e. each frame correspond to frame_shift seconds of data. The default value is 0.01 or 10 msec.
- FOFFSET
- The frame offset, i.e. the first frame starts at frame_offset seconds. Typical values are 0.0 (ESAT convention) and (frame_length-frame_shift)/2 (HTK convention). The ESAT convention requires that the first and last frames invent some left/right data in order to have frame_length seconds of data for these frames as well. The advantage is that (1) a file containing X seconds of data will contain floor(X/frame_shift) frames, and (2) the data for frame i (base 0) is located from i*frame_shift till (i+1)*frame_shift (invariant of the frame_length used).
- NCHAN
- The number of audio channels in the recording. The data for the different channels is stored in an interleaved fashion. For track files, this means that each vector of size DIM2 contains NCHAN feature vectors stored in an interleved fashion.
- CHANLEN
- The number of samples in a sample file (equals sample_frequency*duration). Note that the primary dimension (DIM1) is the number of frames when reading this using a frame_shift FSHIFT.
- SAMPLEFREQ
- The sample frequency of an audio file. The default value is 8000 (8kHz).
- COMPRESS
- When set, the value of this key can specifies a compression algorithm (e.g. gzip) and corresponding parameters that were used (reading data) or will be used (writing data) when the data (not the header) is written. See From physical storage to data content for more details on compression and other coders for data (and headers).
The key header format
A key header consist of:
-
An optional (but strongly recommended) magic cookie ".key\\n".
-
A list of key-value pairs.
-
The header-end line consisting of a sequence (1 or more) of pound signs '#' (no preceeding or trailing white-space).
A key-value pair typically consist of a single line containing the following elements:
-
Optional white-space preceeding the key.
-
The key. A key is a sequence of printable non white-space ASCII characters, i.e. any ASCII character in the range ['!' ... '~'].
-
Required white-space seperating the key and the value.
-
The value. A value can be any sequence of characters, except that leading and trailing white-space will be automatically removed.
-
Optional white-space following the value.
-
A new line.
The header may contain empty lines. The typical keys used in a key header are:
- DATATYPE
- Specifies the type of data stored. Typical values are:
- SAMPLE
- the file contains audio data (samples)
- TRACK
- the file contains feature vectors (frames)
- SEG
- the file is a segmentation file
- PARAM
- the file contains filter parameters, coefficients ...
- DES
- database description files
- SEG
- the file is a segmentation file
This is an open set, i.e. other values are allowed.
- CORPUS
- Replaces DATATYPE, indicating a corpus file.
- DICTIONARY
- May replace DATATYPE, or may be specified in conjunction with a "DATATYPE DES" key-value pair. Indicates a dictionary (lexicon) file.
- DATAFORMAT
- Specifies the type of the data elements. Must be one of the following base types:
- FLOAT
- a single precision (32 bits) float
- DOUBLE
- a double precision (64 bits) float
- INT
- a 32 bits integer, can be signed or unsigned
- SHORT
- a 16 bits integer, can be signed or unsigned
- BYTE
- an 8 bits integer, can be signed or unsigned
- ALAW
- a a-law value (8 bits unsigned integer)
- MULAW
- a u-law value (8 bits unsigned integer)
- STRING
- a string (sequence of characters)
- NPARAM
- The length of the feature vector in a track file (or sample file, assuming FSHIFT is known), or the length of the vectors (typically rows) in a parameter file.
- NFR
- The number of frames in a track file (or sample a file, assuming FSHIFT is known), or the number of vectors (typically rows) in a parameter file. A (-1) value means "as many as there is data available", i.e. the application is supposed to process as many frames, entries, ... as it can read from the file.
- NCHAN
- The number of audio channels in the recording. The data for the different channels is stored in an interleaved fashion. For track files, this means that each vector of size DIM2 contains NCHAN feature vectors stored in an interleved fashion.
- CHANLEN
- The number of samples in a sample file.
- NENTRY
- The number of entries in a segmenation, corpus or dictionary file.
- NSEG
- The number of segments (lines) in a segmenation file.
- FSHIFT
- The frame shift, i.e. each frame correspond to frame_shift seconds of data. The default value is 0.01 or 10 msec.
- SAMPLEFREQ
- The sample frequency of an audio file. The default value is 8000 (8kHz).
- Note
- When reading or writing a key header, the keys are automatically converted to/from the new SPRAAK format (see The standard keys).
Some examples of headers
TODO