Contents

A corpus file contains information on a database. It gives the transcription of the data in a file or a part of a file. The data itself may be sampled data, Corpus files are typically stored with extension '.cor'.

A corpus file makes abstraction of a root directory and file name extensions. Hence, the same corpus file is used to deal with the sampled data as well as all data derived from it, such as feature data, labels, ..

Keys

The important header keys are:

DATA: must be set to CORPUS
DIM1: the number of entries in the CORPUS (i.e. the number of lines of DATA)
TIMEBASE: CONTINUOUS or DISCRETE (Default)

Data

The first four fields in a corpus entry have a predefined meaning. Additional fields are optional and can be interpreted on a program specific basis Hence a corpus entry takes the form:

FILENAME TRANSCRIPTION  F1 F2  [OPT_DATA]

FILENAME: file name, without extension
TRANSCRIPTION: word level transcription in which words are separated by "_"
F1: first frame (default=0) for DISCRETE or begin time for CONTINUOUS
F2: number of frames (default=-1, i.e. all data) for DISCRETE or end time for CONTINOUS
METADATA: one or several fields of metadata to be interpreted in a program specific way

Example

.spr
DATA    CORPUS
DIM1    5
#
1       one     0 -1 C
347     three_four_seven 0 -1 F
tst     This    0  24 M
-       is      25 39 M
-       a_test  40 70 M

Remarks & Limitations:

There may be multiple lines describing contents in a single file
The "-" as filename, indicates continuation in the previous file. Data does not need to be specified contiguously; however there should be no 'rewinding', i.e. the first frame of a new segment in the same file should not start before the end of the previous segment.
The usage of F1 and F2 is not consistent across DISCRETE and CONTINUOUS timebasis for historical reasons