- Contents
A corpus file contains information on a database. It gives the transcription of the data in a file or a part of a file. The data itself may be sampled data, Corpus files are typically stored with extension '.cor'.
A corpus file makes abstraction of a root directory and file name extensions. Hence, the same corpus file is used to deal with the sampled data as well as all data derived from it, such as feature data, labels, ..
- Keys
The important header keys are:
-
DATA: must be set to CORPUS
-
DIM1: the number of entries in the CORPUS (i.e. the number of lines of DATA)
-
TIMEBASE: CONTINUOUS or DISCRETE (Default)
- Data
The first four fields in a corpus entry have a predefined meaning. Additional fields are optional and can be interpreted on a program specific basis Hence a corpus entry takes the form:
FILENAME TRANSCRIPTION F1 F2 [OPT_DATA]
-
FILENAME: file name, without extension
-
TRANSCRIPTION: word level transcription in which words are separated by "_"
-
F1: first frame (default=0) for DISCRETE or begin time for CONTINUOUS
-
F2: number of frames (default=-1, i.e. all data) for DISCRETE or end time for CONTINOUS
-
METADATA: one or several fields of metadata to be interpreted in a program specific way
- Example
.spr
DATA CORPUS
DIM1 5
#
1 one 0 -1 C
347 three_four_seven 0 -1 F
tst This 0 24 M
- is 25 39 M
- a_test 40 70 M
- Remarks & Limitations:
-
There may be multiple lines describing contents in a single file
-
The "-" as filename, indicates continuation in the previous file. Data does not need to be specified contiguously; however there should be no 'rewinding', i.e. the first frame of a new segment in the same file should not start before the end of the previous segment.
-
The usage of F1 and F2 is not consistent across DISCRETE and CONTINUOUS timebasis for historical reasons