Contents

A segmentation files contains either manual or automatic segmentations of a corpus. There is little difference between a corpus file and a segmentation file as a corpus file may be considered as a sentence level segmentation of a large body of data. Segmentations files typically take a '.seg' extension.

Keys

The important header keys are:

DATA: must be set to SEG
DIM1: the number of segments (lines of data)
DIM2: the number of segmentations (number of segmented files)
TIMEBASE: CONTINUOUS or DISCRETE (Default)
SEGTYPE: VITERBI, MANUAL

Data

Each line contains one entry, and takes following form:

FILENAME TRANSCRIPTION  F1 F2  [OPT_DATA]

The first four fields in an entry have a predefined meaning.

FILENAME: file name, without extension
TRANSCRIPTION: unit or state level transcription, taking the form <unit> or <unit>#<state_nr>
F1: first frame (default= 0 for DISCRETE or begin time for CONTINUOUS)
F2: number of frames (default= -1, i.e. end_of_file)

Additional fields are optional and can be interpreted on a program specific basis

Example

.spr
DATA            SEG
DIM1            1773201
DIM2            7239
SEGTYPE         VITERBI
MODEL_FILE      mod_wsj_init/acmod.hmm
TIMEBASE        DISCRETE
FSHIFT          0.01
#
wsj0/si_tr_s/011/011c0201               ##0     0       1
-               ##1     1       35
-               ##2     36      23
-               D#0     59      2
-               D#1     61      2
-               D#2     63      2
-               @#0     65      1
-               @#1     66      1
-               @#2     67      1
-               s#0     68      4
-               s#1     72      7
-               s#2     79      3
-               e+I#0   82      3
-               e+I#1   85      2
-               e+I#2   87      2
-               l#0     89      3
-               l#1     92      10
-               l#2     102     1
...
-               t#0     609     1
-               t#1     610     7
-               t#2     617     1
-               ##0     618     9
-               ##1     627     25
-               ##2     652     2
wsj0/si_tr_s/011/011c0202               ##0     0       1
-               ##1     1       45
-               ##2     46      1
-               D#0     47      1
-               D#1     48      2
-               D#2     50      1
-               i#0     51      2
-               i#1     53      3
-               i#2     56      2
...

The above example contains segmentations from 7239 files segmented in 1773201 state segments. One may also observe the usage of a 3-state silence model ('#' is the transcription of silence, hence ##0,##1,##2 are the respective state transcriptions)