SPRAAK
|
A segmentation files contains either manual or automatic segmentations of a corpus. There is little difference between a corpus file and a segmentation file as a corpus file may be considered as a sentence level segmentation of a large body of data. Segmentations files typically take a '.seg' extension.
The important header keys are:
Each line contains one entry, and takes following form:
FILENAME TRANSCRIPTION F1 F2 [OPT_DATA]
The first four fields in an entry have a predefined meaning.
Additional fields are optional and can be interpreted on a program specific basis
.spr DATA SEG DIM1 1773201 DIM2 7239 SEGTYPE VITERBI MODEL_FILE mod_wsj_init/acmod.hmm TIMEBASE DISCRETE FSHIFT 0.01 # wsj0/si_tr_s/011/011c0201 ##0 0 1 - ##1 1 35 - ##2 36 23 - D#0 59 2 - D#1 61 2 - D#2 63 2 - @#0 65 1 - @#1 66 1 - @#2 67 1 - s#0 68 4 - s#1 72 7 - s#2 79 3 - e+I#0 82 3 - e+I#1 85 2 - e+I#2 87 2 - l#0 89 3 - l#1 92 10 - l#2 102 1 ... - t#0 609 1 - t#1 610 7 - t#2 617 1 - ##0 618 9 - ##1 627 25 - ##2 652 2 wsj0/si_tr_s/011/011c0202 ##0 0 1 - ##1 1 45 - ##2 46 1 - D#0 47 1 - D#1 48 2 - D#2 50 1 - i#0 51 2 - i#1 53 3 - i#2 56 2 ...
The above example contains segmentations from 7239 files segmented in 1773201 state segments. One may also observe the usage of a 3-state silence model ('#' is the transcription of silence, hence ##0,##1,##2 are the respective state transcriptions)