- Frames in Speech Processing
SPRAAK uses frame synchronous signal processing when converting sampled data to a sequence of feature vectors.
Following naming conventions are used:
- frameshift (FSHIFT): the shift between two consecutive frames (default = 0.01 sec)
- framelength (FLENGTH): the length of an individual frame (default = 0.03 sec)
- timebase (TIMEBASE): the unit in which frameshift and framelength are expressed, i.e. TIMEBASE=CONTINUOUS expresses time in seconds and TIMEBASE=DISCRETE expresses time in number of samples
FSHIFT, FLENGTH, TIMEBASE are the relevant keys in the SPRAAK file headers.
- FRAME Convention in SPRAAK
In SPRAAK the absolute position of a frame is uniquely defined by the number of a frame (IFR) and the frameshift parameter (here expressed in number of samples): [ IFR*FSHIFT-FSHIFT/2 : IFR*SHIFT+FSHIFT/2-1 ] This is graphically shown in the diagram below:
0123456789012345678901234567890123.. sample index (last digit shown only)
|||||||||||||||||||||||||||||||||||| sampled data
FSHIFT=10, FLENGTH=20
****012345678901234 frame 0 [-5:14]
56789012345678901234 frame 1 [5:24]
56789012345678901234 frame 2 [15:34]
FSHIFT=10, FLENGTH=14
*012345678901 frame 0 [-2:11]
89012345678901 frame 1 [8:21]
89012345678901 frame 2 [18:31]
- Motivation
This approach has following significant advantages:
- a 1-to-1 synchronization of frames that are the outcome of signal processings algorithms using the same frameshift.
- a straightforward synchronization of sample files and frames
- a straightforward synchronization of frames that are the outcome of signal processings algorithms using frameshifts that are easily related (e.g. 2:1) .
- it is natural to maintain the syncrhonization throughout all further processing (including the computation of feature_vs_time derivaties)
This approach also has one minor drawback. Computations involving initial and final frames will normally require data that extends beyond the file boundaries. The following solutions are offered to this missing data problem:
-
duplicate (mirror) a sufficiently large segment of initial/final sample data beyond the file boundary in order to account for the missing data (new default).
-
append zeros before and after the first/last sample (old default)
-
similar solutions exist for calculating the Mel-spectrum time derivatives or for any other time filter
- Note
- This rarely leads to problems as more often than not the initial and final data is (stationary) noise.