|
SprCwrSLex * | spr_cwr_slex_free (SprCwrSLex *lex, SprMsgId *routine) |
| Free a set of lexicon names. More...
|
|
SprCwrSLex * | spr_cwr_slex_alloc (SprCwrSLex *lex, int nr_words, int wlen_tot) |
| Allocate and/or initialise the set of lexicon names. More...
|
|
int | spr_cwr_sum_str_len (const SprCwrSLex *lex) |
|
SprCwrSLex * | spr_cwr_slex_dup (SprCwrSLex *lex_dst, const SprCwrSLex *lex_src) |
|
SprCwrSLex * | spr_cwr_slex1_create (SprCwrSLex *lex, int duplicate, const SprStrHashTbl *whash_main, const SprStrHashTbl *whash_ext) |
|
SprCwrSLex * | spr_cwr_slex_read (SprCwrSLex *lex, const char *fname, SprKeySet *keys, int split_phon_desc, int unsorted) |
|
int | spr_cwr_slex_dump (SprStream *fd, const SprCwrSLex *lex) |
|
const char * | spr_cwr_slex_get_word (int word_id, const SprCwrSLex *lex) |
| Convert the word id to the corresponding word string. More...
|
|
int | spr_cwr_is_html_marker (const char *word, int len) |
|
int | spr_cwr_get_word_id (SprCwrWordSet *word_set, const char *word, int len, const SprCwrSLex *lex, int notify_level) |
|
void | spr_cwr_word_set_print (SprStream *dest, const SprCwrWordSet *word, const SprCwrSLex *lex) |
| Print all words in a word_set. More...
|
|
int | spr_cwr_word_seq_decode (SprCwrWordSeq *word_seq, const char *word_str, const SprCwrSLex *lex, int notify_level) |
|
Handling of words in all cwr-routines.
- Used symbols
WORD_ID = (GRAPHEME,GRAPHEME_EXTENSION)
WORD_MODEL = (WORD_ID,PRONONCIATION)
LM_WORD_ID = (GRAPHEME,GRAPHEME_EXTENSION')
WORD_CLASSES = (MEANING)
The GRAPHEME_EXTENSION can be use for several purposes. Examples are:
-
Male/female models.
-
OOV-models for different word lengths (nr. of sylabes)
-
Different types of OOV-models (e.g. one for proper names, geographic names, ...)
-
Different words (PRONONCIATION,MEANING) with the same GRAPHEME (e.g. bedelen, appel, ...).
- How are words handled
-
The recognizer
-
Creates WORD_ID hypotheses and send them to the LM.
-
The LM translate them into the corresoponding LM_WORD_ID's, which are extended with the WORD_CLASS information. The LM also checks for special WORD_CLASSES (e.g. SENTENCE_MARKER, FILLER_MODEL, ...)
-
Perplexity/tagging operations:
-
The input are the GRAPHEME's.
-
They are translated into (multiple) WORD_ID hypotheses by the (binary search) lexicon lookup algorithm. The multiple WORD_ID hypotheses case occures if a word is given without its GRAPHEME_EXTENSION.
-
The work of the LM remains the same as during the recognition.
- How is the information stored
-
The LEXICON stores
-
The WORD_MODEL's (GRAPHEME,GRAPHEME_EXTENSION,PRONONCIATION)
-
The GRAPHEME for the UNKNOW_WORD.
-
The COUNT(LM)-file stores the LM-info, i.e.:
-
The CLASS-sequences and counts.
-
The LM_WORD_ID information (GRAPHEME,GRAPHEME_EXTENSION').
-
The LM_WORD_ID to WORD_CLASSES conversion table.
-
The LM_WORD_ID distribution for all appropriate CLASSES.
-
A list of special CLASSES with their appropriate flags.
- Remarks
-
Special (fixed) WORD_CLASSES and WORD_MODEL's
These classes are fixed, so they can be used at any time and by any program without having to read any resources (e.g. bootstrapping). For each special class, there is a corresponding special word. These words allow you to use the GRAPHEME's (both as input or as output), even if they do not occure in the lexicon.
-
SENTENCE_BEGIN
-
SENTENCE_END
-
UNKNOWN_WORD
Other WORD_CLASSES are defined for internal use only:
-
EMPTY_LM_SLOT
-
PARTIAL_WORD
-
Fall back routine for unknown words:
Replace by the UNKNOWN_WORD symbol and retry (may result in multiple WORD_MODEL candidates).
-
Handling sentence begin/end:
-
Only one symbol is provided (i.e. the SENTENCE_MARKER).
-
No statistics are recorded regarding cross-sentence effects.
- Author
- Kris Demuynck
- Date
- Oct 1996