Handling of words in all cwr-routines. More...

Data Structures
struct	SprCwrSLex
	sorted lexicon More...

struct	SprCwrWordSet

struct	SprCwrWordSeq

Enumerations
enum	{ SPR_CWR_SENTENCE_BEGIN, SPR_CWR_SENTENCE_END, SPR_CWR_OTHER_HTML_MARKER, SPR_CWR_UNKNOWN_WORD, SPR_CWR_PARTIAL_WORD, SPR_CWR_EMPTY_LM_SLOT }

enum	{ SPR_CWR_SLEX_FREE_NON, SPR_CWR_SLEX_FREE_MAT, SPR_CWR_SLEX_FREE_VEC }

Functions
SprCwrSLex *	spr_cwr_slex_free (SprCwrSLex lex, SprMsgId routine)
	Free a set of lexicon names. More...

SprCwrSLex *	spr_cwr_slex_alloc (SprCwrSLex *lex, int nr_words, int wlen_tot)
	Allocate and/or initialise the set of lexicon names. More...

int	spr_cwr_sum_str_len (const SprCwrSLex *lex)

SprCwrSLex *	spr_cwr_slex_dup (SprCwrSLex lex_dst, const SprCwrSLex lex_src)

SprCwrSLex *	spr_cwr_slex1_create (SprCwrSLex lex, int duplicate, const SprStrHashTbl whash_main, const SprStrHashTbl *whash_ext)

SprCwrSLex *	spr_cwr_slex_read (SprCwrSLex lex, const char fname, SprKeySet *keys, int split_phon_desc, int unsorted)

int	spr_cwr_slex_dump (SprStream fd, const SprCwrSLex lex)

const char *	spr_cwr_slex_get_word (int word_id, const SprCwrSLex *lex)
	Convert the word id to the corresponding word string. More...

int	spr_cwr_is_html_marker (const char *word, int len)

int	spr_cwr_get_word_id (SprCwrWordSet word_set, const char word, int len, const SprCwrSLex *lex, int notify_level)

void	spr_cwr_word_set_print (SprStream dest, const SprCwrWordSet word, const SprCwrSLex *lex)
	Print all words in a word_set. More...

int	spr_cwr_word_seq_decode (SprCwrWordSeq word_seq, const char word_str, const SprCwrSLex *lex, int notify_level)

Variables
const char *const	spr_cwr_special_word_str [SPR_CWR_NR_SPECIAL_LM_CLASSES+1]
	see also get_word_id() More...

const SprCwrSLex	spr_cwr_empty_slex

const SprCwrWordSeq	spr_cwr_empty_word_seq

Detailed Description

Handling of words in all cwr-routines.

Used symbols

WORD_ID      = (GRAPHEME,GRAPHEME_EXTENSION)
WORD_MODEL   = (WORD_ID,PRONONCIATION)
LM_WORD_ID   = (GRAPHEME,GRAPHEME_EXTENSION')
WORD_CLASSES = (MEANING)

The GRAPHEME_EXTENSION can be use for several purposes. Examples are:

Male/female models.
OOV-models for different word lengths (nr. of sylabes)
Different types of OOV-models (e.g. one for proper names, geographic names, ...)
Different words (PRONONCIATION,MEANING) with the same GRAPHEME (e.g. bedelen, appel, ...).

How are words handled

The recognizer
- Creates WORD_ID hypotheses and send them to the LM.
- The LM translate them into the corresoponding LM_WORD_ID's, which are extended with the WORD_CLASS information. The LM also checks for special WORD_CLASSES (e.g. SENTENCE_MARKER, FILLER_MODEL, ...)
Perplexity/tagging operations:
- The input are the GRAPHEME's.
- They are translated into (multiple) WORD_ID hypotheses by the (binary search) lexicon lookup algorithm. The multiple WORD_ID hypotheses case occures if a word is given without its GRAPHEME_EXTENSION.
- The work of the LM remains the same as during the recognition.

How is the information stored

The LEXICON stores
- The WORD_MODEL's (GRAPHEME,GRAPHEME_EXTENSION,PRONONCIATION)
- The GRAPHEME for the UNKNOW_WORD.
The COUNT(LM)-file stores the LM-info, i.e.:
- The CLASS-sequences and counts.
- The LM_WORD_ID information (GRAPHEME,GRAPHEME_EXTENSION').
- The LM_WORD_ID to WORD_CLASSES conversion table.
- The LM_WORD_ID distribution for all appropriate CLASSES.
- A list of special CLASSES with their appropriate flags.

Remarks

Special (fixed) WORD_CLASSES and WORD_MODEL's
These classes are fixed, so they can be used at any time and by any program without having to read any resources (e.g. bootstrapping). For each special class, there is a corresponding special word. These words allow you to use the GRAPHEME's (both as input or as output), even if they do not occure in the lexicon.
- SENTENCE_BEGIN
- SENTENCE_END
- UNKNOWN_WORD
Other WORD_CLASSES are defined for internal use only:
- EMPTY_LM_SLOT
- PARTIAL_WORD
Fall back routine for unknown words:
Replace by the UNKNOWN_WORD symbol and retry (may result in multiple WORD_MODEL candidates).
Handling sentence begin/end:
- Only one symbol is provided (i.e. the SENTENCE_MARKER).
- No statistics are recorded regarding cross-sentence effects.

Author: Kris Demuynck

Date: Oct 1996

Data Structures

Enumerations

Functions

Variables

Detailed Description