SPRAAK
 All Data Structures Namespaces Files Functions Variables Typedefs Enumerations Enumerator Groups Pages
Data Structures | Enumerations | Functions | Variables
cwr_word.c File Reference

Handling of words in all cwr-routines. More...

Data Structures

struct  SprCwrSLex
 sorted lexicon More...
 
struct  SprCwrWordSet
 
struct  SprCwrWordSeq
 

Enumerations

enum  {
  SPR_CWR_SENTENCE_BEGIN, SPR_CWR_SENTENCE_END, SPR_CWR_OTHER_HTML_MARKER, SPR_CWR_UNKNOWN_WORD,
  SPR_CWR_PARTIAL_WORD, SPR_CWR_EMPTY_LM_SLOT
}
 
enum  { SPR_CWR_SLEX_FREE_NON, SPR_CWR_SLEX_FREE_MAT, SPR_CWR_SLEX_FREE_VEC }
 

Functions

SprCwrSLexspr_cwr_slex_free (SprCwrSLex *lex, SprMsgId *routine)
 Free a set of lexicon names. More...
 
SprCwrSLexspr_cwr_slex_alloc (SprCwrSLex *lex, int nr_words, int wlen_tot)
 Allocate and/or initialise the set of lexicon names. More...
 
int spr_cwr_sum_str_len (const SprCwrSLex *lex)
 
SprCwrSLexspr_cwr_slex_dup (SprCwrSLex *lex_dst, const SprCwrSLex *lex_src)
 
SprCwrSLexspr_cwr_slex1_create (SprCwrSLex *lex, int duplicate, const SprStrHashTbl *whash_main, const SprStrHashTbl *whash_ext)
 
SprCwrSLexspr_cwr_slex_read (SprCwrSLex *lex, const char *fname, SprKeySet *keys, int split_phon_desc, int unsorted)
 
int spr_cwr_slex_dump (SprStream *fd, const SprCwrSLex *lex)
 
const char * spr_cwr_slex_get_word (int word_id, const SprCwrSLex *lex)
 Convert the word id to the corresponding word string. More...
 
int spr_cwr_is_html_marker (const char *word, int len)
 
int spr_cwr_get_word_id (SprCwrWordSet *word_set, const char *word, int len, const SprCwrSLex *lex, int notify_level)
 
void spr_cwr_word_set_print (SprStream *dest, const SprCwrWordSet *word, const SprCwrSLex *lex)
 Print all words in a word_set. More...
 
int spr_cwr_word_seq_decode (SprCwrWordSeq *word_seq, const char *word_str, const SprCwrSLex *lex, int notify_level)
 

Variables

const char *const spr_cwr_special_word_str [SPR_CWR_NR_SPECIAL_LM_CLASSES+1]
 see also get_word_id() More...
 
const SprCwrSLex spr_cwr_empty_slex
 
const SprCwrWordSeq spr_cwr_empty_word_seq
 

Detailed Description

Handling of words in all cwr-routines.

Used symbols
WORD_ID      = (GRAPHEME,GRAPHEME_EXTENSION)
WORD_MODEL   = (WORD_ID,PRONONCIATION)
LM_WORD_ID   = (GRAPHEME,GRAPHEME_EXTENSION')
WORD_CLASSES = (MEANING)
The GRAPHEME_EXTENSION can be use for several purposes. Examples are:
  • Male/female models.
  • OOV-models for different word lengths (nr. of sylabes)
  • Different types of OOV-models (e.g. one for proper names, geographic names, ...)
  • Different words (PRONONCIATION,MEANING) with the same GRAPHEME (e.g. bedelen, appel, ...).
How are words handled
  1. The recognizer
    • Creates WORD_ID hypotheses and send them to the LM.
    • The LM translate them into the corresoponding LM_WORD_ID's, which are extended with the WORD_CLASS information. The LM also checks for special WORD_CLASSES (e.g. SENTENCE_MARKER, FILLER_MODEL, ...)
  2. Perplexity/tagging operations:
    • The input are the GRAPHEME's.
    • They are translated into (multiple) WORD_ID hypotheses by the (binary search) lexicon lookup algorithm. The multiple WORD_ID hypotheses case occures if a word is given without its GRAPHEME_EXTENSION.
    • The work of the LM remains the same as during the recognition.
How is the information stored
  1. The LEXICON stores
    • The WORD_MODEL's (GRAPHEME,GRAPHEME_EXTENSION,PRONONCIATION)
    • The GRAPHEME for the UNKNOW_WORD.
  2. The COUNT(LM)-file stores the LM-info, i.e.:
    • The CLASS-sequences and counts.
    • The LM_WORD_ID information (GRAPHEME,GRAPHEME_EXTENSION').
    • The LM_WORD_ID to WORD_CLASSES conversion table.
    • The LM_WORD_ID distribution for all appropriate CLASSES.
    • A list of special CLASSES with their appropriate flags.
Remarks
  1. Special (fixed) WORD_CLASSES and WORD_MODEL's
    These classes are fixed, so they can be used at any time and by any program without having to read any resources (e.g. bootstrapping). For each special class, there is a corresponding special word. These words allow you to use the GRAPHEME's (both as input or as output), even if they do not occure in the lexicon.
    • SENTENCE_BEGIN
    • SENTENCE_END
    • UNKNOWN_WORD
    Other WORD_CLASSES are defined for internal use only:
    • EMPTY_LM_SLOT
    • PARTIAL_WORD
  2. Fall back routine for unknown words:
    Replace by the UNKNOWN_WORD symbol and retry (may result in multiple WORD_MODEL candidates).
  3. Handling sentence begin/end:
    • Only one symbol is provided (i.e. the SENTENCE_MARKER).
    • No statistics are recorded regarding cross-sentence effects.
Author
Kris Demuynck
Date
Oct 1996