Successor Variety (Cont.)


To illustrate the use of successor variety stemming, consider the example below where the task is to determine the stem of the word READABLE:
   Test Word: READABLE
   Corpus:    ABLE, APE, BEATABLE, FIXABLE, READ, READABLE
              READING, READS, RED, ROPE, RIPE.
Using the complete word segmentation method, the test word “READABLE” will be segmented into “READ” and “ABLE,” since READ appears as a word in the corpus. The peak and plateau method would give the same result.
Prefix Successor Variety Letters
 R
 RE
 REA
 READ
 READA
 READAB
 READABL
 READABLE
 3
 2
 1
 3
 1
 1
 1
 1
 E,I,O
 A,D
 D
 A,I,S
 B 
 L
 E 
 BLANK

After a word has been segmented, the segment to be used as the stem must be selected by using the following rule:
   if ( first segment occurs in ≤ 12 words in corpus )
       first segment is stem
   else (second segment is stem)
The check on the number of occurrences is based on the observation that if a segment occurs in more than 12 words in the corpus, it is probably a prefix. Using this rule, READ would be selected as the stem of READABLE. In summary, the successor variety stemming process has three parts:
  1. Determine the successor varieties for a word,
  2. Use this information to segment the word using one of the methods mentioned in the previous page, and
  3. Select one of the segments as the stem.
While affix removal works well, it requires human preparation of suffix lists and removal rules. Successor variety requires no such preparation.