Successor Variety (Cont.)
To illustrate the use of successor variety stemming, consider the example below where the task is to determine the stem of the word READABLE:
Test Word: READABLE
Corpus: ABLE, APE, BEATABLE, FIXABLE, READ, READABLE
READING, READS, RED, ROPE, RIPE.
Using the complete word segmentation method, the test word “READABLE” will be segmented into “READ” and “ABLE,” since READ appears as a word in the corpus.
The peak and plateau method would give the same result.
|
|
Prefix |
Successor Variety |
Letters |
R
RE
REA
READ
READA
READAB
READABL
READABLE
|
3
2
1
3
1
1
1
1
|
E,I,O
A,D
D
A,I,S
B
L
E
BLANK
|
|
After a word has been segmented, the segment to be used as the stem must be selected by using the following rule:
if ( first segment occurs in ≤ 12 words in corpus )
first segment is stem
else (second segment is stem)
The check on the number of occurrences is based on the observation that if a segment occurs in more than 12 words in the corpus, it is probably a prefix.
Using this rule, READ would be selected as the stem of READABLE.
In summary, the successor variety stemming process has three parts:
- Determine the successor varieties for a word,
- Use this information to segment the word using one of the methods mentioned in the previous page, and
- Select one of the segments as the stem.
While affix removal works well, it requires human preparation of suffix lists and removal rules.
Successor variety requires no such preparation.