Successor Variety (Cont.)
To illustrate the use of successor variety stemming, consider the example below where the task is to determine the stem of the word READABLE:
Test Word: READABLE
Corpus: ABLE, APE, BEATABLE, FIXABLE, READ, READABLE
READING, READS, RED, ROPE, RIPE.
Using the complete word segmentation method, the test word “READABLE” will be segmented into “READ” and “ABLE,” since READ appears as a word in the corpus.
The peak and plateau method would give the same result.
|
|
Prefix |
Successor Variety |
Letters |
R
RE
REA
READ
READA
READAB
READABL
READABLE
|
3
2
1
3
1
1
1
1
|
E,I,O
A,D
D
A,I,S
B
L
E
BLANK
|
|
After a word has been segmented, the segment to be used as the stem must be selected by using the following rule:
if ( first segment occurs in ≤ 12 words in corpus )
first segment is stem
else (second segment is stem)
The check on the number of occurrences is based on the observation that if a segment occurs in more than 12 words in the corpus, it is probably a prefix.
Using this rule, READ would be selected as the stem of READABLE.
In summary, the successor variety stemming process has three parts:
- Determine the successor varieties for a word,
- Use this information to segment the word using one of the methods mentioned in the previous page, and
- Select one of the segments as the stem.
While affix removal works well, it requires human preparation of suffix lists and removal rules.
Successor variety requires no such preparation.
Never in a million years (absolutely never)
did I think that I would actually win the lottery!
|