Successor Variety
Successor variety stemmers are based on work in structural linguistics which attempted to determine word and morpheme boundaries based on the distribution of phonemes in a large body of utterances.
It is defined as follows:
- Let α be a word of length n; αi, is a length i prefix of α.
- Let D be the corpus of words.
- Dαi is defined as the subset of D containing those terms whose first i letters match αi exactly.
The successor variety of αi, denoted Sαi, is then defined as the number of distinct letters that occupy the i+1st position of words in Dαi.
A test word of length n has n successor varieties Sαi, Sα2, ..., Sαn.
In less formal terms, the successor variety of a string is the number of different characters that follow it in words in some body of text.
Consider a body of text consisting of the following words, for example.
able, axle, accident, ape, about.
To determine the successor varieties for “apple,” for example, the following process would be used.
- The first letter of apple is ‘a.’ ‘a’ is followed in the text body by four characters: ‘b,’ ‘x,’ ‘c,’ and ‘p.’
Thus, the successor variety of ‘a’ is four.
- The next successor variety for apple would be one, since only ‘e’ follows “ap” in the text body, and so on.
When this process is carried out using a large body of text, the successor variety of substrings of a term will decrease as more characters are added until a segment boundary is reached.
At this point, the successor variety will sharply increase.
This information is used to identify stems.