Slide 6.9: Successor variety

Successor Variety

Successor variety stemmers are based on work in structural linguistics which attempted to determine word and morpheme boundaries based on the distribution of phonemes in a large body of utterances. It is defined as follows:

Let α be a word of length n; α_i, is a length i prefix of α.
Let D be the corpus of words.
D_αi is defined as the subset of D containing those terms whose first i letters match α_i exactly.

The successor variety of α_i, denoted S_αi, is then defined as the number of distinct letters that occupy the i+1st position of words in D_αi. A test word of length n has n successor varieties S_αi, S_α2, ..., S_αn.

In less formal terms, the successor variety of a string is the number of different characters that follow it in words in some body of text. Consider a body of text consisting of the following words, for example.

   able, axle, accident, ape, about.

To determine the successor varieties for “apple,” for example, the following process would be used.

The first letter of apple is ‘a.’ ‘a’ is followed in the text body by four characters: ‘b,’ ‘x,’ ‘c,’ and ‘p.’ Thus, the successor variety of ‘a’ is four.

The next successor variety for apple would be one, since only ‘e’ follows “ap” in the text body, and so on.

When this process is carried out using a large body of text, the successor variety of substrings of a term will decrease as more characters are added until a segment boundary is reached. At this point, the successor variety will sharply increase. This information is used to identify stems.