Successor Variety


Successor variety stemmers are based on work in structural linguistics which attempted to determine word and morpheme boundaries based on the distribution of phonemes in a large body of utterances. It is defined as follows:
The successor variety of αi, denoted Sαi, is then defined as the number of distinct letters that occupy the i+1st position of words in Dαi. A test word of length n has n successor varieties Sαi, Sα2, ..., Sαn.
In less formal terms, the successor variety of a string is the number of different characters that follow it in words in some body of text. Consider a body of text consisting of the following words, for example.
   able, axle, accident, ape, about.
To determine the successor varieties for “apple,” for example, the following process would be used. When this process is carried out using a large body of text, the successor variety of substrings of a term will decrease as more characters are added until a segment boundary is reached. At this point, the successor variety will sharply increase. This information is used to identify stems.




      “Lost Time is never found again.”    
      ― Benjamin Franklin, Poor Richard’s Almanack