Performing Stemming


One technique for improving retrieval effectiveness and reducing the size of indexing files is to provide searchers with ways of finding morphological variants of search terms. If, for example, a searcher enters the term stemming as part of a query, it is likely that he or she will also be interested in such variants as stemmed and stem. Since a single stem typically corresponds to several full terms, by storing stems instead of terms, compression factors of over 50 percent can be achieved.
Stemming/conflation is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form.
The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.

Stemming can be either manual—using regular expressions—or automatic via stemmers. The figure shows a taxonomy for stemming algorithms including four automatic approaches: