Information Retrieval (Cont.)
Filtering
Filtering is a typical transformation in information retrieval, for example to reduce the size of text, and/or standardize it to simplify searching.
Text is the input and a processed or filtered version of the text is the output.
Major filtering techniques include:
- Common words removed using a list of stopwords such as “of” and “the”, which make poor index terms.
Stopword list is the words may be entered into a search statement but cannot be searched for as individual words.
The page provides a sample of stopword list.
- Uppercase letters transformed to lowercase letters.
- Special symbols such as ‘@’ removed and sequences of multiple spaces reduced to one space.
- Numbers and dates transformed to a standard format.
- Word stemming attempts to reduce a word to its stem or root form.
Thus, the key terms of a query or document are represented by stems rather than by the original words.
For example, a search for “develop” might return pages containing the words “development” or “developer,” or “index” for “indexes” or “indexing.”
Indexing
Almost all types of indexes are based on some kind of trees or hashing, except clustered data structures, and direct acyclic word graph.
DBMSs versus Files
Using a DBMS can relieve your task of designing a file structure for your applications.
However, flexibility is the cost.
In fact, a DBMS is, first and foremost, a disk access manager.