Automatic Indexing Methods
Indexing is the process of developing a document representation by assigning content descriptors to the document.
These descriptors are used in assessing the relevance of a document to a user query.
Difficulties of Indexing
Search engines do not provide comprehensive indexes of the Web because of the limitations: (i) network bandwidth, (ii) computational power, (iii) disk storage, and (iv) scalability of their indexing and retrieval technology.
Indexing Categories
- Administrator-generated indexes,
which are documents, containing pointers to web resources, manually created by administrators.
- Crawler-generated indexes,
which are created by the crawlers which constantly scan the Web.
Indexing Range
- This indexing method performs full- or near full-text indexing.
- The other indexing methods only index parts of the documents such as URL, titles, headers, weighted words, or anchor hypertexts, etc.
Indexing Methods
The automatic assigning of content terms to documents can be based on single or multiple terms.
- Single-term indexing:
The term set of a document includes its set of words and their frequencies.
Single terms are less ideal because their meanings out of context are often ambiguous.
Methods include: (i) statistical, (ii) information theoretic, and (iii) probabilistic.
- Multi-term or phrase indexing:
Terms/phrases carry more specific meaning and thus have more discriminating power.
Methods include: (i) statistical, (ii) probabilistic, and (iii) linguistic.