Automatic Text Processing


Automatic text processing refers to the discipline of mechanizing the creation or manipulation of text.

The purpose of automatic text processing is to produce a set of indexing terms from the text, so the terms will accurately match user query terms. Automatic text processing includes the following steps:
  1. Lexical analysis, which is the process of converting an input stream of characters into a stream of words or tokens,

  2. Elimination of stopwords, which are known to make poor index terms,

  3. Stemming, which relates morphologically similar indexing and search terms,

  4. Selection of index terms, which are terms that capture the essence of the topic of a document,

  5. Building a thesaurus, which is to guide both an indexer and a searcher to select the same preferred term or combination of preferred terms to represent a given subject, and

  6. Creating inverted indexes, which are an index data structure storing a mapping from content, such as words, to its locations in a set of documents.



      Why are white people the scariest in prison?    
      Because you know they’re guilty.