Information Provided by HTML Files
The Web flourishes because of its format-free style.
Lacking a unifying structure popularizes the Web, but this level of complexity also makes Web searches difficult.
HTML pages provide the following information:
- Audio/figure/flash/table/video captions:
A caption is usually a description of the subject.
- Content:
Web page content provides the most accurate and full-text information.
However, it is also the least-used information for a search engine since content extraction is still far less practical.
- Descriptions:
Web page descriptions can either be constructed from the meta tags or submitted by webmasters or reviewers.
A metatag is an HTML tag that provides information such as author, expiration date, a list of keywords, about a web page.
- Hyperlinks:
Hyperlinks contain high-quality semantic clues to a page’s topic.
A hyperlink to a web page represents an implicit endorsement of the page being pointed to.
- Hyperlink text:
Hyperlink text is normally a title or brief summary of the target page.
- Keywords:
Keywords can be extracted from full-text documents or metatags.
Filtering operations are applied to a document before obtaining keywords from the full-text document.
Typical operations include the removal of common words using a list of stopwords, the transformation of upper-case letters to lower-case letters, etc.
- Page titles:
The title tag defines the title of an HTML document.
- Text with a different font, style, color, or size:
Emphasized text is usually given a different font to highlight its importance.
- The first sentences:
The first sentence of a web page is usually an introduction or an abstract.