Extracting Web Page Content


Most programming languages provide functions for extracting text from HTML pages. The instructor personally prefers to use a Unix tool, lynx, which is a text browser for the World Wide Web. It is a Web client for users running cursor-addressable, character-cell display devices. It will display HTML documents containing links to files residing on the local system. The syntax of using the lynx is given below:
       shell> lynx  [options]  [path or URL]
where some of the options are
-dump
Dumps the formatted output of the default document or one specified on the command line to standard output.
-help
Displays a complete list of current options.
-listonly
Shows only the list of links (for -dump).
-nolist
Disables the link list feature in dumps.
-source
Works the same as dump but outputs HTML source instead of formatted text.
lynx can be used to access information on the World Wide Web, or to build information systems intended primarily for local access. For example, lynx has been used to build several Campus Wide Information Systems (CWIS). In addition, lynx can be used to build systems isolated within a single LAN.
  shell> lynx -help
  shell> lynx -dump
  shell> lynx -dump -source
  shell> lynx -dump -nolist
  shell> lynx -dump -listonly




      My wife left a note on the fridge that said, “This isn’t working.”    
      I’m not sure what she’s talking about.    
      I opened the fridge door and it’s working fine!