The Code Sample of Indexing


The following program is used to implement the button “Index web pages” of the previous interface. It adds the URL to the inverted list if its contents include the keyword entered from the Web. If the keyword does not exist in the inverted list, add it and the URL to the list if the URL contents include the keyword.

A PEM file is a text-based file that contains a certificate, a private key, and any associated certificates. The file DigiCertGlobalRootCA.crt.pem can be found from here.

http://undcemcs01.und.edu/~wen.chen.hu/course/515/2/Spider.php
<?php

 # SQL> CREATE TABLE  keywords (
 #   2    kwID     INT AUTO_INCREMENT PRIMARY KEY,
 #   3    keyword  VARCHAR(64) );
 #
 # SQL> CREATE TABLE  url_title (
 #   2    urlID  INT AUTO_INCREMENT PRIMARY KEY,
 #   3    url    VARCHAR(128),
 #   4    title  VARCHAR(128) );
 #
 # SQL> CREATE TABLE  www_index (
 #   2    kwID   INT,
 #   3    urlID  INT,
 #   4    PRIMARY KEY ( kwID, urlID ),
 #   5    FOREIGN KEY ( kwID  ) REFERENCES  keywords  ( kwID ),
 #   6    FOREIGN KEY ( urlID ) REFERENCES  url_title ( urlID ) );

 $keyword  = $argv[1];
 $URL      = $argv[2];
 $host     = "undcemmysql.mysql.database.azure.com";
 $username = "your_id";
 $database = "your_db";
 $password = "your-pw";
 $conn     = mysqli_init( );
 mysqli_ssl_set( $conn, NULL, NULL, "DigiCertGlobalRootCA.crt.pem", NULL, NULL );
 mysqli_real_connect( $conn, $host, $username, $password, $database, 3306 );

 if ( $conn->connect_errno )
   die( 'Failed to connect to MySQL: ' . $conn->connect_error );

 // Fetch the contents of the web page.
 $html = shell_exec( "lynx -dump -source '" . $URL . "'" );

 # Find the page title by using a regular expression.
 $pattern = '/<title.*?<\/title>/';
 preg_match( $pattern, $html, $matches );
 $title   = strip_tags( $matches[0] );

 # Check whether the page contains the keyword.
 if ( substr_count( $html, $keyword ) != 0 )
   $found = true;
 else
   $found = false;
      
 # Find the ID of the input keyword from the keywords table.
 $query = "SELECT kwID FROM keywords WHERE keyword='$keyword';";
 echo( $query . "\n\n" );
 $result = $conn->query( $query );
 $row    = $result->fetch_assoc( );
 if ( $result->num_rows > 0 )
   do
     $kwID = $row['kwID'];
   while( $row = $result->fetch_assoc( ) );
 else {
   $query = "INSERT INTO keywords( keyword ) VALUES ( '$keyword' );";
   echo( $query . "\n\n" );
   $conn->query( $query );
   $query = "SELECT kwID FROM keywords WHERE keyword='$keyword';";
   echo( $query . "\n\n" );
   $result = $conn->query( $query );
   $row    = $result->fetch_assoc( );
   if ( $result->num_rows > 0 )
     do
       $kwID = $row['kwID'];
     while( $row = $result->fetch_assoc( ) );
 }

 # Find the ID of the input URL from the url_title table.
 $query = "SELECT urlID FROM url_title WHERE url='$URL';";
 echo( $query . "\n\n" );
 $result = $conn->query( $query );
 $row    = $result->fetch_assoc( );
 if ( $result->num_rows > 0 )
   do
     $urlID = $row['urlID'];
   while( $row = $result->fetch_assoc( ) );
 else {
   $query = "INSERT INTO url_title( url, title ) VALUES ( '$URL', '$title' );";
   echo( $query . "\n\n" );
   $conn->query( $query );
   $query  = "SELECT urlID FROM url_title WHERE url='$URL';";
   $result = $conn->query( $query );
   $row    = $result->fetch_assoc( );
   if ( $result->num_rows > 0 )
     do
       $urlID = $row['urlID'];
     while( $row = $result->fetch_assoc( ) );
 }

 # Update the inverted list if the keyword is found.
 if ( $found == true ) {
   $query = "INSERT INTO www_index VALUES ( '$kwID', '$urlID' );";
   echo( $query . "\n\n" );
   $conn->query( $query );
 }

 // Close the database connection.
 $conn->close( );

?>

  • Regular expressions: A regular expression is a sequence of characters that forms a search pattern. When you search for data in a text, you can use this search pattern to describe what you are searching for.

  • shell_exec($cmd): This function accepts a single parameter $cmd which is used to hold the command that will be executed. It returns the executed command or NULL if an error occurred.

  • preg_match( ): This function returns whether a match was found in a string by using a regular expression.

  • strip_tags( ): This function strips a string from HTML, XML, and PHP tags.



      “Wise? No, I simply learned to think.”    
      ― Christopher Paolini, Eldest