Indexing the System
The strength of any search engine is in its index. Without a good index, you cannot produce good search results. With the three tables defined, it’s clear how the index will be represented in the database; the trick now is to populate that database by indexing the site’s content. Indexing in this case means reading in a page’s content, breaking it into words, counting the words, and populating the database.
One way you can go about doing that is to create a spider. A spider:
- Starts with one URL and reads in that page’s content
- Finds all the URLs on the page that link to other pages on the same site
- Adds each page found to a list of pages to be indexed, but only if the page has not already been indexed
- Indexes the current page’s content
- Goes to the next page in the list of pages to be indexed
- Repeats this sequence for all pages on the site
This is a common approach but takes a bit of effort. There’s a simpler way in this particular example: Because the entire site’s (meaningful) content resides in the database, the content can be pulled from it in order to be indexed. By heading in this direction, the indexing system doesn’t need to read in tons of HTML and it doesn’t need to look for, manage, and follow links.
Now with this particular example-the message board-one can’t just run a SELECT query and index the results, as that doesn’t solve the issue of the thread subjects and usernames not being indexed. My solution is to create a virtual representation of each page, and index that. With this particular example, the key content is only being displayed on the read.php page, which shows an entire thread. So I created se_read.php, which shows a thread without any markup.
Once you have the content ready to be indexed, you need to create a script that reads in the content and then breaks it into words to be stored in the database. That script is se_index.php, available in the downloadable code. I’ll walk through some of the key parts here.
To read in the content, a quick call to file_get_contents() will do the trick, returning a long string:
$content = file_get_contents($se_url);
Next, strip_tags() is applied, just in case there’s any HTML or other tags present (you could also look for and strip out HTML entities, if you want). And the content is converted to lowercase so that no distinction is made between Word and word:
$content = strip_tags($content); $content = strtolower($content);
A regular expression then breaks the string into individual words:
preg_match_all('/\b\w+\b/', $content, $output);
That code simply looks for words within boundaries. The matches will be returned as the first element in the $output variable.
At this point, the words in the content have been identified and it’s time to count the frequency of each unique word:
foreach ($output[0] as $word) { if (strlen($word) > 3) { if (isset($words[$word])) { $words[$word]++; } else { $words[$word] = 1; } } }
Within the foreach loop, each word is analyzed and added to the $words array. The if conditional only adds words containing more than three characters, although you can change this to suit your needs. The $words array, which stores the final list of words for the content, uses the word as the element index and the number of instances as the value. If the word already exists in the array, its value is incremented. Otherwise, the word is added to the array.
Now the script has a list of words for the content. The remaining steps are to fetch or create the page ID, fetch or create each word ID, and add each page-word combination to the se_pages_words table. This is just a series of queries and another foreach loop. See the corresponding script for the particulars.
Note that this is going to be a memory-intensive script. Ideally you’d run this PHP script from the command line. Or, you could have the script do single page of content at a time.