Friday, 13 August, 2004

How Searching Can Be Improved

Blogging is a relatively new phenomenon.  According to The History of Weblogs, the first weblog was the first Web page, published by Tim Berners-Lee in the early 1990's.  The concept caught on slowly, with just a handful of logs being published until 1998 or 1999 when blogging started to become popular in the technical community.  Blogging by the general public (i.e. people outside of the computer industry) didn't really start until 2000 or 2002.  Since then it's grown tremendously.  Technorati says that the number of weblogs that it tracks has increased from 100,000 to almost 3.5 million in just two years.  Blogging has become very popular.

In the following discussion, I'm going to lump personal weblogs and frequently updated news sites together.  I realize that news sites and news aggregators serve different needs than do personal blogs, but the way that they're updated and searched is almost identical.

It's little surprise that current searching and indexing techniques are inadequate when it comes to searching blog content.  We've spent centuries learning how to index relatively static content from books.  The growth of magazine publishing in the last 60 years or so gave us some idea of how to index and search monthly periodicals.  But indexing information that changes from minute to minute is still experimental.  The problem isn't so much in the indexing itself, but rather in keeping the index up to date.

Current search engines use a brute force method of keeping information up to date.  They have a bunch of servers (Google has over 10,000) that continually scan known Web sites for changes.  They do that in two ways:  by reading known sites and comparing the contents with a cached copy of the contents, and by searching content for links to new pages.  At least, new to the search engine.  Most search engines have some logic built in that prioritizes searches based on the historical frequency of changes to the site.  A site that typically changes only once per week will be searched much less frequently than a site like Yahoo News that changes many times per day.  Even so, there's a lot of bandwidth wasted in crawling pages that haven't changed.

There are ways to notify search engines of changes.  One case in point is Weblogs.com, which has a Ping-site form where site owners can notify the search engine of changes.  When it gets a ping, the Weblogs.com server searches the site to verify that it's changed, and then publishes a change notification in a file called changes.xml.  Programs that want to search current information can download changes.xml and search the listed sites for new information.  Note that changes.xml only says that something on the listed site has changed; it's up to the client of changes.xml to figure out exactly what.  There's still some bandwidth wasted searching unchanged pages, but not as much as a blind search of known sites.

That's where RSS can help.  If there was a service similar to Weblogs.com that listed only updated RSS feeds, then a program could know exactly what had changed on a site.  Imagine a file called changedfeeds.xml, which listed RSS feeds that had changed in the last hour.  A program could download that file, and then search the listed site summaries for changed information.  The amount of bandwidth required to maintain an index of current information is reduced, and clients of the search engine know that they are getting the most up to date information available.

Getting the most current information is only part of the solution.  Presenting the most relevant information to the user is the other big part, and that's going to require some software that doesn't appear to exist yet.  Next time I'll describe what that software has to do.