David Tebbutt - Information World Review

Deep web to be surfaced through XML

Written by David Tebbutt in December 2005

Searching is a fundamental part of your life, right? Whether it's searching third party databases, your own intranet, the web or your desktop, you're probably never far from a search engine. The problems with search engines are that they only look at the past and they don't look at everything.

Someone at the Online Information conference said "if it isn't in Google, it doesn't exist." While amusing, this is simply not true. The tragedy for you is that it's what a lot of your internal clients believe.

On the desktop, indexing is pretty much real-time and with third party providers it is concurrent with the arrival of fresh information. With web search engines, it depends when the crawling and index updates take place. With systems such as Google, this can actually be some weeks after new material arrives, depending on the search engine's algorithms for determining the relative importance of a website.

All the traditional search engines extract their information from the HTML of the web pages. They usually start from the root of the website and, if a page isn't linked to, then it isn't crawled.

Blogs changed things because they started using RSS feeds to syndicate their contents. By reproducing part or all of the blog in XML and adding other contextual information, anyone could pick up the feeds and share them. A number of services sprung up around watching blog feeds. In search terms, most copied traditional search engines by indexing the content of the feeds.

One exception to this is PubSub Concepts. It decided to scan XML feeds and deliver 'hits' to its customers in real-time, as they flowed through the filter. Anyone can try it out, it's free. Put in a 'when this happens' search argument and wait. It's similar in principle to Technorati's 'watchlists' but it only stores the last 32 items from any feed but its remit extends way beyond blogs.

Typically, companies use it to track mentions of themselves, their competitors, their prospects and their clients. At the moment PubSub includes a few other feeds, mainly US-centric, such as airport delays, earthquake warnings, SEC filings, press releases and newsgroups. It is easy to see how its reach could be extended to any organisation with information of value to offer.

This brings us to the deep web which delivers public information, on demand, from databases to transient web pages. Search engines can't see this information yet it has been estimated at 400 to 500 times the size of the surface web. At the moment, if you wanted to automate your searching of this material you'd need to use tools such as BrightPlanet's Deep Query Manager (DQM), which harvests the web, both shallowly and deeply.

Any search strategy in the future will need to consider the role that XML will play in delivering valuable information. The blog world is already leading the way in this respect. Last month saw the announcement of 'structured blogging' which makes it easy to publish things like job vacancies, theatre reviews, cars for sale and event announcements.

Blogging software providers are rushing to include 'fill in the blanks' templates for different announcements. Invisibly, but importantly, they have tweaked the XML to make it much easier for computers to understand context.

There's no earthly reason why organisations of all hues should not present an XML feed of their public information to the outside world. There's even talk of a common pool of feeds called a feedmesh. This makes a lot of sense and it will enable searching, mining and filtering companies to raise their game.

And make your life easier.