The X' (ML) Files - Finding Data in the Deep Web

By Drew Robb

(Back to article)

With so much data being created daily (161 exabytes in 2006 alone—that’s a 10 with 18 zeros after it—according to IDC), finding complete answers to any search requires the ability to query textual and database information anywhere, internally or on the Web. The problem is that most of that data isn't contained in a webpage, but within a database that only gets displayed in response to user action.

“In 2000/2001 we did some analysis and realized that the quantity of documents from these deep web databases was far bigger than what everyone was calling the Internet,” said Jerry Tardif, vice-president of Bright Planet Corp., a search firm headquartered in Sioux Falls, S.D.

You can't just plug some search terms into Google to access all this data. It requires the use of a federated search tool.

“Google makes search look simple, but in fact search is not simple, particularly when completeness is important, said David Fuess, a computer scientist in Lawrence Livermore National Laboratory's Nonproliferation, Homeland and International Security (NHI) directorate. His team uses Bright Planet's Deep Query Manager (DQM) to look for information on end users of export-controlled goods which might have military uses.

“To be effective you must strike a proper balance that maximizes the probability that the information you seek is in the results and that the results can be reviewed within the response time allowed.”

Federated Search

Traditional Web searches, or searches against an organization's own content, consists of creating a database of words in those documents and then running a query against that database. Federated search is the ability to execute queries against multiple databases at the same time.

Internally, for example, a user could run a query on a customer that would turn up both invoices contained in the finance system as well as that customer's contract contained in the document management system. Expanding its scope to outside sources, the federated search engine could also pull up the most recent stock quote from Dow Jones, the bond rating from Moody's and the customer's latest filings with the Securities and Exchange Commission.

Creating a federated search engine is more than just a matter of installing some software. “IT staff needs to understand that this is not a trivial undertaking,” said Abe Lederman, president of Deep Web Technologies, which develops the Explorit federated search software. “It is very unlikely that this is something an IT person can just purchase a copy of it, set it up and run it.”

The first step is surveying what resources are available to be searched. This is relatively simple when dealing with data the organization owns, but gets far more complex when locating outside sources. No one knows exactly the number of publicly available databases on the Internet, but the CompletePlanet directory has a searchable and browse-able list of more than 70,000 online databases and specialty search engines.

“If an agency is federating search on their own databases, they generally know what they have, where it is, and the type of information that is in there,” said BrightPlanet's Tardiff. “But if they are doing something on the outside, they need subject matter expertise on what public sources are available.”

Once you have selected the databases to include in the search, there is the matter of creating links and writing the code needed to execute the query on each of those databases. This can include writing appropriate log in scripts. These scripts need to be checked regularly and updated whenever the underlying database structure changes. A final step is to refine the user interface that aggregates the data from these different sources and presents it to the end user.

Roy Tennant, the User Services architect for the California Digital Library, (the group that provides centralized digital access to the collections of all University of California campuses, as well as hundreds of other databases) found that an off the shelf product didn't provide the needed functions without extensive customization.

“Since the user interface of the commercial product was not as flexible as we required, we needed to build our own user interface layer and use the application program interface (API) of the commercial application to handle the connections to multiple sources, the searching, merging of search results, de-duplication, and ranking, said Tennant.

“This also required us to work with the vendor and the product user community to create a prioritized list of enhancements to the vendor’s API and wait for those enhancements to be provided (which they were).”

Relevant Results

The work doesn't stop once a site is up and running. The U.S. Department of Energy's Office of Scientific and Technical Information (OSTI) maintains the science.gov site which provides a common public search interface for thirty scientific databases of a dozen federal agencies, as well as the newly launched worldwidescience.org site which searches the scientific databases of ten countries.

In February, OSTI released the 4.0 version of science.gov—created and maintained by Deep Web Technologies—which included relevance ranking based on the full text of document, rather than just the metadata and summary.

“Adding full-text relevance ranking was the most significant improvement, but there were others,” said OSTI director Walt Warnick. “We also added alert services where you can put a query in and each week you get an email about anything new that has turned up in any of the thirty databases, without repeating what you found previously.”

And that, as Fuess said, is the key to developing an effective federated search engine: Including all the relevant data sources, but without burying the user with more hits than he can possibly look at in the time available.