Petabyte Power

By Stephen Brobst

(Back to article)

So, just for starters and to be sure we're all clear on the subject, just how big is a petabyte? Well, a terabyte is 10 to the 12th power bytes (a one followed by 12 zeroes) and a petabyte is 10 to the 15th power bytes (a one followed by 15 zeroes), or a thousand Terabytes. In human terms it goes soemthing like this: If every PC had a 50GB hard drive, storing a petabyte would take 20,000 PCs.

Okay then, the question now becomes, What do I do with that much storage? Who needs it? Who wants it?

Well, quite a few people in fact. Very large data warehouses are already providing significant return on investment to the companies using them. In today's world, competitive advantage comes not from differences in prices and products, but from having more detailed information about your customers and potential customers than the competition does.

Converting prospects into loyal customers means presenting them with just the right products, services, and information at just the right time. Companies can do this only if they have collected enough detailed information about each prospect to identify the important patterns and have the proper systems in place to put information together and act upon it in a timely manner.

The companies that do the best job will be the winners: "retail is detail" as "they" say. Technology has given companies the power to collect detailed data in quantities (hundreds of terabytes already, and a petabyte isn't be far in the future) and deploy it in time frames (seconds) that once would have seemed possible only in science fiction. To search and deploy such huge volumes of data so quickly, scalability is crucial.

Scalability is the ability to add more processing power to a hardware configuration and have a linearly proportional increase in performance. Or, looked at another way, it is the ability to add hardware to store and process increasingly larger volumes of data (or increasingly complex queries or increasingly larger numbers of concurrent queries) without any degradation in performance. A poor design or product deployment does just the opposite: it causes performance to deteriorate faster than data size grows.

Scientific institutions, such as Lawrence Livermore National Laboratories, have been working with massive amounts of data (hundreds of terabytes) for hydrodynamic and particle-in-cell simulations for decades. They have often custom developed the programs, operating systems, and compilers to exploit scalable hardware for these purposes.

However, companies such as SBC and others have brought this capability into the mainstream with commercial systems capable of harnessing hundreds of top-of-the-line Intel CPUs with many hundreds of gigabytes of addressable memory and hundreds of terabytes of disk space all supporting a single, integrated database.

So, what's involved in successfully designing and deploying such a system? True scalability has four dimensions:

Dimension One: Handling the Size

Every day businesses gather staggering amounts of data that can be used to support key business applications and enterprise decision-making. Meanwhile, the price per megabyte is falling. Yet the question remains: does the extra data add enough value to justify the expense of storing it?

It does if businesses can efficiently retrieve richly detailed answers to strategic and tactical business queries.

Assume, for example, a multinational bank wants to score the lifetime value of its customers in one key customer segment. If the database still uses a serial approach to data processing, such a query could bog down the system. In contrast, by using a divide and conquer approach with massive amounts of data, through deployment of parallel technology and a "shared nothing" architecture, answers to key business questions arrive more quickly and more reliably. That's where quantifiable business value begins.

Dimension Two: The Challenge of Concurrent Queries

Large corporations need to enable thousands of queries from anywhere within the organization at any time, covering both long- and short-range needs. The multinational bank in the example above might also need fraud detection for countless credit card transactions. Managers might want an analysis of monthly sales figures. Multiply all this by hundreds of business units across various geographic regions and the need for concurrent query capabilities becomes quite clear.

Handling concurrent queries demands that a data warehouse possess sophisticated resource management capabilities. As queries come in, the parallel database must be able to satisfy multiple requests and scan multiple tables.

Dimension Three: Maintaining Business Relationships Among Complex Data

Handling increased data complexity is another challenge for optimizing queries in massive databases. For example, building a simple customer profile might once have involved three or four interrelated data points stored in disparate data marts.

Now it might involve thirty or forty data points, all housed in one enterprise data warehouse. If the warehouse can only create a gargantuan table with billions of pieces of generically categorized transaction data, all the processing capacity in the world isn't going to deliver a useful customer profile. Even if the warehouse can separate the data into different tables, if it can't preserve the business relationships among the tables, then the ability to analyze that data and, in turn, the business value is compromised.

Therefore, as warehouses increase their capacity, they also must create a super-efficient "file system" specifically for analytic queries. The system should contain multiple tables, but preserve the business relationships across subject areas for easy cross-referencing and extensibility. For the customer profile example, the deeply detailed information contained in those tables can now deliver unique insights for product development, marketing programs, or a number of other critical business challenges.

Dimension Four: Support for Sophisticated Data Queries and Data Mining

Finally, the super data warehouse must be prepared to handle queries and data mining that ask for more than a tally of last month's shoe sales. For example, scoring the lifetime value of a customer is really a question with many component parts. The warehouse must be able to break down the various components and determine an efficient route for gathering the appropriate information.

A cost-based optimizer is supposed to automate this process in most databases, but too often database administrators end up having to intervene, a costly and time-consuming process. A data warehouse that truly delivers petabyte-type value would have an optimizer that handles sophisticated queries and data mining without human intervention.

In the world of data, value derives from increasingly detailed and timely business intelligence that informs decision-making across an enterprise. Unless the data warehouse can efficiently organize increasingly complex data and optimize sophisticated and concurrent queries, the amount of data stored is meaningless.

What's exciting about the petabyte is that the capabilities to do something with that data are on the verge of becoming a reality. That's a development worth heralding.