Tackling Data Quality

By Allen Bernard

(Back to article)

As companies try to realize the value locked away in the vast amounts of data they generate and store everyday, an old problem is fast becoming a newly vexing issue: Data quality.

Specifically, how do you know what data to use when, say, looking for a total sales figure for a business customer that has five different divisions and multiple corporate brands.

Some business units in your organization have referred to GM, for example, as Chevrolet, GMC, and Buick for years because those are the only divisions they do business with. Others in accounting, for example, just look at GM as GM because that is what works for them, and so on.

This can be costly, said Bob Hagenau, co-founder and VP of Product Management and Corporate Development for Purisma, an master data management vendor.

"A lot of times (companies) can't track entitlements (for example) and they end up giving free support out that can cost them millions and millions of dollars," he said. " … you've got all these silos of customer information that can't be integrated and don't allow you to have a complete understanding of that customer."

While these disparities are nothing new, the push today to mine data for new opportunities, regulatory compliance, and trends and patterns in customer behavior (such as how much they buy from you—or you from them—in a given year) is bringing this thorny issue to the fore.

"When you talk about it at a high-level it sounds like it ought to be easy," said Philip Russom, senior manager of Research and Services at TDWi: The Data Warehousing Institute. "But the truth of the matter is different application take a very different view of the customer and require different pieces of information about the customer."

With 90% of data stored in corporations today about their customers, their products or financials this is where the problem is most pervasive and the pain greatest, said Russom, who authored Taking Data Quality to the Enterprise through Data Governance, a report on this issue in March.

But it isn't that the data itself is necessarily bad or inaccurate, it is the definition of data—metadata—across the enterprise that causes the most consternation, said Majid Abai, president and CEO of Seena Technologies, an enterprise information management and architecture consulting firm.

When metadata doesn't agree then the underlying information is hard to use from one application or division to the next. "There are several levels of the problem," said Abai. "Number one is definition of data, metadata, does not match from one business unit to another."

No. 2 is IT's obsession over the years with applications, not data. Garbage in, garbage out, as the expression goes, is, as ever, an issue today. And the third biggest issue isn't about poor data quality but the copying of data from one application to another instead of using data from a shared pool so every application is parsing the same numbers.

"This is the problem we are having in corporations is we have never learned to gather data in one place, clean it and distribute it and we are learning that," said Abai.

But, today, that is changing. Service-orientated architectures (SOA) and compliance with Sarbanes Oxley (SOX) are forcing companies and IT departments to use data from a shared resource since SOA is decoupling applications from databases and SOX is forcing companies to become more aware of importance for getting their numbers straight.

To begin the remediation process, the first step is to sit down with all concerned parties and decide things like whose data is correct, whose metadata best describes that data in question, definitions of terms like "customer", and so on.

Not an easy task by any stretch but until completed all the automation in the world will not get you past the core of the problem: people doing what people do to make their lives and work easier.

"That's the thing," said Russom, "at some level, a lot of companies stall because they get into arguments. People will argue about 'Well, how do we define customer? And those (issues) have to be resolved somehow before you really want to tackle moving data around.

Sometimes people think, 'Well, it's data so we must be able to hurl a technical solution at the problem. But really I'm here to tell you, the technical thing is half of it. Various types of actions taken by people is the other half."

Of course, since this is an old problem dating back to mainframe days, there is another option, said Russom: Do nothing. If it ain't broke, don't fix it.

"Sometimes the data is in terrible condition but the business can tolerate it. If you can tolerate the bad condition of data, don't fix it."