Loshin Enterprise Knowledge Management
Loshin Enterprise Knowledge Management
Loshin Enterprise Knowledge Management
XIII
XIV PREFACE
DATA VALIDATION
With no built-in mechanism to vaHdate the data before it enters the sys-
tem, the use of punch cards and the "butterfly ballot" leads to problems
with vote validation. When using a punch card ballot, (which, accord-
ing to the LA Times was used by more than 37 percent of registered
nationwide voters in 1996), the voter selects a candidate by poking out
the chad^the perforated section that should be ejected when the hole is
punctured. The cards are read by a tabulation machine, which counts a
vote when it reads the hole in the card.
The validation issue occurs when the chad is not completely
ejected. The automated tabulation of both "hanging chads" (chads that
are still partially attached) and "pregnant chads" (chads that are
bulging but not punched out) is questionable, and so it is not clear
whether all votes are counted. What constitutes a valid vote selection is
primarily based on whether the tabulation machine can read the card.
In the case of recounts, the cards are passed through the reader multiple
times. In that process some of the hanging chads are shaken free which
leads to different tallies after each recount.
In addition, if someone mistakenly punches out more than one
selection, the vote is automatically nullified. It is claimed that 19,000
ballots were disqualified because more than one vote for president had
been made on a single ballot. This is an example where a policy to pre-
qualify the ballot before it is sent to be counted could be instituted.
Since the rules for what constitutes a valid vote are well described, it
should be possible to have a machine evaluate the punch card to deter-
mine whether it is valid or not, and notify the voter that the ballot
would be invalidated before it is cast.
In the case of the 2000 election, the networks were led to predict
the winner of Florida incorrectly, not just once, but twice. The first
error occured because predicting elections is based on statistical models
generated from past voting behavior that 1) were designed to catch vote
swings an order of magnitude greater than the actual (almost final) tal-
lies and 2) did not take changes in demographics into account. This
meant that the prediction of Gore's winning Florida was retracted
about 2 hours after it was made.
EXPECTATION OF ERROR
According to Title IX, Chapter 102 of Florida law, "if the returns for
any office reflect that a candidate was defeated or eliminated by one-
half of a percent or less of the votes cast for such office...the board
responsible for certifying the results of the vote...shall order a recount
of the votes..."
This section of the law contains data-accuracy implication that
there is an expected margin of error of one-half of one percent of the
votes. The automatic recount is a good example where the threshold for
potential error is recognized and where there is defined governance
associated with a data quality problem.
TIMELINESS
we wake up the next day. Even in a close election when the results are
inconclusive, there are timeliness constraints for the reporting and certi-
fication of votes.
just that: accounts. Over a period of time, however, some people there
became convinced of the benefits of looking at the people associated
with those accounts as customers, and a new project was born that
would turn the accounts database inside out. My role in that project was
to interpret the different information paradigms that appeared in the
accounts database name and address field. For it turned out that a single
customer might be associated with many different accounts, in many dif-
ferent roles: as an owner, a trustee, an investment advisor, and so forth.
I learned two very interesting things about this project. The first
was that the knowledge that can be learned from combining multiple
databases was much greater than from the sum total of analyzing the
databases individually. The second was the realization that the prob-
lems that I saw at this organization were not limited to this company
in fact, these problems are endemic and not only within the financial
industry but in any industry that uses information to run its businesses.
The insight that brought full circle the world of data quality was
this: Every business process that uses data has some inherent assump-
tions and expectations about the data. And these assumptions and
expectations can be expressed in a formal way, and this formality can
expose much more knowledge than simple database schema and Cobol
programs.
So I left that company and formed a new company. Knowledge
Integrity Incorporated, (www.knowledge-integrity.com) whose purpose
is to understand, expose, and correct data quality problems. Our goal is
to create a framework for evaluating the impacts that can be caused by
low data quality, to assess the state of data quality within an enterprise,
to collect the assumptions and expectations about the data that is used,
and recast those assumptions and expectations as a set of data quality
and business rules. In turn, these rules are incorporated as the central
core of a corporate knowledge management environment, to capture
corporate knowledge and manage it as content.
This book is the product of that goal. In it, we elaborate on our
philosophy and methods for evaluating data quality problems and how
we aim to solve them. I believe that the savvy manager understands the
importance of high-quality data as a means for increasing business
effectiveness and productivity, and this book puts these issues into the
proper context. I hope the reader finds this book helpful, and I am cer-
tainly interested in hearing about others' experiences. Please feel free to
contact me at [email protected] and let me know how
your data quality projects are moving along!
XVIII PREFACE
Here is a news story taken from the Associated Press newswire. The
text is printed with permission.
Newark For four years a Middlesex County man fooled
the computer fraud programs at two music-by-mail clubs,
using 1,630 aliases to buy music CDs at rates offered only to
first-time buyers.
INTRODUCTION
The Mars Climate Orbiter, a key part of NASA's program to explore the
planet Mars, vanished in September 1999 after rockets were fired to
bring it into orbit of the planet. It was later discovered by an investiga-
tive board that NASA engineers failed to convert English measures of
rocket thrusts to newtons, a metric system measuring rocket force, and
that was the root cause of the loss of the spacecraft. The orbiter
smashed into the planet instead of reaching a safe orbit.
This discrepancy between the two measures, which was relatively
small, caused the orbiter to approach Mars at too low an altitude. The
result was the loss of a $125 million spacecraft and a significant setback
in NASA's ability to explore Mars.
After having been a loyal credit card customer for a number of years, I
had mistakenly missed a payment when the bill was lost during the
ENTERPRISE KNOWLEDGE MANAGEMENT
move to our new house. I called the customer service department and
explained the omission, and they were happy to remove the service
charge, provided that I sent in my payment right away, which I did.
A few months later, I received a letter indicating that "immediate
action" was required. Evidently, I had a balance due of $0.00, and because
of that, the company had decided to revoke my charging privileges! Not
only that, I was being reported to credit agencies as being delinquent.
Needless to say, this was ridiculous, and after some intense conver-
sations with a number of people in the customer service department,
they agreed to mark my account as being paid in full. They notified the
credit reporting agencies that I was not, and never had been, delinquent
on the account (see Figure 1.1).
One would imagine that if any business might have the issue of data
quality on top of its list, it would be the direct marketing industry. Yet, I
INTROOUCnON
III '
ll:iiiiii|iillifiiK^^
iiiiiiiiiiliiiiiiiiiiiiiii^
W^^.:M0m'X llllllllljj^^^^
mmm m^tn lamin it~"
recently received two identical pieces of mail the same day from the
local chapter of an association for the direct marketing industry. One
was addressed this way.
David Loshin
123 Main Street
Anytown, NY 11787
Dear David,. . .
The other was addressed like this
Loshin David
123 Main Street
Anytown, NY 11787
Dear Loshin,. . .
1.1.9 Conclusions?
These are just a few stories culled from personal experience, interac-
tions with colleagues, or reading the newspaper. Yet, who has not been
subject to some kind of annoyance that can be traced to a data quality
problem?
INTRODUCTION
TABLE 1
Tracking history for the equipment I ordered.
PACKAGE PROGRESS
Over the past 30 years, advances in data collection and database tech-
nology have led to massive legacy databases controlled by legacy soft-
ware. The implicit programming paradigm encompasses both business
policies and data validation policies as application code. Yet, most
legacy applications are maintained by second- and third-generation
engineers, and it is rare to find any staff members with firsthand experi-
ence in either the design or implementation of the original system. As a
result, organizations maintain significant ongoing investments in daily
operations and maintenance of the information processing plant, while
mostly ignoring the tremendous potential of the intellectual capital that
is captured within the data assets.
1.2.3 Why Data Quality Is the Pivot Point for Knowledge Management
The use of a manufacturing chain assumes that multiple stages are asso-
ciated with the final product, and at each stage there is an expectation
that the partially completed product meets some set of standards. Infor-
mation processing is also a manufacturing chain pieces of informa-
tion flow into and out of processing stages where some set of operations
are performed using the data.
To continue this analogy, when a product developed on a manufac-
turing chain does not fit the standards required at a specific stage, either
the product must be thrown away or fixed before it can continue down
the manufacturing line. Information is the same way: When a data
4. http://www.dw-institute.com/whatworks9/Resources/warehousing/warehous-
ing.html
INTRODUCTION 13
Have you ever been billed for service that you have not received? Or
have you been threatened with being reported to a credit bureau for
being late on a payment on a balance due of $0.00? Many people have
some nightmare experience with which they can relate, always associ-
ated with some incorrect information that causes pain in the pocket-
book. These stories always seem to end with the customer ending his or
her relationship with the vendor or product provider over the matter.
These errors are typically due to some mistake on behalf of the ser-
vice or product provider, whether it is in customer records, customer
billing, product pricing, or during data processing. No matter what, the
problem is worsened by the fact that it is apparent that the organization
at fault has no evident means of proactive error detection in place. This
conclusion may be drawn because it is the customer who is doing the
error detection. We can claim that while a significant expense is made to
acquire new customers, it is worthwhile to invest the time and money
into improving the data collected on the current customers, since cus-
tomer attrition may be tied directly to poor data quality.
Just as poor data foster mistrust among current customers, it also can
cast doubt in the minds of potential customers. When a potential cus-
tomer is presented with an offer backed by high-quality data, the image
of the seller is enhanced, which can improve the opportunity to turn a
potential customer into a real customer.
As an example, consider this real-life pitch we recently received in
the mail. My wife and I recently took a trip with our 5-month-old baby.
We purchased a seat for our child at a discount rate because of her
young age. About a month after our trip, our baby received a letter
from the airline and a major long-distance carrier, offering her 10,000
frequent flyer miles if she switched her long-distance service. Clearly,
this cooperative sales pitch, cobbled together between the airline and
the long-distance carrier (LDC), may be effective some of the time, but
consider this: The airline knew that we had bought a discount seat for
our baby, and typically babies don't have authority to make purchasing
decisions in a household. In addition, both my wife and I have received
the same pitch from the same airline-LDC combination on the same
day! Because we saw that the long-distance carrier was unable to keep
14 ENTERPRISE KNOWLEDGE MANAGEMENT
Having been involved in some legacy migration projects, I can say from
direct experience that the most frustrating component of a migration
project is the inability to accumulate the right information about the
data and systems that are being migrated. Usually, this is due to the ten-
dency of implementers to programfirst,document later, if at all. But as
systems age, they are modified, broken, fixed, or improved but without
any updates to documentation.
This situation forces the integrators to become information archae-
ologists to discover what is going on within the system. Naturally, undi-
rected discovery processes will increase costs and delay the actual
implementation, and the amount of time needed to determine what is
going on cannot be predicted ahead of time.
When we have assessed the data quality requirements of our system and
put in place the right kinds of processes to validate information as it
passes through the system, we can limit the downtime and failed
processes based on low data quality. In turn, this means that without hav-
ing to diagnose and fix data quality problems, processing can proceed
INTRODUCTION 15
Business operations are defined using a set of rules that are applied in
everyday execution. When the business depends on the correct flow of
information, there is an aspect of data quality that intersects the opera-
tional specification.
In essence, in an information business, business rules are data quality
rules. This implies that data quality is an integral part of any operational
specification, and organizations that recognize this from the start can
streamUne operations by applying data quality techniques to information
while it is being processed or communicated. This in turn will prevent bad
data from affecting the flow of business, and denying the entry of incor-
rect information into the system eliminates the need to detect and correct
bad data. Because of this, a "data quality aware" operation can execute
at lower cost and higher margin than the traditional company.
1.4.2 Micromarketing
Because the data quality problem ultimately belongs to the data con-
sumer, it a good idea to start out by establishing ownership and bound-
aries. Chapter 2 focuses on the issues of data ownership. The chapter
begins by discussing the data processing activity as a manufacture of
information. The final product of this factory is knowledge that is
owned by the data consumers in a business enterprise.
Who are the data producers and data consumers in an enterprise?
We look at internal data producers (internal processes like account
opening, billing, marketing) and external data producers ("lead lists,"
consumer research, corporate structure data). We also look at the enter-
prise data consumers, ranging from the operational (customer service,
billing, resource planning), the tactical (middle management, schedul-
ing), and strategic consumers (directional management, strategists).
There are complicating notions with respect to data ownership.
The means of dissemination, collecting data from the public domain, as
well as acquiring data from data providers confuse the issues. There-
fore, a set of ownership paradigms are introduced, including decision
makers, sellers, manipulators, guardians, and workers. These para-
digms bound the "social turf" surrounding data ownership. Finally,
Chapter 2 focuses on a finer granularity of ownership issues, including
metadata ownership, governance of storage and repositories, and
accountability for data policies.
The rationale for designing and building data quality systems seems
logical, but an economic framework that can be used to measure the
INTRODUCTION 19
In this chapter, we begin to explore the ideas revolving around data types
and how data types are related to the notion of sets. We then describe
our definition of data domains, both descriptive and enumerated. Next,
we discuss the relations between domains, how those relations exist in
so ENTERPRISE KNOWLEDGE MANAGEMENT
Actors represent the roles that users play, use-cases represent what the
actors do with the system, and triggers represent events that initiate use
cases. We then select from the list of data quality dimensions from Chap-
ter 5 those dimensions that are of greatest importance to the actors and
define data quality rules, as described in Chapter 8. We can choose
thresholds for conformance to the data quality rules as a baseline for
acceptance of the data. These baseline thresholds are defined so that
when met, the data set consumers can be confident of the levels of data
quality.
for Statistical process control, all within the context of the data con-
sumer's constraints.
One technique not yet discussed is using the results of the information
validity exercise to prevent the continuation of low data quality events.
INTRODUCTION 23
When using the rule-based system for validity checking, we can use the
information in the reports to look for root causes of the occurrences of
bad data quality. This technique is the last Unk in our improvement chain,
since fixing the sources of bad data will directly improve overall data
quality.
Before we delve into the details of what data quality means and how it
relates to knowledge management, we should establish where the respon-
sibility for data quality falls within a company. Without a clear assign-
ment of accountability, it is almost impossible to measure the quality of
data, much less effect improvements.
This chapter examines the question of data ownership as the first
step in establishing a knowledge-oriented organization. We begin by dis-
cussing data processing activity as a manufacture of information, and
knowledge, which is owned by the data consumers in a business enter-
prise and is the final product of this factory.
Who are the data producers and data consumers in an enterprise?
We look at internal data producers (internal processes like account
opening, billing, marketing) and external data producers ("lead lists,"
consumer research, corporate structure data). We also look at the enter-
prise data consumers, ranging from the operational (customer service,
billing, resource planning), the tactical (middle management, schedul-
ing), and strategic consumers (directional management, strategists).
There are complicating notions with respect to data ownership.
The means of dissemination, collecting data from the public domain, as
well as acquiring data from data providers confuse the issues. There-
fore, a set of ownership paradigms is defined, including decision mak-
ers, sellers, manipulators, guardians, and workers. These paradigms
bound the "social turf" surrounding data ownership.
Finally, we try to resolve some of these issues by investigating data
policy paradigms. This includes metadata ownership, governance of
storage and repositories, and accountability for data policies.
25
26 ENTERPRISE KNOWLEDGE MANAGEMENT
A relatively simple analogy for processing data that we use throughout the
book is the information factory. Any information processing activity can
be viewed as a small factory that takes some data as raw input, processes
that input, and generates some information result, potentially generating
data by-products and side effects in the process. Inside the factory there
may be smaller subfactories, each with its own input/output production
activity. The raw input data are provided by data suppliers external to the
organization or by data manufacturers within the organization. The ulti-
mate data customers may be internal or external consumers.
To be more precise, let's look at the different roles that exist in the con-
text of the information factory. These roles may represent real people or
automated proxies within the system.
1. Suppliers: Data suppliers provide information to the system.
2. Acquirers: Acquirers accept data from external suppliers for
provision into the factory.
3. Creators: Internal to the factory, data may be generated and
then forwarded to another processing stage.
4. Processors: A processor is any agent that accepts input and gen-
erates output, possibly generating some side effects.
5. Packagers: A packager collates, aggregates, and summarizes
information for reporting purposes.
6. Delivery Agents: A delivery agent delivers packaged informa-
tion to a known data consumer.
7. Consumer: The data consumer is the ultimate user of processed
information.
8. Middle Manager: The people responsible for making sure the
actors are correctly performing their jobs.
9. Senior Manager: The senior manager is responsible for the over-
all operation of the factory.
10. Deciders: These are senior-level managers associated with
strategic and tactical decision making.
Each of these actors plays a well-defined role in the data processing
operation, and each is responsible at some level for quality assurance
WHO OWNS INFORMATION? 27
2.2.1 Value
created the database or what silo currently manages the system. But at
the core, the degree of ownership (and by corollary, the degree of
responsibility) is driven by the value that each interested party derives
from the use of that information.
2.2.2 Privacy
2.2.3 Turf
packaging, and distributing the report naturally leads one to the con-
ception of owning the data that makes up the report.
2.2.4 Fear
2.2.5 Bureaucracy
bined with the turf and fear factors, may account for the failure of
many enterprise infrastructure renovation projects.
A major concern for any data system is the coordination and authoriza-
tion of access. In a system that contains data that is in any way sensi-
tive, whether it is confidential information, human resource data, or
corporate intelligence, it is necessary to define a security and authoriza-
tion policy and to provide for its enforcement.
In addition to standard user support, the owner also holds the responsi-
bility for providing the data to the data consumers. This may include
data preparation, packaging and formatting, as well as providing a
delivery mechanism (such as a data portal or a publish/subscribe mech-
anism).
Aside from the maintenance of the system itself, there is also the mainte-
nance of the information. This includes managing the data input
process, instituting gauges and measurements associated with the data,
and creating data extraction and loading processes.
The data owner is also accountable for maintaining the quality of the
information. This may include determining and setting user data qual-
ity expectations, instituting gauges and measurements of the levels of
data quality, and providing reports on the conformance to data quality.
This also includes defining data quality policies for all data that flows
into the system and any data cleansing, standardization, or other prepa-
ration for user applications.
All data processing operations have business rules. Whether these rules
are embedded in application code, abstracted into a rules format, or
just documented separately from their implementation, the data owner
is also responsible for managing business rules.
WHO OWNS INFORMATION? 33
Managing metadata involves the data definitions, names, data types, data
domains, constraints, applications, database tables, reference reposito-
ries, and dependence rules associated with different tables and databases,
users, access rights, and so forth.
We can enumerate owner responsibilities, but that does not solve the prob-
lem of assigning (or declaring) data ownership. Instead of trying to pro-
actively dictate an ownership model, it is more helpful to explore different
existing ownership paradigms. In each one of these paradigms, we will
look at the question of value and how it relates to the claim of ownership.
In this paradigm, the party that creates or generates the data owns the
data. It represents a speculative investment in creating information as a
prelude to recognizing value from that information in the future.
34 ENTERPRISE KNOWLEDGE MANAGEMENT
This ownership paradigm indicates that the party that consumes the
data owns that data. This is a relatively broad ownership spectrum,
covering all aspects of data acquisition. In this paradigm, any party that
uses data claims ownership of that data. When the consumer requires a
high level of confidence in the data input into a process, this ownership
paradigm is very logical, since the party that cares most about the value
of the data claims ownership (and thus, responsibility). In this case, the
consumer derives the value from the data.
An example of this is a sales organization that uses information
provided from different organizations within a company. Once the data
lands at the sales staff's door, though, the information becomes integral
to the proper operation of the sales team, and so the sales team will
claim ownership of the data that it consumes.
In this paradigm, the user that commissions the data creation claims
ownership. Here there are two parties involved: the one that pays for
the creation of data and the one that actually creates the data. In this
case, the patron claims ownership, since the work is being done on his
or her behalf.
An example is a company that commissions a research organiza-
tion to prepare a competitive intelligence report covering a particular
industry. The company may stipulate that the company is the sole
owner of the provided data.
companies that decode genetic material can then sell the data that they
decode to the medical and pharmaceutical industries.
This paradigm revolves around the subject data ownership issues, such
as personal privacy or image copyrights. In this view, the subject of the
data claims ownership of that data, mostly in reaction to another party
claiming ownership of the same data.
As an example of the privacy issue, consider a pharmacyfillingpre-
scriptions. Drug companies are interested in knowing which doctors are
prescribing their medicines, and doctors like to know which of their
patients are refilling prescriptions as a tool to see how well their patients
WHO OWNS INFORMATION? 37
The final paradigm is the model of global data ownership. Some feel
that monopolization is wrong and data should be available to all with
no restrictions. Clearly, in the business world, this is a radical view, and
it has its benefits as well as detriments.
This ownership model is often in operation, to some degree, in sci-
entific communities, where experimentation, following by the publish-
ing of results, is common practice. In this situation, a common goal is
the increase in the global knowledge of a particular subject, and results
are subject to other experts' scrutiny.
38 ENTERPRISE KNOWLEDGE MANAGEMENT
complications discussed in Section 2.2, as well as hash out the strict def-
initions of ownership as described in Section 2.4. The data ownership
policy specifically defines the positions covering the data ownership
responsibilities described in Section 2.3. At a minimum, a data owner-
ship policy should enumerate the following features.
1 The senior level managers supporting the enforcement of the
policies enumerated
2. All data sets covered under the policy
3. The ownership model (in other words, how is ownership allo-
cated or assigned within the enterprise) for each data set
4. The roles associated with data ownership (and the associated
reporting structure)
5. The responsibilities of each role
6. Dispute resolution processes
7. Signatures of those senior level managers listed in item 1
A template for describing the ownership policy for a specific data set is
shown in Figure 2.1.
Once the stakeholders have been identified, the next step is to learn
what data sets should fall under the ownership policy. The stakeholders
should be interviewed to register the data sets with which they are asso-
ciated and the degree to which each believes his or her stake in the data
is. The goal of this step is to create a create a metadatabase of data sets
to use in the enforcement of the data ownership policies. This catalog
should contain the name of the data set, the location of the data set, and
the list of stakeholders associated with the data set. Eventually, the cat-
alog will also maintain information about data ownership and respon-
sibilities for the data set.
The next step is to determine the roles that are associated with each set
of data in the enterprise and describe the responsibilities of each role.
Here are some examples, although this list is by no means meant to be
exhaustive.
Chief Information Officer The CIO is the chief holder of
accountability for enterprise information and is responsible for deci-
sions regarding the acquisition, storage, and use of data. He or she is
the ultimate arbiter with respect to dispute resolution between areas of
ownership and is the ultimate manager of the definition and enforce-
ment of policies.
Chief Knowledge Officer The chief knowledge officer is responsi-
ble for managing the enterprise knowledge resource, which dictates and
enforces the data sharing policies, as well as overseeing the general
pooling of knowledge across the organization.
Data Trustee The data trustee manages information resources
internal to the organization and manages relationships with data con-
sumers and data suppliers, both internal and external.
Policy Manager The policy manager maintains the data owner-
ship policy and negotiates any modifications or additions to the data
ownership policy.
Data Registrar The data registrar is responsible for cataloging
the data sets covered under the policy as well as the assignment of own-
ership, the definition of roles, and the determination of responsibilities
42 ENTERPRISE KNOWLEDGE MANAGEMENT
and assignments of each role. The data registrar also maintains the data
policy and notifies the policy manager if there are any required changes
to the data ownership policy.
Data Steward The data steward manages all aspects of a subset
of data with responsibility for integrity, accuracy, and privacy.
Data Custodian The data custodian manages access to data in
accordance with access, security, and usage policies. He or she makes
sure that no data consumer makes unauthorized use of accessed data.
Data Administrator The data administrator manages production
database systems, including both the underlying hardware and the
database software. The data administrator is responsible for all aspects
related to the infrastructure needed for production availability of data.
Security Administrator The security administrator is responsible
for the creation of and the enforcement of security and authentication
policies and procedures.
Director of Information Flow The director of information flow is
responsible for the management of data interfaces between processing
stages, as well as acting as an arbiter with respect to conflicts associated
with data flow interfaces.
Director of Production Processing The director of production
processing manages production processing operations, transference of
data from one production source to another, scheduling of processing,
and diagnosis and resolution of production runtime failures.
Director of Application Development The director of application
development manages requirements analysis, implementation, testing,
and deployment of new functionality for eventual turnover to the pro-
duction facility.
Data Consumer A data consumer is an authorized user that has
been granted access rights to some data within the enterprise.
Data Provider A data provider is an accepted supplier of infor-
mation into the system.
The ownership registry is created from the data catalog and the assign-
ment of roles. It is the enterprise log that can be queried to determine
who has the ultimate responsibility for each data set. The ownership
registry should be accessible by all interested parties, especially when
new data requirements arise or there is a conflict that needs resolution.
Management of the ownership registry requires keeping a pulse on
the organization, as it is not unusual for employee turnover to affect the
data management structure. In addition, as new data sets are added to
the governance by the data ownership policy, the decisions regarding
the new data must be added to the registry.
This brings us back to the issue of data quality. Once we have estab-
lished a chain of command for the ownership of data, we can look at
how the responsibility for data quality falls with respect to the policy. A
major factor is the relationship between ownership and care, which is
explored in this section, along with the enforcement of data policies and
an introduction to data quality rules.
dispute. If the issue is not covered in the data ownership poUcy, then the
policy needs to be modified to incorporate the issue.
2.7 SUMMARY
In this chapter, we explore what data quality means and how it can be
effected in the enterprise. Without a definition for data quality, how can
we ever hope to improve it? And since everybody has a different idea as
to what data quality means, how can we level-set the members of an
organization so that an improvement in data quality can be measured
and recognized?
With the simple definition of data quality as "fitness for use," we
can start to plan how to improve data quality across the organization.
We first outline our procedure for a data quality improvement program,
which incorporates the gaining of senior management consensus, train-
ing, analysis, and implementation in a way that allows for a continuous
improvement program to build on each individual success.
We will also spend some time discussing what data quality means
within certain implementation domains operations, databases, data
warehousing, data mining, electronic data interchange (EDI), and the
Internet. In each of these domains, we will explore the importance of
data quality and how a data quality program can be integrated into the
domain as an integral component.
What does data quality mean? In practicality, almost everyone has a dif-
ferent view of data quality. To the mailing list manager, data quality
means cleansed delivery addresses and deduplification. To the account
manager, data quality means accurate aggregation of customer activity.
47
48 ENTERPRISE KNOWLEDGE MANAGEMENT
To the medical industry, data quality may mean refined ability for record
linkage. Clearly, each definition is geared toward the individual's view of
what is "good" and what is not. This leads to the conclusion that there is
no hard and fast definition of data quality. Rather, data quality is defined
in terms of how each data consumer desires to use the data.
In the most general sense, we will use a qualitative definition of
data quality and refine that definition on a case-by-case basis. In essence,
we define data quality in terms of fitness for use the level of data
quality determined by data consumers in terms of meeting or beating
expectations. In practice, this means identifying a set of data quality
objectives associated with any data set and then measuring that data
set's conformance to those objectives.
This is not to say that the tools used for static data cleansing of
names and addresses or products that link data records based on spe-
cific data fields are not useful. It is true, however, that the use of these
tools is not a solution to the data quality problem. Instead, the best way
to get a handle on an organization's data quality is to define a set of
expectations about the data, measure against those expectations, and
continuously improve until those expectations are satisfied.
What has proven to be difficult up until now is that because every-
one's data sets are different, there are no well-defined means for defin-
ing data quality expectations. In this book, we address this need by
developing all the tools needed to determine if there is a data quality
problem, to measure the cost effect of low data quality, to assess the
current state of the organization's data, and to develop data quality
rules that can be used for measurement. But the first step in any of these
processes is to understand the notions of data quality and getting
senior-level management support for the assessment and improvement
of enterprise data quality.
If we want to refine the definition of fitness for use, the first area of
focus is limiting the "badness" of the data. We can refer to this aspect as
"freedom from defects," where a defect is any situation where data val-
ues are not accessible or do not correspond in accuracy to an estab-
lished frame of reference. While we will explore the areas where defects
can crop up in Chapter 5, here is a short list of the kinds of defects we
want to avoid.
DATA QUALITY IN PRACTICE 49
Inaccessibility
Inaccuracy
Out-of-date information
Unmanageably redundant information
Inconsistency with other sources
Incomplete data
Incomprehensible data
The flip side of freedom from defects is that the information has the
characteristics of a high-quality environment. Again, we will explore
these characteristics in Chapter 5, but here is the short list.
The information is timely.
The data model completely and accurately models its real-world
counterpart.
The information is presented in a way that is easy to understand.
The appropriate level of detail is maintained.
The information captured is meaningful in its proper context.
with data quality and shows how an integrated set of data quality solu-
tions can add value to the organization. This enlightenment can be
effected through a number of steps, including initial training in knowl-
edge management and data quality, followed by the creation and
endorsement of a data ownership policy, along with the analysis that
demonstrates the economic impact of low data quality and the eco-
nomic value of measurable high data quality.
Once the policy and its enforcement procedures are in place, the next
step is to identify those areas in greatest need of improvement. (Chapter
4 presents a framework for finding those areas.) In our economic model
of low data quality, we provide a mechanism characterizing the actual
impact of data quality both within and external to the organization.
This is done by taking these steps.
1. Looking for the signs of data quality problems
2. Mapping the flow of information into, through, and out of the
enterprise
3. Characterizing the impacts of low data quality at particular
stages in the information chain
4. Measuring the cost impact of low data quality
5. Building the data quality scorecard
The data quality scorecard is a tool used to focus on the locations
in the information chain where there are data quality problems that
have the greatest impact on the organization. The scorecard can be used
as input to the next step, current state assessment.
With the data quaUty scorecard, the current state assessment, and the
requirements analysis, there is enough data to select a project for
improvement. With senior management support, a team is assembled
and assigned a specific goal: to bring up the level of measured data qual-
ity to the target level determined during the requirements analysis.
Selecting a single project for execution is important. Unfortunately,
many data quality improvement projects are subject to failure because the
scope encompassed is way too large. It may be impossible to demonstrate
overall improvement if there is no particular focus. Remember that the
overall success of the program is determined by small successes in small
steps. Selecting a high-profile but small project for improvement, and suc-
cessfully completing that project, accomplishes three things.
1. It provides a measurable (both in hard metrics and in economic
benefit) improvement in the quality of information in the
enterprise.
2. It gains positive association within the organization for accom-
plishment, which in turn builds more senior-level buy-in and
general consensus.
3. It opens the possibility for additional improvement projects.
Data quality and business rules Rules can direct the packaging
indicate how information and and presentation of information
control flows in the process. to the consumer.
what information must look like before it can enter a specific process,
as well as validate the information as it passes from producer to con-
sumer. In addition, data validation and business rules can be used as
triggers for particular events within an operational environment. For
example, threshold limits can be set for operational efficiency based on
the amount of invalid information in the system. When these thresholds
are encountered, an event is triggered that notifies a supervisor of the
presence of a problem and the location of the problem.
The dependence
relationship between data
attributes is broken out into
Set of Data
Quality Rules
Metadata
Legacy
Database
Business
Rules
Data marts and data warehouses are used for an analytical environment.
It is often said that the bulk of the work performed in implementing a
data warehouse is in the data extraction, cleansing, and transformation
phases of moving information from the original source into the ware-
house. Nevertheless, many data warehouse projects fail because not
enough attention is spent on either understanding the data quality
requirements or on the validation and quality assurance of information
imported into the warehouse.
As an example, you can use data quality and business rules for what we
call "data warehouse certification." Certification is a means of scoring
the believability of the information stored in a data warehouse. A data
warehouse is considered fit for use when the data inside conforms to a
set of data quality expectations embodied in a set of rules. Given these
rules, we assign a score to the quality of the data imported into a data
warehouse for certifying warehouse data quality (see Figure 3.5).
The first step is to define a set of rules that will qualify the data.
Again, we can use the rules framework that is described in Chapters 7
and 8. The next step is to import those data quality rules into a rules
engine. Each rule will have an associated validity threshold (as a per-
centage) based on the users' expectations of quality.
Data Quality
Rules
Rule
Thresholds
As records are fed into the engine, any relevant rules (that is, any
rules that refer to values of attributes defined within the record) are
tested. If no rules fail, the record is said to be valid and is successfully
gated through to the warehouse. If any rules fail, the record is enhanced
with information about which rules were violated, and the record is
output to a reconciliation system. The violating record can also be
passed through to the warehouse, but now it is marked as having not
conformed to the users' expectations, and this information can be used
when performing analysis. The count of failures and successes is main-
tained for each rule.
After the data is imported, each rule's validity value is computed as
the ratio of valid records to the total of records, A data quality certifica-
tion report delineating all validity percentages is generated. If all valid-
ity percentages exceed the associated thresholds, the warehouse is
certified to conform to the users' data quality requirements. Otherwise,
the warehouse is not certified, and until the percentages can be brought
up to the conformance level, the warehouse cannot be said to meet the
data quality requirements.
To qualify the warehouse after a failed certification, the records
output to the reconciliation system must be analyzed for the root cause
of the failures. This analysis and correction is part of a business work-
flow that relies on the same set of data quality and business rules used
for validation. After reconciliation, the data is resubmitted through the
rules engine, and the validity report is generated again. The root cause
information is used to return to the source of problems in the legacy
data and to correct those bad data at the source, gradually leading to
certification.
Data warehouse and data mart certification is an ongoing process,
and the certification report needs to be aligned with the ways that data
records are inserted into the warehouse. For example, if a data mart is
completely repopulated on a regular basis, the certification process can
be inserted as a component to the reloading process. Alternatively, if a
data warehouse is incrementally populated, the results of the certifica-
tion engine must be persistent.
Electronic Data Interchange (or EDI) is the term used for any standard-
ized format for representing business information for the purposes of
electronic communication. EDI is enabled through a process of cooper-
ative data standardization within a particular business environment. It
is used in many industries today, such as the health care and financial
industries, and is firmly entrenched in interactions with the federal gov-
ernment.
EDI is used to eliminate manual processing when executing routine
business operations such as purchase orders, product orders, invoicing,
securities trading transactions, shipping notices, and so forth. This
increases efficiency and volume, thereby lowering the overall cost per
transaction.
EDI is more than just forwarding purchase orders and invoices via
e-mail or through Intranet postings. It is designed to use standards for
formatting and transmitting information that are independent of the
hardware platform. EDI enables a process known as straight-through
processing, (STP), which is the ability to completely automate business
operations with no manual intervention.
Any STP or EDI activity is almost by definition one that is based on
data quality. Because EDI is defined as a standard form for transmitting
information, there are rules about validation of the both the format and
the content of EDI messages. Precisely because STP is meant to replace
well-defined operational processes, there are specific business rules that
guide the STP applications in how to process the transaction.
What is more relevant is that not only must an STP application be
able to execute operations based on a set of predefined business rules, it
must also be able to distinguish between valid and invalid EDI messages.
Validity is more than just structural conformance to the EDI format it
must also include validation of the content as well. STP systems often
will only perform the structural validity checks, and not the content
validity, which may account for slips in actual complete automation.
3.7.1 XML
Another aspect of the use of the Internet is the growth of both business-
to-consumer (B2C) and business-to-business (B2B) EDI frameworks for
executing transactions. Whether we talk about amateur day traders,
competitors in online auctions, or B2B order processing and fulfillment,
there is an increase in the use of the Internet as a distributed transaction
factory.
DATA QUALITY IN PRACTICE 69
There are at least five different data query paradigms over the Internet.
1. Hyperlinking As we said, hyperlinks can be seen as canned
queries that lead to (relatively) static data sets. As Web pages
evolve, sets of hyperlinks are grouped under distinct headings to
further refine and classify the virtual query.
2. Local data searching engines A local search engine that has
indexed the information present in a Web site's set of Web pages
can produce a set of references to Web pages that matches a
user's specific query.
3. Service queries This allows the web page to act as a front end
to an actual database service that is connected to the Web site,
although it may act independently of the Web site. Many
e-commerce sites operate in this manner. An example is a query
through an online catalog site as to the number of requested
items currently in stock and ready to be shipped.
4. Web-wide search engines These are web search engines that
front to databases of indexed Web pages. This can be considered
an extension to the localized Web searching.
70 ENTERPRISE KNOWLEDGE MANAGEMENT
Once when I was using the Internet to collect information for a report I
was preparing on some data management tools, I visited a number of
Web sites for companies that provided those tools. At one particular
corporate site, I found a menu of white papers that were clearly relevant
to my task, but in order to download the papers, I was required to fill
out a form with pertinent contact information name, mailing address,
phone number, e-mail address, and so forth.
After submitting the form, I was gated through to the Web site to
download the white papers, at which point I made my selection and
downloaded the paper. After reading the paper, I realized that there had
been another paper at that site that I also wanted to read. I repointed
my browser at the site, only to find that I was required to fill out the
form again. Only after resubmitting all the same information was I able
to download the other white paper.
About two days later, I was called by a company representative
who introduced himself and asked if I had successfully downloaded the
white paper (not papers!) and inquired about my interest in their prod-
uct. I replied that I had gotten the papers, but that I had to fill out the
form twice. I asked him if my filling out the form twice meant that I
would appear in their database twice. He said it would. I then asked if
that was a problem, considering that if I were called more than once
about the product, I might get annoyed. The salesman answered that
while I would appear in the database more than once, he was aware of
the problem and usually was able to mentally keep track of whom he
DATA QUALITY IN PRACTICE 71
had already called. We briefly talked about data quality and duplicate
record elimination, but he seemed convinced that this was not a serious
dilemma.
About four hours later, I found a message on my voice mail. Who
was it? It was the same salesman, calling to introduce himself and to see
if I had successfully downloaded the white paper! What was ironic is
that the salesman not only completely exposed his inability to keep
track in general, he completely forgot our earlier conversation focusing
on this exact problem!
The simple practice of collecting sales lead information through
Web-based forms is a compelling idea as long as the operational aspects
don't backfire. We can see at least three ways how the absence of a data
quality program associated with this Web site can diminish the effec-
tiveness of the process.
Across one dimension, by filling out that form I expressed interest
in the area associated with the products being sold. As a potential cus-
tomer, filling out the form is a proactive action, exposing me as a quali-
fied sales lead. Yet, by registering a visitor in the database more than
once, the system dilutes the effectiveness of this sales lead qualification.
Across a second dimension, a measurable cost of customer acquisi-
tion is increased each time the salesman makes a telephone call. Associ-
ated with each call is some expectation of converting the lead into a
sale. By calling the same person twice, however, the salesman is not only
performing rework but preventing himself from getting real work done,
such as following up on other sales leads.
Across a third dimension, there is a strategic weakening of the tar-
geted sales process. When targeting a potential customer, it is in the
salesman's best interest to foster a good working relationship with that
customer. Interrupting a decision maker once during a day to gauge
interest is probably bearable, but someone might be irritated by having
his or her work routine interrupted twice in the same day by the same
person. I would probably make the assumption that the company is not
very organized, which would make me question their ability to provide
a good product.
3.9 SUMMARY
be very dependent on the context and domain in which the data is scru-
tinized. We looked at the development of a data quality improvement
program and what it takes to get the ball rolling. We determined that
there are seven phases to the program.
1. Gaining senior level endorsement
2. Training in data quality
3. Creating and enforcing a data ownership policy (as discussed in
Chapter 2)
4. Building the economic model (will be discuss in Chapter 4)
5. Performing a current state assessment (as will be described in
Chapter 9)
6. Selecting a project for improvement
7. Implementing and deploying the project
Of course, with each successful project implementation and
deployment, there are opportunities to select new improvement pro-
jects, which turns phases 6 and 7 into a cycle, although the entire
process may need to be repeated.
We also looked at different contexts for data quality namely,
data quality in an operational context, data quality in the database
world, data quality and the data warehouse, data mining, electronic
data interchange, and the Internet. While there is a distinct need for
data quality in each of these contexts, the actual work that is done with
respect to data quality is understated but will grow in importance.
4
ECONOMIC FRAMEWORK OF DATA
QUALITY AND THE VALUE PROPOSITION
73
74 ENTERPRISE KNOWLEDGE MANAGEMENT
It is said that the best new customers are your current customers. In
other words, you are more likely to sell something to your current cus-
tomer base than to new customers. So when your organization is
always focusing on new business and not closing deals with current cus-
tomers, this may be evidence of customer dissatisfaction, which can be
related to poor data quality.
A high demand for customer service indicates that problems are leaving
the enterprise and making it to the customers. When the customer ser-
vice budget increases especially when the customer service depart-
ment mostly attends to billing problems this is likely indicative of a
data quality problem.
Small data sets with problems result in small sets of errors, but as data
sets grow, so do the size of the error sets. In some cases the growth of
the error sets is linear, but it can increase exponentially compared to the
source data. If the number of errors multiplies rapidly, the organization
will not be able to scale its systems to address those problems.
76 ENTERPRISE KNOWLEDGE MANAGEMENT
Although it is clear that data are used in both operational processing and
decision-making processes, outside the implementation arena, these
processes are often considered "black boxes" that take input data as "raw
material" and generate value-added information as output. This output
then proceeds to another processing operation (or another black box), or
it is summarized as a report as input to a decision-making process.
The first step in understanding the effect of low data quality is
peering inside the black box to identify the steps through which the
input data is converted into usable information. For simplicity, let's
divide the world into one of two data flow models: the strategic data
flow, used for decision-making, and the operational data flow, used for
data processing. Either model represents a data processing system, and,
also for simplicity, let's reduce the number of processing stages to an
abstract minimum. Note that there are, of course, exceptions to these
generalizations, but we'll use them as the generic model for determining
the COLDQ.
Hotel
DB
\
Query Query Credit
Results Card Info,
Customer ID,
Amount
Dates, Credit Dates, Credit
Card Info Card Info 1
Reservation Credit Card
Customer
System Processing
Approved/
Declined
Confirmation
Processing
the source and the target processing stage as well as the data items that
are communicated through that channel.
A strategic data flow represents the stages used for the decision-making
process, as shown in Figure 4.2.
It is likely
There may be many of these
to have J 1
processing and packaging
more than &1 steps in actual processing.
one data 4> .
suppliei: 2 !
Customer
Presentationy
kisiikety
to have
t There may be many of these
pcocessiiig and pidcs^iog
mcxtdian cS 1 steps in adal procesaag.
one data
siipf^ o 1
^\
sI
Data Customer
Supply Presentation .
There may be
nniifiple customers
for the same
packaged data.
CoUected information
6rom all the credit
reporting agencies
is merged at this point
4.4 IMPACTS
We'd like to assume that if there are no data quality problems, either of
the data flows described in Section 4.1 will operate smoothly. It is reason-
able to rely on decisions based on valid data and that an operation system
will function smoothly as long as no invalid data items gum up the works.
Issues appear when the information chains involve low quality data.
The effects of low data quality propagate through the systems, ultimately
leading to poor decision making, tactical difficulties, increased costs.
84 ENTERPRISE KNOWLEDGE MANAGEMENT
Hard impacts are those whose effects can be estimated and/or mea-
sured. These include the following.
Customer attrition
Costs attributed to error detection
Costs attributed to error rework
Costs attributed to prevention of errors
Costs associated with customer service
Costs associated with fixing customer problems
Time delays in operation
Costs attributable to delays in processing
Soft impacts are those that are evident and clearly have an effect on pro-
ductivity but are difficult to measure. These include the following.
Difficulty in decision making
Costs associated with enterprise-wide data inconsistency
Organizational mistrust
Lowered ability to effectively compete
Data ownership conflicts
Lowered employee satisfaction
ECONOMIC FRAMEWORK OF DATA QUALITY AND THE VALUE PROPOSITION 85
Low data quality has an impact on the operational domain, the tactical
domain, and the strategic domain. Within each domain, the different
kinds of cost measures and their effect on the economic model must be
evaluated. Note that in all three domains relying on incorrect or unfit
data will have a noticeable impact.
86 ENTERPRISE KNOWLEDGE MANAGEMENT
The strategic domain stresses the decisions that affect the longer term.
Strategic issues are proactive, less precise decisions that address "where
to be" along a long time period. The burden of strategic decisions falls
to the senior executives of an organization.
ECONOMIC FRAMEWORK OF DATA QUALITY AND THE VALUE PROPOSITION 87
4.7.1 Detection
Detection costs are those incurred when a data quality problem pro-
vokes a system error or processing failure, and a separate process must
be invoked to track down the problem. Error detection only happens
when the system has the ability to recognize that an error has occurred.
Sometimes this is implied by a total system failure, such as an incor-
rectly provided divisor of 0 that causes a system interrupt for dividing
by zero. Sometimes this is implied by an abnormal end during transac-
tion processing because of an invalid data record.
The cost of error detection is mostly associated with three activi-
ties: determining where the failure occurred, determining what caused
the system failure, and determining the seriousness of the problem. This
cost is mostly attributable to employee activity, although there are also
costs associated with the purchase and maintenance of diagnostic tools.
4.7.2 Correction
the process. Correction involves figuring out what the incorrect item
should have been and then searching for the critical point of correction.
Correction may require a modification to data, a modification to process-
ing (software or operations), or both. The cost of correction encompasses
all these activities.
4.7.3 Rollback
4.7.4 Rework
4.7.5 Prevention
4.7.6 Warranty
Data quality problems that affect customers incur costs associated with
bothfixingthe problem as well as compensation to customers for dam-
ECONOMIC FRAMEWORK OF DATA QUALITY AND THE VALUE PROPOSITION 89
ages. These are warranty costs. Any risks and costs associated with
legal action are also rolled up as warranty costs.
4.7.7 Spin
4.7.8 Reduction
4.7.9 Attrition
4.7.10 Blockading
4.8.1 Delay
If data are not accessible, or the timely availability of the data is con-
strained, the decision-making process becomes delayed. A delay in
making a decision will spread to the operational arena as well, causing
productivity delays.
4.8.2 Preemption
4.8.3 Idling
4.8.7 Misalignment
4.8.9 Decay
4.8.10 Infrastructure
Now that we have all the pieces for our economic model, let's look at
the actual steps involved in building it.
1. Map the information chain to understand how information flows
within the organization.
2. Interview employees to determine what people are doing with
respect to data quality issues.
3. Interview customers to understand the impacts on customers.
4. Isolate flawed data by reviewing the information chain and locat-
ing the areas where data quality problems are manifested.
5. Identify the impact domain associated with each instance of poor
data quality.
6. Characterize the economic impact based on the ultimate effects
of the bad data.
7. Aggregate the totals to determine the actual economic impact.
8. Identify opportunities for improvement.
The result is what we can call a data quality scorecard, shown in
Figure 4.6. This scorecard summarizes the overall cost associated with
low data quality and can be used as a tool to find the best opportunities
for improvement.
Use employee^
interviews to
understand what
people are doing
widi respect to
data quality^
Map the
^information chain ^ Isolate flawed Identifythe impaa\ / Characterize the >
to understand how /data by reviewing the\ Aggregate the total
domain associated \ / economic impact
information flows information chain with each instance j""H associated with each to determine the
within the and finding data of poor data / \ instance of low data actual economic
organization .quality problemsv V quality y \^ quality y impact
Conduct
customer
interviews to
^ understand customer ^
impacts
With the results from the interviews in hand, it is time to start annotat-
ing the information chain. At each point where a data set is sent,
received, or manipulated, any locations of a source of a data flaw are
noted, along with a list of the activities attributable to those flaws.
With an information chain annotated with the list of both data flaws
and the activities associated with each of those flaws, it is time to start
attributing the flaws and activities to impact domains. For each source
of low data quality, the impact domains are selected, and each activity
is classified according to the classifications described in Sections 4.7
and 4.8.
We can now build a matrix associated with each data quality problem.
The first axis identifies the problem and its location in the information
chain. The second axis represents the activities associated with each
problem. The third axis denotes the impact areas for each activity. In
each cell in this matrix, we insert the estimated cost associated with that
impact, using the economic measures from Section 4.5. If no estimate
can be made, an indication of the order of magnitude of the impact
should be used.
Note that this matrix does not distinguish between hard and soft
impacts. The values assigned to each cell can represent actual dollar val-
ues or coded indications of level of impact. Figure 4.7 shows the data
quality scorecard matrix.
96 ENTERPRISE KNOWLEDGE MANAGEMENT
Data Information
Reference Quality Chain
ID Problem Location Activity Impact Cost
1 Malformed
credit card Credit Card
Numbers Node 5 Processing Detection $ 12,000.00
Contact
Customer Correction $ 7,000.00 1
Rework $ 20,000.00 1
2 Invalid Direct
addresses NodeO Marketing Detection $
Correction $ 20,000.00 1
Reduced Acquisition
Reach Overhead $ 4,500.00 1
Lost
Opportunity $ 9,000.00
3 Incorrect Shipping
pick lists Node? Processing Detection $ 25,000.00
Correction $ 21,000.00 1
Customer
Service Warranty $ 43,000.00
Spin $ 12,000.00
Attrition $ 50,000.00
The last component of this framework is using the model to look for the
biggest "points of pain." Having categorized the location and the
impacts of the different data quality problems, the next logical step is to
find the best opportunities for improvement, where the biggest value
can be gotten with the smallest investment.
4.11 EXAMPLE
4.12 SUMMARY
101
102 ENTERPRISE KNOWLEDGE MANAGEMENT
For our purposes, we use the categories that include these collections of
data quality dimensions.
Data models
Data values
Information domains
Data presentation
Information policy
We will use an example data application for illustrative purposes.
First we will discuss the application of this data set, and the rest of the
chapter will examine what issues come into play when building the data
management system for this application.
5.2.2 Comprehensiveness
modeled to support all the applications that might draw from that data
set, which implies that all stakeholders in the application suite have had
their say in the design of the model. If users are sharing information that
serves different purposes, there may be other comprehensiveness require-
ments. Is the model comprehensive enough to allow the users to distin-
guish data based on their independent needs .^
In our example, we are using both contact information and sales fig-
ures when representing our current customers. But with that same infor-
mation, the biUing department can also run its applications, although it
may be more important to the billing department that there be a set of
attributes indicating if the product has been shipped, if the customer has
been billed, and if the customer has paid the bill. Therefore, the data
model must be comprehensive enough to support both sets of require-
ments as well as enable the collection and support of extracting only the
information that each data consumer needs.
5.2.3 Flexibility
5.2.4 Robustness
well as the definition of attribute types and domains to hold the possible
values that each attribute might contain in the future. Robustness also
involves defining attributes in ways that adapt to changing values with-
out having to constantly update the values.
In our example, we might want to keep track of how many years a
customer has been associated with our organization. A nonrobust way
to do this is with an attribute containing the number of years that the
person has been a customer. Unfortunately, for each customer this
attribute will need to be updated annually. A more robust way to main-
tain a customer's duration is to store the date of initial contact. That
way, the number of years that the customer has been retained can be
computed correctly at any time without having to change the attribute.
The Year 2000 problem (Y2K), for example, evolved because of a lack
of robustness in many data models: a date attribute that has four digits to
hold the year is more robust than a date attribute with only two digits.
5.2.5 Essentialness
On the other hand, a data model should not include extra information,
except for specific needs like planned redundancy or the facilitation of
analytical applications. Extraneous information requires the expense of
acquisition and storage, and unused data, by nature of its being
ignored, will have an entropic tendency to low data quality levels. In
addition, redundant information creates the problem of maintaining
data consistency across multiple copies.
Another potential problem with unessential attributes is the "over-
loading" effect. With applications that have been in production, it
becomes very hard to modify the underlying data model without causing a
lot of stress in the application code. For this reason, when a new attribute
needs to be added, a behavioral tendency is to look for an attribute that is
infrequently used and then overload the use of that attribute with values
for the new attribute. This typically is manifested in program code with
conditional statements with tests to make sure that the overloaded
attribute is treated in the right manner (see Figure 5.1).
These kinds of conditionals are one basis for hidden business rules
that get buried in program code and/or are passed along as "lore"
within the information technology groups. The issue becomes a prob-
lem once the application has been in production for many years and the
original application implementers are long gone.
106 ENTERPRISE KNOWLEDGE MANAGEMENT
5.2.8 Homogeneity
case, we have two different kinds of customers that are being main-
tained within a single attribute.
Usually, this subclassing is evidenced by "footprints" in the accom-
panying application code. These conditional statements and extra sup-
port code are representative of business rules that are actually embedded
in the data and are only unlocked through program execution. As this
happens more and more, the application code will become more and
more convoluted and will require some sort of reverse engineering to
uncover the business rules. The evolution of subclassing of attribute val-
ues will eventually necessitate the insertion of new attributes by the data-
base administrator to allow distinction between entities in each subclass.
5.2.9 Naturalness
5.2.10 Identifiability
5.2.11 Obtainability
5.2.12 Relevance
5.2.13 Simplicity
CUSTOMER ADDRESS OTY STATE ZIP CUSTOMER Accr CREDrrCARD ORDER LIMIT
Two customer
names are listed.
^^^iSs^^^^^^^mSmf^&^^^^^n^^S
Is a trust fund John Smith 123 Main Street, New York, NY 10011
a customer? John and Mary Smith 123 Main St., New York, NY 10011
Smith Family Trust, John Smith l i t 123 Main St, NY NY 10011
Three customer 123 Main Street, Apt. 4, NY NY
Mary Cummings Smith
names are listed*
> < John and Mary Smith UGMA John Jn 123 Main Street, Apartment 4, New York 10011
Arediesetwo Mary Smith 123 Main Street, Apt. 4, NY NY
custCHners the *
same person?
the second regards the degree to which there are different representa-
tions for the same value.
The first axis can be shown by this example: If our model has attrib-
utes for date of first contact and date of last sale, it is preferable to use the
same representation of dates in both attributes, both in structure (month,
day and year) and in format (American dates vs European dates). An
example of the second axis is measured by enumerating the different ways
that dates are maintained within the system. We will examine these issues
in the discussion of domains and mappings in Chapter 7.
DIMENSIONS OF DATA QUALITY 113
Because low levels of quality of data values are the most obvious to
observers, when most people think of data quality, they think of these
most easily understood data quality dimensions. Low levels of data
value quality are likely to be recognized by both the users (as well as
customers!) and are most likely to lead the user to conclusions about
the reliability (or lack thereof) of a data set. In an environment where
data are not only being used to serve customers but also as input to
automatic knowledge discovery systems (that is, data mining), it is
important to provide high levels of data quality for the data values.
Relying on bad data for decision-making purposes leads to poor strate-
gic decisions, and conclusions drawn from rules derived from incorrect
data can have disastrous effects.
Data value quality centers around accuracy of data values, com-
pleteness of the data sets, consistency of the data values, and timeliness
of information. Most data quality tools are designed to help improve
the quality of data quality values. In Chapter 8, we build a framework
for describing conformance to dimensions of data value quality as a set
of business rules that can be applied to the data set and used to measure
levels of data value quality.
5.3.1 Accuracy
Data accuracy refers to the degree with which data values agree with an
identified source of correct information. There are different sources of
correct information: a database of record, a similar, corroborative set of
data values from another table, dynamically computed values, the
result of a manual workflow, or irate customers. Inaccurate values don't
just cause confusion when examining a database bad data values
result in increased costs. When inaccuracies reach the customers, costs
can increase due to increased pressure on customer service centers,
searches for the inaccuracies, and the necessity to rework the process.
In our example data set, an inaccurate shipping address will result
in errors delivering products to the customers. The repercussions may
be great: A customer will delay payment, cancel the order, and even
cease to be a customer.
114 ENTERPRISE KNOWLEDGE MANAGEMENT
5.3.3 Completeness
5.3.4 Consistency
really mean? If we follow a strict definition, then two data values drawn
from separate data sets may be consistent with each other, yet both can
be incorrect. Even more complicated is the notion of consistency with a
set of predefined constraints. We may declare some data set to be the
"database of record," although what guarantees that the database of
record is of high quality.^
More formal consistency constraints can be encapsulated as a set
of rules that specify consistency relationships between values of attrib-
utes, either across a record or message, or along all values of a single
attribute. These consistency rules can be applied to one or more dimen-
sions of a table or even across tables.
In our example, we can express one consistency constraint for all
values of a ZIP code attribute by indicating that each value must con-
form to the U.S. Postal Service structural definition. A second consis-
tency constraint declares that in every record, the ZIP code attribute's
value must be consistent with the city attribute's value, validated
through a lookup table. A third consistency constraint specifies that if
the ZIP code represents an area within a qualified geographic region,
the account specified by the account number field must be associated
with a salesman whose territory includes that geographic region.
The first consistency constraint applies to a single attribute. The
second applies to a relationship between two attributes within the same
record. The third constraint applies to values in different tables. Consis-
tency constraints can be arbitrarily complex as shown by these three
examples and they frequently reflect business rules inherent in the
applications using the data.
5.3.5 Currency/Timeliness
site. To make sure that we do not charge a customer the wrong price,
we need to guarantee that the time lag between a product price change
and the new price's appearance on the Web site is minimized!
5.4.2 Stewardship
5.4.3 Ubiquity
Data quaUty does not only apply to the way that information is repre-
sented and stored. On the contrary, there are dimensions of data quality
that are related to the way that information is presented to the users
and the way that information is collected from the users. Typically, we
would like to measure the quality of labels, which are used for naming
or identifying items in a presentation; classification categories, which
indicate specific attributes within a category; and quantities, which
indicate the result of measurement or magnitude of quantitative values.
We also want to look at formats, which are mappings from data to
a set of symbols meant to convey information. The format for repre-
senting information depends on the application. As an example, in our
application, we might want to report on sales activity within certain
geographical ranges. We can take a dry approach and deliver a report of
numbers of sales by product within each region, sorted by dollars, or
we might take a different approach in providing a map of the region
with color-coding for each product with intensities applied to indicate
the sales ranges.
Formats and presentations can take many different attributes.
Color, intensity, icons, fonts, scales, positioning, and so forth can all
118 ENTERPRISE KNOWLEDGE MANAGEMENT
5.5.1 Appropriateness
A good presentation provides the user with everything required for the
correct interpretation of information. When there is any possibility of
ambiguity, a key or legend should be included.
In our example, consider an attribute that represents priority of
customer service calls, with the domain being integer values from 0 to
10. While it may have been clear to the data modeler that 0 represents
the highest priority and 10 the lowest, the presentation of that attribute
in the original integer form may confuse the user if he or she is not
familiar with the direction of the priority scale. The presentation of the
DIMENSIONS OF DATA QUALITY 119
information, therefore, should not display the integer value of the prior-
ity, which might be confusing. Instead, providing an iconic format, such
as incremental intensities of the color red, to indicate the degree of
importance of each particular call. This is shown in Figure 5.3.
In this presentation,
the shaded graphic
indicates critical Priority 2
priority with darker
shading.
N
N Priority
5.5.3 Flexibility
5.5.5 Portability
When the null value (or absence of a value) is required for an attribute,
there should be a recognizable form for presenting that null value that
does not conflict with any valid values. This means that for a numerical
field, if the value is missing, it is not an indication that it may be repre-
sented to the user as the value 0, since the presence of any number there
may have different meaning than the absence of the value. Also, if there
are ways of distinguishing the different kinds of null values (see Section
5.3.2), then there should also be different ways of presenting those null
values.
Over the past few years there has been an incredible reduction in the
cost of disk storage to the point where it seems silly to think about con-
serving disk space when building and using a database. Yet, just as the
interstate highway system encouraged travel by automobile, the high
availability of inexpensive disk space encourages our penchant for col-
lecting and storing data.
It is important to remember that even though disk space is inex-
pensive, it is not unlimited. A dimension of data quality, therefore, is in
the evaluation of storage use. This is not to say that the only issue is to
squeeze out every last bit. Instead, it is in investigating how effectively
the storage requirements are offset by other needs, such as performance
or ease of use. For example, the traditional relational database is
assumed to be in normal form, but in some analytical databases, data-
bases are specifically denormalized to improve performance when
accessing the data.
122 ENTERPRISE KNOWLEDGE MANAGEMENT
5.6.1 Accessibility
5.6.2 Metadata
Metadata is data about the data in the system. The dimension of data
quality policy regarding metadata revolves around whether there is an
enterprise-wide metadata framework (which differs from a repository).
Is it required to maintain metadata? Where is it stored, and under
whose authority.^ Metadata is particularly interesting, and we cover it
in Chapter 11.
5.6.3 Privacy
If there is a privacy issue associated with any data set, there should be a
way to safeguard that information to maintain security. Privacy is an
issue of selective display of information based on internally managed
permissions. It involves the ways unauthorized users are prevented
from accessing data and ensures that data are secured from unautho-
rized viewing. Privacy is a policy issue that may extend from the way
that data is stored and encrypted to the means of transference and
whether the information is allowed to be viewed in a nonsecure loca-
tion (such as on a laptop while riding on a train).
DIMENSIONS OF DATA QUALITY 123
5.6.4 Redundancy
5.6.5 Security
125
126 ENTERPRISE KNOWLEDGE MANAGEMENT
In any systems with causes and effects, the bulk of the effects are caused
by a small percentage of the causes. This concept, called the Pareto
Principle, has been integrated into common parlance as the "80-20
rule" 80 percent of the effect is caused by 20 percent of the causes.
This rule is often used to establish the degree of effort that must be
STATISTICAL PROCESS CONTROL AND IMPROVEMENT CYCLE 129
UCL
Center line
LCL
4,?^
D
-i-'j&
Foo 56
Bar 26
Baz 8
Boz 6
Raz 4
Faz 2
Together, these six functions account for 100 percent of the run-
time of the program, totaling 102 seconds. If we can speed up function
Foo by a factor of 2, we will have reduced the runtime of the entire
application by 28 percent (half of the time of function Foo), making the
total time now 74 seconds. A subsequent improvement in the function
Bar of a factor of 2 will result in only an additional 13 percent improve-
ment over the original runtime (actually, the effect is 18 percent of the
STATISTICAL PROCESS CONTROL AND IMPROVEMENT CYCLE 131
Our next step is to build a control chart. The control chart is made up
of data points consisting of individual or aggregated measures associ-
ated with a periodic sample enhanced with the center line and the upper
and lower control limits.
These are the steps to building a control chart for measuring data
quality.
1. Select one or more data quality dimensions to be charted. Use the
Pareto analysis we discussed in Section 6.3 to determine the variables or
attributes that most closely represent the measured problem, since trying
to track down the most grievous offenders is a good place to start.
2. If the goal is to find the source of particular problems, make sure
to determine what the right variables are for charting. For example, if
the dimension being charted is timeliness, consider making the charted
variable the "number of minutes late," instead of "time arrived." When
trying to determine variables, keep in mind that the result of charting
should help find the source and diagnosis of any problems.
3. Determine the proper location within the information chain to
attach the measurement probe. This choice should reflect the following
characteristics.
a. It should be early enough in the information processing
chain that detection and correction of a problem at that point
can prevent incorrectness further along the data flow.
132 ENTERPRISE KNOWLEDGE MANAGEMENT
There are many different varieties of control charts.^ Since our goal is to
measure nonconformance with data quality expectation, we will con-
centrate on particular control chart attributes for measuring noncon-
1. For a detailed list of different control chart types, see Juran's Quality Hand-
book, 5th edition, edited by Joseph M. Juran and A. Blanton Godfrey (New York:
McGraw-Hill. 1999).
STATISTICAL PRCXiESS CONTROL AND IMPROVEMENT CYCLE 133
Gp
^Jpg-p)
where p is the probability of occurrence, and n is the sample size.
To set up a /? chart, a small sample size is collected over a short
period of time (in most cases, 25 to 30 time points will be enough) and
the average P is computed by counting the number of nonconforming
items in each sample, totalling the number of items in each sample
group, and dividing the total number of nonconforming items by the
total number of sampled items. For p charts, the control limits are cal-
culated using the binomial variable standard deviation; the UCL is com-
puted as P + 3ap, and the LCL is computed as P - 3ap. If the LCL is
computed to be a negative number, we just use 0 as the LCL.
It is not out of the realm of possibility that each data item being
observed may have more than one error! In this case, we may not just
want to chart the number of nonconforming data items but the total of
all nonconformities. This kind of attributes chart is called a c chart, and
the UCL is calculated as C + 3 VC. The LCL is calculated as C - 3 VQ
where C is the average number of nonconformities over all the samples.
Thus far, we have discussed the calculations of the upper and lower con-
trol limits as a function of the statistical distribution of points in the data
set. This is not to say that we can only define these limits statistically.
In reality, as quality overseers, it is our duty to specify the accept-
able limits for data quality. For example, when it comes to the acceptable
level of incorrect values in certain kinds of databases, we can specify that
there is no tolerance for error. In this case, the UCL for errors would be
0. In many cases of examining data quality, there is no need for a lower
control limit either. Ultimately, it is up to the users to determine their tol-
erance for expected variations and errors and use that as a guideline for
setting the control limits.
STATISTICAL PROCESS CONTROL AND IMPROVEMENT CYCLE 135
Over the period of time, the overall average ratio of bad records
was computed to be 0.0409, which we use as the center line. The UCL
and LCL were computed in accordance with the computation for the p
136 ENTERPRISE KNOWLEDGE MANAGEMENT
0.07
0.06
0.05
0.04
0.03
0.02
0.01
Our next step in the SPC process is interpreting a control chart. Now
that we have collected and plotted the data points, how can we make
sense out of the resulting control chart? When a process is stable, we
can expect that all the points in the control chart will reflect a natural
pattern. The data points on the chart should be randomly distributed
above and below the center line, and the chart should have these char-
acteristics.
Most of the points are close to the center line.
Some of the points are near the UCL and LCL.
There may be some points above the UCL or below the LCL.
The distribution of points on the chart should not have any non-
random clustering or trending.
In the interpretation of control charts, our goal is to determine
whether a process is stable, and if it is not stable, to find and eliminate
special causes. So, what do we look for in a control chart?
6.8.3 Rebalancing
After the root cause of a problem has been identified and corrected, we
can claim that at least one aspect of an out-of-control situation has been
resolved. In this case, it may be interesting to recalculate the points and
control limits on the control chart, ignoring the data points associated
with the identified cause. This should help strengthen the control limit
calculations and point out other locations to explore for special causes.
We have collected our data, built our control chart, plotted data points
and control limits, and analyzed the chart for anomalous behavior. The
last step in the SPC process is to identify the special causes that are
echoed in the control chart.
Hopefully, we will have selected the areas of measurement in a way
that will point us in the right direction. While we discuss this process in
greater detail in Chapter 15, we can briefly introduce it here.
Assuming that we have translated our data quality expectations
into a set of data quality rules, we can use those rules for validating data
records. If we log the number of times a record is erroneous due to fail-
ing a particular test, we can use those logs to plot the daily conformance
for each specific rule.
At the end of the measurement period, we can construct a control
chart consolidating data from each of the data quality rules. Because
each rule describes a specific aspect of the users' data quality require-
ments, the problem of identifying a special cause reduces to determining
which of the data quality rules accounted for the anomalous behavior.
This provides a starting point for the root cause analysis process
described in Chapter 15.
8.11 SUMMARY
143
144 ENTERPRISE KNOWLEDGE MANAGEMENT
7.2 OPERATIONS
We will build on the notion of data types in a moment, but first, let's
look at the operations that are valid between data values within a data
type and between data values in different data types.
The operations that can be applied to the numeric data types (integers,
decimals, floating point, etc.) are the standard arithmetic operations.
Addition (+)
Subtraction (-)
Multiplication (*)
Division (/)
Modulo (%)
Note that division of integers may by definition yield an integer
result. Some systems may add it other numeric operations.
Floor (returns the largest integer less than a real)
Ceiling (returns the next integer greater than a real)
DOMAINS. MAPPINGS. AND ENTERPRISE REFERENCE DATA 147
Dates are usually stored in a special format, but the conditional opera-
tors listed should work on dates. In addition, there are special versions
of some of the arithmetic operators.
Addition (add a number of days to a date to get a new date)
Subtraction (subtract a number of days from a date to get a new date)
7.3 DOMAINS
In this section, we look at the ways to collect data values that can take
on intuitive meanings. When it becomes clear that a single collection of
values is used for the same meaning throughout different data reposito-
ries in the enterprise, a special status should be assigned to that collec-
tion as a data domain that can be shared by the users in the enterprise.
We can think about sets in two different ways: (1) through enumera-
tion, listing all the elements in the set, and (2) through description. One
nice thing about sets is that all the values in the set can frequently be
described in a few short terms. A very simple example is the positive
whole number the integral values that are greater than 0. In that one
small phrase we can categorize an infinite number of values that intu-
itively belong to the same collection.
When you think about it, the use of databases is very much driven
by the idea of sets. All the records in a table in a database represent a
set. An SQL SELECT query into a database is a definition of a subset of
the records in the database.
In turn, when we look at the relationship between data attributes
and data types, we can see that many attributes draw their values from
a subset of the values allowable under the data type. For example, an
attribute called STATE may have been defined to be CHAR(2). While
there are 676 distinct two-character strings, the values populating that
attribute are limited to the 62 two-letter United States state and posses-
sion abbreviations. This subset restriction can be explicitly stated as
"All values in the attribute STATE must also belong to the set of recog-
nized USPS state and possession abbreviations."
150 ENTERPRISE KNOWLEDGE MANAGEMENT
Code values
Colors
Employment categories and titles
National Holiday dates
Catalog items
Telephone area codes
Product suppliers
Currency Codes
multiplier := 1; sum := 0;
for i := length(cardnumber) downto 1 do
begin
Credit Card Number char := cardnumber[i];
if digittype(char) then
begin
product := digitvalue(char) * multiplier;
sum := sum + (product div 10) + (product mod 10);
multiplier := 3 - multiplier; // l->2, 2->l
end
end;
if (sum mod 10) = 0 then ok
Our domains are restrictions on value sets, and so we allow the same
operations for domains that we allow for sets. Figure 7.2 shows how
domains are similar to sets.
Union The union of domain A and domain B yields a new
domain that contains all the unique values of domain A and
domain B.
Intersection The intersection of domain A and domain B yields
a new domain that contains the values that are in both domain A
and domain B.
DOMAINS. MAPPINGS. AND ENTERPRISE REFERENCE DATA 153
( Domain A ]
A union B a is a member of A
A intersect B B is a subset of A
A is equal to B if
(A intersect B) = A and
(A intersect B) = B
A is not equal to B if
Difference (A-B) (A-B) is not empty
7.4 MAPPINGS
A simple form of a mapping is the enumerated list of pairs. Per the defi-
nition in Section 7.4.1, a mapping between domain A and domain B is a
One-to-One Many-to-One
One-to-Many Many-to-Many
set of pairs of values [a, b] such that <a is a member of domain A and b is
a member of domain B. For an enumerated domain, the collection of [a,
b] pairs is listed explicitly. Note that all the a values must be validated
as belonging to domain A, and all the b values must be validated as
belonging to domain B.
In Section 7.3.3, we encountered the domain of USPS state abbre-
viations. Presuming that we have another domain called USPS State
Names, which has base data type VARCHAR(30), we can define a map-
ping from state abbreviations to state names, as shown in Figure 7.5.
FIGURE 7.5
1 58 ENTERPRISE KNOWLEDGE MANAGEMENT
To round out the examples, let's go back to the example discussed in the
beginning of the chapter the U.S. Social Security number. Our earlier
discussion focused on the structure of a valid Social Security number.
Now we'll look at constructing a domain definition for valid Social
Security numbers.
Our first step is to note that the Social Security number is com-
posed of five parts: the area denotation, a hyphen, the group number, a
hyphen, and the serial number. We can start our domain definition by
describing a Social Security number as a CHARACTER(ll) string that
is a composition of thesefivedomains, ordered by simplicity.
7.5.1 Hyphen
1. The term geographical code is misleading, since according to the SSA, it is not
meant to be any kind of usable geographical information. It is a relic of the precomputer
filing and indexing system used as a bookkeeping device.
2. There are other number ranges that are noted as being "new areas allocated but
not yet issued." We are using this to mean that for testing to see if a Social Security num-
ber is valid, it must have one of the issued geographical codes.
160 ENTERPRISE KNOWLEDGE MANAGEMENT
Data
n
-cm
o
Data
Information Broker
manages subscription
and distribution
requests to
Data
clients
Data
data and are responsible for service, support, maintenance, data quality,
and source management, among other responsibilities.
When a catalog of domains and mappings is available for all users,
they can use the reference data to contract an agreement with the infor-
mation suppliers or data stewards. Information is disseminated using a
publish/subscribe mechanism, which can be implemented using inter-
mediary agents processes known as brokers. If there is any sensitivity in
reference to the users themselves or the specific reference data, an
anonymous publish/subscribe mechanism can be implemented.
8
DATA QUALITY ASSERTIONS
AND BUSINESS RULES
169
170 ENTERPRISE KNOWLEDGE MANAGEMENT
8.1.1 Definitions
Null value rules specify whether a data field may or may not contain
null values. A null value is essentially the absence of a value, although
there are different kinds of null values. Consequently, we will work
with one method for defining and characterizing null values and two
kinds of null value rules. Thefirstasserts that null values are allowed to
172 ENTERPRISE KNOWLEDGE MANAGEMENT
be present in a field, and the second asserts that null values are not
allowed in a field.
Our goal for null value specifications is to isolate the difference between
a legitimate null value and a missing value. Since a data record may at
times allow certain fields to contain null values, we provide these deeper
characterizations.
1. No value There is no value for this field a true null.
2. Unavailable There is a value for this field, but for some reason
it has been omitted. Using the unavailable characterization
implies that at some point the value will be available and the field
should be completed.
3. Not applicable This indicates that in this instance, there is no
applicable value.
4. Not classified There is a value for this field, but it does not con-
form to a predefined set of domain values for that field.
5. Unknown The fact that there is a value is established, but that
value is not known.
Considering that we allow more than one kind of null value, we
also need to allow different actual representations, since there is usually
only one system-defined null value. Therefore, any null value specifica-
tion must include both the kind of null along with an optional assigned
representation. Here are some examples.
Use "U"* for unknown
Use "X" for unavailable
Use "N/A" for not applicable
Of course, in building our rule set, it is worthwhile to assign a han-
dle to any specific null value specifications. In this way other rules can
refer to null values by their handle, which increases readability.
Define X for unknown as "X''
Define GETDATE for unavailable as "fiill in date''
Define U for unknown as " ? "
This way, we can have different representations for the different
kinds of null values in a way that allows flexibility in defining the null
value rules. Figure 8.1 gives examples of each null value specification.
ATA QUALITY ASSERTIONS AND BUSINESS RULES 173
There is a value for this field, In a field for "mobile phone," there is a
Unavailable but for some reason it has been mobile phone number associated with the
omitted. customer, but it has not been filled in.
There is a value for this field, In an order for a sweater, where the colors
but it does not conform to a are limited to red, blue, or black, if the buyer
Not classified
predefined set of domain values requested a color that was not available, the
for that field. field might be left blank.
The null value rule is a prescriptive rule that specifies that null values
may be present in the indicated field. A null value rule will specify the
kinds of null values that may be present and a representation (if any)
used for those null values.
A null value rule may allow traditional null values (such as system
nulls, empty fields, or blanks), generic null values as defined in null
value specifications, or a detailed Ust of specific null value representa-
tions. Note that if we only allow certain kinds of null values, this will
most likely mean that we want to restrict the appearance of the tradi-
tional nulls!
Here are some examples.
Rule ANuUs:
Attribute A allowed nulls {GETDATE, U, X}
In this example, we only allow certain kinds of null values, as
described in Section 8.4.1.
174 ENTERPRISE KNOWLEDGE MANAGEMENT
RuleBNuUs:
Attribute B allowed nulls
With the null value rules, the resulting validation depends on the
types of null values allowed. If any nulls are allowed, then there is really
nothing to do whether or not a value is in the field, the field is con-
formant. But if only certain kinds of nulls are allowed, then the valida-
tion for that rule includes checking to make sure that if any other null
values appear (such as the system null or blanks), the record is marked
as violating that rule.
TABLE 8.1
Manipulation Operators and Functions
All attributes are associated with some preassigned data type, such as
string or integer. While the data type provides a level of data validation,
176 ENTERPRISE KNOWLEDGE MANAGEMENT
it is limited in its extent, since there may be many values that are data
type conformant, yet are inappropriate for a particular field.
One example would be an attribute GRADE associated with a stu-
dent course database. The data type for the field would be CHAR(l) (a
character string of length 1), but we can presume that the only valid val-
ues for GRADE are {A, B, C, D, F}. An entry for this field of the letter J
would be completely valid for the data type. To further restrict the set of
valid values, we can add an attribute value restriction rule for that field.
Rule GradeRestrict:
Restrict GRADE: (value >= "A*^ AND value <= "F") AND
(value != "E*')
This rule contains all the restrictions necessary for the field; The
value is between A and F but cannot be E. All value restriction rules
indicate a narrowing of the set of values valid for the attribute and are
usually expressed using range operations combined with the logical
connectors "AND," "OR," and so on.
These kinds of rules exist in many places, and they are the ones
most frequently embedded in application code. Exposing these rules
can be very useful, especially if there is a possibility that there are con-
tradictions between the restrictions. If more than one rule is associated
with an attribute, both rules must hold true, the equivalent of creating a
new rule composed of both restrictions connected using the logical
AND operator. A contradiction exists when two rules exist that are
applied to the same field, yet it is not possible for both to always hold.
Here is an example of two contradicting rules.
Rule StartRestrictl:
Restrict STARTDATE: value > "June 21,1987" AND value <
"December 1,1990"
Rule StartRestrict2:
Restrict STARTDATE: value < "February 1,1986"
In thefirstrule, the date must fall within a certain range, while the sec-
ond rule says that the date must be earlier than a different date. Combining
the rules yields a restriction that can't be true: The date is before February
1,1986 but is also between June 21,1987 and December 1,1990.
These contradictions may not be discovered in standard use because
application code is executed in a sequential manner, and the time depen-
dence masks out what may be a harmful oversight. Once the rules are
enumerated, an automated process can check for contradictions.
DATA QUALITY ASSERTIONS AND BUSINESS RULES 177
"Labor Day,'^
"Veteran's Day,"
"Thanksgiving,"
"Christmas"}
In some data environments, data values that appear in one field always
belong to the same domain, although it may not be clear a priori what
the complete set of values will be. For example, we may define a domain
of "Current Customers," but values are added to this domain dynami-
cally as new customers are added to the database. But once a value
appears in the field, it will always be considered as a domain member.
This becomes especially useful when other data consumers rely on the
values in that domain. Because of this, we have a rule that specifies that
allfieldvalues define a domain.
Domain assignment is a prescriptive rule. A domain assignment
rule, when applied to a datafield,specifies that all values that appear in
thatfield(or hst offields)are automatically assigned to be members of a
named domain. This domain is added to the set of domains and can be
used in other domain rules.
Our formal representation includes a domain name and a list of
database table field names whose values are propagated into the
domain. The domain is constructed from the union of all the values in
each of the table columns. Here is an example.
Define Domain CurrencyCodes from {Countries.Currency,
Orders.Currency}
In this example, the domain CurrencyCodes is created (and
updated) from the values that appear in both the Currency column in
the Countries table, and the Currency column in the Orders table. This
rule may have been configured assuming that the right currency codes
would be correctly set in the Countries table, but in reality, the curren-
cies that are used in customer orders are those that are actually used, so
we'll include the currencies listed in the Orders table.
DATA QUALITY ASSERTIONS AND BUSINESS RULES 179
{"CO", "COLORADO"},
{"CT", "CONNECnCUT"},
{"DE", "DELAWARE"},
{"DC", "DISTRICT OF COLUMBIA"},
{"FM", "FEDERATED STATES OF MICRONESIA"},
{"FL", "FLORIDA"},
{"GA", "GEORGIA"},
{"GU", "GUAM"},
{"HI", "HAWAH"},
{"ID", "IDAHO"},
{"IL", "ILLINOIS"},
{"IN", "INDIANA"},
{"lA", "IOWA"},
{"KS", "KANSAS"},
{"KY", "KENTUCKY"},
{"LA", "LOUISIANA"},
{"ME", "MAINE"},
{"MH", "MARSHALL ISLANDS"},
{"MD", "MARYLAND"},
{"MA", "MASSACHUSETTS"},
{"MI", "MICHIGAN"},
{"MN", "MINNESOTA"},
{"MS", "MISSISSIPPI"},
{"MO", "MISSOURI"},
{"MT", "MONTANA"},
{"NE", "NEBRASKA"},
{"NV", "NEVADA"},
{"NH", "NEW HAMPSHIRE"},
{"NJ", "NEW JERSEY"},
{"NM", "NEW MEXICO"},
{"NY", "NEW YORK"},
{"NC", "NORTH CAROLINA"},
{"ND", "NORTH DAKOTA"},
{"MP", "NORTHERN MARIANA ISLANDS"},
{"OH", "OHIO"},
{"OK", "OKLAHOMA"},
{"OR", "OREGON"},
{"PW", "PALAU"},
{"PA", "PENNSYLVANIA"},
{"PR", "PUERTO RICO"},
{"RI", "RHODE ISLAND"},
DATA QUALITY ASSERTIONS AND BUSINESS RULES 183
The relation rules refer to data quality rules that apply to more than one
data field. We have two versions of these rules: intra-record and inter-
record. The intra-record version is limited to fields within a single
record or message, and the inter-record version allows referral to fields
in other records, messages, or tables.
There are four kinds of relation rules. Three kinds complete-
ness, exemption, and consistency rules are proscriptive rules. One
kind derivation is a prescriptive rule that describes how some
fields are filled in. Completeness rules describe when a record is fully
complete. Exemption rules describe times that a rule may be missing
data. Consistency rules indicate the consistency relationship between
fields.
There are times when some attributes are not required to have values,
depending on other values in the record. An exemption rule specifies a
condition and a list of attributes, and if the condition evaluates to true,
then those attributes in the list are allowed to have null values. Exemp-
tion rules, like completeness rules, are conditional assertions.
For example, a catalog order might require the customer to specify
a color or size of an article of clothing, but neither is required if the cus-
tomer orders a nonclothing product.
Rule NotClothing:
IF (OrdersJtem^Ciass != "CLOTHING") Exempt
{Orders.Coior,
Orders.Size
1
If a null value rule has been specified for any of the attributes in the
list, then the null value representations for those attributes may appear
as well as traditional nulls.
Rule Salary3:
IF (Employees.title = "Manager") Then (Employees.Salary >=
40000 AND Employees.Salary < 50000)
Rule Salary4:
IF (Employees.title == ''Senior Manager'*) Then (Employees.Salary
>= 50000 AND Employees.Salary < 60000)
For each consistency rule, there is a condition and a consequent.
The condition may consist of clauses referring to more than one
attribute, and the consequent may also refer to more than one attribute.
So far we have mostly focused on rules that apply between the values of
attributes within the same record. In fact, there are rules that apply
between records within a table, between records in different tables, or
between fields in different messages.
The rules in this section are important in terms of traditional rela-
tional database theory. The most powerful of these rules is the func-
DATA QUALITY ASSERTIONS AND BUSINESS RULES 189
A table key is a set of attributes such that, for all records in a table, no
two records have the same set of values for all attributes in that key set.
A primary key is a key set that has been designated as a key that identi-
fies records in the table. A key assertion is a proscriptive rule indicating
that a set of attributes is a key for a specific table.
Rule CustomersKey:
{Customers.FirstName, Customers.LastName, Customers.DD} Key
for Customers
If a key assertion has been specified, no two records in the table
may have the same values for the indicated attributes. Therefore, a key
assertion basically describes the identifiability dimension of data quality
as described in Chapter 5.
A candidate key is a key set consisting of attributes such that the com-
position of the values in the key set uniquely identify a record in the
table and that none of the keys in the key set may be removed without
destroying the uniqueness property. A primary key is an arbitrarily
selected candidate key used as the only record-level addressing mecha-
DATA QUALITY ASSERTIONS AND BUSINESS RULES 191
nism for that table. We can qualify the key assertion by indicating that
the key is primary, as in this example.
Rule CustomersPrimaryKey:
{Customers.FirstName, Customers.LastName, Customers.ID} Pri-
mary Key for Customers
A primary key defines an entity in a table, and, therefore, the
enforcement of a primary key assertion is that no attribute belonging to
the primary key may have null values. This demonstrates an interesting
point: Some rule definitions may have an impact on other rule defini-
tions. Our framework already has the facility for defining null value
rules, but implicit in a primary key rule are other null value rules! Fig-
ure 8.4 explains both key assertions.
Key Assertion The attribute list forms a key. l<attribute-list>] KEY FOR <table name>
FIGURE 8.4
192 ENTERPRISE KNOWLEDGE MANAGEMENT
We must not forget that the quality of data also affects the way the data
is used, and if there is a processing stream that uses information, there
may be rules that govern automated processing. Therefore, we also
define a class of rules that apply "in-process." In-process rules are those
that make assertions about information as it passes between interfaces.
It is through the use of in-process rules that we can add "hooks"
for measurement, validation, or the invocation of any external applica-
tion. Before we can specify the rules themselves, we must also have a
way to express the data flow for the process, and we do that using pro-
cessing chains. Given a processing chain, we can insert into that chain
in-process rules.
The attribute list forms l<table name and attributo) FOREIGN KEY FOR
Foreign Key Assertion
a foreign key. Ktable name and attribute>
Order Input
Files
f Paid ^
GoodPayment "I Orders J
Type Name
BadPayment
Source Order Files
Stage Select Orders
Stage Process Payments
Target Reconciliation
Target Paid Orders
Channel
Channel Name Source Target
Type
Selector Input Order Files Select Orders
' Channel OrderPay Select Orders Process Payments
Channel GoodPayment Process Payments Paid Orders
Channel BadPayment Process Payments Reconciliation
There are other opportunities for inserting in-process rules that do not
necessarily affect the data-flow chain but rather affect the data being
transferred. In this section we discuss transformation rules, that
describe how data is transformed from one form to another and update
rules that define when information should be updated.
MeasureCCNumbers BigOrder
AT OrderPay Measure IF (orders.total > 1000):
{CCNumbers} E-mailTheBoss
There are other kinds of rules that are not discussed in this chapter. One
class of rules is the approximate or fuzzy rules, which specify assertions
allowing some degree of uncertainty. Another set is the navigation rules,
which indicate appropriate navigation paths that can be executed by
users at a Web site. Both of these classes will be treated at greater length
in other chapters.
Once data quality and business rules have been defined, there are two
more steps. The first is the accumulation and consolidation of all the
rules into one working set, and the second step is confirming that the
rules interact with each other correctly. These, together with the opera-
tion of storing, retrieving and editing rules, are part of the management
of rules.
about the business and data quality rules that govern a data system.
This simple formalism makes it possible for these rules to be defined,
read, and understood by a nontechnician. The ultimate goal is a frame-
work in which the data consumer can be integrated into the data qual-
ity process. In this way, data quality rules are integrated into the
enterprise metadata and can be managed as content that can be pub-
lished directly to all eligible data consumers.
8.14.2 Compilation
Before rules can be compiled into an executable format, the rules them-
selves must conform to their own set of validity constraints. Specifically,
the assertions about a data set cannot conflict with each other. For
example, we can make these two assertions about our data.
Rule: IF (Customers.Cust^Class != "BUSINESS**) Exempt
{Customers.Business_Phone
}
DATA QUALITY ASSERTIONS AND BUSINESS RULES 1 99
Rule: RequiredBizPhone:
Customers.Business__Phone nulls not allowed
Each of these rules by itself is completely understandable. The first rule
says that if the customer is not a business customer, the business phone
attribute may be empty. The second rule indicates that the business
phone field may not contain nulls. Put together, these two rules are con-
tradictory: The first says that nulls are allowed under some conditions,
while the second asserts that the same attribute may not be null. This is
an example of violating a validity constraint on data quality rules.
Some other kinds of validity rules include the following.
Restriction contradictions, where two restrictions create an asser-
tion that cannot be true
Uncovered consistency ranges, where consistency rules are defined
for only a subset of data values but not all the possible values that
can populate an attribute
Key assertion violations, where an attribute set specified as a key
is allowed to have nulls
The last issue regarding rules is order. One of the nice things about the
rule system is that if we want to declare that the data conform to the
rules, all the rules and assertions hold true. But how do we test this?
The answer is that we must apply each rule in its own context and vali-
date that there are no violations.
8.15.2 Dependencies
The only exceptions, as just noted, come into play when the rule itself
includes a data dependence or a control dependence. A data depen-
dence occurs when one piece of the rule depends on another part of the
rule. Any of the rules with conditions and domain and mapping assign-
ments have data dependences inherent in them. A control dependence
occurs when an event must take place before the test is allowed. The
trigger, update, and measurement directives imply control dependen-
cies. Some rules may have both control and data dependence issues:
Consider the transformation directives.
A dependence is said to be fulfilled if the no further modifications
may take place to the dependent entity (attribute, table). A rule's depen-
dence is violated if a test for that rule is performed before the associated
dependencies have been fulfilled. The presence of dependences in the rule
set forces some ordering on the execution of rule tests. The goal, then, is to
determine the most efficient order that does not violate any dependence.
8.16 SUMMARY
203
204 ENTERPRISE KNOWLEDGE MANAGEMENT
Let's review the steps in building the information chain that we dis-
cussed in Chapter 4.
When the information chain has been published, the next part of the
current state assessment is to select locations in the information chain
to use as the measurement points. The choice of location can be depen-
dent on a number of aspects.
1. Critical junction Any processing stage with a high degree of
information load is likely to be a site where information from dif-
ferent data sources are merged or manipulated.
2. Collector A collector is likely to be a place where information
is aggregated and prepared for reporting or prepared for storage.
3. Broadcaster A broadcaster is likely to be a processing stage
that prepares information for many consumers and, therefore,
may be a ripe target for measurement.
4. Ease of access While some processing stages may be more
attractive in terms of analytical power, they may not be easily
accessed for information collection.
5. High-profile stages A processing stage that consumes a large
percentage of company resources might provide useful measure-
ment information.
The important point is that a number of locations in the informa-
tion chain be selected, preferably where the same data sets pass through
208 ENTERPRISE KNOWLEDGE MANAGEMENT
more than one of the locations. This will add some depth to the mea-
surement, since place-stamping the data along the way while registering
data quality measurements will help determine if there are specific fail-
ure points in the information chain.
The next sections detail some ways to measure data quality in terms of
the dimensions of data quality introduced in Chapter 5. In each section,
we review a dimension of data quality and suggest ways to measure it.
Of course, these are only starting points, and the ways to measure data
quality may be dependent on the specific situation.
210 ENTERPRISE KNOWLEDGE MANAGEMENT
First let's examine the dimensions of data quality affecting data models.
In most cases, data quality associated with the data model is likely to be
measured using static metrics. Each section will review the dimension of
data quality and provide suggestions for measuring its level.
9.7.2 Comprehensiveness
9.7.3 Flexibility
9.7.4 Robustness
9.7.5 Essentialness
Other than for specific needs such as planned redundancy or the facili-
tation of analytical applications, a data model should not include extra
information. Extra information requires the expense of acquisition and
storage and tends to decrease data quality levels. Also, there may be a
lot of data attributes for which space has been allocated, yet these
attributes are not filled. Statistics regarding the frequency at which data
tables and attributes are referenced is a good source for measuring
essentialness:
Here are some ideas for ways to measure essentialness.
Count the number of data elements, tables, and attributes that are
never read.
Count the number of data elements, tables, and attributes that are
never written.
Count the number of redundant copies of the same data in the
enterprise.
9.7.8 Homogeneity
9.7.9 Naturalness
9.7.10 Identifiability
9.7.11 Obtainabiiity
9.7.12 Relevance
9.7.13 Simplicity/Complexity
the number of tables in a database, and (3) the number of foreign key
relations in a database.
Next, we look at how to measure the data quality of data values. This is
the kind of measurements most frequently associated with data quality
and, consequently, is also the easiest for which to secure management
support.
9.8.1 Accuracy
Data accuracy refers to the degree with which data values agree with an
identified source of correct information. There are different sources of
correct information: a database of record, a similar, corroborative set of
data values from another table, dynamically computed values, the
result of a manual workflow, or irate customers.
Accuracy is measured by comparing the given values with the iden-
tified correct source. The simplest metric is a ratio of correct values and
incorrect values. A more interesting metric is the correctness ratio along
with a qualification of how incorrect the values are using some kind of
distance measurement.
It is possible that some of this measuring can be done automati-
cally, as long as the database of record is also available in electronic for-
mat. Unfortunately, many values of record are maintained in a far less
accessible format and require a large investment in human resources to
get the actual measurements.
A null value is a missing value. Yet, a value that is missing may provide
more information than one might think because there may be different
reasons for the missing value, such as:
Unavailable values
Not applicable for this entity
218 ENTERPRISE KNOWLEDGE MANAGEMENT
9.8.3 Completeness
9.8.4 Consistency
9.8.5 Currency/Timeliness
9.9.2 Stewardship
9.9.3 Ubiquity
Moving on to our next set of data quality dimensions, let's look at ways
to measure the quality of data presentation. Many of these dimensions
can only be measured through dialog with the users, since the presenta-
tion is an issue that affects the way information is absorbed by the user.
In each of these cases, though, the best way to get a feel for a score
in each dimension is to ask the user directly! In the question, formulate
each dimension definition as a positive statement for example, "The
format and presentation of data meets my needs" and see how
strongly each user agrees or disagrees with the statement.
9.10.1 Appropriateness
A good presentation provides the user with everything required for the
correct interpretation of information. Applications with online help
facilities are fertile territory for measuring this dimension. The help
facility can be augmented to count the number of times a user invokes
help and log the questions the user asks.
Applications without online help can still be evaluated. This is
done by assessing the amount of time the application developer spends
explaining, interpreting, or fixing the application front end to enhance
the user's ability to correctly interpret the data.
222 ENTERPRISE KNOWLEDGE MANAGEMENT
9.10.3 Flexibility
9.10.5 Portability
9.11.1 Accessibility
that even if all the information has been presented, can the user
get as much as possible out of the data.
4. How easy is it to get authorized to access information? This is
not a question of automation but instead measures the steps that
must be taken to authorize a user's access.
5. Are therefiltersin place to block unauthorized access.^ This ques-
tions whether there is a way to automate the access limits.
This dimension refers to that information that is allowed to be pre-
sented to any selected subset of users. Even though the last two questions
border on the issue of security and authorization, there is a subtle differ-
ence. The dimension of accessibility characterizes the means for both
providing and controlling access, but the dimension of security charac-
terizes the policies that are defined and implemented for access control.
9.11.2 Metadata
Privacy and security policies should cover more than just the trans-
ference and storage of information. The privacy policy should also
cover the boundaries of any kind of information dissemination, includ-
ing inadvertent as well as inferred disclosure. An example of inadver-
tent disclosure is the lack of care taken when using a mobile telephone
in a public location. An example of inferred disclosure is when the accu-
mulation and merging of information posted to, then subsequently
gleaned from multiple Web sites can be used to draw conclusions that
could not have been inferred from the individual Web sites.
228 ENTERPRISE KNOWLEDGE MANAGEMENT
9.11.4 Redundancy
9.14 SUMMARY
233
234 ENTERPRISE KNOWLEDGE MANAGEMENT
10.1.1 ACTORS
10.1.3 Impacts
Organizational mistrust
Lowered ability to effectively compete
Data ownership conflicts
Lowered employee satisfaction
In Chapter 4, we described how to build the data quality scorecard
by determining the locations in the information chain where these kinds
of impacts are felt. In Chapter 9, we discussed how to measure the
kinds of data quality problems that occur at each location by perform-
ing a current state assessment (CSA).
10.1.4 CSA
The second goal, distribution of cost, can be achieved after the assign-
ment of impact, as each impact is associated with a cost. This step is
actually a refinement of the overall scorecard process, but it gives us a
means for attributing a specific cost to a specific data quality problem,
which then becomes input to the final step, determination of expecta-
tions. Figure 10.2 shows distribution of impact and cost.
This last goal of assessment review poses two generic questions for each
data quality dimension measured with each associated set of impacts
and costs.
1. What quantitative level of measured quality will remove the neg-
ative impact? This represents a base requirement of expected
data quality.
ATA QUALmr REQUIREMENTS 239
Data Information
Reference Quality Chain Overall
ID Problem Location Activity Impact Cost Percentage Cost
1 Malformed
credit card Credit Card
Numbers Node 5 Processing Detection $60,000,00 40.00% $24,000.00
Contact
customer Correction 30.00% $18,000.00
2 Invalid Direct
addresses Node 0 marketing Detection $150,000.00 60.00% $90,000.00
Correction 40.00% $60,000.00
Reduced Acquisition
reach overhead $600,000.00 33.00% $198,000.00
Lost
opportunity 67.00% $402,000.00
3 Incorrect Shipping
pick lists Node 7 processing Detection $24,000.00 20.00% $4,800.00
Correction 80.00% $19,200.00
Customer
service Warranty $209,000.00 20.00% $41,800.00
Spin 10.00% $20,900.00
Attrition 70.00% $146,300.00
more cost effective to achieve a lower threshold and take into account
the Pareto Principle that 80 percent of the work can be achieved
with 20 percent of the effort.
Note that we must integrate the users' needs into the data quality
requirement thresholds, since it is the users that can decree whether their
expectations are being met. We therefore make use of a mechanism
called use case analysis, which, although it is usually used in the system
design process, is easily adapted to the data quality process.
Use case analysis is a process that developed over a number of years and
is described by Ivar Jacobson in his book Object-Oriented Software
Engineering as a way to understand the nature of interaction between
users of a system and the internal requirements of that system. Accord-
ing to Jacobson, a use case model specifies the functionality of a system
as well as a description of what the system should offer from the user's
perspective. A use case model is specified with three components.
Actors, representing the roles that users play
Use cases, representing what the users do with the system
Triggers, representing events that initiate use cases
We will use of this model but altered slightly into what we could
call the "data quality view." The data quaHty view focuses on the con-
text and content of the communication interactions, so we can incorpo-
rate user expectations for the quality of information as a basis for
defining the minimum threshold of acceptance.
10.4.1 Actors
We have already seen the term actors used in this book, and that choice
of terminology was not arbitrary. In a use case analysis, an actor is a
representation of one of the prospective users of the system, describing
the different roles that a user can play.
Note that a single user may take on different roles, and each role is
represented as a different actor. Actors model anything that needs to
interact or exchange information with a system. Therefore, an actor can
represent anything external to the system (either a human or another
DATA QUALITY REQUIREMENTS 241
system) that interacts with our system. With respect to data quaUty, we
want to look at those actors that can have some effect on the levels of
data quality, so our list of actors, as reviewed in Section 10.1.1, is the
appropriate set.
In Jacobson's model, actors correspond to object classes, and the
appearance of a user taking on the role of an actor represents an instance
of one of those objects. The collection of actor descriptions forms a model
of what is external to the system. In our data quality view, we are less
concerned with object instances and more concerned with expectations
an actor will associate with information quality.
It is worthwhile to show a simple example that demonstrates a use
case analysis. Let's consider a simple candy vending machine that makes
use of a debit card. For the design of this candy machine, we expect
there to be three different actors: the purchaser, the maintenance per-
son, and the accounting system that keeps track of transactions that
have taken place on the machine.
10.4.4 Triggers
In the use case model, a trigger is an event that initiates a use case. A
trigger event may occur as a result of an input data structure, an actor
DATA QUALITY REQUIREMENTS 243
Data Layout 1
Modifications
to the system?
10.4.6 Variants
10.4.7 Extensions
Extensions describe how one use case can be inserted into (extend)
another use case. In the vending machine example, the product may get
stuck during product release. The result would be that the purchaser's
debit card is debited, but the product is not properly dispensed. This
implies a new use case, product stuck, to extend get product. In this use
case, the maintenance person is alerted to fix the machine and provide a
refund for the purchaser's debit card.
Here are some other reasons for extensions.
DATA QUALITY REQUIREMENTS 245
In general system design, the systems designer begins with the use case
model to determine the information model that will encompass the
users' requirements. Typically, the process of figuring out the domain
objects begins with isolating all the nouns used in the use cases. These
nouns, along with their meanings, become a glossary for the system.
For the requirements stage, the degree of detail in describing each
object's attributes and its interactions with other objects in the system is
probably sufficient. Often, a set of objects with similar attributes can be
isolated and abstracted using a higher description. For example, the
vending machine may vend different sizes of candy bars. Each of these
objects are different kinds of products, which might imply that product
is a base class with the different sizes or flavors of candy being attrib-
utes of the product class.
Invariants are assertions about a system that must always be true and are
used to identify error conditions. In the case of a vending machine, an
invariant may be that dispensing a product may only be performed if the
inserted debit card has sufficient credit. Boundary conditions describe the
extents of the usability of the system. In the vending machine, one bound-
ary condition is that each row can only hold 15 candy bars, so there is a
maximum number of products that may be sold between maintenance
periods. Constraints deal with issues that impede the usability of the sys-
tem. A constraint of the vending machine is that only one actor may be
served at a time.
DATA QUALITY REQUIREMENTS 247
10.6.5 Performance
10.8 SUMMARY
253
254 ENTERPRISE KNOWLEDGE MANAGEMENT
In any metadata framework, there are some generic elements that can
tag most data or metadata elements in the system. This section enumer-
ates some of the generic metadata elements that would be incorporated
into the repository.
11.1.2 Description
This is a text description of any entity. This might also incorporate ref-
erences to more advanced descriptive objects, such as text documents,
graphical pictures, spreadsheets, or URLs.
11.1.4 Keywords
11.1.5 Location
11.1.6 Author/Owner/Responsible
This represents a resource that provides help for the described entity, such
as a URL, a README file, or the phone number of a support person.
11.1.8 Version
11.1.9 Handler
This describes some kind of object that serves as a handler for the entity,
such as an application component that handles the user interface for
directing operations on the entity.
11.1.10 Menus
These are the menus associated with the user interface for a described
entity.
the types that are available among all platforms. In our metadata repos-
itory, we keep track of all the types that are available within the envi-
ronment, as well as the ways that types are defined and used.
11.2.1 Alias
11.2.2 Enumeration
Any enumerated list of data values that is used as a type is kept as meta-
data. Enumerated types, such as our domains from Chapter 8, are rep-
resented using a base type and a list of values.
11.2.3 Intrinsic
Any defined type that is intrinsic to the data model should be denoted as
such, and the information about both the size of objects and the physi-
cal layout of values of that type should be maintained as metadata. For
example, decimal values may be maintained in a number of different
ways on different platforms, although the presentation of values of type
decimal may be identical across all those platforms. In this case, the
exact representation should be denoted in case the data need to be
migrated to other platforms.
11.2.4 Namespace
accumulated under a single "rule set," the name of which defines a rules
namespace.
11.2.6 Scalers
Atomic data types used in a system as base types are scalar types. Exam-
ples include integers, floating point numbers, and strings. We enumerate
a set of base intrinsic scalar types in Section 11.2.11.
11.2.7 Structure
11.2.8 Union
11.2.9 Array
11.2.10 Collections
With objects that can be bundled using collections, the collection meta-
data maintains all information about the maximum size of a collection.
METADATA. GUIDELINES, AND POLICY 259
the way that collection members are accessed, whether there is any
inherent ordering to members in the collection, and the operations that
are permissible on the collection, such as insert, delete, sort, and so forth.
If there is some inherent ordering, there must also be descriptions of
equality of objects and an ordering rule (that is, a "less-than" operation).
Here is a list of base types that are typically used. Any metadata reposi-
tory should keep track of which of these types are valid within each
enterprise system.
Binary A large unbounded binary object, such as a graphic
image, or a large text memo
Boolean TRUE or FALSE
Character The metadata repository should log which represen-
tation is used for characters, such as ASCII or UNICODE
Date A date that does not include a time stamp
Datetime A datestamp that does include a timestamp
Decimal The exact decimal value representation, as opposed to
float, which is approximate
Double Double precision floating point
Float Single precision floating point
Integer Standard integer type
Small integer Small integer type
Long integer Extended size integer type
Long long integer Larger-sized extended integer type
Numeric Fixed-size decimal representation. This should be
attributed by the number of numeric digits that can be stored and
the number of digits to the left of the decimal point.
Pointer A reference to another data object
String A sequence of characters that must be attributed with the
length as well as the character type
Time A timestamp data type
TimePrecision The precision of the counting period associated
with a timestamp type
Void The standard intrinsic void data type (as in C/C++)
260 ENTERPRISE KNOWLEDGE MANAGEMENT
11.3.1 Catalog
11.3.2 Connections
11.3.3 Tables
Size of the table and growth statistics (and upper size limit, if
necessary)
Source of the data that is input into the table
Table update history, including date of last refresh and results of
last updates
Primary key
Foreign keys
Referential integrity constraints
Cross-columnar data quality assertions
Functional dependences
Other intratable and cross-tabular data quality rules
Data quality requirements for the table
11.3.4 Attributes/Columns
This is a list of the programs used to load data into the tables or that
feed data into the tables. We maintain this information as metadata.
The name of the program
The version of the program
262 ENTERPRISE KNOWLEDGE MANAGEMENT
11.3.6 Views
Views are table-Hke representations of data joins that are not stored as
tables. For each view, we want to maintain the following.
The name of the view
The owner of the view
The source tables used in the view
The attributes used in the view
Whether updates may be made through the view
11.3.7 Queries
11.3.8 Joins
11.3.9 Transformations
In all data sets, data come from different original sources. While some
originates from user applications, other originates from alternate
sources. For supplied data, we want to maintain the following.
The name of the data package
The name of the data supplier
The name of the person responsible for supplying the data (from
the supplier side)
The name of the person responsible for accepting the data (on the
organization side)
The expected size of the data
The time the data is supposed to be delivered
Which tables and attributes are populated with this data
The name of any data transformation programs for this data
The name of the load programs for this data
External data quality requirements associated with provision
agreement
264 ENTERPRISE KNOWLEDGE MANAGEMENT
11.3.11 Triggers
Triggers are rules that are fired when a particular event occurs. These
events may be table update, insert, or deletes, and in the metadata we
want to maintain the following.
Name of the trigger
The author of the trigger
The owner of the trigger
Whether the trigger fires on an update, insert, or delete
Whether the trigger fires before or after the event
The trigger frequency (once per row or once per table)
The statements associated with the trigger
Which columns are associated with the trigger
11.3.13 Indexes
Usage statistics for a data set are useful for system resource manage-
ment, system optimization, and for business reasons (answering ques-
tions such as "Which data set generates the most interest among our
customers?").
11.4.2 Users
For each touchable entity in the system, we want to keep track of which
users in the community have which levels of access.
User name
Data object name
Access rights
This metadata is used both for enumerating the access rights a user
has for a particular data set, as well as enumerating which users have
rights for a given data set.
266 ENTERPRISE KNOWLEDGE MANAGEMENT
11.4.4 Aggregations
11.4.5 Reports
11.5 HISTORICAL
11.5.1 History
For all data objects, we want to keep track of who is reading the data.
The user name
The program name
The data object being read
The frequency of object reads
METADATA. GUIDELINES. AND POLICY 267
For all data objects, we want to keep track of who is writing the data.
The user name
The program name
The data object being written
The frequency of object writing
Domai nRef
domainID
name
dClass
dType domai nVals
description 1
domainID (FK) 1
source ^
w value 1
i1
domai nRi les
domainID (FK)
rule
Mappings
1 1
mappingID 1
Name 1 mappingPairs
r A
+ .
sourcedomain 1 mappingID (FK) I
targetdomain 1 "^"^H
1 1. . ^
targetvalue I
description 1
sourcevalue 1
source 1
mappingRules
In fact, all operations can be represented using a tree structure, where the
root of the tree represents the operator (in this case, "AND") and the
leaves of the tree represent the operands (in this case (employees.salary <
20000) and (employee.status <> "fuUtime")). We can apply this recur-
sively, so that each of those two operands is also represented using a tree.
This is shown in Figure 11.3.
There are two alternatives for maintaining the rules as metadata,
both of which rely on programmatic means for implementation. The
first alternative is to use a data table for embedding nested structures. We
can create a single table to represent expressions, where each expression
contains an operator (like "AND" or "<>") and afinitemaximum num-
ber of operands. For argument's sake, let's constrain it to two operands.
0 NAME Employecs.salary
1 CONSTANT 20000
2 LESSTHAN 0 1
3 NAME Employees.status
4 CONSTANT Fulltime
5 NOTEQUAL 3 4
6 AND 2 5
The rule must be recreated from the representation in the table, but
if the rule is truly modified, then those tuples that compose the rule
within the table must be removed from the table, and the new version of
the rule must be inserted into the table as if it were a new rule. This
means that there is little opportunity for reuse, and rule editing is a very
read/write-intensive operation.
One more table is needed to keep track of which rules are associated
with specific rule sets, since different users may have different sets of
rules, even for the same data. Managing rule sets requires defining a
rule set name, associating it with a unique identifier, and having a sepa-
METADATA, GUIDELINES. AND POLICY 275
rate table linking the rule set id with the identifiers associated with each
defined rule.
Create table rulesets (
RuleSetID integer,
RuleSetName varcharClOO)
);
create table rulesetCollections (
RuleSetID integer,
RulelD integer
);
If we are using the first approach, the rule ID is the identifier asso-
ciated with the root of the expression tree for that rule. If we are using
the second approach, we use rule ID.
base tables are defined and the ways that information is extracted and
presented to users. By controlling the flow of metadata, the managers of
a data resource can control the way that knowledge is integrated
throughout the enterprise. Figure 11.4 shows metadata browsing.
The use of a metadata repository does not guarantee that the systems in
use in the enterprise will subscribe to the philosophy of centralized
knowledge management and control. Unfortunately, the existence of
useful tools does not necessarily induce people to use them.
As a more practical matter, the instituting of metadata use can
enhance an organization's overall ability to collect knowledge and reuse
data. But without an enforceable policy behind the decision to incorpo-
rate a metadata framework, the benefits will not be achieved. In tan-
dem, information policy drives the use of metadata, but the use of
metadata will also drive information policy.
Creating a central core of reference data and metadata requires an
organizational commitment to cooperation. As with the data owner-
ship policy, the information management policy must specify the ways
that data and metadata are shared within the company.
One more aspect of information policy embedded in the metadata
question is that of enterprise-wide accessibility. Between issues of pri-
vacy and security and knowledge sharing, there is some middle ground
that must indicate how metadata is shared and when accessibility con-
straints are invoked. This is a significant area of policy that must be
effectively legislated within the organization.
11.11 SUMMARY
279
280 ENTERPRISE KNOWLEDGE MANAGEMENT
We already saw data quality rules in Chapter 8. We can see that the
rules described there fall into the definitions described in Section 12.2.
Our data quality rules are structural assertions (such as domain and
mapping definitions), action assertions (such as our transformation
rules and domain assignment rules), or knowledge derivation rules
(such as our domain mapping and assignment rules or our derivation
rules). In any of these cases, our data quality rules match the specifica-
tions listed in Section 12.2.
1. Our data quality rules are declarative. The fact that we attempt
to move data quality rules from the executable program to the
world of content proves that our data quality rules are declara-
tive and not procedural.
2. Each specific data quality rule applies to one specific operational
or declarative assertion, demonstrating atomicity.
3. We have defined a well-formed semantic for specifying data qual-
ity rules, yielding a well-formed specification language.
4. Each rule in a system exists in its own context and can be viewed,
modified, or deleted without affecting any other rule in the set.
RULE-BASED DATA QUALITY 283
Data quality and business rules reflect the ongoing operations of a busi-
ness. In any large environment, there are many situations where the
286 ENTERPRISE KNOWLEDGE MANAGEMENT
same business rules may affect more than one area of operations. This
permits us to collapse enterprise-wide usage of predefined domains and
mappings into a coordinated centralized repository. We can see that not
only data-driven defining rules represent enterprise knowledge, but exe-
cutable declarations can also represent operational knowledge that can
be centralized. Once the repository of rules is centralized, the actual
processing and execution of these rules can be replicated and distrib-
uted across multiple servers located across an enterprise network.
Because the rules are not embedded in source code in unmanageable (and
indeterminate) locations, when the business operation changes, it is more
efficient to update the rule base to speed up the implementation of modi-
fied policies. Changes to the rule base, as long as they do not cause incon-
sistencies within the rule base, can be integrated quickly into execution.
rates both state ("memory") and inference ("knowledge") into the sys-
tem. As the execution progresses, the choices of rules change as infor-
mation about the external events is integrated into the knowledge base.
As an example, consider these two rules.
1. If the oil consumption exceeds 21 gallons, then fax an order to
the oil delivery company.
2. If the oil consumption exceeds 21 gallons for 10 days in a row,
then fax an order to the oil delivery company.
Rule 1 implies no state, and if the external event occurs, the rule should
be fired. Rule 2 requires that some knowledge be maintained days in
a row so these facts are integrated into a knowledge base.
12.6.2 Scalability
Rules can be restated in action form, if not already worded that way.
All assertional form rules are changed so the assertion becomes the
288 ENTERPRISE KNOWLEDGE MANAGEMENT
condition, and if the assertion is not true, then an alert action takes
place. A rule is then said to be fired when its condition is evaluated to
true. This is also referred to as triggering a rule.
For interactive systems, a set of inputs signals the evaluation stage
of the rules engine. An input value that allows one or more conditions
to evaluate to true is called a triggering event.
12.9.1 Isolation
Throughout the book, we have seen that a major driver for defining and
using a rules system is the disengagement of the statement of the busi-
ness operations and poUcies from the technical implementation of those
rules and policies. Therefore, the ability to isolate the rules from the
application that uses those rules is a requirement.
A strategic benefit of isolation is the encapsulation of the rules as
content, which can then be managed separately from any application that
uses those rules. The rule definitions in Chapter 8 are designed to provide
this isolation, especially when managed through a separate interface.
12.9.2 Abstraction
Abstraction refers to the way that rules are defined in the system. This
can encompass a GUI that queries the user for rules or a rules language
for describing rules. This requirement dimension covers the question of
how rules are defined and not necessarily how the rules actually interact
within an executing system. The rules in Chapter 8 are meant to be
edited, modified, and tested from within a separate GUI.
12.9.3 Integration
1. Can the rules engine execute as a server? If so, then the rules
engine is likely to be loosely coupled and can be implemented as
a distributed component or even replicated across the enterprise.
2. Are the rules read at runtime or compile time? The rules being
read at compile time would indicate that the rules are read once
when the application is built, as opposed to the application hav-
ing access to a rule base during execution. The implication is that
at execution/runtime, there is a more dynamic system that allows
for rules to evolve as time moves forward.
3. Does the rules engine require access to a database? On the one
hand, requiring a database forces an additional cost constraint,
but on the other hand, the engine may store rules in a proprietary
format.
4. How is a rule system created? How are rules defined, and how
are they stored? How are rules moved from the definition stage
to the execution stage?
5. Is the rule system integrated as a library in an application, or is it
a standalone application? This question asks whether you can
integrate a rules engine and a set of rules as part of another appli-
cation.
6. When updating a rule base, does the entire application need to be
rebuilt, or are builds limited to a subset of the application? To
what degree is the system dependent on the definition of rules?
7. How is the knowledge base maintained during execution? Is
there a database that holds the newly defined assertions?
8. How is the knowledge base maintained during application
update? This question asks how a knowledge base is stored and
restored when the application is halted. If there is a state that
exists in the system, then when the system is brought down, there
should be some format for state persistence.
9. If the rules are integrated directly into a compiled application, is
there a facility for management of the distribution of the applica-
tion? This question asks how rules are distributed to server or
broker applications that execute in a distributed fashion.
294 ENTERPRISE KNOWLEDGE MANAGEMENT
10. How are rules validated when updated? It is possible that a new
rule can invalidate or be inconsistent with other rules that are
already in the rule base. Is there a way to check the validity of a
rule set.^
11. Is there a debugging facility? Is there a separate means for testing
out a rule set before putting it into production? If there is a prob-
lem with a production rule base, and is there a tracing facility
that can be turned on so that the execution can be monitored?
Associated with any set of rules is the specter of a rule base gone wild,
filled with meaningless trivialities and stale rules that only clog up the
system. When using a rules system, one must be detail oriented to the
extent that the rules engineer is willing to commit to understanding the
rule definition system and the rule management system.
The rules approach requires a dedication to detail, since all objects
operating in the business process as well since all attributes of each
object must be specified. This requires a dedication to detail as well as
an understanding of business process analysis.
In Chapter 10, we looked at use-case analysis and how it affects
requirements. It is at this level of granularity that these skills come in
handy. Ultimately the rule specification operates on a set of objects rep-
resenting entities in the real world, and whether the implementation of
rules is data-oriented or object-oriented, the application must be aware
of all potential actors.
So now that we know more about rules and rules engines, how do we
use the data quality rules described in Chapter 8? This requires a few
steps.
The first step involves the selection of a rules system that will execute our
data quality rules. We can use the guidelines and questions discussed in
Section 12.9. In the case of data quality rules, depending on whether the
296 ENTERPRISE KNOWLEDGE MANAGEMENT
rules system is being used for real-time purposes or for offline purposes,
there may be different answers to each of the questions in Section 12.9.5.
For example, if we are using the data quality rules for validating a
data mart, we might want our rules engine to be slightly decoupled but
still integrated with the data loading process. If we are using a rules
engine to generate GUIs, the system can be completely decoupled. If we
want to use rules to ensure the correct operation of our systems, we
might want a tightly coupled rule engine (see Figure 12.2).
Rule Set
Data Input
W Application
Process
Rules Engine
Rule Set
Rules Engine
To use the rules in Chapter 8, we must translate them from their syntac-
tic definitions into a form suitable for an off-the-shelf rules engine. To
demonstrate, let's turn some of our rules from Chapter 8 into a form
that either is a direct assertion or an if-then statement.
Non-null value rule
Attribute B nulls not allowed
is changed into
Assert !isNull(B);
Attribute value restriction
Restrict GRADE: value >= *A' AND value <= T AND value != T '
is changed into
Assert (GRADE >= 'N) AND (GRADE <= T') AND (GRADE != *E')
For the most part, this translation is relatively straightforward. In
order to represent domains and mappings, any chosen rules engine
must support set definitions with a syntax that allows for inserting
strings into sets and checking for set membership. Domains are then
represented as sets of string values, and mappings can be represented as
sets of strings composed of the source domain value, a separator string,
and the target domain value.
One more example would be the completeness rule.
IF (Orders.Total > 0.0), Complete With
{Orders.Billing_Street,
Orders.Billing_City,
Orders.Billing_State,
Orders.Billing_ZIP}
this would change into the following.
IF (Orders.Total > 0.0) THEN !isNull(Orders.Billing^street) AND
!isNull(Orders.Billing_City) AND !isNull(Orders.Billing_State) AND
!isNull(Orders.Billing_ZIP);
A thorough walkthrough of the rules will clarify a mapping into
the assertion/if-then format. Then the rules are ready for the next step
(see Figure 12.3).
validated, and that entails guaranteeing that no set of two or more rules
are contradictory.
Actually, the brevity of this section beUes many issues, the most
important one being that validation of rule-based systems is extremely
difficult and is a current topic of research in the world of databases, arti-
ficial intelligence, and knowledge-based systems. In essence, it is easy to
say, "VaUdate the rules" but much harder to actually do it. Hopefully, if
we constrain our system somewhat and don't allow dynamic inclusion of
new rules, the process may be easier (see Figure 12.4).
The last step is importing the rules into the rules system. Depending on
the system, this process may involve something as simple as inserting rules
into a database or as difficult as converting rules from a natural-language-
based format to instances of embedded C++ or Java class objects.
An interesting goal is to use automatic means to translate rules
defined with the natural language format into a format that can be
loaded into a rules system. The process of translation from one format
to another is called "compilation."
12.12 SUMMARY
In this chapter, we looked more closely "under the hood" of a rules sys-
tem. First we further refined our understanding of what a rule is, and
then we discussed the specifics of business rules. In reference to our
rules formalism defined in Chapter 8, we again posited that data quality
rules and business rules are really the same thing.
We then looked a little more closely at rules systems. A rules engine
drives the execution of rules by identifying which rule conditions have
become true and selecting one of those rules to execute its action. This
process can be repeated on all rules whose conditions have been fulfilled.
We discussed the advantages and disadvantages of the rules
approach, including the augmented capabilities, the ability to implement
business policies relatively quickly, and opportunities for reuse. A discus-
sion of how rules engines worked followed. We also looked at how to eval-
uate a rules system and different philosophies embedded in rule systems.
We focused a bit on some of the drawbacks of using a rule-based
system, including the work required to focus on details, as well as the
fact that many users have inflated expectations when moving from an
exclusively programmed system to one that is rule based. We also saw
that using a rules system does not eliminate programming from the
environment, although rules "programming" may be simpler than
using complex programming languages.
Finally, we looked at the steps in transforming the data quality rules
that were discussed in Chapter 8 into a format that can be imported into a
real rules system. This involves selecting a rules system, translating the
rules into the appropriate form, validating the rules to make sure that there
are no contradictions, and importing those rules into the rules system.
13
METADATA AND RULE DISCOVERY
Up until now, our discussion has centered on using data quality and
business rules as tools for leveraging value from enterprise knowledge.
We have shown that we can use data domains and the mappings
between those domains to consolidate distributed information as a sin-
gle metadata resource that can be shared by the entire organization.
Thus far, however, we have concentrated on the a priori definition of
data domains, mappings, data quality and business rules, and general
metadata.
In this chapter, we focus on analyzing existing data to distinguish
the different kinds of metadata. The processes and algorithms presented
help us find metadata that can be absorbed into a centrally managed
core resource and show us how to manage the metadata and its uses.
Domain discovery is the process of recognizing the existence and
use of either enumerated or descriptive domains. The existence of the
domain is interesting metadata; the domain itself can be absorbed as
enterprise reference data. Having identified a domain, it is also interest-
ing to explore whether there are any further derived subdomains based
on the recognized domain. Along with domain discovery and analysis is
the analysis of domains to detect any multiple domains embedded in the
discovered domain. We can also analyze columns that use data domains
to see whether attributes are being overloaded (used for more than a
single purpose) and attempt to perform mapping discovery.
Another significant area of investigation is the discovery of keys in
collected data, merged data, and legacy databases. Older, nonrelational
database management systems did not have primary key requirements,
so when migrating a legacy database into a modern RDBMS, it may be
301
302 ENTERPRISE KNOWLEDGE MANAGEMENT
necessary to find primary and foreign keys and to validate the referen-
tial integrity constraints.
A third area of investigation involves the discovery of data qual-
ity/business rules that already exist in the data. There are two steps to
formulating these rules: identifying the relationships between attributes
and assigning meaning to those relationships. We will look at a tech-
nique for discovering association rules and the methodology of figuring
out what those associations mean.
You will recall that data domains, based on a base data type, consist of
either enumerated lists of values or a set of rules that specify restrictions
on the values within that data type, using operations that are allowed
on values within that data type. In Chapter 7, we covered the definition
of data domains through enumeration and description. In many data
sets, common data domains are already used, whether by design or not.
We can presume that if data domains are used by design, they will
already have been documented. But if not, we can make use of heuristic
algorithms to find used data domains.
Domain discovery is the recognition that a set of values is classified
as a set and that one or more attributes draw their values from that set.
Once a domain has been discovered, there is a manual validation phase
to verify the discovered domain. Subsequent analysis can be performed
to understand the business meaning of the discovered domain. This stage
applies a semantic meaning to a domain and can then be used as the
basis for the validation of data domain membership (see Figure 13.1).
Data Domain 1
Data Domain 2
Data Domain 3
Metadata and
Reference Data
Repository
Data Domain 4
Data Domain 5
Data Domain N
1 AccountNumber
1 CustomerName
1 HomePhone
1 BusinessPhone
1 MobilePhone
1 CurrencyType
1 MaximumTransaction
1 AccountNumber
1 TransactionType Total number of records: 12834723
1 TransactionDate
1 TransactionAmt I AccountNumber I Date I XCODE I TransType I Amount I Cur
1 CurrencyType
I I QiE]/t] / 0 / L Z Z I S
1 In this example, both
1 the column name
1 ''AccountNumber*'and
1 the column name
1 "CurrencyType* are
1 used in more than one
1 table. This may incticate
1 that the values assigned
1 to diese attributes may Potential domains,
1 belong to a specific doe to the low
1 domain. occurrence of nulls
Database
Table
i ^ i k i i I i i k i k
\r 1 r ^f ^f 1 r ^r
D D D D D D
o o o o 0 o
m m m m m m
a a a a a a 1 Agreement Report |
n n n n n n Domain Agreement
0 1 2 3 4 N Domain ig 97% 1
Domain ij 65% 1
Domain {2 23% 1
Domain '\^ 12% 1
Domain i^ 0% 1
Aside from looking at enumerated value sets, which are easily analyzed
when presented with already extant data value collections, it is also use-
ful to look for patterns that might signal membership in a rule-based
data domain. For string-based attributes, our goal is to derive a rule-
oriented domain by analyzing attribute value patterns.
For example, if we determine that each value has 10 characters, the
first character is always A, and the rest of the characters are digits, we
have a syntax rule that can be posited as a domain definition. We can use
discovered definition as a validation rule, which we would then add to a
metadata database of domain patterns. Simple examples of rule-based
data domains include telephone numbers, ZIP codes, and Social Security
numbers. What is interesting is that frequently the pattern rules that
define domains have deeper business significance, such as the geographical
aspect of Social Security numbers, as we discussed in Chapter 7, or the
hierarchical location focus associated with ZIP codes.
As a more detailed example, consider a customer accounts data-
base containing a data field called ACCOUNT_NUMBER, which
turned out to always be composed of a two-character prefix followed
by a nine-digit number. There was existing code that automatically gen-
erated a new account number when a new customer was added. It
turned out that embedded in the data as well as the code were rules
indicating how an account number was generated. Evidently, the two-
character code represented a sales region, determined by the customer's
address, while the numeric value was assigned as an increasing number
per customer in each sales region. Because this attribute's value carried
multiple pieces of information, it was a classical example of an over-
loaded attribute. The discovery of a pattern pointed to a more compli-
cated business rule, which also paved the way for the cleaving of the
overloaded information into two separate data attributes.
Our first method for pattern analysis is through the superimposi-
tion of small, discrete "meaning" properties to each symbol in a string,
slowly building up more interesting patterns as meanings are assigned
to more symbol components. Initially, we make use of these basic sym-
bol classifications:
Letter
Digit
Punctuation
White space
METADATA AND RULE DISCOVERY 311
Alphabetic
Alphanumeric
Numeric
First name
Last name
Business word
Address words
One of any other categorized word class
f f y
contains a Street
DD-AAAAA-DD DDD-DD-DDDD
street designator Designator
Domain
Business
f
contains a business word
1
1
_ _
Business
Word
Domain
also find that as our domain inventory grows, it is possible that similar,
but not exact, data value sets may appear in our set of domains. When
this occurs, there are two possibilities. The first is that one domain may
be completely contained within another domain, and the other is that
two (or more) similar domains may actually be subsets of a larger data
domain.
We refer to the first case as a subdomain, which is a set of values
that is a subset of another domain. The second case is a superdomain, in
which the true domain that the attributes rely on is the composition (or
union) of the smaller domains.
Note that unless an occasional "self-test" of each domain against
the rest of the set of domains is performed, the only way to recognize
these kinds of overlapping domain value issues is when comparing an
attribute's values against many different known domains. If an analyzed
attribute appears to belong to more than one domain, it may signal the
existence of a subdomain lurking among the known domains. This sub-
domain may represent another business rule, or we might infer that two
similar domains represent the same set of values, in which case the
domains might be merged.
HIGH
LOW
HIGH
GREEN
RED
HIGH
HIGH
YELLOW Domain 1
YELLOW LOW
LOW
LOW
HIGH
RED
GREEN
LOW
GREEN
HIGH
GREEN
HIGH
HIGH
LOW
RED
YELLOW
HIGH
LOW
YELLOW
RED
HIGH
LOW
LOW
GREEN
GREEN
LOW
YELLOW Domain 2 RED
HIGH
YELLOW
HIGH
LOW
YELLOW
LOW
RED
the source of the data that populates this mapping, along with an
assigned identifier for the mapping.
create table mappingref (
name varchar(30),
sourcedomain integer,
targetdomai integer,
description varchar(1024),
source varchar(512),
mappingid integer
);
The value pairs can all be stored in a single table, referenced by
mapping identifier. In this case, we arbitrarily limit the size of the values
to 128 characters or fewer.
create table mappingpairs (
mappingid integer,
sourcevalue varchar(128),
targetvalue varchar(128)
);
Finally, we represent our rules-based mappings using records that
consist of rule statements.
create table mappingrules (
mappingid integer,
rule varchar(1024)
);
Just as with data domains, the best way to identify mappings between
domains is through conversations with experts in the area. Since map-
ping discovery is more complex than domain discovery, any way to
avoid relying on an automated process can only be beneficial.
ping. Therefore, given the value of the first attribute, the second
attribute's value is predetermined, so this rule can be used for both vali-
dation and automated completion. These kinds of mapping member-
ships may exist between composed sets of attributes as well and are also
referred to as functional dependencies.
A one-to-one mapping has certain characteristics that are useful
for integration into the discovery process:
1. All values from the first attribute must belong to one domain.
2. All values from the second attribute must belong to one domain.
3. When all distinct pairs have been extracted, there may not be any
duplicate entries of the source attribute value.
Note that because different source values can map to the same tar-
get value, the third characteristic does not hold for the target attribute
value.
These are some of the heuristics used to determine a one-to-one
mapping:
The number of distinct values in the source attribute is greater
than or equal to the number of distinct values in the target
attribute. This reflects the one-to-one aspect. Since each source
must map to one target and multiple sources may map to the same
target, we cannot have more values in the target than in the
source.
The attribute names appear together in more than one table or
relation. The existence of a tightly bound relationship between
data sets will be evident if it appears more than once, such as a
mapping of ZIP codes to cities.
The attribute values appear in more than one table. In this case,
we are not looking for the names of the attributes occurring fre-
quently but the actual usage of the same domains.
In reality, one-to-one mappings are essentially equivalent to func-
tional dependencies, and we can use the association rule discovery algo-
rithms (in Section 13.5) to find mappings.
data, they already have some ideas as to the set of vaUdation rules that
should be applied. But in the discovery process, we don't always know
what to look for, which makes that process particularly hard. In addi-
tion, we need to be able to distinguish between rules that make sense or
have some business value and spurious or tautological rules, which do
not have any business value. Automated rule discovery is not a simple
task, but we do have some tools that can be uses for rule discovery. The
first tool we will look at is clustering.
FIGURE 1 3 . B Clustering
of defining them from the start, our goal is the same: classifying records
based on a rule. We can use the clustering process to help perform this
classification. In the next sections, we will look at ways of using cluster-
ing to discover certain kinds of rules.
Value A == Value B !
Value A != Value B O
determination of the two nearest neighbors is the tricky part, and there
are a number of methods used for this determination.
Single link method The distance between any two clusters is
determined by the distance between the nearest neighbors in the
two clusters.
Complete link The distance between any two clusters is deter-
mined by the greatest distance between any two data items in the
two different clusters.
Unweighted pair-group average The distance between any two
clusters is calculated as the average distance between all pairs of
data items in the two different clusters.
Weighted pair-group average This method is similar to the
unweighted pair-group average, except that the size of the cluster
is used as a weight.
Unweighted pair-group centroid A centroid is a point calculated
as the average point in a space formed by a cluster. For any multi-
dimensional space, the centroid effectively represents the center of
gravity for all the points in a cluster. In this method, the distance
between any two clusters is measured as the distance between cen-
troids of the two clusters.
Weighted pair-group centroid This is similar to the unweighted
pair-group centroid method, except that the size of the cluster is
used as a weight.
Ward's method This method measures the distance between two
clusters based on the total distance between all members of the
two clusters.
Clustering is useful for more than just value range detection. When we
can characterize each data record as a combination of measurable prop-
erties, we can use clustering as a classification method that depends on
those properties. Since all members of a cluster are related because they
are near to each other based on some set of metrics, we can aim toward
translating the clustering characteristics into a description that explains
why those records belong together. If the distinction is clear, it is a
strong argument to represent those characteristics as one of our data
quality rules.
For example, let's assume that we have a database of customers
that includes a significant amount of demographic information as well
as sales histories. If we wanted to determine whether there were differ-
ent classes of customers based on both their sales histories and their
demographics, we can apply a clustering algorithm to break down the
set of customers into different classes. If our goal is to determine the
characteristics of different classes of customers, we can direct the clus-
tering based on fixing the "type of customer" property by assigning one
of a set of values based on sales volume. Once the clustering has been
326 ENTERPRISE KNOWLEDGE MANAGEMENT
Obviously, decision trees are useful for harvesting business rules from a
data set. But how do we build a decision tree? Here we discuss the
CART algorithm. CART is an acronym for Classification and Regres-
sion Tree, and a CART tree is built by iteratively splitting the record set
at each step based on a function of some selected attribute.
The first step in building a CART tree is to select a set of data
records on which the process is to be performed. We might want to sub-
mammal?
yes
domesticated? reptile?
I yes I no \
I rodent? "| | barks? 11 feline? "| [ long neckT^ | Everglades?] | Hawaii? j [ bumpy skin?) | flies? |
Each path of the tree represents a set of selection criteria for classifica-
tion. If we decide that the leaf represents a vaUd set of records (that is.
330 ENTERPRISE KNOWLEDGE MANAGEMENT
there is some valuable business relevance), we walk the tree from the
root to the leaf, collecting conditional terms, to accumulate search crite-
ria in the data set. By the end of the traversal, we basically have the con-
dition under which the records in that set are classified. If that
classification should enforce some assertion, we can use the condition
as the condition in a data quality consistency rule. Other rules can be
constructed similarly.
13.7 SUMMARY
333
334 ENTERPRISE KNOWLEDGE MANAGEMENT
14.1 STANDARDIZATION
If we try to compare two items that are not of the same class, we are
told that we cannot compare apples with oranges. When we talk about
standardization, we are really trying to make sure that we compare
apples with apples and oranges with oranges. Standardization is a
process by which all elements in a data field (or a set of related data
fields) are forced to conform to a standard.
There are many benefits to this process, the first of which we have
already mentioned: conformity for comparison. When aggregating data
down a column (or set of columns), we will be in much better shape
when we know that all the values in those columns are in standard
DATA CLEANSING 337
form. That way, our sums will not be skewed by erroneous data values.
The same goes for using standard values as foreign keys into other
tables we can feel comfortable that referential integrity is more likely
to be enforced if we stay within the standard.
Another interesting benefit of standardization is the ability to
insert an audit trail for data error accountability. The process of stan-
dardization will point out records that do not conform to the standard,
and these records can be tagged as incorrect or forwarded to a reconcil-
iation process. Either way, by augmenting a table with audit trail fields
and recording at which point in the data processing chain the tables are
standardized, we can trace back any significant source of errors. This
gives us a head start in analyzing the root cause of nonstandard data.
338 ENTERPRISE KNOWLEDGE MANAGEMENT
The first step in defining a standard should be obvious: Invite all data
consumers together to participate in defining the standard because a
standard is not a standard until it is recognized as such through the con-
currence of the users. In practice, a representative body is the best vehi-
cle for defining a data standard.
The second step is to identify a simple set of rules that completely
specify the valid structure and meaning of a correct data value. The
rules in this set may include syntactic rules that define the symbols and
format that a data value may take as well as data domain and inclusion
rules that specify the base data domains from which valid value compo-
nent may be taken. The most important part of this step is making sure
that there is a clear process for determining if a value is or is not in stan-
dard form.
The third step in standard definition is presenting the standard to
the committee (or even the community as a whole) for comments.
Sometimes small items may be overlooked, which might be caught by a
more general reading. And, remember, a standard only becomes a stan-
dard when it is accepted by all the data consumers. After a brief time
period for review and comments, an agreement is reached, and the stan-
dard is put in place.
One critical idea with respect to data value standards is that if a value is
dictated to conform to a standard, there must be a way to test to see if a
DATA CLEANSING 339
data value conforms to the standard form. This means that the defini-
tion of a standard must by association imply the test.
This test can usually be embodied using application code and refer-
ence data sets. Going back to our U.S. telephone number example, it is
easy to write a program that will check if the telephone number string
itself matches the defined format. In addition, two reference tables are
needed. The first is a domain table listing all valid NPA prefix codes that
are valid for the covered geographical area. The second is a domain
mapping between NXX codes (the second digit triplet) and the valid
NPA codes associated with each NXX. The first table is used to test our
first rule that the NPA code is a valid one. The second tables allows
us to test if the local exchange code is used within that NPA.
Another very good example is the U.S. Postal Standard. The Postal
Service has a well-defined standard for mail addressing. This standard
includes a definition of each component of an address (first name, last
name, street, street suffix, city, etc.), as well as standard and nonstandard
forms for the wording of particular components. For example, there is a
set of standard abbreviations for street types the address suffix
"AVENUE" has a set of commonly used abbreviations: {"AV," "AVE,"
"AVEN," "AVENU," "AVENUE," "AVN," "AVNUE"}, but only one
("AVE") is accepted as the Postal Service standard. The Postal Service
also has a predefined domain of valid ZIP codes, as well as a mapping
between valid street name and city combinations and an assigned ZIP
code. In each case, though, an address can be tested to see whether it is in
valid form. We look at this in greater detail in Section 14.9,
Many types of what appear to be random errors actually are due to rel-
atively common problem paradigms, as we see in the next sections.
They are outlined in Figure 14.2.
Attribute Data model is not configured for Using one data attribute holding 1
Granularity proper granularity of values. street, city, state, and ZIP code
Strict Format Format for data entry is too Insisting on a first name, middle 1
Conformance restrictive. initial
1 Semistructured Format for data entry is too Freeform text used in a description
Format permissive. field
1 Transcription The data entry person makes spelling the name LOSHIN
1 Error a mistake when transcribing data. as LOTION
1 Misfielded Data from one attribute appears The street address appears in an
Data in a different attribute. ADDRESS2 field, instead of
ADDRESSl 1
Floating Small-sized attributes cannot support Company name data spills out of
Data longer values, so data flows from the name field and into the street
one field into the next. address field
1 Overloaded The same attribute contains data Inserting both names of a couple in
1 Attributes associated with more than one value, the name field of a customer
or the attribute respresents more database when both members of
than one property of the entity. the couple should be listed
When the format of the data attributes is too restrictive, the data entry
user may not be able to correctly put the right information into the
database. Using the same example of customer name, let's assume that
we broke up one data field for name into three fields: last name, first
name, and middle initial. But the are many people who prefer to be
called by their middle names instead of their first names, in which case
the correct data should be last name, first initial, middle name. But since
there is only room for a middle initial, either both first initial and mid-
dle name are crammed into the first name field, or the middle name is
placed in the first name field and the first initial is placed in the middle
initial field, thereby reversing the customer's first two names!
Typists are not infallible, and typing errors creep into data. Common
mistakes are transcribed letters inside words, misspelled words, and
miskeyed letters. In one environment that maintained information on
businesses, many of the company names in the database had the amper-
sand (&) character as part of the name (Johnson & Smith). A frequent
error that occurred was the appearance of a 7 instead of the ampersand
(Johnson 7 Smith). This makes sense, however, when you see that on
one key the &C character is shift 7. (So, the 7 might be due to a sticky
shift key!)
In the third example, we also have two parties, but in this case there
is a different kind of relationship represented by the business phrase "in
Trust for." The two parties here are John Smith and Charles Smith.
In the fourth example, we now have three parties: John Smith,
Mary Smith, and Charles Smith, linked together via a different business
term ("UGMA," an acronym for Uniform Gift to Minors Act).
In the fifth example, we have three parties but more relations
between the parties. To start off, we have John Smith and Mary Smith,
and now we also have the John and Mary Smith Foundation, which is
an entity in its own right. The relationship here is that John Smith and
Mary Smith act as trustees for the foundation that bears their names.
In all of these cases, the first step in data cleansing is the identification of
distinct data elements embedded in data records. In the most common
situations, where the data set under investigation consists of name,
address, and perhaps other important information, the data elements
may be as follows.
First names
Last names
Middle names or middle initials
Title (MR, MRS, DR)
Name Suffix (JR, SR, MD)
Position (SALES MANAGER, PROJECT LEADER)
Company name
Building
Street address
Unit address (Apt, Floor, Suite)
City
State
ZIP code
Business terms
We will need some metadata intended specifically for parsing rules. The
first metadata items we need are the data element types. In other words,
we want to know what kinds of data elements we are looking for, such
as names, street names, city names, telephone numbers, and so forth.
For each data element type, we need to define a data domain for
the vaUd values. This allows us to test to see if any particular data ele-
ment is recognized as belonging to a specific domain.
The next component we need for data parsing is the parser itself: a pro-
gram that, given a data value (such as a string from a data field), will
output a sequence of individual strings, or tokens. These tokens are
then analyzed to determine their element types. The token is at the same
time fed into a pattern analyzer as well as a value lookup analyzer.
f f f
y r f 1
DD-DDDDDDD CDD DCC DDDDD-DDDD
If the token does not match a particular pattern, then we will see if that
value exists in one of the predefined data domains associated with the
parsing metadata. Note that a token may appear in more than one data
domain, which necessitates the next component, probability assignation.
The parsing stage identifies those data elements that are recognized as
belonging to a specific element type and those that are not recognized.
The next stage of the process attempts to correct those data values that
are not recognized and to tag corrected records with both the original
and the corrected information.
kinds of data will bias the corrective process to that kind of informa-
tion. Thus, if a company's set of correcting rules is based on direct mar-
keting databases, there may be an abundance of rules for correcting
individual names but a dearth of rules for correcting business names.
The second flaw is that every organization's data is somehow dif-
ferent from any other organization's data, as are the business rules that
govern the use of that data. Relying on the business rules from other
organizations will still add value, especially if the data content is simi-
lar, but there will always be some area where humans will need to inter-
act with the system to make decisions about data corrections.
The third flaw is that data can only be perceived to be incorrect
when there are rules indicating correctness. Again, if we rely on other
sets of correctness rules, we may miss errors in the data that may pass
through provided correctness tests. An example of this in address cor-
rection is the famous East-West Highway in suburban Washington,
D.C. Because the expectation with addresses with the word "East" at
the beginning is that the word is being used as a direction prefix and not
as part of the street name itself, some applications inappropriately "cor-
rect" this to "E. West Highway," which is not the correct name.
14.5.2 Correction
14.5.3 Standardizing
might not be Elizabeth, but we can assign EUzabeth to the records any-
way just in case. In this context, standardization is being used purely as
a means to a different end: enhancement and linkage.
As we have seen in other sections, we can implement standardiza-
tion using the same kind of data domain and domain mapping method-
ologies from Chapters 7 and 8. As an example, we can create a data
domain for what we will consider to be the standard male first names
and a separate domain called "alternate names" consisting of diminu-
tive, shortened, and nicknames. This allows us to have a mapping from
the alternate names to the standard names, which we will use to
attribute each name record.
14.5.6 Enhancement
distance function the closer the distance between two data values, the
more similar those two values are to each other. The simplest form of a dis-
tance function is a Euclidean distance on a Cartesian coordinate system.
For example, across a single integral dimension, we can compute a
distance between two integers as the absolute value of the difference
between the two values. In a plane, the distance between two points,
(xl, yl) and (x2, y2) is the square root of (xl - x2)^ + (yl - y2)^.
Even with nonnumeric data, we still have to formulate some kind
of quantitative similarity measure. Because a similarity measure yields a
score that measures the closeness of two data values, there must be
some way to characterize closeness, even if the values are character
strings. In general, we will want to be able to compute the similarity
between two multiattribute records, and that means that we must have
distance functions associated with all kinds of data types, as well as our
constrained data domains.
Address 1 123 Main St 123 Main Similarity is based on value parsing and component matching. 1 3
ZIP 11223-6523 11223 ZIP-f 4 exact match 1; ZIP match ZIP-f 4 s .85; ZIP s ZIP s .75. 2
'^Wi*d(Xi, yi)
/=0
i=0
This gives us a basis for defining similarity measures. The next step
is in defining the difference function for each particular data type. We
have already seen the use of Euclidean distance for numeric domains.
Next we explore similarity measures for different data domains.
DATA CLEANSING 357
14.6.3 Thresholding
The first nonnumeric data type we will look at is the character strings.
Clearly, comparing character strings for an exact match is straightfor-
ward. It is the determination of closeness for character strings that
becomes hazy. For example, we know intuitively that the last names
Smith, Smyth, and Smythe are similar. Or do we? If we are not native
English speakers, or if we had never heard all three names pronounced,
how would we know that these names all sound the same?
One way to measure similarity between two character strings is to
measure what is called the edit distance between those strings. The edit
358 ENTERPRISE KNOWLEDGE MANAGEMENT
1 Name 75% 1
1 Address 67%
1 City 50%
1 State 100%
ZIP 85%
1 Telephone 100%
1 INC 100%
Total Threshold
distance between two strings is the minimum number of basic edit oper-
ations required to transform one string to the other. There are three
basic edit operations.
1. Insertion, where an extra character is inserted into the string.
2. Deletion, where a character has been removed from the string.
3. Transposition, in which two characters are reversed in their
sequence.
So, for example, the edit distance between the strings "INTER-
MURAL" and "INTRAMURAL" is 3, since to change thefirststring to
the second, we would transpose the "ER" into "RE" and delete the "E"
followed by an insertion of an "A."
Some people include substitution as an automatic edit operation,
which is basically a deletion followed by an insertion. Strings that com-
pare with small edit distances are likely to be similar, whereas those that
compare with large edit distances are likely to be dissimilar.
DATA CLEANSING 359
INTRANATIOANL | ORGANASATION
transposition substitution
INTRANATIONAL | ORGANISATION
transposition substitution
INTARNATIONAL | ORGANIZATION
substitution
INTERNATIONAL | Edit distance = 2
Edit distance = 3
HOUSNGAUTHORTY PRODUICTIVTY
insertion deletion
HOUSING AUTHORTY PRODUCTIVTY
insertion insertion
HOUSING AUTHORITY PRODUCTIVrrY
them all with the symbol "A." Unfortunately, the NYSIIS encoding is
also faulty in that it is biased toward English names. Figure 14.8 shows
some Soundex and NYSIIS encodings.
A system called Metaphone, developed in 1990, claims to better rep-
resent English pronunciation. The Metaphone system reduces strings to a
one- to four-letter code, with some more complex transformation rules
than Soundex or NYSIIS. Specifically, Metaphone reduces the alphabet to
16 consonant sound: { B X S K J T F H L M N P R O W Y } , where the zero
represents the th sound. Transformations associated with character
sequences are defined based on the different phonetic constructs. For
example, a T can reduce to an X (representing a sh sound) if it appears in
the -TIA- or ~TIO- context or to a 0 if it appears in the -TH- context.
Otherwise, it remains a T.
There are some ways to improve on the efficacy of these phonetic
representations, especially when addressing the issues described here. In
terms of similar sounding words that do not share the same initial sound,
the phonetic encoding algorithm can be applied to strings in both the for-
ward (in English, left to right) and the backward (right to left) directions.
In this way, words that are similar, like "DIXON" and "NIXON," may
have a chance of matching on the reversed phonetic encodings.
The problem with Soundex's inabiUty to deal with silent letters or
alternate phonemes is to use a more complex algorithm such as Meta-
phone. The issue of the truncated encoding is easy to address: Encode
the entire string without truncation.
Unfortunately, it is difficult to overcome the fact that these pho-
netic schemes are biased toward English names.
14.6.6 N-Gramming
TER
ERN
RNA
NAT
ATI
TIO
ION
ONA
NAL
We can use the n-gramming technique as part of another similarity
measure. If two strings match exactly, they will share all the same n-grams
as well. But if two strings are only slightly different, they will still share a
large number of the same n-grams! So, a new measure of similarity
between two strings is a comparison of the number of n-grams the two
strings share. If we wanted to compare "INTERNATIONAL" with
"INTRENATIONAL" (a commonfingerflub,considering that "E" and
"R" are right next to each other on the keyboard), we would generate the
n-grams for both strings, then compare the overlap. Using n = 2, we have
already generated the digrams for the first string; the digrams for
"INTRENATIONAL" are:
IN
NT
TR
RE
EN
NA
AT
TI
lO
ON
NA
AL
The two strings share 9 out of 12 digrams 75 percent. For a two-
string comparison, this is a high percentage of overlap, and we might say
that any two strings that compare with a score of 70 percent or higher
are likely to be a match. Given two strings, X and Y, where ngram(X.) is
the set of n-grams for string X and ngram(Y) is the set of n-grams for
string (Y), we can actually define three different measures.
364 ENTERPRISE KNOWLEDGE MANAGEMENT
14.7 CONSOLIDATION
One of the most significant insights into similarity and difference mea-
surements is the issue of application context and its impact on both
measurement precision and on the matching criteria. Depending on the
kind of application that makes use of approximate searching and
matching, the thresholds will most likely change.
As our first example, let's consider a simple direct mail sales pro-
gram. While our goal would be to find duplicate entries, if a pair of
duplicates is not caught, the worst that can happen is that some house-
hold might get some extra unwanted mail. In this case, we might prefer
that any borderline matches be assumed to be mismatches so our cover-
age is greater.
For our second example, let's consider an antiterrorist application
used to screen incoming visitors. If the visitor's name matches one of the
names on the list of known terrorists, the visitor is detained, and a full
investigation is performed to determine if the visitor should be allowed
into the country. In this instance, where safety and security are con-
cerned, the worst that can happen if there is a missed match is that a
dangerous person is allowed to enter the country. In this case, we might
prefer that the match threshold be lowered and any borderline matches
be brought to the attention of the examiners so as to avoid missing
potential matches (see Figure 14,9).
The basic application in both of these cases is the same (matching
names against other names), but the precision depends on our expected
results. We can group our applications into those that are exclusive
searches, which are intended to distinguish as many individuals as pos-
sible, and inclusive searches, which want to include as many potential
matches into a cluster as possible. The direct marketing duplicate elimi-
nation would be an exclusive application, while the terrorist applica-
tion is an inclusive application.
Duplicate
Elimination
Direct Mail Match Duplicate Pairs
Mailing List Threshold =
93%
List of Known
Terrorists
List of People
Entering Potential
the Terrorists
Country Entering the
Country
When duplicates are exact matches, they can be discovered through the
simple process of sorting the records based on the data attributes under
investigation. When duplicates exist because or erroneous values, we
have to use a more advanced technique such as approximate searching
and matching for finding and eliminating duplicates.
Duplicate elimination is essentially a process of clustering similar
records together, then using the three-threshold ranges described in Sec-
tion 14.6.3. Depending on the application, as we discuss in Section
DATA CLEANSING 367
14.7.1, the decisions about which records are duplicates and which are
not may either be made automatically or with human review.
14.7.3 Merge/Purge
14.7.4 Householding
1
1 H D Loshin Product Manager 1
1 509-555-1259 John Frankhn DS-187932771
!
1
There are other applications that use a consolidation phase during data
cleansing. One application is currency and correctness analysis. Given a
set of data records collected from multiple sources, the information
embedded within each of the records may be either slightly incorrect or
out of date. In the consolidation phase, when multiple records associ-
ated with a single entity are combined, the information in all the
records can be used to infer the best overall set of data attributes.
Timestamps, placestamps, and quality of data source are all prop-
erties of a record that can be used to condition the value of data's cur-
rency and correctness. Presuming that we can apply the data quality
techniques prescribed in this book, we can quantify a data source's data
quaUty level and use that quantification to consolidate information.
DATA CLEANSING 371
In this example, each node represents a party, and a Hnk between two nodes indicates an
established connection between those two nodes. The shaded areas cover networks that
indicate a "minicommunity,** where (for the most part) each member has an established
link with all (or almost all) other members.
One aspect of data cleansing is being able to fill fields that are missing
information. Recall from Chapter 8 that there are five types of empty
attributes:
1. No value There is no value for thisfield a true null.
2. Unavailable There is a value for thisfield,but for some reason it
has been omitted. Using the unavailable characterization implies
DATA CLEANSING 373
that at some point the value will be available and the field should
be completed,
3. Not applicable This indicates that in this instance, there is no
applicable value.
4. Not classified There is a value for this field, but it does not con-
form to a predefined set of domain values for that field.
5. Unknown The fact that there is a value is established, but that
value is not known.
The data cleansing process will not address categories 1 or 3, since
by definition there is no way to attack those missing fields. But with the
other categories, the reason for the missing value may be due to errors
in the original data, and after a cleansing process, there may be enough
information to properly fill out the missing field.
For unavailable fields, if the reason for the omissions has to do
with the dearth of data at the time of record instantiation, then the con-
solidation process may provide enough information leverage to supply
previously unavailable data. For unclassified fields, it may not be pos-
sible to classify the value due to erroneous data in other attributes that
may have prevented the classification. Given the corrected data, the
proper value may be filled in. For unknown attributes, the process of
cleansing and consoHdation may provide the missing value.
The recipient line indicates the person or entity to which the mail is to
be delivered. The recipient line is usually the first line of a standard
address block, which contains a recipient line, a delivery address line,
and the last line. If there is an "attention" line, the standard specifies
that it should be placed above the recipient line.
The delivery address line is the line that contains the specific location
associated with the recipient. Typically, this line contains the street
address, and should contain at least some of these components:
Primary Address Number This is the number associated with the
street address.
Predirectional and Postdirectional A directional is the term the
Postal Service uses to refer to the address component indicating direc-
tional information. Examples of directionals include "NORTH,"
"NW," "W." The predirectional is the directional that appears before
the street name; the postdirectional is the directional that appears after
the street name. While spelled-out directionals are accepted within the
standard, the preferred form is the abbreviated one. When two direc-
tionals appear consecutively as one or two words before or after the
street name or suffix, the two words become the directional, the excep-
tion being when the directional is part of the street's primary name.
When the directional is part of the street name, the preferred form is not
to abbreviate the directional.
Street Name This is the name of the street, which precedes the
suffix. The Postal Service provides a data file that contains all the valid
street names for any ZIP code area.
Suffix The suffix is the address component indicating the type of
street, such as AVENUE, STREET, or CAUSEWAY, for example. When
the suffix is a real suffix and not part of a street name, the preferred form
is the abbreviated form. The standard provides a table enumerating a
large list of suffix names, common abbreviations, and the preferred stan-
dard abbreviation.
Secondary Address Designator The secondary address unit desig-
nator is essentially a more precise form of the address, narrowing the
delivery point to an apartment, a suite, or a floor. Examples of sec-
ondary unit designators include "APARTMENT," "FLOOR,"
"SUITE." The preferred form is to use the approved abbreviations,
which are also enumerated in Publication 28.
Additionally, there are other rules associated with the delivery
address line. Numeric street names should appear the way they are
specified in the Postal Service's ZIP + 4 file and should be spelled out
only when there are other streets with the same name in the same deliv-
ery area and spelling the numeric is the only way to distinguish between
376 ENTERPRISE KNOWLEDGE MANAGEMENT
The last line of the address includes the city name, state, and ZIP code.
Besides the dash in the ZIP + 4 code, punctuation is acceptable, but it is
preferred that punctuation be removed. The standard recommends that
only city names that are provided by the Postal Service in its city state
file be used (this addresses the issue of vanity city names).
The format of the last line is a city name, followed by a state abbre-
viation, followed by a ZIP + 4 code. Each of these components should be
separated using at least one space. The standard also prefers that full city
names be spelled out, but if there are labeling constraints due to space,
the city name can be abbreviated using the approved 13-character abbre-
viations provided in the city state file.
14.9.5 ZIP+ 4
ZIP codes are postal codes assigned to delivery areas to improve the
precision of sorting and delivering mail. ZIP codes arefive-digitnum-
bers unique to each state, based on a geographical assignment. ZIP + 4
codes are a further refinement, narrowing down a delivery location
within a subsection of a building or a street.
DATA CLEANSING 377
14.9.8 NCOA
14.10 SUMMARY
381
382 ENTERPRISE KNOWLEDGE MANAGEMENT
15.1.2 Example
the newspaper. Supposedly, the correct price for each product is posted
within the store at each product display. But when a shopper brings the
products to the checkout counter, it turns out that many products do
not ring up at the advertised sale price.
If the shopper is keeping track of the prices as they are being rung
up, it is possible to catch these errors at checkout time. When the shop-
per alerts the cashier that an incorrect price has come up, the cashier
must call for an attendant to go to the product display or the weekly cir-
cular and confirm the price. This usually takes time, but when the cor-
rect price is found, the incorrect price is voided, and the correct price is
keyed into the register. This may happen two or three times per pur-
chase. The result of this process is that the amount of time it takes for
the shopper to check out and pay for the purchase increases, resulting in
longer lines. When the store manager sees this, he instructs another
cashier to open up a new register.
If the shopper does not catch the error until after the payment has
been made, the procedure is a bit different. The shopper must go to the
customer service line and wait until a cashier can evaluate the price dif-
ference. At that point, an attendant is sent to do the same price check,
and if there is a price discrepancy, the shopper is reimbursed the over-
charge. Typically, this line is very long because aside from the multitude
of incorrect charges, there are other customer services being provided at
this location also.
The cashier's process of crediting the incorrect price and then
directly typing in the correct price is one example of treating the symp-
tom. The problem, to the cashier's eyes, is an incorrect charge that must
be fixed. Unfortunately, the cashiers are usually not in a position to cor-
rect the product price in the store database. Also, since they are paid
based on the number of hours they work and not as a function of the
number of customers they service, there is no compelling business rea-
son to care how long it takes to process each customer.
The fact that longer lines precipitate the manager's opening of
more registers is another example of treating the symptom. In the man-
ager's eyes, the problem, which we could call a throughput problem
(slow-moving lines), can be fixed by creating additional bandwidth
(additional registers), but it does not occur to the manager to examine
the root cause of the low throughput. If he did, though, he might notice
that approximately 15 percent of the purchased articles ring up at an
incorrect price.
We can see a number of important impacts from these incorrect
prices.
384 ENTERPRISE KNOWLEDGE MANAGEMENT
m ike fuoHcUiatifm
process Associate notifies
manager of
difoence in price
between display
andcompiHsr
/ Data \
( Consumption J
The next step is charting the causal factors. In our case, the causal fac-
tors are likely to be manifested as incorrect data values, and to chart
388 ENTERPRISE KNOWLEDGE MANAGEMENT
these factors we must trace the provenance of the incorrect data back to
its origin. Luckily, we already have a road map for this trace the
information chain.
The Pareto chart will direct the selection of a particular problem to
address. Once this selection has been made, the way to proceed is to
start tracing the problem backward through the information chain. At
each point in the information chain where the offending value can be
modified, we can insert a new probe to check the right values' confor-
mance with what was expected.
There are two important points to note. The first is that since the
value may be subject to modification along the information chain, the
determination of what is a correct versus incorrect value is needed at
each connecting location (such as the exit from a communication chan-
nel and entrance to a processing stage) in the information chain. The
second is that at each location in the information chain, any specific
data value may be dependent on other values sourced from other loca-
tions in the chain. If one of those source values causes the nonconfor-
mity, it is that value that is picked up as the one to continue with the
trace-back.
Eventually, this kind of value tracing will result in finding a point where
the value was correct on entry to the location but incorrect on exit from
that location. We can call this location the problem point because that is
the source of the problem.
Finding the problem point is also interesting if we were to recali-
brate some of our other analysis processes. As we traced back the value,
we may have encountered the source of other problems along the infor-
mation chain. For example, the appearance of an incorrect product
price might affect the customer checkout process but also the daily
bookkeeping process. This means that what we originally considered to
be two problems is actually only one problem, and the impacts and
costs associated with the two problems can be aggregated.
As we trace backward through the information chain (see Figure
15.3), we can decide whether we want to move forward along other
paths and check the effect of the traced nonconformity. This allows us to
subsume both the impacts and the costs associated with low data quality
(as we analyzed using our tools from Chapters 4, 6, and 9) at earlier
ROOT CAUSE ANALYSIS AND SUPPLIER MANAGEMENT 389
Having narrowed the analysis to the problem point, the next step is to
determine the cause of the problem. We can segregate problems into
chronic problems (those that have been around for a long time and
ignored) and acute problems (those that have cropped up recently and
are putting new pressures on the system). Depending on the type of
problem, there are different approaches to debugging.
Often, organizations choose to address chronic problems only
when the cost of the solution is disproportionate to the growth of the
system. On the other hand, organizations address acute problems
because their appearance disrupts normal operations. Debugging the
problem may be as simple as determining that incorrect data are being
input, but this may be difficult when faced with legacy application code
that is not actively being supported.
390 ENTERPRISE KNOWLEDGE MANAGEMENT
perhaps the appUcation and/or the data have not been brought
into synchronization with the business change.
The answers to these questions will lead to other questions. The
debugging process is one of narrowing down the set of possible causes
by the process of elimination. Eventually, the precise point of the non-
conformity will be isolated, and that leads to the next question: How do
we fix the problem?
Now that the problem location has been isolated, it is the job of the
analyst to work with the system engineers to determine what needs to
be done to correct the problem. If we go back to our department store
example, we see that the location of the root cause was at the product
price update process. If the product price database were updated at a
consistent time with the correct prices, the effect would be seen at the
cash registers.
We have two goals at this point: further analyze to propose a cor-
rection to the problem and register what we have learned, and then
decide whether to implement the correction. Knowledge about the trace
of the problem backward through the system and the accumulated
knowledge about the specifics of the problem and suggestions for cor-
rection both fall into the area of enterprise knowledge that should be
absorbed under enterprise management. This information should be
registered within a centralized repository, so if the decision is made not
to correct the problem at this time, the history of that decision can be
reviewed at a later date.
High-Level Description
Characterization
Local Information
Chain
Suggested Correction
Date
Analyst
|;;fs:^;?|f'i-|''::;S-^^
Enterprise
Data Validation
Buffer Zone
been injected into the information chain or before it enters the system.
Obviously, it is better to prevent bad data from entering the system than
having to deal with it once it is in the system.
Supplier management is best implemented as a "compliance codi-
cil," which is included as a component of the original agreement between
the supplier and purchaser. The agreement should include a provision for
a periodic review of the levels of data quality, a minimum threshold for
acceptance, and a definition of how the threshold is computed. The com-
pliance codicil should also specify the penalty when the data provided by
a supplier do not meet the purchaser expectations, as well as a set of
actions to be taken whenever a serious violation of the requirements has
occurred.
The threshold can be expressed as a function of the measured con-
formance to the set of requirements. In Chapter 10, we looked at the
definition of data quality requirements, as well as a way to collect and
manage requirements in a set of metadata tables. Each requirement is
specified as a rule, along with a description of that rule's implementa-
tion. It is the responsibility of the purchaser to provide a copy of these
rules to the supplier so that the desired levels of data quality are under-
stood equally well by both supplier and purchaser.
One interesting way to make sure that both sides are complying
with the agreement is for each side to develop the testing application for
requirements testing, agree that each implementation correctly validates
the information, and then exchange those applications. This means that
when the supplier is running its validation engine, it is using the appli-
cation written by the purchaser, and vice versa.
15.6 SUMMARY
We can now provide a full analysis of the data quality of our sys-
tem and create a framework for selecting and implementing improve-
ments in implementable units of work. In the next chapters, we look at
improving the value of our data through enrichment, and then we dis-
cuss the actual means to implement our data quality improvement
program.
16
DATA ENRICHMENT/ENHANCEMENT
399
400 ENTERPRISE KNOWLEDGE MANAGEMENT
Let's imagine this brief scenario. You are a sales manager, and your
entire compensation is based on commissions on the sales that you and
your team make. Imagine that the director of sales handed each member
of your sales staff a list of sales leads that contained names, addresses,
and telephone numbers. Each sales representative might have some
chance of closing a sale with any one of those leads, but it would be
anybody's guess as to which lead is more likely to convert to being a
customer than any other. Without any additional information, each call
is a crapshoot.
Now imagine if the list contained each sales lead's annual salary,
the amount of money they've spent on similar products over the last five
years, a list of the lastfivepurchases the lead has made, and the propen-
sity of that lead's neighbors to purchase your product. Which of these
lists would you rather have.^
That second list contains enhanced data. The value of the original
list has been improved by adding the extra personal and demographic
information. By having the extra data added to the original set, the sales
team can increase its effectiveness by prioritizing the sales leads in the
order of propensity to buy.
The value of an organization's data can be greatly increased when
that information is enhanced. Data enhancement is a method to add
value to information by accumulating additional information about a
base set of entities and then merging all the sets of information to pro-
vide a focused view of the data.
duce a set of customer profiles that can provide both a framework for
more efficient sales efficiency and a streamlined mechanism for cus-
tomer service. In the sales analysis world, this may imply enhancing
point-of-sale data to understand purchase patterns across the organiza-
tion's sales sites. In the health/pharmaceuticals industry, a goal could be
to understand the interactions between different drugs and to suggest
the best possible treatments for different diseases.
There are different ways to enhance data. Some of these enhance-
ments are derived enhancements, while others are based on incorporation
of different data sets. Here are some examples of data enhancement,
which are shown in Figure 16.1.
For customer data, there are many ways to add demographic enhance-
ments. Demographic information includes customer age, marital status,
gender, income, ethnic coding, to name a few. For business entities,
demographics can include annual revenues, number of employees, size
of occupied space, and so on.
Having a base set of data whose quality we can trust gives us the oppor-
tunity to aggregate, drill, slice, and dice that data. When we can infer
knowledge based on that data, we can augment the data to reflect what
we have learned.
404 ENTERPRISE KNOWLEDGE MANAGEMENT
1. Determine if the city is a valid city name within the state. This
corresponds to a query in a city-state mapping table.
2. If so, determine if the street name is a valid street name within
that city. Again, this corresponds to a query in a database map-
ping between streets and cities.
3. If the street is valid, check to see if the address (the street num-
ber) is in a range that is valid for that street. Do another mapping
lookup, and this mapping should also reflect the ZIP code map-
ping as well. A test will compare the found ZIP code, and an
assignment will just use that found ZIP code.
Note, by the way, how this process uses the data domain mappings
we discussed in Chapter 7! But what happens if one of these lookups
fails .^ The default would be to resolve the ZIP code to the closest level in
the geographical hierarchy. For example, if the street is valid but the
number is not, then assign the ZIP code for the street. If the street does
not exist within the city, assign a default ZIP code for that city. While
the result is not always correct, it may still be in standard form.
When two companies merge, eventually they will have to merge their
customer databases, employee databases, and base reference data. Con-
solidating customer records prevents potentially embarrassing market-
ing snafus (such as having more than one sales representative contacting
the same customer on the same day). More interesting, additional busi-
ness intelligence can be collected on those customers that existed in both
original data sets.
f "1
OS 1 1
OS 1 1
OS 1 1 2 ^
u 1
1 **; 00 2 1
t1 1 1 1 > wi a 1
1 a 4> 4> 1
a> 1 1 JD "O 1
^ 1 1 u 1 u 11
t S 2:!
^ 1^ 1 1 ON 1 1 TD 1 1
OS 1 1
o
Q a 1 ON 1 1
I1^
^^ 1 1
u 1 00 1 1
Q 1 5^ 1
T^ 1 1
to 1
ON 1 OS 1
o1 11 o) 11 11
o 1 1
1
0 1 1
OS
o 11
1 1 lit
1
"3
n
00
C3N
I^
I 1
00 1
ON 1 1
*o 1 1
I^ 11
NO
1 f2
d 1
3
V
f^
00
1
t^ 1 1
1
A 1 1
f^
S> 1
1
1
1
1 C3 (A >
O H-u
CJ
^^ 1
1 1
< 6 ^ 1 1 1 V jd o
ON ON 1 <<
- o
ON 1 1
CO 1 1
OS ON 1
^1 ON 1 ^1 S
OS 1 o 1 ON 1
-^ i ON ON 1 ON 1
<N 1
(3 ON 1
1 1 oo
'^ 1
00 1
^-^ 1
1
O
J^
l^
00
1
1
1
1
1
> 00
rH
1
1
ON 1
A 1
<s 1
(A
2i SO 1
^
a <N K. 1 <N 1
Q
z ON 1
1 ^ ^ tt
1>
i 2 2 1 i 2 1
O ^1
bS
r
1 u<
1
b
i ^ M >
8 S-S W)
a
r r tn
^
ill
^1
M
OS
Mil ^ 1 ^
1 "
8 1 c: 1
1 ^ 1 o 1 Q
1^ 1 1 ^
1 o
B 1 o a 1 ^
1 V
1
1 73 i
i 1 ^ 1 Si 1 ^ 1 MH 1
1 (A 1
1 *"* 1 J5 a - g 1 c:
1\2
1 Q^ 1 V 1 1 ^ 1
^ 1 ^ 1 w
1 ""* 1 1 *^ S !> 8 1 ^
^
1 '^ 1 ^ 11 "^
^ 1
1
r 1 ^1
V ^ &
1 V i$ (d
HI 1 T-
UJ
1 1 PQ t " S 1 CO
J v__ ,^ il
DATA ENRICHMENT/ENHANCEMENT 409
in turns of accuracy. Not only that, when people are involved in the
data entry process, mistakes can be made. This leads to the existence of
duplicate records for many individuals or corporate entities. As seen in
Figure 16.3, which contains real data extracted from the U.S. Securities
and Exchange Commission's EDGAR database, many of these dupli-
cates are not always simple to recognize. Linking the duplicate records
together is a way to cleanse the database and eliminate these duplicates.
16.6.5 Householding
In the health industry, data merging is performed both for diagnosis and
for determination of treatment. A collection of medical professionals
may pool their individual patient information (most likely having been
anonymized first) as a cooperative means for building up a knowledge
base particular to a set of illnesses. The information, which consists of a
patient's history, diagnosis, and treatment, must be enhanced to fit into
the collaborative data model.
For diagnostic purposes, a new patient's history and profile are
matched against other patients' histories in the hope of finding a close
match that can help determine the cause of a health problem. Once a num-
ber of matches have been found, the matched patients' treatment proto-
cols are examined to suggest a way to treat the new patient's problem.
i!
o
I ( ^ "^^ o vo t ^ oo ro '
J <^ r ^ ^ ^ u
O O O ( S (N <S 0 ' i O O O ,-(
I <N r^^ N cv| ( N ( N <N 1 ^ <> <^ ;/5
i 0^ 04 QdS P^ P^ P^ P^ PIH Pd P^ > u j p^ p^ pe: u u u SQ
I U j UL) U j UJ UJ U j PL] u j u j pg I t o t o t o [jj
I C/1 t/5 t o t o t o t / i C/5 t o t o t o I )J J J ^ < CO CQ r^
! D D D C CO D D ! ;
I U 11^ U U U3 U U ^ D ^ ^c^ OH
t/5 r n r o <J
I D D D D D D ID :
iXXXg^
* til PJ U4 H
b^^uU U U PQ i55|
CQ CQ CQ CJ
SPS
I (1) tL) tL] U j PL) U j PL] t iwSi * ^ r"< ?*^ tiJ
GGb^
1< *? to
:x X X X x x x ^ U U U oti
<^<
Ip p p p pppf >aa(yx <<<<
PL] ' ^ ^ ;
I ^ ^ ^ ^ ^ ^ ^ . '^z z z ^ : i pi^ >
|< < <3 < < < < ^ )B s a ^ <i :< :<^ ^
> > > ^
Oi Oj CL, O
ID D D DD D P :
oO 00 00 0^ S P2 U
OOO2;
J3
PfS P ^ pe4
^OQOOOQQOOSQOOOOOOOOO o8 o^ o8 I (d PL) uj Tl
PEH
D D D D zqqi
OOOP^
tto
o tto
o to
0/5 tto
o
^ pq PL] I
Pl^ P^ P^ I
6
;uoo IO U O U < <: < <
[ i ] UJ t i j tL) to to to <
UN t i l PL4 PUi s
^ u ^ u
< <: <: <
sss; o
CO
auoo
I & & s g Q Q Q Q
Z Z Z Z III' S!
XXX
ti] ti] ti]
CO
CD
to to to to xxx; t-
Z Z Z Z < << UJ
i Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q <<<:<: p^ pe: p^ I cc
'-ipLiPLiujtgujtdpLjtDtgtLju^pLiujuj
iO^P^Pl^P^pt^PfiP^P^pdP^PdPl^P^P^
uuu u
p^ pfj pe^ 0^
PLJ tj PL] I
PU 0H PU I
D
UJ t i ] PL] PLJ
PDDDDDDDDDDDDDDDDDDDD
tOQ/5tOtOtO;/5tOtOQ/5tOtOtOtOtOtOtOtOtOtOtOtO ssss 1
!<<;<<
tJU t i ] PL] I
412 ENTERPRISE KNOWLEDGE MANAGEMENT
In many areas where there is a potential for fraud, data merging is used
as a way to both identify fraudulent behavior patterns and use those
patterns to look for fraud. Opportunities exist for fraud in all kinds of
businesses, such as transaction based (telephone and mobile-phone ser-
vice), claim based (all kinds of insurance), or monetary transaction-
based (where there are opportunities for embezzlement, for example).
In fact, there are many areas of crime that call out for data matching
and merging: money laundering, illegal asset transfer, drug enforce-
ment, and "deadbeat dads," to name a few.
self is not. People move, get married, divorced. The data that may be
sitting in one database may be completely out of synch with informa-
tion in another database, making positive matching difficult. For exam-
ple, recently I moved from Washington, D.C. to New York City. It took
a year before any of the online telephone directories had my listing
changed from the D.C. address and telephone to the New York one.
(Incidentally, by the time those listings had changed, I had already
moved to a new address!)
Information is lost. The actual database joins may be constructed
in a way that important information that is originally contained in one
of the data sets is lost during the merge. For example, when my baby
daughter receives a letter asking her to switch her long distance service
in return for an extra 5,000 frequent flyer miles, the information that
my wife and I just purchased a child's seat for her is apparently lost.
When the limitations of standard linkage are combined, it eventually
causes inefficiencies and increased costs. Here are some real examples:
A large frequent traveler program had fielded many complaints
because patrons who had visited member hotels were not credited
with the points corresponding to their purchases. It turns out that
the company providing the frequent traveler program had recently
merged with a few other hotel chains, and the number of different
data formats had ballooned to more than 70, all out of synchro-
nization with each other.
A bid/ask product pricing system inherited its prices from multiple
market data providers, the information of which was collected
and filtered through a mainframe system. When the mainframe
did not forward new daily prices for a particular product, the last
received price was used instead. There was significant embarrass-
ment when customers determined that the provided prices were
out of synchronization with other markets!
Very frequently with direct mail campaigns, multiple mailings are
sent to the same household or to the same person, and a large
number of items are sent to the wrong address.
Current customers are pitched items that they have already
purchased.
Some silly data merging mishaps can be embarrassing to the orga-
nization and cause customer distrust. For example, if a service company
can't keep track of who its customers are, how can they be trusted to
supply the right level of service as well as provide a correct bill every
month .^
414 ENTERPRISE KNOWLEDGE MANAGEMENT
Databases and data warehouses are not the only place where data merg-
ing is an important operation. In our ever-growing World Wide Web, any-
one can publish anything, and that information can be aggregated as well.
Closer to the fact, many Web sites act as front ends to different underlying
databases, and clicking on a Web page is effectively the invocation of a
query into a database, and the presentation of information posted back to
the client Web browser is a way of displaying the result of the query.
If we configure an application to act as a client front-end replacing
the standard browser but still making use of the HTTP protocols to
request and accept replies, we can create a relatively powerful data
aggregation agent to query multiple Web sites, collect data, and provide
it back to the user. Effectively, the World Wide Web can be seen as the
world's largest data warehouse. Unfortunately, the problems that
plague standard record linkage as described in Section 16.7 are magni-
fied by at least an order of magnitude. The reason for this lies in the rel-
ative free-form style of presenting information via the World Wide Web.
For a large part, Web data are "semistructured data." While database
records and electronic data interchange messages are highly structured,
information presented on Web pages conforms to the barest of stan-
dards, some of which are bent based on the selection of Web browser
targeted. On the other hand, there are some Internet "motifs" that
appear regularly. For example, business home pages most often contain
links to other sections of a Web site, with a contact page, an informa-
tion page, a products page, a services page, a privacy policy page, and a
"terms of service" page. Information on each of these pages also tends
to follow a certain style.
For example, a corporate data page will often have a list of the top
managers in a company, followed by a short biography. We can even
drive down to finer detail in our expectations in the corporate
biographies, we can expect to see some reference to college and gradu-
ate school degrees, an account of prior work experiences (with dura-
tions), and professional affiliations. Yet while the format of these
biographies is not standardized at all, in general we learn the same kind
of stuff about each manager in each company.
DATA ENRICHMENT/ENHANCEMENT 41 5
The key to making the best use of aggregated Web data is that the pre-
sentation and packaging of the information is effectively an exercise in
data enhancement. Before the information is presented to the client, any
data quality rules, business rules, validation filters, or trigger rules
should be applied. This can only be done when there is a method for
linking data items from different sources coupled with the definition
and usage of business and data quality rules, as we have explored in
other sections of this book.
One way to counter the limitations of standard record linkage is the use
of a technique called approximate, or approximate matching. Standard
linkage requires that the sets of values in the characterizing attributes
all must match exactly. But in databases, just as in real life, sometimes
things are not always as they seem. Approximate matching relaxes the
exact matching requirement, giving us the chance to find those elusive
near (and sometimes not-so-near) matches.
DATA ENRICHMENT/ENHANCEMENT 417
Since all the scores are relatively high, even outside of the context
of the material being sent, our intuition would say that these are likely
to be the same person. Now adding in one more piece of information,
which is the psychographic item of "interest in data warehousing," this
might tip the balance to an automated process to automatically link the
two records. One more interesting note: I received the same invitation
at my home address!
In all these occasions for data merging, business rules can improve the
outcome. When merging customer databases, derived analytical data
enhancements such as "decision maker" can be inferred. Cooperative
marketing programs can be made more efficient, thereby increasing
response rate. Affinity programs can be improved if rules characterizing
the target's tendency to respond can be inferred. And, of course, data
cleansing operations such as de-duplification and householding are
improved when rules are added.
16.13 SUMMARY
Corrected data
Does not conform to DQ expectations
rest of the section reviews the data quality rules, then shows how the set
representing the nonconforming rule can be specified in SQL.
In Chapter 8, we discussed the fact that there are different kinds of null
values, including this list:
1. No value There is no value for this field a true null.
2. Unavailable There is a value for this field, but for some reason
it has been omitted. Using the unavailable characterization
DATA QUAUTY AND BUSINESS RULES IN PRACTICE 429
implies that at some point the value will be available and the field
should be completed.
3. Not applicable This indicates that in this instance, there is no
applicable value.
4. Not classified There is a value for this field, but it does not con-
form to a predefined set of domain values for that field.
5. Unknown The fact that there is a value is established, but that
value is not known.
We allow the user to define named aliased representations for the
different null types. This is the syntax for this definition.
Define <nullname> for <nulltype> as <string
representation>
In this definition, <nullname> is the alias name, <nulltype> is one
of the varieties of nulls described here, and <string representation>
is the character string used for the null. For example, this definition
specifies the null representation for an unavailable telephone number.
Define NOPHONE f o r UNAVAILABLE as "Phone Number Not
Provided"
There are two kinds of null value rules: those that allow nulls and those
that disallow nulls. Let's look at those that allow nulls first.
If the rule indicated that system null values are allowed, with no
qualification, there is really no validation, since the assertion states that
the absence of a value is allowed. Alternatively, if the rule specified that
only a particular kind of defined null value was allowed, then the vali-
dation is a little more complicated.
Our earlier definition of the null value rule allowed the restriction
of use to defined null values, such as the following.
Attribute employees.phone_number allowed nulls
{GETPHONE, NOPHONE}
More formally, our rule syntax will specify the name of the
attribute, the keywords allows n u l l s , and a list of null representation
types allowed.
Attribute <table>.<attribute> allows n u l l s
{<nullreptype> [, <nullreptype> . . . ] }
The restriction on the type of nulls allowed is meant to disallow the
use of any other type. Therefore, the validation test of this rule is the
test for violators, and we are really testing that no real nulls are used.
This is the SQL statement for this test.
Select * from <table> where <attribute> is null;
Note that in older database systems with no system null, or in text
(delimiter-separated file) data sets, blanks may appear is place of an
"official" null, so we may add this vaHdation test, which grabs the first
character from the attribute and tests for a blank.
Select * from <table> where substring(<attribute>, 1, 1)
If the rule is being obeyed, the size of the result set should be 0. If not,
the result set represents those records that violate the rule.
Selecting records with null values or counting the number of records
with null values is now a bit more complicated, since the null values
allowed are those drawn from the null representation list. Since we have
accumulated those null representations in our null representation table,
the query can be specified either in the general sense (grabbing all records
with nulls) or in the specific sense (getting all records with a "not avail-
DATA QUALITY AND BUSINESS RULES IN PRACTICE 431
able" null type) by using the null representation in the nuUreps table. To
locate all records with nulls, we use this SQL statement.
Select * from <table> where <attribute> in
( s e l e c t nullrep from nullreps where nullreps.name in
C<nullreptype> [, <nullreptype> . . . ] ) ) ;
To specify the selection of those records associated with a specific
null representation, we restrict the null representation list to the specific
representation we care about.
Select * from <table> where <attribute> in
( s e l e c t nullrep from nullreps where
nullreps.name = <nullreptype>);
Similarly, when we specify that an attribute may not be null, the valida-
tion is the same as for the defined null case. This is the syntax of our
original rule.
Attribute employees.empid n u l l s not allowed
The validation test is as follows.
Select * from employees where phone_number i s n u l l ;
Along with the corresponding test for spaces.
Select * from employees where substring(phone_number, 1,
1) = ' ;
If the rule is being obeyed, the size of the result set should be 0. If
not, the result set represents those records that violate the rule.
Value restriction rules limit the set of valid values that can be used
within a context. In our definition of a value restriction, we assigned an
alias to the restriction and specified the expression that defines the
restriction. Here is an example.
Restrict GRADE: value >= 'A* AND value <= 'F* AND value
!= 'E'
432 ENTERPRISE KNOWLEDGE MANAGEMENT
In general, the form for the value restriction rule includes the key-
word r e s t r i c t , a name assigned for the restriction, and a conjunction
of conditions, where the operators associated with the conditions may
be any drawn from predefined set (+, -, *, /, etc.) or from a user-defined
set of functions (which must be provided for execution, of course).
Restrict < r e s t r i c t i o n name>: <condition> [(AND | OR)
<condition> . . . ]
We actually use this restriction for defining value ranges that could
be associated with functional domains, as well as the restriction as a
representation of the restriction of values for a specific attribute. This is
the format for the latter specification.
Attribute <table>.<attribute> r e s t r i c t by < r e s t r i c t i o n
naine>
This indicates that no values that show up in the named attribute
may violate the named restriction. The validation test for this restric-
tion is a test for violators of the restriction, and the query is composed
by searching for the negation of the restriction. We apply DeMorgan's
laws to generate the negation.
NOT (A AND B) => (NOT A OR NOT B)
NOT (A OR B) => (NOT A AND NOT B)
The where clause of the validation test is generated from the nega-
tion. This would be where clause from our previous example.
NOT ( v a l u e >= 'A' AND v a l u e <= 'F' AND v a l u e != ' E ' )
After the application of DeMorgan's law, it becomes this.
(VALUE < 'A' OR VALUE > 'F* OR VALUE = 'E')
In general the validation test becomes this.
Select * from <table> where NOT <condition l i s t > ;
Again, if the rule is being obeyed, the size of the result set should
be 0. If not, the result set represents those records that violate the rule.
c r e a t e t a b l e domainref (
name varchar(30),
dtype char(l),
description varchar(1024),
source varchar(512),
domainid integer);
The actual values are stored in a single data table, referenced via a
foreign key back to the domain reference table.
create table domainvals (
domainid int e ger,
value varchar(128));
A domain membership rule specifies that the data values that pop-
ulate an attribute be taken from a named data domain. This is the for-
mat for this rule.
< t a b l e > . < a t t r i b u t e > taken from <domain name>
The validation test for domain membership tests for violations of
the rule. In other words, we select out all the statements whose attribute
has a value that does not belong to the domain. We can do this using a
subselect statement in SQL.
SELECT * from < t a b l e > where < a t t r i b u t e > not i n
(SELECT v a l u e from domainvals where domainid =
(SELECT domainid from domainref
where domainref.name = <domain name>));
If the rule is being obeyed, the size of the result set should be 0. If
not, the result set contains those records that violate the rule. With a
nonempty result set, we can also grab the actual data values that do not
belong to the named domain.
SELECT < a t t r i b u t e > from < t a b l e > where < a t t r i b u t e > not i n
(SELECT v a l u e from domainvals where domainid =
(SELECT domainid from domainref
where domainref.name = <domain name>));
The domain assignment rule specifies that all the values from the
attribute are implicitly to be included in the named domain. This rule,
which is useful in propagating domain values to the metadata reposi-
tory, must include the keywords define domain, the name of the
domain, the keyword from, and a list of attributes from which the
domain values are gleaned.
Define Domain <domain name> from
(<table>.<attribute> [, <table>.<attribute> . . . ] )
For each of the named attributes, we must have a select statement
to extract those values that are not already in the domain.
SELECT <attribute> from <table> where <attribute> not in
(SELECT value from domainvals where domainid =
(SELECT domainid from domainref
where domainref.name = <domain name>));
We then must merge all these values into a single set and for each
of the values in this set, create a new record to be inserted into
DATA QUALITY AND BUSINESS RULES IN PRACTICE 435
domainvals table. Each new record will include the identifier of the
domain and the new value, and a SQL insert statement is generated.
Here is the pseudocode for the entire process.
Dom_id = s e l e c t domainid from domainref;
Nonmember s e t S = n u l l ;
For each t a b l e t , a t t r i b u t e a i n a t t r i b u t e l i s t AL do:
S = S union
SELECT a from t where a not i n
(SELECT v a l u e from domainvals where
domainid =
(SELECT domainid from domainref
where domainref.name =
dom_id));
For each v a l u e v i n nonmember s e t S do:
I n s e r t i n t o domainvals (domainid, v a l u e )
Values (dom_id, v ) ;
Mappings are maintained as metadata using two data tables. The first is
the mapping reference table, into which we store the name of the map-
ping, the source domain identifier, the target domain identifier, a
description, the source of the data, and an assigned mapping identifier.
c r e a t e t a b l e mappingref (
name varchar(30),
sourcedomain integer,
targetdomain integer,
description varchar(1024),
source varchar(512),
mappingid integer
);
The second table actually holds all pairs of values associated with a
particular domain mapping.
c r e a t e t a b l e mappingpairs (
mappingid integer,
sourcevalue varchar(128),
targetvalue varchar(128)
);
436 ENTERPRISE KNOWLEDGE MANAGEMENT
The mapping assignment rule specifies that all the value pairs from two
specified attribute are implicitly to be included in the named mapping.
This rule, which is useful in propagating mapping values to the meta-
data repository, must include the keywords define mapping, the name
of the mapping, the keyword from, and a list of attributes from which
the mapping values are gleaned.
DATA QUALITY AND BUSINESS RULES IN PRACTICE 437
17.1.13 Completeness
17.1.14 Exemption
Here is the SQL to extract nonconformers when there are no null repre-
sentations specified.
Select * from <table> where not <condition> and
(<attribute> i s null
[, or <attribute> i s null . . . ] ;
If we have qualified the attributes in the attribute list using a null
representation, then the where condition is slightly modified to incorpo-
rate the null representation. Here is the same SQL statement with the
null test replaced.
Select * from <table> where not <condition> and
<attribute> in
( s e l e c t nullrep from nullreps where
nullreps.name = <nullreptype>);
17.1.15 Consistency
17.1.16 Derivation
We can actually ratchet down the constraint from defining a specific pri-
mary key to just defining a key on a table. A key is a set of one or more
attributes such that for all records in a table, no two records have the
same set of values for all attributes in that key set. We specify the key
assertion by listing the attributes that compose the key, along with the
keyword key, and the table for which those attributes are a key.
DATA QUALITY AND BUSINESS RULES IN PRACTICE 443
17.1.19 Uniqueness
The uniqueness rule is a cousin to the primary key rule. The difference
is that we may indicate that a column's values be unique without using
that column as a primary key. We indicate that an attribute's value is to
be unique with a simple assertion.
<table>.<attribute> i s Unique
Our first test for validation is to count the number of records and
then count the number of distinct attribute values. Those counts,
retrieved using these SQL statements, should be the same.
Select count(*) from <table>;
Select count ( d i s t i n c t <attribute>) from <table>;
444 ENTERPRISE KNOWLEDGE MANAGEMENT
This SQL statement pulls out all the values that are supposed to be
used as a foreign key but do not appear in the target table. These are vio-
lating records, and the question is whether the violation occurred because
the foreign key value in the source table was incorrect or whether the for-
eign key is really missing in the target table and should be there.
In this section, we have explored the use of SQL as a way of both exe-
cuting and representing data quality and business rules. We should not
feel obligated to only use this mechanism as a tool for databases. Even
though a rule is described using SQL, the implementation need not be
restricted to an environment where the data sit in a relational database.
The use of intermediate data representations, the use of standard data
structures such as arrays, linked lists, and hash tables, and other pro-
gramming tricks will allow for the execution of these rules in a runtime
environment separated from a query engine.
We like to think that the kinds of rules engines that we saw in
Chapter 12 can be called upon to create intermediately executing rule
validation objects. We can encapsulate the interpretation and operation
of a validator as an operational execution object that can be inserted
into a processing system, and we use this idea in the next section.
As long as the rules engine can be handed a "map" of the data that
pass through it either through the use of metadata for table schemas
that can be provided into the rules engine or a data type definition and
message format schema, such as that which can be described using
markup systems like XML the rules engine can manage the testing
and validation of data quality rules when integrated into the processing
stream. In this way, coordinating with actions that can be linked in
together with the rules engine, a content-oriented workflow system can
be enabled.
17.2.2 Measurements
total_passed = sum(passed[l:num_rules]);
total_failed = siim(failed[l:niimrules]) ;
Insert record into measurements table;
For each rule,
Insert detail record into measurementdetails table;
17.2.3 Triggers
17.2.4 Transformations
17.2.5 Updates
In Chapter 16, we defined two additional rules. The first of these rules,
the approximate match definition, defines a similarity score between
any two values, along with a threshold indicating whether two values
are assumed to match. The formal definition of an approximate match
rule must include a defined function, which should evaluate to a real
value between 0.0 and 1.0 and a threshold value above which a score
indicates a match, which is also a real value between 0.0 and 1.0.
APPROXIMATE MATCH
C<real function>C<valuel>, <value2>)):threshold=0.75
The definition of the function can include predefined operators
(such as the same ones used for expressions) as well as user-defined
functions (which of course, must be provided to build a working appli-
cation). The actual implementation of this rule converts the rule defini-
tion into a function. Here is an example of the conversion of this rule
match_defl: approximate match (1 - (ABSCValuel-
Value2)/MAXCValuel, Value2))):threshold=0.75
into C code:
#define MAX(_a,_b) (_a > _b ? _a :_b)
int match_defl(float _vall, float _val2) {
if (1- (abs(_vall - val2)/MAXCvall. _val2))) >= 0.75)
return(l);
else
return(O);
}
DATA QUALITY AND BUSINESS RULES IN PRACTICE 451
Now that we have a clearer story of how these data quality and busi-
ness rules can be used in an executing system, let's look at some specific
implementation paradigms. Our first is using data quality rules to help
guide the execution in a transaction factory.
In a transaction factory, a transaction, such as a stock trade or a
product purchase, takes place, and the factory processes the transaction
so that a desired result occurs. The transaction factory takes transaction
records as well as other data as raw input, produces some product or
service as a "side effect," and perhaps may generate some output data.
Each transaction factory can be represented as an information chain,
with individual processing stages taking care of the overall transaction
processing.
Data quality and business rules are integrated into the transaction
factory in two ways. First, transaction factories are built assuming that
the data being inserted into the process is of high quality and meets all
452 ENTERPRISE KNOWLEDGE MANAGEMENT
Because the factory is built assuming that the input meets a high stan-
dard, problems in continuing operations occur when the input does not
meet that standard. If a bad record can adversely affect the streamlined
processing, then it is beneficial to capture, remove, and correct offend-
ing data before it ever reaches the internals of the factory.
In this instance, data quality rules would be defined to match the
input requirements. A validation engine would be inserted at the entry
point for all data inputs. In addition, we must also define a reconciliation
process to direct the way that invalid records are treated (see Figure 17.2).
The implementation consists of these steps:
1. Define the data quality rules based on the input requirements.
2. For each rule, associate a degree of criticality, which will dictate
the action taken if the rule is violated.
3. For each rule and degree of criticality, determine the action to be
taken. Typically, we can make use of at least these actions.
a. Ignore: If the violation is not business critical, we may want
to log the error but let the value pass.
b. Auto-correct: We can define a derivation rule that uses the
current value of afieldto determine the actual value, if that is
possible. If a record is auto-corrected, it can then be gated
through into the system.
c. Remove for reconciliation: If the violation is severe enough
that it will affect processing, and it cannot be (or should not
be) auto-corrected, the record should be shunted off to a rec-
onciliation database, enhanced by the rule that was violated,
the data source, and a timestamp.
4. An implementation of a rule validator is built and inserted at the
appropriate entry points of data into the factory.
5. The reconciliation database needs to be constructed and made
operational.
6. A reconciliation decision process is put into effect.
DATA QUALITY AND BUSINESS RULES IN PRACTICE 453
Input dau is
checked for
conformance with
expectations.
Data
z
If the data docs
not pass mustei;
the user is notified
and the correct
If the information
information is
is valid, it is
requested.
propagated into
the database.
Rules
Last, at each decision point in the processing, it is possible that the deci-
sion is based on the content of the data record as it passes through a
point in the information chain. Again, we can associate actions for data
routing based on the values embedded in the records (or messages) as
they pass through locations in the information chain. We can then cre-
ate rule application engines that make use of these rules (and their asso-
ciated actions) at the corresponding location within the information
chain, directing the information flow.
DATA QUALITY AND BUSINESS RULES IN PRACTICE 455
We have already discussed how complex and pervasive the data quaUty
problem is with respect to data warehousing. In Chapter 3, we looked
at an example where we would use data quality and business rules for
what we call "data warehouse certification." Certification is a means
for scoring the believability of the information stored in a data ware-
house. We certify a data warehouse as being fit for use when the data
inside conform to a set of data quality expectations embodied in a set of
rules. Given these rules, we assign a score to the quality of the data
imported into a data warehouse for certifying warehouse data quality.
In this chapter we can see how this can be implemented. A set of
rules is developed as a quality gate for data imported into the data
warehouse. We will associate with each rule a validity threshold (as a
percentage) based on the users' expectations of quality. An engine is
configured to incorporated those rules and execute the validation tests
as data is prepared to be entered into the warehouse (see Figure 17.4).
Data Quality
Rules
Record is valid
Data Warehouse InpuQi Rules Engine Data
Vi Ktpomag
Warehouse
Record
Thresholds /Warehouse\
exceeded \ certified
As records are fed into the engine, any relevant rules (that is, any
rules that refer to values of attributes defined within the record) are
tested. We create a measurement object, which tallies the successes and
failures associated with each rule and outputs the results to the mea-
surements tables. We incorporate a trigger to notify the users whether
the warehouse has been certified or not.
For any record, if no rules fail, the record is said to be valid and is
successfully gated through to the warehouse. For each rule that does
fail, a record is generated to be inserted into the measurements tables
with the information about which rules were violated. The record is
then output to a reconciUation system, as we described in Section 17.3.
The violating record can also be passed through to the warehouse, but
now it should be timestamped and marked as having not conformed to
the users' expectations, and this information can be used when per-
forming analysis.
After you've imported the data, each rule's validity value is com-
puted as the ratio of valid records to the total of records. A data quality
certification report delineating all validity percentages is generated. If
all validity percentages exceed the associated thresholds, the warehouse
is certified to conform to the users' data quality requirements. Other-
wise, the warehouse is not certified, and until the percentages can be
brought up to the conformance level, the warehouse cannot be said to
meet the data quality requirements.
Since we have integrated the measurement process into the ware-
house loading system, we will have a periodic (or as periodic as the
warehouse is loaded) measure of how well the database conforms to the
data quality requirements. In order to qualify the warehouse after a
failed certification, the records output to the reconciliation system must
be analyzed for the root cause of the failures, as explored in Chapter 15.
After reconciliation, the data is resubmitted through the rules engine,
and the validity report is generated again. This process continues until
certification is achieved.
plete unless the other attribute is given a value. This tells us when the
field must be filled in and when it doesn't have to be filled in. This is
shown in Figure 17.5.
The insight we gain from the idea of dependence is that when many
attributes in a table depend on other attributes in the same (or another)
table, we can group our requests for data by what is called a depen-
dence class. All attributes are assigned into a dependence class with
some integer-valued degree. The zero-th degree includes all nondepen-
dent attributes. The first degree includes all attributes that depend only
on zero-th degree attributes.
More formally, a dependence class of degree i is the set of all attrib-
utes whose values depend on attributes of a degree less than /. At each
stage of data collection, we only need to ask for data for the current
dependence class, and we can impose our rules on the attribute values
being collected at that point because all the conditions for our rules are
known to be able to be examined. The relation between different
dependence classes can be summarized in a dependence graph.
To create a user interface, we will generate input forms that are data-
directed. The first form to be presented asks for all zero-th degree
attributes. Each subsequent form can be crafted to incorporate only
those attributes in the next dependence class and can be attributed in
format with the data attributes associated via the data quality rules.
To continue our previous example, if the user did indicate that he or
she owned a car, the next form could explicitly say, "You have indicated
that you own a car. What is the registration number for that car?" and
then prompt for the answer. This offers more context to the user than the
traditional form of emptyfields.Figure 17.6 shows this process.
17.7 SUMMARY
463
464 ENTERPRISE KNOWLEDGE MANAGEMENT
The best way to win the support of senior management is to cast the
problem in terms of how the business is affected or can be affected by
poor data quality. A presentation can be crafted that incorporates the
following notions.
The reliance of the organization on high-quality information
The evidence of existence of a data quality problem
The types of impacts that low data quality can have
The fact that managing data quality is the basis of a knowledge
organization
A review of the anecdotes regarding poor data quality
A review of data ownership issues
The implementation of a data ownership policy
A projection of Return on Investment (ROI)
There are two goals of this presentation. Thefirstis to encourage an
awareness on behalf of the senior managers of the importance of data
quality. The second is the authority to craft a data ownership policy.
Our data ownership policy will define a set of data ownership roles and
assign responsibilities to those roles. Here are some of the responsibili-
ties we discussed in Chapter 2.
Data definition
Authorization of access and validation of security
Support the user community
Data packaging and delivery
Maintenance of data
Data quality
Management of business rules
Management of metadata
Standards management
Supplier management
466 ENTERPRISE KNOWLEDGE MANAGEMENT
The data ownership poUcy is the document guiding the roles associated
with information and the responsibilities accorded those roles. At the
very least, a data ownership policy should enumerate these elements.
1. The senior-level managers supporting the enforcement of the
policies enumerated
2. All data sets covered under the policy
3. The ownership model (that is, how is ownership allocated or
assigned within the enterprise?) for each data set
4. The roles associated with data ownership (and the associated
reporting structure)
5. The responsibilities of each role
6. Dispute resolution processes
7. Signatures of those senior level managers listed in item 1
Keep in mind that there are complicating notions that are working as
forces against the smooth transition to a knowledge organization,
including these.
Questions of information value
Privacy issues
Turf and control concerns
Fear
Bureaucracy
These are the steps in defining the data ownership policy (see Figure
18.1).
1. Identifying the interested parties or stakeholders associated with
the enterprise data. This includes identifying the senior-level
managers that will support the enforcement of the policy.
2. Cataloging the data sets that are covered under the policy
3. Determining the ownership models in place and whether these
are to continue or whether they will be replaced or modified
BUILDING THE DATA QUALITY PRACTICE 467
4. Determining the roles that are in place, those that are not in
place, assigning responsibilities to each role, and assigning the
roles to interested parties
5. Maintaining a registry that keeps track of policies, data owner-
ship, roles, responsibilities, and so forth
We suggest a training program that can be broken into two parts. The
first part covers the "business" aspects of data quality, such as the eco-
nomic analysis, the cost of low data quality, assessments, and building
ROI models. The second part covers the implementation issues, such as
data domains, mappings, data quality and business rules, measure-
ments, data cleansing, correction, and enhancement.
3. Data ownership
4. Quality concepts and the quahty improvement cycle
5. Understanding the economic impact of data quality issues
6. Dimensions of data quality
7. Aspects of reference data domains
8. Data quality and business rules
9. Metrics for measuring and assessing data quality
10. Metadata
11. Data quality requirements analysis
12. Data cleansing and standardization
13. Error detection, correction, and root cause analysis using data
quality rules
14. Data enhancement
We look for those stages that create, read, write, send, or process data.
Here is a review of the processing stages discussed in Chapter 4.
The information chain is a graph, where the vertices are the processing
stages and the communication channels are directed edges. Every vertex
and every edge in the information chain is assigned a unique name.
470 ENTERPRISE KNOWLEDGE MANAGEMENT
The data quality scorecard summarizes the overall cost associated with
low data quality and can be used as a tool to help determine where the
best opportunities are for improvement.
Low data quality can have impacts that affect the way the operational
and strategic environment run. In Chapter 4, we explored these impacts
of low data quality.
Detection of errors
Correction of errors
Rollback of processing
Rework of work already completed under erroneous circum-
stances
Prevention of errors
Warranty against damages caused by nonconformities
Reduction of customer activity
Attrition and loss of customers
Blockading on behalf of angered ex-customers
Delay of decisions
Preemption of decision making
Idling of business activity while waiting for strategy to be defined
Increased difficulty of execution
Lost opportunities
Organizational mistrust
Lack of alignment between business units
Increased acquisition overhead associated with information
products
Decay of information value
Infrastructure costs to support low data quality
To create a data quality scorecard (see Figure 18.3), we follow these steps.
Map the information chain to understand how information flows
within the organization.
Interview employees to understand what people are doing with
respect to data quality issues.
Interview customers to understand the kinds of customer impacts.
Isolate flawed data by reviewing the information chain and locat-
ing the areas where data quality problems are manifested.
The data quality scorecard highlights the effect of low data quality on
the bottom line. We must perform a current state assessment to collect
enough data to understand the nature of the actual data quality prob-
lems (see Figure 18.4).
Choose ^ ^ Describe ^ S , .
Choose a / Measure \ / Prepare
locations f expectations \
subset of the
in the for data quaHty at each 14 data j - 1 assessment
data quality
information ^ location for each J \ quality / \ report
dimensions
chain X ^ dimension ^^
The next stage of the current state assessment is selecting a subset of the
dimensions of data quality for measurement. We suggest that at a mini-
mum, selecting at least one dimension from each of our five classes of
data quality dimensions.
At this stage, we combine the result of the data quality scorecard, which
attributed the information chain with the impacts and costs associated
with low data quality, and the current state assessment, which attrib-
uted the information chain with the measured levels of data quality. The
474 ENTERPRISE KNOWLEDGE MANAGEMENT
Choosing the first problem to address requires some care, since the
results of implementing the first project can make or break the practice.
The first project should reflect these ideas.
The problem to be solved has a noticeable impact.
Solving the problem results in measurable cost savings.
There are little or no political issues that need to be addressed.
There is senior management support.
Access to the problem space is open.
The problem can be solved.
Remember: The goal of the first data quality project is to ensure
the continued operation of the data quality program. A failure in solv-
ing the first project will probably result in the demise of the program, so
choose wisely.
The analysis is complete. The project has been selected. Now it is time
to solve the problem. Each solution project will need to have a team of
people that can execute the different parts of the job. In this section, we
describe these roles.
In solving data quality problems, the role of the system architect is both
system analyst and system historian. The architect must be able to
understand the way the systems work in the context of the problem and
work with the rest of the team members to craft a solution that accom-
modates the already existing system environment.
476 ENTERPRISE KNOWLEDGE MANAGEMENT
Our choice of a problem to solve was based on the types of impacts and
costs associated with the problem. In order to understand how modifi-
cations to the process will change the impacts, it is critical to have an
expert from the business/analytical side to vet any proposed changes. It
is also the role of the domain expert to document the user requirements
and work with the rules engineer to transform those requirements into
data quahty and business rules.
The rules engineer will work with the domain expert to translate user
requirements into data quality and business rules. The rules engineer is
also tasked with managing and configuring any automated rules-
oriented processing, and this includes evaluation of tools, rule defini-
tion, and application integration. This engineer will also implement any
software needed for integration with the main system, as well as any
standalone applications.
While the kinds of tools that exist for data quality were discussed in
Chapters 14 and 16, we did not specify or recommend that any particu-
lar product. The reason for this is that each problem is slightly different,
and different combinations of product capabilities may be needed.
On the other hand, it is likely that the project will need at least
some of these components for implementation.
Data cleansing
Data standardization
BUILDING THE DATA QUALITY PRACTICE 477
Database checking/validation
Rules definition system
Rules execution system
Approximate matching
Some of these products may be offered in bundled form a data
cleansing tool may have a rules manager bundled, for example. The
important thing is to recognize the kind of application functionality
that is needed and plan accordingly.
There are many fine products in the market that can make up part of a
data quality solution, but that doesn't mean that a solution can be inte-
grated every time. It is possible that much of the componentry needed to
configure a solution is available, but sometimes there are reasons why it
doesn't make sense to actually buy the product(s). Also, the cost of
these products can range from hundreds to hundreds of thousands of
dollars. For example, if a cleansing application needs to be run only
once before a new data validation system is installed, it may be more
cost effective to invest a small amount of money in a simple tool and
augment it with constructed applications.
Our intention is to ensure that the definition of data quality and busi-
ness rules can be entrusted directly to the user. To do this, a rule editing
and management interface can be used. This interface should provide
an intuitive mechanism for nonexperts to define rules about the infor-
mation they care about. This kind of application can be integrated with
the rule execution system.
BUILDING THE DATA QUALITY PRACTICE 479
The rule execution system will take as input the rules defined via the
definition and management system and generates executable engines
that implement those rules. The rule execution system should conform
to the kinds of specifications discussed in Chapter 12.
The next step to success is to define the metadata model if one does not
already exist. Use the guidelines described in Chapter 11 to help set this
up. Remember that we can incorporate enterprise reference data into
our metadata system, so it is worthwhile to consider the storage of
domains and mappings within this framework as well.
We have done our impact analysis and our current state assessment,
both of which have fed the requirements definition process. With our
application components and metadata system in hand, we are now
ready to define our data quality and business rules.
Once the domains and mappings are in place, we can begin to succes-
sively build our rule base by collecting the important assertions about
the expectations of data quality. Using the current state assessment and
requirements as input, we can target those areas that are most amenable
to rule definition. We use the rule editing and management application
(see Section 18.10.5) to define the rules.
The rules have been defined and reviewed. Now is the time to integrate
the rules into an executable system and build the environment to run
tests and validate the validator. A test environment should be created
that draws its input from the same source as the application will in pro-
duction, and that environment may be used for testing. This is the
opportunity to evaluate the validity of the rules in an operational con-
text so any necessary changes may be flagged and made before moving
to production.
This component will incorporate all vaUdation rules and all pre-
scriptive rules. If rules are being used to perform a data transformation,
the integration must take place at this time also.
At the point where the level of acceptance is reached during the testing
phase described in Section 18.15.3, the decision to move into produc-
tion can be made. This means integrating the rules and nonconfor-
BUILDING THE DATA QUALITY PRACTICE 483
One of the most critical pieces of the data quality program is the ability
to demonstrate success at improving data quality. We already have a
baseline for measuring improvement the current state assessment. At
that point in the process, we have identified a subset of the particular
areas that critically impact the data quality within the enterprise, and
we have gathered measurements based on defined metrics.
When we integrate the rules system, we made sure to also integrate
these measurements. The reason is that we have successfully built a vali-
dation system to improve the data quality. This should be reflected in the
locations and metrics we chose for the current state assessment. In other
words, we can deliver a strict determination of measured improvement by
continuing to perform the same measurements from the current state
assessment. If we really have improved the data quality, we can document
it with real evidence, not just anecdotal stories about improvements.
We can use the methods of statistical process control, discussed in
Chapter 6, to document historical improvement. We can then make use
of that method to assign new thresholds for additional improvements by
resetting the upper and lower control limits based on user specification.
18.18 CONCLUSION
Berry, Michael J. A. and Gordon Linoff. Data Mining Techniques. Wiley. 1997.
Berson, Alex and Stephen J. Smith. Data Warehousing, Data Mining, & OLAP. McGraw-Hill.
1997.
Bowman, Judith S., Sandra L. Emerson, and Marcy Darnovsky. The Practical SQL Handbook, 3d
ed. Addison Wesley. 1996.
Celko, Joe. Joe Celko's Data & Databases: Concepts in Practice. Morgan Kaufmann. 1999.
Fayyad, Usama M. et al., eds. Advances in Knowledge Discovery & Data Mining. MIT Press. 1996.
Frakes, William B. and Ricardo Baeza-Yates, eds. Information Retrieval Data Structures and Algo-
rithms. Prentice Hall. 1992.
Elmasri, Ramez and Shamkant B. Navathe. Fundamentals of Database Systems. Addison Wesley.
1994.
Enghsh, Larry P. Improving Data Warehouse and Business Information Quality. Wiley. 1999.
Huang, Kuan-Tsae, Yang W. Lee, and Richard Y. Wang. Quality Information and Knowledge.
Prentice HaU. 1999.
Juran, Joseph M. and A. Blantos Godfrey, eds. Juran's Quality Handbook, 5th ed. McGraw-Hill.
1999.
Redman, Thomas C. Data Quality for the Information Age. Artech House. 1996.
Sunderraman, Rajshekhar. Oracle Programming: A Primer. Addison Wesley. 1999.