Data Quality and Cleaning

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 9

Conventional Definition of Data Quality

Accuracy

The data was recorded correctly.

Completeness

All relevant data was recorded.

Uniqueness

Entities are recorded once.

Timeliness

The data is kept up to date.

Special problems in federated data: time consistency.

Consistency

The data agrees with itself.

Data and Information


Data and information is not static, it flows in a data
collection and usage process
Data gathering
Data delivery
Data storage
Data integration
Data retrieval
Data mining/analysis
3

Data Collection and Usage Processes


Issues
Data
Gathering
Typos,
multiple
formats,
missing /
default values
Incomplete
incorrect,
inaccurate
inrelevant
Different data
dictionary
definitions

Sources of Problems

Solutions

Manual entry
Potential Solutions:
No uniform
Preemptive:
standards for
Process architecture
content and
(build in integrity checks)
formats
Process management
Parallel data
(reward accurate data
entry
entry, data sharing, data
(duplicates)
stewards)
Approximatio
Retrospective:
ns, surrogates
Cleaning focus (duplicate
SW/HW
removal, merge/purge,
constraints
name & address
Measurement
matching, field value
errors.
standardization)
Diagnostic focus
(automated detection of
glitches).

Data Collection and Usage Processes


Issues

Sources of Problems

Data delivery Destroying or


Corruption in
transmission
or storage

mutilating
information by
inappropriate preprocessing
Inappropriate
aggregation
Nulls
converted to
default values
Loss of data:
Buffer
overflows
Transmission
problems
No checks

Solutions
Build reliable transmission
protocols
Use a relay server
Verification
Checksums, verification
parser
Do the uploaded files fit an
expected pattern?
Relationships
Are there dependencies
between data streams and
processing steps
Interface agreements
Data quality commitment from
the data stream supplier.

Data Collection and Usage Processes


Issues
Data Storage

Sources of Problems
Problems in physical storage
Can be an issue, but
terabytes are cheap.
Problems in logical storage (ER
relations)
Poor metadata.
Data feeds are often derived from
application programs or legacy
data sources. What does it mean?
Inappropriate data models.
Missing timestamps, incorrect
normalization, etc.
Ad-hoc modifications.
Structure the data to fit the GUI.
Hardware / software
constraints.
Data transmission via Excel
spreadsheets, Y2K

Solutions
.

Metadata
Document and publish data
specifications.
Planning
Assume that everything bad will
happen.
Can be very difficult.
Data exploration
Use data browsing and data mining
tools to examine the data.
Does it meet the specifications
you assumed?
Has something changed?

Data Collection and Usage Processes


Issues
Data Integration

Sources of Problems
Combine data sets
(acquisitions, across
departments).
Common source of problems
Heterogenous data : no
common key, different
field formats
Approximate
matching
Different definitions
What is a
customer: an
account, an
individual, a family,

Time synchronization
Does the data
relate to the same
time periods? Are
the time windows
compatible?
Legacy data
IMS, spreadsheets,
ad-hoc structures
Sociological factors
Reluctance to
share loss of
power.

Solutions
.

Commercial Tools
Significant body of research in data
integration
Many tools for address matching,
schema mapping are available.
Data browsing and exploration
Many hidden problems and meanings :
must extract metadata.
View before and after results : did the
integration go the way you thought?

Data Collection and Usage Processes


Issues
Data Retrieval

Sources of Problems
Exported data sets are often a
view of the actual data.
Problems occur because:
Source data not properly
understood.
Need for derived data not
understood.
Just plain mistakes.
Inner join vs. outer
join
Understanding
NULL values
Computational constraints
E.g., too expensive to
give a full history, well
supply a snapshot.
Incompatibility
Ebcdic?

Solutions
.

Commercial Tools
Significant body of research in data
integration
Many tools for address matching,
schema mapping are available.
Data browsing and exploration
Many hidden problems and meanings :
must extract metadata.
View before and after results : did the
integration go the way you thought?

TERIMA KASIH

You might also like