Data Quality and Cleaning
Data Quality and Cleaning
Data Quality and Cleaning
Accuracy
Completeness
Uniqueness
Timeliness
Consistency
Sources of Problems
Solutions
Manual entry
Potential Solutions:
No uniform
Preemptive:
standards for
Process architecture
content and
(build in integrity checks)
formats
Process management
Parallel data
(reward accurate data
entry
entry, data sharing, data
(duplicates)
stewards)
Approximatio
Retrospective:
ns, surrogates
Cleaning focus (duplicate
SW/HW
removal, merge/purge,
constraints
name & address
Measurement
matching, field value
errors.
standardization)
Diagnostic focus
(automated detection of
glitches).
Sources of Problems
mutilating
information by
inappropriate preprocessing
Inappropriate
aggregation
Nulls
converted to
default values
Loss of data:
Buffer
overflows
Transmission
problems
No checks
Solutions
Build reliable transmission
protocols
Use a relay server
Verification
Checksums, verification
parser
Do the uploaded files fit an
expected pattern?
Relationships
Are there dependencies
between data streams and
processing steps
Interface agreements
Data quality commitment from
the data stream supplier.
Sources of Problems
Problems in physical storage
Can be an issue, but
terabytes are cheap.
Problems in logical storage (ER
relations)
Poor metadata.
Data feeds are often derived from
application programs or legacy
data sources. What does it mean?
Inappropriate data models.
Missing timestamps, incorrect
normalization, etc.
Ad-hoc modifications.
Structure the data to fit the GUI.
Hardware / software
constraints.
Data transmission via Excel
spreadsheets, Y2K
Solutions
.
Metadata
Document and publish data
specifications.
Planning
Assume that everything bad will
happen.
Can be very difficult.
Data exploration
Use data browsing and data mining
tools to examine the data.
Does it meet the specifications
you assumed?
Has something changed?
Sources of Problems
Combine data sets
(acquisitions, across
departments).
Common source of problems
Heterogenous data : no
common key, different
field formats
Approximate
matching
Different definitions
What is a
customer: an
account, an
individual, a family,
Time synchronization
Does the data
relate to the same
time periods? Are
the time windows
compatible?
Legacy data
IMS, spreadsheets,
ad-hoc structures
Sociological factors
Reluctance to
share loss of
power.
Solutions
.
Commercial Tools
Significant body of research in data
integration
Many tools for address matching,
schema mapping are available.
Data browsing and exploration
Many hidden problems and meanings :
must extract metadata.
View before and after results : did the
integration go the way you thought?
Sources of Problems
Exported data sets are often a
view of the actual data.
Problems occur because:
Source data not properly
understood.
Need for derived data not
understood.
Just plain mistakes.
Inner join vs. outer
join
Understanding
NULL values
Computational constraints
E.g., too expensive to
give a full history, well
supply a snapshot.
Incompatibility
Ebcdic?
Solutions
.
Commercial Tools
Significant body of research in data
integration
Many tools for address matching,
schema mapping are available.
Data browsing and exploration
Many hidden problems and meanings :
must extract metadata.
View before and after results : did the
integration go the way you thought?
TERIMA KASIH