Data Mining

 Introduction
 Knowledge Discovery in Databases (KDD)
 Data Mining and Query Tools
 Basic Data Mining Techniques
 Data Mining and Data Warehouse
 Association Rules

 A short story
• The library of Babel (infinite)
Books must be somewhere in the library
People wander round this library until they die
The library contains an infinite amount of data but
no information
• Today’s environment
Too much data but too little information
 Challenge
• Find the required information from huge
amounts of data
• The amount of data is growing  increasingly
difficult to find the meaningful information
 Knowledge Discovery in Database (KDD)
• The whole process of extraction of implicit,
previously unknown and potentially useful
knowledge as a production factor from a large
data sets
• Include data selection, cleaning, coding, data
mining, and reporting
 Data Mining
• The key stage of Knowledge Discovery in
Database (KDD)
• The process of finding the desired information
from large database

 KDD is not a new technique but rather a multi-
disciplinary field of research

 AI, machine learning (1950)
 It is extremely difficult to create computer
that has an intelligent close to that of human
• Lack of creativity and self-learning
 1960: stop researching about learning
• Neural network fail (XOR)
 1980 ~: neural network changes architecture,
new machine learning algorithm (decision
tree, genetic algorithm, etc.), powerful
computer, focus on simple and practical
 Why learning
• Even for simple problem, such as timetable
planning  extremely hard to solve with a
computer but easily solved by experienced
 Using expert system to solve problem
• Even for simple systems, a great many rules
existed . It is difficult to find the right rules.
• Need to interview relevant experts many times
and integrate them to obtain the expert
Knowledge acquisition: using learning algorithms
to generate rules automatically

 Why interest in data mining
• In the 1980s, all organizations begin to build
database. Until now, they contain gigabytes of data
with much ‘hidden’ information that cannot easily
be traced using SQL
SQL is just a query language under the constraints that
you already know
• As the use of networks, it will become increasingly
easy to connect database
Discover more information
• Machine learning techniques have been improved
Easier to find interesting information
• Client/server environment
Electronic commerce

 Data mining tool & Query tool
• Suppose a large database containing millions of
records that describe customers’ purchases
Who bought which product on what date?
What is the average turnover in July?
What is an optimal segmentation of clients
What are the most important trends in customer
• If you know exactly what you are looking for,
use query tool
• If you know only vaguely what you are looking
for, use data mining tool
 Data mining in electronic commerce
• The success of KDD come primarily from
• Prediction
Customer buying baby clothes today may buy
computer games in ten years, and fifteen years later
a motorcycle

• Suppose a company keeps the data about what
products they bought
Mail to everyone  only 3% ~ 4% interest
Analyze user behavior, and cluster customers
according to their interests  can save 50% of
mailing costs

 The problems of data mining
• Lack of long-term vision
What do we want to get from the database in the future?
• Not all files are up to date
Example: the price of computer
• Struggle between departments
• Poor cooperation between users and EDP dept.
• Legal and privacy restrictions
• Data model need to be transformed for different
data mining technique
• Timing problems: integrate data from different
• Interpretation problems


