Association Rules PDF
Association Rules PDF
Association Rules PDF
Association
Rules
• Basic concepts
• Apriori algorithm
• Different data formats for mining
• Mining with multiple minimum supports
• Mining class association rules
• Summary
2
Association rule mining
• Proposed by Agrawal et al in 1993.
• It is an important data mining model studied
extensively by the database and data mining
community.
• Assume all data are categorical.
• No good algorithm for numeric data.
• Initially used for Market Basket Analysis to find how
items purchased by customers are related.
3
The model: data
• I = {i1, i2, …, im}: a set of items.
• Transaction t :
– t a set of items, and t I.
• Transaction Database T: a set of transactions T
= {t1, t2, …, tn}.
4
Transaction data: supermarket data
• Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}
… …
tn: {biscuit, eggs, milk}
• Concepts:
– An item: an item/article in a basket
– I: the set of all items sold in the store
– A transaction: items purchased in a basket; it may
have TID (transaction ID)
– A transactional dataset: A set of transactions
5
Transaction data: a set of documents
• A text document data set. Each document is
treated as a “bag” of keywords
doc1: Student, Teach, School
doc2: Student, School
doc3: Teach, School, City, Game
doc4: Baseball, Basketball
doc5: Basketball, Player, Spectator
doc6: Baseball, Coach, Game, Team
doc7: Basketball, Team, City, Game
6
The model: rules
• A transaction t contains X, a set of items
(itemset) in I, if X t.
• An association rule is an implication of the
form:
X Y, where X, Y I, and X Y =
7
Rule strength measures
• Support: The rule holds with support sup in T
(the transaction data set) if sup% of
transactions contain X Y.
– sup = Pr(X Y).
• Confidence: The rule holds in T with
confidence conf if conf% of tranactions that
contain X also contain Y.
– conf = Pr(Y | X)
• An association rule is a pattern that states
when X occurs, Y occurs with certain
probability.
8
Support and Confidence
• Support count: The support count of an
itemset X, denoted by X.count, in a data set T
is the number of transactions in T that
contain X. Assume T has n transactions.
• Then, ( X Y ).count
support
n
( X Y ).count
confidence
X .count
9
Goal and key features
• Goal: Find all rules that satisfy the user-
specified minimum support (minsup) and
minimum confidence (minconf).
• Key Features
– Completeness: find all rules.
– No target item(s) on the right-hand-side
– Mining with data on hard disk (not in memory)
10
t1: Beef, Chicken, Milk
An example t2:
t3:
Beef, Cheese
Cheese, Boots
t4: Beef, Chicken, Cheese
t5: Beef, Chicken, Clothes, Cheese, Milk
• Transaction data t6: Chicken, Clothes, Milk
minsup = 30%
minconf = 80%
• An example frequent itemset:
{Chicken, Clothes, Milk} [sup = 3/7]
• Association rules from the itemset:
Clothes Milk, Chicken [sup = 3/7, conf = 3/3]
… …
Clothes, Chicken Milk, [sup = 3/7, conf = 3/3]
11
Transaction data representation
• A simplistic view of shopping baskets,
• Some important information not considered.
Ex:
– the quantity of each item purchased and
– the price paid
12
The Apriori algorithm
• Probably the best known algorithm
• Two steps:
– Find all itemsets that have minimum support
(frequent itemsets, also called large itemsets).
– Use frequent itemsets to generate rules.
13
Step 1: Mining all frequent itemsets
• A frequent itemset is an itemset whose support
is ≥ minsup.
• Key idea: The apriori property (downward
closure property): any subsets of a frequent
itemset are also frequent itemsets
ABC ABD ACD BCD
AB AC AD BC BD CD
A B C D
14
The Algorithm
• Iterative algo. (also called level-wise search):
Find all 1-item frequent itemsets; then all 2-item frequent
itemsets, and so on.
– In each iteration k, only consider itemsets that
contain some k-1 frequent itemset.
• Find frequent itemsets of size 1: F1
• From k = 2
– Ck = candidates of size k: those itemsets of size k
that could be frequent, given Fk-1
– Fk = those itemsets that are actually frequent, Fk
Ck (need to scan the database once).
15
Dataset T TID Items
Example – minsup=0.5 T100 1, 3, 4
Finding frequent itemsets T200 2, 3, 5
T300 1, 2, 3, 5
T400 2, 5
itemset:count
1. scan T C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3
F1: {1}:2, {2}:3, {3}:3, {5}:3
C2: {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5}
2. scan T C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2
F2: {1,3}:2, {2,3}:2, {2,5}:3, {3,5}:2
C3: {2, 3,5}
3. scan T C3: {2, 3, 5}:2 F3: {2, 3, 5}
16
Details: ordering of items
• The items in I are sorted in lexicographic order
(which is a total order).
• The order is used throughout the algorithm in
each itemset.
• {w[1], w[2], …, w[k]} represents a k-itemset w
consisting of items w[1], w[2], …, w[k], where
w[1] < w[2] < … < w[k] according to the total
order.
17
Apriori candidate generation
• The candidate-gen function takes Fk-1 and
returns a superset (called the candidates)
of the set of all frequent k-itemsets. It has
two steps
– join step: Generate all possible candidate
itemsets Ck of length k
– prune step: Remove those candidates in Ck that
cannot be frequent.
18
An example
• F3 = {{1, 2, 3}, {1, 2, 4}, {1, 3, 4},
{1, 3, 5}, {2, 3, 4}}
• After join
– C4 = {{1, 2, 3, 4}, {1, 3, 4, 5}}
• After pruning:
– C4 = {{1, 2, 3, 4}}
because {1, 4, 5} is not in F3 ({1, 3, 4, 5} is removed)
19
Step 2: Generating rules from frequent
itemsets
• Frequent itemsets association rules
• One more step is needed to generate
association rules
• For each frequent itemset X,
For each proper nonempty subset A of X,
– Let B = X - A
– A B is an association rule if
• Confidence(A B) ≥ minconf,
support(A B) = support(AB) = support(X)
confidence(A B) = support(A B) / support(A)
20
Generating rules: an example
• Suppose {2,3,4} is frequent, with sup=50%
– Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4}, with
sup=50%, 50%, 75%, 75%, 75%, 75% respectively
– These generate these association rules:
• 2,3 4, confidence=100%
• 2,4 3, confidence=100%
• 3,4 2, confidence=67%
• 2 3,4, confidence=67%
• 3 2,4, confidence=67%
• 4 2,3, confidence=67%
• All rules have support = 50%
21
Generating rules: summary
• To recap, in order to obtain A B, we need
to have support(A B) and support(A)
• All the required information for confidence
computation has already been recorded in
itemset generation. No need to see the data T
any more.
• This step is not as time-consuming as
frequent itemsets generation.
22
On Apriori Algorithm
Seems to be very expensive
• Level-wise search
• K = the size of the largest itemset
• It makes at most K passes over data
• In practice, K is bounded (10).
• The algorithm is very fast. Under some conditions, all
rules can be found in linear time.
• Scale up to large data sets
23
More on association rule mining
• Clearly the space of all association rules is
exponential, O(2m), where m is the number of
items in I.
• The mining exploits sparseness of data, and
high minimum support and high minimum
confidence values.
• Still, it always produces a huge number of
rules, thousands, tens of thousands, millions,
24
Different data formats for mining
• The data can be in transaction form or table
form
Transaction form: a, b
a, c, d, e
a, d, f
Table form: Attr1 Attr2 Attr3
a, b, d
b, c, e
• Table data need to be converted to transaction
form for association mining
25
From a table to a set of transactions
Table form: Attr1 Attr2 Attr3
a, b, d
b, c, e
Transaction form:
(Attr1, a), (Attr2, b), (Attr3, d)
(Attr1, b), (Attr2, c), (Attr3, e)
26
Example 2
• Given the dataset M below, identify the
frequent itemset with minimum support of 2
Transaction Items Bought
ID
T1 MILK, TEA, CAKE
T2 EGGS, TEA, BREAD
T3 MILK, EGGS, TEA, BREAD
T4 EGGS, BREAD
T5 ICE CREAM
27
Step 1: Scan the data
Step 2: Calculate the support/frequency of all items
Step 3: Discard the items with minimum support
less than 2
Step 4: Combine two items
Step 5: Calculate the support/frequency of all items
Step 6: Discard the items with minimum support
less than 2
Step 6.5: Combine three items and calculate their
support.
Step 7: Discard the items with minimum support
less than 2
28
1. Scan dataset M
2. Calculate the support/frequency of all items
29
3. Discard the items with minimum support less
than 2
30
4. Combine two items
ITEMS BOUGHT
MILK, TEA
MILK, EGGS
MILK, BREAD
TEA, EGGS
TEA, BREAD
EGGS, BREAD
31
5. Calculate the support/frequency of all items
MILK, TEA 2
TEA, EGGS 2
TEA, BREAD 2
EGGS, BREAD 3
32
6. Discard the items with minimum support less
than 2
ITEMS BOUGHT
MILK, TEA
TEA, EGGS
TEA, BREAD
EGGS, BREAD
The set of two items that are bought together most frequently ( MILK, TEA),
(TEA, EGGS), (TEA, BREAD) and (EGGS, BREAD).
Association rules:
Milk Tea; A person who buys milk likely buys tea
Tea Milk; A person who buys tea likely buys milk
Based on the table above, complete the association rules.
33
7. Combine three items and calculate their
support.
34
8. Discard the items with minimum support less than 2
Association rules:
Tea->Eggs, Bread (A person who buys tea likely buys eggs and bread)
EggsTea, Bread
BreadTea, Eggs
35