Association Rules PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

Mining

Association
Rules
• Basic concepts
• Apriori algorithm
• Different data formats for mining
• Mining with multiple minimum supports
• Mining class association rules
• Summary

2
Association rule mining
• Proposed by Agrawal et al in 1993.
• It is an important data mining model studied
extensively by the database and data mining
community.
• Assume all data are categorical.
• No good algorithm for numeric data.
• Initially used for Market Basket Analysis to find how
items purchased by customers are related.

Bread  Milk [sup = 5%, conf = 100%]

3
The model: data
• I = {i1, i2, …, im}: a set of items.
• Transaction t :
– t a set of items, and t  I.
• Transaction Database T: a set of transactions T
= {t1, t2, …, tn}.

4
Transaction data: supermarket data
• Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}
… …
tn: {biscuit, eggs, milk}
• Concepts:
– An item: an item/article in a basket
– I: the set of all items sold in the store
– A transaction: items purchased in a basket; it may
have TID (transaction ID)
– A transactional dataset: A set of transactions

5
Transaction data: a set of documents
• A text document data set. Each document is
treated as a “bag” of keywords
doc1: Student, Teach, School
doc2: Student, School
doc3: Teach, School, City, Game
doc4: Baseball, Basketball
doc5: Basketball, Player, Spectator
doc6: Baseball, Coach, Game, Team
doc7: Basketball, Team, City, Game

6
The model: rules
• A transaction t contains X, a set of items
(itemset) in I, if X  t.
• An association rule is an implication of the
form:
X  Y, where X, Y  I, and X Y = 

• An itemset is a set of items.


– E.g., X = {milk, bread, cereal} is an itemset.
• A k-itemset is an itemset with k items.
– E.g., {milk, bread, cereal} is a 3-itemset

7
Rule strength measures
• Support: The rule holds with support sup in T
(the transaction data set) if sup% of
transactions contain X  Y.
– sup = Pr(X  Y).
• Confidence: The rule holds in T with
confidence conf if conf% of tranactions that
contain X also contain Y.
– conf = Pr(Y | X)
• An association rule is a pattern that states
when X occurs, Y occurs with certain
probability.

8
Support and Confidence
• Support count: The support count of an
itemset X, denoted by X.count, in a data set T
is the number of transactions in T that
contain X. Assume T has n transactions.
• Then, ( X  Y ).count
support 
n
( X  Y ).count
confidence 
X .count
9
Goal and key features
• Goal: Find all rules that satisfy the user-
specified minimum support (minsup) and
minimum confidence (minconf).

• Key Features
– Completeness: find all rules.
– No target item(s) on the right-hand-side
– Mining with data on hard disk (not in memory)

10
t1: Beef, Chicken, Milk

An example t2:
t3:
Beef, Cheese
Cheese, Boots
t4: Beef, Chicken, Cheese
t5: Beef, Chicken, Clothes, Cheese, Milk
• Transaction data t6: Chicken, Clothes, Milk

• Assume: t7: Chicken, Milk, Clothes

minsup = 30%
minconf = 80%
• An example frequent itemset:
{Chicken, Clothes, Milk} [sup = 3/7]
• Association rules from the itemset:
Clothes  Milk, Chicken [sup = 3/7, conf = 3/3]
… …
Clothes, Chicken  Milk, [sup = 3/7, conf = 3/3]

11
Transaction data representation
• A simplistic view of shopping baskets,
• Some important information not considered.
Ex:
– the quantity of each item purchased and
– the price paid

12
The Apriori algorithm
• Probably the best known algorithm
• Two steps:
– Find all itemsets that have minimum support
(frequent itemsets, also called large itemsets).
– Use frequent itemsets to generate rules.

• E.g., a frequent itemset


{Chicken, Clothes, Milk} [sup = 3/7]
and one rule from the frequent itemset
Clothes  Milk, Chicken [sup = 3/7, conf = 3/3]

13
Step 1: Mining all frequent itemsets
• A frequent itemset is an itemset whose support
is ≥ minsup.
• Key idea: The apriori property (downward
closure property): any subsets of a frequent
itemset are also frequent itemsets
ABC ABD ACD BCD

AB AC AD BC BD CD

A B C D

14
The Algorithm
• Iterative algo. (also called level-wise search):
Find all 1-item frequent itemsets; then all 2-item frequent
itemsets, and so on.
– In each iteration k, only consider itemsets that
contain some k-1 frequent itemset.
• Find frequent itemsets of size 1: F1
• From k = 2
– Ck = candidates of size k: those itemsets of size k
that could be frequent, given Fk-1
– Fk = those itemsets that are actually frequent, Fk 
Ck (need to scan the database once).
15
Dataset T TID Items
Example – minsup=0.5 T100 1, 3, 4
Finding frequent itemsets T200 2, 3, 5
T300 1, 2, 3, 5
T400 2, 5
itemset:count
1. scan T  C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3
 F1: {1}:2, {2}:3, {3}:3, {5}:3
 C2: {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5}
2. scan T  C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2
 F2: {1,3}:2, {2,3}:2, {2,5}:3, {3,5}:2
 C3: {2, 3,5}
3. scan T  C3: {2, 3, 5}:2  F3: {2, 3, 5}

16
Details: ordering of items
• The items in I are sorted in lexicographic order
(which is a total order).
• The order is used throughout the algorithm in
each itemset.
• {w[1], w[2], …, w[k]} represents a k-itemset w
consisting of items w[1], w[2], …, w[k], where
w[1] < w[2] < … < w[k] according to the total
order.

17
Apriori candidate generation
• The candidate-gen function takes Fk-1 and
returns a superset (called the candidates)
of the set of all frequent k-itemsets. It has
two steps
– join step: Generate all possible candidate
itemsets Ck of length k
– prune step: Remove those candidates in Ck that
cannot be frequent.

18
An example
• F3 = {{1, 2, 3}, {1, 2, 4}, {1, 3, 4},
{1, 3, 5}, {2, 3, 4}}

• After join
– C4 = {{1, 2, 3, 4}, {1, 3, 4, 5}}
• After pruning:
– C4 = {{1, 2, 3, 4}}
because {1, 4, 5} is not in F3 ({1, 3, 4, 5} is removed)

19
Step 2: Generating rules from frequent
itemsets
• Frequent itemsets  association rules
• One more step is needed to generate
association rules
• For each frequent itemset X,
For each proper nonempty subset A of X,
– Let B = X - A
– A  B is an association rule if
• Confidence(A  B) ≥ minconf,
support(A  B) = support(AB) = support(X)
confidence(A  B) = support(A  B) / support(A)

20
Generating rules: an example
• Suppose {2,3,4} is frequent, with sup=50%
– Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4}, with
sup=50%, 50%, 75%, 75%, 75%, 75% respectively
– These generate these association rules:
• 2,3  4, confidence=100%
• 2,4  3, confidence=100%
• 3,4  2, confidence=67%
• 2  3,4, confidence=67%
• 3  2,4, confidence=67%
• 4  2,3, confidence=67%
• All rules have support = 50%

21
Generating rules: summary
• To recap, in order to obtain A  B, we need
to have support(A  B) and support(A)
• All the required information for confidence
computation has already been recorded in
itemset generation. No need to see the data T
any more.
• This step is not as time-consuming as
frequent itemsets generation.

22
On Apriori Algorithm
Seems to be very expensive
• Level-wise search
• K = the size of the largest itemset
• It makes at most K passes over data
• In practice, K is bounded (10).
• The algorithm is very fast. Under some conditions, all
rules can be found in linear time.
• Scale up to large data sets

23
More on association rule mining
• Clearly the space of all association rules is
exponential, O(2m), where m is the number of
items in I.
• The mining exploits sparseness of data, and
high minimum support and high minimum
confidence values.
• Still, it always produces a huge number of
rules, thousands, tens of thousands, millions,

24
Different data formats for mining
• The data can be in transaction form or table
form
Transaction form: a, b
a, c, d, e
a, d, f
Table form: Attr1 Attr2 Attr3
a, b, d
b, c, e
• Table data need to be converted to transaction
form for association mining

25
From a table to a set of transactions
Table form: Attr1 Attr2 Attr3
a, b, d
b, c, e

Transaction form:
(Attr1, a), (Attr2, b), (Attr3, d)
(Attr1, b), (Attr2, c), (Attr3, e)

26
Example 2
• Given the dataset M below, identify the
frequent itemset with minimum support of 2
Transaction Items Bought
ID
T1 MILK, TEA, CAKE
T2 EGGS, TEA, BREAD
T3 MILK, EGGS, TEA, BREAD
T4 EGGS, BREAD
T5 ICE CREAM

27
Step 1: Scan the data
Step 2: Calculate the support/frequency of all items
Step 3: Discard the items with minimum support
less than 2
Step 4: Combine two items
Step 5: Calculate the support/frequency of all items
Step 6: Discard the items with minimum support
less than 2
Step 6.5: Combine three items and calculate their
support.
Step 7: Discard the items with minimum support
less than 2
28
1. Scan dataset M
2. Calculate the support/frequency of all items

ITEMS BOUGHT SUPPORT


MILK 2
EGGS 3
TEA 3
BREAD 3
ICE CREAM 1
CAKE 1

29
3. Discard the items with minimum support less
than 2

ITEMS BOUGHT SUPPORT


MILK 2
TEA 3
EGGS 3
BREAD 3

30
4. Combine two items

ITEMS BOUGHT

MILK, TEA

MILK, EGGS

MILK, BREAD

TEA, EGGS

TEA, BREAD

EGGS, BREAD

31
5. Calculate the support/frequency of all items

ITEMS BOUGHT SUPPORT

MILK, TEA 2

TEA, EGGS 2

TEA, BREAD 2

EGGS, BREAD 3

32
6. Discard the items with minimum support less
than 2
ITEMS BOUGHT
MILK, TEA
TEA, EGGS
TEA, BREAD
EGGS, BREAD
The set of two items that are bought together most frequently ( MILK, TEA),
(TEA, EGGS), (TEA, BREAD) and (EGGS, BREAD).
Association rules:
Milk  Tea; A person who buys milk likely buys tea
Tea Milk; A person who buys tea likely buys milk
Based on the table above, complete the association rules.

33
7. Combine three items and calculate their
support.

ITEMS BOUGHT SUPPORT


MILK, TEA, EGGS 1
MILK, TEA, BREAD 1
TEA, EGGS, BREAD 2
TEA, EGGS, MILK 1

34
8. Discard the items with minimum support less than 2

ITEMS BOUGHT SUPPORT


TEA, EGGS, BREAD 2

Frequent item set based on minimum support of 2: TEA, EGGS, BREAD.


Thus the set of three items that are bought together most frequently
are TEA, EGGS, BREAD

Association rules:
Tea->Eggs, Bread (A person who buys tea likely buys eggs and bread)
EggsTea, Bread
BreadTea, Eggs

35

You might also like