Data Mining Techniques (DMT) by Kushal Anjaria Session-2: Tid Items

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Data Mining Techniques (DMT)

By Kushal Anjaria
Session-2

3. Support (s): The fraction of transactions that


• In this session, we will focus on data transformation and contains an item set. s({Bread, Milk, Diaper})=⅖
pattern mining. So, the first pattern we will consider is 4. Frequent Itemset: An itemset whose support is
something known as association rules. greater than or equal to some minimum support
• This association pattern origin was one of the earliest use threshold.
of data mining in the retail shop. Say, for example, you 5. Association rule is represented using the form 𝑋 →
are going to a supermarket or a mall, and you have 𝑌, where X and Y are Itemset. Example: {Milk,
bought some items. For this instance, I may record the Diaper} →{Beer} Support (s) for the form 𝑋 → 𝑌,
bill after a person has bought something in his basket. would be the fraction of the transaction that contains
For each transaction or purchase by the customer, I will both X and Y.
have a massive number of rows for each basket of items. 6. Confidence (c): How often items Y appear in
You can see a table where these rows are describing the transaction that contains X.
different transactions. So, TID 1 is the transaction Id 1, 7. Example: 𝑋 → 𝑌= {Milk, Diaper} →{Beer}
𝜎({Milk, Diaper, Beer}) 2
for the first customer’s transactions. The next row is the s= = =0.4
|𝑇| 5
subsequent customer transactions, and so on. Along with 𝜎({Milk, Diaper, Beer}) 2
the transactions, what is noted is that what are the items c= = =0.67
𝜎({Milk, Diaper}) 3
purchased by that customer. So, you can see that in this
table, customer one has bought bread and milk, customer
two has bought bread and diaper and beer and eggs, and In simple words, support suggests that item sets are popular
customer three has bought milk and diaper and beer and or not. And confidence suggests whether they are purchased
coke and so on. together or not. Association rule requires both, the item set
• These types of transactions are called market basket should be popular and confident. From the above concepts
transactions. So, these transactions consist of 2 parts. 1st and understanding, following questions may come to readers’
is the Id of the transaction of particular customers, and mind:
the second is the list of items purchased by the customers. ❑ How to find association rules?
Suppose every day thousands of people come to the ❑ How to scan millions of transactions and check which
supermarket, and do this kind of transaction. If you see item sets are satisfying support and confidence criteria?
throughout, say 1 or 2 years, there will be an enormous ❑ Which mathematical concept is applicable in case of
amount of data. IBM was the first company to analyze association rule mining?
these data types and come up with the association rule ❑ How to visualize and represent huge data and association
generation and mining technique. among data points?

Let’s observe the following table and find out what IBM has General steps to generate association rules:
discovered from the data. ❑ Suppose, you have the form 𝑋 → 𝑌,
TID ITEMS ❑ First of all, you consider all items and all possible values
of X and Y.
1 Bread, Milk ❑ Based on these X and Y, try to make rules using support
and confidence.
2 Bread, Diaper, Beer, Eggs ❑ The initial form of such rules are known as candidate
rules.
3 Milk, Diaper, Beer, Coke ❑ Next you decide the threshold of confidence and support
values.
4 Bread, Milk, Diaper, Beer ❑ If for some pair of X and Y in the candidate rules,
confidence and support values are above threshold then
5 Bread, Milk, Diaper, Coke they are rules. Example: {Milk, Diaper} →{Beer}

The above approach definitely gives us the required result.


The table shows that people who buy bread and milk are most However, the approach is computationally prohibitive.
likely to buy diapers. The people who buy diapers are most Suppose there are 100 items. For these items, 2100 values
likely to buy beer. Now this kind of pattern had some would appear. Thus, this brute force approach will not work.
commercial significance. For example, if you buy diapers, I To form the Association rules, we use an Apriori algorithm.
could have given you a discount, and you can buy beer at a ❑ To form the Association rules we use an Apriori
discounted rate. I can also arrange the placements of the items algorithm. Just like the general steps, Ariori technique is
accordingly in the store. Now the question is, from the vast also based on two pillar elements:
amount of data, how to calculate association rules? 1. Frequent Itemset generation: Generate all
For association rule, the following terminologies are useful: itemsets whose support >= minsup
1. Item set: Collection of one or more items: e.g. 2. Rule generation: Generate high confidence
{Bread, Milk, Diaper, Coke}. The k-item set means rules from each itemsets, where each rule is a
the item set that contains k-items binary partitioning of frequent itemsets.
2. Support count (𝜎): Frequency of occurrence of an ❑ We understand Apriori technique using lattice diagram
item set. E.g., 𝜎 ({Bread, Milk, Diaper}) =2
Fig-1 Lattice diagram to understand the Apriori Technique

Fig-2 Example of Apriori Algorithm


❑ If I know that people do not buy milk and bread
The lattice theory can be related with the Apriori technique frequently, then people will not buy milk, bread and beer
in the following ways: frequently.
❑ This intuition leads to the Apriori Principle. The Apriori
❑ If there are d number of items then there would be
principle stats that
2𝑑 possible candidate itemset possible. Apriori Principle: If an item subset is frequent, then all of
❑ Starts with null and ends with full itemset its superset must also be frequent
❑ With each level, uniform increment in the set.
❑ With each level, the number of sets generated is reduced ❑ As per the apriori rule, support of an item set never
by one. E.g., A can generate 4 sets, AB can generate 3 exceeds the support of its subsets.
sets, ABC can generate 2 sets, ABCD can generate 1 set. ❑ This is known as the anti-monotone property of support.
❑ In lattice, one can check each and every member of the Its vice versa is also true.
graph as they are candidate rules and check whether they ❑ If some itemset is frequent then its subset would also
are appearing frequently or not and then decide frequent.
association rules. This is basically a brute force ❑ In lattice, if AB is not frequent, then it means ABC
approach. cannot be frequent. How many other sets cannot be
❑ In computational complexity we have to consider the frequent?
length of the itemset also. ❑ Lattice can be helpful in finding this pattern. From the
❑ If I know that ABCD appears frequently, can I say Apriori principle we can prune candidate rules.
something about its upper layer? Or if I know ABCD is Apriori algorithms say that start with the one dataset and
not frequent, can I say something about the upper layer? check whether they are frequent or not. Example of the
❑ For example, if I know that people do not buy milk and Apriori algorithm is shown in the figure-2
bread frequently, then can I answer the question whether
the people are buying milk, bread and beer frequently or Rule generation using Apriori Principle
not. ❑ Now we have frequent itemset. From these itemset, how
to generate rules? Note that confidence does not follow
Apriori property. I.e. C(ABC-->D) can be larger or the frequent itemset then C(ABC-->D) ≥C(AB-->CD) ≥
smaller than C(AB-->D) C(A-->BCD).

❑ The confidence of rules generated from the same item set ❑ The rule-generation is shown in the figure-3. Examples
follows apriori property. In other words, confidence of the Apriori Algorithm is shown in figure-4 and
follows the apriori principle with respect to the number Figure-5
of items on the RHS part of the rule. E.g. {A, B, C, D} is

Figure-3 Apriori Algorithm for association rule generation

Figure-4 Apriori Algorithm Example


Figure-5 Apriori Algorithm Example

You might also like