Apriori Algorithm

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 19

Chapter 5 :Association

Rules “Apriori algorithm”

By: Yara Ahmed Swaisa


Supervised by DR: Sayed Badr
Subject: Application of Big Data
Agenda:

1. introduction to the topic of data mining


2. Overview of Association Rules.
3. Apriori Algorithm.
4. Evaluation of Candidate Rules.
5. Conclusion
6. Exercises from the book
What is datamining ?

Data mining is the process of extracting knowledge, patterns, and insights from
large volumes of data. It involves using various computational techniques and
applying advanced algorithms to discover hidden relationships, patterns, and
trends within massive datasets that can help organizations make informed
decisions.
What are the Association Rules ?

• Association Rules are a fundamental concept in data mining and machine


learning that capture relationships or associations among items in large
datasets. They reveal patterns of co-occurrence or dependency between
different items based on their presence in transactions or events.

• An association rule typically takes the form of an implication statement: "If


X, then Y." The items X and Y represent sets of items or events, and the rule
indicates that if X occurs in a transaction or event, then there is a high
likelihood that Y will also occur.
Market basket analysis example:

•The goal is to understand the relationships between


products that are frequently purchased together. In this
example The first three rules suggest that when cereal is
purchased, 90% of the time milk is purchased also. When
bread is purchased, 40% of the time milk is purchased also.
When milk is purchased, 23% of the time cereal is also
purchased.

•This information helps businesses optimize product


placement, cross-selling, and upselling strategies.
what is the Apriori Algorithm?

• Apriori algorithm is one of the fundamental


algorithms for generating Association Rules. It is
used for discovering frequent itemsets within a
dataset and generating association rules based
on those itemsets.

• itemset is a collection of items that occur


together as a set, such as {item1, item2, item3}.
Apriori algorithm steps:
• The Apriori algorithm is used to identify frequent itemsets in a set of transactions.

• The algorithm starts by considering each item in the transactions as a 1-itemset.

• It determines the support of each 1-itemset by checking if it meets the minimum support threshold.

• The algorithm then grows the itemsets by joining them onto themselves to form new 2-itemsets.

• It determines the support of each 2-itemset and prunes away those that do not meet the minimum
support threshold.

• This growing and pruning process is repeated until no itemsets meet the minimum support threshold.

• Optionally, a maximum number of items or iterations can be specified to limit the size and runtime of the
algorithm.

• The output of the Apriori algorithm is a collection of all the frequent k-itemsets.

• Based on the frequent itemsets, a collection of candidate rules is formed


Evaluation of Candidate
Rules
Evaluation of Candidate Rules:
• The evaluation of candidate rules in association rule mining involves measures such as
confidence, lift, and leverage to assess the appropriateness and significance of these rules.

• Confidence is a measure of certainty or trustworthiness associated with a rule. It represents


the percentage of transactions that contain both X and Y out of all the transactions that
contain X. A higher confidence indicates a stronger relationship between the antecedent (X)
and the consequent (Y) of the rule. Rules with confidence greater than or equal to a predefined
minimum confidence threshold are considered interesting.
• Lift measures how many times more often X and Y occur together than expected if they
were statistically independent. A lift value greater than 1 indicates that there is a
meaningful relationship between X and Y. A higher lift suggests a stronger association
between the items.

• Leverage: similar to lift, compares the observed co-occurrence of X and Y with the
expected co-occurrence if they were statistically independent. It measures the difference
in the probability of X and Y appearing together. A leverage value greater than 0 indicates a
non-random relationship between X and Y.

• By considering lift and leverage along with confidence, it becomes possible to identify
interesting and meaningful rules while filtering out coincidental associations. These
measures help ensure that the discovered rules are not only trustworthy but also
statistically significant.
Advantages and disadvantages of
the Apriori algorithm
• Advantages of the Apriori algorithm:

Simple and easy to understand.

Scalable and efficient for large datasets.

Helps in decision-making by identifying frequent itemsets and association rules.

• Disadvantages of the Apriori algorithm:

High computational complexity, especially for large datasets.

Generates a large number of candidate itemsets, including many irrelevant ones.

Not suitable for dynamic or streaming data.


Improve the efficiency of the
Apriori algorithm
There are several approaches to improve the efficiency of the Apriori
algorithm:

1. Partitioning: Itemsets potentially frequent in a transaction database must be frequent in at least one of the
partitions of the database. By dividing the database into partitions, the algorithm can reduce the search
space.

2. Sampling: This approach involves extracting a subset of the data with a lower support threshold and
performing association rule mining on the subset. This reduces the computational overhead.

3. Transaction reduction: Transactions that do not contain frequent k-itemsets are irrelevant for subsequent
scans and can be ignored. This reduces the number of transactions that need to be processed.
Conclusion:

• Association rules are unsupervised analysis techniques used to uncover relationships among items in datasets.
They have various applications, including market basket analysis, clickstream analysis, and recommendation
engines. While association rules don't predict outcomes, they excel at identifying interesting and non-obvious
relationships that provide valuable insights for improving business operations.

• The Apriori algorithm is a fundamental algorithm for generating association rules. This chapter demonstrated
the steps of the Apriori algorithm using a grocery store example to generate frequent itemsets and useful rules.
Measures such as support, confidence, lift, and leverage were discussed to evaluate the rules and distinguish
interesting relationships from coincidental ones. The chapter also outlined the advantages and disadvantages of
the Apriori algorithm and suggested methods to enhance its efficiency.
Exercises from the book
Exercises from the book
• What is the Apriori property? The Apriori property is a principle in association rule mining that states if an itemset is
frequent, then all of its subsets must also be frequent. It means that if a set of items occurs frequently in a dataset, then
any subset of that set is also expected to occur frequently. This property is used to reduce the search space and improve
the efficiency of mining association rules by eliminating infrequent itemsets and their subsets. The Apriori algorithm
leverages this property to incrementally build larger itemsets from frequent smaller itemsets.
• Following is a list of five transactions that include items A, B, C, and D:
T1 : { A,B,C } T2 : { A,C } T3 : { B,C } T4 : { A,D } T5 : { A,C,D }
Which itemsets satisfy the minimum support of 0.5? (Hint: An itemset may include more than one item.)?
Let's calculate the support for each itemset:
Itemset {A}: Appears in transactions T1, T2, T4, T5. Support = 4/5 = 0.8 (80%)
Itemset {B}: Appears in transactions T1, T3. Support = 2/5 = 0.4 (40%)
Itemset {C}: Appears in transactions T1, T2, T3, T5. Support = 4/5 = 0.8 (80%)
Itemset {D}: Appears in transactions T4, T5. Support = 2/5 = 0.4 (40%)
Itemset {A, B}: Appears in transaction T1. Support = 1/5 = 0.2 (20%)
Itemset {A, C}: Appears in transactions T1, T2, T5. Support = 3/5 = 0.6 (60%)
Itemset {A, D}: Appears in transactions T4, T5. Support = 2/5 = 0.4 (40%)
Itemset {B, C}: Appears in transactions T1, T3. Support = 2/5 = 0.4 (40%)
Itemset {C, D}: Appears in transaction T5. Support = 1/5 = 0.2 (20%)
Itemset {A, B, C}: Appears in transaction T1. Support = 1/5 = 0.2 (20%)
Itemset {A, C, D}: Appears in transaction T5. Support = 1/5 = 0.2 (20%)
Itemsets that satisfy the minimum support of 0.5 are: {A} {C} {A, C}
• How are interesting rules identified? How are interesting rules distinguished from coincidental rules?

Interesting rules are identified based on their significance and relevance to the analysis objectives. Several
measures are commonly used to evaluate the interestingness of association rules, including support, confidence, lift,
and leverage.
Support measures the frequency or occurrence of an itemset in the dataset. Rules with higher support values indicate
that the itemset occurs frequently and are more likely to be interesting.
Confidence measures the reliability or trustworthiness of a rule. It represents the conditional probability that the
consequent of the rule holds true given the antecedent. High-confidence rules suggest a strong association between
the antecedent and consequent.
Lift compares the observed frequency of the rule's antecedent and consequent occurring together to the expected
frequency under independence. A lift value greater than 1 indicates a meaningful relationship between the items and
suggests an interesting rule.
Leverage measures the difference between the observed frequency of the rule and the expected frequency if the
items were independent. A positive leverage value indicates a non-random relationship and suggests an interesting
rule.
To distinguish interesting rules from coincidental rules, a combination of these measures is employed. Rules with high
support, confidence, lift, and leverage are considered more likely to be meaningful and interesting. Additionally,
domain knowledge and human insights play a crucial role in evaluating the relevance and significance of rules. By
considering these measures and incorporating expert knowledge, analysts can filter out coincidental rules and focus
on those that provide valuable insights and actionable information.
Thanks

You might also like