DWDM Unit-4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

DATA WAREHOUSING AND MINING

UNIT IV:
Association Analysis: Problem Definition, Frequent Item set Generation, Rule Generation:
Confident Based Pruning, Rule Generation in Apriori Algorithm, Compact Representation of
frequent item sets, FPGrowth Algorithm.
Association Analysis: Introduction
Many business enterprises accumulate large quantities of data from their day-to-day operations.
For example, huge amounts of customer purchase data are collected daily at the checkout
counters of grocery stores. Table 1 illustrates an example of such data, commonly known as
market basket transactions.

Table 1: An example of market basket transactions.


Each row in this table corresponds to a transaction, which contains a unique identifier labeled TID
and a set of items bought by a given customer. Retailers are interested in analyzing the data to
learn about the purchasing behavior of their customers. Such valuable information can be used to
support a variety of business-related applications such as marketing promotions, inventory
management, and customer relationship management.

A methodology known as association analysis is useful for discovering interesting relationships


hidden in large data sets. The uncovered relationships can be represented in the form of
association rules or sets of frequent items. For example, the following rule can be extracted from
the data set shown in Table 1:
{Diapers}  {Beer}.
The rule suggests that a strong relationship exists between the sale of diapers and beer because
many customers who buy diapers also buy beer. Retailers can use this type of rules to help them
identify new opportunities for cross selling their products to the customers.

Besides market basket data, association analysis is also applicable to other application domains
such as bioinformatics, medical diagnosis, Web mining, and scientific data analysis.

Page 1
There are two key issues that need to be addressed when applying association analysis to
market basket data.
1. First, discovering patterns from a large transaction data set can be computationally
expensive.
2. Second, some of the discovered patterns are potentially spurious (false or fake) because
they may happen simply by chance.

1. Problem Definition The basic terminology used in association analysis

Binary Representation Market basket data can be represented in a binary format as shown
in Table 2, where each row corresponds to a transaction and each column corresponds to an item.

Table 2: A binary 0/1 representation of market basket data.

An item can be treated as a binary variable whose value is one if the item is present in a
transaction and zero otherwise. Because the presence of an item in a transaction is often
considered more important than its absence, an item is an asymmetric binary variable. This
representation is perhaps a very simplistic view of real market basket data because it ignores
certain important aspects of the data such as the quantity of items sold or the price paid to
purchase them.

Itemset and Support Count Let I = {i1,i2,. . .,id} be the set of all items in a market basket
data and T = {t1, t2, . . . , tN} be the set of all transactions. Each transaction t i contains a subset of
items chosen from I. In association analysis, a collection of zero or more items is termed an
itemset. If an itemset contains k items, it is called a k-itemset. For instance, {Beer, Diapers, Milk} is
an example of a 3-itemset. The null (or empty) set is an itemset that does not contain any items.

The transaction width is defined as the number of items present in a transaction. A transaction tj
is said to contain an itemset X if X is a subset of tj . For example, the second transaction shown in
Table 2 contains the itemset {Bread, Diapers} but not {Bread, Milk}. An important property of an
itemset is its support count, which refers to the number of transactions that contain a particular
itemset. Mathematically, the support count, σ(X), for an itemset X can be stated as follows:

Page 2
Where the symbol | · | denote the number of elements in a set. In the data set shown in Table 2,
the support count for {Beer, Diapers, Milk} is equal to two because there are only two transactions
that contain all three items.

Association Rule An association rule is an implication expression of the form X  Y , where X


and Y are disjoint itemsets, i.e., X ∩ Y = ø. The strength of an association rule can be measured in
terms of its support and confidence. Support determines how often a rule is applicable to a given
data set, while confidence determines how frequently items in Y appear in transactions that
contain X. The formal definitions of these metrics are

❶

❷

Example 1: Consider the rule {Milk, Diapers} −→ {Beer}. Since the support count for {Milk,
Diapers, Beer} is 2 and the total number of transactions is 5, the rule’s support is 2/5 = 0.4. The
rule’s confidence is obtained by dividing the support count for {Milk, Diapers, Beer} by the support
count for {Milk, Diapers}. Since there are 3 transactions that contain milk and diapers, the
confidence for this rule is 2/3 = 0.67.

Why Use Support and Confidence?


 Support is an important measure because a rule that has very low support may occur
simply by chance. A low support rule is also likely to be uninteresting from a business
perspective because it may not be profitable to promote items that customers seldom buy
together. Support is often used to eliminate uninteresting rules. Support also has a
desirable property that can be exploited for the efficient discovery of association rules.

 Confidence, on the other hand, measures the reliability of the inference made by a rule.
For a given rule X  Y, the higher the confidence, the more likely it is for Y to be present in
transactions that contain X. Confidence also provides an estimate of the conditional
probability of Y given X.

Association analysis results should be interpreted with caution. The inference made by an
association rule does not necessarily imply causality. Instead, it suggests a strong co-occurrence
relationship between items in the antecedent and consequent of the rule.

Causality, on the other hand, requires knowledge about the causal and effect attributes in the
data and typically involves relationships occurring over time (e.g., ozone depletion leads to global
warming).

Formulation of Association Rule Mining Problem The association rule mining problem can be
formally stated as follows:

Page 3
 Definition1: (Association Rule Discovery). Given a set of transactions T, find all the rules
having support ≥ minsup and confidence ≥ minconf, where minsup and minconf are the
corresponding support and confidence thresholds.

A brute-force approach for mining association rules is to compute the support and confidence for
every possible rule. This approach is prohibitively expensive because there are exponentially many
rules that can be extracted from a data set. More specifically, the total number of possible rules
extracted from a data set that contains d items is

❸

Even for the small data set shown in Table 1, this approach requires us to compute the support
and confidence for 36 −27 +1 = 602 rules. More than 80% of the rules are discarded after applying
minsup = 20% and minconf = 50%, thus making most of the computations become wasted. To
avoid performing needless computations, it would be useful to prune the rules early without
having to compute their support and confidence values.

An initial step toward improving the performance of association rule mining algorithms is to
decouple the support and confidence requirements. From Equation 2, notice that the support of a
rule X Y depends only on the support of its corresponding itemset, X � Y. For example, the
following rules have identical support because they involve items from the same itemset,
{Beer, Diapers, Milk}:

If the itemset is infrequent, then all six candidate rules can be pruned immediately without having
to compute their confidence values.

Therefore, a common strategy adopted by many association rule mining algorithms is to


decompose the problem into two major subtasks:
1. Frequent Itemset Generation, whose objective is to find all the itemsets that satisfy the
minsup threshold. These itemsets are called frequent itemsets.
2. Rule Generation, whose objective is to extract all the high-confidence rules from the
frequent itemsets found in the previous step. These rules are called strong rules.

The computational requirements for frequent itemset generation are generally more expensive
than those of rule generation.

Page 4
2. Frequent Itemset Generation
A lattice structure can be used to enumerate the list of all possible itemsets. Figure 1 shows an
itemset lattice for I = {a, b, c, d, e}. In general, a data set that contains k items can potentially
generate up to 2k − 1 frequent itemsets, excluding the null set. Because k can be very large in
many practical applications, the search space of itemsets that need to be explored is exponentially
large.

Figure 1: An itemset lattice.

A brute-force approach for finding frequent itemsets is to determine the support count for every
candidate itemset in the lattice structure. To do this, we need to compare each candidate against
every transaction, an operation that is shown in Figure 2. If the candidate is contained in a
transaction, its support count will be incremented. For example, the support for {Bread, Milk} is
incremented three times because the itemset is contained in transactions 1, 4, and 5. Such an
approach can be very expensive because it requires O(NMw) comparisons, where N is the number
of transactions, M = 2k −1 is the number of candidate itemsets, and w is the maximum transaction
width.

Figure 2: Counting the support of candidate itemsets.

Page 5
There are several ways to reduce the computational complexity of frequent itemset generation.
1. Reduce the number of candidate itemsets (M). The Apriori principle is an effective
way to eliminate some of the candidate itemsets without counting their support
values.
2. Reduce the number of comparisons. Instead of matching each candidate itemset
against every transaction, we can reduce the number of comparisons by using more
advanced data structures, either to store the candidate itemsets or to compress the
data set.

The Aprior Principle


“How the support measure helps to reduce the number of candidate itemsets explored during
frequent itemset generation?”

The use of support for pruning candidate itemsets is guided by the following principle.

Theorem 1: (Apriori Principle). If an itemset is frequent, then all of its subsets must also be
frequent.
To illustrate the idea behind the Apriori principle, consider the itemset lattice shown in Figure 3.
Suppose {c, d, e} is a frequent itemset. Clearly, any transaction that contains {c, d, e} must also
contain its subsets, {c, d}, {c, e}, {d, e}, {c}, {d}, and {e}. As a result, if {c, d, e} is frequent, then all
subsets of {c, d, e} (i.e., the shaded itemsets in this figure) must also be frequent.

Page 6
Figure 3: An illustration of the Apriori principle. If {c, d, e} is frequent, then all subsets of this
itemset are frequent.
Conversely, if an itemset such as {a, b} is infrequent, then all of its supersets must be infrequent
too. As illustrated in Figure 4, the entire subgraph containing the supersets of {a, b} can be pruned
immediately once {a, b} is found to be infrequent. This strategy of trimming the exponential
search space based on the support measure is known as support-based pruning. Such a pruning
strategy is made possible by a key property of the support measure, namely, that the support for
an itemset never exceeds the support for its subsets. This property is also known as the anti-
monotone property of the support measure.

Figure 4: An illustration of support-based pruning. If {a, b} is infrequent, then all supersets of


{a, b} are infrequent.

Any measure that possesses an anti-monotone property can be incorporated directly into the
mining algorithm to effectively prune the exponential search space of candidate itemsets.

Page 7
Frequent Itemset Generation in the Apriori Algorithm
Apriori is the first association rule mining algorithm that pioneered the use of support-based
pruning to systematically control the exponential growth of candidate itemsets.

Figure 5: Illustration of frequent itemset generation using the Apriori algorithm.

Figure 5 provides a high-level illustration of the frequent itemset generation part of the Apriori
algorithm for the transactions shown in Table 1. We assume that the support threshold is 60%,
which is equivalent to a minimum support count equal to 3.

Initially, every item is considered as a candidate 1-itemset. After counting their supports, the
candidate itemsets {Cola} and {Eggs} are discarded because they appear in fewer than three
transactions. In the next iteration, candidate 2-itemsets are generated using only the frequent 1-
itemsets because the Apriori principle ensures that all supersets of the infrequent 1-itemsets must
be infrequent. Because there are only four frequent 1-itemsets, the number of candidate 2-
itemsets generated by the algorithm is

Two of these six candidates, {Beer, Bread} and {Beer, Milk}, are subsequently found to be
infrequent after computing their support values. The remaining four candidates are frequent, and
thus will be used to generate candidate 3-itemsets. Without support-based pruning, there are

Page 8
candidate 3-itemsets that can be formed using the six items given in this example.

With the Apriori principle, we only need to keep candidate 3-itemsets whose subsets are frequent.
The only candidate that has this property is {Bread, Diapers, Milk}.
The effectiveness of the Apriori pruning strategy can be shown by counting the number of
candidate itemsets generated. A brute-force strategy of enumerating all itemsets (up to size 3) as
candidates will produce

candidates. With the Apriori principle, this number decreases to

candidates, which represents a 68% reduction in the number of candidate itemsets even in this
simple example.

The pseudo code for the frequent itemset generation part of the Apriori algorithm is shown in
Algorithm 1. Let Ck denote the set of candidate k-itemsets and Fk denote the set of frequent k-
itemsets:
Algorithm 1: Frequent itemset generation of the Apriori algorithm

 The algorithm initially makes a single pass over the data set to determine the support of
each item. Upon completion of this step, the set of all frequent 1-itemsets, F1, will be
known (steps 1 and 2).
 Next, the algorithm will iteratively generate new candidate k-itemsets using the frequent
(k − 1)-itemsets found in the previous iteration (step 5). Candidate generation is
implemented using a function called Apriori-gen.
 To count the support of the candidates, the algorithm needs to make an additional pass
over the data set (steps 6–10). The subset function is used to determine all the candidate
itemsets in Ck that are contained in each transaction t.

Page 9
 After counting their supports, the algorithm eliminates all candidate itemsets whose
support counts are less than minsup (step 12).
 The algorithm terminates when there are no new frequent itemsets generated, i.e., Fk = ∅
(step 13).

The frequent itemset generation part of the Apriori algorithm has two important characteristics.
1. First, it is a level-wise algorithm; i.e., it traverses the itemset lattice one level at a time,
from frequent 1-itemsets to the maximum size of frequent itemsets.

2. Second, it employs a generate-and-test strategy for finding frequent itemsets. At each


iteration, new candidate itemsets are generated from the frequent itemsets found in the
previous iteration. The support for each candidate is then counted and tested against the
minsup threshold. The total number of iterations needed by the algorithm is kmax+1,
where kmax is the maximum size of the frequent itemsets.

Candidate Generation and Pruning


The Apriori-gen function shown in Step 5 of Algorithm 1 generates candidate itemsets by
performing the following two operations:
1. Candidate Generation. This operation generates new candidate k-itemsets based on
the frequent (k − 1)-itemsets found in the previous iteration.

2. Candidate Pruning. This operation eliminates some of the candidate k-itemsets using
the support-based pruning strategy.

To illustrate the candidate pruning operation, consider a candidate k-itemset, X = {i1, i2, . . . , ik}.
The algorithm must determine whether all of its proper subsets, X − {ij} (∀ j = 1, 2, . . . , k), are
frequent. If one of them is infrequent, then X is immediately pruned. This approach can effectively
reduce the number of candidate itemsets considered during support counting. The complexity of
this operation is O(k) for each candidate k-itemset. If m of the k subsets were used to generate a
candidate, we only need to check the remaining k −m subsets during candidate pruning.

In principle, there are many ways to generate candidate itemsets. The following is a list of
requirements for an effective candidate generation procedure:
1. It should avoid generating too many unnecessary candidates. A candidate itemset is
unnecessary if at least one of its subsets is infrequent. Such a candidate is guaranteed
to be infrequent according to the anti-monotone property of support.

2. It must ensure that the candidate set is complete, i.e., no frequent itemsets are left out
by the candidate generation procedure. To ensure completeness, the set of candidate
itemsets must subsume the set of all frequent itemsets, i.e., ∀k : Fk ⊆ Ck.

3. It should not generate the same candidate itemset more than once. For example, the
candidate itemset {a, b, c, d} can be generated in many ways—by merging {a, b, c} with
{d}, {b, d} with {a, c}, {c} with {a, b, d}, etc. Generation of duplicate candidates leads to
wasted computations and thus should be avoided for efficiency reasons.

Page 10
Several candidate generation procedures, including the one used by the Apriori-gen function are

Figure 6: A brute-force method for generating candidate 3-itemsets.

Figure 7: Generating and pruning candidate k-itemsets by merging a frequent (k−1)-itemset with
a frequent item. Note that some of the candidates are unnecessary because their subsets are
infrequent.

Page 11
This approach, however, does not prevent the same candidate itemset from being generated
more than once. For instance, {Bread, Diapers, Milk} can be generated by merging {Bread,
Diapers} with {Milk}, {Bread, Milk} with {Diapers}, or {Diapers, Milk} with {Bread}. One way to
avoid generating duplicate candidates is by ensuring that the items in each frequent itemset are
kept sorted in their lexicographic order. Each frequent (k−1)-itemset X is then extended with
frequent items that are lexicographically larger than the items in X. For example, the itemset
{Bread, Diapers} can be augmented with {Milk} since Milk is lexicographically larger than Bread
and Diapers. However, we should not augment {Diapers, Milk} with {Bread} nor {Bread, Milk} with
{Diapers} because they violate the lexicographic ordering condition.

While this procedure is a substantial improvement over the brute-force method, it can still
produce a large number of unnecessary candidates. For example, the candidate itemset obtained
by merging {Beer, Diapers} with {Milk} is unnecessary because one of its subsets, {Beer, Milk}, is
infrequent. There are several heuristics available to reduce the number of unnecessary
candidates. For example, note that, for every candidate k-itemset that survives the pruning step,
every item in the candidate must be contained in at least k −1 of the frequent (k −1)-itemsets.
Otherwise, the candidate is guaranteed to be infrequent. For example, {Beer, Diapers, Milk} is a
viable candidate 3-itemset only if every item in the candidate, including Beer, is contained in at
least two frequent 2-itemsets. Since there is only one frequent 2-itemset containing Beer, all
candidate itemsets involving Beer must be infrequent.

1. Fk−1×Fk−1 Method The candidate generation procedure in the Apriori-gen function merges a
pair of frequent (k−1)-itemsets only if their first k−2 items are identical. Let A = {a1, a2, . . . , ak−1}
and B = {b1, b2, . . . , bk−1} be a pair of frequent (k − 1)-itemsets. A and B are merged if they satisfy
the following conditions:
ai = bi (for i = 1, 2, . . . , k − 2) and ak−1 _= bk−1.

Page 12
In Figure 8, the frequent itemsets {Bread, Diapers} and {Bread, Milk} are merged to form a
candidate 3-itemset {Bread, Diapers, Milk}. The algorithm does not have to merge {Beer, Diapers}
with {Diapers, Milk} because the first item in both itemsets is different. Indeed, if {Beer, Diapers,
Milk} is a viable candidate, it would have been obtained by merging {Beer, Diapers} with {Beer,
Milk} instead. This example illustrates both the completeness of the candidate generation
procedure and the advantages of using lexicographic ordering to prevent duplicate candidates.
However, because each candidate is obtained by merging a pair of frequent (k−1)-itemsets, an
additional candidate pruning step is needed to ensure that the remaining k − 2 subsets of the
candidate are frequent.

Figure 8: Generating and pruning candidate k-itemsets by merging pairs of frequent (k−1)-
itemsets

Computational Complexity
The computational complexity of the Apriori algorithm can be affected by the following factors.

1. Support Threshold Lowering the support threshold often results in more itemsets being
declared as frequent. This has an adverse effect on the computational complexity of the algorithm

Page 13
because more candidate itemsets must be generated and counted, as shown in Figure 13. The
maximum size of frequent itemsets also tends to increase with lower support thresholds. As the
maximum size of the frequent itemsets increases, the algorithm will need to make more passes
over the data set.
Figure 13: Effect of support threshold on the number of candidate and frequent itemsets.

2. Number of Items (Dimensionality) As the number of items increases, more space will be
needed to store the support counts of items. If the number of frequent items also grows with the
dimensionality of the data, the computation and I/O costs will increase because of the larger
number of candidate itemsets generated by the algorithm.

3. Number of Transactions Since the Apriori algorithm makes repeated passes over the data set,
its run time increases with a larger number of transactions.

4. Average Transaction Width For dense data sets, the average transaction width can be very
large. This affects the complexity of the Apriori algorithm in two ways.
1. First, the maximum size of frequent itemsets tends to increase as the average
transaction width increases. As a result, more candidate itemsets must be examined
during candidate generation and support counting, as illustrated in Figure 14.
2. Second, as the transaction width increases, more itemsets are contained in the
transaction. This will increase the number of hash tree traversals performed during
support counting.

Figure 14:. Effect of average transaction width on the number of candidate and frequent
itemsets.

5. Candidate generation To generate candidate k-itemsets, pairs of frequent (k − 1)-itemsets are


merged to determine whether they have at least k – 2 items in common. Each merging operation
requires at most k − 2 equality comparisons. In the best-case scenario, every merging step
produces a viable candidate k-itemset. In the worst-case scenario, the algorithm must merge
every pair of frequent (k−1)-itemsets found in the previous iteration. Therefore, the overall cost of
merging frequent itemsets is

Page 14
A hash tree is also constructed during candidate generation to store the candidate itemsets.
Because the maximum depth of the tree is k, the cost for populating the hash tree with candidate
itemsets is

During candidate pruning, we need to verify that the k−2 subsets of every candidate k-itemset are
frequent. Since the cost for looking up a candidate in a hash tree is O(k), the candidate pruning
step requires

time

6. Support counting Each transaction of length |t| produces itemsets of size k. This is also
the effective number of hash tree traversals performed for each transaction. The cost for support
counting is

where w is the maximum transaction width and αk is the cost for


updating the support count of a candidate k-itemset in the hash tree.

3. Rule Generation
“How to extract association rules efficiently from a given frequent itemset?”
Each frequent k-itemset, Y, can produce up to 2k−2 association rules, ignoring rules that have
empty antecedents or consequents (∅  Y or Y  ∅). An association rule can be extracted by
partitioning the itemset Y into two non-empty subsets, X and Y −X, such that X  Y −X satisfies the
confidence threshold. Note that all such rules must have already met the support threshold
because they are generated from a frequent itemset.

Example 2: Let X = {1, 2, 3} be a frequent itemset. There are six candidate association rules that
can be generated from X: {1, 2}  {3}, {1, 3}  {2}, {2, 3}  {1}, {1}  {2, 3}, {2} {1, 3}, and
{3}  {1, 2}. As each of their support is identical to the support for X, the rules must satisfy the
support threshold.

Computing the confidence of an association rule does not require additional scans of the
transaction data set. Consider the rule {1, 2}  {3}, which is generated from the frequent itemset
X = {1, 2, 3}. The confidence for this rule is σ({1, 2, 3})/σ({1, 2}). Because {1, 2, 3} is frequent, the
anti-monotone property of support ensures that {1, 2} must be frequent, too. Since the support
counts for both itemsets were already found during frequent itemset generation, there is no need
to read the entire data set again.

Confidence-Based Pruning
Unlike the support measure, confidence does not have any monotone property. For example, the
confidence for X  Y can be larger, smaller, or equal to the confidence for another rule ˜X  ˜Y,
Page 15
where ˜X ⊆ X and ˜Y ⊆Y. Nevertheless, if we compare rules generated from the same frequent
itemset Y, the following theorem holds for the confidence measure.

Theorem 2: If a rule X  Y −X does not satisfy the confidence threshold, then any rule X’  Y –X’,
where X’ is a subset of X, must not satisfy the confidence threshold as well.

To prove this theorem, consider the following two rules: X’ Y –X’_ and X  Y −X, where X’⊂X.
The confidence of the rules are σ(Y )/σ(X’) and σ(Y )/σ(X), respectively. Since X’ is a subset of X,
σ(X’) ≥ σ(X). Therefore, the former rule cannot have a higher confidence than the latter rule.

Rule Generation in Apriori Algorithm


The Apriori algorithm uses a level-wise approach for generating association rules, where each
level corresponds to the number of items that belong to the rule consequent. Initially, all the high-
confidence rules that have only one item in the rule consequent are extracted. These rules are
then used to generate new candidate rules.
For example, if {acd}  {b} and {abd}  {c} are high-confidence rules, then the candidate rule {ad}
 {bc} is generated by merging the consequents of both rules.

Figure 15 shows a lattice structure for the association rules generated from the frequent itemset
{a, b, c, d}. If any node in the lattice has low confidence, then according to Theorem 2, the entire
subgraph spanned by the node can be pruned immediately. Suppose the confidence for
{bcd}  {a} is low. All the rules containing item a in its consequent, including {cd}  {ab},
{bd} {ac}, {bc}  {ad}, and {d} {abc} can be discarded.

Figure 15: Pruning of association rules using the confidence measure.


A pseudocode for the rule generation step is shown in Algorithms 2 and Algorithm3. Note the
similarity between the ap-genrules procedure given in Algorithm 3 and the frequent itemset
generation procedure given in Algorithm 1. The only difference is that, in rule generation, we do
not have to make additional passes over the data set to compute the confidence of the candidate
rules. Instead, we determine the confidence of each rule by using the support counts computed
during frequent itemset generation.

Algorithm 2: Rule generation of the Apriori algorithm.

Page 16
Algorithm 3: Procedure ap-genrules (fk, Hm).

4 Compact Representations of Frequent Itemsets


In practice, the number of frequent itemsets produced from a transaction data set can be very
large. It is useful to identify a small representative set of itemsets from which all other frequent
itemsets can be derived. Two such representations are in the form of maximal and closed
frequent itemsets.

Maximal Frequent Itemsets


Definition 3: (Maximal Frequent Itemset). A maximal frequent itemset is defined as a frequent
itemset for which none of its immediate supersets are frequent.

To illustrate this concept, consider the itemset lattice shown in Figure 16. The itemsets in the
lattice are divided into two groups: those that are frequent and those that are infrequent. A
frequent itemset border, which is represented by a dashed line, is also illustrated in the diagram.

Page 17
Figure 16: Maximal frequent itemset.

Every itemset located above the border is frequent, while those located below the border (the
shaded nodes) are infrequent. Among the itemsets residing near the border, {a, d}, {a, c, e}, and
{b, c, d, e} are considered to be maximal frequent itemsets because their immediate supersets are
infrequent. An itemset such as {a, d} is maximal frequent because all of its immediate supersets,
{a, b, d}, {a, c, d}, and {a, d, e}, are infrequent. In contrast, {a, c} is non-maximal because one of its
immediate supersets, {a, c, e}, is frequent. Maximal frequent itemsets effectively provide a
compact representation of frequent itemsets. In other words, they form the smallest set of
itemsets from which all frequent itemsets can be derived. For example, the frequent itemsets
shown in Figure 16 can be divided into two groups:
 Frequent itemsets that begin with item a and that may contain items c, d, or e. This group
includes itemsets such as {a}, {a, c}, {a, d}, {a, e}, and {a, c, e}.
 Frequent itemsets that begin with items b, c, d, or e. This group includes itemsets such as
{b}, {b, c}, {c, d},{b, c, d, e}, etc.

Frequent itemsets that belong in the first group are subsets of either {a, c, e} or {a, d}, while those
that belong in the second group are subsets of {b, c, d, e}. Hence, the maximal frequent itemsets
{a, c, e}, {a, d}, and {b, c, d, e} provide a compact representation of the frequent itemsets shown in
Figure 16. Maximal frequent itemsets provide a valuable representation for data sets that can
produce very long, frequent itemsets, as there are exponentially many frequent itemsets in such
data. Nevertheless, this approach is practical only if an efficient algorithm exists to explicitly find
the maximal frequent itemsets without having to enumerate all their subsets.

Page 18
Despite providing a compact representation, maximal frequent itemsets do not contain the
support information of their subsets. For example, the support of the maximal frequent itemsets
{a, c, e}, {a, d}, and {b,c,d,e} do not provide any hint about the support of their subsets. An
additional pass over the data set is therefore needed to determine the support counts of the non-
maximal frequent itemsets. In some cases, it might be desirable to have a minimal representation
of frequent itemsets that preserves the support information.

Closed Frequent Itemsets


Closed itemsets provide a minimal representation of itemsets without losing their support
information. A formal definition of a closed itemset is as follows.

Definition 4 (Closed Itemset): An itemset X is closed if none of its immediate supersets has exactly
the same support count as X. Put another way, X is not closed if at least one of its immediate
supersets has the same support count as X. Examples of closed itemsets are shown in Figure 17.

To better illustrate the support count of each itemset, we have associated each node (itemset) in
the lattice with a list of its corresponding transaction IDs. For example, since the node {b, c} is
associated with transaction IDs 1, 2, and 3, its support count is equal to three. From the
transactions given in this diagram, notice that every transaction that contains b also contains c.
Consequently, the support for {b} is identical to {b, c} and {b} should not be considered a closed
itemset. Similarly, since c occurs in every transaction that contains both a and d, the itemset {a, d}
is not closed. On the other hand, {b, c} is a closed itemset because it does not have the same
support count as any of its supersets.

Page 19
Figure 17: An example of the closed frequent itemsets
(with minimum support count equal to 40%).

Definition 5 (Closed Frequent Itemset): An itemset is a closed frequent itemset if it is closed and
its support is greater than or equal to minsup. In the previous example, assuming that the support
threshold is 40%, {b,c} is a closed frequent itemset because its support is 60%. The rest of the
closed frequent itemsets are indicated by the shaded nodes.

Algorithms are available to explicitly extract closed frequent itemsets from a given data set.
Interested readers may refer to the bibliographic notes at the end of this chapter for further
discussions of these algorithms. We can use the closed frequent itemsets to determine the
support counts for the non-closed frequent itemsets. For example, consider the frequent itemset
{a, d} shown in Figure 17. Because the itemset is not closed, its support count must be identical to
one of its immediate supersets. The key is to determine which superset (among {a, b, d}, {a, c, d},
or {a, d, e}) has exactly the same support count as {a, d}. The Apriori principle states that any
transaction that contains the superset of {a, d} must also contain {a, d}. However, any transaction
that contains {a, d} does not have to contain the supersets of {a, d}. For this reason, the support
for {a, d} must be equal to the largest support among its supersets. Since {a, c, d} has a larger
support than both {a, b, d} and {a, d, e}, the support for {a, d} must be identical to the support for
{a, c, d}. Using this methodology, an algorithm can be developed to compute the support for the
non-closed frequent itemsets. The pseudocode for this algorithm is shown in Algorithm 4. The

Page 20
algorithm proceeds in a specific-to-general fashion, i.e., from the largest to the smallest frequent
itemsets. This is because, in order to find the support for a non-closed frequent itemset, the
support for all of its supersets must be known.

Algorithm 4: Support counting using closed frequent itemsets.

To illustrate the advantage of using closed frequent itemsets, consider the data set shown in
Table 5, which contains ten transactions and fifteen items. The items can be divided into three
groups: (1) Group A, which contains items a1 through a5; (2) Group B, which contains items
b1 through b5; and (3) Group C, which contains items c1 through c5.

Note that items within each group are perfectly associated with each other and they do not
appear with items from another group. Assuming the support threshold is 20%, the total number
of frequent itemsets is 3×(25−1) = 93. However, there are only three closed frequent itemsets in
the data: ({a1, a2, a3, a4, a5}, {b1, b2, b3, b4, b5}, and {c1, c2, c3, c4, c5}). It is often sufficient to
present only the closed frequent itemsets to the analysts instead of the entire set of frequent
itemsets.

Table.5: A transaction data set for mining closed itemsets

Page 21
Closed frequent itemsets are useful for removing some of the redundant association rules. An
association rule X Y is redundant if there exists another rule X’  Y’, where X is a subset of X’
and Y is a subset of Y’, such that the support and confidence for both rules are identical. In the
example shown in Figure 17, {b} is not a closed frequent itemset while {b, c} is closed. The
association rule {b}  {d, e} is therefore redundant because it has the same support and
confidence as {b, c}  {d, e}. Such redundant rules are not generated if closed frequent itemsets
are used for rule generation.

Finally, note that all maximal frequent itemsets are closed because none of the maximal frequent
itemsets can have the same support count as their immediate supersets. The relationships among
frequent, maximal frequent, and closed frequent itemsets are shown in Figure 18.

Figure 18: Relationships among frequent, maximal frequent, and closed frequent itemsets.

5 FP-Growth Algorithm
An alternative algorithm called FP-growth that takes a radically different approach to discovering
frequent itemsets. The algorithm does not subscribe to the generate-and-test paradigm of Apriori.
Instead, it encodes the data set using a compact data structure called an FP-tree and extracts
frequent itemsets directly from this structure.

FP-Tree Representation
An FP-tree is a compressed representation of the input data. It is constructed by reading the data
set one transaction at a time and mapping each transaction onto a path in the FP-tree. As
different transactions can have several items in common, their paths may overlap. The more the
paths overlap with one another, the more compression we can achieve using the FP-tree
structure. If the size of the FP-tree is small enough to fit into main memory, this will allow us to
extract frequent itemsets directly from the structure in memory instead of making repeated
passes over the data stored on disk.

Figure 19 shows a data set that contains ten transactions and five items. The structures of the FP-
tree after reading the first three transactions are also depicted in the diagram. Each node in the

Page 22
tree contains the label of an item along with a counter that shows the number of transactions
mapped onto the given path. Initially, the FP-tree contains only the root node represented by the
null symbol. The FP-tree is subsequently extended in the following way:

Figure 19: Construction of an FP-tree.

1. The data set is scanned once to determine the support count of each item. Infrequent items are
discarded, while the frequent items are sorted in decreasing support counts. For the data set
shown in Figure 19, a is the most frequent item, followed by b, c, d, and e.

2. The algorithm makes a second pass over the data to construct the FPtree. After reading the first
transaction, {a, b}, the nodes labeled as a and b are created. A path is then formed from null → a
→ b to encode the transaction. Every node along the path has a frequency count of 1.

3. After reading the second transaction, {b,c,d}, a new set of nodes is created for items b, c, and d.
A path is then formed to represent the transaction by connecting the nodes null → b → c → d.
Every node along this path also has a frequency count equal to one. Although the first two
transactions have an item in common, which is b, their paths are disjoint because the transactions
do not share a common prefix.

Page 23
4. The third transaction, {a,c,d,e}, shares a common prefix item (which is a) with the first
transaction. As a result, the path for the third transaction, null → a → c → d → e, overlaps with
the path for the first transaction, null → a → b. Because of their overlapping path, the frequency
count for node a is incremented to two, while the frequency counts for the newly created nodes,
c, d, and e, are equal to one.

5. This process continues until every transaction has been mapped onto one of the paths given in
the FP-tree. The resulting FP-tree after reading all the transactions is shown at the bottom of
Figure 19.

The size of an FP-tree is typically smaller than the size of the uncompressed data because many
transactions in market basket data often share a few items in common. In the best-case scenario,
where all the transactions have the same set of items, the FP-tree contains only a single branch of
nodes. The worst-case scenario happens when every transaction has a unique set of items.

As none of the transactions have any items in common, the size of the FP-tree is effectively the
same as the size of the original data. However, the physical storage requirement for the FP-tree is
higher because it requires additional space to store pointers between nodes and counters for each
item.

The size of an FP-tree also depends on how the items are ordered. If the ordering scheme in the
preceding example is reversed, i.e., from lowest to highest support item, the resulting FP-tree is
shown in Figure 20.

Figure 20: An FP-tree representation for the data set shown in Figure 19 with a different item
ordering scheme.

The tree appears to be denser because the branching factor at the root node has increased from 2
to 5 and the number of nodes containing the high support items such as a and b has increased
from 3 to 12. Nevertheless, ordering by decreasing support counts does not always lead to the
smallest tree. For example, suppose we augment the data set given in Figure 19 with 100
transactions that contain {e}, 80 transactions that contain {d}, 60 transactions that contain {c}, and
40 transactions that contain {b}. Item e is now most frequent, followed by d, c, b, and a. With the
augmented transactions, ordering by decreasing support counts will result in an FP-tree similar to
Figure 20, while a scheme based on increasing support counts produces a smaller FP-tree similar
to Figure 19(iv).

Page 24
An FP-tree also contains a list of pointers connecting between nodes that have the same items.
These pointers, represented as dashed lines in Figure 19 and Figure 20, help to facilitate the rapid
access of individual items in the tree.

Frequent Itemset Generation in FP-Growth Algorithm


FP-growth is an algorithm that generates frequent itemsets from an FP-tree by exploring the tree
in a bottom-up fashion. Given the example tree shown in Figure 19, the algorithm looks for
frequent itemsets ending in e first, followed by d, c, b, and finally, a. This bottom-up strategy for
finding frequent itemsets ending with a particular item is equivalent to the suffix-based approach.
Since every transaction is mapped onto a path in the FP-tree, we can derive the frequent itemsets
ending with a particular item, say, e, by examining only the paths containing node e. These paths
can be accessed rapidly using the pointers associated with node e. The extracted paths are shown
in Figure 21(a).

Figure 21: Decomposing the frequent itemset generation problem into multiple subproblems,
where each subproblem involves finding frequent itemsets ending in e, d, c, b, and a.

After finding the frequent itemsets ending in e, the algorithm proceeds to look for frequent
itemsets ending in d by processing the paths associated with node d. The corresponding paths are
shown in Figure 21(b). This process continues until all the paths associated with nodes c, b, and
finally a, are processed. The paths for these items are shown in Figures 21(c), (d), and (e), while
their corresponding frequent itemsets are summarized in Table 6.

Table 6: The list of frequent itemsets ordered by their corresponding suffixes.

Page 25
FP-growth finds all the frequent itemsets ending with a particular suffix by employing a divide-
and-conquer strategy to split the problem into smaller subproblems. For example, suppose we are
interested in finding all frequent itemsets ending in e. To do this, we must first check whether the
itemset {e} itself is frequent. If it is frequent, we consider the subproblem of finding frequent
itemsets ending in de, followed by ce, be, and ae. In turn, each of these subproblems are further
decomposed into smaller subproblems. By merging the solutions obtained from the subproblems,
all the frequent itemsets ending in e can be found. This divide-and-conquer approach is the key
strategy employed by the FP-growth algorithm.

For a more concrete example on how to solve the subproblems, consider the task of finding
frequent itemsets ending with e.

Figure 22: Example of applying the FP-growth algorithm to find frequent itemsets ending in e.

1. The first step is to gather all the paths containing node e. These initial paths are called prefix
paths and are shown in Figure 22(a).

2. From the prefix paths shown in Figure 22(a), the support count for e is obtained by adding the
support counts associated with node e. Assuming that the minimum support count is 2, {e} is
declared a frequent itemset because its support count is 3.

3. Because {e} is frequent, the algorithm has to solve the subproblems of finding frequent itemsets
ending in de, ce, be, and ae. Before solving these subproblems, it must first convert the prefix
paths into a conditional FP-tree, which is structurally similar to an FP-tree, except it is used to find
frequent itemsets ending with a particular suffix. A conditional FP-tree is obtained in the following
way:
(a) First, the support counts along the prefix paths must be updated because some of the counts
include transactions that do not contain item e. For example, the rightmost path shown in Figure

Page 26
22(a), null −→ b:2 −→ c:2 −→ e:1, includes a transaction {b, c} that does not contain item e. The
counts along the prefix path must therefore be adjusted to 1 to reflect the actual number of
transactions containing {b, c, e}.
(b) The prefix paths are truncated by removing the nodes for e. These nodes can be removed
because the support counts along the prefix paths have been updated to reflect only transactions
that contain e and the subproblems of finding frequent itemsets ending in de, ce, be, and ae no
longer need information about node e.
(c) After updating the support counts along the prefix paths, some of the items may no longer be
frequent. For example, the node b appears only once and has a support count equal to 1, which
means that there is only one transaction that contains both b and e. Itemb can be safely ignored
from subsequent analysis because all itemsets ending in be must be infrequent.

The conditional FP-tree for e is shown in Figure 22(b). The tree looks different than the original
prefix paths because the frequency counts have been updated and the nodes b and e have been
eliminated.

4. FP-growth uses the conditional FP-tree for e to solve the subproblems of finding frequent
itemsets ending in de, ce, and ae. To find the frequent itemsets ending in de, the prefix paths for d
are gathered from the conditional FP-tree for e (Figure 22(c)). By adding the frequency counts
associated with node d, we obtain the support count for {d, e}. Since the support count is equal to
2, {d, e} is declared a frequent itemset.

Next, the algorithm constructs the conditional FP-tree for de using the approach described in step
3. After updating the support counts and removing the infrequent item c, the conditional FP-tree
for de is shown in Figure 22(d). Since the conditional FP-tree contains only one item a, whose
support is equal to minsup, the algorithm extracts the frequent itemset {a, d, e} and moves on to
the next subproblem, which is to generate frequent itemsets ending in ce. After processing the
prefix paths for c, only {c, e} is found to be frequent. The algorithm proceeds to solve the next
subprogram and found {a, e} to be the only frequent itemset remaining.

This example illustrates the divide-and-conquer approach used in the FPgrowth algorithm. At each
recursive step, a conditional FP-tree is constructed by updating the frequency counts along the
prefix paths and removing all infrequent items. Because the subproblems are disjoint, FP-growth
will not generate any duplicate itemsets. In addition, the counts associated with the nodes allow
the algorithm to perform support counting while generating the common suffix itemsets.

FP-growth is an interesting algorithm because it illustrates how a compact representation of the


transaction data set helps to efficiently generate frequent itemsets. In addition, for certain
transaction data sets, FP-growth outperforms the standard Apriori algorithm by several orders of
magnitude. The run-time performance of FP-growth depends on the compaction factor of the
data set. If the resulting conditional FP-trees are very bushy (in the worst case, a full prefix tree),
then the performance of the algorithm degrades significantly because it has to generate a large
number of subproblems and merge the results returned by each subproblem.

Page 27

You might also like