Data Analytics - Unit - 4

4-2J (C8-5/1T-6)
Frequent Itemsets and Clustering
4 Frequent Itemsets a PART- 1

Frequent Itemsets and Clustering : Mining Frequent Itemsets,
UNIT
Clusten Marhets Based Modelling, Apriori Algorithm.
Questions-Answers
Long Answer Type and Medium Answer Type Questions
CONTENTS Que 4.1. Write short notes on frequent patterns in data mining.
Answer
Frequent Itemsets
and
Part-1 # Clustering : Mining Frequent
4-23 to 4 1. Frequent patterns are patterns (such as itemsets, subsequences, or
Itemsets, Market Based substructures) that appear frequently in a dataset.
Modelling, Apriori Algorithm 2. A substructure can refer to different structural forms, such as sub
graphs, sub-trees, or sub-lattices, which may be combined with itemsets
Large Data
Part-2 : HandlingMain Memory,
4-9J to 4 or subsequences.
Sets in
Limited Pass Algorithm,
3. Ifa substructure occurs frequently, it is called a (frequent) structured
pattern.
Counting Frequent
4. Finding frequent patterns plays an essential role in mining associations,
Itemsets in a Stream
correlations, and many other interesting relationships among data.
Techniques 4-15J to 4-2 5. It helps in data classification, clustering, and other data mining tasks.
Part-3 : Clustering
Hierarchical, k-means 6 Frequent pattern mining searches for recurring relationships in a given
dataset.
Part-4 : Clustering High 4-21J to 4-8 7 For example, a set of items, such as milk and bread that appear
Dimensional Data, frequently together in a grocery transaction dataset is a frequent itemset.
CLIQUE and ProCLUS, 8 Asubsequence, such as buying first a PC, then a digital camera, and
Frequent Pattern Based then a memory card, if it occurs frequently in a shopping history
Clustering Methods database, is a (frequent) sequential pattern.
Part-5 : Clustering in Non-Euclidean 4-26J to 4-28 Que 4.2. Explain frequent itemset mining.
Space, Clustering for
Streams and Parallelism Answer
1 Frequent itemset mining leads to the discovery of associations and
correlations among items in large transactional or relational datasets.
2 With massive amounts of data continuously being collected and stored,
many industries are becoming interested in mining such patterns from
their databases.
3. The discovery of interesting correlation relationships among huge
of business transaction records can help in many business
amounts
4-1J (CS-5/1T-6)
Data Analytics
such as catalogue design,
4-3J(CS-5/1T4 4JCS-6/1T-6) Frequent Itemsets and Clustering
4
decision-making processes
and customer
Atypical
shopping behaviour analysis.
cros -marketing.
example of fre quent itemset mining is market basket analysia
This process analyzescustomer buying habits by finding association
1.
9
3
L|1itemsets that satisfy minimum support 8)
while L,z0
customers place in their if 3N v(GN Ak<N)
5
between the
baskets"
different items that
"shopping
The discovery of these associations can help retailers to develop
4
5
6
Chejcandidate itemsets generated from L
for each transaction t in database D do
which items are
gaininginsight into
6
marketing strategies bycustomers.
purchased toge ther by product, how
frequently 7
8.
increment the counts of C contained in
L,1 candidates in C. that satisfy mínimum supportô
customers are buying some likely
are k - k+1
7 For instance, if products at the same time. This information can lead 9
they to also other retailers do selective marketing 10. return UL
to increased sales by helping
based modelling, At each iteration, the algorithm checks whether the support eriterion
Que 4.3. Write short notes on market 4 iten set, repeating the
can be met; if it can, the algorithm grows the the item sets reach a
process until it runs out of support or until
Answer predefined length.
used to describe a common formt to identify the frequent item
1. The market-basket model of data is two kinds of objects. 5 The first step of the Apriori algorithm is transactions that meets the
relationship between starting with each item in the
many to many sets by
the other we have baskets minimum support threshold 8.
2. On the one hand, we have items, and on predefined
denoted as L,. as each 1-itemset
contains
sometimes called "transactions."
6 These itemsets are 1-itemsets item set s by joining L
itemset), and usuallv we algorithm grows the
3 Each bas ket consists of a set of items (an is small or much smaller only one item. Next, the 2-itemsets denoted as L, and determines
number ofitems in a basket onto itself to form new, grown meet
assume that the
support of each 2-itemset in Lo. Those itemsets that do not
than the total number of items. the are pruned away.
bigger the minimum support threshold 6
4 The number of baskets is usually assumed to be very large, process is repeated until no
itemsets meet
than what can fit in main memory. 7 The growing and pruning
threshold.
5. The data is assumed to be represented in a file consisting of a sequence the minimum support maximum number of items
Ncan be set up to specify the
of baskets. 8 A threshold or the maximum number of iterations of the
the item set can reach the
Que 4.4. Write short notes on algorithm for finding frequent output of the Apriori algorithmn is
algorithm. Once completed, k-itemsets.
collection of all the frequent
itemsets. ?
property used in the algorithm
Answer Que 4.5. How is the Apriori
The Apriori algorithm takes a bottom-up iterative approach to find the Answer
frequent itemsets by first determining all the possible items and then followed, consisting ofjoin and prune actions.
identifying which among them are frequent. Atwo-step process is
2 Let variable C, be the set of candidatek-itemsets and variable L, be the 1 The join step : generated by joining
candidate k-itemsets is
set of k-itemsetsthat satisfy the minimum support. To find L, a set of candidates is denoted C
3 Given a transaction database D, a minimum support threshold6, and L,_, with itself. This set of refers to thejuh
inL,_ The notation ,U]
an optional parameter N indicating the maximum length an itemset b Let l,and l, be itemsets refers to the second to the last item inl,).
could reach,Apriori iteratively computes frequent itemsets L,-, based item in l, (e.g., l, lk-2] Apriori assumes that items within a
on L implementation, order. For the
For eficient in lexicographic
transaction or itemset are sorted
C.
Apriori algorithm :
Apriori (D, 6, N) :
6JC8-6/TT4) Frequent ltemseta and Clustering
Data Analytics thatthe items are sorted
-)itemset,L, thin means non-empty subset s of l, out put the rule s (l- s) if
b For every
performed,where members (support_eountl/ support countle)) z min conf, where min cornt
L tbal(h-2)-itemsare in common. That L
,is of is the minimum confidence threshold.
each one
Thejoin,iftheirfirst [1] nl Becaue the rules are generated from frequent itemsets,
d,
joinable ,arejoinedif1)<,[h-1). automatically satiafies the minimum support.
tables along
, andl, of
de-2-1,2)Ad,[h - ensuresthat Frequent itemsets can be stored ahead of time in hash
1]simply
1l <l,lh -itemset no 6.
with their counts so that they can be sccessed quickiy.
joiningl,duplandeg,
formed by
condition/, &-resulting
Thegenerated. The ,k- 111.) Apriori-based
How can we improve the efficiency of
4h-21,,k-11,
Que 4.7.
are
U,1. 7,121... mining?
step: is, its members may or
Theprune of Ly, that nay
mA Answer
k-iitemsets are
C is a
superset
frequent, but allofthe frequent
count of each
included
in C. algorithm have been proposed that focus
on
Many variations of the Apriorioriginal algorithm Several of these variations
b A the
determinethe
are
candidate
database scan to determination ofL, (ie. all candidates
int improving the efficiency of the
are asfollows:
resultin count into corresponding
would
a count
minimum
less than the belong to L).
therefore
support
frequest L
Hash-based technique (hashing itemsets
buckets) :
definition, and and so this could be used to reduce the size of
the
however, can be huge, of C, the. Apriori involve
iproperty
hea
is u
A hash-based technique can
candidate k-itemsets, C, for k> 1.
C, the size
computation. To reduce property, any (h- 1)-itemset that is e to
each transaction in the databasethe
According to Apriorisubset of a frequent k-itemsets. For example, when scanning generate all
d Hence if b
generate the frequent 1-itemsets, L,, we can
a
frequent cannot becandidate k-item set is not in
L,.the (ie., map) them into the
(-J-subset of a removed from C 2-itemsets for each transaction, hash
structure, and increase the
frequent and can be different buckets of a hash table
candidate cannot be corresponding bucket counts (Fig 4.7.1).
asso ciation rules f an all electronics branch
Que 4.6. Write short
note on generating Table 4.7.1 :Transactional data for
frequent itemsets. List of item IDs
TID
I1, I2, I5
Answer a database Dham T100
Once the frequent
itemsets from transactions in association rile 2. 14
1 straightforward to generate strong T200
been found, it is association rules satisfy both
minima 12, 13
from them (where strong confidence).
T300
I1, 12, 14
support and minimum sho
equation (4.6. 1) for confiden ce, which is T400
2 Thís can be done using I1, 13
T500
bere for completeness : 12, 13
support_count(AUB) ..4.61 T600
confidenceA B) = PIB|A) = support_count(A) I1, I3
T700
expressed in terms of itemset suppn I1, 12, I3, I5
3. The conditional probability is
count, where support_count A UB) is the number of transactiom T800
I1, 12, 13
containing the itemsets A B,and support co unt (A)is the number d T900
transactions containing the itemset A.
4. Based on equation (4.6. 1), association rules can be generated as foloW
For each frequent itemset I, generate all non-empty subsets ut
adaAnalyties 483CSSTT4 Frequent Stemmaet and Cutering
Phase Is
2
In phane L the algorithm divides the transactinns ofD into n
EE nos-overiapping partitions If the minimum relative upport
threshold for tranactione in Dis min_up. then
Minimus support count for a partition min pxthe number
of tranactions in that partition
E Por each partition, all the local frequent itemsets are forund.
2-itemseta. Thia has Alocal frequent itemet may or may not be frequent with respect
Table 4.7.1 to the entire database, D Hoever, any itemset that is potentially
abie wr
ated by anning upport count is transactions frequent with respect to Dmut occur as a frequent itemset in at
while detinig L
Ithe mmini S0y,1
buckits 0, 1. 2, and 4 cannot be least one of the partitiona.
then theiem should not be included in C iv Therefore, all local frequent itemsets are candidate iteseta with
bgant and a the respect to D. The colection offrequent itemsets from all partition
bucket count in the forms the giobal candidate itemsets with respect to D.
a corresponding hask y
A2-itemset with support threshold cannot be Phase II:
that is below the fromthe candidate set.
removed
frequent
zn
In phase 2, a second scan of D is conducted in which the actual
should be support of each candidate is assessed to determine the global
techmique may substantially reduce=2
the frequent itemseta
d Such a hash-based examined (especially when
ofcandidate b-itemsets Partition size and the number of partitions re set so that each
Transaction reduction (reducing the numnber of transaetie
partition can fit into main memor and therefore be read only
2 iterations) : once in each phase.
scanned in future
contain any frequent -itemsts Sampling (mining on a subset of the given data) :
Atransartionthat does not1-itemsets.
contain any frequent ( + The basic idea of the sampling spproach is to pickarandom sample
in S
Therefore, such a transaction can be mariked or
rermgved Sof the given data D, and then search for frequent itemsets
because subseque nt database scan instead of D.
further consideration a trzn efficiency.
Fitemsets, where j>h, wil not need to corsider such b. In this way, we trade off some degree of accuracy sgainst
data to find candidate itemw The S sample size is such that the search for frequent itemsets in
3 Partitioning (partitioningthe ane scan of the
requires just two deti S can be done in main memor,and s0 only
&. Apartitianing techmique canbe used that Fig 472 transactions in Sis required overall.
Bcans to mine the frequent itemsets as shown in of the global
Phase I In this technique, it is possible that we l l miss some
P frequent itemset.
itemsets at
Dynamic itemset coun ting (adding candidate
D
ide D
(1) 5
different points duringa scan):
database is partitioned into blocks marked

propased in which the
A dynamic itemset counting technique was by start pointa.
candidates
addied at any start
b In this variation, new candidate itemsets can be immediately
point, which determines new candidate itemsets only
before each complete database scan
Fig 472. Miaing by partitining the data. ?
Que 4.8. What are the applications of frequent itemset analysis
& It consists of two phases
Ctering
I03CSGTT4 Fregaeut lemt nd
analysis:
tvguentemset onAwers
Anplioansof Type @uetims
Brlatndemogs bskrtsdsing.
be Long Anwer Type and dim Aner
i b ewd
Let beseitemswords that
gst t s forsringitemet eo
2Abkecume Que &What are the diet methods
dnt appear in main memory?
inthe
laok f
wands that
sets of dmatedby the most c togetherin
wl he
he sets Amswer memory:
itemset eount im 1
stop urds Such Diferemt method for storimg
ontain ma the itemsets L The trianglar-matri
nethod:
rhe dnument as mere
requent sti have the pble
items integrs, we
onsder emmen words Even afer cdimg nly ane place
he mst painss some pairsthen
7enall e weust cont a pair the
fwe nre allthequuent of wun cld onder the pair sothat i <ind omlyuse
&
Bowever the
amung Fur etample, we t-dimemsimalarrar a. That strtegy wold
hage ind
Tegeesemtajutoept. emtrya i n a
make halfthe array seiess triangr
to se a nedimesinal
baskets be semte Amore space-effiient way is
dacuments and the where
2 Let che
ihems be the documens the pair A with 1si<jsn
isina busket ifthe semtemceis in we s We store in ej the coumt for
Aihem backwrards, and should rene =G- 1kn - i2 )-j-i.
aangement appears items and baskets is an arh the paisare stared in leingrapi
8 This relatinship betwee The rest of this layout is thait
SL..nl, then (23.(2 4 . 2
the y order, that is first (1.21, IL. 1ml
in-2m-1L jn -2 nl, and fnally n-
mthatm elatshig
pairs afitemssthat appear tag nl, and s an down to
plicatin, welouk for
In chis
in severa buskets The triples method : depemd an
approach tae sturecounts that appear im
documentsthat & This is more appropriate
pair, then we have two items that actually
pains af
& we fnd sucha the fraction ef the possihle
semtemees in eummn seme basket. the cunt ef
Seer
counts as triples e. meaning that bash tablewith
Biomarkers : pruteins We canstare suchasa
typessuch as genes or
blood
, withi<ise. Adata structure, there is a triple for
Let the iems beaftwp pair used so we can tell if
Giseuses
iandjas the search key. is find it quickly.
patient: their gename a giveni andj
and,ifse, to
Bach baskettis the set af data about a medical history afis approach the triples method of
stering counts
bind-chemistry nnlysis as wel astheir disease and one ar m We call this
does not require us to
stare anything if the
8ASegaent itemset that
cumsists af one The triples methad
bimarkerssugest a test fer
the disease count for a pair is 0 requires us to stare three
the other hand the triples method does appear in some
e On ane, for every pair that
PART-2 intègers rather than
basket.
Memoy, Limited Pass large dataset in
Hendliu Lange Daite Sety in Mainlemmsets in a Stream Explain PCY algorithm for handling
Anthm Counti Frgue Que &10.
main memory.
Data Analytics
Answer
algorithm, there may be much
11J(CS4 4-12J (CS-61T-6) Frequent Itemseta and Clustering
1 Infirst pass ofApriori unused 8. We can define the set of candidate pairs C, to be those pairs l,jl such
that:
main memory. for an array of
uses the unused space iandj arefrequent items.
The PCY Algorithm
2
that generalizes the
of a Bloom
idea
filter. The
is idea integer b. iJ) hashes to a frequent bucket.
schematically in Fig. 4.10.1. Que 4.11.Explainsimple and randomized algorithm to find most
Item
frequent itemsets using at most two passes.
Fre
Item 2 Item names
names
quent Answer
counts to items
to integers Simple and randomized algorithm :
integers In simple and randorimized algorithm, we pick a random subset of the
1 using the entire
baskets and pretend it is the entire dataset instead of
Bitmap
file of baskets.
reflect the smaller number of
2 We must adjust the support threshold to
baskets.
for the full dataset is s, and we
3. For instance, if the support threshold then we should examine the
Hash table choose a sample of 1% of the baskets,
Data structure s/100 of the baskets.
for bucket sample for itemsets that appear in at least
for counts
read the entire dataset, and for
counts
of pairs 4 The best way to pick the sample is tosample with some fixed probability
each basket, select that basket for the
p.
entire file. At the end, we shall
Suppose there are m baskets in the
5 close to pm baskets.
have a sample whose size is very the file already,
6 However, if the
baskets appear in random order in
Pass 2
have to read the entire file.
Pass 1 then we do not even part
Fig. 4.10.1. for our sample. Or, if the file is
can select the first pm baskets pick some chunks at random to
7 We can
we
of a distributed file system,
buckets hold integers rather
3 Array is considered as a hash table, whoseare hashed to buckets of this serve as the sample. use part of main memory
than sets of keys or bits. Pairs of items Having selected our sample of the baskets, we
the first pass, we not only 8.
hash table. As we examine a basket during to store these baskets.
item in the basket, but we generate all the execute one of the algorithms
add 1 to the count for each
9. Remaining main memory is used to algorithm must run passes over
pairs, using a double loop. However, the
such as A-Priori or PCY. size, until we find a size
into which that pair sample for each itemset
4 We hash each pair, and we add 1 to the bucket the main-memory
hashes. with no frequent items.
count, which is the sum frequent
5. At the end of the first pass, each bucket has a Explain SON algorithm to find all or most
of the counts of allthe pairs that hash to that bucket. Que 4.12.
passes.
6. If the count of abucket is at least as great as the support threshold s, itemsets using at most two
it iscalled a frequent bucket. We can say nothing about the pairs that Answer
hash to a frequent bucket; they could all be frequent pairs from the
information available to us. SON Algorithm : chunks.
1. The idea is to
divide the input file into randomized
7 But ifthe count ofthe bucket is less than s (an infrequent bucket), we sample,and run the simple and
know no pair that hashes to this bucket can be frequent, even if the 2 Treat each chunk as a
pair consists of two frequent items.
Data Analytics 4-13J (C8-5/TT4 4143CS-5/TT6) Frequent Itemscta and Clustering
algorithm on that chunk The value is ignored, and the Reduce task simply produces those keys
We use pe as the threahold, ifeach chunk is fractionp ofthe whole file (itemsets) that appear one or more times, Thus, the output of thefirst
threshold. Reduce funetion is the candidate itemsets.
and s is the support chunk
Store on diak all the frequent itemsets found for each Second Map funetion :
4
Once all the chunks have been processed in that way, take the union output from
The Map tasks for the second Map funetion take all thea portion of the
that have been found frequent for one or more the first Reduce Function fthe candidate itemsets) and
of all the itemsets
are the candidate itemsets. input data file.
chunks These each of the
chunk, then its support is lese At Each Map task eounts the num ber of ocurrences of
Ifan itemset is not frequent in any
b. dataset
6
number of chunks is 1p, we conclude st candidate itemsets among the baskets in the portion of the
ps in each chunk. Since the itemset is less than (1/p)ps =8.
that it was assigned.
the total support for that o). where C is one of the
the whole is frequent in at lae The output is a set of key-value pairs (C. itemset among the baskets
Thus, every itemset that is frequent in the truly frequent itemsete
C
candidate sets and v is the support for that
one chunk, and we can be sure that all that were input to this Map task.
false negatives. We have mo
among the candidates; i.e., there are no read each chunk and procesest Second Reduce funetion:
we
a total of onepass through the dataas The Reduce tasks take the itemsets they are given as
keys and sum
the associated values.
itemsets and select tho8e.
8 In a second pass, we count all the candidate The result is the total support for each of
the ítemsets that the Reduce
itemsets.
that have support at least s as the fre quent
b
task was assigned to handle.
Those itemsets whose sum of values is at leasts are frequent in the
Que 4.13.Explain SON algorithm usng MapReduce. whole dataset, so the Reduce task outputs
these itemsets with their
counts.
Answer
Itemsets that do not have total support at
least s are not transmitted to
The SON algorithm work well in a parallel-computing environment.
d.
1
the output of the Reduee task.
2 Each of the chunks can be processed in parallel, and the frequent
itemsets from each chunk combined to form the candidates. Que 4.14. Explain Toivonen's algorith m.
We can distribute the candidates to many processors, have each
processor count the support for each candidate in a subset of the Answer
baskets, and finally sum those supports to get the support for each finding frequent
candidate itemset in the whole dataset. 1 Toivonen's algorithm is a heuristic algorithm for
itemsets from a given set of data.
4 There is a natural way of expressing each of the two passes as a algorithms, main memory is considered a
2 For many frequent itemset
MapReduce operation. critical resource.
counting over large data sets results
MapReduce-MapReduce sequence : 3 This is typically because itemset quickly begin to strain the limits of
that
First Map function : in very large data structures
Take the assigned subset of thebaskets and find the itemsets frequent main memory.
approach to discovering
in the subeet using the simple and randomized algorithm. Toivonen's algorithm presents an interesting
algorithm's deceptive simplicity
frequent itemsets in large data sets. The
Lower the support threshold froms tops if ench Map task gets fraction through a sampling process.
pef the total input file. allows us to discover all frequent itemsets border if it is not
in the negative
Th outpu is a set of key-value pairs (P, ), where F is a frequent 5 Negative border : An itemset is immediate subsets are frequent in
itenaet frem the sample. frequent in the sample, but all its
Pirst Reduce Function : the sample.
Each Reduce task is assigned aset ofkeys, which are itemsets.
DataAnaltics
Toivonen'y
algorithm t
-15J(C8-M 416J(CS-MTT4) Frequent Itemsets and Clustering
Passesof
& butlower the threshold
therandomsample, Questions-Answers
Start with
the subset. frequent in the sample
itemsets
the itemsets.
that are the negati Long Answer Type and Medium Answer Type Questions
Addtoofthese
border
fromthe first Oue 4.16. Write short notes on clustering.

Pass 2 t
candidate frequent itemsets pass, and
Count all border.
theirnegative Answer
eountsetsin negative border turns out to }be into multiple
If noitemnset
found
fromthe
all the frequent
itemsets. frequent 1 Clustering is the process of grouping a set of data objects
groups or clusters s0 that objects within a cluster have
high similarity,
then we but are very dissimilar to objects in other clusters.
techniques to extract attribute
Que &l5.
Discuss sampling frequen 2 Dissimilarities and similarities are assessed based on the
values describing the objects and often involve
distance measures.
itemsets from a
stream.
3 The "quality" of a cluster
may be represented by its diameter, the
maximum distance between any two objects in the cluster.
Answer elements are basketsof
items.
Centroid distance is an alternative measure of cluster quality and is
that stream 4 cluster object from the cluster
L We assume current estimate of the defined as the average distance of each
simplestapproach to maintaininganumber offbaskets and | frequent centroid.
2. The streamistocollect
some store it as the process ofpartitioning a set
itemsets in a 5. Cluster analysis or simply clustering issubsets.
a file. meanwhile ignoring th. of data objects (or observations) into
frequent-itemset algorithms, cluster analysis can be referred to
3 Run one of the
arrive, or storing them
as another fle tok 6. The set ofclusters resulting from a
stream elements that as a clustering.
previously unknown groups
analyzed later. estimata 7. Clustering can
lead to the discovery of
algorithm finishes, we have an
4 When the frequent-item sets stream. within the data.
applications such as
of the frequent itemsets
in the
application, but 8. Cluster analysis
has been widely used in many Web search, biology,
frequent itemsets for the recognition,
5. We can use this collection of chosen frequent-itemset business intelligence, image pattern
of the
start running another iteration can either :
and security.
algorithm immediately. This algorithm of the requirements for clustering in
data
Que 4.17. What are the
iteration
Use the file that was
collected while the first
a. yet another file
time, collect
algorithm was running. At the same algorithm, when this current mining ?
iteration of the
to be used at another
iteration finishes. Answer
Start collecting another file of baskets,

and run the algorithm clustering in data mining :
collected. Following are requirements of
until an ade quate number of baskets has been 1 Sealability : containing
clustering algorithms work well on small data sets
a. Many hundred data objects.
PART-3 fewer than several data set may lead to
of agiven largeclustering
Clustering Techniques: Hierarchical, k-means. b. Clustering on onlyasamplehighly scalable
algorithms
resul ts. There fore,
biased
are needed.
Data Analvtios
with
differenttypes
ofattributos: -17J(C8-A
:Many algortta,
(interval-based) data.
418J(CS-6/1T6) Frequent ltemaete and
Clustering
2
Abilityto deal cluster numerie other data types, such
are
designed to
require
clustering
appications may andardinaldata,
or mixtures of these
1Howevn
data
Answer
A hierarchical method ereates a hierarchical
decompoaition ofthe gven
(categorica), arbitrary shape : bys 1
set ofdata objeets.
nominal
clusters with Ahierarchical method can be classified as
Discoveryof determine clusters based on 2
& clustering
Many measures.
distance
algorithms
tend to
Euclidea, a.
Agglomerative approach :
The agglomerative
approach, also called the
bottom-up
separate group.
measures find spheri object forming a
on
suchdistance approach, starts with each elose to one
Algorithmsbased density. However, a cluster b merges the objects or
groups
similarsize and could It successively merged into one (the topmost
clusters with are
another, until all the groups termination condition holds.
ofany shape. that can detect level of the hierarchy), or a
to
important
It is shape.
develop algorithms clusters Divisive approach : approach,
b.
als0 called the top-down
C
arbitrary knowledge to determine
domain inpu The divisive approach, in the same cluster.
Requirements for starts with all the objects into smaller
4
users to provide domain iteration, a cluster is split cluster, ora
Ineach successive
parameters :
algorithms require each object is in one
as
Many clustering form of input parameters such the desired clusters, until eventuall y
knowledge in the termination condition holds.
or density- and
numberofclusters. parameters can be distance-based
may be sensitive to such Hierarchical clustering methods
clustering results 3
continuity-based. clustering in
b. The data: hierarchical methods consider
Ability to deal with noisy and/or missine Various extensions of
5. sets contain outliers 4.
Most real-world data data. subspaces.
the fact that once a
step (merge or
unknown, or
erroneous Hierarchical methods suffer from is useful in that it
noise and may produe undone. This rigidity
algorithms can be sensitive toclustering methods tha
5 never be about a
it can
split) is done, computation costs by not having to worry
Clustering we need
poor-quality clusters. Therefore,
b.
leads to smaller different choices.
noise. combinatorial number of
are robust to high-dimensionality data: clustering.
clustering notes on partitioning method of
6 Capability of contain numerous dimensions or attributes. Que 4.19. Write short
data
A set can low-dimensional
b. Most clustering
algorithms are good at handling dimensions, Answer k partitions of
data such as data sets involving
only two or three
objects, a partitioning method constructs
and kSn. That is,
spaceis Given a set of n cluster
objects in a high dimensional 1
each partition repre sentsa group must contain at
C Finding clusters of dataconsidering that such data can be very the data, where
into kgroups such that each
challenging, especially it divides the data
sparse and highly skewed. least one object. partitioning on
methods conduct one-levelexclusive cluster
Constraint-based clustering : words, partitioning adop
7
clustering nder 2 In other partitioning methods typically
Real-world applications may need to perform datasets. The basic object must belong to exactly one group.
separation i.e., each number of
various kinds of constraints. distance-based. Given k, the
with good clustering partitioning methods are ereates an initial
Achallenging task is to find data groups 3 Most partitioning method
partitions to construct, a
behaviour that satisfy specified constraints.
improve
partitioning. technique that attempts to
clustering iterative relocation ther.
Que 4.I8.Writeshort notes on hierarchical method of 4 It then uses an
moving objects from one
group to ano
the partitioning by
t s Analtis inthat lgeete Prognt emat ndClutering
partitioningwherens
Theg eVner
eritis
l f sp d
etedfoeah
other, sohjects in
There are
inthe Anewer
partitiona varios
different
uter ane partor ery
dters re arjingthegnlity of First, itrndomly selete kof theste in D, eh of whie initially
purtitioning aed represents a clate meen or enter
ther eriuriafor
gtimalityin
mputastianally prohibitive potentially requiring an
6Ahiein l l
elustering
i For ech of the remning otjt, ti ged to the eluter t
whieh is the e t eimilar, bed n the Eariideun distanee beteen
penhlepartitione che objeet snd the cluter men
uertinofal the centroid-based The kens algrithm then iteratislyimgrow the within-duter
eaExplain A:means
teehnique) elusteringmethod
(or
partitionisy 4
Varatn
For each eluter, it emptes thenew oing the oject iged

ethe cluter in the previsiteratin
All the otjeete are thes reiged ing the updated mesn as the
in Euclidean space. pew chuster centers
Answer
LSuppeadata the
nobjecta
st, D., eontains in Dinto &clusters,
objiects
Part,tihtiatn The iteratione contine ntil the animent is stable, that is, the
the sume ae thoe formedin
methods distribute (1SLyshh clusters formed in the current round are
CcDandCnC*efor partitioning
asessthe another quality so tht the previous round
function is used to but
2 Anbjective
objeeta within a
chuster
to one
are similar dissimilar t Algorithm :
Input z
clusters centroiddof a
abeeta in ether technique uses the Athe number ofcluuters
Aoentraid-baed partitioningConcept ually, the centroid of ai cluster
& that cluster
C te regpresent The centroid can be defined
cluster
is
in various ways ssuch as by D:a data set containing n objecta
the cluster Output:A set ofk custers
ite center peint. objecte (or points)aseigned to
the mean of the representative -t
and c , the iss
Method:
centers
The difference between an obyject pe Cwhere distíx, y) the Euclidean Arbitrarily choose kobjects from Das the initial cluster
4
mearured by distip,c), 1
the custer,is andy. repeat
distance between two
points r
by the withbin -eles
2 which the object is the most
cluster C can be measuredbetween 3 (relassign each object ta the chuster to objects in the chuster,
of allI objectss mean value of the
5 The gualitywhich isthe sum of squared error in C similar, based on the
calculate the mean value of the
Variation,
defined as: update the cluster means, that is,
and the centroid c, objects for each cluster:
E- SS dist(p.q' 5
until no change,
lpes
characteristies of different clustering
the data set: Que 4.22. What are the
squared error for allobjects in
Where, Eis the sum of the techniques/m ethods ?
representing a given object
pis the peint in space multi-dimensional)
(both pand c, are
,isthe centroid of cluster C, cluster, the distance from the
Answer techniques/methods are :
In other words, for each object
in each Characteristics of different clustering
squared, and the distances are summed :
object to ite chaster center is
the resulting kclusters as compact Characteristics of partitioning methods shape
This oboctive function tries to make clusters of sphe rical
And as eparate as posible. L Find mutually exclusive
Distance-based
2
algorithm work ? Write k-means to represent cluster
center
Que L21.How does the k-means 3. May use mean or medoid
data sets
algorithm for partitioning. 4 Effective for small to medium size
4-21J
Data Analytica methods t 2 2 J(C&-1T6) Frequent Itemsets and Clustering
hierarchbical
decomposition ie., multiple
Characteristies of
hierarehical
Custering is a erroneous mergesor splita
levels) 2
Projected clustering :
I In high-dimenaional spaces, even though agood partition cannot
correet
Cannot like micro elustering
techniques or be defined on all the dimensions because of the sparsity of the
2
May
incorporate other consi data, some subset of the dimensions can always be obtained on
t which some subseta of data form high quality and significant
object"linkages
density- basedmethods clusters
Charncteristicsof
shapedclusters
arbitrarily that are b. Projected clustering methods are nimed to find clusters specific to
ohjectsin spnce
2
Can find
Clusters are
dense regions
low density regions
of
separated , a particular group of dimensions, Each cluster may refer to
diferent subeets of dimensione
The output ofa typieal projected clustering algorithm, searching
outliers for k elusters in sabspsces of dimension l,is twofold :
3 May flter out gvid-based methods:
Charaeteristics of structure A partition of data ofh + 1different clusters, where the first
resolutiongrid data kclusters are well ahaped, while the (k+ 1hcluster elements
Uen muli are outhers, which by definition do not cluster well.
1 Fast processing time
2 Apossibly different set ofI dimensions for each of the first k
PART-4 clusters, such that the points in each of those clasters are
CLIQUE well clustered in the subspaces defined by these vectors.
Dimensional Data, Biclustering :
Chustering High Pattern Based Clustering Methods,
Biclustering (or two-way clustering) is a metheiology allowing for
and PriCLUS, Frguent simultaneously. ie., to find
feature set and data points clusteringcharacteristics together with
Questions-Answers clusters of samples possessing similar
features creating these similarities.
Questione or hierarchy of
Medium Answer Type b The output of biclustering is not a partition of the who le
and partition
LongAnswer Type partitions of either rows or columns, but a
matrix into sub-matrices or patches.
patches as possible, and
approaches for high dimensional The goal of biclustering is to find as many maintaining strong
the data C
to have them as large as possible, while
Que 4.2s. What are homogeneity within patches.
clustering ?
Que 4.24. Write short note on CLIQUE.
:
data clus tering are
Answer
dimensional
Approaches for high Answer
clustering :
1. Subspace algorithms localize the CLIQUE is a subspace clustering method.
Subspace clustering subspace clusteringthem to find clusters that 1
grid-base d method for
allowing
dimensions CLIQUE (CLustering In QUEst) is a simple
a.
search for relevant overlapping subspaces. 2
exist in multiple, and possibly attemnts finding density based clusters in subspaces.
This technique is an
extension of feature selection that CLIQUE partitions each dimension into non-overlapping intervals,
b subspaces of the same dataset. 3
entire embedding space of the data objects
to find clusters in different thereby partitioning the
threshold to identify dense cells and sparse
search method and evaluation into cells. It uses a density
Subspace clustering requires a
ones.
criteria.
evaluation criteria so as to consider mapped to it exceeds the density
It limits the scope of the different cluster. 4 Acell is dense ifthe number of objecets
for each threshold.
different subspaces for identifying a candidate search
5 The mainstrategy behind CLIQUE
Data Analytics
cells with
respect to
-23JC8 4JCS-S1T4) Frequent lenmaets and Clustering
monotanieity ofdere
spece uses the the Apriori property
This is based on
used in frequent dimpatenentern The three phases of PROCLUS are as follows
Initialization phase : Belect a set of ptential medoids that are
association rule mining
subspaces, the monotonicity far apart aing a greedy algorithm
context of clusters in
kdimmensionalcell c th >
Dcan have at least S8ys th b. Iteration phase s
6 In the which is a cell
a in h-1
following. A1dimensional projection of c, L Select a random set of k-medoids from this reduced data set to
points determine if clustering quality improves by replacing eurrett
if every(k-subspace,hasatleastI
dimensional two steps : medoids with randomly chosen new medoids
performs clusteringin following L Cluster quality is based on the average distance betwee
7 CLIQUE instances and the nearest medoid
the dd-dimensional
First step :
Inthe first step,
CLIQUEpaurtitions identifying
reetangular units, data
the dense Foreach medoid, a set of dimenaions is chosen whose aversge
into
nen-everlapping distances are amall cnmpared to statistical expectation
among these subspaces. iv. Once the subepaces have beern selected for each medoid, average
cells in all of the
CLIQUE fnds dense dimension into Manhattan segmental distance is used to assign points to
partitions every
do so, CLIQUE containing at least
To
intervals,
points. where and
elis th
medoids, forming dusters.
Refinement phase :
identifes intervals
density threshold k-dimensional i Compute a new list of relevant dimenaions for each medoid
joins twro D D dense cella,. based on the clusters formed and reassign points to medoids,
iv CLIQUEthem iterative ly
DD,)and ),
respectively. removing outliers
subspaces
and c in..,D,_, =Di_,, andc, and c share the ssame t
-D generates a newinterval E The distance -based approach f PROCLUs is biased toward
D joinoperation
inthose dimensions The (DDiDD clusters that are hype-spherical in shape.
dimensional candidate cin space
cell
number of points in c pasBes tae
the Que 4.26. Discuss the basic subspace clustering approaches.
CLIQUE checks wbet heriteration terminates when no candiata
density threshold The candidate cells are dense Answer
no
can be generated or
Second step :
Basic subspace clustering approaches are :
i
h
CLIQUE uses thbe dense cells in each
nthe second step, which can be ofarbitrary shape. Grid-based subspace clustering :
i
to assemble clusters, In this approach, data space is divided into axis-parallel cells Then
the Minimum Description Length M the cells containing objects above a predefined thresbold value
E The idea is to apply regons to cover connected den iven as aparameter are erged to form subspace cluasters. Namber
priniple to use the marimal rectangle where eve
cells, wbere a maimal repon is a hyper cannat of intervals is another input parameter which defines range ot
cell falling into this regium is dense, and the region values in each grid.
in the subspace.
extended further in any dimension b Apriori property is used to prune non-promising cells and to
improve efficiency.
on PROCLUS.
Que 42s.Writeshort notes Ifa unit is found to be dense in k-1 dimenein, then itisconidered
for finding dense nit in k dimensions.
Answer & Ifgrid boundaries are strictly followed to separate objecta, acruracy
ciustering
L Projected clustering(PROCLUS is a top-down subspace of clustering result is decreased as it may miss neighbouring
algnithm objects mhich get separated by string grid boumdary.Clustering
ofh-medoids né
2 PROCLUS saples the data and then selects a set qualityis highly dependent on input parameters
iteratively improves the clustering
2 PROCUUIS is atualy faster thanCLIQUE due to the sampling eflarg
data sets
-25 J (CSTT4 4-26J(C&-/1T6) Frequent Itemsets and Clustering
Data Analytics
Measuring clustering quality :
subspace clustering i
2 Window-based
Window-based subspace clustering overcomes drawbacks f After applying a elustering method on adata set, we want to
cell-based subspace clusteringthat it may omit significant result assess how good the resulting clusters are.
A number of measures can be used.
attribute values and b
Here a window
slides across
be used form
to subspace clusters obtaiha C.
Some methods measure how well the clusters fit the data set,
while others measure how well the clusters match the ground
b
overlapping intervals towindow is one of the parameters.,
C. The size of the
sliding
axis-parallelsubspace
clusters, Thee truth, ifsuch truth is available.
There are also measures that score clustering and thus can
algorithmsgenerante clustering: d
compare two sets of clustering resul ta on the same data set.
subspace
Density- based overcome drawback8sof
3
density-based subspace clustering
by notusing grids
grid
a. A clustering algorithms PART-5
based subspace objectsforming a chain wi
as a collection of fined Clustering in Non-Euclidean Space, Clustering For
b Acluster is defined distance and exceed prede merged to f threshold Streams and Parallelism.
fall within a given adjacent dense regions are
object count. Then
bigger clusters. algorithms can findl arbitrarily shaped Questions-Answers
used, these
C As no grids are Type Questions
subspace clusters.
together the objects from
adiaca Long Answer Type and Medium Answer
Clusters are built by joining
d
dense regions. distance parameters GRGPF algorithm.
prone to values of of clusters in
These approaches are density. hase) Que 4.28. Explain representation
curse of dimensionalityis overcome in adaptive t
The effect density measure which is
algorithms by utilizing a Answer of several
in main memory consists
subspace size. 1. The representation of a cluster
clustering evaluation ? features. let
are the major tasks of ifp is any point in a cluster,
Que 4.27. What
2 Before listing these features,squares of the distances fromp to each
ROWSUM(p) be the sum of the
cluster.
Answer following: of the other points in the
clustering evaluation include the features form the represe ntation of a cluster.
The major tasks of 3 The following
cluster.
Assessing clustering tendency : non-random N, the number of points in the to be the
1.
set, we assess whethera cluster, which is defined specifically
In this task, for a given data b The clustroid of the
minimizes the sun of the squares of the
structure exists in the data. point in the cluster that clustroid is the point in
method on a data set will returm that is, the
distances to the other points;ROWSUM.
b Blindly applying a clustering mined may be misleading. smallest
clusters; however, the clusters the cluster with the
meaningful only when there is the clustroid of the cluster.
Clustering analysis on a data set is The rowsum of that are
the k points of the cluster
C.
C.
a nonrandom structure
in the data.
d For somne chosen constant k,their rowsums. These points are part
Determiíning the number of clusters in
a data set : closest to the clustroid, and to the cluster
case the addition of points
require the number of clusters of the representation inchange. The assumption is made that the
2
Afew algorithms,such as k-means, causes the clustroid to the old clustroid.
in a data set as the parameter. one of these k points near
regarded as an interesting new clustroid would be
b Moreover, the number of clusters can be
data set.
and important summary statistic ofa
number even before a
C Therefore, it is desirable to estimate this
clustering algorithm is used to derive detailed clusters.
Data Analytics
we can eonsiderThese
rowsums.
pointa
whether
arepart oftthe
-27J(CSs4
The &pointsof the cluster that are furthest from the clustrod
representation
two clusters are close enoughnto merg
428J (CS-5/1T-6)
2
Frequent Itemsets and Clustering
The sizes of buckets obey the restriction that there is one or two of
their ench size, up to some limit. They are required only to form a sequence
that iftwo clusters are close,
The
assumptionis made respective clustroids wouldd
distant from their thenbe ap
clo
where each size is twice the previous size such as 3, 6, 12, 24,
4.The contents of abucket consist of:
of points cluster tree The size of the bucket.
initialization of in
Que
4.29.Explain GRGY b The timestamp of the bucket, that is, the most recent point that
contributes to the bucket.
algorithm. A collection of records that represent the clusters into which the
Answer and the nodes ofthe 1 points of that bucket have been partitioned. These records contain:
The clusters are
organized into a tree,
or pages,
tree may
as in the case of aB-trs i The number of points in the cluster.
blocks
very large. perhaps disk tree resembles. i The cent roid or clustroid of the cluster.
cluster-representing
which the
holds as many cluster representations as can f ii Any other parameters necessary to enable us to merge clusters
Eachleafof the tree does not depend on and maintain approximations to the full set of parameters for
2
representation has a size that the numbe the merged cluster.
3 Acluster cluster.
of points in the
holds a sample of the
4 Aninterior
the node ofthe cluster tree clustroids
clusters represented by each ofits subtrees, along with pointers d
the roots of those subtrees.

an int
The samples are of fixed size, so the number of children that
5.
of its level.
node may have is indepen dent clustroia:
6 As we go up the tree, the probability that a given cluster's
diminishes.
part of the sample
oftk.
7. We initialize the cluster tree by taking a main-memory sample
dataset and clustering it hierarchically.
8. The result of this clustering is a tree T, but Tis not exactly the oftresit
used by the GRGPF Algorithm. Rather, we select from T certain
nodes that represent clusters of approximately some desired sizen
9 These are the initialclusters for the GRGPF Algorithm, and we plas
their representations at the leaf of the cluster-representing tree. We
then group clusters with a common ancestor in T into interior nodes of
the cluster-representing tree. In some cases, rebalancing of the cluster
representing tree will be necessary.
Que 4.30. Write short note on BMD0 stream clustering algorithm
Answer
1 In BMDO algorithm, the points of the stream are partitioned into, by.
buckets vhose sizes are a power of two. Here, the size of a bucket is
the number of points it represe nts, rather than the number of stream
elements that are 1.

Data Analytics - Unit - 4

Uploaded by

Data Analytics - Unit - 4

Uploaded by

4-2J (C8-5/1T-6)

Frequent Itemsets and Clustering

4 Frequent Itemsets a PART- 1

Clusten Marhets Based Modelling, Apriori Algorithm.

database is partitioned into blocks marked

fromthe first Oue 4.16. Write short notes on clustering.

Start collecting another file of baskets,

For each eluter, it emptes thenew oing the oject iged

the roots of those subtrees.

You might also like