Data Analytics - Unit - 4
Data Analytics - Unit - 4
Questions-Answers
Long Answer Type and Medium Answer Type Questions
CONTENTS Que 4.1. Write short notes on frequent patterns in data mining.
Answer
Frequent Itemsets
and
Part-1 # Clustering : Mining Frequent
4-23 to 4 1. Frequent patterns are patterns (such as itemsets, subsequences, or
Itemsets, Market Based substructures) that appear frequently in a dataset.
Modelling, Apriori Algorithm 2. A substructure can refer to different structural forms, such as sub
graphs, sub-trees, or sub-lattices, which may be combined with itemsets
Large Data
Part-2 : HandlingMain Memory,
4-9J to 4 or subsequences.
Sets in
Limited Pass Algorithm,
3. Ifa substructure occurs frequently, it is called a (frequent) structured
pattern.
Counting Frequent
4. Finding frequent patterns plays an essential role in mining associations,
Itemsets in a Stream
correlations, and many other interesting relationships among data.
Techniques 4-15J to 4-2 5. It helps in data classification, clustering, and other data mining tasks.
Part-3 : Clustering
Hierarchical, k-means 6 Frequent pattern mining searches for recurring relationships in a given
dataset.
Part-4 : Clustering High 4-21J to 4-8 7 For example, a set of items, such as milk and bread that appear
Dimensional Data, frequently together in a grocery transaction dataset is a frequent itemset.
CLIQUE and ProCLUS, 8 Asubsequence, such as buying first a PC, then a digital camera, and
Frequent Pattern Based then a memory card, if it occurs frequently in a shopping history
Clustering Methods database, is a (frequent) sequential pattern.
Part-5 : Clustering in Non-Euclidean 4-26J to 4-28 Que 4.2. Explain frequent itemset mining.
Space, Clustering for
Streams and Parallelism Answer
1 Frequent itemset mining leads to the discovery of associations and
correlations among items in large transactional or relational datasets.
2 With massive amounts of data continuously being collected and stored,
many industries are becoming interested in mining such patterns from
their databases.
3. The discovery of interesting correlation relationships among huge
of business transaction records can help in many business
amounts
4-1J (CS-5/1T-6)
Data Analytics
such as catalogue design,
4-3J(CS-5/1T4 4JCS-6/1T-6) Frequent Itemsets and Clustering
4
decision-making processes
and customer
Atypical
shopping behaviour analysis.
cros -marketing.
example of fre quent itemset mining is market basket analysia
This process analyzescustomer buying habits by finding association
1.
9
3
L|1itemsets that satisfy minimum support 8)
while L,z0
customers place in their if 3N v(GN Ak<N)
5
between the
baskets"
different items that
"shopping
The discovery of these associations can help retailers to develop
4
5
6
Chejcandidate itemsets generated from L
for each transaction t in database D do
which items are
gaininginsight into
6
marketing strategies bycustomers.
purchased toge ther by product, how
frequently 7
8.
increment the counts of C contained in
L,1 candidates in C. that satisfy mínimum supportô
customers are buying some likely
are k - k+1
7 For instance, if products at the same time. This information can lead 9
they to also other retailers do selective marketing 10. return UL
to increased sales by helping
based modelling, At each iteration, the algorithm checks whether the support eriterion
Que 4.3. Write short notes on market 4 iten set, repeating the
can be met; if it can, the algorithm grows the the item sets reach a
process until it runs out of support or until
Answer predefined length.
used to describe a common formt to identify the frequent item
1. The market-basket model of data is two kinds of objects. 5 The first step of the Apriori algorithm is transactions that meets the
relationship between starting with each item in the
many to many sets by
the other we have baskets minimum support threshold 8.
2. On the one hand, we have items, and on predefined
denoted as L,. as each 1-itemset
contains
sometimes called "transactions."
6 These itemsets are 1-itemsets item set s by joining L
itemset), and usuallv we algorithm grows the
3 Each bas ket consists of a set of items (an is small or much smaller only one item. Next, the 2-itemsets denoted as L, and determines
number ofitems in a basket onto itself to form new, grown meet
assume that the
support of each 2-itemset in Lo. Those itemsets that do not
than the total number of items. the are pruned away.
bigger the minimum support threshold 6
4 The number of baskets is usually assumed to be very large, process is repeated until no
itemsets meet
than what can fit in main memory. 7 The growing and pruning
threshold.
5. The data is assumed to be represented in a file consisting of a sequence the minimum support maximum number of items
Ncan be set up to specify the
of baskets. 8 A threshold or the maximum number of iterations of the
the item set can reach the
Que 4.4. Write short notes on algorithm for finding frequent output of the Apriori algorithmn is
algorithm. Once completed, k-itemsets.
collection of all the frequent
itemsets. ?
property used in the algorithm
Answer Que 4.5. How is the Apriori
The Apriori algorithm takes a bottom-up iterative approach to find the Answer
frequent itemsets by first determining all the possible items and then followed, consisting ofjoin and prune actions.
identifying which among them are frequent. Atwo-step process is
2 Let variable C, be the set of candidatek-itemsets and variable L, be the 1 The join step : generated by joining
candidate k-itemsets is
set of k-itemsetsthat satisfy the minimum support. To find L, a set of candidates is denoted C
3 Given a transaction database D, a minimum support threshold6, and L,_, with itself. This set of refers to thejuh
inL,_ The notation ,U]
an optional parameter N indicating the maximum length an itemset b Let l,and l, be itemsets refers to the second to the last item inl,).
could reach,Apriori iteratively computes frequent itemsets L,-, based item in l, (e.g., l, lk-2] Apriori assumes that items within a
on L implementation, order. For the
For eficient in lexicographic
transaction or itemset are sorted
C.
Apriori algorithm :
Apriori (D, 6, N) :
6JC8-6/TT4) Frequent ltemseta and Clustering
Data Analytics thatthe items are sorted
-)itemset,L, thin means non-empty subset s of l, out put the rule s (l- s) if
b For every
performed,where members (support_eountl/ support countle)) z min conf, where min cornt
L tbal(h-2)-itemsare in common. That L
,is of is the minimum confidence threshold.
each one
Thejoin,iftheirfirst [1] nl Becaue the rules are generated from frequent itemsets,
d,
joinable ,arejoinedif1)<,[h-1). automatically satiafies the minimum support.
tables along
, andl, of
de-2-1,2)Ad,[h - ensuresthat Frequent itemsets can be stored ahead of time in hash
1]simply
1l <l,lh -itemset no 6.
with their counts so that they can be sccessed quickiy.
joiningl,duplandeg,
formed by
condition/, &-resulting
Thegenerated. The ,k- 111.) Apriori-based
How can we improve the efficiency of
4h-21,,k-11,
Que 4.7.
are
U,1. 7,121... mining?
step: is, its members may or
Theprune of Ly, that nay
mA Answer
k-iitemsets are
C is a
superset
frequent, but allofthe frequent
count of each
included
in C. algorithm have been proposed that focus
on
Many variations of the Apriorioriginal algorithm Several of these variations
b A the
determinethe
are
candidate
database scan to determination ofL, (ie. all candidates
int improving the efficiency of the
are asfollows:
resultin count into corresponding
would
a count
minimum
less than the belong to L).
therefore
support
frequest L
Hash-based technique (hashing itemsets
buckets) :
definition, and and so this could be used to reduce the size of
the
however, can be huge, of C, the. Apriori involve
iproperty
hea
is u
A hash-based technique can
candidate k-itemsets, C, for k> 1.
C, the size
computation. To reduce property, any (h- 1)-itemset that is e to
each transaction in the databasethe
According to Apriorisubset of a frequent k-itemsets. For example, when scanning generate all
d Hence if b
generate the frequent 1-itemsets, L,, we can
a
frequent cannot becandidate k-item set is not in
L,.the (ie., map) them into the
(-J-subset of a removed from C 2-itemsets for each transaction, hash
structure, and increase the
frequent and can be different buckets of a hash table
candidate cannot be corresponding bucket counts (Fig 4.7.1).
asso ciation rules f an all electronics branch
Que 4.6. Write short
note on generating Table 4.7.1 :Transactional data for
frequent itemsets. List of item IDs
TID
I1, I2, I5
Answer a database Dham T100
Once the frequent
itemsets from transactions in association rile 2. 14
1 straightforward to generate strong T200
been found, it is association rules satisfy both
minima 12, 13
from them (where strong confidence).
T300
I1, 12, 14
support and minimum sho
equation (4.6. 1) for confiden ce, which is T400
2 Thís can be done using I1, 13
T500
bere for completeness : 12, 13
support_count(AUB) ..4.61 T600
confidenceA B) = PIB|A) = support_count(A) I1, I3
T700
expressed in terms of itemset suppn I1, 12, I3, I5
3. The conditional probability is
count, where support_count A UB) is the number of transactiom T800
I1, 12, 13
containing the itemsets A B,and support co unt (A)is the number d T900
transactions containing the itemset A.
4. Based on equation (4.6. 1), association rules can be generated as foloW
For each frequent itemset I, generate all non-empty subsets ut
adaAnalyties 483CSSTT4 Frequent Stemmaet and Cutering
Phase Is
2
In phane L the algorithm divides the transactinns ofD into n
EE nos-overiapping partitions If the minimum relative upport
threshold for tranactione in Dis min_up. then
Minimus support count for a partition min pxthe number
of tranactions in that partition
E Por each partition, all the local frequent itemsets are forund.
2-itemseta. Thia has Alocal frequent itemet may or may not be frequent with respect
Table 4.7.1 to the entire database, D Hoever, any itemset that is potentially
abie wr
ated by anning upport count is transactions frequent with respect to Dmut occur as a frequent itemset in at
while detinig L
Ithe mmini S0y,1
buckits 0, 1. 2, and 4 cannot be least one of the partitiona.
then theiem should not be included in C iv Therefore, all local frequent itemsets are candidate iteseta with
bgant and a the respect to D. The colection offrequent itemsets from all partition
bucket count in the forms the giobal candidate itemsets with respect to D.
a corresponding hask y
A2-itemset with support threshold cannot be Phase II:
that is below the fromthe candidate set.
removed
frequent
zn
In phase 2, a second scan of D is conducted in which the actual
should be support of each candidate is assessed to determine the global
techmique may substantially reduce=2
the frequent itemseta
d Such a hash-based examined (especially when
ofcandidate b-itemsets Partition size and the number of partitions re set so that each
Transaction reduction (reducing the numnber of transaetie
partition can fit into main memor and therefore be read only
2 iterations) : once in each phase.
scanned in future
contain any frequent -itemsts Sampling (mining on a subset of the given data) :
Atransartionthat does not1-itemsets.
contain any frequent ( + The basic idea of the sampling spproach is to pickarandom sample
in S
Therefore, such a transaction can be mariked or
rermgved Sof the given data D, and then search for frequent itemsets
because subseque nt database scan instead of D.
further consideration a trzn efficiency.
Fitemsets, where j>h, wil not need to corsider such b. In this way, we trade off some degree of accuracy sgainst
data to find candidate itemw The S sample size is such that the search for frequent itemsets in
3 Partitioning (partitioningthe ane scan of the
requires just two deti S can be done in main memor,and s0 only
&. Apartitianing techmique canbe used that Fig 472 transactions in Sis required overall.
Bcans to mine the frequent itemsets as shown in of the global
Phase I In this technique, it is possible that we l l miss some
P frequent itemset.
itemsets at
Dynamic itemset coun ting (adding candidate
D
ide D
(1) 5
different points duringa scan):
analysis:
tvguentemset onAwers
Anplioansof Type @uetims
Brlatndemogs bskrtsdsing.
be Long Anwer Type and dim Aner
i b ewd
Let beseitemswords that
gst t s forsringitemet eo
2Abkecume Que &What are the diet methods
dnt appear in main memory?
inthe
laok f
wands that
sets of dmatedby the most c togetherin
wl he
he sets Amswer memory:
itemset eount im 1
stop urds Such Diferemt method for storimg
ontain ma the itemsets L The trianglar-matri
nethod:
rhe dnument as mere
requent sti have the pble
items integrs, we
onsder emmen words Even afer cdimg nly ane place
he mst painss some pairsthen
7enall e weust cont a pair the
fwe nre allthequuent of wun cld onder the pair sothat i <ind omlyuse
&
Bowever the
amung Fur etample, we t-dimemsimalarrar a. That strtegy wold
hage ind
Tegeesemtajutoept. emtrya i n a
make halfthe array seiess triangr
to se a nedimesinal
baskets be semte Amore space-effiient way is
dacuments and the where
2 Let che
ihems be the documens the pair A with 1si<jsn
isina busket ifthe semtemceis in we s We store in ej the coumt for
Aihem backwrards, and should rene =G- 1kn - i2 )-j-i.
aangement appears items and baskets is an arh the paisare stared in leingrapi
8 This relatinship betwee The rest of this layout is thait
SL..nl, then (23.(2 4 . 2
the y order, that is first (1.21, IL. 1ml
in-2m-1L jn -2 nl, and fnally n-
mthatm elatshig
pairs afitemssthat appear tag nl, and s an down to
plicatin, welouk for
In chis
in severa buskets The triples method : depemd an
approach tae sturecounts that appear im
documentsthat & This is more appropriate
pair, then we have two items that actually
pains af
& we fnd sucha the fraction ef the possihle
semtemees in eummn seme basket. the cunt ef
Seer
counts as triples e. meaning that bash tablewith
Biomarkers : pruteins We canstare suchasa
typessuch as genes or
blood
, withi<ise. Adata structure, there is a triple for
Let the iems beaftwp pair used so we can tell if
Giseuses
iandjas the search key. is find it quickly.
patient: their gename a giveni andj
and,ifse, to
Bach baskettis the set af data about a medical history afis approach the triples method of
stering counts
bind-chemistry nnlysis as wel astheir disease and one ar m We call this
does not require us to
stare anything if the
8ASegaent itemset that
cumsists af one The triples methad
bimarkerssugest a test fer
the disease count for a pair is 0 requires us to stare three
the other hand the triples method does appear in some
e On ane, for every pair that
PART-2 intègers rather than
basket.
Memoy, Limited Pass large dataset in
Hendliu Lange Daite Sety in Mainlemmsets in a Stream Explain PCY algorithm for handling
Anthm Counti Frgue Que &10.
main memory.
Data Analytics
Answer
algorithm, there may be much
11J(CS4 4-12J (CS-61T-6) Frequent Itemseta and Clustering
1 Infirst pass ofApriori unused 8. We can define the set of candidate pairs C, to be those pairs l,jl such
that:
main memory. for an array of
uses the unused space iandj arefrequent items.
The PCY Algorithm
2
that generalizes the
of a Bloom
idea
filter. The
is idea integer b. iJ) hashes to a frequent bucket.
schematically in Fig. 4.10.1. Que 4.11.Explainsimple and randomized algorithm to find most
Item
frequent itemsets using at most two passes.
Fre
Item 2 Item names
names
quent Answer
counts to items
to integers Simple and randomized algorithm :
integers In simple and randorimized algorithm, we pick a random subset of the
1 using the entire
baskets and pretend it is the entire dataset instead of
Bitmap
file of baskets.
reflect the smaller number of
2 We must adjust the support threshold to
baskets.
for the full dataset is s, and we
3. For instance, if the support threshold then we should examine the
Hash table choose a sample of 1% of the baskets,
Data structure s/100 of the baskets.
for bucket sample for itemsets that appear in at least
for counts
read the entire dataset, and for
counts
of pairs 4 The best way to pick the sample is tosample with some fixed probability
each basket, select that basket for the
p.
entire file. At the end, we shall
Suppose there are m baskets in the
5 close to pm baskets.
have a sample whose size is very the file already,
6 However, if the
baskets appear in random order in
Pass 2
have to read the entire file.
Pass 1 then we do not even part
Fig. 4.10.1. for our sample. Or, if the file is
can select the first pm baskets pick some chunks at random to
7 We can
we
of a distributed file system,
buckets hold integers rather
3 Array is considered as a hash table, whoseare hashed to buckets of this serve as the sample. use part of main memory
than sets of keys or bits. Pairs of items Having selected our sample of the baskets, we
the first pass, we not only 8.
hash table. As we examine a basket during to store these baskets.
item in the basket, but we generate all the execute one of the algorithms
add 1 to the count for each
9. Remaining main memory is used to algorithm must run passes over
pairs, using a double loop. However, the
such as A-Priori or PCY. size, until we find a size
into which that pair sample for each itemset
4 We hash each pair, and we add 1 to the bucket the main-memory
hashes. with no frequent items.
count, which is the sum frequent
5. At the end of the first pass, each bucket has a Explain SON algorithm to find all or most
of the counts of allthe pairs that hash to that bucket. Que 4.12.
passes.
6. If the count of abucket is at least as great as the support threshold s, itemsets using at most two
it iscalled a frequent bucket. We can say nothing about the pairs that Answer
hash to a frequent bucket; they could all be frequent pairs from the
information available to us. SON Algorithm : chunks.
1. The idea is to
divide the input file into randomized
7 But ifthe count ofthe bucket is less than s (an infrequent bucket), we sample,and run the simple and
know no pair that hashes to this bucket can be frequent, even if the 2 Treat each chunk as a
pair consists of two frequent items.
Data Analytics 4-13J (C8-5/TT4 4143CS-5/TT6) Frequent Itemscta and Clustering
algorithm on that chunk The value is ignored, and the Reduce task simply produces those keys
We use pe as the threahold, ifeach chunk is fractionp ofthe whole file (itemsets) that appear one or more times, Thus, the output of thefirst
threshold. Reduce funetion is the candidate itemsets.
and s is the support chunk
Store on diak all the frequent itemsets found for each Second Map funetion :
4
Once all the chunks have been processed in that way, take the union output from
The Map tasks for the second Map funetion take all thea portion of the
that have been found frequent for one or more the first Reduce Function fthe candidate itemsets) and
of all the itemsets
are the candidate itemsets. input data file.
chunks These each of the
chunk, then its support is lese At Each Map task eounts the num ber of ocurrences of
Ifan itemset is not frequent in any
b. dataset
6
number of chunks is 1p, we conclude st candidate itemsets among the baskets in the portion of the
ps in each chunk. Since the itemset is less than (1/p)ps =8.
that it was assigned.
the total support for that o). where C is one of the
the whole is frequent in at lae The output is a set of key-value pairs (C. itemset among the baskets
Thus, every itemset that is frequent in the truly frequent itemsete
C
candidate sets and v is the support for that
one chunk, and we can be sure that all that were input to this Map task.
false negatives. We have mo
among the candidates; i.e., there are no read each chunk and procesest Second Reduce funetion:
we
a total of onepass through the dataas The Reduce tasks take the itemsets they are given as
keys and sum
the associated values.
itemsets and select tho8e.
8 In a second pass, we count all the candidate The result is the total support for each of
the ítemsets that the Reduce
itemsets.
that have support at least s as the fre quent
b
task was assigned to handle.
Those itemsets whose sum of values is at leasts are frequent in the
Que 4.13.Explain SON algorithm usng MapReduce. whole dataset, so the Reduce task outputs
these itemsets with their
counts.
Answer
Itemsets that do not have total support at
least s are not transmitted to
The SON algorithm work well in a parallel-computing environment.
d.
1
the output of the Reduee task.
2 Each of the chunks can be processed in parallel, and the frequent
itemsets from each chunk combined to form the candidates. Que 4.14. Explain Toivonen's algorith m.
We can distribute the candidates to many processors, have each
processor count the support for each candidate in a subset of the Answer
baskets, and finally sum those supports to get the support for each finding frequent
candidate itemset in the whole dataset. 1 Toivonen's algorithm is a heuristic algorithm for
itemsets from a given set of data.
4 There is a natural way of expressing each of the two passes as a algorithms, main memory is considered a
2 For many frequent itemset
MapReduce operation. critical resource.
counting over large data sets results
MapReduce-MapReduce sequence : 3 This is typically because itemset quickly begin to strain the limits of
that
First Map function : in very large data structures
Take the assigned subset of thebaskets and find the itemsets frequent main memory.
approach to discovering
in the subeet using the simple and randomized algorithm. Toivonen's algorithm presents an interesting
algorithm's deceptive simplicity
frequent itemsets in large data sets. The
Lower the support threshold froms tops if ench Map task gets fraction through a sampling process.
pef the total input file. allows us to discover all frequent itemsets border if it is not
in the negative
Th outpu is a set of key-value pairs (P, ), where F is a frequent 5 Negative border : An itemset is immediate subsets are frequent in
itenaet frem the sample. frequent in the sample, but all its
Pirst Reduce Function : the sample.
Each Reduce task is assigned aset ofkeys, which are itemsets.
DataAnaltics
Toivonen'y
algorithm t
-15J(C8-M 416J(CS-MTT4) Frequent Itemsets and Clustering
Passesof
& butlower the threshold
therandomsample, Questions-Answers
Start with
the subset. frequent in the sample
itemsets
the itemsets.
that are the negati Long Answer Type and Medium Answer Type Questions
Addtoofthese
border
2
Abilityto deal cluster numerie other data types, such
are
designed to
require
clustering
appications may andardinaldata,
or mixtures of these
1Howevn
data
Answer
A hierarchical method ereates a hierarchical
decompoaition ofthe gven
(categorica), arbitrary shape : bys 1
set ofdata objeets.
nominal
clusters with Ahierarchical method can be classified as
Discoveryof determine clusters based on 2
& clustering
Many measures.
distance
algorithms
tend to
Euclidea, a.
Agglomerative approach :
The agglomerative
approach, also called the
bottom-up
separate group.
measures find spheri object forming a
on
suchdistance approach, starts with each elose to one
Algorithmsbased density. However, a cluster b merges the objects or
groups
similarsize and could It successively merged into one (the topmost
clusters with are
another, until all the groups termination condition holds.
ofany shape. that can detect level of the hierarchy), or a
to
important
It is shape.
develop algorithms clusters Divisive approach : approach,
b.
als0 called the top-down
C
arbitrary knowledge to determine
domain inpu The divisive approach, in the same cluster.
Requirements for starts with all the objects into smaller
4
users to provide domain iteration, a cluster is split cluster, ora
Ineach successive
parameters :
algorithms require each object is in one
as
Many clustering form of input parameters such the desired clusters, until eventuall y
knowledge in the termination condition holds.
or density- and
numberofclusters. parameters can be distance-based
may be sensitive to such Hierarchical clustering methods
clustering results 3
continuity-based. clustering in
b. The data: hierarchical methods consider
Ability to deal with noisy and/or missine Various extensions of
5. sets contain outliers 4.
Most real-world data data. subspaces.
the fact that once a
step (merge or
unknown, or
erroneous Hierarchical methods suffer from is useful in that it
noise and may produe undone. This rigidity
algorithms can be sensitive toclustering methods tha
5 never be about a
it can
split) is done, computation costs by not having to worry
Clustering we need
poor-quality clusters. Therefore,
b.
leads to smaller different choices.
noise. combinatorial number of
are robust to high-dimensionality data: clustering.
clustering notes on partitioning method of
6 Capability of contain numerous dimensions or attributes. Que 4.19. Write short
data
A set can low-dimensional
b. Most clustering
algorithms are good at handling dimensions, Answer k partitions of
data such as data sets involving
only two or three
objects, a partitioning method constructs
and kSn. That is,
spaceis Given a set of n cluster
objects in a high dimensional 1
each partition repre sentsa group must contain at
C Finding clusters of dataconsidering that such data can be very the data, where
into kgroups such that each
challenging, especially it divides the data
sparse and highly skewed. least one object. partitioning on
methods conduct one-levelexclusive cluster
Constraint-based clustering : words, partitioning adop
7
clustering nder 2 In other partitioning methods typically
Real-world applications may need to perform datasets. The basic object must belong to exactly one group.
separation i.e., each number of
various kinds of constraints. distance-based. Given k, the
with good clustering partitioning methods are ereates an initial
Achallenging task is to find data groups 3 Most partitioning method
partitions to construct, a
behaviour that satisfy specified constraints.
improve
partitioning. technique that attempts to
clustering iterative relocation ther.
Que 4.I8.Writeshort notes on hierarchical method of 4 It then uses an
moving objects from one
group to ano
the partitioning by
t s Analtis inthat lgeete Prognt emat ndClutering
partitioningwherens
Theg eVner
eritis
l f sp d
etedfoeah
other, sohjects in
There are
inthe Anewer
partitiona varios
different
uter ane partor ery
dters re arjingthegnlity of First, itrndomly selete kof theste in D, eh of whie initially
purtitioning aed represents a clate meen or enter
ther eriuriafor
gtimalityin
mputastianally prohibitive potentially requiring an
6Ahiein l l
elustering
i For ech of the remning otjt, ti ged to the eluter t
whieh is the e t eimilar, bed n the Eariideun distanee beteen
penhlepartitione che objeet snd the cluter men
uertinofal the centroid-based The kens algrithm then iteratislyimgrow the within-duter
eaExplain A:means
teehnique) elusteringmethod
(or
partitionisy 4
Varatn
we can eonsiderThese
rowsums.
pointa
whether
arepart oftthe
-27J(CSs4
The &pointsof the cluster that are furthest from the clustrod
representation
two clusters are close enoughnto merg
428J (CS-5/1T-6)
2
Frequent Itemsets and Clustering
The sizes of buckets obey the restriction that there is one or two of
their ench size, up to some limit. They are required only to form a sequence
that iftwo clusters are close,
The
assumptionis made respective clustroids wouldd
distant from their thenbe ap
clo
where each size is twice the previous size such as 3, 6, 12, 24,
4.The contents of abucket consist of:
of points cluster tree The size of the bucket.
initialization of in
Que
4.29.Explain GRGY b The timestamp of the bucket, that is, the most recent point that
contributes to the bucket.
algorithm. A collection of records that represent the clusters into which the
Answer and the nodes ofthe 1 points of that bucket have been partitioned. These records contain:
The clusters are
organized into a tree,
or pages,
tree may
as in the case of aB-trs i The number of points in the cluster.
blocks
very large. perhaps disk tree resembles. i The cent roid or clustroid of the cluster.
cluster-representing
which the
holds as many cluster representations as can f ii Any other parameters necessary to enable us to merge clusters
Eachleafof the tree does not depend on and maintain approximations to the full set of parameters for
2
representation has a size that the numbe the merged cluster.
3 Acluster cluster.
of points in the
holds a sample of the
4 Aninterior
the node ofthe cluster tree clustroids
clusters represented by each ofits subtrees, along with pointers d