A Sampling of Various Other Learning Methods

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 34

A Sampling of Various Other

Learning Methods

1
Decision Tree Induction

2
Decision Tree Induction
An example decision tree to solve the problem of how to spend
my free time (play soccer or go to the movies?)

Outlook

Sunny Overcast Rain

Humidity Soccer Wind

High Normal High Normal

Movies Soccer Movies Soccer

3
Decision Tree Induction

The Decision Tree can be, alternatively, thought as a collection of


rules:

R1: Outlook=sunny, Humidity=high  Decision=Movies

R2: Outlook=sunny, Humidity=normal  Decision=Soccer

R3: Outlook=overcast  Decision=Soccer

R4: Outlook=rain, Wind=strong  Decision=Movies

R5: Outlook=rain, Wind=normal  Decision=Soccer

4
Decision Tree Induction

The Decision Tree can yet be thought of as concept learning:


For example the concept “good day for soccer” is the disjunction
of the following conjunctions:

(Outlook=sunny and Humidity=normal) or

(Outlook=overcast) or

(Outlook=rain and Wind=normal)

5
Decision Tree Induction

Decision Trees can be automatically learned from data using either


information-theoretic criteria or a measure of classification
performance.
The basic induction procedure is very simple in principle:

1. Start with an empty tree


2. Put at the root of the tree the variable that best classifies the
training examples
3. Create branches under the variable corresponding to its values
4. Under each branch repeat the process with the remaining
variables
5. Until we run out of variables or sample

6
Decision Tree Induction

Notes :
o “best classifies” can be determined on the basis of maximizing
homogeneity of outcome in the resulting subgroups, cross-
validated accuracy, best-fit of some linear regressor, etc.
o DTI is best for:
 Discrete domains
 Target function has discrete outputs
 Disjunctive/conjunctive descriptions required
 Training data may be noisy
 Training data may have missing values

7
Decision Tree Induction

Notes (CONT’D) :
o DTI can represent any finite discrete-valued function
o Extensions for continuous variables do exist
o Search is typically greedy and thus can be trapped in local
minima
o DTI is very sensitive to high feature-to-sample ratios; when
many features contribute a little to classification DTI does
not do well
o DT models are highly intuitive, and easy to explain and
use, even without computing equipment available

8
Supplementary Readings
 S.K. Murthy: “Automatic Construction of decision trees from data: A
multi-disciplinary survey”. Data Mining and Knowledge Discovery,
1997

9
Genetic Algorithms

10
Genetic Algorithms
Evolutionary Computation (Genetic Algorithms & Genetic
Programming) is motivated by the success of evolution as a robust
method for adaptation found in nature
The standard/prototypical genetic algorithm is simple:

1. Generate randomly a population P of p hypotheses


2. Compute the fitness of each member of P, hi
3. Repeat
a. Create a random sample Ps from P by choosing each hi with probability
proportional to the relative fitness of hi to the total fitness of all hj
b. Augment Ps with cross-over offspring of the remaining hypotheses
chosen with same probability as in step #4
c. Change members of Ps at random by bit-mutations
d. Replace P by Ps and compute new fitness of each member of P
4. Until enough generations have been created or a good enough hypothesis
have been generated
5. Return best hypothesis

11
Genetic Algorithms
Representation of hypotheses in GAs is typically a bitstring so that the
mutation and cross-over operations can be achieved easily.
E.g., consider encoding clinical decision-making rules:
variable1: fever {yes, no}
variable2: x_ray {positive, negative}
variable3: diagnosis {flu, pneumonia}
Rule1: fever=yes and x_ray=positive  diagnosis=pneumonia
Rule2: fever=no and x_ray=negative  diagnosis= flu or pneumonia

Bitstring representation:
R1: 10 10 01
R2: 01 01 11

(note: we can constrain this representation by using less bits, the fitness
function, and syntactic checks)
12
Genetic Algorithms
Let’s cross-over these rules at a random point:

R1: 10 10 01
R2: 01 01 11
Gives:

R1’: 10 01 11
R2’: 01 10 01

And mutation at one random bit may give:

R1’’: 10 00 11
R2’’: 01 10 01

Which is interpreted as:


Rule1’’: fever=yes and x_ray=unknown  diagnosis=flu or pneumonia
13
Rule2’’: fever=negative and x_ray=positive  diagnosis=pneumonia
Genetic Algorithms

Notes:
• The population size, cross-over rate, and mutation rate are
parameters that are set empirically
• There exist variations of how to do cross over, how to select
hypotheses for mutation/cross-over, how to isolate subpopulations,
etc.
• Although it may appear at first that the process of finding better
hypotheses relies totally on chance, this is not the case. Several
theoretical results (most famous one being the “Schema Theorem”
prove that exponentially more better-fit hypotheses are being
considered than worse-fit ones (to the number of generations).
• Furthermore, due to the discrete nature of optimization local minima
will trap the algorithm less, but also it becomes more difficult to find
the global optimum.
• It has been shown that GA perform an implicit parallel search in
hypotheses templates without explicitly generating them (“Implicit
Paralellism”.
14
Genetic Algorithms
Notes:
• GAs are “black box” optimizers (i.e., applied without any special
knowledge about the problem structure); sometimes they are applied
appropriately to learn models when no better alternative can be
reasonably found, and when they do have a chance for finding a
good solution.
• There exist cases however when much faster and provably sound
algorithms can (and should) be used, as well as cases where
uninformed heuristic search is provably not capable of finding a
good solution or scale up to large problem inputs (and thus should
not be used).

Consider, for example, the problems of finding the shortest path


between two cities in a map, of sorting numbers, of solving a linear
program, of fitting a linear model, etc. for all these cases better and
faster special-purpose algorithms exist and should be used instead .

15
 In addition:
– The No Free Lunch Theorem (NFLT) for Optimization states that
no black box optimizer is better than any other averaged over all
possible distributions and objective functions
– There are broad classes of problems for which GAs problem-
solving is NP-hard
– There are types of target functions that GAs cannot learn
effectively (e.g., “Royal Road” functions as well as highly
epistatic functions)
– The choice of parameters is critical in producing a solution; yet
finding the right parameters is NP-hard in many cases
– Due to extensive evaluation of hypotheses it is easy to overfit
– The “Biological” metaphor is conceptually useful but not crucial;
there have been equivalent formulations of GAs that do not use
concepts such as “mutation”, “cross-over” etc.

16
Supplementary Readings

– Belew et al: “Optimizing an arbitrary function is hard for the genetic


algorithm” Proc. Intl. Conf. On Genetic Algorithms, 1991

– O.J. Sharp: “Towards a rational methodology for using evolutionary


search algorithms. PhD thesis University of Essex, 2000

– R. Salomon: “Raising theoretical Questions about the utility of genetic


algorithms”, 6th annual conf. Evol. Programming, 1997

– R. Salomon: “ Derandomization of Genetic Algorithms” Eufit '97 -- 5th


European Congress on Intelligent Techniques and Soft Computing

– S. Baluja et al.: “Removing the Genetics from the Standard Genetic


Algorithm” The Proceedings of the 12th Annual Conference on Machine
Learning, 1995, pp. 38 - 46.

17
K-Nearest Neighbors

18
K-Nearest Neighbors
Assume we wish to model patient response to treatment; suppose we have
seen the following cases:

Patient# Treatment type Genotype Survival


-----------------------------------------------------------------------------------------
1 1 1 1
2 1 2 2
3 1 1 1
4 1 2 2
5 2 1 2
6 2 2 1
7 2 1 2
8 2 2 1
(Notice the very strong interaction between treatment and genotype in
determining survival)
19
K-Nearest Neighbors
Say we want to predict outcome for a patient i that received treatment 1
and is of genotype class 2. KNN searches for the K most similar cases in
the training data base (using Euclidean Distance or other similarity
metric):

ED(xi,xj) = S (x i,k – xj,k)2


k

For example patient #1 and the new patient have ED=

Patient# Treatment type Genotype Survival


-----------------------------------------------------------------------------------------
1 1 1 1
i 1 2 ?

ED = ((1-1)2 + (1-2)2) = 1

20
K-Nearest Neighbors

Similarly the distances of case i to all training cases are:

Patient# ED(Patient#, Pi) Survival


-----------------------------------------------------------------------------------------
1 1 1
2 0 2
3 1 1
4 0 2
5 1.4 2
6 1 1
7 1.4 2
8 1 1

Now let’s rank training cases according to distance to case i


21
K-Nearest Neighbors
Patient# ED(Patient#, Pi) Survival
-----------------------------------------------------------------------------------------
2 0 2
4 0 2
3 1 1
1 1 1
6 1 1
8 1 1
5 1.4 2
7 1.4 2

As we can see the training case most similar to i has outcome 2. The 2 training cases
most similar to i have a median outcome 2. The 3 training cases most similar to i
have a median outcome 2, and so on. We say that for K=1 the KNN predicted
value is 2, for K=2 the predicted value is 2, and so on.

22
K-Nearest Neighbors
To summarize:

KNN is based on a case-based reasoning framework.


It has good asymptotic properties for K>1.
It is straightforward to implement (of course care has to be given to
variable encoding, variable relevance, and distance metric); efficient
encoding is not easy since it requires specialized data structures.
It is used in practice as:
o a baseline comparison for new methods
o component algorithm for “wrapper” feature selection methods
o Non-parametric density estimator

23
Clustering

24
Clustering
Unsupervised class of methods
Basic idea: group similar items together and different items
apart
Countless variations:
o of what constitutes “similarity” (may be distance in feature space,may
be other measures of association),
o of what will be clustered (patients, features, time series, cell-lines,
combinations thereoff, etc.)
o of whether clusters are “hard” (no multi-membership) or “fuzzy”
o of how clusters will be build and organized (partitional,
agglomerative, non-hierarchical methods)
Uses:
o Taxonomy (e.g., identify molecular subtypes of disease)
o Classification (e.g., classify patients according to genomic
information)
o Hypothesis generation (e.g., if genes are highly “co-expressed” then
this may suggest they are in same pathway) 25
Clustering
K-means clustering: We want to partition the data into k most-similar
groups

1. Choose k cluster centers (“centroids”) to coincide with k randomly


chosen patterns (or arbitrarily-chosen points in the pattern space)
2. Repeat
3. Assign each pattern in data to cluster with the closest centroid
4. Recompute new centroids
5. Until convergence (i.e., few or no re-assignments or small decrease in
error function such as total sum of squared errors of each pattern in a
cluster from centroid of that cluster)

Variations:
- selection of good initial partitions
- Allow splitting/merging of resulting clusters
- Various similarity measures and convergence criteria

26
Clustering (k-means)
e.g., (K=2)
A B C D E F
2 3 9 10 11 12

Step 1: (arbitrarily)

[A B C D] [E F]

Centroid1=6, centroid2=11.5

Step 2:

[A B] [C D E F]

Centroid1=2.5, centroid2=10.25

-------(algorithm stops)--------
27
Clustering
Agglomerative Single Link:

1. Start with each pattern belonging to its own cluster


2. Repeat
3. Join these two clusters that have the smallest pair-wise distance
4. Until all patterns are in one cluster

Note:
- Inter-cluster distance between clusters A and B is computed as
the minimum distance of all pattern pairs (a,b) s.t. a belongs to A
and b to B

28
Clustering (ASL)
e.g.,
A B C D
1 2 5 7

Step 1:
[A] [B] [C] [D] smallest distance [A] [B]=1

Step 2:
[A B] [C] [D] smallest distance [C] [D]=2

Step 3:
[A B] [C D] smallest distance [A B] [C D]=3

Step 4:
[A B C D] smallest distance [C][D]=2

-------(algorithm stops)--------

29
Clustering (ASL)
e.g.,
A B C D E F
1 2 5 7 11 12

Step 1: [A] [B] [C] [D] [E] [F] smallest distance [A] [B]=1 OR [E] [F]=1

Step 2: [A B] [C] [D] [E] [F] smallest distance [E] [F]=1


Step 3: [A B] [C] [D] [E F] smallest distance [C] [D]=2
Step 4: [A B] [C D] [E F] smallest distance [A B] [C D]=3
Step 5: [A B C D] [E F] smallest distance [A B C D] [E F]=4
Step 6: [A B C D E F] -------(algorithm stops)--------

Schematic representation via the “dendrogram”:

A B C D E F
30
Clustering
Agglomerative Complete Link:

1. Start with each pattern belonging to its own cluster


2. Repeat
3. Join these two clusters that have the smallest pair-wise distance
4. Until all patterns are in one cluster

Note:
- Inter-cluster distance between clusters A and B is computed as the
maximum distance of all pattern pairs (a,b) s.t. a belongs to A and b to B

31
Clustering (ACL)
e.g.,
A B C D E F
1 2 5 7 11 12

Step 1: [A] [B] [C] [D] [E] [F] smallest distance [A] [B]=1 OR [E] [F]=1

Step 2: [A B] [C] [D] [E] [F] smallest distance [E] [F]=1


Step 3: [A B] [C] [D] [E F] smallest distance [C] [D]=2
Step 4: [A B] [C D] [E F] smallest distance [A B] [C D]=6
Step 5: [A B C D] [E F] smallest distance [A B C D] [E F]=11
Step 6: [A B C D E F] -------(algorithm stops)--------

With dendrogram:

A B C D E F

32
Clustering

Clustering has been very prevalent so far in bioinformatics


Papers with a Mesh indexing keyword of statistics account for 5.9%
of all pubmed articles.
In oligo array papers this jumps to 13.3%.
Cluster analysis accounts for 26% of statistics-related papers on
oligo arrays, and 16.7% of genetic network-related papers.
In Nature Genetics cluster analysis is used 71.4% on all statistics-
related papers.
In CAMDA 2000, cluster analysis was used in 27% of all papers. 

33
Clustering
Caveats:
a. There does not exist a good understanding on how to translate from
“A and B cluster together” to: “A and B are dependent/independent
causally/non-causally”
b. There exist very few studies outlining what can be learned or cannot
be learned with clustering methods (learnability), how reliably
(validity, stability), with what sample (sample complexity). Such
analyses exist for a variety of other methods. The few existing
theoretical results point to significant limitations of clustering
methods.
c. Other comments: visual appeal, familiarity, small sample, no explicit
assumptions to check, accessibility, tractability.

34

You might also like