Regression, Classification and Clustering
Regression, Classification and Clustering
Regression, Classification and Clustering
Topics to be discussed
DATA MINING
REGRESSION CLASSIFICATION
CLUSTERING
DATA MINING
Definition
Definition : Exploring hidden information
Models of data mining
Makes prediction using known results found from different data objects.
Descriptive Model
Identifies patterns or relationships in data. Explores properties of the data examined Does not predict new properties.
REGRESSION
Definition
Numeric prediction of the value of dependent variable.
Relationship
between dependent and independent variable(s) are expressible through mathematical equation.
Types of regression
Types of Regression
Linear regression y=c+mx, where c and m are regression coefficients. Multi-Linear regression y=c0+c1x1+c2x2++cnxn where c0,c1,cn are regression coefficients and x1, x2,,xn are independent variables.
Regression Continued
Regression model is selected when Prediction of a continuous or numerical value is needed The relationship of predictor and response can be expressed in the form of a curve or a mathematical equation Regression is not suitable when Data may not fit in linear model Linear data may be poor due to noise or outliers. Data is non-numeric
CLASSIFICATION
Definition
Predicts class membership of data instances Classes are non-overlapping
Height based Output follows the below given division criteria: 2m Height 1.7m < Height < 2m Height 1.7m Classify :<Pat, F, 1.6> using KNN with K=5. Tall Medium Short
- {<Kristina, F, 1.6>, <Kathy, F, 1.6>, < Stephanie, F, 1.7>, <Dave, M, 1.7>, <Wynette, F, 1.75>}. - Pat is Short.
Validation Criteria
Validation Criteria
CLUSTERING
Definition
Grouping of like terms
Groups are not predefined
Four Clusters
Clustering Algorithms
E.g. We have to make groups of students in a class, let the grouping is done on the basis of intelligence level of students
Similarity Measure
The intelligence level of students can be found by
taking a quiz. Marks obtained by students in the quiz are as follows Marks obtained by nine students: {2, 4, 10, 12, 3, 20, 30, 11, 25} The students who have little differences in the marks obtained should be grouped together.
Clustering Algorithm
Result Validation
If clusters do not make sense, go back to prior stage Check for tendency of clusters in the data set
Selection Criteria
Simplification
Useful in data concept construction Unsupervised learning
Validation Criteria
External criteria Entropy, F-Measure, NMI-Measure, Purity Internal criteria Sum of Squared Error, BIC, CH, DB, SIL, DUNN Relative criteria Entropy, SSE
END