JMP SUMMIT EUROPE 2018 - Data Mining Under The Curse of Dimensionality (Gianpaolo Polsinelli - LFoundry Italy)

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 7

Data Mining for asymmetric data set under

the curse of dimensionality


Find in big and noisy data set the most influential
yield predictors in a Semiconductor Fab

I really don't trust statistics much. A man with his head in a


hot oven and his feet in a freezer has statistically an average
body temperature
Charles Bukowski
Introduction
Gianpaolo Polsinelli, Felice Russo
Semiconductor manufacturing is one of the most technologically and highly complicated manufacturing processes. Because of high number of process steps and
the high number of sensors this industry is facing a huge torrent of data. In addition to the large number of production data, the unbalance of pass and failing parts
make this dataset difficult to analyze.

With a so high number of data the standard technique of one variable at the time could fail because of the influence of a large number of manufacturing variables.
Data by itself isn’t useful. To be useful it must be converted into actionable information to drive yield and product quality improvement.
Here comes the Machine Learning (ML).

In order to avoid model over-fitting issue the reduction of sample dimensionality is needed too. In other words to increase the signal-noise ratio of available data
we need to reduce the feature number before apply any ML model. Once the interesting patterns have been extracted from database, they will be validate by
experience of engineer.
Explore and Cleaning Data Find relationships

Stratification and
Predict future observations
Dimensionality reduction

the entire process has been realized using JMP13 JSL features
Curse of Dimensionality
Gianpaolo Polsinelli, Felice Russo
The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces (often with hundreds
or thousands of dimensions) that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience.
The expression was coined by Richard E. Bellman when considering problems in dynamic optimization.
There are multiple phenomena referred to by this name in domains such as numerical analysis, sampling, combinatorics, machine learning, data mining,
and databases. The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the
available data become sparse.

Machine learning algorithms tend Solution for unbalanced and


to over-fit data when the sample high dimensionality data
has a lot of predictor variables.
There is a number of features
above which the performance of a
ML will degrade rather than Unbalanced Dimensionality
improve.
data Reduction
Besides the curse of dimensionality,
data mining engineer have to deal
with unbalanced sample too. Stratification Feature Selection
Example: analysis of few failing Rows Reduction Column Reduction
wafers with respect to a big set of
Bad / Bood group Balancing
good wafers.
Data Cleaning
Gianpaolo Polsinelli, Felice Russo

Anyway before to start we need to clean data....

Remove
Empty
&
Stagnant
Columns

Impute missing Value


Weigthed Stratification
Gianpaolo Polsinelli, Felice Russo
JMP summit Data Base Size Reduction Table Stratified
In order to balance the class distributions (good vs. bad) a Target Target

random under-sampling can be used. This is done Frequencies


Level Count Prob
Frequencies
Level Count Prob
eliminating by chance some good class examples until the bad
good
12 0.06000
188 0.94000
bad
good
12 0.19355
50 0.80645
2 classes have the same size. This is done using the JMP Total
N Missing
200 1.00000
0
Total
N Missing
62 1.00000
0

Subset function with random/stratify option. bad good


2 Levels
bad good
2 Levels

Weighting Stratification
By selecting the Weight column option, a Selecting the new sampling %
new COLUMN is created • % of random rows reduction
• % of rows reduction Stratified by Weight Column

In it the following values are loaded:


Selecting the new table size
1  for BAD samples • number of random rows reduction
Less than 1  for GOOD samples • number of rows reduction stratified by Weight Column
(according to below formula)

Weight _ Column 
# of _ Badsample 
# of _ Goodsample 

Now let see how to minimize the overfitting issue...


Feature Selection
Gianpaolo Polsinelli, Felice Russo
Feature Selection allow to select the “Predictor Screening” "Fishers Score " "Clustering- Cluster Variables”
most influential parameters before to Uses bootstrap forest partitioning to Fisher distance is a measure of Constructs components that are
evaluate the contribution of predictors dissimilarity between two probability linear combinations of variables in a
create any model. on the response. Predictor screening can distributions cluster of similar variables. A substantial
Three techniques implemented... identify predictors that might be weak part of the variation in a large set of
alone but strong enough when used in
FS ( j ) 
 2 2
 j (bad )   j ( good) 2
variables can be represented by cluster
combination with other predictors. 2 2 components or by the most
 j (bad )   j ( good)
representative variable in the cluster.
Predictors type: Numeric-Categorical
Response type: Numeric-Categorical Each feature receive a score using the Predictors type: Numeric
above formula. A subset of features is Response type: Categorical[Good/Bad]
In case the response is numerical if the selected
obtained by making a rank.
predictors are inter-correlated a decreasing
accuracy of ML is possible. To avoid that the VIF Predictors type: Numeric
parameter is used by the script. . Response type: Categorical[Good/Bad]

Going to next step, the user can chose different MACHINE LEARNING models.
The X variables will be the main predictors found by algo.
Machine Learning and Model Validation
Gianpaolo Polsinelli, Felice Russo

Three models implemented… The example below is related to modeling using Cluster Variables technique

Script collects the graphs on the Dashboard

You might also like