CSIS 5420 Final Exam - Answers (13 Jul 05)

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 8

CSIS 5420 Final Exam Answers

1. (50 Points) Data pre-processing and conditioning is one of the key factors that determine
whether a data mining project will be a success. For each of the following topics, describe the
affect this issue can have on our data mining session and what techniques can we use to
counter this problem.

a. Noisy data

b. Missing data

c. Data normalization and scaling

d. Data type conversion

e. Attribute and instance selection

A. Noisy data
Noisy data can be thought of as random error in data (174) and is especially a problem in very large
data sets. Noise can take many forms. First, true duplicate records, which may be splintered in a variety
of disparate entries, causes great data noise (153). This stems from the idea that one key piece of
information may be portrayed or noted in a variety of ways. John Smith, Smith, John, J. Smith, Smith, J.,
J.A. Smith, J. Anthony Smith, and Jonathan Smith may all describe the same person. Second, noise may
attributable to mis-entered field information (154). Serious errors occur when words are present in a
number only field. Finally, incorrect values that greatly differ from the range or normal set of attributes will
surely cause noise.

We can attack the data noise problem with a number of techniques. First, running basic statistics on a
field can locate both incorrect values and entries; an unusually high variance or data type error will locate
some candidate noises (154). Second, we can remove noise using data smoothing both externally and
internally. It is possible to smooth data using another data mining technique like neural networks (which
perform natural smoothing in their processing) (154). Also, it is possible to remove or replace atypical data
instances with mean values or null values (154). While this forces the analyst to deal with the problem of
missing data, it will prevent harm being done by noisy variables. Finally, it is possible to utilize off the shelf
data cleansing tools to help correct noisy data. Such tools would be invaluable in their customizable
options (i.e., candidate names that mean the same thing).

B. Missing data
The main difficulty inherent in missing data is to figure out whether it is missing due to omission, inability
to find/enter a value, or if a null value represents something tangible. Missing values are not necessarily a
harmless proposition. A null value on something like sex would typically mean the value was not filled in
(everyone must be M or F, whether it is filled in or not); however, a null value on a field like veteran
status may be indicative of an unclear or unwanted answer or more.

We typically can handle missing data in three distinct ways so that we may continue on with the process
of data mining (155). First, we can ignore the missing values as if they have no value or weight on the
outcome. Bayes classifiers are excellent at ignoring missing data; they simply build a probability value
based on the complete, existing information. This, however, makes us even more reliant on the quality of
the rest of the data. Second, we may try to compare and classify missing values based on existing
complete entries. That is, if a complete and incomplete instance matches except for the missing data
field then they are similar. This is very dangerous with very noisy data in that dissimilar instances may
appear to be very much alike (155). Third, we may do the pessimistic process of treating incomplete
instances as entirely different from all other complete and incomplete instances. In this fashion we make
sure that misclassifications do not occur.
Your comment about null values is quite true. This is the reason that many data modelers will
often create entry codes to represent data unknown or other. So that a null value can only mean
that the value was not entered and is truly missing.

C. Data normalization and scaling


Many data mining techniques require that data is transformed or standardized into a more palatable
format for input into the technique (like neural networks). There are many reasons for normalizing. First,
we may want to make the data and calculations more manageable for the technique. This may involve
scaling data into a computationally easier format. Second, we may want to make the data more
meaningful. By standardizing input data based upon statistical processes (like mean and standard
deviation) we are better able to compare seemingly disparate data to a normed standard.

Most of the techniques for data normalization and scaling fall under basic algebra and statistics (156).
First, decimal scaling attempts to map a larger real valued range to a more manageable interval (such as
(0,1) or (-1,1) (156). Not only does this make calculations easier but it also prepares the data for entry into
techniques such as neural networks. Second, min-max normalization attempts to frame the data points
in context with the demonstrated range of values. That is, if data exists only between 10 and 20 we are
not concerned with data outside that area. Third, we can normalize using statistical distributions such as
the standard normal distribution. Finally, we can utilize logarithmic transformation to flatten out disparate
values or make categorical data more meaningful.

D. Data type conversion


Data type conversions can be an extraordinarily tricky proposition. First, when working on a data mining
project you may encounter legacy (i.e., non-current) data formats that do not easily mesh well with one
another. By transforming data to what is considered normal standards we must ensure that no detail or
granularity is lost in the process. Second, in converting the data we must ensure that no destruction of the
original data is incurred. If we transformed an all-categorical data set into numerical data for entry into a
neural network it would be foolish to discard the original data formats, or the transformations. Just
because we begin working in one data method does not mean we will stay in that method.

Techniques to counter such conversion problems begin with a clear and loud statement of how data will
be transformed and to what standard it should be transformed into. By having an agreed upon goal, and a
written plan of how to attain the goal, there is at least precedence for how data transformations should
occur.

E. Attribute and instance selection


Attribute and instance selection certainly will dictate the quality of the model and results that you obtain
through your data mining technique. Certainly, we cannot enter every instance and every attribute into a
training or test set due to time and validity constraints. Thus, we must make choice on what is important
to use. First, by realizing what technique we are utilizing (and its corresponding requirements) we can
obtain a better idea of what transformations and attributes will be necessary. Second, we can whittle
down the number of attributes based on what our desired output is and what variables are of no interest
to us. It is important not to dismiss candidate input variables that might illuminate certain relationships in
the data, so this must be done with caution.

We can solve the problems by a number of techniques; however, the best way to learn how to deal with
these issues may simply be experience. In genetic algorithms and neural networks it is possible to pit
attributes against other attributes in a survival of the fittest contest utilizing the goodness functions given.
Once we have a filtered, smaller set we then can run the remaining candidate attributes through a
supervised techniques (such as a decision tree) to see how good they truly are. Also, we can reduce the
number of attributes by combining 2 or more desired attributes in a ratio or equation format. Finally, in
dealing with attributes and instances we must again state that the set (and attributes) present in training
data must truly be representative of the whole. Else, we risk modeling something entirely different than
our goal.
2. (80 points) We've covered several data mining techniques in this course. For each
technique identified below, describe the technique, identify which problems it is best suited
for, identify which problems it has difficulties with, and describe any issues or limitations of
the technique.

a. Decision trees

b. Association rules

c. K-Means algorithm

d. Linear regression

e. Logistic regression

f. Bayes classifier

g. Neural networks

h. Genetic algorithms

For the following 8 data mining techniques I will summarize the technique (1), identify which problems it is
best suited for (2), identify which problems it has difficulties with (3), and describe any issues or limitations
of the technique (4):

A. Decision trees
1. Decision trees are simple structures where non-terminal nodes represent test on one or more
attributes and terminal nodes reflect decision outcomes (9). They are a very popular supervised learning
technique involving training. A set of instances provides a subset of training data to help the model learn
how data should be classified. Additional test data instances are categorized using the model (68). If the
instances are classified correctly, we are done; if they are no the tree is altered until the desired test data
set is classified correctly or exhausted (68).

2. Decision trees are a popular technique because they can tackle a wide array of problems (9). They can
model both categorical (i.e., discrete) data and continuous data by creating branches based upon ranges
of values (WWW1). More so, they are very easy to understand and apply to solving real world problems
(78, WWW2). Problems that need some formal display of logic/rules/classification are good candidates for
decision trees. Decision trees easily converted into sets of production rules (43), which show basic
probabilistic quantifications. They are also good at indicating which fields are most significant in
differentiating groups they are those at the root node of the tree (WWW1).

3. While they overall might be the easiest of the data mining techniques decision trees are not ideal for
solving every problem. The output attribute of a tree must be strictly categorical and single valued (78).
The choice of training data may cause slight to major instabilities in the algorithms choice of branching
(78). More so, decision trees may become complex and less intuitive as the number of input variables
dramatically increases (78). Increased complexity may lead to unreasonable time, funds, and resources
spent on training (WWW1).

4. As noted in 3, output attributes of decision trees are limited to single valued, categorical attributes (78).
While easy to conceptually see on a small scale when data becomes more complex or is based off a time
dimension (i.e., time series) the output and subsequent tree might not benefit the problem (WWW1). In
many fields, such as finance and medicine, where continuous output variables are frequent another
method may be better (WWW1).

B. Association Rules
1. Association rules are a mining technique use to discover interesting associations between attributes
contained in a database (49). They are unique in the fact that a single variable may be considered as an
input and/or an output attribute (78). The most popular use of association rules is in the technique also
known as market basket analysis (78). Main ways of gauging association rules are (101-102):
--confidence (conditional probability B is true when A is true in an A>B statement)
--support (minimum percentage of instances in the database that contain all items listed in a
given association rule)
--accuracy (how often AB really holds)
--coverage (what % of things having characteristic B that have characteristic A)

2. Association rules are frequently used in problems with several desired, multi-valued outputs (49).
Problems requiring clear and understandable results with practical answers would be good candidates for
this technique (WWW2). Association rules are a good method for problems where unsupervised mining is
required (WWW1). Depending on the size of complexity of the data set one can literally search out all
possible combinations of input and output attributes. Also, if the set data tends to be of variable lengths
this technique might be good (WWW1).

3. The association rules technique is a poor choice on extremely large data sets due to the
combinatorially taxing process of comparing the various inputs and outputs (WWW1). In this case, rules
by method of exhaustion are not feasible. Problems with many data attributes are also poor candidates
(WWW2). Problems where rare or outlier-like occurrences are of great importance are also poor
candidates for association rules (49). Association rules typically have cutoff levels for its basic measures
and while rare occurrences may fall under such a threshold they are nonetheless important. However,
they may be discarded because they do not meet minimum criteria.

4. The power of association rules may be in the eye of the beholder. Requirements for minimum
coverage, support, and confidence can all are set based on the tolerance or desire of an individual data
miner. Thus, the ability of the technique to shed light on a problem may indeed be at the hands of the
experimenter. More so, even numerically valued rule measures may be of little value when applied to the
practical world. To know that there is an 80% chance that someone who buys beer on a Sunday might
also buy potato chips may be trivial to a company or individual. It is the odd occurrences (i.e., buying beer
and diapers) that are of great value. Additionally, one may want to use experience and other techniques to
broaden the data set and its investigation to determine whether events are causal, correlative, or due to
random events which may or may not be important!

When they work, they are also easy to explain to the business.

C. K-Means Algorithm

1. The K-means algorithm is a popular means of performing unsupervised clustering when not a lot may
be known about the data set. The iterative algorithm can be broken down into 5 steps (84):
--Choose the # of clusters to be formed (K)
--Choose K random points from the set (N) to be used as initial centers of the K clusters
--Use the Euclidean distance formula to assign the N-K remaining points to the nearest center
--Calculate new cluster centers based on the mean of the points in said cluster
--If new = old center, done; If new <> old center, then repeat steps 3-5.

2. The K-means algorithm is very helpful in cases where little is known about the data set except for a
specific goal the number of clusters. This kind of clustering can be performed on a wide array of real
valued data types (WWW1). The computation process can also be adapted to problems that require a
good enough solution using cutoff values for closeness or iterations. It works well using brute force to
determine cluster centers.
3. The algorithm has problems with categorical data (88). If categorical data needs to be clustered in this
fashion it requires good data transformations to make the data real valued and consistent. If the miner
has no idea how many clusters are desired the algorithm may produce useless data (88) too high of an
estimate will create singularity-like clusters, too low of an estimate will create large clusters that are
grouped without practical meaning. Finally, problems that require an optimal solution may not be good
candidates for this technique (87). While the clusters are guaranteed to stably form the optimizations may
be local, or good enough, rather than globally optimal (87). Problems with a great deal of data points, or
with large n-tuples, may be computationally taxing and require a large amount of time and resources to
come to a stable solution.

4. The K-means algorithm, for its positives, is a somewhat clumsy technique. Many more adept clustering
algorithms obtain clusters faster, or with better initial center choices. Another limitation of this technique is
that the while the clusters are stable they are not consistent. The choice of initial centers greatly
influences the final clustering solution (233). Since the K-means algorithm does a poor job of providing an
explanation for its results it is often difficult to determine which variables are relevant/irrelevant to the
solution (89). This can be remedied by testing the unsupervised clusters using a supervised data mining
technique (such as decision trees or association rules) to gauge the goodness of the model.

D. Linear Regression

1. Linear regression is a popular and powerful statistical technique that helps relate one (or more) input
variables to a dependent variables. The general and simple linear equations appear below (292):

f ( x1 , x 2 ,..., x n ) a1 x1 a n x n ... a n x n c
and
y=ax+b

where Xs are independent variables, As are coefficients that are determined, b is the y intercept, and c is
a constant. Also, the mathematics and linear algebra that support such endeavors is powerful, somewhat
easy to understand, and applicable in a wide array of cases. Further, many common statistics and
measures can be used with confidence to measure the goodness of such a model.

2. Linear regression, naturally, works quite well on distributions where the implicit or explicit relationships
are linear approximated. Linear regression works well on problems where the data is in real valued,
numerical forms. Data with continuous ranges also works quite well in linear regression (298).

3. Linear regression tends to not work well on problems with curvilinear, polynomial, logistic, or scatter
plot tendencies (i.e. most problems using real data often simplification assumptions are used to
support the use of linear regression) (298-299). The technique also tends to work less effectively on
problems with greater and greater independent variables. This is due to the fact that as more less
important variables are added to the model it will drag down the performance of its prediction. Another
group of problems that linear regression has difficulties with is categorical variable problems. Categorical
variables, especially binary variables, cannot be properly captured by a model that emphasizes real
valued coefficients and estimates (299).

4. One limitation of the technique dealing with non-linear data can be mitigated by using weighted
values or transformations of input variables (WWW3). In this fashion, while the input variables may be
adjusted the linear model is still one of many valid ways to get a handle on the data. One issue with
measuring the sample correlation coefficient, r, in a model is that a value between 1 +1 may not be
sufficient to determine a relationship in a larger data set. While an r-value might not be near 1, it still may
be significant once the model and a graphical representation of the model are valued. Very Good

E. Logistic regression
1. Logistic regression is an extension of the linear regression technique. Logistic regression is defined by
the equation (300):

e ax c
p ( y 1 | x)
1 e ax c
where ax+c equals the first equation in D1. This process still models a dependent variable based upon 1+
independent variables. However, this technique allows solution by a wider array of problems and reduces
sum of square errors (300).

2. Logistic regression works very well on problems that simple linear regression cannot normally handle
categorical/binary problems (299). As seen earlier in the class the equation from E1 shows that leads to a
cumulative probability density curve where as x increases the P value gets asymptotically close to 1.0
(301). A similar trend can be shown with x decreasing and P going towards 0.0. Thus, if variable can only
take upon 2,,n values this scale can give an intuitive feel of the probability. In fact, each coefficient can
be thought of a measure of probability for that given input variable, which leads to very powerful results.

3. Logistic regression may not be the best choice for problems whose input variables are mostly
continuous; the time spent on transformation could be better used on goodness of fit calculations. Along
those lines logistic regression does not have as many goodness of fit calculations (WWW4). Also, logistic
work may be intimidating to some novice or superficial analysts dealing with data mining.

4. One limitation of logistic regression is that it works better under iterative tries. Once a first trial model is
created one might be better off limiting the input variables to a select few which have been proven to be
probabilistically significant (ie, high logistic coefficients).

F. Bayes Classifier

1. This technique is a way to classify things in a supervised way based upon one thing: conditional
probability. In statistics Bayes Theorem is stated as (302):

P( E | H ) * P( H )
P( H | E )
P( E )

This equation yields the probability that a hypothesis is true based on given evidence that we know is
true. Bayesian probability can easily be calculated as long as we know (or have reliable estimates) for
hypothesis, evidence, and mixed probabilities.

2. Bayesian techniques work very well in small to moderately countable data sets where calculations of
probability are based on work already done. This technique also works surprisingly well on a number of
common, problematic issues. First, the technique accounts well for zero valued attribute counts (305).
That is, if an attribute is zero valued Bayesian equations add a small nonzero constant to the equation to
ensure nonzero answers. Thus, a 0 in a field does have some probability associated with it. Second, the
technique similarly deals well with problems containing missing data. If a multiple input valued set has a
field that is null the Bayes theorem above simply will discard the variable and create a probability
associated with all the nontrivial input variables (306).

3. This technique does not work well with problems where multiple input variables may have a degree of
dependence present between each other (302). Such a situation would cause conditional probabilities to
be falsely elevated. Another type of problem that Bayesian techniques do not work well with is sets with
unequally weighted input attributes (302). Bayesian probability assumes that events have the same, equal
chance of occurring. If this is not true, the final probabilities will be in error.
4. One point of interest is that Bayesian classification can be done on numerical data utilizing a slightly
altered version of the normal probability distribution. All we truly need to know, in this case, is the class
mean, standard deviation, and value for the numerical attribute in question (307). However, we must note
that such a distribution also assumes that an approximate normal distribution (especially as the number of
instances present increases).

G. Neural networks

1. Neural networks are a mathematics model that attempts to mimic the human brain (246). This data
mining technique is built upon interconnected layers of nodes which allow input data to be transformed
and altered into a final output answer by feeding such information forward (246). The layers typically
consist of a larger input layer followed by one (or more) hidden layers and a final 1-node output layer.
Random initial weights are assigned to the input layer and the values are literally fed forward via a series
of calculations. As the input and output data is in the form of a real valued number in [0,1] it is relatively
easy to see what the output is truly telling us.

An interesting aside; neural networks were developed to model physical neural


activity and better understand the mechanics of the biology. ANN purists frown
on the use of neural networks as data analysis tools. I side with the broader
interpretation of ANN use. I believe that since they model the categorization and
classification methods we use to learn, that they validly mimic the conclusions
we could make with our wetware (brains).
2. Neural networks can be used in a variety of problems because of their ability to do unsupervised and
supervised learning. One can certainly use neural networks on datasets containing large amounts of
noisy input data (256). This is a large advantage in using this on practical problems where even the
best data sets may be incomplete. This technique is also good in dealing with categorical or numerical
data (WWW1). They are also very applicable in a wide range of business, science, economic, and more
liberal domains (WWW2). Finally, their practicality in problems involving time series makes them popular
due to their ability to account for the dimension of time (256).

3. Despite all their positives neural networks do have difficulty with some types of problems. They do
require transforming data onto the real valued [0,1] number line, which may be difficult to meaningfully
transform data to (WWW2). Also, neural networks may converge to a solution that is less than optimal
(257). Further, the solutions to some problems with limited data may perform very well under training
circumstances but fail using test data due to overtraining (257). Finally, problems that require easily
explainable results may not be the best candidates for entry into a neural network (WWW1). Neural
networks may spit out results, even consistently, that may not have much significance (without using other
functions to gauge their results).

4. Since neural network explanation is a huge issue there are a number of techniques to use in gauging
their solutions. First, one can use sensitivity analysis a process of making slight changes to see how the
network adjusts to see how to better construct the network (255). Second, one may use the average
member technique to ensure that the data set is best represented in the training data (255). Finally, one
can use backpropagation; this technique changes the input variables based on how the output appears.

H. Genetic Algorithms

1. Genetic algorithms are a data mining technique based on Charles Darwins theory of evolution (101).
This technique is quite popular due to its ability to form novel, unique solutions through supervised and
unsupervised applications. The basic algorithm process can be described as (90):
--Transform desired elements into a chromosome string, mostly coded into binary
--Create an initial population of possible solution chromosomes
--Create a fitness function to measure the quality of chromosomes
--Use the fitness function to either keep the chromosomes intact, or reject them and create new
strings using 1 of 3 genetic operators:
1. Crossover: creating new elements by copying/combining bits of existing elements
2. Mutation: randomly flipping a 0/1 in an elimination chromosome for novelty
3. Selection: replacing unfit chromosomes with entire copies of good chromosomes

2. Genetic algorithms work well on problems that require explainable results (WWW2). They can also
handle a vast array of data (with some transformation) and can be used in a supervised or unsupervised
manner (WWW2). Because of this natural ability to work with other techniques genetic algorithms mesh
well with neural network problems (98). They also work well on practical problems that nontraditional
methods find very time consuming and difficult, like scheduling and optimization (98).

3. To begin, genetic algorithms do not work well on problems where a great deal of transformation is
required for the data to be in a sufficient state for easy chromosome based learning processes (WWW2).
This technique is guaranteed to find optimized solutions; however, these optimizations may be due to
local rather than global optimizations (98). Problems that require quick answers based on easily available
techniques may not be the best to use this method on. Training and explanation of how to build and use
these algorithms may be too costly (in terms of time and money) for a novice analyst (WWW2).

4. The main limitation of genetic algorithms is that their performance and results are only as good as their
fitness function (98). To benefit the user a fitness function must help show what was originally intended
and must have a significant amount of practical value associated with it. Another difficulty in the iterative
process with this technique is which genetic operator (crossover, mutation, selection) to use, when, and
how often. While mutation finds novel solutions that may not appear otherwise the process is
computationally draining and leads to diminished results most of the trials will be in error. More so,
overusing a technique like selection may lead to a model appearing to be viable when in fact a certain
family of chromosomes is dominating the other pairs in the solution.

A variation of genetic algorithms that is worth considering is Evolutionary Algorithms. Where GA


can be thought of as modeling a population of individual genomes to develop the fittest solution
(a being), EAs model a population of solutions (individuals) and develop the fittest solution from
this population. EAs work well in problems where it may be difficult or undesirable to express our
model as a binary string. The members of an EA population are represented as the set of
coefficients of our fitness equation. Think about how much easier it would be to solve a traveling
salesman problem as an EA rather than as a GA.

You might also like