Machine Learning Section2 Ebook
Machine Learning Section2 Ebook
Machine Learning Section2 Ebook
Machine Learning
Rarely a Straight Line
With machine learning there’s rarely a straight line from start to finish
—you’ll find yourself constantly iterating and trying different ideas and
approaches. This chapter describes a systematic machine learning
workflow, highlighting some key decision points along the way.
Machine Learning Challenges
It takes time to find the best model to fit the data. Choosing the
right model is a balancing act. Highly flexible models tend to overfit
data by modeling minor variations that could be noise. On the
other hand, simple models may assume too much. There are always
tradeoffs between model speed, accuracy, and complexity.
In the next sections we’ll look at the steps in more detail, using a
health monitoring app for illustration. The entire workflow will be
completed in MATLAB®.
The trained model (or classifier) will be integrated into an app to MACHINE LEARNING
help users track their activity levels throughout the day.
1. Sit down holding the phone, log data from the phone,
and store it in a text file labeled “Sitting.”
We store the labeled data sets in a text file. A flat file format such
as text or CSV is easy to work with and makes it straightforward to
import data.
We import the data into MATLAB and plot each labeled set. raw data outliers
To preprocess the data we do the following:
We could simply ignore the missing values, but this will reduce
the size of the data set. Alternatively, we could substitute
approximations for the missing values by interpolating or using Outliers in the activity-tracking data.
comparable data from another sample.
4. Divide the data into two sets. We save part of the data for
testing (the test set) and use the rest (the training set) to build
models. This is referred to as holdout, and is a useful cross-
validation technique.
For the activity tracker, we want to extract features that capture the
frequency content of the accelerometer data. These features will
help the algorithm distinguish between walking (low frequency)
and running (high frequency). We create a new table that includes
the selected features.
The number of features that you could derive is limited only by your imagination. However, there are a lot of techniques
commonly used for different types of data.
When building a model, it’s a good idea to start with something To see how well it performs, we plot the confusion matrix, a table
simple; it will be faster to run and easier to interpret. that compares the classifications made by the model with the
We start with a basic decision tree. actual class labels that we created in step 1.
feat53<335.449 feat53>=335.449
Sitting >99% <1%
Dancing
PREDICTED CLASS
We start with a K-nearest neighbors (KNN), a simple algorithm However, KNNs take a considerable amount of memory to run, since they
that stores all the training data, compares new points to the require all the training data to make a prediction.
training data, and returns the most frequent class of the “K”
nearest points. That gives us 98% accuracy compared to 94.1% We try a linear discriminant model, but that doesn’t improve the
for the simple decision tree. The confusion matrix looks results. Finally, we try a multiclass support vector machine (SVM).
better, too: The SVM does very well—we now get 99% accuracy:
1% 99% 1%
<1% >99% <1%
Walking Walking
TRU TRU
E 2% 98% E
CLASS Running <1% >99%
CLASS Running
<1% 1% 97% 1%
<1% 98% 2%
Dancing Dancing
1% 1% 6% 92%
Sitting Standing Walking Running Dancing Sitting <1% Walking
Standing <1% Running
3% 96%
Dancing
Improving a model can take two different directions: make the A good model includes only the features with the most
model simpler or add complexity. predictive power. A simple model that generalizes well is
better than a complex model that may not generalize or
Simplify train well to new data.
First, we look for opportunities to reduce the number of features. In machine learning, as in many other
computational processes, simplifying the
Popular feature reduction techniques include:
model makes it easier to understand,
• Correlation matrix – shows the relationship between more robust, and more computationally
efficient.
variables, so that variables (or features) that are not highly
correlated can be removed.
• Principal component analysis (PCA) – eliminates
redundancy by finding a combination of features that
captures key distinctions between the original features and
brings out strong patterns in the dataset.
• Sequential feature reduction – reduces features
iteratively on the model until there is no improvement
in performance.
Next, we look at ways to reduce the model itself. We can
do this by:
Add Complexity
If the model can reliably classify activities on the test data, we’re
ready to move it to the phone and start tracking.
Ready for a deeper dive? Explore these resources to learn more about
machine learning methods, examples, and tools.
Watch
Machine Learning Made Easy 34:34
Signal Processing and Machine Learning Techniques for Sensor Data Analytics 42:45
Read
Supervised Learning Workflow and Algorithms
Data-Driven Insights with MATLAB Analytics: An Energy Load Forecasting Case Study
Explore
MATLAB Machine Learning Examples
Classify Data with the Classification Learner App
© 2016 The MathWorks, Inc. MATLAB and Simulink are registered trademarks of The MathWorks, Inc. See mathworks.com/trademarks for a list of additional trademarks.
Other product or brand names may be trademarks or registered trademarks of their respective holders.
93014v00