Machine Learning with R: Learn techniques for building and improving machine learning models, from data preparation to model tuning, evaluation, and working with big data
By Brett Lantz
()
About this ebook
Learn how to solve real-world data problems using machine learning and R
Purchase of the print or Kindle book includes a free eBook in PDF format.
Key Features
The 10th Anniversary Edition of the bestselling R machine learning book, updated with 50% new content for R 4.0.0 and beyond
Harness the power of R to build flexible, effective, and transparent machine learning models
Learn quickly with this clear, hands-on guide by machine learning expert Brett Lantz
Book Description
Machine learning, at its core, is concerned with transforming data into actionable knowledge. R offers a powerful set of machine learning methods to quickly and easily gain insight from your data.
Machine Learning with R, Fourth Edition, provides a hands-on, accessible, and readable guide to applying machine learning to real-world problems. Whether you are an experienced R user or new to the language, Brett Lantz teaches you everything you need to know for data pre-processing, uncovering key insights, making new predictions, and visualizing your findings. This 10th Anniversary Edition features several new chapters that reflect the progress of machine learning in the last few years and help you build your data science skills and tackle more challenging problems, including making successful machine learning models and advanced data preparation, building better learners, and making use of big data.
You'll also find this classic R data science book updated to R 4.0.0 with newer and better libraries, advice on ethical and bias issues in machine learning, and an introduction to deep learning. Whether you're looking to take your first steps with R for machine learning or making sure your skills and knowledge are up to date, this is an unmissable read that will help you find powerful new insights in your data.
What you will learn
Learn the end-to-end process of machine learning from raw data to implementation
Classify important outcomes using nearest neighbor and Bayesian methods
Predict future events using decision trees, rules, and support vector machines
Forecast numeric data and estimate financial values using regression methods
Model complex processes with artificial neural networks
Prepare, transform, and clean data using the tidyverse
Evaluate your models and improve their performance
Connect R to SQL databases and emerging big data technologies such as Spark, Hadoop, H2O, and TensorFlow
Who this book is for
This book is designed to help data scientists, actuaries, data analysts, financial analysts, social scientists, business and machine learning students, and any other practitioners who want a clear, accessible guide to machine learning with R. No R experience is required, although prior exposure to statistics and programming is helpful.
Brett Lantz
"Brett Lantz has spent the past 10 years using innovative data methods to understand human behavior. A sociologist by training, he was first enchanted by machine learning while studying a large database of teenagers' social networking website profiles. Since then, he has worked on interdisciplinary studies of cellular telephone calls, medical billing data, and philanthropic activity, among others. When he's not spending time with family, following college sports, or being entertained by his dachshunds, he maintains dataspelunking.com, a website dedicated to sharing knowledge about the search for insight in data."
Read more from Brett Lantz
Machine Learning with R - Second Edition Rating: 5 out of 5 stars5/5Machine Learning with R: Expert techniques for predictive modeling, 3rd Edition Rating: 4 out of 5 stars4/5R: Unleash Machine Learning Techniques Rating: 0 out of 5 stars0 ratingsR: Data Analysis and Visualization Rating: 0 out of 5 stars0 ratings
Related to Machine Learning with R
Related ebooks
Principles of Data Science: A beginner's guide to essential math and coding skills for data fluency and machine learning Rating: 0 out of 5 stars0 ratingsMastering Python for Data Science Rating: 3 out of 5 stars3/5R: Unleash Machine Learning Techniques Rating: 0 out of 5 stars0 ratingsPractical Data Science with Python: Learn tools and techniques from hands-on examples to extract insights from data Rating: 0 out of 5 stars0 ratingsMachine Learning With Go: Leverage Go's powerful packages to build smart machine learning and predictive applications, 2nd Edition Rating: 0 out of 5 stars0 ratingsMachine Learning with R Rating: 4 out of 5 stars4/5Mastering Machine Learning with R: Advanced machine learning techniques for building smart applications with R 3.5, 3rd Edition Rating: 0 out of 5 stars0 ratingsMastering Predictive Analytics with R - Second Edition Rating: 0 out of 5 stars0 ratingsFeature Engineering Made Easy: Identify unique features from your dataset in order to build powerful machine learning systems Rating: 0 out of 5 stars0 ratingsMachine Learning with R - Third Edition: Expert techniques for predictive modeling, 3rd Edition Rating: 0 out of 5 stars0 ratingsR Machine Learning Essentials Rating: 0 out of 5 stars0 ratingsData Wrangling with R: Load, explore, transform and visualize data for modeling with tidyverse libraries Rating: 0 out of 5 stars0 ratingsPractical Data Analysis - Second Edition Rating: 0 out of 5 stars0 ratingsMastering Machine Learning with R Rating: 0 out of 5 stars0 ratingsMachine Learning for Imbalanced Data: Tackle imbalanced datasets using machine learning and deep learning techniques Rating: 0 out of 5 stars0 ratingsHands-On Data Science with R: Techniques to perform data manipulation and mining to build smart analytical models using R Rating: 0 out of 5 stars0 ratingsData Science Career Guide Interview Preparation Rating: 0 out of 5 stars0 ratingsPrinciples of Data Science Rating: 0 out of 5 stars0 ratingsStatistics for Machine Learning Rating: 3 out of 5 stars3/5Simulating Data with SAS Rating: 0 out of 5 stars0 ratingsThe Pandas Workshop: A comprehensive guide to using Python for data analysis with real-world case studies Rating: 0 out of 5 stars0 ratingsBuilding a Recommendation System with R Rating: 0 out of 5 stars0 ratingsR Object-oriented Programming Rating: 3 out of 5 stars3/5Go Machine Learning Projects: Eight projects demonstrating end-to-end machine learning and predictive analytics applications in Go Rating: 0 out of 5 stars0 ratingsA Handbook of Mathematical Models with Python: Elevate your machine learning projects with NetworkX, PuLP, and linalg Rating: 0 out of 5 stars0 ratingsLearning Quantitative Finance with R: Implement machine learning, time-series analysis, algorithmic trading and more Rating: 0 out of 5 stars0 ratingsInterpretable Machine Learning with Python: Learn to build interpretable high-performance models with hands-on real-world examples Rating: 0 out of 5 stars0 ratings
E-Commerce For You
The Psychology of Selling: Increase Your Sales Faster and Easier Than You Ever Thought Possible Rating: 4 out of 5 stars4/5How to Write Copy That Sells: The Step-By-Step System For More Sales, to More Customers, More Often Rating: 4 out of 5 stars4/5Building a StoryBrand: Clarify Your Message So Customers Will Listen Rating: 4 out of 5 stars4/5The YouTube Formula: How Anyone Can Unlock the Algorithm to Drive Views, Build an Audience, and Grow Revenue Rating: 4 out of 5 stars4/5How I Made My First $1000 on Etsy (With No Social Media Following and No Money to Spend on Advertising Rating: 5 out of 5 stars5/5The Passive Income Cheat Sheet Rating: 4 out of 5 stars4/580/20 Sales and Marketing: The Definitive Guide to Working Less and Making More Rating: 4 out of 5 stars4/5A Beginner's Guide To Day Trading Online 2nd Edition Rating: 4 out of 5 stars4/5Built to Last: Successful Habits of Visionary Companies Rating: 4 out of 5 stars4/5ChatGPT's Guide to Wealth: How to Make Money with Conversational AI Technology Rating: 5 out of 5 stars5/5Influencer: Building Your Personal Brand in the Age of Social Media Rating: 4 out of 5 stars4/5The Bitcoin Standard: The Decentralized Alternative to Central Banking Rating: 4 out of 5 stars4/5Crushing It!: How Great Entrepreneurs Build Their Business and Influence—and How You Can, Too Rating: 4 out of 5 stars4/5How to Day Trade: The Plain Truth Rating: 5 out of 5 stars5/5Traction: Quadruple Your Business Immediately With These Marketing Techniques Rating: 2 out of 5 stars2/5Streams of Income: Living the Multiple Income Streams Dream Rating: 5 out of 5 stars5/5The Beginner's Affiliate Marketing Blueprint Rating: 4 out of 5 stars4/5Starting an Etsy Business For Dummies Rating: 5 out of 5 stars5/52022 Best Ways To Make Money Online Rating: 4 out of 5 stars4/5Trade Like a Stock Market Wizard: How to Achieve Super Performance in Stocks in Any Market Rating: 5 out of 5 stars5/5
Reviews for Machine Learning with R
0 ratings0 reviews
Book preview
Machine Learning with R - Brett Lantz
Machine Learning with R
Fourth Edition
Learn techniques for building and improving machine learning models, from data preparation to model tuning, evaluation, and working with big data
Brett Lantz
BIRMINGHAM—MUMBAI
Machine Learning with R
Fourth Edition
Copyright © 2023 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Lead Senior Publishing Product Manager: Tushar Gupta
Acquisition Editor – Peer Reviews: Saby Dsilva
Project Editor: Janice Gonsalves
Content Development Editors: Bhavesh Amin and Elliot Dallow
Copy Editor: Safis Editor
Technical Editor: Karan Sonawane
Indexer: Hemangini Bari
Presentation Designer: Pranit Padwal
Developer Relations Marketing Executive: Monika Sangwan
First published: October 2013
Second edition: July 2015
Third edition: April 2019
Fourth edition: May 2023
Production reference: 1190523
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80107-132-1
www.packt.com
Contributors
About the author
Brett Lantz (@DataSpelunking) has spent more than 15 years using innovative data methods to understand human behavior. A sociologist by training, Brett was first captivated by machine learning while studying a large database of teenagers’ social network profiles. Brett is a DataCamp instructor and has taught machine learning workshops around the world. He is known to geek out about data science applications for sports, video games, autonomous vehicles, and foreign language learning, among many other subjects, and hopes to eventually blog about such topics at dataspelunking.com.
It is hard to describe how much my world has changed since the first edition of this book was published nearly ten years ago! My sons Will and Cal were born amidst the first and second editions, respectively, and have grown alongside my career. This edition, which consumed two years of weekends, would have been impossible without the backing of my wife, Jessica. Many thanks are due also to the friends, mentors, and supporters who opened the doors that led me along this unexpected data science journey.
About the reviewer
Daniel D. Gutierrez is an independent consultant in data science through his firm AMULET Analytics. He’s also a technology journalist, serving as Editor-in-Chief for insideBIGDATA.com, where he enjoys keeping his finger on the pulse of this fast-paced industry. Daniel is also an educator, having taught data science, machine learning and R classes at university level for many years. He currently teaches data science for UCLA Extension. He has authored four computer industry books on database and data science technology, including his most recent title, Machine Learning and Data Science: An Introduction to Statistical Learning Methods with R. Daniel holds a BS in Mathematics and Computer Science from UCLA.
Join our book’s Discord space
Join our Discord community to meet like-minded people and learn alongside more than 4000 people at:
https://packt.link/r
Contents
Preface
Who this book is for
What this book covers
What you need for this book
Get in touch
Introducing Machine Learning
The origins of machine learning
Uses and abuses of machine learning
Machine learning successes
The limits of machine learning
Machine learning ethics
How machines learn
Data storage
Abstraction
Generalization
Evaluation
Machine learning in practice
Types of input data
Types of machine learning algorithms
Matching input data to algorithms
Machine learning with R
Installing R packages
Loading and unloading R packages
Installing RStudio
Why R and why R now?
Summary
Managing and Understanding Data
R data structures
Vectors
Factors
Lists
Data frames
Matrices and arrays
Managing data with R
Saving, loading, and removing R data structures
Importing and saving datasets from CSV files
Importing common dataset formats using RStudio
Exploring and understanding data
Exploring the structure of data
Exploring numeric features
Measuring the central tendency – mean and median
Measuring spread – quartiles and the five-number summary
Visualizing numeric features – boxplots
Visualizing numeric features – histograms
Understanding numeric data – uniform and normal distributions
Measuring spread – variance and standard deviation
Exploring categorical features
Measuring the central tendency – the mode
Exploring relationships between features
Visualizing relationships – scatterplots
Examining relationships – two-way cross-tabulations
Summary
Lazy Learning – Classification Using Nearest Neighbors
Understanding nearest neighbor classification
The k-NN algorithm
Measuring similarity with distance
Choosing an appropriate k
Preparing data for use with k-NN
Why is the k-NN algorithm lazy?
Example – diagnosing breast cancer with the k-NN algorithm
Step 1 – collecting data
Step 2 – exploring and preparing the data
Transformation – normalizing numeric data
Data preparation – creating training and test datasets
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Transformation – z-score standardization
Testing alternative values of k
Summary
Probabilistic Learning – Classification Using Naive Bayes
Understanding Naive Bayes
Basic concepts of Bayesian methods
Understanding probability
Understanding joint probability
Computing conditional probability with Bayes’ theorem
The Naive Bayes algorithm
Classification with Naive Bayes
The Laplace estimator
Using numeric features with Naive Bayes
Example – filtering mobile phone spam with the Naive Bayes algorithm
Step 1 – collecting data
Step 2 – exploring and preparing the data
Data preparation – cleaning and standardizing text data
Data preparation – splitting text documents into words
Data preparation – creating training and test datasets
Visualizing text data – word clouds
Data preparation – creating indicator features for frequent words
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Summary
Divide and Conquer – Classification Using Decision Trees and Rules
Understanding decision trees
Divide and conquer
The C5.0 decision tree algorithm
Choosing the best split
Pruning the decision tree
Example – identifying risky bank loans using C5.0 decision trees
Step 1 – collecting data
Step 2 – exploring and preparing the data
Data preparation – creating random training and test datasets
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Boosting the accuracy of decision trees
Making some mistakes cost more than others
Understanding classification rules
Separate and conquer
The 1R algorithm
The RIPPER algorithm
Rules from decision trees
What makes trees and rules greedy?
Example – identifying poisonous mushrooms with rule learners
Step 1 – collecting data
Step 2 – exploring and preparing the data
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Summary
Forecasting Numeric Data – Regression Methods
Understanding regression
Simple linear regression
Ordinary least squares estimation
Correlations
Multiple linear regression
Generalized linear models and logistic regression
Example – predicting auto insurance claims costs using linear regression
Step 1 – collecting data
Step 2 – exploring and preparing the data
Exploring relationships between features – the correlation matrix
Visualizing relationships between features – the scatterplot matrix
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Model specification – adding nonlinear relationships
Model specification – adding interaction effects
Putting it all together – an improved regression model
Making predictions with a regression model
Going further – predicting insurance policyholder churn with logistic regression
Understanding regression trees and model trees
Adding regression to trees
Example – estimating the quality of wines with regression trees and model trees
Step 1 – collecting data
Step 2 – exploring and preparing the data
Step 3 – training a model on the data
Visualizing decision trees
Step 4 – evaluating model performance
Measuring performance with the mean absolute error
Step 5 – improving model performance
Summary
Black-Box Methods – Neural Networks and Support Vector Machines
Understanding neural networks
From biological to artificial neurons
Activation functions
Network topology
The number of layers
The direction of information travel
The number of nodes in each layer
Training neural networks with backpropagation
Example – modeling the strength of concrete with ANNs
Step 1 – collecting data
Step 2 – exploring and preparing the data
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Understanding support vector machines
Classification with hyperplanes
The case of linearly separable data
The case of nonlinearly separable data
Using kernels for nonlinear spaces
Example – performing OCR with SVMs
Step 1 – collecting data
Step 2 – exploring and preparing the data
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Changing the SVM kernel function
Identifying the best SVM cost parameter
Summary
Finding Patterns – Market Basket Analysis Using Association Rules
Understanding association rules
The Apriori algorithm for association rule learning
Measuring rule interest – support and confidence
Building a set of rules with the Apriori principle
Example – identifying frequently purchased groceries with association rules
Step 1 – collecting data
Step 2 – exploring and preparing the data
Data preparation – creating a sparse matrix for transaction data
Visualizing item support – item frequency plots
Visualizing the transaction data – plotting the sparse matrix
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Sorting the set of association rules
Taking subsets of association rules
Saving association rules to a file or data frame
Using the Eclat algorithm for greater efficiency
Summary
Finding Groups of Data – Clustering with k-means
Understanding clustering
Clustering as a machine learning task
Clusters of clustering algorithms
The k-means clustering algorithm
Using distance to assign and update clusters
Choosing the appropriate number of clusters
Finding teen market segments using k-means clustering
Step 1 – collecting data
Step 2 – exploring and preparing the data
Data preparation – dummy coding missing values
Data preparation – imputing the missing values
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Summary
Evaluating Model Performance
Measuring performance for classification
Understanding a classifier’s predictions
A closer look at confusion matrices
Using confusion matrices to measure performance
Beyond accuracy – other measures of performance
The kappa statistic
The Matthews correlation coefficient
Sensitivity and specificity
Precision and recall
The F-measure
Visualizing performance tradeoffs with ROC curves
Comparing ROC curves
The area under the ROC curve
Creating ROC curves and computing AUC in R
Estimating future performance
The holdout method
Cross-validation
Bootstrap sampling
Summary
Being Successful with Machine Learning
What makes a successful machine learning practitioner?
What makes a successful machine learning model?
Avoiding obvious predictions
Conducting fair evaluations
Considering real-world impacts
Building trust in the model
Putting the science
in data science
Using R Notebooks and R Markdown
Performing advanced data exploration
Constructing a data exploration roadmap
Encountering outliers: a real-world pitfall
Example – using ggplot2 for visual data exploration
Summary
Advanced Data Preparation
Performing feature engineering
The role of human and machine
The impact of big data and deep learning
Feature engineering in practice
Hint 1: Brainstorm new features
Hint 2: Find insights hidden in text
Hint 3: Transform numeric ranges
Hint 4: Observe neighbors’ behavior
Hint 5: Utilize related rows
Hint 6: Decompose time series
Hint 7: Append external data
Exploring R’s tidyverse
Making tidy table structures with tibbles
Reading rectangular files faster with readr and readxl
Preparing and piping data with dplyr
Transforming text with stringr
Cleaning dates with lubridate
Summary
Challenging Data – Too Much, Too Little, Too Complex
The challenge of high-dimension data
Applying feature selection
Filter methods
Wrapper methods and embedded methods
Example – Using stepwise regression for feature selection
Example – Using Boruta for feature selection
Performing feature extraction
Understanding principal component analysis
Example – Using PCA to reduce highly dimensional social media data
Making use of sparse data
Identifying sparse data
Example – Remapping sparse categorical data
Example – Binning sparse numeric data
Handling missing data
Understanding types of missing data
Performing missing value imputation
Simple imputation with missing value indicators
Missing value patterns
The problem of imbalanced data
Simple strategies for rebalancing data
Generating a synthetic balanced dataset with SMOTE
Example – Applying the SMOTE algorithm in R
Considering whether balanced is always better
Summary
Building Better Learners
Tuning stock models for better performance
Determining the scope of hyperparameter tuning
Example – using caret for automated tuning
Creating a simple tuned model
Customizing the tuning process
Improving model performance with ensembles
Understanding ensemble learning
Popular ensemble-based algorithms
Bagging
Boosting
Random forests
Gradient boosting
Extreme gradient boosting with XGBoost
Why are tree-based ensembles so popular?
Stacking models for meta-learning
Understanding model stacking and blending
Practical methods for blending and stacking in R
Summary
Making Use of Big Data
Practical applications of deep learning
Beginning with deep learning
Choosing appropriate tasks for deep learning
The TensorFlow and Keras deep learning frameworks
Understanding convolutional neural networks
Transfer learning and fine tuning
Example – classifying images using a pre-trained CNN in R
Unsupervised learning and big data
Representing highly dimensional concepts as embeddings
Understanding word embeddings
Example – using word2vec for understanding text in R
Visualizing highly dimensional data
The limitations of using PCA for big data visualization
Understanding the t-SNE algorithm
Example – visualizing data’s natural clusters with t-SNE
Adapting R to handle large datasets
Querying data in SQL databases
The tidy approach to managing database connections
Using a database backend for dplyr with dbplyr
Doing work faster with parallel processing
Measuring R’s execution time
Enabling parallel processing in R
Taking advantage of parallel with foreach and doParallel
Training and evaluating models in parallel with caret
Utilizing specialized hardware and algorithms
Parallel computing with MapReduce concepts via Apache Spark
Learning via distributed and scalable algorithms with H2O
GPU computing
Summary
Other Books You May Enjoy
Index
Landmarks
Cover
Index
Preface
Machine learning, at its core, describes algorithms that transform data into actionable intelligence. This fact makes machine learning well suited to the present-day era of big data. Without machine learning, it would be nearly impossible to make sense of the massive streams of information that are now all around us.
The cross-platform, zero-cost statistical programming environment called R provides an ideal pathway to start applying machine learning. R offers powerful but easy-to-learn tools that can assist you with finding insights in your own data.
By combining hands-on case studies with the essential theory needed to understand how these algorithms work, this book delivers all the knowledge you need to get started with machine learning and to apply its methods to your own projects.
Who this book is for
This book is aimed at people in applied fields—business analysts, social scientists, and others—who have access to data and hope to use it for action. Perhaps you already know a bit about machine learning, but have never used R; or, perhaps you know a little about R, but are new to machine learning. Maybe you are completely new to both! In any case, this book will get you up and running quickly. It would be helpful to have a bit of familiarity with basic math and programming concepts, but no prior experience is required. All you need is curiosity.
What this book covers
Chapter 1, Introducing Machine Learning, presents the terminology and concepts that define and distinguish machine learners, as well as a method for matching a learning task with the appropriate algorithm.
Chapter 2, Managing and Understanding Data, provides an opportunity to get your hands dirty working with data in R. Essential data structures and procedures used for loading, exploring, and understanding data are discussed.
Chapter 3, Lazy Learning – Classification Using Nearest Neighbors, teaches you how to understand and apply a simple yet powerful machine learning algorithm to your first real-world task: identifying malignant samples of cancer.
Chapter 4, Probabilistic Learning – Classification Using Naive Bayes, reveals the essential concepts of probability that are used in cutting-edge spam filtering systems. You’ll learn the basics of text mining in the process of building your own spam filter.
Chapter 5, Divide and Conquer – Classification Using Decision Trees and Rules, explores a couple of learning algorithms whose predictions are not only accurate, but also easily explained. We’ll apply these methods to tasks where transparency is important.
Chapter 6, Forecasting Numeric Data – Regression Methods, introduces machine learning algorithms used for making numeric predictions. As these techniques are heavily embedded in the field of statistics, you will also learn the essential metrics needed to make sense of numeric relationships.
Chapter 7, Black-Box Methods – Neural Networks and Support Vector Machines, covers two complex but powerful machine learning algorithms. Though the math may appear intimidating, we will work through examples that illustrate their inner workings in simple terms.
Chapter 8, Finding Patterns – Market Basket Analysis Using Association Rules, exposes the algorithm used in the recommendation systems employed by many retailers. If you’ve ever wondered how retailers seem to know your purchasing habits better than you know yourself, this chapter will reveal their secrets.
Chapter 9, Finding Groups of Data – Clustering with k-means, is devoted to a procedure that locates clusters of related items. We’ll utilize this algorithm to identify profiles within an online community.
Chapter 10, Evaluating Model Performance, provides information on measuring the success of a machine learning project and obtaining a reliable estimate of the learner’s performance on future data.
Chapter 11, Being Successful with Machine Learning, describes the common pitfalls faced when transitioning from textbook datasets to real world machine learning problems, as well as the tools, strategies, and soft skills needed to combat these issues.
Chapter 12, Advanced Data Preparation, introduces the set of tidyverse
packages, which help wrangle large datasets to extract meaningful information to aid the machine learning process.
Chapter 13, Challenging Data – Too Much, Too Little, Too Complex, considers solutions to a common set of problems that can derail a machine learning project when the useful information is lost within a massive dataset, much like a needle in a haystack.
Chapter 14, Building Better Learners, reveals the methods employed by the teams at the top of machine learning competition leaderboards. If you have a competitive streak, or simply want to get the most out of your data, you’ll need to add these techniques to your repertoire.
Chapter 15, Making Use of Big Data, explores the frontiers of machine learning. From working with extremely large datasets to making R work faster, the topics covered will help you push the boundaries of what is possible with R, and even allow you to utilize the sophisticated tools developed by large organizations like Google for image recognition and understanding text data.
What you need for this book
The examples in this book were tested with R version 4.2.2 on Microsoft Windows, Mac OS X, and Linux, although they are likely to work with any recent version of R. R can be downloaded at no cost at https://cran.r-project.org/.
The RStudio interface, which is described in more detail in Chapter 1, Introducing Machine Learning, is a highly recommended add-on for R that greatly enhances the user experience. The RStudio Open Source Edition is available free of charge from Posit (https://www.posit.co/) alongside a paid RStudio Pro Edition that offers priority support and additional features for commercial organizations.
Download the example code files
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Machine-Learning-with-R-Fourth-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
Download the color images
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://packt.link/TZ7os.
Conventions used
Code in text: function names, filenames, file extensions, and R package names are shown as follows: "The
knn()
function in the
class
package provides a standard, classic implementation of the k-NN algorithm."
R user input and output is written as follows:
>
reg
(
y
=
launch
$
distress_ct
,
x
=
launch
[
2
:
4
])
estimate Intercept 3.527093383 temperature -0.051385940 field_check_pressure 0.001757009 flight_num 0.014292843
New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "In RStudio, a new file can be created using the File menu, selecting New File, and choosing the R Notebook option."
References to additional resources or background information appear like this.
Helpful tips and important caveats appear like this.
Get in touch
Feedback from our readers is always welcome.
General feedback: Email
, and mention the book’s title in the subject of your message. If you have questions about any aspect of this book, please email us at
.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book we would be grateful if you would report this to us. Please visit, http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at
with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit http://authors.packtpub.com.
Share your thoughts
Once you’ve read Machine Learning with R - Fourth Edition, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Download a free PDF copy of this book
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere? Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily
Follow these simple steps to get the benefits:
Scan the QR code or visit the link below
Qr code Description automatically generatedhttps://packt.link/free-ebook/978-1-80107-132-1
Submit your proof of purchase
That’s it! We’ll send your free PDF and other benefits to your email directly
1
Introducing Machine Learning
If science-fiction stories are to be believed, the invention of artificial intelligence inevitably leads to apocalyptic wars between machines and their makers. The stories begin with today’s reality: computers being taught to play simple games like tic-tac-toe and to automate routine tasks. As the stories go, machines are later given control of traffic lights and communications, followed by military drones and missiles. The machines’ evolution takes an ominous turn once the computers become sentient and learn how to teach themselves. Having no more need for human programmers, humankind is then deleted.
Thankfully, at the time of writing, machines still require user input.
Though your impressions of machine learning may be colored by these mass-media depictions, today’s algorithms have little danger of becoming self-aware. The goal of today’s machine learning is not to create an artificial brain, but rather to assist us with making sense of and acting on the world’s rapidly accumulating data stores.
Putting popular misconceptions aside, by the end of this chapter, you will gain a more nuanced understanding of machine learning. You will also be introduced to the fundamental concepts that define and differentiate the most common machine learning approaches. You will learn:
The origins, applications, ethics, and pitfalls of machine learning
How computers transform data into knowledge and action
The steps needed to match a machine learning algorithm with your data
The field of machine learning provides a set of algorithms that transform data into actionable knowledge. Keep reading to see how easy it is to use R to start applying machine learning to real-world problems.
The origins of machine learning
Beginning at birth, we are inundated with data. Our body’s sensors—the eyes, ears, nose, tongue, and nerves—are continually assailed with raw data that our brain translates into sights, sounds, smells, tastes, and textures. Using language, we can share these experiences with others.
Since the advent of written language, humans have recorded their observations. Hunters monitored the movement of animal herds; early astronomers recorded the alignment of planets and stars; and cities recorded tax payments, births, and deaths. Today, such observations, and many more, are increasingly automated and recorded systematically in ever-growing computerized databases.
The invention of electronic sensors has additionally contributed to an explosion in the volume and richness of recorded data. Specialized sensors, such as cameras, microphones, chemical noses, electronic tongues, and pressure sensors mimic the human ability to see, hear, smell, taste, and feel. These sensors process the data far differently than a human being would. Unlike a human’s limited and subjective attention, an electronic sensor never takes a break and has no emotions to skew its perception.
Although sensors are not clouded by subjectivity, they do not necessarily report a single, definitive depiction of reality. Some have an inherent measurement error due to hardware limitations. Others are limited by their scope. A black-and-white photograph provides a different depiction of its subject than one shot in color. Similarly, a microscope provides a far different depiction of reality than a telescope.
Between databases and sensors, many aspects of our lives are recorded. Governments, businesses, and individuals are recording and reporting information, from the monumental to the mundane. Weather sensors obtain temperature and pressure data; surveillance cameras watch sidewalks and subway tunnels; and all manner of electronic behaviors are monitored: transactions, communications, social media relationships, and many others.
This deluge of data has led some to state that we have entered an era of big data, but this may be a bit of a misnomer. Human beings have always been surrounded by large amounts of data—one would need only to look to the sky and attempt to count its stars to discover a virtually endless supply. What makes the current era unique is that we have vast amounts of recorded data, much of which can be directly accessed by computers. Larger and more interesting datasets are increasingly accessible at the tips of our fingers, only a web search away. This wealth of information has the potential to inform action, given a systematic way of making sense of it all.
The field of study dedicated to the development of computer algorithms for transforming data into intelligent action is known as machine learning. This field originated in an environment where the available data, statistical methods, and computing power rapidly and simultaneously evolved. Growth in the volume of data necessitated additional computing power, which in turn spurred the development of statistical methods for analyzing large datasets. This created a cycle of advancement, allowing even larger and more interesting data to be collected, and enabled today’s environment in which endless streams of data are available on virtually any topic.
Diagram Description automatically generatedFigure 1.1: The cycle of advancement that enabled machine learning
A closely related sibling of machine learning, data mining, is concerned with the generation of novel insight from large databases. As the term implies, data mining involves a systematic hunt for nuggets of actionable intelligence. Although there is some disagreement over how widely machine learning and data mining overlap, one point of distinction is that machine learning focuses on teaching computers how to use data to solve a problem, while data mining focuses on teaching computers to identify patterns that humans then use to solve a problem.
Virtually all data mining involves the use of machine learning, but not all machine learning requires data mining. For example, you might apply machine learning to data mine automobile traffic data for patterns related to accident rates. On the other hand, if the computer is learning how to identify traffic signs, this is purely machine learning without data mining.
The phrase data mining
is also sometimes used as a pejorative to describe the deceptive practice of cherry-picking data to support a theory.
Machine learning is also intertwined with the field of artificial intelligence (AI), which is a nebulous discipline and, depending on whom you might ask, is simply machine learning with a strong marketing spin or a distinct field of study altogether. A cynic might suggest that the field of AI tends to exaggerate its importance such as by calling a simple predictive model an AI bot,
while an AI proponent may point out that the field tends to tackle the most challenging learning tasks while aiming for human-level performance. The truth is somewhere in between.
Just as machine learning itself depends on statistical methods, artificial intelligence depends a great deal on machine learning, but the business contexts and applications tend to differ. The table that follows highlights some differentiators among traditional statistics, machine learning, and artificial intelligence; however, keep in mind that the lines between the three disciplines are often less rigid than they may appear.
In this formulation, machine learning sits firmly at the intersection of human and computer partnership, whereas traditional statistics relies primarily on the human to drive insights and AI seeks to minimize human involvement as much as possible. Learning how to maximize the human-machine partnership and apply learning algorithms to real-world problems is the focus of this book. Understanding the use cases and limitations of machine learning is an important starting point in this journey.
Uses and abuses of machine learning
Most people have heard of Deep Blue, the chess-playing computer that in 1997 was the first to win a game against a world champion. Another famous computer, Watson, defeated two human opponents on the television trivia game show Jeopardy in 2011. Based on these stunning accomplishments, some have speculated that computer intelligence will replace workers in information technology occupations, just as automobiles replaced horses and machines replaced workers in fields and assembly lines. Recently, these fears have become more pronounced as artificial intelligence-based algorithms, such as GPT-3 and DALL·E 2 from the OpenAI research group (https://openai.com/), have reached impressive milestones and are proving that computers are capable of writing text and creating artwork that is virtually indistinguishable from that produced by humans. Ultimately, this may lead to massive shifts in occupations like marketing, customer support, illustration, and so on, as creativity is outsourced to machines that can produce endless streams of material more cheaply than the former employees.
In this case, humans may still be necessary because the truth is that even as machines reach such impressive milestones, they are still relatively limited in their ability to thoroughly understand a problem, or understand how the work is going to be applied toward a real-world goal. Learning algorithms are pure intellectual horsepower without direction. A computer may be more capable than a human of finding subtle patterns in large databases, but it still needs a human to motivate the analysis and turn the result into meaningful action. In most cases, the human will determine whether the machine’s output is valuable and will help the machine avoid creating a limitless supply of nonsense.
Without completely discounting the achievements of Deep Blue and Watson, it is important to note that neither is even as intelligent as a typical five-year-old. For more on why comparing smarts is a slippery business,
see the Popular Science article FYI, Which Computer Is Smarter, Watson Or Deep Blue?, by Will Grunewald, 2012: https://www.popsci.com/science/article/2012-12/fyi-which-computer-smarter-watson-or-deep-blue.
Machines are not good at asking questions or even knowing what questions to ask. They are much better at answering them, provided the question is stated in a way that the computer can comprehend. Present-day machine learning algorithms partner with people much like a bloodhound works with its trainer: the dog’s sense of smell may be many times stronger than its master’s, but without being carefully directed, the hound may end up chasing its tail.
Diagram Description automatically generatedFigure 1.2: Machine learning algorithms are powerful tools that require careful direction
In the worst-case scenario, if machine learning were implemented carelessly, it might lead to what controversial tech billionaire Elon Musk provocatively called summoning the demon.
This perspective suggests that we may be unleashing forces outside our control, despite the hubristic sense that we will be able to reign them in when needed. Given the power of artificial intelligence to automate processes and react to changing conditions much faster and more objectively than humans, there may come a point at which Pandora’s box has been opened and it is difficult or impossible to return to the old ways of life where humans are in control. As Musk describes:
If AI has a goal and humanity just happens to be in the way, it will destroy humanity as a matter of course without even thinking about it. No hard feelings… It’s just like, if we’re building a road and an anthill just happens to be in the way, we don’t hate ants, we’re just building a road, and so, goodbye anthill.
While this may seem to be a bleak portrayal, it is still the realm of far-future science fiction, as you will soon learn when reading about the present day’s state-of-the-art machine learning successes.
However, Musk’s warning does help emphasize the importance of understanding the likelihood of machine learning and AI being a double-edged sword. For all of its benefits, there are some places where it still has room for improvement, and some situations where it may do more harm than good. If machine learning practitioners cannot be trusted to act ethically, it may be necessary for governments to intervene to prevent the greatest harm to society.
For more on Musk’s fears of summoning the demon
see the following 2018 article from CNBC: https://www.cnbc.com/2018/04/06/elon-musk-warns-ai-could-create-immortal-dictator-in-documentary.html.
Machine learning successes
Machine learning is most successful when it augments the specialized knowledge of a subject-matter expert rather than replacing the expert altogether. It works with medical doctors at the forefront of the fight to eradicate cancer; assists engineers with efforts to create smarter homes and automobiles; helps social scientists and economists build better societies; and provides business and marketing professionals with valuable insights. Toward these ends, it is employed in countless scientific laboratories, hospitals, companies, and governmental organizations. Any effort that generates or aggregates data likely employs at least one machine learning algorithm to help make sense of it.
Though it is impossible to list every successful application of machine learning, a selection of prominent examples is as follows:
Identification of unwanted spam messages in email
Segmentation of customer behavior for targeted advertising
Forecasts of weather behavior and long-term climate changes
Preemptive interventions for customers likely to churn (stop purchasing)
Reduction of fraudulent credit card transactions
Actuarial estimates of financial damage from storms and natural disasters
Prediction of and influence over election outcomes
Development of algorithms for auto-piloting drones and self-driving cars
Optimization of energy use in homes and office buildings
Projection of areas where criminal activity is most likely
Discovery of genetic sequences useful for precision medicine
By the end of this book, you will understand the basic machine learning algorithms that are employed to teach computers to perform these tasks. For now, it suffices to say that no matter the context, the fundamental machine learning process is the same. In every task, an algorithm takes data and identifies patterns that form the basis for further action.
The limits of machine learning
Although machine learning is used widely and has tremendous potential, it is important to understand its limits. The algorithms used today—even those on the cutting edge of artificial intelligence—emulate a relatively limited subset of the capabilities of the human brain. They offer little flexibility to extrapolate outside of strict parameters and know no common sense. Considering this, one should be extremely careful to recognize exactly what an algorithm has learned before setting it loose in the real world.
Without a lifetime of past experiences to build upon, computers are limited in their ability to make simple inferences about logical next steps. Consider the banner advertisements on websites, which are served according to patterns learned by data mining the browsing history of millions of users. Based on this data, someone who views websites selling mattresses is interested in buying a mattress and should therefore see advertisements for mattresses. The problem is that this becomes a never-ending cycle in which, even after a mattress has been purchased, additional mattress advertisements are shown, rather than advertisements for pillows and bed sheets.
Many people are familiar with the deficiencies of machine learning’s ability to understand or translate language, or to recognize speech and handwriting. Perhaps the earliest example of this type of failure is in a 1994 episode of the television show The Simpsons, which showed a parody of the Apple Newton tablet. In its time, the Newton was known for its state-of-the-art handwriting recognition. Unfortunately for Apple, it would occasionally fail to great effect. The television episode illustrated this through a sequence in which a bully’s note to Beat up Martin
was misinterpreted by the Newton as Eat up Martha.
Figure 1.3: Screen captures from Lisa on Ice, The Simpsons, 20th Century Fox (1994)
Machine language processing has improved enough in the time since the Apple Newton that Google, Apple, and Microsoft are all confident in their ability to offer voice-activated virtual concierge services, such as Google Assistant, Siri, and Cortana. Still, these services routinely struggle to answer relatively simple questions. Furthermore, online translation services sometimes misinterpret sentences that a toddler would readily understand, and the predictive text feature on many devices has led to humorous autocorrect fail
websites that illustrate computers’ ability to understand basic language but completely misunderstand context.
Some of these mistakes are to be expected. Language is complicated, with multiple layers of text and subtext, and even human beings sometimes misunderstand context. Although machine learning is rapidly improving at language processing, and current state-of-the-art algorithms like GPT-3 are quite good in comparison to prior generations, machines still make mistakes that are obvious to humans that know where to look. These predictable shortcomings illustrate the important fact that machine learning is only as good as the data it has learned from. If context is not explicit in the input data, then just like a human, the computer will have to make its best guess from its set of past experiences. However, the computer’s past experiences are usually much more limited than the human’s.
Machine learning ethics
At its core, machine learning is simply a tool that assists us with making sense of the world’s complex data. Like any tool, it can be used for good or evil. Machine learning goes wrong mostly when it is applied so broadly, or so callously, that humans are treated as lab rats, automata, or mindless consumers. A process that may seem harmless can lead to unintended consequences when automated by an emotionless computer. For this reason, those using machine learning or data mining would be remiss not to at least briefly consider the ethical implications of the art.
Due to the relative youth of machine learning as a discipline and the speed at which it is progressing, the associated legal issues and social norms are often quite uncertain, and constantly in flux. Caution should be exercised when obtaining or analyzing data in order to avoid breaking laws, violating terms of service or data use agreements, or abusing the trust or violating the privacy of customers or the public. The informal corporate motto of Google, an organization that collects perhaps more data on individuals than any other, was at one time, don’t be evil.
While this seems clear enough, it may not be sufficient. A better approach may be to follow the Hippocratic Oath, a medical principle that states, above all, do no harm.
Following the principle of do no harm
may have helped avoid recent scandals at Facebook and other companies, such as the Cambridge Analytica controversy, which alleged that social media data was being used to manipulate elections.
Retailers routinely use machine learning for advertising, targeted promotions, inventory management, or the layout of items in a store. Many have equipped checkout lanes with devices that print coupons for promotions based on a customer’s buying history. In exchange for a bit of personal data, the customer receives discounts on the specific products they want to buy. At first, this may appear relatively harmless, but consider what happens when this practice is taken a bit further.
One possibly apocryphal tale concerns a large retailer in the United States that employed machine learning to identify expectant mothers for coupon mailings. The retailer hoped that if these mothers-to-be received substantial discounts, they would become loyal customers who would later purchase profitable items such as diapers, baby formula, and toys. Equipped with machine learning methods, the retailer identified items in the customer purchase history that could be used to predict with a high degree of certainty not only whether a woman was pregnant, but also the approximate timing for when the baby was due.
After the retailer used this data for a promotional mailing, an angry man contacted the chain and demanded to know why his young daughter received coupons for maternity items. He was furious that the retailer seemed to be encouraging teenage pregnancy! As the story goes, when the retail chain called to offer an apology, it was the father who ultimately apologized after confronting his daughter and discovering that she was indeed pregnant!
For more detail on how retailers use machine learning to identify pregnancies, see the New York Times Magazine article titled How Companies Learn Your Secrets, by Charles Duhigg, 2012: https://www.nytimes.com/2012/02/19/magazine/shopping-habits.html.
Whether the story was completely true or not, the lesson learned from the preceding tale is that common sense should be applied before blindly applying the results of a machine learning analysis. This is particularly true in cases where sensitive information, such as health data, is concerned. With a bit more care, the retailer could have foreseen this scenario and used greater discretion when choosing how to reveal the pregnancy status its machine learning analysis had discovered. Unfortunately, as history tends to repeat itself, social media companies have been under fire recently for targeting expectant mothers with advertisements for baby products even after these mothers experience the tragedy of a miscarriage.
Because machine learning algorithms are developed with historical data, computers may learn some unfortunate behaviors of human societies. Sadly, this sometimes includes perpetuating race or gender discrimination and reinforcing negative stereotypes. For example, researchers found that Google’s online advertising service was more likely to show ads for high-paying jobs to men than women and was more likely to display ads for criminal background checks to black people than white people. Although the machine may have correctly learned that men once held jobs that were not offered to most women, it is not desirable to have the algorithm perpetuate such injustices. Instead, it may be necessary to teach the machine to reflect society not as it currently is, but how it ought to be.
Sometimes, algorithms that are specifically designed with the intention of being content-neutral
eventually come to reflect undesirable beliefs or ideologies. In one egregious case, a Twitter chatbot service developed by Microsoft was quickly taken offline after it began spreading Nazi and anti-feminist propaganda, which it may have learned from so-called trolls
posting inflammatory content on internet forums and chat rooms. In another case, an algorithm created to reflect an objective conception of human beauty sparked controversy when it favored almost exclusively white people. Imagine the consequences if this had been applied to facial recognition software for criminal activity!
For more information about the real-world consequences of machine learning and discrimination see the Harvard Business Review article Addressing the Biases Plaguing Algorithms, by Michael Li, 2019: https://hbr.org/2019/05/addressing-the-biases-plaguing-algorithms.
To limit the ability of algorithms to discriminate illegally, certain jurisdictions have well-intentioned laws that prevent the use of racial, ethnic, religious, or other protected class data for business reasons. However, excluding this data from a project may not be enough because machine learning algorithms can still inadvertently learn to discriminate. If a certain segment of people tends to live in a certain region, buys a certain product, or otherwise behaves in a way that uniquely identifies them as a group, machine learning algorithms can infer the protected information from other factors. In such cases, you may need to completely de-identify these people by excluding any potentially identifying data in addition to the already-protected statuses.
In a recent example of this type of alleged algorithmic bias, the Apple credit card, which debuted in 2019, was almost immediately accused of providing substantially higher credit limits to men than to women—sometimes by 10 to 20 times the amount—even for spouses with joint assets and similar credit histories. Although Apple and the issuing bank, Goldman Sachs, denied that gender bias was at play and confirmed that no legally protected applicant characteristics were used in the algorithm, this did not slow speculation that perhaps some bias crept in unintentionally. It did not help matters that for competitive reasons, Apple and Goldman Sachs chose to keep the details of the algorithm secret, which led people to assume the worst. If the systematic bias allegations were untrue, being able to explain what was truly happening and exactly how the decisions were made might have alleviated much of the outrage. A potential worst-case scenario would have occurred if Apple and Goldman Sachs were investigated yet couldn’t explain the result to regulators, due to the algorithm’s complexity!
The Apple credit card fiasco is described in a 2019 BBC article, Apple’s ‘sexist’ credit card investigated by US regulator: https://www.bbc.com/news/business-50365609.
Apart from the legal consequences, customers may feel uncomfortable or become upset if aspects of their lives they consider private are made public. The challenge is that privacy expectations differ across people and contexts. To illustrate this fact, imagine driving by someone’s house and incidentally glancing through the window. This is unlikely to offend most people. In contrast, using a camera to take a picture from across the street is likely to make most feel uncomfortable; walking up to the house and pressing a face against the glass to peer inside is likely to anger virtually everybody. Although all three of these scenarios are arguably using public
information, two of the three cross a line that will upset most people. In much the same way, it is possible to go too far with the use of data and cross a threshold that many will see as inconsiderate at best and creepy at worst.
Just as computing hardware and statistical methods kicked off the big data era, these methods also unlocked a post-privacy era in which many aspects of our lives that were once private are now public, or available to the public at a price. Even prior to the big data era, it would have been possible to learn a great deal about someone by observing public information. Watching their comings and goings may reveal information about their occupation or leisure activity, and a quick glance at their trash and recycling bins may reveal what they eat, drink, and read. A private investigator could learn even more with a bit of focused digging and observation. Companies applying machine learning methods to large datasets are essentially acting as large-scale private investigators, and while they claim to be working on anonymized datasets, many still argue that the companies have gone too far with their digital surveillance.
In recent years, some high-profile web applications have experienced a mass exodus of users who felt exploited when the applications’ terms of service agreements changed, or their data was used for purposes beyond what the users had originally intended. The fact that privacy expectations differ by context, age cohort, and locale adds complexity to deciding the appropriate use of personal data. It would be wise to consider the cultural implications of your work before you begin on your project, in addition to being aware of ever-more-restrictive regulations such as the European Union’s General Data Protection Regulation (GDPR) and the inevitable policies that will follow in its footsteps.
The fact that you can use data for a particular end does not always mean that you should.
Finally, it is important to note that as machine learning algorithms become progressively more important to our everyday lives, there are greater incentives for nefarious actors to work to exploit them. Sometimes, attackers simply want to disrupt algorithms for laughs or notoriety—such as Google bombing,
the crowdsourced method of tricking Google’s algorithms to highly rank a desired page. Other times, the effects are more dramatic. A timely example of this is the recent wave of so-called fake news and election meddling, propagated via the manipulation of advertising and recommendation algorithms that target people according to their personality. To avoid giving such control to outsiders, when building machine learning systems, it is crucial to consider how they may be influenced by a determined individual or crowd.
Social media scholar danah boyd (styled lowercase) presented a keynote at the Strata Data Conference 2017 in New York City that discussed the importance of hardening machine learning algorithms against attackers. For a recap, refer to https://points.datasociety.net/your-data-is-being-manipulated-a7e31a83577b.
The consequences of malicious attacks on machine learning algorithms can also be deadly. Researchers have shown that by creating an adversarial attack
that subtly distorts a road sign with carefully chosen graffiti, an attacker might cause an autonomous vehicle to misinterpret a stop sign, potentially resulting in a fatal crash. Even in the absence of ill intent, software bugs and human errors have already led to fatal accidents in autonomous vehicle technology from Uber and Tesla. With such examples in mind, it is of the utmost importance and ethical concern that machine learning practitioners should worry about how their algorithms will be used and abused in the real world.
How machines learn
A formal definition of machine learning, attributed to computer scientist Tom M. Mitchell, states that a machine learns whenever it utilizes its experience such that its performance improves on similar experiences in the future. Although this definition makes sense intuitively, it completely ignores the process of exactly how experience is translated into future action—and, of course, learning is always easier said than done!
Where human brains are naturally capable of learning from birth, the conditions necessary for computers to learn must be made explicit by the programmer hoping to utilize machine learning methods. For this reason, although it is not strictly necessary to understand the theoretical basis for learning, having a strong theoretical foundation helps the practitioner to understand, distinguish, and implement machine learning algorithms.
As you relate machine learning to human learning, you may find yourself examining your own mind in a different light.
Regardless of whether the learner is a human or a machine, the basic learning process is the same. It can be divided into four interrelated components:
Data storage utilizes observation, memory, and recall to provide a factual basis for further reasoning.
Abstraction involves the translation of stored data into broader representations and concepts.
Generalization uses abstracted data to create knowledge and inferences that drive action in new contexts.
Evaluation provides a feedback mechanism to measure the utility of learned knowledge and inform potential improvements.
Diagram Description automatically generatedFigure 1.4: The four steps in the learning process
Although the learning process has been conceptualized here as four distinct components, they are merely organized this way for illustrative purposes. In reality, the entire learning process is inextricably linked. In human beings, the process occurs subconsciously. We recollect, deduce, induct, and intuit within the confines of our mind’s eye, and because this process is hidden, any differences from person to person are attributed to a vague notion of subjectivity. In contrast, computers make these processes explicit, and because the entire process is transparent, the learned knowledge can be examined, transferred, utilized for future action, and treated as a data science.
The data science buzzword suggests a relationship between the data, the machine, and the people who guide the learning process. The term’s growing use in job descriptions and academic degree programs reflects its operationalization as a field of study concerned with both statistical and computational theory, as well as the technological infrastructure enabling machine learning and its applications. The field often asks its practitioners to be compelling storytellers, balancing an audacity in the use of data with the limitations of what one may infer and forecast from it. To be a strong data scientist, therefore, requires a strong understanding of how the learning algorithms work in the context of a business application, as we will discuss in greater detail in Chapter 11, Being Successful with Machine Learning.
Data storage
All learning begins with data. Humans and computers alike utilize data storage as a foundation for more advanced reasoning. In a human being, this consists of a brain that uses electrochemical signals in a network of biological cells to store and process observations for short- and long-term future recall. Computers have similar capabilities of short- and long-term recall using hard disk drives, flash memory, and random-access memory (RAM) in combination with a central processing unit (CPU).
It may seem obvious, but the ability to store and retrieve data alone is insufficient for learning. Stored data is merely ones and zeros on a disk. It is a collection of memories, meaningless without a broader context. Without a higher level of understanding, knowledge is purely recall, limited to what has been seen before and nothing else.
To better understand the nuances of this idea, it may help to think about the last time you studied for a difficult test, perhaps for a university final exam or a career certification. Did you wish for an eidetic (photographic) memory? If so, you may be disappointed to know that perfect recall would unlikely be of much assistance. Even if you could memorize material perfectly, this rote learning would provide no benefit without knowing the exact questions and answers that would appear on the exam. Otherwise, you would need to memorize answers to every question that could conceivably be asked, on a subject in which there is likely to be an infinite number of questions. Obviously, this is an unsustainable strategy.
Instead, a better approach is to spend time selectively and memorize a relatively small set of representative ideas, while developing an understanding of how the ideas relate and apply to unforeseen circumstances. In this way, important broader patterns are identified, rather than you memorizing every detail, nuance, and potential application.
Abstraction
This work of assigning a broader meaning to stored data occurs during the abstraction process, in which raw data comes to represent a wider, more abstract concept or idea. This type of connection, say between an object and its representation, is exemplified by the famous René Magritte painting The Treachery of Images.
A picture containing text Description automatically generatedFigure 1.5: This is not a pipe.
Source: http://collections.lacma.org/node/239578
The painting depicts a tobacco pipe with the caption Ceci n’est pas une pipe (This is not a pipe
). The point Magritte was illustrating is that a representation of a pipe is not truly a pipe. Yet, despite the fact that the pipe is not real, anybody viewing the painting easily recognizes it as a pipe. This suggests that observers can connect the picture of a pipe to the idea of a pipe, to a memory of a physical pipe that can be held in the hand. Abstracted connections like this are the basis of knowledge representation, the formation of logical structures that assist with turning raw sensory information into meaningful insight.
Bringing this concept full circle, knowledge representation is what allows artificial intelligence-based tools like Midjourney (https://www.midjourney.com) to paint, virtually, in the style of René Magritte. The following image was generated entirely by artificial intelligence based on the algorithm’s understanding of concepts like robot,
pipe,
and smoking.
If he were alive yet today, Magritte himself might find it surreal that his own surrealist work, which challenged human conceptions of reality and the connections between images and ideas, is now incorporated into the minds of computers and, in a roundabout way, is connecting machines’ ideas and images to reality. Machines learned what a pipe is, in part, by viewing images of pipes in artwork like Magritte’s.
Figure 1.6: Am I a pipe?
image created by the Midjourney AI with the prompt of robot smoking a pipe in the style of a René Magritte painting
To reify the process of knowledge representation within an algorithm, the computer summarizes stored raw data using a model, an explicit description of the patterns within the data. Just like Magritte’s pipe, the model representation takes on a life beyond the raw data. It represents an idea greater than the sum of its parts.
There are many different types of models. You may already be familiar with some. Examples include:
Mathematical equations
Relational diagrams, such as trees and graphs
Logical if/else rules
Groupings of data known as clusters
The choice of model is typically not left up to the machine. Instead, the learning task and the type of data on hand inform model selection. Later in this chapter, we will discuss in more detail the methods for choosing the appropriate model type.
Fitting a model to a dataset is known as training. When the model has been trained, the data has been transformed into an abstract form that summarizes the original information. The fact that this step is called training
rather than learning
reveals a couple of interesting aspects of the process. First, note that the process of learning does not end with data abstraction—the learner must still generalize and evaluate its training. Second, the word training
better connotes the fact that the human teacher trains the machine student to use the data toward a specific end.
The distinction between training and learning is subtle but important. The computer doesn’t learn
a model, because this would imply that there is a single correct model to be learned. Of course, the computer must learn something about the data