Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Machine Learning with R: Learn techniques for building and improving machine learning models, from data preparation to model tuning, evaluation, and working with big data
Machine Learning with R: Learn techniques for building and improving machine learning models, from data preparation to model tuning, evaluation, and working with big data
Machine Learning with R: Learn techniques for building and improving machine learning models, from data preparation to model tuning, evaluation, and working with big data
Ebook1,743 pages13 hours

Machine Learning with R: Learn techniques for building and improving machine learning models, from data preparation to model tuning, evaluation, and working with big data

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Learn how to solve real-world data problems using machine learning and R


Purchase of the print or Kindle book includes a free eBook in PDF format.


Key Features


The 10th Anniversary Edition of the bestselling R machine learning book, updated with 50% new content for R 4.0.0 and beyond


Harness the power of R to build flexible, effective, and transparent machine learning models


Learn quickly with this clear, hands-on guide by machine learning expert Brett Lantz


Book Description


Machine learning, at its core, is concerned with transforming data into actionable knowledge. R offers a powerful set of machine learning methods to quickly and easily gain insight from your data.


Machine Learning with R, Fourth Edition, provides a hands-on, accessible, and readable guide to applying machine learning to real-world problems. Whether you are an experienced R user or new to the language, Brett Lantz teaches you everything you need to know for data pre-processing, uncovering key insights, making new predictions, and visualizing your findings. This 10th Anniversary Edition features several new chapters that reflect the progress of machine learning in the last few years and help you build your data science skills and tackle more challenging problems, including making successful machine learning models and advanced data preparation, building better learners, and making use of big data.


You'll also find this classic R data science book updated to R 4.0.0 with newer and better libraries, advice on ethical and bias issues in machine learning, and an introduction to deep learning. Whether you're looking to take your first steps with R for machine learning or making sure your skills and knowledge are up to date, this is an unmissable read that will help you find powerful new insights in your data.


What you will learn


Learn the end-to-end process of machine learning from raw data to implementation


Classify important outcomes using nearest neighbor and Bayesian methods


Predict future events using decision trees, rules, and support vector machines


Forecast numeric data and estimate financial values using regression methods


Model complex processes with artificial neural networks


Prepare, transform, and clean data using the tidyverse


Evaluate your models and improve their performance


Connect R to SQL databases and emerging big data technologies such as Spark, Hadoop, H2O, and TensorFlow


Who this book is for


This book is designed to help data scientists, actuaries, data analysts, financial analysts, social scientists, business and machine learning students, and any other practitioners who want a clear, accessible guide to machine learning with R. No R experience is required, although prior exposure to statistics and programming is helpful.


 

LanguageEnglish
Release dateMay 29, 2023
ISBN9781801076050
Machine Learning with R: Learn techniques for building and improving machine learning models, from data preparation to model tuning, evaluation, and working with big data
Author

Brett Lantz

"Brett Lantz has spent the past 10 years using innovative data methods to understand human behavior. A sociologist by training, he was first enchanted by machine learning while studying a large database of teenagers' social networking website profiles. Since then, he has worked on interdisciplinary studies of cellular telephone calls, medical billing data, and philanthropic activity, among others. When he's not spending time with family, following college sports, or being entertained by his dachshunds, he maintains dataspelunking.com, a website dedicated to sharing knowledge about the search for insight in data."

Read more from Brett Lantz

Related to Machine Learning with R

Related ebooks

E-Commerce For You

View More

Related articles

Reviews for Machine Learning with R

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Machine Learning with R - Brett Lantz

    cover.png

    Machine Learning with R

    Fourth Edition

    Learn techniques for building and improving machine learning models, from data preparation to model tuning, evaluation, and working with big data

    Brett Lantz

    BIRMINGHAM—MUMBAI

    Machine Learning with R

    Fourth Edition

    Copyright © 2023 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    Lead Senior Publishing Product Manager: Tushar Gupta

    Acquisition Editor – Peer Reviews: Saby Dsilva

    Project Editor: Janice Gonsalves

    Content Development Editors: Bhavesh Amin and Elliot Dallow

    Copy Editor: Safis Editor

    Technical Editor: Karan Sonawane

    Indexer: Hemangini Bari

    Presentation Designer: Pranit Padwal

    Developer Relations Marketing Executive: Monika Sangwan

    First published: October 2013

    Second edition: July 2015

    Third edition: April 2019

    Fourth edition: May 2023

    Production reference: 1190523

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham

    B3 2PB, UK.

    ISBN 978-1-80107-132-1

    www.packt.com

    Contributors

    About the author

    Brett Lantz (@DataSpelunking) has spent more than 15 years using innovative data methods to understand human behavior. A sociologist by training, Brett was first captivated by machine learning while studying a large database of teenagers’ social network profiles. Brett is a DataCamp instructor and has taught machine learning workshops around the world. He is known to geek out about data science applications for sports, video games, autonomous vehicles, and foreign language learning, among many other subjects, and hopes to eventually blog about such topics at dataspelunking.com.

    It is hard to describe how much my world has changed since the first edition of this book was published nearly ten years ago! My sons Will and Cal were born amidst the first and second editions, respectively, and have grown alongside my career. This edition, which consumed two years of weekends, would have been impossible without the backing of my wife, Jessica. Many thanks are due also to the friends, mentors, and supporters who opened the doors that led me along this unexpected data science journey.

    About the reviewer

    Daniel D. Gutierrez is an independent consultant in data science through his firm AMULET Analytics. He’s also a technology journalist, serving as Editor-in-Chief for insideBIGDATA.com, where he enjoys keeping his finger on the pulse of this fast-paced industry. Daniel is also an educator, having taught data science, machine learning and R classes at university level for many years. He currently teaches data science for UCLA Extension. He has authored four computer industry books on database and data science technology, including his most recent title, Machine Learning and Data Science: An Introduction to Statistical Learning Methods with R. Daniel holds a BS in Mathematics and Computer Science from UCLA.

    Join our book’s Discord space

    Join our Discord community to meet like-minded people and learn alongside more than 4000 people at:

    https://packt.link/r

    Contents

    Preface

    Who this book is for

    What this book covers

    What you need for this book

    Get in touch

    Introducing Machine Learning

    The origins of machine learning

    Uses and abuses of machine learning

    Machine learning successes

    The limits of machine learning

    Machine learning ethics

    How machines learn

    Data storage

    Abstraction

    Generalization

    Evaluation

    Machine learning in practice

    Types of input data

    Types of machine learning algorithms

    Matching input data to algorithms

    Machine learning with R

    Installing R packages

    Loading and unloading R packages

    Installing RStudio

    Why R and why R now?

    Summary

    Managing and Understanding Data

    R data structures

    Vectors

    Factors

    Lists

    Data frames

    Matrices and arrays

    Managing data with R

    Saving, loading, and removing R data structures

    Importing and saving datasets from CSV files

    Importing common dataset formats using RStudio

    Exploring and understanding data

    Exploring the structure of data

    Exploring numeric features

    Measuring the central tendency – mean and median

    Measuring spread – quartiles and the five-number summary

    Visualizing numeric features – boxplots

    Visualizing numeric features – histograms

    Understanding numeric data – uniform and normal distributions

    Measuring spread – variance and standard deviation

    Exploring categorical features

    Measuring the central tendency – the mode

    Exploring relationships between features

    Visualizing relationships – scatterplots

    Examining relationships – two-way cross-tabulations

    Summary

    Lazy Learning – Classification Using Nearest Neighbors

    Understanding nearest neighbor classification

    The k-NN algorithm

    Measuring similarity with distance

    Choosing an appropriate k

    Preparing data for use with k-NN

    Why is the k-NN algorithm lazy?

    Example – diagnosing breast cancer with the k-NN algorithm

    Step 1 – collecting data

    Step 2 – exploring and preparing the data

    Transformation – normalizing numeric data

    Data preparation – creating training and test datasets

    Step 3 – training a model on the data

    Step 4 – evaluating model performance

    Step 5 – improving model performance

    Transformation – z-score standardization

    Testing alternative values of k

    Summary

    Probabilistic Learning – Classification Using Naive Bayes

    Understanding Naive Bayes

    Basic concepts of Bayesian methods

    Understanding probability

    Understanding joint probability

    Computing conditional probability with Bayes’ theorem

    The Naive Bayes algorithm

    Classification with Naive Bayes

    The Laplace estimator

    Using numeric features with Naive Bayes

    Example – filtering mobile phone spam with the Naive Bayes algorithm

    Step 1 – collecting data

    Step 2 – exploring and preparing the data

    Data preparation – cleaning and standardizing text data

    Data preparation – splitting text documents into words

    Data preparation – creating training and test datasets

    Visualizing text data – word clouds

    Data preparation – creating indicator features for frequent words

    Step 3 – training a model on the data

    Step 4 – evaluating model performance

    Step 5 – improving model performance

    Summary

    Divide and Conquer – Classification Using Decision Trees and Rules

    Understanding decision trees

    Divide and conquer

    The C5.0 decision tree algorithm

    Choosing the best split

    Pruning the decision tree

    Example – identifying risky bank loans using C5.0 decision trees

    Step 1 – collecting data

    Step 2 – exploring and preparing the data

    Data preparation – creating random training and test datasets

    Step 3 – training a model on the data

    Step 4 – evaluating model performance

    Step 5 – improving model performance

    Boosting the accuracy of decision trees

    Making some mistakes cost more than others

    Understanding classification rules

    Separate and conquer

    The 1R algorithm

    The RIPPER algorithm

    Rules from decision trees

    What makes trees and rules greedy?

    Example – identifying poisonous mushrooms with rule learners

    Step 1 – collecting data

    Step 2 – exploring and preparing the data

    Step 3 – training a model on the data

    Step 4 – evaluating model performance

    Step 5 – improving model performance

    Summary

    Forecasting Numeric Data – Regression Methods

    Understanding regression

    Simple linear regression

    Ordinary least squares estimation

    Correlations

    Multiple linear regression

    Generalized linear models and logistic regression

    Example – predicting auto insurance claims costs using linear regression

    Step 1 – collecting data

    Step 2 – exploring and preparing the data

    Exploring relationships between features – the correlation matrix

    Visualizing relationships between features – the scatterplot matrix

    Step 3 – training a model on the data

    Step 4 – evaluating model performance

    Step 5 – improving model performance

    Model specification – adding nonlinear relationships

    Model specification – adding interaction effects

    Putting it all together – an improved regression model

    Making predictions with a regression model

    Going further – predicting insurance policyholder churn with logistic regression

    Understanding regression trees and model trees

    Adding regression to trees

    Example – estimating the quality of wines with regression trees and model trees

    Step 1 – collecting data

    Step 2 – exploring and preparing the data

    Step 3 – training a model on the data

    Visualizing decision trees

    Step 4 – evaluating model performance

    Measuring performance with the mean absolute error

    Step 5 – improving model performance

    Summary

    Black-Box Methods – Neural Networks and Support Vector Machines

    Understanding neural networks

    From biological to artificial neurons

    Activation functions

    Network topology

    The number of layers

    The direction of information travel

    The number of nodes in each layer

    Training neural networks with backpropagation

    Example – modeling the strength of concrete with ANNs

    Step 1 – collecting data

    Step 2 – exploring and preparing the data

    Step 3 – training a model on the data

    Step 4 – evaluating model performance

    Step 5 – improving model performance

    Understanding support vector machines

    Classification with hyperplanes

    The case of linearly separable data

    The case of nonlinearly separable data

    Using kernels for nonlinear spaces

    Example – performing OCR with SVMs

    Step 1 – collecting data

    Step 2 – exploring and preparing the data

    Step 3 – training a model on the data

    Step 4 – evaluating model performance

    Step 5 – improving model performance

    Changing the SVM kernel function

    Identifying the best SVM cost parameter

    Summary

    Finding Patterns – Market Basket Analysis Using Association Rules

    Understanding association rules

    The Apriori algorithm for association rule learning

    Measuring rule interest – support and confidence

    Building a set of rules with the Apriori principle

    Example – identifying frequently purchased groceries with association rules

    Step 1 – collecting data

    Step 2 – exploring and preparing the data

    Data preparation – creating a sparse matrix for transaction data

    Visualizing item support – item frequency plots

    Visualizing the transaction data – plotting the sparse matrix

    Step 3 – training a model on the data

    Step 4 – evaluating model performance

    Step 5 – improving model performance

    Sorting the set of association rules

    Taking subsets of association rules

    Saving association rules to a file or data frame

    Using the Eclat algorithm for greater efficiency

    Summary

    Finding Groups of Data – Clustering with k-means

    Understanding clustering

    Clustering as a machine learning task

    Clusters of clustering algorithms

    The k-means clustering algorithm

    Using distance to assign and update clusters

    Choosing the appropriate number of clusters

    Finding teen market segments using k-means clustering

    Step 1 – collecting data

    Step 2 – exploring and preparing the data

    Data preparation – dummy coding missing values

    Data preparation – imputing the missing values

    Step 3 – training a model on the data

    Step 4 – evaluating model performance

    Step 5 – improving model performance

    Summary

    Evaluating Model Performance

    Measuring performance for classification

    Understanding a classifier’s predictions

    A closer look at confusion matrices

    Using confusion matrices to measure performance

    Beyond accuracy – other measures of performance

    The kappa statistic

    The Matthews correlation coefficient

    Sensitivity and specificity

    Precision and recall

    The F-measure

    Visualizing performance tradeoffs with ROC curves

    Comparing ROC curves

    The area under the ROC curve

    Creating ROC curves and computing AUC in R

    Estimating future performance

    The holdout method

    Cross-validation

    Bootstrap sampling

    Summary

    Being Successful with Machine Learning

    What makes a successful machine learning practitioner?

    What makes a successful machine learning model?

    Avoiding obvious predictions

    Conducting fair evaluations

    Considering real-world impacts

    Building trust in the model

    Putting the science in data science

    Using R Notebooks and R Markdown

    Performing advanced data exploration

    Constructing a data exploration roadmap

    Encountering outliers: a real-world pitfall

    Example – using ggplot2 for visual data exploration

    Summary

    Advanced Data Preparation

    Performing feature engineering

    The role of human and machine

    The impact of big data and deep learning

    Feature engineering in practice

    Hint 1: Brainstorm new features

    Hint 2: Find insights hidden in text

    Hint 3: Transform numeric ranges

    Hint 4: Observe neighbors’ behavior

    Hint 5: Utilize related rows

    Hint 6: Decompose time series

    Hint 7: Append external data

    Exploring R’s tidyverse

    Making tidy table structures with tibbles

    Reading rectangular files faster with readr and readxl

    Preparing and piping data with dplyr

    Transforming text with stringr

    Cleaning dates with lubridate

    Summary

    Challenging Data – Too Much, Too Little, Too Complex

    The challenge of high-dimension data

    Applying feature selection

    Filter methods

    Wrapper methods and embedded methods

    Example – Using stepwise regression for feature selection

    Example – Using Boruta for feature selection

    Performing feature extraction

    Understanding principal component analysis

    Example – Using PCA to reduce highly dimensional social media data

    Making use of sparse data

    Identifying sparse data

    Example – Remapping sparse categorical data

    Example – Binning sparse numeric data

    Handling missing data

    Understanding types of missing data

    Performing missing value imputation

    Simple imputation with missing value indicators

    Missing value patterns

    The problem of imbalanced data

    Simple strategies for rebalancing data

    Generating a synthetic balanced dataset with SMOTE

    Example – Applying the SMOTE algorithm in R

    Considering whether balanced is always better

    Summary

    Building Better Learners

    Tuning stock models for better performance

    Determining the scope of hyperparameter tuning

    Example – using caret for automated tuning

    Creating a simple tuned model

    Customizing the tuning process

    Improving model performance with ensembles

    Understanding ensemble learning

    Popular ensemble-based algorithms

    Bagging

    Boosting

    Random forests

    Gradient boosting

    Extreme gradient boosting with XGBoost

    Why are tree-based ensembles so popular?

    Stacking models for meta-learning

    Understanding model stacking and blending

    Practical methods for blending and stacking in R

    Summary

    Making Use of Big Data

    Practical applications of deep learning

    Beginning with deep learning

    Choosing appropriate tasks for deep learning

    The TensorFlow and Keras deep learning frameworks

    Understanding convolutional neural networks

    Transfer learning and fine tuning

    Example – classifying images using a pre-trained CNN in R

    Unsupervised learning and big data

    Representing highly dimensional concepts as embeddings

    Understanding word embeddings

    Example – using word2vec for understanding text in R

    Visualizing highly dimensional data

    The limitations of using PCA for big data visualization

    Understanding the t-SNE algorithm

    Example – visualizing data’s natural clusters with t-SNE

    Adapting R to handle large datasets

    Querying data in SQL databases

    The tidy approach to managing database connections

    Using a database backend for dplyr with dbplyr

    Doing work faster with parallel processing

    Measuring R’s execution time

    Enabling parallel processing in R

    Taking advantage of parallel with foreach and doParallel

    Training and evaluating models in parallel with caret

    Utilizing specialized hardware and algorithms

    Parallel computing with MapReduce concepts via Apache Spark

    Learning via distributed and scalable algorithms with H2O

    GPU computing

    Summary

    Other Books You May Enjoy

    Index

    Landmarks

    Cover

    Index

    Preface

    Machine learning, at its core, describes algorithms that transform data into actionable intelligence. This fact makes machine learning well suited to the present-day era of big data. Without machine learning, it would be nearly impossible to make sense of the massive streams of information that are now all around us.

    The cross-platform, zero-cost statistical programming environment called R provides an ideal pathway to start applying machine learning. R offers powerful but easy-to-learn tools that can assist you with finding insights in your own data.

    By combining hands-on case studies with the essential theory needed to understand how these algorithms work, this book delivers all the knowledge you need to get started with machine learning and to apply its methods to your own projects.

    Who this book is for

    This book is aimed at people in applied fields—business analysts, social scientists, and others—who have access to data and hope to use it for action. Perhaps you already know a bit about machine learning, but have never used R; or, perhaps you know a little about R, but are new to machine learning. Maybe you are completely new to both! In any case, this book will get you up and running quickly. It would be helpful to have a bit of familiarity with basic math and programming concepts, but no prior experience is required. All you need is curiosity.

    What this book covers

    Chapter 1, Introducing Machine Learning, presents the terminology and concepts that define and distinguish machine learners, as well as a method for matching a learning task with the appropriate algorithm.

    Chapter 2, Managing and Understanding Data, provides an opportunity to get your hands dirty working with data in R. Essential data structures and procedures used for loading, exploring, and understanding data are discussed.

    Chapter 3, Lazy Learning – Classification Using Nearest Neighbors, teaches you how to understand and apply a simple yet powerful machine learning algorithm to your first real-world task: identifying malignant samples of cancer.

    Chapter 4, Probabilistic Learning – Classification Using Naive Bayes, reveals the essential concepts of probability that are used in cutting-edge spam filtering systems. You’ll learn the basics of text mining in the process of building your own spam filter.

    Chapter 5, Divide and Conquer – Classification Using Decision Trees and Rules, explores a couple of learning algorithms whose predictions are not only accurate, but also easily explained. We’ll apply these methods to tasks where transparency is important.

    Chapter 6, Forecasting Numeric Data – Regression Methods, introduces machine learning algorithms used for making numeric predictions. As these techniques are heavily embedded in the field of statistics, you will also learn the essential metrics needed to make sense of numeric relationships.

    Chapter 7, Black-Box Methods – Neural Networks and Support Vector Machines, covers two complex but powerful machine learning algorithms. Though the math may appear intimidating, we will work through examples that illustrate their inner workings in simple terms.

    Chapter 8, Finding Patterns – Market Basket Analysis Using Association Rules, exposes the algorithm used in the recommendation systems employed by many retailers. If you’ve ever wondered how retailers seem to know your purchasing habits better than you know yourself, this chapter will reveal their secrets.

    Chapter 9, Finding Groups of Data – Clustering with k-means, is devoted to a procedure that locates clusters of related items. We’ll utilize this algorithm to identify profiles within an online community.

    Chapter 10, Evaluating Model Performance, provides information on measuring the success of a machine learning project and obtaining a reliable estimate of the learner’s performance on future data.

    Chapter 11, Being Successful with Machine Learning, describes the common pitfalls faced when transitioning from textbook datasets to real world machine learning problems, as well as the tools, strategies, and soft skills needed to combat these issues.

    Chapter 12, Advanced Data Preparation, introduces the set of tidyverse packages, which help wrangle large datasets to extract meaningful information to aid the machine learning process.

    Chapter 13, Challenging Data – Too Much, Too Little, Too Complex, considers solutions to a common set of problems that can derail a machine learning project when the useful information is lost within a massive dataset, much like a needle in a haystack.

    Chapter 14, Building Better Learners, reveals the methods employed by the teams at the top of machine learning competition leaderboards. If you have a competitive streak, or simply want to get the most out of your data, you’ll need to add these techniques to your repertoire.

    Chapter 15, Making Use of Big Data, explores the frontiers of machine learning. From working with extremely large datasets to making R work faster, the topics covered will help you push the boundaries of what is possible with R, and even allow you to utilize the sophisticated tools developed by large organizations like Google for image recognition and understanding text data.

    What you need for this book

    The examples in this book were tested with R version 4.2.2 on Microsoft Windows, Mac OS X, and Linux, although they are likely to work with any recent version of R. R can be downloaded at no cost at https://cran.r-project.org/.

    The RStudio interface, which is described in more detail in Chapter 1, Introducing Machine Learning, is a highly recommended add-on for R that greatly enhances the user experience. The RStudio Open Source Edition is available free of charge from Posit (https://www.posit.co/) alongside a paid RStudio Pro Edition that offers priority support and additional features for commercial organizations.

    Download the example code files

    The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Machine-Learning-with-R-Fourth-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

    Download the color images

    We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://packt.link/TZ7os.

    Conventions used

    Code in text: function names, filenames, file extensions, and R package names are shown as follows: "The

    knn()

    function in the

    class

    package provides a standard, classic implementation of the k-NN algorithm."

    R user input and output is written as follows:

    >

    reg

    (

    y

    =

    launch

    $

    distress_ct

    ,

    x

    =

    launch

    [

    2

    :

    4

    ])

    estimate Intercept 3.527093383 temperature -0.051385940 field_check_pressure 0.001757009 flight_num 0.014292843

    New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "In RStudio, a new file can be created using the File menu, selecting New File, and choosing the R Notebook option."

    References to additional resources or background information appear like this.

    Helpful tips and important caveats appear like this.

    Get in touch

    Feedback from our readers is always welcome.

    General feedback: Email

    [email protected]

    , and mention the book’s title in the subject of your message. If you have questions about any aspect of this book, please email us at

    [email protected]

    .

    Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book we would be grateful if you would report this to us. Please visit, http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

    Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at

    [email protected]

    with a link to the material.

    If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit http://authors.packtpub.com.

    Share your thoughts

    Once you’ve read Machine Learning with R - Fourth Edition, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

    Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

    Download a free PDF copy of this book

    Thanks for purchasing this book!

    Do you like to read on the go but are unable to carry your print books everywhere? Is your eBook purchase not compatible with the device of your choice?

    Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

    Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application. 

    The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

    Follow these simple steps to get the benefits:

    Scan the QR code or visit the link below

    Qr code Description automatically generated

    https://packt.link/free-ebook/978-1-80107-132-1

    Submit your proof of purchase

    That’s it! We’ll send your free PDF and other benefits to your email directly

    1

    Introducing Machine Learning

    If science-fiction stories are to be believed, the invention of artificial intelligence inevitably leads to apocalyptic wars between machines and their makers. The stories begin with today’s reality: computers being taught to play simple games like tic-tac-toe and to automate routine tasks. As the stories go, machines are later given control of traffic lights and communications, followed by military drones and missiles. The machines’ evolution takes an ominous turn once the computers become sentient and learn how to teach themselves. Having no more need for human programmers, humankind is then deleted.

    Thankfully, at the time of writing, machines still require user input.

    Though your impressions of machine learning may be colored by these mass-media depictions, today’s algorithms have little danger of becoming self-aware. The goal of today’s machine learning is not to create an artificial brain, but rather to assist us with making sense of and acting on the world’s rapidly accumulating data stores.

    Putting popular misconceptions aside, by the end of this chapter, you will gain a more nuanced understanding of machine learning. You will also be introduced to the fundamental concepts that define and differentiate the most common machine learning approaches. You will learn:

    The origins, applications, ethics, and pitfalls of machine learning

    How computers transform data into knowledge and action

    The steps needed to match a machine learning algorithm with your data

    The field of machine learning provides a set of algorithms that transform data into actionable knowledge. Keep reading to see how easy it is to use R to start applying machine learning to real-world problems.

    The origins of machine learning

    Beginning at birth, we are inundated with data. Our body’s sensors—the eyes, ears, nose, tongue, and nerves—are continually assailed with raw data that our brain translates into sights, sounds, smells, tastes, and textures. Using language, we can share these experiences with others.

    Since the advent of written language, humans have recorded their observations. Hunters monitored the movement of animal herds; early astronomers recorded the alignment of planets and stars; and cities recorded tax payments, births, and deaths. Today, such observations, and many more, are increasingly automated and recorded systematically in ever-growing computerized databases.

    The invention of electronic sensors has additionally contributed to an explosion in the volume and richness of recorded data. Specialized sensors, such as cameras, microphones, chemical noses, electronic tongues, and pressure sensors mimic the human ability to see, hear, smell, taste, and feel. These sensors process the data far differently than a human being would. Unlike a human’s limited and subjective attention, an electronic sensor never takes a break and has no emotions to skew its perception.

    Although sensors are not clouded by subjectivity, they do not necessarily report a single, definitive depiction of reality. Some have an inherent measurement error due to hardware limitations. Others are limited by their scope. A black-and-white photograph provides a different depiction of its subject than one shot in color. Similarly, a microscope provides a far different depiction of reality than a telescope.

    Between databases and sensors, many aspects of our lives are recorded. Governments, businesses, and individuals are recording and reporting information, from the monumental to the mundane. Weather sensors obtain temperature and pressure data; surveillance cameras watch sidewalks and subway tunnels; and all manner of electronic behaviors are monitored: transactions, communications, social media relationships, and many others.

    This deluge of data has led some to state that we have entered an era of big data, but this may be a bit of a misnomer. Human beings have always been surrounded by large amounts of data—one would need only to look to the sky and attempt to count its stars to discover a virtually endless supply. What makes the current era unique is that we have vast amounts of recorded data, much of which can be directly accessed by computers. Larger and more interesting datasets are increasingly accessible at the tips of our fingers, only a web search away. This wealth of information has the potential to inform action, given a systematic way of making sense of it all.

    The field of study dedicated to the development of computer algorithms for transforming data into intelligent action is known as machine learning. This field originated in an environment where the available data, statistical methods, and computing power rapidly and simultaneously evolved. Growth in the volume of data necessitated additional computing power, which in turn spurred the development of statistical methods for analyzing large datasets. This created a cycle of advancement, allowing even larger and more interesting data to be collected, and enabled today’s environment in which endless streams of data are available on virtually any topic.

    Diagram Description automatically generated

    Figure 1.1: The cycle of advancement that enabled machine learning

    A closely related sibling of machine learning, data mining, is concerned with the generation of novel insight from large databases. As the term implies, data mining involves a systematic hunt for nuggets of actionable intelligence. Although there is some disagreement over how widely machine learning and data mining overlap, one point of distinction is that machine learning focuses on teaching computers how to use data to solve a problem, while data mining focuses on teaching computers to identify patterns that humans then use to solve a problem.

    Virtually all data mining involves the use of machine learning, but not all machine learning requires data mining. For example, you might apply machine learning to data mine automobile traffic data for patterns related to accident rates. On the other hand, if the computer is learning how to identify traffic signs, this is purely machine learning without data mining.

    The phrase data mining is also sometimes used as a pejorative to describe the deceptive practice of cherry-picking data to support a theory.

    Machine learning is also intertwined with the field of artificial intelligence (AI), which is a nebulous discipline and, depending on whom you might ask, is simply machine learning with a strong marketing spin or a distinct field of study altogether. A cynic might suggest that the field of AI tends to exaggerate its importance such as by calling a simple predictive model an AI bot, while an AI proponent may point out that the field tends to tackle the most challenging learning tasks while aiming for human-level performance. The truth is somewhere in between.

    Just as machine learning itself depends on statistical methods, artificial intelligence depends a great deal on machine learning, but the business contexts and applications tend to differ. The table that follows highlights some differentiators among traditional statistics, machine learning, and artificial intelligence; however, keep in mind that the lines between the three disciplines are often less rigid than they may appear.

    In this formulation, machine learning sits firmly at the intersection of human and computer partnership, whereas traditional statistics relies primarily on the human to drive insights and AI seeks to minimize human involvement as much as possible. Learning how to maximize the human-machine partnership and apply learning algorithms to real-world problems is the focus of this book. Understanding the use cases and limitations of machine learning is an important starting point in this journey.

    Uses and abuses of machine learning

    Most people have heard of Deep Blue, the chess-playing computer that in 1997 was the first to win a game against a world champion. Another famous computer, Watson, defeated two human opponents on the television trivia game show Jeopardy in 2011. Based on these stunning accomplishments, some have speculated that computer intelligence will replace workers in information technology occupations, just as automobiles replaced horses and machines replaced workers in fields and assembly lines. Recently, these fears have become more pronounced as artificial intelligence-based algorithms, such as GPT-3 and DALL·E 2 from the OpenAI research group (https://openai.com/), have reached impressive milestones and are proving that computers are capable of writing text and creating artwork that is virtually indistinguishable from that produced by humans. Ultimately, this may lead to massive shifts in occupations like marketing, customer support, illustration, and so on, as creativity is outsourced to machines that can produce endless streams of material more cheaply than the former employees.

    In this case, humans may still be necessary because the truth is that even as machines reach such impressive milestones, they are still relatively limited in their ability to thoroughly understand a problem, or understand how the work is going to be applied toward a real-world goal. Learning algorithms are pure intellectual horsepower without direction. A computer may be more capable than a human of finding subtle patterns in large databases, but it still needs a human to motivate the analysis and turn the result into meaningful action. In most cases, the human will determine whether the machine’s output is valuable and will help the machine avoid creating a limitless supply of nonsense.

    Without completely discounting the achievements of Deep Blue and Watson, it is important to note that neither is even as intelligent as a typical five-year-old. For more on why comparing smarts is a slippery business, see the Popular Science article FYI, Which Computer Is Smarter, Watson Or Deep Blue?, by Will Grunewald, 2012: https://www.popsci.com/science/article/2012-12/fyi-which-computer-smarter-watson-or-deep-blue.

    Machines are not good at asking questions or even knowing what questions to ask. They are much better at answering them, provided the question is stated in a way that the computer can comprehend. Present-day machine learning algorithms partner with people much like a bloodhound works with its trainer: the dog’s sense of smell may be many times stronger than its master’s, but without being carefully directed, the hound may end up chasing its tail.

    Diagram Description automatically generated

    Figure 1.2: Machine learning algorithms are powerful tools that require careful direction

    In the worst-case scenario, if machine learning were implemented carelessly, it might lead to what controversial tech billionaire Elon Musk provocatively called summoning the demon. This perspective suggests that we may be unleashing forces outside our control, despite the hubristic sense that we will be able to reign them in when needed. Given the power of artificial intelligence to automate processes and react to changing conditions much faster and more objectively than humans, there may come a point at which Pandora’s box has been opened and it is difficult or impossible to return to the old ways of life where humans are in control. As Musk describes:

    If AI has a goal and humanity just happens to be in the way, it will destroy humanity as a matter of course without even thinking about it. No hard feelings… It’s just like, if we’re building a road and an anthill just happens to be in the way, we don’t hate ants, we’re just building a road, and so, goodbye anthill.

    While this may seem to be a bleak portrayal, it is still the realm of far-future science fiction, as you will soon learn when reading about the present day’s state-of-the-art machine learning successes.

    However, Musk’s warning does help emphasize the importance of understanding the likelihood of machine learning and AI being a double-edged sword. For all of its benefits, there are some places where it still has room for improvement, and some situations where it may do more harm than good. If machine learning practitioners cannot be trusted to act ethically, it may be necessary for governments to intervene to prevent the greatest harm to society.

    For more on Musk’s fears of summoning the demon see the following 2018 article from CNBC: https://www.cnbc.com/2018/04/06/elon-musk-warns-ai-could-create-immortal-dictator-in-documentary.html.

    Machine learning successes

    Machine learning is most successful when it augments the specialized knowledge of a subject-matter expert rather than replacing the expert altogether. It works with medical doctors at the forefront of the fight to eradicate cancer; assists engineers with efforts to create smarter homes and automobiles; helps social scientists and economists build better societies; and provides business and marketing professionals with valuable insights. Toward these ends, it is employed in countless scientific laboratories, hospitals, companies, and governmental organizations. Any effort that generates or aggregates data likely employs at least one machine learning algorithm to help make sense of it.

    Though it is impossible to list every successful application of machine learning, a selection of prominent examples is as follows:

    Identification of unwanted spam messages in email

    Segmentation of customer behavior for targeted advertising

    Forecasts of weather behavior and long-term climate changes

    Preemptive interventions for customers likely to churn (stop purchasing)

    Reduction of fraudulent credit card transactions

    Actuarial estimates of financial damage from storms and natural disasters

    Prediction of and influence over election outcomes

    Development of algorithms for auto-piloting drones and self-driving cars

    Optimization of energy use in homes and office buildings

    Projection of areas where criminal activity is most likely

    Discovery of genetic sequences useful for precision medicine

    By the end of this book, you will understand the basic machine learning algorithms that are employed to teach computers to perform these tasks. For now, it suffices to say that no matter the context, the fundamental machine learning process is the same. In every task, an algorithm takes data and identifies patterns that form the basis for further action.

    The limits of machine learning

    Although machine learning is used widely and has tremendous potential, it is important to understand its limits. The algorithms used today—even those on the cutting edge of artificial intelligence—emulate a relatively limited subset of the capabilities of the human brain. They offer little flexibility to extrapolate outside of strict parameters and know no common sense. Considering this, one should be extremely careful to recognize exactly what an algorithm has learned before setting it loose in the real world.

    Without a lifetime of past experiences to build upon, computers are limited in their ability to make simple inferences about logical next steps. Consider the banner advertisements on websites, which are served according to patterns learned by data mining the browsing history of millions of users. Based on this data, someone who views websites selling mattresses is interested in buying a mattress and should therefore see advertisements for mattresses. The problem is that this becomes a never-ending cycle in which, even after a mattress has been purchased, additional mattress advertisements are shown, rather than advertisements for pillows and bed sheets.

    Many people are familiar with the deficiencies of machine learning’s ability to understand or translate language, or to recognize speech and handwriting. Perhaps the earliest example of this type of failure is in a 1994 episode of the television show The Simpsons, which showed a parody of the Apple Newton tablet. In its time, the Newton was known for its state-of-the-art handwriting recognition. Unfortunately for Apple, it would occasionally fail to great effect. The television episode illustrated this through a sequence in which a bully’s note to Beat up Martin was misinterpreted by the Newton as Eat up Martha.

    Graphical user interface Description automatically generated

    Figure 1.3: Screen captures from Lisa on Ice, The Simpsons, 20th Century Fox (1994)

    Machine language processing has improved enough in the time since the Apple Newton that Google, Apple, and Microsoft are all confident in their ability to offer voice-activated virtual concierge services, such as Google Assistant, Siri, and Cortana. Still, these services routinely struggle to answer relatively simple questions. Furthermore, online translation services sometimes misinterpret sentences that a toddler would readily understand, and the predictive text feature on many devices has led to humorous autocorrect fail websites that illustrate computers’ ability to understand basic language but completely misunderstand context.

    Some of these mistakes are to be expected. Language is complicated, with multiple layers of text and subtext, and even human beings sometimes misunderstand context. Although machine learning is rapidly improving at language processing, and current state-of-the-art algorithms like GPT-3 are quite good in comparison to prior generations, machines still make mistakes that are obvious to humans that know where to look. These predictable shortcomings illustrate the important fact that machine learning is only as good as the data it has learned from. If context is not explicit in the input data, then just like a human, the computer will have to make its best guess from its set of past experiences. However, the computer’s past experiences are usually much more limited than the human’s.

    Machine learning ethics

    At its core, machine learning is simply a tool that assists us with making sense of the world’s complex data. Like any tool, it can be used for good or evil. Machine learning goes wrong mostly when it is applied so broadly, or so callously, that humans are treated as lab rats, automata, or mindless consumers. A process that may seem harmless can lead to unintended consequences when automated by an emotionless computer. For this reason, those using machine learning or data mining would be remiss not to at least briefly consider the ethical implications of the art.

    Due to the relative youth of machine learning as a discipline and the speed at which it is progressing, the associated legal issues and social norms are often quite uncertain, and constantly in flux. Caution should be exercised when obtaining or analyzing data in order to avoid breaking laws, violating terms of service or data use agreements, or abusing the trust or violating the privacy of customers or the public. The informal corporate motto of Google, an organization that collects perhaps more data on individuals than any other, was at one time, don’t be evil. While this seems clear enough, it may not be sufficient. A better approach may be to follow the Hippocratic Oath, a medical principle that states, above all, do no harm. Following the principle of do no harm may have helped avoid recent scandals at Facebook and other companies, such as the Cambridge Analytica controversy, which alleged that social media data was being used to manipulate elections.

    Retailers routinely use machine learning for advertising, targeted promotions, inventory management, or the layout of items in a store. Many have equipped checkout lanes with devices that print coupons for promotions based on a customer’s buying history. In exchange for a bit of personal data, the customer receives discounts on the specific products they want to buy. At first, this may appear relatively harmless, but consider what happens when this practice is taken a bit further.

    One possibly apocryphal tale concerns a large retailer in the United States that employed machine learning to identify expectant mothers for coupon mailings. The retailer hoped that if these mothers-to-be received substantial discounts, they would become loyal customers who would later purchase profitable items such as diapers, baby formula, and toys. Equipped with machine learning methods, the retailer identified items in the customer purchase history that could be used to predict with a high degree of certainty not only whether a woman was pregnant, but also the approximate timing for when the baby was due.

    After the retailer used this data for a promotional mailing, an angry man contacted the chain and demanded to know why his young daughter received coupons for maternity items. He was furious that the retailer seemed to be encouraging teenage pregnancy! As the story goes, when the retail chain called to offer an apology, it was the father who ultimately apologized after confronting his daughter and discovering that she was indeed pregnant!

    For more detail on how retailers use machine learning to identify pregnancies, see the New York Times Magazine article titled How Companies Learn Your Secrets, by Charles Duhigg, 2012: https://www.nytimes.com/2012/02/19/magazine/shopping-habits.html.

    Whether the story was completely true or not, the lesson learned from the preceding tale is that common sense should be applied before blindly applying the results of a machine learning analysis. This is particularly true in cases where sensitive information, such as health data, is concerned. With a bit more care, the retailer could have foreseen this scenario and used greater discretion when choosing how to reveal the pregnancy status its machine learning analysis had discovered. Unfortunately, as history tends to repeat itself, social media companies have been under fire recently for targeting expectant mothers with advertisements for baby products even after these mothers experience the tragedy of a miscarriage.

    Because machine learning algorithms are developed with historical data, computers may learn some unfortunate behaviors of human societies. Sadly, this sometimes includes perpetuating race or gender discrimination and reinforcing negative stereotypes. For example, researchers found that Google’s online advertising service was more likely to show ads for high-paying jobs to men than women and was more likely to display ads for criminal background checks to black people than white people. Although the machine may have correctly learned that men once held jobs that were not offered to most women, it is not desirable to have the algorithm perpetuate such injustices. Instead, it may be necessary to teach the machine to reflect society not as it currently is, but how it ought to be.

    Sometimes, algorithms that are specifically designed with the intention of being content-neutral eventually come to reflect undesirable beliefs or ideologies. In one egregious case, a Twitter chatbot service developed by Microsoft was quickly taken offline after it began spreading Nazi and anti-feminist propaganda, which it may have learned from so-called trolls posting inflammatory content on internet forums and chat rooms. In another case, an algorithm created to reflect an objective conception of human beauty sparked controversy when it favored almost exclusively white people. Imagine the consequences if this had been applied to facial recognition software for criminal activity!

    For more information about the real-world consequences of machine learning and discrimination see the Harvard Business Review article Addressing the Biases Plaguing Algorithms, by Michael Li, 2019: https://hbr.org/2019/05/addressing-the-biases-plaguing-algorithms.

    To limit the ability of algorithms to discriminate illegally, certain jurisdictions have well-intentioned laws that prevent the use of racial, ethnic, religious, or other protected class data for business reasons. However, excluding this data from a project may not be enough because machine learning algorithms can still inadvertently learn to discriminate. If a certain segment of people tends to live in a certain region, buys a certain product, or otherwise behaves in a way that uniquely identifies them as a group, machine learning algorithms can infer the protected information from other factors. In such cases, you may need to completely de-identify these people by excluding any potentially identifying data in addition to the already-protected statuses.

    In a recent example of this type of alleged algorithmic bias, the Apple credit card, which debuted in 2019, was almost immediately accused of providing substantially higher credit limits to men than to women—sometimes by 10 to 20 times the amount—even for spouses with joint assets and similar credit histories. Although Apple and the issuing bank, Goldman Sachs, denied that gender bias was at play and confirmed that no legally protected applicant characteristics were used in the algorithm, this did not slow speculation that perhaps some bias crept in unintentionally. It did not help matters that for competitive reasons, Apple and Goldman Sachs chose to keep the details of the algorithm secret, which led people to assume the worst. If the systematic bias allegations were untrue, being able to explain what was truly happening and exactly how the decisions were made might have alleviated much of the outrage. A potential worst-case scenario would have occurred if Apple and Goldman Sachs were investigated yet couldn’t explain the result to regulators, due to the algorithm’s complexity!

    The Apple credit card fiasco is described in a 2019 BBC article, Apple’s ‘sexist’ credit card investigated by US regulator: https://www.bbc.com/news/business-50365609.

    Apart from the legal consequences, customers may feel uncomfortable or become upset if aspects of their lives they consider private are made public. The challenge is that privacy expectations differ across people and contexts. To illustrate this fact, imagine driving by someone’s house and incidentally glancing through the window. This is unlikely to offend most people. In contrast, using a camera to take a picture from across the street is likely to make most feel uncomfortable; walking up to the house and pressing a face against the glass to peer inside is likely to anger virtually everybody. Although all three of these scenarios are arguably using public information, two of the three cross a line that will upset most people. In much the same way, it is possible to go too far with the use of data and cross a threshold that many will see as inconsiderate at best and creepy at worst.

    Just as computing hardware and statistical methods kicked off the big data era, these methods also unlocked a post-privacy era in which many aspects of our lives that were once private are now public, or available to the public at a price. Even prior to the big data era, it would have been possible to learn a great deal about someone by observing public information. Watching their comings and goings may reveal information about their occupation or leisure activity, and a quick glance at their trash and recycling bins may reveal what they eat, drink, and read. A private investigator could learn even more with a bit of focused digging and observation. Companies applying machine learning methods to large datasets are essentially acting as large-scale private investigators, and while they claim to be working on anonymized datasets, many still argue that the companies have gone too far with their digital surveillance.

    In recent years, some high-profile web applications have experienced a mass exodus of users who felt exploited when the applications’ terms of service agreements changed, or their data was used for purposes beyond what the users had originally intended. The fact that privacy expectations differ by context, age cohort, and locale adds complexity to deciding the appropriate use of personal data. It would be wise to consider the cultural implications of your work before you begin on your project, in addition to being aware of ever-more-restrictive regulations such as the European Union’s General Data Protection Regulation (GDPR) and the inevitable policies that will follow in its footsteps.

    The fact that you can use data for a particular end does not always mean that you should.

    Finally, it is important to note that as machine learning algorithms become progressively more important to our everyday lives, there are greater incentives for nefarious actors to work to exploit them. Sometimes, attackers simply want to disrupt algorithms for laughs or notoriety—such as Google bombing, the crowdsourced method of tricking Google’s algorithms to highly rank a desired page. Other times, the effects are more dramatic. A timely example of this is the recent wave of so-called fake news and election meddling, propagated via the manipulation of advertising and recommendation algorithms that target people according to their personality. To avoid giving such control to outsiders, when building machine learning systems, it is crucial to consider how they may be influenced by a determined individual or crowd.

    Social media scholar danah boyd (styled lowercase) presented a keynote at the Strata Data Conference 2017 in New York City that discussed the importance of hardening machine learning algorithms against attackers. For a recap, refer to https://points.datasociety.net/your-data-is-being-manipulated-a7e31a83577b.

    The consequences of malicious attacks on machine learning algorithms can also be deadly. Researchers have shown that by creating an adversarial attack that subtly distorts a road sign with carefully chosen graffiti, an attacker might cause an autonomous vehicle to misinterpret a stop sign, potentially resulting in a fatal crash. Even in the absence of ill intent, software bugs and human errors have already led to fatal accidents in autonomous vehicle technology from Uber and Tesla. With such examples in mind, it is of the utmost importance and ethical concern that machine learning practitioners should worry about how their algorithms will be used and abused in the real world.

    How machines learn

    A formal definition of machine learning, attributed to computer scientist Tom M. Mitchell, states that a machine learns whenever it utilizes its experience such that its performance improves on similar experiences in the future. Although this definition makes sense intuitively, it completely ignores the process of exactly how experience is translated into future action—and, of course, learning is always easier said than done!

    Where human brains are naturally capable of learning from birth, the conditions necessary for computers to learn must be made explicit by the programmer hoping to utilize machine learning methods. For this reason, although it is not strictly necessary to understand the theoretical basis for learning, having a strong theoretical foundation helps the practitioner to understand, distinguish, and implement machine learning algorithms.

    As you relate machine learning to human learning, you may find yourself examining your own mind in a different light.

    Regardless of whether the learner is a human or a machine, the basic learning process is the same. It can be divided into four interrelated components:

    Data storage utilizes observation, memory, and recall to provide a factual basis for further reasoning.

    Abstraction involves the translation of stored data into broader representations and concepts.

    Generalization uses abstracted data to create knowledge and inferences that drive action in new contexts.

    Evaluation provides a feedback mechanism to measure the utility of learned knowledge and inform potential improvements.

    Diagram Description automatically generated

    Figure 1.4: The four steps in the learning process

    Although the learning process has been conceptualized here as four distinct components, they are merely organized this way for illustrative purposes. In reality, the entire learning process is inextricably linked. In human beings, the process occurs subconsciously. We recollect, deduce, induct, and intuit within the confines of our mind’s eye, and because this process is hidden, any differences from person to person are attributed to a vague notion of subjectivity. In contrast, computers make these processes explicit, and because the entire process is transparent, the learned knowledge can be examined, transferred, utilized for future action, and treated as a data science.

    The data science buzzword suggests a relationship between the data, the machine, and the people who guide the learning process. The term’s growing use in job descriptions and academic degree programs reflects its operationalization as a field of study concerned with both statistical and computational theory, as well as the technological infrastructure enabling machine learning and its applications. The field often asks its practitioners to be compelling storytellers, balancing an audacity in the use of data with the limitations of what one may infer and forecast from it. To be a strong data scientist, therefore, requires a strong understanding of how the learning algorithms work in the context of a business application, as we will discuss in greater detail in Chapter 11, Being Successful with Machine Learning.

    Data storage

    All learning begins with data. Humans and computers alike utilize data storage as a foundation for more advanced reasoning. In a human being, this consists of a brain that uses electrochemical signals in a network of biological cells to store and process observations for short- and long-term future recall. Computers have similar capabilities of short- and long-term recall using hard disk drives, flash memory, and random-access memory (RAM) in combination with a central processing unit (CPU).

    It may seem obvious, but the ability to store and retrieve data alone is insufficient for learning. Stored data is merely ones and zeros on a disk. It is a collection of memories, meaningless without a broader context. Without a higher level of understanding, knowledge is purely recall, limited to what has been seen before and nothing else.

    To better understand the nuances of this idea, it may help to think about the last time you studied for a difficult test, perhaps for a university final exam or a career certification. Did you wish for an eidetic (photographic) memory? If so, you may be disappointed to know that perfect recall would unlikely be of much assistance. Even if you could memorize material perfectly, this rote learning would provide no benefit without knowing the exact questions and answers that would appear on the exam. Otherwise, you would need to memorize answers to every question that could conceivably be asked, on a subject in which there is likely to be an infinite number of questions. Obviously, this is an unsustainable strategy.

    Instead, a better approach is to spend time selectively and memorize a relatively small set of representative ideas, while developing an understanding of how the ideas relate and apply to unforeseen circumstances. In this way, important broader patterns are identified, rather than you memorizing every detail, nuance, and potential application.

    Abstraction

    This work of assigning a broader meaning to stored data occurs during the abstraction process, in which raw data comes to represent a wider, more abstract concept or idea. This type of connection, say between an object and its representation, is exemplified by the famous René Magritte painting The Treachery of Images.

    A picture containing text Description automatically generated

    Figure 1.5: This is not a pipe. Source: http://collections.lacma.org/node/239578

    The painting depicts a tobacco pipe with the caption Ceci n’est pas une pipe (This is not a pipe). The point Magritte was illustrating is that a representation of a pipe is not truly a pipe. Yet, despite the fact that the pipe is not real, anybody viewing the painting easily recognizes it as a pipe. This suggests that observers can connect the picture of a pipe to the idea of a pipe, to a memory of a physical pipe that can be held in the hand. Abstracted connections like this are the basis of knowledge representation, the formation of logical structures that assist with turning raw sensory information into meaningful insight.

    Bringing this concept full circle, knowledge representation is what allows artificial intelligence-based tools like Midjourney (https://www.midjourney.com) to paint, virtually, in the style of René Magritte. The following image was generated entirely by artificial intelligence based on the algorithm’s understanding of concepts like robot, pipe, and smoking. If he were alive yet today, Magritte himself might find it surreal that his own surrealist work, which challenged human conceptions of reality and the connections between images and ideas, is now incorporated into the minds of computers and, in a roundabout way, is connecting machines’ ideas and images to reality. Machines learned what a pipe is, in part, by viewing images of pipes in artwork like Magritte’s.

    Figure 1.6: Am I a pipe? image created by the Midjourney AI with the prompt of robot smoking a pipe in the style of a René Magritte painting

    To reify the process of knowledge representation within an algorithm, the computer summarizes stored raw data using a model, an explicit description of the patterns within the data. Just like Magritte’s pipe, the model representation takes on a life beyond the raw data. It represents an idea greater than the sum of its parts.

    There are many different types of models. You may already be familiar with some. Examples include:

    Mathematical equations

    Relational diagrams, such as trees and graphs

    Logical if/else rules

    Groupings of data known as clusters

    The choice of model is typically not left up to the machine. Instead, the learning task and the type of data on hand inform model selection. Later in this chapter, we will discuss in more detail the methods for choosing the appropriate model type.

    Fitting a model to a dataset is known as training. When the model has been trained, the data has been transformed into an abstract form that summarizes the original information. The fact that this step is called training rather than learning reveals a couple of interesting aspects of the process. First, note that the process of learning does not end with data abstraction—the learner must still generalize and evaluate its training. Second, the word training better connotes the fact that the human teacher trains the machine student to use the data toward a specific end.

    The distinction between training and learning is subtle but important. The computer doesn’t learn a model, because this would imply that there is a single correct model to be learned. Of course, the computer must learn something about the data

    Enjoying the preview?
    Page 1 of 1