Machine Learning with R: Learn techniques for building and improving machine learning models, from data preparation to model tuning, evaluation, and working with big data

Ebook1,743 pages13 hours

Machine Learning with R: Learn techniques for building and improving machine learning models, from data preparation to model tuning, evaluation, and working with big data

Name: Machine Learning with R: Learn techniques for building and improving machine learning models, from data preparation to model tuning, evaluation, and working with big data
Author: Brett Lantz
ISBN: 9781801076050

By Brett Lantz

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Learn how to solve real-world data problems using machine learning and R

Purchase of the print or Kindle book includes a free eBook in PDF format.

Key Features

The 10th Anniversary Edition of the bestselling R machine learning book, updated with 50% new content for R 4.0.0 and beyond

Harness the power of R to build flexible, effective, and transparent machine learning models

Learn quickly with this clear, hands-on guide by machine learning expert Brett Lantz

Book Description

Machine learning, at its core, is concerned with transforming data into actionable knowledge. R offers a powerful set of machine learning methods to quickly and easily gain insight from your data.

Machine Learning with R, Fourth Edition, provides a hands-on, accessible, and readable guide to applying machine learning to real-world problems. Whether you are an experienced R user or new to the language, Brett Lantz teaches you everything you need to know for data pre-processing, uncovering key insights, making new predictions, and visualizing your findings. This 10th Anniversary Edition features several new chapters that reflect the progress of machine learning in the last few years and help you build your data science skills and tackle more challenging problems, including making successful machine learning models and advanced data preparation, building better learners, and making use of big data.

You'll also find this classic R data science book updated to R 4.0.0 with newer and better libraries, advice on ethical and bias issues in machine learning, and an introduction to deep learning. Whether you're looking to take your first steps with R for machine learning or making sure your skills and knowledge are up to date, this is an unmissable read that will help you find powerful new insights in your data.

What you will learn

Learn the end-to-end process of machine learning from raw data to implementation

Classify important outcomes using nearest neighbor and Bayesian methods

Predict future events using decision trees, rules, and support vector machines

Forecast numeric data and estimate financial values using regression methods

Model complex processes with artificial neural networks

Prepare, transform, and clean data using the tidyverse

Evaluate your models and improve their performance

Connect R to SQL databases and emerging big data technologies such as Spark, Hadoop, H2O, and TensorFlow

Who this book is for

This book is designed to help data scientists, actuaries, data analysts, financial analysts, social scientists, business and machine learning students, and any other practitioners who want a clear, accessible guide to machine learning with R. No R experience is required, although prior exposure to statistics and programming is helpful.

Skip carousel

LanguageEnglish

PublisherPackt Publishing

Release dateMay 29, 2023

ISBN9781801076050

Author

Brett Lantz

"Brett Lantz has spent the past 10 years using innovative data methods to understand human behavior. A sociologist by training, he was first enchanted by machine learning while studying a large database of teenagers' social networking website profiles. Since then, he has worked on interdisciplinary studies of cellular telephone calls, medical billing data, and philanthropic activity, among others. When he's not spending time with family, following college sports, or being entertained by his dachshunds, he maintains dataspelunking.com, a website dedicated to sharing knowledge about the search for insight in data."

Related to Machine Learning with R

Related ebooks

Skip carousel

Principles of Data Science: A beginner's guide to essential math and coding skills for data fluency and machine learning
Ebook
Principles of Data Science: A beginner's guide to essential math and coding skills for data fluency and machine learning
bySinan Ozdemir
Rating: 0 out of 5 stars
0 ratings
Mastering Python for Data Science
Ebook
Mastering Python for Data Science
bySamir Madhavan
Rating: 3 out of 5 stars
3/5
R: Unleash Machine Learning Techniques
Ebook
R: Unleash Machine Learning Techniques
byBrett Lantz
Rating: 0 out of 5 stars
0 ratings
Practical Data Science with Python: Learn tools and techniques from hands-on examples to extract insights from data
Ebook
Practical Data Science with Python: Learn tools and techniques from hands-on examples to extract insights from data
byNathan George
Rating: 0 out of 5 stars
0 ratings
Machine Learning With Go: Leverage Go's powerful packages to build smart machine learning and predictive applications, 2nd Edition
Ebook
Machine Learning With Go: Leverage Go's powerful packages to build smart machine learning and predictive applications, 2nd Edition
byDaniel Whitenack
Rating: 0 out of 5 stars
0 ratings
Machine Learning with R
Ebook
Machine Learning with R
byBrett Lantz
Rating: 4 out of 5 stars
4/5
Mastering Machine Learning with R: Advanced machine learning techniques for building smart applications with R 3.5, 3rd Edition
Ebook
Mastering Machine Learning with R: Advanced machine learning techniques for building smart applications with R 3.5, 3rd Edition
byLesmeister Cory
Rating: 0 out of 5 stars
0 ratings
Mastering Predictive Analytics with R - Second Edition
Ebook
Mastering Predictive Analytics with R - Second Edition
byJames D. Miller
Rating: 0 out of 5 stars
0 ratings
Feature Engineering Made Easy: Identify unique features from your dataset in order to build powerful machine learning systems
Ebook
Feature Engineering Made Easy: Identify unique features from your dataset in order to build powerful machine learning systems
bySinan Ozdemir
Rating: 0 out of 5 stars
0 ratings
Machine Learning with R - Third Edition: Expert techniques for predictive modeling, 3rd Edition
Ebook
Machine Learning with R - Third Edition: Expert techniques for predictive modeling, 3rd Edition
byBrett Lantz
Rating: 0 out of 5 stars
0 ratings
R Machine Learning Essentials
Ebook
R Machine Learning Essentials
byUsuelli Michele
Rating: 0 out of 5 stars
0 ratings
Data Wrangling with R: Load, explore, transform and visualize data for modeling with tidyverse libraries
Ebook
Data Wrangling with R: Load, explore, transform and visualize data for modeling with tidyverse libraries
byGustavo R Santos
Rating: 0 out of 5 stars
0 ratings
Practical Data Analysis - Second Edition
Ebook
Practical Data Analysis - Second Edition
byHector Cuesta
Rating: 0 out of 5 stars
0 ratings
Mastering Machine Learning with R
Ebook
Mastering Machine Learning with R
byLesmeister Cory
Rating: 0 out of 5 stars
0 ratings
Machine Learning for Imbalanced Data: Tackle imbalanced datasets using machine learning and deep learning techniques
Ebook
Machine Learning for Imbalanced Data: Tackle imbalanced datasets using machine learning and deep learning techniques
byAbhishek Kumar
Rating: 0 out of 5 stars
0 ratings
Hands-On Data Science with R: Techniques to perform data manipulation and mining to build smart analytical models using R
Ebook
Hands-On Data Science with R: Techniques to perform data manipulation and mining to build smart analytical models using R
byVitor Bianchi Lanzetta
Rating: 0 out of 5 stars
0 ratings
Time Series Analysis with Python Cookbook: Practical recipes for exploratory data analysis, data preparation, forecasting, and model evaluation
Ebook
Time Series Analysis with Python Cookbook: Practical recipes for exploratory data analysis, data preparation, forecasting, and model evaluation
byTarek A. Atwan
Rating: 0 out of 5 stars
0 ratings
Data Science Career Guide Interview Preparation
Ebook
Data Science Career Guide Interview Preparation
byGradient Publication
Rating: 0 out of 5 stars
0 ratings
Principles of Data Science
Ebook
Principles of Data Science
bySinan Ozdemir
Rating: 0 out of 5 stars
0 ratings
Statistics for Machine Learning
Ebook
Statistics for Machine Learning
byPratap Dangeti
Rating: 3 out of 5 stars
3/5
Simulating Data with SAS
Ebook
Simulating Data with SAS
byRick Wicklin
Rating: 0 out of 5 stars
0 ratings
The Pandas Workshop: A comprehensive guide to using Python for data analysis with real-world case studies
Ebook
The Pandas Workshop: A comprehensive guide to using Python for data analysis with real-world case studies
byBlaine Bateman
Rating: 0 out of 5 stars
0 ratings
Building a Recommendation System with R
Ebook
Building a Recommendation System with R
byGorakala Suresh K.
Rating: 0 out of 5 stars
0 ratings
Agile Machine Learning with DataRobot: Automate each step of the machine learning life cycle, from understanding problems to delivering value
Ebook
Agile Machine Learning with DataRobot: Automate each step of the machine learning life cycle, from understanding problems to delivering value
byBipin Chadha
Rating: 0 out of 5 stars
0 ratings
R Object-oriented Programming
Ebook
R Object-oriented Programming
byKelly Black
Rating: 3 out of 5 stars
3/5
Go Machine Learning Projects: Eight projects demonstrating end-to-end machine learning and predictive analytics applications in Go
Ebook
Go Machine Learning Projects: Eight projects demonstrating end-to-end machine learning and predictive analytics applications in Go
byXuanyi Chew
Rating: 0 out of 5 stars
0 ratings
CompTIA Data+: DAO-001 Certification Guide: Complete coverage of the new CompTIA Data+ (DAO-001) exam to help you pass on the first attempt
Ebook
CompTIA Data+: DAO-001 Certification Guide: Complete coverage of the new CompTIA Data+ (DAO-001) exam to help you pass on the first attempt
byCameron Dodd
Rating: 0 out of 5 stars
0 ratings
A Handbook of Mathematical Models with Python: Elevate your machine learning projects with NetworkX, PuLP, and linalg
Ebook
A Handbook of Mathematical Models with Python: Elevate your machine learning projects with NetworkX, PuLP, and linalg
byDr. Ranja Sarkar
Rating: 0 out of 5 stars
0 ratings
Learning Quantitative Finance with R: Implement machine learning, time-series analysis, algorithmic trading and more
Ebook
Learning Quantitative Finance with R: Implement machine learning, time-series analysis, algorithmic trading and more
byDr. Param Jeet
Rating: 0 out of 5 stars
0 ratings
Interpretable Machine Learning with Python: Learn to build interpretable high-performance models with hands-on real-world examples
Ebook
Interpretable Machine Learning with Python: Learn to build interpretable high-performance models with hands-on real-world examples
bySerg Masís
Rating: 0 out of 5 stars
0 ratings

E-Commerce For You

Skip carousel

Summary of $100M Leads: How to Get Strangers to Want to Buy Your Stuff by Alex Hormozi: by Alex Hormozi - How to Get Strangers to Want to Buy Your Stuff - A Comprehensive Summary
Ebook
Summary of $100M Leads: How to Get Strangers to Want to Buy Your Stuff by Alex Hormozi: by Alex Hormozi - How to Get Strangers to Want to Buy Your Stuff - A Comprehensive Summary
byFrancis Thomas
Rating: 5 out of 5 stars
5/5
The Psychology of Selling: Increase Your Sales Faster and Easier Than You Ever Thought Possible
Ebook
The Psychology of Selling: Increase Your Sales Faster and Easier Than You Ever Thought Possible
byBrian Tracy
Rating: 4 out of 5 stars
4/5
How to Write Copy That Sells: The Step-By-Step System For More Sales, to More Customers, More Often
Ebook
How to Write Copy That Sells: The Step-By-Step System For More Sales, to More Customers, More Often
byRay Edwards
Rating: 4 out of 5 stars
4/5
Building a StoryBrand: Clarify Your Message So Customers Will Listen
Ebook
Building a StoryBrand: Clarify Your Message So Customers Will Listen
byDonald Miller
Rating: 4 out of 5 stars
4/5
The YouTube Formula: How Anyone Can Unlock the Algorithm to Drive Views, Build an Audience, and Grow Revenue
Ebook
The YouTube Formula: How Anyone Can Unlock the Algorithm to Drive Views, Build an Audience, and Grow Revenue
byDerral Eves
Rating: 4 out of 5 stars
4/5
The Motley Fool Investment Guide: Third Edition: How the Fools Beat Wall Street's Wise Men and How You Can Too
Ebook
The Motley Fool Investment Guide: Third Edition: How the Fools Beat Wall Street's Wise Men and How You Can Too
byTom Gardner
Rating: 5 out of 5 stars
5/5
Stories That Stick: How Storytelling Can Captivate Customers, Influence Audiences, and Transform Your Business
Ebook
Stories That Stick: How Storytelling Can Captivate Customers, Influence Audiences, and Transform Your Business
byKindra Hall
Rating: 4 out of 5 stars
4/5
How I Made My First $1000 on Etsy (With No Social Media Following and No Money to Spend on Advertising
Ebook
How I Made My First $1000 on Etsy (With No Social Media Following and No Money to Spend on Advertising
byShopFierce27
Rating: 5 out of 5 stars
5/5
The Basics of Bitcoins and Blockchains: An Introduction to Cryptocurrencies and the Technology that Powers Them (Cryptography, Derivatives Investments, Futures Trading, Digital Assets, NFT)
Ebook
The Basics of Bitcoins and Blockchains: An Introduction to Cryptocurrencies and the Technology that Powers Them (Cryptography, Derivatives Investments, Futures Trading, Digital Assets, NFT)
byAntony Lewis
Rating: 4 out of 5 stars
4/5
The Passive Income Cheat Sheet
Ebook
The Passive Income Cheat Sheet
byRaza Imam
Rating: 4 out of 5 stars
4/5
80/20 Sales and Marketing: The Definitive Guide to Working Less and Making More
Ebook
80/20 Sales and Marketing: The Definitive Guide to Working Less and Making More
byPerry Marshall
Rating: 4 out of 5 stars
4/5
Dividend Investing: Simplified - The Step-by-Step Guide to Make Money and Create Passive Income in the Stock Market with Dividend Stocks: Stock Market Investing for Beginners Book, #1
Ebook
Dividend Investing: Simplified - The Step-by-Step Guide to Make Money and Create Passive Income in the Stock Market with Dividend Stocks: Stock Market Investing for Beginners Book, #1
byMark Lowe
Rating: 2 out of 5 stars
2/5
A Beginner's Guide To Day Trading Online 2nd Edition
Ebook
A Beginner's Guide To Day Trading Online 2nd Edition
byToni Turner
Rating: 4 out of 5 stars
4/5
Copywriting Secrets: How Everyone Can Use the Power of Words to Get More Clicks, Sales, and Profits...No Matter What You Sell or Who You Sell It To!
Ebook
Copywriting Secrets: How Everyone Can Use the Power of Words to Get More Clicks, Sales, and Profits...No Matter What You Sell or Who You Sell It To!
byJim Edwards
Rating: 4 out of 5 stars
4/5
Built to Last: Successful Habits of Visionary Companies
Ebook
Built to Last: Successful Habits of Visionary Companies
byJim Collins
Rating: 4 out of 5 stars
4/5
ChatGPT's Guide to Wealth: How to Make Money with Conversational AI Technology
Ebook
ChatGPT's Guide to Wealth: How to Make Money with Conversational AI Technology
byOliver Smith
Rating: 5 out of 5 stars
5/5
Influencer: Building Your Personal Brand in the Age of Social Media
Ebook
Influencer: Building Your Personal Brand in the Age of Social Media
byBrittany Hennessy
Rating: 4 out of 5 stars
4/5
Affiliate Marketing 2024 Step By Step Guide To Make $10,000/Month Passive Income To Escape The Rat Race and Build an Successful Digital Business From Home
Ebook
Affiliate Marketing 2024 Step By Step Guide To Make $10,000/Month Passive Income To Escape The Rat Race and Build an Successful Digital Business From Home
byJordan Smith
Rating: 4 out of 5 stars
4/5
The Bitcoin Standard: The Decentralized Alternative to Central Banking
Ebook
The Bitcoin Standard: The Decentralized Alternative to Central Banking
bySaifedean Ammous
Rating: 4 out of 5 stars
4/5
Crushing It!: How Great Entrepreneurs Build Their Business and Influence—and How You Can, Too
Ebook
Crushing It!: How Great Entrepreneurs Build Their Business and Influence—and How You Can, Too
byGary Vaynerchuk
Rating: 4 out of 5 stars
4/5
How to Day Trade: The Plain Truth
Ebook
How to Day Trade: The Plain Truth
byRoss Cameron
Rating: 5 out of 5 stars
5/5
Traction: Quadruple Your Business Immediately With These Marketing Techniques
Ebook
Traction: Quadruple Your Business Immediately With These Marketing Techniques
byJonathan S. Walker
Rating: 2 out of 5 stars
2/5
Streams of Income: Living the Multiple Income Streams Dream
Ebook
Streams of Income: Living the Multiple Income Streams Dream
byRyan Reger
Rating: 5 out of 5 stars
5/5
The Beginner's Affiliate Marketing Blueprint
Ebook
The Beginner's Affiliate Marketing Blueprint
byAlex M
Rating: 4 out of 5 stars
4/5
Starting an Etsy Business For Dummies
Ebook
Starting an Etsy Business For Dummies
byKate Shoup
Rating: 5 out of 5 stars
5/5
2022 Best Ways To Make Money Online
Ebook
2022 Best Ways To Make Money Online
byPenric gamhra
Rating: 4 out of 5 stars
4/5
From Zero to One Million Followers: Become an Influencer with Social Media Viral Growth Strategies on YouTube, Twitter, Facebook, Instagram, and the Secrets to Make Your Personal Brand KNOWN
Ebook
From Zero to One Million Followers: Become an Influencer with Social Media Viral Growth Strategies on YouTube, Twitter, Facebook, Instagram, and the Secrets to Make Your Personal Brand KNOWN
byJake A. Clark
Rating: 4 out of 5 stars
4/5
How To Make Money Online Fast: Step By Step Instructions On How To Work From Home Using Proven Internet Marketing Strategies
Ebook
How To Make Money Online Fast: Step By Step Instructions On How To Work From Home Using Proven Internet Marketing Strategies
byArgena Olivis
Rating: 4 out of 5 stars
4/5
Trade Like a Stock Market Wizard: How to Achieve Super Performance in Stocks in Any Market
Ebook
Trade Like a Stock Market Wizard: How to Achieve Super Performance in Stocks in Any Market
byMark Minervini
Rating: 5 out of 5 stars
5/5
1000 Social Media Marketing Tricks: Viral Advertising and Personal Brand Secrets to Grow Your Business with YouTube, Facebook, Instagram - Become an Influencer with Over One Million Followers
Ebook
1000 Social Media Marketing Tricks: Viral Advertising and Personal Brand Secrets to Grow Your Business with YouTube, Facebook, Instagram - Become an Influencer with Over One Million Followers
byGary K. Clark
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

The Three Roles of the Chief Data Officer: ADP’s Jack Berkowitz
Podcast episode
The Three Roles of the Chief Data Officer: ADP’s Jack Berkowitz
byMe, Myself, and AI
0 ratings
0% found this document useful
The Secret Sauce to Learning Analytics with Peter Manniche Riber: As part of the hybrid working environment, organizations typically have an LMS or an LXP in place, that collects a lot of user data and actions which can be sorted, filtered, and analyzed to look for patterns and insights to solve problems. One of the common questions that L&D leaders face is how to analyze and utilize this data?
Podcast episode
The Secret Sauce to Learning Analytics with Peter Manniche Riber: As part of the hybrid working environment, organizations typically have an LMS or an LXP in place, that collects a lot of user data and actions which can be sorted, filtered, and analyzed to look for patterns and insights to solve problems. One of the common questions that L&D leaders face is how to analyze and utilize this data?
byThe Digital Adoption Show | Upskilling the Future Digital Workforce
0 ratings
0% found this document useful
[DataFramed Careers Series #3]: Accelerating Data Careers with Writing
Podcast episode
[DataFramed Careers Series #3]: Accelerating Data Careers with Writing
byDataFramed
0 ratings
0% found this document useful
Breaking Down Today’s Machine Learning Technology with Christina Pawlikowski: Melissa Perri is joined by Christina Pawlikowski, a teaching fellow at Harvard and co-founder of Causal, to help demystify machine learning and AI on this episode of Product Thinking.
Podcast episode
Breaking Down Today’s Machine Learning Technology with Christina Pawlikowski: Melissa Perri is joined by Christina Pawlikowski, a teaching fellow at Harvard and co-founder of Causal, to help demystify machine learning and AI on this episode of Product Thinking.
byProduct Thinking
0 ratings
0% found this document useful
Machine Learning, Business Success – Charles Martin, PhD, Data Scientist, Machine Learning AI Consultant, and Chief Scientist at Calculation Consulting – Rapidly Evolving Opportunities For Business Via Machine Learning and Data Science: Charles Martin, PhD, data scientist, machine learning AI consultant, and chief scientist at Calculation Consulting, delivers a thorough overview of the technologies that are helping companies expand their customer base and increase revenue. Martin is...
Podcast episode
Machine Learning, Business Success – Charles Martin, PhD, Data Scientist, Machine Learning AI Consultant, and Chief Scientist at Calculation Consulting – Rapidly Evolving Opportunities For Business Via Machine Learning and Data Science: Charles Martin, PhD, data scientist, machine learning AI consultant, and chief scientist at Calculation Consulting, delivers a thorough overview of the technologies that are helping companies expand their customer base and increase revenue. Martin is...
byFinding Genius Podcast
0 ratings
0% found this document useful
#175 - How to Solve Real-World Data Analysis Problems - David Asboth
Podcast episode
#175 - How to Solve Real-World Data Analysis Problems - David Asboth
byTech Lead Journal
0 ratings
0% found this document useful
332 — How to choose a learning platform: How do you pick from the hundreds of platforms out there? What questions might you ask to refine your options? If you’re looking for a learning platform, then you’ve got quite the decision to make! Not only is the market huge and complicated, but...
Podcast episode
332 — How to choose a learning platform: How do you pick from the hundreds of platforms out there? What questions might you ask to refine your options? If you’re looking for a learning platform, then you’ve got quite the decision to make! Not only is the market huge and complicated, but...
byThe Mind Tools L&D Podcast
0 ratings
0% found this document useful
Jeremiah Lowin – Machine Learning in Investing – [Invest Like the Best, EP.105]: My guest this week is one of my best and oldest friends, Jeremiah Lowin. Jeremiah has had a fascinating career, starting with advanced work in statistics before moving into the risk management field in the hedge fund world. Through his career he has studi
Podcast episode
Jeremiah Lowin – Machine Learning in Investing – [Invest Like the Best, EP.105]: My guest this week is one of my best and oldest friends, Jeremiah Lowin. Jeremiah has had a fascinating career, starting with advanced work in statistics before moving into the risk management field in the hedge fund world. Through his career he has studi
byInvest Like the Best with Patrick O'Shaughnessy
0 ratings
0% found this document useful
The Role of Infrastructure in ML // Niels Bantilan // #197
Podcast episode
The Role of Infrastructure in ML // Niels Bantilan // #197
byMLOps.community
0 ratings
0% found this document useful
Data Strategy and Customer Experience (with Google and Albertsons): #DataScience #CustomerExperience Data is central to how companies compete, nurture customer relationships, and develop brand loyalty through end-to-end customer experience. In this environment, data strategy is crucial to business success. But,...
Podcast episode
Data Strategy and Customer Experience (with Google and Albertsons): #DataScience #CustomerExperience Data is central to how companies compete, nurture customer relationships, and develop brand loyalty through end-to-end customer experience. In this environment, data strategy is crucial to business success. But,...
byCXOTalk
0 ratings
0% found this document useful
66: A guide to data models and dynamic dashboards for marketers
Podcast episode
66: A guide to data models and dynamic dashboards for marketers
byHumans of Martech
0 ratings
0% found this document useful
15: Lifecycle: A Martech Saga part 4: Picking the right MQL model: You need a good MQL model so that marketing leads make it to sales and get followed up. There are a lot of ways to define MQLs and pass them over. It’s very common to have a lead scoring model, and it’s the best way to get to build a scalable, highly auto
Podcast episode
15: Lifecycle: A Martech Saga part 4: Picking the right MQL model: You need a good MQL model so that marketing leads make it to sales and get followed up. There are a lot of ways to define MQLs and pass them over. It’s very common to have a lead scoring model, and it’s the best way to get to build a scalable, highly auto
byHumans of Martech
0 ratings
0% found this document useful
Data jobs: Interview with data & machine learning expert Catherine Lopes PhD (Ep 42): Who would have thought that 2020 would be the year of data charts? That we would be glued to the daily news like never before, anxiously waiting to see more and more charts, expecting data analysts to tell us which way curves, bars, and pie charts ar...
Podcast episode
Data jobs: Interview with data & machine learning expert Catherine Lopes PhD (Ep 42): Who would have thought that 2020 would be the year of data charts? That we would be glued to the daily news like never before, anxiously waiting to see more and more charts, expecting data analysts to tell us which way curves, bars, and pie charts ar...
byThe Job Hunting Podcast
0 ratings
0% found this document useful
Better Done Than Perfect. Using Surveys for Customer Success with Moritz Dausinger: Today we have another episode of Better Done Than Perfect. Listen in as we talk with Moritz Dausinger, founder of Refiner. Moritz shares the story behind his survey tool, when and how to survey your users, and many other tips for making the most of the survey data.
Podcast episode
Better Done Than Perfect. Using Surveys for Customer Success with Moritz Dausinger: Today we have another episode of Better Done Than Perfect. Listen in as we talk with Moritz Dausinger, founder of Refiner. Moritz shares the story behind his survey tool, when and how to survey your users, and many other tips for making the most of the survey data.
byUI Breakfast: UI/UX Design and Product Strategy
0 ratings
0% found this document useful
CM 066: Cathy O’Neil on the Human Cost of Big Data: Algorithms make millions of decisions about us every day. For example, they determine our insurance premiums, whether we get a mortgage, and how we perform on the job. Yet, what is more alarming is that data scientists also write the code that fires ...
Podcast episode
CM 066: Cathy O’Neil on the Human Cost of Big Data: Algorithms make millions of decisions about us every day. For example, they determine our insurance premiums, whether we get a mortgage, and how we perform on the job. Yet, what is more alarming is that data scientists also write the code that fires ...
byCurious Minds at Work
0 ratings
0% found this document useful
#134 - A Developer-Centric Approach to Measuring and Improving Productivity - Margaret-Anne Storey & Abi Noda
Podcast episode
#134 - A Developer-Centric Approach to Measuring and Improving Productivity - Margaret-Anne Storey & Abi Noda
byTech Lead Journal
0 ratings
0% found this document useful
Traversing the Data Maturity Spectrum: A Startup Perspective // Mark Freeman // Coffee Sessions #94
Podcast episode
Traversing the Data Maturity Spectrum: A Startup Perspective // Mark Freeman // Coffee Sessions #94
byMLOps.community
0 ratings
0% found this document useful
Composable Data Analytics
Podcast episode
Composable Data Analytics
byThe Cloudcast
0 ratings
0% found this document useful
Patrick Lewis (Cohere) - Retrieval Augmented Generation
Podcast episode
Patrick Lewis (Cohere) - Retrieval Augmented Generation
byMachine Learning Street Talk (MLST)
0 ratings
0% found this document useful
MLOps Meetup #29 // Scaling Machine Learning Capabilities in Large Organizations // Bertjan Broeksema & Axel Goblet
Podcast episode
MLOps Meetup #29 // Scaling Machine Learning Capabilities in Large Organizations // Bertjan Broeksema & Axel Goblet
byMLOps.community
0 ratings
0% found this document useful
MLOps.community #6 - Mid Scale Production Feature Engineering with Dr. Venkata Pingali
Podcast episode
MLOps.community #6 - Mid Scale Production Feature Engineering with Dr. Venkata Pingali
byMLOps.community
0 ratings
0% found this document useful
Product Owners in Data Science - Anna Hannemann
Podcast episode
Product Owners in Data Science - Anna Hannemann
byDataTalks.Club
0 ratings
0% found this document useful
Similarities and Differences between ML and Analytics - Rishabh Bhargava
Podcast episode
Similarities and Differences between ML and Analytics - Rishabh Bhargava
byDataTalks.Club
0 ratings
0% found this document useful
Personalizing Learning through Technology and AI
Podcast episode
Personalizing Learning through Technology and AI
byInsights Tomorrow
0 ratings
0% found this document useful
Privacy Engineering at CMU and Privacy Decision Making with Dr. Lorrie Cranor: Dr. Lorrie Cranor began her career in privacy 25 years ago and has been a professor at Carnegie Mellon University in the School of Computer Science for 19 years. Today, she serves as director and professor for the CMU privacy engineering program.In this ...
Podcast episode
Privacy Engineering at CMU and Privacy Decision Making with Dr. Lorrie Cranor: Dr. Lorrie Cranor began her career in privacy 25 years ago and has been a professor at Carnegie Mellon University in the School of Computer Science for 19 years. Today, she serves as director and professor for the CMU privacy engineering program.In this ...
byPartially Redacted: Data, AI, Security, and Privacy
0 ratings
0% found this document useful
From Campus to Spreadsheet: The Microsoft Excel Collegiate Challenge Journey: Welcome to Financial Modeler's Corner (FMC) where we discuss the art and science of financial modeling with your host Paul Barnhurst. Financial Modeler's Corner is sponsored by Financial Modeling Institute (FMI) the most respected accreditation in...
Podcast episode
From Campus to Spreadsheet: The Microsoft Excel Collegiate Challenge Journey: Welcome to Financial Modeler's Corner (FMC) where we discuss the art and science of financial modeling with your host Paul Barnhurst. Financial Modeler's Corner is sponsored by Financial Modeling Institute (FMI) the most respected accreditation in...
byFinancial Modeler's Corner
0 ratings
0% found this document useful
Harnessing Generative AI For Creating Educational Content With Illumidesk: Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
Podcast episode
Harnessing Generative AI For Creating Educational Content With Illumidesk: Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
byData Engineering Podcast
0 ratings
0% found this document useful
Humans in the Loop - Lina Weichbrodt
Podcast episode
Humans in the Loop - Lina Weichbrodt
byDataTalks.Club
0 ratings
0% found this document useful
Ep 532: Data Driven Talent Acquisition: Grant Telfer, Business Development Director at Textkernel, talks to Matt Alder
Podcast episode
Ep 532: Data Driven Talent Acquisition: Grant Telfer, Business Development Director at Textkernel, talks to Matt Alder
byRecruiting Future with Matt Alder
0 ratings
0% found this document useful
Building Data Science Practice - Andrey Shtylenko
Podcast episode
Building Data Science Practice - Andrey Shtylenko
byDataTalks.Club
0 ratings
0% found this document useful

Skip carousel

How And Where You Use Machine-learning
APC
Article
How And Where You Use Machine-learning
Oct 7, 2019
4 min read
Why We Need To Fear The Risk Of AI Model Collapse
Evening Standard
Article
Why We Need To Fear The Risk Of AI Model Collapse
Dec 17, 2023
4 min read
Quantum Leap
Marketing
Article
Quantum Leap
Jul 11, 2019
6 min read
Q&A
Rotman Management
Article
Q&A
May 1, 2023
Describe the capability that companies like Netflix, UPS, Amazon and Caesars Entertainment have in common. These are all leading firms in their industries with respect to leveraging analytics as a source of competitive advantage. We now have so much
7 min read
Leadership Forum: Making Digital Transformation A Reality
Rotman Management
Article
Leadership Forum: Making Digital Transformation A Reality
Jan 1, 2018
Glenda Crisp Senior Vice President and Chief Data Officer, TD Bank Group + Connie Bonello Associate Partner, Financial Services, IBM Canada IN MOST OF TODAY’S ORGANIZATIONS, data underpins every transaction, operation and interaction. And yet, the ab
8 min read
Q&A: OPENAI CTO MIRA MURATI ON SHEPHERDING CHATGPT
TechLife News
Article
Q&A: OPENAI CTO MIRA MURATI ON SHEPHERDING CHATGPT
Apr 29, 2023
4 min read
Q&A: OPENAI CTO MIRA MURATI ON SHEPHERDING CHATGPT
AppleMagazine
Article
Q&A: OPENAI CTO MIRA MURATI ON SHEPHERDING CHATGPT
Apr 28, 2023
4 min read
Questions for Angela Zutavern, Machine Intelligence Expert, Booz Allen Hamilton
Rotman Management
Article
Questions for Angela Zutavern, Machine Intelligence Expert, Booz Allen Hamilton
Jan 1, 2018
You believe that the world of leadership has hit an inflection point. How so? As useful as popular mental models and heuristics are, machine models now outstrip human performance in about half of the portfolio of cognitive tasks. Going forward, we wi
6 min read
Why Your Organisation Needs To Lift Its Data Game
NZBusiness and Management
Article
Why Your Organisation Needs To Lift Its Data Game
Oct 22, 2019
From problems stemming from the recent New Zealand census to data collected by Facebook, data has been in the news a lot lately. It may seem obvious that large organisations such as Statistics New Zealand and Facebook need to continually improve thei
3 min read
The Era of Human + Machine Innovation
Rotman Management
Article
The Era of Human + Machine Innovation
Jan 1, 2019
Interview by Karen Christensen In today's environment, organizations that don't keep up with customers' evolving needs are doomed. What is the best way to get a handle on these evolving needs? The first step in understanding your customers is to acce
5 min read
Pivoting To First-party Data
NZ Marketing
Article
Pivoting To First-party Data
Jun 9, 2021
5 min read
Getting The edge
The European Business Review
Article
Getting The edge
Feb 25, 2021
7 min read
Jobs Of The Future
True Love
Article
Jobs Of The Future
Jan 26, 2023
5 min read
There’s A New Career In Town
True Love
Article
There’s A New Career In Town
Oct 21, 2019
2 min read
‘MBAs THAT DON’T FOCUS ON DATA & TECH WON’T DO WELL’
Business Today
Article
‘MBAs THAT DON’T FOCUS ON DATA & TECH WON’T DO WELL’
Oct 28, 2022
6 min read
AI Industry Is Influencing The World. Mozilla Adviser Abeba Birhane Is Challenging Its Core Values
The Independent
Article
AI Industry Is Influencing The World. Mozilla Adviser Abeba Birhane Is Challenging Its Core Values
Jul 22, 2024
3 min read
How To Make Sense From And With AI ?
The European Business Review
Article
How To Make Sense From And With AI ?
Sep 25, 2021
4 min read
Adoption of Cognitive Computing Across Various Industries
Techfastly
Article
Adoption of Cognitive Computing Across Various Industries
Dec 1, 2021
5 min read
Generative AI: What Leaders Need To Know
Rotman Management
Article
Generative AI: What Leaders Need To Know
Jan 1, 2024
12 min read
Searching For Privacy
NZ Marketing
Article
Searching For Privacy
Dec 8, 2021
6 min read
The Deep Learning Revolution For Artificial Intelligence
Facility Management
Article
The Deep Learning Revolution For Artificial Intelligence
Mar 28, 2019
3 min read
Inform And Enhance Your Business With Open Data
PC Pro Magazine
Article
Inform And Enhance Your Business With Open Data
Jun 10, 2021
7 min read
Understanding 'Big Data' and What It Means to Your Business
Entrepreneur
Article
Understanding 'Big Data' and What It Means to Your Business
May 1, 2013
2 min read
AI Industry Is Influencing The World. Mozilla Adviser Abeba Birhane Is Challenging Its Core Values
AppleMagazine
Article
AI Industry Is Influencing The World. Mozilla Adviser Abeba Birhane Is Challenging Its Core Values
Jul 26, 2024
2 min read
AI Industry Is Influencing The World. Mozilla Adviser Abeba Birhane Is Challenging Its Core Values
TechLife News
Article
AI Industry Is Influencing The World. Mozilla Adviser Abeba Birhane Is Challenging Its Core Values
Jul 27, 2024
2 min read
Signals Of Change: how To Evolve For The New Global Reality
Rotman Management
Article
Signals Of Change: how To Evolve For The New Global Reality
May 1, 2022
11 min read
Finding Your Data
APC
Article
Finding Your Data
Sep 9, 2019
4 min read
Federated Learning Uses The Data Right On Our Devices
Futurity
Article
Federated Learning Uses The Data Right On Our Devices
Jul 21, 2022
2 min read
Things Get Strange When AI Starts Training Itself
The Atlantic
Article
Things Get Strange When AI Starts Training Itself
Feb 16, 2024
7 min read
Data Fabric
PC Pro Magazine
Article
Data Fabric
Aug 13, 2020
3 min read

Related categories

Skip carousel

Reviews for Machine Learning with R

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Machine Learning with R - Brett Lantz

cover.png

Machine Learning with R

Fourth Edition

Learn techniques for building and improving machine learning models, from data preparation to model tuning, evaluation, and working with big data

Brett Lantz

BIRMINGHAM—MUMBAI

Machine Learning with R

Fourth Edition

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Lead Senior Publishing Product Manager: Tushar Gupta

Acquisition Editor – Peer Reviews: Saby Dsilva

Project Editor: Janice Gonsalves

Content Development Editors: Bhavesh Amin and Elliot Dallow

Copy Editor: Safis Editor

Technical Editor: Karan Sonawane

Indexer: Hemangini Bari

Presentation Designer: Pranit Padwal

Developer Relations Marketing Executive: Monika Sangwan

First published: October 2013

Second edition: July 2015

Third edition: April 2019

Fourth edition: May 2023

Production reference: 1190523

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80107-132-1

www.packt.com

Contributors

About the author

Brett Lantz (@DataSpelunking) has spent more than 15 years using innovative data methods to understand human behavior. A sociologist by training, Brett was first captivated by machine learning while studying a large database of teenagers’ social network profiles. Brett is a DataCamp instructor and has taught machine learning workshops around the world. He is known to geek out about data science applications for sports, video games, autonomous vehicles, and foreign language learning, among many other subjects, and hopes to eventually blog about such topics at dataspelunking.com.

It is hard to describe how much my world has changed since the first edition of this book was published nearly ten years ago! My sons Will and Cal were born amidst the first and second editions, respectively, and have grown alongside my career. This edition, which consumed two years of weekends, would have been impossible without the backing of my wife, Jessica. Many thanks are due also to the friends, mentors, and supporters who opened the doors that led me along this unexpected data science journey.

About the reviewer

Daniel D. Gutierrez is an independent consultant in data science through his firm AMULET Analytics. He’s also a technology journalist, serving as Editor-in-Chief for insideBIGDATA.com, where he enjoys keeping his finger on the pulse of this fast-paced industry. Daniel is also an educator, having taught data science, machine learning and R classes at university level for many years. He currently teaches data science for UCLA Extension. He has authored four computer industry books on database and data science technology, including his most recent title, Machine Learning and Data Science: An Introduction to Statistical Learning Methods with R. Daniel holds a BS in Mathematics and Computer Science from UCLA.

Join our book’s Discord space

Join our Discord community to meet like-minded people and learn alongside more than 4000 people at:

https://packt.link/r

Preface

Who this book is for

What this book covers

What you need for this book

Get in touch

Introducing Machine Learning

The origins of machine learning

Uses and abuses of machine learning

Machine learning successes

The limits of machine learning

Machine learning ethics

How machines learn

Data storage

Abstraction

Generalization

Evaluation

Machine learning in practice

Types of input data

Types of machine learning algorithms

Matching input data to algorithms

Machine learning with R

Installing R packages

Loading and unloading R packages

Installing RStudio

Why R and why R now?

Summary

Managing and Understanding Data

R data structures

Vectors

Factors

Lists

Data frames

Matrices and arrays

Managing data with R

Saving, loading, and removing R data structures

Importing and saving datasets from CSV files

Importing common dataset formats using RStudio

Exploring and understanding data

Exploring the structure of data

Exploring numeric features

Measuring the central tendency – mean and median

Measuring spread – quartiles and the five-number summary

Visualizing numeric features – boxplots

Visualizing numeric features – histograms

Understanding numeric data – uniform and normal distributions

Measuring spread – variance and standard deviation

Exploring categorical features

Measuring the central tendency – the mode

Exploring relationships between features

Visualizing relationships – scatterplots

Examining relationships – two-way cross-tabulations

Summary

Lazy Learning – Classification Using Nearest Neighbors

Understanding nearest neighbor classification

The k-NN algorithm

Measuring similarity with distance

Choosing an appropriate k

Preparing data for use with k-NN

Why is the k-NN algorithm lazy?

Example – diagnosing breast cancer with the k-NN algorithm

Step 1 – collecting data

Step 2 – exploring and preparing the data

Transformation – normalizing numeric data

Data preparation – creating training and test datasets

Step 3 – training a model on the data

Step 4 – evaluating model performance

Step 5 – improving model performance

Transformation – z-score standardization

Testing alternative values of k

Summary

Probabilistic Learning – Classification Using Naive Bayes

Understanding Naive Bayes

Basic concepts of Bayesian methods

Understanding probability

Understanding joint probability

Computing conditional probability with Bayes’ theorem

The Naive Bayes algorithm

Classification with Naive Bayes

The Laplace estimator

Using numeric features with Naive Bayes

Example – filtering mobile phone spam with the Naive Bayes algorithm

Step 1 – collecting data

Step 2 – exploring and preparing the data

Data preparation – cleaning and standardizing text data

Data preparation – splitting text documents into words

Data preparation – creating training and test datasets

Visualizing text data – word clouds

Data preparation – creating indicator features for frequent words

Step 3 – training a model on the data

Step 4 – evaluating model performance

Step 5 – improving model performance

Summary

Divide and Conquer – Classification Using Decision Trees and Rules

Understanding decision trees

Divide and conquer

The C5.0 decision tree algorithm

Choosing the best split

Pruning the decision tree

Example – identifying risky bank loans using C5.0 decision trees

Step 1 – collecting data

Step 2 – exploring and preparing the data

Data preparation – creating random training and test datasets

Step 3 – training a model on the data

Step 4 – evaluating model performance

Step 5 – improving model performance

Boosting the accuracy of decision trees

Making some mistakes cost more than others

Understanding classification rules

Separate and conquer

The 1R algorithm

The RIPPER algorithm

Rules from decision trees

What makes trees and rules greedy?

Example – identifying poisonous mushrooms with rule learners

Step 1 – collecting data

Step 2 – exploring and preparing the data

Step 3 – training a model on the data

Step 4 – evaluating model performance

Step 5 – improving model performance

Summary

Forecasting Numeric Data – Regression Methods

Understanding regression

Simple linear regression

Ordinary least squares estimation

Correlations

Multiple linear regression

Generalized linear models and logistic regression

Example – predicting auto insurance claims costs using linear regression

Step 1 – collecting data

Step 2 – exploring and preparing the data

Exploring relationships between features – the correlation matrix

Visualizing relationships between features – the scatterplot matrix

Step 3 – training a model on the data

Step 4 – evaluating model performance

Step 5 – improving model performance

Model specification – adding nonlinear relationships

Model specification – adding interaction effects

Putting it all together – an improved regression model

Making predictions with a regression model

Going further – predicting insurance policyholder churn with logistic regression

Understanding regression trees and model trees

Adding regression to trees

Example – estimating the quality of wines with regression trees and model trees

Step 1 – collecting data

Step 2 – exploring and preparing the data

Step 3 – training a model on the data

Visualizing decision trees

Step 4 – evaluating model performance

Measuring performance with the mean absolute error

Step 5 – improving model performance

Summary

Black-Box Methods – Neural Networks and Support Vector Machines

Understanding neural networks

From biological to artificial neurons

Activation functions

Network topology

The number of layers

The direction of information travel

The number of nodes in each layer

Training neural networks with backpropagation

Example – modeling the strength of concrete with ANNs

Step 1 – collecting data

Step 2 – exploring and preparing the data

Step 3 – training a model on the data

Step 4 – evaluating model performance

Step 5 – improving model performance

Understanding support vector machines

Classification with hyperplanes

The case of linearly separable data

The case of nonlinearly separable data

Using kernels for nonlinear spaces

Example – performing OCR with SVMs

Step 1 – collecting data

Step 2 – exploring and preparing the data

Step 3 – training a model on the data

Step 4 – evaluating model performance

Step 5 – improving model performance

Changing the SVM kernel function

Identifying the best SVM cost parameter

Summary

Finding Patterns – Market Basket Analysis Using Association Rules

Understanding association rules

The Apriori algorithm for association rule learning

Measuring rule interest – support and confidence

Building a set of rules with the Apriori principle

Example – identifying frequently purchased groceries with association rules

Step 1 – collecting data

Step 2 – exploring and preparing the data

Data preparation – creating a sparse matrix for transaction data

Visualizing item support – item frequency plots

Visualizing the transaction data – plotting the sparse matrix

Step 3 – training a model on the data

Step 4 – evaluating model performance

Step 5 – improving model performance

Sorting the set of association rules

Taking subsets of association rules

Saving association rules to a file or data frame

Using the Eclat algorithm for greater efficiency

Summary

Finding Groups of Data – Clustering with k-means

Understanding clustering

Clustering as a machine learning task

Clusters of clustering algorithms

The k-means clustering algorithm

Using distance to assign and update clusters

Choosing the appropriate number of clusters

Finding teen market segments using k-means clustering

Step 1 – collecting data

Step 2 – exploring and preparing the data

Data preparation – dummy coding missing values

Data preparation – imputing the missing values

Step 3 – training a model on the data

Step 4 – evaluating model performance

Step 5 – improving model performance

Summary

Evaluating Model Performance

Measuring performance for classification

Understanding a classifier’s predictions

A closer look at confusion matrices

Using confusion matrices to measure performance

Beyond accuracy – other measures of performance

The kappa statistic

The Matthews correlation coefficient

Sensitivity and specificity

Precision and recall

The F-measure

Visualizing performance tradeoffs with ROC curves

Comparing ROC curves

The area under the ROC curve

Creating ROC curves and computing AUC in R

Estimating future performance

The holdout method

Cross-validation

Bootstrap sampling

Summary

Being Successful with Machine Learning

What makes a successful machine learning practitioner?

What makes a successful machine learning model?

Avoiding obvious predictions

Conducting fair evaluations

Considering real-world impacts

Building trust in the model

Putting the science in data science

Using R Notebooks and R Markdown

Performing advanced data exploration

Constructing a data exploration roadmap

Encountering outliers: a real-world pitfall

Example – using ggplot2 for visual data exploration

Summary

Advanced Data Preparation

Performing feature engineering

The role of human and machine

The impact of big data and deep learning

Feature engineering in practice

Hint 1: Brainstorm new features

Hint 2: Find insights hidden in text

Hint 3: Transform numeric ranges

Hint 4: Observe neighbors’ behavior

Hint 5: Utilize related rows

Hint 6: Decompose time series

Hint 7: Append external data

Exploring R’s tidyverse

Making tidy table structures with tibbles

Reading rectangular files faster with readr and readxl

Preparing and piping data with dplyr

Transforming text with stringr

Cleaning dates with lubridate

Summary

Challenging Data – Too Much, Too Little, Too Complex

The challenge of high-dimension data

Applying feature selection

Filter methods

Wrapper methods and embedded methods

Example – Using stepwise regression for feature selection

Example – Using Boruta for feature selection

Performing feature extraction

Understanding principal component analysis

Example – Using PCA to reduce highly dimensional social media data

Making use of sparse data

Identifying sparse data

Example – Remapping sparse categorical data

Example – Binning sparse numeric data

Handling missing data

Understanding types of missing data

Performing missing value imputation

Simple imputation with missing value indicators

Missing value patterns

The problem of imbalanced data

Simple strategies for rebalancing data

Generating a synthetic balanced dataset with SMOTE

Example – Applying the SMOTE algorithm in R

Considering whether balanced is always better

Summary

Building Better Learners

Tuning stock models for better performance

Determining the scope of hyperparameter tuning

Example – using caret for automated tuning

Creating a simple tuned model

Customizing the tuning process

Improving model performance with ensembles

Understanding ensemble learning

Popular ensemble-based algorithms

Bagging

Boosting

Random forests

Gradient boosting

Extreme gradient boosting with XGBoost

Why are tree-based ensembles so popular?

Stacking models for meta-learning

Understanding model stacking and blending

Practical methods for blending and stacking in R

Summary

Making Use of Big Data

Practical applications of deep learning

Beginning with deep learning

Choosing appropriate tasks for deep learning

The TensorFlow and Keras deep learning frameworks

Understanding convolutional neural networks

Transfer learning and fine tuning

Example – classifying images using a pre-trained CNN in R

Unsupervised learning and big data

Representing highly dimensional concepts as embeddings

Understanding word embeddings

Example – using word2vec for understanding text in R

Visualizing highly dimensional data

The limitations of using PCA for big data visualization

Understanding the t-SNE algorithm

Example – visualizing data’s natural clusters with t-SNE

Adapting R to handle large datasets

Querying data in SQL databases

The tidy approach to managing database connections

Using a database backend for dplyr with dbplyr

Doing work faster with parallel processing

Measuring R’s execution time

Enabling parallel processing in R

Taking advantage of parallel with foreach and doParallel

Training and evaluating models in parallel with caret

Utilizing specialized hardware and algorithms

Parallel computing with MapReduce concepts via Apache Spark

Learning via distributed and scalable algorithms with H2O

GPU computing

Summary

Other Books You May Enjoy

Index

Landmarks

Cover

Index

Preface

Machine learning, at its core, describes algorithms that transform data into actionable intelligence. This fact makes machine learning well suited to the present-day era of big data. Without machine learning, it would be nearly impossible to make sense of the massive streams of information that are now all around us.

The cross-platform, zero-cost statistical programming environment called R provides an ideal pathway to start applying machine learning. R offers powerful but easy-to-learn tools that can assist you with finding insights in your own data.

By combining hands-on case studies with the essential theory needed to understand how these algorithms work, this book delivers all the knowledge you need to get started with machine learning and to apply its methods to your own projects.

Who this book is for

This book is aimed at people in applied fields—business analysts, social scientists, and others—who have access to data and hope to use it for action. Perhaps you already know a bit about machine learning, but have never used R; or, perhaps you know a little about R, but are new to machine learning. Maybe you are completely new to both! In any case, this book will get you up and running quickly. It would be helpful to have a bit of familiarity with basic math and programming concepts, but no prior experience is required. All you need is curiosity.

What this book covers

Chapter 1, Introducing Machine Learning, presents the terminology and concepts that define and distinguish machine learners, as well as a method for matching a learning task with the appropriate algorithm.

Chapter 2, Managing and Understanding Data, provides an opportunity to get your hands dirty working with data in R. Essential data structures and procedures used for loading, exploring, and understanding data are discussed.

Chapter 3, Lazy Learning – Classification Using Nearest Neighbors, teaches you how to understand and apply a simple yet powerful machine learning algorithm to your first real-world task: identifying malignant samples of cancer.

Chapter 4, Probabilistic Learning – Classification Using Naive Bayes, reveals the essential concepts of probability that are used in cutting-edge spam filtering systems. You’ll learn the basics of text mining in the process of building your own spam filter.

Chapter 5, Divide and Conquer – Classification Using Decision Trees and Rules, explores a couple of learning algorithms whose predictions are not only accurate, but also easily explained. We’ll apply these methods to tasks where transparency is important.

Chapter 6, Forecasting Numeric Data – Regression Methods, introduces machine learning algorithms used for making numeric predictions. As these techniques are heavily embedded in the field of statistics, you will also learn the essential metrics needed to make sense of numeric relationships.

Chapter 7, Black-Box Methods – Neural Networks and Support Vector Machines, covers two complex but powerful machine learning algorithms. Though the math may appear intimidating, we will work through examples that illustrate their inner workings in simple terms.

Chapter 8, Finding Patterns – Market Basket Analysis Using Association Rules, exposes the algorithm used in the recommendation systems employed by many retailers. If you’ve ever wondered how retailers seem to know your purchasing habits better than you know yourself, this chapter will reveal their secrets.

Chapter 9, Finding Groups of Data – Clustering with k-means, is devoted to a procedure that locates clusters of related items. We’ll utilize this algorithm to identify profiles within an online community.

Chapter 10, Evaluating Model Performance, provides information on measuring the success of a machine learning project and obtaining a reliable estimate of the learner’s performance on future data.

Chapter 11, Being Successful with Machine Learning, describes the common pitfalls faced when transitioning from textbook datasets to real world machine learning problems, as well as the tools, strategies, and soft skills needed to combat these issues.

Chapter 12, Advanced Data Preparation, introduces the set of tidyverse packages, which help wrangle large datasets to extract meaningful information to aid the machine learning process.

Chapter 13, Challenging Data – Too Much, Too Little, Too Complex, considers solutions to a common set of problems that can derail a machine learning project when the useful information is lost within a massive dataset, much like a needle in a haystack.

Chapter 14, Building Better Learners, reveals the methods employed by the teams at the top of machine learning competition leaderboards. If you have a competitive streak, or simply want to get the most out of your data, you’ll need to add these techniques to your repertoire.

Chapter 15, Making Use of Big Data, explores the frontiers of machine learning. From working with extremely large datasets to making R work faster, the topics covered will help you push the boundaries of what is possible with R, and even allow you to utilize the sophisticated tools developed by large organizations like Google for image recognition and understanding text data.

What you need for this book

The examples in this book were tested with R version 4.2.2 on Microsoft Windows, Mac OS X, and Linux, although they are likely to work with any recent version of R. R can be downloaded at no cost at https://cran.r-project.org/.

The RStudio interface, which is described in more detail in Chapter 1, Introducing Machine Learning, is a highly recommended add-on for R that greatly enhances the user experience. The RStudio Open Source Edition is available free of charge from Posit (https://www.posit.co/) alongside a paid RStudio Pro Edition that offers priority support and additional features for commercial organizations.

Download the example code files

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Machine-Learning-with-R-Fourth-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://packt.link/TZ7os.

Conventions used

Code in text: function names, filenames, file extensions, and R package names are shown as follows: "The

knn()

function in the

class

package provides a standard, classic implementation of the k-NN algorithm."

R user input and output is written as follows:

reg

(

launch

distress_ct

launch

[

])

estimate Intercept 3.527093383 temperature -0.051385940 field_check_pressure 0.001757009 flight_num 0.014292843

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "In RStudio, a new file can be created using the File menu, selecting New File, and choosing the R Notebook option."

References to additional resources or background information appear like this.

Helpful tips and important caveats appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email

[email protected]

, and mention the book’s title in the subject of your message. If you have questions about any aspect of this book, please email us at

[email protected]

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book we would be grateful if you would report this to us. Please visit, http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at

[email protected]

with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit http://authors.packtpub.com.

Share your thoughts

Once you’ve read Machine Learning with R - Fourth Edition, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere? Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below

Qr code Description automatically generated

https://packt.link/free-ebook/978-1-80107-132-1

Submit your proof of purchase

That’s it! We’ll send your free PDF and other benefits to your email directly

1 Introducing Machine Learning

If science-fiction stories are to be believed, the invention of artificial intelligence inevitably leads to apocalyptic wars between machines and their makers. The stories begin with today’s reality: computers being taught to play simple games like tic-tac-toe and to automate routine tasks. As the stories go, machines are later given control of traffic lights and communications, followed by military drones and missiles. The machines’ evolution takes an ominous turn once the computers become sentient and learn how to teach themselves. Having no more need for human programmers, humankind is then deleted.

Thankfully, at the time of writing, machines still require user input.

Though your impressions of machine learning may be colored by these mass-media depictions, today’s algorithms have little danger of becoming self-aware. The goal of today’s machine learning is not to create an artificial brain, but rather to assist us with making sense of and acting on the world’s rapidly accumulating data stores.

Putting popular misconceptions aside, by the end of this chapter, you will gain a more nuanced understanding of machine learning. You will also be introduced to the fundamental concepts that define and differentiate the most common machine learning approaches. You will learn:

The origins, applications, ethics, and pitfalls of machine learning

How computers transform data into knowledge and action

The steps needed to match a machine learning algorithm with your data

The field of machine learning provides a set of algorithms that transform data into actionable knowledge. Keep reading to see how easy it is to use R to start applying machine learning to real-world problems.

The origins of machine learning

Beginning at birth, we are inundated with data. Our body’s sensors—the eyes, ears, nose, tongue, and nerves—are continually assailed with raw data that our brain translates into sights, sounds, smells, tastes, and textures. Using language, we can share these experiences with others.

Since the advent of written language, humans have recorded their observations. Hunters monitored the movement of animal herds; early astronomers recorded the alignment of planets and stars; and cities recorded tax payments, births, and deaths. Today, such observations, and many more, are increasingly automated and recorded systematically in ever-growing computerized databases.

The invention of electronic sensors has additionally contributed to an explosion in the volume and richness of recorded data. Specialized sensors, such as cameras, microphones, chemical noses, electronic tongues, and pressure sensors mimic the human ability to see, hear, smell, taste, and feel. These sensors process the data far differently than a human being would. Unlike a human’s limited and subjective attention, an electronic sensor never takes a break and has no emotions to skew its perception.

Although sensors are not clouded by subjectivity, they do not necessarily report a single, definitive depiction of reality. Some have an inherent measurement error due to hardware limitations. Others are limited by their scope. A black-and-white photograph provides a different depiction of its subject than one shot in color. Similarly, a microscope provides a far different depiction of reality than a telescope.

Between databases and sensors, many aspects of our lives are recorded. Governments, businesses, and individuals are recording and reporting information, from the monumental to the mundane. Weather sensors obtain temperature and pressure data; surveillance cameras watch sidewalks and subway tunnels; and all manner of electronic behaviors are monitored: transactions, communications, social media relationships, and many others.

This deluge of data has led some to state that we have entered an era of big data, but this may be a bit of a misnomer. Human beings have always been surrounded by large amounts of data—one would need only to look to the sky and attempt to count its stars to discover a virtually endless supply. What makes the current era unique is that we have vast amounts of recorded data, much of which can be directly accessed by computers. Larger and more interesting datasets are increasingly accessible at the tips of our fingers, only a web search away. This wealth of information has the potential to inform action, given a systematic way of making sense of it all.

The field of study dedicated to the development of computer algorithms for transforming data into intelligent action is known as machine learning. This field originated in an environment where the available data, statistical methods, and computing power rapidly and simultaneously evolved. Growth in the volume of data necessitated additional computing power, which in turn spurred the development of statistical methods for analyzing large datasets. This created a cycle of advancement, allowing even larger and more interesting data to be collected, and enabled today’s environment in which endless streams of data are available on virtually any topic.

Diagram Description automatically generated

Figure 1.1: The cycle of advancement that enabled machine learning

A closely related sibling of machine learning, data mining, is concerned with the generation of novel insight from large databases. As the term implies, data mining involves a systematic hunt for nuggets of actionable intelligence. Although there is some disagreement over how widely machine learning and data mining overlap, one point of distinction is that machine learning focuses on teaching computers how to use data to solve a problem, while data mining focuses on teaching computers to identify patterns that humans then use to solve a problem.

Virtually all data mining involves the use of machine learning, but not all machine learning requires data mining. For example, you might apply machine learning to data mine automobile traffic data for patterns related to accident rates. On the other hand, if the computer is learning how to identify traffic signs, this is purely machine learning without data mining.

The phrase data mining is also sometimes used as a pejorative to describe the deceptive practice of cherry-picking data to support a theory.

Machine learning is also intertwined with the field of artificial intelligence (AI), which is a nebulous discipline and, depending on whom you might ask, is simply machine learning with a strong marketing spin or a distinct field of study altogether. A cynic might suggest that the field of AI tends to exaggerate its importance such as by calling a simple predictive model an AI bot, while an AI proponent may point out that the field tends to tackle the most challenging learning tasks while aiming for human-level performance. The truth is somewhere in between.

Just as machine learning itself depends on statistical methods, artificial intelligence depends a great deal on machine learning, but the business contexts and applications tend to differ. The table that follows highlights some differentiators among traditional statistics, machine learning, and artificial intelligence; however, keep in mind that the lines between the three disciplines are often less rigid than they may appear.

In this formulation, machine learning sits firmly at the intersection of human and computer partnership, whereas traditional statistics relies primarily on the human to drive insights and AI seeks to minimize human involvement as much as possible. Learning how to maximize the human-machine partnership and apply learning algorithms to real-world problems is the focus of this book. Understanding the use cases and limitations of machine learning is an important starting point in this journey.

Uses and abuses of machine learning

Most people have heard of Deep Blue, the chess-playing computer that in 1997 was the first to win a game against a world champion. Another famous computer, Watson, defeated two human opponents on the television trivia game show Jeopardy in 2011. Based on these stunning accomplishments, some have speculated that computer intelligence will replace workers in information technology occupations, just as automobiles replaced horses and machines replaced workers in fields and assembly lines. Recently, these fears have become more pronounced as artificial intelligence-based algorithms, such as GPT-3 and DALL·E 2 from the OpenAI research group (https://openai.com/), have reached impressive milestones and are proving that computers are capable of writing text and creating artwork that is virtually indistinguishable from that produced by humans. Ultimately, this may lead to massive shifts in occupations like marketing, customer support, illustration, and so on, as creativity is outsourced to machines that can produce endless streams of material more cheaply than the former employees.

In this case, humans may still be necessary because the truth is that even as machines reach such impressive milestones, they are still relatively limited in their ability to thoroughly understand a problem, or understand how the work is going to be applied toward a real-world goal. Learning algorithms are pure intellectual horsepower without direction. A computer may be more capable than a human of finding subtle patterns in large databases, but it still needs a human to motivate the analysis and turn the result into meaningful action. In most cases, the human will determine whether the machine’s output is valuable and will help the machine avoid creating a limitless supply of nonsense.

Without completely discounting the achievements of Deep Blue and Watson, it is important to note that neither is even as intelligent as a typical five-year-old. For more on why comparing smarts is a slippery business, see the Popular Science article FYI, Which Computer Is Smarter, Watson Or Deep Blue?, by Will Grunewald, 2012: https://www.popsci.com/science/article/2012-12/fyi-which-computer-smarter-watson-or-deep-blue.

Machines are not good at asking questions or even knowing what questions to ask. They are much better at answering them, provided the question is stated in a way that the computer can comprehend. Present-day machine learning algorithms partner with people much like a bloodhound works with its trainer: the dog’s sense of smell may be many times stronger than its master’s, but without being carefully directed, the hound may end up chasing its tail.

Diagram Description automatically generated

Figure 1.2: Machine learning algorithms are powerful tools that require careful direction

In the worst-case scenario, if machine learning were implemented carelessly, it might lead to what controversial tech billionaire Elon Musk provocatively called summoning the demon. This perspective suggests that we may be unleashing forces outside our control, despite the hubristic sense that we will be able to reign them in when needed. Given the power of artificial intelligence to automate processes and react to changing conditions much faster and more objectively than humans, there may come a point at which Pandora’s box has been opened and it is difficult or impossible to return to the old ways of life where humans are in control. As Musk describes:

If AI has a goal and humanity just happens to be in the way, it will destroy humanity as a matter of course without even thinking about it. No hard feelings… It’s just like, if we’re building a road and an anthill just happens to be in the way, we don’t hate ants, we’re just building a road, and so, goodbye anthill.

While this may seem to be a bleak portrayal, it is still the realm of far-future science fiction, as you will soon learn when reading about the present day’s state-of-the-art machine learning successes.

However, Musk’s warning does help emphasize the importance of understanding the likelihood of machine learning and AI being a double-edged sword. For all of its benefits, there are some places where it still has room for improvement, and some situations where it may do more harm than good. If machine learning practitioners cannot be trusted to act ethically, it may be necessary for governments to intervene to prevent the greatest harm to society.

For more on Musk’s fears of summoning the demon see the following 2018 article from CNBC: https://www.cnbc.com/2018/04/06/elon-musk-warns-ai-could-create-immortal-dictator-in-documentary.html.

Machine learning successes

Machine learning is most successful when it augments the specialized knowledge of a subject-matter expert rather than replacing the expert altogether. It works with medical doctors at the forefront of the fight to eradicate cancer; assists engineers with efforts to create smarter homes and automobiles; helps social scientists and economists build better societies; and provides business and marketing professionals with valuable insights. Toward these ends, it is employed in countless scientific laboratories, hospitals, companies, and governmental organizations. Any effort that generates or aggregates data likely employs at least one machine learning algorithm to help make sense of it.

Though it is impossible to list every successful application of machine learning, a selection of prominent examples is as follows:

Identification of unwanted spam messages in email

Segmentation of customer behavior for targeted advertising

Forecasts of weather behavior and long-term climate changes

Preemptive interventions for customers likely to churn (stop purchasing)

Reduction of fraudulent credit card transactions

Actuarial estimates of financial damage from storms and natural disasters

Prediction of and influence over election outcomes

Development of algorithms for auto-piloting drones and self-driving cars

Optimization of energy use in homes and office buildings

Projection of areas where criminal activity is most likely

Discovery of genetic sequences useful for precision medicine

By the end of this book, you will understand the basic machine learning algorithms that are employed to teach computers to perform these tasks. For now, it suffices to say that no matter the context, the fundamental machine learning process is the same. In every task, an algorithm takes data and identifies patterns that form the basis for further action.

The limits of machine learning

Although machine learning is used widely and has tremendous potential, it is important to understand its limits. The algorithms used today—even those on the cutting edge of artificial intelligence—emulate a relatively limited subset of the capabilities of the human brain. They offer little flexibility to extrapolate outside of strict parameters and know no common sense. Considering this, one should be extremely careful to recognize exactly what an algorithm has learned before setting it loose in the real world.

Without a lifetime of past experiences to build upon, computers are limited in their ability to make simple inferences about logical next steps. Consider the banner advertisements on websites, which are served according to patterns learned by data mining the browsing history of millions of users. Based on this data, someone who views websites selling mattresses is interested in buying a mattress and should therefore see advertisements for mattresses. The problem is that this becomes a never-ending cycle in which, even after a mattress has been purchased, additional mattress advertisements are shown, rather than advertisements for pillows and bed sheets.

Many people are familiar with the deficiencies of machine learning’s ability to understand or translate language, or to recognize speech and handwriting. Perhaps the earliest example of this type of failure is in a 1994 episode of the television show The Simpsons, which showed a parody of the Apple Newton tablet. In its time, the Newton was known for its state-of-the-art handwriting recognition. Unfortunately for Apple, it would occasionally fail to great effect. The television episode illustrated this through a sequence in which a bully’s note to Beat up Martin was misinterpreted by the Newton as Eat up Martha.

Graphical user interface Description automatically generated

Figure 1.3: Screen captures from Lisa on Ice, The Simpsons, 20th Century Fox (1994)

Machine language processing has improved enough in the time since the Apple Newton that Google, Apple, and Microsoft are all confident in their ability to offer voice-activated virtual concierge services, such as Google Assistant, Siri, and Cortana. Still, these services routinely struggle to answer relatively simple questions. Furthermore, online translation services sometimes misinterpret sentences that a toddler would readily understand, and the predictive text feature on many devices has led to humorous autocorrect fail websites that illustrate computers’ ability to understand basic language but completely misunderstand context.

Some of these mistakes are to be expected. Language is complicated, with multiple layers of text and subtext, and even human beings sometimes misunderstand context. Although machine learning is rapidly improving at language processing, and current state-of-the-art algorithms like GPT-3 are quite good in comparison to prior generations, machines still make mistakes that are obvious to humans that know where to look. These predictable shortcomings illustrate the important fact that machine learning is only as good as the data it has learned from. If context is not explicit in the input data, then just like a human, the computer will have to make its best guess from its set of past experiences. However, the computer’s past experiences are usually much more limited than the human’s.

Machine learning ethics

At its core, machine learning is simply a tool that assists us with making sense of the world’s complex data. Like any tool, it can be used for good or evil. Machine learning goes wrong mostly when it is applied so broadly, or so callously, that humans are treated as lab rats, automata, or mindless consumers. A process that may seem harmless can lead to unintended consequences when automated by an emotionless computer. For this reason, those using machine learning or data mining would be remiss not to at least briefly consider the ethical implications of the art.

Due to the relative youth of machine learning as a discipline and the speed at which it is progressing, the associated legal issues and social norms are often quite uncertain, and constantly in flux. Caution should be exercised when obtaining or analyzing data in order to avoid breaking laws, violating terms of service or data use agreements, or abusing the trust or violating the privacy of customers or the public. The informal corporate motto of Google, an organization that collects perhaps more data on individuals than any other, was at one time, don’t be evil. While this seems clear enough, it may not be sufficient. A better approach may be to follow the Hippocratic Oath, a medical principle that states, above all, do no harm. Following the principle of do no harm may have helped avoid recent scandals at Facebook and other companies, such as the Cambridge Analytica controversy, which alleged that social media data was being used to manipulate elections.

Retailers routinely use machine learning for advertising, targeted promotions, inventory management, or the layout of items in a store. Many have equipped checkout lanes with devices that print coupons for promotions based on a customer’s buying history. In exchange for a bit of personal data, the customer receives discounts on the specific products they want to buy. At first, this may appear relatively harmless, but consider what happens when this practice is taken a bit further.

One possibly apocryphal tale concerns a large retailer in the United States that employed machine learning to identify expectant mothers for coupon mailings. The retailer hoped that if these mothers-to-be received substantial discounts, they would become loyal customers who would later purchase profitable items such as diapers, baby formula, and toys. Equipped with machine learning methods, the retailer identified items in the customer purchase history that could be used to predict with a high degree of certainty not only whether a woman was pregnant, but also the approximate timing for when the baby was due.

After the retailer used this data for a promotional mailing, an angry man contacted the chain and demanded to know why his young daughter received coupons for maternity items. He was furious that the retailer seemed to be encouraging teenage pregnancy! As the story goes, when the retail chain called to offer an apology, it was the father who ultimately apologized after confronting his daughter and discovering that she was indeed pregnant!

For more detail on how retailers use machine learning to identify pregnancies, see the New York Times Magazine article titled How Companies Learn Your Secrets, by Charles Duhigg, 2012: https://www.nytimes.com/2012/02/19/magazine/shopping-habits.html.

Whether the story was completely true or not, the lesson learned from the preceding tale is that common sense should be applied before blindly applying the results of a machine learning analysis. This is particularly true in cases where sensitive information, such as health data, is concerned. With a bit more care, the retailer could have foreseen this scenario and used greater discretion when choosing how to reveal the pregnancy status its machine learning analysis had discovered. Unfortunately, as history tends to repeat itself, social media companies have been under fire recently for targeting expectant mothers with advertisements for baby products even after these mothers experience the tragedy of a miscarriage.

Because machine learning algorithms are developed with historical data, computers may learn some unfortunate behaviors of human societies. Sadly, this sometimes includes perpetuating race or gender discrimination and reinforcing negative stereotypes. For example, researchers found that Google’s online advertising service was more likely to show ads for high-paying jobs to men than women and was more likely to display ads for criminal background checks to black people than white people. Although the machine may have correctly learned that men once held jobs that were not offered to most women, it is not desirable to have the algorithm perpetuate such injustices. Instead, it may be necessary to teach the machine to reflect society not as it currently is, but how it ought to be.

Sometimes, algorithms that are specifically designed with the intention of being content-neutral eventually come to reflect undesirable beliefs or ideologies. In one egregious case, a Twitter chatbot service developed by Microsoft was quickly taken offline after it began spreading Nazi and anti-feminist propaganda, which it may have learned from so-called trolls posting inflammatory content on internet forums and chat rooms. In another case, an algorithm created to reflect an objective conception of human beauty sparked controversy when it favored almost exclusively white people. Imagine the consequences if this had been applied to facial recognition software for criminal activity!

For more information about the real-world consequences of machine learning and discrimination see the Harvard Business Review article Addressing the Biases Plaguing Algorithms, by Michael Li, 2019: https://hbr.org/2019/05/addressing-the-biases-plaguing-algorithms.

To limit the ability of algorithms to discriminate illegally, certain jurisdictions have well-intentioned laws that prevent the use of racial, ethnic, religious, or other protected class data for business reasons. However, excluding this data from a project may not be enough because machine learning algorithms can still inadvertently learn to discriminate. If a certain segment of people tends to live in a certain region, buys a certain product, or otherwise behaves in a way that uniquely identifies them as a group, machine learning algorithms can infer the protected information from other factors. In such cases, you may need to completely de-identify these people by excluding any potentially identifying data in addition to the already-protected statuses.

In a recent example of this type of alleged algorithmic bias, the Apple credit card, which debuted in 2019, was almost immediately accused of providing substantially higher credit limits to men than to women—sometimes by 10 to 20 times the amount—even for spouses with joint assets and similar credit histories. Although Apple and the issuing bank, Goldman Sachs, denied that gender bias was at play and confirmed that no legally protected applicant characteristics were used in the algorithm, this did not slow speculation that perhaps some bias crept in unintentionally. It did not help matters that for competitive reasons, Apple and Goldman Sachs chose to keep the details of the algorithm secret, which led people to assume the worst. If the systematic bias allegations were untrue, being able to explain what was truly happening and exactly how the decisions were made might have alleviated much of the outrage. A potential worst-case scenario would have occurred if Apple and Goldman Sachs were investigated yet couldn’t explain the result to regulators, due to the algorithm’s complexity!

The Apple credit card fiasco is described in a 2019 BBC article, Apple’s ‘sexist’ credit card investigated by US regulator: https://www.bbc.com/news/business-50365609.

Apart from the legal consequences, customers may feel uncomfortable or become upset if aspects of their lives they consider private are made public. The challenge is that privacy expectations differ across people and contexts. To illustrate this fact, imagine driving by someone’s house and incidentally glancing through the window. This is unlikely to offend most people. In contrast, using a camera to take a picture from across the street is likely to make most feel uncomfortable; walking up to the house and pressing a face against the glass to peer inside is likely to anger virtually everybody. Although all three of these scenarios are arguably using public information, two of the three cross a line that will upset most people. In much the same way, it is possible to go too far with the use of data and cross a threshold that many will see as inconsiderate at best and creepy at worst.

Just as computing hardware and statistical methods kicked off the big data era, these methods also unlocked a post-privacy era in which many aspects of our lives that were once private are now public, or available to the public at a price. Even prior to the big data era, it would have been possible to learn a great deal about someone by observing public information. Watching their comings and goings may reveal information about their occupation or leisure activity, and a quick glance at their trash and recycling bins may reveal what they eat, drink, and read. A private investigator could learn even more with a bit of focused digging and observation. Companies applying machine learning methods to large datasets are essentially acting as large-scale private investigators, and while they claim to be working on anonymized datasets, many still argue that the companies have gone too far with their digital surveillance.

In recent years, some high-profile web applications have experienced a mass exodus of users who felt exploited when the applications’ terms of service agreements changed, or their data was used for purposes beyond what the users had originally intended. The fact that privacy expectations differ by context, age cohort, and locale adds complexity to deciding the appropriate use of personal data. It would be wise to consider the cultural implications of your work before you begin on your project, in addition to being aware of ever-more-restrictive regulations such as the European Union’s General Data Protection Regulation (GDPR) and the inevitable policies that will follow in its footsteps.

The fact that you can use data for a particular end does not always mean that you should.

Finally, it is important to note that as machine learning algorithms become progressively more important to our everyday lives, there are greater incentives for nefarious actors to work to exploit them. Sometimes, attackers simply want to disrupt algorithms for laughs or notoriety—such as Google bombing, the crowdsourced method of tricking Google’s algorithms to highly rank a desired page. Other times, the effects are more dramatic. A timely example of this is the recent wave of so-called fake news and election meddling, propagated via the manipulation of advertising and recommendation algorithms that target people according to their personality. To avoid giving such control to outsiders, when building machine learning systems, it is crucial to consider how they may be influenced by a determined individual or crowd.

Social media scholar danah boyd (styled lowercase) presented a keynote at the Strata Data Conference 2017 in New York City that discussed the importance of hardening machine learning algorithms against attackers. For a recap, refer to https://points.datasociety.net/your-data-is-being-manipulated-a7e31a83577b.

The consequences of malicious attacks on machine learning algorithms can also be deadly. Researchers have shown that by creating an adversarial attack that subtly distorts a road sign with carefully chosen graffiti, an attacker might cause an autonomous vehicle to misinterpret a stop sign, potentially resulting in a fatal crash. Even in the absence of ill intent, software bugs and human errors have already led to fatal accidents in autonomous vehicle technology from Uber and Tesla. With such examples in mind, it is of the utmost importance and ethical concern that machine learning practitioners should worry about how their algorithms will be used and abused in the real world.

How machines learn

A formal definition of machine learning, attributed to computer scientist Tom M. Mitchell, states that a machine learns whenever it utilizes its experience such that its performance improves on similar experiences in the future. Although this definition makes sense intuitively, it completely ignores the process of exactly how experience is translated into future action—and, of course, learning is always easier said than done!

Where human brains are naturally capable of learning from birth, the conditions necessary for computers to learn must be made explicit by the programmer hoping to utilize machine learning methods. For this reason, although it is not strictly necessary to understand the theoretical basis for learning, having a strong theoretical foundation helps the practitioner to understand, distinguish, and implement machine learning algorithms.

As you relate machine learning to human learning, you may find yourself examining your own mind in a different light.

Regardless of whether the learner is a human or a machine, the basic learning process is the same. It can be divided into four interrelated components:

Data storage utilizes observation, memory, and recall to provide a factual basis for further reasoning.

Abstraction involves the translation of stored data into broader representations and concepts.

Generalization uses abstracted data to create knowledge and inferences that drive action in new contexts.

Evaluation provides a feedback mechanism to measure the utility of learned knowledge and inform potential improvements.

Diagram Description automatically generated

Figure 1.4: The four steps in the learning process

Although the learning process has been conceptualized here as four distinct components, they are merely organized this way for illustrative purposes. In reality, the entire learning process is inextricably linked. In human beings, the process occurs subconsciously. We recollect, deduce, induct, and intuit within the confines of our mind’s eye, and because this process is hidden, any differences from person to person are attributed to a vague notion of subjectivity. In contrast, computers make these processes explicit, and because the entire process is transparent, the learned knowledge can be examined, transferred, utilized for future action, and treated as a data science.

The data science buzzword suggests a relationship between the data, the machine, and the people who guide the learning process. The term’s growing use in job descriptions and academic degree programs reflects its operationalization as a field of study concerned with both statistical and computational theory, as well as the technological infrastructure enabling machine learning and its applications. The field often asks its practitioners to be compelling storytellers, balancing an audacity in the use of data with the limitations of what one may infer and forecast from it. To be a strong data scientist, therefore, requires a strong understanding of how the learning algorithms work in the context of a business application, as we will discuss in greater detail in Chapter 11, Being Successful with Machine Learning.

Data storage

All learning begins with data. Humans and computers alike utilize data storage as a foundation for more advanced reasoning. In a human being, this consists of a brain that uses electrochemical signals in a network of biological cells to store and process observations for short- and long-term future recall. Computers have similar capabilities of short- and long-term recall using hard disk drives, flash memory, and random-access memory (RAM) in combination with a central processing unit (CPU).

It may seem obvious, but the ability to store and retrieve data alone is insufficient for learning. Stored data is merely ones and zeros on a disk. It is a collection of memories, meaningless without a broader context. Without a higher level of understanding, knowledge is purely recall, limited to what has been seen before and nothing else.

To better understand the nuances of this idea, it may help to think about the last time you studied for a difficult test, perhaps for a university final exam or a career certification. Did you wish for an eidetic (photographic) memory? If so, you may be disappointed to know that perfect recall would unlikely be of much assistance. Even if you could memorize material perfectly, this rote learning would provide no benefit without knowing the exact questions and answers that would appear on the exam. Otherwise, you would need to memorize answers to every question that could conceivably be asked, on a subject in which there is likely to be an infinite number of questions. Obviously, this is an unsustainable strategy.

Instead, a better approach is to spend time selectively and memorize a relatively small set of representative ideas, while developing an understanding of how the ideas relate and apply to unforeseen circumstances. In this way, important broader patterns are identified, rather than you memorizing every detail, nuance, and potential application.

Abstraction

This work of assigning a broader meaning to stored data occurs during the abstraction process, in which raw data comes to represent a wider, more abstract concept or idea. This type of connection, say between an object and its representation, is exemplified by the famous René Magritte painting The Treachery of Images.

A picture containing text Description automatically generated

Figure 1.5: This is not a pipe. Source: http://collections.lacma.org/node/239578

The painting depicts a tobacco pipe with the caption Ceci n’est pas une pipe (This is not a pipe). The point Magritte was illustrating is that a representation of a pipe is not truly a pipe. Yet, despite the fact that the pipe is not real, anybody viewing the painting easily recognizes it as a pipe. This suggests that observers can connect the picture of a pipe to the idea of a pipe, to a memory of a physical pipe that can be held in the hand. Abstracted connections like this are the basis of knowledge representation, the formation of logical structures that assist with turning raw sensory information into meaningful insight.

Bringing this concept full circle, knowledge representation is what allows artificial intelligence-based tools like Midjourney (https://www.midjourney.com) to paint, virtually, in the style of René Magritte. The following image was generated entirely by artificial intelligence based on the algorithm’s understanding of concepts like robot, pipe, and smoking. If he were alive yet today, Magritte himself might find it surreal that his own surrealist work, which challenged human conceptions of reality and the connections between images and ideas, is now incorporated into the minds of computers and, in a roundabout way, is connecting machines’ ideas and images to reality. Machines learned what a pipe is, in part, by viewing images of pipes in artwork like Magritte’s.

Figure 1.6: Am I a pipe? image created by the Midjourney AI with the prompt of robot smoking a pipe in the style of a René Magritte painting

To reify the process of knowledge representation within an algorithm, the computer summarizes stored raw data using a model, an explicit description of the patterns within the data. Just like Magritte’s pipe, the model representation takes on a life beyond the raw data. It represents an idea greater than the sum of its parts.

There are many different types of models. You may already be familiar with some. Examples include:

Mathematical equations

Relational diagrams, such as trees and graphs

Logical if/else rules

Groupings of data known as clusters

The choice of model is typically not left up to the machine. Instead, the learning task and the type of data on hand inform model selection. Later in this chapter, we will discuss in more detail the methods for choosing the appropriate model type.

Fitting a model to a dataset is known as training. When the model has been trained, the data has been transformed into an abstract form that summarizes the original information. The fact that this step is called training rather than learning reveals a couple of interesting aspects of the process. First, note that the process of learning does not end with data abstraction—the learner must still generalize and evaluate its training. Second, the word training better connotes the fact that the human teacher trains the machine student to use the data toward a specific end.

The distinction between training and learning is subtle but important. The computer doesn’t learn a model, because this would imply that there is a single correct model to be learned. Of course, the computer must learn something about the data

Enjoying the preview?

Page 1 of 1

Machine Learning with R: Learn techniques for building and improving machine learning models, from data preparation to model tuning, evaluation, and working with big data

About this ebook

Brett Lantz

Read more from Brett Lantz

Machine Learning with R - Second Edition

Machine Learning with R: Expert techniques for predictive modeling, 3rd Edition

R: Unleash Machine Learning Techniques

Machine Learning with R: R gives you access to the cutting-edge software you need to prepare data for machine learning. No previous knowledge required ‚Äì this book will take you methodically through every stage of applying machine learning.

R: Data Analysis and Visualization

Related authors

Related to Machine Learning with R

Related ebooks

Principles of Data Science: A beginner's guide to essential math and coding skills for data fluency and machine learning

Mastering Python for Data Science

R: Unleash Machine Learning Techniques

Practical Data Science with Python: Learn tools and techniques from hands-on examples to extract insights from data

Machine Learning With Go: Leverage Go's powerful packages to build smart machine learning and predictive applications, 2nd Edition

Machine Learning with R

Mastering Machine Learning with R: Advanced machine learning techniques for building smart applications with R 3.5, 3rd Edition

Mastering Predictive Analytics with R - Second Edition

Feature Engineering Made Easy: Identify unique features from your dataset in order to build powerful machine learning systems

Machine Learning with R - Third Edition: Expert techniques for predictive modeling, 3rd Edition

R Machine Learning Essentials

Data Wrangling with R: Load, explore, transform and visualize data for modeling with tidyverse libraries

Practical Data Analysis - Second Edition

Mastering Machine Learning with R

Machine Learning for Imbalanced Data: Tackle imbalanced datasets using machine learning and deep learning techniques

Hands-On Data Science with R: Techniques to perform data manipulation and mining to build smart analytical models using R

Time Series Analysis with Python Cookbook: Practical recipes for exploratory data analysis, data preparation, forecasting, and model evaluation

Data Science Career Guide Interview Preparation

Principles of Data Science

Statistics for Machine Learning

Simulating Data with SAS

The Pandas Workshop: A comprehensive guide to using Python for data analysis with real-world case studies

Building a Recommendation System with R

Agile Machine Learning with DataRobot: Automate each step of the machine learning life cycle, from understanding problems to delivering value

R Object-oriented Programming

Go Machine Learning Projects: Eight projects demonstrating end-to-end machine learning and predictive analytics applications in Go

CompTIA Data+: DAO-001 Certification Guide: Complete coverage of the new CompTIA Data+ (DAO-001) exam to help you pass on the first attempt

A Handbook of Mathematical Models with Python: Elevate your machine learning projects with NetworkX, PuLP, and linalg

Learning Quantitative Finance with R: Implement machine learning, time-series analysis, algorithmic trading and more

Interpretable Machine Learning with Python: Learn to build interpretable high-performance models with hands-on real-world examples

E-Commerce For You

Summary of $100M Leads: How to Get Strangers to Want to Buy Your Stuff by Alex Hormozi: by Alex Hormozi - How to Get Strangers to Want to Buy Your Stuff - A Comprehensive Summary

The Psychology of Selling: Increase Your Sales Faster and Easier Than You Ever Thought Possible

How to Write Copy That Sells: The Step-By-Step System For More Sales, to More Customers, More Often

Building a StoryBrand: Clarify Your Message So Customers Will Listen

The YouTube Formula: How Anyone Can Unlock the Algorithm to Drive Views, Build an Audience, and Grow Revenue

The Motley Fool Investment Guide: Third Edition: How the Fools Beat Wall Street's Wise Men and How You Can Too

Stories That Stick: How Storytelling Can Captivate Customers, Influence Audiences, and Transform Your Business

How I Made My First $1000 on Etsy (With No Social Media Following and No Money to Spend on Advertising

The Basics of Bitcoins and Blockchains: An Introduction to Cryptocurrencies and the Technology that Powers Them (Cryptography, Derivatives Investments, Futures Trading, Digital Assets, NFT)

The Passive Income Cheat Sheet

80/20 Sales and Marketing: The Definitive Guide to Working Less and Making More

Dividend Investing: Simplified - The Step-by-Step Guide to Make Money and Create Passive Income in the Stock Market with Dividend Stocks: Stock Market Investing for Beginners Book, #1

A Beginner's Guide To Day Trading Online 2nd Edition

Copywriting Secrets: How Everyone Can Use the Power of Words to Get More Clicks, Sales, and Profits...No Matter What You Sell or Who You Sell It To!

Built to Last: Successful Habits of Visionary Companies

ChatGPT's Guide to Wealth: How to Make Money with Conversational AI Technology

Influencer: Building Your Personal Brand in the Age of Social Media

Affiliate Marketing 2024 Step By Step Guide To Make $10,000/Month Passive Income To Escape The Rat Race and Build an Successful Digital Business From Home

The Bitcoin Standard: The Decentralized Alternative to Central Banking

Crushing It!: How Great Entrepreneurs Build Their Business and Influence—and How You Can, Too

How to Day Trade: The Plain Truth

Traction: Quadruple Your Business Immediately With These Marketing Techniques

Streams of Income: Living the Multiple Income Streams Dream

The Beginner's Affiliate Marketing Blueprint

Starting an Etsy Business For Dummies

2022 Best Ways To Make Money Online

From Zero to One Million Followers: Become an Influencer with Social Media Viral Growth Strategies on YouTube, Twitter, Facebook, Instagram, and the Secrets to Make Your Personal Brand KNOWN

How To Make Money Online Fast: Step By Step Instructions On How To Work From Home Using Proven Internet Marketing Strategies

Trade Like a Stock Market Wizard: How to Achieve Super Performance in Stocks in Any Market

1000 Social Media Marketing Tricks: Viral Advertising and Personal Brand Secrets to Grow Your Business with YouTube, Facebook, Instagram - Become an Influencer with Over One Million Followers

Related podcast episodes

Related articles

Related categories

Reviews for Machine Learning with R

What did you think?

Book preview

Machine Learning with R - Brett Lantz