Hands-on Scikit-Learn for Machine Learning Applications: Data Science Fundamentals with Python
By David Paper
()
About this ebook
All applied math and programming skills required to master the content are covered in this book. In-depth knowledge of object-oriented programming is not required as working and complete examples are provided and explained. Coding examples are in-depth and complex when necessary. They are also concise, accurate, and complete, and complement the machine learning concepts introduced. Working the examples helps to build the skills necessary to understand and apply complexmachine learning algorithms.
Hands-on Scikit-Learn for Machine Learning Applications is an excellent starting point for those pursuing a career in machine learning. Students of this book will learn the fundamentals that are a prerequisite to competency. Readers will be exposed to the Anaconda distribution of Python that is designed specifically for data science professionals, and will build skills in the popular Scikit-Learn library that underlies many machine learning applications in the world of Python.
What You'll Learn
- Work with simple and complex datasets common to Scikit-Learn
- Manipulate data into vectors and matrices for algorithmic processing
- Become familiar with the Anaconda distribution used in data science
- Apply machine learning with Classifiers, Regressors, and Dimensionality Reduction
- Tune algorithms and find the best algorithms for each dataset
- Load data from and save to CSV, JSON, Numpy, and Pandas formats
Who This Book Is For
The aspiring data scientist yearning to break into machine learning through mastering the underlying fundamentals that are sometimes skipped over in the rush to be productive. Some knowledge of object-oriented programming and very basic applied linear algebra will make learning easier, although anyone can benefit from this book.
Read more from David Paper
Data Science Fundamentals for Python and MongoDB Rating: 0 out of 5 stars0 ratingsTensorFlow 2.x in the Colaboratory Cloud: An Introduction to Deep Learning on Google’s Cloud Service Rating: 0 out of 5 stars0 ratingsState-of-the-Art Deep Learning Models in TensorFlow: Modern Machine Learning in the Google Colab Ecosystem Rating: 0 out of 5 stars0 ratings
Related to Hands-on Scikit-Learn for Machine Learning Applications
Related ebooks
Practical Python Data Visualization: A Fast Track Approach To Learning Data Visualization With Python Rating: 4 out of 5 stars4/5Data Engineering with Python: Work with massive datasets to design data models and automate data pipelines using Python Rating: 0 out of 5 stars0 ratingsPractical Natural Language Processing with Python: With Case Studies from Industries Using Text Data at Scale Rating: 0 out of 5 stars0 ratingsA Handbook of Mathematical Models with Python: Elevate your machine learning projects with NetworkX, PuLP, and linalg Rating: 0 out of 5 stars0 ratingsPro Machine Learning Algorithms: A Hands-On Approach to Implementing Algorithms in Python and R Rating: 0 out of 5 stars0 ratingsMachine Learning Engineering with Python: Manage the lifecycle of machine learning models using MLOps with practical examples Rating: 0 out of 5 stars0 ratingsDeep Belief Nets in C++ and CUDA C: Volume 1: Restricted Boltzmann Machines and Supervised Feedforward Networks Rating: 0 out of 5 stars0 ratingsDesigning Microservices with Django: An Overview of Tools and Practices Rating: 0 out of 5 stars0 ratingsEnterprise Bug Busting: From Testing through CI/CD to Deliver Business Results Rating: 0 out of 5 stars0 ratingsASP.NET 3.5 CMS Development Rating: 0 out of 5 stars0 ratingsPython Interviews: Discussions with Python Experts Rating: 0 out of 5 stars0 ratingsSystem integration testing The Ultimate Step-By-Step Guide Rating: 0 out of 5 stars0 ratingsPro Cryptography and Cryptanalysis: Creating Advanced Algorithms with C# and .NET Rating: 0 out of 5 stars0 ratingsA Developer's Essential Guide to Docker Compose: Simplify the development and orchestration of multi-container applications Rating: 0 out of 5 stars0 ratingsSoftware Documentation Strategy A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratingsMongoDB Recipes: With Data Modeling and Query Building Strategies Rating: 0 out of 5 stars0 ratingsSocial Media Data Mining and Analytics Rating: 0 out of 5 stars0 ratingsAPIs A Complete Guide - 2021 Edition Rating: 0 out of 5 stars0 ratingsPython AI Programming: Navigating fundamentals of ML, deep learning, NLP, and reinforcement learning in practice Rating: 0 out of 5 stars0 ratingsBuilding REST APIs with Flask: Create Python Web Services with MySQL Rating: 0 out of 5 stars0 ratingsSolr in Action Rating: 3 out of 5 stars3/5Software Development Process Models A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratingsSoftware Craftsmanship A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratingsSoftware Test A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratingsC# Deconstructed: Discover how C# works on the .NET Framework Rating: 0 out of 5 stars0 ratings
Intelligence (AI) & Semantics For You
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5ChatGPT For Dummies Rating: 4 out of 5 stars4/5Summary of Super-Intelligence From Nick Bostrom Rating: 4 out of 5 stars4/5Artificial Intelligence: A Guide for Thinking Humans Rating: 4 out of 5 stars4/5Nexus: A Brief History of Information Networks from the Stone Age to AI Rating: 4 out of 5 stars4/5Chat-GPT Income Ideas: Pioneering Monetization Concepts Utilizing Conversational AI for Profitable Ventures Rating: 3 out of 5 stars3/5Writing AI Prompts For Dummies Rating: 0 out of 5 stars0 ratingsMidjourney Mastery - The Ultimate Handbook of Prompts Rating: 5 out of 5 stars5/5AI Investing For Dummies Rating: 0 out of 5 stars0 ratingsThe Secrets of ChatGPT Prompt Engineering for Non-Developers Rating: 5 out of 5 stars5/5101 Midjourney Prompt Secrets Rating: 3 out of 5 stars3/5The Algorithm of the Universe (A New Perspective to Cognitive AI) Rating: 5 out of 5 stars5/5A Quickstart Guide To Becoming A ChatGPT Millionaire: The ChatGPT Book For Beginners (Lazy Money Series®) Rating: 4 out of 5 stars4/5Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5Dark Aeon: Transhumanism and the War Against Humanity Rating: 5 out of 5 stars5/5ChatGPT For Fiction Writing: AI for Authors Rating: 5 out of 5 stars5/5Our Final Invention: Artificial Intelligence and the End of the Human Era Rating: 4 out of 5 stars4/5Killer ChatGPT Prompts: Harness the Power of AI for Success and Profit Rating: 2 out of 5 stars2/5AI for Educators: AI for Educators Rating: 5 out of 5 stars5/5Gigatrends: Six Forces That Are Changing the Future for Billions Rating: 0 out of 5 stars0 ratingsChatGPT 4 $10,000 per Month #1 Beginners Guide to Make Money Online Generated by Artificial Intelligence Rating: 0 out of 5 stars0 ratings
Reviews for Hands-on Scikit-Learn for Machine Learning Applications
0 ratings0 reviews
Book preview
Hands-on Scikit-Learn for Machine Learning Applications - David Paper
© David Paper 2020
D. PaperHands-on Scikit-Learn for Machine Learning Applicationshttps://doi.org/10.1007/978-1-4842-5373-1_1
1. Introduction to Scikit-Learn
David Paper¹
(1)
Logan, UT, USA
Scikit-Learn is a Python library that provides simple and efficient tools for implementing supervised and unsupervised machine learning algorithms. The library is accessible to everyone because it is open source and commercially usable. It is built on NumPY, SciPy, and matplolib libraries, which means it is reliable, robust, and core to the Python language.
Scikit-Learn is focused on data modeling rather than data loading, cleansing, munging or manipulating. It is also very easy to use and relatively clean of programming bugs.
Machine Learning
Machine learning is getting computers to program themselves. We use algorithms to make this happen. An algorithm is a set of rules used to calculate or problem solve with a computer.
Machine learning advocates create, study, and apply algorithms to improve performance on data-driven tasks. They use tools and technology to answer questions about data by training a machine how to learn.
The goal is to build robust algorithms that can manipulate input data to predict an output while continually updating outputs as new data becomes available. Any information or data sent to a computer is considered input. Data produced by a computer is considered output.
In the machine learning community, input data is referred to as the feature set and output data is referred to as the target. The feature set is also referred to as the feature space. Sample data is typically referred to as training data. Once the algorithm is trained with sample data, it can make predictions on new data. New data is typically referred to as test data.
Machine learning is divided into two main areas: supervised and unsupervised learning. Since machine learning typically focuses on prediction based on known properties learned from training data, our focus is on supervised learning.
Supervised learning is when the data set contains both inputs (or the feature set) and desired outputs (or targets). That is, we know the properties of the data. The goal is to make predictions. This ability to supervise algorithm training is a big part of why machine learning has become so popular.
To classify or regress new data, we must train on data with known outcomes. We classify data by organizing it into relevant categories. We regress data by finding the relationship between feature set data and target data.
With unsupervised learning, the data set contains only inputs but no desired outputs (or targets). The goal is to explore the data and find some structure or way to organize it. Although not the focus of the book, we will explore a few unsupervised learning scenarios.
Anaconda
You can use any Python installation, but I recommend installing Python with Anaconda for several reasons. First, it has over 15 million users. Second, Anaconda allows easy installation of the desired version of Python. Third, it preinstalls many useful libraries for machine learning including Scikit-Learn. Follow this link to see the Anaconda package lists for your operating system and Python version: https://docs.anaconda.com/anaconda/packages/pkg-docs/. Fourth, it includes several very popular editors including IDLE, Spyder, and Jupyter Notebooks. Fifth, Anaconda is reliable and well-maintained and removes compatibility bottlenecks.
You can easily download and install Anaconda with this link: https://www.anaconda.com/download/. You can update with this link: https://docs.anaconda.com/anaconda/install/update-version/. Just open Anaconda and follow instructions. I recommend updating to the current version.
Scikit-Learn
Python’s Scikit-Learn is one of the most popular machine learning libraries. It is built on Python libraries NumPy, SciPy, and Matplotlib. The library is well-documented, open source, commercially usable, and a great vehicle to get started with machine learning. It is also very reliable and well-maintained, and its vast collection of algorithms can be easily incorporated into your projects. Scikit-Learn is focused on modeling data rather than loading, manipulating, visualizing, and summarizing data. For such activities, other libraries such as NumPy, pandas, Matplotlib, and seaborn are covered as encountered. The Scikit-Learn library is imported into a Python script as sklearn.
Data Sets
A great way to understand machine learning application is by working through Python data-driven code examples. We use either Scikit-Learn, UCI Machine Learning, or seaborn data sets for all examples. The Scikit-Learn data sets package embeds some small data sets for getting started and helpers to fetch larger data sets commonly used in the machine learning library to benchmark algorithms on data from the world at large. The UCI Machine Learning Repository maintains 468 data sets to serve the machine learning community. Seaborn provides an API on top of Matplotlib that offers simplicity when working with plot styles, color defaults, and high-level functions for common statistical plot types that facilitate visualization. It also integrates nicely with Pandas DataFrame functionality.
We chose the data sets for our examples because the machine learning community uses them for learning, exploring, benchmarking, and validating, so we can compare our results to others while learning how to apply machine learning algorithms.
Our data sets are categorized as either classification or regression data. Classification data complexity ranges from simple to relatively complex. Simple classification data sets include load_iris, load_wine, bank.csv, and load_digits. Complex classification data sets include fetch_20newsgroups, MNIST, and fetch_1fw_people. Regression data sets include tips, redwine.csv, whitewine.csv, and load_boston.
Characterize Data
Before working with algorithms, it is best to understand the data characterization. Each data set was carefully chosen to help you gain experience with the most common aspects of machine learning. We begin by describing the characteristics of each data set to better understand its composition and purpose. Data sets are organized by classification and regression data.
Classification data is further organized by complexity. That is, we begin with simple classification data sets that are not complex so that the reader can focus on the machine learning content rather than on the data. We then move onto more complex data sets.
Simple Classification Data
Classification is a machine learning technique for predicting the class upon which a dependent variable belongs. A class is a discrete response. In machine learning, a dependent variable is typically referred to as the target. A class is predicted based upon the independent variables of a data set. Independent variables are typically referred to as the feature set or feature space. Feature space is the collection of features used to characterize the data.
Simple data sets are those with a limited number of features. Such a data set is referred to as one with a low-dimensional feature space.
Iris Data
The first data set we characterize is load_iris, which consists of Iris flower data. Iris is a multivariate data set consisting of 50 samples from each of three species of iris (Iris setosa, Iris virginica, and Iris versicolor). Each sample contains four features, namely, length and width of sepals and petals in centimeters. Iris is a typical test case for machine learning classification. It is also one of the best known data sets in the data science literature, which means you can test your results against many other verifiable examples.
The first code example shown in Listing 1-1 loads Iris data, displays its keys, shape of the feature set and target, feature and target names, a slice from the DESCR key, and feature importance (from most to least).
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
if __name__ == __main__
:
br = '\n'
iris = datasets.load_iris()
keys = iris.keys()
print (keys, br)
X = iris.data
y = iris.target
print ('features shape:', X.shape)
print ('target shape:', y.shape, br)
features = iris.feature_names
targets = iris.target_names
print ('feature set:')
print (features, br)
print ('targets:')
print (targets, br)
print (iris.DESCR[525:900], br)
rnd_clf = RandomForestClassifier(random_state=0,
n_estimators=100)
rnd_clf.fit(X, y)
rnd_name = rnd_clf.__class__.__name__
feature_importances = rnd_clf.feature_importances_
importance = sorted(zip(feature_importances, features),
reverse=True)
print ('most important features' + ' (' + rnd_name + '):')
[print (row) for i, row in enumerate(importance)]
Listing 1-1
Characterize the Iris data set
Go ahead and execute the code from Listing 1-1. Remember that you can find the example from the book’s example download. You don’t need to type the example by hand. It’s easier to access the example download and copy/paste.
Your output from executing Listing 1-1 should resemble the following:
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
features shape: (150, 4)
target shape: (150,)
feature set:
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
targets:
['setosa' 'versicolor' 'virginica']
============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width:
most important features (RandomForestClassifier):
(0.4604447396171521, 'petal length (cm)')
(0.4241162651271012, 'petal width (cm)')
(0.09090795402103086, 'sepal length (cm)')
(0.024531041234715754, 'sepal width (cm)')
The code begins by importing datasets and RandomForestClassifier packages. RandomForestClassifier is an ensemble learning method that constructs a multitude of decision trees at training time and outputs the class that is the mode of the classes.
In this example, we are only using it to return feature importance. The main block begins by loading data and displaying its characteristics. Loading feature set data into variable X and target data into variable y is convention in the machine learning community.
The code concludes by training RandomForestClassifier on the pandas data, so it can return feature importance. When actually modeling data, we convert pandas data to NumPy for optimum performance. Keep in mind that the keys are available because the data set is embedded in Scikit-Learn.
Notice that we only took a small slice from DESCR, which holds a lot of information about the data set. I always recommend displaying at least the shape of the original data set before embarking on any machine learning experiment.
Tip
RandomForestClassifier is a powerful machine learning algorithm that not only models training data, but returns feature importance.
Wine Data
The next data set we characterize is load_wine. The load_wine data set consists of 178 data elements. Each element has thirteen features that describe three target classes. It is considered a classic in the machine learning community and offers an easy multi-classification data set.
The next code example shown in Listing 1-2 loads wine data and displays its keys, shape of the feature set and target, feature and target names, a slice from the DESCR key, and feature importance (from most to least).
from sklearn.datasets import load_wine
from sklearn.ensemble import RandomForestClassifier
if __name__ == __main__
:
br = '\n'
data = load_wine()
keys = data.keys()
print (keys, br)
X, y = data.data, data.target
print ('features:', X.shape)
print ('targets', y.shape, br)
print (X[0], br)
features = data.feature_names
targets = data.target_names
print ('feature set:')
print (features, br)
print ('targets:')
print (targets, br)
rnd_clf = RandomForestClassifier(random_state=0,
n_estimators=100)
rnd_clf.fit(X, y)
rnd_name = rnd_clf.__class__.__name__
feature_importances = rnd_clf.feature_importances_
importance = sorted(zip(feature_importances, features),
reverse=True)
n = 6
print (n, 'most important features' + ' (' + rnd_name + '):')
[print (row) for i, row in enumerate(importance) if i < n]
Listing 1-2
Characterize load_wine
After executing code from Listing 1-2, your output should resemble the following:
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])
features: (178, 13)
targets (178,)
[1.423e+01 1.710e+00 2.430e+00 1.560e+01 1.270e+02 2.800e+00 3.060e+00
2.800e-01 2.290e+00 5.640e+00 1.040e+00 3.920e+00 1.065e+03]
feature set:
['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
targets:
['class_0' 'class_1' 'class_2']
6 most important features (RandomForestClassifier):
(0.19399882779940295, 'proline')
(0.16095401215681593, 'flavanoids')
(0.1452667364559143, 'color_intensity')
(0.11070045042456281, 'alcohol')
(0.1097465262717493, 'od280/od315_of_diluted_wines')
(0.08968972021098301, 'hue')
Tip
To create (instantiate) a machine learning algorithm (model), just assign it to a variable (e.g., model = algorithm()). To train based on the model, just fit it to the data (e.g., model.fit(X, y)).
The code begins by importing load_wine and RandomForestClassifier. The main block displays keys, loads data into X and y, displays the first vector from feature set X, displays shapes, and displays feature set and target information. The code concludes by training X with RandomForestClassifier, so we can display the six most important features. Notice that we display the first vector from feature set X to verify that all features are numeric.
Bank Data
The next code example shown in Listing 1-3 works with bank data. The bank.csv data set is composed of direct marketing campaigns from a Portuguese banking institution. The target is described by whether a client will subscribe (yes/no) to a term deposit (target label y). It consists of 41188 data elements with 20 features for each element. A 10% random sample of 4119 data elements is also available from this site for more computationally expensive algorithms such as svm and KNeighborsClassifier.
import pandas as pd
if __name__ == __main__
:
br = '\n'
f = 'data/bank.csv'
bank = pd.read_csv(f)
features = list(bank)
print (features, br)
X = bank.drop(['y'], axis=1).values
y = bank['y'].values
print (X.shape, y.shape, br)
print (bank[['job', 'education', 'age', 'housing',
'marital', 'duration']].head())
Listing 1-3
Characterize bank data
After executing code from Listing 1-3, your output should resemble the following:
['age', 'job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed', 'y']
(41188, 20) (41188,)
job education age housing marital duration
0 housemaid basic.4y 56 no married 261
1 services high.school 57 no married 149
2 services high.school 37 yes married 226
3 admin. basic.6y 40 no married 151
4 services high.school 56 no married 307
The code example begins by importing the pandas package. The main block loads bank data from a CSV file into a Pandas DataFrame and displays the column names (or features). To retrieve column names from pandas, all we need to do is make the DataFrame a list and assign the result to a variable. Next, feature set X and target y are created. Finally, X and y shapes are displayed as well as a few choice features.
Digits Data
The final code example in this