Discover millions of ebooks, audiobooks, and so much more with a free trial

From $11.99/month after trial. Cancel anytime.

Hands-on Scikit-Learn for Machine Learning Applications: Data Science Fundamentals with Python
Hands-on Scikit-Learn for Machine Learning Applications: Data Science Fundamentals with Python
Hands-on Scikit-Learn for Machine Learning Applications: Data Science Fundamentals with Python
Ebook348 pages2 hours

Hands-on Scikit-Learn for Machine Learning Applications: Data Science Fundamentals with Python

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Aspiring data science professionals can learn the Scikit-Learn library along with the fundamentals of machine learning with this book. The book combines the Anaconda Python distribution with the popular Scikit-Learn library to demonstrate a wide range of supervised and unsupervised machine learning algorithms. Care is taken to walk you through the principles of machine learning through clear examples written in Python that you can try out and experiment with at home on your own machine.
All applied math and programming skills required to master the content are covered in this book. In-depth knowledge of object-oriented programming is not required as working and complete examples are provided and explained. Coding examples are in-depth and complex when necessary. They are also concise, accurate, and complete, and complement the machine learning concepts introduced. Working the examples helps to build the skills necessary to understand and apply complexmachine learning algorithms.
Hands-on Scikit-Learn for Machine Learning Applications is an excellent starting point for those pursuing a career in machine learning. Students of this book will learn the fundamentals that are a prerequisite to competency. Readers will be exposed to the Anaconda distribution of Python that is designed specifically for data science professionals, and will build skills in the popular Scikit-Learn library that underlies many machine learning applications in the world of Python.

What You'll Learn
  • Work with simple and complex datasets common to Scikit-Learn
  • Manipulate data into vectors and matrices for algorithmic processing
  • Become familiar with the Anaconda distribution used in data science
  • Apply machine learning with Classifiers, Regressors, and Dimensionality Reduction
  • Tune algorithms and find the best algorithms for each dataset
  • Load data from and save to CSV, JSON, Numpy, and Pandas formats

Who This Book Is For
The aspiring data scientist yearning to break into machine learning through mastering the underlying fundamentals that are sometimes skipped over in the rush to be productive. Some knowledge of object-oriented programming and very basic applied linear algebra will make learning easier, although anyone can benefit from this book.

LanguageEnglish
PublisherApress
Release dateNov 16, 2019
ISBN9781484253731
Hands-on Scikit-Learn for Machine Learning Applications: Data Science Fundamentals with Python

Read more from David Paper

Related to Hands-on Scikit-Learn for Machine Learning Applications

Related ebooks

Intelligence (AI) & Semantics For You

View More

Related articles

Reviews for Hands-on Scikit-Learn for Machine Learning Applications

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Hands-on Scikit-Learn for Machine Learning Applications - David Paper

    © David Paper 2020

    D. PaperHands-on Scikit-Learn for Machine Learning Applicationshttps://doi.org/10.1007/978-1-4842-5373-1_1

    1. Introduction to Scikit-Learn

    David Paper¹ 

    (1)

    Logan, UT, USA

    Scikit-Learn is a Python library that provides simple and efficient tools for implementing supervised and unsupervised machine learning algorithms. The library is accessible to everyone because it is open source and commercially usable. It is built on NumPY, SciPy, and matplolib libraries, which means it is reliable, robust, and core to the Python language.

    Scikit-Learn is focused on data modeling rather than data loading, cleansing, munging or manipulating. It is also very easy to use and relatively clean of programming bugs.

    Machine Learning

    Machine learning is getting computers to program themselves. We use algorithms to make this happen. An algorithm is a set of rules used to calculate or problem solve with a computer.

    Machine learning advocates create, study, and apply algorithms to improve performance on data-driven tasks. They use tools and technology to answer questions about data by training a machine how to learn.

    The goal is to build robust algorithms that can manipulate input data to predict an output while continually updating outputs as new data becomes available. Any information or data sent to a computer is considered input. Data produced by a computer is considered output.

    In the machine learning community, input data is referred to as the feature set and output data is referred to as the target. The feature set is also referred to as the feature space. Sample data is typically referred to as training data. Once the algorithm is trained with sample data, it can make predictions on new data. New data is typically referred to as test data.

    Machine learning is divided into two main areas: supervised and unsupervised learning. Since machine learning typically focuses on prediction based on known properties learned from training data, our focus is on supervised learning.

    Supervised learning is when the data set contains both inputs (or the feature set) and desired outputs (or targets). That is, we know the properties of the data. The goal is to make predictions. This ability to supervise algorithm training is a big part of why machine learning has become so popular.

    To classify or regress new data, we must train on data with known outcomes. We classify data by organizing it into relevant categories. We regress data by finding the relationship between feature set data and target data.

    With unsupervised learning, the data set contains only inputs but no desired outputs (or targets). The goal is to explore the data and find some structure or way to organize it. Although not the focus of the book, we will explore a few unsupervised learning scenarios.

    Anaconda

    You can use any Python installation, but I recommend installing Python with Anaconda for several reasons. First, it has over 15 million users. Second, Anaconda allows easy installation of the desired version of Python. Third, it preinstalls many useful libraries for machine learning including Scikit-Learn. Follow this link to see the Anaconda package lists for your operating system and Python version: https://docs.anaconda.com/anaconda/packages/pkg-docs/. Fourth, it includes several very popular editors including IDLE, Spyder, and Jupyter Notebooks. Fifth, Anaconda is reliable and well-maintained and removes compatibility bottlenecks.

    You can easily download and install Anaconda with this link: https://www.anaconda.com/download/. You can update with this link: https://docs.anaconda.com/anaconda/install/update-version/. Just open Anaconda and follow instructions. I recommend updating to the current version.

    Scikit-Learn

    Python’s Scikit-Learn is one of the most popular machine learning libraries. It is built on Python libraries NumPy, SciPy, and Matplotlib. The library is well-documented, open source, commercially usable, and a great vehicle to get started with machine learning. It is also very reliable and well-maintained, and its vast collection of algorithms can be easily incorporated into your projects. Scikit-Learn is focused on modeling data rather than loading, manipulating, visualizing, and summarizing data. For such activities, other libraries such as NumPy, pandas, Matplotlib, and seaborn are covered as encountered. The Scikit-Learn library is imported into a Python script as sklearn.

    Data Sets

    A great way to understand machine learning application is by working through Python data-driven code examples. We use either Scikit-Learn, UCI Machine Learning, or seaborn data sets for all examples. The Scikit-Learn data sets package embeds some small data sets for getting started and helpers to fetch larger data sets commonly used in the machine learning library to benchmark algorithms on data from the world at large. The UCI Machine Learning Repository maintains 468 data sets to serve the machine learning community. Seaborn provides an API on top of Matplotlib that offers simplicity when working with plot styles, color defaults, and high-level functions for common statistical plot types that facilitate visualization. It also integrates nicely with Pandas DataFrame functionality.

    We chose the data sets for our examples because the machine learning community uses them for learning, exploring, benchmarking, and validating, so we can compare our results to others while learning how to apply machine learning algorithms.

    Our data sets are categorized as either classification or regression data. Classification data complexity ranges from simple to relatively complex. Simple classification data sets include load_iris, load_wine, bank.csv, and load_digits. Complex classification data sets include fetch_20newsgroups, MNIST, and fetch_1fw_people. Regression data sets include tips, redwine.csv, whitewine.csv, and load_boston.

    Characterize Data

    Before working with algorithms, it is best to understand the data characterization. Each data set was carefully chosen to help you gain experience with the most common aspects of machine learning. We begin by describing the characteristics of each data set to better understand its composition and purpose. Data sets are organized by classification and regression data.

    Classification data is further organized by complexity. That is, we begin with simple classification data sets that are not complex so that the reader can focus on the machine learning content rather than on the data. We then move onto more complex data sets.

    Simple Classification Data

    Classification is a machine learning technique for predicting the class upon which a dependent variable belongs. A class is a discrete response. In machine learning, a dependent variable is typically referred to as the target. A class is predicted based upon the independent variables of a data set. Independent variables are typically referred to as the feature set or feature space. Feature space is the collection of features used to characterize the data.

    Simple data sets are those with a limited number of features. Such a data set is referred to as one with a low-dimensional feature space.

    Iris Data

    The first data set we characterize is load_iris, which consists of Iris flower data. Iris is a multivariate data set consisting of 50 samples from each of three species of iris (Iris setosa, Iris virginica, and Iris versicolor). Each sample contains four features, namely, length and width of sepals and petals in centimeters. Iris is a typical test case for machine learning classification. It is also one of the best known data sets in the data science literature, which means you can test your results against many other verifiable examples.

    The first code example shown in Listing 1-1 loads Iris data, displays its keys, shape of the feature set and target, feature and target names, a slice from the DESCR key, and feature importance (from most to least).

    from sklearn import datasets

    from sklearn.ensemble import RandomForestClassifier

    if __name__ == __main__:

        br = '\n'

        iris = datasets.load_iris()

        keys = iris.keys()

        print (keys, br)

        X = iris.data

        y = iris.target

        print ('features shape:', X.shape)

        print ('target shape:', y.shape, br)

        features = iris.feature_names

        targets = iris.target_names

        print ('feature set:')

        print (features, br)

        print ('targets:')

        print (targets, br)

        print (iris.DESCR[525:900], br)

        rnd_clf = RandomForestClassifier(random_state=0,

                                         n_estimators=100)

        rnd_clf.fit(X, y)

        rnd_name = rnd_clf.__class__.__name__

        feature_importances = rnd_clf.feature_importances_

        importance = sorted(zip(feature_importances, features),

                            reverse=True)

        print ('most important features' + ' (' + rnd_name + '):')

        [print (row) for i, row in enumerate(importance)]

    Listing 1-1

    Characterize the Iris data set

    Go ahead and execute the code from Listing 1-1. Remember that you can find the example from the book’s example download. You don’t need to type the example by hand. It’s easier to access the example download and copy/paste.

    Your output from executing Listing 1-1 should resemble the following:

    dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

    features shape: (150, 4)

    target shape: (150,)

    feature set:

    ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

    targets:

    ['setosa' 'versicolor' 'virginica']

        ============== ==== ==== ======= ===== ====================

                        Min  Max   Mean    SD   Class Correlation

        ============== ==== ==== ======= ===== ====================

        sepal length:   4.3  7.9   5.84   0.83    0.7826

        sepal width:    2.0  4.4   3.05   0.43   -0.4194

        petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)

        petal width:

    most important features (RandomForestClassifier):

    (0.4604447396171521, 'petal length (cm)')

    (0.4241162651271012, 'petal width (cm)')

    (0.09090795402103086, 'sepal length (cm)')

    (0.024531041234715754, 'sepal width (cm)')

    The code begins by importing datasets and RandomForestClassifier packages. RandomForestClassifier is an ensemble learning method that constructs a multitude of decision trees at training time and outputs the class that is the mode of the classes.

    In this example, we are only using it to return feature importance. The main block begins by loading data and displaying its characteristics. Loading feature set data into variable X and target data into variable y is convention in the machine learning community.

    The code concludes by training RandomForestClassifier on the pandas data, so it can return feature importance. When actually modeling data, we convert pandas data to NumPy for optimum performance. Keep in mind that the keys are available because the data set is embedded in Scikit-Learn.

    Notice that we only took a small slice from DESCR, which holds a lot of information about the data set. I always recommend displaying at least the shape of the original data set before embarking on any machine learning experiment.

    Tip

    RandomForestClassifier is a powerful machine learning algorithm that not only models training data, but returns feature importance.

    Wine Data

    The next data set we characterize is load_wine. The load_wine data set consists of 178 data elements. Each element has thirteen features that describe three target classes. It is considered a classic in the machine learning community and offers an easy multi-classification data set.

    The next code example shown in Listing 1-2 loads wine data and displays its keys, shape of the feature set and target, feature and target names, a slice from the DESCR key, and feature importance (from most to least).

    from sklearn.datasets import load_wine

    from sklearn.ensemble import RandomForestClassifier

    if __name__ == __main__:

        br = '\n'

        data = load_wine()

        keys = data.keys()

        print (keys, br)

        X, y = data.data, data.target

        print ('features:', X.shape)

        print ('targets', y.shape, br)

        print (X[0], br)

        features = data.feature_names

        targets = data.target_names

        print ('feature set:')

        print (features, br)

        print ('targets:')

        print (targets, br)

        rnd_clf = RandomForestClassifier(random_state=0,

                                         n_estimators=100)

        rnd_clf.fit(X, y)

        rnd_name = rnd_clf.__class__.__name__

        feature_importances = rnd_clf.feature_importances_

        importance = sorted(zip(feature_importances, features),

                            reverse=True)

        n = 6

        print (n, 'most important features' + ' (' + rnd_name + '):')

        [print (row) for i, row in enumerate(importance) if i < n]

    Listing 1-2

    Characterize load_wine

    After executing code from Listing 1-2, your output should resemble the following:

    dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])

    features: (178, 13)

    targets (178,)

    [1.423e+01 1.710e+00 2.430e+00 1.560e+01 1.270e+02 2.800e+00 3.060e+00

     2.800e-01 2.290e+00 5.640e+00 1.040e+00 3.920e+00 1.065e+03]

    feature set:

    ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']

    targets:

    ['class_0' 'class_1' 'class_2']

    6 most important features (RandomForestClassifier):

    (0.19399882779940295, 'proline')

    (0.16095401215681593, 'flavanoids')

    (0.1452667364559143, 'color_intensity')

    (0.11070045042456281, 'alcohol')

    (0.1097465262717493, 'od280/od315_of_diluted_wines')

    (0.08968972021098301, 'hue')

    Tip

    To create (instantiate) a machine learning algorithm (model), just assign it to a variable (e.g., model = algorithm()). To train based on the model, just fit it to the data (e.g., model.fit(X, y)).

    The code begins by importing load_wine and RandomForestClassifier. The main block displays keys, loads data into X and y, displays the first vector from feature set X, displays shapes, and displays feature set and target information. The code concludes by training X with RandomForestClassifier, so we can display the six most important features. Notice that we display the first vector from feature set X to verify that all features are numeric.

    Bank Data

    The next code example shown in Listing 1-3 works with bank data. The bank.csv data set is composed of direct marketing campaigns from a Portuguese banking institution. The target is described by whether a client will subscribe (yes/no) to a term deposit (target label y). It consists of 41188 data elements with 20 features for each element. A 10% random sample of 4119 data elements is also available from this site for more computationally expensive algorithms such as svm and KNeighborsClassifier.

    import pandas as pd

    if __name__ == __main__:

        br = '\n'

        f = 'data/bank.csv'

        bank = pd.read_csv(f)

        features = list(bank)

        print (features, br)

        X = bank.drop(['y'], axis=1).values

        y = bank['y'].values

        print (X.shape, y.shape, br)

        print (bank[['job', 'education', 'age', 'housing',

                     'marital', 'duration']].head())

    Listing 1-3

    Characterize bank data

    After executing code from Listing 1-3, your output should resemble the following:

    ['age', 'job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed', 'y']

    (41188, 20) (41188,)

             job    education  age housing  marital  duration

    0  housemaid     basic.4y   56      no  married       261

    1   services  high.school   57      no  married       149

    2   services  high.school   37     yes  married       226

    3     admin.     basic.6y   40      no  married       151

    4   services  high.school   56      no  married       307

    The code example begins by importing the pandas package. The main block loads bank data from a CSV file into a Pandas DataFrame and displays the column names (or features). To retrieve column names from pandas, all we need to do is make the DataFrame a list and assign the result to a variable. Next, feature set X and target y are created. Finally, X and y shapes are displayed as well as a few choice features.

    Digits Data

    The final code example in this

    Enjoying the preview?
    Page 1 of 1