Python For Data Analysis - The Ultimate Beginner's Guide To Learn Programming in Python For Data Science With Pandas and NumPy, Master Statistical Analysis, and Visualization (2020)

PYTHON FOR DATA ANALYSIS:
THE ULTIMATE BEGINNER'S GUIDE TO LEARN

PROGRAMMING IN PYTHON FOR DATA SCIENCE WITH
PANDAS AND NUMPY, MASTER STATISTICAL ANALYSIS,
AND VISUALIZATION
Matt Foster
© Copyright 2019 - All rights reserved.
The content contained within this book may not be reproduced, duplicated, or transmitted without
direct written permission from the author or the publisher.
Under no circumstances will any blame or legal responsibility be held against the publisher, or
author, for any damages, reparation, or monetary loss due to the information contained within this
book. Either directly or indirectly.
Legal Notice:
This book is copyright protected. This book is only for personal use. You cannot amend, distribute,
sell, use, quote or paraphrase any part, or the content within this book, without the consent of the
author or publisher.
Disclaimer Notice:
Please note the information contained within this document is for educational and entertainment
purposes only. All effort has been executed to present accurate, up to date, and reliable, complete
information. No warranties of any kind are declared or implied. Readers acknowledge that the author
is not engaging in the rendering of legal, financial, medical, or professional advice. The content
within this book has been derived from various sources. Please consult a licensed professional before
attempting any techniques outlined in this book.
By reading this document, the reader agrees that under no circumstances is the author responsible for
any losses, direct or indirect, which are incurred as a result of the use of information contained within
this document, including, but not limited to, — errors, omissions, or inaccuracies.
Table of Contents
Introduction
Chapter 1 - What is Data Analysis
Chapter 2 - Python Crash Course
Chapter 3 - Data Munging
Chapter 4 - Why Data Preprocessing Is Important
Chapter 5 - What is Data Wrangling?
Chapter 6 - Inheritances to Clean Up the Code

Chapter 7 - Reading and writing data
Chapter 8 - The Different Types of Data We Can Work With
Chapter 9 - The Importance of Data Visualization
Chapter 10 - Indexing and selecting arrays
Chapter 11 - Common Debugging Commands
Chapter 12 - Neural Network and What to Use for?
Conclusion
Introduction
It is part of the obligations of the banks to analyze, store, or collect vast
numbers of data. With these data, data science applications are transforming
them into a possibility for banks to learn more about their customers. Doing
this will drive new revenue opportunities instead of seeing those data as a
mere compliance exercise. People widely use digital banking, and it is more
popular these days. The result of this influx produces terabytes of data by
customers; therefore, isolating genuinely relevant data is the first line of
action for data scientists. With the customers’ preferences, interactions, and
behaviors, then, data science applications will isolate the information of the
most relevant clients and process them to enhance the decision-making of
the business.
Investment banks risk modeling

While it serves the most critical purposes during the pricing of financial
investments, investment banks have a high priority for risk modeling since
it helps regulate commercial activities. For investment goals and to conduct
corporate reorganizations or restructuring, investment banking evaluates
values of businesses to facilitate acquisitions and mergers as well as create
capital in corporate financing. For banks, as a result, risk modeling seems
exceedingly substantial, and with more data science tools in reserve and
information at hand, they can assess it to their benefit. Now, for efficient
risk modeling and better data-driven decisions, with data science
applications, innovators in the industry are leveraging these new
technologies.
Personalized marketing

Providing a customized offer that fits the preferences and needs of
particular customers is crucial to success in marketing. Now it is possible to
make the right offer on the correct device to the right customer at the right
time. For a new product, people target selection to identify the potential
customers with the use of data science application. With the aid of apps,
scientists create a model that predicts the probability of a customer’s
response to an offer or promotion through their demographics, historical
purchase, and behavioral data. Thus, banks have improved their customer
relations, personalize outreach, and efficient marketing through data science
applications.
Health and Medicine

An innovative potential industry to implement the solutions of data science
in health and medicine. From the exploration of genetic disease to the
discovery of drug and computerizing medical records, data analytics is
taking medical science to an entirely new level. It is perhaps astonishing
that this dynamic is just the beginning. Through finances, data science and
healthcare are most times connected as the industry makes efforts to cut
down on its expenses with the help of a large amount of data. There is quite
a significant development between medicine and data science, and their
advancement is crucial. Here are some of the impacts data science
applications have on medicine and health.
Analysis of medical image

Medical imaging is one of the most significant benefits the healthcare
sectors get from data science application. As significant research, Big Data
Analytics in healthcare indicates that some of the imaging techniques in
medicine and health are X-ray, magnetic resonance imaging (MRI),
mammography, computed tomography, and so many others. More
applications in development will effectively extract data from images,
present an accurate interpretation, and enhance the quality of the image. As
these data science applications suggest better treatment solutions, they also
boost the accuracy of diagnoses.
Genomics and genetics

Sophisticated therapy individualization is made possible through studies in
genomics and genetics. Finding the individual biological correlation
between disease, genetics, and drug response and also understand the effect
of the DNA on our health is the primary purpose of this study. In the
research of the disease, with an in-depth understanding of genetic issues in
reaction to specific conditions and drugs, the integration of various kinds of
data with genomic data comes through data science techniques. It may be
useful to look into some of these frameworks and technologies. For a short
time of processing efficient data, MapReduce allows reading genetic
sequences mapping, retrieving genomic data is accessible through SQL,
BAM file computation, and manipulation. Also, principally to DNA
interpretation to predict the molecular effects of genetic variation, The Deep
Genomics makes a substantial impact. Scientists have the ability to
understand the manner at which genetic variations impact a genetic code
with their database.
Drugs creation

Involving various disciplines, the process of drug discovery is highly
complicated. Most times, the most excellent ideas pass through billions of
enormous time and financial expenditure and testing. Typically, getting a
drug submitted officially can take up to twelve years. With an addition of a
perspective to the individual stage of drug compound screening to the
prediction of success rate derived from the biological factors, the process is
now shortened and simplified with the aid of data science applications.
Using simulations rather than the “lab experiments,” and advanced
mathematical modeling, these applications can forecast how the compound
will act in the body. With computational drug discovery, it produces
simulations of computer model as a biologically relevant network
simplifying the prediction of future results with high accuracy.
Virtual assistance for customer and patients

support

The idea that some patients don’t necessarily have to visit doctors in person
is the concept behind the clinical process optimization. Also, doctors don’t
necessarily have to visit too when the patients can get more effective
solutions with the use of a mobile application. Commonly as chatbots, the
AI-powered mobile apps can provide vital healthcare support. Derived from
a massive network connecting symptoms to causes, it is as simple as
receiving vital information about your medical condition after you describe
your symptoms. When necessary, applications can assign an appointment
with a doctor and also remind you to take your medicine on time. Allowing
doctors to have their focus on more critical cases, these applications save
patients’ time on waiting in line for an appointment as well as promote a
healthy lifestyle.
Industry knowledge

To offer the best possible treatment and improve the services, knowledge
management in healthcare is vital. It brings together externally generated
information and internal expertise. With the creation of new technologies
and the rapid changes in the industry every day, effective distribution,
storing, and gathering of different facts is essential. For healthcare
organizations to achieve progressive results, the integration of various
sources of knowledge and their combined use in the treatment process is
secure through data science applications.
Oil and Gas

The primary force behind various trends in industries like marketing,
finance, internet, among others, is machine learning and data science. And
there appears to be no exception for oil and gas industry through the
extracting of important observations with some applications in the sectors
in upstream, midstream, and downstream. As a result, within the industry, a
valuable asset to companies is refined data. Data science applications are
quite useful in some of these sectors of oil and gas.
Immediate drag calculation and torque using

neural networks
There is a need to analyze, in drilling, the structured visual data, which

operators get through logging. Also, they can capture the electronic drilling
recorder and contextual data, which takes the pattern of daily report of
drilling log. It is essential to make an instant decision because of the time-
bound disposition of drilling operations. As a result, companies predict
drilling key performance indicators; analyze rig states for real-time data
visualization with the use of neural networks. Using the AI, they can
estimate the coefficient of regular and friction contact forces between the
wellbore and the string. Also, in any given well, they can calculate on the
drill strings real-time the drag and torque. Historical data of pump washouts
is what operators can utilize, and through the alerts on their phone, they will
be able to know when and if there will be a washout.
Predicting well production profile through feature

extraction models

The recurring neural networks and time series forecasting is part of the
optimization of oil and gas production. Rates of gas-to-oil ratios and oil
rates prediction is a significant KPIs. Operators can calculate bottom-hole
pressure, choke, wellhead temperature, and daily oil rate prediction of data
of nearby well with the use of feature extraction models. In the event of
predicting production decline, they make use of fractured parameters. Also,
for pattern recognition on sucker rod dynamometer cards, they utilize neural
networks and deep learning.
Downstream optimization

To process gas and crude oil, oil refineries use a massive volume of water.
Now, there is a system that tackles water solution management in the oil
and gas industry. Also, with the aid of distribution by analyzing data
effectively, there is an increase in modeling speed for forecasting revenues
through cloud-based services.
The Internet
Anytime anyone thinks about data science, the first idea that comes to mind
is the internet. It is typical of thinking of Google when we talk about
searching for something on the internet. However, Bing, Yahoo, AOL, Ask,
and some others also search engines. For these search engines to give back
to you in a fraction of second when you put a search on them, data science
algorithms are all that they all have in common. Every day, Google process
more than 20 petabytes, and these search engines are known today with the
help of data science.
Targeted advertising

Of all the data science applications, the whole digital marketing spectrum is
a significant challenge against the search engines. The data science
algorithms decide the distribution of digital billboards and banner displays
on different websites. And against the traditional advertisements, data
science algorithms have helped marketers get a higher click-through-rates.
Using the behavior of a user, they can target them with specific adverts.
And at the same time and in the same place online, one user might see ads
on anger management while another user sees another ad on a keto diet.
Website recommendations

This case is something familiar to everyone as you see suggestions of the
same products even on eBay and Amazon. Doing this add so much to the
user experience while it helps to discover appropriate products from several
products available with them. Leaning on the relevant information and
interest of the users, so many businesses have promoted their products and
services with this engine. To improve user experience, some giants on the
internet, including Google Play, Amazon, Netflix, and others have used this
system. They derived these recommendations on the results of a user’s
previous search.
Advanced image recognition

The face recognition algorithm makes use of automatic tag suggestion
feature when a user uploads their picture on social media like Facebook and
start getting tag suggestions. For some time now, Facebook has made
significant capacity and accuracy with its image recognition. Also, by
uploading an image to the internet, you have the option of searching for
them on Google, providing the results of related search with the use of
image recognition.
Speech recognition

Siri, Google Voice, Cortana, and so many others are some of the best speech
recognition products. It makes it easy for those who are not in the position
of typing a message to use speech recognition tools. Their speech will be
converted to text when they speak out their words. Though the accuracy of
speech recognition is not certain.
Travel and Tourism

There are several constant challenges and changes, even with the
exceptional opportunities data science has brought to many industries. And
there is no exception when it comes to travel and tourism. Today, there is a
rise in travel culture since a broader audience has been able to afford it.
Therefore, by getting more extensive than ever before, there is a dramatic
change in the target market. As a worldwide trend, travel, and tourism is no
more a privilege of the noble and the rich.
The data science algorithms have become essential in this industry to
process massive data and also delight the requirements of the rising
numbers of consumers. To enhance their services every day, the hotels,
airlines, booking and reservation websites, and several others now see big
data are a vital tool. The travel industry uses some of these tools to make it
more efficient;
Customer segmentation and personalized

marketing

To appreciate travel experience, personalization has become a preferred
trend for some people. The customer segmentation is the general stack of
services to please the needs of every group through the adaptation and
segmenting of the customers according to their preferences. Hence, finding
a solution that will align with all situations is crucial. Collecting users’
social media data to unify, behavioral, and metadata, geolocation is what
customer segmentation and personalized marketing all about. For the future,
it assumes and processes the preferences of the user.
Analysis of customer sentiment

Recognizing emotional elements in the text and analyzing textual data is
what sentiment analysis does. The service provider, as well as the owner of
a business, can learn about the customers’ real attitude towards their brands
through sentiment analysis. The reviews of customers have a huge role
when it comes to the travel industry. This analysis is because to make
decisions, travelers read reviews customers posted on various websites and
platforms and then act upon these recommendations. As a result, providing
sentiment analysis is one of the service packages of some modern booking
websites for those travel hotels and agencies that are willing to cooperate
with them.
Recommendation engine

This concept is one of the most promising and efficient, according to some
experts. In their everyday work, some central booking and travel web
platforms use recommendation engines. Mainly, through the available
offers, they match the needs and wishes of customers with these
recommendations. Based on preferences and previous search, the travel and
tourism companies have the ability to provide alternative travel dates, rental
deals, new routes, attractions, and destination when they apply the data-
powered recommendation engine solutions. Offering suitable provisions to
all these customers, booking service providers, and travel agencies achieve
this with the use of recommendation engines.
Travel support bots

With the provisions of exceptional assistance in travel arrangements and
support for the customers, travel bots are indeed changing the travel
industry nowadays. Saving user’s money and time, answering questions,
suggesting new places to visit and organizing the trips have the influence of
an AI-powered travel bot. It is the best possible solution for customers
support due to its support of multiple languages and 24/7 accessibility
mode. It is significant to add that these bots are always learning and as
such, are becoming more helpful and smarter every day. Therefore, solving
the major tasks of travel and tourism is what chatbot can do. Both
customers and business owners benefit from these chatbots.
Route optimization

In the travel and tourism industry, route optimization plays a significant
role. It can be quite challenging to account for several destinations, plan
trips, schedules, and working distances and hours. With route optimization,
it becomes easy to do some of the following:
Time management
Minimization of the travel costs
Minimization of distance
For sure, data science improves lives and also continues to change the faces
of several industries, giving them the opportunity of providing unique
experiences for their customers with high satisfaction rates. Apart from
shifting our attitudes, data science has become one of the promising
technologies that bring changes to different businesses. With several
solutions the data science applications provide, it is no doubt that its
benefits cannot be over-emphasized.
Chapter 1 - What is Data Analysis
Overview of Decision Tree

In this chapter, we will understand the concepts of Decision Trees and their
importance in data science. When any analysis is related to multiple
variables, the concept of decision trees comes into the picture.
Then you might think, "How are these decision trees generated?" They are
generated by specific algorithms that use different forms to split the data
into segments. Now, these segments form a group after combination and
become an up-side-down decision tree that has a root node originating at
the top of the tree. The main information lies in the root node. This is
usually a 1-Dimensional simple display in the decision tree interface.
A decision tree will have a root node splitting into two or more decision
nodes that are categorized by decision rule. Further, the decision nodes are
categorized as a terminal node or leaf node.
The leaf node has the response or dependent variable as the value. Once the
relationship between leaf nodes and decision nodes is established, it
becomes easy to define the relationship between the inputs and the targets
while building the decision tree. You can select and apply rules to the
decision tree. It has the ability to search hidden values or predict new ones
for specific inputs. This rule allocates observations from a dataset to a
segment that depends on the value of the column in the data. These columns
are referred to as inputs.
Splitting rules are responsible for generating the skeleton of the decision
tree. The decision tree appears like a hierarchy. There is a root node at the
top, followed by the decision nodes for the root nodes, and leaf nodes are
the part of decision nodes. For a leaf node, there is a special path defined
for the data to identify in which leaf it should go. Once you have defined
the decision tree, it will become easier for you to generate other node values
depending upon the unseen data.
History of Decision Tree

The decision Tree concept was practiced more than five decades ago; the
first-ever decision tree concept was used in the invention of television
broadcasting back in 1956 by Belson. From that period, the decision tree
concept was widely undertaken, and various forms of Decision Trees were
developed that had new and different capabilities. It was used in the field of
Data Mining, Machine Learning, etc. The Decision Tree concept was
refurbished with new techniques and was implemented at a larger scale.
Modeling Techniques in Decision Trees

Decision Tree concept works best with regression. These techniques are
vital at selecting the inputs or generating dummy variables that represent
the effects in the equations that deal with regression.
Decision Trees are used to collapse a group of categorized values into
specific ranges aligned with the target variable values. This is referred to as
optimal value collapsing. In this, a combination of categories with the same
values of certain target values, there are minimal chances of information
loss while collapsing categories together. Finally, the result will be a perfect
prediction with the best classification outputs.
Why Are Decision Trees Important?

Decision trees concept are used for multiple variable analysis. Multiple
variable analysis helps us to explain, identify, describe, and classify any
target. For explaining multiple variable analyses, take an example of sales,
the probability of sale, or the time required for a marketing campaign to
respond due to the effects of the multiple input variables, dimensions, and
factors. The multiple variable analysis opens doors to discovering some
other relationships and explain them in multiple fashions. This analysis is
crucial in terms of problem-solving as the success of any crucial input
depends upon multiple factors. There are many multiple variable techniques
discovered as of date, which is an attractive part of data science and
Decision Trees and depends on factors like easiness, robustness, and
relative power of different data and their measurement levels.
Decision Trees are always represented in incremental format. Therefore, it
can be said that any set of multiple influences is a group of one-cause, one-
effect relationships depicted in the recursive format of the decision tree.
This implies that a decision tree is able to handle issues of human short
memory in a more controlled way. It is done in the simplest manner that is
easy to understand compared to complex, multiple variable techniques.
A decision tree is important because it helps in the transformation of any
raw data into a highly knowledgeable version and special awareness about
specific issues like business, scientific, social, and engineering. This helps
you to deploy knowledge in front of a decision tree in a simple way, but in a
very powerful human-understandable format, as decision trees help in
discovering and maintaining a stronghold relation between input values of
the data and target values in any set of observations that are used to build a
dataset. If the set of input values form an association with the target value
while the selection process, then all the target values are categorized
separately and combined to form a bin that eventually forms the decision
branch of the decision tree. This is a special case observed in this kind of
grouping; the bin value and the target value. Consider an example of
binning as; suppose the average of target values are stored in three bins that
are created by input values, then binning will try to select every input value
and establish the relationship between the input value and target value.
Finally, it is determined how the input value is linked to the target value.
You will need a strong interpretation skill to know the relationship between
the input value and target value. This relationship is developed when you
are able to predict the value of the target in an effective way. Not only
understanding the relation between input-target values but also you will
understand the nature of the target. Lastly, you can predict the values
depending upon such relationships.
Chapter 2 - Python Crash Course
A lot of tools for processing data are available. Simply put, data analysis is
a methodology requiring inspection, cleansing, transformation, and
modeling of data. Its purpose is to discover vital information, the end result,
a proper interpretation, and decipher a proper mode of action. This chapter
gives you an idea about the best tools and techniques that data scientists use
for data analysis.
Open Source Data Tools

Openrefine
Openrefine was earlier known as Google Refine. This tool is found to be
most efficient when working with disorganized datasets. This enables data
scientists to clean and put the data into a different format. It also allows the
data scientist to integrate different datasets (external and internal). Google
refine is a great tool for large-scale data exploration, enabling the user to
discover the data patterns easily.
Orange
It is an open-source data visualization and analysis tool designed and meant
for those people who do not have expertise in data science. It helps the user
to build an interactive workflow that can be used for analysis and
visualization of data, using a simple interactive workflow and an advanced
toolbox. The output of this tool differs from the mainstream scatter plots,
bar charts, and dendrograms.
Knime
Knime is another open-source solution tool that enables the user to explore
data and interpret the hidden insights effectively. One of its good attributes
is that it contains more than 1000 modules along with numerous examples
to help the user to understand the applications and effective use of the tool.
It is equipped with the most advanced integrated tools with some complex
algorithms.
R-programming
R-programing is the most common and widely used tool. It has become a
standard tool for programming. R is a free open source software that any
user can install, use, upgrade, modify, clone, and even resell. It can easily
and effectively be used in statistical computing and graphics. It is made in a
way that is compatible with any type of operating system like Windows,
macOS platforms, and UNIX. It is a high-performance language that lets
the user manage big data. Since it is free and is regularly updated, it makes
technological projects cost-effective. Along with data Mining, it lets the
user apply their statistical and graphical knowledge, including common
tests like a statistical test, clustering, and linear, non-linear modeling.
Rapidminer
Rapidminer is similar to KNIME with respect to dealing with visual
programming for data modeling, analysis, and manipulation. It helps to
improve the overall yield of data science project teams. It offers an open-
source platform that permits Machine Learning, model deployment, and
data preparation. It is responsible for speeding up the development of an
entire analytical workflow, right from the steps of model validation to
deployment.
Pentaho
Pentaho tackles issues faced by the organization concerning its ability to
accept values from another data source. It is responsible for simplifying
data preparation and data blending. It also provides tools used for analysis,
visualization, reporting, exploration, and prediction of data. It lets each
member of a team assign the data meaning.
Weka
Weka is another open-source software that is designed with a view of
handling machine-learning algorithms to simplify data Mining tasks. The
user can use these algorithms directly in order to process a data set. Since it
is implemented in JAVA programming, it can be used for developing a new
Machine Learning scheme. It lets easy transition into the field of data
science owing to its simple Graphical User Interface. Any user acquainted
with JAVA can invoke the library into their code.
The nodexl
The nodexl is open-source software, data visualization, and analysis tool
that is capable of displaying relationships in datasets. It has numerous
modules, like social network data importers and automation.
Gelphi
Gelphi is an open-source visualization and network analysis tool written in
Java language.
Talend
Talend is one of the leading open-source software providers that most data-
driven companies go for. It enables the customers to connect easily
irrespective of the places they're at.
Data Visualization
Data Wrapper
It is an online data-visualization software that can be used to build
interactive charts. Data in the form of CSV, Excel, or PDF can be uploaded.
This tool can be used to generate a map, bar, and line. The graphs created
using this tool have ready to use embed codes and can be uploaded on any
website.
Tableau Public
Tableau Public is a powerful tool that can create stunning visualizations that
can be used in any type of business. Data insights can be identified with the
help of this tool. Using visualization tools in Tableau Public, a data scientist
can explore data prior to processing any complex statistical process.
Infogram
Infogram contains more than 35 interactive charts and 500 maps that allow
the user to visualize data. It can make various charts like a word cloud, pie,
and bar.
Google Fusion Tables
Google Fusion Tables is one of the most powerful data analysis tools. It is
widely used when an individual has to deal with massive datasets.
Solver
The solver can support effective financial reporting, budgeting, and
analysis. You can see a button that will allow you to interact with the profit-
making data in a company.
Sentiment Tools
Opentext
Identification and evaluation of expressions and patterns are possible in this
specialized classification engine. It carries out analysis at various levels:
document, sentence, and topic level.
Trackur
Trackur is an automated sentiment analysis software emphasizing a specific
keyword that is tracked by an individual. It can draw vital insights by
monitoring social media and mainstream news. In short, it identifies and
discovers different trends.
Opinion Crawl
Opinion Crawl is also an online sentiment analysis software that analyses
the latest news, products, and companies. Every visitor is given the freedom
to access Web sentiment in a specific topic. Anyone can participate in a
topic and receive an assessment. A pie chart reflecting the latest real-time
sentiment is displayed for every topic. Different concepts that people relate
to are represented by various thumbnails and cloud tags. The positive and
negative effect of the sentiments is also displayed. Web crawlers search the
up-to-date content published on recent subjects and issues to create a
comprehensive analysis.
Data Extraction Tools

Content Grabber
Content Grabber is a tool designed for organizations to enable data mining
and save the data in a specific format like CSV, XML, and Excel reports. It
also has a scripting and editing module, making it a better option for
programming experts. Individuals can also utilize C#, VB.NET to debug,
and write script information.
IBM Cognos Analytics

IBM Cognos Analytics was developed after Cognos Business Intelligence.
It is used for data visualization in the BI product. It is developed with a
Web-based interface. It covers a variety of modules, such as data
governance, strong analytics, and management. The integration of data
from different sources to make reports and visualizations is possible using
this tool.
Sage Live
Sage Live is a cloud-based accounting platform that can be used in small
and mid-sized types of businesses. It enables the user to create invoices, bill
payments using smartphones. This is a perfect tool if you wish to have a
data visualization tool supporting different companies, currencies, and
banks.
Gawk GNU
Gawk GNU allows the user to utilize a computer without software. It
interprets unique programming language enabling the users to handle
simple-data reformatting Jobs. Following are its main attributes:
➢
It is not procedural. It is data-driven.
➢
Writing programs is easy.
➢
Searching for a variety of patterns from the text units.
Graphlab create
s
Graphlab can be used by data scientists as well as developers. It enables the
user to build state-of-the-art data products using Machine Learning to create
smart applications.
The attributes of this tool are the Integration of automatic feature
engineering, Machine Learning visualizations, and model selection to the
application. It can identify and link records within and across data sources.
It can simplify the development of Machine Learning models.
Netlink Business Analytics

Netlink Business Analytics is a comprehensive on-demand solution
providing the tool. You can apply it through any simple browser or
company-related software. Collaboration features also allow the user to
share the dashboards among teams. Features can be customized as per sales
and complicated analytic capability, which is based on inventory
forecasting, fraud detection, sentiment, and customer churn analysis.
Apache Spark
Apache Spark is designed to run-in memory and real-time.
The top 5 data analytics tools and techniques
Visual analytics
Different methods that can be used for data analysis are available. These
methods are possible through integrated efforts involving human
interaction, data analysis, and visualization.
Business Experiments
All the techniques that are used in testing the validity of certain processes
are included in Business Experiments AB testing, business experiments,
and the experimental design.
Regression Analysis
Regression Analysis allows the identification of factors that make two
different variables related to each other.
Correlation Analysis
Correlation Analysis is a statistical technique that detects whether a
relationship exists between two different variables.
Time Series Analysis

Time Series analysis gathers data at specific time intervals. Identifying
changes and predicting future events in a retrospective manner is possible
using this.
Chapter 3 - Data Munging
Now that you’ve gone through a Python programming crash course and you
have some idea of the basic concepts behind programming, we can start
discussing the data science process.
So, what does “data munging” even mean? A few decades ago, a group of
MIT students came up with this term. Data munging is about changing
some original data to more useful data by taking very specific steps. This is
basically the data science pipeline. You might sometimes hear about this
term being referred to as data preparation, or sometimes even data
wrangling. Know that they are all synonyms.
In this chapter we’re going to discuss the data science process and learn
how to upload data from files, deal with missing data, as well as manipulate
it.
The Process
All data science projects are different one way or another, however they can
all be broken down into typical stages. The very first step in this process is
acquiring data. This can be done in many ways. Your data can come from
databases, HTML, images, Excel files, and many other sources, and
uploading data is an important step every data scientist needs to go through.
Data munging comes after uploading the data, however at the moment that
raw data cannot be used for any kind of analysis. Data can be chaotic, and
filled with senseless information or gaps. This is why, as an aspiring data
scientist, you solve this problem with the use of Python data structures that
will turn this data into a data set that contains variables. You will need these
data sets when working with any kind of statistical or machine learning
analysis. Data munging might not be the most exciting phase in data
science, but it is the foundation for your project and much needed to extract
the valuable data you seek to obtain.
In the next phase, once you observe the data you obtain, you will begin to
create a hypothesis that will require testing. You will examine variables
graphically, and come up with new variables. You will use various data
science methodologies such as machine learning or graph analysis in order
to establish the most effective variables and their parameters. In other
words, in this phase you process all the data you obtain from the previous
phase and you create a model from it. You will undoubtedly realize in your
testing that corrections are needed and you will return to the data munging
phase to try something else. It’s important to keep in mind that most of the
time, the solution for your hypothesis will be nothing like the actual
solution you will have at the end of a
successful project. This is why you
cannot work purely theoretically. A good data scientist is required to
prototype a large variety of potential solutions and put them all to the test
until the best course of action is revealed.
One of the most essential parts of the data science process is visualizing the
results through tables, charts, and plots. In data science, this is referred to as
“OSEMN”, which stands for “Obtain, Scrub, Explore, Model, Interpret”.
While this abbreviation doesn’t entirely illustrate the process behind data
science, it captures the most important stages you should be aware of as an
aspiring data scientist. Just keep in mind that data munging will often take
the majority of your efforts when working on a project.
Importing Datasets with pandas

Now is the time to open the toolset we discussed earlier and take out
pandas. We need pandas to first start by loading the tabular data, such as
spreadsheets and databases, from any files. This tool is great because it will
create a data structure where every row will be indexed, variables kept
separate by delimiters, data can be converted, and more. Now start running
Jupyter, and we’ll discuss more about pandas and CSV files. Type:
In: import pandas as pd
iris_filename = ‘datasets-ucl-iris.csv’
iris = pd.read_csv(iris_filename, sep=',', decimal='.', header=None,
names= ['sepal_length', 'sepal_width', 'petal_length',
'petal_width', 'target'])
We start by important pandas and naming our file. In the third line we can
define which character should be used a separator with the “sep” keyword,
as well as the decimal character with the “decimal” keyword. We can also
specify whether there’s a header with the “header” keyword, which in our
case is set to none. The result of what we have so far is an object that we
named “iris” and we refer to it as a pandas DataFrame. In some ways it’s
similar to the lists and dictionaries we talked about in Python, however
there are many more features. You can explore the object’s content just to
see how it looks for now by typing the following line:
In: iris.head()
As you can see, we aren’t using any parameters with these commands, so
what you should get is a table with only the first 5 rows, because that’s the
default if there are no arguments. However, if you want a certain number of
rows to be displayed, simply type the instruction like this:
iris.head(3)
Now you should see the first three rows instead. Next, let’s access the
column names by typing:
In: iris.columns
Out: Index(['sepal_length', 'sepal_width', 'petal_length',
'petal_width', 'target'], dtype='object')
The result of this will be a pandas index of the column names that looks like
a list. Let’s extract the target column. You can do it like this:
In: Y = iris['target']
Y
Out:
0Iris-setosa
1Iris -setosa
2Iris -setosa
3Iris -setosa
...
149Iris-virginica
Name: target, dtype: object
For now it’s important only to understand that Y is a pandas series. That
means it is similar to an array, but in this case it’s one directional. Another
thing that we notice in this example is that the pandas Index class is just like
a dictionary index. Now let’s type the following:
In: X = iris[['sepal_length', 'sepal_width']]
All we did now was asking for a list of columns by index. By doing so, we
received a pandas dataframe as the result. In the first example, we received
a one dimensional pandas series. Now we have a matrix instead, because
we requested multiple columns. What’s a matrix? If your basic math is a bit
rusty, you should know that it is an array of numbers that are arranged in
rows and columns.
Next, we want to have the dimensions of the dataset:
In: print (X.shape)
Out: (150, 2)
In: print (Y.shape)
Out: (150,)
What we have now is a tuple. We can now see the size of the array in both
dimensions. Now that you know the basics of this process, let’s move on to
basic preprocessing.
Preprocessing Data with pandas

The next step after learning how to load datasets is to get accustomed to the
data preprocessing routines. Let’s say we want to apply a function to a
certain section of rows. To achieve this, we need a mask. What’s a mask?
It’s a series of true or false values (Boolean) that we need to tell when a
certain line is selected. As always, let’s examine an example because
reading theory can be dry and confusing.
In: mask_feature = iris['sepal_length'] > 6.0
In: mask_feature
0False
1False
...
146True
147True
148True
149False
In this example we’re trying to select all the lines of our “iris” dataset that
have the value of “sepal length” larger than 6. You can clearly see the
observations that are either true or false, and therefore know the ones that
fit our query. Now let’s use a mask in order to change our “iris-virginica”
target with a new label. Type:
In: mask_target = iris['target'] == 'Iris-virginica'
In: iris.loc[mask_target, 'target'] = 'New label'
All “Iris-virginica” labels will now be shown as “New label” instead. We
are using the “loc()” method to access this data with row and column
indexes. Next, let’s take a look at the new label list in the “target” column.
Type:
In: iris['target'].unique()
Out: array(['Iris-setosa', 'Iris-versicolor', 'New label'], dtype=object)
In this example we are using the “unique” method to examine the new list.
Next we can check the statistics by grouping every column. Let’s see this in
action first, and then discuss how it works.
Type:
In: grouped_targets_mean = iris.groupby(['target']).mean()
grouped_targets_mean
Out:
In: grouped_targets_var = iris.groupby(['target']).var()
grouped_targets_var
Out:
We start by grouping each column with the “groupby” method. If you are a
bit familiar with SQL, it’s worth noting that this works similarly to the
“GROUP BY” instruction. Next, we use the “mean” method, which
computes the average of the values. This is an aggregate method that can be
applied to one or several columns. Then we can have several other pandas
methods such as “var” which stands for the variance, “sum” for the
summation, “count” for the number of rows, and more. Keep in mind that
the result you are looking at is still a data frame. That means that you can
link as many operations as you want. In our example we are using the
“groupby” method to group the observations by label and then check what
the difference is between the values and variances for each of our groups.
Now let’s assume the dataset contains a time series. What’s a time series,
you ask? In data science, sometimes we have to analyze a series of data
points that are graphed in a certain chronological order. In other words, it is
a sequence of the equally spaced points in time. Time series’ are used often
in statistics, for weather forecasting, and for counting sunspots. Often, these
datasets have really noisy data points, so we have to use a “rolling”
operation, like this:
In: smooth_time_series = pd.rolling_mean(time_series, 5)
As you can see, we’re using the “mean” method again in order to obtain the
average of values. You can also replace this method with “median” instead
in order to get the median of values. In this example, we also specified that
we want to obtain 5 samples.
Now let’s explore pandas “apply” method that has many uses due to its
ability to perform programmatically operations on rows and columns. Let’s
see this in action by counting the number of non-zero elements that exist in
each line.
In: iris.apply(np.count_nonzero, axis=1).head()
Out:05
15
25
35
45
dtype: int64
Lastly, let’s use the “applymap” method for element level operations. In the
next example, we are going to assume we want the length of the string
representation of each cell. Type:
In: iris.applymap(lambda el:len(str(el))).head()
To receive our value, we need to cast every individual cell to a string value.
Once that is done, we can gain the value of the length.
Data Selection with pandas

The final section about working with pandas is data selection. Let’s say you
find yourself in a situation where your dataset has an index column, and
you need to import it and then manipulate it. To visualize this, let’s say we
have a dataset with an index from 100. Here’s how it would look:
n,val1,val2,val3
100,10,10,C
101,10,20,C
102,10,30,B
103,10,40,B
104,10,50,A
So the index of row 0 is 100. If you import such a file, you will have an
index column like in our case labeled as “n”. There’s nothing really wrong
with it, however you might use the index column by mistake, so you should
separate it instead in order to prevent such errors from happening. To avoid
possible issues and errors, all you need to do is mention that “n” is an index
column. Here’s how to do it:
In: dataset = pd.read_csv('a_selection_example_1.csv',
index_col=0) dataset
Out:
Your index column should now be separate. Now let’s access the value
inside any cell. There’s more than one way to do that. You can simply target
it by mentioning the column and line. Let’s assume we want to obtain
“Val3” from the 5th
line, which is marked by an index of 104.
In: dataset['val3'][104]
Out: 'A'
Keep in mind that this isn’t a matrix, even though it might look like one.
Make sure to specify the column first, and then the row in order to extract
the value from the cell you want.
Categorical and Numerical Data

Now that we’ve gone through some basics with pandas, let’s learn how to
work with the most common types of data, which are numerical and
categorical.
Numerical data is quite self-explanatory, as it deals with any data expressed
in numbers, such as temperature or sums of money. These numbers can
either be integers or floats that are defined with operators such as greater or
less than.
Categorical data, on the other hand, is expressed by a value that can’t be
measured. A great example of this type of data, which is sometimes referred
to as nominal data, is the weather, which holds values such as sunny,
partially cloudy, and so on. Basically, data to which you cannot apply equal
to, greater than, or less than operators is nominal data. Other examples of
this data include products you purchase from an online store, computer IDs,
IP addresses, etc. Booleans are the one thing that is needed to work with
both categorical and numerical data. They can even be used to encode
categorical values as numerical values. Let’s see an example:
Categorical_feature = sunnynumerical_features = [1, 0, 0, 0, 0]
Categorical_feature = foggynumerical _features = [0, 1, 0, 0, 0]
Categorical_feature = snowynumerical _features = [0, 0, 1, 0, 0]
Categorical_feature = rainynumerical _features = [0, 0, 0, 1, 0]
Categorical_feature = cloudynumerical _features = [0, 0, 0, 0, 1]
Here we take our earlier weather example that takes the categorical data
which is in the form of sunny, foggy, etc, and encode them to numerical
data. This turns the information into a map with 5 true or false statements
for each categorical feature we listed. One of the numerical features (1)
confirms the categorical feature, while the other four are o. Now let’s turn
this result into a dataframe that presents each categorical feature as a
column and the numerical features next to that column. To achieve this you
need to type the following code:
In: import pandas as pd
categorical_feature = pd.Series(['sunny', 'foggy', 'snowy', 'rainy', 'cloudy'])
mapping = pd.get_dummies(categorical_feature)
mapping
Out:
In data science, this is called binarization. We do not use one categorical
feature with as many levels as we have. Instead, we create all the
categorical features and assign two binary values to them. Next we can map
the categorical values to a list of numerical values. This is how it would
look:
In: mapping['sunny']
Out:
01.0
10.0
20.0
30.0
40.0
Name: sunny, dtype: float64
In: mapping['foggy']
Out:
00.0
11.0
20.0
30.0
40.0
Name: cloudy, dtype: float64
You can see in this example that the categorical value “sunny” is mapped to
the following list of Booleans: 1, 0, 0, 0, 0 and you can go on like this for
all the other values.
Next up, let’s discuss scraping the web for data.
Scraping the Web

You won’t always work with already established data sets. So far in our
examples, we assumed we already had the data we needed and worked with
it as it was. Often, you will have to scrape various web pages to get what
you’re after and download it. Here are a few real world situations where
you will find the need to scrape the web:
1. In finance, many companies and institutions need to scrape the

web in order to obtain up to date information about all the
organizations in their portfolio. They perform this process on
websites belonging to newspaper agencies, social networks,
various blogs, and other corporations.
2. Did you use a product comparison website lately to find out
where to get the best deal? Well, those websites need to
constantly scrape the web in order to update the situation on the
market’s prices, products, and services.
3. How do advertising companies figure out whether something is
popular among people? How do they quantify the feelings and
emotions involved with a particular product, service, or even
political debate? They scrape the web and analyze the data they
find in order to understand people’s responses. This enables
them to predict how the majority of consumers will respond
under similar circumstances.
As you can see, web scraping is necessary when working with data,
however working directly with web pages can be difficult because of the
different people, server locations, and languages that are involved in
creating websites. However, data scientists can rejoice because all websites
have one thing in common, and that is HTML. For this reason, web
scraping tools focus almost exclusively on working with HTML pages. The
most popular tool that is used in data science for this purpose is called
Beautiful Soup, and it is written in Python.
Using a tool like Beautiful Soup comes with many advantages. Firstly it
enables you to quickly understand and navigate HTML pages. Secondly, it
can detect errors and even fill in gaps found in the HTML code of the
website. Web designers and developers are humans, after all, and they make
mistakes when creating web pages. Sometimes those mistakes can turn into
noisy or incomplete data, however Beautiful Soup can rectify this problem.
Keep in mind that Beautiful Soup isn’t a crawler that goes through websites
to index and copy all their web pages. You simply need to import and use
the “urllib” library to download the code behind a webpage, and later
import Beautiful Soup to read the data and run it through a parser. Let’s first
start by downloading a web page.
In: import urllib.request
url = 'https://en.wikipedia.org/wiki/Marco_Polo'
request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
With this request, we download the code behind Wikipedia’s Marco Polo
web page. Next up, we use Beautiful Soup to read and parse the resources
through its HTML parser.
In: from bs4 import BeautifulSoup
soup = BeautifulSoup(response, 'html.parser')
Now let’s extract the web page’s title like so:
In: soup.title
Out: <title>Marco Polo - Wikipedia, the free encyclopedia</title>
As you can see, we extracted the HTML title tag, which we can use further
for investigation. Let’s say you want to know which categories are linked to
the wiki page about Marco Polo. You would need to first analyze the page
to learn which HTML tag contains the information we want. There is no
automatic way of doing this because web information, especially on
Wikipedia, constantly changes. You have to analyze the HTML page
manually to learn in which section of the page the categories are stored.
How do you achieve that? Simply navigate to the Marco Polo webpage,
press the F12 key to bring up the web inspector, and go through the code
manually. For our example, we find the categories inside a div tag called
“mw-normal-catlinks”. Here’s the code required to print each category and
how the output would look:
In:
section = soup.find_all(id='mw-normal-catlinks')[0]
for catlink in section.find_all("a")[1:]:
print(catlink.get("title"), "->", catlink.get("href"))
Out:
Category:Marco Polo -> /wiki/Category:Marco_Polo
Category:1254 births -> /wiki/Category:1254_births
Category:1324 deaths -> /wiki/Category:1324_deaths
Category:13th-century explorers -> /wiki/Category:13thcentury_explorers
Category: 13th-century venetian people -
>/wiki/Category:13thcentury_venetian_people
Category:13th-century venetian writers->/wiki/Category:
13thcentury_venetian_writers
Category:14th-century Italian writers->/wiki/Category:
14thcentury_Italian_writers
In this example, we use the “find all” method to find the HTML text
contained in the argument. The method is used twice because we first need
to find an ID, and secondly we need to find the “a” tags.
A word of warning when it comes to web scraping- be careful, because it is
not always permitted to perform scraping. You might need authorization,
because to some websites this minor invasion seems similar to a DoS
attack. This confusion can lead the website to ban your IP address. So if
you download data this way, read the website’s terms and conditions
section, or simply contact the moderators to gain more information.
Whatever you do, do not try to extract information that is copyrighted. You
might find yourself in legal trouble with the website / company owners.
With that being said, let’s put pandas away, and look at data processing by
using NumPy.
NumPy and Data Processing

Now that you know the basics of loading and preprocessing data with the
help of pandas, we can move on to data processing with NumPy. The
purpose of this stage is to have a data matrix ready for the next stage, which
involves supervised and unsupervised machine learning mechanisms.
NumPy data structure comes in the form of ndarray objects, and this is what
you will later feed into the machine learning process. For now, we will start
by creating such an object to better understand this phase.
Chapter 4 - Why Data Preprocessing Is
Important
Data processing is going to be a technique that we are able to use with data
mining. When we work on this process, the raw data that we need is
gathered and then we can analyze this data to find a way to transform it into
useful data. For example, when you go through an e-commerce site, you are
basically generating the data that is needed in order to keep things in line
and to ensure that we are able to get that data to show up. During that
process, the data has also been transformed in a way that is more
understandable so that you can get all of the recommended products to you
at the right time
Data is the fuel that a lot of companies are using to help them in many
ways. But in the real world, most of the data that is being collected is going
to be pretty noisy. It is going to come to us with a lot of errors, meaning that
we have unstructured data that is sometimes hard to read through and
understand.
In order to make sure that we can turn that data into a more structured form
that is easier to read and understand, we need to work with the data
preprocessing step.
This ensures that the data we are working with can be changed around into
a format that is easier to understand.
Import the Needed Libraries

With this in mind, we need to take some time to import the libraries that are
needed to ensure that we can actually do some of the preprocessing that we
need.
These are pretty basic steps, but they are going to be so important in making
sure that we can get the data to do what we want. The first step that we will
focus on is importing all of the libraries and algorithms that are needed to
take care of that raw data.
In this step, there are a few libraries that are going to be required to get this
data processing off and running. The libraries that are the most important to
work on this step will include:
1. NumPy: This is a library that we are going to use with some of
the more complicated options of mathematical computations in
machine learning. It can work with the N-dimensional array,
Fourier transforms, and linear algebra to name a few.
2. Matplotlib: This library is going to be used when we want to
work on a plotting graph and figures like the line chart, pie
chart, and bar graph. This is the one that we can use when it is
time to create a visualization of the data by the analyzation for
understanding the patterns in our data as easily as possible.
3. Pandas: Panda is a library that is mainly used to help the
manipulations of data. It is going to be a library that is open-
sourced and will contain all of the functions that are related to
data structures. It also has all of the tools that are needed for
data analysis, so whether you just want to use it for a data
preprocessing or for the whole process, you will behave the
tools to get it done.
4. Seaborn: This is another option that you can use for data
visualization. You can say that it is kind of like a more
upgraded version of Matplotlib. It is one that we are able to use
to make graphs and charts that are more informatical.
The codes that you need to use to bring up all of these libraries and get
them ready to use will include:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
Importing the Sets of Data

Once we have been able to import the libraries that we need above and any
of the other libraries that we are going to use in this process, it is time to go
through and import the sets of data that we want to use. Before we go
through and work with the data preprocessing step, we need to have some
sets of data ready to use with it. Pandas is the best library that we can use in
order to import these sets of data to get started.
Mostly for some of the smaller preprocessing you can easily import the data
sets from the CSVs files. The format of the set of data the file can be also in
HTML or xlsx file. But as we already know that the CSV files are going to
be lower in size, it is going to be one of the faster options when it comes to
working with all the formats available.
Fill in Some of the Missing Values

It is likely that if you are collecting data that is considered unstructured,
then you will find that there are a lot of missing values that are going to
show up in the mix as well. This happens when you are working through a
lot of different sources and more with the data you have. Finding these
missing values and taking care of them as much as possible is going to be
critical to ensuring that we can really use the data and that the outliers and
missing values are not going to negatively impact our process.
When we are ready to import a set of data, you will find that there are going
to be a few missing values that show up inside of it. If it is not corrected,
then it is going to make it really hard to manage or do any of the
preprocessing that is needed with your data. You will find that there is a lot
of inaccurate information about it, and this can harm the kind of results that
you are able to get.
This means that before you go through and do any data preprocessing, you
have to deal with the issue of missing values first.
While we are on this topic, there are going to be two methods that we can
use when it comes to replacing some of those missing values. Both can
work really well depending on what you are going for in the process, and
what works the best for your project.
With the first method, if you are going through a set of data that is really
large, then you may not want to have any of the missing values there and
you don’t want to spend a lot of time trying to fill them in either. For this
example, we would go through and delete the row of data that has the
missing values. It is not going to have a big effect when it comes to getting
the accurate results that you need from the output.
Another option is to fill in those missing values. Maybe you are going
through the set of data and you notice that there is a numeric column that is
present. And through that column, there are a lot of missing values that are
showing up.
With this method, we are able to replace some of the missing values with
the help of the mean, median, or mode of the values of the entire row of that
column that you want to use. This can help to keep some of the results that
you are working with as organized and as even as possible and will help
you to get the results that you want.
Both of these methods can work well when it comes to working with your
data preprocessing that you need to get done. We just need to make sure
that we are setting them up n the right manner and picking the one that
works the best for our set of data.
Make sure to deal with these missing data points before you start processing
the information so that the missing values don’t mess with the output that
you get.
Modifying the Text Values Over to Numerical

Values
The fourth step that we need to take a look at here is how we can take some
of the text value that we have and turn it over into a number. There are
times when this needs to happen in order to make the model work the way
that you would like.
It takes a bit of work, but it is definitely possible, and it may be just what
your project needs to be successful.
When we work with data preprocessing with the help of machine learning,
we will find that it requires values of the data to be in numerical form. As
we know that the machine learning models are going to contain
mathematical calculations.
This means that we need to convert all of the text values that are in our
columns of the set of data over to a form that is numerical. There are a few
methods that work to handle this, but the LabelEncoder class is going to be
used in most cases to take the categorical or the string variable and change
it over to a numerical value.
The steps that we have gone through above are going to be some of the
biggest things that we need to follow when it is time to preprocess some of
the data that we are working with. But there are a few other steps that have
to come into play, depending on the data that we are working with and what
we need to see get done overall.
For example, some of the other steps that we may need to add to this
process include the Creation of Training and Test data sets and even Feature
Scaling.
Data is the biggest fuel source for most businesses in the future, and it is
seeing a lot of growth that is likely to continue into the future as well. Most
of the data that these companies are going to focus on will come in an
unstructured format, and we have to go through and set some rules on how
to convert this into data that is useful.
This is why data preprocessing is so important. Using the steps that are
above can help us to get through this step of the process, and will ensure
that the data is ready to go through our models, and present us with the right
output, each time.
Chapter 5 - What is Data Wrangling?
The next topic that we need to spend some time on is known as data
wrangling. This is basically the process where we are able to clean, and
then unify, the messy and complex sets of data that we have, in order to
make them easier to access and analyze when we would like. This may
seem like part of the boring stuff when it comes to our data science proves,
but it is going to be so important to the end results so we need to spend
some time on seeing how this works.
With all of the vast amounts of data that are present in the world right now,
and with all of the sources of that data growing at a rapid rate and always
expanding, it is getting more and more essential for these large amounts of
available data to get organized and ready to go before you try to accomplish
any analysis. If you just leave the data in the messy form from before, then
it is not going to provide you with an accurate analysis in the end, and you
will be disappointed by the results.
Now, the process of data wrangling is typically going to include a few steps.
We may find that we need to manually convert or map out data from one
raw form into another format. The reason that this is done in the first place
is that it allows us to have a more convenient consumption for the company
who wants to use that data.
What Is Data Wrangling?

When you work with your own project in data science, there are going to be
times when you gather a lot of data and it is incomplete or messy. This is
pretty normal considering all of the types of data you have to collect, from a
variety of sources overall. The raw data that we are going to gather from all
of those different sources is often going to be hard to use in the beginning.
And this is why we need to spend some time cleaning it. Without the data
being cleaned properly, it will not work with the analytical algorithm that
we want to create.
Our algorithm is going to be an important part of this process as well. It is
able to take all of the data you collect over time and will turn it into some
good insights and predictions that can then help to propel your business into
the future with success. But if you are feeding the analytical data a lot of
information that is unorganized or doesn’t make sense for your goals, then
you are going to end up with a mess. To ensure that the algorithm works the
way that you want, you need to make sure that you clean it first, and this is
the process that we can call data wrangling.
If you as the programmer would like to create your own efficient ETL
pipeline, which is going to include extract, transform and load, or if you
would like to create some great looking data visualizations of your work
when you are done, then just get prepared now for the data wrangling.
Like most data scientists, data analysts, and statisticians will admit, most of
the time that they spend implementing an analysis is going to be devoted to
cleaning or wrangling up the data on its own, rather than in actually coding
or running the model or algorithm that they want to use with the data.
According to the O’Reilly 2016 Data Science Salary Survey, almost 70
percent of data scientists will spend a big portion of their time dealing with
a basic analysis of exploratory data, and then 53 percent will spend their
time on the process of cleaning their data before using in an algorithm.
Data wrangling, as we can see here, is going to be an essential part of the
data science process. And if you are able to gain some skills in data
wrangling, and become more proficient with it, you will soon find that you
are one of those people who can be trusted and relied on when it comes to
some of the cutting-edge data science work.
Data Wrangling with Pandas

Another topic that we can discuss in this chapter is the idea of data
wrangling with Pandas. Pandas is seen as one of the most popular libraries
in Python for data science, and specifically to help with data wrangling.
Pandas is able to help us to learn a variety of techniques that work well with
data wrangling, and when these come together to help us deal with some of
the data formats that are the most common out there, along with some of
their transformations.
We have already spent a good deal of time talking about what the Pandas
library is all about. And when it comes to data science, Pandas can
definitely step in and help get a ton of the work done. With that said, it is
especially good at helping us to get a lot of the data wrangling process that
we want doing as well. There may be a few other libraries out there that can
do the job but none are going to be as efficient or as great to work with, as
the Pandas library.
Pandas will have all of the functions and the tools that you need to really
make your project stand out, and to ensure that we are going to see some
great results in the process of data wrangling as well. So, when you are
ready to work with data wrangling, make sure to download the Pandas
library, and any of the other extensions that it needs.
Our Goals with Data Wrangling

When it comes to data wrangling, most data scientists are going to have a
few goals that they would like to meet in order to get the best results. Some
of the main goals that can come up with data wrangling, and should be high
on the list of priorities, include:
1. Reveal a deep intelligence inside of the data that you are

working with. This is often going to be accomplished by
gathering data from multiple sources.
2. Provides us with accurate and actionable data and then
puts it in the hands of n analyst for the business, in a timely
manner so they can see what is there.
3. Reduce the time that is spent collecting, and even
organizing, some of the data that is really unruly, before it
can be analyzed and utilized by that business.
4. Enables the data scientists, and any other analyst to focus
on the analysis of the data, rather than just the process of
wrangling.
5. Drives better skills for making decisions by senior leaders
in that company.
The Key Steps with Data Wrangling

Just like with some of the other processes that we have discussed in this
guidebook, there are a few key steps that need to come into play when it
comes to data wrangling. There are three main steps that we can focus on
for now, but depending on the goals you have and the data that you are
trying to handle, there could be a few more that get added in as well. The
three key steps that we are going to focus on here though will include data
acquisition, joining data, and data cleansing.
First on the list is data acquisition. How are you meant to organize and get
the data ready for your model if you don’t even have the data in the first
place? In this part of the process, our goal is to first identify and then
obtain, access to the data that is in your preferred sources so that you can
use it as you need in the model.
The second step is going to be where we join together the data. You have
already been able to gather in the data that you want to use from a variety of
sources and even did a bit of editing in the process. Now it is time for us to
combine together the edited data for further use and more analysis in the
process.
And then we can end up with the process that is known as data cleansing. In
the data cleansing process, we need to redesign the data into a format that is
functional and usable, and then remove or correct any of the data that we
consider as something that is bad.
What to Expect with Data Wrangling?

The process of data wrangling can be pretty complex, and we need to take
some time to get through all of it and make sure that we have things in the
right order. When people first get into the process of data wrangling, they
are often surprised that there are a number of steps, but each of these is
going to be important to ensure that we can see the results that we want.
To keep things simple for now, we are going to recognize that the data
wrangling process is going to contain six iterative steps. These are going to
include the following:
The process of discovering. Before you are able to dive into the data and the
analysis that you want to do too deeply, we first need to gain a better
understanding of what might be found in the data. This information is going
to give you more guidance on how you would like to analyze the data. How
you wrangle your customer data, as an example, maybe informed by where
they are located, what the customer decided to buy, and what promotions
they were sent and then used.
The second iterative step that comes with the data wrangling process is
going to be structuring. This means that we need to organize the data. This
is a necessary process because the raw data that we have collected may be
useful, but it does come to us in a variety of shapes and sizes. A single
column may actually turn into a few rows to make the analysis a bit easier
to work within the end. One column can sometimes become two. No matter
how we change up some of the work, remember that the movement of our
data is necessary in order to allow our analysis and computation to become
so much easier than before.
Then we can go on to the process of cleaning. We are not able to take that
data and then just throw it at the model or the algorithm that we want to
work with. We do not want to allow all of those outliers and errors into the
data because they are likely to skew some of our data and ruin the results
that we are going to get. This is why we want to clean off the data.
There are a number of things that are going to spend our time cleaning
when it comes to the data in this step. We can get rid of some of the noise
and the outliers we can take some of the null values and change this around
to make them worth something. Sometimes it is as simple as adding in the
standard format, changing the missing values, or handling some of the
duplicates that show up in the data. The point of doing this though is to
increase the quality of the data that you have, no matter what source you
were able to find it from.
Next on the list is the process of enriching the data. Here we are going to
take stock of the data that we are working with, and then we can strategize
about how some other additional data might be able to augment it out. This
is going to be a stage of questions to make sure that it works, so get ready to
put on your thinking cap.
Some of the questions that you may want to ask during this step could
include things like what new types of data can I derive from what I already
have? What other information would better inform my decision making
about this current data? This is the part where we will fill in some of the
holes that may have found their way into the data, and then find the
supplementation that is needed to make that data pop out.
From here we can move on to the step of validation. The validation rules
that we are going to work with this step in the data science process are
going to be repetitive programming sequences. The point of working with
these is that we want to check out and verify the consistency, quality, and
security of our data to make sure that it is going to do the work that we
want.
There are a lot of examples that come with the validation stage. But this can
include something like ensuring the uniform distribution of attributes that
should be distributed in a normal way, such as birth dates. It can also be
used as a way to confirm the accuracy of fields through a check across the
data.
And the last stage is going to be publishing. Analysts are going to be able to
prepare the wrangled data to use downstream, whether by a software or a
particular user. This one also needs us to go through and document any of
the special steps that were taken or the logic that we used to wrangle this
data. Those who have spent some time wrangling data understand that
implementation of the insights is going to rely upon the ease with which we
are able to get others the information, and how easy it is for these others to
access and utilize the data at hand.
Data wrangling is an important part of our process and ensures that we are
able to get the best results with any process that we undertake. We need to
remember that this is going to help us to get ahead with many of the aspects
of our data science project, and without the proper steps being taken we are
going to be disappointed in what we see as the result in the end. Make sure
to understand what data wrangling is all about, and why it is so important so
that it can be used to help with your data science project.
Chapter 6 - Inheritances to Clean Up the Code
The next topic that we will take a look at when writing Python codes is how
to work on inheritance codes. These codes are great because they will save
you a lot of time and will make your code look nicer because you can reuse
parts of your code without tiring yourself out by having to rewrite it so
many times. This is something that you can do with object-oriented
programming, or OOP, languages, a category which Python is a part of. You
can work with inheritances so you can use a parent code and then make
some adjustments to the parts of the code that you want and make the code
unique. As a beginner, you will find that these inheritances can be quite
easy to work with because you can get the code to work the way you want it
to work without having to write it out a million times over.
To help you keep things simple and to understand how inheritances work a
little better, an inheritance is when you will take a ‘parent’ code and copy it
down into a ‘child’ code. You will then be able to work on the child code
and make some adjustments without having to make any changes in the
parent part of the code. You can do this one time and stop there, or you can
keep on going down the line and change the child code at each level
without making any changes to the parent code.
Working with inheritances can be a fun part of making your own code, and
you can make it look so much nicer without all that mess. Let’s take a look
at what the inheritance code looks like and how it will work inside of your
code:
#Example of inheritance

#base class
class Student(object):
def__init__(self, name, rollno):
self.name = name
self.rollno = rollno
#Graduate class inherits or derived from Student class
class GraduateStudent(Student):
def__init__(self, name, rollno, graduate):
Student__init__(self, name, rollno)
self.graduate = graduate
def DisplayGraduateStudent(self):
print”Student Name:”, self.name)
print(“Student Rollno:”, self.rollno)
print(“Study Group:”, self.graduate)
#Post Graduate class inherits from Student class

class PostGraduate(Student):
def__init__(self, name, rollno, postgrad):
Student__init__(self, name, rollno)
self.postgrad = postgrad
def DisplayPostGraduateStudent(self):
print(“Student Name:”, self.name)
print(“Student Rollno:”, self.rollno)
print(“Study Group:”, self.postgrad)
#instantiate from Graduate and PostGraduate classes
objGradStudent = GraduateStudent(“Mainu”, 1, “MS-Mathematics”)
objPostGradStudent = PostGraduate(“Shainu”, 2, “MS-CS”)
objPostGradStudent.DisplayPostGraduateStudent()

When you type this into your interpreter, you will get the following results:
(‘Student Name:’, ‘Mainu’)
(‘Student Rollno:’, 1)
(‘Student Group:’, ‘MSC-Mathematics’)
(‘Student Name:’, ‘Shainu’)
(‘Student Rollno:’, 2)
(‘Student Group:’, ‘MSC-CS’)
How To Override The Base Class

The next thing that we can work on when it comes to inheritance codes is
how to override a base class. There will be a lot of times that while you are
working on a derived class, you have to go in and override what you have
placed inside a base class. What this means is that you will take a look at
what was placed inside the base class and then make changes to alter some
of the behavior that was programmed inside of it. This helps to bring in new
behavior which will then be available inside the child class that you plan to
create from that base class.
This does sound a little bit complicated to work with, but it can really be
useful because you can choose and pick the parental features that you
would like to place inside the derived class, which ones you would like to
keep around, and which ones you no longer want to use. This whole process
will make it easier for you to make some changes to the new class and keep
the original parts from your base class that might help you out later. It is a
simple process that you can use to make some changes in the code and get
rid of parts of the base class that is no longer working and replaces them
with something that will work better.
Overloading
Another process that you may want to consider when you’re working with
inheritances is learning how to ‘overload.’ When you work on the process
known as overloading, you can take one of the identifiers that you are
working with and then use that to define at least two methods, if not more.
For the most part, there will only be two methods that are inside of each
class, but sometimes this number will be higher. The two methods should
be inside the exact same class, but they need to have different parameters so
that they can be kept separate in this process. You will find that it is a good
idea to use this method when you want the two matched methods to do the
same tasks, but you would like them to do that task while having different
parameters.
This is not something that is common to work with, and as a beginner, you
will have very little need to use this since many experts don’t actually use it
either. But it is still something that you may want to spend your time
learning about just in case you do need to use it inside of your code. There
are some extra modules available for you that you can download so you can
make sure that overloading will work for you.
Final Notes About Inheritances

As you are working on your codes, you will find that it is possible that you
could work on more than one inheritance code. If you are doing this, it
means that you can make a line of inheritances that are similar to each
other, but you can also make some changes to them as well if needed. You
will notice that multiple inheritances are not all that different from what you
did with a normal inheritance. Instead, you are just adding more steps and
continuously repeating yourself so you can make the changes that you
want.
When you want to work with multiple inheritances, you have to take one
class and then give it two or more parent classes to get it started. This is
important once you are ready to write your own code, but you can also use
the inheritances to make sure the code looks nice as you write it out.
Now, as a beginner, you may be worried that working with these multiple
inheritances might be difficult because it sounds too complicated. When
you are working with these types of inheritances, you will create a new
class, which we will call Class3, and you will find that this class was
created from the features that were inside of Class2. Then you can go back
a bit further and will find that Class2 was created with the features that
come from Class1 and so on and so forth. Each layer will contain features
from the class that was ahead of it, and you can really go down as far as you
would like. You can have ten of these classes if you would like, with
features from the past parent class in each one, as long as it works inside of
your code.
One of the things that you should remember when you’re creating new code
and if you are considering to add in some multiple inheritances is that the
Python language will not allow you to create a circular inheritance. You can
add in as many parent classes as you want, but you are not allowed to go
into the code and make the parent class go in a circle, or the program will
get mad at you if you do so. Expanding out the example that we did above
to make another class or more is fine, but you must make sure that you are
copying the codes out properly before you even make changes so you can
get this program to work.
As you start to write out some more codes using the Python programming
language, you will find that working with different types of inheritances is
actually pretty popular.
There are many times when you can just stick with the same block of code
in the program and then make some changes without having to waste your
time and tire yourself out by rewriting the code over and over again.

Chapter 7 - Reading and writing data
In real-world applications, data comes in various formats. These are the
most common ones: CSV, Excel spreadsheets (xlsx / xls), HTML and SQL.
While Pandas can read SQL files, it is not necessarily the best for working
with SQL databases, since there are quite a few SQL engines: SQL lite,
PostgreSQL, MySQL, etc. Hence, we will only be considering CSV, Excel
and HTML.
Read
The pd.read_file_type(‘file_name’)
method is the default way to read files
into the Pandas framework. After import, pandas displays the content as a
data frame for manipulation using all the methods we have practiced so far,
and more.
CSV (comma separated variables) & Excel
Create a CSV file in excel and save it in your python directory. You can
check where your python directory is in Jupyter notebook by typing: pwd().
If you want to change to another directory containing your files (e.g.
Desktop), you can use the following code:
In []: import os
os.chdir('C:\\Users\\Username\\Desktop')
To import your CSV file, type: pd.read_csv(‘csv_file_name’). Pandas will
automatically detect the data stored in the file and display it as a data frame.
A better approach would be to assign the imported data to a variable like
this:
In []:Csv_data = pd.read_csv(‘example file.csv’)
Csv_data
# show
Running this cell will assign the data in ‘example file.csv’ to the variable
Csv_data, which is of the type data frame. Now it can be called later or used
for performing some of the data frame operations.
For excel files (.xlsx and .xls files), the same approach is taken. To read an
excel file named ‘class data.xlsx’, we use the following code:
In []:Xl_data = pd.read_excel(‘class data.xlsx’)
Xl_data
# show
This returns a data frame of the required values. You may notice that an
index starting from 0 is automatically assigned at the left side. This is
similar to declaring a data frame without explicitly including the index
field. You can add index names, like we did in previous examples.
Tip: in case the excel spreadsheet has multiple sheets filled. You can specify
the sheet you need to be imported. Say we need only sheet 1, we use:
sheetname = ‘Sheet 1’
. For extra functionality, you may check the
documentation for read_excel()
by using shift+tab
.
Write
After working with our imported or pandas-built data frames, we can write
the resulting data frame back into various formats. We will, however, only
consider writing back to CSV and excel. To write a data frame to CSV, use
the following syntax:
In []:Csv_data.to_csv(‘file name
’,index = False)
This writes the data frame ‘Csv_data’ to a CSV file with the specified
filename in the python directory. If the file does not exist, it creates it.
For writing to an excel file, a similar syntax is used, but with sheet name
specified for the data frame being exported.
In []: Xl_data.to_excel(‘file name.xlsx’,sheet_name = ‘Sheet 1’)
This writes the data frame Xl_data
to sheet one of ‘file name.xlsx’
.
Html
Reading Html files through pandas requires a few libraries to be installed:
htmllib5, lxml, and BeautifulSoup4. Since we installed the latest Anaconda,
these libraries are likely to be included. Use conda list
to verify, and conda
install
to install any missing ones.
Html tables can be directly read into pandas using the pd.read_html (‘sheet
url’)
method. The sheet url is a web link to the data set to be imported. As
an example, let us import the ‘Failed bank lists’ dataset from FDIC’s
website and call it w_data.
In []: w_data =
pd.read_html('http://www.fdic.gov/bank/individual/failed/bankli
st.html')
w_data[0]
To display the result, here we used w_data [0]
. This is because the table we
need is the first sheet element in the webpage source code. If you are
familiar with HTML, you can easily identify where each element lies. To
inspect a web page source code, use Chrome browser. On the web page >>
right click >> then, select ‘view page source’
. Since what we are looking
for is a table-like data, it will be specified like that in the source code. For
example, here is where the data set is created in the FDIC page source
code:
FDIC page source via chrome

This section concludes our lessons on the Pandas framework. To test your
knowledge on all that has been introduced, ensure to attempt all the
exercises below.
For the exercise, we will be working on an example dataset. A salary
spreadsheet from Kaggle.com. Go ahead and download the spreadsheet
from this link: www.kaggle.com/kaggle/sf-salaries

Note: You might be required to register before downloading the file.
Download the file to your python directory and extract the file.
Exercises:
We will be applying all we have learned here.
1. Import pandas as pd
2. Import the CSV file into Jupyter notebook, assign it to a
variable ‘Sal’, and display the first 5 values.
Hint:
use the .head()
method to display the first 5 values of a data
frame. Likewise, .tail()
is used for displaying the last 5 results. To
specify more values, pass ‘n=value’
into the method.
3. What is the highest pay (including benefits)? Answer:
567595.43
Hint:
Use data frame column indexing and .max()
method.
4. According to the data, what is ‘MONICA FIELDS’s Job
title, and how much does she make plus benefits? Answer:
Deputy Chief of the Fire Department, and $ 261,366.14.

Hint:
Data frame column selection and conditional selection works
(conditional selection can be found under Example 72. Use column
index ==’string’ for the Boolean condition).
5. Finally, who earns the highest basic salary (minus
benefits), and by how much is their salary higher than the
average basic salary. Answer: NATHANIEL FORD earns
the highest. His salary is higher than the average by $
492827.1080282971.
Hint:
Use the .max()
, and .mean()
methods for the pay gap.
Conditional selection with column indexing also works for the
employee name with the highest pay.
Chapter 8 - The Different Types
of Data We Can Work With
There are two main types mainly structured and unstructured, and the types
of algorithms and models that we can run on them will depend on what kind
of data we are working with. Both can be valuable, but it often depends on
what we are trying to learn, and which one will serve us the best for the
topic at hand. With that in mind, let’s dive into some of the differences
between structured and unstructured data and why each can be so important
to our data analysis.
Structured Data
The first type of data that we will explore is known as structured data. This
is often the kind that will be considered traditional data. This means that we
will see it consisting mainly of lots of text files that are organized and have
a lot of useful information. We can quickly glance through this information
and see what kind of data is there, without having to look up more
information, labeling it, or looking through videos to find what we want.
Structured data is going to be the kind that we can store inside one of the
options for warehouses of data, and we can then pull it up any time that we
want for analysis. Before the era of big data, and some of the emerging
sources of data that we are using on a regular basis now, structured data was
the only option that most companies would use to make their business
decisions.
Many companies still love to work with this structured data. The data is
very organized and easy to read through, and it is easier to digest. This
ensures that our analysis is going to be easier to go through with legacy
solutions to data mining. To make this more specific, this structured data is
going to be made up largely of some of the customer data that is the most
basic and could provide us with some information including the contact
information, address, names, geographical locations and more of the
customers.
In addition to all of this, a business may decide to collect some transactional
data and this would be a source of structured data as well. Some of the
transactional data that the company could choose to work with would
include financial information, but we must make sure that when this is used,
it is stored in the appropriate manner so it meets the standards of
compliance for the industry.
There are several methods we can use in order to manage this structured
data. For the most part, though, this type of data is going to be managed
with legacy solutions of analytics because it is already well organized and
we do not need to go through and make adjustments and changes to the data
at all. This can save a lot of time and hassle in the process and ensures that
we are going to get the data that we want to work the way that we want.
Of course, even with some of the rapid rise that we see with new sources of
data, companies are still going to work at dipping into the stores of
structured data that they have. this helps them to produce higher quality
insights, ones that are easier to gather and will not be as hard to look
through the model for insights either. These insights are going to help the
company learn some of the new ways that they can run their business.
While companies that are driven by data all over the world have been able
to analyze this structured data for a long period of time, over many decades,
they are just now starting to really take some of the new and emerging
sources of data as seriously as they should. The good news with this one
though is that it is creating a lot of new opportunities in their company, and
helping them to gain some of the momentum and success that they want.
Even with all of the benefits that come with structured data, this is often not
the only source of data that companies are going to rely on. First off,
finding this kind of data can take a lot of time and can be a waste if you
need to get the results in a quick and efficient manner. Collecting structured
data is something that takes some time, simply because it is so structured
and organized.
Another issue that we need to watch out for when it comes to structured
data is that it can be more expensive. It takes someone a lot of time to sort
through and organize all of that data. And while it may make the model that
we are working on more efficient than other forms, it can often be
expensive to work with this kind of data. Companies need to balance their
cost and benefit ratio here and determine if they want to use any structured
data at all, and if they do, how much of this structured data they are going
to add to their model.
Unstructured Data
The next option of data that we can look at is known as unstructured data.
This kind of data is a bit different than what we talked about before, but it is
really starting to grow in influence as companies are trying to find ways to
leverage the new and emerging data sources. Some companies choose to
work with just unstructured data on their own, and others choose to do
some mixture of unstructured data and structured data. This provides them
with some of the benefits of both and can really help them to get the
answers they need to provide good customer service and other benefits to
their business.
There are many sources where we are able to get these sources of data, but
mainly they come from streaming data. This streaming data comes in from
mobile applications, social media platforms, location services, and the
Internet of Things. Since the diversity that is there among unstructured
sources of data is so prevalent, and it is likely that those businesses who
choose to use unstructured data will rely on many different sources,
businesses may find that it is harder to manage this data than it was with
structured data.
Because of this trouble with managing the unstructured data, there are many
times when a company will be challenged by this data, in ways that they
weren’t in the past. And many times, they have to add in some creativity in
order to handle the data and to make sure they are pulling out the relevant
data, from all of those sources, for their analytics.
The growth and the maturation of things known as data lakes, and even the
platform known as Hadoop, are going to be a direct result of the expanding
collection of unstructured data. The traditional environments that were used
with structured data are not going to cut it at this point, and they are not
going to be a match when it comes to the unstructured data that most
companies want to collect right now and analyze.
Because it is hard to handle the new sources and types of data, we can’t use
the same tools and techniques that we did in the past. Companies who want
to work with unstructured data have to pour additional resources into
various programs and human talent in order to handle the data and actually
collect relevant insights and data from it.
The lack of any structure that is easily defined inside of this type of data can
sometimes turn businesses away from this kind of data in the first place.
But there really is a lot of potentials that are hidden in that data. We just
need to learn the right methods to use to pull that data out. The unstructured
data is certainly going to keep the data scientist busy overall because they
can’t just take the data and record it in a data table or a spreadsheet. But
with the right tools and a specialized set of skills to work with, those who
are trying to use this unstructured data to find the right insights, and are
willing to make some investments in time and money, will find that it can
be so worth it in the end.
Both of these types of data, the structured and the unstructured, are going to
be so important when it comes to the success you see with your business.
Sometimes our project just needs one or the other of these data types, and
other times it needs a combination of both of them.
For a company to reach success though, they need to be able to analyze, in a
proper and effective manner, all of their data, regardless of the type of the
source. Given the experience that the enterprise has with data, it is not a big
surprise that all of this buzz already surrounds data that comes from sources
that may be seen as unstructured. And as new technologies begin to surface
that can help enterprises of all sizes analyze their data in one place it is
more important than ever for us to learn what this kind of data is all about,
and how to combine it with some of the more traditional forms of data,
including structured data.

WHY PYTHON FOR DATA ANALYSIS?
The next thing that we need to spend some of our time on in this guidebook
is the Python language. There are a lot of options that you can choose when
working on your own data analysis, and bringing out all of these tools can
really make a big difference in how much information you are able to get
out of your analysis. But if you want to pick a programming language that
is easy to learn, has a lot of power, and can handle pretty much all of the
tasks that you need to handle with data analysis and machine learning, then
Python is the choice for you. Let’s dive into the Python language a little bit
and see how this language can be used to help us see some great results
with our data analysis.
The Basics of the Python Language

To help us understand a bit more about how the Python language is able to
help us out while handling a data analysis, we first need to take a look at
what the Python language is all about. The Python language is an object-
oriented programming language or OOP language, that is designed with the
user in mind, while still providing us with the power that we need, and the
extensions and libraries, that will make data analysis and machine learning
as easy to work with as possible.
There are a lot of benefits that come with the Python coding language, and
this is one of the reasons why so many people like to learn how to code
with this language compared to one of the other options. First, this coding
language was designed with the beginner in mind. There are a lot of coding
languages that are hard to learn, and only more advanced programmers,
those who have spent years in this kind of field, can learn how to use them.
This is not the case when we talk about the Python language. This one has
been designed to work well for beginners. Even if you have never done any
coding in Python before you will find that this language is easy to catch on
to, and you will be able to write some complex codes, even ones with
enough power to handle machine learning and data science, in no time at
all.
Even though the Python language is an easy one to learn how to use, there
is still a lot of power that comes with this language as well. This language is
designed to take on some of those harder projects, the ones that may need a
little extra power behind them. For example, there are a lot of extensions
that come with Python that can make it work with machine learning, a
process where we teach a model or a computer how to make decisions on
its own.
Due to the many benefits that come with the Python coding language, there
are a lot of people who are interested in learning more about it, and how to
make it work for their needs. This results in many large communities, all
throughout the world, of people sharing their ideas, asking for help, and
offering any advice you may need. If you are a beginner who is just getting
started with doing data analysis or any kind of Python programming at all,
then this large community is going to be one of the best resources for you to
use. It will help you to really get all of your questions answered and ensures
you are going to be able to finish your project, even if you get stuck on it
for a bit.
This coding language also combines well with some of the other coding
languages out there. While Python can do a lot of work on its own, when
you combine it with some of the other libraries that are out there,
sometimes it needs to be compatible with other languages as well. This is
not a problem at all when it comes to Python, and you can add on any
extension, and still write out the code in Python, knowing that it will be
completely compatible with the library in question.
There are also a lot of different libraries that you can work with when it
comes to the Python language. While we can see a lot of strong coding done
with the traditional library of Python, sometimes adding some more
functionality and capabilities can be the trick that is needed to get results.
For example, there are a number of deep learning and machine learning
libraries that connect with Python and can help this coding language really
take on some of the data science and data analysis projects that you want to
use.
Python is also seen as an object-oriented programming language or an OOP
language. This means that it is going to rely on classes and objects to help
organize the information and keep things in line. The objects that we use,
which are going to be based on real objects that we can find in our real
world, are going to be placed in a class to pull out later when they are
needed in the code. This is much easier to work with than we see with the
traditional coding languages of the past, and ensures that all of the different
parts of your code are going to stay exactly where you would like them
As we can see here, there is so much that the Python coding language is
going to be able to do to help us with our data analysis. There are a lot of
different features and capabilities that come with Python, and this makes it
perfect for almost any action or project that we want to handle. When we
combine it together with some of the different libraries that are available,
we can get some of these more complicated tasks done.
How Can Python Help with a Data Analysis?

Now that we have had some time to discuss some of the benefits that come
with the Python language and some of the parts that make up this coding
language, it is now time for us to learn a few of the reasons why Python is
the coding language to help out with all of the complexities and programs
that we want to do with data science.
Looking back, we can see that Python has actually been pretty famous with
data scientists for a long time. Although this language was not really built
to just specifically help out with data science, it is a language that has been
accepted readily and implemented by data scientists for much of the work
that they try to accomplish. of course, we can imagine some of the obvious
reasons why Python is one of the most famous programming languages, and
why it works so well with data science, but some of the best benefits of
using Python to help out with your data science model or project include:
Python is as simple as it gets. One of the best parts about learning how to
work with the Python coding language is that even as someone who is
completely new to programming and who has never done any work in this
in the past, you can grasp the basics of it pretty quickly. This language, in
particular, had two main ideas in mind when it was first started and these
include readability and simplicity.
These features are actually something that is pretty unique when we talk
about coding languages, and they are often only going to apply to a coding
language that is object-oriented, and one that has a tremendous amount of
potential for problem-solving.
What all of this means is that, if you are a beginner to working with data
science and with working on the Python language, then adding these two
together could be the key that you need to get started. They are both going
to seem like simple processes when they work together, and yet you are
able to get a ton done in a short amount of time. Even if you are more
experienced with coding, you will find that Python data science is going to
add a lot of depth to your resume, and can help you get those projects done.
The next benefit is that Python is fast and attractive. Apart from being as
simple as possible, the code that we can write with Python is going to be
leaner and much better looking than others. For example, the Python code
takes up one-third of the volume that we see with code in Java, and one-
fifth of the volume of code in C++, just to do the exact same task.
The use of the common expressions in code writing, rather than going with
variable declarations and empty space in place of ugly brackets can also
help the code in Python to look better. But in addition to having the code
look more attractive, it can help take some of the tediousness that comes in
when learning a new coding language. This coding language can save a lot
of time and is going to tax the brain of the data scientist a lot less, making
working on some of the more complex tasks, like those of data analysis,
much easier to handle overall.
Another benefit here is that the data formats are not going to be as
worrisome with Python. Python is able to work with any kind of data
format that you would like. It is possible for us to directly import SQL
tables in the code without having to convert to a specific format or worry
that our chosen format is not going to work. In addition, we can work with
the Comma Separated Value documents and the web sourced JSON. Python
request library can make it really easy to import data from a lot of websites
and will build up sets of data to help
The Python data analysis library known as Pandas is one of the best for
helping us to handle all of the parts of not only our data analysis but also for
the whole process of data science. Pandas is able to grab onto a lot of data,
without having to worry about lagging and other issues in the process. This
is great news for the data scientist because it helps them to filter, sort, and
display their data in a quick manner.
Next on the list is that the Python library is quickly growing in demand.
While the demand for professionals in the world of IT has seen a decline
recently, at least compared to what it was in the past, the demand for
programmers who can work with Python is steadily on the rise. This is good
news for those who still want to work in this field, and are looking for their
niche or their way to stand out when getting a new job.
Since Python has so many great benefits and has been able to prove itself as
a great language for many things, including programs for data analytics and
machine learning algorithms, many companies who are centered around
data are going to be into those with Python skills. If you already have a
strong on Python, you can really work to ride the market that is out there
right now.
And finally, we come back to the idea of the vibrant community that is
available with the Python language. There are times when you will work on
a project, and things are just not working the way that you had thought they
would, or the way you had planned. Getting frustrated is one option, but it
is not really going to help you to find the solution.
The good news with this is that you will be able to use the vibrant
community, and all of the programmers who are in this community, to
provide you with a helping hand when you get stuck. The community that is
around Python has grown so big and it includes members who are
passionate and very active in these communities. For the newer
programmer, this means that there is always an ample amount of material
that is flowing on various websites, and one of these may have the solution
that you are looking for when training your data.
Once you are able to get ahold of the data that you want to use and all of the
libraries that work with as well, you can really work on this community to
see some of the results that you want. Never get stuck and just give up on a
project or an idea that you have with your code, when you have that
community of programmers and more, often many of whom have a lot of
experience with Python, who will be able to help answer your questions and
get that problem solved.
As we work through this guidebook, and you do more work with Python
and data analysis, you will find that there are a lot of libraries that are
compatible with Python that can help to get the work done. These are all
going to handle different algorithms and different tasks that you want to get
done during your data analysis, so the ones that you will want to bring out
may vary. There are many great choices that you can make, including
TensorFlow, Pandas, NumPy, SciPy, and Scikit-Learn to name a few.
Sometimes these libraries work all on their own, and sometimes they need
to be combined with another library so that they can draw features and
functionalities from each other. When we are able to choose the right library
to work with, and we learn how to make them into the model that we need,
our data analysis is going to become more efficient overall.
While there may be other programming languages out there that are able
to handle the work of data analysis, and that may be able to help us create
the models that we need to see accurate insights and predictions based on
the data, none of them are going to work as well as the Python library.
Taking the time to explore this library and seeing what it can do for your
data analysis process can really be a winner when it comes to your
business and using data science to succeed.
Chapter 9 - The Importance of Data Visualization

The next topic that we need to spend some time looking through is the idea
of data visualization.
This is a unique part of our data science journey, and it is so important that
we spend some time and effort looking through it and understanding what
this process is all about. Data visualization is so important when it comes to
our data analysis. It can take all of the complex relationships that we have
been focusing on in our analysis and puts them in a graph format, or at least
in another visual format that is easier to look through.
Sometimes, looking at all of those numbers and figures can be boring and
really hard to concentrate on. It can take a long time for us to figure out
what relationships are present, and which ones are something that we
should ignore. But when we put the information into some kind of graph
form, such as a graph, a chart, or something similar, then we will be able to
easily see some of the complex relationships that show up, and the
information will make more sense overall.
Many of those who are in charge of making decisions based on that data
and on the analysis that you have worked on will appreciate having a graph
or another tool in place to help them out.
Having the rest of the information in place as well can make a difference
and can back up what you are saying, but having that graph in place is one
of the best ways to ensure that they are able to understand the data and the
insights that you have found.
To make it simple, data visualization is going to be the presentation of data
that shows up in a graphical or a pictorial format of some kind or another. It
is going to enable those who make the big decisions for a business to see
the analytics presented in a more visual manner so that they can really grasp
some of the more difficult concepts or find some new patterns out of the
data that they would never have known in any other manner.
There are a lot of different options that you are able to work with when it
comes to data visualization, and having it organized and ready to go the
way that you like, using the right tools along the way, can make life easier.
With an interactive type of visual, for example, you will be able to take this
concept a bit further and use technology to drill down the information,
while interactively changing what data is shown and how it is processed for
you.
A Look at the History

As we can imagine, the process of visualization, and using pictures to help
us understand the data in front of us is something that has been around for a
long time. Whether we look at the pictures that show up in our books or
even maps and graphs that were found in the 17TH
century and before, we
have been using images and more to help us make sense of the world
around us and all of the data that we have to sort through can really be
understood with some of these visuals as well.
However, it is really a big boost in technology that has helped to make data
visualization something that is as popular as it is today. For example,
computers are really making it possible for us to process a large amount of
data, and we are able to do this at faster speeds than ever before. Today, the
data visualization and all that comes with it is an industry or a field that is
rapidly evolving. Add to it that this is now something that needs a nice
blend of science and art and that it can go a long way in helping us to work
with our own data analysis, and it is no wonder that these visuals are as
popular as they seem.
Why Is Data Visualization So Important?

The next thing that we need to take a look at here is why data visualization
is so important to us. The reason that data visualization is something that
we want to spend our time and energy on is because of the way that
someone is able to process information. It is hard to gather all of the
important insights and more on a process when we have to just read it off a
table or a piece of paper. Sure the information is all right there, but
sometimes it is still hard to form the conclusions and actually see what we
are doing when it is just in text format for us.
For most people, being able to look at a chart or a graph or some other kind
of visual can make things a little bit easier.
Because of the way that our brains work and process the information that
we see, using graphs and charts to visualize a large amount of complex data
is going to be so much easier compared to pouring over some reports or
spreadsheets.
When we work with data visualization, we will find that it is a quick and
easy way to convey a lot of hard and challenging concepts, usually in a
manner that is more universal. And we are able to experiment with the
different scenarios by using an interactive visual that can make some slight
adjustments when we need it the most.
This is just the beginning of what data visualization is able to do for us
though, and it is likely that we will find more and more uses for this as time
goes on. Some of the other ways that data visualization will be able to help
us out will include:
1. Identify areas that will need the most attention when it

comes to improvements and attention.
2. Help us to figure out which of our products we should

place where.
3. It can clarify which factors are the most likely to influence

the behavior of a customer.
4. It can make it easier to tell and make predictions about our

sales volumes, whether these volumes are going to be
higher or lower at a specific time period.
The process of data visualization is going to help us change up the way that
we can work with the data that we are using. Data analysis is supposed to
respond to any issues that are found in the company in a faster manner than
ever before.
And they need to be able to dig through and find more insights as well, look
at data in a different manner, and learn how to be more imaginative and
creative in the process. This is exactly something that data visualization is
able to help us out with.
How Can We Use Data Visualization?
The next thing that we need to take some time to look at is how companies
throughout many industries are able to use data visualization for their own
needs. No matter the size of the company or what kind of industry they are
in, it is possible to use some of the basics of data visualization in order to
help make more sense of the data at hand. And there are a variety of ways
that this data visualization will be able to help you succeed
The first benefit that we can look at is the fact that these visuals are going to
be a great way for us to comprehend the information that we see in a faster
fashion. When we are able to use a graphical representation of all that data
on our business, rather than reading through charts and spreadsheets, we
will be able to see these large amounts of data in a clear and cohesive
manner.
It is much easier to go through all of that information and see what is found
inside, rather than having to try and guess and draw the conclusions on our
own.
And since it is often much faster for us to analyze this kind of information
in a graphical format, rather than analyzing it on a spreadsheet, it becomes
easier for us to understand what is there. When we are able to do it in this
manner, it is so much easier for a business to address problems or answer
some of their big questions in a timely manner so that things are fixed
without issue or without having to worry about more damage.
The second benefit that comes with using data visuals to help out with the
process of data science is that they can really make it easy to pinpoint some
of the emerging trends that we need to focus on. This information is within
the data, and we are going to be able to find them even if we just read
through the spreadsheets and the documents.
But this takes a lot of time, can be boring, and often it is hard for us to
really see these correlations and relationships, and we may miss out on
some of the more important information that we need.
Using the idea of these visuals to handle the data, and to discover trends,
whether this is the trends just in the individual business or in the market as a
whole, can really help to ensure that your business gains some big
advantages over others in your competition base. And of course, any time
that you are able to beat out the competition, it is going to positively affect
your bottom line.
When you use the right visual to help you get the work done, it is much
easier to spot some of the outliers that are present, the ones that are more
likely to affect the quality of your product, the customer churn, or other
factors that will change your business. In addition, it is going to help you to
address issues before they are able to turn into much bigger problems that
you have to work with.
Next on the list is that these visuals are going to be able to help you identify
some relationships and patterns that are found in all of that data that you are
using. Even with extensive amounts of data that is complicated, we can find
that the information starts to make more sense when it is presented in a
graphic format, rather than in just a spreadsheet or another format.
With the visuals, it becomes so much easier for a business to recognize
some of the different parameters that are there and how these are highly
correlated with one another. Some of the correlations that we are able to see
within our data are going to be pretty obvious, but there are others that
won’t be as obvious. When we use these visuals to help us find and know
about these relationships, it is going to make it much easier for our business
to really focus on the areas that are the most likely to influence some of our
most important goals.
We may also find that working with these visuals can help us to find some
of the outliers in the information that is there. Sometimes these outliers
mean nothing. If you are looking at the charts and graphs and find just a
few random outliers that don’t seem to connect with each other, it is best to
cut these out of the system and not worry about them.
But there are times when these outliers are going to be important and we
should pay more attention to them.
If you are looking at some of the visuals that you have and you notice that
there are a substantial amount of them that fall in the same area, then you
will need to pay closer attention. This could be an area that you can focus
your attention on to reach more customers, a problem that could grow into a
major challenge if you are not careful, or something else that you need to
pay some attention to.
These visuals can also help us to learn more about our customers. We can
use them to figure out where our customers are, what kinds of products our
customers would be the happiest with, how we can provide better services
to our customers, and more. Many companies decide to work with data
visualization to help them learn more about their customers and to ensure
that they can really stand out from the crowd with the work they do.
And finally, we need to take a look at how these visuals are a great way to
communicate a story to someone else. Once your business has had the time
to uncover some new insights from visual analytics, the next step here is to
communicate some of those insights to others. It isn’t going to do you much
good to come up with all of those insights, and then not actually show them
to the people responsible for key decisions in the business.
Now, we could just hand these individuals, the ones who make some of the
big decisions, the spreadsheets and some of the reports that we have. And
they will probably be able to learn a lot of information from that. But this is
not always the best way to do things.
Instead, we need to make sure that we set things up with the help of a
visual, ensuring that these individuals who make the big decisions can look
it over and see some of the key relationships and information at a glance.
Using graphs, charts, and some of the other visuals that are really impactful
as a representation of our data is going to be so important in this step
because it is going to be engaging and can help us to get our message across
to others in a faster manner than before.
As we can see, there are a lot of benefits that come in when we talk about
data visualizations and all of the things that we are able to do with them.
Being able to figure out the best kind of visualization that works for your
needs, and ensuring that you can actually turn that data into a graph or chart
or another visualization is going to be so important when it is time to work
with your data analysis.
We can certainly do the analysis without data visualization. But when it
comes to showcasing the findings in an attractive and easy to understand
format, nothing is going to be better than data visualization.
How to Lay the Groundwork

Before we try to implement in a brand new technology of any sort, there are
going to be a few types of steps that we need to take and go through to see
the results.
Not only is it important for us to have a nice solid grasp on the data that we
want to use which is something that should happen during the data analysis
part of the process, we also need to understand three other important things
including the needs of the company, the goals of the company, and the
audience of your company.
Some of the things that have to happen before you can prepare and organize
all of the data that you have and complete this kind of data visualization
will include:
1. Understand the data that we need to visualize in the first place.

This means that we need to know how much cardinality is
present in the data, meaning how much uniqueness is going to
show up in the columns, and we need to know the size of the
data. Some of the algorithms that we will use to work on the
data visualization are not going to do as well when it comes to
very large sets of data.
2. Determine what you would like to visualize, and the kind of

information that we want to be able to communicate with this.
This will make it a bit easier to figure out which type of visual
we want to be able to go within this process.
3. Know the audience that we are working with and understand

how they are going to process the visual information that you
want to show off. Management may need a different visual than
a team. Those in manufacturing may need a different visual
than someone in a more creative role. Being able to make the
visual fit to the audience so that they can actually utilize the
information is going to be a critical step.
4. Use a visual that is able to convey the information in the form

that is not only the best but also the simplest, for your audience.
There are a lot of cool visuals out there that you can work with,
and they can offer a lot of different ways to show your data.
But if the visual is too difficult to understand, it is not going to
make anyone happy. Put away some of the neat gadgets and
find the best way to showcase that information that makes the
most sense to your audience.
Once you have been able to go through and answer all of the initial
questions that we had about the data type that we would like to work with,
and you know what kind of audience is going to be there to consume the
information, it is time for us to make some preparations for the amount of
data that we plan to work within this process
Keep in mind here that big data is great for many businesses and is often
necessary to make data science work. But it is also going to bring in a few
new challenges to the visualization that we are doing. Large volumes,
varying velocities, and different varieties are all going to be taken into
account with this one.
Plus, data is often going to be generated at a rate that is much faster than it
can be managed and analyzed so we have to figure out the best way to deal
with this problem.
There are factors that we need to consider in this process as well, including
the cardinality of the columns that we want to be able to work with.
We have to be aware of whether there is a high level of cardinality in the
process or a low level. If we are dealing with high cardinality, this is a sign
that we are going to have a lot of unique values in our data. A good
example of this would include bank account numbers since each individual
would have a unique account number.
Then it is possible that your data is going to have a low cardinality. This
means that the column of data that you are working with will come with a
large percentage of repeat values. This is something that we may notice
when it comes to the gender column on our system. The algorithm is going
to handle the amount of cardinality, whether it is high or low, in a different
manner, so we always have to take this into some consideration when we do
our work.
Different Types of Data Visualization Available

As you go through and start to work on adding some visualizations to your
own project, you will quickly notice that there are a lot of choices that you
are able to make. And all of them can work well depending on the kind of
data that you are working with, and the way that you would like to present
it.
Sometimes the question that you are asking out of the data will help to
determine which type of visualization is going to be the best for your own
needs.
There are options like bar graphs, line graphs, histograms, pie charts, and
more that can all show the information if you are trying to separate
information into groups and see where your customers lie or which decision
is the best for you, something like a scatterplot could be the best option to
work with.
There are a lot of options when it comes to working with visuals, and we
have to just figure out which one is the best for our needs.
When you are first exploring some of the new data that you have collected,
for example, you may find that something like an auto-chart is the best
option for your needs. This is because they can give you kind of a quick
view into a large amount of data, in a way that other options just are not
able to do. It may not be the final step that you take, but it is going to make
a difference in how well you are able to understand the data in the
beginning, and can lead you on the right path to picking out models and
algorithms to work with later on.
This kind of data exploration capability is going to be helpful, even to those
who are more experienced in machine learning, data science, and statistics
as they seek to speed up the process of analytics because it is going to
eliminate some of the repeated samplings that has to happen for each of the
models that you are working on overall.
None of the visual options are necessarily going to be bad ones, we just
have to learn which one is the best option for the data we have, and for the
uses that we want to do with the data. Each set of data is going to lend itself
well to one type of visual or another, and having a good understanding of
what you are expecting out of this data, and what your data contains in the
end, can help us to figure out which visual we are most interested in.
Data visualization is definitely a part of data science that we do not want to
forget about.
Being able to make this work for our needs, and understanding some of the
process that comes with it, as well as why we actually need to work with a
visual overall, can be important. Make sure to figure out which visual is
going to be the best for your needs to ensure that you will get the best way
to understand the complex relationships in your data in no time.
Chapter 10 - Indexing and selecting arrays
Array indexing is very much similar to List indexing with the same
techniques of item selection and slicing (using square brackets). The
methods are even more similar when the array is a vector.
Example 1:

In []: # Indexing a vector array (values)
values
values[0] # grabbing 1st item
values[-1] # grabbing last item
values[1:3] # grabbing 2nd & 3rd item
values[3:8] # item 4 to 8
Out[]: array([ 1.33534821, 1.73863505, 0.1982571 , -0.47513784,
1.80118596, -1.73710743, -0.24994721, 1.41695744,
-0.28384007, 0.58446065])
Out[]: 1.3353482110285562
Out[]: 0.5844606470172699
Out[]: array([1.73863505, 0.1982571 ])
Out[]: array([-0.47513784, 1.80118596, -1.73710743, -0.24994721,
1.41695744])
The main difference between arrays and lists is in the broadcast property of
arrays. When a slice of a list is assigned to another variable, any changes on
that new variable does not affect the original list. This is seen in the
example below:
In []: num_list = list(range(11))
# list from 0-10
num_list
# display list
list_slice = num_list[:4]
# first 4 items
list_slice
# display slice
list_slice[:] = [5,7,9,3]
# Re-assigning elements

list_slice # display
updated values
# checking for changes
print(' The list changed !') if list_slice == num_list[:4]\
else print(' no changes in original list')
Out[]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Out[]: [0, 1, 2, 3]
Out[]: [5, 7, 9, 3]
no changes in the original list
For arrays, however, a change in the slice of an array also updates or
broadcasts to the original array, thereby changing its values.
In []: # Checking the broadcast feature of arrays
num_array = np.arange(11)
# array from 0-10
num_array
# display array
array_slice = num_array[:4]
# first 4 items
array_slice
# display slice
array_slice[:] = [5,7,9,3]
# Re-assigning elements

array_slice
# display updated values
num_array
Out[]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
Out[]: array([0, 1, 2, 3])
Out[]: array([5, 7, 9, 3])
Out[]: array([ 5, 7, 9, 3, 4, 5, 6, 7, 8, 9, 10])
This happens because Python tries to save memory allocation by allowing
slices of an array to be like shortcuts or links to the actual array. This way it
doesn’t have to allocate a separate memory location to it. This is especially
ingenious in the case of large arrays whose slices can also take up
significant memory. However, to take up a slice of an array without
broadcast, you can create a ‘slice of a copy’ of the array. The array.copy()
method is called to create a copy of the original array.
In []:# Here is an array allocation without broadcast
num_array # Array from the last example
# copies the first 4 items of the array copy
array_slice = num_array.copy()[:4]
array_slice # display array
array_slice[:] = 10 # re-assign array
array_slice # display updated values
num_array # checking original list
Out[]:array([ 5, 7, 9, 3, 4, 5, 6, 7, 8, 9, 10])
Out[]:array([5, 7, 9, 3])
Out[]:array([10, 10, 10, 10])
Out[]:array([ 5, 7, 9, 3, 4, 5, 6, 7, 8, 9, 10])
Notice that the original array remains unchanged.
For two-dimensional arrays or matrices, the same indexing and slicing
methods work. However, it is always easy to consider the first dimension as
the rows and the other as the columns. To select any item or slice of items,
the index of the rows and columns are specified. Let us illustrate this with a
few examples:
Example 2
: Grabbing elements from a matrix
There are two methods for grabbing elements from a matrix:
array_name[row][col] or array_name[row,col].
In []: # Creating the matrix
matrix = np.array(([5,10,15],[20,25,30],[35,40,45]))
matrix #display matrix
matrix[1] # Grabbing second row
matrix[2][0] # Grabbing 35
matrix[0:2] # Grabbing first 2 rows
matrix[2,2]
# Grabbing 45
Out[]: array([[ 5, 10, 15],
[20, 25, 30],
[35, 40, 45]])
Out[]: array([20, 25, 30])

Out[]: 35
Out[]: array([[ 5, 10, 15],
[20, 25, 30]])
Out[]: 45
Tip:
It is recommended to use the array_name[row,col] method, as it
saves typing and is more compact. This will be the convention for the
rest of this section.
To grab columns, we specify a slice of the row and column. Let us try to
grab the second column in the matrix and assign it to a variable
column_slice.
In []: # Grabbing the second column
column_slice = matrix[:,1:2]
# Assigning to variable
column_slice
Out[]: array([,
,
])
Let us consider what happened here. To grab the column slice, we first
specify the row before the comma. Since our column contains elements in
all rows, we need all the rows to be included in our selection, hence the ‘ :’
sign for all. Alternatively, we could use ‘0 :’, which might be easier to
understand. After selecting the row, we then choose the column by
specifying a slice from ‘1:2’, which tells Python to grab from the second
item up to (but not including) the third item. Remember, Python indexing
starts from zero.
Exercise:
Try to create a larger array, and use these
indexing techniques to grab certain elements from the
array. For example, here is a larger array:
In []: # 5

10 Array of even numbers between 0 and 100.
large_array = np.arange(0,100,2).reshape(5,10)
large_array
# show
Out[]: array([[ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18],
[20, 22, 24, 26, 28, 30, 32, 34, 36, 38],
[40, 42, 44, 46, 48, 50, 52, 54, 56, 58],
[60, 62, 64, 66, 68, 70, 72, 74, 76, 78],
[80, 82, 84, 86, 88, 90, 92, 94, 96, 98]])
Tip:
Try grabbing single elements and rows from random arrays you
create. After getting very familiar with this, try selecting columns.
The point is to try as many combinations as possible to get you
familiar with the approach. If the slicing and indexing notations are
confusing, try to revisit the section under list or string slicing and
indexing.
Click this link to revisit the examples on slicing: List indexing
Conditional selection
Consider a case where we need to extract certain values from an array that
meet a Boolean criterion. NumPy offers a convenient way of doing this
without having to use loops.
Example 3:
Using conditional selection
Consider this array of odd numbers between 0 and 20. Assuming we need
to grab elements above 11. We first have to create the conditional array that
selects this:
In []: odd_array = np.arange(1,20,2)
# Vector of
odd nu
mbers
odd_array
# Show vector
bool_array = odd_array > 11
# Boolean
conditional array
bool_array
Out[]: array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19])
Out[]: array([False, False, False, False, False, False, True, True, True,
True])
Notice how the bool_array evaluates to True at all instances where the
elements of the odd_array meet the Boolean criterion.
The Boolean array itself is not usually so useful. To return the values that
we need, we will pass the Boolean_array into the original array to get our
results.
In []: useful_Array = odd_array[bool_array]
# The values we want
useful_Array
Out[]: array([13, 15, 17, 19])
Now, that is how to grab elements using conditional selection. There is
however a more compact way of doing this. It is the same idea, but it
reduces typing.
Instead of first declaring a Boolean_array to hold our truth values, we just
pass the condition into the array itself, like we did for useful_array.
In []: # This code is more compact
compact = odd_array[odd_array>11]
# One line
compact
Out[]: array([13, 15, 17, 19])
See how we achieved the same result with just two lines? It is
recommended to use this second method, as it saves coding time and
resources. The first method helps explain how it all works. However, we
would be using the second method for all other instances in this book.
Exercise:
The conditional selection works on all arrays (vectors and
matrices alike). Create a two 3
3 array of elements greater than 80 from
the ‘large_array’ given in the last exercise.
Hint:
use the reshape method to convert the resulting array into a 3

3 matrix.
NumPy Array Operations
Finally, we will be exploring basic arithmetical operations with NumPy
arrays. These operations are not unlike that of integer or float Python lists.
Array – Array Operations
In NumPy, arrays can operate with and on each other using various
arithmetic operators. Things like the addition of two arrays, division, etc.
Example 4:
In []: # Array - Array Operations

# Declaring two arrays of 10 elements
Array1 = np.arange(10).reshape(2,5)
Array2 = np.random.randn(10).reshape(2,5)
Array1;Array2 # Show the arrays
# Addition
Array_sum = Array1 + Array2
Array_sum # show result array
#Subtraction
Array_minus = Array1 - Array2
Array_minus # Show array
# Multiplication
Array_product = Array1 * Array2
Array_product # Show
# Division
Array_divide = Array1 / Array2
Array_divide # Show
Out[]: array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
array([[ 2.09122638, 0.45323217,
-0.50086442, 1.00633093, 1.24838264], [
1.64954711, -0.93396737, 1.05965475, 0.78422255, -1.8459
5505]])
array([[2.09122638, 1.45323217, 1.49913558, 4.00633093, 5.24838264],
[6.64954711, 5.06603263, 8.05965475, 8.78422255, 7.15404495]])
array([[-2.09122638, 0.54676783, 2.50086442, 1.99366907, 2.75161736
], [ 3.35045289, 6.93396737, 5.94034525, 7.21577745,
10.84595505]])
array([[ 0. , 0.45323217, -1.00172885, 3.01899278, 4.99353055],
[ 8.24773555, -5.60380425, 7.41758328, 6.27378038, -16.6
1359546]])
array([[ 0. , 2.20637474, -3.99309655, 2.9811267 , 3.20414581], [
3.03113501, -6.42420727, 6.60592516,
10.20118591, -4.875525 ]])
Each of the arithmetic operations performed are element-wise. The division
operations require extra care however. In Python, most arithmetic errors in
code throw a run-time error, which helps in debugging. For NumPy,
however, the code could run with a warning issued.
Array – Scalar operations
Also, NumPy supports scalar with Array operations. A scalar in this context
is just a single numeric value of either integer or float type. The scalar –
Array operations are also element-wise, by virtue of the broadcast feature of
NumPy arrays.
Example 5:
In []: #Scalar- Array Operations
new_array = np.arange(0,11)
# Array of values from 0-10
print('New_array')
new_array
# Show
Sc = 100
# Scalar value
# let us make an array with a range from 100 - 110 (using +)
add_array = new_array + Sc
# Adding 100 to every item
print('\nAdd_array')
add_array
# Show
# Let us make an array of 100s (using -)
centurion = add_array - new_array
print('\nCenturion')
centurion
# Show
# Let us do some multiplication (using *)
multiplex = new_array * 100
print('\nMultiplex')
multiplex
# Show
# division [take care], let us deliberately generate
# an error. We will do a divide by Zero.
err_vec = new_array / new_array
print('\nError_vec')
err_vec
# Show
New_array
Out[]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
Add_array
Out[]: array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110])
Centurion
Out[]: array([100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100])
Multiplex
Out[]: array([ 0, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000])
Error_vec
C:\Users\Oguntuase\Anaconda3\lib\site-
packages\ipykernel_launcher.py:27:
RuntimeWarning: invalid value encountered in true_divide
array([nan, 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
Notice the runtime error generated? This divide by zero value was caused
by the division of the first element of new_array by itself, i.e. 0/0. This
would give a divide by zero error in normal Python environment and would
not run the code. NumPy, however, ran the code and indicated the divide by
zero in the Error_vec array as a ‘nan’ type (not-a-number). This also goes
for values that evaluate to infinity, which would be represented by the value
‘+/- inf’ (try 1/0 using NumPy array-scalar or array-array operation.).
Tip:
Always take caution when using division to avoid such runtime
errors that could later bug your code.
Universal Array functions
These are some built-in functions designed to operate in an element-wise
fashion on NumPy arrays. They include mathematical, comparison,
trigonometric, Boolean, etc. operations. They are called using the
np.function_name(array) method.
Example 6
: A few Universal Array functions (U-Func)
In []: # Using U-Funcs
U_add = np.add(new_array,Sc)
# addition
U_add
# Show
U_sub = np.subtract(add_array,new_array)
U_sub
# Show
U_log = np.log(new_array)
# Natural log
U_log
# Show
sinusoid = np.sin(new_array)
# Sine wave
sinusoid
# Show
# Alternatively, we can use the .method
new_array.max()
# find maximum
np.max(new_array)
# same thing
Out[]: array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110])
Out[]: array([100, 100, 100, 100, 100, 100, 100, 100, 100,
100, 100])
packages\ipykernel_launcher.py:8: RuntimeWarning: divide by
zero encountered in log
Out[]: array([ -inf, 0. , 0.69314718, 1.09861229, 1.38629436,
1.60943791, 1.79175947, 1.94591015, 2.07944154, 2.19722458,
2.30258509])
Out[]: array([ 0. , 0.84147098, 0.90929743, 0.14112001, -0.7568025
, -0.95892427, -0.2794155 , 0.6569866
, 0.98935825, 0.41211849, -0.54402111])
Out[]: 10
Out[]: 10
There are still many more functions available, and a full reference can be
found in the NumPy documentation for Universal functions here:
https://docs.scipy.org/doc/numpy/reference/ufuncs.html
Now that we have explored NumPy for creating arrays, we would consider
the Pandas framework for manipulating these arrays and organizing them
into data frames.
Pandas
This is an open source library that extends the capabilities of NumPy. It
supports data cleaning and preparation, with fast analysis capabilities. It is
more like Microsoft excel framework, but with Python. Unlike NumPy, it
has its own built-in visualization features and can work with data from a
variety of sources. It is one of the most versatile packages for data science
with Python, and we will be exploring how to use it effectively.
To use pandas, make sure it is currently part of your installed packages by
verifying with the conda list method. If it is not installed, then you can
install it using the conda install pandas command; you need an internet
connection for this.
Now that Pandas is available on your PC, you can start working with the
package. First, we start with the Pandas series.
Series
This is an extension of the NumPy array. It has a lot of similarities, but with
a difference in indexing capacity. NumPy arrays are only indexed via
number notations corresponding to the desired rows and columns to be
accessed. For Pandas series, the axes have labels that can be used for
indexing their elements. Also, while NumPy arrays -- like Python lists, are
essentially used for holding numeric data, Pandas series are used for
holding any form of Python data/object.
Example 7
: Let us illustrate how to create and use the Pandas series
First, we have to import the Pandas package into our workspace. We will
use the variable name pd for Pandas, just as we used np for NumPy in the
previous section.
In []: import numpy as np
#importing numpy for use
import pandas as pd # importing the Pandas package
We also imported the numpy package because this example involves a
numpy array.
In []: # python objects for use
labels = ['First','Second','Third']
# string list
values = [10,20,30]
# numeric list
array = np.arange(10,31,10)
# numpy array
dico = {'First':10,'Second':20,'Third':30}
# Python dictionary
# create various series
c = pd.Series(values)
print('Default series')
A #show
B = pd.Series(values,labels)
print('\nPython numeric list and label')
B #show
C = pd.Series(array,labels)
print('\nUsing python arrays and labels')
C #show
D = pd.Series(dico)
print('\nPassing a dictionary')
D #show
Default series
Out[]: 0 10
1 2
0
2 30
dtype: int64
Python numeric list and label
Out[]: First 10
Second 20
Third 30
dtype: int64
Using python arrays and labels
Out[]: First 10
Second 20
Third 30
dtype: int32
Passing a dictionary
Out[]: First 10
Second 20
Third 30
dtype: int64
We have just explored a few ways of creating a Pandas series using a
numpy array, Python list, and dictionary. Notice how the labels correspond
to the values? Also, the dtypes are different. Since the data is numeric and
of type integer, Python assigns the appropriate type of integer memory to
the data. Creating series from NumPy arrays returns the smallest integer
size (int 32). The difference between 32 bits and 64 bits unsigned integers is
the corresponding memory allocation. 32 bits obviously requires less
memory (4bytes, since 8bits make a byte), and 64 bits would require double
(8 bytes). However, 32bits integers are processed faster, but have a limited
capacity in holding values, as compared with 64 bits.
Pandas series also support the assignment of any data type or object as its
data points.
In []: pd.Series(labels,values)
Out[]: 10 First
20 Second
30 Third
dtype: object
Here, the string elements of the label list are now the data points. Also,
notice that the dtype is not ‘object’.
This kind of versatility in item operation and storage is what makes pandas
series very robust. Pandas series are indexed using labels. This is illustrated
in the following examples:
Example 8:
In []: # series of WWII countries
pool1 = pd.Series([1,2,3,4],['USA','Britain','France','Germany'])
pool1 #show
print('grabbing the first element')
pool1['USA'] # first label index
Out[]: USA 1
Britain 2
France 3
Germany 4
dtype: int64
grabbing the first element
Out[]: 1
As shown in the code above, to grab a series element, use the same
approach as the numpy array indexing, but by passing the label
corresponding to that data point. The data type of the label is also
important, notice the ‘USA’ label was passed as a string to grab the data
point ‘1’. If the label is numeric, then the indexing would be similar to that
of a numpy array. Consider numeric indexing in the following example:
In []: pool2 = pd.Series(['USA','Britain','France','Germany'],[1,2,3,4])
pool2
#show
print('grabbing the first element')
pool2[1] #numeric indexing
Out[]: 1 USA
2 Britain
3 France
4 Germany
dtype: object
grabbing the first element
Out[]: 'USA'
Tip:
you can easily know the data held by a series through the dtype.
Notice how the dtype for pool1 and pool2 are different, even though
they were both created from the same lists. The difference is that
pool2 holds strings as its data points, while pool1 holds integers
(int64).
Panda series can be added together. It works best if the two series have
similar labels and data points.
Example 9
: Adding series
Let us create a third series ‘pool 3’. This is a similar series as pool1, but
Britain has been replaced with ‘USSR’, and a corresponding data point
value of 5.
In []: pool3 = pd.Series([1,5,3,4],['USA','USSR','France',
'Germany'])
pool3
Out[]: USA 1
USSR 5
France 3
Germany 4
dtype: int64
Now adding series:
In []:# Demonstrating series addition
double_pool = pool1 + pool1
print('Double Pool')
double_pool
mixed_pool = pool1 + pool3
print('\nMixed Pool')
mixed_pool
funny_pool = pool1 + pool2
print('\nFunny Pool')
funny_pool
Double Pool
Out[]: USA 2
Britain 4
France 6
Germany 8
dtype: int64
Mixed Pool
Out[]: Britain NaN
France 6.
0
Germany 8.0
USA 2.0
USSR NaN
dtype: float64
Funny Pool
packages\pandas\core\indexes\base.py:3772: RuntimeWarning: '<'
not supported between instances of 'str' and 'int', sort order is
undefined for incomparable objects
return this.join(other, how=how, return_indexers=return_indexers)
Out[]: USA NaN
Britain NaN
France NaN
Germany NaN
1 NaN
2 NaN
3 NaN
4 NaN
dtype: object
By adding series, the resultant is the increment in data point values of
similar labels (or indexes). A ‘NaN’ is returned in instances where the
labels do not match.
Notice the difference between Mixed_pool and Funny_pool: In a mixed
pool, a few labels are matched, and their values are added together (due to
the add operation). For Funny_pool, no labels match, and the data points are
of different types. An error message is returned and the output is a vertical
concatenation of the two series with ‘NaN’ Datapoints.
Tip:
As long as two series contain the same labels and data points of
the same type, basic array operations like addition, subtraction, etc.
can be done. The order of the labels does not matter, the values will
be changed based on the operator being used. To fully grasp this, try
running variations of the examples given above.
Chapter 11 - Common Debugging Commands
Starting
The command used in debugging is ‘s(tart)' which launches the debugger
from its source. The procedure involved includes typing the title of the
debugger and then the name of the file, object, or program executable to
debug. Inside the debugging tool, there appears a prompt providing you
with several commands to choose from and make the necessary corrections.
Running
The command used is ‘[!]statement
’ or ‘r(un)’, which facilitates the
execution of the command to the intended lines and identify errors if any.
The command prompt will display several arguments probably at the top of
the package, especially when running programs without debuggers. For
example, when the application is named ‘prog1’, then the command to use
is “r prog1 <infile". The debugger will, therefore, execute the command by
redirecting the program name from the file name.
Breakpoints
As essential components in debugging, breakpoints utilize the command
‘b(reak) [[filename:]lineno
|function[, condition]]
” to enable debuggers to
stop code input process when program execution reaches this point. When a
developer inputs the codes or values, and it meets a breakpoint, the process
gets suspended for a while, and the debugger command dialog appears on
the screen. Thereby provides time to check on the variables while
identifying any errors or mistakes, which might affect the process.
Therefore, breakpoints can be scheduled to halt at any line on either
numerical or functions names which designate program execution.
Back Trace
Backtrace is an executive with the command ‘bt’ and involves a list of
pending function calls to be inserted in the program immediately after it
stops. The validity of backtrace commands are solely active when the
execution is suspended during breakpoints, or after it has exited during a
runtime error abnormally, a state called segmentation faults. This form of
debugging is more critical during segmentation faults as it indicates the
source of the error other than pending function calls.
Printing
Printing is primarily is used in programming to analyze the value of
variables or expressions used during function examination before execution.
It uses the command' w(here)' and useful after the programming running
has been stopped at a breakpoint or during runtime error. The legal
expression used here is C with possessing an ability to handles the
legitimate C expression as well as function calls. Besides printing, resuming
the execution after a breakpoint or runtime error uses the command
‘c(ont(inue).'
Single Step
The single-step uses the command' s(tep), n(ext)’ after a breakpoint to jump
through source lines one at a time. The two commands used to describe a
different indication with ‘step' representing the execution of all the lines and
functions while ‘next' skips function calls while not covering each chain on
a given task. However, it is vital to run the program line by line as to get a
more effective outcome when it comes to tracing errors on execution.
Trace Search
With the command, ‘up, down,' the program functions can either be scrolled
downwards or upwards using the trace search within the pending calls. This
form of debugging enables you to go through the variables within varying
levels of calls in the list. Henceforth, you can readily seek out mistakes as
well as eliminate errors using the desired debugging tool.
File Select
Another basic debugger command is file select which utilizes ‘l(ist) [first[,
last]
]’. There exist programs which compose of up to two to several source
files, especially complex programming techniques, thereby the need to
utilize debugging tools in such cases. Debuggers should be set on the main
source file for the benefit of scheduling breakpoints and runtime error to
examine the lines in the folders. With Python, the list of the source files can
be readily selected and prescribe it as the working file.
Help and Quit

The help command is represented as ‘h(elp)’ while quit is symbolized as
‘q(uit)’ with both providing assistance during program execution. Help
command displays all the help information topics and can be redirected into
a particular solution of the current problem. While quit, command is crucial
to exist or abort the debugger tool.
Alias
Alias debugging entails the creation of an alias name to execute a command
but must not be enclosed in either single or double-quotes. The control used
is alias [alias
[command]]. Replaceable parameters also undergo indicators
and can be replaced with other functions. As such, the name may remain the
same if the settings are left without commands or arguments from debugger
tools. In that case, the aliases maybe incorporate and comprise of any data
collaborated in the pdb prompt.
Python Debugger
In Python programming language, the module pdb
typically describes the
interactive source code debugger; therefore, supporting setting parameters
in breakpoints. It also provides a single step impact at the source line level,
source code listing, and analysis of arbitraries codes in Python as a form of
a stack frame. Also, postmortem-debugging remains supported under the
title under program control. Python debugging is extensible usually in the
way of pdb
obtained from the source evaluation. The interface hence
utilizes pdb
and cmd
as the primary modules.
The debugger command prompt pdb
is essential in running programs in
control of the debugging tools; for instance, pdb.py invoked like a script to
debug related formats. Besides, it may be adopted as an application to scan
crashed programs while using several functions in a slightly differing way.
Some of the commands used are run (statement
[, globals [, locals
]]) for
run Python statements and runeval (expression
[, globals[, locals
]]). There
also exist multiple functions not mentioned above to execute Python
programs efficiently.
Using Debugger Commands

As mentioned above, the debugger command prompt is a continuous
process, which displays a window where you input your variables at the
bottom. When your commands are successful in any given window, it
would show the outputs and then display the prompt. The Debugger
Command Window is therefore defined as the Debugger Immediate
Window. It presents two panes; small and the bottom one is where you enter
your commands, and the larger upper one shows your results.
The command prompt is the window where you readily input your
debugging needs, especially when in need to scan through your program for
any errors. The Python debugging prompt is user-friendly, encompasses all
the relevant features of detecting, and eliminate any problems. That said,
the prompt will display your current command, and you can quickly stop,
modify, or select other debugging parameters.
Debugging Session
Using debugging in Python for computer language programming is usually
a repetitive process, which includes writing codes, and running it; it does
not work, and you implement debugging tools, fix errors, and redo the
process once again and again. As such, the debugging session tends to
utilize the same techniques, which hence demand some key points to note.
The sequence below enhances your programming processes and minimizes
the repeats witnessed during program development.
Setting of breakpoints
Running programs by the relevant debugging tools
Check variable outcomes and compare with the existing
function
When all seems correct, you may either resume the
program or wait for another breakpoint and repeat if need
be
When everything seems to go wrong, determine the source
of the problem, alter the current line of codes and begin the
process once more
Tips in Python Debugging

Create a Reliable Branch
With the process of debugging being repetitive and somehow constant
across different programming language platforms, leaning your principles is
essential. Setting your parameters play a significant role in ensuring that
your programs are performed within a given environment. That said; ensure
you set your debugging parameters, especially for beginners.
Install pdb++
When working with Python, it is important to install pdb++ software to ease
maneuvering within a certain command line. The software ensures that you
readily access a unique prompt dialog well colorized and a complete great
tab showed elegantly. Pdb++ also enhances the appearance of your
debugger tool, bringing a newer and standard pdb module.
Conduct Daily Practices

Playing around with Python debugging tools is one of the methods used to
learn more in-depth about incorporating programs with debugging. That
said, create a plan while using a debugger and try to make mistakes or
creating errors and see what happens. Similarly, try to use commands such
as breakpoints, help, and steps to learn further on Python debugging. Create
practical programs while primarily focusing on the use of debuggers to
make corrections on sections with errors.
Learn To Work at Thing at a Time
Learning Python debugging techniques does only detect and enhance error
elimination but also prepares you on understanding how to remove such
problems. One way to do this is by getting used to correcting one anomaly
at a time, which is, removing one bug at a time. Begin with the most
obvious errors then think before doing immediate corrections as at times
may lead to the removal of essential variables. Make the changes and then
test your outcome to ascertain your program outcome.
Ask Question
If you know developers who use Python or other platforms, ask them
questions related to debugging as they are highly using this software. When
you are just beginning and no friends go online find forums, which are
many today. Interact with them by seeking answers to your debugging
problems as well as playing around with some programs you create while
using debugger tools. You should avoid making assumptions to any section
of Python programming, especially in debugging as it may result in failures
in program development.
Be Clever
When we create programs and avoid errors by use of debuggers, it may
make you feel excited and overwhelmed from the outcome. However, be
smart but with limits to keep an eye on your work as well as your future
operations. The success of creating a more realistic and useful program
does not mean that you are not to fail in the future. As remaining in control
will prepare you to use Python debugging tools wisely and claim your
future accomplishments positively.
Chapter 12 - Neural Network and What to Use
for?
Regular deep neural networks commonly receive a single vector as an input
and then transform it through a series of multiple hidden layers. Every
hidden layer in regular deep neural networks, in fact, is made up of a
collection of neurons in which every neuron is fully connected to all
contained neurons from the previous layers. In addition, all neurons
contained in a deep neural network are completely independent as they do
not share any relations or connections.
The last fully-connected layer in regular deep neural networks is called the
output layer and in every classification setting, this output layer represents
the overall class score.
Due to these properties, regular deep neural nets are not capable of scaling
to full images. For instance, in CIFAR-10, all images are sized as 32x32x3.
This means that all CIFAR-10 images gave 3 color channels and that they
are 32 wide and 32 inches high. This means that a single fully-connected
neural network in a first regular neural net would have 32x32x3 or 3071
weights. This is an amount that is not manageable as those fully-connected
structures are not capable of scaling to larger images.
In addition, you would want to have more similar neurons to quickly add-up
more parameters. However, in this case of computer vision and other
similar problems, using fully-connected neurons is wasteful as your
parameters would lead to over-fitting of your model very quickly.
Therefore, convolutional neural networks take advantage of the fact that
their inputs consist of images for solving these kinds of deep learning
problems.
Due to their structure, convolutional neural networks constrain the
architecture of images in a much more sensible way. Unlike a regular deep
neural network, the layers contained in the convolutional neural network are
comprised of neurons that are arranged in three dimensions including depth,
height, and width. For instance, the CIFAR-10 input images are part of the
input volume of all layers contained in a deep neural network and the
volume comes with the dimensions of 32x32x3.
The neurons in these kinds of layers can be connected to only a small area
of the layer before it, instead of all the layers being fully-connected like in
regular deep neural networks. In addition, the output of the final layers for
CIFAR-10 would come with dimensions of 1x1x10 as the end of
convolutional neural networks architecture would have reduced the full
image into a vector of class score arranging it just along the depth
dimension.
To summarize, unlike the regular-three-layer deep neural networks, a
ConvNet composes all its neurons in just three dimensions. In addition,
each layer contained in convolutional neural network transforms the 3D
input volume into a 3D output volume containing various neuron
activations.
A convolutional neural network contains layers that all have a simple API
resulting in 3D output volume that comes with a differentiable function that
may or may not contain neural network parameters.
A convolutional neural network is composed of several subsampling and
convolutional layers that are times followed by fully-connected or dense
layers. As you already know, the input of a convolutional neural network is
a nxnxr image where n represents the height and width of an input image
while the r is a total number of channels present. The convolutional neural
networks may also contain k filters known as kernels. When kernels are
present, they are determined as q, which can be the same as the number of
channels.
Each convolutional neural network map is subsampled with max or mean
pooling over pxp of a contiguous area in which p commonly ranges
between two for small images and more than 5 for larger images. Either
after or before the subsampling layer a sigmoidal non-linearity and additive
bias is applied to every feature map. After these convolutional neural layers,
there may be several fully-connected layers and the structure of these fully-
connected layers is the same as the structure of standard multilayer neural
networks.
How Convolutional Neural Networks Work?

A convolutional neural network structure of ConvNet is normally used for
various deep learning problems. As already mentioned, convolutional
neural networks are used for object recognition, object segmentation,
detection and computer vision due to their structure. Convolutional neural
networks, in fact, learn directly from image data, so there is no need to
perform manual feature extraction which is commonly required in regular
deep neural networks.
The use of convolutional neural networks has become popular due to three
main factors. The first of them is the structure of CNNs, which eliminates
the need for performing manual data extraction as all data features are
learned directly by the convolutional neural networks. The second reason
for the increasing popularity of convolutional neural networks is that they
produce amazing, state-of-art object recognition results. The third reason is
that convolutional neural networks can be easily retained for many new
object recognition tasks to help build other deep neural networks.
A CNN can contain hundreds of layers, which each learn automatically to
detect many different features of an image data. In addition, filters are
commonly applied to every training image at different resolutions, so the
output of every convolved image is used as the input to the following
convolutional layer.
The filters can also start with very simple image features like edges and
brightness, so they commonly can increase the complexity of those image
features which define the object as the convolutional layers progress.
Therefore, filters are commonly applied to every training image at different
resolutions as the output of every convolved image acts as the input to the
following convolutional layer.
Convolutional neural networks can be trained on hundreds, thousands and
millions of images.
When you are working with large amounts of image data and with some
very complex network structures, you should use GPUs that can
significantly boost the processing time required for training a neural
network model.
Once you train your convolutional neural network model, you can use it in
real-time applications like object recognition, pedestrian detection in ADAS
or advanced driver assistance systems and many others.
Convolutional Neural Networks Applications
Convolutional neural networks are one of the main categories of deep
neural networks which have proven to be very effective in numerous
computer science areas like object recognition, object classification, and
computer vision. ConvNets have been used for many years for
distinguishing faces apart, identifying objects, powering vision in self-
driving cars, and robots.
A ConvNet can easily recognize countless image scenes as well as suggest
relevant captions. ConvNets are also able to identify everyday objects,
animals or humans, as well. Lately, convolutional neural networks have also
been used effectively in natural language processing problems like sentence
classification.
Therefore, convolutional neural networks are one of the most important
tools when it comes to machine learning and deep learning tasks. LeEnt was
the very first convolutional neural network introduced that helped
significantly propel the overall field of deep learning. This very first
convolutional neural network was proposed by Yann LeCun back in 1988.
It was primarily used for character recognition problems such as reading
digits and codes.
Convolutional neural networks that are regularly used today for
innumerable computer science tasks are very similar to this first
convolutional neural network proposed back in 1988.
Just like today’s convolutional neural networks, LeNet was used for many
character recognition tasks. Just like in LeNet, the standard convolutional
neural networks we use today come with four main operations including
convolution, ReLU non-linearity activation functions, sub-sampling or
pooling and classification of their fully-connected layers.
These operations, in fact, are the fundamental steps of building every
convolutional neural network. To move onto dealing with convolutional
neural networks in Python, we must get deeper into these four basic
functions for a better understanding of the intuition lying behind
convolutional neural networks.
As you know, every image can be easily represented as a matrix containing
multiple values. We are going to use a conventional term channel where we
are referring to a specific component of images. An image derived from a
standards camera commonly has three channels including blue, red and
green. You can imagine these images as three-2D matrices that are stacked
over each other. Each of these matrices also comes with certain pixel values
in the specific range zero to two hundred fifty-five.
On the other hand, if you have a grayscale image, you only get one channel
as there are no colors present, just black and white. In our case here, we are
going to consider grayscale images, so the example we are studying is just a
single-2D matrix that represents a greyscale image. The value of each pixel
contained in the matrix must range from zero to two hundred fifty-five. In
this case, zero indicates a color of black while two hundred fifty-five
indicates a color of white.
Stride and Padding

Secondly, after specifying the depth, you also must specify the stride that
you slide over the filter. When you have a stride that is one, you must move
one pixel at a time. When you have a stride that is two, you can move two
pixels at a time, but this produces smaller volumes of output spatially. By
default, stride value is one. However, you can have bigger strides in the
case when you want to come across less overlap between your receptive
fields, but, as already mentioned, this will result in having smaller feature
maps as you are skipping over image locations.
In the case when you use bigger strides, but you want to maintain the same
dimensionality, you must use padding that surrounds your input with zeros.
You can either pad with the values on the edge or with zeros. Once you get
the dimensionality of your feature map that matches your input, you can
move onto adding pooling layers that padding is commonly used in
convolutional neural networks when you want to preserve the size of your
feature maps.
If you do not use padding, your feature maps will shrink at every layer.
Adding zero-padding is times very convenient when you want to pad your
input volume just with zeros all around the border.
This is called as zero-padding which is a hyperparameter. By using zero-
padding, you can control the size of your output volumes.
You can easily compute the spatial size of your output volume as a simple
function of your input volume size, the convolution layers receptive field
size, the stride you applied and the amount of zero-padding you used in
your convolutional neural network border.
For instance, if you have a 7x7 input and, if your use a 3x3 filter with stride
one and pad zero, you will get a 5x5 output following the formula. If you
have stride two, you will get a 3x3 output volume and so on using the
formula as following in which W represents the size of your input volume,
F represents the receptive field size of your convolutional neural layers, S
represents the stride applied and P represents the amount of zero-padding
you used.
(W-F +2P)/S+1
Using this formula, you can easily calculate how many neurons can fit in
your convolutional neural network. Consider using zero-padding whenever
you can. For instance, if you have an equal input and output dimensions
which are five, you can use zero-padding of one to get three receptive
fields.
If you do not use zero-padding in the cases like this, you will get your
output volume with a spatial dimension of three, as three is a number of
neurons that can fit into your original input.
Spatial arrangement hypermeters commonly have mutual constraints. For
instance, if you have input size of ten with no zero-padding used and with a
filter size of three, it is impossible to apply stride. Therefore, you will get
the set of your hyperparameter to be invalid and your convolutional neural
networks library will throw an exception or zero pad completely to the rest
to make it fit.
Fortunately, sizing the convolutional layers properly so all dimensions
included work using zero-padding can really make any job easier.
Parameter Sharing
You can use a parameter sharing scheme in your convolutional layers to
entirely control the number of used parameters. If you denoted a single two-
dimensional slice of depth as your depth slice, you can constrain the
neurons contained in every depth slice to use the same bias and weights.
Using parameter sharing techniques, you will get a unique collection of
weights, one of every depth slice, and you will get a unique collection of
weights. Therefore, you can significantly reduce the number of parameters
contained in the first layer of your ConvNet. Doing this step, all neurons in
every depth slice of your ConvNet will use the same parameters.
In other words, during backpropagation, every neuron contained in the
volume will automatically compute the gradient for all its weights.
However, these computed gradients will add up over every depth slice, so
you get to update just a single collection of weights per depth slice.that all
neurons contained in one depth slice will use the exact same weight vector.
Therefore, when you forward pass of the convolutional layers in every
depth slice, it is computed as a convolution of all neurons’ weights
alongside the input volume. This is the reason why we refer to the
collection of weights we get as a kernel or a filter, which is convolved with
your input.
However, there are a few cases in which this parameter sharing assumption,
in fact, does not make any sense. This is commonly the case with many
input images to a convolutional layer that come with certain centered
structure, where you must learn different features depending on your image
location.
For instance, when you have an input of several faces which have been
centered in your image, you probably expect to get different hair-specific or
eye-specific features that could be easily learned at many spatial locations.
When this is the case, it is very common to just relax this parameter sharing
scheme and simply use a locally-connected layer.
Matrix Multiplication
The convolution operation commonly performs those dot products between
the local regions of the input and between the filters. In these cases, a
common implementation technique of the convolutional layers is to take
full advantage of this fact and to formulate the specific forward pass of the
main convolutional layer representing it as one large matrix multiply.
Implementation of matrix multiplication is when the local areas of an input
image are completely stretched out into different columns during an
operation known as im2col. For instance, if you have an input of size
227x227x3 and you convolve it with a filter of size 11x11x3 at a stride of 4,
you must take blocks of pixels in size 11x11x3 in the input and stretch
every block into a column vector of size 363.
However, when you iterate this process in your input stride of 4, you get
fifty-five locations along both weight and height that lead to an output
matrix of x col in which every column contained in fact is a maximally
stretched out receptive fields and where you have 3025 fields in total.
that as the receptive fields overlap, each number in your input volume can
be duplicated in multiple distinct columns. Also, remember, that the
weights of the convolutional layers are very similarly stretched out into
certain rows as well. For instance, if you have 95 filters in size of 11x11x3,
you will get a matrix of w row of size 96x363.
When it comes to matrix multiplications, the result you get from your
convolution will be equal to performing one huge matrix multiply that
evaluates the dot products between every receptive field and between every
filter resulting in the output of your dot production of every filter at every
location. Once you get your result, you must reshape it back to its right
output dimension, which in this case is 55x55x96.
This is a great approach, but it has a downside. The main downside is that it
uses a lot of memory as the values contained in your input volume will be
replicated several times. However, the main benefit of matrix
multiplications is that there are many implementations that can improve
your model. In addition, this im2col ideal can be re-used many times when
you are performing pooling operation
.
Conclusion
Thank you for making it through to the end! The next step is to start putting
the information and examples that we talked about in this guidebook to
good use. There is a lot of information inside all that data that we have been
collecting for some time now. But all of that data is worthless if we are not
able to analyze it and find out what predictions and insights are in there.
This is part of what the process of data science is all about, and when it is
combined together with the Python language, we are going to see some
amazing results in the process as well.
This guidebook took some time to explore more about data science and
what it all entails. This is an in-depth and complex process, one that often
includes more steps than what data scientists were aware of when they first
get started. But if a business wants to be able to actually learn the insights
that are in their data, and they want to gain that competitive edge in so
many ways, they need to be willing to take on these steps of data science,
and make it work for their needs.
This guidebook went through all of the steps that you need to know in order
to get started with data science and some of the basic parts of the Python
code. We can then put all of this together in order to create the right
analytical algorithm that, once it is trained properly and tested with the right
kinds of data, will work to make predictions, provide information, and even
show us insights that were never possible before. And all that you need to
do to get this information is to use the steps that we outline and discuss in
this guidebook.
There are so many great ways that you can use the data you have been
collecting for some time now, and being able to complete the process of
data visualization will ensure that you get it all done. When you are ready to
get started with Python data science, make sure to check out this guidebook
to learn how.
Loops are going to be next on the list of topics we need to explore when we
are working with Python. These are going to be a great way to clean up
some of the code that you are working on so that you can add in a ton of
information and processing in the code, without having to go through the
process of writing out all those lines of code. For example, if you would
like a program that would count out all of the numbers that go from one to
one hundred, you would not want to write out that many lines of code along
the way. Or if you would like to create a program for doing a multiplication
table, this would take forever as well. But doing a loop can help to get all of
this done in just a few lines of code, saving you a lot of time and code
writing in the process.
It is possible to add in a lot of different information into the loops that you
would like to write, but even with all of this information, they are still going
to be easy to work with. These loops are going to have all of the ability to
tell your compiler that it needs to read through the same line of code, over
and over again, until the program has reached the conditions that you set.
This helps to simplify the code that you are working on while still ensuring
that it works the way that you want when executing it.
As you decide to write out some of these loops, it is important to remember
to set up the kind of condition that you would like to have met before you
ever try to run the program. If you just write out one of these loops, without
this condition, the loop won’t have any idea when it is time to stop and will
keep going on and on. Without this kind of condition, the code is going to
keep reading through the loop and will freeze your computer. So, before
you execute this code, double-check that you have been able to put in these
conditions before you try to run it at all.
As you go through and work on these loops and you are creating your own
Python code, there are going to be a few options that you can use with
loops. There are a lot of options but we are going to spend our time looking
at the three main loops that most programmers are going to use, the ones
that are the easiest and most efficient.
The while loop

Out of the three loops that we are going to discuss in this guidebook, we are
going to start out with the while loop. The while loop is going to be one that
will tell the compiler the specific times that you would like it to go through
with that loop. This could be a good loop to use any time that you want the
compiler to count from one to ten. With this example, your condition would
tell the compiler to stop when it reaches ten, so that it doesn’t freeze and
keep going. This is also a good option that programmers like to work with
because it will make sure that it goes through the code at least one time, if
not more before it decides to head on to the other parts of the code that you
are writing. To see a good example of how you can work with the while
loop take a look at the code that we have below for reference:
#calculation of simple interest. Ask the user to input the principal, rate of
interest, number of years.
counter = 1
while(counter <= 3):
principal = int(input(“Enter the principal amount:”))
numberofyeras = int(input(“Enter the number of years:”))
rateofinterest = float(input(“Enter the rate of interest:”))
simpleinterest = principal * numberofyears * rateofinterest/100
print(“Simple interest = %.2f” %simpleinterest)
#increase the counter by 1
counter = counter + 1
print(“You have calculated simple interest for 3 time!”)
This example allows your user to go through and place the information that
pertains to them inside the program. The code will them computer the
interest rates based on the information that the user provides. For this one,
we set up the while (right at the beginning of the code) and told it to only go
through the loop three times. You can change it to go through the process as
many times as you would like.
The for loop

Now that we have had some time to look at the while loop, it is time to take
a look at what is known as the for loop and how we are able to use this in
some of the codings that we want to do. When you would like to bring out
the for loop there are going to be some differences compared to the while
loop from before. Keep in mind here that most programmers are going to
consider the for loop the traditional form of coding. You can use this one in
many of the situations where you would work with the while loop so it is
important to learn how to use this for your needs.
When you are ready to create your own for loop, the user is not going to be
the one who would go through here and provide the code with information
that is needed to start the loop. Rather than having this happen, the for loop
is set up to go through an iteration in whatever order you put it into the
code. There is no need to have this kind of input from the user because the
loop is set up to just go through the full iteration until it reaches the end.
And so on until you end up with 10*10 = 100 as your final spot in the
sequence
Any time you need to get one loop to run inside another loop, the nested
loop will be able to help you get this done. You can combine together the
for loop, the while loop, or each combination based on what you want to get
done inside the code. But it definitely shows you how much time and space
inside the code that these loops can save. The multiplication table above
only took up four lines to write out and you got a huge table. Think of how
long this would take if you had to write out each part of the table!
The for loop, the while loop, and the nested loop are the most common
types of loops that beginners will use when writing their own codes in
Python. You can use these codes to get a lot of work done in your chosen
program without having to write out as many lines. You can use it to make
sure that a certain part of your code will go through and rewrite itself.

Python For Data Analysis - The Ultimate Beginner's Guide To Learn Programming in Python For Data Science With Pandas and NumPy, Master Statistical Analysis, and Visualization (2020)

Uploaded by

Python For Data Analysis - The Ultimate Beginner's Guide To Learn Programming in Python For Data Science With Pandas and NumPy, Master Statistical Analysis, and Visualization (2020)

Uploaded by

PYTHON FOR DATA ANALYSIS:

THE ULTIMATE BEGINNER'S GUIDE TO LEARN

Chapter 1 - What is Data Analysis

Chapter 2 - Python Crash Course

Chapter 3 - Data Munging

Chapter 4 - Why Data Preprocessing Is Important

Chapter 5 - What is Data Wrangling?

Chapter 6 - Inheritances to Clean Up the Code

Chapter 8 - The Different Types of Data We Can Work With

Chapter 9 - The Importance of Data Visualization

Chapter 10 - Indexing and selecting arrays

Chapter 11 - Common Debugging Commands

Chapter 12 - Neural Network and What to Use for?

Investment banks risk modeling

Health and Medicine

Analysis of medical image

Genomics and genetics

Virtual assistance for customer and patients

Oil and Gas

Immediate drag calculation and torque using

There is a need to analyze, in drilling, the structured visual data, which

Predicting well production profile through feature

Advanced image recognition

Travel and Tourism

Customer segmentation and personalized

Analysis of customer sentiment

Travel support bots

Overview of Decision Tree

History of Decision Tree

Modeling Techniques in Decision Trees

Why Are Decision Trees Important?

Open Source Data Tools

Data Extraction Tools

IBM Cognos Analytics

Netlink Business Analytics

Time Series Analysis

Importing Datasets with pandas

Preprocessing Data with pandas

Data Selection with pandas

Categorical and Numerical Data

Scraping the Web

1. In finance, many companies and institutions need to scrape the

NumPy and Data Processing

Import the Needed Libraries

Importing the Sets of Data

Fill in Some of the Missing Values

Modifying the Text Values Over to Numerical

What Is Data Wrangling?

Data Wrangling with Pandas

Our Goals with Data Wrangling

1. Reveal a deep intelligence inside of the data that you are

The Key Steps with Data Wrangling

What to Expect with Data Wrangling?

How To Override The Base Class

Final Notes About Inheritances

FDIC page source via chrome

of Data We Can Work With

The Basics of the Python Language

How Can Python Help with a Data Analysis?

A Look at the History

Why Is Data Visualization So Important?

1. Identify areas that will need the most attention when it

2. Help us to figure out which of our products we should

3. It can clarify which factors are the most likely to influence

4. It can make it easier to tell and make predictions about our

How to Lay the Groundwork

1. Understand the data that we need to visualize in the first place.

2. Determine what you would like to visualize, and the kind of

3. Know the audience that we are working with and understand

4. Use a visual that is able to convey the information in the form

Different Types of Data Visualization Available

Out[]: array([20, 25, 30])

Help and Quit

Using Debugger Commands

Tips in Python Debugging

How Convolutional Neural Networks Work?