Python For Data Analysis - The Ultimate Beginner's Guide To Learn Programming in Python For Data Science With Pandas and NumPy, Master Statistical Analysis, and Visualization (2020)
Python For Data Analysis - The Ultimate Beginner's Guide To Learn Programming in Python For Data Science With Pandas and NumPy, Master Statistical Analysis, and Visualization (2020)
Matt Foster
© Copyright 2019 - All rights reserved.
The content contained within this book may not be reproduced, duplicated, or transmitted without
direct written permission from the author or the publisher.
Under no circumstances will any blame or legal responsibility be held against the publisher, or
author, for any damages, reparation, or monetary loss due to the information contained within this
book. Either directly or indirectly.
Legal Notice:
This book is copyright protected. This book is only for personal use. You cannot amend, distribute,
sell, use, quote or paraphrase any part, or the content within this book, without the consent of the
author or publisher.
Disclaimer Notice:
Please note the information contained within this document is for educational and entertainment
purposes only. All effort has been executed to present accurate, up to date, and reliable, complete
information. No warranties of any kind are declared or implied. Readers acknowledge that the author
is not engaging in the rendering of legal, financial, medical, or professional advice. The content
within this book has been derived from various sources. Please consult a licensed professional before
attempting any techniques outlined in this book.
By reading this document, the reader agrees that under no circumstances is the author responsible for
any losses, direct or indirect, which are incurred as a result of the use of information contained within
this document, including, but not limited to, — errors, omissions, or inaccuracies.
Table of Contents
Introduction
Conclusion
Introduction
It is part of the obligations of the banks to analyze, store, or collect vast
numbers of data. With these data, data science applications are transforming
them into a possibility for banks to learn more about their customers. Doing
this will drive new revenue opportunities instead of seeing those data as a
mere compliance exercise. People widely use digital banking, and it is more
popular these days. The result of this influx produces terabytes of data by
customers; therefore, isolating genuinely relevant data is the first line of
action for data scientists. With the customers’ preferences, interactions, and
behaviors, then, data science applications will isolate the information of the
most relevant clients and process them to enhance the decision-making of
the business.
Personalized marketing
Providing a customized offer that fits the preferences and needs of
particular customers is crucial to success in marketing. Now it is possible to
make the right offer on the correct device to the right customer at the right
time. For a new product, people target selection to identify the potential
customers with the use of data science application. With the aid of apps,
scientists create a model that predicts the probability of a customer’s
response to an offer or promotion through their demographics, historical
purchase, and behavioral data. Thus, banks have improved their customer
relations, personalize outreach, and efficient marketing through data science
applications.
Drugs creation
Involving various disciplines, the process of drug discovery is highly
complicated. Most times, the most excellent ideas pass through billions of
enormous time and financial expenditure and testing. Typically, getting a
drug submitted officially can take up to twelve years. With an addition of a
perspective to the individual stage of drug compound screening to the
prediction of success rate derived from the biological factors, the process is
now shortened and simplified with the aid of data science applications.
Using simulations rather than the “lab experiments,” and advanced
mathematical modeling, these applications can forecast how the compound
will act in the body. With computational drug discovery, it produces
simulations of computer model as a biologically relevant network
simplifying the prediction of future results with high accuracy.
Industry knowledge
To offer the best possible treatment and improve the services, knowledge
management in healthcare is vital. It brings together externally generated
information and internal expertise. With the creation of new technologies
and the rapid changes in the industry every day, effective distribution,
storing, and gathering of different facts is essential. For healthcare
organizations to achieve progressive results, the integration of various
sources of knowledge and their combined use in the treatment process is
secure through data science applications.
The recurring neural networks and time series forecasting is part of the
optimization of oil and gas production. Rates of gas-to-oil ratios and oil
rates prediction is a significant KPIs. Operators can calculate bottom-hole
pressure, choke, wellhead temperature, and daily oil rate prediction of data
of nearby well with the use of feature extraction models. In the event of
predicting production decline, they make use of fractured parameters. Also,
for pattern recognition on sucker rod dynamometer cards, they utilize neural
networks and deep learning.
Downstream optimization
To process gas and crude oil, oil refineries use a massive volume of water.
Now, there is a system that tackles water solution management in the oil
and gas industry. Also, with the aid of distribution by analyzing data
effectively, there is an increase in modeling speed for forecasting revenues
through cloud-based services.
The Internet
Anytime anyone thinks about data science, the first idea that comes to mind
is the internet. It is typical of thinking of Google when we talk about
searching for something on the internet. However, Bing, Yahoo, AOL, Ask,
and some others also search engines. For these search engines to give back
to you in a fraction of second when you put a search on them, data science
algorithms are all that they all have in common. Every day, Google process
more than 20 petabytes, and these search engines are known today with the
help of data science.
Targeted advertising
Of all the data science applications, the whole digital marketing spectrum is
a significant challenge against the search engines. The data science
algorithms decide the distribution of digital billboards and banner displays
on different websites. And against the traditional advertisements, data
science algorithms have helped marketers get a higher click-through-rates.
Using the behavior of a user, they can target them with specific adverts.
And at the same time and in the same place online, one user might see ads
on anger management while another user sees another ad on a keto diet.
Website recommendations
This case is something familiar to everyone as you see suggestions of the
same products even on eBay and Amazon. Doing this add so much to the
user experience while it helps to discover appropriate products from several
products available with them. Leaning on the relevant information and
interest of the users, so many businesses have promoted their products and
services with this engine. To improve user experience, some giants on the
internet, including Google Play, Amazon, Netflix, and others have used this
system. They derived these recommendations on the results of a user’s
previous search.
Speech recognition
Siri, Google Voice, Cortana, and so many others are some of the best speech
recognition products. It makes it easy for those who are not in the position
of typing a message to use speech recognition tools. Their speech will be
converted to text when they speak out their words. Though the accuracy of
speech recognition is not certain.
Recommendation engine
This concept is one of the most promising and efficient, according to some
experts. In their everyday work, some central booking and travel web
platforms use recommendation engines. Mainly, through the available
offers, they match the needs and wishes of customers with these
recommendations. Based on preferences and previous search, the travel and
tourism companies have the ability to provide alternative travel dates, rental
deals, new routes, attractions, and destination when they apply the data-
powered recommendation engine solutions. Offering suitable provisions to
all these customers, booking service providers, and travel agencies achieve
this with the use of recommendation engines.
Route optimization
In the travel and tourism industry, route optimization plays a significant
role. It can be quite challenging to account for several destinations, plan
trips, schedules, and working distances and hours. With route optimization,
it becomes easy to do some of the following:
Time management
Minimization of the travel costs
Minimization of distance
For sure, data science improves lives and also continues to change the faces
of several industries, giving them the opportunity of providing unique
experiences for their customers with high satisfaction rates. Apart from
shifting our attitudes, data science has become one of the promising
technologies that bring changes to different businesses. With several
solutions the data science applications provide, it is no doubt that its
benefits cannot be over-emphasized.
Chapter 1 - What is Data Analysis
Orange
It is an open-source data visualization and analysis tool designed and meant
for those people who do not have expertise in data science. It helps the user
to build an interactive workflow that can be used for analysis and
visualization of data, using a simple interactive workflow and an advanced
toolbox. The output of this tool differs from the mainstream scatter plots,
bar charts, and dendrograms.
Knime
Knime is another open-source solution tool that enables the user to explore
data and interpret the hidden insights effectively. One of its good attributes
is that it contains more than 1000 modules along with numerous examples
to help the user to understand the applications and effective use of the tool.
It is equipped with the most advanced integrated tools with some complex
algorithms.
R-programming
R-programing is the most common and widely used tool. It has become a
standard tool for programming. R is a free open source software that any
user can install, use, upgrade, modify, clone, and even resell. It can easily
and effectively be used in statistical computing and graphics. It is made in a
way that is compatible with any type of operating system like Windows,
macOS platforms, and UNIX. It is a high-performance language that lets
the user manage big data. Since it is free and is regularly updated, it makes
technological projects cost-effective. Along with data Mining, it lets the
user apply their statistical and graphical knowledge, including common
tests like a statistical test, clustering, and linear, non-linear modeling.
Rapidminer
Rapidminer is similar to KNIME with respect to dealing with visual
programming for data modeling, analysis, and manipulation. It helps to
improve the overall yield of data science project teams. It offers an open-
source platform that permits Machine Learning, model deployment, and
data preparation. It is responsible for speeding up the development of an
entire analytical workflow, right from the steps of model validation to
deployment.
Pentaho
Pentaho tackles issues faced by the organization concerning its ability to
accept values from another data source. It is responsible for simplifying
data preparation and data blending. It also provides tools used for analysis,
visualization, reporting, exploration, and prediction of data. It lets each
member of a team assign the data meaning.
Weka
Weka is another open-source software that is designed with a view of
handling machine-learning algorithms to simplify data Mining tasks. The
user can use these algorithms directly in order to process a data set. Since it
is implemented in JAVA programming, it can be used for developing a new
Machine Learning scheme. It lets easy transition into the field of data
science owing to its simple Graphical User Interface. Any user acquainted
with JAVA can invoke the library into their code.
The nodexl
The nodexl is open-source software, data visualization, and analysis tool
that is capable of displaying relationships in datasets. It has numerous
modules, like social network data importers and automation.
Gelphi
Gelphi is an open-source visualization and network analysis tool written in
Java language.
Talend
Talend is one of the leading open-source software providers that most data-
driven companies go for. It enables the customers to connect easily
irrespective of the places they're at.
Data Visualization
Data Wrapper
It is an online data-visualization software that can be used to build
interactive charts. Data in the form of CSV, Excel, or PDF can be uploaded.
This tool can be used to generate a map, bar, and line. The graphs created
using this tool have ready to use embed codes and can be uploaded on any
website.
Tableau Public
Tableau Public is a powerful tool that can create stunning visualizations that
can be used in any type of business. Data insights can be identified with the
help of this tool. Using visualization tools in Tableau Public, a data scientist
can explore data prior to processing any complex statistical process.
Infogram
Infogram contains more than 35 interactive charts and 500 maps that allow
the user to visualize data. It can make various charts like a word cloud, pie,
and bar.
Google Fusion Tables
Google Fusion Tables is one of the most powerful data analysis tools. It is
widely used when an individual has to deal with massive datasets.
Solver
The solver can support effective financial reporting, budgeting, and
analysis. You can see a button that will allow you to interact with the profit-
making data in a company.
Sentiment Tools
Opentext
Identification and evaluation of expressions and patterns are possible in this
specialized classification engine. It carries out analysis at various levels:
document, sentence, and topic level.
Trackur
Trackur is an automated sentiment analysis software emphasizing a specific
keyword that is tracked by an individual. It can draw vital insights by
monitoring social media and mainstream news. In short, it identifies and
discovers different trends.
Opinion Crawl
Opinion Crawl is also an online sentiment analysis software that analyses
the latest news, products, and companies. Every visitor is given the freedom
to access Web sentiment in a specific topic. Anyone can participate in a
topic and receive an assessment. A pie chart reflecting the latest real-time
sentiment is displayed for every topic. Different concepts that people relate
to are represented by various thumbnails and cloud tags. The positive and
negative effect of the sentiments is also displayed. Web crawlers search the
up-to-date content published on recent subjects and issues to create a
comprehensive analysis.
Sage Live
Sage Live is a cloud-based accounting platform that can be used in small
and mid-sized types of businesses. It enables the user to create invoices, bill
payments using smartphones. This is a perfect tool if you wish to have a
data visualization tool supporting different companies, currencies, and
banks.
Gawk GNU
Gawk GNU allows the user to utilize a computer without software. It
interprets unique programming language enabling the users to handle
simple-data reformatting Jobs. Following are its main attributes:
➢
It is not procedural. It is data-driven.
➢
Writing programs is easy.
➢
Searching for a variety of patterns from the text units.
Graphlab create
s
Graphlab can be used by data scientists as well as developers. It enables the
user to build state-of-the-art data products using Machine Learning to create
smart applications.
The attributes of this tool are the Integration of automatic feature
engineering, Machine Learning visualizations, and model selection to the
application. It can identify and link records within and across data sources.
It can simplify the development of Machine Learning models.
Apache Spark
Apache Spark is designed to run-in memory and real-time.
The top 5 data analytics tools and techniques
Visual analytics
Different methods that can be used for data analysis are available. These
methods are possible through integrated efforts involving human
interaction, data analysis, and visualization.
Business Experiments
All the techniques that are used in testing the validity of certain processes
are included in Business Experiments AB testing, business experiments,
and the experimental design.
Regression Analysis
Regression Analysis allows the identification of factors that make two
different variables related to each other.
Correlation Analysis
Correlation Analysis is a statistical technique that detects whether a
relationship exists between two different variables.
The Process
All data science projects are different one way or another, however they can
all be broken down into typical stages. The very first step in this process is
acquiring data. This can be done in many ways. Your data can come from
databases, HTML, images, Excel files, and many other sources, and
uploading data is an important step every data scientist needs to go through.
Data munging comes after uploading the data, however at the moment that
raw data cannot be used for any kind of analysis. Data can be chaotic, and
filled with senseless information or gaps. This is why, as an aspiring data
scientist, you solve this problem with the use of Python data structures that
will turn this data into a data set that contains variables. You will need these
data sets when working with any kind of statistical or machine learning
analysis. Data munging might not be the most exciting phase in data
science, but it is the foundation for your project and much needed to extract
the valuable data you seek to obtain.
In the next phase, once you observe the data you obtain, you will begin to
create a hypothesis that will require testing. You will examine variables
graphically, and come up with new variables. You will use various data
science methodologies such as machine learning or graph analysis in order
to establish the most effective variables and their parameters. In other
words, in this phase you process all the data you obtain from the previous
phase and you create a model from it. You will undoubtedly realize in your
testing that corrections are needed and you will return to the data munging
phase to try something else. It’s important to keep in mind that most of the
time, the solution for your hypothesis will be nothing like the actual
solution you will have at the end of a
successful project. This is why you
cannot work purely theoretically. A good data scientist is required to
prototype a large variety of potential solutions and put them all to the test
until the best course of action is revealed.
One of the most essential parts of the data science process is visualizing the
results through tables, charts, and plots. In data science, this is referred to as
“OSEMN”, which stands for “Obtain, Scrub, Explore, Model, Interpret”.
While this abbreviation doesn’t entirely illustrate the process behind data
science, it captures the most important stages you should be aware of as an
aspiring data scientist. Just keep in mind that data munging will often take
the majority of your efforts when working on a project.
#Example of inheritance
#base class
class Student(object):
def__init__(self, name, rollno):
self.name = name
self.rollno = rollno
#Graduate class inherits or derived from Student class
class GraduateStudent(Student):
def__init__(self, name, rollno, graduate):
Student__init__(self, name, rollno)
self.graduate = graduate
def DisplayGraduateStudent(self):
print”Student Name:”, self.name)
print(“Student Rollno:”, self.rollno)
print(“Study Group:”, self.graduate)
#Post Graduate class inherits from Student class
class PostGraduate(Student):
def__init__(self, name, rollno, postgrad):
Student__init__(self, name, rollno)
self.postgrad = postgrad
def DisplayPostGraduateStudent(self):
print(“Student Name:”, self.name)
print(“Student Rollno:”, self.rollno)
print(“Study Group:”, self.postgrad)
#instantiate from Graduate and PostGraduate classes
objGradStudent = GraduateStudent(“Mainu”, 1, “MS-Mathematics”)
objPostGradStudent = PostGraduate(“Shainu”, 2, “MS-CS”)
objPostGradStudent.DisplayPostGraduateStudent()
When you type this into your interpreter, you will get the following results:
(‘Student Name:’, ‘Mainu’)
(‘Student Rollno:’, 1)
(‘Student Group:’, ‘MSC-Mathematics’)
(‘Student Name:’, ‘Shainu’)
(‘Student Rollno:’, 2)
(‘Student Group:’, ‘MSC-CS’)
Overloading
Another process that you may want to consider when you’re working with
inheritances is learning how to ‘overload.’ When you work on the process
known as overloading, you can take one of the identifiers that you are
working with and then use that to define at least two methods, if not more.
For the most part, there will only be two methods that are inside of each
class, but sometimes this number will be higher. The two methods should
be inside the exact same class, but they need to have different parameters so
that they can be kept separate in this process. You will find that it is a good
idea to use this method when you want the two matched methods to do the
same tasks, but you would like them to do that task while having different
parameters.
This is not something that is common to work with, and as a beginner, you
will have very little need to use this since many experts don’t actually use it
either. But it is still something that you may want to spend your time
learning about just in case you do need to use it inside of your code. There
are some extra modules available for you that you can download so you can
make sure that overloading will work for you.
The pd.read_file_type(‘file_name’)
method is the default way to read files
into the Pandas framework. After import, pandas displays the content as a
data frame for manipulation using all the methods we have practiced so far,
and more.
CSV (comma separated variables) & Excel
Create a CSV file in excel and save it in your python directory. You can
check where your python directory is in Jupyter notebook by typing: pwd().
If you want to change to another directory containing your files (e.g.
Desktop), you can use the following code:
In []: import os
os.chdir('C:\\Users\\Username\\Desktop')
To import your CSV file, type: pd.read_csv(‘csv_file_name’). Pandas will
automatically detect the data stored in the file and display it as a data frame.
A better approach would be to assign the imported data to a variable like
this:
In []:Csv_data = pd.read_csv(‘example file.csv’)
Csv_data
# show
Running this cell will assign the data in ‘example file.csv’ to the variable
Csv_data, which is of the type data frame. Now it can be called later or used
for performing some of the data frame operations.
For excel files (.xlsx and .xls files), the same approach is taken. To read an
excel file named ‘class data.xlsx’, we use the following code:
In []:Xl_data = pd.read_excel(‘class data.xlsx’)
Xl_data
# show
This returns a data frame of the required values. You may notice that an
index starting from 0 is automatically assigned at the left side. This is
similar to declaring a data frame without explicitly including the index
field. You can add index names, like we did in previous examples.
Tip: in case the excel spreadsheet has multiple sheets filled. You can specify
the sheet you need to be imported. Say we need only sheet 1, we use:
sheetname = ‘Sheet 1’
. For extra functionality, you may check the
documentation for read_excel()
by using shift+tab
.
Write
After working with our imported or pandas-built data frames, we can write
the resulting data frame back into various formats. We will, however, only
consider writing back to CSV and excel. To write a data frame to CSV, use
the following syntax:
In []:Csv_data.to_csv(‘file name
’,index = False)
This writes the data frame ‘Csv_data’ to a CSV file with the specified
filename in the python directory. If the file does not exist, it creates it.
For writing to an excel file, a similar syntax is used, but with sheet name
specified for the data frame being exported.
In []: Xl_data.to_excel(‘file name.xlsx’,sheet_name = ‘Sheet 1’)
This writes the data frame Xl_data
to sheet one of ‘file name.xlsx’
.
Html
Reading Html files through pandas requires a few libraries to be installed:
htmllib5, lxml, and BeautifulSoup4. Since we installed the latest Anaconda,
these libraries are likely to be included. Use conda list
to verify, and conda
install
to install any missing ones.
Html tables can be directly read into pandas using the pd.read_html (‘sheet
url’)
method. The sheet url is a web link to the data set to be imported. As
an example, let us import the ‘Failed bank lists’ dataset from FDIC’s
website and call it w_data.
In []: w_data =
pd.read_html('http://www.fdic.gov/bank/individual/failed/bankli
st.html')
w_data[0]
To display the result, here we used w_data [0]
. This is because the table we
need is the first sheet element in the webpage source code. If you are
familiar with HTML, you can easily identify where each element lies. To
inspect a web page source code, use Chrome browser. On the web page >>
right click >> then, select ‘view page source’
. Since what we are looking
for is a table-like data, it will be specified like that in the source code. For
example, here is where the data set is created in the FDIC page source
code:
Exercises:
We will be applying all we have learned here.
1. Import pandas as pd
2. Import the CSV file into Jupyter notebook, assign it to a
variable ‘Sal’, and display the first 5 values.
Hint:
use the .head()
method to display the first 5 values of a data
frame. Likewise, .tail()
is used for displaying the last 5 results. To
specify more values, pass ‘n=value’
into the method.
3. What is the highest pay (including benefits)? Answer:
567595.43
Hint:
Use data frame column indexing and .max()
method.
4. According to the data, what is ‘MONICA FIELDS’s Job
title, and how much does she make plus benefits? Answer:
Deputy Chief of the Fire Department, and $ 261,366.14.
Hint:
Data frame column selection and conditional selection works
(conditional selection can be found under Example 72. Use column
index ==’string’ for the Boolean condition).
5. Finally, who earns the highest basic salary (minus
benefits), and by how much is their salary higher than the
average basic salary. Answer: NATHANIEL FORD earns
the highest. His salary is higher than the average by $
492827.1080282971.
Hint:
Use the .max()
, and .mean()
methods for the pay gap.
Conditional selection with column indexing also works for the
employee name with the highest pay.
Chapter 8 - The Different Types
There are two main types mainly structured and unstructured, and the types
of algorithms and models that we can run on them will depend on what kind
of data we are working with. Both can be valuable, but it often depends on
what we are trying to learn, and which one will serve us the best for the
topic at hand. With that in mind, let’s dive into some of the differences
between structured and unstructured data and why each can be so important
to our data analysis.
Structured Data
The first type of data that we will explore is known as structured data. This
is often the kind that will be considered traditional data. This means that we
will see it consisting mainly of lots of text files that are organized and have
a lot of useful information. We can quickly glance through this information
and see what kind of data is there, without having to look up more
information, labeling it, or looking through videos to find what we want.
Structured data is going to be the kind that we can store inside one of the
options for warehouses of data, and we can then pull it up any time that we
want for analysis. Before the era of big data, and some of the emerging
sources of data that we are using on a regular basis now, structured data was
the only option that most companies would use to make their business
decisions.
Many companies still love to work with this structured data. The data is
very organized and easy to read through, and it is easier to digest. This
ensures that our analysis is going to be easier to go through with legacy
solutions to data mining. To make this more specific, this structured data is
going to be made up largely of some of the customer data that is the most
basic and could provide us with some information including the contact
information, address, names, geographical locations and more of the
customers.
In addition to all of this, a business may decide to collect some transactional
data and this would be a source of structured data as well. Some of the
transactional data that the company could choose to work with would
include financial information, but we must make sure that when this is used,
it is stored in the appropriate manner so it meets the standards of
compliance for the industry.
There are several methods we can use in order to manage this structured
data. For the most part, though, this type of data is going to be managed
with legacy solutions of analytics because it is already well organized and
we do not need to go through and make adjustments and changes to the data
at all. This can save a lot of time and hassle in the process and ensures that
we are going to get the data that we want to work the way that we want.
Of course, even with some of the rapid rise that we see with new sources of
data, companies are still going to work at dipping into the stores of
structured data that they have. this helps them to produce higher quality
insights, ones that are easier to gather and will not be as hard to look
through the model for insights either. These insights are going to help the
company learn some of the new ways that they can run their business.
While companies that are driven by data all over the world have been able
to analyze this structured data for a long period of time, over many decades,
they are just now starting to really take some of the new and emerging
sources of data as seriously as they should. The good news with this one
though is that it is creating a lot of new opportunities in their company, and
helping them to gain some of the momentum and success that they want.
Even with all of the benefits that come with structured data, this is often not
the only source of data that companies are going to rely on. First off,
finding this kind of data can take a lot of time and can be a waste if you
need to get the results in a quick and efficient manner. Collecting structured
data is something that takes some time, simply because it is so structured
and organized.
Another issue that we need to watch out for when it comes to structured
data is that it can be more expensive. It takes someone a lot of time to sort
through and organize all of that data. And while it may make the model that
we are working on more efficient than other forms, it can often be
expensive to work with this kind of data. Companies need to balance their
cost and benefit ratio here and determine if they want to use any structured
data at all, and if they do, how much of this structured data they are going
to add to their model.
Unstructured Data
The next option of data that we can look at is known as unstructured data.
This kind of data is a bit different than what we talked about before, but it is
really starting to grow in influence as companies are trying to find ways to
leverage the new and emerging data sources. Some companies choose to
work with just unstructured data on their own, and others choose to do
some mixture of unstructured data and structured data. This provides them
with some of the benefits of both and can really help them to get the
answers they need to provide good customer service and other benefits to
their business.
There are many sources where we are able to get these sources of data, but
mainly they come from streaming data. This streaming data comes in from
mobile applications, social media platforms, location services, and the
Internet of Things. Since the diversity that is there among unstructured
sources of data is so prevalent, and it is likely that those businesses who
choose to use unstructured data will rely on many different sources,
businesses may find that it is harder to manage this data than it was with
structured data.
Because of this trouble with managing the unstructured data, there are many
times when a company will be challenged by this data, in ways that they
weren’t in the past. And many times, they have to add in some creativity in
order to handle the data and to make sure they are pulling out the relevant
data, from all of those sources, for their analytics.
The growth and the maturation of things known as data lakes, and even the
platform known as Hadoop, are going to be a direct result of the expanding
collection of unstructured data. The traditional environments that were used
with structured data are not going to cut it at this point, and they are not
going to be a match when it comes to the unstructured data that most
companies want to collect right now and analyze.
Because it is hard to handle the new sources and types of data, we can’t use
the same tools and techniques that we did in the past. Companies who want
to work with unstructured data have to pour additional resources into
various programs and human talent in order to handle the data and actually
collect relevant insights and data from it.
The lack of any structure that is easily defined inside of this type of data can
sometimes turn businesses away from this kind of data in the first place.
But there really is a lot of potentials that are hidden in that data. We just
need to learn the right methods to use to pull that data out. The unstructured
data is certainly going to keep the data scientist busy overall because they
can’t just take the data and record it in a data table or a spreadsheet. But
with the right tools and a specialized set of skills to work with, those who
are trying to use this unstructured data to find the right insights, and are
willing to make some investments in time and money, will find that it can
be so worth it in the end.
Both of these types of data, the structured and the unstructured, are going to
be so important when it comes to the success you see with your business.
Sometimes our project just needs one or the other of these data types, and
other times it needs a combination of both of them.
For a company to reach success though, they need to be able to analyze, in a
proper and effective manner, all of their data, regardless of the type of the
source. Given the experience that the enterprise has with data, it is not a big
surprise that all of this buzz already surrounds data that comes from sources
that may be seen as unstructured. And as new technologies begin to surface
that can help enterprises of all sizes analyze their data in one place it is
more important than ever for us to learn what this kind of data is all about,
and how to combine it with some of the more traditional forms of data,
including structured data.
WHY PYTHON FOR DATA ANALYSIS?
The next thing that we need to spend some of our time on in this guidebook
is the Python language. There are a lot of options that you can choose when
working on your own data analysis, and bringing out all of these tools can
really make a big difference in how much information you are able to get
out of your analysis. But if you want to pick a programming language that
is easy to learn, has a lot of power, and can handle pretty much all of the
tasks that you need to handle with data analysis and machine learning, then
Python is the choice for you. Let’s dive into the Python language a little bit
and see how this language can be used to help us see some great results
with our data analysis.
The process of data visualization is going to help us change up the way that
we can work with the data that we are using. Data analysis is supposed to
respond to any issues that are found in the company in a faster manner than
ever before.
And they need to be able to dig through and find more insights as well, look
at data in a different manner, and learn how to be more imaginative and
creative in the process. This is exactly something that data visualization is
able to help us out with.
How Can We Use Data Visualization?
The next thing that we need to take some time to look at is how companies
throughout many industries are able to use data visualization for their own
needs. No matter the size of the company or what kind of industry they are
in, it is possible to use some of the basics of data visualization in order to
help make more sense of the data at hand. And there are a variety of ways
that this data visualization will be able to help you succeed
The first benefit that we can look at is the fact that these visuals are going to
be a great way for us to comprehend the information that we see in a faster
fashion. When we are able to use a graphical representation of all that data
on our business, rather than reading through charts and spreadsheets, we
will be able to see these large amounts of data in a clear and cohesive
manner.
It is much easier to go through all of that information and see what is found
inside, rather than having to try and guess and draw the conclusions on our
own.
And since it is often much faster for us to analyze this kind of information
in a graphical format, rather than analyzing it on a spreadsheet, it becomes
easier for us to understand what is there. When we are able to do it in this
manner, it is so much easier for a business to address problems or answer
some of their big questions in a timely manner so that things are fixed
without issue or without having to worry about more damage.
The second benefit that comes with using data visuals to help out with the
process of data science is that they can really make it easy to pinpoint some
of the emerging trends that we need to focus on. This information is within
the data, and we are going to be able to find them even if we just read
through the spreadsheets and the documents.
But this takes a lot of time, can be boring, and often it is hard for us to
really see these correlations and relationships, and we may miss out on
some of the more important information that we need.
Using the idea of these visuals to handle the data, and to discover trends,
whether this is the trends just in the individual business or in the market as a
whole, can really help to ensure that your business gains some big
advantages over others in your competition base. And of course, any time
that you are able to beat out the competition, it is going to positively affect
your bottom line.
When you use the right visual to help you get the work done, it is much
easier to spot some of the outliers that are present, the ones that are more
likely to affect the quality of your product, the customer churn, or other
factors that will change your business. In addition, it is going to help you to
address issues before they are able to turn into much bigger problems that
you have to work with.
Next on the list is that these visuals are going to be able to help you identify
some relationships and patterns that are found in all of that data that you are
using. Even with extensive amounts of data that is complicated, we can find
that the information starts to make more sense when it is presented in a
graphic format, rather than in just a spreadsheet or another format.
With the visuals, it becomes so much easier for a business to recognize
some of the different parameters that are there and how these are highly
correlated with one another. Some of the correlations that we are able to see
within our data are going to be pretty obvious, but there are others that
won’t be as obvious. When we use these visuals to help us find and know
about these relationships, it is going to make it much easier for our business
to really focus on the areas that are the most likely to influence some of our
most important goals.
We may also find that working with these visuals can help us to find some
of the outliers in the information that is there. Sometimes these outliers
mean nothing. If you are looking at the charts and graphs and find just a
few random outliers that don’t seem to connect with each other, it is best to
cut these out of the system and not worry about them.
But there are times when these outliers are going to be important and we
should pay more attention to them.
If you are looking at some of the visuals that you have and you notice that
there are a substantial amount of them that fall in the same area, then you
will need to pay closer attention. This could be an area that you can focus
your attention on to reach more customers, a problem that could grow into a
major challenge if you are not careful, or something else that you need to
pay some attention to.
These visuals can also help us to learn more about our customers. We can
use them to figure out where our customers are, what kinds of products our
customers would be the happiest with, how we can provide better services
to our customers, and more. Many companies decide to work with data
visualization to help them learn more about their customers and to ensure
that they can really stand out from the crowd with the work they do.
And finally, we need to take a look at how these visuals are a great way to
communicate a story to someone else. Once your business has had the time
to uncover some new insights from visual analytics, the next step here is to
communicate some of those insights to others. It isn’t going to do you much
good to come up with all of those insights, and then not actually show them
to the people responsible for key decisions in the business.
Now, we could just hand these individuals, the ones who make some of the
big decisions, the spreadsheets and some of the reports that we have. And
they will probably be able to learn a lot of information from that. But this is
not always the best way to do things.
Instead, we need to make sure that we set things up with the help of a
visual, ensuring that these individuals who make the big decisions can look
it over and see some of the key relationships and information at a glance.
Using graphs, charts, and some of the other visuals that are really impactful
as a representation of our data is going to be so important in this step
because it is going to be engaging and can help us to get our message across
to others in a faster manner than before.
As we can see, there are a lot of benefits that come in when we talk about
data visualizations and all of the things that we are able to do with them.
Being able to figure out the best kind of visualization that works for your
needs, and ensuring that you can actually turn that data into a graph or chart
or another visualization is going to be so important when it is time to work
with your data analysis.
We can certainly do the analysis without data visualization. But when it
comes to showcasing the findings in an attractive and easy to understand
format, nothing is going to be better than data visualization.
Once you have been able to go through and answer all of the initial
questions that we had about the data type that we would like to work with,
and you know what kind of audience is going to be there to consume the
information, it is time for us to make some preparations for the amount of
data that we plan to work within this process
Keep in mind here that big data is great for many businesses and is often
necessary to make data science work. But it is also going to bring in a few
new challenges to the visualization that we are doing. Large volumes,
varying velocities, and different varieties are all going to be taken into
account with this one.
Plus, data is often going to be generated at a rate that is much faster than it
can be managed and analyzed so we have to figure out the best way to deal
with this problem.
There are factors that we need to consider in this process as well, including
the cardinality of the columns that we want to be able to work with.
We have to be aware of whether there is a high level of cardinality in the
process or a low level. If we are dealing with high cardinality, this is a sign
that we are going to have a lot of unique values in our data. A good
example of this would include bank account numbers since each individual
would have a unique account number.
Then it is possible that your data is going to have a low cardinality. This
means that the column of data that you are working with will come with a
large percentage of repeat values. This is something that we may notice
when it comes to the gender column on our system. The algorithm is going
to handle the amount of cardinality, whether it is high or low, in a different
manner, so we always have to take this into some consideration when we do
our work.
Exercise:
Try to create a larger array, and use these
indexing techniques to grab certain elements from the
array. For example, here is a larger array:
In []: # 5
10 Array of even numbers between 0 and 100.
large_array = np.arange(0,100,2).reshape(5,10)
large_array
# show
Out[]: array([[ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18],
[20, 22, 24, 26, 28, 30, 32, 34, 36, 38],
[40, 42, 44, 46, 48, 50, 52, 54, 56, 58],
[60, 62, 64, 66, 68, 70, 72, 74, 76, 78],
[80, 82, 84, 86, 88, 90, 92, 94, 96, 98]])
Tip:
Try grabbing single elements and rows from random arrays you
create. After getting very familiar with this, try selecting columns.
The point is to try as many combinations as possible to get you
familiar with the approach. If the slicing and indexing notations are
confusing, try to revisit the section under list or string slicing and
indexing.
Click this link to revisit the examples on slicing: List indexing
Conditional selection
Consider a case where we need to extract certain values from an array that
meet a Boolean criterion. NumPy offers a convenient way of doing this
without having to use loops.
Example 3:
Using conditional selection
Consider this array of odd numbers between 0 and 20. Assuming we need
to grab elements above 11. We first have to create the conditional array that
selects this:
In []: odd_array = np.arange(1,20,2)
# Vector of
odd nu
mbers
odd_array
# Show vector
bool_array = odd_array > 11
# Boolean
conditional array
bool_array
Out[]: array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19])
Out[]: array([False, False, False, False, False, False, True, True, True,
True])
Notice how the bool_array evaluates to True at all instances where the
elements of the odd_array meet the Boolean criterion.
The Boolean array itself is not usually so useful. To return the values that
we need, we will pass the Boolean_array into the original array to get our
results.
In []: useful_Array = odd_array[bool_array]
# The values we want
useful_Array
Out[]: array([13, 15, 17, 19])
Now, that is how to grab elements using conditional selection. There is
however a more compact way of doing this. It is the same idea, but it
reduces typing.
Instead of first declaring a Boolean_array to hold our truth values, we just
pass the condition into the array itself, like we did for useful_array.
In []: # This code is more compact
compact = odd_array[odd_array>11]
# One line
compact
Out[]: array([13, 15, 17, 19])
See how we achieved the same result with just two lines? It is
recommended to use this second method, as it saves coding time and
resources. The first method helps explain how it all works. However, we
would be using the second method for all other instances in this book.
Exercise:
The conditional selection works on all arrays (vectors and
matrices alike). Create a two 3
3 array of elements greater than 80 from
the ‘large_array’ given in the last exercise.
Hint:
use the reshape method to convert the resulting array into a 3
3 matrix.
NumPy Array Operations
Finally, we will be exploring basic arithmetical operations with NumPy
arrays. These operations are not unlike that of integer or float Python lists.
Array – Array Operations
In NumPy, arrays can operate with and on each other using various
arithmetic operators. Things like the addition of two arrays, division, etc.
Example 4:
In []: # Array - Array Operations
# Declaring two arrays of 10 elements
Array1 = np.arange(10).reshape(2,5)
Array2 = np.random.randn(10).reshape(2,5)
Array1;Array2 # Show the arrays
# Addition
Array_sum = Array1 + Array2
Array_sum # show result array
#Subtraction
Array_minus = Array1 - Array2
Array_minus # Show array
# Multiplication
Array_product = Array1 * Array2
Array_product # Show
# Division
Array_divide = Array1 / Array2
Array_divide # Show
Out[]: array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
array([[ 2.09122638, 0.45323217,
-0.50086442, 1.00633093, 1.24838264], [
1.64954711, -0.93396737, 1.05965475, 0.78422255, -1.8459
5505]])
array([[2.09122638, 1.45323217, 1.49913558, 4.00633093, 5.24838264],
[6.64954711, 5.06603263, 8.05965475, 8.78422255, 7.15404495]])
array([[-2.09122638, 0.54676783, 2.50086442, 1.99366907, 2.75161736
], [ 3.35045289, 6.93396737, 5.94034525, 7.21577745,
10.84595505]])
array([[ 0. , 0.45323217, -1.00172885, 3.01899278, 4.99353055],
[ 8.24773555, -5.60380425, 7.41758328, 6.27378038, -16.6
1359546]])
array([[ 0. , 2.20637474, -3.99309655, 2.9811267 , 3.20414581], [
3.03113501, -6.42420727, 6.60592516,
10.20118591, -4.875525 ]])
Each of the arithmetic operations performed are element-wise. The division
operations require extra care however. In Python, most arithmetic errors in
code throw a run-time error, which helps in debugging. For NumPy,
however, the code could run with a warning issued.
Array – Scalar operations
Also, NumPy supports scalar with Array operations. A scalar in this context
is just a single numeric value of either integer or float type. The scalar –
Array operations are also element-wise, by virtue of the broadcast feature of
NumPy arrays.
Example 5:
In []: #Scalar- Array Operations
new_array = np.arange(0,11)
# Array of values from 0-10
print('New_array')
new_array
# Show
Sc = 100
# Scalar value
# let us make an array with a range from 100 - 110 (using +)
add_array = new_array + Sc
# Adding 100 to every item
print('\nAdd_array')
add_array
# Show
# Let us make an array of 100s (using -)
centurion = add_array - new_array
print('\nCenturion')
centurion
# Show
# Let us do some multiplication (using *)
multiplex = new_array * 100
print('\nMultiplex')
multiplex
# Show
# division [take care], let us deliberately generate
# an error. We will do a divide by Zero.
err_vec = new_array / new_array
print('\nError_vec')
err_vec
# Show
New_array
Out[]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
Add_array
Out[]: array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110])
Centurion
Out[]: array([100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100])
Multiplex
Out[]: array([ 0, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000])
Error_vec
C:\Users\Oguntuase\Anaconda3\lib\site-
packages\ipykernel_launcher.py:27:
RuntimeWarning: invalid value encountered in true_divide
array([nan, 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
Notice the runtime error generated? This divide by zero value was caused
by the division of the first element of new_array by itself, i.e. 0/0. This
would give a divide by zero error in normal Python environment and would
not run the code. NumPy, however, ran the code and indicated the divide by
zero in the Error_vec array as a ‘nan’ type (not-a-number). This also goes
for values that evaluate to infinity, which would be represented by the value
‘+/- inf’ (try 1/0 using NumPy array-scalar or array-array operation.).
Tip:
Always take caution when using division to avoid such runtime
errors that could later bug your code.
Universal Array functions
These are some built-in functions designed to operate in an element-wise
fashion on NumPy arrays. They include mathematical, comparison,
trigonometric, Boolean, etc. operations. They are called using the
np.function_name(array) method.
Example 6
: A few Universal Array functions (U-Func)
In []: # Using U-Funcs
U_add = np.add(new_array,Sc)
# addition
U_add
# Show
U_sub = np.subtract(add_array,new_array)
U_sub
# Show
U_log = np.log(new_array)
# Natural log
U_log
# Show
sinusoid = np.sin(new_array)
# Sine wave
sinusoid
# Show
# Alternatively, we can use the .method
new_array.max()
# find maximum
np.max(new_array)
# same thing
Out[]: array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110])
Out[]: array([100, 100, 100, 100, 100, 100, 100, 100, 100,
100, 100])
C:\Users\Oguntuase\Anaconda3\lib\site-
packages\ipykernel_launcher.py:8: RuntimeWarning: divide by
zero encountered in log
Out[]: array([ -inf, 0. , 0.69314718, 1.09861229, 1.38629436,
1.60943791, 1.79175947, 1.94591015, 2.07944154, 2.19722458,
2.30258509])
Out[]: array([ 0. , 0.84147098, 0.90929743, 0.14112001, -0.7568025
, -0.95892427, -0.2794155 , 0.6569866
, 0.98935825, 0.41211849, -0.54402111])
Out[]: 10
Out[]: 10
There are still many more functions available, and a full reference can be
found in the NumPy documentation for Universal functions here:
https://docs.scipy.org/doc/numpy/reference/ufuncs.html
Now that we have explored NumPy for creating arrays, we would consider
the Pandas framework for manipulating these arrays and organizing them
into data frames.
Pandas
This is an open source library that extends the capabilities of NumPy. It
supports data cleaning and preparation, with fast analysis capabilities. It is
more like Microsoft excel framework, but with Python. Unlike NumPy, it
has its own built-in visualization features and can work with data from a
variety of sources. It is one of the most versatile packages for data science
with Python, and we will be exploring how to use it effectively.
To use pandas, make sure it is currently part of your installed packages by
verifying with the conda list method. If it is not installed, then you can
install it using the conda install pandas command; you need an internet
connection for this.
Now that Pandas is available on your PC, you can start working with the
package. First, we start with the Pandas series.
Series
This is an extension of the NumPy array. It has a lot of similarities, but with
a difference in indexing capacity. NumPy arrays are only indexed via
number notations corresponding to the desired rows and columns to be
accessed. For Pandas series, the axes have labels that can be used for
indexing their elements. Also, while NumPy arrays -- like Python lists, are
essentially used for holding numeric data, Pandas series are used for
holding any form of Python data/object.
Example 7
: Let us illustrate how to create and use the Pandas series
First, we have to import the Pandas package into our workspace. We will
use the variable name pd for Pandas, just as we used np for NumPy in the
previous section.
In []: import numpy as np
#importing numpy for use
import pandas as pd # importing the Pandas package
We also imported the numpy package because this example involves a
numpy array.
In []: # python objects for use
labels = ['First','Second','Third']
# string list
values = [10,20,30]
# numeric list
array = np.arange(10,31,10)
# numpy array
dico = {'First':10,'Second':20,'Third':30}
# Python dictionary
# create various series
c = pd.Series(values)
print('Default series')
A #show
B = pd.Series(values,labels)
print('\nPython numeric list and label')
B #show
C = pd.Series(array,labels)
print('\nUsing python arrays and labels')
C #show
D = pd.Series(dico)
print('\nPassing a dictionary')
D #show
Default series
Out[]: 0 10
1 2
0
2 30
dtype: int64
Python numeric list and label
Out[]: First 10
Second 20
Third 30
dtype: int64
Using python arrays and labels
Out[]: First 10
Second 20
Third 30
dtype: int32
Passing a dictionary
Out[]: First 10
Second 20
Third 30
dtype: int64
We have just explored a few ways of creating a Pandas series using a
numpy array, Python list, and dictionary. Notice how the labels correspond
to the values? Also, the dtypes are different. Since the data is numeric and
of type integer, Python assigns the appropriate type of integer memory to
the data. Creating series from NumPy arrays returns the smallest integer
size (int 32). The difference between 32 bits and 64 bits unsigned integers is
the corresponding memory allocation. 32 bits obviously requires less
memory (4bytes, since 8bits make a byte), and 64 bits would require double
(8 bytes). However, 32bits integers are processed faster, but have a limited
capacity in holding values, as compared with 64 bits.
Pandas series also support the assignment of any data type or object as its
data points.
In []: pd.Series(labels,values)
Out[]: 10 First
20 Second
30 Third
dtype: object
Here, the string elements of the label list are now the data points. Also,
notice that the dtype is not ‘object’.
This kind of versatility in item operation and storage is what makes pandas
series very robust. Pandas series are indexed using labels. This is illustrated
in the following examples:
Example 8:
In []: # series of WWII countries
pool1 = pd.Series([1,2,3,4],['USA','Britain','France','Germany'])
pool1 #show
print('grabbing the first element')
pool1['USA'] # first label index
Out[]: USA 1
Britain 2
France 3
Germany 4
dtype: int64
grabbing the first element
Out[]: 1
As shown in the code above, to grab a series element, use the same
approach as the numpy array indexing, but by passing the label
corresponding to that data point. The data type of the label is also
important, notice the ‘USA’ label was passed as a string to grab the data
point ‘1’. If the label is numeric, then the indexing would be similar to that
of a numpy array. Consider numeric indexing in the following example:
In []: pool2 = pd.Series(['USA','Britain','France','Germany'],[1,2,3,4])
pool2
#show
print('grabbing the first element')
pool2[1] #numeric indexing
Out[]: 1 USA
2 Britain
3 France
4 Germany
dtype: object
grabbing the first element
Out[]: 'USA'
Tip:
you can easily know the data held by a series through the dtype.
Notice how the dtype for pool1 and pool2 are different, even though
they were both created from the same lists. The difference is that
pool2 holds strings as its data points, while pool1 holds integers
(int64).
Panda series can be added together. It works best if the two series have
similar labels and data points.
Example 9
: Adding series
Let us create a third series ‘pool 3’. This is a similar series as pool1, but
Britain has been replaced with ‘USSR’, and a corresponding data point
value of 5.
In []: pool3 = pd.Series([1,5,3,4],['USA','USSR','France',
'Germany'])
pool3
Out[]: USA 1
USSR 5
France 3
Germany 4
dtype: int64
Now adding series:
In []:# Demonstrating series addition
double_pool = pool1 + pool1
print('Double Pool')
double_pool
mixed_pool = pool1 + pool3
print('\nMixed Pool')
mixed_pool
funny_pool = pool1 + pool2
print('\nFunny Pool')
funny_pool
Double Pool
Out[]: USA 2
Britain 4
France 6
Germany 8
dtype: int64
Mixed Pool
Out[]: Britain NaN
France 6.
0
Germany 8.0
USA 2.0
USSR NaN
dtype: float64
Funny Pool
C:\Users\Oguntuase\Anaconda3\lib\site-
packages\pandas\core\indexes\base.py:3772: RuntimeWarning: '<'
not supported between instances of 'str' and 'int', sort order is
undefined for incomparable objects
return this.join(other, how=how, return_indexers=return_indexers)
Out[]: USA NaN
Britain NaN
France NaN
Germany NaN
1 NaN
2 NaN
3 NaN
4 NaN
dtype: object
By adding series, the resultant is the increment in data point values of
similar labels (or indexes). A ‘NaN’ is returned in instances where the
labels do not match.
Notice the difference between Mixed_pool and Funny_pool: In a mixed
pool, a few labels are matched, and their values are added together (due to
the add operation). For Funny_pool, no labels match, and the data points are
of different types. An error message is returned and the output is a vertical
concatenation of the two series with ‘NaN’ Datapoints.
Tip:
As long as two series contain the same labels and data points of
the same type, basic array operations like addition, subtraction, etc.
can be done. The order of the labels does not matter, the values will
be changed based on the operator being used. To fully grasp this, try
running variations of the examples given above.
Chapter 11 - Common Debugging Commands
Starting
The command used in debugging is ‘s(tart)' which launches the debugger
from its source. The procedure involved includes typing the title of the
debugger and then the name of the file, object, or program executable to
debug. Inside the debugging tool, there appears a prompt providing you
with several commands to choose from and make the necessary corrections.
Running
The command used is ‘[!]statement
’ or ‘r(un)’, which facilitates the
execution of the command to the intended lines and identify errors if any.
The command prompt will display several arguments probably at the top of
the package, especially when running programs without debuggers. For
example, when the application is named ‘prog1’, then the command to use
is “r prog1 <infile". The debugger will, therefore, execute the command by
redirecting the program name from the file name.
Breakpoints
As essential components in debugging, breakpoints utilize the command
‘b(reak) [[filename:]lineno
|function[, condition]]
” to enable debuggers to
stop code input process when program execution reaches this point. When a
developer inputs the codes or values, and it meets a breakpoint, the process
gets suspended for a while, and the debugger command dialog appears on
the screen. Thereby provides time to check on the variables while
identifying any errors or mistakes, which might affect the process.
Therefore, breakpoints can be scheduled to halt at any line on either
numerical or functions names which designate program execution.
Back Trace
Backtrace is an executive with the command ‘bt’ and involves a list of
pending function calls to be inserted in the program immediately after it
stops. The validity of backtrace commands are solely active when the
execution is suspended during breakpoints, or after it has exited during a
runtime error abnormally, a state called segmentation faults. This form of
debugging is more critical during segmentation faults as it indicates the
source of the error other than pending function calls.
Printing
Printing is primarily is used in programming to analyze the value of
variables or expressions used during function examination before execution.
It uses the command' w(here)' and useful after the programming running
has been stopped at a breakpoint or during runtime error. The legal
expression used here is C with possessing an ability to handles the
legitimate C expression as well as function calls. Besides printing, resuming
the execution after a breakpoint or runtime error uses the command
‘c(ont(inue).'
Single Step
The single-step uses the command' s(tep), n(ext)’ after a breakpoint to jump
through source lines one at a time. The two commands used to describe a
different indication with ‘step' representing the execution of all the lines and
functions while ‘next' skips function calls while not covering each chain on
a given task. However, it is vital to run the program line by line as to get a
more effective outcome when it comes to tracing errors on execution.
Trace Search
With the command, ‘up, down,' the program functions can either be scrolled
downwards or upwards using the trace search within the pending calls. This
form of debugging enables you to go through the variables within varying
levels of calls in the list. Henceforth, you can readily seek out mistakes as
well as eliminate errors using the desired debugging tool.
File Select
Another basic debugger command is file select which utilizes ‘l(ist) [first[,
last]
]’. There exist programs which compose of up to two to several source
files, especially complex programming techniques, thereby the need to
utilize debugging tools in such cases. Debuggers should be set on the main
source file for the benefit of scheduling breakpoints and runtime error to
examine the lines in the folders. With Python, the list of the source files can
be readily selected and prescribe it as the working file.
Alias
Alias debugging entails the creation of an alias name to execute a command
but must not be enclosed in either single or double-quotes. The control used
is alias [alias
[command]]. Replaceable parameters also undergo indicators
and can be replaced with other functions. As such, the name may remain the
same if the settings are left without commands or arguments from debugger
tools. In that case, the aliases maybe incorporate and comprise of any data
collaborated in the pdb prompt.
Python Debugger
In Python programming language, the module pdb
typically describes the
interactive source code debugger; therefore, supporting setting parameters
in breakpoints. It also provides a single step impact at the source line level,
source code listing, and analysis of arbitraries codes in Python as a form of
a stack frame. Also, postmortem-debugging remains supported under the
title under program control. Python debugging is extensible usually in the
way of pdb
obtained from the source evaluation. The interface hence
utilizes pdb
and cmd
as the primary modules.
The debugger command prompt pdb
is essential in running programs in
control of the debugging tools; for instance, pdb.py invoked like a script to
debug related formats. Besides, it may be adopted as an application to scan
crashed programs while using several functions in a slightly differing way.
Some of the commands used are run (statement
[, globals [, locals
]]) for
run Python statements and runeval (expression
[, globals[, locals
]]). There
also exist multiple functions not mentioned above to execute Python
programs efficiently.
Debugging Session
Using debugging in Python for computer language programming is usually
a repetitive process, which includes writing codes, and running it; it does
not work, and you implement debugging tools, fix errors, and redo the
process once again and again. As such, the debugging session tends to
utilize the same techniques, which hence demand some key points to note.
The sequence below enhances your programming processes and minimizes
the repeats witnessed during program development.
Setting of breakpoints
Running programs by the relevant debugging tools
Check variable outcomes and compare with the existing
function
When all seems correct, you may either resume the
program or wait for another breakpoint and repeat if need
be
When everything seems to go wrong, determine the source
of the problem, alter the current line of codes and begin the
process once more
Ask Question
If you know developers who use Python or other platforms, ask them
questions related to debugging as they are highly using this software. When
you are just beginning and no friends go online find forums, which are
many today. Interact with them by seeking answers to your debugging
problems as well as playing around with some programs you create while
using debugger tools. You should avoid making assumptions to any section
of Python programming, especially in debugging as it may result in failures
in program development.
Be Clever
When we create programs and avoid errors by use of debuggers, it may
make you feel excited and overwhelmed from the outcome. However, be
smart but with limits to keep an eye on your work as well as your future
operations. The success of creating a more realistic and useful program
does not mean that you are not to fail in the future. As remaining in control
will prepare you to use Python debugging tools wisely and claim your
future accomplishments positively.
Chapter 12 - Neural Network and What to Use
for?
Regular deep neural networks commonly receive a single vector as an input
and then transform it through a series of multiple hidden layers. Every
hidden layer in regular deep neural networks, in fact, is made up of a
collection of neurons in which every neuron is fully connected to all
contained neurons from the previous layers. In addition, all neurons
contained in a deep neural network are completely independent as they do
not share any relations or connections.
The last fully-connected layer in regular deep neural networks is called the
output layer and in every classification setting, this output layer represents
the overall class score.
Due to these properties, regular deep neural nets are not capable of scaling
to full images. For instance, in CIFAR-10, all images are sized as 32x32x3.
This means that all CIFAR-10 images gave 3 color channels and that they
are 32 wide and 32 inches high. This means that a single fully-connected
neural network in a first regular neural net would have 32x32x3 or 3071
weights. This is an amount that is not manageable as those fully-connected
structures are not capable of scaling to larger images.
In addition, you would want to have more similar neurons to quickly add-up
more parameters. However, in this case of computer vision and other
similar problems, using fully-connected neurons is wasteful as your
parameters would lead to over-fitting of your model very quickly.
Therefore, convolutional neural networks take advantage of the fact that
their inputs consist of images for solving these kinds of deep learning
problems.
Due to their structure, convolutional neural networks constrain the
architecture of images in a much more sensible way. Unlike a regular deep
neural network, the layers contained in the convolutional neural network are
comprised of neurons that are arranged in three dimensions including depth,
height, and width. For instance, the CIFAR-10 input images are part of the
input volume of all layers contained in a deep neural network and the
volume comes with the dimensions of 32x32x3.
The neurons in these kinds of layers can be connected to only a small area
of the layer before it, instead of all the layers being fully-connected like in
regular deep neural networks. In addition, the output of the final layers for
CIFAR-10 would come with dimensions of 1x1x10 as the end of
convolutional neural networks architecture would have reduced the full
image into a vector of class score arranging it just along the depth
dimension.
To summarize, unlike the regular-three-layer deep neural networks, a
ConvNet composes all its neurons in just three dimensions. In addition,
each layer contained in convolutional neural network transforms the 3D
input volume into a 3D output volume containing various neuron
activations.
A convolutional neural network contains layers that all have a simple API
resulting in 3D output volume that comes with a differentiable function that
may or may not contain neural network parameters.
A convolutional neural network is composed of several subsampling and
convolutional layers that are times followed by fully-connected or dense
layers. As you already know, the input of a convolutional neural network is
a nxnxr image where n represents the height and width of an input image
while the r is a total number of channels present. The convolutional neural
networks may also contain k filters known as kernels. When kernels are
present, they are determined as q, which can be the same as the number of
channels.
Each convolutional neural network map is subsampled with max or mean
pooling over pxp of a contiguous area in which p commonly ranges
between two for small images and more than 5 for larger images. Either
after or before the subsampling layer a sigmoidal non-linearity and additive
bias is applied to every feature map. After these convolutional neural layers,
there may be several fully-connected layers and the structure of these fully-
connected layers is the same as the structure of standard multilayer neural
networks.
Parameter Sharing
You can use a parameter sharing scheme in your convolutional layers to
entirely control the number of used parameters. If you denoted a single two-
dimensional slice of depth as your depth slice, you can constrain the
neurons contained in every depth slice to use the same bias and weights.
Using parameter sharing techniques, you will get a unique collection of
weights, one of every depth slice, and you will get a unique collection of
weights. Therefore, you can significantly reduce the number of parameters
contained in the first layer of your ConvNet. Doing this step, all neurons in
every depth slice of your ConvNet will use the same parameters.
In other words, during backpropagation, every neuron contained in the
volume will automatically compute the gradient for all its weights.
However, these computed gradients will add up over every depth slice, so
you get to update just a single collection of weights per depth slice.that all
neurons contained in one depth slice will use the exact same weight vector.
Therefore, when you forward pass of the convolutional layers in every
depth slice, it is computed as a convolution of all neurons’ weights
alongside the input volume. This is the reason why we refer to the
collection of weights we get as a kernel or a filter, which is convolved with
your input.
However, there are a few cases in which this parameter sharing assumption,
in fact, does not make any sense. This is commonly the case with many
input images to a convolutional layer that come with certain centered
structure, where you must learn different features depending on your image
location.
For instance, when you have an input of several faces which have been
centered in your image, you probably expect to get different hair-specific or
eye-specific features that could be easily learned at many spatial locations.
When this is the case, it is very common to just relax this parameter sharing
scheme and simply use a locally-connected layer.
Matrix Multiplication
The convolution operation commonly performs those dot products between
the local regions of the input and between the filters. In these cases, a
common implementation technique of the convolutional layers is to take
full advantage of this fact and to formulate the specific forward pass of the
main convolutional layer representing it as one large matrix multiply.
Implementation of matrix multiplication is when the local areas of an input
image are completely stretched out into different columns during an
operation known as im2col. For instance, if you have an input of size
227x227x3 and you convolve it with a filter of size 11x11x3 at a stride of 4,
you must take blocks of pixels in size 11x11x3 in the input and stretch
every block into a column vector of size 363.
However, when you iterate this process in your input stride of 4, you get
fifty-five locations along both weight and height that lead to an output
matrix of x col in which every column contained in fact is a maximally
stretched out receptive fields and where you have 3025 fields in total.
that as the receptive fields overlap, each number in your input volume can
be duplicated in multiple distinct columns. Also, remember, that the
weights of the convolutional layers are very similarly stretched out into
certain rows as well. For instance, if you have 95 filters in size of 11x11x3,
you will get a matrix of w row of size 96x363.
When it comes to matrix multiplications, the result you get from your
convolution will be equal to performing one huge matrix multiply that
evaluates the dot products between every receptive field and between every
filter resulting in the output of your dot production of every filter at every
location. Once you get your result, you must reshape it back to its right
output dimension, which in this case is 55x55x96.
This is a great approach, but it has a downside. The main downside is that it
uses a lot of memory as the values contained in your input volume will be
replicated several times. However, the main benefit of matrix
multiplications is that there are many implementations that can improve
your model. In addition, this im2col ideal can be re-used many times when
you are performing pooling operation
.
Conclusion
Thank you for making it through to the end! The next step is to start putting
the information and examples that we talked about in this guidebook to
good use. There is a lot of information inside all that data that we have been
collecting for some time now. But all of that data is worthless if we are not
able to analyze it and find out what predictions and insights are in there.
This is part of what the process of data science is all about, and when it is
combined together with the Python language, we are going to see some
amazing results in the process as well.
This guidebook took some time to explore more about data science and
what it all entails. This is an in-depth and complex process, one that often
includes more steps than what data scientists were aware of when they first
get started. But if a business wants to be able to actually learn the insights
that are in their data, and they want to gain that competitive edge in so
many ways, they need to be willing to take on these steps of data science,
and make it work for their needs.
This guidebook went through all of the steps that you need to know in order
to get started with data science and some of the basic parts of the Python
code. We can then put all of this together in order to create the right
analytical algorithm that, once it is trained properly and tested with the right
kinds of data, will work to make predictions, provide information, and even
show us insights that were never possible before. And all that you need to
do to get this information is to use the steps that we outline and discuss in
this guidebook.
There are so many great ways that you can use the data you have been
collecting for some time now, and being able to complete the process of
data visualization will ensure that you get it all done. When you are ready to
get started with Python data science, make sure to check out this guidebook
to learn how.
Loops are going to be next on the list of topics we need to explore when we
are working with Python. These are going to be a great way to clean up
some of the code that you are working on so that you can add in a ton of
information and processing in the code, without having to go through the
process of writing out all those lines of code. For example, if you would
like a program that would count out all of the numbers that go from one to
one hundred, you would not want to write out that many lines of code along
the way. Or if you would like to create a program for doing a multiplication
table, this would take forever as well. But doing a loop can help to get all of
this done in just a few lines of code, saving you a lot of time and code
writing in the process.
It is possible to add in a lot of different information into the loops that you
would like to write, but even with all of this information, they are still going
to be easy to work with. These loops are going to have all of the ability to
tell your compiler that it needs to read through the same line of code, over
and over again, until the program has reached the conditions that you set.
This helps to simplify the code that you are working on while still ensuring
that it works the way that you want when executing it.
As you decide to write out some of these loops, it is important to remember
to set up the kind of condition that you would like to have met before you
ever try to run the program. If you just write out one of these loops, without
this condition, the loop won’t have any idea when it is time to stop and will
keep going on and on. Without this kind of condition, the code is going to
keep reading through the loop and will freeze your computer. So, before
you execute this code, double-check that you have been able to put in these
conditions before you try to run it at all.
As you go through and work on these loops and you are creating your own
Python code, there are going to be a few options that you can use with
loops. There are a lot of options but we are going to spend our time looking
at the three main loops that most programmers are going to use, the ones
that are the easiest and most efficient.