Instant ebooks textbook A Python Data Analyst’s Toolkit: Learn Python and Python-based Libraries with Applications in Data Analysis and Statistics Gayathri Rajagopalan download all chapters
Instant ebooks textbook A Python Data Analyst’s Toolkit: Learn Python and Python-based Libraries with Applications in Data Analysis and Statistics Gayathri Rajagopalan download all chapters
com
https://textbookfull.com/product/a-python-data-analysts-
toolkit-learn-python-and-python-based-libraries-with-
applications-in-data-analysis-and-statistics-gayathri-
rajagopalan/
OR CLICK BUTTON
DOWNLOAD NOW
https://textbookfull.com/product/data-analysis-from-scratch-with-
python-peters-morgan/
textboxfull.com
https://textbookfull.com/product/data-analysis-with-python-and-
pyspark-meap-v07-jonathan-rioux/
textboxfull.com
https://textbookfull.com/product/python-for-data-analysis-data-
wrangling-with-pandas-numpy-and-ipython-wes-mckinney/
textboxfull.com
https://textbookfull.com/product/practical-python-data-visualization-
a-fast-track-approach-to-learning-data-visualization-with-python-
ashwin-pajankar/
textboxfull.com
https://textbookfull.com/product/python-2-and-3-compatibility-with-
six-and-python-future-libraries-nanjekye/
textboxfull.com
https://textbookfull.com/product/learning-data-mining-with-python-
layton/
textboxfull.com
https://textbookfull.com/product/a-tour-of-data-science-learn-r-and-
python-in-parallel-nailong-zhang/
textboxfull.com
https://textbookfull.com/product/hands-on-scikit-learn-for-machine-
learning-applications-data-science-fundamentals-with-python-david-
paper/
textboxfull.com
A Python Data
Analyst’s Toolkit
Learn Python and Python-based
Libraries with Applications in Data
Analysis and Statistics
—
Gayathri Rajagopalan
A Python Data
Analyst’s Toolkit
Learn Python and Python-based
Libraries with Applications in Data
Analysis and Statistics
Gayathri Rajagopalan
A Python Data Analyst’s Toolkit: Learn Python and Python-based Libraries with
Applications in Data Analysis and Statistics
Gayathri Rajagopalan
Introduction������������������������������������������������������������������������������������������������������������xix
v
Table of Contents
vi
Table of Contents
Indexing������������������������������������������������������������������������������������������������������������������������������������ 169
Type of an index object�������������������������������������������������������������������������������������������������������� 170
Creating a custom index and using columns as indexes���������������������������������������������������� 171
Indexes and speed of data retrieval������������������������������������������������������������������������������������ 173
Immutability of an index������������������������������������������������������������������������������������������������������ 174
Alignment of indexes����������������������������������������������������������������������������������������������������������� 176
Set operations on indexes��������������������������������������������������������������������������������������������������� 177
Data types in Pandas���������������������������������������������������������������������������������������������������������������� 178
Obtaining information about data types������������������������������������������������������������������������������ 179
Indexers and selection of subsets of data�������������������������������������������������������������������������������� 182
Understanding loc and iloc indexers����������������������������������������������������������������������������������� 183
Other (less commonly used) indexers for data access�������������������������������������������������������� 188
Boolean indexing for selecting subsets of data������������������������������������������������������������������� 192
Using the query method to retrieve data����������������������������������������������������������������������������� 192
Operators in Pandas������������������������������������������������������������������������������������������������������������������ 193
Representing dates and times in Pandas��������������������������������������������������������������������������������� 194
Converting strings into Pandas Timestamp objects������������������������������������������������������������ 195
Extracting the components of a Timestamp object������������������������������������������������������������� 196
Grouping and aggregation�������������������������������������������������������������������������������������������������������� 197
Examining the properties of the groupby object����������������������������������������������������������������� 199
Filtering groups������������������������������������������������������������������������������������������������������������������� 201
Transform method and groupby������������������������������������������������������������������������������������������ 202
Apply method and groupby������������������������������������������������������������������������������������������������� 204
How to combine objects in Pandas������������������������������������������������������������������������������������������� 204
Append method for adding rows����������������������������������������������������������������������������������������� 205
Concat function (adding rows or columns from other objects)������������������������������������������� 207
Join method – index to index���������������������������������������������������������������������������������������������� 210
Merge method – SQL type join based on common columns����������������������������������������������� 211
viii
Table of Contents
ix
Table of Contents
lmplot���������������������������������������������������������������������������������������������������������������������������������� 266
Strip plot������������������������������������������������������������������������������������������������������������������������������ 267
Swarm plot�������������������������������������������������������������������������������������������������������������������������� 268
Catplot��������������������������������������������������������������������������������������������������������������������������������� 269
Pair plot������������������������������������������������������������������������������������������������������������������������������� 270
Joint plot������������������������������������������������������������������������������������������������������������������������������ 272
Summary���������������������������������������������������������������������������������������������������������������������������������� 273
Review Exercises���������������������������������������������������������������������������������������������������������������������� 274
x
Table of Contents
Index��������������������������������������������������������������������������������������������������������������������� 393
xi
About the Author
Gayathri Rajagopalan works for a leading Indian
multinational organization, with ten years of experience
in the software and information technology industry.
She has degrees in computer engineering and business
adminstration, and is a certified Project Management
Professional (PMP). Some of her key focus areas include
Python, data analytics, machine learning, statistics, and
deep learning. She is proficient in Python, Java, and C/C++
programming. Her hobbies include reading, music, and
teaching programming and data science to beginners.
xiii
About the Technical Reviewer
Manohar Swamynathan is a data science practitioner
and an avid programmer, with over 14 years of experience
in various data science related areas that include data
warehousing, Business Intelligence (BI), analytical tool
development, ad hoc analysis, predictive modeling, data
science product development, consulting, formulating
strategy, and executing analytics programs. He’s had a
career covering the life cycle of data across different
domains such as US mortgage banking, retail/ecommerce,
insurance, and industrial IoT. He has a bachelor’s degree
with a specialization in physics, mathematics, and
computers, and a master’s degree in project management. He’s currently living in
Bengaluru, the Silicon Valley of India.
xv
Acknowledgments
This book is a culmination of a year-long effort and would not have been possible
without my family’s support. I am indebted to them for their patience, kindness, and
encouragement.
I would also like to thank my readers for investing their time and money in this book. It is
my sincere hope that this book adds value to your learning experience.
xvii
Introduction
I had two main reasons for writing this book. When I first started learning data science,
I could not find a centralized overview of all the important topics on this subject.
A practitioner of data science needs to be proficient in at least one programming
language, learn the various aspects of data preparation and visualization, and also
be conversant with various aspects of statistics. The goal of this book is to provide
a consolidated resource that ties these interconnected disciplines together and
introduces these topics to the learner in a graded manner. Secondly, I wanted to provide
material to help readers appreciate the practical aspects of the seemingly abstract
concepts in data science, and also help them to be able to retain what they have learned.
There is a section on case studies to demonstrate how data analysis skills can be applied
to make informed decisions to solve real-world challenges. One of the highlights of
this book is the inclusion of practice questions and multiple-choice questions to help
readers practice and apply whatever they have learned. Most readers read a book and
then forget what they have read or learned, and the addition of these exercises will help
readers avoid this pitfall.
The book helps readers learn three important topics from scratch – the Python
programming language, data analysis, and statistics. It is a self-contained introduction
for anybody looking to start their journey with data analysis using Python, as it focuses
not just on theory and concepts but on practical applications and retention of concepts.
This book is meant for anybody interested in learning Python and Python-based libraries
like Pandas, Numpy, Scipy, and Matplotlib for descriptive data analysis, visualization,
and statistics. The broad categories of skills that readers learn from this book include
programming skills, analytical skills, and problem-solving skills.
The book is broadly divided into three parts – programming with Python, data analysis
and visualization, and statistics. The first part of the book comprises three chapters. It
starts with an introduction to Python – the syntax, functions, conditional statements,
data types, and different types of containers. Subsequently, we deal with advanced
concepts like regular expressions, handling of files, and solving mathematical problems
xix
Introduction
with Python. Python is covered in detail before moving on to data analysis to ensure that
the readers are comfortable with the programming language before they learn how to
use it for purposes of data analysis.
The second part of the book, comprising five chapters, covers the various aspects of
descriptive data analysis, data wrangling and visualization, and the respective Python
libraries used for each of these. There is an introductory chapter covering basic concepts
and terminology in data analysis, and one chapter each on NumPy (the scientific
computation library), Pandas (the data wrangling library), and the visualization
libraries (Matplotlib and Seaborn). A separate chapter is devoted to case studies to
help readers understand some real-world applications of data analysis. Among these
case studies is one on air pollution, using data drawn from an air quality monitoring
station in New Delhi, which has seen alarming levels of pollution in recent years. This
case study examines the trends and patterns of major air pollutants like sulfur dioxide,
nitrogen dioxide, and particulate matter for five years, and comes up with insights and
recommendations that would help with designing mitigation strategies.
The third section of this book focuses on statistics, elucidating important principles in
statistics that are relevant to data science. The topics covered include probability, Bayes
theorem, permutations and combinations, hypothesis testing (ANOVA, chi-squared
test, z-test, and t-test), and the use of various functions in the Scipy library to enable
simplification of tedious calculations involved in statistics.
By the end of this book, the reader will be able to confidently write code in Python, use
various Python libraries and functions for analyzing any dataset, and understand basic
statistical concepts and tests. The code is presented in the form of Jupyter notebooks
that can further be adapted and extended. Readers get the opportunity to test their
understanding with a combination of multiple-choice and coding questions. They
also get an idea about how to use the skills and knowledge they have learned to make
evidence-based decisions for solving real-world problems with the help of case studies.
xx
CHAPTER 1
Getting Familiar
with Python
Python is an open source programming language created by a Dutch programmer
named Guido van Rossum. Named after the British comedy group Monty Python,
Python is a high-level, interpreted, open source language and is one of the most sought-
after and rapidly growing programming languages in the world today. It is also the
language of preference for data science and machine learning.
In this chapter, we first introduce the Jupyter notebook – a web application for running
code in Python. We then cover the basic concepts in Python, including data types,
operators, containers, functions, classes and file handling and exception handling, and
standards for writing code and modules.
The code examples for this book have been written using Python version 3.7.3 and
Anaconda version 4.7.10.
T echnical requirements
Anaconda is an open source platform used widely by Python programmers and data
scientists. Installing this platform installs Python, the Jupyter notebook application, and
hundreds of libraries. The following are the steps you need to follow for installing the
Anaconda distribution.
2. Click the installer for your operating system, as shown in Figure 1-1.
The installer gets downloaded to your system.
1
© Gayathri Rajagopalan 2021
G. Rajagopalan, A Python Data Analyst’s Toolkit, https://doi.org/10.1007/978-1-4842-6399-0_1
Chapter 1 Getting Familiar with Python
3. Open the installer (file downloaded in the previous step) and run it.
Please follow the following steps for downloading all the data files used in this book:
Now that we have installed and launched Jupyter, let us understand how to use this
application in the next section.
JupyterLab is the IDE for Jupyter notebooks. Jupyter notebooks are web applications that
run locally on a user’s machine. They can be used for loading, cleaning, analyzing, and
modeling data. You can add code, equations, images, and markdown text in a Jupyter
notebook. Jupyter notebooks serve the dual purpose of running your code as well as
serving as a platform for presenting and sharing your work with others. Let us look at the
various features of this application.
Type “jupyter notebook” in the search bar next to the start menu.
This will open the Jupyter dashboard. The dashboard can be used
to create new notebooks or open an existing one.
Click inside the first cell in your notebook and type a simple line
of code, as shown in Figure 1-4. Execute the code by selecting Run
Cells from the “Cell” menu, or use the shortcut keys Ctrl+Enter.
3
Chapter 1 Getting Familiar with Python
5. Renaming a notebook
Click the default name of the notebook and type a new name, as
shown in Figure 1-6.
Table 1-1 gives some of the familiar icons found in Jupyter notebooks, the corresponding
menu functions, and the keyboard shortcuts.
5
Chapter 1 Getting Familiar with Python
Adding a new cell to a Esc+b (adding a cell below the Insert ➤ Insert Cell
Jupyter notebook current cell), or Esc+a (adding Above or Insert ➤
a cell above the current cell) Insert Cell Below
Running a given cell Ctrl+Enter (to run selected cell); Cell ➤ Run
Shift+Enter (to run selected cell Selected Cells
and insert a new cell)
If you are not sure about which keyboard shortcut to use, go to: Help ➤ Keyboard
Shortcuts, as shown in Figure 1-8.
• Shift+Enter to run the code in the current cell and move to the next
cell.
T ab Completion
This is a feature that can be used in Jupyter notebooks to help you complete the code
being written. Usage of tab completions can speed up the workflow, reduce bugs, and
quickly complete function names, thus reducing typos and saving you from having to
remember the names of all the modules and functions.
For example, if you want to import the Matplotlib library but don’t remember the
spelling, you could type the first three letters, mat, and press Tab. You would see a drop-
down list, as shown in Figure 1-9. The correct name of the library is the second name in
the drop-down list.
7
Chapter 1 Getting Familiar with Python
One commonly used magic command, shown in the following, is used to display
Matplotlib graphs inside the notebook. Adding this magic command avoids the need
to call the plt.show function separately for showing graphs (the Matplotlib library is
discussed in detail in Chapter 7).
CODE:
%matplotlib inline
Magic commands, like timeit, can also be used to time the execution of a script, as shown
in the following.
CODE:
%%timeit
for i in range(100000):
i*i
Output:
16.1 ms ± 283 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Now that you understand the basics of using Jupyter notebooks, let us get started with
Python and understand the core aspects of this language.
P
ython Basics
In this section, we get familiar with the syntax of Python, commenting, conditional
statements, loops, and functions.
C
omments
A comment explains what a line of code does, and is used by programmers to help others
understand the code they have written. In Python, a comment starts with the # symbol.
8
Chapter 1 Getting Familiar with Python
Proper spacing and indentation are critical in Python. While other languages like Java
and C++ use brackets to enclose blocks of code, Python uses an indent of four spaces
to specify code blocks. One needs to take care of indents to avoid errors. Applications
like Jupyter generally take care of indentation and automatically add four spaces at the
beginning of a block of code.
Printing
The print function prints content to the screen or any other output device.
CODE:
print("Hello!")
To print multiple lines of code, we use triple quotes at the beginning and end of the
string, for example:
CODE:
Output:
Note that we do not use semicolons in Python to end statements, unlike some other
languages.
The format method can be used in conjunction with the print method for embedding
variables within a string. It uses curly braces as placeholders for variables that are passed
as arguments to the method.
Let us look at a simple example where we print variables using the format method.
9
Chapter 1 Getting Familiar with Python
CODE:
weight=4.5
name="Simi"
print("The weight of {} is {}".format(name,weight))
Output:
The preceding statement can also be rewritten as follows without the format method:
CODE:
Note that only the string portion of the print argument is enclosed within quotes. The name
of the variable does not come within quotes. Similarly, if you have any constants in your
print arguments, they also do not come within quotes. In the following example, a Boolean
constant (True), an integer constant (1), and strings are combined in a print statement.
CODE:
Output:
The format fields can specify precision for floating-point numbers. Floating-point
numbers are numbers with decimal points, and the number of digits after the decimal
point can be specified using format fields as follows.
CODE:
x=91.234566
print("The value of x upto 3 decimal points is {:.3f}".format(x))
Output:
We can specify the position of the variables passed to the method. In this example, we
use position “1” to refer to the second object in the argument list, and position “0” to
specify the first object in the argument list.
10
Chapter 1 Getting Familiar with Python
CODE:
y='Jack'
x='Jill'
print("{1} and {0} went up the hill to fetch a pail of water".format(x,y))
Output:
I nput
The input function accepts inputs from the user. The input provided by the user is stored
as a variable of type String. If you want to do any mathematical calculations with any
numeric input, you need to change the data type of the input to int or float, as follows.
CODE:
Output:
V
ariables and Constants
A constant or a literal is a value that does not change, while a variable contains a value
can be changed. We do not have to declare a variable in Python, that is, specify its data
type, unlike other languages like Java and C/C++. We define it by giving the variable a
name and assigning it a value. Based on the value, a data type is automatically assigned
to it. Values are stored in variables using the assignment operator (=). The rules for
naming a variable in Python are as follows:
• a variable name cannot have spaces
11
Chapter 1 Getting Familiar with Python
Operators
The following are some commonly used operators in Python.
Arithmetic operators: Take two integer or float values, perform an operation, and return
a value.
• **(Exponent)
• %(modulo or remainder),
• //(quotient),
• *(multiplication)
• -(subtraction)
• +(addition)
CODE:
(1+9)/2-3
Output:
2.0
12
Chapter 1 Getting Familiar with Python
In the preceding expression, the operation inside the parenthesis is performed first,
which gives 10, followed by division, which gives 5, and then subtraction, which gives the
final output as 2.
Comparison operators: These operators compare two values and evaluate to a true or
false value. The following comparison operators are supported in Python:
• >: Greater than
• < : Less than
• <=: Less than or equal to
• >=: Greater than or equal to
• == : equality. Please note that this is different from the assignment
operator (=)
• !=(not equal to)
Logical (or Boolean) operators: Are similar to comparison operators in that they
also evaluate to a true or false value. These operators operate on Boolean variables or
expressions. The following logical operators are supported in Python:
Output:
False
CODE:
(2>1) or (1>3)
13
Another Random Scribd Document
with Unrelated Content
— Je ne peux m’empêcher, en effet, reprit Gaspard, de trouver
bien sots les princes de la terre, les puissants du monde, qui
s’imaginent que le peuple ne pense pas, parce qu’il est muet. Le
peuple a l’air de ne point penser, parce qu’il ne sait pas s’exprimer
avec des mots savants, des mots de livres, mais sa pensée s’agite
dans son cœur ; — et quand il la reconnaît dans la parole des
grands hommes, il comprend qu’elle est juste et il sent se délier sa
langue… Or çà, je vois que l’ami Pablo donne des signes
d’impatience. N’oublions pas qu’il a un miracle à nous conter et qu’il
brûle de faire briller son éloquence à nos yeux.
CHAPITRE X
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
textbookfull.com