Thinking in Pandas - How To Use The Python Data Analysis Library The Right Way (2020)
Thinking in Pandas - How To Use The Python Data Analysis Library The Right Way (2020)
Thinking in Pandas - How To Use The Python Data Analysis Library The Right Way (2020)
Pandas
How to Use the Python Data
Analysis Library the Right Way
—
Hannah Stepanek
Thinking in Pandas
How to Use the Python Data
Analysis Library the Right Way
Hannah Stepanek
Thinking in Pandas
Hannah Stepanek
Portland, OR, USA
Introduction�����������������������������������������������������������������������������������������xi
Chapter 1: Introduction������������������������������������������������������������������������1
About pandas��������������������������������������������������������������������������������������������������������1
How pandas helped build an image of a black hole���������������������������������������������4
How pandas helps financial institutions make more informed predictions
about the future market����������������������������������������������������������������������������������������6
How pandas helps improve discoverability of content�����������������������������������������6
iii
Table of Contents
Chapter 7: Groupby���������������������������������������������������������������������������135
Using groupby correctly������������������������������������������������������������������������������������135
Indexing������������������������������������������������������������������������������������������������������������137
Avoiding groupby����������������������������������������������������������������������������������������������139
iv
Table of Contents
Index�������������������������������������������������������������������������������������������������181
v
About the Author
Hannah Stepanek is a software developer
with a passion for performance and is an
open source advocate. She has over seven
years of industry experience programming
in Python and spent about two of those years
implementing a data analysis project using
pandas.
Hannah was born and raised in Corvallis,
OR, and graduated from Oregon State
University with a major in Electrical Computer Engineering. She enjoys
engaging with the software community, often giving talks at local meetups
as well as larger conferences. In early 2019, she spoke at PyCon US about
the pandas library and at OpenCon Cascadia about the benefits of open
source software. In her spare time, she enjoys riding her horse Sophie and
playing board games.
vii
About the Technical Reviewer
Jaidev Deshpande is a senior data scientist
at Gramener, where he works on automating
insight generation from data. He has a decade
of experience in delivering machine learning
solutions with the scientific Python stack.
His research interests lie at the intersection of
machine learning and signal processing.
ix
Introduction
Using the pandas Python library requires a shift in thinking that is not
always intuitive for those who use it. For beginners, pandas’ rich API can
often be overwhelming and unclear when determining which solution is
optimal. This book aims to give you an intuition for using pandas correctly
by explaining how its operations work underneath. We will establish a
foundation of knowledge covering information such as Python and NumPy
data structures, computer architecture, and performance differences
between Python and C. With this foundation, we will then be able to
explain why certain pandas operations perform the way they do under
certain circumstances. We’ll learn when to use certain operations and
when to use a more performant alternative. And near the end we’ll cover
what improvements can be and are being made to make pandas even more
performant.
xi
CHAPTER 1
Introduction
We live in a world full of data. In fact, there is so much data that it’s
nearly impossible to comprehend it all. We rely more heavily than ever
on computers to assist us in making sense of this massive amount of
information. Whether it’s data discovery via search engines, presentation
via graphical user interfaces, or aggregation via algorithms, we use
software to process, extract, and present the data in ways that make sense
to us. pandas has become an increasingly popular package for working
with big data sets. Whether it’s analyzing large amounts of data, presenting
it, or normalizing it and re-storing it, pandas has a wide range of features
that support big data needs. While pandas is not the most performant
option available, it’s written in Python, so it’s easy for beginners to learn,
quick to write, and has a rich API.
A
bout pandas
pandas is the go-to package for working with big data sets in Python. It’s
made for working with data sets generally below or around 1 GB in size,
but really this limit varies depending on the memory constraints of the
device you run it on. A good rule of thumb is have at least five to ten times
the amount of memory on the device as your data set. Once the data set
starts to exceed the single-digit gigabyte range, it’s generally recommended
to use a different library such as Vaex.
The name pandas came from the term panel data referring to tabular
data. The idea is that you can make panels out of a larger panel of the data,
as shown in Figure 1-1.
2
Chapter 1 Introduction
3
Chapter 1 Introduction
forget that 50 years ago computers took up whole rooms and took several
seconds just to add two numbers together. A lot of programs are simply
fast enough and still meet performance requirements even though they
are not written in the most optimal way. Compute resources for big data
processing take up a significant amount of energy compared to a simple
web service; they require large amounts of memory and CPU, often
requiring large machines to run at their resource limits over multiple
hours. These programs are taxing on the hardware, potentially resulting
in faster aging, and require a large amount of energy both to keep the
machines cool and also to keep the computation running. As developers
we have a responsibility to write efficient programs, not just because
they are faster and cost less but also because they will reduce compute
resources which means less electricity, less hardware, and in general more
sustainability.
It is the goal of this book in the coming chapters to assist developers in
implementing performant pandas programs and to help them develop an
intuition for choosing efficient data processing techniques. Before we deep
dive into the underlying data structures that pandas is built on, let’s take a
look at how some existing impactful projects utilize pandas.
4
Chapter 1 Introduction
each telescope could act as more than one mirror, filling in a significant
portion of the theoretical larger telescope image. Figure 1-2 demonstrates
this technique. These pieces of the larger theoretical image were then passed
through several different image prediction algorithms trained to recognize
different types of images. The idea was if each of these different image
reproduction techniques outputs the same image, then they could be confident
that the image of the black hole was the real image (or reasonably close).
1
h ttps://github.com/achael/eht-imaging
https://solarsystem.nasa.gov/resources/2319/first-image-of-a-black-hole/
2
5
Chapter 1 Introduction
6
Chapter 1 Introduction
a bubble (i.e., recommended content isn’t just the same type of content
they’ve been watching before or presenting the same opinions). Often this is
done by avoiding content silos from the business side.
Now that we’ve looked at some interesting use cases for pandas, in
Chapter 2 we’ll take a look at how to use pandas to access and merge data.
7
CHAPTER 2
10
Chapter 2 Basic Data Access and Merging
name balance
0 Bob 123
1 Mary 3972
2 Mita 7209
11
Chapter 2 Basic Data Access and Merging
>> account_info.iloc[0:2]
name account balance
0 Bob 123846 123
1 Mary 123972 3972
>> account_info.iloc[:]
name account balance
0 Bob 123846 123
1 Mary 123972 3972
2 Mita 347209 7209
iloc also accepts a Boolean array. In Listing 2-5, we grab all odd rows
by taking the modulus of each row index and converting it to a Boolean.
iloc also accepts a function; however, this function is called once with
the entire DataFrame, and there’s little difference between passing it in
and simply calling the function beforehand so we won’t go over that here.
iloc can come in quite handy when working with multi-indexed and
multi-level column DataFrames since levels are integer values. Let’s
review an example and break it down. Here we specify the rows we want
to grab as “:” meaning we want all rows, and we use a Boolean array to
specify the columns. We grab the values for the multi-level column “data”
which are [“score”, “date”, “score”, “date”] and then create a Boolean array
by specifying that the value must equal “score”. This is broken down into
stages in Listing 2-6 so it is easier to follow.
13
Chapter 2 Basic Data Access and Merging
>> score_columns = (
restaurant_inspections.columns.get_level_values("data")
== "score")
>> score_columns
[True, False, True, False]
>> restaurant_inspections.iloc[:, score_columns]
inspection 0 1
data score score
restaurant location
Diner (4, 2) 90 100
Pandas (5, 4) 55 76
14
Chapter 2 Basic Data Access and Merging
>> account_info.loc[
("Mary", "mj100"), pd.IndexSlice[:, "balance"]
]
0 balance 3972
1 balance 222
At the end of the dictionary syntax section, it was mentioned that the
loc method is preferred over the dictionary syntax for complex DataFrames.
Let’s look at what’s happening underneath when we use each syntax to
explain why that is. Listing 2-9 shows what each access method translates
into underneath when operating on a more complex DataFrame. Note in
the second half of Listing 2-9 where the dictionary syntax is used, the code
underneath uses the __getitem__ method and then calls __setitem__ on it.
15
Chapter 2 Basic Data Access and Merging
account_info.__setitem__(
(slice(None), (0, 'balance')),
NEW_BALANCE,
)
"""
account_info.loc[:, (0, "balance")] = NEW_BALANCE
"""
The code below is equivalent to:
account_info.__getitem__(0).__setitem__('balance', NEW_BALANCE )
"""
account_info[0]["balance"] = NEW_BALANCE
Quite often you may have data from multiple sources that you need
to combine into a single DataFrame. Now that you know how to do some
basic data access, we’ll look at different methods for combining data from
different DataFrames together.
16
Chapter 2 Basic Data Access and Merging
Listing 2-10. Finding 1844 buildings that are still standing in 2020
using an inner merge
>> import pandas as pd
>> building_records_1844
established
building
Grande Hotel 1830
Jone's Farm 1842
17
Chapter 2 Basic Data Access and Merging
Public Library 1836
Marietta House 1823
>> building_records_2020
established
building
Sam's Bakery 1962
Grande Hotel 1830
Public Library 1836
Mayberry's Factory 1924
>> cols = building_records_2020.columns.difference(
building_records_1844.columns
)
>> pd.merge(
building_records_1844,
building_records_2020[cols],
how='inner',
on=["building"],
)
established
building
Grande Hotel 1830
Public Library 1836
18
Chapter 2 Basic Data Access and Merging
19
Chapter 2 Basic Data Access and Merging
>> pd.merge(
gene_group1,
gene_group2,
how='outer',
on=["id"],
)
FC1 P2 FC2 P2
id
Myc 2 0.05 2 0.05
BRCA1 3 0.01 3 0.01
BRCA2 8 0.02 8 0.02
Notch1 NaN NaN 2 0.03
20
Chapter 2 Basic Data Access and Merging
21
Chapter 2 Basic Data Access and Merging
22
Chapter 2 Basic Data Access and Merging
adds an additional column called _merge into the resulting DataFrame that
reports whether the key is present in left_only, right_only, or both DataFrames.
This comes in handy in this particular case as we wish to do a somewhat
unconventional merge. Using the query method, we are able to select rows
where the _merge value is not both and then drop the _merge column. This
can be done all in one line as shown at the end of Listing 2-13 but is broken up
into two steps beforehand so you can see how it works underneath.
23
Chapter 2 Basic Data Access and Merging
>> trial_b_records
name
patient
210858 Abi
237340 May
240932 Catherine
154093 Julia
24
Chapter 2 Basic Data Access and Merging
25
Chapter 2 Basic Data Access and Merging
Listing 2-14 plays off of a previous inner merge example in Listing 2-10, but
unlike the previous example where the records of the buildings in common
matched, this time there are discrepancies. A join is desirable for a couple
reasons in this scenario. Firstly, the data has already been indexed according
to the unique building and join will automatically pick up the indexes and use
those to join the two sets of data. Secondly, there are discrepancies in the data,
and thus we wish to see columns from both DataFrames side by side in the
output DataFrame so we can compare them.
26
Chapter 2 Basic Data Access and Merging
established established_2000
building location
Grande Hotel (4,5) 1831 1830
Public Library (6,4) 1836 1835
27
Chapter 2 Basic Data Access and Merging
>> temp_county_b
temp
location
(6,4) 34.2
(0,4) 33.7
(3,8) 38.1
(1,5) 37.0
28
Chapter 2 Basic Data Access and Merging
>> pd.concat(
[temp_device_a, temp_device_b],
keys=["device_a", "device_b"],
axis=1,
)
device_a device_b
temp temp
location
(4,5) 35.6 34.2
(1,2) 37.4 36.7
(6,4) 36.3 37.1
(1,7) 40.2 39.0
29
Chapter 2 Basic Data Access and Merging
1
h ttps://pandas.pydata.org/pandas-docs/version/0.25/user_guide/
merging.html
30
CHAPTER 3
32
Chapter 3 How pandas Works Under the Hood
33
Chapter 3 How pandas Works Under the Hood
34
Chapter 3 How pandas Works Under the Hood
Hash-Index Data
None hash("a"), "a", "apple"
0 hash("b"), "b", " banana"
1
None
Hash-Index Data
None hash("a"), "a"
0 hash("b"), "b"
1
None
There are also many other data structures including integers, floats,
Booleans, and strings. These pretty much directly translate into their
c-type equivalents underneath and aren’t really worth going over here.
Something that is worth mentioning though is some of these have special
built-in caching in Python.
Python has a string and integer cache. Take, for example, str1 and
str2 in Listing 3-5. They are both set to the value “foo” but underneath
they are pointing at the same memory location. This means that rather
than creating a new string that is an exact copy of str1 and duplicating the
memory, the new string will simply point to the existing string value. This
35
Chapter 3 How pandas Works Under the Hood
is demonstrated here by the assertion line where the “is” property is used
to compare the references or pointers of the two strings for equality.
Listing 3-5. str1 and str2 are pointing to the same memory location
str1 = "foo"
str2 = "foo"
assert(str1 is str2)
Listing 3-6. str1 and str2 are not pointing to the same memory
location
str1 = "foo bar"
str2 = "foo bar"
assert(str1 is not str2)
Listing 3-7. int1 and int2 are pointing to the same memory location
but int3 and int4 are not
int1 = 22
int2 = 22
int3 = 257
int4 = 257
assert(int1 is int2)
assert(int3 is not int4)
36
Chapter 3 How pandas Works Under the Hood
37
Chapter 3 How pandas Works Under the Hood
38
Chapter 3 How pandas Works Under the Hood
39
Chapter 3 How pandas Works Under the Hood
40
Chapter 3 How pandas Works Under the Hood
ref1 = "foo"
ref2 = "foo"
Recall that the string cache is at play here, and because of that, both
ref1 and ref2 point to the same value underneath.
When we delete ref2, string foo’s reference count is 1, and when we
delete ref1, string foo’s reference count is 0 and the memory can be freed.
This is demonstrated in Listing 3-9.
Not all objects are freed when their references reach 0 though because
some never reach 0. Take, for example, the scenario presented in Listing 3-10
which tends to happen quite often when working with classes and objects in
Python. In this scenario, exec_info is a tuple and the value at the third index is
the traceback object. The traceback object contains a reference to the frame,
but the frame also contains a reference to the exc_info variable. This is what’s
known as a circular reference, and since there is no way to delete one without
breaking the other, these two objects must be garbage collected. Periodically
the garbage collector will run, identify, and delete circular referenced objects
like this.
41
Chapter 3 How pandas Works Under the Hood
import sys
try:
raise Exception("Something went wrong.")
except Exception as e:
exc_info = sys.exc_info()
frame = exc_info[2].tb_frame # create a third reference
assert(sys.getrefcount(frame) == 3)
del(exc_info)
assert(sys.getrefcount(frame) == 3)
Keeping track of these references does not come for free. Each object has
an associated reference counter which takes up space, and each reference
made in the code takes up CPU cycles to compute the appropriate increment
or decrement of the object’s reference count. This is partially why, if you
compare the size of an object in Python to the size of an object in C, the sizes
are so much larger in Python and also why Python is slower to execute than C.
Part of those extra bytes and extra CPU cycles are due to the reference count
tracking. While the garbage collector does have performance implications, it
also makes Python a simple language to program in. As a developer, you don’t
have to worry about keeping track of memory allocation and deallocation; the
Python garbage collector does that for you.
In a multi-threaded application, reference counts have the same
problem as the total has in Figure 3-3. A thread may create a new reference
to an object in the shared memory space at the same time as another thread
and a race condition occurs where the reference count ends up only being
incremented once instead of twice. When this happens, it can ultimately
lead to the object being freed from memory before it should be (because
the race condition leads to the object’s reference count being incremented
by one instead of two). In other cases when there is a race condition on
42
Chapter 3 How pandas Works Under the Hood
43
Chapter 3 How pandas Works Under the Hood
total = 5 total = 6
t=0s t=5s
total = 5 total = 7
t=3s
total = 6 t=8s
total = 6
t=4s
This is great! We’ve solved the problem! Or have we? Consider the
scenario in Figure 3-5 where there are instead two locks and two totals.
total1 = 5 total1 = 5
t=0s t=3s x
total2 = 4 total2 = 4
x
total1 = 5 total1 = 5
t=1s t=4s x
total2 = 4 total2 = 4
x
total1 = 5
t=2s x
total2 = 4
x
Figure 3-5 is what’s known as deadlock. This happens when two threads
require multiple pieces of data to execute, but they request them in different
orders. In order to avoid these kinds of issues altogether, the author of
Python implemented a lock at the thread level which only allowed one
thread to run at any given time. This was a simple and elegant way to solve
this problem. At the time, since multi-core CPUs were quite uncommon,
it didn’t really impact performance since in the CPU, these threads’
instructions would be run serially anyway. However, as computers have
become more advanced and computations have become more intensive,
multi-core CPUs have become the standard in pretty much all modern
45
Chapter 3 How pandas Works Under the Hood
46
Chapter 3 How pandas Works Under the Hood
memory for the input data arrays and the output data array in C.
Listing 3-11 shows an example of how an array of floats is created in
C. Note the memory must be explicitly allocated using malloc, and it is of
fixed size 100—having only enough room for 100 floats.
47
Chapter 3 How pandas Works Under the Hood
import numpy as np
groups_waiting_for_a_table = np.ndarray(
(3,0),
buffer=np.array([4, 7, 21], dtype=np.uint8),
dtype=np.uint8,
)
48
Chapter 3 How pandas Works Under the Hood
1
www.youtube.com/watch?v=ObUcgEO4N8w
49
Chapter 3 How pandas Works Under the Hood
Data Categories
0 apple
1 banana
2 carrot
0
50
Chapter 3 How pandas Works Under the Hood
in Python and not in C. We’ll dig more into this example and the apply
function specifically in Chapter 6.
def grade(values):
if 70 <= values["score"] < 80:
values["score"] = "C"
elif 80 <= values["score"] < 90:
values["score"] = "B"
elif 90 <= values["score"]:
values["score"] = "A"
else:
values["score"] = "F"
return values
scores= pd.DataFrame(
{"score": [89, 70, 71, 65, 30, 93, 100, 75]}
)
scores.apply(grade, axis=1)
Since pandas is built on NumPy, it uses NumPy arrays as the building
blocks for a pandas DataFrame, which ultimately translate into ndarrays
deep down during computations.
"score": [90,100,55,60]})
>> restaurant_inspections
restaurant location date score
Diner (4, 2) 02/18 90
Diner (4, 2) 05/18 100
Pandas (5, 4) 04/18 55
Pandas (5, 4) 01/18 76
Index Blocks
restaurant Diner Diner Pandas Pandas
location (4, 2) (4, 2) (5, 4) (5, 4)
date 02/18 05/18 04/18 01/18
score 90 100 55 76
52
Chapter 3 How pandas Works Under the Hood
("Diner", (4,2)),
("Pandas", (5,4)),
("Pandas", (5,4)),
),
names = ["restaurant", "location"]
)
restaurant_inspections = pd.DataFrame(
{
"date": ["02/18", "05/18", "04/18", "01/18"],
"score": [90, 100, 55, 76],
},
index=restaurants,
)
>> restaurant_inspections
date score
restaurant location
Diner (4, 2) 02/18 90
05/18 100
Pandas (5, 4) 04/18 55
01/18 76
Levels Names Labels
restaurant Diner Pandas 0 0
location (4, 2) (5, 4) 0 0
1 1
1 1
Index Blocks
date 02/18 05/18 04/18 01/18
score 90 100 55 76
53
Chapter 3 How pandas Works Under the Hood
There is still a NumPy array called Levels that holds the index names;
however, instead of a simple two-dimensional NumPy array of data, the
data undergoes a form of compression. The Names is a two-dimensional
NumPy array that keeps track of the unique values within the index, and
Labels is a two-dimensional NumPy array of integers whose values are the
indexes of the unique index values in the Names NumPy array. This is the
same memory saving technique used by the pandas category data type,
and in fact, since category came later, they probably copied this technique
from the pandas multi-index.
The DataFrame in Listing 3-16 ends up being about two-thirds the size
of the single-index DataFrame in Listing 3-15 due to the data compression
incurred by the use of the multi-index. pandas is able to save memory by
using an integer type instead of another larger type to keep track of and
represent index data. This of course is advantageous when there is a lot
of repeated data in the index and less advantageous when there is little to
no repeated data in the index. This is also why it is important to normalize
the data. If, for example, there were multiple representations for the same
restaurant name (DINER, Diner, diner), we would not be able to take
advantage of the compression as we have done here. We would also not be
able to take as large of an advantage of the Python string cache either.
Similar to multi-level indexes, pandas also permits multi-level columns.
The multi-level columns are implemented the same as the multi-level
indexes with the same data compression technique. Listing 3-17 shows an
example of how to create a multi-index multi-level column DataFrame.
54
Chapter 3 How pandas Works Under the Hood
55
Chapter 3 How pandas Works Under the Hood
56
Chapter 3 How pandas Works Under the Hood
57
Chapter 3 How pandas Works Under the Hood
memory during the merge, and if the original DataFrame is very large, this
could cause a slowdown or even a memory crash if we are very close to our
max memory usage.
If instead we represent the data as a multi-index DataFrame as shown
in Listing 3-19, the data is already grouped uniquely by restaurant. This
means the groupby will be faster since the data is already grouped in the
index. It also means the DataFrame will take up less memory since, as you
recall from the previous section, the data in the index is compressed. Most
significantly, however, we don’t have to do the kind of finagling that we
had to do when using a single-index DataFrame. We are able to run the
calculation and put it back into the original DataFrame without creating a
copy which is a huge time and memory saver. The code you’ll notice is also
simpler and easier to follow. This DataFrame takes up approximately 880
bits underneath. Recall that when we create a multi-index, the index data
is compressed, which is why this multi-index DataFrame is smaller than its
single-index counterpart.
58
Chapter 3 How pandas Works Under the Hood
restaurant_inspections = pd.DataFrame(
{
"date": ["02/18", "05/18", "02/18", "05/18"],
"score": [90, 100, 55, 76],
},
index=restaurants,
)
>> restaurant_inspections
date score
restaurant location
Diner (4, 2) 02/18 90
05/18 100
Pandas (5, 4) 02/18 55
05/18 76
>> restaurant_inspections["total"] = \
restaurant_inspections["score"].groupby(
["restaurant","location"],
).count()
>> restaurant_inspections.set_index(
["total"],
append=True,
inplace=True,
)
date score
restaurant location total
Diner (4, 2) 2 02/18 90
05/18 100
Pandas (5, 4) 2 02/18 55
05/18 76
What if we take this one step further? If we make the dates the column
names, then all the scores will be on the same row and the calculation
59
Chapter 3 How pandas Works Under the Hood
becomes trivial. Here the unique restaurants are indexes, and the unique
inspection dates are columns. Note the score is now the only data. This
makes each row a unique restaurant, and thus the count can simply be
performed across each row. See Listing 3-20.
>> restaurant_inspections["total"] = \
restaurant_inspections.count(axis=1)
60
Chapter 3 How pandas Works Under the Hood
>> restaurant_inspections.set_index(
["total"],
append=True,
inplace=True,
)
date 02/18 05/18
restaurant location total
Diner (4, 2) 2 90 100
Pandas (5, 4) 2 55 76
61
Chapter 3 How pandas Works Under the Hood
These holes are potentially a big problem. Recall that the score data
was represented as an unsigned 8-bit integer, now because there are NaNs
in the data, the type must accommodate the NaN type size which forces
the type to be a 32-bit float. That’s four times more memory for each score.
Not only that, but now we have a bunch of gaps in our data that wasted
space and ultimately wasted memory. The fewer dates in common there
are between the restaurants, the worse this problem becomes. Multi-level
column index to the rescue! See Listing 3-22.
62
Chapter 3 How pandas Works Under the Hood
restaurant_inspections = pd.DataFrame(
[[90, "02/18", 100 "05/18",], [55, "04/18", 76 "01/18",]],
index=restaurants,
columns=inspections,
)
>> restaurant_inspections
inspection 0 1
score date score date
restaurant location
Diner (4, 2) 90 02/18 100 05/18
Pandas (5, 4) 55 04/18 76 01/18
>> total = \
restaurant_inspections.iloc[
:,
restaurant_inspections.columns.get_level_values("data") \
== "score"
].count()
>> new_index = pd.DataFrame(
total.values,
columns=["total"],
index=restaurant_inspections.index,
)
>> new_index.set_index("total", append=True, inplace=True)
>> restaurant_inspections.index = new_index.index
>> restaurant_inspections
inspection 0 1
score date score date
restaurant location total
Diner (4, 2) 2 90 02/18 100 05/18
Pandas (5, 4) 2 55 04/18 76 01/18
63
Chapter 3 How pandas Works Under the Hood
This is probably the most optimal we can get with this DataFrame
format for this particular use case. We have compressed our data as much
as possible taking advantage of both multi-level indexes and multi-level
columns and organized the DataFrame in such a way as to achieve the
fastest calculation possible. Note the main disadvantage of this particular
format is it requires a bit of finagling to get the total back onto the index,
and for that reason, this solution is less readable. If this was the solution
you were going to go with, you might consider making two custom
functions: one that puts data onto the index and another that puts data
onto the columns. These functions would improve code readability by
hiding the finer details of appending level data to the DataFrame.
Once you have decided on a DataFrame format that makes sense,
you will likely need to load your raw data into pandas, normalize it,
and convert it to that particular DataFrame format. In Chapter 4, we’ll
dive into some common pandas data loading methods and discuss the
normalization options they provide in more detail.
64
CHAPTER 4
Loading and
Normalizing Data
Raw data comes in many forms: CSV, JSON, SQL, HTML, and so on.
pandas provides data input and output functions for loading data into a
pandas DataFrame and outputting data from a pandas DataFrame into
various common formats. In this chapter, we’ll deep dive into some of
these input functions and explore the various loading and normalization
options they provide.
The functions that load data into pandas provide a wide range
of normalization and optimization capabilities that can improve the
performance of a program, even to the point where it means the difference
between being able to load the data into pandas and running out of
memory. Each input function is different however, so it really depends
on the input/output format that you are working with and it’s always
worthwhile to check the documentation of the particular functions you are
using. Table 4-1 lists the various input and output functions that pandas
supports.
read_csv to_csv
read_excel to_excel
read_hdf to_hdf
read_sql to_sql
read_json to_JSON
read_html to_html
read_stata to_stata
read_clipboard to_clipboard
read_pickle to_pickle
66
Chapter 4 Loading and Normalizing Data
pd.read_csv
The pandas CSV loader pd.read_csv is the most widely used of the loaders
and by far the most complete in terms of data normalization options.
Because the Python standard library has a built-in CSV loader and the
pandas loader has some fairly fancy Pythonic options, it has two different
parsing engines: the C engine and the Python engine. As you can probably
guess by now, the C engine is more performant than the Python engine,
but depending on what options you specify, you may have no choice but
to use the Python engine for parsing. Thus, it’s advisable to be careful
which options you use and the values you provide to those options so that
you guarantee you are using the C parsing engine and get the best load
performance possible. The CSV loader has an explicit engine parameter
that lets you force the parsing engine to be Python or C. Explicitly always
67
Chapter 4 Loading and Normalizing Data
Listing 4-1. read_csv will raise a ValueError when engine is set to ‘c’
and other settings are not compatible
>> data = io.StringIO(
"""
id,age,height,weight
129237,32,5.4,126
123083,20,6.1,145
"""
)
>> df = pd.read_csv(data, sep=None, engine='c')
ValueError: the 'c' engine does not support sep=None
with delim_whitespace=False
68
Chapter 4 Loading and Normalizing Data
1
h ttps://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.
read_hdf.html
69
Chapter 4 Loading and Normalizing Data
The parameter usecols narrows down the list of columns to load. It’s
possible to have columns within the CSV file that you don’t care about, and thus
this can be an efficient way of eliminating them upon load as opposed to loading
all the data and removing them after. Note that usecols can also be a function
where the column name is an input and the output is a Boolean indicating
whether to include that column or discard it upon loading. A function however
is less ideal as it requires calls between the C parsing engine and the custom
function which will slow the loader down. Listing 4-3 shows an example of using
use_cols to eliminate columns id and age during load.
70
Chapter 4 Loading and Normalizing Data
The skiprows parameter allows you to skip certain rows in the file.
In its simplest form, it can be used to skip the first n number of rows in a
file; however, it can also be used to skip particular rows by specifying a list
of indexes to skip. It can also be a function that accepts a row index and
returns True if that row should be skipped. Note if a function is passed
here, it will have the unfortunate consequence of jumping between the
C parsing engine and the skiprows Python function which may lead to a
substantial slowdown when parsing large data sets. For this reason, it’s
recommended to keep the skiprows a simple integer or list value.
The skipfooter parameter lets you specify the number of lines at the
end of the file to skip. The documentation notes that this is unsupported
with the C parsing engine. Since the Python engine uses the Python CSV
parser, the CSV parser runs and then the last lines of the file are dropped.
This makes sense if you think about this problem a little more deeply:
How would the parser know which lines to skip without knowing how
many lines there are in the file first (which would require first parsing the
file)? This behavior can be somewhat surprising for some users when,
for example, they are actively trying to avoid lines in the file because they
break the parser and find that the parser is still trying to parse those lines
they configured the parser to skip. If you run into this situation in your own
program, nrows is a nice alternative. Listing 4-4 demonstrates an example
of running into a parsing error even though the loader was configured to
skip that line.
71
Chapter 4 Loading and Normalizing Data
72
Chapter 4 Loading and Normalizing Data
73
Chapter 4 Loading and Normalizing Data
import pandas as pd
site_data = pd.read_csv('site1.csv')
site_data['site2'] = pd.read_csv('site2.csv', squeeze=True)
The dtype parameter allows you to specify a type for each column
in the data. If this is not specified, read_csv will attempt to infer the data
type which typically results in the inferred type being an object which is
the largest size that a data type can be. Specifying the dtype during load
can be a huge performance improvement, but that also means you have
to have some knowledge at load time about the columns in the data set. If
you don’t know exactly what to expect until you look at the data, you might
consider loading the header of the data first or the first couple rows using
nrows, identifying the column types, and then loading the whole data file
with the appropriate types specified.
74
Chapter 4 Loading and Normalizing Data
Index 16
age 16
height 16
weight 16
>> df.dtypes
age int64
height float64
weight int64
>> df.index.dtype
dtype('int64')
75
Chapter 4 Loading and Normalizing Data
'height': np.float16,
'weight': np.int16},
index_col=[0],
)
age height weight
id
129237 32 5.398438 126
123083 20 6.101562 145
>> df.memory_usage(deep=True)
Index 16
age 2
height 4
weight 4
>> df.dtypes
age int8
height float16
weight int16
>> df.index.dtype
dtype('int64')
76
Chapter 4 Loading and Normalizing Data
import pandas as pd
data = io.StringIO(
"""
id,age,height,weight,med
129237,32,5.4,126,bta
123083,20,6.1,145,aftg
"""
)
>> treatments = pd.read_csv(
data,
converters={'med': medication_converter},
)
id age height weight med
129237 32 5.4 126 bta
123083 20 6.1 145 atg
The nrows parameter allows you to specify the number of rows to read
from the file. Something that may be unintuitive here is that nrows doesn’t
actually skip reading the rows when using the Python parsing engine. This
is because the Python parsing engine reads the whole file first. This means
that if there are lines after the number of rows you intended to read from
the file that result in parsing errors, when running with the Python parsing
engine, you will not be able to avoid them by using nrows. Since the Python
parsing engine reads the whole file first, it will still throw a parsing error on
those lines, even though you told the CSV loader not to read those rows. So,
this is yet another reason to avoid the Python parsing engine, particularly
77
Chapter 4 Loading and Normalizing Data
when using this setting. Note that skipfooter, on the other hand, even in the
C parsing engine does in fact read the footer row. This is simply because in
order to identify it as the footer of the file, it has to read it and reach the end
of the file to identify it as the footer. Listing 4-10 shows an example of how to
avoid lines that would otherwise cause parsing errors using nrows and the C
parsing engine.
import pandas as pd
data = io.StringIO(
"""
student_id, grade
1045,"a"
2391,"b"
8723,"c"
1092,"a"
"""
)
grades = pd.read_csv(
data,
nrows=3,
)
The nrows parameter in combination with skiprows and header can also
be useful for reading a file into memory in pieces, processing it, and then
reading the next chunk. This is particularly useful with huge sets of data
that you may otherwise be unable to read all at once due to memory
constraints. Listing 4-9 shows an example of this. Note process is a function
that is wrapping the read_csv function. It takes the loaded data from
read_csv and does some processing on it to reduce the memory footprint
and/or normalize it beyond the capabilities of read_csv and returns it to be
concatenated with the rest of the data. In Listing 4-11, we load the first 1000
78
Chapter 4 Loading and Normalizing Data
rows, process them, and use those first 1000 rows to initialize data. Then we
continue reading in rows, processing 1000 at a time until we read in less than
1000 rows. Once we read in less than 1000 rows, we know we’ve read the
entire file and exit the loop.
ROWS_PER_CHUNK = 1000
data = process(pd.read_csv(
'data.csv',
nrows=ROWS_PER_CHUNK,
))
read_rows = len(data)
chunk = 1
while chunk * ROWS_PER_CHUNK == read_rows:
chunk_data = process(pd.read_csv(
'data.csv',
skiprows=chunk * ROWS_PER_CHUNK,
nrows=ROWS_PER_CHUNK,
header=None,
names=data.columns,
))
read_rows += len(chunk_data)
data = data.append(process(chunk_data), ignore_index=True)
79
Chapter 4 Loading and Normalizing Data
memory footprint, meaning while you aren’t able to read all the data into
memory all at once using read_csv because the resulting DataFrame would
be too large, you are able to read it in a chunk at a time and reduce the
memory footprint on each chunk such that the resulting DataFrame will fit
in memory. Note using iterator and chunksize is a better alternative if you
are reading the whole file chunks at a time than using nrows and skiprows
as it keeps the file open at the correct location instead of constantly
re-opening it and scrolling to the next position. Listing 4-12 shows an
example of this.
import pandas as pd
ROWS_PER_CHUNK = 1000
data = pd.DataFrame({})
reader = pd.read_csv(
'data.csv',
chunksize=ROWS_PER_CHUNK,
iterator=True
)
for data_chunk in reader:
processed_data_chunk = process(data_chunk)
data = data.append(processed_data_chunk)
80
Chapter 4 Loading and Normalizing Data
waiting for the next chunk of the file to be loaded into memory. Generally,
accessing memory mapped files is faster because the memory is local to the
program and the memory mapped is already in the page cache so there is
no need to load it on the fly. In practice, memory mapping the file generally
doesn’t provide much of a performance advantage in the typical use case
of loading a file serially from beginning to end. If you are experiencing a
lot of cache misses, meaning the file data that would normally be loaded
into cache (memory closer to the CPU) is not present and must be loaded
from main memory, this may hold a performance improvement. Cache
misses may happen if other programs are running concurrently which add
their memory into the cache and consequently knock your file data out of
the cache. See Chapter 8 for a more detailed explanation of the memory
hierarchy and cache misses. This might also hold a performance advantage
if you are reading this file many times over the course of your program or
your program runs periodically and you don’t want to keep having to load
the same file into memory each time it runs. So, while this feature sounds
like it can provide you with a substantial speedup, the reality is unless you
are working outside of the standard read a file from start to finish workflow,
it’s unlikely to do so.
The na_values parameter allows you to specify values to interpret as
Not a Number, also known as NaNs. This type comes from NumPy which,
if you recall from Chapter 2, is a dependency of pandas. It’s commonly
used in NumPy as a placeholder for a value resulting from a computation
that is invalid such as divide by 0. Note by default pandas interprets any
string Nan or nan as a NaN type automatically. This automatic conversion
may be problematic if you are working with data where Nan or nan may
actually be a valid name, for example. This is where keep_default_na
comes in handy. Setting the parameter keep_default_na to False turns off
pandas automatic interpretation of certain values to NaNs. For a complete
list of values that pandas automatically converts to the NaN type, see the
Appendix.
81
Chapter 4 Loading and Normalizing Data
The parameter na_filter when set to False disables checking for NaNs
altogether, and the documentation notes this can lead to a performance
improvement when you know for certain there are no NaNs in the data.
The parameter na_values, on the other hand, lets you specify additional values
other than the default set that you would also like to be converted to NaNs.
The parameter verbose outputs the number of NaN values in each
column that contains NaNs when the Python parsing engine is used
and parsing performance metrics when the C parsing engine is used.
The pandas documentation states it outputs NaN values explicitly for
non-numeric columns. This can be somewhat deceiving however, as
the non-numeric determination is made at the time the parsing engine
runs and not based on the final type of the column in the resulting
DataFrame. Any column with a NaN in it at parsing time is considered a
non-numeric column, even if the type of that column ultimately ends up
being a numeric type (such as a float64 in the following example). The
Python parser must parse all the values in the column and convert them
appropriately to NaNs before assigning the final type. This means the NaN
values are counted during parsing before the final type of the column has
been assigned. Listing 4-13 demonstrates this behavior.
82
Chapter 4 Loading and Normalizing Data
83
Chapter 4 Loading and Normalizing Data
84
Chapter 4 Loading and Normalizing Data
Instead of letting pandas infer the data type, let’s convert all the
placeholder values to NaNs using na_values. Although ideally we would like
them to be int16s, float16s take up the same amount of memory, and pandas
supports NaNs being stored as floats whereas it does not support them being
stored as integers during loading, so we set the dtype of the weight column
to be float16. Note if we do not specify the dtype, it will be a float64. If we
really need them to be integers, we can replace the NaNs with zeros using
fillna and convert them using astype after loading as shown in Listing 4-15.
85
Chapter 4 Loading and Normalizing Data
123083 20 6.101562 NaN
123083 20 6.101562 NaN
>> df.memory_usage(deep=True)
Index 16
age 3
height 6
weight 6
>> df.dtypes
age int8
height float16
weight float16
>> df.index.dtype
dtype('int64')
>> df["weight"].fillna(0, inplace=True)
>> df["weight"] = df["weight"].astype(np.int16)
>> df
age height weight
id
129237 32 5.398438 126
123083 20 6.101562 0
123083 20 6.101562 0
>> df.memory_usage(deep=True)
Index 16
age 3
height 6
weight 6
>> df.dtypes
age int8
height float16
weight int16
86
Chapter 4 Loading and Normalizing Data
87
Chapter 4 Loading and Normalizing Data
88
Chapter 4 Loading and Normalizing Data
>> df.dtypes
age int8
height float16
weight datetime64[ns]
>> df.index.dtype
dtype('int64')
Note Listing 4-17 assumes that there are no NaNs or placeholder values
in the column. If there are, like in Listing 4-18, na_values must be specified
to convert all the placeholder values to NaNs; otherwise, the column will
be an object rather than a datetime because the placeholder values will be
left as strings.
birth height weight
id
129237 1999-04-10 5.398438 26
123083 NaT 6.101562 150
123083 1989-11-23 6.101562 111
>> df.memory_usage(deep=True)
Index 24
birth 24
height 6
weight 6
>> df.dtypes
age int8
height float16
weight datetime64[ns]
>> df.index.dtype
dtype('int64')
2
h ttps://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.
read_csv.html
90
Chapter 4 Loading and Normalizing Data
keep_date_col keeps both the combined column and the original separate
date columns if parse_dates specified that multiple date columns should
be combined together. The date_parser parameter lets you specify a date
parsing function. The documentation notes this function may be called in
several different ways ranging from calling it once on each row or passing
in all rows and columns at once. Since this function is jumping between
C and Python, it’s advantageous to call it the least amount of times as
possible. This means it’s best to implement this function to operate on all
datetime rows and columns and output an array of datetime instances.
This function could be an existing parser (the default is dateutil.parser.
parser), or it could be a custom function. This might be useful if you need
to do some special timezone handling or the data is stored in a special
datetime format. Not all countries specify the day before the month so
pandas provides a dayfirst parameter so you can specify whether the day
comes first in the dates you are parsing. The parameter cache_dates which
is enabled by default keeps a cache lookup of the converted dates, so that
if the same date appears multiple times in the data set, it does not have to
run the conversion again and can just use the cached value.
The parameter escapechar lets you escape certain characters. For
example, in most programing languages, a commonly used escape
character is a backslash (\) so it may be desirable to escape certain quote
characters inside of a quote with a \” or element delimiter characters with \,.
Listing 4-19 illustrates this use case. If the temperature recordings were
recorded by a country that uses commas as a decimal point delimiter and
also uses commas as a CSV element delimiter, read_csv will not be able to
parse this file with its default configuration and will raise a parsing error,
“pandas.errors.ParserError: Error tokenizing data. C error: Expected 2
fields in line 5, saw 3”. If, instead, the backslash character is used to escape
all the commas delimiting decimal places (\,), then read_csv can be
configured in such a way to correctly parse the data.
91
Chapter 4 Loading and Normalizing Data
pd.read_json
The read_json loader parses entirely in C unlike read_csv which may use
the Python parser under certain conditions.
92
Chapter 4 Loading and Normalizing Data
The parameter orient defines how the JSON format will be converted
into a pandas DataFrame. There are six different options: split, records,
index, columns, values, and table. If the JSON is formatted such that there
are columns, data, and an index already defined as keys, as is the case in
Listing 4-20, the split option should be used. It’s also worth noting that the
JSON parser is particularly picky about spacing including whitespace.
If the JSON is formatted such that each value is a row in the data with
the column names as keys, as is the case in Listing 4-21, the records option
should be used.
93
Chapter 4 Loading and Normalizing Data
If the JSON is formatted such that each key is the index value and the
value of each key is a dictionary of the columns and values for the row, as
is the case in Listing 4-22, the index option should be used.
94
Chapter 4 Loading and Normalizing Data
)
>> temperatures = pd.read_json(
data,
orient="index",
)
temp
234unf923 35.200000
340inf351 32.500000
234abe045 33.100000
If the JSON is formatted such that each key is the column and each
value is a dictionary where the key is the index and the value is the column
value, as is the case in Listing 4-23, the columns option should be used.
95
Chapter 4 Loading and Normalizing Data
temp
234unf923 35.200000
340inf351 32.500000
234abe045 33.100000
96
Chapter 4 Loading and Normalizing Data
Similar to read_csv, read_json has a chunksize that lets you read the
files in chunks at a time. This only is permitted however if the lines option
is also set to True, meaning the JSON format is oriented as records without
the list brackets. Listing 4-26 demonstrates this.
97
Chapter 4 Loading and Normalizing Data
98
Chapter 4 Loading and Normalizing Data
the JSON schema. Just like reading a CSV file, if the types are not specified
in the JSON file, it saves a lot of memory to provide them. Listing 4-27 shows
what the memory footprint might be of a JSON file being loaded without
types specified vs. Listing 4-28 which shows the memory footprint of the
same JSON file with types specified. Note in Listing 4-28 where types are
explicitly specified, the memory of the resulting DataFrame decreased by
about 40%.
99
Chapter 4 Loading and Normalizing Data
birth height weight
129237 04/10/1999 5.4 126
123083 05/18/1989 6.1 130
>> df.dtypes
birth object
height float64
weight int64
>> df.index.dtype
dtype('int64')
>> df.memory_usage()
Index 16
birth 16
height 16
weight 16
100
Chapter 4 Loading and Normalizing Data
},
}
"""
)
>> patient_info = pd.read_json(
data,
orient="columns",
convert_dates=["birth"],
dtype={"height": np.float16, "weight": np.int16},
)
birth height weight
129237 1999-04-10 5.4 126
123083 1989-05-18 6.1 130
>> df.dtypes
birth datetime64[ns]
height float16
weight int16
>> df.index.dtype
dtype('int64')
>> df.memory_usage()
Index 16
birth 16
height 4
weight 4
101
Chapter 4 Loading and Normalizing Data
Listing 4-29. Using a raw SQL string query vs. the SQLAlchemy
ORM to generate a query
cur.execute(
"""
SELECT * FROM temperature_readings
WHERE temperature_readings.temp > 45
"""
)
session.query(TemperatureReadings).filter(
temp > 45
)
102
Chapter 4 Loading and Normalizing Data
Listing 4-30 shows an example of how you might build a database and
insert data into it. In this example, we are creating a user table with two
columns, id and name, and inserting a new user with an id of zero and
name Eric into that table. Note two different URLs are defined in the code,
the one in use connects to sqlite and the other connects to a local Postgres
database instance.
SQLITE_URL = "sqlite://"
POSTGRES_URL = "postgresql://postgres@localhost:5432"
class User(Base):
__tablename__ = 'user'
id = Column(Integer, primary_key=True)
name = Column(String(50))
engine = create_engine(SQLITE_URL)
Session = sessionmaker(bind=engine)
def create_tables():
Base.metadata.create_all(engine)
def add_user():
session = Session()
user = User(id=0, name="Eric")
session.add(user)
103
Chapter 4 Loading and Normalizing Data
session.commit()
session.close()
>> create_tables()
>> add_user()
The SQL loaders generally can either load the whole table or load parts of
the table based on a query. Although the SQL loaders accept a SQLAlchemy
engine, they only accept a select statement rather than a query object. This
means while you can use SQLAlchemy’s query API, you must convert it to a
selectable before passing it into the loader as shown in Listing 4-34. A selectable
is essentially the raw SQL query string. Listing 4-33 shows an example of how
you would load all the user data from the database into a DataFrame, while
Listing 4-34 shows how you might load the user with id=0 into a DataFrame.
104
Chapter 4 Loading and Normalizing Data
Listing 4-33. Loading all the users into a DataFrame using read_sql
>> pd.read_sql(
sql=User.__tablename__,
con=engine,
columns=["id", "name"],
)
id name
0 0 Eric
Listing 4-34. Loading the user with id=0 into a DataFrame using
read_sql
>> select_user0 = session.query(Patient).filter_by(id=0).
selectable
>> pd.read_sql(
sql=select_user0,
con=engine,
columns=["id", "name"],
)
id name
0 0 Eric
The SQL loaders have similar options as the other loaders we’ve looked
at, for example, loading the data chunks at a time or datetime conversion.
However, there are differences as well. Unlike some of the other loaders
we’ve looked at, the SQL loaders do not have an option for data type
specification. This often poses a problem for pandas users working with
databases as they may store a normalized version of the data in a database
and then wish to load it back out only to find the data types are not the
same. If you run into this situation, SQLAlchemy and some custom loading
code can help. SQLAlchemy provides a custom types option which lets
you convert between the database type and the Python type. As we’ve seen
105
Chapter 4 Loading and Normalizing Data
with other loaders where the types are not explicitly specified, pandas will
store the id column as an int64. Listing 4-35 shows an example of how we
might specify the Python type for the integer id column as an int32 instead
of a more generic and larger integer type. Using this table definition, now
when we add a user, the id will be stored as an integer inside the database,
but when we read it out, it will be a NumPy int32 type.
class Int32(types.TypeDecorator):
impl = types.Integer
class User(Base):
__tablename__ = 'user'
id = Column(Int32, primary_key=True)
name = Column(String(50))
106
Chapter 4 Loading and Normalizing Data
The fetchall function returns the data as a list of tuples, for example, [(0, ‘Eric’)].
This implementation is relevant for the next step in how we will get pandas to
use the correct data types we defined in Listing 4-35.
data = result.fetchall()
self.frame = DataFrame.from_records(
data, columns=column_names, coerce_float=coerce_float
)
Listing 4-37. Custom SQL loader code that maintains the data types
defined on the SQLAlchemy table in Listing 4-35
>> sql = session.query(User).selectable
>> results = engine.execute(sql).fetchall()
>> data = {
columns[col]: np.array(
[row[col] for row in results],
dtype=type(results[0][col]))
for col, v in enumerate(results[0])}
>> df = pd.DataFrame(data)
>> df.dtypes
0 int32
1 object
107
Chapter 4 Loading and Normalizing Data
We’ve covered several of the most popular loaders and their options
in this chapter, but there are still many more. Be sure to read the
documentation for the particular loader you are using and see what kinds
of normalization during load features are at your disposal, and if not, you
may have to write some custom code yourself. Keep in mind performance
savings can come from reducing memory overhead and reducing steps
during the load and normalization process. pandas provides many ways
of improving normalization and load performance depending on the
bottlenecks you have in your particular situation. In Chapter 5, we’ll
explore how to reshape the data into the desired DataFrame format once
it’s loaded and normalized.
108
CHAPTER 5
Basic Data
Transformation
in pandas
The pandas library has a huge API that provides many ways of
transforming data. In this chapter, we’ll cover some of the most powerful
and most popular ways to transform data in pandas.
There are a couple performance issues in Listing 5-1. Pivot table does
not have an option to limit memory duplication so it creates an entirely new
DataFrame each time it is used. If your DataFrame is quite large, this can be a
big performance hit to your program. Internally, pivot table is grouping the data
by unique restaurant and location combinations which takes time, particularly
with a large amount of combinations. If this was being used as part of a data
normalization step, it would be far better than if it was used many times
throughout a program as part of data analysis. This is because the performance
hit of uniquely grouping and copying all that memory would happen only once
compared to it happening many times throughout the program. It is far better
to normalize and orient a DataFrame once in such a way that it optimizes all
the analysis you plan to perform on it than leave it in a somewhat unoptimized
110
Chapter 5 Basic Data Transformation in pandas
raw form and have to re-orient it at each analysis step. Note, if instead the
DataFrame was already uniquely grouped, we could run a groupby to calculate
the average score like in Listing 5-2 which would be twice as fast. We’ll discuss
the performance of groupby in depth in Chapter 7. It is very likely that other
analysis needs the data grouped by unique restaurant, so the grouping in this
example at the very least should be part of the normalization step, in which
case it becomes unnecessary to use pivot table at all.
Pivot does the same thing as pivot table, but it does not allow you to
aggregate data. Any columns and index value combinations that result
in multiple values must be aggregated together when using pivot table.
Pivot, on the other hand, simply throws a ValueError if it runs into such a
scenario. Note in Listing 5-3 no combination of drug and date results in
multiple values; however, in Listing 5-4, there are or would be multiple
rows for the same drug and date; thus, Listing 5-4 throws a ValueError.
So, a regrettable limitation of both pivot and pivot table is they do not
output data where there are multiple values for an index and column
combination. Pivot table forces you to aggregate the multiple values
together or select one and pivot simply throws a ValueError.
111
Chapter 5 Basic Data Transformation in pandas
>> df
date tumor_size drug dose
02/18 90 01384 10
02/25 80 01384 10
03/07 65 01384 10
03/21 60 01384 10
02/18 30 01389 7
02/25 20 01389 7
03/07 25 01384 10
03/21 25 01389 7
>> df.pivot(
index="drug",
columns="date",
values="tumor_size"
)
date 02/18 02/25 03/07 03/21
drug
01384 90 80 65 60
01389 30 20 25 25
>> df.pivot(
index="drug",
columns="date",
values="tumor_size",
)
ValueError: Index contains duplicate entries, cannot reshape
Pivot is more performant than pivot table because it does not allow
specification and generation of multi-level columns and multi-indexes.
Thus, it does not have the overhead of generating and handling this
more complex DataFrame format. Regardless of whether the resulting
DataFrame is a multi-index or multi-level column DataFrame, pivot table
still runs the various computations as if it is multi-level which adds a
fair amount of overhead, up to six times more than pivot in some cases.
While pivot will allow you to specify multiple values and create a multi-
level column for them, it does not allow you to provide an explicit list of
columns to generate multi-level columns or provide a list of indexes to
generate multi-level indexes. Pivot table, on the other hand, supports this
type of multi-level DataFrame. It also has some other nicety options like
adding a subtotal of all rows and columns and dropping columns with
NaNs. In summary, if you can get away with using pivot, you should, as it’s
more performant than using pivot table.
S
tack and unstack
Stack and unstack reshape a DataFrame’s column level into an innermost
index and vice versa. An example of this is shown in Listing 5-5 where each
column is a restaurant health inspection, the value is the health inspection
score, and the index represents the restaurant that was inspected. Stack
is used to reshape the data so that the health inspection scores for each
113
Chapter 5 Basic Data Transformation in pandas
restaurant occur across each row rather than each column. Note stack
converts the column names across the top into column values which then
are ultimately dropped from the DataFrame.
You might recognize the shape of the original DataFrame in Listing 5-5
from Listing 3-22. The shape of the DataFrame before it’s reshaped in
Listing 5-5 is the orientation that was deemed the most optimal in the
114
Chapter 5 Basic Data Transformation in pandas
“Choosing the right DataFrame” section at the end of Chapter 3. Listing 5-5
shows how to convert from that optimal shape to the original non-optimal
shape. Now let’s look at how we might take the original non-optimal
shape and turn it into the optimal shape. Listing 5-6 adds a new column
called inspection to the DataFrame whose values become the column
names for the new DataFrame. We also are making use of a handy groupby
aggregation function called cumcount that creates a row number for each
row in each group.
115
Chapter 5 Basic Data Transformation in pandas
score
restaurant location inspection
Diner (4, 2) 0 90
1 100
Pandas (5, 4) 0 55
1 76
>> df = df.unstack()
>> df
score
inspection 0 1
restaurant location
Diner (4, 2) 90 100
Pandas (5, 4) 55 76
So how performant are stack and unstack? They both duplicate memory
as they are not inplace operations which can be costly and thus should really
only be used in data normalization. They are, however, very unique in the
way they can transform the data, so it is difficult to find a more performant
alternative other than melt which is what we’ll explore in the next section.
M
elt
An example of using melt is shown in Listing 5-7. Note that this is very
similar to the stack example. We are essentially doing what we did in
approximately four lines with stack in one line in this example. While melt
does the same thing as stack and a bit more, it does it in a slightly more
performant way. This is mainly due to the slight overhead advantage it has in
not calling into all the various data transformations at a high level, meaning
rather than calling stack underneath, melt performs the lower-level data
manipulations underneath stack directly, thus avoiding the middle code
layers. If you compare a raw stack to melt, stack is about four times faster.
The drawback of using stack is that it often requires other manipulation such
116
Chapter 5 Basic Data Transformation in pandas
>> df
restaurant location score
Diner (4, 2) 90
100
Pandas (5, 4) 55
76
T ranspose
Transpose is a useful trick. It simply turns the columns into rows and the
rows into columns. In Listing 5-8, there is a list of patients who need to be
treated for a certain disease and a table that provides a list of drugs used
to treat the disease based on blood type. We need to add the list of drugs
that can be used to treat the given patient into the patient table based on
the patient’s blood type. The first step is to index both the patient list and
the drug table by blood type, and then we can do a simple join to add the
drug data into the patient list. Because the drug table is oriented such that
the blood types are across the columns instead of the rows, we first do a
transpose. Note when we do this, the index which is provided by default
when creating the DataFrame turns into the columns and the columns
117
Chapter 5 Basic Data Transformation in pandas
turn directly into the index. This means in Listing 5-8 we don’t explicitly
have to set the index as the transpose already sets the index to the blood
type for us.
>> patient_list
id history
blood_type
0+ 02394 hbp
B+ 02312 NaN
0- 23409 lbp
>> drug_table
index 0+ 0- A+ A- B+ B-
0 ADF ADF ACB DCB ACE BAB
1 GCB RAB DF EFR HEF
2 RAB
>> drug_table = drug_table.transpose(copy=False)
>> drug_table
blood_type 0 1 2
0+ ADF GCB RAB
0- ADF RAB
A+ ACB DF
A- DCB EFR
B+ ACE
B- BAB HEF
>> patient_list.join(drug_table)
id history 0 1 2
blood_type
0+ 02394 hbp ADF GCB RAB
B+ 02312 NaN ACE
0- 23409 lbp ADF RAB
118
Chapter 5 Basic Data Transformation in pandas
119
CHAPTER 6
While the example in Listing 6-1 is simple and illustrates how to use
apply, the use case in which it is used is very wrong. It’s a textbook example
of when to not use apply as the np.sum function is a built-in off the
DataFrame itself and thus the built-in should be used as it’s much more
performant. But why is it so much more performant? Let’s explore that in
more detail.
The answer to the question of why the built-in pandas sum is so much
more performant than applying the NumPy sum to each row lies in where
the iteration over the rows takes place. The following loop in Listing 6-2 is
the underlying implementation of the pandas apply method.
for i, v in enumerate(series_gen):
results[i] = self.f(v)
keys.append(v.name)
122
Chapter 6 The apply Method
As you can see in Listing 6-2, the looping over the rows takes place in
Python. Here you can see the series_gen which is either the columns or the
rows that the function to be applied (held in self.f ) will be applied to. This
is in opposition to the built-in pandas sum function that simply passes an
ndarray to be operated on to the NumPy sum function, which then iterates
and sums the data in C and returns the resulting ndarray back to Python.
This process of running the operation on the data in C instead of Python is
known as vectorization. Essentially, vectorization is able to achieve a huge
speedup over the alternative of running the operation in Python. For all
the reasons covered in Chapter 3, looping and performing operations in
C is much more performant than Python. However, the speedup doesn’t
always come from just looping in C.
Vectorized operations allow you to apply a mathematical operation to
a sequence of numbers. For example, if you want to add 4 to each element
in an ndarray, you specify that using the syntax arr + 4. In the case of
NumPy ufuncs (see the Appendix for a comprehensive list), they actually
make use of specialized vector registers in the CPU itself. Vector registers
are registers that can contain a series of values, and when an operation is
performed on them, it is performed on each value in the register at once.
So, what would have been a loop over an array of eight values and eight
consecutive add instructions in the CPU becomes one add instruction
operating on eight values in the CPU. As you can imagine, this leads
to a huge speedup. Vectorization will also pad arrays of mismatched
dimensions in order to make the dimensions match such that an operation
can run. This process is known as broadcasting. When you add a new
column in pandas via df[“new_col”] = 4, 4 is broadcast to have the same
number of rows as all the other columns in the DataFrame. Similarly,
aggregation functions like sum operate over a sequence of numbers using
vectorization. What all of this boils down to is apply is not a vectorized
operation—it loops in Python and should be avoided whenever possible.
It becomes effectively the same thing as iterating over the rows and
applying the function yourself as illustrated in Listing 6-3.
123
Chapter 6 The apply Method
results = [0]*len(df)
for i, v in df.itterrows():
results[i] = v.sum()
df["sum"] = results
Let’s look at another example. Say you have a data set with one column
named A but that column has incomplete data and you wish to replace
the values that are missing with the max of columns B and C. This could
be implemented using apply as demonstrated in Listing 6-5, or this could
be implemented in a much more performant way using the where method
demonstrated in Listing 6-6.
124
Chapter 6 The apply Method
def replace_missing(series):
if np.isnan(series["A"]):
series["A"] = max(series["B"], series["C"])
return series
df = df.apply(replace_missing, axis=1)
df["A"].where(
~df["A"].isna(),
df[["B", "C"]].max(axis=1),
inplace=True,
)
The where method replaces falsey values with the value provided
in the second parameter. This means, in Listing 6-6, all the NaN values
are being replaced with the max of columns B and C. Note we are also
specifying inplace=True so that this replace happens on the current
DataFrame as opposed to creating a new DataFrame that would result in
duplicated memory.
Let’s look at a trickier example in Listings 6-7 and 6-8. Suppose you
have a DataFrame with two columns, fruit and order, and you want to drop
all the data where the fruit is not present in the order for each row. pandas
does have string operations including Series.str.find that will return True if
a substring is present in a string for each value in a Series. However, it will
only allow you to pass in a constant. In other words, you cannot specify a
Series of substrings but only a single string value, so find will not work in
this case. There is also no “in” check built into pandas that operates on
two series objects, so although this is exactly what we want, pandas does
not support it. This means we must implement some kind of customized
solution ourselves, so let’s explore the performance of various options.
125
Chapter 6 The apply Method
Listing 6-7. Dropping rows whose order column does not contain
the substring in the fruit column using apply
def test_fruit_in_order(series):
if (series["fruit"].lower() in
series["order"].lower()
):
return series
return np.nan
>> data.apply(
test_fruit_in_order,
axis=1,
result_type="reduce",
).dropna()
fruit order
0 orange I'd like an orange
2 mango May I have a mango?
126
Chapter 6 The apply Method
Using apply to solve this problem as in Listing 6-7 takes about 14 seconds
on 100,000 rows, whereas using a list comprehension as in Listing 6-8 takes
about 100 milliseconds. But why is a list comprehension so much faster than
apply? Don’t they both loop in Python? List comprehensions are specially
optimized loops within the Python interpreter. The bytecode that they
translate into more closely resembles a loop written in C as they do not load a
bunch of specialized Python list attributes. What follows is the bytecode for a
for loop (Listing 6-9) vs. a list comprehension (Listing 6-10). Notice how much
simpler and smaller the bytecode is for a list comprehension than for a for
loop even though they are doing the same thing.
def for_loop():
l = []
for x in range(5):
l.append( x % 2 )
0 0 BUILD_LIST 0
2 STORE_FAST 0 (l)
1 4 SETUP_LOOP 30 (to 36)
6 LOAD_GLOBAL 0 (range)
8 LOAD_CONST 1 (5)
10 CALL_FUNCTION 1
12 GET_ITER
>> 14 FOR_ITER 18 (to 34)
16 STORE_FAST 1 (x)
2 18 LOAD_FAST 0 (l)
20 LOAD_METHOD 1 (append)
127
Chapter 6 The apply Method
0 0 LOAD_CONST 1
2 LOAD_CONST 2
4 MAKE_FUNCTION 0
6 LOAD_GLOBAL 0 (range)
8 LOAD_CONST 3 (5)
10 CALL_FUNCTION 1
12 GET_ITER
14 CALL_FUNCTION 1
16 STORE_FAST 0 (l)
18 LOAD_CONST 0 (None)
20 RETURN_VALUE None
128
Chapter 6 The apply Method
This means if we had the following input DataFrame, we would see the
following output DataFrame after applying scipy.stats.percentileofscore
using the pandas apply function (Listing 6-12).
129
Chapter 6 The apply Method
def percentileofscore(df):
res_df = pd.DataFrame({})
for col in df.columns:
score = pd.DataFrame([df[col]]*5, index=df.columns).T
left = df[df < score].count(axis=1)
right = df[df <= score].count(axis=1)
right_is_greater = (
df[df <= score].count(axis=1)
> df[df < score].count(axis=1)
).astype(int)
130
Chapter 6 The apply Method
res_df[f'res{col}'] = (
left + right + right_is_greater
) * 50.0 / len(df.columns)
return res_df
percentileofscore(data)
131
Chapter 6 The apply Method
def percentileofscore(values):
percentiles = [0]*len(values[0])
num_rows = len(values)
for row_index in range(num_rows):
row_vals = values[row_index]
for col_index, col_val in enumerate(row_vals):
percentiles[col_index] = \
pctofscore(row_vals, col_val)
values[row_index] = percentiles
setup(
ext_modules = cythonize("percentileofscore.pyx")
)
132
Chapter 6 The apply Method
Note that the Cython function accepts values and not the full
pandas DataFrame; this is because values is a two-dimensional array
and something that is easily translatable into C, whereas the pandas
DataFrame is a Python object and is not. Also note that the function
modifies the data in place as opposed to returning a whole new two-
dimensional array. This is a performance benefit as we do not have to
allocate new memory for the new array, and once the data has been
converted, we no longer need the original data set (at least in this
particular case).
So how different is the performance of these approaches when
run over 100,000 rows? Using apply in Listing 6-12 averages around
58 seconds. Using pandas operations to effectively re-implement the
SciPy equivalent as in Listing 6-13 averages around 24 seconds. The
third approach of building a custom Cython function averages around
4 seconds. There are also other advantages of going with the Cython
approach other than performance. The SciPy function could be used
as is and did not have to be re-implemented, so from an effort of
implementation and readability perspective, it looks very appealing as
well.
In conclusion, only when all other options have been exhausted
should apply be used. It is equivalent in performance to iterrows and
itercolumns and should be treated with the same level of precaution. In
cases where apply needs to be used over a large data set and is causing a
second or more slowdown, a customized Cython apply equivalent should
be implemented instead so as to not degrade data analysis performance.
133
CHAPTER 7
Groupby
Chances are at some point when working with data in pandas, you will
need to do some sort of grouping and aggregation of data. This is what
Groupby is for. It allows you to cluster your data into groups and run
aggregated calculations on those groups.
2016 LON 10
2016 BER 15
2016 BER 10
>> groups = arrivals_by_destination.groupby(["date","place"])
>> for idx, grp in groups:
arrivals_by_destination.loc[idx, "total"] = \
grp["number"].sum()
>> arrivals_by_destination
number total
date place
2015 LON 10 15
2015 BER 20 20
2015 LON 5 15
2016 LON 10 10
2016 BER 15 25
2016 BER 10 25
136
Chapter 7 Groupby
number total
date place
2015 LON 10 15
2015 BER 20 20
2015 LON 5 15
2016 LON 10 10
2016 BER 15 25
2016 BER 10 25
I ndexing
Working with a sorted index provides a substantial speedup when
there are many different values in each index. You may encounter the
warning “PerformanceWarning: indexing past lexsort depth may impact
performance.” This is referring to the number of levels in an index that are
sorted lexically or alphabetically.
137
Chapter 7 Groupby
138
Chapter 7 Groupby
Avoiding groupby
So far, we’ve explored how to get the best performance when running a
groupby operation. Sometimes, however, the most performant option is to
not use a groupby at all. If you find yourself having to do a lot of groupby
operations on your DataFrame, you may consider re-orienting your
DataFrame so that you don’t need to use groupby. Since groupby groups
the data and then runs an aggregate function on each group of data, it is
essentially doing a loop over the number of groups. Even though in the
most performant case the groups are already pre-computed, the indexes
are fast to access, and the looping is run at the C level, all of that still takes
time. It’s much more performant in pandas to run simple row-wise or
column-wise operations.
Let’s take a look at how we can reformat the DataFrame in Listing 7-3
so that we can avoid using groupby. If we keep the index columns where
they are but instead break out the multiple values for each index across the
row, we can do two things to optimize this sum by groups operation. The
first thing this does is eliminate the groupby sum operation and turn it into
a simple sum across the columns. The second thing this does is make the
indexes unique. Note we are taking on some additional memory overhead
by doing this as the gaps in the data will be filled with zeros. Integers,
however, take up little space even in a very large DataFrame so the overall
performance speedup is worth the additional memory usage.
139
Chapter 7 Groupby
140
CHAPTER 8
Performance
Improvements Beyond
pandas
You may have heard another pandas user mention using eval and query to
speed up evaluation of expressions in pandas. While use of these functions
can speed up evaluation of expressions, it cannot do it without the help
of a very important library: NumExpr. Use of these functions without
installing NumExpr can actually cause a performance hit. In order to
understand how NumExpr is able to speed up calculations however, we
need to take a deep dive into the architecture of a computer.
C
omputer architecture
CPUs are broken up into multiple cores where each core has a dedicated
cache. Each core evaluates one instruction at a time. These instructions are
very basic compared to what you might see in a Python program. One line
of Python is often broken up into many CPU instructions. Some examples
include loading data such as storing an array value into a temporary
variable when looping, jumping to a new instruction location such as
when calling a function, and an evaluation expression such as adding two
values together.
142
Chapter 8 Performance Improvements Beyond pandas
for an upcoming instruction is fetched and then loaded into a register in the
writeback phase. This means that if you wish to add two values together, those
values must first be loaded with a load instruction into two different registers
before an add instruction can be run. So, the line of Python code in Listing 8-1
consists of three instructions inside the CPU.
While the CPU instructions in Listing 8-1 may look similar to Python
bytecode, it’s important to note that they are not the same. Remember
that bytecode is run on the Python Virtual Machine, whereas CPU
instructions are run on the CPU. While you can use the dis module (dis
standing for disassembly) to output the bytecode and it may give you
some idea of what the machine code might look like, it is not machine
code. The Python Virtual Machine contains a giant switch statement
that translates a bytecode instruction into a function call which then
executes CPU instructions. So, while we may think of Python as being an
interpreted language that runs bytecode instructions in a software virtual
machine, the fact is at some point that add instruction makes its way to
the CPU. Eventually that add becomes a series of CPU instructions that are
shown in Listing 8-1.
It’s very common for the memory access phase of the instruction pipeline
to take longer than all the other pipeline phases. Rather than making all the
other pipeline phases as slow as the memory access phase or inserting NOPs
(commonly called no ops or no operations), other instructions not dependent
on the data being loaded are used to fill the time. This enables the processor
to keep evaluating instructions even though one phase of the instruction may
take hundreds of cycles to complete. Compilers also play a part in keeping the
143
Chapter 8 Performance Improvements Beyond pandas
144
Chapter 8 Performance Improvements Beyond pandas
When working with large data sets as is the expected use case in pandas,
all of that data cannot be stored inside the cache. It typically takes about three
clock cycles or instruction phases to access level 1 cache and at each level
exponentially increases in latency. To access the level 3 cache, it takes about
21 clock cycles, and if the data we wish to load is not in any of the caches and
it has to go all the way to main memory to load it, it takes anywhere from 150
to 400 clock cycles. At around 21 clock cycles, the performance hit incurred
by accessing the level 3 cache will likely exceed the number of pipelines in
our core. If we have to delay the instructions in our pipeline until the data
is retrieved without re-ordering to pad the delay, that could stall our entire
program for 21 clock cycles. 21 clock cycles of delay on a 4 GHz processor
is about 5.25 ns. This might seem insignificant and it is if we only incur this
delay a couple times in our program. However, keep in mind we are typically
operating on megabytes of data in pandas, and since not all of that is going to
fit in the caches, we will likely incur many performance hits like this. In fact,
we’re even more likely to incur larger performance hits all the way out to main
memory if running an operation over the entire data set.
Caches are generally designed for the best performance of the
average case. In software this means things like looping over arrays which
are sequential data structures. Because of this, when they have to load
something into the cache, they load sequential blocks of memory at a
time called cache lines. This helps to offset the performance hit of loading
something into the cache. The idea is typically programs operate on
sequential memory, so by loading the memory that follows the memory
the core needs right now, it will save needing to load that memory later.
In order to make the most effective use of caching, data that is located
sequentially or close together in memory should be repeatedly referred to in
a short time span. Sequential data will result in less cache loads. Repeatedly
referring to the same data in a short time span will prevent new data from
bumping the older data out of the cache and causing a cache miss that will
require the same data to be loaded into the cache again. Arrays, as we learned
in Chapter 3, are sequential data types, meaning the first element occurs at
145
Chapter 8 Performance Improvements Beyond pandas
address A and the last element occurs at A plus the length of the array. When
you create a bunch of objects in memory with many attributes that point to
other objects and reference those attributes, each object has an address that
is not sequential, and thus you will not be able to utilize your cache as you will
be loading a bunch of different cache lines from a bunch of different memory
locations. Figure 8-3 demonstrates these two types of memory accesses.
146
Chapter 8 Performance Improvements Beyond pandas
.
.
147
Chapter 8 Performance Improvements Beyond pandas
Note that in order to run an evaluation like this all at once on chunks
of the pandas DataFrame(s), we must communicate to NumExpr the
whole expression prior to computation. (A + B) ∗ 3 must be specified in
such a way that NumExpr knows it can be combined together.
148
Chapter 8 Performance Improvements Beyond pandas
This is where query and eval come in.eval allows you to specify a complex
expression as a string to signal NumExpr that it can be run on a chunk of
the DataFrame(s) at a time. query is effectively another form of eval as it
calls eval underneath.
Depending on the computation, the shape and size of the data, the
operating system, and the hardware you are using, you may find that
using NumExpr and eval actually results in a significant performance
degradation. It’s always good to run a performance comparison before
blindly combining computations into an eval or query. NumExpr really
only works well for computations that exceed the size of the level 3 cache.
Typically, this is greater than 256,000 array elements. As we’ve seen with
other pandas functions, it also requires the data type and computation
be easily translatable into C. So, for example, datetimes will not yield a
performance improvement as they cannot be evaluated in NumExpr.
It’s also worth noting that using NumExpr directly can be much more
performant than using eval or query in pandas. Listing 8-3 demonstrates
such an example.
149
Chapter 8 Performance Improvements Beyond pandas
BLAS and LAPACK
NumPy uses Basic Linear Algebra Subprograms (BLAS) underneath to
implement very performant linear algebra operations such as matrix
multiplication and vector addition. These subprograms are typically
written in assembly—a very low-level and performant language that
closely resembles CPU instructions. The Linear Algebra Package (LAPACK)
provides routines for solving linear equations. It is typically written in
Fortran and, just like NumPy, calls into BLAS underneath. There are many
150
Chapter 8 Performance Improvements Beyond pandas
151
Chapter 8 Performance Improvements Beyond pandas
We’ve left out one important detail here which is that typically data
can only be loaded into vector registers if it is sequential in memory. This
poses a slight problem for most complex vector operations which typically
happen on rows of one matrix and columns of another matrix or vice
versa. BLAS is the opposite of Python in that its arrays are column majored
instead of row majored. BLAS also does not have two-dimensional
arrays—they are stored as a single-dimensional array. Listing 8-6 shows an
example of a Python array and how it would be stored in BLAS.
So, going back to the dot product example in Listing 8-5, because
these arrays are both represented as a single-dimensional array, despite
one being a bunch of rows and the other being a bunch of columns, they
both have contiguous memory addresses and so they both can be loaded
into the vector registers. This issue of consecutive memory addresses only
comes into play when working with more complex matrices and more
complex operations, so let’s look at a more complex example.
There are many ways to perform a matrix multiply. One way, using dot
product is shown in Figure 8-4. Taking the dot product of the first matrix’s
row with the second matrix’s column will yield the value of a single
element in the result of the matrix multiply.
152
Chapter 8 Performance Improvements Beyond pandas
153
Chapter 8 Performance Improvements Beyond pandas
154
Chapter 8 Performance Improvements Beyond pandas
155
CHAPTER 9
p andas 1.0
The pandas community has been feverishly working on pandas 1.0, the
first big upgrade since the initial release of pandas. It addresses a lot of the
shortcomings in previous versions.
pandas 1.0 adds a new pandas specific NA type. This new type will
make null values consistent across all types of columns. As you may recall
from Chapter 4, NaNs in pandas 0.25 must be stored as floats; they cannot
be Booleans, integers, or strings. Previously, it was not possible to load
a column with NaNs in it as an integer type—you had to convert it to an
integer after it was loaded. Now with pandas 1.0, it’s possible to load a
column with NaNs as an integer type. Listing 9-1 is the same example
presented earlier in the text in Listing 4-15; only now it’s making use of the
new nullable integer type available in pandas 1.0. Note the memory usage
of this new type takes up one more byte than is indicated by the data type.
So, while the type is set to an Int16Dtype, each element actually takes up
three bytes instead of two. The extra byte corresponds to a Boolean mask
in the IntegerArray implementation which marks which values are NA.
Listing 9-1. Example of how pandas handles NaNs in the data in 1.0
158
Chapter 9 The Future of pandas
123083 20 6.1 <NA>
123087 25 4.5 <NA>
>> df.memory_usage(deep=True)
Index 24
age 3
height 6
weight 9
>> df.dtypes
age int8
height float16
weight Int16
>> df.index.dtype
dtype('int64')
159
Chapter 9 The Future of pandas
consistency within the column, and also identify as a text type rather than
lumping text values in with all values that are of the generic object container
type. Listing 9-3 demonstrates how much less memory the new pandas string
type uses. When using the new string type, each value takes up only 8 bytes,
which is a huge decrease in memory compared to previous versions where
each object value took up about 60 bytes.
Listing 9-3. Memory usage of the pandas 1.0 string type compared
to using object in previous versions
>> data = io.StringIO(
"""
id,name
129237,Mary
123083,Lacey
123087,Bob
"""
)
160
Chapter 9 The Future of pandas
>> df.dtypes
name object
Nullable Booleans are also a huge win for pandas users. Previously,
Boolean columns could not have a nullable state; only True and False were
allowed. This meant users had to use an integer representation or an object
to represent a Boolean with a third NaN state, but now they can use the
pandas BooleanArray type.
The introduction of the new types in pandas, namely, the nullable
Boolean, pandas NA type, and dedicated string type, yields marked
improvements to the pandas type casting in pandas 1.0. Now, integers,
Booleans, and strings will be recognized and stored as smaller data types
even when they contain null values. This is a huge win for performance
and saving memory on load. Note that while these new types exist and are
161
Chapter 9 The Future of pandas
inferred when creating pandas arrays, they are not inferred when creating
DataFrames. You must explicitly specify the types for pandas to use them
when creating pandas DataFrames. This is why in Listings 9-1 through 9-3
the new pandas types were explicitly specified when loading data using
read_csv. If the types were not explicitly specified, they would be inferred
to be the same types as in previous versions of pandas.
Rolling apply methods also now support an engine argument that gives
the option of using Numba instead of Cython. Numba converts the custom
apply function into optimized compiled machine code similar to Cython,
but for data sets with millions of rows and custom functions that operate
on NumPy ndarrays, the pandas team found Numba to produce more
optimized code than Cython. It only makes sense to use Numba, of course,
when you are running the calculation a lot of times over and over again
since Numba has the overhead of compiling the first time it is used.
There has been a lot of work to clean up the Categorical data type in
pandas 1.0. As you may recall from Chapter 2, the Categorical data type
is used to hold metadata with a unique set of values. Deprecations within
the API have been removed, previous operations on the data type that did
not return back a Categorical now do, and there is improved handling of
null values. There are also performance improvements, for example, now,
all the values passed into searchsorted are converted to the same data
type before running a comparison. Listing 9-4 shows an example of using
searchsorted on a Categorical. This operation in pandas 1.0 is about 24
times faster than in previous versions.
import pandas as pd
metadata = pd.Categorical(
['Mary'] * 100000 + ['Boby'] * 100000 + ['Joe'] * 100000
)
metadata.searchsorted(['Mary', 'Joe'])
162
Chapter 9 The Future of pandas
There have also been a lot of refactors and bug fixes made to groupby.
This used to be a complex bit of code to look at with a fair number of
bugs, but there have been many improvements in pandas 1.0 including
improved handling of null values, offering a selection by column name
when axis is one, allowing multiple custom aggregate functions for the
same column to match series groupby behavior, and many more.
The support of load and dump options for reading CSV data in pandas
far exceeds options for other loaders. While supporting so many options
leads to complicated code for developers, it is very nice for users. Some of
the loaders have a nice balance of options, but some fall short in load and
dump capabilities that could lead to performance speedup for users. As we
saw in Chapter 3, read_sql is missing the ability to specify data types during
load which can be fairly critical for performance. The CSV loader on the
other hand has so many options, some of which can result in a performance
slowdown if you aren’t careful. A lot of work has been done to address this
and standardize the options for input and output data methods in pandas 1.0.
For example, both read_json and read_csv are now able to parse and
interpret Infinity, –Infinity, +Infinity, and NaNs as expected. In previous
versions, read_json didn’t handle NaNs or Infinity strings, and read_csv
didn’t cast Infinity strings as floats. The usecols parameter in read_excel has
also been standardized to behave more like read_csv’s usecols parameter.
Previously, usecols was allowed to be a single integer value, whereas now it’s
a list of integer values just like read_csv.
There have been a lot of other subtle performance improvements to
pandas 1.0 as well. We’ll look at a couple of them here just to give you some
idea of what methods are being used to improve performance.
A regression in performance of the infer_type method was fixed in
pandas 1.0. An if statement was moved down in the implementation to
avoid a performance slowdown introduced by converting data types to
objects when running an isnaobj comparison prematurely as shown in
Figure 9-1.
163
Chapter 9 The Future of pandas
Another performance fix was made to the replace method which is used
to replace values with a different value. Here, some additional code was
inserted above the original to take advantage of some early exit conditions.
If the list of values to replace is empty, simply return the original values or a
copy of the original values if inplace is False. If there is only one valid value,
replace that single value with the new value. The values were also converted
to a list of valid values as opposed to being left as a list of values that may or
may not even be legal for the given column. Note while it is not explicitly
shown in Figure 9-2, the new to_replace list was also used in the final replace
call. By doing so, this reduced the number of replaces that were needed
and improved the overall performance over large data sets where several
columns did not contain any of the values that were to be replaced.
164
Chapter 9 The Future of pandas
165
Chapter 9 The Future of pandas
166
Chapter 9 The Future of pandas
There is talk from the pandas team of removing the inplace option from
all pandas methods, and for that reason, they have generally recommended
to not use it. The inplace option, contrary to what its name suggests, does
not always operate inplace without duplicating memory. This typically
happens as a result of pandas type inference where the operation results in
a data type change, and thus the data has to be reconstructed with the new
type. Listing 9-5 illustrates this example. When the NaN value is replaced
with 0.0, the type is still a float and the value can be directly replaced in the
NumPy array without having to create a new one and copy memory. When
0.0 is replaced with the string null, the float64 type cannot hold a string
and so the NumPy array must be rebuilt and the memory must be copied
into a new array of type object. Both these operations were specified with
inplace=True, yet the latter resulted in a memory copy because the type of
the underlying data structure had to change.
167
Chapter 9 The Future of pandas
C
onclusion
Because of pandas’ diverse user base, it supports many different options
and many different methods for doing the same thing. pandas’ API has
a large and ever-expanding set of features and options which can be
incredibly overwhelming and often lead users to implement things in
a suboptimal way. It’s a difficult decision to make: limit the number of
features and options such that users can’t do the wrong thing or provide
a set of features so that users can find a way to do whatever they want.
pandas has certainly erred on the side of the latter which makes it a very
powerful tool and applicable to many different types of big data problems.
And for those users who don’t care if their program takes a minute or an
hour, it’s not an issue that they have used a suboptimal implementation.
However, for users that do, it can be difficult to reason about and
understand. Hopefully, this book has left you with a better understanding
of how pandas works underneath and an intuition for which method to
use during certain scenarios.
1
https://arrow.apache.org/
168
Chapter 9 The Future of pandas
169
APPENDIX
Useful Reference
Tables
Table A-1. Conversion between NumPy and C types1
(continued)
1
https://docs.scipy.org/doc/numpy/user/basics.types.html
172
Appendix Useful Reference Tables
add(x1, x2, /[, out, where, casting, order, …]) Add arguments element-wise
subtract(x1, x2, /[, out, where, casting, …]) Subtract arguments, element-wise
multiply(x1, x2, /[, out, where, casting, …]) Multiply arguments element-wise
(continued)
2
h ttps://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.
read_csv.html
173
Appendix Useful Reference Tables
divide(x1, x2, /[, out, where, casting, …]) Return a true division of the inputs,
element-wise
logaddexp(x1, x2, /[, out, where, casting, …]) Logarithm of the sum of
exponentiations of the inputs
logaddexp2(x1, x2, /[, out, where, casting, …]) Logarithm of the sum of
exponentiations of the inputs in
base 2
true_divide(x1, x2, /[, out, where, …]) Return a true division of the inputs,
element-wise
floor_divide(x1, x2, /[, out, where, …]) Return the largest integer smaller
or equal to the division of the inputs
negative(x, /[, out, where, casting, order, …]) Numerical negative, element-wise
positive(x, /[, out, where, casting, order, …]) Numerical positive, element-wise
power(x1, x2, /[, out, where, casting, …]) First array elements raised
to powers from second array,
element-wise
remainder(x1, x2, /[, out, where, casting, …]) Return element-wise remainder of
division
mod(x1, x2, /[, out, where, casting, order, …]) Return element-wise remainder of
division
fmod(x1, x2, /[, out, where, casting, …]) Return element-wise remainder of
division
(continued)
174
Appendix Useful Reference Tables
divmod(x1, x2[, out1, out2], / [[, out, …]) Return element-wise quotient and
remainder simultaneously
absolute(x, /[, out, where, casting, order, …]) Calculate the absolute value
element-wise
fabs(x, /[, out, where, casting, order, …]) Compute the absolute value
element-wise
rint(x, /[, out, where, casting, order, …]) Round elements of the array to the
nearest integer
sign(x, /[, out, where, casting, order, …]) Return an element-wise indication
of the sign of a number
heaviside(x1, x2, /[, out, where, casting, …]) Compute the Heaviside step
function
conj(x, /[, out, where, casting, order, …]) Return the complex conjugate,
element-wise
conjugate(x, /[, out, where, casting, …]) Return the complex conjugate,
element-wise
exp(x, /[, out, where, casting, order, …]) Calculate the exponential of all
elements in the input array
exp2(x, /[, out, where, casting, order, …]) Calculate 2∗∗p for all p in the input
array
log(x, /[, out, where, casting, order, …]) Natural logarithm, element-wise
log2(x, /[, out, where, casting, order, …]) Base 2 logarithm of x
(continued)
175
Appendix Useful Reference Tables
log10(x, /[, out, where, casting, order, …]) Return the base 10 logarithm of the
input array, element-wise
expm1(x, /[, out, where, casting, order, …]) Calculate exp(x) - 1 for all elements
in the array
log1p(x, /[, out, where, casting, order, …]) Return the natural logarithm of one
plus the input array, element-wise
sqrt(x, /[, out, where, casting, order, …]) Return the non-negative square
root of an array, element-wise
square(x, /[, out, where, casting, order, …]) Return the element-wise square of
the input
cbrt(x, /[, out, where, casting, order, …]) Return the cube root of an array,
element-wise
reciprocal(x, /[, out, where, casting, …]) Return the reciprocal of the
argument, element-wise
gcd(x1, x2, /[, out, where, casting, order, …]) Return the greatest common divisor
of |x1| and |x2|
lcm(x1, x2, /[, out, where, casting, order, …]) Return the lowest common multiple
of |x1| and |x2|
176
Appendix Useful Reference Tables
NULL
-1.#IND
NaN
-NaN
#N/A
NA
#N/A N/A
n/a
#NA
1.#QNan
-1.#QNan
NaN
-NaN
Null
N/A
1.#IND
3
h ttps://pandas.pydata.org/pandas-docs/stable/reference/frame.
html#computations-descriptive-stats
177
Appendix Useful Reference Tables
178
Appendix Useful Reference Tables
179
Appendix Useful Reference Tables
180
Index
A Bytecode, 37
cache, 38
Anti-join method, 22
interpretation, 38
Apply method
tokenizer, 38
Cython library, 131, 133
DataFrame
element, 129 C
implementation, 128, 130
Categorical variables, 50
definition, 121
Computer architecture
arrays, 145
B bytecode instructions, 143
Basic Linear Algebra Subprograms cache, 144, 145
(BLAS), 150 clock cycles, 145
dot product, 151, 153 compilers, 143
library, 154 evaluation phase, 142
loop unrolling, 154 execution phase, 142
matrix comparison, 152 five-stage pipeline, 142
NumPy, 155 memory access
SIMD instruction, 151 phase, 142
techniques, 154 memory load phase/inserting
transposed matrix, 153 NOPs, 143
Big Data software, 31 modern Intel CPU, 144
Blackhole, image pseudo-code CPU
GitHub, 5 instructions, 143
image prediction Python program, 141
algorithms, 5 sequential memory access vs.
telescope, 4, 5 object attribute, 146
182
INDEX
E normalization/optimization
capabilities, 65
exc_info variable, 41
pd.read_csv (see pd.read_csv
loader)
F read_json (see read_json loader)
Fetchall function, 107 read_sql (see read_sql_query
Financial investment decisions, 6 loader)
loc method, 14–16
G, H
M
Garbage collector, 37
memory_map, 80
Global interpreter lock (GIL), 37
Merge method
Groupby
anti-join, 23, 24
aggregate function, 135, 136
historical record, 20, 21
operation, 139, 140
inner merge, 17
pre-indexed DataFrame, 137
_merge column, 23
outer merge, 18, 19
I Mutable tuple, 33
iloc method, 11–14
Image prediction algorithms, 5 N
“is” property, 36
na_values parameter, 81
Non-performant solutions, 31
J, K Normalizing data, 66
Just-In-Time (JIT), 37 Not a Number (Nan) type, 2, 81
np.sum function, 122
nrows parameter, 77
L NumExpr
Linear Algebra Package cache, 147
(LAPACK), 48, 150 CPU pseudo-code
Loading data instructions, 147
creation/elimination, 66 df_sum1, 150
IO pandas, 66 eval, 149
non-numeric values, 67 pandas DataFrames, 146
183
INDEX
184
INDEX
185
INDEX
186