Python For Accounting A Modern Guide Python Programming in Accounting 9789730338928 Compress
Python For Accounting A Modern Guide Python Programming in Accounting 9789730338928 Compress
by Horatio Bota
with Adrian Gosa
Library of Congress Control Number: 2021 9 0 1 1 6 1
ISBN: 9 7 8 - 9 73-0-33892-8
Version identifier: 115431b
Disclaimer
Although the authors have made every effort to ensure that the information in this book was correct at
publication time, the authors do not assume and hereby disclaim any liability to any party for any loss,
damage, or disruption caused by errors or omissions, whether such errors or omissions result from
negligence, accident, or any other cause.
The authors have endeavored to provide trademark information about all of the companies and products
mentioned in this book by the appropriate use of nomenclature. However, the authors cannot guarantee the
accuracy of this information.
Copyright
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means, without the prior written permission of the authors, except in the case of brief
quotations embedded in critical articles or reviews. This work is registered with the U.S. Copyright Office.
Horatio Bota is a freelance Data Scientist with over seven years of experience in data analytics
and data science. He has previously worked at Microsoft Research, J.P. Morgan Chase, and several
startups in the U.K. He holds a B.Sc. and a Ph.D. in Computing Science from the University of
Glasgow.
Adrian Gosa is a Senior Accountant at Nike Europe, with over seven years of business and finance
experience. He has previously worked at PricewaterhouseCoopers and Deloitte, covering a wide
variety of industries. Adrian holds an M.A. in Business Economics and an M.Sc. in Quantitative
Finance from the University of Glasgow.
Alexandra Vtyurina is a Research Scientist at Kira Systems in Toronto, where she uses Python to
analyze user behavior. She holds a B.Sc. and M.Sc. in Computer Science from the Southern Federal
University, and is about to get her Ph.D. in Computer Science from the University of Waterloo.
Contact
Did you like the book? Did you find it helpful? We’d love to add your name to our list of testimonials
on the website! Please email us at contact@pythonforaccounting.com.
If you’d like to report any mistakes, typos, or offer suggestions on how we can improve this book,
please email us at errata@pythonforaccounting.com.
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
What this book is about . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Whom this book is for . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Why read this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
A quick tour of Python’s data tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
How to use this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Getting set up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Installing Python on your computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Setting up your local workspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Using JupyterLab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Using Anaconda Navigator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7 Control flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
If-else statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
List comprehensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
8 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Defining functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Parameters and arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Return values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
12 Pandas in a nutshell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Getting started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Reading data from spreadsheets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Preparing and transforming data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Visualizing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Writing data to a spreadsheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
pythonforaccounting.com/chapter1
1 Introduction 3
Some of you may have used R or Python3 before. In that case, 3: Both R and Python are program-
the first part of the book will serve as a refresher for general ming languages used widely for data
analysis and statistics.
programming concepts and guide you through setting up the tools
you need to run Python data analysis code. If this is you, feel free
to jump ahead and read the chapters you find most useful.
The first drawback is that the work you do in Excel is not easily
reproducible. With Excel, you don’t have a record of all the steps you
took in your analysis, so you can’t rerun those steps if something
goes wrong5 or if you get a new dataset that needs the same kind 5: And Excel does have a fondness for
of analysis. Repeating the same actions over and over, every time crashing without saving your work,
at the worst of times.
the data changes or Excel crashes is not only time consuming and
annoying, but exceedingly error-prone.
pythonforaccounting.com/chapter1
1 Introduction 4
What Python is
I The Python standard library is a collection of Python packages12 12: More details on what exactly
that come with the Python interpreter. You can think of packages as these packages are in the following
extensions to the main application, which is the Python interpreter chapters.
pythonforaccounting.com/chapter1
1 Introduction 5
def multiply_by_two(number):
return number * 2
multiply_by_two(10)
20
pythonforaccounting.com/chapter1
1 Introduction 6
pythonforaccounting.com/chapter1
1 Introduction 7
I Work well with the software you already use (e.g., SAP, Excel);
I Allow you to handle data in explicit and easy-to-check ways;
I Work with large datasets that Excel can’t even open;
I Enable you to automate repetitive manual tasks.
pythonforaccounting.com/chapter1
1 Introduction 8
We’ll cover how to install and use libraries in your Python code
in the following chapters, but here is a quick example of how you
can import the pandas Python library and read data from an Excel
file using the read_excel function20 (which is part of pandas): 20: We’ll also review functions in the
next chapter.
import pandas as pd
pd.read_excel('Q1Sales.xlsx')
pythonforaccounting.com/chapter1
1 Introduction 9
Figure 1.2: A screenshot of JupyterLab, the program you’ll be using to write Python code. This example shows how to use
pandas to read data from an Excel file, and what a table looks like in JupyterLab.
Two other tools — which are not Python libraries — will be essential
in your Python adventure:
I Anaconda22 is an installer that bundles all the open-source Python 22: Or the Anaconda Distribution.
libraries and tools I mentioned so far into a single package that
you can use to get up-and-running with Python data analysis code
fast. What the Microsoft Office suite is to Excel, Anaconda is to the
Python libraries for data analysis you’ll be using throughout this
book (except it’s free to use and open-source).
These are just a few of Python’s libraries and tools for data analysis
but are perhaps the most relevant for accounting. There are many
others online, and you’ll soon discover them on your own.
pythonforaccounting.com/chapter1
1 Introduction 10
This book has a lot of code in it, shown directly in the main text.
Try to read code examples, but also type them out by yourself in
JupyterLab. You learn much faster when you type code by yourself,
and you also get used to the mechanics of writing and running
code (which will serve you well when you finish the book and
have to navigate without a guide).
The gray box at the top shows one or more Python code statements
— the code here just prints some text, but all code examples will
be similar in style to this one. When more than one line of code is
listed in a code box, you’ll see line numbers on the right of the box
to make it easier to reference a particular line in the main text (e.g.,
line 1 prints a hello message).
The gray box code is what you need to run to produce the result
shown in the blue box beneath it, which is the code output. Some-
times code doesn’t produce any output (e.g., when you create a
new variable), so there’s no output box for that code. How exactly
you run code and what the In [1]: and Out [1]: labels mean is
covered in detail in the next chapter.
For the data analysis code you’ll be writing, code output will often
be a table. Output tables look like this:
In [2]: import pandas as pd 1
2
pd.read_excel('Q1Sales.xlsx') 3
Out [2]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
0 1532 Shoppe.com Cannon Water Bom... ... 20.11 14 281.54
1 1533 Walcart LEGO Ninja Turtl... ... 6.70 1 6.70
2 1534 Bullseye NaN ... 11.67 5 58.35
3 1535 Bullseye Transformers Age... ... 13.46 6 80.76
4 1535 Bullseye Transformers Age... ... 13.46 6 80.76
... ... ... ... ... ... ... ...
14049 15581 Bullseye AC Adapter/Power... ... 28.72 8 229.76
14050 15582 Bullseye Cisco Systems Gi... ... 33.39 1 33.39
14051 15583 Understock.com Philips AJ3116M/... ... 4.18 1 4.18
14052 15584 iBay.com NaN ... 4.78 25 119.50
14053 15585 Understock.com Sirius Satellite... ... 33.16 2 66.32
pythonforaccounting.com/chapter1
1 Introduction 11
The table above has 14 054 rows and 12 columns (you’ll have to
believe me), but you only see 10 of its rows and 6 of its columns
(and an indicator at the bottom left that tells you how many rows
and columns there are). I show tables in the book in this truncated
form, with ... instead of actual rows or columns, to make them
fit in the main text, and because tables in JupyterLab also get
truncated like this — at first, you might find this annoying. You’ll
soon discover you can work with data even without seeing it all.
Roadmap
I Part two: Working with tables is where the rubber meets the road
in using Python for accounting. This part of the book introduces
the main features of pandas, the Python library you’ll be using to
work with tables — whether Excel spreadsheets, CSV files or any
other kind of tabular data.
I Part three: Visualizing data shows you how to turn data into plots
using some of Python’s data visualization libraries ( matplotlib,
seaborn, and hvplot).
pythonforaccounting.com/chapter1
1 Introduction 12
The first three parts of the book have several chapters that introduce
new programming ideas. These chapters are short and focused,
and all build on top of each other; ideally you should try to read
them one after the other. There are also five project chapters that
don’t introduce new programming ideas but guide you through
applying what you’ve been learning in an accounting setting. These
chapters will walk you through (in order): filtering and splitting a
large Excel file into multiple sheets, reading and cleaning a general
ledger exported from QuickBooks, mining product reviews, filling
missing values in a sales dataset, and making a waterfall plot from
a cash flow statement. The last part of the book is a sales analysis
project in its entirety.
Throughout the book, you’ll find several sections and even a few
chapters whose titles start with the word “Overthinking”. You can
ignore these overthinking sections and chapters if you want to
because they cover ideas that aren’t strictly necessary in your
learning journey. Still, you might find them interesting if you get
bitten by the Python bug.
pythonforaccounting.com/chapter1
Getting set up 2
This chapter guides you through getting set up with Python and
its data analysis tools on your computer. You will likely want to
use Python at work, so I’ll show you how to install everything you
need to get started on a computer running Windows 10 without
any special privileges. If you have at least one folder on your work
computer where you can copy files, you can install Python. If you’re
using another operating system (e.g., another version of Windows
or mac OS), the steps you need to take might be different, but you
can easily adapt the instructions here for your setup.
Figure 2.1: A screenshot of the Anaconda installer website, as of August 2020. To get the installer, click the download button
at the top of the page and then select 64-bit Graphical Installer under the Windows and Python 3.8 labels.
2 Getting set up 14
Figure 2.2: A screenshot of Anaconda Navigator. To get started, click “Launch” on the JupyterLab card – in this version of
Anaconda Navigator, JupyterLab is the top left card.
When the download completes, run the installer; it will guide you
through getting Python and the libraries you need up-and-running
on your computer. The default options set things up so that you can
run Python on your computer without any special privileges.
The installer will put all Python files in your local user direc-
tory (probably something similar to C:\Users\<YOURUSERNAME>
\Anaconda3) — but you can change this location to any other folder
when you get prompted. The entire installation takes up about
3GB of hard disk space, so make sure there is enough space on
your drive, and the whole process takes a few minutes.
pythonforaccounting.com/chapter2
2 Getting set up 15
pythonforaccounting.com/chapter2
2 Getting set up 16
Figure 2.3: A screenshot of JupyterLab. This is the editor you’ll be using to write Python code. It is to Python data analysis
code files what Excel is to spreadsheets.
Using JupyterLab
pythonforaccounting.com/chapter2
2 Getting set up 17
pythonforaccounting.com/chapter2
2 Getting set up 18
JupyterLab interface
pd.read_excel('Q1Sales.xlsx')
Out [2]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
0 1532 Shoppe.com Cannon Water Bomb ... ... 20.11 14 281.54
1 1533 Walcart LEGO Ninja Turtles... ... 6.70 1 6.70
2 1534 Bullseye NaN ... 11.67 5 58.35
3 1535 Bullseye Transformers Age o... ... 13.46 6 80.76
4 1535 Bullseye Transformers Age o... ... 13.46 6 80.76
... ... ... ... ... ... ... ...
14049 15581 Bullseye AC Adapter/Power S... ... 28.72 8 229.76
14050 15582 Bullseye Cisco Systems Giga... ... 33.39 1 33.39
14051 15583 Understock.com Philips AJ3116M/37... ... 4.18 1 4.18
14052 15584 iBay.com NaN ... 4.78 25 119.50
14053 15585 Understock.com Sirius Satellite R... ... 33.16 2 66.32
Kernel is a new menu that you probably haven’t seen elsewhere and
contains commands related to managing the underlying Python
interpreter process linked to your notebook. I mentioned before
that each notebook has a Python process attached to it (that’s what
the kernel is). Sometimes this kernel process gets stuck, or you
want to stop whatever it’s doing because it is taking too long. When
pythonforaccounting.com/chapter2
2 Getting set up 19
Menu Bar
File Navigator
Figure 2.5: The JupyterLab interface. File navigator and sidebar on the left, main work area on the right. You can open
notebooks, images, PDF or plain text files in the main work area and organize them in a grid layout using drag-and-drop.
that happens, you can use the Interrupt Kernel or Restart Kernel...
commands from this menu.
How do you know your code is taking too long? There’s an indicator
in the top right of your notebook: it looks like an empty circle
when the kernel is idle, and a full circle when the kernel
is running some code. To the left of this indicator you can see
several buttons that you can use to add or remove cells in your
notebook (and some other similar controls).
The left sidebar (highlighted with a red border in figure 2.5) has
some utilities, such as a file browser, a list of running Python
processes (i.e., kernels) and terminals, the command palette where
you can search for JupyterLab commands (e.g., Restart Kernel), and
a list of open tabs. The left sidebar can be collapsed (to give you
more screen space) or expanded by selecting “Show Left Sidebar” in
the View menu.
The main work area (highlighted with a blue border in figure 2.5) is
where you will be working with Jupyter notebooks. You can open
notebooks, images, PDF or HTML files in the main work area, and
you can arrange open tabs in a grid layout with drag-and-drop.
Jupyter notebooks
Jupyter notebooks are documents that keep all your data analysis
work together. They allow you to combine text, images, tables,
pythonforaccounting.com/chapter2
2 Getting set up 20
Code cells
Besides running code, these cells let you see what variables look
like. We’ll go over Python variables in more detail in the following
chapter, but for now, consider the following example:
In [3]: message = "hello, python for accounting!"
This code cell creates a variable called message that stores the same
text we’ve been printing so far. If you inspect message in a separate
cell (i.e., type message in a new cell6 and run the cell), you’ll see 6: Remember you can add new cells
its value as the cell output: by clicking on the button right
above the notebook.
In [4]: message
pythonforaccounting.com/chapter2
2 Getting set up 21
welcome_message
goodbye_message
You may have already noticed that code cells have a numbered
label in front of them — similar to the In [1]: label shown before
code examples in this book. These labels indicate the order in
which you run code cells in your notebook.
Jupyter notebooks keep track of the order you run code cells because
cell execution order is more important than the relative position of
cells in a Jupyter notebook. The easiest way to understand this is
through an example — consider the following cells:
In [1]: message = "hello, python for accounting!"
In [2]: print(message)
But you can run notebook cells in any order (i.e., you can easily
select any cell in your notebook and run it, regardless of its place
in the notebook). In the example below, I ran the last cell in the
notebook right after running the first one, then ran the middle cell
— notice the cell execution counters on the left of each cell are not
in order:
In [1]: message = "hello, python for accounting!"
In [3]: print(message)
The middle cell above prints goodbye, python for accounting! be-
cause I assigned the value in the last cell to message before printing
it (i.e., I didn’t run the code cells in sequence, one after the other).
pythonforaccounting.com/chapter2
2 Getting set up 22
way: you can add new cells anywhere in your notebook to inspect
or modify variables, regardless of where you defined them initially.
However, this feature can become confusing when you have a lot
of code cells in your notebooks.
In general, even though you can create and run code cells in any
order you want to, it is good practice to keep code cells in the same
order as they need to run so that you can follow the sequence of
steps in your notebooks.
Markdown cells
Markdown cells are useful in adding text that describes your work
to your notebooks. Documentation is essential when working with
data because analyses often involve intricate steps that need to be
explained in plain English.7 7: You can use Markdown cells to
describe the business context for your
To add a Markdown cell to your notebook, click on the button analysis (e.g., who the stakeholders
right above your code cell (you can also move cells around using are, what the business goals of your
analysis are), or explain complex logic
drag-and-drop if you don’t like where they are). By default, all new
in your process (e.g., why a metric is
cells are code cells. You can change the type of your new cell by computed using only a sample of the
going to the notebook toolbar and selecting Markdown, as shown original data).
in figure 2.7.
You can now start writing Markdown text in this cell. For example,
to give the notebook a title and some context, add the following
content to this empty cell:
# My First Notebook
You can now either click the button in the notebook toolbar
or press Shift + Enter to run the cell and render this text. You
should see the title rendered in large, bold font. If you want to
change this text, just double-click anywhere on the rendered text.
pythonforaccounting.com/chapter2
2 Getting set up 23
Figure 2.8: An almost complete Markdown reference for Jupyter notebooks. This reference shows side-by-side views of
the same notebook in JupyterLab: the left view shows the Markdown text in edit mode, whereas the right view shows the
rendered Markdown.
The error above tells you that JupyterLab tried to run the contents
of your cell as Python code but couldn’t (because it’s not code, it’s
just some text).
pythonforaccounting.com/chapter2
2 Getting set up 24
pythonforaccounting.com/chapter2
2 Getting set up 25
of packages that have the term “hvplot” in their name — select the
library called just “hvplot” by clicking the checkbox in front of its
name, and then click the “Apply” button at the bottom right of the
screen. You will see a dialog showing you what packages will be
installed; click “Apply” again and wait for the new library to get
downloaded and installed on your computer.
pythonforaccounting.com/chapter2
2 Getting set up 26
Summary
pythonforaccounting.com/chapter2
Python ABCs part one
Python has a large ecosystem of libraries and tools for data analysis.
However, to tap into this vast ecosystem of tools, you first need to
be familiar with the Python language itself.
This part of the book introduces the building blocks of the Python
programming language (i.e., how to define variables or functions,
how to work with lists and loops, and many more). These building
blocks are the foundation on which the data analysis tools you’ll
be using later in the book are built — and they are also the glue
that makes them all work together.
I The line above an indented code block must end with a colon
(i.e., :). Notice the use of colons after each conditional statement in
the if-else block or after the for statement on line 12 .
I Whitespace on the same line does not matter. The following lines
of code are equivalent:
my_sum = 1+2+3 1
my_sum = 1 + 2 + 3 2
my_sum = 1 + 2 + 3 3
Most Python users would agree that the second line is easier to read
than the other two — and, indeed, Python style guides recommend
using a single space around operators to make code easier to read.
However, all three lines have the same effect.
pythonforaccounting.com/chapter3
3 A quick look at Python code 31
The syntax rules we just reviewed are part of the Python program-
ming language (much like punctuation rules are part of the English
language). They describe how to write valid Python code, but not
what different Python language constructs mean and what you
can do with them. Next, let’s take a look at variables and how you
create them in Python.
pythonforaccounting.com/chapter3
Variables and operators 4
Variables are the building blocks of programming. They are names
given to pieces of data that you reference throughout your code.
You just created a variable called message with the value of 'hello,
python for accounting world!'.
Everything is an object
In Python, every variable is an object.1 This means that all Python 1: In fact, everything is an object in
variables have associated attributes and methods you can access by Python, even functions or libraries.
typing a period after their name. For instance, to make the message
variable you created earlier uppercase, you can type a period after
its name and use the upper method:
In [2]: message.upper()
Similarly, to add a number to a list2 you can use the append method 2: More on lists in chapter 6.
on a list variable:
In [3]: numbers = [10, 20, 30] 1
numbers.append(100) 2
3
numbers 4
Notice the dot after numbers and before append on line 2 , the
parentheses after append and the method argument (which is the
number 100 ) in between — this style of calling methods on objects
is called dot syntax.
The parentheses at the end of a method name tell Python to run the
method code. If you try to call a method but forget the parentheses
4 Variables and operators 33
at the end, Python doesn’t run the method code; instead, it gives
you a description of the method:3 3: When you see this kind of out-
put after running a code cell, you for-
In [4]: numbers = [10, 20, 30] got the parentheses after the method
numbers.append name.
Out [4]: <function list.append(object, /)>
print(x.real)
These are available in the unlikely
print(x.imag) case you need to work with complex
Out [5]: 5 numbers.
0
In [6]: x.bit_length()
Out [6]: 3
pythonforaccounting.com/chapter4
4 Variables and operators 34
Here, the df variable contains all the information in the first sheet
of “Q1Sales.xlsx” — this includes values in each column of the table,
information about what types of data is stored in each column,
the number of rows, table headers, and many more. It encapsulates
many different pieces of data into a single variable and makes it
easy to work with those data all at once. Methods that are available
on the df object have access to its internal data (e.g., column names,
individual values in the table) and often produce a result using
some of them. For example, you can compute the column-wise
sum of this table by calling the sum method on the DataFrame
variable:
In [8]: df.sum()
We’ll go into more detail about DataFrame objects in the next part
of the book, and you’ll get a clearer idea of how they work then. For
now, remember that all variables in Python are objects, including
the ones we’ve been using so far.
I will go over Python lists in more detail later but, for now, consider
this simple example that uses two list variables:
In [10]: a = [10, 20, 30]
b = a
print(a)
print(b)
This example creates two variables (i.e., a and b) that both point
to the same sequence of values. If you modify this sequence by
appending a new number to it, through either of the variables, the
other variable “changes” as well:
pythonforaccounting.com/chapter4
4 Variables and operators 35
print(a)
print(b)
print(a)
print(b)
x = x + 10
print(x)
print(y)
pythonforaccounting.com/chapter4
4 Variables and operators 36
Out [13]: 20
10
When you create a new variable in a notebook cell, you can use
that variable in any of the other code cells in your notebook. It
remains available for you to use until you shut down or restart
your notebook (i.e., from the “Kernel” menu in JupyterLab).
Variables7 occupy space in your computer’s working memory (not 7: Variables and the data they point
on your actual drive, but in your computer’s random access memory), to. You can think of the data variables
point to as a temporary files stored
so the more variables you create, the less free working memory
in your computer’s working memory:
you will have on your computer.8 When you stop your notebook it has a name, a path and some con-
(e.g., by going to the “Kernel” menu and selecting “Shut Down tent, just like a regular file. The only
Kernel”), Python erases all notebook variables from your computer difference is that you cannot access
it through a file navigator, only refer-
memory, freeing it up for other programs that need it.
ence it through the variable name.
At times, you will want to delete a variable manually because you 8: This means you can run out of
don’t need it anymore or because you created it by mistake (e.g., working memory if you create a lot
of variables in your notebooks. While
you made a typo in the variable name). To delete a variable from
this is not a problem for the tiny vari-
your notebook (and also remove the data it points to from your able we will be working with in this
computer’s memory), you can use the del keyword: chapter, as you start working with
large tables loaded as DataFrame vari-
In [14]: message = 'hello, python for accounting!' 1
ables, you might reach the limits of
2 your computer’s memory.
del message 3
After you delete a variable using the del keyword, if you try to
access it or use it in any way, you will see the following error:
In [15]: message
pythonforaccounting.com/chapter4
4 Variables and operators 37
NameErrors tell you that the name (of a variable or function) you
are trying to use is not defined. When you see one, check that you
defined a variable or function with that name in your notebook or
that you haven’t mistyped its name when trying to use it — often,
you will find it is the latter.
Operators
Now that you know how to create variables let’s take a look at what
you can do with them. Python has several operators9 you can use 9: Of the smooth variety.
to work with variables — many of these are familiar mathematical
operators.
Arithmetic operators
Out [16]: 12
Out [17]: 10
pythonforaccounting.com/chapter4
4 Variables and operators 38
Table 4.1: Arithmetic operators available in Python. Examples use x=5 and y=2.
Name Example Result Description
Addition x + y 7 Sum of x and y.
Negation -x -5 Negative of x.
In [19]: (-10) ** 2
Comparison operators
Table 4.2: Comparison operators available in Python. Examples use x=5, y=2.
Name Example Result Description
pythonforaccounting.com/chapter4
4 Variables and operators 39
In [20]: x = 5
1 < x < 10
There are a few other operators available in Python that you can use
with conditional statements (e.g., and and or to connect multiple
comparison operators). These operators are most useful in if-else
statements, we’ll come back to them in a few chapters.
Assignment operators
You might come across enhanced versions of the equals sign that
include one of the arithmetic operators I mentioned earlier. For
instance, the following examples are equivalent:
In [23]: x = 5
x = x + 2
Out [23]: 7
Out [24]: 7
# equivalent to:
x = 5
x **= 2
In [26]: x
pythonforaccounting.com/chapter4
4 Variables and operators 40
Out [26]: 25
You will come across the more compact style in code examples
elsewhere, which is why I briefly mention it here.
Summary
In this chapter, you saw that all Python variables are objects:
they have attributes and methods associated with them. Calling
methods on objects is how you’ll do most of your work when
using Python code. We also looked at Python’s operators, which
you’ll use to work with numbers, either by themselves or in table
columns. Let’s see how you create different types of variables in
Python next.
pythonforaccounting.com/chapter4
Python’s built-in data types 5
Python has five simple types of data built-in — most of them
you’ve already used. They are listed below, together with code
examples for each.
These types of data are built into Python, which means they are
always available for you to use — unlike, for instance, the DataFrame
type, which is available only after you import the pandas library,
as you will see later. All Python libraries and tools extend these
simple types into more complicated ones.
You’ll often forget what type your variables are. When that happens,
you can use the built-in type function to find out:
In [1]: x = 10
type(x)
In [2]: y = 'hello!'
type(y)
Integers
One thing that makes Python integers stand out is that they can be
huge. For example, in Python you can easily compute:
In [3]: 2 ** 2000
5 Python’s built-in data types 42
The numbers you work with likely aren’t as big, but this feature of
Python integers is available to you if you need it.
Floating-point numbers
To deal with this issue in Python, you can round decimal numbers
yourself (assuming you don’t need this level of decimal precision)
with the built-in round function:
In [8]: round(0.30000000000000004, 10)
pythonforaccounting.com/chapter5
5 Python’s built-in data types 43
decimal package that is part of the Python standard library.3 3: You can read more about
high-precision arithmetic with
decimal numbers in Python at
docs.python.org/library/decimal.
Booleans
Most often you do not work with booleans directly, they get
produced by some other operation, but you can if you need to:
In [10]: t = True
f = False
There are a few operators you can use with booleans in Python:
and , or and not :
In [11]: x = 5
y = 2
In [12]: t or (x == 5)
pythonforaccounting.com/chapter5
5 Python’s built-in data types 44
Out [15]: 2
The None value is different from the #N/A value in Excel. In Excel,
an #N/A value generally tells you that a formula couldn’t run —
it’s a “cannot-compute” error message. In Python, None is used to
indicate the absence of a value.
pythonforaccounting.com/chapter5
5 Python’s built-in data types 45
In [18]: value = 1
no_value = None
In [19]: value
Out [19]: 1
In [20]: no_value
Strings
Python strings have many different methods you can use to ma-
nipulate them. For example, to make the first letter of each word
in message uppercase, you can use:
In [23]: message = 'python for accounting' 1
2
message.title() 3
capitalize message.capitalize() Python for accounting Makes the first character uppercase.
title message.title() Python For Accounting Makes first character in each word upper-
case.
swapcase message.swapcase() PYTHON FOR ACCOUNTING Swaps character case.
pythonforaccounting.com/chapter5
5 Python’s built-in data types 46
You can also call different methods on the strings you combine or
mix-and-match methods and concatenation as you need to:
In [25]: first_name.capitalize() + " " + last_name.upper()
first_name + " " + last_name + " is " + age + " years old!"
To fix this error, you can convert the age variable (or any other
variable) to a string value by using the built-in str function:
In [28]: age = 83
first_name + " " + last_name + " is " + str(age) + " years old!"
pythonforaccounting.com/chapter5
5 Python’s built-in data types 47
The next section covers a few more built-in functions that you can
use to convert Python variables from one type to another.
Out [30]: 10
In [31]: float(-10)
In [32]: bool(10)
In [33]: bool(None)
In [34]: str(100)
pythonforaccounting.com/chapter5
5 Python’s built-in data types 48
Summary
pythonforaccounting.com/chapter5
Python’s built-in collections 6
Now that you’re familiar with simple data types in Python, we can
take a look at its built-in collection types, which are listed in table
6.1 below.
Tuple tuple x = (1, 4, 8) Sequence of ordered items that cannot be changed after you
define it.
Set set x = {1, 4, 8} Sequence of unique values in which the order of items does not
matter (i.e., {1, 4, 8} is the same as {4, 1, 8}.
Dictionary dict x = { Mapping of keys to values. As with set above, and unlike
'first': 1, list or tuple , the order of items in a dictionary does not
'second': 4, matter (i.e., you access items by key, rather than position).
'third': 8
}
Lists
Lists1 are collections of items stored together under the same name. 1: You may be familiar with arrays
Instead of giving each item a separate name (i.e., assigning each from Excel’s VBA. Lists in Python are
similar to arrays, the difference being
item to a different variable), you can give the entire collection a
that, in Python, you do not have to
name and access individual items through that common name. declare how large your lists are up-
They are useful for working with related data: you can think of a front, or what type of values they
table column as a list of values and of an entire table as a list of contain.
Lists usually contain more than one item2 so it’s a good idea to 2: You can mix data types in the same
name them using a plural (e.g., prices, accounts). list if you need to (e.g., strings and
integers), but I recommend holding
items of the same type in your lists.
Lists are ordered sequences, which means you can access elements
in a list by specifying their position (i.e., by their index). To access
6 Python’s built-in collections 50
elements in a list, type the list variable name followed by the element
index you want to access, surrounded by square brackets:
In [2]: accounts = ['Assets', 'Equity', 'Income', 'Expenses'] 1
2
accounts[0] 3
This gives you the first value in the accounts list. Notice that
the first element in the list is at position 0, not 1, which is the
default way items in a sequence get counted in most programming
languages (this is called zero-based indexing). The second item
has index 1, and so on until the last item in the list, which has
index equal to the number of elements in the list minus one.
To find out how many elements there are in a list, you can use the
built-in len 3 function: 3: Short for length.
In [3]: len(accounts)
Out [3]: 4
Once you access an element from a list, you can use it as you would
any other variable of the same type:
In [4]: accounts[0]
Expenses
Income
Equity
Assets
pythonforaccounting.com/chapter6
6 Python’s built-in collections 51
If you leave out the first number in the slice (i.e., the number before
the colon), it defaults to 0, meaning you will get everything in the
list up to the number on the right of the colon:
In [8]: accounts[:2]
When you omit the position on the right of the colon, you get all
the items in the list after (and including) the position on the left of
the colon.
You can even use negative indexes in the same way. For example,
to get the last two elements in the accounts list:
In [10]: accounts = ['Assets', 'Equity', 'Income', 'Expenses'] 1
2
accounts[-2:] 3
Slicing always gives you a new list, so you can assign a slice of
your original list to another variable, without worrying about the
variables-as-pointers issue I mentioned in chapter 4:
In [11]: a = [1, 2, 3]
b = a[:2]
b.append(10)
print(a)
print(b)
pythonforaccounting.com/chapter6
6 Python’s built-in collections 52
You have already seen the append method in action earlier. It adds
a new element at the end of a list. However, if you need to insert an
element at a certain position in your list, you can use the insert
method, which allows you to specify the index at which you want
to insert the new element:
In [13]: accounts = ['Assets', 'Equity', 'Income', 'Expenses'] 1
accounts.insert(1, 'Liabilities') 2
3
accounts 4
Here, you inserted the new item at position 1 in the list and all
elements to the right of it are shifted by one position.
When you don’t know which element you want to remove from
your list, but you know its position in the list, you can call the pop
method on your list variable and specify the index of the item you
want to remove:
In [15]: accounts = ['Assets', 'Equity', 'Income', 'Expenses'] 1
accounts.pop(2) 2
3
accounts 4
pythonforaccounting.com/chapter6
6 Python’s built-in collections 53
In [17]: accounts
If you don’t know the position of the item you want to change, but
you know its value, you can use the index method on the accounts
variable to find its index first:
In [19]: accounts = ['Assets', 'Equity', 'Income', 'Expenses'] 1
accounts.index('Income') 2
Out [19]: 2
Then you can use the index value to change the element at that
position in the list:
In [20]: accounts = ['Assets', 'Equity', 'Income', 'Expenses'] 1
accounts[accounts.index('Income')] = 'Revenue' 2
3
accounts 4
pythonforaccounting.com/chapter6
6 Python’s built-in collections 54
Organizing a list
The sort method sorts the list by comparing list items to each
other: string items are sorted alphabetically, in ascending order
(i.e., from A to z); numbers are sorted from low to high. If you
want to sort the list in descending order (from high to low), you
can pass a reverse=True argument4 to the sort method: 4: More on functions and arguments
in the following chapters.
In [22]: numbers = [9, 1, 16, 4, 36, 25] 1
numbers.sort(reverse=True) 2
3
accounts 4
Note that both these methods modify the original list you start
with (i.e., they modify the accounts list in-place), so after you use
these methods, you lose the initial ordering of the list. If the initial
order is important, and you need to keep track of it, you can use the
built-in sorted function instead. The sorted function gives you a
sorted copy of the original list to work with, and does not modify
the original list — similarly, you can use the reversed function to
get a new list with items in reverse order without modifying the
original list:
In [24]: accounts = ['Income', 'Assets', 'Expenses', 'Equity'] 1
sorted_accounts = sorted(accounts) 2
reversed_accounts = reversed(accounts) 3
4
print(accounts) 5
print(sorted_accounts) 6
print(reversed_accounts) 7
pythonforaccounting.com/chapter6
6 Python’s built-in collections 55
Python tuples are very similar to lists, the only difference being
that after you define a tuple, you can’t change its contents (i.e.,
you cannot add, remove, or change its items in any way). Tuples
are useful when the relative order of items in your collection is
important, and you don’t want to modify it accidentally (e.g., by
calling sort on it). You define tuples the same way you define lists
in Python, the difference being that you use parentheses rather
than square brackets:
In [25]: values = (1, 4, 5, 2) 1
2
values 3
Out [26]: 1
But you can’t assign new values to tuple items once the tuple is
defined:
In [27]: values[0] = 9
Because the name “tuple” starts with the sound two, I used to think
tuples can only hold two items, but they can hold as many items
as you need them to.
Sets are sequences of unique items — while Python lists can contain
duplicate items, sets can only have unique items. Sets are a bit
trickier to work with (i.e., you can’t access elements in a set using
their position), but they’re useful when you need to perform set
arithmetic (i.e., union, intersection, or difference of items).
accounts
pythonforaccounting.com/chapter6
6 Python’s built-in collections 56
Notice that even if you try to include duplicates (the 'Assets' item
is duplicated above), they will be discarded from the set.
If you need to perform set arithmetic, you can use the union,
difference or intersection methods, the output of which will be
another set:
In [29]: accounts = {'Assets', 'Liabilities', 'Revenue'} 1
ledgers = {'Income', 'Assets', 'Equity'} 2
3
accounts.union(ledgers) 4
In [30]: accounts.intersection(ledgers)
In [31]: accounts.difference(ledgers)
Sets are handy when you want to remove duplicates from a list, or
when you want to count the number of unique items in a Python
list:
In [32]: values = [1, 2, 2, 3, 3, 3] 1
2
set(values) 3
Even though you can’t access items in a set by their index, you can
easily convert lists to sets and back to lists (which are much easier
to work with):
In [33]: values = [1, 2, 2, 3, 3, 3] 1
unique_values = list(set(values)) 2
3
unique_values 4
pythonforaccounting.com/chapter6
6 Python’s built-in collections 57
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-69-fa8b578b7c9b> in <module>
----> 1 accounts[-1]
Dictionaries
Dictionaries are another useful Python data structure that can store
multiple items. Just like real-world dictionaries that map words to
their description (or translation), Python dictionaries map a key
to a value. Dictionary keys are often a string values, but they can
be numbers or boolean values as well; dictionary values can be
anything, including lists or other dictionaries.
pythonforaccounting.com/chapter6
6 Python’s built-in collections 58
In [38]: account['value'] 1
KeyError: 'updated_at'
Or to remove a key and its value, you can use the pop method and
specify the key you want to remove as its argument:5 5: Similar to using pop with a list.
In [41]: account.pop('created_at') 1
2
account 3
pythonforaccounting.com/chapter6
6 Python’s built-in collections 59
In [43]: account = {} 1
2
account['name'] = 'Income' 3
account['value'] = 20100 4
account['created_at'] = '10/03/2020' 5
6
account 7
Much of Python’s power comes from the fact that you can easily
manipulate its building blocks into data structures that work for
you and your use-cases. Dictionaries can be powerful tools, but
as with most programming concepts, it takes practice to build
intuition around when and how to use them effectively. We’ll
come back to Python dictionaries when we start working with
Excel spreadsheets, mostly to replace values in a column (by
specifying mappings from old values to new ones), or to rename
table columns.
pythonforaccounting.com/chapter6
6 Python’s built-in collections 60
Membership operators
Python has two keyword operators you can use to check whether
a certain value is in a sequence: in and not in . These operators
work with all the collections we have covered so far: lists, tuples,
sets, or dictionaries. For example, you can use them to check
whether a value is in a list:
In [48]: accounts = ['Assets', 'Equity', 'Income', 'Expenses'] 1
2
'Income' in accounts 3
You can use these operators together with boolean operators (i.e.,
and and or ) and comparison operators to construct complex
conditional statements about item membership:
In [52]: account = {'name': 'Income', 'value': 20100, 'created_at': '10/03/2020'} 1
2
'value' in account and account['value'] > 10000 3
set(values)
In [54]: tuple(values)
pythonforaccounting.com/chapter6
6 Python’s built-in collections 61
In [55]: list(set(values))
Summary
pythonforaccounting.com/chapter6
Control flow 7
In this chapter, we’ll take a closer look at if-else statements and
loops — both of which you’ve already seen in the example at the
beginning of chapter 3.
If-else statements
x is positive
The code above prints a message based on x’s value. Notice the
use of indentation and colons after each branch of the if-else
statement. There can be zero or more elif 1 branches, and the 1: Short for else if.
final else is optional (i.e., you can have just a single if in your
conditional statement).
In [2]: if x > 0:
print('x is positive')
Loops
1 is odd
2 is even
3 is odd
4 is even
5 is odd
Assets
Equity
Income
Expenses
To start a for loop, after the for keyword, you specify a variable
name2 , the in keyword and a sequence you want to loop over. 2: Which can be any name you want,
You can even read the second example above as “For every account as long as it does not start with a
number or has spaces in it.
in the accounts list, print the account”. Python code is not that far
from plain English.
pythonforaccounting.com/chapter7
7 Control flow 64
I Line 10 prints an empty line after each account. Notice that lines
8, 9, 10 are all indented at the same level — this tells Python
those lines are part of the for loop and should be executed at each
step in the iteration.
pythonforaccounting.com/chapter7
7 Control flow 65
Besides for loops, Python also has while loops. This type of loop
is useful when you don’t know up-front how many times you need
to repeat an action (i.e., they run until a condition is met).
In [7]: counter = 0
Out [7]: 0
1
2
3
4
You won’t use while loops anywhere in this book, but you might
come across them elsewhere, which is why I briefly mentioned
them here.
Overthinking: break
Once started, loops keep running until they finish all their work
(i.e., while loops keep running until the condition they check
becomes False , for loops until they run out of items to go over).
However, you will sometimes want to end a loop early. In those
cases, you can use the break keyword to stop a loop:
In [8]: accounts = ['Assets', 'Equity', 'Income', 'Expenses'] 1
2
for account in accounts: 3
if account.startswith('I'): 4
print('Account name starts with I: ' + account) 5
break 6
7
print('Account name does not start with I: ' + account) 8
In this example, you loop through each item in the accounts list
and check if the item starts with the letter 'I'. If it does not, you
print a message that says that. If it does start with 'I', you print
pythonforaccounting.com/chapter7
7 Control flow 66
the account and then run the break command, which ends the
for loop early, before its last iteration (i.e., the one before the
'Expenses' item in the accounts list).
You can use the break keyword to stop your computer from doing
unnecessary work, once you know the result you wanted was
computed. The break keyword can be used with both for and
while loops.
List comprehensions
You can also apply transformations to the values you want to put
in a list comprehension (remember that the ** operator squares
numbers):
In [10]: squares = [i**2 for i in [2, 3, 4, 5]] 1
2
squares 3
pythonforaccounting.com/chapter7
7 Control flow 67
Summary
This chapter went over if-else statements, loops, and list compre-
hensions. Loops let you repeat parts of your code, while if-else
statements let you choose which parts of your code run and which
don’t. These constructs are often used in custom functions to define
bespoke operations on data — let’s look at how you create a custom
function in Python next.
pythonforaccounting.com/chapter7
Functions 8
Programming is powerful because you can write a sequence of
actions once and repeat it as many times as you need to. You do
that with loops and functions.
Defining functions
The colon and line indentation you’re already familiar with from
if and for statements are required for function definitions too.
To run the function defined above (i.e., call the function), you need
to type its name followed by an empty set of parentheses; you can
call this function as many times as you need to:
In [2]: print_book_info()
print_book_info()
print_book_info()
Most often, you will want to pass some values to a function when
calling it. The values you pass can then be used by the function code
— for instance, to pass a book title to print_book_info, you can
re-define the function above and add a parameter to its definition:
8 Functions 69
Most functions will need more than one parameter to be useful. You
can use positional arguments or keyword arguments to call functions
with multiple parameters; let’s take a quick look at both.
Keyword arguments
pythonforaccounting.com/chapter8
8 Functions 70
Positional arguments
If you want to type less, you can use positional instead of keyword
arguments when calling functions. For example, you can call the
print_book_info function above with:
Notice the order of arguments when you call the function is the
same order in which you specified function parameters when you
defined it. These arguments are called positional because the order
in which they are specified (i.e., their position in the sequence of
arguments) is how Python links them to their respective parameter
names inside the function code. If you want your function to work
as you designed it, you need to be mindful of the order you pass
arguments to it.
Default values
pythonforaccounting.com/chapter8
8 Functions 71
In this case, the SyntaxError message informs you that the argu-
ment that does not have a default value needs to come before the
one that does have a default value.
Argument errors
pythonforaccounting.com/chapter8
8 Functions 72
Return values
Out [11]: 'Book title: Python for Accounting, book subtitle: A modern guide to Python!'
Book title: Python for Accounting, book subtitle: A modern guide to Python!
You can also return more than one value from your functions:
In [13]: def get_book_info(title, subtitle):
return title, subtitle
If you inspect the return value of this function, you will see that
Python wraps the two values it returns inside a tuple:
In [14]: get_book_info('Python for Accounting', 'A modern guide to Python!')
pythonforaccounting.com/chapter8
8 Functions 73
print(book_title)
print(book_subtitle)
Summary
pythonforaccounting.com/chapter8
Modules, packages, and libraries 9
Modules, packages, and libraries are different ways to organize
Python code into separate files and folders.1 They’re useful when 1: They enable Python’s vast ecosys-
you want to share functions or variables across notebooks or with tem of tools, which is driven by devel-
opers sharing their code files online.
other Python users. To access code from modules, packages, or
libraries inside your notebooks, you need to import them:
In [1]: import pandas as pd
This line of code gives you access to functions from the pandas
library, which is a collection of Python files stored somewhere on
your computer (i.e., you installed it with Anaconda earlier).
Modules are text files that have some Python code in them. The best
way to understand Python modules is to create one by yourself:
in JupyterLab, under the “File” menu, open the “New” sub-menu
and then select “Text File”. This should create a new file in your
workspace folder called “untitled.txt”. To turn this empty file into a
Python module, first, you need to rename it: right-click it in the file 2: Any filename works, as long as
navigator and select “Rename”. Change its name to “my_module.py” it doesn’t have spaces, but the file
extension must be .py.
(make sure the file extension is changed to .py instead of .txt).2
The second step in turning this file into a Python module is adding
some code to it. Open “my_module.py” in JupyterLab and define a
simple function that adds two numbers:
Save the file and close it. In this chapter’s Jupyter notebook (or any
other notebook that’s in the same folder as “my_module.py”) you
can import the module using its filename:
In [3]: import my_module
Out [4]: 7
You just created your own Python module. The main takeaway
from this simple example is that modules are just text files which
contain re-usable Python code. Whenever you use the import
keyword — and we’ll use it often in the following chapters —
remember that all it does is give you access to some code written
in a different file.
Note that whenever you change code inside a module file (by
adding a new function, for example), you have to restart your
notebook kernel and run the import statement again to get access
to your module’s latest changes. This is related to how Jupyter
notebooks work more than it is to Python modules. As an example,
let’s extend my_module with a subtract_numbers function — edit
your my_module.py file to contain the following functions:
After the kernel restarts, run the import my_module cell again. You
should be able to run the following code now:
In [6]: import my_module
my_module.subtract_numbers(2, 5)
Out [6]: -3
add_numbers(2, 5)
pythonforaccounting.com/chapter9
9 Modules, packages, and libraries 76
Out [7]: 7
mm.subtract_numbers(2, 5)
Out [8]: -3
Here, you gave my_module an alias called mm, which you can then use
throughout your notebook to refer to my_module and its functions.
You’ve already seen that this style of importing used with the
pandas library:
Now you know what this import statement does. Aliases are
usually short identifiers you can use to access module functions
without having to type too much while also making it clear in
your code what module a certain function is from. In the example
above, the pd alias is used to refer to the pandas library — pd is
commonly used for pandas, but you can use any alias you want as
long as it is a valid Python name (i.e., doesn’t start with a number
or contains spaces).
Similar to how Python modules are just text files (with a .py ex-
tension), Python packages are folders which contain many Python
modules. Packages can also be organized hierarchically, meaning
that Python packages can contain sub-packages (i.e., other folders
containing Python modules) and regular Python modules.
Python modules and packages are tools for organizing and sharing
code across multiple notebooks or with other Python users; they’re
just files and folders with Python code. In practical terms, you’ll
use both modules and packages in the same way, by importing
them into your notebooks and calling their functions. Even so, it’s
a good idea to know what they are so you don’t get confused when
you read about them online or when you see more intricate import
statements in other code examples.
pythonforaccounting.com/chapter9
9 Modules, packages, and libraries 77
Another library that you will use frequently is the Python standard
library. It is a collection of unrelated modules and packages that
provide all sorts of functionality: from working with large lists
efficiently, to accessing web pages directly in your code. It is
standard because it comes with most Python interpreters (so when
you install Python, you already get these modules and packages).
I The os module is helpful when you want to access local files. For
example, to list all the files in your current folder:6 6: In the code output, there are
two names you might not recog-
In [10]: import os nize. The __pycache__ folder is
created by the Python interpreter
os.listdir() and is not in any way useful to
you. The .ipynb_checkpoints is
Out [10]: ['my_module.py',
where JupyterLab auto-saves tempo-
'__pycache__', rary copies of your notebooks, in case
'Python ABCs.ipynb', anything happens and your work is
'.ipynb_checkpoints'] not saved (similar to how Excel saves
temporary copies of your spread-
You can use the output of listdir to get a list of files and then go sheets in case it crashes).
through it (e.g., using a for loop) to process each file. You can
also pass a folder path as an argument to listdir to get all files
at that path (e.g., os.listdir('C:/Documents/Python workspace')
would list all files and folders at that path).
pythonforaccounting.com/chapter9
9 Modules, packages, and libraries 78
today = date.today()
days_of_holiday = timedelta(days=14)
today + days_of_holiday
You can also use it to convert string values to date values and back.
We’ll take a closer look at date values when we start working with
tables, in the next part of the book.8 8: More about the datetime mod-
ule at docs.python.org/library/-
I The json module is useful when you want to convert Python datetime.
dictionaries to JavaScript object notation (i.e., JSON) or vice-versa.
Many software tools or online services use JSON for various features
(e.g., often JSON is used to store configuration options), so you
might find the json module useful when you want to interact with
such services. If you are not quite sure what JSON is, don’t worry,
we won’t use it anywhere in this book.
Besides the libraries I mentioned so far, and the ones we’ll be using
later in the book (e.g., pandas, matplotlib, seaborn), Python’s
ecosystem of libraries is vast (that’s one of Python’s main strengths).
It can also be overwhelming at times: there are so many libraries,
many of them with similar features, it can be hard to figure out
which one to use.
You can use Anaconda Navigator to search for libraries and read
more about each one. Besides Anaconda Navigator, the Python
Package Index (PyPI)9 or GitHub10 are online homes for many 9: pypi.org.
Python libraries. Whenever you need more information about a 10: github.com.
library or want to know if it can help with your problem, check
these resources.
Summary
This chapter showed you how to make your own Python module.
Modules, packages, and libraries are Python features that help you
share code between notebooks or with other Python users.
pythonforaccounting.com/chapter9
How to find help 10
If you made it so far, you might be wondering “How do people
remember all this stuff?!”. The answer is they don’t — everyone
searches for help on Google1 all the time. Finding information 1: Or their search engine of choice.
about functions and modules is as essential a skill as knowing
Python basics.
All the tools you’ll be using in this book have extensive documenta-
tion available online. When you get stuck, that’s where you’ll find
how to fix your issues. For instance, Python’s functions, modules,
and packages are documented at docs.python.org — searching for
anything Python-related online will often send you there. The other
libraries we’ll use later in the book have similar documentation
websites that I’ll reference as we progress.
message.title?
More specifically, words start with uppercased characters and all remaining
cased characters have lower case.
Type: builtin_function_or_method
You can use the ? character with any variable or function name,
regardless of what library or module it comes from. If there’s doc-
umentation available, JupyterLab will show it in your notebook.
But what if you don’t know what functions or methods are avail-
able? In that case, you can use the autocomplete feature in Jupyter
notebooks. If you type the following code:
In [2]: message = 'hello, python for accounting!'
message.
And instead of running the code cell, you press the TAB key
right after the dot character in your code cell, you should see the
JupyterLab autocomplete menu:
The two online resources you’ll make the most use of as you
2: DuckDuckGo or Bing are good al-
continue on your Python journey are Google2 and Stackoverflow.
ternatives.
pythonforaccounting.com/chapter10
10 How to find help 81
However, there is a catch: if you always have to search the web to get
unstuck, you’ll quickly become frustrated, and using Python will
be a nuisance instead of a benefit. You’ll forget method names or
what arguments to use because everybody forgets the details, but
if you develop the mental models and learn the right vocabulary
and concept names4 , you can solve anything with Python, Google 4: You can do that by finishing this
and Stackoverflow; unfortunately, all the googling in the world book!
Summary
Python and all its libraries are well documented. This short chap-
ter showed you how to access documentation directly in your
Jupyter notebook, either by using the ? character or through the
autocomplete menu.
pythonforaccounting.com/chapter10
Overthinking: code style 11
This overthinking chapter covers coding style (i.e., how to name
variables, how much whitespace between operators to use, how
to keep code comments useful). As it happens, Python has unam-
biguous and widely adopted style guidelines — these guidelines
are explicitly laid out in a document called Python Enhancement
Proposal 8 (also known as PEP 8).1 1: You can read it all at
python.org/dev/peps/pep-0008.
Writing code using best practices adopted by other Python users
can make your code easier to understand (by others and by yourself, Python Enhancement Propos-
als (commonly known as PEPs) are
when you revisit it). Even more, because you’ll often read other
documents created by developers
people’s code when looking for examples online, adopting these in the Python community in which
best practices for your code will make it easier for you to understand new Python features are discussed.
their code as well. PEPs are the primary mechanism
through which the Python pro-
PEP 8 is a lengthy document, covering all sorts of formatting issues gramming language evolves and is
related to Python code (e.g., how many empty lines between func- adapted to meet the ever-changing
requirements of modern technology.
tion definitions, how many whitespace characters after opening
All PEPs are publicly available at
parentheses, and many others). However, in this chapter, I’m going python.org/dev/peps.
to introduce the essential style rules. You don’t have to follow all
these practices for your code to run without errors, but knowing
about these practices will help you write cleaner code.
The Python examples you find in this book follow PEP 8 guidelines
so if you write your code in the same style, it will follow tried-and-
tested practices that most Python users have adopted. As a general
rule, if you have to choose between code that is easier (or faster) to
write, and code that is easier to read, always choose the latter.
name. Not following this rule won’t cause any apparent errors at
first but it will make your code behave in strange ways.
It’s also a good idea to keep all your variable or function names
lowercase, with distinct words separated by underscores, such as
account_name. This style of writing compound words is known as
snake case.2 Python uses snake case almost everywhere (in variable 2: Another instance of programmer
names, function names, module or package names), so it makes humor: Python uses snake case.
Most people use English as their coding language (i.e., their variable
or function names use English words). This is because almost
all Python resources (i.e., official documentation, tutorials, Q&A
content) and most programming tools (e.g., JupyterLab, Python
itself) are available in English only. You don’t have to make the
same choice, particularly if your code is for your eyes only, but I
recommend you try to use characters from the Latin alphabet in
your code — so no ñ, ô or any other accented characters.
Sometimes you will need to write a long line of code that involves
many different variables:
In [1]: income = (gross_wages + taxable_interest + (dividends - qualified_dividends) -
ira_deduction - student_loan_interest)
Having very long code lines can make it hard to understand what
the code does exactly (in the example above, it’s difficult to follow
which variables get added and which get subtracted to get the
income variable). To improve readability, you should break lines
before mathematical operators and indent each subsequent line:
pythonforaccounting.com/chapter11
11 Overthinking: code style 84
Notice that breaking lines this way requires the use of parentheses
around the formula. Leaving the parentheses out will prompt you
with an indentation error:
In [3]: income = gross_wages
+ taxable_interest
+ (dividends - qualified_dividends)
- irs_deduction
- student_loan_interest
# No:
i=i+1
# No:
def get_account(name, id = 0.0):
return find_account(n = name, i = id)
More generally, avoid extra space on the same line when calling
functions or when defining variables:
pythonforaccounting.com/chapter11
11 Overthinking: code style 85
In [6]: # Yes:
long_function_name(list_argument[1], dictionary_argument['key'])
# No:
long_function_name( list_argument[ 1 ] , dictionary_argument[ 'key' ] )
In [7]: # Yes:
x = 1
y = 2
long_variable = 3
# No:
x = 1
y = 2
long_variable = 3
There’s much more to styling Python code than I can cover here.3
As with naming variables, the best way to acquire code styling 3: Companies like Google or Face-
best practices is to read and write more code. book typically document their coding
style practices in entire books.
Code comments
hello
It’s up to you to figure out the level of detail needed in your code,
and as with naming variables, getting this right takes some practice.
At first, you can add a cell at the top of your notebook with your
name, date, and short description of what you want to achieve
pythonforaccounting.com/chapter11
11 Overthinking: code style 86
in that notebook. Even simple comments like these will help you
keep your work organized:
In [10]: # author: horatio 1
# date: 08/17/2020 2
# description: python ABCs 3
Jupyter notebooks also allow you to write styled text using Mark-
down cells. You might not be sure whether to include code expla-
nations in comments or Markdown cells. A good rule of thumb
is to keep code explanations or meta-data (such as your name
and date) in comments and longer narrative text that explains the
data analysis process or the business context for the analysis in
Markdown cells.
Summary
This chapter showed you some tips for how to style your Python
code: how to name variables, how to break long lines of code so
they’re easier to read, how to use whitespace and comments, and
a few others. These style guidelines are a part of Python, just like
its keywords and operators; following them will pay off.
pythonforaccounting.com/chapter11
Working with tables part two
Working with tables is where the rubber meets the road in using
1: The pandas library has been in ac-
Python for accounting. This part of the book introduces the main tive development since 2010 and now
features of the pandas1 Python library: a powerful and straight- has almost two thousand developers
forward data manipulation tool that you will be using to work and enthusiasts from various fields
contributing to it. It is a mature piece
with tables — whether Excel spreadsheets, CSV files, or any other
of software that has become an essen-
tabular data — in Python. If you used Anaconda to install Python, tial data science tool. It was initially
pandas is already on your computer. designed for financial time series anal-
ysis but has slowly taken over many
Many of the things you do in Excel right now can be done in pandas, other domains.
but often in an easier, faster, and reproducible way. Learning how
to use pandas will give you the most return on the investment
you’re making by reading this book.
Setup
There are few problems with this dataset, some of which you may
have already noticed (e.g., missing values, duplicate rows, and a
few others). As I introduce more of Python’s table handling tools
in the following chapters, you’ll see how easy it is to fix these
problems with Python and pandas.
Later in this part of the book, you’ll see how you can join the two
datasets on their common ProductID column using pandas.
I designed these datasets to suit the tasks you’ll be doing, but they
are based on real-world public data.2 Some of the project chapters 2: The products dataset I used to
ahead use other datasets instead of the ones mentioned above — create these data was made avail-
able by Julian McAuley at jm-
I’ll introduce those datasets as they’re needed.
cauley.ucsd.edu/data/amazon.
When you’re ready, launch JupyterLab, go to your workspace, open
the first notebook in part two, and let’s see what pandas can do.
Pandas in a nutshell 12
This chapter is a brief tour of pandas’s main features. It aims to
show you how Excel tables look like in Jupyter notebooks and get
you acquainted with pandas syntax — which is slightly different
from the Python code you saw in the previous chapters. Even
though pandas has its particular code style, it is designed around
the same principles of code readability and simplicity at the core
of Python, so you’ll quickly learn its ropes.
I Reading and writing data: from and to various data sources, in-
cluding Excel spreadsheets, CSV or HTML files, or different databases.
The pandas library has built-in features that support all these high-
level tasks. We’ll go over these features in plenty of detail later; for
now, let’s take a quick look at how pandas works.
Getting started
You already know that the import keyword gives you access to
Python code stored somewhere on your computer (i.e., the Python
libraries you installed with Anaconda). In this case, the import
statement gives you access to pandas’s code. It is common to use
the pd alias when importing pandas, as above. This alias makes it
easier to reference pandas functions without having to type pandas
12 Pandas in a nutshell 91
You can read the “Q1Sales.xlsx” Excel file from your workspace
folder and assign it to a variable called ledger_df using pandas’s
read_excel function:
Out [3]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
0 1532 Shoppe.com Cannon Water Bom... ... 20.11 14 281.54
1 1533 Walcart LEGO Ninja Turtl... ... 6.70 1 6.70
2 1534 Bullseye NaN ... 11.67 5 58.35
3 1535 Bullseye Transformers Age... ... 13.46 6 80.76
4 1535 Bullseye Transformers Age... ... 13.46 6 80.76
... ... ... ... ... ... ... ...
14049 15581 Bullseye AC Adapter/Power... ... 28.72 8 229.76
14050 15582 Bullseye Cisco Systems Gi... ... 33.39 1 33.39
14051 15583 Understock.com Philips AJ3116M/... ... 4.18 1 4.18
14052 15584 iBay.com NaN ... 4.78 25 119.50
14053 15585 Understock.com Sirius Satellite... ... 33.16 2 66.32
The output above looks slightly different from the one in your
Jupyter notebook, which should look more like figure 12.1 below.
I use a compact display style for tables in the book, but they still
refer to the same underlying data as those you see in your Jupyter
notebook when you run the code examples.
By default, large tables that have more than ten rows1 are shown 1: If your tables have more than
this way to keep them from filling up your entire notebook — the twenty columns, the middle columns
will also be truncated when you try
underlying data is still there, just not shown in your notebook.
to see them in your notebook.
pythonforaccounting.com/chapter12
12 Pandas in a nutshell 92
Figure 12.1: You can view a DataFrame variable in your Jupyter notebook by typing its name in a cell and running that cell.
By default, the output is truncated, so you do not see the entire DataFrame, but rather only its top five and bottom five rows
(if your DataFrame has more than twenty columns, the middle columns will also be truncated when you try to view it). This
is by design because working with tables using pandas is different from working with tables in Excel: you do not edit table
cells directly, you manipulate entire columns or the whole table by writing code. In such a setting, seeing the entire table in
your notebook is less useful and just occupies a lot of visual space.
The key thing to remember is that ledger_df still contains all the
data read from “Q1Sales.xlsx”3 even though when you display it in 3: Right now it contains data from the
your Jupyter notebook, you do not see all of its contents. first sheet in “Q1Sales.xlsx”, but you
can read data from multiple sheets
just as easily.
Now that you know what a DataFrame looks like in your Jupyter
notebook, let’s go through some common ways to filter, prepare,
pythonforaccounting.com/chapter12
12 Pandas in a nutshell 93
Unlike Excel, in pandas you reference columns by name4 , not by 4: Column names are the ones listed
label: instead of column B, you have column 'Channel'. To select in the table header.
all values from the 'Channel' column, run the following code:
In [4]: ledger_df['Channel']
In [5]: ledger_df['Channel'].value_counts()
The power of using Python and pandas is even more apparent when
you use the two together. For instance, you can define a custom
Python function and apply it to all the values in a column:
In [7]: def make_upper(value):
return value.strip('.com').upper()
ledger_df['Channel'].apply(make_upper)
pythonforaccounting.com/chapter12
12 Pandas in a nutshell 94
14050 BULLSEYE
14051 UNDERSTOCK
14052 IBAY
14053 UNDERSTOCK
Name: Channel, Length: 14054, dtype: object
Out [9]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
0 1532 SHOPPE Cannon Water Bom... ... 20.11 14 281.54
1 1533 WALCART LEGO Ninja Turtl... ... 6.70 1 6.70
2 1534 BULLSEYE NaN ... 11.67 5 58.35
3 1535 BULLSEYE Transformers Age... ... 13.46 6 80.76
4 1535 BULLSEYE Transformers Age... ... 13.46 6 80.76
... ... ... ... ... ... ... ...
14049 15581 BULLSEYE AC Adapter/Power... ... 28.72 8 229.76
14050 15582 BULLSEYE Cisco Systems Gi... ... 33.39 1 33.39
14051 15583 UNDERSTOCK Philips AJ3116M/... ... 4.18 1 4.18
14052 15584 IBAY NaN ... 4.78 25 119.50
14053 15585 UNDERSTOCK Sirius Satellite... ... 33.16 2 66.32
You can define and apply any custom function you need on any of
the table columns — even on multiple columns at once.
Out [10]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
1342 2874 WALCART 3 in 1 Camera Le... ... 15.01 34 510.34
5812 7344 WALCART Olympus SP-820UZ... ... 19.48 16 311.68
6148 7680 WALCART Casio EXILIM Dig... ... 14.91 18 268.38
8358 9890 WALCART 3 in 1 Camera Le... ... 15.01 23 345.23
10577 12109 WALCART Foscam FI8910W W... ... 23.69 40 947.60
11713 13245 WALCART Foscam New Versi... ... 4.12 38 156.56
pythonforaccounting.com/chapter12
12 Pandas in a nutshell 95
[6 rows x 12 columns]
The code above filters the original table to keep rows where
values in the 'Channel' column are 'WALCART' exactly, values in the
'Product Name' column contain the word 'Camera', and values in
the 'Quantity' column are greater than 10.
pythonforaccounting.com/chapter12
12 Pandas in a nutshell 96
If any of these examples are unclear right now, don’t worry, we’ll
revisit them in the following chapters. But first, let’s quickly turn
daily_totals_df into a plot.
Visualizing data
As you can see, plots are displayed underneath code cells, directly
in your notebook. The plot here uses the daily_totals_df table
you created earlier with the pivot_table method. It shows the total
daily revenues for each of the channels in these sales data. While
the details of how to customize this plot will have to wait, you can
already see how simple it is to create the plot.
You can make various types of plots using pandas, not just line
plots. In addition, there are other Python visualization libraries
you can use for specific use-cases — we’ll cover the most popular
ones later in the book.
pythonforaccounting.com/chapter12
12 Pandas in a nutshell 97
After you run the code, open “DailyTotals.xlsx” in Excel5 to see that 5: The file will be in the same folder
it has same contents as daily_totals_df. as your Jupyter notebook.
This concludes our quick tour of pandas. The aim of this tour was
to show you how tables look like in Jupyter notebooks and briefly
illustrate how pandas can be used for everyday operations with
tabular data (e.g., filtering, pivoting, plotting). As you’ll see over
the next chapters, pandas provides out-of-the-box tools for almost
any kind of data manipulation you can imagine. On top of that,
pandas is engineered to be fast even with data many times larger
than what Excel can handle.
Now that we covered the quick tour of pandas, let’s take a more
scenic route and look at the details of how pandas works.
pythonforaccounting.com/chapter12
Tables, columns, and values 13
If you work with Excel, you already know what tables and columns
are — pandas has equivalent data structures1 that go by different 1: A data structure is an abstraction
names but are otherwise similar to the tables and columns you that refers to a collection of values and
the functions, methods, or operations
know well. These are the DataFrame, Series, and Index objects.
that can be applied to those values.
As you saw in the previous chapter, a DataFrame is a table repre- For example, a Python list is a data
structure.
sented in Python code. It is made up of multiple Series objects
(i.e., a DataFrame is made up of one or more Series objects just like
a table is made up of one or more columns). Each row or column in
a DataFrame object has an associated label you can use to access its
values (e.g., a column or row name). These column and row labels
are stored in an Index object, which is also part of the DataFrame.
Series
values['second_row']
DataFrame
DataFrame objects are tables represented in Python code.3 Besides 3: Because you’ll be using pandas
containers for their values, they are also a collection of methods exclusively in the chapters ahead, I’ll
sometimes use the terms table and
you can apply to those values.
DataFrame interchangeably.
You can put multiple Series objects in a Python dictionary to get
a DataFrame. You won’t create DataFrame objects this way often
4: Using its read_excel or any of the
because pandas puts everything together for you whenever you other read functions.
read data from a file.4 However, for this example:
In [6]: df = pd.DataFrame({
"first_column": pd.Series({"first_row": 3, "second_row": -20, "third_row": 121}),
"second_column": pd.Series({"first_row": 0.23, "second_row": 2, "third_row": -3.5}),
"third_column": pd.Series({"second_row": 'Assets'}),
})
df
In [7]: df['first_column']
pythonforaccounting.com/chapter13
13 Tables, columns, and values 100
In [8]: df['first_column']['second_row']
You may have noticed that our hand-crafted DataFrame has a few
NaN5 values (if not, take a look at the last column in the table above). 5: NaN stands for not a number, not
When you put multiple Series together in a DataFrame, they get the bread.
aligned around their row labels. Wherever the Series row labels
don’t match, pandas adds NaN values so that each column in the
final DataFrame has the same number of entries. The NaN is similar
to Python’s None in that both are used to indicate the absence of
a value — unlike Excel’s #N/A, which tells you something went
wrong when calling a function.
Every DataFrame has labels associated with its rows and columns.
These labels are accessible through the index, columns, and axes
DataFrame attributes.6 For example, to access df’s column names,
you can run:
In [9]: df.columns
pythonforaccounting.com/chapter13
13 Tables, columns, and values 101
In [12]: df.index[1]
Index objects are not simple Python lists because they are designed
to make accessing values in a DataFrame efficient and are more
complicated in their internal machinery than Python lists. However,
there is no practical difference: you will often use them as you use
Python lists.
In [14]: df.transpose()
In the table above, the first row in df is now the first column. You
probably won’t need to rotate tables often, but this example helps
illustrate that DataFrame objects have the same kind of labels on
both columns and rows, and there is no difference between the
way they are stored or how they work.
Axes
You can also access a DataFrame’s row and column labels through
its axes attribute:
In [15]: df.axes
The output above is a Python list containing df’s row and column
Index objects. There’s no practical benefit in using the axes attribute
instead of columns or index; you get the same result by running:
pythonforaccounting.com/chapter13
13 Tables, columns, and values 102
Similarly, you can use the same method to find out how many
non-empty values there are in each of df’s rows with:
In [18]: df.count(axis='columns')
pythonforaccounting.com/chapter13
13 Tables, columns, and values 103
DataFrame Series
axis='columns'
axis='rows'
Figure 13.1: Visual representation of a pandas DataFrame (on the left) and of a pandas Series (on the right). Data values
are shown in pink, whereas column and row labels are shown in blue. In both figures, blue boxes at the top represent the
column Index and blue boxes on the left represent the row Index.
If you work with Excel, you probably know that you can change
column or cell type through the “Format” menu.7 Figure 13.2 below 7: At least in Microsoft Excel 16.
shows Excel’s cell formatting dialog, with the different data types
you can choose from.
In [20]: df.info()
pythonforaccounting.com/chapter13
13 Tables, columns, and values 104
The output above has several pieces of information about your table:
how many rows it has (i.e., Index: 3 entries), their labels (i.e.,
first_row to third_row), how many columns (i.e., Data columns
(total 3 columns)) and for each column, its name, the number of
non-empty values (under Non-Null Count) and the type of data it
stores (under Dtype, for data type). In this example, the first column
stores whole numbers (i.e., integers), the second column stores
decimal numbers (i.e., floats), and the last column has a mix of
missing values (i.e., NaN) and string values. The object type is a
generic data type that pandas uses when there are different types
of values in the same column.
The last two lines in the output above show how many columns
there are with each data type, and how much of your computer’s
memory is used to store the table (in this case, not much).
Table 13.1: pandas data types and their Excel equivalents. These (and more) types are described in the official pandas
documentation at pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics-dtypes.
Type Alias Excel equivalent Description
Float float64 Number Columns storing decimal numbers (i.e., floating-
point numbers).
pythonforaccounting.com/chapter13
13 Tables, columns, and values 105
Table 13.1: pandas data types and their Excel equivalents. These (and more) types are described in the official pandas
documentation at pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics-dtypes.
Type Alias Excel equivalent Description
Integer int32, int64, Int32, Number Columns storing whole numbers (i.e., positive
Int64, Int32, Int64 or negative integers).
But if you change9 the column data type to 'string', you can’t 9: More on changing column types
compute the sum anymore: in the following chapters.
In [22]: df['first_column'].astype('string').sum()
pythonforaccounting.com/chapter13
13 Tables, columns, and values 106
can’t figure out the right types because there’s something wrong
with the data themselves (e.g., numbers are stored with currency
symbols), and it assigns the most generic data type to columns (i.e.,
the 'object' type). In that case, some methods might not work as
you expect them to, and you will need to clean the data and assign
types yourself — something we’ll look at later.
Summary
The following chapter goes over the pandas methods you can use
to read and write Excel files.
pythonforaccounting.com/chapter13
Reading and writing Excel files 14
Now that you know how to represent tables in Python code let’s
take a look at some of the ways you can read and write data from
or to Excel files using pandas.
You probably work with spreadsheets1 often. pandas comes with 1: Whether from an Excel file or any
a read_excel function to read data from a spreadsheet into a other file that has one of the follow-
ing extensions: .xls, .xlsx, .xlsm,
DataFrame variable — you’ve already seen it in action earlier, in our
.xlsb, or .odf.
brief tour of pandas. To read data from the “Q1Sales.xlsx” file into
a DataFrame, you can use:
In [2]: ledger_df = pd.read_excel('Q1Sales.xlsx')
Out [3]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
0 1532 Shoppe.com Cannon Water Bomb ... ... 20.11 14 281.54
1 1533 Walcart LEGO Ninja Turtles... ... 6.70 1 6.70
2 1534 Bullseye NaN ... 11.67 5 58.35
3 1535 Bullseye Transformers Age o... ... 13.46 6 80.76
4 1535 Bullseye Transformers Age o... ... 13.46 6 80.76
... ... ... ... ... ... ... ...
14049 15581 Bullseye AC Adapter/Power S... ... 28.72 8 229.76
14050 15582 Bullseye Cisco Systems Giga... ... 33.39 1 33.39
14051 15583 Understock.com Philips AJ3116M/37... ... 4.18 1 4.18
14052 15584 iBay.com NaN ... 4.78 25 119.50
14053 15585 Understock.com Sirius Satellite R... ... 33.16 2 66.32
Out [4]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
0 29486 Walcart Vic Firth American... ... 22.39 6 134.34
1 29487 Walcart Archives Spiral Bo... ... 25.65 1 25.65
2 29488 Bullseye AKG WMS40 Mini Dua... ... 8.98 2 17.96
3 29489 Shoppe.com LE Blue Case for A... ... 8.33 9 74.97
4 29490 Understock.com STARFISH Cookie Cu... ... 17.96 80 1436.80
... ... ... ... ... ... ... ...
9749 39235 iBay.com Nature's Bounty Ga... ... 5.55 2 11.10
9750 39216 Shoppe.com Funko Wonder Woman... ... 28.56 1 28.56
9751 39219 Shoppe.com MONO GS1 GS1-BTY-B... ... 3.33 1 3.33
9752 39238 Shoppe.com NaN ... 34.76 10 347.60
9753 39239 Understock.com 3 Collapsible Bowl... ... 6.39 15 95.85
If you need to read data from different sheets and keep those data
as separate variables, you can use read_excel multiple times:
In [5]: jan_ledger_df = pd.read_excel('Q1Sales.xlsx') 1
feb_ledger_df = pd.read_excel('Q1Sales.xlsx', sheet_name='February') 2
mar_ledger_df = pd.read_excel('Q1Sales.xlsx', sheet_name='March') 3
You won’t always remember the actual sheet names in your Excel
files, but you might remember their position in the file. In that case,
you can specify their position instead of their names in the value
you pass to sheet_name:
In [6]: ledger_df = pd.read_excel('Q1Sales.xlsx', sheet_name=0)
Table 14.1: Keyword arguments available with pandas’s read_excel function. You can use any combination of keyword
arguments when reading data, as needed.
Parameter name Example Description
sheet_name pd.read_excel('Q1Sales.xlsx', Reads all data from a sheet named “March” in
sheet_name='March') “Q1Sales.xlsx”.
pythonforaccounting.com/chapter14
14 Reading and writing Excel files 109
Table 14.1: Keyword arguments available with pandas’s read_excel function. You can use any combination of keyword
arguments when reading data, as needed.
Parameter name Example Description
You can use other pandas functions to read tabular data from
different kinds of files, not just spreadsheets — read_csv might
be useful if you work with CSV files often.2 You can find the other
pandas data-reading functions with the autocomplete feature in
JupyterLab by typing pd.read_ in a separate cell and pressing the
TAB key: 2: The others are read_html,
read_spss, read_sas.
In [7]: pd.read_<TAB>
Now that you have the sales data in a DataFrame, you can start
slicing, selecting, and sorting it any way you want to; but first, let’s
take a closer look at this dataset.
Inspecting data
5: Later, we will go over changing the
You already know how to view a DataFrame variable: type its name default number of rows shown when
inspecting a DataFrame variable, if
in a cell and run it. By default, this will display the top and bottom
you want to see more or fewer rows.
five rows of your DataFrame.5
In [8]: ledger_df
pythonforaccounting.com/chapter14
14 Reading and writing Excel files 110
Out [8]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
0 1532 Shoppe.com Cannon Water Bomb ... ... 20.11 14 281.54
1 1533 Walcart LEGO Ninja Turtles... ... 6.70 1 6.70
2 1534 Bullseye NaN ... 11.67 5 58.35
3 1535 Bullseye Transformers Age o... ... 13.46 6 80.76
4 1535 Bullseye Transformers Age o... ... 13.46 6 80.76
... ... ... ... ... ... ... ...
14049 15581 Bullseye AC Adapter/Power S... ... 28.72 8 229.76
14050 15582 Bullseye Cisco Systems Giga... ... 33.39 1 33.39
14051 15583 Understock.com Philips AJ3116M/37... ... 4.18 1 4.18
14052 15584 iBay.com NaN ... 4.78 25 119.50
14053 15585 Understock.com Sirius Satellite R... ... 33.16 2 66.32
When the displayed table is truncated (i.e., when your table has
more than 10 rows), the last line in the output tells you how many
rows and columns there are in total.
If you want to take a quick look at your data, there are two
DataFrame methods you can use to display the top or bottom rows:
head and tail. For example, to display just the first 3 rows in your
DataFrame, you can use the head method:
In [9]: ledger_df.head(3)
Out [9]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
0 1532 Shoppe.com Cannon Water Bomb Ba... ... 20.11 14 281.54
1 1533 Walcart LEGO Ninja Turtles S... ... 6.70 1 6.70
2 1534 Bullseye NaN ... 11.67 5 58.35
If you want to see more of the table, you can pass a larger number
as an argument to head (e.g., head(20) to see the top 20 rows).
Similarly, to display the last 3 rows, you can use:
In [10]: ledger_df.tail(3)
Out [10]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
14051 15583 Understock.com Philips AJ3116M/37... ... 4.18 1 4.18
14052 15584 iBay.com NaN ... 4.78 25 119.50
14053 15585 Understock.com Sirius Satellite R... ... 33.16 2 66.32
[3 rows x 12 columns]
pythonforaccounting.com/chapter14
14 Reading and writing Excel files 111
there are in each column, what type of values they hold), and how
much memory the entire table uses up.
In [11]: ledger_df.info()
Like the read_excel function reads data from Excel files, pan
das DataFrame objects have a to_excel method that writes their
contents to an Excel file. For example, to write ledger_df to an
pythonforaccounting.com/chapter14
14 Reading and writing Excel files 112
Figure 14.1: Screenshot of “JanQ1Sales.xlsx” opened in Excel. Notice that ledger_df’s row labels are now column A of the
output spreadsheet. Using the index=False keyword argument with to_excel removes them from the final spreadsheet.
The file path above uses forward slashes instead of the more
familiar backslashes used in Windows file paths. If you find using
forward slashes annoying, you can also write file paths using
backslashes, but you’ll have to put an r in front of them:
In [15]: ledger_df.to_excel(r'C:\Users\Horatio\Documents\Python for Accounting\JanQ1Sales.xlsx')
Both read_excel and to_excel work with full paths. Keep in mind
that to_excel will overwrite any existing file with the same name
and path as the one you specify without asking twice, so make
sure you don’t overwrite anything important.
You can also specify the name of the sheet to write to, if you pass a
value as the sheet_name argument when using to_excel:
In [16]: ledger_df.to_excel('JanQ1Sales.xlsx', sheet_name='Sales')
This will write all data in ledger_df to a single8 sheet named 8: How to write data to multiple
“Sales” in “JanQ1Sales.xlsx”. If you open the new file with Excel, you sheets in the same Excel file will have
to wait for the first project chapter.
should see something similar to figure 14.1.
pythonforaccounting.com/chapter14
14 Reading and writing Excel files 113
Notice in figure 14.1 that ledger_df’s row labels are part of the
spreadsheet (column A). Row labels are not particularly useful in
Excel, so you can exclude them from the spreadsheet with:
In [17]: ledger_df.to_excel('JanQ1Sales.xlsx', sheet_name='Sales', index=False)
Table 14.2: Keyword arguments available with pandas’s to_excel function. You can use any combination of keyword
arguments when writing data.
Parameter name Example Description
sheet_name ledger_df.to_excel( Writes all data from ledger_df to a sheet
'JanQ1Sales.xlsx', named “Sales” in “JanQ1Sales.xlsx”.
sheet_name='Sales'
)
This will open an autocomplete menu, and show you all the
DataFrame methods you can use for writing data to a file. You
pythonforaccounting.com/chapter14
14 Reading and writing Excel files 114
The same goes for any of the methods or functions we’ve already
covered. Whenever you get stuck, use JupyterLab’s autocomplete
feature or the documentation helper (i.e., the ? at the end of a
method name) to find help.
The pandas library has a lot of functions, each with several argu-
ments that control how they work. Remembering all of them is
impossible — fortunately, it isn’t necessary either: use the autocom-
plete menu, look up documentation, or search the web to figure
out the details whenever you need to.
Summary
This chapter showed you how to read and write Excel files using
pandas’s read_excel and to_excel functions. Reading data from
an Excel spreadsheet into a pandas DataFrame creates a copy of the
spreadsheet data — when you finish working on your data, you
can write the DataFrame variable back to an Excel file. However,
everything you do between reading and writing data leaves a code
trace in your notebook. If you don’t save your DataFrame to a file,
you can always return to your code and re-run it.
Reading and writing data from and to Excel files is great, but you
probably want to do something with those data in between. Let’s
take a look at how you can slice, filter, and sort tables next.
pythonforaccounting.com/chapter14
Slicing, filtering, and
sorting tables 15
What’s the first thing you do when you start working with a new
dataset in Excel? You probably copy the few columns you need
for your task in a separate “working” sheet or delete the columns
you don’t need. You then put filters on columns (i.e., you put your
data in a table) to enable quick filtering and perhaps sort the entire
table differently.
To run the code examples in this chapter, you first need to import
pandas and read the sales data:
ledger_df = pd.read_excel('Q1Sales.xlsx')
Selecting columns
The same kind of code can be used to access values from any other
column in ledger_df by name:
In [3]: ledger_df['Unit Price']
15 Slicing, filtering, and sorting tables 116
When you want to select more than one column from a DataFrame
you can use a Python list with all the column names you want
instead of a single name:
In [6]: ledger_df[['ProductID', 'Product Name', 'Unit Price', 'Total']]
pythonforaccounting.com/chapter15
15 Slicing, filtering, and sorting tables 117
Notice the two sets of square brackets that surround the column
names. The first set of brackets is just how you access columns in a
DataFrame, whereas the second set of brackets defines a Python list
of column names. If you need the list of column names elsewhere
in your code, you can store it as a separate variable and use it to
select column in the same way:
In [7]: column_names = ['ProductID', 'Product Name', 'Unit Price', 'Total'] 1
2
ledger_df[column_names] 3
products_df
You will often forget column names (I do); remember you can list
all column names by running:
pythonforaccounting.com/chapter15
15 Slicing, filtering, and sorting tables 118
In [9]: ledger_df.columns
In [11]: ledger_df[ledger_df.columns[:3]]
You can even use list comprehensions to select just the columns
that contain a specific keyword in their name or satisfy some other
condition — for example, to select all columns that have the word
'Product' in their name:
pythonforaccounting.com/chapter15
15 Slicing, filtering, and sorting tables 119
In this last example, notice again the double square brackets, one
set for accessing columns in a DataFrame, the other for creating a
Python list comprehension (if you can’t remember what those are
exactly, see chapter 6 for a refresher).
Removing columns
Often after reading a large spreadsheet, you will want to keep only
a few of its columns. If you know which columns you want to keep,
you can select them using square brackets, as you saw earlier, and
then assign them to another variable (or the same variable if you
don’t need the entire dataset):
In [14]: products_df = ledger_df[['ProductID', 'Product Name', 'Unit Price', 'Total']]
On the other hand, if you want to remove a few columns, it’s easier
to list the columns you want to remove rather than the ones you
want to keep. In that case, you can define a list of columns you
don’t need anymore and then use the drop method:
In [15]: columns_to_remove = ['InvoiceNo', 'Account', 'AccountNo', 'Currency']
ledger_df.drop(columns_to_remove, axis='columns')
Out [15]: Channel Product Name ProductID ... Unit Price Quantity Total
0 Shoppe.com Cannon Water Bom... T&G/CAN-97509 ... 20.11 14 281.54
1 Walcart LEGO Ninja Turtl... T&G/LEG-37777 ... 6.70 1 6.70
2 Bullseye NaN T&G/PET-14209 ... 11.67 5 58.35
3 Bullseye Transformers Age... T&G/TRA-20170 ... 13.46 6 80.76
4 Bullseye Transformers Age... T&G/TRA-20170 ... 13.46 6 80.76
... ... ... ... ... ... ... ...
14049 Bullseye AC Adapter/Power... E/AC-63975 ... 28.72 8 229.76
14050 Bullseye Cisco Systems Gi... E/CIS-74992 ... 33.39 1 33.39
14051 Understock.com Philips AJ3116M/... E/PHI-08100 ... 4.18 1 4.18
14052 iBay.com NaN E/POL-61164 ... 4.78 25 119.50
14053 Understock.com Sirius Satellite... E/SIR-83381 ... 33.16 2 66.32
Although you can’t see the entire table, notice on the last line in
the output above that it now has 8 columns instead of 12.
pythonforaccounting.com/chapter15
15 Slicing, filtering, and sorting tables 120
Using loc
Using the loc operator is similar to how you select cell ranges in
Excel. Let’s start with an Excel example and then translate that into
pandas — figure 15.1 shows how you can select the top ten rows of
the first four columns in “Q1Sales.xlsx” using Excel.
Figure 15.1: Selecting the top ten rows of the first four column in “Q1Sales.xlsx” using Excel.
In pandas, you can make the same selection using the loc oper-
ator. You use loc like a regular DataFrame method by typing the
DataFrame variable name, followed by a period, the loc keyword,
and a pair of square brackets. Inside the brackets you need to pass
pythonforaccounting.com/chapter15
15 Slicing, filtering, and sorting tables 121
the row and column labels you want to access. To make the same
range selection as above using loc, you need to run:
In [16]: ledger_df.loc[0:9, 'InvoiceNo':'ProductID']
Unlike Excel’s formula, with loc you first have to specify the
row labels you want to select, followed by the column labels: so
A1:D10 in Excel becomes .loc[0:9, 'InvoiceNo': 'ProductID'] in
pandas. Because the row labels in ledger_df start at 0, the row
range in the example above goes from 0 to 9 instead of 1 to 10 as it
does in Excel.
Besides label ranges, you can use Python lists to enumerate the
rows or columns you want to select:
In [17]: ledger_df.loc[[0, 1, 2, 3, 4], ['InvoiceNo', 'Channel', 'Unit Price', 'Total']]
If you want to select all rows in a DataFrame, you can use a single
colon instead of a range of row labels:
In [18]: ledger_df.loc[:, 'Channel':'Date']
pythonforaccounting.com/chapter15
15 Slicing, filtering, and sorting tables 122
Out [19]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
0 1532 Shoppe.com Cannon Water Bomb ... ... 20.11 14 281.54
1 1533 Walcart LEGO Ninja Turtles... ... 6.70 1 6.70
2 1534 Bullseye NaN ... 11.67 5 58.35
3 1535 Bullseye Transformers Age o... ... 13.46 6 80.76
4 1535 Bullseye Transformers Age o... ... 13.46 6 80.76
[5 rows x 12 columns]
Using iloc
The iloc indexing operator does the same thing as loc, but instead
of using row and column labels, it uses row and column positions in
the DataFrame specified as integers (the i stands for integer). With
iloc, you can make the same selection as in the previous Excel
example (i.e., A1:D10) using:
In [21]: ledger_df.iloc[0:10, 0:4]
Row and column positions start with 0, so you need to use 0 as the
start of your ranges if you want to include the first row or column
in your selections when using iloc. The eagle-eyed among you
may have noticed that the row range above goes up to 10, not 9 as
in the previous example using loc, yet you still get ten rows in the
pythonforaccounting.com/chapter15
15 Slicing, filtering, and sorting tables 123
output. This is because iloc ranges are inclusive only on the left
side, not on both, as are loc ranges.3 3: There’s a good reason for this dif-
ference between iloc and loc. Label
To select multiple rows and columns from ledger_df, you can also ranges used with loc are inclusive at
enumerate the positions you want to select using Python lists: both ends because you don’t always
know the order of labels in your ta-
In [22]: ledger_df.iloc[[0, 1, 2, 3, 4], [1, 2, 9, 11]] bles. Excluding the right end of your
label range when using loc would re-
Out [22]: Channel Product Name Unit Price Total
quire you to know what label comes
0 Shoppe.com Cannon Water Bom... 20.11 281.54
right after it, if you wanted to include
1 Walcart LEGO Ninja Turtl... 6.70 6.70
the right end of your range in your
2 Bullseye NaN 11.67 58.35 table slice. This can be inconvenient,
3 Bullseye Transformers Age... 13.46 80.76 if you don’t know the order of labels
4 Bullseye Transformers Age... 13.46 80.76 in your table. On the other hand, ex-
cluding the right end of an integer
You don’t have to write a number before or after the colon if you range — as iloc does — is consistent
want to specify a range that starts with the first or ends with the with slicing Python lists.
last possible item (row or column):
In [23]: ledger_df.iloc[:, :3]
The two indexing operators can be used in intricate ways and can
even be chained one after the other for more complicated table
slices. Depending on your selection, their output is a DataFrame,
a Series or a single value — all of which you can assign to other
variables and use elsewhere in your code. We’ll revisit loc and
iloc often throughout the rest of the book.
pythonforaccounting.com/chapter15
15 Slicing, filtering, and sorting tables 124
Slicing using iloc and loc can get tricky when DataFrame rows
have integer labels, as is the case with ledger_df. Consider the
following example:
In [25]: ledger_df.loc[0:4, 'Channel']
Here, you use a label range with loc to select the first five rows in
ledger_df. The reason this works is that row labels in ledger_df
are, indeed, integers. However, if you use the same range with
iloc, you get a different result:
In [26]: ledger_df.iloc[0:4, 1]
The first example uses loc and selects the first five rows of the
'Channel' column (i.e., the second column in ledger_df). In con-
trast, the second example uses iloc and selects the first four rows
of the same column, even though the specified row range seems to
be the same.
The reason for this difference is that loc selects all rows between
and including the row labeled 0 and row labeled 4, whereas iloc
selects rows based on their position in ledger_df and excludes
the right limit of the row range (i.e., the row position specified
after the colon). This might seem like a minor detail, but most
of the DataFrame variables you will work with will have integer
row labels; being mindful of this difference between loc and iloc
might save some headache later on.
Filtering data
pythonforaccounting.com/chapter15
15 Slicing, filtering, and sorting tables 125
Out [27]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
1 1533 Walcart LEGO Ninja Turtl... ... 6.70 1 6.70
7 1539 Walcart Zen-Ray ED3 8x43... ... 28.29 1 28.29
10 1542 Walcart Logisys Red 5 LE... ... 19.49 1 19.49
11 1543 Walcart Magline GMK81UA4... ... 18.42 4 73.68
15 1547 Walcart Totally Bamboo 2... ... 4.25 1 4.25
... ... ... ... ... ... ... ...
14012 15408 Walcart OXO Good Grips L... ... 10.10 1 10.10
14022 15427 Walcart Applied Nutritio... ... 7.72 1 7.72
14028 15459 Walcart Update Internati... ... 3.91 1 3.91
14034 15471 Walcart Kikkerland Biode... ... 17.18 16 274.88
14039 15499 Walcart Anchor Hocking 4... ... 7.56 44 332.64
If you run just the equality check between the square brackets
above, by itself, in a separate cell, you will see that it returns a
Series with the same number of rows as ledger_df, containing
only boolean values (i.e., True or False values):
In [28]: ledger_df['Channel'] == 'Walcart'
pythonforaccounting.com/chapter15
15 Slicing, filtering, and sorting tables 126
Out [30]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
1 1533 Walcart LEGO Ninja Turtl... ... 6.70 1 6.70
7 1539 Walcart Zen-Ray ED3 8x43... ... 28.29 1 28.29
10 1542 Walcart Logisys Red 5 LE... ... 19.49 1 19.49
11 1543 Walcart Magline GMK81UA4... ... 18.42 4 73.68
15 1547 Walcart Totally Bamboo 2... ... 4.25 1 4.25
... ... ... ... ... ... ... ...
14012 15408 Walcart OXO Good Grips L... ... 10.10 1 10.10
14022 15427 Walcart Applied Nutritio... ... 7.72 1 7.72
14028 15459 Walcart Update Internati... ... 3.91 1 3.91
14034 15471 Walcart Kikkerland Biode... ... 17.18 16 274.88
14039 15499 Walcart Anchor Hocking 4... ... 7.56 44 332.64
Out [31]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
28 1560 Walcart Conntek 14422 RV... ... 4.47 27 120.69
144 1676 Walcart [ Strip of 6 ] E... ... 11.83 61 721.63
213 1604 Walcart Conntek 14422 RV... ... 4.47 27 120.69
237 1769 Walcart Child Constructi... ... 19.68 43 846.24
292 1824 Walcart Ultima Replenish... ... 13.34 62 827.08
... ... ... ... ... ... ... ...
13751 15144 Walcart Anchor Hocking 4... ... 7.56 44 332.64
13759 15291 Walcart AC Adapter/batte... ... 14.49 68 985.32
13827 15359 Walcart AKG Pro Audio K9... ... 8.10 31 251.10
13897 15429 Walcart NaN ... 13.34 83 1107.22
14039 15499 Walcart Anchor Hocking 4... ... 7.56 44 332.64
pythonforaccounting.com/chapter15
15 Slicing, filtering, and sorting tables 127
Out [32]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
0 1532 Shoppe.com Cannon Water Bom... ... 20.11 14 281.54
2 1534 Bullseye NaN ... 11.67 5 58.35
3 1535 Bullseye Transformers Age... ... 13.46 6 80.76
4 1535 Bullseye Transformers Age... ... 13.46 6 80.76
5 1537 Bullseye 3x Anti-Spy Priv... ... 7.39 8 59.12
... ... ... ... ... ... ... ...
14049 15581 Bullseye AC Adapter/Power... ... 28.72 8 229.76
14050 15582 Bullseye Cisco Systems Gi... ... 33.39 1 33.39
14051 15583 Understock.com Philips AJ3116M/... ... 4.18 1 4.18
14052 15584 iBay.com NaN ... 4.78 25 119.50
14053 15585 Understock.com Sirius Satellite... ... 33.16 2 66.32
pythonforaccounting.com/chapter15
15 Slicing, filtering, and sorting tables 128
pythonforaccounting.com/chapter15
15 Slicing, filtering, and sorting tables 129
You can use the output of these methods to filter DataFrame rows,
just like boolean Series that result from conditional statements.
Another way to filter rows and select columns at the same time is
with the loc indexing operator:
In [38]: ledger_df.loc[
ledger_df['Channel'] == 'Walcart',
['Channel', 'Quantity', 'Total']
]
pythonforaccounting.com/chapter15
15 Slicing, filtering, and sorting tables 130
Notice that instead of a list of row labels, the first argument passed
to loc is a boolean Series (created using a conditional statement),
whereas the second argument is a list of column labels. So loc
actually works with row labels and boolean Series for selecting
rows (but iloc doesn’t).
Sorting data
Out [39]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
7613 9145 iBay.com Magic: the Gathe... ... 0.06 15 0.90
12877 14409 Understock.com Magic: the Gathe... ... 0.06 17 1.02
7810 9342 iBay.com Magic: the Gathe... ... 0.06 26 1.56
7812 9270 iBay.com Magic: the Gathe... ... 0.06 26 1.56
9822 11354 Understock.com Urban Rebounding... ... 1.69 1 1.69
... ... ... ... ... ... ... ...
4757 6162 Understock.com AC Adapter/batte... ... 14.88 226 3362.88
6163 7526 iBay.com Large Display Di... ... 64.15 56 3592.40
6141 7673 iBay.com Large Display Di... ... 64.15 56 3592.40
8006 9538 iBay.com Large Display Di... ... 64.15 61 3913.15
5212 6744 iBay.com Large Display Di... ... 64.15 68 4362.20
Out [40]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
5212 6744 iBay.com Large Display Di... ... 64.15 68 4362.20
8006 9538 iBay.com Large Display Di... ... 64.15 61 3913.15
6141 7673 iBay.com Large Display Di... ... 64.15 56 3592.40
6163 7526 iBay.com Large Display Di... ... 64.15 56 3592.40
8797 10329 Understock.com AC Adapter/batte... ... 14.88 226 3362.88
... ... ... ... ... ... ... ...
9822 11354 Understock.com Urban Rebounding... ... 1.69 1 1.69
7810 9342 iBay.com Magic: the Gathe... ... 0.06 26 1.56
pythonforaccounting.com/chapter15
15 Slicing, filtering, and sorting tables 131
Notice in the output above that, after sorting, row labels are not
consecutive anymore. This is because row labels (even if they are
numbers) don’t indicate a row’s position, but are just row identifiers
in a DataFrame. After sorting, these labels get shuffled around with
their associated rows, as above.
Out [41]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
4747 6279 Understock.com AC Adapter/batte... ... 14.88 226 3362.88
4757 6162 Understock.com AC Adapter/batte... ... 14.88 226 3362.88
4969 6397 Understock.com AC Adapter/batte... ... 14.88 226 3362.88
8797 10329 Understock.com AC Adapter/batte... ... 14.88 226 3362.88
5499 7031 iBay.com NaN ... 13.28 184 2443.52
... ... ... ... ... ... ... ...
616 2148 Understock.com Kodak EasyShare ... ... 166.30 1 166.30
4248 5780 Understock.com Kodak EasyShare ... ... 166.30 1 166.30
4485 5839 Understock.com Kodak EasyShare ... ... 166.30 1 166.30
6152 7684 Understock.com Kodak EasyShare ... ... 166.30 1 166.30
6164 7531 Understock.com Kodak EasyShare ... ... 166.30 1 166.30
Keep in mind that if you have any missing values (i.e., NaNs) in
the column or columns you sort by, these values will be placed,
by default, at the bottom of the sorted DataFrame (regardless of
whether you pass ascending=False or not).
pythonforaccounting.com/chapter15
15 Slicing, filtering, and sorting tables 132
Summary
This chapter showed you how to slice, filter and sort pandas
DataFrame objects. With what you learned in the previous chapters,
you can now read, write, sort, and slice Excel spreadsheets using
pandas.6 Next, let’s look at how you can put the pandas features 6: Well done for making it this far!
we’ve covered so far together, in a quick project that re-organizes
“Q1Sales.xlsx” by channel and revenue.
pythonforaccounting.com/chapter15
Project: Organizing sales data
by channel 16
If you’ve ever needed to split a large Excel file into multiple
spreadsheets — to share them with different people or upload
them as separate files on one of the platforms you use — you’ve
likely mastered the art of copying and pasting rows. Copy-and-
paste can often be the right tool for the job, but when you need to
repeat the same file-splitting steps every other week, or when your
data is too large and unwieldy for Excel, it can quickly become a
headache.
This short project chapter shows you how to split an Excel file
into multiple sheets with pandas. You’ll use the “Q1Sales.xlsx” file
you’re already familiar with and split it into five spreadsheets,
each one containing sales from one of the five sales channels
in the data. You’ll also sort the data in each spreadsheet by the
'Total' and 'Quantity' columns so that each sheet displays the
highest-grossing sales at the top.
jan_sales_df
Out [2]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
0 1532 Shoppe.com Cannon Water Bomb... ... 20.11 14 281.54
1 1533 Walcart LEGO Ninja Turtle... ... 6.70 1 6.70
2 1534 Bullseye NaN ... 11.67 5 58.35
3 1535 Bullseye Transformers Age ... ... 13.46 6 80.76
4 1535 Bullseye Transformers Age ... ... 13.46 6 80.76
... ... ... ... ... ... ... ...
14049 15581 Bullseye AC Adapter/Power ... ... 28.72 8 229.76
14050 15582 Bullseye Cisco Systems Gig... ... 33.39 1 33.39
14051 15583 Understock.com Philips AJ3116M/3... ... 4.18 1 4.18
14052 15584 iBay.com NaN ... 4.78 25 119.50
14053 15585 Understock.com Sirius Satellite ... ... 33.16 2 66.32
16 Project: Organizing sales data by channel 134
The code above reads data from the first sheet in “Q1Sales.xlsx”,
but there are two more spreadsheets in that workbook. To read all
the data in “Q1Sales.xlsx”, you can repeat the code above for the
remaining sheets and assign the output of each read_excel call to
separate DataFrame variables:
In [3]: jan_sales_df = pd.read_excel('Q1Sales.xlsx', sheet_name='January') 1
feb_sales_df = pd.read_excel('Q1Sales.xlsx', sheet_name='February') 2
mar_sales_df = pd.read_excel('Q1Sales.xlsx', sheet_name='March') 3
Now that you have all the data in “Q1Sales.xlsx” loaded in your
notebook, let’s simplify the code a bit. Instead of having three table
variables to work with, let’s put all their data in a single DataFrame.
You can do that by using pandas’s concat function, which is the
pandas equivalent of copy-pasting rows:
Out [5]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
0 1532 Shoppe.com Cannon Water Bomb... ... 20.11 14 281.54
1 1533 Walcart LEGO Ninja Turtle... ... 6.70 1 6.70
2 1534 Bullseye NaN ... 11.67 5 58.35
3 1535 Bullseye Transformers Age ... ... 13.46 6 80.76
4 1535 Bullseye Transformers Age ... ... 13.46 6 80.76
... ... ... ... ... ... ... ...
37703 39235 iBay.com Nature's Bounty G... ... 5.55 2 11.10
37704 39216 Shoppe.com Funko Wonder Woma... ... 28.56 1 28.56
37705 39219 Shoppe.com MONO GS1 GS1-BTY-... ... 3.33 1 3.33
37706 39238 Shoppe.com NaN ... 34.76 10 347.60
37707 39239 Understock.com 3 Collapsible Bow... ... 6.39 15 95.85
Having lots of variables you don’t need can make code unwieldy.
In the code above, you don’t need one DataFrame variable for each
sheet in “Q1Sales.xlsx”. You can change the data-reading code
above (and skip all intermediate variables) by passing multiple
read_excel calls to concat directly:
pythonforaccounting.com/chapter16
16 Project: Organizing sales data by channel 135
You get the same sales_df but without the extra variables.
The goal of this short project is to create an Excel file with five
different sheets, each one containing sales from one of the five
channels in sales_df. To get where we’re going, let’s first select
all 'Understock.com' sales from sales_df and sort the filtered
DataFrame by its 'Total' and 'Quantity' columns:
In [8]: channel_df
Out [8]: ProductID Product Name Channel Unit Price Quantity Total
23620 H&PC/LAR-98606 Large Display Dig... Understock.com 64.32 54 3473.28
36621 H&PC/LAR-98606 Large Display Dig... Understock.com 64.32 54 3473.28
4747 E/AC-44106 AC Adapter/batter... Understock.com 14.88 226 3362.88
4757 E/AC-44106 AC Adapter/batter... Understock.com 14.88 226 3362.88
4969 E/AC-44106 AC Adapter/batter... Understock.com 14.88 226 3362.88
... ... ... ... ... ... ...
15679 M&T/URB-83617 Urban Rebounding ... Understock.com 1.69 1 1.69
12877 T&G/MAG-22549 Magic: the Gather... Understock.com 0.06 17 1.02
15855 T&G/MAG-22549 Magic: the Gather... Understock.com 0.06 11 0.66
24012 T&G/MAG-22549 Magic: the Gather... Understock.com 0.06 11 0.66
24022 T&G/MAG-22549 Magic: the Gather... Understock.com 0.06 11 0.66
The code above should look familiar: it filters sales_df and assigns
the filtered DataFrame to a variable called channel_df. You can
write channel_df to a new Excel file using:
In [9]: channel_df.to_excel('Q1ChannelSales.xlsx', sheet_name='Understock.com', index=False)
Running the code above will create an Excel file called “Q1Channel-
Sales.xlsx” in the same folder as your Jupyter notebook, with one
sheet called “Understock.com” containing the same data as chan
nel_df. However, for this project, you need to repeat the filter
and sort operation above for all the sales channels in sales_df;
whenever you want to repeat something in code, you need a loop.
Let’s set up a for loop that goes through each channel in our sales
data and prints its name:
pythonforaccounting.com/chapter16
16 Project: Organizing sales data by channel 136
The loop above doesn’t do anything useful yet: it goes through each
value in the channels list defined on line 1 and prints it. However,
instead of printing channel names, you can use the loop to do
some real work, like filtering sales_df and creating a different
channel_df for every channel:
The code above doesn’t output anything, but if you inspect chan
nel_df now, you’ll see it contains all sales_df rows where values
in the 'Channel' column are 'Walcart':
In [12]: channel_df
Out [12]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
2461 3993 Walcart Large Display Dig... ... 64.47 23 1482.81
8332 9864 Walcart AC Adapter/batter... ... 14.49 100 1449.00
15293 16825 Walcart AC Adapter/batter... ... 14.49 100 1449.00
15509 16895 Walcart AC Adapter/batter... ... 14.49 100 1449.00
23888 25420 Walcart AC Adapter/batter... ... 14.49 100 1449.00
... ... ... ... ... ... ... ...
936 2332 Walcart Vibrating Slim Je... ... 2.10 1 2.10
11222 12754 Walcart Vibrating Slim Je... ... 2.10 1 2.10
14186 15718 Walcart Blackberry Q10 Wh... ... 1.87 1 1.87
14311 15712 Walcart Blackberry Q10 Wh... ... 1.87 1 1.87
33627 35159 Walcart Red Dragon VT 3-3... ... 0.19 6 1.14
Why 'Walcart' sales? Because the for loop goes through all values
in the channels list and keeps assigning a different DataFrame to
channel_df. The last value in the channels list is 'Walcart', so at
the last iteration of the loop, channel_df gets assigned only those
rows in sales_df where values in the 'Channel' column are equal
to 'Walcart'.
The last step you need to add to the for loop above is writing each
channel_df to a different sheet in “Q1ChannelSales.xlsx”. You can
do that by extending the previous example with the code below:
pythonforaccounting.com/chapter16
16 Project: Organizing sales data by channel 137
There are two new lines of code in the code above. On line 3 , you
use pandas’s ExcelWriter object to open an Excel file and assign
it to a variable called output_file. The output_file variable is an
object that represents the opened file in Python code. When you
need to write data to multiple sheets in the same Excel file, you
have to first open the Excel file as I did above.
As before, the loop then goes through each channel, filters and sorts
rows from sales_df, and assigns them to channel_df. The loop
also calls to_excel, passing the channel name as the sheet_name
keyword argument — this is how each channel’s sales get written
to a different sheet in the open “Q1ChannelSales.xlsx” Excel file.
When the loop completes, you call output_file.save() (on line
10 ), which makes sure the data in the open file gets written to
your computer’s disk (you have to call save this way if you want
to make sure your Excel file and its data get saved).
You might come across another way of writing the same code that
uses Python’s with operator.1 In the code below, the with operator 1: Python’s with operator is nothing
opens and saves files for you, so you don’t have to remember to like VBA’s With statement. In Python,
call save on your open file variable once you’re done writing data with creates a context that handles
errors for you.
to it. Using with you get the same result as above by running:
In [14]: with pd.ExcelWriter('Q1ChannelSales.xlsx') as output_file:
for channel in channels:
channel_df = sales_df[sales_df['Channel'] == channel]
channel_df = channel_df.sort_values(['Total', 'Quantity'], ascending=False)
channel_df.to_excel(output_file, sheet_name=channel, index=False)
Notice in both examples that instead of a file name, you pass the
output_file variable as the first argument to to_excel. This tells
pandas to use the already open file when writing data and is the
only way to write data to multiple spreadsheets in the same Excel
file. If you open “Q1ChannelSales.xlsx” in Excel after running the
code above, you will see something similar to figure 16.1 below.
The entire code you need to read data from “Q1Sales.xlsx”, combine
it in a single DataFrame, and write it to another Excel file, with one
sheet for each sales channel is shown below:
pythonforaccounting.com/chapter16
16 Project: Organizing sales data by channel 138
You can extend the block of code in the for loop to add other
slicing or sorting steps, depending on what you need from your
data. If you need to split large datasets into multiple sheets often,
code like the one above can save you considerable time.
pythonforaccounting.com/chapter16
16 Project: Organizing sales data by channel 139
Summary
This quick project chapter showed you how to split the “Q1Sales.xlsx”
Excel file into multiple spreadsheets. Once you start getting com-
fortable with pandas and its functions, you’ll use it to replace all
your manual data-handling, including copying-and-pasting rows
from one sheet to another.
Next, let’s head back to the scenic tour of pandas and see how you
can add and modify DataFrame columns.
pythonforaccounting.com/chapter16
Adding and modifying columns 17
When you open a spreadsheet in Excel, you get endless empty
columns just waiting to be filled up with data. In pandas you don’t
get any empty columns, but you can still easily add new data to
your DataFrame variables — this chapter shows you how.
Adding columns
Out [1]: InvoiceNo Channel Product Name ... Quantity Total Quarter
0 1532 Shoppe.com Cannon Water Bom... ... 14 281.54 Q1
1 1533 Walcart LEGO Ninja Turtl... ... 1 6.70 Q1
2 1534 Bullseye NaN ... 5 58.35 Q1
3 1535 Bullseye Transformers Age... ... 6 80.76 Q1
4 1535 Bullseye Transformers Age... ... 6 80.76 Q1
... ... ... ... ... ... ... ...
14049 15581 Bullseye AC Adapter/Power... ... 8 229.76 Q1
14050 15582 Bullseye Cisco Systems Gi... ... 1 33.39 Q1
14051 15583 Understock.com Philips AJ3116M/... ... 1 4.18 Q1
14052 15584 iBay.com NaN ... 25 119.50 Q1
14053 15585 Understock.com Sirius Satellite... ... 2 66.32 Q1
Using the same code, you can assign new values to an existing
column, which overwrites all its current values:
In [2]: ledger_df['Quarter'] = 1
ledger_df
17 Adding and modifying columns 141
Out [2]: InvoiceNo Channel Product Name ... Quantity Total Quarter
0 1532 Shoppe.com Cannon Water Bom... ... 14 281.54 1
1 1533 Walcart LEGO Ninja Turtl... ... 1 6.70 1
2 1534 Bullseye NaN ... 5 58.35 1
3 1535 Bullseye Transformers Age... ... 6 80.76 1
4 1535 Bullseye Transformers Age... ... 6 80.76 1
... ... ... ... ... ... ... ...
14049 15581 Bullseye AC Adapter/Power... ... 8 229.76 1
14050 15582 Bullseye Cisco Systems Gi... ... 1 33.39 1
14051 15583 Understock.com Philips AJ3116M/... ... 1 4.18 1
14052 15584 iBay.com NaN ... 25 119.50 1
14053 15585 Understock.com Sirius Satellite... ... 2 66.32 1
Most often, you won’t want to fill a column with a single value
but rather use some of your table’s existing values to create a new
column. For example, to calculate the tax amount (assuming a 19%
sales tax) for each row in ledger_df you can use:
In [3]: ledger_df['Total'] * (19 / 100)
Out [5]: InvoiceNo Channel Product Name ... Total Quarter Sales Tax
0 1532 Shoppe.com Cannon Water Bom... ... 281.54 1 53.4926
1 1533 Walcart LEGO Ninja Turtl... ... 6.70 1 1.2730
2 1534 Bullseye NaN ... 58.35 1 11.0865
3 1535 Bullseye Transformers Age... ... 80.76 1 15.3444
pythonforaccounting.com/chapter17
17 Adding and modifying columns 142
You can use several columns in the same expression (e.g., multiply
two columns together) and assign the output to another column:
In [6]: ledger_df['Quantity'] * ledger_df['Unit Price']
New column names can be anything you want, but short names
will save you from having to type a lot, and it’s a good idea to
keep them consistent with your table’s existing columns (e.g., all
lowercase letters, title case, etc.).
In pandas, you work with entire columns at once, not just individual
values (as you do in regular Python). Thinking “in columns” takes
some getting used to, so don’t worry if it all feels like a mental
workout right now. As you read and write more code, it will soon
become second nature.
Renaming columns
pythonforaccounting.com/chapter17
17 Adding and modifying columns 143
Out [8]: InvoiceNo Channel Product Name ... Total Quarter Tax
0 1532 Shoppe.com Cannon Water Bom... ... 281.54 1 53.4926
1 1533 Walcart LEGO Ninja Turtl... ... 6.70 1 1.2730
2 1534 Bullseye NaN ... 58.35 1 11.0865
3 1535 Bullseye Transformers Age... ... 80.76 1 15.3444
4 1535 Bullseye Transformers Age... ... 80.76 1 15.3444
... ... ... ... ... ... ... ...
14049 15581 Bullseye AC Adapter/Power... ... 229.76 1 43.6544
14050 15582 Bullseye Cisco Systems Gi... ... 33.39 1 6.3441
14051 15583 Understock.com Philips AJ3116M/... ... 4.18 1 0.7942
14052 15584 iBay.com NaN ... 119.50 1 22.7050
14053 15585 Understock.com Sirius Satellite... ... 66.32 1 12.6008
In the output above, the last column is now called 'Tax' instead of
'Sales Tax'. Notice that you pass the old-to-new name mapping
to rename as the columns keyword argument. This method doesn’t
change the original DataFrame, but instead returns a new DataFrame
— to update ledger_df with this new column name, you need to
assign the output of rename back to ledger_df:
In [9]: ledger_df = ledger_df.rename(columns={'Sales Tax': 'Tax'}) 1
2
ledger_df.columns 3
Now ledger_df’s last column is called 'Tax'. You can also rename
multiple columns at once:
In [10]: ledger_df = ledger_df.rename(columns={ 1
'Product Name': 'Name', 2
'Unit Price': 'Price', 3
'Hello': 'World' 4
}) 5
6
ledger_df 7
pythonforaccounting.com/chapter17
17 Adding and modifying columns 144
Replacing values
You may have noticed that some of the values in the 'Channel'
column end with '.com' and some don’t. It might make sense to
remove this ending and keep all channel identifiers using the same
naming style. To do that, you can use the replace method:
In [11]: ledger_df['Channel'].replace({
'Shoppe.com': 'Shoppe',
'Understock.com': 'Understock',
'iBay.com': 'iBay'
})
Now, if you take a look at ledger_df, you will see the new values:
pythonforaccounting.com/chapter17
17 Adding and modifying columns 145
In [13]: ledger_df
To do that, you have to use the loc or iloc operators. First, you
need to specify the location you want to modify and then assign
a new value to that selection. Let’s say you want to change the
'ProductID' value in the second row of ledger_df because you
know it’s wrong — right now, it is T&G/LEG-37777:
In [14]: ledger_df.head(2)
Out [14]: InvoiceNo Channel Name ProductID ... Quantity Total Quarter Tax
0 1532 Shoppe Cannon Wate... T&G/CAN-97509 ... 14 281.54 1 53.4926
1 1533 Walcart LEGO Ninja ... T&G/LEG-37777 ... 1 6.70 1 1.2730
[2 rows x 14 columns]
To modify this table entry, you first have to select it using loc or
iloc, then assign it a new value:
pythonforaccounting.com/chapter17
17 Adding and modifying columns 146
# or with .iloc
ledger_df.iloc[1, 3] = 'T&G/LEG0-0190'
Now, if you inspect ledger_df, you’ll see the new value in the
second row of the 'ProductID' column:
In [17]: ledger_df.head(2)
Out [17]: InvoiceNo Channel Name ProductID ... Total Quarter Total W/Out Tax
0 1532 Shoppe Cannon Wat... T&G/CAN-97509 ... 281.54 1 228.0474
1 1533 Walcart LEGO Ninja... T&G/LEGO-0190 ... 6.70 1 5.4270
[2 rows x 14 columns]
This is the only way to assign new values to table slices using
pandas: first, you make a selection using loc or iloc, then you
assign it a new value. Although this may seem like a complicated
way of achieving something easily done in Excel, changing values
in a DataFrame using this kind of selection leaves a clear trace of
the operations applied to your data, whereas manually modifying
a sheet in Excel does not. However, if you need to modify many
table entries manually, it will be easier to do that in Excel than it is
with pandas.
You can also select ranges of values to modify, not just single entries.
For example, you can set the first three entries in the 'Quarter'
column to 'One' using:
In [18]: ledger_df.loc[0:2, 'Quarter'] = 'One' 1
2
ledger_df.head() 3
Out [18]: InvoiceNo Channel Name ... Total Quarter Total W/Out Tax
0 1532 Shoppe Cannon Water Bom... ... 281.54 One 228.0474
1 1533 Walcart LEGO Ninja Turtl... ... 6.70 One 5.4270
2 1534 Bullseye NaN ... 58.35 One 47.2635
3 1535 Bullseye Transformers Age... ... 80.76 1 65.4156
4 1536 Understock Ty Beanie Boos S... ... 780.92 1 632.5452
[5 rows x 14 columns]
pythonforaccounting.com/chapter17
17 Adding and modifying columns 147
Out [19]: InvoiceNo Channel Name ... Total Quarter Total W/Out Tax
0 1532 Shoppe Cannon Water Bom... ... 281.54 One 228.0474
1 1533 Walcart LEGO Ninja Turtl... ... 6.70 Two 5.4270
2 1534 Bullseye NaN ... 58.35 Three 47.2635
3 1535 Bullseye Transformers Age... ... 80.76 1 65.4156
4 1536 Understock Ty Beanie Boos S... ... 780.92 1 632.5452
[5 rows x 14 columns]
One problem with using loc to change table values is that you
have to know their row and column labels. What if you want to
modify several values in a column and don’t know the row labels
for these values? In that case, you can use conditional filtering with
loc to first find the rows you need, then assign new values in the
column you want to change.
The first thing you need to do is filter the sales data on the 'Channel'
column, using loc, and select values in the 'Total' column:
In [20]: ledger_df.loc[ledger_df['Channel'] == 'Walcart', 'Total']
This selects the total sales amount for those rows in ledger_df
where the sales channel is 'Walcart' — nothing new, yet. To
calculate the new tax amount, you can use:
In [21]: ledger_df.loc[ledger_df['Channel'] == 'Walcart', 'Total'] * (9 / 100)
pythonforaccounting.com/chapter17
17 Adding and modifying columns 148
This is what we wanted to calculate, but the code above doesn’t up-
date ledger_df with the new tax amounts. To update the 'Tax' col-
umn, you need to assign these values back to a slice of ledger_df:
In [22]: walcart_rows = ledger_df['Channel'] == 'Walcart'
Or if you want to do this in one go, you can run (the extra paren-
theses are there to split the long line of code in two):
In [23]: ledger_df.loc[ledger_df['Channel'] == 'Walcart', 'Tax'] = ( 1
ledger_df.loc[ledger_df['Channel'] == 'Walcart', 'Total'] * (9 / 100) 2
) 3
Here, you use loc twice: first, to select the values you want to
update; second, to calculate new values from another column in
the table. Because both selections have the same size (i.e., the same
number of columns and rows), you can assign one to the other
using the = operator. You can use similar code to update Bullseye
tax amounts (but with a 16% sales tax):
In [24]: ledger_df.loc[ledger_df['Channel'] == 'Bullseye', 'Tax'] = ( 1
ledger_df.loc[ledger_df['Channel'] == 'Bullseye', 'Total'] * (16 / 100) 2
) 3
You can use conditional filtering with loc to define even more
complicated table slices and update table values in a similar way.
While it may seem like a lot of typing to modify some values,
remember that every change you make to your data this way
remains visible in your code, and you can re-run these changes
whenever you need to.1 1: Unlike Excel, where after several
filter and edit operations, there’s no
record of what you did and if Excel
crashes, you have to do them all over
Summary again.
This chapter showed you how to add, rename, and modify table
columns with pandas. There are a few pandas quirks you need to
get used to, but the benefits of using code to modify table entries
versus manually changing them are worth the learning effort.
pythonforaccounting.com/chapter17
17 Adding and modifying columns 149
So far, we’ve been looking at our data in close detail — but what if
you want to get a high-level view of your tables? The next chapter
takes a closer look at the DataFrame and Series methods you can
use to summarize data in pandas.
pythonforaccounting.com/chapter17
Summarizing data 18
In addition to slicing a table or adding new columns to it, you will
often want to extract specific pieces of information from its data:
how many unique values there are in a column, how frequently
they appear in the data, what the average value of a column is, etc.
These operations can be labeled as data summaries — let’s take a
look at how you compute them in pandas.
To find out how many unique values there are in each column of
your table, you can use the nunique method:
In [1]: ledger_df.nunique()
The same method works with Series objects, so you can compute
the number of unique values in a specific column using:
In [2]: ledger_df['Channel'].nunique()
Out [2]: 5
For single columns, you can also list unique values with:
In [3]: ledger_df['Channel'].unique()
The output above is a list-like array: it’s not a regular Python list,
but for our purposes, you can use it like one: you can loop over it
or access items from it using their index. Another way you can use
it is to check whether a particular value is among its items:
In [4]: 'Walcart' in ledger_df['Channel'].unique()
18 Summarizing data 151
Even more useful than listing unique items is counting how many
times each value appears in a column using the value_counts
method:
In [6]: ledger_df['Channel'].value_counts()
In [8]: ledger_df['Channel'].value_counts(normalize=True)
Notice that the values returned now sum to 1. If you want per-
centages (i.e., make the values sum to 100 instead of 1), you can
multiply the output above by 100:
In [9]: ledger_df['Channel'].value_counts(normalize=True) * 100
pythonforaccounting.com/chapter18
18 Summarizing data 152
In [11]: ledger_df['Quantity'].mean()
Just like value_counts works with multiple columns, you can use
mean on several numerical columns at the same time as well:
pythonforaccounting.com/chapter18
18 Summarizing data 153
Table 18.1: Mathematical methods you can use to summarize numerical values in both Series and DataFrame objects.
Method Example Output Description
sum ledger_df['Total'].sum() 1823206.56 Returns the sum of values in the 'Total'
column.
mean ledger_df['Quantity'].mean() 11.2498 Returns the average of values in the
'Quantity' column.
max ledger_df['Unit Price'].max() 166.3 Returns the maximum value in the 'Unit
Price' column.
idxmax ledger_df['Unit Price'].idxmax() 616 Returns the row label for the maximum
value in the 'Unit Price' column.
idxmin ledger_df['Total'].idxmin() 7613 Returns the row label for the minimum
value in the 'Total' column.
count ledger_df['Channel'].count() 14054 Returns the number of non-empty (i.e.,
non-NaN) values in the 'Channel' col-
umn. This method works with non-
numerical values as well.
quantile ledger_df['Total'].quantile(.25) 19.26 Computes the value at a quantile passed
as an argument (values between 0 and 1
are valid arguments). This example tells
you that 25% of values in the 'Total'
column are less than 19.26.
pythonforaccounting.com/chapter18
18 Summarizing data 154
In [15]: ledger_df['Total'].describe()
pythonforaccounting.com/chapter18
18 Summarizing data 155
Summary
This chapter went over several pandas methods you can use to
summarize your data. Most of these methods are available on both
Series and DataFrame objects, so you can use them to summarize
single columns or entire tables at once.
pythonforaccounting.com/chapter18
Cleaning data 19
Missing values, duplicate rows, numbers or dates stored as text
are just some of the common problems you face when working
with any data, including balance sheets or ledgers. Fixing these
problems is generally labeled as data cleaning — and infamously
takes up most of a data worker’s time. Fortunately, the designers
of pandas know how real-world data looks like and equipped it
with a set of data cleaning tools.
You can either keep missing values as they are, remove the rows
or columns they appear in, or fill them in with some other appro-
priate value. Fortunately, pandas makes all of these tasks straight-
forward.
In [1]: ledger_df.head()
Out [1]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
0 1532 Shoppe.com Cannon Water Bomb Ba... ... 20.11 14 281.54
1 1533 Walcart LEGO Ninja Turtles S... ... 6.70 1 6.70
2 1534 Bullseye NaN ... 11.67 5 58.35
3 1535 Bullseye Transformers Age of ... ... 13.46 6 80.76
4 1535 Bullseye Transformers Age of ... ... 13.46 6 80.76
[5 rows x 12 columns]
To make things even more confusing, when you inspect this value
in a separate cell, it gets printed as nan instead of NaN:
In [2]: ledger_df.iloc[2, 2]
19 Cleaning data 157
Together with Python’s built-in None value, there sure are a lot of
ways to represent nothing in Python and pandas.
However, all these sentinel values mean the same thing: value
not available. In numerical or text column, pandas uses NaN to
indicate empty values (which stands for not a number), whereas
in date columns, it uses NaT (which stands for not a timestamp).
Recently, pandas introduced the <NA> value, which is supposed to
replace both NaN and NaT as the only “not available” value in future
versions.1 1: The reason for so many variations
of the same thing is fairly technical,
The closest Excel equivalent to pandas’s sentinel values is the #N/A but you can read more about
it at pandas.pydata.org/pandas-
error value. However, Excel’s #N/A is a cannot-compute value,
docs/stable/user_guide/missing_data.
which gets shown whenever a function cannot produce a value,
rather than a missing value indicator (missing values in Excel are
just empty cells). There is no cannot-compute value in pandas;
whenever pandas can’t compute something, you’ll get an error.
After much ado about nothing, let’s see how you can find these
missing values in ledger_df.
You can use the isna method, available on both Series and
DataFrame objects, to detect missing values:
pythonforaccounting.com/chapter19
19 Cleaning data 158
Out [4]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
2 1534 Bullseye NaN ... 11.67 5 58.35
6 1538 Understock.com NaN ... 31.36 9 282.24
16 1548 iBay.com NaN ... 8.07 5 40.35
35 1567 Understock.com NaN ... 12.30 5 61.50
36 1568 Walcart NaN ... 3.52 1 3.52
... ... ... ... ... ... ... ...
14021 15553 iBay.com NaN ... 17.28 23 397.44
14025 15443 Understock.com NaN ... 16.49 103 1698.47
14036 15495 Bullseye NaN ... 14.85 9 133.65
14040 15572 iBay.com NaN ... 16.76 6 100.56
14052 15584 iBay.com NaN ... 4.78 25 119.50
Or if you just want to know how many missing values there are,
because booleans are just numbers in disguise, you can sum the
output of isna:
In [5]: ledger_df['Product Name'].isna().sum()
The sales ledger has a lot of missing values in the 'Product Name'
column. Let’s see what you can do with them.
pythonforaccounting.com/chapter19
19 Cleaning data 159
The first option you might consider when coming across missing
values is to remove them from your DataFrame — especially if you
have entire rows or columns made up of empty values.
You can filter all rows in ledger_df where 'Product Name' values
are empty using:
In [7]: ledger_df[ledger_df['Product Name'].notna()]
Out [7]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
0 1532 Shoppe.com Cannon Water Bomb Ba... ... 20.11 14 281.54
1 1533 Walcart LEGO Ninja Turtles S... ... 6.70 1 6.70
3 1535 Bullseye Transformers Age of ... ... 13.46 6 80.76
4 1535 Bullseye Transformers Age of ... ... 13.46 6 80.76
5 1537 Bullseye 3x Anti-Spy Privacy ... ... 7.39 8 59.12
... ... ... ... ... ... ... ...
14048 15580 iBay.com Lauri Toddler Tote ... 14.46 1 14.46
14049 15581 Bullseye AC Adapter/Power Sup... ... 28.72 8 229.76
14050 15582 Bullseye Cisco Systems Gigabi... ... 33.39 1 33.39
14051 15583 Understock.com Philips AJ3116M/37 D... ... 4.18 1 4.18
14053 15585 Understock.com Sirius Satellite Rad... ... 33.16 2 66.32
As with most examples so far, you need to assign this output back
to ledger_df if you want to make this filter permanent.
More generally, you can discard all empty values using the dropna
method. By default, the dropna method discards all rows that have
empty values in any of the columns:
pythonforaccounting.com/chapter19
19 Cleaning data 160
In [9]: ledger_df.dropna()
Out [9]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
0 1532 Shoppe.com Cannon Water Bomb Ba... ... 20.11 14 281.54
1 1533 Walcart LEGO Ninja Turtles S... ... 6.70 1 6.70
3 1535 Bullseye Transformers Age of ... ... 13.46 6 80.76
4 1535 Bullseye Transformers Age of ... ... 13.46 6 80.76
5 1537 Bullseye 3x Anti-Spy Privacy ... ... 7.39 8 59.12
... ... ... ... ... ... ... ...
14048 15580 iBay.com Lauri Toddler Tote ... 14.46 1 14.46
14049 15581 Bullseye AC Adapter/Power Sup... ... 28.72 8 229.76
14050 15582 Bullseye Cisco Systems Gigabi... ... 33.39 1 33.39
14051 15583 Understock.com Philips AJ3116M/37 D... ... 4.18 1 4.18
14053 15585 Understock.com Sirius Satellite Rad... ... 33.16 2 66.32
In this case, the code above does the same thing as the previous
example: discards all rows in ledger_df where 'Product Name'
values are empty.
Out [10]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
0 1532 Shoppe.com Cannon Water Bomb Ba... ... 20.11 14 281.54
1 1533 Walcart LEGO Ninja Turtles S... ... 6.70 1 6.70
3 1535 Bullseye Transformers Age of ... ... 13.46 6 80.76
4 1535 Bullseye Transformers Age of ... ... 13.46 6 80.76
5 1537 Bullseye 3x Anti-Spy Privacy ... ... 7.39 8 59.12
... ... ... ... ... ... ... ...
14048 15580 iBay.com Lauri Toddler Tote ... 14.46 1 14.46
14049 15581 Bullseye AC Adapter/Power Sup... ... 28.72 8 229.76
14050 15582 Bullseye Cisco Systems Gigabi... ... 33.39 1 33.39
14051 15583 Understock.com Philips AJ3116M/37 D... ... 4.18 1 4.18
14053 15585 Understock.com Sirius Satellite Rad... ... 33.16 2 66.32
This looks for empty values in only the two columns specified as
the subset argument and discards all rows with empty values in
those two columns.
pythonforaccounting.com/chapter19
19 Cleaning data 161
In [11]: ledger_df.dropna(how='all')
This example would discard only those rows from ledger_df that
are entirely empty (i.e., values in all columns are NaN).
Instead of rows, you can also tell dropna to discard columns with
empty values instead, using the axis keyword argument. To discard
all columns in ledger_df that have any empty values in them, you
can use:
In [12]: ledger_df.dropna(how='any', axis='columns')
Out [12]: InvoiceNo Channel ProductID ... Unit Price Quantity Total
0 1532 Shoppe.com T&G/CAN-97509 ... 20.11 14 281.54
1 1533 Walcart T&G/LEG-37777 ... 6.70 1 6.70
2 1534 Bullseye T&G/PET-14209 ... 11.67 5 58.35
3 1535 Bullseye T&G/TRA-20170 ... 13.46 6 80.76
4 1535 Bullseye T&G/TRA-20170 ... 13.46 6 80.76
... ... ... ... ... ... ... ...
14049 15581 Bullseye E/AC-63975 ... 28.72 8 229.76
14050 15582 Bullseye E/CIS-74992 ... 33.39 1 33.39
14051 15583 Understock.com E/PHI-08100 ... 4.18 1 4.18
14052 15584 iBay.com E/POL-61164 ... 4.78 25 119.50
14053 15585 Understock.com E/SIR-83381 ... 33.16 2 66.32
Notice in the output above that instead of rows with empty values
being dropped, the 'Product Name' column was discarded entirely
because it contained several NaN values.
As with all the other DataFrame methods, you can combine dropna’s
keyword arguments in any way to get the result you need. Right
now, they might not seem particularly useful, but in the next project
chapter we’ll take a look at how you can work with a general ledger
in pandas, and you will see how practical dropna and its keyword
arguments are.
pythonforaccounting.com/chapter19
19 Cleaning data 162
Notice that the third value in the output above, which previously
was a NaN, is now a copy of the value right above it. Similarly, in the
row labeled 14 052 (i.e., second row from the bottom), the value is
a copy of the value directly above it. 2: The two keyword values, 'ffill'
and 'bfill' stand for forward fill and
You can tell fillna to use the first valid value found below, rather back fill.
than above, to fill in missing values:2
In [15]: ledger_df['Product Name'].fillna(method='bfill')
pythonforaccounting.com/chapter19
19 Cleaning data 163
Notice in the output above that the value in the fifth row is True.
This method goes through each row in the DataFrame, and checks
it against all the rows above it. If duplicated finds another row
with the same values (in all columns) above any given row, it puts
a True value in the boolean Series it returns — in this case, rows
labeled 3 and 4 in ledger_df are duplicates:
In [17]: ledger_df.loc[3] == ledger_df.loc[4]
pythonforaccounting.com/chapter19
19 Cleaning data 164
Out [19]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
0 1532 Shoppe.com Cannon Water Bomb Ba... ... 20.11 14 281.54
1 1533 Walcart LEGO Ninja Turtles S... ... 6.70 1 6.70
2 1534 Bullseye NaN ... 11.67 5 58.35
3 1535 Bullseye Transformers Age of ... ... 13.46 6 80.76
5 1537 Bullseye 3x Anti-Spy Privacy ... ... 7.39 8 59.12
... ... ... ... ... ... ... ...
14049 15581 Bullseye AC Adapter/Power Sup... ... 28.72 8 229.76
14050 15582 Bullseye Cisco Systems Gigabi... ... 33.39 1 33.39
14051 15583 Understock.com Philips AJ3116M/37 D... ... 4.18 1 4.18
14052 15584 iBay.com NaN ... 4.78 25 119.50
14053 15585 Understock.com Sirius Satellite Rad... ... 33.16 2 66.32
Take a look at the row labels above and notice that label 4 (and its
associated row) is no longer in the table.
Just like the dropna method simplifies discarding empty values, the
drop_duplicates method makes removing duplicate rows easier
than using duplicated. To get the same result as above, and discard
duplicate rows from ledger_df, you can use:
pythonforaccounting.com/chapter19
19 Cleaning data 165
In [20]: ledger_df.drop_duplicates()
Out [20]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
0 1532 Shoppe.com Cannon Water Bomb Ba... ... 20.11 14 281.54
1 1533 Walcart LEGO Ninja Turtles S... ... 6.70 1 6.70
2 1534 Bullseye NaN ... 11.67 5 58.35
3 1535 Bullseye Transformers Age of ... ... 13.46 6 80.76
5 1537 Bullseye 3x Anti-Spy Privacy ... ... 7.39 8 59.12
... ... ... ... ... ... ... ...
14049 15581 Bullseye AC Adapter/Power Sup... ... 28.72 8 229.76
14050 15582 Bullseye Cisco Systems Gigabi... ... 33.39 1 33.39
14051 15583 Understock.com Philips AJ3116M/37 D... ... 4.18 1 4.18
14052 15584 iBay.com NaN ... 4.78 25 119.50
14053 15585 Understock.com Sirius Satellite Rad... ... 33.16 2 66.32
You get the same result as above, but with less typing. You can also
tell pandas to check for duplicates in a subset of columns only, by
passing a list of column names to the subset keyword argument:
In [21]: ledger_df.drop_duplicates(subset=['InvoiceNo', 'ProductID'])
Out [21]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
0 1532 Shoppe.com Cannon Water Bomb Ba... ... 20.11 14 281.54
1 1533 Walcart LEGO Ninja Turtles S... ... 6.70 1 6.70
2 1534 Bullseye NaN ... 11.67 5 58.35
3 1535 Bullseye Transformers Age of ... ... 13.46 6 80.76
5 1537 Bullseye 3x Anti-Spy Privacy ... ... 7.39 8 59.12
... ... ... ... ... ... ... ...
14049 15581 Bullseye AC Adapter/Power Sup... ... 28.72 8 229.76
14050 15582 Bullseye Cisco Systems Gigabi... ... 33.39 1 33.39
14051 15583 Understock.com Philips AJ3116M/37 D... ... 4.18 1 4.18
14052 15584 iBay.com NaN ... 4.78 25 119.50
14053 15585 Understock.com Sirius Satellite Rad... ... 33.16 2 66.32
You can use the subset and keep keyword arguments with both
duplicated and drop_duplicates (i.e., to specify which columns to
use when checking for duplicates and whether to keep one of the
duplicate rows or not).
pythonforaccounting.com/chapter19
19 Cleaning data 166
Sometimes pandas can’t figure out the right data type for a column
— because there’s something wrong with the column values (e.g.,
numbers and text values mixed in the same column) — and assigns
it the most generic type it can: the object data type.
If you know that a column should have a certain data type (e.g., it
has only numerical value) you can change its assigned type using
the astype method. For instance, to change the 'Quantity' column
type to float64 instead of int64, you can use:
In [22]: ledger_df['Quantity'].astype('float')
This turns all values in the 'Quantity' column from whole numbers
to decimal numbers. The astype method is particularly useful after
cleaning columns that store numbers as text values.
Another useful DataFrame method that can fix issues with column
data types is convert_dtypes. This method asks pandas to try and
figure out better data types for each column in a DataFrame, and
can often reduce the amount of computer memory needed to store
your data:
In [23]: ledger_df.convert_dtypes().info()
pythonforaccounting.com/chapter19
19 Cleaning data 167
Out [24]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
0 1532 Shoppe.com Cannon Water Bomb Ba... ... 20.11 14 281.54
1 1533 Walcart LEGO Ninja Turtles S... ... 6.70 1 6.70
2 1534 Bullseye <NA> ... 11.67 5 58.35
3 1535 Bullseye Transformers Age of ... ... 13.46 6 80.76
4 1535 Bullseye Transformers Age of ... ... 13.46 6 80.76
... ... ... ... ... ... ... ...
14049 15581 Bullseye AC Adapter/Power Sup... ... 28.72 8 229.76
14050 15582 Bullseye Cisco Systems Gigabi... ... 33.39 1 33.39
14051 15583 Understock.com Philips AJ3116M/37 D... ... 4.18 1 4.18
14052 15584 iBay.com <NA> ... 4.78 25 119.50
14053 15585 Understock.com Sirius Satellite Rad... ... 33.16 2 66.32
Summary
This chapter showed you how to deal with missing values and
duplicates in a pandas DataFrame. We also took a quick look at the
convert_dtypes method, which you can use to nudge pandas into
finding better column data types for your tables.
pythonforaccounting.com/chapter19
Project: Reading and cleaning a
QuickBooks general ledger 20
A company’s general ledger is rich with information: revenues, ex-
penses, balances, adjustments are all recorded in the GL. However,
as rich as it is with information, the general ledger is often tricky
to work with: its format and organization make turning records
into insights far more complicated than it needs to be.
This project chapter shows you how to use the pandas tools we’ve
covered so far to clean and reformat a general ledger exported
from QuickBooks. At the end of this chapter, you’ll have a clean
general ledger DataFrame that is easy to slice, filter, or handle in
any way.
You may have noticed above that “QuickBooks GL.xlsx” isn’t analysis-
friendly: there are empty columns between data columns, the table
headers are on the fifth row, there are lots of missing values
throughout the data, etc. If you load “QuickBooks GL.xlsx” with
read_excel, you’ll get an unwieldy DataFrame:
20 Project: Reading and cleaning a QuickBooks general ledger 169
ledger_df
Out [2]: Unnamed: 0 Carl's Design and Landscaping Services ... Unnamed: 18 Unnamed: 19
0 NaN General Ledger ... NaN NaN
1 NaN All Dates ... NaN NaN
2 NaN NaN ... NaN NaN
3 NaN Acct ... NaN Balance
4 NaN NaN ... NaN NaN
.. ... ... ... ... ...
476 NaN Total for Miscellaneous ... NaN NaN
477 NaN NaN ... NaN NaN
478 NaN Not Specified ... NaN NaN
479 NaN NaN ... NaN 0
480 NaN Total for Not Specified ... NaN NaN
ledger_df
pythonforaccounting.com/chapter20
20 Project: Reading and cleaning a QuickBooks general ledger 170
ledger_df
Passing the method='ffill' keyword argument2 to fillna tells 2: The 'ffill' value passed to
pandas to go through the 'Account' column and, wherever it finds fillna stands for forward-fill.
pythonforaccounting.com/chapter20
20 Project: Reading and cleaning a QuickBooks general ledger 171
Now that you filled in all missing values in the 'Account' column,
you can count the number of postings in each account by running:
In [6]: ledger_df['Account'].value_counts()
Once the data is clean, you can easily compute account subtotals
yourself; for now, let’s remove subtotal rows from the table. There
are several ways to do that with pandas — as you read the following
chapters you’ll discover most of them. For now, let’s define a custom
Python function that tells us if an account name contains the word
'Total':
The function above first checks if name is not NaN (using pandas’s
notna3 helper function), then checks if name contains the word 3: Whenever you need to check
'Total'. It returns the combined value of these checks as a boolean whether a value is NaN, use pandas’s
isna or notna helper functions. The
(i.e., either True if both checks are True , or False if either of them
isna function returns True if your
is False ). The first check isn’t strictly necessary, but it prevents value is NaN, whereas notna returns
the function from triggering an error if it gets a NaN value as its True if you value isn’t.
input.
pythonforaccounting.com/chapter20
20 Project: Reading and cleaning a QuickBooks general ledger 172
List comprehensions need some getting used to, but once they click
you’ll never stop using them — if the list comprehension above is
confusing, it is equivalent to the following for loop:
valid_accounts = []
for name in ledger_df['Account'].unique():
if not is_subtotal(name):
valid_accounts.append(name)
ledger_df
pythonforaccounting.com/chapter20
20 Project: Reading and cleaning a QuickBooks general ledger 173
You’ll see a much easier way to get the same result in one line of code
in the following chapter, but for now, let’s apply the same approach
to filtering subtotals from the 'SubAccount' column. Altogether,
the code to find and filter subtotals from both the 'Account' and
'SubAccount' columns is:
The problem here is that you can’t use the fillna method like
you did for the 'Account' column earlier. Consider the following
pythonforaccounting.com/chapter20
20 Project: Reading and cleaning a QuickBooks general ledger 174
example_df
example_df
pythonforaccounting.com/chapter20
20 Project: Reading and cleaning a QuickBooks general ledger 175
The code above is a lot to chew on: the for loop goes through
all unique values in the 'Account' column and uses each account
name) to filter and edit ledger_df.4 However, it does the trick: if 4: Head back to “Filter and Edit” in
you check the 'Utilities' and 'Miscellaneous' accounts, you’ll chapter 17 for a reminder on how you
can use loc to modify a DataFrame
see their sub-accounts filled in without spillover:
slice.
In [20]: ledger_df
One last step in cleaning the general ledger is removing rows with
non-empty values only in their 'Account' or 'SubAccount' columns
(e.g., the first two rows above). You can do that by calling dropna
and specifying the subset of columns you want pandas to check
for empty values:
In [21]: ledger_df = ledger_df.dropna(subset=ledger_df.columns[2:], how='all')
Notice that instead of typing column names for the subset keyword
argument, you can use a slice of ledger_df’s columns attribute,
which already contains all column names in order (the first two
names being 'Account' and 'SubAccount' that we want to ignore
in this case).
After all this scrubbing, what can you do with ledger_df? For
instance, you can easily slice it to get postings in various accounts
or sub-accounts:
In [22]: ledger_df[
(ledger_df['Account'] == 'Landscaping Services') &
(ledger_df['SubAccount'] == 'Fountains and Garden Lighting')
]
pythonforaccounting.com/chapter20
20 Project: Reading and cleaning a QuickBooks general ledger 176
You can also save the scrubbed GL back to an Excel file that is
much easier to work with by running:
In [25]: ledger_df.to_excel('project_data/Clean QuickBooks GL.xlsx', index=False)
pythonforaccounting.com/chapter20
20 Project: Reading and cleaning a QuickBooks general ledger 177
Summary
This quick project chapter showed you how to use pandas to read
and clean a general ledger file. In the following chapters, you’ll
uncover more of pandas’s tools that not only let you clean and
reshape tables faster and with less code but also enable you to
extract insights from your data. One such set of tools are pandas’s
text handling methods, which we look at next.
pythonforaccounting.com/chapter20
Working with text columns 21
A large part of working with data revolves around handling text
— what Python calls strings. Python has many built-in tools1 for 1: Python’s flexible tools for manipu-
manipulating strings, many of which have been adapted by pandas lating strings are a big contributor to
its popularity.
to work on entire columns of string values. Before we look at how
you can use these tools with the data in ledger_df, let’s quickly
revisit Python strings.
You may remember from chapter 5 that you can create a new
string variable in Python by assigning a sequence of (one or more)
characters wrapped in quotes2 to a name: 2: Single or double-quotes.
message
Another feature that both Python lists and strings have in common
is that you can use the in keyword to check for membership of an
item or a substring. For instance, you can check whether a piece of
text is part of message with:
In [5]: 'PYTHON' in message
Or if you have several variables that you want to put into a single
piece of text, you can use a Python f-string:
In [8]: first_name = "margaret"
last_name = "hamilton"
age = 83
All of the operations above can be applied to string values that are
part of DataFrame or Series objects, but with a slight pandas twist.
Let’s take a closer look.
pythonforaccounting.com/chapter21
21 Working with text columns 180
If you try to call the upper method directly, without typing str in
front of it, you’ll get the following error message:
In [10]: ledger_df['Channel'].upper()
You can call string methods on any column that contains text values
— you will get an error if you call them on columns that have other
types of data (e.g., numbers or mixed values). Besides the upper
method above, there are many other string methods available; table
21.1 lists some of them, together with code examples. Most of these
Series methods have the same name as Python’s built-in string
methods (e.g., upper, lower, title); if you discover an interesting
method on a Python string, a Series equivalent that works on an
entire column of strings is likely available.
Table 21.1: Some of the methods available on Series objects containing text values. These and more methods are described
in the official pandas documentation available at pandas.pydata.org/pandas-docs/stable/user_guide/text.html.
Method Example Description
contains df['Product Name'].str.contains('shoe') Returns a boolean Series indicating whether a
given string occurs in each value of the Series.
(df['Product Name'] Same as above, but ignores string case.
.str
.contains('shoe', case=False))
pythonforaccounting.com/chapter21
21 Working with text columns 181
Table 21.1: Some of the methods available on Series objects containing text values. These and more methods are described
in the official pandas documentation available at pandas.pydata.org/pandas-docs/stable/user_guide/text.html.
Method Example Description
title df['Product Name'].str.title() Converts strings in a Series to title case (i.e., each
word in the string starts with a capital letter).
strip df['Product Name'].str.strip() Removes leading and trailing characters in each
value of a Series. By default, removes whitespace
characters (i.e., spaces, tabs or newline characters).
df['Product Name'].str.strip(to_strip='x') Same as above, but specifies which trailing charac-
ter to remove (e.g., 'xShirt (x-small)x' becomes
'Shirt (x-small)'.
(df['Product Name'] Same as above, but ignores case when looking for
.str characters to replace.
.replace("w/", "with", case=False))
slice df['Product Name'].str.slice(10, 20) Returns a slice from each string in the Series. Start
and stop positions are specified as arguments. If the
string is shorter than start characters, the returned
value is an empty string.
(df['Product Name'] Same as above, but returns every second character.
.str
.slice(10, 20, step=2))
split df['Product Name'].str.split() Splits each string in the Series around a delimiter
into a list of strings. By default, the delimiter is any
(one or more) whitespace character.
df['Product Name'].str.split(pat="x") Same as above, but uses the x character as a delim-
iter.
isalpha df['Product Name'].str.isalpha() Returns a boolean Series indicating whether each
value contains only alphabetic symbols (i.e., 'abc'
will return True , whereas 'a.b.c.' or 'a2' will
return False ).
isnumeric df['Product Name'].str.isnumeric() Returns a boolean Series indicating whether each
value contains only numeric symbols (i.e., '1' or
'1/2' will return True ).
pythonforaccounting.com/chapter21
21 Working with text columns 182
Table 21.1: Some of the methods available on Series objects containing text values. These and more methods are described
in the official pandas documentation available at pandas.pydata.org/pandas-docs/stable/user_guide/text.html.
Method Example Description
It’s worth pointing out that these string methods skip empty values
(i.e., NaNs or <NA>). For instance, to make all values in the 'Product
Name' column uppercase, you can use the upper method as before
— but notice in the output of upper that missing values remain
unchanged (i.e., NaNs remain NaNs):
In [11]: ledger_df['Product Name']
pythonforaccounting.com/chapter21
21 Working with text columns 183
...
14049 False
14050 False
14051 False
14052 NaN
14053 False
Name: Product Name, Length: 14054, dtype: object
Out [15]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
1 1533 Walcart LEGO Ninja Turtles S... ... 6.70 1 6.70
43 1575 iBay.com LEGO Star Wars Clone... ... 14.68 2 29.36
105 1637 Bullseye LEGO LOTR 79006 The ... ... 7.67 6 46.02
176 1708 Shoppe.com LEGO City Fire Chief... ... 24.95 1 24.95
228 1608 iBay.com LEGO Star Wars Clone... ... 14.68 2 29.36
... ... ... ... ... ... ... ...
13525 15007 Understock.com LEGO City Trains Hig... ... 7.41 18 133.38
13550 15082 Walcart LEGO City Fire Chief... ... 25.11 8 200.88
13731 15131 Walcart LEGO City Fire Chief... ... 25.11 8 200.88
13753 15285 Understock.com LEGO Star Wars Clone... ... 14.84 2 29.68
14031 15468 Understock.com LEGO Star Wars Clone... ... 14.84 2 29.68
Notice you can chain the fillna method to the output of contains
— in general, you can chain as many Series methods you need. In
this case, I filled missing values in the boolean Series with False ,
because I wanted to discard rows with missing product names,
but you can fill them in with True if you want to keep them.
pythonforaccounting.com/chapter21
21 Working with text columns 184
pythonforaccounting.com/chapter21
21 Working with text columns 185
14052 Anazon.com
14053 Understock.com
Name: Channel, Length: 14054, dtype: object
In some cases, you can get the same result with either method —
but in those cases where you can’t, knowing about the two replace
methods will save you some headache.
pythonforaccounting.com/chapter21
21 Working with text columns 186
Out [20]: 0 1
0 T&G CAN-97509
1 T&G LEG-37777
2 T&G PET-14209
3 T&G TRA-20170
4 T&G TRA-20170
... ... ...
24019 E AC-63975
24020 E CIS-74992
24021 E PHI-08100
24022 E POL-61164
24023 E SIR-83381
The code above returns a DataFrame with one column for each item
in the list of splits (had the 'ProductID' values contained multiple
forward slash characters, the number of splits would have been
greater and the DataFrame returned by split would have had more
than two columns).
ledger_df
Out [21]: InvoiceNo Channel Product Name ... Total CategoryID ItemID
0 1532 Shoppe.com Cannon Water Bom... ... 281.54 T&G CAN-97509
1 1533 Walcart LEGO Ninja Turtl... ... 6.70 T&G LEG-37777
2 1534 Bullseye NaN ... 58.35 T&G PET-14209
3 1535 Bullseye Transformers Age... ... 80.76 T&G TRA-20170
4 1535 Bullseye Transformers Age... ... 80.76 T&G TRA-20170
... ... ... ... ... ... ... ...
14049 15581 Bullseye AC Adapter/Power... ... 229.76 E AC-63975
14050 15582 Bullseye Cisco Systems Gi... ... 33.39 E CIS-74992
14051 15583 Understock.com Philips AJ3116M/... ... 4.18 E PHI-08100
14052 15584 iBay.com NaN ... 119.50 E POL-61164
14053 15585 Understock.com Sirius Satellite... ... 66.32 E SIR-83381
pythonforaccounting.com/chapter21
21 Working with text columns 187
You can concatenate text columns using the same operator. For
instance, you can re-create product IDs from the new 'CategoryID'
and 'ItemID' columns using:
In [23]: ledger_df['CategoryID'] + '/' + ledger_df['ItemID']
pythonforaccounting.com/chapter21
21 Working with text columns 188
ledger_df.info()
pythonforaccounting.com/chapter21
21 Working with text columns 189
ledger_df.info()
pythonforaccounting.com/chapter21
21 Working with text columns 190
For all practical uses, columns with the object or string data type
work the same way. However, keep in mind that columns with
different types of values (e.g., strings, numbers, or booleans in the
same Series) also get assigned the object data type. For instance,
the following Series has the object data type:
In [29]: pd.Series([1011, '$1320', "$980", 645, 340])
The code above tries to strip the dollar sign in front of the two
values that have it. However, because there is no strip method
available on integers, the numbers in the Series turned into NaN
values. If you want to keep them in the output, you first have to
convert all values in the Series to the string data type:
In [31]: pd.Series([1011, '$1320', "$980", 645, 340]).astype('string').str.strip('$')
pythonforaccounting.com/chapter21
21 Working with text columns 191
All of the text values below would match this regular expression:
Nikon Camera
Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom
D3100 14.2MP Nikon Camera Digital SLR with 18-55mm f/3.5-5.6 VR
Regular expressions have their own symbols and operators that 4: Python’s official documen-
are different from Python’s or pandas’s — you can think of regular tation includes a tutorial on
regular expressions that is an
expressions as an entirely different, highly specialized program-
excellent place to start learning
ming language that is embedded in Python. As such, this section more about them. It’s available at
can only be a concise introduction to this new language; you’ll docs.python.org/3/howto/regex.html
have to read more about regular expressions on your own if you #regex-howto.
want to add them to your toolkit.4
However, keep in mind that for most use cases (and with some
creativity) you will be able to get the results you need with pan
das’s string methods without using regular expressions at all.
The regular expression above can be reproduced by combining
multiple uses of the contains string method. Nevertheless, regular
expressions are available in Python and pandas, and you can use
them to find and extract complicated patterns from text whenever
you need to.
pythonforaccounting.com/chapter21
21 Working with text columns 192
cameras_df = ledger_df[is_camera]
cameras_df = cameras_df[['ProductID', 'Product Name', 'Total']]
cameras_df
pythonforaccounting.com/chapter21
21 Working with text columns 193
cameras_df['Product Name'].str.extract(pattern)
Out [36]: 0 1
117 Kodak ZM1
287 Kodak ZM1
616 Kodak EasyShare
2151 Canon PowerShot
pythonforaccounting.com/chapter21
21 Working with text columns 194
or hard to figure out, you are in the company of everyone who has
ever used them.
Summary
pythonforaccounting.com/chapter21
Working with date columns 22
Handling dates is another typical task when working with data.
Fortunately, pandas comes with a wide range of tools for manipu-
lating dates — this chapter explores some of the most useful ones.
First, as we did with strings, let’s quickly revisit Python dates.
The code from chapter 9 illustrating how you can use the datetime
module and its objects is shown (and slightly expanded) below:
In [1]: from datetime import date, time, datetime, timedelta
days_of_holiday = timedelta(days=14)
minutes_of_nap = timedelta(minutes=30)
The datetime values above are just as easy to work with as Python
strings or numbers. You can use comparison operators with date
or datetime values, or perform any kind of date arithmetic by
adding and subtracting timedelta values to or from dates:
In [2]: today + days_of_holiday
Table 22.1: Some of the format specifiers available for datetime objects. These are described in the datetime documentation
available at docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior. Examples in this table are based on
the following datetime value: datetime(2020, 9, 30, 7, 6, 5) — which represents 7:06 AM on September 30th 2020.
Specifier Description Output
'%a' Weekday abbreviated name. Mon
'%A' Weekday full name. Monday
'%d' Day of the month as a zero-padded whole number. 30
'%-d' Day of the month as a whole number. 30
'%b' Abbreviated month name. Sep
'%B' Full month name. September
'%m' Month as zero-padded whole number. 09
'%-m' Month as a whole number. 9
'%y' Year without century as zero-padded whole number. 13
'%Y' Year with century as whole number. 2013
'%H' Hour (24-hour clock) as zero-padded whole number. 07
'%-H' Hour (24-hour clock) as whole number. 7
'%I' Hour (12-hour clock) as zero-padded whole number. 07
'%-I' Hour (12-hour clock) as whole number. 7
'%p' AM or PM. AM
'%M' Minute as a zero-padded whole number. 06
'%-M' Minute as whole number. 6
'%S' Second as zero-padded whole number. 05
'%-S' Second as whole number. 5
pythonforaccounting.com/chapter22
22 Working with date columns 197
All this is great, but you didn’t make it so far to work with date
values one-by-one. You can easily apply the operations above to
date columns in a DataFrame, but as usual, with a slight pandas
twist. Let’s take a closer look.
Date columns
pythonforaccounting.com/chapter22
22 Working with date columns 198
In [10]: ledger_df['Date'].iloc[0]
timestamp_date == datetime_date
pythonforaccounting.com/chapter22
22 Working with date columns 199
Often data is sourced from separate systems that use different date
formats and put together in one common dataset — fixing date
formats is a common data cleaning issue. Let’s see how you can
use pandas to fix values in the 'Deadline' column.
Values in the 'Deadline' column are strings (i.e., text values) that
look like dates in various formats (e.g., '2-03-20', '04/23/20', 'Sun
Mar 29 00:00:00 2020', 'August 02 2020' are all values from the
'Deadline' column). While you may think it takes a lot of work to
convert these strings to date values, pandas doesn’t — to convert
them, you can use pandas’s to_datetime function:
In [13]: pd.to_datetime(ledger_df['Deadline'])
pythonforaccounting.com/chapter22
22 Working with date columns 200
pd.to_datetime(dates)
pythonforaccounting.com/chapter22
22 Working with date columns 201
might want to investigate why. For now, these dates seem correct,
and you can use them to fix values in the 'Deadline' column by
assigning the output of to_datetime back to the column:
In [17]: ledger_df['Deadline'] = pd.to_datetime(ledger_df['Deadline']) 1
If you run info again on ledger_df, you will see that the 'Deadline'
column now has the datetime64 data type (column 7 below):
In [18]: ledger_df.info()
You can use to_datetime to convert any column that pandas doesn’t
recognize as containing dates. Converting strings to dates (i.e.,
object or string columns to datetime64 columns) is important be-
cause it allows you to perform date arithmetic with those columns
and enables a range of date-specific methods. Let’s take a look at
some of these date methods next.
pythonforaccounting.com/chapter22
22 Working with date columns 202
Just like you use the str attribute to access string methods on a
Series of string values, you can use the dt attribute on datetime64
columns to access date-specific attributes or methods. For instance,
if you want to extract the year from each value in the 'Deadline'
column, you can use:
In [20]: ledger_df['Deadline'].dt.year 1
Table 22.2: Attributes available on date columns. These attributes are documented in the official pandas documentation
available at pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html.
Attribute name Example Description
year ledger_df['Deadline'].dt.year Returns the year of each date as a Series of
integers.
month ledger_df['Deadline'].dt.month Returns the month of each date as a Series
of integers.
day ledger_df['Deadline'].dt.day Returns the day of each date as a Series of
integers.
date ledger_df['Deadline'].dt.date Returns the date component (without time) of
each date as a Series of Python date values.
time ledger_df['Deadline'].dt.time Returns the time component (without date) of
each date as a Series of Python time values.
dayofyear ledger_df['Deadline'].dt.dayofyear Returns the day of year for each date as a
Series of integers (from 1 to 366).
pythonforaccounting.com/chapter22
22 Working with date columns 203
With day_name, you can even specify the language you want your
day names in — for instance, to get German day names for all dates
in the 'Deadline' column:
In [22]: ledger_df['Deadline'].dt.day_name('de_DE')
pythonforaccounting.com/chapter22
22 Working with date columns 204
Table 22.3: Methods available on date columns. These (and more) methods are documented in the official pandas
documentation available at pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html.
You’ll often need to filter tables based on date ranges or date values.
Fortunately, filtering a DataFrame on one of its date columns is as
straightforward as working with any other type of data.
Out [23]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
7224 8756 Bullseye Miele Type U AirCl... ... 11.71 23 269.33
7225 8757 Shoppe.com Verizon LG G2 Chev... ... 22.21 5 111.05
7226 8758 Understock.com Coleman 5620B718G ... ... 9.67 3 29.01
7227 8759 Understock.com 12-Inch & 9-Inch S... ... 25.85 1 25.85
7228 8760 iBay.com Coaster Oriental S... ... 2.40 2 4.80
... ... ... ... ... ... ... ...
14049 15581 Bullseye AC Adapter/Power S... ... 28.72 8 229.76
14050 15582 Bullseye Cisco Systems Giga... ... 33.39 1 33.39
14051 15583 Understock.com Philips AJ3116M/37... ... 4.18 1 4.18
14052 15584 iBay.com NaN ... 4.78 25 119.50
14053 15585 Understock.com Sirius Satellite R... ... 33.16 2 66.32
pythonforaccounting.com/chapter22
22 Working with date columns 205
shift = dt.timedelta(days=1)
ledger_df[
(ledger_df['Date'] > start_date - shift) &
(ledger_df['Date'] < end_date + shift)
]
Out [25]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
6748 8280 Shoppe.com Fender 005-3191-00... ... 43.45 6 260.70
6749 8281 Understock.com 3M 6897 Black Head... ... 4.40 14 61.60
6750 8282 Understock.com Tarantula Sleeve W... ... 10.57 6 63.42
6751 8283 Walcart Hubsan X4 H107C 2.... ... 5.46 1 5.46
6752 8284 Understock.com Reusable Particula... ... 20.68 2 41.36
... ... ... ... ... ... ... ...
9331 10800 Understock.com Samsung Galaxy S3 ... ... 12.28 17 208.76
9332 10818 iBay.com Bushnell Velocity ... ... 5.26 3 15.78
9333 10821 iBay.com Vivitar V69379-SIL... ... 7.37 6 44.22
9334 10823 Understock.com Cat People / The C... ... 14.82 76 1126.32
9335 10833 Understock.com BANG ... 10.08 6 60.48
The benefit of this approach is that if you want to change the range
used for filtering, you can simply modify the shift variable and
re-run the cell, instead of having to edit date-strings manually.
For more complex date filters, you can even use the date attributes
and methods I mentioned earlier. For instance, to filter ledger_df
and keep all sales that age on Thursdays, in the fourth quarter of
2018 or 2019, you can use:
pythonforaccounting.com/chapter22
22 Working with date columns 206
In [26]: ledger_df[
(ledger_df['Deadline'].dt.year.isin([2018, 2019])) &
(ledger_df['Deadline'].dt.quarter == 4) &
(ledger_df['Deadline'].dt.day_name() == 'Thursday')
]
Out [26]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
6 1538 Understock.com NaN ... 31.36 9 282.24
22 1554 iBay.com Nerf N-Sports Weat... ... 25.95 3 77.85
38 1570 Walcart DR Strings Nickel ... ... 13.08 1 13.08
65 1597 Walcart Tork Dispenser Nap... ... 5.23 17 88.91
85 1617 Walcart Battery Back Door ... ... 23.86 4 95.44
... ... ... ... ... ... ... ...
13806 15338 Understock.com Nokia Lumia 520 8G... ... 16.99 3 50.97
13873 15236 Shoppe.com The Sephra 16-Inch... ... 3.90 12 46.80
13900 15250 Shoppe.com The Sephra 16-Inch... ... 3.90 12 46.80
13943 15475 Understock.com Small Ooze Tube - ... ... 5.95 7 41.65
14000 15388 Understock.com NaN ... 11.98 23 275.54
Notice in the example above that you can chain regular Series
methods to the output of dt methods — because many dt methods
or attributes return numbers or strings, not datetime64 values.
The date format specifiers you can use with strftime are the same
ones listed earlier in table 22.1.
pythonforaccounting.com/chapter22
22 Working with date columns 207
There are several pandas objects you can use to transform date
values in arbitrary ways:
Timedeltas
In [28]: ledger_df['Deadline']
pythonforaccounting.com/chapter22
22 Working with date columns 208
The arguments available when you create a Timedelta object are 6: And milliseconds, microsec
weeks, days, hours, minutes and seconds6 and you can use any onds, nanoseconds if you need to be
combination of arguments to define the time interval you need: that precise with your dates.
pythonforaccounting.com/chapter22
22 Working with date columns 209
Perhaps more useful is that you can divide Timedelta objects to get
a time interval in a different unit. For instance, if you subtract the
two date columns in ledger_df, you get a time difference between
dates expressed in days, as you saw above (days is the default time
interval for Timedelta objects). To express this difference in hours,
pythonforaccounting.com/chapter22
22 Working with date columns 210
you can simply divide the output of subtracting the two columns
by a Timedelta that uses the hour time interval:
In [35]: (ledger_df['Date'] - ledger_df['Deadline']) / pd.Timedelta(hours=1)
Now you have the difference between the two date columns ex-
pressed in hours instead of days. Notice, however, that the output
is a numerical Series (i.e., it has the float64 data type), and is no
longer a Series of Timedelta values.
Date offsets
pythonforaccounting.com/chapter22
22 Working with date columns 211
14053 2020-02-03
Name: Deadline, Length: 14054, dtype: datetime64[ns]
Even more useful is that you can use it to modify part of your dates.
For instance, if you want to set the year of all dates in the 'Date'
column to 1999, you can use the following DateOffset operation:
In [38]: ledger_df['Deadline'] + pd.DateOffset(year=1999)
This returns a new datetime64 Series with all dates set in 1999.
The slight difference from the previous example where you added
2 years to each date is that the keyword argument used with
pd.DateOffset above is in the singular: year not years.
pythonforaccounting.com/chapter22
22 Working with date columns 212
that are available in the singular form are year, day, hour, minute,
second, microsecond and nanosecond — and, as before, you can use
any combination to get the results you need.
After you run the code above, remember you can use JupyterLab’s
autocomplete feature to see what predefined offsets are available
— in a separate cell, type offsets. and press the TAB key:
In [40]: offsets.<TAB>
reference_date.day_name()
pythonforaccounting.com/chapter22
22 Working with date columns 213
Besides the offsets used above, there are several other predefined
date offsets you can use (and each accepts different keyword
arguments) — table 22.4 lists some of them.
Table 22.4: Predefined DateOffset types available from the pandas.tseries.offsets submodule. These offset
types (and more) are documented in the official pandas documentation available at pandas.pydata.org/pandas-
docs/stable/user_guide/timeseries.html#dateoffset-objects.
All examples use date = pd.Timestamp(year=2020, month=3, day=12, hour=12, minute=30) which represents 12:30
AM on March 12th 2020.
pythonforaccounting.com/chapter22
22 Working with date columns 214
Keep in mind that if you try to use these predefined offsets with
a date column that contains missing values (i.e., NaN, NaT or NA),
you will get an error, whereas Timedelta date arithmetic will work,
but keeps missing values in the output.
Periods
Besides dates and date offsets, you can also create regular intervals
of time using pandas’s Period object (i.e., a number of business
days, a calendar month, a quarter, etc.). For accounting, Period
objects are particularly useful for converting calendar dates to fiscal
quarters in different year-end setups.
In [45]: ledger_df['Deadline'].dt.to_period(freq='Q')
In [46]: ledger_df['Deadline'].dt.to_period(freq='Q-SEP')
pythonforaccounting.com/chapter22
22 Working with date columns 215
2020-10-24
freq='M' JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
freq='A-DEC' 2020
Figure 22.1: Illustration of converting 2020-10-24 to periods. Each row represents a succession of periods (i.e., time spans)
that have different lengths and year-end setups. The dotted line crosses each row to show the period representation of
2020-10-24 when using different frequency values with to_period.
Now the quarter labels above are based on a year (i.e., a 12-month
period) that ends in September. By default, December is used as the
year-end month, however, any other month label is valid, as long
as it is specified in the same style (i.e., 'Q-JAN' through 'Q-DEC').
The values in both Series output above are quarter labels, but you
can use different frequencies with to_period to convert dates to
other time intervals. For instance, to convert dates in the 'Deadline'
column to yearly period labels:
In [47]: # A stands for annual
ledger_df['Deadline'].dt.to_period(freq='A')
pythonforaccounting.com/chapter22
22 Working with date columns 216
14051 2020
14052 2020
14053 2020
Name: Deadline, Length: 14054, dtype: period[A-DEC]
And as with quarters, you can also specify the year-end month:
In [48]: ledger_df['Deadline'].dt.to_period(freq='A-MAR')
periods
In [50]: periods.dt.to_timestamp()
pythonforaccounting.com/chapter22
22 Working with date columns 217
14053 2020-01-01
Name: Deadline, Length: 14054, dtype: datetime64[ns]
The to_timestamp method will convert each period to the first date
of the period (in the example above, each quarter becomes the first
date of the quarter). Its output is a familiar datetime64 Series (i.e.,
a column of dates), that you can manipulate using date offsets or
any other pandas date methods.
Overthinking: Timezones
Excel can’t store timezones with any of its dates: all dates in your
spreadsheets are in the same timezone as your computer. You
can, of course, have timezone information as a separate column
in a spreadsheet, but if you have lots of different timezones flying
around, things get tricky fast.
us_pacific_deadlines
pythonforaccounting.com/chapter22
22 Working with date columns 218
The data type associated with the Series above now shows you
what timezone dates are in (i.e., datetime64[ns, US/Pacific). Note,
however, that this doesn’t shift the original dates (or times) to the
new timezone in any way; it merely assigns them a timezone. The
dates are the same as before, but with an extra label that says
they’re in the 'US/Pacific' timezone.
In [53]: us_pacific_deadlines.dt.tz_convert('Europe/Berlin')
pythonforaccounting.com/chapter22
22 Working with date columns 219
Summary
pythonforaccounting.com/chapter22
Applying custom functions 23
You’ve already seen by now that pandas has a lot of functions in
its toolkit — for working with strings or dates, for complex table
filters, for reading and writing Excel files, etc. However, even a
toolkit as diverse as this can’t have everything you need: you will
often have to write your own data transformation functions to
get the results you want. Luckily, pandas makes it straightforward
to use custom functions on the rows or columns of a DataFrame
through the apply method.
This chapter explores how you can use the apply method in its
various configurations — let’s start by applying a custom function
on a single DataFrame column.
You might remember from chapter 8 that you can create custom
functions in Python using the def keyword:1 1: Remember that you need to indent
the code that is in the function block.
In [1]: def process_channel(channel): You also need to put a colon (i.e., :)
return 'Name: ' + channel.upper() after the function name and the paren-
theses.
The process_channel function above is just a block of code with a
name. You can use this name to call the function and run its code
as many times as you need to:
In [2]: process_channel('Shoppe.com')
In [3]: process_channel('Bullseye')
The value you pass between the parentheses when calling the
function is called the function argument. This value gets bound
to the channel2 variable inside the function body. This value is 2: channel is just a name I chose, you
used throughout the function code and any result gets returned as can use any variable or parameter
name inside the function definition,
the function output. The process_channel function above is fairly
as long as it is a valid Python name
basic, but you can craft any custom function using the Python (i.e., it starts with a letter and doesn’t
building blocks we covered in part 1 of the book. contain spaces).
If you wanted to call this function for each value in the 'Channel'
column, you could use a loop — something similar to:
23 Applying custom functions 221
However, pandas makes it much easier to get the same result with-
out a loop — you can call the process_channel function for each
value in a DataFrame column at once using the apply method:
In [5]: ledger_df['Channel'].apply(process_channel)
Notice in the example above that just the function name is passed to
apply, without any parentheses after it (i.e., apply’s argument is
process_channel, not process_channel()). This is because pandas
handles calling the function for you; you just need to tell it which
function to call.
You can apply custom functions this way to any DataFrame column,
regardless of its data type. However, when defining a custom
function, you need to be mindful of the values you want to apply it
to. In this example, I knew I wanted to apply the process_channel
function to a column of strings. As such, I knew that each value
passed to the function will be a Python string, and that Python
strings have an upper method. If you try to apply process_channel
on the 'AccountNo' column, you’ll get an error because values in
that column are not strings but integers, and process_channel is
designed to work only with strings:
In [6]: ledger_df['AccountNo'].apply(process_channel)
pythonforaccounting.com/chapter23
23 Applying custom functions 222
Even if you are mindful of the data you apply your custom function
on, you’ll still get an error if the data contains NaN values. Consider
this example:
In [7]: def process_product(product):
return 'Product: ' + product.upper()
ledger_df['Product Name'].apply(process_product)
To fix process_product, you can make3 it handle NaNs explicitly: 3: Or you can fill in missing values
in the 'Product Name' column with
In [8]: def process_product(product):
a dummy string, using fillna, and
if pd.isna(product): then apply the function.
return 'EMPTY PRODUCT NAME'
else:
return 'Product: ' + product.upper()
pythonforaccounting.com/chapter23
23 Applying custom functions 223
Notice that all missing values in the 'Product Name' column have
been replaced with EMPTY PRODUCT NAME.
You saw in the previous chapters that there are a lot of string and
date methods built-into pandas. Many of those methods can be
written as custom functions and applied to your data’s string or
date columns. There are many roads through the pandas woods
and which one you choose is up to you. However, keep in mind
that pandas’s built-in methods, such as the string or date meth-
ods I mentioned in the previous chapters, typically skip missing
values for you (i.e., they return NaN whenever they get a NaN as
input), whereas with your own functions, you have to handle NaNs
yourself.
pythonforaccounting.com/chapter23
23 Applying custom functions 224
The two functions you defined in the previous section can only be
applied to one column at a time. But what if you need to use values
from several columns in your custom function? For instance, you
might want to calculate a different tax amount for each sale item,
depending on what channel it occurred in — which means you
need both 'Total' and 'Channel' values in your function. In that
case, you can define a custom function and apply it to each row in
your DataFrame, instead of applying it to a column.
To figure out how you can do that, let’s start by selecting the first
row in ledger_df (using iloc) and assigning it to a variable:
In [12]: first_row = ledger_df.iloc[0]
first_row
pythonforaccounting.com/chapter23
23 Applying custom functions 225
In [14]: first_row['Channel']
In [17]: calculate_tax(ledger_df.iloc[10])
Out [17]: 0
In [18]: calculate_tax(ledger_df.iloc[100])
You could loop through each row in ledger_df and call this function
yourself (e.g., using a for loop) but it’s much easier to use the
apply method. To apply calculate_tax on each row in ledger_df,
you can use:
In [19]: ledger_df.apply(calculate_tax, axis='columns')
pythonforaccounting.com/chapter23
23 Applying custom functions 226
You can make your functions more flexible (and more general)
by adding more parameters to their definition. For instance, to
add channel-specific tax levels as a separate parameter to calcu
late_tax, you can use:
pythonforaccounting.com/chapter23
23 Applying custom functions 227
This is clearly not the same result as before. The output is all zeros
because we didn’t pass a dictionary mapping channel names to
tax rates to calculate_tax. To make our function work as before,
pythonforaccounting.com/chapter23
23 Applying custom functions 228
Now the output is the same as before. The benefit of this version of
calculate_tax is that you can easily change or add new tax rates
when applying it, without changing its code at all. For instance, if
you want to add different tax rates for all channels in our data, you
can use:
In [23]: ledger_df.apply( 1
calculate_tax, 2
levels={ 3
'Shoppe.com': (16 / 100), 4
'iBay.com': (11 / 100), 5
'Understock.com': (9 / 100), 6
'Bullseye': (6 / 100), 7
'Walcart': (4 / 100), 8
}, 9
axis='columns' 10
) 11
pythonforaccounting.com/chapter23
23 Applying custom functions 229
Summary
pythonforaccounting.com/chapter23
Project: Mining product reviews 24
Free-form1 text is standard in data — even the QuickBooks general 1: As opposed to structured text, like
ledger you saw in chapter 20 had a Memo column with comments product IDs or sales channel names.
about each record in the ledger. This project chapter shows you
how to use the pandas tools you’ve been reading about in the last
few chapters to extract insights from a product reviews dataset.
The reviews dataset you’ll use for this project is in the “project_data”
folder. The dataset is a CSV file, so you’ll need pandas’s read_csv
function instead of read_excel to load the data into a DataFrame:
In [2]: reviews_df = pd.read_csv('project_data/reviews.csv')
reviews_df
You can call the info method on reviews_df to see what data type
each column is and if there are any missing values:
In [3]: reviews_df.info()
There are three parts to this project: first, you’ll need to find the
most recent product reviews. There are a lot of reviews in the table
above, and many of them are outdated. Second, you’ll need to
process each review and turn it into a list of descriptive words.
Last, you’ll count review words to figure out what makes people
like or dislike the products they purchase.
pythonforaccounting.com/chapter24
24 Project: Mining product reviews 232
You can use several of pandas’s date methods to check how spread
review dates are. One option is to extract the year from each date,
then count how many times each year appears in the data:
In [7]: reviews_df['Date'].dt.year.value_counts().sort_index()
pythonforaccounting.com/chapter24
24 Project: Mining product reviews 233
There are many outdated reviews — the earliest ones are from 2006,
which wouldn’t tell you much about buyers’ current preferences.
Let’s filter reviews_df and keep reviews from 2020 or later only:
In [8]: reviews_df = reviews_df[reviews_df['Date'].dt.year >= 2020]
reviews_df
There are plenty of reviews left to work with; let’s now use pandas
to dig into what people are saying.
As you may have noticed, there’s a lot of variation in the way people
write reviews (i.e., what words they use, how long their reviews
are, how they use punctuation, letter case, etc.). Most free-form
text is like this; therefore, the first thing you need to do before
analyzing text is making it more uniform. Making text uniform
typically involves the following steps:
pythonforaccounting.com/chapter24
24 Project: Mining product reviews 234
There are other steps you can take when preparing text (e.g.,
extracting the stem of each word so that “loving it” and “love it”
become the same string), but for now, let’s apply the steps above
to the reviews data using pandas and Python.
string.punctuation
return review
If you call this function and pass a string value as its argument,
you’ll get the same string back but without any punctuation
characters:
In [12]: remove_punctuation('Great quality!!! Much nicer than expected -- but expensive.')
Out [12]: 'Great quality much nicer than expected but expensive'
You can now apply this custom function to all reviews with:
pythonforaccounting.com/chapter24
24 Project: Mining product reviews 235
Next you need to split each review into a list of words. You can do
that with pandas’s split string function — the code below assigns
the list of words to a new column in reviews_df:
In [14]: reviews_df['Review Words'] = reviews_df['Review'].str.split()
The reviews seem cleaner now. However, you still have a lot of
words that don’t carry meaning (e.g., “it”, “but”, “so”, etc.). In
natural language processing, these words are called stopwords —
they’re noise, not information. Just like you used a list of characters
to remove punctuation, you can use a list of stopwords to remove
unhelpful words from the 'Review Words' column. I’ve added a
stopwords dataset in the “project_data” folder so you don’t have to
create one yourself — to load it into a Python list, you can run:
In [16]: stopwords = pd.read_csv('project_data/stopwords.csv', squeeze=True)
stopwords = list(stopwords)
pythonforaccounting.com/chapter24
24 Project: Mining product reviews 236
Now that you have a clean list of informative words for each review
let’s see why people like or dislike the products they bought by
counting words.
Counting words
Counting words to figure out what text is about may seem sim-
ple, but it’s the most common technique used to unravel human
language with computers — even advanced natural language
processing algorithms count words one way or another.
Before you start counting words, let’s first break the reviews
dataset into two separate DataFrame variables: one with the most
positive reviews and another with the most negative. It’s easier
to understand why someone liked a product when you’re sure
they liked it (based on the product rating). Luckily, for reviews_df
you can use values in the 'Rating' column to determine reviewers’
pythonforaccounting.com/chapter24
24 Project: Mining product reviews 237
sentiment about the product they bought, and create the two
DataFrame variables:
To count review words, let’s create two Series variables with all the
words used for the negative and positive reviews, respectively:
In [23]: negative_words = pd.Series(negative_reviews_df['Review Words'].sum())
positive_words = pd.Series(positive_reviews_df['Review Words'].sum())
You can now use these variables to count how often different words
appear in the product reviews:
In [24]: negative_words.value_counts()
In [25]: positive_words.value_counts()
It’s not surprising to see words like money, junk, and poor used
frequently with negative reviews, and words like great, love, and
good used for positive reviews. However, you can use this simple
count to dive deeper into the reviews and find the ones that speak
about specific product features or qualities.
For instance, you can try to find out what’s so 'great' about the
products that people review. To do that, you can use a simple
regular expression and extract all words that follow 'great' in
positive reviews, with the following pandas code:
pythonforaccounting.com/chapter24
24 Project: Mining product reviews 238
positive_reviews_df['Review'].str.extract(positive_pattern).value_counts()
The same idea works for negative reviews, where you can see why
people think products are 'poor':
In [27]: negative_pattern = '(poor .*)'
negative_reviews_df['Review'].str.extract(negative_pattern).value_counts()
You can dig deeper into the reviews and design more compli-
cated regular expressions that match different product features or
qualities (e.g., '(great coffee .*)').
Summary
This quick project chapter showed you how to use pandas’s date 3: The nltk and spaCy Python li-
and string functions to analyze product reviews.3 Now, let’s get braries are worth exploring if you
back to uncovering more of pandas’s features and see how you can work with text often.
use it to combine, merge, and pivot tables.
pythonforaccounting.com/chapter24
Concatenating tables 25
Working with a single file or dataset can only get you so far. To make
the most of your data, you will often need to combine multiple
spreadsheets or Excel files into a single DataFrame and leverage
connections between datasets.
The two main tools for combining tables in pandas are its concat
(short for concatenate) and merge functions. While concat is the
pandas equivalent of copy-pasting rows (or columns) from one
sheet to another, merge is similar to Excel’s VLOOKUP function. This
chapter shows you how to combine multiple tables into a single
DataFrame using the concat function. The following chapter takes
a look at joining tables on their common columns with pandas’s
merge function.
Row-wise concatenation
You can use the concat function for row-wise or column-wise con-
catenation of DataFrame or Series objects. Row-wise concatenation
is the pandas equivalent of copying rows from one spreadsheet
and pasting them at the bottom of another.
For the code examples in this section, let’s read a small subset of our
sales data into several DataFrame variables and then concatenate
them using concat. To read each sheet of “Q1Sales.xlsx” as a
separate DataFrame, you can use:
In [1]: cols = ['ProductID', 'Quantity', 'Total'] 1
2
jan_df = pd.read_excel('Q1Sales.xlsx', sheet_name='January', usecols=cols, nrows=5) 3
feb_df = pd.read_excel('Q1Sales.xlsx', sheet_name='February', usecols=cols, nrows=5) 4
mar_df = pd.read_excel('Q1Sales.xlsx', sheet_name='March', usecols=cols, nrows=5) 5
After you run the code above, you should have three DataFrame vari-
ables in your notebook, each with five rows and three columns:
In [2]: jan_df
25 Concatenating tables 240
In [3]: feb_df
In [4]: mar_df
Notice that row labels are preserved from the original DataFrame
objects (i.e., row labels in the output above are not consecutive). If
you need to make row labels consecutive and discard the original
labels, you can use the ignore_index=True keyword argument:
pythonforaccounting.com/chapter25
25 Concatenating tables 241
This DataFrame looks slightly different from all the other ones
you’ve seen so far: each row has two associated labels (notice
the two left-most columns above). These row labels are stored in
a MultiIndex — you can assign the DataFrame above to another
variable and inspect its index to see what it looks like:
In [8]: df = pd.concat({'Jan': jan_df, 'Feb': feb_df, 'Mar': mar_df})
df.index
pythonforaccounting.com/chapter25
25 Concatenating tables 242
Going back to concat, notice that all the previous examples com-
bined DataFrame objects with the same set of columns (i.e., each
one of the three DataFrame objects had the same columns). If you
concatenate DataFrame objects that have different columns, the re-
sulting DataFrame will have a union of columns from each separate
DataFrame. For example, you can select a subset of columns from
two of the DataFrame variables you created earlier and concatenate
them with:
pythonforaccounting.com/chapter25
25 Concatenating tables 243
In [11]: pd.concat([
jan_df[['ProductID', 'Quantity']],
feb_df[['ProductID', 'Total']]
])
Now the output of concat contains just the one column that appears
in each of the input DataFrame variables (i.e., the 'ProductID'
column).
pythonforaccounting.com/chapter25
25 Concatenating tables 244
Column-wise concatenation
[5 rows x 9 columns]
The result now is a DataFrame with nine columns (and with tripli-
cate column names). While this is not a particularly useful table, in
some cases column-wise concatenation might be the solution you
need, which is why I’m mentioning it here.
The result above contains a union of all row labels and NaN values
where there is a mismatch. The only row label that appears in both
input DataFrame variables is 2 — which is why the middle row has
values in all columns. As with row-wise concatenation, you can
use the join='inner' keyword argument to keep only the row
labels that are common to all inputs:
In [15]: pd.concat([jan_df.head(3), mar_df.tail(3)], axis='columns', join='inner')
pythonforaccounting.com/chapter25
25 Concatenating tables 245
Instead of using concat, you can also use the append method to
add rows or columns to an existing DataFrame. For example, to
add the rows from feb_df to jan_df, you can use:
In [16]: jan_df.append(feb_df)
The append method does not modify the calling DataFrame in-
place2 but rather returns a new DataFrame with a concatenation of 2: Unlike the append method avail-
rows — it is just a shortcut for concat. If you want to update jan_df able on Python lists.
with rows from feb_df, you need to assign the output of append
back to jan_df. Or if you want to add rows from both feb_df and
mar_df to jan_df, you can use:
pythonforaccounting.com/chapter25
25 Concatenating tables 246
You can create rows using Python dictionaries (with column names
as keys) and pass them in a list to append — pandas will transform
the dictionaries into Series objects (i.e., DataFrame rows). Note that
if you use append this way, you must specify ignore_index=True.
Summary
pythonforaccounting.com/chapter25
Joining tables 26
When working with multiple related datasets, you will often want
to combine them into a single DataFrame, based on the values in
one or more of the columns they have in common. Consider the
two tables below:
Table joins can be puzzling at first. If you get confused at any point 2: If you have used SQL before, the
merge function is the pandas equiv-
in this chapter, it’s because you’re trying to understand something
alent to SQL’s JOIN operator. Even
complicated. The mental workout is, I think, worth it: once joins though the pandas function is called
click, you’ll see how powerful and straightforward they are. merge, I use the term “join” (instead
of “merge”) because it is commonly
The next section briefly introduces how table joins work in general used in other domains for this type of
— you’ll see how to use pandas’s merge function right after. table operation (e.g., it’s used in SQL).
Table joins take as input two tables with one or more columns in
common (let’s call them the left and right table) and produce a
single output table. The way joins work changes slightly depending
on whether you have duplicate values in the joining column (or
columns) in each table or not. For now, let’s take the two tables
above as our first example: these tables can be joined on their
common column (i.e., 'ProductID') — and because both tables
have unique product IDs in their respective 'ProductID' columns
(i.e., there are no duplicate product IDs in either table), joining
them is an example of a one-to-one join.
26 Joining tables 248
After that, values from these two rows get copied as the first
row in the output table — the value they have in common (i.e.,
'ProductID') is copied just once. At the end of this first step, the
output table looks like this:
The next step in the join is identical: pandas goes to the following
product ID in the left table, finds the corresponding row on the
right, then copies a merged version of the two rows to the output
table.
And so the join continues for all remaining rows. The entire join
operation and its resulting table are shown below:
pythonforaccounting.com/chapter26
26 Joining tables 249
Output table:
T&G/PET-14209 11.67 5 Pete the Cat and His Four... Merry Makers
In a nutshell, this is what joins are all about: merging two tables
on the values they have in common. However, there is one detail
we still need to iron out: what happens when you have duplicate
values in the join column?
Many-to-one joins
You might need to join two tables on a common column, but one
of the tables has duplicate values in that column. Consider the
following example:
T&G/PAP-51200 39.14 6
T&G/GRE-17530 5.81 1
The table on the left has duplicate values in its 'ProductID' column,
whereas the table on the right has unique product IDs. Joining
these two tables is an example of a many-to-one join.3 3: If you swap the two tables, you get
a one-to-many join, which works the
As before, the first step starts with finding rows in the two tables same way.
that share a common product ID value beginning with the first
value in the 'ProductID' column of the left table. In this case, there
are two rows in the left table and one row in the right — these
rows are highlighted above.
pythonforaccounting.com/chapter26
26 Joining tables 250
As with the previous example, both left and right rows get merged
and copied in the output table. However, the difference here is that
each row on the left gets merged with its corresponding row on the
right. After the first step, the output table looks like this:
The output above has two rows because there are two rows in the
left table where the product ID is T&G/GRE-17530, and each row
on the left gets merged with its corresponding row on the right.
The next steps in the join are the same — the entire join and the
resulting table are shown below:
T&G/PAP-51200 39.14 6
T&G/GRE-17530 5.81 1
Output table:
Many-to-many joins
pythonforaccounting.com/chapter26
26 Joining tables 251
As with the previous cases, joining these two tables starts with
finding rows in each table that share a common product ID (e.g.,
the highlighted rows above) and merging them. After merging the
first set of rows, the output table looks like this:
Here, the first row in the left table gets merged with each of its
corresponding rows in the right table. (i.e., rows three and four).
There are two more rows on the left with the same product ID (i.e.,
the third and fourth row), and they both get merged with the same
rows from the right table as well.
The following steps in the join are identical — the entire operation
is illustrated below:
pythonforaccounting.com/chapter26
26 Joining tables 252
Output table:
T&G/LEG-37777 6.66 1 A Fantastic Set!! And the Turtle minifig ... 4.0
T&G/LEG-37777 6.69 1 A Fantastic Set!! And the Turtle minifig ... 4.0
All the examples above illustrate how the join operation works
in general — pandas or any other tool that lets you join tables
implements them as described above. Let’s see how you make
them happen with code next.
Joins in pandas
For the code examples in this section, you’ll need to use two datasets:
the sales data from “Q1Sales.xlsx” that you’re now familiar with and
the products dataset from “products.csv” that I briefly introduced
at the beginning of part two. To read the two datasets, run:
In [1]: import pandas as pd
ledger_df = pd.read_excel('Q1Sales.xlsx')
products_df = pd.read_csv('products.csv')
pythonforaccounting.com/chapter26
26 Joining tables 253
I Line 1 : you select the top 5 rows of the ledger DataFrame and
assign them to a new variable called left_df.
I Line : you select three of the columns in the left_df and reassign
2
the selection to left_df.
I Line 5 : from the products dataset, you select only those products
that appear in left_df, based on product IDs, and assign those
rows to a new variable called right_df.
I Line : you select three columns from right_df and assign the
6
selection back to right_df.
In [3]: left_df
In [4]: right_df
The two DataFrame variables have the same data as the exam-
ple tables you saw earlier. They have a common column (i.e.,
'ProductID') with unique values in both left_df and right_df.
To join them on their common column, you can use pandas’s merge
function:
In [5]: pd.merge(left_df, right_df, on='ProductID')
pythonforaccounting.com/chapter26
26 Joining tables 254
The first two arguments passed to merge are the DataFrame variables
you want to join. The on keyword argument specifies what column
to use for joining the two tables.4 The output is a new DataFrame.5 4: You don’t have to specify a column,
but it’s a good idea to be explicit about
the join column with the on keyword
The merge function maps each row in left_df to a row in right_df, argument. If you don’t use the on
based on their corresponding 'ProductID' values (as illustrated argument, the two tables are joined on
all the columns they have in common
in the previous section). It then merges the two datasets into a (i.e., all columns with the same name
new DataFrame, which has the same columns as the left and right in both tables).
tables used as input. This is a join in pandas: even though the join 5: The original row labels from either
operation can be complex, the pandas code to run it is simple. left_df or right_df are not kept in
the output DataFrame.
Whether your join is one-to-one or many-to-many, you still use pan
das’s merge function the same way (i.e., as in the code example
above). The columns you use for joining and whether they have
duplicate values influence the output table, not how you use the
merge function.
One common issue when joining tables is assuming that the joining
column values are unique (in one or both input tables) when they
aren’t. This typically leads to a many-to-many join, which creates a
lot of rows and takes a long time to run — and for large datasets, it
can freeze your computer because it uses up all of its memory.
In [7]: left_df
In [8]: right_df
pythonforaccounting.com/chapter26
26 Joining tables 255
The error message tells you exactly that keys are not unique (in
either dataset) and that the merge is not one-to-one. You can set
the value of validate to one of:
Joins on large tables can take a long time to run, which is why using
the validate argument is a good idea: it can stop unwanted joins
early (i.e., joins on values that you think are unique but aren’t).
pythonforaccounting.com/chapter26
26 Joining tables 256
However, in pandas you can change this default behavior with the
how keyword argument, and setting it to either 'left', 'right', or
'outer' when using merge. Before we look at some code that does
that, let’s quickly see what the other types of joins look like.
A left join6 keeps all rows from the left table in the output. However, 6: Sometimes called a left outer join.
because not all left rows have a corresponding row in the right
table, only a subset of them get merged with values from the right
table, whereas rows that don’t have a match get filled with NaNs.
Left joining the two tables above produces the following output:
A right join7 keeps all rows from the right table in the output. Like 7: Sometimes called a right outer join.
before, because not all rows have a corresponding row in the left
table, only a subset of them get merged, whereas rows that don’t
have a match get filled with NaNs. Right joining the two tables
produces the following output:
T&G/LEG-60816 21.40 1.0 LEGO Star Wars Mandalorian Battle ... LEGO
An outer join8 keeps all rows from both left and right tables 8: Sometimes called a full outer join.
pythonforaccounting.com/chapter26
26 Joining tables 257
in the output. Like left and right joins, only those rows with a
corresponding row in the other table get merged, whereas missing
values get filled with NaNs. The output of an outer join on the two
example tables is shown below:
T&G/LEG-60816 21.40 1.0 LEGO Star Wars Mandalorian Battle ... LEGO
In all the examples above, notice that the order of rows in the
output depends on the type of join (e.g., the output of a left join
has product IDs in the same order as the left table).
Let’s get back to pandas and see how to change the default join
behavior. First, let’s create two DataFrame variables to use as an
example:
In [10]: left_ids = ['T&G/LEG-60816', 'T&G/PLA-85805', 'T&G/DIS-51236', 'T&G/THE-82687']
right_ids = ['T&G/THO-09600', 'T&G/PLA-29969', 'T&G/LEG-60816', 'T&G/PLA-85805']
left_df = ledger_df[ledger_df['ProductID'].isin(left_ids)]
left_df = left_df[['ProductID', 'Unit Price', 'Quantity']]
right_df = products_df[products_df['ProductID'].isin(right_ids)]
right_df = right_df[['ProductID', 'Product Name', 'Brand']]
In [11]: left_df
In [12]: right_df
pythonforaccounting.com/chapter26
26 Joining tables 258
However, discarding rows is not always what you need. You can
use the how keyword argument to tell merge to use one of the
joining behaviors illustrated above:
In [14]: pd.merge(left_df, right_df, on='ProductID', how='left')
It might seem like all the details discussed in this chapter are fairly
specific (and not that common when working with real data). You’d
be surprised how often joins — in their various configurations —
can solve real problems, which is why I think learning what they
do is worth your time.
In some cases, when using a left, right, or outer join, you might
want to know if a given row appears in both, left or right tables.
You can use the indicator keyword argument with merge to add
a new column in the output table that tells you where each row is
originally from:
In [17]: pd.merge(left_df, right_df, on='ProductID', how='outer', indicator='Source')
pythonforaccounting.com/chapter26
26 Joining tables 259
Out [17]: ProductID Unit Price Quantity Product Name Brand Source
0 T&G/THE-82687 5.36 6.0 NaN NaN left_only
1 T&G/PLA-85805 3.31 1.0 Playskool Mrs. Potat... Mr Potato Head both
2 T&G/DIS-51236 14.47 12.0 NaN NaN left_only
3 T&G/LEG-60816 21.40 1.0 LEGO Star Wars Manda... LEGO both
4 T&G/THO-09600 NaN NaN Thomas the Train: My... Fisher-Price right_only
5 T&G/PLA-29969 NaN NaN Plan Toy Pull-Along ... Plan Toys right_only
As with the validate keyword argument, the indicator column9 9: You can pass any string value to
can help you double-check that the assumptions you had before the the indicator argument, which will
merge are correct. For instance, you can assign the output of merge then be used as the indicator column
name. I used 'Source' here, but any
above to a new variable and check how many rows originated from
string value is valid.
each of the two input variables:
In [18]: merged_df = pd.merge(left_df, right_df, on='ProductID', how='outer', indicator='Source')
merged_df['Source'].value_counts()
By design, our two tables had two product IDs that appeared in
both, two product IDs that appeared only in left_df, and two that
appeared only in right_df — the merge output seems correct.
In [20]: left_df
In [21]: right_df
pythonforaccounting.com/chapter26
26 Joining tables 260
right_df
You get the same DataFrame as before, but with different column
names. If you want to run the join above, you can rename the
columns back to their original names (which wouldn’t really help
with this example), or you can specify the left and right columns
to use for the join separately, using:
In [24]: pd.merge( 1
left_df, right_df, 2
left_on=['ProductID', 'Channel'], 3
right_on=['productid', 'channel'] 4
) 5
pythonforaccounting.com/chapter26
26 Joining tables 261
Summary
It might seem like joins are specialized tools, not that useful when
dealing with accounting data. However, you’d be surprised how
often joins can solve real-world problems. One problem they can
solve is filling in missing product names in ledger_df, which is
what we look at in the following project chapter.
pythonforaccounting.com/chapter26
Project: Filling missing product
names in the sales data 27
The sales data you’ve been working with have a specific problem
that table joins can help with: the 'Product Name' column has a lot
of missing values. This quick project chapter shows you how to fill
in missing product names in “Q1Sales.xlsx” with values from the
“products.csv” file.
To get this project started, open the project notebook (which should
be in your Python for Accounting workspace) and import pandas.
Load the sales and products data by running:
In [1]: import pandas as pd 1
2
sales_df = pd.concat(pd.read_excel('Q1Sales.xlsx', sheet_name=None), ignore_index=True) 3
products_df = pd.read_csv('products.csv') 4
Line 3 above may seem strange, but it uses pandas’s concat and
read_excel functions, both of which you’ve seen before. Passing
sheet_name=None to read_excel makes the function read data from
all sheets in an Excel file. However, when you use sheet_name=None,
read_excel no longer returns a single DataFrame, but a Python
dictionary mapping sheet names to DataFrame objects. If you run
the function in a separate cell, you’ll see this dictionary:
In [2]: pd.read_excel('Q1Sales.xlsx', sheet_name=None)
Out [2]: {'January': InvoiceNo Channel Product Name ... Unit Price Quantity
Total
0 1532 Shoppe.com Cannon Water Bom... ... 20.11 14 281.54
1 1533 Walcart LEGO Ninja Turtl... ... 6.70 1 6.70
2 1534 Bullseye NaN ... 11.67 5 58.35
3 1535 Bullseye Transformers Age... ... 13.46 6 80.76
4 1535 Bullseye Transformers Age... ... 13.46 6 80.76
... ... ... ... ... ... ... ...
14049 15581 Bullseye AC Adapter/Power... ... 28.72 8 229.76
14050 15582 Bullseye Cisco Systems Gi... ... 33.39 1 33.39
14051 15583 Understock.com Philips AJ3116M/... ... 4.18 1 4.18
14052 15584 iBay.com NaN ... 4.78 25 119.50
14053 15585 Understock.com Sirius Satellite... ... 33.16 2 66.32
...
'March': InvoiceNo Channel Product Name ... Unit Price Quantity
Total
0 29486 Walcart Vic Firth Americ... ... 22.39 6 134.34
1 29487 Walcart Archives Spiral ... ... 25.65 1 25.65
2 29488 Bullseye AKG WMS40 Mini D... ... 8.98 2 17.96
3 29489 Shoppe.com LE Blue Case for... ... 8.33 9 74.97
4 29490 Understock.com STARFISH Cookie ... ... 17.96 80 1436.80
... ... ... ... ... ... ... ...
9749 39235 iBay.com Nature's Bounty ... ... 5.55 2 11.10
27 Project: Filling missing product names in the sales data 263
The concat function works with dictionaries (like the one above)
as well as lists of DataFrame objects. If you pass the output of
read_excel above to concat, you get a single DataFrame with all the
data in your Excel file. The ignore_index keyword argument makes
concat discard row labels so that labels in the output DataFrame
are consecutive numbers. You’ll often need to read all sheets in an
Excel file and keep their data in a single DataFrame; the shortest
way to do that is with code like the one above.
You can now check both DataFrame variables for missing values:
In [3]: sales_df.isna().sum()
The products dataset doesn’t have any missing values, but that
doesn’t mean all the products in the sales data appear in the
products data. Let’s quickly check if all product IDs in sales_df
have a corresponding ID in products_df:
pythonforaccounting.com/chapter27
27 Project: Filling missing product names in the sales data 264
In [6]: sales_df['ProductID'].isin(products_df['ProductID'])
The code above runs the product ID check, but it returns a long
Series, with a boolean value for each row in sales_df. To quickly
check whether all values in this Series are True , you can use the
all Series method:
In [7]: sales_df['ProductID'].isin(products_df['ProductID']).all()
This tells you that all product IDs in sales_df have a corresponding
ID in products_df. Now you can go ahead with joining the two
datasets:
In [8]: pd.merge(
sales_df,
products_df,
on='ProductID',
suffixes=['-Sales', '-Products'],
validate='many_to_one'
)
Out [8]: InvoiceNo Channel Product Name-Sales ... Total Product Name-Products
0 1532 Shoppe.com Cannon Water Bom... ... 281.54 Cannon Water Bom...
1 1949 Walcart NaN ... 60.72 Cannon Water Bom...
2 5401 Understock.com Cannon Water Bom... ... 101.00 Cannon Water Bom...
3 8601 Understock.com Cannon Water Bom... ... 161.60 Cannon Water Bom...
4 9860 Understock.com Cannon Water Bom... ... 101.00 Cannon Water Bom...
... ... ... ... ... ... ...
37703 38956 Understock.com New Waterproof S... ... 42.89 New Waterproof S...
37704 39053 Understock.com Violinsmart 3/4 ... ... 52.12 Violinsmart 3/4 ...
37705 39030 Understock.com Violinsmart 3/4 ... ... 52.12 Violinsmart 3/4 ...
37706 39038 Understock.com Violinsmart 3/4 ... ... 52.12 Violinsmart 3/4 ...
37707 39045 Understock.com Violinsmart 3/4 ... ... 52.12 Violinsmart 3/4 ...
The example above uses pandas’s merge function to join the two
datasets on their common 'ProductID' column. Because both tables
have a 'Product Name' column that isn’t used as a joining column
in the merge operation, their join will have two 'Product Name'
pythonforaccounting.com/chapter27
27 Project: Filling missing product names in the sales data 265
columns (one from the left table, another from the right one). The
suffixes keyword argument above tells pandas to add a different
suffix to each 'Product Name' column and make the joined table
easier to understand. If you don’t pass custom suffixes as I did above,
pandas uses 'x' and 'y' as default suffixes ('Product Name-Sales'
seems easier to understand than 'Product Namex').
There are two 'Product Name-' columns in the output table above:
'Product Name-Sales' contains the original product names from
“Q1Sales.xlsx” (with missing values), and 'Product Name-Products'
contains product names from “products.csv”. Let’s combine the
two columns into a single 'Product Name' column; first, assign the
joined table back to sales_df:
In [9]: sales_df = pd.merge(
sales_df,
products_df,
on='ProductID',
suffixes=['-Sales', '-Products'],
validate='many_to_one'
)
To combine the two columns into a single one, let’s define a custom
function that choose a valid product name from the two options:
In [10]: def combine_product_names(row):
if pd.notna(row['Product Name-Sales']):
return row['Product Name-Sales']
else:
return row['Product Name-Products']
pythonforaccounting.com/chapter27
27 Project: Filling missing product names in the sales data 266
Notice that I used a different file name for the output data (i.e.,
I didn’t overwrite “Q1Sales.xlsx”). In my experience, it’s a good
idea to keep your input files unchanged until you’re sure the code
works. After you check Q1SalesClean.xlsx (whether in Excel or
with pandas) and you’re confident the results of your code are
what you want them to be, you can overwrite “Q1Sales.xlsx” by
changing the file name above.
Summary
This quick project chapter showed you how to join two datasets to
fill in missing product names in “Q1Sales.xlsx”. Next, let’s see how
you can group and pivot your tables with pandas.
pythonforaccounting.com/chapter27
Groups and pivot tables 28
This chapter takes a look at group-based operations: splitting a
table into several groups of rows and calculating statistics for
each group (e.g., the sum of values in a column). This type of
table summarization is widespread in any data work, accounting
included.
The table below is a small subset of the sales data you’ve been
working with throughout the book:
One of the first questions you might ask of these data is which
channel generates the largest sales revenue. Coincidentally, the
easiest way to answer that question is by applying a group operation
on the table above: first split the table into groups based on values
in the Channel column, for each group sum the Total column, then
28 Groups and pivot tables 268
combine the results back into another table. This entire process is
shown below (I omitted some of the columns above):
Split Apply (sum) Combine
Shoppe.com 1026.72
ProductID Channel Total Total
Understock.com 199.98
I&S/WIH-08645 Shoppe.com 1026.72 Shoppe.com 1026.72
iBay.com 188.56
ProductID Channel Total Total
For the code examples in this section, let’s create the same table as
above from our sales data. This will help keep output tables concise
and easy to check — we will use the entire sales data towards the
end of the chapter to apply the same concepts on a larger dataset.
To create a DataFrame with the same rows as the table from the
previous section, you can use the following code:
In [1]: columns = ['ProductID', 'Product Name', 'Channel', 'Unit Price', 'Quantity', 'Total']
sample_df = ledger_df[columns].tail(10)
In [2]: sample_df
pythonforaccounting.com/chapter28
28 Groups and pivot tables 269
Out [2]: ProductID Product Name Channel Unit Price Quantity Total
14044 MI/SEN-01085 Sennheiser EW 112P... Understock.com 18.58 6 111.48
14045 I&S/WIH-08645 Wiha 26598 Nut Dri... Shoppe.com 16.56 62 1026.72
14046 H&K/KIK-91404 Kikkerland Magneti... iBay.com 3.64 15 54.60
14047 T&G/YU--76445 Yu-Gi-Oh! - Light-... Understock.com 4.50 4 18.00
14048 T&G/LAU-88048 Lauri Toddler Tote iBay.com 14.46 1 14.46
14049 E/AC-63975 AC Adapter/Power S... Bullseye 28.72 8 229.76
14050 E/CIS-74992 Cisco Systems Giga... Bullseye 33.39 1 33.39
14051 E/PHI-08100 Philips AJ3116M/37... Understock.com 4.18 1 4.18
14052 E/POL-61164 NaN iBay.com 4.78 25 119.50
14053 E/SIR-83381 Sirius Satellite R... Understock.com 33.16 2 66.32
The sample_df DataFrame you just created contains the bottom ten
rows of our sales data (notice the row labels above). As promised,
you can run the entire group operation illustrated earlier with one
line of pandas code:
In [3]: sample_df.groupby('Channel').agg({'Total': 'sum'})
There are two methods used in the example above: groupby, which
tells pandas what column to use when splitting the input table
into groups, and agg, which tells pandas what to include in the
output table (here, the sum of values in the 'Total' column for each
group).2 You can group by multiple columns, and compute one or 2: If you are familiar with SQL, the
more aggregates for one or more columns using the two methods groupby method is similar to SQL’s
GROUP BY operator, but much more
above. Before we go over the different ways to use groupby, we
flexible.
need to take a quick look inside the pandas machinery for group
operations — knowing how groupby works will help you develop
an intuition for when and how to use it effectively.
Instead of chaining groupby and agg, as above, let’s take it one step
at a time by assigning the output of groupby to another variable:
In [4]: groups = sample_df.groupby('Channel')
pythonforaccounting.com/chapter28
28 Groups and pivot tables 270
In the output above, the numbers associated with each key are
the actual row labels from sample_df. In short, what groupby does
is create a mapping between unique column values (in this case,
unique values from the 'Channel' column) and the row labels
where those values are in the original DataFrame. You can access
any group as a separate DataFrame using :
In [7]: groups.get_group('Bullseye')
Out [7]: ProductID Product Name Channel Unit Price Quantity Total
14049 E/AC-63975 AC Adapter/Power Sup... Bullseye 28.72 8 229.76
14050 E/CIS-74992 Cisco Systems Gigabi... Bullseye 33.39 1 33.39
And you can access one or more columns from each separate group
of rows, using the same square bracket notation you use to access
columns in DataFrame objects:
In [8]: groups['Total'].get_group('Bullseye')
What’s more interesting is that after you select columns you can
use any Series method with the groups object to call that method
on each group independently:
In [10]: groups['Total'].sum()
pythonforaccounting.com/chapter28
28 Groups and pivot tables 271
This code is slightly different from the agg example you saw earlier,
but it produces the same result. You will come across both styles
of applying group operations in code examples you find online:
this style is not the easiest to understand, and it takes a while to
get used to.
There are two specialized methods4 available on the group object 4: There are two other methods avail-
you can use to compute group summaries: able on the group object: trans
form and filter. However, you can
achieve the same functionality using
I agg — or its alias aggregate — which we used earlier and produces
just the two methods mentioned here.
a single value from each group of rows (e.g., the sum of a column
for each group);
pythonforaccounting.com/chapter28
28 Groups and pivot tables 272
Table 28.1: Common pandas aggregating functions for groups. You can find all of them in the pandas documentation at
pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#aggregation.
Function name Example Description
'mean' (sample_df Computes the mean of the 'Total' column
.groupby('Channel') for each group.
.agg({'Total': 'mean'})
)
pythonforaccounting.com/chapter28
28 Groups and pivot tables 273
Table 28.1: Common pandas aggregating functions for groups. You can find all of them in the pandas documentation at
pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#aggregation.
Function name Example Description
pythonforaccounting.com/chapter28
28 Groups and pivot tables 274
}) 8
) 9
10
# selects the max column under Quantity 11
aggregate_df.loc[:, ('Quantity', 'max')] 12
Let’s say you want to calculate the difference between the largest
sale and the smallest sale in each channel (i.e., a single summary
value for each channel). You can use groupby, but there’s no
predefined aggregating function that comes with pandas and does
what you want. No need to panic, you can define and use your
aggregating function with:
In [17]: def total_diff(column):
return column.max() - column.min()
pythonforaccounting.com/chapter28
28 Groups and pivot tables 275
Inside the custom function, you can manipulate the Series object
6: A function that takes a sequence of
any way you need to, using any Series methods (including the values as an argument (e.g., a Series
string or date methods we looked at in previous chapters, if the object) and returns a single number
column you want to summarize contains strings or dates). However, computed from that sequence of val-
your custom function must return a single value, otherwise, you ues (e.g., the sum) is sometimes called
a reducer.
will get an error:6
In [18]: def custom_aggregating_function(column):
return column
sample_df.groupby('Channel').agg({'Total': custom_aggregating_function})
Group filters allow you to select groups of rows from your table
that meet a certain group-level condition. Say you want to select
pythonforaccounting.com/chapter28
28 Groups and pivot tables 276
rows from those channels that have total sales over 200 dollars —
you have several options to get there, but perhaps the shortest path
is by defining a custom filtering function and using it with groupby
and the apply method:
In [19]: def filter_group(group_df):
return group_df if group_df['Total'].sum() > 200 else None
sample_df.groupby('Channel').apply(filter_group)
Out [19]: ProductID Product Name Channel Unit Price Quantity Total
Channel
Bullseye 14049 E/AC-63975 AC Adapter... Bullseye 28.72 8 229.76
14050 E/CIS-74992 Cisco Syst... Bullseye 33.39 1 33.39
Shoppe.com 14045 I&S/WIH-08645 Wiha 26598... Shoppe.com 16.56 62 1026.72
This returns the original DataFrame, in the same shape, but re-
moves those rows that belonged to channels that totaled less
than 200 dollars in sales (i.e., removes sales from 'iBay.com' and
'Understock.com' channels). When you call apply with a custom
function, as you did here, pandas passes each group to your func-
tion as a DataFrame argument, one at a time (i.e., notice you didn’t
select a column after using groupby, but called apply directly).
Inside the function, you can manipulate the group DataFrame (in
this case, the parameter called group_df) any way you need to.
Row labels in the output above have multiple levels: the first level
indicates which group each row is part of (based on the grouping
columns you used with groupby); the second level has the original
row labels from sample_df. If you don’t need these group labels,
you can discard them using reset_index(drop=True):
In [20]: sample_df.groupby('Channel').apply(filter_group).reset_index(drop=True)
Out [20]: ProductID Product Name Channel Unit Price Quantity Total
0 E/AC-63975 AC Adapter/Power Sup... Bullseye 28.72 8 229.76
1 E/CIS-74992 Cisco Systems Gigabi... Bullseye 33.39 1 33.39
2 I&S/WIH-08645 Wiha 26598 Nut Drive... Shoppe.com 16.56 62 1026.72
pythonforaccounting.com/chapter28
28 Groups and pivot tables 277
Out [21]: ProductID Product Name Channel Unit Price Quantity Total % Group Total
14044 MI/SEN-0... Sennheis... Understo... 18.58 6 111.48 55.75
14045 I&S/WIH-... Wiha 265... Shoppe.com 16.56 62 1026.72 100.00
14046 H&K/KIK-... Kikkerla... iBay.com 3.64 15 54.60 28.96
14047 T&G/YU--... Yu-Gi-Oh... Understo... 4.50 4 18.00 9.00
14048 T&G/LAU-... Lauri To... iBay.com 14.46 1 14.46 7.67
14049 E/AC-63975 AC Adapt... Bullseye 28.72 8 229.76 87.31
14050 E/CIS-74992 Cisco Sy... Bullseye 33.39 1 33.39 12.69
14051 E/PHI-08100 Philips ... Understo... 4.18 1 4.18 2.09
14052 E/POL-61164 NaN iBay.com 4.78 25 119.50 63.38
14053 E/SIR-83381 Sirius S... Understo... 33.16 2 66.32 33.16
Notice that values in the new '% Group Total' column sum to
100 for each separate channel (e.g., '% Group Total' for the one
'Shoppe.com' row is 100.00).
T&G/CAN-97509
281.54
On the road to pivot tables, we need to make a quick stop at T&G/LEG-37777
stacking and unstacking. Stacking and unstacking are operations
Walcart
that change the shape of a table, but don’t modify its data in any
way. Stacking refers to placing all values in a table on top of each 6.7
pythonforaccounting.com/chapter28
28 Groups and pivot tables 278
Let’s first create an even smaller sample table to show how stacking
works in pandas:
In [22]: columns = ['ProductID', 'Channel', 'Total']
sample_df = ledger_df[columns].head()
sample_df
To stack this table (which has the same data as the illustration
above), you use the stack method:
In [23]: sample_df.stack()
pythonforaccounting.com/chapter28
28 Groups and pivot tables 279
stacked_sample.unstack()
All this may seem riveting on its own, but unstacking is most useful
when you want to reshape a table output by groupby. As it turns
out, grouping, aggregating and then unstacking values are the
three steps in creating a pivot table.
Pivot tables
There are probably more pivot table tutorials available on the web
than there are about all other Excel features combined. The reason
for all the tutorials is that pivot tables are just as powerful as they
are confusing. Much of the confusion comes from the fact that their
internal machinery is hidden (i.e., you get to see just the input and
the output) and involves several distinct steps, so you can’t easily
develop intuition about how they work.
First, let’s go back to ledger_df and create a new column that tells
us by what quarter a product is considered “aged” (remember that
the 'Deadline' column holds the date after which a product is
considered “aged” for inventory tracking purposes) — you can do
that using:
In [25]: ledger_df = pd.read_excel('Q1Sales.xlsx') 1
2
ledger_df['Deadline'] = pd.to_datetime(ledger_df['Deadline']) 3
ledger_df['Deadline Quarter'] = ledger_df['Deadline'].dt.to_period(freq='Q-DEC') 4
5
ledger_df 6
pythonforaccounting.com/chapter28
28 Groups and pivot tables 280
Out [25]: InvoiceNo Channel Product Name ... Quantity Total Deadline Quarter
0 1532 Shoppe.com Cannon Water... ... 14 281.54 2019Q4
1 1533 Walcart LEGO Ninja T... ... 1 6.70 2020Q2
2 1534 Bullseye NaN ... 5 58.35 2020Q2
3 1535 Bullseye Transformers... ... 6 80.76 2019Q4
4 1535 Bullseye Transformers... ... 6 80.76 2019Q4
... ... ... ... ... ... ... ...
14049 15581 Bullseye AC Adapter/P... ... 8 229.76 2020Q1
14050 15582 Bullseye Cisco System... ... 1 33.39 2020Q1
14051 15583 Understock.com Philips AJ31... ... 1 4.18 2020Q1
14052 15584 iBay.com NaN ... 25 119.50 2020Q2
14053 15585 Understock.com Sirius Satel... ... 2 66.32 2020Q1
In this case, row grouping is done using values from two input
columns (i.e., not just 'Channel', as in the previous examples). How-
pythonforaccounting.com/chapter28
28 Groups and pivot tables 281
The result above is a DataFrame with multi-level row labels: the first
level has unique values from the 'Channel' column; the second
level has unique values from the 'Deadline Quarter' column. It’s
the result we wanted, but in long instead of wide format, making it
hard to compare values from the same quarter across channels. To
fix that, you can unstack the outer-most level of row labels, turning
it into a set of columns:
In [27]: ledger_df.groupby(['Channel', 'Deadline Quarter']).agg({'Quantity': 'sum'}).unstack()
Now the output is a DataFrame with one row for every channel in
the sales data and a column for each deadline quarter — the values
in the table are the sum of the 'Quantity' column for different
combinations of 'Channel' and 'Deadline Quarter'. You may have
noticed that, even though we used a combination of groupby, agg
and unstack to get here, the result above is a pivot table based on
ledger_df’s data.
This is how pivot tables work in pandas: first, they group rows
using values in two (or more) columns, then compute an aggregate
value for each group of rows, and finally unstack (i.e., “rotate”)
values from one of the grouping columns into a header for the
output.
As I mentioned earlier, you can get the same result with less code
by using pandas’s pivot_table function:
In [28]: pd.pivot_table(ledger_df, 1
index='Channel', 2
columns='Deadline Quarter', 3
values='Quantity', 4
aggfunc='sum') 5
pythonforaccounting.com/chapter28
28 Groups and pivot tables 282
The code above is perhaps more similar to how you construct pivot
tables in Excel, but it is just a shortcut for the same groupby-agg-
unstack operation you saw earlier. The first argument passed to
pivot_table is the DataFrame you want to summarize, followed by
several keyword arguments (which are all required):
I index and columns tell pandas what columns to use for grouping
rows. Unique values from these two columns will become the row
labels and column header in the output DataFrame, respectively;
I values and aggfunc tell pandas what values from the input table
to aggregate and what function to use when aggregating them (in
this case, the sum function, but you can use any of the aggregating
functions mentioned in table 28.1 or your own functions).
In [29]: pd.pivot_table(ledger_df, 1
index='Channel', 2
columns='Deadline Quarter', 3
values='Quantity', 4
aggfunc='sum', 5
margins=True, 6
margins_name='TOTAL') 7
pythonforaccounting.com/chapter28
28 Groups and pivot tables 283
Summary
pythonforaccounting.com/chapter28
Overthinking: Changing how
DataFrames are displayed 29
The previous chapters have taken you through a detailed tour
of pandas’s table handling features. The following part of the
book guides you through Python’s data visualization landscape.
However, before we leave pandas territory, there’s one more feature
I want to cover: changing how DataFrame objects look like in your
notebooks (e.g., how tables get truncated, how decimal values are
formatted, how to change table text size, etc.).
Now that you have a DataFrame let’s change how it’s displayed in
your notebook. With pandas you can configure both table display
options and table styling options — the difference between these
options will become apparent as we work through some examples.
Let’s look at table display options first.
Similarly, you can set the minimum and maximum number of rows
that get shown when you view a DataFrame:
29 Overthinking: Changing how DataFrames are displayed 285
In [3]: pd.options.display.min_rows = 20
pd.options.display.max_rows = 40
If you inspect sales_df now, you’ll see ten of its top and bottom
rows (i.e., min_rows altogether). The max_rows option sets the max-
imum number of rows that get displayed — if a DataFrame has
fewer rows than the value of max_rows, all of its rows get displayed.
Once the number of rows in a DataFrame exceeds max_rows, the
min_rows value determines how many rows are shown.
To display all the data in your DataFrame (i.e., to show all its rows
and columns) you can set the display options above to None :
In [4]: pd.options.display.max_columns = None
pd.options.display.max_rows = None
Now, if you inspect sales_df, you’ll see all of its rows and columns.
This might seem like an improvement over the truncated tables
you’re used to by now. However, if you work with large datasets,
setting the display options above to None will make your notebooks
challenging to work with (you’ll have to scroll a lot to find your
code cells, and your notebooks will likely become slower to use).
Setting display options this way affects the display of all tables in
your notebook. You can set several other display options, including
max_colwidth, which sets the width of columns, and precision,
which sets how many decimal numbers get shown in your tables —
table 29.1 lists some of the other options.
Table 29.1: Display options available with pandas. These (and more) options are documented in the official pandas
documentation at pandas.pydata.org/pandas-docs/stable/user_guide/options.html#available-options.
Option name Default value Description
chop_threshold None If set to a float value, all decimal numbers smaller than the given
threshold will be displayed as exactly 0 in the table.
colheader_justify 'right' Controls the justification of column headers. Valid options are
'left' or 'right'.
date_dayfirst False When True , prints and parses dates with the day first (e.g.,
20/01/2005).
date_yearfirst False When True , prints and parses dates with the year first, (e.g.,
2005/01/20).
float_format None The value assigned to this option should be a function that accepts
a floating point number as its only argument and returns a string
with the desired format of the number.
max_columns 20 Sets the number of columns displayed. Setting this option to None
means all columns get displayed.
max_colwidth 50 The maximum width in characters of a column. When the column has
more characters that max_colwidth, a placeholder (i.e., an ellipsis)
is added to the column value. Setting this option to None means all
characters get displayed.
pythonforaccounting.com/chapter29
29 Overthinking: Changing how DataFrames are displayed 286
Table 29.1: Display options available with pandas. These (and more) options are documented in the official pandas
documentation at pandas.pydata.org/pandas-docs/stable/user_guide/options.html#available-options.
Option name Default value Description
max_rows 60 Sets the maximum number of rows displayed. Setting this option to
None means all rows get displayed.
min_rows 10 Sets the number of rows to show in a truncated table (when the
number of rows in the table exceeds max_rows). Ignored when
max_rows is set to None . When set to None , follows the value of
max_rows.
precision 6 The number of places after the decimal used when displaying
decimal numbers in a table.
show_dimensions 'truncate' Whether to print out dimensions at the end of a displayed DataFrame.
If this option is set to 'truncate', only print out the dimensions if
the table is truncated (i.e., not all rows or columns in the table are
displayed); True or False are valid options.
pd.options.display.float_format = float_format_function
You can use any custom function to format decimal values; for
instance, you can even add a message to your formatting function:
In [7]: def float_format_function(value):
return f'My value is £{value:,}'
pd.options.display.float_format = float_format_function
pythonforaccounting.com/chapter29
29 Overthinking: Changing how DataFrames are displayed 287
Keep in mind that the options above change the way tables look in
your notebooks, not their data. If you save a DataFrame to an Excel
file, you’ll see the same data when you open the file with Excel
regardless of your notebook display options.
Styling tables
The pandas options above change the display of all tables in your
notebook. However, you can use pandas to apply conditional styling
to single DataFrame variables as well (e.g., set text or background
color, change font size, add borders, etc.). Even better, you can
also save your styled DataFrame variable (together with its custom
colors or fonts) as an Excel spreadsheet.
If you run the code above, you should see a pink table with blue
text. The code is somewhat strange: set_properties is a method
that accepts an arbitrary number of arguments. The ** operator
above turns a Python dictionary into a sequence of arguments that
set_properties can work with. Regardless of its details, the code
above is what you need to run to change the visual style of an
entire DataFrame.
Keep in mind that running the code above doesn’t return another
DataFrame but a Styler object so you can’t chain other DataFrame
methods to the output of set_properties above:
In [10]: table = sales_df.style.set_properties(**{'color':'blue', 'background-color': 'pink'})
type(table)
pythonforaccounting.com/chapter29
29 Overthinking: Changing how DataFrames are displayed 288
As such, styling your tables should be the last step in your data
wrangling process.
Table 29.2: Properties available with pandas’s styling feature. These (and more) properties are documented in the official
pandas documentation at pandas.pydata.org/pandas-docs/stable/user_guide/style.
'background-color' Any valid CSS color name (the Sets the table or cell background color.
next chapter lists CSS colors).
'border-style' One of 'dotted', 'dashed', Sets the table border line style. You can control bor-
'solid', or 'double'. der lines separately by changing the property name
to one of 'border-left-style', 'border-top-style',
'border-right-style', or 'border-bottom-style'.
'border-width' One of 'thin', 'medium', Sets the table border line width. You can control bor-
'thick', or a specific value der lines separately by changing the property name
such as '4pt'. You must set to one of 'border-left-width', 'border-top-width',
a 'border-style' for this to 'border-right-width', or 'border-bottom-width'.
work.
'border-color' Any valid CSS color name (the Sets the table border line color. You can control bor-
next chapter lists CSS colors). der lines separately by changing the property name
to one of 'border-left-color', 'border-top-color',
'border-right-color', or 'border-bottom-color'.
'color' Any valid CSS color name (the Sets text color.
next chapter lists CSS colors).
'font-weight' Any value between 100 and 900. Sets how thick or thin characters in table text are displayed.
'text-align' One of 'left', 'right', Sets the horizontal alignment of text in table cells.
'center', or 'justify'.
'number-format' One of Excel’s custom formats Sets the number format using Excel formatting style.
such as '#,##0' or '#,##0
£_);(#,##0 £)'.
pythonforaccounting.com/chapter29
29 Overthinking: Changing how DataFrames are displayed 289
return colors
The function above goes through a Series (i.e., the values argu-
ment) and constructs a list of strings (i.e., either 'color: red' or
'color: green') with the same number of items as values. You can
apply the function and style the 'Quantity' column by running:
In [12]: sales_df.style.apply(color_quantity_column, subset=['Quantity'])
return colors
Similarly, you can set cell background color in the 'Total' column
with another function:
In [14]: def color_total_column(values):
pythonforaccounting.com/chapter29
29 Overthinking: Changing how DataFrames are displayed 290
In [15]: (sales_df
.style
.apply(color_quantity_column, subset=['Quantity'])
.apply(color_total_column, subset=['Total'])
)
Summary
This chapter showed you how to change the display and style of
your DataFrame variables. This chapter also marks the end of our
pandas tour. The following part of the book shows you how to turn
a DataFrame into a plot using some of Python’s most popular data
visualization libraries.
pythonforaccounting.com/chapter29
Visualizing data part three
This part of the book looks at turning data into plots.1 Plots often 1: Charts, graphs, or plots are terms
carry the weight of arguments in your documents or presenta- that loosely refer to the same con-
cept: a graphic representation of data.
tions — knowing how to make them attractive and convincing is
While Excel uses chart, I use plot be-
surprisingly important in business. cause it’s more common in Python’s
data world.
Both Excel and Python2 allow you to turn tables into plots. While
Excel offers sensible styles and options for making plots with one
2: Through its many data visualiza-
click, Python allows you to craft and customize your plots down tion libraries.
to the smallest details.
The Python ecosystem has many different libraries for data vi-
sualization, and it can sometimes be frustrating to find the right
one for your needs.3 In the following chapters, I’ll introduce you 3: There are so many different visual-
to Python’s most versatile and popular visualization library: mat ization libraries in Python that there’s
a website dedicated to helping you
plotlib — and a few others that extend it. Matplotlib is popular
choose which one covers your use-
because it works well with other libraries from the Python ecosys- case: pyviz.org.
tem, including pandas, as you will soon see.
Entire books have been written about making plots with Python,
4: There’s also a yearly matplotlib
so the following chapters can only briefly introduce a vast topic.4 plotting competition called the John
Even so, they will help you understand how to make plots with Hunter Excellence in Plotting Contest.
code and set you up to explore Python’s data visualization universe Plotting can get competitive.
on your own.
Setup
The “Q1DailySales.csv” file contains the same sales data we’ve been
working with so far, but grouped by channel and date. If you open
the file, you’ll see it has 91 rows (one for each day in the first quarter
of 2020) and one column for each of the sales channels:
Date Understock.com Shoppe.com iBay.com Walcart Bullseye
0 2020-01-01 20707.62 6911.72 5637.54 13593.17 9179.39
1 2020-01-02 18280.59 17351.46 5959.61 12040.16 5652.32
2 2020-01-03 17191.15 10578.60 8346.60 9876.21 6127.92
3 2020-01-04 17034.69 6052.03 10168.41 12811.26 10370.95
4 2020-01-05 17074.18 11866.74 12462.30 8318.34 4641.02
.. ... ... ... ... ... ...
86 2020-03-27 20183.41 5111.76 7703.85 3026.12 503.20
87 2020-03-28 8190.09 1392.89 4456.91 2776.78 1772.25
88 2020-03-29 7267.96 3966.57 6717.35 1195.16 1142.15
89 2020-03-30 14712.07 3826.95 11044.34 1702.39 676.08
90 2020-03-31 12343.37 10372.53 13687.49 157.49 848.51
The values in this dataset represent the total daily revenue per
channel for our wholesale supplier. We’ll use these data to create
several plots over the next chapters using Python’s data visualiza-
tion libraries.
When you’re ready, launch JupyterLab, and let’s see how you turn
the revenue table above into a line plot using matplotlib.
Plotting with matplotlib 30
The matplotlib library1 is central to Python’s data visualization 1: It comes with Anaconda, so you al-
universe: there’s no way around learning how to use it if you want ready have it installed if you followed
the setup guide in chapter 2.
to make plots with Python. In addition, several other visualization
libraries extend matplotlib to provide their specialized plotting
tools, so knowing how matplotlib works makes using those other
libraries easier.
In this chapter, you’ll use matplotlib to turn the daily sales data I
introduced earlier into the line plot shown below:
Figure 30.1 highlights the main elements of a matplotlib plot: 2: The data used in the plots were
on the left, you have a plot of two lines and some scatter points, generated for this example only and
don’t represent anything.
and on the right, you have the same plot but with its matplotlib
components highlighted and labeled.2
4
Elements of a matplotlib plot 4
Elements of a matplotlib plot
Axes title Blue line Axes title Blue line
Major tick Red line Major tick Red line
Legend Legend
Minor tick Minor tick
3 3
Major tick label Grid Major tick label Grid
Line
(line plot) Line
(line plot)
Y axis label
Y axis label
2 2
Spines
Figure Line Figure Line
Axes (line plot) Axes (line plot)
0 0
0 0.25 0.50 0.75 1 1.25 1.50 1.75 2 2.25 2.50X 2.75
axis 3
3.25 3.50 3.75 4 0 0.25 0.50 0.75 1 1.25 1.50 1.75 2 2.25 2.50 2.75 3 3.25 3.50 3.75 4
X axis label X axis label X axis
Minor tick label Minor tick label
X axis label X axis label
Figure 30.1: On the left, a simple plot with two lines and scattered points; on the right, the same plot with its main
components highlighted and named. The plots are adapted from an example available on the official matplotlib website,
accessible at matplotlib.org/gallery/showcase/anatomy.html.
I The x-axis and y-axis are the horizontal and vertical dimensions
of the plot. They are part of the axes component, and each has
associated labels, minor and major ticks, and tick labels.
I The main plot elements (in this case, the two lines and the scatter
points) are collections of x and y coordinates (i.e., data points) rep-
resented visually within the bounds of the axes. You can represent
the same collection of data points in different ways, depending on
what type of plot you need (line plot, scatter plot, bar plot, etc.).
If you already use Excel to create charts, many of these plot elements
are familiar to you. The methods you’ll use in the following
sections are named after the plot components they modify (e.g.,
set_xticklabels) and clarifying what these plot components are
(and what their matplotlib name is) will help you understand
what those methods do.
pythonforaccounting.com/chapter30
30 Plotting with matplotlib 296
Plotting basics
To start working on the daily sales plot shown earlier, you first
need to import the two libraries we’ll be using:
In [1]: import pandas as pd 1
import matplotlib.pyplot as plt 2
You’re already familiar with pandas — let’s use it to load the daily
sales dataset into a DataFrame:
In [2]: daily_sales_df = pd.read_csv('Q1DailySales.csv')
daily_sales_df['Date'] = pd.to_datetime(daily_sales_df['Date'])
daily_sales_df.head()
Just like pandas’s workhorse objects are the DataFrame and the
Series, matplotlib’s core objects are its Figure and Axes. Every
plot you make with matplotlib is a Figure object containing one or
more Axes objects. The Figure object is a placeholder for everything
displayed on the screen (which can include multiple axes, legends,
annotations, or anything you can think of to put in a plot). You can
create a Figure object using:
In [3]: fig = plt.figure()
pythonforaccounting.com/chapter30
30 Plotting with matplotlib 297
The code above creates an Axes object, adds it to your Figure, and
assigns it to the ax variable. You can add multiple Axes to the
same Figure to create side-by-side plots or complicated plot grids
— we’ll come back to this, for now let’s continue with one plot.
By default, the entire figure will be 6.4 inches wide by 4.8 inches
tall.3 If you want to change the size of your figure, you can pass 3: And 100 dots per inch (dpi), in
the figsize argument when calling plt.figure: case you need to print your plots.
The first value in the tuple passed to figsize is the figure width,
followed by its height (both in inches). Besides figsize, you can
use a few other keyword arguments when creating a figure to
customize its look — table 32.1 list some of them.
Table 30.1: Optional keyword arguments available when creating a Figure. Read more about the Figure object in
matplotlib’s documentation at matplotlib.org/api/_as_gen/matplotlib.figure.Figure.html.
pythonforaccounting.com/chapter30
30 Plotting with matplotlib 298
Table 30.1: Optional keyword arguments available when creating a Figure. Read more about the Figure object in
matplotlib’s documentation at matplotlib.org/api/_as_gen/matplotlib.figure.Figure.html.
dpi plt.figure(figsize=(12, 6), dpi=200) Sets the resolution of the Figure. If not provided,
defaults 100. Higher resolution images use up more
computer memory.
frameon plt.figure(frameon=False) Enables or disables Figure frame. By default, the
frame is visible.
Plotting data
Now that you have a Figure with Axes set up, you need to plot
some data. Depending on what type of plot you want, there are
several methods you can call on the ax variable you created earlier.
As a first example, let’s create a line plot using the plot method:
In [6]: fig = plt.figure(figsize=(12, 6))
ax = fig.add_subplot()
The plot method is available only on Axes objects. Its first two
arguments above are lists of numbers: the first sequence is a list
of coordinates for the x-axis, the second a list of coordinates for
the y-axis; matplotlib goes through each pair of coordinates, one
by one, and draws a straight line from one to the next (i.e., this is
what the plot method does). The other two arguments used above
set the line color to red and line width to 4 points (by default, the
line width is 1 point) — we’ll go over plot colors and styles later.
The plot method connects all data points passed as input with a
straight line. You can use several other methods to draw different
kinds of plot elements from your data (besides straight lines).
All these methods require a list of x-coordinates and a list of y-
coordinates as their first arguments. Table 30.2 lists some of the
other Axes methods, with examples of the plots they produce.
pythonforaccounting.com/chapter30
30 Plotting with matplotlib 299
Table 30.2: Some of the Axes methods you can use to create various types of plots. Read more about all available plotting
methods in the official matplotlib documentation available at matplotlib.org/api/axes_api.html#plotting.
Method Example Output Description
ax.plot(
[1, 2, 3],
[2, 4, 4],
color='red'
)
ax.scatter(
[1, 2, 3],
[2, 4, 4],
color='purple'
)
ax.barh(
['a', 'b', 'c'],
[3, 2, 1],
color='olive'
)
ax.step(
[1, 2, 3],
[1, 0, 1],
color='teal',
linewidth=4
)
pythonforaccounting.com/chapter30
30 Plotting with matplotlib 300
To plot the daily sales data, you use the same plot method by
passing a list of x and y-coordinates as arguments. To draw just
one of the columns in daily_sales_df as a line on the plot:
In [7]: fig = plt.figure(figsize=(12, 6)) 1
ax = fig.add_subplot() 2
3
ax.plot(daily_sales_df['Date'], daily_sales_df['Bullseye']) 4
You used the same kind of data with plot as in the previous section:
a sequence of x-coordinates as the first argument (i.e., values from
the 'Date' column), and a sequence of y-coordinates for the y-axis
as the second argument (i.e., values from the 'Bullseye' column).
If you don’t specify a color for the line, matplotlib cycles through
a predefined set of colors and uses a different color every time you
call plot on the same Axes object.
To draw another line on the same plot, you can just use the plot
method again:
In [8]: fig = plt.figure(figsize=(12, 6)) 1
ax = fig.add_subplot() 2
3
ax.plot(daily_sales_df['Date'], daily_sales_df['Bullseye']) 4
ax.plot(daily_sales_df['Date'], daily_sales_df['Walcart']) 5
And if you want to plot all of daily_sales_df, you can use the
plot method multiple times, for each one of its columns:
pythonforaccounting.com/chapter30
30 Plotting with matplotlib 301
It’s starting to look like a useful plot, but we still need to fix a few
things: add a legend, so you know which line corresponds to which
channel, add a title, and some other details. However, before we fix
any of those, let’s first improve the code above slightly by making
it less repetitive. Because pandas works well with matplotlib, you
can get the same plot as above in fewer lines of code by running:
In [10]: fig = plt.figure(figsize=(12, 6)) 1
ax = fig.add_subplot() 2
3
ax.plot(daily_sales_df.set_index('Date')) 4
pythonforaccounting.com/chapter30
30 Plotting with matplotlib 302
The sales plot you have now needs a few more things to make it
easier to read: a title, x and y-axis labels, a legend. There is an Axes
method you can use to add each of these elements to your plot —
this section takes a closer look at some of these methods.
pythonforaccounting.com/chapter30
30 Plotting with matplotlib 303
To set a title for a plot, you can use the set_title method. Similarly,
you can set axis labels to help readers understand what the axes
on your plot mean with set_xlabel and set_ylabel:
In [12]: fig = plt.figure(figsize=(12, 6)) 1
ax = fig.add_subplot() 2
3
ax.plot(daily_sales_df) 4
5
ax.set_title('Daily Sales Q12020', loc='left', pad=10) 6
ax.set_xlabel('Date') 7
ax.set_ylabel('Revenue (USD)'); 8
We’ll take a look at how to change font style (i.e., font size and font
family) for the plot title and labels later. For now, let’s see how to
set ticks and their labels.
On most plots, both the x-axis and the y-axis have several short,
evenly spaced dashes that help readers understand the range of
data that is being plotted. These highlighted dashes are called ticks,
and their associated labels are called tick labels (Excel calls them
tick marks).
There are two types of ticks on a plot: major ticks, which indicate
larger (equally spaced) intervals on an axis and are more prominent,
and minor ticks, which indicate smaller intervals between major
ticks and are typically not labeled (figure 30.2 in the margin shows
the different types of ticks on a matplotlib plot).
Right now, the x-axis tick labels on our sales plot are not particularly
easy to read. By default, matplotlib will try and place ticks and
their labels as evenly spaced and nicely formatted as possible.
Figure 30.2: Fragment from Elements
However, if you want to change the location of ticks on the x-axis, of a matplotlib plot highlighting major
you can use the Axes set_xticks method and pass a list of values and minor ticks and their labels.
pythonforaccounting.com/chapter30
30 Plotting with matplotlib 304
at which to place the ticks (in our case, a list of dates). You can
define a list of dates any way you want to — you can even manually
type Timestamp values for each tick if you want to be very specific
about your tick placement:
In [13]: # an easier way to get this list of dates
# is by using pandas's date_range function
# pd.date_range('01 January 2020',
# '1 April 2020', freq='MS')
xtick_values = [
pd.Timestamp('2020-01-01'), pd.Timestamp('2020-02-01'),
pd.Timestamp('2020-03-01'), pd.Timestamp('2020-04-01')
]
The xtick_values variable is a list of dates representing the first 5: You can read more about
of each month between January and April 2020 (I used pandas’s date_range at pandas.pydata.org/
top-level date_range function5 to create this list and copied the pandas-docs/stable/reference/
api/pandas.date_range.
values).
In [14]: fig = plt.figure(figsize=(12, 6)) 1
ax = fig.add_subplot() 2
3
ax.plot(daily_sales_df) 4
5
xtick_values = [ 6
pd.Timestamp('2020-01-01'), pd.Timestamp('2020-02-01'), 7
pd.Timestamp('2020-03-01'), pd.Timestamp('2020-04-01') 8
] 9
10
ax.set_xticks(xtick_values) 11
To change the labels associated with each tick, you need to create
another list with the labels you want to use and pass it to the
set_xticklabels method. The easiest way to make a list of labels
is by using the date values in xtick_values and a Python list
comprehension:
In [15]: xtick_labels = [date.strftime('%a %d/%m') for date in xtick_values]
xtick_labels
pythonforaccounting.com/chapter30
30 Plotting with matplotlib 305
Out [15]: ['Wed 01/01', 'Sat 01/02', 'Sun 01/03', 'Wed 01/04']
The output is just a list of strings, but if the code above seems
strange, head back to chapter 6 for a refresher on Python list com-
prehensions. Inside the list comprehension, the strftime method
is called on each date value in xtick_values to convert it to a string.
The format specifiers passed to strftime (i.e., '%a %d/%m') are the
same specifiers mentioned in chapter 22 (I never remember what
these format specifiers do, so if you’re like me, you can see all of
them explained at strftime.org).
Putting all the steps together, you can set the ticks and tick labels
on the sales plot using the following code:
In [16]: fig = plt.figure(figsize=(12, 6)) 1
ax = fig.add_subplot() 2
3
ax.plot(daily_sales_df) 4
5
xtick_values = pd.date_range('01 January 2020', '1 April 2020', freq='MS') 6
xtick_labels = [date.strftime('%a %d/%m') for date in xtick_values] 7
8
ax.set_xticks(xtick_values) 9
ax.set_xticklabels(xtick_labels); 10
Note that axis limits will change depending on the ticks you set
(i.e., if you set ticks all the way to December, matplotlib will widen
the x-axis range to show all of them).
You can use the same approach to format and position the y-axis
ticks. In this case, because we are plotting numerical values on the
y-axis (i.e., not dates), you need to use a list of numbers with the
set_yticks method. The code below extends the previous example
(i.e., you need to type it in the same cell as the code above):
pythonforaccounting.com/chapter30
30 Plotting with matplotlib 306
This code follows the same logic as with the x-axis ticks: you create
a list of numbers using the Python range function (which takes as
arguments a start value, an end value, and a step) and a list of labels
created using the tick values. You then set the y-axis ticks and their
labels with the set_yticks and set_yticklabels methods.
To set minor ticks (or their labels) on either axis, you can pass the
minor=True keyword argument to the same tick methods. Extending
the previous examples further, you can add minor ticks on the
y-axis of the sales plot using:
There are many different ways to customize your plot ticks and their
associated label (including color, rotation, padding, and others) —
check ax.tick_params? in a separate code cell in your notebook.
Finally, you can extend the axis ticks into your plot’s main area
by adding a grid. Plot grids are useful in guiding readers when
reading values on your plots. To add a grid, you can use the grid
Axes method:
In this example, you specify which axis ticks to extend into a grid
(here, major ticks from both the x and the y axis; 'x' or 'y' are also
valid options for the axis keyword argument), and set the line
style to 'dotted' (also valid are 'solid' or 'dashed'). The alpha
keyword argument sets the transparency of the grid lines (with 0
being fully transparent and 1 being fully opaque) — I often make
grid lines transparent in my plots so they don’t distract readers
from the main visual elements.
pythonforaccounting.com/chapter30
30 Plotting with matplotlib 307
The result after customizing ticks, their labels, and adding a plot
grid is shown below:
Legend
The sales plot right now doesn’t have a legend that explains which
line corresponds to which sales channel. To add one, you can use
the legend Axes method and pass a list of labels (i.e., string values)
that you want to associate with each line on the plot. Because
we used daily_sales_df to plot all the lines at once, you can
use daily_sales_df.columns as the legend labels. Extending the
previous examples, you can add a legend with:
ax.legend(daily_sales_df.columns); 21
I 'best'
I 'upper right'
I 'upper left'
I 'lower left'
I 'lower right'
I 'right'
By default, matplotlib will try to place the legend in the plot area I 'center left'
that has the most empty space. You can change this location by I 'center right'
passing the loc keyword argument, and specifying one of the I 'lower center'
predefined locations listed in the margin on the right as a value: I 'upper center'
I 'center'
ax.legend(daily_sales_df.columns, loc='upper left'); 21
By default, matplotlib places the
legend at the 'best' location (i.e.,
the area of the plot with most empty
space).
pythonforaccounting.com/chapter30
30 Plotting with matplotlib 308
If you want to move the legend outside of the main plot area,
you can use the bbox_to_anchor keyword argument to specify a
point on the plot to use as an anchor for the legend, together with
the loc keyword argument. When using bbox_to_anchor, the loc
keyword argument specifies which corner of the legend to place at
the anchor point. For example, to move the legend outside of the
axes, the following code places the upper left corner of the legend
at the upper right corner of the plots:
ax.legend( 21
daily_sales_df.columns, 22
title='Channel', frameon=False, 23
loc='upper left', bbox_to_anchor=(1, 1) 24
); 25
pythonforaccounting.com/chapter30
30 Plotting with matplotlib 309
Annotations
The last element we need to add to the sales plot is a vertical line
marking a sales event on the 11th of February, as well as some text
describing what this vertical line is. To add a vertical line to the plot,
you can use the ax.axvline method, and to add a text annotation
to the plot, right next to the line, you can use the ax.annotate
method:
The ax.axvline method draws a vertical line that spans the plot’s
entire height at a certain x-coordinate (similarly, you can use
ax.axhline to draw a horizontal line). Because we are using dates
on the x-axis, on line 6 you create a pd.Timestamp variable for the
sales event date, and then use that variable as the x-coordinate for
the vertical line drawn with ax.axvline.
To add a text label at an arbitrary position on the plot, you can use
ax.annotate and specify what text you want to add to the plot, as
well as the x and y-coordinates for the label.
As with any other plot element, there are several keyword argu-
ments you can use with both ax.axvline and ax.annotate (in the
example above, you set the color of both elements to 'red') — as
always, run ax.axvline? or ax.annotate? in a separate cell to find
out more.
The entire code needed to create the sales plot shown at the
beginning of the chapter is listed below:
pythonforaccounting.com/chapter30
30 Plotting with matplotlib 310
pythonforaccounting.com/chapter30
30 Plotting with matplotlib 311
It may seem like a lot of code for a relatively simple plot. But after
you write this code once, you can reuse it later in any similar plot
you create — and of course, you don’t have to use all the plot
customization options we went through if you don’t need them.
Another option for saving your plots is to use the savefig method
on the fig variable (not the ax variable). For instance, you can
save the plot above by calling:
In [26]: fig.savefig('Q1DailySales.png')
The first argument to savefig is a file name (by default, the image
file is saved in the same folder as your Jupyter notebook, but you
can specify a full path instead if you want to). The savefig method
is handy when you need to create multiple plots in a for loop and
save them to your drive (without having to right-click and save
each one manually).
The sales plot example shows you how to add content to your mat
plotlib plots, but what if you want to change their style: different
background color, larger fonts, more padding around labels? This
overthinking section walks you through some of matplotlib’s
styling options.
pythonforaccounting.com/chapter30
30 Plotting with matplotlib 312
Figure 30.3 shows a few examples of plots using the default, ggplot
and seaborn styles (the style names are set as the y-label in each row
of plots in the figure) — you can see more examples of plots using
different styles in matplotlib’s figure gallery.6 The differences 6: At matplotlib.org/gallery/
between plot styles can be subtle but typically involve different style_sheets/style_sheets_reference.
values for many of the same plot options you set in the sales plot
code (e.g., tick frequency, line colors, grid style, etc.).
If you want to use a different plot style, you should select it right
Figure 30.3: Several plot examples using the default, ggplot and seaborn matplotlib styles. Each row of plots uses a
different style. You can see more examples using different matplotlib styles in the figure gallery available at mat-
plotlib.org/gallery/style_sheets/style_sheets_reference.
pythonforaccounting.com/chapter30
30 Plotting with matplotlib 313
Figure 30.4: Sales plot using the 'fivethirtyeight' matplotlib plot style.
Modifying styles
If you find yourself manually setting the same options every time
you create a plot (e.g., setting major tick length to 10), you may
want to update the plot style you’re using instead of calling the
same Axes methods every time you create a new plot.
You can list all available plot parameters for the current style, and
8: You can read about each
their values, by running the code below (the output will be long, parameter (and what values
as there are over 300 different plot parameters you can customize; you can assign to it) in mat
I shortened the output here to make it fit).8 plotlib’s documentation at mat-
plotlib.org/tutorials/introductory/
In [29]: plt.rcParams customizing.
pythonforaccounting.com/chapter30
30 Plotting with matplotlib 314
It can also be tricky to figure out the values you can set for
different parameters. However, if you set an invalid value for a plot
parameter, you’ll get an error, and most error messages include a
pythonforaccounting.com/chapter30
30 Plotting with matplotlib 315
list of valid options for the parameter you tried to set. For instance,
if you try to set 'axes.titlesize' to an invalid option:
In [33]: plt.rcParams['axes.titlesize'] = 'not too small, not too big'
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
...
ValueError: Key axes.titlesize: not too small, not too big is not a valid font size. Valid
font sizes are xx-small, x-small, small, medium, large, x-large, xx-large, smaller, larger.
You can use the same approach to set any of the styling parameters
listed when running plt.rcParams9 — and you can set as many as 9: The rcParams name comes from
you need to. Remember that styling options set using this approach run configuration parameters. The rc
letters appearing at the beginning or
will be applied to all plots in your notebook (so it’s a good idea to
the end of a file name or variable name
configure styling options right after importing matplotlib if you typically mean that the file or variable
want to make your plots share a common look). store some configuration options.
If you need to reset all styling options to their default values, you
can either run the method below, or remove all styling code from
your notebook, and then restart your notebook kernel.
In [34]: plt.rcdefaults()
Colors
When you don’t specify a value for the color keyword argument,
matplotlib cycles through a predefined list of colors and uses a
different color for each new plot element.
pythonforaccounting.com/chapter30
30 Plotting with matplotlib 316
After you run this code, matplotlib will cycle through the list of
'red', 'blue', and 'gray' whenever it adds a new element to a
plot (i.e., whenever you call an Axes method to add a new line, bar,
or point to a plot, without explicitly specifying a color). When it
runs out of colors, it starts over from the beginning.
Fonts
If you run this code, you set all plots in your notebook to use a serif
font called Palatino, with a base size of 18 points, normal weight,
and style. All text elements in your plots are scaled using the base
font size set here. You can list all font-related plot parameters
with:
In [37]: [(key, value) for key, value in plt.rcParams.items() if 'font' in key] 1
To get a list of fonts available on your computer, you can use the
following code (which outputs a long list):
In [38]: import matplotlib
matplotlib.font_manager.fontManager.ttflist
Out [38]: [<Font 'DejaVu Sans' (DejaVuSans-Bold.ttf) normal normal 700 normal>,
<Font 'cmmi10' (cmmi10.ttf) normal normal 400 normal>,
<Font 'STIXNonUnicode' (STIXNonUniBolIta.ttf) italic normal 700 normal>,
...
<Font 'Arial Black' (Arial Black.ttf) normal normal 900 normal>,
<Font 'STIXSizeTwoSym' (STIXSizTwoSymBol.otf) normal normal 700 normal>,
<Font '.SF NS Display Condensed' (SFNSDisplayCondensed.otf) normal normal 400 condensed>]
pythonforaccounting.com/chapter30
30 Plotting with matplotlib 317
Summary
There are many functions and arguments you can use with mat
plotlib, and they can often be frustrating to figure out. However,
just knowing what matplotlib’s Figure and Axes objects are and
how to use a few of their methods is enough to get you started with
Python-based data visualization. The following project chapter
shows you how to use matplotlib’s main features to turn a cash-
flow statement into a waterfall plot.
pythonforaccounting.com/chapter30
Project: Making a waterfall plot
from a cash flow statement 31
In the previous chapter, you saw that matplotlib is a data visual-
ization toolkit more than a gallery of plot templates you can re-use.
However, you can easily use it to build up any plot from scratch.
This chapter shows you how to use matplotlib and pandas to turn
a cash flow statement into a waterfall plot. Waterfall plots1 are 1: Waterfall plots — also know as cas-
useful for showing how an initial value is affected by a series of cade charts — are mainstream visual-
ization tools in accounting.
interventions. The data you’ll be using is the “Cash Flow State-
ment.xlsx” file in the “P3 - Visualizing data” folder of your Python
for Accounting workspace. When you open this file with Excel, you
should see something similar to figure 31.1 below:
You’ll use pandas to read, extract, and reshape these data and
matplotlib to turn them into the waterfall plot below:
200,000 136,941
175,000
110,895
150,000
125,000
39,210 41,925 30,167
100,000
23,482
75,000
-1,455
-17,015
50,000 -10,782 -24,988
25,000 -36,516
-61,919
0
-25,000
Monthly Cash Inflow
-50,000 Monthly Cash Outflow
-75,000
Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar
Figure 31.2: Waterfall plot made with matplotlib using data from “Cash Flow Statement.xlsx”.
31 Project: Making a waterfall plot from a cash flow statement 319
This project has three parts: loading and reshaping the data,
creating the main plot elements, and styling the plot by adding tick
labels, legend entries, a title, and a few other visual elements.
Now let’s load the cash flow data into a DataFrame and reshape it
into something matplotlib can work with.
df
For the waterfall plot, you’ll need the beginning and ending cash
balances for each month, as well as the monthly in and out cash
flows. You can easily spot these values in Excel, but figuring out
where they are in the DataFrame above needs some trial-and-error.
For instance, month labels are in the third row of df (among many
NaN values):
In [3]: df.loc[2]
pythonforaccounting.com/chapter31
31 Project: Making a waterfall plot from a cash flow statement 320
Once you figure out which rows you need to keep, you can discard
all the other ones by reassigning your selection to df:
In [5]: df = df.loc[[2, 3, 11, 43, 44]]
df
[5 rows x 27 columns]
Even with fewer rows, it’s still hard to tell what these rows are:
there are lots of missing values, missing headers or headers mixed
with values. You need to apply several reshaping and cleaning
steps to make these data useful:
In [6]: df = df.transpose().dropna().reset_index(drop=True)
df
pythonforaccounting.com/chapter31
31 Project: Making a waterfall plot from a cash flow statement 321
Out [6]: 2 3 11 43 44
0 Apr 0 79252 -55770 23482
1 May 23482 75510 -59782 39210
2 Jun 39210 21143 -61808 -1455
3 Jul -1455 52418 -67978 -17015
4 Aug -17015 56142 -49909 -10782
5 Sep -10782 11863 -63000 -61919
6 Oct -61919 77979 -52576 -36516
7 Nov -36516 80284 -68756 -24988
8 Dec -24988 121274 -54361 41925
9 Jan 41925 60215 -71973 30167
10 Feb 30167 128583 -47855 110895
11 Mar 110895 76412 -50366 136941
The code above rotates the table from wide format to tall format
(i.e., a tall table has more rows than columns); it then discards all
rows that contain (any) missing values and resets the df’s row
labels to be consecutive numbers. The table is easier to understand
now, but adding more descriptive columns names would be even
better; you can do that by running:
In [7]: df.columns = [
'Month', 'Beginning Cash Balance', 'Total Cash Inflow',
'Total Cash Outflow', 'Ending Cash Balance'
]
df
Out [7]: Month Beginning Cash Balance Total Cash Inflow Total Cash Outflow Ending Cash Balance
0 Apr 0 79252 -55770 23482
1 May 23482 75510 -59782 39210
2 Jun 39210 21143 -61808 -1455
3 Jul -1455 52418 -67978 -17015
4 Aug -17015 56142 -49909 -10782
5 Sep -10782 11863 -63000 -61919
6 Oct -61919 77979 -52576 -36516
7 Nov -36516 80284 -68756 -24988
8 Dec -24988 121274 -54361 41925
9 Jan 41925 60215 -71973 30167
10 Feb 30167 128583 -47855 110895
11 Mar 110895 76412 -50366 136941
Tall tables2 are always easier to manipulate and visualize in 2: In statistical data analysis, tidy-
Python’s data universe — if your Excel data has more columns ing data is a well-defined process for
making data ready for analysis. One
than rows, consider reshaping it as you did here.
of the steps in this process is refor-
matting wide tables into tall ones.
There’s one last data preparation step you need to do. All cash
Tidy datasets are often easier to ma-
outflows are negative values, but it will be easier to plot these nipulate and visualize, regardless of
values later if they’re positive. You can multiply the 'Total Cash the tools you use to analyze them.
Outflow' column by −1 to turn them into positive values: You can read more about data tidy-
ing at vita.had.co.nz/papers/tidy-
In [8]: df['Total Cash Outflow'] = df['Total Cash Outflow'] * (-1) data.pdf.
pythonforaccounting.com/chapter31
31 Project: Making a waterfall plot from a cash flow statement 322
To check that nothing went wrong in the cleanup, you can verify
that the ending cash balance for each month is equal to the starting
balance plus inflow minus outflow:
In [9]: df['Ending Cash Balance'] == (
df['Beginning Cash Balance']
+ df['Total Cash Inflow']
- df['Total Cash Outflow']
)
It looks like you have a cash flow dataset that is clean and ready
for plotting with matplotlib.
You saw in the previous chapter that the first step in making a
matplotlib plot is to set up the Figure and Axes objects:
pythonforaccounting.com/chapter31
31 Project: Making a waterfall plot from a cash flow statement 323
I made the figure extra wide (i.e., by passing the figsize keyword
argument to figure) so all the waterfall bars fit, and the plot doesn’t
look too crowded. The first value in the tuple passed to figsize is
the plot width, followed by its height.
In [11]: # you get the same Figure and Axes variables as before
# with the code below, but in one line of code
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
ax.bar(df.index,
df['Total Cash Inflow'],
color='green')
You now have some green bars representing monthly cash inflows.
Because df has 12 rows of values, the code above draws 12 green
bars. However, all the bars in the plot above start at 0. To turn
this into a waterfall plot, you need to place the bottom of each
green bar at the beginning cash balance for its month. You can do
that passing a sequence of “bottom” y-coordinates (in this case,
the beginning cash balance column) with the bottom keyword
argument:
In [12]: fig, ax = plt.subplots(1, 1, figsize=(12, 6))
ax.bar(df.index,
df['Total Cash Inflow'],
bottom=df['Beginning Cash Balance'],
pythonforaccounting.com/chapter31
31 Project: Making a waterfall plot from a cash flow statement 324
color='green')
The plot is starting to look like the waterfall we wanted. The next
step is adding the red bars — you can use the same bar method as
with the green bars (but the bottom of red bars needs to be at the
monthly ending cash balance, not the starting cash balance):
In [13]: fig, ax = plt.subplots(1, 1, figsize=(12, 6))
ax.bar(df.index,
df['Total Cash Inflow'],
bottom=df['Beginning Cash Balance'],
color='green')
ax.bar(df.index,
df['Total Cash Outflow'],
bottom=df['Ending Cash Balance'],
color='red')
pythonforaccounting.com/chapter31
31 Project: Making a waterfall plot from a cash flow statement 325
The code above adds red bars, but they overlap with the green bars
because both sets of bars use the same list of x-coordinates (i.e.,
df’s row labels). You need to pass a different set of x-coordinate
values to either the green bars or the red ones if you want them
not to overlap. Luckily, you can modify the x-coordinates of the
green bars by subtracting a constant value from the df’s index:
In [14]: fig, ax = plt.subplots(1, 1, figsize=(12, 6)) 1
2
bar_width = 0.4 3
4
ax.bar(df.index - bar_width, 5
df['Total Cash Inflow'], 6
bottom=df['Beginning Cash Balance'], 7
color='green', 8
width=bar_width, 9
label='Monthly Cash Inflow') 10
11
ax.bar(df.index, 12
df['Total Cash Outflow'], 13
bottom=df['Ending Cash Balance'], 14
color='red', 15
width=bar_width, 16
label='Monthly Cash Outflow') 17
The code above sets an explicit bar_width (by passing the width
keyword argument to bar) and shifts all green bars to the left by one
bar_width. Once you understand what matplotlib’s functions and
methods do, using them is simple geometry. You can place different
elements on your plots by specifying their x and y-coordinates
relative to the plot origin (which is the bottom left corner of the
plot with coordinates (0 , 0)).
pythonforaccounting.com/chapter31
31 Project: Making a waterfall plot from a cash flow statement 326
The figure above looks like the waterfall plot we wanted, but it’s
half-way there. Let’s make it easier to understand by adding a title,
a legend, and month names as x-axis tick labels next.
Crafting the details on your plots is where all the matplotlib fun
is. The waterfall bars we have now are in the right place, but the
plot is crowded (especially on its vertical axis) — let’s enlarge the
y-axis limits to add some space above and below the bars. You can
do that by adding the following code to the previous code cell:
In [15]: ax.set_ylim([-75000, 220000])
Another confusing part of the plot is its current x-axis tick labels.
Instead of numbers, you can place the values in df’s 'Month'
column as the x-axis tick labels with:
In [16]: xticks = df.index
xticklabels = df['Month']
ax.set_xticks(xticks)
ax.set_xticklabels(xticklabels);
Similarly, you can increase the frequency and improve the format-
ting of y-axis tick labels with:
In [17]: yticks = range(-75000, 225000, 25000)
yticklabels = [f'{ytick:,}' for ytick in yticks]
ax.set_yticks(yticks)
ax.set_yticklabels(yticklabels)
And you can also extend the y-axis ticks into the main plot area by
adding a grid:
In [18]: ax.grid(axis='y', alpha=0.5)
ax.set_axisbelow(True)
The set_axisbelow method makes sure the y-axis grid lines are
displayed beneath the bars and not on top. Finally, you can add a
title and a legend to the plot using:
In [19]: ax.set_title('Year one cashflow', loc='left');
ax.legend(loc='lower right');
The waterfall plot is almost there, but it’s still missing text annota-
tions above the bars listing the ending cash balance for each month.
For these text annotations, you’ll have to use a for loop — the
code is longer but no less explicit. If you add the code below to the
same plotting cell as before, you’ll get the extra annotations:
pythonforaccounting.com/chapter31
31 Project: Making a waterfall plot from a cash flow statement 327
The for loop goes through each row in df and, for each row, calls
the ax.annotate method. Different xy coordinates for the text label
are computed and set as the xy keyword argument at each iteration
of the loop.
The entire code needed to create the waterfall plot shown at the
beginning of this chapter is listed below:
In [21]: fig, ax = plt.subplots(1, 1, figsize=(12, 5.5)) 1
2
# Adding the main plot elements 3
bar_width = 0.4 4
ax.bar(df.index - bar_width, 5
df['Total Cash Inflow'], 6
bottom=df['Beginning Cash Balance'], 7
pythonforaccounting.com/chapter31
31 Project: Making a waterfall plot from a cash flow statement 328
color='green', 8
width=bar_width, 9
label='Monthly Cash Inflow') 10
11
ax.bar(df.index, 12
df['Total Cash Outflow'], 13
bottom=df['Ending Cash Balance'], 14
color='red', 15
width=bar_width, 16
label='Monthly Cash Outflow') 17
18
# Styling the plot 19
ax.set_ylim([-75000, 220000]) 20
21
xticks = df.index 22
xticklabels = df['Month'].unique() 23
24
ax.set_xticks(xticks) 25
ax.set_xticklabels(xticklabels) 26
27
yticks = range(-75000, 225000, 25000) 28
yticklabels = [f'{ytick:,}' for ytick in yticks] 29
30
ax.set_yticks(yticks) 31
ax.set_yticklabels(yticklabels) 32
33
ax.grid(axis='y', alpha=0.5) 34
ax.set_axisbelow(True) 35
36
ax.set_title('Year one cashflow', loc='left') 37
ax.legend(loc='lower right') 38
39
# Adding text annotation and lines 40
for index, row in df.iterrows(): 41
42
beginning_balance = row['Beginning Cash Balance'] 43
ending_balance = row['Ending Cash Balance'] 44
inflow = row['Total Cash Inflow'] 45
46
ax.annotate(f"{ending_balance:,}", 47
xy=(index - bar_width / 2, beginning_balance + inflow + 2000), 48
horizontalalignment='center') 49
50
if index <= 10: 51
ax.hlines(ending_balance, 52
index - bar_width / 2, 53
index + 2 * bar_width, 54
color='black', linewidth=1, linestyle='dashed') 55
pythonforaccounting.com/chapter31
31 Project: Making a waterfall plot from a cash flow statement 329
200,000 136,941
175,000
110,895
150,000
125,000
39,210 41,925 30,167
100,000
23,482
75,000
-1,455
-17,015
50,000 -10,782 -24,988
25,000 -36,516
-61,919
0
-25,000
Monthly Cash Inflow
-50,000 Monthly Cash Outflow
-75,000
Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar
Figure 31.3: Waterfall plot made with matplotlib using “Cash Flow Statement.xlsx” data.
That’s a lot of code for a relatively simple plot. If you need a quick
waterfall chart, you’re likely better off with Excel. However, if you
need to make a waterfall plot every other month using a different
cash flow statement, you can re-use the code above with minimum
effort. The real value of using Python and matplotlib is that you
can easily scale and customize your plot making.
Summary
This chapter showed you how to turn a cash flow statement into a
waterfall plot using Python, pandas, and matplotlib. As you saw
above, working with matplotlib can sometimes require a lot of
coding. However, there are faster ways to create plots with Python,
and the following chapter introduces some of them.
pythonforaccounting.com/chapter31
Other plotting libraries 32
In the previous chapter, you created a matplotlib plot from scratch,
setting up Figure and Axes objects, adding plot elements one-by-
one, adjusting labels, ticks, legend entries, and everything in-
between. The flexibility you get with matplotlib is a benefit if you
need to customize your plots down to the smallest details, but
it can also get annoying if you just need to visualize your data
quickly. This chapter shows you how to use the pandas, seaborn,
and hvplot libraries to turn data into plots with far less code than
you need with matplotlib.
You’ll be using the same Q1 daily sales data you used in the mat
plotlib chapter — you can load it in your chapter notebook by
running the following code:
In [1]: import pandas as pd 1
2
daily_sales_df = pd.read_csv('Q1DailySales.csv', parse_dates=['Date']) 3
daily_sales_df = daily_sales_df.set_index('Date') 4
5
daily_sales_df 6
In addition to its data slicing tools, pandas lets you quickly turn
Series or DataFrame objects into plots by calling their plot method.
When you call the plot method, pandas actually creates a mat
plotlib plot for you: it deals with a lot of the boilerplate (e.g.,
32 Other plotting libraries 331
creates Figure and Axes objects) and makes sensible styling choices
for you so that you can visualize data with less code.
To turn the daily sales data into a plot, instead of setting up Figure
and Axes objects, all you need to run is:
In [2]: daily_sales_df.plot()
pythonforaccounting.com/chapter32
32 Other plotting libraries 332
Table 32.1: Optional keyword arguments available when calling plot on a Series or DataFrame object.
Read more about plotting with pandas in the official documentation available at pandas.pydata.org/pandas-
docs/stable/user_guide/visualization.
Argument Example Description
kind daily_sales_df.plot(kind='bar') Sets the kind of plot to create. Valid options are
'line', 'bar', 'barh', 'hist', 'box', 'area',
'density', 'pie', 'scatter' and 'hexbin'. By
default, a 'line' plot is created. We will cover
different kinds of plots in the following section.
figsize daily_sales_df.plot(figsize=(12, 6)) Sets the figure width and height, in inches (same as
matplotlib’s plt.figure).
pythonforaccounting.com/chapter32
32 Other plotting libraries 333
Table 32.1: Optional keyword arguments available when calling plot on a Series or DataFrame object.
Read more about plotting with pandas in the official documentation available at pandas.pydata.org/pandas-
docs/stable/user_guide/visualization.
Argument Example Description
grid daily_sales_df.plot(grid=True) Enables the plot grid. By default, the plot grid is
not visible.
legend daily_sales_df.plot(legend=False) Disables the plot legend. By default, the legend is
visible. Another value you can assign to legend is
'reverse', which reverses the legend item order.
xticks daily_sales_df.plot( Sets the values to use for the x-axis or y-axis ticks.
yticks xticks=pd.date_range( Both arguments require a sequence of values (e.g.,
'31 January 2020', a Python list).
'01 March 2020',
freq='W-MON'
),
yticks=[
0, 20000, 40000,
60000, 80000, 100000
]
)
xlim daily_sales_df.plot( Sets the lower and upper plot limits on either axis.
ylim xlim=[ If there is data beyond these limits, it will not be
pd.Timestamp('31 Jan 2020'), shown on the plot. Both arguments require a se-
pd.Timestamp('15 Feb 2020') quence of two values (e.g., a Python list or tuple
], containing two values, representing the lower and
ylim=[0, 40000] upper bounds of the axes).
)
rot daily_sales_df.plot(rot=30) Sets tick label rotation in degrees (x-ticks for vertical,
y-ticks for horizontal plots).
pythonforaccounting.com/chapter32
32 Other plotting libraries 334
Suppose your data isn’t suited for a line plot. In that case, you can tell
pandas to draw a different kind of plot by passing the kind keyword
argument to plot — valid options for the kind keyword argument
are: 'line', 'bar', 'barh', 'hist', 'box', 'area', 'density', 'pie',
'scatter' and 'hexbin'. Most of these plot types have equivalents
in Excel, so they should be familiar, even though they might not
share the same name. Table 32.2 below shows an example of each
created using the daily sales data (for some of the examples, I
summed daily_sales_df column-wise before creating the plot).
Table 32.2: Different plot types you can create using the pandas plot method. Read more about plotting with pandas in
the official documentation available at pandas.pydata.org/pandas-docs/stable/user_guide/visualization.
Plot type Example Output
'line' daily_sales_df.plot(kind='line')
'area' daily_sales_df.plot(
kind='area',
legend='reverse'
)
'pie' daily_sales_df.sum().plot(
kind='pie',
explode=[0, 0, 0.1, 0, 0]
)
'bar' daily_sales_df.sum().plot(
kind='bar',
rot=20
)
pythonforaccounting.com/chapter32
32 Other plotting libraries 335
Table 32.2: Different plot types you can create using the pandas plot method. Read more about plotting with pandas in
the official documentation available at pandas.pydata.org/pandas-docs/stable/user_guide/visualization.
Plot type Example Output
'barh' daily_sales_df.sum().plot(kind='barh')
'hist' daily_sales_df.plot(
kind='hist',
bins=20
)
'density' daily_sales_df.plot(kind='density')
'box' daily_sales_df.plot(
kind='box',
rot=30
)
'scatter' daily_sales_df.plot(
kind='scatter',
x='iBay.com',
y='Understock.com'
)
'hexbin' daily_sales_df.plot(
kind='hexbin',
x='iBay.com',
y='Understock.com',
gridsize=10,
sharex=False
)
You may have noticed that, depending on the kind of plot you
pythonforaccounting.com/chapter32
32 Other plotting libraries 336
Similarly, you can use plot as an accessor for all the different kinds
of plots mentioned above (e.g., daily_sales_df.sum().plot.pie()
to draw a pie plot from the sales data). Whichever notation you end
up using, you will likely come across both styles in other examples
pythonforaccounting.com/chapter32
32 Other plotting libraries 337
Figure 32.1: A few plots from seaborn’s example gallery (from top to bottom, these are called stacked histogram, line plot,
hexbin plot with marginal distributions and ridge plot). You can view these plots (and the code for making them) and many more
at seaborn.pydata.org/examples/index.html#example-gallery. The data used for these plots is either randomly generated,
or sourced from publicly available datasets.
You can customize any plot you create with pandas using the
same matplotlib methods we covered earlier. Together, pandas
and matplotlib likely cover most simple data visualization needs
— but if you want to make a complex plot with matplotlib, you’ll
need to roll up your sleeves. Or you can use seaborn instead.
pythonforaccounting.com/chapter32
32 Other plotting libraries 338
you can still use the same Axes methods to customize your seaborn
plots further, if you need to.
pythonforaccounting.com/chapter32
32 Other plotting libraries 339
Because we have a 3-by-3 grid of plots, grid.axes is a list2 of lists, 2: It is actually an array, but you can
each list representing one row of plots on the grid and containing use it like a Python list.
three matplotlib Axes objects. You can use any of the matplotlib
methods we explored in the previous chapter with these Axes —
for example, if you want to add titles to each of the plots on the
top row of the grid, you can use the Axes set_title method:
In [8]: grid = sns.pairplot(daily_sales_df[['iBay.com', 'Understock.com', 'Shoppe.com']]) 1
2
first_row = grid.axes[0] 3
first_row[0].set_title('First') 4
first_row[1].set_title('Second') 5
first_row[2].set_title('Third'); 6
pythonforaccounting.com/chapter32
32 Other plotting libraries 340
Here, you use add_subplot twice to add two Axes to the same
4: You can use any names for
figure — both of which are assigned to separate variables.4 The
the Axes variables, ax_left and
arguments passed to add_subplot above tell matplotlib to place ax_right fit this example.
pythonforaccounting.com/chapter32
32 Other plotting libraries 341
The subplots function returns a Figure object and a sequence of 5: Unpacking allows you to expand
Axes objects (that is shaped according to the plot layout you want a Python list into separately named
to create). In this example, you create one row and two columns of variables. Because plt.subplots re-
turns a list with two Axes objects, you
Axes (the first two arguments passed to plt.subplots) and unpack5
can unpack them as separate variables
them into two variables called ax_left and ax_right. However, as I did here.
when you want to create more complicated layouts, it’s easier
to store the sequence of Axes as a separate variable and access
individual Axes using indices. For instance, to create a 3-by-3 plot
grid, you can use the following code:
In [11]: fig, grid = plt.subplots(3, 3, figsize=(12, 6))
The grid variable above is similar to our earlier seaborn plot grid.
If you inspect it in a separate cell, you’ll see it is a list of lists6 each 6: It is a numpy array, but you can
list representing one row of plots on the grid and containing three use like a Python list.
matplotlib Axes objects:
In [12]: grid
pythonforaccounting.com/chapter32
32 Other plotting libraries 342
ax_top_left = grid[0][0]
ax_mid_center = grid[1][1]
ax_bottom_right = grid[2][2]
In the code example above, when indexing grid, the first index
value is used to determine a row in the grid of plots, and the
second value is used to determine a column. Together, they access
an Axes object, on which you can use any of the (now familiar)
matplotlib plotting methods — in this case, the plot, scatter and
bar methods.
pythonforaccounting.com/chapter32
32 Other plotting libraries 343
One area where matplotlib is limited8 is creating interactive plots 8: You can use matplotlib to create
that you can zoom, pan or hover over to change the display of interactive plots, but I think the alter-
native mentioned below is easier to
information. Let’s quickly take a look at a Python visualization
use and more powerful. You can read
library that lets you do all of those things. about interactive matplotlib plots
at matplotlib.org/users/interactive.
pythonforaccounting.com/chapter32
32 Other plotting libraries 344
Like with matplotlib and its extension libraries (i.e., pandas and
seaborn), there are several other core libraries for creating inter-
active plots in Python, each with its own extension libraries that
try to simplify the process of creating complex plots. One of these
core libraries is Bokeh9 with pandas-bokeh, holoviews or hvplot as 9: More on Bokeh at bokeh.org.
some of its extension libraries. You’ll use hvplot in this section,
but you can head over to pyviz.org if you want to read more about
the other visualization libraries.
In [16]: daily_sales_df.hvplot( 1
kind='line', 2
width=800, 3
height=400, 4
title='Q1 total daily sales' 5
) 6
pythonforaccounting.com/chapter32
32 Other plotting libraries 345
The keyword arguments available with hvplot are not the same
ones available with pandas’s plot method, but they are similar. The
hvplot method can be used as an accessor as well, just like pandas’s
plot method — if you want to learn what keyword arguments you
can use with it, run daily_sales_df.hvplot.line? in a separate
code cell.13 13: Alternately, you can read
more about hvplot options at
There are several other types of plots available with hvplot: type hvplot.holoviz.org/user_guide.
daily_sales_df.hvplot. in a separate code cell and press TAB to
get a list of available plotting methods or visit the hvplot gallery
at hvplot.holoviz.org/reference for more examples.
Summary
This chapter showed you how to use pandas and seaborn to create
plots from DataFrame objects. Both pandas and seaborn extend
matplotlib to make plotting code shorter and easier to understand.
Because they’re extensions of matplotlib, you can still use mat
plotlib functions to customize pandas or seaborn plots.
pythonforaccounting.com/chapter32
32 Other plotting libraries 346
The last part of the book starts next: in the chapters ahead, you’ll
put together all the Python tools you’ve been reading about to
complete a typical management accounting analysis project.
pythonforaccounting.com/chapter32
Sales analysis project part four
You’ll explore the data files more in the following chapters. When
you’re ready, let’s set up the project and define its goals.
Setting up your project 33
When starting a new project, it can be tempting to jump into writing
code and wrangling data straight away. However, having a project
structure and clearly defined goals for your analysis (that are well
documented) will help you stay on track when writing code.
This short chapter looks at how you can keep your project notebooks
and data files organized and how you can use Markdown cells in
Jupyter notebooks to document your analysis goals.
All data analysis projects spread across multiple files (e.g., they
source data from several files, use multiple notebooks, generate
several presentations or reports, etc.). Keeping files organized
needs practice, regardless of the tools you use to handle them. In
this section, you’ll find a few tips for keeping your data and code
files organized when working with Python code.
Data files
It’s often a good idea to keep your data files in a separate folder in
your project workspace. This keeps your top-level project folder
from getting cluttered with too many files and makes it easier for
you to recognize which file does what (or contains what data).
Right now, the “P4 - Sales analysis” contains several Excel and
CSV files. To move these files into a separate folder, create a new
folder by clicking the New Folder button in JupyterLab (which
you’ll find right above the file navigator) and name it data. Now
you can move all the data files into the data folder, either with
drag-and-drop or cut-and-paste, directly in JupyterLab.
At times you’ll need to write new data files that can be used in
various parts of your analysis. For instance, after cleaning and
reshaping a dataset, you can save the clean dataset to a new file
in the data folder so you can re-use it in other project notebooks
without repeating all the cleaning steps. You can organize your
data files further, by creating sub-folders in the data folder to
distinguish between the data files you started with (i.e., files from
external sources) and data files you created during the analysis.
However, for this project, let’s keep all data files in the data folder.
33 Setting up your project 350
Jupyter notebooks
While you can develop your entire project in a single Jupyter note-
book, it’s common to break analysis code into several notebooks,
each one performing a specific part of the analysis. For example,
you can have a notebook for data cleaning and preparation, one
for the actual analysis, another for making plots, etc.
You can keep Jupyter notebooks in your main project folder (i.e.,
not in a separate folder) and you can use a simple naming trick to
keep them organized. In your top-level project folder, create a new
notebook1 and name it “01 - Setup”. You’ll use this notebook to 1: Either by going to File -> New
document the business goals for the sales analysis in the following -> Notebook or through the New
Launcher button.
section. The 01 at the beginning of the file name helps identify
this notebook as the project’s first notebook. You’ll increment this
indicator for the next notebooks you create (e.g., the second project
notebook will be called “02 - Data preparation”). This simple naming
trick not only keeps notebooks nicely listed in your file explorer,
but it also indicates the order in which notebooks need to run to
reproduce your project results.
Common code
A common code file is just a text file with a .py extension (i.e., a
Python file). To create a common code file, go to File -> New ->
Text File and rename it to common_code.py (you can give it any
name you want to, but the file extension has to be .py). Open the
file, add the following line of Python code and save it:
pythonforaccounting.com/chapter33
33 Setting up your project 351
data_folder = 'data'
The benefit of having the data folder path defined in the com
mon_code.py file (rather than in every notebook that uses it) is that
if you ever want to move your data files to a different location, you
can update this variable and your notebooks will work just the
same, without having to modify their code. Later in this chapter,
you’ll also add a custom function to this file, which you can then
similarly import and re-use in any of your project notebooks.
The first step in any data analysis project is having clear goals
and questions you want to answer. It can be tempting to jump
into writing code straight away, but without clear goals, you can
quickly get drawn into a cycle of slicing, pivoting and plotting data
that doesn’t lead anywhere (especially if you have lots of data to
work with).
pythonforaccounting.com/chapter33
33 Setting up your project 352
Figure 33.2: An almost complete Markdown reference for Jupyter notebooks. This reference shows side-by-side views of
the same notebook in JupyterLab: the left view shows the Markdown text in edit mode, whereas the right view shows the
rendered Markdown.
does. However, for larger projects like our sales analysis, you can
even keep an entire notebook for documenting the project, which
is what we’ll do here. Not all projects follow the same pattern, but
in general, you can document several aspects of your data projects
before writing any actual code:
I Description – what the project is about, why it’s needed, and who
the key stakeholders are;
I Questions – a list of specific questions you want to answer through
your analysis;
I Metrics – a list of metrics (e.g., gross profit) that are used in the
analysis and how they’re calculated;
I Data sources – what datasets you use in the project and how
they’re obtained.
I Timelines – a list of events related to the project and their descrip-
tion (e.g., when you started, when you last updated the code, how
often the analysis is repeated, and why).
2: By pressing m in Command mode,
To document your sales analysis, open the “01 - Setup” Jupyter or by selecting Markdown from the cell
type dropdown menu in the toolbar
notebook you created earlier, add a new cell and change its type to
right above the cell.
Markdown;2 add the following text to the Markdown cell:
pythonforaccounting.com/chapter33
33 Setting up your project 353
For the upcoming board meeting, the CFO is asking for an in-depth analysis of sales
performance in 2020. In particular, management wants to know which sales channels and which
product categories are most profitable.
This project brings together data from different sources (accounting, marketing, sales),
and looks at sales profitability across channels and product categories.
## Questions
## Metrics
To determine product profitability, I use `Gross Profit` and `Margin per Unit` as
profitability metrics. For each line item in the sales data, the following metrics are
calculated:
## Data sources
- Quarterly sales data is stored in Excel files exported from SAP (`Q1Sales.xlsx` through `
Q4Sales.xlsx`);
- `products.csv` is a dataset containing information about all products, including their
brand and category labels;
- `standard costs.csv` is a Standard Cost of Goods dataset with costing information about
each sold product.
## Timelines
When you run this cell, you should see the text displayed inline
and styled based on the Markdown styling commands you used.
Documenting3 your analysis this way not only helps you stay on 3: Documenting your analysis is an
track when writing code, it also helps you when you return to the iterative process. Often the questions
you want to answer change as the
analysis to understand the details of your project.
project progresses, and new questions
come up. As your project evolves,
Not all projects require this setup: if you just need to combine
make sure you return to your doc-
several Excel workbooks into a single one, there are no analysis umentation and update it often, as
metrics to describe — but you might still want to document where this keeps your project organized and
the original workbooks come from, why you are combining them, your analysis goals in focus.
and who is interested in the combined output.
pythonforaccounting.com/chapter33
33 Setting up your project 354
Summary
This short chapter showed you a few tricks you can use to keep your
data files, notebooks, and common code organized when analyzing
data with Python. It also showed you how to use Markdown cells to
document your analysis goals. Documenting your analysis keeps
your project goals in focus and helps you stay on track when
writing code.
With clear questions for your sales analysis written down, let’s
move on to cleaning and preparing the sales data next.
pythonforaccounting.com/chapter33
Preparing data 34
The second step in our sales analysis is preparing data: exploring
the available datasets, cleaning and fixing issues with any of them,
and merging them into a single working dataset.
For this step of the analysis, create a new notebook in your project
folder and name it “02 - Data preparation” — you’ll use this notebook
to explore and prepare data using pandas code. But before writing
any code, let’s add a title and description to this notebook. Just like
in the previous chapter, create a new cell at the top of the notebook,
set its type to Markdown, and add the following content:
# Exploring and preparing sales data
This notebook prepares the original sales data, including costs and product datasets, for
the analysis process.
Now in a new code cell, import pandas and the os Python module.
Also, import the data_folder variable from the common_code file
you created earlier — you’ll use the os module together with the
data_folder variable to list and read Excel files from the data folder
later:
In [1]: import os 1
import pandas as pd 2
3
from common_code import data_folder 4
Notice that the full file path is constructed with the data_folder
variable you imported from common_code, using a Python f-string
— if you don’t remember what a Python f-string is, head back to
chapter 5 for a quick refresher.
To read the costs dataset, you can use the same read_csv function
as above:
In [5]: costs_df = pd.read_csv(f'{data_folder}/standard costs.csv')
costs_df
pythonforaccounting.com/chapter34
34 Preparing data 357
The output above is not very useful because the file values are
separated by a tab character, not a comma. You need to tell pandas
to use the tab character as a value separator by specifying the sep
keyword argument:
In [6]: costs_df = pd.read_csv(f'{data_folder}/standard costs.csv', sep='\t')
costs_df
Out [6]: ProductID FOB Duty Freight Other Standard Unit Cost
0 A/DAN-29859 2.85 0.49 0.25 0.29 3.88
1 A/DAN-94863 2.97 0.54 0.08 0.48 4.07
2 AC&S/10X-13891 8.47 1.48 0.28 1.51 11.74
3 AC&S/4 N-48073 2.24 0.40 0.19 0.32 3.15
4 AC&S/ALV-45850 14.12 2.42 1.08 2.30 19.92
... ... ... ... ... ... ...
3576 T&G/ZIN-76306 2.18 0.38 0.07 0.42 3.05
3577 T&G/`DE-22075 5.67 0.99 0.08 0.56 7.30
3578 VG/NIN-87997 12.38 2.04 0.78 1.40 16.60
3579 VG/SEG-25084 8.55 1.42 0.61 0.33 10.91
3580 VG/SKQ-07575 9.57 1.60 0.75 0.66 12.58
costs_df
pythonforaccounting.com/chapter34
34 Preparing data 358
These datasets need to be merged with the actual sales data, but
first, you need to read and fix any issues with the sales data
themselves, which is what the following section looks at.
Sales data
This project’s sales data is stored in four Excel files in the data
folder, one file for every quarter in 2020; each of these files has
three separate sheets, one for every month in its respective quarter.
If you open either of them in Excel, you’ll see something similar to
figure 34.1.
To analyze all sales at once, you need to read data from each file
and concatenate it into a single DataFrame. Because you need to
repeat the same steps for each file (i.e., read the file, select relevant
pythonforaccounting.com/chapter34
34 Preparing data 359
First, let’s define a simple function that takes as input a file name
(i.e., for any of the sales Excel files), reads data from that file and
returns a DataFrame with:
In [9]: def get_sales(file_name): 1
df = pd.read_excel(f'{data_folder}/{file_name}') 2
return df 3
As with the previous datasets, the full file path is constructed with
the data_folder variable you imported from common_code and the
file_name parameter, using a Python f-string.
Out [10]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
0 1532 Shoppe.com Cannon Water Bomb B... ... 20.11 14 281.54
1 1533 Walcart LEGO Ninja Turtles ... ... 6.70 1 6.70
2 1534 Bullseye NaN ... 11.67 5 58.35
3 1535 Bullseye Transformers Age of... ... 13.46 6 80.76
4 1535 Bullseye Transformers Age of... ... 13.46 6 80.76
... ... ... ... ... ... ... ...
14049 15581 Bullseye AC Adapter/Power Su... ... 28.72 8 229.76
14050 15582 Bullseye Cisco Systems Gigab... ... 33.39 1 33.39
14051 15583 Understock.com Philips AJ3116M/37 ... ... 4.18 1 4.18
14052 15584 iBay.com NaN ... 4.78 25 119.50
14053 15585 Understock.com Sirius Satellite Ra... ... 33.16 2 66.32
pythonforaccounting.com/chapter34
34 Preparing data 360
function again, you’ll get data from all the sheets in “Q1Sales.xlsx”
in a single DataFrame:
In [12]: sales_df = get_sales('Q1Sales.xlsx')
sales_df
Out [12]: InvoiceNo Channel Product Name ... Unit Price Quantity Total
0 1532 Shoppe.com Cannon Water Bomb B... ... 20.11 14 281.54
1 1533 Walcart LEGO Ninja Turtles ... ... 6.70 1 6.70
2 1534 Bullseye NaN ... 11.67 5 58.35
3 1535 Bullseye Transformers Age of... ... 13.46 6 80.76
4 1535 Bullseye Transformers Age of... ... 13.46 6 80.76
... ... ... ... ... ... ... ...
37703 39235 iBay.com Nature's Bounty Gar... ... 5.55 2 11.10
37704 39216 Shoppe.com Funko Wonder Woman ... ... 28.56 1 28.56
37705 39219 Shoppe.com MONO GS1 GS1-BTY-BL... ... 3.33 1 3.33
37706 39238 Shoppe.com NaN ... 34.76 10 347.60
37707 39239 Understock.com 3 Collapsible Bowl ... ... 6.39 15 95.85
If you call info on the sales_df variable you’ll see it has several
columns you won’t need for the sales analysis. For instance, if you
take a look at the number of unique values in each column in
sales_df, some of its columns have only one unique value (which
means they don’t contain useful information for the analysis):
In [13]: sales_df.nunique() 1
Besides, there are a few other issues with these data: the 'Product
Name' column contains missing values, and there are several dupli-
cate rows in the DataFrame.
In [14]: sales_df.isna().sum()
pythonforaccounting.com/chapter34
34 Preparing data 361
Date 0
Deadline 0
Currency 0
Unit Price 0
Quantity 0
Total 0
dtype: int64
In [15]: sales_df.duplicated().sum()
sales_df
Out [17]: InvoiceNo Channel ProductID ... Product Name Category Unit Cost
0 1532 Shoppe.com T&G/CAN-9... ... Cannon Wa... Toys & Games 17.64
29 1533 Walcart T&G/LEG-3... ... LEGO Ninj... Toys & Games 5.84
50 1534 Bullseye T&G/PET-1... ... Pete the ... Toys & Games 10.32
63 1535 Bullseye T&G/TRA-2... ... Transform... Toys & Games 11.90
98 1537 Bullseye CP&A/3X-0... ... 3x Anti-S... Cell Phon... 5.13
... ... ... ... ... ... ... ...
1355 39227 Bullseye CP&A/MUS-... ... Music Jog... Cell Phon... 6.41
28481 39229 Understoc... K&D/ZER-9... ... ZeroWater... Kitchen &... 3.81
4517 39235 iBay.com H&PC/NAT-... ... Nature's ... Health & ... 5.41
pythonforaccounting.com/chapter34
34 Preparing data 362
21855 39238 Shoppe.com T&G/MAG-6... ... Magic: th... Toys & Games 30.49
15116 39239 Understoc... K&D/3 C-0... ... 3 Collaps... Kitchen &... 6.18
Notice that the output of get_sales now has fewer rows (i.e., 37532
instead of 37708) because you dropped duplicates. Also, row labels
are not consecutive anymore because get_sales sorts rows based
on the 'InvoiceNo' column.
pythonforaccounting.com/chapter34
34 Preparing data 363
In general, you can craft your custom functions to include any data
preparation steps you need. For this project, you can now use the
get_sales function with all the sales data files. First, you need to
get a list of all the Excel files that are in the data folder using the
os module:
In [19]: os.listdir(data_folder) 1
The output above lists all files in the “data” folder, but there are
a few files you need to exclude from this list if you want to use
them with the get_sales function (i.e., all the files with names
that don’t end in '.xlsx'). Luckily, all the items in the list above
are Python strings so you can work with them as with any other
Python strings — for example, you can use the endswith string
method and a Python list comprehension to keep only file names
that end in '.xlsx':
In [20]: [name for name in os.listdir(data_folder) if name.endswith('.xlsx')]
Now that you have a list of files to process, you can go through
this list and use the get_sales function for each file:
In [21]: sales_df = pd.concat( 1
[get_sales(name) for name in os.listdir(data_folder) if name.endswith('.xlsx')], 2
ignore_index=True 3
) 4
pythonforaccounting.com/chapter34
34 Preparing data 364
Out [22]: InvoiceNo Channel ProductID ... Product Name Category Unit Cost
0 1532 Shoppe.com T&G/CAN-9... ... Cannon Wa... Toys & Games 17.64
1 1533 Walcart T&G/LEG-3... ... LEGO Ninj... Toys & Games 5.84
2 1534 Bullseye T&G/PET-1... ... Pete the ... Toys & Games 10.32
3 1535 Bullseye T&G/TRA-2... ... Transform... Toys & Games 11.90
4 1537 Bullseye CP&A/3X-0... ... 3x Anti-S... Cell Phon... 5.13
... ... ... ... ... ... ... ...
119164 89228 Shoppe.com MI/FEN-18728 ... Fender 09... Musical I... 20.96
119165 89229 Understoc... C&P/CAS-0... ... Casio EXI... Camera & ... 12.04
119166 89232 iBay.com M&T/LEG-3... ... Legally B... Movies & TV 2.69
119167 89233 Shoppe.com MI/ELE-71119 ... Electro-H... Musical I... 45.24
119168 89237 Understoc... T&G/MEL-5... ... Melissa &... Toys & Games 2.55
You now have all the sales data, clean and in shape, in one DataFrame
variable. This DataFrame contains product and cost information, so
you can use it to answer all the analysis questions. The last thing
you need to do is compute the gross profit, unit profit, and unit
margin for each sale — you can do that with:
In [23]: sales_df['Gross Profit'] = (sales_df['Total'] - 1
(sales_df['Quantity'] * sales_df['Unit Cost'])) 2
3
sales_df['Profit per Unit'] = sales_df['Gross Profit'] / sales_df['Quantity'] 4
sales_df['Margin per Unit'] = ((sales_df['Profit per Unit'] / 5
sales_df['Unit Price']) * 100) 6
The current notebook is already quite long; adding more code for
the actual analysis steps will make it longer and more difficult
to handle. Besides, the analysis code you’ll write to answer each
question is conceptually unrelated to this notebook, which only 2: Why not keep everything in one
cleans and prepares the sales data for analysis. In the following notebook? You can, but this separa-
tion of code into multiple notebooks
chapter, you’ll use a separate notebook to write the code that
makes it easier for you to know where
answers the analysis questions.2 However, to run that code, you’ll different parts of your analysis are
need the clean sales data you just prepared. The easiest way to get and easier to re-run notebooks if you
the clean data in your new notebook is by writing sales_df to a need to, without repeating unneces-
sary steps.
file, which you can then read later:
In [24]: sales_df.to_csv(f'{data_folder}/sales2020.csv', index=False)
Writing data to CSV files is much faster than writing it to Excel files,
which is why I used the to_csv function here — but you can use
to_excel instead if you need an Excel file with the clean data.
pythonforaccounting.com/chapter34
34 Preparing data 365
Summary
This chapter showed you how to clean, reshape, and combine the
different data available for the sales analysis project to create a
“working” dataset that can answer the project questions. The fol-
lowing chapter shows you how to run the actual analysis, visualize
your findings, and share your results with others.
pythonforaccounting.com/chapter34
Finding answers 35
Now that the sales data is ready for analysis and the project goals
are clearly defined, the last step in this project is finding answers
to the questions we started with.
Analysis of channel and category sales profits. The analysis looks at `Gross Profit` and `
Quantity` as indicators of sales volume, and `Margin per Unit`as an indicator of sales
performance.
**Questions:**
In this notebook, you’ll use pandas and the seaborn plotting library,
as well as the data_folder variable from common_code. You can
import them as before with:
You can use Markdown cells to split the notebook into separate
sections, keeping your notebook easier to navigate. For example,
you can mark a new section in the notebook with the following
content in a new Markdown cell:
## Sales data
And then read in the clean sales data using a code cell:
In [2]: sales_df = pd.read_csv(f'{data_folder}/sales2020.csv')
35 Finding answers 367
Channel profits
The first question you need to answer is which sales channels are
most profitable. Three distinct metrics measure different aspects
of profitability (i.e., quantity, gross profit, and average margin per
unit), so we want to look at each metric for every sales channel.
First, let’s add a new section to the notebook and briefly explain
the question we’re trying to answer in this section:
## 1. Which sales channels are the most profitable?
The table output above gives you the answer you need, but it’s
always easier to understand data when you visualize it. You can
quickly draw bar plots from channel_profits_df using the pandas
plot method:
pythonforaccounting.com/chapter35
35 Finding answers 368
In [5]: channel_profits_df.plot( 1
kind='bar', figsize=(15, 4), subplots=True, 2
layout=(1, 3), legend=False, rot=30 3
); 4
You can polish these plots further, using the matplotlib methods
we explored in chapter 30, but even in this rough format, they
highlight that sales profits in the two brick-and-mortar retailers
(Walcart and Bullseye) are much lower compared to any of the
online retailers, even though average margin per unit is almost
identical across channels.
Category profits
The second question you need to answer is similar to the first one,
and you can use the same approach as in the previous section,
grouping the sales data by Category instead of Channel to compute
the same aggregate metrics:
## 2. Which product categories are most profitable?
In [6]: category_profits_df = ( 1
sales_df 2
.groupby('Category') 3
.agg({ 4
'Quantity': 'sum', 5
'Gross Profit': 'sum', 6
'Margin per Unit': 'mean' 7
}) 8
pythonforaccounting.com/chapter35
35 Finding answers 369
.round(3) 9
.sort_values('Gross Profit', ascending=False) 10
) 11
In [7]: category_profits_df
You can use the same pandas plotting approach as before to visual-
ize product category profits:
In [8]: category_profits_df.plot( 1
kind='barh', figsize=(15, 6), subplots=True, 2
layout=(1, 3), legend=False, sharex=False, sharey=True 3
) 4
A few categories stand out: Electronics and Camera & Photo generate
relatively low gross profits, even though they have a high average
margin per unit. Increasing sales volume for these two categories
could increase overall profitability. In contrast, Patio, Lawn & Garden
pythonforaccounting.com/chapter35
35 Finding answers 370
More specifically, are there differences between the average `Margin per Unit` across
channels?
To get the sales data in the shape we need it, you can use pd.pivot_table:
In [9]: average_margin_per_category = pd.pivot_table( 1
sales_df, 2
index='Channel', 3
columns='Category', 4
values='Margin per Unit', 5
aggfunc='mean' 6
).round(3) 7
In [10]: average_margin_per_category
The average margin per category varies across sales channels (e.g.,
the lowest average margin per unit for the Electronics product
category is 26%, whereas the highest is 30%). Optimizing some cat-
egories by selling products through their most profitable channels
might have an impact on total profits.
pythonforaccounting.com/chapter35
35 Finding answers 371
Product profits
To answer the last question in this project, you need to find the
most and least profitable products in each product category. There
are many categories to go through, so manually filtering the sales
data for each would require a lot of typing — you can write a
custom function to help quickly explore products in each category
instead.
As with the custom function we used to read and clean the sales
data, you can start with a simple function to explore products in a
pythonforaccounting.com/chapter35
35 Finding answers 372
Out [13]: InvoiceNo Channel ... Profit per Unit Margin per Unit
0 1532 Shoppe.com ... 2.47 12.28
1 1949 Walcart ... 2.60 12.85
2 5401 Understock.com ... 2.56 12.67
3 8601 Understock.com ... 2.56 12.67
4 9860 Understock.com ... 2.56 12.67
... ... ... ... ... ...
87890 114220 Bullseye ... 0.58 11.65
87894 117532 Walcart ... 1.07 12.91
87896 118296 Understock.com ... 0.89 12.61
87897 118330 Understock.com ... 0.89 12.61
87898 118340 Understock.com ... 0.89 12.61
pythonforaccounting.com/chapter35
35 Finding answers 373
Out [15]: ProductID Product Name Unit Price Quantity Gross Profit Margin per Unit
0 T&G/CHI-38293 Child Construc... 19.68 2864 6963.08 12.385
1 T&G/DIS-87606 Disneys Frozen... 13.02 4240 6955.44 12.458
2 T&G/MEL-91223 Melissa & Doug... 11.44 4323 6115.61 12.373
3 T&G/LEG-28766 LEGO The Lord ... 13.31 3498 5775.40 12.421
4 T&G/GOL-13352 Goldie Blox an... 12.57 3506 5482.67 12.381
.. ... ... ... ... ... ...
502 T&G/WL-74772 Wl Products - ... 3.05 4 1.44 11.800
503 T&G/FIS-85290 Fisher Price G... 5.47 2 1.34 12.250
504 T&G/GRE-29463 Green Toys Roc... 8.29 1 1.07 12.910
505 T&G/LEA-17226 Learning Resou... 3.49 1 0.45 12.890
506 T&G/POW-45149 Power Wheels T... 2.10 1 0.26 12.380
The output above is sorted by gross profit, with the highest grossing
products in the Toys & Games category at the top of the table and
the least profitable ones at the bottom. However, if you want to sort
the table by margin values instead of gross profits or change the
sort order, you need to edit the function again (or sort its output
later). You can add a few more parameters to the function (with
sensible default values) to make it easier to change the sort column
and the sort order:
In [16]: def get_product_profits(df, 1
category='All', channel='All', 2
sort_column='Gross Profit', ascending=False): 3
4
if category != 'All': 5
df = df[df['Category'] == category] 6
7
if channel != 'All': 8
df = df[df['Channel'] == channel] 9
10
return ( 11
df.groupby('ProductID') 12
.agg({ 13
'Product Name': 'first', 14
'Unit Price': 'first', 15
'Category': 'first', 16
'Quantity': 'sum', 17
'Gross Profit': 'sum', 18
'Margin per Unit': 'mean' 19
}) 20
.sort_values(by=sort_column, ascending=ascending) 21
pythonforaccounting.com/chapter35
35 Finding answers 374
.reset_index() 22
.round(3) 23
) 24
Out [17]: ProductID Product Name Unit Price Quantity Gross Profit Margin per Unit
0 T&G/LEG-31282 LEGO City Gre... 2.08 168 40.32 11.54
1 T&G/DUP-28439 DUPLO LEGO Vi... 2.76 15 4.80 11.59
2 T&G/RUB-64023 Rubber Pirate... 2.76 129 41.28 11.59
3 T&G/FUN-21237 Funko Pop! Di... 2.24 1128 293.28 11.61
4 T&G/MY-91022 My Little Pon... 3.70 193 82.99 11.62
.. ... ... ... ... ... ...
364 T&G/MY-60452 My Neighbor T... 2.70 176 56.32 11.85
365 T&G/PAW-31908 Paw Print Bal... 3.20 151 57.38 11.88
366 T&G/SWI-62367 Swimways Baby... 1.85 21 4.62 11.89
367 T&G/FUN-29714 Funko Mike My... 2.35 102 28.56 11.91
368 T&G/MAG-22549 Magic: the Ga... 0.06 326 3.26 16.67
After you run the code above, if you open “Product profits 2020.xlsx”
(which should be in your data folder) with Excel, you’ll see it
contains one sheet for every product category, with product profits
in each sheet sorted by their total gross profit.
pythonforaccounting.com/chapter35
35 Finding answers 375
You now have answers for all the analysis questions we started
with (both as tables and as plots). The last step of this project (or
any data analysis project) is communicating results with others.
In general, depending on how you need to share results, you
can either copy-paste tables or plots1 directly into your reports or 1: To copy-paste figures from Jupyter-
presentations or export Jupyter notebooks as HTML files, which Lab, you can right-click them while
holding down the Shift key, which
you can then share over email or store in a public folder where
opens your browser’s native context
others can access it. You can also create standalone, interactive menu, with options to copy or save
HTML files containing Python-based plots and DataFrame views. plots as images.
Let’s take a quick look at both options next.
Sharing results
Sharing your data analysis files (i.e., Jupyter notebooks and data
files) with others can be tricky. While data can easily be written
to Excel workbooks and shared that way, notebooks with project
documentation, analysis steps, figures and all your results are not
as straightforward to share.
pythonforaccounting.com/chapter35
35 Finding answers 376
This will create an HTML file in your project folder called “Sales
analysis.html”. If you open it (in JupyterLab or directly in your
browser), you will see it has the same content as the sales analysis
notebook, but without any of the Python code. If you want to create
a PDF export instead, the easiest option is to export your notebook
to HTML, and then use your browser’s native print command to
print the HTML file to PDF.4 4: On the other hand, if you want to
install LATEX on your computer, you
The code above is a quick and easy way to generate HTML files can use the same nbconvert com-
that you can share with others from your Jupyter notebooks. The mand to export to PDF directly by
changing the output file name to
following overthinking section looks at how you can generate a
Sales analysis.pdf
slightly improved HTML file from your Jupyter notebooks by using
the panel Python library to create a simple, interactive dashboard
that you can save and share.
This project’s last question looks at the most and least profitable
products from each product category. While the product tables
you produced (i.e., the Excel file with one product category per
sheet) contain all the information you need to determine product
profitability, it’s not easy to compare products within categories or
across channels using these tables. To help compare products with
each other, you can visualize product profits using an interactive
scatter plot, where each point on the plot is a different product. Even
better, you can turn this interactive scatterplot into a dashboard by
connecting it to dropdown menus that filter the points shown on
the plot by channel or category.
You’ll use pandas and hvplot for the interactive scatter plot. After
you have the plot, you’ll use panel to create a simple dashboard,
with dropdown menus linked to the get_product_profits function
to control which product categories or sales channels are shown
on the scatter plot. The pn.extension() function call on line 5
pythonforaccounting.com/chapter35
35 Finding answers 377
You’ll be using the clean sales data in our dashboard; you can read
it again in this notebook with:
In [21]: sales_df = pd.read_csv(f'{data_folder}/sales2020.csv') 1
For the scatter plot, you’re using gross profit as the x-coordinate for
each point and the average margin per product as the y-coordinate.
You also encode the total sale quantity for each product as the size
of the point marker and product category as each point’s color.
The other keyword arguments passed to hvplot control the plot’s
display in various ways (e.g., scale=0.2 makes the points on the
plot slightly smaller).
Now you can call the function and draw the scatter plot using:
In [23]: products_scatterplot()
Hover over any point on the plot. You’ll see a tooltip with extra
information about the product it represents, in addition to its x or
y-coordinates (i.e., gross profit and average margin), size, and cate-
gory. The tooltip information is controlled through the hover_cols
pythonforaccounting.com/chapter35
35 Finding answers 378
To turn this plot into a dashboard, you need to create two dropdown
menus and connect them to products_scatterplot’s category and
channel parameters. The panel library provides several widgets
you can use to connect Python functions to graphical interface
elements, including buttons, dropdown menus, or date pickers. To
create a simple dropdown menu (using the Select panel widget),
you can use:
In [24]: test_dropdown = pn.widgets.Select(name='Test dropdown', value='a', options=['a', 'b', 'c'])
This test menu shows how you can use panel’s Select widget.
To make a menu that is useful for the products dashboard, you
can delete this test_dropdown menu and create a new one that
displays different product sales channels. You’ll modify the prod
ucts_scatterplot function to use the values selected in this menu
through the variable defined below. Because the plotting function
will depend on the menu variable, it’s a good idea to define them in
cells above the products_scatterplot function. This way, when you
re-start or re-run your notebook, the menu variable is created before
pythonforaccounting.com/chapter35
35 Finding answers 379
In [27]: channel_dropdown
Notice that you create the options list on line 1 , which includes
an 'All' value that will display all available products. The list
of options is just a sorted Python list, created directly from the
Channel column in our sales data, so that you don’t have to type the
different menu options yourself. You can use the same approach
for the product categories, and create a separate menu using:
In [28]: categories = ['All'] + sorted(sales_df['Category'].unique()) 1
category_dropdown = pn.widgets.Select(name='Category', value='All', options=categories) 2
In [29]: category_dropdown
The two menus don’t work yet; you need to link them to the prod
ucts_scatterplot function. To do that, you need to (slightly) change
the plotting function, and include the menus and the products
scatter plot in a panel component. First, add the following line of
code above the products_scatterplot function:
In [30]: @pn.depends(category_dropdown, channel_dropdown) 1
def products_scatterplot(category='All', channel='All'): 2
df = get_product_profits(sales_df, category, channel) 3
4
return df.hvplot( 5
kind='scatter', 6
x='Gross Profit', 7
y='Margin per Unit', 8
size='Quantity', 9
color='Category', 10
scale=0.2, 11
grid=True, 12
line_color='black', 13
width=900, 14
height=600, 15
hover_cols=['ProductID', 'Product Name', 'Unit Price'] 16
) 17
pythonforaccounting.com/chapter35
35 Finding answers 380
You can now use the dropdown menus to filter data shown on
the plot — the output above shows the plot after filtering for
Electronics products. The panel library uses columns or rows to
arrange widgets and plots: when placing components inside a
pn.Column, they will be displayed vertically, one below the other.
You can create more complex layouts by nesting rows and columns
— more information about panel components and layout options is
available at panel.holoviz.org/user_guide/Components.html.
You can also include Markdown text and commands inside your
panel components (as Python strings), and they will be displayed
in the final output. Let’s add a title to the panel component you
created above and assign it to a separate variable:
In [32]: dashboard = pn.Column(
"# Products dashboard",
"Select sales channel or product category using the menus below:",
pythonforaccounting.com/chapter35
35 Finding answers 381
category_dropdown,
channel_dropdown,
products_scatterplot
)
Now, if you want to view the dashboard variable, you can either
inspect it in a separate cell or use the show method to display the
component in a separate browser tab:
In [33]: dashboard.show() 1
When you run the code above, a new tab should open in your
browser, showing the scatter plot and dropdown menus as a
separate web page (i.e., outside of your Jupyter notebook). This
page still depends on your Jupyter notebook; to view this web
page, your notebook needs to be running. However, you can save
this page (including the interactive plot and dropdown menus) as
a standalone HTML file with the save method:
In [34]: dashboard.save('Products dashboard.html', embed=True) 1
The first argument to save specifies the name (or path) of the file
you want to save the dashboard to, and the embed=True argument
tells panel to embed all the different menu options inside the
HTML file — in this case, you save the dashboard to a file called
“Products dashboard.html” in your project folder. When you run
the code above, panel cycles through different values of the two
dropdown menus and saves each generated plot inside the HTML
file. Depending on how complex your dashboard is, this process
can take some time, but the output is a standalone file that doesn’t
depend on your Jupyter notebook and that you can share with
others.
Summary
This chapter showed you how to answer the sales analysis questions
using pandas. It also briefly showed you how to visualize your
results and export them from your Jupyter notebook to HTML files
that you can share with others.
This chapter marks the end of the analysis project, and the end
of all Python coding in this book. The following chapter wraps
pythonforaccounting.com/chapter35
35 Finding answers 382
things up and points you to a few resources that can guide your
Python-learning further.
pythonforaccounting.com/chapter35
Next steps
This concludes our tour of Python. I hope you now have a clear
idea of how Python can help in your accounting work.
You already know that the pandas library can read and write Excel
files. However, in the previous chapters, I didn’t mention that
in its own code, pandas uses other Python libraries that enable
it to interact with Excel files. These libraries are openpyxl and
xlsxwriter.6 Both of these libraries provide advanced methods for 6: Because pandas uses these li-
interacting with spreadsheets. For instance, with openpyxl, you braries in its code, they’re already
installed on your computer.
can set individual cell values in your Excel tables:
In [35]: from openpyxl import load_workbook
wb = load_workbook(filename='MyExcelFile.xlsx')
wb.save('MyExcelFile.xlsx')
If you work with PDF files often, you’ve probably already wondered
if you can use Python to automate some of that work. The answer
is yes: several Python libraries can help, but my favorite is pikepdf.
With pikepdf you can edit, slice, transform, and extract text from
PDF files. Unfortunately, it’s not part of the Anaconda distribution,
so you’ll have to install it yourself using Anaconda Navigator. 8 8: You can read more about pikepdf
at github.com/pikepdf/pikepdf.
pythonforaccounting.com/chapter35
Statistics, machine learning, and forecasting tools
One of the most popular Python toolkits for statistics is the scipy11
library. The scipy library is a bundle of functions for mathematics, 11: Pronounced "Sigh Pie".
statistics, and engineering. If you ever need to run statistical tests
on your data, you should look up methods in scipy’s stats
submodule.12 12: scipy.org.