Python Data Wrangling Tutorial With Pandas
Python Data Wrangling Tutorial With Pandas
Edition
elitedatascience.com/python-data-wrangling-tutorial
Bitcoin and cryptocurrency have been all the rage… but as data scientists, we’re
empiricists, right? We don’t want to just take others’ word for it… we want to look at the
data firsthand! In this tutorial, we’ll introduce common and powerful techniques for data
wrangling in Python.
For example,. let’s say we wanted to run a step-forward analysis of a very rudimentary
momentum trading strategy that goes as follows:
1. At the start of every month, we buy the cryptocurrency that had the largest price
gain over the previous 7, 14, 21, or 28 days. We want to evaluate each of these
time windows.
2. Then, we hold for exactly 7 days and sell our position. Please note: this is a
purposefully simple strategy that is only meant for illustrative purposes.
This is a great question for showcasing data wrangling techniques because all the hard
work lies in molding your dataset into the proper format. Once you have the appropriate
analytical base table (ABT), answering the question becomes simple.
1/15
This is not a guide about investment or trading strategies, nor is it an endorsement for or
against cryptocurrency. Potential investors should form their own views independently, but
this guide will introduce tools for doing so.
Again, the focus of this tutorial is on data wrangling techniques and the ability to
transform raw datasets into formats that help you answer interesting questions.
This tutorial is designed to be streamlined, and it won’t cover any one topic in too
much detail. It may be helpful to have the Pandas library documentation open beside
you as a supplemental reference.
We strongly recommend installing the Anaconda Distribution, which comes with all of
those packages. Simply follow the instructions on that download page.
Once you have Anaconda installed, simply start Jupyter (either through the command line
or the Navigator app) and open a new notebook:
2/15
Let's start by importing Pandas, the best Python library for wrangling relational (i.e. table-
format) datasets. Pandas will be doing most of the heavy lifting for this tutorial.
Tip: we'll give Pandas an alias. Later, we can invoke the library with pd.
Pandas
Python
Next, let's tweak the display options a bit. First, let's display floats with 2 decimal places
to make tables less crowded. Don't worry... this is only a display setting that doesn't
reduce the underlying precision. Let's also expand the limits for the number of rows and
columns displayed.
For this tutorial, we'll be using a price dataset managed by Brave New Coin and
distributed on Quandl. The full version tracks price indices for 1,900+ fiat-crypto trading
pairs, but it requires a premium subscription, so we've provided a small sample with a
handful of cryptocurrencies.
To follow along, you can download BNC2_sample.csv. Clicking that link will take you to
Google Drive, and then simply click the download icon in the top right:
Once you've downloaded the dataset and put in the same file directory as your Jupyter
notebook, you can run the following code to read the dataset into a Pandas dataframe
and display example observations.
3/15
Python
Note that we use the names= argument for pd.read_csv() to set our own column names
because the original dataset does not have any.
There will be exceptions, but for the most part, this rule of thumb can save you from many
headaches.
Equivalence in Granularity - For example, you could have 10 rows of data from 10
different cryptocurrencies. However, you should not have an 11th row with average
or total values from the other 10 rows. That 11th row would be an aggregation, and
thus not equivalent in granularity to the other 10.
4/15
Equivalence in Units - You could have 10 rows with prices in USD collected at
different dates. However, you should not then have another 10 rows with prices
quoted in EUR. Any aggregations, distributions, visualizations, or statistics would
become meaningless.
Data stored in CSV files or databases are often in “stacked” or “record” format. They use
a single 'Code' column as a catch-all for metadata. For example, in the sample dataset,
we have the follow codes:
First, see how some codes start with GWA and others with MWA? These are actually
completely different types of indicators according to the documentation page.
MWA stands for "market-weighted average," and they show regional prices. There
are multiple MWA codes for each cryptocurrency, one for each local fiat currency.
On the other hand, GWA stands for "global-weighted average," which shows
globally indexed prices. GWA is thus an aggregation of MWA and not equivalent in
granularity. (Note: only a subset of regional MWA codes are included in the sample
dataset.)
5/15
As you can see, we have multiple entries for a cryptocurrency on a given date. To further
complicate things, the regional MWA data are denominated in their local currency (i.e.
nonequivalent units), so you would also need historical exchange rates.
Having different levels of granularity and/or different units makes analysis unwieldy at
best, or downright impossible at worst.
In the previous step, we learned that GWA codes are aggregations of the regional MWA
codes. Therefore, to perform our analysis, we only need to keep the global GWA codes:
Now that we only have GWA codes left, all of our observations are equivalent in
granularity and in units. We can confidently proceed.
6/15
However, it would be a huge pain to do so with the current "stacked" dataset. It would
involve writing helper functions, loops, and plenty of conditional logic. Instead, we'll take a
more elegant approach....
First, we'll pivot the dataset while keeping only one price column. For this tutorial, let's
keep the VWAP (volume weighted average price) column, but you could make a good
case for most of them.
Pivot dataset
Python
1 # Pivot dataset
2 pivoted_df = df.pivot(index='Date', columns='Code', values='VWAP')
3
4 # Display examples from pivoted dataset
5 pivoted_df.tail()
As you can see, each column in our pivoted dataset now represents the price for one
cryptocurrency and each row contains prices from one date. All the features are now
aligned by date.
This function shifts the index of the dataframe by some number of periods. For example,
here's what happens when we shift our pivoted dataset by 1:
Shift method
Python
7/15
1 print( pivoted_df.tail(3) )
2 # Code GWA_BTC GWA_ETH GWA_LTC GWA_XLM GWA_XRP
3 # Date
4 # 2018-01-21 12,326.23 1,108.90 197.36 0.48 1.55
5 # 2018-01-22 11,397.52 1,038.21 184.92 0.47 1.43
6 # 2018-01-23 10,921.00 992.05 176.95 0.47 1.42
7
8 print( pivoted_df.tail(3).shift(1) )
9 # Code GWA_BTC GWA_ETH GWA_LTC GWA_XLM GWA_XRP
10 # Date
11 # 2018-01-21 nan nan nan nan nan
12 # 2018-01-22 12,326.23 1,108.90 197.36 0.48 1.55
13 # 2018-01-23 11,397.52 1,038.21 184.92 0.47 1.43
Notice how the shifted dataset now has values from 1 day before? We can take
advantage of this to calculate prior returns for our 7, 14, 21, 28 day windows.
For example, to calculate returns over the 7 days prior, we would need prices_today /
prices_7_days_ago - 1.0, which translates to:
Calculating returns for all of our windows is as easy as writing a loop and storing them in
a dictionary:
8/15
1 # Calculate returns over each window and store them in dictionary
2 delta_dict = {}
3 for offset in [7, 14, 21, 28]:
4 delta_dict['delta_{}'.format(offset)] = pivoted_df / pivoted_df.shift(offset) - 1.0
Note: Calculating returns by shifting the dataset requires 2 assumptions to be met: (1) the
observations are sorted ascending by date and (2) there are no missing dates. We
checked this "off-stage" to keep this tutorial concise, but we recommend confirming this
on your own.
We couldn't directly shift the original dataset because the data for different coins were
stacked on each other, so the boundaries would've overlapped. In other words, BTC data
would leak into ETH calculations, ETH data would leak into LTC calculations, and so on.
9/15
To do so for all of the returns dataframes, we can simply loop through delta_dict, like so:
Finally, we can create another melted dataframe that contains the forward-looking 7-day
returns. This will be our "target variable" for evaluating our trading strategy.
Simply shift the pivoted dataset by -7 to get "future" prices, like so:
We now have 5 melted dataframes stored in the melted_dfs list, one for each of the
backward-looking 7, 14, 21, and 28-day returns and one for the forward-looking 7-day
returns.
The first is Pandas's merge function, which works like SQL JOIN. For example, to merge
the first two melted dataframes...
10/15
Merge two dataframes
Python
See how we now have delta_7 and delta_14 in the same row? This is the start of our
analytical base table. All we need to do now is merge all of our melted dataframes
together with a base dataframe of other features we might want.
The most elegant way to do this is using Python's built-in reduce function. First we'll need
to import it:
Next, before we use that function, let's create a feature_dfs list that contains base
features from the original dataset plus the melted datasets.
Now we're ready to use the reduce function. Reduce applies a function of two arguments
cumulatively to the objects in a sequence (e.g. a list). For example, reduce(lambda x,y:
x+y, [1,2,3,4,5]) calculates ((((1+2)+3)+4)+5).
11/15
Reduce-merge features into ABT
Python
By the way, notice how the last 7 observations don't have values for the
'return_7' feature? This is expected, as we cannot calculate "future 7-day returns" for the
last 7 days of the dataset.
Technically, with this ABT, we can already answer our original objective. For example, if
we wanted to pick the coin that had the biggest momentum on September 1st, 2017, we
could simply display the rows for that date and look at the 7, 14, 21, and 28-day prior
returns:
12/15
And if you wanted to programmatically pick the crypto with the biggest momentum (e.g.
over the prior 28 days), you would write:
However, since we're only interested in trading on the first day of each month, we can
make things even easier for ourselves...
1. First, create a new 'month' feature from the first 7 characters of the Date strings.
2. Then, group the observations by 'Code' and by 'month'. Pandas will create "cells"
of data that separate observations by Code and month.
3. Finally, within each group, simply take the .first() observation and reset the index.
13/15
As you can see, we now have a proper ABT with:
In other words, we have exactly what we need to evaluate the simple trading strategy we
proposed at the beginning!
Congratulations... you've made it to the end of this Python data wrangling tutorial!
We introduced several key tools for filtering, manipulating, and transforming datasets in
Python, but we've only scratched the surface. Pandas is a very powerful library with
plenty of additional functionality.
Complete script
Python
14/15
1 # 2. Import libraries and dataset
2 import pandas as pd
3 pd.options.display.float_format = '{:,.2f}'.format
4 pd.options.display.max_rows = 200
5 pd.options.display.max_columns = 100
6
7 df = pd.read_csv('BNC2_sample.csv',
8 names=['Code', 'Date', 'Open', 'High', 'Low',
9 'Close', 'Volume', 'VWAP', 'TWAP'])
10
11 # 4. Filter unwanted observations
12 gwa_codes = [code for code in df.Code.unique() if 'GWA_' in code]
13 df = df[df.Code.isin(gwa_codes)]
14
15 # 5. Pivot the dataset
16 pivoted_df = df.pivot(index='Date', columns='Code', values='VWAP')
17
18 # 6. Shift the pivoted dataset
19 delta_dict = {}
20 for offset in [7, 14, 21, 28]:
21 delta_dict['delta_{}'.format(offset)] = pivoted_df / pivoted_df.shift(offset) - 1
22
23 # 7. Melt the shifted dataset
24 melted_dfs = []
25 for key, delta_df in delta_dict.items():
26 melted_dfs.append( delta_df.reset_index().melt(id_vars=['Date'],
27 value_name=key) )
28
29 return_df = pivoted_df.shift(-7) / pivoted_df - 1.0
30 melted_dfs.append( return_df.reset_index().melt(id_vars=['Date'],
31 value_name='return_7') )
32
33 # 8. Reduce-merge the melted data
34 from functools import reduce
35
36 base_df = df[['Date', 'Code', 'Volume', 'VWAP']]
37 feature_dfs = [base_df] + melted_dfs
38
39 abt = reduce(lambda left,right: pd.merge(left,right,on=['Date', 'Code']), feature_dfs)
40
41 # 9. Aggregate with group-by.
abt['month'] = abt.Date.apply(lambda x: x[:7])
gb_df = abt.groupby(['Code', 'month']).first().reset_index()
15/15