0% found this document useful (0 votes)
181 views15 pages

Python Data Wrangling Tutorial With Pandas

This document provides a tutorial on using Python to wrangle cryptocurrency price data. It discusses: 1) Importing price data from a CSV file and understanding that it contains different types of indicators (global vs regional) at different granularities and units, violating assumptions for analysis. 2) Filtering the data to only include the global indicators, removing unwanted observations to obtain an analytically sound dataset before further wrangling. 3) The next steps will pivot the filtered data and calculate returns over different time windows to evaluate a simple momentum trading strategy.

Uploaded by

Sebastián Emdef
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
181 views15 pages

Python Data Wrangling Tutorial With Pandas

This document provides a tutorial on using Python to wrangle cryptocurrency price data. It discusses: 1) Importing price data from a CSV file and understanding that it contains different types of indicators (global vs regional) at different granularities and units, violating assumptions for analysis. 2) Filtering the data to only include the global indicators, removing unwanted observations to obtain an analytically sound dataset before further wrangling. 3) The next steps will pivot the filtered data and calculate returns over different time windows to evaluate a simple momentum trading strategy.

Uploaded by

Sebastián Emdef
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 15

Python Data Wrangling Tutorial: Cryptocurrency

Edition
elitedatascience.com/python-data-wrangling-tutorial

January 30, 2018

Bitcoin and cryptocurrency have been all the rage… but as data scientists, we’re
empiricists, right? We don’t want to just take others’ word for it… we want to look at the
data firsthand! In this tutorial, we’ll introduce common and powerful techniques for data
wrangling in Python.

Broadly speaking, data wrangling is the process of reshaping, aggregating, separating,


or otherwise transforming your data from one format to a more useful one.

For example,. let’s say we wanted to run a step-forward analysis of a very rudimentary
momentum trading strategy that goes as follows:

1. At the start of every month, we buy the cryptocurrency that had the largest price
gain over the previous 7, 14, 21, or 28 days. We want to evaluate each of these
time windows.
2. Then, we hold for exactly 7 days and sell our position. Please note: this is a
purposefully simple strategy that is only meant for illustrative purposes.

How well would we go about evaluating this strategy?

This is a great question for showcasing data wrangling techniques because all the hard
work lies in molding your dataset into the proper format. Once you have the appropriate
analytical base table (ABT), answering the question becomes simple.

What this guide is not:

1/15
This is not a guide about investment or trading strategies, nor is it an endorsement for or
against cryptocurrency. Potential investors should form their own views independently, but
this guide will introduce tools for doing so.

Again, the focus of this tutorial is on data wrangling techniques and the ability to
transform raw datasets into formats that help you answer interesting questions.

A quick tip before we begin:

This tutorial is designed to be streamlined, and it won’t cover any one topic in too
much detail. It may be helpful to have the Pandas library documentation open beside
you as a supplemental reference.

Python Data Wrangling Tutorial Contents


Here are the steps we’ll take for our analysis:

Step 1: Set up your environment.


First, make sure you have the following installed on your computer:

Python 2.7+ or Python 3


Pandas
Jupyter Notebook (optional, but recommended)

We strongly recommend installing the Anaconda Distribution, which comes with all of
those packages. Simply follow the instructions on that download page.

Once you have Anaconda installed, simply start Jupyter (either through the command line
or the Navigator app) and open a new notebook:

Python 3 or Python 2.7+ are both fine.

Step 2: Import libraries and dataset.

2/15
Let's start by importing Pandas, the best Python library for wrangling relational (i.e. table-
format) datasets. Pandas will be doing most of the heavy lifting for this tutorial.

Tip: we'll give Pandas an alias. Later, we can invoke the library with pd.

Pandas
Python

1 # Pandas for managing datasets


2 import pandas as pd

Next, let's tweak the display options a bit. First, let's display floats with 2 decimal places
to make tables less crowded. Don't worry... this is only a display setting that doesn't
reduce the underlying precision. Let's also expand the limits for the number of rows and
columns displayed.

Pandas display settings


Python

1 # Display floats with 2 decimal places


2 pd.options.display.float_format = '{:,.2f}'.format
3
4 # Expand display limits
5 pd.options.display.max_rows = 200
6 pd.options.display.max_columns = 100

For this tutorial, we'll be using a price dataset managed by Brave New Coin and
distributed on Quandl. The full version tracks price indices for 1,900+ fiat-crypto trading
pairs, but it requires a premium subscription, so we've provided a small sample with a
handful of cryptocurrencies.

To follow along, you can download BNC2_sample.csv. Clicking that link will take you to
Google Drive, and then simply click the download icon in the top right:

Once you've downloaded the dataset and put in the same file directory as your Jupyter
notebook, you can run the following code to read the dataset into a Pandas dataframe
and display example observations.

Read sample dataset

3/15
Python

1 # Read BNC2 sample dataset


2 df = pd.read_csv('BNC2_sample.csv',
3 names=['Code', 'Date', 'Open', 'High', 'Low',
4 'Close', 'Volume', 'VWAP', 'TWAP'])
5
6 # Display first 5 observations
7 df.head()

Note that we use the names= argument for pd.read_csv() to set our own column names
because the original dataset does not have any.

Data Dictionary (for code GWA_BTC):

Date: The day on which the index values were calculated.


Open: The day's opening price index for Bitcoin in US dollars.
High: The highest value for the price index for Bitcoin in US dollars that day.
Low: The lowest value for the price index for Bitcoin in US dollars that day.
Close: The day's closing price index for Bitcoin in US dollars.
Volume: The volume of Bitcoin traded that day.
VWAP: The volume weighted average price of Bitcoin traded that day.
TWAP: The time-weighted average price of Bitcoin traded that day.

Step 3: Understand the data.


One of the most common reasons to wrangle data is when there's "too much" information
packed into a single table, especially when dealing with time series data.

Generally, all observations should be equivalent in granularity and in units.

There will be exceptions, but for the most part, this rule of thumb can save you from many
headaches.

Equivalence in Granularity - For example, you could have 10 rows of data from 10
different cryptocurrencies. However, you should not have an 11th row with average
or total values from the other 10 rows. That 11th row would be an aggregation, and
thus not equivalent in granularity to the other 10.

4/15
Equivalence in Units - You could have 10 rows with prices in USD collected at
different dates. However, you should not then have another 10 rows with prices
quoted in EUR. Any aggregations, distributions, visualizations, or statistics would
become meaningless.

Our current raw dataset breaks both of these rules!

Data stored in CSV files or databases are often in “stacked” or “record” format. They use
a single 'Code' column as a catch-all for metadata. For example, in the sample dataset,
we have the follow codes:

Unique codes in the dataset


Python

1 # Unique codes in the dataset


2 print( df.Code.unique() )
3
4 # ['GWA_BTC' 'GWA_ETH' 'GWA_LTC' 'GWA_XLM' 'GWA_XRP' 'MWA_BTC_CNY'
5 # 'MWA_BTC_EUR' 'MWA_BTC_GBP' 'MWA_BTC_JPY' 'MWA_BTC_USD'
6 'MWA_ETH_CNY'
7 # 'MWA_ETH_EUR' 'MWA_ETH_GBP' 'MWA_ETH_JPY' 'MWA_ETH_USD'
8 'MWA_LTC_CNY'
9 # 'MWA_LTC_EUR' 'MWA_LTC_GBP' 'MWA_LTC_JPY' 'MWA_LTC_USD'
'MWA_XLM_CNY'
# 'MWA_XLM_EUR' 'MWA_XLM_USD' 'MWA_XRP_CNY' 'MWA_XRP_EUR'
'MWA_XRP_GBP'
# 'MWA_XRP_JPY' 'MWA_XRP_USD']

First, see how some codes start with GWA and others with MWA? These are actually
completely different types of indicators according to the documentation page.

MWA stands for "market-weighted average," and they show regional prices. There
are multiple MWA codes for each cryptocurrency, one for each local fiat currency.
On the other hand, GWA stands for "global-weighted average," which shows
globally indexed prices. GWA is thus an aggregation of MWA and not equivalent in
granularity. (Note: only a subset of regional MWA codes are included in the sample
dataset.)

For instance, let's look at Bitcoin's codes on the same date:

Example of GWA and MWA relationship


Python

1 # Example of GWA and MWA relationship


2 df[df.Code.isin(['GWA_BTC', 'MWA_BTC_JPY', 'MWA_BTC_EUR'])
3 & (df.Date == '2018-01-01')]

5/15
As you can see, we have multiple entries for a cryptocurrency on a given date. To further
complicate things, the regional MWA data are denominated in their local currency (i.e.
nonequivalent units), so you would also need historical exchange rates.

Having different levels of granularity and/or different units makes analysis unwieldy at
best, or downright impossible at worst.

Luckily, once we've spotted this issue, fixing it is actually trivial!

Step 4: Filter unwanted observations.


One of the simplest yet most useful data wrangling techniques is removing unwanted
observations.

In the previous step, we learned that GWA codes are aggregations of the regional MWA
codes. Therefore, to perform our analysis, we only need to keep the global GWA codes:

Filter out MWA codes


Python

1 # Number of observations in dataset


2 print( 'Before:', len(df) )
3 # Before: 31761
4
5 # Get all the GWA codes
6 gwa_codes = [code for code in df.Code.unique() if 'GWA_' in code]
7
8 # Only keep GWA observations
9 df = df[df.Code.isin(gwa_codes)]
10
11 # Number of observations left
12 print( 'After:', len(df) )
13 # After: 6309

Now that we only have GWA codes left, all of our observations are equivalent in
granularity and in units. We can confidently proceed.

Step 5: Pivot the dataset.


Next, in order to analyze our momentum trading strategy outlined above, for each
cryptocurrency, we'll need calculate returns over the prior 7, 14, 21, and 28 days... for the
first day of each month.

6/15
However, it would be a huge pain to do so with the current "stacked" dataset. It would
involve writing helper functions, loops, and plenty of conditional logic. Instead, we'll take a
more elegant approach....

First, we'll pivot the dataset while keeping only one price column. For this tutorial, let's
keep the VWAP (volume weighted average price) column, but you could make a good
case for most of them.

Pivot dataset
Python

1 # Pivot dataset
2 pivoted_df = df.pivot(index='Date', columns='Code', values='VWAP')
3
4 # Display examples from pivoted dataset
5 pivoted_df.tail()

As you can see, each column in our pivoted dataset now represents the price for one
cryptocurrency and each row contains prices from one date. All the features are now
aligned by date.

Step 6: Shift the pivoted dataset.


To easily calculate returns over the prior 7, 14, 21, and 28 days, we can use Pandas's
shift method.

This function shifts the index of the dataframe by some number of periods. For example,
here's what happens when we shift our pivoted dataset by 1:

Shift method
Python

7/15
1 print( pivoted_df.tail(3) )
2 # Code GWA_BTC GWA_ETH GWA_LTC GWA_XLM GWA_XRP
3 # Date
4 # 2018-01-21 12,326.23 1,108.90 197.36 0.48 1.55
5 # 2018-01-22 11,397.52 1,038.21 184.92 0.47 1.43
6 # 2018-01-23 10,921.00 992.05 176.95 0.47 1.42
7
8 print( pivoted_df.tail(3).shift(1) )
9 # Code GWA_BTC GWA_ETH GWA_LTC GWA_XLM GWA_XRP
10 # Date
11 # 2018-01-21 nan nan nan nan nan
12 # 2018-01-22 12,326.23 1,108.90 197.36 0.48 1.55
13 # 2018-01-23 11,397.52 1,038.21 184.92 0.47 1.43

Notice how the shifted dataset now has values from 1 day before? We can take
advantage of this to calculate prior returns for our 7, 14, 21, 28 day windows.

For example, to calculate returns over the 7 days prior, we would need prices_today /
prices_7_days_ago - 1.0, which translates to:

Calculate returns over 7 days prior


Python

1 # Calculate returns over 7 days prior


2 delta_7 = pivoted_df / pivoted_df.shift(7) - 1.0
3
4 # Display examples
5 delta_7.tail()

Calculating returns for all of our windows is as easy as writing a loop and storing them in
a dictionary:

Loop over time windows


Python

8/15
1 # Calculate returns over each window and store them in dictionary
2 delta_dict = {}
3 for offset in [7, 14, 21, 28]:
4 delta_dict['delta_{}'.format(offset)] = pivoted_df / pivoted_df.shift(offset) - 1.0

Note: Calculating returns by shifting the dataset requires 2 assumptions to be met: (1) the
observations are sorted ascending by date and (2) there are no missing dates. We
checked this "off-stage" to keep this tutorial concise, but we recommend confirming this
on your own.

Step 7: Melt the shifted dataset.


Now that we've calculated returns using the pivoted dataset, we're going to "unpivot" the
returns. By unpivoting, or melting the data, we can later create an analytical base table
(ABT) where each row contains all of the relevant information for a particular coin on a
particular date.

We couldn't directly shift the original dataset because the data for different coins were
stacked on each other, so the boundaries would've overlapped. In other words, BTC data
would leak into ETH calculations, ETH data would leak into LTC calculations, and so on.

To melt the data, we'll...

reset_index() so we can call the columns by name.


Call the melt() method.
Pass the column(s) to keep into the id_vars= argument.
Name the melted column using the value_name= argument.

Here's how that looks for one dataframe:

Melt shifted data


Python

1 # Melt delta_7 returns


2 melted_7 = delta_7.reset_index().melt(id_vars=['Date'], value_name='delta_7')
3
4 # Melted dataframe examples
5 melted_7.tail()

9/15
To do so for all of the returns dataframes, we can simply loop through delta_dict, like so:

# Melt all the delta dataframes


Python

1 # Melt all the delta dataframes and store in list


2 melted_dfs = []
3 for key, delta_df in delta_dict.items():
4 melted_dfs.append( delta_df.reset_index().melt(id_vars=['Date'],
value_name=key) )

Finally, we can create another melted dataframe that contains the forward-looking 7-day
returns. This will be our "target variable" for evaluating our trading strategy.

Simply shift the pivoted dataset by -7 to get "future" prices, like so:

Calculate forward-looking 7-day returns


Python

1 # Calculate 7-day returns after the date


2 return_df = pivoted_df.shift(-7) / pivoted_df - 1.0
3
4 # Melt the return dataset and append to list
5 melted_dfs.append( return_df.reset_index().melt(id_vars=['Date'],
value_name='return_7') )

We now have 5 melted dataframes stored in the melted_dfs list, one for each of the
backward-looking 7, 14, 21, and 28-day returns and one for the forward-looking 7-day
returns.

Step 8: Reduce-merge the melted data.


All that's left to do is join our melted dataframes into a single analytical base table. We'll
need two tools.

The first is Pandas's merge function, which works like SQL JOIN. For example, to merge
the first two melted dataframes...

10/15
Merge two dataframes
Python

1 # Merge two dataframes


2 pd.merge(melted_dfs[0], melted_dfs[1], on=['Date', 'Code']).tail()

See how we now have delta_7 and delta_14 in the same row? This is the start of our
analytical base table. All we need to do now is merge all of our melted dataframes
together with a base dataframe of other features we might want.

The most elegant way to do this is using Python's built-in reduce function. First we'll need
to import it:

Import reduce function


Python

1 from functools import reduce

Next, before we use that function, let's create a feature_dfs list that contains base
features from the original dataset plus the melted datasets.

Create feature_dfs list


Python

1 # Grab features from original dataset


2 base_df = df[['Date', 'Code', 'Volume', 'VWAP']]
3
4 # Create a list with all the feature dataframes
5 feature_dfs = [base_df] + melted_dfs

Now we're ready to use the reduce function. Reduce applies a function of two arguments
cumulatively to the objects in a sequence (e.g. a list). For example, reduce(lambda x,y:
x+y, [1,2,3,4,5]) calculates ((((1+2)+3)+4)+5).

Thus, we can reduce-merge all of the features like so:

11/15
Reduce-merge features into ABT
Python

1 # Reduce-merge features into analytical base table


2 abt = reduce(lambda left,right: pd.merge(left,right,on=['Date', 'Code']), feature_dfs)
3
4 # Display examples from the ABT
5 abt.tail(10)

Data Dictionary for our Analytical Base Table (ABT):

Date: The day on which the index values were calculated.


Code: Which cryptocurrency.
VWAP: The volume weighted average price traded that day.
delta_7: Return over the prior 7 days (1.0 = 100% return).
delta_14: Return over the prior 14 days (1.0 = 100% return).
delta_21: Return over the prior 21 days (1.0 = 100% return).
delta_28: Return over the prior 28 days (1.0 = 100% return).
return_7: Future return over the next 7 days (1.0 = 100% return).

By the way, notice how the last 7 observations don't have values for the
'return_7' feature? This is expected, as we cannot calculate "future 7-day returns" for the
last 7 days of the dataset.

Technically, with this ABT, we can already answer our original objective. For example, if
we wanted to pick the coin that had the biggest momentum on September 1st, 2017, we
could simply display the rows for that date and look at the 7, 14, 21, and 28-day prior
returns:

# Data from Sept 1st, 2017


Python

1 # Data from Sept 1st, 2017


2 abt[abt.Date == '2017-09-01']

12/15
And if you wanted to programmatically pick the crypto with the biggest momentum (e.g.
over the prior 28 days), you would write:

Programmatically pick highest momentum


Python

1 max_momentum_id = abt[abt.Date == '2017-09-01'].delta_28.idxmax()


2 daily_df.loc[max_momentum_id, ['Code','return_7']]
3 # Code GWA_LTC
4 # return_7 -0.10
5 # Name: 3543, dtype: object

However, since we're only interested in trading on the first day of each month, we can
make things even easier for ourselves...

Step 9: (Optional) Aggregate with group-by.


As a final step, if we wanted to only keep the first days of each month, we can use
a group-by followed by an aggregation.

1. First, create a new 'month' feature from the first 7 characters of the Date strings.
2. Then, group the observations by 'Code' and by 'month'. Pandas will create "cells"
of data that separate observations by Code and month.
3. Finally, within each group, simply take the .first() observation and reset the index.

Note: We're assuming your dataframe is still properly sorted by date.

Here's what it looks like all put together:

Aggregate with group-by


Python

1 # Create 'month' feature


2 abt['month'] = abt.Date.apply(lambda x: x[:7])
3
4 # Group by 'Code' and 'month' and keep first date
5 gb_df = abt.groupby(['Code', 'month']).first().reset_index()
6
7 # Display examples
8 gb_df.tail()

13/15
As you can see, we now have a proper ABT with:

Only relevant data from the 1st day of each month.


Momentum features calculated from the prior 7, 14, 21, and 28 days.
The future returns you would've made 7 days later.

In other words, we have exactly what we need to evaluate the simple trading strategy we
proposed at the beginning!

Congratulations... you've made it to the end of this Python data wrangling tutorial!

We introduced several key tools for filtering, manipulating, and transforming datasets in
Python, but we've only scratched the surface. Pandas is a very powerful library with
plenty of additional functionality.

For continued learning, we recommend downloading more datasets for hands-on


practice. Propose an interesting question, plan your approach, and fall back on
documentation for help.

The complete code, from start to finish.


Here's all the main code in one place, in a single script.

Complete script
Python

14/15
1 # 2. Import libraries and dataset
2 import pandas as pd
3 pd.options.display.float_format = '{:,.2f}'.format
4 pd.options.display.max_rows = 200
5 pd.options.display.max_columns = 100
6
7 df = pd.read_csv('BNC2_sample.csv',
8 names=['Code', 'Date', 'Open', 'High', 'Low',
9 'Close', 'Volume', 'VWAP', 'TWAP'])
10
11 # 4. Filter unwanted observations
12 gwa_codes = [code for code in df.Code.unique() if 'GWA_' in code]
13 df = df[df.Code.isin(gwa_codes)]
14
15 # 5. Pivot the dataset
16 pivoted_df = df.pivot(index='Date', columns='Code', values='VWAP')
17
18 # 6. Shift the pivoted dataset
19 delta_dict = {}
20 for offset in [7, 14, 21, 28]:
21 delta_dict['delta_{}'.format(offset)] = pivoted_df / pivoted_df.shift(offset) - 1
22
23 # 7. Melt the shifted dataset
24 melted_dfs = []
25 for key, delta_df in delta_dict.items():
26 melted_dfs.append( delta_df.reset_index().melt(id_vars=['Date'],
27 value_name=key) )
28
29 return_df = pivoted_df.shift(-7) / pivoted_df - 1.0
30 melted_dfs.append( return_df.reset_index().melt(id_vars=['Date'],
31 value_name='return_7') )
32
33 # 8. Reduce-merge the melted data
34 from functools import reduce
35
36 base_df = df[['Date', 'Code', 'Volume', 'VWAP']]
37 feature_dfs = [base_df] + melted_dfs
38
39 abt = reduce(lambda left,right: pd.merge(left,right,on=['Date', 'Code']), feature_dfs)
40
41 # 9. Aggregate with group-by.
abt['month'] = abt.Date.apply(lambda x: x[:7])
gb_df = abt.groupby(['Code', 'month']).first().reset_index()

15/15

You might also like