Learn Python Pandas For Data Science Quick TutorialExamples For All Primary Operations of DataFrames
Learn Python Pandas For Data Science Quick TutorialExamples For All Primary Operations of DataFrames
com /learn-python-pandas-for-data-science-tutorial/
This is a guide to help you quickly learn and use Python pandas for data science, machine learning.
Pandas is the most popular Python package for data manipulation and analysis. Data analysts or
scientists often spend 80% of their time exploring and transforming data, which makes pandas an
essential skill for data science.
In this tutorial, you’ll learn the most useful data science-related pandas techniques:
1/37
This practical tutorial with examples will help you start real data analysis, if you just learned the basics of
Python.
Note: if you are new to Python, take our FREE Python crash course for data science.
Let’s begin!
Because pandas:
Note before the tutorial: it’s not necessary (or possible) to memorize all the functions or methods. Plus,
there might be other functions or methods you need in the future.
We suggest you follow the tutorial to learn systematically, which covers the essentials. When you are
practicing data science, a simple Google search would often solve the problem.
You can install the Anaconda distribution to start using Python for data science, which is FREE and
includes popular packages such as pandas.
If you want to get pandas, check out How to Install/Setup Python and Prep for Data Science NOW.
If you want to install pandas and learn the basics of Python, take our FREE Python crash course.
This guide assumes you have all the basic Python knowledge.
If you already have Python but not pandas, please use the command below to install it.
After downloading the package, we need to import the packages for each new Python session.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
view raw import_packages.py hosted with ❤ by GitHub
2/37
We also import the matplotlib package, which allows us to show some simple data visualizations.
Series: a 1-dimensional labeled array that can hold any data type such as integers, strings, floating
points, Python objects.
It has row/axis labels as the index.
DataFrame: a 2-dimensional labeled data structure with columns of potentially different types.
It also has row labels as the index.
In this tutorial, we will focus on DataFrames since they need a lot more data manipulation than Series.
Dictionary to Dataframe
There are different ways to construct DataFrames from scratch. Let’s use Python dictionaries as an
example.
After a raw dataset is preprocessed in Python, it is often convenient to compile the new dataset as a
dictionary. But dictionaries are not great objects for further analysis.
This is when we need to convert this dictionary object into a pandas DataFrame.
We have a dataset for housing sales records. First, we create two lists with values of the houses’
addresses and prices. Then we’ll combine these lists into a dictionary d with key ‘address’, value l1 and
key ‘price’, value l2. We can simply transform this dictionary as a pandas DataFrame df2 as below.
l1 = ['123 first st.', '234 second st.', '345 third st.', '456 fourth st.', '567 fifth st.', '678 sixth st.']
l2 = [305000, 650000, 267000, 500000, 125000, 875000]
d = {'address': l1, 'price': l2}
df2 = pd.DataFrame(data=d)
df2
view raw dataframe_dictionary.py hosted with ❤ by GitHub
3/37
This is great!
We’ve created our first pandas DataFrame. As you can see, it has six rows and two columns with an
index from 0 to 5.
Yet, it is more often we import data from outside sources. Let’s look at the most common source: csv
files.
read_csv
Within this tutorial, we’ll mainly use two sample datasets. Both can be downloaded following the links.
One is a YouTube dataset sydney_videos.csv; the other is a transactional dataset trans.csv. They are
both datasets we’ve done analyses on, which makes this guide more practical. Please make sure the two
files are stored in the same working directory as your Python code. This saves us from specifying the
path in read_csv.
Since both datasets are saved as csv files. Let’s use the pandas read_csvfunction to import them into
Python.
For the YouTube dataset, we can use the default settings of read_csv to load it as DataFrame df0.
df0 = pd.read_csv('sydney_videos.csv')
df0
view raw read_csv.py hosted with ❤ by GitHub
4/37
We can see that df0 is a DataFrame with 837 rows and four columns. Each of the rows is a video with
columns:
title
time posted_ago
length
number of views
We can also change the settings of the input parameters to read_csv. For example, to read the
transactional dataset, we set:
sep = ‘;’ to change the delimiter from commas (,) to semicolons (;).
dtype={‘k_symbol’: ‘string’, ‘bank’: ‘string’} to set the data types of columns k_symbol and bank to
strings.
parse_dates=[‘date’] to parse the column date as a DateTime format.
5/37
Further Reading: Pandas read_csv to DataFrames: Python Pandas Tutorial
Read this complete tutorial to learn more about Python pandas read_csv. Or take the last section as a
cheat sheet.
read_sql
Another common case is when we want to import data into Python from SQL servers. We can use
read_sql to read in the SQL query or database table into a DataFrame.
For example, we have a sample database in Microsoft SQL Server. We first set up a connection with the
server by using the SQLAlchemy function create_engine. Then we can query within Python to get the
data as DataFrame.
6/37
Further Readings:
How to Download and Install SQL Server and Sample Database: Step-by-Step
SQL Tutorial for Beginners: Learn SQL for Data Analysis
SQL is another essential data science skill. Read these articles to master it.
Besides CSV files and SQL databases, we can read data from a variety of sources with pandas. Learn
more with pandas’ official IO tools documentation.
At the end of the tutorial, we’ll show how to save the DataFrames on your computer.
info: summary
The info method shows a helpful summary of our DataFrames. It is one of the first methods you should
apply after loading in the dataset.
df0.info()
view raw dataframe_info.py hosted with ❤ by GitHub
7/37
We can see that it prints information, including:
The head method is also used in every data analysis project. With its help, we can see exactly how the
first n rows look like within the DataFrame.
Let’s print the first five rows of df0 using the default settings of the method.
8/37
The tail method returns the last n rows of the DataFrame. It can be used as a quick verification after we
sort or append rows.
9/37
shape: dimensionality
To get the number of rows and columns of the DataFrame, we can use shape.
df0.shape
view raw dataframe_shape.py hosted with ❤ by GitHub
(837, 4)
The method describe generates descriptive statistics for both numeric and object columns.
For example, for the numeric column views, we can print out its summary statistics such as:
count
mean
standard deviation
percentiles.
df0['views'].describe()
view raw dataframe_views_describe.py hosted with ❤ by GitHub
For the object column posted_ago, we can see its corresponding statistics:
count
unique counts
10/37
the most common value and its frequency (top, freq).
df0['posted_ago'].describe()
view raw dataframe_describe_object.py hosted with ❤ by GitHub
When the columns are numeric, it’s useful to explore their distributions by boxplot and histogram.
df0.boxplot('views')
view raw dataframe_views_boxplot.py hosted with ❤ by GitHub
df0.hist('views', bins=50)
view raw dataframe_hist_views.py hosted with ❤ by GitHub
11/37
Note: we prefer the seaborn library for data science since it has more features. But these methods in
pandas are good for quick and simple plots.
Further Readings: How to use Python Seaborn for Exploratory Data Analysis
The YouTube dataset was also used in this complete guide of using seaborn Python for Exploratory Data
Analysis. Discover how to use histograms, barplots, scatter plots, boxplots, heatmaps, etc.
The value_counts method is handy for categorical columns. It returns a pandas Series with counts of
unique values, in descending order. And it excludes NA values by default.
For example, we can use it to see frequency tables of column posted_ago. It provides more details than
the previous method describe().
df0['posted_ago'].value_counts()
view raw dataframe_value_counts.py hosted with ❤ by GitHub
12/37
We can also display the results in percentages by using the normalize parameter.
df0['posted_ago'].value_counts(normalize=True)
view raw dataframe_value_count_normalize.py hosted with ❤ by GitHub
13/37
It’s also helpful to visualize this result. We can do this by using the plot method.
df0['posted_ago'].value_counts().plot(kind='bar')
view raw dataframe_value_count_plot_bar.py hosted with ❤ by GitHub
14/37
Note: again, we prefer the seaborn library for data science since it has more features. But these methods
in pandas are good for quick and simple plots.
Further Readings: How to use Python Seaborn for Exploratory Data Analysis
Now with a better understanding of the DataFrames, let’s see what we can do with its columns and rows.
Columns of DataFrames
The columns of the DataFrames often represent variables/features in the dataset. There are often names
we can refer to these columns by.
Select Columns
The most basic way of selecting columns is by calling its name within squared brackets [].
For example, we can print out the columns length and views separately below.
df0['length']
df0['views']
view raw dataframe_column.py hosted with ❤ by GitHub
15/37
Column Operations
For example, we can perform a multiplication of 100 on the numeric column views.
df0['views'] * 100
view raw dataframe_column_numeric.py hosted with ❤ by GitHub
We can also create new columns in the DataFrame by assigning existing columns’ transformations.
The column length stores the number of seconds of the videos. Sometimes it’s more convenient to look
at it in minutes. Let’s create a new column below.
df0['length_in_minutes'] = df0['length'] / 60
df0.head(10)
view raw dataframe_new_columns.py hosted with ❤ by GitHub
By printing out the head of df0, we can see the new column length_in_minutes being added to it.
16/37
Similarly, we can perform categorial operations and create new columns.
df0['title_lowercase'] = df0['title'].str.lower()
df0.head(10)
view raw dataframe_column_categorical.py hosted with ❤ by GitHub
The new column title_lowercase contains the lower cases of title, is added to the DataFrame.
Rows of DataFrames
Rows represent observations in the datasets. In this section, we’ll cover the basic methods for getting
subsets of the DataFrame and indexing.
The two main methods to locate rows are either by labels (loc) or integer-location based indexing (iloc).
Before that, let’s change the index of our dataset to contain special labels.
Like strings or lists, every pandas DataFrame has an index. By default, it is an integer ranging from 0 to
(number of observations – 1). But we can also set the index to be one of the columns.
17/37
The default index of the transaction dataset df1 is integers 0 to 1056319. Since it is a time series dataset,
the column date could be a better index.
df1 = df1.set_index('date')
df1.head(10)
view raw dataframe_set_index.py hosted with ❤ by GitHub
We can see that the date column replaced the leftmost index column.
We’ll start from the loc method, which helps us to locate rows by their labels/names.
18/37
For example, if we want to see all the transactions that happened on 1996-12-25, we can use loc to find
them.
df1.loc['1996-12-25']
view raw dataframe_loc_date.py hosted with ❤ by GitHub
We can also get a range of dates using slicing. Note the difference from the usual Python slices. Both the
start and the stop are included.
df1.loc['1997-01-01':'1997-01-31']
view raw dataframe_loc_date_range.py hosted with ❤ by GitHub
19/37
The loc method also allows us to subset both rows and columns by the labels.
For example, instead of printing all the columns for date index ‘1996-12-25’, we can specify the columns.
Besides labels, we can also use boolean arrays (True/False) to subset rows, which will be shown in the
Filtering section later.
reset_index
One of the disadvantages of an index is that the data in it is not as accessible as the other columns. To
do better analysis on the dates in later sections, let’s reset the index of df1 to return the date as a column.
df1 = df1.reset_index()
df1.head()
view raw reset_index.py hosted with ❤ by GitHub
20/37
iloc: locate by index
The iloc method allows us to locate rows based on integer-location based indexing.
For example, to call the 10th to 19th rows, we can use slicing with iloc.
df1.iloc[10:20]
view raw dataframe_iloc_row_number.py hosted with ❤ by GitHub
Note that the stop index 20 is not included since iloc slicing is similar to the regular Python slices.
When selecting columns using iloc, we need to specify the column number (starting from 0) rather than
the names. For example, the column date is the first column, so it has the index 0.
df1.iloc[10:20, [2,0,4]]
view raw dataframe_iloc_columns.py hosted with ❤ by GitHub
21/37
Besides integers, we can also use boolean arrays (True/False) to subset rows with iloc, similar to loc.
Random Sampling
Random sampling is a common practice in data science. We can use the sample method in pandas to do
this.
For example, we can take a random sample of size ten from df1 and assign it to a new DataFrame
df1_samp1.
Or, we can take a certain proportion from the original dataset using the parameter frac.
22/37
df1_samp2
view raw dataframe_random_sampling_proportion.py hosted with ❤ by GitHub
Filtering is another popular technique in Python pandas to learn for data science. It’s useful when we
want to subset the data based on some condition(s).
For example, we can grab the rows with both views over 50k and length of over 50 minutes. The variable
msk is a boolean array with length equals the number of rows in the Dataframe. The array element is
True when the condition is satisfied, and False otherwise.
As you may have noticed, we can use logical operators such as & (AND), | (OR) in filtering. It is good
practice to put the parentheses around the conditions to specify the execution. Otherwise, it might not
work as expected.
23/37
Moreover, we can combine filtering with loc to select rows and columns as below. This is when loc is
used to locate rows based on boolean arrays.
24/37
Besides displaying the subset, we can also change the values in the DataFrame based on the filters.
We have non-English terms like ‘PRIJEM’ in the column type within the transactions dataset.
25/37
view raw dataframe_loc_set_values.py hosted with ❤ by GitHub
Now with the English categories in column type, we can more easily perform our analysis.
That’s a lot of rows and columns manipulation! Let’s move on to another critical pandas functionality:
Group by.
Groupby
Before learning pandas GroupBy, let’s look at some aggregating functions.
Below are some of the popular ones. We are not showing the results of the code since they are
straightforward.
df1['amount'].count()
df1['amount'].sum()
df1['amount'].min()
df1['amount'].median()
df1['amount'].max()
df1['amount'].quantile([0.25, 0.5, 0.75, 0.9, 0.95, 0.99])
view raw aggregating_functions.py hosted with ❤ by GitHub
Instead of applying over the whole dataset, aggregating functions are more powerful on groups.
pandas Groupby is a popular manipulation tool for data science. It can be used to group large amounts of
data and compute operations based on groups. The idea should be familiar to those who have used
SQL’s SELECT and Group By statement.
For example, we can get the sum and mean of the column amount, grouped by column type.
df1.groupby('type')['amount'].sum()
df1.groupby('type')['amount'].mean()
view raw groupby.py hosted with ❤ by GitHub
26/37
We can also group by multiple columns. For example, we can get the sum, mean of the column amount,
grouped by type and operation.
df1.groupby(['type', 'operation'])['amount'].sum()
df1.groupby(['type', 'operation'])['amount'].mean()
view raw groupby_multiple_cols.py hosted with ❤ by GitHub
There is a lot more to learn about pandas Groupby. That’s why we wrote another tutorial below. Take a
look to dig into it deeper!
Data Cleaning/Manipulation
pandas is also a powerful tool for cleaning and transforming the dataset. Let’s see some useful functions.
With the pandas apply method, we can apply a function on the DataFrame.
The DataFrame df1’s type column has three main categories, as shown below.
df1['type'].value_counts()
view raw type_value_count.py hosted with ❤ by GitHub
We can first define a function specifying the logic we described above and then apply it to the
DataFrame. The parameter axis is set to 1 to apply the function on each row for columns.
def pct_of_balance_trans(row):
if row['type'] == 'DEPOSIT':
return row['amount']/row['balance']
27/37
elif row['type'] == 'WITHDRAWAL':
return row['amount']/(row['balance'] + row['amount'])
else:
-999
df1['pct_of_balance_trans'] = df1.apply(pct_of_balance_trans, axis=1)
df1
view raw define_function_apply.py hosted with ❤ by GitHub
For example, by calling the tail of df1, we can see it has some missing values in the operation, bank, and
account columns.
df1.tail()
view raw df_tail_missing_values.py hosted with ❤ by GitHub
We can fill in the missing values of column operation with ‘UNKNOWN’ using the fillna method.
df1['operation'] = df1['operation'].fillna('UNKNOWN')
df1.tail()
28/37
view raw fillna.py hosted with ❤ by GitHub
We can also combine the filtering method and isnull, and notnull to subset rows with or without missing
values.
The code below subsets all the rows with the missing values in the bank column.
msk = df1['bank'].isnull()
df1[msk]
view raw isnull.py hosted with ❤ by GitHub
The code below subsets all the rows with the bank column NOT missing.
msk = df1['bank'].notnull()
df1[msk]
view raw notnull.py hosted with ❤ by GitHub
29/37
Further Reading: Data Cleaning Techniques in Python: the Ultimate Guide
There is a lot more to learn about missing data. Take a look at this guide for how to deal with missing
data.
Remove Columns/Rows
When we want to drop rows or columns from a DataFrame, we use the drop method.
When we drop columns, we set the parameter axis = 1. The default is 0, which drops by index/rows.
df0.drop(['title'], axis=1)
# df0 = df0.drop(['title'], axis=1) permanently assign the dropped dataset to a new dataframe
view raw drop_columns.py hosted with ❤ by GitHub
30/37
Or we can drop a few rows.
Note that the drop method is not permanent; we often assign it to a new DataFrame for further analysis.
For example, df0_new = df0.drop([‘title’], axis=1). This makes sure we always have the original dataset.
For details on drop_duplicates, please check out the duplicates section in Data Cleaning Guide.
Date functions
pandas has a lot of functionality for dates, which is a common variable in data science.
For example, we can get the year, month, day, and weekday from the date column.
df1['year'] = df1['date'].dt.year
df1['month'] = df1['date'].dt.month
df1['day'] = df1['date'].dt.month
df1['weekday'] = df1['date'].dt.weekday
df1[['date', 'year', 'month', 'day', 'weekday']]
view raw date.py hosted with ❤ by GitHub
31/37
We can then use methods such as Groupby and plot to analyze the data based on the date.
For example, we plot the bar chart of the average amount by month.
df1.groupby('month')['amount'].mean().plot(kind='bar')
view raw bar_plot_date_groupby.py hosted with ❤ by GitHub
32/37
Merge and Concatenate DataFrames
Similar to SQL tables, we can also merge or concatenate DataFrames. Let’s see how it works.
For the demonstration, we create two new DataFrames left and right below.
Merge
For example, we can merge the left and right DataFrames on their common column col1. The how
parameter specifies the type of merge to be performed. For example, when how = ‘left’, the new
DataFrame keeps all the rows from the left DataFrame and the corresponding values from the right
DataFrame.
Inner
Left
33/37
Outer
Right
Concatenate
When we have two DataFrames with the same row or column information, we can glue them together
using concat.
For example, if we create two new DataFrames with the same columns, we may want to combine these
two DataFrames.
d = {'account_id': [1, 2, 3, 11382], 'gender': ['female', 'male', 'female', 'male'], 'age': [55, 25, 29,
39]}
df4 = pd.DataFrame(d)
df4
d = {'account_id': [4, 5, 6, 7], 'gender': ['female', 'male', 'female', 'male'], 'age': [12, 13, 14, 15]}
df5 = pd.DataFrame(d)
df5
view raw df5.py hosted with ❤ by GitHub
Sort DataFrames
What if we want to display the DataFrame in order of a column?
34/37
For example, we can use sort_values to order the DataFrame by columns account_id, and trans_id, in
ascending order.
We’ll introduce the two common methods – pickles and csv files.
pickles
If we need to save the DataFrame and resume our analysis in Python, it’s better to output it as the pickle
format. As pickles, the DataFrame will be saved as it is in Python, i.e., we have the same index, data
types, etc.
When we want to pause the analysis, we use to_pickle to save the DataFrame df0 as videos.pkl.
df0.to_pickle('videos.pkl')
view raw save_dataframe.py hosted with ❤ by GitHub
Next time, we can access the same DataFrame by using read_pickle. Let’s also print out its head and
info.
df0_from_pickle = pd.read_pickle('videos.pkl')
df0_from_pickle.head()
df0_from_pickle.info()
35/37
view raw read_pickle.py hosted with ❤ by GitHub
The DataFrame looks the same as before. We can keep analyzing from where we left off.
csv files
csv files are more common than pickles and can be opened by other software/programs. So if we want to
share the data with others, it’s better to save our DataFrames as csv files.
Note that some information might be lost compared to pickles. For example, we might have to specify the
column data types when loading the csv files into Python again.
df0.to_csv('videos.csv', index=False)
view raw save_csv.py hosted with ❤ by GitHub
That’s it! You’ve learned a lot about Python pandas for data science.
Did you get a better idea of how to analyze and manipulate data?
Leave a comment for any questions you may have or anything else!
36/37
Python crash course: breaking into Data Science
A FREE Python online course, beginner-friendly tutorial. Start your successful data science career
journey.
37/37