Python Pandas - Comparison with SQL



Pandas is a powerful Python library for data manipulation and analysis, widely used in data science and engineering. Many potential Pandas users come from a background in SQL, a language designed for managing and querying relational databases. Understanding how to perform SQL-like operations using Pandas can significantly ease the transition and enhance productivity.

This tutorial provides a side-by-side comparison of common SQL operations and their equivalents in Pandas, using the popular "tips" dataset.

Importing the Necessary Libraries

Before we dive into the comparison, let's start by importing the necessary libraries.

import pandas as pd import numpy as np

We will also load the "tips" dataset, which will be used throughout this tutorial.

import pandas as pd url = 'https://raw.githubusercontent.com/pandas-dev/pandas/main/pandas/tests/io/data/csv/tips.csv' tips=pd.read_csv(url) print(tips.head())

Its output is as follows −

    total_bill   tip      sex  smoker  day     time  size
0        16.99  1.01   Female      No  Sun  Dinner      2
1        10.34  1.66     Male      No  Sun  Dinner      3
2        21.01  3.50     Male      No  Sun  Dinner      3
3        23.68  3.31     Male      No  Sun  Dinner      2
4        24.59  3.61   Female      No  Sun  Dinner      4

Selecting Columns

In SQL, the SELECT statement is used to retrieve specific columns from a table. Selection is done using a comma-separated list of columns that you select (or a * to select all columns) −

SELECT total_bill, tip, smoker, time FROM tips LIMIT 5;

In Pandas, you can achieve the same result by selecting columns from a DataFrame using a list of column names −

tips[['total_bill', 'tip', 'smoker', 'time']].head(5)

Example

Let's check the full program of displaying the first five rows of the selected columns −

import pandas as pd url = 'https://raw.githubusercontent.com/pandas-dev/pandas/main/pandas/tests/io/data/csv/tips.csv' tips=pd.read_csv(url) print(tips[['total_bill', 'tip', 'smoker', 'time']].head(5))

Its output is as follows −

   total_bill   tip  smoker     time
0       16.99  1.01      No   Dinner
1       10.34  1.66      No   Dinner
2       21.01  3.50      No   Dinner
3       23.68  3.31      No   Dinner
4       24.59  3.61      No   Dinner

Calling the DataFrame without the list of column names will display all columns (akin to SQLs *).

Filtering Rows

In SQL, the WHERE clause is used to filter records based on specific conditions.

SELECT * FROM tips WHERE time = 'Dinner' LIMIT 5;

DataFrames can be filtered in multiple ways; the most intuitive of which is using Boolean indexing.

tips[tips['time'] == 'Dinner'].head(5)

Example

Let's check the full program of displaying the first five records where the time is equal to 'Dinner' −

import pandas as pd url = 'https://raw.githubusercontent.com/pandas-dev/pandas/main/pandas/tests/io/data/csv/tips.csv' tips=pd.read_csv(url) print(tips[tips['time'] == 'Dinner'].head(5))

Its output is as follows −

   total_bill   tip      sex  smoker  day    time  size
0       16.99  1.01   Female     No   Sun  Dinner    2
1       10.34  1.66     Male     No   Sun  Dinner    3
2       21.01  3.50     Male     No   Sun  Dinner    3
3       23.68  3.31     Male     No   Sun  Dinner    2
4       24.59  3.61   Female     No   Sun  Dinner    4

The above statement passes a Series of True/False objects to the DataFrame, returning all rows with True.

Grouping Data

SQL's GROUP BY clause is used to group rows that have the same values in specified columns and perform aggregate functions on them. For example, to count the number of tips left by each gender: −

SELECT sex, count(*) FROM tips GROUP BY sex;

In Pandas, the groupby() method is used to achieve the same result −

tips.groupby('sex').size()

Example

Let's check the full program of displaying the count of tips grouped by gender −

import pandas as pd url = 'https://raw.githubusercontent.com/pandas-dev/pandas/main/pandas/tests/io/data/csv/tips.csv' tips=pd.read_csv(url) print(tips.groupby('sex').size())

Its output is as follows −

sex
Female   87
Male    157
dtype: int64

Limiting the Number of Rows

In SQL, the LIMIT clause is used to limit the number of rows returned by a query. For example −

SELECT * FROM tips LIMIT 5 ;

In Pandas, the head() method is used to achieve this −

tips.head(5)

Example

Let's check the full example of displaying the first five rows of the DataFrame −

import pandas as pd url = 'https://raw.githubusercontent.com/pandas-dev/pandas/main/pandas/tests/io/data/csv/tips.csv' tips=pd.read_csv(url) tips = tips[['smoker', 'day', 'time']].head(5) print(tips)

Its output is as follows −

   smoker   day     time
0      No   Sun   Dinner
1      No   Sun   Dinner
2      No   Sun   Dinner
3      No   Sun   Dinner
4      No   Sun   Dinner

These are the few basic operations we compared are, which we learnt, in the previous chapters of the Pandas Library.

Advertisements