
- Python Pandas - Home
- Python Pandas - Introduction
- Python Pandas - Environment Setup
- Python Pandas - Basics
- Python Pandas - Introduction to Data Structures
- Python Pandas - Index Objects
- Python Pandas - Panel
- Python Pandas - Basic Functionality
- Python Pandas - Indexing & Selecting Data
- Python Pandas - Series
- Python Pandas - Series
- Python Pandas - Slicing a Series Object
- Python Pandas - Attributes of a Series Object
- Python Pandas - Arithmetic Operations on Series Object
- Python Pandas - Converting Series to Other Objects
- Python Pandas - DataFrame
- Python Pandas - DataFrame
- Python Pandas - Accessing DataFrame
- Python Pandas - Slicing a DataFrame Object
- Python Pandas - Modifying DataFrame
- Python Pandas - Removing Rows from a DataFrame
- Python Pandas - Arithmetic Operations on DataFrame
- Python Pandas - IO Tools
- Python Pandas - IO Tools
- Python Pandas - Working with CSV Format
- Python Pandas - Reading & Writing JSON Files
- Python Pandas - Reading Data from an Excel File
- Python Pandas - Writing Data to Excel Files
- Python Pandas - Working with HTML Data
- Python Pandas - Clipboard
- Python Pandas - Working with HDF5 Format
- Python Pandas - Comparison with SQL
- Python Pandas - Data Handling
- Python Pandas - Sorting
- Python Pandas - Reindexing
- Python Pandas - Iteration
- Python Pandas - Concatenation
- Python Pandas - Statistical Functions
- Python Pandas - Descriptive Statistics
- Python Pandas - Working with Text Data
- Python Pandas - Function Application
- Python Pandas - Options & Customization
- Python Pandas - Window Functions
- Python Pandas - Aggregations
- Python Pandas - Merging/Joining
- Python Pandas - MultiIndex
- Python Pandas - Basics of MultiIndex
- Python Pandas - Indexing with MultiIndex
- Python Pandas - Advanced Reindexing with MultiIndex
- Python Pandas - Renaming MultiIndex Labels
- Python Pandas - Sorting a MultiIndex
- Python Pandas - Binary Operations
- Python Pandas - Binary Comparison Operations
- Python Pandas - Boolean Indexing
- Python Pandas - Boolean Masking
- Python Pandas - Data Reshaping & Pivoting
- Python Pandas - Pivoting
- Python Pandas - Stacking & Unstacking
- Python Pandas - Melting
- Python Pandas - Computing Dummy Variables
- Python Pandas - Categorical Data
- Python Pandas - Categorical Data
- Python Pandas - Ordering & Sorting Categorical Data
- Python Pandas - Comparing Categorical Data
- Python Pandas - Handling Missing Data
- Python Pandas - Missing Data
- Python Pandas - Filling Missing Data
- Python Pandas - Interpolation of Missing Values
- Python Pandas - Dropping Missing Data
- Python Pandas - Calculations with Missing Data
- Python Pandas - Handling Duplicates
- Python Pandas - Duplicated Data
- Python Pandas - Counting & Retrieving Unique Elements
- Python Pandas - Duplicated Labels
- Python Pandas - Grouping & Aggregation
- Python Pandas - GroupBy
- Python Pandas - Time-series Data
- Python Pandas - Date Functionality
- Python Pandas - Timedelta
- Python Pandas - Sparse Data Structures
- Python Pandas - Sparse Data
- Python Pandas - Visualization
- Python Pandas - Visualization
- Python Pandas - Additional Concepts
- Python Pandas - Caveats & Gotchas
Python Pandas - Comparison with SQL
Pandas is a powerful Python library for data manipulation and analysis, widely used in data science and engineering. Many potential Pandas users come from a background in SQL, a language designed for managing and querying relational databases. Understanding how to perform SQL-like operations using Pandas can significantly ease the transition and enhance productivity.
This tutorial provides a side-by-side comparison of common SQL operations and their equivalents in Pandas, using the popular "tips" dataset.
Importing the Necessary Libraries
Before we dive into the comparison, let's start by importing the necessary libraries.
import pandas as pd import numpy as np
We will also load the "tips" dataset, which will be used throughout this tutorial.
import pandas as pd url = 'https://raw.githubusercontent.com/pandas-dev/pandas/main/pandas/tests/io/data/csv/tips.csv' tips=pd.read_csv(url) print(tips.head())
Its output is as follows −
total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4
Selecting Columns
In SQL, the SELECT statement is used to retrieve specific columns from a table. Selection is done using a comma-separated list of columns that you select (or a * to select all columns) −
SELECT total_bill, tip, smoker, time FROM tips LIMIT 5;
In Pandas, you can achieve the same result by selecting columns from a DataFrame using a list of column names −
tips[['total_bill', 'tip', 'smoker', 'time']].head(5)
Example
Let's check the full program of displaying the first five rows of the selected columns −
import pandas as pd url = 'https://raw.githubusercontent.com/pandas-dev/pandas/main/pandas/tests/io/data/csv/tips.csv' tips=pd.read_csv(url) print(tips[['total_bill', 'tip', 'smoker', 'time']].head(5))
Its output is as follows −
total_bill tip smoker time 0 16.99 1.01 No Dinner 1 10.34 1.66 No Dinner 2 21.01 3.50 No Dinner 3 23.68 3.31 No Dinner 4 24.59 3.61 No Dinner
Calling the DataFrame without the list of column names will display all columns (akin to SQLs *).
Filtering Rows
In SQL, the WHERE clause is used to filter records based on specific conditions.
SELECT * FROM tips WHERE time = 'Dinner' LIMIT 5;
DataFrames can be filtered in multiple ways; the most intuitive of which is using Boolean indexing.
tips[tips['time'] == 'Dinner'].head(5)
Example
Let's check the full program of displaying the first five records where the time is equal to 'Dinner' −
import pandas as pd url = 'https://raw.githubusercontent.com/pandas-dev/pandas/main/pandas/tests/io/data/csv/tips.csv' tips=pd.read_csv(url) print(tips[tips['time'] == 'Dinner'].head(5))
Its output is as follows −
total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4
The above statement passes a Series of True/False objects to the DataFrame, returning all rows with True.
Grouping Data
SQL's GROUP BY clause is used to group rows that have the same values in specified columns and perform aggregate functions on them. For example, to count the number of tips left by each gender: −
SELECT sex, count(*) FROM tips GROUP BY sex;
In Pandas, the groupby() method is used to achieve the same result −
tips.groupby('sex').size()
Example
Let's check the full program of displaying the count of tips grouped by gender −
import pandas as pd url = 'https://raw.githubusercontent.com/pandas-dev/pandas/main/pandas/tests/io/data/csv/tips.csv' tips=pd.read_csv(url) print(tips.groupby('sex').size())
Its output is as follows −
sex Female 87 Male 157 dtype: int64
Limiting the Number of Rows
In SQL, the LIMIT clause is used to limit the number of rows returned by a query. For example −
SELECT * FROM tips LIMIT 5 ;
In Pandas, the head() method is used to achieve this −
tips.head(5)
Example
Let's check the full example of displaying the first five rows of the DataFrame −
import pandas as pd url = 'https://raw.githubusercontent.com/pandas-dev/pandas/main/pandas/tests/io/data/csv/tips.csv' tips=pd.read_csv(url) tips = tips[['smoker', 'day', 'time']].head(5) print(tips)
Its output is as follows −
smoker day time 0 No Sun Dinner 1 No Sun Dinner 2 No Sun Dinner 3 No Sun Dinner 4 No Sun Dinner
These are the few basic operations we compared are, which we learnt, in the previous chapters of the Pandas Library.