Efficient methods to iterate rows in Pandas Dataframe
When iterating over rows in a Pandas DataFrame, the method you choose can greatly impact performance. Avoid traditional row iteration methods like for
loops or .iterrows()
when performance matters. Instead, use methods like vectorization or itertuples()
.
Vectorized operations are the fastest and most efficient approach in Pandas. They are preferred when the operations can be applied directly to entire columns or datasets without row-wise iteration. Vectorized operations are not considered row-wise iteration in the traditional sense, but they can achieve the same end goal without explicit iteration. In this article, we’ll focus on efficient ways for iterating over rows in Pandas Dataframe, and Vectorized operations as well.
Below is the correct order of methods ranked by efficiency and speed, along with their significance:
1. Using itertuples() -
Fastest Row-Wise Iteration
itertuples()
returns each row as a lightweight named tuple, which is faster and more memory-efficient. It preserves data types, making it ideal for large datasets requiring structured row-wise access. Suitable when you need to iterate over rows with structured data access.
Example: We’ll create a moderately large dataset (10,000 rows) to compare methods effectively. This size is suitable for observing performance differences while being manageable for demonstration purposes.
import pandas as pd
import numpy as np
data = {
'A': np.random.randint(1, 20, 10), # Random integers from 1 to 20
'B': np.random.randint(10, 30, 10),
'C': np.random.choice(['X', 'Y', 'Z'], 10) # Random categorical values
}
df = pd.DataFrame(data)
print(df)
Example to demonstrate iterating over rows using itertuples()
:
import pandas as pd
import numpy as np
data = {
'A': np.random.randint(1, 20, 10), # Random integers from 1 to 20
'B': np.random.randint(10, 30, 10),
'C': np.random.choice(['X', 'Y', 'Z'], 10) # Random categorical values
}
df = pd.DataFrame(data)
# Using itertuples for faster row-wise iteration
results = []
for row in df.itertuples(index=False):
if row.C == 'X':
results.append(row.A * row.B)
else:
results.append(row.A + row.B)
df['Result'] = results
print(df)
Output
A B C Result 0 15 25 Y 40 1 7 23 Y 30 2 2 22 X 44 3 2 12 Z 14 4 8 19 Z 27 5 9 16 X 144 6 13 21 Z 34 7 7 15 X 105 8 9 22 ...
2. apply()
Method (Preferred for Complex Operations)
The .apply()
function allows applying a custom function across rows or columns. Use .apply()
only when operations require complex logic that depends on multiple columns or rows.
This function takes a single row of the DataFrame as input and performs calculations based on the value in column C
:
- If the value of
C
is'X'
, the function returnsA * 2
. - Otherwise, it returns
B * 3
.
import pandas as pd
import numpy as np
data = {
'A': np.random.randint(1, 20, 10), # Random integers from 1 to 20
'B': np.random.randint(10, 30, 10),
'C': np.random.choice(['X', 'Y', 'Z'], 10) # Random categorical values
}
df = pd.DataFrame(data)
# Applying a custom function row-wise
def custom_function(row):
return row['A'] * 2 if row['C'] == 'X' else row['B'] * 3
result = df.apply(custom_function, axis=1)
print(result)
Output
0 81 1 84 2 39 3 51 4 84 5 66 6 78 7 22 8 63 9 54 dtype: int64
3. Vectorization (Preferred for Speed and Large Datasets)
Vectorized operations process entire columns at once and avoids explicit iteration, making it the fastest and most efficient approach for large datasets. Best for performing transformations or calculations on entire columns without needing row-wise logic.
import pandas as pd
import numpy as np
data = {
'A': np.random.randint(1, 20, 10), # Random integers from 1 to 20
'B': np.random.randint(10, 30, 10),
'C': np.random.choice(['X', 'Y', 'Z'], 10) # Random categorical values
}
df = pd.DataFrame(data)
# Vectorized operations
df['Result'] = np.where(df['C'] == 'X', df['A'] * df['B'], df['A'] + df['B'])
print(df)
Output
A B C Result 0 1 12 Y 13 1 15 24 Z 39 2 19 12 Z 31 3 19 18 X 342 4 3 27 Y 30 5 13 27 X 351 6 6 15 X 90 7 7 25 Y 32 8 15 13 ...
Conclusion – Efficiently Iterating over rows in Pandas Dataframe
For row-wise operations, prefer:
- Use
itertuples()
when iteration is unavoidable and structured data access is needed. - Opt for
.apply()
when performing complex transformations that cannot be vectorized.
Avoid iterrows()
and index-based iteration for large datasets due to poor performance and significant overhead . For any dataset, vectorization is the fastest method and should be the default choice unless row-specific logic is mandatory.
How do you iterate over multiple rows in Pandas?
Use
iterrows()
oritertuples()
to iterate over rows, or loop throughDataFrame.index
to access rows by index. Prefer vectorized operations whenever possible for better performance.
Is Pandas apply faster than iterrows?
Yes,
apply()
is faster thaniterrows()
because it leverages vectorization. Whileiterrows()
processes row-by-row,apply()
applies a function across the series, reducing Python overhead.
How to make Pandas loop faster?
Avoid explicit loops. Use vectorized operations,
apply()
, ornumpy
functions. When looping is unavoidable, preferitertuples()
overiterrows()
for improved speed and efficiency.
Why are itertuples faster than iterrows?
itertuples()
converts rows into lightweight named tuples, which are more memory-efficient and faster to access than the pandasSeries
objects returned byiterrows()
.