Efficient methods to iterate rows in Pandas Dataframe

Last Updated : 27 Nov, 2024

When iterating over rows in a Pandas DataFrame, the method you choose can greatly impact performance. Avoid traditional row iteration methods like for loops or .iterrows() when performance matters. Instead, use methods like vectorization or itertuples().

Vectorized operations are the fastest and most efficient approach in Pandas. They are preferred when the operations can be applied directly to entire columns or datasets without row-wise iteration. Vectorized operations are not considered row-wise iteration in the traditional sense, but they can achieve the same end goal without explicit iteration. In this article, we’ll focus on efficient ways for iterating over rows in Pandas Dataframe, and Vectorized operations as well.

Below is the correct order of methods ranked by efficiency and speed, along with their significance:

1. Using `itertuples() -` Fastest Row-Wise Iteration

itertuples() returns each row as a lightweight named tuple, which is faster and more memory-efficient. It preserves data types, making it ideal for large datasets requiring structured row-wise access. Suitable when you need to iterate over rows with structured data access.

Example: We’ll create a moderately large dataset (10,000 rows) to compare methods effectively. This size is suitable for observing performance differences while being manageable for demonstration purposes.

import pandas as pd
import numpy as np

data = {
    'A': np.random.randint(1, 20, 10),  # Random integers from 1 to 20
    'B': np.random.randint(10, 30, 10),
    'C': np.random.choice(['X', 'Y', 'Z'], 10)  # Random categorical values
}
df = pd.DataFrame(data)
print(df)

Example to demonstrate iterating over rows using itertuples():

import pandas as pd
import numpy as np

data = {
    'A': np.random.randint(1, 20, 10),  # Random integers from 1 to 20
    'B': np.random.randint(10, 30, 10),
    'C': np.random.choice(['X', 'Y', 'Z'], 10)  # Random categorical values
}
df = pd.DataFrame(data)

# Using itertuples for faster row-wise iteration
results = []
for row in df.itertuples(index=False):
    if row.C == 'X':
        results.append(row.A * row.B)
    else:
        results.append(row.A + row.B)

df['Result'] = results
print(df)

Output

    A   B  C  Result
0  15  25  Y      40
1   7  23  Y      30
2   2  22  X      44
3   2  12  Z      14
4   8  19  Z      27
5   9  16  X     144
6  13  21  Z      34
7   7  15  X     105
8   9  22  ...

2. `apply()` Method (Preferred for Complex Operations)

The .apply() function allows applying a custom function across rows or columns. Use .apply() only when operations require complex logic that depends on multiple columns or rows.

This function takes a single row of the DataFrame as input and performs calculations based on the value in column C:

If the value of C is 'X', the function returns A * 2.
Otherwise, it returns B * 3.

import pandas as pd
import numpy as np

data = {
    'A': np.random.randint(1, 20, 10),  # Random integers from 1 to 20
    'B': np.random.randint(10, 30, 10),
    'C': np.random.choice(['X', 'Y', 'Z'], 10)  # Random categorical values
}
df = pd.DataFrame(data)

# Applying a custom function row-wise
def custom_function(row):
    return row['A'] * 2 if row['C'] == 'X' else row['B'] * 3

result = df.apply(custom_function, axis=1)
print(result)

Output

0    81
1    84
2    39
3    51
4    84
5    66
6    78
7    22
8    63
9    54
dtype: int64

3. Vectorization (Preferred for Speed and Large Datasets)

Vectorized operations process entire columns at once and avoids explicit iteration, making it the fastest and most efficient approach for large datasets. Best for performing transformations or calculations on entire columns without needing row-wise logic.

import pandas as pd
import numpy as np

data = {
    'A': np.random.randint(1, 20, 10),  # Random integers from 1 to 20
    'B': np.random.randint(10, 30, 10),
    'C': np.random.choice(['X', 'Y', 'Z'], 10)  # Random categorical values
}
df = pd.DataFrame(data)
# Vectorized operations
df['Result'] = np.where(df['C'] == 'X', df['A'] * df['B'], df['A'] + df['B'])
print(df)

Output

    A   B  C  Result
0   1  12  Y      13
1  15  24  Z      39
2  19  12  Z      31
3  19  18  X     342
4   3  27  Y      30
5  13  27  X     351
6   6  15  X      90
7   7  25  Y      32
8  15  13  ...

Conclusion – Efficiently Iterating over rows in Pandas Dataframe

For row-wise operations, prefer:

Use itertuples() when iteration is unavoidable and structured data access is needed.
Opt for .apply() when performing complex transformations that cannot be vectorized.

Avoid iterrows() and index-based iteration for large datasets due to poor performance and significant overhead . For any dataset, vectorization is the fastest method and should be the default choice unless row-specific logic is mandatory.

How do you iterate over multiple rows in Pandas?

Use iterrows() or itertuples() to iterate over rows, or loop through DataFrame.index to access rows by index. Prefer vectorized operations whenever possible for better performance.

Is Pandas apply faster than iterrows?

Yes, apply() is faster than iterrows() because it leverages vectorization. While iterrows() processes row-by-row, apply() applies a function across the series, reducing Python overhead.

How to make Pandas loop faster?

Avoid explicit loops. Use vectorized operations, apply(), or numpy functions. When looping is unavoidable, prefer itertuples() over iterrows() for improved speed and efficiency.

Why are itertuples faster than iterrows?

itertuples() converts rows into lightweight named tuples, which are more memory-efficient and faster to access than the pandas Series objects returned by iterrows().