Python Pandas - Indexing and Selecting Data



In pandas, indexing and selecting data are crucial for efficiently working with data in Series and DataFrame objects. These operations help you to slice, dice, and access subsets of your data easily.

These operations involve retrieving specific parts of your data structure, whether it's a Series or DataFrame. This process is crucial for data analysis as it allows you to focus on relevant data, apply transformations, and perform calculations.

Indexing in pandas is essential because it provides metadata that helps with analysis, visualization, and interactive display. It automatically aligns data for easier manipulation and simplifies the process of getting and setting data subsets.

This tutorial will explore various methods to slice, dice, and manipulate data using Pandas, helping you understand how to access and modify subsets of your data.

Types of Indexing in Pandas

Similar to Python and NumPy indexing ([ ]) and attribute (.) operators, Pandas provides straightforward methods for accessing data within its data structures. However, because the data type being accessed can be unpredictable, relying exclusively on these standard operators may lead to optimization challenges.

Pandas provides several methods for indexing and selecting data, such as −

  • Label-Based Indexing with .loc

  • Integer Position-Based Indexing with .iloc

  • Indexing with Brackets []

Label-Based Indexing with .loc

The .loc indexer is used for label-based indexing, which means you can access rows and columns by their labels. It also supports boolean arrays for conditional selection.

.loc() has multiple access methods like −

  • single scalar label: Selects a single row or column, e.g., df.loc['a'].

  • list of labels: Select multiple rows or columns, e.g., df.loc[['a', 'b']].

  • Label Slicing: Use slices with labels, e.g., df.loc['a':'f'] (both start and end are included).

  • Boolean Arrays: Filter data based on conditions, e.g., df.loc[boolean_array].

loc takes two single/list/range operator separated by ','. The first one indicates the row and the second one indicates columns.

Example 1

Here is a basic example that selects all rows for a specific column using the loc indexer.

Open Compiler
#import the pandas library and aliasing as pd import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8, 4), index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D']) print("Original DataFrame:\n", df) #select all rows for a specific column print('\nResult:\n',df.loc[:,'A'])

Its output is as follows −

Original DataFrame:
           A         B         C         D
a  0.962766 -0.195444  1.729083 -0.701897
b -0.552681  0.797465 -1.635212 -0.624931
c  0.581866 -0.404623 -2.124927 -0.190193
d -0.284274  0.019995 -0.589465  0.914940
e  0.697209 -0.629572 -0.347832  0.272185
f -0.181442 -0.000983  2.889981  0.104957
g  1.195847 -1.358104  0.110449 -0.341744
h -0.121682  0.744557  0.083820  0.355442

Result:
 a    0.962766
b   -0.552681
c    0.581866
d   -0.284274
e    0.697209
f   -0.181442
g    1.195847
h   -0.121682
Name: A, dtype: float64

Note: The output generated will vary with each execution because the DataFrame is created using NumPy's random number generator.

Example 2

This example selecting all rows for multiple columns.

Open Compiler
# import the pandas library and aliasing as pd import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8, 4), index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D']) # Select all rows for multiple columns, say list[] print(df.loc[:,['A','C']])

Its output is as follows −

            A           C
a    0.391548    0.745623
b   -0.070649    1.620406
c   -0.317212    1.448365
d   -2.162406   -0.873557
e    2.202797    0.528067
f    0.613709    0.286414
g    1.050559    0.216526
h    1.122680   -1.621420

Example 3

This example selects the specific rows for the specific columns.

Open Compiler
# import the pandas library and aliasing as pd import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8, 4), index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D']) # Select few rows for multiple columns, say list[] print(df.loc[['a','b','f','h'],['A','C']])

Its output is as follows −

           A          C
a   0.391548   0.745623
b  -0.070649   1.620406
f   0.613709   0.286414
h   1.122680  -1.621420

Example 4

The following example selecting a range of rows for all columns using the loc indexer.

Open Compiler
# import the pandas library and aliasing as pd import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8, 4), index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D']) # Select range of rows for all columns print(df.loc['c':'e'])

Its output is as follows −

          A         B         C         D
c  0.044589  1.966278  0.894157  1.798397
d  0.451744  0.233724 -0.412644 -2.185069
e -0.865967 -1.090676 -0.931936  0.214358

Integer Position-Based Indexing with .iloc

The .iloc indexer is used for integer-based indexing, which allows you to select rows and columns by their numerical position. This method is similar to standard python and numpy indexing (i.e. 0-based indexing).

  • Single Integer: Selects data by its position, e.g., df.iloc[0].

  • List of Integers: Select multiple rows or columns by their positions, e.g., df.iloc[[0, 1, 2]].

  • Integer Slicing: Use slices with integers, e.g., df.iloc[1:3].

  • Boolean Arrays: Similar to .loc, but for positions.

Example 1

Here is a basic example that selects 4 rows for the all column using the iloc indexer.

Open Compiler
# import the pandas library and aliasing as pd import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D']) print("Original DataFrame:\n", df) # select all rows for a specific column print('\nResult:\n',df.iloc[:4])

Its output is as follows −

Original DataFrame:
           A         B         C         D
0 -1.152267  2.206954 -0.603874  1.275639
1 -0.799114 -0.214075  0.283186  0.030256
2 -1.823776  1.109537  1.512704  0.831070
3 -0.788280  0.961695 -0.127322 -0.597121
4  0.764930 -1.310503  0.108259 -0.600038
5 -1.683649 -0.602324 -1.175043 -0.343795
6  0.323984 -2.314158  0.098935  0.065528
7  0.109998 -0.259021 -0.429467  0.224148

Result:
           A         B         C         D
0 -1.152267  2.206954 -0.603874  1.275639
1 -0.799114 -0.214075  0.283186  0.030256
2 -1.823776  1.109537  1.512704  0.831070
3 -0.788280  0.961695 -0.127322 -0.597121

Example 2

The following example selects the specific data using the integer slicing.

Open Compiler
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D']) # Integer slicing print(df.iloc[:4]) print(df.iloc[1:5, 2:4])

Its output is as follows −

           A          B           C           D
0   0.699435   0.256239   -1.270702   -0.645195
1  -0.685354   0.890791   -0.813012    0.631615
2  -0.783192  -0.531378    0.025070    0.230806
3   0.539042  -1.284314    0.826977   -0.026251

           C          D
1  -0.813012   0.631615
2   0.025070   0.230806
3   0.826977  -0.026251
4   1.423332   1.130568

Example 3

This example selects the data using the slicing through list of values.

Open Compiler
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D']) # Slicing through list of values print(df.iloc[[1, 3, 5], [1, 3]])

Its output is as follows −

           B           D
1   0.890791    0.631615
3  -1.284314   -0.026251
5  -0.512888   -0.518930

Direct Indexing with Brackets "[]"

Direct indexing with [] is a quick and intuitive way to access data, similar to indexing with Python dictionaries and NumPy arrays. Its often used for basic operations −

  • Single Column: Access a single column by its name.

  • Multiple Columns: Select multiple columns by passing a list of column names.

  • Row Slicing: Slice rows using integer-based indexing.

Example 1

This example demonstrates how to use the direct indexing with brackets for accessing a single column.

Open Compiler
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D']) # Accessing a Single Column print(df['A'])

Its output is as follows −

0   -0.850937
1   -1.588211
2   -1.125260
3    2.608681
4   -0.156749
5    0.154958
6    0.396192
7   -0.397918
Name: A, dtype: float64

Example 2

This example selects the multiple columns using the direct indexing.

Open Compiler
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D']) # Accessing Multiple Columns print(df[['A', 'B']])

Its output is as follows −

          A         B
0  0.167211 -0.080335
1 -0.104173  1.352168
2 -0.979755 -0.869028
3  0.168335 -1.362229
4 -1.372569  0.360735
5  0.428583 -0.203561
6 -0.119982  1.228681
7 -1.645357  0.331438
Advertisements