Groupby without aggregation in Pandas
Pandas is a great python package for manipulating data and some of the tools which we learn as a beginner are an aggregation and group by functions of pandas.
Groupby() is a function used to split the data in dataframe into groups based on a given condition. Aggregation on other hand operates on series, data and returns a numerical summary of the data. There are a lot of aggregation functions as count(),max(),min(),mean(),std(),describe(). We can combine both functions to find multiple aggregations on a particular column. For further details about this refer to this article How to combine Groupby and Multiple aggregation function in Pandas.
Instead of using groupby aggregation together, we can perform groupby without aggregation which is applicable to aggregate data separately. We will see this with an example where we will take a breast cancer dataset with different numerical features like mean area, worst texture, and many more. The target column has 0 which means the cancer is benign and 1 means the cancer is malignant.
Example 1:
Python3
# importing python libraries and breast_cancer dataset from sklearn import numpy as np import pandas as pd from sklearn import datasets from sklearn.datasets import load_breast_cancer # data is loaded in a DataFrame cancer_data = load_breast_cancer() df = pd.DataFrame(cancer_data.data, columns = cancer_data.feature_names) df[ 'target' ] = pd.Series(cancer_data.target) df.head() |
Output:
Thus, we can visualize the data which have all the columns, but all the columns are in numerical form and there are no categorical data instead of only the target column, so let’s have a look in target and another column named ‘worst texture’.
Python3
print (df[ 'target' ].describe(), df[ 'worst texture' ].describe()) |
Output:
count 569.000000 mean 0.627417 std 0.483918 min 0.000000 25% 0.000000 50% 1.000000 75% 1.000000 max 1.000000 Name: target, dtype: float64 count 569.000000 mean 25.677223 std 6.146258 min 12.020000 25% 21.080000 50% 25.410000 75% 29.720000 max 49.540000 Name: worst texture, dtype: float64
Here we can see the summary of target and worst texture column, we take only these columns to understand the groupby aggregate functions better.
Python3
df1 = df[[ 'worst texture' , 'worst area' , 'target' ]] gr1 = df1.groupby(df1[ 'target' ]).mean() gr1 |
Output:
So here we see the mean of worst texture and worst area grouped around benign and malignant cancer, now the normal data has been interfered by this method, and we have to add them separately there’s why groupby without aggregation becomes handy.
Python3
# function to take the data as group and perform aggregation def meanofTargets(group1): wt = group1[ 'worst texture' ].agg( 'mean' ) wa = group1[ 'worst area' ].agg( 'mean' ) group1[ 'Mean worst texture' ] = wt group1[ 'Mean worst area' ] = wa return group1 df2 = df1.groupby( 'target' ). apply (meanofTargets) df2 |
Output:
Thus, in the above dataset, we are able to join the mean of the worst area and worst texture in a separate column, and we do it with groupby method of the target column where it grouped ‘1’s and 0’s separately.
Example 2:
Similarly, let’s see another example of using groupby without aggregation. But as there is no categorical column we will have to make a categorical column myself. For this, let’s choose mean area which have a max value of 2500 and min value of 150, so we will classify them into 6 groups of range 400 using the pandas cut method to convert continuous to categorical. As this doesn’t concern subject of the article refer to the GitHub repo here for more info.
Thus, we make a categorical column ‘Cat_mean_area’ and we can perform the groupby aggregation method here too. But instead of grouping the whole dataset we can use some specific columns like mean area and target only.
Python3
# dataframe df_3 to contain only mean_area,Cat_mean_area and target df_3 = df_2[[ 'mean area' , 'Cat_mean_area' , 'target' ]] # applying groupby sum gr2 = df_3.groupby(df_2[ 'Cat_mean_area' ]). sum () gr2 |
Output:
Thus, by the steps mentioned above, we perform groupby without aggregation.
Python3
# function to take the data as group and perform aggregation def totalTargets(group): g = group[ 'target' ].agg( 'sum' ) group[ 'Total_targets' ] = g return group df_4 = df_3.groupby(df_3[ 'Cat_mean_area' ]). apply (totalTargets) df_4 |
Output: