Aggregate in R

In R, you can use the aggregate
function to compute summary statistics for subsets of the data. This function is very similar to the tapply
function, but you can also input a formula or a time series object and in addition, the output is of class data.frame
. In this tutorial you will learn how to use the R aggregate function with several examples, to aggregate rows by a grouping factor.
The aggregate() function in R
The syntax of the R aggregate
function will depend on the input data. There are three possible input types: a data frame, a formula and a time series object. The arguments and its description for each method are summarized in the following block:
Recall to type help(aggregate)
or ?aggregate
for additional information.
In the following sections we will show examples and use cases about aggregating data, like aggregating the mean, the count or the quantiles, among other examples. Using aggregate
in R is very simple and it is worth to mention that you can apply any function you want, even a custom function.
Aggregate mean in R by group
Consider, for instance, the following dataset, which contains the weight and the type of feed of a sample of chickens:
In order to use the aggregate
function for mean in R, you will need to specify the numerical variable on the first argument, the categorical (as a list) on the second and the function to be applied (in this case mean
) on the third. An alternative is to specify a formula of the form: numerical ~ categorical
.
Note that, when using a formula, the grouping variable is coerced to factor. In consequence, you could also use a numerical variable for representing groups.
However, you might have noticed that the column names of the resulting data frame doesn’t represent the variables. In order to modify the column names of the output, you can use the colnames
function as follows:
Aggregate count
Sometimes it can be useful to know the number of elements of each group of a categorical variable. Although you could use the table
function, if you want the output to be a data frame, you can get the count applying the length
function to aggregate
.
Aggregate quantile
In this section we are going to use a time series object of class xts
as an example, although you could use a data frame instead to apply the function. Consider the following sample object that represents the monthly returns of an investment fund over a year:
In this scenario, you may be interested in aggregating the quantiles by date (aggregate daily data to monthly or to weekly, for instance). Hence, you can calculate the quantiles 5% and 95% for the returns of each month typing:
Note that you can add additional arguments of the function you are applying separating them with commas after the FUN
argument.
Aggregate by multiple columns in R
Finally, it is worth to mention that it is possible to aggregate more than one variable. For this purpose, there exist three options: aggregating more than one categorical variable, aggregating multiple numerical variables or both at the same time.
On the one hand, we are going to create a new categorical variable named cat_var
.
Now, you can use the aggregate
function to aggregate the sum to summarize the data frame based on the two variables:
By applying the aggregate
function to several categorical variables, all possible combinations between them are made and the corresponding statistical summary is created for each one.
On the other hand, we are going to create a new numeric variable named num_var
.
In this scenario, when working with two or more numerical variables you can make use of the cbind
function to concatenate them:
Thus, the statistical summary is created for the numeric variables based on the factor.
You could also apply the function with multiple numerical and categorical variables. In this a situation, there would be as many summaries as numerical variables and as many groups as possible combinations.