Summarizing Data

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 13

Summarizing data

People remain confused when it comes to summarizing data real


quick in R. There are various options.

People who transition from SAS or SQL are used to writing


simple queries on these languages to summarize data sets. For
such audience, the biggest concern is to how do we do the same
thing on R.
In this article I will cover primary ways to summarize data sets.
Hopefully this will make your journey much easier than it looks
like.
Generally, summarizing data means finding statistical figures
such as mean, median, box plot etc. If understand well with
scatter plots & histogram, you can refer to guide on data
visualization in R.

 
Methods to Summarise Data in R
1. apply
Apply function returns a vector or array or list of values
obtained by applying a function to either rows or columns. This
is the simplest of all the function which can do this job.
However this function is very specific to collapsing either row
or column.
m <- matrix(c(1:10, 11:20), nrow = 10, ncol = 2)
apply(m, 1, mean)
[1]  6  7  8  9 10 11 12 13 14 15
apply(m, 2, mean)
[1]  5.5 15.5
 
2. lapply
“lapply” returns a list of the same length as X, each element of
which is the result of applying FUN to the corresponding
element of X.”
l <- list(a = 1:10, b = 11:20)
lapply(l, mean)
$a
[1] 5.5
$b
[1] 15.5
 
3.  sapply
“sapply” does the same thing as apply but returns a vector or
matrix. Let’s consider the last example again.
l <- list(a = 1:10, b = 11:20) l.mean <- sapply(l, mean)
class(l.mean)
[1] "numeric"
 
4. tapply
Till now, all the function we discussed cannot do what Sql can
achieve. Here is a function which completes the palette for
R. Usage is “tapply(X, INDEXatt, FUN = NULL, …, simplify =
TRUE)”, where X is “an atomic object, typically a vector” and
INDEX is “a list of factors, each of same length as X”. Here is
an example which will make the usage clear.
attach(iris)
# mean petal length by species
tapply(iris$Petal.Length, Species, mean)
    setosa versicolor  virginica 
     1.462      4.260      5.552
 
5. by
Now comes a slightly more complicated algorithm. Function
‘by’ is an object-oriented wrapper for ‘tapply’ applied to data
frames. Hopefully the example will make it more clear.
attach(iris)
by(iris[, 1:4], Species, colMeans)
Species: setosa
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
       5.006        3.428        1.462        0.246 
------------------------------------------------------------ 
Species: versicolor
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
       5.936        2.770        4.260        1.326 
------------------------------------------------------------ 
Species: virginica
Sepal.Length  Sepal.Width Petal.Length  Petal.Width
       6.588        2.974        5.552        2.026
What did the function do? It simply splits the data by a class
variable, which in this case is the specie. And then it creates a
summary at this level. So it does apply function on split frames.
The returned object is of class “by”.
 
6. sqldf
If you found any of the above statements difficult, don’t panic. I
bring you a life line which you can use anytime. Let’s fit in the
SQL queries in R. Here is a way you can do the same.
attach(iris)
summarization <- sqldf(select Species, mean(Petal.Length) from
Petal.Length_mean where Species is not null group by Species’)
And it’s done. Wasn’t it simple enough? One setback of this
approach is the amount of time it takes to execute. In case you
are interested in getting speed and same results read the
next section.
7. ddply
Fastest of all we discussed. You will need an additional package.
Let’s do what we exactly did in tapply section.
library(plyr)
attach(iris)
# mean petal length by species
ddply(iris,"Species",summarise, Petal.Length_mean = mean
(Petal.Length))
 
Additional Notes: You can also use packages such as dplyr,
data.table to summarize data.  Here’s– Faster Data Manipulation
with these 7 R Packages.
In general if you are trying to add this summarisation step in the
middle of a process and need a table as output, you need to go
for sqldf or ddply. “ddply” in these cases is faster but will not
give you options beyond just grouping. “sqldf” has all features
you need to summarize the data in SQL statements.
In case you are interested in using function similar to pivot
tables or transposing the tables, you can consider using
“reshape”. We have covered a few examples of the same in our
article – comprehensive guide for data exploration in R.
Challenge : Here is a simple problem you can attempt to solve
using all the methods we have discussed. You have a table for
all school kids marks in a particular city.
Write a code to find the mean marks of each school for both
class 1 and 2, for students with roll no less than 6. And print
only the class whose mean score comes out to be higher for the
school. For instance, if school A has a mean score of 6 for class
1 and 4 for class 2, you will reject class 2 and only take class 1
mean score for the school. In cases of tie, you can make a
random choice. Assume that the actual table is much bigger and
keep the code as generalized as possible.
 
summarize in r, when we have a dataset and need to get a clear
idea about each parameter then a summary of the data is
important. Summarized data will provide a clear idea about the
data set.
In this tutorial we are going to talk about summarize () function
from dplyr package. Summarizing a data set by group gives
better indication on the distribution of the data.
This tutorial you will get the idea about summarise(), group_by
summary and important functions in summarise()
Load Library
library(dplyr)
Let’s load iris data set for summarization. Let’s store the iris
data set into new variable say df for summarize in r.
df<-iris
df1<-summarise(df, mean(Sepal.Length())df<-iris
Output:-
mean(Sepal.Length)
5.843333
Let’s create mean and sd of Sepal Length.
df2<-summarise(df, Mean=mean(Sepal.Length(),
SD=sd(Sepal.Length())
Output:-
Mean SD
5.843333 0.8280661
Now we try to summarize based on groups.
Principal component analysis (PCA) in R »
df3<-summarise(group_by(df, Species),
Mean=mean(Sepal.Length(),
SD=sd(Sepal.Length())
Output:-
Species Mean SD
1 setosa 5.01 0.352
2 versicolor 5.94 0.516
3 virginica 6.59 0.636
You can make use of pipe operator for summarising the data set.
Pipe operator comes under magrittr package. Let’s load the
package.
library(magrittr)
df4<-df %>%
group_by(Species) %>%
summarise(Mean = mean(Sepal.Length),
SD=sd(Sepal.Length))
Output:-
Species Mean SD
1 setosa 5.01 0.352
2 versicolor 5.94 0.516
3 virginica 6.59 0.636
Based on pipe operator you can easily summarize and plot it
with the help of ggplot2.

Exploratory Data Analysis (EDA) » Overview »


library(ggplot2)
For plotting the datset we have main four steps
Step 1: Select the appropriate data frame
Step 2: Group the data frame
Step 3: Summarize the data frame
Step 4: Plot the summary statistics based on your requirement
df %>%
group_by(Species) %>%
summarise(Mean = mean(Sepal.Length)) %>%
ggplot(aes(x = Species, y = Mean, fill = Species)) +
geom_bar(stat = "identity") +
theme_classic() +
labs(
x = "Species",
y = "Average Sepal.Length ",
title = paste(
"Summary Based on Groups"
)
)
Sum

Another useful function to aggregate the variable is sum().


Deep Neural Network in R » Keras & Tensor Flow
df5<-df %>%
group_by(Species) %>%
summarise(sum = sum(Sepal.Length),
SD=sd(Sepal.Length))
Output:-
Species sum SD
1 setosa 250 0.352
2 versicolor 297 0.516
3 virginica 329 0.636
Minimum and maximum

Find the minimum and the maximum of a vector or variable


with the help of function min() and max().
df6<-df %>%
group_by(Species) %>%
summarise(Min = min(Sepal.Length),
Max=max(Sepal.Length))
Output:-
Species Min Max
1 setosa 4.3 5.8
2 versicolor 4.9 7
3 virginica 4.9 7.9
Count

Suppose if you want to count observations by group you can


aggregate the number of occurrence with n().
Naive Bayes Classification in R » Prediction Model »
df7<-df %>%
group_by(Species) %>%
summarise(Sepal.Length = n())%>%
arrange(desc(Sepal.Length))
Output:-
Species Sepal.Length
1 setosa 50
2 versicolor 50
3 virginica 50
First and Last

Some cases first cases or position identification is important,


then you can make use of first, last or nth position of a group.
df8<-df %>%
group_by(Species) %>%
summarise(First = first(Sepal.Length),
Last=last(Sepal.Length))
Output:- df8
Species First Last
1 setosa 5.1 5
2 versicolor 7 5.7
3 virginica 6.3 5.9
The same way you can make use of following functions some of
the functions already covered in the tutorial.
You can see the important functions below for summarizing the
dataset.
tidyverse in r – Complete Tutorial » Unknown Techniques »
Mean
summarise(df,mean = mean(x1))
Median
summarise(df,median = median(x1))
Sum
summarise(df,sum = sum(x1))
Standard Deviation
summarise(df,sd = sd(x1))
Interquartile
summarise(df,interquartile = IQR(x1))
Minimum
summarise(df,minimum = min(x1))
Maximum
summarise(df,maximum = max(x1))
Quantile
summarise(df,quantile = quantile(x1))
First Observation
summarise(df,first = first(x1))
Last observation
summarise(df,last = last(x1))
nth observation
summarise(df,nth = nth(x1, 2))
Number of occurrence
summarise(df,count = n(x1))
Number of distinct occurrence
summarise(df,distinct = n_distinct(x1))
How to find dataset differences in R Quickly Compare Datasets
»

You might also like