R-Web-Appendix of Foundations of Statistics For Data Scientists
R-Web-Appendix of Foundations of Statistics For Data Scientists
R-Web-Appendix of Foundations of Statistics For Data Scientists
R-Web-Appendix of
Foundations of Statistics for
Data Scientists
Contents
0 CHAPTER 0: BASICS OF R 1
A0.1 Starting a Session, Entering Commands, and Quitting . . . . . . . . . . . . . 1
A0.2 Installing and Loading R Packages . . . . . . . . . . . . . . . . . . . . . . . . . 2
A0.3 R Functions and Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . 3
A0.3.1 Vectors and Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
A0.3.2 Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
A0.3.3 Matrix and Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
A0.3.4 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
A0.3.5 User–Defined Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
A0.3.5.1 Example of user-defined functions: weighted mean . . . . . . 9
A0.4 Data Input in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
A0.5 Control Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
A0.6 Graphs in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
iii
iv Contents
Bibliography 117
0
CHAPTER 0: BASICS OF R
R is a free software for statistical computing and graphics that enjoys increasing popularity
among data scientists. It is supported by Windows, Linux and macOS (see http://www.r-
project.org/ for downloading R and for information about installation, help and documen-
tation). Software R uses the S programming language and a similar environment. It is
continuously enriched and updated by researchers who develop new statistical methods and
supplement their published results with the associated R code. This way, one can find a
variety of updated add-on packages for basic or advanced data analysis and visualization,
most of them stored in CRAN (Comprehensive R Archive Network). Meanwhile R became
the dominating statistical software, supported by a strong voluntary R-community that
developed an R-culture. With the occasion of the 25th anniversary of the creation of R,
Significance published an article on its history and perspectives (Thieme, 2018).
Many R users prefer to work in RStudio, which is an integrated development environ-
ment (IDE) for R that includes a console, syntax-highlighting editor supporting direct code
execution, as well as tools for plotting, history, debugging and workspace management.
RStudio is available also in an open source edition (RStudio (download)).
This Appendix is not to be considered as a kind of R-manual. It supplements the chapters
of this book and is an extended version its Appendix A, motivating and enabling the direct
application of all the discussed statistical analyses. For starting with R and detailed guidance
on R-programming, very helpful are the freely online available R-manuals, especially the ‘An
Introduction to R’ and ‘R Data Import/Export’. Furthermore, there exists a vast variety of
books and on-line notes; see for example the Swirl tutoring system at https://swirlstats.com,
books by Hothorn and Everitt (2014), Wickham and Grolemund (2017), and Baumer et al.
(2017), or Altham’s Notes at http://www.statslab.cam.ac.uk/ pat/redwsheets.pdf.
Here, we briefly refer to some very basics, needed follow the examples worked-out in R.
Essentials to start with R are summarized in the first chapter of this appendix. The sequel
chapters provide chapter-wise additional R-examples and highlights.
or by selecting ‘Exit’ in the ‘File’ Menu. Comments (inactive text) can be entered in an R-
command script after the symbol #, within the same line. Values to variables are assigned by
the operator ‘←’ or‘=’ (‘→’ is also possible but not recommended since variables definitions
are not easy identifiable and thus code checking procedures are more complicated). The
assignment operator does not provide any output. In order to see the value of a variable x,
you need to type x at the command prompt. or to provide the assignment in parentheses.
1
2 0. R-Web Appendix
You can provide multiple commands in a line by separating them by ‘;’ while a command can
expand to more than one lines. In this case, a + appears at the beginning of the additional
line(s), indicating that this is not a new command or assignment. Finally note that R is a
case-sensitive language.
> x <- 7; x; X <- 10; X # equivalent to: (x <- 7); (X <- 10)
[1] 7
[1] 10
Output starts with [1], indicating that this line starts with the first value of the results.
The provided R–code examples throughout this book resemble the R–console. Thus,
command lines start with a > (or + if a command expands along more lines). These > and
+ are not part of the command and should not be typed in the console while reproducing
the examples.
We list next the major R packages used in this book. A base R installation along with
these packages form a powerful and broad tool–box for performing statistical analysis and
tackling data analysis problems of diverge fields. An overview of all contributed packages
in CRAN can be found in http://cran.r-project.org/web/packages/
R Functions and Data Structures 3
Package Description
Data values are synthesized and appear in a wide variety of data structure forms, the
most commonly used being vectors, matrices, arrays, lists and data frames. Data structures,
functions and more complex structures synthesized by such components, are all known as
objects. Actually all entities created and handled in R are objects. The names of all objects
defined so far in your active workspace are listed on the display by typing objects() or
ls(). Another useful information that provides compact information on the structure of an
arbitrary R object is str.
The most common data value types treated in R are double (numeric, including -Inf:
−∞, Inf: ∞ and NaN: not a number), integer (defined by placing an L behind the number),
4 0. R-Web Appendix
character (character string, given in ’ ’ or ” ”) and logical (TRUE (T) and FAULSE (F)).
Variables of all types can take the value NA (not available), which by default is logical.
The specific (R internal) type of any object is determined through the typeof function.
It can be also identified by class, which is an attribute of an object and can be assigned
by a user to an object, regardless of its internal storage mode. Furhtermore, the functions
as.numeric, as.integer, as.logical and as.character can be used to convert a variable
to the stated type. For example consider:
> a <- TRUE; typeof(a) # same output as with: class(a)
[1] "logical"
> as.numeric(a) # same output as with: as.integer(a)
[1] 1
> b <- "color"; typeof(b) # same output as with: class(b)
[1] "character"
> c <- 2L; typeof(c); str(c) # same output as with: class(c)
[1] "integer"
int 2
> d <- 2; typeof(d); class(d) # different output for typeof() and class()
[1] "double"
[1] "numeric"
> as.logical(d)
[1] TRUE
The data structures used in this book are briefly described below.
Some helpful functions in creating vectors are the functions seq and rep, standing for
sequence and repeat, respectively:
> s1 <- seq(from=1, to=4, by=1); s1 # creates a double vector while
[1] 1 2 3 4 # 1:4 creates an integer vector
> s2 <- rep(1:4, 3); s2
A0.3. R Functions and Data Structures 5
[1] 1 2 3 4 1 2 3 4 1 2 3 4
> s3 <- rep(1:4, each=3); s3
[1] 1 1 1 2 2 2 3 3 3 4 4 4
Basic functions for handling vectors are length, min, max, sum and prod, providing for
a vector its length, minimal value, maximal value, sum and product of its elements, respec-
tively. Furthermore, sort sorts the elements of a vector in increasing order (or decreasing,
specified by the argument decreasing = TRUE) and rank provides the rank values for its
elements.
A vector with components of mixed types is not possible. In such a case, the resulting
type is the more flexible one. Thus,
> x5 <- c(2.1, TRUE, "A","B","C"); x5
[1] "2.1" "TRUE" "A" "B" "C"
typeof(x5)
[1] "character"
A list is a flexible data structure that allows the combination of data components of
different value types. Continuing the example above, we define a list with the elements of
x5 and verify that the constructed list has 5 components of different type.
> y1 <- list(2.1, TRUE, "A","B","C"); y1
[[1]]
[1] 2.1
[[2]]
[1] TRUE
[[3]]
[1] "A"
[[4]]
[1] "B"
[[5]]
[1] "C"
A list can also combine vectors or lists of different lengths. Note the difference of list y1 to
y2 below.
> y2 <- list(2.1, TRUE, c("A","B","C")); y2
[[1]]
[1] 2.1
[[2]]
[1] TRUE
[[3]]
[1] "A" "B" "C"
> y1[3]
[[1]]
[1] "A"
> y2[3]
[[1]]
[1] "A" "B" "C"
A0.3.2 Factors
A factor is a special type of vector that corresponds to a nominal or ordinal categorical
variable and has a relatively small number of pre-specified possible outcomes (numeric or
character), known as levels. Factors are important in modeling with categorical data while
the definition of factors is also required for the creation of some plots, as we shall see later
on (e.g. in Section A1.2). The levels can be predefined or specified by the data, as illustrated
in the example that follows. Consider the quality category (A, B or C) of a sample of 7
products in the case that none of the sampled products was of the lowest category C.
> q <- c("A","B","B","A","C","A","B","E","C","A","A","B"); q
[1] "A" "B" "B" "A" "C" "A" "B" "E" "C" "A" "A" "B"
> q_f <- factor(q); q_f # levels specified by the data
[1] A B B A C A B E C A A B
Levels: A B C E
> q_fc <- factor(q,levels=c("A","B","C","D","E")) # prespecified levels
> q_fc
[1] A B B A C A B E C A A B
Levels: A B C D E
> table(q_f) # frequency table for variable quality
q_f
A B C E
5 4 2 1
> table(q_fc) # frequency table for variable quality
q_fc # (with prespecified domain)
A B C D E
5 4 2 0 1
Equivalently to q fc the quality factor can be defined through a numeric vector as follows.
> q_n <- c(1,2,2,1,3,1,2,5,3,1,1,2)
> q_n_fc <- factor(q_n, levels=c(1,2,3,4,5), labels=c("A","B","C","D","E"))
Note that the forcats package in tidyverse provides tools for simplifying certain
tasks for factors, such as reordering the levels of a factor by some criterion (like frequency
of another variable, useful for better visualization in graphs) or collapsing levels. For our
example, we can easily collapse the categories C to E:
> q3 <- fct_collapse(q_fc, CDE=c("C","D","E")); q3
[1] A B B A CDE A B CDE CDE A A B
Levels: A B CDE
# alternatively, levels can be collapsed by duplicating labels:
> qf <- factor(q_fc, levels=c("A","B","C","D","E"),
+ labels=c("A","B","C","C","C"))
, , 2
Arrays of higher dimension are defined analogously by adjusting the dimension vector in
the array argument accordingly.
A two-dimensional array can alternatively be defined by matrix. Thus, z2 of the example
above can alternatively be defined as follows.
> z2 <- matrix(z1, nrow=4, ncol=3)
By default, expands matrix the vector by columns. An expansion by rows is also possible.
> m2 <- matrix(z1, nrow=4, ncol=3, byrow=T); m2
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
Alternatively, a matrix can be constructed by binding vectors of the same length row-
or column-wise using the functions rbind or cbind:
> r1 <- 1:5; r2 <- 6:10; r12 <- rbind(r1,r2); r12
[,1] [,2] [,3] [,4] [,5]
r1 1 2 3 4 5 # cbind(r1,r2): the transpose of r12
r2 6 7 8 9 10
Useful functions for working with matrices include dim, nrow, ncol and t for providing
the dimension, number of rows, number of columns and the transpose of a matrix, respec-
tively. Additionally, the sums and means for each row or column of a matrix are obtained
applying functions rowSums and rowMeans or colSums and colMeans, respectively.
Computational tasks on vectors and matrices can be simplified and fastened in R, using
the provided matrix algebra operations. For instance, two matrices of the same dimen-
sion can be multiplied component-wise using the * operator or the sum of squares of the
components of a vector can easily be computed through matrix multiplication (%*%). Fur-
thermore, it is handy for calculation that standard functions and operators for numerical
variables apply component-wise to numerical vectors.
> a <-c(1,4,9) # computation of sum of squares
> sum_a2 <- t(a)%*%a ; sum_a2 # t: transpose of a matrix
[,1]
[1,] 98
> sqrt(a)
[1] 1 2 3
8 0. R-Web Appendix
The command cbind which combines vectors (or matrices) to a matrix, when applied
on objects consisting of at least one data frame, yields a data frame; otherwise not.
> v4 <- 6:10
> df2 <- cbind(df,v4) # df2 is a data frame
> df2
v1 v2 v3 v4
1 1 1 FALSE 6
2 2 4 FALSE 7
3 3 9 FALSE 8
4 4 16 TRUE 9
5 5 25 TRUE 10
> v5 <- df$v1+v4
> v45 <- cbind(v4,v5) #column bind: v45 is a matrix
%> v45
% v4 v5
%[1,] 6 7
%[2,] 7 9
%[3,] 8 11
%[4,] 9 13
%[5,] 10 15
Note that the return statement is not compulsory for constructing a function. If omitted, the
value returned is the last executed command in the function. Furthermore, more than one
variables can be returned, replacing return by the function list. Functions are frequently
used in this Appendix (e.g., see Section A2.2).
A more elegant and practical way to programize the above problem is through vector-
ization, which is compacter and simultaneously allows the calculation of the weighted mean
of an arbitrary number of samples:
> wght.mean <- function(mean.vector, n.vector){
# mean.vector: the vector of sample means
# n.vector: a vector of the same length, containing the corresponding
# sample sizes
mean.all <- sum(mean.vector*n.vector)/sum(n.vector)
return(mean.all)
}
# Implementation example:
> mean.v <- c(12,15,21); n.v <- c(20,30,10)
> wght.mean(mean.v, n.v) # equivalently: wght.mean(c(12,15,21), c(20,30,10))
[1] 15
Simple data can be imported in R manually from the keyboard, using for example the c
or scan function. Thus, the values of a vector y can be read as follows:
> y <- scan()
1: 2 # start typing the data
2: 5 # one vector element is provided per line
3: 3
4: # enter a blank line to signal the end of data reading
Read 3 items
Alternatively, an R spreadsheet can be used to type in the data in form of a list, which
is activated by the edit function, as shown below:
> toy_example <- data.frame(salary=numeric(0), gender=character(0))
> toy_example <- edit(toy_example)
Data Input in R 11
Most importantly, R provides functions for importing and exporting data, supporting
many data files formats (table formatted data in plain-text files or data files from excel,
SAS, SPSS, Stata and Systat).
For plain-text data files, the basic function is read.table and write.table for im-
porting and exporting data, respectively. For these functions the default separator is ‘white
space’ (i.e. one or more spaces, tabs or newlines). Most commonly, data files have comma
separated values (csv). Such forms are handled by functions read.csv and write.csv while
read.csv2 and write.csv2 are for semicolon separated data. The first argument of these
functions is the file name of the data to be read (or written), given by a full path or the
relative path of the current working directory. For the latter, the working directory can be
get or set by functions getwd or setwd, respectively. They further have a variety of argu-
ments that allow additional specifications and handling options for the data format. For
example, the logical argument header can be used to read/write data sets with (=TRUE)
or without (=FALSE) header. By default, character string variables are converted to factor
variables. You may change this default setting using the arguments stringsAsFactors or
as.is. For reading contingency tables, there exist features that enable the reading of labels
for the categories of the classification variables from the source file or the assignment la-
bels when they are not provided in the source file. For further details you may consult the
RDocumentation.
Consider for example the file ‘drugs.dat’ that is allocated in the local folder ‘DS.Data’
and contains the data of a survey on the use of drugs at high schools having values sepa-
rated by comma. This data set corresponds to a 2 × 2 × 2 contingency table, formulated by
cross-classifying 2276 high school students according to whether they consume alcohol (a),
cigarettes (c) or marijuana (m). It can be read in a data.frame format as follows:
> setwd("C:/Users/.../DS.Data") # provide the full path of the folder
> drugs <- read.csv("drugs.dat", header=TRUE)
> drugs
a c m count
a c m count
1 yes yes yes 911
2 yes yes no 538
3 yes no yes 44
4 yes no no 456
5 no yes yes 3
6 no yes no 43
7 no no yes 2
8 no no no 279
>typeof(drugs$a) # character string variable a is converted to a factor
[1] "integer"
To read a data file from a website, we simply need to provide the full website path, as
illustrated for example in Section 1.4.1 of the book for the carbon dioxide emissions data
set, which was read by read.table, since data values are separated in file ‘Carbon.dat’ by
white spaces.
As it is extensively commented in the ‘R Data Import/Export’ manual, these functions
are not to be used for reading large data files (use a lot of memory and are slow). Thus,
when reading large data matrices (having many columns) it is preferable to use scan instead
of read.table.
An alternative option for data importing is the readr function, which is included in the
collection of R–packages tidyverse. The functions read tsv, read csv, and read csv2 are
analogous to the basic R functions read.table, read.csv and read.csv2, respectively (the
argument header is now replaced by the argument col names). It is claimed that for large
data sets they are typically much faster (up to 10 times). Further functions are available, as
for example the general read delim, that reads data values delimited with any separator.
12 0. R-Web Appendix
These functions, in contrary to the basic ones, do not convert character vectors to factors.
They also provide additional handy options, as for example the possibility to skip n lines of
meta data at the beginning of the file (argument skip=n) or to drop lines with comments,
signalized by the specific character they start with, e.g. # (argument comment="#") The
data file in Section 1.4.1 could equivalently be read by:
> library(tidyverse)
-- Attaching packages --------------------------------------- tidyverse 1.3.0 --
v ggplot2 3.3.2 v purrr 0.3.4
v tibble 3.0.4 v dplyr 1.0.2
v tidyr 1.1.2 v stringr 1.4.0
v readr 1.4.0 v forcats 0.5.0
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
> Carbon <- read_tsv("http://stat4ds.rwth-aachen.de/data/Carbon.dat")
There are special packages available that facilitate data exchange between R and other
statistcal software. For example, the foreign package can be used to import data from
(or export to) a variety of sources, including Minitab, SAS, SPSS, Stata and Systat. Excel
data can be read employing the xlsx package. A functionality similar to foreign offers the
haven package, which provides the functions read sas, read sav and read dta for reading
data files in SAS, SPSS and Stata file formats, respectively, and can be faster than foreign.
Finally, readxl is an analogous to xlsx package for importing data from excel while data
frames can be exported to excel files by the writexl package. Packages haven, readxl and
writexl are among the packages installed automatically with tidyverse.
For expressing conditions, the usual logical operators employed are < (less), <=(less or
equal), > (greater), >= (greater or equal), == (equal), ! = (not equal) and the Boolean
operators (∣ (OR), & (AND), xor (elementwise exclusive OR)).
A0.6 Graphs in R
For data visualization and presentation of statistical analysis output, for example some
diagnostic plots, R provides powerful graphical tools.
The most common plotting function in R is the plot function, which may produce a
scatterplot, a time series plot,a bar plot or a box plot, depending on its arguments. Other
types of plots include qq–plots (qqnorm, qqplot), histograms (hist) and contour plots
(contour). For multivariate data, the pairs function is used for producing a matrix of
pairwise plots while coplot(a∼b|c), where c is a factor, produces a number of (a,b) plots
for every level of c.
The ggplot2 package in tidyverse offers many options for elegant and very flexible
graphics that can be tailored to meet user’s expectations and advanced demands for visu-
alizing complex data sets. It is based on the grammar of graphics and the idea that the
construction of a graph is based on the same core components (data set, set of geoms and
a coordinate system). Thus, a graph is gradually built, starting with ggplot (or qplot),
and specifying (i) the data set, (ii) asthetic mappings (by aes), and (iii) the type of visual
representation of the data (geom). Then further levels with a geom * or stat * function
(e.g. geom histogram) can be gradually added (after the + sign) for setting further speci-
fications as well as controlling coordinate systems and faceting. A very helpful overview of
the logic, options and features of ggplot is provided in the RStudio Cheat Sheet.
We shall introduce and explore some of the options of R graphical packages in this
appendix, motivated by specific examples.
1
CHAPTER 1: R FOR DESCRIPTIVE
STATISTICS
summarize provides user specified descriptive statistics for one or more variables
group by groups cases (data in rows) having the same value of a specific variable
arrange sorts the full data set according to the ordering of specified variables
slice selects a subset of rows (cases) by position
filter selects a subset of rows that fulfill a condition
select selects a subset of columns (variables) based on a condition
mutate computes and appends new variables (columns)
Important in applying dplyr is the “pipe” operator %>%, which passes the object on
the left hand side as an argument of the function in the right hand side, making thus the
15
16 Agresti and Kateri (2022): 1. R-Web Appendix
code more handy. In particular, x %>% f(y) is the same as f(x,y) and y %>% f(x,.,z)
the same as f(x,y,z).
All actions carried out in dplyr can be also obtained in standard R. The advantage is
that dplyr provides a more handy way, using verbs of common sense that make the code
more user friendly (but in some cases not necessarily shorter). Next, we provide some data
handling examples, illustrated on the very famous iris data frame.
A1.1.1 summarize
The summarize function provides summary statistics for a data frame, providing a single
value for the columns (variables) asked. Some characteristic examples follow. The functions
used (e.g. mean, sd) are exchangeable. Every function is eligible as long as its output is a
single value.
> library(dplyr)
> data(iris)
> names(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
> iris %>% summarise(mean=mean(Sepal.Length)) # equivalent to:
mean # mean(iris$Sepal.Length)
1 5.843333
> iris%>%summarize(sample_size=n(), mean_SL=mean(Sepal.Length), mean_SW=
mean(Sepal.Width),mean_PL=mean(Petal.Length), mean_PW=mean(Petal.Width))
sample_size mean_SL mean_SW mean_PL mean_PW
1 150 5.843333 3.057333 3.758 1.199333
> iris%>%summarize(mean_SL=mean(Sepal.Length),sd_SL=sd(Sepal.Length))
mean_SL sd_SL
1 5.843333 0.8280661
A1.1.2 group by
The group by function is a convenient function that groups the sample units (i.e. rows of a
tidy data set) according to their values on a variable (column) of the data set. Most data
operators applied on grouped data, are then performed group-wise. This is very convenient
as a first step, since in most analyses we want to compare responses, profiles, etc. among
groups. Grouping is removed by ungroup.
Frequently it is applied in combination with summarize, to provide summary statistics
per group.
> iris_spec <- iris %>% group_by(Species)
> iris_spec %>% summarize(n=n(), mean_SL=mean(Sepal.Length), mean_SW=
mean(Sepal.Width), mean_PL=mean(Petal.Length), mean_PW=mean(Petal.Width))
Species n mean_SL mean_SW mean_PL mean_PW
<fct> <int> <dbl> <dbl> <dbl> <dbl>
1 setosa 50 5.01 3.43 1.46 0.246
2 versicolor 50 5.94 2.77 4.26 1.33
3 virginica 50 6.59 2.97 5.55 2.03
The %>% operator is very convenient for wrangling data, since it allows the successive use
of nested function in one step. Thus, for deriving the within groups means above we do not
need to create a new data frame (iris spec). It is equivalent to:
> iris %>% group_by(Species) %>% summarize(n=n(), mean_SL=mean(Sepal.Length),
mean_SW=mean(Sepal.Width), mean_PL=mean(Petal.Length),
mean_PW=mean(Petal.Width))
Data Handling and Wrangling 17
A1.1.3 arrange
The arrange function rearranges the cases in a data frame in increasing order of a specified
variable, while decreasing order is also possible. In case of ties, cases can be ordered according
to other specified variables. There is also the option to sort within groups (specified by a
grouping variable). Do not run this function applied to a data frame, since it will print on
screen the rearranged data set; save the rearranged data frame, instead.
> iris.ord <- iris%>%arrange(Sepal.Length) # Sepal.Length: increasing
> iris.ord[1:3,] # you could also try: head(iris.ord)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.3 3.0 1.1 0.1 setosa
2 4.4 2.9 1.4 0.2 setosa
3 4.4 3.0 1.3 0.2 setosa
# If ties, items are arranged by increasing order of Petal.Length, ... :
> iris.ord2 <- iris%>%arrange(Sepal.Length, Petal.Length, Sepal.Width)
> iris.ord2[1:3,]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.3 3.0 1.1 0.1 setosa
2 4.4 3.0 1.3 0.2 setosa
3 4.4 3.2 1.3 0.2 setosa
> iris.ord3 <- iris%>%arrange(desc(Sepal.Length)) # Sepal.Length: decreasing
> iris1 <- iris_spec %>% arrange(desc(Sepal.Length), .by_group = TRUE)
> iris1[97:102,]
# A tibble: 6 x 5
# Groups: Species [2]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 2.5 3 1.1 versicolor
2 5 2 3.5 1 versicolor
3 5 2.3 3.3 1 versicolor
4 4.9 2.4 3.3 1 versicolor
5 7.9 3.8 6.4 2 virginica
6 7.7 3.8 6.7 2.2 virginica
# compare to:
> iris[iris$Sepal.Length >=6 & iris$Sepal.Width < 2.5, ]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
63 6.0 2.2 4.0 1.0 versicolor
69 6.2 2.2 4.5 1.5 versicolor
88 6.3 2.3 4.4 1.3 versicolor
120 6.0 2.2 5.0 1.5 virginica
A1.1.5 select
This functions returns a subset of the columns (variables) of a data frame, as demonstrated
below.
> iris2 <- select(iris, Sepal.Length Sepal.Width) # equivalent to: iris[,1:2]
> slice(iris2, 2:3)
Sepal.Length Sepal.Width
1 4.9 3.0
2 4.7 3.2
> iris %>%filter(Sepal.Length==7)%>% select(Sepal.Width)
Sepal.Width # equivalent to: iris[iris$Sepal.Length==7,2]
1 3.2
A1.1.6 mutate
The mutate function creates new variables and adds them as columns to a data frame.
iris%>%mutate(above.meanSL=Sepal.Length>mean(Sepal.Length))%>%slice(.,107:109)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species above.meanSL
1 4.9 2.5 4.5 1.7 virginica FALSE
2 7.3 2.9 6.3 1.8 virginica TRUE
3 6.7 2.5 5.8 1.8 virginica TRUE
> iris.SLW <- iris%>%mutate(SLW=Sepal.Length/Sepal.Width)
# equivalent to: cbind(iris,iris$Sepal.Length/iris$Sepal.Width)
> iris.SLW[1,]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species SLW
1 5.1 3.5 1.4 0.2 setosa 1.457143
14
20
12
15
10
Frequency
Frequency
8
10
6
4
5
2
0
0
0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0
HDI HDI
FIGURE A1.1: Histograms for the variable HDI in UN data file with number of classes being
specified by the rule of Sturges (left) and Scott (right).
Sturges, takes k = log2 (n) + 1. Its derivation assumes a bell-shaped distribution, and outliers
can be problematic. The Scott rule takes h = 3.49s/n1/3 , for sample standard deviation s.
For large n, Sturges’ rule yields wider bins than Scott’s and may oversmooth the histogram.
We explore and illustrate the R facilities and options for histograms using the UN data file
from the book’s website, consisting of nine variables measured over 42 countries, described
in Exercise 1.24:
> UN <- read.table("http://stat4ds.rwth-aachen.de/data/UN.dat", header=T)
# use header=T (or header=TRUE) when file has variable names at top
> names(UN) # provides the names of the variables
[1] "Nation" "GDP" "HDI" "GII" "Fertility" "C02"
[7] "Homicide" "Prison" "Internet"
We illustrate histograms with the human development index (HDI), a summary measure
with components referring to life expectancy at birth, educational attainment, and income
per capita:
> hist(UN[,3], xlab="HDI", xlim=c(0.5,1), main=NULL) # breaks="Sturges"
> hist(UN[,3], breaks="Scott", xlab="HDI", ylim=c(0,20), main=NULL)
The derived histograms, with equal length classes and for the two considered choices for
the determination of the number of classes, are provided in Figure A1.1.
The histogram of age for prespecified classes of age, provided in Figure A1.2 (left), can
easily be produced using a function from the ggplot2 package.
> library(ggplot2)
> ggplot(GSS, aes(x=AGE)) + geom_histogram(breaks=
c(18,25,35,50,65,75,90), color="dodgerblue4", fill="lightsteelblue")
20 Agresti and Kateri (2022): 1. R-Web Appendix
600
0.015
400
density
0.010
count
200
0.005
0 0.000
20 40 60 80 20 40 60 80
AGE AGE
FIGURE A1.2: Histogram for the age of the GSS2018–participants for bins of unequal width
(left: incorrect, right: correct).
The default values for the y-axis are the counts of the classes. However, in case the
classes are of unequal width (as here), this histogram is incorrect, since the surface of each
bin shall be proportional to the relative frequency of the corresponding class. In such cases,
the right scale for the y-axis is the density and the right histogram is in Figure A1.2 (right).
> ggplot(GSS, aes(x=AGE)) + geom_histogram(breaks=c(18,25,35,50,65,75,90),
+ aes(y = ..density..), color="dodgerblue4", fill="lightsteelblue")
On the box plot, provided in Figure A1.3 (upper left), we added the sample mean and the
interval ‘sample mean +/- standard deviation’, as well. Replacing UN[,2] with UN[,3] to
UN[,9], the rest of the box plots in Figure A1.3 are produced.
the 10% highest-valued observations is y 0.8 = 18 ∑9i=2 y(i) = 11, where y(i) denotes the i-th
ordered observation, e.g. y(1) ≤ y(2) ≤ . . . ≤ y(10) . This calculation can easily be implemented
in R:
> y <- c(11,6,12,9,68,15,5,12,23,14)
> mean(y) # mean of elements in vector x
> median(y) # median of elements in vector x
> mean(y,0.1) # trimmed mean (mean of 80% ‘central’ observations)
1 The mean value of a data vector is influenced strongly by outliers, i.e. single observations that are
much lower or higher than the main body of the data. A more robust description of the central location of
the data is provided by their median (see Section 1.4.3). The p% trimmed mean, is a corrected mean that
excludes the p% lowest and p% highest data points before computing the mean.
2 The data are based on a much larger sample taken by the U.S. Bureau of the Census and are analyzed
GDP HDI
GII Fertility
0 5 10 15 0 5 10 15 20 25 30
CO2 Homicide
Prison Internet
FIGURE A1.3: Box plots for the variables values of the 41 nations with their corresponding
sample mean +/- standard deviation (in blue).
Descriptive Statistics 23
40.0%
gender
prop
male
female
20.0%
0.0%
male female
gender
FIGURE A1.4: Barplot and pie chart for the gender of the persons in the income data.
Categorical variables can be visualized by a barplot or pie chart, as illustrated next with
the employees’ gender of a certain sector of a company (Salaries data file3 from the book’s
website). The derived graphs are derived in ggplot2 and given in Figure A1.4.
> library(ggplot2)
> library(scales)
> salaries <- read.table("http://stat4ds.rwth-aachen.de/data/Salaries.dat", header=T)
> salaries$gender <- factor(salaries$gender, levels=c(1,2),
labels=c("male", "female"))
# Bar plot of counts: (not shown)
> ggplot(data=salaries, aes(x=gender)) +
geom_bar(width=0.5, fill="steelblue")
# Bar plot of proportions: Figure A1.4 (left)
> ggplot(data=salaries, aes(gender)) +
geom_bar(aes(y=..prop.., group = 1),width=0.5, fill="steelblue") +
scale_y_continuous(labels=percent_format())
Notice that for calculating the right descriptive statistics for categorical data in R, it is
important to treat categorical variables as factors. Hence, since gender in Salaries data file
is a numeric variable, the corresponding frequency distribution is derived by:
> summary(as.factor(income$gender)) # compare to summary(income$gender)
1 2
14 12
3 Consists of three variables, providing the monthly salaries (in Euro), the seniority (in years) and the
In base R, group specific statistics are possible through the tapply function (as illustrated
in Section 1.4.5 with the Murder2 data file) and the aggregate function. Thus the above
results can equivalently be derived as follows:
> aggregate(salaries$salary ~ income$gender, data =income, mean)
> aggregate(salaries$salary ~ income$gender, data =income, sd)
The purrr library facilitates a straightforward way of obtaining descriptive statistics for
all variables in a data set and all levels of a categorical covariate:
> library(purrr)
> salaries %>% split(.$gender) %>% map(summary)
$‘1‘
salary years gender
Min. :1965 Min. : 8.00 Min. :1
1st Qu.:2720 1st Qu.:13.75 1st Qu.:1
Median :3172 Median :18.50 Median :1
Mean :3231 Mean :19.21 Mean :1
3rd Qu.:3902 3rd Qu.:25.00 3rd Qu.:1
Max. :4387 Max. :32.00 Max. :1
$‘2‘
salary years gender
Min. :2280 Min. :11.00 Min. :2
1st Qu.:3044 1st Qu.:19.00 1st Qu.:2
Median :3424 Median :22.50 Median :2
Mean :3394 Mean :22.00 Mean :2
3rd Qu.:3719 3rd Qu.:24.75 3rd Qu.:2
Max. :4438 Max. :33.00 Max. :2
Descriptive Statistics 25
4500
6
4000
4
3500
M M
F F
3000
2500
0 2000
FIGURE A1.5: Stacked histogram by gender (left) and side-by-side box plot (right) for
salaries.
Side-by-side box plots can be constructed via the basic function boxplot as well, as
shown in Section 1.4.5 (Figure 1.6).
26 Agresti and Kateri (2022): 1. R-Web Appendix
15
7.5
10
GDP.cat GDP.cat
1 5.0 1
count
count
2 2
3 3
4 4
5
2.5
0 0.0
30 60 90 25 50 75
Internet Internet
FIGURE A1.6: Stacked histograms for the variable Internet with classes of length 20 (left)
and 10 (right).
Analogously, for the UN data file, we can derive a histogram for the values of the Internet
variable stacked by different levels of GDP. For this, we categorize GDP, to an ordinal variable
GDP.cat of 4 levels, corresponding to the intervals defined by the quantiles of GDP and
produce the corresponding stacked histogram shown in Figure A1.6 (left) as follows:
> GDP.cat<-as.integer(cut(UN[,2], c(0,quantile(UN[,2]))))
> GDP.cat <- as.factor(ifelse(GDP.cat==1, 2, GDP.cat)-1)
> library(ggplot2)
> library(RColorBrewer)
> barlines <- "#1F3552" # the color for the barlines
> ggplot(UN, aes(x = Internet, fill = GDP.cat)) +
geom_histogram(aes(y = ..count..), binwidth = 20,
colour = barlines, position="stack") +
theme_bw() +
scale_fill_brewer(palette="Spectral")
The histogram for the same variables but for shorter classes (binwidth = 10) is shown in
Figure A1.6 (right).
For the GGS2018 data file, the age can easily be categorized (based on the intervals used
for the histogram), as follows:
> age.cat <- table(cut(age, c(18,25,35,50,65,75,90), include.lowest=T, right=T))
The age histogram stacked by the survey participants vote in the 2016 presidential
elections, is provided in Figure A1.7 (left) and produced as follows (the variable PRES16
needs first to be factorized):
> pres16 <- factor(GSS$PRES16, levels=c(1,2,3,4), labels =
c("Clinton", "Trump", "Other", "Not vote"))
> ggplot(GSS, aes(x=age, fill=pres16)) +
geom_histogram(breaks=c(18,25,35,50,65,75,90),
aes(y = ..count../(n*width)), color="black") +
scale_fill_manual(values=c("royalblue2","red","green4","white"))
Note that this histogram (ignoring the pres16 factor) is the same as that of Table
A1.2 (right). A high percentage of participants did not respond to the presidential elections
question. The category of missing values (NA) is considered as level of the pres16 factor
and thus included in the histogram constructed above. The histogram for the ages of the
responders that did respond this question (i.e. excluding NA), is provided in Figure A1.7
(right) and is derived as given below.
Missing Values in Data Files 27
0.012
0.015
0.009
pres16
count/(n * width) pres16v
count/(n * width)
Clinton
0.010 Clinton
Trump 0.006 Trump
Other
Other
Not vote
Not vote
NA
0.005
0.003
0.000 0.000
20 40 60 80 20 40 60 80
age AGE
FIGURE A1.7: Histogram for the age of the GSS2018 participants, stacked by their vote in
the presidential elections 2016 with (left) and without (right) the NA category.
beyond the basic descriptive statistics for all values in the data set, we get also information
about the number of non available values (NA) per variable. NA’s may cause problems in
the analysis; for example function mean results to NA. A way out is to apply the function
removing the NA-values (na.rm=T):
> mean(GSS$AGE, na.rm=T)
[1] 48.97138
28 Agresti and Kateri (2022): 1. R-Web Appendix
One could restrict the analysis only to the subsample of complete cases but this can be very
small and leads to loss of valuable information. In our case the complete cases are just 224
out of 2348:
> GSS2018full <- na.omit(GSS) # subsample of fully complete cases
> ind.df <- complete.cases(GSS) # indicator (logical: F for NA)
> sum(ind.df) # number of complete cases
[1] 224
For this it is better to remove cases with NA values selectively (only for the variables
analyzed). For example, we can verify that there are only 7 missing age values in the sample.
If we want to follow in the sequel analysis only the cases with known age, then we can restrict
our sample, as shown below.
> sum(is.na(GSS$AGE)) # number of NA cases for variable age
# [1] 7
# indicator for NA-values in AGE (1st variable):
> ind.age <- complete.cases(GSS[,1])
# subsample without missing age values:
> GSS2018 <- GSS[ind.age,]
> mean(GSS2018$AGE) # equivalent to: mean(GSS$AGE, na.rm=T)
In large surveys data, it is important to have an overview of the extent and distribution
of the missing values. A graphical visualization is provided in the visdat package by the
vis miss function which applies on data frames and provides a heatmap of missingness,
colouring cells according to whether they are missing or not. There is the additional option
to cluster the NA values and the columns (variables) by the rows (sample units) to identify
easier missing patterns, if present. For the GSS dataframe, this missingness heatmap, given
in Figure A1.8, is produced by:
> GSS2 <- data.frame(GSS$AGE, GSS$INCOME, GSS$RINCOME, GSS$PARTYID, GSS$GUNLAW)
> library(visdat)
> vis_miss(GSS2)
> vis_dat(GSS2) # produces a variant of vis_miss(GSS2) with columns colored by
# the type of the corresponding variable (not shown)
> vis_miss(GSS2, cluster=TRUE)
The heatmap shows that AGE and PARTYID have little missing data whereas the other
variables (e.g. RINCOME, the respondent’s income) are missing a lot. Overall 19.1% of the
observations are missing for these five variables.
Identifying missing data is important in the data wrangling phase of a data analysis.
Missingness can be due to different causes and has to be treated accordingly. Advanced
methods of statistics, such as data imputation, analyze the data by estimating the missing
observations.
)
%
%
)
)
%
%
)
)
%
%
)
)
99
99
1%
1%
37
37
55
55
3.
3.
4.
4.
.4
.4
5.
5.
(4
(4
(3
(3
(1
(1
)
)
(1
(1
%
%
E
E
W
W
D
D
.3
.3
M
M
E
E
I
I
LA
LA
M
M
(0
(0
O
O
TY
TY
O
O
C
C
N
N
E
E
AR
AR
C
IN
IN
U
U
G
G
.IN
.IN
.G
.G
.R
.R
.A
.P
.A
.P
SS
SS
SS
SS
SS
SS
SS
SS
SS
SS
G
G
0 0
500 500
Observations
Observations
1000 1000
1500 1500
2000 2000
FIGURE A1.8: Missing data heatmaps for the GSS2018 data file, without and with clustering
according to missingness.
regression line. The data points are differently marked per gender, indicating that for males
and females of the same job seniority, the men are slightly better paid. The regression lines
fitted on the subsamples of males and females separately are illustrated in Figure A1.9
(right) as well. The R-code for producing these scatterplots is given below.
Gender <- factor(salaries$gender, labels=c("M","F"))
# scatterplot (with regression lines):
plot(salaries$salary~salaries$years, pch=as.character(Gender),
ylab="Monthly Salary (euro)", xlab="Job Seniority (years)")
4500
4500
MF Male MF
Female
4000
4000
MMM F MMM F
Monthly Salary (euro) F F
3500
F F
MF F MF F
M M
F F
3000
3000
M MF
F M MF
F
M M
M M
MF MF
2500
2500
M
F M
F
2000
2000
M M
10 15 20 25 30 10 15 20 25 30
FIGURE A1.9: Scatterplots of salaries by gender with the fitted linear regression line.
The regression lines fitted on the subsamples of males and females are shown on the right
scatterplot in blue and red, respectively.
The correlation matrix can be visualized by a heatmap, where the cells of the correlation
matrix are colored according to the strength and direction of the corresponding pairwise
correlations. Heatmaps can be produced by the function heatmap. Alternatively, it can
be constructed in ggplot2, as shown next. In order to apply the ggplot function, the
correlation matrix has to be expanded (melted) in a vector. This is achieved by function
melt of the reshape2 package, which is installed with tidyverse.
> library(reshape2) # needed for melt()
> cm_melt <- melt(cm, na.rm = TRUE)
A nice and very convenient feature of the ggplot2 package is that it allows to built
a graph gradually. Thus, once the heatmap is printed (not shown here), we can update
the graph and add the values of the pairwise correlations in its cells by simply adding the
corresponding argument. The output of this command is shown in Figure A1.10.
ggheatmap +
geom_text(aes(Var2, Var1, label = value), color = "black", size = 2.5)
Summarizing Bivariate Categorical Data 31
−0.5
GII −0.851 −0.882 1 0.598 −0.555 0.511 0.216 −0.866
−1.0
t
on
y
P
II
02
ne
D
it
G
id
D
til
H
is
C
ic
G
er
r
Pr
om
Fe
t
In
H
FIGURE A1.10: Heatmap of the Pearson’s correlation matrix for the UN data.
For alternative variants of heatmaps for the correlation matrix with ggplot2, also in
trinagular form, we refer to the corresponding entry of STHDA (statistical tools for high-
throughput data analysis).
Furthermore, a visualization of all possible pairwise linear relationships among the vari-
ables is provided by all paired scatterplots. In the scatterplots given in Figure A1.11, nations
can be identified with respect to the level of their GDP, i.e. the level of GDP.cat. This set
of scatterplots is derived as follows.
pairs(UN[2:9], main = "UN Data", pch = 21, bg = c("red",
+ "orange", "darkseagreen3","steelblue")[unclass(GDP.cat)])
UN Data
0.5 0.8 2 4 6 0 20 20 80
10 50
GDP
0.9
HDI
0.5
0.1 0.5
GII
5
Fertility
2
10
C02
0
20
Homicide
0
700
Prison
100
80
Internet
20
FIGURE A1.11: Pairwise scatter plots for the variables values of UN data, where the nations
are identifiable with respect to the quantiles of the GDP, i.e. nations with GDP ≤ 13.18, in
(13.18, 27.45], (27.45, 40.33] or > 40.33, are marked in red, orange, blue or green, respec-
tively.
the prayer study discussed in Sections 4.5.5 and 5.4.2. The associated 2×2 contingency table,
provided in Table 4.3, can be constructed as follows:
> heart.surg <- matrix(c(315, 304,289,293), nrow=2, ncol=2)
> dimnames(heart.surg) <- list(Prayer=c("Yes","No"),
Complications=c("Yes","No"))
The prop.table function applies on matrices and thus corresponding sample propor-
tions tables are produced as follows.
> hs.p <- prop.table(heart.surg) # equivalent to: heart.surg/sum(heart.surg)
> addmargins(hs.p)
Complications # proportions of the table sum to 1
Prayer Yes No Sum
Yes 0.2622814 0.2406328 0.5029142
No 0.2531224 0.2439634 0.4970858
Sum 0.5154038 0.4845962 1.0000000
For our GSS2018 data file, the relationship between political views and opinion towards
the gun law can be visualized by stacked bar plots. The bar plots of Figure A1.12 show
the distribution of law support within the voters of each candidate and the distribution of
the voting behavior within the supporters and opponents of the law. These bar plots are
derived by the barplot function of basic R as shown below.
> election16 <- prop.table(xtabs(~ gender + pres16, data = GSS))
> GunLaw <- factor(GSS$GUNLAW, levels=c(1,2), labels =
c("favor","oppose"))
> gun_elect16 <- prop.table(xtabs(~ pres16+ GunLaw, data = GSS))
Note that the barplot in Figure A1.12 (right) is fitted on t(gun elect16), the transpose of
the data matrix gun elect16.
The most popular graphical display for contingency tables is the mosaic plot. For a
two–way contingency table, a mosaic plot displays the cells of the table as rectangular areas
of size proportional to the corresponding frequencies. Additionally, the alignment of these
areas is indicative for the underlying association between the classification variables. Worse
alignment signals stronger association. Mosaic plots can be constructed by the mosaicplot
function of the graphics package or in the vcd package by mosaic.
The mosaic plot for the contingency table cross-classifying GSS2018 participants’ gender
and their vote in the presidential elections 2016, constructed as shown below, is provided
in Figure A1.13.
> library(vcd)
> vote16 <- factor(GSS$PRES16, levels=c(1,2,3,4),
labels = c("Clinton", "Trump", "Oth", "Nv"))
> ct <- table(gender,vote16)
> mosaic(ct)
34 Agresti and Kateri (2022): 1. R-Web Appendix
Gun Law (by Vote in Presidential Elections 2016) Vote in Presidential Elections 2016 (by Gun Law)
0.8
0.6
Not vote oppose
Other favor
0.5
Trump
0.6
Clinton
0.4
Proportions
Proportions
0.4
0.3
0.2
0.2
0.1
0.0
0.0
favor oppose Clinton Trump Other Not vote
FIGURE A1.12: Stacked barplots of the GSS2018 participants’ vote in presidential elections
in 2016 vs their opinion with respect to a gun law.
vote16
Clinton Trump OthNv
Male
gender
Female
FIGURE A1.13: Mosaic plot for the two–way table cross-classifying GSS2018–participants’
gender and their vote in the presidential elections 2016.
, , alcohol = no
cigarette
marijuana yes no
yes 3 2
no 43 279
2
CHAPTER 2: R FOR PROBABILITY
DISTRIBUTIONS
For each of these distributions, R provides four functions for computing values of its
cumulative density (cdf), its inverse cdf (used for confidence interval constructions and for
hypothesis testing), its pdf or pmf, as well as for generating random numbers from this
distribution. In particular, each distribution has a ‘base name’ (for example norm for the
normal) and the names of the four basic functions mentioned above are derived by imposing
the corresponding prefix-letter, as given below.
● p for ‘probability’: cdf
● q for ‘quantile’: inverse cdf
● d for ‘density’: pdf or pmf
● r for ‘random’: random sample generation
35
36 Agresti and Kateri (2022): 2. R-Web Appendix
For example, for the normal distribution these are pnorm, qnorm, dnorm, and rnorm. In
Sections 2.5.2 and 2.5.3 we found cumulative probabilities and quantiles for normal distri-
butions using pnorm and qnorm. We used rnorm for simulations in Section 1.5.3.
The multinomial distribution (see equation (2.14)) is also treated directly in R, with base
function multinom. However, for the multinomial distribution, which is multivariate, only
the dmultinom and rmultinom functions are available.
Some probability distributions have alternative parameterizations. For example, the
standard pdf parameterization in R of a gamma distribution is
1
f (y; θ, k) = e−y/θ y k−1 , y ≥ 0; f (y; θ, k) = 0, y < 0,
θk Γ(k)
√
for shape parameter k and scale parameter θ. It has µ = kθ and σ = kθ. The scale pa-
rameter relates to the rate parameter λ in equation (2.10) and the exponential special case
(2.12) by θ = 1/λ. The R functions for the gamma distribution provide the ability to choose
parameterization in terms of scale or rate. However only one of these parameters should
be used. For instance, rgamma(1000, shape=10, scale=2) is equivalent to rgamma(1000,
shape=10, rate=1/2).
Thus, when applying the R functions for probability distributions, attention has to be
paid in the parameterization of the corresponding function.
The plots of the pdf and cdf of the uniform distribution in interval [0,1], s. Figures 2.6
and 2.7, respectively, have been produced in R by the code given below:
# pdf of U(0,1) with highlighted P(a<Y<b):
> x<- seq(0,1,length=2)
> plot(x,dunif(x), type="l", xlim=c(-0.4,1.4), col="blue", lwd=2, xaxt="n",
ylim=c(0, 1), xlab="y", ylab="f(y)")
> x <- seq(0.2,0.6, length=2) # the boarders of the interval [a, b]
> y <- dunif(x)
> polygon(c(0.2,x,0.6), c(0,y,0), col="gray")
> axis(1, at=c(0,0.2,0.6,1), labels=c(0,"a", "b", 1))
> text(0.4,0.5,paste("P(a<Y<b)"),cex=1,col="white")
> segments(-0.4, 0, 0, 0, col="blue", lwd=2)
> segments(1, 0, 1.4, 0, col="blue", lwd=2,lend=0.1)
# cdf of U(0,1) :
> x<- seq(0,1,length=2)
> plot(x,punif(x), type="l", xlim=c(-0.4,1.4), col="blue", lwd=2, xaxt="n",
ylim=c(0, 1), xlab="y", ylab="f(y)")
> axis(1, at = seq(-0.4, 1.4, by = 0.2), las=2)
> segments(-0.4, 0, 0, 0, col="blue", lwd=2,lend=0.1)
> segments(1, 1, 1.4, 1, col="blue", lwd=2,lend=0.1)
> segments(-0.45,0.8,0.8,0.8,lty=2)
> segments(0.8,-0.04,0.8,0.8,lty=2)
> text(-0.2,0.85, paste("P(Y<0.8)=0.8"), cex=1)
The plot in Figure 2.8 is given over the support (i.e. for x ≥ 0). Alternatively, the pdf
e−y , if y ≥ 0
f (y) = { can be plotted over R. The associated R–code follows. Replacing
0, if y < 0
in the code the dexp function through pexp, the graph of the cdf is derived. Both plots are
given in Figure A2.1.
> x<- seq(0,14,length=300)
> plot(x,dexp(x,1), xlim=c(-2,14), xaxt="n", type="l", col="blue", lwd=2,
xlab="y", ylab="f(y)")
> axis(1, at = seq(-2, 14, by = 2))
# x-axis values are listed vertically if axis() is replaced as follows:
# axis(1, at = seq(-2, 14, by = 2),las=2)
> segments(-2, 0, 0, 0, col="blue", lwd=2)
1.0
1.0
0.8
0.8
0.6
0.6
F(y)
f(y)
0.4
0.4
0.2
0.2
0.0
0.0
−2 0 2 4 6 8 10 12 14 −2 0 2 4 6 8 10 12 14
y y
FIGURE A2.1: The pdf (left) and cdf (right) of an exponential distribution with parameter
λ = 1.
The R–code for producing the graph of gamma pdfs of different shape parameter values
but all having expected value equal to 10, provided in Figure 2.12 (right), follows below.
Replacing function dgamma by pgamma, the corresponding graph of the cdfs is derived.
> y = seq(0, 40, 0.0001)
> plot(y, dgamma(y, shape=10, scale=1), ylab="f(y)", type ="l", col="blue")
> lines(y, dgamma(y, shape=2, scale=5), col="green") # scale = 1/lambda
> lines(y, dgamma(y, shape=1, scale=10), col="red")
> legend(20,0.12, c("k=10","k=2","k=1"),lty=c(1,1,1),col=c("black","green","red"))
φ(y)
−5 0 5 10 15 −5 0 5 10 15
y y
FIGURE A2.2: The probability that a normal distribution with µ = 3 and σ 2 = 9 takes
values in the interval [2, 5] (right) and the 90% quantile of a normal distribution with µ = 5
and σ 2 = 9.
Please notice the use of the expression() function for adding mathematical expressions
on an R plot. Here, we applied expression() for using Greek letters in the y-axis label.
The function above can be adjusted for other distributions or for calculating probabilities
of lower or upper tails. Caution is required when calculating probabilities of discrete random
variables. Since though for a continuous random variable Y it holds P (Y ∈ (a, b]) = P (Y ∈
[a, b]), this is not true if Y is discrete, since in this case in general P (Y = a) ≠ 0.
For a continuous random variable, Section 2.5.6 defined the pth quantile (100p percentile)
as the point q at which the cdf satisfies F (q) = p. For a discrete random variable, the cdf is
a step function, and the pth quantile is defined as the minimum q such that F (q) ≥ p. For
instance, for the binomial distribution in Table 2.3 with n = 12 and π = 0.50, the cdf has
F (5) = 0.3872 and F (6) = 0.6128, so the pth quantile is 6 for any 0.3872 < p ≤ 0.6128, such
as p = 0.40 and 0.60 as shown in the following R code:
> cbind(5:6, pbinom(5:6, 12, 0.50))
[,1] [,2]
[1,] 5 0.3872
[2,] 6 0.6128
> qbinom(0.40, 12, 0.50); qbinom(0.60, 12, 0.50)
[1] 6
[1] 6
Quantiles, Q − Q Plots and the Normal Quantile Plot 39
Figure A2.3 illustrates. It shows, for instance, that 0.60 on the vertical cdf probability scale
maps to the 0.60 quantile of 6 on the horizontal scale of binomial random variable values.
1.0
0.8
0.6
p
0.4
0.2
0.0
0 2 4 6 8 10 12
pth quantile
FIGURE A2.3: The cdf of a binomial distribution with n = 12 and π = 0.50 with the 0.40th
and 0.60th quantiles, both equal to 6.
The code that follows produces a plot allocating the pth quantile of a normal distribution
on the horizontal axis (y) and shading gray the probability p, as illustrated in the produced
graph in Figure A2.2 (right).
quantNormal <- function(a,mu,sigma){
quant <- qnorm(a,mu,sigma)
low <- min(mu-4*sigma,a)
up <- max(mu+4*sigma,a)
curve(dnorm(x,mu,sigma), xlim=c(low,up), main=" ", xlab="y",
ylab=expression(phi(y)), col="blue", lwd=2)
x=seq(low,quant,length=200)
y=dnorm(x,mu,sigma)
polygon(c(low,x,quant),c(0,y,0),col="gray")
Some inferential statistical methods assume that the data come from a particular distri-
bution, often the normal. The Q-Q plot (quantile-quantile plot) is a graphical comparison of
the observed sample data distribution with a theoretical distribution. As explained in Exer-
cise 2.67, it plots the order statistics y(1) ≤ y(2) ≤ . . . ≤ y(n) of the data against the ordered
quantiles q n+1
1 ≤ q n+1
2 ≤ . . . ≤ q n+1
n of the reference distribution. If {q i } and {yi } come from
n+1
the same distribution, the points {(q n+1 1 , y(1) ), (q 2 , y(2) ), . . . , (q n , y(n) )} should approx-
n+1 n+1
imately follow a straight line, more closely so when n is large. With the standard normal
distribution for {q i } and a normal distribution assumed for {yi }, the Q-Q plot is called
n+1
40 Agresti and Kateri (2022): 2. R-Web Appendix
a normal quantile plot. With a standard normal distribution assumed for {yi }, the points
should approximately follow the straight line y = x having intercept 0 and slope 1, which R
plots with the command abline(0,1). When the points deviate greatly from a straight line,
this gives a visual indication of how the sample data distribution differs from the reference
distribution.
We illustrate by generating random samples from a standard normal distribution, a t
distribution (introduced in Section 4.4.1, symmetric around 0 like the standard normal but
with thicker tails), an exponential distribution (2.12) with λ = 1, and a uniform distribution
over (0, 1). The qqnorm function creates normal quantile plots:
> Y1 <- rnorm(1000); Y2 <- rt(1000, df=3) # generating random samples
> Y3 <- rexp(1000); Y4 <- runif(1000) # from four distributions
> par(mfrow=c(2, 2)) # plots 4 graphs in a 2x2 matrix format
> qqnorm(Y1, col=’blue’, main=’Y1 ~ N(0,1)’); abline(0,1)
> qqnorm(Y2, col=’blue’, main=’Y2 ~ t(3)’); abline(0,1)
> qqnorm(Y3, col=’blue’, main=’Y3 ~ exp(1)’)
> qqnorm(Y4, col=’blue’, main=’Y4 ~ uniform(0,1)’)
Figure A2.4 shows the normal quantile plots. The first plot shows excellent agreement,
as expected, between the normal sample and the normal quantiles, with the points falling
close to the straight line y = x. The plot for the sample from the t distribution indicates that
more observations occur well out in the tails (i.e., larger ∣t∣ values) than expected with a
standard normal distribution. The plot for the uniform distribution indicates the opposite,
fewer observations in the tails than expected with the normal distribution. The plot for the
sample from the exponential distribution reflects the right skew of that distribution, with
some quite large observations but no very small observations, reflecting its lower boundary
of 0 for possible values.
To illustrate the normal quantile plot for actual data, we construct if for the ages in the
GSS data frame created in Section A1.2.1:
> ind.age <- complete.cases(GSS[,2]) # subsample without missing ages (variable 2)
> GSSsub <- GSS[ind.age,] # new data frame without missing ages
> qqnorm(GSSsub$AGE, col="dodgerblue4"); qqline(GSSsub$AGE)
Figure A2.5 (left) shows the plot. The qqline function adds the straight line corresponding
to the trend in points if the sample distribution were normal. It suggests that the distribution
has fewer observations in the tails than expected with the normal, reflecting subjects under
18 not being sampled in the GSS and very old subjects being in a smaller cohort and also
dropping out because of deaths. (A histogram also shows evidence of non-normality.) The
ggplot2 package provides options for controlling the format of the plot according to groups
defined by levels of a factor. The following shows the code for the Q-Q plots for the ages of
females and of males, which is shown in Figure A2.5 (right).
> library(ggplot2)
> GSS$SEX <- factor(GSS$SEX, labels = c("male","female"))
> p <- qplot(sample=AGE, data=GSS, color=SEX, shape=SEX); # no output
# one qq-plot for males and females:
> p + scale_color_manual(values=c("blue", "red")) + scale_shape_manual(values=c(2,20)) +
labs(x="quantiles of N(0,1)", y = "Age")
# separate qq-plot for males and females (facet_wrap):
> p + scale_color_manual(values=c("blue", "red")) + scale_shape_manual(values= c(2,20)) +
labs(x="quantiles of N(0,1)", y="Age") + facet_wrap(~ SEX)
The qqPlot function in the EnvStats library can construct Q-Q plots for reference
distributions other than the normal.
Quantiles, Q − Q Plots and the Normal Quantile Plot 41
Y1 ~ N(0,1) Y2 ~ t(3)
Sample Quantiles
Sample Quantiles
5
2
0
−5
−15
−3
−3 −1 1 2 3 −3 −1 1 2 3
Y3 ~ exp(1) Y4 ~ uniform(0,1)
Sample Quantiles
Sample Quantiles
6
0.8
4
0.4
2
0.0
0
−3 −1 1 2 3 −3 −1 1 2 3
FIGURE A2.4: Normal quantile plots, plotting quantiles of the standard normal distribution
against quantiles of random samples from a N (0, 1) distribution, a t distribution with df =
3, an exponential distribution with λ = 1, and a uniform distribution over [0, 1].
male female
80
90
60
SEX
80
Age
male
70
Sample Quantiles
female
60
50
40
40
30
20
20
−3 −2 −1 0 1 2 3
−2 0 2 −2 0 2
Theoretical Quantiles
quantiles of N(0,1)
FIGURE A2.5: Normal quantile (Q-Q) plots for the ages of respondents in the GSS2018
data file, overall and grouped by gender.
42 Agresti and Kateri (2022): 2. R-Web Appendix
The joint probability table above is derived manually, dividing all cell entries of the
frequency table by the total sample size n. Alternatively, it is convenient to use the func-
tion prop.table, since it allows also to derive the conditional probabilities within rows or
columns. Thus we could produce joint.prob through the following command.
> joint.prob <- prop.table(fairsociety) # derives proportion table
# from a frequency table
> cond.prob1 <- prop.table(fairsociety, 1) # cond. prop. within rows
> cond.prob2 <- prop.table(fairsociety, 2) # cond. prop. within columns
smallgap
gender strongly agree agree neutral disagree strongly disagree
Male 0.448 0.473 0.437 0.535 0.594
Female 0.552 0.527 0.563 0.465 0.406
1.0
0.30
Female
Male
0.25
0.8
0.20
conditional proportions
0.6
proportions
0.15
0.4
0.10
0.2
0.05
0.00
0.0
strongly agree agree neutral disagree strongly disagree strongly agree agree neutral disagree strongly disagree
FIGURE A2.6: Barplots for the observed proportions (left) and conditional proportions
(right) of the attitude towards a fair society, stacked by gender.
Is the responders’ opinion of fair society (Y ) independent of gender (X)? If so, then (see
Section 2.6.6) it should hold
P (X = x∣Y = y) = P (X = x), x = 1, 2, y = 1, . . . , 5.
We now from the GSS set-up that P (X = 1) = P (X = 2) = 0.5. We can observe that the
observed conditional gender proportions within each level of response on the small social
gap attitude are close to 50% (s. R results above, or the barplot in Figure A2.6, right).
Here, we illustrated a bivariate joint probability. In practical applications, quite often
occur joint probabilities of higher dimensions (and high dimensional contingency tables).
For a detailed discussion on contingency table analysis we refer to Agresti (2019) and Kateri
(2014).
Alternatively, function dnorm2d from the package fMultivar can be used in combination
with persp of package graphics, as follows, which procudes the plot given in Figure A2.7.
> library(fMultivar); library(graphics)
> x <- (-40:40)/10; X <- grid2d(x)
> z <- dnorm2d(X$x, X$y, rho = -0.8)
> X <- list(x = x, y = x, z = matrix(z, ncol = length(x)))
# Perspective Density Plot:
> persp(X, theta = -30, phi = 25, expand = 0.5, col = "lightblue",
ltheta = 120, shade = 0.15, ticktype = "detailed",
xlab = "x", ylab = "y", zlab =" ")
0.25
0.20
0.15
0.10
0.05
0.00
4
2
0 4
y
2
−2 0
x
−2
−4 −4
In Section 1.3.1, the function sample is applied to generate a random sample of 5 integers
from a discrete uniform distribution on {1, 2, . . . , 60}. This sampling is without replacement,
i.e. the same value cannot be considered more than once. Sampling with replacement is
carried out as shown next and is the basis of bootstrapping, a very important method in
computational statistics (Section 4.6 and A4.5).
> sample(1:60, 5, replace = TRUE)
[1] 56 27 54 24 27 # value 27 is observed twice
Randomization is also used in data jittering, which is the action of adding noise to
the data (see Section 7.2.4). Most commonly jittering is applied on scatterplots to reduce
overplotting, especially when the data are rounded. It further helps to visualize the data and
relationships among variables. However, caution is needed with jittering, since there is no
obvious way how to generate the added random noise, while jittering can weaken the visual
impact of an underlying relationship between two variables. A scatterplot with jittering is
given Figure A3.1, derived as shown below. The jitter function of base R provides this
option. You can try:
> jitter(5) # equivalent to: 5 + runif(1,-1,1)
[1] 4.945889
> jitter(5,2) # equivalent to: 5 + runif(1,-2,2)
[1] 5.08814
x <- sample(1:10, 50, TRUE)
y <- 2 * x + rnorm(50)
plot(y~x, pch=1,col="blue") # presented in the figure
plot(y~jitter(x,2), pch=1,col="blue")
20
20
15
15
10
10
y
y
5
5
0
2 4 6 8 10 2 4 6 8 10
x jitter(x, 2)
FIGURE A3.1: Scatterplot without (left) and with (right) jittering on the x-values.
45
46 Agresti and Kateri (2022): 3. R-Web Appendix
For example, for generating randomly from a Bernoulli random variable with success
probability π, we generate a uniform X and set Y = 0 if X ≤ 1 − π(= π0 ) or Y = 1 if
X > 1 − π. Consequently, to simulate from a binomial rv Y with parameters n and π, we
simulate n iid Bernoulli (π) random variables Y1 , . . . , Yn and then set Y = ∑ni=1 Yi . The
corresponding R-function is given below, along with an R function for simulating from a
multinomial random variable.
# random number generation from Bernoulli (p):
> randbern <- function(s, p) { # p: success probability,
# s: number of values to be simulated
randnumber1 <- function(){
x <- runif(1)
Y <- as.numeric(x>1-p) ; Y } # assignment "x>1-p" produces logical data
replicate(s, randnumber1()) }
replicate(n, randnumber()) }
More than one multinomial random samples can be simulated using the command
replicate as done above for the binomial distribution.
The discrete inverse transformation can be applied also for simulating from a Poisson
distribution. An alternative method is based on the connection between the Poisson and
the exponential distributions (see Section 2.5.5). Let Y be a random variable counting the
number of occurrences of an event over time that has a Poisson distribution with expected
number of occurrences in a time interval of length 1 equal to λ. Then the random times
T1 , T2 , . . . between successive events are independent and exponential distributed with pa-
rameter λ. Based on this, we simulate times between events, t1 , t2 , . . ., from independent
exponential distributions with mean 1 and set Y = i∗ , where i∗ is the smallest index such
∗
that ∑ii=1+1 ti > λ.
# random number generation from Poisson distribution (lambda)
generate 1,000,000 Poisson random variables with mean denoted by mu. We then organize
these in a matrix with 10 columns and 100,000 rows. Each row of the matrix contains a
simulated random sample of size 10 from the Poisson distribution. The apply function then
finds the mean within each row. (In the second argument of the apply function, 1 indicates
rows, 2 indicates columns, and c(1, 2) indicates rows and columns.) At this stage, the
vector Ymean is a vector of 100,000 means. The remaining code creates plots, showing the
sample data distribution for the first sample and the empirical sampling distribution of the
100,000 simulated values of ȳ.
> pois_CLT <- function(n, mu, B) {
# n: vector of 2 sample sizes [e.g. n <- c(10, 100)]
# mu: mean parameter of Poisson distribution
# B: number of simulated random samples from the Poisson
par(mfrow = c(2, 2))
for (i in 1:2){
Y <- numeric(length=n[i]*B)
Y <- matrix(rpois(n[i]*B, mu), ncol=n[i])
Ymean <- apply(Y, 1, mean) # or, can do this with rowMeans(Y)
barplot(table(Y[1,]), main=paste("n=", n[i]), xlab="y",
col="lightsteelblue") # sample data dist. for first sample
hist(Ymean, main=paste("n=",n[i]), xlab=expression(bar(y)),
col="lightsteelblue") # histogram of B sample mean values
} }
# implement:with 100000 random sample sizes of 10 and 100, mean = 0.7
> n <- c(10, 100)
> pois_CLT(n, 0.7, 100000)
Figure A3.2 shows the results with µ = 0.7, which we used to find the sampling distri-
bution of the sample median in Section 3.4.3.
We use both n = 10 and n = 100 to show the impact of the Central Limit Theorem
(see Section 3.3) as n increases. The figures on the left are bar graphs of the sample data
distribution for the first of the 100,000 simulated random samples of size n. With µ = 0.7, a
typical sample has a mode of 0, few if any observations above 3, and severe skew to the right.
The sampling distributions are shown on the right. With random samples of size n = 10,
the sampling distribution has somewhat of a bell shape but is still skewed to the right. (It
is a re-scaling of a Poisson with mean 7, since adding 10 independent Poissons with mean
0.7 gives a Poisson with mean 7.) With n = 100, the sampling distribution is bell-shaped,
has a more symmetric appearance, and is narrower because the standard error decreases as
n increases.
By changing the function from the mean in the apply command, you can simulate
sampling distributions of other statistics, such as we did for the sample median in Section
3.4.3. By changing the rpois argument, you can simulate sampling distributions for other
probability distributions as well as other statistics.
n= 10 n= 10
Frequency
10000
4
2
0
0
0 1 2 3 4 0.0 1.0 2.0
y y
n= 100 n= 100
40
Frequency
15000
20
0
0
y y
FIGURE A3.2: Bar plot of sample data distribution (left) and histogram of empirical sam-
pling distribution of sample mean (right), for random samples of size n = 10 (upper figures)
and n = 100 (lower figures) from a Poisson distribution with µ = 0.7.
ulations and their properties is an extended field, which is not discussed here.1 We restrict
on the presentation of the very basic idea.
∑i=1 g(Yi )
1 B
If Y1 , Y2 , . . . , YB are independent random variables sampled from f , then B
is the Monte Carlo (MC) estimator of E[g(Y )].
For a parameter θ, estimated by θ̂, we can use the Monte Carlo method to estimate the
variance of the sampling distribution of θ̂ for a random sample (Y1 , . . . , Yn ). Here are the
steps for a basic MC algorithm:
(1) For j = 1, . . . , B, draw y(j) = (yj1 , . . . , yjn ) independently from the distribution of inter-
est.
(2) Use y(j) to find θ̂j for sample j of size n, j = 1, . . . , B.
ed., Wiley.
50 Agresti and Kateri (2022): 3. R-Web Appendix
2
∑j=1 (θ̂j − θ̄) .
1 B
(4) The Monte Carlo estimate of var(θ̂) is B
An implementation of this algorithm for the case of binomial distributions with π = 0.3
and sample sizes n = 10, 30, 100 and 1000, based on 100000 replications, is shown next while
the derived plots are given in Figure A3.3.
> par(mfrow=c(2,2)) # multiple graphs layout in a 2x2 table format
> CLT_binom(100000, 10, 0.3)
$var.mean
[1] 0.021
$p.MC
[1] 0.299555
$varp.MC
[1] 0.02093151
> CLT_binom(100000, 30, 0.3)
$var.mean
[1] 0.007
$p.MC
[1] 0.3002177
$varp.MC
[1] 0.0070047
> CLT_binom(100000, 100, 0.3)
$var.mean
[1] 0.0021
$p.MC
[1] 0.2998553
$varp.MC
[1] 0.002109033
> CLT_binom(100000, 1000, 0.3)
$var.mean
[1] 0.00021
$p.MC
[1] 0.3000596
$varp.MC
[1] 0.0002109015
Monte Carlo Simulation 51
n= 10 n= 30
6
0 1 2 3 4 5
Density
Density
4
2
0
0.0 0.4 0.8 0.0 0.2 0.4 0.6
Ymean Ymean
n= 100 n= 1000
25
8
6
Density
Density
15
4
2
0 5
0
Ymean Ymean
We illustrated here the MC approach, though unnecessary in this case since closed
form expressions exist for the estimator and its variance. However, although we have a
standard error formula for a sample mean, most probability distributions do not have a
simple standard error formula for a sample median, so Monte Carlo approximation is useful
to approximate it.
We approximate that the standard error is 1.99. In fact, for sampling from a normal popu-
lation, the standard error of the sample median is 25% larger than the standard error of a
sample mean. The sample mean tends to be closer than the sample median to the joint pop-
ulation mean and median of a normal distribution. Apparently the sample mean is a better
estimator than the sample median of the center of a normal distribution. Constructing good
estimators is the subject of Chapter 4.
We provide next a function for estimating via MC the median of a random sample from
a gamma distribution with shape and rate parameters equal to s and r, respectively. We
illustrate the algorithm for s = 2 and r = 0.5. If r and s are unknown, we replace them with
their estimates.
> median_gamma <- function(B,s,r) {
# B: number of iterations used in the MC procedure
# s: shape parameter
# r: rate parameter
# Illustration:
> median_gamma(100000,2,0.5)
[1] 3.35453
In case the cdf F is unknown, the MC algorithm does not apply and an alternative way
to estimate var(θ̂n ) is through the bootstrap (see Section 4.6 and A4.5).
4
CHAPTER 4: R FOR ESTIMATION
The maxLik package has a function for ML estimation when you supply the log-likelihood
function to be maximized. Output includes the standard errors of the ML estimates.
Notice that the score CI is called Wilson after the name of the statistician who originally
proposed it.
Of course the sample size of 1497 is quite large and the two types of CIs do not differ in
terms of their actual coverage probability which achieves the nominal level of 95%. Trying
with a relatively small sample size (n = 25) and a probability π = 0.25, we can verify their
poor performance (s. Section 4.3.2 and the fact that the score CI is preferable to the Wald
CI (i.e. closer to 95%), confirmed by a simulation study. The conduction of a simulation
study to compare their performance in terms of their actual coverage probability is straight
forward in the binom package, as shown next. The output provided is the mean, lower and
upper coverage probabilities based on M replications (here M = 1000).
53
54 Agresti and Kateri (2022): 4. R-Web Appendix
One could further try the function BinomCI of the DescTools package, which provides a
variety of methods for constructing CIs beyond the score and Wald CIs, like for example the
Clopper–Pearson and the Agresti–Coull CIs. The score CI can also be obtained applying
the very basic prop.test function (of the stats package in base R). Both these functions
offer also the option of one-sided CIs, as demonstrated below. For example, the 99% score
CI for the success probability of a therapy, based on observing 38 successes in a random
sample of 45 patients is
> BinomCI(38,45, conf.level = 0.99, method = c("wilson", "wald"))
est lwr.ci upr.ci
wilson 0.8444444 0.6629331 0.9374360 # score CI
wald 0.8444444 0.7052765 0.9836124
As discussed in Section 4.3.7, the sample size of a survey can be determined by controlling
the error. The sample size n derivation described there can be easily calculated in R by a
simple used-defined function, as shown next. Note the use of the ceiling function to set
the sample size equal to the least integer greater than the resulting real value for n.
> nBinom <- function(error, p=0.5, alpha=0.05){
# error: margin of error allowed
# alpha: significance level, default set equal to 5%
# p: guess for p, default set equal to 0.5
# returns: the sample size n
n <- ceiling(qnorm(alpha/2)^2*p*(1-p)/(error^2))
return(n)}
We revisit the UN data file and construct a 95% CI for the mean GDP of these 42 nations,
based (i) on the t distribution with 41 degrees of freedom and (ii) on the standard normal
distribution, applying the above defined function zCI. As expected, since the degrees of
freedom are relatively high and thus the t distribution approximates the standard normal,
these two CIs are very close. The usual t–CI, along with the sample mean can also be
derived by the MeanCI function of the DescTools package, which gives also the option of
constructing bootstrap CIs as well as one-sided CIs.
> t.test(UN$GDP, conf.level=0.95)$conf.int
[1] 22.05311 31.60879
> zCI(UN$GDP)
[1] 22.19406 31.46784
> library(DescTools)
> MeanCI(UN$GDP, conf.level=0.95) # provides also the sampling mean
mean lwr.ci upr.ci
26.83095 22.05311 31.60879
Let us construct 95% CI for the mean GDP for nations with low (≤ 2) and high (> 2)
homicide rate. In this case, the sample sizes of the two groups are 25 and 17, respectively and
thus we shall not consider the standard normal based CI. We demonstrate this derivation
below, using the group by function of tidyverse and summarizing the sample sizes and
the CIs for the mean GDP. Note that since the summarize function returns just one value,
we need to ask separately the lower and upper bound of the CI.
> library(tidyverse)
> UN %>% group_by(Homicide>2) %>% summarize(n=n(),
+ GDB_L=t.test(GDP)$conf.int[1],GDB_U=t.test(GDP)$conf.int[2])
# A tibble: 2 x 4
‘Homicide > 2‘ n GDB_L GDB_U
<lgl> <int> <dbl> <dbl>
1 FALSE 25 26.8 38.7
2 TRUE 17 11.8 24.5
# alternative derivations:
> length(UN$GDP[UN$Homicide<=2]);length(UN$GDP[UN$Homicide>2]) # sample sizes
> t.test(UN$GDP[UN$Homicide<=2], conf.level=0.95)$conf.int
> t.test(UN$GDP[UN$Homicide>2], conf.level=0.95)$conf.int
56 Agresti and Kateri (2022): 4. R-Web Appendix
In Section 4.5.3, a CI for the difference of the means of two independent populations
is discussed and demonstrated on the Anorexia study, comparing the weight gain for the
cognitive behavioral therapy group to that of the control group. The sample weight differ-
ences were saved under the vectors cogbehav, and control, respectively. Alternatively, a
CI for the difference of two means can be constructed by the MeanDiffCI function of the
DescTools package. This provides the t CI without assuming equality of the standard devi-
ations of the two populations. Furthermore, it provides the estimate of the mean difference,
the option of constructing one-sided CIs, as well as bootstrap CIs.
> library(DescTools)
> MeanDiffCI(cogbehav,control)
meandiff lwr.ci upr.ci
3.9344828 -0.7044632 7.6182563
Asymptotic CIs for the difference of two proportions corresponding to two independent
populations are derived in Section 4.5.5 for the difference in the after surgery complica-
tions proportions for two groups of patients, one having prayers and the other not. The
CI provided by the classical prop.test function is a Wald CI while a score CI can be
obtained in the PropCIs package by the diffscoreci function. The DescTools package
has the BinomDiffCI function that provides Wald and score CI, along with a variety of
other methods, having analogous options with the corresponding BinomCI function for one
proportion.
> BinomDiffCI(315,604,304,597, method="wald") # (x1,n1,x2,n3)
est lwr.ci upr.ci
[1,] 0.01231045 -0.04421536 0.06883625
> BinomDiffCI(315,604,304,597, method="score")
[1,] 0.01231045 -0.04398169 0.06871075
We have derived in Section 4.4.3 a 95% CI for the mean weight change of young girls
suffering from anorexia after their therapy, for the group of cognitive behavioral (cb) ther-
apy, by calculating the differences of weight (change = weight after - weight before) and
considering then for the weight differences the classical t CI for a mean. This type of ”be-
fore - after” studies are the so called paired design studies and the CI for this MeanDiffCI
function of DescTools, both with the argument paired=TRUE. The latter has the option of
constructing the CI based on the standard normal distribution (method = "norm"). Verify
that the CI in Section 4.8.2 can equivalently be derived as given next while the normal CI
is narrower than the t CI.
> t.test(Anor$after[Anor$therapy=="cb"], Anor$before[Anor$therapy=="cb"],
+ paired=TRUE)$conf.int
[1] 0.2268902 5.7869029
attr(,"conf.level")
[1] 0.95
> MeanDiffCI(Anor$after[Anor$therapy=="cb"],Anor$before[Anor$therapy=="cb"],
+ paired=TRUE, conf.level = 0.95, sides ="two.sided")
meandiff lwr.ci upr.ci
3.0068966 0.2268902 5.7869029 # t CI
> MeanDiffCI(Anor$after[Anor$therapy=="cb"],Anor$before[Anor$therapy=="cb"],
+ paired=TRUE, method = "norm", conf.level = 0.95, sides ="two.sided")
meandiff lwr.ci upr.ci
3.0068966 0.3319241 5.5965078 # normal CI
The t and Other Probability Distributions for Statistical Inference 57
The plot of the cdfs for these t distributions is derived by replacing in the code above
dnorm and dt by pnorm and pt, respectively.
Furthermore, let us compare quantiles (0.01, 0.05, . . . , 0.95, 0.99) of a t distributions with
30 and 90 degress of freedom to the corresponding ones of the standard normal. We can
verify that the quantiles of a t distribution approach those of a standard normal, as the
degrees of freedom increase. For degrees of freedom higher than 30 they are quite close
having the highest differences in the tails.
quant <- c(0.01,0.05,0.10,0.25,0.5,0.75,0.9,0.95,0.99)
qt(quant,30)-qnorm(quant)
[1] -0.130913668 -0.052407260 -0.028863460 -0.008265943 0.000000000
[6] 0.008265943 0.028863460 0.052407260 0.130913668
qt(quant,90)-qnorm(quant)
[1] -0.042149602 -0.017107457 -0.009477333 -0.002735750 0.000000000
[6] 0.002735750 0.009477333 0.017107457 0.042149602
The pdf and cdf of chi squared distributions are provided in Figure A4.1 and are derived
adjusting the code provided above for the t distribution.
The R-code for producing Figure 4.7 is
> y = seq(0,70)
> plot(dchisq(y,10), type="l", xlab="Chi-squared", ylab="Probability
density function")
> lines(y, dchisq(y, 20), col="red")
> lines(y, dchisq(y, 40), col="green")
> legend(40, 0.09, c("df=10","df=20","df=40"),lty=c(1,1,1),
col=c("black","red","green"))
58 Agresti and Kateri (2022): 4. R-Web Appendix
0.25
1.0
Cumulative probability denstity function
df=3
df=5
0.20
0.8
df=10
0.15
0.6
0.10
0.4
df=3
0.05
0.2
df=5
df=10
df=20
0.00
0.0
0 10 20 30 40 0 10 20 30 40
y value y value
1 n
Fn (y) = ∑ I(Yi ≤ y),
n i=1
⎧
⎪
⎪1, if Yi ≤ y
where I(⋅) is the indicator function, I(Yi ≤ y) = ⎨ .
⎪
⎪0, otherwise
⎩
That is, Fn (y) is the sample proportion of the n observations that fall at or below y.
We illustrate by generating a random sample of size n = 10 from the N (100, 162 ) distri-
bution of IQ values and constructing the empirical cdf :
> y <- rnorm(10, 100, 16)
> plot(ecdf(y), xlab="y", ylab="Empirical CDF", col="dodgerblue4") # ecdf = empirical cdf
> lines(seq(50, 150, by=.1), pnorm(seq(50,150,by=.1), 100, 16), col="red4", lwd=2)
Figure A4.2 shows the empirical cdf and the cdf of the normal distribution from which
the data were simulated. The empirical cdf is the cdf of the discrete distribution having
probability 1/n at each observation, so it is a step function. The figure also shows an
empirical cdf for a random sample of size n = 50. As n increases, the empirical cdf converges
uniformly over y to the true underlying cdf.1
In medical applications that focus on the survival time of patients following some
1 This result is called the Glivenko–Cantelli Theorem, named after the probabilists who proved it in 1933.
Nonparametric and Parametric Bootstrap 59
1.0
1.0
0.8
0.8
n = 10 n = 50
Empirical CDF
Empirical CDF
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
70 80 90 100 110 120 130 40 60 80 100 120 140 160
y y
FIGURE A4.2: Empirical cumulative distribution functions for random samples of sizes
n = 10 and n = 50 from the N (100, 162 ) distribution, showing also the N (100, 162 ) cdf.
● B bootstrap samples are observed and y∗(j) = (y1 ) denotes the j-th observed
∗(j) ∗(j)
, . . . , yn
bootstrap sample, j = 1, . . . , B,
For the bootstrap variance estimation, the MC Algorithm (see Section A3.2) is applied,
replacing the unknown cdf F by its ecdf Fn . Hence, the variance var(θ̂n ), based on F , is
approximated by var(θ̂), based on Fn . The bootstrap estimate of var(θ̂) is
2
1 B ⎛ (j) 1 B (j) ⎞
2
σ̂B = ∑ θ̂ − ∑ θ̂ .
B j=1 ⎝ B j=1 ⎠
θ̂n − θ d
√ → N (0, 1)
var(θ̂n )
Replacing varF (θ̂n ) by its bootstrap estimate, the bootstrap approximate normal CI is
derived.
If θ̂n is not normally distributed and the CLT does not apply, then a bootstrap approxi-
mate normal CI is not adequate. There exist alternative bootstrap CIs, the simplest being
a ‘percentile’ or a ‘pivotal’ bootstrap CI.
Nonparametric and Parametric Bootstrap 61
[θ̂(α/2) , θ̂(1−α/2) ] ,
where θ̂(β) denotes the β-th empirical quantile of the bootstrap samples {θ̂(j) ∶ j =
1, . . . , B}.
The pivotal bootstrap CI assumes that the distribution of the random variable δn = θ̂n − θ
is a distribution not involving θ and that it is approximated by the bootstrap distribution
of δ = θ̂ − θ̂obs . Then, the quantiles of the cdf of δn are approximated by the quantiles of the
ecdf of δ.
where δ(β) is the β-th empirical quantile of {δ (j) = θ̂(j) − θ̂obs ∶ j = 1, . . . , B}.
Other variants of bootstrap CIs include ‘bias-corrected (BC)’ and ‘bias-corrected and accel-
erated (BCa)’ bootstrap CIs, having coverage probabilities closer to the nominal level than
the percentile-based method.
The bootstrap method for variance estimation and derivation of confidence intervals can
be easily implemented in R using the function boot, as illustrated for the example in Section
4.6.2, where boostrap CIs for the median and the standard deviation of the books’ shelf
time. Alternatively, it is straight forward to program manually the simulations required and
constract a bootstrap CI. Thus, for the median shelf time of books, the 95% bootstrap CI
is derived as follows.
> Books <- read.table("http://stat4ds.rwth-aachen.de/data/Library.dat", header=TRUE)
> n <- 54; nboot <- 100000
> Psample <- matrix(0, nboot, n) # matrix of nboot rows and n columns
> for (i in 1:nboot) Psample[i,] <- sample(Books$P, n, replace=TRUE)
> MedianBoot <- apply(Psample, 1, median) # finds median in each row
> quantile(MedianBoot, c(0.025, 0.975))
2.5% 97.5% # 95% percentile CI for median
11 18.5
> sd(MedianBoot) # standard deviation of the bootstrapped medians
[1] 2.196998
> SDBoot <- apply(Psample, 1, sd)
> quantile(SDBoot, c(0.025, 0.975)) # 95% percentile CI for standard deviation
2.5% 97.5%
13.45806 35.79818
of an estimator (4.1), shows that one can find a minimum mean squared biased estimator
having smaller variance than the unbiased one. In machine learning and prediction modeling,
the bias (variance) of prediction is decreasing (increasing) in model complexity and the
problem of optimal decision for the model complexity is known as the bias-variance tradeoff.
Thus, the bias of an estimator θ̂n , b(θ̂n ) =F (θ̂n ) − θ, is commonly of interest and quite often
difficult to estimate. It can be estimated via bootstrap by
⎛ 1 B (j) ⎞
b̂(θ̂n ) = ∑ θ̂ − θ̂obs .
⎝ B j=1 ⎠
Histogram of t
0.90
40
0.85
30
Density
t*
20
0.80
10
0.75
0
0.75 0.85 −3 −1 1 3
#.....................--...........................................
> set.seed(54321) # for reproducibility of the example
> b_median = boot(UN[,3], y.median, R=1000); b_median
#...... output ....................................................
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = UN[, 3], statistic = y.median, R = 1000)
Bootstrap Statistics :
original bias std. error # t1*: bootstrapped estimate of the median
t1* 0.86 -0.014815 0.03493223
#.................................................................
> boot.ci(b_median, conf=0.90); plot(b_median)
#...... output ...................................................
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = b_median, conf = 0.9)
Intervals :
Level Normal Basic
90% ( 0.8146, 0.9345 ) ( 0.8400, 0.9500 )
Histogram of t
0.8
4
0.7
3
Density
0.6
t*
2
0.5
1
0.4
0
FIGURE A4.4: Bootstrap plots for the Pearson correlation between GDP and CO2 of UN
data.
variable, namely the Pearson correlation, defined by the function xy.cor that follows. We
illustrate it for the correlation between GDP and CO2 of the UN data.
> xy.cor <- function(y, i){
v <- y[i,]
return(cor(v[,1], v[,2], method=’p’))}
> b_cor = boot(cbind(UN[,2], UN[,6]), xy.cor, R=1000)
> boot.ci(b_cor, conf=0.90); plot(b_cor)
#..... output .........................................
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = b_cor, conf = 0.9)
Intervals :
Level Normal Basic
90% ( 0.5436, 0.8084 ) ( 0.5559, 0.8216 )
Parametric bootstrap is also implemented in the package boot by setting the parameter
sim="parametric" (the default value is "ordinary"). In this case, the function used to
generate the data is given in the ran.gen function. The second argument of ran.gen is set
in the argument mle=, which usually will be the vector of maximum likelihood estimates of
the parameters computed on the original data.
Revisiting the example on the median of a gamma distribution in Section A3.2.2, assume
that y is a realization of a random sample Y = (Y1 , . . . , Y25 ) of 25 iid random variables from
a gamma distribution with shape and rate parameters s = 2 and r = 0.5, respectively. Then,
based on y, the bootstrap estimate of the median of a gamma (2,0.5) distribution and its
standard error can be computed as follows. In this case, since we assume the parameters to
be known, we provide these values in argument mle= and the second argument of function
ran.gen.
> library(boot)
# ................ required functions ....................................
> y.median <- function(y, i){ # y: data vector; i: indices vector
return(median(y[i]))}
> gamma.rg <- function(y,p){ # function to generate random gamma variates
rgamma(length(y), shape=p[1], rate=p[2])}
#.........................................................................
> y <- c(5.88,5.55,5.40,1.83,2.31,1.32,1.52,6.79,4.99,3.87,1.21,10.44,
3.71,1.68,2.53,5.40,0.17,9.00,1.41,3.37,2.99,1.68,1.73,6.43,4.16)
> s <- 2; r <- 0.5
> p <- c(s,r)
> gamma_bootmed = boot(y, y.median, R=10000, sim = "parametric",
ran.gen=gamma.rg, mle=p)
> gamma_bootmed
#..... output ............................................................
PARAMETRIC BOOTSTRAP
66 Agresti and Kateri (2022): 4. R-Web Appendix
Call:
boot(data = x, statistic = x.median, R = 10000, sim = "parametric",
ran.gen = gamma.rg, mle = p)
Bootstrap Statistics :
original bias std. error
t1* 3.37 0.02498518 0.6300741
#.........................................................................
# Bootstrap estimate of the median of gamma(2,0.5):
> 3.37 - 0.02 # =sample median - bias
[1] 3.35
The B sample medians of the simulated bootstrap samples are saved under the vector
gamma bootmed$t. Thus the bootstrap estimate of the median in the example above can
equivalently by computed as follows. You may also verify the bootstrap estimate of the
standard error of the median estimator.
> round(median(gamma_bootmed$t),2)
[1] 3.35
> round(sd(gamma_bootmed$t),2)
[1] 0.63
beta(12.0, 1.0) for π1 and beta(12.0, 1.0) for π2 . For inference about the risk ratio, we obtain
with simulation:
> library(PropCIs) # EQT intervals
> rrci.bayes(11, 11, 0, 1, 1.0, 1.0, 1.0, 1.0, 0.95, nsim = 1000000)
[1] 1.078 73.379 # EQT interval for pi1/pi2
> rrci.bayes(0, 1, 11, 11, 1.0, 1.0, 1.0, 1.0, 0.95, nsim = 1000000)
[1] 0.01363 0.92771 # EQT for pi2/pi1; endpoints (1/73.379, 1/1.078)
> library(HDInterval) # HPD interval for ratio of probabilities
> pi1 <- rbeta(1000000, 12.0, 1.0) # random sample from beta posterior
> pi2 <- rbeta(1000000, 1.0, 2.0)
> hdi(pi1/pi2, credMass=0.95)
lower upper
0.6729 36.6303
> hdi(pi2/pi1, credMass=0.95)
lower upper
7.820e-07 8.506e-01 # quite different from (1/36.63, 1/0.67)
The HPD 95% interval for π1 /π2 is (0.673, 36.630). Taking reciprocals, this would suggest
(0.027, 1.486) as plausible values for π2 /π1 , but the HPD 95% interval is (0.000, 0.851). The
inference then depends on which group we identify as Group 1 and which we identify as
Group 2! Because of this, we prefer EQT intervals for nonlinear functions of parameters.
When you can specify the posterior distribution, the HPD interval is also available in R
with the hpd function of the TeachingDemos package. The LearnBayes package is a collec-
tion of functions helpful in learning the Bayesian approach to statistical inference. Albert
(2009) is a related excellent introduction to applied Bayesian analysis and computations.
15
15
10
10
pdf
pdf
5
5
0
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
π π
FIGURE A4.5: The beta (1/2,1/2) prior (dotted line) and the beta posterior (solid line)
for parameter π, when n = 10 and y = 0 (left) or y = 10 (right). The 95% EQT and the 95%
HPD intervals are marked on the x-axis by red triangles and blue bullets, respectively.
For case (iii) y = 3, the posterior is initially increasing and then decreasing; the corre-
sponding graph is provided in Figure A4.6.
The code specifying the posterior, deriving the HPD credible intervals and producing
these plots follows for (i). For the other cases, we simply need to change x ← accordingly
and repeat all the other commands.
> library(TeachingDemos)
> alpha <- 0.05 # significance level
> a0 <- 1/2; b0 <- 1/2 # beta prior parameters
> n <- 10 # n: number of trials
> x <- 0 # x: no. of successes in case (i)
# Plots:
> prior <- function(p,a,b) { # beta prior pdf
dens<-function(p) dbeta(p,a,b)
return(dens)}
> post<- function(p,a,b,x,n) { # beta posterior pdf
dens<-function(p) dbeta(p,x+a,n-x+b)
return(dens)}
> y0=c(0,0) # (needed in plots)
> p1<-plot(post(p,a0,b0,x,10), col="red", lwd=2, xlim=c(0,1),type="l",
xlab=expression(paste(pi)), ylab="pdf", main=expression(paste(
"beta prior: ", alpha,"=1/2, ", beta,"=1/2")))
> points(q,y0, pch=2, col="red");
## changing q by h, you get on the plot the HPD bounds:
> points(h,y0, pch=16, col="blue");
> par(new=T)
> plot(prior(p,a0,b0),axes=F,xlab="",ylab="", lwd=2, lty=2, col="blue");
> par(new=F)
Example: EQT Intervals for Normal Means 69
2.5
2.0
1.5
pdf
1.0
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
FIGURE A4.6: The beta (1/2,1/2) prior (dotted line) and the beta posterior (solid line) for
parameter π, when n = 10 and y = 3. The 95% EQT and the 95% HPD intervals are marked
on the x-axis by red triangles and blue bullets, respectively.
Some times the prior information is given in terms of the expected value and the standard
deviation of the parameter of interest. This information can be transformed to information
in terms of the parameters of the beta prior via the following functions.
> priorpar <- function(Ep,SDp) {
# Prior knowledge for expected mean (Ep) and SD (SDp) of pi
a <- (Ep/SDp^2)*(Ep*(1-Ep)-SDp^2); b <- a*(1-Ep)/Ep
par <- c(round(a,3),round(b,3))
return(par)}
# Example:
> priorpar(0.5, 0.3536) # corresponds to a=b=1/2
[1] 0.5 0.5
square distribution with n − 1 degrees of freedom and s2 is the sample variance. For details
and implementation in R we refer to Albert (2009, p. 57–58). Following his algorithm,
we provide next the R–code for our example. As expected, the results are very close to
those based on a non-informative conjugate prior. In Figure A4.8 we provide the plot of
the simulated posterior pdf of µ along with the theoretical posterior, with σ 2 estimated
by σ̂ 2 = (n − 1)s2 /(n − 3), since E(χ−2df ) = 1/(df − 2), for df > 2. Note that for our sample
70 Agresti and Kateri (2022): 4. R-Web Appendix
0.25
0.20
0.15
f(µ|y)
0.10
0.05
0.00
−2 0 2 4 6 8
FIGURE A4.7: The simulated posterior density function (blue) for µ, the expected weight
change for the girls of the ”cb” therapy group, conditional on σ 2 , along with the pdf of a
2
N (y, σn ) (black), i.e. the theoretical posterior density conditional on σ 2 , with estimated σ 2 .
σ̂ 2 = 57.523 (compare to 7.511, the mean of the simulated values from the marginal posterior
of σ 2 ).
> change <- Anor$after - Anor$before
> y <- change[Anor$therapy=="cb"]; n=length(y)
> S = sum((y-mean(y))^2)
> sigma2 <- S/rchisq(500000,n-1)
> mu <- rnorm(500000, mean=mean(y), sd=sqrt(sigma2)/sqrt(n))
> Anor%>%summarize(n=n, mu.post=mean(mu), sd.mu=sd(mu),
sd.post=mean(sqrt(sigma2)), sd.sd=sd(sqrt(sigma2)))
n mu.post sd.mu sd.post sd.sd
1 29 2.997451 1.412274 7.511078 1.047082
> quantile(mu, c(0.025,0.975))
2.5% 97.5%
0.2059197 5.7869275
> quantile(sqrt(sigma2), c(0.025,0.975))
2.5% 97.5%
5.795127 9.895656
Analogously, the posterior distribution for the difference µ1 − µ2 of the means of two
independent normal populations with common variance σ 2 and improper priors can be
derived as follows. Assume for prior f (µi ) ∝ 1, i = 1, 2 and f (σ 2 ) ∝ 1/σ 2 . Let further y i ,
(n −1)s2 +(n −1)s2
i = 1, 2, be the sample means and s2 = 1 n1 +n 1 2
2 −2
2
the pooled estimate of σ 2 , where s2i
and ni denote the corresponding sample variance and sample size, respectively. Then, the
posterior of µi , conditional on σ 2 , is distributed N (y i , σ 2 /ni ) and the marginal posterior of
σ 2 is distributed (n1 + n2 − 2)s2 ∼ χ−2 n1 +n2 −2 .
Example: EQT Intervals for Normal Means 71
Reconsidering the example in Section 4.5.3, we shall derive the EQT interval for the
difference of the mean weight gains for the cognitive behavioral therapy and the control
groups. Recall that the 95% t CI is (-0.68, 7.59).
> cogbehav <- Anor$after[Anor$therapy=="cb"]-Anor$before[Anor$therapy=="cb"]
> control <- Anor$after[Anor$therapy=="c"]-Anor$before[Anor$therapy=="c"]
> n1 <- length(cogbehav); n2 <- length(control)
0.15
f(µ1 − µ2|y)
0.10
0.05
0.00
−2 0 2 4 6 8 10
µ1 − µ2
FIGURE A4.8: The simulated posterior density function (blue) for µ1 − µ2 , the expected
mean difference in weight gains for the ”cb” and ”control” groups, conditional on σ 2 , along
with the pdf of N (y 1 − y 2 , σ 2 ( n11 + n12 )) (black), i.e. the theoretical posterior density condi-
tional on σ 2 , with estimated σ 2 .
5
CHAPTER 5: R FOR SIGNIFICANCE TESTING
A very nice feature of the BF is that interchanging the role of the null and alternative
hypothesis is straightforward, which is not the case in frequentist approaches. In particular,
the BF in favor of the null hypothesis is simply given by
It can easily be verified that BF10 (y) equals the ratio of the marginal distributions of the
data under H1 and H0 , which are derived by integrating the likelihoods under H1 and H0
over the parameter θ. There is a kind of correspondence to the likelihood ratio test, since
the likelihood ratio is computed by maximizing the likelihoods in terms of θ while the BF
integrating over θ.
The BF was introduced by Jeffreys in 1961. The evaluation scale mostly used is an
adjustment of the initial evaluation scale of Jeffreys, proposed by Kass and Raftery 1 :
2 ln B10 (y) B10 (y) Evidence against H0
0-2 1-3 negligible
2-6 3 - 20 positive
6 - 10 20 - 150 strong
> 10 > 150 very strong
We shall consider next Bayesian t tests for comparing the means of two independent
normal distributed random variables with common variance, i.e. the Bayesian analogue to
the test presented in Section 5.3.4. Various versions of Bayesian t tests have been proposed
in the literature, offering options from very vague non-informative prior to strongly infor-
mative. These tests reparameterize the problem by expressing the means as deviations from
the common mean µ under the null hypothesis, i.e. they consider that µ1 = µ + σδ 2
and
µ2 = µ − σδ2
, where δ = µ1 −µ2
σ
is the standardized effect size. Then the hypothesis testing
problem becomes H0 ∶ δ = 0 versus H0 ∶ δ ≠ 0, having a common nuisance parameter vector
1 Kass and Raftery (1995). Bayes factors, Journal of the American Statistical Association, 90, 773–795
73
74 Agresti and Kateri (2022): 5. R-Web Appendix
(µ, σ) under both hypotheses, which facilitates the priors consideration and the Bayesian
calculations. Under this set-up, the considered prior for the parameter vector under H1 is
π1 (µ, σ, δ) = π0 (µ, σ)π(δ), where π0 (µ, σ) is the prior of the parameter vector under H0 .
Then a non-informative prior is used for the nuisance parameters π0 (µ, σ) ∝ σ −1 while the
decision for the prior on δ depends on the availability of prior information and differs among
various considered tests. The predominant options is a Cauchy non-informative prior 2 or
a normal 3 that, depending on the choice of parameters may be informative or not.
Such Bayesian t tests can easily be performed in the BayesFactor package applying the
ttest.tstat or the ttestBF function. The first requires just the value of the t-test statistic
and the sample sizes n1 and n2 , while the second applies on the data and offers also the
possibility to sample from the posterior and hence have options for further deliveries, like
for example credible intervals or simulated posterior densities plots.
We reanalyze next the example of Sections 5.3.2 and 5.3.5, comparing cognitive be-
havioral and control therapies for anorexia,
√ using for δ a non-informative Cauchy prior
π(δ) = Cauchy(δ; 0, γ) with scale γ = 2/2, which is the default option in BayesFactor.
The R–code is provided below:
> y1 <- Anor$after[Anor$therapy=="cb"] - Anor$before[Anor$therapy=="cb"]
> y2 <- Anor$after[Anor$therapy=="c"] - Anor$before[Anor$therapy=="c"]
> t.anor <- t.test(y1, y2, var.equal=TRUE)
> t.anor
t = 1.676, df = 53, p-value = 0.09963 # classical t test
> library(BayesFactor)
%# requires the vectors "cogbehav" and "control" computed in Section 5.3.4
%> t.anor <- t.test(cogbehav, control, var.equal=TRUE)
> ttest.tstat(t = t.anor$statistic, n1=length(cogbehav), n2=length(control),
simple = TRUE)
B10
0.8630774
> ttestBF(x = cogbehav, y = control)
Bayes factor analysis
---------------------------------------
[1] Alt., r=0.707 : 0.8630774 ±0.01% # r=sqrt(2)/2 (scale of Cauchy prior)
Against denominator:
Null, mu1-mu2 = 0
---
Bayes factor type: BFindepSample, JZS
The estimated Bayes factor of BF10 (y) = 0.86 shows only weak evidence against H0 . The
2 Rouder, Speckman, Sun, Morey, Iverson (2009). Bayesian t-tests for accepting and rejecting the null
252–257.
Bayes Factors and a Bayesian t Test 75
1.5
0.3
1.0
0.2
0.5
0.1
0.0 0.0
FIGURE A5.1: Simulated posterior density function for (i) the expected mean difference in
weight gains for the ‘cb’ and ‘control’ groups µ1 − µ2 (left) and (ii) the associated standard-
ized effect size δ (right), based on 5000 replications.
95% posterior interval for the standardized effect difference δ is (−0.11, 0.89). Although
some plausible values for δ are negative, the posterior probability of a negative value is only
0.064. (The classical one-sided P -value is 0.0996/2 = 0.05.) Figure A5.1 plots the posterior
densities of the difference µ1 − µ2 (left) and the effect size δ (right).
In an alternative recent approach 4 , the BF for a Bayesian test, corresponding to a t
test with observed statistic value tobs , under any proper prior for δ can be written as
√
∫ Tdf (tobs ∣ nδ δ)π(δ)dδ
BF10 (tobs ) = ,
Tdf (tobs )
where Tdf (t∣c) denotes the density function of a t distribution with non-centrality parameter
c and Tdf (t) = Tdf (t∣0). BF10 (tobs ) can easily be evaluated by numerical integration. Using
the corresponding R functions provided in informedTtest functions.R and adjusting the
associated example code (https://osf.io/37vch/) for the anorexia example, we can verify the
value of BF10 and plot the posterior of the effect size along its prior, as shown in Figure
A5.2.
> library(BayesFactor)
# requires the vectors "cogbehav" and "control" computed in Section 5.3.4
> source("informedTtest_functions.R")
> rscale_def <- 1/sqrt(2)
> t <- t.anor$statistic
4 Gronau, Ly, Wagenmakers (2020). Informed Bayesian t-tests, The American Statistician, 74, 137–143.
76 Agresti and Kateri (2022): 5. R-Web Appendix
2.0
0.378
−0.111 0.897
1.5
Density
1.0
0.5
0.0
√
FIGURE A5.2: Prior (Cauchy with scale= 2/2) and posterior density functions for the
effect size δ corresponding to the difference in weight gains for the ‘cb’ and ‘control’ groups.
Another Bayesian alternative to the t test, not based on the Bayes Factor, is provided in
Simulating the Exact Distribution of the Likelihood-Ratio Statistic 77
the BEST package, which simulates from the posterior and provides HPD intervals (denoted
in the package as HDI) for means, standard deviations and their differences, as well as the
effective size. Its implementation is straightforward and provides handy output that is an
mcmc object and can be thus analyzed further in coda (for a brief discussion on mcmc objects
and coda, we refer to Section A6.3). Furthermore, the simulated draws from the posterior of
each of the parameters in the model can be used for drawing inference for functions of our
parameters, as for example the ratio σ1 /σ2 , as illustrated below for our anorexia example.
install.packages("jags")
install.packages("rjags")
install.packages("coda")
install.packages("BEST")
library(BEST)
# requires the vectors "cogbehav" and "control" computed in Section 5.3.4
−2 0 2 4 6 8 −6 −4 −2 0 2 4
µ1 − µ2 σ1 − σ2
FIGURE A5.3: Simulated posterior density function for (i) the expected mean difference in
weight gains for the ‘cb’ and ‘control’ groups µ1 − µ2 (left) and (ii) the difference of their
standard deviations (right), based on 100000 replications.
The simulated quantiles of the exact sampling distribution are close to the χ21 quantiles.
Figure A5.4 is a histogram of the 100,000 values of the likelihood-ratio test statistic and
the χ21 pdf. The approximation seems fairly good, even though n is relatively small.
# Function for computing the Poisson log(LRT):
#----------------------------------------------------------------#
> logLRT <- function(n,lamb0,lamb.hat){
2*n*((lamb0-lamb.hat)-lamb.hat*log(lamb0/lamb.hat))}
#----------------------------------------------------------------#
# Function returning a vector of length R with the logLRT-values
# for the R simulated Poisson(lambda_0) samples of size n:
#----------------------------------------------------------------#
> testat <- function(R,n,lamb0){ y <- rep(-1,R)
for (i in 1:R){x <- rpois(n,5)
MLE <- mean(x)
y[i] <- logLRT(n,lamb0,MLE)}
return(y) }
#----------------------------------------------------------------#
# Application of the testat function for n=25 and lambda0=5:
> n<- 25; lamb0 <- 5
> R <- 10000 # number of replicates
> T25 <- testat(R,25,lamb0); stat<- T25
Nonparametric Statistics: Permutation Test and Wilcoxon Test 79
2.5
Probability density function
2.0
1.5
1.0
0.5
0.0
0 2 4 6 8
FIGURE A5.4: Histogram of 100,000 values of likelihood-ratio test statistic and the χ21 pdf,
for random sampling from a Poisson distribution with µ0 = 5 and n = 25.
The nonparametric Wilcoxon test for comparing the mean ranks of two independent
groups, discussed in Section 5.8.3, can easily be implemented using the wilcox.test func-
tion of base R, as illustrated below on the same example as above. Notice that for this
function the sample values of the two groups need to be given in different vectors:
> x <- c(114, 203, 217, 254, 256, 284, 296)
> y <- c(4, 7, 24, 25, 48, 71, 294)
> wilcox.test(x,y, alternative ="greater", exact =TRUE)
data: x and y
W = 43, p-value = 0.008741
alternative hypothesis: true location shift is greater than 0
5 Strasser H, Weber C (1999). On the asymptotic theory of permutation statistics. Mathematical Methods
where the argument formula defines the model to be fitted and data specifies the data
frame on which the model will be applied. There are further optional arguments that can
specified in a call of the lm function that can for example weight (weights) observations or
use a subset (subset) of observations in the fitting process.
The formula argument specifies the model formula as, e.g. y ∼ x1 + x2, where y con-
tains the data of the response variable while x1 and x2 of the explanatory variables that
can be numeric continuous or categorical (factors). All variables appearing in this argument
must be in the workspace or in the data frame specified in the (optional) data argument.
Other symbols that can be used in the formula argument are
● x1:x2, for an interaction between x1 and x2
● x1*x2, which expands to x1+x2+x1:x2
(e.g. y∼x1*x2 is equivalent to y∼x1+x2+x1:x2)
Simply calling lm does not save any results but simply provides a summary output on
the screen. The results of the model fit are saved and can be used for further analysis under
an lm, say named model (e.g. fit <- lm(...)). In this case, there is no output provided
automatically but can be controlled by the user.
Among the functions that are available for displaying (or saving) components of an lm
(or glm) object, are the following.
● summary(fit): displays the results
● coef(fit): the vector of the model parameters’ estimates (coefficients)
● confint(fit, level=0.95): (1−α) t approximate CI for the coefficients of the specified
model (by default is provided a 95% CI).
● confint.default(fit, level=0.95): (1 − α) CI for the coefficients of the specified
model based on asymptotic normality (by default α = 0.05).
● fitted.values(fit): the fitted mean response values
● residuals(fit): the residuals (observed response - fitted value)
(standardized and studentized residuals are obtained by the rstandard and rstudent
functions, respectively)
● df.residual(fit): the residual degrees of freedom
81
82 Agresti and Kateri (2022): 6. R-Web Appendix
● predict(fit, newdata= ): the predicted expected responses for new data points, based
on the estimated model
● plot(fit): diagnostic plots (discussed next)
A selection of up to
√ six plots of residuals; (1) residuals against f ittedvalues, (2) Scale-
Location plot of ∣residuals∣ against f ittedvalues, (3) normal qq–plot, (4) plot of
Cook ′ sdistances versus rowlabels, (5) plot of residuals against leverages, and (6) plot
of Cook ′ sdistances against leverage/(1 − leverage). By default, plots (1), (2), (3) and
(5) are provided. This function requires the graphics package.
Figure A6.1 shows four of the available diagnostic plots. These help us check the model
assumptions that E(Y ) follows the linear model form and that Y has a normal distribution
about E(Y ) with constant variance σ 2 . The first display plots the residuals {yi − µ̂i } against
the fitted values {µ̂i }. If the linear trend holds for E(Y ) and the conditional variance of
Y is truly constant, these should fluctuate in a random manner, with similar variability
throughout. With only 40 observations, the danger is over-interpreting, but this plot does
not show any obvious abnormality. Observation 40 stands out as a relatively large residual,
with observed y more than 10 higher than the fitted value. The residuals can also be plotted
against each explanatory variable. Such a plot can reveal possible nonlinearity in an effect,
such as when the residuals exhibit a U-shaped pattern. They can also highlight nonconstant
variance, such as a fan-shaped pattern in which the residuals are markedly more spread out
as the values of an explanatory variable increase.
Figure A6.1 also shows Cook’s distance values. From Section 6.2.8, a large Cook’s dis-
tance highlights an observation that may be influential, because of having a relatively large
residual and leverage. In this display, cases 5, 39 and 40 stand out. The plot of the resid-
uals versus the leverage highlight these observations. In this plot, observations fall outside
red dashed lines if their Cook’s distance exceeds 0.5, which identifies them as potentially
influential. Here, no observation has that large a Cook’s distance, but when observation 40
is removed from the data file, the estimated life events effect weakens somewhat.
The figure also shows the normal quantile (Q-Q) plot. This enables us to check the
normal assumption for the conditional distribution of mental impairment that is the basis
for using the t distribution in statistical inference, including prediction intervals. The as-
sumption seems reasonable, as the trend in this Q-Q plot is not far from the straight line
expected with normality.
Diagnostic Plots for Linear Models 83
Cook’s distance
40
40
10
Residuals
0.20
39
5
0.00
−10
39
20 25 30 35 0 10 20 30 40
Standardized residuals
Standardized residuals
Normal Q−Q Residuals vs Leverage
40 40 0.5
2
2 39
0
0
Cook’s5 distance
−2
−2
3 9 0.5
FIGURE A6.1: Diagnostic plots for the linear model fitted to the Mental data file.
Call:
lm(formula = salary ~ years, data = income)
Residuals:
Min 1Q Median 3Q Max
-168.02 -68.41 -24.68 82.95 176.79
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1323.160 63.962 20.69 <2e-16 ***
years 96.730 2.969 32.58 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Call:
lm(formula = salary ~ years, data = income, subset = gender == 1)
Residuals:
Min 1Q Median 3Q Max
84 Agresti and Kateri (2022): 6. R-Web Appendix
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1310.480 78.646 16.66 1.16e-09 ***
years 99.956 3.853 25.94 6.57e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Call:
lm(formula = salary ~ years, data = income, subset = gender == 2)
Residuals:
Min 1Q Median 3Q Max
-112.022 -45.487 -4.634 42.635 117.268
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1276.938 71.829 17.78 6.77e-09 ***
years 96.215 3.151 30.54 3.33e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Call:
lm(formula = salary ~ years + gender, data = income)
Residuals:
Min 1Q Median 3Q Max
-161.571 -55.325 -4.797 65.821 137.748
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1450.418 66.380 21.850 < 2e-16 ***
years 98.490 2.559 38.492 < 2e-16 ***
gender -111.771 34.024 -3.285 0.00324 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Observe that although model (i) provides an excellent fit, it does not take gender into
account and thus it is not informative with respect to our bias question. A first approach
would be to fit separate regression lines per gender (models (i)a and (i)b). However, though
this approach gives as an indication it does not provide a direct way to test the significance
Plots for Regression Bands and Posterior Distributions 85
of the gender effect and furthermore fitting two models using half of the available data each
time, reduces the accuracy of the parameter estimates based on these models. On the other
hand, model (ii) includes the gender as additional explanatory variable and thus estimates
a common job seniority effect, based on the total sample size and explains the average
difference of the salaries between males and females by the coefficient of the gender effect,
which is highly significant and negative, indicating that females earn on average 111.8 euro
less than their male colleagues with the same seniority.
Residual diagnostics is a necessary step before accepting our model and using it for rela-
tional interpretation of the effect of the explanatory variables on the response or prediction.
Observing the plots in Figure A6.2 for our covariance model (ii), we verify that there is
no distinctive pattern of unexplained response variation in the residuals (upper, left) and
no indication for deviation from the normality assumption (middle, left). There is also no
indication for non-constant variance, since the residuals appear randomly spread and the
red line is almost horizontal (lower, left). For large sample sizes, a thumb guideline for inter-
preting the Cook’s distance is to consider Cook’s distance values higher than 1 as indicative
for highly influential observations. The plots on the right column are for detecting influ-
ential points. The Cook’s distances plot (upper, right) shows that the Cook’s distances of
cases 8, 17 and especially 23 are higher compared to the rest of the observations and might
be influential. These observations are indicated in the residuals vs. leverage plot (middle,
right), which serves for the identification of influential observations. Since all our points,
even cases 8,17 and 23, are within the Cook’s distance lines (red dashed lines; the upper
0.5 line is hardly to be seen in the upper right corner of the plot), it seems that there are
no influential points. In case of influential points, the red solid line would cross the Cook’s
distance lines. Finally, on the Cook’s distance vs. Leverage plots (lower, right), the dashed
line contours are standardized residuals (of the labeled value). We verify that only case 23
is above the value 2 and none case exceeds the standardized residual value of 2.5. Thus, our
model passes all residual diagnostics successfully. The plots that the plot function provides
by default are the three on the left and the middle one on the right side.
Cook’s distance
23
17
Residuals
13
50
17
0.2
8
−150
0.0
23
Standardized residuals
Standardized residuals
Normal Q−Q Residuals vs Leverage
2 0.5
2
13 17 17
0
0
Cook’s distance 8
−2
−2
23 23 0.5
Cook’s distance
23 23
17 13 2.5 2
0.25
1.0
17 8
1.5
1
0.00
0.5
0.0
0
2500 3000 3500 4000 4500 0.05 0.1 0.15 0.2
FIGURE A6.2: Residual plots for the linear model (ii) fitted on the Salaries data.
For the same model, the scatter plot of the data along with the regression line and the
associated 99% confidence intervals and 99% prediction intervals are provided in Figure
A6.3 (left). Figure A6.3 (right) pictures the scatter plot of the same data along with the
regression lines per gender and the associated 95% confidence intervals.
Please note that although ggplot2 and other software visualize the confidence and
prediction intervals through lower and upper continuous curves, these are pointwise intervals
(as illustrated in Figures 6.8) and are not to be considered as confidence intervals for the
entire regression line. Providing confidence bands for the regression line, we need to take into
account also the distribution of s2p , where p is the dimension of the parameter vector. For
example, the regression line confidence band for the simple linear regression model (p = 1),
is ¿
√ Á1
À + (x0 − x̄) ,
2
µ̂(x0 ) ± s 2F1−α;2,n−2 Á (6.1)
n ∑i=1 (xi − x̄)2
n
> library(tidyverse)
> alpha <- 0.01
> pred.dat <- predict(model.all, interval="prediction", level=1-alpha)
#----------------------------------------------------------------------------
# calculation of the entire regression line CI:
x <- income$years; n <- nrow(income)
cf <- sqrt(sum(model.all$residuals^2)/(n-2))*
sqrt(2*qf(1-alpha, 2, n-2))*sqrt((1/n)+(mean(x)-x)^2/(n*var(x)))
ci.regr1 <- model.all$fitted.values - cf
ci.regr2 <- model.all$fitted.values + cf
names(ci.regr1) <- "lwr.regr"; names(ci.regr2) <- "uwr.regr"
#----------------------------------------------------------------------------
newdata <- cbind(income, pred.dat, ci.regr1, ci.regr2)
> ggplot(newdata, aes(years, salary)) +
geom_point() +
geom_smooth(method=lm, se=TRUE, level=1-alpha)+
geom_line(aes(y=lwr), color = "red", linetype = "dashed")+
geom_line(aes(y=upr), color = "red", linetype = "dashed")+
geom_line(aes(y=ci.regr1), color = "red")+
geom_line(aes(y=ci.regr2), color = "red")
4000
4000
Gender
salary
salary
M
F
3000
3000
2000
2000
10 15 20 25 30 10 15 20 25 30
years years
FIGURE A6.3: Scatter plots of the salary (in euro) versus the seniority (in years) for the
income data along with (1) the regression fit, and 99% CIs (gray shadowed), 99% confidence
band for the entire regression line (red lines) and 99% prediction intervals (dashed red lines)
based on model (i) fitted on the income data (left) and (2) regression lines per gender (males:
blue, females: red) and the associated 95% confidence interval bands (right).
The linear regression model in Section 6.6.2 is fitted by the MCMCregress function of
the MCMCpack package. Its syntax is similar to lm, and, as all model function of MCMCpack,
returns an mcmc object that contains the generated Markov chains for all parameters in
the model. Data sets being in a vector or matric form with one column per variable can
be transformed to mcmc objects by the mcmc functions. Mcmc objects resemble time series
objects and have optional arguments that allow to control the start, the end and the thinning
of the chain (e.g. keep for example one of every 10 observations).
With MCMCpack, we can plot a posterior pdf using the densplot command. We illustrate
88 Agresti and Kateri (2022): 6. R-Web Appendix
this next for the life events effect β1 (the 2nd model parameter) in the Bayesian fitting
of the linear model for mental impairment that used improper prior distributions for {β1 }
(Section 6.6.2):
> library(MCMCpack)
> fit.bayes <- MCMCregress(impair ~ life + ses, mcmc=5000000,
+ b0=0, B0=10^(-10), c0=10^(-10), d0=10^(-10), data=Mental)
> fit.bayes[1:3,] # first three rows of the derived mcmc object
(Intercept) life ses sigma2
[1,] 26.66740 0.13160292 -0.09251524 23.22411
[2,] 30.17502 0.10860963 -0.12223697 20.57522
[3,] 27.04690 0.06959574 -0.06056731 17.20253
> densplot(fit.bayes[,2], xlab="Life events effect", ylab="Posterior pdf") # Figure A6.4
# Plot of the posterior pdf for the variance (not shown):
> densplot(fit.bayes[,4], xlab=expression(sigma^2))
Figure A6.4 shows the plot. For this effect, the posterior mean was 0.103, with a standard
error of 0.033.
12
10
Posterior pdf
8
6
4
2
0
FIGURE A6.4: Posterior pdf for the life events effect β1 on mental impairment, in the linear
model fitted to the Mental data file
A crucial task in Bayesian inference through MCMC methods is the convergence of the
chain to the posterior. For this, the consideration of diagnostics is necessary. Such diagnostics
are provided in the coda package that applies on mcmc objects. Thus, the results of a model
function of MCMCpack can directly be further exploited in coda. The discussion of MCMC
methods is beyond the scope of this book and for this we refer to specialized literature.
A typical coda graphical display is the so called trace plot, obtained by the traceplot
function and used to monitor the terms in a Markov chain and verify whether the chain
mixes well. The traceplot of long chains will be too messy and at the end non-informative,
thus we usually plot just a fraction of the chain, commonly the last part. A part of a column
of an mcmc object is no more an mcmc object, thus to apply traceplot on a subchain, we
need to define the selected part of the chain as an mcmc object before. Thus, continuing our
example, the trace plot of the last 500 values of the posterior of β1 can be derived as given
next (the figure is not shown).
Bayesian Linear Regression for Improper Priors 89
> library(coda)
> U <- nrow(fit.bayes); L <- U-499
> beta1.post <- mcmc(fit.bayes[,2][L:U])
> traceplot(beta1.post, main="Trace of life events effect")
Applying the plot function on an mcmc object in coda delivers the trace plots and
posterior density plots for all chains in the columns of the object.
and unknown error variance σ 2 , under the typical noninformative prior assumption
1
g(β, σ 2 ) ∼ ,
σ2
for the parameter vector β = (β0 , β1 , . . . , βp ) and σ 2 . The posterior distribution of β con-
ditional on σ 2 , g(β∣y, σ 2 ), is multivariate normal distributed with mean the MLE β̂ and
covariance matrix var(β̂) (s. (6.1) and (6.13). The marginal posterior distribution of σ 2
is inverse gamma( n−(p+1)
2
, 12 S), with S = (y − X β̂)T (y − X β̂). For this set-up, we re-
fer to Albert(2009, Section 9.2), where the analysis is based on the functions blinreg,
blinregexpected and blinregpred of the LearnBayes package, which allow direct deriva-
tion of results and insightful visualizations.
We implement next this approach on the Scottish hill races data, introduced in Section
6.1.4, for the linear regression model fitted with least squares in Section 6.2.2. Notice the
way to simulate from (i) the posterior distribution and (ii) the prediction distribution of
the response variable for a particular set of values for the explanatory variables. These
simulated draws can then be used to derive summary statistics and credible or prediction
intervals.
> fit.dc2 <- lm(timeW ~ distance + climb , data=Scots2,x=TRUE,y=TRUE)
> library(LearnBayes)
# basic function for Bayesian linear regression:
> theta.sample=blinreg(fit.dc2$y, fit.dc2$x, 5000)
#-----------------------------------------------------------------------
# histograms of posterior densities:
> par(mfrow=c(1,2)) # see Figure A6.5
> hist(theta.sample$beta[,2],main="climb", xlab=expression(beta[1]))
> hist(theta.sample$beta[,3],main="distance", xlab=expression(beta[2]))
#-----------------------------------------------------------------------
# posterior means/sd:
> b.mean <- apply(theta.sample$beta,2,mean); b.mean
X(Intercept) Xdistance Xclimb
-8.950305 4.173835 43.847641
> b.sd <- apply(theta.sample$beta,2,sd); b.sd
X(Intercept) Xdistance Xclimb
3.3197336 0.2435103 3.7584219
> sigma.mean <- mean(theta.sample$sigma); sigma.mean
[1] 12.37126
> sigma.sd <- sd(theta.sample$sigma); sigma.sd
90 Agresti and Kateri (2022): 6. R-Web Appendix
[1] 1.116089
# posterior quantiles:
> apply(theta.sample$beta,2,quantile,c(.025,.25,.5,.75,.975))
X(Intercept) Xdistance Xclimb
2.5% -15.424176 3.693711 36.40000
25% -11.185153 4.012498 41.35451
50% -8.973437 4.173592 43.83825
75% -6.721816 4.335581 46.35683
97.5% -2.368920 4.655774 51.25069
> quantile(theta.sample$sigma,c(.025,.25,.5,.75,.975))
2.5% 25% 50% 75% 97.5%
10.42304 11.59083 12.29223 13.06214 14.79655
#-----------------------------------------------------------------------
# (i) Sampling from the posterior expected response at a particular X0:
#-----------------------------------------------------------------------
> cov1=c(1,min(Scots2$distance),min(Scots2$climb))
> cov2=c(1,quantile(Scots2$distance, 0.25),quantile(Scots2$climb,0.25))
> cov3=c(1,mean(Scots2$distance),mean(Scots2$climb))
> cov4=c(1,max(Scots2$distance),max(Scots2$climb))
> X0=rbind(cov1,cov2,cov3,cov4)
> p <- ncol(X0)
> mean.draws=blinregexpected(X0,theta.sample)
> par(mfrow=c(2,2)) # see Figure A6.6
> hist(mean.draws[,1],main="Covariate set at min",xlab="timeW")
> hist(mean.draws[,2],main="Covariate set at 0.25 quantile",xlab="timeW")
> hist(mean.draws[,3],main="Covariate set at mean",xlab="timeW")
> hist(mean.draws[,4],main="Covariate set at max",xlab="timeW")
#-----------------------------------------------------------------------
# (1-a) EQT Credible Intervals:
#-----------------------------------------------------------------------
> alpha <- 0.05
> t(apply(theta.sample$beta,2,quantile,c(alpha/2,1-alpha/2)))
2.5% 97.5%
X(Intercept) -15.424176 -2.368920
Xdistance 3.693711 4.655774
Xclimb 36.399998 51.250687
> quantile(theta.sample$sigma,c(alpha/2,1-alpha/2))
2.5% 97.5%
10.42304 14.79655
# EQT intervals for the expected response at new data points in X0:
> X0 <- data.frame(X0)
> names(X0) <- c("Intercept", "distance", "climb")
> eqt.EY <- apply(mean.draws,2,quantile,c(alpha/2,1-alpha/2))
> eqt.EY <- cbind(X0[,2:p], t(eqt.EY)); eqt.EY
distance climb 2.5% 97.5%
cov1 3.20000 0.1850000 7.087614 18.03510
cov2 10.10000 0.4300000 48.142879 55.97406
cov3 15.61343 0.8825672 91.929115 97.90148
cov4 43.00000 2.4000000 265.272248 286.23025
#----------------------------------------------------------------------------------
# (ii) Sampling from the posterior predictive distribution of {Y_f}
# ( s. future observation Y_f and compare to prediction interval in Section 6.4.5):
#----------------------------------------------------------------------------------
# function blinregpred is here evaluated at fit.dc2$x (can be replaced
# by new data points)
> pred.draws=blinregpred(fit.dc2$x,theta.sample)
> pred.sum=apply(pred.draws,2,quantile,c(.025,.975))
> par(mfrow=c(1,1)) # see Figure A6.7
> ind=1:nrow(Scots2)
> matplot(rbind(ind,ind),pred.sum,type="l",lty=1,col=1, xlab="INDEX",
ylab="timeW")
> points(ind,Scots2$timeW,pch=19) # observed respones {yi} with solid points
Bayesian Linear Regression for Improper Priors 91
climb distance
15000
10000
8000
10000
6000
Frequency
Frequency
4000
5000
2000
0
0
2.5 3.0 3.5 4.0 4.5 5.0 5.5 30 40 50 60
β1 β2
FIGURE A6.5: Histograms of simulated draws from the marginal posterior distributions of
β1 (distance) and β2 (climb).
Frequency
4000
0
0 5 10 15 20 25 45 50 55 60
timeW timeW
Frequency
10000
4000
0
timeW timeW
FIGURE A6.6: Histograms of simulated draws of the posterior of mean timeW for four
particular sets of values for the explanatory variables (distance, climb).
92 Agresti and Kateri (2022): 6. R-Web Appendix
200
150
timeW
100
50
0
0 10 20 30 40 50 60
INDEX
FIGURE A6.7: Posterior 95% prediction intervals of {yi∗ } with actual observed values of
Wtime indicated by points.
7
CHAPTER 7: R FOR GENERALIZED LINEAR
MODELS
where the formula and data argument, along with some further optional arguments, are
analogous to those of the lm function, (see Section A6.1). An implementation of the argu-
ment weight in a glm set-up is provided for modeling grouped binary data in Section 7.2.3.
Of special interest for modeling rates is the (offset) argument which can be used to specify
a component to be definitely included in the linear predictor (see Section 7.4.3).
The additional family argument determines the error distribution and link function
of the model. The exponential dispersion family functions available in R, along with their
canonical link, are:
● binomial(link = "logit")
● gaussian(link = "identity")
● Gamma(link = "inverse")
● inverse.gaussian(link = "1/mu2 ")
● poisson(link = "log")
However, other links beyond the canonical are possible. Thus, for the Covid-19 example in
Section 7.1.8, we used gaussian and gamma GLMs with a log link.
Similarly to the lm function, the results of the model fit can be saved as a glm ob-
ject and used for further analysis in the same manner as discussed for the lm function in
Section A6.1. For a saved object from fitting a model, say called fit, typing names(fit)
yields an overview of what is saved for the object, including characteristics such as the de-
viance, AIC, coefficients, fitted values, converged, and residuals. For instance, the command
fit$converged asks whether the Fisher scoring fitting algorithm converged. Useful follow-
up functions include confint for profile likelihood confidence intervals (the MASS package
is also required). Furthermore note that in the GLM case not all residual plots make sense.
For example, the Q − Q plot does not make sense for non–normally distributed responses.
Included in the output is the number of iterations needed for the Fisher scoring algorithm
to converge, with a default maximum of 25. Normally this is small, but it may be large (e.g.,
17 for the endometrial cancer example in Section 7.3.2) or not even converge when some ML
estimates are infinite or do not exist. You can increase the maximum number of iterations,
such as with the argument maxit = 50 in the glm function, but convergence may still fail.
In that case, you should not trust estimates shown in the output.
93
94 Agresti and Kateri (2022): 7. R-Web Appendix
The goodness of fit (GOF) of a logistic regression model can be fitted by the Hosmer-
Lemeshow GOF Test 1 , which is an approximate χ2 -test. In R it can be implemented in
the ResourceSelection library, as shown below for our example. The test requires the
specification of the number of bins used to calculate the quantiles, which are default set to
10.
> library(ResourceSelection)
> hoslem.test(Beetles$y, fit$fitted.values, g = 10)
1 Hosmer, Lemeshow (2000). Applied Logistic Regression. New York, USA: John Wiley and Sons.
Model Selection for GLMs 95
1.0
0.8
0.6
prob.fit(x)
0.4
0.2
1.70 1.75 1.80 1.85
FIGURE A7.1: Estimated expected death probability for the flour beetles study as a func-
tion of x (disage of exposure to gaseous carbon disulfide, in mg per liter).
We illustrate for the data on house selling prices introduced in Exercise 6.15 and analyzed
with a linear model and a gamma GLM in Section 7.1.3. There we observed that the
variability in selling prices increases noticeably as the mean selling price increases, so we’ll
use gamma GLMs. To predict selling prices (in thousands of dollars), we begin with all
five explanatory variables (size of house in square feet, annual property tax bill in dollars,
number of bedrooms, number of bathrooms, and whether the house is new) and their ten
two-way interaction terms as potential explanatory variables. The initial model has AIC =
1087.85. After several backward steps this reduces to 1077.31:
> Houses <- read.table("http://stat4ds.rwth-aachen.de/data/Houses.dat", header=TRUE)
> step(glm(price ~ (size+taxes+new+bedrooms+baths)^2, family=Gamma(link=identity),
+ data=Houses))
Start: AIC=1087.85
... # several steps not shown here
Step: AIC=1077.31 # final model chosen with backward elimination
price ~ size+taxes+new+bedrooms+baths+size:new+size:baths+taxes:new+bedrooms:baths
> fit <- glm(formula = price ~ size+taxes+new+bedrooms+bath + size:new+size:baths
+ + taxes:new+bedrooms:baths, family=Gamma(link=identity), data=Houses)
> summary(fit)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.486e+01 5.202e+01 1.631 0.106311
size 1.422e-01 4.153e-02 3.424 0.000932
taxes 6.030e-02 7.369e-03 8.183 1.71e-12
new -1.397e+02 7.491e+01 -1.865 0.065465
bedrooms -7.381e+01 2.737e+01 -2.697 0.008358
baths -1.673e+01 3.178e+01 -0.526 0.600004
size:new 2.520e-01 8.321e-02 3.028 0.003209
size:baths -3.601e-02 1.863e-02 -1.933 0.056394
taxes:new -1.130e-01 5.332e-02 -2.118 0.036902
bedrooms:baths 2.800e+01 1.478e+01 1.894 0.061392
---
(Dispersion parameter for Gamma family taken to be 0.0618078)
Null deviance: 31.9401 on 99 degrees of freedom
Residual deviance: 5.6313 on 90 degrees of freedom
AIC: 1077.3
The model chosen by backward elimination is still quite complex, having four interac-
tion terms. Only a slightly higher AIC value (1080.7) results from taking out all interactions
except the one between size and whether the house is new, which is by far the most signif-
icant among the interaction terms. Removing also the baths term gives the model selected
based on the BIC criterion, implemented in the step function by the additional argument
k=log(n):
> n <- nrow(Houses) # sample size
> step(glm(price ~ (size+taxes+new+bedrooms+baths)^2, family=Gamma(link=identity),
+ data=Houses), k=log(n)) # output not shown
The interpretation is much simpler, as the effects of taxes and bedrooms are main effects
solely. Here is the R code:
> fit2 <- glm(price ~ size + taxes + new + bedrooms + size:new,
+ family = Gamma(link=identity), data=Houses)
> summary(fit2)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.923e+01 2.024e+01 1.938 0.0556
size 8.588e-02 1.837e-02 4.675 9.80e-06 # effect 0.086 for old homes
taxes 5.776e-02 7.563e-03 7.638 1.82e-11
new -1.273e+02 7.557e+01 -1.685 0.0954
bedrooms -2.135e+01 9.432e+00 -2.263 0.0259
size:new 8.956e-02 4.310e-02 2.078 0.0404 # size effect 0.086 + 0.090 for
Model Selection for GLMs 97
The estimated effect on selling price of a square-foot increase in size, adjusting for the other
predictors, is 0.086 (i.e., $86) for older homes and (0.086 + 0.090) = 0.176 (i.e., $176) for
newer homes.
Apart from the AIC not being much higher, how do we know that the fit of the simpler
model is essentially as good in practical terms? A measure of how well the model predicts is
given by the correlation between the observed response variable and the fitted values for the
model. For a linear model, this is the multiple correlation (Section 6.3.3). For these data,
the multiple correlation is 0.918 for the more complex model and 0.904 for the simpler one,
very nearly as high, and this is without adjusting for the more complex model having many
more terms.
> cor(Houses$price, fitted(fit)); cor(Houses$price, fitted(fit2))
[1] 0.9179869 # multiple correlation analog for model chosen by backward elimination
[1] 0.9037955 # multiple correlation analog for model with only size:new interaction
You can check that further simplification is also possible without much change in AIC or
the multiple correlation. For instance, the simpler model yet that removes the bedrooms
predictor has a multiple correlation of 0.903.
log µijk = λ + λL
i + λj + λk + λij + λik + λjk + λijk ,
S G LS LG SG LSG
i, k = 1, 2, j = 1, . . . , 5,
and provides the perfect fit (df = 0, µijk = nijk , for all cells). Applying a backward stepwise
model selection procedure, we search among hierarchical models to find the one that best fits
to our data. The next model to consider would be the model with no three-factor interaction,
followed by models having only two of the three two-factor interactions, up to the model of
complete independence of the three classification variables, i.e. the model consisting only of
the main effects (λL , λS , λG ).
Note that in order to apply a log–linear model through the glm function, we need
98 Agresti and Kateri (2022): 7. R-Web Appendix
Male
SmallGap
GunLaw strongly agr. agree neutral disagree strongly dis.
favor 29 63 60 66 10
oppose 4 31 34 43 9
Female
SmallGap
GunLaw strongly agr. agree neutral disagree strongly dis.
favor 36 94 97 73 11
oppose 6 16 33 38 4
to transform the contingency table in a data frame class, where the cell frequencies are
expanded in a vector form and are in one column (variable) of the data frame while the
corresponding classification variables categories levels are given in further columns. It is
important, before applying glm, to define the classification variables as factors. The process
of transforming a contingency table to the appropriate data frame is illustrated for our
example.
> freq <- as.vector(tab1)
> row <- rep(1:2, 10)
> col <- rep(rep(1:5, each=2),2)
> lay <- rep(1:2, each=10)
> Glaw <- factor(row, levels=c(1,2), labels = c("favor", "oppose"))
> Sgap <- factor(col, levels=c(1,2,3,4,5), labels = c("strongly agr.",
"agree", "neutral","disagree", "strongly dis."))
> G <- factor(lay, levels=c(1,2), labels = c("Male", "Female"))
> table.df <- data.frame(freq,Glaw,Sgap,G)
For our example the model selection and model fit procedure is given below. When fitting
any GLM by the glm function, you need to check in the summary output of the model fit
that the number of Fisher Scoring iterations is below 20. Otherwise, the algorithm has not
converged 2 and you cannot trust the output since by default the maximum number of
iterations is set to 20. You may fit the model again, increasing the number of iterations (set
for example in the glm function call the argument iter=50).
> saturated <- glm(freq~L*S*G, poisson, data=table.df)
> step(saturated, direction="backward", k=2) # output not given
# k=2 for using AIC (default), k=log(n) for BIC
# further direction options: "both", "forward")
> fit <- glm(freq ~ L*S + L*G, poisson)
> fit
Call:
glm(formula = freq ~ L * S + L * G, family = poisson)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.12111 -0.57115 0.01851 0.59029 0.93383
2 see also Marschner (2011). glm2: Fitting generalized linear models with convergence problems, The R
Journal, 3, 12–15.
Model Selection for GLMs 99
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.31402 0.13385 24.760 < 2e-16 ***
Loppose -1.60014 0.34870 -4.589 4.46e-06 ***
Sagree 0.88186 0.14749 5.979 2.25e-09 ***
Sneutral 0.88186 0.14749 5.979 2.25e-09 ***
Sdisagree 0.76009 0.15026 5.058 4.23e-07 ***
Sstrongly dis. -1.12986 0.25101 -4.501 6.75e-06 ***
GFemale 0.31045 0.08719 3.561 0.000370 ***
Loppose:Sagree 0.66570 0.37819 1.760 0.078371 .
Loppose:Sneutral 1.02025 0.36970 2.760 0.005786 **
Loppose:Sdisagree 1.33178 0.36732 3.626 0.000288 ***
Loppose:Sstrongly dis. 1.39223 0.48982 2.842 0.004479 **
Loppose:GFemale -0.53153 0.16179 -3.285 0.001019 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Response: freq
LR Chisq Df Pr(>Chisq)
L 140.522 1 < 2.2e-16 ***
S 254.655 4 < 2.2e-16 ***
G 4.603 1 0.0319145 *
L:S 21.711 4 0.0002288 ***
L:G 10.877 1 0.0009739 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> library(vcd)
> mosaic(tab1, gp=shading_Friendly, residuals=stdres, # Figure A7.2
residuals_type="Std\nresiduals")
SGap
strongly agr.agree neutral disagree
strongly dis.
Std
residuals
1.6
Male
favor
GLaw
Female
0.0
G
Female Male
oppose
−1.7
FIGURE A7.2: Mosaic plot for the contingency table of Table A7.1, with tiles shaded by
the standardized residuals for model (LS,LG).
In special cases, values of explanatory variables and some effects may be the same for each t.
For continuous responses, it is common to assume a multivariate normal distribution for Yi
to obtain a likelihood function that also has correlation parameters to account for within-
subject responses being correlated. The model is fitted simultaneously for all t. For discrete
Correlated Responses: Marginal, Random Effects, and Transitional Models 101
3 1 1 3 1
4 1 1 1 2
5 1 1 2 2
6 1 1 3 2
> abortion[5545:5550,]
gender response question case
5545 0 0 1 1849
5546 0 0 2 1849
5547 0 0 3 1849
5548 0 0 1 1850
5549 0 0 2 1850
5550 0 0 3 1850
> question1 <- ifelse(abortion$question==1,1,0)
> question2 <- ifelse(abortion$question==2,1,0)
> question3 <- ifelse(abortion$question==3,1,0) # reference category
The marginal model is fitted on these data under independent and exchangeable corre-
lation structures:
> library(gee)
> fit.gee <- gee(response ~ gender + question1 + question2, id=case, family=binomial,
+ corstr="independence", data=abortion)
> summary(fit.gee)
Model:
Link: Logit
Variance to Mean Relation: Binomial
Correlation Structure: Independent
Summary of Residuals:
Min 1Q Median 3Q Max
-0.5068800 -0.4825552 -0.4686891 0.5174448 0.5313109
Coefficients:
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) -0.125407576 0.05562131 -2.25466795 0.06758236 -1.85562596
gender 0.003582051 0.05415761 0.06614123 0.08784012 0.04077921
question1 0.149347113 0.06584875 2.26803253 0.02973865 5.02198759
question2 0.052017989 0.06586692 0.78974374 0.02704704 1.92324166
Working Correlation
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
>
> fit.gee2 <- gee(response ~ gender + question1 + question2, id=case, family=binomial,
+ corstr="exchangeable", data=abortion)
> summary(fit.gee2)
Model:
Link: Logit
Variance to Mean Relation: Binomial
Correlation Structure: Exchangeable
Summary of Residuals:
Min 1Q Median 3Q Max
-0.5068644 -0.4825396 -0.4687095 0.5174604 0.5312905
Coefficients:
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) -0.125325730 0.06782579 -1.84775925 0.06758212 -1.85442135
gender 0.003437873 0.08790630 0.03910838 0.08784072 0.03913758
Modeling Time Series 103
Working Correlation
[,1] [,2] [,3]
[1,] 1.0000000 0.8173308 0.8173308
[2,] 0.8173308 1.0000000 0.8173308
[3,] 0.8173308 0.8173308 1.0000000
Notice that the item effects are much stronger with the GLMM (β̂1 = 0.835, β̂2 = 0.292)
than with the GEE model assuming an ”exchangeable” working correlation structure (β̂1 =
0.149, β̂2 = 0.052). This is expected, due to the strong underlying correlation (large random
effects variability), stronger than assumed under the exchangeable correlation structure.
Thus, based on the GLMM, regardless of gender, the estimated odds of supporting legalized
abortion for situation (1) equals exp(0.83) = 2.3 times the estimated odds for situation (3).
The estimated standard deviation of the random effects distribution is 8.736, indicating
high heterogeneity among subjects.
autocorrelation, and the values (ρ1 , ρ2 , ρ3 , . . .) are called the autocorrelation function. The
sequence of observations
Yt − µ = φ(yt−1 − µ) + t , t = 1, 2, 3, . . . ,
where ∣φ∣ < 1 and t ∼ N (0, σ 2 ) is independent of {Y1 , . . . , Yt−1 }, is called an autoregressive
process. The expected deviation from the mean at time t is proportional to the previous
deviation. This model satisfies ρs = φs , with autocorrelation exponentially decreasing as the
distance s between times increases. The process is a Markov chain (Section 2), because the
distribution of (Yt ∣ y1 , . . . , yt−1 ) is the same as the distribution of (Yt ∣ yt−1 ). In particular,
corr(Yt , Yt+1 ) = φ but corr(Yt , Yt+2 ∣ Yt+1 ) = 0.
To illustrate the use of R to generate an autoregressive process and to fit an autoregressive
model and predict future observations, we start at y1 = 100 and generate 200 observations
with µ = 100, φ = 0.90, and σ = 10:
> y <- rep(0, 200)
> y[1] <- 100
> for (t in 2:200){
+ y[t] <- 100 + 0.90*(y[t-1] - 100) + rnorm(1,0,10)
+ }
> plot(y, xlim=c(0,210)) # plots the time index t against y[t]
> acf(y, lag.max=10, plot=FALSE) # autocorrelation function
Autocorrelations of series ‘y’, by lag
0 1 2 3 4 5 6 7 8 9 10
1.000 0.917 0.856 0.779 0.706 0.644 0.574 0.517 0.437 0.355 0.290
> fit <- ar(y, order.max = 1, method = c("mle")) # fit autoregressive model by ML
> fit$ar
[1] 0.9180788 # ML estimate of phi parameter in model
> pred10 <- predict(fit, n.ahead=10) # predict next 10 observations on y
> pred10$pred # predicted y for next 10 observations
Start = 201 End = 210
[1] 66.146 68.350 70.374 72.231 73.937 75.503 76.940 78.260 79.472 80.58 4
> pred10$se # standard errors of predicted values
[1] 10.275 13.949 16.419 18.243 19.649 20.761 21.653 22.378 22.971 23.459
points(201:210, pred10$pred, col="red") # add on plot next 10 predicted y values
Figure A7.3 shows the time series. Observations that are close together in time tend to be
quite close in value. The plot looks greatly different than if the observations were generated
independently, which results for the special case of the model with φ = 0. After generating
the data, we use the acf function in R to find the sample autocorrelation function, relating
to times t and t + s with lag s between 1 and 10. The first-order sample autocorrelation (s =
1) is 0.921, and the values are weaker as the lag s increases. We then fit the autoregressive
model with the ar function in R. The ML estimate of φ = 0.90 is φ̂ = 0.918. One can use
the fit of the model to generate predictions of future observations as well as standard errors
to reflect the precision of those predictions. Predicting ahead the next 10 observations, the
R output shows that the predictions tend toward µ = 100 but with standard error that
increases as we look farther into the future. Figure A7.3 also shows, in red, the predictions
for the next 10 observations.
A more general autoregressive model, having order p instead of 1, is
p
Yt − µ = ∑ φj (yt−j − µ) + t , t = 1, 2. . . . .
j=1
It uses the p observations before time t in the linear predictor. This process is a higher-order
Markov chain, in which Yt and Yt−(p+1) are conditionally independent, given the observations
Modeling Time Series 105
140
80 100
y
60
40
Index
FIGURE A7.3: Observations from autoregressive time series process, with predicted values
of next 10 observations shown in red.
in between those two times. An alternative time series model, called a moving average
model of order p, has form
p
Yt − µ = ∑ λj t−j + t , t = 1, 2, . . . .
j=1
It has ρs = 0 for s > p, so it is appropriate if the sample autocorrelations drop off sharply
after lag p. A more general ARMA model, more difficult for interpretation but potentially
useful for prediction, can include both autoregressive and moving average terms. The models
generalize also to include explanatory variables other than past observations of the response
variable and to allow observations at irregular time intervals. See Brockwell and Davis (2016)
and Cryer and Chan (2008) for introductions to time series modeling.
8
CHAPTER 8: R FOR CLASSIFICATION AND
CLUSTERING
Figure A8.1 shows the plots. The first one shows the observed y values, at the various
width and color values. The second plot shows the predicted responses. At each color, the
crabs with greater widths were predicted to have y = 1, with the boundary for the predictions
moving to higher width values as color darkness increases. The third plot identifies the
crabs that were misclassified. The last plot shades the points according to their estimated
probabilities.
107
108 Agresti and Kateri (2022): 8. R-Web Appendix
4 4
3 3
y predY
color
color
0 0
1 1
2 2
1 1
25 30 25 30
width width
4
4
3
3
prob1
pred.right
color
0.75
color
FALSE
TRUE 0.50
0.25
2 2
1 1
25 30 25 30
width width
FIGURE A8.1: Scatterplots of width and color values for linear discriminant analysis to pre-
dict y = whether a female horseshoe crabs has male satellites, colored after (i) the observed
value of y (upper left), (ii) the predicted value of y (upper right), (iii) the missclassification
(lower left) and (iv) the estimated probability of having satellites (lower right).
Cross-Validation in R 109
A8.2 Cross-Validation in R
For classification methods, leave-one-out cross-validation (loocv) provides more realistic val-
ues for the probability of a correct classification (Section 8.1.3). For penalized likelihood
methods, Section 7.7.2 mentioned that the choice of tuning parameter λ can be based on
k-fold cross-validation (k-fold cv ). The loocv is the extreme case of k-fold cv for k = n.
Increasing k, especially for large n, may be computationally difficult. Commonly the values
k = 5 or k = 10 are considered in practice. The choice k = 10 is predominant, since it usually
performs similarly to loocv. 1 .
Model training refers to randomly partitioning the data frame into a training sam-
ple (typically 70%-80% of the observations) and a test sample, fitting the model on the
training sample, and checking the fit’s accuracy when applied to the testing sample. Cross-
validation is used for fitting the model, with 10-fold cv by default. For a k-fold cv, the
choice of k is controlled by the corresponding argument. Thus, one has to set trControl
= trainControl(method = "loocv") for loocv or trControl = trainControl(method =
"cv", number=5) for a 5-fold cv. The caret package for classification and regression train-
ing applies on models fitted in basic R packages such as stats, MASS, and rpart. One
specifies the desired modelling function (glm, lda,...) in the method= argument of the
train function. The createDataPartition function randomly partitions the data frame
to training and testing subsamples. We illustrate in the linear discriminant analysis con-
text, for the horseshoe crabs example in Section 8.1.2, using 70% of the observations in the
training sample and for 10-fold cv:
> library(caret); library(ggplot2); library(MASS)
> Crabs$y <- factor(Crabs$y) # need factor response variable for ggplot2
> index = createDataPartition(y=Crabs$y, p=0.7, list=FALSE)
> train = Crabs[index,] # training sample, 70% of observations (p=0.7 above)
> test = Crabs[-index,] # testing sample
> dim(train)
[1] 122 7 # 122 observations of 7 variables in training sample
> dim(test) # (8 variables if include pred.right in Crabs, as in above code)
[1] 51 7 # 51 observations of 7 variables in testing sample
> lda.fit = train(y ~ width + color, data=train, method="lda",
trControl = trainControl(method = "cv", number=10))
> names(lda.fit) # lists what is saved under "lda.fit"; not shown
> summary(lda.fit) # output not given
> predY.new = predict(lda.fit,test) # prediction for testing subsample
> table(predY.new, test$y)
predY.new 0 1 # output varies depending on training sample used
0 5 2
1 13 31
> round(mean(predY.new == test$y)*100, 2) # prediction accuracy
[1] 0.706
> qplot(width, color, data=test, col=y) # first plot in Figure A8.2
> test$pred.right = predY.new == test$y
> qplot(width,color, data=test, col=pred.right) # second plot in Figure A8.2
Figure A8.2 provides plots that are analogous to the first and third in Figure A8.1, but
now only for the 51 cases in the testing subsample. The first plot shows the actual y values
and the second shows the misclassified observations.
Model training is an essential step with neural networks. The keras package (see Chollet
and Allaire 2018) is useful for this method.
1 e.g., see Molinaro, Simon, Pfeiffer (2005). Prediction error estimation: a comparison of resampling
4 4
3 3
y pred.right
color
color
0 FALSE
1 TRUE
2 2
1 1
FIGURE A8.2: Scatterplots of width and color values for horseshoe crabs in testing sub-
sample, showing (i) observed value of y, (ii) missclassified observations.
0.65
Accuracy (Cross−Validation)
0.60
0.55
FIGURE A8.3: Decision tree of depth 4 for the Presidential elections 2016 vote of the
participants in the GSS2018 survey (left) and cross–validated accuracy rate for different
values for the complexity parameter (right).
> GSS2018.tree
#------------------ output -------------------------------------------
CART
160 samples
12 predictor
4 classes: ’Clinton’, ’Trump’, ’Other’, ’Not vote’
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 144, 145, 145, 143, 144, 144, ...
Resampling results across tuning parameters:
cp Accuracy Kappa
0.02857143 0.6811765 0.37562782
0.08571429 0.6451471 0.30188450
0.11428571 0.5876471 0.09631507
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.02857143.
#----------------------------------------------------------------------
> GSS2018.pred = predict(GSS2018.tree, newdata = test)
> table(GSS2018.pred, test$PRES16)
GSS2018.pred Clinton Trump Other Not vote
Clinton 33 8 1 0
Trump 5 15 3 0
Other 0 0 0 0
Not vote 0 0 0 0
> error.rate = round(mean(GSS2018.pred != test$PRES16),3)
> error.rate
[1] 0.262
> fancyRpartPlot(GSS2018.tree$finalModel) # Figure A8.3 (left)
> ggplot(GSS2018.tree) # Figure A8.3 (right)
> summary(GSS2018.tree$finalModel) # long output with detailed information (not shown)
The rpart and tree packages can also form trees when the response is quantita-
tive, in which case the display is called a regression tree. The predicted value at a
112 Agresti and Kateri (2022): 8. R-Web Appendix
terminal node is the mean response for the subjects in the region of predictor values
for that node. The site https://cran.r-project.org/web/packages/rpart/vignettes/
longintro.pdf has useful examples of rpart for binary, multiple category, and quantitative
responses.
Figure A8.4 shows the produced dendrogram (also shown in Figure A.20 of the book)
and the heatmap of the dissimilarity matrix.
Color Key
and Histogram
Cluster Dendrogram
0 2 4 6 8 10
Value
15
Algeria
Morocco
Peru
Pakistan
10
Turkey
China
Height
Brazil
SouthAfrica
US
5
Germany
Japan
Belgium
Finland
Denmark
Norway
0
SouthAfrica
Nigeria
Algeria
Netherlands
Denmark
Germany
China
Chile
Argentina
Vietnam
Morocco
Philippines
India
Indonesia
Russia
US
Italy
Spain
Greece
UK
Canada
Finland
Belgium
Ireland
Switzerland
Brazil
Iran
Portugal
Israel
France
Sweden
Austria
NewZealand
Japan
Mexico
Malaysia
Peru
Pakistan
Australia
Korea
Norway
Canada
Turkey
Italy
Portugal
Malaysia
UK
Israel
Denmark
SouthAfrica
Algeria
Germany
UK
Italy
Canada
Finland
Belgium
US
China
Morocco
Israel
Portugal
Brazil
Japan
Malaysia
Peru
Pakistan
Norway
Turkey
d
hclust (*, "ward.D2")
FIGURE A8.4: Dendrogram (left) and heatmap of the dissimilarity matrix for the hierar-
chical clustering of the UN data.
Hierarchical clustering is not applicable for large data sets. In such cases, the k-means
clustering is adopted, which finds the optimal clustering (according to a distance measure
and a linkage method) for a prespecified number of clusters k. The kmeans function of the
cluster package provides insightful output. Hence it makes sense even for small data sets,
after applying hierarchical clustering and having decided about the number of clusters to
continue with k–means clustering.
Cluster analysis is a descriptive method, so there is not really ‘optimal’ number of
clusters. The decision has to be taken also having in mind the content of the data and the
survey question we focus on. In Figure A8.5 we provide the clustering for two up to five
clusters to gain a better insight to our data and how countries are clustered.
The two-cluster solution seems to differentiate between economically-advanced Western
nations and the other nations. The code for the two-means clustering and the construction
of the clusterplots in Figure A8.5 follows:
# k-means Clustering:
> fit2 <- kmeans(UN_scaled, 2) # k = 2 clusters
> attributes(kmeans.fit) # output not shown here
1 US
3
5
US
4
4
2 2
SouthAfrica Russia
SouthAfrica Russia
3
3
Component 2
Component 2
2
Brazil
2
Brazil Mexico
Mexico
Iran Iran
1
1
Chile 1Australia
Chile Australia
Peru Israel Canada Peru Israel Canada
Argentina
TurkeyMalaysia Korea
NewZealand Argentina
TurkeyMalaysia Korea
NewZealand
0
Morocco UK
0
Morocco UK
Belgium GreeceBelgium
SpainNetherlands
Greece
Algeria ChinaPortugal Spain
Netherlands
Finland
Germany
Norway
Austria Algeria China Portugal Finland
ItalyGermany
Ireland Norway
Austria
Vietnam
Philippines Italy
Ireland
Japan Vietnam
Philippines Japan
Denmark
Nigeria Denmark
France Nigeria
Pakistan France
Pakistan Switzerland
Sweden Switzerland
Sweden
−1
−1
Indonesia
India Indonesia
India
−2
−2
−4 −2 0 2 −4 −2 0 2
Component 1 Component 1
These two components explain 77.84 % of the point variability. These two components explain 77.84 % of the point variability.
US
1 US
4
4
4
SouthAfrica
2 Russia SouthAfrica
1
3
Russia
3
Component 2
Component 2
2
Brazil
Mexico
2
Brazil
Iran
3 Mexico
1
4
Chile Australia Iran 3
Peru Israel Canada
1
Argentina NewZealand
TurkeyMalaysia Korea Chile 5Australia
0
Morocco UK
Belgium Peru Israel Canada
Algeria Greece
Spain
China Portugal Netherlands
Finland
Germany
Vietnam
Philippines Italy apanNorwa
Austria
Ireland
JDenmark Argentina
TurkeyMalaysia Korea
NewZealand
France
0
Nigeria
Pakistan Switzerland
Sweden Morocco UK
Belgium
−1
Indonesia
India Greece
SpainNetherlands
Algeria
Vietnam
Philippines
China Portugal JFinland
ItalyGermany
apanNorway
Austria
Ireland
Denmark
France
Nigeria
Pakistan
2 Switzerland
Sweden
−1
Indonesia
India
−2
−4 −2 0 2 −4 −2 0 2
Component 1 Component 1
These two components explain 77.84 % of the point variability. These two components explain 77.84 % of the point variability.
FIGURE A8.5: Clustering plots of the UN data, corresponding to k-means clustering (from
2 up to 5 clusters).
[1] 17 25
In general, in k-means clustering the optimal number of clusters can be estimated based
on methods minimizing the total within clusters sum of square (wss), the average silhouette
width 2 or based on gap statistics 3 . Figure A8.6 provides the optimal number of clusters
plot based on the three methods discussed above for the UN data file, produced as follows:
> library(factoextra)
> library(gridExtra)
> layout(matrix(c(1,2,3),1,3))
> pl1 <- fviz_nbclust(UN_scaled, FUNcluster = kmeans,
2 Rousseeuw (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis,
0.6
300
0.4
0.5
Total Within Sum of Square
0.2 0.4
100 0.1
0.3
0.0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Number of clusters k Number of clusters k Number of clusters k
FIGURE A8.6: Optimal number of clusters using the method based on (i) the total within
clusters sum of square (left), (ii) the average silhouette width (middle), and (iii) the gab
statistics (right).
Beyond the types of clustering algorithms discussed above, which depend only on the
data without assuming an underlying stochastic model, clustering methods are also available
in a modeling context For example, Gaussian mixture models assume that k multivariate
normal distributions generate the observations. Model-based clustering is available with the
mclust package.
Bibliography
117
118 Bibliography