R-programming - Unit 5
R-programming - Unit 5
5. 1 GENERAL CONCEPTS
A linear regression model aims to derive a function that predicts the average value of one
variable based on a specific value of another variable. The response variable, also known as the
"outcome" variable, represents the mean that you are trying to estimate. On the other hand, the
explanatory variable, referred to as the "predictor" variable, represents the value you already
possess.
For Example ,In student survey, suppose we ask “What’s the expected height of a student if their
handspan is 12.5 cm?” .In this example , height is the response variable and handspan is the
explanatory variable is the handspan.
5.1.1. Definition of the Model
Assume If X is the given value of explanatory variable and Y is the value of a response
variables,then the equation of value of response in simple linear regression model is
Y | X = β0 + β1 X + s
Y | X read as “the value of Y conditional upon the value of X .”
S is considered to have a normal distribution with a mean of zero and a standard deviation of σ.
The variance of s(σ2) remains constant.S is a random error.
β0 is an intercept . In the case where the predictor is zero, the intercept, β0, denotes the expected
value of the response variable.
β1 is the slope. The slope, β1, is the central point of interest as it denotes the change in the
average response for every one-unit increase in the predictor.
If a slope is positive, it indicates that the regression line rises from left to right, indicating that
the mean response is higher when the predictor is higher. If a slope is negative slope indicates
that the line descends from left to right, indicating that the mean response is lower when the
predictor is higher. If a slope of zero suggests that the predictor has no impact on the response
value. The more extreme the value of β1, the steeper the incline or decline of the line.
If βˆ0 and βˆ1 are the regression parameters called as fitting the linear model ,x is and the n pairs
of observations then the mean response value denoted yˆ for a specific value of the predictor, x,
and is written as follows:
yˆ = βˆ0 + βˆ1 x
The predictor and response variables in a simple linear regression function can be represented by
xi and yi, respectively, where i ranges from 1 to n. The estimates for the parameters can then be
calculated based on these n observed data pairs.
sy
R> survfit
Call:
lm(formula = Height ~ Wr.Hnd, data = survey)
113.954 3.117
The abline function is used to incorporate completely horizontal and vertical lines into an
existing plot. However, if an object of class "lm" is provided, which represents a simple linear
model such as survfit, the function will instead add the fitted regression line.
The observed data is fitted with a solid and bold simple linear regression line. To illustrate
positive and negative residuals, two dashed vertical line segments are provided, with the leftmost
segment representing a positive residual and the rightmost segment representing a negative
residual.
5.4 Illustrating Residuals
In statistics, the difference between the actual value and the predicted value of a dependent
variable is referred to as a residual. The line that is fitted to the data is commonly known as the
least-squares regression line, as it is the line that reduces the average squared deviation between
the observed data and the line itself.
First, let’s extract two specific records from the Wr.Hnd and Height data vectors and call the
resulting vectors obsA and obsB.
R> obsA <- c(survey$Wr.Hnd[197],survey$Height[197])
R> obsA
Take a look at the names of the individuals listed in the survfit object.
R> names(survfit)
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "na.action" "xlevels" "call" "terms"
[13] "model"
The components of a fitted model object of class "lm" are automatically generated and consist of
various members. It is worth noting that one of these components is called "coefficients", which
contains a numeric vector representing the estimates of the intercept and slope.For the
coefficients component of an "lm" object, coef is used.
R> mycoefs <- coef(survfit)
R> mycoefs
Finally, to draw the vertical dashed lines using segments present in Figure .
R> segments(x0=c(obsA[1],obsB[1]),y0=beta0.hat+beta1.hat*c(obsA[1],obsB[1]),
x1=c(obsA[1],obsB[1]),y1=c(obsA[2],obsB[2]),lty=2)
Statistical Inference is a concept that involves hypothesis testing, which is performed by taking a random
sample of data from the population and conducting testing. The hypothesis is either accepted or rejected
based on the results of the testing.
The purpose of significance tests in linear regression is to determine the statistical significance of
the relationship between the dependent variable and one or more independent variables. In
simpler terms, these tests help us determine whether the independent variables are effective
predictors of the dependent variable.
Multiple tests can be employed to assess the significance of a linear regression model, but the t-
test is the most prevalent. The t-test is utilized to determine whether the slope coefficient(s) in
the linear regression model significantly differ from zero.
The R summary function is capable of conducting a linear regression significance test on a pre-
existing linear regression model. It furnishes a detailed account of the linear regression model,
such as anticipated coefficients, standard errors, t-statistics, and p-values, for every predictor
variable.
The output of summary also provides you with the values of Multiple R-squared and Adjusted R-
squared, which are particularly interesting. Both of these are referred to as the coefficient of de
termination; they describe the proportion of the variation in the response that can be attributed to
the predictor
For the student height example, fstore the estimated correlation between Wr.Hnd and Height as
rho.xy, and then square it:
5.4 Prediction
The capability to fit a statistical model allows you to not only understand and quantify the
relationships within your data, but also to predict the values of the outcome of interest. This
holds true even when you have not directly observed the values of any explanatory variables in
the original dataset. However, it is essential to always accompany any point estimates or
predictions with a measure of variability, as is customary in any statistical analysis.
Confidence Interval(CI) is a range of values that we can be certain contains the true value. The
probability of the confidence interval containing the true parameter value is determined by the
selection of a confidence level.
Prediction interval (PI) for an observed response is distinct from that of a confidence interval.
While confidence intervals are utilized to depict the variability of the mean response, prediction
intervals are employed to offer the potential range of values that a single realization of the
response variable could assume, given x. This differentiation is subtle yet significant: the
confidence interval pertains to a mean, whereas the prediction interval pertains to an individual
observation.
R> as.numeric(beta0.hat+beta1.hat*14.5)
[1] 159.1446
R> as.numeric(beta0.hat+beta1.hat*24)
[1] 188.7524
R> plot(survey$Height~survey$Wr.Hnd,xlim=c(13,24),ylim=c(140,205),
xlab="Writing handspan (cm)",ylab="Height (cm)")
R> abline(survfit,lwd=2)
Additionally, include the locations of the fitted values for x = 14.5 and x = 24, as well as two sets
of vertical lines indicating the confidence intervals (CIs) and prediction intervals (PIs).
R> points(xvals[,1],mypred.ci[,1],pch=8)
R> segments(x0=c(14.5,24),y0=c(mypred.pi[1,2],mypred.pi[2,2]),
x1=c(14.5,24),y1=c(mypred.pi[1,3],mypred.pi[2,3]),col="gray",lwd=3)
R> segments(x0=c(14.5,24),y0=c(mypred.ci[1,2],mypred.ci[2,2]),
x1=c(14.5,24),y1=c(mypred.ci[1,3],mypred.ci[2,3]),lwd=2)
The above figure gives a fitted regression line and point estimates at x = 14.5 and x = 24.
Additionally, it includes corresponding 95 percent confidence intervals (black vertical lines) and
prediction intervals (gray vertical lines). To provide a comprehensive understanding, dashed
black and dashed gray lines are incorporated, representing 95 percent confidence and prediction
bands for the response variable across the visible range of x value.
Interpolation is the term used to describe a prediction when the specified x value falls within the
range of observed data. On the other hand, extrapolation occurs when the x value of interest lies
outside this range.
interpolation is more favorable than extrapolation. It is more logical to utilize a fitted model for
making predictions in the vicinity of observed data. However, if the extrapolation is not too far
from that vicinity, it may still be considered reliable. The extrapolation for the student height
example at x = 24 serves as a prime example. Although it falls outside the range of observed
data, it is not significantly distant in terms of scale. The estimated intervals for the expected
value of yˆ = 188.75 cm do not appear unreasonable, at least visually, considering the distribution
of the other observations. On the other hand, it would be less sensible to use the fitted model for
predicting student height at a handspan of, for instance, 50 cm.
R> predict(survfit,newdata=data.frame(Wr.Hnd=50),interval="confidence",
level=0.95)
fit lwr upr
1 269.7845 251.9583 287.6106
5.5. MULTIPLE LINEAR REGRESSION
It is a common form of linear regression. The concept of Multiple Linear Regression involves the
linear relationship between a response variable Y and several predictor variables.
Multiple Regression can be applied in various scenarios
1. when determining the selling price of a house, factors such as the location's desirability,
the number of bedrooms and bathrooms, the year of construction
2. when analyzing the height of a child, variables such as the mother's height, the father's
height, nutrition, and environmental factors play a significant role in determining the
child's height.
The overarching model is defined as a means to ascertain the value of a continuous response
variable Y, based on the values of p > 1 independent explanatory variables X1, X2, . . ., Xp.
Y = β0 + β1 X1 + β2 X2 + . . . + βp Xp + s
In the case of n multivariate observations, the expression for Equation (21.1) can be written as
follows.
Y = X · β + s,
Omnibus F-Test
An omnibus test is a statistical test used in statistics to determine the significance of
multiple parameters in a model at the same time.
An instance of an omnibus test is when we have null and alternative hypotheses as follows:
H0: μ1 = μ2 = μ3 = … = μk (all the population means are equal) and
HA: At least one population mean is different from the rest.
The null hypothesis contains more than two parameters, making it an omnibus test. If we
reject the null hypothesis, we can conclude that at least one of the population means is
different from the rest, but we cannot determine which population means are different.
The F-test statistic can be calculated by utilizing the coefficient of determination, R2,
obtained from the fitted regression model. If we consider p as the count of regression
parameters that need estimation, except for the intercept β0, then...
F = R 2 (n − p − 1) (1 − R2 )p
By analyzing the survey results, a model has been established to determine student height
using variables such as handspan, sex, and smoking status. The summary report contains
the coefficient of multiple determination, which can be extracted for further analysis.
R> R2 <- summary(survmult2)$r.squared
R> R2
[1] 0.508469
The original size of the dataset used in the survey can be determined by subtracting any
missing values (which were reported as 30 in the previous summary output) from n.
R> n <- nrow(survey)-30
R> n
[1] 207
We get p as the number of estimated regression parameters (minus 1
for the intercept).
R> p <- length(coef(survmult2))-1
R> p
[1] 5
The plots are generated separately by the three lines of code provided, appearing from left
to right.
R> plot(x,y,type="l")
R> plot(x,y2,type="l")
R>plot(x,y3,type="l")
Logarithmic
In statistical modeling situations where positive numeric observations are present, it is customary
to perform a log transformation on the data. This transformation plays a crucial role in
significantly diminishing the overall range of the data and bringing extreme observations closer
to a measure of centrality. Consequently, adopting a logarithmic scale can effectively mitigate the
severity of heavily skewed data.
R> plot(1:1000,log(1:1000),type="l",xlab="x",ylab="",ylim=c(-8,8))
R> lines(1:1000,-log(1:1000),lty=2)
R> legend("topleft",legend=c("log(x)","-log(x)"),lty=c(1,2))
This graph illuatrates the relationship between the logarithm of integers ranging from 1 to 1000
and their corresponding raw values. It also includes the negative logarithm. Notably, as the raw
values increase, the logarithmically transformed values exhibit a tapering off and flattening out
effect.
LINEAR MODEL SELECTION AND DIAGNOSTICS
The main aim of fitting any statistical model is to faithfully represent the data and the
relationships inherent within it. The process of fitting statistical models essentially involves
striking a balance between two key aspects: goodness-of-fit and complexity. Goodness-of-fit
refers to the objective of obtaining a model that accurately captures the relationships between the
response and the predictor (or predictors). Complexity, on the other hand, describes the level of
intricacy in a model, which is determined by the number of terms that require estimation.
The principle of parsimony
It is a concept in statistics that involves finding a balance between the goodness-of-fit and
complexity of a model. The aim is to select a model that is as simple as possible, with low
complexity, while still maintaining a reasonable level of goodness-of-fit.
Model Selection Algorithms
model selection algorithm is to sift through your available explanatory variables in some
systematic fashion in order to establish which are best able to jointly describe the response
The use of model selection algorithms can be a topic of controversy due to the availability of
multiple methods. It is essential to recognize that no single approach can be universally
applicable for every regression model.
Forward Selection
Forward selection, also referred to as forward elimination, is employed at this stage. The
approach entails commencing with a model that solely consists of an intercept, followed by
conducting a set of independent tests to ascertain which predictor variables contribute
significantly to enhancing the goodness-of-fit. Subsequently, the model object is updated by
incorporating the identified term, and the series of tests is repeated for the remaining terms to
determine if any of them further enhance the fit. This iterative process continues until no
additional terms are found to significantly improve the fit in a statistically significant manner.
The ready-to-use R functions add1 and update facilitate the execution of these tests and the
subsequent update of the fitted regression model.
Backward Selection
The process of forward selection initiates from a reduced model and gradually builds up to a
final model by incorporating additional terms. Conversely, backward selection commences with
the fullest model and systematically eliminates terms. In R, the functions drop1 and update are
utilized to inspect partial F-tests and update the models, respectively.
Revisit the nuclear example. First, define the fullest model as that which predicts cost by main
effects of all available covariates.
Call:
lm(formula = cost ~ date + t1 + t2 + cap + pr + ne + ct + bw +
cum.n + pt, data = nuclear)
Residuals:
Min 1Q Median 3Q Max
-128.608 -46.736 -2.668 39.782 180.365
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.135e+03 2.788e+03 -2.918 0.008222 **
date 1.155e+02 4.226e+01 2.733 0.012470 *
t1 5.928e+00 1.089e+01 0.545 0.591803
t2 4.571e+00 2.243e+00 2.038 0.054390 .
cap 4.217e-01 8.844e-02 4.768 0.000104 ***
pr -8.112e+01 4.077e+01 -1.990 0.059794 .
ne 1.375e+02 3.869e+01 3.553 0.001883 **
ct 4.327e+01 3.431e+01 1.261 0.221008
bw -8.238e+00 5.188e+01 -0.159 0.875354
cum.n -6.989e+00 3.822e+00 -1.829 0.081698 .
pt -1.925e+01 6.367e+01 -0.302 0.765401
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
By executing this command, a plot illustrating the spatial locations of the 1000
seismic events in the quakes data frame will be generated. In case Device 1, the
null device, is the only device currently accessible, any plotting command that
refreshes the plotting window and produces a new image will be utilized.
R> dev.new()
The new window will be assigned the number 3 (typically, it positions itself above
the previously opened window, hence you might consider relocating it to a side
using your mouse).
At this point, you can enter the usual command to bring up the desired histogram
in Device 3:
R> hist(quakes$stations)
R> dev.set(2)
quartz
2
R> plot(quakes$long,quakes$lat,cex=0.02*quakes$stations,
xlab="Longitude",ylab="Latitude")
Switching back to Device 3, as a final tweak, add a vertical line marking off the mean
number of detecting stations.
R> dev.set(3)
quartz
3
R> abline(v=mean(quakes$stations),lty=2)
The two plots of the quakes data have been produced and manipulated, and the resulting
final outputs are showcased on my two visible graphics devices, specifically Device 2 (on
the left) and Device 3 (on the right).
Closing a Device
To conclude a graphics device, you have the option to click on the X button using your
mouse, just like closing any other window. Alternatively, you can make use of the dev.off
function.
R> dev.off(2)
quartz
3
Then repeat the call without an argument to close the remaining device:
R> dev.off()
null device
1
2. The figure region encompasses the area that accommodates the axes, their labels, and
any titles. These spaces are also known as the figure margins.
3. The outer region, also known as the outer margins, is additional space surrounding the
figure region that is not included by default but can be specified if necessary.
1 Default Spacing
Retrieve the default margin settings for your figure by calling the par function in R.
R> par()$oma
[1] 0 0 0 0
R> par()$mar
[1] 5.1 4.1 4.1 2.1
here that oma=c(0, 0, 0, 0)—there is no outer margin set by default. The default figure
margin space is mar=c(5.1, 4.1, 4.1, 2.1)—in other words
R> plot(1:10)
R> box(which="figure",lty=2)
The traditional (base) R graphics represent graphical device regions through solid-line boxes for
plot region, dashed-line boxes for figure region, and a dotted-line box for outer region area. On
the left, the default settings are shown, while on the right, the user can specify the outer and
figure margin areas in "lines of text" through oma and mar, respectively, using par.
Custom Spacing
By modifying the spacing, we can create a plot that has customized outer margins. In this case,
the bottom, left, top, and right areas will have one, four, three, and two lines respectively.
Furthermore, the figure margins will be set to four, five, six, and seven lines. The code provided
will generate the corresponding result, which can be seen on the right side of above figure.
R> par(oma=c(1,4,3,2),mar=4:7)
R> plot(1:10)
R> box("figure",lty=2)
R> box("outer",lty=3)
When using R, the default square device may have its plot region compressed by irregular
margins, which are adjusted to make space for the defined spacing around the edges. However, if
you manipulate the graphical parameters excessively and squash the plot region too much, R will
generate an error message stating that the figure margins are too large.
To adjust the margin space for specific annotations of the plot, you can use the mtext function,
which is specifically designed to produce text in the figure or outer margins. By default, the text
is written in the figure margin, but you can position it in the outer region by setting outer=TRUE.
If you want to add more margin annotation to your plot, you can use the lines provided to do so.
Clipping
Clipping is a technique that enables you to add or draw elements to the margin regions of a plot
with respect to the user coordinates of the plot itself. For instance, you may want to place a
legend outside the plotting area or draw an arrow that extends beyond the plot region to highlight
a specific observation.
The xpd graphical parameter is responsible for controlling clipping in base R graphics. By
default, xpd is set to FALSE, which means that all drawing is clipped to the available plot region
only, except for special margin-addition functions like mtext. If you set xpd to TRUE, you can
draw things outside the formally defined plot region into the figure margins, but not into any
outer margins.
Drawing in all three areas, namely plot region, figure margins, and outer margins, can be
achieved by setting xpd to NA.
R> dev.new()
R> par(oma=c(1,1,5,1),mar=c(2,4,5,4))
R> boxplot(mtcars$mpg~mtcars$cyl,xaxt="n",ylab="MPG") R> box("figure",lty=2)
R> box("figure",lty=2)
R> box("outer",lty=3)
R> arrows(x0=c(2,2.5,3),y0=c(44,37,27),x1=c(1.25,2.25,3),y1=c(31,22,20), xpd=FALSE)
R> text(x=c(2,2.5,3),y=c(45,38,28),c("V4 cars","V6 cars","V8 cars"), xpd=FALSE)
The locator command allows you to find and return user coordinates.
To observe its functionality, start by executing the plot(1,1) command to generate a simple plot
with a single point at the center. To use the locator function, execute it without any arguments,
which will cause the console to "hang" without returning to the prompt. Afterwards, on an active
graphics device, your mouse cursor will change to a + symbol (you may need to click on your
device once to bring it to the forefront of your computer desktop). With the + cursor, you can
perform a series of left mouse clicks within the device, and R will silently record the precise user
coordinates. To end this process, simply right-click to terminate the command.
I detected four points located at arbitrary positions around the plotted point at (1, 1) on my
machine. These points were arranged in a clockwise order, starting from the top left and ending
at the bottom left. The output printed to my console is as follows:
R> plot(1,1)
R> locator()
$x
[1] 0.8275456 1.1737525 1.1440526 0.8201909
$y
[1] 1.1581795 1.1534442 0.9003221 0.8630254
You can also use locator to plot the points you select as either individual points or as lines.
Running the following code produces in figure:
R> plot(1,1)
R> Rtist <- locator(type="o",pch=4,lty=2,lwd=3,col="red",xpd=TRUE) R> Rtist
$x
[1] 0.5013189 0.6267149 0.7384407 0.7172250 1.0386740 1.2765699
[7] 1.4711542 1.2352573 1.2220592 0.8583484 1.0483300 1.0091491
$y
[1] 0.6966016 0.9941945 0.9636752 1.2819852 1.2766579 1.4891270
[7] 1.2439071 0.9630832 0.7625887 0.7541716 0.6394519 0.9618461
If you desire greater precision in an R plot, it is best to start with a "clean slate." This entails
understanding the default settings of specific graphical parameters when calling a plotting
function and how to remove elements like boxes and axes. This is the starting point. As an
example, let's plot MPG against horsepower (from the readily available mtcars data set) and size
each plotted point proportionally to the weight of each car. For convenience, create the following
objects:
The car weight vector is transformed by multiplying it with the mean value. As a result, the
vector is created in such a way that cars weighing less than the average have a value less than 1,
while cars weighing more than the average have a value greater than 1. This property makes it an
excellent choice for adjusting the size of the plotted points using the cex parameter.
To plot the mtcars data, one can correlate MPG with horsepower while keeping the point size
proportional to the weight of the car. This can be done by using a plot call. The plot can be
displayed in three different ways. The first one is the default appearance. The second one can be
achieved by setting xaxs="i" and yaxs="i" to prevent buffer spacing on the limits of the axes.
The third one can be achieved by using xaxt, yaxt, xlab, ylab, and bty to suppress all box, axis,
and label drawing. Alternatively, this can be achieved by setting axes=FALSE and ann=FALSE.
Customizing Boxes
When initiating a suppressed-box or suppressed-axis plot, you can introduce a box that pertains
specifically to the current plot region in the active graphics device by utilizing the "box" function
and specifying its type using the "bty" parameter. For instance, if you commence with a plot
resembling the one depicted on the right side of the above figure(simply execute the most recent
line of code to obtain this), you can enhance it further by incorporating the following line,
resulting in the image presented on the left side of the below figure.
R> box(bty="l",lty=3,lwd=2)
R> box(bty="]",lty=2,col="gray")
Customizing Axes
Once you have adjusted the box according to your preferences, you can proceed to focus on the
axes. The axis function provides you with the ability to precisely manipulate and enhance the
presence of an axis on any of the four sides of the plot area. The initial parameter it requires is
the side, which is specified by a single integer: 1 (bottom), 2 (left), 3 (top), or 4 (right). These
integers correspond to the respective positions of the margin-spacing values that are relevant
when configuring graphical parameter vectors such as mar.
The built-in function pretty in R is used to find a “neat” sequence of values for the scale of each
axis, However we can set our own by passing the desired value through the "at" argument to the
axis.
R> hpseq <- seq(min(hp),max(hp),length=10)
R> plot(hp,mpg,cex=wtcex,xaxt="n",bty="n",ann=FALSE)
R> axis(side=1,at=hpseq)
R> axis(side=3,at=round(hpseq))
To begin with, a set of 10 values that are evenly spaced and cover the entire range of hp is stored
as hpseq. The initial plot command is used to hide the x-axis, the box, and any default axis
labels. However, the y-axis is allowed to be displayed as per its default settings. Then, the axis
function is called to draw the x-axis (side=1), with tick marks positioned at hpseq. Additionally,
another axis is drawn at the top (side=3), but this time the tick marks are placed at hpseq after
rounding it to the nearest integer.
Greek Symbols
Greek symbols or mathematical markup may be necessary for annotation in
statistically mathematically technical plots.
par(mar=c(3,3,3,3))
plot(1,1,type="n",xlim=c(-1,1),ylim=c(0.5,4.5),xaxt="n",yaxt="n", ann=FALSE)
text(0,4,label=expression(alpha),cex=1.5)
text(0,3,label=expression(paste("sigma: ",sigma ,"\t Sigma:",Sigma)), family="mono",cex=1.5)
text(0,2,label=expression(paste(beta," ",gamma," ",Phi)),cex=1.5)
text(0,1,label=expression(paste(Gamma,"(",tau,") = 24 when ",tau," = 5")),
family="serif",cex=1.5)
title(main=expression(paste("Gr",epsilon,epsilon,"k")),cex.main=2)
To obtain a single special character on its own, you can utilize the expression(alpha) to generate
α in the plot, as demonstrated in the first text call within the code chunk. The specification of the
special characters is performed without enclosing the name of the desired symbol in quotes. The
title function, which allows you to add axis and main titles.
Mathematical Expressions
The process of formatting complete mathematical expressions in R plots can prove to be slightly
more complex and resembling the usage of markup languages such as LATEX.
R> expr1 <- expression(c^2==a[1]^2+b[1]^2)
R> expr2 <- expression(paste(pi^{x[i]},(1-pi)^(n-x[i])))
R> expr3 <- expression(paste("Sample mean: ", italic(n)^{-1}, sum(italic(x)[italic(i)],
italic(i)==1, italic(n)) ==frac(italic(x)[1]+...+italic(x)[italic(n)],
italic(n))))
R> expr4 <- expression(paste("f(x","|",alpha,",",beta,")"==frac(x^{alpha-1}~(1-x)^{beta-1},
B(alpha,beta))))
R> par(mar=c(3,3,3,3))
R> plot(1,1,type="n",xlim=c(-1,1),ylim=c(0.5,4.5),xaxt="n",yaxt="n", ann=FALSE)
R> text(0,4:1,labels=c(expr1,expr2,expr3,expr4),cex=1.5)
R> title(main="Math",cex.main=2)
Here are some important points to note:
1. Superscripts are indicated by ^ and subscripts by [ ]. For instance, c^2 represents c2 in expr1,
and the a[1]^2 component represents a2.
2. Components can be grouped using parentheses ( ), which are visible. For example, the (1-
pi)^(n-x[i]) component of expr2. Alternatively, components can be grouped using braces { },
which are not visible. For example, the pi^{x[i]} component.
3. Italicized alphabetic variables are displayed using the italic() function. For example, italic(n)
displays n in expr3.
Color plays a key role in many plots. It is crucial to interpret your data/models effectively by
differentiating between values and variables, as this can greatly assist in the analysis process.
Palette
There are various conventional methods to generate and depict distinct colors, as well as
techniques to establish and employ a harmonious collection of colors, which is commonly known
as a palette.
Built-in Palettes
There are several color palettes included in the base R installation. These palettes are determined
by the functions rainbow, heat.colors, terrain.colors, topo.colors, cm.colors, gray.colors, and
gray.
The following code provided generates a total of 600 colors from every palette available.
R> N <- 600
R> rbow <- rainbow(N)
R> heat <- heat.colors(N)
R> terr <- terrain.colors(N) R> topo <- topo.colors(N)
R> cm <- cm.colors(N)
R> gry1 <- gray.colors(N)
R> gry2 <- gray(level=seq(0,1,length=N))
In order to start a new plot, the process involves using vector repetition to place 600 points for
each palette. This is achieved by making a single call to the points function, which also ensures
the points are colored correctly based on the hex code vectors.
R> dev.new(width=8,height=3)
R> par(mar=c(1,8,1,1))
R> plot(1,1,xlim=c(1,N),ylim=c(0.5,7.5),type="n",xaxt="n",yaxt="n",ann=FALSE) R>
points(rep(1:N,7),rep(7:1,each=N),pch=19,cex=3,
col=c(rbow,heat,terr,topo,cm,gry1,gry2))
R> axis(2,at=7:1,labels=c("rainbow","heat.colors","terrain.colors",
"topo.colors","cm.colors","gray.colors","gray"), family="mono",las=1)
For additional information, access the help files ?gray.colors and ?gray, which provide details on
the grayscale palettes, while the rest of the options can be found under ?rainbow.
Custom Palettes
The colorRampPalette function enables the creation of custom palettes by specifying two or
more key colors. These key colors are provided as arguments with the same name, and the
function generates a palette that smoothly transitions between them.
The palette function is created by specifying the desired order of key colors to be interpolated,
which are represented as a character vector of names recognized by R. This function allows for
the generation of colors on a scale between purple and yellow. The following line of code is used
to create this palette function.
R> puryel.colors <- colorRampPalette(colors=c("purple","yellow"))
In the following code using more than two colors :
R> fours <- colorRampPalette(colors=c("black","hotpink","seagreen4","tomato"))
R> patriot.colors <- colorRampPalette(colors=c("red","white","blue"))
We can now generate any number of colors from each range:
R> py <- puryel.colors(N)
R> bls <- blues(N) R> frs <- fours(N)
R> pat <- patriot.colors(N) R> dev.new(width=8,height=2) R> par(mar=c(1,8,1,1))
R> plot(1,1,xlim=c(1,N),ylim=c(0.5,4.5),type="n",xaxt="n",yaxt="n",ann=FALSE) R>
points(rep(1:N,4),rep(4:1,each=N),pch=19,cex=3,col=c(py,bls,frs,pat))
R> axis(2,at=4:1,labels=c("peryel.colors","blues","fours","patriot.colors"),
family="mono",las=1)
Via Categorization
To color values based on a continuous variable, one approach is to transform it into a problem of
coloring points associated with a categorical variable. This can be achieved by dividing the
continuous values into a predetermined number of k categories, creating k colors from a chosen
palette, and assigning each observation to the corresponding color based on the bin it belongs to.
The above plot is the aexample to assign color to points based on a continuous value: via categorizatio
The following plot is the example for assign color to points based on a continuous value: via normalization
Now that you have the ability to effectively utilize color in your plots, it is necessary to have a
legend that corresponds to the color scale. Although it is possible to create a legend using only
the fundamental tools in R, it is often easier to utilize contributed functionality within R.
This is found in the shape package, so first download and install shape from CRAN.
The above example is for implementing a color strip legend using the colorlegend
function from the contributed shape package.
Opacity
R assumes full opacity as the default when creating colors. However, if you explicitly set the
opacity using alpha, the hex codes will change slightly. Instead of six characters after the #, eight
characters will appear, with the last two representing the additional opacity information. Take a
look at the following lines of code that generate four different versions of the color red: default,
zero opacity, 40 percent opacity (0.4 × 255 = 102), and full opacity, respectively:
The initial and final colors are the same; however, the last hex code explicitly indicates full
opacity. The subsequent line of code converts the default red hex code generated by the first line
in the previous example into a version that has a 40 percent opacity.
R> adjustcolor(rgb(cbind(255,0,0),maxColorValue=255),alpha.f=0.4)
[1] "#FF000066"
This is another example showcasing the application of adjustcolor. In this demonstration, the
color sequence generated using depth.pal(20) is adjusted to have 60 percent opacity to match the
plotted points. The legend is positioned using posx and posy, and the optional argument left is set
to TRUE to display the tick marks and color legend labels on the left side of the strip. The final
outcome can be seen in following Figure.
Aa
Aaa
RGB Alternatives and Further Functionality
3D Scatterplots
3D Scatterplots tools enable the plotting of raw observations based on three continuous variables
concurrently, unlike the traditional 2D scatterplot that can only handle two variables .There are
multiple approaches to generate scatterplots with three variables in R, however, the preferred
method is typically utilizing the scatterplot3d function from the contributed package with the
same name.
Basic Syntax
The scatterplot3d function follows a syntax similar to the default plot function. In the default plot
function, you need to input a vector of x- and y-axis coordinates, whereas in the scatterplot3d
function, you only need to provide an extra third vector of values for the z-axis coordinates. This
additional dimension enables you to perceive the three axes in the following manner: the x-axis
increases from left to right, the y-axis increases from foreground to background, and the z-axis
increases from bottom to top.
The iris flower data, which was initially introduced in Section 14.4, comprises measurements on
four continuous variables (petal length/width and sepal length/width) and one categorical
variable (flower species). The iris data frame can be readily accessed from the R prompt,
eliminating the need to load any additional files. To swiftly access the measurement values
constituting the data, input the following command:
R> library("scatterplot3d")
R> scatterplot3d(x=pwid,y=plen,z=swid)
A general positive correlation can be observed among the three plotted variables in this context.
Additionally, there is a distinct cluster of observations in the foreground that exhibit relatively
large sepal widths but small petal measurements.
Fig : Two 3D scatterplots of the famous iris data with petal width, petal length, and sepal width
on the x-, y-, and z-axis, respectively. Left: Basic default appearance. Right: Tidying up titles and
adding visual enhancements to emphasize 3D depth and legibility via color and vertical line
marks.
Visual Enhancements
Perceiving depth in the plotted cloud of points can often pose a challenge, despite the inclusion
of default box and x-y plane grid lines. To address this issue, there are a few optional
enhancements available for scatterplot3d plots. One such enhancement involves coloring the
points to facilitate a clearer transition from the foreground to the background. Additionally,
setting the type="h" argument allows for the drawing of lines perpendicular to the x-y plane.
> scatterplot3d(x=pwid,y=plen,z=swid,highlight.3d=TRUE,type="h",
lty.hplot=2,lty.hide=3,xlab="Petal width",
ylab="Petal length",zlab="Sepal width",
main="Iris Flower Measurements")
xlab, ylab, zlab, and main control the corresponding titles of the three axes and the plot itself.
The above figure is 3D scatterplot of the famous iris data, displaying all five present variables
with the additional use of color (for sepal length) and point character (for species).
A bivariate function can be used to create a smooth surface that is defined within the range of 1
to 6 on the x-axis and 1 to 4 on the y-axis. To achieve this, evenly spaced sequences can be
defined over each coordinate range using the 'seq' function.
R> xcoords <- 1:6
R> xcoords
[1] 1 2 3 4 5 6
R> ycoords <- 1:4
R> ycoords
[1] 1 2 3 4
When provided with two vectors, the expand.grid function generates all unique coordinate pairs
by repeating each value in the second vector against the entire length of the first vector.
The ready-to-use letters object in R allows you to gen- erate letters of the alphabet quickly.
The 3D plots used to visualize a bivariate function require the z values cor- responding to the x-y
evaluation grid in the form of an appropriately con- structed matrix. The size of the z-matrix is
determined directly by the reso- lution of the evaluation grid; the number of rows corresponds to
the num- ber of unique x grid values, and the number of columns corresponds to the number of
unique y grid values.
The following is the correct matrix representation of the hypothetical “function result” vector z:
R> zmat
[,1] [,2] [,3] [,4]
[1,] "a" "g" "m" "s"
[2,] "b" "h" "n" "t"
[3,] "c" "i" "o" "u"
[4,] "d" "j" "p" "v"
[5,] "e" "k" "q" "w"
[6,] "f" "l" "r" "x"
The R function contour is used to generate contours that connect x-y coordinates sharing the
same z value, based on a provided numeric z-matrix.
R> dim(volcano)
[1] 87 61
R> contour(x=1:nrow(volcano),y=1:ncol(volcano),z=volcano,asp=1)
Contours are able to show you not only the peaks and troughs in
surface like this but the “steepness” of any such features too.
Color-Filled Contours
To introduce a slight variation to the contour plot, one can utilize color to fill the gaps between
the different levels that are illustrated. This approach, when combined with a color legend,
eliminates the requirement of labeling the contour lines and, in certain scenarios, facilitates the
visual interpretation of fluctuations in the plotted z-matrix surface. Color Filled Contours
function is used for this.
You are required to provide the ascending sequences of grid coordinates in both the x-axis and y-
axis directions, along with the corresponding z-matrix, to the parameters x, y, and z in a manner
similar to contour. The most convenient approach to indicate the colors is by supplying a color
palette to the color.palette argument.
R> filled.contour(x=hp.seq,y=wt.seq,z=car.pred.mat,
color.palette=colorRampPalette(c("white","red4")),
xlab="Horsepower",ylab="Weight",
key.title=title(main="Mean MPG",cex.main=0.8))
Figure 25-13: Filled contour plot of the response surface for the
fitted multiple linear model of the mtcars data
Pixel Images
A pixel image is the most accurate visual representation of a continuous surface that is
approximated by a finite evaluation grid. Its resemblance to a filled contour plot is evident, but
an image plot offers a more direct means of managing the display of each entry in the associated
z-matrix.
R> image(x=1:nrow(volcano),y=1:ncol(volcano),z=volcano,asp=1)
By setting asp=1, the aspect ratio of the horizontal and vertical axes is maintained at a one-to-one
ratio. The plot is composed of exactly 5307 pixels, with each pixel representing a specific entry
in the volcano matrix. The contour plot of the same data in the below Fig visually demonstrates
the reflection of this image.
Aa
Perspective Plots
The perspective plot, some- times also called as a wireframe. In certain applications, it may be
necessary to accurately assess the degree of extremity of peaks and troughs in a plotted surface.
This task becomes more challenging when dealing with pixel images or contour plots, as they do
not provide a clear representation of the relative extremity.
Coloring Facets
The optional col argument in traditional R plotting commands can be used to color the facets of a
perspective surface. If you want to color the surface with a constant color, you can simply
provide col with a single value.
However, if you're interested in using col to highlight the changing value of the bivariate
function, you would want to color the surface according to the fluctuating z-values. It's important
to note that the facets making up the surface are not the same as the pixels in a pixel image of the
z-matrix. The facets should be interpreted as the space between the border lines drawn at the
matrix entries, resulting in (m-1)(n-1) facets. In other words, each z-matrix entry lies at an
intersection of the drawn lines and is not situated in the middle of each facet.
R> persprot(x=quak.dens$x,y=quak.dens$y,z=quak.dens$z,border="red3",shade=0.4,
ticktype="detailed",xlab="Longitude",ylab="Latitude", zlab="Kernel estimate")