R-programming - Unit 5

UNIT 5
SIMPLE LINEAR REGRESSION

The regression analysis draws a line or curve that intersects all the data points on the target-
predictor graph while minimizing the vertical distance between the data points and the regression line.
Linear Regression is a widely employed method of predictive analysis. It is a statistical technique

used to model the connection between a dependent variable and a specified set of independent
variables. The assumption is that a straight line can be employed to estimate this relationship. The
primary objective of linear regression is to determine the line that minimizes the differences between
the observed data points and the expected values of the line.
5. 1 GENERAL CONCEPTS
A linear regression model aims to derive a function that predicts the average value of one
variable based on a specific value of another variable. The response variable, also known as the
"outcome" variable, represents the mean that you are trying to estimate. On the other hand, the
explanatory variable, referred to as the "predictor" variable, represents the value you already
possess.
For Example ,In student survey, suppose we ask “What’s the expected height of a student if their
handspan is 12.5 cm?” .In this example , height is the response variable and handspan is the
explanatory variable is the handspan.
5.1.1. Definition of the Model
Assume If X is the given value of explanatory variable and Y is the value of a response
variables,then the equation of value of response in simple linear regression model is
Y | X = β0 + β1 X + s
Y | X read as “the value of Y conditional upon the value of X .”
S is considered to have a normal distribution with a mean of zero and a standard deviation of σ.
The variance of s(σ2) remains constant.S is a random error.
β0 is an intercept . In the case where the predictor is zero, the intercept, β0, denotes the expected
value of the response variable.
β1 is the slope. The slope, β1, is the central point of interest as it denotes the change in the
average response for every one-unit increase in the predictor.
If a slope is positive, it indicates that the regression line rises from left to right, indicating that
the mean response is higher when the predictor is higher. If a slope is negative slope indicates
that the line descends from left to right, indicating that the mean response is lower when the
predictor is higher. If a slope of zero suggests that the predictor has no impact on the response
value. The more extreme the value of β1, the steeper the incline or decline of the line.
5.2 Estimating the Intercept and Slope Parameters

The goal is to use your data to estimate the regression parameters, yielding the estimates βˆ0
and βˆ1; this is referred to as fitting the linear model. In this case, the data comprise n pairs of
observations for each individual. The fitted model of interest concerns the mean response value,
denoted yˆ, for a specific value of the predictor, x, and is written as follows:
If βˆ0 and βˆ1 are the regression parameters called as fitting the linear model ,x is and the n pairs
of observations then the mean response value denoted yˆ for a specific value of the predictor, x,
and is written as follows:
yˆ = βˆ0 + βˆ1 x
The predictor and response variables in a simple linear regression function can be represented by
xi and yi, respectively, where i ranges from 1 to n. The estimates for the parameters can then be
calculated based on these n observed data pairs.
sy
- x¯,y¯ are the sample means of the xis and yis.

- sx , sy are the sample standard deviations of the xis and yis.
- ρxy is estimate of correlation between X and Y
5.3 Fitting Linear Models with lm
lm command is used for estimation. For example, the following line of code generates a fitted
linear model object that represents the average student height based on handspan, and it is then
stored as survfit in your global environment.
survfit <- lm(Height ~ Wr.Hnd,data=survey)
~ - predictor formula denotes desired model.

Survfit - special class “lm” in R.
The most basic output can be obtained by typing in the name of the "lm" object at the prompt,
which will include a repetition of your call and the estimates for the intercept ( βˆ0) and slope
( βˆ1).
R> survfit
Call:
lm(formula = Height ~ Wr.Hnd, data = survey)
Coefficients: (Intercept) Wr.Hnd
113.954 3.117
The abline function is used to incorporate completely horizontal and vertical lines into an
existing plot. However, if an object of class "lm" is provided, which represents a simple linear
model such as survfit, the function will instead add the fitted regression line.
The observed data is fitted with a solid and bold simple linear regression line. To illustrate
positive and negative residuals, two dashed vertical line segments are provided, with the leftmost
segment representing a positive residual and the rightmost segment representing a negative
residual.
5.4 Illustrating Residuals
In statistics, the difference between the actual value and the predicted value of a dependent
variable is referred to as a residual. The line that is fitted to the data is commonly known as the
least-squares regression line, as it is the line that reduces the average squared deviation between
the observed data and the line itself.
First, let’s extract two specific records from the Wr.Hnd and Height data vectors and call the
resulting vectors obsA and obsB.
R> obsA <- c(survey$Wr.Hnd[197],survey$Height[197])
R> obsA
[1] 15.00 170.18
R> obsB <- c(survey$Wr.Hnd[154],survey$Height[154])

R> obsB
[1] 21.50 172.72
Take a look at the names of the individuals listed in the survfit object.
R> names(survfit)
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "na.action" "xlevels" "call" "terms"
[13] "model"
The components of a fitted model object of class "lm" are automatically generated and consist of
various members. It is worth noting that one of these components is called "coefficients", which
contains a numeric vector representing the estimates of the intercept and slope.For the
coefficients component of an "lm" object, coef is used.
R> mycoefs <- coef(survfit)
R> mycoefs
(Intercept) Wr.Hnd 113.953623 3.116617
R> beta0.hat <- mycoefs[1] R> beta1.hat <- mycoefs[2]
Finally, to draw the vertical dashed lines using segments present in Figure .
R> segments(x0=c(obsA[1],obsB[1]),y0=beta0.hat+beta1.hat*c(obsA[1],obsB[1]),
x1=c(obsA[1],obsB[1]),y1=c(obsA[2],obsB[2]),lty=2)
5.3 Statistical Inference
Statistical Inference is a concept that involves hypothesis testing, which is performed by taking a random
sample of data from the population and conducting testing. The hypothesis is either accepted or rejected
based on the results of the testing.
5.3.2 Regression Coefficient Significance Tests
The purpose of significance tests in linear regression is to determine the statistical significance of
the relationship between the dependent variable and one or more independent variables. In
simpler terms, these tests help us determine whether the independent variables are effective
predictors of the dependent variable.
Multiple tests can be employed to assess the significance of a linear regression model, but the t-
test is the most prevalent. The t-test is utilized to determine whether the slope coefficient(s) in
the linear regression model significantly differ from zero.
The R summary function is capable of conducting a linear regression significance test on a pre-
existing linear regression model. It furnishes a detailed account of the linear regression model,
such as anticipated coefficients, standard errors, t-statistics, and p-values, for every predictor
variable.
# Load the dataset

data(mtcars)
# Fit the linear regression model

model <- lm(mpg ~ wt + hp, data = mtcars)
# Perform the significance test

summary(model)
5.3.4 Coefficient of Determination
The output of summary also provides you with the values of Multiple R-squared and Adjusted R-
squared, which are particularly interesting. Both of these are referred to as the coefficient of de
termination; they describe the proportion of the variation in the response that can be attributed to
the predictor
For the student height example, fstore the estimated correlation between Wr.Hnd and Height as
rho.xy, and then square it:
R> rho.xy <- cor(survey$Wr.Hnd,survey$Height,use="complete.obs")

R> rho.xy^2 [1] 0.3611901
5.4 Prediction
The capability to fit a statistical model allows you to not only understand and quantify the
relationships within your data, but also to predict the values of the outcome of interest. This
holds true even when you have not directly observed the values of any explanatory variables in
the original dataset. However, it is essential to always accompany any point estimates or
predictions with a measure of variability, as is customary in any statistical analysis.
Confidence Interval(CI) is a range of values that we can be certain contains the true value. The
probability of the confidence interval containing the true parameter value is determined by the
selection of a confidence level.
Prediction interval (PI) for an observed response is distinct from that of a confidence interval.
While confidence intervals are utilized to depict the variability of the mean response, prediction
intervals are employed to offer the potential range of values that a single realization of the
response variable could assume, given x. This differentiation is subtle yet significant: the
confidence interval pertains to a mean, whereas the prediction interval pertains to an individual
observation.
5.4.2 Interpreting Intervals

Suppose you wish to calculate the average height of students with a handspan of 14.5 cm and
those with a handspan of 24 cm. The process of obtaining point estimates is straightforward -
simply substitute the desired x values into the regression equation
yˆ = 113.954 + 3.117x
R> as.numeric(beta0.hat+beta1.hat*14.5)
[1] 159.1446
R> as.numeric(beta0.hat+beta1.hat*24)
[1] 188.7524
5.4.3 Plotting Intervals

Both confidence intervals (CIs) and prediction intervals (PIs) are suitable for visualizing simple
linear regression models.
R> plot(survey$Height~survey$Wr.Hnd,xlim=c(13,24),ylim=c(140,205),
xlab="Writing handspan (cm)",ylab="Height (cm)")
R> abline(survfit,lwd=2)
Additionally, include the locations of the fitted values for x = 14.5 and x = 24, as well as two sets
of vertical lines indicating the confidence intervals (CIs) and prediction intervals (PIs).
R> points(xvals[,1],mypred.ci[,1],pch=8)
R> segments(x0=c(14.5,24),y0=c(mypred.pi[1,2],mypred.pi[2,2]),
x1=c(14.5,24),y1=c(mypred.pi[1,3],mypred.pi[2,3]),col="gray",lwd=3)
R> segments(x0=c(14.5,24),y0=c(mypred.ci[1,2],mypred.ci[2,2]),
x1=c(14.5,24),y1=c(mypred.ci[1,3],mypred.ci[2,3]),lwd=2)
The above figure gives a fitted regression line and point estimates at x = 14.5 and x = 24.
Additionally, it includes corresponding 95 percent confidence intervals (black vertical lines) and
prediction intervals (gray vertical lines). To provide a comprehensive understanding, dashed
black and dashed gray lines are incorporated, representing 95 percent confidence and prediction
bands for the response variable across the visible range of x value.
5.4.4 Interpolation vs. Extrapolation
Interpolation is the term used to describe a prediction when the specified x value falls within the
range of observed data. On the other hand, extrapolation occurs when the x value of interest lies
outside this range.
interpolation is more favorable than extrapolation. It is more logical to utilize a fitted model for
making predictions in the vicinity of observed data. However, if the extrapolation is not too far
from that vicinity, it may still be considered reliable. The extrapolation for the student height
example at x = 24 serves as a prime example. Although it falls outside the range of observed
data, it is not significantly distant in terms of scale. The estimated intervals for the expected
value of yˆ = 188.75 cm do not appear unreasonable, at least visually, considering the distribution
of the other observations. On the other hand, it would be less sensible to use the fitted model for
predicting student height at a handspan of, for instance, 50 cm.
R> predict(survfit,newdata=data.frame(Wr.Hnd=50),interval="confidence",
level=0.95)
fit lwr upr
1 269.7845 251.9583 287.6106
5.5. MULTIPLE LINEAR REGRESSION
It is a common form of linear regression. The concept of Multiple Linear Regression involves the
linear relationship between a response variable Y and several predictor variables.
Multiple Regression can be applied in various scenarios
1. when determining the selling price of a house, factors such as the location's desirability,
the number of bedrooms and bathrooms, the year of construction
2. when analyzing the height of a child, variables such as the mother's height, the father's
height, nutrition, and environmental factors play a significant role in determining the
child's height.
Extending the Simple Model to a Multiple Model
The overarching model is defined as a means to ascertain the value of a continuous response
variable Y, based on the values of p > 1 independent explanatory variables X1, X2, . . ., Xp.
Y = β0 + β1 X1 + β2 X2 + . . . + βp Xp + s
where β0, . . . , βp are the regression coefficients.

In practical application, there exists a set of n data records, where each record contains values for
all the predictors Xj, where j ranges from 1 to p. The objective is to fit a model based on the
average response, given a specific realization of the explanatory variables.
yˆ = E[Y | X1 = x 1 , X2 = x 2 , . . . , X p = x p ] = βˆ0 + βˆ1 x 1 + βˆ2 x 2 + . . . + βˆp x p ,
where the βˆ j s represent the regression coefficients.
Estimating in Matrix Form
In the case of n multivariate observations, the expression for Equation (21.1) can be written as
follows.
Y = X · β + s,
where Y and s denote n × 1 column matrices

The term for the ith individual that represents the response observation and random
error is referred to as the quantity β. This quantity is a column matrix of size (p + 1) ×
1, which contains the regression coefficients. The observed predictor data for all
individuals and explanatory variables are stored in a matrix X of size n × (p + 1),
known as the design matrix.
Visualizing the Multiple Linear Model

According to the information provided, the characteristic of "being male" leads to an
alteration of approximately 9.49 cm in the overall intercept is
R> survcoefs <- coef(survmult)
R> survcoefs
(Intercept) Wr.Hnd SexMale
137.686951 1.594446 9.489814
R> as.numeric(survcoefs[1]+survcoefs[3])
[1] 147.1768
Because of this, you could also write (21.4) as two equations.
Here’s the equation for female students
“Mean height” = 137.687 + 1.594 × “handspan”
“Mean height” = (137.687 + 9.4898) + 1.594 × “handspan”

= 147.177 + 1.594 × “handspan”
The multivariate model can be visualized in a similar manner to the simple linear models.
R> plot(survey$Height~survey$Wr.Hnd,
col=c("gray","black")[as.numeric(survey$Sex)],
pch=16,xlab="Writing handspan",ylab="Height")
R> abline(a=survcoefs[1],b=survcoefs[2],col="gray",lwd=2)
R> abline(a=survcoefs[1]+survcoefs[3],b=survcoefs[2],col="black",lwd=2)
R> legend("topleft",legend=levels(survey$Sex),col=c("gray","black"),pch=16)
The handspan and sex of students were used to model their height, and the resulting data
was visualized in above Figure , which shows both the observed data and the fitted multiple
linear model.
Omnibus F-Test
An omnibus test is a statistical test used in statistics to determine the significance of
multiple parameters in a model at the same time.
An instance of an omnibus test is when we have null and alternative hypotheses as follows:
H0: μ1 = μ2 = μ3 = … = μk (all the population means are equal) and
HA: At least one population mean is different from the rest.
The null hypothesis contains more than two parameters, making it an omnibus test. If we
reject the null hypothesis, we can conclude that at least one of the population means is
different from the rest, but we cannot determine which population means are different.
The F-test statistic can be calculated by utilizing the coefficient of determination, R2,
obtained from the fitted regression model. If we consider p as the count of regression
parameters that need estimation, except for the intercept β0, then...
F = R 2 (n − p − 1) (1 − R2 )p
where n is the number of observations that is used in fitting the model.
By analyzing the survey results, a model has been established to determine student height
using variables such as handspan, sex, and smoking status. The summary report contains
the coefficient of multiple determination, which can be extracted for further analysis.
R> R2 <- summary(survmult2)$r.squared
R> R2
[1] 0.508469
The original size of the dataset used in the survey can be determined by subtracting any
missing values (which were reported as 30 in the previous summary output) from n.
R> n <- nrow(survey)-30
R> n
[1] 207
We get p as the number of estimated regression parameters (minus 1
for the intercept).
R> p <- length(coef(survmult2))-1
R> p
[1] 5
confirm the value of n − p − 1, which is equal to the summary

output (201 degrees of freedom):
R> n-p-1
[1] 20-1
Finally, you find the test statistic F and using Pf function to get the corresponding p value :
R> Fstat <- (R2*(n-p-1))/((1-R2)*p)

R> Fstat
[1] 41.58529
R> 1-pf(Fstat,df1=p,df2=n-p-1)
[1] 0
Transforming Numeric Variables
Numeric transformation involves utilizing a mathematical function to adjust your
numerical data. This process can be exemplified by taking the square root of a number or
converting Fahrenheit to Celsius. In regression analysis, transformation is typically
reserved for continuous variables and can be executed in various manners.
Polynomial
In your regression model, you can apply a polynomial or power transformation to a
particular predictor variable. This technique is simple yet effective as it introduces
polynomial curvature in the relationships, enabling the predictor to have a more intricate
impact on the response than what is typically feasible.
To provide a clearer understanding of polynomial curvature, let us examine the subsequent
series ranging from -4 to 4, along with the basic vectors derived from it.
R> x <- seq(-4,4,length=50) R> y <- x
R> y2 <- x + x^2
R> y3 <- x + x^2 + x^3
The plots are generated separately by the three lines of code provided, appearing from left
to right.
R> plot(x,y,type="l")
R> plot(x,y2,type="l")
The plots are generated separately by the three lines of code provided, appearing from left
to right.
R> plot(x,y,type="l")
R> plot(x,y2,type="l")
R>plot(x,y3,type="l")
Illustrating linear (left), quadratic (middle), and cubic functions (right) of x
Logarithmic
In statistical modeling situations where positive numeric observations are present, it is customary
to perform a log transformation on the data. This transformation plays a crucial role in
significantly diminishing the overall range of the data and bringing extreme observations closer
to a measure of centrality. Consequently, adopting a logarithmic scale can effectively mitigate the
severity of heavily skewed data.
R> plot(1:1000,log(1:1000),type="l",xlab="x",ylab="",ylim=c(-8,8))
R> lines(1:1000,-log(1:1000),lty=2)
R> legend("topleft",legend=c("log(x)","-log(x)"),lty=c(1,2))
This graph illuatrates the relationship between the logarithm of integers ranging from 1 to 1000
and their corresponding raw values. It also includes the negative logarithm. Notably, as the raw
values increase, the logarithmically transformed values exhibit a tapering off and flattening out
effect.
LINEAR MODEL SELECTION AND DIAGNOSTICS
Goodness-of-Fit vs. Complexity
The main aim of fitting any statistical model is to faithfully represent the data and the
relationships inherent within it. The process of fitting statistical models essentially involves
striking a balance between two key aspects: goodness-of-fit and complexity. Goodness-of-fit
refers to the objective of obtaining a model that accurately captures the relationships between the
response and the predictor (or predictors). Complexity, on the other hand, describes the level of
intricacy in a model, which is determined by the number of terms that require estimation.
The principle of parsimony
It is a concept in statistics that involves finding a balance between the goodness-of-fit and
complexity of a model. The aim is to select a model that is as simple as possible, with low
complexity, while still maintaining a reasonable level of goodness-of-fit.
Model Selection Algorithms
model selection algorithm is to sift through your available explanatory variables in some
systematic fashion in order to establish which are best able to jointly describe the response
The use of model selection algorithms can be a topic of controversy due to the availability of
multiple methods. It is essential to recognize that no single approach can be universally
applicable for every regression model.
Forward Selection
Forward selection, also referred to as forward elimination, is employed at this stage. The
approach entails commencing with a model that solely consists of an intercept, followed by
conducting a set of independent tests to ascertain which predictor variables contribute
significantly to enhancing the goodness-of-fit. Subsequently, the model object is updated by
incorporating the identified term, and the series of tests is repeated for the remaining terms to
determine if any of them further enhance the fit. This iterative process continues until no
additional terms are found to significantly improve the fit in a statistically significant manner.
The ready-to-use R functions add1 and update facilitate the execution of these tests and the
subsequent update of the fitted regression model.
Backward Selection
The process of forward selection initiates from a reduced model and gradually builds up to a
final model by incorporating additional terms. Conversely, backward selection commences with
the fullest model and systematically eliminates terms. In R, the functions drop1 and update are
utilized to inspect partial F-tests and update the models, respectively.
Revisit the nuclear example. First, define the fullest model as that which predicts cost by main
effects of all available covariates.
R> nuc.0 <- lm(cost~date+t1+t2+cap+pr+ne+ct+bw+cum.n+pt,data=nuclear)

R> summary(nuc.0)
Call:
lm(formula = cost ~ date + t1 + t2 + cap + pr + ne + ct + bw +
cum.n + pt, data = nuclear)
Residuals:
Min 1Q Median 3Q Max
-128.608 -46.736 -2.668 39.782 180.365
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.135e+03 2.788e+03 -2.918 0.008222 **
date 1.155e+02 4.226e+01 2.733 0.012470 *
t1 5.928e+00 1.089e+01 0.545 0.591803
t2 4.571e+00 2.243e+00 2.038 0.054390 .
cap 4.217e-01 8.844e-02 4.768 0.000104 ***
pr -8.112e+01 4.077e+01 -1.990 0.059794 .
ne 1.375e+02 3.869e+01 3.553 0.001883 **
ct 4.327e+01 3.431e+01 1.261 0.221008
bw -8.238e+00 5.188e+01 -0.159 0.875354
cum.n -6.989e+00 3.822e+00 -1.829 0.081698 .
pt -1.925e+01 6.367e+01 -0.302 0.765401
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 82.83 on 21 degrees of freedom Multiple R-squared: 0.8394,

Adjusted R-squared: 0.763
F-statistic: 10.98 on 10 and 21 DF, p-value: 2.844e-06
ADVANCED PLOT CUSTOMIZATION

Handling the Graphics Device
1 Manually Opening a New Device
To open new device windows using dev.new, the most recent window will
automatically become active, and any subsequent plotting commands will impact
that specific device.
In example, first close any open graphics windows and then enter the following at
the R prompt:
R> plot(quakes$long,quakes$lat)
By executing this command, a plot illustrating the spatial locations of the 1000
seismic events in the quakes data frame will be generated. In case Device 1, the
null device, is the only device currently accessible, any plotting command that
refreshes the plotting window and produces a new image will be utilized.
Execute the following to open a new plotting window:
R> dev.new()
The new window will be assigned the number 3 (typically, it positions itself above
the previously opened window, hence you might consider relocating it to a side
using your mouse).
At this point, you can enter the usual command to bring up the desired histogram
in Device 3:
R> hist(quakes$stations)
Switching Between Devices

To modify a setting in Device 2 without closing Device 3, utilize the command
"dev.set" followed by the desired device number to activate it. The provided code activates
Device 2 and updates the seismic event locations, adjusting the size of each point based on
the number of stations that detected the event. Additionally, it enhances the clarity of the
axis labels.
R> dev.set(2)
quartz
2
R> plot(quakes$long,quakes$lat,cex=0.02*quakes$stations,
xlab="Longitude",ylab="Latitude")
Switching back to Device 3, as a final tweak, add a vertical line marking off the mean
number of detecting stations.
R> dev.set(3)
quartz
3
R> abline(v=mean(quakes$stations),lty=2)
The two plots of the quakes data have been produced and manipulated, and the resulting
final outputs are showcased on my two visible graphics devices, specifically Device 2 (on
the left) and Device 3 (on the right).
Closing a Device
To conclude a graphics device, you have the option to click on the X button using your
mouse, just like closing any other window. Alternatively, you can make use of the dev.off
function.
R> dev.off(2)
quartz
3
Then repeat the call without an argument to close the remaining device:
R> dev.off()
null device
1
Plotting Regions and Margins

While the data set or model takes center stage when plotting, it is equally important to
ensure that the plot is annotated in a clear and accurate manner to facilitate correct
interpretation. To achieve this, one must possess the knowledge and skills to manipulate
and draw within all visible areas of the given device, rather than limiting oneself to the
specific area where the data is located.
There are three distinct regions that comprise the image of any single plot created using
base R graphics.
1. The plot region is where the actual plot is displayed and where you typically draw your
points, lines, text, and other elements. It utilizes the user coordinate system, which
represents the value and scale of the horizontal and vertical axes.
2. The figure region encompasses the area that accommodates the axes, their labels, and
any titles. These spaces are also known as the figure margins.
3. The outer region, also known as the outer margins, is additional space surrounding the
figure region that is not included by default but can be specified if necessary.
1 Default Spacing
Retrieve the default margin settings for your figure by calling the par function in R.
R> par()$oma
[1] 0 0 0 0
R> par()$mar
[1] 5.1 4.1 4.1 2.1
here that oma=c(0, 0, 0, 0)—there is no outer margin set by default. The default figure
margin space is mar=c(5.1, 4.1, 4.1, 2.1)—in other words
R> plot(1:10)
R> box(which="figure",lty=2)
The traditional (base) R graphics represent graphical device regions through solid-line boxes for
plot region, dashed-line boxes for figure region, and a dotted-line box for outer region area. On
the left, the default settings are shown, while on the right, the user can specify the outer and
figure margin areas in "lines of text" through oma and mar, respectively, using par.
Custom Spacing
By modifying the spacing, we can create a plot that has customized outer margins. In this case,
the bottom, left, top, and right areas will have one, four, three, and two lines respectively.
Furthermore, the figure margins will be set to four, five, six, and seven lines. The code provided
will generate the corresponding result, which can be seen on the right side of above figure.
R> par(oma=c(1,4,3,2),mar=4:7)
R> plot(1:10)
R> box("figure",lty=2)
R> box("outer",lty=3)
When using R, the default square device may have its plot region compressed by irregular
margins, which are adjusted to make space for the defined spacing around the edges. However, if
you manipulate the graphical parameters excessively and squash the plot region too much, R will
generate an error message stating that the figure margins are too large.
To adjust the margin space for specific annotations of the plot, you can use the mtext function,
which is specifically designed to produce text in the figure or outer margins. By default, the text
is written in the figure margin, but you can position it in the outer region by setting outer=TRUE.
If you want to add more margin annotation to your plot, you can use the lines provided to do so.
R> mtext("Figure region margins\nmar[ . ]",line=2)

R> mtext("Outer region margins\noma[ . ]",line=0.5,outer=TRUE)
Clipping
Clipping is a technique that enables you to add or draw elements to the margin regions of a plot
with respect to the user coordinates of the plot itself. For instance, you may want to place a
legend outside the plotting area or draw an arrow that extends beyond the plot region to highlight
a specific observation.
The xpd graphical parameter is responsible for controlling clipping in base R graphics. By
default, xpd is set to FALSE, which means that all drawing is clipped to the available plot region
only, except for special margin-addition functions like mtext. If you set xpd to TRUE, you can
draw things outside the formally defined plot region into the figure margins, but not into any
outer margins.
Drawing in all three areas, namely plot region, figure margins, and outer margins, can be
achieved by setting xpd to NA.
R> dev.new()
R> par(oma=c(1,1,5,1),mar=c(2,4,5,4))
R> boxplot(mtcars$mpg~mtcars$cyl,xaxt="n",ylab="MPG") R> box("figure",lty=2)
R> box("figure",lty=2)
R> box("outer",lty=3)
R> arrows(x0=c(2,2.5,3),y0=c(44,37,27),x1=c(1.25,2.25,3),y1=c(31,22,20), xpd=FALSE)
R> text(x=c(2,2.5,3),y=c(45,38,28),c("V4 cars","V6 cars","V8 cars"), xpd=FALSE)
Point-and-Click Coordinate Interaction
Retrieving Coordinates Silently
The locator command allows you to find and return user coordinates.
To observe its functionality, start by executing the plot(1,1) command to generate a simple plot
with a single point at the center. To use the locator function, execute it without any arguments,
which will cause the console to "hang" without returning to the prompt. Afterwards, on an active
graphics device, your mouse cursor will change to a + symbol (you may need to click on your
device once to bring it to the forefront of your computer desktop). With the + cursor, you can
perform a series of left mouse clicks within the device, and R will silently record the precise user
coordinates. To end this process, simply right-click to terminate the command.
I detected four points located at arbitrary positions around the plotted point at (1, 1) on my
machine. These points were arranged in a clockwise order, starting from the top left and ending
at the bottom left. The output printed to my console is as follows:
R> plot(1,1)
R> locator()
$x
[1] 0.8275456 1.1737525 1.1440526 0.8201909
$y
[1] 1.1581795 1.1534442 0.9003221 0.8630254
Visualizing Selected Coordinates
You can also use locator to plot the points you select as either individual points or as lines.
Running the following code produces in figure:
R> plot(1,1)
R> Rtist <- locator(type="o",pch=4,lty=2,lwd=3,col="red",xpd=TRUE) R> Rtist
$x
[1] 0.5013189 0.6267149 0.7384407 0.7172250 1.0386740 1.2765699
[7] 1.4711542 1.2352573 1.2220592 0.8583484 1.0483300 1.0091491
$y
[1] 0.6966016 0.9941945 0.9636752 1.2819852 1.2766579 1.4891270
[7] 1.2439071 0.9630832 0.7625887 0.7541716 0.6394519 0.9618461
Using locator to draw an arbitrary sequence of overplotted points and lines
Customizing Traditional R Plots
Graphical Parameters for Style and Suppression
If you desire greater precision in an R plot, it is best to start with a "clean slate." This entails
understanding the default settings of specific graphical parameters when calling a plotting
function and how to remove elements like boxes and axes. This is the starting point. As an
example, let's plot MPG against horsepower (from the readily available mtcars data set) and size
each plotted point proportionally to the weight of each car. For convenience, create the following
objects:
R> hp <- mtcars$hp

R> mpg <- mtcars$mpg
R> wtcex <- mtcars$wt/mean(mtcars$wt)
The car weight vector is transformed by multiplying it with the mean value. As a result, the
vector is created in such a way that cars weighing less than the average have a value less than 1,
while cars weighing more than the average have a value greater than 1. This property makes it an
excellent choice for adjusting the size of the plotted points using the cex parameter.
To plot the mtcars data, one can correlate MPG with horsepower while keeping the point size
proportional to the weight of the car. This can be done by using a plot call. The plot can be
displayed in three different ways. The first one is the default appearance. The second one can be
achieved by setting xaxs="i" and yaxs="i" to prevent buffer spacing on the limits of the axes.
The third one can be achieved by using xaxt, yaxt, xlab, ylab, and bty to suppress all box, axis,
and label drawing. Alternatively, this can be achieved by setting axes=FALSE and ann=FALSE.
Customizing Boxes
When initiating a suppressed-box or suppressed-axis plot, you can introduce a box that pertains
specifically to the current plot region in the active graphics device by utilizing the "box" function
and specifying its type using the "bty" parameter. For instance, if you commence with a plot
resembling the one depicted on the right side of the above figure(simply execute the most recent
line of code to obtain this), you can enhance it further by incorporating the following line,
resulting in the image presented on the left side of the below figure.
(a) (b) (c)
R> box(bty="l",lty=3,lwd=2)
If we call the above code It will produce the Fig (b) .
R> box(bty="]",lty=2,col="gray")
He above code will produce the fig (c)
Customizing Axes
Once you have adjusted the box according to your preferences, you can proceed to focus on the
axes. The axis function provides you with the ability to precisely manipulate and enhance the
presence of an axis on any of the four sides of the plot area. The initial parameter it requires is
the side, which is specified by a single integer: 1 (bottom), 2 (left), 3 (top), or 4 (right). These
integers correspond to the respective positions of the margin-spacing values that are relevant
when configuring graphical parameter vectors such as mar.
The built-in function pretty in R is used to find a “neat” sequence of values for the scale of each
axis, However we can set our own by passing the desired value through the "at" argument to the
axis.
R> hpseq <- seq(min(hp),max(hp),length=10)
R> plot(hp,mpg,cex=wtcex,xaxt="n",bty="n",ann=FALSE)
R> axis(side=1,at=hpseq)
R> axis(side=3,at=round(hpseq))
To begin with, a set of 10 values that are evenly spaced and cover the entire range of hp is stored
as hpseq. The initial plot command is used to hide the x-axis, the box, and any default axis
labels. However, the y-axis is allowed to be displayed as per its default settings. Then, the axis
function is called to draw the x-axis (side=1), with tick marks positioned at hpseq. Additionally,
another axis is drawn at the top (side=3), but this time the tick marks are placed at hpseq after
rounding it to the nearest integer.
Specialized Text and Label Notation

Font
The displayed font two graphical parameters: family for the specific font family and font, an
integer selector for controlling bold and italic typeface.
Two graphical parameters are used to control the displayed graphical parameters:
Family – used to determines the specific font family
Font - represented by an integer selector, controls the bold and italic typeface.
There are always three generic families to choose from: "sans" (the default), "serif", and "mono".
These families are always available and can be paired with any of the four font values: 1 (normal
text, default), 2 (bold), 3 (italic), and 4 (bold and italic).
The following code gives some variants with corresponding values of family and font.
par(mar=c(2,2,3,3))
plot(1,1,type="n",xlim=c(-1,1),ylim=c(0,7),xaxt="n",yaxt="n",ann=FALSE)
text(0,3,label="R Programming (bold, italic)\nfamily=\"mono\", font=4", family="mono",font=4)
text(0,2,label="R Programming (italic)\nfamily=\"sans\", font=3", family="sans",font=3)
text(0,6,label="R Programming (default)\n family=\"mono\", font=1")
text(0,5,label="R Programming \nfamily=\"serif\", font=1", family="serif",font=1)
text(0,4,label="R Programming \nfamily=\"mono\", font=1", family="mono",font=1)
Greek Symbols
Greek symbols or mathematical markup may be necessary for annotation in
statistically mathematically technical plots.
par(mar=c(3,3,3,3))
plot(1,1,type="n",xlim=c(-1,1),ylim=c(0.5,4.5),xaxt="n",yaxt="n", ann=FALSE)
text(0,4,label=expression(alpha),cex=1.5)
text(0,3,label=expression(paste("sigma: ",sigma ,"\t Sigma:",Sigma)), family="mono",cex=1.5)
text(0,2,label=expression(paste(beta," ",gamma," ",Phi)),cex=1.5)
text(0,1,label=expression(paste(Gamma,"(",tau,") = 24 when ",tau," = 5")),
family="serif",cex=1.5)
title(main=expression(paste("Gr",epsilon,epsilon,"k")),cex.main=2)
To obtain a single special character on its own, you can utilize the expression(alpha) to generate
α in the plot, as demonstrated in the first text call within the code chunk. The specification of the
special characters is performed without enclosing the name of the desired symbol in quotes. The
title function, which allows you to add axis and main titles.
Mathematical Expressions
The process of formatting complete mathematical expressions in R plots can prove to be slightly
more complex and resembling the usage of markup languages such as LATEX.
R> expr1 <- expression(c^2==a[1]^2+b[1]^2)
R> expr2 <- expression(paste(pi^{x[i]},(1-pi)^(n-x[i])))
R> expr3 <- expression(paste("Sample mean: ", italic(n)^{-1}, sum(italic(x)[italic(i)],
italic(i)==1, italic(n)) ==frac(italic(x)[1]+...+italic(x)[italic(n)],
italic(n))))
R> expr4 <- expression(paste("f(x","|",alpha,",",beta,")"==frac(x^{alpha-1}~(1-x)^{beta-1},
B(alpha,beta))))
R> par(mar=c(3,3,3,3))
R> plot(1,1,type="n",xlim=c(-1,1),ylim=c(0.5,4.5),xaxt="n",yaxt="n", ann=FALSE)
R> text(0,4:1,labels=c(expr1,expr2,expr3,expr4),cex=1.5)
R> title(main="Math",cex.main=2)
Here are some important points to note:
1. Superscripts are indicated by ^ and subscripts by [ ]. For instance, c^2 represents c2 in expr1,
and the a[1]^2 component represents a2.
2. Components can be grouped using parentheses ( ), which are visible. For example, the (1-
pi)^(n-x[i]) component of expr2. Alternatively, components can be grouped using braces { },
which are not visible. For example, the pi^{x[i]} component.
3. Italicized alphabetic variables are displayed using the italic() function. For example, italic(n)
displays n in expr3.
DEFINING COLORS AND PLOTTING IN HIGHER DIMENSIONS

The utilization of higher-dimensional plots enables us to visually explore our data or models by
incorporating a greater number of variables than would be possible in alternative scenarios.
Representing and Using color
Color plays a key role in many plots. It is crucial to interpret your data/models effectively by
differentiating between values and variables, as this can greatly assist in the analysis process.
Palette
There are various conventional methods to generate and depict distinct colors, as well as
techniques to establish and employ a harmonious collection of colors, which is commonly known
as a palette.
Red-Green-Blue Hexadecimal Color Codes

R provides colors in plots in the form of either an integer value ranging from 1 to 8 or a character
string. The specification of color often involves the use of three primary colors—red, green, and
blue (RGB)—with varying saturations or intensities. These primaries are combined to create the
desired color. In the standard RGB system, each primary component is represented by an integer
ranging from 0 to 255. As a result, this method allows for the creation of a vast palette of
16,777,216 possible colors.
The values are arranged in the order of (R, G, B), commonly known as a triplet. For example,
pure black is represented by (0, 0, 0), pure white is represented by (255, 255, 255), and full green
is represented by (0, 255, 0).
With the use of the following call, we can locate these eight colors.
R> palette()
[1] "black" "red" "green3" "blue" "cyan" "magenta"
[7] "yellow" "gray"
To obtain the RGB values for a specific color, you can provide the desired color names as a
vector of character strings to the built-in col2rgb function.
R> col2rgb(c("black","green3","pink"))
[,1] [,2] [,3]
red 0 0 255
green 0 205 192
blue 0 0 203
A matrix of RGB values is obtained as the result, wherein each column represents one of the
specified colors.Hexadecimal is a commonly used numeric coding system in computing to
express RGB triplets. In R, a hex code is a character string that starts with a # and consists of six
alphanumeric characters. The valid characters include the letters A through F and the digits 0
through 9. The first two characters represent the red component, while the second and third pairs
represent the green and blue components, respectively.
R> rgb(t(col2rgb(c("black","green3","pink"))),maxColorValue=255)
[1] "#000000" "#00CD00" "#FFC0CB"
The output provides the hexadecimal codes corresponding to the RGB values of the colors
"black", "green3", and "pink", respectively.
To set up the device, the code provided initializes it and proceeds to randomly select 14 valid R
color names, storing them in the character vector mycols. After these colors have been used up
through mouse clicks in different parts of the plot region, the execution is considered finished.
pcol <- function(cols){

n <- length(cols)
dev.new(width=7,height=7)
par(mar=rep(1,4))
plot(1:5,1:5,type="n",xaxt="n",yaxt="n",ann=FALSE)
for(i in 1:n){
pt <- locator(1)
rgbval <- col2rgb(cols[i])
points(pt,cex=4,pch=19,col=cols[i])
text(pt$x+1,pt$y,family="mono",
label=paste("\"",cols[i],"\"","\nR: ",rgbval[1],
" G: ",rgbval[2]," B: ",rgbval[3],
"\nhex: ",rgb(t(rgbval),maxColorValue=255), sep=""))
}
}
mycols <- c("black","blue","royalblue2","pink","magenta","purple",
"violet","coral","lightgray","seagreen4","red","red2", "yellow","lemonchiffon3")
pcol(mycols)
Built-in Palettes
There are several color palettes included in the base R installation. These palettes are determined
by the functions rainbow, heat.colors, terrain.colors, topo.colors, cm.colors, gray.colors, and
gray.
The following code provided generates a total of 600 colors from every palette available.
R> N <- 600
R> rbow <- rainbow(N)
R> heat <- heat.colors(N)
R> terr <- terrain.colors(N) R> topo <- topo.colors(N)
R> cm <- cm.colors(N)
R> gry1 <- gray.colors(N)
R> gry2 <- gray(level=seq(0,1,length=N))
In order to start a new plot, the process involves using vector repetition to place 600 points for
each palette. This is achieved by making a single call to the points function, which also ensures
the points are colored correctly based on the hex code vectors.
R> dev.new(width=8,height=3)
R> par(mar=c(1,8,1,1))
R> plot(1,1,xlim=c(1,N),ylim=c(0.5,7.5),type="n",xaxt="n",yaxt="n",ann=FALSE) R>
points(rep(1:N,7),rep(7:1,each=N),pch=19,cex=3,
col=c(rbow,heat,terr,topo,cm,gry1,gry2))
R> axis(2,at=7:1,labels=c("rainbow","heat.colors","terrain.colors",
"topo.colors","cm.colors","gray.colors","gray"), family="mono",las=1)
For additional information, access the help files ?gray.colors and ?gray, which provide details on
the grayscale palettes, while the rest of the options can be found under ?rainbow.
Custom Palettes
The colorRampPalette function enables the creation of custom palettes by specifying two or
more key colors. These key colors are provided as arguments with the same name, and the
function generates a palette that smoothly transitions between them.
The palette function is created by specifying the desired order of key colors to be interpolated,
which are represented as a character vector of names recognized by R. This function allows for
the generation of colors on a scale between purple and yellow. The following line of code is used
to create this palette function.
R> puryel.colors <- colorRampPalette(colors=c("purple","yellow"))
In the following code using more than two colors :
R> fours <- colorRampPalette(colors=c("black","hotpink","seagreen4","tomato"))
R> patriot.colors <- colorRampPalette(colors=c("red","white","blue"))
We can now generate any number of colors from each range:
R> py <- puryel.colors(N)
R> bls <- blues(N) R> frs <- fours(N)
R> pat <- patriot.colors(N) R> dev.new(width=8,height=2) R> par(mar=c(1,8,1,1))
R> plot(1,1,xlim=c(1,N),ylim=c(0.5,4.5),type="n",xaxt="n",yaxt="n",ann=FALSE) R>
points(rep(1:N,4),rep(4:1,each=N),pch=19,cex=3,col=c(py,bls,frs,pat))
R> axis(2,at=4:1,labels=c("peryel.colors","blues","fours","patriot.colors"),
family="mono",las=1)
Using Color Palettes to Index a Continuum

When it comes to assigning colors to values on a continuum, it is important to approach it with
careful consideration. There are two methods to achieve this: categorization or normalization of
the continuous values.
Via Categorization
To color values based on a continuous variable, one approach is to transform it into a problem of
coloring points associated with a categorical variable. This can be achieved by dividing the
continuous values into a predetermined number of k categories, creating k colors from a chosen
palette, and assigning each observation to the corresponding color based on the bin it belongs to.
The above plot is the aexample to assign color to points based on a continuous value: via categorizatio
The following plot is the example for assign color to points based on a continuous value: via normalization
Including a Color Legend
Now that you have the ability to effectively utilize color in your plots, it is necessary to have a
legend that corresponds to the color scale. Although it is possible to create a legend using only
the fundamental tools in R, it is often easier to utilize contributed functionality within R.
This is found in the shape package, so first download and install shape from CRAN.
The above example is for implementing a color strip legend using the colorlegend
function from the contributed shape package.
Opacity
R assumes full opacity as the default when creating colors. However, if you explicitly set the
opacity using alpha, the hex codes will change slightly. Instead of six characters after the #, eight
characters will appear, with the last two representing the additional opacity information. Take a
look at the following lines of code that generate four different versions of the color red: default,
zero opacity, 40 percent opacity (0.4 × 255 = 102), and full opacity, respectively:
R> rgb(cbind(255,0,0),maxColorValue=255) [1] "#FF0000"

R> rgb(cbind(255,0,0),maxColorValue=255,alpha=0) [1] "#FF000000"
R> rgb(cbind(255,0,0),maxColorValue=255,alpha=102) [1] "#FF000066"
R> rgb(cbind(255,0,0),maxColorValue=255,alpha=255) [1] "#FF0000FF"
The initial and final colors are the same; however, the last hex code explicitly indicates full
opacity. The subsequent line of code converts the default red hex code generated by the first line
in the previous example into a version that has a 40 percent opacity.
R> adjustcolor(rgb(cbind(255,0,0),maxColorValue=255),alpha.f=0.4)
[1] "#FF000066"
This is another example showcasing the application of adjustcolor. In this demonstration, the
color sequence generated using depth.pal(20) is adjusted to have 60 percent opacity to match the
plotted points. The legend is positioned using posx and posy, and the optional argument left is set
to TRUE to display the tick marks and color legend labels on the left side of the strip. The final
outcome can be seen in following Figure.
Aa
Aaa
RGB Alternatives and Further Functionality
Another specifications involve hue-saturation-value (HSV) and hue-chroma-luminance (HCL),

which are accessible via the hsv and hcl functions provided within the system. These functions
function similarly to rgb, enabling you to indicate the strength of influence for the three
components and obtain corresponding character string hex codes that constitute valid R colors
for any relevant plotting instruction.
3D Scatterplots
3D Scatterplots tools enable the plotting of raw observations based on three continuous variables
concurrently, unlike the traditional 2D scatterplot that can only handle two variables .There are
multiple approaches to generate scatterplots with three variables in R, however, the preferred
method is typically utilizing the scatterplot3d function from the contributed package with the
same name.
Basic Syntax
The scatterplot3d function follows a syntax similar to the default plot function. In the default plot
function, you need to input a vector of x- and y-axis coordinates, whereas in the scatterplot3d
function, you only need to provide an extra third vector of values for the z-axis coordinates. This
additional dimension enables you to perceive the three axes in the following manner: the x-axis
increases from left to right, the y-axis increases from foreground to background, and the z-axis
increases from bottom to top.
The iris flower data, which was initially introduced in Section 14.4, comprises measurements on
four continuous variables (petal length/width and sepal length/width) and one categorical
variable (flower species). The iris data frame can be readily accessed from the R prompt,
eliminating the need to load any additional files. To swiftly access the measurement values
constituting the data, input the following command:
R> pwid <- iris$Petal.Width

R> plen <- iris$Petal.Length
R> swid <- iris$Sepal.Width
R> slen <- iris$Sepal.Length
To generate a basic 3D scatterplot, you can utilize the following steps, which involve plotting
petal length, petal width, and sepal width:
R> library("scatterplot3d")
R> scatterplot3d(x=pwid,y=plen,z=swid)
A general positive correlation can be observed among the three plotted variables in this context.
Additionally, there is a distinct cluster of observations in the foreground that exhibit relatively
large sepal widths but small petal measurements.
Fig : Two 3D scatterplots of the famous iris data with petal width, petal length, and sepal width
on the x-, y-, and z-axis, respectively. Left: Basic default appearance. Right: Tidying up titles and
adding visual enhancements to emphasize 3D depth and legibility via color and vertical line
marks.
Visual Enhancements
Perceiving depth in the plotted cloud of points can often pose a challenge, despite the inclusion
of default box and x-y plane grid lines. To address this issue, there are a few optional
enhancements available for scatterplot3d plots. One such enhancement involves coloring the
points to facilitate a clearer transition from the foreground to the background. Additionally,
setting the type="h" argument allows for the drawing of lines perpendicular to the x-y plane.
> scatterplot3d(x=pwid,y=plen,z=swid,highlight.3d=TRUE,type="h",
lty.hplot=2,lty.hide=3,xlab="Petal width",
ylab="Petal length",zlab="Sepal width",
main="Iris Flower Measurements")
xlab, ylab, zlab, and main control the corresponding titles of the three axes and the plot itself.
The above figure is 3D scatterplot of the famous iris data, displaying all five present variables
with the additional use of color (for sepal length) and point character (for species).
Preparing a Surface for Plotting

The function/estimate/model of interest can be conceptualized as a plane or surface that can vary
based on continuous, two-dimensional x-y coordinates. It is not feasible to plot a completely
continuous surface due to the need to evaluate the function at an infinite number of coordinates.
As a result, the surface is typically evaluated on a finite grid of evenly spaced coordinates along
both the x- and y-axes. The value of the function at each unique pair of coordinates is stored in a
corresponding position within a matrix of appropriate size, commonly known as the z-matrix.
It is crucial to comprehend the construction, arrangement, and interpretation of the z-matrix by
the traditional R graphics commands that plot bivariate functions. This understanding is essential
to accurately depict the outcome.
Constructing an Evaluation Grid
A bivariate function can be used to create a smooth surface that is defined within the range of 1
to 6 on the x-axis and 1 to 4 on the y-axis. To achieve this, evenly spaced sequences can be
defined over each coordinate range using the 'seq' function.
R> xcoords <- 1:6
R> xcoords
[1] 1 2 3 4 5 6
R> ycoords <- 1:4
R> ycoords
[1] 1 2 3 4
When provided with two vectors, the expand.grid function generates all unique coordinate pairs
by repeating each value in the second vector against the entire length of the first vector.
R> xycoords <- expand.grid(x=xcoords,y=ycoords)

The result is stored in a data frame with two columns and 24 rows. When inspecting the
xycoords object in the R console, you will notice that the x values range from 1 to 6, each
associated with a repeated y value of 1. Following that, the x values range from 1 to 6 once more,
but this time paired with a y value of 2, and so on.
R> z <- letters[1:24]
R> cbind(xycoords,z) x y z
111a
221b
331c
441d
--snip-- 21 3 4 u
22 4 4 v
23 5 4 w
24 6 4 x
The ready-to-use letters object in R allows you to generate letters of the alphabet quickly.
Constructing the z-Matrix
The 3D plots used to visualize a bivariate function require the z values corresponding to the x-y
evaluation grid in the form of an appropriately con- structed matrix. The size of the z-matrix is
determined directly by the reso- lution of the evaluation grid; the number of rows corresponds to
the number of unique x grid values, and the number of columns corresponds to the number of
unique y grid values.
The following is the correct matrix representation of the hypothetical “function result” vector z:
R> nx <- length(xcoords)

R> ny <- length(ycoords)
R> zmat <- matrix(z,nrow=nx,ncol=ny)
R> zmat
[,1] [,2] [,3] [,4]
[1,] "a" "g" "m" "s"
[2,] "b" "h" "n" "t"
[3,] "c" "i" "o" "u"
[4,] "d" "j" "p" "v"
[5,] "e" "k" "q" "w"
[6,] "f" "l" "r" "x"
Conceptualizing the z-Matrix

The following plot provides a conceptual diagram of this illustrative surface, indexed by zmat as
per the 24 unique coordinates defined via xcoords and ycoords.
Contour Plots
A popular technique for illustrating a surface based on the evaluation of a function over a grid of
bivariate coordinates is the contour plot. This plot consists of a series of lines, referred to as
contours, which are overlaid on the 2D evaluation grid. Each contour corresponds to a specific
level of the surface under investigation.
Drawing Contour Lines
The R function contour is used to generate contours that connect x-y coordinates sharing the
same z value, based on a provided numeric z-matrix.
Example 1: Topographical Map

An instance of a data set, namely the volcano object, is at our disposal. This data set is essentially
a matrix that encompasses measurements of the height above sea level (in meters) of a dormant
volcano spanning a rectangular area within the Auckland region of New Zealand.
R> dim(volcano)
[1] 87 61
R> contour(x=1:nrow(volcano),y=1:ncol(volcano),z=volcano,asp=1)
The x- and y-sequences are provided to x and y, and the z- matrix to z.

asp=1 is aspect ratio of the plot, forces a 1-to-1 unit treatment of the coordinate axes
Fig . Using contour to produce a topographic map of the volcano data
Contours are able to show you not only the peaks and troughs in
surface like this but the “steepness” of any such features too.
Color-Filled Contours
To introduce a slight variation to the contour plot, one can utilize color to fill the gaps between
the different levels that are illustrated. This approach, when combined with a color legend,
eliminates the requirement of labeling the contour lines and, in certain scenarios, facilitates the
visual interpretation of fluctuations in the plotted z-matrix surface. Color Filled Contours
function is used for this.
You are required to provide the ascending sequences of grid coordinates in both the x-axis and y-
axis directions, along with the corresponding z-matrix, to the parameters x, y, and z in a manner
similar to contour. The most convenient approach to indicate the colors is by supplying a color
palette to the color.palette argument.
R> filled.contour(x=hp.seq,y=wt.seq,z=car.pred.mat,
color.palette=colorRampPalette(c("white","red4")),
xlab="Horsepower",ylab="Weight",
key.title=title(main="Mean MPG",cex.main=0.8))
Figure 25-13: Filled contour plot of the response surface for the
fitted multiple linear model of the mtcars data
Pixel Images
A pixel image is the most accurate visual representation of a continuous surface that is
approximated by a finite evaluation grid. Its resemblance to a filled contour plot is evident, but
an image plot offers a more direct means of managing the display of each entry in the associated
z-matrix.
One Grid Point = One Pixel

The pixel images are plotted using the built-in image function. Similar to contour, you need to
provide increasing sequences of x and y-axis evaluation grid coordinates to the x and y
arguments, along with the corresponding z-matrix supplied to z.
R> image(x=1:nrow(volcano),y=1:ncol(volcano),z=volcano,asp=1)
By setting asp=1, the aspect ratio of the horizontal and vertical axes is maintained at a one-to-one
ratio. The plot is composed of exactly 5307 pixels, with each pixel representing a specific entry
in the volcano matrix. The contour plot of the same data in the below Fig visually demonstrates
the reflection of this image.
Aa
Fig :Pixel image of the Auckland volcano topography
Perspective Plots
The perspective plot, some- times also called as a wireframe. In certain applications, it may be
necessary to accurately assess the degree of extremity of peaks and troughs in a plotted surface.
This task becomes more challenging when dealing with pixel images or contour plots, as they do
not provide a clear representation of the relative extremity.
Basic Plots and Angle Adjustment

Perspective plots offer a distinct advantage when the objective is to emphasize the fluctuating
nature of the values in the z-matrix. In certain applications, it becomes crucial to gain a
comprehensive understanding of the relative extremity of peaks and troughs on the plotted
surface. This becomes more challenging when utilizing alternative visual representations such as
pixel images or contour plots.
Coloring Facets
The optional col argument in traditional R plotting commands can be used to color the facets of a
perspective surface. If you want to color the surface with a constant color, you can simply
provide col with a single value.
However, if you're interested in using col to highlight the changing value of the bivariate
function, you would want to color the surface according to the fluctuating z-values. It's important
to note that the facets making up the surface are not the same as the pixels in a pixel image of the
z-matrix. The facets should be interpreted as the space between the border lines drawn at the
matrix entries, resulting in (m-1)(n-1) facets. In other words, each z-matrix entry lies at an
intersection of the drawn lines and is not situated in the middle of each facet.
Rotating with Loops

By utilizing a straightforward for loop to increment either theta or phi, it is possible to execute a
sequence of repeated invocations to persp. Each invocation is made at a slightly different angle,
resulting in an animation that resembles a rotating surface. This animation provides the
opportunity to observe the surface from various perspectives, akin to a cartoon. Take into
account the subsequent fundamental function in the R editor:
persprot <- function(skip=1,...){
for(i in seq(90,20,by=-skip)){ persp(phi=i,theta=0,...)
}
for(i in seq(0,360,by=skip)){ persp(phi=20,theta=i,...)
}
}
R> persprot(x=quak.dens$x,y=quak.dens$y,z=quak.dens$z,border="red3",shade=0.4,
ticktype="detailed",xlab="Longitude",ylab="Latitude", zlab="Kernel estimate")
Following figure shows a series of screenshots of the rotating plot.

Fig. A rotating perspective plot of a KDE surface for the spatial earthquake locations, after a call
to the custom persprot function.

R-programming - Unit 5

Uploaded by

R-programming - Unit 5

Uploaded by

UNIT 5

SIMPLE LINEAR REGRESSION

Linear Regression is a widely employed method of predictive analysis. It is a statistical technique

5.2 Estimating the Intercept and Slope Parameters

- x¯,y¯ are the sample means of the xis and yis.

survfit <- lm(Height ~ Wr.Hnd,data=survey)

~ - predictor formula denotes desired model.

Coefficients: (Intercept) Wr.Hnd

[1] 15.00 170.18

R> obsB <- c(survey$Wr.Hnd[154],survey$Height[154])

[1] 21.50 172.72

(Intercept) Wr.Hnd 113.953623 3.116617

R> beta0.hat <- mycoefs[1] R> beta1.hat <- mycoefs[2]

5.3 Statistical Inference

5.3.2 Regression Coefficient Significance Tests

# Load the dataset

# Fit the linear regression model

# Perform the significance test

5.3.4 Coefficient of Determination

R> rho.xy <- cor(survey$Wr.Hnd,survey$Height,use="complete.obs")

5.4.2 Interpreting Intervals

5.4.3 Plotting Intervals

5.4.4 Interpolation vs. Extrapolation

Extending the Simple Model to a Multiple Model

where β0, . . . , βp are the regression coefficients.

where the βˆ j s represent the regression coefficients.

Estimating in Matrix Form

where Y and s denote n × 1 column matrices

Visualizing the Multiple Linear Model

“Mean height” = (137.687 + 9.4898) + 1.594 × “handspan”

where n is the number of observations that is used in fitting the model.

confirm the value of n − p − 1, which is equal to the summary

R> Fstat <- (R2*(n-p-1))/((1-R2)*p)

Illustrating linear (left), quadratic (middle), and cubic functions (right) of x

Goodness-of-Fit vs. Complexity

R> nuc.0 <- lm(cost~date+t1+t2+cap+pr+ne+ct+bw+cum.n+pt,data=nuclear)

Residual standard error: 82.83 on 21 degrees of freedom Multiple R-squared: 0.8394,

ADVANCED PLOT CUSTOMIZATION

Execute the following to open a new plotting window:

Switching Between Devices

Plotting Regions and Margins

R> mtext("Figure region margins\nmar[ . ]",line=2)

Point-and-Click Coordinate Interaction

Retrieving Coordinates Silently

Visualizing Selected Coordinates

Using locator to draw an arbitrary sequence of overplotted points and lines

Customizing Traditional R Plots

Graphical Parameters for Style and Suppression

R> hp <- mtcars$hp

(a) (b) (c)

If we call the above code It will produce the Fig (b) .

He above code will produce the fig (c)

Specialized Text and Label Notation

DEFINING COLORS AND PLOTTING IN HIGHER DIMENSIONS

Representing and Using color

Red-Green-Blue Hexadecimal Color Codes

pcol <- function(cols){

Using Color Palettes to Index a Continuum

Including a Color Legend

R> rgb(cbind(255,0,0),maxColorValue=255) [1] "#FF0000"

Another specifications involve hue-saturation-value (HSV) and hue-chroma-luminance (HCL),

R> pwid <- iris$Petal.Width

Preparing a Surface for Plotting

Constructing an Evaluation Grid

R> xycoords <- expand.grid(x=xcoords,y=ycoords)

Constructing the z-Matrix

R> nx <- length(xcoords)

Conceptualizing the z-Matrix

Drawing Contour Lines

Example 1: Topographical Map

The x- and y-sequences are provided to x and y, and the z- matrix to z.

One Grid Point = One Pixel

Fig :Pixel image of the Auckland volcano topography

Basic Plots and Angle Adjustment

Rotating with Loops

Following figure shows a series of screenshots of the rotating plot.

R> Fstat <- (R2(n-p-1))/((1-R2)p)