Introduction To R
Introduction To R
Introduction To R
R is a system for statistical computation and graphics. R provides, among other things, a
programming language, high level graphics, interfaces to other languages and debugging
facilities.
The R language is a dialect of S which was designed in the 1980s and has been in widespread
use in the statistical community since. The language syntax has a superficial similarity with C
The base R distribution contains functions and data to implement and illustrate most common
statistical procedures, including regression and ANOVA, classical parametric and nonparametric
tests, cluster analysis, density estimation and much more. R is open source software and its
home page is http://www.R-project.org/.
The system processes commands entered by the user, who types the commands at the
command prompt, or submits the commands from a file called a script to save retyping and to
separate commands from results. In a window system, users interact with R through the R console.
Basics
The R commands are entered at the prompt in the R console window. The prompt character is >
and when a line is continued the prompt changes to +. R is case sensitive.
Assignments
The right-to-left assignment operators are the left arrow <- and equal sign=.
Please NOTE: On specifying the file path we use / (forward slash) or (\\) and not \.
Working directory : To view the current working directory type getwd(). To change the
working directory type setwd(“pathname”). Change your working directory to the HSTS204
course folder you created.
Packages
An R installation contains a library of packages. Some of these packages are part of the basic
installation and others can be downloaded from Comprehensive R Archive Network (CRAN)
sites through mirror sites. We use the South African mirror site for downloads. You can create
your own packages.
A package is loaded into R using the library command, e.g. library(survival). The loaded
packages are not considered part of the workspace and if you terminate your session you need
to load them again when you start a new session.
Built in data
R has a lot of inbuilt data sets and some are contained in the ISwR package.
To load these you need to be connected to the internet in type the following command in an R
session:
install.packages(“ISwR”, .libPaths()[1])
The R Graphical User Interface has a Help menu to find and display online documentation for R
objects, methods, datasets, and functions. Through the Help menu one can find several manuals
in PDF form, an html help page, and help search utilities. The help search utility functions are
also available at the command line, using the functions help where you type help("keyword")
which displays help for “keyword” and help.search using help.search("keyword") which
searches for all objects containing “keyword” and the corresponding shortcuts are ? and ??
respectively. The quotes are optional in the help command, but would be required for special
characters and are required in the help.search command.
Example Type :
R example/Tutorial
R also provides a function example that runs all of the examples if any exist for the keyword. To
see the examples for the function mean, type example(mean).
Session management
The workspace
All variables stored in R are stored in a common workspace. To see the variables that are
defined in a workspace, type ls() (list).
It is possible to delete some of the objects using the command rm(x,y,z) (remove).
It is possible to save the workspace to a file at any time using: save.image() and it will be saved
with file extension .RData.
All the commands typed in an R session are saved upon exit in a file called .Rhistory under the
working directory. You can use a text editor to edit the .Rhistory.
R objects
In every computer language variables provide a means of accessing the data stored in memory.
R does not provide direct access to the computer’s memory but rather provides a number
of specialized data structures we will refer to as objects. The entities that R creates and
manipulates are known as objects. These objects are referred to through
symbols or variables. In R, however, the symbols are themselves objects and can be
manipulated in the same way as any other object.
During an R session, objects are created and stored by name . The R command
> objects()
(alternatively, ls()) can be used to display the names of (most of) the objects which are
currently
stored within R. The collection of objects currently stored is called the workspace.
1) Vectors
Vectors can be thought of as contiguous cells containing data. Cells are accessed through
indexing operations such as x[5] means the 5th observation of the vector x.
R has six basic (‘atomic’) vector types: logical, integer, real, complex, string (or character)
and raw.
Single numbers, such as 4.2, and strings, such as "four point two" are still vectors, of length
1; there are no more basic types. Vectors with length zero are possible (and useful).
String vectors have mode and storage mode "character". A single element of a character
vector is often referred to as a character string.
2) Lists
Lists (“generic vectors”) are another kind of data storage. Lists have elements, each of which
can contain any type of R object, i.e. the elements of a list do not have to be of the same type.
List elements are accessed through three different indexing operations.
Lists are vectors, and the basic vector types are referred to as atomic vectors where it is
necessary to exclude lists.
3) Language objects
There are three types of objects that constitute the R language. They are calls, expressions,
and names. These objects have modes "call", "expression", and "name", respectively.
They can be created directly from expressions using the quote mechanism and converted to
and from lists by the as.list and as.call functions.
Symbol objects
Symbols refer to R objects. The name of any R object is usually a symbol. Symbols can be
created through the functions as.name and quote.
4) Expression objects
An expression contains one or more statements.
5) Function objects
In R functions are objects and can be manipulated in much the same way as any other object.
Functions (or more precisely, function closures) have three basic components:
-a formal argument list: the argument list is a comma-separated list of arguments;
-a body : The body is a parsed R statement which is usually a collection of statements in braces
(‘{’ and ‘}’), but it can be a single statement, a symbol or even a constant
and an environment: a function’s environment is the environment that was active at the time
that the function was created. The syntax for writing a function is function ( arglist ) body
The function declaration is the keyword function which indicates to R that you want to create a
function.
6) NULL
There is a special object called NULL. It is used whenever there is a need to indicate or specify
that an object is absent. It should not be confused with a vector or list of zero length.
The NULL object has no type and no modifable properties.
7) Builtin objects and special forms
These two kinds of object contain the builtin functions of R, i.e., those that are displayed as
.Primitive in code listings (as well as those accessed via the .Internal function and hence not
user-visible as objects). The difference between the two lies in the argument handling. Builtin
functions have all their arguments evaluated and passed to the internal function, in accordance
with call-by-value, whereas special functions pass the unevaluated arguments to the internal
function.
The other objects include: Promise objects, Dot-dot-dot, Pairlist objects and Environments
Environments can be thought of as consisting of two things. A frame, consisting of a set of
symbol-value pairs, and an enclosure, a pointer to an enclosing environment.
(i)Factors
Factors are used to describe items that can have a finite number of values (categorical
variables). A factor may be purely nominal or may have ordered categories.
(ii)Data frame objects
Data frames are the R structures which most closely mimic the SAS or SPSS data set, i.e. a
“cases by variables” matrix of data.
A data frame is a list of vectors, factors, and/or matrices all having the same length (number
of rows in the case of matrices). In addition, a data frame generally has a names for the
variables.
Objects Attributes
All objects except NULL can have one or more attributes attached to them. Attributes are stored
as a pairlist where all elements are named, but should be thought of as a set of name=value
pairs.
The following are the basic attributes of an object:
Names:A names attribute, when present, labels the individual elements of a vector or list. When
an object is printed the names attribute, when present, is used to label the elements.
Dimensions: The dim attribute is used to implement arrays. The content of the array is stored in
a vector in column-major order and the dim attribute is a vector of integers specifying the
respective extents of the array. R ensures that the length of the vector is the product of the
lengths of the dimensions. For example Matrices and arrays are simply vectors with the
attribute dim attached to the vector. A dimension vector is a vector of non-negative integers
Dimnames:Arrays may name each dimension separately using the dimnames attribute which is
a list of character vectors.
Classes: R has an elaborate class system1, principally controlled via the class attribute. This
attribute
is a character vector containing the list of classes that an object inherits from. This forms the
basis of the “generic methods” functionality in R.
Time series attributes: The tsp attribute is used to hold parameters of time series, start, end,
and frequency. This construction is mainly used to handle series with periodic substructure
such as monthly or quarterly data.
Execution of commands in R
When a user types a command at the prompt (or when an expression is read from a file), the
command is transformed by the parser/compiler into an internal representation and the
evaluator executes parsed R expressions and returns the value of the expression. All
expressions have a value. This is the core of the language.
Data Entry
Basics
Recall that R has objects and modes. Objects are anything that you can give a name. There are
many different classes of objects. The main classes of interest here are vector, matrix, factor, list,
and data frame. The mode of an object tells what kind of things are in it. The main modes of
interest here are logical ,numeric, and character.
(i) Typing
(a) c eg c(2,4,6,8,10)
(ii) Character vectors- c(“Gerald”, “Peter”,, “Alfred”,, “Mildred”,, “Tafadzwa”,) : a vector of text
string elements which should be specified and printed in quotes, does not matter whether single
or double
(iii) Logical vectors- c(T,T,F,T,F) : can take the value TRUE or FALSE or (NA).
(b) seq (sequence) : Used for equidistant series of numbers, e.g. seq(2, 10) , or seq(4,20,2)
(c) rep (replicate): Used to generate repeated values, x<-(5,10,15), rep(x,4), or rep(x, 1:3) or
rep(1:3, c(8,10,12)
Vectorised arithmetic
You can do calculations with vectors just like ordinary numbers and operations are applied
element by element.
>bmi<-weight/height^2
>bmi
Handling categorical vectors (factors)
A factor is a vector object used to specify a discrete classification (grouping) of the components
of other vectors of the same length. R provides both ordered and unordered factors. A factor is
similarly created using the factor() function applied on a vector of numbers or characters.
Example: We want to capture the sex of the respondents in the data set in the table below.
>sex=c( “Female”, “Male” ,“Female” ,“Female”, “Male”, “Female”, “Female”, “Male”, “Male”)
>sexf = factor(sex)
>sexf
The command below can be used to get the levels of the factor directly without listing the
factor.
> levels(sexf)
Alternative we can create the factor by specifying the vector of values, the levels and labels.
>ben=c(2,1,1,1,2,1,1,1,1,1)
>possible.ben=c(1,2)
>labels.ben=c(“Beneficiary”,”Non Beneficiary”)
The function table() allows frequency tables to be calculated from equal length factors.
The function tapply allows one to do analysis by a categorising variable. For example we may
require average income by sex of HHH.
Enter the income vector in the same order as the categorising variable:
>income=c(1200,380,900,482,2400,800,680,800,450,720)
> incmeans <- tapply(income, sexf, mean)
This the most convenient way of reading data into R. Use the command : read.table(path,
header=T). It requires the data to be in an ASCII (American Standard Code for Information
Interchange) which a format created by any plain editor such as Windows NotePad. This results
in a data frame. The first line of the data can contain a header .
The read.table command assumes fields are separated by whitespaces. Variants of the
command are : read.csv and read.csv2 which assume that fields are separated by comma and
semicolons respectively. Another variant is read.delim or read.delim2 for reading delimited
files for which the default delimiter is the Tab character.
The simplest way is to request the package to export data as a text file (one of the forms state in
(ii) above). Alternatively the foreign package is recommended for handling other formats like
SPSS, SAS, STATA, Minitab etc.
Data frames
A data frame corresponds to what is commonly referred to as a data matrix or a data set. It is a
list of vectors and/or factors of the same length which are associated across.
e.g >y1=c(1,2,3,4,5)
>y3=c(7,8,9,10,11)
>ydata=data.frame(y1,y2,y3)
>dim(dataframe_name) # displays dimension of the data frame, number of rows and number
of variables
> summary(dataframe_name) #gives appropriate summary statistics for all the variables.
> attach(dataframe_name)
> variable_name
Indexing
Used for selection of data in a vector e.g. z<-(5:12), z[6] will give the element sitting on position
6 of the vector z.
Indexing can also be used to select data in a data frame, e.g. d[6,5], will report the value of the
5th variable for the 6th subject in the data frame d.
(i) Using the command data.entry : Allows you to edit numeric variables in the workspace.
(ii) Using the edit function: This command requires you to call the data frame to display using
the command : data(filename) then type newname<-edit(filename). This brings up a
spreadsheet-like editor with a column for each variable in the data frame. Inside the editor, you
can move around with a mouse or cursor keys and edit the current cell by typing in data. The
original data frame is left intact.
Missing values
R allows vectors to contain a special value NA and computations and operations on it yield NA
as the result.
In R matrices and arrays are represented as vectors with dimensions. An array can be
considered as a multiply subscripted collection of data entries. Matrices can be created using
different functions:
y<-1:12
dim(x)<-c(3,4)
(ii) matrix function, eg matrix(1:12,nrow=3,byrow=T)
(iii) cbind and rbind ‘glue’ vectors together columnwise or rowwise respectively,
Matrix operations :
> A %*% B
>eigen(x)#eigen values-vectors
R Scripts
R commands can be placed in a file, called an R script, and can be run using source or copy paste.
Using the source function causes R to accept input from the named source, such as a file.
In the R GUI users can open a new script window through the File menu. R scripts will be saved
with extension .R.
Using the source function, auto-printing of expressions does not happen and we need to add the
print statements to the script so that the values of objects will be printed. The command you
type is
Example
summary(lm (y~x1+x2))#regression
anova(lm (response~factor))#anova
Functions
R users interact with the software primarily through functions. The syntax of a function is
Where f is the name of the function, x is the name of the first argument (there can be several
arguments), and... indicates possible additional arguments.
Functions can be defined with no arguments, also. The curly brackets enclose the body of the
function. The return value of a function is the value of the last expression evaluated.
Graphics
One of the most attractive features of R is that it gives a fine control of graphic components. You
can specify the plotting parameters like, plotting characters, line types etc. Learn more using the
help function.
e.g plot (x,y) , use any two variables of your choice, plot (x,y,pch=3)
Probability functions:
R provides functions for the density, cumulative distribution function (CDF), percentiles, and for
generating random variates for many commonly applied distributions. For the Poisson
distribution these functions are dpois, ppois, qpois, and rpois, respectively. For the normal
distribution these functions are dnorm, pnorm, qnorm, and rnorm.