Introduction To R

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Introduction to R

R is a system for statistical computation and graphics. R provides, among other things, a
programming language, high level graphics, interfaces to other languages and debugging
facilities.
The R language is a dialect of S which was designed in the 1980s and has been in widespread
use in the statistical community since. The language syntax has a superficial similarity with C

The base R distribution contains functions and data to implement and illustrate most common
statistical procedures, including regression and ANOVA, classical parametric and nonparametric
tests, cluster analysis, density estimation and much more. R is open source software and its
home page is http://www.R-project.org/.

The system processes commands entered by the user, who types the commands at the
command prompt, or submits the commands from a file called a script to save retyping and to
separate commands from results. In a window system, users interact with R through the R console.

Each command or expression to be evaluated is typed at the command prompt, and


immediately evaluated when the Enter key is pressed at the end of a syntactically complete
statement.

-Press the up-arrow key to recall commands and edit them.


-Use the Esc (Escape) key to cancel a command.

Basics

The command prompt

The R commands are entered at the prompt in the R console window. The prompt character is >
and when a line is continued the prompt changes to +. R is case sensitive.

Comments : In R comments begin with a # symbol.

Assignments

The right-to-left assignment operators are the left arrow <- and equal sign=.

Please NOTE: On specifying the file path we use / (forward slash) or (\\) and not \.

Working directory : To view the current working directory type getwd(). To change the
working directory type setwd(“pathname”). Change your working directory to the HSTS204
course folder you created.

Packages

An R installation contains a library of packages. Some of these packages are part of the basic
installation and others can be downloaded from Comprehensive R Archive Network (CRAN)
sites through mirror sites. We use the South African mirror site for downloads. You can create
your own packages.

A package is loaded into R using the library command, e.g. library(survival). The loaded
packages are not considered part of the workspace and if you terminate your session you need
to load them again when you start a new session.

Built in data

R has a lot of inbuilt data sets and some are contained in the ISwR package.

To load these you need to be connected to the internet in type the following command in an R
session:

install.packages(“ISwR”, .libPaths()[1])

The R Help System

The R Graphical User Interface has a Help menu to find and display online documentation for R
objects, methods, datasets, and functions. Through the Help menu one can find several manuals
in PDF form, an html help page, and help search utilities. The help search utility functions are
also available at the command line, using the functions help where you type help("keyword")
which displays help for “keyword” and help.search using help.search("keyword") which
searches for all objects containing “keyword” and the corresponding shortcuts are ? and ??
respectively. The quotes are optional in the help command, but would be required for special
characters and are required in the help.search command.

Example Type :

?barplot #searches for barplot topic

??plot #anythingcontaining "plot"

R example/Tutorial

R also provides a function example that runs all of the examples if any exist for the keyword. To
see the examples for the function mean, type example(mean).

Session management

The workspace

All variables stored in R are stored in a common workspace. To see the variables that are
defined in a workspace, type ls() (list).

It is possible to delete some of the objects using the command rm(x,y,z) (remove).

It is possible to save the workspace to a file at any time using: save.image() and it will be saved
with file extension .RData.

All the commands typed in an R session are saved upon exit in a file called .Rhistory under the
working directory. You can use a text editor to edit the .Rhistory.
R objects

In every computer language variables provide a means of accessing the data stored in memory.
R does not provide direct access to the computer’s memory but rather provides a number
of specialized data structures we will refer to as objects. The entities that R creates and
manipulates are known as objects. These objects are referred to through
symbols or variables. In R, however, the symbols are themselves objects and can be
manipulated in the same way as any other object.

During an R session, objects are created and stored by name . The R command
> objects()
(alternatively, ls()) can be used to display the names of (most of) the objects which are
currently
stored within R. The collection of objects currently stored is called the workspace.

The list below gives some of the basic R objects:

1) Vectors
Vectors can be thought of as contiguous cells containing data. Cells are accessed through
indexing operations such as x[5] means the 5th observation of the vector x.
R has six basic (‘atomic’) vector types: logical, integer, real, complex, string (or character)
and raw.
Single numbers, such as 4.2, and strings, such as "four point two" are still vectors, of length
1; there are no more basic types. Vectors with length zero are possible (and useful).
String vectors have mode and storage mode "character". A single element of a character
vector is often referred to as a character string.

2) Lists
Lists (“generic vectors”) are another kind of data storage. Lists have elements, each of which
can contain any type of R object, i.e. the elements of a list do not have to be of the same type.
List elements are accessed through three different indexing operations.
Lists are vectors, and the basic vector types are referred to as atomic vectors where it is
necessary to exclude lists.
3) Language objects
There are three types of objects that constitute the R language. They are calls, expressions,
and names. These objects have modes "call", "expression", and "name", respectively.
They can be created directly from expressions using the quote mechanism and converted to
and from lists by the as.list and as.call functions.
Symbol objects
Symbols refer to R objects. The name of any R object is usually a symbol. Symbols can be
created through the functions as.name and quote.
4) Expression objects
An expression contains one or more statements.
5) Function objects
In R functions are objects and can be manipulated in much the same way as any other object.
Functions (or more precisely, function closures) have three basic components:
-a formal argument list: the argument list is a comma-separated list of arguments;
-a body : The body is a parsed R statement which is usually a collection of statements in braces
(‘{’ and ‘}’), but it can be a single statement, a symbol or even a constant
and an environment: a function’s environment is the environment that was active at the time
that the function was created. The syntax for writing a function is function ( arglist ) body
The function declaration is the keyword function which indicates to R that you want to create a
function.
6) NULL
There is a special object called NULL. It is used whenever there is a need to indicate or specify
that an object is absent. It should not be confused with a vector or list of zero length.
The NULL object has no type and no modifable properties.
7) Builtin objects and special forms
These two kinds of object contain the builtin functions of R, i.e., those that are displayed as
.Primitive in code listings (as well as those accessed via the .Internal function and hence not
user-visible as objects). The difference between the two lies in the argument handling. Builtin
functions have all their arguments evaluated and passed to the internal function, in accordance
with call-by-value, whereas special functions pass the unevaluated arguments to the internal
function.
The other objects include: Promise objects, Dot-dot-dot, Pairlist objects and Environments
Environments can be thought of as consisting of two things. A frame, consisting of a set of
symbol-value pairs, and an enclosure, a pointer to an enclosing environment.

8) Special compound Objects

(i)Factors
Factors are used to describe items that can have a finite number of values (categorical
variables). A factor may be purely nominal or may have ordered categories.
(ii)Data frame objects
Data frames are the R structures which most closely mimic the SAS or SPSS data set, i.e. a
“cases by variables” matrix of data.
A data frame is a list of vectors, factors, and/or matrices all having the same length (number
of rows in the case of matrices). In addition, a data frame generally has a names for the
variables.
Objects Attributes
All objects except NULL can have one or more attributes attached to them. Attributes are stored
as a pairlist where all elements are named, but should be thought of as a set of name=value
pairs.
The following are the basic attributes of an object:
Names:A names attribute, when present, labels the individual elements of a vector or list. When
an object is printed the names attribute, when present, is used to label the elements.
Dimensions: The dim attribute is used to implement arrays. The content of the array is stored in
a vector in column-major order and the dim attribute is a vector of integers specifying the
respective extents of the array. R ensures that the length of the vector is the product of the
lengths of the dimensions. For example Matrices and arrays are simply vectors with the
attribute dim attached to the vector. A dimension vector is a vector of non-negative integers
Dimnames:Arrays may name each dimension separately using the dimnames attribute which is
a list of character vectors.
Classes: R has an elaborate class system1, principally controlled via the class attribute. This
attribute
is a character vector containing the list of classes that an object inherits from. This forms the
basis of the “generic methods” functionality in R.
Time series attributes: The tsp attribute is used to hold parameters of time series, start, end,
and frequency. This construction is mainly used to handle series with periodic substructure
such as monthly or quarterly data.
Execution of commands in R
When a user types a command at the prompt (or when an expression is read from a file), the
command is transformed by the parser/compiler into an internal representation and the
evaluator executes parsed R expressions and returns the value of the expression. All
expressions have a value. This is the core of the language.

Data Entry

Basics

Recall that R has objects and modes. Objects are anything that you can give a name. There are
many different classes of objects. The main classes of interest here are vector, matrix, factor, list,
and data frame. The mode of an object tells what kind of things are in it. The main modes of
interest here are logical ,numeric, and character.

(i) Typing

Creating Vectors (atomic)

There are 3 functions which are used for creating vectors:

(a) c eg c(2,4,6,8,10)

There are three types of vectors created this way:

(i) Numerical vectors e.g. c(2,4,6,8,10) : a vector of numerical elements.

(ii) Character vectors- c(“Gerald”, “Peter”,, “Alfred”,, “Mildred”,, “Tafadzwa”,) : a vector of text
string elements which should be specified and printed in quotes, does not matter whether single
or double

(iii) Logical vectors- c(T,T,F,T,F) : can take the value TRUE or FALSE or (NA).

(b) seq (sequence) : Used for equidistant series of numbers, e.g. seq(2, 10) , or seq(4,20,2)

(c) rep (replicate): Used to generate repeated values, x<-(5,10,15), rep(x,4), or rep(x, 1:3) or
rep(1:3, c(8,10,12)

Vectorised arithmetic

The construct c(...) is used to define vectors. Example

> height<-c(1.75, 1.80, 1.65, 1.90, 1.74, 1.91).

You can do calculations with vectors just like ordinary numbers and operations are applied
element by element.

> weight<-c(60, 72, 57, 90, 95, 72)

>bmi<-weight/height^2

>bmi
Handling categorical vectors (factors)

A factor is a vector object used to specify a discrete classification (grouping) of the components
of other vectors of the same length. R provides both ordered and unordered factors. A factor is
similarly created using the factor() function applied on a vector of numbers or characters.

Example: We want to capture the sex of the respondents in the data set in the table below.

Quesid Head of Sex of Year of Marital House Annual Beneficiary


Household Head of Birth of Status of hold size Income Status
name HH HHH HH ($)
1 Dube Betty Female 1952 Married 5 1200 Non Beneficiary
2 Hove Tom Male 1988 Single 7 380 Beneficiary
3 Sadza Hama Male 1942 Widowed 12 900 Beneficiary
4 Hope Alice Female 1981 Separated 8 482 Beneficiary
5 Ndlovu Thuli Female 1988 Married 4 2400 Non Beneficiary
6 Sibanda Iso Male 1972 Married 11 800 Beneficiary
7 Chaipa Helna Female 1982 Single 5 680 Beneficiary
8 Moyo Alpha Female 1992 Widowed 4 800 Beneficiary
9 Donga Zet Male 1971 Married 6 450 Beneficiary
10 Ncube Mark Male 1938 Widowed 10 720 Beneficiary
We would type the following at command prompt:

>sex=c( “Female”, “Male” ,“Female” ,“Female”, “Male”, “Female”, “Female”, “Male”, “Male”)

>sexf = factor(sex)

>sexf

The command below can be used to get the levels of the factor directly without listing the
factor.

> levels(sexf)

Alternative we can create the factor by specifying the vector of values, the levels and labels.

>ben=c(2,1,1,1,2,1,1,1,1,1)

>possible.ben=c(1,2)

>labels.ben=c(“Beneficiary”,”Non Beneficiary”)

> benf = factor(ben, levels=possible.ben, labels=labels.ben)

The function table() allows frequency tables to be calculated from equal length factors.

>s <- table(sexf)

The function tapply allows one to do analysis by a categorising variable. For example we may
require average income by sex of HHH.

Enter the income vector in the same order as the categorising variable:

>income=c(1200,380,900,482,2400,800,680,800,450,720)
> incmeans <- tapply(income, sexf, mean)

(ii) Reading from a text file

This the most convenient way of reading data into R. Use the command : read.table(path,
header=T). It requires the data to be in an ASCII (American Standard Code for Information
Interchange) which a format created by any plain editor such as Windows NotePad. This results
in a data frame. The first line of the data can contain a header .

The read.table command assumes fields are separated by whitespaces. Variants of the
command are : read.csv and read.csv2 which assume that fields are separated by comma and
semicolons respectively. Another variant is read.delim or read.delim2 for reading delimited
files for which the default delimiter is the Tab character.

(iii)Reading data from other statistical packages and spreadsheets

The simplest way is to request the package to export data as a text file (one of the forms state in
(ii) above). Alternatively the foreign package is recommended for handling other formats like
SPSS, SAS, STATA, Minitab etc.

Data frames

A data frame corresponds to what is commonly referred to as a data matrix or a data set. It is a
list of vectors and/or factors of the same length which are associated across.

Creating a data frame manually:

Enter your variables as columns, form an array.

e.g >y1=c(1,2,3,4,5)

>y2=c(“Y”, “Y”, “N”, “N”, “Y”)

>y3=c(7,8,9,10,11)

>ydata=data.frame(y1,y2,y3)

Importing a data frame plain text files

>dataframe_name=read.csv(path, header=T)# to import from a csv format

Some basics on handling data frames

>str(Dataframe_name)# to get the structure of the file

>names(dataframe_name) # displays variable names

>dim(dataframe_name) # displays dimension of the data frame, number of rows and number
of variables

> summary(dataframe_name) #gives appropriate summary statistics for all the variables.

>dataframe_name$variable_name # extracts a variable from a data frame


Alternatively if we attach the data frame, the variables can be referenced directly by

name, without the dollar sign operator.

> attach(dataframe_name)

> variable_name

We can dettach the data frame if no longer needed

Indexing

Used for selection of data in a vector e.g. z<-(5:12), z[6] will give the element sitting on position
6 of the vector z.

Indexing can also be used to select data in a data frame, e.g. d[6,5], will report the value of the
5th variable for the 6th subject in the data frame d.

Indexing can be used to modify values in a vector data frame. Eg z[6]<-25

The data Editor

R provides 2 ways of editing data interactively.

(i) Using the command data.entry : Allows you to edit numeric variables in the workspace.

(ii) Using the edit function: This command requires you to call the data frame to display using
the command : data(filename) then type newname<-edit(filename). This brings up a
spreadsheet-like editor with a column for each variable in the data frame. Inside the editor, you
can move around with a mouse or cursor keys and edit the current cell by typing in data. The
original data frame is left intact.

Missing values

R allows vectors to contain a special value NA and computations and operations on it yield NA
as the result.

Matrices and arrays

In R matrices and arrays are represented as vectors with dimensions. An array can be
considered as a multiply subscripted collection of data entries. Matrices can be created using
different functions:

(i) dim sets of changes the dimension of an attribute say y,

e.g Type the following

y<-1:12

dim(x)<-c(3,4)
(ii) matrix function, eg matrix(1:12,nrow=3,byrow=T)

(iii) cbind and rbind ‘glue’ vectors together columnwise or rowwise respectively,

e.g. cbind(A=1:4, =5:8,C =9:12), rbind(A=1:4, =5:8,C =9:12)

Matrix operations :

The matrix product of A and B is given by:

> A %*% B

>eigen(x)#eigen values-vectors

>solve(x)# inverse matrix

>t(x) # transpose matrix

R Scripts

R commands can be placed in a file, called an R script, and can be run using source or copy paste.
Using the source function causes R to accept input from the named source, such as a file.

In the R GUI users can open a new script window through the File menu. R scripts will be saved
with extension .R.

Using the source function, auto-printing of expressions does not happen and we need to add the
print statements to the script so that the values of objects will be printed. The command you
type is

source("filename.R"). The script file is saved in your working directory.

Example

Create a script file and name it trialdata.R


Type the following in the script file:
# trialdata
k=c(0,1,2,3,4)
x =c(109,65, 22, 3, 1)
p =x / sum(x) #relative frequencies
print(p)
r =sum(k *p) #mean
v =sum(x *(k - r)^2) / 199 #variance
print(r)
print(v)
f =dpois(k, r)
print(cbind(k, p, f))
On the R console type command source("trialdata.R")

Regression and ANOVA

summary(lm (y~x1+x2))#regression
anova(lm (response~factor))#anova

Functions

R users interact with the software primarily through functions. The syntax of a function is

f <- function(x, ...){

Where f is the name of the function, x is the name of the first argument (there can be several
arguments), and... indicates possible additional arguments.

Functions can be defined with no arguments, also. The curly brackets enclose the body of the
function. The return value of a function is the value of the last expression evaluated.

Graphics

One of the most attractive features of R is that it gives a fine control of graphic components. You
can specify the plotting parameters like, plotting characters, line types etc. Learn more using the
help function.

e.g plot (x,y) , use any two variables of your choice, plot (x,y,pch=3)

Probability functions:

R provides functions for the density, cumulative distribution function (CDF), percentiles, and for
generating random variates for many commonly applied distributions. For the Poisson
distribution these functions are dpois, ppois, qpois, and rpois, respectively. For the normal
distribution these functions are dnorm, pnorm, qnorm, and rnorm.

You might also like