R Quick Guide
R Quick Guide
R Quick Guide
R - Overview
R is a programming language and software environment for statistical analysis, graphics
representation and reporting. R was created by Ross Ihaka and Robert Gentleman at the
University of Auckland, New Zealand, and is currently developed by the R Development Core
Team.
The core of R is an interpreted computer language which allows branching and looping as well
as modular programming using functions. R allows integration with the procedures written in
the C, C++, .Net, Python or FORTRAN languages for efficiency.
R is freely available under the GNU General Public License, and pre-compiled binary versions
are provided for various operating systems like Linux, Windows and Mac.
R is free software distributed under a GNU-style copy left, and an official part of the GNU
project called GNU S.
Evolution of R
R was initially written by Ross Ihaka and Robert Gentleman at the Department of Statistics of
the University of Auckland in Auckland, New Zealand. R made its first appearance in 1993.
A large group of individuals has contributed to R by sending code and bug reports.
Since mid-1997 there has been a core group (the "R Core Team") who can modify the
R source code archive.
Features of R
As stated earlier, R is a programming language and software environment for statistical
analysis, graphics representation and reporting. The following are the important features of R
R provides a suite of operators for calculations on arrays, lists, vectors and matrices.
R provides a large, coherent and integrated collection of tools for data analysis.
R provides graphical facilities for data analysis and display either directly at the
computer or printing at the papers.
As a conclusion, R is worlds most widely used statistics programming language. It's the # 1
choice of data scientists and supported by a vibrant and talented community of contributors.
R is taught in universities and deployed in mission critical business applications. This tutorial
will teach you R programming along with suitable examples in simple and easy steps.
R - Environment Setup
Try the following example using Try it option at the website available at the top
right corner of the below sample code box
# Print Hello World.
print("Hello World")
For most of the examples given in this tutorial, you will find Try it option at the
website, so just make use of it and enjoy your learning.
Windows Installation
You can download the Windows installer version of R from R-3.2.2 for Windows (32/64 bit)
and save it in a local directory.
As it is a Windows installer (.exe) with a name "R-version-win.exe". You can just double click
and run the installer accepting the default settings. If your Windows is 32-bit version, it installs
the 32-bit version. But if your windows is 64-bit, then it installs both the 32-bit and 64-bit
versions.
After installation you can locate the icon to run the Program in a directory structure
"R\R3.2.2\bin\i386\Rgui.exe" under the Windows Program Files. Clicking this icon brings up
the R-GUI which is the R console to do R Programming.
Linux Installation
R is available as a binary for many versions of Linux at the location R Binaries .
The instruction to install Linux varies from flavor to flavor. These steps are mentioned under
each type of Linux version in the mentioned link. However, if you are in a hurry, then you can
use yum command to install R as follows
$ yum install R
Above command will install core functionality of R programming along with standard
packages, still you need additional package, then you can launch R prompt as follows
$R
>
Now you can use install command at R prompt to install the required package. For example,
the following command will install plotrix package which is required for 3D charts.
> install("plotrix")
R - Basic Syntax
As a convention, we will start learning R programming by writing a "Hello, World!" program.
Depending on the needs, you can program either at R command prompt or you can use an R
script file to write your program. Let's check both one by one.
R Command Prompt
Once you have R environment setup, then its easy to start your R command prompt by just
typing the following command at your command prompt
$R
This will launch R interpreter and you will get a prompt > where you can start typing your
program as follows
Here first statement defines a string variable myString, where we assign a string "Hello,
World!" and then next statement print() is being used to print the value stored in variable
myString.
R Script File
Usually, you will do your programming by writing your programs in script files and then you
execute those scripts at your command prompt with the help of R interpreter called Rscript.
So let's start with writing following code in a text file called test.R as under
print ( myString)
Save the above code in a file test.R and execute it at Linux command prompt as given below.
Even if you are using Windows or other system, syntax will remain same.
$ Rscript test.R
Comments
Comments are like helping text in your R program and they are ignored by the interpreter while
executing your actual program. Single comment is written using # in the beginning of the
statement as follows
R does not support multi-line comments but you can perform a trick which is something as
follows
if(FALSE) {
"This is a demo for multi-line comments and it should be put inside either a single
OR double quote"
}
Though above comments will be executed by R interpreter, they will not interfere with your
actual program. You should put such comments inside, either single or double quote.
R - Data Types
Generally, while doing programming in any programming language, you need to use various
variables to store various information. Variables are nothing but reserved memory locations to
store values. This means that, when you create a variable you reserve some space in memory.
You may like to store information of various data types like character, wide character, integer,
floating point, double floating point, Boolean etc. Based on the data type of a variable, the
operating system allocates memory and decides what can be stored in the reserved memory.
In contrast to other programming languages like C and java in R, the variables are not
declared as some data type. The variables are assigned with R-Objects and the data type of
the R-object becomes the data type of the variable. There are many types of R-objects. The
frequently used ones are
Vectors
Lists
Matrices
Arrays
Factors
Data Frames
The simplest of these objects is the vector object and there are six data types of these atomic
vectors, also termed as six classes of vectors. The other R-Objects are built upon the atomic
vectors.
v <- TRUE
print(class(v))
[1] "logical"
v <- 23.5
print(class(v))
[1] "numeric"
v <- 2L
print(class(v))
[1] "integer"
v <- 2+5i
print(class(v))
[1] "complex"
v <- "TRUE"
print(class(v))
[1] "character"
v <- charToRaw("Hello")
print(class(v))
"Hello" is stored as 48 65
Raw it produces the following result
6c 6c 6f
[1] "raw"
In R programming, the very basic data types are the R-objects called vectors which hold
elements of different classes as shown above. Please note in R the number of classes is not
confined to only the above six types. For example, we can use many atomic vectors and
create an array whose class will become array.
Vectors
When you want to create vector with more than one element, you should use c() function
which means to combine the elements into a vector.
# Create a vector.
apple <- c('red','green',"yellow")
print(apple)
Lists
A list is an R-object which can contain many different types of elements inside it like vectors,
functions and even another list inside it.
# Create a list.
list1 <- list(c(2,5,3),21.3,sin)
[[2]]
[1] 21.3
[[3]]
function (x) .Primitive("sin")
Matrices
A matrix is a two-dimensional rectangular data set. It can be created using a vector input to
the matrix function.
# Create a matrix.
M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)
Arrays
While matrices are confined to two dimensions, arrays can be of any number of dimensions.
The array function takes a dim attribute which creates the required number of dimension. In
the below example we create an array with two elements which are 3x3 matrices each.
# Create an array.
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)
,,1
,,2
Factors
Factors are the r-objects which are created using a vector. It stores the vector along with the
distinct values of the elements in the vector as labels. The labels are always character
irrespective of whether it is numeric or character or Boolean etc. in the input vector. They are
useful in statistical modeling.
Factors are created using the factor() function.The nlevels functions gives the count of levels.
# Create a vector.
apple_colors <- c('green','green','yellow','red','red','red','green')
Data Frames
Data frames are tabular data objects. Unlike a matrix in data frame each column can contain
different modes of data. The first column can be numeric while the second column can be
character and third column can be logical. It is a list of vectors of equal length.
R - Variables
A variable provides us with named storage that our programs can manipulate. A variable in R
can store an atomic vector, group of atomic vectors or a combination of many Robjects. A
valid variable name consists of letters, numbers and the dot or underline characters. The
variable name starts with a letter or the dot not followed by a number.
var_name% Invalid Has the character '%'. Only dot(.) and underscore allowed.
.var_name , valid Can start with a dot(.) but the dot(.)should not be followed by a
var.name number.
Variable Assignment
The variables can be assigned values using leftward, rightward and equal to operator. The
values of the variables can be printed using print() or cat()function. The cat() function
combines multiple items into a continuous print output.
print(var.1)
cat ("var.1 is ", var.1 ,"\n")
cat ("var.2 is ", var.2 ,"\n")
cat ("var.3 is ", var.3 ,"\n")
Note The vector c(TRUE,1) has a mix of logical and numeric class. So logical class is
coerced to numeric class making TRUE as 1.
Finding Variables
To know all the variables currently available in the workspace we use the ls() function. Also
the ls() function can use patterns to match the variable names.
print(ls())
Note It is a sample output depending on what variables are declared in your environment.
The ls() function can use patterns to match the variable names.
# List the variables starting with the pattern "var".
print(ls(pattern = "var"))
The variables starting with dot(.) are hidden, they can be listed using "all.names = TRUE"
argument to ls() function.
print(ls(all.name = TRUE))
Deleting Variables
Variables can be deleted by using the rm() function. Below we delete the variable var.3. On
printing the value of the variable error is thrown.
rm(var.3)
print(var.3)
[1] "var.3"
Error in print(var.3) : object 'var.3' not found
All the variables can be deleted by using the rm() and ls() function together.
rm(list = ls())
print(ls())
character(0)
R - Operators
An operator is a symbol that tells the compiler to perform specific mathematical or logical
manipulations. R language is rich in built-in operators and provides following types of
operators.
Types of Operators
We have the following types of operators in R programming
Arithmetic Operators
Relational Operators
Logical Operators
Assignment Operators
Miscellaneous Operators
Arithmetic Operators
Following table shows the arithmetic operators supported by R language. The operators act
on each element of the vector.
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v+t)
+ Adds two vectors
it produces the following result
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v-t)
Subtracts second vector from the first
it produces the following result
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v*t)
* Multiplies both vectors
it produces the following result
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v/t)
/ Divide the first vector with the second When we execute the above code,
it produces the following result
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v%%t)
Give the remainder of the first vector with the
%%
second it produces the following result
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v%/%t)
The result of division of first vector with
%/%
second (quotient) it produces the following result
[1] 0 1 1
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v^t)
The first vector raised to the exponent of
^
second vector it produces the following result
Relational Operators
Following table shows the relational operators supported by R language. Each element of the
first vector is compared with the corresponding element of the second vector. The result of
comparison is a Boolean value.
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
Checks if each element of the first vector is print(v>t)
> greater than the corresponding element of the
second vector. it produces the following result
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
Checks if each element of the first vector is print(v < t)
less than the corresponding element of the
< second vector. it produces the following result
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
Checks if each element of the first vector is print(v == t)
== equal to the corresponding element of the
second vector. it produces the following result
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
Checks if each element of the first vector is print(v<=t)
<= less than or equal to the corresponding
element of the second vector. it produces the following result
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
Checks if each element of the first vector is print(v>=t)
>= greater than or equal to the corresponding
element of the second vector. it produces the following result
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
Checks if each element of the first vector is print(v!=t)
!= unequal to the corresponding element of the
second vector. it produces the following result
Logical Operators
Following table shows the logical operators supported by R language. It is applicable only to
vectors of type logical, numeric or complex. All numbers greater than 1 are considered as
logical value TRUE.
Each element of the first vector is compared with the corresponding element of the second
vector. The result of comparison is a Boolean value.
v <- c(3,0,TRUE,2+2i)
It is called Element-wise Logical OR operator. t <- c(4,0,FALSE,2+3i)
It combines each element of the first vector print(v|t)
| with the corresponding element of the second
vector and gives a output TRUE if one the it produces the following result
elements is TRUE.
[1] TRUE FALSE TRUE TRUE
v <- c(3,0,TRUE,2+2i)
print(!v)
It is called Logical NOT operator. Takes each
! element of the vector and gives the opposite it produces the following result
logical value.
[1] FALSE TRUE FALSE FALSE
The logical operator && and || considers only the first element of the vectors and give a vector
of single element as output.
v <- c(3,0,TRUE,2+2i)
t <- c(1,3,TRUE,2+3i)
Called Logical AND operator. Takes first print(v&&t)
&& element of both the vectors and gives the
TRUE only if both are TRUE. it produces the following result
[1] TRUE
v <- c(0,0,TRUE,2+2i)
t <- c(0,3,TRUE,2+3i)
Called Logical OR operator. Takes first print(v||t)
|| element of both the vectors and gives the
TRUE only if both are TRUE. it produces the following result
[1] FALSE
Assignment Operators
These operators are used to assign values to vectors.
Operator Description Example
v1 <- c(3,1,TRUE,2+3i)
< v2 <<- c(3,1,TRUE,2+3i)
v3 = c(3,1,TRUE,2+3i)
or print(v1)
print(v2)
print(v3)
= Called Left Assignment
it produces the following result
or
[1] 3+0i 1+0i 1+0i 2+3i
<< [1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i
c(3,1,TRUE,2+3i) -> v1
c(3,1,TRUE,2+3i) ->> v2
-> print(v1)
print(v2)
or Called Right Assignment
it produces the following result
->>
[1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i
Miscellaneous Operators
These operators are used to for specific purpose and not general mathematical or logical
computation.
v <- 2:8
print(v)
Colon operator. It creates the series of
: it produces the following result
numbers in sequence for a vector.
[1] 2 3 4 5 6 7 8
v1 <- 8
v2 <- 12
t <- 1:10
print(v1 %in% t)
This operator is used to identify if an element print(v2 %in% t)
%in%
belongs to a vector.
it produces the following result
[1] TRUE
[1] FALSE
M = matrix( c(2,6,5,1,10,4),
nrow = 2,ncol = 3,byrow =
TRUE)
t = M %*% t(M)
print(t)
This operator is used to multiply a matrix with
%*%
its transpose. it produces the following result
[,1] [,2]
[1,] 65 82
[2,] 82 117
R - Decision making
Decision making structures require the programmer to specify one or more conditions to be
evaluated or tested by the program, along with a statement or statements to be executed if
the condition is determined to be true, and optionally, other statements to be executed if the
condition is determined to be false.
Following is the general form of a typical decision making structure found in most of the
programming languages
R provides the following types of decision making statements. Click the following links to
check their detail.
1 if statement
An if statement consists of a Boolean expression followed by one or more
statements.
2 if...else statement
An if statement can be followed by an optional else statement, which executes
when the Boolean expression is false.
3 switch statement
A switch statement allows a variable to be tested for equality against a list of
values.
R - Loops
There may be a situation when you need to execute a block of code several number of times.
In general, statements are executed sequentially. The first statement in a function is executed
first, followed by the second, and so on.
Programming languages provide various control structures that allow for more complicated
execution paths.
A loop statement allows us to execute a statement or group of statements multiple times and
the following is the general form of a loop statement in most of the programming languages
R programming language provides the following kinds of loop to handle looping requirements.
Click the following links to check their detail.
2 while loop
Repeats a statement or group of statements while a given condition is true. It tests
the condition before executing the loop body.
3 for loop
Like a while statement, except that it tests the condition at the end of the loop
body.
R supports the following control statements. Click the following links to check their detail.
1 break statement
Terminates the loop statement and transfers execution to the statement
immediately following the loop.
2 Next statement
The next statement simulates the behavior of R switch.
R - Functions
A function is a set of statements organized together to perform a specific task. R has a large
number of in-built functions and the user can create their own functions.
In R, a function is an object so the R interpreter is able to pass control to the function, along
with arguments that may be necessary for the function to accomplish the actions.
The function in turn performs its task and returns control to the interpreter as well as any
result which may be stored in other objects.
Function Definition
An R function is created by using the keyword function. The basic syntax of an R function
definition is as follows
Function Components
The different parts of a function are
Function Body The function body contains a collection of statements that defines
what the function does.
Return Value The return value of a function is the last expression in the function
body to be evaluated.
R has many in-built functions which can be directly called in the program without defining
them first. We can also create and use our own functions referred as user defined functions.
Built-in Function
Simple examples of in-built functions are seq(), mean(), max(), sum(x) and paste(...) etc. They
are directly called by user written programs. You can refer most widely used R functions.
[1] 32 33 34 35 36 37 38 39 40 41 42 43 44
[1] 53.5
[1] 1526
User-defined Function
We can create user-defined functions in R. They are specific to what a user wants and once
created they can be used like the built-in functions. Below is an example of how a function is
created and used.
Calling a Function
# Create a function to print squares of numbers in sequence.
new.function <- function(a) {
for(i in 1:a) {
b <- i^2
print(b)
}
}
[1] 1
[1] 4
[1] 9
[1] 16
[1] 25
[1] 36
[1] 1
[1] 4
[1] 9
[1] 16
[1] 25
[1] 26
[1] 58
[1] 18
[1] 45
Lazy Evaluation of Function
Arguments to functions are evaluated lazily, which means so they are evaluated only when
needed by the function body.
[1] 36
[1] 6
Error in print(b) : argument "b" is missing, with no default
R - Strings
Any value written within a pair of single quote or double quotes in R is treated as a string.
Internally R stores every string within double quotes, even when you create them with single
quote.
Double quotes can be inserted into a string starting and ending with single quote.
Single quote can be inserted into a string starting and ending with double quotes.
Double quotes can not be inserted into a string starting and ending with double
quotes.
Single quote can not be inserted into a string starting and ending with single quote.
String Manipulation
Concatenating Strings - paste() function
Many strings in R are combined using the paste() function. It can take any number of
arguments to be combined together.
Syntax
The basic syntax for paste function is
collapse is used to eliminate the space in between two strings. But not the space
within two words of one string.
Example
a <- "Hello"
b <- 'How'
c <- "are you? "
print(paste(a,b,c))
Syntax
The basic syntax for format function is
format(x, digits, nsmall, scientific, width, justify = c("left", "right", "centre", "none"))
nsmall is the minimum number of digits to the right of the decimal point.
Example
# Total number of digits displayed. Last digit rounded off.
result <- format(23.123456789, digits = 9)
print(result)
[1] "23.1234568"
[1] "6.000000e+00" "1.314521e+01"
[1] "23.47000"
[1] "6"
[1] " 13.7"
[1] "Hello "
[1] " Hello "
Syntax
The basic syntax for nchar() function is
nchar(x)
Example
result <- nchar("Count the number of characters")
print(result)
[1] 30
Syntax
The basic syntax for toupper() & tolower() function is
toupper(x)
tolower(x)
Example
# Changing to Upper case.
result <- toupper("Changing To Upper")
print(result)
Syntax
The basic syntax for substring() function is
substring(x,first,last)
Example
# Extract characters from 5th to 7th position.
result <- substring("Extract", 5, 7)
print(result)
[1] "act"
R - Vectors
Vectors are the most basic R data objects and there are six types of atomic vectors. They are
logical, integer, double, complex, character and raw.
Vector Creation
Single Element Vector
Even when you write just one value in R, it becomes a vector of length 1 and belongs to one of
the above vector types.
[1] "abc"
[1] 12.5
[1] 63
[1] TRUE
[1] 2+3i
[1] 68 65 6c 6c 6f
Multiple Elements Vector
Using colon operator with numeric data
# If the final element specified does not belong to the sequence then it is discarded.
v <- 3.8:11.4
print(v)
[1] 5 6 7 8 9 10 11 12 13
[1] 6.6 7.6 8.6 9.6 10.6 11.6 12.6
[1] 3.8 4.8 5.8 6.8 7.8 8.8 9.8 10.8
[1] 5.0 5.4 5.8 6.2 6.6 7.0 7.4 7.8 8.2 8.6 9.0
The non-character values are coerced to character type if one of the elements is a character.
Vector Manipulation
Vector arithmetic
Two vectors of same length can be added, subtracted, multiplied or divided giving the result
as a vector output.
# Vector addition.
add.result <- v1+v2
print(add.result)
# Vector substraction.
sub.result <- v1-v2
print(sub.result)
# Vector multiplication.
multi.result <- v1*v2
print(multi.result)
# Vector division.
divi.result <- v1/v2
print(divi.result)
[1] 7 19 4 13 1 13
[1] -1 -3 4 -3 -1 9
[1] 12 88 0 40 0 22
[1] 0.7500000 0.7272727 Inf 0.6250000 0.0000000 5.5000000
v1 <- c(3,8,4,5,0,11)
v2 <- c(4,11)
# V2 becomes c(4,11,4,11,4,11)
[1] 7 19 8 16 4 22
[1] -1 -3 0 -6 -4 0
[1] -9 0 3 4 5 8 11 304
[1] 304 11 8 5 4 3 0 -9
[1] "Blue" "Red" "violet" "yellow"
[1] "yellow" "violet" "Red" "Blue"
Lists are the R objects which contain elements of different types like numbers, strings,
vectors and another list inside it. A list can also contain a matrix or a function as its elements.
List is created using list() function.
Creating a List
Following is an example to create a list containing strings, numbers, vectors and a logical
values
[[1]]
[1] "Red"
[[2]]
[1] "Green"
[[3]]
[1] 21 32 11
[[4]]
[1] TRUE
[[5]]
[1] 51.23
[[6]]
[1] 119.1
$`1st_Quarter`
[1] "Jan" "Feb" "Mar"
$A_Matrix
[,1] [,2] [,3]
[1,] 3 5 -2
[2,] 9 1 8
$A_Inner_list
$A_Inner_list[[1]]
[1] "green"
$A_Inner_list[[2]]
[1] 12.3
# Access the thrid element. As it is also a list, all its elements will be printed.
print(list_data[3])
$`1st_Quarter`
[1] "Jan" "Feb" "Mar"
$A_Inner_list
$A_Inner_list[[1]]
[1] "green"
$A_Inner_list[[2]]
[1] 12.3
[[1]]
[1] "New element"
$
NULL
Merging Lists
You can merge many lists into one list by placing all the lists inside one list() function.
# Create two lists.
list1 <- list(1,2,3)
list2 <- list("Sun","Mon","Tue")
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
[[4]]
[1] "Sun"
[[5]]
[1] "Mon"
[[6]]
[1] "Tue"
list2 <-list(10:14)
print(list2)
print(v1)
print(v2)
[[1]]
[1] 1 2 3 4 5
[[1]]
[1] 10 11 12 13 14
[1] 1 2 3 4 5
[1] 10 11 12 13 14
[1] 11 13 15 17 19
R - Matrices
Matrices are the R objects in which the elements are arranged in a two-dimensional
rectangular layout. They contain elements of the same atomic types. Though we can create a
matrix containing only characters or only logical values, they are not of much use. We use
matrices containing numeric elements to be used in mathematical calculations.
Syntax
The basic syntax for creating a matrix in R is
data is the input vector which becomes the data elements of the matrix.
byrow is a logical clue. If TRUE then the input vector elements are arranged by row.
Example
Create a matrix taking a vector of numbers as input
[1] 5
[1] 13
col1 col2 col3
6 7 8
row1 row2 row3 row4
5 8 11 14
Matrix Computations
Various mathematical operations are performed on the matrices using the R operators. The
result of the operation is also a matrix.
The dimensions (number of rows and columns) should be same for the matrices involved in
the operation.
R - Arrays
Arrays are the R data objects which can store data in more than two dimensions. For example
If we create an array of dimension (2, 3, 4) then it creates 4 rectangular matrices each with 2
rows and 3 columns. Arrays can store only data type.
An array is created using the array() function. It takes vectors as input and uses the values in
the dim parameter to create an array.
Example
The following example creates an array of two 3x3 matrices each with 3 rows and 3 columns.
,,1
,,2
, , Matrix1
, , Matrix2
# Print the element in the 1st row and 3rd column of the 1st matrix.
print(result[1,3,1])
Syntax
apply(x, margin, fun)
x is an array.
margin is the name of the data set used.
Example
We use the apply() function below to calculate the sum of the elements in the rows of an array
across all the matrices.
# Use apply to calculate the sum of the rows across all the matrices.
result <- apply(new.array, c(1), sum)
print(result)
,,1
,,2
[1] 56 68 60
R - Factors
Factors are the data objects which are used to categorize the data and store it as levels. They
can store both strings and integers. They are useful in the columns which have a limited
number of unique values. Like "Male, "Female" and True, False etc. They are useful in data
analysis for statistical modeling.
Factors are created using the factor () function by taking a vector as input.
Example
# Create a vector as input.
data <- c("East","West","East","North","North","East","West","West","West","East"
print(data)
print(is.factor(data))
print(factor_data)
print(is.factor(factor_data))
[1] "East" "West" "East" "North" "North" "East" "West" "West" "West" "East" "North"
[1] FALSE
[1] East West East North North East West West West East North
Levels: East North West
[1] TRUE
[1] East West East North North East West West West East North
Levels: East North West
[1] East West East North North East West West West East North
Levels: East West North
Syntax
gl(n, k, labels)
Example
v <- gl(3, 4, labels = c("Tampa", "Seattle","Boston"))
print(v)
R - Data Frames
A data frame is a table or a two-dimensional array-like structure in which each column
contains values of one variable and each row contains one set of values from each column.
The data stored in a data frame can be of numeric, factor or character type.
start_date = as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Extract Specific columns.
result <- data.frame(emp.data$emp_name,emp.data$salary)
print(result)
emp.data.emp_name emp.data.salary
1 Rick 623.30
2 Dan 515.20
3 Michelle 611.00
4 Ryan 729.00
5 Gary 843.25
Extract 3rd and 5th row with 2nd and 4th column
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
# Extract 3rd and 5th row with 2nd and 4th column.
result <- emp.data[c(3,5),c(2,4)]
print(result)
emp_name start_date
3 Michelle 2014-11-15
5 Gary 2015-03-27
Add Column
Just add the column vector using a new column name.
In the example below we create a data frame with new rows and merge it with the existing
data frame to create the final data frame.
R - Packages
R packages are a collection of R functions, complied code and sample data. They are stored
under a directory called "library" in the R environment. By default, R installs a set of packages
during installation. More packages are added later, when they are needed for some specific
purpose. When we start the R console, only the default packages are available by default.
Other packages which are already installed have to be loaded explicitly to be used by the R
program that is going to use them.
Below is a list of commands to be used to check, verify and use the R packages.
.libPaths()
When we execute the above code, it produces the following result. It may vary depending on
the local settings of your pc.
When we execute the above code, it produces the following result. It may vary depending on
the local settings of your pc.
search()
When we execute the above code, it produces the following result. It may vary depending on
the local settings of your pc.
install.packages("Package Name")
Now you can run the following command to install this package in the R environment.
R - Data Reshaping
Data Reshaping in R is about changing the way data is organized into rows and columns.
Most of the time data processing in R is done by taking the input data as a data frame. It is
easy to extract data from the rows and columns of a data frame but there are situations when
we need the data frame in a format that is different from format in which we received it. R has
many functions to split, merge and change the rows to columns and vice-versa in a data
frame.
# Print a header.
cat("# # # # The First data frame\n")
# Print a header.
cat("# # # The Second data frame\n")
# Print a header.
cat("# # # The combined data frame\n")
In the example below, we consider the data sets about Diabetes in Pima Indian Women
available in the library names "MASS". we merge the two data sets based on the values of
blood pressure("bp") and body mass index("bmi"). On choosing these two columns for
merging, the records where values of these two variables match in both data sets are
combined together to form a single data frame.
library(MASS)
merged.Pima <- merge(x = Pima.te, y = Pima.tr,
by.x = c("bp", "bmi"),
by.y = c("bp", "bmi")
)
print(merged.Pima)
nrow(merged.Pima)
bp bmi npreg.x glu.x skin.x ped.x age.x type.x npreg.y glu.y skin.y ped.y
1 60 33.8 1 117 23 0.466 27 No 2 125 20 0.088
2 64 29.7 2 75 24 0.370 33 No 2 100 23 0.368
3 64 31.2 5 189 33 0.583 29 Yes 3 158 13 0.295
4 64 33.2 4 117 27 0.230 24 No 1 96 27 0.289
5 66 38.1 3 115 39 0.150 28 No 1 114 36 0.289
6 68 38.5 2 100 25 0.324 26 No 7 129 49 0.439
7 70 27.4 1 116 28 0.204 21 No 0 124 20 0.254
8 70 33.1 4 91 32 0.446 22 No 9 123 44 0.374
9 70 35.4 9 124 33 0.282 34 No 6 134 23 0.542
10 72 25.6 1 157 21 0.123 24 No 4 99 17 0.294
11 72 37.7 5 95 33 0.370 27 No 6 103 32 0.324
12 74 25.9 9 134 33 0.460 81 No 8 126 38 0.162
13 74 25.9 1 95 21 0.673 36 No 8 126 38 0.162
14 78 27.6 5 88 30 0.258 37 No 6 125 31 0.565
15 78 27.6 10 122 31 0.512 45 No 6 125 31 0.565
16 78 39.4 2 112 50 0.175 24 No 4 112 40 0.236
17 88 34.5 1 117 24 0.403 40 Yes 4 127 11 0.598
age.y type.y
1 31 No
2 21 No
3 24 No
4 21 No
5 21 No
6 43 Yes
7 36 Yes
8 40 No
9 29 Yes
10 28 No
11 55 No
12 39 No
13 39 No
14 49 Yes
15 49 Yes
16 38 No
17 28 No
[1] 17
We consider the dataset called ships present in the library called "MASS".
library(MASS)
print(ships)
R - CSV Files
In R, we can read data from files stored outside the R environment. We can also write data
into files which will be stored and accessed by the operating system. R can read and write into
various file formats like csv, excel, xml etc.
In this chapter we will learn to read data from a csv file and then write data into a csv file. The
file should be present in current working directory so that R can read it. Of course we can also
set our own directory and read files from there.
[1] "/web/com/1441086124_2016"
[1] "/web/com"
This result depends on your OS and your current directory where you are working.
You can create this file using windows notepad by copying and pasting this data. Save the file
as input.csv using the save As All files(*.*) option in notepad.
id,name,salary,start_date,dept
1,Rick,623.3,2012-01-01,IT
2,Dan,515.2,2013-09-23,Operations
3,Michelle,611,2014-11-15,IT
4,Ryan,729,2014-05-11,HR
,Gary,843.25,2015-03-27,Finance
6,Nina,578,2013-05-21,IT
7,Simon,632.8,2013-07-30,Operations
8,Guru,722.5,2014-06-17,Finance
print(is.data.frame(data))
print(ncol(data))
print(nrow(data))
[1] TRUE
[1] 5
[1] 8
Once we read data in a data frame, we can apply all the functions applicable to data frames
as explained in subsequent section.
[1] 843.25
Here the column X comes from the data set newper. This can be dropped using additional
parameters while writing the file.
# Create a data frame.
data <- read.csv("input.csv")
retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))
R - Excel File
Microsoft Excel is the most widely used spreadsheet program which stores data in the .xls or
.xlsx format. R can read directly from these files using some excel specific packages. Few
such packages are - XLConnect, xlsx, gdata etc. We will be using xlsx package. R can also
write into excel file using this package.
install.packages("xlsx")
[1] TRUE
Loading required package: rJava
Loading required package: methods
Loading required package: xlsxjars
Input as xlsx File
Open Microsoft excel. Copy and paste the following data in the work sheet named as sheet1.
Also copy and paste the following data to another worksheet and rename this worksheet to
"city".
name city
Rick Seattle
Dan Tampa
Michelle Chicago
Ryan Seattle
Gary Houston
Nina Boston
Simon Mumbai
Guru Dallas
Save the Excel file as "input.xlsx". You should save it in the current working directory of the R
workspace.
R - Binary Files
A binary file is a file that contains information stored only in form of bits and bytes.(0s and
1s). They are not human readable as the bytes in it translate to characters and symbols which
contain many other non-printable characters. Attempting to read a binary file using any text
editor will show characters like and .
The binary file has to be read by specific programs to be useable. For example, the binary file
of a Microsoft Word program can be read to a human readable form only by the Word
program. Which indicates that, besides the human readable text, there is a lot more
information like formatting of characters and page numbers etc., which are also stored along
with alphanumeric characters. And finally a binary file is a continuous sequence of bytes. The
line break we see in a text file is a character joining first line to the next.
R has two functions WriteBin() and readBin() to create and read binary files.
Syntax
writeBin(object, con)
readBin(con, what, n )
what is the mode like character, integer etc. representing the bytes to be read.
Example
We consider the R inbuilt data "mtcars". First we create a csv file from it and convert it to a
binary file and store it as a OS file. Next we read this binary file created into R.
Writing the Binary File
We read the data frame "mtcars" as a csv file and then write it as a binary file to the OS.
# Read the "mtcars" data frame as a csv file and store only the columns
"cyl", "am" and "gear".
write.table(mtcars, file = "mtcars.csv",row.names = FALSE, na = "",
col.names = TRUE, sep = ",")
# Create a connection object to write the binary file using mode "wb".
write.filename = file("/web/com/binmtcars.dat", "wb")
# Write the column names of the data frame to the connection object.
writeBin(colnames(new.mtcars), write.filename)
# Close the file for writing so that it can be read by other program.
close(write.filename)
# Create a connection object to read the file in binary mode using "rb".
read.filename <- file("/web/com/binmtcars.dat", "rb")
# Next read the column values. n = 18 as we have 3 column names and 15 values.
read.filename <- file("/web/com/binmtcars.dat", "rb")
bindata <- readBin(read.filename, integer(), n = 18)
# Read the values from 4th byte to 8th byte which represents "cyl".
cyldata = bindata[4:8]
print(cyldata)
# Read the values form 9th byte to 13th byte which represents "am".
amdata = bindata[9:13]
print(amdata)
# Read the values form 9th byte to 13th byte which represents "gear".
geardata = bindata[14:18]
print(geardata)
When we execute the above code, it produces the following result and chart
[1] 6 6 4 6 8
[1] 1 1 1 0 0
[1] 4 4 4 3 3
cyl am gear
[1,] 6 1 4
[2,] 6 1 4
[3,] 4 1 4
[4,] 6 0 3
[5,] 8 0 3
As we can see, we got the original data back by reading the binary file in R.
R - XML Files
XML is a file format which shares both the file format and the data on the World Wide Web,
intranets, and elsewhere using standard ASCII text. It stands for Extensible Markup Language
(XML). Similar to HTML it contains markup tags. But unlike HTML where the markup tag
describes structure of the page, in xml the markup tags describe the meaning of the data
contained into he file.
You can read a xml file in R using the "XML" package. This package can be installed using
following command.
install.packages("XML")
Input Data
Create a XMl file by copying the below data into a text editor like notepad. Save the file with a
.xml extension and choosing the file type as all files(*.*).
<RECORDS>
<EMPLOYEE>
<ID>1</ID>
<NAME>Rick</NAME>
<SALARY>623.3</SALARY>
<STARTDATE>1/1/2012</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>2</ID>
<NAME>Dan</NAME>
<SALARY>515.2</SALARY>
<STARTDATE>9/23/2013</STARTDATE>
<DEPT>Operations</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>3</ID>
<NAME>Michelle</NAME>
<SALARY>611</SALARY>
<STARTDATE>11/15/2014</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>4</ID>
<NAME>Ryan</NAME>
<SALARY>729</SALARY>
<STARTDATE>5/11/2014</STARTDATE>
<DEPT>HR</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>5</ID>
<NAME>Gary</NAME>
<SALARY>843.25</SALARY>
<STARTDATE>3/27/2015</STARTDATE>
<DEPT>Finance</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>6</ID>
<NAME>Nina</NAME>
<SALARY>578</SALARY>
<STARTDATE>5/21/2013</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>7</ID>
<NAME>Simon</NAME>
<SALARY>632.8</SALARY>
<STARTDATE>7/30/2013</STARTDATE>
<DEPT>Operations</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>8</ID>
<NAME>Guru</NAME>
<SALARY>722.5</SALARY>
<STARTDATE>6/17/2014</STARTDATE>
<DEPT>Finance</DEPT>
</EMPLOYEE>
</RECORDS>
Reading XML File
The xml file is read by R using the function xmlParse(). It is stored as a list in R.
1
Rick
623.3
1/1/2012
IT
2
Dan
515.2
9/23/2013
Operations
3
Michelle
611
11/15/2014
IT
4
Ryan
729
5/11/2014
HR
5
Gary
843.25
3/27/2015
Finance
6
Nina
578
5/21/2013
IT
7
Simon
632.8
7/30/2013
Operations
8
Guru
722.5
6/17/2014
Finance
output
[1] 8
Details of the First Node
Let's look at the first record of the parsed file. It will give us an idea of the various elements
present in the top level node.
$EMPLOYEE
1
Rick
623.3
1/1/2012
IT
attr(,"class")
[1] "XMLInternalNodeList" "XMLNodeList"
As the data is now available as a dataframe we can use data frame related function to read
and manipulate the file.
R - JSON Files
JSON file stores data as text in human-readable format. Json stands for JavaScript Object
Notation. R can read JSON files using the rjson package.
install.packages("rjson")
Input Data
Create a JSON file by copying the below data into a text editor like notepad. Save the file with
a .json extension and choosing the file type as all files(*.*).
{
"ID":["1","2","3","4","5","6","7","8" ],
"Name":["Rick","Dan","Michelle","Ryan","Gary","Nina","Simon","Guru" ],
"Salary":["623.3","515.2","611","729","843.25","578","632.8","722.5" ],
"StartDate":[ "1/1/2012","9/23/2013","11/15/2014","5/11/2014","3/27/2015","5/21/2013"
"7/30/2013","6/17/2014"],
"Dept":[ "IT","Operations","IT","HR","Finance","IT","Operations","Finance"]
}
$ID
[1] "1" "2" "3" "4" "5" "6" "7" "8"
$Name
[1] "Rick" "Dan" "Michelle" "Ryan" "Gary" "Nina" "Simon" "Guru"
$Salary
[1] "623.3" "515.2" "611" "729" "843.25" "578" "632.8" "722.5"
$StartDate
[1] "1/1/2012" "9/23/2013" "11/15/2014" "5/11/2014" "3/27/2015" "5/21/2013"
"7/30/2013" "6/17/2014"
$Dept
[1] "IT" "Operations" "IT" "HR" "Finance" "IT"
"Operations" "Finance"
print(json_data_frame)
R - Web Data
Many websites provide data for consumption by its users. For example the World Health
Organization(WHO) provides reports on health and medical information in the form of CSV, txt
and XML files. Using R programs, we can programmatically extract specific data from such
websites. Some packages in R which are used to scrap data form the web are "RCurl",XML",
and "stringr". They are used to connect to the URLs, identify required links for the files and
download them to the local environment.
Install R Packages
The following packages are required for processing the URLs and links to the files. If they are
not available in your R Environment, you can install them using following commands.
install.packages("RCurl")
install.packages("XML")
install.packages("stringr")
install.packages("pylr")
Input Data
We will visit the URL weather data and download the CSV files using R for the year 2015.
Example
We will use the function getHTMLLinks() to gather the URLs of the files. Then we will use the
function downlaod.file() to save the files to the local system. As we will be applying the same
code again and again for multiple files, we will create a function to be called multiple times.
The filenames are passed as parameters in form of a R list object to this function.
# Identify only the links which point to the JCMB 2015 files.
filenames <- links[str_detect(links, "JCMB_2015")]
# Create a function to download the files by passing the URL and filename list.
downloadcsv <- function (mainurl,filename) {
filedetails <- str_c(mainurl,filename)
download.file(filedetails,filename)
}
# Now apply the l_ply function and save the files into the current R working directory.
l_ply(filenames,downloadcsv,mainurl = "http://www.geos.ed.ac.uk/~weather/jcmb_ws/"
R - Databases
The data is Relational database systems are stored in a normalized format. So, to carry out
statistical computing we will need very advanced and complex Sql queries. But R can connect
easily to many relational databases like MySql, Oracle, Sql server etc. and fetch records from
them as a data frame. Once the data is available in the R environment, it becomes a normal R
data set and can be manipulated or analyzed using all the powerful packages and functions.
In this tutorial we will be using MySql as our reference database for connecting to R.
RMySQL Package
R has a built-in package named "RMySQL" which provides native connectivity between with
MySql database. You can install this package in the R environment using the following
command.
install.packages("RMySQL")
Connecting R to MySql
Once the package is installed we create a connection object in R to connect to the database.
It takes the username, password, database name and host name as input.
# Store the result in a R data frame object. n = 5 is used to fetch first 5 rows.
data.frame = fetch(result, n = 5)
print(data.fame)
After executing the above code we can see the table updated in the MySql Environment.
After executing the above code we can see the row inserted into the table in the MySql
Environment.
After executing the above code we can see the table created in the MySql Environment.
After executing the above code we can see the table is dropped in the MySql Environment.
R - Pie Charts
R Programming language has numerous libraries to create charts and graphs. A pie-chart is a
representation of values as slices of a circle with different colors. The slices are labeled and
the numbers corresponding to each slice is also represented in the chart.
In R the pie chart is created using the pie() function which takes positive numbers as a vector
input. The additional parameters are used to control labels, color, title etc.
Syntax
The basic syntax for creating a pie-chart using the R is
radius indicates the radius of the circle of the pie chart.(value between 1 and +1).
clockwise is a logical value indicating if the slices are drawn clockwise or anti
clockwise.
Example
A very simple pie-chart is created using just the input vector and labels. The below script will
create and save the pie chart in the current R working directory.
Example
The below script will create and save the pie chart in the current R working directory.
piepercent<- round(100*x/sum(x), 1)
3D Pie Chart
A pie chart with 3 dimensions can be drawn using additional packages. The package plotrix
has a function called pie3D() that is used for this.
# Get the library.
library(plotrix)
R - Bar Charts
A bar chart represents data in rectangular bars with length of the bar proportional to the value
of the variable.
R uses the function barplot() to create bar charts. R can draw both vertical and Horizontal
bars in the bar chart. In bar chart each of the bars can be given different colors.
Syntax
The basic syntax to create a bar-chart in R is:
barplot(H,xlab,ylab,main, names.arg,col)
Following is the description of the parameters used :
Example
A simple bar chart is created using just the input vector and the name of each bar.
The below script will create and save the bar chart in the current R working directory.
Example
The below script will create and save the bar chart in the current R working directory.
More than two variables are represented as a matrix which is used to create the group bar
chart and stacked bar chart.
#Create the input vectors.
colors=c("green","orange","brown")
months <- c("Mar","Apr","May","Jun","Jul")
regions <- c("East","West","North")
R - Boxplots
Boxplots are a measure of how well distributed is the data in a data set. It divides the data set
into three quartiles. This graph represents the minimum, maximum, median, first quartile and
third quartile in the data set. It is also useful in comparing the distribution of data across data
sets by drawing boxplots for each of them.
Boxplots are created in R by using the boxplot() function.
Syntax
The basic syntax to create a boxplot in R is
x is a vector or a formula.
varwidth is a logical value. Set as true to draw width of the box proportionate to the
sample size.
names are the group labels which will be printed under each boxplot.
Example
We use the data set "mtcars" available in the R environment to create a basic boxplot. Let's
look at the columns "mpg" and "cyl" in mtcars.
mpg cyl
Mazda RX4 21.0 6
Mazda RX4 Wag 21.0 6
Datsun 710 22.8 4
Hornet 4 Drive 21.4 6
Hornet Sportabout 18.7 8
Valiant 18.1 6
The below script will create a boxplot graph with notch for each of the data group.
# Give the chart file a name.
png(file = "boxplot_with_notch.png")
R - Histograms
A histogram represents the frequencies of values of a variable bucketed into ranges.
Histogram is similar to bar chat but the difference is it groups the values into continuous
ranges. Each bar in histogram represents the height of the number of values present in that
range.
R creates histogram using hist() function. This function takes a vector as an input and uses
some more parameters to plot histograms.
Syntax
The basic syntax for creating a histogram using R is
hist(v,main,xlab,xlim,ylim,breaks,col,border)
Example
A simple histogram is created using input vector, label, col and border parameters.
The script given below will create and save the histogram in the current R working directory.
Syntax
The basic syntax to create a line chart in R is
plot(v,type,col,xlab,ylab)
type takes the value "p" to draw only the points, "l" to draw only the lines and "o" to
draw both points and lines.
Example
A simple line chart is created using the input vector and the type parameter as "O". The below
script will create and save a line chart in the current R working directory.
Example
# Create the data for the chart.
v <- c(7,12,28,3,41)
After the first line is plotted, the lines() function can use an additional vector as input to draw
the second line in the chart,
R - Scatterplots
Scatterplots show many points plotted in the Cartesian plane. Each point represents the
values of two variables. One variable is chosen in the horizontal axis and another in the
vertical axis.
Syntax
The basic syntax for creating scatterplot in R is
Example
We use the data set "mtcars" available in the R environment to create a basic scatterplot. Let's
use the columns "wt" and "mpg" in mtcars.
wt mpg
Mazda RX4 2.620 21.0
Mazda RX4 Wag 2.875 21.0
Datsun 710 2.320 22.8
Hornet 4 Drive 3.215 21.4
Hornet Sportabout 3.440 18.7
Valiant 3.460 18.1
Creating the Scatterplot
The below script will create a scatterplot graph for the relation between wt(weight) and
mpg(miles per gallon).
# Plot the chart for cars with weight between 2.5 to 5 and mileage between 15 and 30.
plot(x = input$wt,y = input$mpg,
xlab = "Weight",
ylab = "Milage",
xlim = c(2.5,5),
ylim = c(15,30),
main = "Weight vs Milage"
)
Scatterplot Matrices
When we have more than two variables and we want to find the correlation between one
variable versus the remaining ones we use scatterplot matrix. We use pairs() function to
create matrices of scatterplots.
Syntax
The basic syntax for creating scatterplot matrices in R is
pairs(formula, data)
data represents the data set from which the variables will be taken.
Example
Each variable is paired up with each of the remaining variable. A scatterplot is plotted for each
pair.
pairs(~wt+mpg+disp+cyl,data = mtcars,
main = "Scatterplot Matrix")
The functions we are discussing in this chapter are mean, median and mode.
Mean
It is calculated by taking the sum of the values and dividing with the number of values in a
data series.
Syntax
The basic syntax for calculating mean in R is
trim is used to drop some observations from both end of the sorted vector.
na.rm is used to remove the missing values from the input vector.
Example
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find Mean.
result.mean <- mean(x)
print(result.mean)
[1] 8.22
When trim = 0.3, 3 values from each end will be dropped from the calculations to find mean.
In this case the sorted vector is (21, 5, 2, 3, 4.2, 7, 8, 12, 18, 54) and the values removed
from the vector for calculating mean are (21,5,2) from left and (12,18,54) from right.
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find Mean.
result.mean <- mean(x,trim = 0.3)
print(result.mean)
[1] 5.55
Applying NA Option
If there are missing values, then the mean function returns NA.
To drop the missing values from the calculation use na.rm = TRUE. which means remove the
NA values.
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5,NA)
# Find mean.
result.mean <- mean(x)
print(result.mean)
[1] NA
[1] 8.22
Median
The middle most value in a data series is called the median. The median() function is used in
R to calculate this value.
Syntax
The basic syntax for calculating median in R is
na.rm is used to remove the missing values from the input vector.
Example
# Create the vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
[1] 5.6
Mode
The mode is the value that has highest number of occurrences in a set of data. Unike mean
and median, mode can have both numeric and character data.
R does not have a standard in-built function to calculate mode. So we create a user function
to calculate mode of a data set in R. This function takes the vector as input and gives the
mode value as output.
Example
# Create the function.
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
[1] 2
[1] "it"
R - Linear Regression
Regression analysis is a very widely used statistical tool to establish a relationship model
between two variables. One of these variable is called predictor variable whose value is
gathered through experiments. The other variable is called response variable whose value is
derived from the predictor variable.
In Linear Regression these two variables are related through an equation, where exponent
(power) of both these variables is 1. Mathematically a linear relationship represents a straight
line when plotted as a graph. A non-linear relationship where the exponent of any variable is
not equal to 1 creates a curve.
y = ax + b
Carry out the experiment of gathering a sample of observed values of height and
corresponding weight.
Find the coefficients from the model created and create the mathematical equation
using these
Get a summary of the relationship model to know the average error in prediction.
Also called residuals.
Input Data
Below is the sample data representing the observations
# Values of height
151, 174, 138, 186, 128, 136, 179, 163, 152, 131
# Values of weight.
63, 81, 56, 91, 47, 57, 76, 72, 62, 48
lm() Function
This function creates the relationship model between the predictor and the response variable.
Syntax
The basic syntax for lm() function in linear regression is
lm(formula,data)
print(relation)
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-38.4551 0.6746
print(summary(relation))
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
predict() Function
Syntax
The basic syntax for predict() in linear regression is
predict(object, newdata)
object is the formula which is already created using the lm() function.
newdata is the vector containing the new value for predictor variable.
1
76.22869
We create the regression model using the lm() function in R. The model determines the value
of the coefficients using the input data. Next we can predict the value of the response variable
for a given set of predictor variables using these coefficients.
lm() Function
This function creates the relationship model between the predictor and the response variable.
Syntax
The basic syntax for lm() function in multiple regression is
lm(y ~ x1+x2+x3...,data)
formula is a symbol presenting the relation between the response variable and
predictor variables.
Example
Input Data
Consider the data set "mtcars" available in the R environment. It gives a comparison between
different car models in terms of mileage per gallon (mpg), cylinder displacement("disp"),
horse power("hp"), weight of the car("wt") and some more parameters.
The goal of the model is to establish the relationship between "mpg" as a response variable
with "disp","hp" and "wt" as predictor variables. We create a subset of these variables from the
mtcars data set for this purpose.
mpg disp hp wt
Mazda RX4 21.0 160 110 2.620
Mazda RX4 Wag 21.0 160 110 2.875
Datsun 710 22.8 108 93 2.320
Hornet 4 Drive 21.4 258 110 3.215
Hornet Sportabout 18.7 360 175 3.440
Valiant 18.1 225 105 3.460
a <- coef(model)[1]
print(a)
print(Xdisp)
print(Xhp)
print(Xwt)
Call:
lm(formula = mpg ~ disp + hp + wt, data = input)
Coefficients:
(Intercept) disp hp wt
37.105505 -0.000937 -0.031157 -3.800891
Y = a+Xdisp.x1+Xhp.x2+Xwt.x3
or
Y = 37.15+(-0.000937)*x1+(-0.0311)*x2+(-3.8008)*x3
For a car with disp = 221, hp = 102 and wt = 2.91 the predicted mileage is
Y = 37.15+(-0.000937)*221+(-0.0311)*102+(-3.8008)*2.91 = 22.7104
R - Logistic Regression
The Logistic Regression is a regression model in which the response variable (dependent
variable) has categorical values such as True/False or 0/1. It actually measures the
probability of a binary response as the value of response variable based on the mathematical
equation relating it with the predictor variables.
y = 1/(1+e^-(a+b1x1+b2x2+b3x3+...))
The function used to create the regression model is the glm() function.
Syntax
The basic syntax for glm() function in logistic regression is
glm(formula,data,family)
family is R object to specify the details of the model. It's value is binomial for logistic
regression.
Example
The in-built data set "mtcars" describes different models of a car with their various engine
specifications. In "mtcars" data set, the transmission mode (automatic or manual) is
described by the column am which is a binary value (0 or 1). We can create a logistic
regression model between the columns "am" and 3 other columns - hp, wt and cyl.
print(head(input))
am cyl hp wt
Mazda RX4 1 6 110 2.620
Mazda RX4 Wag 1 6 110 2.875
Datsun 710 1 4 93 2.320
Hornet 4 Drive 0 6 110 3.215
Hornet Sportabout 0 8 175 3.440
Valiant 0 6 105 3.460
print(summary(am.data))
Call:
glm(formula = am ~ cyl + hp + wt, family = binomial, data = input)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.17272 -0.14907 -0.01464 0.14116 1.27641
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 19.70288 8.11637 2.428 0.0152 *
cyl 0.48760 1.07162 0.455 0.6491
hp 0.03259 0.01886 1.728 0.0840 .
wt -9.14947 4.15332 -2.203 0.0276 *
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Conclusion
In the summary as the p-value in the last column is more than 0.05 for the variables "cyl" and
"hp", we consider them to be insignificant in contributing to the value of the variable "am". Only
weight (wt) impacts the "am" value in this regression model.
R - Normal Distribution
In a random collection of data from independent sources, it is generally observed that the
distribution of data is normal. Which means, on plotting a graph with the value of the variable
in the horizontal axis and the count of the values in the vertical axis we get a bell shape curve.
The center of the curve represents the mean of the data set. In the graph, fifty percent of
values lie to the left of the mean and the other fifty percent lie to the right of the graph. This is
referred as normal distribution in statistics.
R has four in built functions to generate normal distribution. They are described below.
x is a vector of numbers.
p is a vector of probabilities.
mean is the mean value of the sample data. It's default value is zero.
dnorm()
This function gives height of the probability distribution at each point for a given mean and
standard deviation.
# Create a sequence of numbers between -10 and 10 incrementing by 0.1.
x <- seq(-10, 10, by = .1)
plot(x,y)
pnorm()
This function gives the probability of a normally distributed random number to be less that the
value of a given number. It is also called "Cumulative Distribution Function".
# Create a sequence of numbers between -10 and 10 incrementing by 0.2.
x <- seq(-10,10,by = .2)
qnorm()
This function takes the probability value and gives a number whose cumulative value matches
the probability value.
# Create a sequence of probability values incrementing by 0.02.
x <- seq(0, 1, by = 0.02)
rnorm()
This function is used to generate random numbers whose distribution is normal. It takes the
sample size as input and generates that many random numbers. We draw a histogram to
show the distribution of the generated numbers.
# Create a sample of 50 numbers which are normally distributed.
y <- rnorm(50)
R - Binomial Distribution
The binomial distribution model deals with finding the probability of success of an event
which has only two possible outcomes in a series of experiments. For example, tossing of a
coin always gives a head or a tail. The probability of finding exactly 3 heads in tossing a coin
repeatedly for 10 times is estimated during the binomial distribution.
R has four in-built functions to generate binomial distribution. They are described below.
x is a vector of numbers.
p is a vector of probabilities.
n is number of observations.
dbinom()
This function gives the probability density distribution at each point.
print(x)
[1] 0.610116
qbinom()
This function takes the probability value and gives a number whose cumulative value matches
the probability value.
# How many heads will have a probability of 0.25 will come out when a coin is tossed 51
x <- qbinom(0.25,51,1/2)
print(x)
When we execute the above code, it produces the following result
[1] 23
rbinom()
This function generates required number of random values of given probability from a given
sample.
print(x)
[1] 58 61 59 66 55 60 61 67
R - Poisson Regression
Poisson Regression involves regression models in which the response variable is in the form
of counts and not fractional numbers. For example, the count of number of births or number
of wins in a football match series. Also the values of the response variables follow a Poisson
distribution.
The function used to create the Poisson regression model is the glm() function.
Syntax
The basic syntax for glm() function in Poisson regression is
glm(formula,data,family)
family is R object to specify the details of the model. It's value is 'Poisson' for
Logistic Regression.
Example
We have the in-built data set "warpbreaks" which describes the effect of wool type (A or B)
and tension (low, medium or high) on the number of warp breaks per loom. Let's consider
"breaks" as the response variable which is a count of number of breaks. The wool "type" and
"tension" are taken as predictor variables.
Input Data
input <- warpbreaks
print(head(input))
Call:
glm(formula = breaks ~ wool + tension, family = poisson, data = warpbreaks)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.6871 -1.6503 -0.4269 1.1902 4.2616
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.69196 0.04541 81.302 < 2e-16 ***
woolB -0.20599 0.05157 -3.994 6.49e-05 ***
tensionM -0.32132 0.06027 -5.332 9.73e-08 ***
tensionH -0.51849 0.06396 -8.107 5.21e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
In the summary we look for the p-value in the last column to be less than 0.05 to consider an
impact of the predictor variable on the response variable. As seen the wooltype B having
tension type M and H have impact on the count of breaks.
R - Analysis of Covariance
We use Regression analysis to create models which describe the effect of variation in
predictor variables on the response variable. Sometimes, if we have a categorical variable
with values like Yes/No or Male/Female etc. The simple regression analysis gives multiple
results for each value of the categorical variable. In such scenario, we can study the effect of
the categorical variable by using it along with the predictor variable and comparing the
regression lines for each level of the categorical variable. Such an analysis is termed as
Analysis of Covariance also called as ANCOVA.
Example
Consider the R built in data set mtcars. In it we observer that the field "am" represents the type
of transmission (auto or manual). It is a categorical variable with values 0 and 1. The miles
per gallon value(mpg) of a car can also depend on it besides the value of horse power("hp").
We study the effect of the value of "am" on the regression between "mpg" and "hp". It is done
by using the aov() function followed by the anova() function to compare the multiple
regressions.
Input Data
Create a data frame containing the fields "mpg", "hp" and "am" from the data set mtcars. Here
we take "mpg" as the response variable, "hp" as the predictor variable and "am" as the
categorical variable.
input <- mtcars[,c("am","mpg","hp")]
print(head(input))
am mpg hp
Mazda RX4 1 21.0 110
Mazda RX4 Wag 1 21.0 110
Datsun 710 1 22.8 93
Hornet 4 Drive 0 21.4 110
Hornet Sportabout 0 18.7 175
Valiant 0 18.1 105
ANCOVA Analysis
We create a regression model taking "hp" as the predictor variable and "mpg" as the response
variable taking into account the interaction between "am" and "hp".
This result shows that both horse power and transmission type has significant effect on miles
per gallon as the p value in both cases is less than 0.05. But the interaction between these
two variables is not significant as the p-value is more than 0.05.
This result shows that both horse power and transmission type has significant effect on miles
per gallon as the p value in both cases is less than 0.05.
Model 1: mpg ~ hp * am
Model 2: mpg ~ hp + am
Res.Df RSS Df Sum of Sq F Pr(>F)
1 28 245.43
2 29 245.44 -1 -0.0052515 6e-04 0.9806
As the p-value is greater than 0.05 we conclude that the interaction between horse power and
transmission type is not significant. So the mileage per gallon will depend in a similar manner
on the horse power of the car in both auto and manual transmission mode.
Syntax
The basic syntax for ts() function in time series analysis is
data is a vector or matrix containing the values used in the time series.
start specifies the start time for the first observation in time series.
end specifies the end time for the last observation in time series.
Example
Consider the annual rainfall details at a place starting from January 2012. We create an R time
series object for a period of 12 months and plot it.
When we execute the above code, it produces the following result and chart
frequency = 24*6 pegs the data points for every 10 minutes of a day.
When we execute the above code, it produces the following result and chart
Series 1 Series 2
Jan 2012 799.0 655.0
Feb 2012 1174.8 1306.9
Mar 2012 865.1 1323.4
Apr 2012 1334.6 1172.2
May 2012 635.4 562.2
Jun 2012 918.5 824.0
Jul 2012 685.5 822.4
Aug 2012 998.6 1265.5
Sep 2012 784.2 799.6
Oct 2012 985.0 1105.6
Nov 2012 882.8 1106.7
Dec 2012 1071.0 1337.8
In Least Square regression, we establish a regression model in which the sum of the squares
of the vertical distances of different points from the regression curve is minimized. We
generally start with a defined model and assume some values for the coefficients. We then
apply the nls() function of R to get the more accurate values along with the confidence
intervals.
Syntax
The basic syntax for creating a nonlinear least square test in R is
Example
We will consider a nonlinear model with assumption of initial values of its coefficients. Next
we will see what is the confidence intervals of these assumed values so that we can judge
how well these values fir into the model.
a = b1*x^2+b2
Let's assume the initial coefficients to be 1 and 3 and fit these values into nls() function.
# Plot the chart with new data by fitting it to a prediction from 100 data points.
new.data <- data.frame(xvalues = seq(min(xvalues),max(xvalues),len = 100))
lines(new.data$xvalues,predict(model,newdata = new.data))
[1] 1.081935
Waiting for profiling to be done...
2.5% 97.5%
b1 1.137708 1.253135
b2 1.497364 2.496484
We can conclude that the value of b1 is more close to 1 while the value of b2 is more close to
2 and not 3.
R - Decision Tree
Decision tree is a graph to represent choices and their results in form of a tree. The nodes in
the graph represent an event or choice and the edges of the graph represent the decision
rules or conditions. It is mostly used in Machine Learning and Data Mining applications using
R.
Examples of use of decision tress is predicting an email as spam or not spam, predicting of
a tumor is cancerous or predicting a loan as a good or bad credit risk based on the factors in
each of these. Generally, a model is created with observed data also called training data. Then
a set of validation data is used to verify and improve the model. R has packages which are
used to create and visualize decision trees. For new set of predictor variable, we use this
model to arrive at a decision on the category (yes/No, spam/not spam) of the data.
Install R Package
Use the below command in R console to install the package. You also have to install the
dependent packages if any.
install.packages("party")
The package "party" has the function ctree() which is used to create and analyze decison tree.
Syntax
The basic syntax for creating a decision tree in R is
ctree(formula, data)
Input Data
We will use the R in-built data set named readingSkills to create a decision tree. It describes
the score of someone's readingSkills if we know the variables "age","shoesize","score" and
whether the person is a native speaker or not.
# Load the party package. It will automatically load other dependent packages.
library(party)
When we execute the above code, it produces the following result and chart
# Load the party package. It will automatically load other dependent packages.
library(party)
null device
1
Loading required package: methods
Loading required package: grid
Loading required package: mvtnorm
Loading required package: modeltools
Loading required package: stats4
Loading required package: strucchange
Loading required package: zoo
as.Date, as.Date.numeric
R - Random Forest
In the random forest approach, a large number of decision trees are created. Every
observation is fed into every decision tree. The most common outcome for each observation
is used as the final output. A new observation is fed into all the trees and taking a majority
vote for each classification model.
An error estimate is made for the cases which were not used while building the tree. That is
called an OOB (Out-of-bag) error estimate which is mentioned as a percentage.
Install R Package
Use the below command in R console to install the package. You also have to install the
dependent packages if any.
install.packages("randomForest)
The package "randomForest" has the function randomForest() which is used to create and
analyze random forests.
Syntax
The basic syntax for creating a random forest in R is
randomForest(formula, data)
Input Data
We will use the R in-built data set named readingSkills to create a decision tree. It describes
the score of someone's readingSkills if we know the variables "age","shoesize","score" and
whether the person is a native speaker.
# Load the party package. It will automatically load other required packages.
library(party)
When we execute the above code, it produces the following result and chart
Example
We will use the randomForest() function to create the decision tree and see it's graph.
# Load the party package. It will automatically load other required packages.
library(party)
library(randomForest)
Call:
randomForest(formula = nativeSpeaker ~ age + shoeSize + score,
data = readingSkills)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 1
Conclusion
From the random forest shown above we can conclude that the shoesize and score are the
important factors deciding if someone is a native speaker or not. Also the model has only 1%
error which means we can predict with 99% accuracy.
R - Survival Analysis
Survival analysis deals with predicting the time when a specific event is going to occur. It is
also known as failure time analysis or analysis of time to death. For example predicting the
number of days a person with cancer will survive or predicting the time when a mechanical
system is going to fail.
The R package named survival is used to carry out survival analysis. This package contains
the function Surv() which takes the input data as a R formula and creates a survival object
among the chosen variables for analysis. Then we use the function survfit() to create a plot
for the analysis.
Install Package
install.packages("survival")
Syntax
The basic syntax for creating survival analysis in R is
Surv(time,event)
survfit(formula)
Example
We will consider the data set named "pbc" present in the survival packages installed above. It
describes the survival data points about people affected with primary biliary cirrhosis (PBC) of
the liver. Among the many columns present in the data set we are primarily concerned with
the fields "time" and "status". Time represents the number of days between registration of the
patient and earlier of the event between the patient receiving a liver transplant or death of the
patient.
When we execute the above code, it produces the following result and chart
id time status trt age sex ascites hepato spiders edema bili chol
1 1 400 2 1 58.76523 f 1 1 1 1.0 14.5 261
2 2 4500 0 1 56.44627 f 0 1 1 0.0 1.1 302
3 3 1012 2 1 70.07255 m 0 0 0 0.5 1.4 176
4 4 1925 2 1 54.74059 f 0 1 1 0.5 1.8 244
5 5 1504 1 2 38.10541 f 0 1 1 0.0 3.4 279
6 6 2503 2 2 66.25873 f 0 1 0 0.0 0.8 248
albumin copper alk.phos ast trig platelet protime stage
1 2.60 156 1718.0 137.95 172 190 12.2 4
2 4.14 54 7394.8 113.52 88 221 10.6 3
3 3.48 210 516.0 96.10 55 151 12.0 4
4 2.54 64 6121.8 60.63 92 183 10.3 4
5 3.53 143 671.0 113.15 72 136 10.9 3
6 3.98 50 944.0 93.00 63 NA 11.0 3
From the above data we are considering time and status for our analysis.
When we execute the above code, it produces the following result and chart
For example, we can build a data set with observations on people's ice-cream buying pattern
and try to correlate the gender of a person with the flavor of the ice-cream they prefer. If a
correlation is found we can plan for appropriate stock of flavors by knowing the number of
gender of people visiting.
Syntax
The function used for performing chi-Square test is chisq.test().
chisq.test(data)
Example
We will take the Cars93 data in the "MASS" library which represents the sales of different
models of car in the year 1993.
library("MASS")
print(str(Cars93))
The above result shows the dataset has many Factor variables which can be considered as
categorical variables. For our model we will consider the variables "AirBags" and "Type". Here
we aim to find out any significant correlation between the types of car sold and the type of Air
bags it has. If correlation is observed we can estimate which types of cars can sell better with
what types of air bags.
data: car.data
X-squared = 33.001, df = 10, p-value = 0.0002723
Warning message:
In chisq.test(car.data) : Chi-squared approximation may be incorrect
Conclusion
The result shows the p-value of less than 0.05 which indicates a string correlation.
Advertisements
Sony ICD-UX560F Digital Sony ICD-UX560F Digital
Rs. 6,928.00 Rs. 6,650.00
(details + delivery) (details + delivery)
R Overview