0% found this document useful (0 votes)
4 views22 pages

UNIT - 5

Uploaded by

Bulbul Sharma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
4 views22 pages

UNIT - 5

Uploaded by

Bulbul Sharma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 22

UNIT - 5

Strings, Data in R, Graphics


Stirg operations in R, Regular Expression, Data in R, Time in R
Graphics: one-dimensional plot, legends, Function plot, Box plot

String operations in R -
String manipulation basically refers to the process of handling and analyzing strings.
It involves various operations concerned with the modification and parsing of strings
to use and change their data. R offers a series of in-built functions to manipulate the
contents of a string. In this article, we will study different functions concerned with
the manipulation of strings in R.
Concatenation of Strings
String Concatenation is the technique of combining two strings. String Concatenation
can be done using many ways:
● paste() function
Any number of strings can be concatenated together using the paste()
function to form a larger string. This function takes the separator as an
argument which is used between the individual string elements and another
argument ‘collapse’ which reflects if we wish to print the strings together as
a single larger string. By default, the value of collapse is NULL.
Syntax:
paste(..., sep=" ", collapse = NULL)
Example:# R program for String concatenation
● # Concatenation using paste() function
● str <- paste("Learn", "Code")
● print (str)
Output:
"Learn Code"
In case no separator is specified the default separator ” ” is inserted between
individual strings.
Example:
● str <- paste(c(1:3), "4", sep = ":")
● print (str)
Output:
"1:4" "2:4" "3:4"
Since, the objects to be concatenated are of different lengths, a repetition of
the string of smaller length is applied with the other input strings. The first
string is a sequence of 1, 2, 3 which is then individually concatenated with
the other string “4” using separator ‘:’.
● Example -
● str <- paste(c(1:4), c(5:8), sep = "--")
● print (str)
Output:
"1--5" "2--6" "3--7" "4--8"
Since, both the strings are of the same length, the corresponding elements of
both are concatenated, that is the first element of the first string is
concatenated with the first element of second-string using the sep '–'.

cat() function
Different types of strings can be concatenated together using the cat()) function in R,
where sep specifies the separator to give between the strings and file name, in case we
wish to write the contents onto a file.
Syntax:
cat(..., sep=" ", file)
The output string is printed without any quotes and the default separator is ‘:’.NULL
value is appended at the end.
Example: cat(c(1:5), file ='sample.txt')
Output:
12345

The output is written to a text file sample.txt in the same working directory.
Calculating Length of strings
● length() function
The length() function determines the number of strings specified in the
function.
Example:
● # R program to calculate length
● print (length(c("Learn to", "Code")))
Output:
2
There are two strings specified in the function.
● nchar() function
nchar() counts the number of characters in each of the strings specified as
arguments to the function individually.
Example: print (nchar(c("Learn", "Code")))
Output:
54
The output indicates the length of Learn and then Code separated by ” ”.

Case Conversion of strings


● Conversion to upper case
All the characters of the strings specified are converted to upper case.
Example: print (toupper(c("Learn Code", "hI")))
Output:
"LEARN CODE" "HI"
● Conversion to lowercase
All the characters of the strings specified are converted to lowercase.
Example: print (tolower(c("Learn Code", "hI")))
Output:
"learn code" "hi"
● casefold() function
All the characters of the strings specified are converted to lowercase or
uppercase according to the arguments in casefold(…, upper=TRUE).
Examples: print (casefold(c("Learn Code", "hI")))
Output:
"learn code" "hi"

● Character replacement - Characters can be translated using the


chartr(oldchar, newchar, …) function in R, where every instance of old
character is replaced by the new character in the specified set of strings.

Example 1:

chartr("a", "A", "An honest man gave that")

Output:
"An honest mAn gAve thAt"
Every instance of ‘a’ is replaced by ‘A’.
Example 2:

chartr("is", "#@", c("This is it", "It is great"))

Output:
"Th#@ #@ #t" "It #@ great"
Every instance of an old string is replaced by the new specified string. “i” is replaced
by “#” by “s” by “@”, that is the corresponding positions of the old string is replaced
by new string.
The length of the old string should be less than the new string.

Splitting the string


A string can be split into corresponding individual strings using ” ” the default
separator.
Example:

strsplit("Learn Code Teach !", " ")

Output:
[1] "Learn" "Code" "Teach" "!"
Working with substrings
substr(…, start, end) or substring(…, start, end) function in R extracts substrings out
of a string beginning with the start index and ending with the end index. It also
replaces the specified substring with a new set of characters.
Example:

substr("Learn Code Tech", 1, 4)

Output:
"Lear"
Extracts the first four characters from the string.

str <- c("program", "with", "new", "language")


substring(str, 3, 3) <- "%"
print(str)

Output:
"pr%gram" "wi%h" "ne%" "la%guage"
Replaces the third character of every string with a % sign.
str <- c("program", "with", "new", "language")
substr(str, 3, 3) <- c("%", "@")
print(str)

Output:
"pr%gram" "wi@h" "ne%" "la@guage"
Replaces the third character of each string alternatively with the specified symbols.

Regular Expressions -
The concept of regular expressions, usually referred to as regex, exists in many
programming languages, such as R, Python, C, C++, Perl, Java, and JavaScript. You
can access the functionality of regex either in the base version of those languages or
via libraries. For most programming languages, the syntax of regex patterns is similar.
A regular expression, regex, in R is a sequence of characters (or even one character)
that describes a certain pattern found in a text. Regex patterns can be as short as ‘a’ or
as long as the one mentioned in this StackOverflow thread.

Broadly speaking, the above definition of the regex is related not only to R but also to
any other programming language supporting regular expressions.

Regex represents a very flexible and powerful tool widely used for processing and
mining unstructured text data. For example, they find their application in search
engines, lexical analysis, spam filtering, and text editors.

Tools and Functions to Work with R Regex


While regex patterns are similar for the majority of programming languages, the
functions for working with them are different.

In R, we can use the functions of the base R to detect, match, locate, extract, and
replace regex. Below are the main functions that search for regex matches in a
character vector and then do the following:

● grep(), grepl() – return the indices of strings containing a match (grep()) or a


logical vector showing which strings contain a match (grepl()).
● regexpr(), gregexpr() – return the index for each string where the match begins
and the length of that match. While regexpr() provides this information only for
the first match (from the left), gregexpr() does the same for all the matches.
● sub(), gsub() – replace a detected match in each string with a specified string
(sub() – only for the first match, gsub() – for all the matches).
● regexec() – works like regexpr() but returns the same information also for a
specified sub-expression inside the match.
● regmatches() – works like regexec() but returns the exact strings detected for
the overall match and a specified sub-expression.

However, instead of using the native R functions, a more convenient and consistent
way to work with R regex is to use a specialized stringr package of the tidyverse
collection. This library is built on top of the stringi package. In the stringr library, all
the functions start with str_ and have much more intuitive names (as well as the names
of their optional parameters) than those of the base R.

To install and load the stringr package, run the following:

install.packages('stringr')

library(stringr)

The table below shows the correspondence between the stringr functions and those of
the base R that we've discussed earlier in this section:

stringr Base R

str_subset() grep()

str_detect() grepl()

str_extract() regexpr(), regmatches(), grep()

str_match() regexec()

str_locate() regexpr()
str_locate_all() gregexpr()

str_replace() sub()

str_replace_all() gsub()

You can find a full list of the stringr functions and regular expressions in these cheat
sheets, but we'll discuss some of them further in this tutorial.

Note: in the stringr functions, we pass in first the data and then a regex, while in the
base R functions – just the opposite.

R Regex Patterns- Namely, let's check if a unicorn has at least one corn
str_detect('unicorn', 'corn') Output - TRUE
Character Escapes - There are a few characters that have a special meaning when
used in R regular expressions. More precisely, they don't match themselves, as all
letters and digits do, but they do something different:

str_extract_all('unicorn', '.')

Output:

1. 'u' 'n' 'i' 'c' 'o' 'r' 'n'

We clearly see that there are no dots in our unicorn. However, the str_extract_all()
function extracted every single character from this string. This is the exact mission of
the. character – to match any single character except for a new line.

What if we want to extract a literal dot? For this purpose, we have to use a regex
escape character before the dot – a backslash (\). However, there is a pitfall here to
keep in mind: a backslash is also used in the strings themselves as an escape character.
This means that we first need to "escape the escape character," by using a double
backslash. Let's see how it works:

str_extract_all('Eat. Pray. Love.', '\\.')

Output:
1. '.' '.' '.'

Hence, the backslash helps neglect a special meaning of some symbols in R regular
expressions and interpret them literally. It also has the opposite mission: to give a
special meaning to some characters that otherwise would be interpreted literally.
Below is a table of the most used character escape sequences:

R regex What matches

\b A word boundary (a boundary between a \w and a \W)

\B A non-word boundary (\w-\w or \W-\W)

\n A new line

\t A tab

\v A vertical tab

Let's take a look at some examples keeping in mind that also in such cases, we have to
use a double backslash. At the same time, we'll introduce two more stringr functions:
str_view() and str_view_all() (to view HTML rendering of the first match or all
matches):

str_view('Unicorns are so cute!', 's\\b')

str_view('Unicorns are so cute!', 's\\B')

Output:

Unicorns are so cute!

Unicorns are so cute!


In the string, Unicorns are so cute! there are two instances of the letter s. Above, the
first R regex pattern highlighted the first instance of the letter s (since it's followed by
a space), while the second one – the second instance (since it's followed by another
letter, not a word boundary).

Character Classes

A character class matches any character of a predefined set of characters. Built-in


character classes have the same syntax as the character escape sequences we saw in
the previous section: a backslash followed by a letter to which it gives a special
meaning rather than its literal one. The most popular of these constructions are given
below:

R regex What matches

\w Any word character (any letter, digit, or underscore)

\W Any non-word character

\d Any digit

\D Any non-digit

\s Any space character (a space, a tab, a new line, etc.)

\S Any non-space character

Let's take a look at some self-explanatory examples:

str_view_all('Unicorns are so cute!', '\\w')

str_view_all('Unicorns are so cute!', '\\W')

Output:
Unicorns are so cute!

● Built-in character classes can also appear in an alternative form –


[:character_class_name:]. Some of these character classes have an equivalent
among those with a backslash, others don't. The most common ones are:

R regex What matches

[:alpha:] Any letter

[:lower:] Any lowercase letter

[:upper:] Any uppercase letter

[:digit:] Any digit (equivalent to \d)

[:alnum:] Any letter or number

[:xdigit:] Any hexadecimal digit

[:punct:] Any punctuation character

[:graph:] Any letter, number, or punctuation character

[:space:] A space, a tab, a new line, etc. (equivalent to \s)

Let's explore some examples keeping in mind that we have to put any of the above R
regex patterns inside square brackets:

str_view('Unicorns are so cute!', '[[:upper:]]')

str_view('Unicorns are so cute!', '[[:lower:]]')


Output:

Unicorns are so cute!

Unicorns are so cute!

Quantifiers

Often, we need to match a certain R regex pattern repetitively, instead of strictly once.
For this purpose, we use quantifiers. A quantifier always goes after the regex pattern
to which it's related. The most common quantifiers are given in the table below:

R regex Number of pattern repetitions

* 0 or more

+ at least 1

? at most 1

{n} exactly n

{n,} at least n

{n,m} at least n and at most m

Let's try all of them:

str_extract('dog', 'dog\\d*')

Output:

'Dog'

Anchors
By default, R regex will match any part of a provided string. We can change this
behavior by specifying a certain position of an R regex pattern inside the string. Most
often, we may want to impose the match from the start or end of the string. For this
purpose, we use the two main anchors in R regular expressions:

● ^ – matches from the beginning of the string (for multiline strings – the
beginning of each line)
● $ – matches from the end of the string (for multiline strings – the end of each
line)

Let's see how they work on the example of a palindrome stella won no wallets:

str_view('stella won no wallets', '^s')

str_view('stella won no wallets', 's$')

Output:

stella won no wallets

stella won no wallets

If we want to match the characters ^ or $ themselves, we need to precede the character


of interest with a backslash (doubling it):

Alternation

Applying the alternation operator (|), we can match more than one R regex pattern in
the same string. Note that if we use this operator as a part of a user-defined character
class, it's interpreted literally, hence doesn't perform any alternation.

Grouping

R regex patterns follow certain precedence rules. For example, repetition (using
quantifiers) is prioritized over anchoring, while anchoring takes precedence over
alternation. To override these rules and increase the precedence of a certain operation,
we should use grouping. This can be performed by enclosing a subexpression of
interest into round brackets.

Advanced Applications of R Regular Expressions


there are many more things we can do with this powerful tool. Without getting into
detail, let's just mention some advanced operations that we can perform with R regex:

● Overriding the defaults of the stringr functions


● Matching grapheme clusters
● Group backreferencing
● Matching Unicode properties
● Applying advanced character escaping
● Verifying a pattern's existence without including it in the output (so-called
lookarounds)
● Making the pattern repetition mechanism lazy rather than greedy
● Working with atomic groups

Dates in R - R has developed a special representation for dates and times. The Date
class and times represent dates are represented by the POSIXct or the POSIXlt
class. Dates are stored internally as the number of days since 1970-01-01 while times
are stored internally as the number of seconds since 1970-01-01.

The as.Date() Function - The most basic function we use while dealing with the
dates is as.Date() function. This function allows us to create a date value (without
time) in R programming. It also allows the various input formats of the date value
through the format = argument. B
Remember that this function supports the standard date format as “YYYY-MM-DD.”
when we don’t have input value in a standard date format, we still can use the
as.Date() function to create a date value. we have format = argument under the
function, which allows it to arrange the date values in a standard form and present it to
us. %d
- means a day of the month in number format %m -
stands for the month in number format %Y -
stands for the year in the “YYYY” format. If we have the year value in two digits, we
will use the “%y” instead of “%Y.”

Time in R - R has developed a special representation for dates and times. The Date
class and times represent dates are represented by the POSIXct or the POSIXlt class.
Dates are stored internally as the number of days since 1970-01-01 while times are
stored internally as the number of seconds since 1970-01-01
Getting the Current Date and Time for System - Well, the as.Date() function takes
a date value as an argument. Meaning, we always have to give a date value as an
input. What if we wanted to get the current system date and time? Well, that is
possible in R. We have two options to get that done.

Using the Sys.Date(), Sys.time() Function - In R programming, if you use Sys.Date()


function, will give you the system date. You don’t need to add an argument inside the
parentheses to this function. There is again a function named Sys.timezone() that
allows us to get the timezone based on the location at which the user is running the
code on the system. And finally, we have the Sys.time() function. Which, if used, will
return the current date as well as the time of the system with the timezone details.

Using the lubridate Package - Well, there is a package named lubridate, which has a
function named now() that can give us the current date, current time, and the current
timezone details in a single call (same as the Sys.time() function).

Extraction and Manipulation of the Parts of the Date - Since you already have
seen how the “lubridate” package work, it becomes easier to use it for extraction and
manipulation of some parts of the date value. There are various functions under the
package that allow us to either extract the year, month, week, etc. from the date.

We have created a date variable named “x,” which contains three different date values.

1. In the first example, the year() function allows us to extract the year values for
each element of the vector.
2. Similarly, the month() function takes a single date value or a vector that
contains dates as elements and extracts the month from those as numbers.
3. What if we wanted the abbreviated names for each month from dates? We have
to add the “label = TRUE” argument under the month() function and could see
the month names in abbreviated form.
4. If we use the “abbr = FALSE” argument under the month function along with
the “label = TRUE,” we will get the full month names.
5. To extract the days from the given date values, you can use the mday()
function. You will get the days as numbers.
6. The wday() function allows us to get the weekdays in numbers by default.
However, when we use the “label = TRUE” and “abbr = FALSE” as additional
arguments under the function, we will come to know which day of the given
date has which weekday value.

Note: Well, I would like to add a point of note here. Every time you open the instance

of your RStudio/R software, you need to navigate towards this “lubridate” package to

let these functions do the work for you. You can always use the “library(lubridate)” as

a piece of code that takes you to the lubridate package. However, installing the

package is a one-time process only, as mentioned above.

Conclusion
● Dates are an essential aspect when we are dealing with real-life data problems.
● The as.Date() function allows us to convert the string date values into actual
dates. The standard date format is “YYYY-MM-DD.”
● To get the current system date, we can use the Sys.Date() function.
● Sys.timezone() function allows us to get the timezone of the system.
● The Sys.time() function allows us to get the current system date, and time, with
timezone in a single function.
● The lubridate package has various functions that allow us to work with dates
more efficiently.
Graphics - Graphs in the R language are a preferred feature that is used to create
various types of graphs and charts for visualizations. R language supports a rich set of
packages and functionalities to create graphs using the input data set for data
analytics.
Plot Function - The plot() function is used to draw points (markers) in a diagram. The
function takes parameters for specifying points in the diagram.

Parameter 1 specifies points on the x-axis.

Parameter 2 specifies points on the y-axis.

Plot Labels

The plot() function also accepts other parameters, such as main, xlab, and ylab if you
want to customize the graph with the main title and different labels for the x and
y-axis:
Box plot - Boxplots are a measure of how well distributed is the data in a data set. It
divides the data set into three quartiles. This graph represents the minimum,
maximum, median, first quartile and third quartile in the data set. It is also useful in
comparing the distribution of data across data sets by drawing boxplots for each of
them.

Boxplots are created in R by using the boxplot() function.

Syntax -
The basic syntax to create a boxplot in R is −

boxplot(x, data, notch, varwidth, names, main)

Following is the description of the parameters used −

● x is a vector or a formula.
● data is the data frame.
● notch is a logical value. Set as TRUE to draw a notch.
● varwidth is a logical value. Set as true to draw width of the box
proportionate to the sample size.
● names are the group labels which will be printed under each boxplot.
● main is used to give a title to the graph.

Example -
We use the data set "mtcars" available in the R environment to create a basic boxplot.
Let's look at the columns "mpg" and "cyl" in mtcars.

Live Demo -

input <- mtcars[,c('mpg','cyl')]

print(head(input))

When we execute above code, it produces following result −

mpg cyl

Mazda RX4 21.0 6

Mazda RX4 Wag 21.0 6

Datsun 710 22.8 4

Hornet 4 Drive 21.4 6

Hornet Sportabout 18.7 8

Valiant 18.1 6

Creating the Boxplot


The below script will create a boxplot graph for the relation between mpg (miles per
gallon) and cyl (number of cylinders).
Live Demo

# Give the chart file a name.

png(file = "boxplot.png")

# Plot the chart.

boxplot(mpg ~ cyl, data = mtcars, xlab = "Number of Cylinders",

ylab = "Miles Per Gallon", main = "Mileage Data")

# Save the file.

dev.off()

Data Visualization using R Programming - Boxplots are a measure of how well


distributed is the data in a data set. It divides the data set into three quartiles. This
graph represents the minimum, maximum, median, first quartile and the third quartile
in the data set.

R – Graph Plotting
The R Programming Language provides some easy and quick tools that let us convert
our data into visually insightful elements like graphs.

Graph plotting in R is of two types:


● One-dimensional Plotting: In one-dimensional plotting, we plot one
variable at a time. For example, we may plot a variable with the number of
times each of its values occurred in the entire dataset (frequency). So, it is
not compared to any other variable of the dataset. These are the 4 major
types of graphs that are used for One-dimensional analysis –
○ Five Point Summary
○ Box Plotting
○ Histograms
○ Bar Plotting
● Two-dimensional Plotting: In two-dimensional plotting, we visualize and
compare one variable with respect to the other. For example, in a dataset of
Air Quality measures, we would like to compare how the AQI varies with
the temperature at a particular place. So, temperature and AQI are two
different variables and we wish to see how one changes with respect to the
other. These are the 3 major kinds of graphs used for such kinds of analysis

○ Box Plotting
○ Histograms
○ Scatter plots

One d Histogram - A histogram is a graphical representation commonly used to


visualize the distribution of numerical data. It divides the values within a numerical
variable into “bins”, and counts the number of observations that fall into each bin.

Bar Plotting - R uses the function barplot() to create bar charts. R can draw both
vertical and Horizontal bars in the bar chart. In the bar chart, each of the bars can be
given different colors.
Legends - Legends are useful to add more information to the plots and enhance user
readability. It involves the creation of titles, indexes, and placement of plot boxes in
order to create a better understanding of the graphs plotted. The in-built R function
legend() can be used to add a legend to the plot.

Syntax: legend(x, y, legend, fill, col, bg, lty, cex, title, text. font, bg)

Parameters:

● x and y: These are co-ordinates to be used to position the legend


● legend: Text of the legend
● fill: Colors to use for filling the boxes of legend text
● col: Colors of lines
● bg: It defines the background color for the legend box.
● title: Legend title (optional)
● text.font: An integer specifying the font style of the legend (optional)

Returns: Legend plot

The legend box in the graph can be customized to suit the requirements in order to
convey more information and add aesthetics to the graph at the same time. Given
below are properties of legends basing which they can be customized:

● title: The title of the legend box that can be declared to understand what the
index of the indicates
● position: Indicator of the placement of the legend box ; which can have the
possible options : “bottomright”, “bottom”, “bottomleft”, “left”,
“topleft”, “top”, “topright”, “right” and “center”
● bty (Default: o): The type of box to enclose the legend. Different types of
letters can be used, where the box shape is equivalent to the letter shape.
For instance, “n” can be used for no box.
● bg: A background colour can be assigned to the legend box
● box.lwd: Indicator of the line width of the legend box
● box.lty: Indicator of the line type of the legend box
● box.col: Indicator of the line color of the legend box

The text of the legend function can also be customized for better styling using the
following properties:

● text.font: A numerical value which is an indicator of the font style of the


legend text. It has the following values :(1 – normal 2- bold, 3 – italic, 4 –
bold and italic)
● text.col: Which is used to indicate the color of the text used to write legend
text
● border (Default : black): Indicator of the border color of the boxes inside
the legend box
● fill_color: colors used for filling boxes

You might also like