UNIT - 5
UNIT - 5
String operations in R -
String manipulation basically refers to the process of handling and analyzing strings.
It involves various operations concerned with the modification and parsing of strings
to use and change their data. R offers a series of in-built functions to manipulate the
contents of a string. In this article, we will study different functions concerned with
the manipulation of strings in R.
Concatenation of Strings
String Concatenation is the technique of combining two strings. String Concatenation
can be done using many ways:
● paste() function
Any number of strings can be concatenated together using the paste()
function to form a larger string. This function takes the separator as an
argument which is used between the individual string elements and another
argument ‘collapse’ which reflects if we wish to print the strings together as
a single larger string. By default, the value of collapse is NULL.
Syntax:
paste(..., sep=" ", collapse = NULL)
Example:# R program for String concatenation
● # Concatenation using paste() function
● str <- paste("Learn", "Code")
● print (str)
Output:
"Learn Code"
In case no separator is specified the default separator ” ” is inserted between
individual strings.
Example:
● str <- paste(c(1:3), "4", sep = ":")
● print (str)
Output:
"1:4" "2:4" "3:4"
Since, the objects to be concatenated are of different lengths, a repetition of
the string of smaller length is applied with the other input strings. The first
string is a sequence of 1, 2, 3 which is then individually concatenated with
the other string “4” using separator ‘:’.
● Example -
● str <- paste(c(1:4), c(5:8), sep = "--")
● print (str)
Output:
"1--5" "2--6" "3--7" "4--8"
Since, both the strings are of the same length, the corresponding elements of
both are concatenated, that is the first element of the first string is
concatenated with the first element of second-string using the sep '–'.
cat() function
Different types of strings can be concatenated together using the cat()) function in R,
where sep specifies the separator to give between the strings and file name, in case we
wish to write the contents onto a file.
Syntax:
cat(..., sep=" ", file)
The output string is printed without any quotes and the default separator is ‘:’.NULL
value is appended at the end.
Example: cat(c(1:5), file ='sample.txt')
Output:
12345
The output is written to a text file sample.txt in the same working directory.
Calculating Length of strings
● length() function
The length() function determines the number of strings specified in the
function.
Example:
● # R program to calculate length
● print (length(c("Learn to", "Code")))
Output:
2
There are two strings specified in the function.
● nchar() function
nchar() counts the number of characters in each of the strings specified as
arguments to the function individually.
Example: print (nchar(c("Learn", "Code")))
Output:
54
The output indicates the length of Learn and then Code separated by ” ”.
Example 1:
Output:
"An honest mAn gAve thAt"
Every instance of ‘a’ is replaced by ‘A’.
Example 2:
Output:
"Th#@ #@ #t" "It #@ great"
Every instance of an old string is replaced by the new specified string. “i” is replaced
by “#” by “s” by “@”, that is the corresponding positions of the old string is replaced
by new string.
The length of the old string should be less than the new string.
Output:
[1] "Learn" "Code" "Teach" "!"
Working with substrings
substr(…, start, end) or substring(…, start, end) function in R extracts substrings out
of a string beginning with the start index and ending with the end index. It also
replaces the specified substring with a new set of characters.
Example:
Output:
"Lear"
Extracts the first four characters from the string.
Output:
"pr%gram" "wi%h" "ne%" "la%guage"
Replaces the third character of every string with a % sign.
str <- c("program", "with", "new", "language")
substr(str, 3, 3) <- c("%", "@")
print(str)
Output:
"pr%gram" "wi@h" "ne%" "la@guage"
Replaces the third character of each string alternatively with the specified symbols.
Regular Expressions -
The concept of regular expressions, usually referred to as regex, exists in many
programming languages, such as R, Python, C, C++, Perl, Java, and JavaScript. You
can access the functionality of regex either in the base version of those languages or
via libraries. For most programming languages, the syntax of regex patterns is similar.
A regular expression, regex, in R is a sequence of characters (or even one character)
that describes a certain pattern found in a text. Regex patterns can be as short as ‘a’ or
as long as the one mentioned in this StackOverflow thread.
Broadly speaking, the above definition of the regex is related not only to R but also to
any other programming language supporting regular expressions.
Regex represents a very flexible and powerful tool widely used for processing and
mining unstructured text data. For example, they find their application in search
engines, lexical analysis, spam filtering, and text editors.
In R, we can use the functions of the base R to detect, match, locate, extract, and
replace regex. Below are the main functions that search for regex matches in a
character vector and then do the following:
However, instead of using the native R functions, a more convenient and consistent
way to work with R regex is to use a specialized stringr package of the tidyverse
collection. This library is built on top of the stringi package. In the stringr library, all
the functions start with str_ and have much more intuitive names (as well as the names
of their optional parameters) than those of the base R.
install.packages('stringr')
library(stringr)
The table below shows the correspondence between the stringr functions and those of
the base R that we've discussed earlier in this section:
stringr Base R
str_subset() grep()
str_detect() grepl()
str_match() regexec()
str_locate() regexpr()
str_locate_all() gregexpr()
str_replace() sub()
str_replace_all() gsub()
You can find a full list of the stringr functions and regular expressions in these cheat
sheets, but we'll discuss some of them further in this tutorial.
Note: in the stringr functions, we pass in first the data and then a regex, while in the
base R functions – just the opposite.
R Regex Patterns- Namely, let's check if a unicorn has at least one corn
str_detect('unicorn', 'corn') Output - TRUE
Character Escapes - There are a few characters that have a special meaning when
used in R regular expressions. More precisely, they don't match themselves, as all
letters and digits do, but they do something different:
str_extract_all('unicorn', '.')
Output:
We clearly see that there are no dots in our unicorn. However, the str_extract_all()
function extracted every single character from this string. This is the exact mission of
the. character – to match any single character except for a new line.
What if we want to extract a literal dot? For this purpose, we have to use a regex
escape character before the dot – a backslash (\). However, there is a pitfall here to
keep in mind: a backslash is also used in the strings themselves as an escape character.
This means that we first need to "escape the escape character," by using a double
backslash. Let's see how it works:
Output:
1. '.' '.' '.'
Hence, the backslash helps neglect a special meaning of some symbols in R regular
expressions and interpret them literally. It also has the opposite mission: to give a
special meaning to some characters that otherwise would be interpreted literally.
Below is a table of the most used character escape sequences:
\n A new line
\t A tab
\v A vertical tab
Let's take a look at some examples keeping in mind that also in such cases, we have to
use a double backslash. At the same time, we'll introduce two more stringr functions:
str_view() and str_view_all() (to view HTML rendering of the first match or all
matches):
Output:
Character Classes
\d Any digit
\D Any non-digit
Output:
Unicorns are so cute!
Let's explore some examples keeping in mind that we have to put any of the above R
regex patterns inside square brackets:
Quantifiers
Often, we need to match a certain R regex pattern repetitively, instead of strictly once.
For this purpose, we use quantifiers. A quantifier always goes after the regex pattern
to which it's related. The most common quantifiers are given in the table below:
* 0 or more
+ at least 1
? at most 1
{n} exactly n
{n,} at least n
str_extract('dog', 'dog\\d*')
Output:
'Dog'
Anchors
By default, R regex will match any part of a provided string. We can change this
behavior by specifying a certain position of an R regex pattern inside the string. Most
often, we may want to impose the match from the start or end of the string. For this
purpose, we use the two main anchors in R regular expressions:
● ^ – matches from the beginning of the string (for multiline strings – the
beginning of each line)
● $ – matches from the end of the string (for multiline strings – the end of each
line)
Let's see how they work on the example of a palindrome stella won no wallets:
Output:
Alternation
Applying the alternation operator (|), we can match more than one R regex pattern in
the same string. Note that if we use this operator as a part of a user-defined character
class, it's interpreted literally, hence doesn't perform any alternation.
Grouping
R regex patterns follow certain precedence rules. For example, repetition (using
quantifiers) is prioritized over anchoring, while anchoring takes precedence over
alternation. To override these rules and increase the precedence of a certain operation,
we should use grouping. This can be performed by enclosing a subexpression of
interest into round brackets.
Dates in R - R has developed a special representation for dates and times. The Date
class and times represent dates are represented by the POSIXct or the POSIXlt
class. Dates are stored internally as the number of days since 1970-01-01 while times
are stored internally as the number of seconds since 1970-01-01.
The as.Date() Function - The most basic function we use while dealing with the
dates is as.Date() function. This function allows us to create a date value (without
time) in R programming. It also allows the various input formats of the date value
through the format = argument. B
Remember that this function supports the standard date format as “YYYY-MM-DD.”
when we don’t have input value in a standard date format, we still can use the
as.Date() function to create a date value. we have format = argument under the
function, which allows it to arrange the date values in a standard form and present it to
us. %d
- means a day of the month in number format %m -
stands for the month in number format %Y -
stands for the year in the “YYYY” format. If we have the year value in two digits, we
will use the “%y” instead of “%Y.”
Time in R - R has developed a special representation for dates and times. The Date
class and times represent dates are represented by the POSIXct or the POSIXlt class.
Dates are stored internally as the number of days since 1970-01-01 while times are
stored internally as the number of seconds since 1970-01-01
Getting the Current Date and Time for System - Well, the as.Date() function takes
a date value as an argument. Meaning, we always have to give a date value as an
input. What if we wanted to get the current system date and time? Well, that is
possible in R. We have two options to get that done.
Using the lubridate Package - Well, there is a package named lubridate, which has a
function named now() that can give us the current date, current time, and the current
timezone details in a single call (same as the Sys.time() function).
Extraction and Manipulation of the Parts of the Date - Since you already have
seen how the “lubridate” package work, it becomes easier to use it for extraction and
manipulation of some parts of the date value. There are various functions under the
package that allow us to either extract the year, month, week, etc. from the date.
We have created a date variable named “x,” which contains three different date values.
1. In the first example, the year() function allows us to extract the year values for
each element of the vector.
2. Similarly, the month() function takes a single date value or a vector that
contains dates as elements and extracts the month from those as numbers.
3. What if we wanted the abbreviated names for each month from dates? We have
to add the “label = TRUE” argument under the month() function and could see
the month names in abbreviated form.
4. If we use the “abbr = FALSE” argument under the month function along with
the “label = TRUE,” we will get the full month names.
5. To extract the days from the given date values, you can use the mday()
function. You will get the days as numbers.
6. The wday() function allows us to get the weekdays in numbers by default.
However, when we use the “label = TRUE” and “abbr = FALSE” as additional
arguments under the function, we will come to know which day of the given
date has which weekday value.
Note: Well, I would like to add a point of note here. Every time you open the instance
of your RStudio/R software, you need to navigate towards this “lubridate” package to
let these functions do the work for you. You can always use the “library(lubridate)” as
a piece of code that takes you to the lubridate package. However, installing the
Conclusion
● Dates are an essential aspect when we are dealing with real-life data problems.
● The as.Date() function allows us to convert the string date values into actual
dates. The standard date format is “YYYY-MM-DD.”
● To get the current system date, we can use the Sys.Date() function.
● Sys.timezone() function allows us to get the timezone of the system.
● The Sys.time() function allows us to get the current system date, and time, with
timezone in a single function.
● The lubridate package has various functions that allow us to work with dates
more efficiently.
Graphics - Graphs in the R language are a preferred feature that is used to create
various types of graphs and charts for visualizations. R language supports a rich set of
packages and functionalities to create graphs using the input data set for data
analytics.
Plot Function - The plot() function is used to draw points (markers) in a diagram. The
function takes parameters for specifying points in the diagram.
Plot Labels
The plot() function also accepts other parameters, such as main, xlab, and ylab if you
want to customize the graph with the main title and different labels for the x and
y-axis:
Box plot - Boxplots are a measure of how well distributed is the data in a data set. It
divides the data set into three quartiles. This graph represents the minimum,
maximum, median, first quartile and third quartile in the data set. It is also useful in
comparing the distribution of data across data sets by drawing boxplots for each of
them.
Syntax -
The basic syntax to create a boxplot in R is −
● x is a vector or a formula.
● data is the data frame.
● notch is a logical value. Set as TRUE to draw a notch.
● varwidth is a logical value. Set as true to draw width of the box
proportionate to the sample size.
● names are the group labels which will be printed under each boxplot.
● main is used to give a title to the graph.
Example -
We use the data set "mtcars" available in the R environment to create a basic boxplot.
Let's look at the columns "mpg" and "cyl" in mtcars.
Live Demo -
print(head(input))
mpg cyl
Valiant 18.1 6
png(file = "boxplot.png")
dev.off()
R – Graph Plotting
The R Programming Language provides some easy and quick tools that let us convert
our data into visually insightful elements like graphs.
Bar Plotting - R uses the function barplot() to create bar charts. R can draw both
vertical and Horizontal bars in the bar chart. In the bar chart, each of the bars can be
given different colors.
Legends - Legends are useful to add more information to the plots and enhance user
readability. It involves the creation of titles, indexes, and placement of plot boxes in
order to create a better understanding of the graphs plotted. The in-built R function
legend() can be used to add a legend to the plot.
Syntax: legend(x, y, legend, fill, col, bg, lty, cex, title, text. font, bg)
Parameters:
The legend box in the graph can be customized to suit the requirements in order to
convey more information and add aesthetics to the graph at the same time. Given
below are properties of legends basing which they can be customized:
● title: The title of the legend box that can be declared to understand what the
index of the indicates
● position: Indicator of the placement of the legend box ; which can have the
possible options : “bottomright”, “bottom”, “bottomleft”, “left”,
“topleft”, “top”, “topright”, “right” and “center”
● bty (Default: o): The type of box to enclose the legend. Different types of
letters can be used, where the box shape is equivalent to the letter shape.
For instance, “n” can be used for no box.
● bg: A background colour can be assigned to the legend box
● box.lwd: Indicator of the line width of the legend box
● box.lty: Indicator of the line type of the legend box
● box.col: Indicator of the line color of the legend box
The text of the legend function can also be customized for better styling using the
following properties: