Open In App

R Programming for Data Science

Last Updated : 27 Dec, 2024
Summarize
Comments
Improve
Suggest changes
Like Article
Like
Share
Report
News Follow

R is an open-source programming language used statistical software and data analysis tools. It is an important tool for Data Science. It is highly popular and is the first choice of many statisticians and data scientists.

  • R includes powerful tools for creating aesthetic and insightful visualizations.
  • Facilitates data extraction, transformation, and loading, with interfaces for SQL, spreadsheets, and more.
  • Provides essential packages for cleaning and transforming data.
  • Enables the application of ML algorithms to predict future events.
  • Supports analysis of unstructured data through NoSQL database interfaces.

Syntax and Variables in R

In R, we use the <- operator to assign values to variables, though = is also commonly used. You can also add comments in your code to explain what’s happening, using the# symbol. It’s great practice to comment your code so that it’s easier to understand later.

x <- 5    # Assigns the value 5 to x
y <- 3    # Assigns the value 3 to y
sum_result <- x + y
product_result <- x * y

print(paste('Sum of x and y: ', sum_result))
print(paste('Product of x and y: ', product_result))

Output
[1] "Sum of x and y:  8"
[1] "Product of x and y:  15"

Data Types and Structure in R

In R, data is stored in various structures, such as vectors, matrices, lists, and data frames. Let’s break each one down.

1. Vectors: Vectors are like simple arrays that hold multiple values of the same type. You can create a vector using the c() function:

# Creating Vector in R 
vector <- c(1, 2, 3, 4, 5)  
print(vector)

Output
[1] 1 2 3 4 5

2. Matrices: Matrices are two-dimensional arrays where each element has the same data type. You create a matrix using the matrix() function:

# Creating Matrix in R 
matrix_data <- matrix(1:9, nrow = 3, ncol = 3) 
print(matrix_data)

Output
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

3. Lists: Lists can contain elements of different types, including numbers, strings, vectors, and another list inside it. Lists are created using the list() function:

# Creating list in R 
list_data <- list("Red", 20, TRUE, 1:5)
print(list_data)

Output
[[1]]
[1] "Red"

[[2]]
[1] 20

[[3]]
[1] TRUE

[[4]]
[1] 1 2 3 4 5

4. Data Frames: Data frames are the most commonly used data structure in R. They’re like tables, where each column can contain different data types. Use data.frame() to create one:

# Creating DataFrame in R 
data_frame <- data.frame(Name = c("Alice", "Bob"), Age = c(24, 28))
print(data_frame)

Output
   Name Age
1 Alice  24
2   Bob  28

These foundational concepts are a great starting point for your journey into data science. To dive deeper, consider exploring the following tutorial: R Programming Tutorial

In R Programming, several libraries are required in data science for tasks like data manipulation and statistical modeling to visualize and machine learning. The key libraries include:

Data Manipulation with R Programming

R Libraries are effective for data manipulation, enabling analysts to clean, transform, and summarize datasets efficiently.

Using dplyr for Data Manipulation

The dplyr package provides a set of functions that make it easy to manipulate data frames in a clean and readable manner. Some of the key functions in dplyr include:

  • filter(): Filters rows based on conditions.
  • select(): Selects specific columns.
  • mutate(): Adds or modifies columns.
  • arrange(): Orders rows by specified columns.
  • summarize(): Summarizes data by applying functions (e.g., mean, sum).

Let’s perform data manipulation using the above function using a sample dataset:

install.packages("dplyr")
library(dplyr)

data <- data.frame(
  Name = c("Alice", "Bob", "Charlie", "David", "Eve"),
  Age = c(24, 28, 35, 40, 22),
  Salary = c(50000, 60000, 70000, 80000, 45000)
)

# Filters rows based on conditions
filtered_data <- filter(data, Age > 25)
print("Filtered Data (Age > 25):")
print(filtered_data)

# Selects specific columns
selected_data <- select(data, Name, Salary)
print("Selected Data (Name and Salary columns):")
print(selected_data)

Output:

[1] “Filtered Data (Age > 25):”
Name Age Salary
1 Bob 28 60000
2 Charlie 35 70000
3 David 40 80000

[1] “Selected Data (Name and Salary columns):”
Name Salary
1 Alice 50000
2 Bob 60000
3 Charlie 70000
4 David 80000
5 Eve 45000

Data Cleaning and Transformation

Data cleaning involves correcting or removing errors and transforming data into a usable format. Key transformations include:

Now, we will be using the previous dataset to perform data transformation:

# Renaming columns
data_renamed <- rename(data, Employee_Name = Name, Employee_Age = Age)
print("Renamed Data (Name to Employee_Name, Age to Employee_Age):")
print(data_renamed)

Output

[1] "Renamed Data (Name to Employee_Name, Age to Employee_Age):"
Employee_Name Employee_Age Salary Salary_per_year
1 Alice 24 50000 4166.667
2 Bob 28 60000 5000.000
3 Charlie 35 70000 5833.333
4 David 40 80000 6666.667
5 Eve 22 45000 3750.000

Handling Missing Values

Dealing with missing values is an essential part of data preparation. R provides several functions to identify, handle, and replace missing values in datasets. Key functions include:

  • is.na(): To identify missing values in the data.
  • na.omit(): To remove rows with missing values.
  • ifelse(): To replace missing values with a specific value or calculated result.
  • tidyr::fill(): To fill missing values using the previous or next non-missing value in the column.
data_missing <- data.frame(
  Name = c("Alice", "Bob", "Charlie", NA, "Eve"),
  Age = c(24, 28, 35, NA, 22),
  Salary = c(50000, NA, 70000, 80000, 45000)
)

# Identifying missing values
missing_data <- is.na(data_missing)
print("Identifying Missing Values:")
print(missing_data)

# Fill missing values 
install.packages("tidyr")
library(tidyr)
data_filled <- fill(data_missing, Age, .direction = "down")
print("Data After Filling Missing Values in Age (Downward Direction):")
print(data_filled)

Output:

[1] “Identifying Missing Values:”
Name Age Salary
[1,] FALSE FALSE FALSE
[2,] FALSE FALSE TRUE
[3,] FALSE FALSE FALSE
[4,] TRUE TRUE FALSE
[5,] FALSE FALSE FALSE

[1] “Data After Filling Missing Values in Age (Downward Direction):”
Name Age Salary
1 Alice 24 50000
2 Bob 28 NA
3 Charlie 35 70000
4 <NA> 35 80000
5 Eve 22 45000

Statistical Analysis in R

R provides tools for performing both descriptive and inferential statistical analysis, making it a preferred choice for statisticians and data scientists.

Descriptive Statistics

Descriptive statistics provide a summary of the data’s key characteristics using measures like mean, median, variance, and standard deviation.

  • mean(): Calculates the average of a dataset.
  • median(): Identifies the middle value in a dataset.
  • sd(): Computes the standard deviation.
  • summary(): Provides a summary of key descriptive statistics.
# Define a vector with numeric values
vector <- c(10, 20, 30, 40, 50)

# Calculate the mean of the vector
mean_value <- mean(vector)
# Calculate the median of the vector
median_value <- median(vector) 
# Calculate the sum of the vector
total_sum <- sum(vector)

# Output the results
print(paste("Mean:", mean_value))
print(paste("Median:", median_value))
print(paste("Sum:", total_sum))

Output
[1] "Mean: 30"
[1] "Median: 30"
[1] "Sum: 150"

Inferential Statistics

Inferential statistics allow you to make predictions or generalizations about a population based on sample data.

1. Hypothesis Testing

Hypothesis Testing evaluates assumptions (hypotheses) about population parameters. In R, common hypothesis tests include:

  • t.test(): Performs t-tests to compare means between two groups.
  • aov(): Conducts Analysis of Variance (ANOVA) to compare means among three or more groups
  • chisq.test(): Performs Chi-Square tests for independence or goodness of fit.
  • wilcox.test(): A non-parametric test that compares two independent samples (Wilcoxon rank-sum test).
  • ks.test(): The Kolmogorov-Smirnov test compares two distributions to see if they are the same.
  • fisher.test(): Fisher’s exact test is used for small sample sizes in contingency tables.
# T-test to compare means between two groups
group1 <- c(1, 2, 3, 4, 5)
group2 <- c(6, 7, 8, 9, 10)
t_test_result <- t.test(group1, group2)
print("T-test Result:")
print(t_test_result)

# Chi-Square test for independence
data_chisq <- matrix(c(10, 20, 20, 40), nrow = 2, byrow = TRUE)
chisq_result <- chisq.test(data_chisq)
print("Chi-Square Test Result:")
print(chisq_result)

Output:

[1] “T-test Result:”

Welch Two Sample t-test

data: group1 and group2
t = -5, df = 8, p-value = 0.001053
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-7.306004 -2.693996
sample estimates:
mean of x mean of y
3 8


[1] “Chi-Square Test Result:”

Pearson’s Chi-squared test

data: data_chisq
X-squared = 0, df = 1, p-value = 1

2. Correlation and Regression Analysis

These techniques explore relationships between variables:

  1. Correlation Analysis: Measures the strength and direction of relationships using cor().
  2. Regression Analysis: Models relationships using lm()(linear regression).
# Correlation Analysis using cor(): Measure the strength and direction of a linear relationship
x <- c(1, 2, 3, 4, 5)
y <- c(5, 4, 3, 2, 1)
correlation_result <- cor(x, y)
print("Correlation Between x and y:")
print(correlation_result)

Output:

[1] “Correlation Between x and y:”
[1] -1

Machine Learning with R

Machine learning in R enables analysts to build predictive models, perform classification, and uncover patterns in data.

Supervised Learning

1. Linear Regression: Linear regression is used for predicting continuous numeric outcomes based on one or more predictors. In R, we can predict the continuous numeric outcomes using lm().

# Sample Dataset 
set.seed(123)
train_data <- data.frame(
  predictor1 = rnorm(100, mean = 50, sd = 10),
  predictor2 = rnorm(100, mean = 30, sd = 5),
  target = rnorm(100, mean = 100, sd = 15)
)

model_lr <- lm(target ~ predictor1 + predictor2, data = train_data)
pred_lr <- predict(model_lr, newdata = train_data)
head(pred_lr)
mse <- mean((train_data$target - pred_lr)^2)
mse

Output:

197.509197666493

2. Logistic Regression: Logistic regression is used for binary classification tasks where the outcome variable is categorical (e.g., 0 or 1), in R, it is performed using glm() function.

set.seed(123)
train_data_logistic <- data.frame(
  predictor1 = rnorm(100, mean = 50, sd = 10),
  predictor2 = rnorm(100, mean = 30, sd = 5),
  target = sample(0:1, 100, replace = TRUE)
)

# Fit Logistic Regression model
model_logistic <- glm(target ~ predictor1 + predictor2, family = binomial, data = train_data_logistic)
pred_logistic <- predict(model_logistic, newdata = train_data_logistic, type = "response")
pred_logistic_class <- ifelse(pred_logistic > 0.5, 1, 0)  # Convert probabilities to binary predictions

accuracy_logistic <- mean(pred_logistic_class == train_data_logistic$target)
accuracy_logistic

Output:

0.63

3. Decision Trees: Decision trees are used for both classification and regression tasks. In this example, we perform classification using rpart() function:

install.packages("rpart")
library(rpart)

set.seed(123)
train_data_tree <- data.frame(
  predictor1 = rnorm(100, mean = 50, sd = 10),
  predictor2 = rnorm(100, mean = 30, sd = 5),
  target = sample(0:1, 100, replace = TRUE)
)

# Fit Decision Tree model
model_tree <- rpart(target ~ predictor1 + predictor2, data = train_data_tree, method = "class")
pred_tree <- predict(model_tree, newdata = train_data_tree, type = "class")
accuracy_tree <- mean(pred_tree == train_data_tree$target)
accuracy_tree

Output:

0.72

4. Random Forest: Random Forest is an ensemble learning technique to perform classification and regression using randomForest().

install.packages("randomForest")
library(randomForest)

set.seed(123)
train_data_rf <- data.frame(
  predictor1 = rnorm(100, mean = 50, sd = 10),
  predictor2 = rnorm(100, mean = 30, sd = 5),
  target = sample(0:1, 100, replace = TRUE)  
)

train_data_rf$target <- factor(train_data_rf$target, levels = c(0, 1))

# Random Forest model
model_rf <- randomForest(target ~ predictor1 + predictor2, data = train_data_rf)
pred_rf <- predict(model_rf, newdata = train_data_rf)
accuracy_rf <- mean(pred_rf == train_data_rf$target)
print(paste("Random Forest Accuracy: ", accuracy_rf))

Output:

Random Forest Accuracy: 1

Unsupervised Learning

Unsupervised learning involves learning patterns in data without labeled outputs. Common techniques include clustering and dimensionality reduction.

1. K-means Clustering: K-means partitions the data into K clusters based on the distance between data points. In R, kmeans() function is used perform clustering.

set.seed(123)
data <- data.frame(
  predictor1 = rnorm(100, mean = 50, sd = 10),
  predictor2 = rnorm(100, mean = 30, sd = 5),
  target = sample(0:1, 100, replace = TRUE)  
)

# Perform K-means clustering
model_kmeans <- kmeans(data[, -3], centers = 3)  
cluster_centers <- model_kmeans$centers  
cluster_assignments <- model_kmeans$cluster 
withinss <- model_kmeans$tot.withinss 

print("Cluster Centers:")
print(cluster_centers)

print("Cluster Assignments:")
print(cluster_assignments)

print("Total Within-Cluster Sum of Squares:")
print(withinss)

Output:

[1] “Cluster Centers:”
predictor1 predictor2
1 62.48318 27.73121
2 51.24186 30.80630
3 41.05266 29.10471

[1] “Cluster Assignments:”
[1] 3 2 1 2 2 1 2 3 3 2 1 2 2 2 3 1 2 3 1 3 3 2 3 3 3 3 1 2 3 1 2 2 1 1 1 2 1
[38] 2 2 3 3 2 3 1 1 3 3 3 2 2 2 2 2 1 2 1 3 2 2 2 2 3 3 3 3 2 2 2 1 1 3 3 1 3
[75] 3 1 2 3 2 2 2 2 3 1 2 2 1 2 2 1 1 2 2 3 1 3 1 1 2 3

[1] “Total Within-Cluster Sum of Squares:”
[1] 3809.048

2. Principal Component Analysis (PCA): PCA transforms the data into a new coordinate system where the axes represent direction of maximum variance. In R, PCA is performed using prcomp() function.

set.seed(123)
data_pca <- data.frame(
  predictor1 = rnorm(100, mean = 50, sd = 10),
  predictor2 = rnorm(100, mean = 30, sd = 5),
  predictor3 = rnorm(100, mean = 60, sd = 15)
)

# Perform PCA
pca_result <- prcomp(data_pca, center = TRUE, scale. = TRUE)
summary(pca_result)  

Output

Importance of components:
PC1 PC2 PC3
Standard deviation 1.0726 0.9900 0.9324
Proportion of Variance 0.3835 0.3267 0.2898
Cumulative Proportion 0.3835 0.7102 1.0000

Model Evaluation

After building a model, it’s essential to evaluate its performance. We can evaluate models using the following metrics:

1. Classification Evaluation Metrics

2. Regression Evaluation Metrics

Time Series Analysis in R

R provides multiple functions for creating, manipulating and analyzing time series data.

ts() function in R

The ts() function is used to convert a numeric vector into a time series object, where you can specify the start date and the frequency of the data (e.g., monthly, quarterly).

Decomposition of Time Series

In R, the decompose() function is used for decomposing time series into trend, seasonal, and residual components.

For more advanced decomposition, you can use STL (Seasonal and Trend decomposition using Loess), which is more robust for irregular seasonality. It is implemented using stl() function.

Time Series Forecasting using R

  • ARIMA Model: The auto.arima() function from the forecast package can automatically select the best ARIMA model for the given time series data based on criteria like AIC (Akaike Information Criterion).
  • SARIMA Model: The auto.arima() function in R can also be used to fit a SARIMA model by automatically selecting the seasonal components.
  • Exponential Smoothing (ETS): Another popular forecasting technique is Exponential Smoothing, which is available in R through the ets() function from the forecast package.
  • Prophet: For handling seasonality and holidays, Facebook’s Prophet model can be used. The function used to perform forecasting is prophet(). It is particularly useful for forecasting time series data with strong seasonal effects and missing data.

Difference Between R Programming and Python Programming

FeatureRPython
IntroductionR is a language and environment designed for statistical programming, computing, and graphics.Python is a general-purpose programming language used for data analysis and scientific computing.
ObjectiveFocuses on statistical analysis and data visualization.Supports a wide range of applications, including GUI development, web development, and embedded systems.
WorkabilityOffers numerous easy-to-use packages for statistical tasks.Excels in matrix computation, optimization, and general-purpose tasks.
Integrated Development Environment (IDE)Popular IDEs include RStudio, RKward, and R Commander.Common IDEs are Spyder, Eclipse+PyDev, Atom, and more.
Libraries and PackagesIncludes packages like ggplot2 for visualization and caret for machine learning.Features libraries like Pandas, NumPy, and SciPy for data manipulation and analysis.
ScopePrimarily used for complex statistical analysis and data science projects.Offers a streamlined approach for data science, along with versatility in other domains.

R is ideal for statistical computing and visualization, while Python provides a more versatile platform for diverse applications, including data science.

Top Companies Using R for Data Science

  • Google: Utilizes R for analytical operations, including the Google Flu Trends project, which analyzes flu-related search trends.
  • Facebook: Leverages R for social network analytics, gaining user insights and analyzing user relationships.
  • IBM: A major investor in R, IBM uses it for developing analytical solutions, including in IBM Watson.
  • Uber: Employs R’s Shiny package for interactive web applications and embedding dynamic visual graphics.


Next Article

Similar Reads

three90RightbarBannerImg