Factors in R
Factors in R
Factor variables represent categories or groups in your data. The function factor() can be used to
create a factor variable.
Create a factor
# Create a factor variable
friend_groups <- factor(c(1, 2, 1, 2))
friend_groups
[1] 1 2 1 2
Levels: 1 2
It’s possible to access to the factor levels using the function levels():
Note that, R orders factor levels alphabetically. If you want a different order in the levels, you
can specify the levels argument in the factor function as follow.
Note that:
The function is.factor() can be used to check whether a variable is a factor. Results are
TRUE (if factor) or FALSE (if not factor)
The function as.factor() can be used to convert a variable to a factor.
# Check if friend_groups is a factor
is.factor(friend_groups)
[1] TRUE
# Check if "are_married" is a factor
is.factor(are_married)
[1] FALSE
# Convert "are_married" as a factor
as.factor(are_married)
[1] TRUE FALSE TRUE TRUE
Levels: FALSE TRUE
summary(friend_groups)
not_best_friend best_friend
2 2
In the following example, I want to compute the mean salary of my friends by groups.
The function tapply() can be used to apply a function, here mean(), to each group.
# Salaries of my friends
salaries
Nicolas Thierry Bernard Jerome
2000 1800 2500 3000
# Friend groups
friend_groups
[1] best_friend not_best_friend best_friend not_best_friend
Levels: not_best_friend best_friend
# Compute the mean salaries by groups
mean_salaries <- tapply(salaries, friend_groups, mean)
mean_salaries
not_best_friend best_friend
2400 2250
# Compute the size/length of each group
tapply(salaries, friend_groups, length)
not_best_friend best_friend
2 2
It’s also possible to use the function table() to create a frequency table, also known as a
contingency table of the counts at each combination of factor levels.
table(friend_groups)
friend_groups
not_best_friend best_friend
2 2
# Cross-tabulation between
# friend_groups and are_married variables
table(friend_groups, are_married)
are_married
friend_groups FALSE TRUE
not_best_friend 1 1
best_friend 0 2
Data frames
A data frame is like a matrix but can have columns with different types (numeric, character,
logical). Rows are observations (individuals) and columns are variables.
To check whether a data is a data frame, use the is.data.frame() function. Returns TRUE if the
data is a data frame:
is.data.frame(friends_data)
[1] TRUE
is.data.frame(my_data)
[1] FALSE
The object “friends_data” is a data frame, but not the object “my_data”. We can convert-it to a
data frame using the as.data.frame() function:
As described in matrix section, you can use the function t() to transpose a data frame:
t(friends_data)
2. Negative indexing
# Exclude column 1
friends_data[, -1]
age height married
Nicolas 27 180 TRUE
Thierry 25 170 FALSE
Bernard 29 185 TRUE
Jerome 26 169 TRUE
3. Index by characteristics
TRUE specifies that the row contains a value of age >= 27.
# Select the rows that meet the condition
friends_data[friends_data$age >= 27, ]
name age height married
Nicolas Nicolas 27 180 TRUE
Bernard Bernard 29 185 TRUE
The R code above, tells R to get all rows from friends_data where age >= 27, and then to return
all the columns.
If you don’t want to see all the column data for the selected rows but are just interested in
displaying, for example, friend names and age for friends with age >= 27, you could use the
following R code:
If you’re finding that your selection statement is starting to be inconvenient, you can put your
row and column selections into variables first, such as:
Then you can select the rows and columns with those variables:
friends_data[age27, cols]
name age
Nicolas Nicolas 27
Bernard Bernard 29
Another option is to use the functions attach() and detach(). The function attach() takes a data
frame and makes its columns accessible by simply giving their names.
It’s also possible to use the functions cbind() and rbind() to extend a data frame.