Factor in R

Factors in R are used to represent categorical data. You can think about them as integer vectors in which each integer has an associated label. Note that using factors with labels is preferred than integer vectors, as labels are self-descriptive. In this lesson you will learn all about how to create a factor in R.
What is a factor in R programming?
A factor in R is a data structure used to represent a vector as categorical data. Therefore, the factor object takes a bounded number of different values called levels. Factors are very useful when working with character columns of data frames, for creating barplots and creating statistical summaries for categorical variables.
The factor function
The factor
function allows you to create factors in R. In the following block we show the arguments of the function with a summarized description.
You can get a more detailed description of the function and its arguments calling ?factor
or help(factor)
.
Convert character to factor in R
Now we will review an example where our input is a character vector. Suppose, for instance, that you have a vector containing the week days when some event happened. Thus, you can convert your character vector to factor with the factor
function.
By default, converting a character vector to factor will order the levels alphabetically.
If you want to preserve the order of the levels as appear on the input data, specify in the levels
argument the following:
Note that you can return and convert the factor levels to character with the levels
function.
Convert numeric to factor in R
Suppose you have registered the birth city of six individuals with the following codification:
- 1: Dublin.
- 2: London,
- 3: Sofia.
- 4: Pontevedra.
Hence, you will have something like the following data stored in a numeric vector:
Now, you can call the factor
to convert the data into factor and get it categorized for further analysis.
The output will have the following structure:
Change factor labels of the levels
If the input vector is numeric, as in the previous section, the corresponding label (the city) is not reflected. In order to solve this issue, you can store the data in a factor object using the factor
function and indicate the corresponding labels of the levels in the labels
argument, in order to rename the factor levels.
In the previous code block you can see the final output. As you can observe, now the data is categorized using the cities as labels.
Difference between levels and labels in R
It is common to get confused between labels and levels arguments of the R factor
function. Consider the following vector with a unique group and create a factor from it with default arguments:
On the one hand, the labels
argument allows you to modify the factor levels names. Hence, the labels
argument it is related to output. Note that the length of the vector passed to the labels
argument must be of the same length of the number of unique groups of the input vector.
On the other hand, the levels
argument is related to input. This argument allows you to specify how the levels are coded. Moreover, this argument allows you to add new levels to the factor:
Note you have to specify at least the same names of the input vector groups, or the output won’t be as expected:
Relevel and reorder factor levels
You may be wondering how to change the levels order (which can be important, for instance, in some graphical representations). The factor levels order can be changed in various ways, described in the following subsections.
Custom order of factor levels
In case you want create a custom order for the levels you will have to create a vector with the desired order and pass it to the labels
argument.
In addition, you can order the levels of the factor alphabetically making use of the sort
function:
Reorder factor levels
The reorder
function is designed to order the levels of a factor based on a statistical measure of other variable. To demonstrate, consider a data frame where each row represents an individual, the ‘city’ column represents the city where it was born and the column ‘salary’ represents its actual annual wage in thousands of dollars.
You can reorder the factor based, for example, on the mean wage of the individuals using the reorder
function as follows:
Reverse order of levels
Recall that you can use the levels
function to obtain the levels of a factor. At this point, the levels of the factor are the following:
With this in mind, you can reverse the order of levels of a factor with the rev
function:
Relevel function
Moreover, if you want to change just one observation and put it first you can use the relevel
function. For example, if you want the level ‘London’ appearing first and maintain the order of the others you can use:
In the following sections we will review how to convert factors to other data types in the more efficient way.
Convert factor in R to numeric
If you have a factor in R that you want to convert to numeric, the most efficient way is illustrated in the following block code, using the as.numeric
and levels
functions for indexing the levels by the index of the corresponding factor.
If you want to convert the factor to the original vector (with the same order) never use as.numeric(my_factor)
, as it will return a numeric vector different than the desired.
Convert factor to string
You may need to convert a factor to string. For that purpose, you can make use of the as.character
function.
Note that if you use the levels
function, the output will return a character vector with the unique strings ordered alphabetically, as we show in one of the previous sections.
Convert factor to date
Also, if you need to change your factor object to date, you can use the as.Date
function, specifying in the format
argument the date format you are working with.