By John M Quick

The R Tutorial Series provides a collection of user-friendly tutorials to people who want to learn how to use R for statistical analysis.


My Statistical Analysis with R book is available from Packt Publishing and Amazon.


R Tutorial Series: Regression With Categorical Variables

Categorical predictors can be incorporated into regression analysis, provided that they are properly prepared and interpreted. This tutorial will explore how categorical variables can be handled in R.

Tutorial Files

Before we begin, you may want to download the sample data (.csv) used in this tutorial. Be sure to right-click and save the file to your R working directory. Note that all code samples in this tutorial assume that this data has already been read into an R variable and has been attached. This dataset contains variables for the following information related to NFL quarterback and team salaries in 1991.
  • TEAM: Name of team
  • QB: Starting quarterback salary in thousands of dollars
  • TOTAL: team salary in thousands of dollars
  • CONF: conference (NFC or AFC)
In this dataset, the CONF variable is categorical. It can take on one of two values, either NFC or AFC. Suppose for the purposes of this tutorial that our research question is "how well do quarterback salary and conference predict total team salary?" The model that we use to answer this question will need to incorporate the categorical predictor for conference.

Dummy Coding

To be able to perform regression with a categorical variable, it must first be coded. Here, I will use the as.numeric(VAR) function, where VAR is the categorical variable, to dummy code the CONF predictor. As a result, CONF will represent NFC as 1 and AFC as 0. The sample code below demonstrates this process.
  1. > #represent a categorical variable numerically using as.numeric(VAR)
  2. > #dummy code the CONF variable into NFC = 1 and AFC = 0
  3. > dCONF <- as.numeric(CONF) - 1
Note that the -1 that comes after the as.numeric(CONF) function causes the variables to read 1 and 0 rather than 2 and 1, which is the default behavior.

Interpretation

Visual

One useful way to visualize the relationship between a categorical and continuous variable is through a box plot. When dealing with categorical variables, R automatically creates such a graph via the plot() function (see Scatterplots). The CONF variable is graphically compared to TOTAL in the following sample code.
  1. > #use the plot() function to create a box plot
  2. > #what does the relationship between conference and team salary look like?
  3. > plot(CONF, TOTAL, main="Team Salary by Conference", xlab="Conference", ylab="Salary ($1,000s)")
The resulting box plot is show below.

From a box plot, we can derive many useful insights, such as the minimum, maximum, and median values. Our box plot of total team salary on conference suggests that, compared to AFC teams, NFC teams have slightly higher salaries on average and the range of these salaries is larger.

Routine Analysis

Once a categorical variable has been quantified, it can be used in routine analyses, such as descriptive statistics and correlations. The following code depicts a few examples.
  1. > #what are the mean and standard deviation of conference?
  2. > mean(dCONF)
  3. > [1] 0.5
  4. > sd(dCONF)
  5. > [1] 0.5091751
  6. > #this makes sense… there are an even number of teams in both conferences and they are coded as either 0 or 1!
  7. > #what is the correlation between total team salary and conference?
  8. > cor(dCONF, TOTAL)
  9. > [1]0.007019319
The correlation between total team salary and conference indicates that there is little to no linear relationship between the variables.

Linear Regression

Let's return to our original question of how well quarterback salary and conference predict team salary. With the categorical predictor quantified, we can create a regression model for this relationship, as demonstrated below.
  1. > #create a linear model using lm(FORMULA, DATAVAR)
  2. > #predict team salary using quarterback salary and conference
  3. linearModel <- lm(TOTAL ~ QB + dCONF, datavar)
  4. #generate model summary
  5. summary(linearModel)
The model summary is pictured below.

Considering both the counterintuitive and statistically insignificant results of this model, our analysis of the conference variable would likely end or change directions at this point. However, there is one more interpretation method that is worth mentioning for future reference.

Split Model

With a dummy coded predictor, a regression model can be split into two halves by substituting in the possible values for the categorical variable. For example, we can think of our model as a regression of total salary on quarterback salary for two states of the world - teams in the AFC and teams in the NFC. These derivative models are covered in the following sample code.
  1. > #input the categorical values to split the linear model into two representations
  2. > #the original model: TOTAL = 19099 + 2.5 * QB - 103 * dCONF
  3. > #substitute 0 for dCONF to derive the AFC model: TOTAL = 19099 + 2.5 * QB
  4. > #substitute 1 for dCONF to derive the NFC model: TOTAL = 18996 + 2.5 * QB
  5. #what is the predicted salary for a team with a quarterback salary of $2,000,000 in the AFC and NFC conferences?
  6. #AFC prediction
  7. 19099 + 2.5 * 2000
  8. [1] 24099
  9. #NFC prediction
  10. 18996 + 2.5 * 2000
  11. [1] 23996
Based only on what we have modeled, we can further infer that conference was not a significant predictor of total team salaries in the NFL in 1991. The difference between the team salaries based on conference is less than one-half of one percent on average! Of course, only using quarterback salary and conference to predict an NFL team's overall salary is neglecting quite a few potentially significant predictors. Nonetheless, split model interpretation is a useful way to break down the perspectives captured by a categorical regression model.

More On Categorical Predictors

Certainly, much more can be done with categorical variables than the basic dummy coding that was demonstrated here. Individuals whose work requires a deeper inspection into the procedures of categorical regression are encouraged to seek additional resources (and to consider writing a guest tutorial for this series).

Complete Categorical Regression Example

To see a complete example of how a categorical regression model can be created in R, please download the categorical regression example (.txt) file.

References

The Associated Press. (1991). Q-back and team salaries [Data File]. Retrieved December 14, 2009 from http://lib.stat.cmu.edu/DASL/Datafiles/qbacksalarydat.html

20 comments:

  1. What if you had more than two categories? The "as.numeric" trick you are using only works for binary categories doesn't it?

    ReplyDelete
    Replies
    1. Hi,

      as.numeric() is not limited to binary categories

      John

      Delete
    2. Hello,

      While it is treu that as.numeric() is not limited to binary categories. However, you cannot use that be a categorical regression. You need to create n-1 variables and make them all 1s or 0s. You will have category 1 as 10 category 2 as 01, and category 3 as 00. This makes sure you don't have interference or implied relationships or order between variables.

      In short, the function works to create inputs, but is statistically NOT correct.

      Delete
    3. Yes, that is correct. For more than two levels, you need to create n-1 variables. Thanks for pointing that out. John

      Delete
  2. How do u convert a variable into dummy variable having more than two categories?

    ReplyDelete
  3. Why not simply run lm(variable ~ factor(cat_variable) + continous_var)?

    ReplyDelete
    Replies
    1. There are usually a large number of ways to execute things in R, so my tutorials focus on demonstrating just one way of doing things.

      Delete
  4. Nice tutorial, But how can i create n-1 dummy variables?

    ReplyDelete
  5. Is there a trick to use plot() to generate boxplot? When I tried to do plot() on my own dataset, I am still getting a scatterplot instead of boxplot.

    test <- c("yes","no","no","no","no","yes","yes");
    test <- as.factor(test)
    dtest <- as.numeric(test)-1
    test2 <- c(17256,23074,20666,24249,21992,19413,19545);
    plot(dtest,test2);

    ReplyDelete
    Replies
    1. There is a boxplot() function in R, so try using that to see if you can get the intended graph. Something may have changed in R since I wrote this tutorial that prevents the plot() function from working as demonstrated.

      Delete
    2. As it turned out, the following work. Still not sure why plotting dtest doesn't work.
      test <- c("yes","no","no","no","no","yes","yes");
      test <- factor(test)
      test2 <- c(17256,23074,20666,24249,21992,19413,19545);
      plot(test,test2);

      Thanks for the tutorial. Didn't know you want generate boxplot this way. Very useful

      Delete
    3. try
      est <- c("yes","no","no","no","no","yes","yes");
      test <- as.factor(test)
      dtest <- as.numeric(test)-1
      test2 <- c(17256,23074,20666,24249,21992,19413,19545);
      plot(as.factor(dtest),test2);

      By default R2.15 produces a boxplot when you plot a continuous variable against a categorical variable.

      Delete
  6. So, is there significant difference between salary in AFC and NFC? Since the p-value for dCONF is > 0.05 doesn't that means there is no significant difference between conference salaries?

    ReplyDelete
    Replies
    1. This tutorial did not explore whether there is a statistically significant difference between the conferences, although that is something you could examine if you wanted to. Instead, the tutorial looked at how conference and QB salaries predicted the total team salaries. Ultimately, conference was not a good predictor of overall team salaries.

      Delete
  7. what about the relationship between 2 categorical variables

    say high blood pressure: yes/no
    Obese : yes/no

    can box plot work there?

    ReplyDelete
  8. I have a data in which the binary response coded 0 and 1 needs to be changed. For example: since I`m gonna run a logistic regression, the response in which I am interested in is coded 0. But the reference for the model in R is 1. So, I need to switch the coding, and the response encoded 1 turn out to be 0 and vice-versa. I´m sweating my pants here to try and change this. Does anyone know how to do it?

    ReplyDelete
  9. subtract 1 and take the square.

    ReplyDelete
  10. There is a tutorial for creating those Coding Systems for categorical variables

    http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm

    ReplyDelete