R Tutorial Series: Multiple Linear Regression

In R, multiple linear regression is only a small step away from simple linear regression. In fact, the same lm() function can be used for this technique, but with the addition of a one or more predictors. This tutorial will explore how R can be used to perform multiple linear regression.

Tutorial Files

Before we begin, you may want to download the sample data (.csv) used in this tutorial. Be sure to right-click and save the file to your R working directory. This dataset contains information used to estimate undergraduate enrollment at the University of New Mexico (Office of Institutional Research, 1990). Note that all code samples in this tutorial assume that this data has already been read into an R variable and has been attached.

Creating A Linear Model With Two Predictors

The lm() function

In R, the lm(), or "linear model," function can be used to create a multiple regression model. The lm() function accepts a number of arguments ("Fitting Linear Models," n.d.). The following list explains the two most commonly used parameters.
  • formula: describes the model
  • Note that the formula argument follows a specific format. For multiple linear regression, this is "YVAR ~ XVAR1 + XVAR2 + … + XVARi" where YVAR is the dependent, or predicted, variable and XVAR1, XVAR2, etc. are the independent, or predictor, variables.
  • data: the variable that contains the dataset
It is recommended that you save a newly created linear model into a variable. By doing so, the model can be used in subsequent calculations and analyses without having to retype the entire lm() function each time. The sample code below demonstrates how to create a linear model with two predictors and save it into a variable. In this particular case, we are using the unemployment rate (UNEM) and number of spring high school graduates (HGRAD) to predict the fall enrollment (ROLL).
  1. > #create a linear model using lm(FORMULA, DATAVAR)
  2. > #predict the fall enrollment (ROLL) using the unemployment rate (UNEM) and number of spring high school graduates (HGRAD)
  3. > twoPredictorModel <- lm(ROLL ~ UNEM + HGRAD, datavar)
  4. > #display model
  5. > twoPredictorModel
The output of the preceding function is pictured below.

From this output, we can determine that the intercept is -8255.8, the coefficient for the unemployment rate is 698.2, and the coefficient for number of spring high school graduates is 0.9. Therefore, the complete regression equation is Fall Enrollment = -8255.8 + 698.2 * Unemployment Rate + 0.9 * Number of Spring High School Graduates. This equation tells us that the predicted fall enrollment for the University of New Mexico will increase by 698.2 students for every one percent increase in the unemployment rate and 0.9 students for every one high school graduate. Suppose that our research question asks what the expected fall enrollment is, given this year's unemployment rate of 9% and spring high school graduating class of 100,000 students. As follows, we can use the regression equation to calculate the answer to this question.
  1. > #what is the expected fall enrollment (ROLL) given this year's unemployment rate (UNEM) of 9% and spring high school graduating class (HGRAD) of 100,000
  2. > -8255.8 + 698.2 * 9 + 0.9 * 100000
  3. [1] 88028
  4. > #the predicted fall enrollment, given a 9% unemployment rate and 100,000 student spring high school graduating class, is 88,028 students.

Creating A Linear Model With Three or More Predictors

When creating a model with more than two predictors, the lm() function can again be used. Simply, one can just continue to add variables to the FORMULA argument until all of them are accounted for. A three predictor model is demonstrated below. It seeks to predict the fall enrollment (ROLL) via the unemployment rate (UNEM), number of spring high school graduates (HGRAD), and per capita income (INC).
  1. > #create a linear model using lm(FORMULA, DATAVAR)
  2. > #predict the fall enrollment (ROLL) using the unemployment rate (UNEM), number of spring high school graduates (HGRAD), and per capita income (INC)
  3. > threePredictorModel <- lm(ROLL ~ UNEM + HGRAD + INC, datavar)
  4. > #display model
  5. > threePredictorModel
The output of the preceding function is pictured below.

From this output, we can determine that the intercept is -9153.3, the coefficient for the unemployment rate is 450.1, the coefficient for number of spring high school graduates is 0.4, and the coefficient for per capita income is 4.3. Therefore, the complete regression equation is Fall Enrollment = -9153.3 + 450.1 * Unemployment Rate + 0.4 * Number of Spring High School Graduates + 4.3 * Per Capita Income. This equation tells us that the predicted fall enrollment for the University of New Mexico will increase by 450.1 students for every one percent increase in the unemployment rate, 0.4 students for every one high school graduate, and 4.3 students for every one dollar of per capita income. Let's revisit our research question, this time including a per capita income of $30,000.
  1. > #what is the expected fall enrollment (ROLL) given this year's unemployment rate (UNEM) of 9%, spring high school graduating class (HGRAD) of 100,000, and a per capita income (INC) of $30,000
  2. > -9153.3 + 450.1 * 9 + 0.4 * 100000 + 4.3 * 30000
  3. [1] 163897.6
  4. > #the predicted fall enrollment, given a 9% unemployment rate, 100,000 student spring high school graduating class, and $30000 per capita income, is 163,898 students.

Summarizing The Models

A multiple linear regression model can be used to do much more than just calculate expected values. Here, the summary(OBJECT) function is a useful tool. It is capable of generating a wealth of important information about a linear model. The example below demonstrates the use of the summary function on the two models created during this tutorial.
  1. > #use summary(OBJECT) to display information about the linear model
  2. > summary(twoPredictorModel)
  3. > summary(threePredictorModel)
The output of the preceding functions is pictured below.


The summary(OBJECT) function has provided us with t-test, F-test, R-squared, residual, and significance values. All of this data can be used to answer important questions related to our models.

Alternative Modeling Options

Although lm() was used in this tutorial, note that there are alternative modeling functions available in R, such as glm() and rlm(). Depending on your unique circumstances, it may be beneficial or necessary to investigate alternatives to lm() before choosing how to conduct your regression analysis.

Complete Multiple Linear Regression Example

To see a complete example of how multiple linear regression can be conducted in R, please download the multiple linear regression example (.txt) file.

References

Fitting Linear Models. (n.d.). Retrieved November 22, 2009 from http://sekhon.berkeley.edu/library/stats/html/lm.html
Office of Institutional Research (1990). Enrollment Forecast [Data File]. Retrieved November 22, 2009 from http://lib.stat.cmu.edu/DASL/Datafiles/enrolldat.html

9 comments:

  1. Thanks John. Your blog and explanations are most helpful for a beginner. Bill Yarberry

    ReplyDelete
  2. Hi Bill. Thanks for the comments. I'm glad the tutorials have been helpful to you.

    John

    ReplyDelete
  3. Will you be making/can you direct me to a tutorial for running a Discriminate Function Analysis in R?

    ReplyDelete
  4. Hi Ryane,

    Thanks for the recommendation. I do not currently have knowledge of discriminate function analysis, so I recommend searching Google for information on conducting it in R. Some other good sites to look at are Quick-R, Crantastic, the R Help Listserv archives, and the relevant package documentation. The odds are that someone has covered it in some form that you can use to sort out how to do it on your own. It may not be as clean as what I present here, but most things are out there in some form.

    ReplyDelete
  5. Hi John,

    Congratulations on your blog. I'm a beginner in R and it's being absolutely essential!

    I'm trying to see the summary of the lm model, but I get the following message
    Error in function (classes, fdef, mtable) :
    unable to find an inherited method for function ‘Summary’ for signature ‘"lm"’

    Do you know what the problem is?

    Thank you very much!
    Cristina

    ReplyDelete
    Replies
    1. Cristina,

      Make sure "summary" is lowercase. The error message indicates that it can't find "Summary." It's case-sensitive.

      -Ryan

      Delete
    2. Hi Ryan,

      Thanks for helping a fellow R user on this question!

      John

      Delete
  6. Hi John,
    I'm new in R language. I would like to know how to simulate a multiple linear regression that fulfill all four regression assumption.

    ReplyDelete
    Replies
    1. Hi, take a look at the side links for the other posts on this blog. I have one dedicated to assessing regression assumptions. Thanks, John.

      Delete