By John M Quick

The R Tutorial Series provides a collection of user-friendly tutorials to people who want to learn how to use R for statistical analysis.


My Statistical Analysis with R book is available from Packt Publishing and Amazon.


R Tutorial Series: Basic Polynomial Regression

Often times, a scatterplot reveals a pattern that seems not so linear. Polynomial regression can be used to explore a predictor at different levels of curvilinearity. This tutorial will demonstrate how polynomial regression can be used in a hierarchical fashion to best represent a dataset in R.

Tutorial Files

Before we begin, you may want to download the sample data (.csv) used in this tutorial. Be sure to right-click and save the file to your R working directory. Note that all code samples in this tutorial assume that this data has already been read into an R variable and has been attached. This dataset contains hypothetical student data that uses practice exam scores to predict final exam scores.

Scatterplot


The preceding scatterplot demonstrates that these data may not be linear. Notably, no one scored lower than 50 on the practice exam and at approximately the 85 and above practice mark, final exam scores taper off. These suggest that the data is curvilinear. Furthermore, since exam scores range between 0 to 100, it is not possible to observe nor appropriate to predict that an individual with a 150 practice score would have a certain final exam score.

Creating The Higher Order Variables

A two step process, identical to the one used to create interaction variables, can be followed to create higher order variables in R. First, the variables must be centered to mitigate multicollinearity. Second, the predictor must be multiplied by itself a certain number of times to create each higher order variable. In this tutorial, we will explore the a linear, quadratic, and cubic model. Therefore, the predictor will need to be squared to create the quadratic model and cubed to create the cubic model.

Step 1: Centering

To center a variable, simply subtract its mean from each data point and save the result into a new R variable, as demonstrated below.
  1. > #center the independent variable
  2. > FinalC <- Final - mean(Final)
  3. > #center the predictor
  4. > PracticeC <- Practice - mean(Practice)

Step 2: Multiplication

Once the input variable has been centered, the higher order terms can be created. Since a higher order variable is formed by the product of a predictor with itself, we can simply multiply our centered term from step one and save the result into a new R variable, as demonstrated below.
  1. > #create the quadratic variable
  2. > PracticeC2 <- PracticeC * PracticeC
  3. > #create the cubic variable
  4. > PracticeC3 <- PracticeC * PracticeC * PracticeC

Creating The Models

Now we have all of the pieces necessary to assemble our linear and curvilinear models.
  1. > #create the models using lm(FORMULA, DATAVAR)
  2. > #linear model
  3. > linearModel <- lm(FinalC ~ PracticeC, datavar)
  4. > #quadratic model
  5. > quadraticModel <- lm(FinalC ~ PracticeC + PracticeC2, datavar)
  6. > #cubic model
  7. > cubicModel <- lm(FinalC ~ PracticeC + PracticeC2 + PracticeC3, datavar)

Evaluating The Models

As is the case in other forms of regression, it can be helpful to summarize and compare our potential models using the summary(MODEL) and anova(MODEL1, MODEL2,… MODELi) functions.
  1. > #display summary information about the models
  2. > summary(linearModel)
  3. > summary(quadraticModel)
  4. > summary(cubicModel)
  5. #compare the models using ANOVA
  6. anova(linearModel, quadraticModel, cubicModel)
The model summaries and ANOVA comparison chart are displayed below.

At this point we can compare the models. In this case, the quadratic and cubic terms are not statistically significant themselves nor are their models statistically significant beyond the linear model. However, in a real research study, there would be other practical considerations to make before deciding on a final model.

More On Interactions, Polynomials, and HLR

Certainly, much more can be done with these topics than I have covered in my tutorials. What I have provided is a basic discussion with guided examples. The regression topics covered in these tutorials can be mixed and matched to create exceedingly complex models. For example, multiple interactions and higher order variables could be contained in a single model. The good news is that more complex models can be created using the same techniques covered here. The basic principles remain the same.

Complete Polynomial Regression Example

To see a complete example of how polynomial regression models can be created in R, please download the polynomial regression example (.txt) file.

No comments:

Post a Comment