By John M Quick

The R Tutorial Series provides a collection of user-friendly tutorials to people who want to learn how to use R for statistical analysis.


My Statistical Analysis with R book is available from Packt Publishing and Amazon.


R Tutorial Series: Two-Way ANOVA with Unequal Sample Sizes

When the sample sizes within the levels of our independent variables are not equal, we have to handle our ANOVA differently than in the typical two-way case. This tutorial will demonstrate how to conduct a two-way ANOVA in R when the sample sizes within each level of the independent variables are not the same.

Tutorial Files

Before we begin, you may want to download the sample data (.csv) used in this tutorial. Be sure to right-click and save the file to your R working directory. This dataset contains a hypothetical sample of 30 students who were exposed to one of two learning environments (offline or online) and one of two methods of instruction (classroom or tutor), then tested on a math assessment. Possible math scores range from 0 to 100 and indicate how well each student performed on the math assessment. Each student participated in either an offline or online learning environment and received either classroom instruction (i.e. one to many) or instruction from a personal tutor (i.e. one to one).

Beginning Steps

To begin, we need to read our dataset into R and store its contents in a variable.
  1. > #read the dataset into an R variable using the read.csv(file) function
  2. > dataTwoWayUnequalSample <- read.csv("dataset_ANOVA_TwoWayUnequalSample.csv")
  3. > #display the data
  4. > dataTwoWayUnequalSample

The first ten rows of our dataset

Unequal Sample Sizes

In our study, 16 students participated in the online environment, whereas only 14 participated in the offline environment. Further, 20 students received classroom instruction, whereas only 10 received personal tutor instruction. As such, we should take action to compensate for the unequal sample sizes in order to retain the validity of our analysis. Generally, this comes down to examining the correlation between the factors and the causes of the unequal sample sizes en route to choosing whether to use weighted or unweighted means - a decision which can drastically impact the results of an ANOVA. This tutorial will demonstrate how to conduct ANOVA using both weighted and unweighted means. Thus, the ultimate decision as to the use of weighted or unweighted means is left up to each individual and his or her specific circumstances.

Weighted Means

First, let's suppose that we decided to go with weighted means, which take into account the correlation between our factors that results from having treatment groups with different sample sizes. A weighted mean is calculated by simply adding up all of the values and dividing by the total number of values. Consequently, we can easily derive the weighted means for each treatment group using our subset(data, condition) and mean(data) functions.
  1. > #use subset(data, condition) to create subsets for each treatment group
  2. > #offline subset
  3. > offlineData <- subset(dataTwoWayUnequalSample, dataTwoWayUnequalSample$environment == "offline")
  4. > #online subset
  5. > onlineData <- subset(dataTwoWayUnequalSample, dataTwoWayUnequalSample$environment == "online")
  6. > #classroom subset
  7. > classroomData <- subset(dataTwoWayUnequalSample, dataTwoWayUnequalSample$instruction == "classroom")
  8. > #tutor subset
  9. > tutorData <- subset(dataTwoWayUnequalSample, dataTwoWayUnequalSample$instruction == "tutor")
  10. > #use mean(data) to calculate the weighted means for each treatment group
  11. > #offline weighted mean
  12. > mean(offlineData$math)
  13. > #online weighted mean
  14. > mean(onlineData$math)
  15. > #classroom weighted mean
  16. > mean(classroomData$math)
  17. > #tutor weighted mean
  18. > mean(tutorData$math)

The weighted means for the environment and instruction conditions

ANOVA using Type I Sums of Squares

When applying weighted means, it is suggested that we use Type I sums of squares (SS) in our ANOVA. Type I happens to be the default SS used in our standard anova(object) function, which will be used to execute our analysis. Note that in the case of two-way ANOVA, the ordering of our independent variables matters when using weighted means. Therefore, we must run our ANOVA two times, once with each independent variable taking the lead. However, the interaction effect is not affected by the ordering of the independent variables.
  1. > #use anova(object) to execute the Type I SS ANOVAs
  2. > #environment ANOVA
  3. > anova(lm(math ~ environment * instruction, dataTwoWayUnequalSample))
  4. > #instruction ANOVA
  5. > anova(lm(math ~ instruction * environment, dataTwoWayUnequalSample))

The Type I SS ANOVA results. Note the differences in main effects based on the ordering of the independent variables.

These results indicate statistically insignificant main effects for both the environment and instruction variables, as well as the interaction between them.

Unweighted Means

Now let's turn to using unweighted means, which essentially ignore the correlation between the independent variables that arise from unequal sample sizes. An unweighted mean is calculated by taking the average of the individual group means. Thus, we can derive our unweighted means by summing the means of each level of our independent variables and dividing by the total number of levels. For instance, to find the unweighted mean for environment, we will add the means for our offline and online groups, then divide by two.
  1. > #use mean(data) and subset(data, condition) to calculate the unweighted means for each treatment group
  2. > #offline unweighted mean = (classroom offline mean + tutor offline mean) / 2
  3. (mean(subset(offlineData$math, offlineData$instruction == "classroom")) + mean(subset(offlineData$math, offlineData$instruction == "tutor"))) / 2
  4. > #online unweighted mean = (classroom online mean + tutor online mean) / 2
  5. > (mean(subset(onlineData$math, onlineData$instruction == "classroom")) + mean(subset(onlineData$math, onlineData$instruction == "tutor"))) / 2
  6. > #classroom unweighted mean = (offline classroom mean + online classroom mean) / 2
  7. > (mean(subset(classroomData$math, classroomData$environment == "offline")) + mean(subset(classroomData$math, classroomData$environment == "online"))) / 2
  8. > #tutor unweighted mean = (offline tutor mean + online tutor mean) / 2
  9. > (mean(subset(tutorData$math, tutorData$environment == "offline")) + mean(subset(tutorData$math, tutorData$environment == "online"))) / 2

The unweighted means for the environment and instruction conditions

ANOVA using Type III Sums of Squares

When applying unweighted means, it is suggested that we use Type III sums of squares (SS) in our ANOVA. Type III SS can be set using the type argument in the Anova(mod, type) function, which is a member of the car package.
  1. > #load the car package (install first, if necessary)
  2. > library(car)
  3. > #use the Anova(mod, type) function to conduct the Type III SS ANOVA
  4. > Anova(lm(math ~ environment * instruction, dataTwoWayUnequalSample), type = "3")

The Type III SS ANOVA results.

Once again, our ANOVA results indicate statistically insignificant main effects for both the environment and instruction variables, as well as the interaction between them. However, it is worth noting that both the means and p-values are different when using unweighted means and Type III SS compared to weighted means and Type I SS. In certain cases, this difference can be quite pronounced and lead to entirely different outcomes between the two methods. Hence, choosing the appropriate means and SS for a given analysis is a matter that should be approached with conscious consideration.

Pairwise Comparisons

Note that since our independent variables contain only two levels, there is no need to conduct follow-up comparisons. However, should you reach this point with a statistically significant independent variable of more than three levels, you could conduct pairwise comparisons in the same manner as demonstrated in the Two-Way ANOVA with Comparisons tutorial.

Complete Two-Way ANOVA with Unequal Sample Sizes Example

To see a complete example of how two-way ANOVA with unequal sample sizes can be conducted in R, please download the two-way ANOVA with unequal sample sizes example (.txt) file.

8 comments:

  1. Type III sum of squares? I start think about John Maindonald!

    ReplyDelete
  2. Anova(lm(...), type="III") will not give SS type III unless one also sets options(contrasts=c(unordered="contr.sum", ordered="contr.poly")) beforehand, or uses Anova(lm(..., contrasts=list(environment=contr.sum, instruction=contr.sum))).

    ReplyDelete
  3. Hi. Thanks for the tip, but I get the same results using what is provided in the tutorial:

    Anova(lm(math ~ environment * instruction, dataTwoWayUnequalSample), type = "3")

    and your suggestion:

    Anova(lm(math ~ environment * instruction, dataTwoWayUnequalSample), type = "3",contrasts=list(environment=contr.sum, instruction=contr.sum))

    Can you be more specific about the differences between the functions and how they impact the results?

    Thanks,
    John

    ReplyDelete
  4. Hi, thanks for this post.

    I think it would be helpful to explain the different questions answered using the different Type I vs Type III analyses.

    Can you see any use for Type II SS?
    Perhaps when a priori no interaction is expected?

    ReplyDelete
  5. Hi,

    Thanks for the feedback. Unfortunately, I'm not an expert on statistical methods nor am I qualified to offer statistical advice. The process of using weighted means with Type I SS and unweighted means with Type III SS for unequal sample sizes was taught to me in an ANOVA course for social scientists. I am not familiar with the use of Type II SS.

    I recommend talking to a professional statistician about the merits and circumstances surrounding the different tests. You should also consult a senior member of your field and the published research, since different fields often have different standards and procedures for handling statistics.

    ReplyDelete
  6. for good explanations, look at Falk Scholer's site:
    http://goanna.cs.rmit.edu.au/~fscholer/anova.php

    and the paper by Ista Zahn, referenced there.

    ReplyDelete
  7. Hi Ruediger,

    Thanks for the website and paper reference.

    John

    ReplyDelete
  8. Is this the same procedure for MANOVA with unequa sample sizes?

    Thank you

    Christos

    ReplyDelete