R Tutorial Series: Exploratory Factor Analysis

Exploratory factor analysis (EFA) is a common technique in the social sciences for explaining the variance between several measured variables as a smaller set of latent variables. EFA is often used to consolidate survey data by revealing the groupings (factors) that underly individual questions. This will be the context for demonstration in this tutorial.

Tutorial Files

Before we begin, you may want to download the dataset (.csv) used in this tutorial. Be sure to right-click and save the file to your R working directory. This dataset contains a hypothetical sample of 300 responses on 6 items from a survey of college students' favorite subject matter. The items range in value from 1 to 5, which represent a scale from Strongly Dislike to Strongly Like. Our 6 items asked students to rate their liking of different college subject matter areas, including biology (BIO), geology (GEO), chemistry (CHEM), algebra (ALG), calculus (CALC), and statistics (STAT). This is where our tutorial ends, because all students rated all of these content areas as Strongly Dislike, thereby rendering insufficient variance for conducting EFA (just kidding).

Beginning Steps

To begin, we need to read our datasets into R and store their contents in variables.
  1. > #read the dataset into R variable using the read.csv(file) function
  2. > data <- read.csv("dataset_EFA.csv")

First 10 rows of the dataset

Psych Package

Next, we need to install and load the psych package, which I prefer to use when conducting EFA. In this tutorial, we will make use of the package's fa() function.
  1. > #install the package
  2. > install.packages("psych")
  3. > #load the package
  4. > library(psych)

Number of Factors

For this tutorial, we will assume that the appropriate number of factors has already been determined to be 2, such as through eigenvalues, scree tests, and a priori considerations. Most often, you will want to test solutions above and below the determined amount to ensure the optimal number of factors was selected.

Factor Solution

To derive the factor solution, we will use the fa() function from the psych package, which receives the following primary arguments.
  • r: the correlation matrix
  • nfactors: number of factors to be extracted (default = 1)
  • rotate: one of several matrix rotation methods, such as "varimax" or "oblimin"
  • fm: one of several factoring methods, such as "pa" (principal axis) or "ml" (maximum likelihood)
Note that several rotation and factoring methods are available when conducting EFA. Rotation methods can be described as orthogonal, which do not allow the resulting factors to be correlated, and oblique, which do allow the resulting factors to be correlated. Factoring methods can be described as common, which are used when the goal is to better describe data, and component, which are used when the goal is to reduce the amount of data. The fa() function is used for common factoring. For component analysis, see princomp(). The best methods will vary by circumstance and it is therefore recommended that you seek professional council in determining the optimal parameters for your future EFAs.
In this tutorial, we will use oblique rotation (rotate = "oblimin"), which recognizes that there is likely to be some correlation between students' latent subject matter preference factors in the real world. We will use principal axis factoring (fm = "pa"), because we are most interested in identifying the underlying constructs in the data.
  1. > #calculate the correlation matrix
  2. > corMat <- cor(data)
  3. > #display the correlation matrix
  4. > corMat

The correlation matrix
  1. > #use fa() to conduct an oblique principal-axis exploratory factor analysis
  2. > #save the solution to an R variable
  3. > solution <- fa(r = corMat, nfactors = 2, rotate = "oblimin", fm = "pa")
  4. > #display the solution output
  5. > solution

Complete solution output

By looking at our factor loadings, we can begin to assess our factor solution. We can see that BIO, GEO, and CHEM all have high factor loadings around 0.8 on the first factor (PA1). Therefore, we might call this factor Science and consider it representative of a student's interest in science subject matter. Similarly, ALG, CALC, and STAT load highly on the second factor (PA2), which we might call Math. Note that STAT has a much lower loading on PA2 than ALG or CALC and that it has a slight loading on factor PA1. This suggests that statistics is less related to the concept of Math than algebra and calculus. Just below the loadings table, we can see that each factor accounted for around 30% of the variance in responses, leading to a factor solution that accounted for 66% of the total variance in students' subject matter preference. Lastly, notice that our factors are correlated at 0.21 and recall that our choice of oblique rotation allowed for the recognition of this relationship.
Of course, there are many other considerations to be made in developing and assessing an EFA that will not be presented here. The intent with this tutorial was simply to demonstrate the basic execution of EFA in R. For a detailed and digestible overview of EFA, I recommend the Factor Analysis chapter of Multivariate Data Analysis by Hair, Black, Babin, and Anderson.

Complete EFA Example

To see a complete example of how EFA data can be organized using the psych package in R, please download the EFA example (.txt) file. For the code used in this tutorial, download the EFA Example (.R) file.

References

Revelle, W. (2011). psych: Procedures for Personality and Psychological Research. http://personality-project.org/r/psych.manual.pdf

8 comments:

  1. Thank you. Just last week I was trying to learn factor analysis for machine learning.

    ReplyDelete
  2. Thank you! This was helpful for getting started with EFA.
    Can you, by chance, provide any assistance on the topic of EFA with panel data?

    ReplyDelete
    Replies
    1. Thanks, Cassie. I provide these tutorials to demonstrate how analyses can be conducted in R. However, I do not provide specific advice on conducting analyses or fundamental instruction on the statistical methods themselves. I recommend that you seek professional statistical assistance with these topics.

      Delete
    2. Ok. Thank you, again, for providing these!

      Delete
  3. Thanks a lot. However do you know how to extract cor or dist between observations and factors? In fact, i want to make a 3D plot, with all my indiv in my 3 factors space... Thanks

    ReplyDelete
  4. Thank you so much for your tutorial. It was extremely helpful. I was wondering if there was a limit to how many variables could be processed in an EFA? I have a dataset with 770 observations and 25 variables.

    After I read in

    solution <- fa (r=corMat, nfactors=5,rotate="oblimin",fm="pa")

    I receive the following error: "The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method." Could you provide a suggestion for how to proceed? Thank you so much! I very much appreciate it.

    ReplyDelete
  5. if you carry out an oblique rotation using the fa() function you get an additional column labelled 'com' but I thought the h2 column was the commonality?

    ReplyDelete