By John M Quick

The R Tutorial Series provides a collection of user-friendly tutorials to people who want to learn how to use R for statistical analysis.


My Statistical Analysis with R book is available from Packt Publishing and Amazon.


R Tutorial Series: Zero-Order Correlations

One of the most common and basic techniques for analyzing the relationships between variables is zero-order correlation. This tutorial will explore the ways in which R can be used to employ this method.

Tutorial Files

Before we start, you may want to download the sample data (.csv) used in this tutorial. Be sure to right-click and save the file to your R working directory. This dataset contains pre and post test scores for 66 subjects on a series of reading comprehension tests (Moore & McCabe, 1989). Note that all code samples in this tutorial assume that this data has already been read into an R variable and has been attached.

Correlation Between Two Variables

The most fundamental way to calculate correlations is to directly operate on two variables. In R, this can be done using the cor() function. The cor() function accepts the following arguments ("Correlation, Variance...", n.d.).
  • x: the first variable to correlate
  • y: the second variable to correlate
  • use (optional): determines how missing values are handled; accepts "all.obs", "complete.obs", or "pairwise.complete.obs"
  • method (optional): determines the statistical method used; accepts c("pearson"), c("kendall"), or c("spearman")
In most cases, x and y are the only arguments that you will use when running the cor() function. The basic format for calculating a correlation is cor(VAR1, VAR2), where VAR1 and VAR2 are the variables that you would like to correlate.

cor(VAR1, VAR2) Example

Suppose that our research question is: "How does a subject's pretest 1 score relate to his or her posttest 1 score?" The following example demonstrates how to use the cor() function to calculate the correlation between pretest 1 (PRE1) and posttest 1 (POST1).
  1. >#use cor(VAR1, VAR2) to calculate the correlation between variable 1 and variable 2
  2. > cor(PRE1, POST1)
  3. [1] 0.5659026

Correlations Between Multiple Variables

When beginning to analyze a dataset, researchers often want to get a complete picture of all correlations, rather than just a single one. Conveniently, the cor() function can also be run on an entire set of data. The format for this operation is cor(DATAVAR), where DATAVAR is the name of the R variable containing the data.

cor(DATAVAR) Example


Note that the underlying code for the cor(datavar) function has changed in recent versions of R. The function is no longer able to receive datasets that do contain non-numerical values. In this case, you will receive an error to the effect of "x must be numeric," and should ensure that all of your data are in numeric form prior to using the function.

Suppose now that our research question is: "How do all of the test scores in the dataset relate to each other?" The following example demonstrates how to use the cor() function to calculate all of the correlations in a dataset.
  1. >#use cor(DATAVAR) to get the correlations between all variables
  2. > cor(datavar)
The output of the preceding function is pictured below.

Complete Correlational Analysis

To see a complete example of how correlational analysis can be conducted in R, please download the correlational analysis example (.txt) file.

References

Correlation, Variance and Covariance (Matrices). (n.d.). Retrieved October, 27, 2009 from http://sekhon.berkeley.edu/stats/html/cor.html
Moore, D., and McCabe, G. (1989). Introduction to the practice of statistics [Data File]. Retrieved October, 27, 2009 from http://lib.stat.cmu.edu/DASL/Datafiles/ReadingTestScores.html

27 comments:

  1. Hi, I have a question.
    When the dataset contains multiple groups and you want to calculate the correlation between two variables for each group, how do you do that?

    Thank you in advance.

    Patrick

    ReplyDelete
  2. Hi Patrick,

    Thanks for your question. I am not exactly sure what you are asking. Perhaps you could give me a few more details? For example, are you asking how to break a large dataset into subsets for analysis?

    If you can give me some more information, I can help point you in the right direction.

    John

    ReplyDelete
  3. Dear John,
    I was trying to replicate your example using cor()for calculating all the correlations with R 2.11.0 but I got the following error message:

    > cor(datavar)
    Error in cor(datavar) : 'x' must be numeric

    Any suggestions how to fix it?
    Many thanks in advance,
    Ruben

    ReplyDelete
  4. Hi,

    It sounds like your 'x' variable is not numeric and therefore R is unable to correlate it. Try making your 'x' data numeric following the dummy coding technique demonstrated here: http://rtutorialseries.blogspot.com/2010/02/r-tutorial-series-regression-with.html

    ReplyDelete
  5. Hi John,
    thanks for the suggestion but unfortunately it didn't work for me.
    I tried the following:

    > datavar <- read.csv("dataset_readingTests.csv")
    > attach(datavar)
    > Group<-as.numeric(Group)
    > cor(datavar)
    Error in cor(datavar) : 'x' must be numeric

    The funny thing is that when I checked all the fields with the is.numeric () function, the answers were all true and I could actually calculate correlations pairwise.
    Any other suggestions would be greatly appreciated.
    Regards,
    Ruben

    ReplyDelete
  6. Hi Ruben,

    I see what is happening now. You have created a variable named "Group" that contains the numeric version of the Group column from the dataset. However, this does not modify the original Group column in the dataset. So, when you try to run cor() on the datavar, it still sees the original text values for the Group column and cannot form a correlation. Instead, use your new Group variable and the original dataset inside the cor() function. Here is an example:

    > #read in the data
    > datavar <- read.csv("dataset_readingTests.csv")
    > #create a variable containing the numeric version of the Group column
    > numericGroup <- as.numeric(Group)
    > #correlate the numeric Group variable with the original dataset
    > cor(datavar, numericGroup)
    [,1]
    Subject 0.94291728
    Group NA
    PRE1 -0.18571906
    PRE2 -0.05915376
    POST1 0.13223726
    POST2 0.43986753
    POST3 0.19983331
    Warning message:
    In cor(datavar, numericGroup) : NAs introduced by coercion

    This will get you a correlation between Group (numeric) and all of the other columns in the dataset. Of course, you will get an NA still on the Group-Group correlation since the original dataset still contains text values.

    ReplyDelete
  7. Hi John,
    thanks for the suggestion.Unfortunately I'm still having the same problem (I'm using a MacBook Pro with OS X 10.5.8 and R 2.11.0)
    This is what I tried:

    > datavar<-read.csv("dataset_readingTests.csv")
    > numericGroup<-as.numeric(Group)
    Error: object 'Group' not found
    > attach(datavar)
    > numericGroup<-as.numeric(Group)
    > cor(datavar,numericGroup)
    Error in cor(datavar, numericGroup) : 'x' must be numeric
    > cor(numericGroup,datavar)
    Error in cor(numericGroup, datavar) : 'y' must be numeric

    May it be a problem with the actual file? My hardware/software combination?

    Thanks in advance for any suggestions.
    Regards,
    Ruben

    ReplyDelete
  8. Hi John,
    I executed the same commands using the R 2.10.1 for Windows in my virtual machine and everything worked as expected.
    Now that I know it's related to OS X, do you have any ideas how to solve it?
    Many thanks in advance,
    Ruben

    ReplyDelete
  9. Hi Ruben,

    Another reader commented that attach() can cause unexpected console errors, although I have never experienced problems with it up to this point. So, one other thing to try might be typing out the entire column name without using attach(). You could do this:

    > #read in the data
    > datavar <- read.csv("dataset_readingTests.csv")
    > #create a variable containing the numeric version of the Group column
    > numericGroup <- as.numeric(datavar$Group)
    > #correlate the numeric Group variable with the original dataset
    > cor(datavar, numericGroup)

    Otherwise, I'm not sure what to do at this point. I have never encountered the error message that you have posted. For the record, I am using Mac OS X 10.6.3 and R 2.10.0 GUI 1.30 Leopard build 64-bit (5511).

    ReplyDelete
  10. Hi John,
    I tried your suggestion but unfortunately I got the same error message.
    I'm really at a loss as to what is causing the problem so I will try to ask the R community.
    Anyway, thanks a lot for your help and for creating such great tutorials.
    Regards,
    Ruben

    ReplyDelete
  11. Thanks, Ruben.

    Please come back and share the solution once you find it.

    ReplyDelete
  12. Hi John,
    I think I found the problem.
    The error message only appears with R version 2.11.0 for OS X.
    I tried to execute the code with R 2.10.1 for OS X and it worked perfectly.
    I'm going to report the issue in the R user groups.
    Regards,
    Ruben

    ReplyDelete
  13. Hi, I have a question about correlation in R.
    I am trying to compare time varying correlations between an asset and the S&P 500. I want to find the correlation for each date that I have data for.
    Here for example. say X = 1,4,6,7,8,3,2,9,1,2,3,3 and Y =5,2,3,4, 4,8,3,5,9,10,3 ,4 how can I find the correlation between X and Y for every point starting with when X = 4 and Y =2.

    Thanks for the help!
    Kurt

    ReplyDelete
  14. Hi Kurt,

    You can select just the rows that you want to use from your dataset and save them into a new variable. Then you can perform your operations on that dataset. Here is an example.

    > #read data into R
    > dataset <- read.csv("xyData.csv")
    > dataset
    x y
    1 1 5
    2 4 2
    3 6 3
    4 7 4
    5 8 4
    6 3 8
    7 2 3
    8 9 5
    9 1 9
    10 2 10
    11 3 3
    12 3 4
    > #create a second dataset that excludes the first row
    > subset <- dataset[2:12,]
    > subset
    x y
    2 4 2
    3 6 3
    4 7 4
    5 8 4
    6 3 8
    7 2 3
    8 9 5
    9 1 9
    10 2 10
    11 3 3
    12 3 4
    > #calculate a correlation on the new subset
    > cor(subset)
    x y
    x 1.0000000 -0.3958011
    y -0.3958011 1.0000000

    ReplyDelete
  15. Hello
    I had the same problem as Ruben above but I solved it by uploading this file:
    as you can see I replaced:
    Basal with 0
    DRTA with 1
    and Strat with 2
    it did not like to have text there
    if you upload the csv below it will work
    -----
    Subject,Group,PRE1,PRE2,POST1,POST2,POST3
    1,0,4,3,5,4,41
    2,0,6,5,9,5,41
    3,0,9,4,5,3,43
    4,0,12,6,8,5,46
    5,0,16,5,10,9,46
    6,0,15,13,9,8,45
    7,0,14,8,12,5,45
    8,0,12,7,5,5,32
    9,0,12,3,8,7,33
    10,0,8,8,7,7,39
    11,0,13,7,12,4,42
    12,0,9,2,4,4,45
    13,0,12,5,4,6,39
    14,0,12,2,8,8,44
    15,0,12,2,6,4,36
    16,0,10,10,9,10,49
    17,0,8,5,3,3,40
    18,0,12,5,5,5,35
    19,0,11,3,4,5,36
    20,0,8,4,2,3,40
    21,0,7,3,5,4,54
    22,0,9,6,7,8,32
    23,1,7,2,7,6,31
    24,1,7,6,5,6,40
    25,1,12,4,13,3,48
    26,1,10,1,5,7,30
    27,1,16,8,14,7,42
    28,1,15,7,14,6,48
    29,1,9,6,10,9,49
    30,1,8,7,13,5,53
    31,1,13,7,12,7,48
    32,1,12,8,11,6,43
    33,1,7,6,8,5,55
    34,1,6,2,7,0,55
    35,1,8,4,10,6,57
    36,1,9,6,8,6,53
    37,1,9,4,8,7,37
    38,1,8,4,10,11,50
    39,1,9,5,12,6,54
    40,1,13,6,10,6,41
    41,1,10,2,11,6,49
    42,1,8,6,7,8,47
    43,1,8,5,8,8,49
    44,1,10,6,12,6,49
    45,2,11,7,11,12,53
    46,2,7,6,4,8,47
    47,2,4,6,4,10,41
    48,2,7,2,4,4,49
    49,2,7,6,3,9,43
    50,2,6,5,8,5,45
    51,2,11,5,12,8,50
    52,2,14,6,14,12,48
    53,2,13,6,12,11,49
    54,2,9,5,7,11,42
    55,2,12,3,5,10,38
    56,2,13,9,9,9,42
    57,2,4,6,1,10,34
    58,2,13,8,13,1,48
    59,2,6,4,7,9,51
    60,2,12,3,5,13,33
    61,2,6,6,7,9,44
    62,2,11,4,11,7,48
    63,2,14,4,15,7,49
    64,2,8,2,9,5,33
    65,2,5,3,6,8,45
    66,2,8,3,4,6,42
    ------

    Luca

    ReplyDelete
  16. Hi Luca,

    Thanks for your contribution.

    John

    ReplyDelete
  17. Hi John,

    Attempting this specific tutorial on R 2.12.0 running on a Windows box, and I get the same error as Ruben above:

    > dataset <- read.csv("dataset_readingTests.csv")
    > cor(dataset)
    Error in cor(dataset) : 'x' must be numeric
    > numericGroup <- as.numeric(dataset$Group)
    > ls()
    [1] "dataset" "numericGroup"
    > cor(dataset,numericGroup)
    Error in cor(dataset, numericGroup) : 'x' must be numeric

    I supposed it would work if I substituted out the Group ASCII for a numeric, but I see the advantage in getting the results you initially achieved. Have there been any clues as to why this is the case, or being above to force define Group to being numeric somehow?

    Thanks,

    Jon

    ReplyDelete
  18. I am new to R and faithfully following this tutorial. I copied into R one by one the commands from "correlational analysis example (.txt)". Unfortunately, for the command

    > cor(datavar)

    I get the message

    Error en cor(datavar) : 'x' must be numeric

    Any suggestions?

    Thanx

    ReplyDelete
  19. Hi,

    I recommend reading the above posts, as others have also experienced this problem. Ruben reported it as happening on a particular version of R for OSX, while Luca offered a modified CSV file which replaced the text terms with numeric ones.

    John

    ReplyDelete
  20. I have updated the data file to use 0, 1, and 2, rather than Basal, DRTA, and Strat. Hopefully, this will help those of you experiencing the x must be numeric error.

    ReplyDelete
  21. Hi John,
    I have a lot of correlations to run, but I need significance values. How do I get a table with the r-values for the correlations between my variables but also p-values? (The cor.test() function doesn't work for multiple variables like cor() does).

    Thanks,
    Elspeth

    ReplyDelete
  22. Hi Elspeth,

    I have never done what you describe, but I see that a similar question was posted almost ten years ago on the R Help listserv: https://stat.ethz.ch/pipermail/r-help/2001-November/016201.html

    You may want to search around for an answer in the archives/Google or consider pursuing it further on the listserv. This seems like a question that would be valuable for the community to answer.

    For a makeshift solution, perhaps you could use a For loop to run each x and y variable combination through the cor.test() function?

    John

    ReplyDelete
  23. P.S. I was also notified that the rcorr function in the Hmisc package may be useful for what you are doing, since it accepts a matrix X and returns a correlation matrix with p-values. I haven't used the function myself.

    John

    ReplyDelete
  24. Hi, i am getting an error message datavar not found? do i need to attach a libarary?

    ReplyDelete
  25. You have to create a variable containing the data. Datavar is a placeholder name for my example, although you could use any name you wish.

    For all tutorials, also be sure to read through the entire tutorial and complete all steps without jumping ahead in the code.

    ReplyDelete
  26. Hi there,

    I want to assign different shapes to 4 groups of data that I'm plotting together. I know how to put in the shapes but not how to assign them to a specific group (e.g.; C2=square, C3=circle, etc). Thanks so much for your help.

    Phoebe

    ReplyDelete
  27. Hi Phoebe,

    Check out the Scatterplots tutorial. There is an explanation by a commenter on how to plot markers by group.

    John

    ReplyDelete