Tutorial Files
Before we start, you may want to download the sample data (.csv) used in this tutorial. Be sure to right-click and save the file to your R working directory. This dataset contains pre and post test scores for 66 subjects on a series of reading comprehension tests (Moore & McCabe, 1989). Note that all code samples in this tutorial assume that this data has already been read into an R variable and has been attached.Correlation Between Two Variables
The most fundamental way to calculate correlations is to directly operate on two variables. In R, this can be done using the cor() function. The cor() function accepts the following arguments ("Correlation, Variance...", n.d.).- x: the first variable to correlate
- y: the second variable to correlate
- use (optional): determines how missing values are handled; accepts "all.obs", "complete.obs", or "pairwise.complete.obs"
- method (optional): determines the statistical method used; accepts c("pearson"), c("kendall"), or c("spearman")
cor(VAR1, VAR2) Example
Suppose that our research question is: "How does a subject's pretest 1 score relate to his or her posttest 1 score?" The following example demonstrates how to use the cor() function to calculate the correlation between pretest 1 (PRE1) and posttest 1 (POST1).
- >#use cor(VAR1, VAR2) to calculate the correlation between variable 1 and variable 2
- > cor(PRE1, POST1)
- [1] 0.5659026
Correlations Between Multiple Variables
When beginning to analyze a dataset, researchers often want to get a complete picture of all correlations, rather than just a single one. Conveniently, the cor() function can also be run on an entire set of data. The format for this operation is cor(DATAVAR), where DATAVAR is the name of the R variable containing the data.cor(DATAVAR) Example
Note that the underlying code for the cor(datavar) function has changed in recent versions of R. The function is no longer able to receive datasets that do contain non-numerical values. In this case, you will receive an error to the effect of "x must be numeric," and should ensure that all of your data are in numeric form prior to using the function.
Suppose now that our research question is: "How do all of the test scores in the dataset relate to each other?" The following example demonstrates how to use the cor() function to calculate all of the correlations in a dataset.
The output of the preceding function is pictured below.
- >#use cor(DATAVAR) to get the correlations between all variables
- > cor(datavar)
Complete Correlational Analysis
To see a complete example of how correlational analysis can be conducted in R, please download the correlational analysis example (.txt) file.References
Correlation, Variance and Covariance (Matrices). (n.d.). Retrieved October, 27, 2009 from http://sekhon.berkeley.edu/stats/html/cor.htmlMoore, D., and McCabe, G. (1989). Introduction to the practice of statistics [Data File]. Retrieved October, 27, 2009 from http://lib.stat.cmu.edu/DASL/Datafiles/ReadingTestScores.html
Hi, I have a question.
ReplyDeleteWhen the dataset contains multiple groups and you want to calculate the correlation between two variables for each group, how do you do that?
Thank you in advance.
Patrick
Hi Patrick,
ReplyDeleteThanks for your question. I am not exactly sure what you are asking. Perhaps you could give me a few more details? For example, are you asking how to break a large dataset into subsets for analysis?
If you can give me some more information, I can help point you in the right direction.
John
Dear John,
ReplyDeleteI was trying to replicate your example using cor()for calculating all the correlations with R 2.11.0 but I got the following error message:
> cor(datavar)
Error in cor(datavar) : 'x' must be numeric
Any suggestions how to fix it?
Many thanks in advance,
Ruben
Hi,
ReplyDeleteIt sounds like your 'x' variable is not numeric and therefore R is unable to correlate it. Try making your 'x' data numeric following the dummy coding technique demonstrated here: http://rtutorialseries.blogspot.com/2010/02/r-tutorial-series-regression-with.html
Hi John,
ReplyDeletethanks for the suggestion but unfortunately it didn't work for me.
I tried the following:
> datavar <- read.csv("dataset_readingTests.csv")
> attach(datavar)
> Group<-as.numeric(Group)
> cor(datavar)
Error in cor(datavar) : 'x' must be numeric
The funny thing is that when I checked all the fields with the is.numeric () function, the answers were all true and I could actually calculate correlations pairwise.
Any other suggestions would be greatly appreciated.
Regards,
Ruben
Hi Ruben,
ReplyDeleteI see what is happening now. You have created a variable named "Group" that contains the numeric version of the Group column from the dataset. However, this does not modify the original Group column in the dataset. So, when you try to run cor() on the datavar, it still sees the original text values for the Group column and cannot form a correlation. Instead, use your new Group variable and the original dataset inside the cor() function. Here is an example:
> #read in the data
> datavar <- read.csv("dataset_readingTests.csv")
> #create a variable containing the numeric version of the Group column
> numericGroup <- as.numeric(Group)
> #correlate the numeric Group variable with the original dataset
> cor(datavar, numericGroup)
[,1]
Subject 0.94291728
Group NA
PRE1 -0.18571906
PRE2 -0.05915376
POST1 0.13223726
POST2 0.43986753
POST3 0.19983331
Warning message:
In cor(datavar, numericGroup) : NAs introduced by coercion
This will get you a correlation between Group (numeric) and all of the other columns in the dataset. Of course, you will get an NA still on the Group-Group correlation since the original dataset still contains text values.
Hi John,
ReplyDeletethanks for the suggestion.Unfortunately I'm still having the same problem (I'm using a MacBook Pro with OS X 10.5.8 and R 2.11.0)
This is what I tried:
> datavar<-read.csv("dataset_readingTests.csv")
> numericGroup<-as.numeric(Group)
Error: object 'Group' not found
> attach(datavar)
> numericGroup<-as.numeric(Group)
> cor(datavar,numericGroup)
Error in cor(datavar, numericGroup) : 'x' must be numeric
> cor(numericGroup,datavar)
Error in cor(numericGroup, datavar) : 'y' must be numeric
May it be a problem with the actual file? My hardware/software combination?
Thanks in advance for any suggestions.
Regards,
Ruben
Hi John,
ReplyDeleteI executed the same commands using the R 2.10.1 for Windows in my virtual machine and everything worked as expected.
Now that I know it's related to OS X, do you have any ideas how to solve it?
Many thanks in advance,
Ruben
Hi Ruben,
ReplyDeleteAnother reader commented that attach() can cause unexpected console errors, although I have never experienced problems with it up to this point. So, one other thing to try might be typing out the entire column name without using attach(). You could do this:
> #read in the data
> datavar <- read.csv("dataset_readingTests.csv")
> #create a variable containing the numeric version of the Group column
> numericGroup <- as.numeric(datavar$Group)
> #correlate the numeric Group variable with the original dataset
> cor(datavar, numericGroup)
Otherwise, I'm not sure what to do at this point. I have never encountered the error message that you have posted. For the record, I am using Mac OS X 10.6.3 and R 2.10.0 GUI 1.30 Leopard build 64-bit (5511).
Hi John,
ReplyDeleteI tried your suggestion but unfortunately I got the same error message.
I'm really at a loss as to what is causing the problem so I will try to ask the R community.
Anyway, thanks a lot for your help and for creating such great tutorials.
Regards,
Ruben
Thanks, Ruben.
ReplyDeletePlease come back and share the solution once you find it.
Hi John,
ReplyDeleteI think I found the problem.
The error message only appears with R version 2.11.0 for OS X.
I tried to execute the code with R 2.10.1 for OS X and it worked perfectly.
I'm going to report the issue in the R user groups.
Regards,
Ruben
Hi, I have a question about correlation in R.
ReplyDeleteI am trying to compare time varying correlations between an asset and the S&P 500. I want to find the correlation for each date that I have data for.
Here for example. say X = 1,4,6,7,8,3,2,9,1,2,3,3 and Y =5,2,3,4, 4,8,3,5,9,10,3 ,4 how can I find the correlation between X and Y for every point starting with when X = 4 and Y =2.
Thanks for the help!
Kurt
Hi Kurt,
ReplyDeleteYou can select just the rows that you want to use from your dataset and save them into a new variable. Then you can perform your operations on that dataset. Here is an example.
> #read data into R
> dataset <- read.csv("xyData.csv")
> dataset
x y
1 1 5
2 4 2
3 6 3
4 7 4
5 8 4
6 3 8
7 2 3
8 9 5
9 1 9
10 2 10
11 3 3
12 3 4
> #create a second dataset that excludes the first row
> subset <- dataset[2:12,]
> subset
x y
2 4 2
3 6 3
4 7 4
5 8 4
6 3 8
7 2 3
8 9 5
9 1 9
10 2 10
11 3 3
12 3 4
> #calculate a correlation on the new subset
> cor(subset)
x y
x 1.0000000 -0.3958011
y -0.3958011 1.0000000
Hello
ReplyDeleteI had the same problem as Ruben above but I solved it by uploading this file:
as you can see I replaced:
Basal with 0
DRTA with 1
and Strat with 2
it did not like to have text there
if you upload the csv below it will work
-----
Subject,Group,PRE1,PRE2,POST1,POST2,POST3
1,0,4,3,5,4,41
2,0,6,5,9,5,41
3,0,9,4,5,3,43
4,0,12,6,8,5,46
5,0,16,5,10,9,46
6,0,15,13,9,8,45
7,0,14,8,12,5,45
8,0,12,7,5,5,32
9,0,12,3,8,7,33
10,0,8,8,7,7,39
11,0,13,7,12,4,42
12,0,9,2,4,4,45
13,0,12,5,4,6,39
14,0,12,2,8,8,44
15,0,12,2,6,4,36
16,0,10,10,9,10,49
17,0,8,5,3,3,40
18,0,12,5,5,5,35
19,0,11,3,4,5,36
20,0,8,4,2,3,40
21,0,7,3,5,4,54
22,0,9,6,7,8,32
23,1,7,2,7,6,31
24,1,7,6,5,6,40
25,1,12,4,13,3,48
26,1,10,1,5,7,30
27,1,16,8,14,7,42
28,1,15,7,14,6,48
29,1,9,6,10,9,49
30,1,8,7,13,5,53
31,1,13,7,12,7,48
32,1,12,8,11,6,43
33,1,7,6,8,5,55
34,1,6,2,7,0,55
35,1,8,4,10,6,57
36,1,9,6,8,6,53
37,1,9,4,8,7,37
38,1,8,4,10,11,50
39,1,9,5,12,6,54
40,1,13,6,10,6,41
41,1,10,2,11,6,49
42,1,8,6,7,8,47
43,1,8,5,8,8,49
44,1,10,6,12,6,49
45,2,11,7,11,12,53
46,2,7,6,4,8,47
47,2,4,6,4,10,41
48,2,7,2,4,4,49
49,2,7,6,3,9,43
50,2,6,5,8,5,45
51,2,11,5,12,8,50
52,2,14,6,14,12,48
53,2,13,6,12,11,49
54,2,9,5,7,11,42
55,2,12,3,5,10,38
56,2,13,9,9,9,42
57,2,4,6,1,10,34
58,2,13,8,13,1,48
59,2,6,4,7,9,51
60,2,12,3,5,13,33
61,2,6,6,7,9,44
62,2,11,4,11,7,48
63,2,14,4,15,7,49
64,2,8,2,9,5,33
65,2,5,3,6,8,45
66,2,8,3,4,6,42
------
Luca
Hi Luca,
ReplyDeleteThanks for your contribution.
John
Hi John,
ReplyDeleteAttempting this specific tutorial on R 2.12.0 running on a Windows box, and I get the same error as Ruben above:
> dataset <- read.csv("dataset_readingTests.csv")
> cor(dataset)
Error in cor(dataset) : 'x' must be numeric
> numericGroup <- as.numeric(dataset$Group)
> ls()
[1] "dataset" "numericGroup"
> cor(dataset,numericGroup)
Error in cor(dataset, numericGroup) : 'x' must be numeric
I supposed it would work if I substituted out the Group ASCII for a numeric, but I see the advantage in getting the results you initially achieved. Have there been any clues as to why this is the case, or being above to force define Group to being numeric somehow?
Thanks,
Jon
I am new to R and faithfully following this tutorial. I copied into R one by one the commands from "correlational analysis example (.txt)". Unfortunately, for the command
ReplyDelete> cor(datavar)
I get the message
Error en cor(datavar) : 'x' must be numeric
Any suggestions?
Thanx
Hi,
ReplyDeleteI recommend reading the above posts, as others have also experienced this problem. Ruben reported it as happening on a particular version of R for OSX, while Luca offered a modified CSV file which replaced the text terms with numeric ones.
John
I have updated the data file to use 0, 1, and 2, rather than Basal, DRTA, and Strat. Hopefully, this will help those of you experiencing the x must be numeric error.
ReplyDeleteHi John,
ReplyDeleteI have a lot of correlations to run, but I need significance values. How do I get a table with the r-values for the correlations between my variables but also p-values? (The cor.test() function doesn't work for multiple variables like cor() does).
Thanks,
Elspeth
Hi Elspeth,
ReplyDeleteI have never done what you describe, but I see that a similar question was posted almost ten years ago on the R Help listserv: https://stat.ethz.ch/pipermail/r-help/2001-November/016201.html
You may want to search around for an answer in the archives/Google or consider pursuing it further on the listserv. This seems like a question that would be valuable for the community to answer.
For a makeshift solution, perhaps you could use a For loop to run each x and y variable combination through the cor.test() function?
John
P.S. I was also notified that the rcorr function in the Hmisc package may be useful for what you are doing, since it accepts a matrix X and returns a correlation matrix with p-values. I haven't used the function myself.
ReplyDeleteJohn
Hi, i am getting an error message datavar not found? do i need to attach a libarary?
ReplyDeleteYou have to create a variable containing the data. Datavar is a placeholder name for my example, although you could use any name you wish.
ReplyDeleteFor all tutorials, also be sure to read through the entire tutorial and complete all steps without jumping ahead in the code.
Hi there,
ReplyDeleteI want to assign different shapes to 4 groups of data that I'm plotting together. I know how to put in the shapes but not how to assign them to a specific group (e.g.; C2=square, C3=circle, etc). Thanks so much for your help.
Phoebe
Hi Phoebe,
ReplyDeleteCheck out the Scatterplots tutorial. There is an explanation by a commenter on how to plot markers by group.
John