R Tutorial Series: R Tutorial Series: Zero-Order Correlations

R Tutorial Series: Zero-Order Correlations

One of the most common and basic techniques for analyzing the relationships between variables is zero-order correlation. This tutorial will explore the ways in which R can be used to employ this method.

Tutorial Files

Before we start, you may want to download the sample data (.csv) used in this tutorial. Be sure to right-click and save the file to your R working directory. This dataset contains pre and post test scores for 66 subjects on a series of reading comprehension tests (Moore & McCabe, 1989). Note that all code samples in this tutorial assume that this data has already been read into an R variable and has been attached.

Correlation Between Two Variables

The most fundamental way to calculate correlations is to directly operate on two variables. In R, this can be done using the cor() function. The cor() function accepts the following arguments ("Correlation, Variance...", n.d.).

x: the first variable to correlate
y: the second variable to correlate
use (optional): determines how missing values are handled; accepts "all.obs", "complete.obs", or "pairwise.complete.obs"
method (optional): determines the statistical method used; accepts c("pearson"), c("kendall"), or c("spearman")

In most cases, x and y are the only arguments that you will use when running the cor() function. The basic format for calculating a correlation is cor(VAR1, VAR2), where VAR1 and VAR2 are the variables that you would like to correlate.

cor(VAR1, VAR2) Example

Suppose that our research question is: "How does a subject's pretest 1 score relate to his or her posttest 1 score?" The following example demonstrates how to use the cor() function to calculate the correlation between pretest 1 (PRE1) and posttest 1 (POST1).

>#use cor(VAR1, VAR2) to calculate the correlation between variable 1 and variable 2

> cor(PRE1, POST1)

[1] 0.5659026

Correlations Between Multiple Variables

When beginning to analyze a dataset, researchers often want to get a complete picture of all correlations, rather than just a single one. Conveniently, the cor() function can also be run on an entire set of data. The format for this operation is cor(DATAVAR), where DATAVAR is the name of the R variable containing the data.

cor(DATAVAR) Example

Note that the underlying code for the cor(datavar) function has changed in recent versions of R. The function is no longer able to receive datasets that do contain non-numerical values. In this case, you will receive an error to the effect of "x must be numeric," and should ensure that all of your data are in numeric form prior to using the function.

Suppose now that our research question is: "How do all of the test scores in the dataset relate to each other?" The following example demonstrates how to use the cor() function to calculate all of the correlations in a dataset.

>#use cor(DATAVAR) to get the correlations between all variables

> cor(datavar)

The output of the preceding function is pictured below.

Complete Correlational Analysis

To see a complete example of how correlational analysis can be conducted in R, please download the correlational analysis example (.txt) file.

References

Correlation, Variance and Covariance (Matrices). (n.d.). Retrieved October, 27, 2009 from http://sekhon.berkeley.edu/stats/html/cor.html
Moore, D., and McCabe, G. (1989). Introduction to the practice of statistics [Data File]. Retrieved October, 27, 2009 from http://lib.stat.cmu.edu/DASL/Datafiles/ReadingTestScores.html

27 comments:

UnknownApril 13, 2010 at 7:50 PM
Hi, I have a question.
When the dataset contains multiple groups and you want to calculate the correlation between two variables for each group, how do you do that?

Thank you in advance.

Patrick
ReplyDelete
Replies
JohnApril 14, 2010 at 3:59 PM
Hi Patrick,

Thanks for your question. I am not exactly sure what you are asking. Perhaps you could give me a few more details? For example, are you asking how to break a large dataset into subsets for analysis?

If you can give me some more information, I can help point you in the right direction.

John
ReplyDelete
Replies
RubenMay 1, 2010 at 3:30 AM
Dear John,
I was trying to replicate your example using cor()for calculating all the correlations with R 2.11.0 but I got the following error message:

> cor(datavar)
Error in cor(datavar) : 'x' must be numeric

Any suggestions how to fix it?
Many thanks in advance,
Ruben
ReplyDelete
Replies
JohnMay 1, 2010 at 8:29 AM
Hi,

It sounds like your 'x' variable is not numeric and therefore R is unable to correlate it. Try making your 'x' data numeric following the dummy coding technique demonstrated here: http://rtutorialseries.blogspot.com/2010/02/r-tutorial-series-regression-with.html
ReplyDelete
Replies
RubenMay 1, 2010 at 7:43 PM
Hi John,
thanks for the suggestion but unfortunately it didn't work for me.
I tried the following:

> datavar <- read.csv("dataset_readingTests.csv")
> attach(datavar)
> Group<-as.numeric(Group)
> cor(datavar)
Error in cor(datavar) : 'x' must be numeric

The funny thing is that when I checked all the fields with the is.numeric () function, the answers were all true and I could actually calculate correlations pairwise.
Any other suggestions would be greatly appreciated.
Regards,
Ruben
ReplyDelete
Replies
JohnMay 1, 2010 at 8:06 PM
Hi Ruben,

I see what is happening now. You have created a variable named "Group" that contains the numeric version of the Group column from the dataset. However, this does not modify the original Group column in the dataset. So, when you try to run cor() on the datavar, it still sees the original text values for the Group column and cannot form a correlation. Instead, use your new Group variable and the original dataset inside the cor() function. Here is an example:

> #read in the data
> datavar <- read.csv("dataset_readingTests.csv")
> #create a variable containing the numeric version of the Group column
> numericGroup <- as.numeric(Group)
> #correlate the numeric Group variable with the original dataset
> cor(datavar, numericGroup)
[,1]
Subject 0.94291728
Group NA
PRE1 -0.18571906
PRE2 -0.05915376
POST1 0.13223726
POST2 0.43986753
POST3 0.19983331
Warning message:
In cor(datavar, numericGroup) : NAs introduced by coercion

This will get you a correlation between Group (numeric) and all of the other columns in the dataset. Of course, you will get an NA still on the Group-Group correlation since the original dataset still contains text values.
ReplyDelete
Replies
RubenMay 1, 2010 at 9:31 PM
Hi John,
thanks for the suggestion.Unfortunately I'm still having the same problem (I'm using a MacBook Pro with OS X 10.5.8 and R 2.11.0)
This is what I tried:

> datavar<-read.csv("dataset_readingTests.csv")
> numericGroup<-as.numeric(Group)
Error: object 'Group' not found
> attach(datavar)
> numericGroup<-as.numeric(Group)
> cor(datavar,numericGroup)
Error in cor(datavar, numericGroup) : 'x' must be numeric
> cor(numericGroup,datavar)
Error in cor(numericGroup, datavar) : 'y' must be numeric

May it be a problem with the actual file? My hardware/software combination?

Thanks in advance for any suggestions.
Regards,
Ruben
ReplyDelete
Replies
RubenMay 1, 2010 at 9:52 PM
Hi John,
I executed the same commands using the R 2.10.1 for Windows in my virtual machine and everything worked as expected.
Now that I know it's related to OS X, do you have any ideas how to solve it?
Many thanks in advance,
Ruben
ReplyDelete
Replies
JohnMay 2, 2010 at 11:09 AM
Hi Ruben,

Another reader commented that attach() can cause unexpected console errors, although I have never experienced problems with it up to this point. So, one other thing to try might be typing out the entire column name without using attach(). You could do this:

> #read in the data
> datavar <- read.csv("dataset_readingTests.csv")
> #create a variable containing the numeric version of the Group column
> numericGroup <- as.numeric(datavar$Group)
> #correlate the numeric Group variable with the original dataset
> cor(datavar, numericGroup)

Otherwise, I'm not sure what to do at this point. I have never encountered the error message that you have posted. For the record, I am using Mac OS X 10.6.3 and R 2.10.0 GUI 1.30 Leopard build 64-bit (5511).
ReplyDelete
Replies
RubenMay 5, 2010 at 2:48 AM
Hi John,
I tried your suggestion but unfortunately I got the same error message.
I'm really at a loss as to what is causing the problem so I will try to ask the R community.
Anyway, thanks a lot for your help and for creating such great tutorials.
Regards,
Ruben
ReplyDelete
Replies
JohnMay 5, 2010 at 7:58 AM
Thanks, Ruben.

Please come back and share the solution once you find it.
ReplyDelete
Replies
RubenMay 7, 2010 at 9:41 PM
Hi John,
I think I found the problem.
The error message only appears with R version 2.11.0 for OS X.
I tried to execute the code with R 2.10.1 for OS X and it worked perfectly.
I'm going to report the issue in the R user groups.
Regards,
Ruben
ReplyDelete
Replies
AnonymousJune 19, 2010 at 5:25 AM
Hi, I have a question about correlation in R.
I am trying to compare time varying correlations between an asset and the S&P 500. I want to find the correlation for each date that I have data for.
Here for example. say X = 1,4,6,7,8,3,2,9,1,2,3,3 and Y =5,2,3,4, 4,8,3,5,9,10,3 ,4 how can I find the correlation between X and Y for every point starting with when X = 4 and Y =2.

Thanks for the help!
Kurt
ReplyDelete
Replies
JohnJune 19, 2010 at 8:53 AM
Hi Kurt,

You can select just the rows that you want to use from your dataset and save them into a new variable. Then you can perform your operations on that dataset. Here is an example.

> #read data into R
> dataset <- read.csv("xyData.csv")
> dataset
x y
1 1 5
2 4 2
3 6 3
4 7 4
5 8 4
6 3 8
7 2 3
8 9 5
9 1 9
10 2 10
11 3 3
12 3 4
> #create a second dataset that excludes the first row
> subset <- dataset[2:12,]
> subset
x y
2 4 2
3 6 3
4 7 4
5 8 4
6 3 8
7 2 3
8 9 5
9 1 9
10 2 10
11 3 3
12 3 4
> #calculate a correlation on the new subset
> cor(subset)
x y
x 1.0000000 -0.3958011
y -0.3958011 1.0000000
ReplyDelete
Replies
AnonymousJuly 28, 2010 at 5:55 AM
Hello
I had the same problem as Ruben above but I solved it by uploading this file:
as you can see I replaced:
Basal with 0
DRTA with 1
and Strat with 2
it did not like to have text there
if you upload the csv below it will work
-----
Subject,Group,PRE1,PRE2,POST1,POST2,POST3
1,0,4,3,5,4,41
2,0,6,5,9,5,41
3,0,9,4,5,3,43
4,0,12,6,8,5,46
5,0,16,5,10,9,46
6,0,15,13,9,8,45
7,0,14,8,12,5,45
8,0,12,7,5,5,32
9,0,12,3,8,7,33
10,0,8,8,7,7,39
11,0,13,7,12,4,42
12,0,9,2,4,4,45
13,0,12,5,4,6,39
14,0,12,2,8,8,44
15,0,12,2,6,4,36
16,0,10,10,9,10,49
17,0,8,5,3,3,40
18,0,12,5,5,5,35
19,0,11,3,4,5,36
20,0,8,4,2,3,40
21,0,7,3,5,4,54
22,0,9,6,7,8,32
23,1,7,2,7,6,31
24,1,7,6,5,6,40
25,1,12,4,13,3,48
26,1,10,1,5,7,30
27,1,16,8,14,7,42
28,1,15,7,14,6,48
29,1,9,6,10,9,49
30,1,8,7,13,5,53
31,1,13,7,12,7,48
32,1,12,8,11,6,43
33,1,7,6,8,5,55
34,1,6,2,7,0,55
35,1,8,4,10,6,57
36,1,9,6,8,6,53
37,1,9,4,8,7,37
38,1,8,4,10,11,50
39,1,9,5,12,6,54
40,1,13,6,10,6,41
41,1,10,2,11,6,49
42,1,8,6,7,8,47
43,1,8,5,8,8,49
44,1,10,6,12,6,49
45,2,11,7,11,12,53
46,2,7,6,4,8,47
47,2,4,6,4,10,41
48,2,7,2,4,4,49
49,2,7,6,3,9,43
50,2,6,5,8,5,45
51,2,11,5,12,8,50
52,2,14,6,14,12,48
53,2,13,6,12,11,49
54,2,9,5,7,11,42
55,2,12,3,5,10,38
56,2,13,9,9,9,42
57,2,4,6,1,10,34
58,2,13,8,13,1,48
59,2,6,4,7,9,51
60,2,12,3,5,13,33
61,2,6,6,7,9,44
62,2,11,4,11,7,48
63,2,14,4,15,7,49
64,2,8,2,9,5,33
65,2,5,3,6,8,45
66,2,8,3,4,6,42
------

Luca
ReplyDelete
Replies
JohnJuly 28, 2010 at 9:30 PM
Hi Luca,

Thanks for your contribution.

John
ReplyDelete
Replies
JonNovember 18, 2010 at 9:26 PM
Hi John,

Attempting this specific tutorial on R 2.12.0 running on a Windows box, and I get the same error as Ruben above:

> dataset <- read.csv("dataset_readingTests.csv")
> cor(dataset)
Error in cor(dataset) : 'x' must be numeric
> numericGroup <- as.numeric(dataset$Group)
> ls()
[1] "dataset" "numericGroup"
> cor(dataset,numericGroup)
Error in cor(dataset, numericGroup) : 'x' must be numeric

I supposed it would work if I substituted out the Group ASCII for a numeric, but I see the advantage in getting the results you initially achieved. Have there been any clues as to why this is the case, or being above to force define Group to being numeric somehow?

Thanks,

Jon
ReplyDelete
Replies
AnonymousNovember 24, 2010 at 8:13 PM
I am new to R and faithfully following this tutorial. I copied into R one by one the commands from "correlational analysis example (.txt)". Unfortunately, for the command

> cor(datavar)

I get the message

Error en cor(datavar) : 'x' must be numeric

Any suggestions?

Thanx
ReplyDelete
Replies
JohnNovember 27, 2010 at 7:56 AM
Hi,

I recommend reading the above posts, as others have also experienced this problem. Ruben reported it as happening on a particular version of R for OSX, while Luca offered a modified CSV file which replaced the text terms with numeric ones.

John
ReplyDelete
Replies
JohnNovember 27, 2010 at 8:21 AM
I have updated the data file to use 0, 1, and 2, rather than Basal, DRTA, and Strat. Hopefully, this will help those of you experiencing the x must be numeric error.
ReplyDelete
Replies
AnonymousJanuary 4, 2011 at 3:42 PM
Hi John,
I have a lot of correlations to run, but I need significance values. How do I get a table with the r-values for the correlations between my variables but also p-values? (The cor.test() function doesn't work for multiple variables like cor() does).

Thanks,
Elspeth
ReplyDelete
Replies
JohnJanuary 4, 2011 at 8:59 PM
Hi Elspeth,

I have never done what you describe, but I see that a similar question was posted almost ten years ago on the R Help listserv: https://stat.ethz.ch/pipermail/r-help/2001-November/016201.html

You may want to search around for an answer in the archives/Google or consider pursuing it further on the listserv. This seems like a question that would be valuable for the community to answer.

For a makeshift solution, perhaps you could use a For loop to run each x and y variable combination through the cor.test() function?

John
ReplyDelete
Replies
JohnJanuary 4, 2011 at 9:12 PM
P.S. I was also notified that the rcorr function in the Hmisc package may be useful for what you are doing, since it accepts a matrix X and returns a correlation matrix with p-values. I haven't used the function myself.

John
ReplyDelete
Replies
AnonymousMay 30, 2011 at 5:36 AM
Hi, i am getting an error message datavar not found? do i need to attach a libarary?
ReplyDelete
Replies
JohnJune 3, 2011 at 4:29 PM
You have to create a variable containing the data. Datavar is a placeholder name for my example, although you could use any name you wish.

For all tutorials, also be sure to read through the entire tutorial and complete all steps without jumping ahead in the code.
ReplyDelete
Replies
AnonymousAugust 23, 2011 at 9:20 AM
Hi there,

I want to assign different shapes to 4 groups of data that I'm plotting together. I know how to put in the shapes but not how to assign them to a specific group (e.g.; C2=square, C3=circle, etc). Thanks so much for your help.

Phoebe
ReplyDelete
Replies
JohnAugust 23, 2011 at 9:30 AM
Hi Phoebe,

Check out the Scatterplots tutorial. There is an explanation by a commenter on how to plot markers by group.

John
ReplyDelete
Replies

Add comment

R Tutorial Series