By John M Quick

The R Tutorial Series provides a collection of user-friendly tutorials to people who want to learn how to use R for statistical analysis.


My Statistical Analysis with R book is available from Packt Publishing and Amazon.


R Tutorial Series: Scatterplots

A scatterplot is a useful way to visualize the relationship between two variables. Similar to correlations, scatterplots are often used to make initial diagnoses before any statistical analyses are conducted. This tutorial will explore the ways in which R can be used to create scatterplots.

Tutorial Files

Before we start, you may want to download the sample data (.csv) used in this tutorial. Be sure to right-click and save the file to your R working directory. This dataset contains pre and post test scores for 66 subjects on a series of reading comprehension tests (Moore & McCabe, 1989). Note that all code samples in this tutorial assume that this data has already been read into an R variable and has been attached.

Plotting Two Variables

The simplest way to create a scatterplot is to directly graph two variables using the default settings. In R, this can be accomplished with the plot(XVAR, YVAR) function, where XVAR is the variable to plot along the x-axis and YVAR is the variable to plot along the y-axis. Suppose that we want to get a picture of the relationship between pretest 1 (PRE1) and posttest 1 (POST1). The following example demonstrates how to use the plot(XVAR, YVAR) function to visualize this relationship.
  1. #create a scatterplot of Y on X using plot(XVAR, YVAR)
  2. #what does the relationship between pretest 1 and posttest 1 look like?
  3. plot(PRE1, POST1)
The output of the preceding function is pictured below.

Plotting All Variables

When beginning to analyze a dataset, researchers often want to get a complete picture of all relationships, rather than just a single one. Conveniently, the plot() function can also be run on an entire set of data. The format for this operation is plot(DATAVAR), where DATAVAR is the name of the R variable containing the data. Suppose now that our interest is in visualizing all of the scatterplots at once, in order to diagnose the various relationships present in our data. The following example demonstrates how to use the plot(DATAVAR) function.
  1. #create scatterplots of all variables using plot(DATAVAR)
  2. #what do all of the relationships in the data look like?
  3. plot(datavar)
The output of the preceding function is pictured below.

Note that the image above has been resized to fit on this page. In the R Quartz Window, the scatterplots could be made much larger for easier viewing.

Custom Plotting

Additional Plot() Arguments

Up to this point, we have been using the default values for all of our scatterplots' elements. However, R also allows for the customization of scatterplots. In addition to x and y axis variables, the plot() function also accepts the following arguments ("The Default Scatterplot Function", n.d.).
  • main: the title for the plot (displayed at the top)
  • sub: the subtitle for the plot (displayed at the bottom)
  • xlim: the x-axis scale; uses the format c(min, max); automatically determined by default
  • ylim: the y-axis scale; uses the format c(min, max); automatically determined by default
  • xlab: the x-axis title
  • ylab: the y-axis title
  • Even more arguments are accepted by the plot() function. Take a look at the referenced page if you wish to explore further options.
Now let's recreate the original plot depicting the relationship between pretest 1 and posttest 1 with more detailed and meaningful parameters.
  1. #create a detailed scatterplot of Y on X incorporating the optional arguments of the plot() function
  2. #set axis scales for x and y to range between 0 and 20
  3. #set main title and subtitle
  4. #set x and y axis labels
  5. plot(PRE1, POST1, xlim = c(0, 20), ylim = c(0, 20), main = "Posttest 1 on Pretest 1", sub = "A Scattered Tale", xlab = "Pretest 1 Score", ylab = "Posttest 1 Score")
The output of the preceding function is pictured below.

Advanced Plotting

There are numerous graphical arguments available to functions in R. In this tutorial, just a few of the common aesthetic options will be addressed below ("Set or Query Graphical Parameters", n.d.).
  • col: determines the colors used for points and lines; accepts character strings of color names (i.e. "red", "green", etc.)
  • pch: the type of point to use (i.e. circle, square, triangle, etc.); accepts values 0-25 for symbols and 32-255 for characters
  • cex: the amount to scale the size of points; accepts a numeric value; default is 1
  • lty: defines the line type; accepts various character strings (i.e. "solid", "dashed", "dotted", etc.)
  • lwd: defines the line width; accepts a positive number; default is 1
Even more arguments are accepted by the plot() function. Take a look at the referenced page if you wish to explore further options.
Now let's recreate the plot of posttest 1 on pretest 1 yet again, but this time with the inclusion of customized aesthetic parameters.
  1. #create a scatterplot of Y on X incorporating the custom aesthetic parameters of the plot() function
  2. #set point colors to dark green, red, and orange
  3. #set point markers to circle, square, and diamond
  4. #set point size to three times the default
  5. #set lines to be solid and three times the default thickness
  6. plot(PRE1, POST1, xlim = c(0, 20), ylim = c(0, 20), main = "Posttest 1 on Pretest 1", sub = "A Scattered Tale", xlab = "Pretest 1 Score", ylab = "Posttest 1 Score", col = c("dark green", "red", "orange"), pch = c(21, 22, 23), cex = 3, lty = "solid", lwd = 3)
The output of the preceding function is pictured below.

Note that the c() function is used for a number of the parameters in the plot function above. This allows one to define multiple values as a "vector" that can be fed into a single argument. For example, if one wanted to use only a single line color, then col = "red" would be acceptable. However, to use multiple colors, all items must be placed into a vector such as col = c("red", "green", "blue"). Without using a vector for multiple colors, as in col = "red", "green", "blue", an error would occur because the colors would be treated as separate arguments rather than a single entity.

Complete Plot Examples

To see a complete example of how scatterplots can be created in R, please download the plot examples (.txt) file.

Even More Visualizations

R has much more sophisticated graphic capabilities than have been demonstrated in this tutorial. In fact, opportunities exist to make very complex and unique visuals. To see examples of the kinds of charts that can be generated with R, I recommend that you visit the R Graph Gallery (François, 2006).

References

François, R. (2006). R graph gallery: Enhance your data visualization with R. Retrieved November 11, 2009 from http://addictedtor.free.fr/graphiques
Moore, D., and McCabe, G. (1989). Introduction to the practice of statistics [Data File]. Retrieved October 27, 2009 from http://lib.stat.cmu.edu/DASL/Datafiles/ReadingTestScores.html
Set or Query Graphical Parameters. (n.d.). Retrieved November 11, 2009 from http://sekhon.berkeley.edu/graphics/html/par.html
The Default Scatterplot Function. (n.d.). Retrieved November 11, 2009 from http://sekhon.berkeley.edu/graphics/html/plotdefault.html

21 comments:

  1. For some reason the it doesn't recognize the object PRE1 in the second example, why may that be?

    ReplyDelete
  2. Hi Dane,

    It is hard to tell without seeing it, but maybe the dataset is not attached. Could you please provide the error message that you received?

    ReplyDelete
  3. I'm a bit confused by the colour coded scatter plot example here. I would have thought colours would be assigned according to a catagorical variable (ie "Group" in this case), but it seems that the colours here are assigned randomly to data points? Could you perhaps post an example of how to assign colours based on a third variable? I think this would have far more practical application.... thanks!

    ReplyDelete
  4. For group based coding of col, just type the following as part of the call to plot() function:
    col=as.numeric(groupvariable)

    How this works is....
    Groupvariable is converted to a numeric (starting from 1) and color is selected as per group level


    If you want specific colors, then for a groupvariable with three levels, add argument:

    col= c("red", "green", "blue")[as.numeric(groupvariable)]

    This should do.

    ReplyDelete
  5. Hi Heretic,

    Thanks for posting this tip.

    ReplyDelete
  6. Thank you so much for posting all of this!!! I'm re-learning R and it's super helpful. I wish I'd had this as a reference when I was learning it the first time...

    ReplyDelete
  7. Hi Aly,

    I'm glad that the tutorials are helpful for you.

    ReplyDelete
  8. Hi,

    Is there a way not only to assign colors to specific points, but to label them, e.g. by an ID variable (like 1,2,3... or A,B,C...) ?

    thanks.

    ReplyDelete
  9. Hi Daniel,

    See Labeling Data Points link in the Data Visualization menu on the right-hand side of the page.

    John

    ReplyDelete
  10. col=c("red", "green")[as.numeric(matrix$type)]

    where type is a column in a matrix with value of 0 or 1 does not lead to a scatterplot with red points for type=0 and green points for type is 1.

    Confused!

    ReplyDelete
  11. See Heretic's and others' comments above for coloring points by group.

    ReplyDelete
  12. Hi John, I tried that, but
    col=c("red", "green")[as.numeric(matrix$type)]
    seems to only show a red color for all data points and not the green color at all.

    I found a solution that works. using if-else. But it probably wont work if you have more than 2 groups.

    col=ifelse(matrix$type == "0", "red", "green")

    ReplyDelete
  13. I think using a grouping variable would help in the format that Heretic explains. The grouping variable would be the column in your dataset that identifies which group each object belongs to. For example, with students you might have freshman, sophomore, junior, and senoir in your "class" grouping variable. Since each student in your dataset has a class, R could use this information to plot a color for each student that matches his/her class.

    John

    ReplyDelete
  14. Hi John,

    Thanks for the awesome tutorial!

    One question: How do I include a legend on the graph?

    ReplyDelete
    Replies
    1. Hi Jonathan, take a look at the help documentation for the legend() function, which adds a legend to a plot. John

      Delete
  15. Hello, Thanks for the tutorial.How can I removed the plot borders and maintain only the x and y axes? I have tried the function frame.front but it does not work. x and y appear separately. Please help me.

    ReplyDelete
  16. Hi
    I have two questions
    1) How do I change the color of the background of plots ?

    2) More precisely how can we paste an image as background picture of a plot ?
    I mean I have gps coordinates, and I plot them (y axis=lat, and x axis=long).
    I thus get a scattered point pattern on a 500x500m white scatterplot.
    I want the background to be the google earth image of the landscape on which the gps coords were recorded and not the default.
    I was thinking that if we can make the background of the plot transparent then I can superpose the images, if well georeferenced
    Any script for this ? any ideas ?

    ReplyDelete
    Replies
    1. Did you get answer for question 2 .. I was also looking to do the same ..many thanks

      Delete
  17. Hi John - what do you suggest when you have a large x-axis range e.g. 0-1,000,000 and the values come up as e+06? What is available to adjust this without collapsing the range using xlim?
    Thanks

    ReplyDelete
  18. Hi John,

    If you've used the different symbols/colours for groups in a scatterplot how do you show what they are in the legend?

    ReplyDelete