R Tutorial Series: Introduction to The R Project for Statistical Computing (Part 2)

Welcome to part two of the Introduction to The R Project for Statistical Computing tutorial. If you missed part one, it can be found here. In this segment, we will explore the following topics.
  • Importing Data
  • Variables
  • Workspace Files
  • Console Files
  • Finding Help

Tutorial Files

Before we start, you may want to download the sample data (.csv) used in this tutorial. Be sure to right-click and save the file to your R working directory.

Importing Data

While values can be input directly into R, the most common method for obtaining data is to import it from preexisting sources. Most spreadsheets can be converted to CSV (comma-separated values) files, which are recommended for use with R. However, by way of the foreign package, a variety of alternative data files can be imported, such as ones generated in SPSS. Below are examples demonstrating how to import data using both methods.
To import data from a csv file, use the read.csv("FILENAME") command, where FILENAME is the name of the file that you would like to import.
  1. read.csv("dataset_intro_pt2.csv")
When a file is read, the console displays its contents, as depicted in the screenshot below.


Similarly, the foreign package can be used to import files from other spreadsheet and statistical analysis programs. A hypothetical example of loading data from an SPSS file (.sav) follows.
  1. > #first, load the foreign package
  2. > library(foreign)
  3. > #then, import the data file
  4. > read.spss("newData.sav")
Note that there are a variety of read.FUNCTION commands available in R. Depending on your source file, you may be better off using a different version of the command than what has been presented here. Nonetheless, the process of importing data will remain the same.

Variables

Creating Variables

An important aspect of conducting statistical analyses in R concerns the use of variables. As with other programming languages, variables can be thought of as containers that store information and allow it to be manipulated. This contrasts with merely displaying information, as takes place in previous demonstrations of the read command. For example, when the command read.csv("intro_pt2_data.csv") was used, age and income data for 20 subjects was read into and displayed in the console. Now the numbers can be seen, but what if you want to conduct statistical analyses on the data? To do this, you would have to save the information into a variable using the <- operator. The <- characters are used to set a variable to a certain value and can be remembered as meaning "is equal to the contents of." Subsequently, the format for creating a variable is NAME <- VALUE or, in words, "the variable named NAME is equal to the contents of the value VALUE.


  1. > dataSet <- read.csv("dataset_intro_pt2.csv")
Thus, the line of code above creates a new variable named dataSet and sets it to equal the contents of the imported CSV file.

Accessing Data Stored In Variables

Now that the contents of the spreadsheet have been stored in a variable, the individual data elements can be accessed. In the sample provided, age and income values were collected for 20 subjects and entered into a two-column spreadsheet. Since both age and income have their own column of values, each can be accessed individually using the format DATASET $COLUMN, where DATASET is the name of the variable that contains all of data (i.e. dataSet) and $COLUMN is the name of the column within the data (i.e. $Age or $Income). The following code demonstrates how individual variables within a dataset can be accessed and displayed.
  1. > dataSet $Age
  2. [1] 10 25 43 32 70 19 5 21 35 24 12 14 49 62 48 40 33 67 9 28
  3. dataSet $Income
  4. [1] 0 35000 75000 55000 25000 20000 0 20000 60000 30000 0 10000 35000 80000 80000 0 0 55000 0 100000

Data Frames

Data can also be saved as a frame. A data frame is very similar to a dataset in that it stores information and its variables can be accessed in the same way. However, data frames are displayed in a nice tabular format when printed in the R console. Additionally, operations can be conducted on data frames that cannot be done on regular dataset variables. You can create a data frame from a preexisting dataset via the data.frame(DATASET) command, where DATASET is the name of the variable containing the data.
  1. > dataFrame <- data.frame(dataSet)

Attaching Data Variables

A convenient method for accessing variables comes thanks to the ability to attach datasets in R. This is accomplished through the attach(NAME) command, where NAME is the name of the dataset variable that you want to attach. This allows you to refer to variables within the dataset without the need to list the name of the dataset and the $ symbol. Hence, the example below accomplishes the same tasks as in the previous section, but with less code.
  1. > #first, attach the dataset
  2. > attach(dataSet)
  3. > #now you can access variables using the shorthand method
  4. > Age
  5. [1] 10 25 43 32 70 19 5 21 35 24 12 14 49 62 48 40 33 67 9 28
  6. > Income
  7. [1] 0 35000 75000 55000 25000 20000 0 20000 60000 30000 0 10000 35000 80000 80000 0 0 55000 0 100000
Note that each time R is run, the dataset must be reattached. This method is most useful when you know that you will be working with a single dataset for an entire session. Furthermore, a data frame can be attached and used in the same manner as a dataset.

Workspace Files

Every time that you create a variable to store values in R, it is saved to the current Workspace. A Workspace is a repository for all of the objects managed during a session. For instance, when you assigned the variable "dataSet" to the contents of the sample CSV file, the dataSet object, complete with Age and Income data, was entered into the R Workspace. A Workspace can be saved at any time and loaded during a future session. Workspace files always end with the extention ".RData" and are a useful way to pick up your work where you left off at the end of a previous session. The essential functions related to Workspaces are demonstrated below.
To save a Workspace file, use the save.image("PATH/FILENAME.RData") command, where PATH represents the directory path where you would like to save the new file (the working directory is used by default) and FILENAME is the name of the new file.
  1. > save.image("Users/Admin/Desktop/NewSaveFile.RData")
Similarly, to load a Workspace file, use the load("PATH/FILENAME.RData") command, where PATH represents the directory path to the previously saved file (the working directory is used by default) and FILENAME is the name of the previously saved file.
  1. > load("Users/Admin/Desktop/PreviouslySavedFile.RData")
Furthermore, a list of all of the objects currently held in the Workspace can be displayed via the ls() function.
  1. > ls()
  2. [1] "dataSet"
Note that R also features a Workspace menu where each of the above tasks can be handled. The Workspace Browser (pictured) is especially useful for visualizing the contents of your current Workspace.


Console Files

As discussed in part one of this tutorial, the R Console is where commands are issued and subsequent outputs are displayed. In contrast to the Workspace, where all of the objects in use are being stored, the Console is the complete history of the actions taken by those objects.
Consider a meeting between people as an analogy to further explain the relationship between the Workspace and the Console. All of the individuals who attend the meeting are contained in a single room (i.e. the Workspace). Everything that the participants do and say is recorded in the meeting minutes (i.e. the Console). Thus, the Workspace contains objects (such as the people who attend a meeting) and the Console consists of a log of interactions between objects (such as what people say to each other during a meeting).
The contents of the Console can be saved to a text file using File > Save As… from the menu. In fact, the same procedure can be executed from the Quartz window to produce a PDF of a particular graphic. Moreover, the contents in any of the R windows can be copied and pasted into another program, such as a word processor. Unlike a Workspace, which may be saved and reloaded from session to session to continue work, a Console is most useful for keeping track of what you have done in previous sessions. This history can be a reminder of where you left off during the last session, the results of prior analyses, how to execute certain functions, or an array of other items. A sample Console output is pictured below. Take notice of the contrast between this and the previous image of the Workspace Browser.

Finding Help

When getting started with R for the first time, or when exploring new facets of the program, it can be useful to get help from more experienced users. Fortunately, R has a large community with a strong online presence. Help documentation, FAQs, tutorials, and discussions can be found covering nearly every aspect of R that one would ever need or want to become familiar with. The following list represents just a few of the excellent R resources that have assisted me thus far.
In spite of the abundance of R information available online, I have decided to create a series of my own tutorials for three main reasons. First, the R knowledge base is scattered across the internet, making it difficult for users to find what they need, when they need it. Second, information about R has been written by many people, in many places, at many times, causing inconsistencies in language and format to exist that challenge users' ability to easily comprehend and apply the solutions that they find. Third, there is no cohesive set of R tutorials that appeals directly to my own (and others') usage of the program, which leaves me searching for small bits of answers in many different places rather than finding holistic solutions. Thus, my goal in creating this series of tutorials is to provide fellow researchers with a coherent and unified set of essential statistical analyses that can be applied to diverse projects using the R system.

28 comments:

  1. Great tutorial! Exactly what I was looking for.

    ReplyDelete
  2. Thanks. I'm glad I could help. I have a few more tutorials coming up next week, so be sure to check those out as well. Also, let me know if you have any requests.

    ReplyDelete
    Replies
    1. Thanks John. This was very helpful to "learn and get going" with RScript. Perfect for my needs.

      Delete
  3. The use of attach is deprecated by most experienced R programmers. Changes to a avariable within an attached dataframe will disappear when it is detached, which is an enormous gotcha. It also allows the creation of creates difficult to debug errors. The time it saves in typing is not worth the downside problems.

    ReplyDelete
  4. The saved data set's name is not the same name listed in the text.

    ReplyDelete
  5. Thanks, I updated the link. It is actually the same dataset, but there are two copies/names.

    ReplyDelete
  6. great job, you have a great gift for explaining.

    ReplyDelete
  7. This website is going to make my life easier to work with R.

    ReplyDelete
  8. Excellent work, I find it really helpful! Thanks!

    ReplyDelete
  9. Is there a different command for loading .csv files other than read.csv? I don't want the whole huge data file to show up in my workspace every time I download new data. Thanks so much, this is sooooooo helpful!!

    ReplyDelete
  10. You could save the dataset to a variable, as shown in the "Creating Variables" section. This way, the data will only display if you call the variable. This is something you would normally want to do anyway, since you want your data in variable form to further manipulate it in R.

    ReplyDelete
  11. First off: this is excellent.

    That being said, is there any tutorial you have, or are planning to have, that breaks down the manipulation of data (for example, formatting a column/row to be binary, or categorical, vs. just numbers?) I remember doing this a long time ago, but now I am struggling to recall.

    Thanks again!

    ReplyDelete
  12. Thanks.

    I do not have articles specifically dedicated to manipulating data in the way you mentioned, but I do cover similar topics as they apply to specific analyses. For example, dummy coding and binary variables are addressed in the Regression with Categorical Variables tutorial.

    ReplyDelete
  13. Very helpful introduction. Thanks!

    ReplyDelete
  14. I'm working with MacOs lion, have downloaded R, can work within it but I'm unable to import data to R no matter what I do?

    ReplyDelete
  15. Really useful! Cheers!

    ReplyDelete
  16. This is very useful. I am an old time SAS programmer trying to learn in detail(as much as I can) the functionality of R. In the tutorial above, it appears to me that whenever you call a variable, all the values get printed. Isn't this cumbersome if you are looking at 100000 records?

    ReplyDelete
    Replies
    1. The values are only printed when you call a variable directly in the console. Usually, you won't be doing this. If you did need to preview the dataset in R, you could choose to display specific columns/rows or create a subset of the data.

      Delete
  17. Hi,

    Obviously I'd like to echo others with y own thanks, i've picked up R today and i'm just flying along thanks to this tutorial.

    That said, I cannot seem to open a workspace browser as you did under workspace files. I am using R version 2.15.0 in windows (vista). Any ideas?

    Thanks

    ReplyDelete
    Replies
    1. Hi. R is not consistent across platforms and I use the Mac version, so it is entirely possible that some of the GUI elements that I demonstrate cannot be found or are in other places than I show. To be honest, I never use the GUI myself (or in the subsequent tutorials), so you might consider doing everything via the console. Alternatively, there are more advanced GUIs for R than the default ones that you could look into online, though I have not used any myself.

      Delete
    2. I just recently started using Rstudio, which has a workspace browser, and I really like it...

      --Paul Baer
      Georgia Tech

      Delete
  18. Hi, the link to the file is different from the one in the code - I changed the line in the tutorial to:

    read.csv("dataset_intro_pt2.csv")

    ReplyDelete
    Replies
    1. Hi Mark,

      I recently relocated the files for this blog and reorganized/named the datasets and text files. However, I did not edit all of the read statements, so if you make sure to use the updated names in your code, you will be fine.

      John

      Delete
  19. I teach stats to undergrads in the public policy school at Georgia Tech, and I just found this site while looking for data for a two-way ANOVA.

    I used to do it all in R but my students demanded (via my faculty colleagues) to learn something more practical - Excel! But I still sneak R in the back door.

    This looks like a great series. I have a couple of individual comments on other comments up above.

    --Paul Baer
    Assistant Professor
    School of Public Policy
    Georgia Tech

    ReplyDelete
  20. John
    This is a very commendable job you did here. Am new to R but your tutorials plus the book that you wrote have helped me tremendously.
    Keep up the good work
    Chisomo Kumbuyo
    Tottori University
    Japan

    ReplyDelete