R Tutorial Series: Summary and Descriptive Statistics

Summary (or descriptive) statistics are the first figures used to represent nearly every dataset. They also form the foundation for much more complicated computations and analyses. Thus, in spite of being composed of simple methods, they are essential to the analysis process. This tutorial will explore the ways in which R can be used to calculate summary statistics, including the mean, standard deviation, range, and percentiles. Also introduced is the summary function, which is one of the most useful tools in the R set of commands.

Tutorial Files

Before we start, you may want to download the sample data (.csv) used in this tutorial. Be sure to right-click and save the file to your R working directory. This dataset contains hypothetical age and income data for 20 subjects. Note that all code samples in this tutorial assume that this data has already been read into an R variable and has been attached.

Mean

In R, a mean can be calculated on an isolated variable via the mean(VAR) command, where VAR is the name of the variable whose mean you wish to compute. Alternatively, a mean can be calculated for each of the variables in a dataset by using the mean(DATAVAR) command, where DATAVAR is the name of the variable containing the data. The code sample below demonstrates both uses of the mean function.
  1. > #calculate the mean of a variable with mean(VAR)
  2. > #what is the mean Age in the sample?
  3. > mean(Age)
  4. [1] 32.3
  5. > #calculate the mean of all variables in a dataset with mean(DATAVAR)
  6. > #what is the mean of each variable in the dataset?
  7. > mean(dataset)
  8. Age...... Income
  9. 32.3..... 34000.0

Standard Deviation

Within R, standard deviations are calculated in the same way as means. The standard deviation of a single variable can be computed with the sd(VAR) command, where VAR is the name of the variable whose standard deviation you wish to retrieve. Similarly, a standard deviation can be calculated for each of the variables in a dataset by using the sd(DATAVAR) command, where DATAVAR is the name of the variable containing the data. The code sample below demonstrates both uses of the standard deviation function.
  1. > #calculate the standard deviation of a variable with sd(VAR)
  2. > #what is the standard deviation of Age in the sample?
  3. > sd(Age)
  4. [1] 19.45602
  5. > #calculate the standard deviation of all variables in a dataset with sd(DATAVAR)
  6. > #what is the standard deviation of each variable in the dataset?
  7. > sd(dataset)
  8. Age.............. Income
  9. 19.45602.... 32306.10175

Range

Minimum and Maximum

Keeping with the pattern, a minimum can be computed on a single variable using the min(VAR) command. The maximum, via max(VAR), operates identically. However, in contrast to the mean and standard deviation functions, min(DATAVAR) or max(DATAVAR) will retrieve the minimum or maximum value from the entire dataset, not from each individual variable. Therefore, it is recommended that minimums and maximums be calculated on individual variables, rather than entire datasets, in order to produce more useful information. The sample code below demonstrates the use of the min and max functions.
  1. > #calculate the min of a variable with min(VAR)
  2. > #what is the minimum age found in the sample?
  3. > min(Age)
  4. [1] 5
  5. > #calculate the max of a variable with max(VAR)
  6. > #what is the maximum age found in the sample?
  7. > max(Age)
  8. [1] 70

Range

The range of a particular variable, that is, its maximum and minimum, can be retrieved using the range(VAR) command. As with the min and max functions, using range(DATAVAR) is not very useful, since it considers the entire dataset, rather than each individual variable. Consequently, it is recommended that ranges also be computed on individual variables. This operation is demonstrated in the following code sample.
  1. > #calculate the range of a variable with range(VAR)
  2. > #what range of age values are found in the sample?
  3. > range(Age)
  4. [1] 5....70

Percentiles

Values from Percentiles (Quantiles)

Given a dataset and a desired percentile, a corresponding value can be found using the quantile(VAR, c(PROB1, PROB2,…)) command. Here, VAR refers to the variable name and PROB1, PROB2, etc., relate to probability values. The probabilities must be between 0 and 1, therefore making them equivalent to decimal versions of the desired percentiles (i.e. 50% = 0.5). The following example shows how this function can be used to find the data value that corresponds to a desired percentile.
  1. > #calculate desired percentile values using quantile(VAR, c(PROB1, PROB2,...))
  2. > #what are the 25th and 75th percentiles for age in the sample?
  3. > quantile(Age, c(0.25, 0.75))
  4. 25%....... 75%
  5. 17.75..... 44.25
Note that quantile(VAR) command can also be used. When probabilities are not specified, the function will default to computing the 0, 25, 50, 75, and 100 percentile values, as shown in the following example.
  1. > #calculate the default percentile values using quantile(VAR)
  2. > #what are the 0, 25, 50, 75, and 100 percentiles for age in the sample?
  3. > quantile(Age)
  4. 0%...... 25%...... 50%...... 75%...... 100%
  5. 5.00... 17.75...... 30.00... 44.25..... 70.00

Percentiles from Values (Percentile Rank)

In the opposite situation, where a percentile rank corresponding to a given value is needed, one has to devise a custom method. To begin, consider the steps involved in calculating a percentile rank.
  1. count the number of data points that are at or below the given value
  2. divide by the total number of data points
  3. multiply by 100
From the preceding steps, the formula for calculating a percentile rank can be derived: percentile rank = length(VAR[VAR <= VAL]) / length(VAR) * 100, where VAR is the name of the variable and VAL is the given value. This formula makes use of the length function in two variations. The first, length(VAR[VAR <= VAL]), counts the number of data points in a variable that are below the given value. Note that the "<=" operator can be replaced with other combinations of the <, >, and = operators, supposing that the function were to be applied to different scenarios. The second, length(VAR), counts the total number of data points in the variable. Together, they accomplish steps one and two of the percentile rank computation process. The final step is to multiply the result of the division by 100 to transform the decimal value into a percentage. A sample percentile rank calculation is demonstrated below.
  1. > #calculate the percentile rank for a given value using the custom formula: length(VAR[VAR <>
  2. > #in the sample, an age of 45 is at what percentile rank?
  3. > length(Age[Age <= 45]) / length(Age) * 100
  4. [1] 75

Summary

A very useful multipurpose function in R is summary(X), where X can be one of any number of objects, including datasets, variables, and linear models, just to name a few. When used, the command provides summary data related to the individual object that was fed into it. Thus, the summary function has different outputs depending on what kind of object it takes as an argument. Besides being widely applicable, this method is valuable because it often provides exactly what is needed in terms of summary statistics. A couple examples of how summary(X) can be used are displayed in the following code sample. I encourage you to use the summary command often when exploring ways to analyze your data in R. This function will be revisited throughout the R Tutorial Series.
  1. > #summarize a variable with summary(VAR)
  2. > summary(Age)
The output of the preceding summary is pictured below.


  1. > #summarize a dataset with summary(DATAVAR)
  2. > summary(dataset)
The output of the preceding summary is pictured below.

Complete Summary Statistics Analysis

To see a complete example of how summary statistics can be used to analyze data in R, please download the summary statistics analysis example (.txt) file.

Up Next: Zero-Order Correlations

Thank you for participating in the Summary and Descriptive Statistics tutorial. I hope that it has been useful to your work with R and statistics. Please let me know of any feedback, questions, or requests that you have in the comments section of this article. Our next guide will be on the topic of Zero-Order Correlations.

16 comments:

  1. Thanks so much, this is just what I need :)

    ReplyDelete
  2. Excellent stuff. I was about to give up on learning. Having a non-programming background, I find your blog very helpful.

    ReplyDelete
  3. Thanks, I'm glad the tutorials are helpful to you.

    ReplyDelete
  4. Thanks for posting. This is very helpful.

    ReplyDelete
  5. I agree. So helpful. You explain everything really clearly. Thanks!

    ReplyDelete
  6. John, this was really helpful, I wish I could see all the sessions though, since I am struggling with renaming the variable at the moment. I am suspecting you have that explained under Data Visualization, but I cannot acess it. Is there a way I could get to this chapter? Thank you!

    ReplyDelete
  7. Thank you for your tutorial. I am able to clarify what the word "input" of a summary means.

    ReplyDelete
  8. ===

    "Data Visualization" is a header and not a link to an actual tutorial. The links to tutorials are contained under each header on the right-hand side of the page.

    ===

    Sorry, I don't know what you mean by "input." The only place it appears on this page is in your comment.

    ReplyDelete
  9. Many thanks John, and happy new year.

    ReplyDelete
  10. Hi John, I'm glad I cam across this tutorial. I am starting with R today and I've read my data in from a .dta (STATA) file.

    I wanted to test out descriptive stat functions but I get "NA", maybe because my variables were read in as something other than numbers? Any tips would be appreciated!

    ReplyDelete
    Replies
    1. Hi Cait,

      Try this tutorial, which uses the Foreign package and has a specific example for importing Stata files: http://www.ats.ucla.edu/stat/r/faq/inputdata_R.htm

      John

      Delete
  11. Nice, tank you very much. It makes fun to learn like that :)

    ReplyDelete
  12. Hi John,

    This tutorial series is incredibly clear and helpful. You are right, along the lines of your stated purpose to aggregate the bits and pieces of good R guidance out there and using consistent language - super useful. Thank you for your work!

    ReplyDelete
  13. john may i know what are the other reasons why we always run the zero-order correlations of the variables as part of our preliminary analysis of the data? can't think of any except that its the basic technique for analyzing relationships between variables.

    ReplyDelete
  14. How can I use summary() to read numeric variable as categorical ? if each of the number represent the group not actual number itself, How can I force number to be categorical in R ?

    ReplyDelete
    Replies
    1. summary (as.factor(VAR))

      Delete