### Tutorial Files

Before we start, you may want to download the sample data (.csv) used in this tutorial. Be sure to right-click and save the file to your R working directory. This dataset contains hypothetical age and income data for 20 subjects. Note that all code samples in this tutorial assume that this data has already been read into an R variable and has been attached.### Mean

In R, a mean can be calculated on an isolated variable via the mean(VAR) command, where VAR is the name of the variable whose mean you wish to compute. Alternatively, a mean can be calculated for each of the variables in a dataset by using the mean(DATAVAR) command, where DATAVAR is the name of the variable containing the data. The code sample below demonstrates both uses of the mean function.

- > #calculate the mean of a variable with mean(VAR)
- > #what is the mean Age in the sample?
- > mean(Age)
- [1] 32.3
- > #calculate the mean of all variables in a dataset with mean(DATAVAR)
- > #what is the mean of each variable in the dataset?
- > mean(dataset)
- Age...... Income
- 32.3..... 34000.0

### Standard Deviation

Within R, standard deviations are calculated in the same way as means. The standard deviation of a single variable can be computed with the sd(VAR) command, where VAR is the name of the variable whose standard deviation you wish to retrieve. Similarly, a standard deviation can be calculated for each of the variables in a dataset by using the sd(DATAVAR) command, where DATAVAR is the name of the variable containing the data. The code sample below demonstrates both uses of the standard deviation function.

- > #calculate the standard deviation of a variable with sd(VAR)
- > #what is the standard deviation of Age in the sample?
- > sd(Age)
- [1] 19.45602
- > #calculate the standard deviation of all variables in a dataset with sd(DATAVAR)
- > #what is the standard deviation of each variable in the dataset?
- > sd(dataset)
- Age.............. Income
- 19.45602.... 32306.10175

### Range

#### Minimum and Maximum

Keeping with the pattern, a minimum can be computed on a single variable using the min(VAR) command. The maximum, via max(VAR), operates identically. However, in contrast to the mean and standard deviation functions, min(DATAVAR) or max(DATAVAR) will retrieve the minimum or maximum value from the entire dataset,*not from each individual variable*. Therefore, it is recommended that minimums and maximums be calculated on individual variables, rather than entire datasets, in order to produce more useful information. The sample code below demonstrates the use of the min and max functions.

- > #calculate the min of a variable with min(VAR)
- > #what is the minimum age found in the sample?
- > min(Age)
- [1] 5
- > #calculate the max of a variable with max(VAR)
- > #what is the maximum age found in the sample?
- > max(Age)
- [1] 70

#### Range

The range of a particular variable, that is, its maximum and minimum, can be retrieved using the range(VAR) command. As with the min and max functions, using range(DATAVAR) is not very useful, since it considers the entire dataset, rather than each individual variable. Consequently, it is recommended that ranges also be computed on individual variables. This operation is demonstrated in the following code sample.

- > #calculate the range of a variable with range(VAR)
- > #what range of age values are found in the sample?
- > range(Age)
- [1] 5....70

### Percentiles

#### Values from Percentiles (Quantiles)

Given a dataset and a desired percentile, a corresponding value can be found using the quantile(VAR, c(PROB1, PROB2,…)) command. Here, VAR refers to the variable name and PROB1, PROB2, etc., relate to probability values. The probabilities must be between 0 and 1, therefore making them equivalent to decimal versions of the desired percentiles (i.e. 50% = 0.5). The following example shows how this function can be used to find the data value that corresponds to a desired percentile.Note that quantile(VAR) command can also be used. When probabilities are not specified, the function will default to computing the 0, 25, 50, 75, and 100 percentile values, as shown in the following example.

- > #calculate desired percentile values using quantile(VAR, c(PROB1, PROB2,...))
- > #what are the 25th and 75th percentiles for age in the sample?
- > quantile(Age, c(0.25, 0.75))
- 25%....... 75%
- 17.75..... 44.25

- > #calculate the default percentile values using quantile(VAR)
- > #what are the 0, 25, 50, 75, and 100 percentiles for age in the sample?
- > quantile(Age)
- 0%...... 25%...... 50%...... 75%...... 100%
- 5.00... 17.75...... 30.00... 44.25..... 70.00

#### Percentiles from Values (Percentile Rank)

In the opposite situation, where a percentile rank corresponding to a given value is needed, one has to devise a custom method. To begin, consider the steps involved in calculating a percentile rank.- count the number of data points that are at or below the given value
- divide by the total number of data points
- multiply by 100

- > #calculate the percentile rank for a given value using the custom formula: length(VAR[VAR <>
- > #in the sample, an age of 45 is at what percentile rank?
- > length(Age[Age <= 45]) / length(Age) * 100
- [1] 75

### Summary

A very useful multipurpose function in R is summary(X), where X can be one of any number of objects, including datasets, variables, and linear models, just to name a few. When used, the command provides summary data related to the individual object that was fed into it. Thus, the summary function has different outputs depending on what kind of object it takes as an argument. Besides being widely applicable, this method is valuable because it often provides exactly what is needed in terms of summary statistics. A couple examples of how summary(X) can be used are displayed in the following code sample. I encourage you to use the summary command often when exploring ways to analyze your data in R. This function will be revisited throughout the R Tutorial Series.The output of the preceding summary is pictured below.

- > #summarize a variable with summary(VAR)
- > summary(Age)

The output of the preceding summary is pictured below.

- > #summarize a dataset with summary(DATAVAR)
- > summary(dataset)

Thanks so much, this is just what I need :)

ReplyDeleteExcellent stuff. I was about to give up on learning. Having a non-programming background, I find your blog very helpful.

ReplyDeleteThanks, I'm glad the tutorials are helpful to you.

ReplyDeleteThanks for posting. This is very helpful.

ReplyDeleteI agree. So helpful. You explain everything really clearly. Thanks!

ReplyDeleteJohn, this was really helpful, I wish I could see all the sessions though, since I am struggling with renaming the variable at the moment. I am suspecting you have that explained under Data Visualization, but I cannot acess it. Is there a way I could get to this chapter? Thank you!

ReplyDeleteThank you for your tutorial. I am able to clarify what the word "input" of a summary means.

ReplyDelete===

ReplyDelete"Data Visualization" is a header and not a link to an actual tutorial. The links to tutorials are contained under each header on the right-hand side of the page.

===

Sorry, I don't know what you mean by "input." The only place it appears on this page is in your comment.

Many thanks John, and happy new year.

ReplyDeleteHi John, I'm glad I cam across this tutorial. I am starting with R today and I've read my data in from a .dta (STATA) file.

ReplyDeleteI wanted to test out descriptive stat functions but I get "NA", maybe because my variables were read in as something other than numbers? Any tips would be appreciated!

Hi Cait,

DeleteTry this tutorial, which uses the Foreign package and has a specific example for importing Stata files: http://www.ats.ucla.edu/stat/r/faq/inputdata_R.htm

John

Nice, tank you very much. It makes fun to learn like that :)

ReplyDeleteHi John,

ReplyDeleteThis tutorial series is incredibly clear and helpful. You are right, along the lines of your stated purpose to aggregate the bits and pieces of good R guidance out there and using consistent language - super useful. Thank you for your work!

john may i know what are the other reasons why we always run the zero-order correlations of the variables as part of our preliminary analysis of the data? can't think of any except that its the basic technique for analyzing relationships between variables.

ReplyDeleteHow can I use summary() to read numeric variable as categorical ? if each of the number represent the group not actual number itself, How can I force number to be categorical in R ?

ReplyDelete