Tutorial FilesBefore we start, you may want to download the sample data (.csv) used in this tutorial. Be sure to right-click and save the file to your R working directory. This dataset contains hypothetical age and income data for 20 subjects. Note that all code samples in this tutorial assume that this data has already been read into an R variable and has been attached.
MeanIn R, a mean can be calculated on an isolated variable via the mean(VAR) command, where VAR is the name of the variable whose mean you wish to compute. Alternatively, a mean can be calculated for each of the variables in a dataset by using the mean(DATAVAR) command, where DATAVAR is the name of the variable containing the data. The code sample below demonstrates both uses of the mean function.
- > #calculate the mean of a variable with mean(VAR)
- > #what is the mean Age in the sample?
- > mean(Age)
-  32.3
- > #calculate the mean of all variables in a dataset with mean(DATAVAR)
- > #what is the mean of each variable in the dataset?
- > mean(dataset)
- Age...... Income
- 32.3..... 34000.0
Standard DeviationWithin R, standard deviations are calculated in the same way as means. The standard deviation of a single variable can be computed with the sd(VAR) command, where VAR is the name of the variable whose standard deviation you wish to retrieve. Similarly, a standard deviation can be calculated for each of the variables in a dataset by using the sd(DATAVAR) command, where DATAVAR is the name of the variable containing the data. The code sample below demonstrates both uses of the standard deviation function.
- > #calculate the standard deviation of a variable with sd(VAR)
- > #what is the standard deviation of Age in the sample?
- > sd(Age)
-  19.45602
- > #calculate the standard deviation of all variables in a dataset with sd(DATAVAR)
- > #what is the standard deviation of each variable in the dataset?
- > sd(dataset)
- Age.............. Income
- 19.45602.... 32306.10175
Minimum and MaximumKeeping with the pattern, a minimum can be computed on a single variable using the min(VAR) command. The maximum, via max(VAR), operates identically. However, in contrast to the mean and standard deviation functions, min(DATAVAR) or max(DATAVAR) will retrieve the minimum or maximum value from the entire dataset, not from each individual variable. Therefore, it is recommended that minimums and maximums be calculated on individual variables, rather than entire datasets, in order to produce more useful information. The sample code below demonstrates the use of the min and max functions.
- > #calculate the min of a variable with min(VAR)
- > #what is the minimum age found in the sample?
- > min(Age)
-  5
- > #calculate the max of a variable with max(VAR)
- > #what is the maximum age found in the sample?
- > max(Age)
-  70
RangeThe range of a particular variable, that is, its maximum and minimum, can be retrieved using the range(VAR) command. As with the min and max functions, using range(DATAVAR) is not very useful, since it considers the entire dataset, rather than each individual variable. Consequently, it is recommended that ranges also be computed on individual variables. This operation is demonstrated in the following code sample.
- > #calculate the range of a variable with range(VAR)
- > #what range of age values are found in the sample?
- > range(Age)
-  5....70
Values from Percentiles (Quantiles)Given a dataset and a desired percentile, a corresponding value can be found using the quantile(VAR, c(PROB1, PROB2,…)) command. Here, VAR refers to the variable name and PROB1, PROB2, etc., relate to probability values. The probabilities must be between 0 and 1, therefore making them equivalent to decimal versions of the desired percentiles (i.e. 50% = 0.5). The following example shows how this function can be used to find the data value that corresponds to a desired percentile.
Note that quantile(VAR) command can also be used. When probabilities are not specified, the function will default to computing the 0, 25, 50, 75, and 100 percentile values, as shown in the following example.
- > #calculate desired percentile values using quantile(VAR, c(PROB1, PROB2,...))
- > #what are the 25th and 75th percentiles for age in the sample?
- > quantile(Age, c(0.25, 0.75))
- 25%....... 75%
- 17.75..... 44.25
- > #calculate the default percentile values using quantile(VAR)
- > #what are the 0, 25, 50, 75, and 100 percentiles for age in the sample?
- > quantile(Age)
- 0%...... 25%...... 50%...... 75%...... 100%
- 5.00... 17.75...... 30.00... 44.25..... 70.00
Percentiles from Values (Percentile Rank)In the opposite situation, where a percentile rank corresponding to a given value is needed, one has to devise a custom method. To begin, consider the steps involved in calculating a percentile rank.
- count the number of data points that are at or below the given value
- divide by the total number of data points
- multiply by 100
- > #calculate the percentile rank for a given value using the custom formula: length(VAR[VAR <>
- > #in the sample, an age of 45 is at what percentile rank?
- > length(Age[Age <= 45]) / length(Age) * 100
-  75
SummaryA very useful multipurpose function in R is summary(X), where X can be one of any number of objects, including datasets, variables, and linear models, just to name a few. When used, the command provides summary data related to the individual object that was fed into it. Thus, the summary function has different outputs depending on what kind of object it takes as an argument. Besides being widely applicable, this method is valuable because it often provides exactly what is needed in terms of summary statistics. A couple examples of how summary(X) can be used are displayed in the following code sample. I encourage you to use the summary command often when exploring ways to analyze your data in R. This function will be revisited throughout the R Tutorial Series.
The output of the preceding summary is pictured below.
- > #summarize a variable with summary(VAR)
- > summary(Age)
The output of the preceding summary is pictured below.
- > #summarize a dataset with summary(DATAVAR)
- > summary(dataset)