# Summarising Data In R

R contains a number of simple summary statistics for use on R objects. Typical functions are length(), mean(), median(), max(), min(), range() and summary(). For example:

```> x <- c(5,3,3,6,7,9,2,4,1,9,7,9,4,4,8)
> length(x)
[1] 15
> mean(x)
[1] 5.4
> median(x)
[1] 5
> min(x)
[1] 1
> max(x)
[1] 9
> range(x)
[1] 1 9```

On simple vectors, summary() gives the range, median, mean and interquartile range:

```> summary(x)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
1.0     3.5     5.0     5.4     7.5     9.0```

The summary() command can also be used on data frames and other more complicated objects too. For example, here is the Santa Claus example once again:

```> summary(Santa)
Believe             Age           Gender      Presents    Behaviour
Mode :logical   Min.   : 4.00   female:23   Min.   : 3.0   bad :26
FALSE:25        1st Qu.: 5.00   male  :27   1st Qu.:20.0   good:24
TRUE :25        Median : 7.00               Median :26.5
Mean   : 6.86               Mean   :27.0
3rd Qu.: 9.00               3rd Qu.:33.5
Max.   :10.00               Max.   :57.0 ```

Other useful commands for matrices and data frames include dim() (what are the dimensions), rownames() and colnames() to find out (or set) the row and column names:

```> dim(Santa)
[1] 50  5
> colnames(Santa)
[1] "Believe"   "Age"       "Gender"    "Presents"  "Behaviour"```

## Standard Deviation & Variance

You get the standard deviation of a vector x with sd(x), or its variance with var(x). For example, using the ages in the Santa data:

```> sd(Santa\$Age)
[1] 2.11901
> var(Santa\$Age)
[1] 4.490204```

We can also check that squaring the standard deviation gives the variance:

```> sd(Santa\$Age) ^ 2
[1] 4.490204```

What if we wanted to break up the data by gender?

```> table(Santa\$Gender)

female   male
23     27 ```

On the page about manipulating data you where shown that a list of logical values could be used to select elements from a matrix or data frame. We can use this to extract only the "male" data.

```> Santa\$Gender == "male"
[1]  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE
[10] FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
[19] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE
[28] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE
[37]  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
[46]  TRUE FALSE FALSE  TRUE  TRUE
> Santa[Santa\$Gender == "male",]
Believe Age Gender Presents Behaviour
1    FALSE   9   male       25   naughty
2     TRUE   5   male       20      nice
4     TRUE   4   male       34   naughty
...
46    TRUE   4   male       34   naughty
49    TRUE  10   male       57      nice
50   FALSE   4   male        3   naughty```

In particular, we can get the boy's ages like this:

```> Santa[Santa\$Gender == "male","Age"]
[1]  9  5  4  4  6  5  7  5  5  4  8  8  9  4 10  7  5  6
[19]  7  8  4  8  9  6  4 10  4```

Or like this, which you might find clearer:

```> Santa\$Age[Santa\$Gender == "male"]
[1]  9  5  4  4  6  5  7  5  5  4  8  8  9  4 10  7  5  6
[19]  7  8  4  8  9  6  4 10  4```

Now we have a vector of the boys ages, it is trivial to get their standard deviation and a quick summary of distribution:

```> sd(Santa[Santa\$Gender == "male","Age"])
[1] 2.038099
> summary(Santa[Santa\$Gender == "male","Age"])
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
4.000   4.500   6.000   6.333   8.000  10.000```

For the girls this becomes:

```> sd(Santa[Santa\$Gender == "female","Age"])
[1] 2.086092
> summary(Santa[Santa\$Gender == "female","Age"])
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
4.000   5.500   8.000   7.478   9.000  10.000```

As you can see, while the range and the standard deviation are about the same for the two groups, but the girls seem to be older.

```> tapply(Santa\$Age, Santa\$Gender, sd)
female     male
2.086092 2.038099 ```

or using summary() we get:

```> tapply(Santa\$Age, Santa\$Gender, summary)
\$female
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
4.000   5.500   8.000   7.478   9.000  10.000

\$male
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
4.000   4.500   6.000   6.333   8.000  10.000```

This used the "table apply" function, tapply(), to apply the function sd() or summary() to the vector Santa\$Age when broken up into a table using the factor vector Santa\$Gender (!).