28 September 2016
infant_tidy.rdsfrom the course website and import it. This is a tidied version of the infant data that we used in the Data Handling practical.
complete.casesfunction (from the stats package) to create a logical vector indicating cases (rows) with complete data (no missing values).
Fit a linear model regressing the infant birth weight,
gestationand the maternal characteristics
race, using the
subsetargument to fit the model to the complete cases only.
updateto add the smoking variables
number. Compare the two models using
The following code plots a histogram of a variable from the
treesexample data set, then overlays a density curve:
his <- hist(trees$Girth, freq = FALSE) dens <- density(trees$Girth) ymax <- max(his$density, dens$y) plot(his, freq = FALSE, ylim = c(0, ymax), xlab = "Girth", main = "Histogram of Girth") lines(dens)
This code is provided in the file
Practical_5_Starter_Code.Ron the course website, along with the other code chunks displayed in this worksheet. Run the code to try it out. Use this starter code to create a function enabling you to make this type of plot for any variable
x, with a custom label for the x axis and a custom title.
Use your new function to re-create the plot for the
Girthvariable from the
treesdata. Then use it again to create a similar plot for the
Heightvariable - make sure you can update the label for the x axis and the title using your function!
The following code creates a ggplot version of a plot from the R Orientation notes:
library(ggplot2) mycol <- c("blue", "orange", "green") ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) + geom_point() + scale_color_manual(values = mycol) + xlab("Petal Length") + ylab("Petal Width") + theme_light() + theme(legend.key = element_blank()) #remove boxes in legend
Use this code to create a function enabling you to make this type of plot for x and y variables from any data set, with any grouping variable and with custom colours. Put the call to
library(ggplot2)at the start of your function to ensure this package is loaded when the function is called. Use
aes_stringto specify the aesthetics. Set the default value for the point colours to the colours used in the starter code.
Use your new function to re-create the plot of
Petal.Width. Then use it to create a similar plot for
Sepal.Length. Try changing the colours.
The football data presented in the talk was originally stored in text files with fixed width columns. Some example files are given on the course website: “2008-9.txt” and “2009-10.txt”. Open one of the files (e.g. in your web browser or in a text editor) to look at the format. This is a fixed width format: each column has 3 characters and spaces delimit the columns. It is tricky to read in because the missing values are given as 3 spaces.
This data can be read in with
import, but in this case it is easier to use the
read_tablefunction from readr
library(readr) library(dplyr) read_table("2008-9.txt") %>% rename(Home = X1)
Since rio depends on readr you will already have readr installed. As there is no column name for the first column,
X1- the above code renames the column as
Home, since it represents the home team.
The scores for each game are spread across multiple columns, with one column per away team. Load the tidyr package, then extend the data pipeline above to gather the scores into a column named
Score, keyed by a variable named
Awayfor the away team. When the home team and away team are the same, the score is an empty character string
"". Continue the pipeline to filter out these values. Finally separate the score into two new variables,
Create a new function to run the whole data pipeline, from importing the data to separating the scores. Load readr, tidyr and dplyr inside the function. Write the function with one argument that allows you to change the name of the data file. Test your function on the 2008-9 season data.
Load the purrr package. Use
map_dfto map a list of the names of the example football files to the argument of your new function, so that each data file is read in, processed and added to a combined data frame. Name the arguments of the list passed to
map_df, so that you can use the
map_dfto add a column identifying the data for each season.
Optional extra - for those of you that are still keen!
Load the knitr and forcats packages. If you did not do the extra activity in Practical 1, you may need to install forcats.
We are going to create some frequency tables of variables in the
smokevariable contains some missing values and we would like to include these in the table. The
fct_explicit_nafunction in forcats will expand the levels of a factor to create a new level for the missing values, see the help file for more detail.
Start a data pipeline with the
infantdata, then use
mutateto create a new factor based on
smoke, with an extra level for the missing values. Continuing the pipeline, group by your new factor, then summarise each group, by counting the number of values with the function
n. End your pipeline with a call to
kable, to create a markdown version of the frequency table, in which first column is left-aligned, the second column is centre-aligned and the columns are given the labels
Create a function from your data pipeline so that you can create a kable for any variable, with a custom label for the category column. Use the
renametrick to be able to use a character string to specify the variable to be tabulated. Use your function to recreate the table for the smoking catgory.
pmapto parallel map values for the two arguments to your function as follows
var label “smoke” “Smoking history” “time” “Time since quitting” “number” “Cigarettes/day”