This homework is going to focus on cleaning data. We have data on UK gas consumption and UK driver deaths, I want to look at the relationship between the two. Unfortunately, we have monthly data for deaths and quarterly data for gas consumption. We will need to do some work before we can compare them.

1 Setup

Download some data:

library(tidyverse)
ukg = read_csv("https://codowd.com/bigdata/hw/hw6/ukgas.csv")
ukdd  = read_csv("https://codowd.com/bigdata/hw/hw6/ukdriverdeaths.csv")

2 Warning

This may not be easy or straightforward. The end product is a plot and a regression. But cleaning data is a right pain. We will convert from wide to long, deal with differing frequencies of data, deal with poorly formatted variables, etc.

There are dozens of correct ways to deal with each step. I expect the hardest subparts to be (4), then (2), then (3). For (2) and (3), a for loop with no more than 12 ‘if’ statements can solve the problem – though there are certainly much neater ways. For (4), there are several ways to solve the problem, but it would be very easy to get stuck here - if you do, please email me or post on canvas with your thoughts.

3 Work

Plot quarterly deaths and quarterly gas consumption over time (on the same plot), then regress one against the other.

  1. Convert both datasets from wide to long data.
  2. Add a variable “quarter” to ukdd, which indicates which quarter (1,2,3, or 4) each month is in.
  3. Edit the variable containing “quarter” in ukg so that it it contains only a number (1,2,3, or 4), not (Q1, Q2, Q3, Q4). [hint: str_sub could help, and the package stringr more generally is your friend] Be sure to convert it to a number with “as.numeric”.
  4. Create a new dataset from ukdd, such that we have the total deaths in each quarter and year. (Jan, Feb, Mar are in Q1; Apr, May, Jun are in Q2; Jul, Aug, Sep in Q3; Oct, Nov, Dec in Q4. Instead of knowing deaths in Jan, Feb, Mar, we want deaths in Q1, which is the sum of deaths in Jan, Feb, Mar)
  5. Merge the two quarterly datasets together, merging on year and quarter. (do an inner join)
  6. Create a new variable “time” in the merged data which is the year + (quarter-1)/4 – so that Q1 in 1975 has a value of 1975.0 and Q3 is 1975.5.
  7. Plot quarterly UK deaths (as a line) and quarterly UK gas consumption (as a line) against time (on the same plot).
  8. Make a scatterplot of deaths against gas consumption, (optional: coloring each point by the quarter).
  9. Run a linear regression of deaths against gas consumption, and show the regression summary.
  10. In 2 sentences, describe any relationship.

4 Submission

Due Wed May 12th at 11:59:59 pm.