This homework is going to focus on cleaning data. We have data on UK gas consumption and UK driver deaths, I want to look at the relationship between the two. Unfortunately, we have monthly data for deaths and quarterly data for gas consumption. We will need to do some work before we can compare them.
Download some data:
library(tidyverse)
ukg = read_csv("https://codowd.com/bigdata/hw/hw6/ukgas.csv")
ukdd = read_csv("https://codowd.com/bigdata/hw/hw6/ukdriverdeaths.csv")
This may not be easy or straightforward. The end product is a plot and a regression. But cleaning data is a right pain. We will convert from wide to long, deal with differing frequencies of data, deal with poorly formatted variables, etc.
There are dozens of correct ways to deal with each step. I expect the hardest subparts to be (4), then (2), then (3). For (2) and (3), a for loop with no more than 12 ‘if’ statements can solve the problem – though there are certainly much neater ways. For (4), there are several ways to solve the problem, but it would be very easy to get stuck here - if you do, please email me or post on canvas with your thoughts.
Plot quarterly deaths and quarterly gas consumption over time (on the same plot), then regress one against the other.
ukdd
, which indicates which quarter (1,2,3, or 4) each month is in.ukg
so that it it contains only a number (1,2,3, or 4), not (Q1, Q2, Q3, Q4). [hint: str_sub
could help, and the package stringr
more generally is your friend] Be sure to convert it to a number with “as.numeric”.ukdd
, such that we have the total deaths in each quarter and year. (Jan, Feb, Mar are in Q1; Apr, May, Jun are in Q2; Jul, Aug, Sep in Q3; Oct, Nov, Dec in Q4. Instead of knowing deaths in Jan, Feb, Mar, we want deaths in Q1, which is the sum of deaths in Jan, Feb, Mar)Due Wed May 12th at 11:59:59 pm.