```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` This homework is going to focus on cleaning data. We have data on UK gas consumption and UK driver deaths, I want to look at the relationship between the two. Unfortunately, we have monthly data for deaths and quarterly data for gas consumption. We will need to do some work before we can compare them. # Setup Download some data: ```{r, message=F} library(tidyverse) ukg = read_csv("https://codowd.com/bigdata/hw/hw6/ukgas.csv") ukdd = read_csv("https://codowd.com/bigdata/hw/hw6/ukdriverdeaths.csv") ``` # Warning This may not be easy or straightforward. The end product is a plot and a regression. But cleaning data is a right pain. We will convert from wide to long, deal with differing frequencies of data, deal with poorly formatted variables, etc. There are dozens of correct ways to deal with each step. I expect the hardest subparts to be (4), then (2), then (3). For (2) and (3), a for loop with no more than 12 'if' statements can solve the problem -- though there are certainly much neater ways. For (4), there are several ways to solve the problem, but it would be very easy to get stuck here - if you do, please email me or post on canvas with your thoughts. # Work Plot quarterly deaths and quarterly gas consumption over time (on the same plot), then regress one against the other. 1. Convert both datasets from wide to long data. 2. Add a variable "quarter" to `ukdd`, which indicates which quarter (1,2,3, or 4) each month is in. 3. Edit the variable containing "quarter" in `ukg` so that it it contains only a number (1,2,3, or 4), not (Q1, Q2, Q3, Q4). [hint: `str_sub` could help, and the package `stringr` more generally is your friend] Be sure to convert it to a number with "as.numeric". 4. Create a new dataset from `ukdd`, such that we have the total deaths in each quarter and year. (Jan, Feb, Mar are in Q1; Apr, May, Jun are in Q2; Jul, Aug, Sep in Q3; Oct, Nov, Dec in Q4. Instead of knowing deaths in Jan, Feb, Mar, we want deaths in Q1, which is the sum of deaths in Jan, Feb, Mar) 4. Merge the two quarterly datasets together, merging on year and quarter. (do an inner join) 5. Create a new variable "time" in the merged data which is the year + (quarter-1)/4 -- so that Q1 in 1975 has a value of 1975.0 and Q3 is 1975.5. 6. Plot quarterly UK deaths (as a line) and quarterly UK gas consumption (as a line) against time (on the same plot). 7. Make a scatterplot of deaths against gas consumption, (optional: coloring each point by the quarter). 8. Run a linear regression of deaths against gas consumption, and show the regression summary. 9. In 2 sentences, describe any relationship. # Submission Due Wed May 12th at 11:59:59 pm.