```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ```{r} setwd("~/Github/bigdata/predictions/") library(tidyverse) #Data from CDC trend page. cases = read_csv("cdc_cases.csv",skip=2) ``` Most recent data is from Thurs May 6th. I'm going to start with a very basic model. Lets call it the log diff-in-diff model. Basically, this sunday/last sunday = this thursday/last thursay. I know the three inputs. - *Basic intuition*: This naturally builds in a Sunday FE. It uses 1 week of changes in most recent data to predict. - *Natural Complications*: - Use percent change in 7day moving AVG for forecast # Cleaning ```{r} length(unique(cases$State)) # only US. Drop the col. cases = cases %>% select(-State) # Check backend rolling avg tail(cases) #Interesting. # Need to make Date more usable. library(lubridate) cases = cases %>% mutate(Date=mdy(Date)) %>% #Convert to Date from Chr mutate(dow = wday(Date)) #Day of week. #Fix names cases = cases %>% rename(cases = `New Cases`, avg = `7-Day Moving Avg`,date=Date) small = cases %>% filter(dow == 5 | dow == 1) #Grab thurs and Sun #Building pred vars -- one-week old cases and avgs small = small %>% mutate(week_old_cases = lead(cases,n=2), week_old_avg = lead(avg,n=2)) # Basic model input small = small %>% mutate(ratio.cases = cases/week_old_cases, ratio.avg = avg/week_old_avg) #Compare the ratios ggplot(small,aes(x=ratio.cases,y=ratio.avg))+geom_point() ``` Extreme outliers (ratios > 20) -- probably very early in pandemic. ```{r} small %>% filter(ratio.cases > 20) # Yup. Those are like.. Before April 2020. Going to just drop dates before May 2020. small = small %>% filter(date > ymd("2020-04-29")) #plot again ggplot(small,aes(x=ratio.cases,y=ratio.avg,col=date))+ geom_point() + geom_smooth(method="lm",col=3)+ geom_abline(slope=1,intercept=0,col=1,lty=2) #Really expect it to lie on that 45 degree line. #Not a huge problem. #But probably a good reason to think about the avg # not the daily. Will be comparing shortly. ``` # Modelling ```{r} small = small %>% mutate(pred.avg = ratio.avg*lead(cases,n=1), pred.cases = ratio.cases*lead(cases,n=1)) truth = small %>% filter(dow == 1) %>% select(cases) %>% as.matrix() preds = small %>% filter(dow == 5) %>% select(pred.avg,pred.cases) %>% as.matrix() #Differing number of dimensions. Need to... make sure I line up dates correctly. tail(small) tail(small %>% select(date,dow,cases,pred.avg)) #Looks like I drop first and last in preds and last in truth. preds = preds[-c(1,nrow(preds)),] truth = truth[-length(truth)] truth2 = cbind(truth,truth) errs = preds/truth-1 #multiplicative model.mean=1. colMeans(errs^2) #looks like last day outperforms avg. #closer to 'trend' i guess? dates = small %>% filter(dow==1) %>% select(date) %>% as.matrix() dates = dates[-nrow(dates),] dates = ymd(dates) df = data.frame(preds = preds[,2], truth=truth, errs=errs[,2],dates=dates) ggplot(df,aes(x=log(preds),y=log(truth),col=errs))+ geom_point()+ geom_abline(slope=1,intercept=0,col=2)+ geom_smooth(method="lm") #Is there any patterning over time? ggplot(df,aes(x=log(preds),y=log(truth),col=dates))+ geom_point() #Maybe percentage errors are getting smaller? ggplot(df,aes(x=dates,y=errs^2^0.5))+ geom_point() + geom_smooth() #They got smaller... but like... growing again? ``` **Notice**: - I came to the data with two _full_ models. - No coefficients to estimate. Nothing. - It was "ratio*last sunday" and "alt ratio*last sunday" - That means every one of these predictions is out of sample. - And wasn't used for fitting the model. - (ish -- i've... obviously seen the gist of this data before) *Couple of further notes*. - Looks like In August there was some kind of fundamental shift. should probably subset data to the post-august period. - I should also _in principle_ consider models that are midpoints between the two here. - Like.. Instead of All weight on Thurs ratio, or 1/7 weight on Thurs and 1/7 on last friday, - Should eyeball some other weights. - Right now my mean absolute percentage errors are approximately 12%. That leaves... a lot of room to improve - My root mean squared percent errors are closer to 17%. - And my 95% CI is [-30%,+40%] which is... huge. ```{r} quantile(errs[1:41,2],c(0.025,0.975)) ``` - Dropping pre-august (obs 42:52) makes it more like plus-minus 18%. - my rmspe = 10%, and my mape = 9%. Much more reasonable. - Little risky though -- unclear if that variation has gone away over time, or went away as cases rose. If it is as cases rose, we should expect it back as they fall. # Base Forecast But my ballpark first forecast is..... ```{r} small$pred.cases[1] # (drumroll): 28173.96 ``` My price-is-right forecast is at the 20th percentile ```{r} quantile(errs[1:41,2],c(0.2)) #8.5% less than that, for: (1+quantile(errs[1:41,2],0.2))*small$pred.cases[1] # 25783.53 ``` And the implied chance of coming in below 25k? ```{r} 25000/28173-1 # 11.3% below prediction. which occurs mean(errs[1:41,2] < -0.1126) ``` Just shy of 10% of the time. (9.75% exactly) This is a dumb model, and there is a high chance I screwed something up. So I'll call it a 1/6 chance (16.66%) > This is the exact difference between in-model uncertainty, and my uncertainty after considering model uncertainty. The model says "less than a 10% chance". I say "but theres also an 8% chance you're garbage, so lets call it 16% overall" # Is sunday the lowest day? I'm going to make blocks of weeks, with sundays in the middle. So the weeks I'm defining run "Thurs, Fri, Sat, Sun, Mon, Tues, Weds". ```{r} #Starts with an empty group, ends with one. weeks = c(NA,sapply(1:67,function(x) rep(x,7)),NA) cases$weeks = weeks cases = cases %>% filter(date > "2020-05-01") low_day = cases %>% group_by(weeks) %>% summarize(low_day = dow[which.min(cases)]) table(low_day$low_day)/nrow(low_day) ```