```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r}
setwd("~/Github/bigdata/predictions/")
library(tidyverse)
#Data from CDC trend page.
cases = read_csv("cdc_cases.csv",skip=2)
```

Most recent data is from Thurs May 6th. 

I'm going to start with a very basic model. Lets call it the log diff-in-diff model. Basically, this sunday/last sunday = this thursday/last thursay. I know the three inputs. 
 
- *Basic intuition*: This naturally builds in a Sunday FE. It uses 1 week of changes in most recent data to predict.
- *Natural Complications*:
  - Use percent change in 7day moving AVG for forecast


#      Cleaning
```{r}
length(unique(cases$State))
# only US. Drop the col.
cases = cases %>% select(-State)
# Check backend rolling avg
tail(cases)
#Interesting.

# Need to make Date more usable.
library(lubridate)
cases = cases %>% 
  mutate(Date=mdy(Date)) %>% #Convert to Date from Chr
  mutate(dow = wday(Date)) #Day of week.
#Fix names
cases = cases %>% rename(cases = `New Cases`, avg = `7-Day Moving Avg`,date=Date)

small = cases  %>% 
  filter(dow == 5 | dow == 1) #Grab thurs and Sun

#Building pred vars -- one-week old cases and avgs
small = small %>% 
  mutate(week_old_cases = lead(cases,n=2),
         week_old_avg   = lead(avg,n=2))

# Basic model input
small = small %>%
  mutate(ratio.cases = cases/week_old_cases,
         ratio.avg   = avg/week_old_avg)
#Compare the ratios
ggplot(small,aes(x=ratio.cases,y=ratio.avg))+geom_point()
```

Extreme outliers (ratios > 20) -- probably very early in pandemic.
```{r}
small %>% filter(ratio.cases > 20)
# Yup. Those are like.. Before April 2020. Going to just drop dates before May 2020.
small = small %>% filter(date > ymd("2020-04-29"))
#plot again
ggplot(small,aes(x=ratio.cases,y=ratio.avg,col=date))+
  geom_point() +
  geom_smooth(method="lm",col=3)+
  geom_abline(slope=1,intercept=0,col=1,lty=2)
#Really expect it to lie on that 45 degree line.
#Not a huge problem. 
#But probably a good reason to think about the avg
# not the daily. Will be comparing shortly.
```


# Modelling

```{r}
small = small %>%
  mutate(pred.avg = ratio.avg*lead(cases,n=1),
         pred.cases = ratio.cases*lead(cases,n=1))

truth = small %>% 
  filter(dow == 1) %>% 
  select(cases) %>% 
  as.matrix()
preds = small %>% 
  filter(dow == 5) %>%
  select(pred.avg,pred.cases) %>%
  as.matrix()

#Differing number of dimensions. Need to... make sure I line up dates correctly.
tail(small)
tail(small %>% select(date,dow,cases,pred.avg))
#Looks like I drop first and last in preds and last in truth.

preds = preds[-c(1,nrow(preds)),]
truth = truth[-length(truth)]

truth2 = cbind(truth,truth)
errs = preds/truth-1 #multiplicative model.mean=1.
colMeans(errs^2)
#looks like last day outperforms avg.
#closer to 'trend' i guess?

dates = small %>% 
  filter(dow==1) %>% 
  select(date) %>%
  as.matrix()
dates = dates[-nrow(dates),]
dates = ymd(dates)

df = data.frame(preds = preds[,2],
                truth=truth,
                errs=errs[,2],dates=dates)
ggplot(df,aes(x=log(preds),y=log(truth),col=errs))+
  geom_point()+
  geom_abline(slope=1,intercept=0,col=2)+
  geom_smooth(method="lm")

#Is there any patterning over time?
ggplot(df,aes(x=log(preds),y=log(truth),col=dates))+
  geom_point()

#Maybe percentage errors are getting smaller?
ggplot(df,aes(x=dates,y=errs^2^0.5))+
  geom_point() +
  geom_smooth()

#They got smaller... but like... growing again?
```

**Notice**:

- I came to the data with two _full_ models.
- No coefficients to estimate. Nothing. 
- It was "ratio*last sunday" and "alt ratio*last sunday"
- That means every one of these predictions is out of sample.
- And wasn't used for fitting the model. 
    - (ish -- i've... obviously seen the gist of this data before)

*Couple of further notes*. 

- Looks like In August there was some kind of  fundamental shift. should probably subset data to the post-august period.
- I should also _in principle_ consider models that are midpoints between the two here.
    - Like.. Instead of All weight on Thurs ratio, or 1/7 weight on Thurs and 1/7 on last friday, 
    - Should eyeball some other weights. 
- Right now my mean absolute percentage errors are approximately 12%. That leaves... a lot of room to improve
- My root mean squared percent errors are closer to 17%.
- And my 95% CI is [-30%,+40%] which is... huge.
```{r}
quantile(errs[1:41,2],c(0.025,0.975))
```
- Dropping pre-august (obs 42:52) makes it more like plus-minus 18%.
- my rmspe = 10%, and my mape = 9%. Much more reasonable.
- Little risky though -- unclear if that variation has gone away over time, or went away as cases rose. If it is as cases rose, we should expect it back as they fall.

# Base Forecast

But my ballpark first forecast is.....
```{r}
small$pred.cases[1]
# (drumroll):  28173.96
```

My price-is-right forecast is at the 20th percentile
```{r}
quantile(errs[1:41,2],c(0.2))
#8.5% less than that, for:
(1+quantile(errs[1:41,2],0.2))*small$pred.cases[1]
# 25783.53
```

And the implied chance of coming in below 25k?
```{r}
25000/28173-1
# 11.3% below prediction. which occurs
mean(errs[1:41,2] < -0.1126)
```
Just shy of 10% of the time. (9.75% exactly) This is a dumb model, and there is a high chance I
screwed something up. So I'll call it a 1/6 chance (16.66%)

> This is the exact difference between in-model uncertainty, and my uncertainty after considering model uncertainty. The model says "less than a 10% chance". I say "but theres also an 8% chance you're garbage, so lets call it 16% overall" 


# Is sunday the lowest day?

I'm going to make blocks of weeks, with sundays in the middle. So the weeks I'm defining run "Thurs, Fri, Sat, Sun, Mon, Tues, Weds". 

```{r}
#Starts with an empty group, ends with one.
weeks = c(NA,sapply(1:67,function(x) rep(x,7)),NA)
cases$weeks = weeks
cases = cases %>% filter(date > "2020-05-01")
low_day = cases %>% group_by(weeks) %>% summarize(low_day = dow[which.min(cases)])
table(low_day$low_day)/nrow(low_day)
```