```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
setwd("~/Github/bigdata/lectures/l3/")
library(tidyverse)
```

## Today's Class | Time permitting {.smaller}

1. Quick Review
    - Linear Regression
    - Logistic Regression
2. Deviance
3. Out-of-Sample performance
4. GLM more broadly:
    - Poisson, etc.
5. Inference
    - Bootstrap?
6. Preview HW2
7. Review HW1 Answers. 

# Quick Review

## Linear Regression

Many problems involve a response or outcome (`y`),  
And a bunch of covariates or predictors (`x`) to be used for regression.

A general tactic is to deal in averages and lines. 

$$E[y|x] = f(x'\beta)$$

Where <span style="color:maroon"> $x = [1,x_1,x_2,x_3,...,x_p]$ </span> is our vector of covariates.  <span style="color: lightblue"> (Our number of covariates is $p$ again)</span>  
$\beta = [\beta_0,\beta_1,\beta_2,...,\beta_p]$ are the corresponding coefficients.  
The product $x'\beta = \beta_0+\beta_1 x_1 + \beta_2 x_2+\cdots+\beta_p x_p$.

 <span style="color: lightblue"> For simplicity we denote $x_0 = 1$ to estimate intercepts</span> 


# Example

---

```{r, echo=F,out.width="100%",fig.align="center", message=F}
set.seed(203)
x = rnorm(200,mean=1.5,sd=0.5)
y = 0.8*x+rnorm(200,sd=0.5)+2
df = data.frame(x=x,y=y)
df2 = data.frame(x=c(1,2,2),y=c(2.8,2.8,3.6))

ggplot() + 
  geom_point(data=df,aes(x=x,y=y),alpha=0.1)+
  geom_abline(aes(intercept = 2,slope=0.8)) +
  geom_path(data=df2,aes(x=x,y=y),col=2,lty=1,lwd=2) + 
  xlim(0,3)+ylim(0,5) + 
  annotate("errorbar",x=2.2,ymin=2.8,ymax=3.6,width=0.2)+
  annotate("text",x=2.4,y=3.2,label="beta",parse=T,size=10)+
  annotate("text",x=2.7,y=3.2,label="= 0.8",size=10)+
  annotate("errorbar",y=2.5,xmin=1,xmax=2,width=0.2) +
  annotate("text",y=2.1,x=1.5,label="1",size=10) +
  annotate("errorbar",x=0.2,ymin=0,ymax=2,width=0.2) + 
  annotate("text",x=0.7,y=1.2,label="Intercept = 2",size=7)+
  annotate("point",x=0,y=c(0,2),col=2,size=5)
```

---


```{r, echo=F,out.width="100%",fig.align="center",message=F}
mod = lm(y~x)
coefs = coef(mod)
coefsa = round(coefs,digits=2)
ggplot(df,aes(x=x,y=y)) +
  geom_point(alpha=0.4) +
  geom_abline(intercept=2,slope=0.8)+
  geom_abline(intercept=coefsa[1],slope=coefsa[2],col="maroon")
```

## Logistic Regression

Building a linear model for **binary data**.  
Recall our original specification: $E[Y|X] = f(x'\beta)$

The Response $y$ is 0 or 1, leading to a conditional mean:
$$E[y|x] = P[y=1|x]\times 1 + P[y=0|x]\times 0 = P[y=1|x]$$
$\implies$ The expectation is a probability. 

The 'logit' link is common, for a few reasons. One big reason? 
$$log\left(\frac{p}{1-p}\right) =\beta_0+\beta_1x_1+...+\beta_Kx_K$$
This is a linear model for log odds.

## Deviance
<span style="color:maroon">Deviance</span> refers to the distance between our fit and the data. You generally want to minimize it. 

$$Dev(\beta) = -2 log(L(\beta|data))+C$$
<span style="color:lightblue">We can ignore C for now.</span> 

Deviance is useful for comparing models. It is a measure of GOF that is similar to the residual sum of squares, for a broader class of models (logistic regression, etc). 

We'll think about deviance as a cost to be minimized. 

<span style="color:maroon">Minimize Deviance $\iff$ Maximize likelihood</span> 

---

Bringing this full circle.

$$\phi(x) = \frac{1}{\sqrt{2\pi}}exp\left(- {x^2 \over 2}\right)$$
Given $n$ independent observations, the likelihood becomes:
$$ \prod_{i=1}^n \phi\left(y-x'\beta \over \sigma \right) \propto \prod_{i=1}^n exp\left(-{(y-x'\beta)^2 \over 2 \sigma^2} \right)$$
$$ \propto exp \left(-{1 \over 2\sigma^2} \sum_{i=1}^n (y-x'\beta)^2 \right)$$

---

This leads to <span style="color:maroon">Deviance</span> of:
$$Dev(\beta) = -2log(L(\beta|data)) + C = {1 \over \sigma^2} \sum_{i=1}^n (y-x'\beta)^2 + C'$$

So:  
<span style="color:maroon">Min Deviance $\iff$ Max likelihood $\iff$ Min $l_2$ loss </span> 

This is just a particular loss function, which is driven by the distribution of the $\epsilon$ terms. 

## MLE for Logistic Regression {.smaller}
Our logistic regression has the following likelihood:
$$L(\beta) = \prod_{i=1}^n P[y_i|x_i] = \prod_{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i}$$
$$ = \prod_{i=1}^n \left({exp(x_i'\beta) \over 1+exp(x_i'\beta)}\right)^{y_i} \left({1 \over 1+exp(x_i'\beta)}\right)^{1-y_i} $$
Thus the deviance to minimize is:
$$Dev(\beta) = -2 \sum_{i=1}^n (y_ilog(p_i)+(1-y_i)log(1-p_i))$$
$$\propto \sum_{i=1}^n [ log(1+exp(x_i'\beta))-y_ix_i'\beta]$$

<span style="color:lightblue">This is just taking the logs and removing the factor of 2.</span> 

---

Back to our summary outputs. We can print the same output for both linear and logistic regressions. 

But the "dispersion parameter" is always 1 for the logistic regression. `summary(spam)`  
![](../l3/logit_deviance.png){width=80%}

<span style="color:maroon">'degrees of freedom'</span>  is actually 'number of observations - df' where df is the number of coefficients estimated in the model. 

Specifically `df(deviance) = nobs - df(regression)`

You should be able to back out number of observations from the R output. 

## Dispersion parameter for Linear regression? {.smaller}

`summary(reg.bse)`
![](../l3/linear_deviance.png){width=80%}

Remember our basic gaussian model was:
$$Y|X \sim N(X'\beta,\sigma^2)$$
And the implied deviance was:
$$Dev(\beta) = {1 \over \sigma^2}\sum_{i=1}^n (y-x'\beta)^2 +C'$$

$\sigma$ is the dispersion parameter, and it is critical here. The logit has a mean-variance link, so we don't need the separate param.

## Estimating $\sigma$

$$y_i = x_i'\beta+\epsilon_i; \quad \sigma^2 = Var(\epsilon)$$
Denote the residuals, $r_i = y_i-x_i'\hat{\beta}$.

$$ \hat{\sigma}^2 = {1 \over n-p-1} \sum_{i=1}^n r_i^2 $$

R calls $\hat{\sigma}^2$ the dispersion parameter. 

Critically, even if we know $\beta$, we only predict sales with uncertainty.  
E.g., approximately a 95% chance of sales in $x'\beta \pm 2\sqrt{0.48}$

## $R^2$ {.smaller}

<span style="color:maroon">Residual Deviance</span>, $D$ is what we've minimized using $x$.  
<span style="color:maroon">Null Deviance</span> $D_0$ is for the model without $x$ (or more generally, the model under the null).  
i.e. $\hat{y}_i = \bar{y}$

- $D_0 = \sum (y_i-\bar{y})^2$ in linear regression
- $D_0 = -2\sum[y_ilog(\bar{y})+(1-y_i)log(1-\bar{y})]$ in logits

The difference between $D$ and $D_0$ comes from information in $x$.

Proportion of deviance explained by $x$ is called the $R^2$ in a linear regression, "Pseudo-$R^2$" in logit.
$$ R^2 = {D_0-D \over D_0} = 1-{D \over D_0}$$

This measures how much variability you explain with your model. 

- In `spam`: $R^2 = 1-1549/6170 = 0.75$
- In OJ -- `reg.bse`: $R^2 = 1-13975/30079 = 0.54$

## $R^2$ in linear regression

Recall that for linear model, deviance is the sum of squared errors (SSE) and $D_0$ is the total sum of squares (TSS). 
$$ R^2 = 1-{SSE \over TSS}$$

You may also recall that $R^2 = corr(y,\hat{y})^2$. 
```{r, echo=F, message=F}
oj = read_csv("~/Github/bigdata/lectures/l3/oj/oj.csv")
oj$brand = as.factor(oj$brand)
levels(oj$brand)[2] = "Minute Maid"
levels(oj$brand) = tools::toTitleCase(levels(oj$brand))
oj$feat = as.logical(oj$feat)
reg.bse = glm(logmove~feat*log(price)*brand,data=oj)
```

```{r}
cor(reg.bse$fitted,oj$logmove)^2
```

For linear regression, min deviance $=$ max corr$(y,\hat{y})$. $\implies$ if $y$ vs $\hat{y}$ is a straight line, you have a perfect fit. 

Also implies that $R^2$ (weakly) increases whenever we add another variable. 

## Fit plots {.smaller}

```{r, echo=T,out.width="70%",fig.align="center",message=F}
fitplotdf = data.frame(y = oj$logmove,yhat= predict(reg.bse),brand=oj$brand)
ggplot(fitplotdf,aes(y=y,x=yhat,col=brand)) + 
  geom_jitter(alpha=0.2,size=0.2,width=0.03)+
  geom_abline(intercept=0,slope=1)
```

It is good practice to plot $y$ vs $\hat{y}$. It helps you check for specification issues (e.g. non-constant variance, non-linearities, etc)

## Fit plots -- logit {.smaller}
```{r, echo=F, message=F, warning=F}
email = read_csv("~/Github/bigdata/lectures/l3/spam/spam.csv")
spam = glm(spam~.,data=email,family=binomial)
```

```{r, echo=T,out.width="70%",fig.align="center",message=F}
logit.fit.df = data.frame(spam=email$spam==1,yhat=predict(spam,type="response"))
ggplot(logit.fit.df,aes(y=yhat,x=spam,)) + 
  geom_violin()+ylab("Predicted P[Spam]")
```

New question: how do you choose the classification threshold? When do you send to the spam folder?

## Prediction

Prediction is easy with `glm`. 
```{r}
predict(spam,newdata=email[1:4,])
```
But this output is $x'\beta$. To get probabilities $exp(x'\beta)/(1+exp(x'\beta))$, add `type="response"`
```{r}
predict(spam,newdata=email[1:4,],type="response")
```

<span style="color:maroon">Newdata must match the format of the original data.</span> 

## Out-of-Sample Prediction {.smaller}

We care about how well our model works <span style="color:maroon">out-of-sample</span>.

One way to test this is to use a "validation sample".

1. Randomly split your data into two samples
    - usually named "testing" and "training".
2. Fit your model using the "training" data
3. Test predictions on the left-out "testing" data.

```{r, warning=F}
#Sample 1000 indices
leaveout = sample.int(nrow(email),1000)
#Train the model
spam_train = glm(spam~.,data=email[-leaveout,],family="binomial")
#Predict performance on the test data.
test_preds = predict(spam_train,newdata=email[leaveout,],type="response")
```

## Out-of-Sample Prediction {.smaller}

Fit plots for those left-out observations.
```{r, echo=F,out.width="70%",fig.align="center",message=F}
logit.fit.df = data.frame(spam=email$spam[leaveout]==1,yhat=test_preds)
ggplot(logit.fit.df,aes(y=yhat,x=spam,)) + 
  geom_violin()+ylab("Predicted P[Spam]")
```
For the leave out data, we get $D_0 = 1360$ and $D=362$ for $R^2 = 0.73$.

<span style="color:lightblue">Since the leave-out sample is random, your results may vary.</span> 

<span style="color:maroon">Note: OOS $R^2$ is lower than in-sample $R^2 = 0.75$</span> 

## Out-of-Sample

More on this next week during model selection.

You might see notions of this in my code for predicting vaccinations.

# More GLM

## Link Functions Aren't everything

We spoken about GLM giving a basic model of:
$$E[y|x] = f(x'\beta)$$
This is the link function, allowing us to model non-linear functions as "linear in some other domain". 

But we haven't spoken explicitly about the error distribution yet.

## Error distributions

This lack of commentary on error distributions is odd -- because the error distribution is critical to the likelihood, and the likelihood is how we are estimating our model.

We have in fact been baking in basic assumptions about the distributions.

- For the logit: $\epsilon \sim Binomial(1,p)$
- For the linear: $\epsilon \sim N(0,\sigma^2)$

<span style="color:lightblue">Asymptotically, for linear regressions we can make much weaker statements. See "quasi-maximum likelihood" models. Beyond this course.</span> 

## Generalizing Further

If we have errors with other distributions, it is useful to take advantage of our knowledge of those other distributions. 

A common alternative distribution is the poisson. 

```{r, fig.align="center", out.width="80%", echo=F}
hist(rpois(1000,5))
```

## GLM with Poisson

Once we say: 
$\epsilon \sim Poisson(\lambda)$, it implies a likelihood function, and thus can help our estimation routine. It is easy to incorporate this new information.

```{r, eval=F}
mod = glm(y~.,data=df,family=poisson)
```

We can do the same with many other distributions:  

- Easily in base-R for some: Poisson, binomial, Gamma.
- In principle with any distribution: t, Cauchy, etc. 

<span style="color:lightblue">Be aware, that sometimes this imposes a constant variance assumption.</span> 


# Inference

## Sample vs Population

As we've seen, even with very well behaved data, our estimate is not the same as the true value of a parameter.

A large vein of statistics is dedicated to estimating how large a gap there might be. 

Typically, this relies on some notion of asymptotic normality of the parameter. With that in place, you can estimate a standard error, and do fairly well. 

## MOAR DATA

We don't want to do difficult theory work. We just want to know how far away we can expect our parameter to be. 

If we had a few more similar samples of data, we could estimate it directly. We could just calculate our parameter in each sample, and look at how much it varies. 

We could break our sample into pieces, but each one will be much smaller, and thus higher variance. How can we get more information without doing theory? 

<span style="color:maroon">The BOOTSTRAP. </span> 

## Pretend we have 10 thousand samples
Each observation in our data was independently sampled from some distribution. 

What if every time we wanted a new sample, we just put all our observations in an urn, drew our $n$ of them (with replacement), or $n-10$ (without replacement) and calculated the parameter in that sample? 

We could do this ten-thousand times. And have a theory-free distribution of possible values our sample could have taken. 

This is the simplest form of the bootstrap. And it is powerful.

## Example
```{r,echo=F}
obs = rnorm(1000,10)
```
```{r}
summary(obs)
bootstrap_means = replicate(10000,{
  new_sample = sample(obs,length(obs),replace=T)
  mean(new_sample)
})
```

## Example

```{r, fig.align="center", out.width="80%"}
hist(bootstrap_means,breaks=100)
abline(v=quantile(bootstrap_means,c(0.025,0.975)),col=2)
```


## Bootstrap Comments

This relying on the independence between observations. If there is some dependence structure, you can still bootstrap, you just bootstrap clusters.

This will only work well when the parameter of interest has finite variance. 

Will only capture all the sampling variation -- biased samples are still biased.

## IF Time

Go over HW 2 introduction.

Go over HW 1 answers. 

# Wrap up

## Things to do

Before Wednesday: 

- HW 2

## Rehash

- Regressions minimize prediction errors, loss, or deviance, or they maximize some likelihood function. 
- Deviance is a measure of model fit. 
- GLM is a fairly flexible tool for modelling in a wide variety of situations -- linear only in some dimension. 
- Out-of-sample performance is critical. Measuring this is a pain.
- Bootstraps help us simulate the notion of repeated samples. 


# Bye!