![](../l2/sims/BH.png){width=100%}
We started with the notion that a given $\alpha$, (pvalue cutoffs) can lead to a big FDR: $\alpha \rightarrow q(\alpha)$.
BH reverse that. They fix FDR, and find the relevant $\alpha$. The algorithm is the key to doing that. $q \rightarrow \alpha^*(q)$

## Loss
- Loss is a function both of our prediction and the true outcome
- More importantly, the driving feature of loss is our experience of making a certain error. Do we lose money? Time? Prestige?
- Our choice of procedure is driven by this loss.
- $l_p(Y,\hat{Y}) = l_p(Y-\hat{Y}) = l_p(e) = \left( \frac1n \sum_{i=1}^n |e|^p\right)^{\frac{1}{p}}$
- E.g. $l_2(e) = \sqrt(\frac1n \sum_{i=1}^n e^2)$.
# Regression
## Motivation
```{r, echo=F, message=F}
oj = read.csv("oj/oj.csv")
library(tidyverse)
oj$brand = as.factor(oj$brand)
levels(oj$brand)[2] = "Minute Maid"
levels(oj$brand) = tools::toTitleCase(levels(oj$brand))
oj$feat = as.logical(oj$feat)
oj = as_tibble(oj)
g1 = ggplot(oj %>% filter(store==2),aes(x=week,y=logmove)) +
geom_line() +
facet_grid(cols=vars(brand)) +
ylab("Log Sales") + xlab("Week") + labs(title="Log Orange Juice Sales at Store 2 by Brand")
g1
```
What is driving sales? Brand differences? Price changes? Ads?
## Motivation
```{r, echo=F}
g1+geom_point(aes(col=feat),show.legend=F)
```
Blue points are based on ongoing promotional activity.
It looks like ads are important.
## Motivation
Fit a line for sales by brand controlling for promotional activity.
$$ log(Sales) \approx \alpha + \gamma Brand + \beta Ads $$
$\alpha+\gamma_b$ are like our baseline sales. But we can bring in $\beta$ more sales with some promotional activity.
## Regression
- Regression through linear models
- Implementation in R
- Complications:
- Interaction
- Factors
- Logistic Regression
- Estimation: Maximum likelihood, Minimum Deviance
This should be mostly review, but perhaps with a different emphasis.
## Linear Models
Many problems involve a response or outcome (`y`),
And a bunch of covariates or predictors (`x`) to be used for regression.
A general tactic is to deal in averages and lines.
$$E[y|x] = f(x'\beta)$$
Where $x = [1,x_1,x_2,x_3,...,x_p]$ is our vector of covariates. (Our number of covariates is $p$ again)
$\beta = [\beta_0,\beta_1,\beta_2,...,\beta_p]$ are the corresponding coefficients.
The product $x'\beta = \beta_0+\beta_1 x_1 + \beta_2 x_2+\cdots+\beta_p x_p$.
For simplicity we denote $x_0 = 1$ to estimate intercepts
## Marginals and Conditional Distributions
```{r,echo=F, message=F,out.width="100%"}
g1 = ggplot(oj,aes(x=log(price))) +
geom_density(adjust=3)
g1
```
Marginal Distribution shows us that prices are widely distributed
```{r, echo=F,out.width="100%"}
ggplot(oj,aes(y=log(price),fill=brand))+
geom_blank()
```

## Marginals and Conditional Distributions
```{r,echo=F, message=F,out.width="100%"}
g1
```
Marginal Distribution shows us that prices are widely distributed
```{r, echo=F,out.width="100%"}
ggplot(oj,aes(y=log(price),x=brand,fill=brand))+
geom_violin(adjust=2,draw_quantiles=c(0.25,0.5,0.75))
```
The conditionals show us that brands are doing different things.

## Marginal vs Conditional
The Marginal mean (aka unconditional mean) is a simple number.
The conditional mean is a function that depends on covariates.
The data is distributed randomly around these means.
---
In a Gaussian Linear regression,
$$ y|x \sim N(x'\beta,\sigma^2) $$
The conditional mean is $E[y|x] = x'\beta$.
With just one $x$, $E[y|x]$ increases by $\beta$ when $x$ increases by 1.
# Example
---
Generate some well-behaved data to demo.
$$ X \sim N(1.5,0.5^2)$$
$$ \epsilon \sim N(0,0.5^2)$$
$$ Y = 2 + 0.8X + \epsilon$$
$$\therefore \quad Y|X \sim N(2+0.8X,0.5^2)$$
---
```{r, echo=F,out.width="100%",fig.align="center"}
set.seed(203)
x = rnorm(200,mean=1.5,sd=0.5)
y = 0.8*x+rnorm(200,sd=0.5)+2
df = data.frame(x=x,y=y)
g1 = ggplot(df,aes(x=x,y=y)) +
geom_point(alpha=0.5) +
xlim(0,3)+ylim(0,5)
g1
```
---
```{r, echo=F,out.width="100%",fig.align="center"}
df2 = data.frame(x=c(1,2,2),y=c(2.8,2.8,3.6))
g1 = g1 +
geom_abline(aes(intercept = 2,slope=0.8))
g1
```
---
```{r, echo=F,out.width="100%",fig.align="center"}
ggplot() +
geom_point(data=df,aes(x=x,y=y),alpha=0.1)+
geom_abline(aes(intercept = 2,slope=0.8)) +
geom_path(data=df2,aes(x=x,y=y),col=2,lty=1,lwd=2) +
xlim(0,3)+ylim(0,5) +
annotate("errorbar",x=2.2,ymin=2.8,ymax=3.6,width=0.2)+
annotate("text",x=2.4,y=3.2,label="beta",parse=T,size=10)+
annotate("text",x=2.7,y=3.2,label="= 0.8",size=10)+
annotate("errorbar",y=2.5,xmin=1,xmax=2,width=0.2) +
annotate("text",y=2.1,x=1.5,label="1",size=10) +
annotate("errorbar",x=0.2,ymin=0,ymax=2,width=0.2) +
annotate("text",x=0.7,y=1.2,label="Intercept = 2",size=7)+
annotate("point",x=0,y=c(0,2),col=2,size=5)
```
---
```{r, echo=F,out.width="100%",fig.align="center"}
g1
```
---
```{r, echo=F,out.width="100%",fig.align="center",message=F}
mod = lm(y~x)
coefs = coef(mod)
coefsa = round(coefs,digits=2)
g1 +
geom_abline(intercept=coefsa[1],slope=coefsa[2],col="maroon")
```
## Coefficient Selection
The simplest form of regression is known as "Ordinary Least Squares" (OLS).
We select coefficents $\beta$ to minimize the sum of the squared prediction errors.
$$\hat{\beta} = \underset{\beta}{argmin} \sum_{i=1}^N (y-\hat{y})^2 = \underset{\beta}{argmin} \sum_{i=1}^N (y-x'\beta)^2 $$
$$ = \underset{\beta}{argmin} \sum_{i=1}^N (y-(\beta_0+\beta_1x_1+...\beta_Kx_K))^2$$
$$ = \underset{\beta}{argmin} \left(\sum_{i=1}^N (y-x'\beta)^2\right)^{1 \over 2}$$
## Implications
Standard regressions have a loss function built in alread. The $l_2$ loss. And they minimize it.
We minimized our $l_2$ loss and didn't wind up at the true value $\beta$. So the loss for $\beta$ must be higher.
$\implies$ The variance for the true model is usually higher than the variance of our model's error terms.
- See *overfit* next week
- This is why we compare to a $t$-distribution.
- We had to *estimate* the variance as well.
## OLS vs MLE
There is a different way to choose coefficients, known as Maximum Likelihood.
Instead of minimizing a loss function, makes assumptions about error distribution. More later.
# Basic Regressions in R
## Orange Juice Sales
Three brands.
- Tropicana
- Dominicks
- Minute Maid
83 Chicagoland stores.
Demographics for each.
Price, sales (log units moved), and advertising activity (`feat`) on a number of dates.
Data in `oj.csv`.

## Price, Brand, Sales
```{r, out.width="75%", fig.align="center"}
ggplot(oj, aes(y=logmove,x=log(price),col=brand))+
geom_point()
```
Sales decrease with price. Each brand has its own price strategy.
## Price, Brand, Sales
```{r, echo=T,out.width="70%",fig.align="center",message=F}
ggplot(oj,aes(y=logmove,x=log(price),col=brand))+
geom_jitter(alpha=0.3,size=0.3,width=0.03)
```
Sales decrease with price. Each brand has its own price strategy.
## Scale
Why is this a log-log plot?
## Scale
Why is this a log-log plot?
```{r, echo=F,out.width="90%",fig.align="center",message=F}
ggplot(oj,aes(y=exp(logmove),x=price,col=brand))+
geom_jitter(alpha=0.3,size=0.3,width=0.03)
```
## Log Linear {.smaller}
We often model the mean for $log(y)$ rather than $y$.
$$log(y) = log(a) + x\beta \iff y = ae^{x\beta} $$
Predicted $y$ is multiplied by $e^\beta$ after $x$ increases by 1.
*Recall:* $log(y) = z \iff e^z = y$ where $e \approx 2.718$.
Further, $log(ab) = log(a)+log(b)$ and $log(a^b) = b$ $log(a)$.
We define $log=ln$, the natural log. Using, e.g. $log_2$ would only change our intercept.
Whenever $y$ changes on a percentage scale, use $log(y)$.
- prices: "Orange Juice is 10% off"
- sales: "Sales are up 10%"
- Most other things that are strictly positive may warrant a $log$ scale.
## Price Elasticity
A simple 'elasticity' model for orange juice sales $y$:
$$ E[log(y)] = \gamma log(price) + x'\beta $$
Elasticities and log-log regressions: We can interpret $\gamma$ as the % change in $y$ associated with a 1% change in $x$.
In R:
```{r}
coef(glm(logmove~log(price)+brand,data=oj))
```
We see that (with brand FEs), sales drop 3.1% for every 1% increase in price.
## Regression in R {.smaller}
That one command is capable of doing a lot of work.
```{r}
reg = glm(logmove~log(price)+brand,data=oj)
```
- `glm` stands for "Generalized Linear Regression
- For this simple model, `lm` works too.
- `y ~ a + b` is the formula defining our regression.
- `y ~ .` will regress against everything in the dataset.
- Intercepts are implicitly included by default.
The variable `reg` is a list of useful things. (try `names(reg)`).
- `summary(reg)` prints a ton of information
- `coef(reg)` gives coefficients
- `predict(reg, newdata=newdataframe)` gives predictions for new data
- `newdataframe` must be a dataframe with exactly the same format as mydata. (Variable names, factor levels, etc).
# Factors
## The Design Matrix {.smaller}
```{r}
coef(reg)
```
What is `brandTropicana`?
Our regression formulas look like $\beta_0+\beta_1 x_1 +\cdots$. But `brand` is not a number, so $brand \times \beta$ isn't sensible.
`glm`, `lm`, most other standard routines start by creating a numeric design matrix, which converts the factor `brand` into a lot of Boolean variables.
```{r}
inds = sample(nrow(oj),10) #Grabbing some random observations
oj$brand[inds]
```
## The Design Matrix {.smaller}
They convert with `model.matrix`, which gives us these variables.
```{r}
model.matrix(reg)[inds,]
```
These are numeric values $x$ which we can multiply against coefficients $\beta$.
## Intercepts {.smaller}
```{r}
coef(reg)
model.matrix(reg)[1,]
```
Each factor's reference level is absorbed by the intercept. The coefficients are now "changes relative to intercept".
In this case, the intercept is "dominicks" brand OJ.
## Releveling Factors
To check your factor's reference level, look at the first level in the factor levels.
```{r}
levels(oj$brand)
```
You can change this with `relevel` (base R) or `fct_relevel` and `fct_reorder` in the `tidyverse`'s `forcats` package.
# Interactions
## Interactions
Beyond Additive effects: Variables change how other variable's act on $y$.
An Interaction is the product of two covariates.
$$ E[y|x] = ... + \beta_jx_j+x_jx_k\beta_{jk}$$
So the effect on $E[y]$ of a unit increase in $x_j$ is $\beta_j+x_k\beta_{jk}$
$\implies$ It depends on $x_k$!
---
Interactions play a massive role in statistical learning, and are often central to social science and business questions.
- Does gender change the effect of education?
- Do older patients benefit more from a vaccine?
- How does advertisement affect price sensitivity?
---
Fitting interactions in R: use * in your formula.
```{r}
reg.int = glm(logmove~log(price)*brand,data=oj)
coef(reg.int)
```
This model is $E[log(y)] = \alpha_b+\beta_b log(price)$.
A separate slope and intercept for each brand!
Elasticities wind up as:
dominicks: -3.4,
minute maid: -3.3, tropicana: -2.7
Where do these numbers come from?
## Advertisements
What changes when a brand is featured?
Here, we mean an in-store display promo or flier.
We could model an additive effect on sales:
$E[log(sales)] = \alpha_b + 1_{[feat]}\alpha_{feat}+\beta_b log(p)$
Or that and an elasticity effect:
$E[log(sales)] = \alpha_b +\beta_b log(p)+ 1_{[feat]}(\alpha_{feat}+\beta_{feat}log(p))$
Or a brand-specific effect on elasticity.
$E[log(sales)] = \alpha_b +\beta_b log(p)+ 1_{[feat]}(\alpha_{feat}+\beta_{b,feat}log(p))$
See the R code for each of these online.
## Brand-specific Elasticities - Coefs
```{r}
reg.bse = glm(logmove~feat*log(price)*brand,data=oj)
coefs = coef(reg.bse)
coefs
```
## Brand-specific Elasticities - Coefs
```{r, echo=F}
coefs = data.frame(beta=coefs,names=names(coefs))
coefs$price = c(0,0,1,0,0,1,0,0,1,1,1,1)
coefs$feat = c(0,1,0,0,0,1,1,1,0,0,1,1)
coefs$feat = coefs$feat == 1
coefs$brand = c("Dominicks","Dominicks","Dominicks","Minute Maid","Tropicana","Dominicks","Minute Maid","Tropicana","Minute Maid","Tropicana","Minute Maid","Tropicana")
coefs$brand = as.factor(coefs$brand)
coefs = coefs %>% filter(price == 1) %>% select(-price,-names)
elasts = coefs = coefs %>% pivot_wider(names_from=brand,values_from=beta)
elasts[2,2:4] = elasts[2,2:4]+elasts[1,2:4]
elasts = elasts %>% mutate(`Minute Maid`=`Minute Maid`+Dominicks,Tropicana=Tropicana+Dominicks)
coefs
```
Ads seem to increase the price-sensitivity.
Minute Maid and Tropicana have big jumps.
## Actual Elasticities
```{r, echo=F}
elasts
```
After marketing,
Why does marketing increase the price sensitivity? How will this influence strategy?
## Confounding
There are differential rates of advertising, which makes going from these static coefficients, to statements about the companies, difficult.
```{r}
plot(table(oj$brand,oj$feat))
```
# Logistic Regression
## Logits
Linear regression is just one linear model. Probably not the most used model.
Logistic Regression *when $y$ is TRUE or FALSE (1/0).*
Binary Response as a prediction target:
- Profit or loss, greater or less than, pay or default
- Thumbs up or down, buy or not
- Win or not, Sick or healthy, Spam or not?
In high dimensions, simplifying to binary variables can help us interpret.
---
Building a linear model for **binary data**.
Recall our original specification: $E[Y|X] = f(x'\beta)$
The Response $y$ is 0 or 1, leading to a conditional mean:
$$E[y|x] = P[y=1|x]\times 1 + P[y=0|x]\times 0 = P[y=1|x]$$
$\implies$ The expectation is a probability.
We will choose $f(x'\beta)$ to give values between 0 and 1.
Also note that the variance becomes related to the mean.
---
We want a binary choice model
$$p = P[y=1|x] = f(\beta_0+\beta_1x_1+...+\beta_Kx_K)$$
Where $f$ is a function increasing in value from 0 to 1.
```{r, echo=F, fig.align="center", out.width="80%"}
x = seq(-5,5,length.out=1001)
y = exp(x)/(1+exp(x))
df = data.frame(x=x,y=y)
ggplot(df,aes(x=x,y=y))+
geom_path() +
xlab("x' beta") +
annotate("text",x=2.5,y=0.5,label="f(x' beta)",size=5)
```
---
We will use the '**logit link**' function and do '**logistic regression**'.
$$P[y=1|x] = \frac{e^{x'\beta}}{1+e^{x'\beta}} = \frac{exp(\beta_0+\beta_1x_1+...+\beta_Kx_K)}{1+exp(\beta_0+\beta_1x_1+...\beta_Kx_K)} $$
The 'logit' link is common, for a few reasons. One big reason? Some math shows:
$$log\left(\frac{p}{1-p}\right) =\beta_0+\beta_1x_1+...+\beta_Kx_K$$
So this is a linear model for log odds.
## Spam Filters
Most inboxes use binary regression to predict whether an email is spam.
Say y=1 for spam, and y=0 for not spam.
`spam.csv` has 4600 emails with word and character presence indicators (1 if character is in message) and related info.
Logistic Regression fits $p(y=1)$ as a function of email content.
---
```{r,echo=F}
email = read.csv("spam/spam.csv")
```
Very easy in R.
```{r, eval=F}
glm(Y ~ X+Z, data=df, family=binomial)
```
The argument `family=binomial` tells `glm` that `spam` is binary.
The response can take a few forms:
- $y = 1,1,0,....$ numeric vector
- $y = TRUE, FALSE, TRUE,...$ logical vector
- $y = 'spam','not', 'spam',... $ factor vector
Everything else is the same as for linear regression.
## Perfect Separation
```{r}
spam = glm(spam~.,data=email,family=binomial)
```
Some emails are very easy to predict. Clearly spam or not. This is called 'perfect separation'. It can cause some numerical instability (convergence issues, standard errors can suffer, p-values, etc), but mostly doesn't matter much.
We see it here, because some words are excellent predictors.
```{r}
table(email$word_george, email$spam,dnn=c("word_george","spam"))
```
## Interpreting Coefficients
The model is:
$$\frac{p}{1-p} =exp\left(\beta_0+\beta_1x_1+...+\beta_Kx_K\right)$$
So $exp(\beta_j)$ is the odds multiplier for a unit increase in $x_j$.
```{r}
coef(spam)['word_george']
```
So if the word 'george' occurs in an email, the odds of it being spam are multiplied by $exp(-5.8) \approx 0.003$.
What is the odds multiplier for a coefficient of 0?
# Deviance
## Deviance in R Summaries
The summary function gives coefficients, and a bunch more. `summary(spam)`
![](logit_deviance.png){width=80%}
---
The linear OJ regression does the same `summary(reg.bse)`
![](linear_deviance.png){width=80%}
These are statistics telling us about our model fit. They are important in both linear and logistic regression. Understanding deviance will tie them together.
---
```{r, echo=F,out.width="90%",fig.align="center",message=F}
g1 +
geom_abline(intercept=coefsa[1],slope=coefsa[2],col="maroon")
```
## Estimation and Goodness-of-Fit
**Two related concepts**
Likelihood is a function of the unknown parameters $\beta$ of a *statistical model*, given data:
$$ L(\beta|data) = P[data|\beta]$$
Maximum Likelihood Estimation (MLE)
$$ \hat{\beta} = argmax_{\beta} L(\beta|data)$$
MLE estimates $\hat{\beta}$ are the parameters that make our data *most likely*.
Maximum likelihood estimation *requires* making statements about distribution of error terms.
## Error Distributions {.smaller}
Back to Gaussian linear model for a moment:
$$ Y|X \sim N(x'\beta,\sigma^2)$$
$$\implies P[Y=y|X=x] = \phi\left({y-x'\beta \over \sigma}\right)$$
Then we can aggregate across all observations and pick a $\beta$.
$$\implies L(\beta|data) = P[data | \beta] = \prod_{i=1}^N P[Y=y|X=x] $$
$$ = \prod_{i=1}^N \phi\left({y-x'\beta \over \sigma}\right)$$
Where $\phi$ is the normal distribution's pdf.
## Estimation and Goodness-of-Fit
**Two related concepts**
Deviance refers to the distance between our fit and the data. You generally want to minimize it.
$$Dev(\beta) = -2 log(L(\beta|data))+C$$
We can ignore C for now.
Deviance is useful for comparing models. It is a measure of GOF that is similar to the residual sum of squares, for a broader class of models (logistic regression, etc).
We'll think about deviance as a cost to be minimized.
Minimize Deviance $\iff$ Maximize likelihood
---
Bringing this full circle.
$$\phi(x) = \frac{1}{\sqrt{2\pi}}exp\left(- {x^2 \over 2}\right)$$
Given $n$ independent observations, the likelihood becomes:
$$ \prod_{i=1}^n \phi\left(y-x'\beta \over \sigma \right) \propto \prod_{i=1}^n exp\left(-{(y-x'\beta)^2 \over 2 \sigma^2} \right)$$
$$ \propto exp \left(-{1 \over 2\sigma^2} \sum_{i=1}^n (y-x'\beta)^2 \right)$$
---
This leads to Deviance of:
$$Dev(\beta) = -2log(L(\beta|data)) + C = {1 \over \sigma^2} \sum_{i=1}^n (y-x'\beta)^2 + C'$$
So:
Min Deviance $\iff$ Max likelihood $\iff$ Min $l_2$ loss
This is just a particular loss function, which is driven by the distribution of the $\epsilon$ terms.
## MLE for Logistic Regression {.smaller}
Our logistic regression has the following likelihood:
$$L(\beta) = \prod_{i=1}^n P[y_i|x_i] = \prod_{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i}$$
$$ = \prod_{i=1}^n \left({exp(x_i'\beta) \over 1+exp(x_i'\beta)}\right)^{y_i} \left({1 \over 1+exp(x_i'\beta)}\right)^{1-y_i} $$
Thus the deviance to minimize is:
$$Dev(\beta) = -2 \sum_{i=1}^n (y_ilog(p_i)+(1-y_i)log(1-p_i))$$
$$\propto \sum_{i=1}^n [ log(1+exp(x_i'\beta))-y_ix_i'\beta]$$
This is just taking the logs and removing the factor of 2.
---
Back to our summary outputs. We can print the same output for both linear and logistic regressions.
But the "dispersion parameter" is always 1 for the logistic regression. `summary(spam)`
![](logit_deviance.png){width=80%}
'degrees of freedom' is actually 'number of observations - df' where df is the number of coefficients estimated in the model.
Specifically `df(deviance) = nobs - df(regression)`
You should be able to back out number of observations from the R output.
## Dispersion parameter for Linear regression? {.smaller}
`summary(reg.bse)`
![](linear_deviance.png){width=80%}
Remember our basic gaussian model was:
$$Y|X \sim N(X'\beta,\sigma^2)$$
And the implied deviance was:
$$Dev(\beta) = {1 \over \sigma^2}\sum_{i=1}^n (y-x'\beta)^2 +C'$$
$\sigma$ is the dispersion parameter, and it is critical here. The logit has a mean-variance link, so we don't need the separate param.
## Estimating $\sigma$
$$y_i = x_i'\beta+\epsilon_i; \quad \sigma^2 = Var(\epsilon)$$
Denote the residuals, $r_i = y_i-x_i'\hat{\beta}$.
$$ \hat{\sigma}^2 = {1 \over n-p-1} \sum_{i=1}^n r_i^2 $$
R calls $\hat{\sigma}^2$ the dispersion parameter.
Critically, even if we know $\beta$, we only predict sales with uncertainty.
E.g., approximately a 95% chance of sales in $x'\beta \pm 2\sqrt{0.48}$
## $R^2$ {.smaller}
Residual Deviance, $D$ is what we've minimized using $x$.
Null Deviance $D_0$ is for the model without $x$ (or more generally, the model under the null).
i.e. $\hat{y}_i = \bar{y}$
- $D_0 = \sum (y_i-\bar{y})^2$ in linear regression
- $D_0 = -2\sum[y_ilog(\bar{y})+(1-y_i)log(1-\bar{y})]$ in logits
The difference between $D$ and $D_0$ comes from information in $x$.
Proportion of deviance explained by $x$ is called the $R^2$ in a linear regression, "Pseudo-$R^2$" in logit.
$$ R^2 = {D_0-D \over D_0} = 1-{D \over D_0}$$
This measures how much variability you explain with your model.
- In `spam`: $R^2 = 1-1549/6170 = 0.75$
- In OJ -- `reg.bse`: $R^2 = 1-13975/30079 = 0.54$
## $R^2$ in linear regression
Recall that for linear model, deviance is the sum of squared errors (SSE) and $D_0$ is the total sum of squares (TSS).
$$ R^2 = 1-{SSE \over TSS}$$
You may also recall that $R^2 = corr(y,\hat{y})^2$.
```{r}
cor(reg.bse$fitted,oj$logmove)^2
```
For linear regression, min deviance $=$ max corr$(y,\hat{y})$. $\implies$ if $y$ vs $\hat{y}$ is a straight line, you have a perfect fit.
Also implies that $R^2$ (weakly) increases whenever we add another variable.
## Fit plots {.smaller}
```{r, echo=T,out.width="70%",fig.align="center",message=F}
fitplotdf = data.frame(y = oj$logmove,yhat= predict(reg.bse),brand=oj$brand)
ggplot(fitplotdf,aes(y=y,x=yhat,col=brand)) +
geom_jitter(alpha=0.2,size=0.2,width=0.03)+
geom_abline(intercept=0,slope=1)
```
It is good practice to plot $y$ vs $\hat{y}$. It helps you check for specification issues (e.g. non-constant variance, non-linearities, etc)
## Fit plots -- logit {.smaller}
```{r, echo=T,out.width="70%",fig.align="center",message=F}
logit.fit.df = data.frame(spam=email$spam==1,yhat=predict(spam,type="response"))
ggplot(logit.fit.df,aes(y=yhat,x=spam,)) +
geom_violin()+ylab("Predicted P[Spam]")
```
New question: how do you choose the classification threshold? When do you send to the spam folder?
## Prediction
Prediction is easy with `glm`.
```{r}
predict(spam,newdata=email[1:4,])
```
But this output is $x'\beta$. To get probabilities $exp(x'\beta)/(1+exp(x'\beta))$, add `type="response"`
```{r}
predict(spam,newdata=email[1:4,],type="response")
```
Newdata must match the format of the original data.
## Out-of-Sample Prediction {.smaller}
We care about how well our model works out-of-sample.
One way to test this is to use a "validation sample".
1. Randomly split your data into two samples
- usually named "testing" and "training".
2. Fit your model using the "training" data
3. Test predictions on the left-out "testing" data.
```{r, warning=F}
#Sample 1000 indices
leaveout = sample.int(nrow(email),1000)
#Train the model
spam_train = glm(spam~.,data=email[-leaveout,],family="binomial")
#Predict performance on the test data.
test_preds = predict(spam_train,newdata=email[leaveout,],type="response")
```
## Out-of-Sample Prediction {.smaller}
Fit plots for those left-out observations.
```{r, echo=F,out.width="70%",fig.align="center",message=F}
logit.fit.df = data.frame(spam=email$spam[leaveout]==1,yhat=test_preds)
ggplot(logit.fit.df,aes(y=yhat,x=spam,)) +
geom_violin()+ylab("Predicted P[Spam]")
```
For the leave out data, we get $D_0 = 1360$ and $D=362$ for $R^2 = 0.73$.
Since the leave-out sample is random, your results may vary.
Note: OOS $R^2$ is lower than in-sample $R^2 = 0.75$
## Out-of-Sample
More on this next week during model selection.
You might see notions of this in my code for predicting vaccinations.
# Wrap up
## Things to do
Before Thursday:
- Literally nothing
## Rehash
- Regressions minimize prediction errors, loss, or deviance, or they maximize some likelihood function.
- Interpretation of coefficients depends on variables: elasticities, levels, logs
- Factor coefficients wind up being "difference from some reference
- Interactions allow us to model the effect of one covariate on another
- Logistic regression lets us predict binary variables using a linear model of log-odds
- Deviance's are a measure of prediction error which generalize beyond linear regressions.
- Out-of-sample performance is key.
# Bye!