```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(glmnet)
setwd("~/Github/bigdata/hw/hw3/")
```
This homework is going to examine the use of a) stepwise regression and b) LASSO for model selection with AIC and cross-validation. We will be re-using the home sale data we've saw in HW2.
# Data
Load in the data. Do the same things we did last week to clean it a bit.
```{r, message=F}
homes = read_csv("https://codowd.com/bigdata/hw/hw2/homes2004.csv")
#Make a function that tells us if there are two levels -- Y and N
temp_fn = function(x) is.character(x) & all(levels(as.factor(x)) %in% c("Y","N"))
# Make a function that takes those levels and makes a logical variable
temp_fn2 = function(x) x == "Y"
#Run temp fn2 on columns where temp_fn 1 is true
homes = homes %>% mutate(across(where(temp_fn),temp_fn2))
#Convert remaining character functions to
homes = homes %>% mutate(across(where(is.character),as.factor))
# But we want to relevel them for LASSO
na_relevel = function(x) factor(x,levels=c(NA,levels(x)),exclude=NULL)
homes = homes %>% mutate(across(where(is.factor),na_relevel))
```
Make a binary for if the number of bedrooms and bathrooms are both at least two. This is what we will be trying to predict.
```{r}
homes = homes %>% mutate(twotwo = BATHS>=2 & BEDRMS >=2)
```
# Questions
All questions asking you to predict `twotwo` in the data want you to use all variables. All questions not regarding `twotwo` should *include* `twotwo` in the model selection process (e.g. just don't try to exclude it).
## Q1
Model 1: Use a forward stepwise linear regression (with AIC based steps) to predict `log(LPRICE)` in the data. What is the AIC of the final regression (hint: its part of the standard output for `glm` but not `lm`)? How many coefficients does it have?
```{r}
null = glm(log(LPRICE)~1,data=homes)
full = formula(glm(log(LPRICE)~.,data=homes))
stepwise = step(null,full,dir="forward",trace=0)
summary(stepwise)$aic #Prints AIC
length(coef(stepwise)) #Prints number of coefficients
```
Model 2: Use a forward stepwise linear regression (with BIC based steps - hint: look at the help page for `step`) to predict `log(LPRICE)` in the data. How many coefficients does this model have?
```{r}
stepwise_bic = step(null,full,dir="forward",trace=0,k=log(nrow(homes)))
length(coef(stepwise_bic))
```
In two sentences or less, explain the reason for the difference in number of coefficients between models 1&2.
> BIC imposes a stronger penalty on additional parameters, multiplying each additional parameter by ~10 (`log(nrow(homes))` in this data) rather than 2. This means fewer parameters are sufficiently predictive to overcome the penalty, and so fewer variables get included.
## Q2
```{r}
xmatrix = sparse.model.matrix(log(LPRICE)~.,data=homes)
```
Use a cross-validated LASSO to predict `log(LPRICE)` (`cv.glmnet` should make this relatively easy, but you will want `xmatrix` from above). Show the deviance plot. How many possible subsets of the variables in our data are there? (hint: `ncol(xmatrix)` $\neq$ `ncol(homes)`, and `ncol(xmatrix)` is the relevant number)
Model 3: Find the $\lambda$ that minimizes OOS deviance and the deviance at that $\lambda$ (no need to report either). What is the number of non-zero coefficients with that $\lambda_{min}$?
Model 4: Find the biggest $\lambda$ within 1 standard error of the minimum OOS deviance and the deviance at that $\lambda$ (no need to report either). How many non-zero coefficients with that $\lambda_{1se}$?
We know that relative to the null distribution, the pseudo $R^2$ is $1-D/D_0$. What we haven't discussed is that we could use a similar method to compare any two models (with the same likelihoods). What is the ratio of deviance of the larger model to deviance of the smaller model? In two sentences or less interpret that ratio.
```{r}
cv.mod = cv.glmnet(xmatrix,log(homes$LPRICE)) #Build model
2^ncol(xmatrix) #Find n subsets.
```
```{r}
plot(cv.mod)
```
```{r}
cv.mod$lambda.min #Minimizing Lambda
min(cv.mod$cvm) #Min OOS deviance. Could also read off plot.
sum(coef(cv.mod,s="lambda.min")!=0) #nonzero coefs with minimizing lambda.
```
Mostly the same code for the second bits
```{r}
cv.mod$lambda.1se #1se Lambda
cv.mod$cvm[cv.mod$lambda == cv.mod$lambda.1se] #deviance of 1se Lambda
sum(coef(cv.mod,s="lambda.1se")!=0) #nonzero coefs with 1se lambda
```
```{r}
cv.mod$cvm[cv.mod$lambda == cv.mod$lambda.1se]/min(cv.mod$cvm)
```
This ratio says that relative to the larger model, the smaller model fails to explain ~6% of the variation in the data.
## Q3
Model 5: Add interactions (warning: may be slow, hint: you'll need to modify/recreate `xmatrix`) and re-estimate the same cross-validation model as in Q2. How many possible subsets are there now (or at least, what does R say)? Show the CV error plot. What is the estimated number of coefficients when $\lambda=0.2$? (hint: look at the help page for `cv.glmnet`) In one sentence, how do these OOS deviances compare to the OOS deviances in Q2?
```{r}
xmatrix = sparse.model.matrix(log(LPRICE)~.^2,data=homes)
cv.mod2 = cv.glmnet(xmatrix,log(homes$LPRICE))
2^ncol(xmatrix)
```
```{r}
plot(cv.mod2)
```
```{r}
sum(coef(cv.mod2,s=0.2)!=0)
```
## Q4
Using a straightforward linear model to predict a binary is a model structure known as a "linear probability model" (LPM). This is because the $E[Y]$ is still $P(Y=1)$, and the linear model is still optimizing for $E[Y]$, but is purely linear, so we have a linear model of the probability.
Model 6: Cross-validate a linear LASSO (no interactions or anything) to predict `twotwo`.
```{r}
xmatrix = sparse.model.matrix(twotwo~.,data=homes)
mod6 = cv.glmnet(xmatrix,homes$twotwo)
```
Model 7: Cross-validate a logistic LASSO to predict `twotwo` (hint: `family="binomial"`).
```{r}
mod7 = cv.glmnet(xmatrix,homes$twotwo,family="binomial")
```
Which model has the smaller deviance? In at most two sentences, why can't we use deviances to compare Model 6 and Model 7? Rerun model 7 but set `type.measure="mse"` in `cv.glmnet`. In at most two sentences, Can we compare these models now? and if so, which is better?
```{r}
mod7 = cv.glmnet(xmatrix,homes$twotwo,family="binomial",type.measure="mse")
```
> Model 7 has a smaller deviance. But model 7 is calculating the binomial deviance, not the mean-squared error, so the comparison is not apples-to-apples. After setting `type.measure`, we can compare the two models -- as we have MSE for both. Model 7 is still better.
## Q5
A well known issue with LPMs is that they frequently predict probabilities that are non-sensical (E.g. they might predict that $P(Y=1) = 110%$, or $P(Y=1) = -10%$). Check your predictions from Model 6 and you'll see that some of them are like this.
One solution to this problem is to use a link function which is constrained to $[0,1]$ -- as the logistic regression does. A different solution is to say "we know those probabilities are nonsense, but for prediction purposes we can easily set them to reasonable values". Specifically, predictions that are greater than 1, we set to 1, and predictions that are less than 0, we set to 0. I've written a simple function below that does this.
```{r}
lpm_prediction_editor = function(preds) {
out = preds
out[out>1] = 1
out[out<0] = 0
out
}
```
Would using this function for our predictions with Model 6 improve the MSE?
Could this change to our predictions change the value of $\lambda$ we should select?
Given a $\lambda$, could this change to our predictions change the variables we should choose? Could it change the coefficients on variables we already chose?
> Yes using the function would improve our MSE. All values of $y$ are inside [0,1], so shifting values of $\hat y$ into that region can only improve MSE.
> Properly speaking, if we are adjusting our predictions like this, it changes every step at which we optimized against prediction errors. For a given $\lambda$ that means our coefficients and variables may change. It also means the optimal $\lambda$ may change.
## Q6 -- purely R
Changing our predictions to "certainly FALSE" and "certainly TRUE" [$\hat y$ either 0 or 1] as the function above does seems extreme. Rewrite the above function so that it takes an input argument `eps_err`, and adds `eps_err` probability of being wrong to all the predictions which claim to be certain.
```{r}
lpm_prediction_editor = function(preds,eps_err=0) {
out = preds
out[out>=1] = 1-eps_err
out[out<=0] = eps_err
out
}
```
> Important to note that while shifting our predictions to 0,1 could only improve MSE, shifting our predictions to 0.99 and 0.01 (or whatever) could make our MSE worse, mostly in situations where we have a lot of predictive power.
# Submission
This homework is due by Midnight on Wednesday April 21. As before, submit on canvas in groups, solutions will be discussed in class on April 22.
# Optional Exercises
1. Build a function that calculates the binomial deviance given predictions and truth.
2. Take the code from lecture 6, and modify it to run the cross-validation for model 6 by hand.
1. Modify the predictions inside the cross-validation so that they are all between $[0,1]$, using an `eps_err=0.001`.
2. Instead of returning the MSE of each cross-validated lambda, return the binomial deviance.
3. Build a fit plot for each of the models in Q4.