```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(glmnet)
setwd("~/Github/bigdata/hw/hw3/")
```

This homework is going to examine the use of a) stepwise regression and b) LASSO for model selection with AIC and cross-validation. We will be re-using the home sale data we saw in HW2. 

# Data
Load in the data. Do the same things we did last week to clean it a bit. 

```{r, message=F}
homes = read_csv("https://codowd.com/bigdata/hw/hw2/homes2004.csv")

#Make a function that tells us if there are two levels -- Y and N
temp_fn = function(x) is.character(x) & all(levels(as.factor(x)) %in% c("Y","N"))
# Make a function that takes those levels and makes a logical variable
temp_fn2 = function(x) x == "Y"
#Run temp fn2 on columns where temp_fn 1 is true
homes = homes %>% mutate(across(where(temp_fn),temp_fn2))
#Convert remaining character functions to 
homes = homes %>% mutate(across(where(is.character),as.factor))
# But we want to relevel them for LASSO
na_relevel = function(x) factor(x,levels=c(NA,levels(x)),exclude=NULL)
homes = homes %>% mutate(across(where(is.factor),na_relevel))
```


Make a binary for if the number of bedrooms and bathrooms are both at least two. This is what we will be trying to predict. 
```{r}
homes = homes %>% mutate(twotwo = BATHS>=2 & BEDRMS >=2)
```

# Questions

All questions asking you to predict `twotwo` in the data want you to use all variables. All questions not regarding `twotwo` should *include* `twotwo` in the model selection process (e.g. just don't try to exclude it). 


## Q1 -- Stepwise Regression: AIC & BIC

Model 1: Use a forward stepwise linear regression (with AIC based steps) to predict `log(LPRICE)` in the data. What is the AIC of the final regression (hint: its part of the standard output for `glm` but not `lm`)? How many coefficients does it have? 

Model 2: Use a forward stepwise linear regression (with BIC based steps - hint: look at the help page for `step`) to predict `log(LPRICE)` in the data. How many coefficients does this model have?

In two sentences or less, explain the reason for the difference in number of coefficients between models 1&2.

## Q2 -- BASIC Cross-Validated LASSO

```{r}
xmatrix = sparse.model.matrix(log(LPRICE)~.,data=homes)
```

Use a cross-validated LASSO to predict `log(LPRICE)` (`cv.glmnet` should make this relatively easy, but you will want `xmatrix` from above).  Show the deviance plot. How many possible subsets of the variables in our data are there? (hint: `ncol(xmatrix)` $\neq$ `ncol(homes)`, and `ncol(xmatrix)` is the relevant number)

Model 3: Find the $\lambda$ that minimizes OOS deviance (hint: see `cv.glmnet` help page) and the deviance at that $\lambda$ (no need to report either). What is the number of non-zero coefficients with that $\lambda_{min}$? 

Model 4: Find the biggest $\lambda$ within 1 standard error of the minimum OOS deviance and the deviance at that $\lambda$ (no need to report either). How many non-zero coefficients with that $\lambda_{1se}$?

We know that relative to the null distribution, the pseudo $R^2$ is $1-D/D_0$. What we haven't discussed is that we could use a similar method to compare any two models (with the same likelihoods). What is the ratio of deviance of the larger model to deviance of the smaller model? In two sentences or less interpret that ratio.

## Q3 -- GIANT Models

Model 5: Add interactions (warning: may be slow, hint: you'll need to modify/recreate `xmatrix`) and re-estimate the same cross-validation model as in Q2. How many possible subsets are there now (or at least, what does R say)? Show the CV error plot. In one sentence, how do these OOS deviances compare to the OOS deviances in Q2? 

## Q4 -- LPM, Deviances, Loss

Using a straightforward linear model to predict a binary is a model structure known as a "linear probability model" (LPM). This is because the $E[Y]$ is still $P(Y=1)$, and the linear model is still optimizing for $E[Y]$, but is purely linear, so we have a linear model of the probability.

Model 6: Cross-validate a linear LASSO (no interactions or anything) to predict `twotwo`.

Model 7: Cross-validate a logistic LASSO to predict `twotwo` (hint: `family="binomial"`). 

Which model has the smaller deviance? In at most two sentences, why can't we use deviances to compare Model 6 and Model 7? Rerun model 7 but set `type.measure="mse"` in `cv.glmnet`. In at most two sentences, Can we compare these models now? and if so, which is better? 

## Q5 -- Changes to Predictions

A well known issue with LPMs is that they frequently predict probabilities that are non-sensical (E.g. they might predict that $P(Y=1) = 110%$, or $P(Y=1) = -10%$). Check your predictions from Model 6 and you'll see that some of them are like this.

One solution to this problem is to use a link function which is constrained to $[0,1]$ -- as the logistic regression does. A different solution is to say "we know those probabilities are nonsense, but for prediction purposes we can easily set them to reasonable values". Specifically, predictions that are greater than 1, we set to 1, and predictions that are less than 0, we set to 0. I've written a simple function below that does this. 

```{r}
lpm_prediction_editor = function(preds) {
  out = preds
  out[out>1] = 1
  out[out<0] = 0
  out
}
```

Would using this function for our predictions with Model 6 improve the MSE? 
Could this change to our predictions change the value of $\lambda$ we should select? 
Given a $\lambda$, could this change to our predictions change the variables we should choose? Could it change the coefficients on variables we already chose? 

## Q6 -- purely R

Changing our predictions to "certainly FALSE" and "certainly TRUE"  [$\hat y$ either 0 or 1] as the function above does seems extreme. Rewrite the above function so that it takes an input argument `eps_err`, and adds `eps_err` probability of being wrong to all the predictions which claim to be certain. 


# Submission

This homework is due by Midnight on Wednesday April 21. As before, submit on canvas in groups, solutions will be discussed in class on April 22. 

# Optional Exercises

1. Build a function that calculates the binomial deviance given predictions and truth.
2. Take the code from lecture 6, and modify it to run the cross-validation for model 6 by hand.
    1. Modify the predictions inside the cross-validation so that they are all between $[0,1]$, using an `eps_err=0.001`.
    2. Instead of returning the MSE of each cross-validated lambda, return the binomial deviance.
3. Build a fit plot for each of the models in Q4.