This homework is going to examine the use of a) stepwise regression and b) LASSO for model selection with AIC and cross-validation. We will be re-using the home sale data we saw in HW2.

# 1 Data

Load in the data. Do the same things we did last week to clean it a bit.

homes = read_csv("https://codowd.com/bigdata/hw/hw2/homes2004.csv")

#Make a function that tells us if there are two levels -- Y and N
temp_fn = function(x) is.character(x) & all(levels(as.factor(x)) %in% c("Y","N"))
# Make a function that takes those levels and makes a logical variable
temp_fn2 = function(x) x == "Y"
#Run temp fn2 on columns where temp_fn 1 is true
homes = homes %>% mutate(across(where(temp_fn),temp_fn2))
#Convert remaining character functions to
homes = homes %>% mutate(across(where(is.character),as.factor))
# But we want to relevel them for LASSO
na_relevel = function(x) factor(x,levels=c(NA,levels(x)),exclude=NULL)
homes = homes %>% mutate(across(where(is.factor),na_relevel))

Make a binary for if the number of bedrooms and bathrooms are both at least two. This is what we will be trying to predict.

homes = homes %>% mutate(twotwo = BATHS>=2 & BEDRMS >=2)

# 2 Questions

All questions asking you to predict twotwo in the data want you to use all variables. All questions not regarding twotwo should include twotwo in the model selection process (e.g. just don’t try to exclude it).

## 2.1 Q1 – Stepwise Regression: AIC & BIC

Model 1: Use a forward stepwise linear regression (with AIC based steps) to predict log(LPRICE) in the data. What is the AIC of the final regression (hint: its part of the standard output for glm but not lm)? How many coefficients does it have?

Model 2: Use a forward stepwise linear regression (with BIC based steps - hint: look at the help page for step) to predict log(LPRICE) in the data. How many coefficients does this model have?

In two sentences or less, explain the reason for the difference in number of coefficients between models 1&2.

## 2.2 Q2 – BASIC Cross-Validated LASSO

xmatrix = sparse.model.matrix(log(LPRICE)~.,data=homes)

Use a cross-validated LASSO to predict log(LPRICE) (cv.glmnet should make this relatively easy, but you will want xmatrix from above). Show the deviance plot. How many possible subsets of the variables in our data are there? (hint: ncol(xmatrix) $$\neq$$ ncol(homes), and ncol(xmatrix) is the relevant number)

Model 3: Find the $$\lambda$$ that minimizes OOS deviance (hint: see cv.glmnet help page) and the deviance at that $$\lambda$$ (no need to report either). What is the number of non-zero coefficients with that $$\lambda_{min}$$?

Model 4: Find the biggest $$\lambda$$ within 1 standard error of the minimum OOS deviance and the deviance at that $$\lambda$$ (no need to report either). How many non-zero coefficients with that $$\lambda_{1se}$$?

We know that relative to the null distribution, the pseudo $$R^2$$ is $$1-D/D_0$$. What we haven’t discussed is that we could use a similar method to compare any two models (with the same likelihoods). What is the ratio of deviance of the larger model to deviance of the smaller model? In two sentences or less interpret that ratio.

## 2.3 Q3 – GIANT Models

Model 5: Add interactions (warning: may be slow, hint: you’ll need to modify/recreate xmatrix) and re-estimate the same cross-validation model as in Q2. How many possible subsets are there now (or at least, what does R say)? Show the CV error plot. In one sentence, how do these OOS deviances compare to the OOS deviances in Q2?

## 2.4 Q4 – LPM, Deviances, Loss

Using a straightforward linear model to predict a binary is a model structure known as a “linear probability model” (LPM). This is because the $$E[Y]$$ is still $$P(Y=1)$$, and the linear model is still optimizing for $$E[Y]$$, but is purely linear, so we have a linear model of the probability.

Model 6: Cross-validate a linear LASSO (no interactions or anything) to predict twotwo.

Model 7: Cross-validate a logistic LASSO to predict twotwo (hint: family="binomial").

Which model has the smaller deviance? In at most two sentences, why can’t we use deviances to compare Model 6 and Model 7? Rerun model 7 but set type.measure="mse" in cv.glmnet. In at most two sentences, Can we compare these models now? and if so, which is better?

## 2.5 Q5 – Changes to Predictions

A well known issue with LPMs is that they frequently predict probabilities that are non-sensical (E.g. they might predict that $$P(Y=1) = 110%$$, or $$P(Y=1) = -10%$$). Check your predictions from Model 6 and you’ll see that some of them are like this.

One solution to this problem is to use a link function which is constrained to $$[0,1]$$ – as the logistic regression does. A different solution is to say “we know those probabilities are nonsense, but for prediction purposes we can easily set them to reasonable values”. Specifically, predictions that are greater than 1, we set to 1, and predictions that are less than 0, we set to 0. I’ve written a simple function below that does this.

lpm_prediction_editor = function(preds) {
out = preds
out[out>1] = 1
out[out<0] = 0
out
}

Would using this function for our predictions with Model 6 improve the MSE? Could this change to our predictions change the value of $$\lambda$$ we should select? Given a $$\lambda$$, could this change to our predictions change the variables we should choose? Could it change the coefficients on variables we already chose?

## 2.6 Q6 – purely R

Changing our predictions to “certainly FALSE” and “certainly TRUE” [$$\hat y$$ either 0 or 1] as the function above does seems extreme. Rewrite the above function so that it takes an input argument eps_err, and adds eps_err probability of being wrong to all the predictions which claim to be certain.

# 3 Submission

This homework is due by Midnight on Wednesday April 21. As before, submit on canvas in groups, solutions will be discussed in class on April 22.

# 4 Optional Exercises

1. Build a function that calculates the binomial deviance given predictions and truth.
2. Take the code from lecture 6, and modify it to run the cross-validation for model 6 by hand.
1. Modify the predictions inside the cross-validation so that they are all between $$[0,1]$$, using an eps_err=0.001.
2. Instead of returning the MSE of each cross-validated lambda, return the binomial deviance.
3. Build a fit plot for each of the models in Q4.