This homework is going to examine the use of a) stepwise regression and b) LASSO for model selection with AIC and cross-validation. We will be re-using the home sale data we’ve saw in HW2.

Load in the data. Do the same things we did last week to clean it a bit.

```
homes = read_csv("https://codowd.com/bigdata/hw/hw2/homes2004.csv")
#Make a function that tells us if there are two levels -- Y and N
temp_fn = function(x) is.character(x) & all(levels(as.factor(x)) %in% c("Y","N"))
# Make a function that takes those levels and makes a logical variable
temp_fn2 = function(x) x == "Y"
#Run temp fn2 on columns where temp_fn 1 is true
homes = homes %>% mutate(across(where(temp_fn),temp_fn2))
#Convert remaining character functions to
homes = homes %>% mutate(across(where(is.character),as.factor))
# But we want to relevel them for LASSO
na_relevel = function(x) factor(x,levels=c(NA,levels(x)),exclude=NULL)
homes = homes %>% mutate(across(where(is.factor),na_relevel))
```

Make a binary for if the number of bedrooms and bathrooms are both at least two. This is what we will be trying to predict.

`homes = homes %>% mutate(twotwo = BATHS>=2 & BEDRMS >=2)`

All questions asking you to predict `twotwo`

in the data want you to use all variables. All questions not regarding `twotwo`

should *include* `twotwo`

in the model selection process (e.g. just don’t try to exclude it).

Model 1: Use a forward stepwise linear regression (with AIC based steps) to predict `log(LPRICE)`

in the data. What is the AIC of the final regression (hint: its part of the standard output for `glm`

but not `lm`

)? How many coefficients does it have?

```
null = glm(log(LPRICE)~1,data=homes)
full = formula(glm(log(LPRICE)~.,data=homes))
stepwise = step(null,full,dir="forward",trace=0)
summary(stepwise)$aic #Prints AIC
```

`## [1] 30290.06`

`length(coef(stepwise)) #Prints number of coefficients`

`## [1] 41`

Model 2: Use a forward stepwise linear regression (with BIC based steps - hint: look at the help page for `step`

) to predict `log(LPRICE)`

in the data. How many coefficients does this model have?

```
stepwise_bic = step(null,full,dir="forward",trace=0,k=log(nrow(homes)))
length(coef(stepwise_bic))
```

`## [1] 32`

In two sentences or less, explain the reason for the difference in number of coefficients between models 1&2.

BIC imposes a stronger penalty on additional parameters, multiplying each additional parameter by ~10 (

`log(nrow(homes))`

in this data) rather than 2. This means fewer parameters are sufficiently predictive to overcome the penalty, and so fewer variables get included.

`xmatrix = sparse.model.matrix(log(LPRICE)~.,data=homes)`

Use a cross-validated LASSO to predict `log(LPRICE)`

(`cv.glmnet`

should make this relatively easy, but you will want `xmatrix`

from above). Show the deviance plot. How many possible subsets of the variables in our data are there? (hint: `ncol(xmatrix)`

\(\neq\) `ncol(homes)`

, and `ncol(xmatrix)`

is the relevant number)

Model 3: Find the \(\lambda\) that minimizes OOS deviance and the deviance at that \(\lambda\) (no need to report either). What is the number of non-zero coefficients with that \(\lambda_{min}\)?

Model 4: Find the biggest \(\lambda\) within 1 standard error of the minimum OOS deviance and the deviance at that \(\lambda\) (no need to report either). How many non-zero coefficients with that \(\lambda_{1se}\)?

We know that relative to the null distribution, the pseudo \(R^2\) is \(1-D/D_0\). What we haven’t discussed is that we could use a similar method to compare any two models (with the same likelihoods). What is the ratio of deviance of the larger model to deviance of the smaller model? In two sentences or less interpret that ratio.

```
cv.mod = cv.glmnet(xmatrix,log(homes$LPRICE)) #Build model
2^ncol(xmatrix) #Find n subsets.
```

`## [1] 1.1259e+15`

`plot(cv.mod)`

`cv.mod$lambda.min #Minimizing Lambda`

`## [1] 0.0007139252`

`min(cv.mod$cvm) #Min OOS deviance. Could also read off plot. `

`## [1] 0.4108812`

`sum(coef(cv.mod,s="lambda.min")!=0) #nonzero coefs with minimizing lambda.`

`## [1] 46`

Mostly the same code for the second bits

`cv.mod$lambda.1se #1se Lambda`

`## [1] 0.05155117`

`cv.mod$cvm[cv.mod$lambda == cv.mod$lambda.1se] #deviance of 1se Lambda`

`## [1] 0.441502`

`sum(coef(cv.mod,s="lambda.1se")!=0) #nonzero coefs with 1se lambda`

`## [1] 18`

`cv.mod$cvm[cv.mod$lambda == cv.mod$lambda.1se]/min(cv.mod$cvm)`

`## [1] 1.074525`

This ratio says that relative to the larger model, the smaller model fails to explain ~6% of the variation in the data.

Model 5: Add interactions (warning: may be slow, hint: you’ll need to modify/recreate `xmatrix`

) and re-estimate the same cross-validation model as in Q2. How many possible subsets are there now (or at least, what does R say)? Show the CV error plot. What is the estimated number of coefficients when \(\lambda=0.2\)? (hint: look at the help page for `cv.glmnet`

) In one sentence, how do these OOS deviances compare to the OOS deviances in Q2?

```
xmatrix = sparse.model.matrix(log(LPRICE)~.^2,data=homes)
cv.mod2 = cv.glmnet(xmatrix,log(homes$LPRICE))
2^ncol(xmatrix)
```

`## [1] Inf`

`plot(cv.mod2)`