```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
setwd("~/Github/bigdata/hw/hw2/")
```
# Introduction
We have a dataset consisting of a lot of descriptive information about a number of home sales. We're going to focus on predicting sale prices and whether or not the down payment is at least 20%.
# Data
The data is available at [https://codowd.com/bigdata/hw/hw2/homes2004.csv](https://codowd.com/bigdata/hw/hw2/homes2004.csv).
There is a "codebook", which describes all the variables, posted at [https://codowd.com/bigdata/hw/hw2/homes2004code.txt](https://codowd.com/bigdata/hw/hw2/homes2004code.txt).
## Importing Data
```{r, message=F}
library(tidyverse)
homes = read_csv("https://codowd.com/bigdata/hw/hw2/homes2004.csv")
homes
```
## Cleaning it up
Lots of the data is stored as characters, even though they look like binary variables. Lets fix that.
First, I'll make sure that is really what is happening.
```{r}
# I could look at every observation, but that will be time consuming.
# Instead I'll look at each column, and see how many values it takes.
levels(as.factor(homes$EABAN))
# So EABAN only takes values Y and N.
# Look at another Variable.
levels(as.factor(homes$STATE))
# Other variables take many levels. So we want to see which columns only take two values.
length(levels(as.factor(homes$BATHS)))
#Baths takes a lot of values. So it isn't binary.
# But typing things out and running this code for every column is also time consuming.
# Let's do it all at once using "apply".
#First we build a function that does the above. It takes a variable "x", and finds how many values that variable takes.
apply_helper = function(x) length(levels(as.factor(x)))
# As an aside, the following would also work.
apply_helper = function(x) length(unique(x))
#Tell apply to look at the data frame "homes". Then tell it to look at each column (rather than each row) (this is the "2"). Then run our function on each column of the dataframe
apply(homes,2,apply_helper)
```
So we can conclude that for a sizeable number of variables, there are only two values, "Y" and "N". Let's convert them into Binary's that we can interpret easily.
There are more complicated and "prettier" ways to do this, but sometimes simplicity is its own value.
```{r}
for (i in 1:ncol(homes)) { #For each column
uniques = unique(homes[[i]]) #Find the unique values
uniques = sort(uniques) #Sort the values (so "N" is before "Y")
#If there are too many unique values, move to the next column
if (length(uniques) != 2) next
#If the 2 unique values are correct, (match: "N","Y")
if (uniques[1] == "N" & uniques[2] == "Y") {
homes[[i]] = (homes[[i]] == "Y") #Replace with a binary.
} else { #Otherwise
print(i) #Print the column number
print(uniques) #And the values
}
}
```
Okay, lets look at the misbehaving columns
```{r}
colnames(homes)[c(11,12,21,27)]
```
They seem to be a mix of things, which aren't all that cleanly converted into binaries.
There are still a few other oddball variables around. In particular, we have a few character vectors with more than 2 values (e.g. State). Lets convert all the characters to factors (they all take a limited number of values anyhow).
```{r}
# In the homes data, look "across" columns and change ("mutate") the ones that are characters (is.character) into factors ("as.factor")
homes = homes %>% mutate(across(where(is.character),as.factor))
```
# Questions
## Q1 - Regression
Regress log(price) against all the variables except mortgage and ETRANS (hint: `y~. -VarNAME` will regress on everything except VarNAME). What is the $R^2$? How many coefficients are there?
```{r}
mod1 = glm(log(LPRICE)~.-AMMORT-ETRANS,data=homes)
1-mod1$deviance/mod1$null.deviance
length(coef(mod1)) #number of coefs
```
## Q2 - FDR, variable selection
Rerun the regression with just those variables that are significant at a 5% FDR (hint: `summary(mod1)$coefficients[,4]` may be helpful assuming your regression in Q1 was named `mod1`). If a factor has some significant levels, keep the entire factor in. What is the new $R^2$? What happened with the $R^2$? Why? (2 sentences total)
```{r}
#Find cut first.
fdr_cut = function(pvals, q){
pvals = pvals[!is.na(pvals)] # Ditch NAs
p = length(pvals) #Count hypotheses
k = rank(pvals, ties.method="min") #Rank pvals
alpha = max(pvals[ pvals <= (q*k/p) ]) #Find relevant pval
alpha #Return
}
smod = summary(mod1)
pvals = smod$coefficients[,4]
alpha_star = fdr_cut(pvals,0.05)
indices = pvals <= alpha_star
#Look at variables
which(!indices)
#Can't ditch state -- it takes many (significant) values.
mod2 = glm(log(LPRICE)~.-AMMORT-ETRANS-NUNITS-BEDRMS,data=homes)
1-mod2$deviance/mod2$null.deviance
```
$R^2$ fell -- very slightly, because we removed two variables that were being used to maximize the $R^2$.
Some people may have ditched State in entirety as well, giving the following, larger fall in $R^2$.
```{r}
mod2 = glm(log(LPRICE)~.-AMMORT-ETRANS-NUNITS-BEDRMS-STATE,data=homes)
1-mod2$deviance/mod2$null.deviance
```
It is even possible some people ditched just CO and CT states. Code for that is funky and I'm not bothering. People who managed it, and find a reduced $R^2$, well done.
## Q3 - Logit
Make a binary variable indicating whether or not buyers had at least a 20% down payment (i.e. the mortgage value is less than 80% of the price). Fit a logit to predict this binary using all variables except mortgage, price. Fit a logit using the variables interacted (once) with eachother. (Hint: `y~ .^2` will interact everything, and parenthesis may help) (warning: this may take a while. ~2 minutes on my laptop).
How many more coefficients does the second model have? What are the $R^2$ values of each model (hint: the model output stores deviance and null deviance)? Which model would you prefer for predictions at this stage? (1 sentence)
```{r, cache=T}
homes$down20 = (homes$AMMORT/homes$LPRICE) <= 0.8
mod3 = glm(down20 ~ .-AMMORT-LPRICE,data=homes,family=binomial)
mod4 = glm(down20 ~ (.-AMMORT-LPRICE)^2,data=homes,family=binomial)
```
```{r}
# How many more coefficients does the second model have?
length(coef(mod4))-length(coef(mod3))
#R^2 values
# Mod3 (not interacted)
1-mod3$deviance/mod3$null.deviance
# Mod4 (interacted)
1-mod4$deviance/mod4$null.deviance
```
Looks like model 4 (interacted) has much more predictive power.
## Q4 - Out of Sample A
Estimate the model in Q1 using only data where ETRANS is TRUE. Then test how well that model performs by making predictions for the data where ETRANS is FALSE. Show the out-of-sample fitted vs real outcome plot (hint: it may be helpful to add both a 45 degree line, and the best fit line). Describe what happened here (max 2 sentences, variable codebook may help you).
```{r}
mod5 = glm(log(LPRICE)~.-AMMORT-ETRANS,data=homes %>% filter(ETRANS))
preds = predict(mod5,newdata = homes %>% filter(!ETRANS))
real = log(homes$LPRICE[!homes$ETRANS])
df = data.frame(preds=preds,real=real)
ggplot(df,aes(x=preds,y=real))+
geom_point(alpha=0.02,size=0.1)+
geom_abline(col=2)+
geom_smooth(method="lm",col="blue")
```
Just eyeballing the plot with a 45 degree line it looks pretty good, but if we add a line showing the best fit between predictions and real values, we see a substantial slope difference -- indicating a bias out of sample. This is sensible, as the training sample and test sample have fundamental differences in some attributes (most notably, transit proximity).
## Q5 - Out of Sample B
Randomly select a holdout sample of 1000 observations (hint: the `sample` function). Fit both models from Q3 again using the remaining observations (hint: `homes[-indices,]` will give `homes` but without the observations indexed by the vector `indices`). Make predictions for the holdout sample using each model. Calculate the prediction error for each observation in the holdout sample. What are the mean squared errors for each of these models out of sample? Which model would you prefer at this stage?
```{r, warning=F}
#Build holdout
holdout = sample.int(nrow(homes),1000,replace=F)
#Fit models
mod3a = glm(down20 ~ .-AMMORT-LPRICE,data=homes[-holdout,],family=binomial)
mod4a = glm(down20 ~ (.-AMMORT-LPRICE)^2,data=homes[-holdout,],family=binomial)
```
```{r, warning=F}
#Make Predictions on holdout
preds3a = predict(mod3a,newdata=homes[holdout,],type="response")
preds4a = predict(mod4a,newdata=homes[holdout,],type="response")
#Calculate errors
err3a = preds3a-as.numeric(homes$down20[holdout])
err4a = preds4a-as.numeric(homes$down20[holdout])
#Mean Squared Errors
mean(err3a^2)
mean(err4a^2)
```
Looks like MSE for the smaller (non-interacted) model is better, and I'd prefer it.
Commentary: I'd prefer it more if we used the out-of-sample deviance for this comparison rather than the MSE. Does doing so change our conclusions?
# Submission
As before, submit on canvas in groups. Due Date is Wednesday April 14th at midnight. Solutions will be discussed in class on April 15th.
# Optional Exercises
1. Use a random holdout sample for Q4.
```{r}
modA1 = glm(log(LPRICE)~.-AMMORT-ETRANS,data=homes[-holdout,])
preds = predict(modA1,newdata = homes[holdout,])
real = log(homes$LPRICE[holdout])
df = data.frame(preds=preds,real=real)
ggplot(df,aes(x=preds,y=real))+
geom_point(alpha=0.5,size=0.5)+
geom_abline(col=2)+
geom_smooth(method="lm",col="blue")
```
True line is close to 45 degree line -- good sign. $R^2$ probably changed.
2. Instead of selecting using FDR, install the 'glmnet' package and run a LASSO for Q2.
```{r, eval=F}
install.packages("glmnet")
```
```{r}
library(glmnet)
data = homes %>% select(-AMMORT,-ETRANS)
data = model.matrix(LPRICE~.,data=data)
mod = cv.glmnet(data,log(homes$LPRICE),nfolds=5)
betas = coef(mod,lambda="1se")
betas
sum(betas == 0)
```
3. Calculate the out-of-sample deviance for each model in Q5. Which is better now?
```{r}
binom_deviance = function(y, pred){
if(is.factor(y)) y = as.numeric(y)>1
-2*sum( y*log(pred) + (1-y)*log(1-pred))
}
binom_deviance(homes$down20[holdout],preds3a)
binom_deviance(homes$down20[holdout],preds4a)
```
Model 3 still comes out ahead.