# 1 Introduction

We have a dataset consisting of a lot of descriptive information about a number of home sales. We’re going to focus on predicting sale prices and whether or not the down payment is at least 20%.

# 2 Data

The data is available at https://codowd.com/bigdata/hw/hw2/homes2004.csv.

There is a “codebook”, which describes all the variables, posted at https://codowd.com/bigdata/hw/hw2/homes2004code.txt.

## 2.1 Importing Data

library(tidyverse)
homes
## # A tibble: 15,565 x 29
##    AMMORT EAPTBL ECOM1 ECOM2 EGREEN EJUNK ELOW1 ESFD  ETRANS EABAN HOWH  HOWN
##     <dbl> <chr>  <chr> <chr> <chr>  <chr> <chr> <chr> <chr>  <chr> <chr> <chr>
##  1  50000 N      N     N     Y      N     N     Y     N      N     good  good
##  2  70000 N      N     N     N      N     N     Y     N      N     good  bad
##  3 117000 N      N     N     N      N     N     Y     N      N     good  good
##  4 100000 N      N     N     N      N     Y     Y     N      N     good  good
##  5 100000 N      Y     N     Y      N     N     Y     N      N     good  good
##  6  96000 N      N     N     N      N     N     Y     N      N     good  good
##  7 130500 N      N     N     Y      N     N     Y     N      N     good  good
##  8 120000 N      N     N     Y      N     N     Y     N      N     good  good
##  9 189900 N      N     N     Y      N     N     Y     N      N     good  good
## 10  99000 N      N     N     Y      N     N     Y     N      N     good  good
## # … with 15,555 more rows, and 17 more variables: ODORA <chr>, STRNA <chr>,
## #   ZINC2 <dbl>, PER <dbl>, ZADULT <dbl>, HHGRAD <chr>, NUNITS <dbl>,
## #   INTW <dbl>, METRO <chr>, STATE <chr>, LPRICE <dbl>, BATHS <dbl>,
## #   BEDRMS <dbl>, MATBUY <chr>, DWNPAY <chr>, VALUE <dbl>, FRSTHO <chr>

## 2.2 Cleaning it up

Lots of the data is stored as characters, even though they look like binary variables. Lets fix that.

First, I’ll make sure that is really what is happening.

# I could look at every observation, but that will be time consuming.
# Instead I'll look at each column, and see how many values it takes.
levels(as.factor(homes$EABAN)) ## [1] "N" "Y" # So EABAN only takes values Y and N. # Look at another Variable. levels(as.factor(homes$STATE))
##  [1] "CA" "CO" "CT" "GA" "IL" "IN" "LA" "MO" "OH" "OK" "PA" "TX" "WA"
# Other variables take many levels. So we want to see which columns only take two values.

## 3.3 Q3 - Logit

Make a binary variable indicating whether or not buyers had at least a 20% down payment (i.e. the mortgage value is less than 80% of the price). Fit a logit to predict this binary using all variables except mortgage, price. Fit a logit using the variables interacted (once) with eachother. (Hint: y~ .^2 will interact everything, and parenthesis may help) (warning: this may take a while. ~2 minutes on my laptop).

How many more coefficients does the second model have? What are the $$R^2$$ values of each model (hint: the model output stores deviance and null deviance)? Which model would you prefer for predictions at this stage? (1 sentence)

## 3.4 Q4 - Out of Sample A

Estimate the model in Q1 using only data where ETRANS is TRUE. Then test how well that model performs by making predictions for the data where ETRANS is FALSE. Show the out-of-sample fitted vs real outcome plot (hint: it may be helpful to add both a 45 degree line, and the best fit line). Describe what happened here (max 2 sentences, variable codebook may help you).

## 3.5 Q5 - Out of Sample B

Randomly select a holdout sample of 1000 observations (hint: the sample function). Fit both models from Q3 again using the remaining observations (hint: homes[-indices,] will give homes but without the observations indexed by the vector indices). Make predictions for the holdout sample using each model. Calculate the prediction error for each observation in the holdout sample. What are the mean squared errors for each of these models out of sample? Which model would you prefer at this stage?

# 4 Submission

As before, submit on canvas in groups. Due Date is Wednesday April 14th at midnight. Solutions will be discussed in class on April 15th.

# 5 Optional Exercises

1. Use a random holdout sample for Q4. How does this change your results?
2. Instead of selecting variables using FDR in Q2, install the ‘glmnet’ package and run a LASSO. How many variables do you drop?
3. Calculate the out-of-sample deviance for each model in Q5. Which is better now?