1 Introduction

We have a dataset consisting of a lot of descriptive information about a number of home sales. We’re going to focus on predicting sale prices and whether or not the down payment is at least 20%.

2 Data

The data is available at https://codowd.com/bigdata/hw/hw2/homes2004.csv.

There is a “codebook”, which describes all the variables, posted at https://codowd.com/bigdata/hw/hw2/homes2004code.txt.

2.1 Importing Data

library(tidyverse)
homes = read_csv("https://codowd.com/bigdata/hw/hw2/homes2004.csv")
homes
## # A tibble: 15,565 x 29
##    AMMORT EAPTBL ECOM1 ECOM2 EGREEN EJUNK ELOW1 ESFD  ETRANS EABAN HOWH  HOWN 
##     <dbl> <chr>  <chr> <chr> <chr>  <chr> <chr> <chr> <chr>  <chr> <chr> <chr>
##  1  50000 N      N     N     Y      N     N     Y     N      N     good  good 
##  2  70000 N      N     N     N      N     N     Y     N      N     good  bad  
##  3 117000 N      N     N     N      N     N     Y     N      N     good  good 
##  4 100000 N      N     N     N      N     Y     Y     N      N     good  good 
##  5 100000 N      Y     N     Y      N     N     Y     N      N     good  good 
##  6  96000 N      N     N     N      N     N     Y     N      N     good  good 
##  7 130500 N      N     N     Y      N     N     Y     N      N     good  good 
##  8 120000 N      N     N     Y      N     N     Y     N      N     good  good 
##  9 189900 N      N     N     Y      N     N     Y     N      N     good  good 
## 10  99000 N      N     N     Y      N     N     Y     N      N     good  good 
## # … with 15,555 more rows, and 17 more variables: ODORA <chr>, STRNA <chr>,
## #   ZINC2 <dbl>, PER <dbl>, ZADULT <dbl>, HHGRAD <chr>, NUNITS <dbl>,
## #   INTW <dbl>, METRO <chr>, STATE <chr>, LPRICE <dbl>, BATHS <dbl>,
## #   BEDRMS <dbl>, MATBUY <chr>, DWNPAY <chr>, VALUE <dbl>, FRSTHO <chr>

2.2 Cleaning it up

Lots of the data is stored as characters, even though they look like binary variables. Lets fix that.

First, I’ll make sure that is really what is happening.

# I could look at every observation, but that will be time consuming.
# Instead I'll look at each column, and see how many values it takes.
levels(as.factor(homes$EABAN))
## [1] "N" "Y"
# So EABAN only takes values Y and N. 
# Look at another Variable.
levels(as.factor(homes$STATE))
##  [1] "CA" "CO" "CT" "GA" "IL" "IN" "LA" "MO" "OH" "OK" "PA" "TX" "WA"
# Other variables take many levels. So we want to see which columns only take two values.
length(levels(as.factor(homes$BATHS)))
## [1] 8
#Baths takes a lot of values. So it isn't binary.

# But typing things out and running this code for every column is also time consuming. 

# Let's do it all at once using "apply".
#First we build a function that does the above. It takes a variable "x", and finds how many values that variable takes.
apply_helper = function(x) length(levels(as.factor(x)))
# As an aside, the following would also work. 
apply_helper = function(x) length(unique(x))
#Tell apply to look at the data frame "homes". Then tell it to look at each column (rather than each row) (this is the "2"). Then run our function on each column of the dataframe 
apply(homes,2,apply_helper)
## AMMORT EAPTBL  ECOM1  ECOM2 EGREEN  EJUNK  ELOW1   ESFD ETRANS  EABAN   HOWH 
##   1739      2      2      2      2      2      2      2      2      2      2 
##   HOWN  ODORA  STRNA  ZINC2    PER ZADULT HHGRAD NUNITS   INTW  METRO  STATE 
##      2      2      2   3818     13     10      5     73     20      2     13 
## LPRICE  BATHS BEDRMS MATBUY DWNPAY  VALUE FRSTHO 
##   1347      8      9      2      2    697      2

So we can conclude that for a sizeable number of variables, there are only two values, “Y” and “N”. Let’s convert them into Binary’s that we can interpret easily.

There are more complicated and “prettier” ways to do this, but sometimes simplicity is its own value.

for (i in 1:ncol(homes)) { #For each column
  uniques = unique(homes[[i]]) #Find the unique values
  uniques = sort(uniques) #Sort the values (so "N" is before "Y")
  #If there are too many unique values, move to the next column
  if (length(uniques) != 2) next 
  #If the 2 unique values are correct, (match: "N","Y")
  if (uniques[1] == "N" & uniques[2] == "Y") { 
    homes[[i]] = (homes[[i]] == "Y") #Replace with a binary.
  } else { #Otherwise 
    print(i) #Print the column number
    print(uniques) #And the values
  }
}
## [1] 11
## [1] "bad"  "good"
## [1] 12
## [1] "bad"  "good"
## [1] 21
## [1] "rural" "urban"
## [1] 27
## [1] "other"     "prev home"

Okay, lets look at the misbehaving columns

colnames(homes)[c(11,12,21,27)]
## [1] "HOWH"   "HOWN"   "METRO"  "DWNPAY"

They seem to be a mix of things, which aren’t all that cleanly converted into binaries. There are still a few other oddball variables around. In particular, we have a few character vectors with more than 2 values (e.g. State). Lets convert all the characters to factors (they all take a limited number of values anyhow).

# In the homes data, look "across" columns and change ("mutate") the ones that are characters (is.character) into factors ("as.factor")
homes = homes %>% mutate(across(where(is.character),as.factor))

3 Questions

3.1 Q1 - Regression

Regress log(price) against all the variables except mortgage and ETRANS (hint: y~. -VarNAME will regress on everything except VarNAME). What is the \(R^2\)? How many coefficients are there?

3.2 Q2 - FDR, variable selection

Rerun the regression with just those variables that are significant at a 5% FDR (hint: summary(mod1)$coefficients[,4] may be helpful assuming your regression in Q1 was named mod1). If a factor has some significant levels, keep the entire factor in. What is the new \(R^2\)? What happened with the \(R^2\)? Why? (2 sentences total)

3.3 Q3 - Logit

Make a binary variable indicating whether or not buyers had at least a 20% down payment (i.e. the mortgage value is less than 80% of the price). Fit a logit to predict this binary using all variables except mortgage, price. Fit a logit using the variables interacted (once) with eachother. (Hint: y~ .^2 will interact everything, and parenthesis may help) (warning: this may take a while. ~2 minutes on my laptop).

How many more coefficients does the second model have? What are the \(R^2\) values of each model (hint: the model output stores deviance and null deviance)? Which model would you prefer for predictions at this stage? (1 sentence)

3.4 Q4 - Out of Sample A

Estimate the model in Q1 using only data where ETRANS is TRUE. Then test how well that model performs by making predictions for the data where ETRANS is FALSE. Show the out-of-sample fitted vs real outcome plot (hint: it may be helpful to add both a 45 degree line, and the best fit line). Describe what happened here (max 2 sentences, variable codebook may help you).

3.5 Q5 - Out of Sample B

Randomly select a holdout sample of 1000 observations (hint: the sample function). Fit both models from Q3 again using the remaining observations (hint: homes[-indices,] will give homes but without the observations indexed by the vector indices). Make predictions for the holdout sample using each model. Calculate the prediction error for each observation in the holdout sample. What are the mean squared errors for each of these models out of sample? Which model would you prefer at this stage?

4 Submission

As before, submit on canvas in groups. Due Date is Wednesday April 14th at midnight. Solutions will be discussed in class on April 15th.

5 Optional Exercises

  1. Use a random holdout sample for Q4. How does this change your results?
  2. Instead of selecting variables using FDR in Q2, install the ‘glmnet’ package and run a LASSO. How many variables do you drop?
  3. Calculate the out-of-sample deviance for each model in Q5. Which is better now?