1 Introduction

We have a dataset consisting of a lot of descriptive information about a number of home sales. We’re going to focus on predicting sale prices and whether or not the down payment is at least 20%.

2 Data

The data is available at https://codowd.com/bigdata/hw/hw2/homes2004.csv.

There is a “codebook”, which describes all the variables, posted at https://codowd.com/bigdata/hw/hw2/homes2004code.txt.

2.1 Importing Data

library(tidyverse)
homes = read_csv("https://codowd.com/bigdata/hw/hw2/homes2004.csv")
homes
## # A tibble: 15,565 x 29
##    AMMORT EAPTBL ECOM1 ECOM2 EGREEN EJUNK ELOW1 ESFD  ETRANS EABAN HOWH  HOWN 
##     <dbl> <chr>  <chr> <chr> <chr>  <chr> <chr> <chr> <chr>  <chr> <chr> <chr>
##  1  50000 N      N     N     Y      N     N     Y     N      N     good  good 
##  2  70000 N      N     N     N      N     N     Y     N      N     good  bad  
##  3 117000 N      N     N     N      N     N     Y     N      N     good  good 
##  4 100000 N      N     N     N      N     Y     Y     N      N     good  good 
##  5 100000 N      Y     N     Y      N     N     Y     N      N     good  good 
##  6  96000 N      N     N     N      N     N     Y     N      N     good  good 
##  7 130500 N      N     N     Y      N     N     Y     N      N     good  good 
##  8 120000 N      N     N     Y      N     N     Y     N      N     good  good 
##  9 189900 N      N     N     Y      N     N     Y     N      N     good  good 
## 10  99000 N      N     N     Y      N     N     Y     N      N     good  good 
## # … with 15,555 more rows, and 17 more variables: ODORA <chr>, STRNA <chr>,
## #   ZINC2 <dbl>, PER <dbl>, ZADULT <dbl>, HHGRAD <chr>, NUNITS <dbl>,
## #   INTW <dbl>, METRO <chr>, STATE <chr>, LPRICE <dbl>, BATHS <dbl>,
## #   BEDRMS <dbl>, MATBUY <chr>, DWNPAY <chr>, VALUE <dbl>, FRSTHO <chr>

2.2 Cleaning it up

Lots of the data is stored as characters, even though they look like binary variables. Lets fix that.

First, I’ll make sure that is really what is happening.

# I could look at every observation, but that will be time consuming.
# Instead I'll look at each column, and see how many values it takes.
levels(as.factor(homes$EABAN))
## [1] "N" "Y"
# So EABAN only takes values Y and N. 
# Look at another Variable.
levels(as.factor(homes$STATE))
##  [1] "CA" "CO" "CT" "GA" "IL" "IN" "LA" "MO" "OH" "OK" "PA" "TX" "WA"
# Other variables take many levels. So we want to see which columns only take two values.
length(levels(as.factor(homes$BATHS)))
## [1] 8
#Baths takes a lot of values. So it isn't binary.

# But typing things out and running this code for every column is also time consuming. 

# Let's do it all at once using "apply".
#First we build a function that does the above. It takes a variable "x", and finds how many values that variable takes.
apply_helper = function(x) length(levels(as.factor(x)))
# As an aside, the following would also work. 
apply_helper = function(x) length(unique(x))
#Tell apply to look at the data frame "homes". Then tell it to look at each column (rather than each row) (this is the "2"). Then run our function on each column of the dataframe 
apply(homes,2,apply_helper)
## AMMORT EAPTBL  ECOM1  ECOM2 EGREEN  EJUNK  ELOW1   ESFD ETRANS  EABAN   HOWH 
##   1739      2      2      2      2      2      2      2      2      2      2 
##   HOWN  ODORA  STRNA  ZINC2    PER ZADULT HHGRAD NUNITS   INTW  METRO  STATE 
##      2      2      2   3818     13     10      5     73     20      2     13 
## LPRICE  BATHS BEDRMS MATBUY DWNPAY  VALUE FRSTHO 
##   1347      8      9      2      2    697      2

So we can conclude that for a sizeable number of variables, there are only two values, “Y” and “N”. Let’s convert them into Binary’s that we can interpret easily.

There are more complicated and “prettier” ways to do this, but sometimes simplicity is its own value.

for (i in 1:ncol(homes)) { #For each column
  uniques = unique(homes[[i]]) #Find the unique values
  uniques = sort(uniques) #Sort the values (so "N" is before "Y")
  #If there are too many unique values, move to the next column
  if (length(uniques) != 2) next 
  #If the 2 unique values are correct, (match: "N","Y")
  if (uniques[1] == "N" & uniques[2] == "Y") { 
    homes[[i]] = (homes[[i]] == "Y") #Replace with a binary.
  } else { #Otherwise 
    print(i) #Print the column number
    print(uniques) #And the values
  }
}
## [1] 11
## [1] "bad"  "good"
## [1] 12
## [1] "bad"  "good"
## [1] 21
## [1] "rural" "urban"
## [1] 27
## [1] "other"     "prev home"

Okay, lets look at the misbehaving columns

colnames(homes)[c(11,12,21,27)]
## [1] "HOWH"   "HOWN"   "METRO"  "DWNPAY"

They seem to be a mix of things, which aren’t all that cleanly converted into binaries. There are still a few other oddball variables around. In particular, we have a few character vectors with more than 2 values (e.g. State). Lets convert all the characters to factors (they all take a limited number of values anyhow).

# In the homes data, look "across" columns and change ("mutate") the ones that are characters (is.character) into factors ("as.factor")
homes = homes %>% mutate(across(where(is.character),as.factor))

3 Questions

3.1 Q1 - Regression

Regress log(price) against all the variables except mortgage and ETRANS (hint: y~. -VarNAME will regress on everything except VarNAME). What is the \(R^2\)? How many coefficients are there?

mod1 = glm(log(LPRICE)~.-AMMORT-ETRANS,data=homes)
1-mod1$deviance/mod1$null.deviance
## [1] 0.4472979
length(coef(mod1)) #number of coefs
## [1] 41

3.2 Q2 - FDR, variable selection

Rerun the regression with just those variables that are significant at a 5% FDR (hint: summary(mod1)$coefficients[,4] may be helpful assuming your regression in Q1 was named mod1). If a factor has some significant levels, keep the entire factor in. What is the new \(R^2\)? What happened with the \(R^2\)? Why? (2 sentences total)

#Find cut first.
fdr_cut = function(pvals, q){
  pvals = pvals[!is.na(pvals)]   # Ditch NAs
  p = length(pvals) #Count hypotheses
  k = rank(pvals, ties.method="min") #Rank pvals
  alpha = max(pvals[ pvals <= (q*k/p) ]) #Find relevant pval
  alpha #Return
}
smod = summary(mod1)
pvals = smod$coefficients[,4]
alpha_star = fdr_cut(pvals,0.05)
indices = pvals <=  alpha_star
#Look at variables
which(!indices)
##  NUNITS STATECO STATECT  BEDRMS 
##      21      24      25      37
#Can't ditch state -- it takes many (significant) values.
mod2 = glm(log(LPRICE)~.-AMMORT-ETRANS-NUNITS-BEDRMS,data=homes)
1-mod2$deviance/mod2$null.deviance
## [1] 0.4471949

\(R^2\) fell – very slightly, because we removed two variables that were being used to maximize the \(R^2\).

Some people may have ditched State in entirety as well, giving the following, larger fall in \(R^2\).

mod2 = glm(log(LPRICE)~.-AMMORT-ETRANS-NUNITS-BEDRMS-STATE,data=homes)
1-mod2$deviance/mod2$null.deviance
## [1] 0.4237065

It is even possible some people ditched just CO and CT states. Code for that is funky and I’m not bothering. People who managed it, and find a reduced \(R^2\), well done.

3.3 Q3 - Logit

Make a binary variable indicating whether or not buyers had at least a 20% down payment (i.e. the mortgage value is less than 80% of the price). Fit a logit to predict this binary using all variables except mortgage, price. Fit a logit using the variables interacted (once) with eachother. (Hint: y~ .^2 will interact everything, and parenthesis may help) (warning: this may take a while. ~2 minutes on my laptop).

How many more coefficients does the second model have? What are the \(R^2\) values of each model (hint: the model output stores deviance and null deviance)? Which model would you prefer for predictions at this stage? (1 sentence)

homes$down20 = (homes$AMMORT/homes$LPRICE) <= 0.8
mod3 = glm(down20 ~ .-AMMORT-LPRICE,data=homes,family=binomial)
mod4 = glm(down20 ~ (.-AMMORT-LPRICE)^2,data=homes,family=binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# How many more coefficients does the second model have? 
length(coef(mod4))-length(coef(mod3))
## [1] 748
#R^2 values
# Mod3 (not interacted)
1-mod3$deviance/mod3$null.deviance
## [1] 0.1038013
# Mod4 (interacted)
1-mod4$deviance/mod4$null.deviance
## [1] 0.1619313

Looks like model 4 (interacted) has much more predictive power.

3.4 Q4 - Out of Sample A

Estimate the model in Q1 using only data where ETRANS is TRUE. Then test how well that model performs by making predictions for the data where ETRANS is FALSE. Show the out-of-sample fitted vs real outcome plot (hint: it may be helpful to add both a 45 degree line, and the best fit line). Describe what happened here (max 2 sentences, variable codebook may help you).

mod5 = glm(log(LPRICE)~.-AMMORT-ETRANS,data=homes %>% filter(ETRANS))
preds = predict(mod5,newdata = homes %>% filter(!ETRANS))
real = log(homes$LPRICE[!homes$ETRANS])
df = data.frame(preds=preds,real=real)
ggplot(df,aes(x=preds,y=real))+
  geom_point(alpha=0.02,size=0.1)+
  geom_abline(col=2)+
  geom_smooth(method="lm",col="blue")
## `geom_smooth()` using formula 'y ~ x'

Just eyeballing the plot with a 45 degree line it looks pretty good, but if we add a line showing the best fit between predictions and real values, we see a substantial slope difference – indicating a bias out of sample. This is sensible, as the training sample and test sample have fundamental differences in some attributes (most notably, transit proximity).

3.5 Q5 - Out of Sample B

Randomly select a holdout sample of 1000 observations (hint: the sample function). Fit both models from Q3 again using the remaining observations (hint: homes[-indices,] will give homes but without the observations indexed by the vector indices). Make predictions for the holdout sample using each model. Calculate the prediction error for each observation in the holdout sample. What are the mean squared errors for each of these models out of sample? Which model would you prefer at this stage?

#Build holdout
holdout = sample.int(nrow(homes),1000,replace=F)
#Fit models
mod3a = glm(down20 ~ .-AMMORT-LPRICE,data=homes[-holdout,],family=binomial)
mod4a = glm(down20 ~ (.-AMMORT-LPRICE)^2,data=homes[-holdout,],family=binomial)
#Make Predictions on holdout
preds3a = predict(mod3a,newdata=homes[holdout,],type="response")
preds4a = predict(mod4a,newdata=homes[holdout,],type="response")
#Calculate errors
err3a = preds3a-as.numeric(homes$down20[holdout])
err4a = preds4a-as.numeric(homes$down20[holdout])

#Mean Squared Errors
mean(err3a^2)
## [1] 0.1777864
mean(err4a^2)
## [1] 0.1869091

Looks like MSE for the smaller (non-interacted) model is better, and I’d prefer it.

Commentary: I’d prefer it more if we used the out-of-sample deviance for this comparison rather than the MSE. Does doing so change our conclusions?

4 Submission

As before, submit on canvas in groups. Due Date is Wednesday April 14th at midnight. Solutions will be discussed in class on April 15th.

5 Optional Exercises

  1. Use a random holdout sample for Q4.
modA1 = glm(log(LPRICE)~.-AMMORT-ETRANS,data=homes[-holdout,])
preds = predict(modA1,newdata = homes[holdout,])
real = log(homes$LPRICE[holdout])
df = data.frame(preds=preds,real=real)
ggplot(df,aes(x=preds,y=real))+
  geom_point(alpha=0.5,size=0.5)+
  geom_abline(col=2)+
  geom_smooth(method="lm",col="blue")
## `geom_smooth()` using formula 'y ~ x'