We have a dataset consisting of a lot of descriptive information about a number of home sales. We’re going to focus on predicting sale prices and whether or not the down payment is at least 20%.
The data is available at https://codowd.com/bigdata/hw/hw2/homes2004.csv.
There is a “codebook”, which describes all the variables, posted at https://codowd.com/bigdata/hw/hw2/homes2004code.txt.
library(tidyverse)
homes = read_csv("https://codowd.com/bigdata/hw/hw2/homes2004.csv")
homes
## # A tibble: 15,565 x 29
## AMMORT EAPTBL ECOM1 ECOM2 EGREEN EJUNK ELOW1 ESFD ETRANS EABAN HOWH HOWN
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 50000 N N N Y N N Y N N good good
## 2 70000 N N N N N N Y N N good bad
## 3 117000 N N N N N N Y N N good good
## 4 100000 N N N N N Y Y N N good good
## 5 100000 N Y N Y N N Y N N good good
## 6 96000 N N N N N N Y N N good good
## 7 130500 N N N Y N N Y N N good good
## 8 120000 N N N Y N N Y N N good good
## 9 189900 N N N Y N N Y N N good good
## 10 99000 N N N Y N N Y N N good good
## # … with 15,555 more rows, and 17 more variables: ODORA <chr>, STRNA <chr>,
## # ZINC2 <dbl>, PER <dbl>, ZADULT <dbl>, HHGRAD <chr>, NUNITS <dbl>,
## # INTW <dbl>, METRO <chr>, STATE <chr>, LPRICE <dbl>, BATHS <dbl>,
## # BEDRMS <dbl>, MATBUY <chr>, DWNPAY <chr>, VALUE <dbl>, FRSTHO <chr>
Lots of the data is stored as characters, even though they look like binary variables. Lets fix that.
First, I’ll make sure that is really what is happening.
# I could look at every observation, but that will be time consuming.
# Instead I'll look at each column, and see how many values it takes.
levels(as.factor(homes$EABAN))
## [1] "N" "Y"
# So EABAN only takes values Y and N.
# Look at another Variable.
levels(as.factor(homes$STATE))
## [1] "CA" "CO" "CT" "GA" "IL" "IN" "LA" "MO" "OH" "OK" "PA" "TX" "WA"
# Other variables take many levels. So we want to see which columns only take two values.
length(levels(as.factor(homes$BATHS)))
## [1] 8
#Baths takes a lot of values. So it isn't binary.
# But typing things out and running this code for every column is also time consuming.
# Let's do it all at once using "apply".
#First we build a function that does the above. It takes a variable "x", and finds how many values that variable takes.
apply_helper = function(x) length(levels(as.factor(x)))
# As an aside, the following would also work.
apply_helper = function(x) length(unique(x))
#Tell apply to look at the data frame "homes". Then tell it to look at each column (rather than each row) (this is the "2"). Then run our function on each column of the dataframe
apply(homes,2,apply_helper)
## AMMORT EAPTBL ECOM1 ECOM2 EGREEN EJUNK ELOW1 ESFD ETRANS EABAN HOWH
## 1739 2 2 2 2 2 2 2 2 2 2
## HOWN ODORA STRNA ZINC2 PER ZADULT HHGRAD NUNITS INTW METRO STATE
## 2 2 2 3818 13 10 5 73 20 2 13
## LPRICE BATHS BEDRMS MATBUY DWNPAY VALUE FRSTHO
## 1347 8 9 2 2 697 2
So we can conclude that for a sizeable number of variables, there are only two values, “Y” and “N”. Let’s convert them into Binary’s that we can interpret easily.
There are more complicated and “prettier” ways to do this, but sometimes simplicity is its own value.
for (i in 1:ncol(homes)) { #For each column
uniques = unique(homes[[i]]) #Find the unique values
uniques = sort(uniques) #Sort the values (so "N" is before "Y")
#If there are too many unique values, move to the next column
if (length(uniques) != 2) next
#If the 2 unique values are correct, (match: "N","Y")
if (uniques[1] == "N" & uniques[2] == "Y") {
homes[[i]] = (homes[[i]] == "Y") #Replace with a binary.
} else { #Otherwise
print(i) #Print the column number
print(uniques) #And the values
}
}
## [1] 11
## [1] "bad" "good"
## [1] 12
## [1] "bad" "good"
## [1] 21
## [1] "rural" "urban"
## [1] 27
## [1] "other" "prev home"
Okay, lets look at the misbehaving columns
colnames(homes)[c(11,12,21,27)]
## [1] "HOWH" "HOWN" "METRO" "DWNPAY"
They seem to be a mix of things, which aren’t all that cleanly converted into binaries. There are still a few other oddball variables around. In particular, we have a few character vectors with more than 2 values (e.g. State). Lets convert all the characters to factors (they all take a limited number of values anyhow).
# In the homes data, look "across" columns and change ("mutate") the ones that are characters (is.character) into factors ("as.factor")
homes = homes %>% mutate(across(where(is.character),as.factor))
Regress log(price) against all the variables except mortgage and ETRANS (hint: y~. -VarNAME
will regress on everything except VarNAME). What is the \(R^2\)? How many coefficients are there?
Rerun the regression with just those variables that are significant at a 5% FDR (hint: summary(mod1)$coefficients[,4]
may be helpful assuming your regression in Q1 was named mod1
). If a factor has some significant levels, keep the entire factor in. What is the new \(R^2\)? What happened with the \(R^2\)? Why? (2 sentences total)
Make a binary variable indicating whether or not buyers had at least a 20% down payment (i.e. the mortgage value is less than 80% of the price). Fit a logit to predict this binary using all variables except mortgage, price. Fit a logit using the variables interacted (once) with eachother. (Hint: y~ .^2
will interact everything, and parenthesis may help) (warning: this may take a while. ~2 minutes on my laptop).
How many more coefficients does the second model have? What are the \(R^2\) values of each model (hint: the model output stores deviance and null deviance)? Which model would you prefer for predictions at this stage? (1 sentence)
Estimate the model in Q1 using only data where ETRANS is TRUE. Then test how well that model performs by making predictions for the data where ETRANS is FALSE. Show the out-of-sample fitted vs real outcome plot (hint: it may be helpful to add both a 45 degree line, and the best fit line). Describe what happened here (max 2 sentences, variable codebook may help you).
Randomly select a holdout sample of 1000 observations (hint: the sample
function). Fit both models from Q3 again using the remaining observations (hint: homes[-indices,]
will give homes
but without the observations indexed by the vector indices
). Make predictions for the holdout sample using each model. Calculate the prediction error for each observation in the holdout sample. What are the mean squared errors for each of these models out of sample? Which model would you prefer at this stage?
As before, submit on canvas in groups. Due Date is Wednesday April 14th at midnight. Solutions will be discussed in class on April 15th.