This homework is going to focus ensemble models. But first, a tree. To do this, you will want to install the following packages:

intall.packages(c("rpart","ranger"))
library(rpart)
library(ranger)

1 Setup

Load the homes data from HW 3 and do the same cleaning routine that the start of HW3 has. You should wind up with a variable called ‘twotwo’ – as well as a bunch of other cleaned variables.

2 Questions

2.1 Q1 - Tree

Using the rpart package (and same-named function), build a tree to predict two-two using all the data in homes. Add the argument cp=0, and make another tree. Plot that tree (plot(mod) will do the trick).

In two sentences or less, describe why this tree is shaped the way it is. (hint: think about what we are predicting)

2.2 Q2 - Bagging

Steps:

  1. Set aside a 20% holdout sample.
set.seed(14432)
ind = sample(nrow(homes),0.2*nrow(homes))
holdout = homes[ind,]
train = homes[-ind,]
  1. Follow the code in Lecture 10 to create 20 resampled datasets and build trees predicting log(LPRICE) with them. (hint: you will need to change the formula used by rpart inside my resampled_mod function, as well as the data used by the functions - use the training data).
  2. Make predictions on the holdout sample for each of the 20 tree models. (pred_helper = function(x,xdata=holdout) predict(x,newdata=xdata) may help)
  3. Average those predictions to make a ‘bagged model’ prediction (rowMeans may help you here)
  4. Find the errors for each prediction. truth = matrix(rep(log(holdout$LPRICE),20),ncol=20,byrow=F) may help with the tree models.
  5. Find the MSE for each of the tree models as well as the bagged model.

What fraction of the tree models does the bagged model outperform? Give a two-sentence explanation of that performance. Give a one sentence explanation of what the line of code I gave in #5 did.

2.3 Q3 - Forest

Using the same training data as above: 1. Make a 20-tree forest predicting log(LPRICE). (hint: the function ranger and the argument num.trees are your friends.) 2. Now make a 100-tree forest with that training data. 3. Make predictions for both forests on the holdout sample. predict(mod,data=holdout)$predictions may help. 4. Find the out-of-sample MSE for each each forest.

How did the forest compare to the bagged model? In 3 sentences or less, try to explain why.

3 Optional

  1. Build 100 different trees, plot the OOS MSE for bagged model as you increase the number of trees in the model from 1 to 100 (or 1000). Compare Asymptote to 20-tree forest OOS MSE.
  2. Look at tree 1 from Q2. How many nodes does it have? The parameter cp is a cost that controls the number of leaves. Use cross-validation to find the optimal single-tree value for cp.
  3. Use that optimal single-tree value for cp to build a new bagged model with 20 trees. How does the MSE of that model compare to the MSE from the 20-tree bagged model in Q2?

4 Submission

Due Wednesday May 12 at midnight.