This homework is going to focus ensemble models. But first, a tree. To do this, you will want to install the following packages:
intall.packages(c("rpart","ranger"))
library(rpart)
library(ranger)
Load the homes data from HW 3 and do the same cleaning routine that the start of HW3 has. You should wind up with a variable called ‘twotwo’ – as well as a bunch of other cleaned variables.
Using the rpart
package (and same-named function), build a tree to predict two-two using all the data in homes
. Add the argument cp=0
, and make another tree. Plot that tree (plot(mod)
will do the trick).
In two sentences or less, describe why this tree is shaped the way it is. (hint: think about what we are predicting)
Steps:
set.seed(14432)
ind = sample(nrow(homes),0.2*nrow(homes))
holdout = homes[ind,]
train = homes[-ind,]
log(LPRICE)
with them. (hint: you will need to change the formula used by rpart
inside my resampled_mod
function, as well as the data used by the functions - use the training data).pred_helper = function(x,xdata=holdout) predict(x,newdata=xdata)
may help)rowMeans
may help you here)truth = matrix(rep(log(holdout$LPRICE),20),ncol=20,byrow=F)
may help with the tree models.What fraction of the tree models does the bagged model outperform? Give a two-sentence explanation of that performance. Give a one sentence explanation of what the line of code I gave in #5 did.
Using the same training data as above: 1. Make a 20-tree forest predicting log(LPRICE)
. (hint: the function ranger
and the argument num.trees
are your friends.) 2. Now make a 100-tree forest with that training data. 3. Make predictions for both forests on the holdout sample. predict(mod,data=holdout)$predictions
may help. 4. Find the out-of-sample MSE for each each forest.
How did the forest compare to the bagged model? In 3 sentences or less, try to explain why.
cp
is a cost that controls the number of leaves. Use cross-validation to find the optimal single-tree value for cp
.cp
to build a new bagged model with 20 trees. How does the MSE of that model compare to the MSE from the 20-tree bagged model in Q2?Due Wednesday May 12 at midnight.