This homework is going to focus on using R to make KNN predictions and somewhat reproduce the plots I used in lecture 7.

1 Setup

For the lecture I used the following code to generate data:

n = 50
x1 = runif(n)
x2 = runif(n)
prob  = ifelse(x1 < 0.5 & x1 > 0.25 & x2 > 0.25 & x2<0.75,0.8,0.3)
y  = as.factor(rbinom(n,1,prob))
levels(y) = c("1","2")
df = data.frame(y=y,x1=x1,x2=x2)

And then I used the following function to make KNN predictions:

knn_pred = function(point,x1,x2,y,k=5) {
  dists = sqrt((x1-point[1])^2+(x2-point[2])^2) #Find all distances to current obs
  bound = sort(dists)[k]                #Find kth smallest distance
  indices = which(dists <= bound)       #Find which obs have dists 1:k
  outcomes = as.integer(y[indices])     #Find corresponding outcomes y
  round(mean(outcomes)) #Taking advantage of 2 outcomes. If more 2s, this gives 2, if more 1s this gives 1. 

This code builds a grid of points, and then makes predictions for each of those points.

grid.fineness = 201
sequence = seq(0,1,length.out=grid.fineness)
grid = expand.grid(sequence,sequence)
colnames(grid) = c("x1","x2")
yhat = apply(grid,1,knn_pred,x1=x1,x2=x2,y=y,k=5)
yhat = as.factor(yhat)

With those predictions, we can build a dataframe, and plot.

df =
df$y = yhat

If we drop the round and subtract 1 in our knn_pred function, we can get probabilities out.

knn_prob = function(point,x1,x2,y,k=5) {
  dists = sqrt((x1-point[1])^2+(x2-point[2])^2) #Find all distances to current obs
  bound = sort(dists)[k]                #Find kth smallest distance
  indices = which(dists <= bound)       #Find which obs have dists 1:k
  outcomes = as.integer(y[indices])     #Find corresponding outcomes y
  mean(outcomes)-1 #Taking advantage of 2 outcomes.

We can predict those probabilities at each point:

phat = apply(grid,1,knn_prob,x1=x1,x2=x2,y=y,k=5)
df$phat = phat

This is, in essence, the beginnings of a simulation study. We generated data, and we can look at how our predictions perform. We can do this with either the classifications or the underlying probabilities.

2 Questions

We are going to extend this simulation study in a few ways.

2.1 Q1 - Bigger Sample

Resetting the seed with:


Run the same data generation code, but with a sample size of 1000. Plot the resulting probabilities when we use \(K=1\), \(K=5\), and \(K=25\).

Plot the classification predictions when \(K=10\), using a probability threshold of 0.2 for our predictions instead of the standard 0.5.

2.2 Q2 – Logit Comparison

Fit an interaacted logit to this data. (i.e. model \(Y\sim x1+x2+x1:x2\) – using glm, not a LASSO). Find the predicted probabilities for every point in our grid, and plot those predicted probabilities.

2.3 Q3 – ROC

Plot the (in-sample) ROC curves for both the logit model and the KNN with \(K=10\). (hint: I have a function for doing this, given outcomes and probabilities in the lecture)

Which of these models looks better?

2.4 Q4 – Survey

Please complete the surveys posted last week (and in announcements on canvas). They will help me ensure the last 5 weeks of class are as useful to you as possible.

3 Optional

  1. Plot the resulting predictions from a \(k=30\) KNN for Q4.
  2. Find OOS ROC curves in Q2 – build a holdout, then use predictions on it. 3. Do K-fold cross validation for the OOS ROC curves
  3. Add squared terms to the logit, remake predictions and plots (e.g. y~x1+x1^2+x2+x2^2+x1:x2)

3.1 Optional Long Q

We have a new output with three categories.

# x1,x2,y from your sample with 1k observations need to exist to run this.
z = ifelse((x1>0.8 | x2 < 0.4),rbinom(length(y),1,0.8),(y==1)*2)

Modify knn_pred so that it predicts the most likely category out of 3 categories (hint: the functions table, which.max, names, and as.integer are how I did this) (hint2: maybe start by building a function that takes a vector and finds the most common element, then fit it into the rest of this). Plot the grid-predictions with this new classifier.