```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) setwd("~/Github/bigdata/lectures/l7/") library(tidyverse) library(glmnet) ``` ## Today's Class {.smaller} 1. Review - LASSO - Deviance vs MSE - Cross-Validation - R help docs -- An explainer 2. KNN 3. Binomial Classificiation: Probabilities to Predictions 4. Misclassification: Sensitivity, Specificity, ROC Curve 5. Multinomial Regression # Quick Review ## LASSO The lasso penalty is the $l_1$ norm of our parameters: $pen(\beta) = \sum |\beta_j|$, which penalizes larger coefficients and non-zero coefficients. So our estimates are: $$\hat{\beta}_\lambda = \underset{\beta}{\text{argmin}}\left( Dev(\beta)+\lambda \sum_{j=1}^p |\beta_j|\right)$$ For a given model of the error terms, $Dev(\beta)$ is held fixed, so we can make comparisons within the logistic or linear regressions, but not between them easily. When we switch logistic $\rightarrow$ linear, we change $Dev$. We can still make comparisons betwen these models but we need to ensure we are comparing the same thing. ## LASSO -- MSE vs Deviance For the logistic LASSO, our deviance is approximately: $$Dev_{logit}(\beta) \propto \sum_{i=1}^n [ log(1+exp(x_i'\beta))-y_ix_i'\beta]$$ While for our linear LASSO, our deviance is approximately: $$Dev_{linear}(\beta) \approx {1 \over \sigma^2} \sum_{i=1}^n (y-x'\beta)^2 \propto \widehat{MSE}$$ This difference derives from the different *likelihoods* each model uses to model their different ideas about the error distribution. But there is only one true error distribution. ## Model Choice When deciding which model to use, we used the out-of-sample deviance to select a value of $\lambda$ for each LASSO type. But we still need to choose between these two very different models. That choice is still driven by out of sample prediction errors. We need to compare the prediction errors using the same yardstick. So we could compare all the prediction errors using the binomial deviance, or using the MSE, or using some other loss function. ## Model Choice Critically (and we will see this again), cross-validation gives us a good understanding of our out-of-sample error. We can answer questions like "When I build a model in this way, what do its out of sample errors look like?" Which means we can also answer questions like "Which model building process has better out-of-sample predictions?" All of this is because of the very trustworthy error distributions we get from Cross-validation. ## R help documents. I've realized I should give you a quick overview of how the R help documents are structured. There are three main questions I turn to the help documents with. 1. Where in the output is $x$? 2. What do I put in the input to do $y$? 3. How do I do $z$? By far, the help documents are most useful for (1) and (2). ```{r, eval=F} ?cv.glmnet ``` ## Basic Structure of Help Docs {.smaller} > 1. **Description**: Basic explanation of the function in question. > 2. **Usage**: This shows the function (and close relatives), as well as most of the arguments to the function, and their defaults. > 3. **Arguments**: This lists out every argument you can enter, describing what it must be, and what it controls. > 4. **Details**: In-depth description of the function, often including further detail about inputs, sometimes math. > 5. **Value**: In depth description of the output of the function. > 6. **Authors/Refs**: People and places with even more details > 7. **See Also**: other closely related functions. Many model tools will mention "`plot`, `predict`, and `coef`" methods for `cv.glmnet`. That implies functions like `coef.cv.glmnet` and `plot.cv.glmnet` you can look up. > 8. **Examples**: example code running the function on basic data. # Classification ## Basic Setting Just as in our basic prediction problems, we have data with $n$ observations $(\textbf{x}_i,y_i)$ of something. But now $y_i$ is qualitative rather than quantitative. Membership in some category $\{1,...,M\}$ The basic problem then is the following: Given new observation covariates $\textbf{x}_i^{new}$, what is the class label $y_i^{new}$? The quality of any classifier can be determined by its misclassification risk: $$P[\hat{y}_i^{new} \neq y_i^{new}]$$ ## Motivation How does this differ from basic logistic question of predicting $P[y_i^{new}=1]$? 1. We may have many categories, not just two. 2. We may have different discrete actions we take depending on classification. - In some domains, as $P[y_i=1]$ changes, our actions change smoothly. - e.g. as $P[$stock x goes up$]$ we buy more of it. - In other domains, as $P[y_i=1]$ changes, our actions change suddenly. - e.g. as $P[$get into college i$|$I apply$]$ goes up, we switch from "don't apply" to "apply", there is mostly no "apply a little bit more" ## Motivation This leads to situations where we want to categorize, as it affects our decisions. We face decisions not on a spectrum, so the most useful interpretation of our predictions may not be on that spectrum. $\implies$ classification. ## Optimal Classifier Presuming you have no preference between different types of misclassification (**LOSS FUNCTION CLAIM**), there is an optimal classifier, known as the Bayes Classifier. $$ \hat{y}_i^{new} = \underset{j \in \{1,...,M\}}{\text{argmax}} P[y_i^{new} = j |\textbf{x}_i^{new}]$$ Find the prediction which is *most* likely (not necessarily all that likely -- e.g. $P[\hat y = y] < 0.5$ is common). This will minimize the misclassification risk. ## Bayes Classifier Unfortunately, we don't know $P[y = j | \textbf{x}_i^{new}]$. So the Bayes classifier is in some ways an unattainable standard. But we can estimate it! ## Estimating P There are many tools for estimating $P[y|x]$ given the training data. - We can use parametric tools. - Assume $P[y|x]$ is some function of unknown parameters $\beta$, estimate those, and make predictions. - Sound familiar? Logistic Regression does this. - We can use non-parametric tools. - Estimate $P[y|x]$ directly without any parameters. - K Nearest Neighbors (KNN) # KNN ## KNN Basics **Basic Idea**: Estimate $P[y|x]$ locally using the labels of similar observations in the training data. KNN: What is the most common class near $x^{new}$? 1. Take the $K$ nearest neighbors $x_{i,1},...,x_{i,K}$ of $x^{new}$ in the training data - Nearness is (usually) Euclidean distance: $\sqrt{\sum_{j=1}^p (x^{new}_j-x_{i,k,j})^2}$ 2. Estimate $P[y=j|x] = \sum_{i=1}^K 1(y_i=j)$ 3. Select the class $j$ with the highest probability. ## KNN -- Details This (again) is sensitive to the scale of each covariate $x$. So we will rescale them all by standard deviations. This will be sensitive to $K$, and we need to pick it. - $K=n$ -- we just take the mean across the entire training data. - $K=1$ -- Whatever observation happens to be closest will be our prediction. Cross-validation here will help. ## KNN Example Data ```{r, echo=F} n = 50 set.seed(100) set.seed(101) x1 = runif(n) x2 = runif(n) prob = ifelse(x1 < 0.5 & x1 > 0.25 & x2 > 0.25 & x2<0.75,0.8,0.3) newx = data.frame(y="New Observation",x1=0.7,x2=0.6) y = as.factor(rbinom(n,1,prob)) levels(y) = c("1","2","New Observation") df = data.frame(y=y,x1=x1,x2=x2) df = rbind(df,newx) ``` ```{r,echo=F} ggplot(df,aes(x=x1,y=x2,col=y)) + geom_point()+ theme(aspect.ratio = 1) ``` ## KNN Example $K=3$ ```{r,echo=F} dists = sqrt((x1-0.7)^2+(x2-0.6)^2) k3 = sort(dists)[3] gg_circle <- function(r, xc, yc, color="black", fill=NA, ...) { x <- xc + r*cos(seq(0, pi, length.out=100)) ymax <- yc + r*sin(seq(0, pi, length.out=100)) ymin <- yc + r*sin(seq(0, -pi, length.out=100)) annotate("ribbon", x=x, ymin=ymin, ymax=ymax, color=color, fill=fill, ...) } ggplot(df,aes(x=x1,y=x2,col=y))+ gg_circle(k3,0.7,0.6,fill="salmon",alpha=0.3,lwd=0)+ geom_point()+ theme(aspect.ratio = 1) ``` ## KNN Example $K=7$ ```{r,echo=F} k7 = sort(dists)[7] ggplot(df,aes(x=x1,y=x2,col=y))+ gg_circle(k7,0.7,0.6,fill="green",alpha=0.3,lwd=0)+ geom_point()+ theme(aspect.ratio = 1) ``` The relative 'vote counts' are a very crude estimate of probability. ## KNN Example $K=1$ ```{r, echo=F} k1 = sort(dists)[1] ggplot(df,aes(x=x1,y=x2,col=y))+ gg_circle(k1,0.7,0.6,fill="salmon",alpha=0.3,lwd=0)+ geom_point()+ theme(aspect.ratio = 1) ``` ## KNN Example ```{r,echo=F} grid.fineness = 81 sequence = seq(0,1,length.out=grid.fineness) grid = expand.grid(sequence,sequence) knn_pred = function(x,k=5) { dists = sqrt((x1-x[1])^2+(x2-x[2])^2) #Find all distances to current obs bound = sort(dists)[k] #Find kth smallest distance indices = which(dists <= bound) #Find which obs have dists 1:k outcomes = as.integer(y[indices]) #Find corresponding outcomes y round(mean(outcomes)) #Taking advantage of 2 outcomes. } colnames(grid) = c("x1","x2") yhat = as.factor(apply(grid,1,knn_pred)) levels(yhat) = c(levels(yhat),"New Observation") df_grid = data.frame(x1=grid$x1,x2=grid$x2,y=yhat) ggplot(df,aes(x=x1,y=x2,col=y))+ geom_point(data=df_grid,aes(x=x1,y=x2,col=y),alpha=0.5,size=0.2)+ geom_point()+ theme(aspect.ratio=1) ``` ## KNN Example: More Data ```{r,echo=F} n = 2000 x1 = runif(n) x2 = runif(n) prob = ifelse(x1 < 0.5 & x1 > 0.25 & x2 > 0.25 & x2<0.75,0.8,0.3) newx = data.frame(y="New Observation",x1=0.7,x2=0.6) y = as.factor(rbinom(n,1,prob)) levels(y) = c("1","2","New Observation") df = data.frame(y=y,x1=x1,x2=x2) df = rbind(df,newx) dists = sqrt((x1-0.7)^2+(x2-0.6)^2) ggplot(df,aes(x=x1,y=x2,col=y))+geom_point(alpha=0.5)+theme(aspect.ratio=1) ``` ## KNN Example $K=3$ ```{r,echo=F} grid.fineness = 101 sequence = seq(0,1,length.out=grid.fineness) grid = expand.grid(sequence,sequence) colnames(grid) = c("x1","x2") yhat = as.factor(apply(grid,1,knn_pred,k=3)) levels(yhat) = c(levels(yhat),"New Observation") df_grid = data.frame(x1=grid$x1,x2=grid$x2,y=yhat) ggplot(df_grid,aes(x=x1,y=x2,col=y))+ geom_point(size=0.2)+ theme(aspect.ratio=1) ``` ## KNN Example $K=11$ ```{r, echo=F} yhat = as.factor(apply(grid,1,knn_pred,k=11)) levels(yhat) = c(levels(yhat),"New Observation") df_grid = data.frame(x1=grid$x1,x2=grid$x2,y=yhat) ggplot(df_grid,aes(x=x1,y=x2,col=y))+ geom_point(size=0.2)+ theme(aspect.ratio=1) ``` ## KNN Example $K=31$ ```{r,echo=F} yhat = as.factor(apply(grid,1,knn_pred,k=31)) levels(yhat) = c(levels(yhat),"New Observation") df_grid = data.frame(x1=grid$x1,x2=grid$x2,y=yhat) ggplot(df_grid,aes(x=x1,y=x2,col=y))+ geom_point(size=0.2)+ theme(aspect.ratio=1) ``` ## KNN -- Example Base Truth The optimal prediction scheme ```{r,echo=F} opt_pred = function(x){ out = 1 if(x[1] < 0.5 & x[1]>0.25 & x[2] < 0.75 & x[2] > 0.25) out = 2 out } opt_preds = apply(grid,1,opt_pred) df_grid_opt = df_grid df_grid_opt$y = as.factor(opt_preds) ggplot(df_grid_opt,aes(x=x1,y=x2,col=y))+ geom_point(size=0.2)+ theme(aspect.ratio=1) ``` ## KNN Higher $K$ leads to higher *in-sample* training error. (Proportion incorrect). Lower $K$ leads to higher flexibility $\implies$ overfitting and poor OOS misclassification. ## KNN {.smaller} - Pros: - Straightforward - Easily adjust to multiple categories. - Outperform other classifiers in settings where local details matter - Cons: - Computing neighbors (just finding them) can be costly with big $n$ or $p$ - No variable selection built in - Choosing $k$ is tricky -- Cross-validation can work -- but it is unstable. - Remember, LASSO benefitted from *stability* of estimates - Classification is very sensitive to $k$ - Only other output is rough local probabilities. - Without good probabilities, assessing uncertainty is hard. ## Binary Classification Many problems can be reduced to binary classification (as above). KNNs are a useful non-parametric classification tool. *Logits* are a useful parametric classification tool. - Remember Spam? - Logits yield parametric decision boundaries. Easy to interpret. - Logits are global methods. Use *all* the training data to inform predictions. - The probability estimates are more useful and more stable. - Logits can do variable selection. ## Credit Classification Example German loan/default data. "Predict performance of new loans" -- want to predict default probability. Going to try to use borrower and loan characteristics. Messy Data. ## Beware This is not 'randomly sampled' data. ```{r, echo=F} ## read data and create some `interesting' variables credit = read.csv("~/Github/bigdata/lectures/l7/credit/credit.csv") ## re-level the credit history and checking account status credit$history = factor(credit$history, levels=c("A30","A31","A32","A33","A34")) levels(credit$history) = c("good","good","poor","poor","terrible") ## a few others credit$foreign = factor(credit$foreign, levels=c("A201","A202"), labels=c("foreign","german")) credit$rent = factor(credit$housing=="A151") credit$purpose = factor(credit$purpose, levels=c("A40","A41","A42","A43","A44","A45","A46","A47","A48","A49","A410")) levels(credit$purpose) = c("newcar","usedcar",rep("goods/repair",4),"edu",NA,"edu","biz","biz") ``` ```{r, echo=F} plot(factor(Default) ~ history, data=credit, col=c(8,2), ylab="Default") ## surprise! ``` ## Beware Consider your data sources carefully. ```{r, echo=F} plot(factor(Default) ~ purpose, data=credit, col=c(8,2), ylab="") ``` ## German Credit LASSO ```{r, echo=F} xnaref = function(x){ if(is.factor(x)) if(!is.na(levels(x)[1])) x = factor(x,levels=c(NA,levels(x)),exclude=NULL) x } naref = function(DF){ if(is.null(dim(DF))) return(xnaref(DF)) if(!is.data.frame(DF)) stop("You need to give me a data.frame or a factor") DF = lapply(DF, xnaref) return(as.data.frame(DF)) } credit = naref(credit) credx = sparse.model.matrix( ~ (duration + amount + installment + age + history + purpose + foreign + rent)^2, data=credit) default = credit$Default ``` ```{r} credscore = cv.glmnet(credx, default, family="binomial") plot(credscore) ``` ## German Credit LASSO ```{r} sum(coef(credscore, s="lambda.1se")!=0) # 1se sum(coef(credscore, s="lambda.min")!=0) # min ``` ## Decision Making There are two ways to be wrong in this binary problem. - False Positive predict $\hat y=1$ when $y=0$. - Classify as defaulters when they are not - False Negatives predict $\hat y=0$ when $y=1$. - Classify as non-defaulters when they are. Both mistakes are bad, but sometimes one is worse than the other. Logistic regression gives us an estimate of $P[y=1|x]$. And the Bayes decision rule classifies purely based on probabilities. When $P[y=1|x] > 0.5$, classify as 1. But instead of minimizing misclassification risk, we want to minimize our **loss**. ## Using probabilities for Decisions To make optimal decisions you need to account for probabilities and costs. If for each loan, you make \$0.25 when repaid, and lose \$1 when not, then we only *expect* to make 'profits' when $P[y=1] < 0.2$. So we may want our classifier to use a different threshold. ## False Positive Rate Much like in lecture 1, we we can think about our *rate* of false positives. - False Positive Rate = #Misclassified as 1 / #classified as 1 - False Negative Rate = #Misclassified as 0/ #Classified as 0 ## Sensitivity and Specificity But we may also want to think about sensitivity and specificity. - Sensitivity: proportion of true $y=1$ classified as such. - Specificity: proportion of true $y=0$ classified as such. A rule is sensitive if it mostly gets the 1s right. A rule is specific if it mostly gets the 0s right. ## ROC Curve We can plot the ROC curve for different choices of threshold. ```{r, echo=F} pred = predict(credscore, credx, type="response") pred = drop(pred) roc = function(p,y, ...){ y = factor(y) n = length(p) p = as.vector(p) Q = p > matrix(rep(seq(0,1,length=100),n),ncol=100,byrow=TRUE) specificity = colMeans(!Q[y==levels(y)[1],]) sensitivity = colMeans(Q[y==levels(y)[2],]) plot(1-specificity, sensitivity, type="l", ...) abline(a=0,b=1,lty=2,col=8) } roc(p=pred, y=default, bty="n") ``` # Wrap up ## Things to do HW 3 is due tomorrow night. See you Thursday ## Rehash - Classification is a different type of problem - KNN is a useful tool for using local data to make predictions. - Logits can also be used for classifications - Prediction Errors in this setting are closely related to eachother # Bye!