```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) setwd("~/Github/bigdata/lectures/l8/") library(tidyverse) library(glmnet) ``` ## Today's Class {.smaller} 1. Review - KNN - Binomial Classification - ROC Curve 2. Multinomial Regression 3. HW3 solns/discussion 4. HW4 # Quick Review ## KNN Basics **Basic Idea**: Estimate $P[y|x]$ locally using the labels of similar observations in the training data. KNN: What is the most common class near $x^{new}$? 1. Take the $K$ nearest neighbors $x_{i,1},...,x_{i,K}$ of $x^{new}$ in the training data - Nearness is (usually) Euclidean distance: $\sqrt{\sum_{j=1}^p (x^{new}_j-x_{i,k,j})^2}$ 2. Estimate $P[y=j|x] = \frac1n \sum_{i=1}^K 1(y_i=j)$ 3. Select the class $j$ with the highest probability. ## KNN Example Data ```{r, echo=F} n = 50 set.seed(100) set.seed(101) x1 = runif(n) x2 = runif(n) prob = ifelse(x1 < 0.5 & x1 > 0.25 & x2 > 0.25 & x2<0.75,0.8,0.3) newx = data.frame(y="New Observation",x1=0.7,x2=0.6) y = as.factor(rbinom(n,1,prob)) levels(y) = c("1","2","New Observation") df = data.frame(y=y,x1=x1,x2=x2) df = rbind(df,newx) ``` ```{r,echo=F} ggplot(df,aes(x=x1,y=x2,col=y)) + geom_point()+ theme(aspect.ratio = 1) ``` ## KNN Example $K=7$ ```{r,echo=F} dists = sqrt((x1-0.7)^2+(x2-0.6)^2) k7 = sort(dists)[7] gg_circle <- function(r, xc, yc, color="black", fill=NA, ...) { x <- xc + r*cos(seq(0, pi, length.out=100)) ymax <- yc + r*sin(seq(0, pi, length.out=100)) ymin <- yc + r*sin(seq(0, -pi, length.out=100)) annotate("ribbon", x=x, ymin=ymin, ymax=ymax, color=color, fill=fill, ...) } ggplot(df,aes(x=x1,y=x2,col=y))+ gg_circle(k7,0.7,0.6,fill="green",alpha=0.3,lwd=0)+ geom_point()+ theme(aspect.ratio = 1) ``` The relative 'vote counts' are a very crude estimate of probability. ## KNN Example ```{r,echo=F} grid.fineness = 81 sequence = seq(0,1,length.out=grid.fineness) grid = expand.grid(sequence,sequence) knn_pred = function(x,k=5) { dists = sqrt((x1-x[1])^2+(x2-x[2])^2) #Find all distances to current obs bound = sort(dists)[k] #Find kth smallest distance indices = which(dists <= bound) #Find which obs have dists 1:k outcomes = as.integer(y[indices]) #Find corresponding outcomes y round(mean(outcomes)) #Taking advantage of 2 outcomes. } colnames(grid) = c("x1","x2") yhat = as.factor(apply(grid,1,knn_pred)) levels(yhat) = c(levels(yhat),"New Observation") df_grid = data.frame(x1=grid$x1,x2=grid$x2,y=yhat) ggplot(df,aes(x=x1,y=x2,col=y))+ geom_point(data=df_grid,aes(x=x1,y=x2,col=y),size=0.8)+ theme(aspect.ratio=1) ``` ## Binary Classification Many problems can be reduced to binary classification (as above). KNNs are a useful non-parametric classification tool. *Logits* are a useful parametric classification tool. - Remember Spam? - Logits yield parametric decision boundaries. Easy to interpret. - Logits are global methods. Use *all* the training data to inform predictions. - The probability estimates are more useful and more stable. - Logits can do variable selection. ## Credit Classification Example ```{r, echo=F} ## read data and create some `interesting' variables credit = read.csv("~/Github/bigdata/lectures/l7/credit/credit.csv") ## re-level the credit history and checking account status credit$history = factor(credit$history, levels=c("A30","A31","A32","A33","A34")) levels(credit$history) = c("good","good","poor","poor","terrible") ## a few others credit$foreign = factor(credit$foreign, levels=c("A201","A202"), labels=c("foreign","german")) credit$rent = factor(credit$housing=="A151") credit$purpose = factor(credit$purpose, levels=c("A40","A41","A42","A43","A44","A45","A46","A47","A48","A49","A410")) levels(credit$purpose) = c("newcar","usedcar",rep("goods/repair",4),"edu",NA,"edu","biz","biz") xnaref = function(x){ if(is.factor(x)) if(!is.na(levels(x)[1])) x = factor(x,levels=c(NA,levels(x)),exclude=NULL) x } naref = function(DF){ if(is.null(dim(DF))) return(xnaref(DF)) if(!is.data.frame(DF)) stop("You need to give me a data.frame or a factor") DF = lapply(DF, xnaref) return(as.data.frame(DF)) } credit = naref(credit) credx = sparse.model.matrix( ~ (duration + amount + installment + age + history + purpose + foreign + rent)^2, data=credit) default = credit$Default credscore = cv.glmnet(credx, default, family="binomial") plot(credscore) credx2 = model.matrix( ~ (duration + amount + installment + age + history + purpose + foreign + rent)^2, data=credit) credit_mod = glmnet(credx2,default,family="binomial") ``` ## Sensitivity and Specificity But we may also want to think about sensitivity and specificity. - Sensitivity: proportion of true $y=1$ classified as such. - Specificity: proportion of true $y=0$ classified as such. A rule is sensitive if it mostly gets the 1s right. A rule is specific if it mostly gets the 0s right. ## Error Types ![](../l1/fdr.png) ## ROC Curve We can plot the ROC curve for different choices of threshold. ```{r, echo=F} pred = predict(credscore, credx, type="response") pred = drop(pred) roc = function(p,y, ...){ y = factor(y) n = length(p) p = as.vector(p) probs = seq(0,1,length=1001) Q = p > matrix(rep(probs,n),ncol=length(probs),byrow=TRUE) specificity = colMeans(!Q[y==levels(y)[1],]) sensitivity = colMeans(Q[y==levels(y)[2],]) plot(1-specificity, sensitivity, type="l", ...) abline(a=0,b=1,lty=2,col=8) inds = c(251,501,751) points(1-specificity[inds],sensitivity[inds],col=2,cex=3) } roc(p=pred, y=default, bty="n") ``` ## AUC The AUC is a common metric for choosing between classifiers, which selects the classifier that maximizes the Area Under the Curve -- specifically the area under the ROC curve. ```{r} roc(p=pred,y=default,bty="n") ``` ## ROC & Decisions We face a tradeoff with differential costs? Find where the slope of the ROC equates those costs. ```{r} roc(p=pred,y=default,bty="n") ``` # Multinomial ## Multiple Options What if we have more than one class possible? E.g. Not just $\{0,1\}$ but $\{0,1,2,...,M\}$ ## KNN Still works We can still just ask which category is most likely in the group nearest to a point. Though our code may need to change. (Recommended optional long Q in HW4) ## Multinomial Logit We can also switch to a multinomial logit. At its simplest, this combines $M$ models of $P[Y=1|x]$, $P[Y=2|x]$,...,$P[Y=M|x]$, in a manageable manner. We can estimate each of those models as a simple logit -- to get their coefficients. Then we make sure that our probabilities sum to 1. $$\sum_{j=1}^M P[Y=j|x] = 1$$ ## Multinomial Logit Our full model then is something different: $$P[Y_i = k|\textbf{x}_i] = p_{ik} = \frac{e^{\textbf{x}_i'\beta_k}}{\sum_{j=1}^M e^{\textbf{x}_i'\beta_j}} $$ NOTICE there are different coefficients for each class $K$ ## Multinomial Logit This will change our likelihood slightly. Denoting by $k_i$ the class of $y_i$, we get: $$L(\beta_1,...,\beta_M) \propto \prod_{i=1}^n p_{ik_i}$$ Which implies the deviance is $$Dev(\beta_1,...,\beta_M) \propto -2 \sum_{i=1}^n log(p_{ik_i})$$ ## Multinomial Now we can try to minimize the deviance (standard multinomial) or do some penalized version of this (make sure you sum across all the coefficients!) ## Glass Example We have shards of broken glass, and we want to determine what of a few common types of glass it is. - Headlight - Vehicle Window - Standard Window (floating or not) - Container (e.g. glass bottle) - Tableware (e.g. a glass) This is where a multinomial regression may be useful. ## Glass Example ```{r, echo=F, message=F,warning=F} library(MASS) ## a library of example datasets data(fgl) ## The glass data x <- scale(fgl[,c("RI","Mg")]) gtype <- fgl$type xfgl <- sparse.model.matrix(type~., data=fgl)[,-1] ``` ```{r,message=F, warning=F} glassfit <- cv.glmnet(xfgl, gtype, family="multinomial") ``` ```{r,echo=F} plot(glassfit) # CV error; across top avg # nonzero across classes ``` ## Multinomial ```{r,echo=F,out.width="30%",out.height="80%"} plot(glassfit$glm, xvar="lambda") ``` # Wrap up ## Things to do HW 3 is due next Wednesday night. See you Tuesday ## Rehash - ROC curves are powerful tools - Multinomial regressions are very possible # Bye!