```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
setwd("~/Github/bigdata/lectures/l8/")
library(tidyverse)
library(glmnet)
```
## Today's Class {.smaller}
1. Review
- KNN
- Binomial Classification
- ROC Curve
2. Multinomial Regression
3. HW3 solns/discussion
4. HW4
# Quick Review
## KNN Basics
**Basic Idea**: Estimate $P[y|x]$ locally using the labels of similar observations in the training data.
KNN: What is the most common class near $x^{new}$?
1. Take the $K$ nearest neighbors $x_{i,1},...,x_{i,K}$ of $x^{new}$ in the training data
- Nearness is (usually) Euclidean distance: $\sqrt{\sum_{j=1}^p (x^{new}_j-x_{i,k,j})^2}$
2. Estimate $P[y=j|x] = \frac1n \sum_{i=1}^K 1(y_i=j)$
3. Select the class $j$ with the highest probability.
## KNN Example Data
```{r, echo=F}
n = 50
set.seed(100)
set.seed(101)
x1 = runif(n)
x2 = runif(n)
prob = ifelse(x1 < 0.5 & x1 > 0.25 & x2 > 0.25 & x2<0.75,0.8,0.3)
newx = data.frame(y="New Observation",x1=0.7,x2=0.6)
y = as.factor(rbinom(n,1,prob))
levels(y) = c("1","2","New Observation")
df = data.frame(y=y,x1=x1,x2=x2)
df = rbind(df,newx)
```
```{r,echo=F}
ggplot(df,aes(x=x1,y=x2,col=y)) +
geom_point()+
theme(aspect.ratio = 1)
```
## KNN Example $K=7$
```{r,echo=F}
dists = sqrt((x1-0.7)^2+(x2-0.6)^2)
k7 = sort(dists)[7]
gg_circle <- function(r, xc, yc, color="black", fill=NA, ...) {
x <- xc + r*cos(seq(0, pi, length.out=100))
ymax <- yc + r*sin(seq(0, pi, length.out=100))
ymin <- yc + r*sin(seq(0, -pi, length.out=100))
annotate("ribbon", x=x, ymin=ymin, ymax=ymax, color=color, fill=fill, ...)
}
ggplot(df,aes(x=x1,y=x2,col=y))+
gg_circle(k7,0.7,0.6,fill="green",alpha=0.3,lwd=0)+
geom_point()+
theme(aspect.ratio = 1)
```
The relative 'vote counts' are a very crude estimate of probability.
## KNN Example
```{r,echo=F}
grid.fineness = 81
sequence = seq(0,1,length.out=grid.fineness)
grid = expand.grid(sequence,sequence)
knn_pred = function(x,k=5) {
dists = sqrt((x1-x[1])^2+(x2-x[2])^2) #Find all distances to current obs
bound = sort(dists)[k] #Find kth smallest distance
indices = which(dists <= bound) #Find which obs have dists 1:k
outcomes = as.integer(y[indices]) #Find corresponding outcomes y
round(mean(outcomes)) #Taking advantage of 2 outcomes.
}
colnames(grid) = c("x1","x2")
yhat = as.factor(apply(grid,1,knn_pred))
levels(yhat) = c(levels(yhat),"New Observation")
df_grid = data.frame(x1=grid$x1,x2=grid$x2,y=yhat)
ggplot(df,aes(x=x1,y=x2,col=y))+
geom_point(data=df_grid,aes(x=x1,y=x2,col=y),size=0.8)+
theme(aspect.ratio=1)
```
## Binary Classification
Many problems can be reduced to binary classification (as above).
KNNs are a useful non-parametric classification tool.
*Logits* are a useful parametric classification tool.
- Remember Spam?
- Logits yield parametric decision boundaries. Easy to interpret.
- Logits are global methods. Use *all* the training data to inform predictions.
- The probability estimates are more useful and more stable.
- Logits can do variable selection.
## Credit Classification Example
```{r, echo=F}
## read data and create some `interesting' variables
credit = read.csv("~/Github/bigdata/lectures/l7/credit/credit.csv")
## re-level the credit history and checking account status
credit$history = factor(credit$history, levels=c("A30","A31","A32","A33","A34"))
levels(credit$history) = c("good","good","poor","poor","terrible")
## a few others
credit$foreign = factor(credit$foreign, levels=c("A201","A202"), labels=c("foreign","german"))
credit$rent = factor(credit$housing=="A151")
credit$purpose = factor(credit$purpose, levels=c("A40","A41","A42","A43","A44","A45","A46","A47","A48","A49","A410"))
levels(credit$purpose) = c("newcar","usedcar",rep("goods/repair",4),"edu",NA,"edu","biz","biz")
xnaref = function(x){
if(is.factor(x))
if(!is.na(levels(x)[1]))
x = factor(x,levels=c(NA,levels(x)),exclude=NULL)
x
}
naref = function(DF){
if(is.null(dim(DF))) return(xnaref(DF))
if(!is.data.frame(DF))
stop("You need to give me a data.frame or a factor")
DF = lapply(DF, xnaref)
return(as.data.frame(DF))
}
credit = naref(credit)
credx = sparse.model.matrix( ~ (duration + amount +
installment + age + history +
purpose + foreign + rent)^2, data=credit)
default = credit$Default
credscore = cv.glmnet(credx, default, family="binomial")
plot(credscore)
credx2 = model.matrix( ~ (duration + amount +
installment + age + history +
purpose + foreign + rent)^2, data=credit)
credit_mod = glmnet(credx2,default,family="binomial")
```
## Sensitivity and Specificity
But we may also want to think about sensitivity and specificity.
- Sensitivity: proportion of true $y=1$ classified as such.
- Specificity: proportion of true $y=0$ classified as such.
A rule is sensitive if it mostly gets the 1s right. A rule is specific if it mostly gets the 0s right.
## Error Types
![](../l1/fdr.png)
## ROC Curve
We can plot the ROC curve for different choices of threshold.
```{r, echo=F}
pred = predict(credscore, credx, type="response")
pred = drop(pred)
roc = function(p,y, ...){
y = factor(y)
n = length(p)
p = as.vector(p)
probs = seq(0,1,length=1001)
Q = p > matrix(rep(probs,n),ncol=length(probs),byrow=TRUE)
specificity = colMeans(!Q[y==levels(y)[1],])
sensitivity = colMeans(Q[y==levels(y)[2],])
plot(1-specificity, sensitivity, type="l", ...)
abline(a=0,b=1,lty=2,col=8)
inds = c(251,501,751)
points(1-specificity[inds],sensitivity[inds],col=2,cex=3)
}
roc(p=pred, y=default, bty="n")
```
## AUC
The AUC is a common metric for choosing between classifiers, which selects the classifier that maximizes the Area Under the Curve -- specifically the area under the ROC curve.
```{r}
roc(p=pred,y=default,bty="n")
```
## ROC & Decisions
We face a tradeoff with differential costs? Find where the slope of the ROC equates those costs.
```{r}
roc(p=pred,y=default,bty="n")
```
# Multinomial
## Multiple Options
What if we have more than one class possible?
E.g. Not just $\{0,1\}$ but $\{0,1,2,...,M\}$
## KNN Still works
We can still just ask which category is most likely in the group nearest to a point. Though our code may need to change.
(Recommended optional long Q in HW4)
## Multinomial Logit
We can also switch to a multinomial logit. At its simplest, this combines $M$ models of $P[Y=1|x]$, $P[Y=2|x]$,...,$P[Y=M|x]$, in a manageable manner.
We can estimate each of those models as a simple logit -- to get their coefficients.
Then we make sure that our probabilities sum to 1.
$$\sum_{j=1}^M P[Y=j|x] = 1$$
## Multinomial Logit
Our full model then is something different:
$$P[Y_i = k|\textbf{x}_i] = p_{ik} = \frac{e^{\textbf{x}_i'\beta_k}}{\sum_{j=1}^M e^{\textbf{x}_i'\beta_j}} $$
NOTICE there are different coefficients for each class $K$
## Multinomial Logit
This will change our likelihood slightly. Denoting by $k_i$ the class of $y_i$, we get:
$$L(\beta_1,...,\beta_M) \propto \prod_{i=1}^n p_{ik_i}$$
Which implies the deviance is
$$Dev(\beta_1,...,\beta_M) \propto -2 \sum_{i=1}^n log(p_{ik_i})$$
## Multinomial
Now we can try to minimize the deviance (standard multinomial) or do some penalized version of this (make sure you sum across all the coefficients!)
## Glass Example
We have shards of broken glass, and we want to determine what of a few common types of glass it is.
- Headlight
- Vehicle Window
- Standard Window (floating or not)
- Container (e.g. glass bottle)
- Tableware (e.g. a glass)
This is where a multinomial regression may be useful.
## Glass Example
```{r, echo=F, message=F,warning=F}
library(MASS) ## a library of example datasets
data(fgl) ## The glass data
x <- scale(fgl[,c("RI","Mg")])
gtype <- fgl$type
xfgl <- sparse.model.matrix(type~., data=fgl)[,-1]
```
```{r,message=F, warning=F}
glassfit <- cv.glmnet(xfgl, gtype, family="multinomial")
```
```{r,echo=F}
plot(glassfit) # CV error; across top avg # nonzero across classes
```
## Multinomial
```{r,echo=F,out.width="30%",out.height="80%"}
plot(glassfit$glm, xvar="lambda")
```
# Wrap up
## Things to do
HW 3 is due next Wednesday night.
See you Tuesday
## Rehash
- ROC curves are powerful tools
- Multinomial regressions are very possible
# Bye!