- Review
- LASSO
- Deviance vs MSE
- Cross-Validation
- R help docs – An explainer
- KNN
- Binomial Classificiation: Probabilities to Predictions
- Misclassification: Sensitivity, Specificity, ROC Curve
- Multinomial Regression
April 20th, 2021
The lasso penalty is the \(l_1\) norm of our parameters: \(pen(\beta) = \sum |\beta_j|\), which penalizes larger coefficients and non-zero coefficients. So our estimates are: \[\hat{\beta}_\lambda = \underset{\beta}{\text{argmin}}\left( Dev(\beta)+\lambda \sum_{j=1}^p |\beta_j|\right)\]
For a given model of the error terms, \(Dev(\beta)\) is held fixed, so we can make comparisons within the logistic or linear regressions, but not between them easily.
When we switch logistic \(\rightarrow\) linear, we change \(Dev\). We can still make comparisons betwen these models but we need to ensure we are comparing the same thing.
For the logistic LASSO, our deviance is approximately: \[Dev_{logit}(\beta) \propto \sum_{i=1}^n [ log(1+exp(x_i'\beta))-y_ix_i'\beta]\]
While for our linear LASSO, our deviance is approximately: \[Dev_{linear}(\beta) \approx {1 \over \sigma^2} \sum_{i=1}^n (y-x'\beta)^2 \propto \widehat{MSE}\]
This difference derives from the different likelihoods each model uses to model their different ideas about the error distribution.
But there is only one true error distribution.
When deciding which model to use, we used the out-of-sample deviance to select a value of \(\lambda\) for each LASSO type. But we still need to choose between these two very different models.
That choice is still driven by out of sample prediction errors. We need to compare the prediction errors using the same yardstick.
So we could compare all the prediction errors using the binomial deviance, or using the MSE, or using some other loss function.
Critically (and we will see this again), cross-validation gives us a good understanding of our out-of-sample error.
We can answer questions like “When I build a model in this way, what do its out of sample errors look like?”
Which means we can also answer questions like “Which model building process has better out-of-sample predictions?”
All of this is because of the very trustworthy error distributions we get from Cross-validation.
I’ve realized I should give you a quick overview of how the R help documents are structured.
There are three main questions I turn to the help documents with.
By far, the help documents are most useful for (1) and (2).
?cv.glmnet
plot
, predict
, and coef
” methods for cv.glmnet
. That implies functions like coef.cv.glmnet
and plot.cv.glmnet
you can look up.Just as in our basic prediction problems, we have data with \(n\) observations \((\textbf{x}_i,y_i)\) of something.
But now \(y_i\) is qualitative rather than quantitative. Membership in some category \(\{1,...,M\}\)
The basic problem then is the following: Given new observation covariates \(\textbf{x}_i^{new}\), what is the class label \(y_i^{new}\)?
The quality of any classifier can be determined by its misclassification risk: \[P[\hat{y}_i^{new} \neq y_i^{new}]\]
How does this differ from basic logistic question of predicting \(P[y_i^{new}=1]\)?
This leads to situations where we want to categorize, as it affects our decisions.
We face decisions not on a spectrum, so the most useful interpretation of our predictions may not be on that spectrum.
\(\implies\) classification.
Presuming you have no preference between different types of misclassification (LOSS FUNCTION CLAIM), there is an optimal classifier, known as the Bayes Classifier.
\[ \hat{y}_i^{new} = \underset{j \in \{1,...,M\}}{\text{argmax}} P[y_i^{new} = j |\textbf{x}_i^{new}]\]
Find the prediction which is most likely (not necessarily all that likely – e.g. \(P[\hat y = y] < 0.5\) is common). This will minimize the misclassification risk.
Unfortunately, we don’t know \(P[y = j | \textbf{x}_i^{new}]\). So the Bayes classifier is in some ways an unattainable standard.
But we can estimate it!
There are many tools for estimating \(P[y|x]\) given the training data.
Basic Idea: Estimate \(P[y|x]\) locally using the labels of similar observations in the training data.
KNN: What is the most common class near \(x^{new}\)?
This (again) is sensitive to the scale of each covariate \(x\). So we will rescale them all by standard deviations.
This will be sensitive to \(K\), and we need to pick it.
Cross-validation here will help.
The relative ‘vote counts’ are a very crude estimate of probability.
The optimal prediction scheme
Higher \(K\) leads to higher in-sample training error. (Proportion incorrect).
Lower \(K\) leads to higher flexibility \(\implies\) overfitting and poor OOS misclassification.
Many problems can be reduced to binary classification (as above).
KNNs are a useful non-parametric classification tool.
Logits are a useful parametric classification tool.
- Remember Spam?
German loan/default data. “Predict performance of new loans” – want to predict default probability.
Going to try to use borrower and loan characteristics.
Messy Data.
This is not ‘randomly sampled’ data.
Consider your data sources carefully.
credscore = cv.glmnet(credx, default, family="binomial") plot(credscore)
sum(coef(credscore, s="lambda.1se")!=0) # 1se
## [1] 13
sum(coef(credscore, s="lambda.min")!=0) # min
## [1] 21
There are two ways to be wrong in this binary problem.
Both mistakes are bad, but sometimes one is worse than the other. Logistic regression gives us an estimate of \(P[y=1|x]\). And the Bayes decision rule classifies purely based on probabilities. When \(P[y=1|x] > 0.5\), classify as 1.
But instead of minimizing misclassification risk, we want to minimize our loss.
To make optimal decisions you need to account for probabilities and costs.
If for each loan, you make $0.25 when repaid, and lose $1 when not, then we only expect to make ‘profits’ when \(P[y=1] < 0.2\).
So we may want our classifier to use a different threshold.
Much like in lecture 1, we we can think about our rate of false positives.
But we may also want to think about sensitivity and specificity.
A rule is sensitive if it mostly gets the 1s right. A rule is specific if it mostly gets the 0s right.
We can plot the ROC curve for different choices of threshold.
HW 3 is due tomorrow night.
See you Thursday