- Quick Review
- Deviance
- Bootstrap
- Out-of-Sample
- Comparing Models – A primer
- T-tests
- F-tests
- AIC/AICc/BIC
- Stepwise Regression
- Regularization and LASSO
- Cross-Validation
April 13th, 2021
Deviance refers to the distance between our fit and the data. You generally want to minimize it.
\[Dev(\beta) = -2 log(L(\beta|data))+C\] We can ignore C for now.
Deviance is useful for comparing models. It is a measure of GOF that is similar to the residual sum of squares, for a broader class of models (logistic regression, etc). It is closely related to \(R^2\)
\[ R^2 = 1-\frac{D}{D_0} = 1- \frac{Dev(\hat{\beta}))}{Dev(\textbf{0})}\]
Basic Procedure for data with \(n\) observations.
We care about how well our model works out-of-sample.
One way to test this is to use a “validation sample”.
The difference between in and out of sample deviance is about the sample.
Example Linear Regression
For In-Sample (IS) \(R^2\), we have observations \([\textbf{x}_1,y_1],...,[\textbf{x}_n,y_n]\), which we use to fit \(\hat{\beta}_n\). The deviance is: \[ dev_{is}(\hat{\beta}_n) = \sum_{i=1}^n (y_i - \textbf{x}_i'\hat{\beta}_n)^2 \]
The Out-of-Sample (OOS) \(R^2\), we keep the same observations \(1..n\) to estimate \(\hat{\beta}_n\), but the deviance is calculated using \(m\) new observations.
\[ dev_{oos}(\hat{\beta}_n) = \sum_{i=n+1}^{n+m}(y_i-\textbf{x}_i'\hat{\beta}_n)^2\]
Semiconductors manufacturing: extremely complicated, no margin for error, hundreds of diagnostics.
We want to focus on diagnostics that are predictive of failure.
\(\textbf{x}_i\) is a vector of 200 diagnostic signals, \(y_i\) is a binary (Pass/Fail) for a chip batch. There are 100 failures out of 1477 batches.
Logistic regression for failure of chip \(i\) is \[ p_i = P[fail_i|\textbf{x}_i] = exp(\alpha+\textbf{x}_i' \beta)/(1+exp(\alpha+\textbf{x}_i'\beta)) \]
Full model (non-interacted, jointly estimated) has an in-sample \(R^2 \approx 0.56\) (based on binomial deviance).
We could consider using FDR to slim down our model. \(q=0.1\) yields \(\alpha \approx 0.012\) cutoff. That leaves 25 variables, of which ~2.5 are “false”.
A ‘cut’ model using only those 25 variables has an \(R^2 \approx 0.18\).
summary(cut.pvals)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.000000 0.001762 0.030080 0.086693 0.092763 0.495039
But now we have a bunch of large pvalues again. One question: Should we repeat this process?
We care about out of sample performance. Let’s examine it.
In Sample \(R^2\)
Out of Sample \(R^2\).
We may gain predictive accuracy by dropping variables.
The OOS \(R^2\) for the cut model: 0.10. (~50% of in-sample \(R^2\))
The OOS \(R^2\) for the full model: -5.87.
THIS IS NEGATIVE. Out-of-Sample, we would be much BETTER off predicting the mean than with the full model.
Estimated by using our testing-training split method – AKA cross-validation.
IN SAMPLE \(R^2\) MISLEADS US
Cross validation uses OOS experiments to choose the best model. This is a big part of big data
Selection of ‘the best’ model is at the core of big data. We are moving into new territory here.
Before getting to selection, we need to come up with a good set of candidate models to choose from.
HOW?
Suppose for the time being, we ignore interactions between variables, and nonlinearities. We are just interested in models which include some subset of our variables. With \(p\) variables, how many models are there?
T-tests are the source for the p-values we’ve been using for FDR control.
We could (as we did above), estimate the full joint model (or many marginal models) and use the pvalues from these t-tests to choose our covariates, ditching the covariates that are insignificant, either under standard rejection or FDR control frameworks.
But as we saw, the p-values shift as we change models. Do we iterate?
There are other problems, like independence of p-values, data with \(p>n\) where we can’t estimate p-values, and covariates that are useful only with or without other variables.
Is “variable significance” really what we care about?
Stepwise regression is an important historical model building technique. Once we accept that we can’t estimate every subsetted model, we ask “what if we start with an empty model and slowly build it up?”
Stepwise Algorithm (Forward)
When to stop? Stop when your model selection rule prefers the current model to all candidate models. Model selection rules in a moment.
This is a useful heuristic search procedure, which reduces the problem space significantly. \(\implies\) can be easy to estimate.
There are many variants
How do we choose?
There is a standard tool for comparing two (nested) models in linear regression. The “F-test”.
The F-test is purely a function of the \(R^2\) and the number of variables.
You could build a procedure out of iteratively comparing two models.
But this will again mostly be selecting mostly for variables which are “significant” rather than optimizing for our predictive loss.
Again, this is not really what we want.
What we care about is Out-of-Sample predictive performance.
It turns out (after a lot of theoretical work) that we can estimate that using Akaike’s Information Criterion. For a model \(M\) with \(k\) variables estimated to be \(\hat{\beta}_M\)
\[ AIC = 2k + Dev(\hat{\beta}_M) = 2k - 2 log(L(\hat{\beta}_M|data))\] This was in our basic model output.
The AIC actually estimates OOS deviance – what your deviance would be in another independent sample of size \(n\).
In-sample deviance is too small, because the model minimizes that. Some incredible theory work shows us that \(AIC \approx OOS\ Deviance\).
This approximation is good when \(\frac{n}{k}\) is very big.
In big data, that may not apply. Often \(k \approx n\) or is bigger. In that case, AIC overfits!
We want to minimize the AIC.
There are many more information criteria. We’ll cover two briefly.
AIC struggles when the number of parameters is too big.
An improved approximation is:
\[AICc = Dev(\hat{\beta}_M) + 2k \frac{n}{n-k-1}\]
Notice that when \(\frac{n}{k}\) is big, \(AICc \approx AIC\). So we will generally prefer AICc. Nptice also that it is easy to calculate AICc given AIC.
\[ BIC = Dev(\hat{\beta}_M) + k\ log(n)\]
This looks like AIC (swapping a \(log(n)\) for a \(2\)) but it comes from a very different place.
\(BIC \approx -log(P[M|data])\), the probability ‘model b is true’, in a bayesian setting.
Generally BIC will select for smaller models. Why?
Imperfect Rule of Thumb:
“AIC/AICc approximate deviance, BIC approximates truth”
Stepwise regression is not choosing the globally best subset of predictors.
Remember there are \(2^k\) such models
Instead, it takes a “greedy” approach. Looking at the best option for the immediate step, and taking that. This is computationally much easier – but isn’t likely to give you the ‘globally’ optimal model.
Its computationally easy?
null = glm(FAIL~1,data=SC,family=binomial) full = glm(FAIL~.,data=SC,family=binomial) stepwise = step(null,scope=formula(full),direction="forward",trace=0) #Default is AIC
length(coef(stepwise))
## [1] 63
Full is the largest model you would consider, and scope is it’s formula. Feel free to include interactions to make it huge.
But be warned, even though stepwise is faster than ennumerating every possible model, it is not necessarily ‘fast’.
Subset Selection Enumerate every possible model \(M\), estimate \(\hat{\beta}\) for each model, choose the best.
\(\implies\) Completely infeasible even for small \(p\) like 50.
Stepwise Selection: A faster heuristic. Start with some base model, find the best ‘adjacent’ model, change to that, and continue iterating.
What’s next? Regularization
The general idea: Impose restrictions on \(\hat{\beta}\) to make them stable, purposeful, and quick to estimate.
Variable selection is achieved with Estimation, not p-values.
What are our goals?
We will penalize solutions \(\hat{\beta}\) that do not meet expectations.
Using all predictors \(x_1,...,x_p\), the most likely (MLE) values \(\hat{\beta}_{MLE}\) minimize deviance.
\[ \hat{\beta}_{MLE} = \underset{\beta}{\text{argmin}}\ Dev(\beta) = \underset{\beta}{\text{argmin}} -2log(L(\beta))/n \]
But \(\hat{\beta}_{MLE}\) can only be estimated when \(p < n\), may be unstable, and won’t have exact 0s.
The penalized version of this adds a penalty \(pen()\) which depends on the parameter \(\beta\) and is scaled by some weight \(\lambda > 0\).
\[\hat{\beta}_{PMLE} = \underset{\beta}{\text{argmin}} \left(Dev(\beta) + \lambda\ pen(\beta) \right)\]
This penalty imposes a cost for solutions that don’t meet our preferences. \(\implies \hat{\beta}_{PMLE}\) is biased.
What costs should we impose? What costs do we face?
LASSO is the most commonly used regularized (or penalized) regression model. The lasso penalty is the \(l_1\) norm of our parameters: \(pen(\beta) = \sum |\beta_j|\), which penalizes larger coefficients and non-zero coefficients. So our estimates are: \[\hat{\beta}_\lambda = \underset{\beta}{\text{argmin}}\left( Dev(\beta)+\lambda \sum_{j=1}^p |\beta_j|\right)\]
For linear regression this simplifies dramatically:
\[\hat{\beta}_\lambda = \underset{\beta}{\text{argmin}}\left(\sum_{i=1}^n (y-x'\beta)^2+\lambda \hat\sigma^2\sum_{j=1}^p |\beta_j|\right)\]
LASSO gives sparse solutions. It imposes a ‘spiky’ penalty for coefficients. The ‘spike’ at 0 yields a preference for coefficients that are 0. Non-zero coefficients need to be sufficiently informative to pay the penalty.
Even though the deviance (the other part of our maximation) is smooth, a smooth function plus a spiky one can be spiky. And the minimum can be at that spike.
The LASSO ‘shrinks’ some coefficients to 0.
\(\implies\) Automatic variable selection.
This is why LASSO is sometimes referred to as `modern least squares’.
But LASSO also shrinks the parameters that wind up being non-zero.
In particular, it can be shown that for a given \(\lambda\),
\[\hat{\beta}_{j,\lambda} = \hat{\beta}_{j,OLS}\ \text{max}\left(0,1-\frac{n\lambda}{|\hat{\beta}_{j,OLS}|}\right)\] So when \(\hat\beta_{j,\lambda}\) is non-zero, it is \(\approx n\lambda\) closer to 0 than \(\hat\beta_{j,OLS}\).
This, as well as the zeroing out of coefficients, induces the bias.
Very easy to estimate in R, thanks to numerous useful packages.
library(glmnet) mod = glmnet(as.matrix(SC[,-1]),SC$FAIL, family=binomial) #~1 seconds sum(coef(mod, s=0.01) != 0) #s = 0.01 is setting lambda
## [1] 53
So 147 coefficients are 0 with \(\lambda=0.01\).
We chose the penalty to be the sum of absolute values of \(\beta\). But there are other choices.
LASSO finds \(\hat{\beta}_\lambda\) which minimizes \(Dev(\beta)+\lambda\sum|\beta|\). How do we pick \(\lambda\)?
R will estimate a sequence of \(\lambda_1 < \lambda_2 < ... < \lambda_T\). We can apply model selection tools to choose the best \(\hat\lambda\)
LASSO path estimation: R does the following:
Each lambda has a different series of coefficients. There is a classic plot illustrating. plot(mod)
That was against the sum of coefficients, we can also plot against \(\lambda\). plot(mod,xvar="lambda")
Or we can plot in shares of deviance – pseudo-\(R^2\). plot(mod,xvar="dev")
Covariate scale matters here. This is because \(x'\beta\) has the same effect in predictions as \(2x'\beta/2\), but \(|\beta|\) is twice as large a penalty as \(|\beta/2|\).
We can recenter and rescale our covariates so that they all are mean 0 and standard deviation 1. This will eliminate any problems here.
Most major stats packages will do this for you.
We can now estimate an entire path of coefficients for every value of \(\lambda\). But we still need to choose one.
Think of \(\lambda\) as a signal-to-noise filter.
If there is lots of signal in your data, you want a low \(\lambda\). If there is lots of noise, you want high \(\lambda\).
We will use Cross Validation to pick the best \(\lambda\), but you could also use some information criterion.
This works well because the path algorithms are relatively stable and fast. We can quickly enumerate many candidate models, and the one that ‘looks best’ is going to be quite near the one that is best.
Testing frameworks are about is this model true?
We want to know what is my best guess?
None of your models will be ‘true’ for complicated high dimensional systems. We just want a model that makes good predictions.
Parsimony or Occams Razor: If two models do similarly well in unseen data, choose the simpler one.
The goal is to find the sweet spot.
The recipe:
LASSO path algorithms give us (1)
OOS prediction error methods give us (2)
In particular, K-FOLD CROSS VALIDATION
We need to estimate the prediction accuracy on unseen future data. We’ve seen that we can split our sample into a ‘test’ and ‘training’ data sets to do this.
But this is inefficient. If your test data has 10% of the observations:
K-Fold Cross-validation systematically improves on this.
We could, like the bootstrap, repeat our testing-training procedure many times. But we want to guarantee each observation gets used as an ‘out-of-sample’ observation at least once.
Instead, we will “fold” the data. We will partition it (split into exclusive and exhaustive groups) into \(K\) different groups of observations. Then, for k=1:K
This guarantees that each observation is left out once, and improves the performance of our routine.
There are several options:
Most people set \(K \in [5,20]\). I’ll mostly use 10.
\(\implies\) Optimizing \(K\) is very 3rd order. Not worth worrying about too much beyond time considerations and some preference for larger \(K\).
We have a LASSO path indexed by \(\lambda_1 < \lambda_2 < ... < \lambda_T\).
Cross-Validation for \(\lambda\):
For each of \(k=1,...,K\) folds:
This gives us \(K\) draws of the OOS deviance for each \(\lambda_t\).
Choose the best \(\hat\lambda\), and fit your model to all the data with that \(\hat\lambda\).
Again, Cross-validation is very easy and relatively fast for LASSO in R.
mod = cv.glmnet(as.matrix(SC[,-1]),SC$FAIL,family=binomial)
And there is a nice plot for it too
plot(mod)
Once we’ve done the cross-validation its easy to pick out our coefficients.
sum(coef(mod,s="lambda.1se") != 0)
## [1] 1
sum(coef(mod,s="lambda.min") != 0)
## [1] 23
“Min” vs “1se” is to do with a concern about overfitting. The 1SE model is the smallest model that has predictions which perform very similarly to the true model.
That plot wasn’t the most lovely. This is from a new example looking at Comscore data. This data is about consumer spending on websites.
plot(cv.spender)
Before Wednesday:
Before Thursday: