- Assorted Business
- Predictions
- Questions
- Quick Review
- Regression
- False Discovery Intro
- False Discovery Rate, More than you wanted to know
- Loss functions
- My prediction walkthrough
- Homework intro (if time?)
April 1st, 2021
“How many people in the US will have had at least one dose by end of day on April 30th?”
Prediction: 148 million
90% CI: [130,169] million.
Based on CDC trend data – not what I gave you
You don’t always have the best data
But you could probably still do pretty well with the data I gave. CI calibration would be tough.
The basic model is as follows:
\(Perc.OneDose = \beta_0 + \beta_1 Delivered.100k +\) \(\beta_2 Perc.TwoDose + \epsilon\)
Where \(E[\epsilon] = 0\).
We care about \(\beta_1\) or perhaps \(\beta_2\). What are they?
We can compare pvalues, which are measure of extremity, to a pre-set threshold (\(\alpha\)) which controls our false discovery chance.
But with lots of variables, how do we think about things?
Both seem aggressive. Want a middle ground.
Notation Changed
We wish to test \(K\) simultaneous null hypothesis: \[H0_1,H0_2,...,H0_K \] Out of the \(K\) null hypothesis, \(N_0\) are true nulls and \(N_1 = K-N_0\) are false – i.e. there is an effect.
FD Proportion = False positives / #Significant = \(\frac{FD}{R}\)
We can’t know this.
We can control its expectation though: False Discovery Rate, \(FDR = E[FDP]\).
If all tests are tested at \(\alpha\) level, we have \(\alpha = E[FD/N_0]\), whereas \(FDR = E[FD/R]\)
We can find in-sample analogues (ish) of these things.
Suppose we want to know that \(FDR \leq q = 0.1\).
Benjamini + Hochberg Algorithm
Then \(FDR \leq q\) – assuming approximate independence between tests.
Step two there is a mess. Lets look at it closely.
\[ \alpha^* = max \{p_{(k)}: p_{(k)} \leq q \frac{k}{K}\} \] or (because \(k,K\) are both positive) \[ \alpha^* = max \{p_{(k)}: \frac{p_{(k)}K}{k} \leq q \} \] The secret sauce here is that \(p_{(k)}K\) is the expected number of false discoveries, under the nulls, if \(p_{(k)}\) were our rejection threshold.
Dividing by \(k\) - the number of discoveries with that threshold - gives us an estimate of the FDR.
We started with the notion that a given \(\alpha\), (pvalue cutoffs) can lead to a big FDR: \(\alpha \rightarrow q(\alpha)\).
BH reverse that. They fix FDR, and find the relevant \(\alpha\). The algorithm is the key to doing that. \(q \rightarrow \alpha^*(q)\)
FDR is not the only way to think about these risks. But it is a very solid middle ground when we have many tests.
=> Principled bounds on overall errors, while maintaining power to detect.
GWAS: genome-wide association studies.
Want to find genetic markers related to disease for early prevention and monitoring.
Single-nucleotide polymorphisms (SNPs) are paired DNA locations that vary across chromosomes. The allele that occurs most often is “major” (A) and the other is “minor” (a).
Question: Which ones increase risk?
Willer et al, Nat Gen 2013 describe a meta-analysis of GWAS for cholesterol levels. We will focus on LDL cholesterol.
At each of 2.5 million SNPs, they fit a linear regression \[ E[LDL] = \alpha+\beta AF \] Where \(AF\) is allele frequency for the ‘trait increasing allele’.
2.5 million SNP locations.
=> 2.5 million tests of \(\beta = 0\)
=> 2.5 million p-values.
We get about 4500 SNPs where we reject the null hypothesis, and about 5 of them we expect to be false positives.
Suppose you have an observation coming from some distribution. What do you predict?
Some natural questions:
At its simplest, a loss function maps the truth, and your prediction, into how unhappy you are about your error.
\[L(y,\hat{y}) = ????\] The most common class of loss functions only cares about the magnitude of the error, not its location.
\[L(y,\hat{y}) = l(y-\hat{y}) = l(e)\] This is not always a reasonable simplification.
We’re defining a norm against a distribution. So we need to think about all the possible values the outcome could take.
Naturally then, we’re going to plug the loss into an expectation. That will let us make statements about our expected loss. (With some iffy notation)
\[ E[l(e)] = E[l(y-\hat{y}] = \int l(y-\hat{y})P[y] dy \]
This should look very familiar.
Within a sample, we can use a loss function to dictate our predictions.
We choose parameters to minimize
\[ L_(Y,\hat{Y}) = L(Y,\hat{\alpha}+\hat{\beta} X)\]
The most common norms here are known as the \(l_p\) norms. Within sample, (and with some iffy notation) this looks like:
\[ L(Y,\hat{Y}) = \left(\frac1n \sum_{i=1}^n |Y-\hat{Y}|^p \right)^{\frac{1}{p}}\]
Notice, we’ve thrown in a symmetry statement. The absolute value means that \(l_p(e) = l_p(-e)\).
Again: This is not always a reasonable simplification.
=> These are different statements about how much we care about a tradeoff between infrequent large errors and frequent small errors.
How many fingers do you think our dean has?
Loss functions come from the context of a situation. No generalizable advice here.
Dataset of ~13k reviews for some products, collected in 2012.
Reviews include product details, ratings, and plain text comments.
We will look for words associated with good/bad ratings.
I will now go through the code at the start, introduce you to the datasets, run some things, and comment on various features that may help you understand R and large datasets.
Before Tuesday: