Benjamini + Hochberg Algorithm 1. Rank your p-values smallest to largest. 2. Set p-value cutoff as $\alpha^* = max\{p_{(k)}: p_{(k)} \leq q \frac{k}{K}\}$ Then $FDR \leq q$ -- assuming approximate independence between tests. ## Rewriting that {.smaller} Step two there is a mess. Lets look at it closely. $$ \alpha^* = max \{p_{(k)}: p_{(k)} \leq q \frac{k}{K}\} $$ or (because $k,K$ are both positive) $$ \alpha^* = max \{p_{(k)}: \frac{p_{(k)}K}{k} \leq q \} $$ The secret sauce here is that $p_{(k)}K$ is the expected number of false discoveries, under the nulls, if $p_{(k)}$ were our rejection threshold. Dividing by $k$ - the number of discoveries with that threshold - gives us an estimate of the FDR. ## Understanding BH

![](sims/BH.png){width=110%}
- Under the null, pvalues lie along the grey line (slope $1/K$).
- Our rejection threshold is set by the green line (slope $q/K$), and we reject values under it.

## FDR Roundup
We started with the notion that a given $\alpha$, (pvalue cutoffs) can lead to a big FDR: $\alpha \rightarrow q(\alpha)$.
BH reverse that. They fix FDR, and find the relevant $\alpha$. The algorithm is the key to doing that. $q \rightarrow \alpha^*(q)$
FDR is not the only way to think about these risks. But it is a very solid middle ground when we have many tests.
=> Principled bounds on overall errors, while maintaining power to detect.
## Example: multiple testing in GWAS
GWAS: genome-wide association studies.
Want to find genetic markers related to disease for early prevention and monitoring.
Single-nucleotide polymorphisms (SNPs) are paired DNA locations that vary across chromosomes. The allele that occurs most often is "major" (A) and the other is "minor" (a).
Question: Which ones increase risk?
## Cholesterol
Willer et al, Nat Gen 2013 describe a meta-analysis of GWAS for cholesterol levels. We will focus on LDL cholesterol.
At each of 2.5 million SNPs, they fit a linear regression
$$ E[LDL] = \alpha+\beta AF $$
Where $AF$ is allele frequency for the 'trait increasing allele'.
2.5 million SNP locations.
=> 2.5 million tests of $\beta = 0$
=> 2.5 million p-values.
## All the pvalues
![](lipids/lipids_hist.png){height=80%}
## BH plot (log-log)
![](lipids/lipids_bh.png){width=100%}
We get about 4500 SNPs where we reject the null hypothesis, and about 5 of them we expect to be false positives.

## BH Roundup
- p-values from the null distribution are uniform, and should lie along the 1/K line if there are K of them.
- FDP is the number of false discoveries divided by number of rejections. We **can't** know it.
- $$FDR = E[FDP]$$ we can control though.
- Fix it to be $\leq q$ for $K$ tests
- rank and plot p-values against rank$/K$
- draw a line with slope $q/K$
- Reject under the line.
# Loss
## Predictions
Suppose you have an observation coming from some distribution. What do you predict?
```{r, echo=F}
set.seed(120)
s = rpois(10000,6)
hist(s,main="",xlab="Value",ylab="Density",breaks=20)
```
## Some typical choices
- Mean
- Median
- Mode
## Some typical choices
- Mean
- Median
- Mode
Some natural questions:
> 1. What if we're playing 'price is right rules' for your prediction?
> 2. What would motivate a choice of median over mean or mode?
## Formalizing loss
At its simplest, a loss function maps the truth, and your prediction, into how unhappy you are about your error.
$$L(y,\hat{y}) = ????$$
The most common class of loss functions only cares about the magnitude of the error, not its location.
$$L(y,\hat{y}) = l(y-\hat{y}) = l(e)$$
This is not always a reasonable simplification.
## Using loss
*We're defining a norm against a distribution.* So we need to think about all the possible values the outcome could take.
Naturally then, we're going to plug the loss into an expectation. That will let us make statements about our expected loss. (With some iffy notation)
$$ E[l(e)] = E[l(y-\hat{y}] = \int l(y-\hat{y})P[y] dy $$
This should look very familiar.
## Insample
Within a sample, we can use a loss function to dictate our predictions.
We choose parameters to minimize
$$ L_(Y,\hat{Y}) = L(Y,\hat{\alpha}+\hat{\beta} X)$$
## $l_p$ norm
The most common norms here are known as the $l_p$ norms. Within sample, (and with some iffy notation) this looks like:
$$ L(Y,\hat{Y}) = \left(\frac1n \sum_{i=1}^n |Y-\hat{Y}|^p \right)^{\frac{1}{p}}$$
Notice, we've thrown in a symmetry statement. The absolute value means that $l_p(e) = l_p(-e)$.
Again: This is not always a reasonable simplification.
## Back to typical answers:
- Mean: corresponds to answer with lowest expected $l_2$ loss.
- AKA: $min \sqrt{\frac1n \sum(Y-\hat{y})^2}$ is the RMSE
- Median: Answer with lowest $l_1$ loss
- AKA $min \frac1n \sum |Y-\hat{Y}|$ is the MAD
- Mode: Answer with lowest $l_0$ loss
- AKA $min \frac1n \sum 1(Y\neq \hat{Y})$ wants Exact predictions only
=> These are different statements about how much we care about a tradeoff between infrequent large errors and frequent small errors.
## How do we choose?
How many fingers do you think our dean has?
> - Would you guess 10? Median/Mode?
> - Would you guess 9.9? Mean?
> - How do *you* choose?
> - What if we were competing to be closest?
> - What if it was a random lumberjack?
## Loss function last thoughts _for now_
Loss functions come from the context of a situation. No generalizable advice here.
- Standard loss functions lean on symmetry and location-indifference type assumptions that may not be reasonable.
- Very important for making actionable predictions
- And important to bake in very early
- They drive every choice of statistic
- Price-is-right rules, competitions more generally are going to screw with this.
- "Winners curse"
# Homework Introduction
## Week 1 HW
![](amazon.jpg){width=60%}
Dataset of ~13k reviews for some products, collected in 2012.
Reviews include product details, ratings, and plain text comments.
We will look for words associated with good/bad ratings.
## Assignment online
I will now go through the code at the start, introduce you to the datasets, run some things, and comment on various features that may help you understand R and large datasets.
# Wrap up
## Things to do
Before Tuesday:
- Homework
## Rehash
- False Discovery Rates can be controlled
- Understanding our loss function is critical
- You have homework
# Bye!