```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## A Plea You are not required to keep your cameras on. Please do. It helps me immensely to be able to see how people respond as I say things. If 15% of you are lost, I want to know about it, but with 20 cameras on, I have to notice one of only three faces being lost. More generally though, please, jump in with questions when you have them. Use the chat if you prefer. ## I'm here to teach you. - If you have questions, - If you have trouble, - If you have feedback for me, - If you have ideas for improving the course, Please let me know. ## Today's Class 1. Introduction - Goals - Material - Syllabus 2. What is Big Data? 3. Computation: R, Big Data 4. Data Viz, Statistics, Dimension Reduction 5. Testing: False Discovery # Big Data ## Introduction Big data **is not** an end in itself We are doing Inference at LARGE scale! The goal here is to learn what we can trust, how it can be used, and how to learn more. We want to help you make good decisions with lots of messy data. By necessity then, this class will be a mix of theory and practice. ## Theory and Practice You need a solid foundation in statistical principles. - We don't want to shoot ourselves in the foot You also need a hefty dose of 'rules of thumb'. - Inefficient procedures can be wildly inefficient - **BUT** we don't want to reinvent the wheel. This is hands on work. ## What is Big Data? There are a lot of names out there for a large cluster of very similar disciplines: - Econometrics - Data Science - Big Data - Statistics - Datamining - Machine Learning There are differences in focus and style here. But the similarities are larger than the distinctions. ## What is Big Data? Big Data as a name, originates with computer scientists working with data too large for any single computer "Big Data is one GB larger than my RAM". But often it is associated with administrative data, where statistical notions of sampling error may start to fall apart. More generally, it is in a nexus where a fairly tight connection between inference and prediction can start to break down. ## What is Big Data? Big Data is focused on extracting useful truth from large datasets. - Infer patterns in high dimensional data - Simple and scalable algorithms - Honest and humble model selection - Manage conflict between "useful" and "true" - Make some **decision** ## What is Big Data? Big in number of observations (size $n$). Big in number of variables (dimension $p$). In these settings you *cannot*: - Look at each variable and make a decision (t-tests). - Choose from a small set of nested models (F-tests). - Plot every variable to look for interactions and transformations. Some of our tools are straight out of Stats 101 (regression, confidence intervals), some are close relatives of Stats 101 (PCA, MSE), some are new beasts (trees, Bagging). ## Why should you care? > - Extremely employable. > - ![Monster.com Job postings for "Big Data" last Friday](Monster Screenshot.png){width=90%} ## Why should you care? {} - Extremely employable - Avoid being hoodwinked by nutjobs with numbers. > - Avoid hoodwinking yourself with numbers. *They want to lie to you*. > - Its fun. Genuinely. ## Does data really want to lie to you? | YES {.smaller} ![Influence of Researcher Decisions can reverse results](Data wants to lie.png){width=70%} # Admin ## Course Schedule subject to change... 1. Data: Computing, plotting, principles. False Discovery. Loss. 2. Regression: Everything in two days. Linear and Logistic. 3. Model Selection: Penalties, information criteria, cross-validation 4. Treatment Effects: High Dimensional Controls, AB testing, Bootstrap. 5. Classificiation: Multinomials, KNN, sensitivity/specificity 6. Trees: CART, Random Forests, Ensembles 7. Networks?: Co-occurence, directed graphs 8. Clustering?: k-means, mixture models, association rules 9. Factors?: Latent variables, PCA, PLS ## Big Data Team TA: Yuxiao Li -- Booth PhD. She will primarily be grading your work. Me: Booth PhD. First (and last) time teaching this course. Office hours: Happy to meet with you. For now, email me if you want to meet. If that proves overwhelming for me, I may set formal office hours. Structure: Today I'll see how much I get through, and then Thursday will be what we don't get to, as well as discussion of homework 1 (group assignment due next week), and more. ## Course Design This is also a good time to note that this course is based closely off the course by the same name taught by Professor Veronika Rockova. But some things will be different. First time for this course with undergrads. So we don't know what pace makes sense. Feedback is welcome. ## Textbooks The primary text for this course is "Elements of Statistical Learning" by Hastie, et al. It is recommended as an excellent reference going forwards. PDFs are (legally) available for free at https://web.stanford.edu/~hastie/Papers/ESLII.pdf and on my website. I will also try to recommend other textbooks as we go along for specific topics. If there is a topic you're interested in that I haven't recommended a textbook for, email me about it. ## Homeworks {.smaller} Homeworks will be in small groups of 3. You may not have groups smaller than 3. There will be ~1 homework a week, graded from 1-5. 1. Oops 2. Eh 3. Fine 4. Good. -- Typical score for homework meeting my expectations. (Nothing wrong with it) 5. Nailed it. -- Somehow you impressed the TA. I don't expect there to be many 5s. Literally nothing is wrong with a 4. There will also be a prediction competition. ## Other Grade Components 1. Participation NEW -- I care a lot about asking and answering questions on the canvas discussion board. 2. Midterm -- possible, curious how you all feel about a 20 question multiple choice/numeric answer midterm. 3. Take home final ## Analysis We will be working with real data. The essence of big data is an excellent ability to make predictions using data. To that end, we will also run a low key prediction competition. For now, there is one thing to predict. If there is interest in the prediction competition, we will have more questions. First prediction assignment: US covid vaccination predictions. ## Details "How many people in the US will have had at least one dose by end of day on April 30th?" - Predictions due before class on Thursday. - Full details in Assignment on website - Along with 3 CDC datasets over time. - You only need to provide 3 numbers: - A prediction - A 90% CI lower bound - A 90% CI upper bound - Enter on canvas for full credit - Enter on google form to compete against me - And register interest in more questions. # Break # Computation ## R All of our analysis will be in R. This is the premier data analysis platform. Other platforms are better at some things, none is better at everything. Its free, open source, cross platform, very capable. ## R Used by academics (in statistics, marketing, finance, genetics, etc), companies (EBAY, Google, Microsoft, Boeing, Citadel, IBM, etc), governments (RAND, DOE Labs, US Navy, etc). R's big strength is that it is open source and has a large community contributing packages. E.g. it is very easy to install a package to do LASSO or other fancy regression tools. But it can be unpolished. The non-standard packages are as varied as their contributors. Some are super specific, some are super general. Many are useless. R is not flawless. But it has a large community that will help you overcome flaws. I **strongly** recommend using the RStudio IDE. (This lecture, your homeworks, much else in this course has been made with it) ## R The big barrier to entry is the command line interface. - You type a command, R follows the command. This can mean quite the learning curve. But it is well worth it. - Once you write code for something, it becomes easy to repeat that operation. - This is a critical skill for using modern statistical tools. I will post all code online. I strongly recommend you look at my solutions to HWs, etc if you run into trouble. ## R -- Introduction {.smaller} Open R (or RStudio). Just like any other program. In some sense, R is a jumped up calculator. (e.g. *,/,+,-,^,%%) If you want to save a number, or anything else, give it a name. ```{r} A = 2 ``` If you want to see what a variable is, just type the name: ```{r} A ``` ## R -- Vectors {.smaller} We can build vectors with `c()` or "`:`" ```{r} 1:5 B = c(1,5,3) B ``` ## R -- Vectors {.smaller} We can grab subsets of vectors (and other groupings) with `[ ]`. ```{r} B[2] ``` And then do math with them. ("`;`" will end a line) ```{r} D = A+B[2]; D ``` Sometimes, its not clear what you want, and R takes a guess: ```{r} E = A+B ``` ## R -- Errors {.smaller} ```{r} E ``` And sometimes you ask it to do something it can't do. So it tells you it can't do that, and tries to explain itself. ```{r, error=T} G = A+b[2] ``` If something isn't working, googling often helps. (e.g. "R error object not found") ## R -- CSVs We are usually going to work with data in .csv files. Incredibly human-readable files, and easy to move between Excel and other programs. ![CSV examples](csv example.png){width=100%} ## R -- Loading Data We can load the data into R using the `read.csv` command. It is good practice to set a working directory first. ```{r} setwd("~/Github/bigdata/predictions/") mar3 = read.csv("us_covid_vaccinations_mar3.csv",skip=3) ``` Now we can play with (some of) the data. ## R -- Dataframes `mar3` is a dataframe. We can index columns with names as well as their position. Dataframes also improve on matrices by allowing different variable types in each column. ```{r, eval=F} mar3[1,] # The first row mar3[1:10,] # 10 rows mar3[,2] # The second column mar3$Total.Doses.Delivered # The second column mar3[mar3$Total.Doses.Delivered > 1000,] #Rows w/ >1k doses delivered nrow(mar3) ncol(mar3) dim(mar3) colnames(mar3) ``` ## R -- Variable types R stores different types of data differently. - Numeric data are just numbers. - Though they can take several types (integer, float, etc) - Factor data take a small number of values. - Think vaccine types, or car models. - logical/boolean values are either `TRUE` or `FALSE` - Characters and strings are words or letters. - "hello". "Alaska". "aagta". "September 1st". "19". We have many tools for converting between types and identifying types. `is.numeric`, `factor()`, `levels()`. ## R -- Misc Functions look like `f(arg1,arg2,...)` And are incredibly useful. e.g. Create new variables: ```{r, eval=F} mar3$log.doses = log(mar3$Total.Doses.Delivered) ``` Finding out how a function works is simple `?log` will take you to the help page. ## R -- Plotting Is straightforward for simple plots, but rapidly gets complicated. ```{r, warning=F} plot(mar3[,10]~mar3[,14]) ``` ## R -- Plotting Histograms. ```{r, error=T} hist(mar3$Doses.Delivered.per.100K) ``` ```{r} typeof(mar3$Doses.Delivered.per.100K) head(mar3$Doses.Delivered.per.100K) ``` ## R -- Plotting ```{r, warning=F} mar3$Doses.Delivered.per.100K = as.numeric(mar3$Doses.Delivered.per.100K) hist(mar3$Doses.Delivered.per.100K) ``` ## R -- Terrible Colnames ```{r} colnames(mar3)[10:15] colnames(mar3) = c("State","Delivered","Delivered.100k", "X18.Delivered.100k","Administered","Administered.100k", "X18.Administered","X18.Administered.100k","OneDose", "Perc.OneDose","X18.OneDose","X18.Perc.OneDose", "TwoDose","Perc.TwoDose","X18.TwoDose","X18.Perc.TwoDose") ``` ## R -- Plotting - In color? ```{r, echo=F, warning=F} mar3$Perc.OneDose = as.numeric(mar3$Perc.OneDose) mar3$Perc.TwoDose = as.numeric(mar3$Perc.TwoDose) mar3$Delivered.100k = as.numeric(mar3$Delivered.100k) ``` ```{r} plot(mar3$Delivered.100k/1000,mar3$Perc.OneDose, col=(mar3$Perc.TwoDose > 8.2)+1, main = "One Dosed vs Delivered Doses per 100 people") ``` ## R -- Scatterplots Scatterplots are fundamental to statistics. If you can show a compelling scatterplot, you can convince people of anything. In part, this is because they are able to contain nearly as many dimensions of information as we can interpret. We tend to max out at 3-4 dimensions. Critical to reduce dimensions down to interpretable level for good visualization. ## R -- Tidyverse {.smaller} ```{r, message=F, warning=F} library(tidyverse) ggplot(mar3,aes(x=Delivered.100k/1000,y = Perc.OneDose,col=Perc.TwoDose)) + geom_point() ``` # Statistics ## R -- Regression {.smaller} ```{r} fit = lm(Perc.OneDose~Delivered.100k+Perc.TwoDose, data=mar3) summary(fit) ``` ## Regression The basic model is as follows: $E[Perc.OneDose] = \beta_0 + \beta_1 Delivered.100k +$ $\beta_2 Perc.TwoDose$ Or, similarly: $Perc.OneDose = \beta_0 + \beta_1 Delivered.100k +$ $\beta_2 Perc.TwoDose + \epsilon$ Where $E[\epsilon] = 0$. ## Hypothesis Testing Is "Delivered.100k" important? H0: $\beta_1 = 0$. ```{r} summary(fit) ``` ## A Single Test The "p-value" is the probability of observing an event at least as extreme as the observed event, assuming that the null hypothesis is true. **Testing Procedure** Choose a cutoff $\alpha$. Conclude that there is a real effect if $p < \alpha$. This procedure tries to give only $\alpha$ probability of a false positive. $$ P(p<\alpha | H0) = \alpha$$ ## Challenges Big Data situations pose two different problems for p-values. 1. With a big enough sample (i.e. large $n$), we wind up rejecting the null even for effects that are of no practical significance, but which have a small positive $\beta$. 2. With many parameters (i.e. large $p$), we are performing many tests, and need to account for the number of tests we perform. ## Large Scale Testing We wish to test $p$ simultaneous null hypothesis: $$H0_1,H0_2,...,H0_p $$ Out of the $p$ null hypothesis, $N_0$ are true nulls and $N_1 = p-N_0$ are false -- i.e. there is an effect. ![](fdr.png){width=80%} ## Multiplicity $\alpha$ is typically for a single test. If we repeat $p$ tests, we expect $p\alpha$ significant even when all the null hypotheses are true. Suppose we have 100 regression coefficents, of which 5 are true effects. Now suppose we test all 100 coefficients, all 5 true effects come up (if we're lucky), as well as $95\alpha \approx 5$ more. So of the significant variables, 50% are fake. This is the "false discovery proportion". It depends closely on $\alpha$, the underlying truth, and how good our methods are. But it can be really big. ## False Discovery Rate Big data is about making choices. Instead of single tests, we'll consider: FD Proportion = False positives / Significant = $\frac{FD}{R}$ FDP is a property of the fitted model and the truth. We don't know it. We can control its expectation though: False Discovery Rate, $FDR = E[FDP]$. This is the multivariate analogue of $\alpha$. If all tests are tested at $\alpha$ level, we have $\alpha = E[FD/N_0]$, whereas $FDR = E[FD/R]$ ## FDR Control Suppose we want to know that $FDR \leq q = 0.1$.

Benjamini + Hochberg Algorithm 1. Rank your p-values smallest to largest. 2. Set p-value cutoff as $\alpha^* = max\{p_{(k)}: p_{(k)} \leq q \frac{k}{p}\}$ Then $FDR \leq q$ -- assuming approximate independence between tests. # Wrap up ## Things to do Before Thursday: - Group up on canvas - Submit predictions - Play with R, tidyverse, ggplot2 ## Rehash - I'm Connor Dowd - Big Data brings new problems - R is important - Graphs are important - False Discovery Rates can be controlled # Bye!