```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
I'll start by apologizing. I meant to publish this much sooner. I'm sorry.
The intent of this document is to cover the variety of topics you ought to be familiar with before the course starts. The topics are not listed in any particular order, nor do you need to know *everything* here before reading the document.
Answers will be posted shortly. Feel free to browse them.
# Basic Statistics
You should feel very comfortable with everything in this section.
## Probability Statements
1. Interpret the statement "this is a fair coin."
2. Out of 25 coin tosses, how many do you anticipate will be heads?
3. What do we conclude if all 25 are heads?
4. I have a new coin. Will you make a bet with me about the next flip?
5. Find the flaw in the following statement "The probability of a nuclear accident resulting in the death of at least one US resident in the next 5 years is either 0 or 1 but we won't know for 5 years".
6. P[x>6] = 0.4; P[x>10] = 0 = P[x<0]. Interpet in words.
7. If there is a 40% chance something happens, what are its odds?
## Expectations
1. What is the expected value of the sum of the 25 coin tosses above?
2. Question 6 above gives some information about a variable x. What range of possible values could the expectation of X take?
3. Interpret E[Y] = 0. E[Y|x=5] = 0. E[Y|Y=0] = ?
## Variance
1. What is the variance of the sum of coin tosses above?
2. What is the range of possible variances of the variable X in question 6?
## Density functions:
1. Interpret "the CDF of 2 is 0.6" and "the pdf of 2 is 0.6". How do they differ?
# Distributions
This section asks for details, but really you should just have some sense of these things, and know how to look up the true answers.
## Gaussian/Normal
1. Draw a gaussian distribution.
2. What is the mean of the gaussian distribution "N(0,1)"? Median? Mode? Variance? Standard Deviation? Kurtosis? Skew? 95% predictive interval?
## t
1. Draw a t distribution on top of your gaussian distribution.
2. What is the mean of the t distribution with 2 degrees of freedom? Variance? Median? 95% predictive interval?
3. What is the variance of a t-distribution with 1 degree of freedom? 95% predictive interval?
## Poisson
1. What values can a poisson distribution take?
2. What is the mean of a Poisson(5) distribution? Variance? 95% predictive interval?
## Binomial
1. What values can a Binomial take?
2. What is the mean of a Binomial(100,0.05)? Variance? 95% predictive interval?
## Cauchy
1. What is the mean of a cauchy distribution? Median? 95% predictive interval?
# Linear Algebra
I am not great at linear algebra. However, I excel at saying "so X'X is the variance matrix of the matrix X and that means...". I have no expectations beyond some ability to do that, and perhaps more importantly, to understand what is going on when I do that.
1. $\hat{\beta} = Cov(X,Y)/Var(X)$. Explain why var(x) shows up in the denominator.
2. $\hat{\beta} = (X'X)^{-1}X'Y$. Explain which bits are the same as above.
3. What would prevent the above equation from working? Can you describe that to your grandmother?
# Computing
This will run through some basic tasks you should be familiar with in R.
## Open R.
## Assignment
1. Assign the value 2 to the variable x.
2. Print x.
3. Remove the variable x.
## Load the help files
Find the help file for the R command `rnorm`. (`?rnorm`)
## Run a basic regression
Using the builtin dataset `iris`, run a regression of Sepal.Length on Sepal.Width.
## Results of regression
Show the summary table for that regression
## Run a multivariate regression
Using the same dataset, run a regression of Sepal.Length on all other variables. Print the summary.
## Load a .csv
Download the .csv file for US Covid Vaccinations as of March 3 available [here](https://codowd.com/bigdata/predictions/us_covid_vaccinations_mar3.csv).
Load it into R and assign it to the variable "mar3". (hint: you may need to skip a few lines)
## Scrape a .csv
Combine these steps into one. Try to get R to directly read the csv for the Covid Vaccinations as of March 18th available at https://codowd.com/bigdata/predictions/us_covid_vaccinations_mar18.csv. Assign this to "mar18".
## For-loop
Using a for-loop, add all the numbers from -5 to 29.
## if-else
Using a for loop, draw a random observation from a standard normal distribution 100 times. Find the sum of the values larger than 1.5.
## Subsetting
1. Use subset notation (`iris[3,2]`) to print the 5th element of the 3rd column of the iris dataset.
2. Run the command (`x = rnorm(100)`) and find the sum of the values larger than 1.5 using subset notation.
## Functions
1. Write a function that returns double its input.
2. Extra credit: Write a function that returns the nth Fibbonacci number using recursion.
## install a package
Install the package "glmnet".
## Load a package
Load the package "glmnet".
## Generate Random Numbers
Generate 100 random numbers, calculate their mean. Calculate the standard error of that mean.
## Run a simulation study
Using a for-loop, repeat the above generation of 100 random numbers 1000 times, calculating and storing the mean of the sample each time.
## Plot results
Plot a histogram of the output from the above simulation study. (`hist`)
## Extra Credit:
Find the standard deviation of the means you generated. Compare it to the Standard error from the first sample you generated.
# Regressions
## Interpreting Univariate Regressions
Way back in univariate regression, you printed the summary output of a regression.
1. Interpret the coefficient on Sepal.Width.
2. Interpret the F-test output.
3. Interpret the $R^2$.
4. If I told you the Intercept was 6.49, would your estimate of the coefficient on Sepal.Width change? If yes, would you care to guess a direction?
## Interpreting Multivariate Regression
1. Interpret the coefficient on Petal.Width from the multivariate regression.
2. Interpret the F-test output.
3. Interpret the $R^2$. Explain why it is higher.
4. If you needed to predict a Sepal.Length for an iris, Would you like to run a different regression? What would you change?
5. What do the words "heteroscedasticity robust standard errors" mean to you?
6. Does this regression output assume that the noise in the observations is normally distributed? Did any of your answers above? What other assumptions are we making?
7. If I pull on a petal to make it wider, will the Sepal.Length grow?
# Hypothesis Testing/Confidence Intervals.
## Null Hypothesis
1. What is the null hypothesis usually?
2. What does it mean to reject the null? What else might we do?
## P-values
1. What is a p-value?
2. What does it mean that a coefficient has a p-value of 0.03?
3. What would it mean if it was a p-value of 0.07?
## Confidence Intervals
1. What is a confidence interval?
2. How does it relate to a predictive interval?
3. What is the confidence interval for the coefficient on Sepal.Width (in the univariate regression)?