This document provides answers to the questions in HW0. Feel free to peruse it or use it as a reference.

1 Basic Statistics

You should feel very comfortable with everything in this section.

1.1 Probability Statements

Interpret the statement “this is a fair coin.”

Any given toss of the coin is as likely to be heads as tails.

Out of 25 coin tosses, how many do you anticipate will be heads?

12.5

What do we conclude if all 25 are heads?

It is extremely unlikely the coin is fair.

I have a new coin. Will you make a bet with me about the next flip?

Absolutely not. But I might be willing to bet with someone you don’t know about its flip.

Find the flaw in the following statement “The probability of a nuclear accident resulting in the death of at least one US resident in the next 5 years is either 0 or 1 but we won’t know for 5 years”.

Most prominently, it begs the question, “what is the probability that probability is 1?”. More importantly, it is an utterly unhelpful prediction.

P[x>6] = 0.4; P[x>10] = 0 = P[x<0]. Interpet in words.

There is some variable X taking values between 0 and 10. 40% of the time it is greater than 6.

If there is a 40% chance something happens, what are its odds?

The odds are 2 to 3.

1.2 Expectations

What is the expected value of the sum of the 25 coin tosses above?

12.5

Question 6 above gives some information about a variable x. What range of possible values could the expectation of X take?

E[x] depends on the (mostly unknown) distribution. The max possible value would happen when the distribution is as far right as possible. In this case, that happens when P[x=6]=0.6 and P[x=10]=0.4, giving a max expectation of 7.6. The min is when the distribution is as far left as possible, at P[x=0]=0.6 and P[x=6+epsilon]=0.4, for a minimum expectation of 2.4+epsilon.

Interpret E[Y] = 0. E[Y|x=5] = 0. E[Y|Y=0] = ?

E[Y] = 0 says that the mean of Y is 0. E[Y|x=5] = 0 tells us that when x=5, the mean of Y is 0. E[Y|Y=0] is 0.

1.3 Variance

What is the variance of the sum of coin tosses above?

This is the variance of a binomial, which is np(1-p), or in this case: 25*0.25 = 6.25.

What is the range of possible variances of the variable X in question 6?

The variance is at a max when the distribution is as spread as possible. That would be when P[x=0]=0.6 and P[x=10]=0.4, for a variance of 24. The min variance is when the distribution is as narrow as possible, when P[x=6]=0.6 and P[x=6+epsilon]=0.4 for a variance of 0.24*epsilon^2. The range of variances possible is (0,24].

1.4 Density functions:

Interpret “the CDF of 2 is 0.6” and “the pdf of 2 is 0.6”. How do they differ?

The CDF of 2 is 0.6 means that 60% of values (of this unnamed random variable) are 2 or below. The pdf of 2 is 0.6 means that the density (of this unnamed RV) at 2 is 0.6, which is borderline uninterpretable by itself. The two are closely related in that the CDF is the running integral of the density, while the density is the slope of the CDF. Critically, “the pdf of 2 is 0.6” does not mean P[x=2]=0.6. Though “the cdf of 2 is 0.6” does mean P[x<=2] = 0.6.

2 Distributions

This section asks for details, but really you should just have some sense of these things, and know how to look up the true answers.

2.1 Gaussian/Normal

Draw a gaussian distribution.

I can’t easily draw this for you, so I’ll plot it in R. You certainly did not need to do that.

plot(density(rnorm(100000)))

What is the mean of the gaussian distribution “N(0,1)”? Median? Mode? Variance? Standard Deviation? Kurtosis? Skew? 95% predictive interval?

Mean = 0. Median = 0. Mode = 0. Variance = 1, SD = 1. Kurtosis = 3 [excess kurtosis = 0]. Skew = 0. 95% predictive interval = (-1.96,1.96) [I would have been happy with (-2,2)].

2.2 t

Draw a t distribution on top of your gaussian distribution.

Same again

plot(density(rnorm(100000)))
lines(density(rt(100000,5)),col=2)

What is the mean of the t distribution with 2 degrees of freedom? Variance? Median? 95% predictive interval?

Mean = 0. Variance = Inf. Median = 0. 95% Predictive interval = (-4.31,4.31)

What is the variance of a t-distribution with 1 degree of freedom? 95% predictive interval?

Variance = Inf (again). PI = (-12.71,12.71)

2.3 Poisson

What values can a poisson distribution take?

Any non-negative integer.

What is the mean of a Poisson(5) distribution? Variance? 95% predictive interval?

Mean = 5. Variance = 5. PI = [1,10]

2.4 Binomial

What values can a Binomial take?

A Binomial(n,p) can take values {0,1,…,n}.

What is the mean of a Binomial(100,0.05)? Variance? 95% predictive interval?

Mean = 5. Variance = 4.75. PI = [1,10]. For those of you paying attention, those numbers should look very similar to the numbers for the above Poisson(5) distribution.

2.5 Cauchy

What is the mean of a cauchy distribution? Median? 95% predictive interval?

Mean = ????. Median = 0. PI = (-12.71,12.71) [using qcauchy].

3 Linear Algebra

I am not great at linear algebra. However, I excel at saying “so X’X is the variance matrix of the matrix X and that means…”. I have no expectations beyond some ability to do that, and perhaps more importantly, to understand what is going on when I do that.

\(\hat{\beta} = Cov(X,Y)/Var(X)\). Explain why var(x) shows up in the denominator.

The beta is a measure of the relationship between X and Y. Suppose we double all values of X, how should the beta evolve? It should halve. But when we double all values of X, the Cov(X,Y) will also double. The Var(x) in the denominator is making sure the beta adjusts appropriately. In this case, because Var(2x) = 4Var(x), the beta will adopt correctly.

\(\hat{\beta} = (X'X)^{-1}X'Y\). Explain which bits are the same as above.

(X’X)^{-1} is the Variance in the denominator. Except now it is a full variance matrix including the covariances between different columns of X. X’Y is the Covariance between X and Y except now it is a full vector of covariances between Y and each column of X. And the beta has changed too, from a single number to a vector.

What would prevent the above equation from working? Can you describe that to your grandmother?

The above equation relies on the invertibility of X’X. For my grandmother, and myself, what that means is that when some of my variables are closely related to eachother, we won’t be able tell which ones are important, so we can’t estimate the beta. For instance, if we observe a state imposing a tax on alcohol and a massive ad campaign about its deleterious effects at the same time, it will be hard to attribute any changes in drinking to one or the other of those policies.

4 Computing

This will run through some basic tasks you should be familiar with in R.

4.1 Open R.

I did it.

4.2 Assignment

Assign the value 2 to the variable x.

x = 2

Print x.

## [1] 2

Remove the variable x.

rm(x)
x

## Error in eval(expr, envir, enclos): object 'x' not found

4.3 Load the help files

Find the help file for the R command rnorm. (?rnorm)

?rnorm

4.4 Run a basic regression

Using the builtin dataset iris, run a regression of Sepal.Length on Sepal.Width.

mod = lm(Sepal.Length~Sepal.Width,data=iris)

4.5 Results of regression

Show the summary table for that regression

summary(mod)

## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = iris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5561 -0.6333 -0.1120  0.5579  2.2226 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.5262     0.4789   13.63   <2e-16 ***
## Sepal.Width  -0.2234     0.1551   -1.44    0.152    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8251 on 148 degrees of freedom
## Multiple R-squared:  0.01382,    Adjusted R-squared:  0.007159 
## F-statistic: 2.074 on 1 and 148 DF,  p-value: 0.1519

4.6 Run a multivariate regression

Using the same dataset, run a regression of Sepal.Length on all other variables. Print the summary.

mod.all = lm(Sepal.Length~.,data=iris)
summary(mod.all)

## 
## Call:
## lm(formula = Sepal.Length ~ ., data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.79424 -0.21874  0.00899  0.20255  0.73103 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.17127    0.27979   7.760 1.43e-12 ***
## Sepal.Width        0.49589    0.08607   5.761 4.87e-08 ***
## Petal.Length       0.82924    0.06853  12.101  < 2e-16 ***
## Petal.Width       -0.31516    0.15120  -2.084  0.03889 *  
## Speciesversicolor -0.72356    0.24017  -3.013  0.00306 ** 
## Speciesvirginica  -1.02350    0.33373  -3.067  0.00258 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3068 on 144 degrees of freedom
## Multiple R-squared:  0.8673, Adjusted R-squared:  0.8627 
## F-statistic: 188.3 on 5 and 144 DF,  p-value: < 2.2e-16

4.7 Load a .csv

Download the .csv file for US Covid Vaccinations as of March 3 available here. Load it into R and assign it to the variable “mar3”. (hint: you may need to skip a few lines)

I downloaded the file to a folder named “/Users/cdowd/Downloads/”

mar3 = read.csv("/Users/codowd/Downloads/us_covid_vaccinations_mar3.csv",skip=3)

4.8 Scrape a .csv

Combine these steps into one. Try to get R to directly read the csv for the Covid Vaccinations as of March 18th available at https://codowd.com/bigdata/predictions/us_covid_vaccinations_mar18.csv. Assign this to “mar18”.

mar18 = read.csv("https://codowd.com/bigdata/predictions/us_covid_vaccinations_mar18.csv",skip=3)

4.9 For-loop

Using a for-loop, add all the numbers from -5 to 29.

total = 0
for (i in -5:29) {
  total = total+i
}
total

## [1] 420

4.10 if-else

Using a for loop, draw a random observation from a standard normal distribution 100 times. Find the sum of the values larger than 1.5.

total = 0
for (i in 1:100) {
  obs = rnorm(1)
  if (obs > 1.5) {
    total = total+obs
  }
}
total

## [1] 14.84213

4.11 Subsetting

Use subset notation (iris[3,2]) to print the 5th element of the 3rd column of the iris dataset.

iris[5,3]

## [1] 1.4

Run the command (x = rnorm(100)) and find the sum of the values larger than 1.5 using subset notation.

x = rnorm(100)
sum(x[x>1.5])

## [1] 17.48829

4.12 Functions

Write a function that returns double its input.

double = function(x) {
  2*x
}

Extra credit: Write a function that returns the nth Fibbonacci number using recursion.

Warning: this function is stupidly slow for values of n greater than like 20.

fib_n = function(n) {
  if (n <= 1) return(0)
  if (n == 2) return(1)
  return(fib_n(n-1)+fib_n(n-2))
}

4.13 install a package

Install the package “glmnet”.

install.packages("glmnet")

4.14 Load a package

Load the package “glmnet”.

Make sure you install the dependencies

library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.1-1

4.15 Generate Random Numbers

Generate 100 random numbers, calculate their mean. Calculate the standard error of that mean.

x = rnorm(100)
mean(x)

## [1] -0.06445758

se = sd(x)/sqrt(100)
se

## [1] 0.1076501

4.16 Run a simulation study

Using a for-loop, repeat the above generation of 100 random numbers 1000 times, calculating and storing the mean of the sample each time.

nsamples = 1000
means = numeric(nsamples)
for (i in 1:nsamples) {
  means[i] = mean(rnorm(100))
}

4.17 Plot results

Plot a histogram of the output from the above simulation study. (hist)

hist(means)

4.18 Extra Credit:

Find the standard deviation of the means you generated. Compare it to the Standard error from the first sample you generated.

sd(means)

## [1] 0.09710419

This standard deviation is very close to the SE calculated above. They are of course both estimates of the same parameter – the standard deviation of the mean of 100 observations from a standard normal. The true SD of that mean is 0.1.

5 Regressions

5.1 Interpreting Univariate Regressions

Way back in univariate regression, you printed the summary output of a regression.

Interpret the coefficient on Sepal.Width.

A 1 unit increase in Sepal.Width is associated with a 0.22 decrease in the Sepal.Length.

Interpret the F-test output.

It is unclear if adding the variable Sepal.Width is helping our predictions at all.

Interpret the \(R^2\).

Our model can explain 1.4% of the variation in Sepal.Length.

If I told you the Intercept was 6.49, would your estimate of the coefficient on Sepal.Width change? If yes, would you care to guess a direction?

Yes. It would likely shrink towards 0.

5.2 Interpreting Multivariate Regression

Interpret the coefficient on Petal.Width from the multivariate regression.

Holding everything else constant, a 1 unit increase in Petal.Width is associated with a 0.31 unit decrease in Sepal.Length.

Interpret the F-test output.

Our model represents substantial improvement beyond a simple constant prediction.

Interpret the \(R^2\). Explain why it is higher.

This model can explain about 87% of the variation in Sepal.Length. As we added more explanatory variables, we nearly guaranteed that it would rise – the coefficient on a variable would have to 0 for it to not affect the R^2. But this rise is more than that, we’ve added variables with substantial explanatory power, and they are using that power.

If you needed to predict a Sepal.Length for an iris, Would you like to run a different regression? What would you change?

Eh. This is a nice model. I would consider adding some interactions between species and the other variables.

What do the words “heteroscedasticity robust standard errors” mean to you?

The range of values the coefficients may take is being determined under the assumption that the variation in Sepal.Length is constant across all other variables. This may be unreasonable, heteroscedasticity robust SEs let us relax that assumption.

Does this regression output assume that the noise in the observations is normally distributed? Did any of your answers above? What other assumptions are we making?

No and no. Though we do assume that the coefficient estimates are normally distributed, the observations themselves can take other distributions with finite variances. This is only the ‘best prediction’ in the class of linear models, if we don’t believe in linearity, we should consider other models. We are assuming the observations are independent from eachother. We are also assuming the data is trustworthy and not error prone.

If I pull on a petal to make it wider, will the Sepal.Length grow?

Maybe? My guess is no. More broadly, this isn’t necessarily a causal relationship we’ve found, much less one we might be able to manipulate. But it is possible there is a causal link here. Of course even then, the coefficient is negative. So “No” seems like a good bet.

6 Hypothesis Testing/Confidence Intervals.

6.1 Null Hypothesis

What is the null hypothesis usually?

Usually the null hypothesis is that some parameter is 0. Occasionally the default null is that some parameter is 1.

What does it mean to reject the null? What else might we do?

Rejecting the null means we’ve concluded that the parameter is not 0. We might also fail to reject the null.

6.2 P-values

What is a p-value?

The probability of observing data at least as extreme as the data we observed, under the assumption that the null hypothesis is true.

What does it mean that a coefficient has a p-value of 0.03?

It means that under the null distribution, such an extreme estimated value of the coefficient is unlikely.

What would it mean if it was a p-value of 0.07?

It is still twice as likely, but still unlikely.

6.3 Confidence Intervals

What is a confidence interval?

It is a range of values a parameter is likely to fall in.

How does it relate to a predictive interval?

Typically, but not always, a predictive interval is a range of values a specific observation will likely fall in. E.g. the range of heights of the next person to walk into a room. While the confidence interval is the range of values some parameter is likely to fall in – e.g. the average height of people walking into rooms.

What is the confidence interval for the coefficient on Sepal.Width (in the univariate regression)?

The CI is approximately (-0.52,0.08).

Homework 0 Answers

BUSN 20800 – Big Data

Connor Dowd

March 24, 2021