This document provides answers to the questions in HW0. Feel free to peruse it or use it as a reference.
You should feel very comfortable with everything in this section.
Any given toss of the coin is as likely to be heads as tails.
12.5
It is extremely unlikely the coin is fair.
Absolutely not. But I might be willing to bet with someone you don’t know about its flip.
Most prominently, it begs the question, “what is the probability that probability is 1?”. More importantly, it is an utterly unhelpful prediction.
There is some variable X taking values between 0 and 10. 40% of the time it is greater than 6.
The odds are 2 to 3.
12.5
E[x] depends on the (mostly unknown) distribution. The max possible value would happen when the distribution is as far right as possible. In this case, that happens when P[x=6]=0.6 and P[x=10]=0.4, giving a max expectation of 7.6. The min is when the distribution is as far left as possible, at P[x=0]=0.6 and P[x=6+epsilon]=0.4, for a minimum expectation of 2.4+epsilon.
E[Y] = 0 says that the mean of Y is 0. E[Y|x=5] = 0 tells us that when x=5, the mean of Y is 0. E[Y|Y=0] is 0.
This is the variance of a binomial, which is np(1-p), or in this case: 25*0.25 = 6.25.
The variance is at a max when the distribution is as spread as possible. That would be when P[x=0]=0.6 and P[x=10]=0.4, for a variance of 24. The min variance is when the distribution is as narrow as possible, when P[x=6]=0.6 and P[x=6+epsilon]=0.4 for a variance of 0.24*epsilon^2. The range of variances possible is (0,24].
The CDF of 2 is 0.6 means that 60% of values (of this unnamed random variable) are 2 or below. The pdf of 2 is 0.6 means that the density (of this unnamed RV) at 2 is 0.6, which is borderline uninterpretable by itself. The two are closely related in that the CDF is the running integral of the density, while the density is the slope of the CDF. Critically, “the pdf of 2 is 0.6” does not mean P[x=2]=0.6. Though “the cdf of 2 is 0.6” does mean P[x<=2] = 0.6.
This section asks for details, but really you should just have some sense of these things, and know how to look up the true answers.
I can’t easily draw this for you, so I’ll plot it in R. You certainly did not need to do that.
plot(density(rnorm(100000)))
Mean = 0. Median = 0. Mode = 0. Variance = 1, SD = 1. Kurtosis = 3 [excess kurtosis = 0]. Skew = 0. 95% predictive interval = (-1.96,1.96) [I would have been happy with (-2,2)].
Same again
plot(density(rnorm(100000)))
lines(density(rt(100000,5)),col=2)
Mean = 0. Variance = Inf. Median = 0. 95% Predictive interval = (-4.31,4.31)
Variance = Inf (again). PI = (-12.71,12.71)
Any non-negative integer.
Mean = 5. Variance = 5. PI = [1,10]
A Binomial(n,p) can take values {0,1,…,n}.
Mean = 5. Variance = 4.75. PI = [1,10]. For those of you paying attention, those numbers should look very similar to the numbers for the above Poisson(5) distribution.
Mean = ????. Median = 0. PI = (-12.71,12.71) [using
qcauchy
].
I am not great at linear algebra. However, I excel at saying “so X’X is the variance matrix of the matrix X and that means…”. I have no expectations beyond some ability to do that, and perhaps more importantly, to understand what is going on when I do that.
The beta is a measure of the relationship between X and Y. Suppose we double all values of X, how should the beta evolve? It should halve. But when we double all values of X, the Cov(X,Y) will also double. The Var(x) in the denominator is making sure the beta adjusts appropriately. In this case, because Var(2x) = 4Var(x), the beta will adopt correctly.
(X’X)^{-1} is the Variance in the denominator. Except now it is a full variance matrix including the covariances between different columns of X. X’Y is the Covariance between X and Y except now it is a full vector of covariances between Y and each column of X. And the beta has changed too, from a single number to a vector.
The above equation relies on the invertibility of X’X. For my grandmother, and myself, what that means is that when some of my variables are closely related to eachother, we won’t be able tell which ones are important, so we can’t estimate the beta. For instance, if we observe a state imposing a tax on alcohol and a massive ad campaign about its deleterious effects at the same time, it will be hard to attribute any changes in drinking to one or the other of those policies.
This will run through some basic tasks you should be familiar with in R.
I did it.
x = 2
x
## [1] 2
rm(x)
x
## Error in eval(expr, envir, enclos): object 'x' not found
Find the help file for the R command rnorm
. (?rnorm
)
?rnorm
Using the builtin dataset iris
, run a regression of Sepal.Length on Sepal.Width.
mod = lm(Sepal.Length~Sepal.Width,data=iris)
Show the summary table for that regression
summary(mod)
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5561 -0.6333 -0.1120 0.5579 2.2226
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.5262 0.4789 13.63 <2e-16 ***
## Sepal.Width -0.2234 0.1551 -1.44 0.152
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8251 on 148 degrees of freedom
## Multiple R-squared: 0.01382, Adjusted R-squared: 0.007159
## F-statistic: 2.074 on 1 and 148 DF, p-value: 0.1519
Using the same dataset, run a regression of Sepal.Length on all other variables. Print the summary.
mod.all = lm(Sepal.Length~.,data=iris)
summary(mod.all)
##
## Call:
## lm(formula = Sepal.Length ~ ., data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.79424 -0.21874 0.00899 0.20255 0.73103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.17127 0.27979 7.760 1.43e-12 ***
## Sepal.Width 0.49589 0.08607 5.761 4.87e-08 ***
## Petal.Length 0.82924 0.06853 12.101 < 2e-16 ***
## Petal.Width -0.31516 0.15120 -2.084 0.03889 *
## Speciesversicolor -0.72356 0.24017 -3.013 0.00306 **
## Speciesvirginica -1.02350 0.33373 -3.067 0.00258 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3068 on 144 degrees of freedom
## Multiple R-squared: 0.8673, Adjusted R-squared: 0.8627
## F-statistic: 188.3 on 5 and 144 DF, p-value: < 2.2e-16
Download the .csv file for US Covid Vaccinations as of March 3 available here. Load it into R and assign it to the variable “mar3”. (hint: you may need to skip a few lines)
I downloaded the file to a folder named “/Users/cdowd/Downloads/”
mar3 = read.csv("/Users/codowd/Downloads/us_covid_vaccinations_mar3.csv",skip=3)
Combine these steps into one. Try to get R to directly read the csv for the Covid Vaccinations as of March 18th available at https://codowd.com/bigdata/predictions/us_covid_vaccinations_mar18.csv. Assign this to “mar18”.
mar18 = read.csv("https://codowd.com/bigdata/predictions/us_covid_vaccinations_mar18.csv",skip=3)
Using a for-loop, add all the numbers from -5 to 29.
total = 0
for (i in -5:29) {
total = total+i
}
total
## [1] 420
Using a for loop, draw a random observation from a standard normal distribution 100 times. Find the sum of the values larger than 1.5.
total = 0
for (i in 1:100) {
obs = rnorm(1)
if (obs > 1.5) {
total = total+obs
}
}
total
## [1] 14.84213
iris[3,2]
) to print the 5th element of the 3rd column of the iris dataset.iris[5,3]
## [1] 1.4
x = rnorm(100)
) and find the sum of the values larger than 1.5 using subset notation.x = rnorm(100)
sum(x[x>1.5])
## [1] 17.48829
double = function(x) {
2*x
}
Warning: this function is stupidly slow for values of n greater than like 20.
fib_n = function(n) {
if (n <= 1) return(0)
if (n == 2) return(1)
return(fib_n(n-1)+fib_n(n-2))
}
Install the package “glmnet”.
install.packages("glmnet")
Load the package “glmnet”.
Make sure you install the dependencies
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-1
Generate 100 random numbers, calculate their mean. Calculate the standard error of that mean.
x = rnorm(100)
mean(x)
## [1] -0.06445758
se = sd(x)/sqrt(100)
se
## [1] 0.1076501
Using a for-loop, repeat the above generation of 100 random numbers 1000 times, calculating and storing the mean of the sample each time.
nsamples = 1000
means = numeric(nsamples)
for (i in 1:nsamples) {
means[i] = mean(rnorm(100))
}
Plot a histogram of the output from the above simulation study. (hist
)
hist(means)
Find the standard deviation of the means you generated. Compare it to the Standard error from the first sample you generated.
sd(means)
## [1] 0.09710419
This standard deviation is very close to the SE calculated above. They are of course both estimates of the same parameter – the standard deviation of the mean of 100 observations from a standard normal. The true SD of that mean is 0.1.
Way back in univariate regression, you printed the summary output of a regression.
A 1 unit increase in Sepal.Width is associated with a 0.22 decrease in the Sepal.Length.
It is unclear if adding the variable Sepal.Width is helping our predictions at all.
Our model can explain 1.4% of the variation in Sepal.Length.
Yes. It would likely shrink towards 0.
Holding everything else constant, a 1 unit increase in Petal.Width is associated with a 0.31 unit decrease in Sepal.Length.
Our model represents substantial improvement beyond a simple constant prediction.
This model can explain about 87% of the variation in Sepal.Length. As we added more explanatory variables, we nearly guaranteed that it would rise – the coefficient on a variable would have to 0 for it to not affect the R^2. But this rise is more than that, we’ve added variables with substantial explanatory power, and they are using that power.
Eh. This is a nice model. I would consider adding some interactions between species and the other variables.
The range of values the coefficients may take is being determined under the assumption that the variation in Sepal.Length is constant across all other variables. This may be unreasonable, heteroscedasticity robust SEs let us relax that assumption.
No and no. Though we do assume that the coefficient estimates are normally distributed, the observations themselves can take other distributions with finite variances. This is only the ‘best prediction’ in the class of linear models, if we don’t believe in linearity, we should consider other models. We are assuming the observations are independent from eachother. We are also assuming the data is trustworthy and not error prone.
Maybe? My guess is no. More broadly, this isn’t necessarily a causal relationship we’ve found, much less one we might be able to manipulate. But it is possible there is a causal link here. Of course even then, the coefficient is negative. So “No” seems like a good bet.
Usually the null hypothesis is that some parameter is 0. Occasionally the default null is that some parameter is 1.
Rejecting the null means we’ve concluded that the parameter is not 0. We might also fail to reject the null.
The probability of observing data at least as extreme as the data we observed, under the assumption that the null hypothesis is true.
It means that under the null distribution, such an extreme estimated value of the coefficient is unlikely.
It is still twice as likely, but still unlikely.
It is a range of values a parameter is likely to fall in.
Typically, but not always, a predictive interval is a range of values a specific observation will likely fall in. E.g. the range of heights of the next person to walk into a room. While the confidence interval is the range of values some parameter is likely to fall in – e.g. the average height of people walking into rooms.
The CI is approximately (-0.52,0.08).