# IV \ DiD examples & SCM discussion
# Connor Dowd -- November 2018
# codowd.com
#Setting a seed, so I can discuss exact results
set.seed(24955)
###########################
# IV
###########################
#Generating data:
#----------------
n = 1000 #Sample size
z = rnorm(n) #Instrument
q = rnorm(n) #unobserved confound
x = 5*z + 3*q + rnorm(n) #variable of interest
y = 0.5*x + 2*q + rnorm(n) #outcome of interest
#Problem:
#---------------
#We care about the relationship between y and x (true value -- 0.5).
# But if we regress x on y, our coefficient will also
# be incorporating the fact that q affects both x and y. i.e. our
# estimate will be biased.
biased.mod = lm(y~x)
summary(biased.mod)
#You can see that if you build a confidence interval for the
#relationship, it doesn't include the true relationship. nor is it near.
beta_biased = summary(biased.mod)$coefficients[2,]
CI_biased = beta_biased[1]+c(-2,2)*beta_biased[2]
# I get a CI from 0.65 to 0.7. Does not come close to including 0.5
#This is the bias we care about.
#Solution: IV (2SLS)
#-------------------
# We are going to use 2SLS (google it)
# First we regress Z (our instrument) on X. Z must be related to X,
# and unrelated to Q. It also can't be related to Y other than through X.
first.stage = lm(x~z)
summary(first.stage)
#F-stat tells us that there is a relationship between X and Z.
#Now we will regress the predictions of x against y.
#This will give the relationship we desire
x.pred = fitted(first.stage)
second.stage = lm(y~x.pred)
second.stage
#As you can see, the coefficient for x.pred is very close to the truth
# of 0.5.
#WARNING: while this process gives the proper estimate for beta,
# the lm output will not give proper standard errors, they must be
# adjusted for this process. this is why i didn't print the full
# lm summary. Bootstrapping should work. Or you could look up the
# (fairly complicated) analytical solution.
#Quick Discussion
# The essence here is that we have some 3rd variable which drives
# our variable of interest.
# Example:
# Suppose we are interested in the effect of lead on crime.
# Lead pipes cause lead to leach into the water supply. If lead affects
# Crime, we might want to run a regression of lead pipes against crime.
# HOWEVER, because poor neighborhoods are more likely to use cheaper lead
# pipes, and more likely to have crime, we may be picking up the difference
# in wealth, rather than a true effect.
# Solution: use distance to a lead smelter to predict usage of lead pipes.
# then use those predictions to look at the changes in crime. That way,
# the effect isn't being driven by the wealth of counties in question.
# (at least as much)
# Paper: https://scholar.harvard.edu/files/jfeigenbaum/files/feigenbaum_muller_lead_crime.pdf
#################################################
# Difference-in-Differences
#################################################
set.seed(34531)
#Generating data
#----------------
#(panel data -- multiple cities over time)
n = 30 #number of cities
m = 50 #number of time periods (aside: both t and T are already defined in R)
data.matrix = replicate(n,{ # Each city
a = rnorm(m) #Noise in each period
a = a+2 #there is a time trend of 2/period.
a = cumsum(a) #This is a unit-root process
a = a + rnorm(1,,5) #each city has its own mean
})
#Make sure each city is a column and each time is a row
if (nrow(data.matrix) < ncol(data.matrix)) data.matrix = t(data.matrix)
treated.units = sample(n,n/2,replace=F) #picking 50% to treat at random
treatment.effect = 1 #Ground truth treatment effect is size 1
#Add treatment effect in treatment period (last time)
data.matrix[m,treated.units]= data.matrix[m,treated.units]+treatment.effect
#Problem
#---------------
# We want to identify the treatment effect.
# Idea 1
# We could just take the difference between treated and untreated units.
est1 = mean(data.matrix[m,treated.units])-mean(data.matrix[m,-treated.units])
est1
#-2 is way off. Why? Each city has its own mean, and that is being included.
mean(colMeans(data.matrix[-m,treated.units])) #47.2
mean(colMeans(data.matrix[-m,-treated.units])) #49.7
#This is causing a bias.
#Idea 2
#What if we just take the difference pre and post treatment for the
# treated units?
est2 = mean(data.matrix[m,treated.units])-mean(data.matrix[m-1,treated.units])
est2
#3.25 is still off. Why? there is a time trend of 2/period.
#Solution
#--------
#Diff-in-Diff
#Going to take the difference between treatment and control in pre and post periods.
pre_control = mean(data.matrix[m-1,-treated.units])
pre_treated = mean(data.matrix[m-1,treated.units])
post_control = mean(data.matrix[m,-treated.units])
post_treated = mean(data.matrix[m,treated.units])
est3 = (post_treated-post_control)-(pre_treated-pre_control)
est3
#1.12 -- this is actually nearby. And we can prove that under some conditions this is an ubiased estimate.
#Quick Discussion
# Broadly the problem diff-in-diff solves is that of simple time trends
# plus city fixed effects. Diff-in-diff relies on those time trends all
# being the same (i.e. every city added 2 per period) --
# AKA the parallel trends assumption.
# Examples are too numerous to be worth pointing out. Google it.
# WARNING: I've made no discussion of inference. Building confidence
# intervals/ p-values will require more. Google can help you here.
# Notably bootstrapping may fall apart -- bootstrapping really needs
# independence claims that may not hold here.
############################
# Synthetic Control Method
############################
#THIS IS WELL BEYOND ANYTHING YOU NEED TO KNOW FOR THE COURSE.
# I WAS MERELY ASKED FOR MORE REFERENCES.
#NB this section will use matrix multiplication (%*%)
# and define 4 functions.
# Synthetic Controls is best thought of as expanding on Diff-in-Diff.
# Setting is similar. Panel data:
set.seed(1554539)
# Generating Data
#----------------
n = 31 #Observed cities (one is treated, so 30 untreated)
m = 21 #Observed time periods (one is post-treatment, so 20 pre)
#again, t and T are both reserved characters in R
nfactor = 7 #number of unobserved, latent factors.
# Generate Factors
unobserved.factors = matrix(rnorm(m*nfactor),ncol=nfactor)
# Each column is a factor. Each row is a time period.
observations = matrix(nrow=m,ncol=n)
time.fixed.effects = rnorm(m)
for (i in 1:n) {
city.fixed.effect = rnorm(1) #each city has its own average
city.factor.loadings = rnorm(nfactor) #and its own relationships
#to the unobserved factors
#sum those up real quick
city = unobserved.factors%*%city.factor.loadings + city.fixed.effect
# Add some noise and the time.fixed.effects
city = city + time.fixed.effects + rnorm(m)
observations[,i] = city
}
#Treated city (number 1) gets a treatment effect in the post-treatment period
treatment.effect = 3 #THIS IS THE TRUTH WE WANT TO FIND
observations[m,1] = observations[m,1] + treatment.effect
# We don't observe these things
rm(unobserved.factors,nfactor,time.fixed.effects,treatment.effect,i)
rm(city,city.fixed.effect,city.factor.loadings)
# Problem
# -----------
# As discussed at the end of diff-in-diff, we need the cities to each have
# the same time trend for diff-in-diff to work.
# That doesn't hold in the data I've generated above.
# Lets try diff-in-diff anyhow.
pre_control = mean(observations[m-1,2:n])
pre_treated = observations[m-1,1]
post_control = mean(observations[m,2:n])
post_treated = observations[m,1]
diff_in_diff_estimate = (post_treated-post_control)-(pre_treated-pre_control)
diff_in_diff_estimate
# -1.6 which is way off from the truth of 3.
# Moreover, diff-in-diff is unsatisfying as a method. Do we really think
# a mere average of all the cities is best? Can we not use that
# pre-treatment data (in the example here, 19 periods are going unused!) at all?
# Aside
# -----
# Notably most SCM applications assume only 1 treated observation.
# This makes the idea of trying to improve on the naive average
# particularly appealing.
# Solution
# ----------------
# So, how can we improve? Instead of weighting each control unit by 1/n in
# our average, we will use data-driven weights to build a weighted average.
# This takes advantage of those other pre-treatment periods.
# As it turns out, this will allow us to relax that parallel trends
# assumption and allow for much more complicated DGPs.
# That is SCM.
# Second Aside: it is standard (and what I'll do below), to impose two
# restrictions on our weights. They must be non-negative and sum to 1.
# I will also use an intercept term.
#Building the estimation routine
#To estimate the weights, I must first define a loss. It will be just
# Squared error loss. But the weights will be projected to meet our
# sum-to-one and non-negativity restrictions
scm_projection = function(x) {
intercept = x[1]
y = x[-1] #without intercept
y[y<0] = 0 #set negative weights to 0
y = y/sum(y) #scale so that it sums to 1
c(intercept,y)
}
#With the projection, I can define my loss function. Its just prediction
# errors, but I'm also making sure the intercept gets passed through.
scm_loss = function(x,treated.obs,control.obs) {
projected = scm_projection(x)
predictions = projected[1] + control.obs%*%projected[-1]
errors = predictions-treated.obs
mse = mean(errors**2)
mse
}
# Now we have a function which builds weights that optimize that loss
scm_weights = function(pre.treated,pre.control) {
m = nrow(pre.control) #pre periods b/c we don't use treatment period
# to figure out optimal weights
n = ncol(pre.control) #control units
#Going to use an optimization routine to set weights.
#need an initial condition. using diff-in-diff default
initial = c(mean(pre.treated),rep(1/n,n))
#Setting upper and lower bounds
upper.bound = c(Inf,rep(1,n))
lower.bound = c(-Inf,rep(0,n))
#Running optimization
opt = optim(initial,scm_loss,treated.obs=pre.treated,control.obs=pre.control,lower=lower.bound,upper=upper.bound,method="L-BFGS-B")
#returning weights -- properly scaled
scm_projection(opt$par)
}
# Now a function that actually does estimation
scm_estimate = function(observations) {
#assume first unit is treated, last period is treated.
m = nrow(observations)
n = ncol(observations)
#Divide up into treatment and control in pre-periods
pre.treated = observations[2:m-1,1]
pre.control = observations[2:m-1,2:n]
#use pre-periods to build weights
weights = scm_weights(pre.treated,pre.control)
#Divide into treatment and control in treatment period
post.control = observations[m,2:n]
post.treated = observations[m,1]
#Use control units and weights to make counterfactual prediction
counterfactual.prediction = weights[1]+weights[-1]%*%post.control
#Difference between outcome and counterfactual is our estimate
estimate = post.treated-counterfactual.prediction
estimate
}
scm_estimate(observations)
#2.23 --- this is a much better estimate than -1.6. Still off,
# but actually we can prove this is unbiased, etc.
# WARNING: Inference in SCM is not only not discussed here, but is an
# area of active research. Googling will hurt more than help.
# Building CIs, p-values, is not an activity with agreed upon standards.
# so please consult your statistician before using.