The goal of this course is to prepare you for future encounters with large datasets and complicated modelling tasks. For that reason, this course is not purely a statistics course, nor is it an econometrics or computer science course. We will dabble with the objective of covering many basic frameworks which may be of value in the future. This means covering both the essence of many ideas, as well as practicing their application to real world data.
Techniques we will cover will include an advanced overview of regression: linear and logistic; model choice: false discovery rates, information criteria and cross validation; regularized regressions: ridge and lasso; computation techniques: bagging and the bootstrap; design questions: experiments and upsampling; machine learning questions: classification, decision trees and random forests. Depending on time constraints and students preferences, we may also cover neural nets, nonparametric regressions, time series, and text processing.
The emphasis of this course will be actual practice of techniques on real data. We will use zillow housing data, public COVID data, and more. To this end, the course will focus on the use of R for actual data work.
I will assume you are familiar with all the following: the fundamentals of probability, random variables (and functions thereof), normal and multinomial distributions, confidence/prediction intervals, hypothesis testing and sampling distributions. In particular, you should be comfortable with the basics of linear regression.
As all computing is conducted in R, I strongly encourage students to make themselves familiar with the basics of the computing language BEFORE class begins. See the computing section for more details.
I intend to publish (by February 26th) an optional assignment (homework 0), which will help you assess your own comfort with the required topics. If you are uncertain, please feel free to email me.
Tuesdays & Thursdays: 1PM - 2:20 PM (Central)
This course will be fully remote. The plan is to break each 80 minute session into two 35 minute lectures with a 10 minute break. This plan is subject to change based on student feedback and other considerations.
This class uses R, which is available for free via www.r-project.org. You can download and install the software following directions there (do this ASAP). I also strongly recommend the use of RStudio, a graphical interface for using R. RStudio can be found at https://rstudio.com/products/rstudio/. Download the free desktop version.
Modern statistics is a computational endeavor. Learning to use a programming language like R can be quite challenging. However, it can also be immensely rewarding, and is critical to performing analysis of big data.
This is not a class on R. I will not assume you have a background using R coming into the course, however, like all languages, you will learn best through practice. There are numerous great guides to learning R if you are new. I recommend www.codecademy.com/learn/learn-r, which should be free. There are many other great free resources available, but you could also purchase any of numerous books to help you. Adler’s R in a Nutshell comes highly recommended.
The primary course text will be “The Elements of Statistical Learning” (ESL) by Hastie Tibshirani and Friedman. However, I also strongly recommend getting “Introduction to Statistical Learning with Applications in R” (ISL) by James, Witten, Hastie, and Tibshirani instead of ESL for those of you less familiar with either R or statistics. The two books cover much of the same material, and this course is likely to sit between them in terms of technical depth.
As we go through the course, I may name other supplementary textbooks, which may be of use for individuals either struggling with or who wish to go into more depth on particular topics, but they are by no means necessary for taking this course. Likely candidates:
Course grades will be based on a mix of small group (2-3) homeworks, one individual homework, participation, and a ‘take-home’ style final. For the take home final you will be given a question and a dataset and you will have ~3 days to come to a conclusion and write a short (<2 page + graphs) summary of your methods and conclusion. Depending on student feedback, I may also throw in a short (<20 question) multiple choice midterm.
Provisionally, the final grade will be 65% group homeworks, 5% individual homeworks, 10% participation, and 20% the take home final. I plan for there to be 8 group homeworks and one individual homework (which is likely to consist of 2 short questions). In consultation with students, I may decide to make the individual homework optional.
All work will be submitted online. Groups are required and may submit only 1 document. Students pledge to adhere to Booth Honor Code standards on all work.
During the course, we will use the canvas discussion board. You are of course welcome to email me with questions, but please try to post questions on canvas as well. It provides an opportunity for me to publicly answer questions, as well as providing your classmates the opportunity to answer them – which is often highly educational for them.
All elements of the course are open for discussion between myself and students. If you have questions, comments, or concerns, please reach out to me. I reserve the right to make changes to all elements of the curriculum.