March 30th 2021

A Plea

You are not required to keep your cameras on.

Please do. It helps me immensely to be able to see how people respond as I say things.

If 15% of you are lost, I want to know about it, but with 20 cameras on, I have to notice one of only three faces being lost.

More generally though, please, jump in with questions when you have them. Use the chat if you prefer.

I’m here to teach you.

  • If you have questions,
  • If you have trouble,
  • If you have feedback for me,
  • If you have ideas for improving the course,

Please let me know.

Today’s Class

  1. Introduction
    • Goals
    • Material
    • Syllabus
  2. What is Big Data?
  3. Computation: R, Big Data
  4. Data Viz, Statistics, Dimension Reduction
  5. Testing: False Discovery

Big Data


Big data is not an end in itself

We are doing Inference at LARGE scale!

The goal here is to learn what we can trust, how it can be used, and how to learn more.

We want to help you make good decisions with lots of messy data.

By necessity then, this class will be a mix of theory and practice.

Theory and Practice

You need a solid foundation in statistical principles.

  • We don’t want to shoot ourselves in the foot

You also need a hefty dose of ‘rules of thumb’.

  • Inefficient procedures can be wildly inefficient
  • BUT we don’t want to reinvent the wheel.

This is hands on work.

What is Big Data?

There are a lot of names out there for a large cluster of very similar disciplines:

  • Econometrics
  • Data Science
  • Big Data
  • Statistics
  • Datamining
  • Machine Learning

There are differences in focus and style here. But the similarities are larger than the distinctions.

What is Big Data?

Big Data as a name, originates with computer scientists working with data too large for any single computer

“Big Data is one GB larger than my RAM”.

But often it is associated with administrative data, where statistical notions of sampling error may start to fall apart.

More generally, it is in a nexus where a fairly tight connection between inference and prediction can start to break down.

What is Big Data?

Big Data is focused on extracting useful truth from large datasets.

  • Infer patterns in high dimensional data
  • Simple and scalable algorithms
  • Honest and humble model selection
  • Manage conflict between “useful” and “true”
  • Make some decision

What is Big Data?

Big in number of observations (size \(n\)).

Big in number of variables (dimension \(p\)).

In these settings you cannot:

  • Look at each variable and make a decision (t-tests).
  • Choose from a small set of nested models (F-tests).
  • Plot every variable to look for interactions and transformations.

Some of our tools are straight out of Stats 101 (regression, confidence intervals), some are close relatives of Stats 101 (PCA, MSE), some are new beasts (trees, Bagging).

Why should you care?

  • Extremely employable.

Why should you care?

  • Extremely employable
  • Avoid being hoodwinked by nutjobs with numbers.
  • Avoid hoodwinking yourself with numbers. They want to lie to you.
  • Its fun. Genuinely.

Does data really want to lie to you?