As discussed in class, the central element of Big Data is making predictions. We want to predict all manner of things, using all manner of sources of data, and protect ourselves against all kinds of mistakes in making our predictions. To the extent that we can do a good job of predicting, we will have accomplished our goals.
Like many other skills, the simplest way to become good at making predictions is practice. To that end, we are going to have a simple prediction competition.
A topic of major concern, for the US at large and potentially for students personally, is Covid-19. To that end, I would like to know how many US residents will test positive for Covid-19 on Sunday May 9th. At the risk of being repetitive the target is below:
How many people in the US will test positive for Covid-19 on Sunday May 9th?
Specifically, I want three numbers:
The answer to this question will be determined by the US CDC’s case tracker, looking at the number of people who tested positive on May 9th. Specifically, I’ll look at the Wayback Machine’s archive of that webpage which is closest to, but after midnight (the morning of) May 17th (US eastern time – i.e. the time zone CDC is in). The week delay in lookback means that we should get an accurate assessment of the number (CDC numbers take up to 5 days to fully update).
The CDC provides a lot of data, which is available on the resolution page.
To enter the competition proper, you must submit your prediction (consisting of 3 numbers) in two locations before midnight on May 7th. The first location is on Canvas. The second location is a google form here.
The 3 numbers that make up your submission are the following:
You are permitted to change your submission, up to the start of class.
You must also provide a 1 to 3 sentence description of how you made your predictions in order to enter the competition.
This is a graded homework assignment. To receive full credit, you must simply submit answers to Canvas by Midnight Friday. However, as I’m not optimistic about being able to download all your answers from canvas in a usable manner, to actually compete, you need to also complete the google form.
This is a very difficult prediction task, so I’m no longer competing directly against you (sorry!). Prizes are also currently a source of some debate. I will keep you apprised.
Nevertheless, there will be at minimum some kind of certificate sent to the top three predictions on each of the first two targets.
The first prediction: estimate this number, will be ranked by mean-squared-error between your prediction and the truth.
The second prediction uses “price is right rules”. This means that predictions that are above the true value will be ranked below all predictions that are below the true value (google it if this doesn’t make sense). Predictions below the truth will be ranked by how far below the truth they are. The closest prediction that is not above the truth is the best prediction.
The final prediction is a probability of being below 25000.
Awards are TBD.
I intend to do two more competitions. We will see if that works out.