1 Data: Amazon Reviews

The dataset consists of 13 319 reviews for selected products on Amazon from Jan-Oct 2012. Reviews include product information, ratings, and a plain text review. The data consists of three tables (none of which is a true csv, they are all space [i.e. " "] separated):

1.1 Review subset.csv

This is a table containing, for each review, its

  • ProductId: Amazon ASIN product code
  • UserId: ID of the reviewer
  • Score: numeric 1-5 (the number of stars)
  • Time: date of the review
  • Summary: review summary in words
  • Nrev: number of reviews by the user
  • Length: number of words in the review
  • Prod Category: Amazon product category
  • Prod Group: Amazon product group

1.2 Word freq.csv

This file is a simple triplet matrix of word counts from the review text including - Review ID: the row index of Review subset.csv - Word ID: the row index of words.csv - Times Word: how many times the word occurred in the review

1.3 Words.csv

This file contains 1125 alphabetically ordered words that occur in the reviews.

2 Data exploration

The code below loads the data.

# READ REVIEWS

data = read.table("Review_subset.csv",header=TRUE)
dim(data)
## [1] 13319     9
colnames(data)
## [1] "ProductId"     "UserId"        "Score"         "Time"         
## [5] "Summary"       "Nrev"          "Length"        "Prod_Category"
## [9] "Prod_Group"
# READ WORDS
words = read.table("words.csv")
words = words[,1]
length(words)
## [1] 1125
# READ text-word pairings file
doc_word = read.table("word_freq.csv")
colnames(doc_word) = c("Review ID","Word ID","Times Word")

3 Marginal Regression Screening

We would like to pre-screen words that associate with ratings. To this end, we run a series of (separate) marginal regressions of review Score on word presence in review text for each of 1125 words.

In the starter script below, you will find a code to run these marginal regressions (both in parallel and sequentially). The code gives you a set of p-values for a marginal effect of each word. That is, we fit \[ {\tt stars}_i = \alpha + \beta_j I{[x_{ji}>0]} + \epsilon_{ji} \] for each word term \(j\) with count \(x_{ji}\) in review \(i\), and return the p-value associated with a test of \(\beta_{j}\neq0\). We’ll use these 1125 independent regressions to screen words.

3.1 Word Presence Matrix

First we want a matrix of word presence, where each column tells us in which review a certain word was used.

# Create a matrix of word presence

spm = matrix(0L,nrow = nrow(data),ncol=length(words))
for (index in 1:nrow(doc_word)) {
  i = doc_word[index,1]
  j = doc_word[index,2]
  spm[i,j] = doc_word[index,3]
}
colnames(spm) = words
dim(spm)
## [1] 13319  1125

3.1.1 An aside: this is big.

Use the function “object.size” to see sizes for things.

object.size(spm)
## 60009704 bytes

SPM is now ~60 MB. (If we used a 0, instead of 0L initially when we built our matrix, we could double that – by storing everything as floats instead of integers)

Why is it so big? It is chock full of zeroes. Each of which is being stored as a byte or so.

mean(spm == 0)
## [1] 0.9971154

There are ways of storing matrices that avoid this problem – sparse matrices. Example below, but not quite worth it today.

# DON'T RUN THIS CODE
library(gamlr) #Needs install. 
spm = sparseMatrix(i=doc_word[,1],
                  j=doc_word[,2],
                  x=doc_word[,3],
                  dimnames=list(id=1:nrow(data),words=words))

3.2 Simplify from Counts to presence

Create a dense dataframe of word presence – also going to be big. And a Y variable for use later.

P = as.data.frame(as.matrix(spm>0)) 

stars = data$Score

Now we will build a function that will run each of our marginal regressions.

margreg = function(p){
    fit = lm(stars~P[[p]])
    sf = summary(fit)
    return(sf$coef[2,4]) 
}
margreg(10) #the pval for the 10th word
## [1] 0.004430345

Coincidentally, word 10 looks pretty significant. What is it?

words[10]
## [1] "addictive"

Okay, that explains that.

3.3 Calculate Pvalues for every word

There are ~1200 regressions to run here. Which could be time consuming. So we will look at a few ways to do the procedure.

3.3.1 For loop

The simplest way. Make sure you initialize the vector you want to save output to, or it will be substantially slower.

pvals = numeric(length(words))

# About 15 seconds on my laptop.
for (i in 1:length(words)) {
  pvals[i] = margreg(i)
} 

But why a for loop? We are not doing anything differently as a result of prior iterations.
So we should use the apply() family. This is idiomatic R.

3.3.2 Apply version

We will use sapply, which returns a vector when it can. It goes through each element of its first argument, and runs the function margreg on that element. Just like the for loop above.

pvals = sapply(1:length(words),margreg)

Not any faster really. But it is much clearer you aren’t using the order of iterations. And it is much easier to parallelize if we want. And we don’t need to initialize up front.

3.3.3 Parallel

So what if this was slower? We want to parallelize. And apply will help us. It shows us where embarassingly parallel things are, and makes parallelizing them dead easy.

library(parallel) # BASE R package. Should come pre-installed
ncores = detectCores()-1 #good form on your own computer to lose 1
ncores
# Make a cluster
cl = makeCluster(ncores) 
# Export data to the cluster
clusterExport(cl,c("stars","P")) 
# Run the regressions in parallel
# The same syntax as sapply, but first we tell it the cluster name.
pvals = parSapply(cl,1:length(words),margreg)
# About 2 seconds on my computer w/ 7 cores. 
# Turn off cluster
stopCluster(cl)

3.3.4 Wrapup

Just going to add names to my pvals, and remove the big variables.

rm(data,doc_word,spm,P)
names(pvals) = words

Now we have a vector of pvals, each of which corresponds to a word.

4 Homework Questions:

If you have trouble, I strongly recommend you go through the example code posted alongside lecture 2.

4.1 Q1

Plot the p-values and comment on their distribution (2 sentence max).

4.2 Q2

Let’s do standard statistical testing. How many tests are significant at the alpha level 0.05 and 0.01?

4.3 Q3

What is the p-value cutoff for 1% FDR? Plot the rejection region.

4.4 Q4

How many discoveries do you find at q=0.01 and how many do you expect to be false?

4.5 Q5

What are the 10 most significant words? Was ‘addictive’ significant’? Do these results make sense to you? (2 sentence max)

4.6 Q6

What are the advantages and disadvantages of our FDR anaysis? (4 sentence max)

5 Submission

Submit a pdf, docx, or html file containing your responses to the above questions on canvas before class on Tuesday April 6th. One submission per group.

6 Optional Exercises

Literally just for your entertainment and to motivate future weeks. Please do not submit responses to this. Full answers will not be provided, sketches of answers may be.

  1. Do the above regressions, but against word counts instead of word presences.
  2. Above, but explicitly model outcome distribution (i.e. bounded discrete values)
  3. Use ggplot for your plotting
  4. Compare the rejections you get with word counts to the rejections you get with word presences. Are they similar?
  5. Consider methods for choosing between the multiple models we’ve now created.
  6. Look through the (posted) diabetes dataset. Replicate the work done there.