1 Data: Amazon Reviews

The dataset consists of 13 319 reviews for selected products on Amazon from Jan-Oct 2012. Reviews include product information, ratings, and a plain text review. The data consists of three tables (none of which is a true csv, they are all space [i.e. " "] separated):

1.1 Review subset.csv

This is a table containing, for each review, its

  • ProductId: Amazon ASIN product code
  • UserId: ID of the reviewer
  • Score: numeric 1-5 (the number of stars)
  • Time: date of the review
  • Summary: review summary in words
  • Nrev: number of reviews by the user
  • Length: number of words in the review
  • Prod Category: Amazon product category
  • Prod Group: Amazon product group

1.2 Word freq.csv

This file is a simple triplet matrix of word counts from the review text including - Review ID: the row index of Review subset.csv - Word ID: the row index of words.csv - Times Word: how many times the word occurred in the review

1.3 Words.csv

This file contains 1125 alphabetically ordered words that occur in the reviews.

2 Data exploration

The code below loads the data.

# READ REVIEWS

data = read.table("Review_subset.csv",header=TRUE)
dim(data)
## [1] 13319     9
colnames(data)
## [1] "ProductId"     "UserId"        "Score"         "Time"         
## [5] "Summary"       "Nrev"          "Length"        "Prod_Category"
## [9] "Prod_Group"
# READ WORDS
words = read.table("words.csv")
words = words[,1]
length(words)
## [1] 1125
# READ text-word pairings file
doc_word = read.table("word_freq.csv")
colnames(doc_word) = c("Review ID","Word ID","Times Word")

3 Marginal Regression Screening

We would like to pre-screen words that associate with ratings. To this end, we run a series of (separate) marginal regressions of review Score on word presence in review text for each of 1125 words.

In the starter script below, you will find a code to run these marginal regressions (both in parallel and sequentially). The code gives you a set of p-values for a marginal effect of each word. That is, we fit \[ {\tt stars}_i = \alpha + \beta_j I{[x_{ji}>0]} + \epsilon_{ji} \] for each word term \(j\) with count \(x_{ji}\) in review \(i\), and return the p-value associated with a test of \(\beta_{j}\neq0\). We’ll use these 1125 independent regressions to screen words.

3.1 Word Presence Matrix

First we want a matrix of word presence, where each column tells us in which review a certain word was used.

# Create a matrix of word presence

spm = matrix(0L,nrow = nrow(data),ncol=length(words))
for (index in 1:nrow(doc_word)) {
  i = doc_word[index,1]
  j = doc_word[index,2]
  spm[i,j] = doc_word[index,3]
}
colnames(spm) = words
dim(spm)
## [1] 13319  1125

3.1.1 An aside: this is big.

Use the function “object.size” to see sizes for things.

object.size(spm)
## 60009704 bytes

SPM is now ~60 MB. (If we used a 0, instead of 0L initially when we built our matrix, we could double that – by storing everything as floats instead of integers)

Why is it so big? It is chock full of zeroes. Each of which is being stored as a byte or so.

mean(spm == 0)
## [1] 0.9971154

There are ways of storing matrices that avoid this problem – sparse matrices. Example below, but not quite worth it today.

# DON'T RUN THIS CODE
library(gamlr) #Needs install. 
spm = sparseMatrix(i=doc_word[,1],
                  j=doc_word[,2],
                  x=doc_word[,3],
                  dimnames=list(id=1:nrow(data),words=words))

3.2 Simplify from Counts to presence

Create a dense dataframe of word presence – also going to be big. And a Y variable for use later.

P = as.data.frame(as.matrix(spm>0)) 

stars = data$Score

Now we will build a function that will run each of our marginal regressions.

margreg = function(p){
    fit = lm(stars~P[[p]])
    sf = summary(fit)
    return(sf$coef[2,4]) 
}
margreg(10) #the coefficient for the 10th word
## [1] 0.004430345

Coincidentally, word 10 looks pretty significant. What is it?

words[10]
## [1] "addictive"

Okay, that explains that.

3.3 Calculate Pvalues for every word

There are ~1200 regressions to run here. Which could be time consuming. So we will look at a few ways to do the procedure. ### For loop The simplest way. Make sure you initialize the vector you want to save output to, or it will be substantially slower.

pvals = numeric(length(words))

# About 15 seconds on my laptop.
for (i in 1:length(words)) {
  pvals[i] = margreg(i)
} 

But why a for loop? We are not doing anything differently as a result of prior iterations.
So we should use the apply() family. This is idiomatic R. ### Apply version We will use sapply, which returns a vector when it can. It goes through each element of its first argument, and runs the function margreg on that element. Just like the for loop above.

pvals = sapply(1:length(words),margreg)

Not any faster really. But it is much clearer you aren’t using the order of iterations. And it is much easier to parallelize if we want. And we don’t need to initialize up front.

3.3.1 Parallel

So what if this was slower? We want to parallelize. And apply will help us. It shows us where embarassingly parallel things are, and makes parallelizing them dead easy.

library(parallel) # BASE R package. Should come pre-installed
ncores = detectCores()-1 #good form on your own computer to lose 1
ncores
# Make a cluster
cl = makeCluster(ncores) 
# Export data to the cluster
clusterExport(cl,c("stars","P")) 
# Run the regressions in parallel
# The same syntax as sapply, but first we tell it the cluster name.
pvals = parSapply(cl,1:length(words),margreg)
# About 2 seconds on my computer w/ 7 cores. 
# Turn off cluster
stopCluster(cl)

3.3.2 Wrapup

Just going to add names to my pvals, and remove the big variables.

rm(data,doc_word,spm,P)
names(pvals) = words

Now we have a vector of pvals, each of which corresponds to a word.

4 Homework Questions:

4.1 Q1

Plot the p-values and comment on their distribution (2 sentence max).

hist(pvals,breaks=100,freq=F)

Two sentences:
Looks like there are a number of very small pvalues and a a long tail of large pvalues. This is not uniform like it would be if the nulls were all true, which is good to see.

More commentary:
I like the histogram, but a density plot, a scatterplot or other plots could be good. More broadly, deviation from uniformity is the thing we want to see. In histograms I tend to like many thin bins rather than a few big ones, but it is a personal preference.

4.2 Q2

Let’s do standard statistical testing. How many tests are significant at the alpha level 0.05 and 0.01?

sum(pvals < 0.05)
## [1] 461
sum(pvals < 0.01)
## [1] 348

4.3 Q3

What is the p-value cutoff for 1% FDR? Plot the rejection region.

fdr_cut = function(pvals, q){
  pvals = pvals[!is.na(pvals)]   # Ditch NAs
  p = length(pvals) #Count hypotheses
  k = rank(pvals, ties.method="min") #Rank pvals
  alpha = max(pvals[ pvals <= (q*k/p) ]) #Find relevant pval
  alpha #Return
}
fdr_plot = function(pvals,q,log="",...) {
    alpha = fdr_cut(pvals,q) #Get the alpha using other fn
    p = length(pvals) # Count hypotheses
    sig = factor(pvals<=alpha) #Trick for coloring plot
    levels(sig) = c("grey","maroon") #Trick cont.
    o = order(pvals) #Helps sort pvals and colors...
    plot(pvals[o], col=sig[o], pch=20, ..., 
         ylab="p-values", xlab="tests ordered by p-value", 
         main = paste('FDR =',q),log=log) #Plot
    lines(1:p, q*(1:p)/p,col="blue") #Line for FDR threshold
}

q = 0.01
fdr_cut(pvals,q)
## [1] 0.002413249
fdr_plot(pvals,q)