The dataset consists of 13 319 reviews for selected products on Amazon from Jan-Oct 2012. Reviews include product information, ratings, and a plain text review. The data consists of three tables (none of which is a true csv, they are all space [i.e. " "] separated):
This is a table containing, for each review, its
This file is a simple triplet matrix of word counts from the review text including - Review ID: the row index of Review subset.csv - Word ID: the row index of words.csv - Times Word: how many times the word occurred in the review
This file contains 1125 alphabetically ordered words that occur in the reviews.
The code below loads the data.
# READ REVIEWS
data = read.table("Review_subset.csv",header=TRUE)
dim(data)
## [1] 13319 9
colnames(data)
## [1] "ProductId" "UserId" "Score" "Time"
## [5] "Summary" "Nrev" "Length" "Prod_Category"
## [9] "Prod_Group"
# READ WORDS
words = read.table("words.csv")
words = words[,1]
length(words)
## [1] 1125
# READ text-word pairings file
doc_word = read.table("word_freq.csv")
colnames(doc_word) = c("Review ID","Word ID","Times Word")
We would like to pre-screen words that associate with ratings. To this end, we run a series of (separate) marginal regressions of review Score on word presence in review text for each of 1125 words.
In the starter script below, you will find a code to run these marginal regressions (both in parallel and sequentially). The code gives you a set of p-values for a marginal effect of each word. That is, we fit \[ {\tt stars}_i = \alpha + \beta_j I{[x_{ji}>0]} + \epsilon_{ji} \] for each word term \(j\) with count \(x_{ji}\) in review \(i\), and return the p-value associated with a test of \(\beta_{j}\neq0\). We’ll use these 1125 independent regressions to screen words.
First we want a matrix of word presence, where each column tells us in which review a certain word was used.
# Create a matrix of word presence
spm = matrix(0L,nrow = nrow(data),ncol=length(words))
for (index in 1:nrow(doc_word)) {
i = doc_word[index,1]
j = doc_word[index,2]
spm[i,j] = doc_word[index,3]
}
colnames(spm) = words
dim(spm)
## [1] 13319 1125
Use the function “object.size” to see sizes for things.
object.size(spm)
## 60009704 bytes
SPM is now ~60 MB. (If we used a 0, instead of 0L initially when we built our matrix, we could double that – by storing everything as floats instead of integers)
Why is it so big? It is chock full of zeroes. Each of which is being stored as a byte or so.
mean(spm == 0)
## [1] 0.9971154
There are ways of storing matrices that avoid this problem – sparse matrices. Example below, but not quite worth it today.
# DON'T RUN THIS CODE
library(gamlr) #Needs install.
spm = sparseMatrix(i=doc_word[,1],
j=doc_word[,2],
x=doc_word[,3],
dimnames=list(id=1:nrow(data),words=words))
Create a dense dataframe of word presence – also going to be big. And a Y variable for use later.
P = as.data.frame(as.matrix(spm>0))
stars = data$Score
Now we will build a function that will run each of our marginal regressions.
margreg = function(p){
fit = lm(stars~P[[p]])
sf = summary(fit)
return(sf$coef[2,4])
}
margreg(10) #the pval for the 10th word
## [1] 0.004430345
Coincidentally, word 10 looks pretty significant. What is it?
words[10]
## [1] "addictive"
Okay, that explains that.
There are ~1200 regressions to run here. Which could be time consuming. So we will look at a few ways to do the procedure.
The simplest way. Make sure you initialize the vector you want to save output to, or it will be substantially slower.
pvals = numeric(length(words))
# About 15 seconds on my laptop.
for (i in 1:length(words)) {
pvals[i] = margreg(i)
}
But why a for loop? We are not doing anything differently as a result of prior iterations.
So we should use the apply()
family. This is idiomatic R.
We will use sapply
, which returns a vector when it can. It goes through each element of its first argument, and runs the function margreg
on that element. Just like the for loop above.
pvals = sapply(1:length(words),margreg)
Not any faster really. But it is much clearer you aren’t using the order of iterations. And it is much easier to parallelize if we want. And we don’t need to initialize up front.
So what if this was slower? We want to parallelize. And apply will help us. It shows us where embarassingly parallel things are, and makes parallelizing them dead easy.
library(parallel) # BASE R package. Should come pre-installed
ncores = detectCores()-1 #good form on your own computer to lose 1
ncores
# Make a cluster
cl = makeCluster(ncores)
# Export data to the cluster
clusterExport(cl,c("stars","P"))
# Run the regressions in parallel
# The same syntax as sapply, but first we tell it the cluster name.
pvals = parSapply(cl,1:length(words),margreg)
# About 2 seconds on my computer w/ 7 cores.
# Turn off cluster
stopCluster(cl)
Just going to add names to my pvals, and remove the big variables.
rm(data,doc_word,spm,P)
names(pvals) = words
Now we have a vector of pvals, each of which corresponds to a word.
If you have trouble, I strongly recommend you go through the example code posted alongside lecture 2.
Plot the p-values and comment on their distribution (2 sentence max).
Let’s do standard statistical testing. How many tests are significant at the alpha level 0.05 and 0.01?
What is the p-value cutoff for 1% FDR? Plot the rejection region.
How many discoveries do you find at q=0.01 and how many do you expect to be false?
What are the 10 most significant words? Was ‘addictive’ significant’? Do these results make sense to you? (2 sentence max)
What are the advantages and disadvantages of our FDR anaysis? (4 sentence max)
Submit a pdf, docx, or html file containing your responses to the above questions on canvas before class on Tuesday April 6th. One submission per group.
Literally just for your entertainment and to motivate future weeks. Please do not submit responses to this. Full answers will not be provided, sketches of answers may be.
diabetes
dataset. Replicate the work done there.