This is a basic introduction to some of the features of the tidyverse.
The tidyverse is a series of packages built on top of R, maintained by Hadley Wickham (and others), with a unified syntax which is good for working with data.
library(tidyverse) #Load into memory, show list of packages## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()As you can see, it contains the following packages as “core Tidyverse”:
data.frameWhen you install the tidyverse package, it also installs these supplemental packages:
%>%Most of the time, the identifying features of tidyverse code is the use of the pipe %>%, which prevents us from having to save each intermediate step of a calculation.
E.g.
oj = read_csv("oj/oj.csv")
# Non tidyverse styling:
df1 = oj[oj$brand=="dominicks",1:6] #Pain to read, unclear
#Tidyverse style
df2 = oj %>% 
  filter(brand == "dominicks") %>% #filter for rows with a brand
  select(store,brand,week,logmove,feat,price)  #choose some variables
# Easy to read, very clear, step by step.
identical(df1,df2) #Same results.## [1] TRUETidyverse makes it easy to make code that is interpretable, clear, and straightforward. This is critical for re-using code, as well as for working on code in groups.
The simplest way to get a grasp on tidyverse is to explore the excellent documentation online. There are great cheatsheets, and there are also just great websites. See tidyverse homepage and click on a package for more.
This package changes data frames. From a user-facing perspective, it mostly changes how they print, producing a useful summary of columns, and not printing 10k rows.
df2 #tibble## # A tibble: 9,649 x 6
##    store brand      week logmove  feat price
##    <dbl> <chr>     <dbl>   <dbl> <dbl> <dbl>
##  1     2 dominicks    40    9.26     1  1.59
##  2     2 dominicks    46    8.99     0  2.69
##  3     2 dominicks    47    8.83     1  2.09
##  4     2 dominicks    48    7.97     0  2.09
##  5     2 dominicks    50    7.38     0  2.09
##  6     2 dominicks    51   10.1      1  1.89
##  7     2 dominicks    52    9.28     0  1.89
##  8     2 dominicks    53    8.80     0  1.89
##  9     2 dominicks    54    8.79     0  1.79
## 10     2 dominicks    57    7.45     0  2.69
## # … with 9,639 more rowsas.data.frame(df2) #df print -- I didn't run because it prints 10k rows. You're welcome to test it though.There are also performance changes under the hood, but we can ignore those for now
readr makes it easy to read csv type files into tibbles.
#oldschool
oj = read.csv("oj/oj.csv") #output is a data.frame
# tidyverse
oj = read_csv("oj/oj.csv") #output is a tibble## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   store = col_double(),
##   brand = col_character(),
##   week = col_double(),
##   logmove = col_double(),
##   feat = col_double(),
##   price = col_double(),
##   AGE60 = col_double(),
##   EDUC = col_double(),
##   ETHNIC = col_double(),
##   INCOME = col_double(),
##   HHLARGE = col_double(),
##   WORKWOM = col_double(),
##   HVAL150 = col_double(),
##   SSTRDIST = col_double(),
##   SSTRVOL = col_double(),
##   CPDIST5 = col_double(),
##   CPWVOL5 = col_double()
## )This is minor. I usually use read.csv and as_tibble separately.
Makes cleaning the data fairly easy. Do you want each brand to be its own column, with the sales at each store for that brand in a week to be a row? We can do that. Examples may come in the future.
Makes manipulating data easy. Want to select only a few variables? or filter the rows by some constraint? or add variables? It is easy.
#Look only at dominicks sold at store 2. 
#Select only 4 variables, make two more from those.
oj %>% 
  filter(brand=="dominicks") %>%
  filter(store == "2") %>%
  select(logmove,week,price,feat) %>%
  mutate(sales = exp(logmove),logprice = log(price))## # A tibble: 110 x 6
##    logmove  week price  feat  sales logprice
##      <dbl> <dbl> <dbl> <dbl>  <dbl>    <dbl>
##  1    9.26    40  1.59     1 10560.    0.464
##  2    8.99    46  2.69     0  8000.    0.990
##  3    8.83    47  2.09     1  6848.    0.737
##  4    7.97    48  2.09     0  2880.    0.737
##  5    7.38    50  2.09     0  1600.    0.737
##  6   10.1     51  1.89     1 25344.    0.637
##  7    9.28    52  1.89     0 10752.    0.637
##  8    8.80    53  1.89     0  6656.    0.637
##  9    8.79    54  1.79     0  6592.    0.582
## 10    7.45    57  2.69     0  1728.    0.990
## # … with 100 more rowsThis is what the tidyverse is known for. Very straightforward data manipulation. You will see me use %>% filter() and %>% mutate() a lot. The dplyr overview page is very helpful.
Functions for improving how your functions work. This is a functional programming toolkit and it is very valuable – but not really a beginner toolkit.
This is a tool for making great plots with ease. Want to plot price against sales, color by brand, and show ads and no-ads in two side-by-side plots with the axes matched? Come and see:
# Make a plot using OJ data. X-axis is log(price), y is sales, color by brand.
ggplot(oj,aes(x=log(price),y=logmove,col=brand)) + 
  #Make it a scatter plot, and make each point fairly transparent
  geom_point(alpha=0.2) +
  #And separate it into two plots based on ad-presence
  facet_grid(cols = vars(feat))This is a power tool for making nice plots, which I will use constantly. The reference page is almost always open on my computer, and each function on that page has a bunch of great examples at the bottom of its own page.
These packages provide nice tools for working with data that are factors (an anagram of forcats) and for data that are strings.
I won’t go through examples right now.