1 Introduction

This is a basic introduction to some of the features of the tidyverse.

The tidyverse is a series of packages built on top of R, maintained by Hadley Wickham (and others), with a unified syntax which is good for working with data.

library(tidyverse) #Load into memory, show list of packages
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

As you can see, it contains the following packages as “core Tidyverse”:

  1. ggplot2
    • Advanced plotting tools.
    • Solid defaults
  2. tibble
    • data type that improves in many small ways over data.frame
  3. tidyr
    • Basic data cleanring: unnest, pivot, missing values, NAs
  4. readr
    • importing data: CSVs, TSVs, other flat files
  5. purrr
    • toolkit for manipulating functions and their outputs
  6. dplyr
    • Basic data manipulation. filter rows, select columns, make new variables, etc.
  7. stringr
    • tools for working with strings
  8. forcats
    • tools for working with factors

When you install the tidyverse package, it also installs these supplemental packages:

  1. Packages for importing data:
    • DBI: for relational databases
    • haven: for SPSS, SAS, and Stata data
    • httr: for web APIs
    • readxl: for XLSX and XLS files
    • rvest: for web scraping
    • jsonlite: for reading JSON files
    • xml2: for reading XML files
  2. Packages for data manipulation:
    • lubridate: for working with dates and times
    • hms: for time of day values
    • blob: for binary data
  3. Packages for function work:
    • magrittr: for the pipe %>%
    • glue: for combining strings
  4. As well as a few others.

2 Identifying tidyverse in the wild

Most of the time, the identifying features of tidyverse code is the use of the pipe %>%, which prevents us from having to save each intermediate step of a calculation.

E.g.

oj = read_csv("oj/oj.csv")
# Non tidyverse styling:
df1 = oj[oj$brand=="dominicks",1:6] #Pain to read, unclear
#Tidyverse style
df2 = oj %>% 
  filter(brand == "dominicks") %>% #filter for rows with a brand
  select(store,brand,week,logmove,feat,price)  #choose some variables
# Easy to read, very clear, step by step.
identical(df1,df2) #Same results.
## [1] TRUE

Tidyverse makes it easy to make code that is interpretable, clear, and straightforward. This is critical for re-using code, as well as for working on code in groups.

3 Using tidyverse

The simplest way to get a grasp on tidyverse is to explore the excellent documentation online. There are great cheatsheets, and there are also just great websites. See tidyverse homepage and click on a package for more.

4 Each package – in a touch more depth

4.1 tibble

This package changes data frames. From a user-facing perspective, it mostly changes how they print, producing a useful summary of columns, and not printing 10k rows.

df2 #tibble
## # A tibble: 9,649 x 6
##    store brand      week logmove  feat price
##    <dbl> <chr>     <dbl>   <dbl> <dbl> <dbl>
##  1     2 dominicks    40    9.26     1  1.59
##  2     2 dominicks    46    8.99     0  2.69
##  3     2 dominicks    47    8.83     1  2.09
##  4     2 dominicks    48    7.97     0  2.09
##  5     2 dominicks    50    7.38     0  2.09
##  6     2 dominicks    51   10.1      1  1.89
##  7     2 dominicks    52    9.28     0  1.89
##  8     2 dominicks    53    8.80     0  1.89
##  9     2 dominicks    54    8.79     0  1.79
## 10     2 dominicks    57    7.45     0  2.69
## # … with 9,639 more rows
as.data.frame(df2) #df print -- I didn't run because it prints 10k rows. You're welcome to test it though.

There are also performance changes under the hood, but we can ignore those for now

4.2 readr

readr makes it easy to read csv type files into tibbles.

#oldschool
oj = read.csv("oj/oj.csv") #output is a data.frame
# tidyverse
oj = read_csv("oj/oj.csv") #output is a tibble
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   store = col_double(),
##   brand = col_character(),
##   week = col_double(),
##   logmove = col_double(),
##   feat = col_double(),
##   price = col_double(),
##   AGE60 = col_double(),
##   EDUC = col_double(),
##   ETHNIC = col_double(),
##   INCOME = col_double(),
##   HHLARGE = col_double(),
##   WORKWOM = col_double(),
##   HVAL150 = col_double(),
##   SSTRDIST = col_double(),
##   SSTRVOL = col_double(),
##   CPDIST5 = col_double(),
##   CPWVOL5 = col_double()
## )

This is minor. I usually use read.csv and as_tibble separately.

4.3 tidyr

Makes cleaning the data fairly easy. Do you want each brand to be its own column, with the sales at each store for that brand in a week to be a row? We can do that. Examples may come in the future.

4.4 dplyr

Makes manipulating data easy. Want to select only a few variables? or filter the rows by some constraint? or add variables? It is easy.

#Look only at dominicks sold at store 2. 
#Select only 4 variables, make two more from those.
oj %>% 
  filter(brand=="dominicks") %>%
  filter(store == "2") %>%
  select(logmove,week,price,feat) %>%
  mutate(sales = exp(logmove),logprice = log(price))
## # A tibble: 110 x 6
##    logmove  week price  feat  sales logprice
##      <dbl> <dbl> <dbl> <dbl>  <dbl>    <dbl>
##  1    9.26    40  1.59     1 10560.    0.464
##  2    8.99    46  2.69     0  8000.    0.990
##  3    8.83    47  2.09     1  6848.    0.737
##  4    7.97    48  2.09     0  2880.    0.737
##  5    7.38    50  2.09     0  1600.    0.737
##  6   10.1     51  1.89     1 25344.    0.637
##  7    9.28    52  1.89     0 10752.    0.637
##  8    8.80    53  1.89     0  6656.    0.637
##  9    8.79    54  1.79     0  6592.    0.582
## 10    7.45    57  2.69     0  1728.    0.990
## # … with 100 more rows

This is what the tidyverse is known for. Very straightforward data manipulation. You will see me use %>% filter() and %>% mutate() a lot. The dplyr overview page is very helpful.

4.5 purrr

Functions for improving how your functions work. This is a functional programming toolkit and it is very valuable – but not really a beginner toolkit.

4.6 ggplot2

This is a tool for making great plots with ease. Want to plot price against sales, color by brand, and show ads and no-ads in two side-by-side plots with the axes matched? Come and see:

# Make a plot using OJ data. X-axis is log(price), y is sales, color by brand.
ggplot(oj,aes(x=log(price),y=logmove,col=brand)) + 
  #Make it a scatter plot, and make each point fairly transparent
  geom_point(alpha=0.2) +
  #And separate it into two plots based on ad-presence
  facet_grid(cols = vars(feat))

This is a power tool for making nice plots, which I will use constantly. The reference page is almost always open on my computer, and each function on that page has a bunch of great examples at the bottom of its own page.

4.7 stringr and forcats

These packages provide nice tools for working with data that are factors (an anagram of forcats) and for data that are strings.

I won’t go through examples right now.