This is a basic introduction to some of the features of the tidyverse
.
The tidyverse is a series of packages built on top of R, maintained by Hadley Wickham (and others), with a unified syntax which is good for working with data.
library(tidyverse) #Load into memory, show list of packages
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
As you can see, it contains the following packages as “core Tidyverse”:
data.frame
When you install the tidyverse
package, it also installs these supplemental packages:
%>%
Most of the time, the identifying features of tidyverse code is the use of the pipe %>%
, which prevents us from having to save each intermediate step of a calculation.
E.g.
oj = read_csv("oj/oj.csv")
# Non tidyverse styling:
df1 = oj[oj$brand=="dominicks",1:6] #Pain to read, unclear
#Tidyverse style
df2 = oj %>%
filter(brand == "dominicks") %>% #filter for rows with a brand
select(store,brand,week,logmove,feat,price) #choose some variables
# Easy to read, very clear, step by step.
identical(df1,df2) #Same results.
## [1] TRUE
Tidyverse makes it easy to make code that is interpretable, clear, and straightforward. This is critical for re-using code, as well as for working on code in groups.
The simplest way to get a grasp on tidyverse is to explore the excellent documentation online. There are great cheatsheets, and there are also just great websites. See tidyverse homepage and click on a package for more.
This package changes data frames. From a user-facing perspective, it mostly changes how they print, producing a useful summary of columns, and not printing 10k rows.
df2 #tibble
## # A tibble: 9,649 x 6
## store brand week logmove feat price
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2 dominicks 40 9.26 1 1.59
## 2 2 dominicks 46 8.99 0 2.69
## 3 2 dominicks 47 8.83 1 2.09
## 4 2 dominicks 48 7.97 0 2.09
## 5 2 dominicks 50 7.38 0 2.09
## 6 2 dominicks 51 10.1 1 1.89
## 7 2 dominicks 52 9.28 0 1.89
## 8 2 dominicks 53 8.80 0 1.89
## 9 2 dominicks 54 8.79 0 1.79
## 10 2 dominicks 57 7.45 0 2.69
## # … with 9,639 more rows
as.data.frame(df2) #df print -- I didn't run because it prints 10k rows. You're welcome to test it though.
There are also performance changes under the hood, but we can ignore those for now
readr
makes it easy to read csv type files into tibbles.
#oldschool
oj = read.csv("oj/oj.csv") #output is a data.frame
# tidyverse
oj = read_csv("oj/oj.csv") #output is a tibble
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## store = col_double(),
## brand = col_character(),
## week = col_double(),
## logmove = col_double(),
## feat = col_double(),
## price = col_double(),
## AGE60 = col_double(),
## EDUC = col_double(),
## ETHNIC = col_double(),
## INCOME = col_double(),
## HHLARGE = col_double(),
## WORKWOM = col_double(),
## HVAL150 = col_double(),
## SSTRDIST = col_double(),
## SSTRVOL = col_double(),
## CPDIST5 = col_double(),
## CPWVOL5 = col_double()
## )
This is minor. I usually use read.csv
and as_tibble
separately.
Makes cleaning the data fairly easy. Do you want each brand to be its own column, with the sales at each store for that brand in a week to be a row? We can do that. Examples may come in the future.
Makes manipulating data easy. Want to select only a few variables? or filter the rows by some constraint? or add variables? It is easy.
#Look only at dominicks sold at store 2.
#Select only 4 variables, make two more from those.
oj %>%
filter(brand=="dominicks") %>%
filter(store == "2") %>%
select(logmove,week,price,feat) %>%
mutate(sales = exp(logmove),logprice = log(price))
## # A tibble: 110 x 6
## logmove week price feat sales logprice
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 9.26 40 1.59 1 10560. 0.464
## 2 8.99 46 2.69 0 8000. 0.990
## 3 8.83 47 2.09 1 6848. 0.737
## 4 7.97 48 2.09 0 2880. 0.737
## 5 7.38 50 2.09 0 1600. 0.737
## 6 10.1 51 1.89 1 25344. 0.637
## 7 9.28 52 1.89 0 10752. 0.637
## 8 8.80 53 1.89 0 6656. 0.637
## 9 8.79 54 1.79 0 6592. 0.582
## 10 7.45 57 2.69 0 1728. 0.990
## # … with 100 more rows
This is what the tidyverse is known for. Very straightforward data manipulation. You will see me use %>% filter()
and %>% mutate()
a lot. The dplyr overview page is very helpful.
Functions for improving how your functions work. This is a functional programming toolkit and it is very valuable – but not really a beginner toolkit.
This is a tool for making great plots with ease. Want to plot price against sales, color by brand, and show ads and no-ads in two side-by-side plots with the axes matched? Come and see:
# Make a plot using OJ data. X-axis is log(price), y is sales, color by brand.
ggplot(oj,aes(x=log(price),y=logmove,col=brand)) +
#Make it a scatter plot, and make each point fairly transparent
geom_point(alpha=0.2) +
#And separate it into two plots based on ad-presence
facet_grid(cols = vars(feat))
This is a power tool for making nice plots, which I will use constantly. The reference page is almost always open on my computer, and each function on that page has a bunch of great examples at the bottom of its own page.
These packages provide nice tools for working with data that are factors (an anagram of forcats) and for data that are strings.
I won’t go through examples right now.