class: center, middle, inverse, title-slide # Lab 02: CS631 ## Working in the Tidyverse ### Alison Hill --- # Tidyverse basics In David Robinson's DataCamp course you learned: - `<-` (variable assignment) - `%>%` (then...) - `dplyr`, `ggplot2` (packages) - `install.packages("dplyr")` (1x per machine) - `library(dplyr)` (1x per work session) --- class: center, middle, inverse # 📇 ## Let's review --- # Data for today We'll use data from the Museum of Modern Art (MoMA) - Publicly available on [GitHub](https://github.com/MuseumofModernArt/collection) - As analyzed by [fivethirtyeight.com](https://fivethirtyeight.com/features/a-nerds-guide-to-the-2229-paintings-at-moma/) - And by [others](https://medium.com/@foe/here-s-a-roundup-of-how-people-have-used-our-data-so-far-80862e4ce220) --- # Get the data Use this code chunk to import my cleaned CSV file: ```r library(readr) moma <- read_csv("http://bit.ly/cs631-moma") ``` --- # Data wrangling so far All functions from `dplyr` package .pull-left[ From DataCamp Chapter 1 - print a tibble - `filter` - `arrange` - `mutate` ] -- .pull-right[ From Lab 01 - `glimpse` - `distinct` - `count` ] --- class: middle, center, inverse ![](../images/rladylego-pipe.jpg) ## Plus: `%>%` *image courtesy [@LegoRLady](https://twitter.com/LEGO_RLady/status/986661916855754752)* --- class: middle, center, inverse # ⌛️ ## Let's review some helpful functions for `filter` --- class: inverse, bottom, center background-image: url("../images/peapod.png") background-size: 25% ## Base R + Tidyverse --- class: middle, center, inverse #💡 ## First: ## Logical Operators --- ```r ?base::Logic ``` <table> <thead> <tr> <th style="text-align:left;"> Operator </th> <th style="text-align:left;"> Description </th> <th style="text-align:left;"> Usage </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> & </td> <td style="text-align:left;"> and </td> <td style="text-align:left;"> x & y </td> </tr> <tr> <td style="text-align:left;"> | </td> <td style="text-align:left;"> or </td> <td style="text-align:left;"> x | y </td> </tr> <tr> <td style="text-align:left;"> xor </td> <td style="text-align:left;"> exactly x or y </td> <td style="text-align:left;"> xor(x, y) </td> </tr> <tr> <td style="text-align:left;"> ! </td> <td style="text-align:left;"> not </td> <td style="text-align:left;"> !x </td> </tr> </tbody> </table> --- Logical or (`|`) is inclusive, so `x | y` really means: * x or * y or * both x & y Exclusive or (`xor`) is exclusive, so `xor(x, y)` really means: * x or * y... * but not both x & y ```r x <- c(0, 1, 0, 1) y <- c(0, 0, 1, 1) boolean_or <- x | y exclusive_or <- xor(x, y) cbind(x, y, boolean_or, exclusive_or) ``` ``` x y boolean_or exclusive_or [1,] 0 0 0 0 [2,] 1 0 1 1 [3,] 0 1 1 1 [4,] 1 1 1 0 ``` --- class: middle, center, inverse #💡 ## Second: ## Comparisons --- ```r ?Comparison ``` <table> <thead> <tr> <th style="text-align:left;"> Operator </th> <th style="text-align:left;"> Description </th> <th style="text-align:left;"> Usage </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> < </td> <td style="text-align:left;"> less than </td> <td style="text-align:left;"> x < y </td> </tr> <tr> <td style="text-align:left;"> <= </td> <td style="text-align:left;"> less than or equal to </td> <td style="text-align:left;"> x <= y </td> </tr> <tr> <td style="text-align:left;"> > </td> <td style="text-align:left;"> greater than </td> <td style="text-align:left;"> x > y </td> </tr> <tr> <td style="text-align:left;"> >= </td> <td style="text-align:left;"> greater than or equal to </td> <td style="text-align:left;"> x >= y </td> </tr> <tr> <td style="text-align:left;"> == </td> <td style="text-align:left;"> exactly equal to </td> <td style="text-align:left;"> x == y </td> </tr> <tr> <td style="text-align:left;"> != </td> <td style="text-align:left;"> not equal to </td> <td style="text-align:left;"> x != y </td> </tr> <tr> <td style="text-align:left;"> %in% </td> <td style="text-align:left;"> group membership* </td> <td style="text-align:left;"> x %in% y </td> </tr> <tr> <td style="text-align:left;"> is.na </td> <td style="text-align:left;"> is missing </td> <td style="text-align:left;"> is.na(x) </td> </tr> <tr> <td style="text-align:left;"> !is.na </td> <td style="text-align:left;"> is not missing </td> <td style="text-align:left;"> !is.na(x) </td> </tr> </tbody> </table> *(shortcut to using `|` repeatedly with `==`) --- ## Lab 02: Challenge 1 (`dplyr`) 1. How many paintings (rows) are in `moma`? How many variables (columns) are in `moma`? 1. What is the first painting acquired by MoMA? Which year? Which artist? What title? - *Hint: you may want to look into `select` + `arrange`* 1. What is the oldest painting in the collection? Which year? Which artist? What title? *(see above hint)* 1. How many distinct artists are there? 1. Which artist has the most paintings in the collection? How many paintings are by this artist? 1. How many paintings are by male vs female artists? If you want more: 1. How many artists of each gender are there? 1. In what year were the most paintings acquired? Created? 1. In what year was the first painting by a (solo) female artist acquired? When was that painting created? Which artist? What title? --- # From DataCamp Chapter 2 all `ggplot2` - `aes(x = , y = )` (aesthetics) - `aes(x = , y = , color = )` (add color) - `aes(x = , y = , size = )` (add size) - `+ facet_wrap(~ )` (facetting) --- # Old School (Challenge 2)<sup>1</sup> - Sketch the graphics below on paper, where the `x`-axis is variable `year_created` and the `y`-axis is variable `year_acquired` ``` # A tibble: 4 x 4 painted acquired area gender <dbl> <dbl> <dbl> <chr> 1 1980. 1985. 3. male 2 1990. 1995. 2. male 3 2000. 2005. 1. female 4 2010. 2015. 2. female ``` <!-- Copy to chalkboard/whiteboard --> 1. A scatter plot 1. A scatter plot where the `color` of the points corresponds to `gender` 1. A scatter plot where the `size` of the points corresponds to `area` .footnote[ [1] Shamelessly borrowed with much appreciation to [Chester Ismay](https://ismayc.github.io/talks/ness-infer/slide_deck.html) ] --- # 1. A scatterplot ```r library(ggplot2) ggplot(moma_ex, aes(painted, acquired)) + geom_point() ``` -- <img src="02-slides_files/figure-html/unnamed-chunk-11-1.png" width="80%" style="display: block; margin: auto;" /> --- # 2. `color` points by `gender` ```r library(ggplot2) ggplot(moma_ex, aes(painted, acquired, color = gender)) + geom_point() ``` -- <img src="02-slides_files/figure-html/unnamed-chunk-13-1.png" width="80%" style="display: block; margin: auto;" /> --- # 3. `size` points by `area` ```r library(ggplot2) ggplot(moma_ex, aes(painted, acquired, size = area)) + geom_point() ``` -- <img src="02-slides_files/figure-html/unnamed-chunk-15-1.png" width="80%" style="display: block; margin: auto;" /> --- # [The Five-Named Graphs](http://moderndive.com/3-viz.html#FiveNG) - Scatterplot: `geom_point()` - Line graph: `geom_line()` - Histogram: `geom_histogram()` - Boxplot: `geom_boxplot()` - Bar graph: `geom_bar()` or `geom_col` (see [Lab 01](../01-eda_hot_dogs.html)) --- # Lab 02: Plotting Challenges Challenges 3-5 are in the [Lab 02 code-through](../02-moma.html)! https://apreshill.github.io/data-vis-labs-2018/02-moma.html --- class: inverse, middle, center # 📊 ## Basics of `ggplot2` and `dplyr`: [R4DS `ggplot2` chapter](http://r4ds.had.co.nz/data-visualisation.html) [ModernDive `ggplot2` chapter](http://moderndive.com/3-viz.html) [RStudio `ggplot2` Cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/data-visualization-2.1.pdf) [R4DS `dplyr` chapter](http://r4ds.had.co.nz/transform.html) [ModernDive `dplyr` chapter](http://moderndive.com/5-wrangling.html) [RStudio `dplyr` Cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf)