ggplot2
errors- we all get them!theme_gray
or theme_bw
(or theme_minimal
)?These are important questions, and I want you to develop (well-informed) opinions on these matters!
This includes a reconstruction of Nathan Yau’s hot dog contest example, as interpreted by Jackie Wirz, ported into R and ggplot2
by Steven Bedrick for a workshop for the OHSU Data Science Institute, and finally adapted by Alison Hill for all you intrepid Data-Viz-onauts!
First, we load our packages:
library(tidyverse)
library(extrafont)
library(here)
Next, we load some data. You can use the following chunk to load it in from a link:
hot_dogs <- read_csv("http://bit.ly/cs631-hotdog",
col_types = cols(
gender = col_factor(levels = NULL)
))
Or you can save the file at the link to a local CSV file. I did this and saved my file in a folder called data
, then built up the file path to the CSV using here
:
hot_dogs <- read_csv(here::here("data", "hot_dog_contest.csv"),
col_types = cols(
gender = col_factor(levels = NULL)
))
Either way you do it, check it out once read in and make sure it looks like this!
glimpse(hot_dogs)
Observations: 49
Variables: 4
$ year <dbl> 2017, 2017, 2016, 2016, 2015, 2015, 2014, 2014, 2013...
$ gender <fct> male, female, male, female, male, female, male, fema...
$ name <chr> "Joey Chestnut", "Miki Sudo", "Joey Chestnut", "Miki...
$ num_eaten <dbl> 72.000, 41.000, 70.000, 38.000, 62.000, 38.000, 61.0...
hot_dogs
# A tibble: 49 x 4
year gender name num_eaten
<dbl> <fct> <chr> <dbl>
1 2017. male Joey Chestnut 72.0
2 2017. female Miki Sudo 41.0
3 2016. male Joey Chestnut 70.0
4 2016. female Miki Sudo 38.0
5 2015. male Matthew Stonie 62.0
6 2015. female Miki Sudo 38.0
7 2014. male Joey Chestnut 61.0
8 2014. female Miki Sudo 34.0
9 2013. male Joey Chestnut 69.0
10 2013. female Sonya Thomas 36.8
# ... with 39 more rows
We’ll be wanting to somehow include information about whether a given year was before or after the incorporation of the competitive eating league, so let’s add an indicator field to the data using mutate()
. Also, the data’s a little sketchy pre-1981 and for our purposes today we’ll be focusing on males only, so let’s do some filter
ing too:
hot_dogs <- hot_dogs %>%
mutate(post_ifoce = year >= 1997) %>%
filter(year >= 1981 & gender == 'male')
hot_dogs
# A tibble: 37 x 5
year gender name num_eaten post_ifoce
<dbl> <fct> <chr> <dbl> <lgl>
1 2017. male Joey Chestnut 72. TRUE
2 2016. male Joey Chestnut 70. TRUE
3 2015. male Matthew Stonie 62. TRUE
4 2014. male Joey Chestnut 61. TRUE
5 2013. male Joey Chestnut 69. TRUE
6 2012. male Joey Chestnut 68. TRUE
7 2011. male Joey Chestnut 62. TRUE
8 2010. male Joey Chestnut 54. TRUE
9 2009. male Joey Chestnut 68. TRUE
10 2008. male Joey Chestnut 59. TRUE
# ... with 27 more rows
Now let’s try making a first crack at a sketchy plot:
ggplot(hot_dogs, aes(x = year, y = num_eaten)) +
geom_col()
Note that our data is already in “counted” form, so we’re using geom_col()
instead of geom_bar()
.
ggplot(hot_dogs, aes(x = year, y = num_eaten)) +
geom_col() +
labs(x = "Year", y = "Hot Dogs and Buns Consumed") +
ggtitle("Nathan's Hot Dog Eating Contest Results, 1981-2017")
Make 3 versions of the last plot we just made:
post_ifoce
is TRUE or FALSE (use default colors for now).
ggplot(hot_dogs, aes(x = year, y = num_eaten)) +
geom_col(colour = "white") +
labs(x = "Year", y = "Hot Dogs and Buns Consumed") +
ggtitle("Nathan's Hot Dog Eating Contest Results, 1981-2017")
ggplot(hot_dogs, aes(x = year, y = num_eaten)) +
geom_col(colour = "white", fill = "navyblue") +
labs(x = "Year", y = "Hot Dogs and Buns Consumed") +
ggtitle("Nathan's Hot Dog Eating Contest Results, 1981-2017")
ggplot(hot_dogs, aes(x = year, y = num_eaten)) +
geom_col(aes(fill = post_ifoce), colour = "white") +
labs(x = "Year", y = "Hot Dogs and Buns Consumed") +
ggtitle("Nathan's Hot Dog Eating Contest Results, 1981-2017")
What if you want to change the legend in the last plot you made? Use google to figure out how to do the following:
ggplot(hot_dogs, aes(x = year, y = num_eaten)) +
geom_col(aes(fill = post_ifoce), colour = "white") +
labs(x = "Year", y = "Hot Dogs and Buns Consumed") +
ggtitle("Nathan's Hot Dog Eating Contest Results, 1981-2017") +
scale_fill_discrete(name = "",
labels=c("Pre-IFOCE", "Post-IFOCE"))
Now, let’s change the question a little bit. This looks at the creation of the IFOCE. What about the affiliation of the contestants? We’ll need some different data for this. Through the Magic Of Data Science™, we have dug that information up and put it into an expanded version of our CSV file available at http://bit.ly/cs631-hotdog-affiliated.
Let’s work with this new dataset! Do the following:
col_types
to read in affiliated
and gender
as factors.mutate
, create a new variable called post_ifoce
that is TRUE if year
is greater than or equal to 1997.filter
the new data for only years 1981 and after, and only for male competitors.
hdm_affil <- read_csv("http://bit.ly/cs631-hotdog-affiliated",
col_types = cols(
affiliated = col_factor(levels = NULL),
gender = col_factor(levels = NULL)
)) %>%
mutate(post_ifoce = year >= 1997) %>%
filter(year >= 1981 & gender == "male")
hdm_affil <- read_csv(here::here("data", "hot_dog_contest_with_affiliation.csv"),
col_types = cols(
affiliated = col_factor(levels = NULL),
gender = col_factor(levels = NULL)
)) %>%
mutate(post_ifoce = year >= 1997) %>%
filter(year >= 1981 & gender == "male")
glimpse(hdm_affil)
Observations: 37
Variables: 6
$ year <dbl> 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 200...
$ gender <fct> male, male, male, male, male, male, male, male, mal...
$ name <chr> "Joey Chestnut", "Joey Chestnut", "Matthew Stonie",...
$ num_eaten <dbl> 72.000, 70.000, 62.000, 61.000, 69.000, 68.000, 62....
$ affiliated <fct> current, current, current, current, current, curren...
$ post_ifoce <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU...
Let’s do some basic EDA with this new dataset! Do the following:
dplyr::distinct
to figure out how many unique values there are of affiliated
.dplyr::count
to count the number of rows for each unique value of affiliated
; use ?count
to figure out how to sort the counts in descending order.
hdm_affil %>%
distinct(affiliated)
# A tibble: 3 x 1
affiliated
<fct>
1 current
2 former
3 not affiliated
hdm_affil %>%
count(affiliated, sort = TRUE)
# A tibble: 3 x 2
affiliated n
<fct> <int>
1 not affiliated 20
2 current 11
3 former 6
Now let’s plot this new data, and fill the columns according to our new affiliated
column.
ggplot(hdm_affil, aes(x = year, y = num_eaten)) +
geom_col(aes(fill = affiliated)) +
labs(x = "Year", y = "Hot Dogs and Buns Consumed") +
ggtitle("Nathan's Hot Dog Eating Contest Results, 1981-2017")
Do the following updates to the last plot we just made:
c('#E9602B','#2277A0','#CCB683')
.affil_plot <- ggplot(hdm_affil, aes(x = year, y = num_eaten)) +
geom_col(aes(fill = affiliated)) +
labs(x = "Year", y = "Hot Dogs and Buns Consumed") +
ggtitle("Nathan's Hot Dog Eating Contest Results, 1981-2017") +
scale_fill_manual(values = c('#E9602B','#2277A0','#CCB683'),
name = "IFOCE-affiliation")
affil_plot
The spacing’s a little funky down near the origin of the plot. The documentation tells us that the defaults are c(0.05, 0)
for continuous variables. The first number is multiplicative and the second is additive.
The default was that 1.8 ((2017-1981)*.05+0) was added to the right and left sides of the x-axis as padding, so the effective default limits were c(1979, 2019)
.
Let’s tighten that up with the expand
property for the scale_y_continuous
(we’ll also change the breaks for y-axis tick marks here) and scale_x_continuous
settings:
affil_plot <- affil_plot +
scale_y_continuous(expand = c(0, 0),
breaks = seq(0, 70, 10)) +
scale_x_continuous(expand = c(0, 0))
affil_plot
But now the plot looks like it is wearing tight pants.
Let’s loosen things up a bit by updating the plot coordinates.
Use coord_cartesian
to:
Using coord_cartesian
is the preferred layer here because “setting limits on the coordinate system will zoom the plot (like you’re looking at it with a magnifying glass), and will not change the underlying data like setting limits
on a scale will.”
limits
unless you really know what you are doing! Most of the time, you want to change the coordinates instead.
affil_plot <- affil_plot +
coord_cartesian(xlim = c(1980, 2018), ylim = c(0, 80))
affil_plot
Let’s change some key theme settings:
affil_plot +
theme(plot.title = element_text(hjust = 0.5)) +
theme(axis.text = element_text(size = 12)) +
theme(panel.background = element_blank()) +
theme(axis.line.x = element_line(color = "gray80", size = 0.5)) +
theme(axis.ticks = element_line(color = "gray80", size = 0.5))
By default, plot titles in ggplot2
are left-aligned. For hjust
:
0
== left0.5
== centered1
== rightWe could also save all these as a custom theme. We are not fans of the default font, so we are also going to change this. To do this, you need to install the (extrafont
package)[https://github.com/wch/extrafont] and follow its setup instructions before doing this next step.
hot_diggity <- theme(plot.title = element_text(hjust = 0.5),
axis.text = element_text(size = 12),
panel.background = element_blank(),
axis.line.x = element_line(color = "gray80", size = 0.5),
axis.ticks = element_line(color = "gray80", size = 0.5),
text = element_text(family = "Lato") # need extrafont for this
)
affil_plot + hot_diggity
We could also use someone else’s theme:
library(ggthemes)
affil_plot + theme_fivethirtyeight(base_family = "Lato")
affil_plot + theme_tufte(base_family = "Palatino")
The final thing we have to mess with is the x-axis ticks and labels. We’ll do this in two steps, then override our previous layer scale_x_continuous
.
years_to_label <- seq(from = 1981, to = 2017, by = 4)
years_to_label
[1] 1981 1985 1989 1993 1997 2001 2005 2009 2013 2017
hd_years <- hdm_affil %>%
distinct(year) %>%
mutate(year_lab = ifelse(year %in% years_to_label, year, ""))
affil_plot +
hot_diggity +
scale_x_continuous(expand = c(0, 0),
breaks = hd_years$year,
labels = hd_years$year_lab)
Scale for 'x' is already present. Adding another scale for 'x', which
will replace the existing scale.
Don’t name your files “final” :)
All together in one chunk, here is our final (for now) plot! I’m also adding some additional elements here to show you options:
nathan_plot <- ggplot(hdm_affil, aes(x = year, y = num_eaten)) +
geom_col(aes(fill = affiliated)) +
labs(x = "Year", y = "Hot Dogs and Buns Consumed") +
ggtitle("Nathan's Hot Dog Eating Contest Results, 1981-2017") +
scale_fill_manual(values = c('#E9602B','#2277A0','#CCB683'),
name = "IFOCE-affiliation") +
hot_diggity +
scale_y_continuous(expand = c(0, 0),
breaks = seq(0, 70, 10)) +
scale_x_continuous(expand = c(0, 0),
breaks = hd_years$year,
labels = hd_years$year_lab) +
coord_cartesian(xlim = c(1980, 2018), ylim = c(0, 80))
nathan_plot
Adding some plot annotations rather than having a fill legend:
nathan_ann <- nathan_plot +
guides(fill = FALSE) +
coord_cartesian(xlim = c(1980, 2019), ylim = c(0, 85)) +
annotate('segment', x=1980.75, xend=2000.25, y= 30, yend=30, size=0.5, color="#CCB683") +
annotate('segment', x=1980.75, xend=1980.75, y= 30, yend=28, size=0.5, color="#CCB683") +
annotate('segment', x=2000.25, xend=2000.25, y= 30, yend=28, size=0.5, color="#CCB683") +
annotate('segment', x=1990, xend=1990, y= 33, yend=30, size=0.5, color="#CCB683") +
annotate('text', x=1990, y=36, label="No MLE/IFOCE Affiliation", color="#CCB683", family="Lato", hjust=0.5, size = 3) +
annotate('segment', x=2000.75, xend=2006.25, y= 58, yend=58, size=0.5, color="#2277A0") +
annotate('segment', x=2000.75, xend=2000.75, y= 58, yend=56, size=0.5, color="#2277A0") +
annotate('segment', x=2006.25, xend=2006.25, y= 58, yend=56, size=0.5, color="#2277A0") +
annotate('segment', x=2003.5, xend=2003.5, y= 61, yend=58, size=0.5, color="#2277A0") +
annotate('text', x=2003.5, y=65, label="MLE/IFOCE\nFormer Member", color="#2277A0", family="Lato", hjust=0.5, size = 3) +
annotate('segment', x=2006.75, xend=2017.25, y= 76, yend=76, size=0.5, color="#E9602B") +
annotate('segment', x=2006.75, xend=2006.75, y= 76, yend=74, size=0.5, color="#E9602B") +
annotate('segment', x=2017.25, xend=2017.25, y= 76, yend=74, size=0.5, color="#E9602B") +
annotate('segment', x=2012, xend=2012, y= 79, yend=76, size=0.5, color="#E9602B") +
annotate('text', x=2012, y=82, label="MLE/IFOCE Current Member", color="#E9602B", family="Lato", hjust=0.5, size = 3)
Coordinate system already present. Adding new coordinate system, which will replace the existing one.
nathan_ann
Finally, adding in another layer of data from female contestants:
hdm_females <- read_csv(here::here("data", "hot_dog_contest_with_affiliation.csv"),
col_types = cols(
affiliated = col_factor(levels = NULL),
gender = col_factor(levels = NULL)
)) %>%
mutate(post_ifoce = year >= 1997) %>%
filter(year >= 1981 & gender == "female")
glimpse(hdm_females)
Observations: 7
Variables: 6
$ year <dbl> 2017, 2016, 2015, 2014, 2013, 2012, 2011
$ gender <fct> female, female, female, female, female, female, female
$ name <chr> "Miki Sudo", "Miki Sudo", "Miki Sudo", "Miki Sudo",...
$ num_eaten <dbl> 41.00, 38.00, 38.00, 34.00, 36.75, 45.00, 40.00
$ affiliated <fct> current, current, current, current, current, curren...
$ post_ifoce <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE
nathan_w_females <- nathan_ann +
# add in the female data, and manually set a fill color
geom_col(data = hdm_females,
width = 0.75,
fill = "#F68A39")
nathan_w_females
And adding a final caption:
caption <- paste(strwrap("* From 2011 on, separate Men's and Women's prizes have been awarded. All female champions to date have been MLE/IFOCE-affiliated.", 70), collapse="\n")
nathan_w_females +
# now an asterisk to set off the female scores, and a caption
annotate('text', x = 2018.5, y = 39, label="*", family = "Lato", size = 8) +
labs(caption = caption) +
theme(plot.caption = element_text(family = "Lato", size=8, hjust=0, margin=margin(t=15)))