ggplot
 hillali@ohsu.edu
 OHSU Gaines Hall 19
This workshop was presented at OHSU on June 24, 2016. The same dataset and plots were made for a parallel breakout session using Python; materials are all here: https://github.com/abalter/data-viz-jamboree
Review: Graphics and statistics for cardiology: comparing categorical and continuous variables Full text Authors: Kenneth Rice, Thomas Lumley
doi:10.1136/heartjnl-2015-308104
NHANES (National Health and Nutrition Examination Survey) http://www.cdc.gov/nchs/nhanes/
Dataset url: http://faculty.washington.edu/kenrice/heartgraphs/
Do this once per machine.
install.packages("readr")
install.packages("ggplot2")
install.packages("ggbeeswarm")
install.packages("dplyr")
install.packages("MASS")
install.packages("hexbin")
install.packages("ggalt")
Do this once per R session.
library(readr)
library(ggplot2)
library(ggbeeswarm)
library(dplyr)
library(MASS)
library(hexbin)
library(ggalt)
readr
package to import our csv with read_csv
functionread_csv()
parentheses is the url to the datasetna = "."
, specifies that missing data in this dataset is denoted by a period<-
and call that object something simple like heart
heart <- read_csv("http://faculty.washington.edu/kenrice/heartgraphs/nhaneslarge.csv", na = ".") #na= tells R that . is an na value
Use the head()
function in your console and check with your partner to see if you got the same output
head(heart)
Let's look at the variables in our heart
dataframe:
names(heart)
[1] "BPXSAR" "BPXDAR" "BPXDI1" "BPXDI2" "race_ethc"
[6] "gender" "DR1TFOLA" "RIAGENDR" "BMXBMI" "RIDAGEYR"
These are pretty cryptic variable names. For example, you may be wondering what DR1TFOLA
and BPXSAR
are. The data dictionary online helps us demystify the variables we are interested in:
BPXSAR
: systolic blood pressure (mmHg)BPXDAR
: diastolic blood pressure (mmHg)BPXDI1
, BPXDI2
: two diastolic blood pressure readingsrace_ethc
: race/ethnicity, coded as Hispanic, White non-Hispanic, Black non-Hispanic and Othergender
: sex, coded as Male/FemaleDR1TFOLA
: folate intake (μg/day)RIAGENDR
: sex, coded as 1/2BMXBMI
: body mass index (kg/m2)RIDAGEY
: age (years)Let's start with folate intake.
ggplot(heart, aes(x = DR1TFOLA))
The code above should produce an empty plot with DR1TFOLA
on the x-axis. The first argument in the ggplot()
function is your dataframe- we called ours heart
. Then we set 1 aesthestic: the x-axis, and mapped it only our folate intake variable.
ggplot
makes a plot by layering. So let's take this empty plot and add a visible layer to it- called a geometric object or geom
for short. Let's start with the geom_histogram
. You can always find out what arguments go along with a specific geom using the help function in the console:
?geom_histogram
Let's focus first on univariate plots by examining folate intake (a continuous variable).
Create a histogram by adding a geom_histogram()
layer and using the +
sign
ggplot(heart, aes(x = DR1TFOLA)) +
geom_histogram()
Small tweak #1: change x-axis title
ggplot(heart, aes(x = DR1TFOLA)) +
geom_histogram() +
labs(x = "folate intake (μg/day)")
Small tweak #2: change outline of bars
ggplot(heart, aes(x = DR1TFOLA)) +
geom_histogram(colour = "white") +
labs(x = "folate intake (μg/day)")
Small tweak #3: change fill of bars
ggplot(heart, aes(x = DR1TFOLA)) +
geom_histogram(colour = "white", fill = "peachpuff") +
labs(x = "folate intake (μg/day)")
Small tweak #4: change number of bins
All of our histograms have given us this warning:
#stat_bin() using bins = 30. Pick better value with binwidth.
This warning tells us that this is a default that ggplot chose for us given the data it has. For now, we will change the amount of bins to 50. Feel free to play around with this. You can change this to any number.
ggplot(heart, aes(x = DR1TFOLA)) +
geom_histogram(colour = "white", fill = "peachpuff", bins = 50) +
labs(x = "folate intake (μg/day)")
OK, let's add another variable to the mix. We'll keep looking at folate intake as the continuous variable, with gender
as the categorical variable. There are a few ways to do this in ggplot2
.
About facets (from facet_wrap
documentation):
"Most displays are roughly rectangular, so if you have a categorical variable with many levels, it doesn't make sense to try and display them all in one row (or one column). To solve this dilemma, facet_wrap wraps a 1d sequence of panels into 2d, making best use of screen real estate."
Let's add + facet_wrap(~variable)
to split the histograms based on gender
ggplot(heart, aes(x = DR1TFOLA)) +
geom_histogram(colour = "white", fill = "peachpuff", bins = 50) +
labs(x = "folate intake (μg/day)") +
facet_wrap(~gender)
Let's say you want the exact same plot, but want to see if density gives a different picture than a histogram. To do this, we pick a different geom
. The code is similar, but for geom_density
, make sure to take out the bin specification as a density plot doesn't have visible bins.
About density estimation (from geom_density
documentation):
"Kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. Kernel density estimation is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample."
Use the kernel density geom (geom_density
) instead of geom_histogram
ggplot(heart, aes(x = DR1TFOLA)) +
geom_density(colour = "white", fill = "peachpuff") +
labs(x = "folate intake (μg/day)") +
facet_wrap(~gender)
So far, when we played with colours, we set fill/colour without mapping them onto another variable (i.e., (colour = "white", fill = "peachpuff")
). Now, we want fill/colour to vary based on the value of a specific variable. Here we want colour to depend on gender
, so we map the colour aesthetic (aes(colour = variable)
) onto the variable gender
.
Notice moving the colour into aesthetics parentheses.
ggplot(heart, aes(x = DR1TFOLA)) +
geom_density(aes(colour = gender)) +
labs(x = "folate intake (μg/day)")
This is the exact same! Play around with changing global aesthetics (like below) versus geom-specific aesthetics (like above). It won't matter when you only have one layer, but once you start adding geoms
and stats
, this can be a powerful way to change your visualization.
ggplot(heart, aes(x = DR1TFOLA, colour = gender)) +
geom_density() +
labs(x = "folate intake (μg/day)")
Now let's make some side-by-side univariate plots, specifically dotcharts or stripcharts, to visualize systolic blood pressure by gender.
ggplot(heart, aes(x = gender, y = BPXSAR)) +
geom_point() +
labs(x = "", y = "Systolic BP (mmHg)")
Too many points!
Small tweak #2: Add alpha in the geom layer to make points more transparent.
Alpha works on a scale from 0 (transparent) to 1 (opaque).
ggplot(heart, aes(x = gender, y = BPXSAR)) +
geom_point(alpha = .3) +
labs(x = "", y = "Systolic BP (mmHg)")
Small tweak #3: try jittering the points
To do this, instead of geom_point
, we will switch to using geom_jitter
, which automatically adds both vertical and horizontal space (noise) to your datapoints.
ggplot(heart, aes(x = gender, y = BPXSAR)) +
geom_jitter(alpha = .3) +
labs(x = "", y = "Systolic BP (mmHg)")
Small tweak #4: control the jitter
Sometimes you may only want to change the width of jitter but not the height.
ggplot(heart, aes(x = gender, y = BPXSAR)) +
geom_jitter(alpha = .3, width = .2, height = 0) +
labs(x = "", y = "Systolic BP (mmHg)")
Now let's try a different geom in our side-by-side plot, which is available through the beeswarm
package you should have already loaded.
ggplot(heart, aes(x = gender, y = BPXSAR)) +
geom_beeswarm() +
labs(x = "", y = "Systolic BP (mmHg)")
Small tweak #1: add alpha again
ggplot(heart, aes(x = gender, y = BPXSAR)) +
geom_beeswarm(alpha = .2) +
labs(x = "", y = "Systolic BP (mmHg)")
Small tweak #2: add statistics
We'll include the mean plus 95% CI
ggplot(heart, aes(x = gender, y = BPXSAR)) +
geom_beeswarm(alpha = .2) +
stat_summary(fun.y = "mean", geom = "point", colour = "orange") +
stat_summary(fun.data = mean_cl_boot, geom = "linerange", colour = "orange") +
labs(x = "", y = "Systolic BP (mmHg)")
Small tweak #3: add another geom layer
Try adding geom_boxplot
on top of your side-by-side beeswarm plot. Also try re-ordering the geoms to see what changes.
ggplot(heart, aes(x = gender, y = BPXSAR)) +
geom_boxplot(outlier.shape = NA) +
geom_beeswarm(alpha = .2) +
labs(x = "", y = "Systolic BP (mmHg)")
New geom time- this time use geom_violin
.
ggplot(heart, aes(x = gender, y = BPXSAR)) +
geom_violin(alpha = .2) +
labs(x = "", y = "Systolic BP (mmHg)")
Small tweak #1: add statistics
Let's include the sample mean and median
ggplot(heart, aes(x = gender, y = BPXSAR)) +
geom_violin(alpha = .2) +
stat_summary(fun.y = "mean", geom = "point", colour = "orange") +
stat_summary(fun.y = "median", geom = "point", colour = "blue") +
labs(x = "", y = "Systolic BP (mmHg)")
Small tweak #2: add another geom layer
Add another layer to your violin plot: try geom_boxplot
ggplot(heart, aes(x = gender, y = BPXSAR)) +
geom_violin(alpha = .2) +
geom_boxplot(width = .05) +
labs(x = "", y = "Systolic BP (mmHg)")
Now we'll create some bivariate plots to look at the association between age and systolic blood pressure, both of which are continuous variables.
ggplot(heart, aes(x = RIDAGEYR, y = BPXSAR)) +
geom_point() +
labs(x = "Age (years)", y = "Systolic BP (mmHg)")
If you have big n, try hexbin plot
ggplot(heart, aes(x = RIDAGEYR, y = BPXSAR)) +
geom_hex() +
labs(x = "Age (years)", y = "Systolic BP (mmHg)")
Make colors make more sense
library(viridis)
ggplot(heart, aes(x = RIDAGEYR, y = BPXSAR)) +
geom_hex() +
labs(x = "Age (years)", y = "Systolic BP (mmHg)") +
scale_fill_gradientn(colours = viridis(256, begin = 1, end = 0))
library(ggalt)
ggplot(heart, aes(x = RIDAGEYR, y = BPXSAR)) +
stat_bkde2d(aes(fill = ..level.., alpha = ..level..), geom = "polygon", bandwidth = c(2,2)) +
labs(x = "Age (years)", y = "Systolic BP (mmHg)") +
scale_fill_gradientn(colours = viridis(256, begin = 1, end = 0))
Add linear regression line with SE
ggplot(heart, aes(x = RIDAGEYR, y = BPXSAR)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Age (years)", y = "Systolic BP (mmHg)")
Default is loess line
ggplot(heart, aes(x = RIDAGEYR, y = BPXSAR)) +
geom_point() +
geom_smooth() +
labs(x = "Age (years)", y = "Systolic BP (mmHg)")
Add splines
library(splines)
library(MASS)
ggplot(heart, aes(x = RIDAGEYR, y = BPXSAR)) +
geom_point() +
stat_smooth(method = "lm", formula = y ~ ns(x, 3)) +
labs(x = "Age (years)", y = "Systolic BP (mmHg)")
We'll finish up by creating some multivariable plots that help us visualize how the association between body mass index & systolic BP varies by gender and age (so 4 variables!).
Just copy this code to create a categorical age variable:
library(dplyr)
heart2 <- heart %>%
mutate(age_cat = cut(RIDAGEYR, c(0, 30, 55, 100)))
Recreate theirs first
ggplot(heart2, aes(x = BMXBMI, y = BPXSAR)) +
geom_point() +
stat_smooth(aes(colour = gender), method = "lm") +
facet_wrap(~age_cat) +
labs(x = "Body Mass Index"~(kg/m^2), y = "Systolic BP (mmHg)")
Try with facet grid, update labels
ggplot(heart2, aes(x = BMXBMI, y = BPXSAR)) +
geom_point() +
stat_smooth(aes(colour = gender), method = "lm") +
facet_grid(gender~age_cat) +
labs(x = "Body Mass Index"~(kg/m^2), y = "Systolic BP (mmHg)")
Play with colors!
ggplot(heart2, aes(x = BMXBMI, y = BPXSAR, colour = gender)) +
geom_point(alpha = .5) +
stat_smooth(method = "lm") +
facet_grid(gender~age_cat) +
theme_minimal() +
labs(x = "Body Mass Index"~(kg/m^2), y = "Systolic BP (mmHg)") +
scale_color_manual(values = c("#B47CC7", "#D65F5F"), guide = FALSE)
my_colors <- c("#C4AD66", "#77BEDB")
ggplot(heart2, aes(x = BMXBMI, y = BPXSAR, colour = gender)) +
geom_point(alpha = .5) +
stat_smooth(method = "lm") +
facet_grid(gender~age_cat) +
theme_light() +
labs(x = "Body Mass Index"~(kg/m^2), y = "Systolic BP (mmHg)") +
scale_color_manual(values = my_colors, guide = FALSE)