Read all the way through step 6, and note that there is a file that needs to be turned in to Sakai before Wednesday at noon!
This guide will lead you through the steps to install and use R, a free and open-source software environment for statistical computing and graphics.
What is R?
>
)What is RStudio?
Our end goal is to get you looking at a screen like this:
Install R from CRAN, the Comprehensive R Archive Network. Please choose a precompiled binary distribution for your operating system.
Launch R. You should see one console with a command line interpreter (>
). Close R.
Install the free, open-source edition of RStudio: http://www.rstudio.com/products/rstudio/download/
RStudio provides a powerful user interface for R, called an integrated development environment. RStudio includes:
>
),Launch RStudio. You should get a window similar to the screenshot you see here, but yours will be empty. Look at the bottom left pane: this is the same console window you saw when you opened R in step 1.15.
>
and type x <- 2 + 2
, hit enter or return, then type x
, and hit enter/return again.[1] 4
prints to the screen, you have successfully installed R and RStudio, and you can move onto installing packages.The version of R that you just downloaded is considered base R, which provides you with good but basic statistical computing and graphics powers. For analytical and graphical super-powers, you’ll need to install add-on packages, which are user-written, to extend/expand your R capabilities. Packages can live in one of two places:
install.packages("name_of_package", dependencies = TRUE)
.devtools
package.install.packages("devtools")
library(devtools)
install_github("name_of_package")
Place your cursor in the console again (where you last typed x
and [4]
printed on the screen). You can use the first method to install the following packages directly from CRAN, all of which we will use:
You can download all of these at once, too:
install.packages(c("dplyr", "ggplot2", "babynames"),
dependencies = TRUE)
Heads up: We should formally introduce the combine command, c()
, used above. You will use this often- any time you want to combine things into a vector.
c("hello", "my", "name", "is", "alison")
[1] "hello" "my" "name" "is" "alison"
c(1:3, 20, 50)
[1] 1 2 3 20 50
Mind your use of quotes carefully with packages.
install.packages("name_of_package")
.library(name_of_package)
, leaving the name of the package bare. You only need to do this once per RStudio session.help(name_of_package)
or ?name_of_package
.citation("name_of_package")
.install.packages("dplyr", dependencies = TRUE)
library(dplyr)
help("dplyr")
citation("ggplot2")
To cite ggplot2 in publications, please use:
H. Wickham. ggplot2: Elegant Graphics for Data Analysis.
Springer-Verlag New York, 2016.
A BibTeX entry for LaTeX users is
@Book{,
author = {Hadley Wickham},
title = {ggplot2: Elegant Graphics for Data Analysis},
publisher = {Springer-Verlag New York},
year = {2016},
isbn = {978-3-319-24277-4},
url = {http://ggplot2.org},
}
Heads up: R is case-sensitive, so ?dplyr
works but ?Dplyr
will not. Likewise, a variable called A
is different from a
.
Open a new R script in RStudio by going to File --> New File --> R Script
. For this first foray into R, I’ll give you the code, so sit back and relax and feel free to copy and paste my code with some small tweaks.
First load the packages:
library(babynames) # contains the actual data
library(dplyr) # for manipulating data
library(ggplot2) # for plotting data
Next, we’ll follow best practices for inspecting a freshly read dataset. Also, see “What I do when I get a new data set as told through tweets” for more ideas about exploring a new dataset. Here are some critical commands to obtain a high-level overview (HLO) of your freshly read dataset in R. We’ll call it saying hello to your dataset:
glimpse(babynames) # dplyr
Observations: 1,858,689
Variables: 5
$ year <dbl> 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 188...
$ sex <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F...
$ name <chr> "Mary", "Anna", "Emma", "Elizabeth", "Minnie", "Margaret"...
$ n <int> 7065, 2604, 2003, 1939, 1746, 1578, 1472, 1414, 1320, 128...
$ prop <dbl> 0.072384329, 0.026679234, 0.020521700, 0.019865989, 0.017...
head(babynames) # base R
# A tibble: 6 x 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1880. F Mary 7065 0.0724
2 1880. F Anna 2604 0.0267
3 1880. F Emma 2003 0.0205
4 1880. F Elizabeth 1939 0.0199
5 1880. F Minnie 1746 0.0179
6 1880. F Margaret 1578 0.0162
tail(babynames) # same
# A tibble: 6 x 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 2015. M Zyah 5 0.00000247
2 2015. M Zykell 5 0.00000247
3 2015. M Zyking 5 0.00000247
4 2015. M Zykir 5 0.00000247
5 2015. M Zyrus 5 0.00000247
6 2015. M Zyus 5 0.00000247
names(babynames) # same
[1] "year" "sex" "name" "n" "prop"
If you have done the above and produced sane-looking output, you are ready for the next bit. Use the code below to create a new data frame called alison
.
alison <- babynames %>%
filter(name == "Alison" | name == "Allison") %>%
filter(sex == "F")
The first bit makes a new dataset called alison
that is a copy of the babynames
dataset- the %>%
tells you we are doing some other stuff to it later.
The second bit filters
our babynames
to only keep rows where the name
is either Alison or Allison (read |
as “or”.)
The third bit applies another filter
to keep only those where sex is female.
Let’s check out the data.
alison
# A tibble: 214 x 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1905. F Alison 7 0.0000226
2 1907. F Alison 5 0.0000148
3 1908. F Allison 6 0.0000169
4 1910. F Alison 5 0.0000119
5 1910. F Allison 5 0.0000119
6 1911. F Allison 9 0.0000204
7 1912. F Allison 12 0.0000205
8 1912. F Alison 9 0.0000153
9 1913. F Alison 12 0.0000183
10 1913. F Allison 7 0.0000107
# ... with 204 more rows
glimpse(alison)
Observations: 214
Variables: 5
$ year <dbl> 1905, 1907, 1908, 1910, 1910, 1911, 1912, 1912, 1913, 191...
$ sex <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F...
$ name <chr> "Alison", "Alison", "Allison", "Alison", "Allison", "Alli...
$ n <int> 7, 5, 6, 5, 5, 9, 12, 9, 12, 7, 22, 11, 16, 13, 24, 15, 2...
$ prop <dbl> 2.259012e-05, 1.481776e-05, 1.692367e-05, 1.191821e-05, 1...
Again, if you have sane-looking output here, move along to plotting the data!
plot <- ggplot(alison, aes(x = year,
y = prop,
group = name,
color = name)) +
geom_line()
Now if you did this right, you will not see your plot! Because we saved the ggplot
with a name (plot
), R just saved the object for you. But check out the top right pane in RStudio again: under Values
you should see plot
, so it is there, you just have to ask for it. Here’s how:
plot
Edit my code above to create a new dataset. Pick 2 names to compare how popular they each are (these could be different spellings of your own name, like I did, but you can choose any 2 names that are present in the dataset). Make the new plot, changing the name of the first argument alison
in ggplot()
to the name of your new dataset.
babynames
projects