Data Wrangling R Cheat Sheet



The Ultimate R Cheat Sheetconnects the documentation for the R package ecosystem within the data science workflow. The R Cheat Sheet is a key component of in learning the R programming language efficiently. We teach the cheat sheet in our Business Analysis With R (DS4B 101-R) Coursethat is the first course in the Data Science For Business R-Track. This cheat sheet is inspired by the data wrangling cheat sheets from RStudio and pandas. Examples are based on the Kaggle Titanic data set. Created by Tom Kwong, September 2020. V0.21 rev3 Page 2 / 2 gdf = groupby(df,:pclass) gdf = groupby(df, :pclass,:sex) Group data frame by one or more columns.

Google Sheets are a useful way to collect, store, and collaboratively work with data. The googlesheets4 package wraps the Sheets API, making it easy for you to work with Google Sheets in R.

The “4” in googlesheets4 refers to the most recent version (v4) of the Google Sheets API. There’s also an R package called googlesheets, which uses an older version (v3) of the Google Sheets API. If you’ve worked with the googlesheets package previously, note that the Sheets API v3 will be shut down on March 3, 2020, so you’ll need to switch over to googlesheets4.

14.1 Reading

Data

Reading data stored in a Google Sheet into R will probably be your most common use of googlesheets4. Here, we’ll read in the data from our example sheet, which contains data from Gapminder.

To read in the data, we need a way to identify the Google Sheet. googlesheets4 supports multiple ways of identifying sheets, but we recommend using the sheet ID, as it’s stable and concise. You can find the ID of a Google Sheet in its URL:

If you want to extract an ID from a URL programmatically, you can also use the function as_sheets_id().

Data

We’ve stored the ID for the Gapminder sheet in the parameters section up at the top. Here it is:

Now, we can use the googlesheets4 function read_sheet() to read in the data. read_sheet()’s first argument, ss, takes the sheet ID.

Notice that the original Sheet contains multiple sheets, one for each continent. We can list all these sheets by using the function sheets_sheets().

By default, sheets_read() reads in the first sheet. Here, that’s the Africa sheet. If we want to read in Asia, we can specify the sheet argument.

14.2 Writing

As of 2019-12-05, you cannot write to Google Sheets with the googlesheets4 package. Check back for updates.

14.3 Finding sheets

It can sometimes be difficult to find the exact Google Sheet you’re looking for. googlesheets4 includes a handy function that will return the names of the all your sheets, alongside their IDs, in an object called a dribble. A dribble is a tibble specifically for storing metadata about Google Drive files.

Note that sheets_find() will lists both sheets that you own and private sheets that you have access to. These are the same sheets that you can see on your Google Sheets homepage.

Now, you can easily search for a sheet by piping the results of sheets_find() into view().

14.4 Authentication

14.4.1 Interactive session

When you run R code in the console or in an R Markdown chunk, you’re in an interactive session. R understands that it’s interacting with a human, and so can prompt you for input or actions. In an interactive session, you don’t need to worry much about authentication. googlesheets4 will do most of the work for you.

The first time you call a googlesheets4 function that requires authentication (e.g., sheets_read(ss = id_gapminder)), a browser tab will open and prompt you to sign into Google. Sign into your account and then return to RStudio.

By default, your user credentials will now be stored as something called a gargle token. gargle is the name of an R package for working with Google APIs. The next time googlesheets4 requires authentication, it will use this token to authenticate you. On a Mac, you can locate your gargle token by looking in ~/.R/gargle/.

14.4.2 Non-interactive session

When you knit an R Markdown, you’re using R non-interactively. googlesheets4 can’t prompt you to sign into Google, because it doesn’t assume that there’s a human standing by to do so. This should only be a problem if you’re trying to knit an R Markdown document that uses googlesheets4 and you’ve never authenticated with googlesheets4 before. The easiest way to quickly authenticate and set up your gargle token is to run googlesheets4::sheets_auth() (you can run this anywhere: console, R Markdown chunk, etc.). Once you’ve signed into Google and returned to RStudio, try knitting your document.

If you’ve authenticated with googlesheets4 before, but your R Markdown document never finishing knitting, you may need to update your gargle token. Run googlesheets4::sheets_auth() and then try knitting again.

I reproduce some of the plots from Rstudio’s ggplot2 cheat sheet using Base R graphics. I didn’t try to pretty up these plots, but you should.

I use this dataset

The main functions that I generally use for plotting are

  • Plotting Functions
    • plot: Makes scatterplots, line plots, among other plots.
    • lines: Adds lines to an already-made plot.
    • par: Change plotting options.
    • hist: Makes a histogram.
    • boxplot: Makes a boxplot.
    • text: Adds text to an already-made plot.
    • legend: Adds a legend to an already-made plot.
    • mosaicplot: Makes a mosaic plot.
    • barplot: Makes a bar plot.
    • jitter: Adds a small value to data (so points don’t overlap on a plot).
    • rug: Adds a rugplot to an already-made plot.
    • polygon: Adds a shape to an already-made plot.
    • points: Adds a scatterplot to an already-made plot.
    • mtext: Adds text on the edges of an already-made plot.
  • Sometimes needed to transform data (or make new data) to make appropriate plots:
    • table: Builds frequency and two-way tables.
    • density: Calculates the density.
    • loess: Calculates a smooth line.
    • predict: Predicts new values based on a model.

Data Wrangling R Cheat Sheet

All of the plotting functions have arguments that control the way the plot looks. You should read about these arguments. In particular, read carefully the help page ?plot.default. Useful ones are:

  • main: This controls the title.
  • xlab, ylab: These control the x and y axis labels.
  • col: This will control the color of the lines/points/areas.
  • cex: This will control the size of points.
  • pch: The type of point (circle, dot, triangle, etc…)
  • lwd: Line width.
  • lty: Line type (solid, dashed, dotted, etc…).

Discrete

Barplot Avenir next font adobe download.

Different type of bar plot

Continuous X, Continuous Y

Scatterplot

Jitter points to account for overlaying points.

Add a rug plot

Add a Loess Smoother

Loess smoother with upper and lower 95% confidence bands

Loess smoother with upper and lower 95% confidence bands and that fancy shading from ggplot2.

Data Wrangling R Cheat Sheet Printable

Add text to a plot

Discrete X, Discrete Y

Mosaic Plot

Data Wrangling R Cheat Sheet Excel

Color code a scatterplot by a categorical variable and add a legend.

Data Wrangling R Cheat Sheet 2020

par sets the graphics options, where mfrow is the parameter controling the facets.

The first line sets the new options and saves the old options in the list old_options. The last line reinstates the old options.

Data Wrangling R Cheat Sheet Template

This R Markdown site was created with workflowr