Deep dive into ggplot2 layers - II

Lecture 2

Dr. Greg Chism

University of Arizona
INFO 526 - Fall 2024

Announcements

  • HW 1 due Wednesday (Sept 11).

Setup

# load packages
if(!require(pacman))
  install.packages("pacman")

pacman::p_load(tidyverse,
               countdown,
               hexbin,
               palmerpenguins,
               ggrepel,
               here,
               waffle,
               scales)

The downloaded binary packages are in
    /var/folders/61/5w0zfjkx2ks_c31ggc0f8gch0000gt/T//RtmpFe2QJa/downloaded_packages
# set theme for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 14))

# set width of code output
options(width = 65)

# set figure parameters for knitr
knitr::opts_chunk$set(
  fig.width = 7,        # 7" width
  fig.asp = 0.618,      # the golden ratio
  fig.retina = 3,       # dpi multiplier for displaying HTML output on retina
  fig.align = "center", # center align figures
  dpi = 300             # higher dpi, sharper image
)

# read in tucsonHousing.csv
tucsonHousing <- read_csv(here(
  "slides", "01", "data" ,"tucsonHousing.csv"))

From last time

tucsonHousing <- tucsonHousing |>
  mutate(
    decade_built = (year_built %/% 10) * 10,
    decade_built_cat = case_when(
      decade_built <= 1950 ~ "1950 or before",
      decade_built >= 2000 ~ "2000 or after",
      TRUE ~ as.character(decade_built)
    )
  )

mean_area_decade <- tucsonHousing |>
  group_by(decade_built_cat) |>
  summarise(mean_area = mean(area))

Geoms

Geoms

  • Geometric objects, or geoms for short, perform the actual rendering of the layer, controlling the type of plot that you create

  • You can think of them as “the geometric shape used to represent the data”

One variable

  • Discrete:

    • geom_bar(): display distribution of discrete variable.
  • Continuous

    • geom_histogram(): bin and count continuous variable, display with bars

    • geom_density(): smoothed density estimate

    • geom_dotplot(): stack individual points into a dot plot

    • geom_freqpoly(): bin and count continuous variable, display with lines

Aside

Always use “typewriter text” (monospace font) when writing function names, and follow with (), e.g.,

geom_dotplot()

What does each point represent? How are their locations determined? What do the x and y axes represent?

ggplot(tucsonHousing, aes(x = price)) +
  geom_dotplot(binwidth = 50000)

Comparing across groups

Which of the following allows for easier comparison across groups?

ggplot(tucsonHousing, aes(x = price, fill = decade_built_cat)) +
  geom_histogram(binwidth = 100000)

ggplot(tucsonHousing, aes(x = price, color = decade_built_cat)) +
  geom_freqpoly(binwidth = 100000, size = 1)

Two variables - both continuous

  • geom_point(): scatterplot

  • geom_quantile(): smoothed quantile regression

  • geom_rug(): marginal rug plots

  • geom_smooth(): smoothed line of best fit

  • geom_text(): text labels

Two variables - show density

  • geom_bin2d(): bin into rectangles and count

  • geom_density2d(): smoothed 2d density estimate

  • geom_hex(): bin into hexagons and count

geom_hex()

Not very helpful for 112 observations:

ggplot(tucsonHousing, aes(x = area, y = price)) +
  geom_hex()

geom_hex()

More helpful for 53940 observations:

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_hex()

geom_hex() and warnings

  • Requires installing the hexbin package separately!
install.packages("hexbin")
  • Otherwise you might see
Warning: Computation failed in `stat_binhex()`

Two variables

  • At least one discrete

    • geom_count(): count number of point at distinct locations

    • geom_jitter(): randomly jitter overlapping points

  • One continuous, one discrete

    • geom_col(): a bar chart of pre-computed summaries

    • geom_boxplot(): boxplots

    • geom_violin(): show density of values in each group

geom_jitter()

How are the following three plots different?

ggplot(tucsonHousing, aes(x = bed, y = price)) +
  geom_point()

ggplot(tucsonHousing, aes(x = bed, y = price)) +
  geom_jitter()

ggplot(tucsonHousing, aes(x = bed, y = price)) +
  geom_jitter()

geom_jitter() and set.seed()

set.seed(1234)

ggplot(tucsonHousing, aes(x = bed, y = price)) +
  geom_jitter()

set.seed(1234)

ggplot(tucsonHousing, aes(x = bed, y = price)) +
  geom_jitter()

Two variables

  • One time, one continuous
    • geom_area(): area plot
    • geom_line(): line plot
    • geom_step(): step plot
  • Display uncertainty:
    • geom_crossbar(): vertical bar with center
    • geom_errorbar(): error bars
    • geom_linerange(): vertical line
    • geom_pointrange(): vertical line with center
  • Spatial
    • geom_sf(): for map data (more on this later…)

Average price per year built

mean_price_year <- tucsonHousing |>
  group_by(year_built) |>
  summarise(
    n = n(),
    mean_price = mean(price),
    sd_price = sd(price)
    )

mean_price_year
# A tibble: 56 × 4
   year_built     n mean_price sd_price
        <dbl> <int>      <dbl>    <dbl>
 1       1936     1    330000       NA 
 2       1943     1    260000       NA 
 3       1948     1    310000       NA 
 4       1950     2    270000        0 
 5       1951     2    172450.   60174.
 6       1952     3    382833.  104205.
 7       1953     3    365133.  132328.
 8       1954     1    295000       NA 
 9       1956     3    279333.   25891.
10       1957     1    299000       NA 
# ℹ 46 more rows

geom_line()

ggplot(mean_price_year, aes(x = year_built, y = mean_price)) +
  geom_line()

geom_area()

ggplot(mean_price_year, aes(x = year_built, y = mean_price)) +
  geom_area()

geom_step()

ggplot(mean_price_year, aes(x = year_built, y = mean_price)) +
  geom_step()

Let’s clean things up a bit!

ggplot(tucsonHousing, aes(x = area, y = price)) +
  geom_point(alpha = 0.6, size = 2, color = "#012169") +
  scale_x_continuous(labels = label_number(big.mark = ",")) +
  scale_y_continuous(labels = label_dollar(scale = 1/1000, suffix = "K")) +
  labs(
    x = "Area (square feet)",
    y = "Sale price (USD)",
    title = "Sale prices of homes in Tucson",
    subtitle = "As of July 2023",
    caption = "Source: Zillow.com"
  )

Three variables

  • geom_contour(): contours
  • geom_tile(): tile the plane with rectangles
  • geom_raster(): fast version of geom_tile() for equal sized tiles

geom_tile()

ggplot(tucsonHousing, aes(x = bed, y = bath)) +
 geom_tile(aes(fill = price))