Visualizing geospatial data I

Lecture 16

Dr. Greg Chism

University of Arizona
INFO 526 - Fall 2024

Announcements

  • HW 04 + RQ 05 due Wednesday, November 20th (live now)
  • Project 2 proposals for peer review due Wednesday, November 13th

Note

tidyverse is a meta-package

Loading the tidyverse loads the following packages:

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

So you never need to load dplyr, ggplot2, readr, etc. separately if you’ve already loaded the tidyverse package.

Setup

# load packages
if(!require(pacman))
  install.packages("pacman")

pacman::p_load(countdown,
               tidyverse,
               mapproj,
               sf,
               geofacet,
               glue)

# set theme for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 14))

# set width of code output
options(width = 65)

# set figure parameters for knitr
knitr::opts_chunk$set(
  fig.width = 7,        # 7" width
  fig.asp = 0.618,      # the golden ratio
  fig.retina = 3,       # dpi multiplier for displaying HTML output on retina
  fig.align = "center", # center align figures
  dpi = 300             # higher dpi, sharper image
)

John Snow and 1854 cholera epidemic

This John Snow knows things

10% of the population of Soho died in a week (!!)


Miasma theory said it was because the air was bad

The Broad Street pump

Outright lies

Outright lies

Fake maps and junk maps

Fake maps and junk maps

Chloropleths can be great

Chloropleths can distort

Cartograms

Projections

Visualizing geographic areas

Without any projection, on the cartesian coordinate system

world_map <- ggplot(map_data("world"), aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = "white", color = "#3c3b6e", linewidth = 0.3) +
  labs(x = NULL, y = NULL)

world_map

Mercator projection

Meridians are equally spaced and vertical, parallels are horizontal lines whose spacing increases the further we move away from the equator

world_map +
  coord_map(projection = "mercator")

Mercator projection

without the weird straight lines through the earth!

world_map +
  coord_map(
    projection = "mercator", 
    xlim = c(-180, 180)
    )

Sinusoidal projection

Parallels are equally spaced

world_map +
  coord_map(projection = "sinusoidal", xlim = c(-180, 180))

Orthographic projection

Viewed from infinity

world_map +
  coord_map(projection = "orthographic")

Mollweide projection

Equal-area, hemisphere is a circle

world_map +
  coord_map(projection = "mollweide", xlim = c(-180, 180))

Visualizing distances

Draw a line between Istanbul and Los Angeles.

cities <- tribble(
  ~city,         ~long,    ~lat,
  "istanbul",    28.9784,  41.0082,
  "los angeles", -118.243, 34.0522,
)

Visualizing distances

As if the earth is flat:

Visualizing distances

Based on a spherical model of the earth:

Intermediate points on the great circle

cities <- tribble(
  ~city,         ~long,    ~lat,
  "istanbul",    28.9784,  41.0082,
  "los angeles", -118.243, 34.0522,
)
gc <- geosphere::gcIntermediate(
  p1 = cities |> filter(city == "istanbul")    |> select(-city), 
  p2 = cities |> filter(city == "los angeles") |> select(-city), 
  n = 100, 
  addStartEnd = TRUE
  ) |>
  as_tibble()

Intermediate points on the great circle

gc
# A tibble: 102 × 2
     lon   lat
   <dbl> <dbl>
 1  29.0  41.0
 2  28.4  41.9
 3  27.8  42.8
 4  27.1  43.6
 5  26.5  44.5
 6  25.8  45.3
 7  25.1  46.2
 8  24.4  47.0
 9  23.7  47.9
10  22.9  48.7
# ℹ 92 more rows

Plotting both distances

world_map +
  geom_point(
    data = cities, aes(x = long, y = lat, group = NULL),
    size = 2, color = "red"
  ) +
  geom_line(
    data = gc, aes(x = lon, y = lat, group = NULL),
    linewidth = 1, color = "red", linetype = "dashed"
  ) +
  geom_line(
    data = cities, aes(x = long, y = lat, group = NULL),
    linewidth = 1, color = "red"
  ) +
  coord_map(projection = "mercator", xlim = c(-180, 180))

Plotting both distances

Another distance between two points

How long does it take to fly from the Western most point in the US to the Eastern most point? Guess.

  • The straight-line distance between these points is approximately 4,500 miles (7,242 km)
  • Nonstop commercial flights that travel at an average speed of about 500-550 miles per hour (805-885 km/h)
  • Would take roughly 8 to 9 hours

Dateline

Geospatial data in the real world:
Freedom index

Freedom index

  • Since 1973, Freedom House has assessed the condition of political rights and civil liberties around the world.

  • It is used on a regular basis by policymakers, journalists, academics, activists, and many others.

Bias warning

“Freedom Index” from any source have potential bias and is prone to miscalculations. While the index appears to cover many social issues including freedom of religion, expression, etc. this data (like any data) should be approached with skepticism. Quantifying complex issues like these is difficult and the process can oversimplify difficult to record/measure political nuances.

Data

freedom <- read_csv("data/freedom-2022.csv", na = c("", "-")) |>
  drop_na()
freedom
# A tibble: 195 × 4
   country                pr    cl status
   <chr>               <dbl> <dbl> <chr> 
 1 Afghanistan             7     7 NF    
 2 Albania                 3     3 PF    
 3 Algeria                 6     5 NF    
 4 Andorra                 1     1 F     
 5 Angola                  6     5 NF    
 6 Antigua and Barbuda     2     2 F     
 7 Argentina               2     2 F     
 8 Armenia                 4     4 PF    
 9 Australia               1     1 F     
10 Austria                 1     1 F     
# ℹ 185 more rows
  • pr: Political rights rating
  • cl: Civil liberties rating
  • status: The average of each pair of ratings on political rights and civil liberties determines the overall status of F (Free, 1.0 - 2.5), PF (Partly Free, 3.0 - 5.0), or NF (Not Free, 5.5 - 7.0)

Improve

The following visualization shows the distribution civil liberties ratings (1 - greatest degree of freedom to 7 - smallest degree of freedom). This is, undoubtedly, not the best visualization we can make of these data. How can we improve it?

Mapping the freedom data

  • Obtain country boundaries and store as a data frame
  • Join the freedom and country boundaries data frames
  • Plot the country boundaries, and fill by freedom scores

map_data()

The map_data() function easily turns data from the maps package in to a data frame suitable for plotting with ggplot2:

world_map <- map_data("world") |> as_tibble()
world_map
# A tibble: 99,338 × 6
    long   lat group order region subregion
   <dbl> <dbl> <dbl> <int> <chr>  <chr>    
 1 -69.9  12.5     1     1 Aruba  <NA>     
 2 -69.9  12.4     1     2 Aruba  <NA>     
 3 -69.9  12.4     1     3 Aruba  <NA>     
 4 -70.0  12.5     1     4 Aruba  <NA>     
 5 -70.1  12.5     1     5 Aruba  <NA>     
 6 -70.1  12.6     1     6 Aruba  <NA>     
 7 -70.0  12.6     1     7 Aruba  <NA>     
 8 -70.0  12.6     1     8 Aruba  <NA>     
 9 -69.9  12.5     1     9 Aruba  <NA>     
10 -69.9  12.5     1    10 Aruba  <NA>     
# ℹ 99,328 more rows

Mapping the world

ggplot(world_map, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = "gray") +
  coord_quickmap()

Freedom and world map

freedom |> select(country)
# A tibble: 195 × 1
   country            
   <chr>              
 1 Afghanistan        
 2 Albania            
 3 Algeria            
 4 Andorra            
 5 Angola             
 6 Antigua and Barbuda
 7 Argentina          
 8 Armenia            
 9 Australia          
10 Austria            
# ℹ 185 more rows
world_map |> select(region)
# A tibble: 99,338 × 1
   region
   <chr> 
 1 Aruba 
 2 Aruba 
 3 Aruba 
 4 Aruba 
 5 Aruba 
 6 Aruba 
 7 Aruba 
 8 Aruba 
 9 Aruba 
10 Aruba 
# ℹ 99,328 more rows

Join freedom and world map

freedom_map <- freedom |>
  left_join(world_map, by = c("country" = "region"), multiple = "all")
glimpse(freedom_map)
Rows: 82,198
Columns: 9
$ country   <chr> "Afghanistan", "Afghanistan", "Afghanistan", …
$ pr        <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, …
$ cl        <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, …
$ status    <chr> "NF", "NF", "NF", "NF", "NF", "NF", "NF", "NF…
$ long      <dbl> 74.89131, 74.84023, 74.76738, 74.73896, 74.72…
$ lat       <dbl> 37.23164, 37.22505, 37.24917, 37.28564, 37.29…
$ group     <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ order     <int> 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 2…
$ subregion <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

Mapping freedom

What is missing/misleading about the following map?

ggplot(freedom_map, mapping = aes(x = long, y = lat, group = group)) +
  geom_polygon(aes(fill = cl)) +
  coord_map(projection = "mercator", xlim = c(-180, 180))

Missing countries

freedom |> 
  anti_join(world_map, by = c("country" = "region")) |>
  select(country) |>
  print(n = 14)
# A tibble: 14 × 1
   country                       
   <chr>                         
 1 Antigua and Barbuda           
 2 Cabo Verde                    
 3 Congo (Brazzaville)           
 4 Congo (Kinshasa)              
 5 Cote d'Ivoire                 
 6 Eswatini                      
 7 St. Kitts and Nevis           
 8 St. Lucia                     
 9 St. Vincent and the Grenadines
10 The Gambia                    
11 Trinidad and Tobago           
12 Tuvalu                        
13 United Kingdom                
14 United States                 

Data cleanup - freedom

freedom_updated <- freedom |>
  mutate(country = case_when(
    country == "Cabo Verde" ~ "Cape Verde",
    country == "Congo (Brazzaville)" ~ "Republic of Congo",
    country == "Congo (Kinshasa)" ~ "Democratic Republic of the Congo",
    country == "Cote d'Ivoire" ~ "Ivory Coast",
    country == "St. Lucia" ~ "Saint Lucia",
    country == "The Gambia" ~ "Gambia",
    country == "United Kingdom" ~ "UK",
    country == "United States" ~ "USA",
    TRUE ~ country
    )
  )

Data cleanup - world_map

world_map_updated <- world_map |>
  mutate(region = case_when(
    region == "Antigua" ~ "Antigua and Barbuda",
    region == "Barbuda" ~ "Antigua and Barbuda",
    region == "Saint Kitts" ~ "St. Kitts and Nevis",
    region == "Nevis" ~ "St. Kitts and Nevis",
    region == "Saint Vincent" ~ "St. Vincent and the Grenadines",
    region == "Grenadines" ~ "St. Vincent and the Grenadines",
    region == "Trinidad" ~ "Trinidad and Tobago",
    region == "Tobago" ~ "Trinidad and Tobago",
    region == "Swaziland" ~ "Eswatini",
    TRUE ~ region
    )
  )

Check again

freedom_updated |> 
  anti_join(world_map_updated, by = c("country" = "region")) |>
  select(country)
# A tibble: 1 × 1
  country
  <chr>  
1 Tuvalu 

Tuvalu, formerly known as the Ellice Islands, is an island country and microstate in the Polynesian subregion of Oceania in the Pacific Ocean. Its islands are situated about midway between Hawaii and Australia. Tuvalu is composed of three reef islands and six atolls.

Let’s map!

Highlights from livecoding

  • When working through non-matching unique identifiers in a join, you might need to clean the data in both data frames being merged, depending on the context

  • Two ways to surface polygons with NAs:

    • left_join() map to data, layering with map at the bottom, data on top
    • left_join() data to map, set na.value in scale_fill_*() to desired color
  • Use na.value = "red" (or some other color that will stand out) to easily spot polygons with NAs

Geofaceting

Daily US vaccine data by state

us_state_vaccinations <- read_csv(here::here("data/us_state_vaccinations.csv"))
us_state_vaccinations
# A tibble: 17,501 × 14
   date       location total_vaccinations total_distributed
   <date>     <chr>                 <dbl>             <dbl>
 1 2021-01-12 Alabama               78134            377025
 2 2021-01-13 Alabama               84040            378975
 3 2021-01-14 Alabama               92300            435350
 4 2021-01-15 Alabama              100567            444650
 5 2021-01-16 Alabama                  NA                NA
 6 2021-01-17 Alabama                  NA                NA
 7 2021-01-18 Alabama                  NA                NA
 8 2021-01-19 Alabama              130795            444650
 9 2021-01-20 Alabama              139200            483275
10 2021-01-21 Alabama              165919            493125
# ℹ 17,491 more rows
# ℹ 10 more variables: people_vaccinated <dbl>,
#   people_fully_vaccinated_per_hundred <dbl>,
#   total_vaccinations_per_hundred <dbl>,
#   people_fully_vaccinated <dbl>,
#   people_vaccinated_per_hundred <dbl>,
#   distributed_per_hundred <dbl>, …

Facet by location

ggplot(
  us_state_vaccinations,
  aes(x = date, y = people_fully_vaccinated_per_hundred)
) +
  geom_area() +
  facet_wrap(~location)

Facet by location

Warning: Removed 1802 rows containing non-finite outside the scale range
(`stat_align()`).

Data cleaning

us_state_vaccinations <- us_state_vaccinations |>
  mutate(location = if_else(location == "New York State", "New York", location)) |>
  filter(location %in% c(state.name, "District of Columbia"))

Geofacet by state

Using geofacet::facet_geo():

ggplot(us_state_vaccinations, 
       aes(x = date, y = people_fully_vaccinated_per_hundred)) +
  geom_area() +
  facet_geo(~ location) +
  labs(
    x = NULL, y = NULL,
    title = "Covid-19 vaccination rate in the US",
    subtitle = "Daily number of people fully vaccinated, per hundred",
    caption = "Source: Our World in Data"
  )

Geofacet by state

Geofacet by state, with improvements

ggplot(us_state_vaccinations, aes(x = date, y = people_fully_vaccinated_per_hundred, group = location)) +
  geom_area() +
  facet_geo(~location) +
  scale_y_continuous(
    limits = c(0, 100),
    breaks = c(0, 50, 100),
    minor_breaks = c(25, 75)
    ) +
  scale_x_date(breaks = c(ymd("2021-01-01", "2021-05-01", "2021-09-01")), date_labels = "%b") +
  labs(
    x = NULL, y = NULL,
    title = "Covid-19 vaccination rate in the US",
    subtitle = "Daily number of people fully vaccinated, per hundred",
    caption = "Source: Our World in Data"
  ) +
  theme(
    strip.text.x = element_text(size = 7),
    axis.text = element_text(size = 8),
    plot.title.position = "plot"
  )

Geofacet by state, with improvements

Bring in 2020 Presidential election results

election_2020 <- read_csv(here::here("data/us-election-2020.csv"))
election_2020
# A tibble: 51 × 5
   state                electoal_votes biden trump win       
   <chr>                         <dbl> <dbl> <dbl> <chr>     
 1 Alabama                           9     0     9 Republican
 2 Alaska                            3     0     3 Republican
 3 Arizona                          11    11     0 Democrat  
 4 Arkansas                          6     0     6 Republican
 5 California                       55    55     0 Democrat  
 6 Colorado                          9     9     0 Democrat  
 7 Connecticut                       7     7     0 Democrat  
 8 Delaware                          3     3     0 Democrat  
 9 District of Columbia              3     3     0 Democrat  
10 Florida                          29     0    29 Republican
# ℹ 41 more rows

Geofacet by state + presidential election result

us_state_vaccinations |>
  left_join(election_2020, by = c("location" = "state")) |>
  ggplot(aes(x = date, y = people_fully_vaccinated_per_hundred)) +
  geom_area(aes(fill = win)) +
  facet_geo(~location) +
  scale_y_continuous(limits = c(0, 100), breaks = c(0, 50, 100), minor_breaks = c(25, 75)) +
  scale_x_date(breaks = c(ymd("2021-01-01", "2021-05-01", "2021-09-01")), date_labels = "%b") +
  scale_fill_manual(values = c("#2D69A1", "#BD3028")) +
  labs(
    x = NULL, y = NULL,
    title = "Covid-19 vaccination rate in the US",
    subtitle = "Daily number of people fully vaccinated, per hundred",
    caption = "Source: Our World in Data",
    fill = "2020 Presidential\nElection"
  ) +
  theme(
    strip.text.x = element_text(size = 7),
    axis.text = element_text(size = 8),
    plot.title.position = "plot",
    legend.position = c(0.93, 0.15),
    legend.text = element_text(size = 9), 
    legend.title = element_text(size = 11), 
    legend.background = element_rect(color = "gray", size = 0.5) 
  )

Geofacet by state + presidential election result