Networks

Lecture 20

Dr. Greg Chism

University of Arizona
INFO 526 - Spring 2024

Warm up

Setup

# load packages
library(tidyverse)
library(tidytext)
library(ggtext)
library(glue)
library(ggwordcloud)
library(ggraph)
library(igraph)

# set theme for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 14))

# set width of code output
options(width = 65)

# set figure parameters for knitr
knitr::opts_chunk$set(
  fig.width = 7, # 7" width
  fig.asp = 0.618, # the golden ratio
  fig.retina = 3, # dpi multiplier for displaying HTML output on retina
  fig.align = "center", # center align figures
  dpi = 300 # higher dpi, sharper image
)

Text data

Do you recognize the following text?


Whenever I get gloomy with the state of the world, I think about the arrivals gate at Heathrow airport. General opinion started to make out that we live in a world of hatred and greed, but I don’t see that. Seems to me that love is everywhere. Often it’s not particularly dignified or newsworthy but it’s always there. Fathers and sons, mothers and daughters, husbands and wives, boyfriends, girlfriends, old friends. When the planes hit the Twin Towers, as far as I know, none of the phone calls from people on board were messages of hate or revenge, they were all messages of love. If you look for it, I’ve got a sneaky feeling, you’ll find that love actually is all around.

Text as data

Text can be represented as data in a variery of ways:

  • String: Character vector

  • Corpus: Raw strings annotated with additional metadata and details

  • Document-term matrix: Sparse matrix describing a collection (i.e., a corpus) of documents with one row for each document and one column for each term, with word counts (or another measure of how common the word is in that text) as values

Tidy text

  • Each row is a token
    • A token can be a word, bigram (two words), ngram (n words), sentence, paragraph, etc.
  • Each column is a variable
  • Each type of observational unit is a table
love_actually |>
  slice_head(n = 6)
# A tibble: 6 × 4
  scene  line speaker dialogue                                   
  <dbl> <dbl> <chr>   <chr>                                      
1     1     1 (Man)   'Whenever I get gloomy with the state of t…
2     2     2 Billy   ♪ I feel it in my fingers ♪ I feel it in m…
3     2     3 Joe     I'm afraid you did it again, Bill.         
4     2     4 Billy   It's just I know the old version so well, …
5     2     5 Joe     Well, we all do. That's why we're making t…
6     2     6 Billy   Right, OK, let's go. ♪ I feel it in my fin…

Tokenize into words

With tidytext::unnest_tokens():

love_actually |>
  unnest_tokens(
    output = word,    # first argument is output
    input = dialogue, # second argument is input
    token = "words"   # third argument is token, with default "words"
    )
# A tibble: 9,899 × 4
   scene  line speaker word    
   <dbl> <dbl> <chr>   <chr>   
 1     1     1 (Man)   whenever
 2     1     1 (Man)   i       
 3     1     1 (Man)   get     
 4     1     1 (Man)   gloomy  
 5     1     1 (Man)   with    
 6     1     1 (Man)   the     
 7     1     1 (Man)   state   
 8     1     1 (Man)   of      
 9     1     1 (Man)   the     
10     1     1 (Man)   world   
# ℹ 9,889 more rows

Most common words

Why do these words appear so commonly in Love Actually?

love_actually |>
  unnest_tokens(word, dialogue) |>
  count(word, sort = TRUE)
# A tibble: 1,770 × 2
   word      n
   <chr> <int>
 1 you     334
 2 i       300
 3 the     263
 4 a       201
 5 and     199
 6 to      199
 7 it      150
 8 is      124
 9 of      112
10 no      111
# ℹ 1,760 more rows

Stop words

  • In computing, stop words are words which are filtered out before or after processing of natural language data (text)

  • They usually refer to Most common words in a language, but there is not a single list of stop words used by all natural language processing tools

English

get_stopwords(language = "[Ee]nglish")
# A tibble: 175 × 2
   word      lexicon 
   <chr>     <chr>   
 1 i         snowball
 2 me        snowball
 3 my        snowball
 4 myself    snowball
 5 we        snowball
 6 our       snowball
 7 ours      snowball
 8 ourselves snowball
 9 you       snowball
10 your      snowball
# ℹ 165 more rows

Spanish

get_stopwords(language = "[Ss]panish")
# A tibble: 308 × 2
   word  lexicon 
   <chr> <chr>   
 1 de    snowball
 2 la    snowball
 3 que   snowball
 4 el    snowball
 5 en    snowball
 6 y     snowball
 7 a     snowball
 8 los   snowball
 9 del   snowball
10 se    snowball
# ℹ 298 more rows

Most common words

love_actually %>%
  unnest_tokens(word, dialogue) %>%
  anti_join(stop_words) %>%
  count(word, sort = TRUE)
Joining with `by = join_by(word)`
# A tibble: 1,355 × 2
   word           n
   <chr>      <int>
 1 christmas     49
 2 yeah          48
 3 er            45
 4 love          40
 5 erm           39
 6 sir           28
 7 portuguese    25
 8 god           23
 9 bye           21
10 time          20
# ℹ 1,345 more rows

Portuguese?!

love_actually %>%
  filter(str_detect(dialogue, "[Pp]ortuguese"))
# A tibble: 25 × 4
   scene  line speaker dialogue                                  
   <dbl> <dbl> <chr>   <chr>                                     
 1    32   360 woman   Unfortunately, she cannot speak French, j…
 2    33   367 Jamie   (Pidgin Portuguese) Bello. Er, bella. Er,…
 3    38   421 Aurelia (Portuguese) Thank you very much but no. …
 4    38   423 Aurelia (Portuguese)Just don't go eating it all y…
 5    39   426 Aurelia (Portuguese) Nao! Eu peco imensa desculpa…
 6    39   430 Aurelia (Portuguese) Fuck - it's cold!            
 7    39   432 Aurelia (Portuguese) This stuff better be good.   
 8    39   434 Aurelia (Portuguese) I don't want to drown saving…
 9    39   436 Aurelia (Portuguese) What kind of an idiot doesn'…
10    39   438 Aurelia (Portuguese) Try not to disturb the eels. 
# ℹ 15 more rows

Data cleaning

  • Remove language identifiers
love_actually <- love_actually %>%
  mutate(dialogue = str_remove(dialogue, "(Portuguese)"))
  • Take another look
love_actually %>%
  filter(str_detect(dialogue, "[Pp]ortuguese"))
# A tibble: 0 × 4
# ℹ 4 variables: scene <dbl>, line <dbl>, speaker <chr>,
#   dialogue <chr>

Most common words

without “Portuguese”

love_actually %>%
  unnest_tokens(word, dialogue) %>%
  anti_join(stop_words) %>%
  count(word, sort = TRUE)
Joining with `by = join_by(word)`
# A tibble: 1,354 × 2
   word          n
   <chr>     <int>
 1 christmas    49
 2 yeah         48
 3 er           45
 4 love         40
 5 erm          39
 6 sir          28
 7 god          23
 8 bye          21
 9 time         20
10 ah           19
# ℹ 1,344 more rows

Visualizing Most common words (top 10)

Visualizing Most common words (freq)

Visualizing Most common words (color)

Visualizing the most common words

Use ggtext::element_textbox_simple() to add color to title