Create a slide deck in presentation.qmd, replacing the placeholder text.
Add your write-up to index.qmd, replacing the placeholder text.
In both outputs, hide the code with echo: false prior to turning them in.
Due: Wednesday, October 16th, 2pm.
Warnings and messages
You should suppress package loading and data loading messages with message: false as a chunk option
You should also suppress warnings with warning: falseafter making sure you’re ok with them, or update your code to eliminate warnings
Setup
# load packagesif(!require(pacman))install.packages("pacman")pacman::p_load(countdown, tidyverse, glue, scales, ggthemes, gt)# set theme for ggplot2ggplot2::theme_set(ggplot2::theme_minimal(base_size =12))# set width of code outputoptions(width =65)# set figure parameters for knitrknitr::opts_chunk$set(fig.width =7,fig.asp =0.618,fig.retina =3,fig.align ="center",dpi =300)
Densities
Distributions and Motivating Example
There are many properties of a distribution of values
Center: Mean, Median, Modes
Spread: Variance, Range (Support), Interquartile range
Shape: Skewness, Kurtosis, Quantiles
Any statistic you can think of
Ultimately when analyzing data, the distribution is important to know how to proceed:
Parametric tests
Erratic Data
Outliers
Baseball! A home run in baseball occurs when a player hits a fair ball outside of the playing field.
Baseball is a game with a long rich history, but home runs have always been an integral part of it. By examining the distribution of home runs year-by-year we may be able to see the effect of various rule changes or events.
library(Lahman)home_runs <- Batting |>filter( G >=100|# We want to see how many home runs players that played most of the season hit (G >=40& yearID ==2020) |# COVID-shortened season (G >=70& yearID ==1994), # Strike-shortened season yearID >1920, # Beginning of live-ball era lgID %in%c("AL", "NL") ) # Most common leagues
Our dataset comes from the R package Lahman. Each row in the data frame is the hitting stats of a player for a given year. Today we will use the following columns:
Variable
Description
yearID
The year for the statistics
HR
The number of home runs a player hit in a given year
G
Number of games played; there are 162 games in a baseball season (154 before 1961)
In particular we are interested in the distribution of home runs per year!
Important notes from last time
Although density graphs are very useful and can display lots of information, they can be sensitive to bandwidth.
It was not clear how to properly determine if two distributions were significantly different.
Cumulative Distribution Functions (CDF)
For a random variable \(X\), the CDF describes the probability that \(X\) is below a certain value:
Between 0 and 1 (like all probabilities)
Non-decreasing
Derivative is the PDF, i.e. the larger the PDF the faster the CDF is increasing.
# Closest 4 yearsHRearly2010s<-home_runs|>filter(yearID%in%2011:2014)# Years up to COVID seasonHRlate2010s<-home_runs|>filter(yearID%in%2016:2019)ks.test(HRearly2010s$HR, HRlate2010s$HR)
Asymptotic two-sample Kolmogorov-Smirnov test
data: HRearly2010s$HR and HRlate2010s$HR
D = 0.16691, p-value = 6.463e-12
alternative hypothesis: two-sided
get_ks_df <-function(dat1, dat2) {# Make ECDF of each set of data ecdf1 <-ecdf(dat1) ecdf2 <-ecdf(dat2)# Calculate the absolute difference between the 2 ECDFs on the support grid_points <-seq(0, max(c(dat1, dat2)), length.out=1000) differences <-abs(ecdf1(grid_points) -ecdf2(grid_points))# Get the KS statistic and where it occurs ks_stat <-max(differences) first_max_location <- grid_points[which.max(differences)]# Return tibble to help with plottingtibble(x = first_max_location,xend = first_max_location,y =ecdf1(first_max_location),yend =ecdf2(first_max_location) )}ks_stat_2010s <-get_ks_df(HRearly2010s$HR, HRlate2010s$HR)ggplot(rbind(HRearly2010s, HRlate2010s), aes(HR, color =factor(yearID <2015))) +stat_ecdf(geom ="step") +geom_segment(data = ks_stat_2010s,aes(x = x,y = y,xend = xend,yend = yend ),color ="black",linetype ="dashed" ) +labs(x ="Homeruns per player per year",y ="Empirical CDF",title ="Empirical CDFs of player home runs per year in years 2011-2019 ",subtitle ="Dashed line is the Kolmogorov-Smirnov Statistic" ) +theme(legend.position ="bottom")
Major League Baseball was accused of replacing the standard baseballs with “juiced” baseballs (easier to hit home runs) secretly in the middle of 2015. Is there credence to this claim?
Asymptotic two-sample Kolmogorov-Smirnov test
data: HR2005$HR and HR2006$HR
D = 0.073827, p-value = 0.503
alternative hypothesis: two-sided
ks_stat_0506 <-get_ks_df(HR2005$HR, HR2006$HR)ggplot(rbind(HR2005, HR2006), aes(HR, color =factor(yearID))) +stat_ecdf(geom ="step") +geom_segment(data = ks_stat_0506,aes(x = x,y = y,xend = xend,yend = yend ),color ="black",linetype ="dashed" ) +labs(x ="Homeruns per player per year",y ="Empirical CDF",title ="Empirical CDFs of player home runs per year in years 2005 and 2006 ",subtitle ="Dashed line is the Kolmogorov-Smirnov Statistic",color ="Year" ) +theme(legend.position ="bottom")
2005 and 2006 are similar years in terms of home runs, so the Kolmogorov-Smirnov test does not reject.
Using the Kolmogorov-Smirnov statistic for visualization
Our earlier motivation was to compare the distribution of homeruns over time to see if rule changes made a difference.
We can’t use a ridge or violin plot with all 100 years; it would be too cramped.
What if we used the KS statistic to compare all pairs of years and emulated a correlation matrix?
Building the Matrix
ks_matrix <-tribble(~year1, ~year2, ~ks_stat, ~p_value)all_years <-unique(home_runs$yearID)# Save some memoryhome_runs_to_search <- home_runs |>select(yearID, HR)options(warn =-1) # Turn off ks.test warningfor (year1 in all_years) { year1HR <- home_runs_to_search |>filter(yearID == year1)for (year2 inmin(all_years):year1) { # Only do half since the test is symmetricif (year1 == year2) {next } year2HR <- home_runs_to_search |>filter(yearID == year2) test <-ks.test( year1HR$HR, year2HR$HR ) ks_matrix <- ks_matrix |>add_row(year1 = year1,year2 = year2,ks_stat = test$statistic,p_value = test$p.value ) }}ks_matrix <-bind_rows( ks_matrix, ks_matrix |>mutate(tmp_year1 = year1,year1 = year2,year2 = tmp_year1 ) |>select(-tmp_year1))options(warn =0)
When performing multiple hypothesis tests, we typically want to control the Family Wise Error Rate: the probability we make at least 1 Type I error (a false rejection).
Newer methods focus more on controlling the False Discovery Rate: the probability that any particular rejected null hypothesis is actually a false positive.
Both require adjusting p-values which is built in to R (?p.adjust).
holm, hochberg and hommel control for Family Wise Error Rate
bonferroni also controls for this, but is very conservative
fdr and BY methods control for False Discovery Rate
# Let's try all the adjustments and see how they change our visualizationhalf_matrix <- ks_matrix |>filter(year1 < year2) |>mutate(p_holm =p.adjust(p_value, "holm"),p_hochberg =p.adjust(p_value, "hochberg"),p_hommel =p.adjust(p_value, "hommel"),p_bonferroni =p.adjust(p_value, "bonferroni"),p_fdr =p.adjust(p_value, "fdr"),p_BY =p.adjust(p_value, "BY") )other_half <- half_matrix |>mutate(tmp_year1 = year1,year1 = year2,year2 = tmp_year1 ) |>select(-tmp_year1)ks_matrix <-bind_rows(half_matrix, other_half)
Controlling for the Family Wise Error Rate got rid of a lot of our interesting patterns. We don’t mind some false positives so we choose the BY adjustment as it controls False Discovery Rate but is a bit more conservative than FDR.
description <-"During World War II, many baseball players fought overseas,making for an atypical number of home runs during these years."|>str_wrap(width=40)mat_BY +annotate("rect",xmin =c(1920, 1941),ymin =c(1941, 1920),xmax =c(2022, 1945),ymax =c(1945, 2022),color ="#0000FFaa",alpha =0 ) +annotate("label",x =1990,y =1990,label = description,alpha =0.9,size =3 )
description <-"In 1994 less homeruns were hit due to the strike-shorted season. Similarly, COVID in 2020 called for a shorter season."|>str_wrap(width=40)mat_BY +annotate("rect",xmin =c(1920, 2019, 1920, 1993),ymin =c(2019, 1920, 1993, 1920),xmax =c(2022, 2021, 2022, 1995),ymax =c(2021, 2022, 1995, 2022),color ="#0000FFaa",alpha =0 ) +annotate("label",x =1950,y =1950,label = description,alpha =0.9,size =3 )
description <-"1968 is known as the Year of the Pitcher, when pitchers dominated the league causing less homeruns. The next year, the pitcher's mound was made smaller to give \npitchers a smaller advantage."|>str_wrap(width =30)mat_BY +annotate("rect",xmin =c(1920, 1967),ymin =c(1967, 1920),xmax =c(2022, 1969),ymax =c(1969, 2022),color ="#0000FFaa",alpha =0 ) +annotate("label",x =2000,y =1950,label = description,alpha =0.9,size =3 )
Summary
The Empirical CDF is a very useful tool in statistics, for both analysis and visualization.
The Kolmogorov-Smirnov Test is a good way to test if two distributions of values are different.
However, be careful as it is only an approximate test with discrete values.
Furthermore, be sure to use multiple testing corrections if testing many different distributions.
Baseball is interesting! (you may disagree on this one)