Visualization Case Study

In this chapter, we use a few case studies to demonstrate how data visualization techniques help us extract insights from data. The following R packages are needed for running the examples in this chapter.

library(tidyverse)
library(ggthemes)
library(grid)
library(gridExtra)
#library(Lahman)
#library(MASS) # conflict with dplyr package for select(), use dplyr::select
library(RColorBrewer)
library(ggridges)
library(ggrepel)
library(kableExtra)

1 Case Study 1: Simpson’s Paradox in Baseball

We analyze a data set containing the offense and defense statistics for the baseball teams in Major League Baseball in the early years, in particular, from 1871 through 1900. The data set includes baseball teams from many “major” leagues such as American Association, Union Association, Players League, Federal League, and the National Association (of 1871-1875). This data set was created by Sean Lahman, who pioneered the effort to make baseball statistics freely available to the general public. The complete data set includes most recent data is available in the R package Lahman.

We first load the data and print the first a few rows.

Teams = read.csv("data/baseballgame_bf1900.csv")
head(Teams)
##   yearID lgID teamID franchID divID Rank  G Ghome  W  L DivWin WCWin LgWin
## 1   1871 <NA>    BS1      BNA    NA    3 31    NA 20 10     NA    NA     N
## 2   1871 <NA>    CH1      CNA    NA    2 28    NA 19  9     NA    NA     N
## 3   1871 <NA>    CL1      CFC    NA    8 29    NA 10 19     NA    NA     N
## 4   1871 <NA>    FW1      KEK    NA    7 19    NA  7 12     NA    NA     N
## 5   1871 <NA>    NY2      NNA    NA    5 33    NA 16 17     NA    NA     N
## 6   1871 <NA>    PH1      PNA    NA    1 28    NA 21  7     NA    NA     Y
##   WSWin   R   AB   H X2B X3B HR BB SO SB CS HBP SF  RA  ER  ERA CG SHO SV
## 1  <NA> 401 1372 426  70  37  3 60 19 73 16  NA NA 303 109 3.55 22   1  3
## 2  <NA> 302 1196 323  52  21 10 60 22 69 21  NA NA 241  77 2.76 25   0  1
## 3  <NA> 249 1186 328  35  40  7 26 25 18  8  NA NA 341 116 4.11 23   0  0
## 4  <NA> 137  746 178  19   8  2 33  9 16  4  NA NA 243  97 5.17 19   1  0
## 5  <NA> 302 1404 403  43  21  1 33 15 46 15  NA NA 313 121 3.72 32   1  0
## 6  <NA> 376 1281 410  66  27  9 46 23 56 12  NA NA 266 137 4.95 27   0  0
##   IPouts  HA HRA BBA SOA   E DP    FP                    name
## 1    828 367   2  42  23 243 24 0.834    Boston Red Stockings
## 2    753 308   6  28  22 229 16 0.829 Chicago White Stockings
## 3    762 346  13  53  34 234 15 0.818  Cleveland Forest Citys
## 4    507 261   5  21  17 163  8 0.803    Fort Wayne Kekiongas
## 5    879 373   7  42  22 235 14 0.840        New York Mutuals
## 6    747 329   3  53  16 194 13 0.845  Philadelphia Athletics
##                           park attendance BPF PPF teamIDBR teamIDlahman45
## 1          South End Grounds I         NA 103  98      BOS            BS1
## 2      Union Base-Ball Grounds         NA 104 102      CHI            CH1
## 3 National Association Grounds         NA  96 100      CLE            CL1
## 4               Hamilton Field         NA 101 107      KEK            FW1
## 5     Union Grounds (Brooklyn)         NA  90  88      NYU            NY2
## 6     Jefferson Street Grounds         NA 102  98      ATH            PH1
##   teamIDretro
## 1         BS1
## 2         CH1
## 3         CL1
## 4         FW1
## 5         NY2
## 6         PH1

Each row in the data set represents various statistics of one team in a particular season. The statistics include the number of runs scored by that team in that year/season (R), the number of home runs (HR), the number of games (G), lost games/losses (L), and wins (W) and many others variables. Suppose we are interested in the relationship between the number of runs (R) and the number of lost games (L). We can easily generate a scatter plot to show the relationship.

Teams %>% 
  ggplot(aes(R, L)) +  
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm")
## 
## `geom_smooth()` using formula = 'y ~ x'

Usually, more runs lead to higher scores and hence fewer losses. However, when we visualize the data, we actually get a positive correlation, which is puzzling. We further fit a regression line to the data, which confirms our observations.

In order to understand this puzzle, we recall that the data we have is the statistics of each team in the early years. Back in the 1870s and 1880s, the baseball league is in its infancy. The league has experienced the ups and downs when it developed to its current scale. Therefore, each season is quite different in all aspects from another season. Therefore, when we pool the data across all the years together, the data tells a very different story.

For example, the number of runs and the number of losses depend on how many games each team played during that season. Since the league is growing rapidly, the number of games in each season is much different. Some season hold more games and some hold less.

Therefore, we could reexamine the relationship between the number of runs and the number of losses conditional on the number of games that each team plays in that season. This is sometimes called stratification. Essentially we divide the observations into groups according to the value of a third variable, i.e., the number of games in this case, so that the observations in each group all have similar or identical values for that variable. Within the group, we investigate the data set focusing on the two variables of interest, i.e., the number of runs and losses in this case. Then our results and findings are now conditional on the third variable by which we stratify the data.

Since we want to stratify by the number of games and it is a quantitative variable, we need to cut this variable into several bins/intervals. We can first check the range of the number of games as follow.

range(Teams$G)
## [1]   6 158

It seems an interval of length 20 will give us about 8 strata. We use the function cut() to convert the quantitative variable to a categorical variable with values of (0, 20], … , (140, 160]. To visualize the strata, we use the visualization technique called faceting. It essentially plots the relationship of interest in many small multiple of panels which correspond to the strata. Here, we plot the scatter plot between the numbers of runs and losses within each stratum.

Teams %>%
  mutate(G_strata = cut(G, seq(0,180,20))) %>% 
  ggplot(aes(R, L)) +  
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") +
  facet_wrap( ~ G_strata)

We immediately find out that the relationship becomes negative within each stratum. If we overlay the fitted regression within each stratum on the fitted regression to the whole data set, we can see the contrast.

Teams %>% 
  mutate(G_strata = cut(G, seq(0,180,20))) %>%
  ggplot(aes(R, L, color=G_strata)) +  
  geom_point(alpha = 0.5) + 
  geom_smooth(aes(group=G_strata, color=G_strata),
              method = "lm")+
  scale_color_viridis_d()+
  geom_smooth(color = "blue", method = "lm")

Clearly, the relationship between the numbers of runs and losses within strata and over the entire data set is opposite. This is a typical case of so called Simpson’s paradox. It is called a paradox because we see the sign of the correlation flips when comparing the entire population and specific groups (or strata). It usually happens because there is a confounder variable which affects both of the variables of interest. Since the confounder’s variation is more dominant, the overall pattern between the two variables of interest, when marginalized over the confounder, shows an unexpected association.

In this case, the number of game in each season is the confounder. The more game the team plays, the more runs and more losses the team acquires. When the analysis is stratified over the confounder, the true relationship appears.

We can further investigate this issue. We plot the year against the number of games, the number of runs, and the number of losses. We can immediately see that, as the sport becomes more popular over the years, there are more games in each season, with a few exceptions around 1984-1985. Therefore, the increase of the number of games over the years is the main reason for observing this positive relationship between the number of runs and the number of losses.

plot1 <- Teams %>% 
  ggplot(aes(yearID, G)) + geom_jitter(alpha = 0.5)
plot2 <- Teams %>% 
  ggplot(aes(yearID, R)) + geom_jitter(alpha = 0.5)
plot3 <- Teams %>%
  ggplot(aes(yearID, L)) + geom_jitter(alpha = 0.5)
plot4 <- Teams %>%
  ggplot(aes(G, R)) + geom_jitter(alpha = 0.5)
plot5 <- Teams %>%
  ggplot(aes(G, L)) + geom_jitter(alpha = 0.5)
grid.arrange(plot1, plot2, plot3, plot4, plot5, ncol=3)

This example shows that how faceting in data visualization can help us identify/explain/investigate the Simpson’s paradox in the data set. One important tool for dealing with the Simpson’s paradox is conditionaling, or stratification. Conditional/stratification is achieved in data visualization using the facet feature, where small multiple figures are generated for comparison.

2 Case Study 2: London’s Wind Speed and Direction

Next, we analyze a wind data set containing the hourly measurements of the wind speed (ws) and direction (wd) in London in 1999. The data is available in the R package openair. We first load the data and print the first a few rows.

wind = read.csv(file="data/LondonWind1999.csv", header = TRUE)
head(wind)
##                  date   ws  wd nox no2 o3 pm10      so2       co pm25
## 1 1999-01-01 00:00:00 5.04 140  88  35  4   21 3.835000 1.025000   18
## 2 1999-01-01 01:00:00 4.08 160 132  41  3   17 5.243333 2.700000   11
## 3 1999-01-01 02:00:00 4.80 160 168  40  4   17 6.513333 2.866667    8
## 4 1999-01-01 03:00:00 4.92 150  85  36  3   15 4.180000 1.625000   10
## 5 1999-01-01 04:00:00 4.68 150  93  37  3   16 4.250000 1.025000   11
## 6 1999-01-01 06:00:00 3.84 170 102  32  4   16 4.005000 0.800000    9

As we can see, each row represents one observation of the wind speed and direction at a particular hour in a particular day in 1999. We mostly focus on the wind speed (ws) and direction (wd) and ignore the other variables for now. For the wind direction, we list all possible directions which are equally spaced with 10 degrees aprt from each other.

sort(unique(wind$wd))
##  [1]   0  10  20  30  40  50  60  70  80  90 100 110 120 130 140 150 160 170 180
## [20] 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360

For this data set, we start with the scatter plot of direction versus speed in the upper left panel in the figure below. We can quickly visualize the data, but it does not show many insights because of the overlapping observations. We solve this issue by jittering the observations and immediately see that the wind direction is mostly focus on the 180-270 degree range. Meanwhile, the wind speed is also faster in these directions.

g1=ggplot(wind)+geom_point(aes(x=wd, y=ws),alpha=0.5, size=0.8)
g2=ggplot(wind)+geom_point(aes(x=factor(wd), y=ws),
                           position = position_jitter(0.3),
                           size = 0.8, alpha=0.2)+
  scale_x_discrete(breaks=seq(0,360,30))
grid.arrange(g1,g2,ncol=2)

Even though the scatter plot is sufficient for faithfully presenting the data, but it is hard to obtain any insights. If we focus on the distribution of the wind speed in each direction, we could use side by side boxplot. In the left panel of the figure below, we can see that the wind speed distribution gradually shift to large values as the wind direction moves towards 220 degree. Once improvement we can make about this visualization is that since the wind direction is in a circular format, we can arrange these boxplots in a polar coordinate system as opposed to the Cartesian coordinate system.

g1=ggplot(wind)+geom_boxplot(aes(x=factor(wd), y=ws),outlier.size = 0.5)+
  scale_x_discrete(breaks=seq(0,360,30))
wind$wd[wind$wd==360]=0
g2=ggplot(wind)+geom_boxplot(aes(x=factor(wd), y=ws),outlier.size = 0.5)+coord_polar(start = -pi/32)
grid.arrange(g1,g2,ncol=2)

In the right panel, we can see that boxplots are aligned in a circle. Through the median and first and third quartiles, we can clearly see that the wind speed is much faster in the southwest wind direction. The east wind has relatively fast speed. These insights are harder to extract from the Cartesian coordinate system, which is the advantage of the polar coordinate system.

Using the boxplot in the polar coordinate system is certainly useful, however, there is a shortcoming. For example, we can only visualize the distribution of wind speed in each direction, but cannot account for the frequency of each direction. From the jittered scatter plot, we know that not only does the wind speed is faster in the southwest direction, but there are also more observations of southwest wind than any other directions. Such information cannot be visualized in the figure above.

Alternatively, to visualize the frequency of the wind direction as well as the distribution of the wind speed in each direction, we could use barplot. We first convert the wind speed to different intervals of speed. Then we can use the stacked bar plot to show the frequencies of the wind speed intervals in each direction, shown in the left panel of the figure below. Lastly, we convert the stacked bar plot to the polar coordinate system and show it on the right panel.

wind$binned_speed = cut(wind$ws, c(seq(0,14,2),Inf), include.lowest = TRUE)
wind$direction = 
  cut(wind$wd, 
      c(-1,11.25,33.75,56.25,78.75,101.25,123.75,146.25,168.75,191.25,213.75,236.25,258.75,281.25,303.75,326.25,348.75,361),
      labels = c("N","NNE","NE","ENE","E","ESE","SE","SSE","S","SSW","SW","WSW","W","WNW","NW","NNW","N"))
g3=ggplot(wind, aes(x=direction, fill = binned_speed)) +
  geom_bar(position = position_stack(reverse = TRUE)) +
  scale_fill_viridis_d()+
  theme(legend.position = "top")
g4=ggplot(wind, aes(x=direction, fill = binned_speed)) +
  geom_bar(width=1, position = position_stack(reverse = TRUE)) +
  scale_fill_viridis_d()+
  coord_polar(start = -pi/16)
grid.arrange(g3,g4,ncol=2)

In this visualization, we can see that the west, southwest, south bars take a large portion among all bars, which means the wind mostly comes from these directions in London. In addition, the wind blows fast in these directions because the colors of the bars show more yellow segments This visualization is called wind rose plot. It is powerful because it can visualize the wind speed and direction at the same time. Because of the polar coordinate system, the visualization is very easy to digest and intuitive. There are also many extension of this visualization, for example, the choice of the color scheme.

3 Case Study 3: Basketball Shots

From Charlotte Wickham’s example

Where the Heat and the Thunder Hit Their Shots

shots = read.csv("data/shots_sum.csv", header = TRUE)
head(shots)
##   x  y num_shots num_made prop_made avg_points total_points
## 1 0  4         2        1       0.5        1.5            3
## 2 0  5         3        0       0.0        0.0            0
## 3 0  6         1        0       0.0        0.0            0
## 4 0  7         1        0       0.0        0.0            0
## 5 0  9         2        0       0.0        0.0            0
## 6 0 10         1        0       0.0        0.0            0
shots = filter(shots, num_shots < 1000)
ggplot(shots, aes(x, y))+
  geom_point(aes(color = avg_points, 
                 size = num_shots)) +
  scale_color_distiller("Points", palette = "RdYlGn") +
  scale_size("Attempts", range = c(0.1, 5) ) +
  ylim(0, 35) +
  coord_equal() +
  theme_void() +
  theme(axis.ticks = element_blank(),
        axis.text  = element_blank(),
        axis.line  = element_blank(),
        axis.title = element_blank())
## Warning: Removed 54 rows containing missing values (`geom_point()`).

ggplot(shots, aes(x, y))+
  geom_point(aes(color = avg_points)) +
  scale_color_distiller("Points", palette = "RdYlGn") +
  scale_size("Attempts", range = c(0.1, 5) ) +
  ylim(0, 35) +
  coord_equal() +
  theme_void() +
  theme(axis.ticks = element_blank(),
        axis.text  = element_blank(),
        axis.line  = element_blank(),
        axis.title = element_blank())
## Warning: Removed 54 rows containing missing values (`geom_point()`).

4 Case Study 4:

From Charlotte Wickham’s example

library(dplyr)
library(hflights)
library(ggplot2)

hflights_df <- tbl_df(hflights)
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## i Please use `tibble::as_tibble()` instead.
hflights_df <- mutate(hflights_df, 
                      DepHour = floor(DepTime/100),
                      DayOfWeek = factor(DayOfWeek, 
                                         labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")),
                      Date = ISOdate(Year, Month, DayofMonth)
)
hou <- filter(hflights_df, Origin == "HOU")

hou_mon <- filter(hou, DayOfWeek == "Mon")

# over all mondays in 2011, avg delay of flights departing by hour
hou_mon_avg <- hou_mon %>%
  group_by(DepHour) %>%
  summarise(avg_delay = mean(DepDelay))

# initial plot
ggplot(hou_mon_avg, aes(DepHour, avg_delay)) + 
  geom_point() +
  geom_line() + 
  ylab("Average delay (mins)") +
  xlab("Departure time") +
  scale_x_continuous(breaks = seq(0, 24, 6),
                     labels = c("midnight", "6am", "noon", "6pm", "midnight")) +
  theme_bw(18)
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Warning: Removed 1 row containing missing values (`geom_line()`).

#ggsave("08-monday.png", width = 6, height = 4)

# for each monday in 2011, avg delay of flights departing by hour
hou_mon_day <- filter(hou, DayOfWeek == "Mon") %>%
  group_by(Date, DepHour) %>%
  summarise(avg_delay = mean(DepDelay))
## `summarise()` has grouped output by 'Date'. You can override using the `.groups`
## argument.
# quantiles for delay by time
hou_mon_q <- hou_mon %>% group_by(DepHour) %>%
  summarise(n = n(),
            q25 = quantile(DepDelay, probs = 0.25, na.rm = TRUE),
            q50 = quantile(DepDelay, probs = 0.5, na.rm = TRUE),
            q75 = quantile(DepDelay, probs = 0.75, na.rm = TRUE))

hou_mon %>%
  ggplot(aes(DepHour, DepDelay)) + 
  geom_point() +
  #geom_line(aes(group = Date)) + 
  ylab("Average delay (mins)") +
  xlab("Departure time") +
  scale_x_continuous(breaks = seq(0, 24, 6),
                     labels = c("midnight", "6am", "noon", "6pm", "midnight")) +
  theme_bw(18)
## Warning: Removed 156 rows containing missing values (`geom_point()`).

5 Case Study 5: Temperature by Cities

library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(tidyverse)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
t = read_csv("data/temperature_airport.csv")
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 141294 Columns: 61
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr   (4): STATION, NAME, FMTM, PGTM
## dbl  (45): LATITUDE, LONGITUDE, ELEVATION, ACMH, ACSH, AWND, GAHT, PRCP, PSU...
## lgl  (11): ACMC, ACSC, DAPR, FRGT, MDPR, WT10, WT17, WT19, WV01, WV07, WV20
## date  (1): DATE
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
t2 = t %>%
  mutate(YEAR = year(DATE),
         MONTH = month(DATE),
         DAY = day(DATE),
         MONTHDAY = paste(ifelse(MONTH>=10, MONTH, paste("0",MONTH,sep="")), 
                          ifelse(  DAY>=10,   DAY, paste("0",  DAY,sep="")), sep="-"),
         NAME = case_when(NAME == "CHICAGO OHARE INTERNATIONAL AIRPORT, IL US" ~ "ORD", 
                          NAME == "CINCINNATI NORTHERN KENTUCKY INTERNATIONAL AIRPORT, KY US" ~ "CVG", 
                          NAME == "JFK INTERNATIONAL AIRPORT, NY US" ~ "JFK", 
                          NAME == "MIAMI INTERNATIONAL AIRPORT, FL US" ~ "MIA", 
                          NAME == "SAN FRANCISCO INTERNATIONAL AIRPORT, CA US" ~ "SFO",
                          NAME == "MINNEAPOLIS ST. PAUL INTERNATIONAL AIRPORT, MN US" ~ "MSP")) %>% 
  group_by(NAME, MONTHDAY) %>%
  summarize(TEMP = mean(TAVG, na.rm = TRUE) ) %>%
  ungroup()
## `summarise()` has grouped output by 'NAME'. You can override using the `.groups`
## argument.
t2
## # A tibble: 1,830 x 3
##    NAME  MONTHDAY  TEMP
##    <chr> <chr>    <dbl>
##  1 CVG   01-01     34.2
##  2 CVG   01-02     36.1
##  3 CVG   01-03     35.4
##  4 CVG   01-04     33.4
##  5 CVG   01-05     30  
##  6 CVG   01-06     27.8
##  7 CVG   01-07     25.6
##  8 CVG   01-08     29.8
##  9 CVG   01-09     32.9
## 10 CVG   01-10     32.6
## # ... with 1,820 more rows
ggplot(t2) +
  geom_line(aes(x = MONTHDAY, y = TEMP, group = NAME, color = NAME)) +
  geom_smooth(aes(x = MONTHDAY, y = TEMP, group = NAME, color = NAME), method = "gam") +
  scale_x_discrete(breaks = c("01-01", "04-01", "07-01", "10-01"),
                   labels = c("01-01", "04-01", "07-01", "10-01")) +
  scale_y_continuous(limits = c(0,100)) +
  coord_polar()
## 
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

References