In this chapter, we use a few case studies to demonstrate how data visualization techniques help us extract insights from data. The following R packages are needed for running the examples in this chapter.
library(tidyverse)
library(ggthemes)
library(grid)
library(gridExtra)
#library(Lahman)
#library(MASS) # conflict with dplyr package for select(), use dplyr::select
library(RColorBrewer)
library(ggridges)
library(ggrepel)
library(kableExtra)
We analyze a data set containing the offense and defense statistics
for the baseball teams in Major League Baseball in the early years, in
particular, from 1871 through 1900. The data set includes baseball teams
from many “major” leagues such as American Association, Union
Association, Players League, Federal League, and the National
Association (of 1871-1875). This data set was created by Sean Lahman,
who pioneered the effort to make baseball statistics freely available to
the general public. The complete data set includes most recent data is
available in the R package Lahman.
We first load the data and print the first a few rows.
Teams = read.csv("data/baseballgame_bf1900.csv")
head(Teams)
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin
## 1 1871 <NA> BS1 BNA NA 3 31 NA 20 10 NA NA N
## 2 1871 <NA> CH1 CNA NA 2 28 NA 19 9 NA NA N
## 3 1871 <NA> CL1 CFC NA 8 29 NA 10 19 NA NA N
## 4 1871 <NA> FW1 KEK NA 7 19 NA 7 12 NA NA N
## 5 1871 <NA> NY2 NNA NA 5 33 NA 16 17 NA NA N
## 6 1871 <NA> PH1 PNA NA 1 28 NA 21 7 NA NA Y
## WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV
## 1 <NA> 401 1372 426 70 37 3 60 19 73 16 NA NA 303 109 3.55 22 1 3
## 2 <NA> 302 1196 323 52 21 10 60 22 69 21 NA NA 241 77 2.76 25 0 1
## 3 <NA> 249 1186 328 35 40 7 26 25 18 8 NA NA 341 116 4.11 23 0 0
## 4 <NA> 137 746 178 19 8 2 33 9 16 4 NA NA 243 97 5.17 19 1 0
## 5 <NA> 302 1404 403 43 21 1 33 15 46 15 NA NA 313 121 3.72 32 1 0
## 6 <NA> 376 1281 410 66 27 9 46 23 56 12 NA NA 266 137 4.95 27 0 0
## IPouts HA HRA BBA SOA E DP FP name
## 1 828 367 2 42 23 243 24 0.834 Boston Red Stockings
## 2 753 308 6 28 22 229 16 0.829 Chicago White Stockings
## 3 762 346 13 53 34 234 15 0.818 Cleveland Forest Citys
## 4 507 261 5 21 17 163 8 0.803 Fort Wayne Kekiongas
## 5 879 373 7 42 22 235 14 0.840 New York Mutuals
## 6 747 329 3 53 16 194 13 0.845 Philadelphia Athletics
## park attendance BPF PPF teamIDBR teamIDlahman45
## 1 South End Grounds I NA 103 98 BOS BS1
## 2 Union Base-Ball Grounds NA 104 102 CHI CH1
## 3 National Association Grounds NA 96 100 CLE CL1
## 4 Hamilton Field NA 101 107 KEK FW1
## 5 Union Grounds (Brooklyn) NA 90 88 NYU NY2
## 6 Jefferson Street Grounds NA 102 98 ATH PH1
## teamIDretro
## 1 BS1
## 2 CH1
## 3 CL1
## 4 FW1
## 5 NY2
## 6 PH1
Each row in the data set represents various statistics of one team in a particular season. The statistics include the number of runs scored by that team in that year/season (R), the number of home runs (HR), the number of games (G), lost games/losses (L), and wins (W) and many others variables. Suppose we are interested in the relationship between the number of runs (R) and the number of lost games (L). We can easily generate a scatter plot to show the relationship.
Teams %>%
ggplot(aes(R, L)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm")
##
## `geom_smooth()` using formula = 'y ~ x'
Usually, more runs lead to higher scores and hence fewer losses. However, when we visualize the data, we actually get a positive correlation, which is puzzling. We further fit a regression line to the data, which confirms our observations.
In order to understand this puzzle, we recall that the data we have is the statistics of each team in the early years. Back in the 1870s and 1880s, the baseball league is in its infancy. The league has experienced the ups and downs when it developed to its current scale. Therefore, each season is quite different in all aspects from another season. Therefore, when we pool the data across all the years together, the data tells a very different story.
For example, the number of runs and the number of losses depend on how many games each team played during that season. Since the league is growing rapidly, the number of games in each season is much different. Some season hold more games and some hold less.
Therefore, we could reexamine the relationship between the number of runs and the number of losses conditional on the number of games that each team plays in that season. This is sometimes called stratification. Essentially we divide the observations into groups according to the value of a third variable, i.e., the number of games in this case, so that the observations in each group all have similar or identical values for that variable. Within the group, we investigate the data set focusing on the two variables of interest, i.e., the number of runs and losses in this case. Then our results and findings are now conditional on the third variable by which we stratify the data.
Since we want to stratify by the number of games and it is a quantitative variable, we need to cut this variable into several bins/intervals. We can first check the range of the number of games as follow.
range(Teams$G)
## [1] 6 158
It seems an interval of length 20 will give us about 8 strata. We use
the function cut() to convert the quantitative variable to
a categorical variable with values of (0, 20], … , (140, 160]. To
visualize the strata, we use the visualization technique called
faceting. It essentially plots the relationship of interest in many
small multiple of panels which correspond to the strata. Here, we plot
the scatter plot between the numbers of runs and losses within each
stratum.
Teams %>%
mutate(G_strata = cut(G, seq(0,180,20))) %>%
ggplot(aes(R, L)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
facet_wrap( ~ G_strata)
We immediately find out that the relationship becomes negative within each stratum. If we overlay the fitted regression within each stratum on the fitted regression to the whole data set, we can see the contrast.
Teams %>%
mutate(G_strata = cut(G, seq(0,180,20))) %>%
ggplot(aes(R, L, color=G_strata)) +
geom_point(alpha = 0.5) +
geom_smooth(aes(group=G_strata, color=G_strata),
method = "lm")+
scale_color_viridis_d()+
geom_smooth(color = "blue", method = "lm")
Clearly, the relationship between the numbers of runs and losses within strata and over the entire data set is opposite. This is a typical case of so called Simpson’s paradox. It is called a paradox because we see the sign of the correlation flips when comparing the entire population and specific groups (or strata). It usually happens because there is a confounder variable which affects both of the variables of interest. Since the confounder’s variation is more dominant, the overall pattern between the two variables of interest, when marginalized over the confounder, shows an unexpected association.
In this case, the number of game in each season is the confounder. The more game the team plays, the more runs and more losses the team acquires. When the analysis is stratified over the confounder, the true relationship appears.
We can further investigate this issue. We plot the year against the number of games, the number of runs, and the number of losses. We can immediately see that, as the sport becomes more popular over the years, there are more games in each season, with a few exceptions around 1984-1985. Therefore, the increase of the number of games over the years is the main reason for observing this positive relationship between the number of runs and the number of losses.
plot1 <- Teams %>%
ggplot(aes(yearID, G)) + geom_jitter(alpha = 0.5)
plot2 <- Teams %>%
ggplot(aes(yearID, R)) + geom_jitter(alpha = 0.5)
plot3 <- Teams %>%
ggplot(aes(yearID, L)) + geom_jitter(alpha = 0.5)
plot4 <- Teams %>%
ggplot(aes(G, R)) + geom_jitter(alpha = 0.5)
plot5 <- Teams %>%
ggplot(aes(G, L)) + geom_jitter(alpha = 0.5)
grid.arrange(plot1, plot2, plot3, plot4, plot5, ncol=3)
This example shows that how faceting in data visualization can help us identify/explain/investigate the Simpson’s paradox in the data set. One important tool for dealing with the Simpson’s paradox is conditionaling, or stratification. Conditional/stratification is achieved in data visualization using the facet feature, where small multiple figures are generated for comparison.
Next, we analyze a wind data set containing the hourly measurements
of the wind speed (ws) and direction (wd) in London in 1999. The data is
available in the R package openair. We first load the data
and print the first a few rows.
wind = read.csv(file="data/LondonWind1999.csv", header = TRUE)
head(wind)
## date ws wd nox no2 o3 pm10 so2 co pm25
## 1 1999-01-01 00:00:00 5.04 140 88 35 4 21 3.835000 1.025000 18
## 2 1999-01-01 01:00:00 4.08 160 132 41 3 17 5.243333 2.700000 11
## 3 1999-01-01 02:00:00 4.80 160 168 40 4 17 6.513333 2.866667 8
## 4 1999-01-01 03:00:00 4.92 150 85 36 3 15 4.180000 1.625000 10
## 5 1999-01-01 04:00:00 4.68 150 93 37 3 16 4.250000 1.025000 11
## 6 1999-01-01 06:00:00 3.84 170 102 32 4 16 4.005000 0.800000 9
As we can see, each row represents one observation of the wind speed and direction at a particular hour in a particular day in 1999. We mostly focus on the wind speed (ws) and direction (wd) and ignore the other variables for now. For the wind direction, we list all possible directions which are equally spaced with 10 degrees aprt from each other.
sort(unique(wind$wd))
## [1] 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180
## [20] 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360
For this data set, we start with the scatter plot of direction versus speed in the upper left panel in the figure below. We can quickly visualize the data, but it does not show many insights because of the overlapping observations. We solve this issue by jittering the observations and immediately see that the wind direction is mostly focus on the 180-270 degree range. Meanwhile, the wind speed is also faster in these directions.
g1=ggplot(wind)+geom_point(aes(x=wd, y=ws),alpha=0.5, size=0.8)
g2=ggplot(wind)+geom_point(aes(x=factor(wd), y=ws),
position = position_jitter(0.3),
size = 0.8, alpha=0.2)+
scale_x_discrete(breaks=seq(0,360,30))
grid.arrange(g1,g2,ncol=2)
Even though the scatter plot is sufficient for faithfully presenting the data, but it is hard to obtain any insights. If we focus on the distribution of the wind speed in each direction, we could use side by side boxplot. In the left panel of the figure below, we can see that the wind speed distribution gradually shift to large values as the wind direction moves towards 220 degree. Once improvement we can make about this visualization is that since the wind direction is in a circular format, we can arrange these boxplots in a polar coordinate system as opposed to the Cartesian coordinate system.
g1=ggplot(wind)+geom_boxplot(aes(x=factor(wd), y=ws),outlier.size = 0.5)+
scale_x_discrete(breaks=seq(0,360,30))
wind$wd[wind$wd==360]=0
g2=ggplot(wind)+geom_boxplot(aes(x=factor(wd), y=ws),outlier.size = 0.5)+coord_polar(start = -pi/32)
grid.arrange(g1,g2,ncol=2)
In the right panel, we can see that boxplots are aligned in a circle. Through the median and first and third quartiles, we can clearly see that the wind speed is much faster in the southwest wind direction. The east wind has relatively fast speed. These insights are harder to extract from the Cartesian coordinate system, which is the advantage of the polar coordinate system.
Using the boxplot in the polar coordinate system is certainly useful, however, there is a shortcoming. For example, we can only visualize the distribution of wind speed in each direction, but cannot account for the frequency of each direction. From the jittered scatter plot, we know that not only does the wind speed is faster in the southwest direction, but there are also more observations of southwest wind than any other directions. Such information cannot be visualized in the figure above.
Alternatively, to visualize the frequency of the wind direction as well as the distribution of the wind speed in each direction, we could use barplot. We first convert the wind speed to different intervals of speed. Then we can use the stacked bar plot to show the frequencies of the wind speed intervals in each direction, shown in the left panel of the figure below. Lastly, we convert the stacked bar plot to the polar coordinate system and show it on the right panel.
wind$binned_speed = cut(wind$ws, c(seq(0,14,2),Inf), include.lowest = TRUE)
wind$direction =
cut(wind$wd,
c(-1,11.25,33.75,56.25,78.75,101.25,123.75,146.25,168.75,191.25,213.75,236.25,258.75,281.25,303.75,326.25,348.75,361),
labels = c("N","NNE","NE","ENE","E","ESE","SE","SSE","S","SSW","SW","WSW","W","WNW","NW","NNW","N"))
g3=ggplot(wind, aes(x=direction, fill = binned_speed)) +
geom_bar(position = position_stack(reverse = TRUE)) +
scale_fill_viridis_d()+
theme(legend.position = "top")
g4=ggplot(wind, aes(x=direction, fill = binned_speed)) +
geom_bar(width=1, position = position_stack(reverse = TRUE)) +
scale_fill_viridis_d()+
coord_polar(start = -pi/16)
grid.arrange(g3,g4,ncol=2)
In this visualization, we can see that the west, southwest, south bars take a large portion among all bars, which means the wind mostly comes from these directions in London. In addition, the wind blows fast in these directions because the colors of the bars show more yellow segments This visualization is called wind rose plot. It is powerful because it can visualize the wind speed and direction at the same time. Because of the polar coordinate system, the visualization is very easy to digest and intuitive. There are also many extension of this visualization, for example, the choice of the color scheme.
From Charlotte Wickham’s example
Where the Heat and the Thunder Hit Their Shots
shots = read.csv("data/shots_sum.csv", header = TRUE)
head(shots)
## x y num_shots num_made prop_made avg_points total_points
## 1 0 4 2 1 0.5 1.5 3
## 2 0 5 3 0 0.0 0.0 0
## 3 0 6 1 0 0.0 0.0 0
## 4 0 7 1 0 0.0 0.0 0
## 5 0 9 2 0 0.0 0.0 0
## 6 0 10 1 0 0.0 0.0 0
shots = filter(shots, num_shots < 1000)
ggplot(shots, aes(x, y))+
geom_point(aes(color = avg_points,
size = num_shots)) +
scale_color_distiller("Points", palette = "RdYlGn") +
scale_size("Attempts", range = c(0.1, 5) ) +
ylim(0, 35) +
coord_equal() +
theme_void() +
theme(axis.ticks = element_blank(),
axis.text = element_blank(),
axis.line = element_blank(),
axis.title = element_blank())
## Warning: Removed 54 rows containing missing values (`geom_point()`).
ggplot(shots, aes(x, y))+
geom_point(aes(color = avg_points)) +
scale_color_distiller("Points", palette = "RdYlGn") +
scale_size("Attempts", range = c(0.1, 5) ) +
ylim(0, 35) +
coord_equal() +
theme_void() +
theme(axis.ticks = element_blank(),
axis.text = element_blank(),
axis.line = element_blank(),
axis.title = element_blank())
## Warning: Removed 54 rows containing missing values (`geom_point()`).
From Charlotte Wickham’s example
library(dplyr)
library(hflights)
library(ggplot2)
hflights_df <- tbl_df(hflights)
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## i Please use `tibble::as_tibble()` instead.
hflights_df <- mutate(hflights_df,
DepHour = floor(DepTime/100),
DayOfWeek = factor(DayOfWeek,
labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")),
Date = ISOdate(Year, Month, DayofMonth)
)
hou <- filter(hflights_df, Origin == "HOU")
hou_mon <- filter(hou, DayOfWeek == "Mon")
# over all mondays in 2011, avg delay of flights departing by hour
hou_mon_avg <- hou_mon %>%
group_by(DepHour) %>%
summarise(avg_delay = mean(DepDelay))
# initial plot
ggplot(hou_mon_avg, aes(DepHour, avg_delay)) +
geom_point() +
geom_line() +
ylab("Average delay (mins)") +
xlab("Departure time") +
scale_x_continuous(breaks = seq(0, 24, 6),
labels = c("midnight", "6am", "noon", "6pm", "midnight")) +
theme_bw(18)
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Warning: Removed 1 row containing missing values (`geom_line()`).
#ggsave("08-monday.png", width = 6, height = 4)
# for each monday in 2011, avg delay of flights departing by hour
hou_mon_day <- filter(hou, DayOfWeek == "Mon") %>%
group_by(Date, DepHour) %>%
summarise(avg_delay = mean(DepDelay))
## `summarise()` has grouped output by 'Date'. You can override using the `.groups`
## argument.
# quantiles for delay by time
hou_mon_q <- hou_mon %>% group_by(DepHour) %>%
summarise(n = n(),
q25 = quantile(DepDelay, probs = 0.25, na.rm = TRUE),
q50 = quantile(DepDelay, probs = 0.5, na.rm = TRUE),
q75 = quantile(DepDelay, probs = 0.75, na.rm = TRUE))
hou_mon %>%
ggplot(aes(DepHour, DepDelay)) +
geom_point() +
#geom_line(aes(group = Date)) +
ylab("Average delay (mins)") +
xlab("Departure time") +
scale_x_continuous(breaks = seq(0, 24, 6),
labels = c("midnight", "6am", "noon", "6pm", "midnight")) +
theme_bw(18)
## Warning: Removed 156 rows containing missing values (`geom_point()`).
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(tidyverse)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
t = read_csv("data/temperature_airport.csv")
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 141294 Columns: 61
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (4): STATION, NAME, FMTM, PGTM
## dbl (45): LATITUDE, LONGITUDE, ELEVATION, ACMH, ACSH, AWND, GAHT, PRCP, PSU...
## lgl (11): ACMC, ACSC, DAPR, FRGT, MDPR, WT10, WT17, WT19, WV01, WV07, WV20
## date (1): DATE
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
t2 = t %>%
mutate(YEAR = year(DATE),
MONTH = month(DATE),
DAY = day(DATE),
MONTHDAY = paste(ifelse(MONTH>=10, MONTH, paste("0",MONTH,sep="")),
ifelse( DAY>=10, DAY, paste("0", DAY,sep="")), sep="-"),
NAME = case_when(NAME == "CHICAGO OHARE INTERNATIONAL AIRPORT, IL US" ~ "ORD",
NAME == "CINCINNATI NORTHERN KENTUCKY INTERNATIONAL AIRPORT, KY US" ~ "CVG",
NAME == "JFK INTERNATIONAL AIRPORT, NY US" ~ "JFK",
NAME == "MIAMI INTERNATIONAL AIRPORT, FL US" ~ "MIA",
NAME == "SAN FRANCISCO INTERNATIONAL AIRPORT, CA US" ~ "SFO",
NAME == "MINNEAPOLIS ST. PAUL INTERNATIONAL AIRPORT, MN US" ~ "MSP")) %>%
group_by(NAME, MONTHDAY) %>%
summarize(TEMP = mean(TAVG, na.rm = TRUE) ) %>%
ungroup()
## `summarise()` has grouped output by 'NAME'. You can override using the `.groups`
## argument.
t2
## # A tibble: 1,830 x 3
## NAME MONTHDAY TEMP
## <chr> <chr> <dbl>
## 1 CVG 01-01 34.2
## 2 CVG 01-02 36.1
## 3 CVG 01-03 35.4
## 4 CVG 01-04 33.4
## 5 CVG 01-05 30
## 6 CVG 01-06 27.8
## 7 CVG 01-07 25.6
## 8 CVG 01-08 29.8
## 9 CVG 01-09 32.9
## 10 CVG 01-10 32.6
## # ... with 1,820 more rows
ggplot(t2) +
geom_line(aes(x = MONTHDAY, y = TEMP, group = NAME, color = NAME)) +
geom_smooth(aes(x = MONTHDAY, y = TEMP, group = NAME, color = NAME), method = "gam") +
scale_x_discrete(breaks = c("01-01", "04-01", "07-01", "10-01"),
labels = c("01-01", "04-01", "07-01", "10-01")) +
scale_y_continuous(limits = c(0,100)) +
coord_polar()
##
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'