Acknowledgement: the materials below are partially based on the book “Introduction to Data Science Data Analysis and Prediction Algorithms with R” by Rafael A. Irizarry.

#library(tidyverse)#install.packages("tidyverse")
library(gridExtra) #install.packages("gridExtra")
library(Lahman) #install.packages("Lahman")
library(dplyr) #install.packages("dplyr")
library(tidyr) #install.packages("tidyr")
library(tibble) #install.packages("tibble")
library(readr) #install.packages("readr")
library(purrr) #install.packages("purrr")
library(stringr) #install.packages("stringr")
library(forcats) #install.packages("forcats")
library(dslabs) #install.packages("dslabs")
library(ggplot2) #install.packages("ggplot2")
library(broom) #install.packages("broom")
#ds_theme_set()

1 Correlation vs Causation

Correlation is not causation. The regression tools we have introduced are useful for quantifying correlations (i.e., associations) between covariates and response variable. However, we must be careful not to over interpret these correlations. Sometimes, correlations are caused by confounders.

2 Confounders

Confounder (also confounding variable, confounding factor, or lurking variable) is a variable that influences both the dependent variable and independent variable, causing a spurious association.

If X and Y are correlated, we call Z a confounder if changes in Z causes changes in both X and Y. Confounding is a causal concept, and as such, cannot be described in terms of correlations or associations. In some cases, we can use linear models to account for confounders. However, this is not always the case.

Incorrect interpretation due to confounders is ubiquitous and they are often hard to detect.

3 Simpson’s Paradox

Because of confounder, we could run into a situation called Simpson’s paradox. It is called a paradox because we see the sign of the correlation flips when comparing the entire publication and specific strata. Here is an example.

Here are a few other examples.

4 Example: Moneyball

There are many confounder situation in our moneyball example.

4.1 Run vs Loss

Usually, more runs lead to more wins and less losses. However, when we regress run on loss, we actually get positive slope, which is puzzling.

RvsL_fit <- lm(R~L, data=Teams)
RvsL_fit
## 
## Call:
## lm(formula = R ~ L, data = Teams)
## 
## Coefficients:
## (Intercept)            L  
##    669.5492       0.1876

Note that we can write a function to extract the slope estimate as follows.

get_slope <- function(y, x) lm(y~x)$coef[2]
Teams %>% summarize(slope = get_slope(R,L))
##       slope
## 1 0.1876455

Here we plot the run vs loss.

Teams %>% ggplot(aes(L, R)) +  
geom_point(alpha = 0.5) +
geom_smooth(method = "lm")

As we can see, the pattern seems to be a mixture of a positive correlations and negative correlations. In order to understand this puzzle, we recall that the data we have is the yearly statistics of each team. However, these statistics depend on how many games you play during that season. Some season hold more games and some hold less. If we plot the runs vs the number of games, and losses vs the number of games, here is what we found.

plot1 <- Teams %>% ggplot(aes(G, R)) + geom_point(alpha = 0.5)
plot2 <- Teams %>% ggplot(aes(G, L)) + geom_point(alpha = 0.5)
grid.arrange(plot1, plot2, ncol=2)

In fact, if we plot everything as a function of time, i.e., year, the pattern becomes more obvious.

plot1 <- Teams %>% ggplot(aes(yearID, G)) + geom_point(alpha = 0.5) + 
  geom_vline(xintercept = 1961)
plot2 <- Teams %>% ggplot(aes(yearID, R)) + geom_point(alpha = 0.5) + 
  geom_vline(xintercept = 1961)
plot3 <- Teams %>% ggplot(aes(yearID, L)) + geom_point(alpha = 0.5) + 
  geom_vline(xintercept = 1961)
grid.arrange(plot1, plot2, plot3, nrow=3)

As the years go by, there are more games in total, hence, more runs and more losses. In other words, the number of games is the confounder, which leads to a positive correlation between run and loss.

To correctly evaluate the correlation between run and loss, we need to do stratification. Below we separate the data into different group according to the number of games. We divide the number of games into 9 bins (0-20, 21-40, …,140-159, 160+), and fit separate regression within each strata. Here are the results.

Teams %>% mutate(G_strata = floor(G/20)*20) %>%
  ggplot(aes(L, R)) +  
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") +
  facet_wrap( ~ G_strata)

We can all report the estimated slopes in each strata. They are mostly negative.

get_slope <- function(y, x) lm(y~x)$coef[2]

Teams %>% mutate(G_strata = floor(G/10)*10) %>%  
  group_by(G_strata) %>%
  summarize(slope = get_slope(R,L))
## # A tibble: 17 x 2
##    G_strata   slope
##       <dbl>   <dbl>
##  1        0  29.0  
##  2       10  -0.524
##  3       20 -12.5  
##  4       30  -4.06 
##  5       40 -12.2  
##  6       50 -12.1  
##  7       60  -6.07 
##  8       70 -10.4  
##  9       80  -7.09 
## 10       90  -4.72 
## 11      100  -4.25 
## 12      110  -5.70 
## 13      120  -7.50 
## 14      130  -5.75 
## 15      140  -4.08 
## 16      150  -4.38 
## 17      160  -4.01

4.2 A More Subtle and Harder Case: Bases on Balls vs Home Run

Another more subtle example of confounder is the effect of bases on balls. If you recall from our moneyball example, bases on ball is one of the most important factors associated with run. When we regress run on bases on balls, we have the following. Note that we still analyze the data from 1961 through 2001. In addition, this example is different from the previous in that the analysis is on the game level which means all variables are “per game.”

Teams %>% filter(yearID %in% 1961:2001) %>%
  mutate(BB_per_game = BB / G,
         R_per_game = R / G) %>%
  ggplot(aes(BB_per_game, R_per_game)) +  
    geom_point(alpha = 0.5) +
    geom_smooth(method = "lm") +
    xlim(2, 5) + ylim(2, 6)
## Warning: Removed 4 rows containing non-finite values (stat_smooth).
## Warning: Removed 4 rows containing missing values (geom_point).

get_slope <- function(y, x) cov(x, y) / (sd(x) * sd(x))

bb_slope <- Teams %>% 
  filter(yearID %in% 1961:2001 ) %>% 
  mutate(BB_per_game = BB/G, R_per_game = R/G) %>% 
  summarize(slope = get_slope(R_per_game, BB_per_game))

bb_slope 
##       slope
## 1 0.7353288

The estimated slope of bases on balls is quite high, because if we regress run on singles, we only have the following slope.

singles_slope <- Teams %>% 
  filter(yearID %in% 1961:2001 ) %>%
  mutate(Singles_per_game = (H-HR-X2B-X3B)/G, R_per_game = R/G) %>%
  summarize(slope = get_slope(R_per_game, Singles_per_game))
singles_slope 
##       slope
## 1 0.4494253

The bases on balls slope is almost twice the singles slope. This is very puzzling because bases on ball and singles contribute to the team almost the same, i.e, the player all go the first base.

Sometimes, when pitchers are afraid of home runs, they will intentionally “walk” the really good hitters. As a result, home run hitters tend to have more bases on balls. So it appears that bases on balls are strong predictor of runs, it is in fact that the home runs cause most of the runs. So home runs is the confounder that drives both bases on ball and runs.

However, does bases on ball help with the runs at all? To answer this, we somehow have to adjust for the HR effect by doing regression under stratification. We again divide all the observations into bins in which home runs are roughly the same, i.e., 0.4-0.49, 0.5-0.59, … 1.2-1.29.

Teams %>% 
  filter(yearID %in% 1961:2001 ) %>% 
  mutate(Singles = (H-HR-X2B-X3B)/G, BB = BB/G, HR = HR/G) %>%  
  summarize(cor(BB, HR), cor(Singles, HR), cor(BB, Singles))
##   cor(BB, HR) cor(Singles, HR) cor(BB, Singles)
## 1   0.4039313       -0.1737435      -0.05603822
dat <- Teams %>% filter(yearID %in% 1961:2001) %>%
  mutate(HR_strata = round(HR/G, 1), 
         BB_per_game = BB / G,
         R_per_game = R / G,
         BB_strata = round(BB/G, 1),
         HR_per_game = HR / G ) %>%
  filter(HR_strata >= 0.4 & HR_strata <=1.2) 
dat %>% 
  ggplot(aes(BB_per_game, R_per_game)) +  
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") +
  xlim(2, 5) + ylim(2, 6) +
  facet_wrap( ~ HR_strata)

Compare the above figure with the simple linear regression.

We also report the slope of bases on balls in each strata below. As we can see, they are much lower than the simple linear regression in the beginning of this section. In fact, their slopes are much closer to the slope of singles, which is what we expect. This means that bases on balls does help with the runs, it has about the same effect as the singles.

dat %>%  
  group_by(HR_strata) %>%
  summarize(slope = get_slope(R_per_game,BB_per_game))
## # A tibble: 9 x 2
##   HR_strata slope
##       <dbl> <dbl>
## 1       0.4 0.734
## 2       0.5 0.566
## 3       0.6 0.412
## 4       0.7 0.285
## 5       0.8 0.365
## 6       0.9 0.261
## 7       1   0.511
## 8       1.1 0.454
## 9       1.2 0.440

5 Example: UC Berkeley Admissions

Confounders are not only for regressions, there are many other places where confounders mask the true relationship. Below is the famous UC Berkeley admission example.

We obtain the admission data from six U.C. Berkeley majors from 1973. The data shows that more men were being admitted than women. In fact, the admission rate is: 44% for men and 30% for women. Below is the data.

data(admissions)
names(admissions)[3]="admitted_percent" # some data cleaning
admissions <- admissions %>% 
  mutate(admitted_percent = admitted_percent/100,
         admitted_num=round(admitted_percent*applicants),
         not_admitted_num=applicants-admitted_num)
admissions
##    major gender admitted_percent applicants admitted_num not_admitted_num
## 1      A    men             0.62        825          512              313
## 2      B    men             0.63        560          353              207
## 3      C    men             0.37        325          120              205
## 4      D    men             0.33        417          138              279
## 5      E    men             0.28        191           53              138
## 6      F    men             0.06        373           22              351
## 7      A  women             0.82        108           89               19
## 8      B  women             0.68         25           17                8
## 9      C  women             0.34        593          202              391
## 10     D  women             0.35        375          131              244
## 11     E  women             0.24        393           94              299
## 12     F  women             0.07        341           24              317

We can calculate the overall admission rate for men and women as follows.

admission_rate <- admissions %>% group_by(gender) %>% 
  summarise(ttl_admitted_n=sum(admitted_num),
            ttl_not_admitted_n=sum(not_admitted_num),
            total_applicants=sum(applicants),
            admission_rate=sum(admitted_num)/sum(applicants))
admission_rate
## # A tibble: 2 x 5
##   gender ttl_admitted_n ttl_not_admitted_n total_applicants admission_rate
##   <chr>           <dbl>              <dbl>            <dbl>          <dbl>
## 1 men              1198               1493             2691          0.445
## 2 women             557               1278             1835          0.304

We can conduct a chi-square test to see if the gender is associated with admission rate. The p-value is very significant.

chisq.test(admission_rate[,c("ttl_admitted_n","ttl_not_admitted_n")])
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  admission_rate[, c("ttl_admitted_n", "ttl_not_admitted_n")]
## X-squared = 91.61, df = 1, p-value < 2.2e-16

However, if we compare men and women within each major. We found out that for most of the majors, men’s admission rate is lower than women’s. Note that the last column below is mostly positive.

admissions %>% select(major, gender, admitted_percent) %>%
  spread(gender, admitted_percent) %>%
  mutate(women_minus_men = women - men)
##   major  men women women_minus_men
## 1     A 0.62  0.82            0.20
## 2     B 0.63  0.68            0.05
## 3     C 0.37  0.34           -0.03
## 4     D 0.33  0.35            0.02
## 5     E 0.28  0.24           -0.04
## 6     F 0.06  0.07            0.01

So why is the overall conclusion the opposite? Here is why. If we plot the admission rate of the each major against the proportion of women applicants, we found a striking pattern, which means the major that women took majority are very selective.

admissions %>% 
  group_by(major) %>% 
  summarize(major_selectivity = sum(admitted_percent * applicants)/sum(applicants),
            percent_women_applicants = sum(applicants * (gender=="women")) /
                                             sum(applicants) * 100) %>%
  ggplot(aes(major_selectivity, percent_women_applicants, label = major)) +
  geom_text()

Lastly, we summarize the whole picture in the figure below. We plot the admission rate of men and women against each majors. As we can see, for major A which is not very selective, but there are fewer women applying to major A. In fact, women’s addimision rate in major A is much higher than men. On the other hand, for all majors which are more selective. There are comparable number of men and women. For major E, which is a highly selective major, there are more women applicant than men applicants. The overall admission rates of men and women are two horizontal lines, which can be considered as the weighted averages. Because there are a lot of men applying major A, the weighted avearge for mean is “dragged upward”. So in this case, major is the confounder.

admissions %>% 
  ggplot(aes(major, admitted_percent, col = gender, size = applicants)) +
  geom_point() + geom_hline(yintercept = c(0.445,0.304))

admissions %>%  group_by(gender) %>% summarize(average = mean(admitted_percent))
## # A tibble: 2 x 2
##   gender average
##   <chr>    <dbl>
## 1 men      0.382
## 2 women    0.417

6 Spurious Correlation

Here is a list of examples of spurious correlations. [http://tylervigen.com/spurious-correlations]

These cases are mostly instances of what is generally called data fishing, data snooping, or cherry picking. An example of data fishing would be if you look through many results produced by a random process and pick the one that shows a relationship that supports a theory you want to defend.

Here is an simulation example.

n=10
y=rnorm(n)
y
##  [1]  0.08041368 -0.42644117 -0.61059486  0.25601776 -0.56623463
##  [6] -0.03835481 -1.36430646  0.61230888 -1.05280615 -0.78603801
try=100000
x=matrix(rnorm(n*try),n,try)
cor_vec=rep(NA,try)
for (it in 1:try)
{
  cor_vec[it]=cor(y,x[,it])
}
hist(cor_vec,breaks=20,xlab="correlation")

Another form of data fishing is p-hacking or selective reporting, which is much discussed in scientific publications. Because journals tend to reward statistically significance, i.e., p-value<0.05, researchers are incentive to report significant results, and hence may perform many statistical analyses and then report only the significant one.

There is a saying: “if you torture the data long enough, it will confess to anything.”