Visualization Basics - Part II
1 Barplot to Visualize Amount, Count, and Other Discrete Data
2 Pie/Donut Chart and Their Limitations
3 Colors
4 Layout, Margin, and Save Visualization.

Visualization Basics - Part II

In this chapter, we continue to discuss the visualization basics. The following R packages are needed for running the examples in this chapter.

library(tidyverse)
library(grid)
library(gridExtra)
library(RColorBrewer)

1 Barplot to Visualize Amount, Count, and Other Discrete Data

So far, we have focused on scatter plot which mostly visualize quantitative variable. If we would like to visualize qualitative variables, then we need the barplot.

We use the barplot to show the frequency/distribution of different categories and the composition of the qualitative variable. For example of mpg data set, suppose we are interested in the number of university in different regions in US, then we could use the barplot to visualize this information.

Note that it is always better to have meaningful x or y axis in the barplot. For example, we could sort all the bars according to the frequencies.

college = read.csv("data/college.csv", header = TRUE)
g1 = ggplot(data=college) +
  geom_bar(aes(x=region))
g2 = ggplot(data=college) + 
  geom_bar(aes(y=fct_infreq(region))) + 
  scale_x_continuous(expand = c(0, 0), limits = c(0, 500)) +
  ylab("region") + 
  xlab("Number of observations")
grid.arrange(g1,g2,ncol=3)

The code above first counts the number of universities in each region, and then plot their frequencies. Alternatively, we could manually calculate the frequencies ourselves and directly plot them. We will need to either set stat="identity" in geom_bar(), or use geom_col(). Note that ordering the axis is slightly different in geom_col() than in geom_bar().

region_freq = college %>% group_by(region) %>% summarise(count=n())
region_freq # frequencies of types

## # A tibble: 4 x 2
##   region    count
##   <chr>     <int>
## 1 Midwest     353
## 2 Northeast   299
## 3 South       459
## 4 West        158

Please try the following code on your computer. They all produce the same results as before.

g1 = ggplot(data=region_freq) + 
  geom_col(aes(y=region, x = count))
g2 = ggplot(data=region_freq) + 
  geom_col(aes(y=fct_reorder(region, -count), x = count))
g3 = ggplot(data=region_freq) + 
  geom_col(aes(y=fct_infreq(region), x = count))
g4 = ggplot(data=region_freq) + 
  geom_bar(aes(y=region, x = count), stat="identity")
g5 = ggplot(data=region_freq) +
  geom_bar(aes(y=fct_reorder(region, -count), x = count), stat="identity") + ylab("region")
g6 = ggplot(data=college) + 
  geom_bar(aes(y=fct_infreq(region))) + ylab("region")
grid.arrange(g1,g2,g3,g4,g5,g6, ncol=3)

When visualizing the amount across different categories, barplot may not be the most suitable tool. Take a look at the following example. We plot the average SAT score in each state in US using a barplot. Note that we have sort the bars according to the SAT scores. The SAT score are mostly between 800 and 1400 (You get 400 by just putting your name down). When we take the average across all universities in each state, the variation in the SAT score for each state becomes even less. Therefore, we mostly see many long bars of similar lengths. The visualization is accurate, but not informative. We can emphasize on the difference of SAT score by changing the range of y-axis, which is the middle figure. However, the length of the bar is not proportional to SAT which violates one of the most important principal in data visualization. We will discuss this principal in details in the following chapter. For example, it seems that the SAT of North Carolina (NC) is about twice the SAT of West Virginia (WV) since the bar of the former is twice as long as the bar of latter. In this case, we can alternatively use dot plot, which is just scatter plot, but with one axis being discrete data. The dot only indicates the location of the data point.

state_sat_df = college %>% 
  group_by(state) %>%
  summarize(state_sat = mean(sat_avg))
g1 = state_sat_df %>%
  ggplot() + 
  geom_col(aes(y=fct_reorder(state, state_sat), x = state_sat), width = 0.7) + 
  scale_y_discrete(name = "State") +
  xlab("Average SAT")
g2 = state_sat_df %>%
  ggplot() + 
  geom_col(aes(y=fct_reorder(state, state_sat), x = state_sat), width = 0.7) + 
  scale_y_discrete(name = "State") +
  coord_cartesian(xlim = c(950,1200) ) +
  scale_x_continuous(expand = c(0,0)) +
  xlab("Average SAT") +
  theme(axis.ticks.y = element_blank())
g3 = state_sat_df %>%
  ggplot() + 
  geom_point(aes(y=fct_reorder(state, state_sat), x = state_sat), size = 2) + 
  scale_y_discrete(name = "State") +
  coord_cartesian(xlim = c(950,1200) ) +
  scale_x_continuous(expand = c(0,0)) +
  xlab("Average SAT") +
  geom_vline(xintercept = mean(college$sat_avg), 
             color = "blue",
             linetype = "dotted")+
  theme(axis.ticks.y = element_blank(),
        panel.grid.major.y = element_line(color = "grey",
                                          linetype = "dashed"))
grid.arrange(g1,g2,g3,ncol=3)

The barplot can visualize the distribution of the qualitative variable, but also can visualize the relationship between two qualitative variables through its variations. For example, the argument position in geom_bar() function can adjust the position to show different graphical properties which includes “identity”, “dodge”, “fill”,and “stack”.

“identity” does not make any adjustment.
“dodge” splitting one bar into pieces and generate sidy by side bars.
“fill” fill different colors within one bar and use bar lengths to represent proportions.
“stack” add different colors to one bar and generate stacked bars.

For the college data set example, suppose we want to visualize the association between the region of the university region and the funding type of the university, we can use the following barplots.

g1=ggplot(data=college) +
  geom_bar(aes(x=fct_relevel(region, "South", "Midwest", "Northeast", "West"), 
               fill = control), 
           width=0.75) + 
  scale_x_discrete(name = "Region") +
  scale_fill_discrete(name = "")+
  theme(legend.position = "top",
        axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
g2=ggplot(data=college) +
  geom_bar(aes(x=region, 
               fill = control), 
           position="fill", 
           width = 0.75) + 
  scale_x_discrete(name = "Region") +
  scale_fill_discrete("")+
  scale_y_continuous(name="precent", labels = scales::percent)+
  theme(legend.position = "top",
        axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
g3=ggplot(data=college) +
  geom_bar(aes(x=fct_relevel(region, "South", "Midwest", "Northeast", "West"), 
               fill = control), 
           position="dodge", 
           width=0.75) +
  scale_x_discrete(name = "Region") +
  scale_fill_discrete("") +
  theme(legend.position = "top",
        axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
grid.arrange(g1,g2,g3,ncol=3)

ggplot(data=college) +
  geom_bar(aes(x=fct_relevel(region, "South", "Midwest", "Northeast", "West"),
               fill = control), 
           width=0.75) +
  scale_x_discrete(name = "Region") +
  scale_y_continuous(limits = c(0, 250)) +
  facet_wrap(~control)

Population Pyramid

We use Saudi Aradia’s population data to generate the population pyramid as follows.

saudi = read_csv("data/saudi_arabia.csv")

## Rows: 102 Columns: 12
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (4): FIPS, GENC, Country/Area Name, GROUP
## dbl (8): Year, Population, % of Population, Male Population, % of Males, Fem...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

saudi

## # A tibble: 102 x 12
##    FIPS  GENC  `Country/Area Name`  Year GROUP Population `% of Population`
##    <chr> <chr> <chr>               <dbl> <chr>      <dbl>             <dbl>
##  1 SA    SA    Saudi Arabia         2023 TOTAL   35939806            100   
##  2 SA    SA    Saudi Arabia         2023 0         494327              1.38
##  3 SA    SA    Saudi Arabia         2023 1         495889              1.38
##  4 SA    SA    Saudi Arabia         2023 2         502139              1.40
##  5 SA    SA    Saudi Arabia         2023 3         511583              1.42
##  6 SA    SA    Saudi Arabia         2023 4         522930              1.46
##  7 SA    SA    Saudi Arabia         2023 5         536550              1.49
##  8 SA    SA    Saudi Arabia         2023 6         551843              1.54
##  9 SA    SA    Saudi Arabia         2023 7         568419              1.58
## 10 SA    SA    Saudi Arabia         2023 8         584245              1.63
## # ... with 92 more rows, and 5 more variables: `Male Population` <dbl>,
## #   `% of Males` <dbl>, `Female Population` <dbl>, `% of Females` <dbl>,
## #   `Sex Ratio` <dbl>

saudi %>%
  select(GROUP, `Male Population`, `Female Population`) %>%
  rename(age = GROUP,
         male = `Male Population`,
         female = `Female Population`) %>%
  filter(age != c("TOTAL", "100+")) %>%
  mutate_at(c("age"), as.numeric) %>%
  mutate(age_group = cut(age, breaks = seq(0, 100, 5), right = FALSE)) %>%
  pivot_longer(cols = c("male", "female"), 
               names_to = "gender", 
               values_to = "pop") %>%
  ggplot() +
  geom_col(aes(y = age_group, 
               x = ifelse(gender == "male", pop, -pop),
               fill = gender)) +
  scale_x_continuous(labels = abs) +
  xlab("Population") +
  ylab("Age")

2 Pie/Donut Chart and Their Limitations

We briefly discuss the pie chart, which is used to visualize the composition of a qualitative variable. There are two forms for pie charts - the typical filled circle, or a colored ring. The pie chart uses the angel or the length of the curve to represent the proportion of each category or an unique value. Since these angels or curves are often in different orientations, the comparison across different categories are often difficult. This is also why we do not recommend pie chart. Instead, we should use the barplot which is more efficient.

Let us take a look at the example. Suppose we would like to compare on the number of universities in the five states in the midwest area, including OH, MI, IN, IL, and WI. We generate the pie chart, the ring chart, and the barplot for comparison. As we can see, other than OH which has the most of the universities, it is hard to compare the rest of the states as they angels and ring segments are almost the same. When looking the barplot, it is apparent that WI has the lowest number, while IL is higher than IN and IN is higher than MI. This insight cannot be easily obtained in the pie chart and donute chart, which is why we should avoid using them. There are some remedies for these charts, such as adding the percentage numbers next to the pie. But the visualization is meant to be self-explanatory, efficient and faithfully, adding the text is conflicting to these goals.

g1 = ggplot(filter(college, state %in% c("OH", "MI", "IN", "IL", "WI")), 
            aes(x = 1, fill = state)) + 
  geom_bar() + 
  scale_fill_discrete("State")+
  coord_polar(theta = "y") + 
  theme(panel.background = element_blank(),
        axis.ticks = element_blank(),
        axis.text = element_blank(),
        axis.title = element_blank(),
        legend.position = "top")
g2=ggplot(filter(college, state %in% c("OH", "MI", "IN", "IL", "WI")), 
          aes(x = 1, fill = state)) + 
  geom_bar(width = 0.8) +
  scale_fill_discrete("State")+
  coord_polar(theta = "y") + 
  scale_x_continuous(limits=c(0,1.5)) + # Add a continuous x scale from 0.5 to 1.5
  theme(panel.background = element_blank(),
        axis.ticks = element_blank(),
        axis.text = element_blank(),
        axis.title = element_blank(),
        legend.position = "top")
g3=ggplot(filter(college, state %in% c("OH", "MI", "IN", "IL", "WI")), 
          aes(x = state, fill=state)) + 
  scale_fill_discrete("State")+
  geom_bar() + 
  theme(legend.position = "top")
grid.arrange(g1,g2,g3,ncol=3)

3 Colors

So far, we have been using the default the colors in ggplot(). The color is an important element in visualization. In many situations, we would need to customize to improve the visualization. In this chapter, we discuss how to customize the colors.

3.1 Color Coding

The colors can be mainly represented/indexed in three ways in R: color names, 3-digit RGB values, and hexadecimal strings.

First, we could refer color by their names, e.g. “red”, “orange”, “yellow”, “wheat”, “salmon”. R has 657 built in color names, such as “red”, “cyan”, and “chocolate”. To see the list, type colors(). These colors are shown here¹.

colors()

##  [1] "magenta2"       "purple2"        "darkseagreen"   "indianred"     
##  [5] "turquoise2"     "papayawhip"     "lightgoldenrod" "darkgreen"     
##  [9] "grey27"         "darkorchid4"    "orangered3"     "lightskyblue"  
## [13] "moccasin"       "lemonchiffon3"  "steelblue"      "dodgerblue4"   
## [17] "grey93"         "darkturquoise"  "steelblue1"     "gray63"

Alternatively, we could refer each color by 3-digit RGB values, e.g., (255, 135, 0), which represents the proportion of red, green, and blue in the color. Each color in R can be represented by the proportion of red, green, and blue using a numeric vector of three numbers ranging from 0 to 255, which is called the RGB color system². Therefore, there are in total 256*256*256=16,777,216 possible colors in the RGB color system.

rgb(255, 165, 0, maxColorValue = 255)

## [1] "#FFA500"

rgb(1, 0.5, 0)

## [1] "#FF8000"

rgb(1, 0.5, 0, alpha = 0.5) # alpha represents the transparency level.

## [1] "#FF800080"

Lastly, we could refer colors by their hexadecimal strings, e.g., “#FF0000”, “#FFA500”, “#FFFF00”. R internally uses hexadecimal (or hex) to represent colors. Hexadecimal is a base-16 number system used to describe color. Red, green, and blue are each represented by two characters (#rrggbb). Each character has 16 possible symbols: 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F. For example, white RGB code is rbg(255,255,255) which can be represented as #FFFFFF = 255*65536+255*256+255 and gray RGB code is rgb(128,128,128) which can be represented as #808080 = 128*65536+128*256+128 and brow is (165,42,42) or #A52A2A.

Here is an example where we use all three ways to refer to colors in a scatter plot.

d = data.frame(x = c(1,2,3,4,5), y = c(1,2,3,4,5), class = c("AA", "AA", "BB", "CC", "DD"))
ggplot(d) + geom_point(aes(x,y,color=class),size=4)+
  scale_color_manual(breaks = c("AA", "BB", "CC", "DD"),
                     values = c(rgb(250, 250, 10, maxColorValue=255),
                                rgb(0.1, .2, 0.7, alpha=0.3),
                                "#A52A2A",
                                "salmon"))

3.2 Color Palette

Choosing colors manually is too time consuming especially when you have too many categories. We can rely on the existing color palettes in R, which stores a sequence of pre-specified colors that are suitable for representing continuous and discrete data. There are two main color palettes in R which are the R packages RcolorBrewer and viridis, displayed as follows.

The RColorBrewer package contain following palettes. As we can see, there are three types of palettes: sequential, discrete, and diverging. The sequential color palettes can be used for continuous variables. The discrete color palettes can be used for discrete variables. The diverging color palettes can be used for continuous variables who have both positive and negative values, such as correlation. To use the RColorBrewer palettes, we use the following three functions depending on the type of variable:

scale_colour_distiller() is a continuous color scale.
scale_colour_brewer() is a discrete color scale.
scale_colour_fermenter() is a binned color scale.

Meanwhile, the viridis package contains only four palettes shown below. All these four palettes can be applied to both continuous and discrete variables. To use the viridis palettes, we use the following three functions depending on the type of variable:

scale_fill_viridis_c() is a continuous color scale.
scale_fill_viridis_d() is a discrete color scale.
scale_fill_viridis_b() is a binned color scale.

3.3 Colors for Continuous Variables

For quantitative/continuous variables, we can use the following functions to customize colors.

Use pre-specified color palettes virids and RColorBrewer.
- scale_color_viridis_c() for virids palettes such as viridis, magma, and etc.
- scale_color_distiller() for RColorBrewer palettes such as YlOrRd, YlorBr, and etc.
Create a new sequence of colors by interpolating existing colors.
- scale_color_gradient() for interpolating two colors.
- scale_color_gradient2() for interpolating three colors
- scale_color_gradientn() for interpolating more than three colors.

Back to the mpg data example, suppose we would like to use various colors for the symbol. We first examples of viridis color palette, RColorBrewer color palette, and creating a color sequence by interpolating colors.

library(RColorBrewer)
college = read.csv("data/college.csv", header = TRUE)
g = ggplot(college) + geom_point(aes(x=sat_avg, y=admission_rate, color = admission_rate),size=2) 
g1 = g + scale_color_continuous(name = "Adm Rt")
g2 = g + scale_color_viridis_c(name = "Adm Rt")
g3 = g + scale_color_viridis_c(name = "Adm Rt", option = "magma") # try option = "plasma" "inferno" or "cividis"
g4 = g + scale_color_distiller(name = "Adm Rt", palette = "Spectral")
g5 = g + scale_color_distiller(name = "Adm Rt", palette = "Greys")
g6 = g + scale_color_gradient(name = "Adm Rt", low = "black", high = "green")
g7 = g + scale_color_gradient2(name = "Adm Rt", low = "blue", mid = "white", high = "yellow", midpoint = 0.5)
g8 = g + scale_color_gradientn(name = "Adm Rt", colors = c("black", "red", "pink", "blue", "green"))
g9 = g + scale_color_gradientn(name = "Adm Rt", colors = colorspace::diverge_hcl(7))
grid.arrange(g1,g2,g3,g4,g5,g6,g7,g8,g9,ncol=3)

We usually use the diverging colors to represents numerical values that are both positive and negative, for example, correlation.

library(tidyverse)
college = read.csv("data/college.csv", header = TRUE)
cor_mat = college %>%
  select(admission_rate, sat_avg, undergrads, tuition, faculty_salary_avg, loan_default_rate, median_debt) %>%
  complete.cases() %>%
  cor()
cor_mat %>%
  as_tibble() %>%
  mutate( name = rownames(cor_mat))
  mutate(variable = )
  pivot_longer(cols = admission_rate:median_debt, 
               names_to = )

3.4 Colors for Discrete Variables

For qualitative/discrete variables, we can use the following functions to customize colors.

Use pre-specified color palettes virids and RColorBrewer.
- scale_color_viridis_d() for virids palettes such as viridis, magma, and etc.
- scale_color_brewer() for RColorBrewer palettes such as Set1, Set2, and etc.
Create a set of colors and assign them to each level of the variable.
- scale_color_manual() for manually assigning colors to levels of discrete variables.

Back to the mpg data example, suppose we would like to use various colors of the symbols to represent the drive train type. We first show examples of viridis and RColorBrewer color palettes. We also customize colors for discrete variables, which we usually use scale_color_manual().

g <- college %>%
  filter(city %in% c("New York", "Los Angeles", "Cincinnati", "Chicago")) %>%
  ggplot() + 
  geom_point(aes(x=sat_avg, y=tuition, color = city), size=2)
g1 = g + scale_color_viridis_d()
g2 = g + scale_color_brewer(palette = "Set3")
g3 = g + scale_color_manual(breaks = c("New York", "Los Angeles", "Cincinnati", "Chicago"),
                              values = c("red", "blue", "yellow", "pink"))
grid.arrange(g1,g2,g3,ncol=3)

Note that we can set color in a similar way for other type of visualizations. Here are some example.

g1=ggplot(filter(college, state %in% c("OH", "MI", "IN", "IL", "WI")), 
          aes(x = state, fill=state)) + 
  geom_bar() + 
  scale_fill_viridis_d(option = "magma")
data("faithfuld")
erupt <- ggplot(faithfuld, aes(waiting, eruptions, fill = density)) +
  geom_raster() + scale_x_continuous(NULL, expand = c(0, 0)) + scale_y_continuous(NULL, expand = c(0, 0)) + 
  theme(legend.position = "none")
g2 = erupt + scale_fill_viridis_c(option = "magma")
grid.arrange(g1,g2,ncol=2)

Note that scale_color_continuous() is equivalent to scale_color_gradient(). In addition, scale_color_discrete() is equivalent to scale_color_hue().

3.5 Other Resources

In ggplot2, there are many functions to adjust the colors. We mostly focus on the following functions.

scale_color_brewer(): for qualitative variable mapped to color, use R package RColorBrewer’s palatte.
scale_color_distiller(): for quantitative variable mapped to color, use R package RColorBrewer’s palatte.
scale_color_viridis_d(): for qualitative variable mapped to color, use viridis color palette.
scale_color_viridis_c(): for quantitative variable mapped to color, use viridis color palette.
scale_color_gradient(): for quantitative variable mapped to color, interpolate to two colors to get a palette (low-high).
scale_color_manual(): for qualitative variable mapped to color, manually specify the color for each level.

The first two are for RColorBrewer color palette. The second two are for viridis color palette. The last two are the most flexible functions: scale_colour_gradient() and scale_colour_manual() for continuous and discrete variables, respectively. Note that we also have another six functions for fill such as scale_fill_brewer() and etc.

For the aesthetic dimension of color, a complete list of scale functions are below. For the aesthetic dimension of fill, a similar set of functions can be obtained by replacing *_color_*() with *_fill_*(). Their functions are the same.

scale_color_brewer(): for qualitative variable mapped to color, use R package RColorBrewer’s palatte.
scale_color_distiller(): for quantitative variable mapped to color, use R package RColorBrewer’s palatte.
scale_color_fermenter(): for binned variable mapped to color, use R package RColorBrewer’s palatte.
scale_color_continuous(): default to scale_color_gradient().
scale_color_binned(): default to scale_color_steps().
scale_color_discrete(): default to scale_color_hue()/scale_color_brewer().
scale_color_gradient(): for quantitative variable mapped to color, interpolate to two colors to get a palette (low-high).
scale_color_gradient2(): for quantitative variable mapped to color, interpolate to three colors to get a palette (low-mid-high).
scale_color_gradientn(): for quantitative variable mapped to color, interpolate to n colors to get a palette.
scale_color_grey(): for quantitative variable mapped to color, interpolate to black and white to get a palette.
scale_color_hue(): for qualitative variable mapped to color, not colour-blind safe palette.
scale_color_identity(): for qualitative variable mapped to color, this variable has to already contain color as values.
scale_color_manual(): for qualitative variable mapped to color, manually specify the color for each level.
scale_color_steps(): for binned variable mapped to color, interpolate to two colors to get a palette (low-high).
scale_color_steps2(): for binned variable mapped to color, interpolate to three colors to get a palette (low-mid-high).
scale_color_stepsn(): for binned variable mapped to color, interpolate to n colors to get a palette.
scale_color_viridis_d(): for qualitative variable mapped to color, use viridis color palette.
scale_color_viridis_c(): for quantitative variable mapped to color, use viridis color palette.
scale_color_viridis_b(): for binned variable mapped to color, use viridis color palette.

The color is an incredibly complex topic. We only scratch the surface of this issue. Some additional resources on colors are photopea for color coding³, Adobe color wheel⁴, color pallete⁵, colorbrewer⁶, colororacle⁷, simulator for colorblind⁸.

4 Layout, Margin, and Save Visualization.

To enhance the visualization, we can display multiple figures side by side or in a grid for better comparison. In order to set up the layout, we use the R packages grid and gridExtra.

The grid package provides a low-level graphics system to access the graphics facilities in R. The gridExtra package provides a number of user-level functions to work with grid package and to arrange multiple figures on a page. More specifically, we use grid.arrange() functions to set up the layout. Here is an example.

g1 = ggplot(college, aes(sat_avg, tuition)) +
  geom_point(size=0.1)
g2 = ggplot(college, aes(faculty_salary_avg, tuition))+
  geom_point(size=0.1)
g3 = ggplot(college, aes(loan_default_rate, tuition))+
  geom_point(size=0.1)
g4 = ggplot(college, aes(undergrads, tuition))+
  geom_point(size=0.1)
g5 = ggplot(college, aes(admission_rate, tuition))+
  geom_point(size=0.1)
g6 = ggplot(college, aes(median_debt, tuition))+
  geom_point(size=0.1)
library(grid)
library(gridExtra)
plots<-list(g1,g2,g3,g4,g5,g6)#put 6 plots in one list
vp <- viewport(width =0.6, height =1) #create a viewpoint whose width is 0.6 and height is 1 
grid.arrange(grobs = plots, ncol = 2,vp=vp)

## Warning: Removed 2 rows containing missing values (`geom_point()`).

We can display different types of visualization together too. We draw histogram, stack density plot, and boxplot. Use more complex layout with the customized width and height of each panel, we use the following code.

p1 = ggplot(college, aes(x=tuition)) + geom_histogram() +
  labs(title="Histogram of tuition")
p2 = ggplot(college, aes(x=tuition, y=..density.., fill=control)) + 
  geom_density(position="stack")+
  labs(title="PDF of tuition",x="Tuition")
p3 = ggplot(college, aes(y=tuition, x=control,fill=control))+
  geom_boxplot(outlier.size=.3)+
  labs(title="Boxplot of tuition")
lay1 = rbind(c(1, 1),
            c(2, 3)) #matrix of layout
library(knitr)
knitr::kable(lay1)

1	1
2	3

plots1=list(p1,p2,p3)
grid.arrange(grobs = plots1,
             layout_matrix = lay1, # matrix of layout is lay1
             widths = c(2, 1),heights = c(0.5,1), 
             # widths 2:1, heights 1:2 
             # try delete this row and see what happens
             top="Distribution of tuition")

##

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## i Please use `after_stat(density)` instead.

To adjust the margins of the plot, we specify the arguments in the theme layer.

ggplot(college,aes(x=sat_avg, y=tuition))+
  geom_point()+ 
  theme(plot.margin = unit(c(3,3,3,3), "cm")) # the 4 margins of the plot are 2 cm

Finally, to save the plot, we use the ggsave and arrangeGrob functions.

m<-arrangeGrob(grobs = plots1, layout_matrix = lay1,
               widths = c(2, 1),  heights = c(1, 0.5),
               top="Distribution of Tuition")
ggsave(file="my_fig.png", m, width = 25, height = 20)

Ch4P2 Visualization Basics Part II

Descriptive Analytics and Data Visualization

Yichen Qin (qinyn@ucmail.uc.edu), University of Cincinnati

2023-02-15