Introduction

1 Data Analysis Flowchart

Descriptive analytics, predictive analytics, and prescriptive analytics use data to help analysts make better decisions. Therefore, all analytics projects are essentially data analysis projects. Following Wickham (2017), we would like to start by introducing the data analysis flowchart as follows.

This flowchart gives us an overview of the key components in data analysis. First, a data analysis project always starts at data collection and data import. As a data scientist, collecting data and organize data in a format that can be efficiently stored and extracted is the foundation for effective data analysis. This step is often accompanied with data management and database queries. The data collection step highly depends on the context and is not our focus. Instead, we mostly focus more on data import using R, which comes with useful functions and packages. The most frequently used data format is comma-separated values (csv), which can be imported into R using read.csv().

Once the data is imported, we analyze it through an iterative process where we switch among data wrangling, data visualization, and data modeling. Data wrangling is to clean, reorganize, and transform data into a structure that are easy for visualization and modeling. For example, data wrangling may create new variables based on the existing variables or filter out a few observations that satisfy certain requirements so that our analysis can be targeted.

After data is “wrangled”, we can visualize data using various tools such as scatterplot, barplot, and etc. The visualization helps us to get a better understanding about the data, such as strength of the signals in the data. Based on the visualization, we can select the appropriate models for next step. Alternatively, visualization may also suggest different data transformation and data wrangling, in which case we go back to data wrangling step. The loop between data wrangling and data visualization allows us to feel the data more closely and understand the limitations of the data.

Another critical component is data modeling. This is where the analytics models come in and help us with the hypothesis testing, prediction, diagnosis, and much more. We do not extensively discuss this modeling component as our focus is on data visualization and data wrangling.

The driving force of the interaction among data wrangling, data visualization, and data modeling is the analytics goal, that is, the motivating questions or the research questions we have about the data. Theses questions can be formed in different ways and at different stages. For example, the motivating questions may be formed before we conduct the experiments and collect data. Then we try to answer these questions using data wrangling/visualization/modeling and may even revise the questions accordingly because of the data quality. On the other hand, we may be given a data set and asked to analyze it and report any interesting patterns and observations, which could lead to actionable decisions. In this case, we enter into the data analysis with no specific questions in mind but gradually develop a sequence of questions through the iterative process. Either way, the question propels the data analysis and helps us to decide the next step.

Once we finish the iterative process and are satisfied with what we find, we are ready to move the final stage of data analysis which is reporting. In this stage, we communicate our results to the audience in a structured way. Storytelling is the most important part in this stage. The results and conclusions we find in the iterative process is often fragmented and lack of theme or structure. In the reporting stage, we organize our findings organically and hierarchically so that the audience can understand and appreciate the impact of the findings, the context of the results, and the suggestions for the next step. The data visualization is also used in this step.

Descriptive analytics, or exploratory data analysis (EDA), consists everything in the figure with less emphasis on modeling and more emphasis on visualization, while predictive and prescriptive analytics focuses more on modeling.

Given a data set, the first thing to do is to explore the data and see if the data is good for any modeling/analysis. This is called exploratory data analysis. This step should be done before any formal modeling starts. In fact, EDA is one of the most important parts of any data analysis projects because it will help the data analyst to understand the data and select the right model.

EDA is often an iterative process where you propose a few questions/ideas that you would like to know/understand, and then analyze the data to answer these questions or implement these ideas. Some questions can be sufficiently answered using the data, some cannot. Next, you form more questions and explore the data accordingly. After this iterative process, you find a few promising questions that can be answered by the data, which lead you to formal modeling.

This is similar to mining, in that, before you take out the drill and do the work, you would like to check all possible sites and decide which areas could be more productive. EDA can be considered as the step to explore all possible sites and identify the more promising sites. In other words, EDA tells the analysts what kind of analysis/modeling the data at hand can afford and what it cannot afford.

2 Why Visualization?

As we can see, visualization plays an important role in data analysis. Why is visualization so important and effective? To answer this, let us start with an example of Anscombe’s quartet¹. Suppose we have four data sets of two variables, x and y, and we are asked to explore the data and report interesting patterns and observations, such as modeling choices. Suppose we have imported and wrangled data and present them as follows.

For these data sets, we can compute the summary and descriptive statistics, such as mean, variance, and correlation. Here is a list of common statistics.

Measure of location: mean / median / mode / geometric mean / harmonic mean.
Measure of variability: range / variance / standard deviation / coefficient of variation/ interquartile range (IQR)
Measure of distribution: probability density function (pdf) / cumulative distribution function (cdf) / percentiles / quartiles / Z-scores
Measure of association between: Pearson (linear) / Kendall (nonlinear) / Spearman (nonlinear) correlations
Basic visualization: histogram / boxplot / scatter plot / barplot
Type of data: continuous(quantitative)/discrete(categorical,qualitative)/cross-sectional/time series/unstructured such as text, networks, images, videos,

However, as we can see below, some of these key statistics are identical across all four data sets. We simply cannot distinguish them using statistics.

This is because statistics are high level summaries of the data that may overlook intricate details in the data. Alternatively, if we visualize the data, we immediately see the problem.

It seems that the first data set is suitable for linear regression analysis. The second data set is suitable for polynomial regressions or nonparametric regression. The first data set clearly has an outlier. Once it is removed, linear model seems plausible. The last data set clearly shows an imbalance in the value of x and requires a different set of models. These insights would not be possible without data visualization. These insights are also useful for modeling.

Data visualization helps us understand information faster. In fact, visualization is the fastest way for our brain to process information than descriptive texts or numbers. So image is the best channel to convey information to our brain than audio and text.

“Numerical quantities focus on expected values, graphical summaries on unexpected values.” — John Tukey²

A picture is worth a thousand words

Therefore, the key is to find the easiest form of representation of the data so that our brain can quickly understand the data. Good visualization simplifies complex data for our brain, whereas bad visualization complicates the data and confuses our brain.

3 Visualization in General and Data Visualization

In this book, we mostly focus on data visualization However, visualization is much more than that. Visualization in general can be divided in four quadrants as follows³.

Note: declarative sometimes is replaced with explanatory.

Data visualization is the graphical representation of the data and information using charts, graphs, maps, and etc. It is used in various places in descriptive analytics. Converting data from tons of numbers to figures and patterns makes it much easier for humans to digest information. Human brain is much more efficient in processing visual information, such as shapes and colors, than verbal information, such as text and numbers. A picture is worth a thousand words.

According the purpose of the visualization, it can be exploratory or explanatory/declarative. According to the information being visualized, it can be data-driven or conceptual. We mainly focus on the data-driven visualization, in particular, exploratory data visualization and explanatory data visualization. However, it is important to know visualization is equally effective in other areas.

Exploratory data visualization: The purpose of the exploratory data visualization is to quickly get an idea of the data, such as patterns, trend, and anomalies, so that we can apply more complex tools such as linear regression and machine learning techniques. The exploratory data visualization often requires generating a figure that presents the data faithfully, the visualization does not have to be pretty but should always be accurate. Example of Zoo attendance data in Table 1.1 and Figure 1.1 p5 of Camm et al. (2022).

Explanatory data visualization (or storytelling): Once data analysis is complete, we will communicate with audience to deliver our findings, using explanatory data visualization. Explanatory data visualization tries to present the data in a way to communicate with the audience, which requires careful design of the visualization. Explanatory data visualization should be easy to digest by audience. Example of Job seeker survey in Table 1.2 on p8 of Camm et al. (2022).

4 Successful Data Visualization Examples

Here we go over a few successful data visualization examples.

Soccer game USA vs Belgium in World Cup 2014: This is a visualization of the soccer match between USA and Belgium in World Cup 2014. Tim Howard was the goalkeeper of team USA. This is more like exploratory data visualization where you do not know what you are about to find, and would like to explore the data. This visualization was created by New York Times. See here and here for details.

National parks visitors: See here for details. This is more like explanatory data visualization where your main goal to present the results to audience and deliver your message, i.e., the patterns in visitor counts throughout the years.

Train schedule between Paris and Lyon: This is the visualization of the train schedule between Paris and Lyon. There is a modern version of the Paris-Lyon train schedule visualization: Boston MBTA subway system

Life expectancy vs fertility rate over time: Original YouTube video at here and here.

Are there any limitations for data visualization? Data visualization is effective because human vision is most sensitive to visual patterns. Depending the nature of the data, data visualization may not be the most efficient way to present data. For example, in this video, the visualization below presents the key stroke for Pachelbel’s Canon. The visualization pales in comparison to listening to the music. Even though the visualization contains same information as music itself, looking at the visualization does not help identify the patterns in the music.

Therefore, different data requires different tools to understand. Luckily, many data sets can be visualized efficiently and vividly so that we understand the story behind the data, but there are for sure exceptions.

5 Resources

5.1 Websites

Data visualization examples

Data journalism

People in data visualization

Collections of data visualization and general discussion

List of ggplot2 visualizations which you should be able to recreate.
flowingdata.com
R Graph Gallery
Data Visualisation Catalogue
Observable
visualizing.org
data viz project
datavisualization.ch
D3JS and R Shiny for interactive visualization.
Many others such as https://ilovecharts.tumblr.com/, http://visualcomplexity.com/vc/, and https://www.visualisingdata.com/

Data visualization courses

5.2 Books

There has been extensive literature on data visualization. Camm et al. (2022) offers a comprehensive introduction to data visualization, which covers important visualization principles as well as important visualization types. The same authors also provides another book on the general introduction to business analytics which focuses more on modeling (Camm et al. (2021)). Both books uses Excel primarily for its easy access. For visualization in R, Wickham (2016) provides a detailed introduction to the R data visualization package ggplot2. Focusing more on practical data analysis, Healy (2019) offers an extensive introduction to the principles and the best practice in visualization with extensive examples in R. Furthermore, Wilke (2019) explains many detailed principles for visualizing various types of data. These principles are applicable regardless of the programming languages or visualization tools. Kabacoff (2018) also offers the introduction of data visualization in R and provide many sample code. Chang (2018) is a detailed cookbook for R graphics and can be used as a reference book. From a broader perspective, Peng (2016) discusses the exploratory data analysis using R. It includes various data visualization topics and also covers descriptive analytics topics such as clustering and dimension reduction. From a even broader perspective, Irizarry (2019) provides an extensive introduction to the entire subject of data science and covers a wide range of topics including data visualization, regression, inference, and machine learning techniques and much more. In addition, Wickham (2017) also provides the R foundation for data science and includes a section on data visualization. Lastly, we use R Markdown to write the chapters, which is a very useful tool. The details of R markdown are explained in Xie (2015) and Xie, Allaire, and Grolemund (2018).

There are many other excellent data visualization books and articles. Their focus are not necessarily on R, but more on visualization principles and case studies. For example, Yau (2013), Tufte (2001), Tufte (1990), Tufte (1997), Gelman, Pasarica, and Dodhia (2002), Robbins (2013), Cleveland (1993), Cleveland (1994), Knaflic (2015), Cairo (2012), and Cairo (2019).

References

Cairo, Alberto. 2012. Functional Art, the: An Introduction to Information Graphics and Visualization. 1st ed. New Riders. https://www.amazon.com/Functional-Art-introduction-information-visualization/dp/0321834739.

———. 2019. How Charts Lie: Getting Smarter about Visual Information. W. W. Norton & Company. https://www.amazon.com/dp/B07P88R6DW/ref=dp-kindle-redirect?_encoding=UTF8&btkr=1.

Camm, Jeffrey D., James J. Cochran, Michael J. Fry, and Jeffrey W. Ohlmann. 2021. Business Analytics. 4th ed. Cengage Learning. https://www.cengage.com/c/business-analytics-4e-camm/9780357131787/.

———. 2022. Data Visualization: Exploring and Explaining with Data. 1st ed. Cengage. https://www.cengage.com/c/data-visualization-exploring-and-explaining-with-data-1e-camm/9780357631348PF/.

Chang, Winston. 2018. R Graphics Cookbook. 2nd ed. O’Reilly Media. https://r-graphics.org/.

Cleveland, William S. 1993. Visualizing Data. 1st ed. Hobart Press. https://www.amazon.com/Visualizing-Data-William-S-Cleveland/dp/0963488406.

———. 1994. The Elements of Graphing Data. 2nd ed. Hobart Press. https://www.amazon.com/Elements-Graphing-Data-William-Cleveland/dp/0963488414.

Gelman, Andrew, Cristian Pasarica, and Rahul Dodhia. 2002. “Let’s Practice What We Preach.” The American Statistician. https://www.tandfonline.com/doi/abs/10.1198/000313002317572790.

Healy, Kieran. 2019. Data Visualization (a Practical Introduction). 1st ed. Princeton University Press. https://socviz.co/.

Irizarry, Rafael A. 2019. Introduction to Data Science: Data Analysis and Prediction Algorithms with r. 1st ed. Chapman; Hall/CRC. https://rafalab.github.io/dsbook/.

Kabacoff, Robert. 2018. Data Visualization with r. open source textbook. https://rkabacoff.github.io/datavis/.

Knaflic, Cole Nussbaumer. 2015. Storytelling with Data: A Data Visualization Guide for Business Professionals. Wiley. https://www.storytellingwithdata.com/books.

Peng, Roger. 2016. Exploratory Data Analysis with r. 1st ed. lulu.com. https://bookdown.org/rdpeng/exdata/.

Robbins, Naomi B. 2013. Creating More Effective Graphs. 1st ed. Chart House. https://www.amazon.com/Creating-Effective-Graphs-Naomi-Robbins/dp/0985911123.

Tufte, Edward R. 1990. Envisioning Information. 1st ed. Graphics Press. https://www.amazon.com/Envisioning-Information-Edward-R-Tufte/dp/0961392118.

———. 1997. Visual Explanations: Images and Quantities, Evidence and Narrative. 1st ed. Graphics Press. https://www.amazon.com/Visual-Explanations-Quantities-Evidence-Narrative/dp/1930824157.

———. 2001. The Visual Display of Quantitative Information. 2nd ed. Graphics Press. https://www.amazon.com/Visual-Display-Quantitative-Information/dp/1930824130.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. 3rd ed. Springer. https://ggplot2-book.org/.

———. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1st ed. O’Reilly Media. https://r4ds.had.co.nz/.

Wilke, Claus O. 2019. Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures. 1st ed. O’Reilly Media. https://clauswilke.com/dataviz/.

Xie, Yihui. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. http://yihui.name/knitr/.

Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown.

Yau, Nathan. 2013. Data Points: Visualization That Means Something. 1st ed. Wiley. https://www.wiley.com/en-us/Data+Points%3A+Visualization+That+Means+Something-p-9781118462195.

Ch1 Introduction

Descriptive Analytics and Data Visualization

Yichen Qin (qinyn@ucmail.uc.edu), University of Cincinnati

2023-02-27