Descriptive analytics, predictive analytics, and prescriptive analytics use data to help analysts make better decisions. Therefore, all analytics projects are essentially data analysis projects. Following Wickham (2017), we would like to start by introducing the data analysis flowchart as follows.
This flowchart gives us an overview of the key components in data
analysis. First, a data analysis project always starts at data
collection and data import. As a data scientist, collecting data and
organize data in a format that can be efficiently stored and extracted
is the foundation for effective data analysis. This step is often
accompanied with data management and database queries. The data
collection step highly depends on the context and is not our focus.
Instead, we mostly focus more on data import using R, which comes with
useful functions and packages. The most frequently used data format is
comma-separated values (csv), which can be imported into R using
read.csv()
.
Once the data is imported, we analyze it through an iterative process where we switch among data wrangling, data visualization, and data modeling. Data wrangling is to clean, reorganize, and transform data into a structure that are easy for visualization and modeling. For example, data wrangling may create new variables based on the existing variables or filter out a few observations that satisfy certain requirements so that our analysis can be targeted.
After data is “wrangled”, we can visualize data using various tools such as scatterplot, barplot, and etc. The visualization helps us to get a better understanding about the data, such as strength of the signals in the data. Based on the visualization, we can select the appropriate models for next step. Alternatively, visualization may also suggest different data transformation and data wrangling, in which case we go back to data wrangling step. The loop between data wrangling and data visualization allows us to feel the data more closely and understand the limitations of the data.
Another critical component is data modeling. This is where the analytics models come in and help us with the hypothesis testing, prediction, diagnosis, and much more. We do not extensively discuss this modeling component as our focus is on data visualization and data wrangling.
The driving force of the interaction among data wrangling, data visualization, and data modeling is the analytics goal, that is, the motivating questions or the research questions we have about the data. Theses questions can be formed in different ways and at different stages. For example, the motivating questions may be formed before we conduct the experiments and collect data. Then we try to answer these questions using data wrangling/visualization/modeling and may even revise the questions accordingly because of the data quality. On the other hand, we may be given a data set and asked to analyze it and report any interesting patterns and observations, which could lead to actionable decisions. In this case, we enter into the data analysis with no specific questions in mind but gradually develop a sequence of questions through the iterative process. Either way, the question propels the data analysis and helps us to decide the next step.
Once we finish the iterative process and are satisfied with what we find, we are ready to move the final stage of data analysis which is reporting. In this stage, we communicate our results to the audience in a structured way. Storytelling is the most important part in this stage. The results and conclusions we find in the iterative process is often fragmented and lack of theme or structure. In the reporting stage, we organize our findings organically and hierarchically so that the audience can understand and appreciate the impact of the findings, the context of the results, and the suggestions for the next step. The data visualization is also used in this step.
Descriptive analytics, or exploratory data analysis (EDA), consists everything in the figure with less emphasis on modeling and more emphasis on visualization, while predictive and prescriptive analytics focuses more on modeling.
Given a data set, the first thing to do is to explore the data and see if the data is good for any modeling/analysis. This is called exploratory data analysis. This step should be done before any formal modeling starts. In fact, EDA is one of the most important parts of any data analysis projects because it will help the data analyst to understand the data and select the right model.
EDA is often an iterative process where you propose a few questions/ideas that you would like to know/understand, and then analyze the data to answer these questions or implement these ideas. Some questions can be sufficiently answered using the data, some cannot. Next, you form more questions and explore the data accordingly. After this iterative process, you find a few promising questions that can be answered by the data, which lead you to formal modeling.
This is similar to mining, in that, before you take out the drill and do the work, you would like to check all possible sites and decide which areas could be more productive. EDA can be considered as the step to explore all possible sites and identify the more promising sites. In other words, EDA tells the analysts what kind of analysis/modeling the data at hand can afford and what it cannot afford.
As we can see, visualization plays an important role in data analysis. Why is visualization so important and effective? To answer this, let us start with an example of Anscombe’s quartet1. Suppose we have four data sets of two variables, x and y, and we are asked to explore the data and report interesting patterns and observations, such as modeling choices. Suppose we have imported and wrangled data and present them as follows.
For these data sets, we can compute the summary and descriptive statistics, such as mean, variance, and correlation. Here is a list of common statistics.
However, as we can see below, some of these key statistics are identical across all four data sets. We simply cannot distinguish them using statistics.
This is because statistics are high level summaries of the data that may overlook intricate details in the data. Alternatively, if we visualize the data, we immediately see the problem.
It seems that the first data set is suitable for linear regression analysis. The second data set is suitable for polynomial regressions or nonparametric regression. The first data set clearly has an outlier. Once it is removed, linear model seems plausible. The last data set clearly shows an imbalance in the value of x and requires a different set of models. These insights would not be possible without data visualization. These insights are also useful for modeling.
Data visualization helps us understand information faster. In fact, visualization is the fastest way for our brain to process information than descriptive texts or numbers. So image is the best channel to convey information to our brain than audio and text.
“Numerical quantities focus on expected values, graphical summaries on unexpected values.” — John Tukey2
A picture is worth a thousand words
Therefore, the key is to find the easiest form of representation of the data so that our brain can quickly understand the data. Good visualization simplifies complex data for our brain, whereas bad visualization complicates the data and confuses our brain.
In this book, we mostly focus on data visualization However, visualization is much more than that. Visualization in general can be divided in four quadrants as follows3.
Note: declarative sometimes is replaced with explanatory.
Data visualization is the graphical representation of the data and information using charts, graphs, maps, and etc. It is used in various places in descriptive analytics. Converting data from tons of numbers to figures and patterns makes it much easier for humans to digest information. Human brain is much more efficient in processing visual information, such as shapes and colors, than verbal information, such as text and numbers. A picture is worth a thousand words.
According the purpose of the visualization, it can be exploratory or explanatory/declarative. According to the information being visualized, it can be data-driven or conceptual. We mainly focus on the data-driven visualization, in particular, exploratory data visualization and explanatory data visualization. However, it is important to know visualization is equally effective in other areas.
Exploratory data visualization: The purpose of the exploratory data visualization is to quickly get an idea of the data, such as patterns, trend, and anomalies, so that we can apply more complex tools such as linear regression and machine learning techniques. The exploratory data visualization often requires generating a figure that presents the data faithfully, the visualization does not have to be pretty but should always be accurate. Example of Zoo attendance data in Table 1.1 and Figure 1.1 p5 of Camm et al. (2022).
Explanatory data visualization (or storytelling): Once data analysis is complete, we will communicate with audience to deliver our findings, using explanatory data visualization. Explanatory data visualization tries to present the data in a way to communicate with the audience, which requires careful design of the visualization. Explanatory data visualization should be easy to digest by audience. Example of Job seeker survey in Table 1.2 on p8 of Camm et al. (2022).
Here we go over a few successful data visualization examples.
Soccer game USA vs Belgium in World Cup 2014: This is a visualization of the soccer match between USA and Belgium in World Cup 2014. Tim Howard was the goalkeeper of team USA. This is more like exploratory data visualization where you do not know what you are about to find, and would like to explore the data. This visualization was created by New York Times. See here and here for details.
National parks visitors: See here for details. This is more like explanatory data visualization where your main goal to present the results to audience and deliver your message, i.e., the patterns in visitor counts throughout the years.
Train schedule between Paris and Lyon: This is the visualization of the train schedule between Paris and Lyon. There is a modern version of the Paris-Lyon train schedule visualization: Boston MBTA subway system
Life expectancy vs fertility rate over time: Original YouTube video at here and here.
Are there any limitations for data visualization? Data visualization is effective because human vision is most sensitive to visual patterns. Depending the nature of the data, data visualization may not be the most efficient way to present data. For example, in this video, the visualization below presents the key stroke for Pachelbel’s Canon. The visualization pales in comparison to listening to the music. Even though the visualization contains same information as music itself, looking at the visualization does not help identify the patterns in the music.
Therefore, different data requires different tools to understand. Luckily, many data sets can be visualized efficiently and vividly so that we understand the story behind the data, but there are for sure exceptions.
Data visualization examples
Data journalism
People in data visualization
Collections of data visualization and general discussion
Data visualization courses
There has been extensive literature on data visualization. Camm et al. (2022) offers a comprehensive
introduction to data visualization, which covers important visualization
principles as well as important visualization types. The same authors
also provides another book on the general introduction to business
analytics which focuses more on modeling (Camm et
al. (2021)).
Both books uses Excel primarily for its easy access. For visualization
in R, Wickham (2016) provides a detailed introduction
to the R data visualization package ggplot2
. Focusing more
on practical data analysis, Healy (2019) offers an
extensive introduction to the principles and the best practice in
visualization with extensive examples in R. Furthermore, Wilke (2019) explains many detailed principles
for visualizing various types of data. These principles are applicable
regardless of the programming languages or visualization tools. Kabacoff (2018) also offers the introduction of
data visualization in R and provide many sample code. Chang (2018) is a detailed cookbook for R
graphics and can be used as a reference book. From a broader
perspective, Peng (2016) discusses the exploratory data
analysis using R. It includes various data visualization topics and also
covers descriptive analytics topics such as clustering and dimension
reduction. From a even broader perspective, Irizarry (2019) provides an extensive introduction
to the entire subject of data science and covers a wide range of topics
including data visualization, regression, inference, and machine
learning techniques and much more. In addition, Wickham (2017) also provides the R foundation for
data science and includes a section on data visualization. Lastly, we
use R Markdown to write the chapters, which is a very useful tool. The
details of R markdown are explained in Xie (2015) and Xie, Allaire, and Grolemund (2018).
There are many other excellent data visualization books and articles. Their focus are not necessarily on R, but more on visualization principles and case studies. For example, Yau (2013), Tufte (2001), Tufte (1990), Tufte (1997), Gelman, Pasarica, and Dodhia (2002), Robbins (2013), Cleveland (1993), Cleveland (1994), Knaflic (2015), Cairo (2012), and Cairo (2019).