Visualization of Model Selection Uncertainty

Abstract:

In this paper, we introduce several graphical tools which can visualize the distribution of the selected model. For example, G-plot, H-plot, Scatter plot and Heatmap. To the best of our knowledge, this is the first attempt to visualize such a distribution.

Keywords:

distribution of the selected model

Our Purpose:

The selected model from a model selection procedure can be considered as a random ``point estimate’’ for the true model. Therefore, it is important to understand its random behavior through its distribution, i.e., the distribution of the selected model. As a first attempt, we introduce several graphical tools to visualize such a distribution and to help understand the model selection uncertainty. The proposed visualization is useful in graphical comparison of different selection methods, giving analysts a good sense of level of randomness each method comes with. We define the most frequently selected model as the mode model, denoted as \(m^\ast\), \[ \begin{align} m^\ast = \arg\max_{m \in \mathcal{M}} \mathbb{P}(\widehat{m}=m). \end{align} \]

Our Main Results:

Naive Visualization of the Distribution of Selected Model

The distribution of the selected model is generally hard to visualize, because the support of the distribution is on all possible models, and these models have complex relationships among themselves. We first present a naive visualization of such a distribution to show its difficulty. Here each circle represents one unique model. The vertical axis shows the model complexity. The models in the same row are arranged according to their model frequencies descendingly. There exists a line connecting two models if the large model \(m_2\) includes the small model \(m_1\) with one extra variable, i.e., \(m_2\supset m_1\) and \(| m_2 \setminus m_1|=1\). This is a naive visualization of the distribution of selected model.

The Distribution of the Selected Model by Groups (G-plot)

By grouping models of a similar structure together, we are able to visualize the distribution more efficiently and clearly, and reveal important patterns in the distribution that are not available through other types of analysis. In order to focus on the important patterns in the distribution, we propose to visualize the distribution of the selected model by groups and call it G-plot. The motivation is that, since the model space is too large, we put models with similar structures into groups and visualize the group frequency, i.e., sum of model frequencies in the group. In the figure, each model group is represented by a circle while the group frequency is represented by the color intensity. The groups are placed in the figure according to their model complexities and their structural relationships. The vertical axis represents the model complexity.

This is an example of G-plot.

Group #1 contains only the mode model \(m^*\) and is placed on the xy-coordinates of \((0, |m^*|)\) or \((0,5)\) as an ``anchor’’. Other models are grouped according to their model complexities and their Hamming distances to the mode model. The Hamming distance between two models \(m_1\) and \(m_2\) is defined as \[ H(m_1 \| m_2)=| (m_1 \setminus m_2)\cup (m_2 \setminus m_1) |, \] which also represents the number of different variables or the cardinality of the symmetric difference. So the models in the same group have the same complexity and similar structure. Conceptually, each group of models can be expressed as \(\{m: H(m^\ast \| m)=i,|m|=j \}\) for different \(i\) and \(j\), and the group is placed at xy-coordinates of \((i,j)\). The left histogram displays the frequency of model complexity, while the top histogram displays the frequency of Hamming distance.

Based on this criterion, group #2 consists of all the models that contain the mode model and also have one extra variable. Group #2 is placed at xy-coordinates of \((1,6)\) because its complexity is 6 and its members all have Hamming distances of 1 to the mode model. Similarly, group #3 consists of all sub-models of mode model with one less variable. Group #4 and group #6 are defined in the similar way. Group #5 is defined as all the models that miss one variable from the mode model but have one extra variable, so they have the same complexity as the mode model. The rest of groups are defined in the same fashion and numbered sequentially. Alternatively, every model in group #5 can be considered as a sub-model of at lease one model in group #2 or a super-model of at least one model in group #3.

The Distribution of the Selected Model by Decomposed Hamming Distance (H-plot)

we further focus on Hamming Distance. Note that the Hamming distance between the mode model \(m^\ast\) and any arbitrary model \(m\) can be decomposed into two parts, \(H^-\) and \(H^+\), that is, \[ H^-(m^\ast \| m)=|m^\ast \backslash m|, \quad H^+(m^\ast \| m)=|m \backslash m^\ast|, \quad H(m^\ast \| m)=H^-(m^\ast \| m) + H^+(m^\ast \| m). \] Here \(H^-\) represents the number of missing variables by \(m\) compared to the mode model. Meanwhile, \(H^+\) represents the number of redundant variables in \(m\) compared to the mode model. In total, \(H=H^- + H^+\) represents the number of different variables between \(m\) and mode model. Therefore, we can form the model groups according to \(H^-\) and \(H^+\). Specifically, each model group can be expressed as \(\{m: H^-(m^\ast \| m)=i, H^+(m^\ast \| m)=j \}\) for different \(i\) and \(j\). As an alternative to G-plot, we can plot each group at xy-coordinates of \((i,j)\), and we call this new visualization H-plot.

This is an example of H-plot.

The Distribution of the Selected Model by Scatterplot (Scatter plot)

H-plot focuses on the difference between the selected and the mode model, in terms of missing variables and redundant variables. However, in the case of missing variables, the magnitude of the missing variable’s coefficient is also important because it indicates the negative impact of missing such a variable. To take this information into consideration, we define the weighted Hamming distance as \[H^-_w(m^\ast \| m)=\sum_{j \in \{m^* \setminus m\}} |\beta^0_j|\] where \(\beta^0_j\) is the \(j\)-th element in \(\boldsymbol{\beta}^0\). Based on such a distance, we further propose another type of visualization, namely weighted Hamming distance based scatter plot.

This is an example of Scatter plot.

The Distribution of the Selected Model by Heatmap (Heatmap)

Lastly, we propose the weighted Hamming distance based heatmap as an alternative to scatter plot. As the data dimension increases, the number of unique models increases exponentially. As a result, the scatter plot would become more difficult to read as many model groups overlap with each other. Therefore, we propose to divide the x- and y-axes of the scatter plot into equal spaced intervals and convert the scatter plot into the heatmap (or 2d histogram). The color of each rectangle in the heatmap represents the sum of the frequencies of models whose \(H^-_w\) and \(H^+\) fall into the corresponding intervals.

This is an example of Heatmap.

Conclusions:

In this article, we have proposed several new graphical tools to visualize the distributions of the selected model under various model selection procedures. The visualization helps us to understand the behavior of the model selection procedure. To the best of my knowledge, there is the first attempt in visualizing such a complex distribution. We further propose a few numerical attributes on the distribution to quantify its central tendency, dispersion, and skewness. Among them, the model selection deviation allows quantitative comparison of the model selection uncertainty of various model selection procedures.