Slice
DominoSlicer
- DominoSlicer(n_slices=5, covariance_type='diag', n_pca_components=128, n_mixture_components=25, y_log_likelihood_weight=1, y_hat_log_likelihood_weight=1, max_iter=100, init_params='confusion', confusion_noise=0.001, random_state=None)[source]
Slice Discovery based on the Domino Mixture Model.
Discover slices by jointly modeling a mixture of input embeddings (e.g. activations from a trained model), class labels, and model predictions. This encourages slices that are homogeneous with respect to error type (e.g. all false positives).
Examples
Suppose you’ve trained a model and stored its predictions on a dataset in a Meerkat DataPanel with columns “emb”, “target”, and “pred_probs”. After loading the DataPanel, you can discover underperforming slices of the validation dataset with the following:
from domino import DominoSlicer dp = ... # Load dataset into a Meerkat DataPanel # split dataset valid_dp = dp.lz[dp["split"] == "valid"] test_dp = dp.lz[dp["split"] == "test"] domino = DominoSlicer() domino.fit( data=valid_dp, embeddings="emb", targets="target", pred_probs="pred_probs" ) dp["domino_slices"] = domino.transform( data=test_dp, embeddings="emb", targets="target", pred_probs="pred_probs" )
- Parameters
n_slices (int, optional) – The number of slices to discover. Defaults to 5.
covariance_type (str, optional) – The type of covariance parameter \(\mathbf{\Sigma}\) to use. Same as in sklearn.mixture.GaussianMixture. Defaults to “diag”, which is recommended.
n_pca_components (Union[int, None], optional) – The number of PCA components to use. If
None, then no PCA is performed. Defaults to 128.n_mixture_components (int, optional) – The number of clusters in the mixture model, \(\bar{k}\). This differs from
n_slicesin that theDominoSDMonly returns the topn_sliceswith the highest error rate of then_mixture_components. Defaults to 25.y_log_likelihood_weight (float, optional) – The weight \(\gamma\) applied to the \(P(Y=y_{i} | S=s)\) term in the log likelihood during the E-step. Defaults to 1.
y_hat_log_likelihood_weight (float, optional) – The weight \(\hat{\gamma}\) applied to the \(P(\hat{Y} = h_\theta(x_i) | S=s)\) term in the log likelihood during the E-step. Defaults to 1.
max_iter (int, optional) – The maximum number of iterations to run. Defaults to 100.
init_params (str, optional) – The initialization method to use. Options are the same as in sklearn.mixture.GaussianMixture plus one addition, “confusion”. If “confusion”, the clusters are initialized such that almost all of the examples in a cluster come from same cell in the confusion matrix. See Notes below for more details. Defaults to “confusion”.
confusion_noise (float, optional) – Only used if
init_params="confusion". The scale of noise added to the confusion matrix initialization. See notes below for more details. Defaults to 0.001.random_state (Union[int, None], optional) – The random seed to use when initializing the parameters.
Notes
The mixture model is an extension of a standard Gaussian Mixture Model. The model is based on the assumption that data is generated according to the following generative process.
Each example belongs to one of \(\bar{k}\) slices. This slice \(S\) is sampled from a categorical distribution \(S \sim Cat(\mathbf{p}_S)\) with parameter \(\mathbf{p}_S \in\{\mathbf{p} \in \mathbb{R}_+^{\bar{k}} : \sum_{i = 1}^{\bar{k}} p_i = 1\}\) (see
DominoSDM.mm.weights_).Given the slice \(S'\), the embeddings are normally distributed \(Z | S \sim \mathcal{N}(\mathbf{\mu}, \mathbf{\Sigma}\)) with parameters mean \(\mathbf{\mu} \in \mathbb{R}^d\) (see
DominoSDM.mm.means_) and \(\mathbf{\Sigma} \in \mathbb{S}^{d}_{++}\) (seeDominoSDM.mm.covariances_; normally this parameter is constrained to the set of symmetric positive definite \(d \\times d\) matrices, however the argumentcovariance_typeallows for other constraints).Given the slice, the labels vary as a categorical \(Y |S \sim Cat(\mathbf{p})\) with parameter \(\mathbf{p} \in \{\mathbf{p} \in \mathbb{R}^c_+ : \sum_{i = 1}^c p_i = 1\}\) (see
DominoSDM.mm.y_probs).Given the slice, the model predictions also vary as a categorical \(\hat{Y} | S \sim Cat(\mathbf{\hat{p}})\) with parameter \(\mathbf{\hat{p}} \in \{\mathbf{\hat{p}} \in \mathbb{R}^c_+ : \sum_{i = 1}^c \hat{p}_i = 1\}\) (see
DominoSDM.mm.y_hat_probs).
The mixture model is, thus, parameterized by \(\phi = [\mathbf{p}_S, \mu, \Sigma, \mathbf{p}, \mathbf{\hat{p}}]\) corresponding to the attributes
weights_, means_, covariances_, y_probs, y_hat_probsrespectively. The log-likelihood over the \(n\) examples in the validation dataset \(D_v\) is given as followsand maximized using expectation-maximization:\[\ell(\phi) = \sum_{i=1}^n \log \sum_{s=1}^{\hat{k}} P(S=s)P(Z=z_i| S=s) P( Y=y_i| S=s)P(\hat{Y} = h_\theta(x_i) | S=s)\]We include two optional hyperparameters \(\gamma, \hat{\gamma} \in \mathbb{R}_+\) (see
y_log_liklihood_weightandy_hat_log_likelihood_weightbelow) that balance the importance of modeling the class labels and predictions against the importance of modeling the embedding. The modified log-likelihood over \(n\) examples is given as follows:\[\ell(\phi) = \sum_{i=1}^n \log \sum_{s=1}^{\hat{k}} P(S=s)P(Z=z_i| S=s) P( Y=y_i| S=s)^\gamma P(\hat{Y} = h_\theta(x_i) | S=s)^{\hat{\gamma}}\]Attention
Although we model the prediction \(\hat{Y}\) as a categorical random variable, in practice predictions are sometimes “soft” (e.g. the output of a softmax layer is a probability distribution over labels, not a single label). In these cases, the prediction \(\hat{Y}\) is technically a dirichlet random variable (i.e. a distribution over distributions).
However, to keep the implementation simple while still leveraging the extra information provided by “soft” predictions, we naïvely plug the “soft” predictions directly into the categorical PMF in the E-step and the update in the M-step. Specifically, during the E-step, instead of computing the categorical PMF \(P(\hat{Y}=\hat{y_i} | S=s)\) we compute \(\sum_{j=1}^c \hat{y_i}(j) P(\hat{Y}=j | S=s)\) where \(\hat{y_i}(j)\) is the “soft” prediction for class \(j\) (we can think of this like we’re marginalizing out the uncertainty in the prediction). During the M-step, we compute a “soft” update for the categorical parameters \(p_j^{(s)} = \sum_{i=1}^n Q(s,i) \hat{y_i}(j)\) where \(Q(s,i)\) is the “responsibility” of slice \(s\) towards the data point \(i\).
When using
"confusion"initialization, each slice $s^{(j)}$ is assigned a \(y^{(j)}\in \mathcal{Y}\) and \(\hat{y}^{(j)} \in \mathcal{Y}\) (i.e. each slice is assigned a cell in the confusion matrix). This is typically done in a round-robin fashion so that there are at least \(\floor{\hat{k} / {|\mathcal{Y}|^2}}\) slices assigned to each cell in the confusion matrix. Then, we fill in the initial responsibility matrix \(Q \in \mathbb{R}^{n \times \hat{k}}\), where each cell \(Q_{ij}\) corresponds to our model’s initial estimate of \(P(S=s^{(j)}|Y=y_i, \hat{Y}=\hat{y}_i)\). We do this according to\[\begin{split}\bar{Q}_{ij} \leftarrow \begin{cases} 1 + \epsilon & y_i=y^{(j)} \land \hat{y}_i = \hat{y}^{(j)} \\ \epsilon & \text{otherwise} \end{cases}\end{split}\]\[Q_{ij} \leftarrow \frac{\bar{Q}_{ij} } {\sum_{l=1}^{\hat{k}} \bar{Q}_{il}}\]where \(\epsilon\) is random noise which ensures that slices assigned to the same confusion matrix cell won’t have the exact same initialization. We sample \(\epsilon\) uniformly from the range
(0, confusion_noise].
SpotlightSlicer
- SpotlightSlicer(spotlight_size=0.02, num_steps=1000, learning_rate=0.001, **kwargs)[source]
Slice a dataset with The Spotlight algorithm [deon_2022].
TODO: add docstring similar to the Domino one
- deon_2022
d’Eon, G., d’Eon, J., Wright, J. R. & Leyton-Brown, K. The Spotlight: A General Method for Discovering Systematic Errors in Deep Learning Models. arXiv:2107. 00758 [cs, stat] (2021)
- Parameters
spotlight_size (int) –
num_steps (int) –
learning_rate (float) –