Weighting non-IID batches for out-of-distribution detection

Zhao, Zhilin; Cao, Longbing

doi:10.1007/s10994-024-06605-z

Weighting non-IID batches for out-of-distribution detection

Open access
Published: 19 August 2024

Volume 113, pages 7371–7391, (2024)
Cite this article

Download PDF

You have full access to this open access article

Machine Learning Aims and scope Submit manuscript

Weighting non-IID batches for out-of-distribution detection

Download PDF

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

A standard network pretrained on in-distribution (ID) samples could make high-confidence predictions on out-of-distribution (OOD) samples, leaving the possibility of failing to distinguish ID and OOD samples in the test phase. To address this over-confidence issue, the existing methods improve the OOD sensitivity from modeling perspectives, i.e., retraining it by modifying training processes or objective functions. In contrast, this paper proposes a simple but effective method, namely Weighted Non-IID Batching (WNB), by adjusting batch weights. WNB builds on a key observation: increasing the batch size can improve the OOD detection performance. This is because a smaller batch size may make its batch samples more likely to be treated as non-IID from the assumed ID, i.e., associated with an OOD. This causes a network to provide high-confidence predictions for all samples from the OOD. Accordingly, WNB applies a weight function to weight each batch according to the discrepancy between batch samples and the entire training ID dataset. Specifically, the weight function is derived by minimizing the generalization error bound. It ensures that the weight function assigns larger weights to batches with smaller discrepancies and makes a trade-off between ID classification and OOD detection performance. Experimental results show that incorporating WNB into state-of-the-art OOD detection methods can further improve their performance.

AUAAC: Area Under Accuracy-Accuracy Curve for Evaluating Out-of-Distribution Detection

Multi-label Adaptive Batch Selection by Highlighting Hard and Imbalanced Samples

Out-of-distribution Detection with Boundary Aware Learning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

A basic assumption for deep neural networks (DNNs) is that training and test samples are independent and identically distributed (IID) drawn from the same distribution (Sun et al., 2021). Accordingly, a network is learned from samples drawn from an unknown distribution, i.e., in-distribution (ID), to predict labels for test samples that are assumed to be ID. However, test samples could be drawn from distributions different from that of the training ID samples, i.e., out-of-distribution (OOD) (Zhao et al., 2023; Ren et al., 2019; Salehi et al., 2022). The OOD samples are assigned with the same labels as ID samples by the trained network (Guo et al., 2017; Malinin & Gales, 2018). It limits the adaption of networks to OOD and may cause severe accidents in real-world applications (Shrivastava et al., 2017), e.g., sacrificing the safety of DNNs (Amodei et al., 2016). These issues can be addressed by involving OOD detection to identify OOD samples in the test phase.

OOD detection is a challenging problem because partial OOD samples could be assigned with high-confidence predictions due to the network vulnerability (Zhao et al., 2023). As a result, a trained network cannot distinguish between ID and OOD samples and predict labels for OOD samples in the test phase (Goodfellow et al., 2015). For a standard network pretrained on in-distribution samples, post-hoc methods (Salehi et al., 2022) design OOD detectors to calculate OOD scores for test samples according to the network outputs and distinguish OOD samples from ID samples according to the OOD scores. However, these post-hoc methods do not modify the standard network, making the OOD detection performance heavily rely on the knowledge learned from the standard network. Therefore, the main challenge of OOD detection is to improve the OOD sensitivity of standard networks.

Recently, the confidence enhancement methods (Salehi et al., 2022) enhance the modeling to improve the OOD sensitivity. Specifically, their success relies on retraining standard networks by modifying training processes and objective functions. However, modifying a standard network requires knowledge about its design principle and could significantly affect the performance of its main task when altering its model. Furthermore, the confidence enhancement methods modify standard networks using solid assumptions about the data characteristics of OOD samples, hence failing when these OOD assumptions are broken.

Both post-hoc and confidence enhancement methods ignore the affection of ID sample characteristics on OOD detection performance. This gap inspires to improve the OOD sensitivity by exploring the data characteristics of training ID samples rather than assuming the data characteristics of unobserved OOD samples. The impact of different batch sizes on detecting OOD samples of standard networks is illustrated in Fig. 1. The results show that a larger batch size leads to relatively better OOD detection performance. This phenomenon is attributed to the fact that the sampling of batch samples is often biased due to the limited batch sizes, and the IID assumption may not hold. Accordingly, the samples in a batch can be treated as non-IID from the distribution of training ID samples. In contrast, these samples can be treated as IID from another unknown distribution different from that of training ID samples. If the two distributions are significantly different, as shown in Fig. 2, some IID samples from the unknown distribution can be treated as OOD for the entire training ID dataset. Accordingly, a network will learn to assign high-confidence predictions on some OOD samples, e.g., located in the long tail of the distribution of training ID samples. This phenomenon implies that the OOD sensitivity can be improved by making batch samples aligned with the entire ID training samples without modifying the training processes and objective functions of standard networks.

The above observation inspires to propose a novel data approach: Weighted Non-IID Batching (WNB). WNB takes the idea that a batch whose samples are IID drawn from a distribution significantly different from the ID should have a lower weight, or vice versa. Accordingly, WNB weights each batch according to the discrepancy between the batch samples and the entire training ID dataset. Specifically, a batch with a larger discrepancy should be assigned with a smaller weight to discourage the network from providing high-confidence predictions on OOD samples, or vice versa. To balance the ID classification and OOD detection performance of the network, we derive the generalization error bound of WNB and obtain the expression of the weight function by minimizing this bound. A major advantage of WNB is that it can be easily incorporated into post-hoc and confidence enhancement methods to further improve their OOD detection performance, verified experimentally on various network architectures and datasets.

The main contributions of this work include the following:

WNB is proposed to improve the OOD sensitivity by weighting each batch according to the discrepancy between the batch samples and the entire training ID dataset.
The weight function of WNB is derived by minimizing its generalization error bound to make a trade-off between ID classification and OOD detection performance.
WNB improves OOD detection performance by weighting batches without modifying the training processes and objective functions, which indicates that WNB can be incorporated into post-hoc and confidence enhancement methods.

The rest of this paper is organized as follows. Section 2 briefly reviews related work of OOD detection. Section 3 describes the proposed WNB method and its theoretical guarantees in detail. Section 4 presents extensive experiments to verify the superiority of the proposed method. Section 5 draws the conclusion and discusses future work.

2 Related work

OOD detection (Ming et al., 2022) aims to identify OOD samples that are drawn from distributions different from that of ID samples for a network in the test phase. OOD detection is similar with predictive uncertainty (Wang et al., 2022), domain generalization (Li et al., 2021) and outlier detection (Krleza et al., 2021; Yuan et al., 2022). In the test phase, predictive uncertainty concentrates on providing high predictive uncertainty for misclassified ID samples, and domain generalization focuses on improving the classification accuracy for the ID samples with covariate shift. Unlike the two research settings, OOD detection aims to recognize samples with sematic shift that are treated as OOD samples and refuses to predict labels for these OOD samples. Outlier detection filters out-of-distribution samples that are different from the majority in the training dataset for learning downstream networks. Different from Outlier detection, OOD detection can only observe ID samples in the training phase and detect OOD samples from the unlabeled test samples according to network outputs in the test phase. The existing OOD detection methods can be categorized into post-hoc and confidence enhancement methods.

2.1 Post-hoc methods

Post-hoc methods apply a detector to calculate OOD scores for test samples. They are easily applicable to a standard network pretrained on ID samples without modifying its training process and objective function. Therefore, post-hoc methods can adopt the standard network for OOD detection in real-world applications. A baseline method involves Maximum over Softmax Probabilities (MSP) (Hendrycks et al., 2017) treating the confidence represented by the maximum softmax probability from a network as the OOD score of a test sample. To improve the OOD sensitivity of the baseline method, Out-of-DIstribution detector for Neural networks (ODIN) (Liang et al., 2018) considers the negative adversarial perturbations for test samples and temperature scaling for the softmax function. Mahalanobis Distance Detector (MDD) (Lee et al., 2018) method linearly combines the Mahalanobis distances of latent representations from different network layers to calculate an OOD score. Furthermore, Energy-Based Detector (EBD) (Liu et al., 2020) treats the energy function score represented by the softmax activations of predicted label probabilities as the OOD score for a test sample. Gram Matrices (GM) (Sastry & Oore, 2020) treats the feature correlation between activity patterns from all layers and the predicted class as the OOD score for a test sample. Deep Residual Flow (DRF) (Zisselman & Tamar, 2020) is based on ODIN and MLB, which calculates the residual flows of each layer and each class by normalizing the flows of an expressive density model. ReAct (Sun et al., 2021) aims to reduce the negative effect of noise by truncating activations on the penultimate layer of a network. GEN (Liu et al., 2023) presents a universal entropy scoring mechanism compatible with all pre-trained softmax-based classifiers. Meanwhile, ASH (Djurisic et al., 2023) proposes a unique, after-the-fact activation modification technique for identifying out-of-distribution data. This method operates by selectively reducing a major portion of the activation in the later stages of the model during the inference process, and it does so without the necessity of referring back to training data statistics. However, these post-hoc methods merely apply the network outputs to calculate OOD scores without network modification, making the OOD detection performance heavily rely on the knowledge extracted by the network.

2.2 Confidence enhancement methods

Confidence enhancement methods enhance the OOD sensitivity of a standard network pretrained on ID samples. They introduce the knowledge about OOD samples by retraining the standard network with the OOD-sensitive training process or objective function. Joint Confidence Loss (JCL) (Lee et al., 2018) co-trains a generative adversarial network with the softmax loss to generate OOD samples and penalizes them by encouraging their predicted label probabilities to satisfy a uniform distribution. Extending ODIN, DeConf-C (Hsu et al., 2020) applies a specialized training objective function which is the divisor structure of class probability confidence and searches the perturbation magnitude for test samples on training ID samples. Deep Gamblers (DG) (Liu et al., 2019) assigns an extra class to those low-confidence ID samples in the training process. Minimum Other Score (MOS) (Huang & Li, 2021) divides the training ID samples into groups with similar concepts to simplify the decision boundaries between ID and OOD samples. Confidence enhancement methods can improve the OOD sensitivity of a standard network. Self-Supervised outlier Detection (SSD) (Sehwag et al., 2021) and Contrasting Shifted Instances (CSI) (Tack et al., 2020) treat the augmented ID samples obtained by rotations as OOD samples and apply them to improve OOD sensitivity by contrastive losses. HEAT (Lafon et al., 2023) introduces a method for OOD detection by rectifying a mixture of class-conditional Gaussian distributions using an energy-based approach. This technique effectively addresses the problem of non-mixing in Markov Chain Monte Carlo (MCMC) sampling, a common challenge in training energy-based models. In a different approach, Dual Representation Learning (DRL) (Zhao & Cao, 2023) simultaneously leverages both robust and nuanced features associated with labels. It does so by training an auxiliary network to identify representations that can distinguish between different distributions. These representations are designed to complement the label-distinguishing capabilities of an existing network, thereby boosting the effectiveness of OOD detection. However, these methods require modifying the model, which could significantly affect the performance of the main task of the standard network. Besides, these methods assume specific data characteristics of OOD samples, leaving failure chances when their OOD assumptions are broken.

3 Weighted non-IID batching

Different from post-hoc and confidence enhancement methods designed to improve modeling capacity, Weighted Non-IID Batching (WNB) method improves the OOD sensitivity by considering the data characteristics of training ID samples. The limited number of batch samples causes sampling bias, i.e., the batch samples are non-IID with respect to the ID and may be treated as IID from another distribution different from the ID. If this distribution is significantly different from the ID, i.e., forming OOD, a training network tends to make high-confidence predictions on some OOD samples, as shown in Fig. 2. Taking this perspective, WNB weights each batch according to the discrepancy between the batch samples and the entire training ID dataset, where the weight function is inferred according to the generalization error bound of WNB. Specifically, a batch should be provided with a higher weight if its discrepancy is smaller, or vice versa.

3.1 Preliminaries

Let $\textbf{x} \in \mathcal {X}$ and $y \in \mathcal {Y}$ be an input and its corresponding label, respectively. All the inputs are bounded, i.e., $\Vert \textbf{x} \Vert \le B$. We assume the training dataset $\mathcal {S} = \{\textbf{z}_i\}_{i = 1}^m$ contains m ID samples IID drawn from an unknown distribution $\mathcal {D}$ on $\mathcal {X} \times \mathcal {Y}$, where $\textbf{z} = (\textbf{x}, y) \overset{\text {IID}}{\sim } \mathcal {D}$. Accordingly, $\mathcal {B}= \{\textbf{z}_i\}_{i = 1}^b$ is a batch containing $b \le m$ randomly selected samples from $\mathcal {S}$. Furthermore, we assume any function $h: \mathcal {X} \rightarrow \mathcal {Y}$ is from the hypothesis class $\mathcal {H}$, and the loss function $l: \mathcal {Y} \times \mathcal {Y} \rightarrow [-a, a]$ is L-Lipschitz continuous with respect to any $h \in \mathcal {H}$. For the hypothesis class $\mathcal {H}$, we further assume $\mathcal {H}$ is the class of real-valued networks of depth d over the domain $\mathcal {X}$. For the networks from $\mathcal {H}$, the Frobenius norm of the weight matrices are at most $M_1, \ldots , M_d$, and the activation function is 1-Lipschitz, positive-homogeneous, and applied element-wise (such as the ReLU).

We denote the expected risk on the training distribution $\mathcal {D}$ as,

$$\begin{aligned} L_\mathcal {D}(h) = \mathbb {E}_{(\textbf{x},y) \sim \mathcal {D}} \left[ l(h(\textbf{x}), y) \right] . \end{aligned}$$

(1)

Estimating the expected risk by the entire ID training dataset $\mathcal {S}$ and a batch $\mathcal {B}$, we obtain the corresponding empirical risks,

$$\begin{aligned} \begin{aligned} L_\mathcal {S}(h) = \frac{1}{m} \sum _{(\textbf{x},y) \in \mathcal {S}} l(h(\textbf{x}), y), \quad L_\mathcal {B}(h) = \frac{1}{b} \sum _{(\textbf{x},y) \in \mathcal {B}} l(h(\textbf{x}), y). \end{aligned} \end{aligned}$$

(2)

Accordingly, we define the optimal solution of the expected risk $L_\mathcal {D}(h)$,

$$\begin{aligned} h^* \in \arg \min _{h \in \mathcal {H}} L_\mathcal {D}(h). \end{aligned}$$

(3)

The classification learning task aim to search a minimizer from the hypothesis space $\mathcal {H}$ by optimizing an empirical risk to approximate the optimal solution.

3.2 Objective function

The basic idea of WNB is to align the batch samples with the entire ID training samples by weighting each batch. This is achieved by strategically adjusting the weights assigned to each batch during the training process, with the aim of reducing sampling bias inherent in smaller batches. Unlike methods that focus on determining the optimal batch size, our approach maintains a given batch size as a constant and instead concentrates on optimizing the weight distribution within these batches. Therefore, our approach is designed to be easily integrated with existing training frameworks that utilize batch gradient descent, providing a simple yet effective solution for improving OOD detection without the need for complex alterations to the training process.

Based on the empirical risks $L_\mathcal {S}(h)$ and $L_\mathcal {B}(h)$ in Eq. (2), the objective function of WNB is,

$$\begin{aligned} L_\mathcal {S}^{W}(h) = \frac{1}{\lceil m/b \rceil }\sum _{i = 1}^{\lceil m/b \rceil } w_{\mathcal {B}_i} L_{\mathcal {B}_i}(h), \end{aligned}$$

(4)

where $w_{\mathcal {B}}$ is the weight of the batch $\mathcal {B}$. The batch weight depends on the discrepancy $d_{\mathcal {H}}(\mathcal {B}, \mathcal {S})$ between the batch $\mathcal {B}$ and the entire ID dataset $\mathcal {S}$ with respect to the hypothesis space $\mathcal {H}$. Specifically, a batch with a larger discrepancy should be assigned with a smaller weight to discourage the network from providing high-confidence predictions on OOD samples, or vice versa. Accordingly, we define the weight as,

$$\begin{aligned} w_{\mathcal {B}} = g\circ d_{\mathcal {H}}(\mathcal {B}, \mathcal {S}), \end{aligned}$$

(5)

where $g(\cdot )$ is a weight function which maps a large $d_{\mathcal {H}}(\mathcal {B}, \mathcal {S})$ to a small weight, or vice versa. Accordingly, the empirical risk minimizer of Eq. (4) is defined as,

$$\begin{aligned} \widehat{h} \in \arg \min _{h \in \mathcal {H}} L_\mathcal {S}^{W}(h). \end{aligned}$$

(6)

The learning task aims to obtain an empirical risk minimizer $\widehat{h}$ to approximate the optimal solution $h^*$ and be aware of OOD samples. Therefore, $\widehat{h}$ should make a trade-off between ID classification and OOD detection performance. In the following, we will (1) define the discrepancy $d_{\mathcal {H}}(\mathcal {B}, \mathcal {S})$; (2) derive the generalization error bound of $\widehat{h}$ in terms of Rademacher complexity (Bartlett & Mendelson, 2002; Chen et al., 2022; 3) infer the weight function $g(\cdot )$ by minimizing the generalization error bound.

3.3 Dataset discrepancy

WNB weighs a batch per the discrepancy between the batch samples and the entire training ID dataset. A smaller discrepancy indicates the batch samples are more likely IID drawn from the ID, otherwise the samples tend to be non-IID with respect to the ID. We follow the idea of L-discrepancy (Ben-David et al., 2010; Kifer et al., 2004) between source and target domains, i.e., the samples from two datasets are drawn from the same distribution if any function makes the same expectation under the two datasets, or vice versa. Accordingly, we define the discrepancy between the training dataset $\mathcal {S}$ and a batch $\mathcal {B}$ with respect to the hypothesis class $\mathcal {H}$ as,

$$\begin{aligned} d_{\mathcal {H}}(\mathcal {B} , \mathcal {S}) = \sup _{h \in \mathcal {H}} \left( \vert L_{\mathcal {B}}(h) - L_{\mathcal {S}}(h)\vert \right) . \end{aligned}$$

(7)

If a hypothesis $h \in \mathcal {H}$ performs similarly on $\mathcal {S}$ and $\mathcal {B}$, the two datasets have a low discrepancy, i.e., samples in $\mathcal {\mathcal {B}}$ are IID drawn from the ID; otherwise, these batch samples can be treated as non-IID from the ID but IID from an OOD, respectively. We can obtain the following theorem revealing how the batch size b affects the discrepancy Eq. (7).

Theorem 1

The training dataset $\mathcal {S}$ and a batch $\mathcal {B}$ contain m and b samples, respectively. For a loss function $\vert l(h(\textbf{x}), y) \vert \le a$, with probability at least $1 - \delta$, we have,

$$\begin{aligned} d_{\mathcal {H}}(\mathcal {B} , \mathcal {S}) \le \frac{a \left( \sqrt{m} + \sqrt{b}\right) }{\sqrt{mb}} \sqrt{2 \log \frac{2}{\delta }}. \end{aligned}$$

(8)

The detailed proof is represent in Appendix 1. According to Theorem 1, we know that a smaller batch size b causes a lower probability that the discrepancy $d_{\mathcal {H}}(\mathcal {B}, \mathcal {S})$ is smaller than a predefined threshold. This is because the gradient of the function $f(b) = \frac{\sqrt{m} + \sqrt{b}}{\sqrt{mb}}$ is,

$$\begin{aligned} \frac{\partial f(b)}{ \partial b} = \frac{b - m}{2mb \sqrt{mn}}, \end{aligned}$$

which is less than 0 for $m > b$. A small batch size can cause a large discrepancy under the IID assumption, which indicates the batch samples can also be treated as IID samples from an OOD different from that of training ID samples. This causes a network to make high-confidence predictions on OOD samples. Intuitively, the over-confidence issue on OOD samples more likely occurs when the IID assumption does not hold because the empirical distribution of batch samples is OOD.

Evaluating the discrepancy Eq. (7) for batch $\mathcal {B}$ requires sampling infinite hypotheses from the hypothesis class $\mathcal {H}$, which is infeasible. According to Monte Carlo methods (Bardenet et al., 2017), we randomly select n networks $H = \{h_i\}_{i = 1}^n$ from the hypothesis class $\mathcal {H}$ and estimate the discrepancy $d_{\mathcal {H}}(\mathcal {B}, \mathcal {S})$ by the estimator $\widehat{d}_{\mathcal {H}}(\mathcal {B}, \mathcal {S})$, i.e.,

$$\begin{aligned} d_{\mathcal {H}}(\mathcal {B} , \mathcal {S}) \simeq \widehat{d}_{\mathcal {H}}(\mathcal {B} , \mathcal {S}) = \max _{h \in H} \left( \vert L_{\mathcal {B}}(h) - L_{\mathcal {S}}(h)\vert \right) . \end{aligned}$$

(9)

However, the estimation method requires large memory because the n networks have to be stored in the memory. Further, a more exact estimation of discrepancies Eq. (9) requires a larger n, which causes memory explosion. Alternatively, instead of storing n randomly selected networks, we store n randomly generated linear classifiers and combine these classifiers with the backbone of the training network to form n different networks.

3.4 Generalization error bound

Here, we present the generalization error bound of the proposed WNB method. According to the discrepancy $d_{\mathcal {H}}(\mathcal {B}, \mathcal {S})$ in Eq. (7) and the generalization bounds in terms of Rademacher complexity of deep neural networks (Golowich et al., 2018), we can obtain the following theorem.

Theorem 2

For any $\delta$, with probability at least $1 - \delta$, we obtain,

$$\begin{aligned} \begin{aligned} L_\mathcal {D}(\widehat{h}) \le&L_{\mathcal {D}}(h^*) + 8a \sqrt{\frac{2 \ln (8 / \delta )}{m}} + 4 \sqrt{m} BL(\sqrt{2d \ln 2} + 1) \prod _{i = 1}^d M_i\\&+ \frac{2}{\lceil m/b \rceil }\sum _{i = 1}^{\lceil m/b \rceil } \left[ g \circ d_{\mathcal {H}}(\mathcal {B}_i, \mathcal {S}) \right] d_{\mathcal {H}}(\mathcal {B}_i, \mathcal {S}). \end{aligned} \end{aligned}$$

(10)

The detailed proof is represent in Appendix 2. According to Theorem 2, the tightness of this bound indicates how closely $\widehat{h}$ approximates $h^*$, which in turn reflects the model ability to accurately categorize ID samples. The generalization error of WNB is related to four terms. The first term is the error of the optimal solution $h^*$ on the expected risk which can be treated as a constant. This is because this term is independent of the training dataset $\mathcal {D}$ and the objective function. The second term is due to the limited training samples. The third term depends on the properties of network architectures, which demonstrates a smaller Frobenius norm of weight matrices can lead to a tighter generalization error bound. The fourth term is caused by WNB which shows that the generalization error bound is related to the discrepancy and the weight function.

3.5 Weight function

Recall the weight for the batch $\mathcal {B}$ is $w_{\mathcal {B}} = g\circ d_{\mathcal {H}}(\mathcal {B}, \mathcal {S})$ where $g(\cdot )$ is the weight function. Theorem 1 indicates that a smaller batch size could cause a larger discrepancy from the entire training ID dataset, which may further lead to high-confidence predictions for OOD samples. Therefore, to improve the OOD sensitivity of a network, a natural idea is to weigh each batch according to its discrepancy $d_{\mathcal {H}}(\mathcal {B}, \mathcal {S})$. Specifically, the weight function $g(\cdot )$ should map a smaller discrepancy to a larger weight, or vice versa. Theorem 2 indicates that the weight function is essential for the generalization error. Its impact on the ID classification should also be considered because predicting labels for ID samples is the main task of a standard network pretrained on ID samples. To improve the OOD sensitivity while reducing its impact on ID classification, the weight function should be derived by minimizing the generalization error bound of WNB. According to Theorem 1 and Theorem 2, a simple and effective solution is to define the weight function as,

$$\begin{aligned} g(x) = \frac{c}{x}, \end{aligned}$$

(11)

where x is a random variable and $c > 0$ is a scale coefficient controlling the scale of the weight values.

The weight function Eq. (11) is monotonically decreasing, which ensures that a batch with a smaller discrepancy is assigned with a larger weight. Applying the weight function $g(\cdot )$ in Eq. (11) to the fourth term of the generalization error bound in Theorem 2, we obtain,

$$\begin{aligned} \frac{1}{\lceil m/b \rceil } \sum _{i = 1}^{\lceil m/b \rceil } \left[ g \circ d_{\mathcal {H}}(\mathcal {B}_i, \mathcal {S}) \right] d_{\mathcal {H}}(\mathcal {B}_i, \mathcal {S}) = c, \end{aligned}$$

(12)

which is a discrepancy-independent generalization error bound. Accordingly, the generalization error bound is easily controlled by varying the scale coefficient c. Furthermore, the discrepancy used for improving OOD sensitivity will not cause a negative effect on classifying ID samples. Consequently, we can obtain a desired ID classification performance by selecting a small c to minimize the generalization error bound. The discrepancy-independent property shown in Eq. (12) ensures that the weight function Eq. (11) can make a trade-off between ID classification and OOD detection performance.

For $c \rightarrow 0$, the bound Eq. (12) tends to be 0 and all weights are approximately equal, which indicates the objective function Eq. (4) degrades into the traditional cross-entropy loss. Therefore, $c \rightarrow 0$ can achieve a tight generalization bound but cannot improve the OOD sensitivity. For $c \rightarrow \infty$, the discrepancies of different batches would be significantly distinct, and the generalization bound becomes loose. Furthermore, a significantly large c enlarges the weight gap between batches, making the learning process unstable and leading to poor OOD detection and ID classification performance. According to Eq. (12), WNB finds a trade-off between ID classification and OOD detection by simply varying the scale coefficient c. We will further discuss the impact of c in analyzing the hyper-parameter effect.

The training process is summarized in Algorithm 1. Post-hoc methods can be applied to networks trained by WNB. Furthermore, WNB can be incorporated into the confidence enhancement methods by replacing the cross-entropy loss and considering the specific training process in Algorithm 1.

Table 1 The OOD detection performance of post-hoc methods

Full size table

Table 2 The OOD detection performance of confidence enhancement methods

Full size table

4 Experiments

Here, we demonstrate the effectiveness of our proposed WNB method. We incorporate WNB into post-hoc and confidence enhancement methods to verify its applicability and effectiveness. Furthermore, we run a set of ablation study and analyze the impact of its hyper-parameters, including the number of networks n in estimating discrepancies, the scale coefficient c which controls the scale of the weights, and the batch size b.

4.1 Experiment settings

Following the setups of the baseline (Hendrycks et al., 2017) and state-of-the-art methods (Zhao et al., 2023; Tack et al., 2020; Liang et al., 2018; Lee et al., 2018), we adopt the following datasets, network architectures, and metrics in experiments. If unspecific, we set $c = 0.001$ to balance ID classification and OOD detection performance and set $n = 100$ to balance the effectiveness and efficiency. The hyper-parameters will be discussed in this section.

4.1.1 Datasets

In the training phase, we solely utilize ID data, and then in the testing phase, we use the corresponding test data for this ID data along with OOD data to evaluate the OOD detection capability. The ID datasets used for pretraining standard and retraining networks include CIFAR10 (50,000) (Krizhevsky, 2009), CIFAR100 (Krizhevsky, 2009) (50,000), and SVHN (Netzer et al., 2011) (73,257). The numbers in parentheses indicate the number of samples contained in the dataset. To measure the OOD detection performance in the test phase, ID samples are selected from test samples in the corresponding ID dataset. OOD samples are selected from several real-world datasets whose labels are different from that of the ID dataset. Specifically, the considered ID datasets in the test phase include CIFAR10 (10,000), CIFAR100 (10,000), and SVHN (26,032). The considered OOD datasets include CUB200 (11,788) (Wah et al., 2011), StanfordDogs120 (20,580) (Khosla et al., 2011), OxfordPets37 (7349) (Parkhi et al., 2012), Oxfordflowers102 (8189) (Nilsback & Zisserman, 2006), Caltech256 (30,607) (Griffin et al., 2006), DTD47 (5640) (Cimpoi et al., 2014), and COCO (5000) (Lin et al., 2014). We resize all OOD samples to maintain the same size as the training ID samples.

4.1.2 Network architectures

The network architectures used in the comparison methods include ResNet18 (He et al., 2016), SENet (Hu et al., 2020), and ShuffleNet (Ma et al., 2018). If not specified, standard and retrained networks use the same training setup, and standard networks are optimized by minimizing the traditional cross entropy loss. Following the traditional setup (Zhang et al., 2018), the batch size is 128, and the maximal epoch is 200. Furthermore, the learning rate is initialized as 0.1 and divided by 10 after 100 and 150 epochs.

4.1.3 Metrics

To measure the OOD detection performance, we adopt the Area Under the Receiver Operating Characteristic curve (AUROC) (Davis & Goadrich, 2006) on the OOD scores of test ID and OOD samples. A larger AUROC, representing a larger OOD score gap between ID and OOD samples, indicates better OOD detection performance. To obtain the OOD scores for test samples, following the state-of-the-art OOD detection methods (Lee et al., 2018; Hendrycks et al., 2019), we adopt MSP (Hendrycks et al., 2017) for confidence enhancement methods if not specified. Furthermore, ID classification accuracy is also considered because predicting labels for ID samples is the main task.

4.2 Performance comparison

To verify the effectiveness and applicability of WNB, we incorporate it into post-hoc and confidence enhancement methods. The setups of the comparison methods follow their previous ones. For each method, we report their average AUROC across seven OOD datasets, including CUB200, StanfordDogs120, OxfordPets37, Oxfordflowers102, Caltech256, DTD47, and COCO. The training losses of WNB on different ID datasets with varying network architectures are presented in Fig. 3.

4.2.1 Ablation study

We consider the standard works pretrained on ID samples and the networks retrained by WNB. Using the baseline method Maximum over Softmax Probabilities (MSP) (Hendrycks et al., 2017) as the detector, comparing standard networks and retrained networks of WNB can be regarded as an ablation study. This is because this comparison can verify the effect of the weighted batches in WNB. The comparison results are shown in Table 1. WNB achieves a significant improvement ($3.58 \%$) over the standard networks with MSP. This indicates that assigning weights inversely proportional to discrepancies for batches can improve the OOD sensitivity of networks. This is attributed to our design of assigning a small weight to a batch with a large discrepancy, which reduces the learning rate of a network to make high-confidence predictions for OOD samples from the distribution of the batch samples. Furthermore, the network only focuses on learning to provide labels for samples from batches that can be treated as IID samples from the distribution of the entire training ID samples.

4.2.2 Comparison with post-hoc methods

We apply nine post-hoc methods, including Maximum over Softmax Probabilities (MSP) (Hendrycks et al., 2017), Out-of-DIstribution detector for Neural networks (ODIN) (Liang et al., 2018), Mahalanobis Distance Detector (MDD), Energy-Based Detector (EBD) (Liu et al., 2020), Gram Matrices (GM) (Sastry & Oore, 2020), Deep Residual Flow (DRF) (Zisselman & Tamar, 2020), ReAct (Sun et al., 2021), GEM (Liu et al., 2023), and ASH (Djurisic et al., 2023), to detect OOD samples for standard networks and retrained networks of WNB. We summarize their results in Table 1. With these different post-hoc methods, the retrained networks by WNB obtain significant improvements (0.48–29.15%) over their corresponding standard networks, showing that WNB can improve the OOD sensitivity of a standard network by merely weighting batches for loss functions without modifying their training processes. Furthermore, WNB can improve the OOD detection performance for all detectors, showing the WNB applicability to various detectors. This is because WNB prevents a network from providing high-confidence predictions for OOD samples by only weighting batches but still encourages the network to provide high-confidence predictions for ID samples. Therefore, slightly adjusting the contributions of different batches to classify ID samples would not cause the network to treat more unseen OOD samples as ID.

4.2.3 Comparison with confidence enhancement methods

The existing confidence enhancement methods improve OOD sensitivity by modifying training processes and objective functions, while WNB improves OOD sensitivity by adjusting the data, i.e., weighting batches. Accordingly, WNB can be combined with state-of-the-art confidence enhancement methods. The considered confidence enhancement methods include Joint Confidence Loss (JCL) (Lee et al., 2018), DeConf-C (Hsu et al., 2020), Deep Gamblers (DG) (Liu et al., 2019), and Minimum Other Score (MOS) (Huang & Li, 2021). DeConf-C applies its specific detector to calculate OOD scores, and the other methods apply the MSP. Self-Supervised outlier Detection (SSD) (Sehwag et al., 2021), Contrasting Shifted Instances (CSI) (Tack et al., 2020), HEAT (Lafon et al., 2023), and Dual Representation Learning (DRL) (Zhao & Cao, 2023) are not included in the experiments because they have specific data sampling methods, which causes that the proposed WNB method cannot be simply applied. We summarize the results in Table 2. By applying WNB to adjust the weights of batches, all the confidence enhancement methods can achieve further improvements (1.54–17.48%) in detecting OOD samples. This indicates that WNB is adaptive to confidence enhancement methods. Recall that confidence enhancement methods improve their OOD sensitivity by modifying training processes and objective functions. Differently, WNB improves OOD sensitivity by making networks aware of their batch discrepancies. The two methods apply different strategies to improve OOD samples from different perspectives: modifying models and adjusting batch sampling. Therefore, they are complementary and can be combined to further improve the OOD detection performance.

4.3 Effect of hyper-parameters

This section empirically shows the effect of hyper-parameters, including the number of networks n, the scale coefficient c, and the batch size b. We test the effect of the hyper-parameters by selecting n from $\{ 1, 10, 100, 1000 \}$, selecting c from $\{ 0.0001, 0.001, 0.01, 0.1 \}$, and selecting b from $\{32,64,128,256,512,1024,2048\}$. For analyzing n and c, we test the ID classification and OOD detection performance in terms of accuracy and AUROC, respectively, and perform the experiments on CIFAR10 and ResNet18. For analyzing b, we evaluate the OOD detection performance in terms of AUROC and perform the experiments on three network architectures, including ResNet18, SENet, and ShuffleNet. The training cost of a standard network serves as the baseline. The additional cost for WNB arises primarily from calculating the discrepancy. This is notably efficient because WNB does not necessitate modifying the training process or the objective functions of standard networks. In line with common practices in OOD research, we do not use a validation set for hyperparameter tuning, recognizing the diversity of OOD samples. Instead, we test our model on a range of OOD samples, employing seven distinct OOD datasets. This approach not only helps in estimating model performance but also in calculating the average OOD detection effectiveness, which is crucial for assessing the impact of hyperparameters. Such a methodology is practical in real-world scenarios, where it is often feasible to obtain a limited number of unlabeled OOD samples to evaluate a model OOD detection capabilities.

4.3.1 The number of networks n

Per Eq. (7), theoretically, an infinite number of networks from $\mathcal {H}$ is required for estimating discrepancies. However, this is not feasible in practice. Therefore, we select a value that is both sufficiently large to be effective and computationally feasible. This approach balances the theoretical requirements with practical limitations, ensuring a robust yet efficient estimation of discrepancies. The experimental results are presented in Fig. 4. The results show that increasing the number of networks n for estimating the discrepancy can improve the OOD detection performance. This is because more networks from the hypothesis class can more precisely estimate the discrepancies of batches, making the networks better avoid learning to provide high-confidence predictions for OOD samples. However, varying the number of networks n does not affect the ID classification performance, which indicates ID classification performance is independent of n. This is because networks still learn to predict labels for test samples from the same training ID dataset even though the weights of batches are different. This experimental results also match the theoretical result in Eq. (12), which indicates that the generalization error is independent of the discrepancies of batches.

4.3.2 The scale coefficient c

Based on Theorem 2 and Eq. (12), a smaller value of c tightens the upper bound on generalization error, enhancing ID classification performance. In contrast, a larger c amplifies the weight disparity across different batches, thereby improving OOD detection. Consequently, c is pivotal in balancing the trade-off between ID classification and OOD detection. A lower scale coefficient minimizes generalization error but may not markedly enhance OOD sensitivity. Conversely, a higher coefficient might induce a looser generalization bound and potentially unstable learning, impacting both OOD detection and ID classification. This nuanced approach underscores the importance of carefully selecting c to maintain equilibrium between the two objectives. The experimental results are presented in Fig. 4. The results show that decreasing the scale coefficient c can improve the OOD detection performance, and the performance diminishes when c is sufficiently small ($c < 0.001$). This is because a sufficiently small c causes all batch weights to be approximately equal, which further makes WNB degrade into the standard networks. However, increasing the scale coefficient c can decline the ID classification performance. This experimental result matches the theoretical result in Theorem 2, which reveals that a larger c can lead to a looser generalization error bound. A larger c makes the discrepancies of batches more distinct and leads to poorer ID classification performance. A network that cannot distinguish ID samples with different labels is less likely to distinguish OOD samples from ID samples. Furthermore, varying c from 0.0001 to 0.1 can be regarded as considering corresponding coefficients in learning with a fixed scale coefficient $c = 1$. A larger learning rate causes a more unstable learning process due to the large gap between the weights of batches. Therefore, both sufficiently small and large c could lead to poor OOD detection performance.

4.3.3 The batch size b

While batch size impacts the optimizer convergence, WNB primary objective is not to identify the optimal batch size but to improve OOD detection by modulating batch weights within a set size. We are investigating the performance of WNB across varying batch sizes. The experimental results are presented in Fig. 5. The results indicate that increasing the batch size can improve the OOD detection performance. This is because a larger batch size is less likely to cause sampling bias, i.e., the probability density function of a larger batch is more similar to that of the entire training ID dataset. Therefore, the samples in a larger batch can be treated as IID samples from the distribution of training ID samples with a higher probability, which discourages networks from providing high-confidence predictions for OOD samples located in the long tail of the distribution of training ID samples. Furthermore, applying the proposed WNB to weight batches can improve the OOD detection performance for any batch size, and the improvement is more significant for smaller batch sizes. This is because WNB can reduce the impact of batches with a large bias. A smaller batch size is more likely to cause bias, which indicates that WNB can adjust more batches to improve OOD sensitivity.

5 Conclusions

In this paper, we have proposed a simple and effective method, namely Weighted Non-IID Batching (WNB) by deeply characterizing the data characteristics to improve the OOD sensitivity of networks. WNB weighs each batch according to its discrepancy with the entire training ID dataset, where a batch with a higher discrepancy is assigned with a lower weight, or vice versa. The theoretical results present the generalization error bound in terms of the Rademacher complexity of a hypothesis class for real-valued networks. Based on that, WNB derives the weight function from obtaining a discrepancy-independent bound with a trade-off between ID classification and OOD detection. The experimental results show that incorporating WNB into post-hoc and confidence enhancement methods can further improve their OOD detection performance. Our approach also has its own limitations. According to the weight function and hyper-parameter analyses, the discrepancy is essential to the OOD detection performance. To balance effectiveness and efficiency, we estimate the discrepancy by randomly selecting limited classifiers and combining them with the backbone of a training network. Therefore, an interesting future direction is to estimate the discrepancy by exploring non-IID network sampling methods for a hypothesis class.

Data availability

The datasets are available at: https://github.com/Lawliet-zzl/WNB/.

Code availability

The source codes are available at: https://github.com/Lawliet-zzl/WNB/.

References

Amodei, D., Olah, C., Steinhardt, J., Christiano, P. F., Schulman, J., & Mané, D. (2016). Concrete problems in AI safety. CoRR abs/1606.06565, pp. 1–29.
Bardenet, R., Doucet, A., & Holmes, C. C. (2017). On Markov chain Monte Carlo methods for tall data. Journal of Machine Learning Research, 18, 1–47.
MathSciNet Google Scholar
Bartlett, P. L., & Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3, 463–482.
MathSciNet Google Scholar
Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., & Vaughan, J. W. (2010). A theory of learning from different domains. Machine Learning, 79(1–2), 151–175.
Article MathSciNet Google Scholar
Chen, Q., Xue, B., & Zhang, M. (2022). Rademacher complexity for enhancing the generalization of genetic programming for symbolic regression. IEEE Transactions on Cybernetics, 52(4), 2382–2395.
Article Google Scholar
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., & Vedaldi, A. (2014). Describing textures in the wild. in Conference on Computer Vision and Pattern Recognition.
Davis, J., & Goadrich, M. (2006). The relationship between precision-recall and ROC curves. in International Conference on Machine Learning, pp. 233–240.
Djurisic, A., Bozanic, N., Ashok, A., & Liu, R. (2023). Extremely simple activation shaping for out-of-distribution detection. in 11th International Conference on Learning Representations, pp. 1–22.
Golowich, N., Rakhlin, A., & Shamir, O. (2018). Size-independent sample complexity. Conference On Learning Theory, 75, 297–299.
Google Scholar
Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. in 3th International Conference on Learning Representations, pp. 1–11.
Griffin, G., Holub, A., & Perona, P. (2006). The Caltech 256. Technical report.
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. in International Conference on Machine Learning, pp. 1321–1330.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. in Conference on Computer Vision and Pattern Recognition, pp. 770–778.
Hendrycks, D., & Gimpel, K. (2017). A baseline for detecting misclassified and out-of-distribution examples in neural networks. in In: 5th International Conference on Learning Representations, pp. 1–12.
Hendrycks, D., Mazeika, M., & Dietterich, T. G. (2019). Deep anomaly detection with outlier exposure. in 7th International Conference on Learning Representations, pp. 1–18.
Hsu, Y., Shen, Y., Jin, H., & Kira, Z. (2020). Generalized ODIN detecting out-of-distribution image without learning from out-of-distribution data. in Conference on Computer Vision and Pattern Recognition, pp. 10948–10957.
Hu, J., Shen, L., Albanie, S., Sun, G., & Wu, E. (2020). Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(8), 2011–2023.
Article Google Scholar
Huang, R., & Li, Y. (2021). MOS: towards scaling out-of-distribution detection for large semantic space. in Conference on Computer Vision and Pattern Recognition, pp. 8710–8719.
Khosla, A., Jayadevaprakash, N., Yao, B., & Fei-Fei, L. (2011). Novel dataset for fine-grained image categorization. in Conference on Computer Vision and Pattern Recognition Workshop.
Kifer, D., Ben-David, S., & Gehrke, J. (2004). Detecting change in data streams. in Proceedings of the Thirtieth International Conference on Very Large Data Bases, pp. 180–191.
Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical report.
Krleza, D., Vrdoljak, B., & Brcic, M. (2021). Statistical hierarchical clustering algorithm for outlier detection in evolving data streams. Machine Learning, 110(1), 139–184.
Article MathSciNet Google Scholar
Lafon, M., Ramzi, E., Rambour, C., & Thome, N. (2023). Hybrid energy based model in the feature space for out-of-distribution detection. International Conference on Machine Learning, 202, 18250–18268.
Google Scholar
Lee, K., Lee, K., Lee, H., & Shin, J. (2018). A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in Neural Information Processing Systems, 31, 7167–7177.
Google Scholar
Lee, K., Lee, H., Lee, K., & Shin, J. (2018). Training confidence-calibrated classifiers for detecting out-of-distribution samples. in 6th International Conference on Learning Representations, pp. 1–16.
Li, J., Shang, S., & Chen, L. (2021). Domain generalization for named entity boundary detection via metalearning. IEEE Transactions on Neural Networks and Learning Systems, 32, 3819–3830.
Article Google Scholar
Liang, S., Li, Y., & Srikant, R. (2018). Enhancing the reliability of out-of-distribution image detection in neural networks. in 6th International Conference on Learning Representations, pp. 1–27.
Lin, T., Maire, M., Belongie, S. J., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. Proceedings of the European Conference on Computer Vision, 8693, 740–755.
Google Scholar
Liu, Z., Wang, Z., Liang, P. P., Salakhutdinov, R., Morency, L., & Ueda, M. (2019). Deep gamblers: Learning to abstain with portfolio theory. Advances in Neural Information Processing Systems, 32, 10622–10632.
Google Scholar
Liu, W., Wang, X., Owens, J. D., & Li, Y. (2020). Energy-based out-of-distribution detection. Advances in Neural Information Processing Systems, 33, 21464–21475.
Google Scholar
Liu, X., Lochman, Y., & Zach, C. (2023). GEN: pushing the limits of softmax-based out-of-distribution detection. in Conference on Computer Vision and Pattern Recognition, pp. 23946–23955.
Ma, N., Zhang, X., Zheng, H., & Sun, J. (2018). Shufflenet V2: Practical guidelines for efficient CNN architecture design. in European Conferenc on Computer Vision, pp. 122–138.
Malinin, A., & Gales, M. J. F. (2018). Predictive uncertainty estimation via prior networks. Advances in Neural Information Processing Systems, 31, 7047–7058.
Google Scholar
Ming, Y., Yin, H., & Li, Y. (2022). On the impact of spurious correlation for out-of-distribution detection. in Thirty-Sixth AAAI Conference on Artificial Intelligence, pp. 10051–10059.
Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2018). Foundations of machine learning. London: MIT Press.
Google Scholar
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., & Ng, A. Y. (2011). Reading digits in natural images with unsupervised feature learning. in NIPS workshop on deep learning and unsupervised feature learning, pp. 1-9.
Nilsback, M., & Zisserman, A. (2006). A visual vocabulary for flower classification. in IEEE Computer Society Conference on Computer Vision and Pattern.
Parkhi, O. M., Vedaldi, A., Zisserman, A., & Jawahar, C. V. (2012). Cats and dogs. in Conference on Computer Vision and Pattern Recognition.
Ren, J., Liu, P. J., Fertig, E., Snoek, J., Poplin, R., DePristo, M. A., Dillon, J. V., & Lakshminarayanan, B. (2019). Likelihood ratios for out-of-distribution detection. Advances in Neural Information Processing Systems, 32, 14680–14691.
Google Scholar
Salehi, M., Mirzaei, H., Hendrycks, D., Li, Y., Rohban, M. H., & Sabokrou, M. (2022). A unified survey on anomaly, novelty, open-set, and out of-distribution detection: Solutions and future challenges. in Transactions on Machine Learning Research, pp. 1-81.
Sastry, C. S., & Oore, S. (2020). Detecting out-of-distribution examples with gram matrices. in International Conference on Machine Learning, pp. 8491–8501.
Sehwag, V., Chiang, M., & Mittal, P. (2021). SSD: A unified framework for self-supervised outlier detection. in 9th International Conference on Learning Representations, pp. 1–17.
Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning From Theory to Algorithms. New York: Cambridge University Press.
Book Google Scholar
Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., & Webb, R. (2017). Learning from simulated and unsupervised images through adversarial training. in Conference on Computer Vision and Pattern Recognition, pp. 2242–2251.
Sun, Y., Guo, C., & Li, Y. (2021). ReAct: Out-of-distribution detection with rectified activations. Advances in Neural Information Processing Systems, 34, 144–157.
Google Scholar
Sun, T., Li, D., & Wang, B. (2021). Stability and generalization of decentralized stochastic gradient descent. in Thirty-Fifth AAAI Conference on Artificial Intelligence, pp. 9756–9764.
Tack, J., Mo, S., Jeong, J., & Shin, J. (2020). CSI: Novelty detection via contrastive learning on distributionally shifted instances. Advances in Neural Information Processing Systems, 33, 1–14.
Google Scholar
Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The Caltech-UCSD birds-200-2011 dataset. Technical report
Wang, Y., He, P., Shi, P., & Zhang, H. (2022). Fault detection for systems with model uncertainty and disturbance via coprime factorization and gap metric. IEEE Transactions on Cybernetics, 52(8), 7765–7775.
Article Google Scholar
Yuan, Z., Chen, H., Li, T., Sang, B., & Wang, S. (2022). Outlier detection based on fuzzy rough granules in mixed attribute data. IEEE Transactions on Cybernetics, 52(8), 8399–8412.
Article Google Scholar
Zhang, H., Cissé, M., Dauphin, Y. N., & Lopez-Paz, D. (2018). mixup: Beyond empirical risk minimization. in 6th International Conference on Learning Representations, pp. 1–13.
Zhao, Z., & Cao, L. (2023). Dual representation learning for out-of-distribution detection. Transactions on Machine Learning Research, 2023, 1–21.
Google Scholar
Zhao, Z., Cao, L., & Lin, K.-Y. (2023). Revealing the distributional vulnerability of discriminators by implicit generators. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7), 8888–8901.
Google Scholar
Zisselman, E., & Tamar, A. (2020). Deep residual flow for out-of-distribution detection. in Conference on Computer Vision and Pattern Recognition, pp. 13991–14000.

Download references

Funding

Open Access funding enabled and organized by CAUL and its Member Institutions. The work is partially sponsored by Australian Research Council Discovery and Future Fellowship grants (DP190101079 and FT190100734).

Author information

Authors and Affiliations

Data Science Lab, School of Computing and DataX Research Centre, Macquarie University, Sydney, NSW, 2109, Australia
Zhilin Zhao & Longbing Cao

Authors

Zhilin Zhao
View author publications
You can also search for this author inPubMed Google Scholar
Longbing Cao
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Zhilin Zhao: Conceptualization, Investigation, Methodology, Validation, Resources, Data Curation, Formal analysis, Writing—Original Draft, Writing—Review & Editing, Visualization; Longbing Cao: Writing—Original Draft, Writing—Review & Editing, Visualization, Supervision, Funding acquisition.

Corresponding author

Correspondence to Zhilin Zhao.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethics approval

Not applicable.

Consent to participate

The authors understand that participation is voluntary and that they are free to withdraw at any time, without giving a reason and without cost. The authors voluntarily agree to take part in this study.

Consent for publication

The authors give our consent for the publication of identifiable details which can include data and photographs to be published in Machine Learning.

Additional information

Editor: Joao Gama, Zhu Feida, Bin Yang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1 1.1 Proof of Theorem 1

We have,

$$\begin{aligned} \begin{aligned} d_{\mathcal {H}}(\mathcal {B} , \mathcal {S}) =&\sup _{h \in \mathcal {H}} \left( \vert L_{\mathcal {B}}(h) - L_{\mathcal {D}}(h) + L_{\mathcal {D}}(h) - L_{\mathcal {S}}(h)\vert \right) \\ \le&\sup _{h \in \mathcal {H}} \left( \vert L_{\mathcal {B}}(h) - L_{\mathcal {D}}(h) \vert \right) + \sup _{h \in \mathcal {H}} \left( \vert L_{\mathcal {D}}(h) - L_{\mathcal {S}}(h)\vert \right) \\ \le&a\sqrt{\frac{2}{b} \log \frac{4}{\delta }} + a\sqrt{\frac{2}{m} \log \frac{4}{\delta }}, \\ \end{aligned} \end{aligned}$$

(A1)

where the first inequality follows the absolute value inequality and the second inequality is owing to Hoeffding’s inequality (Shalev-Shwartz & Ben-David, 2014).

Appendix 2 1.1 Proof of Theorem 2

According to the generalization bound with respect to the Rademacher complexity (Bartlett & Mendelson, 2002), for any $h \in \mathcal {H}$, with probability at least $1 - \delta$,

$$\begin{aligned} \vert L_\mathcal {D}(h) - L_\mathcal {S}(h) \vert \le 2 \mathcal {R}\left( l \circ \mathcal {H} \circ \mathcal {S}\right) + 4a \sqrt{\frac{2 \ln (4 / \delta )}{m}}, \end{aligned}$$

(B2)

where $\mathcal {R}(l \circ \mathcal {H} \circ \mathcal {S})$ is the Rademacher complexity of $\mathcal {H}$ with respect to l and $\mathcal {S}$. According to the Talagrand’s contraction lemma (Mohri et al., 2018),

$$\begin{aligned} \mathcal {R}(l \circ \mathcal {H} \circ \mathcal {S})\le L \mathbb {E}_{\sigma \in \{\pm 1\}^{m}} \left[ \sup _{h \in \mathcal {H}} \sum _{(\textbf{x},y) \in \mathcal {S}} \sigma _i h(\textbf{x}) \right] , \end{aligned}$$

(B3)

where $\sigma$ is a Rademacher random variable. According to the Radermacher bound for neural networks (Golowich et al., 2018),

$$\begin{aligned} \mathbb {E}_{\sigma \in \{\pm 1\}^m} \left[ \sup _{h \in \mathcal {H}} \sum _{(\textbf{x},y) \in \mathcal {S}} \sigma _i h(\textbf{x}) \right] \le \sqrt{m} B(\sqrt{2d \ln 2} + 1) \prod _{i = 1}^d M_i. \end{aligned}$$

(B4)

Substituting Eq. (B3) and Eq. (B4) into Eq. (B2), we obtain a upper bound with respect to $\delta$ for $\vert L_\mathcal {D}(h) - L_\mathcal {S}(h) \vert$ and define as $\mathfrak {B}_1(\delta )$, i.e.,

$$\begin{aligned} \vert L_\mathcal {D}(h) - L_\mathcal {S}(h) \vert \le \mathfrak {B}_1(\delta ). \end{aligned}$$

(B5)

For any $h \in \mathcal {H}$,

$$\begin{aligned} \begin{aligned} \vert L_\mathcal {S}^{W}(h) - L_\mathcal {S}(h) \vert =&\left| \frac{1}{\lceil m/b \rceil } \sum _{i = 1}^{\lceil m/b \rceil } \left[ g \circ d_{\mathcal {H}}(\mathcal {B}_i, S) \right] L_{\mathcal {B}_i}(h) - L_\mathcal {S}(h) \right| \\ \le&\frac{1}{\lceil m/b \rceil } \sum _{i = 1}^{\lceil m/b \rceil } \left[ g \circ d_{\mathcal {H}}(\mathcal {B}_i, \mathcal {S}) \right] \left| L_{\mathcal {B}_i}(h) - L_\mathcal {S}(h) \right| \\ =&\frac{1}{\lceil m/b \rceil } \sum _{i = 1}^{\lceil m/b \rceil } \left[ g \circ d_{\mathcal {H}}(\mathcal {B}_i, \mathcal {S}) \right] d_{\mathcal {H}}(\mathcal {B}_i, \mathcal {S}) \triangleq \mathfrak {B}_2, \end{aligned} \end{aligned}$$

(B6)

where the first inequality use the absolute value inequality and the second inequality is owning to Eq. (7). Accordingly, with probability at least $1 - \delta$,

$$\begin{aligned} \begin{aligned} L_\mathcal {D}(\widehat{h}) \le&L_\mathcal {S}(\widehat{h}) + \mathfrak {B}_1(\delta ) \\ \le&L_\mathcal {S}^{W}(\widehat{h}) + \mathfrak {B}_1(\delta ) + \mathfrak {B}_2 \\ \le&L_\mathcal {S}^{W}(h^*) + \mathfrak {B}_1(\delta ) + \mathfrak {B}_2 \\ \le&L_{\mathcal {S}}(h^*) + \mathfrak {B}_1(\delta ) + 2\mathfrak {B}_2 \\ \le&L_{\mathcal {D}}(h^*) + 2\mathfrak {B}_1(\delta / 2) + 2\mathfrak {B}_2,\\ \end{aligned} \end{aligned}$$

(B7)

where the first and fifth inequality use Eq. (B5); the second and fourth inequality use Eq. (B6); the third inequality is because $\widehat{h}$ is the minimizer of $L_\mathcal {S}^{W}(h)$.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhao, Z., Cao, L. Weighting non-IID batches for out-of-distribution detection. Mach Learn 113, 7371–7391 (2024). https://doi.org/10.1007/s10994-024-06605-z

Download citation

Received: 23 February 2023
Revised: 30 January 2024
Accepted: 04 August 2024
Published: 19 August 2024
Issue Date: October 2024
DOI: https://doi.org/10.1007/s10994-024-06605-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Weighting non-IID batches for out-of-distribution detection

Abstract

Similar content being viewed by others

AUAAC: Area Under Accuracy-Accuracy Curve for Evaluating Out-of-Distribution Detection

Multi-label Adaptive Batch Selection by Highlighting Hard and Imbalanced Samples

Out-of-distribution Detection with Boundary Aware Learning

Explore related subjects

1 Introduction

2 Related work

2.1 Post-hoc methods

2.2 Confidence enhancement methods

3 Weighted non-IID batching

3.1 Preliminaries

3.2 Objective function

3.3 Dataset discrepancy

Theorem 1

3.4 Generalization error bound

Theorem 2

3.5 Weight function

4 Experiments

4.1 Experiment settings

4.1.1 Datasets

4.1.2 Network architectures

4.1.3 Metrics

4.2 Performance comparison

4.2.1 Ablation study

4.2.2 Comparison with post-hoc methods

4.2.3 Comparison with confidence enhancement methods

4.3 Effect of hyper-parameters

4.3.1 The number of networks n

4.3.2 The scale coefficient c

4.3.3 The batch size b

5 Conclusions

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Appendices

Appendix 1

1.1 Proof of Theorem 1

Appendix 2

1.1 Proof of Theorem 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords