License: arXiv.org perpetual non-exclusive license
arXiv:2401.02611v1 [cs.CV] 05 Jan 2024

MOODv2: Masked Image Modeling for Out-of-Distribution Detection

Jingyao Li, Pengguang Chen, Shaozuo Yu, Shu Liu,  and Jiaya Jia Jingyao Li and Shaozuo Yu are with the Department of Computer Science and Engineering of the Chinese University of Hong Kong (CUHK)
Jiaya Jia’s E-mail: [email protected] Pengguang Chen, Shu Liu and Jiaya Jia are with SmartMore.
Abstract

The crux of effective out-of-distribution (OOD) detection lies in acquiring a robust in-distribution (ID) representation, distinct from OOD samples. While previous methods predominantly leaned on recognition-based techniques for this purpose, they often resulted in shortcut learning, lacking comprehensive representations. In our study, we conducted a comprehensive analysis, exploring distinct pretraining tasks and employing various OOD score functions. The results highlight that the feature representations pre-trained through reconstruction yield a notable enhancement and narrow the performance gap among various score functions. This suggests that even simple score functions can rival complex ones when leveraging reconstruction-based pretext tasks. Reconstruction-based pretext tasks adapt well to various score functions. As such, it holds promising potential for further expansion. Our OOD detection framework, MOODv2, employs the masked image modeling pretext task. Without bells and whistles, MOODv2 impressively enhances 14.30% AUROC to 95.68% on ImageNet and achieves 99.98% on CIFAR-10.

Index Terms:
Computer Vision, Out-of-Distribution Detection, Outlier Detection, Masked Image Modeling

1 Introduction

Areliable visual recognition system not only provides correct predictions on known context (also known as in-distribution data) but also detects unknown out-of-distribution (OOD) samples and rejects (or transfers) them to human intervention for safe handling. This motivates the applications of outlier detectors before feeding input to the downstream networks, which is the main task of OOD detection, also referred to as novelty or anomaly detection. OOD detection is the task of identifying whether a test sample is drawn far from the in-distribution (ID) data or not. It is at the cornerstone of various safety-critical applications, including medical diagnosis [1], fraud detection [2], autonomous driving [3], etc. A representative in-distribution feature space representation is crucial for out-of-distribution detection. A well-crafted feature representation significantly enhances the performance via most mainstream OOD detection score functions. Our research is dedicated to refining feature representations tailored for OOD detection, with the aim of advancing the entire field.

Existing methods perform contrastive learning [4, 5] or pretrain classification on a large dataset [6, 7, 8, 9] to detect OOD samples. The former methods classify images according to the pseudo labels while the latter classifies images based on ground truth, whose core tasks are both to fulfill the classification target. However, research on backdoor attack [10, 11] shows that when learning is represented by classifying data, networks tend to take a shortcut to classify images. In a typical backdoor attack scene [11], the attacker adds secret triggers on original training images with the visibly correct label. During the course of testing, the victim model classifies images with secret triggers into the wrong category. Research in this area demonstrates that networks only learn specific distinguishable patterns of different categories because it is a shortcut to fulfill the classification requirement. Nonetheless, learning these patterns is ineffective for OOD detection. Thus, learning representations by classifying ID data for OOD detection may not be satisfying. For example, when patterns similar to some ID categories appear in OOD samples, the network could easily interpret these OOD samples as the ID data and classify them into the wrong ID categories, as shown in Fig. 2.

Refer to caption
Figure 1: The average AUROC (%) tested on four OOD datasets applied to a ViT model with different pre-text tasks. Methods in blue use the feature space; methods in green use logits; methods in yellow use the softmax probability; and methods in red use both features and logits. The stars show the average performance of a category of methods.
Refer to caption
Figure 2: Comparison of reconstruction-based and classification-based methods. In the context of image classification, networks often take a shortcut when categorizing images [10, 11]. For example, ears are a distinctive feature for distinguishing between cats and dogs, and a classification model typically assumes that animals with pointed ears are cats, while those without are dogs. Consequently, when the network encounters an out-of-distribution animal, such as a fox with pointed ears, it readily misclassifies it as a cat. In contrast, reconstruction-based tasks effectively mitigate this issue. By randomly masking portions of images, the model avoids learning localized, stereotypical features (e.g., masked ears), thus preventing shortcuts and instead acquiring effective pixel-level representations for ID data. This significantly improves the model’s ability to detect OOD instances.

To remedy this issue, we introduce the reconstruction-based pretext task. Different from contrastive learning in existing OOD detection approaches [4, 5], our method forces the network to achieve the training purpose of reconstructing the image and thus makes it learn pixel-level feature representation. Specifically, we adopt the masked image modeling (MIM) [12] as our self-supervised pretext task, which has been demonstrated to have great potential in both natural language processing [13] and computer vision [12, 14]. In the MIM task, a proportion of image patches are randomly masked. The network learns information from the remaining patches to speculate the masked patches and restore tokens of the original image. The reconstruction process enables the model to learn from the prior effective ID feature representation rather than just learning different patterns among categories in the classification process. In our work, we observed that the pre-trained models effectively reconstruct ID images, whereas they exhibit distinct domain differences when it comes to the OOD domain (Fig. 4). This visual discrepancy clearly underscores the existing domain gap in model features between ID and OOD data, offering valuable insights for OOD detection.

To validate the effectiveness of our ID feature representation, we conduct experiments to test its performance with various mainstream OOD detection score functions. We employed OOD score functions encompassing probability-based [15, 16], logits-based [17, 16], features-based [7, 18, 19], and hybrid methods utilizing both logits and features [7]. In the context of a comparative analysis spanning classic classification [20], contrastive learning [21, 22], and masked image modeling pretext tasks [12, 23], our findings underscore the dominant role of reconstruction-based strategies in the field of OOD detection, as illustrated in Fig. 1.

Furthermore, we conduct a comprehensive analysis of the experimental results and observe that our approach not only significantly improves the overall results but also substantially reduces the disparities among score functions. This observation underscores that even simple score functions can perform on par with more complex ones when a representative ID feature representation is utilized. These findings further emphasize the critical importance of effective feature representation in OOD detection. More details are in Sec. 3.2. Ultimately, MOODv2 demonstrates remarkable enhancements, achieving a substantial 14.30% increase, reaching 95.68% AUROC on ImageNet. On CIFAR-10, our results significantly improved to an impressive 99.98%, marking a notable 0.35% enhancement compared to the previous state-of-the-art.

2 Related Works

2.1 Out-of-distribution Detection

Many scoring functions have been developed by researchers to distinguish between in-distribution and out-of-distribution examples. These functions are designed to exploit properties that are typically exhibited by ID examples but violated by OOD examples, and vice versa. These scores are primarily derived from three sources:

  1. 1.

    Probability-based: This category includes measures like the maximum softmax probabilities [15] and the minimum KL-divergence between the softmax and the mean class-conditional distributions [16], etc.

  2. 2.

    Logit-based: These functions rely on maximum logits [16] and the logsumexpnormal-logsumexp\mathrm{logsumexp}roman_logsumexp function computed over logits [17], etc.

  3. 3.

    Feature-based: These functions involve the norm of the residual between a feature and the pre-image of its low-dimensional embedding [24] and the minimum Mahalanobis distance between a feature and the class centroids [19], among others.

After a thorough analysis of the performance and their correlations with various score functions and pretext tasks, our work follows the hybrid methods combining logit and feature [7] and includes the reconstruction-based methods as a pretext task. We will explain the implementation details later in this paper.

2.2 Self-Supervised Pretext Task

In the ever-evolving landscape of computer vision and deep learning, a multitude of strategies and techniques have been devised to enhance the capacity of models to understand and process visual data:.

  1. 1.

    Classification task: Vision models are pre-trained via classical classification task [20].

  2. 2.

    Contrastive Learning tasks: MOCOv3 [21] and DINOv2 [22] are advanced contrastive learning methods used for self-supervised representation learning. These methods focus on learning representations by contrasting positive pairs (e.g., different augmentations of the same image) with negative pairs (e.g., augmentations from different images). MOCOv3 extends the MOCO framework [25] with a momentum encoder and dynamic queues for improved performance. DINOv2 introduces a clustered teacher network and an asymmetric loss to learn efficient representations.

  3. 3.

    Masked Image Modeling Tasks: Data-Efficient Image Transformer (BEiT series [12, 23]) are self-supervised learning tasks that involve masked image modeling. In these tasks, a portion of an image is randomly masked, and the model’s objective is to predict the masked pixels, effectively filling in the blanks.

These methods and tasks represent cutting-edge approaches in the field of computer vision and deep learning. They have led to substantial improvements in the ability of models to learn useful visual representations from unlabeled data, enabling better performance on various downstream vision tasks.

Multiple existing methods take advantage of self-supervised tasks to guide the learning of representation for OOD detection. Previous work [4, 5] presents contrastive learning models as feature extractors. However, existing approaches of classifying transformed images according to contrastive learning possess similar limitations – that is, the model tends to learn the specific patterns of categories [10, 26], which are beneficial for classification but do not help understand the intrinsic ID representation. In our work, we address this issue by performing the masked image modeling task for OOD detection.

2.3 Training Strategy

Numerous approaches have been developed to address OOD-awareness in training loss [27]. These methods often involve the introduction of regularization terms aimed at encouraging a clearer separation between ID and OOD features [28, 29]. In some cases, networks are augmented with confidence estimation branches, utilizing misclassified in-distribution examples as proxies for out-of-distribution ones [27]. MOS [29] adapts the loss function by incorporating a predefined group structure, enabling the minimum group-wise “else” class probability to serve as an indicator of OOD classification. An alternative approach [28] focuses on compelling ID samples to embed into a union of 1-dimensional subspaces during training, and it evaluates the minimum angular distance between the feature and class-wise subspaces.

In contrast to these approaches, our method belongs to the lightweight training-free methods [7, 30], which doesn’t necessitate retraining the model. Therefore, it not only offers a more straightforward application but also preserves the accuracy of in-distribution classification.

Refer to caption
(a) ImageNet
Refer to caption
(b) CIFAR-10
Figure 3: The AUROC (%) of MOODv2 and MOODv1 tested on four OOD datasets, including OpenImage-O [31], Texture [32], iNaturalist [33], and ImageNet-O [34].
Refer to caption
Figure 4: Each image pair consists of the original image (left) and reconstructed image (right). The rows of images are sourced from ImageNet [35], Texture [36], iNaturalist [37], ImageNet-O [38], and OpenImage-O [31]. The number in the top left corner of each image pair represents the Euclidean distance between the two images.
Methods prob feat logit feat+logit
ViT[20] 73.61±plus-or-minus\pm±21.36 82.61±plus-or-minus\pm±23.81 45.11±plus-or-minus\pm±4.45 99.63
MoCov3[21] 70.96±plus-or-minus\pm±23.68 79.17±plus-or-minus\pm±28.75 41.42±plus-or-minus\pm±3.50 99.73
DINOv2[22] 87.20±plus-or-minus\pm±10.62 84.73±plus-or-minus\pm±21.57 80.30±plus-or-minus\pm±0.10 99.98
BEiTv2[23] 79.96±plus-or-minus\pm±13.71 91.77±plus-or-minus\pm±11.47 72.87±plus-or-minus\pm±2.08 99.87
BEiT[12] 77.51±plus-or-minus\pm±17.83 89.05±plus-or-minus\pm±15.46 65.05±plus-or-minus\pm±2.06 99.98
(a) ID: CIFAR-10
Methods prob feat logit feat+logit
ViT[20] 78.52±plus-or-minus\pm±1.76 76.86±plus-or-minus\pm±3.20 70.61±plus-or-minus\pm±4.76 77.65
MoCov3[21] 78.36±plus-or-minus\pm±1.42 72.51±plus-or-minus\pm±6.21 70.61±plus-or-minus\pm±5.04 72.07
DINOv2[22] 59.64±plus-or-minus\pm±7.82 63.56±plus-or-minus\pm±2.89 60.70±plus-or-minus\pm±4.51 61.32
BEiTv2[23] 89.07±plus-or-minus\pm±0.24 92.96±plus-or-minus\pm±1.27 90.29±plus-or-minus\pm±0.13 95.42
BEiT[12] 89.47±plus-or-minus\pm±0.47 93.30±plus-or-minus\pm±1.89 89.84±plus-or-minus\pm±0.01 95.68
(b) ID: ImageNet
TABLE I: The AUROC (%) of four types of methods: probability-based methods MSP [15] and KL-Matching [16]; logits-based methods Energy [17] and MaxLogit [16]; features-based methods Residual [7], React [18] and Mahalanobis [19]; and methods using both logits and features include ViM [7]. The best method for each model is emphasized in bold.

2.4 MOODv1

Our previous version MOODv1 [30] has introduced masked image modeling pretraining strategy into the OOD detection (MOOD) and achieved promising results. However, there are still concerns:

Firstly, previous studies [30, 4, 5] have typically necessitated fine-tuning a model on each in-distribution dataset. The expense of training becomes notably high when dealing with a substantial number of ID datasets to be assessed, such as in one-class OOD detection [4, 30]. However, through experimental validation, we have discovered that a well-prepared masked image modeling model doesn’t require additional fine-tuning to achieve outstanding detection performance, conserving substantial fine-tuning resource consumption when dealing with a plethora of ID datasets that require evaluation.

Secondly, as the field has seen the emergence of more advanced OOD score functions [16, 7, 18, 15, 17, 19] and pretraining techniques [23, 22, 21, 12, 20], it raises the question of whether masked image modeling continues to maintain its leading role. In MOODv2, we integrate the latest advancements in pretraining methods and conduct experiments with an array of state-of-the-art OOD score functions. This broader spectrum of pretraining methods and score functions allows for a more comprehensive assessment of the MOODv2’s performance, better aligning MOODv2 with the increasingly intricate challenges of OOD detection.

Lastly, it is well known that if the network has seen similar samples in training, regardless of pre-training or fine-tuning, the OOD performance will be more or less trivial [31]. Previous works [6, 30] rely on pre-training on ImageNet-21K, so that the benchmark OOD dataset such as CIFAR [39], Places [40], etc., is unlikely to be untouched by the ImageNet-21K [35] dataset. In this work, MOODv2 introduces the latest unnatural datasets as OOD, which rules out the possibility of overlap between the OOD test set and the training set [31, 34].

In summary, MOODv2 incorporates improved score functions, advanced pretraining techniques, a wider range of unnatural OOD datasets, and a streamlined general framework. The performance improvement of MOODv2 compared to MOODv1 is depicted in Fig. 3. On ImageNet, MOODv2 exhibits a noteworthy 2.17% improvement in AUROC compared to MOODv1. Furthermore, on CIFAR-10, MOODv2, without finetuning on the ID dataset, achieves an exceptional AUROC score of up to 99.98%.

3 Methods

In this section, we initiate our exploration of reconstruction tasks for OOD detection by presenting the underlying motivation in Sec. 3.1. Following that, in Sec. 3.2, we delve into a comprehensive analysis of the essential attributes that play a pivotal role in OOD detection.

3.1 Motivation: seeking for effective ID representation

Most previous OOD methods learn the ID representation through classification [15, 6] or contrastive learning [4, 5] on ID samples, which take advantage of either the ground truth or pseudo labels to supervise the classification networks. On the other hand, work of [10, 11] shows that classification networks only learn different patterns among training categories because it is a shortcut to fulfill classification. It is indicated that the network actually does not learn the effective in-distribution representation. In comparison, the reconstruction-based pretext task forces the network to learn the pixel-level image representation of the ID images during training to reconstruct the image instead of the patterns for classification. In this way, the network can learn a more representative feature of the ID dataset.

To verify this, we reconstruct ID and OOD data and compute the Euclidean distance between the original and reconstructed images. A greater distance indicates a larger deviation of the reconstructed image from the original image. We collect recovery distances for ID and OOD data. Examples of the reconstruction are depicted in Fig. 4. In the first row, for ID images, pre-trained models reconstruct the images effectively. Instead, for unnatural OOD images in the following rows, clear domain discrepancies emerge. For instance, in the case of textured images, the models still apply lighting and shadows reminiscent of natural images. In the case of sketch images, the models render the images smoother and brighter. This discrepancy visually highlights the domain gap in model features between ID and OOD data, which can be leveraged for OOD detection.

Refer to caption
(a) ID: CIFAR-10
Refer to caption
(b) ID: ImageNet
Figure 5: The AUROC (%) tested on unnatural OOD datasets of various OOD detection algorithms applied to a ViT model. The pre-text tasks include classification task [20], contrastive learning tasks MoCov3 [21] and DINOv2 [22], and masked image modeling tasks BEiT series [12, 23]. Methods in blue utilize the feature space; methods in green use logits; methods in yellow make use of the softmax probability. and methods in red leverage both features and logits. Stars represent the average AUROC for methods in the corresponding colors; light vertical lines represent the standard deviation.

3.2 Reconstruction Tasks for OOD Detection

Methods Models Texture [32] iNaturalist [33] ImageNet-O [34] OpenImage-O [31] Average
AUROC\uparrow FPR95\downarrow AUROC\uparrow FPR95\downarrow AUROC\uparrow FPR95\downarrow AUROC\uparrow FPR95\downarrow AUROC\uparrow FPR95\downarrow
MSP[15] ViT[20] 71.31 71.31 90.70 90.70 60.77 60.77 84.29 84.29 76.77 76.77
MoCov3[21] 66.85 66.85 90.68 90.68 64.80 64.80 85.42 85.42 76.94 76.94
DINOv2[22] 47.49 47.49 62.13 62.13 44.87 44.87 52.83 52.83 51.83 51.83
BEiTv2[23] 85.61 85.61 96.05 96.05 81.15 81.15 92.52 92.52 88.83 88.83
BEiT[12] 85.05 85.05 95.50 95.50 83.17 83.17 92.28 92.28 89.00 89.00
Energy[17] ViT[20] 54.11 54.11 76.61 76.61 61.63 61.63 71.06 71.06 65.85 65.85
MoCov3[21] 48.79 48.79 76.80 76.80 64.56 64.56 72.13 72.13 65.57 65.57
DINOv2[22] 73.89 73.89 80.34 80.34 49.98 49.98 56.64 56.64 65.21 65.21
BEiTv2[23] 85.32 85.32 96.95 96.95 85.27 85.27 94.14 94.14 90.42 90.42
BEiT[12] 83.04 83.04 96.48 96.48 86.36 86.36 93.50 93.50 89.85 89.85
MaxLogit[16] ViT[20] 67.22 67.22 89.88 89.88 61.68 61.68 82.73 82.73 75.37 75.37
MoCov3[21] 62.36 62.36 90.38 90.38 65.65 65.65 84.19 84.19 75.64 75.64
DINOv2[22] 54.70 54.70 69.98 69.98 45.60 45.60 54.52 54.52 56.20 56.20
BEiTv2[23] 85.94 85.94 96.90 96.90 83.97 83.97 93.82 93.82 90.16 90.16
BEiT[12] 84.17 84.17 96.48 96.48 85.34 85.34 93.31 93.31 89.83 89.83
KL-Matching[16] ViT[20] 82.59 82.59 87.63 87.63 66.55 66.55 84.34 84.34 80.28 80.28
MoCov3[21] 82.35 82.35 86.24 86.24 67.80 67.80 82.73 82.73 79.78 79.78
DINOv2[22] 80.51 80.51 56.93 56.93 69.77 69.77 62.63 62.63 67.46 67.46
BEiTv2[23] 87.14 87.14 95.13 95.13 82.87 82.87 92.10 92.10 89.31 89.31
BEiT[12] 87.87 87.87 94.82 94.82 84.56 84.56 92.48 92.48 89.93 89.93
Residual[7] ViT[20] 82.39 82.39 73.72 73.72 68.44 68.44 74.88 74.88 74.86 74.86
MoCov3[21] 75.25 75.25 73.80 73.80 57.69 57.69 67.82 67.82 68.64 68.64
DINOv2[22] 66.50 66.50 61.90 61.90 58.94 58.94 56.84 56.84 61.04 61.04
BEiTv2[23] 94.99 94.99 99.01 99.01 87.23 87.23 95.43 95.43 94.17 94.17
BEiT[12] 94.16 94.16 99.50 99.50 89.35 89.35 96.52 96.52 94.88 94.88
React[18] ViT[20] 62.09 62.09 91.20 91.20 63.66 63.66 80.43 80.43 74.34 74.34
MoCov3[21] 51.47 51.47 79.30 79.30 65.33 65.33 74.35 74.35 67.61 67.61
DINOv2[22] 76.73 76.73 74.25 74.25 56.26 56.26 63.17 63.17 67.60 67.60
BEiTv2[23] 86.10 86.10 98.09 98.09 85.69 85.69 94.96 94.96 91.21 91.21
BEiT[12] 84.32 84.32 96.99 96.99 87.04 87.04 94.21 94.21 90.64 90.64
Mahalanobis[19] ViT[20] 84.93 84.93 84.90 84.90 71.53 71.53 84.16 84.16 81.38 81.38
MoCov3[21] 84.29 84.29 86.95 86.95 70.33 70.33 83.54 83.54 81.28 81.28
DINOv2[22] 68.58 68.58 63.14 63.14 58.86 58.86 57.57 57.57 62.04 62.04
BEiTv2[23] 93.01 93.01 98.78 98.78 86.78 86.78 95.46 95.46 93.51 93.51
BEiT[12] 93.03 93.03 99.18 99.18 88.84 88.84 96.51 96.51 94.39 94.39
ViM[7] ViT[20] 83.51 83.51 77.75 77.75 71.04 71.04 78.31 78.31 77.65 77.65
MoCov3[21] 76.28 76.28 78.18 78.18 61.35 61.35 72.46 72.46 72.07 72.07
DINOv2[22] 66.90 66.90 62.53 62.53 58.93 58.93 56.93 56.93 61.32 61.32
BEiTv2[23] 95.35 95.35 99.31 99.31 90.06 90.06 96.96 96.96 95.42 95.42
BEiT[12] 94.25 94.25 99.59 99.59 91.47 91.47 97.41 97.41 95.68 95.68
Average ViT[20] 73.52 73.52 84.05 84.05 65.66 65.66 80.02 80.02 75.81 75.81
MoCov3[21] 68.45 68.45 82.79 82.79 64.69 64.69 77.83 77.83 73.44 73.44
DINOv2[22] 66.91 66.91 66.40 66.40 55.40 55.40 57.64 57.64 61.59 61.59
BEiTv2[23] 89.18 89.18 97.53 97.53 85.38 85.38 94.42 94.42 91.63 91.63
BEiT[12] 88.24 88.24 97.32 97.32 87.02 87.02 94.53 94.53 91.77 91.77
Best ViT[20] 84.93 84.93 91.20 91.20 71.53 71.53 84.34 84.34 81.38 81.38
MoCov3[21] 84.29 84.29 90.68 90.68 70.33 70.33 85.42 85.42 81.28 81.28
DINOv2[22] 80.51 80.51 80.34 80.34 69.77 69.77 63.17 63.17 67.60 67.60
BEiTv2[23] 95.35 95.35 99.31 99.31 90.06 90.06 96.96 96.96 95.42 95.42
BEiT[12] 94.25 94.25 99.59 99.59 91.47 91.47 97.41 97.41 95.68 95.68
TABLE II: Performance of OOD detection methods on ViT-B/16 model with 224×224224224224\times 224224 × 224-pixel inputs. The pre-text tasks include classification task [20], contrastive learning tasks MoCov3 [21] and DINOv2 [22], and masked image modeling tasks BEiT [12] and BEiTv2 [23]. All models are per-trained on ImageNet-21k and finetuned on ImageNet-1k. Both metrics AUROC and FPR95 are in percentage. The best method is emphasized in bold and a gray background indicates our choice.

In this section, we offer a comprehensive analysis of these key elements in the context of OOD detection. We employ ImageNet [35] as the in-distribution dataset and evaluate pre-task texts on challenging unnatural out-of-distribution datasets, including OpenImage-O [31], Texture [32], iNaturalist [33], and ImageNet-O [34]. Extensive validations with various pretraining methods and OOD score functions, including MSP [15], Energy [17], ODIN [41], MaxLogit [16], KL Matching [16], Residual [7], ReAct [18], Mahalanobis [19] and ViM [7].

Results are shown in  Tab. II. The results indicate that the masked image modeling pretext task surpasses classification and contrastive learning pretext tasks when employing all included score functions. The average AUROC across these score functions exhibits an improvement of 15.96% compared to the competition. Models when using the best-performing score function saw a 14.30% increase in performance. This remarkable achievement can be attributed to the representative ID feature space representation, thereby aiding in distinguishing between ID and OOD data. This discovery is highly significant as it enhances performance across mainstream OOD detection score functions, thus advancing the entire field. We also employ CIFAR-10 [39] as the ID dataset and provide results in the appendix. Our approach attains an impressive AUROC of 99.99% while concurrently reducing the FPR95 to a mere 0.03%.

To enhance the comprehensibility of our experimental findings, we conduct a thorough statistical analysis and illustrate them in visual representations. The outcomes are depicted in Fig. 5. Our approach not only leads to an overall enhancement in results but also notably minimizes the variations among different methods. For instance, the ViT, MoCov3, and DINOv2 models using logit-based methods exhibited standard deviations of 4.76%, 5.04%, and 4.51%, respectively, while BEiT and BEiTv2 displayed significantly lower standard deviations, reaching as low as 0.13% and 0.01%. This observation underscores that even uncomplicated score functions can perform equivalently to more intricate ones when an effective ID feature representation is applied.

In Tab. I, we underscore the optimal methods for each model. On CIFAR-10, all models achieved their best results when employing the feat and logit combination approach, achieving almost 100% accuracy. This suggests a highly effective grasp of CIFAR-10’s feature space. Conversely, with the larger ImageNet dataset, we observed variations in outcomes. Notably, the masked image modeling pretext-pretrained model achieved the best results when using the feat and logit combination method, while other models excelled in probability-based and feature-based methods. Additionally, our masked image modeling pretext demonstrated significantly superior performance compared to other pretraining methods, underscoring the limitations of classification-based pretraining strategies and their inadequacy in harnessing advanced score functions effectively. These discoveries reinforce the pivotal role of proficient feature representation in OOD detection. Furthermore, for more detailed information, we provide illustrations of the distribution curves of OOD scores for both ID and OOD datasets in the appendix.

3.3 Masked Image Modeling for Out-of-Distribution v2

To sum up, in this section, we observed that pre-trained models adeptly reconstruct ID images, yet manifest distinctive domain differences in the OOD scenario (Fig. 4). This visual incongruity starkly highlights the prevailing domain gap in model features between ID and OOD data. Additionally, a thorough analysis of experimental outcomes reveals that the pre-task of masked image modeling not only significantly enhances overall results but also markedly diminishes disparities among score functions. These findings emphasize the crucial significance of effective feature representation in OOD detection, highlighting the enhancement of features through masked image modeling tasks.

Finally, we propose our Masked Image Modeling for Out-of-Distribution Detection v2 (MOODv2). The algorithm of is shown in Algorithm 1, mainly including the following stages.

  1. 1.

    Pre-train the vision encoder with masked image modeling on the pretrain dataset.

  2. 2.

    Apply fine-tuning the backbone on the in-distribution dataset.

  3. 3.

    Extract features from the trained image encoder and calculate the OOD score distance score function for OOD detection.

In terms of the OOD score function, we adopt ViM[7] that combines features and logits, leveraging insights from the masked image modeling pre-trained model, which has demonstrated superior performance. Mathematically, the score is

s(x)=eαxTRRTxi=1Celi+eαxTRRTx.s𝑥superscript𝑒𝛼superscript𝑥𝑇𝑅superscript𝑅𝑇𝑥superscriptsubscript𝑖1𝐶superscript𝑒subscript𝑙𝑖superscript𝑒𝛼superscript𝑥𝑇𝑅superscript𝑅𝑇𝑥\text{s}(x)=\frac{e^{\alpha\sqrt{x^{T}RR^{T}x}}}{\sum_{i=1}^{C}e^{l_{i}}+e^{% \alpha\sqrt{x^{T}RR^{T}x}}}.s ( italic_x ) = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_α square-root start_ARG italic_x start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_R italic_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT italic_α square-root start_ARG italic_x start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_R italic_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x end_ARG end_POSTSUPERSCRIPT end_ARG . (1)

where lisubscript𝑙𝑖l_{i}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_i-th logit of feature x𝑥xitalic_x in the training set X𝑋Xitalic_X; α𝛼\alphaitalic_α is a per-model constant; RN×(ND)𝑅superscript𝑁𝑁𝐷R\in\mathbb{R}^{N\times(N-D)}italic_R ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × ( italic_N - italic_D ) end_POSTSUPERSCRIPT is the (D+1)𝐷1(D+1)( italic_D + 1 )-th column to the last column of the eigenvector matrix Q𝑄Qitalic_Q of X𝑋Xitalic_X and N𝑁Nitalic_N is the principal dimension; C𝐶Citalic_C is the number of classes.

Algorithm 1 MOODv2 Detection Algorithm
1:Pre-train set XPsubscript𝑋𝑃X_{P}italic_X start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT, in-distribution set XIDsubscript𝑋IDX_{\rm ID}italic_X start_POSTSUBSCRIPT roman_ID end_POSTSUBSCRIPT, test set Xtestsubscript𝑋testX_{\rm test}italic_X start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT, required True Positive Rate η𝜂\etaitalic_η%, backbone f𝑓fitalic_f.
2:Is xtestsubscript𝑥testx_{\rm test}italic_x start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT outlier or not? xtestXtestfor-allsubscript𝑥testsubscript𝑋test\forall x_{\rm test}\in X_{\rm test}∀ italic_x start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT.
3:Pre-train f𝑓fitalic_f on XPsubscript𝑋𝑃X_{P}italic_X start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT by maximizing
xXP𝔼M[iMlogpMIM(z|xM)]subscript𝑥subscript𝑋𝑃subscript𝔼𝑀delimited-[]subscript𝑖𝑀subscript𝑝MIMconditional𝑧superscript𝑥𝑀\sum_{x\in X_{P}}\mathbb{E}_{M}\left[\sum_{i\in M}\log p_{\rm MIM}(z|x^{M})\right]∑ start_POSTSUBSCRIPT italic_x ∈ italic_X start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_i ∈ italic_M end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT roman_MIM end_POSTSUBSCRIPT ( italic_z | italic_x start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ) ]
.
4:Fine-tune f𝑓fitalic_f on XPsubscript𝑋𝑃X_{P}italic_X start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT by minimizing
Lft=xpXPCrossEntropy(f(xp),yP(xp))subscript𝐿ftsubscriptsubscript𝑥𝑝subscript𝑋𝑃CrossEntropy𝑓subscript𝑥𝑝subscript𝑦𝑃subscript𝑥𝑝L_{\rm ft}=\sum_{x_{p}\in X_{P}}{\rm CrossEntropy}(f(x_{p}),y_{P}(x_{p}))italic_L start_POSTSUBSCRIPT roman_ft end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_CrossEntropy ( italic_f ( italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) )
5:Calculate d(xtest)𝑑subscript𝑥testd(x_{\rm test})italic_d ( italic_x start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT ) for xtestXtestsubscript𝑥testsubscript𝑋testx_{\rm test}\in X_{\rm test}italic_x start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT and d(xcal)𝑑subscript𝑥cald(x_{\rm cal})italic_d ( italic_x start_POSTSUBSCRIPT roman_cal end_POSTSUBSCRIPT ) for xcalXcalsubscript𝑥calsubscript𝑋calx_{\rm cal}\in X_{\rm cal}italic_x start_POSTSUBSCRIPT roman_cal end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT roman_cal end_POSTSUBSCRIPT.
6:Compute threshold T𝑇Titalic_T as the η𝜂\etaitalic_η percentile of d(xcal)𝑑subscript𝑥cald(x_{\rm cal})italic_d ( italic_x start_POSTSUBSCRIPT roman_cal end_POSTSUBSCRIPT ).
7:if d(xtest)>T𝑑subscript𝑥test𝑇d(x_{\rm test})>Titalic_d ( italic_x start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT ) > italic_T then
8:    xtestsubscript𝑥testx_{\rm test}italic_x start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT is an outlier.
9:end if
ID data Methods Texture [32] iNaturalist [33] ImageNet-O [34] OpenImage-O [31] Average
AUROC\uparrow FPR95\downarrow AUROC\uparrow FPR95\downarrow AUROC\uparrow FPR95\downarrow AUROC\uparrow FPR95\downarrow AUROC\uparrow FPR95\downarrow
CIFAR-10 MSP[15] 45.67 95.17 71.07 81.76 32.52 98.85 59.74 91.45 52.25 91.81
Energy[17] 31.16 97.89 48.95 97.92 37.22 97.85 45.29 96.36 40.65 97.50
MaxLogit[16] 41.21 95.95 67.83 86.04 32.58 98.80 56.64 92.94 49.56 93.43
KL-Matching[16] 98.00 10.64 94.23 35.86 92.99 32.40 94.68 27.92 94.97 26.71
Residual[7] 99.91 0.21 99.68 0.45 99.36 2.85 99.42 2.46 99.59 1.49
React[18] 35.97 96.26 69.01 87.91 36.65 97.75 54.14 93.11 48.94 93.76
Mahalanobis[19] 99.77 0.60 99.39 1.11 98.93 4.90 99.14 3.26 99.31 2.47
ViM[7] 99.91 0.23 99.72 0.38 99.38 2.65 99.49 2.31 99.63 1.39
MOODv1[30] 99.95 0.06 99.99 0.02 99.61 1.90 99.82 0.77 99.84 0.69
MOODv2 (ours) 99.98 0.06 100.00 0.00 99.94 0.20 99.99 0.01 99.98 0.07
ImageNet MSP[15] 71.31 77.07 90.70 43.72 60.77 90.60 84.29 61.79 76.77 68.30
Energy[17] 54.11 86.28 76.61 72.70 61.63 81.00 71.06 73.99 65.85 78.49
MaxLogit[16] 67.22 77.98 89.88 45.57 61.68 88.60 82.73 62.52 75.37 68.67
KL-Matching[16] 82.59 67.27 87.63 69.71 66.55 88.15 84.34 74.23 80.28 74.84
Residual[7] 82.39 64.61 73.72 86.00 68.44 87.45 74.88 77.98 74.86 79.01
React[18] 62.09 80.47 91.20 38.74 63.66 81.00 80.43 60.41 74.34 65.15
Mahalanobis[19] 84.93 66.05 84.90 81.60 71.53 88.85 84.16 74.72 81.38 77.80
ViM[7] 83.51 62.71 77.75 81.72 71.04 86.60 78.31 74.55 77.65 76.40
MOODv1[30] 93.01 30.91 98.78 5.89 86.78 63.15 95.46 26.46 93.51 31.60
MOODv2 (ours) 94.25 24.69 99.59 1.83 91.47 40.80 97.41 13.55 95.68 20.22
TABLE III: Performance of OOD detection methods on ViT-B/16 model with 224×224224224224\times 224224 × 224-pixel inputs. All methods are pre-trained on ImageNet-21k and finetuned on ImageNet-1k. ID datasets include CIFAR-10 [39] and ImageNet-1k [35]. Both metrics AUROC and FPR95 are in percentage. The best method is emphasized in bold and a gray background indicates our methods.
Methods ID class Average
Plane Car Bird Cat Deer Dog Frog Horse Ship Truck
KL-Matching[16] 95.35 92.04 95.18 91.26 88.11 94.66 94.99 86.52 93.61 89.37 92.11
Residual[7] 97.62 95.88 97.06 96.30 89.18 94.33 96.73 91.46 94.89 92.36 94.58
Mahalanobis[19] 97.52 96.07 96.77 96.41 89.60 94.79 96.41 91.48 94.80 92.58 94.64
ViM[7] 97.61 96.36 97.19 96.50 88.78 94.21 96.70 91.60 94.97 92.35 94.63
MOODv1[30] 98.63 99.33 94.31 93.22 98.11 96.50 99.25 98.96 98.76 97.82 97.83
MOODv2 (ours) 99.14 99.03 99.51 98.37 97.12 97.20 98.53 98.07 98.35 96.68 98.20
(a) AUROC
Methods ID class Average
Plane Car Bird Cat Deer Dog Frog Horse Ship Truck
KL-Matching[16] 23.60 32.60 22.32 42.92 46.26 24.30 24.97 46.74 25.32 40.53 32.96
Residual[7] 12.06 25.58 16.71 21.17 48.33 22.12 17.42 36.72 17.30 30.76 24.82
Mahalanobis[19] 12.59 25.72 18.92 21.48 48.44 20.59 19.20 38.02 17.47 30.93 25.34
ViM[7] 12.43 24.83 15.77 20.13 48.68 21.77 17.63 36.63 17.60 30.78 24.63
MOODv1[30] 7.59 5.04 2.47 7.49 15.63 10.96 11.37 13.09 10.06 19.62 10.33
MOODv2 (ours) 4.82 4.50 1.79 8.80 15.59 11.00 8.46 12.43 8.60 18.96 9.49
(b) FPR95
TABLE IV: Performance of OOD detection methods on ViT-B/16 model with 224×224224224224\times 224224 × 224-pixel inputs. All methods are pre-trained on ImageNet-21k and finetuned on ImageNet-1k. We perform each category of CIFAR-10 [39] as the ID dataset and other classes as OOD datasets. We report the average results across OOD classes of each ID class. Both metrics AUROC and FPR95 are in percentage. The best method is emphasized in bold and a gray background indicates our methods.

4 Experiments

In this section, we conduct a thorough comparison of our algorithm with the latest OOD detection methods. We employ the ViT-B/16 model, pre-trained on ImageNet-21K and fine-tuned on ImageNet-1K at a resolution of 224×224224224224\times 224224 × 224.

ID/OOD Datasets. We select CIFAR-10 [39] and ImageNet-1K [35] as the ID datasets. Following established procedures [7], for estimating the principal space of ImageNet, we randomly sample 200,000200000200,000200 , 000 images from the training set. Our experiments include the following OOD datasets:

  1. 1.

    OpenImage-O is a newly collected large-scale OOD dataset [31].

  2. 2.

    Texture [36] comprises natural textural images, with four overlapping categories (bubbly, honeycombed, cobwebbed, spiraled) removed since they coincide with ImageNet.

  3. 3.

    iNaturalist [37] is a fine-grained species classification dataset, and we use a specific subset from previous works  [29].

  4. 4.

    ImageNet-O [38] contains images that are adversarially filtered to challenge OOD detectors.

Evaluation Metrics. We report two commonly used evaluation metrics AUROC and FPR95. The AUROC is a threshold-free metric, indicating the area under the receiver operating characteristic curve, with a higher value denoting better detection performance. FPR95, or FPR at TPR95, stands for the false positive rate when the true positive rate is 95%, and a smaller FPR95 is preferable. Both metrics are expressed as percentages.

Baseline Methods. Following previous works [7], we compare MOODv2 with the baseline algorithms that do not require fine-tuning including MSP [15], Energy [17], ODIN [41], MaxLogit [16], KL Matching [16], Residual, ReAct [18], and Mahalanobis [19].

Refer to caption
Figure 6: The distribution curves of OOD score functions for ID and OOD datasets obtained using various mainstream methods, including MSP [15], Energy [17], ODIN [41], MaxLogit [16], KL Matching [16], Residual [7], ReAct [18], Mahalanobis [19] and ViM [7]. The red line indicates the ID dataset ImageNet [35]; the blue line indicates Texture [32]; the green line indicates iNaturalist [33]; the purple line indicates ImageNet-O [34]; the orange line indicates OpenImage-O [31]

4.1 One-Class OOD Detection

We start with the one-class OOD detection. For a given multi-class dataset of Ncsubscript𝑁𝑐N_{c}italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT classes, we conduct Ncsubscript𝑁𝑐N_{c}italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT one-class OOD tasks, where each task regards one of the classes as in-distribution and the remaining classes as out-of-distribution. We run our experiments on CIFAR-10 [39]. Table IV summarizes the average results across OOD classes of each ID class and the detailed class-wize performance is in the appendix.

It’s worth noting that all methods were pre-trained on ImageNet-21k and fine-tuned on ImageNet-1k, which may have had some influence on the results to varying degrees. Nevertheless, we ensure consistent training strategies for all methods to ensure a fair comparison. Experimental results have demonstrated that MOODv2 achieves significant improvements across all ID classes even without fine-tuning the ID dataset. Notably, we achieved a remarkable 3.56% increase in the AUROC, reaching 98.20%, while simultaneously reducing the FPR95 by 15.14% to achieve an impressive 9.49%.

4.2 Multi-Class OOD Detection

For multi-class OOD Detection, we assume that ID samples are from a multi-class dataset, either CIFAR-10 [39] or ImageNet [35]. They are tested on external datasets as out-of-distribution, including OpenImage-O [31], Texture [36], iNaturalist [37] and ImageNet-O [38].

Results are shown in Tab. III. MOODv2 delivers outstanding results on CIFAR-10, achieving an impressive AUROC of 99.98% (0.35% enhancement) and the FPR95 reaches an astonishingly low rate of 0.07%, marking a substantial 95% reduction compared to the prior SOTA (1.39%). On ImageNet, MOODv2 also exhibited significant improvements, showcasing a remarkable 14.30% increase in AUROC, resulting in 95.68%. Additionally, the FPR95 saw a substantial reduction of 44.93%, reaching 20.22%.

In Fig. 7, we illustrate the distribution curves of OOD scores for ID and OOD datasets using various mainstream methods. A smaller overlap between ID and OOD data indicates superior OOD detection performance, while a larger overlap signifies weaker detection results. The ID curve (in red) for MOODv2 features a distinct peak at a higher position, resulting in minimal overlap with other OOD data, indicating a notable OOD detection capability. This success can be attributed to the high-quality ID feature representation.

5 Conclusion

In our work, we focus on the critical aspect of effective out-of-distribution (OOD) detection, which involves acquiring a robust in-distribution (ID) representation that distinguishes it from OOD samples. We conduct comprehensive experiments with distinct pretraining tasks and employ various OOD score functions. The findings indicate that feature representations pre-trained through reconstruction significantly enhance performance and reduce the performance gap among different score functions. This implies that even simple score functions can perform as well as complex ones when utilizing reconstruction-based pretext tasks. These findings hold promise for further development in OOD detection. Ultimately, we introduce the MOODv2 OOD detection framework, employing the masked image modeling pretext task, which achieves a remarkable 14.30% increase in AUROC, reaching 95.68% on ImageNet, and substantially improving CIFAR-10 to 99.98%.

References

  • [1] R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad, “Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission,” in Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 2015, pp. 1721–1730.
  • [2] C. Phua, V. Lee, K. Smith, and R. Gayler, “A comprehensive survey of data mining-based fraud detection research,” arXiv preprint arXiv:1009.6119, 2010.
  • [3] K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, and D. Song, “Robust physical-world attacks on deep learning visual classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1625–1634.
  • [4] J. Tack, S. Mo, J. Jeong, and J. Shin, “Csi: Novelty detection via contrastive learning on distributionally shifted instances,” Advances in neural information processing systems, vol. 33, pp. 11 839–11 852, 2020.
  • [5] V. Sehwag, M. Chiang, and P. Mittal, “Ssd: A unified framework for self-supervised outlier detection,” arXiv preprint arXiv:2103.12051, 2021.
  • [6] S. Fort, J. Ren, and B. Lakshminarayanan, “Exploring the limits of out-of-distribution detection,” Advances in Neural Information Processing Systems, vol. 34, pp. 7068–7081, 2021.
  • [7] H. Wang, Z. Li, L. Feng, and W. Zhang, “Vim: Out-of-distribution with virtual-logit matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  • [8] J. Yang, K. Zhou, Y. Li, and Z. Liu, “Generalized out-of-distribution detection: A survey,” arXiv preprint arXiv:2110.11334, 2021.
  • [9] M. B. Sariyildiz, K. Alahari, D. Larlus, and Y. Kalantidis, “Fake it till you make it: Learning transferable representations from synthetic imagenet clones,” 2023.
  • [10] A. Saha, A. Subramanya, and H. Pirsiavash, “Hidden trigger backdoor attacks,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 11 957–11 965, Apr. 2020. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/6871
  • [11] ——, “Hidden trigger backdoor attacks,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 11 957–11 965.
  • [12] H. Bao, L. Dong, and F. Wei, “Beit: Bert pre-training of image transformers,” arXiv preprint arXiv:2106.08254, 2021.
  • [13] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  • [14] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 000–16 009.
  • [15] D. Hendrycks and K. Gimpel, “A baseline for detecting misclassified and out-of-distribution examples in neural networks,” arXiv preprint arXiv:1610.02136, 2016.
  • [16] D. Hendrycks, S. Basart, M. Mazeika, A. Zou, J. Kwon, M. Mostajabi, J. Steinhardt, and D. Song, “Scaling out-of-distribution detection for real-world settings,” arXiv preprint arXiv:1911.11132, 2019.
  • [17] W. Liu, X. Wang, J. Owens, and Y. Li, “Energy-based out-of-distribution detection,” Advances in neural information processing systems, vol. 33, pp. 21 464–21 475, 2020.
  • [18] Y. Sun, C. Guo, and Y. Li, “React: Out-of-distribution detection with rectified activations,” Advances in Neural Information Processing Systems, vol. 34, pp. 144–157, 2021.
  • [19] K. Lee, K. Lee, H. Lee, and J. Shin, “A simple unified framework for detecting out-of-distribution samples and adversarial attacks,” Advances in neural information processing systems, vol. 31, 2018.
  • [20] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  • [21] X. Chen, S. Xie, and K. He, “An empirical study of training self-supervised vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 9640–9649.
  • [22] M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y. Huang, S.-W. Li, I. Misra, M. Rabbat, V. Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features without supervision,” 2023.
  • [23] Z. Peng, L. Dong, H. Bao, Q. Ye, and F. Wei, “Beit v2: Masked image modeling with vector-quantized visual tokenizers,” 2022.
  • [24] I. Ndiour, N. Ahuja, and O. Tickoo, “Out-of-distribution detection with subspace techniques and probabilistic modeling of features,” arXiv preprint arXiv:2012.04250, 2020.
  • [25] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738.
  • [26] Y. Li, Y. Jiang, Z. Li, and S.-T. Xia, “Backdoor learning: A survey,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
  • [27] T. DeVries and G. W. Taylor, “Learning confidence for out-of-distribution detection in neural networks,” arXiv preprint arXiv:1802.04865, 2018.
  • [28] A. Zaeemzadeh, N. Bisagno, Z. Sambugaro, N. Conci, N. Rahnavard, and M. Shah, “Out-of-distribution detection using union of 1-dimensional subspaces,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9452–9461.
  • [29] R. Huang and Y. Li, “MOS: Towards scaling out-of-distribution detection for large semantic space,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8710–8719.
  • [30] J. Li, P. Chen, Z. He, S. Yu, S. Liu, and J. Jia, “Rethinking out-of-distribution (ood) detection: Masked image modeling is all you need,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 578–11 589.
  • [31] I. Krasin, T. Duerig, N. Alldrin, V. Ferrari, S. Abu-El-Haija, A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, A. Veit, S. Belongie, V. Gomes, A. Gupta, C. Sun, G. Chechik, D. Cai, Z. Feng, D. Narayanan, and K. Murphy, “Openimages: A public dataset for large-scale multi-label and multi-class image classification.” Dataset available from https://github.com/openimages, 2017.
  • [32] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 3606–3613.
  • [33] G. Van Horn, O. Mac Aodha, Y. Song, Y. Cui, C. Sun, A. Shepard, H. Adam, P. Perona, and S. Belongie, “The inaturalist species classification and detection dataset,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8769–8778.
  • [34] D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song, “Natural adversarial examples,” CVPR, 2021.
  • [35] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li, “Imagenet large scale visual recognition challenge,” Int. J. Comput. Vis., 2015.
  • [36] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3606–3613.
  • [37] G. Van Horn, O. Mac Aodha, Y. Song, Y. Cui, C. Sun, A. Shepard, H. Adam, P. Perona, and S. Belongie, “The iNaturalist species classification and detection dataset,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8769–8778.
  • [38] D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song, “Natural adversarial examples,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 262–15 271.
  • [39] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
  • [40] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million image database for scene recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 6, pp. 1452–1464, 2017.
  • [41] S. Liang, Y. Li, and R. Srikant, “Enhancing the reliability of out-of-distribution image detection in neural networks,” arXiv preprint arXiv:1706.02690, 2017.
[Uncaptioned image] Jingyao Li received the B.Eng. degree from Xi’an Jiaotong University. She is currently a Ph.D. student at Department of Computer Science and Engineering of the Chinese University of Hong Kong (CUHK), under the supervision of Prof. Jiaya Jia. She serves as a reviewer for CVPR, ECCV, ICCV and etc. Her research interests include computer vision and large language models.
[Uncaptioned image] Pengguang Chen received the B.Eng. degree in Computer Science from Nanjing University and the Ph.D. degree from the Chinese University of Hong Kong (CUHK), under the supervision of Prof. Jiaya Jia. He is currently a researcher in SmartMore. He serves as a reviewer for CVPR, ICCV, ECCV, TPAMI. His research interests include neural architecture search, self-supervised learning, knowledge distillation and semantic segmentation.
[Uncaptioned image] Shaozuo Yu is a Ph.D. student at Department of Computer Science and Engineering of the Chinese University of Hong Kong. He served as a program chair of the workshop and challenge on “Out-of-Distribution Generalization in Computer Vision” at ECCV’22. He served as a reviewer for CVPR, Neurips, and ICML. His research interests include multimodality, generative models, and robust vision.
[Uncaptioned image] Shu Liu now serves as Co-Founder and Technical Head in SmartMore. He received the BS degree from Huazhong University of Science and Technology and the PhD degree from the Chinese University of Hong Kong. He was the winner of 2017 COCO Instance Segmentation Competition and received the Outstanding Reviewer of ICCV in 2019. He continuously served as a reviewer for TPAMI, CVPR, ICCV, NIPS, ICLR and etc. His research interests lie in deep learning and computer vision. He is a member of IEEE.
[Uncaptioned image] Jiaya Jia received the Ph.D. degree in Computer Science from Hong Kong University of Science and Technology in 2004 and is currently a full professor in Department of Computer Science and Engineering at the Chinese University of Hong Kong (CUHK). He assumes the position of Associate Editor-in-Chief of IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) and is in the editorial board of International Journal of Computer Vision (IJCV). He continuously served as area chairs for ICCV, CVPR, AAAI, ECCV, and several other conferences for the organization. He was on program committees of major conferences in graphics and computational imaging, including ICCP, SIGGRAPH, and SIGGRAPH Asia. He is a Fellow of the IEEE.

This supplementary material includes visualization of distribution curves, multi-class and one-class OOD detection results on CIFAR-10, etc., which are not included in the main paper due to page limitations.

Appendix A Distribution curves

For more comprehensive insights, we offer visual representations of distribution curves for OOD scores on both ID and OOD datasets in Fig. 7. A narrower overlap between ID and OOD data signifies superior OOD detection performance, whereas a wider overlap indicates weaker detection results. The ID curve, depicted in red, for the fine-tuned BEiT series [12] models, exhibits a distinctive peak at a higher position. This leads to minimal overlap with other OOD data, highlighting a remarkable OOD detection capability. This accomplishment can be attributed to the high-quality ID feature representation derived from masked image modeling.

Appendix B Details of Results on CIFAR-10

B.1 Multi-class OOD Detection

We employ CIFAR-10 [39] as the in-distribution dataset and evaluate pre-task texts on multiple challenging unnatural out-of-distribution datasets, including OpenImage-O [31], Texture [32], iNaturalist [33], and ImageNet-O [34]. Extensive validations with various pretraining methods and OOD score functions including MSP [15], Energy [17], ODIN [41], MaxLogit [16], KL Matching [16], Residual [7], ReAct [18], Mahalanobis [19] and ViM [7]. Results are in Tab. V. Our approach attains an impressive AUROC of 99.99% while concurrently reducing the FPR95 to a mere 0.03%.

B.2 One-class OOD Detection

We perform one-class OOD detection. In the context of a multi-class dataset with Ncsubscript𝑁𝑐N_{c}italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT classes, we conduct Ncsubscript𝑁𝑐N_{c}italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT one-class OOD tasks. Each task treats one of the classes as in-distribution and the remaining classes as out-of-distribution. Our experiments are conducted on CIFAR-10 [39] and provide the detailed class-wise performance of mainstream methods including KL-Marching (Tab. VI), Residual (Tab. VII), Mahalanobis (Tab. VIII) and ViM (Tab. IX).

It’s important to note that all methods were pre-trained on ImageNet-21k and subsequently fine-tuned on ImageNet-1k, which might have influenced the results to varying degrees. However, we ensure consistent training strategies for all methods to maintain a fair comparison. The experimental results demonstrate that MOODv2 achieves significant improvements across all ID classes, even without fine-tuning the ID dataset. Notably, we achieved a remarkable 3.56% increase in the state-of-the-art AUROC, reaching 98.20%, while simultaneously reducing FPR95 by 15.14%, achieving an impressive 9.49%.

Refer to caption
Figure 7: The distribution curves of OOD scores for ID and OOD datasets obtained using various mainstream methods, including MSP [15], Energy [17], ODIN [41], MaxLogit [16], KL Matching [16], Residual [7], ReAct [18], Mahalanobis [19] and ViM [7]. The red line indicates the ID dataset ImageNet [35]; the blue line indicates Texture [32]; the green line indicates iNaturalist [33]; the purple line indicates ImageNet-O [34]; the orange line indicates OpenImage-O [31]. Pretrained models include classification task [20], MoCov3 [21], DINOv2 [22], BEiTv2 [23] and BEiT [12].
Methods Models Texture iNaturalist ImageNet-O OpenImage-O Average
AUROC\uparrow FPR95\downarrow AUROC\uparrow FPR95\downarrow AUROC\uparrow FPR95\downarrow AUROC\uparrow FPR95\downarrow AUROC\uparrow FPR95\downarrow
MSP[15] ViT[20] 45.67 95.17 71.07 81.76 32.52 98.85 59.74 91.45 52.25 91.81
MoCov3[21] 37.11 97.64 64.60 89.69 31.52 98.45 55.90 93.53 47.28 94.83
DINOv2[22] 70.37 58.33 87.10 37.77 70.61 63.25 78.27 51.58 76.59 52.73
BEiTv2[23] 57.67 88.31 82.53 55.54 52.06 89.55 72.72 70.71 66.24 76.03
BEiT[12] 51.64 91.09 73.85 74.02 47.82 90.75 65.40 80.81 59.68 84.17
Energy[17] ViT[20] 31.16 97.89 48.95 97.92 37.22 97.85 45.29 96.36 40.65 97.50
MoCov3[21] 24.97 98.93 44.74 98.29 38.06 95.45 43.90 95.49 37.92 97.04
DINOv2[22] 86.73 28.16 91.43 20.27 68.97 62.75 73.66 53.40 80.20 41.15
BEiTv2[23] 63.35 82.64 88.52 38.21 66.24 72.30 81.69 51.80 74.95 61.24
BEiT[12] 52.98 88.53 81.84 54.55 59.99 77.20 73.61 65.08 67.10 71.34
MaxLogit[16] ViT[20] 41.21 95.95 67.83 86.04 32.58 98.80 56.64 92.94 49.56 93.43
MoCov3[21] 32.94 98.22 61.79 92.21 31.65 98.45 53.32 94.26 44.92 95.78
DINOv2[22] 76.80 45.06 91.96 22.66 72.49 57.95 80.36 44.75 80.40 42.61
BEiTv2[23] 60.51 85.85 86.05 47.39 59.14 83.90 77.47 62.79 70.79 69.98
BEiT[12] 51.94 89.90 77.92 66.72 52.95 87.10 69.16 75.24 62.99 79.74
KL-Matching[16] ViT[20] 98.00 10.64 94.23 35.86 92.99 32.40 94.68 27.92 94.97 26.71
MoCov3[21] 97.61 13.97 94.65 35.51 92.05 38.25 94.22 33.48 94.64 30.30
DINOv2[22] 98.05 8.74 98.95 5.32 97.29 12.35 96.99 13.55 97.82 9.99
BEiTv2[23] 97.41 14.98 91.78 50.90 93.21 35.90 92.28 43.17 93.67 36.24
BEiT[12] 97.83 12.71 95.14 32.28 93.52 35.00 94.84 31.13 95.33 27.78
Residual[7] ViT[20] 99.91 0.21 99.68 0.45 99.36 2.85 99.42 2.46 99.59 1.49
MoCov3[21] 99.90 0.25 99.87 0.09 99.22 3.85 99.59 1.31 99.65 1.38
DINOv2[22] 99.98 0.04 100.00 0.01 99.99 0.05 99.97 0.18 99.98 0.07
BEiTv2[23] 99.98 0.04 100.00 0.00 99.79 0.90 99.92 0.27 99.92 0.30
BEiT[12] 99.99 0.02 100.00 0.00 99.96 0.10 99.99 0.01 99.99 0.03
React[18] ViT[20] 35.97 96.26 69.01 87.91 36.65 97.75 54.14 93.11 48.94 93.76
MoCov3[21] 25.74 98.90 46.11 98.29 37.60 95.55 44.63 95.46 38.52 97.05
DINOv2[22] 68.00 61.94 60.58 82.56 40.71 90.10 47.60 88.88 54.22 80.87
BEiTv2[23] 62.81 82.71 91.59 28.05 65.37 72.95 82.43 49.49 75.55 58.30
BEiT[12] 53.27 87.91 82.09 53.87 59.64 77.50 73.73 64.84 67.18 71.03
Mahalanobis[19] ViT[20] 99.77 0.60 99.39 1.11 98.93 4.90 99.14 3.26 99.31 2.47
MoCov3[21] 99.78 0.78 99.71 0.45 98.61 7.65 99.31 2.48 99.35 2.84
DINOv2[22] 99.98 0.06 100.00 0.00 99.99 0.00 99.97 0.16 99.99 0.05
BEiTv2[23] 99.95 0.06 99.99 0.02 99.61 1.90 99.82 0.77 99.84 0.69
BEiT[12] 99.99 0.00 100.00 0.00 99.96 0.05 99.98 0.05 99.98 0.03
ViM[7] ViT[20] 99.91 0.23 99.72 0.38 99.38 2.65 99.49 2.31 99.63 1.39
MoCov3[21] 99.93 0.16 99.92 0.03 99.40 2.75 99.69 1.03 99.73 0.99
DINOv2[22] 99.98 0.04 100.00 0.01 99.99 0.05 99.97 0.18 99.98 0.07
BEiTv2[23] 99.95 0.14 100.00 0.01 99.61 1.60 99.93 0.28 99.87 0.51
BEiT[12] 99.98 0.06 100.00 0.00 99.94 0.20 99.99 0.01 99.98 0.07
Best ViT[20] 99.91 0.21 99.72 0.38 99.38 2.65 99.49 2.31 99.63 1.39
MoCov3[21] 99.93 0.16 99.92 0.03 99.40 2.75 99.69 1.03 99.73 0.99
DINOv2[22] 99.98 0.04 100.00 0.00 99.99 0.00 99.97 0.16 99.99 0.05
BEiTv2[23] 99.98 0.04 100.00 0.00 99.79 0.90 99.93 0.27 99.92 0.30
BEiT[12] 99.99 0.00 100.00 0.00 99.96 0.05 99.99 0.01 99.99 0.03
TABLE V: AUROC (%) of OOD detection methods. The ID dataset is CIFAR-10 [39], and the OOD datasets are OpenImage-O [31], Texture [32], iNaturalist [33], and ImageNet-O [34]. The pre-text tasks include classical classification task [20], contrastive learning tasks MoCov3 [21] and DINOv2 [22], and masked image modeling tasks BEiT [12] and BEiT [23]. All pre-text tasks are performed on ImageNet-21k. Both metrics AUROC and FPR95 are in percentage. A pre-trained ViT-B/16 model with 224×224224224224\times 224224 × 224-pixel inputs is tested. The best method is emphasized in bold and a gray background indicates our choice.
Models ID class OOD class Average
0 1 2 3 4 5 6 7 8 9
ViT[20] 0 - 81.29 90.44 90.62 86.42 95.45 93.28 91.15 67.66 66.06 84.71
1 98.54 - 99.58 99.33 99.57 99.69 99.82 96.16 96.00 85.23 97.10
2 90.41 97.50 - 87.18 76.75 95.02 90.45 81.47 96.76 98.16 90.41
3 94.14 94.49 91.32 - 77.12 78.07 88.59 82.25 98.25 85.24 87.72
4 96.15 98.55 92.46 89.63 - 95.80 94.34 66.06 98.45 96.45 91.99
5 98.38 98.19 96.45 78.99 86.87 - 97.45 68.19 99.48 95.96 91.11
6 96.68 97.44 92.05 88.20 83.14 95.27 - 96.14 95.03 97.92 93.54
7 96.32 97.72 97.36 92.15 85.78 94.45 98.03 - 94.20 98.45 94.94
8 89.88 71.86 97.40 95.86 98.82 98.45 93.19 98.04 - 80.89 91.60
9 97.62 91.34 99.60 99.38 98.48 99.75 99.78 99.23 96.62 - 97.98
MoCov3[21] 0 - 87.10 87.79 90.33 89.95 94.07 91.11 72.18 69.31 69.45 83.48
1 93.05 - 98.12 97.41 98.20 98.26 99.03 94.47 93.38 75.58 94.17
2 82.96 95.82 - 83.13 78.83 93.45 82.30 73.48 94.46 96.90 86.81
3 91.08 93.10 90.56 - 74.01 74.04 84.32 79.70 93.94 93.13 85.99
4 93.82 96.02 86.55 88.85 - 92.48 92.56 66.43 96.26 96.97 89.99
5 94.32 95.45 91.58 74.17 88.34 - 94.31 69.63 97.36 98.02 89.24
6 95.09 96.76 88.10 87.53 80.44 90.97 - 94.57 92.31 97.94 91.52
7 91.62 96.60 92.38 91.56 80.46 92.20 96.91 - 92.62 95.95 92.25
8 80.42 88.78 96.30 93.92 96.62 95.98 95.00 95.97 - 79.94 91.44
9 92.33 79.04 97.95 97.19 98.14 98.28 98.73 97.74 88.23 - 94.18
DINOv2[22] 0 - 60.94 65.14 64.79 71.95 62.40 76.86 61.40 39.99 51.08 61.62
1 73.33 - 59.97 48.33 56.38 44.85 57.27 41.11 60.09 45.81 54.13
2 69.07 55.66 - 49.75 44.54 46.53 51.42 43.47 57.03 53.23 52.30
3 79.75 63.89 61.86 - 56.82 47.72 59.42 47.58 70.69 61.79 61.06
4 81.90 68.04 59.55 59.30 - 57.66 54.34 53.14 73.53 68.04 63.95
5 81.63 65.90 63.37 52.89 58.20 - 61.38 48.90 73.32 64.01 63.29
6 86.28 71.81 62.40 57.83 51.69 56.97 - 54.87 81.90 74.10 66.43
7 84.48 68.52 64.60 58.30 57.49 54.55 58.26 - 77.39 66.77 65.60
8 64.17 71.12 75.01 73.31 79.28 71.02 84.92 71.20 - 61.16 72.35
9 75.23 59.90 68.79 58.47 68.79 54.25 72.32 50.25 62.55 - 63.39
BEiTv2[23] 0 - 94.83 97.42 99.31 97.01 99.79 99.34 99.44 89.14 91.32 96.40
1 99.01 - 99.91 99.83 99.72 99.90 99.98 99.74 92.98 94.11 98.35
2 85.99 99.23 - 98.03 81.97 99.25 93.81 93.23 96.98 99.43 94.21
3 98.39 98.20 97.60 - 90.40 85.67 74.37 95.00 99.49 98.30 93.05
4 99.57 99.62 98.82 88.31 - 99.31 99.03 77.26 99.32 99.88 95.68
5 99.57 99.31 99.40 80.54 97.51 - 99.14 53.39 99.72 98.76 91.93
6 99.46 99.75 97.00 96.55 97.07 99.04 - 99.35 99.71 99.91 98.65
7 99.12 98.87 99.76 98.94 92.57 99.02 99.92 - 98.73 95.91 98.09
8 88.49 93.17 99.83 99.74 99.82 99.94 99.94 99.86 - 96.15 97.44
9 99.19 96.36 99.96 99.75 99.75 99.97 99.97 99.80 97.88 - 99.18
BEiT[12] 0 - 95.65 95.16 98.75 94.10 99.60 98.60 98.70 83.00 85.22 94.31
1 97.46 - 99.85 99.63 99.47 99.91 99.96 99.55 95.42 90.32 97.95
2 96.98 98.44 - 89.21 77.98 98.19 97.53 72.68 97.65 99.17 91.98
3 97.62 95.80 96.95 - 80.24 86.56 72.85 86.98 99.38 94.34 90.08
4 99.40 98.45 98.57 98.12 - 98.92 96.00 70.81 98.42 93.32 94.67
5 99.15 99.03 98.66 82.32 82.56 - 98.78 47.63 99.81 99.62 89.73
6 99.32 99.56 96.90 96.87 93.96 99.06 - 99.56 99.78 99.89 98.32
7 98.76 98.51 99.55 99.00 93.66 99.01 99.84 - 98.99 96.76 98.23
8 92.28 92.62 99.44 99.40 99.72 99.80 99.81 99.44 - 92.60 97.24
9 99.07 93.71 99.93 99.77 99.80 99.97 99.98 99.62 97.69 - 98.84
TABLE VI: AUROC (%) of one-class OOD Detection on CIFAR-10 [39] using KL-Matching [16]. Pretrained models include classification task [20], MoCov3 [21], DINOv2 [22], BEiTv2 [23] and BEiT [12].
Models ID class OOD class Average
0 1 2 3 4 5 6 7 8 9
ViT[20] 0 - 88.70 96.73 99.02 96.98 99.82 98.77 98.21 66.32 79.11 91.52
1 98.03 - 99.97 99.93 99.92 99.99 99.98 99.87 95.48 69.60 95.86
2 96.51 99.92 - 94.67 75.60 98.06 89.95 90.72 99.08 99.83 93.82
3 98.54 99.70 94.27 - 75.83 66.18 89.15 81.41 99.56 98.99 89.29
4 99.25 99.85 93.29 95.01 - 95.08 95.51 70.54 99.40 99.54 94.16
5 99.74 99.97 98.38 87.74 89.16 - 98.40 85.77 99.92 99.85 95.44
6 99.44 99.96 93.81 95.33 88.84 98.00 - 97.18 99.79 99.90 96.92
7 98.85 99.64 97.53 95.31 76.90 91.92 99.25 - 98.41 98.70 95.17
8 90.11 89.74 99.59 99.72 99.56 99.96 99.60 99.65 - 85.70 95.96
9 98.11 85.44 99.96 99.93 99.80 99.99 99.98 99.75 96.08 - 97.67
MoCov3[21] 0 - 96.11 97.08 99.13 98.40 99.75 98.56 99.00 73.05 87.16 94.25
1 96.02 - 99.79 99.76 99.76 99.90 99.82 99.71 92.94 68.77 95.16
2 93.25 99.90 - 94.04 80.49 97.94 88.20 91.06 98.09 99.60 93.62
3 97.15 99.60 93.31 - 79.89 71.78 87.59 86.66 98.02 98.72 90.30
4 98.28 99.91 92.84 93.95 - 96.47 94.59 73.27 98.35 99.53 94.13
5 99.12 99.90 97.62 84.00 89.42 - 98.45 89.13 99.38 99.45 95.16
6 99.38 99.97 95.00 95.30 93.13 98.72 - 98.83 99.66 99.90 97.77
7 97.23 99.70 96.70 95.15 76.84 93.22 98.94 - 97.06 98.53 94.82
8 87.65 95.39 99.26 99.45 99.47 99.76 99.27 99.57 - 89.66 96.61
9 96.24 86.59 99.84 99.83 99.84 99.91 99.88 99.77 92.68 - 97.18
DINOv2[22] 0 - 72.83 64.15 68.05 65.47 68.53 68.80 70.10 54.59 69.72 66.92
1 66.18 - 66.00 56.84 62.76 58.68 60.05 57.11 54.97 52.30 59.43
2 66.58 77.06 - 57.10 45.49 56.78 49.57 62.65 67.99 76.34 62.17
3 74.83 77.23 59.35 - 52.76 50.60 51.10 63.03 71.63 76.36 64.10
4 82.00 86.75 63.82 67.27 - 66.91 57.16 71.45 80.04 85.45 73.43
5 79.01 80.38 61.21 54.33 54.49 - 54.53 64.44 75.72 80.20 67.15
6 85.87 86.34 66.60 66.63 56.40 66.99 - 75.17 84.70 87.00 75.08
7 76.76 76.51 59.87 56.22 49.64 53.88 52.70 - 73.05 72.37 63.44
8 62.96 74.15 73.59 72.77 74.25 72.95 77.20 76.44 - 71.13 72.83
9 68.96 59.80 70.76 61.83 68.02 62.95 66.87 58.87 57.35 - 63.94
BEiTv2[23] 0 - 98.30 99.41 99.91 99.60 99.98 99.86 99.68 91.03 93.81 97.95
1 99.66 - 100.00 99.99 99.99 99.99 100.00 99.99 98.49 78.72 97.43
2 97.60 99.96 - 99.17 93.13 99.48 97.72 98.58 99.58 99.92 98.35
3 99.74 99.87 98.56 - 94.77 80.80 94.73 97.37 99.89 99.90 96.18
4 99.83 99.90 98.68 99.06 - 99.14 99.16 87.86 99.83 99.88 98.15
5 99.94 99.97 99.62 91.48 98.24 - 99.70 96.71 99.95 99.81 98.38
6 99.89 99.99 99.39 99.25 99.14 99.68 - 99.83 99.94 99.98 99.68
7 99.67 99.82 99.76 99.75 97.38 99.45 99.98 - 99.67 99.66 99.46
8 97.13 98.60 99.92 99.98 99.95 99.97 99.98 99.97 - 95.87 99.04
9 99.36 95.53 100.00 99.99 100.00 100.00 100.00 99.99 98.54 - 99.27
BEiT[12] 0 - 98.74 99.39 99.91 99.59 99.96 99.83 99.77 91.23 93.39 97.98
1 99.32 - 100.00 99.98 99.98 100.00 100.00 99.97 99.02 88.68 98.55
2 99.25 99.97 - 98.32 90.25 99.21 97.47 98.08 99.79 99.88 98.02
3 99.78 99.85 99.09 - 94.53 83.32 94.51 97.40 99.94 99.60 96.45
4 99.88 99.90 98.82 98.63 - 98.89 98.95 93.34 99.88 99.78 98.67
5 99.98 99.88 99.77 90.61 97.43 - 99.60 93.86 99.99 99.83 97.88
6 99.95 100.00 99.28 99.13 98.97 99.67 - 99.97 99.99 99.99 99.66
7 99.55 99.45 99.76 99.41 97.82 99.04 99.94 - 99.49 99.13 99.29
8 95.97 97.99 99.95 99.95 99.90 99.98 99.98 99.96 - 96.03 98.86
9 99.38 94.73 100.00 99.98 99.98 100.00 100.00 99.85 98.85 - 99.20
TABLE VII: AUROC (%) of one-class OOD Detection on CIFAR-10 [39] using Residual[7]. Pretrained models include classification task [20], MoCov3 [21], DINOv2 [22], BEiTv2 [23] and BEiT [12].
Models ID class OOD class Average
0 1 2 3 4 5 6 7 8 9
ViT[20] 0 - 89.17 95.53 98.62 95.81 99.70 97.89 97.15 65.74 78.81 90.94
1 98.20 - 99.94 99.91 99.86 99.98 99.96 99.79 96.03 71.41 96.12
2 96.19 99.84 - 94.76 77.37 98.03 89.80 90.19 98.81 99.68 93.85
3 98.08 99.40 93.07 - 75.92 68.55 87.75 81.75 99.21 98.35 89.12
4 99.19 99.81 93.59 95.16 - 96.03 95.52 70.51 99.35 99.54 94.30
5 99.65 99.91 98.30 88.92 91.23 - 98.31 87.68 99.86 99.72 95.95
6 99.30 99.93 93.91 95.23 89.50 98.04 - 96.95 99.64 99.82 96.93
7 98.62 99.50 97.22 95.53 77.46 92.86 99.03 - 98.20 98.59 95.22
8 90.26 90.45 99.48 99.67 99.47 99.93 99.48 99.57 - 87.26 96.18
9 98.16 86.56 99.93 99.91 99.76 99.99 99.97 99.70 96.39 - 97.82
MoCov3[21] 0 - 95.82 95.69 98.75 97.49 99.58 97.83 98.04 71.17 85.75 93.35
1 95.59 - 99.66 99.69 99.67 99.85 99.74 99.50 92.72 68.79 95.02
2 92.17 99.77 - 93.66 80.97 97.58 87.61 90.67 97.16 99.21 93.20
3 95.87 99.33 91.17 - 78.08 71.77 86.00 84.32 96.86 98.04 89.05
4 97.97 99.79 92.35 94.00 - 96.40 94.38 73.76 98.11 99.26 94.00
5 98.66 99.78 97.00 84.31 90.24 - 97.96 89.13 98.93 99.14 95.02
6 99.13 99.94 94.33 95.19 92.56 98.56 - 98.38 99.45 99.82 97.48
7 96.61 99.55 96.01 94.83 76.00 92.97 98.57 - 96.47 98.15 94.35
8 87.08 95.47 98.86 99.27 99.20 99.62 98.97 99.20 - 89.75 96.38
9 95.98 86.94 99.74 99.77 99.76 99.87 99.82 99.63 92.63 - 97.13
DINOv2[22] 0 - 73.26 64.39 68.07 65.83 68.69 69.50 70.14 54.72 70.07 67.19
1 66.64 - 66.59 57.46 63.19 58.81 60.71 57.10 55.24 52.26 59.78
2 67.18 77.06 - 57.44 45.86 57.11 49.82 63.01 68.56 76.62 62.52
3 75.38 77.23 59.51 - 52.41 50.47 50.53 63.05 72.42 76.62 64.18
4 82.15 86.71 63.94 67.52 - 67.12 57.42 71.56 80.55 86.07 73.67
5 77.37 79.22 60.55 53.76 54.45 - 54.10 63.90 75.10 78.91 66.37
6 86.06 86.59 66.51 66.79 56.39 67.45 - 75.36 84.95 87.54 75.29
7 77.61 76.62 59.83 56.58 49.58 54.03 53.04 - 74.18 72.88 63.82
8 63.06 74.53 74.59 73.16 75.12 73.67 78.29 76.87 - 71.34 73.40
9 69.38 60.16 70.57 62.54 67.91 63.32 67.09 58.82 58.01 - 64.20
BEiTv2[23] 0 - 98.36 99.38 99.94 99.70 99.99 99.90 99.70 91.92 94.07 98.11
1 99.69 - 100.00 99.99 99.99 100.00 100.00 99.99 98.96 80.71 97.70
2 97.78 99.94 - 99.37 94.03 99.54 98.04 98.32 99.54 99.88 98.49
3 99.72 99.89 98.56 - 94.80 81.26 95.70 96.85 99.85 99.87 96.28
4 99.79 99.92 98.69 99.20 - 99.23 99.34 88.58 99.84 99.88 98.27
5 99.94 99.98 99.57 92.56 98.39 - 99.73 97.18 99.92 99.85 98.57
6 99.87 100.00 99.44 99.37 99.25 99.71 - 99.80 99.92 99.96 99.70
7 99.68 99.80 99.74 99.77 97.64 99.50 99.96 - 99.69 99.62 99.49
8 97.51 98.74 99.90 99.97 99.92 99.95 99.95 99.91 - 96.76 99.18
9 99.41 95.56 100.00 99.99 99.99 100.00 100.00 99.98 98.60 - 99.28
BEiT[12] 0 - 98.65 99.41 99.91 99.59 99.96 99.86 99.72 91.50 94.10 98.08
1 99.45 - 100.00 99.98 99.98 100.00 100.00 99.96 99.14 89.24 98.64
2 99.25 99.95 - 98.42 90.61 99.25 97.69 98.32 99.78 99.87 98.13
3 99.77 99.80 99.09 - 95.07 84.11 94.70 97.80 99.94 99.63 96.66
4 99.89 99.89 98.88 98.85 - 99.06 99.09 93.79 99.90 99.77 98.79
5 99.98 99.83 99.75 91.54 97.74 - 99.61 95.89 99.99 99.79 98.23
6 99.95 100.00 99.33 99.19 99.06 99.67 - 99.96 99.98 99.98 99.68
7 99.61 99.48 99.78 99.46 97.85 99.14 99.94 - 99.53 99.18 99.33
8 96.51 98.01 99.95 99.96 99.92 99.98 99.98 99.95 - 96.05 98.92
9 99.47 94.84 100.00 99.99 99.98 100.00 100.00 99.90 98.97 - 99.24
TABLE VIII: AUROC (%) of one-class OOD Detection on CIFAR-10 [39] using Mahalanobis[19]. Pretrained models include classification task [20], MoCov3 [21], DINOv2 [22], BEiTv2 [23] and BEiT [12].
Models ID class OOD class Average
0 1 2 3 4 5 6 7 8 9
ViT[20] 0 - 90.64 97.03 99.20 97.19 99.86 98.92 98.45 67.16 80.51 92.11
1 97.97 - 99.98 99.93 99.93 99.99 99.99 99.89 95.22 67.49 95.60
2 96.31 99.93 - 94.88 74.69 98.10 89.81 90.86 99.10 99.85 93.72
3 98.61 99.76 94.39 - 74.04 64.74 88.69 81.14 99.59 99.07 88.89
4 99.25 99.86 93.62 95.27 - 95.28 95.60 71.21 99.41 99.54 94.34
5 99.75 99.98 98.47 88.47 89.05 - 98.42 86.11 99.93 99.87 95.56
6 99.46 99.97 94.05 95.60 88.80 98.08 - 97.32 99.80 99.91 97.00
7 98.90 99.69 97.61 95.49 75.97 91.93 99.27 - 98.47 98.79 95.12
8 90.10 90.70 99.63 99.77 99.59 99.97 99.63 99.68 - 86.14 96.14
9 98.12 86.67 99.96 99.93 99.79 99.99 99.99 99.75 96.03 - 97.80
MoCov3[21] 0 - 97.10 97.66 99.36 98.74 99.83 98.88 99.27 74.06 88.58 94.83
1 96.25 - 99.82 99.79 99.79 99.91 99.84 99.74 93.06 67.99 95.13
2 93.80 99.93 - 94.63 80.47 98.24 88.49 91.63 98.37 99.69 93.92
3 97.60 99.68 94.10 - 79.97 72.24 87.98 87.63 98.32 98.96 90.72
4 98.49 99.93 93.56 94.53 - 96.84 94.90 74.27 98.52 99.61 94.52
5 99.29 99.93 97.94 85.08 89.84 - 98.61 89.87 99.49 99.57 95.51
6 99.46 99.97 95.39 95.71 93.39 98.87 - 98.97 99.69 99.90 97.93
7 97.57 99.77 97.08 95.65 76.65 93.74 99.06 - 97.39 98.74 95.07
8 88.58 96.14 99.39 99.55 99.55 99.81 99.39 99.64 - 90.57 96.96
9 96.55 87.81 99.87 99.86 99.86 99.93 99.90 99.79 92.96 - 97.39
DINOv2[22] 0 - 72.84 64.24 68.20 65.56 68.69 68.90 70.20 54.60 69.76 67.00
1 66.30 - 66.15 57.07 62.92 58.93 60.22 57.30 55.07 52.37 59.59
2 66.61 77.05 - 57.20 45.49 56.89 49.57 62.70 68.01 76.37 62.21
3 74.84 77.16 59.30 - 52.69 50.62 51.02 63.03 71.61 76.33 64.06
4 82.04 86.74 63.85 67.37 - 67.00 57.18 71.49 80.07 85.48 73.47
5 79.00 80.30 61.17 54.32 54.41 - 54.44 64.39 75.69 80.16 67.10
6 85.90 86.33 66.63 66.71 56.41 67.08 - 75.22 84.73 87.03 75.12
7 76.80 76.47 59.90 56.30 49.63 53.98 52.69 - 73.07 72.36 63.47
8 63.00 74.18 73.68 72.92 74.35 73.11 77.30 76.54 - 71.19 72.92
9 69.06 59.79 70.88 62.04 68.16 63.16 67.02 59.00 57.41 - 64.06
BEiTv2[23] 0 - 97.86 99.24 99.90 99.48 99.97 99.73 99.61 87.59 89.06 96.94
1 99.49 - 99.99 99.99 99.98 99.98 100.00 99.99 97.62 66.23 95.92
2 95.66 99.95 - 98.84 90.50 98.98 95.46 97.71 99.07 99.77 97.33
3 99.44 99.79 98.00 - 93.19 72.65 90.51 96.39 99.65 99.32 94.33
4 99.72 99.87 98.37 98.87 - 98.76 98.78 84.32 99.72 99.75 97.57
5 99.86 99.95 99.47 91.21 97.83 - 99.52 95.97 99.88 99.49 98.13
6 99.85 99.99 99.26 99.14 98.93 99.60 - 99.81 99.91 99.92 99.60
7 99.55 99.78 99.68 99.72 96.47 99.17 99.96 - 99.45 99.34 99.24
8 96.02 98.31 99.90 99.97 99.94 99.96 99.96 99.97 - 93.04 98.56
9 99.17 95.57 100.00 99.99 99.99 99.99 100.00 99.99 98.21 - 99.21
BEiT[12] 0 - 98.76 99.35 99.90 99.48 99.95 99.76 99.78 88.74 91.93 97.52
1 99.15 - 100.00 99.98 99.97 100.00 100.00 99.97 98.78 84.81 98.07
2 99.12 99.98 - 97.89 88.04 98.88 96.49 97.93 99.73 99.85 97.55
3 99.77 99.83 98.95 - 93.35 78.82 92.42 97.56 99.94 99.38 95.56
4 99.87 99.90 98.65 98.48 - 98.66 98.75 93.22 99.85 99.73 98.57
5 99.97 99.88 99.73 90.68 97.10 - 99.49 94.34 99.99 99.76 97.88
6 99.96 100.00 99.27 99.08 98.83 99.64 - 99.98 99.99 99.99 99.64
7 99.53 99.46 99.75 99.39 97.50 98.89 99.92 - 99.46 99.03 99.22
8 95.53 98.13 99.94 99.95 99.87 99.97 99.98 99.96 - 95.62 98.77
9 99.32 95.30 100.00 99.98 99.97 100.00 100.00 99.86 98.67 - 99.23
TABLE IX: AUROC (%) of one-class OOD Detection on CIFAR-10 [39] using ViM[7]. Pretrained models include classification task [20], MoCov3 [21], DINOv2 [22], BEiTv2 [23] and BEiT [12].