MOODv2: Masked Image Modeling for Out-of-Distribution Detection

Jingyao Li, Pengguang Chen, Shaozuo Yu, Shu Liu, and Jiaya Jia Jingyao Li and Shaozuo Yu are with the Department of Computer Science and Engineering of the Chinese University of Hong Kong (CUHK)
Jiaya Jia’s E-mail: [email protected] Pengguang Chen, Shu Liu and Jiaya Jia are with SmartMore.

Abstract

The crux of effective out-of-distribution (OOD) detection lies in acquiring a robust in-distribution (ID) representation, distinct from OOD samples. While previous methods predominantly leaned on recognition-based techniques for this purpose, they often resulted in shortcut learning, lacking comprehensive representations. In our study, we conducted a comprehensive analysis, exploring distinct pretraining tasks and employing various OOD score functions. The results highlight that the feature representations pre-trained through reconstruction yield a notable enhancement and narrow the performance gap among various score functions. This suggests that even simple score functions can rival complex ones when leveraging reconstruction-based pretext tasks. Reconstruction-based pretext tasks adapt well to various score functions. As such, it holds promising potential for further expansion. Our OOD detection framework, MOODv2, employs the masked image modeling pretext task. Without bells and whistles, MOODv2 impressively enhances 14.30% AUROC to 95.68% on ImageNet and achieves 99.98% on CIFAR-10.

Index Terms:

Computer Vision, Out-of-Distribution Detection, Outlier Detection, Masked Image Modeling

1 Introduction

Areliable visual recognition system not only provides correct predictions on known context (also known as in-distribution data) but also detects unknown out-of-distribution (OOD) samples and rejects (or transfers) them to human intervention for safe handling. This motivates the applications of outlier detectors before feeding input to the downstream networks, which is the main task of OOD detection, also referred to as novelty or anomaly detection. OOD detection is the task of identifying whether a test sample is drawn far from the in-distribution (ID) data or not. It is at the cornerstone of various safety-critical applications, including medical diagnosis [1], fraud detection [2], autonomous driving [3], etc. A representative in-distribution feature space representation is crucial for out-of-distribution detection. A well-crafted feature representation significantly enhances the performance via most mainstream OOD detection score functions. Our research is dedicated to refining feature representations tailored for OOD detection, with the aim of advancing the entire field.

Existing methods perform contrastive learning [4, 5] or pretrain classification on a large dataset [6, 7, 8, 9] to detect OOD samples. The former methods classify images according to the pseudo labels while the latter classifies images based on ground truth, whose core tasks are both to fulfill the classification target. However, research on backdoor attack [10, 11] shows that when learning is represented by classifying data, networks tend to take a shortcut to classify images. In a typical backdoor attack scene [11], the attacker adds secret triggers on original training images with the visibly correct label. During the course of testing, the victim model classifies images with secret triggers into the wrong category. Research in this area demonstrates that networks only learn specific distinguishable patterns of different categories because it is a shortcut to fulfill the classification requirement. Nonetheless, learning these patterns is ineffective for OOD detection. Thus, learning representations by classifying ID data for OOD detection may not be satisfying. For example, when patterns similar to some ID categories appear in OOD samples, the network could easily interpret these OOD samples as the ID data and classify them into the wrong ID categories, as shown in Fig. 2.

Refer to caption — Figure 1: The average AUROC (%) tested on four OOD datasets applied to a ViT model with different pre-text tasks. Methods in blue use the feature space; methods in green use logits; methods in yellow use the softmax probability; and methods in red use both features and logits. The stars show the average performance of a category of methods.

To remedy this issue, we introduce the reconstruction-based pretext task. Different from contrastive learning in existing OOD detection approaches [4, 5], our method forces the network to achieve the training purpose of reconstructing the image and thus makes it learn pixel-level feature representation. Specifically, we adopt the masked image modeling (MIM) [12] as our self-supervised pretext task, which has been demonstrated to have great potential in both natural language processing [13] and computer vision [12, 14]. In the MIM task, a proportion of image patches are randomly masked. The network learns information from the remaining patches to speculate the masked patches and restore tokens of the original image. The reconstruction process enables the model to learn from the prior effective ID feature representation rather than just learning different patterns among categories in the classification process. In our work, we observed that the pre-trained models effectively reconstruct ID images, whereas they exhibit distinct domain differences when it comes to the OOD domain (Fig. 4). This visual discrepancy clearly underscores the existing domain gap in model features between ID and OOD data, offering valuable insights for OOD detection.

To validate the effectiveness of our ID feature representation, we conduct experiments to test its performance with various mainstream OOD detection score functions. We employed OOD score functions encompassing probability-based [15, 16], logits-based [17, 16], features-based [7, 18, 19], and hybrid methods utilizing both logits and features [7]. In the context of a comparative analysis spanning classic classification [20], contrastive learning [21, 22], and masked image modeling pretext tasks [12, 23], our findings underscore the dominant role of reconstruction-based strategies in the field of OOD detection, as illustrated in Fig. 1.

Furthermore, we conduct a comprehensive analysis of the experimental results and observe that our approach not only significantly improves the overall results but also substantially reduces the disparities among score functions. This observation underscores that even simple score functions can perform on par with more complex ones when a representative ID feature representation is utilized. These findings further emphasize the critical importance of effective feature representation in OOD detection. More details are in Sec. 3.2. Ultimately, MOODv2 demonstrates remarkable enhancements, achieving a substantial 14.30% increase, reaching 95.68% AUROC on ImageNet. On CIFAR-10, our results significantly improved to an impressive 99.98%, marking a notable 0.35% enhancement compared to the previous state-of-the-art.

2 Related Works

2.1 Out-of-distribution Detection

Many scoring functions have been developed by researchers to distinguish between in-distribution and out-of-distribution examples. These functions are designed to exploit properties that are typically exhibited by ID examples but violated by OOD examples, and vice versa. These scores are primarily derived from three sources:

1.

Probability-based: This category includes measures like the maximum softmax probabilities [15] and the minimum KL-divergence between the softmax and the mean class-conditional distributions [16], etc.
2.

Logit-based: These functions rely on maximum logits [16] and the $\mathrm{logsumexp}$ function computed over logits [17], etc.
3.

Feature-based: These functions involve the norm of the residual between a feature and the pre-image of its low-dimensional embedding [24] and the minimum Mahalanobis distance between a feature and the class centroids [19], among others.

After a thorough analysis of the performance and their correlations with various score functions and pretext tasks, our work follows the hybrid methods combining logit and feature [7] and includes the reconstruction-based methods as a pretext task. We will explain the implementation details later in this paper.

2.2 Self-Supervised Pretext Task

In the ever-evolving landscape of computer vision and deep learning, a multitude of strategies and techniques have been devised to enhance the capacity of models to understand and process visual data:.

1.

Classification task: Vision models are pre-trained via classical classification task [20].
2.

Contrastive Learning tasks: MOCOv3 [21] and DINOv2 [22] are advanced contrastive learning methods used for self-supervised representation learning. These methods focus on learning representations by contrasting positive pairs (e.g., different augmentations of the same image) with negative pairs (e.g., augmentations from different images). MOCOv3 extends the MOCO framework [25] with a momentum encoder and dynamic queues for improved performance. DINOv2 introduces a clustered teacher network and an asymmetric loss to learn efficient representations.
3.

Masked Image Modeling Tasks: Data-Efficient Image Transformer (BEiT series [12, 23]) are self-supervised learning tasks that involve masked image modeling. In these tasks, a portion of an image is randomly masked, and the model’s objective is to predict the masked pixels, effectively filling in the blanks.

These methods and tasks represent cutting-edge approaches in the field of computer vision and deep learning. They have led to substantial improvements in the ability of models to learn useful visual representations from unlabeled data, enabling better performance on various downstream vision tasks.

Multiple existing methods take advantage of self-supervised tasks to guide the learning of representation for OOD detection. Previous work [4, 5] presents contrastive learning models as feature extractors. However, existing approaches of classifying transformed images according to contrastive learning possess similar limitations – that is, the model tends to learn the specific patterns of categories [10, 26], which are beneficial for classification but do not help understand the intrinsic ID representation. In our work, we address this issue by performing the masked image modeling task for OOD detection.

2.3 Training Strategy

Numerous approaches have been developed to address OOD-awareness in training loss [27]. These methods often involve the introduction of regularization terms aimed at encouraging a clearer separation between ID and OOD features [28, 29]. In some cases, networks are augmented with confidence estimation branches, utilizing misclassified in-distribution examples as proxies for out-of-distribution ones [27]. MOS [29] adapts the loss function by incorporating a predefined group structure, enabling the minimum group-wise “else” class probability to serve as an indicator of OOD classification. An alternative approach [28] focuses on compelling ID samples to embed into a union of 1-dimensional subspaces during training, and it evaluates the minimum angular distance between the feature and class-wise subspaces.

In contrast to these approaches, our method belongs to the lightweight training-free methods [7, 30], which doesn’t necessitate retraining the model. Therefore, it not only offers a more straightforward application but also preserves the accuracy of in-distribution classification.

Methods	prob	feat	logit	feat+logit
ViT[20]	73.61 $\pm$ 21.36	82.61 $\pm$ 23.81	45.11 $\pm$ 4.45	99.63
MoCov3[21]	70.96 $\pm$ 23.68	79.17 $\pm$ 28.75	41.42 $\pm$ 3.50	99.73
DINOv2[22]	87.20 $\pm$ 10.62	84.73 $\pm$ 21.57	80.30 $\pm$ 0.10	99.98
BEiTv2[23]	79.96 $\pm$ 13.71	91.77 $\pm$ 11.47	72.87 $\pm$ 2.08	99.87
BEiT[12]	77.51 $\pm$ 17.83	89.05 $\pm$ 15.46	65.05 $\pm$ 2.06	99.98

(a) ID: CIFAR-10

Methods	prob	feat	logit	feat+logit
ViT[20]	78.52 $\pm$ 1.76	76.86 $\pm$ 3.20	70.61 $\pm$ 4.76	77.65
MoCov3[21]	78.36 $\pm$ 1.42	72.51 $\pm$ 6.21	70.61 $\pm$ 5.04	72.07
DINOv2[22]	59.64 $\pm$ 7.82	63.56 $\pm$ 2.89	60.70 $\pm$ 4.51	61.32
BEiTv2[23]	89.07 $\pm$ 0.24	92.96 $\pm$ 1.27	90.29 $\pm$ 0.13	95.42
BEiT[12]	89.47 $\pm$ 0.47	93.30 $\pm$ 1.89	89.84 $\pm$ 0.01	95.68

(b) ID: ImageNet

TABLE I: The AUROC (%) of four types of methods: probability-based methods MSP [15] and KL-Matching [16]; logits-based methods Energy [17] and MaxLogit [16]; features-based methods Residual [7], React [18] and Mahalanobis [19]; and methods using both logits and features include ViM [7]. The best method for each model is emphasized in bold.

2.4 MOODv1

Our previous version MOODv1 [30] has introduced masked image modeling pretraining strategy into the OOD detection (MOOD) and achieved promising results. However, there are still concerns:

Firstly, previous studies [30, 4, 5] have typically necessitated fine-tuning a model on each in-distribution dataset. The expense of training becomes notably high when dealing with a substantial number of ID datasets to be assessed, such as in one-class OOD detection [4, 30]. However, through experimental validation, we have discovered that a well-prepared masked image modeling model doesn’t require additional fine-tuning to achieve outstanding detection performance, conserving substantial fine-tuning resource consumption when dealing with a plethora of ID datasets that require evaluation.

Secondly, as the field has seen the emergence of more advanced OOD score functions [16, 7, 18, 15, 17, 19] and pretraining techniques [23, 22, 21, 12, 20], it raises the question of whether masked image modeling continues to maintain its leading role. In MOODv2, we integrate the latest advancements in pretraining methods and conduct experiments with an array of state-of-the-art OOD score functions. This broader spectrum of pretraining methods and score functions allows for a more comprehensive assessment of the MOODv2’s performance, better aligning MOODv2 with the increasingly intricate challenges of OOD detection.

Lastly, it is well known that if the network has seen similar samples in training, regardless of pre-training or fine-tuning, the OOD performance will be more or less trivial [31]. Previous works [6, 30] rely on pre-training on ImageNet-21K, so that the benchmark OOD dataset such as CIFAR [39], Places [40], etc., is unlikely to be untouched by the ImageNet-21K [35] dataset. In this work, MOODv2 introduces the latest unnatural datasets as OOD, which rules out the possibility of overlap between the OOD test set and the training set [31, 34].

In summary, MOODv2 incorporates improved score functions, advanced pretraining techniques, a wider range of unnatural OOD datasets, and a streamlined general framework. The performance improvement of MOODv2 compared to MOODv1 is depicted in Fig. 3. On ImageNet, MOODv2 exhibits a noteworthy 2.17% improvement in AUROC compared to MOODv1. Furthermore, on CIFAR-10, MOODv2, without finetuning on the ID dataset, achieves an exceptional AUROC score of up to 99.98%.

3 Methods

In this section, we initiate our exploration of reconstruction tasks for OOD detection by presenting the underlying motivation in Sec. 3.1. Following that, in Sec. 3.2, we delve into a comprehensive analysis of the essential attributes that play a pivotal role in OOD detection.

3.1 Motivation: seeking for effective ID representation

Most previous OOD methods learn the ID representation through classification [15, 6] or contrastive learning [4, 5] on ID samples, which take advantage of either the ground truth or pseudo labels to supervise the classification networks. On the other hand, work of [10, 11] shows that classification networks only learn different patterns among training categories because it is a shortcut to fulfill classification. It is indicated that the network actually does not learn the effective in-distribution representation. In comparison, the reconstruction-based pretext task forces the network to learn the pixel-level image representation of the ID images during training to reconstruct the image instead of the patterns for classification. In this way, the network can learn a more representative feature of the ID dataset.

To verify this, we reconstruct ID and OOD data and compute the Euclidean distance between the original and reconstructed images. A greater distance indicates a larger deviation of the reconstructed image from the original image. We collect recovery distances for ID and OOD data. Examples of the reconstruction are depicted in Fig. 4. In the first row, for ID images, pre-trained models reconstruct the images effectively. Instead, for unnatural OOD images in the following rows, clear domain discrepancies emerge. For instance, in the case of textured images, the models still apply lighting and shadows reminiscent of natural images. In the case of sketch images, the models render the images smoother and brighter. This discrepancy visually highlights the domain gap in model features between ID and OOD data, which can be leveraged for OOD detection.

3.2 Reconstruction Tasks for OOD Detection

Methods	Models	Texture [32]		iNaturalist [33]		ImageNet-O [34]		OpenImage-O [31]		Average
Methods	Models	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$
MSP[15]	ViT[20]	71.31	71.31	90.70	90.70	60.77	60.77	84.29	84.29	76.77	76.77
	MoCov3[21]	66.85	66.85	90.68	90.68	64.80	64.80	85.42	85.42	76.94	76.94
	DINOv2[22]	47.49	47.49	62.13	62.13	44.87	44.87	52.83	52.83	51.83	51.83
	BEiTv2[23]	85.61	85.61	96.05	96.05	81.15	81.15	92.52	92.52	88.83	88.83
	BEiT[12]	85.05	85.05	95.50	95.50	83.17	83.17	92.28	92.28	89.00	89.00
Energy[17]	ViT[20]	54.11	54.11	76.61	76.61	61.63	61.63	71.06	71.06	65.85	65.85
	MoCov3[21]	48.79	48.79	76.80	76.80	64.56	64.56	72.13	72.13	65.57	65.57
	DINOv2[22]	73.89	73.89	80.34	80.34	49.98	49.98	56.64	56.64	65.21	65.21
	BEiTv2[23]	85.32	85.32	96.95	96.95	85.27	85.27	94.14	94.14	90.42	90.42
	BEiT[12]	83.04	83.04	96.48	96.48	86.36	86.36	93.50	93.50	89.85	89.85
MaxLogit[16]	ViT[20]	67.22	67.22	89.88	89.88	61.68	61.68	82.73	82.73	75.37	75.37
	MoCov3[21]	62.36	62.36	90.38	90.38	65.65	65.65	84.19	84.19	75.64	75.64
	DINOv2[22]	54.70	54.70	69.98	69.98	45.60	45.60	54.52	54.52	56.20	56.20
	BEiTv2[23]	85.94	85.94	96.90	96.90	83.97	83.97	93.82	93.82	90.16	90.16
	BEiT[12]	84.17	84.17	96.48	96.48	85.34	85.34	93.31	93.31	89.83	89.83
KL-Matching[16]	ViT[20]	82.59	82.59	87.63	87.63	66.55	66.55	84.34	84.34	80.28	80.28
	MoCov3[21]	82.35	82.35	86.24	86.24	67.80	67.80	82.73	82.73	79.78	79.78
	DINOv2[22]	80.51	80.51	56.93	56.93	69.77	69.77	62.63	62.63	67.46	67.46
	BEiTv2[23]	87.14	87.14	95.13	95.13	82.87	82.87	92.10	92.10	89.31	89.31
	BEiT[12]	87.87	87.87	94.82	94.82	84.56	84.56	92.48	92.48	89.93	89.93
Residual[7]	ViT[20]	82.39	82.39	73.72	73.72	68.44	68.44	74.88	74.88	74.86	74.86
	MoCov3[21]	75.25	75.25	73.80	73.80	57.69	57.69	67.82	67.82	68.64	68.64
	DINOv2[22]	66.50	66.50	61.90	61.90	58.94	58.94	56.84	56.84	61.04	61.04
	BEiTv2[23]	94.99	94.99	99.01	99.01	87.23	87.23	95.43	95.43	94.17	94.17
	BEiT[12]	94.16	94.16	99.50	99.50	89.35	89.35	96.52	96.52	94.88	94.88
React[18]	ViT[20]	62.09	62.09	91.20	91.20	63.66	63.66	80.43	80.43	74.34	74.34
	MoCov3[21]	51.47	51.47	79.30	79.30	65.33	65.33	74.35	74.35	67.61	67.61
	DINOv2[22]	76.73	76.73	74.25	74.25	56.26	56.26	63.17	63.17	67.60	67.60
	BEiTv2[23]	86.10	86.10	98.09	98.09	85.69	85.69	94.96	94.96	91.21	91.21
	BEiT[12]	84.32	84.32	96.99	96.99	87.04	87.04	94.21	94.21	90.64	90.64
Mahalanobis[19]	ViT[20]	84.93	84.93	84.90	84.90	71.53	71.53	84.16	84.16	81.38	81.38
	MoCov3[21]	84.29	84.29	86.95	86.95	70.33	70.33	83.54	83.54	81.28	81.28
	DINOv2[22]	68.58	68.58	63.14	63.14	58.86	58.86	57.57	57.57	62.04	62.04
	BEiTv2[23]	93.01	93.01	98.78	98.78	86.78	86.78	95.46	95.46	93.51	93.51
	BEiT[12]	93.03	93.03	99.18	99.18	88.84	88.84	96.51	96.51	94.39	94.39
ViM[7]	ViT[20]	83.51	83.51	77.75	77.75	71.04	71.04	78.31	78.31	77.65	77.65
	MoCov3[21]	76.28	76.28	78.18	78.18	61.35	61.35	72.46	72.46	72.07	72.07
	DINOv2[22]	66.90	66.90	62.53	62.53	58.93	58.93	56.93	56.93	61.32	61.32
	BEiTv2[23]	95.35	95.35	99.31	99.31	90.06	90.06	96.96	96.96	95.42	95.42
	BEiT[12]	94.25	94.25	99.59	99.59	91.47	91.47	97.41	97.41	95.68	95.68
Average	ViT[20]	73.52	73.52	84.05	84.05	65.66	65.66	80.02	80.02	75.81	75.81
	MoCov3[21]	68.45	68.45	82.79	82.79	64.69	64.69	77.83	77.83	73.44	73.44
	DINOv2[22]	66.91	66.91	66.40	66.40	55.40	55.40	57.64	57.64	61.59	61.59
	BEiTv2[23]	89.18	89.18	97.53	97.53	85.38	85.38	94.42	94.42	91.63	91.63
	BEiT[12]	88.24	88.24	97.32	97.32	87.02	87.02	94.53	94.53	91.77	91.77
Best	ViT[20]	84.93	84.93	91.20	91.20	71.53	71.53	84.34	84.34	81.38	81.38
	MoCov3[21]	84.29	84.29	90.68	90.68	70.33	70.33	85.42	85.42	81.28	81.28
	DINOv2[22]	80.51	80.51	80.34	80.34	69.77	69.77	63.17	63.17	67.60	67.60
	BEiTv2[23]	95.35	95.35	99.31	99.31	90.06	90.06	96.96	96.96	95.42	95.42
	BEiT[12]	94.25	94.25	99.59	99.59	91.47	91.47	97.41	97.41	95.68	95.68

TABLE II: Performance of OOD detection methods on ViT-B/16 model with

224\times 224

-pixel inputs. The pre-text tasks include classification task [20], contrastive learning tasks MoCov3 [21] and DINOv2 [22], and masked image modeling tasks BEiT [12] and BEiTv2 [23]. All models are per-trained on ImageNet-21k and finetuned on ImageNet-1k. Both metrics AUROC and FPR95 are in percentage. The best method is emphasized in bold and a gray background indicates our choice.

In this section, we offer a comprehensive analysis of these key elements in the context of OOD detection. We employ ImageNet [35] as the in-distribution dataset and evaluate pre-task texts on challenging unnatural out-of-distribution datasets, including OpenImage-O [31], Texture [32], iNaturalist [33], and ImageNet-O [34]. Extensive validations with various pretraining methods and OOD score functions, including MSP [15], Energy [17], ODIN [41], MaxLogit [16], KL Matching [16], Residual [7], ReAct [18], Mahalanobis [19] and ViM [7].

Results are shown in Tab. II. The results indicate that the masked image modeling pretext task surpasses classification and contrastive learning pretext tasks when employing all included score functions. The average AUROC across these score functions exhibits an improvement of 15.96% compared to the competition. Models when using the best-performing score function saw a 14.30% increase in performance. This remarkable achievement can be attributed to the representative ID feature space representation, thereby aiding in distinguishing between ID and OOD data. This discovery is highly significant as it enhances performance across mainstream OOD detection score functions, thus advancing the entire field. We also employ CIFAR-10 [39] as the ID dataset and provide results in the appendix. Our approach attains an impressive AUROC of 99.99% while concurrently reducing the FPR95 to a mere 0.03%.

To enhance the comprehensibility of our experimental findings, we conduct a thorough statistical analysis and illustrate them in visual representations. The outcomes are depicted in Fig. 5. Our approach not only leads to an overall enhancement in results but also notably minimizes the variations among different methods. For instance, the ViT, MoCov3, and DINOv2 models using logit-based methods exhibited standard deviations of 4.76%, 5.04%, and 4.51%, respectively, while BEiT and BEiTv2 displayed significantly lower standard deviations, reaching as low as 0.13% and 0.01%. This observation underscores that even uncomplicated score functions can perform equivalently to more intricate ones when an effective ID feature representation is applied.

In Tab. I, we underscore the optimal methods for each model. On CIFAR-10, all models achieved their best results when employing the feat and logit combination approach, achieving almost 100% accuracy. This suggests a highly effective grasp of CIFAR-10’s feature space. Conversely, with the larger ImageNet dataset, we observed variations in outcomes. Notably, the masked image modeling pretext-pretrained model achieved the best results when using the feat and logit combination method, while other models excelled in probability-based and feature-based methods. Additionally, our masked image modeling pretext demonstrated significantly superior performance compared to other pretraining methods, underscoring the limitations of classification-based pretraining strategies and their inadequacy in harnessing advanced score functions effectively. These discoveries reinforce the pivotal role of proficient feature representation in OOD detection. Furthermore, for more detailed information, we provide illustrations of the distribution curves of OOD scores for both ID and OOD datasets in the appendix.

3.3 Masked Image Modeling for Out-of-Distribution v2

To sum up, in this section, we observed that pre-trained models adeptly reconstruct ID images, yet manifest distinctive domain differences in the OOD scenario (Fig. 4). This visual incongruity starkly highlights the prevailing domain gap in model features between ID and OOD data. Additionally, a thorough analysis of experimental outcomes reveals that the pre-task of masked image modeling not only significantly enhances overall results but also markedly diminishes disparities among score functions. These findings emphasize the crucial significance of effective feature representation in OOD detection, highlighting the enhancement of features through masked image modeling tasks.

Finally, we propose our Masked Image Modeling for Out-of-Distribution Detection v2 (MOODv2). The algorithm of is shown in Algorithm 1, mainly including the following stages.

1.

Pre-train the vision encoder with masked image modeling on the pretrain dataset.
2.

Apply fine-tuning the backbone on the in-distribution dataset.
3.

Extract features from the trained image encoder and calculate the OOD score distance score function for OOD detection.

In terms of the OOD score function, we adopt ViM[7] that combines features and logits, leveraging insights from the masked image modeling pre-trained model, which has demonstrated superior performance. Mathematically, the score is

\text{s}(x)=\frac{e^{\alpha\sqrt{x^{T}RR^{T}x}}}{\sum_{i=1}^{C}e^{l_{i}}+e^{% \alpha\sqrt{x^{T}RR^{T}x}}}.

(1)

where $l_{i}$ is the $i$ -th logit of feature $x$ in the training set $X$ ; $\alpha$ is a per-model constant; $R\in\mathbb{R}^{N\times(N-D)}$ is the $(D+1)$ -th column to the last column of the eigenvector matrix $Q$ of $X$ and $N$ is the principal dimension; $C$ is the number of classes.

Algorithm 1 MOODv2 Detection Algorithm

1:Pre-train set

X_{P}

, in-distribution set

X_{\rm ID}

, test set

X_{\rm test}

, required True Positive Rate

\eta

%, backbone

f

2:Is

x_{\rm test}

outlier or not?

\forall x_{\rm test}\in X_{\rm test}

3:Pre-train

f

X_{P}

by maximizing

\sum_{x\in X_{P}}\mathbb{E}_{M}\left[\sum_{i\in M}\log p_{\rm MIM}(z|x^{M})\right]

4:Fine-tune

f

X_{P}

by minimizing

L_{\rm ft}=\sum_{x_{p}\in X_{P}}{\rm CrossEntropy}(f(x_{p}),y_{P}(x_{p}))

5:Calculate

d(x_{\rm test})

for

x_{\rm test}\in X_{\rm test}

and

d(x_{\rm cal})

for

x_{\rm cal}\in X_{\rm cal}

6:Compute threshold

T

as the

\eta

percentile of

d(x_{\rm cal})

7:if

d(x_{\rm test})>T

then

x_{\rm test}

is an outlier.

9:end if

ID data	Methods	Texture [32]		iNaturalist [33]		ImageNet-O [34]		OpenImage-O [31]		Average
ID data	Methods	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$
CIFAR-10	MSP[15]	45.67	95.17	71.07	81.76	32.52	98.85	59.74	91.45	52.25	91.81
	Energy[17]	31.16	97.89	48.95	97.92	37.22	97.85	45.29	96.36	40.65	97.50
	MaxLogit[16]	41.21	95.95	67.83	86.04	32.58	98.80	56.64	92.94	49.56	93.43
	KL-Matching[16]	98.00	10.64	94.23	35.86	92.99	32.40	94.68	27.92	94.97	26.71
	Residual[7]	99.91	0.21	99.68	0.45	99.36	2.85	99.42	2.46	99.59	1.49
	React[18]	35.97	96.26	69.01	87.91	36.65	97.75	54.14	93.11	48.94	93.76
	Mahalanobis[19]	99.77	0.60	99.39	1.11	98.93	4.90	99.14	3.26	99.31	2.47
	ViM[7]	99.91	0.23	99.72	0.38	99.38	2.65	99.49	2.31	99.63	1.39
	MOODv1[30]	99.95	0.06	99.99	0.02	99.61	1.90	99.82	0.77	99.84	0.69
	MOODv2 (ours)	99.98	0.06	100.00	0.00	99.94	0.20	99.99	0.01	99.98	0.07
ImageNet	MSP[15]	71.31	77.07	90.70	43.72	60.77	90.60	84.29	61.79	76.77	68.30
	Energy[17]	54.11	86.28	76.61	72.70	61.63	81.00	71.06	73.99	65.85	78.49
	MaxLogit[16]	67.22	77.98	89.88	45.57	61.68	88.60	82.73	62.52	75.37	68.67
	KL-Matching[16]	82.59	67.27	87.63	69.71	66.55	88.15	84.34	74.23	80.28	74.84
	Residual[7]	82.39	64.61	73.72	86.00	68.44	87.45	74.88	77.98	74.86	79.01
	React[18]	62.09	80.47	91.20	38.74	63.66	81.00	80.43	60.41	74.34	65.15
	Mahalanobis[19]	84.93	66.05	84.90	81.60	71.53	88.85	84.16	74.72	81.38	77.80
	ViM[7]	83.51	62.71	77.75	81.72	71.04	86.60	78.31	74.55	77.65	76.40
	MOODv1[30]	93.01	30.91	98.78	5.89	86.78	63.15	95.46	26.46	93.51	31.60
	MOODv2 (ours)	94.25	24.69	99.59	1.83	91.47	40.80	97.41	13.55	95.68	20.22

TABLE III: Performance of OOD detection methods on ViT-B/16 model with

224\times 224

-pixel inputs. All methods are pre-trained on ImageNet-21k and finetuned on ImageNet-1k. ID datasets include CIFAR-10 [39] and ImageNet-1k [35]. Both metrics AUROC and FPR95 are in percentage. The best method is emphasized in bold and a gray background indicates our methods.

Methods	ID class										Average
Methods	Plane	Car	Bird	Cat	Deer	Dog	Frog	Horse	Ship	Truck	Average
KL-Matching[16]	95.35	92.04	95.18	91.26	88.11	94.66	94.99	86.52	93.61	89.37	92.11
Residual[7]	97.62	95.88	97.06	96.30	89.18	94.33	96.73	91.46	94.89	92.36	94.58
Mahalanobis[19]	97.52	96.07	96.77	96.41	89.60	94.79	96.41	91.48	94.80	92.58	94.64
ViM[7]	97.61	96.36	97.19	96.50	88.78	94.21	96.70	91.60	94.97	92.35	94.63
MOODv1[30]	98.63	99.33	94.31	93.22	98.11	96.50	99.25	98.96	98.76	97.82	97.83
MOODv2 (ours)	99.14	99.03	99.51	98.37	97.12	97.20	98.53	98.07	98.35	96.68	98.20

(a) AUROC

Methods	ID class										Average
Methods	Plane	Car	Bird	Cat	Deer	Dog	Frog	Horse	Ship	Truck	Average
KL-Matching[16]	23.60	32.60	22.32	42.92	46.26	24.30	24.97	46.74	25.32	40.53	32.96
Residual[7]	12.06	25.58	16.71	21.17	48.33	22.12	17.42	36.72	17.30	30.76	24.82
Mahalanobis[19]	12.59	25.72	18.92	21.48	48.44	20.59	19.20	38.02	17.47	30.93	25.34
ViM[7]	12.43	24.83	15.77	20.13	48.68	21.77	17.63	36.63	17.60	30.78	24.63
MOODv1[30]	7.59	5.04	2.47	7.49	15.63	10.96	11.37	13.09	10.06	19.62	10.33
MOODv2 (ours)	4.82	4.50	1.79	8.80	15.59	11.00	8.46	12.43	8.60	18.96	9.49

(b) FPR95

TABLE IV: Performance of OOD detection methods on ViT-B/16 model with

224\times 224

-pixel inputs. All methods are pre-trained on ImageNet-21k and finetuned on ImageNet-1k. We perform each category of CIFAR-10 [39] as the ID dataset and other classes as OOD datasets. We report the average results across OOD classes of each ID class. Both metrics AUROC and FPR95 are in percentage. The best method is emphasized in bold and a gray background indicates our methods.

4 Experiments

In this section, we conduct a thorough comparison of our algorithm with the latest OOD detection methods. We employ the ViT-B/16 model, pre-trained on ImageNet-21K and fine-tuned on ImageNet-1K at a resolution of $224\times 224$ .

ID/OOD Datasets. We select CIFAR-10 [39] and ImageNet-1K [35] as the ID datasets. Following established procedures [7], for estimating the principal space of ImageNet, we randomly sample $200,000$ images from the training set. Our experiments include the following OOD datasets:

1.

OpenImage-O is a newly collected large-scale OOD dataset [31].
2.

Texture [36] comprises natural textural images, with four overlapping categories (bubbly, honeycombed, cobwebbed, spiraled) removed since they coincide with ImageNet.
3.

iNaturalist [37] is a fine-grained species classification dataset, and we use a specific subset from previous works [29].
4.

ImageNet-O [38] contains images that are adversarially filtered to challenge OOD detectors.

Evaluation Metrics. We report two commonly used evaluation metrics AUROC and FPR95. The AUROC is a threshold-free metric, indicating the area under the receiver operating characteristic curve, with a higher value denoting better detection performance. FPR95, or FPR at TPR95, stands for the false positive rate when the true positive rate is 95%, and a smaller FPR95 is preferable. Both metrics are expressed as percentages.

Baseline Methods. Following previous works [7], we compare MOODv2 with the baseline algorithms that do not require fine-tuning including MSP [15], Energy [17], ODIN [41], MaxLogit [16], KL Matching [16], Residual, ReAct [18], and Mahalanobis [19].

4.1 One-Class OOD Detection

We start with the one-class OOD detection. For a given multi-class dataset of $N_{c}$ classes, we conduct $N_{c}$ one-class OOD tasks, where each task regards one of the classes as in-distribution and the remaining classes as out-of-distribution. We run our experiments on CIFAR-10 [39]. Table IV summarizes the average results across OOD classes of each ID class and the detailed class-wize performance is in the appendix.

It’s worth noting that all methods were pre-trained on ImageNet-21k and fine-tuned on ImageNet-1k, which may have had some influence on the results to varying degrees. Nevertheless, we ensure consistent training strategies for all methods to ensure a fair comparison. Experimental results have demonstrated that MOODv2 achieves significant improvements across all ID classes even without fine-tuning the ID dataset. Notably, we achieved a remarkable 3.56% increase in the AUROC, reaching 98.20%, while simultaneously reducing the FPR95 by 15.14% to achieve an impressive 9.49%.

4.2 Multi-Class OOD Detection

For multi-class OOD Detection, we assume that ID samples are from a multi-class dataset, either CIFAR-10 [39] or ImageNet [35]. They are tested on external datasets as out-of-distribution, including OpenImage-O [31], Texture [36], iNaturalist [37] and ImageNet-O [38].

Results are shown in Tab. III. MOODv2 delivers outstanding results on CIFAR-10, achieving an impressive AUROC of 99.98% (0.35% enhancement) and the FPR95 reaches an astonishingly low rate of 0.07%, marking a substantial 95% reduction compared to the prior SOTA (1.39%). On ImageNet, MOODv2 also exhibited significant improvements, showcasing a remarkable 14.30% increase in AUROC, resulting in 95.68%. Additionally, the FPR95 saw a substantial reduction of 44.93%, reaching 20.22%.

In Fig. 7, we illustrate the distribution curves of OOD scores for ID and OOD datasets using various mainstream methods. A smaller overlap between ID and OOD data indicates superior OOD detection performance, while a larger overlap signifies weaker detection results. The ID curve (in red) for MOODv2 features a distinct peak at a higher position, resulting in minimal overlap with other OOD data, indicating a notable OOD detection capability. This success can be attributed to the high-quality ID feature representation.

5 Conclusion

In our work, we focus on the critical aspect of effective out-of-distribution (OOD) detection, which involves acquiring a robust in-distribution (ID) representation that distinguishes it from OOD samples. We conduct comprehensive experiments with distinct pretraining tasks and employ various OOD score functions. The findings indicate that feature representations pre-trained through reconstruction significantly enhance performance and reduce the performance gap among different score functions. This implies that even simple score functions can perform as well as complex ones when utilizing reconstruction-based pretext tasks. These findings hold promise for further development in OOD detection. Ultimately, we introduce the MOODv2 OOD detection framework, employing the masked image modeling pretext task, which achieves a remarkable 14.30% increase in AUROC, reaching 95.68% on ImageNet, and substantially improving CIFAR-10 to 99.98%.

References

[1] R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad, “Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission,” in Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 2015, pp. 1721–1730.
[2] C. Phua, V. Lee, K. Smith, and R. Gayler, “A comprehensive survey of data mining-based fraud detection research,” arXiv preprint arXiv:1009.6119, 2010.
[3] K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, and D. Song, “Robust physical-world attacks on deep learning visual classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1625–1634.
[4] J. Tack, S. Mo, J. Jeong, and J. Shin, “Csi: Novelty detection via contrastive learning on distributionally shifted instances,” Advances in neural information processing systems, vol. 33, pp. 11 839–11 852, 2020.
[5] V. Sehwag, M. Chiang, and P. Mittal, “Ssd: A unified framework for self-supervised outlier detection,” arXiv preprint arXiv:2103.12051, 2021.
[6] S. Fort, J. Ren, and B. Lakshminarayanan, “Exploring the limits of out-of-distribution detection,” Advances in Neural Information Processing Systems, vol. 34, pp. 7068–7081, 2021.
[7] H. Wang, Z. Li, L. Feng, and W. Zhang, “Vim: Out-of-distribution with virtual-logit matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
[8] J. Yang, K. Zhou, Y. Li, and Z. Liu, “Generalized out-of-distribution detection: A survey,” arXiv preprint arXiv:2110.11334, 2021.
[9] M. B. Sariyildiz, K. Alahari, D. Larlus, and Y. Kalantidis, “Fake it till you make it: Learning transferable representations from synthetic imagenet clones,” 2023.
[10] A. Saha, A. Subramanya, and H. Pirsiavash, “Hidden trigger backdoor attacks,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 11 957–11 965, Apr. 2020. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/6871
[11] ——, “Hidden trigger backdoor attacks,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 11 957–11 965.
[12] H. Bao, L. Dong, and F. Wei, “Beit: Bert pre-training of image transformers,” arXiv preprint arXiv:2106.08254, 2021.
[13] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[14] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 000–16 009.
[15] D. Hendrycks and K. Gimpel, “A baseline for detecting misclassified and out-of-distribution examples in neural networks,” arXiv preprint arXiv:1610.02136, 2016.
[16] D. Hendrycks, S. Basart, M. Mazeika, A. Zou, J. Kwon, M. Mostajabi, J. Steinhardt, and D. Song, “Scaling out-of-distribution detection for real-world settings,” arXiv preprint arXiv:1911.11132, 2019.
[17] W. Liu, X. Wang, J. Owens, and Y. Li, “Energy-based out-of-distribution detection,” Advances in neural information processing systems, vol. 33, pp. 21 464–21 475, 2020.
[18] Y. Sun, C. Guo, and Y. Li, “React: Out-of-distribution detection with rectified activations,” Advances in Neural Information Processing Systems, vol. 34, pp. 144–157, 2021.
[19] K. Lee, K. Lee, H. Lee, and J. Shin, “A simple unified framework for detecting out-of-distribution samples and adversarial attacks,” Advances in neural information processing systems, vol. 31, 2018.
[20] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[21] X. Chen, S. Xie, and K. He, “An empirical study of training self-supervised vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 9640–9649.
[22] M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y. Huang, S.-W. Li, I. Misra, M. Rabbat, V. Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features without supervision,” 2023.
[23] Z. Peng, L. Dong, H. Bao, Q. Ye, and F. Wei, “Beit v2: Masked image modeling with vector-quantized visual tokenizers,” 2022.
[24] I. Ndiour, N. Ahuja, and O. Tickoo, “Out-of-distribution detection with subspace techniques and probabilistic modeling of features,” arXiv preprint arXiv:2012.04250, 2020.
[25] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738.
[26] Y. Li, Y. Jiang, Z. Li, and S.-T. Xia, “Backdoor learning: A survey,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
[27] T. DeVries and G. W. Taylor, “Learning confidence for out-of-distribution detection in neural networks,” arXiv preprint arXiv:1802.04865, 2018.
[28] A. Zaeemzadeh, N. Bisagno, Z. Sambugaro, N. Conci, N. Rahnavard, and M. Shah, “Out-of-distribution detection using union of 1-dimensional subspaces,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9452–9461.
[29] R. Huang and Y. Li, “MOS: Towards scaling out-of-distribution detection for large semantic space,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8710–8719.
[30] J. Li, P. Chen, Z. He, S. Yu, S. Liu, and J. Jia, “Rethinking out-of-distribution (ood) detection: Masked image modeling is all you need,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 578–11 589.
[31] I. Krasin, T. Duerig, N. Alldrin, V. Ferrari, S. Abu-El-Haija, A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, A. Veit, S. Belongie, V. Gomes, A. Gupta, C. Sun, G. Chechik, D. Cai, Z. Feng, D. Narayanan, and K. Murphy, “Openimages: A public dataset for large-scale multi-label and multi-class image classification.” Dataset available from https://github.com/openimages, 2017.
[32] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 3606–3613.
[33] G. Van Horn, O. Mac Aodha, Y. Song, Y. Cui, C. Sun, A. Shepard, H. Adam, P. Perona, and S. Belongie, “The inaturalist species classification and detection dataset,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8769–8778.
[34] D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song, “Natural adversarial examples,” CVPR, 2021.
[35] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li, “Imagenet large scale visual recognition challenge,” Int. J. Comput. Vis., 2015.
[36] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3606–3613.
[37] G. Van Horn, O. Mac Aodha, Y. Song, Y. Cui, C. Sun, A. Shepard, H. Adam, P. Perona, and S. Belongie, “The iNaturalist species classification and detection dataset,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8769–8778.
[38] D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song, “Natural adversarial examples,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 262–15 271.
[39] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
[40] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million image database for scene recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 6, pp. 1452–1464, 2017.
[41] S. Liang, Y. Li, and R. Srikant, “Enhancing the reliability of out-of-distribution image detection in neural networks,” arXiv preprint arXiv:1706.02690, 2017.

This supplementary material includes visualization of distribution curves, multi-class and one-class OOD detection results on CIFAR-10, etc., which are not included in the main paper due to page limitations.

Appendix A Distribution curves

For more comprehensive insights, we offer visual representations of distribution curves for OOD scores on both ID and OOD datasets in Fig. 7. A narrower overlap between ID and OOD data signifies superior OOD detection performance, whereas a wider overlap indicates weaker detection results. The ID curve, depicted in red, for the fine-tuned BEiT series [12] models, exhibits a distinctive peak at a higher position. This leads to minimal overlap with other OOD data, highlighting a remarkable OOD detection capability. This accomplishment can be attributed to the high-quality ID feature representation derived from masked image modeling.

Appendix B Details of Results on CIFAR-10

B.1 Multi-class OOD Detection

We employ CIFAR-10 [39] as the in-distribution dataset and evaluate pre-task texts on multiple challenging unnatural out-of-distribution datasets, including OpenImage-O [31], Texture [32], iNaturalist [33], and ImageNet-O [34]. Extensive validations with various pretraining methods and OOD score functions including MSP [15], Energy [17], ODIN [41], MaxLogit [16], KL Matching [16], Residual [7], ReAct [18], Mahalanobis [19] and ViM [7]. Results are in Tab. V. Our approach attains an impressive AUROC of 99.99% while concurrently reducing the FPR95 to a mere 0.03%.

B.2 One-class OOD Detection

We perform one-class OOD detection. In the context of a multi-class dataset with $N_{c}$ classes, we conduct $N_{c}$ one-class OOD tasks. Each task treats one of the classes as in-distribution and the remaining classes as out-of-distribution. Our experiments are conducted on CIFAR-10 [39] and provide the detailed class-wise performance of mainstream methods including KL-Marching (Tab. VI), Residual (Tab. VII), Mahalanobis (Tab. VIII) and ViM (Tab. IX).

It’s important to note that all methods were pre-trained on ImageNet-21k and subsequently fine-tuned on ImageNet-1k, which might have influenced the results to varying degrees. However, we ensure consistent training strategies for all methods to maintain a fair comparison. The experimental results demonstrate that MOODv2 achieves significant improvements across all ID classes, even without fine-tuning the ID dataset. Notably, we achieved a remarkable 3.56% increase in the state-of-the-art AUROC, reaching 98.20%, while simultaneously reducing FPR95 by 15.14%, achieving an impressive 9.49%.

Methods	Models	Texture		iNaturalist		ImageNet-O		OpenImage-O		Average
Methods	Models	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$
MSP[15]	ViT[20]	45.67	95.17	71.07	81.76	32.52	98.85	59.74	91.45	52.25	91.81
	MoCov3[21]	37.11	97.64	64.60	89.69	31.52	98.45	55.90	93.53	47.28	94.83
	DINOv2[22]	70.37	58.33	87.10	37.77	70.61	63.25	78.27	51.58	76.59	52.73
	BEiTv2[23]	57.67	88.31	82.53	55.54	52.06	89.55	72.72	70.71	66.24	76.03
	BEiT[12]	51.64	91.09	73.85	74.02	47.82	90.75	65.40	80.81	59.68	84.17
Energy[17]	ViT[20]	31.16	97.89	48.95	97.92	37.22	97.85	45.29	96.36	40.65	97.50
	MoCov3[21]	24.97	98.93	44.74	98.29	38.06	95.45	43.90	95.49	37.92	97.04
	DINOv2[22]	86.73	28.16	91.43	20.27	68.97	62.75	73.66	53.40	80.20	41.15
	BEiTv2[23]	63.35	82.64	88.52	38.21	66.24	72.30	81.69	51.80	74.95	61.24
	BEiT[12]	52.98	88.53	81.84	54.55	59.99	77.20	73.61	65.08	67.10	71.34
MaxLogit[16]	ViT[20]	41.21	95.95	67.83	86.04	32.58	98.80	56.64	92.94	49.56	93.43
	MoCov3[21]	32.94	98.22	61.79	92.21	31.65	98.45	53.32	94.26	44.92	95.78
	DINOv2[22]	76.80	45.06	91.96	22.66	72.49	57.95	80.36	44.75	80.40	42.61
	BEiTv2[23]	60.51	85.85	86.05	47.39	59.14	83.90	77.47	62.79	70.79	69.98
	BEiT[12]	51.94	89.90	77.92	66.72	52.95	87.10	69.16	75.24	62.99	79.74
KL-Matching[16]	ViT[20]	98.00	10.64	94.23	35.86	92.99	32.40	94.68	27.92	94.97	26.71
	MoCov3[21]	97.61	13.97	94.65	35.51	92.05	38.25	94.22	33.48	94.64	30.30
	DINOv2[22]	98.05	8.74	98.95	5.32	97.29	12.35	96.99	13.55	97.82	9.99
	BEiTv2[23]	97.41	14.98	91.78	50.90	93.21	35.90	92.28	43.17	93.67	36.24
	BEiT[12]	97.83	12.71	95.14	32.28	93.52	35.00	94.84	31.13	95.33	27.78
Residual[7]	ViT[20]	99.91	0.21	99.68	0.45	99.36	2.85	99.42	2.46	99.59	1.49
	MoCov3[21]	99.90	0.25	99.87	0.09	99.22	3.85	99.59	1.31	99.65	1.38
	DINOv2[22]	99.98	0.04	100.00	0.01	99.99	0.05	99.97	0.18	99.98	0.07
	BEiTv2[23]	99.98	0.04	100.00	0.00	99.79	0.90	99.92	0.27	99.92	0.30
	BEiT[12]	99.99	0.02	100.00	0.00	99.96	0.10	99.99	0.01	99.99	0.03
React[18]	ViT[20]	35.97	96.26	69.01	87.91	36.65	97.75	54.14	93.11	48.94	93.76
	MoCov3[21]	25.74	98.90	46.11	98.29	37.60	95.55	44.63	95.46	38.52	97.05
	DINOv2[22]	68.00	61.94	60.58	82.56	40.71	90.10	47.60	88.88	54.22	80.87
	BEiTv2[23]	62.81	82.71	91.59	28.05	65.37	72.95	82.43	49.49	75.55	58.30
	BEiT[12]	53.27	87.91	82.09	53.87	59.64	77.50	73.73	64.84	67.18	71.03
Mahalanobis[19]	ViT[20]	99.77	0.60	99.39	1.11	98.93	4.90	99.14	3.26	99.31	2.47
	MoCov3[21]	99.78	0.78	99.71	0.45	98.61	7.65	99.31	2.48	99.35	2.84
	DINOv2[22]	99.98	0.06	100.00	0.00	99.99	0.00	99.97	0.16	99.99	0.05
	BEiTv2[23]	99.95	0.06	99.99	0.02	99.61	1.90	99.82	0.77	99.84	0.69
	BEiT[12]	99.99	0.00	100.00	0.00	99.96	0.05	99.98	0.05	99.98	0.03
ViM[7]	ViT[20]	99.91	0.23	99.72	0.38	99.38	2.65	99.49	2.31	99.63	1.39
	MoCov3[21]	99.93	0.16	99.92	0.03	99.40	2.75	99.69	1.03	99.73	0.99
	DINOv2[22]	99.98	0.04	100.00	0.01	99.99	0.05	99.97	0.18	99.98	0.07
	BEiTv2[23]	99.95	0.14	100.00	0.01	99.61	1.60	99.93	0.28	99.87	0.51
	BEiT[12]	99.98	0.06	100.00	0.00	99.94	0.20	99.99	0.01	99.98	0.07
Best	ViT[20]	99.91	0.21	99.72	0.38	99.38	2.65	99.49	2.31	99.63	1.39
	MoCov3[21]	99.93	0.16	99.92	0.03	99.40	2.75	99.69	1.03	99.73	0.99
	DINOv2[22]	99.98	0.04	100.00	0.00	99.99	0.00	99.97	0.16	99.99	0.05
	BEiTv2[23]	99.98	0.04	100.00	0.00	99.79	0.90	99.93	0.27	99.92	0.30
	BEiT[12]	99.99	0.00	100.00	0.00	99.96	0.05	99.99	0.01	99.99	0.03

TABLE V: AUROC (%) of OOD detection methods. The ID dataset is CIFAR-10 [39], and the OOD datasets are OpenImage-O [31], Texture [32], iNaturalist [33], and ImageNet-O [34]. The pre-text tasks include classical classification task [20], contrastive learning tasks MoCov3 [21] and DINOv2 [22], and masked image modeling tasks BEiT [12] and BEiT [23]. All pre-text tasks are performed on ImageNet-21k. Both metrics AUROC and FPR95 are in percentage. A pre-trained ViT-B/16 model with

224\times 224

-pixel inputs is tested. The best method is emphasized in bold and a gray background indicates our choice.

Models	ID class	OOD class										Average
Models	ID class	0	1	2	3	4	5	6	7	8	9	Average
ViT[20]	0	-	81.29	90.44	90.62	86.42	95.45	93.28	91.15	67.66	66.06	84.71
	1	98.54	-	99.58	99.33	99.57	99.69	99.82	96.16	96.00	85.23	97.10
	2	90.41	97.50	-	87.18	76.75	95.02	90.45	81.47	96.76	98.16	90.41
	3	94.14	94.49	91.32	-	77.12	78.07	88.59	82.25	98.25	85.24	87.72
	4	96.15	98.55	92.46	89.63	-	95.80	94.34	66.06	98.45	96.45	91.99
	5	98.38	98.19	96.45	78.99	86.87	-	97.45	68.19	99.48	95.96	91.11
	6	96.68	97.44	92.05	88.20	83.14	95.27	-	96.14	95.03	97.92	93.54
	7	96.32	97.72	97.36	92.15	85.78	94.45	98.03	-	94.20	98.45	94.94
	8	89.88	71.86	97.40	95.86	98.82	98.45	93.19	98.04	-	80.89	91.60
	9	97.62	91.34	99.60	99.38	98.48	99.75	99.78	99.23	96.62	-	97.98
MoCov3[21]	0	-	87.10	87.79	90.33	89.95	94.07	91.11	72.18	69.31	69.45	83.48
	1	93.05	-	98.12	97.41	98.20	98.26	99.03	94.47	93.38	75.58	94.17
	2	82.96	95.82	-	83.13	78.83	93.45	82.30	73.48	94.46	96.90	86.81
	3	91.08	93.10	90.56	-	74.01	74.04	84.32	79.70	93.94	93.13	85.99
	4	93.82	96.02	86.55	88.85	-	92.48	92.56	66.43	96.26	96.97	89.99
	5	94.32	95.45	91.58	74.17	88.34	-	94.31	69.63	97.36	98.02	89.24
	6	95.09	96.76	88.10	87.53	80.44	90.97	-	94.57	92.31	97.94	91.52
	7	91.62	96.60	92.38	91.56	80.46	92.20	96.91	-	92.62	95.95	92.25
	8	80.42	88.78	96.30	93.92	96.62	95.98	95.00	95.97	-	79.94	91.44
	9	92.33	79.04	97.95	97.19	98.14	98.28	98.73	97.74	88.23	-	94.18
DINOv2[22]	0	-	60.94	65.14	64.79	71.95	62.40	76.86	61.40	39.99	51.08	61.62
	1	73.33	-	59.97	48.33	56.38	44.85	57.27	41.11	60.09	45.81	54.13
	2	69.07	55.66	-	49.75	44.54	46.53	51.42	43.47	57.03	53.23	52.30
	3	79.75	63.89	61.86	-	56.82	47.72	59.42	47.58	70.69	61.79	61.06
	4	81.90	68.04	59.55	59.30	-	57.66	54.34	53.14	73.53	68.04	63.95
	5	81.63	65.90	63.37	52.89	58.20	-	61.38	48.90	73.32	64.01	63.29
	6	86.28	71.81	62.40	57.83	51.69	56.97	-	54.87	81.90	74.10	66.43
	7	84.48	68.52	64.60	58.30	57.49	54.55	58.26	-	77.39	66.77	65.60
	8	64.17	71.12	75.01	73.31	79.28	71.02	84.92	71.20	-	61.16	72.35
	9	75.23	59.90	68.79	58.47	68.79	54.25	72.32	50.25	62.55	-	63.39
BEiTv2[23]	0	-	94.83	97.42	99.31	97.01	99.79	99.34	99.44	89.14	91.32	96.40
	1	99.01	-	99.91	99.83	99.72	99.90	99.98	99.74	92.98	94.11	98.35
	2	85.99	99.23	-	98.03	81.97	99.25	93.81	93.23	96.98	99.43	94.21
	3	98.39	98.20	97.60	-	90.40	85.67	74.37	95.00	99.49	98.30	93.05
	4	99.57	99.62	98.82	88.31	-	99.31	99.03	77.26	99.32	99.88	95.68
	5	99.57	99.31	99.40	80.54	97.51	-	99.14	53.39	99.72	98.76	91.93
	6	99.46	99.75	97.00	96.55	97.07	99.04	-	99.35	99.71	99.91	98.65
	7	99.12	98.87	99.76	98.94	92.57	99.02	99.92	-	98.73	95.91	98.09
	8	88.49	93.17	99.83	99.74	99.82	99.94	99.94	99.86	-	96.15	97.44
	9	99.19	96.36	99.96	99.75	99.75	99.97	99.97	99.80	97.88	-	99.18
BEiT[12]	0	-	95.65	95.16	98.75	94.10	99.60	98.60	98.70	83.00	85.22	94.31
	1	97.46	-	99.85	99.63	99.47	99.91	99.96	99.55	95.42	90.32	97.95
	2	96.98	98.44	-	89.21	77.98	98.19	97.53	72.68	97.65	99.17	91.98
	3	97.62	95.80	96.95	-	80.24	86.56	72.85	86.98	99.38	94.34	90.08
	4	99.40	98.45	98.57	98.12	-	98.92	96.00	70.81	98.42	93.32	94.67
	5	99.15	99.03	98.66	82.32	82.56	-	98.78	47.63	99.81	99.62	89.73
	6	99.32	99.56	96.90	96.87	93.96	99.06	-	99.56	99.78	99.89	98.32
	7	98.76	98.51	99.55	99.00	93.66	99.01	99.84	-	98.99	96.76	98.23
	8	92.28	92.62	99.44	99.40	99.72	99.80	99.81	99.44	-	92.60	97.24
	9	99.07	93.71	99.93	99.77	99.80	99.97	99.98	99.62	97.69	-	98.84

TABLE VI: AUROC (%) of one-class OOD Detection on CIFAR-10 [39] using KL-Matching [16]. Pretrained models include classification task [20], MoCov3 [21], DINOv2 [22], BEiTv2 [23] and BEiT [12].

Models	ID class	OOD class										Average
Models	ID class	0	1	2	3	4	5	6	7	8	9	Average
ViT[20]	0	-	88.70	96.73	99.02	96.98	99.82	98.77	98.21	66.32	79.11	91.52
	1	98.03	-	99.97	99.93	99.92	99.99	99.98	99.87	95.48	69.60	95.86
	2	96.51	99.92	-	94.67	75.60	98.06	89.95	90.72	99.08	99.83	93.82
	3	98.54	99.70	94.27	-	75.83	66.18	89.15	81.41	99.56	98.99	89.29
	4	99.25	99.85	93.29	95.01	-	95.08	95.51	70.54	99.40	99.54	94.16
	5	99.74	99.97	98.38	87.74	89.16	-	98.40	85.77	99.92	99.85	95.44
	6	99.44	99.96	93.81	95.33	88.84	98.00	-	97.18	99.79	99.90	96.92
	7	98.85	99.64	97.53	95.31	76.90	91.92	99.25	-	98.41	98.70	95.17
	8	90.11	89.74	99.59	99.72	99.56	99.96	99.60	99.65	-	85.70	95.96
	9	98.11	85.44	99.96	99.93	99.80	99.99	99.98	99.75	96.08	-	97.67
MoCov3[21]	0	-	96.11	97.08	99.13	98.40	99.75	98.56	99.00	73.05	87.16	94.25
	1	96.02	-	99.79	99.76	99.76	99.90	99.82	99.71	92.94	68.77	95.16
	2	93.25	99.90	-	94.04	80.49	97.94	88.20	91.06	98.09	99.60	93.62
	3	97.15	99.60	93.31	-	79.89	71.78	87.59	86.66	98.02	98.72	90.30
	4	98.28	99.91	92.84	93.95	-	96.47	94.59	73.27	98.35	99.53	94.13
	5	99.12	99.90	97.62	84.00	89.42	-	98.45	89.13	99.38	99.45	95.16
	6	99.38	99.97	95.00	95.30	93.13	98.72	-	98.83	99.66	99.90	97.77
	7	97.23	99.70	96.70	95.15	76.84	93.22	98.94	-	97.06	98.53	94.82
	8	87.65	95.39	99.26	99.45	99.47	99.76	99.27	99.57	-	89.66	96.61
	9	96.24	86.59	99.84	99.83	99.84	99.91	99.88	99.77	92.68	-	97.18
DINOv2[22]	0	-	72.83	64.15	68.05	65.47	68.53	68.80	70.10	54.59	69.72	66.92
	1	66.18	-	66.00	56.84	62.76	58.68	60.05	57.11	54.97	52.30	59.43
	2	66.58	77.06	-	57.10	45.49	56.78	49.57	62.65	67.99	76.34	62.17
	3	74.83	77.23	59.35	-	52.76	50.60	51.10	63.03	71.63	76.36	64.10
	4	82.00	86.75	63.82	67.27	-	66.91	57.16	71.45	80.04	85.45	73.43
	5	79.01	80.38	61.21	54.33	54.49	-	54.53	64.44	75.72	80.20	67.15
	6	85.87	86.34	66.60	66.63	56.40	66.99	-	75.17	84.70	87.00	75.08
	7	76.76	76.51	59.87	56.22	49.64	53.88	52.70	-	73.05	72.37	63.44
	8	62.96	74.15	73.59	72.77	74.25	72.95	77.20	76.44	-	71.13	72.83
	9	68.96	59.80	70.76	61.83	68.02	62.95	66.87	58.87	57.35	-	63.94
BEiTv2[23]	0	-	98.30	99.41	99.91	99.60	99.98	99.86	99.68	91.03	93.81	97.95
	1	99.66	-	100.00	99.99	99.99	99.99	100.00	99.99	98.49	78.72	97.43
	2	97.60	99.96	-	99.17	93.13	99.48	97.72	98.58	99.58	99.92	98.35
	3	99.74	99.87	98.56	-	94.77	80.80	94.73	97.37	99.89	99.90	96.18
	4	99.83	99.90	98.68	99.06	-	99.14	99.16	87.86	99.83	99.88	98.15
	5	99.94	99.97	99.62	91.48	98.24	-	99.70	96.71	99.95	99.81	98.38
	6	99.89	99.99	99.39	99.25	99.14	99.68	-	99.83	99.94	99.98	99.68
	7	99.67	99.82	99.76	99.75	97.38	99.45	99.98	-	99.67	99.66	99.46
	8	97.13	98.60	99.92	99.98	99.95	99.97	99.98	99.97	-	95.87	99.04
	9	99.36	95.53	100.00	99.99	100.00	100.00	100.00	99.99	98.54	-	99.27
BEiT[12]	0	-	98.74	99.39	99.91	99.59	99.96	99.83	99.77	91.23	93.39	97.98
	1	99.32	-	100.00	99.98	99.98	100.00	100.00	99.97	99.02	88.68	98.55
	2	99.25	99.97	-	98.32	90.25	99.21	97.47	98.08	99.79	99.88	98.02
	3	99.78	99.85	99.09	-	94.53	83.32	94.51	97.40	99.94	99.60	96.45
	4	99.88	99.90	98.82	98.63	-	98.89	98.95	93.34	99.88	99.78	98.67
	5	99.98	99.88	99.77	90.61	97.43	-	99.60	93.86	99.99	99.83	97.88
	6	99.95	100.00	99.28	99.13	98.97	99.67	-	99.97	99.99	99.99	99.66
	7	99.55	99.45	99.76	99.41	97.82	99.04	99.94	-	99.49	99.13	99.29
	8	95.97	97.99	99.95	99.95	99.90	99.98	99.98	99.96	-	96.03	98.86
	9	99.38	94.73	100.00	99.98	99.98	100.00	100.00	99.85	98.85	-	99.20

TABLE VII: AUROC (%) of one-class OOD Detection on CIFAR-10 [39] using Residual[7]. Pretrained models include classification task [20], MoCov3 [21], DINOv2 [22], BEiTv2 [23] and BEiT [12].

Models	ID class	OOD class										Average
Models	ID class	0	1	2	3	4	5	6	7	8	9	Average
ViT[20]	0	-	89.17	95.53	98.62	95.81	99.70	97.89	97.15	65.74	78.81	90.94
	1	98.20	-	99.94	99.91	99.86	99.98	99.96	99.79	96.03	71.41	96.12
	2	96.19	99.84	-	94.76	77.37	98.03	89.80	90.19	98.81	99.68	93.85
	3	98.08	99.40	93.07	-	75.92	68.55	87.75	81.75	99.21	98.35	89.12
	4	99.19	99.81	93.59	95.16	-	96.03	95.52	70.51	99.35	99.54	94.30
	5	99.65	99.91	98.30	88.92	91.23	-	98.31	87.68	99.86	99.72	95.95
	6	99.30	99.93	93.91	95.23	89.50	98.04	-	96.95	99.64	99.82	96.93
	7	98.62	99.50	97.22	95.53	77.46	92.86	99.03	-	98.20	98.59	95.22
	8	90.26	90.45	99.48	99.67	99.47	99.93	99.48	99.57	-	87.26	96.18
	9	98.16	86.56	99.93	99.91	99.76	99.99	99.97	99.70	96.39	-	97.82
MoCov3[21]	0	-	95.82	95.69	98.75	97.49	99.58	97.83	98.04	71.17	85.75	93.35
	1	95.59	-	99.66	99.69	99.67	99.85	99.74	99.50	92.72	68.79	95.02
	2	92.17	99.77	-	93.66	80.97	97.58	87.61	90.67	97.16	99.21	93.20
	3	95.87	99.33	91.17	-	78.08	71.77	86.00	84.32	96.86	98.04	89.05
	4	97.97	99.79	92.35	94.00	-	96.40	94.38	73.76	98.11	99.26	94.00
	5	98.66	99.78	97.00	84.31	90.24	-	97.96	89.13	98.93	99.14	95.02
	6	99.13	99.94	94.33	95.19	92.56	98.56	-	98.38	99.45	99.82	97.48
	7	96.61	99.55	96.01	94.83	76.00	92.97	98.57	-	96.47	98.15	94.35
	8	87.08	95.47	98.86	99.27	99.20	99.62	98.97	99.20	-	89.75	96.38
	9	95.98	86.94	99.74	99.77	99.76	99.87	99.82	99.63	92.63	-	97.13
DINOv2[22]	0	-	73.26	64.39	68.07	65.83	68.69	69.50	70.14	54.72	70.07	67.19
	1	66.64	-	66.59	57.46	63.19	58.81	60.71	57.10	55.24	52.26	59.78
	2	67.18	77.06	-	57.44	45.86	57.11	49.82	63.01	68.56	76.62	62.52
	3	75.38	77.23	59.51	-	52.41	50.47	50.53	63.05	72.42	76.62	64.18
	4	82.15	86.71	63.94	67.52	-	67.12	57.42	71.56	80.55	86.07	73.67
	5	77.37	79.22	60.55	53.76	54.45	-	54.10	63.90	75.10	78.91	66.37
	6	86.06	86.59	66.51	66.79	56.39	67.45	-	75.36	84.95	87.54	75.29
	7	77.61	76.62	59.83	56.58	49.58	54.03	53.04	-	74.18	72.88	63.82
	8	63.06	74.53	74.59	73.16	75.12	73.67	78.29	76.87	-	71.34	73.40
	9	69.38	60.16	70.57	62.54	67.91	63.32	67.09	58.82	58.01	-	64.20
BEiTv2[23]	0	-	98.36	99.38	99.94	99.70	99.99	99.90	99.70	91.92	94.07	98.11
	1	99.69	-	100.00	99.99	99.99	100.00	100.00	99.99	98.96	80.71	97.70
	2	97.78	99.94	-	99.37	94.03	99.54	98.04	98.32	99.54	99.88	98.49
	3	99.72	99.89	98.56	-	94.80	81.26	95.70	96.85	99.85	99.87	96.28
	4	99.79	99.92	98.69	99.20	-	99.23	99.34	88.58	99.84	99.88	98.27
	5	99.94	99.98	99.57	92.56	98.39	-	99.73	97.18	99.92	99.85	98.57
	6	99.87	100.00	99.44	99.37	99.25	99.71	-	99.80	99.92	99.96	99.70
	7	99.68	99.80	99.74	99.77	97.64	99.50	99.96	-	99.69	99.62	99.49
	8	97.51	98.74	99.90	99.97	99.92	99.95	99.95	99.91	-	96.76	99.18
	9	99.41	95.56	100.00	99.99	99.99	100.00	100.00	99.98	98.60	-	99.28
BEiT[12]	0	-	98.65	99.41	99.91	99.59	99.96	99.86	99.72	91.50	94.10	98.08
	1	99.45	-	100.00	99.98	99.98	100.00	100.00	99.96	99.14	89.24	98.64
	2	99.25	99.95	-	98.42	90.61	99.25	97.69	98.32	99.78	99.87	98.13
	3	99.77	99.80	99.09	-	95.07	84.11	94.70	97.80	99.94	99.63	96.66
	4	99.89	99.89	98.88	98.85	-	99.06	99.09	93.79	99.90	99.77	98.79
	5	99.98	99.83	99.75	91.54	97.74	-	99.61	95.89	99.99	99.79	98.23
	6	99.95	100.00	99.33	99.19	99.06	99.67	-	99.96	99.98	99.98	99.68
	7	99.61	99.48	99.78	99.46	97.85	99.14	99.94	-	99.53	99.18	99.33
	8	96.51	98.01	99.95	99.96	99.92	99.98	99.98	99.95	-	96.05	98.92
	9	99.47	94.84	100.00	99.99	99.98	100.00	100.00	99.90	98.97	-	99.24

TABLE VIII: AUROC (%) of one-class OOD Detection on CIFAR-10 [39] using Mahalanobis[19]. Pretrained models include classification task [20], MoCov3 [21], DINOv2 [22], BEiTv2 [23] and BEiT [12].

Models	ID class	OOD class										Average
Models	ID class	0	1	2	3	4	5	6	7	8	9	Average
ViT[20]	0	-	90.64	97.03	99.20	97.19	99.86	98.92	98.45	67.16	80.51	92.11
	1	97.97	-	99.98	99.93	99.93	99.99	99.99	99.89	95.22	67.49	95.60
	2	96.31	99.93	-	94.88	74.69	98.10	89.81	90.86	99.10	99.85	93.72
	3	98.61	99.76	94.39	-	74.04	64.74	88.69	81.14	99.59	99.07	88.89
	4	99.25	99.86	93.62	95.27	-	95.28	95.60	71.21	99.41	99.54	94.34
	5	99.75	99.98	98.47	88.47	89.05	-	98.42	86.11	99.93	99.87	95.56
	6	99.46	99.97	94.05	95.60	88.80	98.08	-	97.32	99.80	99.91	97.00
	7	98.90	99.69	97.61	95.49	75.97	91.93	99.27	-	98.47	98.79	95.12
	8	90.10	90.70	99.63	99.77	99.59	99.97	99.63	99.68	-	86.14	96.14
	9	98.12	86.67	99.96	99.93	99.79	99.99	99.99	99.75	96.03	-	97.80
MoCov3[21]	0	-	97.10	97.66	99.36	98.74	99.83	98.88	99.27	74.06	88.58	94.83
	1	96.25	-	99.82	99.79	99.79	99.91	99.84	99.74	93.06	67.99	95.13
	2	93.80	99.93	-	94.63	80.47	98.24	88.49	91.63	98.37	99.69	93.92
	3	97.60	99.68	94.10	-	79.97	72.24	87.98	87.63	98.32	98.96	90.72
	4	98.49	99.93	93.56	94.53	-	96.84	94.90	74.27	98.52	99.61	94.52
	5	99.29	99.93	97.94	85.08	89.84	-	98.61	89.87	99.49	99.57	95.51
	6	99.46	99.97	95.39	95.71	93.39	98.87	-	98.97	99.69	99.90	97.93
	7	97.57	99.77	97.08	95.65	76.65	93.74	99.06	-	97.39	98.74	95.07
	8	88.58	96.14	99.39	99.55	99.55	99.81	99.39	99.64	-	90.57	96.96
	9	96.55	87.81	99.87	99.86	99.86	99.93	99.90	99.79	92.96	-	97.39
DINOv2[22]	0	-	72.84	64.24	68.20	65.56	68.69	68.90	70.20	54.60	69.76	67.00
	1	66.30	-	66.15	57.07	62.92	58.93	60.22	57.30	55.07	52.37	59.59
	2	66.61	77.05	-	57.20	45.49	56.89	49.57	62.70	68.01	76.37	62.21
	3	74.84	77.16	59.30	-	52.69	50.62	51.02	63.03	71.61	76.33	64.06
	4	82.04	86.74	63.85	67.37	-	67.00	57.18	71.49	80.07	85.48	73.47
	5	79.00	80.30	61.17	54.32	54.41	-	54.44	64.39	75.69	80.16	67.10
	6	85.90	86.33	66.63	66.71	56.41	67.08	-	75.22	84.73	87.03	75.12
	7	76.80	76.47	59.90	56.30	49.63	53.98	52.69	-	73.07	72.36	63.47
	8	63.00	74.18	73.68	72.92	74.35	73.11	77.30	76.54	-	71.19	72.92
	9	69.06	59.79	70.88	62.04	68.16	63.16	67.02	59.00	57.41	-	64.06
BEiTv2[23]	0	-	97.86	99.24	99.90	99.48	99.97	99.73	99.61	87.59	89.06	96.94
	1	99.49	-	99.99	99.99	99.98	99.98	100.00	99.99	97.62	66.23	95.92
	2	95.66	99.95	-	98.84	90.50	98.98	95.46	97.71	99.07	99.77	97.33
	3	99.44	99.79	98.00	-	93.19	72.65	90.51	96.39	99.65	99.32	94.33
	4	99.72	99.87	98.37	98.87	-	98.76	98.78	84.32	99.72	99.75	97.57
	5	99.86	99.95	99.47	91.21	97.83	-	99.52	95.97	99.88	99.49	98.13
	6	99.85	99.99	99.26	99.14	98.93	99.60	-	99.81	99.91	99.92	99.60
	7	99.55	99.78	99.68	99.72	96.47	99.17	99.96	-	99.45	99.34	99.24
	8	96.02	98.31	99.90	99.97	99.94	99.96	99.96	99.97	-	93.04	98.56
	9	99.17	95.57	100.00	99.99	99.99	99.99	100.00	99.99	98.21	-	99.21
BEiT[12]	0	-	98.76	99.35	99.90	99.48	99.95	99.76	99.78	88.74	91.93	97.52
	1	99.15	-	100.00	99.98	99.97	100.00	100.00	99.97	98.78	84.81	98.07
	2	99.12	99.98	-	97.89	88.04	98.88	96.49	97.93	99.73	99.85	97.55
	3	99.77	99.83	98.95	-	93.35	78.82	92.42	97.56	99.94	99.38	95.56
	4	99.87	99.90	98.65	98.48	-	98.66	98.75	93.22	99.85	99.73	98.57
	5	99.97	99.88	99.73	90.68	97.10	-	99.49	94.34	99.99	99.76	97.88
	6	99.96	100.00	99.27	99.08	98.83	99.64	-	99.98	99.99	99.99	99.64
	7	99.53	99.46	99.75	99.39	97.50	98.89	99.92	-	99.46	99.03	99.22
	8	95.53	98.13	99.94	99.95	99.87	99.97	99.98	99.96	-	95.62	98.77
	9	99.32	95.30	100.00	99.98	99.97	100.00	100.00	99.86	98.67	-	99.23

TABLE IX: AUROC (%) of one-class OOD Detection on CIFAR-10 [39] using ViM[7]. Pretrained models include classification task [20], MoCov3 [21], DINOv2 [22], BEiTv2 [23] and BEiT [12].