HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: axessibility
  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2404.06065v1 [cs.CV] 09 Apr 2024

Unified Entropy Optimization for Open-Set Test-Time Adaptation

Zhengqing Gao1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT Xu-Yao Zhang1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT111Corresponding author. Cheng-Lin Liu1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTMAIS, Institute of Automation, Chinese Academy of Sciences
22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTSchool of Artificial Intelligence, University of Chinese Academy of Sciences
[email protected] {xyz, liucl}@nlpr.ia.ac.cn
Abstract

Test-time adaptation (TTA) aims at adapting a model pre-trained on the labeled source domain to the unlabeled target domain. Existing methods usually focus on improving TTA performance under covariate shifts, while neglecting semantic shifts. In this paper, we delve into a realistic open-set TTA setting where the target domain may contain samples from unknown classes. Many state-of-the-art closed-set TTA methods perform poorly when applied to open-set scenarios, which can be attributed to the inaccurate estimation of data distribution and model confidence. To address these issues, we propose a simple but effective framework called unified entropy optimization (UniEnt), which is capable of simultaneously adapting to covariate-shifted in-distribution (csID) data and detecting covariate-shifted out-of-distribution (csOOD) data. Specifically, UniEnt first mines pseudo-csID and pseudo-csOOD samples from test data, followed by entropy minimization on the pseudo-csID data and entropy maximization on the pseudo-csOOD data. Furthermore, we introduce UniEnt+ to alleviate the noise caused by hard data partition leveraging sample-level confidence. Extensive experiments on CIFAR benchmarks and Tiny-ImageNet-C show the superiority of our framework. The code is available at https://github.com/gaozhengqing/UniEnt.

1 Introduction

Refer to caption
Figure 1: Existing TTA methods exhibit performance degradation with unknown classes included, while our methods can improve them significantly. We compare BN Adapt [33], CoTTA [46], TENT [44], EATA [35], and OSTTA [27].

Deep neural networks (DNNs) have achieved great success in recent years when the training and test data are drawn i.i.d. from the same distribution. However, in many real-world applications, this strict assumption is difficult to hold. Models deployed in practice can encounter different types of distribution shifts. On the one hand, the model needs to be able to address semantic shifts, i.e., identify samples from unknown classes, which has given rise to problems such as out-of-distribution (OOD) detection [15, 16, 32, 19, 56] and open-set recognition [4, 5, 43, 21]. On the other hand, the model needs to be robust to covariate shifts and have good generalization performance to different styles and domains. Many efforts have been devoted to reduce the performance gap of DNNs under covariate shifts, such as domain generalization [58, 47, 59, 45] and domain adaptation [11, 50]. Among various studies addressing covariate shifts, test-time adaptation (TTA) has recently received increasing attention because its practicality: neither source domain data nor target domain labels are required [33, 44, 46, 35, 27, 28].

Nevertheless, most of the existing TTA methods [33, 44, 46, 35] focus only on solving the covariate shift and ignoring the semantic shift. We believe that this is impractical since we cannot guarantee the test samples contain only the classes seen in the training phase. Many recent works [27, 28] have realized this and made some initial attempts. Figure 2 illustrates the differences between the traditional closed-set TTA and the novel open-set TTA settings. First, we need to clarify that in the literature on OOD detection, out refers specifically to “outside the semantic space", whereas in the literature on OOD generalization, out refers specifically to “outside the covariate space". Here we follow the terminology used in [56]. According to the different types of distribution shifts, we divide the real-world data into four types:

  • In-distribution (ID) data is the most common data we typically use to train a model, with a limited number of classes.

  • Out-of-distribution (OOD) data contains some open classes that have not been seen before in ID data, with the same style and domain as ID data.

  • Covariate-shifted ID (csID) data and ID data have the same classes and differ in styles and domains.

  • Covariate-shifted OOD (csOOD) data is different from ID data in both classes and domains.

The open-set TTA setting takes into account both csID data and csOOD data.

Refer to caption
Figure 2: Comparison between closed-set TTA and open-set TTA.

Existing TTA methods make extensive use of entropy objective, which proves to be very effective. We first experimentally verify that existing TTA methods [54, 33, 44, 46, 35, 27] degrade the classification accuracy of known classes when open-set classes are included, which is consistent with the conclusions drawn from some recent studies [27, 28]. In addition, as shown in Fig. 1, the detection performance of unknown classes is also impaired, which has not received enough attention in previous studies. We attribute the performance degradation to the following two points. First, the presence of open-set samples leads to the incorrect estimation of normalization statistics by the model, leading to errors in updating affine parameters. Second, entropy minimization on samples from unknown classes forces the model to output confident predictions, undermining the model’s confidence and leading to a decrease in the model’s ability to distinguish between known classes and unknown classes.

With the aforementioned causes in mind, we propose three techniques to enhance the robustness of existing TTA methods under open-set setting. We first propose a distribution-aware filter to preliminarily distinguish between csID samples and csOOD samples. Specifically, we observe that the cosine similarity between the features extracted by the source model and the source domain prototypes can reflect the semantic shift, and we use this property to distinguish samples. We then propose a unified entropy optimization framework (UniEnt) to address the aforementioned challenges. UniEnt minimizes the entropy of csID samples while maximizing the entropy of csOOD samples simultaneously. Furthermore, we propose UniEnt+ using a sample-level weighting strategy to avoid the error caused by noisy data partition.

We summarize the contributions of this paper as follows.

  • We first delve into the performance of existing methods under closed-set TTA and open-set TTA settings. We then summarize two reasons for the performance degradation of existing methods with open-set classes included.

  • We propose a unified entropy optimization framework, which consists of a distribution-aware filter to distinguish csID and csOOD samples, entropy minimization on csID samples to obtain good classification performance on known classes and entropy maximization on csOOD samples to obtain good detection performance on unknown classes.

  • Our proposed framework can be flexibly applied to many existing TTA methods and substantially improves their performance under open-set setting. Comprehensive experiments demonstrate the effectiveness of our approach.

Refer to caption
Figure 3: Illustration of the unified entropy optimization (UniEnt) framework. At timestamp t𝑡titalic_t, mini-batch tsubscript𝑡\mathcal{B}_{t}caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT may contain samples from csID and csOOD. First, we filter csOOD samples by csOOD score S(𝐱)𝑆𝐱S(\mathbf{x})italic_S ( bold_x ). Then, we perform entropy minimization for csID samples and entropy maximization for csOOD samples, we also adopt marginal entropy maximization to pervent model collapse. After optimization, we can yield better classification and detection performance tradeoff.

2 Related Work

Test-time adaptation.

Among all the approaches to solving covariate shifts, test-time adaptation has received much attention because of its challenging setting of accessing only the source model and unlabelled target data. Some of the initial work [33, 39, 44, 51, 24, 31] focused on improving TTA performance by estimating batch normalization statistics using test data and designing unsupervised objective functions, e.g., TENT [44] proposed to optimize the affine parameters of batch normalization by minimizing the entropy of model outputs. These works mainly focus on static TTA and do not take into account the changes in the domain. After adapting to a target domain, the adapted model is reset to the one pretrained on the source domain to adapt to the next domain. Later, some work [46, 35] proposed the continual TTA setting where the model needs to adapt to a series of continuously changing target domains without knowing the domain labels. This poses new challenges for TTA: catastrophic forgetting and error accumulation. CoTTA [46] addresses the above issues through teacher-student model structure with data augmentation and stochastic recovery, while EATA [35] addresses the above issues through sample selection and anti-forgetting regularizer.

Robust test-time adaptation.

Recently, several works have paid more attention to the robustness of TTA methods. LAME [2], NOTE [12] and RoTTA [53] focus on the performance of TTA methods under non-i.i.d. correlated sampling of test data. SITA [24] and MEMO [57] explore techniques for performing TTA on a single image. ODS [60] addresses case with label shift. OSTTA [27] pays attention to the performance degradation caused by long-term TTA. OWTTT [28] and OSTTA [27] consider the scenarios where the test data includes unknown classes. SAR [36] comprehensively analyzed the impact of mixed domain shifts, small batch sizes, and online imbalanced label distribution shifts on TTA performance. It is worth noting that there are some differences between the settings proposed by OWTTT [28] and OSTTA [27], the samples of unknown classes in OWTTT [28] are drawn from OOD, while the samples of unknown classes in OSTTA [27] are drawn from csOOD. We adopt the setting proposed in OSTTA [27] because of its practicality and challenging nature. First, the unknown class samples we encounter during TTA are likely to experience the same covariate shift. Second, it is more difficult to distinguish between csID samples and csOOD samples than between csID samples and OOD samples.

OOD detection.

For models deployed in real-world scenarios, the ability of OOD detection is crucial. Recent studies in OOD detection can be roughly divided into two categories. The first type of approaches [15, 30, 20, 32, 19] is devoted to design sophisticated score functions and input-output transformations. MSP [15] uses the maximum softmax probability to detect OOD samples. ODIN [30] and generalized ODIN [20] further introduces temperature scaling, input preprocessing and confidence decompose to improve OOD detection performance. The second type of approaches instead regularizes the model by exploring the additional outlier data [16, 52, 48, 23, 55]. For example, OE [16] encourages the model to output low-confidence predictions for anomalous data. WOODS [23] on the other hand utilizes unlabelled wild data to improve the detection performance. SCONE [1] considers both OOD detection and OOD generalization for the first time. It is worth noting that all the methods mentioned above are designed for the training phase. Recently, AUTO [49] propose to optimize the network using unlabeled test data at test time to imporve OOD detection performance.

3 Methodology

3.1 Problem Setup

Let 𝒟s={𝐱i,yi}i=1Nssubscript𝒟𝑠superscriptsubscriptsubscript𝐱𝑖subscript𝑦𝑖𝑖1subscript𝑁𝑠\mathcal{D}_{s}=\{\mathbf{x}_{i},y_{i}\}_{i=1}^{N_{s}}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = { bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT be the source domain dataset with label space 𝒴s={1,,Cs}subscript𝒴𝑠1subscript𝐶𝑠\mathcal{Y}_{s}=\{1,\cdots,C_{s}\}caligraphic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = { 1 , ⋯ , italic_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT }, and 𝒟t={𝐱j,yj}j=1Ntsubscript𝒟𝑡superscriptsubscriptsubscript𝐱𝑗subscript𝑦𝑗𝑗1subscript𝑁𝑡\mathcal{D}_{t}=\{\mathbf{x}_{j},y_{j}\}_{j=1}^{N_{t}}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT be the target domain dataset with label space 𝒴t={1,,Ct}subscript𝒴𝑡1subscript𝐶𝑡\mathcal{Y}_{t}=\{1,\cdots,C_{t}\}caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { 1 , ⋯ , italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, where Cssubscript𝐶𝑠C_{s}italic_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denote the number of classes in the source and target domain datasets, respectively. Cssubscript𝐶𝑠C_{s}italic_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is equal to Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for closed-set TTA while Cs<Ctsubscript𝐶𝑠subscript𝐶𝑡C_{s}<C_{t}italic_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT < italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT always holds for open-set TTA. Given a model fθ0subscript𝑓subscript𝜃0f_{\theta_{0}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT pre-trained on 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, TTA aims to adapt the model to 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT without target labels accessible. To be specific, we denote the mini-batch of test samples at timestamp t𝑡titalic_t as tsubscript𝑡\mathcal{B}_{t}caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the adapted model as fθtsubscript𝑓subscript𝜃𝑡f_{\theta_{t}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT. The main objective of open-set TTA is to correctly predict the classes in 𝒴ssubscript𝒴𝑠\mathcal{Y}_{s}caligraphic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT while reject the classes in 𝒴t𝒴ssubscript𝒴𝑡subscript𝒴𝑠\mathcal{Y}_{t}\setminus\mathcal{Y}_{s}caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∖ caligraphic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT using the adapted model fθtsubscript𝑓subscript𝜃𝑡f_{\theta_{t}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, especially in the presence of large data distribution shifts.

3.2 Preliminaries

For closed-set TTA, a common practice [44] is to adapt the model by minimizing the unsupervised entropy objective:

minθtt=1t𝐱tH(fθt(𝐱))λH(f¯θt),subscriptsubscript𝜃𝑡subscript𝑡1normsubscript𝑡subscript𝐱subscript𝑡𝐻subscript𝑓subscript𝜃𝑡𝐱𝜆𝐻subscript¯𝑓subscript𝜃𝑡\min_{\theta_{t}}\mathcal{L}_{t}=\frac{1}{\|\mathcal{B}_{t}\|}\sum_{\mathbf{x}% \in\mathcal{B}_{t}}H(f_{\theta_{t}}(\mathbf{x}))-\lambda H(\bar{f}_{\theta_{t}% }),roman_min start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG ∥ caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ end_ARG ∑ start_POSTSUBSCRIPT bold_x ∈ caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_H ( italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x ) ) - italic_λ italic_H ( over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , (1)

where H(fθt(𝐱))=c=1Cfθtc(𝐱)logfθtc(𝐱)𝐻subscript𝑓subscript𝜃𝑡𝐱superscriptsubscript𝑐1𝐶superscriptsubscript𝑓subscript𝜃𝑡𝑐𝐱superscriptsubscript𝑓subscript𝜃𝑡𝑐𝐱H(f_{\theta_{t}}(\mathbf{x}))=-\sum_{c=1}^{C}f_{\theta_{t}}^{c}(\mathbf{x})% \log f_{\theta_{t}}^{c}(\mathbf{x})italic_H ( italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x ) ) = - ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( bold_x ) roman_log italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( bold_x ) denotes the entropy of the softmax output fθt(𝐱)subscript𝑓subscript𝜃𝑡𝐱f_{\theta_{t}}(\mathbf{\mathbf{x}})italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x ), f¯θt=1t𝐱tfθt(𝐱)subscript¯𝑓subscript𝜃𝑡1normsubscript𝑡subscript𝐱subscript𝑡subscript𝑓subscript𝜃𝑡𝐱\bar{f}_{\theta_{t}}=\frac{1}{\|\mathcal{B}_{t}\|}\sum_{\mathbf{x}\in\mathcal{% B}_{t}}f_{\theta_{t}}(\mathbf{x})over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG ∥ caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ end_ARG ∑ start_POSTSUBSCRIPT bold_x ∈ caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x ) represents the average softmax output over the mini-batch tsubscript𝑡\mathcal{B}_{t}caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and λ𝜆\lambdaitalic_λ is a hyperparameter used to balance the two terms in the loss function. In previous studies [29, 3, 6, 31], marginal entropy H(f¯θt)𝐻subscript¯𝑓subscript𝜃𝑡H(\bar{f}_{\theta_{t}})italic_H ( over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) has been widely adopted to prevent model collapse, i.e., predicting all input samples to the same class.

3.3 Motivation

There is no label of the test data to provide supervised imformation during TTA, an entropy minimization or a self-training strategy is widely adopted in existing methods. While previous studies [54, 33, 46, 44, 35, 27] focused on improving the performance of closed-set TTA, we empirically find that they exhibit performance degradation with open-set samples included. As shown in Fig. 4, We first compare the performance of existing TTA methods under different settings. Specifically, we conduct closed-set experiments on CIFAR-100-C [14], i.e., updating the model and measuring the performance of the adapted model with only the test samples from known classes, and the open-set counterparts are extracted from Tab. 1. Experimental results show that applying existing methods to open-set TTA leads to the degradation of both the classification performance on known classes and the detection performance on unknown classes. We argue the degradation is caused by the following two reasons. First, the introduce of samples from unknown classes leads to the incorrect estimation of normalization statistics by the model, which results in unreliable updating of the model parameters. Second, entropy minimization-based methods achieved competitive closed-set results by making the model confident on the predictions. However, minimizing entropy on samples from unknown classes destroys the model confidence, which is an undesirable result. We believe that a good model confidence is very important, especially in open-set TTA, because it can tell us how much can we trust the adapted model’s predictions.

Refer to caption
(a)
Refer to caption
(b)
Figure 4: Performance comparison of existing TTA methods under closed-set and open-set settings.

3.4 Distribution-aware Filter

We first model the open-set data distribution as shown in Eq. (2):

𝒫OPEN:=π𝒫csID+(1π)𝒫csOOD,assignsubscript𝒫OPEN𝜋subscript𝒫csID1𝜋subscript𝒫csOOD\mathcal{P}_{\text{OPEN}}:=\pi\mathcal{P}_{\text{csID}}+(1-\pi)\mathcal{P}_{% \text{csOOD}},caligraphic_P start_POSTSUBSCRIPT OPEN end_POSTSUBSCRIPT := italic_π caligraphic_P start_POSTSUBSCRIPT csID end_POSTSUBSCRIPT + ( 1 - italic_π ) caligraphic_P start_POSTSUBSCRIPT csOOD end_POSTSUBSCRIPT , (2)

where π[0,1]𝜋01\pi\in[0,1]italic_π ∈ [ 0 , 1 ]. Equation (2) contains two distributions that the model may encounter during TTA:

  • Covariate-shifted ID 𝒫csIDsubscript𝒫csID\mathcal{P}_{\text{csID}}caligraphic_P start_POSTSUBSCRIPT csID end_POSTSUBSCRIPT shares the label space with the training data, whereas the input space suffers from style and domain shifts.

  • Covariate-shifted OOD 𝒫csOODsubscript𝒫csOOD\mathcal{P}_{\text{csOOD}}caligraphic_P start_POSTSUBSCRIPT csOOD end_POSTSUBSCRIPT differs from those of the training data in both the label space and the input space.

We define the csOOD score for each test sample as:

S(𝐱)=ν(maxcgθ0(𝐱)pcgθ0(𝐱)pc),𝑆𝐱𝜈subscript𝑐subscript𝑔subscript𝜃0𝐱subscript𝑝𝑐normsubscript𝑔subscript𝜃0𝐱normsubscript𝑝𝑐S(\mathbf{x})=\nu\left(\max_{c}\frac{g_{\theta_{0}}(\mathbf{x})\cdot p_{c}}{\|% g_{\theta_{0}}(\mathbf{x})\|\|p_{c}\|}\right),italic_S ( bold_x ) = italic_ν ( roman_max start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT divide start_ARG italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x ) ⋅ italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x ) ∥ ∥ italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∥ end_ARG ) , (3)

where ν()𝜈\nu(\cdot)italic_ν ( ⋅ ) denotes min-max normalization with the range of [0,1]01[0,1][ 0 , 1 ], gθ0subscript𝑔subscript𝜃0g_{\theta_{0}}italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the feature extractor of source domain pre-trained model, pcsubscript𝑝𝑐p_{c}italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT denotes the source domain prototype of class c𝑐citalic_c.

As shown in Fig. 5, we empirically found that S(𝐱)𝑆𝐱S(\mathbf{x})italic_S ( bold_x ) can distinguish between csID samples and csOOD samples. To be more specific, the distribution of S(𝐱)𝑆𝐱S(\mathbf{x})italic_S ( bold_x ) appears to be bimodal, and its two peaks indicate csID and csOOD modes, respectively. In order to select the optimal threshold, we model the distribution of S(𝐱)𝑆𝐱S(\mathbf{x})italic_S ( bold_x ) as a Gaussian mixture model (GMM) with two components, where the component with larger mean corresponds to the csID samples, and vice versa:

𝒫(𝐱)=π(𝐱)𝒩(𝐱μcsID,σcsID2)+(1π(𝐱))𝒩(𝐱μcsOOD,σcsOOD2),𝒫𝐱𝜋𝐱𝒩conditional𝐱subscript𝜇csIDsuperscriptsubscript𝜎csID21𝜋𝐱𝒩conditional𝐱subscript𝜇csOODsuperscriptsubscript𝜎csOOD2\begin{split}\mathcal{P}(\mathbf{x})=&\pi(\mathbf{x})\mathcal{N}(\mathbf{x}% \mid\mu_{\text{csID}},\sigma_{\text{csID}}^{2})\\ &+(1-\pi(\mathbf{x}))\mathcal{N}(\mathbf{x}\mid\mu_{\text{csOOD}},\sigma_{% \text{csOOD}}^{2})\end{split},start_ROW start_CELL caligraphic_P ( bold_x ) = end_CELL start_CELL italic_π ( bold_x ) caligraphic_N ( bold_x ∣ italic_μ start_POSTSUBSCRIPT csID end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT csID end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ( 1 - italic_π ( bold_x ) ) caligraphic_N ( bold_x ∣ italic_μ start_POSTSUBSCRIPT csOOD end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT csOOD end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW , (4)

where π(𝐱)𝜋𝐱\pi(\mathbf{x})italic_π ( bold_x ) denotes the probability that S(𝐱)𝑆𝐱S(\mathbf{x})italic_S ( bold_x ) belongs to the csID component, μcsIDsubscript𝜇csID\mu_{\text{csID}}italic_μ start_POSTSUBSCRIPT csID end_POSTSUBSCRIPT, σcsID2superscriptsubscript𝜎csID2\sigma_{\text{csID}}^{2}italic_σ start_POSTSUBSCRIPT csID end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and μcsOODsubscript𝜇csOOD\mu_{\text{csOOD}}italic_μ start_POSTSUBSCRIPT csOOD end_POSTSUBSCRIPT, σcsOOD2superscriptsubscript𝜎csOOD2\sigma_{\text{csOOD}}^{2}italic_σ start_POSTSUBSCRIPT csOOD end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT represent the mean and variance of the csID and csOOD components, respectively. Further, π(𝐱)𝜋𝐱\pi(\mathbf{x})italic_π ( bold_x ) can be easily obtained using the EM algorithm.

Refer to caption
Figure 5: The csOOD score S(𝐱)𝑆𝐱S(\mathbf{x})italic_S ( bold_x ) presents a bimodal distribution.

Then, we can split tsubscript𝑡\mathcal{B}_{t}caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT into t,csIDsubscript𝑡csID\mathcal{B}_{t,\text{csID}}caligraphic_B start_POSTSUBSCRIPT italic_t , csID end_POSTSUBSCRIPT and t,csOODsubscript𝑡csOOD\mathcal{B}_{t,\text{csOOD}}caligraphic_B start_POSTSUBSCRIPT italic_t , csOOD end_POSTSUBSCRIPT through Eq. (5):

t,csID={𝐱𝐱tπ(𝐱)0.5}t,csOOD={𝐱𝐱tπ(𝐱)<0.5},subscript𝑡csIDconditional-set𝐱𝐱subscript𝑡𝜋𝐱0.5subscript𝑡csOODconditional-set𝐱𝐱subscript𝑡𝜋𝐱0.5\displaystyle\begin{split}\mathcal{B}_{t,\text{csID}}&=\{\mathbf{x}\mid\mathbf% {x}\in\mathcal{B}_{t}\wedge\pi(\mathbf{x})\geq 0.5\}\\ \mathcal{B}_{t,\text{csOOD}}&=\{\mathbf{x}\mid\mathbf{x}\in\mathcal{B}_{t}% \wedge\pi(\mathbf{x})<0.5\}\end{split},start_ROW start_CELL caligraphic_B start_POSTSUBSCRIPT italic_t , csID end_POSTSUBSCRIPT end_CELL start_CELL = { bold_x ∣ bold_x ∈ caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∧ italic_π ( bold_x ) ≥ 0.5 } end_CELL end_ROW start_ROW start_CELL caligraphic_B start_POSTSUBSCRIPT italic_t , csOOD end_POSTSUBSCRIPT end_CELL start_CELL = { bold_x ∣ bold_x ∈ caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∧ italic_π ( bold_x ) < 0.5 } end_CELL end_ROW , (5)

where t,csIDsubscript𝑡csID\mathcal{B}_{t,\text{csID}}caligraphic_B start_POSTSUBSCRIPT italic_t , csID end_POSTSUBSCRIPT and t,csOODsubscript𝑡csOOD\mathcal{B}_{t,\text{csOOD}}caligraphic_B start_POSTSUBSCRIPT italic_t , csOOD end_POSTSUBSCRIPT are the mini-batches of pseudo csID and pseudo csOOD samples at timestamp t𝑡titalic_t, respectively.

3.5 Unified Entropy Optimization

UniEnt.

Based on the previous sections, we consider minimizing the entropy of the model’s predictions of the samples from known classes, which can solve the inaccurate estimation of the data distribution and yield more reliable adaptation. However, the samples from unknown classes have not been explored effectively. Inspired by previous work [16, 23, 49], we propose to make the model produce approximately uniform predictions via entropy maximization instead, which can solve the inaccurate estimation of the model confidence and help distinguish known classes samples from unknown classes samples. The overall test-time optimization objective can be written as:

t,csID=1t,csID𝐱t,csIDH(fθt(𝐱)),subscript𝑡csID1normsubscript𝑡csIDsubscript𝐱subscript𝑡csID𝐻subscript𝑓subscript𝜃𝑡𝐱\mathcal{L}_{t,\text{csID}}=\frac{1}{\|\mathcal{B}_{t,\text{csID}}\|}\sum_{% \mathbf{x}\in\mathcal{B}_{t,\text{csID}}}H(f_{\theta_{t}}(\mathbf{x})),caligraphic_L start_POSTSUBSCRIPT italic_t , csID end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG ∥ caligraphic_B start_POSTSUBSCRIPT italic_t , csID end_POSTSUBSCRIPT ∥ end_ARG ∑ start_POSTSUBSCRIPT bold_x ∈ caligraphic_B start_POSTSUBSCRIPT italic_t , csID end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_H ( italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x ) ) , (6)
t,csOOD=1t,csOOD𝐱t,csOODH(fθt(𝐱)),subscript𝑡csOOD1normsubscript𝑡csOODsubscript𝐱subscript𝑡csOOD𝐻subscript𝑓subscript𝜃𝑡𝐱\mathcal{L}_{t,\text{csOOD}}=\frac{1}{\|\mathcal{B}_{t,\text{csOOD}}\|}\sum_{% \mathbf{x}\in\mathcal{B}_{t,\text{csOOD}}}H(f_{\theta_{t}}(\mathbf{x})),caligraphic_L start_POSTSUBSCRIPT italic_t , csOOD end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG ∥ caligraphic_B start_POSTSUBSCRIPT italic_t , csOOD end_POSTSUBSCRIPT ∥ end_ARG ∑ start_POSTSUBSCRIPT bold_x ∈ caligraphic_B start_POSTSUBSCRIPT italic_t , csOOD end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_H ( italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x ) ) , (7)
minθtt=t,csIDλ1t,csOODλ2H(f¯θt),subscriptsubscript𝜃𝑡subscript𝑡subscript𝑡csIDsubscript𝜆1subscript𝑡csOODsubscript𝜆2𝐻subscript¯𝑓subscript𝜃𝑡\min_{\theta_{t}}\mathcal{L}_{t}=\mathcal{L}_{t,\text{csID}}-\lambda_{1}% \mathcal{L}_{t,\text{csOOD}}-\lambda_{2}H(\bar{f}_{\theta_{t}}),roman_min start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_t , csID end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_t , csOOD end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_H ( over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , (8)

where λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are trade-off hyperparameters.

UniEnt+.

In the distribution-aware filter, we distinguish csID samples from csOOD samples roughly, which inevitably introduces some noise. To address this problem, we propose a weighting scheme to achieve entropy minimization for known classes and entropy maximization for unknown classes at the same time. The objective can be reformulated as follows:

minθtt=1t𝐱tπ(𝐱)H(fθt(𝐱))λ11t𝐱t(1π(𝐱))H(fθt(𝐱))λ2H(f¯θt).subscriptsubscript𝜃𝑡subscript𝑡1normsubscript𝑡subscript𝐱subscript𝑡𝜋𝐱𝐻subscript𝑓subscript𝜃𝑡𝐱subscript𝜆11normsubscript𝑡subscript𝐱subscript𝑡1𝜋𝐱𝐻subscript𝑓subscript𝜃𝑡𝐱subscript𝜆2𝐻subscript¯𝑓subscript𝜃𝑡\begin{split}\min_{\theta_{t}}\mathcal{L}_{t}=&\frac{1}{\|\mathcal{B}_{t}\|}% \sum_{\mathbf{x}\in\mathcal{B}_{t}}\pi(\mathbf{x})H(f_{\theta_{t}}(\mathbf{x})% )\\ &-\lambda_{1}\frac{1}{\|\mathcal{B}_{t}\|}\sum_{\mathbf{x}\in\mathcal{B}_{t}}(% 1-\pi(\mathbf{x}))H(f_{\theta_{t}}(\mathbf{x}))\\ &-\lambda_{2}H(\bar{f}_{\theta_{t}})\end{split}.start_ROW start_CELL roman_min start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG ∥ caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ end_ARG ∑ start_POSTSUBSCRIPT bold_x ∈ caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_π ( bold_x ) italic_H ( italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG ∥ caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ end_ARG ∑ start_POSTSUBSCRIPT bold_x ∈ caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 - italic_π ( bold_x ) ) italic_H ( italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_H ( over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_CELL end_ROW . (9)

4 Experiments

Method CIFAR-10-C CIFAR-100-C Average
Acc\uparrow AUROC\uparrow FPR@TPR95\downarrow OSCR\uparrow Acc\uparrow AUROC\uparrow FPR@TPR95\downarrow OSCR\uparrow Acc\uparrow AUROC\uparrow FPR@TPR95\downarrow OSCR\uparrow
Source [54] 81.73 77.89 79.45 68.44 53.25 60.55 94.98 39.87 67.49 69.22 87.22 54.16
BN Adapt [33] 84.20 80.40 76.84 72.13 57.16 72.45 84.29 47.10 70.68 76.43 80.57 59.62
CoTTA [46] 85.77 85.89 72.40 77.26 56.46 77.04 80.96 48.95 71.12 81.47 76.68 63.11
TENT [44] 79.38 65.39 95.94 56.73 54.74 65.00 94.79 42.24 67.06 65.20 95.37 49.49
+ UniEnt 84.31 (+4.93) 92.28 (+26.89) 36.74 (-59.20) 80.32 (+23.59) 59.07 (+4.33) 89.28 (+24.28) 51.14 (-43.65) 56.26 (+14.02) 71.69 (+4.63) 90.78 (+25.59) 43.94 (-51.43) 68.29 (+18.81)
+ UniEnt+ 84.03 (+4.65) 93.18 (+27.79) 32.74 (-63.20) 80.62 (+23.89) 58.58 (+3.84) 91.39 (+26.39) 41.09 (-53.70) 56.36 (+14.12) 71.31 (+4.25) 92.29 (+27.09) 36.92 (-58.45) 68.49 (+19.01)
EATA [35] 80.92 84.32 71.66 72.63 60.63 88.64 50.18 57.24 70.78 86.48 60.92 64.94
+ UniEnt 84.31 (+3.39) 97.15 (+12.83) 13.25 (-58.41) 82.99 (+10.36) 59.75 (-0.88) 93.42 (+4.78) 30.36 (-19.82) 57.99 (+0.75) 72.03 (+1.26) 95.29 (+8.81) 21.81 (-39.12) 70.49 (+5.55)
+ UniEnt+ 85.18 (+4.26) 96.97 (+12.65) 14.28 (-57.38) 83.67 (+11.04) 59.71(-0.92) 94.23 (+5.59) 26.87 (-23.31) 58.19 (+0.95) 72.45 (+1.67) 95.60 (+9.12) 20.58 (-40.35) 70.93 (+6.00)
OSTTA [27] 84.44 72.74 77.02 65.17 60.03 75.37 82.75 51.35 72.24 74.06 79.89 58.26
+ UniEnt 82.46 (-1.98) 96.20 (+23.46) 16.37 (-60.65) 80.51 (+15.34) 58.69 (-1.34) 94.84 (+19.47) 22.95 (-59.80) 57.28 (+5.93) 70.58 (-1.66) 95.52 (+21.47) 19.66 (-60.23) 68.90 (+10.64)
+ UniEnt+ 84.30 (-0.14) 97.38 (+24.64) 11.56 (-65.46) 82.91 (+17.74) 58.93 (-1.10) 95.42 (+20.05) 20.59 (-62.16) 57.69 (+6.34) 71.62 (-0.62) 96.40 (+22.35) 16.08 (-63.81) 70.30 (+12.04)
Table 1: Results of different methods on CIFAR benchmarks. \uparrow indicates that larger values are better, and vice versa. All values are percentages. The bold values indicate the best results, and the underlined values indicate the second best results.
Method Tiny-ImageNet-C
Acc\uparrow AUROC\uparrow FPR@TPR95\downarrow OSCR\uparrow
Source [54] 22.29 53.79 93.41 16.29
BN Adapt [33] 37.00 61.06 90.90 28.50
TENT [44] 28.96 49.78 95.96 19.02
+ UniEnt 37.23 (+8.27) 63.92 (+14.14) 89.72 (-6.24) 30.18 (+11.16)
+ UniEnt+ 37.31 (+8.35) 63.83 (+14.05) 89.12 (-6.84) 30.12 (+11.10)
EATA [35] 37.09 57.55 93.22 27.91
+ UniEnt 37.54 (+0.45) 64.34 (+6.79) 89.23 (-3.99) 30.59 (+2.68)
+ UniEnt+ 38.65 (+1.56) 62.30 (+4.75) 90.88 (-2.34) 30.95 (+3.04)
OSTTA [27] 37.29 55.66 94.34 27.74
+ UniEnt 33.72 (-3.57) 62.69 (+7.03) 89.67 (-4.67) 26.63 (-1.11)
+ UniEnt+ 34.47 (-2.82) 61.28 (+5.62) 89.56 (-4.78) 26.65 (-1.09)
Table 2: Results of different methods on Tiny-ImageNet-C.
Method t,csIDsubscript𝑡csID\mathcal{L}_{t,\text{csID}}caligraphic_L start_POSTSUBSCRIPT italic_t , csID end_POSTSUBSCRIPT t,csOODsubscript𝑡csOOD\mathcal{L}_{t,\text{csOOD}}caligraphic_L start_POSTSUBSCRIPT italic_t , csOOD end_POSTSUBSCRIPT CIFAR-10-C CIFAR-100-C
Acc\uparrow AUROC\uparrow FPR@TPR95\downarrow OSCR\uparrow Acc\uparrow AUROC\uparrow FPR@TPR95\downarrow OSCR\uparrow
TENT [44] 79.38 65.39 95.94 56.73 54.74 65.00 94.79 42.24
85.04 (+5.66) 81.80 (+16.41) 68.89 (-27.05) 73.57 (+16.84) 59.30 (+4.56) 86.09 (+21.09) 63.65 (-31.14) 55.55 (+13.31)
84.31 (+4.93) 92.28 (+26.89) 36.74 (-59.20) 80.32 (+23.59) 59.07 (+4.33) 89.28 (+24.28) 51.14 (-43.65) 56.26 (+14.02)
EATA [35] 80.92 84.32 71.66 72.63 60.63 88.64 50.18 57.24
85.53 (+4.61) 82.94 (-1.38) 67.95 (-3.71) 74.85 (+2.22) 60.46 (-0.17) 88.53 (-0.11) 54.30 (+4.12) 57.26 (+0.02)
84.31 (+3.39) 97.15 (+12.83) 13.25 (-58.41) 82.99 (+10.36) 59.75 (-0.88) 93.42 (+4.78) 30.36 (-19.82) 57.99 (+0.75)
OSTTA [27] 84.44 72.74 77.02 65.17 60.03 75.37 82.75 51.35
84.86 (+0.42) 84.96 (+12.22) 62.66 (-14.36) 75.84 (+10.67) 58.95 (-1.08) 90.62 (+15.25) 44.79 (-37.96) 56.50 (+5.15)
82.46 (-1.98) 96.20 (+23.46) 16.37 (-60.65) 80.51 (+15.34) 58.69 (-1.34) 94.84 (+19.47) 22.95 (-59.80) 57.28 (+5.93)
Table 3: Ablation study on CIFAR benchmarks. We investigate the effectiveness of t,csIDsubscript𝑡csID\mathcal{L}_{t,\text{csID}}caligraphic_L start_POSTSUBSCRIPT italic_t , csID end_POSTSUBSCRIPT and t,csOODsubscript𝑡csOOD\mathcal{L}_{t,\text{csOOD}}caligraphic_L start_POSTSUBSCRIPT italic_t , csOOD end_POSTSUBSCRIPT in Eq. (8) for UniEnt.
Method 0.1 0.2 0.5 1.0 ΔΔ\Deltaroman_Δ
TENT [44] + UniEnt (59.09, 89.11, 51.68, 56.20) (59.07, 89.28, 51.14, 56.26) (58.92, 89.59, 50.16, 56.22) (58.76, 89.95, 48.92, 56.21) (0.33, 0.84, 2.76, 0.06)
+ UniEnt+ (58.64, 91.18, 41.79, 56.34) (58.58, 91.39, 41.09, 56.36) (58.41, 91.68, 40.22, 56.33) (58.12, 91.89, 39.68, 56.13) (0.52, 0.71, 2.11, 0.23)
EATA [35] + UniEnt (59.50, 93.34, 30.72, 57.72) (59.75, 93.42, 30.36, 57.99) (59.37, 92.56, 34.98, 57.40) (59.58, 93.82, 28.29, 57.97) (0.38, 1.26, 6.69, 0.59)
+ UniEnt+ (59.73, 93.47, 30.25, 58.00) (59.81, 93.88, 27.84, 58.17) (59.71, 94.23, 26.87, 58.19) (59.62, 93.47, 30.37, 57.91) (0.19, 0.76, 3.50, 0.28)
OSTTA [27] + UniEnt (58.85, 93.89, 26.59, 57.14) (58.82, 94.32, 24.94, 57.24) (58.69, 94.84, 22.95, 57.28) (57.88, 94.80, 23.51, 56.51) (0.97, 0.95, 3.64, 0.77)
+ UniEnt+ (59.25, 94.19, 24.62, 57.54) (59.15, 94.84, 22.29, 57.69) (58.93, 95.42, 20.59, 57.69) (58.20, 95.65, 20.12, 57.06) (1.05, 1.46, 4.50, 0.63)
Table 4: Performance of UniEnt and UniEnt+ with varying λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT on CIFAR-100-C. The values in the table are presented as (Acc, AUROC, FPR@TPR95, OSCR). ΔΔ\Deltaroman_Δ is the difference between the maximum and minimum values when λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT take different values. Smaller ΔΔ\Deltaroman_Δ values represent better robustness.
Method 0.1 0.2 0.5 1.0 ΔΔ\Deltaroman_Δ
TENT [44] + UniEnt (59.44, 87.02, 60.32, 55.93) (59.07, 89.28, 51.14, 56.26) (58.09, 92.87, 33.24, 56.23) (56.62, 94.53, 25.26, 55.24) (2.82, 7.51, 35.06, 1.02)
+ UniEnt+ (59.19, 87.95, 57.31, 56.04) (58.58, 91.39, 41.09, 56.36) (56.71, 94.57, 25.02, 55.34) (53.13, 94.93, 24.19, 52.01) (6.06, 6.98, 33.12, 4.35)
EATA [35] + UniEnt (60.54, 88.14, 55.48, 57.15) (60.06, 89.45, 50.99, 57.16) (59.75, 93.42, 30.36, 57.99) (58.26, 95.07, 22.18, 57.02) (2.28, 6.93, 33.30, 0.97)
+ UniEnt+ (60.35, 89.49, 50.20, 57.44) (60.51, 91.03, 42.50, 58.02) (59.71, 94.23, 26.87, 58.19) (59.03, 95.28, 21.20, 57.81) (1.48, 5.79, 29.00, 0.75)
OSTTA [27] + UniEnt (58.69, 94.84, 22.95, 57.28) (56.63, 95.43, 21.02, 55.46) (49.85, 93.77, 32.12, 48.59) (43.89, 91.19, 47.50, 42.41) (14.80, 4.24, 26.48, 14.87)
+ UniEnt+ (59.15, 94.84, 22.29, 57.69) (57.55, 95.82, 18.91, 56.43) (50.31, 94.09, 30.05, 49.11) (43.66, 91.78, 43.35, 42.28) (15.49, 4.04, 24.44, 15.41)
Table 5: Performance of UniEnt and UniEnt+ with varying λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT on CIFAR-100-C. ΔΔ\Deltaroman_Δ is the difference between the maximum and minimum values when λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT take different values.

4.1 Setup

Datasets.

Following previous studies, we evaluate our proposed methods on the widely used corruption benchmark datasets: CIFAR-10-C, CIFAR-100-C, and Tiny-ImageNet-C [14]. Each dataset contains 15 types of corruptions with 5 severity levels, all our experiments are conducted under the most severe corruption level 5. Pre-trained models are trained on the clean training set and tested and adapted on the corrupted test set. Following OSTTA [27], we apply the same corruption type to the original SVHN [34] and ImageNet-O [18] test sets to generate the SVHN-C and ImageNet-O-C datasets. We use SVHN-C and ImageNet-O-C as the covariate shifted OOD datasets for CIFAR-10/100-C and Tiny-ImageNet-C, respectively.

Evaluation protocols.

Following recent research [46, 44, 35, 27], we evaluate TTA methods under continuously changing domains without resetting the parameters after each domain. At test time, the corrupted images are provided to the model in an online fashion. After encountering a mini-batch of test data, the model makes predictions and updates parameters immediately. The predictions of test data arriving at timestamp t𝑡titalic_t will not be affected by any test data arriving after timestamp t𝑡titalic_t. We construct the mini-batch using the same number of csID samples and csOOD samples. Regarding the model’s adaptation performance on csID data, we use the accuracy metric. To evaluate whether the adapted model can detect csOOD data robustly, we measure the area under the receiver operating characteristic curve (AUROC) and the false positive rate of csOOD samples when the true positive rate of csID samples is at 95% (FPR@TPR95). As we pursue a good trade-off between the classification accuracy on csID data and the detection accuracy on csOOD data, we also report the open-set classification rate (OSCR) [9] to measure the balanced performance.

Baseline methods.

We mainly compare our method with two types of pervious methods in TTA: 1) entropy-free methods: Source directly evaluates the test data using the source model without adaptation. BN Adapt [33] updates batch normalization statistics with the test data during TTA. CoTTA [46] adopts the teacher-student architecture to provide weight-averaged and augmentation-averaged pseudo-labels to reduce error accumulation, combined with stochastic restoration to avoid catastrophic forgetting. 2) entropy-based methods: TENT [44] estimates normalization statistics and optimizes channel-wise affine transformations through entropy minimization. EATA [35] selects reliable and non-redundant samples for model adaptation, the former achieve prediction entropy lower than a pre-defined threshold and the latter have diverse model outputs. In addition, the fisher regularization is introduced to prevent catastrophic forgetting. OSTTA [27] uses the wisdom of crowds to filter out the samples with lower confidence values in the adapted model than in the original model. Our methods can be easily applied to existing entropy-based methods without additional modification. Regarding applying our methods to EATA and OSTTA, we apply the filtering methods and keep everything else the same.

Implementation details.

For experiments on CIFAR benchmarks, following pervious studies [6, 31, 27], we use the WideResNet [54] with 40 layers and widen factor of 2. The model pre-trained with AugMix [17] is available from RobustBench [7]. For Tiny-ImageNet-C, we pre-train ResNet50 [13] on the Tiny-ImageNet [26] training set, as OSTTA [27] did. The model is initialized with the pre-trained weights on ImageNet [8] and optimized for 50 epochs using SGD [38] with a batch size of 256. The initial learning rate is set to 0.01 and adjust using a cosine annealing schedule. During TTA, we use Adam [25] optimizer with the batch size of 200 for all experiments. The learning rate is set to 0.001 and 0.01 for entropy-based methods (TENT [44], EATA [35], OSTTA) and CoTTA [46], respectively. We use the energy score [32] to measure the ability of the adapted model to detect unknown classes. Furthermore, following T3A [22], we use the weights of the linear classifier as the source domain prototypes, and thus our approach is source-free. Entropy-based methods update only the affine parameters, while CoTTA updates all parameters.

4.2 Results

CIFAR benchmarks.

We first conduct experiments on the most common CIFAR benchmarks, and the results are presented in Tab. 1. From Tab. 1, we can see that UniEnt and UniEnt+ significantly improve the performance of three different existing TTA methods. For example, on CIFAR-10-C, UniEnt improves the Acc, AUROC, FPR@TPR95 and OSCR of TENT [44] by 4.93%, 26.89%, 59.20% and 23.59% respectively, while UniEnt+ improves the Acc, AUROC, FPR@TPR95 and OSCR of TENT by 4.65%, 27.79%, 63.20% and 23.89% respectively.

In more detail, we can observe that TENT [44] and OSTTA [27] perform even worse than Source and BN methods that do not update model parameters in some cases (OSCR decreases by 3.27%similar-to\sim15.40%), which indicates that some existing TTA methods cannot effectively update model parameters with open-set classes included. This can be attributed to the fact that these methods ignore the distribution variations introduced by open-set samples, resulting in the unreliable estimation of normalization statistics and model confidence.

Tiny-ImageNet-C.

We then conduct experiments on a more challenging dataset Tiny-ImageNet-C, and the results are summarized in Tab. 2. As shown in Tab. 2, consistent with previous analysis, UniEnt and UniEnt+ still achieve better performance. Numerically, UniEnt improves the Acc, AUROC, FPR@TPR95 and OSCR of TENT [44] by 8.27%, 14.14%, 6.24% and 11.16% respectively, while UniEnt+ improves the Acc, AUROC, FPR@TPR95 and OSCR of TENT by 8.35%, 14.05%, 6.84% and 11.10% respectively.

4.3 Analysis

Ablation study.

To verify the effectiveness of different components in tsubscript𝑡\mathcal{L}_{t}caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (Eq. (8)), we conduct extensive ablation studies on CIFAR benchmarks. The results are summarized in Tab. 3. Compared with the baselines without t,csIDsubscript𝑡csID\mathcal{L}_{t,\text{csID}}caligraphic_L start_POSTSUBSCRIPT italic_t , csID end_POSTSUBSCRIPT and t,csOODsubscript𝑡csOOD\mathcal{L}_{t,\text{csOOD}}caligraphic_L start_POSTSUBSCRIPT italic_t , csOOD end_POSTSUBSCRIPT (the same as TENT [44], EATA [35] and OSTTA [27]), introducing t,csIDsubscript𝑡csID\mathcal{L}_{t,\text{csID}}caligraphic_L start_POSTSUBSCRIPT italic_t , csID end_POSTSUBSCRIPT improves the classification accuracy of known classes, which indicates that our proposed distribution-aware filter can well distinguish the samples of known classes from the samples of unknown classes. It is worth noting that the introduction of t,csIDsubscript𝑡csID\mathcal{L}_{t,\text{csID}}caligraphic_L start_POSTSUBSCRIPT italic_t , csID end_POSTSUBSCRIPT also leads to better detection performance of unknown classes, which is consistent with the findings obtained in a recent study [43]. With the addition of t,csOODsubscript𝑡csOOD\mathcal{L}_{t,\text{csOOD}}caligraphic_L start_POSTSUBSCRIPT italic_t , csOOD end_POSTSUBSCRIPT, the model’s detection performance of unknown classes has been further improved. Considering the trade-off between the two, UniEnt achieves the optimal OSCR values in most cases.

Hyperparameter sensitivity.

We perform sensitivity analyses on the hyperparameters λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, as summarized in Tab. 4 and Tab. 5. We first investigate the effect of λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT on CIFAR-100-C, with λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT taking values from {0.1,0.2,0.5,1.0}0.10.20.51.0\{0.1,0.2,0.5,1.0\}{ 0.1 , 0.2 , 0.5 , 1.0 } and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT holds constant. The experimental results show that our methods are robust to the value of λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the gaps between the best and worst values of Acc, AUROC, FPR@TPR95 and OSCR are 1.05%, 1.46%, 6.69% and 0.77%, respectively. We then examine how λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT affects csID classification and csOOD detection, with λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT taking values from {0.1,0.2,0.5,1.0}0.10.20.51.0\{0.1,0.2,0.5,1.0\}{ 0.1 , 0.2 , 0.5 , 1.0 } and λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT holds constant. It is easy to conclude from the results that a larger λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT leads to better csOOD detection performance, yet at the same time, it may lose some of the csID classification performance, and vice versa. Numerically, different values of λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT will result in the maximum performance differences of 15.49%, 7.51%, 35.06% and 15.41% for Acc, AUROC, FPR@TPR95 and OSCR, respectively.

Performance under different number of unknown classes.

The number of unknown classes is an important measure representing the complexity of the open-set. We examine the impact of different numbers of unknown classes. Specifically, we perform experiments on the CIFAR-10-C dataset and control the number of unknown classes to vary from 2 to 10, keeping the number of samples constant. From Tab. 6, we can see that TENT [44] fluctuates with different number of classes while the proposed UniEnt and UniEnt+ are more robust to different number of unknown classes.

Method 2 4 6 8 10 ΔΔ\Deltaroman_Δ
Source [54] 70.84 69.28 69.32 69.18 68.44 2.40
BN Adapt [33] 72.56 72.48 72.52 72.44 72.14 0.42
TENT [44] 49.51 48.29 51.74 49.53 50.97 3.45
+ UniEnt 78.71 78.39 78.28 78.13 77.82 0.89
+ UniEnt+ 78.65 78.23 78.23 78.07 77.68 0.97
Table 6: OSCR of UniEnt and UniEnt+ on CIFAR-10-C under different number of unknown classes.

Performance under different ratios of csOOD to csID samples.

We also perform experiments with different ratios of the number of csOOD samples to the number of csID samples, and the results are displayed in Tab. 7. We vary the data ratio from 0.2 to 1.0. It can be observed that our proposed methods are insensitive to the variation of the data ratio while TENT [44] is more sensitive, and thus can be applied to different data ratio cases.

Method 0.2 0.4 0.6 0.8 1.0 ΔΔ\Deltaroman_Δ
Source [54] 40.00 40.03 39.98 39.92 39.87 0.16
BN Adapt [33] 49.91 49.55 48.92 47.97 47.10 2.81
TENT [44] 47.68 44.12 44.06 42.90 42.16 5.52
+ UniEnt 56.84 57.48 57.13 56.77 56.26 1.22
+ UniEnt+ 57.15 57.59 57.24 56.88 56.33 1.26
Table 7: OSCR of UniEnt and UniEnt+ on CIFAR-100-C under different ratios of csOOD to csID samples.

T-SNE visualization.

To illustrate the effects of different methods on csID classification and csOOD detection, we visualize the feature representations of CIFAR-10-C test samples with SVHN-C test samples as csOOD samples via T-SNE [42] in Fig. 6. It can be observed that the features from known classes and unknown classes adapted by TENT [44] are mixed together, while UniEnt and UniEnt+ can better separate them. Furthremore, we observe that filtering out csOOD samples (w/ t,csIDsubscript𝑡csID\mathcal{L}_{t,\text{csID}}caligraphic_L start_POSTSUBSCRIPT italic_t , csID end_POSTSUBSCRIPT) can not only improve the classification performance on known classes, but also the detection performance on unknown classes.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 6: T-SNE visualization on CIFAR-10-C test set with SVHN-C as csOOD. red \to blue denotes csID samples and yellow denotes csOOD samples.

5 Conclusion

This paper presents a unified entropy optimization framework for open-set test-time adaptation that can be flexibly applied to various existing TTA methods. We first delve into the performance of existing methods under open-set TTA setting, and attribute the performance degradation to the unreliable estimation of normalization statistics and model confidence. To address the above issues, we then propose a distribution-aware filter to preliminary distinguish csID samples from csOOD samples, followed by entropy minimization on csID samples and entropy maximization on csOOD samples. In addition, we propose to leverage sample-level confidence to reduce the noise from hard data partition. Extensive experiments reveal that our methods outperform state-of-the-art TTA methods in open-set scenarios. We hope that more studies can focus on the robustness of TTA methods under open-set, which can facilitate the application of these methods in real scenarios.

Acknowledgements.

This work has been supported by the National Science and Technology Major Project (2022ZD0116500), National Natural Science Foundation of China (U20A20223, 62222609, 62076236), CAS Project for Young Scientists in Basic Research (YSBR-083), and Key Research Program of Frontier Sciences of CAS (ZDBS-LY-7004).

References

  • Bai et al. [2023] Haoyue Bai, Gregory Canal, Xuefeng Du, Jeongyeol Kwon, Robert D Nowak, and Yixuan Li. Feed two birds with one scone: Exploiting wild data for both out-of-distribution generalization and detection. In ICML, 2023.
  • Boudiaf et al. [2022] Malik Boudiaf, Romain Mueller, Ismail Ben Ayed, and Luca Bertinetto. Parameter-free online test-time adaptation. In CVPR, 2022.
  • Chen et al. [2022] Dian Chen, Dequan Wang, Trevor Darrell, and Sayna Ebrahimi. Contrastive test-time adaptation. In CVPR, 2022.
  • Chen et al. [2020] Guangyao Chen, Limeng Qiao, Yemin Shi, Peixi Peng, Jia Li, Tiejun Huang, Shiliang Pu, and Yonghong Tian. Learning open set network with discriminative reciprocal points. In ECCV, 2020.
  • Chen et al. [2021] Guangyao Chen, Peixi Peng, Xiangqian Wang, and Yonghong Tian. Adversarial reciprocal points learning for open set recognition. IEEE TPAMI, 2021.
  • Choi et al. [2022] Sungha Choi, Seunghan Yang, Seokeon Choi, and Sungrack Yun. Improving test-time adaptation via shift-agnostic weight regularization and nearest source prototypes. In ECCV, 2022.
  • Croce et al. [2021] Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas Flammarion, Mung Chiang, Prateek Mittal, and Matthias Hein. Robustbench: a standardized adversarial robustness benchmark. In NeurIPS Datasets and Benchmarks Track, 2021.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  • Dhamija et al. [2018] Akshay Raj Dhamija, Manuel Günther, and Terrance Boult. Reducing network agnostophobia. In NeurIPS, 2018.
  • Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  • Ganin et al. [2016] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario March, and Victor Lempitsky. Domain-adversarial training of neural networks. JMLR, 2016.
  • Gong et al. [2022] Taesik Gong, Jongheon Jeong, Taewon Kim, Yewon Kim, Jinwoo Shin, and Sung-Ju Lee. Note: Robust continual test-time adaptation against temporal correlation. In NeurIPS, 2022.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • Hendrycks and Dietterich [2019] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In ICLR, 2019.
  • Hendrycks and Gimpel [2017] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In ICLR, 2017.
  • Hendrycks et al. [2019] Dan Hendrycks, Mantas Mazeika, and Thomas Dietterich. Deep anomaly detection with outlier exposure. In ICLR, 2019.
  • Hendrycks et al. [2020] Dan Hendrycks, Norman Mu, Ekin Dogus Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. Augmix: A simple data processing method to improve robustness and uncertainty. In ICLR, 2020.
  • Hendrycks et al. [2021] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In CVPR, 2021.
  • Hendrycks et al. [2022] Dan Hendrycks, Steven Basart, Mantas Mazeika, Andy Zou, Joseph Kwon, Mohammadreza Mostajabi, Jacob Steinhardt, and Dawn Song. Scaling out-of-distribution detection for real-world settings. In ICML, 2022.
  • Hsu et al. [2020] Yen-Chang Hsu, Yilin Shen, Hongxia Jin, and Zsolt Kira. Generalized odin: Detecting out-of-distribution image without learning from out-of-distribution data. In CVPR, 2020.
  • Huang et al. [2022] Hongzhi Huang, Yu Wang, Qinghua Hu, and Ming-Ming Cheng. Class-specific semantic reconstruction for open set recognition. IEEE TPAMI, 2022.
  • Iwasawa and Matsuo [2021] Yusuke Iwasawa and Yutaka Matsuo. Test-time classifier adjustment module for model-agnostic domain generalization. In NeurIPS, 2021.
  • Katz-Samuels et al. [2022] Julian Katz-Samuels, Julia B Nakhleh, Robert Nowak, and Yixuan Li. Training ood detectors in their natural habitats. In ICML, 2022.
  • Khurana et al. [2021] Ansh Khurana, Sujoy Paul, Piyush Rai, Soma Biswas, and Gaurav Aggarwal. Sita: Single image test-time adaptation. arXiv preprint arXiv:2112.02355, 2021.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Le and Yang [2015] Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. 2015.
  • Lee et al. [2023] Jungsoo Lee, Debasmit Das, Jaegul Choo, and Sungha Choi. Towards open-set test-time adaptation utilizing the wisdom of crowds in entropy minimization. In ICCV, 2023.
  • Li et al. [2023] Yushu Li, Xun Xu, Yongyi Su, and Kui Jia. On the robustness of open-world test-time training: Self-training with dynamic prototype expansion. In ICCV, 2023.
  • Liang et al. [2020] Jian Liang, Dapeng Hu, and Jiashi Feng. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In ICML, 2020.
  • Liang et al. [2018] Shiyu Liang, Yixuan Li, and R Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. In ICLR, 2018.
  • Lim et al. [2023] Hyesu Lim, Byeonggeun Kim, Jaegul Choo, and Sungha Choi. Ttn: A domain-shift aware batch normalization in test-time adaptation. In ICLR, 2023.
  • Liu et al. [2020] Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection. In NeurIPS, 2020.
  • Nado et al. [2020] Zachary Nado, Shreyas Padhy, D Sculley, Alexander D’Amour, Balaji Lakshminarayanan, and Jasper Snoek. Evaluating prediction-time batch normalization for robustness under covariate shift. arXiv preprint arXiv:2006.10963, 2020.
  • Netzer et al. [2011] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.
  • Niu et al. [2022] Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Yaofo Chen, Shijian Zheng, Peilin Zhao, and Mingkui Tan. Efficient test-time model adaptation without forgetting. In ICML, 2022.
  • Niu et al. [2023] Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Zhiquan Wen, Yaofo Chen, Peilin Zhao, and Mingkui Tan. Towards stable test-time adaptation in dynamic wild world. In ICLR, 2023.
  • Press et al. [2023] Ori Press, Steffen Schneider, Matthias Kümmerer, and Matthias Bethge. Rdumb: A simple approach that questions our progress in continual test-time adaptation. arXiv:2306.05401, 2023.
  • Ruder [2016] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
  • Schneider et al. [2020] Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bringmann, Wieland Brendel, and Matthias Bethge. Improving robustness against common corruptions by covariate shift adaptation. In NeurIPS, 2020.
  • Tian et al. [2022] Rui Tian, Zuxuan Wu, Qi Dai, Han Hu, and Yugang Jiang. Deeper insights into vits robustness towards common corruptions. arXiv:2204.12143, 2022.
  • Touvron et al. [2021] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In ICML, 2021.
  • Van der Maaten and Hinton [2008] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 2008.
  • Vaze et al. [2022] Sagar Vaze, Kai Han, Andrea Vedaldi, and Andrew Zisserman. Open-set recognition: A good closed-set classifier is all you need. In ICLR, 2022.
  • Wang et al. [2021] Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. In ICLR, 2021.
  • Wang et al. [2022a] Jindong Wang, Cuiling Lan, Chang Liu, Yidong Ouyang, Tao Qin, Wang Lu, Yiqiang Chen, Wenjun Zeng, and Philip Yu. Generalizing to unseen domains: A survey on domain generalization. TKDE, 2022a.
  • Wang et al. [2022b] Qin Wang, Olga Fink, Luc Van Gool, and Dengxin Dai. Continual test-time domain adaptation. In CVPR, 2022b.
  • Xu et al. [2021] Qinwei Xu, Ruipeng Zhang, Ya Zhang, Yanfeng Wang, and Qi Tian. A fourier-based framework for domain generalization. In CVPR, 2021.
  • Yang et al. [2021] Jingkang Yang, Haoqi Wang, Litong Feng, Xiaopeng Yan, Huabin Zheng, Wayne Zhang, and Ziwei Liu. Semantically coherent out-of-distribution detection. In ICCV, 2021.
  • Yang et al. [2023] Puning Yang, Jian Liang, Jie Cao, and Ran He. Auto: Adaptive outlier optimization for online test-time ood detection. arXiv preprint arXiv:2303.12267, 2023.
  • Yang and Soatto [2020] Yanchao Yang and Stefano Soatto. Fda: Fourier domain adaptation for semantic segmentation. In CVPR, 2020.
  • You et al. [2021] Fuming You, Jingjing Li, and Zhou Zhao. Test-time batch statistics calibration for covariate shift. arXiv preprint arXiv:2110.04065, 2021.
  • Yu and Aizawa [2019] Qing Yu and Kiyoharu Aizawa. Unsupervised out-of-distribution detection by maximum classifier discrepancy. In ICCV, 2019.
  • Yuan et al. [2023] Longhui Yuan, Binhui Xie, and Shuang Li. Robust test-time adaptation in dynamic scenarios. In CVPR, 2023.
  • Zagoruyko and Komodakis [2016] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016.
  • Zhang et al. [2023a] Jingyang Zhang, Nathan Inkawhich, Randolph Linderman, Yiran Chen, and Hai Li. Mixture outlier exposure: Towards out-of-distribution detection in fine-grained environments. In WACV, 2023a.
  • Zhang et al. [2023b] Jingyang Zhang, Jingkang Yang, Pengyun Wang, Haoqi Wang, Yueqian Lin, Haoran Zhang, Yiyou Sun, Xuefeng Du, Kaiyang Zhou, Wayne Zhang, et al. Openood v1. 5: Enhanced benchmark for out-of-distribution detection. arXiv preprint arXiv:2306.09301, 2023b.
  • Zhang et al. [2022] Marvin Zhang, Sergey Levine, and Chelsea Finn. Memo: Test time robustness via adaptation and augmentation. In NeurIPS, 2022.
  • Zhou et al. [2021] Kaiyang Zhou, Yongxin Yang, Yu Qiao, and Tao Xiang. Domain generalization with mixstyle. In ICLR, 2021.
  • Zhou et al. [2022] Kaiyang Zhou, Ziwei Liu, Yu Qiao, Tao Xiang, and Chen Change Loy. Domain generalization: A survey. IEEE TPAMI, 2022.
  • Zhou et al. [2023] Zhi Zhou, Lan-Zhe Guo, Lin-Han Jia, Dingchu Zhang, and Yu-Feng Li. Ods: Test-time adaptation in the presence of open-world data shift. In ICML, 2023.
\thetitle

Supplementary Material

6 Pseudo Code

For a better understanding of our proposed methods, we summarize UniEnt and UniEnt+ as Algorithm 1 and Algorithm 2, respectively.

Input: Source model fθ0subscript𝑓subscript𝜃0f_{\theta_{0}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT pre-trained on the source domain dataset, testing samples t={𝐱},t=1,,Tformulae-sequencesubscript𝑡𝐱𝑡1𝑇\mathcal{B}_{t}=\{\mathbf{x}\},t=1,\cdots,Tcaligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { bold_x } , italic_t = 1 , ⋯ , italic_T.
for t1normal-←𝑡1t\leftarrow 1italic_t ← 1 to T𝑇Titalic_T do
       for 𝐱t𝐱subscript𝑡\mathbf{x}\in\mathcal{B}_{t}bold_x ∈ caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT do
             Compute csOOD score for each testing sample via Eq. (3);
            
       end for
      Obtain π(x)𝜋𝑥\pi(x)italic_π ( italic_x ) via the EM algorithm;
       Split tsubscript𝑡\mathcal{B}_{t}caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT into t,csIDsubscript𝑡csID\mathcal{B}_{t,\text{csID}}caligraphic_B start_POSTSUBSCRIPT italic_t , csID end_POSTSUBSCRIPT and t,csOODsubscript𝑡csOOD\mathcal{B}_{t,\text{csOOD}}caligraphic_B start_POSTSUBSCRIPT italic_t , csOOD end_POSTSUBSCRIPT via Eq. (5);
       Update model via Eq. (8);
      
end for
Output: The predictions argmaxcfθt(𝐱)subscript𝑐subscript𝑓subscript𝜃𝑡𝐱\mathop{\arg\max}_{c}f_{\theta_{t}}(\mathbf{x})start_BIGOP roman_arg roman_max end_BIGOP start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x ) for all 𝐱t,t=1,,Tformulae-sequence𝐱subscript𝑡𝑡1𝑇\mathbf{x}\in\mathcal{B}_{t},t=1,\cdots,Tbold_x ∈ caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t = 1 , ⋯ , italic_T.
Algorithm 1 UniEnt
Input: Source model fθ0subscript𝑓subscript𝜃0f_{\theta_{0}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT pre-trained on the source domain dataset, testing samples t={𝐱},t=1,,Tformulae-sequencesubscript𝑡𝐱𝑡1𝑇\mathcal{B}_{t}=\{\mathbf{x}\},t=1,\cdots,Tcaligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { bold_x } , italic_t = 1 , ⋯ , italic_T.
for t1normal-←𝑡1t\leftarrow 1italic_t ← 1 to T𝑇Titalic_T do
       for 𝐱t𝐱subscript𝑡\mathbf{x}\in\mathcal{B}_{t}bold_x ∈ caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT do
             Compute csOOD score for each testing sample via Eq. (3);
            
       end for
      Obtain π(x)𝜋𝑥\pi(x)italic_π ( italic_x ) via the EM algorithm;
       Update model via Eq. (9);
      
end for
Output: The predictions argmaxcfθt(𝐱)subscript𝑐subscript𝑓subscript𝜃𝑡𝐱\mathop{\arg\max}_{c}f_{\theta_{t}}(\mathbf{x})start_BIGOP roman_arg roman_max end_BIGOP start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x ) for all 𝐱t,t=1,,Tformulae-sequence𝐱subscript𝑡𝑡1𝑇\mathbf{x}\in\mathcal{B}_{t},t=1,\cdots,Tbold_x ∈ caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t = 1 , ⋯ , italic_T.
Algorithm 2 UniEnt+

7 More Analysis

Scalability of large-scale datasets.

To demonstrate that our methods can be used for large-scale datasets, we conduct experiments on ImageNet-C [14]. Specifically, we use ResNet-50 [13] pre-trained with AugMix [17], the weights of which can be obtained from RobustBench [7]. For optimization, we use the SGD optimizer [38] with the learning rate of 0.00025 and the batch size of 64. We apply common corruptions and perturbations to ImageNet-O [18] through the official code of [14] to construct ImageNet-O-C as csOOD data. From Table 8, we can see that UniEnt and UniEnt+ consistently improve the performance of the existing baseline methods in the open-set setting.

Method ImageNet-C
Acc\uparrow AUROC\uparrow FPR@TPR95\downarrow OSCR\uparrow
Source [54] 28.21 49.63 94.74 19.81
BN Adapt [33] 43.57 55.89 93.39 30.42
CoTTA [46] 47.67 55.58 94.51 33.80
TENT [44] 45.82 51.34 96.47 30.33
+ UniEnt 47.53 (+1.71) 56.33 (+4.99) 95.21 (-1.26) 34.42 (+4.09)
+ UniEnt+ 46.87 (+1.05) 55.86 (+4.52) 95.10 (-1.37) 33.73 (+3.40)
EATA [35] 51.40 53.10 95.18 34.87
+ UniEnt 49.60 (-1.80) 58.29 (+5.19) 93.63 (-1.55) 36.28 (+1.41)
+ UniEnt+ 51.57 (+0.17) 59.45 (+6.35) 93.60 (-1.58) 38.27 (+3.40)
OSTTA [27] 47.91 52.93 96.15 32.77
+ UniEnt 47.92 (+0.01) 56.02 (+3.09) 95.23 (-0.92) 34.47 (+1.70)
+ UniEnt+ 47.47 (-0.44) 55.67 (+2.74) 95.16 (-0.99) 34.03 (+1.26)
Table 8: Results of different methods on ImageNet-C. \uparrow indicates that larger values are better, and vice versa. All values are percentages. The bold values indicate the best results, and the underlined values indicate the second best results. The values in parentheses indicate the improvements of our methods over the baseline methods.

Scalability of model architecture.

Recently, Vision Transformer (ViT) [10] has demonstrated better performance than Convolutional Neural Network (CNN), we also perform experiments with ViT backbone on ImageNet-C. Specifically, we use DeiT-Base [41] designed in [40], which proposes many techniques in the training phase to improve the robustness of the model to common corruptions. The pre-trained weights are also available from RobustBench. We update the affine parameters of the model’s layer normalization. Table 9 shows that our approaches are compatible with ViT.

Method ResNet-50 DeiT Base
Acc\uparrow AUROC\uparrow FPR@TPR95\downarrow OSCR\uparrow Acc\uparrow AUROC\uparrow FPR@TPR95\downarrow OSCR\uparrow
Source [54] 28.21 49.63 94.74 19.81 56.59 56.01 91.55 36.13
CoTTA [46] 47.67 55.58 94.51 33.80 60.73 53.51 93.14 37.33
TENT [44] 45.82 51.34 96.47 30.33 62.85 59.51 93.47 43.52
+ UniEnt 47.53 (+1.71) 56.33 (+4.99) 95.21 (-1.26) 34.42 (+4.09) 58.81 (-4.04) 67.10 (+7.59) 90.90 (-2.57) 47.40 (+3.88)
+ UniEnt+ 46.87 (+1.05) 55.86 (+4.52) 95.10 (-1.37) 33.73 (+3.40) 58.40 (-4.45) 66.69 (+7.18) 90.43 (-3.04) 46.74 (+3.22)
EATA [35] 51.40 53.10 95.18 34.87 65.38 57.95 92.92 44.29
+ UniEnt 49.60 (-1.80) 58.29 (+5.19) 93.63 (-1.55) 36.28 (+1.41) 59.36 (-6.02) 67.22 (+9.27) 91.63 (-1.29) 48.23 (+3.94)
+ UniEnt+ 51.57 (+0.17) 59.45 (+6.35) 93.60 (-1.58) 38.27 (+3.40) 61.50 (-3.88) 66.96 (+9.01) 89.99 (-2.93) 48.79 (+4.50)
OSTTA [27] 47.91 52.93 96.15 32.77 60.19 60.69 92.42 43.19
+ UniEnt 47.92 (+0.01) 56.02 (+3.09) 95.23 (-0.92) 34.47 (+1.70) 58.73 (-1.46) 67.62 (+6.93) 90.51 (-1.91) 47.64 (+4.45)
+ UniEnt+ 47.47 (-0.44) 55.67 (+2.74) 95.16 (-0.99) 34.03 (+1.26) 58.72 (-1.47) 67.28 (+6.59) 90.02 (-2.40) 47.32 (+4.13)
Table 9: Results of different methods on ImageNet-C using diverse architectures.

Performance under long-term open-set test-time adaptation.

Models deployed in real-world scenarios are exposed to test samples for long periods and need to make reliable predictions at any time. Recent work [37, 27] points out that most existing TTA methods perform poorly in long-term settings, even worse than non-updating models. Following [27], we simulate long-term TTA by repeating adaptation for 10 rounds. During adaptation, the domain changes continuously and the model is never reset. The results are summarized in Table 10. We observe that in most cases the performance degradation of our methods is very slight compared to the baseline methods.

Method CIFAR-10-C CIFAR-100-C Average
Acc\uparrow AUROC\uparrow FPR@TPR95\downarrow OSCR\uparrow Acc\uparrow AUROC\uparrow FPR@TPR95\downarrow OSCR\uparrow Acc\uparrow AUROC\uparrow FPR@TPR95\downarrow OSCR\uparrow
Source [54] 81.73 77.89 79.45 68.44 53.25 60.55 94.98 39.87 67.49 69.22 87.22 54.16
BN Adapt [33] 84.20 80.40 76.84 72.13 57.16 72.45 84.29 47.10 70.68 76.43 80.57 59.62
CoTTA [46] 85.77 85.89 72.40 77.26 56.46 77.04 80.96 48.95 71.12 81.47 76.68 63.11
TENT [44] 79.38 65.39 95.94 56.73 54.74 65.00 94.79 42.24 67.06 65.20 95.37 49.49
+ UniEnt 84.31 (+4.93) 92.28 (+26.89) 36.74 (-59.20) 80.32 (+23.59) 59.07 (+4.33) 89.28 (+24.28) 51.14 (-43.65) 56.26 (+14.02) 71.69 (+4.63) 90.78 (+25.59) 43.94 (-51.43) 68.29 (+18.81)
+ UniEnt+ 84.03 (+4.65) 93.18 (+27.79) 32.74 (-63.20) 80.62 (+23.89) 58.58 (+3.84) 91.39 (+26.39) 41.09 (-53.70) 56.36 (+14.12) 71.31 (+4.25) 92.29 (+27.09) 36.92 (-58.45) 68.49 (+19.01)
EATA [35] 80.92 84.32 71.66 72.63 60.63 88.64 50.18 57.24 70.78 86.48 60.92 64.94
+ UniEnt 84.31 (+3.39) 97.15 (+12.83) 13.25 (-58.41) 82.99 (+10.36) 59.75 (-0.88) 93.42 (+4.78) 30.36 (-19.82) 57.99 (+0.75) 72.03 (+1.26) 95.29 (+8.81) 21.81 (-39.12) 70.49 (+5.55)
+ UniEnt+ 85.18 (+4.26) 96.97 (+12.65) 14.28 (-57.38) 83.67 (+11.04) 59.71(-0.92) 94.23 (+5.59) 26.87 (-23.31) 58.19 (+0.95) 72.45 (+1.67) 95.60 (+9.12) 20.58 (-40.35) 70.93 (+6.00)
OSTTA [27] 84.44 72.74 77.02 65.17 60.03 75.37 82.75 51.35 72.24 74.06 79.89 58.26
+ UniEnt 82.46 (-1.98) 96.20 (+23.46) 16.37 (-60.65) 80.51 (+15.34) 58.69 (-1.34) 94.84 (+19.47) 22.95 (-59.80) 57.28 (+5.93) 70.58 (-1.66) 95.52 (+21.47) 19.66 (-60.23) 68.90 (+10.64)
+ UniEnt+ 84.30 (-0.14) 97.38 (+24.64) 11.56 (-65.46) 82.91 (+17.74) 58.93 (-1.10) 95.42 (+20.05) 20.59 (-62.16) 57.69 (+6.34) 71.62 (-0.62) 96.40 (+22.35) 16.08 (-63.81) 70.30 (+12.04)
(a)
Method CIFAR-10-C CIFAR-100-C Average
Acc\uparrow AUROC\uparrow FPR@TPR95\downarrow OSCR\uparrow Acc\uparrow AUROC\uparrow FPR@TPR95\downarrow OSCR\uparrow Acc\uparrow AUROC\uparrow FPR@TPR95\downarrow OSCR\uparrow
Source [54] 81.73 77.89 79.45 68.44 53.25 60.55 94.98 39.87 67.49 69.22 87.22 54.16
BN Adapt [33] 84.20 80.40 76.84 72.13 57.16 72.45 84.28 47.09 70.68 76.43 80.57 59.62
CoTTA [46] 35.90 47.27 97.52 19.95 13.34 48.34 91.61 8.19 24.62 47.81 94.57 14.07
TENT [44] 32.61 60.86 93.24 20.86 37.49 53.73 95.07 25.02 35.05 57.30 94.16 22.94
+ UniEnt 84.07 (+51.46) 88.53 (+27.67) 51.48 (-41.76) 77.87 (+57.01) 57.93 (+20.44) 90.62 (+36.89) 46.18 (-48.89) 55.67 (+30.65) 71.00 (+35.95) 89.58 (+32.28) 48.83 (-45.33) 66.77 (+43.83)
+ UniEnt+ 84.17 (+51.56) 88.21 (+27.35) 52.57 (-40.67) 77.75 (+56.89) 57.92 (+20.43) 90.63 (+36.90) 45.10 (-49.97) 55.59 (+30.57) 71.05 (+36.00) 89.42 (+32.13) 48.84 (-45.32) 66.67 (+43.73)
EATA [35] 40.94 64.52 88.41 29.07 48.75 73.26 80.83 41.27 44.85 68.89 84.62 35.17
+ UniEnt 81.22 (+40.28) 91.05 (+26.53) 30.59 (-57.82) 76.42 (+47.35) 57.07 (+8.32) 98.59 (+25.33) 5.85 (-74.98) 56.70 (+15.43) 69.15 (+24.30) 94.82 (+25.93) 18.22 (-66.40) 66.56 (+31.39)
+ UniEnt+ 80.41 (+39.47) 92.49 (+27.97) 30.00 (-58.41) 77.00 (+47.93) 58.02 (+9.27) 98.05 (+24.79) 7.92 (-72.91) 57.47 (+16.20) 69.22 (+24.37) 95.27 (+26.38) 18.96 (-65.66) 67.24 (+32.07)
OSTTA [27] 83.83 71.93 76.12 63.90 57.39 75.46 82.47 49.61 70.61 73.70 79.30 56.76
+ UniEnt 80.74 (-3.09) 88.94 (+17.01) 35.66 (-40.46) 74.52 (+10.62) 56.13 (-1.26) 95.20 (+19.74) 21.15 (-61.32) 54.89 (+5.28) 68.44 (-2.18) 92.07 (+18.38) 28.41 (-50.89) 64.71 (+7.95)
+ UniEnt+ 82.42 (-1.41) 90.15 (+18.22) 31.18 (-44.94) 76.46 (+12.56) 57.45 (+0.06) 95.91 (+20.45) 17.33 (-65.14) 56.32 (+6.71) 69.94 (-0.67) 93.03 (+19.34) 24.26 (-55.04) 66.39 (+9.64)
(b)
Table 10: Results of different methods on CIFAR benchmarks.

Effects of learning rate and batch size.

We explore the impact of learning rate and batch size on our approaches in Table 11. A learning rate that is too large or too small can hurt performance, while a larger batch size results in better performance. Compared to TENT [44] and EATA [35], our methods are more robust to learning rate and batch size. Nonetheless, our methods share the same limitation as the baseline methods: they rely on a large batch size to estimate the distribution accurately. Moreover, we observe that OSTTA [27] is less sensitive to learning rate and batch size.

Method Learning rate ΔΔ\Deltaroman_Δ
0.005 0.001 0.0005 0.0001
Source [54] 39.87 39.87 39.87 39.87 0.00
BN Adapt [33] 47.10 47.10 47.10 47.10 0.00
TENT [44] 10.60 42.24 42.38 48.36 37.76
+ UniEnt 53.82 (+43.22) 56.20 (+13.96) 56.06 (+13.68) 54.51 (+6.15) 2.38
+ UniEnt+ 54.44 (+43.84) 56.36 (+14.12) 56.27 (+13.89) 54.65 (+6.29) 1.92
EATA [35] 40.96 57.00 56.91 53.60 16.04
+ UniEnt 49.36 (+8.40) 57.76 (+0.76) 57.10 (+0.19) 53.63 (+0.03) 8.40
+ UniEnt+ 49.05 (+8.09) 58.07 (+1.07) 57.39 (+0.48) 53.40 (-0.20) 9.02
OSTTA [27] 49.43 51.35 51.98 52.37 2.94
+ UniEnt 51.41 (+1.98) 56.93 (+5.58) 57.22 (+5.24) 55.58 (+3.21) 5.81
+ UniEnt+ 53.39 (+3.96) 57.69 (+6.34) 57.68 (+5.70) 56.06 (+3.69) 4.30
(c)
Method Batch size ΔΔ\Deltaroman_Δ
64 32 16 8
Source [54] 39.87 39.87 39.87 39.87 0.00
BN Adapt [33] 46.38 45.25 42.94 38.61 7.77
TENT [44] 33.27 8.10 2.51 0.95 32.32
+ UniEnt 55.17 (+21.90) 53.05 (+44.95) 48.87 (+46.36) 31.47 (+30.52) 23.70
+ UniEnt+ 55.17 (+21.90) 53.13 (+45.03) 49.27 (+46.76) 28.35 (+27.40) 26.82
EATA [35] 53.09 47.78 40.57 31.57 21.52
+ UniEnt 57.08 (+3.99) 54.52 (+6.74) 50.71 (+10.14) 43.89 (+12.32) 13.19
+ UniEnt+ 56.79 (+3.70) 54.29 (+6.51) 50.49 (+9.92) 43.17 (+11.60) 13.62
OSTTA [27] 50.35 48.82 46.07 39.75 10.60
+ UniEnt 54.54 (+4.19) 50.49 (+1.67) 44.97 (-1.10) 36.72 (-3.03) 17.82
+ UniEnt+ 55.76 (+5.41) 52.66 (+3.84) 47.94 (+1.87) 41.45 (+1.70) 14.31
(d)
Table 11: OSCR of different methods on CIFAR-100-C with diverse learning rates and batch sizes. ΔΔ\Deltaroman_Δ is the difference between the largest and smallest values. Smaller ΔΔ\Deltaroman_Δ values represent better robustness.

Effects of OOD score.

We use the energy score [32] to measure the model’s detection performance on csOOD data. From Table 12, we can make two observations. First, our methods consistently improve the performance using different OOD scores. Second, compared with MSP [15], using Max Logit [19] and Energy yields better detection performance.

Method OOD score ΔΔ\Deltaroman_Δ
MSP [15] Max Logit [19] Energy [32]
Source [54] 39.65 40.24 39.87 0.59
BN Adapt [33] 48.75 48.04 47.10 1.65
CoTTA [46] 49.44 49.73 48.99 0.74
TENT [44] 36.86 41.79 42.24 5.38
+ UniEnt 55.42 (+18.56) 56.20 (+14.41) 56.26 (+14.02) 0.84
+ UniEnt+ 55.24 (+18.38) 56.31 (+14.52) 56.36 (+14.12) 1.12
EATA [35] 55.20 57.52 57.55 2.35
+ UniEnt 56.94 (+1.74) 57.88 (+0.36) 57.87 (+0.32) 0.94
+ UniEnt+ 57.37 (+2.17) 58.33 (+0.81) 58.33 (+0.78) 0.96
OSTTA [27] 49.14 51.42 51.35 2.28
+ UniEnt 56.52 (+7.38) 57.23 (+5.81) 57.25 (+5.90) 0.73
+ UniEnt+ 57.12 (+7.98) 57.69 (+6.27) 57.69 (+6.34) 0.57
Table 12: OSCR of different methods on CIFAR-100-C using diverse OOD scores.