Open-World Test-Time Training: Self-Training with Contrast Learning

Houcheng Su Hong Kong University of Science and Technology (Guangzhou)GuangzhouChina [email protected] Mengzhu Wang Hebei University of TechnologyTianjinChina [email protected] Jiao Li University of Electronic Science and Technology of ChinaChengduChina [email protected] Bingli Wang Sichuan Agricultural UniversityYaanChina [email protected] Daixian Liu Sichuan Agricultural UniversityYaanChina [email protected]  and  Zeheng Wang Harbin Engineering UniversityHarbinChina [email protected]
Abstract.

Traditional test-time training (TTT) methods, while addressing domain shifts, often assume a consistent class set, limiting their applicability in real-world scenarios characterized by infinite variety. Open-World Test-Time Training (OWTTT) addresses the challenge of generalizing deep learning models to unknown target domain distributions, especially in the presence of strong Out-of-Distribution (OOD) data. Existing TTT methods often struggle to maintain performance when confronted with strong OOD data. In OWTTT, the focus has predominantly been on distinguishing between overall strong and weak OOD data. However, during the early stages of TTT, initial feature extraction is hampered by interference from strong OOD and corruptions, resulting in diminished contrast and premature classification of certain classes as strong OOD. To address this, we introduce Open World Dynamic Contrastive Learning (OWDCL), an innovative approach that utilizes contrastive learning to augment positive sample pairs. This strategy not only bolsters contrast in the early stages but also significantly enhances model robustness in subsequent stages. In comparison datasets, our OWDCL model has produced the most advanced performance.

Open World, Test-time Training, Self-Training, Contrast Learning

1. Introduction

Deep neural networks (DNNs) have demonstrated remarkable performances across many application scenarios with well-prepared datasets (Amodei et al., 2016; He et al., 2016; Liu et al., 2021b). These successes typically hinge on the assumption of independent and identically distributed (i.i.d.) data, meaning that training and testing data are drawn from the same distribution. However, in real-world settings, satisfying this requirement is impractical (Mirza et al., 2023). For instance, applying the assumption to self-driving tasks may fail due to unpredictable elements like fog, snow, rain, rare traffic incidents, or unusual obstacles like sandstorms and characters in strange costumes. In medical diagnosis, the variance in equipment noise and diverse physiological characteristics of patients may compromise the model’s efficacy.

Refer to caption
Figure 1. In an experimental setup involving 15 types of corruption within the ImageNet-C dataset and employing the MNIST dataset as a benchmark for Strong OOD analysis, we conduct a performance comparison between OWDCL and OWTTT.

In real-world scenarios, the i.i.d. assumption often collapses due to variable noise from different device sensors, weather, and climate conditions, leading to a domain shift between the training and test sets. This shift results in models performing well on training data but failing on real-world test data (Hendrycks and Dietterich, 2019). Addressing this discrepancy is crucial for developing robust models capable of handling real-world variability.

Table 1. Characteristics of problem settings that adapt a trained model to a potentially shifted test domain. ‘Offline’ adaptation assumes access to the entire source or target dataset, while ‘Online’ adaptation can automatically predict a single or batch of incoming test samples.
Setting Source Target Train Loss Test Loss Offline Online Strong OOD
Fine-tuning xt,ytsuperscript𝑥𝑡superscript𝑦𝑡x^{t},y^{t}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (xs,ys)superscript𝑥𝑠superscript𝑦𝑠\mathcal{L}(x^{s},y^{s})caligraphic_L ( italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) -
Unsupervised Domain Adaptation xs,yssuperscript𝑥𝑠superscript𝑦𝑠x^{s},y^{s}italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT xtsuperscript𝑥𝑡x^{t}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (xs,ys)+(xs,xt)superscript𝑥𝑠superscript𝑦𝑠superscript𝑥𝑠superscript𝑥𝑡\mathcal{L}(x^{s},y^{s})+\mathcal{L}(x^{s},x^{t})caligraphic_L ( italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) + caligraphic_L ( italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) -
Universal Domain Adaptation xs,yssuperscript𝑥𝑠superscript𝑦𝑠x^{s},y^{s}italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT xtsuperscript𝑥𝑡x^{t}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (xs,ys)+(xs)superscript𝑥𝑠superscript𝑦𝑠superscript𝑥𝑠\mathcal{L}(x^{s},y^{s})+\mathcal{L}(x^{s})caligraphic_L ( italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) + caligraphic_L ( italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) -
Domain Generalization xs,yssuperscript𝑥𝑠superscript𝑦𝑠x^{s},y^{s}italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT (xs,ys)superscript𝑥𝑠superscript𝑦𝑠\mathcal{L}(x^{s},y^{s})caligraphic_L ( italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) -
Source-free Domain Adaptation xtsuperscript𝑥𝑡x^{t}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (xs,xt)superscript𝑥𝑠superscript𝑥𝑡\mathcal{L}(x^{s},x^{t})caligraphic_L ( italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) -
Test-time training(TTT) xs,yssuperscript𝑥𝑠superscript𝑦𝑠x^{s},y^{s}italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT xtsuperscript𝑥𝑡x^{t}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (xs,ys)+(xs)superscript𝑥𝑠superscript𝑦𝑠superscript𝑥𝑠\mathcal{L}(x^{s},y^{s})+\mathcal{L}(x^{s})caligraphic_L ( italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) + caligraphic_L ( italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) (xt)superscript𝑥𝑡\mathcal{L}(x^{t})caligraphic_L ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )
Test-time adaptation(TTA) xtsuperscript𝑥𝑡x^{t}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (xt)superscript𝑥𝑡\mathcal{L}(x^{t})caligraphic_L ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )
Open-World Test-time training(OWTTT) xs,yssuperscript𝑥𝑠superscript𝑦𝑠x^{s},y^{s}italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT xtsuperscript𝑥𝑡x^{t}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (xs,ys)+(xs)superscript𝑥𝑠superscript𝑦𝑠superscript𝑥𝑠\mathcal{L}(x^{s},y^{s})+\mathcal{L}(x^{s})caligraphic_L ( italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) + caligraphic_L ( italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) (xt)superscript𝑥𝑡\mathcal{L}(x^{t})caligraphic_L ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )

In practical scenarios, target domain data is often unavailable until inference, necessitating immediate, reliable test data predictions without extra interventions. This is vital in time-sensitive or resource-limited settings where rapid adaptation is key. Test-time training/adaptation (TTT/TTA) tackles this by rapidly reducing domain shift and boosting model performance, using unlabeled target domain data during inference (Liu et al., 2021a; Wang et al., 2020; Sun et al., 2020). Recent TTT advancements show promise, employing meta-learning (Bartler et al., 2022) for swift task adaptation, student-teacher frameworks (Sinha et al., 2023) for knowledge distillation under domain shift, and adversarial sample techniques (Croce et al., 2022) for enhanced robustness and adaptability.

Nevertheless, traditional TTT methods mostly rely on the assumption that while there is a domain shift between source and target domains, they share the same class set. However, in the real world, a limited source domain cannot possibly encompass the infinite variety of real-world scenes (Scheirer et al., 2012; Bendale and Boult, 2015, 2016; Geng et al., 2020). To better align with real-world complexities, the focus of TTT is shifting towards addressing domain shifts within the context of Open-World scenarios. In such scenarios, TTT methods must contend with continually evolving distributions. More importantly, they need to recognize and adapt to strong OOD data, such as unprecedented events or entities, rather than merely adjusting to weaker, more predictable shifts like common corruptions (weak OOD data) (Li et al., 2023). For example, while self-driving cars might be trained to recognize the sight of brown bears on the road, they might not anticipate encountering a panda that has escaped from a zoo. Such unpredicted occurrences exemplify the strong OOD data that pose significant challenges in Open-World settings.

TTT methods, relying on unlabeled target domain data to address domain shifts during testing, may struggle with varying degrees of strong OOD data. Recent OWTTT advancements tackle this by dynamically expanding prototypes based on the source domain’s feature distribution, improving the distinction between weak and strong OOD data (Li et al., 2023). However, a key prerequisite for these methods is the model’s ability to initially extract features from weak OOD data. Without this, weak OOD data, potentially indistinguishable from strong OOD under significant domain shifts, may be mistakenly treated as noise, leading to its misclassification as strong OOD during the TTT phase.

In this paper, we tackle the challenge of initial domain shifts during testing, where the model encounters a scarcity of positive samples, often leading to misclassification of weak OOD data as strong OOD noise. Inspired by contrastive learning, we propose that augmented samples should maintain the same feature distribution as their originals. To address early TTT stage challenges, where samples lacking contrast are indistinguishable from strong OOD, our approach employs simple data augmentation to generate positive sample pairs. We incorporate the NT-XENT contrastive learning loss function, using these pairs to aid the model’s adaptation and prevent premature classification of classes as strong OOD due to initial feature extraction challenges. Subsequently, we align these pairs with the source domain class cluster centers, enhancing our method’s robustness and enabling basic clustering for strong OODs. We term this methodology Open World Dynamic Contrastive Learning (OWDCL).

The contributions of this paper are as follows:

  • In open-world TTT, our method effectively solves the problem of inaccurate classification of weak OOD samples due to lack of contrast.

  • Our approach is the first work to introduce contrastive learning to reduce domain shifts in open-world TTT problems.

  • OWDCL exhibits superior performance compared to existing state-of-the-art models across a variety of datasets.

2. Related Work

2.1. Unsupervised Domain Adaptation

Unsupervised domain adaptation (UDA) (Ganin and Lempitsky, 2015; Wang and Deng, 2018; Liu et al., 2022) aims to adapt models trained on a source domain to unlabeled target domain data. UDA typically employs strategies like difference loss (Long et al., 2015), adversarial training (Ganin and Lempitsky, 2015), and self-supervised training (Liu et al., 2021c) to learn invariant properties across domains. Despite considerable progress in enhancing target domain generalizability, UDA’s reliance on both source and target domains during adaptation is often impractical, e.g., due to data privacy concerns. Consequently, source-free domain adaptation (Xia et al., 2021; Liu et al., 2021d; Yang et al., 2021; Kundu et al., 2020) has emerged, eliminating the need for source domain data and relying solely on a pre-trained model and target domain data.

Refer to caption
Figure 2. Overall framework of our model OWDCL. (1) pssubscript𝑝𝑠\mathcal{L}_{ps}caligraphic_L start_POSTSUBSCRIPT italic_p italic_s end_POSTSUBSCRIPT: Improve the feature extraction ability of the model by comparing samples with enhanced samples. (2)cssubscript𝑐𝑠\mathcal{L}_{cs}caligraphic_L start_POSTSUBSCRIPT italic_c italic_s end_POSTSUBSCRIPT: The classification accuracy is optimized through the comprehensive comparison between the enhanced sample pair and the class center of gravity.

2.2. Test-Time Training

In scenarios requiring adaptation to arbitrary unknown target domains with low inference latency and without source domain data access, Test-Time Training/Adaptation (TTT/TTA) (Liu et al., 2021a; Wang et al., 2020; Sun et al., 2020) has emerged as a new paradigm. TTT/TTA can be achieved not only by adjusting model weights to align features with the source domain distribution (Liu et al., 2021a; Su et al., 2022) but also through self-training that reinforces model predictions on unlabeled data (Wang et al., 2020; Chen et al., 2022; Su et al., 2023; Niu et al., 2022). However, TTT/TTA, limited by the absence of target domain labels, often relies on summarizing the target domain’s feature distribution to approximate and align with the correct source domain distribution, enhancing model performance. This approach, while reducing uncertainty, is prone to errors, especially under strong OOD interference in open-world scenarios (Li et al., 2023).

2.3. Open-Set Domain Adaptation

To address open-world scenarios, Open-Set Domain Adaptation (OSDA) has been proposed (Panareda Busto and Gall, 2017). Existing OSDA methods include strategies like transforming logits of unknown class samples into a recognizable constant (Saito et al., 2018), and defining and maximizing the distance between open-set and closed-set (Panareda Busto and Gall, 2017). Additionally, Universal Adaptation Network (UAN) approaches consider scenarios where unknown classes exist in both source and target domains (You et al., 2019). Further, in scenarios lacking access to source domain data, Universal source-free Domain Adaptation has been explored (Kundu et al., 2020). There is very poor research on open-world test-time training (OWTTT) (Li et al., 2023). There is a lack of research to solve the problem of weak OOD accuracy due to the lack of feature extraction ability in the initial model.

3. Methods

3.1. Problem Formulation

Test-time training aims to adapt the source domain pre-trained model to the target domain which may be subject to a distribution shift from the source domain. So we define the source domain data as 𝒳ssubscript𝒳𝑠\mathcal{X}_{s}caligraphic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, and target domain data as 𝒳tsubscript𝒳𝑡\mathcal{X}_{t}caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. we also define the source label as Ys={1,2,,m}subscript𝑌𝑠12𝑚Y_{s}=\left\{1,2,...,m\right\}italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = { 1 , 2 , … , italic_m }, the strong OOD label set as Ystr={m+1,,m+n}subscript𝑌𝑠𝑡𝑟𝑚1𝑚𝑛Y_{str}=\left\{m+1,...,m+n\right\}italic_Y start_POSTSUBSCRIPT italic_s italic_t italic_r end_POSTSUBSCRIPT = { italic_m + 1 , … , italic_m + italic_n }, and the target label as Yt=YsYstrsubscript𝑌𝑡subscript𝑌𝑠subscript𝑌𝑠𝑡𝑟Y_{t}=Y_{s}\cup Y_{str}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∪ italic_Y start_POSTSUBSCRIPT italic_s italic_t italic_r end_POSTSUBSCRIPT.

To clarify, we define weak Out-of-Distribution (weak OOD) as those classes that align with the source domain yet are subjected to alterations like noise or other forms of corruption. In contrast, strong Out-of-Distribution (strong OOD) encompasses categories that are entirely new and distinct from those of the source domain.

Before the TTT stage, We will extract the features of the source domain 𝒳ssubscript𝒳𝑠\mathcal{X}_{s}caligraphic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT through the pre-training model f()𝑓f(\cdot)italic_f ( ⋅ ), and summarize the distribution of the source domain label features 𝒟s={d1s,,dms}subscript𝒟𝑠subscriptsuperscript𝑑𝑠1subscriptsuperscript𝑑𝑠𝑚\mathcal{D}_{s}=\left\{d^{s}_{1},...,d^{s}_{m}\right\}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = { italic_d start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_d start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }. At the official start of the TTT stage, We augment the sample xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by data augmentation to obtain the positive sample pair xisubscriptsuperscript𝑥𝑖x^{\prime}_{i}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, they have the same label yiYtsubscript𝑦𝑖subscript𝑌𝑡y_{i}\in Y_{t}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. According to the threshold τ𝜏\tauitalic_τ, the label of xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is determined through 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and the comprehensive between xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and xisubscriptsuperscript𝑥𝑖x^{\prime}_{i}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. If it is not in 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, it is divided into 𝒟str={dm+1str,.,dm+nstr}\mathcal{D}_{str}=\left\{d^{str}_{m+1},....,d^{str}_{m+n}\right\}caligraphic_D start_POSTSUBSCRIPT italic_s italic_t italic_r end_POSTSUBSCRIPT = { italic_d start_POSTSUPERSCRIPT italic_s italic_t italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT , … . , italic_d start_POSTSUPERSCRIPT italic_s italic_t italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m + italic_n end_POSTSUBSCRIPT }. Since there is no label in open-world TTT, we will set a pseudo-label y^iYtsubscript^𝑦𝑖subscript𝑌𝑡\hat{y}_{i}\in Y_{t}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT based on sample xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

3.2. Overall Test-Time Training Framework

In comparison with Test-Time Adaptation, Test-Time Training allows for the use of a subset of source domain data. However, due to the requirement for low latency, it does not permit access to the entire source domain dataset. Considering this constraint and the demonstrated effectiveness of cluster structures in domain adaptation tasks (Saito et al., 2018), their application is maintained in open-world TTT (Li et al., 2023). Feature extraction from the source domain 𝒳ssubscript𝒳𝑠\mathcal{X}_{s}caligraphic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT will be performed using the pre-trained model f()𝑓f(\cdot)italic_f ( ⋅ ). The cluster centers for each class are defined as follows:

(1) dm=1Mi=1Mf(xi),yiYSformulae-sequencesubscript𝑑𝑚1𝑀superscriptsubscript𝑖1𝑀𝑓subscript𝑥𝑖subscript𝑦𝑖subscript𝑌𝑆d_{m}=\frac{1}{M}\sum_{i=1}^{M}f(x_{i}),y_{i}\in Y_{S}italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_Y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT

Here, M𝑀Mitalic_M represents the number of samples for a class in the source domain.

In open-world test-time training, existing research (Li et al., 2023) shows excellent performance in most scenarios. However, in certain cases, while the discrimination of strong OOD instances improves, there is a noticeable decline in handling weak OOD instances, as illustrated in 1.

At the onset of TTT, some classes are ineffectively classified, with accuracy deteriorating as TTT progresses. This is common in TTT/TTA, where models, lacking target domain labels and facing corruption interference, often use entropy-like methods to minimize output confusion (Wang et al., 2020; Su et al., 2023). Ineffective initial feature extraction of specific classes leads to misclassification as noise. This challenge is exacerbated in open-world TTT, compounded by corruption and strong OOD disturbances, making the unsupervised process more complex.

Current research often fails to enhance feature extraction capabilities for each sample, focusing instead on differentiating between strong and weak OOD scenarios. We believe this issue originates from early model stages, where the absence of labels and class corruption hinders effective feature extraction, lacking necessary comparison and feedback.

Inspired by contrastive learning (He et al., 2020; Chen et al., 2020; Chen and He, 2021), we use simple data augmentation techniques to improve input samples. Complex augmentations, like contrast and brightness adjustments combined with corrupted data, can impede model convergence. Therefore, for xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we employ flipping and a random rotation ranging from 0 to 30%, resulting in augmented data xisubscriptsuperscript𝑥𝑖x^{\prime}_{i}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Regarding the data enhancement strategy, we opt for simple rather than novel or complex data augmentations to facilitate comparative learning with sample pairs. Our experiments demonstrate that several sets of basic data enhancements yield similar effects. Specifically, a combination of vertical flipping and rotation within 0-15/45 degrees appears to be most effective. This approach is chosen for its simplicity and effectiveness. It is important to note that we advise against using contrast adjustments and adding other forms of noise for data enhancement. This is because weak OOD samples may already exhibit such corruptions, and complex augmentations could lead to convergence difficulties during testing.

The following hypothesis is proposed: For the samples xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and their augmented counterparts xisubscriptsuperscript𝑥𝑖x^{\prime}_{i}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the model f()𝑓f(\cdot)italic_f ( ⋅ ), as derived from pre-training, and its iteratively updated version during the Test-Time Training (TTT) process, f()superscript𝑓f^{\prime}(\cdot)italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ ), are conjectured to conform to the subsequent mathematical relation:

(2) f(xi)=f(xi)superscript𝑓subscript𝑥𝑖superscript𝑓subscriptsuperscript𝑥𝑖f^{\prime}(x_{i})=f^{\prime}(x^{\prime}_{i})italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

Based on this hypothesis, we implement contrastive alignment by positive sample pairs and contrastive alignment by cluster and sample pairs, and the overall framework is depicted in Figure 2.

3.3. Contrastive Alignment by Positive Sample Pairs

For each sample xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and its augmented counterpart xisubscriptsuperscript𝑥𝑖x^{\prime}_{i}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the current batch, we extract features f(xi)superscript𝑓subscript𝑥𝑖f^{\prime}(x_{i})italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and f(xi)superscript𝑓subscriptsuperscript𝑥𝑖f^{\prime}(x^{\prime}_{i})italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) using the model f()superscript𝑓f^{\prime}(\cdot)italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ ). The first step involves normalizing these features with the L2 norm, calculated as:

(3) 𝐯2=v12+v22++vn2subscriptnorm𝐯2superscriptsubscript𝑣12superscriptsubscript𝑣22superscriptsubscript𝑣𝑛2\|\mathbf{v}\|_{2}=\sqrt{v_{1}^{2}+v_{2}^{2}+\ldots+v_{n}^{2}}∥ bold_v ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = square-root start_ARG italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + … + italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG

The result post-normalization using the L2 norm is articulated as:

(4) vi=f(xi)i=1Bf(xi)2,vi=f(xi)i=1Bf(xi)2formulae-sequencesubscript𝑣𝑖𝑓subscript𝑥𝑖superscriptsubscript𝑖1𝐵superscript𝑓superscriptsubscript𝑥𝑖2subscriptsuperscript𝑣𝑖𝑓subscriptsuperscript𝑥𝑖superscriptsubscript𝑖1𝐵superscript𝑓superscriptsubscriptsuperscript𝑥𝑖2\displaystyle v_{i}=\frac{f(x_{i})}{\sqrt{\sum_{i=1}^{B}f^{\prime}(x_{i})^{2}}% },v^{\prime}_{i}=\frac{f(x^{\prime}_{i})}{\sqrt{\sum_{i=1}^{B}f^{\prime}(x^{% \prime}_{i})^{2}}}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG

Where B𝐵Bitalic_B is the number of samples in the current batch.

We then compute the similarity among pairs of positive samples within the normalized vectors as follows:

(5) 𝒮(vi,vj)pos=exp(i,j=1Bvivjγ1)𝒮subscriptsubscript𝑣𝑖subscriptsuperscript𝑣𝑗𝑝𝑜𝑠superscriptsubscript𝑖𝑗1𝐵subscript𝑣𝑖subscriptsuperscript𝑣𝑗subscript𝛾1\mathcal{S}(v_{i},v^{\prime}_{j})_{pos}=\exp(\frac{\sum_{i,j=1}^{B}v_{i}\cdot v% ^{\prime}_{j}}{\gamma_{1}})caligraphic_S ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT = roman_exp ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG )

Here, γ1subscript𝛾1\gamma_{1}italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT represents the temperature normalization factor, which scales the outcome.

Following this, the similarity among pairs of negative samples is also computed, employing a distinct formula, which is delineated below:

(6) 𝒮(vi,vj)neg=exp(vivjTγ1)𝒮subscriptsubscript𝑣𝑖subscriptsuperscript𝑣𝑗𝑛𝑒𝑔subscript𝑣𝑖superscriptsubscript𝑣𝑗superscript𝑇subscript𝛾1\displaystyle\mathcal{S}(v_{i},v^{\prime}_{j})_{neg}=\exp(\frac{v_{i}\cdot v_{% j}^{{}^{\prime}T}}{\gamma_{1}})caligraphic_S ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT = roman_exp ( divide start_ARG italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG )
𝒮(vi,vj)neg=exp(vivjTγ1)𝒮subscriptsubscriptsuperscript𝑣𝑖subscript𝑣𝑗𝑛𝑒𝑔subscriptsuperscript𝑣𝑖superscriptsubscript𝑣𝑗𝑇subscript𝛾1\displaystyle\mathcal{S}(v^{\prime}_{i},v_{j})_{neg}=\exp(\frac{v^{\prime}_{i}% \cdot v_{j}^{T}}{\gamma_{1}})caligraphic_S ( italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT = roman_exp ( divide start_ARG italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG )

In conclusion, by leveraging the identified similarities and differences in both positive and negative sample pairs, we utilize the Normalized Temperature-Scaled Cross-Entropy Loss (NT-XENT) (Chen et al., 2020) for optimization. This loss function excels at discerning relational dynamics between data points in the absence of labeled data, while avoiding comparisons between identical samples. The final loss formulation for the initial phase is expressed as:

(7) ps=subscript𝑝𝑠absent\displaystyle\mathcal{L}_{ps}=caligraphic_L start_POSTSUBSCRIPT italic_p italic_s end_POSTSUBSCRIPT =
α1(log(𝒮(vi,vj)poskiB𝒮(vi,vk)neg+𝒮(vi,vj)pos)\displaystyle-\alpha_{1}(\log(\frac{\mathcal{S}(v_{i},v^{\prime}_{j})_{pos}}{% \sum_{k\neq i}^{B}\mathcal{S}(v^{\prime}_{i},v_{k})_{neg}+\mathcal{S}(v_{i},v^% {\prime}_{j})_{pos}})- italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( roman_log ( divide start_ARG caligraphic_S ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k ≠ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT caligraphic_S ( italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT + caligraphic_S ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT end_ARG )
+log(𝒮(vi,vj)poskjB𝒮(vk,vj)neg+𝒮(vi,vj)pos))\displaystyle+\log(\frac{\mathcal{S}(v_{i},v^{\prime}_{j})_{pos}}{\sum_{k\neq j% }^{B}\mathcal{S}(v^{\prime}_{k},v_{j})_{neg}+\mathcal{S}(v_{i},v^{\prime}_{j})% _{pos}}))+ roman_log ( divide start_ARG caligraphic_S ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k ≠ italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT caligraphic_S ( italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT + caligraphic_S ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT end_ARG ) )

Here, α1subscript𝛼1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is a hyperparameter that adjusts the impact magnitude of the loss.

Optimizing the pssubscript𝑝𝑠\mathcal{L}_{ps}caligraphic_L start_POSTSUBSCRIPT italic_p italic_s end_POSTSUBSCRIPT loss function enables the model to defer classifying a class as strong OOD until it has effectively extracted features from that class’s samples. This approach enhances the efficacy of each sample within the weak OOD class, ensuring more precise and discriminative feature extraction.

3.4. Contrastive Alignment by Cluster and Sample Pairs

For each sample xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the strong OOD score is quantified based on its degree of similarity to the nearest cluster center dksubscript𝑑𝑘d_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in the source domain. <,><\cdot,\cdot>< ⋅ , ⋅ > measures the cosine similarity. This quantification is defined as follows:

(8) osi=1maxdk𝒟sf(xi),dk𝑜subscript𝑠𝑖1subscript𝑚𝑎𝑥subscript𝑑𝑘subscript𝒟𝑠superscript𝑓subscript𝑥𝑖subscript𝑑𝑘os_{i}=1-\mathop{max}\limits_{d_{k}\in\mathcal{D}_{s}}\left\langle f^{\prime}(% x_{i}),d_{k}\right\rangleitalic_o italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 - start_BIGOP italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟨ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩

Drawing on insights from prior research, we establish the optimal threshold as the demarcation that distinguishes between two distinct distribution patterns. This approach is conceptualized as classifying outliers into two separate clusters, which can be delineated as follows:

(9) N+=i𝟙(osi>τ),N=i𝟙(osiτ)formulae-sequencesuperscript𝑁superscript𝑖1𝑜subscript𝑠𝑖𝜏superscript𝑁superscript𝑖1𝑜subscript𝑠𝑖𝜏\displaystyle N^{+}={\textstyle\sum^{i}}\mathbbm{1}(os_{i}>\tau),N^{-}={% \textstyle\sum^{i}}\mathbbm{1}(os_{i}\leq\tau)italic_N start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = ∑ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT blackboard_1 ( italic_o italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > italic_τ ) , italic_N start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT = ∑ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT blackboard_1 ( italic_o italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_τ )

Here, 𝟙()1\mathbbm{1}(\cdot)blackboard_1 ( ⋅ ) is the indicator function. The optimal threshold τsuperscript𝜏\tau^{*}italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is identified by optimizing:

(10) minτ1N+i[osi1N+j𝟙(osj>τ)osj]2+limit-fromsubscript𝑚𝑖𝑛𝜏1superscript𝑁subscript𝑖superscriptdelimited-[]𝑜subscript𝑠𝑖1superscript𝑁subscript𝑗1𝑜subscript𝑠𝑗𝜏𝑜subscript𝑠𝑗2\displaystyle\mathop{min}\limits_{\tau}\frac{1}{N^{+}}\sum_{i}[os_{i}-\frac{1}% {N^{+}}{\textstyle\sum_{j}}\mathbbm{1}(os_{j}>\tau)os_{j}]^{2}+start_BIGOP italic_m italic_i italic_n end_BIGOP start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_o italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT blackboard_1 ( italic_o italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > italic_τ ) italic_o italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT +
1Ni[osi1N𝟙(osjτ)osj]21superscript𝑁subscript𝑖superscriptdelimited-[]𝑜subscript𝑠𝑖1superscript𝑁1𝑜subscript𝑠𝑗𝜏𝑜subscript𝑠𝑗2\displaystyle\frac{1}{N^{-}}\sum_{i}[os_{i}-\frac{1}{N^{-}}{\textstyle\sum}% \mathbbm{1}(os_{j}\leq\tau)os_{j}]^{2}divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_o italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_ARG ∑ blackboard_1 ( italic_o italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≤ italic_τ ) italic_o italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

To ensure a stable estimation of the outlier distribution, the distribution is updated using an exponential moving average manner with a length of Nasubscript𝑁𝑎N_{a}italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT. Here, it ranges from 0 to 1, and the step size is set to 0.01.

Upon confirming the effective feature extraction of class samples, resulting in f(xi)superscript𝑓subscript𝑥𝑖f^{\prime}(x_{i})italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and f(xi)superscript𝑓subscriptsuperscript𝑥𝑖f^{\prime}(x^{\prime}_{i})italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), we obtain the feature distribution 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT of the weak OOD in the source domain, ascertained during the pre-TTT stage.

For handling weak OOD samples, we employ a strategy that integrates the contrastive learning loss NT-XENT with negative log-likelihood loss. This approach aims to embed the test sample xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT nearer to the cluster center of its respective class while distancing it from the cluster centers of other classes. The formulation of the negative log-likelihood loss is detailed below:

(11) PCwea=kYs𝟙(y^=k)logexp(<dk,f(xi)>δ)lexp(<dl,f(xi)>δ)\displaystyle\mathcal{L}^{wea}_{PC}=-\sum_{k\in Y_{s}}\mathbbm{1}(\hat{y}=k)% \log\frac{\exp(\frac{<d_{k},f^{\prime}(x_{i})>}{\delta})}{{\textstyle\sum_{l}}% \exp(\frac{<d_{l},f^{\prime}(x_{i})>}{\delta})}caligraphic_L start_POSTSUPERSCRIPT italic_w italic_e italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P italic_C end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_k ∈ italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_1 ( over^ start_ARG italic_y end_ARG = italic_k ) roman_log divide start_ARG roman_exp ( divide start_ARG < italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT roman_exp ( divide start_ARG < italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > end_ARG start_ARG italic_δ end_ARG ) end_ARG

Where δ𝛿\deltaitalic_δ is a hyperparameter, set to 0.1 in all experiments.

To bolster the robustness of sample classification and streamline the computation, the feature distribution for the current batch has been quantified based on pseudo-labels y^=k^𝑦𝑘\hat{y}=kover^ start_ARG italic_y end_ARG = italic_k. The corresponding formula is articulated as follows:

(12) dkc=12Ki=1K(f(x)+f(x))superscriptsubscript𝑑𝑘𝑐12𝐾superscriptsubscript𝑖1𝐾superscript𝑓𝑥superscript𝑓superscript𝑥\displaystyle d_{k}^{c}=\frac{1}{2K}\sum_{i=1}^{K}(f^{\prime}(x)+f^{\prime}(x^% {\prime}))italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) + italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) )

In the current batch, there are k𝑘kitalic_k sample pairs in class K𝐾Kitalic_K, and their average feature distribution is dkcsuperscriptsubscript𝑑𝑘𝑐d_{k}^{c}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT.

Initially, positive sample pairs are normalized employing the L2 norm. The specific formula utilized for this normalization is detailed below:

(13) vic=dici=1M(dic)2,subscriptsuperscript𝑣𝑐𝑖superscriptsubscript𝑑𝑖𝑐superscriptsubscript𝑖1𝑀superscriptsuperscriptsubscript𝑑𝑖𝑐2\displaystyle v^{c}_{i}=\frac{d_{i}^{c}}{\sqrt{\sum_{i=1}^{M}(d_{i}^{c})^{2}}},italic_v start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ,
vis=disi=1M(dis)2subscriptsuperscript𝑣𝑠𝑖superscriptsubscript𝑑𝑖𝑠superscriptsubscript𝑖1𝑀superscriptsuperscriptsubscript𝑑𝑖𝑠2\displaystyle v^{s}_{i}=\frac{d_{i}^{s}}{\sqrt{\sum_{i=1}^{M}(d_{i}^{s})^{2}}}italic_v start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG

Using normalized vectors vicsubscriptsuperscript𝑣𝑐𝑖v^{c}_{i}italic_v start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and vissubscriptsuperscript𝑣𝑠𝑖v^{s}_{i}italic_v start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the NT-XENT loss is computed:

(14) NT=subscript𝑁𝑇absent\displaystyle\mathcal{L}_{NT}=caligraphic_L start_POSTSUBSCRIPT italic_N italic_T end_POSTSUBSCRIPT =
α2(log(𝒮(vic,vjs)poskiM𝒮(vkc,vjs)neg+𝒮(vic,vjs)pos)\displaystyle-\alpha_{2}(\log(\frac{\mathcal{S}(v^{c}_{i},v^{s}_{j})_{pos}}{% \sum_{k\neq i}^{M}\mathcal{S}(v^{c}_{k},v^{s}_{j})_{neg}+\mathcal{S}(v^{c}_{i}% ,v^{s}_{j})_{pos}})- italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_log ( divide start_ARG caligraphic_S ( italic_v start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k ≠ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT caligraphic_S ( italic_v start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT + caligraphic_S ( italic_v start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT end_ARG )
+log(𝒮(vic,vjs)poskjM𝒮(vic,vks)neg+𝒮(vic,vjs)pos))\displaystyle+\log(\frac{\mathcal{S}(v^{c}_{i},v^{s}_{j})_{pos}}{\sum_{k\neq j% }^{M}\mathcal{S}(v^{c}_{i},v^{s}_{k})_{neg}+\mathcal{S}(v^{c}_{i},v^{s}_{j})_{% pos}}))+ roman_log ( divide start_ARG caligraphic_S ( italic_v start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k ≠ italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT caligraphic_S ( italic_v start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT + caligraphic_S ( italic_v start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT end_ARG ) )

α2subscript𝛼2\alpha_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT adjusts the loss’s impact magnitude. The similarity computation incorporates a temperature normalization factor γ2subscript𝛾2\gamma_{2}italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, pivotal in adjusting the scale of similarity measures within the model.

For categorizing samples as strong OOD, the following conditions or mathematical criteria must be met:

(15) os^i=1maxdk𝒟s𝒟strf(xi),dksubscript^𝑜𝑠𝑖1subscript𝑚𝑎𝑥subscript𝑑𝑘subscript𝒟𝑠subscript𝒟𝑠𝑡𝑟superscript𝑓subscript𝑥𝑖subscript𝑑𝑘\hat{os}_{i}=1-\mathop{max}\limits_{d_{k}\in\mathcal{D}_{s}\cup\mathcal{D}_{% str}}\left\langle f^{\prime}(x_{i}),d_{k}\right\rangleover^ start_ARG italic_o italic_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 - start_BIGOP italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_s italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟨ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩

When strong OOD samples fulfill a certain criterion, they are incorporated into the existing strong OOD class. If not, a new strong OOD cluster center is established. In the real-world application of machine learning models, the classes known and trained on in the source domain are finite and predetermined. However, the emergence of new classes in practical scenarios is theoretically infinite. To prevent the unbounded growth of OOD cluster centers, the distribution 𝒟strsubscript𝒟𝑠𝑡𝑟\mathcal{D}_{str}caligraphic_D start_POSTSUBSCRIPT italic_s italic_t italic_r end_POSTSUBSCRIPT is managed as a queue with a fixed capacity of Nqsubscript𝑁𝑞N_{q}italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT.The value of Nqsubscript𝑁𝑞N_{q}italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT is 100. As new OOD prototypes are introduced, the oldest prototypes are phased out.

Concurrently, the negative log-likelihood loss for these samples is computed as follows:

(16) PCstr=kYstr𝟙(y^=k)logexp(<dk,f(xi)>δ)lexp(<dl,f(xi)>δ)\displaystyle\mathcal{L}^{str}_{PC}=-\sum_{k\in Y_{str}}\mathbbm{1}(\hat{y}=k)% \log\frac{\exp(\frac{<d_{k},f^{\prime}(x_{i})>}{\delta})}{{\textstyle\sum_{l}}% \exp(\frac{<d_{l},f^{\prime}(x_{i})>}{\delta})}caligraphic_L start_POSTSUPERSCRIPT italic_s italic_t italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P italic_C end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_k ∈ italic_Y start_POSTSUBSCRIPT italic_s italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_1 ( over^ start_ARG italic_y end_ARG = italic_k ) roman_log divide start_ARG roman_exp ( divide start_ARG < italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT roman_exp ( divide start_ARG < italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > end_ARG start_ARG italic_δ end_ARG ) end_ARG

Self-training (ST) is susceptible to the issue of incorrect pseudo-labels, known as confirmation bias. This self-supervised confirmation bias can exacerbate over time, significantly impacting performance. Particularly in the presence of strong OOD samples within the target domain, the model may erroneously classify these as belonging to known categories, even with low confidence, thereby intensifying the confirmation bias. To mitigate the risk of ST failure, we adopt distribution alignment as a form of self-training regularization, drawing on insights from previous studies. This approach aims to reduce the adverse effects of confirmation bias by ensuring that the model’s predictions are more aligned with the actual distribution of the data.

The features in the source domain are assumed to follow a Gaussian distribution 𝒩(μs,s)𝒩subscript𝜇𝑠subscript𝑠\mathcal{N}(\mu_{s},{\textstyle\sum_{s}})caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ). In the target domain, the feature distribution 𝒩(μt,t)𝒩subscript𝜇𝑡subscript𝑡\mathcal{N}(\mu_{t},{\textstyle\sum_{t}})caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is estimated using a momentum parameter β𝛽\betaitalic_β, incorporating only test samples pruned via strong OOD criteria. To refine clustering in the target domain, we use the Kullback-Leibler Divergence loss LKLDsubscript𝐿𝐾𝐿𝐷L_{KLD}italic_L start_POSTSUBSCRIPT italic_K italic_L italic_D end_POSTSUBSCRIPT:

(17) KLD=DKL(𝒩(μs,s)||𝒩(μt,t))\displaystyle\mathcal{L}_{KLD}=D_{KL}(\mathcal{N}(\mu_{s},{\textstyle\sum_{s}}% )||\mathcal{N}(\mu_{t},{\textstyle\sum_{t}}))caligraphic_L start_POSTSUBSCRIPT italic_K italic_L italic_D end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) | | caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) )

For the sake of aesthetics, we have simplified the formula. As a result, the final loss function for the phase of contrastive alignment by cluster centers and sample pairs can be articulated as follows:

Table 2. Open-world test time training results on CIFAR10-C. All numbers are in %. The best results are shown in bold.
Method Noise MNIST SVHN Tiny-ImageNet CIFAR100-C
AccS AccN AccH AccS AccN AccH AccS AccN AccH AccS AccN AccH AccS AccN AccH
TEST 68.59 99.97 81.36 60.48 88.81 71.96 60.94 86.44 71.48 57.41 79.63 66.72 52.74 74.24 61.67
BN 76.63 95.69 85.11 76.15 95.75 84.83 79.18 94.71 86.25 67.66 82.67 74.42 68.44 81.38 74.35
TTT++ 41.09 57.31 47.86 59.52 77.52 67.34 68.77 85.80 76.34 66.70 79.28 72.44 65.69 77.47 71.10
TENT 32.24 33.30 32.77 55.64 68.27 61.31 66.70 82.50 73.77 66.54 79.32 72.37 64.80 76.40 70.12
SHOT 63.54 71.37 67.23 56.92 53.26 55.03 70.01 72.58 71.27 67.78 82.25 74.32 67.73 72.87 70.21
TTAC 64.46 77.42 70.35 77.60 84.53 80.92 77.30 81.10 79.16 71.64 77.14 74.29 71.94 75.44 73.65
OWTTT 85.46 98.60 91.56 83.89 97.83 90.32 84.99 87.94 86.44 71.77 84.71 77.70 74.08 84.64 79.01
OWDCL(Ours) 87.16 99.99 93.08 85.59 99.14 91.82 85.35 89.74 87.49 76.57 86.34 81.20 78.47 85.47 81.82
(18) cs=NT+PCwea+PCstr+KLDsubscript𝑐𝑠subscript𝑁𝑇subscriptsuperscript𝑤𝑒𝑎𝑃𝐶subscriptsuperscript𝑠𝑡𝑟𝑃𝐶subscript𝐾𝐿𝐷\displaystyle\mathcal{L}_{cs}=\mathcal{L}_{NT}+\mathcal{L}^{wea}_{PC}+\mathcal% {L}^{str}_{PC}+\mathcal{L}_{KLD}caligraphic_L start_POSTSUBSCRIPT italic_c italic_s end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_N italic_T end_POSTSUBSCRIPT + caligraphic_L start_POSTSUPERSCRIPT italic_w italic_e italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P italic_C end_POSTSUBSCRIPT + caligraphic_L start_POSTSUPERSCRIPT italic_s italic_t italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P italic_C end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_K italic_L italic_D end_POSTSUBSCRIPT
=α2log(𝒮(vic,vjs)poskiM𝒮(vkc,vjs)neg+𝒮(vic,vjs)pos)absentsubscript𝛼2𝒮subscriptsubscriptsuperscript𝑣𝑐𝑖subscriptsuperscript𝑣𝑠𝑗𝑝𝑜𝑠superscriptsubscript𝑘𝑖𝑀𝒮subscriptsubscriptsuperscript𝑣𝑐𝑘subscriptsuperscript𝑣𝑠𝑗𝑛𝑒𝑔𝒮subscriptsubscriptsuperscript𝑣𝑐𝑖subscriptsuperscript𝑣𝑠𝑗𝑝𝑜𝑠\displaystyle=-\alpha_{2}\log(\frac{\mathcal{S}(v^{c}_{i},v^{s}_{j})_{pos}}{% \sum_{k\neq i}^{M}\mathcal{S}(v^{c}_{k},v^{s}_{j})_{neg}+\mathcal{S}(v^{c}_{i}% ,v^{s}_{j})_{pos}})= - italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_log ( divide start_ARG caligraphic_S ( italic_v start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k ≠ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT caligraphic_S ( italic_v start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT + caligraphic_S ( italic_v start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT end_ARG )
α2log(𝒮(vic,vjs)poskjM𝒮(vic,vks)neg+𝒮(vic,vjs)pos)subscript𝛼2𝒮subscriptsubscriptsuperscript𝑣𝑐𝑖subscriptsuperscript𝑣𝑠𝑗𝑝𝑜𝑠superscriptsubscript𝑘𝑗𝑀𝒮subscriptsubscriptsuperscript𝑣𝑐𝑖subscriptsuperscript𝑣𝑠𝑘𝑛𝑒𝑔𝒮subscriptsubscriptsuperscript𝑣𝑐𝑖subscriptsuperscript𝑣𝑠𝑗𝑝𝑜𝑠\displaystyle-\alpha_{2}\log(\frac{\mathcal{S}(v^{c}_{i},v^{s}_{j})_{pos}}{% \sum_{k\neq j}^{M}\mathcal{S}(v^{c}_{i},v^{s}_{k})_{neg}+\mathcal{S}(v^{c}_{i}% ,v^{s}_{j})_{pos}})- italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_log ( divide start_ARG caligraphic_S ( italic_v start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k ≠ italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT caligraphic_S ( italic_v start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT + caligraphic_S ( italic_v start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT end_ARG )
kYs𝟙(y^=k)logexp(<dk,f(xi)>δ)lexp(<dl,f(xi)>δ)\displaystyle-\sum_{k\in Y_{s}}\mathbbm{1}(\hat{y}=k)\log\frac{\exp(\frac{<d_{% k},f^{\prime}(x_{i})>}{\delta})}{{\textstyle\sum_{l}}\exp(\frac{<d_{l},f^{% \prime}(x_{i})>}{\delta})}- ∑ start_POSTSUBSCRIPT italic_k ∈ italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_1 ( over^ start_ARG italic_y end_ARG = italic_k ) roman_log divide start_ARG roman_exp ( divide start_ARG < italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT roman_exp ( divide start_ARG < italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > end_ARG start_ARG italic_δ end_ARG ) end_ARG
kYstr𝟙(y^=k)logexp(<dk,f(xi)>δ)lexp(<dl,f(xi)>δ)\displaystyle-\sum_{k\in Y_{str}}\mathbbm{1}(\hat{y}=k)\log\frac{\exp(\frac{<d% _{k},f^{\prime}(x_{i})>}{\delta})}{{\textstyle\sum_{l}}\exp(\frac{<d_{l},f^{% \prime}(x_{i})>}{\delta})}- ∑ start_POSTSUBSCRIPT italic_k ∈ italic_Y start_POSTSUBSCRIPT italic_s italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_1 ( over^ start_ARG italic_y end_ARG = italic_k ) roman_log divide start_ARG roman_exp ( divide start_ARG < italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT roman_exp ( divide start_ARG < italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > end_ARG start_ARG italic_δ end_ARG ) end_ARG
+DKL(𝒩(μs,s)||𝒩(μt,t))\displaystyle+D_{KL}(\mathcal{N}(\mu_{s},{\textstyle\sum_{s}})||\mathcal{N}(% \mu_{t},{\textstyle\sum_{t}}))+ italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) | | caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) )

4. Experiments

Table 3. Open-world test time training results on CIFAR100-C. All numbers are in %. The best results are shown in bold.
Method Noise MNIST SVHN Tiny-ImageNet CIFAR10-C
AccS AccN AccH AccS AccN AccH AccS AccN AccH AccS AccN AccH AccS AccN AccH
TEST 36.75 99.87 53.73 25.99 49.59 34.11 30.01 81.62 43.89 25.41 70.06 37.30 25.55 73.28 37.89
BN 50.21 98.72 66.56 36.21 84.69 50.73 45.69 90.45 60.71 34.88 82.18 48.97 37.00 83.54 51.28
TTT++ 23.47 70.26 35.19 28.31 86.74 42.68 37.56 90.45 53.08 34.67 81.25 48.60 33.78 81.12 47.70
TENT 22.57 66.60 33.72 27.85 80.92 41.43 37.08 89.90 52.51 35.51 77.34 48.60 35.20 80.26 48.94
SHOT 51.52 98.21 67.58 35.35 81.71 49.35 45.87 89.72 60.70 35.72 81.11 49.59 38.00 82.13 51.96
TTAC 51.11 98.66 67.34 37.78 86.66 52.62 47.29 91.42 62.33 32.04 80.46 45.83 38.83 83.68 53.05
OWTTT 56.76 97.25 71.68 40.77 82.91 54.66 54.32 81.98 65.34 38.90 81.92 52.75 38.97 83.20 53.08
OWDCL(Ours) 58.20 99.93 73.23 44.01 81.85 56.69 55.38 82.80 66.36 40.91 81.53 54.48 41.46 83.73 55.46
Table 4. Open-world test time training results on ImageNet-C. All numbers are in %. The best results are shown in bold.
Method Noise MNIST SVHN
AccS AccN AccH AccS AccN AccH AccS AccN AccH
TEST 18.51 100.00 31.24 18.66 98.27 31.36 18.94 87.75 31.15
BN 36.34 99.97 53.31 30.77 74.53 43.55 33.26 84.54 47.74
TENT 22.54 10.47 14.29 27.53 10.01 14.68 41.16 45.51 43.22
SHOT 46.79 100.00 63.75 27.47 55.25 36.70 34.00 75.94 46.97
TTAC 42.60 94.52 58.73 30.43 72.11 42.80 31.59 74.07 44.29
OWTTT 41.40 100.00 58.56 38.86 93.35 54.87 38.60 98.06 55.40
OWDCL(Ours) 41.96 100.00 59.11 41.70 99.92 57.00 42.23 99.25 57.70

4.1. Datasets and Evaluation Metric

Several datasets are utilized to fully demonstrate the validity of our method. For the corruption datasets, we use the following datasets, CIFAR10-C/CIFAR100-C (Hendrycks and Dietterich, 2019), each containing 10000 corrupt images with 10/100 classes, and ImageNet-C (Hendrycks and Dietterich, 2019), which contains 5000 corrupt images within 1000 classes. For the style transfer dataset, we introduce the Tiny-ImageNet (Le and Yang, 2015) consists of 200 classes with each class containing 500 training images and 50 validation images. For other common datasets, We also introduce MNIST (LeCun et al., 1998) is a handwritten digit dataset, that contains 60,000 training images and 10,000 testing images. SVHN (Netzer et al., 2011) is a digital dataset in a real street context, including 50,000 training images and 10,000 testing images.

To evaluate open-world test-time training, we adopt the same evaluation metric as OWTTT (Li et al., 2023). To set up a fair comparison with existing methods, we take all the classes in the TTT benchmark dataset as seen classes and add additional classes from additional datasets as unseen classes. In the later experiments, we set the number of known class samples and the number of unknown class samples to be the same. Then we follow the ”One Pass” protocol (Su et al., 2022), Firstly, the training objective cannot be changed during the source domain training procedure. Secondly, testing data in the target domain is sequentially streamed and predicted. In this problem, we evaluate whether we can judge the accuracy of the source domain class as a strong OOD. First, the accuracy of the source domain class is recorded as AccS𝐴𝑐subscript𝑐𝑆Acc_{S}italic_A italic_c italic_c start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT:

(19) AccS=xi,yi𝒟t𝟙(yi=y^i)𝟙(yi𝒞s)xi,yi𝒟t𝟙(yi𝒞s)𝐴𝑐subscript𝑐𝑆subscriptsubscript𝑥𝑖subscript𝑦𝑖subscript𝒟𝑡1subscript𝑦𝑖subscript^𝑦𝑖1subscript𝑦𝑖subscript𝒞𝑠subscriptsubscript𝑥𝑖subscript𝑦𝑖subscript𝒟𝑡1subscript𝑦𝑖subscript𝒞𝑠Acc_{S}=\frac{{\textstyle\sum_{x_{i},y_{i}\in\mathcal{D}_{t}}\mathbbm{1}(y_{i}% =\hat{y}_{i})\cdot\mathbbm{1}(y_{i}\in\mathcal{C}_{s})}}{{\textstyle\sum_{x_{i% },y_{i}\in\mathcal{D}_{t}}\mathbbm{1}(y_{i}\in\mathcal{C}_{s})}}italic_A italic_c italic_c start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_1 ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ blackboard_1 ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_1 ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG

This is followed by the rejection of strong OOD, which successfully rejects the accuracy of the strong OOD sample and is recorded as AccN𝐴𝑐subscript𝑐𝑁Acc_{N}italic_A italic_c italic_c start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT:

(20) AccN=xi,yi𝒟t𝟙(yi𝒞t𝒞s)𝟙(yi𝒞t𝒞s)xi,yi𝒟t𝟙(yi𝒞t𝒞s)𝐴𝑐subscript𝑐𝑁subscriptsubscript𝑥𝑖subscript𝑦𝑖subscript𝒟𝑡1subscript𝑦𝑖subscript𝒞𝑡subscript𝒞𝑠1subscript𝑦𝑖subscript𝒞𝑡subscript𝒞𝑠subscriptsubscript𝑥𝑖subscript𝑦𝑖subscript𝒟𝑡1subscript𝑦𝑖subscript𝒞𝑡subscript𝒞𝑠Acc_{N}=\frac{{\textstyle\sum_{x_{i},y_{i}\in\mathcal{D}_{t}}\mathbbm{1}(y_{i}% \in\mathcal{C}_{t}\setminus\mathcal{C}_{s})\cdot\mathbbm{1}(y_{i}\in\mathcal{C% }_{t}\setminus\mathcal{C}_{s})}}{{\textstyle\sum_{x_{i},y_{i}\in\mathcal{D}_{t% }}\mathbbm{1}(y_{i}\in\mathcal{C}_{t}\setminus\mathcal{C}_{s})}}italic_A italic_c italic_c start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_1 ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∖ caligraphic_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ⋅ blackboard_1 ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∖ caligraphic_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_1 ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∖ caligraphic_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG

And finally, their tradeoff, set to AccH𝐴𝑐subscript𝑐𝐻Acc_{H}italic_A italic_c italic_c start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT:

(21) AccH=2AccSAccNAccS+AccN𝐴𝑐subscript𝑐𝐻2𝐴𝑐subscript𝑐𝑆𝐴𝑐subscript𝑐𝑁𝐴𝑐subscript𝑐𝑆𝐴𝑐subscript𝑐𝑁Acc_{H}=2\cdot\frac{Acc_{S}\cdot Acc_{N}}{Acc_{S}+Acc_{N}}italic_A italic_c italic_c start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT = 2 ⋅ divide start_ARG italic_A italic_c italic_c start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ⋅ italic_A italic_c italic_c start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG start_ARG italic_A italic_c italic_c start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT + italic_A italic_c italic_c start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG

where y^isubscript^𝑦𝑖\hat{y}_{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT refers to the predicted label and 𝟙(yi𝒞s)1subscript𝑦𝑖subscript𝒞𝑠\mathbbm{1}(y_{i}\in\mathcal{C}_{s})blackboard_1 ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) is true if yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is in the set 𝒞ssubscript𝒞𝑠\mathcal{C}_{s}caligraphic_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT.

4.2. Comparison Methods and Settings

Given that open-world Test-Time Training (OWTTT) is a relatively unexplored area with limited studies, our comparison necessarily includes other Test-Time Training (TTT) models, drawing on insights from previous research. It’s important to note that while TTT is a method optimized for real-time testing, it differs from test-time adaptation in that it utilizes parts of the source domain data, such as small batch samples or source domain BN layer statistics, under real-time constraints. This includes the feature distribution of the source domain, as seen in OWTTT and our OWDCL model. Therefore, including traditional TTT models in our experimental comparison is justified. Our comparison model is as follows:

TEST: Evaluating the source domain model on testing data.

BN (Ioffe and Szegedy, 2015): Updating batch norm statistics on the testing data for test-time adaptation.

TTT++ (Liu et al., 2021a): Aligns source and target domain distribution by minimizing the F-norm between the mean covariance.

TENT (Wang et al., 2020): This method fine-tunes scale and bias parameters of the batch normalization layers using an entropy minimization loss during inference.

SHOT (Liang et al., 2020): Implements test-time training by entropy minimization and self-training. SHOT assumes the target domain is class balanced and introduces an entropy loss to encourage uniform distribution of the prediction results.

TTAC (Su et al., 2022): Employs distribution alignment at both global and class levels to facilitate test-time training.

OWTTT (Li et al., 2023): Which combines self-training with prototype expansion to accommodate the strong OOD samples.

For all competing methods that are set by default, we equip them with the same strong OOD detector introduced in (Li et al., 2023). For all models, ResNet-50 (He et al., 2016) was selected as the backbone, SGD was selected as the optimizer, and the learning rate was set to 0.01/0.001 and batch size to 256 in CIFAR10-C/CIFAR100-C. In ImageNet-C, the learning rate is set to 0.001 and the batch size is set to 128. The other hyperparameter Setting of the model refer to the default Settings of the original paper. For the data enhancement of the positive sample of OWDCL(ours), we only perform rotation in order (0-30 degrees), flipping horizontally. Because of the noise effect of domain shift, combined with overly complex data enhancement, it will make the model difficult to fit.

For the CIFAR10-C/CIFAR100-C datasets, the hyperparameters are configured as follows: γ1subscript𝛾1\gamma_{1}italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is set to 0.8, γ2subscript𝛾2\gamma_{2}italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to 0.4, α1subscript𝛼1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to 1, and α2subscript𝛼2\alpha_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to 2. In the ImageNet-C dataset, both γ1subscript𝛾1\gamma_{1}italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and γ2subscript𝛾2\gamma_{2}italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are uniformly set at 1. Regarding α1subscript𝛼1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, initially set at 1, we reduce it to 0.1 after the 20th batch to mitigate potential overfitting issues identified in more complex datasets, where pssubscript𝑝𝑠\mathcal{L}_{ps}caligraphic_L start_POSTSUBSCRIPT italic_p italic_s end_POSTSUBSCRIPT remains impactful in the initial stages. Regarding the other parameters, their settings are consistent throughout the document and were initially introduced at their first mention. These specific configurations draw upon established practices from previous research (Li et al., 2023).

Refer to caption
Figure 3. Visual analysis experiment. Black is strong OOD, while the others are weak OOD.

4.3. Comparative experiments

We first evaluate open-world test-time training under noise corrupted target domain. We treat CIFAR10/CIFAR100 (Krizhevsky et al., 2009) and ImageNet (Deng et al., 2009) as the source domain and test-time adapt to CIFAR10-C, CIFAR100-C, and ImageNet-C as the target domain respectively.

For experiments on CIFAR10/100, we introduce random noise, MNIST, SVHN, Tiny-ImageNet with non-overlap classes, and CIFAR100 as strong OOD testing samples. Table 2 compares the classification error of our proposed method against recent TTT methods on the CIFAR10-C dataset. Table 3 shows the performance comparison results on the CIFAR100-C dataset. It can be seen that for different strong OOD, our models have shown extremely excellent performance, and basically, under each strong OOD, our accuracy has been improved by more than 2%. In the CIFAR10-C dataset, we added Tiny-ImageNet as a strong OOD, which improved our accuracy by nearly 5% for this complex strong OOD.

In CIFAR100-C, due to the complexity of data set categories and the interference of strong OOD, many models have significantly improved the recognition accuracy of strong OOD (ACCN𝐴𝐶subscript𝐶𝑁ACC_{N}italic_A italic_C italic_C start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT). However, his weak OOD (ACCS𝐴𝐶subscript𝐶𝑆ACC_{S}italic_A italic_C italic_C start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT) accuracy drops sharply, which is caused by stong OOD interference, and he loses the ability to recognize the source domain classes. OWDCL not only demonstrates significant performance improvements compared to traditional TTT models but also incorporates contrastive learning to enhance the model’s feature extraction capabilities. This enhancement helps to prevent the misclassification of weak OOD samples as strong OOD by improving feature extraction. Compared to OWTTT, OWDCL generally achieves an accuracy improvement of about 1-4%, highlighting the effectiveness of integrating contrastive learning for more robust feature discrimination and OOD handling.

For ImageNet-C, we introduce random noise, MNIST, and SVHN as strong OOD samples. Very encouraging results are also obtained on the large-size complicated ImageNet-C dataset, as shown in Table 4. Our model shows a similar effect for large data sets. For random noise as strong OOD, our method is inferior to SHOT. We believe that random noise prevents us from extracting features from strong OOD, thus affecting the final performance. In experiments where MNIST and SVHN were used as strong OOD samples, our OWDCL model’s classification accuracy for weak OOD (ACCS𝐴𝐶subscript𝐶𝑆ACC_{S}italic_A italic_C italic_C start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT) increased by approximately 4% compared to OWTTT, a more pronounced improvement than observed with the CIFAR10-C/CIFAR100-C datasets. This suggests that the complexity of the dataset significantly impacts the model’s feature extraction requirements, making weak OOD samples more susceptible to being misclassified as strong OOD. Our method’s enhancements effectively address this issue, demonstrating that the more complex the dataset, the more pronounced the benefits of our model become.

Finally, our proposed method consistently outperforms all competing methods under most experiment settings, suggesting the effectiveness of the proposed method.

4.4. Further Performance Analysis

4.4.1. Ablation Study

Table 5. Model ablation experiment
𝒫𝒮𝒫𝒮\mathcal{PS}caligraphic_P caligraphic_S 𝒞𝒮𝒞𝒮\mathcal{CS}caligraphic_C caligraphic_S AccS𝐴𝑐subscript𝑐𝑆Acc_{S}italic_A italic_c italic_c start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT AccN𝐴𝑐subscript𝑐𝑁Acc_{N}italic_A italic_c italic_c start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT AccH𝐴𝑐subscript𝑐𝐻Acc_{H}italic_A italic_c italic_c start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT
85.46 98.60 91.56
86.54 99.99 92.78
86.89 99.99 92.93
87.16 99.99 93.08

In our extensive ablation study conducted on the CIFAR10-C dataset, we incorporated Noise as a representative of strong OOD scenarios, alongside 15 different types of corruption present in the original dataset. Due to constraints in length, we present the final averaged results; the details of which are illustrated in Table 5. In this study, 𝒫𝒮𝒫𝒮\mathcal{PS}caligraphic_P caligraphic_S denotes the enhancements made in the Contrastive Alignment by Positive Sample Pairs segment, and 𝒞𝒮𝒞𝒮\mathcal{CS}caligraphic_C caligraphic_S signifies the advancements in the Contrastive Alignment by Cluster and Sample Pairs aspect. The baseline, denoted as OWTTT, does not incorporate any of these improvements. Our findings indicate that each improvement significantly outperforms the baseline. This achievement is particularly notable in effectively differentiating strong OOD while simultaneously accurately classifying weak OOD.

Refer to caption
Figure 4. Parameter Robustness Analysis.

4.4.2. Visualized Analysis

We conducted a visual analysis on the CIFAR10-C dataset, using Gaussian noise as the corruption factor and the MNIST dataset as the benchmark for strong OOD scenarios. Three models - TEST, OWTTT, and OWDCL - were assessed using data from their last five batches. This data underwent dimensionality reduction via t-SNE, followed by a subsequent visualization. In these visualizations, black indicates the strong OOD class, while ten other colors represent the ten CIFAR-10 classes, as detailed in Figure 3. Compared to TEST, OWTTT showed improved classification accuracy but with a significantly higher misclassification rate. OWDCL further excelled by enlarging the spatial separation between distinct classes, indicating superior performance. Notably, OWDCL demonstrated remarkable feature extraction capabilities for unknown strong OODs during the Test-Time Training (TTT) process, despite being initially trained on MNIST. This ability is evidenced by the emergence of distinct class clusters, even though it does not precisely classify each of the ten MNIST classes.

4.4.3. Parameter Robustness Analysis

In the context of parameter settings for the experiment, our approach OWDCL, being an extension of OWTTT, refers to the parameter configuration of OWTTT, adhering to a consistent parameter setup throughout the paper. Owing to the numerous secondary parameters involved in our method, the specific design values were mentioned at their initial introduction, and a unified approach was adopted for all experiments. In the parameter robustness analysis, we scrutinized the primary parameters α1subscript𝛼1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and α2subscript𝛼2\alpha_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to evaluate their robustness. The experiments were conducted under the Noise condition in the CIFAR10-C dataset, as depicted in Figure 4. From the illustration, it is evident that the model’s accuracy maintains commendable performance within a certain range, thus affirming the robustness of our two parameters over a defined interval.

5. Conclusion

In conclusion, our study introduces Open World Dynamic Contrastive Learning (OWDCL), a novel approach that effectively addresses the limitations of traditional Test-Time Training (TTT) methods in open-world scenarios. By innovatively employing contrastive learning to generate positive sample pairs, OWDCL significantly enhances initial feature extraction and reduces the misclassification of weak OOD data as strong OOD. This methodology not only improves contrast in early TTT stages but also strengthens the overall robustness of the model against strong OOD data. Demonstrating superior performance across various datasets, OWDCL sets a new benchmark in the field of Open-World Test-Time Training.

References

  • (1)
  • Amodei et al. (2016) Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning. PMLR, 173–182.
  • Bartler et al. (2022) Alexander Bartler, Andre Bühler, Felix Wiewel, Mario Döbler, and Bin Yang. 2022. Mt3: Meta test-time training for self-supervised test-time adaption. In International Conference on Artificial Intelligence and Statistics. PMLR, 3080–3090.
  • Bendale and Boult (2015) Abhijit Bendale and Terrance Boult. 2015. Towards open world recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1893–1902.
  • Bendale and Boult (2016) Abhijit Bendale and Terrance E Boult. 2016. Towards open set deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1563–1572.
  • Chen et al. (2022) Dian Chen, Dequan Wang, Trevor Darrell, and Sayna Ebrahimi. 2022. Contrastive test-time adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 295–305.
  • Chen et al. (2020) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597–1607.
  • Chen and He (2021) Xinlei Chen and Kaiming He. 2021. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15750–15758.
  • Croce et al. (2022) Francesco Croce, Sven Gowal, Thomas Brunner, Evan Shelhamer, Matthias Hein, and Taylan Cemgil. 2022. Evaluating the adversarial robustness of adaptive test-time defenses. In International Conference on Machine Learning. PMLR, 4421–4435.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
  • Ganin and Lempitsky (2015) Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In International conference on machine learning. PMLR, 1180–1189.
  • Geng et al. (2020) Chuanxing Geng, Sheng-jun Huang, and Songcan Chen. 2020. Recent advances in open set recognition: A survey. IEEE transactions on pattern analysis and machine intelligence 43, 10 (2020), 3614–3631.
  • He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729–9738.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  • Hendrycks and Dietterich (2019) Dan Hendrycks and Thomas Dietterich. 2019. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261 (2019).
  • Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning. pmlr, 448–456.
  • Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of features from tiny images. (2009).
  • Kundu et al. (2020) Jogendra Nath Kundu, Naveen Venkat, R Venkatesh Babu, et al. 2020. Universal source-free domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4544–4553.
  • Le and Yang (2015) Ya Le and Xuan Yang. 2015. Tiny imagenet visual recognition challenge. CS 231N 7, 7 (2015), 3.
  • LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
  • Li et al. (2023) Yushu Li, Xun Xu, Yongyi Su, and Kui Jia. 2023. On the Robustness of Open-World Test-Time Training: Self-Training with Dynamic Prototype Expansion. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11836–11846.
  • Liang et al. (2020) Jian Liang, Dapeng Hu, and Jiashi Feng. 2020. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International conference on machine learning. PMLR, 6028–6039.
  • Liu et al. (2021c) Hong Liu, Jianmin Wang, and Mingsheng Long. 2021c. Cycle self-training for domain adaptation. Advances in Neural Information Processing Systems 34 (2021), 22968–22981.
  • Liu et al. (2022) Xiaofeng Liu, Chaehwa Yoo, Fangxu Xing, Hyejin Oh, Georges El Fakhri, Je-Won Kang, Jonghye Woo, et al. 2022. Deep unsupervised domain adaptation: A review of recent advances and perspectives. APSIPA Transactions on Signal and Information Processing 11, 1 (2022).
  • Liu et al. (2021a) Yuejiang Liu, Parth Kothari, Bastien Van Delft, Baptiste Bellot-Gurlet, Taylor Mordan, and Alexandre Alahi. 2021a. Ttt++: When does self-supervised test-time training fail or thrive? Advances in Neural Information Processing Systems 34 (2021), 21808–21820.
  • Liu et al. (2021d) Yuang Liu, Wei Zhang, and Jun Wang. 2021d. Source-free domain adaptation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1215–1224.
  • Liu et al. (2021b) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021b. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012–10022.
  • Long et al. (2015) Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. 2015. Learning transferable features with deep adaptation networks. In International conference on machine learning. PMLR, 97–105.
  • Mirza et al. (2023) Muhammad Jehanzeb Mirza, Pol Jané Soneira, Wei Lin, Mateusz Kozinski, Horst Possegger, and Horst Bischof. 2023. ActMAD: Activation Matching to Align Distributions for Test-Time-Training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 24152–24161.
  • Netzer et al. (2011) Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. 2011. Reading digits in natural images with unsupervised feature learning. (2011).
  • Niu et al. (2022) Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Yaofo Chen, Shijian Zheng, Peilin Zhao, and Mingkui Tan. 2022. Efficient test-time model adaptation without forgetting. In International conference on machine learning. PMLR, 16888–16905.
  • Panareda Busto and Gall (2017) Pau Panareda Busto and Juergen Gall. 2017. Open set domain adaptation. In Proceedings of the IEEE international conference on computer vision. 754–763.
  • Saito et al. (2018) Kuniaki Saito, Shohei Yamamoto, Yoshitaka Ushiku, and Tatsuya Harada. 2018. Open set domain adaptation by backpropagation. In Proceedings of the European conference on computer vision (ECCV). 153–168.
  • Scheirer et al. (2012) Walter J Scheirer, Anderson de Rezende Rocha, Archana Sapkota, and Terrance E Boult. 2012. Toward open set recognition. IEEE transactions on pattern analysis and machine intelligence 35, 7 (2012), 1757–1772.
  • Sinha et al. (2023) Samarth Sinha, Peter Gehler, Francesco Locatello, and Bernt Schiele. 2023. TeST: Test-time Self-Training under Distribution Shift. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2759–2769.
  • Su et al. (2023) Houcheng Su, Daixian Liu, Mengzhu Wang, and Wei Wang. 2023. Singular Value Penalization and Semantic Data Augmentation for Fully Test-Time Adaptation. arXiv preprint arXiv:2312.08378 (2023).
  • Su et al. (2022) Yongyi Su, Xun Xu, and Kui Jia. 2022. Revisiting realistic test-time training: Sequential inference and adaptation by anchored clustering. Advances in Neural Information Processing Systems 35 (2022), 17543–17555.
  • Sun et al. (2020) Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. 2020. Test-time training with self-supervision for generalization under distribution shifts. In International conference on machine learning. PMLR, 9229–9248.
  • Wang et al. (2020) Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. 2020. Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726 (2020).
  • Wang and Deng (2018) Mei Wang and Weihong Deng. 2018. Deep visual domain adaptation: A survey. Neurocomputing 312 (2018), 135–153.
  • Xia et al. (2021) Haifeng Xia, Handong Zhao, and Zhengming Ding. 2021. Adaptive adversarial network for source-free domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9010–9019.
  • Yang et al. (2021) Shiqi Yang, Yaxing Wang, Joost Van De Weijer, Luis Herranz, and Shangling Jui. 2021. Generalized source-free domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8978–8987.
  • You et al. (2019) Kaichao You, Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. 2019. Universal domain adaptation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2720–2729.