Cross-modulated Attention Transformer for RGBT Tracking

Yun Xiao1   Jiacong Zhao1   Andong Lu2   Chenglong Li1   Yin Lin3   Bing Yin3   Cong Liu3
1 School of Artificial Intelligence, Anhui University, Hefei, China
2 School of Computer Science and Technology, Anhui University, Hefei, China
3 iFLYTEK CO.LTD., Hefei, China
Abstract

Existing Transformer-based RGBT trackers achieve remarkable performance benefits by leveraging self-attention to extract uni-modal features and cross-attention to enhance multi-modal feature interaction and template-search correlation computation. Nevertheless, the independent search-template correlation calculations ignore the consistency between branches, which can result in ambiguous and inappropriate correlation weights. It not only limits the intra-modal feature representation, but also harms the robustness of cross-attention for multi-modal feature interaction and search-template correlation computation. To address these issues, we propose a novel approach called Cross-modulated Attention Transformer (CAFormer), which performs intra-modality self-correlation, inter-modality feature interaction, and search-template correlation computation in a unified attention model, for RGBT tracking. In particular, we first independently generate correlation maps for each modality and feed them into the designed Correlation Modulated Enhancement module, modulating inaccurate correlation weights by seeking the consensus between modalities. Such kind of design unifies self-attention and cross-attention schemes, which not only alleviates inaccurate attention weight computation in self-attention but also eliminates redundant computation introduced by extra cross-attention scheme. In addition, we propose a collaborative token elimination strategy to further improve tracking inference efficiency and accuracy. Extensive experiments on five public RGBT tracking benchmarks show the outstanding performance of the proposed CAFormer against state-of-the-art methods.

1 Introduction

Refer to caption
Figure 1: Comparison of performance and speed for state-of-the-art tracking methods on RGBT234 [18]. We visualize the Success Rate (SR) to the Frames Per Second (FPS). Closer to the top means higher performance, and closer to the right means faster. CAFormer is able to rank the 1st in SR while running at 83.6 FPS.

RGBT tracking [20, 21, 25, 15, 13], which involves fusing information from both visible and thermal infrared (TIR) modalities for visual tracking, has become an active research field in the computer vision community. Recently, with the success of Transformers in visual object tracking (VOT) [3, 39, 2], RGBT trackers based on Transformers have gradually gained advantages in terms of speed and performance.

The Transformer is successfully applied in RGBT tracking due to its attention mechanism, which allows it to selectively focus on relevant information and ignore irrelevant information. Existing Transformer-based RGBT trackers [15, 13, 12] achieve remarkable performance benefits by leveraging self-attention to extract uni-modal features and cross-attention to enhance multi-modal feature interaction. However, we observe that the calculation of correlations in self-attention is sensitive to low-quality data, resulting in ambiguous and inappropriate correlation weights, as shown in the second row of Figure 2. And importantly, existing works[39, 8, 7] suggest that proper correlation is important for tracking. Therefore, we believe that there are limitations in the modality self-attention independent modeling strategy widely adopted in existing methods. This limitation not only impairs intra-modality feature representation, but also affects subsequent multi-modal feature interactions and the robustness of template and search cross-correlation. Moreover, existing individual computation of self-attention and cross-attention also introduces redundant computation, which limits the speed of existing RGBT trackers.

Refer to caption
Figure 2: Illustration of correlation maps with different fusion methods under different modal quality inputs. The score map is the output of the location branch in the tracking head.

To address these issues, we propose a novel approach called Cross-modulated Attention Transformer (CAFormer), which performs intra-modality feature extraction and inter-modality feature interaction in a unified attention model, for RGBT tracking. Visible and infrared images in RGBT tracking are highly spatially aligned, thus their correlation between search frames and target templates is should also be consistent. Consequently, different modality self-correlations exhibit similar interaction properties with multi-modal image features. To this end, an intuitive idea of enhancement and correction of low-quality modal correlations through high-quality modal correlations is proposed. To adapt to the dynamic change of modal quality in RGBT tracking, a Cross-Modulated Attention (CMA) in both directions is designed to achieve adaptive correlation modulation. In particular, we first compute the correlation maps for each modality independently, and then feed them into the designed Correlation Modulation Enhancement (CME) module for cross-correlation modeling to seek a correlation agreement between two modalities, which can perform the correction of inaccurate correlation relationships in previous self-attention, as shown in the third row of Figure 2. Moreover, CMA is more efficient in fusion. Taking ViT-Base as an example of a backbone network, for feature fusion the dimension of input features to be processed is 768. CMA only needs to process the search-template part of the correlation map, and the dimension of the correlation vectors is related to the number of template tokens, and usually, this value is 64. By avoiding the computation of higher dimension features, CAFormer is to far outperforms existing feature fusion methods in terms of efficiency. In summary, the proposed CMA unifies the self-attention and cross-attention schemes, which not only mitigates inaccurate correlations in self-attention, but also avoids the computational burden of additional cross-attention.

In addition, inspired by candidate eliminate method in OSTrack [39], we propose a collaborative token elimination strategy to further improve tracking inference efficiency and accuracy. Specifically, within the search region, we consider each token as a potential candidate for the target and treat each template token as a constituent of the target object. Leveraging prior knowledge about the similarity between the target and each candidate provided by correlations in individual modality branches, we add the similarity of the two modalities as the overall similarity, then we kick out tokens with lower similarity. By this way, we coordinate initial elimination results from both modalities to improve background elimination precision. Consequently, our module not only enhances tracking efficiency but also maintains robust performance.

Figure 1 shows the comparison of CAFormer with existing state-of-the-art methods in tracking accuracy and speed, which simultaneously achieve excellent performance in two metrics. It fully demonstrates the superiority and powerful potential of the proposed cross-modulated attention.

The contributions of this paper can be summarized as follows.

  • We reveal the consistency exist in correlations between modalities, it brought by spatio-temporally aligned multimodal image pairs.

  • We propose a novel Cross-modulated Attention Transformer called CAFormer for accurate and efficient RGBT tracking.

  • We propose a collaborative token elimination strategy, which improves the inference efficiency with further performance enhancement.

  • The proposed method achieves an impressive tracking speed of 83.6 FPS while achieving state-of-the-art results on three mainstream public datasets.

2 Related Work

Refer to caption
Figure 3: Overall framework of Cross-modulated Attention Transformer (CAFormer) for RGBT tracking.

2.1 Attention Mechanism

Attention mechanisms have been widely used in computer vision tasks over past decade [14, 34, 31, 10]. Transformer [31] is favored among these attention mechanisms due to its powerful representation of self-attention and cross-attention. Existing attention studies can be broadly classified into two categories. One category focuses on lightweight attention studies [22, 29, 43, 45]. For example, Liu et al. [22] reduce the computational complexity of attention by introducing local windows into self-attention. Schlatt et al. [29] design a sparse interaction strategy between query token and key token, which improves the efficiency of cross-attention. However, these methods of accelerating attention may harm performance due to the remove of global relationship modelling. Another category of studies [8, 37] is devoted to improving the quality of attention maps. For example, Gao et al. [8] refine the original self-attention by constructing a second-order relation matrix of the self-attention map. Xu et al. [37] propose a self-calibrated cross-attention to enhance discrimination between foreground and background images. However, these schemes are challenging to model accurately attention weights on their own information when they encounter low-quality data inputs. In contrast, this paper proposes a multi-modal cross-modulated attention for the first time, which enhances the attention quality of each modality by establishing a strong association between the attentions of RGB and thermal modalities.

2.2 RGBT Tracking

Due to the highly complementary nature of RGB and thermal infrared (TIR) modality, using TIR modality as an additional modality can effectively improve the robustness of tracking. Therefore, RGBT tracking is proposed and has attracted wide attention. With the publication of large-scale RGBT datasets  [21, 28], Transformer is widely used in RGBT tracking. For example, Xiao et al. [36] design attribute-specific fusion branches and utilize Transformer to enhance attribute aggregation features and modality-specific features. Hui et al. [15] extend ViT [6] to a multi-modal backbone and propose using fusion templates as a medium for modal interactions to enhance feature fusion with target-related contexts. Luo et al. [26] employ three distinct Transformer backbones to extract both modality-specific and modality-shared features. Other works [11, 44] explore the application of prompt learning to multimodal tracking. However, the correlation calculation of each modality in these methods is performed independently, which makes it challenging to avoid inaccurate correlations for low-quality inputs, thus limiting further performance improvement. Moreover, existing fusion modules are typically designed for high-dimensional modal features, with great demand for computational resources, which is not conducive to the goal of achieving efficient tracking.

3 Method

3.1 Overview

The proposed approach, named Cross-modulated Attention Transformer (CAFormer), is designed to address the challenges of RGBT tracking by performing intra-modality self-correlation and inter-modality feature interaction in a unified attention model. As illustrated in Figure 3, the framework consists of a backbone network comprising Transformer and CAFormer blocks that process flattened and embedded tokens of RGB and TIR image pairs. The cross-modulated attention mechanism employs correlation maps from both modalities to enhance interaction in the Correlation Modulated Enhancement (CME) module. Furthermore, to filter out non-target tokens, we employ the Collaborative Token Elimination (CTE) strategy in certain layers, which improves the reliability by add correlation maps. Subsequently, we complete the RGB and TIR tokens belonging to the search region using a token padding scheme, and then concatenate them in the channel and feed them into the tracking head for target state prediction.

Refer to caption
Figure 4: The proposed Cross-modulated Attention with the Correlation Modulated Enhancement (CME) module. Refer to caption denotes dividing the features of two modalities, tensor-product\bigotimes denotes matrix multiplication, and direct-sum\bigoplus denotes element-wise addition. The numbers beside the arrows are feature dimensions that do not include the batch size. Linear projections in (a) and matrix transpose operations are omitted for brevity. N𝑁Nitalic_N, Nxsubscript𝑁𝑥N_{x}italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, and Nzsubscript𝑁𝑧N_{z}italic_N start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT represent all token numbers, search region token numbers, and template region token numbers, respectively.

3.2 RGBT Baseline Tracker

We adopt a similar approach to recent SOT (Single Object Tracking) methods [39, 2] by concatenating the template frames and search frames together into the Transformer backbone and then extending it to be the multi-modal backbone of our tracker.

Specifically, given the input RGB and TIR template image pair 𝑰rz,𝑰tzHz×Wz×3superscriptsubscript𝑰𝑟𝑧superscriptsubscript𝑰𝑡𝑧superscriptsubscript𝐻𝑧subscript𝑊𝑧3\bm{I}_{r}^{z},\bm{I}_{t}^{z}\in\mathbb{R}^{H_{z}\times W_{z}\times 3}bold_italic_I start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT , bold_italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT × 3 end_POSTSUPERSCRIPT, and search region image pair 𝑰rx,𝑰txHx×Wx×3superscriptsubscript𝑰𝑟𝑥superscriptsubscript𝑰𝑡𝑥superscriptsubscript𝐻𝑥subscript𝑊𝑥3\bm{I}_{r}^{x},\bm{I}_{t}^{x}\in\mathbb{R}^{H_{x}\times W_{x}\times 3}bold_italic_I start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , bold_italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × 3 end_POSTSUPERSCRIPT respectively, we first divide these images into patches of size P×P𝑃𝑃P\times Pitalic_P × italic_P and then flatten them to obtain sequences of patches 𝑷rz,𝑷tzNz×(3P2)superscriptsubscript𝑷𝑟𝑧superscriptsubscript𝑷𝑡𝑧superscriptsubscript𝑁𝑧3superscript𝑃2\bm{P}_{r}^{z},\bm{P}_{t}^{z}\in\mathbb{R}^{N_{z}\times(3P^{2})}bold_italic_P start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT , bold_italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT × ( 3 italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT and 𝑷rx,𝑷txNx×(3P2)superscriptsubscript𝑷𝑟𝑥superscriptsubscript𝑷𝑡𝑥superscriptsubscript𝑁𝑥3superscript𝑃2\bm{P}_{r}^{x},\bm{P}_{t}^{x}\in\mathbb{R}^{N_{x}\times(3P^{2})}bold_italic_P start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , bold_italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × ( 3 italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT, where Nz=HzWz/P2subscript𝑁𝑧subscript𝐻𝑧subscript𝑊𝑧superscript𝑃2N_{z}=H_{z}W_{z}/P^{2}italic_N start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = italic_H start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT / italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and Nx=HxWx/P2subscript𝑁𝑥subscript𝐻𝑥subscript𝑊𝑥superscript𝑃2N_{x}=H_{x}W_{x}/P^{2}italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = italic_H start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT / italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT denote the number of patches for the template and search frames, respectively. A patch embedding layer with parameter 𝑾0(3P2)×Csuperscript𝑾0superscript3superscript𝑃2𝐶\bm{W}^{0}\in\mathbb{R}^{(3P^{2})\times C}bold_italic_W start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( 3 italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) × italic_C end_POSTSUPERSCRIPT and learnable positional encoding 𝑬zNz×Csubscript𝑬𝑧superscriptsubscript𝑁𝑧𝐶\bm{E}_{z}\in\mathbb{R}^{N_{z}\times C}bold_italic_E start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT and 𝑬xNx×Csubscript𝑬𝑥superscriptsubscript𝑁𝑥𝐶\bm{E}_{x}\in\mathbb{R}^{N_{x}\times C}bold_italic_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT is then applied to obtain template features 𝒁r0superscriptsubscript𝒁𝑟0\bm{Z}_{r}^{0}bold_italic_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, 𝒁t0superscriptsubscript𝒁𝑡0\bm{Z}_{t}^{0}bold_italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT and search region features 𝑿r0superscriptsubscript𝑿𝑟0\bm{X}_{r}^{0}bold_italic_X start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, 𝑿t0superscriptsubscript𝑿𝑡0\bm{X}_{t}^{0}bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT as follows:

𝒁r0=𝑷rz𝑾0+𝑬z,𝑿r0=𝑷rx𝑾0+𝑬x;𝒁t0=𝑷tz𝑾0+𝑬z,𝑿t0=𝑷tx𝑾0+𝑬x.formulae-sequencesuperscriptsubscript𝒁𝑟0superscriptsubscript𝑷𝑟𝑧superscript𝑾0subscript𝑬𝑧formulae-sequencesuperscriptsubscript𝑿𝑟0superscriptsubscript𝑷𝑟𝑥superscript𝑾0subscript𝑬𝑥formulae-sequencesuperscriptsubscript𝒁𝑡0superscriptsubscript𝑷𝑡𝑧superscript𝑾0subscript𝑬𝑧superscriptsubscript𝑿𝑡0superscriptsubscript𝑷𝑡𝑥superscript𝑾0subscript𝑬𝑥\begin{split}&\bm{Z}_{r}^{0}=\bm{P}_{r}^{z}\bm{W}^{0}+\bm{E}_{z},\;\bm{X}_{r}^% {0}=\bm{P}_{r}^{x}\bm{W}^{0}+\bm{E}_{x};\\ &\bm{Z}_{t}^{0}=\bm{P}_{t}^{z}\bm{W}^{0}+\bm{E}_{z},\;\bm{X}_{t}^{0}=\bm{P}_{t% }^{x}\bm{W}^{0}+\bm{E}_{x}.\end{split}start_ROW start_CELL end_CELL start_CELL bold_italic_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = bold_italic_P start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT bold_italic_W start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT + bold_italic_E start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = bold_italic_P start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT bold_italic_W start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT + bold_italic_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL bold_italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = bold_italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT bold_italic_W start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT + bold_italic_E start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = bold_italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT bold_italic_W start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT + bold_italic_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT . end_CELL end_ROW (1)

Subsequently, concatenating these features yields token sequences 𝑭r0=[𝒁r0;Xr0]superscriptsubscript𝑭𝑟0superscriptsubscript𝒁𝑟0superscriptsubscript𝑋𝑟0\bm{F}_{r}^{0}=[\bm{Z}_{r}^{0};X_{r}^{0}]bold_italic_F start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = [ bold_italic_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ; italic_X start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ], 𝑭t0=[𝒁t0;Xt0]superscriptsubscript𝑭𝑡0superscriptsubscript𝒁𝑡0superscriptsubscript𝑋𝑡0\bm{F}_{t}^{0}=[\bm{Z}_{t}^{0};X_{t}^{0}]bold_italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = [ bold_italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ; italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ], then we feed them into the l𝑙litalic_l-layer (l=1,2,,L𝑙12𝐿l=1,2,...,Litalic_l = 1 , 2 , … , italic_L) Transformer block T𝑇Titalic_T, whose structure is shown in Figure 3. For simplicity, we use a tracking head consistent with OSTrack [39] and denote it as ϕitalic-ϕ\phiitalic_ϕ. The forward propagation process is formulated as follows:

𝑭rl=Tl(𝑭rl1),𝑭tl=Tl(𝑭tl1),l=1,2,,L,formulae-sequencesuperscriptsubscript𝑭𝑟𝑙superscript𝑇𝑙superscriptsubscript𝑭𝑟𝑙1formulae-sequencesuperscriptsubscript𝑭𝑡𝑙superscript𝑇𝑙superscriptsubscript𝑭𝑡𝑙1𝑙12𝐿\displaystyle\bm{F}_{r}^{l}=T^{l}(\bm{F}_{r}^{l-1}),\;\bm{F}_{t}^{l}=T^{l}(\bm% {F}_{t}^{l-1}),\;\hfill l=1,2,...,L,bold_italic_F start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = italic_T start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_italic_F start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ) , bold_italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = italic_T start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ) , italic_l = 1 , 2 , … , italic_L , (2)

where 𝑭rL,𝑭tLsuperscriptsubscript𝑭𝑟𝐿superscriptsubscript𝑭𝑡𝐿\bm{F}_{r}^{L},\bm{F}_{t}^{L}bold_italic_F start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT , bold_italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT are outputs of the last Transformer block. We merge these features along the channel dimension and feed them into the tracking head ϕitalic-ϕ\phiitalic_ϕ to derive the final predicted bounding box 𝑩=ϕ(𝑭rL,𝑭tL)𝑩italic-ϕsuperscriptsubscript𝑭𝑟𝐿superscriptsubscript𝑭𝑡𝐿\bm{B}=\phi(\bm{F}_{r}^{L},\bm{F}_{t}^{L})bold_italic_B = italic_ϕ ( bold_italic_F start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT , bold_italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ). At this point, we have a basic multi-modal tracker composed of two branches that share parameters and process different modalities independently.

3.3 Cross-modulated Attention

Attention mechanism is a key component of the Transformer tracker [4, 30], and the correlation map is an intermediate result of the Transformer attention, which measures the similarity between the tokens [39]. To avoid the low-quality data affecting the correlation calculation in self-attention, we use high-quality modal correlations to achieve enhancement and correction of low-quality modal correlations. Considering the dynamic changes in the quality of modal correlations, a bidirectional cross-modulated strategy is used to achieve an adaptive correlation modulated process. We design a cross-modulated attention mechanism employs correlation maps from both modalities to enhance interaction in the Correlation Modulated Enhancement (CME) module.

Recalling the backbone in our base tracker, the inputs to layer l𝑙litalic_l are 𝑭rl=[𝒁rl;𝑿rl]superscriptsubscript𝑭𝑟𝑙superscriptsubscript𝒁𝑟𝑙superscriptsubscript𝑿𝑟𝑙\bm{F}_{r}^{l}=[\bm{Z}_{r}^{l};\bm{X}_{r}^{l}]bold_italic_F start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = [ bold_italic_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ; bold_italic_X start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ] and 𝑭tl=[𝒁tl;𝑿tl]superscriptsubscript𝑭𝑡𝑙superscriptsubscript𝒁𝑡𝑙superscriptsubscript𝑿𝑡𝑙\bm{F}_{t}^{l}=[\bm{Z}_{t}^{l};\bm{X}_{t}^{l}]bold_italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = [ bold_italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ; bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ], here we omit the superscript l𝑙litalic_l and use 𝑭rsubscript𝑭𝑟\bm{F}_{r}bold_italic_F start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, 𝑭tsubscript𝑭𝑡\bm{F}_{t}bold_italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for simplicity. 𝑸r=𝑭r𝑾q=[𝒁r;𝑿r]𝑾q=[𝑸rz;𝑸rx]subscript𝑸𝑟subscript𝑭𝑟subscript𝑾𝑞subscript𝒁𝑟subscript𝑿𝑟subscript𝑾𝑞superscriptsubscript𝑸𝑟𝑧superscriptsubscript𝑸𝑟𝑥\bm{Q}_{r}=\bm{F}_{r}\bm{W}_{q}=[\bm{Z}_{r};\bm{X}_{r}]\bm{W}_{q}=[\bm{Q}_{r}^% {z};\bm{Q}_{r}^{x}]bold_italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = bold_italic_F start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = [ bold_italic_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ; bold_italic_X start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ] bold_italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = [ bold_italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ; bold_italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ], 𝑲r=𝑭r𝑾k=[𝒁r;𝑿r]𝑾k=[𝑲rz;𝑲rx]subscript𝑲𝑟subscript𝑭𝑟subscript𝑾𝑘subscript𝒁𝑟subscript𝑿𝑟subscript𝑾𝑘superscriptsubscript𝑲𝑟𝑧superscriptsubscript𝑲𝑟𝑥\bm{K}_{r}=\bm{F}_{r}\bm{W}_{k}=[\bm{Z}_{r};\bm{X}_{r}]\bm{W}_{k}=[\bm{K}_{r}^% {z};\bm{K}_{r}^{x}]bold_italic_K start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = bold_italic_F start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = [ bold_italic_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ; bold_italic_X start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ] bold_italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = [ bold_italic_K start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ; bold_italic_K start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ] denote query and key matrix from RGB modality, and 𝑾qsubscript𝑾𝑞\bm{W}_{q}bold_italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, 𝑾ksubscript𝑾𝑘\bm{W}_{k}bold_italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denote the linear projection weights for queries and keys, respectively. For the RGB branch, its process of RGB features to produce correlation maps 𝑴rN×Nsubscript𝑴𝑟superscript𝑁𝑁\bm{M}_{r}\in\mathbb{R}^{N\times N}bold_italic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT can be expressed as:

𝑴rsubscript𝑴𝑟\displaystyle\bm{M}_{r}bold_italic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT =𝑸r𝑲r=[𝑸rz;𝑸rx][𝑲rz;𝑲rx]absentsubscript𝑸𝑟superscriptsubscript𝑲𝑟topsuperscriptsubscript𝑸𝑟𝑧superscriptsubscript𝑸𝑟𝑥superscriptsuperscriptsubscript𝑲𝑟𝑧superscriptsubscript𝑲𝑟𝑥top\displaystyle=\bm{Q}_{r}\bm{K}_{r}^{\top}=[\bm{Q}_{r}^{z};\bm{Q}_{r}^{x}][\bm{% K}_{r}^{z};\bm{K}_{r}^{x}]^{\top}= bold_italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT bold_italic_K start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = [ bold_italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ; bold_italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ] [ bold_italic_K start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ; bold_italic_K start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT (3)
=[𝑸rz𝑲rz,𝑸rz𝑲rx;𝑸rx𝑲rz,𝑸rx𝑲rx]absentsuperscriptsubscript𝑸𝑟𝑧superscriptsubscript𝑲𝑟limit-from𝑧topsuperscriptsubscript𝑸𝑟𝑧superscriptsubscript𝑲𝑟limit-from𝑥topsuperscriptsubscript𝑸𝑟𝑥superscriptsubscript𝑲𝑟limit-from𝑧topsuperscriptsubscript𝑸𝑟𝑥superscriptsubscript𝑲𝑟limit-from𝑥top\displaystyle=[\bm{Q}_{r}^{z}\bm{K}_{r}^{z\top},\bm{Q}_{r}^{z}\bm{K}_{r}^{x% \top};\bm{Q}_{r}^{x}\bm{K}_{r}^{z\top},\bm{Q}_{r}^{x}\bm{K}_{r}^{x\top}]= [ bold_italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT bold_italic_K start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z ⊤ end_POSTSUPERSCRIPT , bold_italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT bold_italic_K start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x ⊤ end_POSTSUPERSCRIPT ; bold_italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT bold_italic_K start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z ⊤ end_POSTSUPERSCRIPT , bold_italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT bold_italic_K start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x ⊤ end_POSTSUPERSCRIPT ]
=[𝑴rzz,𝑴rzx;𝑴rxz,𝑴rxx].absentsuperscriptsubscript𝑴𝑟𝑧𝑧superscriptsubscript𝑴𝑟𝑧𝑥superscriptsubscript𝑴𝑟𝑥𝑧superscriptsubscript𝑴𝑟𝑥𝑥\displaystyle=[\bm{M}_{r}^{zz},\bm{M}_{r}^{zx};\bm{M}_{r}^{xz},\bm{M}_{r}^{xx}].= [ bold_italic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z italic_z end_POSTSUPERSCRIPT , bold_italic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z italic_x end_POSTSUPERSCRIPT ; bold_italic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x italic_z end_POSTSUPERSCRIPT , bold_italic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x italic_x end_POSTSUPERSCRIPT ] .

Note that 𝑴rsubscript𝑴𝑟\bm{M}_{r}bold_italic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT needs to undergo softmax and scale to be attention map in the usual meaning. For 𝑸tsubscript𝑸𝑡\bm{Q}_{t}bold_italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝑲tsubscript𝑲𝑡\bm{K}_{t}bold_italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from TIR modality in the same way. The processing of RGB features is symmetric to TIR features, we can get correlation maps 𝑴tN×Nsubscript𝑴𝑡superscript𝑁𝑁\bm{M}_{t}\in\mathbb{R}^{N\times N}bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT in the same way. As shown in Eq. 3, 𝑴r,𝑴tsubscript𝑴𝑟subscript𝑴𝑡\bm{M}_{r},\bm{M}_{t}bold_italic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can all be partitioned into four parts 𝑴rzz,𝑴rzx,𝑴rxz,𝑴rxxsuperscriptsubscript𝑴𝑟𝑧𝑧superscriptsubscript𝑴𝑟𝑧𝑥superscriptsubscript𝑴𝑟𝑥𝑧superscriptsubscript𝑴𝑟𝑥𝑥\bm{M}_{r}^{zz},\bm{M}_{r}^{zx},\bm{M}_{r}^{xz},\bm{M}_{r}^{xx}bold_italic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z italic_z end_POSTSUPERSCRIPT , bold_italic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z italic_x end_POSTSUPERSCRIPT , bold_italic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x italic_z end_POSTSUPERSCRIPT , bold_italic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x italic_x end_POSTSUPERSCRIPT and 𝑴tzz,𝑴tzx,𝑴txz,𝑴txxsuperscriptsubscript𝑴𝑡𝑧𝑧superscriptsubscript𝑴𝑡𝑧𝑥superscriptsubscript𝑴𝑡𝑥𝑧superscriptsubscript𝑴𝑡𝑥𝑥\bm{M}_{t}^{zz},\bm{M}_{t}^{zx},\bm{M}_{t}^{xz},\bm{M}_{t}^{xx}bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z italic_z end_POSTSUPERSCRIPT , bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z italic_x end_POSTSUPERSCRIPT , bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x italic_z end_POSTSUPERSCRIPT , bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x italic_x end_POSTSUPERSCRIPT with different roles in tracking, as proposed by [30]. To simplify the description, we named each part TT,TS,ST,SS𝑇𝑇𝑇𝑆𝑆𝑇𝑆𝑆\bm{TT},\bm{TS},\bm{ST},\bm{SS}bold_italic_T bold_italic_T , bold_italic_T bold_italic_S , bold_italic_S bold_italic_T , bold_italic_S bold_italic_S based on the query-key pairs used to calculate the correlation. Among them, 𝑺𝑻𝑺𝑻\bm{ST}bold_italic_S bold_italic_T is a special part, it controls the info stream from template to search frame. Specifically, in most transformer trackers [39, 2], the tracking head accepts features from the search region but actually it relies heavily on the template features to output results. Thus the effect of 𝑺𝑻𝑺𝑻\bm{ST}bold_italic_S bold_italic_T on tracking results is significant. And importantly, due to the spatio-temporally aligned multimodal image pairs, 𝑺𝑻𝑺𝑻\bm{ST}bold_italic_S bold_italic_T within different branches have remarkable associations.

Existing methods [26, 15] perform separate calculations for correlation in modality, which ignores the crucial cross-modality associations. To achieve an adaptive correlation modulated process, we design a cross-modulated attention mechanism to employ correlation maps from both modalities to enhance interaction in CME module. The purpose of CME is to modulate 𝑺𝑻𝑺𝑻\bm{ST}bold_italic_S bold_italic_T, but we need to take 𝑺𝑺𝑺𝑺\bm{SS}bold_italic_S bold_italic_S into account as well so that we can modulate the final attention map. Specifically, we obtain the aggregated information 𝑼𝑼\bm{U}bold_italic_U for two horizontally adjacent parts as follows:

𝑼𝑼\displaystyle\bm{U}bold_italic_U =LN(LN([𝑺𝑻𝒓,𝑺𝑺𝒓;𝑺𝑻𝒕,𝑺𝑺𝒕])𝑾e)absent𝐿𝑁𝐿𝑁𝑺subscript𝑻𝒓𝑺subscript𝑺𝒓𝑺subscript𝑻𝒕𝑺subscript𝑺𝒕subscript𝑾𝑒\displaystyle=LN(LN([\bm{ST_{r}},\bm{SS_{r}};\bm{ST_{t}},\bm{SS_{t}}])\bm{W}_{% e})= italic_L italic_N ( italic_L italic_N ( [ bold_italic_S bold_italic_T start_POSTSUBSCRIPT bold_italic_r end_POSTSUBSCRIPT , bold_italic_S bold_italic_S start_POSTSUBSCRIPT bold_italic_r end_POSTSUBSCRIPT ; bold_italic_S bold_italic_T start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , bold_italic_S bold_italic_S start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT ] ) bold_italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) (4)
[𝑼r;𝑼t]absentsubscript𝑼𝑟subscript𝑼𝑡\displaystyle\triangleq[\bm{U}_{r};\bm{U}_{t}]≜ [ bold_italic_U start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ; bold_italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]

LN𝐿𝑁LNitalic_L italic_N denotes the LayerNorm [1] layer, and 𝑾esubscript𝑾𝑒\bm{W}_{e}bold_italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is a learnable linear projection weight for embedding two correlation parts. Then we perform an attention operation on 𝑼𝑼\bm{U}bold_italic_U to obtain the modulated correlation map 𝑴superscript𝑴\bm{M}^{{}^{\prime}}bold_italic_M start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT.

𝑴superscript𝑴\displaystyle\bm{M}^{{}^{\prime}}bold_italic_M start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT =Softmax((𝑼𝑾q)(𝑼𝑾k)Nz)[𝑺𝑻𝒓;𝑺𝑻𝒕]absent𝑆𝑜𝑓𝑡𝑚𝑎𝑥𝑼superscriptsubscript𝑾𝑞superscript𝑼superscriptsubscript𝑾𝑘topsubscript𝑁𝑧𝑺subscript𝑻𝒓𝑺subscript𝑻𝒕\displaystyle=Softmax(\frac{(\bm{U}\bm{W}_{q}^{{}^{\prime}})(\bm{U}\bm{W}_{k}^% {{}^{\prime}})^{\top}}{\sqrt{N_{z}}})[\bm{ST_{r}};\bm{ST_{t}}]= italic_S italic_o italic_f italic_t italic_m italic_a italic_x ( divide start_ARG ( bold_italic_U bold_italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) ( bold_italic_U bold_italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_N start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG end_ARG ) [ bold_italic_S bold_italic_T start_POSTSUBSCRIPT bold_italic_r end_POSTSUBSCRIPT ; bold_italic_S bold_italic_T start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT ] (5)

where Nzsubscript𝑁𝑧N_{z}italic_N start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT is the template tokens number, 𝑾qsuperscriptsubscript𝑾𝑞\bm{W}_{q}^{{}^{\prime}}bold_italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT and 𝑾ksuperscriptsubscript𝑾𝑘\bm{W}_{k}^{{}^{\prime}}bold_italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT denote linear projection weights for queries and keys in CME module. Next, we separate 𝑺𝑻𝒓𝑺superscriptsubscript𝑻𝒓bold-′\bm{ST_{r}^{{}^{\prime}}}bold_italic_S bold_italic_T start_POSTSUBSCRIPT bold_italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT and 𝑺𝑻𝒕𝑺superscriptsubscript𝑻𝒕bold-′\bm{ST_{t}^{{}^{\prime}}}bold_italic_S bold_italic_T start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT from the initial modulated correlation map 𝑴rsubscriptsuperscript𝑴𝑟\bm{M}^{{}^{\prime}}_{r}bold_italic_M start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and 𝑴tsubscriptsuperscript𝑴𝑡\bm{M}^{{}^{\prime}}_{t}bold_italic_M start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, respectively.

CME(𝑴r;𝑴t)𝐶𝑀𝐸subscript𝑴𝑟subscript𝑴𝑡\displaystyle CME(\bm{M}_{r};\bm{M}_{t})italic_C italic_M italic_E ( bold_italic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ; bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =𝑴(1+𝑾)absentsuperscript𝑴1superscript𝑾\displaystyle=\bm{M}^{{}^{\prime}}(1+\bm{W}^{{}^{\prime}})= bold_italic_M start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ( 1 + bold_italic_W start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) (6)
=[𝑺𝑻𝒓;𝑺𝑻𝒕],absent𝑺superscriptsubscript𝑻𝒓bold-′𝑺superscriptsubscript𝑻𝒕bold-′\displaystyle=[\bm{ST_{r}^{{}^{\prime}}};\bm{ST_{t}^{{}^{\prime}}}],= [ bold_italic_S bold_italic_T start_POSTSUBSCRIPT bold_italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ; bold_italic_S bold_italic_T start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ] ,
𝑴r=[0𝑻𝑻𝒓,0𝑻𝑺𝒓;𝑺𝑻𝒓,0𝑺𝑺𝒓],superscriptsubscript𝑴𝑟0𝑻subscript𝑻𝒓0𝑻subscript𝑺𝒓𝑺superscriptsubscript𝑻𝒓bold-′0𝑺subscript𝑺𝒓\displaystyle\bm{M}_{r}^{{}^{\prime}}=[0\cdot\bm{TT_{r}},0\cdot\bm{TS_{r}};\bm% {ST_{r}^{{}^{\prime}}},0\cdot\bm{SS_{r}}],bold_italic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = [ 0 ⋅ bold_italic_T bold_italic_T start_POSTSUBSCRIPT bold_italic_r end_POSTSUBSCRIPT , 0 ⋅ bold_italic_T bold_italic_S start_POSTSUBSCRIPT bold_italic_r end_POSTSUBSCRIPT ; bold_italic_S bold_italic_T start_POSTSUBSCRIPT bold_italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , 0 ⋅ bold_italic_S bold_italic_S start_POSTSUBSCRIPT bold_italic_r end_POSTSUBSCRIPT ] , (7)

where 𝑾superscript𝑾\bm{W}^{{}^{\prime}}bold_italic_W start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT is a learnable linear projection.

Finally, we add the obtained 𝑴rsuperscriptsubscript𝑴𝑟\bm{M}_{r}^{{}^{\prime}}bold_italic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT to the original correlation map 𝑴rsubscript𝑴𝑟\bm{M}_{r}bold_italic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT to get the final modulated correlation map. The process of yielding the final RGB attention map 𝑨rsubscript𝑨𝑟\bm{A}_{r}bold_italic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT can be described as follows:

𝑨r=Softmax(𝑴r+𝑴rC),subscript𝑨𝑟𝑆𝑜𝑓𝑡𝑚𝑎𝑥superscriptsubscript𝑴𝑟subscript𝑴𝑟𝐶\displaystyle\bm{A}_{r}=Softmax(\frac{\bm{M}_{r}^{{}^{\prime}}+\bm{M}_{r}}{% \sqrt{C}}),bold_italic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = italic_S italic_o italic_f italic_t italic_m italic_a italic_x ( divide start_ARG bold_italic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT + bold_italic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_C end_ARG end_ARG ) , (8)

where C𝐶Citalic_C denotes the dimension size of the token.

In addition, as illustrated in Figure 4 (a), the proposed Cross-modulated Attention is a symmetric structure in which the parameters at the corresponding positions on the left and right sides of the figure are shared. For a multi-head attention block, we share the parameters of the CME module between the parallel multi-heads. It is worth noting that our CME module can be easily applied to other parts in the attention map.

3.4 Collaborative Token Elimination

Efficiency is an important metric for evaluating tracking methods [38, 5]. Ye et al. [39] employ an early candidate elimination strategy to speed up the inference process in some blocks. This mechanism requires constructing accurate attention weights between the target and each candidate, but it is difficult to achieve from low-quality modalities. To solve the above problem, we propose a Collaborative Token Elimination (CTE) strategy that combines the attention weights from two modalities to make judgments.

Given the query vector 𝒒rzsuperscriptsubscript𝒒𝑟𝑧\bm{q}_{r}^{z}bold_italic_q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT from 𝑸rzsuperscriptsubscript𝑸𝑟𝑧\bm{Q}_{r}^{z}bold_italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT and 𝒒tzsuperscriptsubscript𝒒𝑡𝑧\bm{q}_{t}^{z}bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT from 𝑸tzsuperscriptsubscript𝑸𝑡𝑧\bm{Q}_{t}^{z}bold_italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT (here we follow [39] to choose the token in the center of the template), each search region token at absolute position i𝑖iitalic_i can be given a scalar hisubscript𝑖h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

𝒉=softmax(𝒒rz𝑲rx)+softmax(𝒒tz𝑲tx),𝒉𝑠𝑜𝑓𝑡𝑚𝑎𝑥superscriptsubscript𝒒𝑟𝑧superscriptsubscript𝑲𝑟𝑥𝑠𝑜𝑓𝑡𝑚𝑎𝑥superscriptsubscript𝒒𝑡𝑧superscriptsubscript𝑲𝑡𝑥\displaystyle\bm{h}=softmax(\bm{q}_{r}^{z}\bm{K}_{r}^{x})+softmax(\bm{q}_{t}^{% z}\bm{K}_{t}^{x}),bold_italic_h = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( bold_italic_q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT bold_italic_K start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ) + italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT bold_italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ) , (9)

where 𝑲rxsuperscriptsubscript𝑲𝑟𝑥\bm{K}_{r}^{x}bold_italic_K start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT and 𝑲txsuperscriptsubscript𝑲𝑡𝑥\bm{K}_{t}^{x}bold_italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT are the key vectors of search region tokens. After that, we use hisubscript𝑖h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to sort the search region tokens and keep the top-k tokens. Our method enhances the stability of token elimination, specifically in cases where the quality of one modality declines. It accelerates the network’s inference speed while maintaining robustness.

Table 1: Comparison with state-of-the-art methods. The top 3 results are highlighted with red, blue, and green fonts, respectively. ”*” indicates the speed test on the same GPU (Nvidia 3080ti).
Method backbone Pub. Info. GTOT RGBT210 RGBT234 LasHeR VTUAV FPS \uparrow
PR \uparrow SR \uparrow PR \uparrow SR \uparrow PR \uparrow SR \uparrow PR \uparrow NPR \uparrow SR \uparrow PR \uparrow SR \uparrow
DAPNet [46] VGG-M ACM MM’19 88.2 70.7 - - 76.6 53.7 43.1 38.3 31.4 - - -
MANet [19] VGG-M ICCVW’19 89.4 72.4 - - 77.7 53.9 45.5 - 32.6 - - 2.1*
DAFNet [9] VGG-M ICCVW’19 89.1 71.6 - - 79.6 54.4 44.8 39.0 31.1 62.0 45.8 20.5*
mfDiMP [41] ResNet-50 ICCVW’19 83.6 69.7 84.9 59.3 84.2 59.1 59.9 - 46.7 67.3 55.4 34.6*
CAT [20] VGG-M ECCV’20 88.9 71.7 79.2 53.3 80.4 56.1 45.0 39.5 31.4 - - -
MaCNet [40] VGG-M Sensors’20 - - - - 79.0 55.4 48.2 42.0 35.0 - - 1.6*
CMPP [32] VGG-M CVPR’20 92.6 73.8 - - 82.3 57.5 - - - - - -
FANet [47] VGG-M TIV’21 - - - - 78.7 55.3 44.1 38.4 30.9 - - -
MANet++ [24] VGG-M TIP’21 88.2 70.7 - - 80.0 55.4 46.7 40.4 31.4 - - -
SiamCDA [42] ResNet-50 TCSVT’21 87.7 73.2 - - 76.0 56.9 - - - - - -
DMCNet [25] VGG-M TNNLS’22 - - 79.7 55.5 83.9 59.3 49.0 43.1 35.5 - - -
APFNet [36] VGG-M AAAI’22 90.5 73.7 - - 82.7 57.9 50.0 43.9 36.2 - 1.9*
MIRNet [12] VGG-M ICME’22 90.9 74.4 - - 81.6 58.9 - - - - - -
TFNet [48] VGG-M TCSVT’22 88.6 72.9 77.7 52.9 80.6 56.0 - - - - -
HMFT [28] ResNet-50 CVPR’22 91.2 74.9 - - 78.8 56.8 - - - 75.8 62.7 30.2
ViPT [44] ViT-B CVPR’23 - - - - 83.5 61.7 65.1 - 52.5 - - -
MACFT [26] ViT-B Sensors’23 90.0 72.7 - - 85.7 62.2 65.3 - 51.4 80.1 66.8 33.3
TBSI [15] ViT-B CVPR’23 - - 85.3 62.5 87.1 63.7 69.2 65.7 55.6 - - 36.2*
OneTracker [11] ViT-B CVPR’24 - - - - 85.7 64.2 67.2 - 53.8 - - -
Un-Track [35] ViT-B CVPR’24 - - - - 84.2 62.5 66.7 - 53.6 - - -
SDSTrack [13] ViT-B CVPR’24 - - - - 84.8 62.5 66.5 - 53.1 - - -
TATrack [33] ViT-B AAAI’24 - - 85.3 61.8 87.2 64.4 70.2 - 56.1 - - 26.1
CAFormer ViT-B - 91.8 76.9 85.6 63.2 88.3 66.4 70.0 66.1 55.6 88.6 76.2 83.6*

4 Experiments

4.1 Implementation Details

To get a more concrete understanding of the proposed method, here we present details of the implementation. In our method, the proposed CAFormer blocks are integrated into the last 3 layers of the backbone. The search regions are resized to 256 × 256, while the templates are resized to 128 × 128. For the training process, CAFormer is trained on 2 NVIDIA 2080ti GPUs with a global batch size of 32. We set the learning rates of the backbone network and other parameters to 5e-6 and 5e-5, respectively. The optimization algorithm employed is AdamW [23] with a weight decay of 1e-4. We train our model for 10 epochs on the training set of LasHeR [21], and each epoch consists of 60K image pairs. For GTOT[16], RGBT210 [17], and RGBT234 [18], we directly evaluate our model without any further fine-tuning. For the VTUAV [28] dataset, we adopt the VTUAV training set for our training process, and adjust the number of training epochs to 5. Following previous work [15], all experiments in this paper are loaded with pre-trained weights from the public SOT method [39].

4.2 Evaluation on Public Datasets

Our experiments perform on five public datasets including GTOT [16], RGBT210 [17], RGBT234 [18], LasHeR [21], and VTUAV [28]. For evaluation metrics, we adopt commonly used Precision Rate (PR) and Success Rate (SR) metrics. Following previous work [21], we also adopt the Normalized Precision Rate (NPR) [27] metric for LasHeR. In addition, GTOT [16], RGBT234 [18], and VTUAV [28] datasets provide the ground truth of the two modalities, following prior works [18, 28, 13] we use the best results of two modalities to circumvent small alignment errors.

GTOT [16] contains 50 video sequence pairs. As shown in Table 1, compared to previous state-of-the-art trackers [28, 32], our method outperforms HMFT [28] and the SR score is higher than CMPP [32].

RGBT210 [17] is a challenging RGBT dataset, which contains 210 video sequence pairs, 210K frames, and 12 tracking challenge attributes. In the evaluation of the RGBT210 dataset, our method gets the best PR/SR score with 85.6%/63.2%. Compared with TBSI [15], there is a minor improvement of 0.3%/0.7% on PR/SR, but in terms of efficiency, the proposed method is twice as efficient as TBSI. In addition, our method has a significant advantage over other methods, outperforming CAT [20], TFNet [48] 6.4%/9.9% and 7.9%/10.3% in terms of PR/SR, respectively.

RGBT234 [18] is extended from RGBT210, which contains 12 challenge attributes, 234K frames, and 234 video sequence pairs. As shown in Table 1, we compare our method with recently proposed RGBT trackers and achieve the best result. TBSI [15] is the state-of-the-art method and it uses feature fusion. Our method outperforms TBSI by a significant margin of 1.2%/2.7% on PR/SR and obtains the best performance. For other trackers, our method outperformed mfDiMP [41] and ViPT [44] in PR and SR scores by 4.1%/7.3% and 4.8%/4.7%, respectively.

LasHeR [21] contains 19 challenge attributes, 734.8K frames, and 1224 video sequence pairs. Specifically, our tracker significantly outperforms the mfDiMP and ViPT, i.e. 10.1%/8.9% and 4.9%/3.1% respectively in PR/SR. Although compared with TBSI [15], our method only has the performance advantage of 0.8%/0.4% in PR/NPR metrics, TBSI obviously lags behind our method in tracking efficiency because of its bulky multi-level feature interaction.

VTUAV [28] stands out as a large-scale RGBT dataset specifically designed for UAV perspectives. VTUAV contains 500 video sequence pairs having 1.7M image pairs with 1920 × 1080 resolution. It can be seen that our method outperforms all previous methods. Specifically, compared to MACFT [26], which is the previous state-of-the-art method, our method leads by 8.5%/9.4% in PR/SR. This indicates that the proposed method is equally applicable to UAV scenarios and its efficiency is suitable for the needs of UAV scenarios.

4.3 Ablation Study

Table 2: Evaluation results for different structures.
Method RGBT234 LasHeR
PR SR PR SR
RGBT baseline 86.4 64.5 67.8 54.0
w/o 𝑺𝑺𝑺𝑺\bm{SS}bold_italic_S bold_italic_S 87.6 65.6 69.2 55.1
w/o Cross-modal 87.5 65.9 68.3 54.3
Full model (CAFormer) 88.3 66.4 70.0 55.6

4.3.1 Component Analysis

As shown in Table 2, we compare different designs for the proposed CMA module.

w/o 𝑺𝑺𝑺𝑺\bm{SS}bold_italic_S bold_italic_S. Since the softmax operation will span the parts of two horizontally neighboring correlation maps, the two will affect each other. So when we process 𝑺𝑻𝑺𝑻\bm{ST}bold_italic_S bold_italic_T, we also take 𝑺𝑺𝑺𝑺\bm{SS}bold_italic_S bold_italic_S into account. When we remove 𝑺𝑺𝑺𝑺\bm{SS}bold_italic_S bold_italic_S, compared to input both 𝑺𝑻𝑺𝑻\bm{ST}bold_italic_S bold_italic_T and 𝑺𝑺𝑺𝑺\bm{SS}bold_italic_S bold_italic_S in the full model, the PR/SR score decrease by 0.7%/0.8% on RGBT234 and 0.8%/0.5% on LasHeR, respectively. The results are shown in the Table 2. It proves that 𝑺𝑺𝑺𝑺\bm{SS}bold_italic_S bold_italic_S plays an important role in adjusting the weights of 𝑺𝑻𝑺𝑻\bm{ST}bold_italic_S bold_italic_T.

w/o Cross-modal. The proposed CME module is aims to exploit the association of correlations between modalities. When we remove this mechanism, it means that the correlation weights of the two modalities only perform self-interaction. As shown in the Table 2, this leads to a significant decrease of 1.7% and 1.3% in PR and SR compared to the full model on LasHeR, respectively. This suggests that the main performance increase of our method comes from the cross-modal interaction of correlation weights.

4.3.2 CMA Utilization in Different Parts

Besides interacting with 𝑺𝑻𝑺𝑻\bm{ST}bold_italic_S bold_italic_T, we attempt to deploy the CME module in other parts of the correlation map. The results on RGBT234 [18] and LasHeR [21] are summarized in Table 3, ”(𝑺𝑻,𝑺𝑺𝑺𝑻𝑺𝑺\bm{ST},\bm{SS}bold_italic_S bold_italic_T , bold_italic_S bold_italic_S)” means that the two parts are considered as a whole. We can observe the best result is obtained when applied to the 𝑺𝑻𝑺𝑻\bm{ST}bold_italic_S bold_italic_T, which confirms our view that the part 𝑺𝑻𝑺𝑻\bm{ST}bold_italic_S bold_italic_T has a stronger cross-modal correlation. And the part is critical for tracking. When we do not distinguish different parts, i.e. ”(𝑺𝑻,𝑺𝑺𝑺𝑻𝑺𝑺\bm{ST},\bm{SS}bold_italic_S bold_italic_T , bold_italic_S bold_italic_S)”, it leads to a 1.1%/0.9% decrease in PR/SR scores and is less efficient compared to 𝑺𝑻𝑺𝑻\bm{ST}bold_italic_S bold_italic_T. This shows that it is necessary to distinguish different parts of the correlation map.

Table 3: Modulating different parts of the correlation map.
part RGBT234 LasHeR
PR SR PR SR
𝑺𝑻𝑺𝑻\bm{ST}bold_italic_S bold_italic_T 88.3 66.4 70.0 55.6
𝑺𝑺𝑺𝑺\bm{SS}bold_italic_S bold_italic_S 87.8 65.8 69.1 55.0
𝑻𝑺𝑻𝑺\bm{TS}bold_italic_T bold_italic_S 87.7 65.7 68.2 54.4
(𝑺𝑻𝑺𝑻\bm{ST}bold_italic_S bold_italic_T,𝑺𝑺𝑺𝑺\bm{SS}bold_italic_S bold_italic_S) 86.4 64.4 68.9 54.7
Table 4: Apply layers of the proposed CAFormer block.
Layers RGBT234 LasHeR
PR SR PR SR
last 1 layer 87.5 65.6 69.2 55.1
last 3 layers 88.3 66.4 70.0 55.6
last 6 layers 86.6 65.0 68.1 54.2
4,7,10 layers 88.5 65.8 69.6 55.1
Table 5: Different candidate elimination strategies.
Method LasHeR FPS MACs (G)
CE CTE PR SR
69.3 55.1 76.4 58.43
69.5 55.2 83.6 42.91
70.0 55.6 83.6 42.91

4.3.3 CMA Insertion in Different Layers

Here we insert the CAFormer block to different layers and summarize the experimental results on RGBT234 [18] and LasHeR [21] in Table 4. The results show that when only one CAFormer block is applied, there is already a significant improvement, which shows the necessity of correlation fusion. When increasing to 3, the boosting effect is weakened, and when continuing to increase to 6, worse results are obtained. This suggests that less a priori information can lead to difficulties in distinguishing potentially correct correlation weights, thus yielding erroneous interactions and resulting in performance degradation. Finally, we choose the last 3 layers as the final solution.

4.3.4 Different Token Elimination Schemes

To verify the effectiveness of the proposed Collaborative Token Elimination (CTE) strategy, we also evaluate the Candidate Elimination (CE) strategy in  [39]. As shown in Table 5, CTE not only helps to improve the inference speed, but also significantly enhances performance, whereas the CE strategy primarily improves efficiency. Specifically, adding the CTE or CE policy improves the tracking speed by 9.4% and decreases the MACs by 26.6%, while in terms of tracking performance, the CTE obtains a 0.7%/0.5% improvement in PR/SR, which is significantly larger than that of the CE, which is 0.2%/0.1%. This indicates that the proposed method is capable of mitigating the effect of noise weights on the learning of CME modules. And more importantly, it ensures that the weights at the corresponding locations of different modalities can interact and thus better adapt to the CME module. We provide a comparison of the visualization results on CE and CTE in the supporting material.

Refer to caption
Figure 5: Attribute-based evaluation on RGBT234 dataset. In parentheses, the value on the left indicates the minimum success rate, and on the right the maximum success rate.

4.4 Attribute-based Performance

We evaluate the performance of our proposed method in various scenarios by conducting experiments on different challenge attribute subsets of the RGBT234 dataset [18], including no occlusion (NO), partial occlusion (PO), heavy occlusion (HO), low illumination (LI), low resolution (LR), thermal crossover (TC), deformation (DEF), fast motion (FM), scale change (SC), motion blur (MB), camera moving (CM) and background clutter (BC). All results are summarized in Figure 5. Our proposed method exhibits significant improvement over the CNN-based method [36] in challenges such as HO, MB, and others, owing to the long-range modeling capability of the Transformer. Furthermore, our method outperforms Transformer-based methods [15, 44] in feature fusion under all challenges. This demonstrates the advantages of the proposed correlation fusion scheme.

5 Conclusion

In this paper, we reveal a consistency in the correlations of different modal branches and exploit it to design a correlation fusion module. The proposed method also provides a novel fusion idea for multi-modal tracking that is different from feature fusion. Experimental results indicate that the performance of correlation fusion is competitive with or surpasses state-of-the-art feature fusion methods. Additionally, the paper introduces a Collaborative Token Elimination strategy that enhances the differentiation between foreground and background and further improving efficiency and performance. In the future, we plan to combine correlation fusion and feature fusion to further improve tracking performance.

References

  • Ba et al. [2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  • Chen et al. [2022] Boyu Chen, Peixia Li, Lei Bai, Lei Qiao, Qiuhong Shen, Bo Li, Weihao Gan, Wei Wu, and Wanli Ouyang. Backbone is all your need: A simplified architecture for visual object tracking. In Proceedings of the European Conference on Computer Vision, pages 375–392, 2022.
  • Chen et al. [2021] Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. Transformer tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8126–8135, 2021.
  • Cui et al. [2022] Yutao Cui, Cheng Jiang, Limin Wang, and Gangshan Wu. Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13608–13618, 2022.
  • Cui et al. [2024] Yutao Cui, Tianhui Song, Gangshan Wu, and Limin Wang. Mixformerv2: Efficient fully transformer tracking. Advances in Neural Information Processing Systems, 36, 2024.
  • Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  • Fu et al. [2022] Zhihong Fu, Zehua Fu, Qingjie Liu, Wenrui Cai, and Yunhong Wang. Sparsett: Visual tracking with sparse transformers. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, 2022.
  • Gao et al. [2022] Shenyuan Gao, Chunluan Zhou, Chao Ma, Xinggang Wang, and Junsong Yuan. Aiatrack: Attention in attention for transformer visual tracking. In Proceedings of the European Conference on Computer Vision, 2022.
  • Gao et al. [2019] Yuan Gao, Chenglong Li, Yabin Zhu, Jin Tang, Tao He, and Futian Wang. Deep adaptive fusion network for high performance rgbt tracking. In Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019.
  • Guo et al. [2022] Meng-Hao Guo, Tian-Xing Xu, Jiang-Jiang Liu, Zheng-Ning Liu, Peng-Tao Jiang, Tai-Jiang Mu, Song-Hai Zhang, Ralph R Martin, Ming-Ming Cheng, and Shi-Min Hu. Attention mechanisms in computer vision: A survey. Computational visual media, 8(3):331–368, 2022.
  • Hong et al. [2024] Lingyi Hong, Shilin Yan, Renrui Zhang, Wanyun Li, Xinyu Zhou, Pinxue Guo, Kaixun Jiang, Yiting Chen, Jinglun Li, Zhaoyu Chen, et al. Onetracker: Unifying visual object tracking with foundation models and efficient tuning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2024.
  • Hou et al. [2022] Ruichao Hou, Tongwei Ren, and Gangshan Wu. Mirnet: A robust rgbt tracking jointly with multi-modal interaction and refinement. In IEEE International Conference on Multimedia and Expo, pages 1–6, 2022.
  • Hou et al. [2024] Xiaojun Hou, Jiazheng Xing, Yijie Qian, Yaowei Guo, Shuo Xin, Junhao Chen, Kai Tang, Mengmeng Wang, Zhengkai Jiang, Liang Liu, et al. Sdstrack: Self-distillation symmetric adapter learning for multi-modal visual object tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2024.
  • Hu et al. [2018] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
  • Hui et al. [2023] Tianrui Hui, Zizheng Xun, Fengguang Peng, Junshi Huang, Xiaoming Wei, Xiaolin Wei, Jiao Dai, Jizhong Han, and Si Liu. Bridging search region interaction with template for rgb-t tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13630–13639, 2023.
  • Li et al. [2016] Chenglong Li, Hui Cheng, Shiyi Hu, Xiaobai Liu, Jin Tang, and Liang Lin. Learning collaborative sparse representation for grayscale-thermal tracking. IEEE Transactions on Image Processing, 25(12):5743–5756, 2016.
  • Li et al. [2017] Chenglong Li, Nan Zhao, Yijuan Lu, Chengli Zhu, and Jin Tang. Weighted sparse representation regularized graph learning for rgb-t object tracking. Proceedings of the 25th ACM international conference on Multimedia, 2017.
  • Li et al. [2019a] Chenglong Li, Xinyan Liang, Yijuan Lu, Nan Zhao, and Jin Tang. Rgb-t object tracking: Benchmark and baseline. Pattern Recognition, 96:106977, 2019a.
  • Li et al. [2019b] Chenglong Li, Andong Lu, Aihua Zheng, Zhengzheng Tu, and Jin Tang. Multi-adapter rgbt tracking. In Proceedings of the IEEE International Conference on Computer Vision Workshop, pages 2262–2270, 2019b.
  • Li et al. [2020] Chenglong Li, Lei Liu, Andong Lu, Qing Ji, and Jin Tang. Challenge-aware rgbt tracking. In Proceedings of the European Conference on Computer Vision, pages 222–237, 2020.
  • Li et al. [2021] Chenglong Li, Wanlin Xue, Yaqing Jia, Zhichen Qu, Bin Luo, Jin Tang, and Dengdi Sun. Lasher: A large-scale high-diversity benchmark for rgbt tracking. IEEE Transactions on Image Processing, 31:392–404, 2021.
  • Liu et al. [2021] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  • Loshchilov and Hutter [2017] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  • Lu et al. [2021] Andong Lu, Chenglong Li, Yuqing Yan, Jin Tang, and Bin Luo. Rgbt tracking via multi-adapter network with hierarchical divergence loss. IEEE Transactions on Image Processing, 30:5613–5625, 2021.
  • Lu et al. [2022] Andong Lu, Cun Qian, Chenglong Li, Jin Tang, and Liang Wang. Duality-gated mutual condition network for rgbt tracking. IEEE Transactions on Neural Networks and Learning Systems, pages 1–14, 2022.
  • Luo et al. [2023] Yang Luo, Xiqing Guo, Mingtao Dong, and Jin Yu. Learning modality complementary features with mixed attention mechanism for rgb-t tracking. Sensors, 23(14):6609, 2023.
  • Muller et al. [2018] Matthias Muller, Adel Bibi, Silvio Giancola, Salman Alsubaihi, and Bernard Ghanem. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European Conference on Computer Vision, pages 300–317, 2018.
  • Pengyu et al. [2022] Zhang Pengyu, Jie Zhao, Dong Wang, Huchuan Lu, and Xiang Ruan. Visible-thermal uav tracking: A large-scale benchmark and new baseline. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2022.
  • Schlatt et al. [2024] Ferdinand Schlatt, Maik Fröbe, and Matthias Hagen. Investigating the effects of sparse attention on cross-encoders. In European Conference on Information Retrieval, pages 173–190, 2024.
  • Song et al. [2023] Zikai Song, Run Luo, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. Compact transformer tracker with correlative masked modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, 2023.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
  • Wang et al. [2020] Chaoqun Wang, Chunyan Xu, Zhen Cui, Ling Zhou, Tong Zhang, Xiaoya Zhang, and Jian Yang. Cross-modal pattern-propagation for rgb-t tracking. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 7064–7073, 2020.
  • Wang et al. [2024] Hongyu Wang, Xiaotao Liu, Yifan Li, Meng Sun, Dian Yuan, and Jing Liu. Temporal adaptive rgbt tracking with modality prompt. In Proceedings of the AAAI Conference on Artificial Intelligence, 2024.
  • Wang et al. [2018] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.
  • Wu et al. [2024] Zongwei Wu, Jilai Zheng, Xiangxuan Ren, Florin-Alexandru Vasluianu, Chao Ma, Danda Pani Paudel, Luc Van Gool, and Radu Timofte. Single-model and any-modality for video object tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2024.
  • Xiao et al. [2022] Yun Xiao, Mengmeng Yang, Chenglong Li, Lei Liu, and Jin Tang. Attribute-based progressive fusion network for rgbt tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
  • Xu et al. [2023] Qianxiong Xu, Wenting Zhao, Guosheng Lin, and Cheng Long. Self-calibrated cross attention network for few-shot segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 655–665, 2023.
  • Yan et al. [2021] Bin Yan, Houwen Peng, Kan Wu, Dong Wang, Jianlong Fu, and Huchuan Lu. Lighttrack: Finding lightweight neural networks for object tracking via one-shot architecture search. In IEEE Conference on Computer Vision and Pattern Recognition, 2021.
  • Ye et al. [2022] Botao Ye, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. Joint feature learning and relation modeling for tracking: A one-stream framework. In Proceedings of the European Conference on Computer Vision, pages 341–357, 2022.
  • Zhang et al. [2020] Hui Zhang, Lei Zhang, Li Zhuo, and Jing Zhang. Object tracking in rgb-t videos using modal-aware attention network and competitive learning. Sensors, 20(2):393, 2020.
  • Zhang et al. [2019] Lichao Zhang, Martin Danelljan, Abel Gonzalez-Garcia, Joost van de Weijer, and Fahad Shahbaz Khan. Multi-modal fusion for end-to-end rgb-t tracking. In Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019.
  • Zhang et al. [2021] Tianlu Zhang, Xueru Liu, Qiang Zhang, and Jungong Han. Siamcda: Complementarity-and distractor-aware rgb-t tracking based on siamese network. IEEE Transactions on Circuits and Systems for Video Technology, 32(3):1403–1417, 2021.
  • Zhou et al. [2024] Quan Zhou, Huimin Shi, Weikang Xiang, Bin Kang, and Longin Jan Latecki. Dpnet: Dual-path network for real-time object detection with lightweight attention. IEEE Transactions on Neural Networks and Learning Systems, 2024.
  • Zhu et al. [2023] Jiawen Zhu, Simiao Lai, Xin Chen, Dong Wang, and Huchuan Lu. Visual prompt multi-modal tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9516–9526, 2023.
  • Zhu et al. [2020a] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020a.
  • Zhu et al. [2019] Yabin Zhu, Chenglong Li, Bin Luo, Jin Tang, and Xiao Wang. Dense feature aggregation and pruning for rgbt tracking. In Proceedings of the ACM International Conference on Multimedia, pages 465–472, 2019.
  • Zhu et al. [2020b] Yabin Zhu, Chenglong Li, Jin Tang, and Bin Luo. Quality-aware feature aggregation network for robust rgbt tracking. IEEE Transactions on Intelligent Vehicles, 6(1):121–130, 2020b.
  • Zhu et al. [2021] Yabin Zhu, Chenglong Li, Jin Tang, Bin Luo, and Liang Wang. Rgbt tracking by trident fusion network. IEEE Transactions on Circuits and Systems for Video Technology, 32(2):579–592, 2021.