Cross-modulated Attention Transformer for RGBT Tracking

Yun Xiao¹ Jiacong Zhao¹ Andong Lu² Chenglong Li¹ Yin Lin³ Bing Yin³ Cong Liu³
¹ School of Artificial Intelligence, Anhui University, Hefei, China
² School of Computer Science and Technology, Anhui University, Hefei, China
³ iFLYTEK CO.LTD., Hefei, China

Abstract

Existing Transformer-based RGBT trackers achieve remarkable performance benefits by leveraging self-attention to extract uni-modal features and cross-attention to enhance multi-modal feature interaction and template-search correlation computation. Nevertheless, the independent search-template correlation calculations ignore the consistency between branches, which can result in ambiguous and inappropriate correlation weights. It not only limits the intra-modal feature representation, but also harms the robustness of cross-attention for multi-modal feature interaction and search-template correlation computation. To address these issues, we propose a novel approach called Cross-modulated Attention Transformer (CAFormer), which performs intra-modality self-correlation, inter-modality feature interaction, and search-template correlation computation in a unified attention model, for RGBT tracking. In particular, we first independently generate correlation maps for each modality and feed them into the designed Correlation Modulated Enhancement module, modulating inaccurate correlation weights by seeking the consensus between modalities. Such kind of design unifies self-attention and cross-attention schemes, which not only alleviates inaccurate attention weight computation in self-attention but also eliminates redundant computation introduced by extra cross-attention scheme. In addition, we propose a collaborative token elimination strategy to further improve tracking inference efficiency and accuracy. Extensive experiments on five public RGBT tracking benchmarks show the outstanding performance of the proposed CAFormer against state-of-the-art methods.

1 Introduction

Refer to caption — Figure 1: Comparison of performance and speed for state-of-the-art tracking methods on RGBT234 [18]. We visualize the Success Rate (SR) to the Frames Per Second (FPS). Closer to the top means higher performance, and closer to the right means faster. CAFormer is able to rank the 1st in SR while running at 83.6 FPS.

RGBT tracking [20, 21, 25, 15, 13], which involves fusing information from both visible and thermal infrared (TIR) modalities for visual tracking, has become an active research field in the computer vision community. Recently, with the success of Transformers in visual object tracking (VOT) [3, 39, 2], RGBT trackers based on Transformers have gradually gained advantages in terms of speed and performance.

The Transformer is successfully applied in RGBT tracking due to its attention mechanism, which allows it to selectively focus on relevant information and ignore irrelevant information. Existing Transformer-based RGBT trackers [15, 13, 12] achieve remarkable performance benefits by leveraging self-attention to extract uni-modal features and cross-attention to enhance multi-modal feature interaction. However, we observe that the calculation of correlations in self-attention is sensitive to low-quality data, resulting in ambiguous and inappropriate correlation weights, as shown in the second row of Figure 2. And importantly, existing works[39, 8, 7] suggest that proper correlation is important for tracking. Therefore, we believe that there are limitations in the modality self-attention independent modeling strategy widely adopted in existing methods. This limitation not only impairs intra-modality feature representation, but also affects subsequent multi-modal feature interactions and the robustness of template and search cross-correlation. Moreover, existing individual computation of self-attention and cross-attention also introduces redundant computation, which limits the speed of existing RGBT trackers.

To address these issues, we propose a novel approach called Cross-modulated Attention Transformer (CAFormer), which performs intra-modality feature extraction and inter-modality feature interaction in a unified attention model, for RGBT tracking. Visible and infrared images in RGBT tracking are highly spatially aligned, thus their correlation between search frames and target templates is should also be consistent. Consequently, different modality self-correlations exhibit similar interaction properties with multi-modal image features. To this end, an intuitive idea of enhancement and correction of low-quality modal correlations through high-quality modal correlations is proposed. To adapt to the dynamic change of modal quality in RGBT tracking, a Cross-Modulated Attention (CMA) in both directions is designed to achieve adaptive correlation modulation. In particular, we first compute the correlation maps for each modality independently, and then feed them into the designed Correlation Modulation Enhancement (CME) module for cross-correlation modeling to seek a correlation agreement between two modalities, which can perform the correction of inaccurate correlation relationships in previous self-attention, as shown in the third row of Figure 2. Moreover, CMA is more efficient in fusion. Taking ViT-Base as an example of a backbone network, for feature fusion the dimension of input features to be processed is 768. CMA only needs to process the search-template part of the correlation map, and the dimension of the correlation vectors is related to the number of template tokens, and usually, this value is 64. By avoiding the computation of higher dimension features, CAFormer is to far outperforms existing feature fusion methods in terms of efficiency. In summary, the proposed CMA unifies the self-attention and cross-attention schemes, which not only mitigates inaccurate correlations in self-attention, but also avoids the computational burden of additional cross-attention.

In addition, inspired by candidate eliminate method in OSTrack [39], we propose a collaborative token elimination strategy to further improve tracking inference efficiency and accuracy. Specifically, within the search region, we consider each token as a potential candidate for the target and treat each template token as a constituent of the target object. Leveraging prior knowledge about the similarity between the target and each candidate provided by correlations in individual modality branches, we add the similarity of the two modalities as the overall similarity, then we kick out tokens with lower similarity. By this way, we coordinate initial elimination results from both modalities to improve background elimination precision. Consequently, our module not only enhances tracking efficiency but also maintains robust performance.

Figure 1 shows the comparison of CAFormer with existing state-of-the-art methods in tracking accuracy and speed, which simultaneously achieve excellent performance in two metrics. It fully demonstrates the superiority and powerful potential of the proposed cross-modulated attention.

The contributions of this paper can be summarized as follows.

•

We reveal the consistency exist in correlations between modalities, it brought by spatio-temporally aligned multimodal image pairs.
•

We propose a novel Cross-modulated Attention Transformer called CAFormer for accurate and efficient RGBT tracking.
•

We propose a collaborative token elimination strategy, which improves the inference efficiency with further performance enhancement.
•

The proposed method achieves an impressive tracking speed of 83.6 FPS while achieving state-of-the-art results on three mainstream public datasets.

2 Related Work

2.1 Attention Mechanism

Attention mechanisms have been widely used in computer vision tasks over past decade [14, 34, 31, 10]. Transformer [31] is favored among these attention mechanisms due to its powerful representation of self-attention and cross-attention. Existing attention studies can be broadly classified into two categories. One category focuses on lightweight attention studies [22, 29, 43, 45]. For example, Liu et al. [22] reduce the computational complexity of attention by introducing local windows into self-attention. Schlatt et al. [29] design a sparse interaction strategy between query token and key token, which improves the efficiency of cross-attention. However, these methods of accelerating attention may harm performance due to the remove of global relationship modelling. Another category of studies [8, 37] is devoted to improving the quality of attention maps. For example, Gao et al. [8] refine the original self-attention by constructing a second-order relation matrix of the self-attention map. Xu et al. [37] propose a self-calibrated cross-attention to enhance discrimination between foreground and background images. However, these schemes are challenging to model accurately attention weights on their own information when they encounter low-quality data inputs. In contrast, this paper proposes a multi-modal cross-modulated attention for the first time, which enhances the attention quality of each modality by establishing a strong association between the attentions of RGB and thermal modalities.

2.2 RGBT Tracking

Due to the highly complementary nature of RGB and thermal infrared (TIR) modality, using TIR modality as an additional modality can effectively improve the robustness of tracking. Therefore, RGBT tracking is proposed and has attracted wide attention. With the publication of large-scale RGBT datasets [21, 28], Transformer is widely used in RGBT tracking. For example, Xiao et al. [36] design attribute-specific fusion branches and utilize Transformer to enhance attribute aggregation features and modality-specific features. Hui et al. [15] extend ViT [6] to a multi-modal backbone and propose using fusion templates as a medium for modal interactions to enhance feature fusion with target-related contexts. Luo et al. [26] employ three distinct Transformer backbones to extract both modality-specific and modality-shared features. Other works [11, 44] explore the application of prompt learning to multimodal tracking. However, the correlation calculation of each modality in these methods is performed independently, which makes it challenging to avoid inaccurate correlations for low-quality inputs, thus limiting further performance improvement. Moreover, existing fusion modules are typically designed for high-dimensional modal features, with great demand for computational resources, which is not conducive to the goal of achieving efficient tracking.

3 Method

3.1 Overview

The proposed approach, named Cross-modulated Attention Transformer (CAFormer), is designed to address the challenges of RGBT tracking by performing intra-modality self-correlation and inter-modality feature interaction in a unified attention model. As illustrated in Figure 3, the framework consists of a backbone network comprising Transformer and CAFormer blocks that process flattened and embedded tokens of RGB and TIR image pairs. The cross-modulated attention mechanism employs correlation maps from both modalities to enhance interaction in the Correlation Modulated Enhancement (CME) module. Furthermore, to filter out non-target tokens, we employ the Collaborative Token Elimination (CTE) strategy in certain layers, which improves the reliability by add correlation maps. Subsequently, we complete the RGB and TIR tokens belonging to the search region using a token padding scheme, and then concatenate them in the channel and feed them into the tracking head for target state prediction.

3.2 RGBT Baseline Tracker

We adopt a similar approach to recent SOT (Single Object Tracking) methods [39, 2] by concatenating the template frames and search frames together into the Transformer backbone and then extending it to be the multi-modal backbone of our tracker.

Specifically, given the input RGB and TIR template image pair $\bm{I}_{r}^{z},\bm{I}_{t}^{z}\in\mathbb{R}^{H_{z}\times W_{z}\times 3}$ , and search region image pair $\bm{I}_{r}^{x},\bm{I}_{t}^{x}\in\mathbb{R}^{H_{x}\times W_{x}\times 3}$ respectively, we first divide these images into patches of size $P\times P$ and then flatten them to obtain sequences of patches $\bm{P}_{r}^{z},\bm{P}_{t}^{z}\in\mathbb{R}^{N_{z}\times(3P^{2})}$ and $\bm{P}_{r}^{x},\bm{P}_{t}^{x}\in\mathbb{R}^{N_{x}\times(3P^{2})}$ , where $N_{z}=H_{z}W_{z}/P^{2}$ and $N_{x}=H_{x}W_{x}/P^{2}$ denote the number of patches for the template and search frames, respectively. A patch embedding layer with parameter $\bm{W}^{0}\in\mathbb{R}^{(3P^{2})\times C}$ and learnable positional encoding $\bm{E}_{z}\in\mathbb{R}^{N_{z}\times C}$ and $\bm{E}_{x}\in\mathbb{R}^{N_{x}\times C}$ is then applied to obtain template features $\bm{Z}_{r}^{0}$ , $\bm{Z}_{t}^{0}$ and search region features $\bm{X}_{r}^{0}$ , $\bm{X}_{t}^{0}$ as follows:

\begin{split}&\bm{Z}_{r}^{0}=\bm{P}_{r}^{z}\bm{W}^{0}+\bm{E}_{z},\;\bm{X}_{r}^% {0}=\bm{P}_{r}^{x}\bm{W}^{0}+\bm{E}_{x};\\ &\bm{Z}_{t}^{0}=\bm{P}_{t}^{z}\bm{W}^{0}+\bm{E}_{z},\;\bm{X}_{t}^{0}=\bm{P}_{t% }^{x}\bm{W}^{0}+\bm{E}_{x}.\end{split}

(1)

Subsequently, concatenating these features yields token sequences $\bm{F}_{r}^{0}=[\bm{Z}_{r}^{0};X_{r}^{0}]$ , $\bm{F}_{t}^{0}=[\bm{Z}_{t}^{0};X_{t}^{0}]$ , then we feed them into the $l$ -layer ( $l=1,2,...,L$ ) Transformer block $T$ , whose structure is shown in Figure 3. For simplicity, we use a tracking head consistent with OSTrack [39] and denote it as $\phi$ . The forward propagation process is formulated as follows:

\displaystyle\bm{F}_{r}^{l}=T^{l}(\bm{F}_{r}^{l-1}),\;\bm{F}_{t}^{l}=T^{l}(\bm% {F}_{t}^{l-1}),\;\hfill l=1,2,...,L,

(2)

where $\bm{F}_{r}^{L},\bm{F}_{t}^{L}$ are outputs of the last Transformer block. We merge these features along the channel dimension and feed them into the tracking head $\phi$ to derive the final predicted bounding box $\bm{B}=\phi(\bm{F}_{r}^{L},\bm{F}_{t}^{L})$ . At this point, we have a basic multi-modal tracker composed of two branches that share parameters and process different modalities independently.

3.3 Cross-modulated Attention

Attention mechanism is a key component of the Transformer tracker [4, 30], and the correlation map is an intermediate result of the Transformer attention, which measures the similarity between the tokens [39]. To avoid the low-quality data affecting the correlation calculation in self-attention, we use high-quality modal correlations to achieve enhancement and correction of low-quality modal correlations. Considering the dynamic changes in the quality of modal correlations, a bidirectional cross-modulated strategy is used to achieve an adaptive correlation modulated process. We design a cross-modulated attention mechanism employs correlation maps from both modalities to enhance interaction in the Correlation Modulated Enhancement (CME) module.

Recalling the backbone in our base tracker, the inputs to layer $l$ are $\bm{F}_{r}^{l}=[\bm{Z}_{r}^{l};\bm{X}_{r}^{l}]$ and $\bm{F}_{t}^{l}=[\bm{Z}_{t}^{l};\bm{X}_{t}^{l}]$ , here we omit the superscript $l$ and use $\bm{F}_{r}$ , $\bm{F}_{t}$ for simplicity. $\bm{Q}_{r}=\bm{F}_{r}\bm{W}_{q}=[\bm{Z}_{r};\bm{X}_{r}]\bm{W}_{q}=[\bm{Q}_{r}^% {z};\bm{Q}_{r}^{x}]$ , $\bm{K}_{r}=\bm{F}_{r}\bm{W}_{k}=[\bm{Z}_{r};\bm{X}_{r}]\bm{W}_{k}=[\bm{K}_{r}^% {z};\bm{K}_{r}^{x}]$ denote query and key matrix from RGB modality, and $\bm{W}_{q}$ , $\bm{W}_{k}$ denote the linear projection weights for queries and keys, respectively. For the RGB branch, its process of RGB features to produce correlation maps $\bm{M}_{r}\in\mathbb{R}^{N\times N}$ can be expressed as:

$\displaystyle\bm{M}_{r}$	$\displaystyle=\bm{Q}_{r}\bm{K}_{r}^{\top}=[\bm{Q}_{r}^{z};\bm{Q}_{r}^{x}][\bm{% K}_{r}^{z};\bm{K}_{r}^{x}]^{\top}$	(3)
	$\displaystyle=[\bm{Q}_{r}^{z}\bm{K}_{r}^{z\top},\bm{Q}_{r}^{z}\bm{K}_{r}^{x% \top};\bm{Q}_{r}^{x}\bm{K}_{r}^{z\top},\bm{Q}_{r}^{x}\bm{K}_{r}^{x\top}]$
	$\displaystyle=[\bm{M}_{r}^{zz},\bm{M}_{r}^{zx};\bm{M}_{r}^{xz},\bm{M}_{r}^{xx}].$

Note that $\bm{M}_{r}$ needs to undergo softmax and scale to be attention map in the usual meaning. For $\bm{Q}_{t}$ and $\bm{K}_{t}$ from TIR modality in the same way. The processing of RGB features is symmetric to TIR features, we can get correlation maps $\bm{M}_{t}\in\mathbb{R}^{N\times N}$ in the same way. As shown in Eq. 3, $\bm{M}_{r},\bm{M}_{t}$ can all be partitioned into four parts $\bm{M}_{r}^{zz},\bm{M}_{r}^{zx},\bm{M}_{r}^{xz},\bm{M}_{r}^{xx}$ and $\bm{M}_{t}^{zz},\bm{M}_{t}^{zx},\bm{M}_{t}^{xz},\bm{M}_{t}^{xx}$ with different roles in tracking, as proposed by [30]. To simplify the description, we named each part $\bm{TT},\bm{TS},\bm{ST},\bm{SS}$ based on the query-key pairs used to calculate the correlation. Among them, $\bm{ST}$ is a special part, it controls the info stream from template to search frame. Specifically, in most transformer trackers [39, 2], the tracking head accepts features from the search region but actually it relies heavily on the template features to output results. Thus the effect of $\bm{ST}$ on tracking results is significant. And importantly, due to the spatio-temporally aligned multimodal image pairs, $\bm{ST}$ within different branches have remarkable associations.

Existing methods [26, 15] perform separate calculations for correlation in modality, which ignores the crucial cross-modality associations. To achieve an adaptive correlation modulated process, we design a cross-modulated attention mechanism to employ correlation maps from both modalities to enhance interaction in CME module. The purpose of CME is to modulate $\bm{ST}$ , but we need to take $\bm{SS}$ into account as well so that we can modulate the final attention map. Specifically, we obtain the aggregated information $\bm{U}$ for two horizontally adjacent parts as follows:

	$\displaystyle\bm{U}$	$\displaystyle=LN(LN([\bm{ST_{r}},\bm{SS_{r}};\bm{ST_{t}},\bm{SS_{t}}])\bm{W}_{% e})$		(4)
		$\displaystyle\triangleq[\bm{U}_{r};\bm{U}_{t}]$		(4)

$LN$ denotes the LayerNorm [1] layer, and $\bm{W}_{e}$ is a learnable linear projection weight for embedding two correlation parts. Then we perform an attention operation on $\bm{U}$ to obtain the modulated correlation map $\bm{M}^{{}^{\prime}}$ .

\displaystyle\bm{M}^{{}^{\prime}}

\displaystyle=Softmax(\frac{(\bm{U}\bm{W}_{q}^{{}^{\prime}})(\bm{U}\bm{W}_{k}^% {{}^{\prime}})^{\top}}{\sqrt{N_{z}}})[\bm{ST_{r}};\bm{ST_{t}}]

(5)

where $N_{z}$ is the template tokens number, $\bm{W}_{q}^{{}^{\prime}}$ and $\bm{W}_{k}^{{}^{\prime}}$ denote linear projection weights for queries and keys in CME module. Next, we separate $\bm{ST_{r}^{{}^{\prime}}}$ and $\bm{ST_{t}^{{}^{\prime}}}$ from the initial modulated correlation map $\bm{M}^{{}^{\prime}}_{r}$ and $\bm{M}^{{}^{\prime}}_{t}$ , respectively.

	$\displaystyle CME(\bm{M}_{r};\bm{M}_{t})$	$\displaystyle=\bm{M}^{{}^{\prime}}(1+\bm{W}^{{}^{\prime}})$		(6)
		$\displaystyle=[\bm{ST_{r}^{{}^{\prime}}};\bm{ST_{t}^{{}^{\prime}}}],$		(6)

\displaystyle\bm{M}_{r}^{{}^{\prime}}=[0\cdot\bm{TT_{r}},0\cdot\bm{TS_{r}};\bm% {ST_{r}^{{}^{\prime}}},0\cdot\bm{SS_{r}}],

(7)

where $\bm{W}^{{}^{\prime}}$ is a learnable linear projection.

Finally, we add the obtained $\bm{M}_{r}^{{}^{\prime}}$ to the original correlation map $\bm{M}_{r}$ to get the final modulated correlation map. The process of yielding the final RGB attention map $\bm{A}_{r}$ can be described as follows:

\displaystyle\bm{A}_{r}=Softmax(\frac{\bm{M}_{r}^{{}^{\prime}}+\bm{M}_{r}}{% \sqrt{C}}),

(8)

where $C$ denotes the dimension size of the token.

In addition, as illustrated in Figure 4 (a), the proposed Cross-modulated Attention is a symmetric structure in which the parameters at the corresponding positions on the left and right sides of the figure are shared. For a multi-head attention block, we share the parameters of the CME module between the parallel multi-heads. It is worth noting that our CME module can be easily applied to other parts in the attention map.

3.4 Collaborative Token Elimination

Efficiency is an important metric for evaluating tracking methods [38, 5]. Ye et al. [39] employ an early candidate elimination strategy to speed up the inference process in some blocks. This mechanism requires constructing accurate attention weights between the target and each candidate, but it is difficult to achieve from low-quality modalities. To solve the above problem, we propose a Collaborative Token Elimination (CTE) strategy that combines the attention weights from two modalities to make judgments.

Given the query vector $\bm{q}_{r}^{z}$ from $\bm{Q}_{r}^{z}$ and $\bm{q}_{t}^{z}$ from $\bm{Q}_{t}^{z}$ (here we follow [39] to choose the token in the center of the template), each search region token at absolute position $i$ can be given a scalar $h_{i}$ :

\displaystyle\bm{h}=softmax(\bm{q}_{r}^{z}\bm{K}_{r}^{x})+softmax(\bm{q}_{t}^{% z}\bm{K}_{t}^{x}),

(9)

where $\bm{K}_{r}^{x}$ and $\bm{K}_{t}^{x}$ are the key vectors of search region tokens. After that, we use $h_{i}$ to sort the search region tokens and keep the top-k tokens. Our method enhances the stability of token elimination, specifically in cases where the quality of one modality declines. It accelerates the network’s inference speed while maintaining robustness.

Table 1: Comparison with state-of-the-art methods. The top 3 results are highlighted with red, blue, and green fonts, respectively. ”*” indicates the speed test on the same GPU (Nvidia 3080ti).

Method	backbone	Pub. Info.	GTOT		RGBT210		RGBT234		LasHeR			VTUAV		FPS $\uparrow$
Method	backbone	Pub. Info.	PR $\uparrow$	SR $\uparrow$	PR $\uparrow$	SR $\uparrow$	PR $\uparrow$	SR $\uparrow$	PR $\uparrow$	NPR $\uparrow$	SR $\uparrow$	PR $\uparrow$	SR $\uparrow$	FPS $\uparrow$
DAPNet [46]	VGG-M	ACM MM’19	88.2	70.7	-	-	76.6	53.7	43.1	38.3	31.4	-	-	-
MANet [19]	VGG-M	ICCVW’19	89.4	72.4	-	-	77.7	53.9	45.5	-	32.6	-	-	2.1*
DAFNet [9]	VGG-M	ICCVW’19	89.1	71.6	-	-	79.6	54.4	44.8	39.0	31.1	62.0	45.8	20.5*
mfDiMP [41]	ResNet-50	ICCVW’19	83.6	69.7	84.9	59.3	84.2	59.1	59.9	-	46.7	67.3	55.4	34.6*
CAT [20]	VGG-M	ECCV’20	88.9	71.7	79.2	53.3	80.4	56.1	45.0	39.5	31.4	-	-	-
MaCNet [40]	VGG-M	Sensors’20	-	-	-	-	79.0	55.4	48.2	42.0	35.0	-	-	1.6*
CMPP [32]	VGG-M	CVPR’20	92.6	73.8	-	-	82.3	57.5	-	-	-	-	-	-
FANet [47]	VGG-M	TIV’21	-	-	-	-	78.7	55.3	44.1	38.4	30.9	-	-	-
MANet++ [24]	VGG-M	TIP’21	88.2	70.7	-	-	80.0	55.4	46.7	40.4	31.4	-	-	-
SiamCDA [42]	ResNet-50	TCSVT’21	87.7	73.2	-	-	76.0	56.9	-	-	-	-	-	-
DMCNet [25]	VGG-M	TNNLS’22	-	-	79.7	55.5	83.9	59.3	49.0	43.1	35.5	-	-	-
APFNet [36]	VGG-M	AAAI’22	90.5	73.7	-	-	82.7	57.9	50.0	43.9	36.2		-	1.9*
MIRNet [12]	VGG-M	ICME’22	90.9	74.4	-	-	81.6	58.9	-	-	-	-	-	-
TFNet [48]	VGG-M	TCSVT’22	88.6	72.9	77.7	52.9	80.6	56.0	-	-		-	- -
HMFT [28]	ResNet-50	CVPR’22	91.2	74.9	-	-	78.8	56.8	-	-	-	75.8	62.7	30.2
ViPT [44]	ViT-B	CVPR’23	-	-	-	-	83.5	61.7	65.1	-	52.5	-	-	-
MACFT [26]	ViT-B	Sensors’23	90.0	72.7	-	-	85.7	62.2	65.3	-	51.4	80.1	66.8	33.3
TBSI [15]	ViT-B	CVPR’23	-	-	85.3	62.5	87.1	63.7	69.2	65.7	55.6	-	-	36.2*
OneTracker [11]	ViT-B	CVPR’24	-	-	-	-	85.7	64.2	67.2	-	53.8	-	-	-
Un-Track [35]	ViT-B	CVPR’24	-	-	-	-	84.2	62.5	66.7	-	53.6	-	-	-
SDSTrack [13]	ViT-B	CVPR’24	-	-	-	-	84.8	62.5	66.5	-	53.1	-	-	-
TATrack [33]	ViT-B	AAAI’24	-	-	85.3	61.8	87.2	64.4	70.2	-	56.1	-	-	26.1
CAFormer	ViT-B	-	91.8	76.9	85.6	63.2	88.3	66.4	70.0	66.1	55.6	88.6	76.2	83.6*

4 Experiments

4.1 Implementation Details

To get a more concrete understanding of the proposed method, here we present details of the implementation. In our method, the proposed CAFormer blocks are integrated into the last 3 layers of the backbone. The search regions are resized to 256 × 256, while the templates are resized to 128 × 128. For the training process, CAFormer is trained on 2 NVIDIA 2080ti GPUs with a global batch size of 32. We set the learning rates of the backbone network and other parameters to 5e-6 and 5e-5, respectively. The optimization algorithm employed is AdamW [23] with a weight decay of 1e-4. We train our model for 10 epochs on the training set of LasHeR [21], and each epoch consists of 60K image pairs. For GTOT[16], RGBT210 [17], and RGBT234 [18], we directly evaluate our model without any further fine-tuning. For the VTUAV [28] dataset, we adopt the VTUAV training set for our training process, and adjust the number of training epochs to 5. Following previous work [15], all experiments in this paper are loaded with pre-trained weights from the public SOT method [39].

4.2 Evaluation on Public Datasets

Our experiments perform on five public datasets including GTOT [16], RGBT210 [17], RGBT234 [18], LasHeR [21], and VTUAV [28]. For evaluation metrics, we adopt commonly used Precision Rate (PR) and Success Rate (SR) metrics. Following previous work [21], we also adopt the Normalized Precision Rate (NPR) [27] metric for LasHeR. In addition, GTOT [16], RGBT234 [18], and VTUAV [28] datasets provide the ground truth of the two modalities, following prior works [18, 28, 13] we use the best results of two modalities to circumvent small alignment errors.

GTOT [16] contains 50 video sequence pairs. As shown in Table 1, compared to previous state-of-the-art trackers [28, 32], our method outperforms HMFT [28] and the SR score is higher than CMPP [32].

RGBT210 [17] is a challenging RGBT dataset, which contains 210 video sequence pairs, 210K frames, and 12 tracking challenge attributes. In the evaluation of the RGBT210 dataset, our method gets the best PR/SR score with 85.6%/63.2%. Compared with TBSI [15], there is a minor improvement of 0.3%/0.7% on PR/SR, but in terms of efficiency, the proposed method is twice as efficient as TBSI. In addition, our method has a significant advantage over other methods, outperforming CAT [20], TFNet [48] 6.4%/9.9% and 7.9%/10.3% in terms of PR/SR, respectively.

RGBT234 [18] is extended from RGBT210, which contains 12 challenge attributes, 234K frames, and 234 video sequence pairs. As shown in Table 1, we compare our method with recently proposed RGBT trackers and achieve the best result. TBSI [15] is the state-of-the-art method and it uses feature fusion. Our method outperforms TBSI by a significant margin of 1.2%/2.7% on PR/SR and obtains the best performance. For other trackers, our method outperformed mfDiMP [41] and ViPT [44] in PR and SR scores by 4.1%/7.3% and 4.8%/4.7%, respectively.

LasHeR [21] contains 19 challenge attributes, 734.8K frames, and 1224 video sequence pairs. Specifically, our tracker significantly outperforms the mfDiMP and ViPT, i.e. 10.1%/8.9% and 4.9%/3.1% respectively in PR/SR. Although compared with TBSI [15], our method only has the performance advantage of 0.8%/0.4% in PR/NPR metrics, TBSI obviously lags behind our method in tracking efficiency because of its bulky multi-level feature interaction.

VTUAV [28] stands out as a large-scale RGBT dataset specifically designed for UAV perspectives. VTUAV contains 500 video sequence pairs having 1.7M image pairs with 1920 × 1080 resolution. It can be seen that our method outperforms all previous methods. Specifically, compared to MACFT [26], which is the previous state-of-the-art method, our method leads by 8.5%/9.4% in PR/SR. This indicates that the proposed method is equally applicable to UAV scenarios and its efficiency is suitable for the needs of UAV scenarios.

4.3 Ablation Study

Table 2: Evaluation results for different structures.

Method	RGBT234		LasHeR
Method	PR	SR	PR	SR
RGBT baseline	86.4	64.5	67.8	54.0
w/o $\bm{SS}$	87.6	65.6	69.2	55.1
w/o Cross-modal	87.5	65.9	68.3	54.3
Full model (CAFormer)	88.3	66.4	70.0	55.6

4.3.1 Component Analysis

As shown in Table 2, we compare different designs for the proposed CMA module.

w/o $\bm{SS}$ . Since the softmax operation will span the parts of two horizontally neighboring correlation maps, the two will affect each other. So when we process $\bm{ST}$ , we also take $\bm{SS}$ into account. When we remove $\bm{SS}$ , compared to input both $\bm{ST}$ and $\bm{SS}$ in the full model, the PR/SR score decrease by 0.7%/0.8% on RGBT234 and 0.8%/0.5% on LasHeR, respectively. The results are shown in the Table 2. It proves that $\bm{SS}$ plays an important role in adjusting the weights of $\bm{ST}$ .

w/o Cross-modal. The proposed CME module is aims to exploit the association of correlations between modalities. When we remove this mechanism, it means that the correlation weights of the two modalities only perform self-interaction. As shown in the Table 2, this leads to a significant decrease of 1.7% and 1.3% in PR and SR compared to the full model on LasHeR, respectively. This suggests that the main performance increase of our method comes from the cross-modal interaction of correlation weights.

4.3.2 CMA Utilization in Different Parts

Besides interacting with $\bm{ST}$ , we attempt to deploy the CME module in other parts of the correlation map. The results on RGBT234 [18] and LasHeR [21] are summarized in Table 3, ”( $\bm{ST},\bm{SS}$ )” means that the two parts are considered as a whole. We can observe the best result is obtained when applied to the $\bm{ST}$ , which confirms our view that the part $\bm{ST}$ has a stronger cross-modal correlation. And the part is critical for tracking. When we do not distinguish different parts, i.e. ”( $\bm{ST},\bm{SS}$ )”, it leads to a 1.1%/0.9% decrease in PR/SR scores and is less efficient compared to $\bm{ST}$ . This shows that it is necessary to distinguish different parts of the correlation map.

Table 3: Modulating different parts of the correlation map.

part	RGBT234		LasHeR
part	PR	SR	PR	SR
$\bm{ST}$	88.3	66.4	70.0	55.6
$\bm{SS}$	87.8	65.8	69.1	55.0
$\bm{TS}$	87.7	65.7	68.2	54.4
( $\bm{ST}$ , $\bm{SS}$ )	86.4	64.4	68.9	54.7

Table 4: Apply layers of the proposed CAFormer block.

Layers	RGBT234		LasHeR
Layers	PR	SR	PR	SR
last 1 layer	87.5	65.6	69.2	55.1
last 3 layers	88.3	66.4	70.0	55.6
last 6 layers	86.6	65.0	68.1	54.2
4,7,10 layers	88.5	65.8	69.6	55.1

Table 5: Different candidate elimination strategies.

Method		LasHeR		FPS	MACs (G)
CE	CTE	PR	SR	FPS	MACs (G)
		69.3	55.1	76.4	58.43
✓		69.5	55.2	83.6	42.91
	✓	70.0	55.6	83.6	42.91

4.3.3 CMA Insertion in Different Layers

Here we insert the CAFormer block to different layers and summarize the experimental results on RGBT234 [18] and LasHeR [21] in Table 4. The results show that when only one CAFormer block is applied, there is already a significant improvement, which shows the necessity of correlation fusion. When increasing to 3, the boosting effect is weakened, and when continuing to increase to 6, worse results are obtained. This suggests that less a priori information can lead to difficulties in distinguishing potentially correct correlation weights, thus yielding erroneous interactions and resulting in performance degradation. Finally, we choose the last 3 layers as the final solution.

4.3.4 Different Token Elimination Schemes

To verify the effectiveness of the proposed Collaborative Token Elimination (CTE) strategy, we also evaluate the Candidate Elimination (CE) strategy in [39]. As shown in Table 5, CTE not only helps to improve the inference speed, but also significantly enhances performance, whereas the CE strategy primarily improves efficiency. Specifically, adding the CTE or CE policy improves the tracking speed by 9.4% and decreases the MACs by 26.6%, while in terms of tracking performance, the CTE obtains a 0.7%/0.5% improvement in PR/SR, which is significantly larger than that of the CE, which is 0.2%/0.1%. This indicates that the proposed method is capable of mitigating the effect of noise weights on the learning of CME modules. And more importantly, it ensures that the weights at the corresponding locations of different modalities can interact and thus better adapt to the CME module. We provide a comparison of the visualization results on CE and CTE in the supporting material.

4.4 Attribute-based Performance

We evaluate the performance of our proposed method in various scenarios by conducting experiments on different challenge attribute subsets of the RGBT234 dataset [18], including no occlusion (NO), partial occlusion (PO), heavy occlusion (HO), low illumination (LI), low resolution (LR), thermal crossover (TC), deformation (DEF), fast motion (FM), scale change (SC), motion blur (MB), camera moving (CM) and background clutter (BC). All results are summarized in Figure 5. Our proposed method exhibits significant improvement over the CNN-based method [36] in challenges such as HO, MB, and others, owing to the long-range modeling capability of the Transformer. Furthermore, our method outperforms Transformer-based methods [15, 44] in feature fusion under all challenges. This demonstrates the advantages of the proposed correlation fusion scheme.

5 Conclusion

In this paper, we reveal a consistency in the correlations of different modal branches and exploit it to design a correlation fusion module. The proposed method also provides a novel fusion idea for multi-modal tracking that is different from feature fusion. Experimental results indicate that the performance of correlation fusion is competitive with or surpasses state-of-the-art feature fusion methods. Additionally, the paper introduces a Collaborative Token Elimination strategy that enhances the differentiation between foreground and background and further improving efficiency and performance. In the future, we plan to combine correlation fusion and feature fusion to further improve tracking performance.

References

Ba et al. [2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
Chen et al. [2022] Boyu Chen, Peixia Li, Lei Bai, Lei Qiao, Qiuhong Shen, Bo Li, Weihao Gan, Wei Wu, and Wanli Ouyang. Backbone is all your need: A simplified architecture for visual object tracking. In Proceedings of the European Conference on Computer Vision, pages 375–392, 2022.
Chen et al. [2021] Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. Transformer tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8126–8135, 2021.
Cui et al. [2022] Yutao Cui, Cheng Jiang, Limin Wang, and Gangshan Wu. Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13608–13618, 2022.
Cui et al. [2024] Yutao Cui, Tianhui Song, Gangshan Wu, and Limin Wang. Mixformerv2: Efficient fully transformer tracking. Advances in Neural Information Processing Systems, 36, 2024.
Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
Fu et al. [2022] Zhihong Fu, Zehua Fu, Qingjie Liu, Wenrui Cai, and Yunhong Wang. Sparsett: Visual tracking with sparse transformers. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, 2022.
Gao et al. [2022] Shenyuan Gao, Chunluan Zhou, Chao Ma, Xinggang Wang, and Junsong Yuan. Aiatrack: Attention in attention for transformer visual tracking. In Proceedings of the European Conference on Computer Vision, 2022.
Gao et al. [2019] Yuan Gao, Chenglong Li, Yabin Zhu, Jin Tang, Tao He, and Futian Wang. Deep adaptive fusion network for high performance rgbt tracking. In Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019.
Guo et al. [2022] Meng-Hao Guo, Tian-Xing Xu, Jiang-Jiang Liu, Zheng-Ning Liu, Peng-Tao Jiang, Tai-Jiang Mu, Song-Hai Zhang, Ralph R Martin, Ming-Ming Cheng, and Shi-Min Hu. Attention mechanisms in computer vision: A survey. Computational visual media, 8(3):331–368, 2022.
Hong et al. [2024] Lingyi Hong, Shilin Yan, Renrui Zhang, Wanyun Li, Xinyu Zhou, Pinxue Guo, Kaixun Jiang, Yiting Chen, Jinglun Li, Zhaoyu Chen, et al. Onetracker: Unifying visual object tracking with foundation models and efficient tuning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2024.
Hou et al. [2022] Ruichao Hou, Tongwei Ren, and Gangshan Wu. Mirnet: A robust rgbt tracking jointly with multi-modal interaction and refinement. In IEEE International Conference on Multimedia and Expo, pages 1–6, 2022.
Hou et al. [2024] Xiaojun Hou, Jiazheng Xing, Yijie Qian, Yaowei Guo, Shuo Xin, Junhao Chen, Kai Tang, Mengmeng Wang, Zhengkai Jiang, Liang Liu, et al. Sdstrack: Self-distillation symmetric adapter learning for multi-modal visual object tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2024.
Hu et al. [2018] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
Hui et al. [2023] Tianrui Hui, Zizheng Xun, Fengguang Peng, Junshi Huang, Xiaoming Wei, Xiaolin Wei, Jiao Dai, Jizhong Han, and Si Liu. Bridging search region interaction with template for rgb-t tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13630–13639, 2023.
Li et al. [2016] Chenglong Li, Hui Cheng, Shiyi Hu, Xiaobai Liu, Jin Tang, and Liang Lin. Learning collaborative sparse representation for grayscale-thermal tracking. IEEE Transactions on Image Processing, 25(12):5743–5756, 2016.
Li et al. [2017] Chenglong Li, Nan Zhao, Yijuan Lu, Chengli Zhu, and Jin Tang. Weighted sparse representation regularized graph learning for rgb-t object tracking. Proceedings of the 25th ACM international conference on Multimedia, 2017.
Li et al. [2019a] Chenglong Li, Xinyan Liang, Yijuan Lu, Nan Zhao, and Jin Tang. Rgb-t object tracking: Benchmark and baseline. Pattern Recognition, 96:106977, 2019a.
Li et al. [2019b] Chenglong Li, Andong Lu, Aihua Zheng, Zhengzheng Tu, and Jin Tang. Multi-adapter rgbt tracking. In Proceedings of the IEEE International Conference on Computer Vision Workshop, pages 2262–2270, 2019b.
Li et al. [2020] Chenglong Li, Lei Liu, Andong Lu, Qing Ji, and Jin Tang. Challenge-aware rgbt tracking. In Proceedings of the European Conference on Computer Vision, pages 222–237, 2020.
Li et al. [2021] Chenglong Li, Wanlin Xue, Yaqing Jia, Zhichen Qu, Bin Luo, Jin Tang, and Dengdi Sun. Lasher: A large-scale high-diversity benchmark for rgbt tracking. IEEE Transactions on Image Processing, 31:392–404, 2021.
Liu et al. [2021] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
Loshchilov and Hutter [2017] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
Lu et al. [2021] Andong Lu, Chenglong Li, Yuqing Yan, Jin Tang, and Bin Luo. Rgbt tracking via multi-adapter network with hierarchical divergence loss. IEEE Transactions on Image Processing, 30:5613–5625, 2021.
Lu et al. [2022] Andong Lu, Cun Qian, Chenglong Li, Jin Tang, and Liang Wang. Duality-gated mutual condition network for rgbt tracking. IEEE Transactions on Neural Networks and Learning Systems, pages 1–14, 2022.
Luo et al. [2023] Yang Luo, Xiqing Guo, Mingtao Dong, and Jin Yu. Learning modality complementary features with mixed attention mechanism for rgb-t tracking. Sensors, 23(14):6609, 2023.
Muller et al. [2018] Matthias Muller, Adel Bibi, Silvio Giancola, Salman Alsubaihi, and Bernard Ghanem. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European Conference on Computer Vision, pages 300–317, 2018.
Pengyu et al. [2022] Zhang Pengyu, Jie Zhao, Dong Wang, Huchuan Lu, and Xiang Ruan. Visible-thermal uav tracking: A large-scale benchmark and new baseline. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2022.
Schlatt et al. [2024] Ferdinand Schlatt, Maik Fröbe, and Matthias Hagen. Investigating the effects of sparse attention on cross-encoders. In European Conference on Information Retrieval, pages 173–190, 2024.
Song et al. [2023] Zikai Song, Run Luo, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. Compact transformer tracker with correlative masked modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, 2023.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
Wang et al. [2020] Chaoqun Wang, Chunyan Xu, Zhen Cui, Ling Zhou, Tong Zhang, Xiaoya Zhang, and Jian Yang. Cross-modal pattern-propagation for rgb-t tracking. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 7064–7073, 2020.
Wang et al. [2024] Hongyu Wang, Xiaotao Liu, Yifan Li, Meng Sun, Dian Yuan, and Jing Liu. Temporal adaptive rgbt tracking with modality prompt. In Proceedings of the AAAI Conference on Artificial Intelligence, 2024.
Wang et al. [2018] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.
Wu et al. [2024] Zongwei Wu, Jilai Zheng, Xiangxuan Ren, Florin-Alexandru Vasluianu, Chao Ma, Danda Pani Paudel, Luc Van Gool, and Radu Timofte. Single-model and any-modality for video object tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2024.
Xiao et al. [2022] Yun Xiao, Mengmeng Yang, Chenglong Li, Lei Liu, and Jin Tang. Attribute-based progressive fusion network for rgbt tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
Xu et al. [2023] Qianxiong Xu, Wenting Zhao, Guosheng Lin, and Cheng Long. Self-calibrated cross attention network for few-shot segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 655–665, 2023.
Yan et al. [2021] Bin Yan, Houwen Peng, Kan Wu, Dong Wang, Jianlong Fu, and Huchuan Lu. Lighttrack: Finding lightweight neural networks for object tracking via one-shot architecture search. In IEEE Conference on Computer Vision and Pattern Recognition, 2021.
Ye et al. [2022] Botao Ye, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. Joint feature learning and relation modeling for tracking: A one-stream framework. In Proceedings of the European Conference on Computer Vision, pages 341–357, 2022.
Zhang et al. [2020] Hui Zhang, Lei Zhang, Li Zhuo, and Jing Zhang. Object tracking in rgb-t videos using modal-aware attention network and competitive learning. Sensors, 20(2):393, 2020.
Zhang et al. [2019] Lichao Zhang, Martin Danelljan, Abel Gonzalez-Garcia, Joost van de Weijer, and Fahad Shahbaz Khan. Multi-modal fusion for end-to-end rgb-t tracking. In Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019.
Zhang et al. [2021] Tianlu Zhang, Xueru Liu, Qiang Zhang, and Jungong Han. Siamcda: Complementarity-and distractor-aware rgb-t tracking based on siamese network. IEEE Transactions on Circuits and Systems for Video Technology, 32(3):1403–1417, 2021.
Zhou et al. [2024] Quan Zhou, Huimin Shi, Weikang Xiang, Bin Kang, and Longin Jan Latecki. Dpnet: Dual-path network for real-time object detection with lightweight attention. IEEE Transactions on Neural Networks and Learning Systems, 2024.
Zhu et al. [2023] Jiawen Zhu, Simiao Lai, Xin Chen, Dong Wang, and Huchuan Lu. Visual prompt multi-modal tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9516–9526, 2023.
Zhu et al. [2020a] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020a.
Zhu et al. [2019] Yabin Zhu, Chenglong Li, Bin Luo, Jin Tang, and Xiao Wang. Dense feature aggregation and pruning for rgbt tracking. In Proceedings of the ACM International Conference on Multimedia, pages 465–472, 2019.
Zhu et al. [2020b] Yabin Zhu, Chenglong Li, Jin Tang, and Bin Luo. Quality-aware feature aggregation network for robust rgbt tracking. IEEE Transactions on Intelligent Vehicles, 6(1):121–130, 2020b.
Zhu et al. [2021] Yabin Zhu, Chenglong Li, Jin Tang, Bin Luo, and Liang Wang. Rgbt tracking by trident fusion network. IEEE Transactions on Circuits and Systems for Video Technology, 32(2):579–592, 2021.