License: CC BY-NC-SA 4.0
arXiv:2402.05370v1 [cs.LG] 08 Feb 2024

Attention as Robust Representation for Time Series Forecasting

PeiSong Niu    Tian Zhou    Xue Wang    Liang Sun    Rong Jin
Abstract

Time series forecasting is essential for many practical applications, with the adoption of transformer-based models on the rise due to their impressive performance in NLP and CV. Transformers’ key feature, the attention mechanism, dynamically fusing embeddings to enhance data representation, often relegating attention weights to a byproduct role. Yet, time series data, characterized by noise and non-stationarity, poses significant forecasting challenges. Our approach elevates attention weights as the primary representation for time series, capitalizing on the temporal relationships among data points to improve forecasting accuracy. Our study shows that an attention map, structured using global landmarks and local windows, acts as a robust kernel representation for data points, withstanding noise and shifts in distribution. Our method outperforms state-of-the-art models, reducing mean squared error (MSE) in multivariate time series forecasting by a notable 3.6% without altering the core neural network architecture. It serves as a versatile component that can readily replace recent patching based embedding schemes in transformer-based models, boosting their performance. The source code for our work is available at: https://anonymous.4open.science/r/AttnEmbed-7430.

Machine Learning, ICML

1 Introduction

Time series forecasting is a vital problem that has played an important role in many real-world applications (Wen et al., 2022; Courty & Li, 1999; Böse et al., 2017; Li et al., 2019), ranging from energy, weather, traffic to economics. In recent years, traditional statistical and machine learning methods (Box & Jenkins, 1968; Box & Pierce, 1970) have been gradually replaced by deep learning models in time series forecasting. In particular, CNN and MLP-based models (Wu et al., 2023; Zeng et al., 2023) have shown great performance improvement in time series analysis. Moreover, following the successes in NLP (Vaswani & etc., 2017; Devlin et al., 2019; Radford et al., 2019) and CV (Dosovitskiy et al., 2021; Bao et al., 2022), transformer models (Wen et al., 2023; Zhou et al., 2021; Wu et al., 2021; Zhou et al., 2022b; Nie et al., 2022; Liu et al., 2023; Xue et al., 2023) have demonstrated impressive results. Among the transformer models, PatchTST (Nie et al., 2022) successfully applies the idea of vision transformer (ViT) (Dosovitskiy et al., 2021) to time series by segmenting the time series into multiple patches to serve as input tokens for transformers. While segmentation is beneficial for reducing information redundancy, it overlooks the relationship between a time point and its neighbors, making it insufficient for noise reduction and to handle rapid distribution drifts.

The attention mechanism is a pivotal component that underpins the transformative success of the transformer model across various domains. It is widely regarded as the linchpin behind monumental advancements such as ChatGPT and Midjourney, although other elements like feed-forward networks (FFNs) and positional embeddings also play significant roles. Essentially, the attention mechanism functions as a dynamic, weighted feed-forward layer. Within a self-attention layer, for instance, queries (Q) and keys (K) are used to calculate an attention matrix, which subsequently serves as a weighting matrix that synthesizes the values (V). The resultant attention matrix is usually viewed as a byproduct to reveal the quantitative influence of each input token and to effectively aggregate information across different tokens.

Although time series forecasting can be naturally viewed as a sequence modeling problem, it differs significantly from token sequences in CV or NLP in which limited information can be founded in patches because every data point in time series is simply a scalar. In contrast, tokens in both CV and NLP encompass significantly richer information in that we often find considerable amount of redundant information across different tokens, evidenced by high masking rates used in self supervised learning in CV and NLP. Furthermore, many time series data often contain noises and distribution shifts, partly due to high sampling rates (Wen et al., 2022), making the forecasting more challenging. These observations inspire us to develop a richer and robust representation for time series data. Since weights in the attention matrix reveal the pairwise relationship between different patches in time series, motivated by the theory of kernel learning (Wilson et al., 2015) and reproduced kernel Hilbert space (Ghojogh et al., 2021), we propose a novel and robust data representation based on the attention matrix that captures the relationship among different data points in the same time series. One obvious advantage of using attention weights for data representation is that it helps capture the overall seasonality of time series, a special complex relationship. In Section 2.2 and Appendix A, we demonstrate that, based on kernel learning theory, employing attention weights as representations more effectively captures the intricate relationships among data points.

We also note previous efforts that connect attention mechanism with kernel function. For instance, several studies (Tsai et al., 2019; Katharopoulos et al., 2020; Song et al., 2021) have explored attention from the perspective of kernel functions, either to propose a new paradigm for transformers or to reduce the computational complexity. In addition,  (Mika et al., 1998) exploited non-linear kernel functions to reduce noise in time series while preserving the relationship between different time points. In this study, we also show that using kernel functions, such as polynomial kernels, in attention matrix computation can be more effective for time series forecasting than the standard softmax.

Several studies have leveraged the attention matrix’s pairwise relationships for time series anomaly detection. For example, Anomaly Transformer (Xu et al., 2021) introduces an association discrepancy by measuring the Kullback-Leibler divergence between the attention matrix and a learnable Gaussian kernel to identify anomalies. Similarly, DCdetector (Yang et al., 2023) suggests that the divergence between attention matrices is a dependable indicator. However, our work is not limited by the anomaly detection framework. Instead, we have developed a generalized data representation from the attention matrix, which presents versatile potential for tasks involving embeddings.

Our contributions in this paper are summarized as follows:

  • Attention as robust representation: We propose a novel time series representation method called AttnEmbed, which utilizes attention weights as representation of time segments. The resilience of AttnEmbed to both noise and non-stationary distributions is verified by our empirical studies of synthetic datasets, and is also verified by our theoretical analysis.

  • Outstanding performance for time series forecasting: Our innovative embedding schema, AttnEmbed, integrates a global landscape and smoothing design to adeptly handle distribution shifts. When paired with a vanilla transformer, this approach significantly outperforms state-of-the-art methods in time series forecasting, as evidenced by our comprehensive experimental analysis.

  • Kernel functions for better attention: We illustrate that the polynomial kernel can effectively replace traditional similarity measures in attention mechanisms, yielding representations that enhance performance in forecasting tasks.

  • General plug-in: AttnEmbed can be seamlessly integrated as a general plug-in module. We have effectively integrated it into multiple methods, yielding performance enhancements over the patching method.

2 Attention as Robust Representation

To verify the resilience of attention as a data representation to both noise and non-stationary distributions, we first conduct experiments on synthetic data, and then examine the robustness of attention based representation by a theoretical analysis.

2.1 An Empirical Study on Synthetic Data

We develop two synthetic datasets, one for non-stationary time series and one for noisy data, and compare our approach (i.e. AttnEmbed), against a method that inputs patches of the original data with linear projection for embedding (i.e., PatchTST,VIT).

Synthetic Data.

The synthetic data is generated by the aggregation of 10 sinusoids and cubic functions, each characterized by distinct random parameters:

f1(x)subscript𝑓1𝑥\displaystyle f_{1}(x)italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) =\displaystyle== Asin(ωx+ϕ)𝐴𝑠𝑖𝑛𝜔𝑥italic-ϕ\displaystyle\mkern-18.0mu\sum Asin(\omega x+\phi)∑ italic_A italic_s italic_i italic_n ( italic_ω italic_x + italic_ϕ ) +(ax3+bx2+cx+d),𝑎superscript𝑥3𝑏superscript𝑥2𝑐𝑥𝑑\displaystyle+\sum(ax^{3}+bx^{2}+cx+d),+ ∑ ( italic_a italic_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + italic_b italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_c italic_x + italic_d ) ,
f2(x)subscript𝑓2𝑥\displaystyle f_{2}(x)italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) =\displaystyle== Asin(ωx+ϕ)𝐴𝑠𝑖𝑛𝜔𝑥italic-ϕ\displaystyle\mkern-18.0mu\sum Asin(\omega x+\phi)∑ italic_A italic_s italic_i italic_n ( italic_ω italic_x + italic_ϕ ) +(ax3+bx2+cx+d)+σ,𝑎superscript𝑥3𝑏superscript𝑥2𝑐𝑥𝑑𝜎\displaystyle+\sum(ax^{3}+bx^{2}+cx+d)+\sigma,+ ∑ ( italic_a italic_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + italic_b italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_c italic_x + italic_d ) + italic_σ ,

where f1(x)subscript𝑓1𝑥f_{1}(x)italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) is a designed for non-stationary distribution and f2(x)subscript𝑓2𝑥f_{2}(x)italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) is designed for noisy data. All the parameters in f1()subscript𝑓1f_{1}(\cdot)italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ⋅ ) and f2()subscript𝑓2f_{2}(\cdot)italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ⋅ ) are randomly chosen. A total of 2000 time steps are sampled, with a lookback window size of 192 and the forecast horizon of 96. Figure 1 shows the plots of the two functions, together with all experimental results for comparison.

Refer to caption
Figure 1: Comparison between AttnEmbed (ours) and PatchTST on synthetic data. (a) Non-stationary. (b) Noise reduction.
Non-stationary Data.

Figure 1 (a) shows time series data generated by f1()subscript𝑓1f_{1}(\cdot)italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ⋅ ), which clearly show a noticeable shift. We can observe that the proposed representation AttnEmbed is able to better capture the overall drift than PatchTST for the first 50 time points of forecasting, and the advantage disappears after time point 250250250250.

Noise Reduction.

Figure 1 (b) shows the time series generated by f2()subscript𝑓2f_{2}(\cdot)italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ⋅ ) (i.e. f1()subscript𝑓1f_{1}(\cdot)italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ⋅ ) plus noise). Experimental results indicate that while both AttnEmbed and PatchTST effectively mitigate noise, AttnEmbed delivers more accurate predictions for imminent time points in forecasting.

In conclusion, our empirical studies using synthetic data demonstrate that attention-based embedding is an effective schema for addressing noise and non-stationary distributions.

2.2 Theoretical Analysis for Robustness of Attention based Representation

To demonstrate that AttnEmbed is more resilient to noise, we first will show that by adding significant amount of noises to the input patterns, the distance of “similar” data pairs can be very close to that for “dissimilar” pairs. In constrast, by using attention based representations, we are able to maintain that the distance for “similar” data pairs is significantly smaller than that for “dissimilar” pairs even after adding large noises to the input patterns. Below, we will provide the sketch of overall results, and postpone the full analysis to the appendix.

Consider we have n𝑛nitalic_n vectors xidsubscript𝑥𝑖superscript𝑑x_{i}\in\mathbb{R}^{d}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT in a sequence that are generated from m<d𝑚𝑑m<ditalic_m < italic_d Gaussian distributions 𝒩(μi,Id),i=1,,mformulae-sequence𝒩subscript𝜇𝑖subscript𝐼𝑑𝑖1𝑚\mathcal{N}(\mu_{i},I_{d}),i=1,\ldots,mcaligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) , italic_i = 1 , … , italic_m. We assume μi,μj=δi,jssubscript𝜇𝑖subscript𝜇𝑗subscript𝛿𝑖𝑗𝑠\langle\mu_{i},\mu_{j}\rangle=\delta_{i,j}s⟨ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ = italic_δ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_s. It is easy to show that for two “similar” data points xi+superscriptsubscript𝑥𝑖x_{i}^{+}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and xj+superscriptsubscript𝑥𝑗x_{j}^{+}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT that are generated from the same distribution, their expected distance is E[|xi+xj+|2]=2dEdelimited-[]superscriptsuperscriptsubscript𝑥𝑖superscriptsubscript𝑥𝑗22𝑑\mathrm{E}[|x_{i}^{+}-x_{j}^{+}|^{2}]=2droman_E [ | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = 2 italic_d, whereas for two “dissimilar” data points xisuperscriptsubscript𝑥𝑖x_{i}^{-}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT and xj+superscriptsubscript𝑥𝑗x_{j}^{+}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT that are generated from different distributions, their expected distance is E[|xixj|2]=2d+2sEdelimited-[]superscriptsuperscriptsubscript𝑥𝑖superscriptsubscript𝑥𝑗22𝑑2𝑠\mathrm{E}[|x_{i}^{-}-x_{j}^{-}|^{2}]=2d+2sroman_E [ | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = 2 italic_d + 2 italic_s. When sdmuch-less-than𝑠𝑑s\ll ditalic_s ≪ italic_d, i.e., noises are much larger than signals, we have E[|xi+xj+|2]E[|xixj|2]Edelimited-[]superscriptsuperscriptsubscript𝑥𝑖superscriptsubscript𝑥𝑗2Edelimited-[]superscriptsuperscriptsubscript𝑥𝑖superscriptsubscript𝑥𝑗2\mathrm{E}[|x_{i}^{+}-x_{j}^{+}|^{2}]\approx\mathrm{E}[|x_{i}^{-}-x_{j}^{-}|^{% 2}]roman_E [ | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≈ roman_E [ | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ], implying that there is a significant chance that |xixj|2superscriptsuperscriptsubscript𝑥𝑖superscriptsubscript𝑥𝑗2|x_{i}^{-}-x_{j}^{-}|^{2}| italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT can be noticeably smaller than |xi+xj+|2superscriptsuperscriptsubscript𝑥𝑖superscriptsubscript𝑥𝑗2|x_{i}^{+}-x_{j}^{+}|^{2}| italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Now, if we use the attention weights as the representation, denoted by f(x)𝑓𝑥f(x)italic_f ( italic_x ), using the same notation for “similar” and “dissimilar” data pairs, we can show that

E[|f(xi)f(xj)|2]E[|f(xi+)f(xj+)|2]|f(xi)f(xj)|2]=Ω(1)\frac{\mathrm{E}[|f(x_{i}^{-})-f(x_{j}^{-})|^{2}]-\mathrm{E}[|f(x_{i}^{+})-f(x% _{j}^{+})|^{2}]}{|f(x_{i}^{-})-f(x_{j}^{-})|^{2}]}=\Omega(1)divide start_ARG roman_E [ | italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - roman_E [ | italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_ARG | italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG = roman_Ω ( 1 )

with appropriate choice of temperature. It implies that even after adding large noises to input patters, we can still clearly distinguish “similar” data pairs from the “dissimilar” data pairs, thus verifying the robustness of the proposed attention based data representation. The full theoretical analysis is in appendix  A.

3 Related Work

In this section, we provide brief reviews of literature in the areas of time series forecasting and the relationship between attention mechanism and kernel function.

3.1 Time Series Forecasting

Recently, inspired by great success in NLP and CV, transformer models have also been widely used in time series forecasting (Wen et al., 2023). Informer (Zhou et al., 2021) proposes a probability sparse attention mechanism to deal with long-term dependencies. Autoformer (Wu et al., 2021) introduces a decomposition transformer architecture and replaces the attention module with an Auto-Correlation mechanism. FEDformer (Zhou et al., 2022b) employs a Fourier-enhanced architecture to improve computational efficiency, achieving linear complexity. PatchTST (Nie et al., 2022) segments time series into individual patches, which successfully increases input length and reduce information redundancy. GPT4TS (Zhou et al., 2023) utilizes a frozen GPT-2 and achieves a promising performance in several time series tasks. CARD (Xue et al., 2023) and iTransformer (Liu et al., 2023) integrates the correlations among multiple variables to enhance the performance in multivariate time series forecasting. Moreover, TimesNet (Wu et al., 2023) treats time series as a 2D signal and utilizes a convolution-based inception network as its backbone. A simple MLP-based DLinear (Zeng et al., 2023) outperforms a lot of transformer models in time series forecasting with channel-independence and seasonal-trend decomposition.

3.2 Attention and Kernel Function

The perspective of viewing attention as a kernel function is widely recognized in the literature, encompassing modifications to transformers and attention mechanisms (Tsai et al., 2019; Song et al., 2021), acceleration of computational processes (Katharopoulos et al., 2020), and time series anomaly detection (Yang et al., 2023; Xu et al., 2021). (Tsai et al., 2019) proposes that the attention mechanisms in transformers can be interpreted as employing a kernel smoother across the input data and the kernel scores is the similarities between inputs. (Song et al., 2021) derives that the attention is a product of RBF kernel and the exponential of 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm. Also, given kernel functions are advantageous in computational efficiency for distance calculations, (Katharopoulos et al., 2020) reformulates self-attention as a linear operation involving the dot-product of kernelized feature maps. Excitingly, the exploration of kernel functions has extended into the realm of time series anomaly detection. Anomaly Transformer (Xu et al., 2021) and DCDetector (Yang et al., 2023) both utilize Kullback-Leibler divergence to calculate the distance between attention matrix and Gaussian kernel, establishing a novel linkage between attention mechanisms and kernel functions in the domain of time series. Although the methods mentioned previously utilize attention weights primarily for token mixing, our work is among the first to explore attention as an end in itself—not just a means—for embedding schema in the field of time series forecasting. To our knowledge, such an approach has been rarely investigated.

4 Methodology

Refer to caption
Figure 2: The architecture of (a) AttnEmbed and a comparison with (b) PatchTST. Unlike PatchTST, AttnEmbed considers the relationship of time steps within each window.

Consider a multivariate time series with look back window L𝐿Litalic_L: (x1,,xt,,xL)subscript𝑥1subscript𝑥𝑡subscript𝑥𝐿(x_{1},...,x_{t},...,x_{L})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ), where xtMsubscript𝑥𝑡superscript𝑀x_{t}\in\mathbb{R}^{M}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT is the observation at time t𝑡titalic_t with M𝑀Mitalic_M channels. Our objective is to forecast future steps with a horizon of T𝑇Titalic_T, denoted by (xL+1,,xL+T)subscript𝑥𝐿1subscript𝑥𝐿𝑇(x_{L+1},...,x_{L+T})( italic_x start_POSTSUBSCRIPT italic_L + 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_L + italic_T end_POSTSUBSCRIPT ).

4.1 Overall Architecture

The architecture of AttnEmbed is illustrated in Figure 2. AttnEmbed contains Pre-processing module which consists of instance normalization and channel independence, Attention Embedding module and Transformer Encoder.It is important to note that our proposed method serves as a model-agnostic alternative for embedding, with the transformer employed merely as an illustrative example. As demonstrated in Table 7, we have conducted experiments with various baseline models, including PatchTST (Nie et al., 2022) and CARD (Xue et al., 2023).

Pre-process Module.

The input time series in Pre-process module is first normalized by instance normalization (Kim et al., 2022). This normalization block performs a simple normalization of the input time series with mean and variance, and subsequently integrates these values back to the output. We then employ the channel independence technique, as used in DLinear (Zeng et al., 2023) and PatchTST (Nie et al., 2022), which has been widely validated for its effectiveness in time series forecasting. This technique essentially transforms a multivariate time series forecasting problem into a univariate one.

Attention Embedding Module.

The Attention Embedding module is critical in the architecture of AttnEmbed. The pre-processed time series is split into multiple windows in the Attention Embedding module. Within each window, we utilize a shared Embedding self-attention block with L𝐿Litalic_L layers to extract the mutual relationships between time steps. Specifically, for each window, we extract the intermediary computational outputs generated by the Embedding module, obtaining a set of attention matrices. Then, all the last row of attention matrices are concatenated to form the embedding for the respective window.

Transformer Encoder.

Similar to PatchTST (Nie et al., 2022), next we employ a Transformer Encoder based on the generated embeddings for the forecasting task. As shown in Figure 2, compared to PatchTST, the primary distinction is that AttnEmbed integrates the interaction between time steps within a single window or patch, which is essential for addressing distribution shift by capturing local dynamics.

4.2 Attention Embedding

We now delve into the specifics of computing the attention embedding, as depicted in Figure 5. The pre-processed univariate time series is represented as U=[u1,,ut,,uL]L𝑈subscript𝑢1subscript𝑢𝑡subscript𝑢𝐿superscript𝐿U=[u_{1},...,u_{t},...,u_{L}]\in\mathbb{R}^{L}italic_U = [ italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, where L𝐿Litalic_L is the length of the series.

Refer to caption
Figure 3: Detail of attention embedding.

4.2.1 Tokenization and Global Landmark

Each input univariate time series is split into several overlapped or non-overlapped windows with window size W𝑊Witalic_W and stride length S𝑆Sitalic_S. Thus, the raw tokens are generated as 𝒳^=[x1w,,xiw,,xNw]W×1×N^𝒳subscriptsuperscript𝑥𝑤1subscriptsuperscript𝑥𝑤𝑖subscriptsuperscript𝑥𝑤𝑁superscript𝑊1𝑁\mathcal{\hat{X}}=[x^{w}_{1},...,x^{w}_{i},...,x^{w}_{N}]\in\mathbb{R}^{W% \times 1\times N}over^ start_ARG caligraphic_X end_ARG = [ italic_x start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_x start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_W × 1 × italic_N end_POSTSUPERSCRIPT, where N=LWS+1𝑁𝐿𝑊𝑆1N=\lfloor\frac{L-W}{S}\rfloor+1italic_N = ⌊ divide start_ARG italic_L - italic_W end_ARG start_ARG italic_S end_ARG ⌋ + 1. Each window, comprising W𝑊Witalic_W time steps, is processed through a common set of self-attention layers to yield a concatenated attention matrix, which is then utilized as the embedding.

While the above attention embedding method benefits from capturing local information, it overlooks the global information of the time series. Thus, we introduce global landmarks designed to incorporate the information from the entire series. We utilze Conv1D to calculate the global landmarks:

xgw=Conv1D(xiw),subscriptsuperscript𝑥𝑤𝑔Conv1Dsubscriptsuperscript𝑥𝑤𝑖x^{w}_{g}={\rm Conv1D}(x^{w}_{i}),italic_x start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = Conv1D ( italic_x start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (1)

where xgwGsubscriptsuperscript𝑥𝑤𝑔superscript𝐺x^{w}_{g}\in\mathbb{R}^{G}italic_x start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT and G𝐺Gitalic_G represents the number of landmarks, which is dictated by the parameters of Conv1D. Subsequently, the embedding matrix formed by the shared attention layers is assembled by concatenating each local feature representation xtwsubscriptsuperscript𝑥𝑤𝑡x^{w}_{t}italic_x start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with the corresponding global feature representation xgwsubscriptsuperscript𝑥𝑤𝑔x^{w}_{g}italic_x start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT:

𝒳=[[xgw,x1w],,[xgw,xNw]],𝒳subscriptsuperscript𝑥𝑤𝑔subscriptsuperscript𝑥𝑤1subscriptsuperscript𝑥𝑤𝑔subscriptsuperscript𝑥𝑤𝑁\mathcal{X}=[[x^{w}_{g},x^{w}_{1}],...,[x^{w}_{g},x^{w}_{N}]],caligraphic_X = [ [ italic_x start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] , … , [ italic_x start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] ] , (2)

where 𝒳(G+W)×1×N𝒳superscript𝐺𝑊1𝑁\mathcal{X}\in\mathbb{R}^{(G+W)\times 1\times N}caligraphic_X ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_G + italic_W ) × 1 × italic_N end_POSTSUPERSCRIPT.

For each individual window, the attention score of the hhitalic_h-th head in the l𝑙litalic_l-th layer is denoted as Ahl(G+W)×(G+W)subscriptsuperscript𝐴𝑙superscript𝐺𝑊𝐺𝑊A^{l}_{h}\in\mathbb{R}^{(G+W)\times(G+W)}italic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_G + italic_W ) × ( italic_G + italic_W ) end_POSTSUPERSCRIPT Through the combination and projection of these attentions, an embedding can be generated that characterizes the local information of the window. Specifically, the final rows from all attention matrices are concatenated and subsequently passed through a projection layer:

xemb=Proj(Acat),superscript𝑥𝑒𝑚𝑏Projsuperscript𝐴𝑐𝑎𝑡x^{emb}={\rm Proj}(A^{cat}),italic_x start_POSTSUPERSCRIPT italic_e italic_m italic_b end_POSTSUPERSCRIPT = roman_Proj ( italic_A start_POSTSUPERSCRIPT italic_c italic_a italic_t end_POSTSUPERSCRIPT ) , (3)
Acat=ConcatLain[1,La],h[1,Ha]](Ahl[1,:]),A^{cat}=\mathop{{\rm Concat}}\limits_{L^{a}in[1,L^{a}],h\in[1,H^{a}]]}(A^{l}_{% h}[-1,:]),italic_A start_POSTSUPERSCRIPT italic_c italic_a italic_t end_POSTSUPERSCRIPT = roman_Concat start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT italic_i italic_n [ 1 , italic_L start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ] , italic_h ∈ [ 1 , italic_H start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ] ] end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT [ - 1 , : ] ) , (4)

where AcatLaHa(G+W)superscript𝐴𝑐𝑎𝑡superscriptsuperscript𝐿𝑎superscript𝐻𝑎𝐺𝑊A^{cat}\in\mathbb{R}^{L^{a}H^{a}(G+W)}italic_A start_POSTSUPERSCRIPT italic_c italic_a italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_G + italic_W ) end_POSTSUPERSCRIPT, Lasuperscript𝐿𝑎L^{a}italic_L start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT and Hasuperscript𝐻𝑎H^{a}italic_H start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT represent the number of layers and the number of heads in the set of self-attention layers of the embedding module respectively. Thus the final output embedding can be denoted as 𝒳emb=[x1emb,,xNemb]superscript𝒳𝑒𝑚𝑏subscriptsuperscript𝑥𝑒𝑚𝑏1subscriptsuperscript𝑥𝑒𝑚𝑏𝑁\mathcal{X}^{emb}=[x^{emb}_{1},...,x^{emb}_{N}]caligraphic_X start_POSTSUPERSCRIPT italic_e italic_m italic_b end_POSTSUPERSCRIPT = [ italic_x start_POSTSUPERSCRIPT italic_e italic_m italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUPERSCRIPT italic_e italic_m italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ].

4.2.2 Exponential Moving Average (EMA)

To enhance the capture of local information, we have incorporated an Exponential Moving Average (EMA) within the self-attention blocks of the embedding module. EMA is a special case of moving average that responds to changes more quickly in time and can smooth out the output for noise reduction. Specifically, EMA utilizes factors exponentially decaying weighting factors as:

yt=αxt+(1α)yt1,subscript𝑦𝑡𝛼subscript𝑥𝑡1𝛼subscript𝑦𝑡1y_{t}=\alpha x_{t}+(1-\alpha)y_{t-1},italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_α italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( 1 - italic_α ) italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , (5)

where α(0,1)𝛼01\alpha\in(0,1)italic_α ∈ ( 0 , 1 ) is the degree of weighting decrease. Many works (Ma et al., 2023; Xue et al., 2023) have explored the application of EMA in the attention module. We integrate EMA into the queries and keys within the attention mechanism, opting for a non-parametric approach to reinforce stability during the training process.

4.3 Kernel Function for Attention based Representation

Inspired by earlier studies (Choromanski et al., 2021; Katharopoulos et al., 2020) that pioneered a kernel-based interpretation of the attention matrix, we propose the adoption of advanced kernel methods. We utilize both the Radial Basis Function (RBF) and polynomial kernels to assess the degree of similarity between time steps within a given window. This methodological innovation underpins the output of our attention embedding module, thereby replacing the 𝒳embsuperscript𝒳𝑒𝑚𝑏\mathcal{X}^{emb}caligraphic_X start_POSTSUPERSCRIPT italic_e italic_m italic_b end_POSTSUPERSCRIPT as described in Section 4.2.

We generate queries Q𝑄Qitalic_Q and keys K𝐾Kitalic_K by linearly projecting the token tensor 𝒳i=[xgw,xiw]T(G+W)×1subscript𝒳𝑖superscriptsubscriptsuperscript𝑥𝑤𝑔subscriptsuperscript𝑥𝑤𝑖𝑇superscript𝐺𝑊1\mathcal{X}_{i}=[x^{w}_{g},x^{w}_{i}]^{T}\in\mathbb{R}^{(G+W)\times 1}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ italic_x start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_G + italic_W ) × 1 end_POSTSUPERSCRIPT as follows:

Q=Fq(𝒳i),K=Fk(𝒳i),formulae-sequence𝑄subscript𝐹𝑞𝒳𝑖𝐾subscript𝐹𝑘𝒳𝑖Q=F_{q}(\mathcal{X}i),\quad K=F_{k}(\mathcal{X}i),italic_Q = italic_F start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( caligraphic_X italic_i ) , italic_K = italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( caligraphic_X italic_i ) , (6)

with both Q,K(G+W)×d𝑄𝐾superscript𝐺𝑊𝑑Q,K\in\mathbb{R}^{(G+W)\times d}italic_Q , italic_K ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_G + italic_W ) × italic_d end_POSTSUPERSCRIPT, where Fq,Fksubscript𝐹𝑞subscript𝐹𝑘F_{q},F_{k}italic_F start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT map from dimension 1111 to d𝑑ditalic_d through MLP layers. These matrices are further processed to obtain Qh,Kh(G+W)×dheadsubscript𝑄subscript𝐾superscript𝐺𝑊𝑑𝑒𝑎𝑑Q_{h},K_{h}\in\mathbb{R}^{(G+W)\times d{head}}italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_G + italic_W ) × italic_d italic_h italic_e italic_a italic_d end_POSTSUPERSCRIPT, corresponding to the queries and keys for the hthsuperscript𝑡h^{th}italic_h start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT attention head, with d=Ha×dhead𝑑subscript𝐻𝑎𝑑𝑒𝑎𝑑d=H_{a}\times d{head}italic_d = italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT × italic_d italic_h italic_e italic_a italic_d.

Kernel-based embeddings are then computed as:

xkernelembsubscriptsuperscript𝑥𝑒𝑚𝑏𝑘𝑒𝑟𝑛𝑒𝑙\displaystyle x^{emb}_{kernel}italic_x start_POSTSUPERSCRIPT italic_e italic_m italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k italic_e italic_r italic_n italic_e italic_l end_POSTSUBSCRIPT =Proj(Akernelcat),absent𝑃𝑟𝑜𝑗subscriptsuperscript𝐴𝑐𝑎𝑡𝑘𝑒𝑟𝑛𝑒𝑙\displaystyle=Proj(A^{cat}_{kernel}),= italic_P italic_r italic_o italic_j ( italic_A start_POSTSUPERSCRIPT italic_c italic_a italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k italic_e italic_r italic_n italic_e italic_l end_POSTSUBSCRIPT ) , (7)
Akernelcatsubscriptsuperscript𝐴𝑐𝑎𝑡𝑘𝑒𝑟𝑛𝑒𝑙\displaystyle A^{cat}_{kernel}italic_A start_POSTSUPERSCRIPT italic_c italic_a italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k italic_e italic_r italic_n italic_e italic_l end_POSTSUBSCRIPT =Concath[1,Ha]𝒦(Qh[1,:],Kh),absentsubscriptConcat1superscript𝐻𝑎𝒦subscript𝑄1:subscript𝐾\displaystyle=\mathop{{\rm Concat}}\limits_{h\in[1,H^{a}]}\mathcal{K}(Q_{h}[-1% ,:],K_{h}),= roman_Concat start_POSTSUBSCRIPT italic_h ∈ [ 1 , italic_H start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ] end_POSTSUBSCRIPT caligraphic_K ( italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT [ - 1 , : ] , italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) , (8)

where 𝒦𝒦\mathcal{K}caligraphic_K is the kernel function. In this paper, we introduce two kernel functions, the RBF kernel and the polynomial kernel.

5 Experiments

5.1 Experiments on Real-world Datasets

Datasets.

We conduct experiments on seven popular real-world benchmark datasets, including 4 ETT dataset (Zhou et al., 2021) (comprising of two hourly dayasets ETTh1, ETTh2 and two 15-minute datasets ETTm1, ETTm2), the Electricity 111https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams 20112014 dataset for hourly electricity consumption, the Weather 222https://www.bgc-jena.mpg.de/wetter/ dataset for 10-minute weather forecasting and the Traffic 333http://pems.dot.ca.gov dataset for hourly road occupancy rate.

Baselines.

In our comparison, we have chosen a range of representative baselines, including transformer-based models such as PatchTST (Nie et al., 2022), FiLM (Zhou et al., 2022a), FEDformer (Zhou et al., 2022b), Autoformer (Wu et al., 2021), and Informer (Zhou et al., 2021); the MLP-based DLinear (Zeng et al., 2023); and the CNN-based TimesNet (Wu et al., 2023). Our research is particularly concerned with exploring interactions within individual channels, so we’ve limited our benchmarking to cutting-edge models that adopt a channel-independent structure. This selection criterion ensures a focused and pertinent benchmarking against our research aims. Consequently, models like iTransformer (Liu et al., 2023) and CARD (Xue et al., 2023) are excluded from the primary experiments. Nonetheless, in Section 5.1.2, we demonstrate how AttnEmbed can be effectively integrated as a plug-in to enhance CARD’s performance.

Main Results.

For better comparison, we follow the experimental settings in (Wu et al., 2023), maintaining the lookback length at 96, and the horizon length at 96, 192, 336, and 720, respectively. The main results of multivariate forecasting are summarized in Table 5. The lower MSE/MAE indicates the better forecasting results. AttnEmbed notably achieves state-of-the-art results, outshining the top-performing PatchTST model. Crucially, this is achieved while preserving a similar main model architecture, specifically the vanilla transformer encoder used by PatchTST. The enhancement in performance can be solely attributed to our shift from traditional patching to the AttnEmbed method, marking this advancement as substantial. AttnEmbed gains the best performance on 6 out of 7 datasets in both MSE and MAE. Compared with PatchTST, AttnEmbed yield an overall 3.6% relative MSE reduction and 2.1% relative MAE reduction. In datasets with noisier and more frequently shifting distributions, such as ETTh1 and Traffic, the improvement from PatchTST to AttnEmbed is more pronounced, achieving a significant reduction in MSE by 6.2% and 8.2%, respectively. In general, the improvements made by AttnEmbed are consistent across various horizons, indicating that attention effectively represents time series with the resilience of both noise and non-stationary distributions.

Table 1: Multivariate forecasting with a lookback length of 96. All models are averaged from 4 different horizons. A lower MSE indicates better performance. The best ones are in Bold, and the second ones are underlined. Detailed results are provided in Appendix B.1
Methods AttnEmbed PatchTST TimesNet DLinear FiLM FEDformer Autoformer Informer
Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
Weather 0.252 0.278 0.257 0.280 0.259 0.287 0.265 0.317 0.269 0.339 0.309 0.360 0.338 0.382 0.634 0.548
ETTh1 0.422 0.430 0.450 0.440 0.458 0.450 0.456 0.452 0.461 0.456 0.440 0.460 0.496 0.487 1.040 0.795
ETTh2 0.361 0.393 0.366 0.404 0.414 0.427 0.559 0.515 0.384 0.406 0.437 0.449 0.450 0.459 4.431 1.729
ETTm1 0.377 0.395 0.381 0.395 0.400 0.406 0.403 0.407 0.408 0.399 0.448 0.452 0.588 0.517 0.961 0.734
ETTm2 0.286 0.331 0.285 0.327 0.291 0.333 0.350 0.401 0.287 0.328 0.305 0.349 0.327 0.371 1.410 0.810
ECL 0.189 0.274 0.196 0.280 0.192 0.295 0.212 0.300 0.223 0.303 0.214 0.327 0.227 0.338 0.311 0.397
Traffic 0.447 0.282 0.487 0.308 0.620 0.336 0.625 0.383 0.639 0.389 0.610 0.376 0.628 0.379 0.311 0.397

Recent works have shown that extending the lookback length can enhance performance. Thus, we have also demonstrated that AttnEmbed’s effectiveness is not constrained by the lookback window size and can outperform PatchTST with longer inputs, in Section 5.1.2.

5.1.1 Kernel Functions

Our experiments with real-world datasets (ETTh1, ETTm1, and Weather) using RBF and polynomial kernels demonstrate that kernel functions can achieve results comparable to softmax-based attention mechanisms. The summarized outcomes in Table 6 indicate that both kernels not only meet but, on average, surpass the performance of previous state-of-the-art (SOTA) models like PatchTST and TimesNet in terms of MSE and MAE. Impressively, the polynomial kernel attains a relative MSE reduction of 4.2%percent4.24.2\%4.2 % and MASE reduction of 2.0%percent2.02.0\%2.0 % compared to PatchTST on ETTh1. These findings suggest that the effectiveness of attention weights can be replicated through a kernel approach, where similarities between tokens are calculated. Consequently, this validates the integration of kernel functions into time series forecasting, underscoring their viability and promising potential for such applications.

Table 2: Multivairate forecasting results with RBF kernel and polynomial kernel with a lookback length of 96. All models are averaged from on 4 different horizons. A lower MSE indicates better performance. The best ones are in Bold, and the second ones are underlined. Detailed results are provided in Appendix B.2
Methods AttnEmbed RBF Kernel Polynomial Kernel PatchTST TimesNet
Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
Weather 0.252 0.278 0.255 0.280 0.254 0.279 0.257 0.280 0.259 0.287
ETTh1 0.422 0.430 0.445 0.439 0.431 0.431 0.450 0.440 0.458 0.450
ETTm1 0.377 0.395 0.378 0.393 0.379 0.393 0.381 0.395 0.400 0.406
Table 3: Utilize AttnEmbed as a plug-in. The lookback length is 336 for PatchTST and 96 for CARD. All models are averaged from 4 different horizons. A lower MSE indicates better performance. Detailed results are provided in Appendix B.3.
Methods PatchTST(42) PatchTST(42) CARD CARD
+AttnEmbed +AttnEmbed
Metric MSE MAE MSE MAE MSE MAE MSE MAE
Weather 0.229 0.265 0.227 0.263 0.239 0.265 0.238 0.260
ETTh1 0.417 0.430 0.409 0.426 0.442 0.428 0.436 0.427
ETTh2 0.352 0.381 0.347 0.379 0.383 0.384 0.377 0.379

5.1.2 Utilize AttnEmbed as A Plug-in

As depicted in Figure 5, the primary distinction between AttnEmbed and PatchTST lies in the embedding module. Consequently, AttnEmbed could potentially serve as a plug-in module to replace patching. Here, we primarily investigate two aspects of AttnEmbed’s versatility: the extension of the lookback window and the incorporation of multi-channel relationships. We have smoothly incorporated the AttnEmbed module into the PatchTST framework (42) with a lookback window of 336 time steps, and into the CARD model, which utilizes a 96-step lookback window and is tailored to improve cross-channel interactions. This strategic integration underscores the adaptability and strength of our design as a robust plug-in module, demonstrating its efficacy over diverse input horizons. To ensure a fair comparison, the window size, stride settings, and lookback window are kept in line with those used in the respective original models. The results are summarized in Table 7. After substituting the patching with AttnEmbed, we can observe a performance improvement even when retaining the same window size and stride. Notably, on the ETTh1 dataset, AttnEmbed achieves a relative MSE reduction of 1.9%percent1.91.9\%1.9 % when integrated with PatchTST(42), and a relative MSE reduction of 1.3%percent1.31.3\%1.3 % for CARD. This indicates that AttnEmbed is versatile beyond the constraints of lookback window and channel-independent settings, demonstrating its potential as an adaptable plug-in module for various models.

5.2 Model Analysis

5.2.1 Ablations

Here, we carry out ablation studies for the architectural design of AttnEmbed, with the aim of demonstrating the performance impact of omitting the global landmark or EMA components. Two ablated versions of AttnEmbed are evaluated on the ETTh1 and ETTm1 datasets: 1) AttnEmbed without global landmark, to assess the significance of incorporating global information; and 2) AttnEmbed without EMA, to ascertain the contribution of EMA to time series forecasting. As depicted in Table 8, the fully-equipped AttnEmbed model, which integrates both landmark and EMA, outperforms its two ablated variants by achieving an average MSE reduction of 2.2%percent2.22.2\%2.2 %. This highlights the crucial roles that landmarks and EMA play within the AttnEmbed framework, effectively capturing global information and local time-dependent dynamics.

Table 4: Ablation on EMA and landmark with a lookback length of 96. All models are averaged from 4 different horizons. A lower MSE indicates better performance. Detailed results are provided in Appendix B.4

Methods AttnEmbed AttnEmbed AttnEmbed
w/o EMA w/o Landmark
Metric MSE MAE MSE MAE MSE MAE
ETTh1 0.422 0.430 0.432 0.427 0.423 0.429
ETTm1 0.377 0.395 0.384 0.397 0.386 0.397

5.2.2 Parameter Analysis

Refer to caption
Figure 4: Parameter analysis on ETTh1 with a lookback window of 96 and a horizon of 96. (a) Window size. (b) Decrease coefficients of EMA. (c) Stride sizes of global Conv1D. (d) Layer numbers of the attention embedding module.

We examined AttnEmbed’s parameter sensitivity, presenting the forecasting MSE for varying configurations in Figure 4. These parameters include window sizes ([5, 10, 25, 50]), EMA decay coefficients ([0.3, 0.5, 0.7, 0.9]), Conv1D stride sizes ([24, 48, 96]), and attention embedding module layers ([1, 2, 3]), all tested on the ETTh1 dataset with a 96-period lookback and forecast horizon.

Figure 4(a) reveals that larger window sizes struggle with quick distribution changes. In contrast, as shown in Figure 4(b), the proper EMA decay coefficient can enhance results and mitigate noise, although too low a coefficient may over-smooth and degrade performance. Figure 4(c) suggests that a stride size around half the lookback window optimizes the capture of global patterns. Lastly, Figure 4(d) indicates that deeper attention embedding layers improve outcomes, with three layers being selected for their balance of performance and computational efficiency.

5.2.3 Alleviating Rank Collapse

Rank collapse is a notable challenge in the application of transformer models, wherein the attention matrix’s rank decreases during training. This contraction in rank can constrain the model’s ability to fit data, leading to weaker generalization. Although this issue affects transformers for time series (TS), they are less prone to it compared to the more layered Large Language Models (LLMs), yet it remains a relevant concern for TS model robustness.Since attention matrix is closely related to kernel matrix that often exhibits a higher rank than the input matrix, we expect the introduction of attention based representation may help alleviate the problem of rank collapse.

Though following (Dong et al., 2021), we compare the relative norm of the residual to evaluate the ‘rankness’ of layers in a model. As shown in Figure 5, using the attention based representation, where pairwise similarities are computed using RBF and polynomial kernels, can efficiently mitigate the issue of rank collapse observed in time series data. More information about kernel function for attention based representation can be found in Section 4.3.

Refer to caption
Figure 5: Relative norm of the residual along the depth for PatchTST, Attention, RBF kernel and polynomial kernel with different layers ([3, 6]) of transformer encoder on ETTh1.

6 Conclusion

The paper addresses the inherent nature of time series data, such as its low information density and the prevalence of distribution shifts and noises. By leveraging an attention mechanism tailored to time series, where the attention weights play a central role in representing data, we propose a novel and robust embedding strategy that utilizes global landmarks and a localized window to enrich the data representation. Our tailored attention map significantly outperforms the patching embedding-based SOTA transformer model in time series forecasting, a testament to its effectiveness. The results are compelling, with our approach yielding an average 3.6% improvement in MSE for SOTA multivariate time series prediction. The enhancement is designed to elevate predictive precision and introduces a modular component that is engineered for seamless integration within existing architectures, potentially reinforcing their resilience in generating embeddings from noisy signals.

Impact Statements

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

References

  • Bao et al. (2022) Bao, H., Dong, L., Piao, S., and Wei, F. BEit: BERT pre-training of image transformers. In International Conference on Learning Representations, 2022.
  • Böse et al. (2017) Böse, J.-H., Flunkert, V., Gasthaus, J., Januschowski, T., Lange, D., Salinas, D., Schelter, S., Seeger, M., and Wang, Y. Probabilistic demand forecasting at scale. Proceedings of the VLDB Endowment, 10(12):1694–1705, 2017.
  • Box & Jenkins (1968) Box, G. E. and Jenkins, G. M. Some recent advances in forecasting and control. Journal of the Royal Statistical Society. Series C (Applied Statistics), 17(2):91–109, 1968.
  • Box & Pierce (1970) Box, G. E. and Pierce, D. A. Distribution of residual autocorrelations in autoregressive-integrated moving average time series models. Journal of the American statistical Association, 65(332):1509–1526, 1970.
  • Choromanski et al. (2021) Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J. Q., Mohiuddin, A., Kaiser, L., Belanger, D. B., Colwell, L. J., and Weller, A. Rethinking attention with performers. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Ua6zuk0WRH.
  • Courty & Li (1999) Courty, P. and Li, H. Timing of seasonal sales. The Journal of Business, 72(4):545–572, 1999.
  • Devlin et al. (2019) Devlin, J., Chang, M., Lee, K., and Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MN, USA, June 2-7, 2019, pp.  4171–4186, 2019.
  • Dong et al. (2021) Dong, Y., Cordonnier, J.-B., and Loukas, A. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. ArXiv, abs/2103.03404, 2021. URL https://api.semanticscholar.org/CorpusID:232134936.
  • Dosovitskiy et al. (2021) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations (ICLR), Austria, May 3-7, 2021, 2021.
  • Ghojogh et al. (2021) Ghojogh, B., Ghodsi, A., Karray, F., and Crowley, M. Reproducing kernel hilbert space, mercer’s theorem, eigenfunctions, nyström method, and use of kernels in machine learning: Tutorial and survey. ArXiv, abs/2106.08443, 2021. URL https://api.semanticscholar.org/CorpusID:235446387.
  • Katharopoulos et al. (2020) Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning (ICML), 2020.
  • Kim et al. (2022) Kim, T., Kim, J., Tae, Y., Park, C., Choi, J.-H., and Choo, J. Reversible instance normalization for accurate time-series forecasting against distribution shift. In International Conference on Learning Representations, 2022.
  • Li et al. (2019) Li, S., Jin, X., Xuan, Y., Zhou, X., Chen, W., Wang, Y.-X., and Yan, X. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. arXiv preprint arXiv:1907.00235, 2019.
  • Liu et al. (2023) Liu, Y., Hu, T., Zhang, H., Wu, H., Wang, S., Ma, L., and Long, M. itransformer: Inverted transformers are effective for time series forecasting. arXiv preprint arXiv:2310.06625, 2023.
  • Ma et al. (2023) Ma, X., Zhou, C., Kong, X., He, J., Gui, L., Neubig, G., May, J., and Zettlemoyer, L. Mega: Moving average equipped gated attention. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=qNLe3iq2El.
  • Mika et al. (1998) Mika, S., Schölkopf, B., Smola, A., Müller, K.-R., Scholz, M., and Rätsch, G. Kernel pca and de-noising in feature spaces. In Kearns, M., Solla, S., and Cohn, D. (eds.), Advances in Neural Information Processing Systems, volume 11. MIT Press, 1998. URL https://proceedings.neurips.cc/paper_files/paper/1998/file/226d1f15ecd35f784d2a20c3ecf56d7f-Paper.pdf.
  • Nie et al. (2022) Nie, Y., Nguyen, N. H., Sinthong, P., and Kalagnanam, J. A time series is worth 64 words: Long-term forecasting with transformers. ArXiv, abs/2211.14730, 2022.
  • Radford et al. (2019) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. 2019.
  • Song et al. (2021) Song, K., Jung, Y., Kim, D., and Moon, I.-C. Implicit kernel attention. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  9713–9721, 2021.
  • Tsai et al. (2019) Tsai, Y.-H. H., Bai, S., Yamada, M., Morency, L.-P., and Salakhutdinov, R. Transformer dissection: An unified understanding for transformer’s attention via the lens of kernel. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  4344–4353, 2019.
  • Vaswani & etc. (2017) Vaswani, A. and etc. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
  • Wen et al. (2022) Wen, Q., Yang, L., Zhou, T., and Sun, L. Robust time series analysis and applications: An industrial perspective. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  4836–4837, 2022.
  • Wen et al. (2023) Wen, Q., Zhou, T., Zhang, C., Chen, W., Ma, Z., Yan, J., and Sun, L. Transformers in time series: A survey. In International Joint Conference on Artificial Intelligence(IJCAI), 2023.
  • Wilson et al. (2015) Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P. Deep kernel learning. In International Conference on Artificial Intelligence and Statistics, 2015. URL https://api.semanticscholar.org/CorpusID:1443279.
  • Wu et al. (2021) Wu, H., Xu, J., Wang, J., and Long, M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. In Advances in Neural Information Processing Systems (NeurIPS), pp.  101–112, 2021.
  • Wu et al. (2023) Wu, H., Hu, T., Liu, Y., Zhou, H., Wang, J., and Long, M. Timesnet: Temporal 2d-variation modeling for general time series analysis. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=ju_Uqw384Oq.
  • Xu et al. (2021) Xu, J., Wu, H., Wang, J., and Long, M. Anomaly transformer: Time series anomaly detection with association discrepancy. arXiv preprint arXiv:2110.02642, 2021.
  • Xue et al. (2023) Xue, W., Zhou, T., Wen, Q., Gao, J., Ding, B., and Jin, R. Make transformer great again for time series forecasting: Channel aligned robust dual transformer, 2023.
  • Yang et al. (2023) Yang, Y., Zhang, C., Zhou, T., Wen, Q., and Sun, L. Dcdetector: Dual attention contrastive representation learning for time series anomaly detection. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023. URL https://api.semanticscholar.org/CorpusID:259203116.
  • Zeng et al. (2023) Zeng, A., Chen, M., Zhang, L., and Xu, Q. Are transformers effective for time series forecasting? 2023.
  • Zhou et al. (2021) Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., and Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of AAAI, 2021.
  • Zhou et al. (2022a) Zhou, T., MA, Z., wang, x., Wen, Q., Sun, L., Yao, T., Yin, W., and Jin, R. Film: Frequency improved legendre memory model for long-term time series forecasting. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  12677–12690. Curran Associates, Inc., 2022a.
  • Zhou et al. (2022b) Zhou, T., Ma, Z., Wen, Q., Wang, X., Sun, L., and Jin, R. FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting. In Proc. 39th International Conference on Machine Learning (ICML 2022), 2022b.
  • Zhou et al. (2023) Zhou, T., Niu, P., Wang, X., Sun, L., and Jin, R. One fits all: Power general time series analysis by pretrained LM. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=gMS6FVZvmF.

Appendix A Full theoretical Analysis of attention as Robust Representation

In this section, we will show that using attention map as an alternative representation can be significantly more robust than the original inputs, particularly to the noise. In other words, attention map help reduce the impact of noises compared to the original inputs.

Let xid,i=1,,nformulae-sequencesubscript𝑥𝑖superscript𝑑𝑖1𝑛x_{i}\in\mathbb{R}^{d},i=1,\ldots,nitalic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_i = 1 , … , italic_n be n𝑛nitalic_n vectors. For simplicity of analysis, we assume that each vector is generated from one of m<d𝑚𝑑m<ditalic_m < italic_d Gaussian distributions, denoted by 𝒩(μi,Id),i=1,,mformulae-sequence𝒩subscript𝜇𝑖subscript𝐼𝑑𝑖1𝑚\mathcal{N}\left(\mu_{i},I_{d}\right),i=1,\ldots,mcaligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) , italic_i = 1 , … , italic_m. For the convenience of analysis, we assume that μi,μj=δi,jssubscript𝜇𝑖subscript𝜇𝑗subscript𝛿𝑖𝑗𝑠\langle\mu_{i},\mu_{j}\rangle=\delta_{i,j}s⟨ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ = italic_δ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_s for any i,j[m]𝑖𝑗delimited-[]𝑚i,j\in[m]italic_i , italic_j ∈ [ italic_m ]. Let n=mK𝑛𝑚𝐾n=mKitalic_n = italic_m italic_K, and we choose to generate K𝐾Kitalic_K vectors from each Gaussian distributions. In particular, vector xmj+isubscript𝑥𝑚𝑗𝑖x_{mj+i}italic_x start_POSTSUBSCRIPT italic_m italic_j + italic_i end_POSTSUBSCRIPT, with j=0,,K1,i=1,,mformulae-sequence𝑗0𝐾1𝑖1𝑚j=0,\ldots,K-1,i=1,\ldots,mitalic_j = 0 , … , italic_K - 1 , italic_i = 1 , … , italic_m, is generated from Gaussian distribution 𝒩(μi,Id)𝒩subscript𝜇𝑖subscript𝐼𝑑\mathcal{N}\left(\mu_{i},I_{d}\right)caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ). Then, if we choose two vectors xi+superscriptsubscript𝑥𝑖x_{i}^{+}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and xj+superscriptsubscript𝑥𝑗x_{j}^{+}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT that are sampled from the same Gaussian distribution, we have

E[|xi+xj+|2]=2dEdelimited-[]superscriptsubscriptsuperscript𝑥𝑖subscriptsuperscript𝑥𝑗22𝑑\mathrm{E}[|x^{+}_{i}-x^{+}_{j}|^{2}]=2droman_E [ | italic_x start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = 2 italic_d

and if xisubscriptsuperscript𝑥𝑖x^{-}_{i}italic_x start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and xjsubscriptsuperscript𝑥𝑗x^{-}_{j}italic_x start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are sampled from different distributions, we have

E[|xixj|2]=2d+2sEdelimited-[]superscriptsubscriptsuperscript𝑥𝑖subscriptsuperscript𝑥𝑗22𝑑2𝑠\mathrm{E}[|x^{-}_{i}-x^{-}_{j}|^{2}]=2d+2sroman_E [ | italic_x start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = 2 italic_d + 2 italic_s

It is clearly that the relative difference between the two expected distance square is O(s/d)𝑂𝑠𝑑O(s/d)italic_O ( italic_s / italic_d ). In fact, we can further show that there is a significant chance for the distance between two data points sampled from the same distributions to be larger than the distance between two data points sampled from different distributions, implying that the added noises can significantly affect the geometrical relationship among the sampled data points. To this end, we write xi+=μa+zi+superscriptsubscript𝑥𝑖subscript𝜇𝑎subscriptsuperscript𝑧𝑖x_{i}^{+}=\mu_{a}+z^{+}_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and xj+=μa+zj+superscriptsubscript𝑥𝑗subscript𝜇𝑎subscriptsuperscript𝑧𝑗x_{j}^{+}=\mu_{a}+z^{+}_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, where zi+,zj+𝒩(0,Id)similar-tosubscriptsuperscript𝑧𝑖subscriptsuperscript𝑧𝑗𝒩0subscript𝐼𝑑z^{+}_{i},z^{+}_{j}\sim\mathcal{N}(0,I_{d})italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ). Hence

|xi+xj+|2=|zi+zj+:=u+|2superscriptsuperscriptsubscript𝑥𝑖superscriptsubscript𝑥𝑗2superscriptsubscriptsubscriptsuperscript𝑧𝑖subscriptsuperscript𝑧𝑗assignabsentsubscript𝑢2|x_{i}^{+}-x_{j}^{+}|^{2}=|\underbrace{z^{+}_{i}-z^{+}_{j}}_{:=u_{+}}|^{2}| italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = | under⏟ start_ARG italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT := italic_u start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

It is clear that |u+|22χd2similar-tosuperscriptsubscript𝑢22subscriptsuperscript𝜒2𝑑|u_{+}|^{2}\sim 2\chi^{2}_{d}| italic_u start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∼ 2 italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. In the meantime, by writing xi=μb+zisuperscriptsubscript𝑥𝑖subscript𝜇𝑏superscriptsubscript𝑧𝑖x_{i}^{-}=\mu_{b}+z_{i}^{-}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT = italic_μ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT + italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT and xj=μc+zjsuperscriptsubscript𝑥𝑗subscript𝜇𝑐superscriptsubscript𝑧𝑗x_{j}^{-}=\mu_{c}+z_{j}^{-}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT = italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT, with zi,zj𝒩(0,Id)similar-tosuperscriptsubscript𝑧𝑖superscriptsubscript𝑧𝑗𝒩0subscript𝐼𝑑z_{i}^{-},z_{j}^{-}\sim\mathcal{N}(0,I_{d})italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ∼ caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ), we have

|xixj|2=2s+2μbμc,zizj:=v+|zizj:=u|2superscriptsuperscriptsubscript𝑥𝑖superscriptsubscript𝑥𝑗22𝑠2subscriptsubscript𝜇𝑏subscript𝜇𝑐superscriptsubscript𝑧𝑖superscriptsubscript𝑧𝑗assignabsentsubscript𝑣superscriptsubscriptsuperscriptsubscript𝑧𝑖superscriptsubscript𝑧𝑗assignabsentsubscript𝑢2|x_{i}^{-}-x_{j}^{-}|^{2}=2s+2\underbrace{\langle\mu_{b}-\mu_{c},z_{i}^{-}-z_{% j}^{-}\rangle}_{:=v_{-}}+|\underbrace{z_{i}^{-}-z_{j}^{-}}_{:=u_{-}}|^{2}| italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 2 italic_s + 2 under⏟ start_ARG ⟨ italic_μ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT - italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ⟩ end_ARG start_POSTSUBSCRIPT := italic_v start_POSTSUBSCRIPT - end_POSTSUBSCRIPT end_POSTSUBSCRIPT + | under⏟ start_ARG italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT - italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT := italic_u start_POSTSUBSCRIPT - end_POSTSUBSCRIPT end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

It is clear that v𝒩(0,2s)similar-tosubscript𝑣𝒩02𝑠v_{-}\sim\mathcal{N}(0,2s)italic_v start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , 2 italic_s ) and |u|22χd2similar-tosuperscriptsubscript𝑢22superscriptsubscript𝜒𝑑2|u_{-}|^{2}\sim 2\chi_{d}^{2}| italic_u start_POSTSUBSCRIPT - end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∼ 2 italic_χ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We want to bound the probability

Pr(|u|2+2v|u+|22s)Prsuperscriptsubscript𝑢22subscript𝑣superscriptsubscript𝑢22𝑠\Pr\left(|u_{-}|^{2}+2v_{-}-|u_{+}|^{2}\leq-2s\right)roman_Pr ( | italic_u start_POSTSUBSCRIPT - end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_v start_POSTSUBSCRIPT - end_POSTSUBSCRIPT - | italic_u start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ - 2 italic_s )

First, using the standard concentration inequality for χd2subscriptsuperscript𝜒2𝑑\chi^{2}_{d}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT distributions, we have

Pr(12|u|2d+2dδ+2δ)1exp(δ)Pr12superscriptsubscript𝑢2𝑑2𝑑𝛿2𝛿1𝛿\Pr\left(\frac{1}{2}|u_{-}|^{2}\leq d+2\sqrt{d\delta}+2\delta\right)\geq 1-% \exp(-\delta)roman_Pr ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG | italic_u start_POSTSUBSCRIPT - end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_d + 2 square-root start_ARG italic_d italic_δ end_ARG + 2 italic_δ ) ≥ 1 - roman_exp ( - italic_δ )

By setting δ=s2/(16d)𝛿superscript𝑠216𝑑\delta=s^{2}/(16d)italic_δ = italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / ( 16 italic_d ), under the assumption s216dsuperscript𝑠216𝑑s^{2}\leq 16ditalic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 16 italic_d, we have

Pr(|u|22d+s+s24d)1exp(s216d)s232dPrsuperscriptsubscript𝑢22𝑑𝑠superscript𝑠24𝑑1superscript𝑠216𝑑superscript𝑠232𝑑\Pr\left(|u_{-}|^{2}\leq 2d+s+\frac{s^{2}}{4d}\right)\geq 1-\exp\left(-\frac{s% ^{2}}{16d}\right)\geq\frac{s^{2}}{32d}roman_Pr ( | italic_u start_POSTSUBSCRIPT - end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 2 italic_d + italic_s + divide start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_d end_ARG ) ≥ 1 - roman_exp ( - divide start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 16 italic_d end_ARG ) ≥ divide start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 32 italic_d end_ARG

Since v𝒩(0,2s)similar-tosubscript𝑣𝒩02𝑠v_{-}\sim\mathcal{N}(0,2s)italic_v start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , 2 italic_s ), we have

Pr(v2s3)322πsexp(s9)es/92sPrsubscript𝑣2𝑠3322𝜋𝑠𝑠9superscript𝑒𝑠92𝑠\Pr\left(v_{-}\geq\frac{2s}{3}\right)\leq\frac{3}{2\sqrt{2\pi}s}\exp\left(-% \frac{s}{9}\right)\geq\frac{e^{-s/9}}{\sqrt{2}s}roman_Pr ( italic_v start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ≥ divide start_ARG 2 italic_s end_ARG start_ARG 3 end_ARG ) ≤ divide start_ARG 3 end_ARG start_ARG 2 square-root start_ARG 2 italic_π end_ARG italic_s end_ARG roman_exp ( - divide start_ARG italic_s end_ARG start_ARG 9 end_ARG ) ≥ divide start_ARG italic_e start_POSTSUPERSCRIPT - italic_s / 9 end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG 2 end_ARG italic_s end_ARG

and therefore

Pr(|u+|2+2v2d+7s3+s24d)s232des/92sPrsuperscriptsubscript𝑢22subscript𝑣2𝑑7𝑠3superscript𝑠24𝑑superscript𝑠232𝑑superscript𝑒𝑠92𝑠\Pr\left(|u_{+}|^{2}+2v_{-}\leq 2d+\frac{7s}{3}+\frac{s^{2}}{4d}\right)\leq% \frac{s^{2}}{32d}-\frac{e^{-s/9}}{\sqrt{2}s}roman_Pr ( | italic_u start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_v start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ≤ 2 italic_d + divide start_ARG 7 italic_s end_ARG start_ARG 3 end_ARG + divide start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_d end_ARG ) ≤ divide start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 32 italic_d end_ARG - divide start_ARG italic_e start_POSTSUPERSCRIPT - italic_s / 9 end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG 2 end_ARG italic_s end_ARG

In the meantime, we can also lower bound |u+|2superscriptsubscript𝑢2|u_{+}|^{2}| italic_u start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Using the fact the CDF for χ22subscriptsuperscript𝜒22\chi^{2}_{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is 1exp(x/2)1𝑥21-\exp(-x/2)1 - roman_exp ( - italic_x / 2 ). We have

Pr(|u+|22(1+ε)d)exp(εd2)Prsuperscriptsubscript𝑢221𝜀𝑑𝜀𝑑2\Pr\left(|u_{+}|^{2}\geq 2(1+\varepsilon)d\right)\geq\exp\left(-\frac{% \varepsilon d}{2}\right)roman_Pr ( | italic_u start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ 2 ( 1 + italic_ε ) italic_d ) ≥ roman_exp ( - divide start_ARG italic_ε italic_d end_ARG start_ARG 2 end_ARG )

By choosing ε=3s/(2d)𝜀3𝑠2𝑑\varepsilon=3s/(2d)italic_ε = 3 italic_s / ( 2 italic_d ), we have

Pr(|u+|22d+3s)exp(3s4)Prsuperscriptsubscript𝑢22𝑑3𝑠3𝑠4\Pr\left(|u_{+}|^{2}\geq 2d+3s\right)\geq\exp\left(-\frac{3s}{4}\right)roman_Pr ( | italic_u start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ 2 italic_d + 3 italic_s ) ≥ roman_exp ( - divide start_ARG 3 italic_s end_ARG start_ARG 4 end_ARG )

Combining the above two inequalities, we have

Pr(|u|2|u+|2+2v2s3+s24d)exp(3s4)(s232des/92s)Prsuperscriptsubscript𝑢2superscriptsubscript𝑢22subscript𝑣2𝑠3superscript𝑠24𝑑3𝑠4superscript𝑠232𝑑superscript𝑒𝑠92𝑠\Pr\left(|u_{-}|^{2}-|u_{+}|^{2}+2v_{-}\leq-\frac{2s}{3}+\frac{s^{2}}{4d}% \right)\geq\exp\left(-\frac{3s}{4}\right)\left(\frac{s^{2}}{32d}-\frac{e^{-s/9% }}{\sqrt{2}s}\right)roman_Pr ( | italic_u start_POSTSUBSCRIPT - end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - | italic_u start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_v start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ≤ - divide start_ARG 2 italic_s end_ARG start_ARG 3 end_ARG + divide start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_d end_ARG ) ≥ roman_exp ( - divide start_ARG 3 italic_s end_ARG start_ARG 4 end_ARG ) ( divide start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 32 italic_d end_ARG - divide start_ARG italic_e start_POSTSUPERSCRIPT - italic_s / 9 end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG 2 end_ARG italic_s end_ARG )

when

s<d2s332𝑠𝑑2superscript𝑠332s<d\leq\frac{\sqrt{2}s^{3}}{32}italic_s < italic_d ≤ divide start_ARG square-root start_ARG 2 end_ARG italic_s start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG start_ARG 32 end_ARG

we have

Pr(|u|2|u+|2+2vs3)12sexp(3s4)(1es/9)Prsuperscriptsubscript𝑢2superscriptsubscript𝑢22subscript𝑣𝑠312𝑠3𝑠41superscript𝑒𝑠9\Pr\left(|u_{-}|^{2}-|u_{+}|^{2}+2v_{-}\leq-\frac{s}{3}\right)\geq\frac{1}{% \sqrt{2}s}\exp\left(-\frac{3s}{4}\right)\left(1-e^{-s/9}\right)roman_Pr ( | italic_u start_POSTSUBSCRIPT - end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - | italic_u start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_v start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ≤ - divide start_ARG italic_s end_ARG start_ARG 3 end_ARG ) ≥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 end_ARG italic_s end_ARG roman_exp ( - divide start_ARG 3 italic_s end_ARG start_ARG 4 end_ARG ) ( 1 - italic_e start_POSTSUPERSCRIPT - italic_s / 9 end_POSTSUPERSCRIPT )

implying that there is a descent chance for |xixj|<|xi+xj+|superscriptsubscript𝑥𝑖superscriptsubscript𝑥𝑗superscriptsubscript𝑥𝑖superscriptsubscript𝑥𝑗|x_{i}^{-}-x_{j}^{-}|<|x_{i}^{+}-x_{j}^{+}|| italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT | < | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT |.

Now, let’s check the attention based representation, i.e. for any vector x𝑥xitalic_x, we represent it by f(x)𝑓𝑥f(x)italic_f ( italic_x ) given below

f(x)=(exp(λx,x1),,exp(λx,xn))𝑓𝑥𝜆𝑥subscript𝑥1𝜆𝑥subscript𝑥𝑛f(x)=\left(\exp\left(\lambda\langle x,x_{1}\rangle\right),\ldots,\exp\left(% \lambda\langle x,x_{n}\rangle\right)\right)italic_f ( italic_x ) = ( roman_exp ( italic_λ ⟨ italic_x , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟩ ) , … , roman_exp ( italic_λ ⟨ italic_x , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⟩ ) )

For simplicity, we sample a pair of data points xi+,xj+𝒩(μa,Id)similar-tosuperscriptsubscript𝑥𝑖superscriptsubscript𝑥𝑗𝒩subscript𝜇𝑎subscript𝐼𝑑x_{i}^{+},x_{j}^{+}\sim\mathcal{N}(\mu_{a},I_{d})italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) from the same distribution, and another pair of data points xi𝒩(μb,Id)similar-tosuperscriptsubscript𝑥𝑖𝒩subscript𝜇𝑏subscript𝐼𝑑x_{i}^{-}\sim\mathcal{N}(\mu_{b},I_{d})italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) and xj𝒩(μc,Id)similar-tosuperscriptsubscript𝑥𝑗𝒩subscript𝜇𝑐subscript𝐼𝑑x_{j}^{-}\sim\mathcal{N}(\mu_{c},I_{d})italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ). We first represent each of these four data points by their attention map. We first compute the distance

E[|f(xi+)f(xj+)|2]Edelimited-[]superscript𝑓superscriptsubscript𝑥𝑖𝑓superscriptsubscript𝑥𝑗2\displaystyle\mathrm{E}[|f(x_{i}^{+})-f(x_{j}^{+})|^{2}]roman_E [ | italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] =\displaystyle== k=1nE[|exp(λxi+,xk)exp(λxj+,xk)|2]superscriptsubscript𝑘1𝑛Edelimited-[]superscript𝜆superscriptsubscript𝑥𝑖subscript𝑥𝑘𝜆superscriptsubscript𝑥𝑗subscript𝑥𝑘2\displaystyle\sum_{k=1}^{n}\mathrm{E}\left[\left|\exp\left(\lambda\langle x_{i% }^{+},x_{k}\rangle\right)-\exp\left(\lambda\langle x_{j}^{+},x_{k}\rangle% \right)\right|^{2}\right]∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_E [ | roman_exp ( italic_λ ⟨ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ ) - roman_exp ( italic_λ ⟨ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=\displaystyle== Kk=1mE[|exp(λxi+,x)exp(λxj+,x)|2]𝐾superscriptsubscript𝑘1𝑚Edelimited-[]superscript𝜆superscriptsubscript𝑥𝑖𝑥𝜆superscriptsubscript𝑥𝑗𝑥2\displaystyle K\sum_{k=1}^{m}\mathrm{E}\left[\left|\exp\left(\lambda\langle x_% {i}^{+},x\rangle\right)-\exp\left(\lambda\langle x_{j}^{+},x\rangle\right)% \right|^{2}\right]italic_K ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT roman_E [ | roman_exp ( italic_λ ⟨ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_x ⟩ ) - roman_exp ( italic_λ ⟨ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_x ⟩ ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=\displaystyle== Kk=1mE[2exp(2λxi+,x)2exp(λx,xi++xj+)]𝐾superscriptsubscript𝑘1𝑚Edelimited-[]22𝜆superscriptsubscript𝑥𝑖𝑥2𝜆𝑥superscriptsubscript𝑥𝑖superscriptsubscript𝑥𝑗\displaystyle K\sum_{k=1}^{m}\mathrm{E}\left[2\exp\left(2\lambda\langle x_{i}^% {+},x\rangle\right)-2\exp\left(\lambda\langle x,x_{i}^{+}+x_{j}^{+}\rangle% \right)\right]italic_K ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT roman_E [ 2 roman_exp ( 2 italic_λ ⟨ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_x ⟩ ) - 2 roman_exp ( italic_λ ⟨ italic_x , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT + italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ⟩ ) ]

To compute the above expectation, we first consider k=a𝑘𝑎k=aitalic_k = italic_a. Define zi+=xi+μasuperscriptsubscript𝑧𝑖superscriptsubscript𝑥𝑖subscript𝜇𝑎z_{i}^{+}=x_{i}^{+}-\mu_{a}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, zj+=xj+μasuperscriptsubscript𝑧𝑗superscriptsubscript𝑥𝑗subscript𝜇𝑎z_{j}^{+}=x_{j}^{+}-\mu_{a}italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, and z=xμa𝑧𝑥subscript𝜇𝑎z=x-\mu_{a}italic_z = italic_x - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT. We have

E[2exp(2λxi+,x)2exp(2x,xj+)]Edelimited-[]22𝜆superscriptsubscript𝑥𝑖𝑥22𝑥superscriptsubscript𝑥𝑗\displaystyle\mathrm{E}\left[2\exp\left(2\lambda\langle x_{i}^{+},x\rangle% \right)-2\exp\left(\sqrt{2}\langle x,x_{j}^{+}\rangle\right)\right]roman_E [ 2 roman_exp ( 2 italic_λ ⟨ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_x ⟩ ) - 2 roman_exp ( square-root start_ARG 2 end_ARG ⟨ italic_x , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ⟩ ) ]
=\displaystyle== E[2exp(2λ(s+z,zi++μa,zi++z))2exp(2λ(2s+z,zj++μa,zj++z))]Edelimited-[]22𝜆𝑠𝑧superscriptsubscript𝑧𝑖subscript𝜇𝑎superscriptsubscript𝑧𝑖𝑧22𝜆2𝑠𝑧superscriptsubscript𝑧𝑗subscript𝜇𝑎superscriptsubscript𝑧𝑗𝑧\displaystyle\mathrm{E}\left[2\exp\left(2\lambda(s+\langle z,z_{i}^{+}\rangle+% \langle\mu_{a},z_{i}^{+}+z\rangle)\right)-2\exp\left(\sqrt{2}\lambda\left(% \sqrt{2}s+\langle z,z_{j}^{+}\rangle+\langle\mu_{a},z_{j}^{+}+z\rangle\right)% \right)\right]roman_E [ 2 roman_exp ( 2 italic_λ ( italic_s + ⟨ italic_z , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ⟩ + ⟨ italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT + italic_z ⟩ ) ) - 2 roman_exp ( square-root start_ARG 2 end_ARG italic_λ ( square-root start_ARG 2 end_ARG italic_s + ⟨ italic_z , italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ⟩ + ⟨ italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT + italic_z ⟩ ) ) ]

By taking the expectation over zi+superscriptsubscript𝑧𝑖z_{i}^{+}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and zj+superscriptsubscript𝑧𝑗z_{j}^{+}italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, given zi+,z+μa𝒩(0,|z+μa|2)similar-tosuperscriptsubscript𝑧𝑖𝑧subscript𝜇𝑎𝒩0superscript𝑧subscript𝜇𝑎2\langle z_{i}^{+},z+\mu_{a}\rangle\sim\mathcal{N}(0,|z+\mu_{a}|^{2})⟨ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_z + italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ ∼ caligraphic_N ( 0 , | italic_z + italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and zj+superscriptsubscript𝑧𝑗z_{j}^{+}italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and zj+,z+μa𝒩(0,|z+μa|2)similar-tosuperscriptsubscript𝑧𝑗𝑧subscript𝜇𝑎𝒩0superscript𝑧subscript𝜇𝑎2\langle z_{j}^{+},z+\mu_{a}\rangle\sim\mathcal{N}(0,|z+\mu_{a}|^{2})⟨ italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_z + italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ ∼ caligraphic_N ( 0 , | italic_z + italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), we have

Ezi+[exp(2λzi+,z+μa)]=exp(2λ2|z+μa|2),Ezj+[exp(2λzj+,z+μa)]=exp(λ2|z+μa|2)formulae-sequencesubscriptEsuperscriptsubscript𝑧𝑖delimited-[]2𝜆superscriptsubscript𝑧𝑖𝑧subscript𝜇𝑎2superscript𝜆2superscript𝑧subscript𝜇𝑎2subscriptEsuperscriptsubscript𝑧𝑗delimited-[]2𝜆superscriptsubscript𝑧𝑗𝑧subscript𝜇𝑎superscript𝜆2superscript𝑧subscript𝜇𝑎2\mathrm{E}_{z_{i}^{+}}\left[\exp\left(2\lambda\langle z_{i}^{+},z+\mu_{a}% \rangle\right)\right]=\exp\left(2\lambda^{2}|z+\mu_{a}|^{2}\right),\;\mathrm{E% }_{z_{j}^{+}}\left[\exp\left(\sqrt{2}\lambda\langle z_{j}^{+},z+\mu_{a}\rangle% \right)\right]=\exp\left(\lambda^{2}|z+\mu_{a}|^{2}\right)roman_E start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_exp ( 2 italic_λ ⟨ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_z + italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ ) ] = roman_exp ( 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_z + italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , roman_E start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_exp ( square-root start_ARG 2 end_ARG italic_λ ⟨ italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_z + italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ ) ] = roman_exp ( italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_z + italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )

and therefore

E[2exp(2λxi+,x)2exp(2x,xj)]Edelimited-[]22𝜆superscriptsubscript𝑥𝑖𝑥22𝑥superscriptsubscript𝑥𝑗\displaystyle\mathrm{E}\left[2\exp\left(2\lambda\langle x_{i}^{+},x\rangle% \right)-2\exp\left(\sqrt{2}\langle x,x_{j}^{-}\rangle\right)\right]roman_E [ 2 roman_exp ( 2 italic_λ ⟨ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_x ⟩ ) - 2 roman_exp ( square-root start_ARG 2 end_ARG ⟨ italic_x , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ⟩ ) ]
=\displaystyle== E[exp(2λ(s+μa,z+λ|z+μa|2))]E[exp(2λ(s+μa,z+λ2|z+μa|2))]Edelimited-[]2𝜆𝑠subscript𝜇𝑎𝑧𝜆superscript𝑧subscript𝜇𝑎2Edelimited-[]2𝜆𝑠subscript𝜇𝑎𝑧𝜆2superscript𝑧subscript𝜇𝑎2\displaystyle\mathrm{E}\left[\exp\left(2\lambda\left(s+\langle\mu_{a},z\rangle% +\lambda|z+\mu_{a}|^{2}\right)\right)\right]-\mathrm{E}\left[\exp\left(\sqrt{2% }\lambda\left(s+\langle\mu_{a},z\rangle+\frac{\lambda}{\sqrt{2}}|z+\mu_{a}|^{2% }\right)\right)\right]roman_E [ roman_exp ( 2 italic_λ ( italic_s + ⟨ italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_z ⟩ + italic_λ | italic_z + italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) ] - roman_E [ roman_exp ( square-root start_ARG 2 end_ARG italic_λ ( italic_s + ⟨ italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_z ⟩ + divide start_ARG italic_λ end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG | italic_z + italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) ]

Since

E[exp(2λ2|z+μa|2+2λμa,z)]Edelimited-[]2superscript𝜆2superscript𝑧subscript𝜇𝑎22𝜆subscript𝜇𝑎𝑧\displaystyle\mathrm{E}\left[\exp\left(2\lambda^{2}|z+\mu_{a}|^{2}+2\lambda% \langle\mu_{a},z\rangle\right)\right]roman_E [ roman_exp ( 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_z + italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_λ ⟨ italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_z ⟩ ) ]
=\displaystyle== e2λ2s(2π)d/2exp(2λ2|z|2+2λ(2λ+1)μa,z|z|22)𝑑zsuperscript𝑒2superscript𝜆2𝑠superscript2𝜋𝑑22superscript𝜆2superscript𝑧22𝜆2𝜆1subscript𝜇𝑎𝑧superscript𝑧22differential-d𝑧\displaystyle\frac{e^{2\lambda^{2}s}}{(2\pi)^{d/2}}\int\exp\left(2\lambda^{2}|% z|^{2}+2\lambda(2\lambda+1)\langle\mu_{a},z\rangle-\frac{|z|^{2}}{2}\right)dzdivide start_ARG italic_e start_POSTSUPERSCRIPT 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG start_ARG ( 2 italic_π ) start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT end_ARG ∫ roman_exp ( 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_z | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_λ ( 2 italic_λ + 1 ) ⟨ italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_z ⟩ - divide start_ARG | italic_z | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) italic_d italic_z
=\displaystyle== e2λ2s(2π)(d1)/2exp((14λ2)|zd1|22)𝑑zd1×12πexp((14λ2)z22+2λ(2λ+1)s1/2z)superscript𝑒2superscript𝜆2𝑠superscript2𝜋𝑑1214superscript𝜆2superscriptsubscript𝑧𝑑122differential-dsubscript𝑧𝑑112𝜋14superscript𝜆2superscript𝑧222𝜆2𝜆1superscript𝑠12𝑧\displaystyle\frac{e^{2\lambda^{2}s}}{(2\pi)^{(d-1)/2}}\int\exp\left(-(1-4% \lambda^{2})\frac{|z_{d-1}|^{2}}{2}\right)dz_{d-1}\times\frac{1}{\sqrt{2\pi}}% \int\exp\left(-(1-4\lambda^{2})\frac{z^{2}}{2}+2\lambda(2\lambda+1)s^{1/2}z\right)divide start_ARG italic_e start_POSTSUPERSCRIPT 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG start_ARG ( 2 italic_π ) start_POSTSUPERSCRIPT ( italic_d - 1 ) / 2 end_POSTSUPERSCRIPT end_ARG ∫ roman_exp ( - ( 1 - 4 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) divide start_ARG | italic_z start_POSTSUBSCRIPT italic_d - 1 end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) italic_d italic_z start_POSTSUBSCRIPT italic_d - 1 end_POSTSUBSCRIPT × divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG end_ARG ∫ roman_exp ( - ( 1 - 4 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) divide start_ARG italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG + 2 italic_λ ( 2 italic_λ + 1 ) italic_s start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_z )
=\displaystyle== e2λ2s(14λ2)d/2exp(2λ2(1+2λ)2s14λ2)=e2λ2s(14λ2)d/2exp(2λ2(1+2λ)s12λ):=C1superscript𝑒2superscript𝜆2𝑠superscript14superscript𝜆2𝑑22superscript𝜆2superscript12𝜆2𝑠14superscript𝜆2superscript𝑒2superscript𝜆2𝑠superscript14superscript𝜆2𝑑22superscript𝜆212𝜆𝑠12𝜆assignsubscript𝐶1\displaystyle\frac{e^{2\lambda^{2}s}}{(1-4\lambda^{2})^{d/2}}\exp\left(\frac{2% \lambda^{2}(1+2\lambda)^{2}s}{1-4\lambda^{2}}\right)=\frac{e^{2\lambda^{2}s}}{% (1-4\lambda^{2})^{d/2}}\exp\left(\frac{2\lambda^{2}(1+2\lambda)s}{1-2\lambda}% \right):=C_{1}divide start_ARG italic_e start_POSTSUPERSCRIPT 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - 4 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT end_ARG roman_exp ( divide start_ARG 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + 2 italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s end_ARG start_ARG 1 - 4 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) = divide start_ARG italic_e start_POSTSUPERSCRIPT 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - 4 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT end_ARG roman_exp ( divide start_ARG 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + 2 italic_λ ) italic_s end_ARG start_ARG 1 - 2 italic_λ end_ARG ) := italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

and

E[exp(2λμa,z+λ2|z+μa|2)]Edelimited-[]2𝜆subscript𝜇𝑎𝑧superscript𝜆2superscript𝑧subscript𝜇𝑎2\displaystyle\mathrm{E}\left[\exp\left(\sqrt{2}\lambda\langle\mu_{a},z\rangle+% \lambda^{2}|z+\mu_{a}|^{2}\right)\right]roman_E [ roman_exp ( square-root start_ARG 2 end_ARG italic_λ ⟨ italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_z ⟩ + italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_z + italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ]
=\displaystyle== eλ2s(2π)d/2exp((12λ2)|z|22+λ(2+2λ)z,μa)𝑑zsuperscript𝑒superscript𝜆2𝑠superscript2𝜋𝑑212superscript𝜆2superscript𝑧22𝜆22𝜆𝑧subscript𝜇𝑎differential-d𝑧\displaystyle\frac{e^{\lambda^{2}s}}{(2\pi)^{d/2}}\int\exp\left(-(1-2\lambda^{% 2})\frac{|z|^{2}}{2}+\lambda(\sqrt{2}+2\lambda)\langle z,\mu_{a}\rangle\right)dzdivide start_ARG italic_e start_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG start_ARG ( 2 italic_π ) start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT end_ARG ∫ roman_exp ( - ( 1 - 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) divide start_ARG | italic_z | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG + italic_λ ( square-root start_ARG 2 end_ARG + 2 italic_λ ) ⟨ italic_z , italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ ) italic_d italic_z
=\displaystyle== eλ2s(12λ2)d/2exp(λ2(1+2λ)2s2(12λ2))=eλ2s(12λ2)d/2exp(λ2(1+2λ)s2(12λ)):=C2superscript𝑒superscript𝜆2𝑠superscript12superscript𝜆2𝑑2superscript𝜆2superscript12𝜆2𝑠212superscript𝜆2superscript𝑒superscript𝜆2𝑠superscript12superscript𝜆2𝑑2superscript𝜆212𝜆𝑠212𝜆assignsubscript𝐶2\displaystyle\frac{e^{\lambda^{2}s}}{(1-2\lambda^{2})^{d/2}}\exp\left(\frac{% \lambda^{2}(1+\sqrt{2}\lambda)^{2}s}{2(1-2\lambda^{2})}\right)=\frac{e^{% \lambda^{2}s}}{(1-2\lambda^{2})^{d/2}}\exp\left(\frac{\lambda^{2}(1+\sqrt{2}% \lambda)s}{2(1-\sqrt{2}\lambda)}\right):=C_{2}divide start_ARG italic_e start_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT end_ARG roman_exp ( divide start_ARG italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + square-root start_ARG 2 end_ARG italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s end_ARG start_ARG 2 ( 1 - 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG ) = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT end_ARG roman_exp ( divide start_ARG italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + square-root start_ARG 2 end_ARG italic_λ ) italic_s end_ARG start_ARG 2 ( 1 - square-root start_ARG 2 end_ARG italic_λ ) end_ARG ) := italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

we have

E[2exp(2λxi+,x)2exp(x,xi++xj+)]Edelimited-[]22𝜆superscriptsubscript𝑥𝑖𝑥2𝑥superscriptsubscript𝑥𝑖superscriptsubscript𝑥𝑗\displaystyle\mathrm{E}\left[2\exp\left(2\lambda\langle x_{i}^{+},x\rangle% \right)-2\exp\left(\langle x,x_{i}^{+}+x_{j}^{+}\rangle\right)\right]roman_E [ 2 roman_exp ( 2 italic_λ ⟨ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_x ⟩ ) - 2 roman_exp ( ⟨ italic_x , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT + italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ⟩ ) ]
=\displaystyle== 2e2λ(1+λ)s(14λ2)d/2exp(2λ2(1+2λ)s12λ)2eλ(2+λ)s(12λ2)d/2exp(λ2(1+2λ)s2(12λ))2superscript𝑒2𝜆1𝜆𝑠superscript14superscript𝜆2𝑑22superscript𝜆212𝜆𝑠12𝜆2superscript𝑒𝜆2𝜆𝑠superscript12superscript𝜆2𝑑2superscript𝜆212𝜆𝑠212𝜆\displaystyle\frac{2e^{2\lambda(1+\lambda)s}}{(1-4\lambda^{2})^{d/2}}\exp\left% (\frac{2\lambda^{2}(1+2\lambda)s}{1-2\lambda}\right)-\frac{2e^{\lambda(2+% \lambda)s}}{(1-2\lambda^{2})^{d/2}}\exp\left(\frac{\lambda^{2}(1+\sqrt{2}% \lambda)s}{2(1-\sqrt{2}\lambda)}\right)divide start_ARG 2 italic_e start_POSTSUPERSCRIPT 2 italic_λ ( 1 + italic_λ ) italic_s end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - 4 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT end_ARG roman_exp ( divide start_ARG 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + 2 italic_λ ) italic_s end_ARG start_ARG 1 - 2 italic_λ end_ARG ) - divide start_ARG 2 italic_e start_POSTSUPERSCRIPT italic_λ ( 2 + italic_λ ) italic_s end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT end_ARG roman_exp ( divide start_ARG italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + square-root start_ARG 2 end_ARG italic_λ ) italic_s end_ARG start_ARG 2 ( 1 - square-root start_ARG 2 end_ARG italic_λ ) end_ARG )

For ka𝑘𝑎k\neq aitalic_k ≠ italic_a, we have

E[2λexp(2λxi+,x)2exp(2x,xj+)]Edelimited-[]2𝜆2𝜆superscriptsubscript𝑥𝑖𝑥22𝑥superscriptsubscript𝑥𝑗\displaystyle\mathrm{E}\left[2\lambda\exp\left(2\lambda\langle x_{i}^{+},x% \rangle\right)-2\exp\left(\sqrt{2}\langle x,x_{j}^{+}\rangle\right)\right]roman_E [ 2 italic_λ roman_exp ( 2 italic_λ ⟨ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_x ⟩ ) - 2 roman_exp ( square-root start_ARG 2 end_ARG ⟨ italic_x , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ⟩ ) ]
=\displaystyle== 2e2λ2s(14λ2)d/2exp(2λ2(1+2λ)s12λ)2eλ2s(12λ2)d/2exp(λ2(1+2λ)s2(12λ))2superscript𝑒2superscript𝜆2𝑠superscript14superscript𝜆2𝑑22superscript𝜆212𝜆𝑠12𝜆2superscript𝑒superscript𝜆2𝑠superscript12superscript𝜆2𝑑2superscript𝜆212𝜆𝑠212𝜆\displaystyle\frac{2e^{2\lambda^{2}s}}{(1-4\lambda^{2})^{d/2}}\exp\left(\frac{% 2\lambda^{2}(1+2\lambda)s}{1-2\lambda}\right)-\frac{2e^{\lambda^{2}s}}{(1-2% \lambda^{2})^{d/2}}\exp\left(\frac{\lambda^{2}(1+\sqrt{2}\lambda)s}{2(1-\sqrt{% 2}\lambda)}\right)divide start_ARG 2 italic_e start_POSTSUPERSCRIPT 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - 4 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT end_ARG roman_exp ( divide start_ARG 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + 2 italic_λ ) italic_s end_ARG start_ARG 1 - 2 italic_λ end_ARG ) - divide start_ARG 2 italic_e start_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT end_ARG roman_exp ( divide start_ARG italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + square-root start_ARG 2 end_ARG italic_λ ) italic_s end_ARG start_ARG 2 ( 1 - square-root start_ARG 2 end_ARG italic_λ ) end_ARG )

By combining the above results, we have

E[|f(xi+)f(xj+)|2]=2K(e2λs+m1)(C1C2)Edelimited-[]superscript𝑓superscriptsubscript𝑥𝑖𝑓superscriptsubscript𝑥𝑗22𝐾superscript𝑒2𝜆𝑠𝑚1subscript𝐶1subscript𝐶2\displaystyle\mathrm{E}\left[\left|f(x_{i}^{+})-f(x_{j}^{+})\right|^{2}\right]% =2K\left(e^{2\lambda s}+m-1\right)\left(C_{1}-C_{2}\right)roman_E [ | italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = 2 italic_K ( italic_e start_POSTSUPERSCRIPT 2 italic_λ italic_s end_POSTSUPERSCRIPT + italic_m - 1 ) ( italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )

We then compute the distance between f(xi)𝑓superscriptsubscript𝑥𝑖f(x_{i}^{-})italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) and f(xj)𝑓superscriptsubscript𝑥𝑗f(x_{j}^{-})italic_f ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ), which is given by

E[|f(xi)f(xj)|2]=Kk=1mE[exp(2λxi,x)+exp(2λxj,x)2exp(λx,xi+xj)]Edelimited-[]superscript𝑓superscriptsubscript𝑥𝑖𝑓superscriptsubscript𝑥𝑗2𝐾superscriptsubscript𝑘1𝑚Edelimited-[]2𝜆superscriptsubscript𝑥𝑖𝑥2𝜆superscriptsubscript𝑥𝑗𝑥2𝜆𝑥superscriptsubscript𝑥𝑖superscriptsubscript𝑥𝑗\displaystyle\mathrm{E}\left[|f(x_{i}^{-})-f(x_{j}^{-})|^{2}\right]=K\sum_{k=1% }^{m}\mathrm{E}\left[\exp\left(2\lambda\langle x_{i}^{-},x\rangle\right)+\exp% \left(2\lambda\langle x_{j}^{-},x\rangle\right)-2\exp\left(\lambda\langle x,x_% {i}^{-}+x_{j}^{-}\rangle\right)\right]roman_E [ | italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = italic_K ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT roman_E [ roman_exp ( 2 italic_λ ⟨ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT , italic_x ⟩ ) + roman_exp ( 2 italic_λ ⟨ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT , italic_x ⟩ ) - 2 roman_exp ( italic_λ ⟨ italic_x , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT + italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ⟩ ) ]

First, for the case k=b𝑘𝑏k=bitalic_k = italic_b, we have

E[exp(2λxi+,x)+exp(2λxj+,x)2exp(λx,xi++xj+)]Edelimited-[]2𝜆superscriptsubscript𝑥𝑖𝑥2𝜆superscriptsubscript𝑥𝑗𝑥2𝜆𝑥superscriptsubscript𝑥𝑖superscriptsubscript𝑥𝑗\displaystyle\mathrm{E}\left[\exp\left(2\lambda\langle x_{i}^{+},x\rangle% \right)+\exp\left(2\lambda\langle x_{j}^{+},x\rangle\right)-2\exp\left(\lambda% \langle x,x_{i}^{+}+x_{j}^{+}\rangle\right)\right]roman_E [ roman_exp ( 2 italic_λ ⟨ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_x ⟩ ) + roman_exp ( 2 italic_λ ⟨ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_x ⟩ ) - 2 roman_exp ( italic_λ ⟨ italic_x , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT + italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ⟩ ) ]
=\displaystyle== E[exp(2λs+2λμb,z+zi+2λz,zi)+exp(2μb,z+2λμc,zi+2λz,zi)]Edelimited-[]2𝜆𝑠2𝜆subscript𝜇𝑏𝑧superscriptsubscript𝑧𝑖2𝜆𝑧superscriptsubscript𝑧𝑖2subscript𝜇𝑏𝑧2𝜆subscript𝜇𝑐superscriptsubscript𝑧𝑖2𝜆𝑧superscriptsubscript𝑧𝑖\displaystyle\mathrm{E}\left[\exp\left(2\lambda s+2\lambda\langle\mu_{b},z+z_{% i}^{-}\rangle+2\lambda\langle z,z_{i}^{-}\rangle\right)+\exp\left(2\langle\mu_% {b},z\rangle+2\lambda\langle\mu_{c},z_{i}^{-}\rangle+2\lambda\langle z,z_{i}^{% -}\rangle\right)\right]roman_E [ roman_exp ( 2 italic_λ italic_s + 2 italic_λ ⟨ italic_μ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_z + italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ⟩ + 2 italic_λ ⟨ italic_z , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ⟩ ) + roman_exp ( 2 ⟨ italic_μ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_z ⟩ + 2 italic_λ ⟨ italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ⟩ + 2 italic_λ ⟨ italic_z , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ⟩ ) ]
2E[exp(λs+λμb,zi+zj+λμb+μc,zj+λz,zi+zj)]2Edelimited-[]𝜆𝑠𝜆subscript𝜇𝑏superscriptsubscript𝑧𝑖superscriptsubscript𝑧𝑗𝜆subscript𝜇𝑏subscript𝜇𝑐subscript𝑧𝑗𝜆𝑧superscriptsubscript𝑧𝑖superscriptsubscript𝑧𝑗\displaystyle-2\mathrm{E}\left[\exp\left(\lambda s+\lambda\langle\mu_{b},z_{i}% ^{-}+z_{j}^{-}\rangle+\lambda\langle\mu_{b}+\mu_{c},z_{j}\rangle+\lambda% \langle z,z_{i}^{-}+z_{j}^{-}\rangle\right)\right]- 2 roman_E [ roman_exp ( italic_λ italic_s + italic_λ ⟨ italic_μ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT + italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ⟩ + italic_λ ⟨ italic_μ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT + italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ + italic_λ ⟨ italic_z , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT + italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ⟩ ) ]
=\displaystyle== E[exp(2λs+2λ2|z+μb|2+2λμb,z)]+E[exp(2λμb,z+2λ2|μc+z|2)]Edelimited-[]2𝜆𝑠2superscript𝜆2superscript𝑧subscript𝜇𝑏22𝜆subscript𝜇𝑏𝑧Edelimited-[]2𝜆subscript𝜇𝑏𝑧2superscript𝜆2superscriptsubscript𝜇𝑐𝑧2\displaystyle\mathrm{E}\left[\exp\left(2\lambda s+2\lambda^{2}|z+\mu_{b}|^{2}+% 2\lambda\langle\mu_{b},z\rangle\right)\right]+\mathrm{E}\left[\exp\left(2% \lambda\langle\mu_{b},z\rangle+2\lambda^{2}|\mu_{c}+z|^{2}\right)\right]roman_E [ roman_exp ( 2 italic_λ italic_s + 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_z + italic_μ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_λ ⟨ italic_μ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_z ⟩ ) ] + roman_E [ roman_exp ( 2 italic_λ ⟨ italic_μ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_z ⟩ + 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_z | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ]
2E[exp(λs+λμb+μc,zj+λ2|zi+zj|22)]2Edelimited-[]𝜆𝑠𝜆subscript𝜇𝑏subscript𝜇𝑐superscriptsubscript𝑧𝑗superscript𝜆2superscriptsuperscriptsubscript𝑧𝑖superscriptsubscript𝑧𝑗22\displaystyle-2\mathrm{E}\left[\exp\left(\lambda s+\lambda\langle\mu_{b}+\mu_{% c},z_{j}^{-}\rangle+\frac{\lambda^{2}|z_{i}^{-}+z_{j}^{-}|^{2}}{2}\right)\right]- 2 roman_E [ roman_exp ( italic_λ italic_s + italic_λ ⟨ italic_μ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT + italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ⟩ + divide start_ARG italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT + italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) ]
=\displaystyle== e2λsC1+E[exp(2λ2s+2λμb+2λμc,z+2λ2|z|2)]2E[exp(λs+λμb+μc,zj+λ2|zi|2)]superscript𝑒2𝜆𝑠subscript𝐶1Edelimited-[]2superscript𝜆2𝑠2𝜆subscript𝜇𝑏2𝜆subscript𝜇𝑐𝑧2superscript𝜆2superscript𝑧22Edelimited-[]𝜆𝑠𝜆subscript𝜇𝑏subscript𝜇𝑐superscriptsubscript𝑧𝑗superscript𝜆2superscriptsuperscriptsubscript𝑧𝑖2\displaystyle e^{2\lambda s}C_{1}+\mathrm{E}\left[\exp\left(2\lambda^{2}s+2% \lambda\langle\mu_{b}+2\lambda\mu_{c},z\rangle+2\lambda^{2}|z|^{2}\right)% \right]-2\mathrm{E}\left[\exp\left(\lambda s+\lambda\langle\mu_{b}+\mu_{c},z_{% j}^{-}\rangle+\lambda^{2}|z_{i}^{-}|^{2}\right)\right]italic_e start_POSTSUPERSCRIPT 2 italic_λ italic_s end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + roman_E [ roman_exp ( 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s + 2 italic_λ ⟨ italic_μ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT + 2 italic_λ italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_z ⟩ + 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_z | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ] - 2 roman_E [ roman_exp ( italic_λ italic_s + italic_λ ⟨ italic_μ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT + italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ⟩ + italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ]
=\displaystyle== e2λsC1+e2λ2s(14λ2)d/2exp(2λ2(1+4λ2)s14λ2)2eλs+λ2s(12λ2)d/2superscript𝑒2𝜆𝑠subscript𝐶1superscript𝑒2superscript𝜆2𝑠superscript14superscript𝜆2𝑑22superscript𝜆214superscript𝜆2𝑠14superscript𝜆22superscript𝑒𝜆𝑠superscript𝜆2𝑠superscript12superscript𝜆2𝑑2\displaystyle e^{2\lambda s}C_{1}+\frac{e^{2\lambda^{2}s}}{(1-4\lambda^{2})^{d% /2}}\exp\left(\frac{2\lambda^{2}(1+4\lambda^{2})s}{1-4\lambda^{2}}\right)-% \frac{2e^{\lambda s+\lambda^{2}s}}{(1-2\lambda^{2})^{d/2}}italic_e start_POSTSUPERSCRIPT 2 italic_λ italic_s end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG italic_e start_POSTSUPERSCRIPT 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - 4 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT end_ARG roman_exp ( divide start_ARG 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + 4 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_s end_ARG start_ARG 1 - 4 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) - divide start_ARG 2 italic_e start_POSTSUPERSCRIPT italic_λ italic_s + italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT end_ARG
\displaystyle\geq (e2λs+e8λ3)C12exp(λ2s2)C2superscript𝑒2𝜆𝑠superscript𝑒8superscript𝜆3subscript𝐶12superscript𝜆2𝑠2subscript𝐶2\displaystyle\left(e^{2\lambda s}+e^{-8\lambda^{3}}\right)C_{1}-2\exp\left(-% \frac{\lambda^{2}s}{2}\right)C_{2}( italic_e start_POSTSUPERSCRIPT 2 italic_λ italic_s end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT - 8 italic_λ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 2 roman_exp ( - divide start_ARG italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s end_ARG start_ARG 2 end_ARG ) italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

Second, for kb𝑘𝑏k\neq bitalic_k ≠ italic_b and kc𝑘𝑐k\neq citalic_k ≠ italic_c, we have

E[exp(2λxi+,x)+exp(2λxj+,x)2exp(λx,xi++xj+)]Edelimited-[]2𝜆superscriptsubscript𝑥𝑖𝑥2𝜆superscriptsubscript𝑥𝑗𝑥2𝜆𝑥superscriptsubscript𝑥𝑖superscriptsubscript𝑥𝑗\displaystyle\mathrm{E}\left[\exp\left(2\lambda\langle x_{i}^{+},x\rangle% \right)+\exp\left(2\lambda\langle x_{j}^{+},x\rangle\right)-2\exp\left(\lambda% \langle x,x_{i}^{+}+x_{j}^{+}\rangle\right)\right]roman_E [ roman_exp ( 2 italic_λ ⟨ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_x ⟩ ) + roman_exp ( 2 italic_λ ⟨ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_x ⟩ ) - 2 roman_exp ( italic_λ ⟨ italic_x , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT + italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ⟩ ) ]
=\displaystyle== 2E[exp(2μb,z+2λμc,zi+2λz,zi)]2Edelimited-[]2subscript𝜇𝑏𝑧2𝜆subscript𝜇𝑐superscriptsubscript𝑧𝑖2𝜆𝑧superscriptsubscript𝑧𝑖\displaystyle 2\mathrm{E}\left[\exp\left(2\langle\mu_{b},z\rangle+2\lambda% \langle\mu_{c},z_{i}^{-}\rangle+2\lambda\langle z,z_{i}^{-}\rangle\right)\right]2 roman_E [ roman_exp ( 2 ⟨ italic_μ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_z ⟩ + 2 italic_λ ⟨ italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ⟩ + 2 italic_λ ⟨ italic_z , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ⟩ ) ]
2E[exp(λμk,zi+zj+λμb+μc,zj+λz,zi+zj)]2Edelimited-[]𝜆subscript𝜇𝑘superscriptsubscript𝑧𝑖superscriptsubscript𝑧𝑗𝜆subscript𝜇𝑏subscript𝜇𝑐subscript𝑧𝑗𝜆𝑧superscriptsubscript𝑧𝑖superscriptsubscript𝑧𝑗\displaystyle-2\mathrm{E}\left[\exp\left(\lambda\langle\mu_{k},z_{i}^{-}+z_{j}% ^{-}\rangle+\lambda\langle\mu_{b}+\mu_{c},z_{j}\rangle+\lambda\langle z,z_{i}^% {-}+z_{j}^{-}\rangle\right)\right]- 2 roman_E [ roman_exp ( italic_λ ⟨ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT + italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ⟩ + italic_λ ⟨ italic_μ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT + italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ + italic_λ ⟨ italic_z , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT + italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ⟩ ) ]
=\displaystyle== 2E[exp(2λμb,z+2λ2|μc+z|2)]2E[exp(2λμk+z,zi+λμb+μc,zj)]2Edelimited-[]2𝜆subscript𝜇𝑏𝑧2superscript𝜆2superscriptsubscript𝜇𝑐𝑧22Edelimited-[]2𝜆subscript𝜇𝑘𝑧superscriptsubscript𝑧𝑖𝜆subscript𝜇𝑏subscript𝜇𝑐superscriptsubscript𝑧𝑗\displaystyle 2\mathrm{E}\left[\exp\left(2\lambda\langle\mu_{b},z\rangle+2% \lambda^{2}|\mu_{c}+z|^{2}\right)\right]-2\mathrm{E}\left[\exp\left(\sqrt{2}% \lambda\langle\mu_{k}+z,z_{i}^{-}\rangle+\lambda\langle\mu_{b}+\mu_{c},z_{j}^{% -}\rangle\right)\right]2 roman_E [ roman_exp ( 2 italic_λ ⟨ italic_μ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_z ⟩ + 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_z | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ] - 2 roman_E [ roman_exp ( square-root start_ARG 2 end_ARG italic_λ ⟨ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_z , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ⟩ + italic_λ ⟨ italic_μ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT + italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ⟩ ) ]
=\displaystyle== 2E[exp(2λ2s+2λμb+2λμc,z+2λ2|z|2)]2E[exp(λ2s+λ2|μk+z|2)]2Edelimited-[]2superscript𝜆2𝑠2𝜆subscript𝜇𝑏2𝜆subscript𝜇𝑐𝑧2superscript𝜆2superscript𝑧22Edelimited-[]superscript𝜆2𝑠superscript𝜆2superscriptsubscript𝜇𝑘𝑧2\displaystyle 2\mathrm{E}\left[\exp\left(2\lambda^{2}s+2\lambda\langle\mu_{b}+% 2\lambda\mu_{c},z\rangle+2\lambda^{2}|z|^{2}\right)\right]-2\mathrm{E}\left[% \exp\left(\lambda^{2}s+\lambda^{2}|\mu_{k}+z|^{2}\right)\right]2 roman_E [ roman_exp ( 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s + 2 italic_λ ⟨ italic_μ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT + 2 italic_λ italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_z ⟩ + 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_z | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ] - 2 roman_E [ roman_exp ( italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s + italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_z | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ]
=\displaystyle== 2e2λ2s(14λ2)d/2exp(2λ2(1+4λ2)14λ2)2eλ2s(12λ2)d/22superscript𝑒2superscript𝜆2𝑠superscript14superscript𝜆2𝑑22superscript𝜆214superscript𝜆214superscript𝜆22superscript𝑒superscript𝜆2𝑠superscript12superscript𝜆2𝑑2\displaystyle\frac{2e^{2\lambda^{2}s}}{(1-4\lambda^{2})^{d/2}}\exp\left(\frac{% 2\lambda^{2}(1+4\lambda^{2})}{1-4\lambda^{2}}\right)-\frac{2e^{\lambda^{2}s}}{% (1-2\lambda^{2})^{d/2}}divide start_ARG 2 italic_e start_POSTSUPERSCRIPT 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - 4 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT end_ARG roman_exp ( divide start_ARG 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + 4 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG 1 - 4 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) - divide start_ARG 2 italic_e start_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT end_ARG
\displaystyle\geq 2e8λ3C12exp(λ2s2)C22superscript𝑒8superscript𝜆3subscript𝐶12superscript𝜆2𝑠2subscript𝐶2\displaystyle 2e^{-8\lambda^{3}}C_{1}-2\exp\left(-\frac{\lambda^{2}s}{2}\right% )C_{2}2 italic_e start_POSTSUPERSCRIPT - 8 italic_λ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 2 roman_exp ( - divide start_ARG italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s end_ARG start_ARG 2 end_ARG ) italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

Thus, we have

E[|f(xi)f(xj)|2]K(2(e2λs+me8λ3)C1mexp(λ2s2)C2)2Ke2λsC1Edelimited-[]superscript𝑓superscriptsubscript𝑥𝑖𝑓superscriptsubscript𝑥𝑗2𝐾2superscript𝑒2𝜆𝑠𝑚superscript𝑒8superscript𝜆3subscript𝐶1𝑚superscript𝜆2𝑠2subscript𝐶22𝐾superscript𝑒2𝜆𝑠subscript𝐶1\mathrm{E}\left[|f(x_{i}^{-})-f(x_{j}^{-})|^{2}\right]\geq K\left(2\left(e^{2% \lambda s}+me^{-8\lambda^{3}}\right)C_{1}-m\exp\left(-\frac{\lambda^{2}s}{2}% \right)C_{2}\right)\geq 2Ke^{2\lambda s}C_{1}roman_E [ | italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≥ italic_K ( 2 ( italic_e start_POSTSUPERSCRIPT 2 italic_λ italic_s end_POSTSUPERSCRIPT + italic_m italic_e start_POSTSUPERSCRIPT - 8 italic_λ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_m roman_exp ( - divide start_ARG italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s end_ARG start_ARG 2 end_ARG ) italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≥ 2 italic_K italic_e start_POSTSUPERSCRIPT 2 italic_λ italic_s end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

and hence

E[|f(xi)f(xi)|2]E[|f(xi+)f(xi+)|2]E[|f(xi+)f(xi+)|2]C1/(C1C2)(1+(m1)e2λs)1+(m1)e2λsEdelimited-[]superscript𝑓superscriptsubscript𝑥𝑖𝑓superscriptsubscript𝑥𝑖2Edelimited-[]superscript𝑓superscriptsubscript𝑥𝑖𝑓superscriptsubscript𝑥𝑖2Edelimited-[]superscript𝑓superscriptsubscript𝑥𝑖𝑓superscriptsubscript𝑥𝑖2subscript𝐶1subscript𝐶1subscript𝐶21𝑚1superscript𝑒2𝜆𝑠1𝑚1superscript𝑒2𝜆𝑠\frac{\mathrm{E}\left[|f(x_{i}^{-})-f(x_{i}^{-})|^{2}\right]-\mathrm{E}\left[|% f(x_{i}^{+})-f(x_{i}^{+})|^{2}\right]}{\mathrm{E}\left[|f(x_{i}^{+})-f(x_{i}^{% +})|^{2}\right]}\geq\frac{C_{1}/(C_{1}-C_{2})-\left(1+(m-1)e^{-2\lambda s}% \right)}{1+(m-1)e^{-2\lambda s}}divide start_ARG roman_E [ | italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - roman_E [ | italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_ARG roman_E [ | italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG ≥ divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / ( italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - ( 1 + ( italic_m - 1 ) italic_e start_POSTSUPERSCRIPT - 2 italic_λ italic_s end_POSTSUPERSCRIPT ) end_ARG start_ARG 1 + ( italic_m - 1 ) italic_e start_POSTSUPERSCRIPT - 2 italic_λ italic_s end_POSTSUPERSCRIPT end_ARG

Since

C1C1C2=11eλ2s(14λ212λ2)d/2exp(λ2s[1+2λ2(12λ)2(1+2λ)12λ]):=Γsubscript𝐶1subscript𝐶1subscript𝐶211subscriptsuperscript𝑒superscript𝜆2𝑠superscript14superscript𝜆212superscript𝜆2𝑑2superscript𝜆2𝑠delimited-[]12𝜆212𝜆212𝜆12𝜆assignabsentΓ\frac{C_{1}}{C_{1}-C_{2}}=\frac{1}{1-\underbrace{e^{-\lambda^{2}s}\left(\frac{% 1-4\lambda^{2}}{1-2\lambda^{2}}\right)^{d/2}\exp\left(\lambda^{2}s\left[\frac{% 1+\sqrt{2}\lambda}{2(1-\sqrt{2}\lambda)}-\frac{2(1+2\lambda)}{1-2\lambda}% \right]\right)}_{:=\Gamma}}divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG = divide start_ARG 1 end_ARG start_ARG 1 - under⏟ start_ARG italic_e start_POSTSUPERSCRIPT - italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( divide start_ARG 1 - 4 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT roman_exp ( italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s [ divide start_ARG 1 + square-root start_ARG 2 end_ARG italic_λ end_ARG start_ARG 2 ( 1 - square-root start_ARG 2 end_ARG italic_λ ) end_ARG - divide start_ARG 2 ( 1 + 2 italic_λ ) end_ARG start_ARG 1 - 2 italic_λ end_ARG ] ) end_ARG start_POSTSUBSCRIPT := roman_Γ end_POSTSUBSCRIPT end_ARG

It is easy to verify that when λ2=1/dsuperscript𝜆21𝑑\lambda^{2}=1/ditalic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1 / italic_d, Γ=Ω(1)ΓΩ1\Gamma=\Omega(1)roman_Γ = roman_Ω ( 1 ), and by further assuming that s𝑠sitalic_s is sufficiently large that e2λsγ/(2(m1))superscript𝑒2𝜆𝑠𝛾2𝑚1e^{-2\lambda s}\leq\gamma/(2(m-1))italic_e start_POSTSUPERSCRIPT - 2 italic_λ italic_s end_POSTSUPERSCRIPT ≤ italic_γ / ( 2 ( italic_m - 1 ) ), we have

E[|f(xi)f(xi)|2]E[|f(xi+)f(xi+)|2]E[|f(xi+)f(xi+)|2]Γ2+ΓΓ3Edelimited-[]superscript𝑓superscriptsubscript𝑥𝑖𝑓superscriptsubscript𝑥𝑖2Edelimited-[]superscript𝑓superscriptsubscript𝑥𝑖𝑓superscriptsubscript𝑥𝑖2Edelimited-[]superscript𝑓superscriptsubscript𝑥𝑖𝑓superscriptsubscript𝑥𝑖2Γ2ΓΓ3\frac{\mathrm{E}\left[|f(x_{i}^{-})-f(x_{i}^{-})|^{2}\right]-\mathrm{E}\left[|% f(x_{i}^{+})-f(x_{i}^{+})|^{2}\right]}{\mathrm{E}\left[|f(x_{i}^{+})-f(x_{i}^{% +})|^{2}\right]}\geq\frac{\Gamma}{2+\Gamma}\geq\frac{\Gamma}{3}divide start_ARG roman_E [ | italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - roman_E [ | italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_ARG roman_E [ | italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG ≥ divide start_ARG roman_Γ end_ARG start_ARG 2 + roman_Γ end_ARG ≥ divide start_ARG roman_Γ end_ARG start_ARG 3 end_ARG

This analysis shows that by using attention as a representation, we are able to easily distinguish if data points come from the same distribution, even with a very large noise, which become difficult if we use the original inputs.

Appendix B Detailed Results

B.1 Detailed Results of Multivariate Forecasting

Table 5: Multivariate forecasting results. The lookback length is set as 96. All models are evaluated on 4 different prediction horizons {96, 192, 336, 720}. A lower MSE indicates better performance. The best ones are in Bold, and the second ones are underlined.
Methods AttnEmbed PatchTST TimesNet DLinear FiLM FEDformer Autoformer Informer
Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
Weather𝑊𝑒𝑎𝑡𝑒𝑟Weatheritalic_W italic_e italic_a italic_t italic_h italic_e italic_r 96 0.171 0.215 0.178 0.219 0.172 0.220 0.196 0.255 0.193 0.234 0.217 0.296 0.266 0.336 0.300 0.384
192 0.218 0.257 0.224 0.259 0.219 0.261 0.237 0.296 0.236 0.269 0.276 0.336 0.307 0.367 0.598 0.544
336 0.274 0.297 0.278 0.298 0.280 0.306 0.283 0.335 0.288 0.304 0.339 0.380 0.359 0.395 0.578 0.523
720 0.348 0.346 0.350 0.346 0.365 0.359 0.345 0.381 0.358 0.350 0.403 0.428 0.419 0.428 1.059 0.741
Avg 0.252 0.278 0.257 0.280 0.259 0.287 0.265 0.317 0.269 0.339 0.309 0.360 0.338 0.382 0.634 0.548
ETTh1𝐸𝑇𝑇1ETTh1italic_E italic_T italic_T italic_h 1 96 0.367 0.398 0.393 0.408 0.384 0.402 0.386 0.400 0.388 0.401 0.376 0.419 0.449 0.459 0.865 0.713
192 0.420 0.428 0.445 0.434 0.436 0.429 0.437 0.432 0.443 0.439 0.420 0.448 0.500 0.482 1.008 0.792
336 0.448 0.438 0.484 0.451 0.491 0.469 0.481 0.459 0.488 0.466 0.459 0.465 0.521 0.496 1.107 0.809
720 0.454 0.459 0.480 0.471 0.521 0.500 0.519 0.516 0.525 0.519 0.506 0.507 0.514 0.512 1.181 0.865
Avg 0.422 0.430 0.450 0.440 0.458 0.450 0.456 0.452 0.461 0.456 0.440 0.460 0.496 0.487 1.040 0.795
ETTh2𝐸𝑇𝑇2ETTh2italic_E italic_T italic_T italic_h 2 96 0.296 0.346 0.294 0.343 0.340 0.374 0.333 0.387 0.296 0.344 0.358 0.397 0.346 0.388 3.755 1.525
192 0.369 0.392 0.377 0.393 0.402 0.414 0.477 0.476 0.389 0.402 0.429 0.439 0.456 0.452 5.602 1.931
336 0.376 0.405 0.381 0.409 0.452 0.452 0.594 0.541 0.418 0.430 0.496 0.487 0.482 0.486 4.721 1.835
720 0.405 0.432 0.412 0.471 0.462 0.468 0.831 0.657 0.433 0.448 0.463 0.474 0.515 0.511 3.647 1.625
Avg 0.361 0.393 0.366 0.404 0.414 0.427 0.559 0.515 0.384 0.406 0.437 0.449 0.450 0.459 4.431 1.729
ETTm1𝐸𝑇𝑇𝑚1ETTm1italic_E italic_T italic_T italic_m 1 96 0.317 0.356 0.321 0.360 0.338 0.375 0.345 0.372 0.348 0.367 0.379 0.416 0.505 0.475 0.672 0.571
192 0.357 0.381 0.362 0.384 0.371 0.387 0.380 0.389 0.387 0.385 0.426 0.441 0.553 0.496 0.795 0.669
336 0.387 0.404 0.392 0.402 0.410 0.411 0.413 0.413 0.418 0.405 0.445 0.459 0.621 0.537 1.212 0.871
720 0.448 0.439 0.450 0.435 0.478 0.450 0.474 0.453 0.479 0.440 0.543 0.490 0.671 0.561 1.166 0.823
Avg 0.377 0.395 0.381 0.395 0.400 0.406 0.403 0.407 0.408 0.399 0.448 0.452 0.588 0.517 0.961 0.734
ETTm2𝐸𝑇𝑇𝑚2ETTm2italic_E italic_T italic_T italic_m 2 96 0.181 0.265 0.178 0.260 0.187 0.267 0.193 0.292 0.183 0.266 0.203 0.287 0.255 0.339 0.365 0.453
192 0.245 0.304 0.249 0.307 0.249 0.309 0.284 0.362 0.247 0.305 0.269 0.328 0.281 0.340 0.533 0.563
336 0.309 0.349 0.313 0.346 0.321 0.351 0.369 0.427 0.309 0.343 0.325 0.366 0.339 0.372 1.363 0.887
720 0.409 0.407 0.400 0.398 0.408 0.403 0.554 0.522 0.407 0.398 0.421 0.415 0.433 0.432 3.379 1.338
Avg 0.286 0.331 0.285 0.327 0.291 0.333 0.350 0.401 0.287 0.328 0.305 0.349 0.327 0.371 1.410 0.810
ECL𝐸𝐶𝐿ECLitalic_E italic_C italic_L 96 0.166 0.252 0.174 0.259 0.168 0.272 0.197 0.282 0.198 0.276 0.193 0.308 0.201 0.317 0.274 0.368
192 0.172 0.259 0.178 0.265 0.184 0.289 0.196 0.285 0.198 0.279 0.201 0.315 0.222 0.334 0.296 0.386
336 0.191 0.277 0.196 0.282 0.198 0.300 0.209 0.301 0.217 0.301 0.214 0.329 0.254 0.361 0.300 0.394
720 0.231 0.309 0.237 0.316 0.220 0.320 0.245 0.333 0.279 0.357 0.246 0.355 0.254 0.361 0.373 0.439
Avg 0.189 0.274 0.196 0.280 0.192 0.295 0.212 0.300 0.223 0.303 0.214 0.327 0.227 0.338 0.311 0.397
Traffic𝑇𝑟𝑎𝑓𝑓𝑖𝑐Trafficitalic_T italic_r italic_a italic_f italic_f italic_i italic_c 96 0.428 0.276 0.477 0.305 0.593 0.321 0.650 0.396 0.649 0.391 0.587 0.366 0.613 0.388 0.274 0.368
192 0.434 0.274 0.471 0.299 0.617 0.336 0.598 0.370 0.603 0.366 0.604 0.373 0.616 0.382 0.296 0.386
336 0.448 0.282 0.485 0.305 0.629 0.336 0.605 0.373 0.613 0.371 0.621 0.383 0.622 0.337 0.300 0.394
720 0.478 0.299 0.518 0.325 0.640 0.350 0.645 0.394 0.692 0.427 0.626 0.382 0.660 0.408 0.373 0.439
Avg 0.447 0.282 0.487 0.308 0.620 0.336 0.625 0.383 0.639 0.389 0.610 0.376 0.628 0.379 0.311 0.397

B.2 Detailed Results of Multivariate Forecasting with RBF Kernel and Polynomial Kernel

Table 6: Multivairate forecasting results with RBF kernel and polynomial kernel. The lookback length is set as 96. All models are evaluated on 4 different prediction horizons {96, 192, 336, 720}. A lower MSE indicates better performance. The best ones are in Bold, and the second ones are underlined.
Methods AttnEmbed RBF Kernel Polynomial Kernel PatchTST TimesNet
Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
Weather𝑊𝑒𝑎𝑡𝑒𝑟Weatheritalic_W italic_e italic_a italic_t italic_h italic_e italic_r 96 0.171 0.215 0.175 0.220 0.174 0.216 0.178 0.219 0.172 0.220
192 0.218 0.257 0.222 0.258 0.221 0.257 0.224 0.259 0.219 0.261
336 0.274 0.297 0.276 0.297 0.272 0.296 0.278 0.298 0.280 0.306
720 0.348 0.346 0.347 0.346 0.350 0.346 0.350 0.346 0.365 0.359
Avg 0.252 0.278 0.255 0.280 0.254 0.279 0.257 0.280 0.259 0.287
ETTh1𝐸𝑇𝑇1ETTh1italic_E italic_T italic_T italic_h 1 96 0.367 0.398 0.374 0.398 0.380 0.400 0.393 0.408 0.384 0.402
192 0.420 0.428 0.441 0.436 0.437 0.431 0.445 0.434 0.436 0.429
336 0.448 0.438 0.475 0.452 0.457 0.442 0.484 0.451 0.491 0.469
720 0.454 0.459 0.491 0.472 0.450 0.453 0.480 0.471 0.521 0.500
Avg 0.422 0.430 0.445 0.439 0.431 0.431 0.450 0.440 0.458 0.450
ETTm1𝐸𝑇𝑇𝑚1ETTm1italic_E italic_T italic_T italic_m 1 96 0.317 0.356 0.316 0.356 0.318 0.355 0.321 0.360 0.338 0.375
192 0.357 0.381 0.358 0.380 0.354 0.377 0.362 0.384 0.371 0.387
336 0.387 0.404 0.389 0.403 0.391 0.403 0.392 0.402 0.410 0.411
720 0.448 0.439 0.450 0.435 0.453 0.437 0.450 0.435 0.478 0.450
Avg 0.377 0.395 0.378 0.393 0.379 0.393 0.381 0.395 0.400 0.406

B.3 Detailed Results of Utilize AttnEmbed as A Plug-in

Table 7: Utilize AttnEmbed as a plug-in. The lookback length is 336 for PatchTST and 96 for CARD. All models are evaluated on 4 different prediction horizons {96, 192, 336, 720}. A lower MSE indicates better performance.
Methods PatchTST(42) PatchTST(42) CARD CARD
+AttnEmbed +AttnEmbed
Metric MSE MAE MSE MAE MSE MAE MSE MAE
Weather𝑊𝑒𝑎𝑡𝑒𝑟Weatheritalic_W italic_e italic_a italic_t italic_h italic_e italic_r 96 0.152 0.199 0.151 0.199 0.150 0.188 0.152 0.190
192 0.197 0.243 0.195 0.240 0.202 0.238 0.200 0.237
336 0.249 0.283 0.246 0.282 0.260 0.282 0.259 0.282
720 0.320 0.335 0.320 0.331 0.343 0.353 0.341 0.334
Avg 0.229 0.265 0.227 0.263 0.239 0.265 0.238 0.260
ETTh1𝐸𝑇𝑇1ETTh1italic_E italic_T italic_T italic_h 1 96 0.375 0.399 0.374 0.397 0.383 0.391 0.379 0.390
192 0.414 0.421 0.412 0.423 0.435 0.420 0.428 0.421
336 0.431 0.436 0.420 0.432 0.479 0.442 0.472 0.440
720 0.449 0.466 0.430 0.455 0.471 0.461 0.469 0.459
Avg 0.417 0.430 0.409 0.426 0.442 0.428 0.436 0.427
ETTm1𝐸𝑇𝑇𝑚1ETTm1italic_E italic_T italic_T italic_m 1 96 0.290 0.342 0.286 0.341 0.319 0.347 0.316 0.344
192 0.332 0.369 0.331 0.369 0.363 0.370 0.356 0.365
336 0.366 0.392 0.363 0.390 0.392 0.390 0.386 0.386
720 0.420 0.424 0.410 0.416 0.458 0.425 0.450 0.422
Avg 0.352 0.381 0.347 0.379 0.383 0.384 0.377 0.379

B.4 Detailed Results of Utilize AttnEmbed as A Plug-in

Table 8: Ablation on EMA and landmark. The lookback length is 96. All models are evaluated on 4 different prediction horizons {96, 192, 336, 720}. A lower MSE indicates better performance.

Methods AttnEmbed AttnEmbed AttnEmbed
w/o EMA w/o Landmark
Metric MSE MAE MSE MAE MSE MAE
ETTh1𝐸𝑇𝑇1ETTh1italic_E italic_T italic_T italic_h 1 96 0.367 0.398 0.378 0.398 0.370 0.400
192 0.420 0.428 0.423 0.425 0.421 0.427
336 0.448 0.438 0.456 0.430 0.440 0.436
720 0.454 0.459 0.472 0.455 0.461 0.454
Avg 0.422 0.430 0.432 0.427 0.423 0.429
ETTm1𝐸𝑇𝑇𝑚1ETTm1italic_E italic_T italic_T italic_m 1 96 0.317 0.356 0.319 0.360 0.326 0.365
192 0.357 0.381 0.374 0.391 0.370 0.384
336 0.387 0.404 0.396 0.402 0.396 0.403
720 0.448 0.439 0.449 0.437 0.455 0.436
Avg 0.377 0.395 0.384 0.397 0.386 0.397