AreTransformersEffective forTimeSeriesForecasting

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)
Are Transformers Effective for Time Series Forecasting?
Ailing Zeng1,2 * , Muxi Chen1 * , Lei Zhang2 , Qiang Xu1

1
The Chinese University of Hong Kong
2
International Digital Economy Academy
{zengailing, leizhang}@idea.edu.cn,{mxchen21, qxu}@cse.cuhk.edu.hk
Abstract speech recognition (Dong, Xu, and Xu 2018), and computer

vision (Liu et al. 2021b). Recently, there has also been a
Recently, there has been a surge of Transformer-based solu-
tions for the long-term time series forecasting (LTSF) task. surge of Transformer-based solutions for time series analy-
Despite the growing performance over the past few years, sis, as surveyed in (Wen et al. 2022). Most notable mod-
we question the validity of this line of research in this work. els, which focus on the less explored and challenging long-
Specifically, Transformers is arguably the most successful so- term time series forecasting (LTSF) problem, include Log-
lution to extract the semantic correlations among the elements Trans (Li et al. 2019) (NeurIPS 2019), Informer (Zhou et al.
in a long sequence. However, in time series modeling, we are 2021) (AAAI 2021 Best paper), Autoformer (Xu et al. 2021)
to extract the temporal relations in an ordered set of contin- (NeurIPS 2021), Pyraformer (Liu et al. 2021a) (ICLR 2022
uous points. While employing positional encoding and using Oral), Triformer (Cirstea et al. 2022) (IJCAI 2022) and the
tokens to embed sub-series in Transformers facilitate preserv- recent FEDformer (Zhou et al. 2022) (ICML 2022).
ing some ordering information, the nature of the permutation-
invariant self-attention mechanism inevitably results in tem- The main working power of Transformers is from its
poral information loss. multi-head self-attention mechanism, which has a remark-
To validate our claim, we introduce a set of embarrassingly able capability of extracting semantic correlations among el-
simple one-layer linear models named LTSF-Linear for com- ements in a long sequence (e.g., words in texts or 2D patches
parison. Experimental results on nine real-life datasets show in images). However, self-attention is permutation-invariant
that LTSF-Linear surprisingly outperforms existing sophisti- and “anti-order” to some extent. While using various types
cated Transformer-based LTSF models in all cases, and often of positional encoding techniques can preserve some order-
by a large margin. Moreover, we conduct comprehensive em- ing information, it is still inevitable to have temporal infor-
pirical studies to explore the impacts of various design ele- mation loss after applying self-attention on top of them. This
ments of LTSF models on their temporal relation extraction is usually not a serious concern for semantic-rich applica-
capability. We hope this surprising finding opens up new re- tions such as NLP, e.g., the semantic meaning of a sentence
search directions for the LTSF task. We also advocate revisit-
ing the validity of Transformer-based solutions for other time
is largely preserved even if we reorder some words in it.
series analysis tasks (e.g., anomaly detection) in the future. However, when analyzing time series data, there is usually
a lack of semantics in the numerical data itself, and we are
mainly interested in modeling the temporal changes among
Introduction a continuous set of points. That is, the order itself plays
Time series are ubiquitous in today’s data-driven world. the most crucial role. Consequently, we pose the following
Given historical data, time series forecasting (TSF) is a intriguing question: Are Transformers really effective for
long-standing task that has a wide range of applications, long-term time series forecasting?
including but not limited to traffic flow estimation, energy Moreover, while existing Transformer-based LTSF so-
management, and financial investment. Over the past sev- lutions have demonstrated considerable prediction accu-
eral decades, TSF solutions have undergone a progression racy improvements over traditional methods, in their exper-
from traditional statistical methods (e.g., ARIMA (Ariyo, iments, all the compared (non-Transformer) baselines per-
Adewumi, and Ayo 2014)) and machine learning techniques form autoregressive or iterated multi-step (IMS) forecast-
(e.g., GBRT (Friedman 2001)) to deep learning-based solu- ing (Ariyo, Adewumi, and Ayo 2014; Salinas, Flunkert, and
tions, e.g., (Bai, Kolter, and Koltun 2018; Liu et al. 2022). Gasthaus 2017; Bahdanau, Cho, and Bengio 2014; Taylor
Transformer (Vaswani et al. 2017) is arguably the most and Letham 2017), which are known to suffer from sig-
successful sequence modeling architecture, demonstrating nificant error accumulation effects for the LTSF problem.
unparalleled performances in various applications, such as Therefore, in this work, we challenge Transformer-based
natural language processing (NLP) (Devlin et al. 2018), LTSF solutions with direct multi-step (DMS) forecasting
* These authors contributed equally. strategies to validate their real performance.
Copyright © 2023, Association for the Advancement of Artificial Not all time series are predictable, let alone long-term
Intelligence (www.aaai.org). All rights reserved. forecasting (e.g., for chaotic systems). We hypothesize that
11121
long-term forecasting is only feasible for those time series tion procedure, but they inevitably suffer from error accu-
with a relatively clear trend and periodicity. As linear mod- mulation effects. Consequently, IMS forecasting is prefer-
els can already extract such information, we introduce a set able when there is a highly-accurate single-step forecaster,
of embarrassingly simple models named LTSF-Linear as a and T is relatively small. In contrast, DMS forecasting gen-
new baseline for comparison. LTSF-Linear regresses histor- erates more accurate predictions when it is hard to obtain an
ical time series with a one-layer linear model to forecast fu- unbiased single-step forecasting model, or T is large.
ture time series directly. We conduct extensive experiments
on nine widely-used benchmark datasets that cover various Transformer-Based LTSF Solutions
real-life applications: traffic, energy, economics, weather,
Transformer-based models (Vaswani et al. 2017) have
and disease predictions. Surprisingly, our results show that
achieved unparalleled performances in many long-standing
LTSF-Linear outperforms existing complex Transformer-
AI tasks in natural language processing and computer vi-
based models in all cases, and often by a large margin (20%
sion fields, thanks to the effectiveness of the multi-head
∼ 50%). Moreover, we find that, in contrast to the claims in
self-attention mechanism. This has also triggered lots of re-
existing Transformers, most of them fail to extract temporal
search interest in Transformer-based time series modeling
relations from long sequences, i.e., the forecasting errors are
techniques (Wen et al. 2022). In particular, a large amount
not reduced (sometimes even increased) with the increase of
of research works are dedicated to the LTSF task (e.g., (Li
look-back window sizes. Finally, we conduct various abla-
et al. 2019; Liu et al. 2021a; Xu et al. 2021; Zhou et al. 2021,
tion studies on existing Transformer-based TSF solutions to
2022)). Considering the ability to capture long-range depen-
study the impact of various design elements in them.
dencies with Transformer models, most of them focus on the
To sum up, the contributions of this work include:
less-explored long-term forecasting problem (T ≫ 1)1 .
• To the best of our knowledge, this is the first work to When applying the vanilla Transformer model to the
challenge the effectiveness of the booming Transformers LTSF problem, it has some limitations, including the
for the long-term time series forecasting task. quadratic time/memory complexity with the original self-
• To validate our claims, we introduce a set of embar- attention scheme and error accumulation caused by the au-
rassingly simple one-layer linear models, named LTSF- toregressive decoder design. Informer (Zhou et al. 2021) ad-
Linear, and compare them with existing Transformer- dresses these issues and proposes a novel Transformer ar-
based LTSF solutions on nine benchmarks. LTSF-Linear chitecture with reduced complexity and a DMS forecasting
can be a new baseline for the LTSF problem. strategy. Later, more Transformer variants introduce various
time series features into their models for performance or effi-
• We conduct comprehensive empirical studies on various
ciency improvements (Liu et al. 2021a; Xu et al. 2021; Zhou
aspects of existing Transformer-based solutions, includ-
et al. 2022). We summarize the design elements of existing
ing the capability of modeling long inputs, the sensitivity
Transformer-based LTSF solutions as follows (see Figure 1).
to time series order, the impact of positional encoding
Time series decomposition: For data preprocessing, nor-
and sub-series embedding, and efficiency comparisons.
malization with zero-mean is common in TSF. Besides, Aut-
Our findings would benefit future research in this area.
oformer (Xu et al. 2021) first applies seasonal-trend de-
With the above, we conclude that the temporal model- composition behind each neural block, which is a standard
ing capabilities of Transformers for time series are exagger- method in time series analysis to make raw data more pre-
ated, at least for the existing LTSF benchmarks. At the same dictable (Cleveland 1990; Hamilton 2020). Specifically, they
time, while LTSF-Linear achieves a better prediction accu- use a moving average kernel on the input sequence to extract
racy compared to existing works, it merely serves as a sim- the trend-cyclical component of the time series. The differ-
ple baseline for future research on the challenging long-term ence between the original sequence and the trend component
TSF problem. With our findings, we also advocate revisiting is regarded as the seasonal component. On top of the de-
the validity of Transformer-based solutions for other time composition scheme of Autoformer, FEDformer (Zhou et al.
series analysis tasks (e.g., anomaly detection) in the future. 2022) further proposes the mixture of experts’ strategies to
mix the trend components extracted by moving average ker-
Preliminaries: TSF Problem Formulation nels with various kernel sizes.
For time series containing C variates, given historical data Input embedding strategies: The self-attention layer in the
X = {X1t , ..., XCt }L Transformer architecture cannot preserve the positional in-
t=1 , wherein L is the look-back window
size and Xit is the value of the ith variate at the tth time step. formation of the time series. However, local positional infor-
mation, i.e. the ordering of time series, is important. Besides,
The time series forecasting task is to predict the values X̂ = global temporal information, such as hierarchical times-
{X̂1t , ..., X̂Ct }L+T
t=L+1 at the T future time steps. When T > 1, tamps (week, month, year) and agnostic timestamps (holi-
iterated multi-step (IMS) forecasting (Taieb, Hyndman et al. days and events), is also informative (Zhou et al. 2021). To
2012) learns a single-step forecaster and iteratively applies it enhance the temporal context of time-series inputs, a practi-
to obtain multi-step predictions. Alternatively, direct multi- cal design in the SOTA Transformer-based methods is inject-
step (DMS) forecasting (Chevillon 2007) directly optimizes ing several embeddings, like a fixed positional encoding, a
the multi-step forecasting objective at once.
Compared to DMS forecasting results, IMS predictions 1
Due to page limit, we leave the discussion of non-Transformer
have smaller variance thanks to the autoregressive estima- forecasting solutions in the Appendix.
11122
LogSparse and convolutional Iterated Multi-Step
Channel projection self-attention @LogTrans (IMS) @LogTrans
Normalization
ProbSparse and distilling Direct Multi-Step
self-attention @Informer (DMS) @Informer
Fixed position
Output
Series auto-correlation with DMS with auto-correlation and
Input
Timestamp
preparation decomposition @Autoformer decomposition @Autoformer
Local timestamp
Multi-resolution pyramidal DMS along spatio-temporal
Seasonal-trend attention @Pyraformer dimension @Pyraformer
decomposition Global timestamp Frequency enhanced block with DMS with frequency attention
decomposition @FEDformer and decomposition@FEDformer
(a) Preprocessing (b) Embedding (c) Encoder (d) Decoder
Figure 1: The pipeline of existing Transformer-based TSF solutions. In (a) and (b), the solid boxes are essential operations, and
the dotted boxes are applied optionally. (c) and (d) are distinct for different methods (Li et al. 2019; Zhou et al. 2021; Xu et al.
2021; Liu et al. 2021a; Zhou et al. 2022).
channel projection embedding, and learnable temporal em- tional encodings associated with input tokens. Considering
beddings into the input sequence. Moreover, temporal em- the raw numerical data in time series (e.g., stock prices or
beddings with a temporal convolution layer (Li et al. 2019) electricity values), there are hardly any point-wise semantic
or learnable timestamps (Xu et al. 2021) are introduced. correlations between them. In time series modeling, we are
Self-attention schemes: Transformers rely on the self- mainly interested in the temporal relations among a contin-
attention mechanism to extract the semantic dependencies uous set of points, and the order of these elements instead
between paired elements. Motivated by reducing the O L2 of the paired relationship plays the most crucial role. While
time and memory complexity of the vanilla Transformer, re- employing positional encoding and using tokens to embed
cent works propose two strategies for efficiency. On the one sub-series facilitate preserving some ordering information,
hand, LogTrans and Pyraformer explicitly introduce a spar- the nature of the permutation-invariant self-attention mech-
sity bias into the self-attention scheme. Specifically, Log- anism inevitably results in temporal information loss. Due
Trans uses a Logsparse mask to reduce the computational to the above observations, we are interested in revisiting the
complexity to O (LlogL) while Pyraformer adopts pyrami- effectiveness of Transformer-based LTSF solutions.
dal attention that captures hierarchically multi-scale tem-
poral dependencies with an O (L) time and memory com- An Embarrassingly Simple Baseline for LTSF
plexity. On the other hand, Informer and FEDformer use In the experiments of existing Transformer-based LTSF so-
the low-rank property in the self-attention matrix. Informer lutions (T ≫ 1), all the compared (non-Transformer) base-
proposes a ProbSparse self-attention mechanism and a self- lines are IMS forecasting techniques, which are known to
attention distilling operation to decrease the complexity to suffer from significant error accumulation effects. We hy-
O (LlogL), and FEDformer designs a Fourier enhanced pothesize that the performance improvements in these works
block and a wavelet enhanced block with random selection are largely due to the DMS strategy used in them.
to obtain O (L) complexity. Lastly, Autoformer designs a
series-wise auto-correlation mechanism to replace the origi-
nal self-attention layer.
Decoders: The vanilla Transformer decoder outputs se-
quences in an autoregressive manner, resulting in a slow in-
ference speed and error accumulation effects, especially for
long-term predictions. Informer designs a generative-style
decoder for DMS forecasting. Other Transformer variants
employ similar DMS strategies. For instance, Pyraformer
uses a fully-connected layer concatenating Spatio-temporal
axes as the decoder. Autoformer sums up two refined de-
composed features from trend-cyclical components and the
stacked auto-correlation mechanism for seasonal compo- Figure 2: Illustration of the basic linear model.
nents to get the final prediction. FEDformer also uses a de-
composition scheme with the proposed frequency attention To validate this hypothesis, we present the simplest DMS
block to decode the final results. model via a temporal linear layer, named LTSF-Linear, as
The premise of Transformer models is the semantic cor- a baseline for comparison. The basic formulation of LTSF-
relations between paired elements, while the self-attention Linear directly regresses historical time series for future pre-
mechanism itself is permutation-invariant, and its capabil- diction via a weighted sum operation (as illustrated in Fig-
ity of modeling temporal relations largely depends on posi- ure 2). The mathematical expression is X̂i = W Xi , where
11123
Datasets ETTh1&ETTh2 ETTm1 &ETTm2 Traffic Electricity Exchange-Rate Weather ILI
Variates 7 7 862 321 8 21 7
Timesteps 17,420 69,680 17,544 26,304 7,588 52,696 966
Granularity 1hour 5min 1hour 1hour 1day 10min 1week
Table 1: The statistics of the nine popular datasets for the LTSF problem.
Methods IMP. Linear* NLinear* DLinear* FEDformer Autoformer Informer Pyraformer* Repeat*
Metric MSE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
Exchange Electricity
96 27% 0.140 0.237 0.141 0.237 0.140 0.237 0.193 0.308 0.201 0.317 0.274 0.368 0.386 0.449 1.588 0.946
192 24% 0.153 0.250 0.154 0.248 0.153 0.249 0.201 0.315 0.222 0.334 0.296 0.386 0.386 0.443 1.595 0.950
336 21% 0.169 0.268 0.171 0.265 0.169 0.267 0.214 0.329 0.231 0.338 0.300 0.394 0.378 0.443 1.617 0.961
720 17% 0.203 0.301 0.210 0.297 0.203 0.301 0.246 0.355 0.254 0.361 0.373 0.439 0.376 0.445 1.647 0.975
96 45% 0.082 0.207 0.089 0.208 0.081 0.203 0.148 0.278 0.197 0.323 0.847 0.752 0.376 1.105 0.081 0.196
192 42% 0.167 0.304 0.180 0.300 0.157 0.293 0.271 0.380 0.300 0.369 1.204 0.895 1.748 1.151 0.167 0.289
336 34% 0.328 0.432 0.331 0.415 0.305 0.414 0.460 0.500 0.509 0.524 1.672 1.036 1.874 1.172 0.305 0.396
720 46% 0.964 0.750 1.033 0.780 0.643 0.601 1.195 0.841 1.447 0.941 2.478 1.310 1.943 1.206 0.823 0.681
96 30% 0.410 0.282 0.410 0.279 0.410 0.282 0.587 0.366 0.613 0.388 0.719 0.391 2.085 0.468 2.723 1.079
Traffic
192 30% 0.423 0.287 0.423 0.284 0.423 0.287 0.604 0.373 0.616 0.382 0.696 0.379 0.867 0.467 2.756 1.087
336 30% 0.436 0.295 0.435 0.290 0.436 0.296 0.621 0.383 0.622 0.337 0.777 0.420 0.869 0.469 2.791 1.095
720 26% 0.466 0.315 0.464 0.307 0.466 0.315 0.626 0.382 0.660 0.408 0.864 0.472 0.881 0.473 2.811 1.097
96 19% 0.176 0.236 0.182 0.232 0.176 0.237 0.217 0.296 0.266 0.336 0.300 0.384 0.896 0.556 0.259 0.254
Weather
192 21% 0.218 0.276 0.225 0.269 0.220 0.282 0.276 0.336 0.307 0.367 0.598 0.544 0.622 0.624 0.309 0.292
336 23% 0.262 0.312 0.271 0.301 0.265 0.319 0.339 0.380 0.359 0.395 0.578 0.523 0.739 0.753 0.377 0.338
720 20% 0.326 0.365 0.338 0.348 0.323 0.362 0.403 0.428 0.419 0.428 1.059 0.741 1.004 0.934 0.465 0.394
24 48% 1.947 0.985 1.683 0.858 2.215 1.081 3.228 1.260 3.483 1.287 5.764 1.677 1.420 2.012 6.587 1.701
36 36% 2.182 1.036 1.703 0.859 1.963 0.963 2.679 1.080 3.103 1.148 4.755 1.467 7.394 2.031 7.130 1.884
ILI
48 34% 2.256 1.060 1.719 0.884 2.130 1.024 2.622 1.078 2.669 1.085 4.763 1.469 7.551 2.057 6.575 1.798
60 34% 2.390 1.104 1.819 0.917 2.368 1.096 2.857 1.157 2.770 1.125 5.264 1.564 7.662 2.100 5.893 1.677
96 1% 0.375 0.397 0.374 0.394 0.375 0.399 0.376 0.419 0.449 0.459 0.865 0.713 0.664 0.612 1.295 0.713
ETTh1
192 4% 0.418 0.429 0.408 0.415 0.405 0.416 0.420 0.448 0.500 0.482 1.008 0.792 0.790 0.681 1.325 0.733
336 7% 0.479 0.476 0.429 0.427 0.439 0.443 0.459 0.465 0.521 0.496 1.107 0.809 0.891 0.738 1.323 0.744
720 13% 0.624 0.592 0.440 0.453 0.472 0.490 0.506 0.507 0.514 0.512 1.181 0.865 0.963 0.782 1.339 0.756
96 20% 0.288 0.352 0.277 0.338 0.289 0.353 0.346 0.388 0.358 0.397 3.755 1.525 0.645 0.597 0.432 0.422
ETTh2
192 20% 0.377 0.413 0.344 0.381 0.383 0.418 0.429 0.439 0.456 0.452 5.602 1.931 0.788 0.683 0.534 0.473
336 26% 0.452 0.461 0.357 0.400 0.448 0.465 0.496 0.487 0.482 0.486 4.721 1.835 0.907 0.747 0.591 0.508
720 14% 0.698 0.595 0.394 0.436 0.605 0.551 0.463 0.474 0.515 0.511 3.647 1.625 0.963 0.783 0.588 0.517
96 21% 0.308 0.352 0.306 0.348 0.299 0.343 0.379 0.419 0.505 0.475 0.672 0.571 0.543 0.510 1.214 0.665
ETTm1
192 21% 0.340 0.369 0.349 0.375 0.335 0.365 0.426 0.441 0.553 0.496 0.795 0.669 0.557 0.537 1.261 0.690
336 17% 0.376 0.393 0.375 0.388 0.369 0.386 0.445 0.459 0.621 0.537 1.212 0.871 0.754 0.655 1.283 0.707
720 22% 0.440 0.435 0.433 0.422 0.425 0.421 0.543 0.490 0.671 0.561 1.166 0.823 0.908 0.724 1.319 0.729
96 18% 0.168 0.262 0.167 0.255 0.167 0.260 0.203 0.287 0.255 0.339 0.365 0.453 0.435 0.507 0.266 0.328
ETTm2
192 18% 0.232 0.308 0.221 0.293 0.224 0.303 0.269 0.328 0.281 0.340 0.533 0.563 0.730 0.673 0.340 0.371
336 16% 0.320 0.373 0.274 0.327 0.281 0.342 0.325 0.366 0.339 0.372 1.363 0.887 1.201 0.845 0.412 0.410
720 13% 0.413 0.435 0.368 0.384 0.397 0.421 0.421 0.415 0.433 0.432 3.379 1.338 3.625 1.451 0.521 0.465
- Methods* are implemented by us; Other results are from FEDformer (Zhou et al. 2022).
Table 2: Multivariate long-term forecasting errors in terms of MSE and MAE, the lower the better. Among them, ILI dataset is
with forecasting horizon T ∈ {24, 36, 48, 60}. For the others, T ∈ {96, 192, 336, 720}. The best results are highlighted in bold
and the best results of Transformers are highlighted with an underline. IMP. is the best result of linear models compared to the
results of Transformer-based solutions.
W ∈ RT ×L is a linear layer along the temporal axis. X̂i ods, named DLinear and NLinear.
and Xi are the prediction and input for each ith variate.
Note that LTSF-Linear shares weights across different vari- • Specifically, DLinear is a combination of a Decompo-
ates and does not model any spatial correlations. sition scheme used in Autoformer and FEDformer with
linear layers. It first decomposes a raw data input into a
LTSF-Linear is a set of linear models. Vanilla Linear is a trend component by a moving average kernel and a re-
one-layer linear model. To handle time series across differ- mainder (seasonal) component. Then, two one-layer lin-
ent domains (e.g., finance, traffic, and energy domains), we ear layers are applied to each component, and we sum up
further introduce two variants with two preprocessing meth- the two features to get the final prediction. By explicitly
11124
GrouthTruth Autoformer Informer FEDformer DLinear
handling trend, DLinear enhances the performance of a
vanilla linear when there is a clear trend in the data.
1.0
• Meanwhile, to boost the performance of LTSF-Linear
when there is a distribution shift in the dataset, NLin-
0.5
ear first subtracts the input by the last value of the se-
quence. Then, the input goes through a linear layer, and
the subtracted part is added back before making the final 0.0
prediction. The subtraction and addition in NLinear are a

simple normalization for the input sequence. 0.5
Experiments 1.0
Experimental Settings 1.5
Dataset. We conduct extensive experiments on nine

0 50 100 150 200 250 300
widely-used real-world datasets, including ETT (Electric-
ity Transformer Temperature) (Zhou et al. 2021) (ETTh1, Figure 3: Illustration of the long-term forecasting outputs
ETTh2, ETTm1, ETTm2), Traffic, Electricity, Weather, ILI, (Y-axis) of five models with an input length L=96 and output
Exchange-Rate (Lai et al. 2017). All of them are multivariate length T =192 (X-axis) on Electricity.
time series. We leave data descriptions in the Appendix.
Evaluation metric. Following previous works (Zhou
et al. 2021; Xu et al. 2021; Zhou et al. 2022), we use Mean output horizon is 336 steps, Transformers fail to capture the
Squared Error (MSE) and Mean Absolute Error (MAE). scale and bias of the future data . Moreover, they can hardly
Compared methods. We include four recent predict a proper trend on aperiodic data.
Transformer-based methods: FEDformer (Fourier) (Zhou
et al. 2022), Autoformer (Xu et al. 2021), Informer (Zhou More Analyses on Transformer-Based Solutions
et al. 2021), Pyraformer (Liu et al. 2021a). Besides, we
Can existing LTSF-Transformers extract temporal rela-
include a naive DMS method: Closest Repeat (Repeat),
tions well from longer input sequences? The size of the
which repeats the last value in the look-back window.
look-back window greatly impacts forecasting accuracy as
Comparison with Transformers it determines how much we can learn from historical data.
Generally speaking, a powerful TSF model with a strong
Quantitative results. In Table 2, we extensively evaluate temporal relation extraction capability should be able to
all mentioned Transformers on nine benchmarks, following achieve better results with larger look-back window sizes.
the experimental setting of previous work (Xu et al. 2021; To study the impact of input look-back window sizes, we
Zhou et al. 2022, 2021). Surprisingly, the performance of conduct experiments with L ∈ {24, ..., 720} for long-term
LTSF-Linear surpasses the SOTA FEDformer in most cases forecasting (T=720). Similar to the observations from pre-
by 20% ∼ 50% improvements on the multivariate forecast- vious studies (Zhou et al. 2021; Wen et al. 2022), existing
ing, where LTSF-Linear even does not model correlations Transformers’ performance deteriorates or stays stable when
among variates. For different time series benchmarks, NLin- the look-back window size increases. In contrast, the perfor-
ear and DLinear show the superiority to handle the dis- mances of all LTSF-Linear are significantly boosted with the
tribution shift and trend-seasonality features. We also pro- increase of look-back window size. Thus, existing solutions
vide results for univariate forecasting of ETT datasets in tend to overfit temporal noises instead of extracting tempo-
the Appendix, where LTSF-Linear still consistently outper- ral information if given a longer sequence, and the input size
forms Transformer-based LTSF solutions by a large mar- 96 is exactly suitable for most Transformers.
gin. In general, these results reveal that existing complex What can be learned for long-term forecasting? While
Transformer-based LTSF solutions are not seemingly effec- the temporal dynamics in the look-back window signifi-
tive on the existing nine benchmarks while LTSF-Linear cantly impact the forecasting accuracy of short-term time
can be a powerful baseline. Another interesting observa- series forecasting, we hypothesize that long-term forecasting
tion is that even though the naive Repeat method shows depends on whether models can capture the trend and peri-
worse results when predicting long-term seasonal data (e.g., odicity well only. That is, the farther the forecasting horizon,
Electricity ), it surprisingly outperforms all Transformers on the less impact the look-back window itself has.
Exchange-Rate (around 45%). This is mainly caused by the
wrong prediction of trends in Transformer-based solutions, Methods FEDformer Autoformer
which may overfit toward sudden change noises in the train- Input Close Far Close Far
ing data, resulting in significant accuracy degradation. Electricity 0.251 0.265 0.255 0.287
Qualitative results. As shown in Figure 3, we plot the Traffic 0.631 0.645 0.677 0.675
prediction results on three selected time series datasets with
Transformer-based solutions and LTSF-Linear: Electricity Table 3: The MSE comparisons of different input sequences.
(Sequence 1951, Variate 36) , where it has different tem-
poral patterns. When the input length is 96 steps, and the To validate the above hypothesis, in Table 3, we com-
11125
Transformer Autoformer Pyraformer NLinear der well? Self-attention is inherently permutation-invariant,
Informer FEDformer Linear DLinear
i.e., regardless of the order. However, in time-series forecast-
ing, the sequence order often plays a crucial role. We argue
0.40
that even with positional and temporal embeddings, existing
Transformer-based methods still suffer from temporal infor-
mation loss. In Table 5, we shuffle the raw input before
0.35
the embedding strategies. Two shuffling strategies are pre-
sented: Shuf. randomly shuffles the whole input sequences
0.30 and Half-Ex. exchanges the first half of the input sequence
with the second half. Interestingly, compared with the orig-
0.25
inal setting (Ori.) on the Exchange Rate, the performance
of all Transformer-based methods does not fluctuate even
when the input sequence is randomly shuffled. By contrary,
0.20
the performance of LTSF-Linear is damaged significantly.
24 48 72 96 120 144 168 192 336 504 672 720 These indicate that LTSF-Transformers with different posi-
tional and temporal embeddings preserve quite limited tem-
poral relations and are prone to overfit on noisy financial
Figure 4: The MSE results (Y-axis) of models with different
data, while the simple LTSF-Linear can model the order nat-
look-back window sizes (X-axis) of long-term forecasting
urally and avoid overfitting with fewer parameters.
(T=720) on Electricity.
For the ETTh1 dataset, FEDformer and Autoformer in-
troduce time series inductive bias into their models, mak-
pare the forecasting accuracy for the same future 720 time ing them can extract certain temporal information when the
steps with data from two different look-back windows: (i). dataset has more clear temporal patterns (e.g., periodicity)
the original input L=96 setting (called Close) and (ii). the than the Exchange Rate. Therefore, the average drops of
far input L=96 setting (called Far) that is before the original the two Transformers are 73.28% and 56.91% under the
96 time steps. The performance of the SOTA Transformers Shuf. setting, where it loses the whole order information.
drops slightly, indicating these models only capture simi- Moreover, Informer still suffers less from both Shuf. and
lar temporal information from the adjacent time series se- Half-Ex. settings due to its no such temporal inductive bias.
quence. Since capturing the intrinsic characteristics of the Overall, the average drops of LTSF-Linear are larger than
dataset generally does not require a large number of param- Transformer-based methods for all cases, indicating the ex-
eters, i,e. one parameter can represent the periodicity. Using isting Transformers do not preserve temporal order well.
too many parameters will even cause overfitting. How effective are different embedding strategies? In Ta-
Are the self-attention scheme effective for LTSF? We ble 6, the forecasting errors of Informer largely increase
verify whether these complex designs in the existing Trans- without positional embeddings (wo/Pos.). Without times-
former (e.g., Informer) are essential. In Table 4, we grad- tamp embeddings (wo/Temp.) will gradually damage the
ually transform Informer to Linear. First, we replace each performance of Informer as the forecasting lengths increase.
self-attention layer with a linear layer, called Att.-Linear, Since Informer uses a single time step for each token, it is
since a self-attention layer can be regarded as a fully- necessary to introduce temporal information in tokens.
connected layer where weights are dynamically changed. Rather than using a single time step in each token, FED-
Furthermore, we discard other auxiliary designs (e.g., FFN) former and Autoformer input a sequence of timestamps to
in Informer to leave embedding layers and linear layers, embed the temporal information. Hence, they can achieve
named Embed + Linear. Finally, we simplify the model to comparable or even better performance without fixed po-
one linear layer. As can be observed, the performance of In- sitional embeddings. However, without timestamp embed-
former grows with the gradual simplification, thereby chal- dings, the performance of Autoformer declines rapidly be-
lenging the necessity of these modules. cause of the loss of global temporal information. Instead,
thanks to the frequency-enhanced module proposed in FED-
Methods Informer Att.-Linear Embed + Linear Linear former to introduce temporal inductive bias, it suffers less
from removing any position/timestamp embeddings.
Exchange
96 0.847 1.003 0.173 0.084

192 1.204 0.979 0.443 0.155 Is training data size a limiting factor for existing LTSF-
336 1.672 1.498 1.288 0.301 Transformers? Some may argue that the poor performance
720 2.478 2.102 2.026 0.763
96 0.865 0.613 0.454 0.400
of Transformer-based solutions is due to the small sizes of
ETTh1
192 1.008 0.759 0.686 0.438 the benchmark datasets. Unlike computer vision or natural
336 1.107 0.921 0.821 0.479 language processing tasks, TSF is performed on collected
720 1.181 0.902 1.051 0.515 time series, and it is difficult to scale up the training data
size. In fact, the size of the training data would indeed have a
Table 4: The MSE comparisons of gradually transforming significant impact on the model performance. Accordingly,
Informer to a Linear from the left to right columns. we conduct experiments on Traffic, comparing the perfor-
mance of the model trained on a full dataset (17,544*0.7
Can existing LTSF-Transformers preserve temporal or- hours), named Ori., with that training on a shortened dataset
11126
Methods Linear FEDformer Autoformer Informer
Predict Length
Exchange Ori. Shuf. Half-Ex. Ori. Shuf. Half-Ex. Ori. Shuf. Half-Ex. Ori. Shuf. Half-Ex.
96 0.080 0.133 0.169 0.161 0.160 0.162 0.152 0.158 0.160 0.952 1.004 0.959
192 0.162 0.208 0.243 0.274 0.275 0.275 0.278 0.271 0.277 1.012 1.023 1.014
336 0.286 0.320 0.345 0.439 0.439 0.439 0.435 0.430 0.435 1.177 1.181 1.177
720 0.806 0.819 0.836 1.122 1.122 1.122 1.113 1.113 1.113 1.198 1.210 1.196
Average Drop N/A 27.26% 46.81% N/A -0.09% 0.20% N/A 0.09% 1.12% N/A -0.12% -0.18%
96 0.395 0.824 0.431 0.376 0.753 0.405 0.455 0.838 0.458 0.974 0.971 0.971
ETTh1
192 0.447 0.824 0.471 0.419 0.730 0.436 0.486 0.774 0.491 1.233 1.232 1.231
336 0.490 0.825 0.505 0.447 0.736 0.453 0.496 0.752 0.497 1.693 1.693 1.691
720 0.520 0.846 0.528 0.468 0.720 0.470 0.525 0.696 0.524 2.720 2.716 2.715
Average Drop N/A 81.06% 4.78% N/A 73.28% 3.44% N/A 56.91% 0.46% N/A 1.98% 0.18%
Table 5: The MSE comparisons of models when shuffling the raw input sequence. Shuf. randomly shuffles the input sequence.
Half-EX. randomly exchanges the first half of the input sequences with the second half. We run five times.
Traffic tical efficiencies with 5 runs. Interestingly, compared with

Methods Embedding
96 192 336 720
the vanilla Transformer (with the same DMS decoder), most
All 0.597 0.606 0.627 0.649
wo/Pos. 0.587 0.604 0.621 0.626
Transformer variants incur similar or worse inference time
FEDformer and parameters in practice. These follow-ups introduce more
wo/Temp. 0.613 0.623 0.650 0.677
wo/Pos.-Temp. 0.613 0.622 0.648 0.663 additional design elements to make practical costs high.
All 0.629 0.647 0.676 0.638
wo/Pos. 0.613 0.616 0.622 0.660 Method MACs Parameter Time Memory
Autoformer
wo/Temp. 0.681 0.665 0.908 0.769 DLinear 0.04G 139.7K 0.4ms 687MiB
wo/Pos.-Temp. 0.672 0.811 1.133 1.300 Transformer× 4.03G 13.61M 26.8ms 6091MiB
All 0.719 0.696 0.777 0.864 Informer 3.93G 14.39M 49.3ms 3869MiB
wo/Pos. 1.035 1.186 1.307 1.472 Autoformer 4.41G 14.91M 164.1ms 7607MiB
Informer
wo/Temp. 0.754 0.780 0.903 1.259 Pyraformer 0.80G 241.4M 3.4ms 7017MiB
wo/Pos.-Temp. 1.038 1.351 1.491 1.512 FEDformer 4.41G 20.68M 40.5ms 4143MiB
Table 6: The MSE comparisons of different embedding × the same one-step decoder.
strategies on Transformer-based methods with look-back Table 8: Comparison of practical efficiency of LTSF-
window size 96 and forecasting lengths {96, 192, 336, 720}. Transformers under L=96 and T=720 on the Electricity.
MACs are the number of multiply-accumulate operations.
The inference time averages 5 runs.
(8,760 hours, i.e., 1 year), called Short. Unexpectedly, Ta-
ble 7 presents that the prediction errors with reduced train-
ing data are usually lower. This might be because the whole-
year data maintain clearer temporal features than a longer Conclusion and Future Work
but incomplete data size. While we cannot conclude that we Conclusion. This work questions the effectiveness of
should use fewer data for training, it demonstrates that the emerging favored Transformer-based solutions for the long-
training data scale is not the limiting reason. term time series forecasting problem. We use an embarrass-
ingly simple linear model LTSF-Linear as a DMS forecast-
Methods FEDformer Autoformer ing baseline to verify our claims. Note that our contributions
Dataset Ori. Short Ori. Short do not come from proposing a linear model but rather from
96 0.587 0.568 0.613 0.594 throwing out an important question, showing surprising
192 0.604 0.584 0.616 0.621 comparisons, and demonstrating why LTSF-Transformers
336 0.621 0.601 0.622 0.621
are not as effective as claimed in these works through var-
720 0.626 0.608 0.660 0.650
ious perspectives. We sincerely hope our comprehensive
Table 7: The MSE comparisons of two training data sizes. studies can benefit future work in this area.
Future work. LTSF-Linear has a limited model capacity,
and it merely serves a simple yet competitive baseline with
Is efficiency really a top-level priority?
Existing LTSF- strong interpretability for future research. Consequently, we
Transformers claim that the O L2 complexity of the believe there is great potential for new model designs, data
vanilla Transformer is unaffordable for the LTSF problem. processing, and benchmarks to tackle LTSF.
Although they prove to be able to improve the theoretical
time and memory complexity from O L2 to O (L), it is
unclear whether 1) the actual inference time and memory Acknowledgments
cost on devices are improved, and 2) the memory issue is This work was supported in part by Alibaba Group Holding
unacceptable and urgent for today’s GPU (e.g., an NVIDIA Ltd. under Grant No. TA2015393. We thank the anonymous
Titan XP here). In Table 8, we compare the average prac- reviewers for their constructive comments and suggestions.
11127
References Salinas, D.; Flunkert, V.; and Gasthaus, J. 2017. DeepAR:
Ariyo, A. A.; Adewumi, A. O.; and Ayo, C. K. 2014. Stock Probabilistic Forecasting with Autoregressive Recurrent
price prediction using the ARIMA model. In 2014 UKSim- Networks. International Journal of Forecasting.
AMSS 16th International Conference on Computer Mod- Taieb, S. B.; Hyndman, R. J.; et al. 2012. Recursive and
elling and Simulation, 106–112. IEEE. direct multi-step forecasting: the best of both worlds, vol-
ume 19. Citeseer.
Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural Ma-
chine Translation by Jointly Learning to Align and Trans- Taylor, S. J.; and Letham, B. 2017. Forecasting at Scale.
late. arXiv: Computation and Language arXiv:1409.0473. PeerJ Prepr.
Bai, S.; Kolter, J. Z.; and Koltun, V. 2018. An empirical Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
evaluation of generic convolutional and recurrent networks L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-
for sequence modeling. arXiv preprint arXiv:1803.01271. tention is all you need. Advances in neural information pro-
cessing systems, 30.
Chevillon, G. 2007. Direct multi-step estimation and fore-
Wen, Q.; Zhou, T.; Zhang, C.; Chen, W.; Ma, Z.; Yan, J.; and
casting. Journal of Economic Surveys, 21(4): 746–785.
Sun, L. 2022. Transformers in Time Series: A Survey. arXiv
Cirstea, R.-G.; Guo, C.; Yang, B.; Kieu, T.; Dong, X.; preprint arXiv:2202.07125.
and Pan, S. 2022. Triformer: Triangular, Variable-Specific Xu, J.; Wang, J.; Long, M.; et al. 2021. Autoformer: Decom-
Attentions for Long Sequence Multivariate Time Series position transformers with auto-correlation for long-term se-
Forecasting–Full Version. arXiv preprint arXiv:2204.13767. ries forecasting. Advances in Neural Information Processing
Cleveland, R. B. 1990. STL : A Seasonal-Trend Decomposi- Systems, 34.
tion Procedure Based on Loess. Journal of Office Statistics. Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.;
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. and Zhang, W. 2021. Informer: Beyond Efficient Trans-
Bert: Pre-training of deep bidirectional transformers for lan- former for Long Sequence Time-Series Forecasting. In
guage understanding. arXiv preprint arXiv:1810.04805. The Thirty-Fifth AAAI Conference on Artificial Intelligence,
Dong, L.; Xu, S.; and Xu, B. 2018. Speech-transformer: AAAI 2021, Virtual Conference, volume 35, 11106–11115.
a no-recurrence sequence-to-sequence model for speech AAAI Press.
recognition. In 2018 IEEE International Conference on Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; and Jin, R.
Acoustics, Speech and Signal Processing (ICASSP), 5884– 2022. FEDformer: Frequency enhanced decomposed trans-
5888. IEEE. former for long-term series forecasting. In International
Conference on Machine Learning.
Friedman, J. H. 2001. Greedy function approximation: a
gradient boosting machine. Annals of statistics, 1189–1232.
Hamilton, J. D. 2020. Time series analysis. Princeton uni-
versity press.
Lai, G.; Chang, W.-C.; Yang, Y.; and Liu, H. 2017. Modeling
Long- and Short-Term Temporal Patterns with Deep Neural
Networks. international acm sigir conference on research
and development in information retrieval.
Li, S.; Jin, X.; Xuan, Y.; Zhou, X.; Chen, W.; Wang, Y.-X.;
and Yan, X. 2019. Enhancing the locality and breaking the
memory bottleneck of transformer on time series forecast-
ing. Advances in Neural Information Processing Systems,
32.
Liu, M.; Zeng, A.; Chen, M.; Xu, Z.; Lai, Q.; Ma, L.; and
Xu, Q. 2022. SCINet: Time Series Modeling and Forecast-
ing with Sample Convolution and Interaction. Thirty-sixth
Conference on Neural Information Processing Systems.
Liu, S.; Yu, H.; Liao, C.; Li, J.; Lin, W.; Liu, A. X.; and
Dustdar, S. 2021a. Pyraformer: Low-complexity pyramidal
attention for long-range time series modeling and forecast-
ing. In International Conference on Learning Representa-
tions.
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin,
S.; and Guo, B. 2021b. Swin transformer: Hierarchical vi-
sion transformer using shifted windows. In Proceedings of
the IEEE/CVF International Conference on Computer Vi-
sion, 10012–10022.
11128

AreTransformersEffective forTimeSeriesForecasting

Uploaded by

Copyright:

Available Formats

AreTransformersEffective forTimeSeriesForecasting

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AreTransformersEffective forTimeSeriesForecasting

Uploaded by

Copyright:

Available Formats

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

Are Transformers Effective for Time Series Forecasting?

Ailing Zeng1,2 * , Muxi Chen1 * , Lei Zhang2 , Qiang Xu1

Abstract speech recognition (Dong, Xu, and Xu 2018), and computer

(a) Preprocessing (b) Embedding (c) Encoder (d) Decoder

prediction. The subtraction and addition in NLinear are a

Experimental Settings 1.5

Dataset. We conduct extensive experiments on nine

96 0.847 1.003 0.173 0.084

Traffic tical efficiencies with 5 runs. Interestingly, compared with

You might also like