AreTransformersEffective forTimeSeriesForecasting
AreTransformersEffective forTimeSeriesForecasting
AreTransformersEffective forTimeSeriesForecasting
11121
long-term forecasting is only feasible for those time series tion procedure, but they inevitably suffer from error accu-
with a relatively clear trend and periodicity. As linear mod- mulation effects. Consequently, IMS forecasting is prefer-
els can already extract such information, we introduce a set able when there is a highly-accurate single-step forecaster,
of embarrassingly simple models named LTSF-Linear as a and T is relatively small. In contrast, DMS forecasting gen-
new baseline for comparison. LTSF-Linear regresses histor- erates more accurate predictions when it is hard to obtain an
ical time series with a one-layer linear model to forecast fu- unbiased single-step forecasting model, or T is large.
ture time series directly. We conduct extensive experiments
on nine widely-used benchmark datasets that cover various Transformer-Based LTSF Solutions
real-life applications: traffic, energy, economics, weather,
Transformer-based models (Vaswani et al. 2017) have
and disease predictions. Surprisingly, our results show that
achieved unparalleled performances in many long-standing
LTSF-Linear outperforms existing complex Transformer-
AI tasks in natural language processing and computer vi-
based models in all cases, and often by a large margin (20%
sion fields, thanks to the effectiveness of the multi-head
∼ 50%). Moreover, we find that, in contrast to the claims in
self-attention mechanism. This has also triggered lots of re-
existing Transformers, most of them fail to extract temporal
search interest in Transformer-based time series modeling
relations from long sequences, i.e., the forecasting errors are
techniques (Wen et al. 2022). In particular, a large amount
not reduced (sometimes even increased) with the increase of
of research works are dedicated to the LTSF task (e.g., (Li
look-back window sizes. Finally, we conduct various abla-
et al. 2019; Liu et al. 2021a; Xu et al. 2021; Zhou et al. 2021,
tion studies on existing Transformer-based TSF solutions to
2022)). Considering the ability to capture long-range depen-
study the impact of various design elements in them.
dencies with Transformer models, most of them focus on the
To sum up, the contributions of this work include:
less-explored long-term forecasting problem (T ≫ 1)1 .
• To the best of our knowledge, this is the first work to When applying the vanilla Transformer model to the
challenge the effectiveness of the booming Transformers LTSF problem, it has some limitations, including the
for the long-term time series forecasting task. quadratic time/memory complexity with the original self-
• To validate our claims, we introduce a set of embar- attention scheme and error accumulation caused by the au-
rassingly simple one-layer linear models, named LTSF- toregressive decoder design. Informer (Zhou et al. 2021) ad-
Linear, and compare them with existing Transformer- dresses these issues and proposes a novel Transformer ar-
based LTSF solutions on nine benchmarks. LTSF-Linear chitecture with reduced complexity and a DMS forecasting
can be a new baseline for the LTSF problem. strategy. Later, more Transformer variants introduce various
time series features into their models for performance or effi-
• We conduct comprehensive empirical studies on various
ciency improvements (Liu et al. 2021a; Xu et al. 2021; Zhou
aspects of existing Transformer-based solutions, includ-
et al. 2022). We summarize the design elements of existing
ing the capability of modeling long inputs, the sensitivity
Transformer-based LTSF solutions as follows (see Figure 1).
to time series order, the impact of positional encoding
Time series decomposition: For data preprocessing, nor-
and sub-series embedding, and efficiency comparisons.
malization with zero-mean is common in TSF. Besides, Aut-
Our findings would benefit future research in this area.
oformer (Xu et al. 2021) first applies seasonal-trend de-
With the above, we conclude that the temporal model- composition behind each neural block, which is a standard
ing capabilities of Transformers for time series are exagger- method in time series analysis to make raw data more pre-
ated, at least for the existing LTSF benchmarks. At the same dictable (Cleveland 1990; Hamilton 2020). Specifically, they
time, while LTSF-Linear achieves a better prediction accu- use a moving average kernel on the input sequence to extract
racy compared to existing works, it merely serves as a sim- the trend-cyclical component of the time series. The differ-
ple baseline for future research on the challenging long-term ence between the original sequence and the trend component
TSF problem. With our findings, we also advocate revisiting is regarded as the seasonal component. On top of the de-
the validity of Transformer-based solutions for other time composition scheme of Autoformer, FEDformer (Zhou et al.
series analysis tasks (e.g., anomaly detection) in the future. 2022) further proposes the mixture of experts’ strategies to
mix the trend components extracted by moving average ker-
Preliminaries: TSF Problem Formulation nels with various kernel sizes.
For time series containing C variates, given historical data Input embedding strategies: The self-attention layer in the
X = {X1t , ..., XCt }L Transformer architecture cannot preserve the positional in-
t=1 , wherein L is the look-back window
size and Xit is the value of the ith variate at the tth time step. formation of the time series. However, local positional infor-
mation, i.e. the ordering of time series, is important. Besides,
The time series forecasting task is to predict the values X̂ = global temporal information, such as hierarchical times-
{X̂1t , ..., X̂Ct }L+T
t=L+1 at the T future time steps. When T > 1, tamps (week, month, year) and agnostic timestamps (holi-
iterated multi-step (IMS) forecasting (Taieb, Hyndman et al. days and events), is also informative (Zhou et al. 2021). To
2012) learns a single-step forecaster and iteratively applies it enhance the temporal context of time-series inputs, a practi-
to obtain multi-step predictions. Alternatively, direct multi- cal design in the SOTA Transformer-based methods is inject-
step (DMS) forecasting (Chevillon 2007) directly optimizes ing several embeddings, like a fixed positional encoding, a
the multi-step forecasting objective at once.
Compared to DMS forecasting results, IMS predictions 1
Due to page limit, we leave the discussion of non-Transformer
have smaller variance thanks to the autoregressive estima- forecasting solutions in the Appendix.
11122
LogSparse and convolutional Iterated Multi-Step
Channel projection self-attention @LogTrans (IMS) @LogTrans
Normalization
ProbSparse and distilling Direct Multi-Step
self-attention @Informer (DMS) @Informer
Fixed position
Output
Series auto-correlation with DMS with auto-correlation and
Input
Timestamp
preparation decomposition @Autoformer decomposition @Autoformer
Local timestamp
Multi-resolution pyramidal DMS along spatio-temporal
Seasonal-trend attention @Pyraformer dimension @Pyraformer
decomposition Global timestamp Frequency enhanced block with DMS with frequency attention
decomposition @FEDformer and decomposition@FEDformer
Figure 1: The pipeline of existing Transformer-based TSF solutions. In (a) and (b), the solid boxes are essential operations, and
the dotted boxes are applied optionally. (c) and (d) are distinct for different methods (Li et al. 2019; Zhou et al. 2021; Xu et al.
2021; Liu et al. 2021a; Zhou et al. 2022).
channel projection embedding, and learnable temporal em- tional encodings associated with input tokens. Considering
beddings into the input sequence. Moreover, temporal em- the raw numerical data in time series (e.g., stock prices or
beddings with a temporal convolution layer (Li et al. 2019) electricity values), there are hardly any point-wise semantic
or learnable timestamps (Xu et al. 2021) are introduced. correlations between them. In time series modeling, we are
Self-attention schemes: Transformers rely on the self- mainly interested in the temporal relations among a contin-
attention mechanism to extract the semantic dependencies uous set of points, and the order of these elements instead
between paired elements. Motivated by reducing the O L2 of the paired relationship plays the most crucial role. While
time and memory complexity of the vanilla Transformer, re- employing positional encoding and using tokens to embed
cent works propose two strategies for efficiency. On the one sub-series facilitate preserving some ordering information,
hand, LogTrans and Pyraformer explicitly introduce a spar- the nature of the permutation-invariant self-attention mech-
sity bias into the self-attention scheme. Specifically, Log- anism inevitably results in temporal information loss. Due
Trans uses a Logsparse mask to reduce the computational to the above observations, we are interested in revisiting the
complexity to O (LlogL) while Pyraformer adopts pyrami- effectiveness of Transformer-based LTSF solutions.
dal attention that captures hierarchically multi-scale tem-
poral dependencies with an O (L) time and memory com- An Embarrassingly Simple Baseline for LTSF
plexity. On the other hand, Informer and FEDformer use In the experiments of existing Transformer-based LTSF so-
the low-rank property in the self-attention matrix. Informer lutions (T ≫ 1), all the compared (non-Transformer) base-
proposes a ProbSparse self-attention mechanism and a self- lines are IMS forecasting techniques, which are known to
attention distilling operation to decrease the complexity to suffer from significant error accumulation effects. We hy-
O (LlogL), and FEDformer designs a Fourier enhanced pothesize that the performance improvements in these works
block and a wavelet enhanced block with random selection are largely due to the DMS strategy used in them.
to obtain O (L) complexity. Lastly, Autoformer designs a
series-wise auto-correlation mechanism to replace the origi-
nal self-attention layer.
Decoders: The vanilla Transformer decoder outputs se-
quences in an autoregressive manner, resulting in a slow in-
ference speed and error accumulation effects, especially for
long-term predictions. Informer designs a generative-style
decoder for DMS forecasting. Other Transformer variants
employ similar DMS strategies. For instance, Pyraformer
uses a fully-connected layer concatenating Spatio-temporal
axes as the decoder. Autoformer sums up two refined de-
composed features from trend-cyclical components and the
stacked auto-correlation mechanism for seasonal compo- Figure 2: Illustration of the basic linear model.
nents to get the final prediction. FEDformer also uses a de-
composition scheme with the proposed frequency attention To validate this hypothesis, we present the simplest DMS
block to decode the final results. model via a temporal linear layer, named LTSF-Linear, as
The premise of Transformer models is the semantic cor- a baseline for comparison. The basic formulation of LTSF-
relations between paired elements, while the self-attention Linear directly regresses historical time series for future pre-
mechanism itself is permutation-invariant, and its capabil- diction via a weighted sum operation (as illustrated in Fig-
ity of modeling temporal relations largely depends on posi- ure 2). The mathematical expression is X̂i = W Xi , where
11123
Datasets ETTh1&ETTh2 ETTm1 &ETTm2 Traffic Electricity Exchange-Rate Weather ILI
Variates 7 7 862 321 8 21 7
Timesteps 17,420 69,680 17,544 26,304 7,588 52,696 966
Granularity 1hour 5min 1hour 1hour 1day 10min 1week
Table 1: The statistics of the nine popular datasets for the LTSF problem.
Methods IMP. Linear* NLinear* DLinear* FEDformer Autoformer Informer Pyraformer* Repeat*
Metric MSE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
Exchange Electricity
96 27% 0.140 0.237 0.141 0.237 0.140 0.237 0.193 0.308 0.201 0.317 0.274 0.368 0.386 0.449 1.588 0.946
192 24% 0.153 0.250 0.154 0.248 0.153 0.249 0.201 0.315 0.222 0.334 0.296 0.386 0.386 0.443 1.595 0.950
336 21% 0.169 0.268 0.171 0.265 0.169 0.267 0.214 0.329 0.231 0.338 0.300 0.394 0.378 0.443 1.617 0.961
720 17% 0.203 0.301 0.210 0.297 0.203 0.301 0.246 0.355 0.254 0.361 0.373 0.439 0.376 0.445 1.647 0.975
96 45% 0.082 0.207 0.089 0.208 0.081 0.203 0.148 0.278 0.197 0.323 0.847 0.752 0.376 1.105 0.081 0.196
192 42% 0.167 0.304 0.180 0.300 0.157 0.293 0.271 0.380 0.300 0.369 1.204 0.895 1.748 1.151 0.167 0.289
336 34% 0.328 0.432 0.331 0.415 0.305 0.414 0.460 0.500 0.509 0.524 1.672 1.036 1.874 1.172 0.305 0.396
720 46% 0.964 0.750 1.033 0.780 0.643 0.601 1.195 0.841 1.447 0.941 2.478 1.310 1.943 1.206 0.823 0.681
96 30% 0.410 0.282 0.410 0.279 0.410 0.282 0.587 0.366 0.613 0.388 0.719 0.391 2.085 0.468 2.723 1.079
Traffic
192 30% 0.423 0.287 0.423 0.284 0.423 0.287 0.604 0.373 0.616 0.382 0.696 0.379 0.867 0.467 2.756 1.087
336 30% 0.436 0.295 0.435 0.290 0.436 0.296 0.621 0.383 0.622 0.337 0.777 0.420 0.869 0.469 2.791 1.095
720 26% 0.466 0.315 0.464 0.307 0.466 0.315 0.626 0.382 0.660 0.408 0.864 0.472 0.881 0.473 2.811 1.097
96 19% 0.176 0.236 0.182 0.232 0.176 0.237 0.217 0.296 0.266 0.336 0.300 0.384 0.896 0.556 0.259 0.254
Weather
192 21% 0.218 0.276 0.225 0.269 0.220 0.282 0.276 0.336 0.307 0.367 0.598 0.544 0.622 0.624 0.309 0.292
336 23% 0.262 0.312 0.271 0.301 0.265 0.319 0.339 0.380 0.359 0.395 0.578 0.523 0.739 0.753 0.377 0.338
720 20% 0.326 0.365 0.338 0.348 0.323 0.362 0.403 0.428 0.419 0.428 1.059 0.741 1.004 0.934 0.465 0.394
24 48% 1.947 0.985 1.683 0.858 2.215 1.081 3.228 1.260 3.483 1.287 5.764 1.677 1.420 2.012 6.587 1.701
36 36% 2.182 1.036 1.703 0.859 1.963 0.963 2.679 1.080 3.103 1.148 4.755 1.467 7.394 2.031 7.130 1.884
ILI
48 34% 2.256 1.060 1.719 0.884 2.130 1.024 2.622 1.078 2.669 1.085 4.763 1.469 7.551 2.057 6.575 1.798
60 34% 2.390 1.104 1.819 0.917 2.368 1.096 2.857 1.157 2.770 1.125 5.264 1.564 7.662 2.100 5.893 1.677
96 1% 0.375 0.397 0.374 0.394 0.375 0.399 0.376 0.419 0.449 0.459 0.865 0.713 0.664 0.612 1.295 0.713
ETTh1
192 4% 0.418 0.429 0.408 0.415 0.405 0.416 0.420 0.448 0.500 0.482 1.008 0.792 0.790 0.681 1.325 0.733
336 7% 0.479 0.476 0.429 0.427 0.439 0.443 0.459 0.465 0.521 0.496 1.107 0.809 0.891 0.738 1.323 0.744
720 13% 0.624 0.592 0.440 0.453 0.472 0.490 0.506 0.507 0.514 0.512 1.181 0.865 0.963 0.782 1.339 0.756
96 20% 0.288 0.352 0.277 0.338 0.289 0.353 0.346 0.388 0.358 0.397 3.755 1.525 0.645 0.597 0.432 0.422
ETTh2
192 20% 0.377 0.413 0.344 0.381 0.383 0.418 0.429 0.439 0.456 0.452 5.602 1.931 0.788 0.683 0.534 0.473
336 26% 0.452 0.461 0.357 0.400 0.448 0.465 0.496 0.487 0.482 0.486 4.721 1.835 0.907 0.747 0.591 0.508
720 14% 0.698 0.595 0.394 0.436 0.605 0.551 0.463 0.474 0.515 0.511 3.647 1.625 0.963 0.783 0.588 0.517
96 21% 0.308 0.352 0.306 0.348 0.299 0.343 0.379 0.419 0.505 0.475 0.672 0.571 0.543 0.510 1.214 0.665
ETTm1
192 21% 0.340 0.369 0.349 0.375 0.335 0.365 0.426 0.441 0.553 0.496 0.795 0.669 0.557 0.537 1.261 0.690
336 17% 0.376 0.393 0.375 0.388 0.369 0.386 0.445 0.459 0.621 0.537 1.212 0.871 0.754 0.655 1.283 0.707
720 22% 0.440 0.435 0.433 0.422 0.425 0.421 0.543 0.490 0.671 0.561 1.166 0.823 0.908 0.724 1.319 0.729
96 18% 0.168 0.262 0.167 0.255 0.167 0.260 0.203 0.287 0.255 0.339 0.365 0.453 0.435 0.507 0.266 0.328
ETTm2
192 18% 0.232 0.308 0.221 0.293 0.224 0.303 0.269 0.328 0.281 0.340 0.533 0.563 0.730 0.673 0.340 0.371
336 16% 0.320 0.373 0.274 0.327 0.281 0.342 0.325 0.366 0.339 0.372 1.363 0.887 1.201 0.845 0.412 0.410
720 13% 0.413 0.435 0.368 0.384 0.397 0.421 0.421 0.415 0.433 0.432 3.379 1.338 3.625 1.451 0.521 0.465
- Methods* are implemented by us; Other results are from FEDformer (Zhou et al. 2022).
Table 2: Multivariate long-term forecasting errors in terms of MSE and MAE, the lower the better. Among them, ILI dataset is
with forecasting horizon T ∈ {24, 36, 48, 60}. For the others, T ∈ {96, 192, 336, 720}. The best results are highlighted in bold
and the best results of Transformers are highlighted with an underline. IMP. is the best result of linear models compared to the
results of Transformer-based solutions.
W ∈ RT ×L is a linear layer along the temporal axis. X̂i ods, named DLinear and NLinear.
and Xi are the prediction and input for each ith variate.
Note that LTSF-Linear shares weights across different vari- • Specifically, DLinear is a combination of a Decompo-
ates and does not model any spatial correlations. sition scheme used in Autoformer and FEDformer with
linear layers. It first decomposes a raw data input into a
LTSF-Linear is a set of linear models. Vanilla Linear is a trend component by a moving average kernel and a re-
one-layer linear model. To handle time series across differ- mainder (seasonal) component. Then, two one-layer lin-
ent domains (e.g., finance, traffic, and energy domains), we ear layers are applied to each component, and we sum up
further introduce two variants with two preprocessing meth- the two features to get the final prediction. By explicitly
11124
GrouthTruth Autoformer Informer FEDformer DLinear
handling trend, DLinear enhances the performance of a
vanilla linear when there is a clear trend in the data.
1.0
• Meanwhile, to boost the performance of LTSF-Linear
when there is a distribution shift in the dataset, NLin-
0.5
ear first subtracts the input by the last value of the se-
quence. Then, the input goes through a linear layer, and
the subtracted part is added back before making the final 0.0
Experiments 1.0
11125
Transformer Autoformer Pyraformer NLinear der well? Self-attention is inherently permutation-invariant,
Informer FEDformer Linear DLinear
i.e., regardless of the order. However, in time-series forecast-
ing, the sequence order often plays a crucial role. We argue
0.40
that even with positional and temporal embeddings, existing
Transformer-based methods still suffer from temporal infor-
mation loss. In Table 5, we shuffle the raw input before
0.35
the embedding strategies. Two shuffling strategies are pre-
sented: Shuf. randomly shuffles the whole input sequences
0.30 and Half-Ex. exchanges the first half of the input sequence
with the second half. Interestingly, compared with the orig-
0.25
inal setting (Ori.) on the Exchange Rate, the performance
of all Transformer-based methods does not fluctuate even
when the input sequence is randomly shuffled. By contrary,
0.20
the performance of LTSF-Linear is damaged significantly.
24 48 72 96 120 144 168 192 336 504 672 720 These indicate that LTSF-Transformers with different posi-
tional and temporal embeddings preserve quite limited tem-
poral relations and are prone to overfit on noisy financial
Figure 4: The MSE results (Y-axis) of models with different
data, while the simple LTSF-Linear can model the order nat-
look-back window sizes (X-axis) of long-term forecasting
urally and avoid overfitting with fewer parameters.
(T=720) on Electricity.
For the ETTh1 dataset, FEDformer and Autoformer in-
troduce time series inductive bias into their models, mak-
pare the forecasting accuracy for the same future 720 time ing them can extract certain temporal information when the
steps with data from two different look-back windows: (i). dataset has more clear temporal patterns (e.g., periodicity)
the original input L=96 setting (called Close) and (ii). the than the Exchange Rate. Therefore, the average drops of
far input L=96 setting (called Far) that is before the original the two Transformers are 73.28% and 56.91% under the
96 time steps. The performance of the SOTA Transformers Shuf. setting, where it loses the whole order information.
drops slightly, indicating these models only capture simi- Moreover, Informer still suffers less from both Shuf. and
lar temporal information from the adjacent time series se- Half-Ex. settings due to its no such temporal inductive bias.
quence. Since capturing the intrinsic characteristics of the Overall, the average drops of LTSF-Linear are larger than
dataset generally does not require a large number of param- Transformer-based methods for all cases, indicating the ex-
eters, i,e. one parameter can represent the periodicity. Using isting Transformers do not preserve temporal order well.
too many parameters will even cause overfitting. How effective are different embedding strategies? In Ta-
Are the self-attention scheme effective for LTSF? We ble 6, the forecasting errors of Informer largely increase
verify whether these complex designs in the existing Trans- without positional embeddings (wo/Pos.). Without times-
former (e.g., Informer) are essential. In Table 4, we grad- tamp embeddings (wo/Temp.) will gradually damage the
ually transform Informer to Linear. First, we replace each performance of Informer as the forecasting lengths increase.
self-attention layer with a linear layer, called Att.-Linear, Since Informer uses a single time step for each token, it is
since a self-attention layer can be regarded as a fully- necessary to introduce temporal information in tokens.
connected layer where weights are dynamically changed. Rather than using a single time step in each token, FED-
Furthermore, we discard other auxiliary designs (e.g., FFN) former and Autoformer input a sequence of timestamps to
in Informer to leave embedding layers and linear layers, embed the temporal information. Hence, they can achieve
named Embed + Linear. Finally, we simplify the model to comparable or even better performance without fixed po-
one linear layer. As can be observed, the performance of In- sitional embeddings. However, without timestamp embed-
former grows with the gradual simplification, thereby chal- dings, the performance of Autoformer declines rapidly be-
lenging the necessity of these modules. cause of the loss of global temporal information. Instead,
thanks to the frequency-enhanced module proposed in FED-
Methods Informer Att.-Linear Embed + Linear Linear former to introduce temporal inductive bias, it suffers less
from removing any position/timestamp embeddings.
Exchange
192 1.008 0.759 0.686 0.438 the benchmark datasets. Unlike computer vision or natural
336 1.107 0.921 0.821 0.479 language processing tasks, TSF is performed on collected
720 1.181 0.902 1.051 0.515 time series, and it is difficult to scale up the training data
size. In fact, the size of the training data would indeed have a
Table 4: The MSE comparisons of gradually transforming significant impact on the model performance. Accordingly,
Informer to a Linear from the left to right columns. we conduct experiments on Traffic, comparing the perfor-
mance of the model trained on a full dataset (17,544*0.7
Can existing LTSF-Transformers preserve temporal or- hours), named Ori., with that training on a shortened dataset
11126
Methods Linear FEDformer Autoformer Informer
Predict Length
Exchange Ori. Shuf. Half-Ex. Ori. Shuf. Half-Ex. Ori. Shuf. Half-Ex. Ori. Shuf. Half-Ex.
96 0.080 0.133 0.169 0.161 0.160 0.162 0.152 0.158 0.160 0.952 1.004 0.959
192 0.162 0.208 0.243 0.274 0.275 0.275 0.278 0.271 0.277 1.012 1.023 1.014
336 0.286 0.320 0.345 0.439 0.439 0.439 0.435 0.430 0.435 1.177 1.181 1.177
720 0.806 0.819 0.836 1.122 1.122 1.122 1.113 1.113 1.113 1.198 1.210 1.196
Average Drop N/A 27.26% 46.81% N/A -0.09% 0.20% N/A 0.09% 1.12% N/A -0.12% -0.18%
96 0.395 0.824 0.431 0.376 0.753 0.405 0.455 0.838 0.458 0.974 0.971 0.971
ETTh1
192 0.447 0.824 0.471 0.419 0.730 0.436 0.486 0.774 0.491 1.233 1.232 1.231
336 0.490 0.825 0.505 0.447 0.736 0.453 0.496 0.752 0.497 1.693 1.693 1.691
720 0.520 0.846 0.528 0.468 0.720 0.470 0.525 0.696 0.524 2.720 2.716 2.715
Average Drop N/A 81.06% 4.78% N/A 73.28% 3.44% N/A 56.91% 0.46% N/A 1.98% 0.18%
Table 5: The MSE comparisons of models when shuffling the raw input sequence. Shuf. randomly shuffles the input sequence.
Half-EX. randomly exchanges the first half of the input sequences with the second half. We run five times.
Table 6: The MSE comparisons of different embedding × the same one-step decoder.
strategies on Transformer-based methods with look-back Table 8: Comparison of practical efficiency of LTSF-
window size 96 and forecasting lengths {96, 192, 336, 720}. Transformers under L=96 and T=720 on the Electricity.
MACs are the number of multiply-accumulate operations.
The inference time averages 5 runs.
(8,760 hours, i.e., 1 year), called Short. Unexpectedly, Ta-
ble 7 presents that the prediction errors with reduced train-
ing data are usually lower. This might be because the whole-
year data maintain clearer temporal features than a longer Conclusion and Future Work
but incomplete data size. While we cannot conclude that we Conclusion. This work questions the effectiveness of
should use fewer data for training, it demonstrates that the emerging favored Transformer-based solutions for the long-
training data scale is not the limiting reason. term time series forecasting problem. We use an embarrass-
ingly simple linear model LTSF-Linear as a DMS forecast-
Methods FEDformer Autoformer ing baseline to verify our claims. Note that our contributions
Dataset Ori. Short Ori. Short do not come from proposing a linear model but rather from
96 0.587 0.568 0.613 0.594 throwing out an important question, showing surprising
192 0.604 0.584 0.616 0.621 comparisons, and demonstrating why LTSF-Transformers
336 0.621 0.601 0.622 0.621
are not as effective as claimed in these works through var-
720 0.626 0.608 0.660 0.650
ious perspectives. We sincerely hope our comprehensive
Table 7: The MSE comparisons of two training data sizes. studies can benefit future work in this area.
Future work. LTSF-Linear has a limited model capacity,
and it merely serves a simple yet competitive baseline with
Is efficiency really a top-level priority?
Existing LTSF- strong interpretability for future research. Consequently, we
Transformers claim that the O L2 complexity of the believe there is great potential for new model designs, data
vanilla Transformer is unaffordable for the LTSF problem. processing, and benchmarks to tackle LTSF.
Although they prove to be able to improve the theoretical
time and memory complexity from O L2 to O (L), it is
unclear whether 1) the actual inference time and memory Acknowledgments
cost on devices are improved, and 2) the memory issue is This work was supported in part by Alibaba Group Holding
unacceptable and urgent for today’s GPU (e.g., an NVIDIA Ltd. under Grant No. TA2015393. We thank the anonymous
Titan XP here). In Table 8, we compare the average prac- reviewers for their constructive comments and suggestions.
11127
References Salinas, D.; Flunkert, V.; and Gasthaus, J. 2017. DeepAR:
Ariyo, A. A.; Adewumi, A. O.; and Ayo, C. K. 2014. Stock Probabilistic Forecasting with Autoregressive Recurrent
price prediction using the ARIMA model. In 2014 UKSim- Networks. International Journal of Forecasting.
AMSS 16th International Conference on Computer Mod- Taieb, S. B.; Hyndman, R. J.; et al. 2012. Recursive and
elling and Simulation, 106–112. IEEE. direct multi-step forecasting: the best of both worlds, vol-
ume 19. Citeseer.
Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural Ma-
chine Translation by Jointly Learning to Align and Trans- Taylor, S. J.; and Letham, B. 2017. Forecasting at Scale.
late. arXiv: Computation and Language arXiv:1409.0473. PeerJ Prepr.
Bai, S.; Kolter, J. Z.; and Koltun, V. 2018. An empirical Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
evaluation of generic convolutional and recurrent networks L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-
for sequence modeling. arXiv preprint arXiv:1803.01271. tention is all you need. Advances in neural information pro-
cessing systems, 30.
Chevillon, G. 2007. Direct multi-step estimation and fore-
Wen, Q.; Zhou, T.; Zhang, C.; Chen, W.; Ma, Z.; Yan, J.; and
casting. Journal of Economic Surveys, 21(4): 746–785.
Sun, L. 2022. Transformers in Time Series: A Survey. arXiv
Cirstea, R.-G.; Guo, C.; Yang, B.; Kieu, T.; Dong, X.; preprint arXiv:2202.07125.
and Pan, S. 2022. Triformer: Triangular, Variable-Specific Xu, J.; Wang, J.; Long, M.; et al. 2021. Autoformer: Decom-
Attentions for Long Sequence Multivariate Time Series position transformers with auto-correlation for long-term se-
Forecasting–Full Version. arXiv preprint arXiv:2204.13767. ries forecasting. Advances in Neural Information Processing
Cleveland, R. B. 1990. STL : A Seasonal-Trend Decomposi- Systems, 34.
tion Procedure Based on Loess. Journal of Office Statistics. Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.;
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. and Zhang, W. 2021. Informer: Beyond Efficient Trans-
Bert: Pre-training of deep bidirectional transformers for lan- former for Long Sequence Time-Series Forecasting. In
guage understanding. arXiv preprint arXiv:1810.04805. The Thirty-Fifth AAAI Conference on Artificial Intelligence,
Dong, L.; Xu, S.; and Xu, B. 2018. Speech-transformer: AAAI 2021, Virtual Conference, volume 35, 11106–11115.
a no-recurrence sequence-to-sequence model for speech AAAI Press.
recognition. In 2018 IEEE International Conference on Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; and Jin, R.
Acoustics, Speech and Signal Processing (ICASSP), 5884– 2022. FEDformer: Frequency enhanced decomposed trans-
5888. IEEE. former for long-term series forecasting. In International
Conference on Machine Learning.
Friedman, J. H. 2001. Greedy function approximation: a
gradient boosting machine. Annals of statistics, 1189–1232.
Hamilton, J. D. 2020. Time series analysis. Princeton uni-
versity press.
Lai, G.; Chang, W.-C.; Yang, Y.; and Liu, H. 2017. Modeling
Long- and Short-Term Temporal Patterns with Deep Neural
Networks. international acm sigir conference on research
and development in information retrieval.
Li, S.; Jin, X.; Xuan, Y.; Zhou, X.; Chen, W.; Wang, Y.-X.;
and Yan, X. 2019. Enhancing the locality and breaking the
memory bottleneck of transformer on time series forecast-
ing. Advances in Neural Information Processing Systems,
32.
Liu, M.; Zeng, A.; Chen, M.; Xu, Z.; Lai, Q.; Ma, L.; and
Xu, Q. 2022. SCINet: Time Series Modeling and Forecast-
ing with Sample Convolution and Interaction. Thirty-sixth
Conference on Neural Information Processing Systems.
Liu, S.; Yu, H.; Liao, C.; Li, J.; Lin, W.; Liu, A. X.; and
Dustdar, S. 2021a. Pyraformer: Low-complexity pyramidal
attention for long-range time series modeling and forecast-
ing. In International Conference on Learning Representa-
tions.
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin,
S.; and Guo, B. 2021b. Swin transformer: Hierarchical vi-
sion transformer using shifted windows. In Proceedings of
the IEEE/CVF International Conference on Computer Vi-
sion, 10012–10022.
11128