From Known to Unknown: Knowledge-guided Transformer for Time-Series Sales Forecasting in Alibaba

by   Xinyuan Qi, et al.
NetEase, Inc

Time series forecasting (TSF) is fundamentally required in many real-world applications, such as electricity consumption planning and sales forecasting. In e-commerce, accurate time-series sales forecasting (TSSF) can significantly increase economic benefits. TSSF in e-commerce aims to predict future sales of millions of products. The trend and seasonality of products vary a lot, and the promotion activity heavily influences sales. Besides the above difficulties, we can know some future knowledge in advance except for the historical statistics. Such future knowledge may reflect the influence of the future promotion activity on current sales and help achieve better accuracy. However, most existing TSF methods only predict the future based on historical information. In this work, we make up for the omissions of future knowledge. Except for introducing future knowledge for prediction, we propose Aliformer based on the bidirectional Transformer, which can utilize the historical information, current factor, and future knowledge to predict future sales. Specifically, we design a knowledge-guided self-attention layer that uses known knowledge's consistency to guide the transmission of timing information. And the future-emphasized training strategy is proposed to make the model focus more on the utilization of future knowledge. Extensive experiments on four public benchmark datasets and one proposed large-scale industrial dataset from Tmall demonstrate that Aliformer can perform much better than state-of-the-art TSF methods. Aliformer has been deployed for goods selection on Tmall Industry Tablework, and the dataset will be released upon approval.


page 1

page 2

page 3

page 4


Evaluation of Time-Series Forecasting Models for Chickenpox Cases Estimation in Hungary

Time-Series Forecasting is a powerful data modeling discipline that anal...

Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

Many real-world applications require the prediction of long sequence tim...

Historical Inertia: An Ignored but Powerful Baseline for Long Sequence Time-series Forecasting

Long sequence time-series forecasting (LSTF) has become increasingly pop...

Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting

Time series forecasting is an important problem across many domains, inc...

Robust Forecasting for Robotic Control: A Game-Theoretic Approach

Modern robots require accurate forecasts to make optimal decisions in th...

Multimodal Quasi-AutoRegression: Forecasting the visual popularity of new fashion products

Estimating the preferences of consumers is of utmost importance for the ...

Simulating User-Level Twitter Activity with XGBoost and Probabilistic Hybrid Models

The Volume-Audience-Match simulator, or VAM was applied to predict futur...


Time series forecasting (TSF) plays a fundamental guiding role in many real-world applications, such as weather predictionKarevan and Suykens (2020), financial investmentAlhnaity and Abbod (2020), and sales forecastingQi et al. (2019). In e-commerce, accurate time-series sales forecasting (TSSF) supports the merchants to inventory stocks scientifically. It helps the platform optimize the turnover volume by assigning more exposure to the products with higher predicted sales value. Taking the 2020 Tmall’s 11.11 Global Shopping Festival as an example, the total turnover of 498.2 billion RMB reflects the long-term collaboration among the supply chain, industries, and e-commerce platforms behind this seemingly short period of promotion activity. These cooperation operations, including factory production, merchant stocking, and marketing strategy planning, cannot be achieved alone without the time-series sales forecasting algorithm.

Classical time series models Brockwell and Davis (2009); Box et al. (2015); Seeger, Salinas, and Flunkert (2016); Seeger et al. (2017)

make predictions based only on historical observations by analyzing the sequence’s periodicity/seasonality and trend. Recent deep learning methods

Bai, Kolter, and Koltun (2018) demonstrate effectiveness in both accuracy and generalization ability by encoding abundant features to represent the hidden status at each timestamp and enables a decoder to capture the long-range dependency of the hidden status sequence. Regardless of the design differences, these two kinds of approaches all obey the principle of causality, i.e., the model should consider only the information before the timestamp when predicting the observation value on .

Figure 1: The time series of sales and price of an eye-shadow in Tmall experienced two promotion activities: one range from Jan 20 to Jan 25 and another from Mar 5 to Mar 8.

However, some changeable information can be critical to the final results and even hurt the model’s performance. To illustrate this point, we still use the sales forecasting task as an example: Figure 1 shows the price of an eye-shadow over the past several months in Tmall, along with its corresponding sales volume. It is noted that the eye-shadow has experienced two relatively significant price reductions during promotion activities. Whenever the promotion approaches, the sales volume firstly decreases (doted in green) and then bursts when the activity begins (doted in red), suggesting that the future price information can largely influence the purchase behaviors. Consumers won’t buy the product until the promotion starts. And since existing methods can only predict the future through historical information, they will fail in predicting the sales volume during the promotion time caused by such changeable external factors, even though they can work well with regular time series. Fortunately, a large part of future knowledge can be known ahead of time, especially for sales forecasting in e-commerce since sales value is heavily influenced by marketing operations like discount coupons setting, advertising investment, or participating in a live stream. All these operations can usually be known before the procedure is implemented because such operations generally require advance application and approval by the platform.

In this work, we consider the impact of the known future knowledge on TSSF and make up for shortcomings of existing methods that cannot leverage future knowledge in advance. We contribute the Knowledge-guided Transformer in Alibaba (Aliformer) to fully make use of the knowledge information to guide the forecasting procedure. Firstly, we define what information can be known in advance in our e-commerce scenario, which can be divided into two categories based on its sources: product- or platform-related, and will be described in more detail in the latter.

Secondly, we describe our model given future knowledge above, which is mainly elaborated on the architecture of encoder of vanilla TransformerVaswani et al. (2017) by considering the recent prevalence and effectiveness of transformer-based methods in many sequence tackling applications. The significant discrepancy between these methods and our Aliformer is showed in Figure 2

, which is that we allow leakage of future information by applying bidirectional self-attention on the input sequence with the known future information and anonymous information masked by a trainable token embedding. However, directly applying bidirectional self-attention to the whole input sequence can lead to some unsatisfied results. The part of the series that is masked off can be seen as embeddings without any semantic information. Taking them as the key vectors to calculate the attention scores with other unmasked representations will bring some noise to the attention map. Even though our Aliformer has made up for the robustness of the masked token embedding with part of known future knowledge, the problem will still exist if we simply add knowledge embedding and masked token embedding together. Therefore, we present the

AliAttention by adding a knowledge-guided branch to revise the attention map to minimize the impact of noise. In general, to be consistent with the final task, we should mask off the last part of the sequence. But this may hinder the bidirectional learning ability of our approach because it will make the model rely on historical information other than future information. Thus, we present the future-emphasized training strategy by adding span masking at the middle of the sequence to emphasize the importance of future knowledge. The hyper-parameter analysis further validates the effectiveness of this strategy.

Finally, we conduct extensive experiments on both e-commerce dataset and public time-series forecasting datasets. The evaluation shows that Aliformer sets the new state-of-the-art performance for time-series sales forecasting. We also notice that there is a lack of real-world benchmark datasets for TSSF task. Therefore, we elaborately construct the Tmall Merchandise Sales (TMS) dataset, collected from the Tmall (the B2C e-commerce platform of Alibaba). TMS includes millions of time series sales data spanning over 200 days. To our knowledge, it is the first public e-commerce dataset that tailored for time-series sales forecasting task. The dataset will be released upon approval.

In addition, we have implemented Aliformer into a prototype and deployed it on the ODPS system of Alibaba since May 1, 2021. We will select the top 1 million products out of billion products to participate in the promotion activities according to the predicted sales value to maximize the overall turnover for the platform. The evaluation of Aliformer on GMV coverage rate with 4.73 AP gain over the well-known Informer further illustrates its effectiveness.

Contributions To summarize, the key contributions of this paper are:

  • Method We propose a knowledge-guided transformer (Aliformer) to fully make use of the future knowledge that largely affects sales value. Some strategies are proposed to enhance the ability of knowledge guidance. Extensive evaluations show Aliformer sets the new state-of-the-art performance for time-series sales forecasting.

  • Dataset A large benchmark dataset is collected from the Tmall platform to make up for the lack of an e-commerce dataset for the sales forecasting problem. The dataset will be released upon approval.

  • Application We deploy Aliformer in Alibaba, the largest e-commerce platform in China, and achieve significant performance improvements in the real-world application.

Figure 2: Illustration of the difference between time series forecasting task and time series forecasting with knowledge task.

Related Work

Time Series Forecasting

Various methods have been well developed in TSF. In the past years, a bunch of deep learning-based methods has been proposed to model the temporal dependencies for accurate TSF Lai et al. (2018); Huang et al. (2019); Chang et al. (2018). LSTNet Lai et al. (2018)

introduces the convolutional neural network (CNN) with a recurrent-skip structure to extract both the local dependencies and long-term trends. DeepAR

Salinas et al. (2020)

combines traditional autoregressive models and recurrent neural networks (RNNs) to model a probabilistic distribution for future time series. Temporal convolution networks (TCNs)

Bai, Kolter, and Koltun (2018); Sen, Yu, and Dhillon (2019) attempt to model the temporal causality with the causal convolution and dilated convolution. In addition, attention-based RNNs Shih, Sun, and Lee (2019); Qin et al. (2017) incorporate the temporal attention to capturing the long-term dependencies for forecasting.

Recently, the well-known Transformer Vaswani et al. (2017) has achieved great success in language modeling Devlin et al. (2018)

, computer vision

Parmar et al. (2018) et al.. Also, there are many Transformed-based methods for TSF Zhou et al. (2021); Li et al. (2019); Wu et al. (2021). LogTrans Li et al. (2019) introduces causal convolutions into Transformer and proposes the efficient LogSparse attention. Informer Zhou et al. (2021) proposes another efficient ProbSparse self-attention for Long Sequence Time-Series Forecasting. Autoformer Wu et al. (2021) empowers the deep forecasting model with inherent progressive decomposition capacity through an Auto-Correlation mechanism.

Sales Forecasting

As a typical application of TSF, sales forecasting plays a vital role in e-commerce platforms and has attracted significant interest. For general business forecasting tasks, Taylor and Letham (2018) proposes an analyst-in-the-loop framework named Prophet. For extensive promotion sales forecasting in e-commerce, Qi et al. (2019) propose a GRU-based algorithm to explicitly model the competitive relationship between the target product and its substitute products. Another work presented by Xin et al. (2019) fuses heterogeneous information into a modified GRU cell to be aware of the status of the pre-sales stage before the promotion activities. For new product sales forecasting, Ekambaram et al. (2020) utilize several attention-based multi-modal encoder-decoder models to leverage the sparsity of historical statistics of new products.

Yet all these approaches mainly focus on leveraging the historical statistics or enhanced multi-modal information to promote the accuracy, while ignoring the impact of some critical known future information on sales forecasting. Unlike the above methods, we release some future information leakage yet do not break the principle of causality to capture some exciting correlation between sales value and some changeable factors.


Unsimilar to the TSF task, which forecasts the future observations based on the historical statistics, future sales in e-commerce are heavily influenced by the changeable external factors like the approaching promotion activity. Fortunately, we can know part of the external factors in advance. It’s essential to predict sales with this future knowledge. In this section, we specify the definition of the research problem and explain the future knowledge, the vanilla self-attention layer for time series forecasting with knowledge. Then we present the proposed AliAttention layer, which treats the knowledge information as guidance. A future-emphasized training strategy is also proposed to make the model focus more on future knowledge. Finally, the panorama of the Aliformer is illustrated.

Problem Formulation

Given a sequence of historical statistics and knowledge information, the time series sale forecasting task aims to forecast the statistics in a future period. Let denotes a product, its historical statistics and knowledge information can be represented as a chronological sequence:

where the statistics represents the historical statistics at -th time (e.g. historical payed amount et al.), denotes the -th time knowledge information (e.g. price at -th time, which may vary at each time).

For time , input can be represented with an embedding layer Emb:

where Emb means an FC layer for numerical features and a lookup table for id features, which map the features to . The sum of numerical and id features make up the input .

Given the historical information of product , the time series sale forecasting method predict the future sales can be formulated as:

means predicting the future based on historical information.

Future Knowledge

Knowledge information can be any factor that potentially determines sales and can be known in advance, categorized into two types: product-related or platform-related.

Product-related knowledge information is intrinsic and describes the product itself, including price, advertising investment, whether in promotion activity et al.. Platform-related knowledge information is related to the promotion activity, such as the level, time, category of the activity. We can usually know the knowledge of a future period in advance because the promotion activity is scheduled, and products should follow certain rules in the campaign (e.g., give a certain discount). That is can be known and:

With the future knowledge involved, the input can be represented as:

the time series sale forecasting be formulated as:

where denotes predicting the future based on both the historical information and future knowledge, means a default value or learnable token.

Figure 3: AliAttention layer. The knowledge-guided branch utilizes the pure knowledge information and outputs the knowledge-guided attention to revise the final attention map

Vanilla Self-Attention

Bidirectional encoder from transformers is first introduced in BERTDevlin et al. (2018), which serves as a language representation model and focuses on learning word embedding. As shown in Figure 2 (b), we introduce the bidirectional framework into the TSSF task. Statistics and knowledge are represented as vectors with the embedding technique. During training, the future statistics are replaced with a learnable token . The bidirectional model will try to predict the future sales based on the historical statistics and knowledge with the vanilla self-attention (VSA):

where is in -th VSA layer at time , is a scale factor, represent linear layer to compute in vanilla self-attention and respectively. With vanilla self-attention, information transmission can be bidirectional, and the current observations can be influenced by historical information, future knowledge, and current factors simultaneously. In such a framework, representations can be updated layer by layer with the self-attention mechanism:

For the VSA is a positional invariant function, we add position embedding in to encode the position information explicitly to assist the time knowledge in . From this perspective, position and time information can serve as knowledge in any time series dataset.

AliAttention Layer

The encoder of the Transformer can be viewed as stacked VSA layers. Statistics and knowledge is the component of input . However, we don’t know the future statistics

in fact. The tokenized training strategy will introduce noise while computing the attention map, and the learnable token may have the drawback of computing the attention map. Mixing the information from deterministic values and padded variables might make it unnecessarily difficult for the model to compute the relevance of each time.

Figure 4: The overview of Aliformer. Original knowledge information is provided to each AliAttention layer.

Therefore, we proposed the AliAttention layer, which utilizes the certainty of general knowledge and revises the attention map to minimize the impact of noise. In detail, we modify the vanilla self-attention mechanism and add a knowledge-guided branch to revise the final attention map in the AliAttention layer. As shown in Figure 3, except for the vanilla self-attention, we introduce a branch for the pure knowledge information:

For the proposed AliAttention, it take both integrated information and knowledge information as input. AliAttention compute attention values based on and respectively. In practice, for the integrated input together with knowledge series , where is batch size, is the length of time series, is the size of embedding vectors. The AliAttention can be formalized as:

where and are the representation in -th AliAttention layer at time of and respectively. won’t be update layer by layer. is a scale factor, represent linear layer to compute vanilla . are the linear layer to compute the of the knowledge-guided branch. Knowledge series contains clean known knowledge without any noise (padded value and learnable token). Thus the knowledge-guided attention performs as reviser to the final attention map .

The panorama of Aliformer is presented in Figure 4. Each AliAttention layer takes the integrated information and the original knowledge information as the input. Representations with the same dimension are output and will be fed into the next layer. The integrated information at the first layer is embedded statistics and knowledge information. To guarantee the guidance of knowledge, identical original knowledge is explicitly provided for each AliAttention layer.

Future-emphasized Training Strategy

TSSF task focuses on predicting sales of a period future. In the training stage, a naive strategy is to train the model with future statistics replaced with the learnable token and predict the future sales. However, the learnable token may drawback the model’s ability to utilize the future knowledge because the learnable token may be a bias for the future information and make the model relies more on the history while predicting. To lighten this phenomenon, we propose a future-emphasized training strategy. That is, in the training stage, we introduce span maskingJoshi et al. (2020)

in addition to the naive strategy. Series trained in span masking try to predict a period of sales in the middle time, which urges the model to emphasize the future information. One series is trained in the naive and span masking strategies with the probability

and , respectively. The effectiveness of our future-emphasized training strategy is confirmed in the experiment.



We conducted extensive experiments to evaluate the proposed model on four public benchmark datasets (ETTh111, ETTm11footnotemark: 1, ECL222, Kaggle-M5333 and one real-world dataset (Tmall Merchandise Sales Dataset – TMS). More details about the datasets are illustrated in the Appendix. Here we only take a brief introduction of TMS dataset.

Tmall Merchandise Sales Dataset

We collect 1.2 million samples of product sales from the Tmall platform, the B2C e-commerce platform of Alibaba. The time-series sales data span from Sep 13, 2020, to Apr 15, 2021 (215 days). Each sample contains 86-dimensional features, including price, category, brand, date and statistical information of product, seller, and category. The five most relevant features to products are listed in the following, which are also the targets in experiments:

  • item page view (ipv): The daily views count of the product details page.

  • unique visitor (ipv_uv): The unique daily visitors’ count of the product details page.

  • gross merchandise volume (gmv): The daily deal amount of the product.

  • count (ord): The order count of the product sold per day.

  • buyer (byr): The buyer count of the product sold per day.

The scale of TMS is much larger and more complex than the current standard public datasets. In the actual scenario of sales forecast, we could get richer future information, such as product prices, marketing activities status in TMS. We conduct mainly analysis of our method on TMS in the following sections.

Experimental Details

We select several recently time series forecasting methods as our baselines: (1) LSTNet Lai et al. (2018) (2) LSTMa Bahdanau, Cho, and Bengio (2015) (3) LogTrans Li et al. (2019) (4) Informer Zhou et al. (2021) More experimental details are put in the Appendix to avoid trivializing description.

Results and Analysis

Comparison with the state-of-the-art

Methods Aliformer Informer LogTrans LSTMa LSTNet
TMS(5) 15 0.154 0.229 0.321 0.353 0.327 0.368 0.313 0.354 0.283 0.336
Kaggle-M5(1) 28 0.526 0.555 0.552 0.568 0.544 0.570 0.533 0.556 0.528 0.561
ETTh(7) 48 0.767 0.694 1.575 1.086 1.952 1.122 1.805 1.094 3.629 1.697
168 1.480 0.957 3.166 1.480 3.693 1.642 4.449 1.559 3.218 1.984
336 1.604 1.056 2.933 1.446 4.173 1.902 3.706 1.500 4.606 2.027
720 2.455 1.383 2.903 1.410 3.276 1.485 3.961 1.659 4.972 3.847
ETTm(7) 48 0.309 0.371 0.413 0.444 0.424 0.515 0.647 0.607 1.672 1.072
96 0.324 0.376 0.574 0.538 0.651 0.694 0.788 0.677 2.340 1.351
288 0.430 0.452 0.823 0.687 1.140 1.154 1.190 0.908 0.980 1.814
672 0.581 0.551 1.134 0.868 1.588 1.370 2.107 1.306 1.824 2.758
ECL(321) 48 0.304 0.369 0.387 0.428 0.399 0.455 0.486 0.572 0.443 0.446
168 0.330 0.396 0.393 0.435 0.394 0.443 0.574 0.602 0.381 0.420
336 0.322 0.386 0.389 0.424 0.380 0.432 0.886 0.795 0.419 0.477
720 0.396 0.408 0.424 0.454 0.427 0.466 1.676 1.095 0.556 0.565
Table 1: Time series forecasting results of all methods on five datasets. The number in brackets means the target value’s dimensions. The best results are highlighted.

The evaluation results of all the methods on five datasets are summarized in Table 1. As can be seen, the Transformer-based methods, the Aliformer, Informer, and LogTrans, show better results than LSTMa and LSTNet.

The proposed method-Aliformer achieves consistent state-of-the-art performance on all five datasets and all prediction lengths. Especially, the Aliformer greatly outperforms other comparison methods on the real-world product sales dataset - TMS. Our method gets a MSE improvement of 52% (0.321 0.154) for Informer, 53% (0.327 0.154) for LogTrans. The great improvement reveals that the proposed knowledge-guided AliAttention layer and future-emphasized training strategy can fully use this future knowledge for better predictions. The greatly gain on TMS dataset and relatively poorer gain on other datasets(whose knowledge information only include position and time information) also reflect that future knowledge is vital for future forecasting accurately.

Ablation Analysis

To verify the effectiveness of future knowledge and AliAttention mechanism, we conduct ablation experiments on TMS dataset for each target value, i.e., ipv, ipv_uv, amt, cnt, and byr. The results of each target value and all target values are reported in Table 2.

TMS wo/ future wo/AliAttention Aliformer (full)
0.065 0.185 0.059 0.176 0.059 0.176
0.055 0.168 0.049 0.159 0.049 0.159
0.639 0.487 0.494 0.413 0.456 0.391
0.130 0.245 0.110 0.219 0.105 0.213
0.124 0.238 0.104 0.212 0.100 0.206
0.203 0.265 0.163 0.236 0.154 0.229
Table 2: Ablation results of the proposed Aliformer.
  • Effect of future knowledge In this study, we modify the proposed Aliformer to predict without future knowledge, which predicts only based on historical information. As shown in Table 2, this ablation model ”wo/ future” achieved a worse prediction with 0.203/0.265 for MSE/MAE. The significant reduction verifies that future knowledge is essential for future sales prediction.

  • Effect of AliAttention layer The knowledge-guided AliAttention layer is a core component in our Aliformer, which efficiently utilizes the consistency of general knowledge in future forecasting. Results in Table 2 between the ablation model ”wo/ AliAttention” and the full model show that this knowledge-guided AliAttention layer leverages the future-known knowledge with more comprehensive depictions for better prediction.

(a) AliAttention layer numbers.
(b) Span masking probability.
Figure 5: The parameter sensitivity of two components in the proposed Aliformer.

Parameter Sensitivity

We provide the sensitivity analysis of the Aliformer model on TMS

dataset for two important hyperparameters.

AliAttention Layer Numbers: In the Figure 5 (a), we could find that the Aliformer model achieves better prediction results with more AliAttention layers. However, the improvement of increasing layers is minimal when the model contains 12 AliAttention layers. Thus, we set the number of AliAttention layers equal to 12 on TMS dataset. Span Masking Probability : As we can see in the Figure 5 (b), increasing this probability brings lower MSE and MAE, but further increasing degrades the performances. Because when the probability arrives at some great value, the model can leverage the future knowledge but ignores to predict future sales. There is no significant performance change during . We set the span masking probability as 50% in practice. The experiments for parameter sensitivity express the capacity and robustness of the Aliformer model.

(a) Forecasting comparison results.
(b) Forecasting results with varying sales price.
Figure 6: Visualization results for case study on TMS.

Visualization analysis for case study

We provide the visualization results for case study on TMS dataset, including the forecasting comparison results and forecasting resluts of the proposed Aliformer varying the target product’s sales price. In Figure 6 (a), we show the future 15-days forecasts of the LSTMa, Informer and the proposed Aliformer. As we can see, the Aliformer achieves better prediction than other comparison methods. When predicting the future 15-days target value, the Aliformer could catch the sudden alteration before the marketing activity and perform good forecasting with the help of both the historical temporal patterns and future knowledge.

In Figure 6 (b), we study the effect of the product sales price on product’s sales gmv. We perform this analysis by varying one product’s sales price during a promotion activity. We show the prediction results when we set the sales price as the real price, the daily price (the averaged price of the former seven days), and a discount price (20% off). The results show that a discount price would increase the sales performance during the promotion activity but further inhibit the performance before this activity. This phenomenon conforms to a general view in the real world that customers tend to buy products at a lower sales price and delay their purchase behavior until the promotion campaigns start. Such ability can also advise sellers to set prices in the marketing activity for their desire sales.

More explanations on AliAttention mechanism

(a) PDF of attention weights.
(b) Weight’s proportion.
Figure 7: Visualization results for the AliAttention’s two major components: the vanilla attention Att and the knowledge aligned attention .

Furthermore, we provide a visualization analysis for the proposed AliAttention mechanism to study how its two components, the vanilla attention Att and the knowledge-guided attention work for forecasting. In Figure 7 (a), we show the distributions of Att’s and ’s attention weights in the bottom layer (Layer_1) and the top layer (Layer_12). Additionally, we compute the proportion of ’s weights Att’s weights in different stacked layers, which is exhibited in Figure 7 (b). As is shown, we could find the knowledge-guided attention plays a greater role on bottom layers, whereas a smaller role on top layers (more greater attention value in bottom layer and less in top layer). This indicates that vanilla attention captures sufficient information under the guidance of general knowledge with the stacked layers increasing.

Deployment in Alibaba

To verify the effectiveness of the proposed method in a real-world scenario, Aliformer has been deployed since May 1, 2021, to use goods selection for extensive promotion activities in Tmall. Specifically, we will sort billions of items according to the predicted sales value for the next 15 days based on the past 200 historical statistics and extra knowledge. The top 1 million products will then be selected to participate in the platform’s promotion activities to obtain additional user exposure and click to maximize the overall GMV of the platform. In practice, Informer has been chosen as our baseline since it significantly improves our original tree-based algorithm. We conduct the comparison during the past Tmall 618 shopping festival during Jun 1, 2021 to Jun 20, 2021, by pre-selecting the candidates sorted by Aliformer and Informer. The evaluation shows that the overall sales volume of products selected by Aliformer can coverage the 74.96% of overall GMV, with an 4.73 absolute percentage gain over Informer (70.23% of overall GMV), suggesting that Aliformer can bring tremendous profit benefit for the e-commerce platform.


In this work, we present a knowledge-guided transformer (Aliformer) for TSSF task. The proposed method can make full use of future knowledge and significantly improve sales forecasting accuracy in e-commerce. Extensive experiments demonstrate its effectiveness. We also tailor the TMS dataset to make up for the lack of e-commerce benchmark datasets for TSSF problem. The deployment of Aliformer further achieves significant performance improvements in the real-world application.


Datasets Description

Here we provide a detailed introduction to four public datasets used in our experiment and the TMS dataset collected from the real-world e-commerce application.

Public Benchmark Datasets

  • Electricity Transformer Temperature (ETT) 111 datasets contain the load and oil temperature of electricity transformers from one county of China. Two separate datasets as ETTh for 1-hour-level and ETTm for 15-minute-level are collected between July 2016 and July 2018 Zhou et al. (2021). The training, validation, and test set are split in chronological order by the ratio of 6:2:2.

  • Electricity Consuming Load (ECL) 222 dataset records the hourly electricity consumption (Kwh) of 321 clients from 2012 to 2014. The train, validation, and test set is split as the ratio of 7:1:2.

  • Kaggle-M5 333 dataset, generously made available by Walmart in Kaggle M5 competition, involves the unit sales of various products sold in the USA. This dataset contains the past unit sales as well as calendar-related features. The Kaggle-M5 involves the unit sales of 3049 products sold across ten stores. Thus, we convert the dataset into 30490 series of products. Due to the unavailability of Kaggle-M5’s test set, we train models on the first 1913 days and validate them on the following 28 days.

(a) Sales distribution of TMS dataset.
(b) Top 15 category of TMS dataset.
Figure 8: Data distribution of TMS dataset.

Tmall Merchandise Sales Dataset

The TMS dataset is a vast real-world dataset of product sales collected from the Tmall platform, the B2C e-commerce platform of Alibaba. To ensure its validity, we sample 1.2 million products out of billions of products by filtering out long-tail products (products with low exposure) based on preset thresholds on historical click number and sales volume. The training, validation, and testing set are randomly divided into 80/20/20 based on its item id. For each product, we collect its time-series data from Sep 13, 2020, to Apr 15, 2021 (215 days). The period is over half a year, during which dozens of promotion activities are held. In the experiment, we utilize the past 200 days’ information and the future known knowledge to predict the following 15 days’ sales volume. The products in TMS dataset can be divided into 107 categories and 149,582 brands, and the sales value ranges from 0 to 17 after function . We present the top 15 categories and sale distribution in Figure 8.

Figure 9: The predictions of Aliformer, Informer, LogTrans, LSTMa and LSTNet on the TMS dataset. The red/blue curves stand for sequences of the prediction/ground truth.

Each sample contains 86-dimensional features, which might be divided into three groups:

  • ID: The primary key of each sample.

  • Sparse Features: A feature set contains nine dims, including category, brand, activity level, date context, etc.

  • Dense Features: There are 76 dims, including products’ property and statistical information of product, seller, and category.

Details of the 86-dimensional features are shown in Table 3. As we can see, the scale of TMS (Tmall Merchandise Sales) is much larger and more complex than the current standard public time series datasets. In the actual scenario of sales forecast, we could get richer future information, such as product prices, marketing activities status in TMS. More inspiring studies are expected on TMS.

Experimental Details


We select several recently time series forecasting methods as our baselines: (1) LSTNet Lai et al. (2018), which introduces the CNN with a recurrent-skip structure to extract both long- and short-term temporal patterns; (2) LSTMa Bahdanau, Cho, and Bengio (2015), which introduces the addictive attention into the encoder-decoder architecture; (3) LogTrans Li et al. (2019), a variation of Transformer based on causal convolutions and the LogSparse attention; (4) Informer Zhou et al. (2021), the latest state-of-the-art Transformer-based model using the ProbSparse self-attention and a generative style decoder.

Evaluation Metrics

In experiments, two metrics are used for performance evaluation: Mean Squared Error (MSE) and Mean Absolute Error (MAE). These two metrics are computed on each prediction window with stride = 1 for the whole set.

Implementation Setting

The five targets in experiments are listed in the following:

  • ipv_1d (ipv)

  • ipv_uv_1d (ipv_uv)

  • pay_ord_amt_1d (gmv)

  • pay_ord_cnt_1d (ord)

  • pay_ord_byr_cnt_1d (byr)

We conduct grid search to tune the hyper-parameters with some ranges over the validation set. We set the prediction length progressively, i.e., in ETTh and ECL, in ETTm, in Kaggle-M5 and in TMS. The number of AliAttention layers used in the proposed model is selected from . Each AliAttention layer consists of a 12-heads attention block. Our proposed method is optimized with the Adam kingma2015adam optimizer and its initial learning rate of

. The training process is stopped after 20 epochs. The comparison methods are employed as recommended, and the batch size is 512. All experiments are repeated five times, and we report the averaged results. All the models are implemented in PyTorch

paszke2019pytorch and trained/tested on 8 Nvidia V100 32GB GPUs.

Sparse Features
Dense Features
Table 3: 86-dimensional features of TMS dataset. item_id is the primary key of each sample. Sparse features are categorized into product-related and platform-related. Dense features contain product property, product statistics, seller statistics, cate statistics and seller_cate statistics.

More Visualization for Case Study

Figure 9 presents the predicted sales value by 5 models of two products randomly selected from TMS dataset. Both two products participated in the same marketing campaign during the 7th-9th days of the period. For the first product, the ground truth of sales value (blue curve) increases slightly during 1st-4th days and decreases in 5th-6th days due to activity inhibition. When the promotion activity began, it quickly reached its maximum, gradually decreased in the following days, and reached a stable point. The second product’s sales value shows a similar pattern as the first product, i.e., it gradually decreased in 1st-6th days, then burst in the 7th-9th days with some slow decrement. Its sales value suddenly burst on the 15th day, since the seller gave a relatively large discount on this product.

The comparisons among methods demonstrate the effectiveness of Aliformer not only in fitting trends but also in handling pulses. Here we analyze the reasons that other baselines failed on these two cases: For the Informer, the masked attention mechanism in its decoder hinders information after the prediction moment, though it can capture rapidly changing pattern of sequence based on its feature extraction ability on historical information; LogTrans and LSTMa perform generative-style prediction, which is time-consuming and lacks the guidance of future knowledge. The Figure 

2 shows that LogTrans is effective in the short-term (1st-4th days) but accumulates error as the time interval increases; LSTMa tends to learn long-term trends and predicts smoother and thus can’t be adapted to the fluctuations, especially during promotion activities; LSTNet is amenable for mining periodicity of the sequence, which can’t recognize the complex patterns. Therefore, its predicted trend is contrary to the facts during the 1st-4th days in case 1 and does not fit well for case 2.