Multi-task Learning for Financial Forecasting

by   Tao Ma, et al.

Financial forecasting is challenging and attractive in machine learning. There are many classic solutions, as well as many deep learning based methods, proposed to deal with it yielding encouraging performance. Stock time series forecasting is the most representative problem in financial forecasting. Due to the strong connections among stocks, the information valuable for forecasting is not only included in individual stocks, but also included in the stocks related to them. However, most previous works focus on one single stock, which easily ignore the valuable information in others. To leverage more information, in this paper, we propose a jointly forecasting approach to process multiple time series of related stocks simultaneously, using multi-task learning framework. Compared to the previous works, we use multiple networks to forecast multiple related stocks, using the shared and private information of them simultaneously through multi-task learning. Moreover, we propose an attention method learning an optimized weighted combination of shared and private information based on the idea of Capital Asset Pricing Model (CAPM) to help forecast. Experimental results on various data show improved forecasting performance over baseline methods.


Self-boosted Time-series Forecasting with Multi-task and Multi-view Learning

A robust model for time series forecasting is highly important in many d...

Single Model for Influenza Forecasting of Multiple Countries by Multi-task Learning

The accurate forecasting of infectious epidemic diseases such as influen...

NumHTML: Numeric-Oriented Hierarchical Transformer Model for Multi-task Financial Forecasting

Financial forecasting has been an important and active area of machine l...

DeepCSO: Forecasting of Combined Sewer Overflow at a Citywide Level using Multi-task Deep Learning

Combined Sewer Overflow (CSO) is a major problem to be addressed by many...

Co-evolutionary multi-task learning for dynamic time series prediction

Multi-task learning employs shared representation of knowledge for learn...

Financial Time Series Representation Learning

This paper addresses the difficulty of forecasting multiple financial ti...

Learning Predictive Leading Indicators for Forecasting Time Series Systems with Unknown Clusters of Forecast Tasks

We present a new method for forecasting systems of multiple interrelated...


Time series forecasting is to build models that can predict the future values based on their past. This forecasting problem widely exits in many research fields, such as finance, climate forecasting, medical analysis, etc. In most cases, the time series we deal with are not univariate but multivariate, so it is also called multivariate time series forecasting. In this paper, we focus on the problem of financial forecasting, specifically, stock time series forecasting.

Financial time series, especially stocks’, are extremely challenging to predict because of their low signal-noise ratio [Laloux et al.2000] and heavy-tailed distributions [Cont2001]. Meanwhile, the predictability of stock market returns still remains open and controversial [Malkiel and Fama1970]. There are many classic statistical methods applied to solve this problem, such as MA, AR, VAR, VARMA [Hamilton1994]

. There are also many works using machine learning approaches, e.g. Neural Networks

[Chakraborty et al.1992] and SVM [Pai and Lin2005], to deal with it, achieving promising results. However, these methods are focusing on analyzing one single stock. Actually, the information contained in a single stock’s time series is often limited. According to the theory of Capital Asset Pricing Model (CAPM) [Sharpe1964], the returns of all individual stocks are affected by the systemic risk, in other words, they are all affected by the macro market. So there are strong connections among stocks. Due to these connections, a lot of information valuable for forecasting is actually included in other related stocks’ time series, not just individual stocks’. When analyzing stocks independently, it is very difficult to capture them all. Thus, it is better to process multiple related stocks at the same time.

To leverage more information from related stocks, a straight-foreword solution is Multi-Task Learning (MTL) [Caruana1997], which is already widely used in text and image applications [He et al.2016, Sun et al.2014]. MTL jointly learns multiple related tasks and leverages the correlations over tasks to improve the performance. Therefore, it often works better than single-task learning. Some recent works apply MTL to time series forecasting, e.g. the works of DBLP:journals/corr/HarutyunyanKKG17 DBLP:journals/corr/HarutyunyanKKG17 and li2018multi li2018multi. However, there are some limitations in these approaches: 1) only learn the shared information but ignore the task-private: most of them use a single encoding model to learn the shared latent features of all tasks, which makes it easily ignore the useful task-private information; 2) simply put all latent features together: some other approaches build multiple models to learn both the shared and task-private latent features, but they simply put these features together and feed them to the dense layer, instead of integrating them with more knowledge.

To address the problems of these existing works, in this paper, we propose a novel multi-series jointly forecasting approach for multiple stocks forecasting, as well as a new attention method to learn an optimized combination of shared and private information. More specifically, in our MTL based method, each task represents the forecasting of a single stock. Only the shared information is not enough, so we build multiple networks to learn both the shared and private latent features from multiple time series of related stocks using MTL. To combine the information with more valuable knowledge, we build an attention model to learn an optimized weighted combination of them based on the idea of Capital Asset Pricing Model (CAPM) and Attention

[Hu, Shen, and Sun2017].

Experimental results on various financial datasets show the proposed method can outperform the previous works, including classic methods, single-task methods, and other MTL based solutions [Abdulnabi et al.2015].

The contributions of this paper are multifold:

  • To the best of our knowledge, the proposed multi-series jointly forecasting approach is the first work applying multi-task learning to time series forecasting for multiple related stocks.

  • We propose a novel attention method to learn the optimized combination of shared and task-private latent features based on the idea of CAPM.

  • We demonstrate in experiments on financial data that the proposed approach outperforms single-task baselines and other MTL based methods, which further improves the forecasting performance.

The remainder of the paper is organized as follows: related works are introduced in Section 2. The details of the proposed method are presented in Section 3. Experiments on various datasets are demonstrated in Section 4, including the results and analysis. Finally, we conclude in Section 5.

Related Work

Time Series Forecasting

The study of time series forecasting has a long history in the field of economics. Due to its importance to investing, it is still attractive to researchers from many fields, not only economics but also machine learning and data mining.

Many classic linear stochastic models are proposed and widely used, such as AR, ARIMA [Box and Pierce1970] and GARCH [Bollerslev1986]. However, most of these methods pay more attention to the interpretation of the variables than improving the forecasting performance. Especially when dealing with complex time series, they perform poorly. To improve the performance, Gaussian Processes (GP) are often used [Hwang, Tong, and Choi2016], which works better especially when the time series are sampled irregularly [Cunningham, Ghahramani, and Rasmussen2012].

On the basis of these methods yielding to, some works bring in Machine Learning (ML), e.g., the Gaussian Copula Process Volatility model [Wilson and Ghahramani2010], which brings GP and ML together.

Deep Learning (DL) is representative in ML, which has made amazing achievements in many fields [Schmidhuber2015]

in the past few years, such as computer vision

[Krizhevsky, Sutskever, and Hinton2012]

, natural language processing (NLP)

[Collobert and Weston2008, Józefowicz et al.2016] and speech recognition [Sak, Senior, and Beaufays2014]. Recently, many works apply DL to forecasting time series [Yang et al.2015, Lv et al.2015]

. However, there are still few works using deep learning for financial forecasting. For some recent examples, DBLP:conf/ijcai/DingZLD15 DBLP:conf/ijcai/DingZLD15 applied deep learning to event-driven stock market prediction. DBLP:journals/corr/HeatonPW16 DBLP:journals/corr/HeatonPW16 used autoencoders with one single layer to compress multivariate financial data. neil2016phased neil2016phased present augmentation of LSTM architecture, which is able to process asynchronous series. DBLP:conf/icml/BinkowskiMD18 DBLP:conf/icml/BinkowskiMD18 proposed autoregressive convolutional neural networks for asynchronous financial time series.

These works have a common limitation: they only focus on the time series of one single stock, or even a univariate time series. Even if they can process multiple time series of multiple stocks, they still don’t make good use of the connections among stocks to extract all the information.

Deep Multi-task Learning

Multi-task Learning (MTL) is to process multiple related tasks at the same time, leveraging the correlation over tasks to improve the performance. In recent years, it often comes with deep learning, so also called Deep Multi-task Learning (DMTL). Generally, if you find your loss function optimizes multiple targets at the same time, you actually do multi-task learning

[Ruder2017]. It has successfully applied in all applications of machine learning, including natural language process [Collobert and Weston2008] and computer vision [Girshick2015].

There are some recent works using DMTL to deal with time series forecasting problems. durichen2015multitask durichen2015multitask used multi-task Gaussian processes to process physiological time series. jung2015learning jung2015learning proposed a multi-task learning approach to learn the conditional independence structure of stationary time series. DBLP:conf/ijcai/LiuZLLR16 DBLP:conf/ijcai/LiuZLLR16 used multi-task multi-view learning to predict urban water quality. DBLP:journals/corr/HarutyunyanKKG17 DBLP:journals/corr/HarutyunyanKKG17 used recurrent LSTM neural networks and multi-task learning to deal with clinical time series. And li2018multi li2018multi applied multi-task representation learning to travel time estimation. Moreover, some methods are proposed to learn the shared representation of all the task-private information, e.g., misra2016cross misra2016cross proposed cross-stitch networks to combine multiple task-private latent features.

There are some limitations in these works. Firstly, most of them ignore the task-private information since they only build a single model to learn the shared information of multiple tasks. Secondly, although some consider the task-private information, they do not make use of them efficiently since they simply put these latent features together and feed them to the forecasting model.


To address the limitations that the previous works focus on a single stock and use the shared information only, we propose a new method based on Deep Multi-Task Learning (DMTL) for financial forecasting. More specifically, to efficiently extract the shared and private information from multiple related stocks, we build multiple networks to learn their latent representations using DMTL. Furthermore, to address the problems of not efficiently combining the shared and private information, we propose an attention method to learn their optimized combination based on the idea of CAPM. We will describe the details in the following.

Problem Statement

Firstly, we give the formal definition of the financial time series forecasting problems, which are the autoregressively forecasting time series problems, whose predictions are:


where is the mathematical expectation, is the approximate function, and is the length of past sequence.

The time series could be multivariate, that is, represent the values of multiple series at time :


where is the number of series. For example, for a stock, there are multiple price series, such as opening prices, closing prices and so on.

Then, for multi-task learning, assuming that there are totally tasks, the problem is defined as:


where is the time series of task , and is the length of the time series.

In this paper, different tasks represent the forecasting of different stocks, since different stocks often have different trading behaviours and represent different companies.

Multi-series Jointly Forecasting

In order to utilize the connections to extract the valuable information from multiple related stocks and improve the forecasting performance, we propose a jointly forecasting approach based on DMTL to process multiple stocks simultaneously, called Multi-series Jointly Forecasting (MSJF).

According to the theory of CAPM, there are strong connections among stocks. However, these connections are complicated to clearly quantify and describe in the model. If these connections are utilized, the forecasting performance will be further improved. Therefore, we propose MSJF, the framework of which can be found in Figure 1, to leverage the connections among tasks to forecast multiple stocks. Formally, MSJF with tasks can be defined as:



  • is training targets (labels), is the forecasting values.

  • is the forecasting model of task , using both the shared and task-private information.

  • is the shared information of all tasks, is the task-private information of task .

  • is the shared encoding model, is the private encoding model of task .

Figure 1: The architecture of MSJF. It processes multiple time series of related stocks at the same time. The shared encoding model extracts the shared information from all stocks, while each task (stock) has their private encoding models, extracting their own information . Then each forecasting model uses both the shared and task-private information to predict the future values.

MSJF processes tasks at the same time, jointly forecasting the time series of related stocks, and each task represents one of these stocks. Due to the connections among stocks, each task can do forecasting with both the shared and its private information through DMTL.

More specifically, there is a shared encoding model extracting the shared information from the time series of all stocks using their connections.


Moreover, to make use of their private information, each single task (stock) has their private encoding models , extracting their own information .


Recurrent Neural Networks with LSTM cells (LSTM-RNNs) are applied in both shared and task-private encoding models due to its excellent ability to extract the latent information from series data. Then both the shared and task-private encoded outputs are used for each forecasting task:


Besides the encoding models, MSJF jointly trains all tasks by the joint loss function, which is defined as:


where is the joint loss, is the ground truth of all samples in task and is the forecasting values of all samples in task . Mean Square Error (MSE) is used to measure the forecasting performance of each task.

Figure 2: The architecture of SPA model. The encoding models extract the latent features and from the raw information. Then SPA measures the contributions of them to the forecasting task and learns their weights and . Finally the model combines them with and into the optimized representations .

Shared-private Attention

To combine the shared and task-private latent features with more valuable knowledge, instead of simply putting them together, we propose an attention model to learn the optimized combination of them based on the idea of CAPM.

Capital Asset Pricing Model [Sharpe1964]: given an asset (e.g., stock) , the relationship between its excess earnings and the excess earnings of market can be expressed as:



  • is the expected return on the capital asset , is the expected return of the market .

  • is the risk-free return, such as interest arising from government bonds.

  • is the sensitivity of the expected excess return of asset to the expected excess return of market .

CAPM suggests that the return of the capital asset can be explained by the return of macro market.

Then, subsequent work [Jensen1968] shows that there are excess returns in the earnings of the capital asset that exceeds the market benchmark portfolio.


where is the excess return of asset that exceeds the market benchmark portfolio. For stocks, the return of a single stock actually receives varying degrees of influence from the macro market (often called Beta) and its own factors (often called Alpha). And the levels of these influences vary from different stocks. If the levels are expressed by weights, then the return of individual stocks can be described as:

Figure 3: The hierarchical architecture of jointly forecasting. For convenience, take four related stocks from two industries for example. Based on the prior knowledge, stock 1 and 2 are from the same industry 1, while 3 and 4 are from industry 2. Then stock 1 and 2 can share the information of industry 1, so as for stock 3 and 4. All stocks can share the macro market information.

Similarly, in our DMTL model, each task represents a single stock, then it is also influenced by the market (shared information) and its own factors (task-private information), the levels of which can be different and vary from different tasks. So based on this, we aim to combine these information with their levels of influence.

Attention mechanism measures the importance of objects in your vision and learns their importance by weighting them. Therefore, we use an attention model to measure the contributions of the shared information and the task-private information to its own forecasting task . Then the model combines these information with their weights and obtains the optimized combination of them. We call it Shared-private Attention (SPA).



  • is the weights of shared latent features for task .

  • is the weights of task-private latent features for its own task .

  • is the optimized combination.

Finally, MSJF uses the combined latent features to jointly forecast the time series of multiple related stocks, which is called Multi-series Jointly Forecasting with Shared-private Attention (SPA-MSJF).



Description of the dataset
Dataset Period Days Tasks Time series per task Total
Banks 2010-10 to 2018-08 1937 ICBC, ABC, BOC, CCB 5 20
Securities 2010-02 to 2018-08 2122 CITIC, CM, HAI, GF, HUA, EB 5 30
Shipping 2011-11 to 2017-12 2233 HK, LB, BK, SP, SH 8 40
Table 1: Datasets used in the experiments. Days means the length of time series datasets. Tasks means the forecasting tasks in the dataset. Time series means the number of time series in one single task.

Differences from the Previous Works

MSJF is an approach to jointly forecast the time series of multiple related stocks based on DMTL, while most previous works only focus on a single stock. Moreover, most of the previous works in DMTL easily put all latent features together, but we aim to combine them with more useful knowledge. Since we mainly focus on financial data, so inspired by the idea of CAPM and Attention, we propose a new method, SPA, to learn the optimized combination of all latent features.

MSJF for Other Types of Time Series

Although this paper mainly focuses on stock time series, the proposed method can be applied to other financial data, because there are similar connections. For example, several companies with trades, the connections between the stock market and bond market, even non-financial data. And to prove it, we conduct experiments on a non-financial dataset, shipping data.

Architecture with More Prior Knowledge

The architecture in Figure 1 is the basic MSJF. Actually, in this framework, more prior knowledge can be applied into architecture design. For example, assuming that multiple related stocks from different industries are selected for jointly forecasting, the stocks from the same industry first share the local information, then all stocks share the information of the macro market. See Figure 3. Therefore, the prior knowledge of the hierarchical relationships among tasks can be added to the design of hierarchical model architecture. And we also demonstrate this in our experiments.



Stock data of the Big Four banks in China

In this paper, we focus on forecasting stock time series, so we choose the stock daily trading data of the Big Four banks in China. The details of the dataset are presented in Table 1, and the stock (closing) prices data is shown in Figure 4(a).

These four stocks come from the Chinese banking industry, and they are the most representative stocks in this industry. Each stock has 5 time series, including opening prices, closing prices, highest prices, lowest prices and trading volumes.

Stock data of six securities in China

Besides the stocks from the banking industry, we also choose the stock data of six securities in China. Similar to the Big Four banks in China, they are representative in the Chinese securities industry. The details are similar to the banks’ dataset, also presented in Table 1, and the stock (closing) prices are shown in Figure 4(b).

(a) Stock prices of the Big Four banks in China.
(b) Stock prices of six securities in China.
Figure 4: Stock prices.

Shipping data

In order to verify the performance of our model on other types of time series data, we choose the shipping data of an American transportation company. There are 350 ports all over the world, and each of them stores 4 types of containers. Each type of containers has its own daily inventory and demand. Therefore, each port has 8 time series. We select the top five ports of this company with the most frequent trades. The details of shipping dataset are also presented in Table 1.

Big Four banks Six securities
MSJF 0.1047 0.0749 0.0748 0.0644 0.0866 0.1235 0.1289 0.1271 0.1137 0.1258 0.1269
H-MSJF 0.0767 0.0441 0.0549 0.0388 0.0558 0.0943 0.1281 0.0933 0.0979 0.0793 0.0802
Table 2: Performance comparison of MSJF and Hierarchical MSJF (H-MSJF). The experiment is on financial datasets (Banks and Securities). MSJF and H-MSJF process 10 forecasting tasks simultaneously, four of which are banks’ stocks while the others are securities’. In H-MJSF, besides the shared encoder, the stocks in the same industry (prior knowledge) have their local-shared encoder, extracting the shared information of the industry.
Method     Banks Securities Shipping
MA 0.3217 0.3459 0.8363
ARMA 0.1578 0.2495 0.6643
ST 0.1267 0.2019 0.5769
FSST 0.1101 0.2324 0.4746
FSMT 0.1418 0.2359 0.4729
PS-MTL 0.1148 0.1929 0.4335
MJSF 0.1024 0.1646 0.4171
SPA-MSJF 0.0808 0.1558 0.4078
Table 3: Performance Comparison. The forecasting performances are measured by the average testing MSE of all tasks. Our methods are MSJF and SPA-MSJF, and the others are baseline methods.
Big Four banks
Single-task 0.1547 0.1033 0.1101 0.1389
MSJF 0.0932 0.0933 0.0919 0.1312
SPA-MSJF 0.0766 0.0815 0.0744 0.0904
Table 4: Forecasting performance on each task of the banks stock dataset.
Six securities
Single-task 0.2186 0.1974 0.1972 0.2226 0.1794 0.1966
MSJF 0.1738 0.1776 0.1517 0.1555 0.1425 0.1865
SPA-MSJF 0.1683 0.1671 0.1404 0.1692 0.1334 0.1563
Table 5: Forecasting performance on each task of the securities stock dataset.
Shipping data
Single-task 0.4843 0.4698 0.6199 0.7847 0.5257
MSJF 0.3508 0.3178 0.4111 0.5497 0.4559
SPA-MSJF 0.3289 0.3206 0.4201 0.5367 0.4328
Table 6: Forecasting performance on each task of the shipping dataset.

Experimental Settings

Training and Testing

Dealing with time series data, we choose sliding training and testing, described as follows:

  • Training: Select the data of half a year as the training dataset, and train the model for a number of epochs.

  • Testing: After training, select the data of one month right after the training dataset as the dataset to test.

  • Sliding: Slide the training and testing datasets forward for one month, that is, merge the testing dataset into the training dataset, and drop the first month’s training data. Then repeat the training and testing processes.

Multi-task Assignments

We make the multi-task assignments on three datasets as follows:

  • Stock data of the Big Four banks in China: As mentioned before, there are four forecasting tasks processed by our method in this dataset, each of which forecasts the time series of one single stock.

  • Stock data of the Big Four banks in China: Similar to the above, six forecasting tasks.

  • Shipping data: The forecasting tasks are divided by ports, each of which forecasts the multivariate time series of one single port. Our method processes 5 tasks simultaneously.


We compare our proposed method with the following baseline methods:

  • Moving Average Model (MA): This is a very classic statistical method in economics, which is often used for financial time series forecasting. So it serves as a baseline to illustrate the forecasting performance of our method.

  • Auto-Regressive and Moving Average Model (ARMA): Similar to MA, this is another classic method in economics, also serves as a baseline.

  • Single-task Baseline (ST): This serves as a baseline without benefits of multi-task learning. Each single-task model forecasts the multivariate time series of one stock separately, not sharing the information of other related stocks.

  • Fully-shared and Single-task Baseline (FSST): It also serves as baseline without benefits of MTL, using the shared information of all tasks but still single-task.

  • Fully-shared and Multi-task Baseline (FSMT): It serves as a baseline using only the shared information to forecast, similar to the previous works we mentioned, which can prove the benefits of our multi-model architecture.

  • Private-shared MTL Baseline (PS-MTL): As our final baseline, we compare to a variant method of misra2016cross misra2016cross. The original method builds multple private encoding models and there is a shared embedding layer learning the shared representations of all private latent features, different from ours. So their method is adapted to this problem and serves as a private-shared MTL baseline.

Results and Analysis

For convenience, we use the following abbreviations of our methods: 1) Multi-series Jointly Forecasting (MSJF); 2) Multi-series Jointly Forecasting with Shared-private Attention (SPA-MSJF).

(a) Banks
(b) Securities
(c) Shipping data
Figure 5: The shared weights () and task-private weights () are learned by SPA from three datasets. Each testing dataset contains testing samples in a month, and the values shown in the figure are the average on each testing dataset.

Overall Performance Comparison

The overall comparison experiment results are shown in Table 3. From these results, We have the following observations: 1) Both MSJF and SPA-MSJF can outperform the baseline methods on all datasets. This indicates the effectiveness of the proposed methods; 2) SPA-MSJF is better than MSJF. This demonstrates the proposed SPA model can indeed further improve the performance of MSJF; 3) The proposed methods also work well on shipping dataset. This shows the proposed methods can work on all kinds of time series data.

Effects of Multi-series Jointly Forecasting

To show the effects of MSJF, we use the experimental results in Table 3, 4, 5 and 6. Without the benefits of SPA, 1) in Table 3, MSJF outperforms single-task (ST) and fully-shared & single-task (FSST) baselines. And it outperforms ST on each task in all datasets, as shown in Table 4, 5 and 6. These suggest the effectiveness of MSJF; 2) MSJF performs better than fully-shared & multi-task (FSMT) and private-shared MTL (PS-MTL) baselines. This suggests the effectiveness of the multi-model architecture in MSJF.

Analysis on Shared-private Attention

On the basis of MSJF, we propose SPA to learn the optimized combination of shared and task-private latent features. In Table 3, SPA-MSJF outperforms MSJF on the average test MSE in all datasets. And in Table 4, 5, 6, SPA-MSJF outperforms MSJF on 12 tasks (totally 15 tasks). These results demonstrate the effectiveness of SPA.

We also provide a visualization of combination weights learned by SPA, shown in Figure 5. From the visualization, 1) we can find the shared weights are larger than the private weights in almost all test data of financial datasets. It means the shared information plays an important role in financial forecasting, which is similar to the conclusion of CAPM. This result indicates the SPA model can indeed leverage the idea of CAPM to improve the performance of financial forecasting; 2) As for the result in shipping data, we find a different pattern: the shared weights are almost the same as the private weights. However, from the result in Table 6, SPA-MSJF is still better than MSJF on average. This shows SPA also can work on the non-financial data. These results also demonstrate the effectiveness of SPA.

Effects of Hierarchical Architecture

To demonstrate the effects of hierarchical architecture with prior knowledge, we conduct experiments on two financial datasets, Banks and Securities. There are total 10 forecasting tasks for MSJF and Hierarchical MSJF (H-MSJF). And according to the prior knowledge, we know four of them are from the banking industry and the rest are from the securities industry. Thus, in H-MSJF, besides the shared encoding model, the stocks in the same industry also have their local-shared encoding model, extracting the information of the industry. According to the results in Table 2, H-MSJF outperforms MSJF on each forecasting task, which suggests the effectiveness of the hierarchical architecture with prior knowledge.

Experimental results on financial datasets demonstrate our proposed methods, MSJF and SPA-MSJF, outperform the previous works, including classic methods, single-task methods and other DMTL based solutions, and SPA-MSJF performs best. We separately analyze the effects of MSJF and SPA, using the results to prove they further improve the forecasting performance indeed. In addition, the experiments on shipping dataset demonstrate our methods can work on other kinds of time series data, and we analyze the effects of the hierarchical architecture with prior knowledge.


In this paper, we propose a jointly forecasting approach, MSJF, to process the time series of multiple related stocks based on DMTL, which can use the connections among stocks to improve the forecasting performance. Moreover, in order to combine the shared and task-private information more accurately, we propose an attention method, SPA, to learn the optimized combination of them based on the idea of CAPM. We demonstrate our method on financial datasets and another type of time series dataset, and it outperforms the classic methods and other MTL based methods. In the future works, we would like to further improve SPA’s ability of combining latent features. And for DMTL, we would like to build hierarchical models to extract the shared information from all tasks more efficiently.


  • [Abdulnabi et al.2015] Abdulnabi, A. H.; Wang, G.; Lu, J.; and Jia, K. 2015. Multi-task cnn model for attribute prediction. IEEE Transactions on Multimedia 17(11):1949–1959.
  • [Binkowski, Marti, and Donnat2018] Binkowski, M.; Marti, G.; and Donnat, P. 2018. Autoregressive convolutional neural networks for asynchronous time series. In Dy, J. G., and Krause, A., eds., Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of JMLR Workshop and Conference Proceedings, 579–588.
  • [Bollerslev1986] Bollerslev, T. 1986. Generalized autoregressive conditional heteroskedasticity. Journal of econometrics 31(3):307–327.
  • [Box and Pierce1970] Box, G. E., and Pierce, D. A. 1970. Distribution of residual autocorrelations in autoregressive-integrated moving average time series models. Journal of the American statistical Association 65(332):1509–1526.
  • [Caruana1997] Caruana, R. 1997. Multitask learning. Machine learning 28(1):41–75.
  • [Chakraborty et al.1992] Chakraborty, K.; Mehrotra, K.; Mohan, C. K.; and Ranka, S. 1992. Forecasting the behavior of multivariate time series using neural networks. Neural networks 5(6):961–970.
  • [Collobert and Weston2008] Collobert, R., and Weston, J. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, 160–167. ACM.
  • [Cont2001] Cont, R. 2001. Empirical properties of asset returns: stylized facts and statistical issues. Quantitative Finance 1(2):223–236.
  • [Cunningham, Ghahramani, and Rasmussen2012] Cunningham, J.; Ghahramani, Z.; and Rasmussen, C. 2012. Gaussian processes for time-marked time-series data. In Artificial Intelligence and Statistics, 255–263.
  • [Ding et al.2015] Ding, X.; Zhang, Y.; Liu, T.; and Duan, J. 2015. Deep learning for event-driven stock prediction. In Yang, Q., and Wooldridge, M., eds., Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, 2327–2333. AAAI Press.
  • [Dürichen et al.2015] Dürichen, R.; Pimentel, M. A.; Clifton, L.; Schweikard, A.; and Clifton, D. A. 2015. Multitask gaussian processes for multivariate physiological time-series analysis. IEEE Transactions on Biomedical Engineering 62(1):314–322.
  • [Girshick2015] Girshick, R. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, 1440–1448.
  • [Hamilton1994] Hamilton, J. D. 1994. Time series analysis, volume 2. Princeton university press Princeton, NJ.
  • [Harutyunyan et al.2017] Harutyunyan, H.; Khachatrian, H.; Kale, D. C.; and Galstyan, A. 2017. Multitask learning and benchmarking with clinical time series data. CoRR abs/1703.07771.
  • [He et al.2016] He, T.; Huang, W.; Qiao, Y.; and Yao, J. 2016. Text-attentional convolutional neural network for scene text detection. IEEE transactions on image processing 25(6):2529–2541.
  • [Heaton, Polson, and Witte2016] Heaton, J. B.; Polson, N. G.; and Witte, J. H. 2016. Deep learning in finance. CoRR abs/1602.06561.
  • [Hu, Shen, and Sun2017] Hu, J.; Shen, L.; and Sun, G. 2017. Squeeze-and-excitation networks. CoRR abs/1709.01507.
  • [Hwang, Tong, and Choi2016] Hwang, Y.; Tong, A.; and Choi, J. 2016. Automatic construction of nonparametric relational regression models for multiple time series. In International Conference on Machine Learning, 3030–3039.
  • [Jensen1968] Jensen, M. C. 1968. The performance of mutual funds in the period 1945–1964. The Journal of finance 23(2):389–416.
  • [Józefowicz et al.2016] Józefowicz, R.; Vinyals, O.; Schuster, M.; Shazeer, N.; and Wu, Y. 2016. Exploring the limits of language modeling. CoRR abs/1602.02410.
  • [Jung2015] Jung, A. 2015. Learning the conditional independence structure of stationary time series: A multitask learning approach. IEEE Transactions on Signal Processing 63(21):5677–5690.
  • [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 1097–1105.
  • [Laloux et al.2000] Laloux, L.; Cizeau, P.; Potters, M.; and Bouchaud, J.-P. 2000. Random matrix theory and financial correlations. International Journal of Theoretical and Applied Finance 3(03):391–397.
  • [Li et al.2018] Li, Y.; Fu, K.; Wang, Z.; Shahabi, C.; Ye, J.; and Liu, Y. 2018. Multi-task representation learning for travel time estimation. In International Conference on Knowledge Discovery and Data Mining,(KDD).
  • [Liu et al.2016] Liu, Y.; Zheng, Y.; Liang, Y.; Liu, S.; and Rosenblum, D. S. 2016. Urban water quality prediction based on multi-task multi-view learning. In Kambhampati, S., ed., Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, 2576–2581. IJCAI/AAAI Press.
  • [Lv et al.2015] Lv, Y.; Duan, Y.; Kang, W.; Li, Z.; Wang, F.-Y.; et al. 2015. Traffic flow prediction with big data: A deep learning approach. IEEE Trans. Intelligent Transportation Systems 16(2):865–873.
  • [Malkiel and Fama1970] Malkiel, B. G., and Fama, E. F. 1970. Efficient capital markets: A review of theory and empirical work. The journal of Finance 25(2):383–417.
  • [Misra et al.2016] Misra, I.; Shrivastava, A.; Gupta, A.; and Hebert, M. 2016. Cross-stitch networks for multi-task learning. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 3994–4003.
  • [Neil, Pfeiffer, and Liu2016] Neil, D.; Pfeiffer, M.; and Liu, S.-C. 2016. Phased lstm: Accelerating recurrent network training for long or event-based sequences. In Advances in Neural Information Processing Systems, 3882–3890.
  • [Pai and Lin2005] Pai, P.-F., and Lin, C.-S. 2005.

    A hybrid arima and support vector machines model in stock price forecasting.

    Omega 33(6):497–505.
  • [Ruder2017] Ruder, S. 2017. An overview of multi-task learning in deep neural networks. CoRR abs/1706.05098.
  • [Sak, Senior, and Beaufays2014] Sak, H.; Senior, A.; and Beaufays, F. 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth annual conference of the international speech communication association.
  • [Schmidhuber2015] Schmidhuber, J. 2015. Deep learning in neural networks: An overview. Neural networks 61:85–117.
  • [Sharpe1964] Sharpe, W. F. 1964. Capital asset prices: A theory of market equilibrium under conditions of risk. The journal of finance 19(3):425–442.
  • [Sun et al.2014] Sun, Y.; Chen, Y.; Wang, X.; and Tang, X. 2014. Deep learning face representation by joint identification-verification. In Advances in neural information processing systems, 1988–1996.
  • [Wilson and Ghahramani2010] Wilson, A. G., and Ghahramani, Z. 2010. Copula processes. In Advances in Neural Information Processing Systems, 2460–2468.
  • [Yang et al.2015] Yang, J.; Nguyen, M. N.; San, P. P.; Li, X.; and Krishnaswamy, S. 2015. Deep convolutional neural networks on multichannel time series for human activity recognition. In Ijcai, volume 15, 3995–4001.