Introduction
Time series forecasting is to build models that can predict the future values based on their past. This forecasting problem widely exits in many research fields, such as finance, climate forecasting, medical analysis, etc. In most cases, the time series we deal with are not univariate but multivariate, so it is also called multivariate time series forecasting. In this paper, we focus on the problem of financial forecasting, specifically, stock time series forecasting.
Financial time series, especially stocks’, are extremely challenging to predict because of their low signalnoise ratio [Laloux et al.2000] and heavytailed distributions [Cont2001]. Meanwhile, the predictability of stock market returns still remains open and controversial [Malkiel and Fama1970]. There are many classic statistical methods applied to solve this problem, such as MA, AR, VAR, VARMA [Hamilton1994]
. There are also many works using machine learning approaches, e.g. Neural Networks
[Chakraborty et al.1992] and SVM [Pai and Lin2005], to deal with it, achieving promising results. However, these methods are focusing on analyzing one single stock. Actually, the information contained in a single stock’s time series is often limited. According to the theory of Capital Asset Pricing Model (CAPM) [Sharpe1964], the returns of all individual stocks are affected by the systemic risk, in other words, they are all affected by the macro market. So there are strong connections among stocks. Due to these connections, a lot of information valuable for forecasting is actually included in other related stocks’ time series, not just individual stocks’. When analyzing stocks independently, it is very difficult to capture them all. Thus, it is better to process multiple related stocks at the same time.To leverage more information from related stocks, a straightforeword solution is MultiTask Learning (MTL) [Caruana1997], which is already widely used in text and image applications [He et al.2016, Sun et al.2014]. MTL jointly learns multiple related tasks and leverages the correlations over tasks to improve the performance. Therefore, it often works better than singletask learning. Some recent works apply MTL to time series forecasting, e.g. the works of DBLP:journals/corr/HarutyunyanKKG17 DBLP:journals/corr/HarutyunyanKKG17 and li2018multi li2018multi. However, there are some limitations in these approaches: 1) only learn the shared information but ignore the taskprivate: most of them use a single encoding model to learn the shared latent features of all tasks, which makes it easily ignore the useful taskprivate information; 2) simply put all latent features together: some other approaches build multiple models to learn both the shared and taskprivate latent features, but they simply put these features together and feed them to the dense layer, instead of integrating them with more knowledge.
To address the problems of these existing works, in this paper, we propose a novel multiseries jointly forecasting approach for multiple stocks forecasting, as well as a new attention method to learn an optimized combination of shared and private information. More specifically, in our MTL based method, each task represents the forecasting of a single stock. Only the shared information is not enough, so we build multiple networks to learn both the shared and private latent features from multiple time series of related stocks using MTL. To combine the information with more valuable knowledge, we build an attention model to learn an optimized weighted combination of them based on the idea of Capital Asset Pricing Model (CAPM) and Attention
[Hu, Shen, and Sun2017].Experimental results on various financial datasets show the proposed method can outperform the previous works, including classic methods, singletask methods, and other MTL based solutions [Abdulnabi et al.2015].
The contributions of this paper are multifold:

To the best of our knowledge, the proposed multiseries jointly forecasting approach is the first work applying multitask learning to time series forecasting for multiple related stocks.

We propose a novel attention method to learn the optimized combination of shared and taskprivate latent features based on the idea of CAPM.

We demonstrate in experiments on financial data that the proposed approach outperforms singletask baselines and other MTL based methods, which further improves the forecasting performance.
The remainder of the paper is organized as follows: related works are introduced in Section 2. The details of the proposed method are presented in Section 3. Experiments on various datasets are demonstrated in Section 4, including the results and analysis. Finally, we conclude in Section 5.
Related Work
Time Series Forecasting
The study of time series forecasting has a long history in the field of economics. Due to its importance to investing, it is still attractive to researchers from many fields, not only economics but also machine learning and data mining.
Many classic linear stochastic models are proposed and widely used, such as AR, ARIMA [Box and Pierce1970] and GARCH [Bollerslev1986]. However, most of these methods pay more attention to the interpretation of the variables than improving the forecasting performance. Especially when dealing with complex time series, they perform poorly. To improve the performance, Gaussian Processes (GP) are often used [Hwang, Tong, and Choi2016], which works better especially when the time series are sampled irregularly [Cunningham, Ghahramani, and Rasmussen2012].
On the basis of these methods yielding to, some works bring in Machine Learning (ML), e.g., the Gaussian Copula Process Volatility model [Wilson and Ghahramani2010], which brings GP and ML together.
Deep Learning (DL) is representative in ML, which has made amazing achievements in many fields [Schmidhuber2015]
in the past few years, such as computer vision
[Krizhevsky, Sutskever, and Hinton2012], natural language processing (NLP)
[Collobert and Weston2008, Józefowicz et al.2016] and speech recognition [Sak, Senior, and Beaufays2014]. Recently, many works apply DL to forecasting time series [Yang et al.2015, Lv et al.2015]. However, there are still few works using deep learning for financial forecasting. For some recent examples, DBLP:conf/ijcai/DingZLD15 DBLP:conf/ijcai/DingZLD15 applied deep learning to eventdriven stock market prediction. DBLP:journals/corr/HeatonPW16 DBLP:journals/corr/HeatonPW16 used autoencoders with one single layer to compress multivariate financial data. neil2016phased neil2016phased present augmentation of LSTM architecture, which is able to process asynchronous series. DBLP:conf/icml/BinkowskiMD18 DBLP:conf/icml/BinkowskiMD18 proposed autoregressive convolutional neural networks for asynchronous financial time series.
These works have a common limitation: they only focus on the time series of one single stock, or even a univariate time series. Even if they can process multiple time series of multiple stocks, they still don’t make good use of the connections among stocks to extract all the information.
Deep Multitask Learning
Multitask Learning (MTL) is to process multiple related tasks at the same time, leveraging the correlation over tasks to improve the performance. In recent years, it often comes with deep learning, so also called Deep Multitask Learning (DMTL). Generally, if you find your loss function optimizes multiple targets at the same time, you actually do multitask learning
[Ruder2017]. It has successfully applied in all applications of machine learning, including natural language process [Collobert and Weston2008] and computer vision [Girshick2015].There are some recent works using DMTL to deal with time series forecasting problems. durichen2015multitask durichen2015multitask used multitask Gaussian processes to process physiological time series. jung2015learning jung2015learning proposed a multitask learning approach to learn the conditional independence structure of stationary time series. DBLP:conf/ijcai/LiuZLLR16 DBLP:conf/ijcai/LiuZLLR16 used multitask multiview learning to predict urban water quality. DBLP:journals/corr/HarutyunyanKKG17 DBLP:journals/corr/HarutyunyanKKG17 used recurrent LSTM neural networks and multitask learning to deal with clinical time series. And li2018multi li2018multi applied multitask representation learning to travel time estimation. Moreover, some methods are proposed to learn the shared representation of all the taskprivate information, e.g., misra2016cross misra2016cross proposed crossstitch networks to combine multiple taskprivate latent features.
There are some limitations in these works. Firstly, most of them ignore the taskprivate information since they only build a single model to learn the shared information of multiple tasks. Secondly, although some consider the taskprivate information, they do not make use of them efficiently since they simply put these latent features together and feed them to the forecasting model.
Methods
To address the limitations that the previous works focus on a single stock and use the shared information only, we propose a new method based on Deep MultiTask Learning (DMTL) for financial forecasting. More specifically, to efficiently extract the shared and private information from multiple related stocks, we build multiple networks to learn their latent representations using DMTL. Furthermore, to address the problems of not efficiently combining the shared and private information, we propose an attention method to learn their optimized combination based on the idea of CAPM. We will describe the details in the following.
Problem Statement
Firstly, we give the formal definition of the financial time series forecasting problems, which are the autoregressively forecasting time series problems, whose predictions are:
(1) 
where is the mathematical expectation, is the approximate function, and is the length of past sequence.
The time series could be multivariate, that is, represent the values of multiple series at time :
(2) 
where is the number of series. For example, for a stock, there are multiple price series, such as opening prices, closing prices and so on.
Then, for multitask learning, assuming that there are totally tasks, the problem is defined as:
(3) 
where is the time series of task , and is the length of the time series.
In this paper, different tasks represent the forecasting of different stocks, since different stocks often have different trading behaviours and represent different companies.
Multiseries Jointly Forecasting
In order to utilize the connections to extract the valuable information from multiple related stocks and improve the forecasting performance, we propose a jointly forecasting approach based on DMTL to process multiple stocks simultaneously, called Multiseries Jointly Forecasting (MSJF).
According to the theory of CAPM, there are strong connections among stocks. However, these connections are complicated to clearly quantify and describe in the model. If these connections are utilized, the forecasting performance will be further improved. Therefore, we propose MSJF, the framework of which can be found in Figure 1, to leverage the connections among tasks to forecast multiple stocks. Formally, MSJF with tasks can be defined as:
(4)  
where

is training targets (labels), is the forecasting values.

is the forecasting model of task , using both the shared and taskprivate information.

is the shared information of all tasks, is the taskprivate information of task .

is the shared encoding model, is the private encoding model of task .
MSJF processes tasks at the same time, jointly forecasting the time series of related stocks, and each task represents one of these stocks. Due to the connections among stocks, each task can do forecasting with both the shared and its private information through DMTL.
More specifically, there is a shared encoding model extracting the shared information from the time series of all stocks using their connections.
(5) 
Moreover, to make use of their private information, each single task (stock) has their private encoding models , extracting their own information .
(6) 
Recurrent Neural Networks with LSTM cells (LSTMRNNs) are applied in both shared and taskprivate encoding models due to its excellent ability to extract the latent information from series data. Then both the shared and taskprivate encoded outputs are used for each forecasting task:
(7) 
Besides the encoding models, MSJF jointly trains all tasks by the joint loss function, which is defined as:
(8) 
where is the joint loss, is the ground truth of all samples in task and is the forecasting values of all samples in task . Mean Square Error (MSE) is used to measure the forecasting performance of each task.
Sharedprivate Attention
To combine the shared and taskprivate latent features with more valuable knowledge, instead of simply putting them together, we propose an attention model to learn the optimized combination of them based on the idea of CAPM.
Capital Asset Pricing Model [Sharpe1964]: given an asset (e.g., stock) , the relationship between its excess earnings and the excess earnings of market can be expressed as:
(9) 
where

is the expected return on the capital asset , is the expected return of the market .

is the riskfree return, such as interest arising from government bonds.

is the sensitivity of the expected excess return of asset to the expected excess return of market .
CAPM suggests that the return of the capital asset can be explained by the return of macro market.
Then, subsequent work [Jensen1968] shows that there are excess returns in the earnings of the capital asset that exceeds the market benchmark portfolio.
(10) 
where is the excess return of asset that exceeds the market benchmark portfolio. For stocks, the return of a single stock actually receives varying degrees of influence from the macro market (often called Beta) and its own factors (often called Alpha). And the levels of these influences vary from different stocks. If the levels are expressed by weights, then the return of individual stocks can be described as:
(11) 
Similarly, in our DMTL model, each task represents a single stock, then it is also influenced by the market (shared information) and its own factors (taskprivate information), the levels of which can be different and vary from different tasks. So based on this, we aim to combine these information with their levels of influence.
Attention mechanism measures the importance of objects in your vision and learns their importance by weighting them. Therefore, we use an attention model to measure the contributions of the shared information and the taskprivate information to its own forecasting task . Then the model combines these information with their weights and obtains the optimized combination of them. We call it Sharedprivate Attention (SPA).
(12)  
where

is the weights of shared latent features for task .

is the weights of taskprivate latent features for its own task .

is the optimized combination.
Finally, MSJF uses the combined latent features to jointly forecast the time series of multiple related stocks, which is called Multiseries Jointly Forecasting with Sharedprivate Attention (SPAMSJF).
(13)  
Discussions
Description of the dataset  

Dataset  Period  Days  Tasks  Time series per task  Total 
Banks  201010 to 201808  1937  ICBC, ABC, BOC, CCB  5  20 
Securities  201002 to 201808  2122  CITIC, CM, HAI, GF, HUA, EB  5  30 
Shipping  201111 to 201712  2233  HK, LB, BK, SP, SH  8  40 
Differences from the Previous Works
MSJF is an approach to jointly forecast the time series of multiple related stocks based on DMTL, while most previous works only focus on a single stock. Moreover, most of the previous works in DMTL easily put all latent features together, but we aim to combine them with more useful knowledge. Since we mainly focus on financial data, so inspired by the idea of CAPM and Attention, we propose a new method, SPA, to learn the optimized combination of all latent features.
MSJF for Other Types of Time Series
Although this paper mainly focuses on stock time series, the proposed method can be applied to other financial data, because there are similar connections. For example, several companies with trades, the connections between the stock market and bond market, even nonfinancial data. And to prove it, we conduct experiments on a nonfinancial dataset, shipping data.
Architecture with More Prior Knowledge
The architecture in Figure 1 is the basic MSJF. Actually, in this framework, more prior knowledge can be applied into architecture design. For example, assuming that multiple related stocks from different industries are selected for jointly forecasting, the stocks from the same industry first share the local information, then all stocks share the information of the macro market. See Figure 3. Therefore, the prior knowledge of the hierarchical relationships among tasks can be added to the design of hierarchical model architecture. And we also demonstrate this in our experiments.
Experiments
Dataset
Stock data of the Big Four banks in China
In this paper, we focus on forecasting stock time series, so we choose the stock daily trading data of the Big Four banks in China. The details of the dataset are presented in Table 1, and the stock (closing) prices data is shown in Figure 4(a).
These four stocks come from the Chinese banking industry, and they are the most representative stocks in this industry. Each stock has 5 time series, including opening prices, closing prices, highest prices, lowest prices and trading volumes.
Stock data of six securities in China
Besides the stocks from the banking industry, we also choose the stock data of six securities in China. Similar to the Big Four banks in China, they are representative in the Chinese securities industry. The details are similar to the banks’ dataset, also presented in Table 1, and the stock (closing) prices are shown in Figure 4(b).
Shipping data
In order to verify the performance of our model on other types of time series data, we choose the shipping data of an American transportation company. There are 350 ports all over the world, and each of them stores 4 types of containers. Each type of containers has its own daily inventory and demand. Therefore, each port has 8 time series. We select the top five ports of this company with the most frequent trades. The details of shipping dataset are also presented in Table 1.
Big Four banks  Six securities  

Method  Average  ICBC  ABC  BOC  CCB  CITIC  CM  HAI  GF  HUA  EB 
MSJF  0.1047  0.0749  0.0748  0.0644  0.0866  0.1235  0.1289  0.1271  0.1137  0.1258  0.1269 
HMSJF  0.0767  0.0441  0.0549  0.0388  0.0558  0.0943  0.1281  0.0933  0.0979  0.0793  0.0802 
Datasets  

Method  Banks  Securities  Shipping 
MA  0.3217  0.3459  0.8363 
ARMA  0.1578  0.2495  0.6643 
ST  0.1267  0.2019  0.5769 
FSST  0.1101  0.2324  0.4746 
FSMT  0.1418  0.2359  0.4729 
PSMTL  0.1148  0.1929  0.4335 
MJSF  0.1024  0.1646  0.4171 
SPAMSJF  0.0808  0.1558  0.4078 
Big Four banks  

Method  ICBC  ABC  BOC  CCB 
Singletask  0.1547  0.1033  0.1101  0.1389 
MSJF  0.0932  0.0933  0.0919  0.1312 
SPAMSJF  0.0766  0.0815  0.0744  0.0904 
Six securities  

Method  CITIC  CM  HAI  GF  HUA  EB 
Singletask  0.2186  0.1974  0.1972  0.2226  0.1794  0.1966 
MSJF  0.1738  0.1776  0.1517  0.1555  0.1425  0.1865 
SPAMSJF  0.1683  0.1671  0.1404  0.1692  0.1334  0.1563 
Shipping data  

Method  HK  LB  BK  SP  SH 
Singletask  0.4843  0.4698  0.6199  0.7847  0.5257 
MSJF  0.3508  0.3178  0.4111  0.5497  0.4559 
SPAMSJF  0.3289  0.3206  0.4201  0.5367  0.4328 
Experimental Settings
Training and Testing
Dealing with time series data, we choose sliding training and testing, described as follows:

Training: Select the data of half a year as the training dataset, and train the model for a number of epochs.

Testing: After training, select the data of one month right after the training dataset as the dataset to test.

Sliding: Slide the training and testing datasets forward for one month, that is, merge the testing dataset into the training dataset, and drop the first month’s training data. Then repeat the training and testing processes.
Multitask Assignments
We make the multitask assignments on three datasets as follows:

Stock data of the Big Four banks in China: As mentioned before, there are four forecasting tasks processed by our method in this dataset, each of which forecasts the time series of one single stock.

Stock data of the Big Four banks in China: Similar to the above, six forecasting tasks.

Shipping data: The forecasting tasks are divided by ports, each of which forecasts the multivariate time series of one single port. Our method processes 5 tasks simultaneously.
Baselines
We compare our proposed method with the following baseline methods:

Moving Average Model (MA): This is a very classic statistical method in economics, which is often used for financial time series forecasting. So it serves as a baseline to illustrate the forecasting performance of our method.

AutoRegressive and Moving Average Model (ARMA): Similar to MA, this is another classic method in economics, also serves as a baseline.

Singletask Baseline (ST): This serves as a baseline without benefits of multitask learning. Each singletask model forecasts the multivariate time series of one stock separately, not sharing the information of other related stocks.

Fullyshared and Singletask Baseline (FSST): It also serves as baseline without benefits of MTL, using the shared information of all tasks but still singletask.

Fullyshared and Multitask Baseline (FSMT): It serves as a baseline using only the shared information to forecast, similar to the previous works we mentioned, which can prove the benefits of our multimodel architecture.

Privateshared MTL Baseline (PSMTL): As our final baseline, we compare to a variant method of misra2016cross misra2016cross. The original method builds multple private encoding models and there is a shared embedding layer learning the shared representations of all private latent features, different from ours. So their method is adapted to this problem and serves as a privateshared MTL baseline.
Results and Analysis
For convenience, we use the following abbreviations of our methods: 1) Multiseries Jointly Forecasting (MSJF); 2) Multiseries Jointly Forecasting with Sharedprivate Attention (SPAMSJF).
Overall Performance Comparison
The overall comparison experiment results are shown in Table 3. From these results, We have the following observations: 1) Both MSJF and SPAMSJF can outperform the baseline methods on all datasets. This indicates the effectiveness of the proposed methods; 2) SPAMSJF is better than MSJF. This demonstrates the proposed SPA model can indeed further improve the performance of MSJF; 3) The proposed methods also work well on shipping dataset. This shows the proposed methods can work on all kinds of time series data.
Effects of Multiseries Jointly Forecasting
To show the effects of MSJF, we use the experimental results in Table 3, 4, 5 and 6. Without the benefits of SPA, 1) in Table 3, MSJF outperforms singletask (ST) and fullyshared & singletask (FSST) baselines. And it outperforms ST on each task in all datasets, as shown in Table 4, 5 and 6. These suggest the effectiveness of MSJF; 2) MSJF performs better than fullyshared & multitask (FSMT) and privateshared MTL (PSMTL) baselines. This suggests the effectiveness of the multimodel architecture in MSJF.
Analysis on Sharedprivate Attention
On the basis of MSJF, we propose SPA to learn the optimized combination of shared and taskprivate latent features. In Table 3, SPAMSJF outperforms MSJF on the average test MSE in all datasets. And in Table 4, 5, 6, SPAMSJF outperforms MSJF on 12 tasks (totally 15 tasks). These results demonstrate the effectiveness of SPA.
We also provide a visualization of combination weights learned by SPA, shown in Figure 5. From the visualization, 1) we can find the shared weights are larger than the private weights in almost all test data of financial datasets. It means the shared information plays an important role in financial forecasting, which is similar to the conclusion of CAPM. This result indicates the SPA model can indeed leverage the idea of CAPM to improve the performance of financial forecasting; 2) As for the result in shipping data, we find a different pattern: the shared weights are almost the same as the private weights. However, from the result in Table 6, SPAMSJF is still better than MSJF on average. This shows SPA also can work on the nonfinancial data. These results also demonstrate the effectiveness of SPA.
Effects of Hierarchical Architecture
To demonstrate the effects of hierarchical architecture with prior knowledge, we conduct experiments on two financial datasets, Banks and Securities. There are total 10 forecasting tasks for MSJF and Hierarchical MSJF (HMSJF). And according to the prior knowledge, we know four of them are from the banking industry and the rest are from the securities industry. Thus, in HMSJF, besides the shared encoding model, the stocks in the same industry also have their localshared encoding model, extracting the information of the industry. According to the results in Table 2, HMSJF outperforms MSJF on each forecasting task, which suggests the effectiveness of the hierarchical architecture with prior knowledge.
Experimental results on financial datasets demonstrate our proposed methods, MSJF and SPAMSJF, outperform the previous works, including classic methods, singletask methods and other DMTL based solutions, and SPAMSJF performs best. We separately analyze the effects of MSJF and SPA, using the results to prove they further improve the forecasting performance indeed. In addition, the experiments on shipping dataset demonstrate our methods can work on other kinds of time series data, and we analyze the effects of the hierarchical architecture with prior knowledge.
Conclusion
In this paper, we propose a jointly forecasting approach, MSJF, to process the time series of multiple related stocks based on DMTL, which can use the connections among stocks to improve the forecasting performance. Moreover, in order to combine the shared and taskprivate information more accurately, we propose an attention method, SPA, to learn the optimized combination of them based on the idea of CAPM. We demonstrate our method on financial datasets and another type of time series dataset, and it outperforms the classic methods and other MTL based methods. In the future works, we would like to further improve SPA’s ability of combining latent features. And for DMTL, we would like to build hierarchical models to extract the shared information from all tasks more efficiently.
References
 [Abdulnabi et al.2015] Abdulnabi, A. H.; Wang, G.; Lu, J.; and Jia, K. 2015. Multitask cnn model for attribute prediction. IEEE Transactions on Multimedia 17(11):1949–1959.
 [Binkowski, Marti, and Donnat2018] Binkowski, M.; Marti, G.; and Donnat, P. 2018. Autoregressive convolutional neural networks for asynchronous time series. In Dy, J. G., and Krause, A., eds., Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, volume 80 of JMLR Workshop and Conference Proceedings, 579–588. JMLR.org.
 [Bollerslev1986] Bollerslev, T. 1986. Generalized autoregressive conditional heteroskedasticity. Journal of econometrics 31(3):307–327.
 [Box and Pierce1970] Box, G. E., and Pierce, D. A. 1970. Distribution of residual autocorrelations in autoregressiveintegrated moving average time series models. Journal of the American statistical Association 65(332):1509–1526.
 [Caruana1997] Caruana, R. 1997. Multitask learning. Machine learning 28(1):41–75.
 [Chakraborty et al.1992] Chakraborty, K.; Mehrotra, K.; Mohan, C. K.; and Ranka, S. 1992. Forecasting the behavior of multivariate time series using neural networks. Neural networks 5(6):961–970.
 [Collobert and Weston2008] Collobert, R., and Weston, J. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, 160–167. ACM.
 [Cont2001] Cont, R. 2001. Empirical properties of asset returns: stylized facts and statistical issues. Quantitative Finance 1(2):223–236.
 [Cunningham, Ghahramani, and Rasmussen2012] Cunningham, J.; Ghahramani, Z.; and Rasmussen, C. 2012. Gaussian processes for timemarked timeseries data. In Artificial Intelligence and Statistics, 255–263.
 [Ding et al.2015] Ding, X.; Zhang, Y.; Liu, T.; and Duan, J. 2015. Deep learning for eventdriven stock prediction. In Yang, Q., and Wooldridge, M., eds., Proceedings of the TwentyFourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 2531, 2015, 2327–2333. AAAI Press.
 [Dürichen et al.2015] Dürichen, R.; Pimentel, M. A.; Clifton, L.; Schweikard, A.; and Clifton, D. A. 2015. Multitask gaussian processes for multivariate physiological timeseries analysis. IEEE Transactions on Biomedical Engineering 62(1):314–322.
 [Girshick2015] Girshick, R. 2015. Fast rcnn. In Proceedings of the IEEE international conference on computer vision, 1440–1448.
 [Hamilton1994] Hamilton, J. D. 1994. Time series analysis, volume 2. Princeton university press Princeton, NJ.
 [Harutyunyan et al.2017] Harutyunyan, H.; Khachatrian, H.; Kale, D. C.; and Galstyan, A. 2017. Multitask learning and benchmarking with clinical time series data. CoRR abs/1703.07771.
 [He et al.2016] He, T.; Huang, W.; Qiao, Y.; and Yao, J. 2016. Textattentional convolutional neural network for scene text detection. IEEE transactions on image processing 25(6):2529–2541.
 [Heaton, Polson, and Witte2016] Heaton, J. B.; Polson, N. G.; and Witte, J. H. 2016. Deep learning in finance. CoRR abs/1602.06561.
 [Hu, Shen, and Sun2017] Hu, J.; Shen, L.; and Sun, G. 2017. Squeezeandexcitation networks. CoRR abs/1709.01507.
 [Hwang, Tong, and Choi2016] Hwang, Y.; Tong, A.; and Choi, J. 2016. Automatic construction of nonparametric relational regression models for multiple time series. In International Conference on Machine Learning, 3030–3039.
 [Jensen1968] Jensen, M. C. 1968. The performance of mutual funds in the period 1945–1964. The Journal of finance 23(2):389–416.
 [Józefowicz et al.2016] Józefowicz, R.; Vinyals, O.; Schuster, M.; Shazeer, N.; and Wu, Y. 2016. Exploring the limits of language modeling. CoRR abs/1602.02410.
 [Jung2015] Jung, A. 2015. Learning the conditional independence structure of stationary time series: A multitask learning approach. IEEE Transactions on Signal Processing 63(21):5677–5690.
 [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 1097–1105.
 [Laloux et al.2000] Laloux, L.; Cizeau, P.; Potters, M.; and Bouchaud, J.P. 2000. Random matrix theory and financial correlations. International Journal of Theoretical and Applied Finance 3(03):391–397.
 [Li et al.2018] Li, Y.; Fu, K.; Wang, Z.; Shahabi, C.; Ye, J.; and Liu, Y. 2018. Multitask representation learning for travel time estimation. In International Conference on Knowledge Discovery and Data Mining,(KDD).
 [Liu et al.2016] Liu, Y.; Zheng, Y.; Liang, Y.; Liu, S.; and Rosenblum, D. S. 2016. Urban water quality prediction based on multitask multiview learning. In Kambhampati, S., ed., Proceedings of the TwentyFifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 915 July 2016, 2576–2581. IJCAI/AAAI Press.
 [Lv et al.2015] Lv, Y.; Duan, Y.; Kang, W.; Li, Z.; Wang, F.Y.; et al. 2015. Traffic flow prediction with big data: A deep learning approach. IEEE Trans. Intelligent Transportation Systems 16(2):865–873.
 [Malkiel and Fama1970] Malkiel, B. G., and Fama, E. F. 1970. Efficient capital markets: A review of theory and empirical work. The journal of Finance 25(2):383–417.

[Misra et al.2016]
Misra, I.; Shrivastava, A.; Gupta, A.; and Hebert, M.
2016.
Crossstitch networks for multitask learning.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 3994–4003.  [Neil, Pfeiffer, and Liu2016] Neil, D.; Pfeiffer, M.; and Liu, S.C. 2016. Phased lstm: Accelerating recurrent network training for long or eventbased sequences. In Advances in Neural Information Processing Systems, 3882–3890.

[Pai and Lin2005]
Pai, P.F., and Lin, C.S.
2005.
A hybrid arima and support vector machines model in stock price forecasting.
Omega 33(6):497–505.  [Ruder2017] Ruder, S. 2017. An overview of multitask learning in deep neural networks. CoRR abs/1706.05098.
 [Sak, Senior, and Beaufays2014] Sak, H.; Senior, A.; and Beaufays, F. 2014. Long shortterm memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth annual conference of the international speech communication association.
 [Schmidhuber2015] Schmidhuber, J. 2015. Deep learning in neural networks: An overview. Neural networks 61:85–117.
 [Sharpe1964] Sharpe, W. F. 1964. Capital asset prices: A theory of market equilibrium under conditions of risk. The journal of finance 19(3):425–442.
 [Sun et al.2014] Sun, Y.; Chen, Y.; Wang, X.; and Tang, X. 2014. Deep learning face representation by joint identificationverification. In Advances in neural information processing systems, 1988–1996.
 [Wilson and Ghahramani2010] Wilson, A. G., and Ghahramani, Z. 2010. Copula processes. In Advances in Neural Information Processing Systems, 2460–2468.
 [Yang et al.2015] Yang, J.; Nguyen, M. N.; San, P. P.; Li, X.; and Krishnaswamy, S. 2015. Deep convolutional neural networks on multichannel time series for human activity recognition. In Ijcai, volume 15, 3995–4001.