1 Introduction
Momentum as a risk premium in finance has been extensively documented in the academic literature, with evidence of persistent abnormal returns demonstrated across a range of asset classes, prediction horizons and time periods [2, 3, 4]. Based on the philosophy that strong price trends have a tendency to persist, time series momentum strategies are typically designed to increase position sizes with large directional moves and reduce positions at other times. Although the intuition underpinning the strategy is clear, specific implementation details can vary widely between signals – with a plethora of methods available to estimate the magnitude of price trends [5, 6, 4] and map them to actual traded positions [7, 8, 9].
In recent times, deep neural networks have been increasingly used for time series prediction, outperforming traditional benchmarks in applications such as demand forecasting [10], medicine [11] and finance [12]
. With the development of modern architectures such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs)
[13], deep learning models have been favoured for their ability to build representations of a given dataset [14] – capturing temporal dynamics and crosssectional relationships in a purely datadriven manner. The adoption of deep neural networks has also been facilitated by powerful opensource frameworks such as TensorFlow [15] and PyTorch [16]– which use automatic differentiation to compute gradients for backpropagation without having to explicitly derive them in advance. In turn, this flexibility has allowed deep neural networks to go beyond standard classification and regression models. For instance, the creation of hybrid methods that combine traditional timeseries models with neural network components have been observed to outperform pure methods in either category
[17] – e.g. the exponential smoothing RNN [18], autoregressive CNNs [19]and Kalman filter variants
[20, 21]– while also making outputs easier to interpret by practitioners. Furthermore, these frameworks have also enabled the development of new loss functions for training neural networks, such as adversarial loss functions in generative adversarial networks (GANs)
[22].While numerous papers have investigated the use of machine learning for financial time series prediction, they typically focus on casting the underlying prediction problem as a standard regression or classification task
[23, 24, 25, 12, 26, 19, 27]– with regression models forecasting expected returns, and classification models predicting the direction of future price movements. This approach, however, could lead to suboptimal performance in the context timeseries momentum for several reasons. Firstly, sizing positions based on expected returns alone does not take risk characteristics into account – such as the volatility or skew of the predictive returns distribution – which could inadvertently expose signals to large downside moves. This is particularly relevant as raw momentum strategies without adequate risk adjustments, such as volatility scaling
[7], are susceptible to large crashes during periods of market panic [28, 29]. Furthermore, even with volatility scaling – which leads to positively skewed returns distributions and longoptionlike behaviour [30, 31] – trend following strategies can place more losing trades than winning ones and still be profitable on the whole – as they size up only into large but infrequent directional moves. As such, [32] argue that the fraction of winning trades is a meaningless metric of performance, given that it cannot be evaluated independently from the trading style of the strategy. Similarly, high classification accuracies may not necessarily translate into positive strategy performance, as profitability also depends on the magnitude of returns in each class. This is also echoed in betting strategies such as the Kelly criterion [33], which requires both win/loss probabilities and betting odds for optimal sizing in binomial games. In light of the deficiencies of standard supervised learning techniques, new loss functions and training methods would need to be explored for position sizing – accounting for tradeoffs between risk and reward.
In this paper, we introduce a novel class of hybrid models that combines deep learningbased trading signals with the volatility scaling framework used in time series momentum strategies [8, 1] – which we refer to as the Deep Momentum Networks (DMNs). This improves existing methods from several angles. Firstly, by using deep neural networks to directly generate trading signals, we remove the need to manually specify both the trend estimator and position sizing methodology – allowing them to be learnt directly using modern time series prediction architectures. Secondly, by utilising automatic differentiation in existing backpropagation frameworks, we explicitly optimise networks for riskadjusted performance metrics, i.e. the Sharpe ratio [34], improving the risk profile of the signal on the whole. Lastly, retaining a consistent framework with other momentum strategies also allows us to retain desirable attributes from previous works – specifically volatility scaling, which plays a critical role in the positive performance of time series momentum strategies [9]. This consistency also helps when making comparisons to existing methods, and facilitates the interpretation of different components of the overall signal by practitioners.
2 Related Works
2.1 Classical Momentum Strategies
Momentum strategies are traditionally divided into two categories – namely (multivariate) cross sectional momentum [35, 24] and (univariate) time series momentum [1, 8]
. Cross sectional momentum strategies focus on the relative performance of securities against each other, buying relative winners and selling relative losers. By ranking a universe of stocks based on their past return and trading the top decile against the bottom decile,
[35] find that securities that recently outperformed their peers over the past 3 to 12 months continue to outperform on average over the next month. The performance of cross sectional momentum has also been shown to be stable across time [36], and across a variety of markets and asset classes [4].Time series momentum extends the idea to focus on an asset’s own past returns, building portfolios comprising all securities under consideration. This was initially proposed by [1], who describe a concrete strategy which uses volatility scaling and trades positions based on the sign of returns over the past year – demonstrating profitability across 58 different liquid instruments individually over 25 years of data. Since then, numerous trading rules have been proposed – with various trend estimation techniques and methods map them to traded positions. For instance, [6] documents a wide range of linear and nonlinear filters to measure trends and a statistic to test for its significance – although methods to size positions with these estimates are not directly discussed. [8] adopt a similar approach to [1], regressing the log price over the past 12 months against time and using the regression coefficient tstatistics to determine the direction of the traded position. While Sharpe ratios were comparable between the two, tstatistic based trend estimation led to a reduction in portfolio turnover and consequently trading costs. More sophisticated trading rules are proposed in [4] and [37], taking volatilitynormalised moving average convergence divergence (MACD) indicators as inputs. Despite the diversity of options, few comparisons have been made between the trading rules themselves, offering little clear evidence or intuitive reasoning to favour one rule over the next. We hence propose the use of deep neural networks to generate these rules directly, avoiding the need for explicit specification. Training them based on riskadjusted performance metrics, the networks hence learn optimal training rules directly from the data itself.
2.2 Deep Learning in Finance
Machine learning has long been used for financial time series prediction, with recent deep learning applications studying midprice prediction using daily data [26], or using limit order book data in a high frequency trading setting [25, 12, 38]. While a variety of CNN and RNN models have been proposed, they typically frame the forecasting task as a classification problem, demonstrating the improved accuracy of their method in predicting the direction of the next price movement. Trading rules are then manually defined in relation to class probabilities – either by using thresholds on classification probabilities to determine when to initiate positions [26], or incorporating these thresholds into the classification problem itself by dividing price movements into buy, hold and sell classes depending on magnitude [12, 38]. In addition to restricting the universe of strategies to those which rely on high accuracy, further gains might be made by learning trading rules directly from the data and removing the need for manual specification – both of which are addressed in our proposed method.
Deep learning regression methods have also been considered in crosssectional strategies [23, 24], ranking assets on the basis of expected returns over the next time period. Using a variety of linear, treebased and neural network models [23]
demonstrate the outperformance of nonlinear methods, with deep neural networks – specifically 3layer multilayer perceptrons (MLPs) – having the best outofsample predictive
. Machine learning portfolios were then built by ranking stocks on a monthly basis using model predictions, with the best strategy coming from a 4layer MLP that trades the top decile against the decile of predictions. In other works, [24]adopt a similar approach using autoencoder and denoising autoencoder architectures, incorporating volatility scaling into their model as well. While the results with basic deep neural are promising, they do not consider more modern architectures for time series prediction, such as the LSTM
[39] and WaveNet [40] architectures which we evaluate for the DMN. Moreover, to the best of our knowledge, our paper is the first to consider the use of deep learning within the context of time series momentum strategies – opening up possibilities in an alternate class of signals.Popularised by success of DeepMind’s AlphaGo Zero [41]
, deep reinforcement learning (RL) has also gained much attention in recent times – prized for its ability to recommend pathdependent actions in dynamic environments. RL is particularly of interest within the context of optimal execution and automated hedging
[42, 43] for example, where actions taken can have an impact on future states of the world (e.g. market impact). However, deep RL methods generally require a realistic simulation environment (for Qlearning or policy gradient methods), or model of the world (for modelbased RL) to provide feedback to agents during training – both of which are difficult to obtain in practice.3 Strategy Definition
Adopting the terminology of [8], the combined returns of a time series momentum (TSMOM) strategy can be expressed as below – characterised by a trading rule or signal :
(1) 
Here is the realised return of the strategy from day to , is the number of included assets at , and is the oneday return of asset . We set the annualised volatility target to be and scale asset returns with an exante volatility estimate
– computed using an exponentially weighted moving standard deviation with a 60day span on
.3.1 Standard Trading Rules
In traditional financial time series momentum strategies, the construction of a trading signal is typically divided into two steps: 1) estimating future trends based on past information, and 2) computing the actual positions to hold. We illustrate this in this section using two examples from the academic literature [1, 4], which we also include as benchmarks into our tests.
Moskowitz et al. 2012 [1]
In their original paper on time series momentum, a simple trading rule is adopted as below:
Trend Estimation:  (2)  
Position Sizing:  (3) 
This broadly uses the past year’s returns as a trend estimate for the next time step  taking a maximum long position when the expected trend is positive (i.e. ) and a maximum short position when negative.
Baz et al. 2015 [4]
In practice, more sophisticated methods can be used to compute and – such as the model of [4] described below:
Trend Estimation:  (4) 
(5)  
(6) 
Here is the 63day rolling standard deviation of asset prices , is the exponentially weighted moving average of asset prices with a timescale that translates into a halflife of . The moving average crossover divergence (MACD) signal is defined in relation to a short and a long timescale and respectively.
The volatilitynormalised MACD signal hence a measures the strength of the trend, which is then translated in to a position size as below:
Position Sizing:  (7) 
where . Plotting in Exhibit 1, we can see that positions are increased until , before decreasing back to zero for larger moves. This allows the signal to reduces positions in instances where assets are overbought or oversold – defined to be when is observed to be larger than 1.41 times its past year’s standard deviation.
Increasing the complexity even further, multiple signals with different timesscales can also be averaged to give a final position:
3.2 Machine Learning Extensions
As can be seen from Section 3.1, many explicit design decisions are required to define a sophisticated time series momentum strategy. We hence start by considering how machine learning methods can be used to learn these relationships directly from data – alleviating the need for manual specification.
Standard Supervised Learning
In line with numerous previous works (see Section 2.2), we can cast trend estimation as a standard regression or binary classification problem, with outputs:
Trend Estimation:  (9) 
where
is the output of the machine learning model, which takes in a vector of input features
and model parameters to generate predictions. Taking volatilitynormalised returns as targets, the following meansquared error and binary crossentropy losses can be used for training:(10)  
(11)  
where is the set of all possible prediction and target tuples across all assets and time steps. For the binary classification case, is the indicator function – making the estimated probability of a positive return.
This still leaves us to specify how trend estimates map to positions, and we do so using a similar form to Equation 3:
Position Sizing:
Regression  (12)  
Classification  (13) 
As such, we take a maximum long position when the expected returns are positive in the regression case, or when the probability of a positive return is greater than 0.5 in the classification case.
Direct Outputs
An alternative approach is to use machine learning models to generate positions directly – simultaneously learning both trend estimation and position sizing in the same function, i.e.:
Direct Outputs:  (14) 
Given the lack of direct information on the optimal positions to hold at each step – which is required to produce labels for standard regression and classification models – calibration would hence need to be performed by directly optimising performance metrics. Specifically, we focus on optimising the average return and the Sharpe ratio via the loss functions below:
(15)  
(16) 
where is the return captured by the trading rule for asset at time .
4 Deep Momentum Networks
In this section, we examine a variety of architectures that can be used in Deep Momentum Networks – all of which can be easily reconfigured to generate the predictions described in Section 3.2. This is achieved by implementing the models using the Keras API in Tensorflow [15]
, where output activation functions can be flexibly interchanged to generate the predictions of different types (e.g. expected returns, binary probabilities, or direct positions). Arbitrary loss functions can also be defined for direct outputs, with gradients for backpropagation being easily computed using the builtin libraries for automatic differentiation.
4.1 Network Architectures
Lasso Regression
In the simplest case, a standard linear model could be used to generate predictions as below:
(17) 
where depending on the prediction task, is a weight vector for the linear model, and is a bias term. Here is a activation function which depends on the specific prediction type – linear for standard regression, sigmoid for binary classification, and tanhfunction for direct outputs.
Additional regularisation is also provided during training by augmenting the various loss functions to include an additional regulariser as below:
(18) 
where corresponds to one of the loss functions described in Section 3.2, is the norm of , and
is a constant term which we treat as an additional hyperparameter. To incorporate recent history into predictions as well, we concatenate inputs over the past
days into a single input vector – i.e. . This was fixed to be days for tests in Section 5.Multilayer Perceptron (MLP)
Increasing the degree of model complexity slightly, a 2layer neural network can be used to incorporated nonlinear effects:
(19)  
(20) 
where is the hidden state of the MLP using an internal tanh activation function, , and and are layer weight matrices and biases respectively.
WaveNet
More modern techniques such as convolutional neural networks (CNNs) have been used in the domain of time series prediction – particularly in the form of autoregressive architectures e.g. [19]. These typically take the form of 1D causal convolutions, sliding convolutional filters across time to extract useful representations which are then aggregated in higher layers of the network. To increase the size of the receptive field – or the length of history fed into the CNN – dilated CNNs such as WaveNet [40]
have been proposed, which skip over inputs at intermediate levels with a predetermined dilation rate. This allows it to effectively increase the amount of historical information used by the CNN without a large increase in computational cost. Let us consider a dilated convolutional layer with residual connections take the form below:
(21) 
Here and are weight matrices associated with the gated activation function, and and are the weights and biases used to transform the to match dimensionality of the layer outputs for the skip connection. The equations for WaveNet architecture used in our investigations can then be expressed as:
(22)  
(23)  
(24) 
Here each intermediate layer aggregates representations at weekly, monthly and quarterly frequencies respectively. Intermediate layers are then concatenated at each layer before passing through a 2layer MLP to generate outputs, i.e.:
(25)  
(26)  
(27) 
State sizes for each intermediate layers , , and the MLP hidden state are fixed to be the same, allowing us to use a single hyperparameter to define the architecture. To independently evaluate the performance of CNN and RNN architectures, the above also excludes the LSTM block (i.e. the context stack) described in [40], focusing purely on the merits of the dilated CNN model.
Long Shortterm Memory (LSTM)
Traditionally used in sequence prediction for natural language processing, recurrent neural networks – specifically long shortterm memory (LSTM) architectures
[39] – have been increasing used in time series prediction tasks. The equations for the LSTM in our model are provided below:(28)  
(29)  
(30)  
(31)  
(32)  
(33) 
where is the Hadamard (elementwise) product, is the sigmoid activation function, and are weight matrices for the different layers, correspond to the forget, input and output gates respectively, is the cell state, and is the hidden state of the LSTM. From these equations, we can see that the LSTM uses the cell state as a compact summary of past information, controlling memory retention with the forget gate and incorporating new information via the input gate. As such, the LSTM is able to learn representations of longterm relationships relevant to the prediction task – sequentially updating its internal memory states with new observations at each step.
4.2 Training Details
Model calibration was undertaken using minibatch stochastic gradient descent with the
Adam optimiser [44], based on the loss functions defined in Section 3.2. Backpropagation was performed up to a maximum of 100 training epochs using
of a given block of training data, and the most recent retained as a validation dataset. Validation data is then used to determine convergence – with early stopping triggered when the validation loss has not improved for 25 epochs – and to identify the optimal model across hyperparameter settings. Hyperparameter optimisation was conducted using 50 iterations of random search, with full details provided in Appendix .2. For additional information on the deep neural network calibration, please refer to [13].Dropout regularisation [45] was a key feature to avoid overfitting in the neural network models – with dropout rates included as hyperparameters during training. This was applied to the inputs and hidden state for the MLP, as well as the inputs, Equation (22), and outputs, Equation (26), of the convolutional layers in the WaveNet architecture. For the LSTM, we adopted the same dropout masks as in [46] – applying dropout to the RNN inputs, recurrent states and outputs.
5 Performance Evaluation
5.1 Overview of Dataset
The predictive performance of the different architectures was evaluated via a backtest using 88 ratioadjusted continuous futures contracts downloaded from the Pinnacle Data Corp CLC Database [47]. These contracts spanned across a variety of asset classes – including commodities, fixed income and currency futures – and contained prices from 1990 to 2015. A full breakdown of the dataset can be found in Appendix .1.
5.2 Backtest Description
Throughout our backtest, the models were recalibrated from scratch every 5 years – rerunning the entire hyperparameter optimisation procedure using all data available up to the recalibration point. Model weights were then fixed for signals generated over the next 5 year period, ensuring that tests were performed outofsample.
For the Deep Momentum Networks, we incorporate a series of useful features adopted by standard time series momentum strategies in Section 3.1 to generate predictions at each step:

Normalised Returns – Returns over the past day, 1month, 3month, 6month and 1year periods are used, normalised by a measure of daily volatility scaled to an appropriate time scale. For instance, normalised annual returns were taken to be ).

MACD Indicators – We also include the MACD indicators – i.e. trend estimates – as in Equation (4), using the same short timescales and long timescales .
For comparisons against traditional time series momentum strategies, we also incorporate the following reference benchmarks:
Finally, performance was judged based on the following metrics:

Profitability – Expected returns () and the percentage of positive returns observed across the test period.

Risk – Daily volatility (Vol.), downside deviation and the maximum drawdown (MDD) of the overall portfolio.

Performance Ratios – Risk adjusted performance was measured by the Sharpe ratio , Sortino ratio and Calmar ratio , as well as the average profit over the average loss .
5.3 Results and Discussion
Aggregating the outofsample predictions from 1995 to 2015, we compute performance metrics for both the strategy returns based on Equation (1) (Exhibit 1), as well as that for portfolios with an additional layer of volatility scaling – which brings overall strategy returns to match the volatility target (Exhibit 2). Given the large differences in returns volatility seen in Table 1, this rescaling also helps to facilitates comparisons between the cumulative returns of different strategies – which are plotted for various loss functions in Exhibit 2. We note that strategy returns in this section are computed in the absence of transaction costs, allowing us to focus on the raw predictive ability of the models themselves. The impact of transaction costs is explored further in Section 6, where we undertake a deeper analysis of signal turnover. More detailed results can also be found in Appendix .3, which echo the findings below.
Focusing on the raw signal outputs, the Sharpe ratiooptimised LSTM outperforms all benchmarks as expected, improving the best neural network model (Sharpeoptimised MLP) by and the best reference benchmark (Sgn(Returns)) by more than two times. In conjunction with Sharpe ratio improvements to both the linear and MLP models, this highlights the benefits of using models which capture nonlinear relationships, and have access to more time history via an internal memory state. Additional model complexity, however, does not necessarily lead to better predictive performance, as demonstrated by the underperformance of WaveNet compared to both the reference benchmarks and simple linear models. Part of this can be attributed to the difficulties in tuning models with multiple design parameters  for instance, better results could possibly achieved by using alternative dilation rates, number of convolutional layers, and hidden state sizes in Equations (22) to (24) for the WaveNet. In contrast, only a single design parameter is sufficient to specify the hidden state size in both the MLP and LSTM models. Analysing the relative performance within each model class, we can see that models which directly generate positions perform the best – demonstrating the benefits of simultaneous learning both trend estimation and position sizing functions. In addition, with the exception of a slight decrease in the MLP, Sharpeoptimised models outperform returnsoptimised ones, with standard regression and classification benchmarks taking third and fourth place respectively.
E[Return]  Vol. 

MDD  Sharpe  Sortino  Calmar 


Reference  
Long Only  0.039  0.052  0.035  0.167  0.738  1.086  0.230  53.8%  0.970  
Sgn(Returns)  0.054  0.046  0.032  0.083  1.192  1.708  0.653  54.8%  1.011  
MACD  0.030  0.031  0.022  0.081  0.976  1.356  0.371  53.9%  1.015  
Linear  
Sharpe  0.041  0.038  0.028  0.119  1.094  1.462  0.348  54.9%  0.997  
Ave. Returns  0.047  0.045  0.031  0.164  1.048  1.500  0.287  53.9%  1.022  
MSE  0.049  0.047  0.032  0.164  1.038  1.522  0.298  54.3%  1.000  
Binary  0.013  0.044  0.030  0.167  0.295  0.433  0.078  50.6%  1.028  
MLP  
Sharpe  0.044  0.031  0.025  0.154  1.383  1.731  0.283  56.0%  1.024  
Ave. Returns  0.064*  0.043  0.030  0.161  1.492  2.123  0.399  55.6%  1.031  
MSE  0.039  0.046  0.032  0.166  0.844  1.224  0.232  52.7%  1.035  
Binary  0.003  0.042  0.028  0.233  0.080  0.120  0.014  50.8%  0.981  
WaveNet  
Sharpe  0.030  0.035  0.026  0.101  0.854  1.167  0.299  53.5%  1.008  
Ave. Returns  0.032  0.040  0.028  0.113  0.788  1.145  0.281  53.8%  0.980  
MSE  0.022  0.042  0.028  0.134  0.536  0.786  0.166  52.4%  0.994  
Binary  0.000  0.043  0.029  0.313  0.011  0.016  0.001  50.2%  0.995  
LSTM  
Sharpe  0.045  0.016*  0.011*  0.021*  2.804*  3.993*  2.177*  59.6%*  1.102*  
Ave. Returns  0.054  0.046  0.033  0.164  1.165  1.645  0.326  54.8%  1.003  
MSE  0.031  0.046  0.032  0.163  0.669  0.959  0.189  52.8%  1.003  
Binary  0.012  0.039  0.026  0.255  0.300  0.454  0.046  51.0%  1.012 
E[Return]  Vol. 

MDD  Sharpe  Sortino  Calmar 



Reference  
Long Only  0.117  0.154  0.102  0.431  0.759  1.141  0.271  53.8%  0.973  
Sgn(Returns)  0.215  0.154  0.102  0.264  1.392  2.108  0.815  54.8%  1.041  
MACD  0.172  0.155  0.106  0.317  1.111  1.622  0.543  53.9%  1.031  
Linear  
Sharpe  0.232  0.155  0.103  0.303  1.496  2.254  0.765  54.9%  1.056  
Ave. Returns  0.189  0.154  0.100  0.372  1.225  1.893  0.507  53.9%  1.047  
MSE  0.186  0.154  0.099*  0.365  1.211  1.889  0.509  54.3%  1.025  
Binary  0.051  0.155  0.103  0.558  0.332  0.496  0.092  50.6%  1.033  
MLP  
Sharpe  0.312  0.154  0.102  0.335  2.017  3.042  0.930  56.0%  1.104  
Ave. Returns  0.266  0.154  0.099*  0.354  1.731  2.674  0.752  55.6%  1.065  
MSE  0.156  0.154  0.099*  0.371  1.017  1.582  0.422  52.7%  1.062  
Binary  0.017  0.154  0.102  0.661  0.108  0.162  0.025  50.8%  0.986  
WaveNet  
Sharpe  0.148  0.155  0.103  0.349  0.956  1.429  0.424  53.5%  1.018  
Ave. Returns  0.136  0.154  0.101  0.356  0.881  1.346  0.381  53.8%  0.993  
MSE  0.084  0.153*  0.101  0.459  0.550  0.837  0.184  52.4%  0.995  
Binary  0.007  0.155  0.103  0.779  0.045  0.068  0.009  50.2%  1.001  
LSTM  
Sharpe  0.451*  0.155  0.105  0.209*  2.907*  4.290*  2.159*  59.6%*  1.113*  
Ave. Returns  0.208  0.154  0.102  0.365  1.349  2.045  0.568  54.8%  1.028  
MSE  0.121  0.154  0.100  0.362  0.791  1.211  0.335  52.8%  1.020  
Binary  0.075  0.155  0.099*  0.682  0.486  0.762  0.110  51.0%  1.043 
From Exhibit 2, while the addition of volatility scaling at the portfolio level improved performance ratios on the whole, it had a larger beneficial effect on machine learning models compared to the reference benchmarks – propelling Sharpeoptimised MLPs to outperform returnsoptimised ones, and even leading to Sharpeoptimised linear models beating reference benchmarks. From a risk perspective, we can see that both volatility and downside deviation also become a lot more comparable, with the former hovering close to and the later around . However, Sharpeoptimised LSTMs still retained the lowest MDD across all models, with superior riskadjusted performance ratios across the board. Referring to the cumulative returns plots for the rescaled portfolios in Exhibit 2, the benefits of direct outputs with Sharpe ratio optimisation can also be observed – with larger cumulative returns observed for linear, MLP and LSTM models compared to the reference benchmarks. Furthermore, we note the general underperformance of models which use standard regression and classification methods for trend estimation – hinting at the difficulties faced in selecting an appropriate position sizing function, and in optimising models to generate positions without accounting for risk. This is particularly relevant for binary classification methods, which produce relatively flat equity lines and underperform reference benchmarks in general. Some of these poor results can be explained by the implicit decision threshold adopted. From the percentage of positive returns captured in Exhibit 2, most binary classification models have about a
accuracy which, while expected of a classifier with a 0.5 probability threshold, is far below the accuracies seen in other benchmarks. Furthermore, performance is made worse by the fact that the model’s magnitude of gains versus losses
is much smaller than competing methods – with average loss magnitudes even outweighing profits for the MLP classifier . As such, these observations lend support to the direct generation of positions sizes with machine learning methods, given the multiple considerations (e.g. decision thresholds and profit/loss magnitudes) that would be required to incorporate standard supervising learning methods into a profitable trading strategy.Strategy performance could also be aided by diversification across a range of assets, particularly when the correlation between signals is low. Hence, to evaluate the raw quality of the underlying signal, we investigate the performance constituents of the time series momentum portfolios – using box plots for a variety of performance metrics, plotting the minimum, lower quartile, median, upper quartile, and maximum values across individual futures contracts. We present in Exhibit
3 plots of one metric per category in Section 5.2, although similar results can be seen for other performance ratios are documented in Appendix .3. In general, the Sharpe ratio plots in Exhibit 3(a) echo previous findings, with direct output methods performing better than indirect trend estimation models. However, as seen in Exhibit 3(c), this is mainly attributable to significant reduction in signal volatility for the Sharpeoptimised methods, despite a comparable range of average returns in Exhibit 3(b). The benefits of retaining the volatility scaling can also be observed, with individual signal volatility capped near the target across all methods – even with a naive position sizer. As such, the combination of volatility scaling, direct outputs and Sharpe ratio optimisation were all key to performance gains in Deep Momentum Networks.6 Turnover Analysis
To investigate how transaction costs affect strategy performance, we first analyse the daily position changes of the signal – characterised for asset by daily turnover as defined in [8]:
(34) 
Which is broadly proportional to the volume of asset traded on day with reference to the updated portfolio weights.
Exhibit 4(a) shows the average strategy turnover across all assets from 1995 to 2015, focusing on positions generated by the raw signal outputs. As the box plots are charted on a logarithm scale, we note that while the machine learningbased models have a similar turnover, they also trade significantly more than the reference benchmarks – approximately 10 times more compared to the Long Only benchmark. This is also reflected in Exhibit 4(a) which compares the average daily returns against the average daily turnover – with ratios from machine learning models lying close to the xaxis.
To concretely quantity the impact of transaction costs on performance, we also compute the excost Sharpe ratios – using the rebalancing costs defined in [8] to adjust our returns for a variety of transaction cost assumptions . For the results in Exhibit 5, the top of each bar chart marks the maximum costfree Sharpe ratio of the strategy, with each coloured block denoting the Sharpe ratio reduction for the corresponding cost assumption. In line with the turnover analysis, the reference benchmarks demonstrate the most resilience to high transaction costs (up to 5bps), with the profitability across most machine learning models persisting only up to 4bps. However, we still obtain higher costadjusted Sharpe ratios with the Sharpeoptimised LSTM for up to 23 bps, demonstrating its suitability for trading more liquid instruments.
E[Return]  Vol. 

MDD  Sharpe  Sortino  Calmar 




Long Only  0.097  0.154*  0.103  0.482  0.628  0.942  0.201  53.3%  0.970  
Sgn(Returns)  0.133  0.154*  0.102*  0.373  0.861  1.296  0.356  53.3%  1.011  
MACD  0.111  0.155  0.106  0.472  0.719  1.047  0.236  52.5%  1.020*  
LSTM  0.833  0.157  0.114  1.000  5.313  7.310  0.833  33.9%  0.793  
LSTM + Reg.  0.141*  0.154*  0.102*  0.371*  0.912*  1.379*  0.379*  53.4%*  1.014 
6.1 Turnover Regularisation
One simple way to account for transaction costs is to use costadjusted returns directly during training, augmenting the strategy returns defined in Equation (1) as below:
(35) 
where is a constant reflecting transaction cost assumptions. As such, using in Sharpe ratio loss functions during training corresponds to optimising the excost riskadjusted returns, and can also be interpreted as a regularisation term for turnover.
Given that the Sharpeoptimised LSTM is still profitable in the presence of small transactions costs, we seek to quantify the effectiveness of turnover regularisation when costs are prohibitively high – considering the extreme case where bps in our investigation. Tests were focused on the Sharpeoptimised LSTM with and without the turnover regulariser (LSTM + Reg. for the former) – including the additional portfolio level volatility scaling to bring signal volatilities to the same level. Based on the results in Exhibit 3, we can see that the turnover regularisation does help improve the LSTM in the presence of large costs, leading to slightly better performance ratios when compared to the reference benchmarks.
7 Conclusions
We introduce Deep Momentum Networks – a hybrid class of deep learning models which retain the volatility scaling framework of time series momentum strategies while using deep neural networks to output position targeting trading signals. Two approaches to position generation were evaluated here. Firstly, we cast trend estimation as a standard supervised learning problem – using machine learning models to forecast the expected asset returns or probability of a positive return at the next time step – and apply a simple maximum long/short trading rule based on the direction of the next return. Secondly, trading rules were directly generated as outputs from the model, which we calibrate by maximising the Sharpe ratio or average strategy return. Testing this on a universe of continuous futures contracts, we demonstrate clear improvements in riskadjusted performance by calibrating models with the Sharpe ratio – where the LSTM model achieved best results. Incorporating transaction costs, the Sharpeoptimised LSTM outperforms benchmarks up to 23 basis points of costs, demonstrating its suitability for trading more liquid assets. To accommodate high costs settings, we introduce a turnover regulariser to use during training, which was shown to be effective even in extreme scenarios (i.e. bps).
Future work includes extensions of the framework presented here to incorporate ways to deal better with nonstationarity in the data, such as using the recently introduced Recurrent Neural Filters [48]. Another direction of future work focuses on the study of time series momentum at the microstructure level.
8 Acknowledgements
We would like to thank Anthony Ledford, James Powrie and Thomas Flury for their interesting comments as well the OxfordMan Institute of Quantitative Finance for financial support.
References
 [1] T. J. Moskowitz, Y. H. Ooi, and L. H. Pedersen, “Time series momentum,” Journal of Financial Economics, vol. 104, no. 2, pp. 228 – 250, 2012, Special Issue on Investor Sentiment.
 [2] B. Hurst, Y. H. Ooi, and L. H. Pedersen, “A century of evidence on trendfollowing investing,” The Journal of Portfolio Management, vol. 44, no. 1, pp. 15–29, 2017.
 [3] Y. Lempérière, C. Deremble, P. Seager, M. Potters, and J.P. Bouchaud, “Two centuries of trend following,” Journal of Investment Strategies, vol. 3, no. 3, pp. 41–61, 2014.
 [4] J. Baz, N. Granger, C. R. Harvey, N. Le Roux, and S. Rattray, “Dissecting investment strategies in the cross section and time series,” SSRN, 2015. [Online]. Available: https://ssrn.com/abstract=2695101
 [5] A. Levine and L. H. Pedersen, “Which trend is your friend,” Financial Analysts Journal, vol. 72, no. 3, 2016.
 [6] B. Bruder, T.L. Dao, J.C. Richard, and T. Roncalli, “Trend filtering methods for momentum strategies,” SSRN, 2013. [Online]. Available: https://ssrn.com/abstract=2289097
 [7] A. Y. Kim, Y. Tse, and J. K. Wald, “Time series momentum and volatility scaling,” Journal of Financial Markets, vol. 30, pp. 103 – 124, 2016.
 [8] N. Baltas and R. Kosowski, “Demystifying timeseries momentum strategies: Volatility estimators, trading rules and pairwise correlations,” SSRN, 2017. [Online]. Available: https://ssrn.com/abstract=2140091
 [9] C. R. Harvey, E. Hoyle, R. Korgaonkar, S. Rattray, M. Sargaison, and O. van Hemert, “The impact of volatility targeting,” SSRN, 2018. [Online]. Available: https://ssrn.com/abstract=3175538
 [10] N. Laptev, J. Yosinski, L. E. Li, and S. Smyl, “Timeseries extreme event forecasting with neural networks at uber,” in Time Series Workshop – International Conference on Machine Learning (ICML), 2017.
 [11] B. Lim and M. van der Schaar, “Diseaseatlas: Navigating disease trajectories using deep learning,” in Proceedings of the 3rd Machine Learning for Healthcare Conference (MLHC), ser. Proceedings of Machine Learning Research, vol. 85, 2018, pp. 137–160.
 [12] Z. Zhang, S. Zohren, and S. Roberts, “DeepLOB: Deep convolutional neural networks for limit order books,” IEEE Transactions on Signal Processing, 2019.
 [13] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.
 [14] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
 [15] M. Abadi et al., “TensorFlow: Largescale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/
 [16] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in PyTorch,” in Autodiff Workshop – Conference on Neural Information Processing (NIPS), 2017.
 [17] S. Makridakis, E. Spiliotis, and V. Assimakopoulos, “The M4 competition: Results, findings, conclusion and way forward,” International Journal of Forecasting, vol. 34, no. 4, pp. 802 – 808, 2018.
 [18] S. Smyl, J. Ranganathan, , and A. Pasqua. (2018) M4 forecasting competition: Introducing a new hybrid esrnn model. [Online]. Available: https://eng.uber.com/m4forecastingcompetition/
 [19] M. Binkowski, G. Marti, and P. Donnat, “Autoregressive convolutional neural networks for asynchronous time series,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 80, 2018, pp. 580–589.
 [20] S. S. Rangapuram, M. W. Seeger, J. Gasthaus, L. Stella, Y. Wang, and T. Januschowski, “Deep state space models for time series forecasting,” in Advances in Neural Information Processing Systems 31 (NeurIPS), 2018.

[21]
M. Fraccaro, S. Kamronn, U. Paquet, and O. Winther, “A disentangled recognition and nonlinear dynamics model for unsupervised learning,” in
Advances in Neural Information Processing Systems 30 (NIPS), 2017.  [22] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems 27 (NIPS), 2014.
 [23] S. Gu, B. T. Kelly, and D. Xiu, “Empirical asset pricing via machine learning,” Chicago Booth Research Paper No. 1804; 31st Australasian Finance and Banking Conference 2018, 2017. [Online]. Available: https://ssrn.com/abstract=3159577
 [24] S. Kim, “Enhancing the momentum strategy through deep regression,” Quantitative Finance, vol. 0, no. 0, pp. 1–13, 2019.
 [25] J. Sirignano and R. Cont, “Universal features of price formation in financial markets: Perspectives from deep learning,” SSRN, 2018. [Online]. Available: https://ssrn.com/abstract=3141294
 [26] S. Ghoshal and S. Roberts, “Thresholded ConvNet ensembles: Neural networks for technical forecasting,” in Data Science in Fintech Workshop – Conference on Knowledge Discover and Data Mining (KDD), 2018.
 [27] W. Bao, J. Yue, and Y. Rao, “A deep learning framework for financial time series using stacked autoencoders and longshort term memory,” PLOS ONE, vol. 12, no. 7, pp. 1–24, 2017.

[28]
P. Barroso and P. SantaClara, “Momentum has its moments,”
Journal of Financial Economics, vol. 116, no. 1, pp. 111 – 120, 2015.  [29] K. Daniel and T. J. Moskowitz, “Momentum crashes,” Journal of Financial Economics, vol. 122, no. 2, pp. 221 – 247, 2016.
 [30] R. Martins and D. Zou, “Momentum strategies offer a positive point of skew,” Risk Magazine, 2012.
 [31] P. Jusselin, E. Lezmi, H. Malongo, C. Masselin, T. Roncalli, and T.L. Dao, “Understanding the momentum risk premium: An indepth journey through trendfollowing strategies,” SSRN, 2017. [Online]. Available: https://ssrn.com/abstract=3042173
 [32] M. Potters and J.P. Bouchaud, “Trend followers lose more than they gain,” Wilmott Magazine, 2016.
 [33] L. M. Rotando and E. O. Thorp, “The Kelly criterion and the stock market,” The American Mathematical Monthly, vol. 99, no. 10, pp. 922–931, 1992.
 [34] W. F. Sharpe, “The sharpe ratio,” The Journal of Portfolio Management, vol. 21, no. 1, pp. 49–58, 1994.
 [35] N. Jegadeesh and S. Titman, “Returns to buying winners and selling losers: Implications for stock market efficiency,” The Journal of Finance, vol. 48, no. 1, pp. 65–91, 1993.
 [36] ——, “Profitability of momentum strategies: An evaluation of alternative explanations,” The Journal of Finance, vol. 56, no. 2, pp. 699–720, 2001.
 [37] J. Rohrbach, S. Suremann, and J. Osterrieder, “Momentum and trend following trading strategies for currencies revisited  combining academia and industry,” SSRN, 2017. [Online]. Available: https://ssrn.com/abstract=2949379
 [38] Z. Zhang, S. Zohren, and S. Roberts, “BDLOB: Bayesian deep convolutional neural networks for limit order books,” in Bayesian Deep Learning Workshop – Conference on Neural Information Processing (NeurIPS), 2018.
 [39] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 [40] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “WaveNet: A generative model for raw audio,” CoRR, vol. abs/1609.03499, 2016.
 [41] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, “Mastering the game of Go without human knowledge,” Nature, vol. 550, pp. 354–, 2017.
 [42] P. N. Kolm and G. Ritter, “Dynamic replication and hedging: A reinforcement learning approach,” The Journal of Financial Data Science, vol. 1, no. 1, pp. 159–171, 2019.
 [43] H. Bühler, L. Gonon, J. Teichmann, and B. Wood, “Deep Hedging,” arXiv eprints, p. arXiv:1802.03042, 2018.
 [44] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations (ICLR), 2015.
 [45] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014.
 [46] Y. Gal and Z. Ghahramani, “A theoretically grounded application of dropout in recurrent neural networks,” in Advances in Neural Information Processing Systems 29 (NIPS), 2016.
 [47] “Pinnacle Data Corp. CLC Database,” https://pinnacledata2.com/clc.html.
 [48] B. Lim, S. Zohren, and S. Roberts, “Recurrent Neural Filters: Learning Independent Bayesian Filtering Steps for Time Series Prediction,” arXiv eprints, p. arXiv:1901.08096, 2019.
Comments
There are no comments yet.