A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem

Financial portfolio management is the process of constant redistribution of a fund into different financial products. This paper presents a financial-model-free Reinforcement Learning framework to provide a deep machine learning solution to the portfolio management problem. The framework consists of the Ensemble of Identical Independent Evaluators (EIIE) topology, a Portfolio-Vector Memory (PVM), an Online Stochastic Batch Learning (OSBL) scheme, and a fully exploiting and explicit reward function. This framework is realized in three instants in this work with a Convolutional Neural Network (CNN), a basic Recurrent Neural Network (RNN), and a Long Short-Term Memory (LSTM). They are, along with a number of recently reviewed or published portfolio-selection strategies, examined in three back-test experiments with a trading period of 30 minutes in a cryptocurrency market. Cryptocurrencies are electronic and decentralized alternatives to government-issued money, with Bitcoin as the best-known example of a cryptocurrency. All three instances of the framework monopolize the top three positions in all experiments, outdistancing other compared trading algorithms. Although with a high commission rate of 0.25 least 4-fold returns in 50 days.



There are no comments yet.


page 1

page 2

page 3

page 4


Financial Trading as a Game: A Deep Reinforcement Learning Approach

An automatic program that generates constant profit from the financial m...

Long Short-Term Memory Neural Network for Financial Time Series

Performance forecasting is an age-old problem in economics and finance. ...

Investment Ranking Challenge: Identifying the best performing stocks based on their semi-annual returns

In the IEEE Investment ranking challenge 2018, participants were asked t...

Reinforcement Learning for Portfolio Management

In this thesis, we develop a comprehensive account of the expressive pow...

On Technical Trading and Social Media Indicators in Cryptocurrencies' Price Classification Through Deep Learning

This work aims to analyse the predictability of price movements of crypt...

A Framework for Online Investment Algorithms

The artificial segmentation of an investment management process into a w...

Deep neural network for optimal retirement consumption in defined contribution pension system

In this paper, we develop a deep neural network approach to solve a life...

Code Repositories


PGPortfolio: Policy Gradient Portfolio, the source code of "A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem"(https://arxiv.org/pdf/1706.10059.pdf).

view repo


Attempting to replicate "A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem" https://arxiv.org/abs/1706.10059 (and an openai gym environment)

view repo


Trading Gym is an open source project for the development of reinforcement learning algorithms in the context of trading.

view repo


Analysis of crypto currencies

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Portfolio management is the decision making process of continuously reallocating an amount of fund into a number of different financial investment products, aiming to maximize the return while restraining the risk (Haugen, 1986; Markowitz, 1968)

. Traditional portfolio management methods can be classified into four categories, ”Follow-the-Winner”, ”Follow-the-Loser”, ”Pattern-Matching”, and ”Meta-Learning”

(Li and Hoi, 2014). The first two categories are based on prior-constructed financial models, while they may also be assisted by some machine learning techniques for parameter determinations (Li et al., 2012; Cover, 1996). The performance of these methods is dependent on the validity of the models on different markets. ”Pattern-Matching” algorithms predict the next market distribution based on a sample of historical data and explicitly optimizes the portfolio based on the sampled distribution (Györfi et al., 2006). The last class, ”Meta-Learning” method combine multiple strategies of other categories to attain more consistent performance (Vovk and Watkins, 1998; Das and Banerjee, 2011).

There are existing deep machine-learning approaches to financial market trading. However, many of them try to predict price movements or trends (Heaton et al., 2016; Niaki and Hoseinzade, 2013; Freitas et al., 2009)

. With history prices of all assets as its input, a neural network can output a predicted vector of asset prices for the next period. Then the trading agent can act upon this prediction. This idea is straightforward to implement, because it is a supervised learning, or more specifically a regression problem. The performance of these price-prediction-based algorithms, however, highly depends on the degree of prediction accuracy, but it turns out that future market prices are difficult to predict. Furthermore, price predictions are not market actions, converting them into actions requires additional layer of logic. If this layer is a hand-coded, then the whole approach is not fully machine learning, and thus is not very extensible or adaptable. For example, it is difficult for a prediction-based network to consider transaction cost as a risk factor.

Previous successful attempts of model-free and fully machine-learning schemes to the algorithmic trading problem, without predicting future prices, are treating the problem as a Reinforcement Learning (RL) one. These include Moody and Saffell (2001), Dempster and Leemans (2006), Cumming (2015), and the recent deep RL utilization by Deng et al. (2017). These RL algorithms output discrete trading signals on an asset. Being limited to single-asset trading, they are not applicable to general portfolio management problems, where trading agents manage multiple assets.

Deep RL is lately drawing much attention due to its remarkable achievements in playing video games (Mnih et al., 2015) and board games (Silver et al., 2016). These are RL problems with discrete action spaces, and can not be directly applied to portfolio selection problems, where actions are continuous. Although market actions can be discretized, discretization is considered a major drawback, because discrete actions come with unknown risks. For instance, one extreme discrete action may be defined as investing all the capital into one asset, without spreading the risk to the rest of the market. In addition, discretization scales badly. Market factors, like number of total assets, vary from market to market. In order to take full advantage of adaptability of machine learning over different markets, trading algorithms have to be scalable. A general-purpose continuous deep RL framework, the actor-critic Deterministic Policy Gradient Algorithms, was recently introduced (Silver et al., 2014; Lillicrap et al., 2016)

. The continuous output in these actor-critic algorithms is achieved by a neural-network approximated action policy function, and a second network is trained as the reward function estimator. Training two neural networks, however, is found out to be difficult, and sometimes even unstable.

This paper proposes an RL framework specially designed for the task of portfolio management. The core of the framework is the Ensemble of Identical Independent Evaluators (EIIE) topology. An IIE is a neural network whose job is to inspect the history of an asset and evaluate its potential growth for the immediate future. The evaluation score of each asset is discounted by the size of its intentional weight change for the asset in the portfolio and is presented to a softmax layer, whose outcome will be the new portfolio weights for the coming trading period. The portfolio weights define the market action of the RL agent. An asset with an increased target weight will be bought in with additional amount, and that with decreased weight will be sold. Apart from the market history, portfolio weights from the previous trading period are also input to the EIIE. This is for the RL agent to consider the effect of transaction cost to its wealth. For this purpose, the portfolio weights of each period are recorded in a Portfolio Vector Memory (PVM). The EIIE is trained in an Online Stochastic Batch Learning scheme (OSBL), which is compatible with both pre-trade training and online training during back-tests or online trading. The reward function of the RL framework is the explicit average of the periodic logarithmic returns. Having an explicit reward function, the EIIE evolves, under training, along the gradient ascending direction of the function. Three different species of IIEs are tested in this work, a Convolutional Neural Network (CNN)

(Fukushima, 1980; Krizhevsky et al., 2012; Sermanet et al., 2012), a basic Recurrent Neural Network (RNN) (Werbos, 1988), and a Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997).

Being a fully machine-learning approach, the framework is not restricted to any particular markets. To examine its validity and profitability, the framework is tested in a cryptocurrency (virtual money, Bitcoin as the most famous example) exchange market, Polonix.com. A set of coins are preselected by their ranking in trading-volume over a time interval just before an experiment. Three back-test experiments of well separated time-spans are performed in a trading period of 30 minutes. The performance of the three EIIEs are compared with some recently published or reviewed portfolio selection strategies (Li et al., 2015a; Li and Hoi, 2014). The EIIEs significantly beat all other strategies in all three experiments

Cryptographic currencies, or simply cryptocurrencies, are electronic and decentralized alternatives to government-issued moneys (Nakamoto, 2008; Grinberg, 2012). While the best known example of a cryptocurrency is Bitcoin, there are more than 100 other tradable cryptocurrencies competing each other and with Bitcoin (Bonneau et al., 2015). The motive behind this competition is that there are a number of design flaws in Bitcoin, and people are trying to invent new coins to overcome these defects hoping their inventions will eventually replace Bitcoin (Bentov et al., 2014; Duffield and Hagan, 2014). There are, however, more and more cryptocurrencies being created without targeting to beat Bitcoin, but with the purposes of using the blockchain technology behind it to develop decentralized applications111For example, Ethereum is a decentralized platform that runs smart contracts, and Siacoin is the currency for buying and selling storage service on the decentralized cloud Sia. . To June 2017, the total market capital of all cryptocurrencies is 102 billions in USD, 41 of which is of Bitcoin.222Crypto-currency market capitalizations, http://coinmarketcap.com/, accessed: 2017-06-30. Therefore, regardless of its design faults, Bitcoin is still the dominant cryptocurrency in markets. As a result, many other currencies can not be bought with fiat currencies, but only be traded against Bitcoin.

Two natures of cryptocurrencies differentiate them from traditional financial assets, making their market the best test-ground for algorithmic portfolio management experiments. These natures are decentralization and openness, and the former implies the latter. Without a central regulating party, anyone can participate in cryptocurrency trading with low entrance requirements. One direct consequence is abundance of small-volume currencies. Affecting the prices of these penny-markets will require smaller amount of investment, compared to traditional markets. This will eventually allow trading machines to learn and take advantage of the impacts by their own market actions. Openness also means the markets are more accessible. Most cryptocurrency exchanges have application programming interface for obtaining market data and carrying out trading actions, and most exchanges are open 24/7 without restricting frequency of tradings. These non-stop markets are ideal for machines to learn in the real world in shorter time-frames.

The paper is organized as follows. Section 2 defines the portfolio management problem that this project is aiming to solve. Section 3

introduces asset preselection and the reasoning behind it, the input price tensor, and a way to deal with missing data in the market history. The portfolio management problem is re-described in the language RL in Section 

4. Section 5 presents the EIIE meta topology, the PVM, the OSBL scheme. The results of the three experiments are staged in Section 6.

2 Problem Definition

Portfolio management is the action of continuous reallocation of a capital into a number of financial assets. For an automatic trading robot, these investment decisions and actions are made periodically. This section provides a mathematical setting of the portfolio management problem.

2.1 Trading Period

In this work, trading algorithms are time-driven, where time is divided into periods of equal lengths . At the beginning of each period, the trading agent reallocates the fund among the assets. in all experiments of this paper. The price of an asset goes up and down within a period, but four important price points characterize the overall movement of a period, namely the opening, highest, lowest and closing prices (Rogers and Satchell, 1991). For continuous markets, the opening price of a financial instrument in a period is the closing price from the previous period. It is assumed in the back-test experiments that at the beginning of each period assets can be bought or sold at the opening price of that period. The justification of such an assumption is given in Section 2.4.

2.2 Mathematical Formalism

The portfolio consists of assets. The closing prices of all assets comprise the price vector for Period , . In other words, the element of , , is the closing price of the asset in the th period. Similarly, and denote the highest and lowest prices of the period. The first asset in the portfolio is special, that it is the quoted currency, referred to as the cash for the rest of the article. Since the prices of all assets are quoted in cash, the first elements of , and are always one, that is . In the experiments of this paper, the cash is Bitcoin.

For continuous markets, elements of are the opening prices for Period as well as the closing prices for Period . The price relative vector of the th trading period, , is defined as the element-wise division of by :


The elements of are the quotients of closing prices and opening prices for individual asset in the period. The price relative vector can be used to calculate the change in total portfolio value in a period. If is the portfolio value at the begining of Period , ignoring transaction cost,


where is the portfolio weight vector (referred to as the portfolio vector from now on) at the beginning of Period , whose th element, , is the proportion of asset in the portfolio after capital reallocation. The elements of always sum up to one by definition, . The rate of return for Period is then


and the corresponding logarithmic rate of return is


In a typical portfolio management problem, the initial portfolio weight vector is chosen to be the first basis vector in the Euclidean space,


indicating all the capital is in the trading currency before entering the market. If there is no transaction cost, the final portfolio value will be


where is the initial investment amount. The job of a portfolio manager is to maximize for a given time frame.

2.3 Transaction Cost

In a real-world scenario, buying or selling assets in a market is not free. The cost is normally from commission fee. Assuming a constant commission rate, this section will re-calculate the final portfolio value in Equation (2.6), using a recursive formula extending a work by Ormos and Urbán (2013).

The portfolio vector at the beginning of Period is . Due to price movements in the market, at the end of the same period, the weights evolve into


where is the element-wise multiplication. The mission of the portfolio manager now at the end of Period is to reallocate portfolio vector from to by selling and buying relevant assets. Paying all commission fees, this reallocation action shrinks the portfolio value by a factor . , and will be called the transaction remainder factor from now on. is to be determined below. Denoting as the portfolio value at the beginning of Period and at the end,







Figure 1: Illustration of the effect of transaction remainder factor . The market movement during Period , represented by the price-relative vector , drives the portfolio value and portfolio weights from and to and . The asset selling and purchasing action at time redistributes the fund into . As a side-effect, these transactions shrink the portfolio to by a factor of . The rate of return for Period is calculated using portfolio values at the beginning of the two consecutive periods in Equation (2.9).

The rate of return (2.3) and logarithmic rate of return (2.4) are now


and the final portfolio value in Equation (2.6) becomes


Different from Equation (2.4) and (2.2) where transaction cost is not considered, in Equation (2.10) and (2.11), and the difference between the two values is where the transaction remainder factor comes into play. Figure 1 demonstrates the relationship among portfolio vectors and values and their dynamic relationship on a time axis.

The remaining problem is to determine this transaction remainder factor . During the portfolio reallocation from to , some or all amount of asset need to be sold, if or . The total amount of cash obtained by all selling is


where is the commission rate for selling, and is the element-wise rectified linear function, if , otherwise. This money and the original cash reserve taken away the new reserve will be used to buy new assets,


where is the commission rate for purchasing, and has been canceled out on both sides. Using identity and the fact that Equation (2.13) is simplified to


The presence of inside a linear rectifier means is not solvable analytically, but it can only be solved iteratively. thmconvergence Denoting

the sequence , defined as


converges to , the solution to Equation (2.14), for any .

While this convergence is not stated in Ormos and Urbán (2013), its proof will be given in Appendix A. This theorem provides a way to approximate the transaction remainder factor to an arbitrary accuracy. The speed on the convergence depends on the error of the initial guest . The smaller is, the quicker Sequence (2.15) converges to . When , there is a practice (Moody et al., 1998) to approximate with . Therefore, in this work, will use this as the first value for the sequence, that


In the training of the neural networks, with a fixed in (2.15) is used. In the back-test experiments, a tolerant error dynamically determines , that is the first , such that , is used for to approximate . In general, and its approximations are functions of portfolio vectors of two recent periods and the price relative vector,


Throughout this work, a single constant commission rate for both selling and purchasing for all non-cash assets is used, , the maximum rate at Poloniex.

The purpose of the algorithmic agent is to generate a time-sequence of portfolio vectors in order to maximize the accumulative capital in (2.11), taking transaction cost into account.

2.4 Two Hypotheses

In this work, back-test tradings are only considered, where the trading agent pretends to be back in time at a point in the market history, not knowing any ”future” market information, and does paper trading from then onward. As a requirement for the back-test experiments, the following two assumptions are imposed:

  1. Zero slippage: The liquidity of all market assets is high enough that, each trade can be carried out immediately at the last price when a order is placed.

  2. Zero market impact: The capital invested by the software trading agent is so insignificant that is has no influence on the market.

In a real-world trading environment, if the trading volume in a market is high enough, these two assumptions are near to reality.

3 Data Treatments

The trading experiments are done in the exchange Poloniex, where there are about 80 tradable cryptocurrency pairs with about 65 available cryptocurrencies333as of May 23, 2017.. However, for the reasons given below, only a subset of coins is considered by the trading robot in one period. Apart from coin selection scheme, this section also gives a description of the data structure that the neural networks take as their input, a normalization pre-process, and a scheme to deal with missing data.

3.1 Asset Pre-Selection

In the experiments of the paper, the 11 most-volumed non-cash assets are preselected for the portfolio. Together with the cash, Bitcoin, the size of the portfolio, , is . This number is chosen by experience and can be adjusted in future experiments. For markets with large volumes, like the foreign exchange market, can be as big as the total number of available assets.

One reason for selecting top-volumed cryptocurrencies (simply called coins below) is that bigger volume implies better market liquidity of an asset. In turn it means the market condition is closer to Hypothesis 1 set in Section 2.4. Higher volumes also suggest that the investment can have less influence on the market, establishing an environment closer to the Hypothesis 2. Considering the relatively high trading frequency (30 minutes) compared to some daily trading algorithms, liquidity and market size are particularly important in the current setting. In addition, the market of cryptocurrency is not stable. Some previously rarely- or popularly-traded coins can have sudden boost or drop in volume in a short period of time. Therefore, the volume for asset preselection is of a longer time-frame, relative to the trading period. In these experiments, volumes of 30 days are used.

However, using top volumes for coin selection in back-test experiments can give rise to a survival bias. The trading volume of an asset is correlated to its popularity, which in turn is governed by its historic performance. Giving future volume rankings to a back-test, will inevitably and indirectly pass future price information to the experiment, causing unreliable positive results. For this reason, volume information just before the beginning of the back-tests is taken for preselection to avoid survival bias.

3.2 Price Tensor

Historic price data is fed into a neural network to generate the output of a portfolio vector. This subsection describes the structure of the input tensor, its normalization scheme, and how missing data is dealt with.

The input to the neural networks at the end of Period is a tensor, , of rank 3 with shape , where is the number of preselected non-cash assets, is the number of input periods before , and

is the feature number. Since prices further back in the history have much less correlation to the current moment than that of recent ones,

(a day and an hour) for the experiments. The criterion of choosing the assets were given in Section 3.1. Features for asset on Period are its closing, highest, and lowest prices in the interval. Using the notations from Section 2.2, these are , , and . However, these absolute price values are not directly fed to the networks. Since only the changes in prices will determine the performance of the portfolio management (Equation (2.10)), all prices in the input tensor will be normalization by the latest closing prices. Therefore, is the stacking of the three normalized price matrices,


where , , and are the normalized price matrices,

with , and being the element-wise division operator.

At the end of Period , the portfolio manager comes up with a portfolio vector using merely the information from the price tensor and the previous portfolio vector , according to some policy . In other words, . At the end of Period , the logarithmic rate of return for the period due to decision can be calculated with the additional information from the price change vector , using Equation (2.10), In the language of RL, is the immediate reward to the portfolio management agent for its action under environment condition .

3.3 Filling Missing Data

Some of the selected coins lack part of the history. This absence of data is due to the fact that these coins just appeared relatively recently. Data points before the existence of a coin are marked as Not A Numbers (NANs) from the exchange. NANs only appeared in the training set, because the coin selection criterion is the volume-ranking of the last 30 days before the back-tests, meaning all assets must have existed before that.

As the input of a neural network must be real numbers, these NANs have to be replaced. In a previous work of the authors (Jiang and Liang, 2017), the missing data was filled with fake decreasing price series with a decay rate of 0.01, in order for the neural networks to avoid picking these absent assets in the training process. However, it turned out that the networks deeply remembered these particular assets, that they avoided them even when they were in very promising up-climbing trends in the back-test experiments. For this reason, in this current work, flat fake price-movements (0 decay rates) are used to fill the missing data points. In addition, under the novel EIIE structure, the new networks will not be able to reveal the identity of individual assets, preventing them from making decision based on the long-past bad records of particular assets.

4 Reinforcement Learning

With the problem defined in Section 2 in mind, this section presents a reinforcement-learning (RL) solution framework using a deterministic policy gradient algorithm. The explicit reward function is also given under this framework.

4.1 The Environment and the Agent

In the problem of algorithmic portfolio management, the agent is the software portfolio manager performing trading-actions in the environment of a financial market. This environment comprises of all available assets in the markets and the expectations of all market participants towards them.

It is impossible for the agent to get total information of a state of such a large and complex environment. Nonetheless, all relevant information is believed, in the philosophy of technical traders (Charles et al., 2006; Lo et al., 2000), to be reflected in the prices of the assets, which are publicly available to the agent. Under this point of view, an environmental state can be roughly represented by the prices of all orders throughout the market’s history up to the moment where the state is at. Although full order history is in the public domain for many financial markets, it is too huge a task for the software agent to practically process this information. As a consequence, sub-sampling schemes for the order-history information are employed to future simplify the state representation of the market environment. These schemes include asset preselection described in Section 3.1

, periodic feature extraction and history cut-off. Periodic feature extraction discretizes the time into periods, and then extract the highest, lowest, and closing prices in each periods. History cut-off simply takes the price-features of only a recent number of periods to represent the current state of the environment. The resultant representation is the price tensor

described in Section 3.2.

Under Hypothesis 2 in Section 2.4, the trading action of the agent will not influence the future price states of the market. However, the action made at the beginning of Period will affect the reward of Period , and as a result will affect the decision of its action. The agent’s buying and selling transactions made at the beginning of Period , aiming to redistribute the wealth among the assets, are determined by the difference between portfolio weights and . is defined in term of in Equation (2.7), which also plays a role in the action for the last period. Since has already been determined in the last period the action of the agent at time can be represented solely by the portfolio vector ,


Therefore a previous action does have influence on the decision of the current one through the dependency of and on (2.17). In the current framework, this influence is encapsulated by considering as a part of the environment and inputting it to the agent’s action making policy, so the state at is represented as the pair of and ,


where is predetermined in (2.5). The state consists of two parts, the external state represented by the price tensor, , and the internal state represented by the portfolio vector from the last period, . Because under Hypothesis 2 of Section 2.4, the portfolio amount is negligible compared to the total trading volume of the market, is not included in the internal state.

4.2 Full-Exploitation and the Reward Function

It is the job of the agent to maximize the final portfolio value of Equation (2.11) at the end of the period. As the agent does not have control over the choices of the initial investment, , and the length of the whole portfolio management process, , this job is equivalent to maximizing the average logarithmic cumulated return ,


On the right-hand side of (4.3), is given by action , is part of price tensor from state variable , and is a function of , and as stated in (2.17). In the language of RL, is the cumulated reward, and is the immediate reward for an individual episode. Different from a reward function using accumulated portfolio value (Moody et al., 1998), the denominator guarantees the fairness of the reward function between runs of different lengths, enabling it to train the trading policy in mini-batches.

With this reward function, the current framework has two important distinctions from many other RL problems. One is that both the episodic and cumulated rewards are exactly expressed. In other words, the domain knowledge of the environment is well-mastered, and can be fully exploited by the agent. This exact expressiveness is based upon Hypothesis 1 of Section 2.4 that an action has no influence on the external part of future states, the price tensor. This isolation of action and external environment also allows one to use the same segment of market history to evaluate difference sequences of actions. This feature of the framework is considered a major advantage, because a complete new trial in a trading game is both time-consuming and expansive.

The second distinction is that all episodic rewards are equally important to the final return. This distinction, together with the zero-market-impact assumption, allows to be regarded as the action-value function of action with a discounted factor of , taking no consideration of future influence of the action. Having a definite action-value function further justifies the full-exploitation approach, since exploration in other RL problems is mainly for trying out different classes of action-value functions.

Without exploration, on the other hand, local optima can be avoided by random initialisation of the policy parameters which will be discussed below.

4.3 Deterministic Policy Gradient

A policy is a mapping from the state space to the action space, . With full exploitation in the current framework, an action is deterministically produced by the policy from a state. The optimal policy is obtained using a gradient ascent algorithm. To achieve this, a policy is specified by a set of parameter , and . The performance metric of for time interval is defined as the corresponding reward function (4.3) of the interval,


After random initialisation, the parameters are continuously updated along the gradient direction with a learning rate ,


To improve training efficiency and avoid machine-precision errors, will be updated upon mini-batches instead of the whole training market-history. If the time-range of a mini-batch is , the updating rule for the batch is


with the denominator in the corresponding defined in (4.3) replaced by . This mini-batch approach of gradient ascent also allows online learning, which is important in online trading where new market history keep coming to the agent. Details of the online learning and mini-batch training will be discussed in Section 5.3

5 Policy Networks

The policy functions will be constructed using three different deep neural networks. The neural networks in this paper differ from a previous version (Jiang and Liang, 2017) with three important innovations, the mini-machine topology invented to target the portfolio management problem, the portfolio-vector memory, and a stochastic mini-batch online learning scheme.

5.1 Network Topologies

The three incarnations of neural networks to build up the policy functions are a CNN, a basic RNN, and a LSTM. Figure 2 shows the topology of a CNN designed for solving the current portfolio management problem, while Figure 3 portrays the structure of a basic RNN or LSTM network for the same problem. In all cases, the input to the networks is the price tensor defined in (3.1), and the output is the portfolio vector . In both figures, an hypothetical example of output portfolio vector is used, while the dimension of the price tensor and thus the number of assets are actual values deployed in the experiments. The last hidden layers are the voting scores for all non-cash assets. The softmax outcomes of these scores and a cash bias become the actual corresponding portfolio weights. In order for the neural network to consider transaction cost, the portfolio vector from the last period, , is inserted to the networks just before the voting-layer. The actual mechanism of storing and retrieving portfolio vectors in a parallel manner is presented in Section 5.2.

A vital common feature in all three networks is that the networks flow independently for the assets while network parameters are shared among these streams. These streams are like independent but identical networks of smaller scopes, separately observing and assessing individual non-cash assets. They only interconnect at the softmax function, just to make sure their outputting weights are non-negative and summing up to unity. We call these streams mini-machines or more formally Identical Independent Evaluators (IIE), and this topology feature Ensemble of IIE (EIIE) nicknamed mini-machine approach, to distinguish with the wholesome approach in an earlier attempt (Jiang and Liang, 2017). EIIE is realized differently in Figure 2 and 3. An IIE in Figure 2 is just a chain of convolution with kernels of height , while in Figure 3 it is either a LSTM or a Basic RNN taking the price history of a single asset as input.

EIIE greatly improves the performance of the portfolio management. Remembering the historic performance of individual assets, an integrated network in the previous version is more reluctant to invest money to a historically unfavorable asset, even if the asset has a much more promising future. On the other hand, without being designed to reveal the identity of the assigned asset, an IIE is able to judge its potential rise and fall merely based on more recent events.

From a practical point of view, EIIE has three other crucial advantages over an integrated network. The first is scalability in asset number. Having the mini-machines all identical with shared parameters, the training time of an ensemble scales roughly linearly with . The second advantage is data-usage efficiency. For an interval of price history, a mini-machine can be trained times across different assets. Asset assessing experience of the IIEs is then shared and accumulated in both time and asset dimensions. The final advantage is plasticity to asset collection. Since an IIE’s asset assessing ability is universal without being restricted to any particular assets, an EIIE can update its choice of assets and/or the size of the portfolio in real-time, without having to train the network again from ground zero.

Figure 2: CNN Implementation of the EIIE: This is a realization the Ensemble of Identical Independent Evaluators (EIIE), a fully convolutional network. The first dimensions of all the local receptive fields in all feature maps are , making all rows isolated from each other until the softmax activation. Apart from weight-sharing among receptive fields in a feature map, which is a usual CNN characteristic, parameters are also shared between rows in an EIIE configuration. Each row of the entire network is assigned with a particular asset, and is responsible to submit a voting score to the softmax on the growing potential of the asset in the coming trading period. The input to the network is a price tensor, comprising the highest, closing, and lowest prices of non-cash assets over the past periods. The outputs are the new portfolio weights. The previous portfolio weights are inserted as an extra feature map before the scoring layer, for the agent to minimize transaction cost.
Figure 3: RNN (Basic RNN or LSTM) Implementation of the EIIE: This is a recurrent realization the Ensemble of Identical Independent Evaluators (EIIE). In this version, the price inputs of individual assets are taken by small recurrent subnets. These subnets are identical LSTMs or Basic RNNs. The structure of the ensemble network after the recurrent subnets is the same as the second half of the CNN in Figure 2.

5.2 Portfolio-Vector Memory

(a) Mini-Batch Viewpoint
(b) Network Viewpoint
Figure 4: A Read/Write Cycle of the Portfolio-Vector Memory: In both graphs, a small vertical strip on the time axis represents a portion of the memory containing the portfolio weights at the beginning of a period. Red memories are being read to the policy network, while blue ones are being overwritten by the network. The two colored rectangles in LABEL: consisting of four strips are example of two consecutive mini-batches. While LABEL: exhibits a complete read and write circle for a mini-batch, LABEL: shows a circle within a network (omitting the CNN or RNN part of the network).

In order for the portfolio management agent to minimize transaction cost by restraining itself from large changes between consecutive portfolio vectors, the output of portfolio weights from the previous trading period is input to the networks. One way to achieve this is to rely on the remembering ability of RNN, but with this approach the price normalization scheme proposed in (3.1) has to be abandoned. This normalization scheme is empirically better performing than others. Another possible solution is Direct Reinforcement (RR) introduced by Moody and Saffell (2001). However, both RR and RNN memory suffer from the gradient vanishing problem. More importantly, RR and RNN require serialization of the training process, unable to utilize parallel training within mini-batches.

In this work, inspired by the idea of experience replay memory (Mnih et al., 2016), a dedicated Portfolio-Vector Memory (PVM), is introduced to store the network outputs. As shown in Figure 4, the PVM is a stack of portfolio vectors in chronological order. Before any network training, the PVM is initialized with uniform weights. In each training step, a policy network loads the portfolio vector of the previous period from the memory location at , and overwrites the memory at

with its output. As the parameters of the policy networks converge through many training epochs, the values in the memory also converge.

Sharing a single memory stack allows a network to be trained simultaneously against data points within a mini-batches, enormously improving training efficiency. In the case of RNN versions of the networks, inserting last outputs after the recurrent blocks (Figure 3) avoids passing the gradients back to the deep RNN structures, circumventing the gradient vanishing problem.

5.3 Online Stochastic Batch Learning

With the introduction of the network output-memory, mini-batch training becomes plausible, although the learning framework requires sequential inputs. However, unlike supervised learning, where data points are unordered and mini-batches are random disjoint subsets of the training sample space, in this training scheme the data points within a batch have to be in their time-order. In addition, since data sets are time series, mini-batches starting with different periods are considered valid and distinctive, even if they have a significantly overlapping interval. For example, if the uniform batch size is , data sets covering and are two validly different batches.

The ever-ongoing nature of financial markets means new data keeps pouring into the agent, and as a consequence the size of the of training sample explodes indefinitely. Fortunately, it is believed that the correlation between two market price events decades exponentially with the temporal distance between them (Holt, 2004; Charles et al., 2006). With this belief, here an Online Stochastic Batch Learning (OSBL) scheme is proposed.

At the end of the th period, the price movement of this period will be added to the training set. After the agent has completed its orders for period , the policy network will be trained against randomly chosen mini-batches from this set. A batch starting with period

is picked with a geometrically distributed probability




is the probability-decaying rate determining the shape of the probability distribution and how important are recent market events, and

is the number of periods in a mini-batch.

6 Experiments

The tools has been developed to this point of the article are examined in three back-test experiments of different time frames with all three policy networks on the crypto-currency exchange Poloniex. Results are compared with many well-established and recently published portfolio-selection strategies. The main compared financial metric is the portfolio value as well as maximum drawdown and the Sharpe ratio.

6.1 Test Ranges

Data Purpose Data Range Training Data Set
CV 2016-05-07 04:00 to 2016-06-27 08:00 2014-07-01 to 2016-05-07 04:00
Back-Test 1 2016-09-07 04:00 to 2016-10-28 08:00 2014-11-01 to 2016-09-07 04:00
Back-Test 2 2016-12-08 04:00 to 2017-01-28 08:00 2015-02-01 to 2016-12-08 04:00
Back-Test 3 2017-03-07 04:00 to 2017-04-27 08:00 2015-05-01 to 2017-03-07 04:00
Table 6.1:

Price data ranges for hyperparameter-selection (cross-validation, CV) and back-test experiments. Prices are accessed in periods of 30 minutes. Closing prices are used for cross validation and back-tests, while highest, lowest, and closing prices in the periods are used for training. The hours of the starting points for the training sets are not given, since they begin at midnight of the days. All times are in UTC.

Details of the time-ranges for the back-test experiments and their corresponding training sets are presented in Table 6.1. A cross validation set is used for determination of the hyper-parameters, whose range is also listed. All time in the table are in Coordinated Universal Time (UTC). All training sets start at 0 o’clock. For example, the training set for Back-Test 1 is from 00:00 on November 1st 2014. All price data is accessed with Poloniex’s official Application Programming Interface (API)444https://poloniex.com/support/api/.

6.2 Performance Measures

Different metrics are used to measure the performance of a particular portfolio selection strategy. The most direct measurement of how successful is a portfolio management over a timespan is the accumulative portfolio value (APV), . It is unfair, however, to compare the PVs of two management starting of different initial values. Therefore, APVs here are measured in the unit of their initial values, or equivalently and thus


In this unit, APV is then closely related to the accumulated return, and in fact it only differs from the latter by . Under the same unit, the final APV (fAPV) is the APV at the end of a back-test experiment, .

A major disadvantage of APV is that it does not measure the risk factors, since it merely sums up all the periodic returns without considering fluctuation in these returns. A second metric, the Sharpe ratio (SR) (Sharpe, 1964, 1994), is used to take risk into account. The ratio is a risk adjusted mean return, defined as the average of the risk-free return by its deviation,


where are periodic returns defined in (2.9), and is the rate of return of a risk-free asset. In these experiments the risk-free asset is Bitcoin. Because the quoted currency is also Bitcoin, the risk-free return is zero, , here.

Although the SR considers volatility of the portfolio values, but it equally treats upwards and downwards movements. In reality upwards volatility contributes to positive returns, but downwards to loss. In order to highlight the downwards deviation, Maximum Drawdown (MDD) (Magdon-Ismail and Atiya, 2004) is also considered. MDD is the biggest loss from a peak to a trough, and mathematically


6.3 Results

2016-09-07 to 2016-10-28 2016-12-08 to 2017-01-28 2017-03-07 to 2017-04-27
3 CNN 0.224 29.695 0.087 0.216 8.026 0.059 0.406 31.747 0.076
bRNN 0.241 13.348 0.074 0.262 4.623 0.043 0.393 47.148 0.082
LSTM 0.280 6.692 0.053 0.319 4.073 0.038 0.487 21.173 0.060
iCNN 0.221 4.542 0.053 0.265 1.573 0.022 0.204 3.958 0.044
Best Stock 0.654 1.223 0.012 0.236 1.401 0.018 0.668 4.594 0.033
UCRP 0.265 0.867 -0.014 0.185 1.101 0.010 0.162 2.412 0.049
UBAH 0.324 0.821 -0.015 0.224 1.029 0.004 0.274 2.230 0.036
Anticor 0.265 0.867 -0.014 0.185 1.101 0.010 0.162 2.412 0.049
OLMAR 0.913 0.142 -0.039 0.897 0.123 -0.038 0.733 4.582 0.034
PAMR 0.997 0.003 -0.137 0.998 0.003 -0.121 0.981 0.021 -0.055
WMAMR 0.682 0.742 -0.0008 0.519 0.895 0.005 0.673 6.692 0.042
CWMR 0.999 0.001 -0.148 0.999 0.002 -0.127 0.987 0.013 -0.061
RMR 0.900 0.127 -0.043 0.929 0.090 -0.045 0.698 7.008 0.041
ONS 0.233 0.923 -0.006 0.295 1.188 0.012 0.170 1.609 0.027
UP 0.269 0.864 -0.014 0.188 1.094 0.009 0.165 2.407 0.049
EG 0.268 0.865 -0.014 0.187 1.097 0.010 0.163 2.412 0.049
0.436 0.758 -0.013 0.336 0.770 -0.012 0.390 2.070 0.027
CORN 0.999 0.001 -0.129 1.000 0.0001 -0.179 0.999 0.001 -0.125
M0 0.335 0.933 -0.001 0.308 1.106 0.008 0.180 2.729 0.044
Table 6.2: Performances of the three EIIE (Ensemble of Identical Independent Evaluators) neural networks, an integrated network, and some traditional portfolio selection strategies in three different back-test experiments (in UTC, detailed time-ranges listed in Table 6.1) on the cryptocurrency exchange Poloniex. The performance metrics are Maximum Drawdown (MDD), the final Accumulated Portfolio Value (fAPV) in the unit of initial portfolio amount (), and the Sharpe ratio (SR). The bold algorithms are the EIIE networks introduced in this paper, named after the underlining structures of their IIEs. For example, bRNN is the EIIE of Figure 3 using basic RNN evaluators. Three benchmarks (italic), the integrated CNN (iCNN) previous proposed by the authors (Jiang and Liang, 2017), and some recently revieweda strategies (Li et al., 2015a; Li and Hoi, 2014) are also tested. The algorithms in the table are divided into five categories, the model-free neural network, the benchmarks, follow-the-loser strategies, follow-the-winner strategies, and pattern-matching or other strategies. The best performance in each column is highlighted with boldface. All three EIIEs significantly outperform all other algorithms in the fAPV and SR columns, showing the profitability and reliability of the EIIE machine-learning solution to the portfolio management problem.

a. The exceptions are RMR of Huang et al. (2013) and WMAMR of Gao and Zhang (2013).

The performances of all three EIIE policy networks proposed in the current paper will be compared to that of the integrated CNN (iCNN) (Jiang and Liang, 2017), several well-known or recently published model-based strategies, and three benchmarks.

The three benchmarks are the Best Stock, the asset with the most fAPV over the back-test interval, the Uniform Buy and Hold (UBAH), a portfolio management approach simply equally spreading the total fund into the preselected assets and holding them without making any purchases or selling until the end (Li and Hoi, 2014), and Uniform Constant Rebalanced Portfolios (UCRP) (Kelly, 1956; Cover, 1991).

Most of the strategies to be compared in this work were surveyed by Li and Hoi (2014), including Aniticor (Borodin et al., 2004), Online Moving Average Reversion (OLMAR) (Li et al., 2015b), Passive Aggressive Mean Reversion (PAMR) (Li et al., 2012), Confidence Weighted Mean Reversion (CWMR) (Li et al., 2013), Online Newton Step (ONS) (Agarwal et al., 2006), Universal Portfolios (UP) (Cover, 1991), Exponential Gradient (EG) (Helmbold et al., 1998), Nonparametric Kernel Based Log Optimal Strategy () (Györfi et al., 2006), Correlation-driven Nonparametric Learning Strategy (CORN) (Li et al., 2011), and M0 (Borodin et al., 2000), except Weighted Moving Average Mean Reversion (WMAMR) (Gao and Zhang, 2013) and Robust Median Reversion (RMR) (Huang et al., 2013).

Table 6.2 shows the performance scores fAPV, SR, and MDD of the EIIE policy networks as well as of the compared strategies for the three back-test intervals listed in Table 6.1. In term of fAPV or SR, the best performing algorithm in Back-Test 1 and 2 is the CNN EIIE whose final wealth is more than twice of the runner-up in the first experiment. Top three winners in these two measures in all back-tests are occupied by the three EIIE networks, losing only the MDD measure. This result demonstrates the powerful profitability and consistency of the current EIIE machine-learning framework.

When only considering fAPV, all three EIIEs outperform the best assets in all three back-tests, while the only model-based algorithm does that is RMR on the only occasion of Back-Test 3. Because of the high commission rate of and the relatively high half-hourly trading frequency, many traditional strategies have bad performances. Especially in Back-Test 1, all model-based strategies have negative returns, with fAPV less than 1 or equivalently negative SRs. On the other hand, the EIIEs are able to achieve at least 4-fold returns in 20 days in different market conditions.

Figures 5, 6 and 7 plot the APV against time in the three back-tests respectively for the CNN and bRNN EIIE networks, two selected benchmarks and two model-based strategies. The benchmarks Best Stock and UCRP are two good representatives of the market. In all three experiments, both CNN and bRNN EIIEs beat the market throughout the entirety of the back-tests, while traditional strategies are only able to achieve that in the second half of Back-Test 3 and very briefly elsewhere.

Figure 5: Back-Test 1: 2016-09-07-4:00 to 2016-10-28-8:00 (UTC). Accumulated portfolio values (APV, ) over the interval of Back-Test 1 for the CNN and basic RNN EIIEs, the Best Stock, the UCRP, RMR, and the ONS are plotted in log-10 scale here. The two EIIEs are leading throughout the entire time-span, growing consistently only with a few drawdown incidents.
Figure 6: Back-Test 2: 2016-12-08-4:00 to 2017-01-28-8:00 (UTC), log-scale accumulated weath. This is the worst experiment amung the three back-tests for the EIIEs. However, they are able to steadily climb up till the end of the test.
Figure 7: Back Test 3: 2017-03-07-4:00 to 2017-04-27-8:00 (UTC), log-scale accumulated weath. All algorithms struggle and consolidate at the beginning of this experiment, and both of the EIIEs experience two major dips on March 15 and April 9. This diving contributes to their high Maximum Drawdown in the text (Table 6.2). Nevertheless, this is the best month for both EIIEs in term of final wealth.

7 Conclusion

This article proposed an extensible reinforcement-learning framework solving the general financial portfolio management problem. Being invented to cope with multi-channel market inputs and directly output portfolio weights as the market actions, the framework can be fit in with different deep neural networks, and is linearly scalable with the portfolio size. This scalability and extensibility are the results of the EIIE meta topology, which is able to accommodate many types of weight-sharing neural-net structures in the lower level. To take transaction cost into account when training the policy networks, the framework includes a portfolio-weight memory, the PVM, allowing the portfolio-management agent to learn restraining from oversized adjustments between consecutive actions, while avoiding the gradient vanishing problem faced by many recurrent networks. The PVM also allow parallel training within batching, beating recurrent approaches in learning efficiency to the transaction cost problem. Moreover, the OSBL scheme governs the online learning process, so that the agent can continuously digest constant incoming market information while trading. Finally, the agent was trained using a fully exploiting deterministic policy gradient method, aiming to maximize the accumulated wealth as the reinforcement reward function.

The profitability of the framework surpasses all surveyed traditional portfolio-selection methods, as demonstrated in the paper by the outcomes of three back-test experiments over different periods in a cryptocurrency market. In these experiments, the framework was realized using three different underlining networks, a CNN, a basic RNN and a LSTM. All three versions better performed in final accumulated portfolio value than other trading algorithms in comparison. The EIIE networks also monopolized the top three positions in the risk-adjusted score in all three tests, indicating the consistency of the framework in its performances. Another deep reinforcement learning solution, previously introduced by the authors, was assessed and compared as well under the same settings, losing too to the EIIE networks, proving that the EIIE framework is a major improvement over its more primitive cousin.

Among the three EIIE networks, LSTM had much lower scores than the CNN and the basic RNN. The significant gap in performance between the two RNN species under the same framework might be an indicator to the well-known secret in financial markets, that history repeats itself. Not being designed to forget its input history, a vanilla RNN is more able than a LSTM to exploit repetitive patterns in price movement for higher yields. The gap might also be due to lack of fine-tuning in hyper-parameters for the LSTM. In the experiments, same set of structural hyper-parameters were used for both basic RNN and LSTM.

Despite the success of the EIIE framework in the back-tests, there is room for improvement in future works. The main weakness of the current work is the assumptions of zero market impact and zero slippage. In order to consider market impact and slippage, large amount of well-documented real-world trading examples will be needed as training data. Some protocol will have to be invented for documenting trade actions and market reactions. If that is accomplished, live trading experiments of the auto-trading agent in its current version can be recorded, for its future version to learn the principles behind market impacts and slippages from this recorded history. Another shortcoming of the work is that the framework has only been tested in one market. To test its adaptability, the current and later versions will need to be examined in back-tests and live trading in a more traditional financial market. In addition, the current award function will have to be amended, if not abandoned, for the reinforcement-learning agent to include awareness of longer-term market reactions. This may be achieved by a critic network. However, the backbone of the current framework, including the EIIE meta topology, the PVM, and the OSBL scheme, will continue to take important roles in future versions.

Appendix A Proof of Theorem 2.14

In order to prove Theorem 2.14, it is handy to have the following five lemmas. The function in Theorem 2.14 is monotonically increasing. In other words, if . Recall that from Section 2.3,

The fact that the linear rectifier ( if , otherwise) is monotonically increasing readily implies that is also monotonically increasing.

Using the fact that ,

for . For a commission rate, is impractically high. Therefore, always holds.

The proof is split into two cases. The fact that implies will be used in Case 1.

Case 1:

Since ,

Case 2:

This will be proved by contradiction. By assuming ,

Bringing the two ’s together,


Noting that ,

Using identity , (A.1) becomes

Moving the terms to the right-hand side,


The left-hand side of (A.2) is a non-positive number, and the right-hand side is a non-negative number. The former is greater than the latter, arriving at a contradiction.

Therefore, in both cases.

the sequence , defined as

converges to . This is a special case of the final goal Theorem 2.14 when . This convergence is proved by the Monotone Convergence Theorem (MCT) (Rudin, 1976, Chapter 5). The monotonicity of by Lemma A with Mathematical Induction establishes an upper bound for .

Note that by definition is the transaction remainder factor, and . The monotonicity of sequence itself can be also proved by Mathematical Induction and Lemma A.

If , then is the solution of Equation (2.14), and the proof ends here. Otherwise, the sequence is strictly increasing and bounded above by . In that case, by MCT, , where is the Least Upper Bound of . As a result, Therefore, is the solution to Equation (2.14), and hence