1 Introduction
Most trading decisions nowadays are made by algorithmic trading systems. According to the Deutsche Bank report, the share of automated highfrequency trading in the equity market resulted in a total of 50% in the US [DeutBank_HFT].
Decisionmaking processes based on data analysis are called quantitative trading strategies. Quantitative trading strategies can be divided into two categories: fundamental [Abarbanell1997] and technical analysis [Lo2000, Park2007]. Fundamental analysis refers to performing analysis based on realworld activity. Therefore, fundamental data analysis is mostly based on financial statements and balance sheets. On the other hand, technical analysis is solely based on technical signals, such as historical price and volume. Technicians believe that profitable patterns can be discovered by analyzing historical movements of prices. Traditional quantitative traders attempt to find profitable strategies by constructing algorithms that best represent their beliefs of the market. Although they provide rational clues and theoretical justification of their logic, traditional quantitative strategies are only able to reflect a part of the entire market dynamics. For instance, the momentum strategy [Momentum] assumes that if there exist clear trends, prices will maintain their direction of movement. The mean reversion strategy [MeanRevert] believes that asset prices tend to revert to the average over time. However, it is nontrivial to maintain stable profits under evolving market conditions by leveraging only specific aspects of the financial market.
Inspired by the recent success of deep learning (DL), researchers have put much effort into finding new profitable patterns from several factors. Early approaches using DL in financial applications focused on how to improve the prediction of stock movements. DARNN proposed hierarchical attention combined with a recurrent neural network (RNN) architecture to improve time series prediction. Besides using traditional signals such as stock chart information, there have been numerous attempts to find profitable patterns from new factors like news and sentiment analysis
[stocknet], [eventdriven]. More recently, [AdvStock] and [HATS] attempted to create more robust predictions by incorporating adversarial training and corporate relation information, respectively.Forecasting models, such as the ones mentioned above, require explicit supervision in the form of labels. These labels take on various forms depending on the task at hand (e.g. updownstationary signals for classification). Despite its simple facade, the defining and design of these labels is nontrivial.
Reinforcement learning (RL) approaches provide us with a more seamless framework for decision making [Idiosyncrasies]. The advantage of using RL to make trading decisions is that an agent is trained to maximize its long term reward without supervision. FFDR applied RL with fuzzy learning and recurrent RL. PracticalRL proved RL’s effectiveness in asset management.
Although the aforementioned work shows promising results, there still remain many challenges in applying DL to portfolio management. Most existing methods utilizing DL focus on proposing a model which simply maximizes expected return without considering risk factors. However, the ultimate goal of portfolio management is to maximize expected return constrained to a given risk level, as stated in modern portfolio theory [MPT]. In other words, we must consider riskadjusted return (e.g. Sharpe ratio) rather than expected return. There has been relatively few work that has considered riskadjusted return metrics.
In this paper, we propose a cooperative MultiAgent reinforcement learningbased Portfolio management System (MAPS) inspired by portfolio diversification strategies used in large investment companies. We focus on the fact that investment firms not only diversify assets composing the portfolios, but also the portfolios themselves. Likewise, rather than creating a single optimal strategy, MAPS creates diversified portfolios by distributing assets to each agent.
Each agent in MAPS creates its own portfolio based on the current state of the market. We designed MAPS’ loss function to guide our agents to act as diversely as possible while maximizing their own returns. Agents in MAPS can be seen as a group of independent ”investors” cooperating to create a diversified portfolio. With multiple agents, MAPS as a system would have a portfolio of portfolios.
We believe that no single strategy fits every market condition, so it is integral to diversify our strategies to mitigate risk and achieve higher riskadjusted returns. Each agent works towards optimizing a portfolio while keeping in mind that the system as a whole would suffer from a lower riskadjusted return if they were to create portfolios similar to that of other agents. Our contribution can be summarized as follows:

To the best of our knowledge, this is the first attempt to use cooperative multiagent reinforcement learning (MARL) in the field of portfolio management. Given raw financial trading data as the state description, our agents maximize riskadjusted return.

We devise a new loss function with a diversification penalty term to effectively encourage agents to act as diversely as possible while maximizing their own return. Our experimental results show that the diversification penalty effectively guide our agents to act diversely when creating portfolios.

We conduct extensive experiments on 12 year’s worth of US market data with approximately 3,000 companies. The results show that MAPS effectively improves riskadjusted returns and the diversification of portfolios. Furthermore, we conduct an ablation study and show that adding more agents to our system results in better Sharpe ratios due to further diversification.
2 Problem Statement
In this section, we first introduce the concept of a Markov decision process (MDP) and define how trading decisions are made in a singleagent case. We then extend the singleagent case into a multiagent case.
2.1 SingleAgent Reinforcement Learning
Singleagent decisionmaking problems are usually formulated as MDPs. An MDP is defined as a tuple , where is a finite set of current states, is a finite set of actions, and is a reward. The state transition function is omitted for simplicity, since the state transition is not affected by the agent actions in our work. Considering the stochastic and dynamic nature of the financial market, we model stock trading as an MDP as follows:

State : a set of features that describes the current state of a stock. In general, different types of information such as historical price movement, trading volume, financial statements, and sentiment scores can be used as the current state. We use a sequence of closing prices of the past days of a particular company.

Action : a set of actions. Our agents can take a long, short, or neutral position.

Reward : a reward based on an agent’s action at a current state. In this study, a reward is calculated based on the current action and the next day return of a company.

Policy : the trading strategy of an agent. A policy
is essentially a probability distribution over actions given a state
. The goal of an agent is to find the optimal policy which yields maximum cumulative rewards.
2.2 MultiAgent Reinforcement Learning Extension
The extension of an MDP to the multiagent case is called a stochastic game which is defined as a tuple . Where is a finite set of current states and a is a joint action set a = … of agents. The rewards r = , …, also depend on current state and joint action a of all agents. Like the singleagent case, the state transition function is omitted in the multiagent case.
In the fully cooperative MARL, the goal of the agents is to find the optimal joint policy (s, a) to maximize the cumulative rewards r of all agents. However, there are two fundamental issues in MARL: the curse of dimensionality and the nonstationarity problem.
With each additional agent, the joint action space exponentially grows. For example, if one agent can take three total actions (i.e. Long, Neutral, and Short), having ten agents would lead to a total of actions. As a result, it becomes more and more difficult to find the optimal joint policy in MARL as the number of agents increases.
In addition, portfolios are comprised of various companies, and an action is typically taken for each company. Considering the combination of all actions for all companies causes the corresponding action space to become exponentially large.
Furthermore, in MARL it is also difficult to find the optimal policy of an agent because all agents learn in conjunction, and consequently the optimal policy of an agent changes as the policy of the other agents change.
Therefore, addressing the proper way to handle the curse of dimensionality problem and designing an appropriate reward structure are central problems in MARL. In the next section, we introduce how we handle these problems and consequently how we can effectively guide the agents to act differently from one another while maximizing their own returns.
3 Methods
3.1 MAPS Architecture
In this section, we describe the overall architecture of MAPS which is illustrated in Figure 2. In MAPS, all of our agents are trained via Deep Qlearning [DQN]. Each agent consists of an MLP encoder and Qnetwork, with structures varying from agent to agent. The input to each agent is a shared state
which is a vector of length
. Each consists of the normalized closing price sequences of the past days. The output vector of each agent, , is a vector of length 3 with each element representing the expected long term reward of actions Long, Neutral, and Short, respectively, given the current state . Therefore, an MLP encoder maps raw state features provided from the environment (i.e. the closing price sequence of a company) into an action value.To handle the curse of dimensionality and the nonstationarity problem mentioned in the previous section, we use following methods. First, when calculating the action values of agent the other agents’ actions are ignored. Doing so limits the possible number of total actions to three (i.e. Long, Neutral, and Short). Second, each agent maintains two MLP network parameter sets, and . The network parameter set is used when performing the gradient step to minimize loss, and the target network parameter set is simply a copy of , and is updated periodically to handle the nonstationarity problem due to the changes of policies of other agents during training. We also adopted experience replay [DQN] to reduce correlation between subsequent episodes.
The overall training procedure is as follows. For each iteration, the episode for each agent is sampled using an greedy policy [suttonRL] from a training data set of size , stored in a memory buffer of size , where , , and each indicates the number of companies, the number of days, and the size of the memory buffer of each agent. Then, a batch of size is sampled from the memory buffer to calculate the loss. Finally, the gradient step is performed to minimize loss with respect to the parameters . is copied to after every iterations ().
3.2 Shared State Memory Buffer
The first step in our training procedure is to sample an episode from the training data and to store it in the memory buffer. Unlike the singleagent case, the memory buffer is a matrix where is the number of agents.
An episode is a tuple defined as = , , , , where and denote the index of an agent and the column index of the memory buffer. and each refer to the current state of company at time and the action chosen by the greedy policy of agent given current state , and and each refer to the immediate reward received by agent and the subsequent state of company at time . Note that there is no subscript index for the agents in and . This means that the same input state is stored in the same column in the memory buffer.
An action and reward are defined as follows.
(1) 
(2) 
where and refer to the output vector of agent given input and the daily return of company between time and time represented in percentage. Therefore, the value 1, 0, or 1 is assigned to action for Long, Neutral, or Short actions, respectively.
The next step is to sample random batches of size from the memory buffer to calculate loss. To formulate the procedure, we define a sampled batch as a matrix and define a vector of length . At every iteration, a random integer value in the range is sampled and assigned to vector . Then the element at the th row and th column in the batch matrix is assigned as: , where indicates th element in vector .
The intuition behind this sampling method is to share the same input state sequence among agents in each batch. Since vector is resampled every iteration rather than by each agent, the same column index sequence is sampled from the memory buffer for each agent. Consequently, as shown in Figure 2, every agent is trained using the same input sequence and we can therefore guide the agents to act different from each other despite being given identical input sequences.
3.3 Loss Function
As previously mentioned, our goal is to guide the agents in MAPS to act as diversely as possible while maximizing their own rewards. To achieve these two contradicting goals, we design our loss function to have two components, namely a local loss, and a global loss. The local loss of each agent is calculated based only on the reward and action value of a particular agent. We first define of agent calculated using a single episode at the th row and th column in the batch matrix.
(3) 
where , , , , and each indicate the current state, current action, immediate reward, next state, and next action, respectively, and refers to the actionvalue function. These values are obtained from episode in the batch matrix. Note that while choosing action given state , the target network of agent parameterized by is used to avoid the moving target problem. We get local loss by summing up the over batch size as follows:
(4) 
However, it is not possible an agent to be aware of the actions of other agents with the local reward alone. Therefore, the global loss provides additional guidance to our agents. We define the positional confidence score of agent for company calculated using a single episode at the th row and th column of the batch matrix as follows:
(5) 
where is the output vector of agent given input . Since the elements of each represent the actions of agent , respectively, represents the th agent’s confidence of how much company ’s price will rise at the subsequent time step. By concatenating the calculated positional confidence scores, we get a positional confidence vector of agent :
(5) 
We penalize similar behavior among the agents by minimizing the correlation of positional confidence vectors between agents. Formally, the global loss can be expressed as:
(6) 
Note that while creating a positional confidence vector of agent for a agent , we use the target network parameterized by to mitigate the effect of the nonstationarity problem.
Finally, our total loss is a weighted sum of the local loss and the global loss.
(7) 
where
is a hyperparameter with a value within
. The training procedure is summarized in Algorithm 1. The value of maxiter, and are 400,000, 128, and 1000, respectively.3.4 Portfolio of Portfolios
When training is finished, each of our agents is expected to output an action value. We create a final portfolio vector at time by summing the portfolio vectors of each agent. The portfolio vector of agent at time (i.e. ) is a vector of length , which satisfies , where represents the th element in the vector . Thus, each represents the weight assigned to company at time by agent . We use the positional confidence score to create the portfolio vector of agent at time as follows:
(8) 
Note that superscript is added to (i.e. ) since the test is proceeded on a test set size of , not on the batch. The final portfolio vector is calculated as follows.
(9) 
where represents the th element in vector . Finally, the portfolio vector is normalized to satisfy = 1.0.
4 Experiments
4.1 Experimental Settings
Period  N  #Data  

Training  20002004  1534  1876082 
Validation  20042006  1651  779272 
Test  20062018  2061  6019248 
Dataset
We collected roughly 18 year’s worth of daily closing price data of approximately 3,000 US companies. Specifically, we used the list of companies from the Russell 3000 index.
We divided our dataset into training set validation set and test set. Detailed statistics of our dataset are summarized in Table 1. The validation set is used to optimize the hyperparameters.
States & Hyperparameters
Among many possible candidates, we gave our agents raw historical closing prices as state description features. However, it is worth noting that our framework is not restricted to certain types of state features, and other kinds of features such as technical indicators or sentiment scores can also be used. We expect further diversification to occur if various sources of information were to be provided to MAPS, and leave this as an open question for future work.
MAPS@k is our proposed model with k agents in the system. k is an arbitrary hyperparameter and we choose among the values [4, 8, 16] for our experiments to show the effect of using different numbers of agents.
To explain the structure of the MLP encoder, we take MAPS@4 as an example. An MLP of size represents agent #1, and each subsequent agent has an extra layer with double the hidden units. For example, agent #2 would be an MLP of size . MAPS@8 and MAPS@16 are simply structures where this pattern is repeated two and four times, respectively.
Batch normalization [batch] is used after every layer except the final layer and Adam optimizer [adam] was used with a learning rate of 0.00001 to train our models. The value of was empirically chosen as based on the validation set.
Evaluation Metric
We measure profitability of methods with Return and Sharpe ratio.

Return We calculated the daily return of our portfolio as follows:
(5.1) where denotes the closing price of stock at time .

Sharpe Ratio The annualized Sharpe ratio is used to measure the performance of an investment compared to its risk. The ratio calculates the excess earned return to the riskfree rate per unit of volatility (risk) as follows:
(5.2) where is daily riskfree rate at time and 252 is the number of business days in a year.
Baselines
We compare MAPS with the Russell 3000, which is one of the major indices, and the following baselines:

Momentum (MOM) is an investment strategy based on the belief that current market trends will continue. We use the simplest version of the strategy: the last 10day price movements are used as momentum indicators.

MeanReversion (MR) strategy works on the assumption that there is a stable underlying trend line and the price of an asset changes randomly around this line. MR believes that asset prices will eventually revert to the longterm mean. The 30day moving average is used as the mean reversion indicator.

MLP, CNN Among many existing stock movement forecasting methods, we chose these two models as our forecast baselines as they are widely used in stock forecasting [Forecastsurvey]
. The MLP model in our experiments consists of five hidden layers with sizes of [256, 128, 64, 32, 16]. The CNN model has four convolutional layers with [16, 16, 32, 32] filters and one fully connected layer of size [32] is used. Maxpooling layers are applied after the second and fourth layers. Batch normalization is applied for both models. Both models have one additional prediction layer with a softmax function and are trained with 3label (i.e. up, neutral, down) crossentropy loss.

DARNN refers to the dualstage attentionbased RNN [DARNN]. Is is the stateoftheart and attention mechanisms are used in each stage to identify relevant input features and select relevant encoder hidden states. As the DARNN model was originally designed to forecast time series signals, we also trained our model to predict future prices of the assets with mean squared error. A portfolio is created based on the expected return of the predicted asset prices.
4.2 Results
Period  20062012  20122018  

Models  Return  Sharpe  Return  Sharpe 
MOM  3.938  1.149  3.223  1.198 
MR  2.262  0.899  2.220  0.816 
MLP  16.377  1.309  1.744  0.368 
CNN  17.036  1.093  3.294  0.442 
DARNN  11.860  3.283  4.309  2.113 
MAPS@4  17.955  4.829  4.846  2.121 
MAPS@8  22.744  4.751  6.123  2.175 
MAPS@16  23.467  5.547  5.567  2.247 
Models  20062012  20122018 

MAPS@4  0.3415  0.3456 
MAPS@8  0.4622  0.4424 
MAPS@16  0.2318  0.2429 
Performance analysis
Our experiment results are summarized in Table 2, and Figure 3 illustrates a comparison of cumulative wealth based on the portfolios created by each model. MAPS outperformed all baselines in terms of both annualized return and Sharpe ratio. Some interesting findings are as follows:

The performance of traditional strategies like MOM and MR vary based on market conditions and generally perform poorly. As these strategies use a single rule leveraging only certain aspects of market dynamics, their performance is not robust as the market evolves.

Forecastbased methods show better performance than traditional approaches in terms of annualized return. Naturally, the performance of forecastbased methods heavily relies on the prediction accuracy of the model. The MLP and CNN perform better in general but did not always outperform the traditional methods. Only DARNN performed consistently better in both annualized return and Sharpe ratio for both testing periods.

MAPS outperformed all baselines in our experiments. It is worth noting that MAPS shows a better Sharpe ratio even when the return is similar. This proves the effectiveness of diversification with multiple agents in the perspective of riskadjusted return. We further observe that MAPS with more agents obtains a better Sharpe ratio.

One unexpected result is that the Sharpe ratio does not scale linearly with the number of agents. We can interpret this with the daily return correlation scores of different MAPS, summarized in Table 3. If our agents act diversely, the correlation of daily return would be small. In table 3 we can find that the average correlation of MAPS@8 is higher than MAPS@4. The agents of MAPS@8 act more similarly to each other than those of MAPS@4, resulting in lower Sharpe ratio despite higher returns. Further improvement in our learning scheme may solve this issue and we leave this for future work.
Effect of global loss
To investigate the effect of a global loss, we compare the learning process of MAPS with and without global loss. During our learning process, we calculate the correlation of a positional confidence score between all agents for the entire validation set. We calculate this value every 10,000 training iterations and average for all companies and pairs of agents. As the positional confidence value indicates the type of action taken by the agents, higher correlation means more similar actions among the agents. As we can see in Figure 4., average correlation values of MAPS without global loss increase rapidly and converge with much higher values than MAPS with global loss. In contrast, the correlations of MAPS trained with global loss increased slowly and resulted in having small values. The results verify the effectiveness of global loss in making agents act independently.
Case Study
To better understand how our agents act differently with identical state features, we illustrate an example portfolio in Figure 5. The black line is a movement of Amazon’s stock price in 2016. The colored rectangle at the bottom of the figure describes the actions of our agents. In this case, we have eight agents in MAPS and each line represents which positions were taken by which agent, with each color representing a position. Red, grey, and blue each refer to long, neutral, and short positions taken at a given time. What we can observe here is that the agents in our system make different decisions based on their own understanding of the market. For instance, in the spring of 2014 (the period we outlined with the bright white box), the future movement of Amazon stock price seems volatile and uncertain after a steep fall and several price corrections. Two out of eight agents decided to take long positions betting that the future price would rise, and two agents chose a short position with the opposite prospect. This kind of discrepancy in actions is prevalent throughout the trading process, making our portfolio as a whole sufficiently diversified.
5 Conclusion
In this work, we propose MAPS, a cooperative MultiAgent reinforcement learningbased Portfolio management System. The agents in MAPS act as differently as possible while maximizing their own reward guided by our proposed loss function. Experiments with 12 years of US market data show that MAPS outperforms most of the existing baselines in terms of Sharpe ratio. We also presented the effectiveness of our learning scheme and how our agents’ independent actions end up with a diversified portfolio with detailed analysis.
Acknowledgements
This work was supported by the National Research Foundation of Korea (NRF2017R1A2A1A17069645, NRF2017M3C4A7065887).
Comments
There are no comments yet.