Adversarial recovery of agent rewards from latent spaces of the limit order book

12/09/2019 ∙ by Jacobo Roa-Vicens, et al. ∙ University of Oxford JPMorgan Chase & Co. Twitter J.P. Morgan UCL 6

Inverse reinforcement learning has proved its ability to explain state-action trajectories of expert agents by recovering their underlying reward functions in increasingly challenging environments. Recent advances in adversarial learning have allowed extending inverse RL to applications with non-stationary environment dynamics unknown to the agents, arbitrary structures of reward functions and improved handling of the ambiguities inherent to the ill-posed nature of inverse RL. This is particularly relevant in real time applications on stochastic environments involving risk, like volatile financial markets. Moreover, recent work on simulation of complex environments enable learning algorithms to engage with real market data through simulations of its latent space representations, avoiding a costly exploration of the original environment. In this paper, we explore whether adversarial inverse RL algorithms can be adapted and trained within such latent space simulations from real market data, while maintaining their ability to recover agent rewards robust to variations in the underlying dynamics, and transfer them to new regimes of the original environment.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) achieves robust performance in a wide number of fields, with particularly relevant success in model-free applications mnih2013playing ; vanHasselt2015dqn where agents explore an environment with no prior knowledge about its underlying dynamics, and learn a policy that maximizes certain cumulative reward function. Such learning process typically requires recurrent access of the agent to the environment on a trial-and-error based exploration; however, reinforcement learning in risk-critical tasks such as automatic navigation or financial risk control would not allow such an exploration, since decisions have to be made in real time in a non-stationary environment where the risks and costs inherent to a trial-and-error approach can be unaffordable. In addition, such RL performance generally requires that the designer manually specifies a reward function that represents the task adequately and can be optimized efficiently ng1999policyinvariance .

In the context of learning from expert demonstrations, inverse reinforcement learning has proved capable of recovering through inference the reward function of expert agents through observations of their state-action trajectories ziebart2008maximum ; levine2011nonlinear with decreasing dependence on pre-defined assumptions about linearity or the general structure of the underlying reward function, generally under a maximum entropy framework ziebart2010modeling .

Recent advances in inverse RL have extended its application to high-dimensional state spaces with unknown dynamics and arbitrary non-linear reward structures, thanks to the use of neural networks to represent the reward function

finn2016guided . Adversarial inverse reinforcement learning (AIRL) fu2017learning extends inverse RL further, achieving the recovery of rewards robust to variations in the dynamics of the environment, while learning at the same time a policy to perform the task; AIRL builds on the equivalences found by FinnCAL16 between inverse RL under maximum entropy and the training process of generative adversarial networks (GANs) goodfellow2014gans .

1.1 Contributions

Financial markets are a particularly challenging case for inverse RL: high-dimensional, stochastic environments with non-stationary transition dynamics unknown to the observer, and with variable reactions to actions from agents. This makes AIRL particularly interesting to test on real financial data, aiming at learning from experts robust reward functions that can then be transferred to new regimes of the original environment.

In this paper we explore the adaptation of AIRL to a volatile financial environment based on real tick data from a limit order book (LOB) in the stock market, attempting to recover the rewards from three expert market agents through an observer with no prior knowledge of the underlying dynamics, where such dynamics can also change with time following real market data, and where the environment reacts to the agent’s actions. This is a challenging environment where we would expect AIRL to remain better suited to learn from the expert agent trajectories than methods aimed directly at recovering the policy (generative adversarial imitation learning

ermon2016gail , or GAIL), given the expected added value of AIRL regarding robustness of the recovered rewards with respect to varying dynamics of the environment. GAIL provides means analogous to generative adversarial networks that allow extraction of policies directly from data through a model-free approach for complex behaviours in high-dimensional environments. The performance of each method is then evaluated against the proportion of the expert agent’s total cumulative reward that can be obtained by policies recovered through adversarial learning.

1.2 Model environment based on real data

The adversarial learning algorithms used in the experiment will require a model of the environment where the observed agent trajectories took place, in order to evaluate the iterative estimations of rewards and policies most likely to have generated the observations. In practice, we would observe expert trajectories from agents as training data for adversarial learning, and then transfer the learnt policies to new test market data from the real environment.

The latent space model of the equity LOB described in yuanbo2019 shows how the Mixture-Density Recurrent Network (RNN-MDN) architecture presented in World Models ha2018worldmodels to learn the transition dynamics of the environment can also be applied to financial time series, learning a causal representation of LOB market data where RL experts can be trained to learn policies then transferable back to the original environment. Building on this work, we train three expert traders in the latent space market model through advantage actor critic (A2C) mnih2019a3c , double DQN vanHasselt2015dqn , and Policy Gradient williams92policygradient respectively, whose learnt policies remain profitable when tested on subsequent time series out of sample. The collections of expert state-action trajectories generated by each trained agent serve as input for the AIRL and GAIL algorithms in our experiments, following the implementation by fu2017learning . Our conclusions will then examine the proportion of the experts’ cumulative rewards produced by the policies learnt through either AIRL or GAIL from each expert agent.

2 Background

2.1 Related work

A number of previous works have applied inverse RL to financial data, focusing on evaluations of feature vectors for state representations at different scales to explore a market of competing agents

hendricks2017 , and assuming linear structures for the reward functions. Other works have focused on assessing the comparative performance of probabilistic inverse RL based on Bayesian neural networks roavicens2019 as an alternative to inverse RL based on Gaussian processes levine2011nonlinear , applied to a simulated financial market under a maximum entropy framework and allowing non-linear structures in the reward functions of the agents.

On the other hand, model-based approaches have attempted to recover specific parameters such as risk aversion implied by data halperin2018 , where the observer assumes a certain structure for the underlying utility of the agent. Research with simulations of real environments through neural networks kaiser2019mbrl allows to extend the original action and reward spaces to produce observations in the same spaces. The representation of an environment through generative models has also been previously described by World Models ha2018worldmodels and its adaptation to limit order books yuanbo2019 , where the authors obtain latent representations of the environment enabling agents to learn a policy efficiently, and to transfer it back to the original environment. Other authors have explored applications of Gaussian inverse RL to learning investor sentiment models from historic data yangYA18 , to then extract tradable signals from the model.

2.2 Adversarial Learning under Maximum Causal Entropy

The adversarial IRL experiments presented in this paper follow the AIRL implementation described in fu2017learning

, framed through a standard Markov decision process defined by a tuple

consisting of a state space ; a set of eligible actions; a model of transition dynamics , where each

represents the transition probability from state

through action to state ; the unknown reward function that we aim at recovering; and the discount factor taking values between [0, 1].

In general terms, forward RL seeks an optimal policy that maximizes an expected cumulative reward under the transition dynamics reflected in and the policy . We denote the state-action trajectories derived from such expert policy as .

Under this framework, inverse RL attempts to recover the reward function that would most likely have generated a given set of expert trajectories under the MDP of unknown reward. The adversarial learning methods considered will observe the trajectories in to infer a reward (that yields a policy ) to explain . The performance of each model is then evaluated through the total cumulative reward that can recover against the total reward obtained by the expert agent under policy .

One of the main challenges of inverse RL resides in its ill-posed nature: firstly, a given set of state-action trajectories may be explained by several different optimal policies ng1999policyinvariance ; the maximum entropy principle as proposed in ziebart2008maximum addresses this problem, assuming that the optimal policy that would have generated the trajectories in follows as developed in ziebart2008maximum ; haarnoja2017reinforcement . The soft function refers to the learning process under the standard RL formulation, where the objective is regularized against a metric of differential entropy.

Secondly, the optimal policy determined may also be explained by a number of different reward functions, where an external observer lacks the means to distinguish true rewards from those product of non-stationary environment dynamics. The latter, when transferred to an environment with new dynamics, may fail to produce an optimal outcome. Adversarial inverse RL fu2017learning introduces the concept of disentangled rewards, aiming at learning reward functions robust to variations in the environmental dynamics. This makes AIRL particularly attractive to study agents in financial markets, given their inherent need of continuous adaptation to changing dynamics.

The connection between inverse RL under maximum causal entropy and GANs as described by FinnCAL16 compares the iterative cycles between generator and discriminator in the GAN with cases of inverse RL that employ neural nets to learn generic reward functions under unknown environment dynamics finn2016guided ; boularias2011a . Moreover, a probabilistic approach to the problem allows to capture the stochastic nature of the relationship between the reward objective and the policy actually executed.

Following the cited previous works, we use as benchmark a GAIL implementation adapted to our market model: while GAIL is able to learn a policy from expert demonstrations, we expect AIRL to outperform in an environment with volatile, non-stationary dynamics, besides recovering a reward function in addition to the policy.

3 Experimental setup and results

The first requirement of our experiments is a model environment based on real financial data, that allows training of RL agents and is also compatible with the AIRL and GAIL learning algorithms. Learning a rich representation of the environment adds the general advantage of allowing RL models that are simpler, smaller and less expensive to train than model-free counterparts for a certain target performance of the learnt policy, as they search in a smaller space. Our experimental setup is based on an OpenAI Gym structure brockman2016openai used to adapt the AIRL and GAIL implementations by fu2017learning to a financial version of a World Model environment as follows:

3.1 World models learnt from observed environments

World Models ha2018worldmodels provides an unsupervised probabilistic generative method to learn world representations in space and time through neural networks, including the transition dynamics as our experiment requires. RL agents operating in the real world would adapt their policies to their expectations of such transition dynamics, i.e. probabilities of certain future state given the current state and choice of action contained in . RL agents can then be trained within the world model, with their learnt policies remaining optimal once transferred back to the original environment (or even outperforming the agents trained in the original environment in some instances).

These models capture three elements of the original environment: firstly, an auto-encoder learns a latent representation of each state vector from its original high-dimensional space, compressing all the observed information of that state into a vector within a latent space of lower dimension (Fig. 1).

Figure 1: An auto-encoder extracts a latent-space representation from the original LOB data nxcore_nanex

Secondly, after learning the latent vector for each state, the model learns the transition dynamics connecting consecutive states through certain action choices . This model of transition probabilities of in the MDP is learnt through an RNN Schmidhuber90makingthe ; Schmidhuber90anon ; Schmidhuber91nips990_393 , able to learn sequential information from time series along various time horizons by learning where represents a hidden state of the RNN.

The choice of RNN architectures is based on their suitability to capture the transition model of sequential data Ha_Eck2017 , thus learning some predictive modelling of future states. In order to better reflect the stochastic nature of the target environment, the deterministic output of the RNN is then fed into a Mixture Density Network (MDN) Bishop94mixturedensity to produce a probabilistic prediction of next latent state. This combined RNN-MDN architecture is detailed further in ha2018worldmodels ; graves2013 .

Finally, an RL module decides which action to take given all the above information, including the transition predicted in the latent space. As noted in ha2018worldmodels

, this approach has the additional advantage of providing RL agents with access to the latent space learnt by the model, which includes representations of both the current state and an indication of what to expect in upcoming states through the transitions modelled, sampling from the probability distribution of future states provided by the RNN-MDN.

Figure 2: Model for prediction of latent space transitions. Our RL processes are run in the latent state space.

3.2 LOB version of world models learnt from real market data

The original version of World Models learns from environments based on computer simulations, such as CarRacing-v0 klimov2016 and VizDoom kempkaWRTJ16vizdoom . We base our experiments on the extension of this architecture to financial environments presented in yuanbo2019 , where the auto-encoder is trained to learn a latent space from sequences of snapshots of actual limit order book data, and the RNN-MDN models the transitions between consecutive time frames as follows:

The data used to train the model consists of LOB time series from shares of Tencent Holdings Ltd. (HK-700) traded in the Hong Kong stock exchange along sixty trading days between January and March, 2018. Data from the following twenty trading days in April is then used as testing reference for the adversarial learning algorithms. Each state contains the sequence of the last 10 data ticks for 3 LOB levels, so that the sequential information necessary to learn the transition dynamics is captured in the data for each state.

LOB data is of high dimension, generally denominated Level II market data, and features several relevant data points for each timestamp. Firstly, it contains various levels of bid (buy) and ask (sell) prices for each timestamp, typically ranked in decreasing order of competitivity. The mid price

is generally derived as an average between the highest bid and lowest ask in the first LOB level. Secondly, it includes the trading volume associated with each of such prices offered. The combined progression in time of this data structure is often represented as a tensor of three dimensions

tran2017hfttensors . Finally, trade stamp series contain the price and size of the last transactions executed out of previous LOB states, used in this model as RL exploration.

In order to allow the RNN-MDN to model the transition dynamics, each state consists of 10 consecutive LOB data ticks for each of the above features. The auto-encoder learns a compressed version of the data in lower dimensionality, through a latent space representation of short sequences of original market data (Fig. 1). The model used in yuanbo2019 features a CNN-based auto-encoder, looking back 10 updates in the limit order book that contains 3 levels of orders, and condenses the matrix into a vector of dimension 12.

The RNN-MDN used to model the transition dynamics assumes that follows a Gaussian mixture distribution yuanbo2019 ; ha2018worldmodels , where the model learns the parameters of the distribution with assumed to be Gaussian,

and 128 neurons per layer in the RNN. Once the prediction of the next state is sampled in the latent space, the reward with respect to the present state and the action chosen is produced through a regression model.

3.3 Training experts in the LOB latent space

Once the full world model is trained and integrated into an OpenAI Gym structure, we train in it the expert agents whose trajectories will serve as input to the adversarial learning experiments. We follow the selection of RL algorithms in yuanbo2019 proved to learn policies in the world model that remain profitable when transferred back to the real environment: double DQN vanHasselt2015dqn , advantage actor critic (A2C) mnih2019a3c and policy gradient williams92policygradient .

We now use each of the three trained expert agents to generate collections of 100 state-action trajectories for each agent, with a length of 1000

pairs each. Among the three experts, A2C seems to be the best performing agent in terms of high mean and low variance of cumulative rewards, followed by DQN (Fig.


Figure 3: Distribution of rewards obtained by each expert RL agent.

3.4 Adversarial learning from Reinforcement Learning trading experts

As in general GAN structures, adversarial inverse reinforcement learning is implemented through a generator and a discriminator: two neural networks that contest with each other. The generator learns to produce an output mapping from a latent space to a target data distribution, while the discriminator tries to classify which samples come from the true distribution and which where produced by the generator.

In the case of AIRL, the generator learns to produce state-action pairs based on those observed in the trajectories from the expert demonstrations , while the discriminator

tries to classify which state-action pairs were produced by the generator against those actually produced by an expert agent, training through binary logistic regression. The reward estimate

is then updated from the GAN minimax loss: . An estimate of the policy is then updated from iteratively.

Our experiments are initialized with the expert trajectories produced by executing each expert policy within the world model learnt from data gathered between January and March. We then initialize a policy and the discriminator . The adversarial observer then updates iteratively and the reward that produces a policy , which generates samples of increasingly similar to . The evaluations of cumulative rewards produced by each take place with a model environment run with test data from April.

We present in Fig. 4 and Table 1 the results of running AIRL and GAIL on 200 training iterations: AIRL is found to outperform GAIL in all the three cases considered (learning from demonstrations of each expert trained), consistently with the initial motivation of the experiment.

(a) Trained with A2C Experts
(b) Trained with DQN Experts
(c) Trained with PG Experts
Figure 4: Comparative of rewards obtained through AIRL and GAIL by learning from each expert agent.
Agent Average Return Std Return
Expert A2C 5047.3 125.9
AIRL 2433.0 59.7
GAIL 1563.1 43.4
(a) Adversarial learning from A2C Experts
Agent Average Return Std Return
Expert DQN 4177.4 116.6
AIRL 2151.1 45.6
GAIL 1574.5 70.8
(b) Adversarial learning from DQN Experts
Agent Average Return Std Return
Expert PG 1483.3 313.2
AIRL 1267.7 33.8
GAIL 1029.1 67.9
(c) Adversarial learning from PG Experts
Table 1: Summary of cumulative rewards obtained by AIRL, GAIL and each expert agent in the order book.

4 Conclusions

We have presented an experimental setup to adapt adversarial inverse reinforcement learning and generative adversarial imitation learning to a volatile financial environment, in order to evaluate their ability to learn from expert trajectories in latent space simulations: the latent model employed allows the training of the adversarial algorithms based on real market data. The results obtained show that both methods can be trained against a latent space model of the market, while the specific advantage of AIRL for learning rewards robust to variable environment dynamics allows it to outperform GAIL in the cumulative rewards recovered from each of the three agents as the number of iterations increases.

A review of the original data series (Fig. 5) confirms the variability of mid price series and of their volatility levels (measured across various horizons) when comparing the data corresponding to the training period against that of the testing period. The rewards learnt by AIRL from the training model are robust to such variable dynamics, hence its outperformance over GAIL when evaluated on the test period.

(a) Mid price evolution, sampled hourly.
(b) Volatility of mid prices (1-day window)
(c) Volatility of mid prices (2-day window)
Figure 5: The original data series used to train the latent space model and the experts (blue) and to test the adversarial learning methods (green) contain significant variability of price and volatility levels.