1 Introduction
Reinforcement learning (RL) achieves robust performance in a wide number of fields, with particularly relevant success in modelfree applications mnih2013playing ; vanHasselt2015dqn where agents explore an environment with no prior knowledge about its underlying dynamics, and learn a policy that maximizes certain cumulative reward function. Such learning process typically requires recurrent access of the agent to the environment on a trialanderror based exploration; however, reinforcement learning in riskcritical tasks such as automatic navigation or financial risk control would not allow such an exploration, since decisions have to be made in real time in a nonstationary environment where the risks and costs inherent to a trialanderror approach can be unaffordable. In addition, such RL performance generally requires that the designer manually specifies a reward function that represents the task adequately and can be optimized efficiently ng1999policyinvariance .
In the context of learning from expert demonstrations, inverse reinforcement learning has proved capable of recovering through inference the reward function of expert agents through observations of their stateaction trajectories ziebart2008maximum ; levine2011nonlinear with decreasing dependence on predefined assumptions about linearity or the general structure of the underlying reward function, generally under a maximum entropy framework ziebart2010modeling .
Recent advances in inverse RL have extended its application to highdimensional state spaces with unknown dynamics and arbitrary nonlinear reward structures, thanks to the use of neural networks to represent the reward function
finn2016guided . Adversarial inverse reinforcement learning (AIRL) fu2017learning extends inverse RL further, achieving the recovery of rewards robust to variations in the dynamics of the environment, while learning at the same time a policy to perform the task; AIRL builds on the equivalences found by FinnCAL16 between inverse RL under maximum entropy and the training process of generative adversarial networks (GANs) goodfellow2014gans .1.1 Contributions
Financial markets are a particularly challenging case for inverse RL: highdimensional, stochastic environments with nonstationary transition dynamics unknown to the observer, and with variable reactions to actions from agents. This makes AIRL particularly interesting to test on real financial data, aiming at learning from experts robust reward functions that can then be transferred to new regimes of the original environment.
In this paper we explore the adaptation of AIRL to a volatile financial environment based on real tick data from a limit order book (LOB) in the stock market, attempting to recover the rewards from three expert market agents through an observer with no prior knowledge of the underlying dynamics, where such dynamics can also change with time following real market data, and where the environment reacts to the agent’s actions. This is a challenging environment where we would expect AIRL to remain better suited to learn from the expert agent trajectories than methods aimed directly at recovering the policy (generative adversarial imitation learning
ermon2016gail , or GAIL), given the expected added value of AIRL regarding robustness of the recovered rewards with respect to varying dynamics of the environment. GAIL provides means analogous to generative adversarial networks that allow extraction of policies directly from data through a modelfree approach for complex behaviours in highdimensional environments. The performance of each method is then evaluated against the proportion of the expert agent’s total cumulative reward that can be obtained by policies recovered through adversarial learning.1.2 Model environment based on real data
The adversarial learning algorithms used in the experiment will require a model of the environment where the observed agent trajectories took place, in order to evaluate the iterative estimations of rewards and policies most likely to have generated the observations. In practice, we would observe expert trajectories from agents as training data for adversarial learning, and then transfer the learnt policies to new test market data from the real environment.
The latent space model of the equity LOB described in yuanbo2019 shows how the MixtureDensity Recurrent Network (RNNMDN) architecture presented in World Models ha2018worldmodels to learn the transition dynamics of the environment can also be applied to financial time series, learning a causal representation of LOB market data where RL experts can be trained to learn policies then transferable back to the original environment. Building on this work, we train three expert traders in the latent space market model through advantage actor critic (A2C) mnih2019a3c , double DQN vanHasselt2015dqn , and Policy Gradient williams92policygradient respectively, whose learnt policies remain profitable when tested on subsequent time series out of sample. The collections of expert stateaction trajectories generated by each trained agent serve as input for the AIRL and GAIL algorithms in our experiments, following the implementation by fu2017learning . Our conclusions will then examine the proportion of the experts’ cumulative rewards produced by the policies learnt through either AIRL or GAIL from each expert agent.
2 Background
2.1 Related work
A number of previous works have applied inverse RL to financial data, focusing on evaluations of feature vectors for state representations at different scales to explore a market of competing agents
hendricks2017 , and assuming linear structures for the reward functions. Other works have focused on assessing the comparative performance of probabilistic inverse RL based on Bayesian neural networks roavicens2019 as an alternative to inverse RL based on Gaussian processes levine2011nonlinear , applied to a simulated financial market under a maximum entropy framework and allowing nonlinear structures in the reward functions of the agents.On the other hand, modelbased approaches have attempted to recover specific parameters such as risk aversion implied by data halperin2018 , where the observer assumes a certain structure for the underlying utility of the agent. Research with simulations of real environments through neural networks kaiser2019mbrl allows to extend the original action and reward spaces to produce observations in the same spaces. The representation of an environment through generative models has also been previously described by World Models ha2018worldmodels and its adaptation to limit order books yuanbo2019 , where the authors obtain latent representations of the environment enabling agents to learn a policy efficiently, and to transfer it back to the original environment. Other authors have explored applications of Gaussian inverse RL to learning investor sentiment models from historic data yangYA18 , to then extract tradable signals from the model.
2.2 Adversarial Learning under Maximum Causal Entropy
The adversarial IRL experiments presented in this paper follow the AIRL implementation described in fu2017learning
, framed through a standard Markov decision process defined by a tuple
consisting of a state space ; a set of eligible actions; a model of transition dynamics , where eachrepresents the transition probability from state
through action to state ; the unknown reward function that we aim at recovering; and the discount factor taking values between [0, 1].In general terms, forward RL seeks an optimal policy that maximizes an expected cumulative reward under the transition dynamics reflected in and the policy . We denote the stateaction trajectories derived from such expert policy as .
Under this framework, inverse RL attempts to recover the reward function that would most likely have generated a given set of expert trajectories under the MDP of unknown reward. The adversarial learning methods considered will observe the trajectories in to infer a reward (that yields a policy ) to explain . The performance of each model is then evaluated through the total cumulative reward that can recover against the total reward obtained by the expert agent under policy .
One of the main challenges of inverse RL resides in its illposed nature: firstly, a given set of stateaction trajectories may be explained by several different optimal policies ng1999policyinvariance ; the maximum entropy principle as proposed in ziebart2008maximum addresses this problem, assuming that the optimal policy that would have generated the trajectories in follows as developed in ziebart2008maximum ; haarnoja2017reinforcement . The soft function refers to the learning process under the standard RL formulation, where the objective is regularized against a metric of differential entropy.
Secondly, the optimal policy determined may also be explained by a number of different reward functions, where an external observer lacks the means to distinguish true rewards from those product of nonstationary environment dynamics. The latter, when transferred to an environment with new dynamics, may fail to produce an optimal outcome. Adversarial inverse RL fu2017learning introduces the concept of disentangled rewards, aiming at learning reward functions robust to variations in the environmental dynamics. This makes AIRL particularly attractive to study agents in financial markets, given their inherent need of continuous adaptation to changing dynamics.
The connection between inverse RL under maximum causal entropy and GANs as described by FinnCAL16 compares the iterative cycles between generator and discriminator in the GAN with cases of inverse RL that employ neural nets to learn generic reward functions under unknown environment dynamics finn2016guided ; boularias2011a . Moreover, a probabilistic approach to the problem allows to capture the stochastic nature of the relationship between the reward objective and the policy actually executed.
Following the cited previous works, we use as benchmark a GAIL implementation adapted to our market model: while GAIL is able to learn a policy from expert demonstrations, we expect AIRL to outperform in an environment with volatile, nonstationary dynamics, besides recovering a reward function in addition to the policy.
3 Experimental setup and results
The first requirement of our experiments is a model environment based on real financial data, that allows training of RL agents and is also compatible with the AIRL and GAIL learning algorithms. Learning a rich representation of the environment adds the general advantage of allowing RL models that are simpler, smaller and less expensive to train than modelfree counterparts for a certain target performance of the learnt policy, as they search in a smaller space. Our experimental setup is based on an OpenAI Gym structure brockman2016openai used to adapt the AIRL and GAIL implementations by fu2017learning to a financial version of a World Model environment as follows:
3.1 World models learnt from observed environments
World Models ha2018worldmodels provides an unsupervised probabilistic generative method to learn world representations in space and time through neural networks, including the transition dynamics as our experiment requires. RL agents operating in the real world would adapt their policies to their expectations of such transition dynamics, i.e. probabilities of certain future state given the current state and choice of action contained in . RL agents can then be trained within the world model, with their learnt policies remaining optimal once transferred back to the original environment (or even outperforming the agents trained in the original environment in some instances).
These models capture three elements of the original environment: firstly, an autoencoder learns a latent representation of each state vector from its original highdimensional space, compressing all the observed information of that state into a vector within a latent space of lower dimension (Fig. 1).
Secondly, after learning the latent vector for each state, the model learns the transition dynamics connecting consecutive states through certain action choices . This model of transition probabilities of in the MDP is learnt through an RNN Schmidhuber90makingthe ; Schmidhuber90anon ; Schmidhuber91nips990_393 , able to learn sequential information from time series along various time horizons by learning where represents a hidden state of the RNN.
The choice of RNN architectures is based on their suitability to capture the transition model of sequential data Ha_Eck2017 , thus learning some predictive modelling of future states. In order to better reflect the stochastic nature of the target environment, the deterministic output of the RNN is then fed into a Mixture Density Network (MDN) Bishop94mixturedensity to produce a probabilistic prediction of next latent state. This combined RNNMDN architecture is detailed further in ha2018worldmodels ; graves2013 .
Finally, an RL module decides which action to take given all the above information, including the transition predicted in the latent space. As noted in ha2018worldmodels
, this approach has the additional advantage of providing RL agents with access to the latent space learnt by the model, which includes representations of both the current state and an indication of what to expect in upcoming states through the transitions modelled, sampling from the probability distribution of future states provided by the RNNMDN.
3.2 LOB version of world models learnt from real market data
The original version of World Models learns from environments based on computer simulations, such as CarRacingv0 klimov2016 and VizDoom kempkaWRTJ16vizdoom . We base our experiments on the extension of this architecture to financial environments presented in yuanbo2019 , where the autoencoder is trained to learn a latent space from sequences of snapshots of actual limit order book data, and the RNNMDN models the transitions between consecutive time frames as follows:
The data used to train the model consists of LOB time series from shares of Tencent Holdings Ltd. (HK700) traded in the Hong Kong stock exchange along sixty trading days between January and March, 2018. Data from the following twenty trading days in April is then used as testing reference for the adversarial learning algorithms. Each state contains the sequence of the last 10 data ticks for 3 LOB levels, so that the sequential information necessary to learn the transition dynamics is captured in the data for each state.
LOB data is of high dimension, generally denominated Level II market data, and features several relevant data points for each timestamp. Firstly, it contains various levels of bid (buy) and ask (sell) prices for each timestamp, typically ranked in decreasing order of competitivity. The mid price
is generally derived as an average between the highest bid and lowest ask in the first LOB level. Secondly, it includes the trading volume associated with each of such prices offered. The combined progression in time of this data structure is often represented as a tensor of three dimensions
tran2017hfttensors . Finally, trade stamp series contain the price and size of the last transactions executed out of previous LOB states, used in this model as RL exploration.In order to allow the RNNMDN to model the transition dynamics, each state consists of 10 consecutive LOB data ticks for each of the above features. The autoencoder learns a compressed version of the data in lower dimensionality, through a latent space representation of short sequences of original market data (Fig. 1). The model used in yuanbo2019 features a CNNbased autoencoder, looking back 10 updates in the limit order book that contains 3 levels of orders, and condenses the matrix into a vector of dimension 12.
The RNNMDN used to model the transition dynamics assumes that follows a Gaussian mixture distribution yuanbo2019 ; ha2018worldmodels , where the model learns the parameters of the distribution with assumed to be Gaussian,
and 128 neurons per layer in the RNN. Once the prediction of the next state is sampled in the latent space, the reward with respect to the present state and the action chosen is produced through a regression model.
3.3 Training experts in the LOB latent space
Once the full world model is trained and integrated into an OpenAI Gym structure, we train in it the expert agents whose trajectories will serve as input to the adversarial learning experiments. We follow the selection of RL algorithms in yuanbo2019 proved to learn policies in the world model that remain profitable when transferred back to the real environment: double DQN vanHasselt2015dqn , advantage actor critic (A2C) mnih2019a3c and policy gradient williams92policygradient .
We now use each of the three trained expert agents to generate collections of 100 stateaction trajectories for each agent, with a length of 1000
pairs each. Among the three experts, A2C seems to be the best performing agent in terms of high mean and low variance of cumulative rewards, followed by DQN (Fig.
3):3.4 Adversarial learning from Reinforcement Learning trading experts
As in general GAN structures, adversarial inverse reinforcement learning is implemented through a generator and a discriminator: two neural networks that contest with each other. The generator learns to produce an output mapping from a latent space to a target data distribution, while the discriminator tries to classify which samples come from the true distribution and which where produced by the generator.
In the case of AIRL, the generator learns to produce stateaction pairs based on those observed in the trajectories from the expert demonstrations , while the discriminator
tries to classify which stateaction pairs were produced by the generator against those actually produced by an expert agent, training through binary logistic regression. The reward estimate
is then updated from the GAN minimax loss: . An estimate of the policy is then updated from iteratively.Our experiments are initialized with the expert trajectories produced by executing each expert policy within the world model learnt from data gathered between January and March. We then initialize a policy and the discriminator . The adversarial observer then updates iteratively and the reward that produces a policy , which generates samples of increasingly similar to . The evaluations of cumulative rewards produced by each take place with a model environment run with test data from April.
We present in Fig. 4 and Table 1 the results of running AIRL and GAIL on 200 training iterations: AIRL is found to outperform GAIL in all the three cases considered (learning from demonstrations of each expert trained), consistently with the initial motivation of the experiment.



4 Conclusions
We have presented an experimental setup to adapt adversarial inverse reinforcement learning and generative adversarial imitation learning to a volatile financial environment, in order to evaluate their ability to learn from expert trajectories in latent space simulations: the latent model employed allows the training of the adversarial algorithms based on real market data. The results obtained show that both methods can be trained against a latent space model of the market, while the specific advantage of AIRL for learning rewards robust to variable environment dynamics allows it to outperform GAIL in the cumulative rewards recovered from each of the three agents as the number of iterations increases.
A review of the original data series (Fig. 5) confirms the variability of mid price series and of their volatility levels (measured across various horizons) when comparing the data corresponding to the training period against that of the testing period. The rewards learnt by AIRL from the training model are robust to such variable dynamics, hence its outperformance over GAIL when evaluated on the test period.
References
 (1) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 (2) Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double qlearning. CoRR, abs/1509.06461, 2015.

(3)
Andrew Y. Ng, Daishi Harada, and Stuart J. Russell.
Policy invariance under reward transformations: Theory and
application to reward shaping.
In
Proceedings of the Sixteenth International Conference on Machine Learning
, ICML ’99, pages 278–287, 1999.  (4) Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
 (5) Sergey Levine, Zoran Popovic, and Vladlen Koltun. Nonlinear inverse reinforcement learning with gaussian processes. In Advances in Neural Information Processing Systems, pages 19–27, 2011.
 (6) Brian D. Ziebart. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy. PhD thesis, Pittsburgh, PA, USA, 2010. AAI3438449.
 (7) Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pages 49–58, 2016.
 (8) Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse reinforcement learning. arXiv preprint arXiv:1710.11248, 2017.
 (9) Chelsea Finn, Paul F. Christiano, Pieter Abbeel, and Sergey Levine. A connection between generative adversarial networks, inverse reinforcement learning, and energybased models. CoRR, abs/1611.03852, 2016.
 (10) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
 (11) Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4565–4573. Curran Associates, Inc., 2016.
 (12) Haoran Wei, Yuanbo Wang, Lidia Mangu, and Keith Decker. Modelbased reinforcement learning for predictions and control for limit order books, 2019. Cite Arxiv 1910.03743.
 (13) David Ha and Jürgen Schmidhuber. World models. CoRR, abs/1803.10122, 2018.
 (14) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1928–1937, New York, New York, USA, 20–22 Jun 2016. PMLR.
 (15) Ronald J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. In Machine Learning, pages 229–256, 1992.
 (16) Dieter Hendricks, Adam Cobb, Richard Everett, Jonathan Downing, and Stephen J. Roberts. Inferring agent objectives at different scales of a complex adaptive system. Papers 1712.01137, arXiv.org, December 2017.
 (17) Jacobo Roa Vicens, Cyrine Chtourou, Angelos Filos, Francisco Rullan, Yarin Gal, and Ricardo Silva. Towards inverse reinforcement learning for limit order book dynamics. In ICML Workshop ’AI in Finance: Applications and Infrastructure for MultiAgent Learning’ at the 36th International Conference on Machine Learning, 2019.
 (18) Igor Halperin and Ilya Feldshteyn. Market selflearning of signals, impact and optimal trading: Invisible hand inference with free energy. arXiv preprint arXiv:1805.06126, 2018.
 (19) Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H. Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Ryan Sepassi, George Tucker, and Henryk Michalewski. Modelbased reinforcement learning for atari. CoRR, abs/1903.00374, 2019.
 (20) Steve Y. Yang, Yangyang Yu, and Saud Almahdi. An investor sentiment rewardbased trading system using gaussian inverse reinforcement learning algorithm. Expert Syst. Appl., 114:388–401, 2018.
 (21) Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energybased policies. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1352–1361. JMLR. org, 2017.

(22)
Abdeslam Boularias, Jens Kober, and Jan Peters.
Relative entropy inverse reinforcement learning.
In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors,
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics
, volume 15 of Proceedings of Machine Learning Research, pages 182–189, Fort Lauderdale, FL, USA, 11–13 Apr 2011. PMLR.  (23) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016. Cite arxiv:1606.01540.
 (24) LOB representation from NxCore API. 2019. Nanex Corp.

(25)
Jürgen Schmidhuber.
Making the world differentiable: On using selfsupervised fully recurrent neural networks for dynamic reinforcement learning and planning in nonstationary environments.
Technical report, 1990.  (26) Jürgen Schmidhuber. An online algorithm for dynamic reinforcement learning and planning in reactive environments. In In Proc. IEEE/INNS International Joint Conference on Neural Networks, pages 253–258. IEEE Press, 1990.
 (27) Jürgen Schmidhuber. Reinforcement learning in markovian and nonmarkovian environments. In R. P. Lippmann, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems 3, pages 500–506. MorganKaufmann, 1991.
 (28) David Ha and Douglas Eck. A neural representation of sketch drawings. CoRR, abs/1704.03477, 2017.
 (29) Christopher M. Bishop. Mixture density networks. Technical report, 1994.
 (30) Alex Graves. Generating sequences with recurrent neural networks. CoRR, abs/1308.0850, 2013.
 (31) Oleg Klimov. Carracingv0. 2016. https://gym.openai.com/envs/CarRacingv0/.
 (32) Michal Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Jaskowski. Vizdoom: A doombased AI research platform for visual reinforcement learning. CoRR, abs/1605.02097, 2016.
 (33) Dat Thanh Tran, Martin Magris, Juho Kanniainen, Moncef Gabbouj, and Alexandros Iosifidis. Tensor representation in highfrequency financial data for price change prediction. CoRR, abs/1709.01268, 2017.
Comments
There are no comments yet.