## 1 Introduction

This paper presents a simple ’structural’ model of price dynamics in a financial market. Though based on concepts not commonly used in Finance ( Reinforcement Learning, Information Theory, Physics etc. see below), the model we suggest is mathematically rather simple at the end (see Eq.(109)), after getting through a ’story’ behind its structure. It is designed as both a practical tool for market practitioners, and a theoretical model of a financial market that can be explored further using simulations and/or analytical methods. For definitiveness, we focus in this paper on stock markets, though the same approach can be applied to other markets in the same way.

In a way, the main idea of a model presented below can be formulated as a dynamic and data-driven extension of an approach to modeling excess returns that was suggested in the seminal Black-Litterman (BL) model [8]. As will be shown below, a structural asset return model arising in our solution to this problem has some interesting properties such as the presence of mean reversion in stock prices, which in our framework appears as a result of joint actions of all traders in the market that dynamically

implement Markowitz-type mean-variance portfolio strategies.

In essence, the BL model flips the Markowitz optimal portfolio theory [32] on its head, and considers an inverse optimization problem. Namely, it starts with an observation that a market portfolio (as typically represented by the S&P500 index) is, by definition, the optimal ”market-implied” portfolio. Therefore, if we consider such a given market portfolio as an optimal portfolio, then we can invert the portfolio optimization problem, and ask what is the optimal asset allocation policy that corresponds to this optimal market portfolio. Within the framework of a single-period Markowitz mean-variance optimization [32], this translates into market-implied values of expected returns and covariances of returns.

Respectively, this framework was suggested by Black and Litterman as a way to assess values of private ”alpha” signals in generating excess returns. The BL model was explicitly re-interpreted as an inverse portfolio optimization problem by Bertsimas et. al. [7], along with proposing some extensions such as robust inverse optimization. Note that the inverse optimization in [7] is still performed in a single-period (one time step) setting, the same as in the original BL model [8] and in the Markowitz mean-variance portfolio model [32].

A model suggested in this paper extends such inverse optimization view of the market portfolio to a dynamic, multi-period setting. While this requires some new mathematical tools, the outputs of the model can be used in essentially the same way as the outputs of the BL model: to assess the value of private ”alpha” signals, and design trading strategies according to own assessments of joint effects of signals and market impacts from trades on expected excess returns.

An important difference of our model from a majority of market models used in both the industry and the academia is that our model does not assume a competitive market equilibrium. As discussed at length by Duffie [14], this paradigm underlies three cornerstone Nobel prize-winning theories of modern Finance, which are used by many practitioners on both the sell and buy sides.
On the other hand, George Soros, a famous guru of financial markets, called this paradigm an ”absurd postulate”^{1}^{1}1”Economics ended up with the theory of
rational expectations, which maintains that there is a single optimum view of the future, that which corresponds to it, and eventually all
the market participants will converge around that view. This postulate is absurd, but it is
needed in order to allow economic theory to model itself on Newtonian Physics.” (G. Soros). We thank Vivek Kapoor for this reference..

Our model can be interpreted as an attempt to reconcile such opposite views. Our suggested answer is that both sides are right in their own ways, but we offer a practical and easily computable unifying framework. This allows us to quantify Soros’ critique and propose a simple model that can be used to describe markets in three different states: disequilibrium, quasi-equilibrium, and a perfect thermal equilibrium. The latter scenario may only occur if there is no inflow of information in a market - hardly a realistic scenario.

The last case of a perfect thermal equilibrium corresponds to assumptions of the competitive market equilibrium paradigm.
While we believe that for financial markets the last limit is in a way ’non-physical’^{2}^{2}2It is non-physical in the sense that it contradicts the very existence of markets where market makers generate liquidity and speculators make profits by digesting new information - neither should exist in competitive market equilibrium models. This is because a perfect equilibrium is only possible for a closed system that does not exchange information with an outside world. Therefore, competitive market equilibrium models do not try to answer the question why markets exist, but rather simply postulate first-order optimality (equilibrium) conditions, and then explore the consequences [46]. In physics, a perfect thermodynamic equilibrium is achieved in the thermodynamic limit of a closed system, and corresponds to a state of a ’heat death of the Universe’ [29]., it is the limit described by competitive market equilibrium models such as the
Modigliniani-Miller’s capital structure irrelevance for the market value of a corporation,
the Capital Asset Pricing Model (CAPM) of William Sharpe (1964), and the Black-Scholes model of option pricing^{3}^{3}3The Black-Scholes model relies on a weaker form of competitive market equilibrium paradigm known as the no-arbitrage principle [14].[14].

These models consider market dynamics as equilibrium fluctuations around a perfectly thermodynamically equilibrium market state. Therefore, they implicitly assume that there is no inflow/outflow of money and information in a market as a whole, and the market is in a state of a maximum entropy. This may be a reasonable approximation in a steady/slow market, which may explain why these models work reasonably well under ’normal’ market conditions.

But this assumption of competitive market equilibrium also suggests that these models should behave progressively worse during periods of market instabilities, crises and market crashes - an observation that seems to be widely recognized in the literature.

The general reason for such model failures when they are needed most is that in all these cases, a view of a market as an equilibrium fluctuation around a stationary state where entropy is already maximized becomes inadequate to describe market dynamics, see also below on analogies with self-organizing systems and living organisms.

The above remarks concern with potential theoretical implications of our framework. Irrespective, our model also attempts to address the needs of market practitioners that want to make a profit rather than do a theoretical research into the dynamics of the markets.

To this end, in addition to providing a multi-period extension of the Black-Litterman model^{4}^{4}4Note that because the latter is a one-period model, a question of a market in equilibrium vs non-equilibrium cannot even be formulated in this framework., our model produces ”market-implied” values of market impact parameters and risk aversion parameter of an agent that dynamically maintains

such market-optimal portfolio, as well as a ”market-implied” optimal investment strategy, which can be viewed for monitoring of the market or individual players in the market (see below). Given an explicit formula produced by our model for a market-implied optimal strategy, expressions like ’a strategy that beats the market’ can now be probably given a more quantitative meaning

ex-ante rather than ex-post.Finally, one more interesting insight may be provided by the fact that one of parameters estimated by the model from market data is a parameter

that describes a degree of rationality of the market (another name for is the ”inverse temperature” of the market). This suggests that market-implied value of can be used as a monitoring tool and possibly a predictive signal to have an aggregative view of a market, a specific exchange, or even a specific dealer^{5}

^{5}5The latter case corresponds to a possible application of the model developed in this paper for an individual investor rather than for the market as a whole, see also below..

### 1.1 Outlook of our approach

The main intuitive idea behind the model can be introduced as follows. While the real market dynamics are highly complex as they are driven, to a large extent, by a very large number of individual rational or bounded-rational market participants, is is commonly known that market players exhibit a strong tendency to a herding behavior: when markets are up, everyone is buying, and when markets are down, everyone is selling.

This suggests a concept of a representative investor whose objective is to optimize a given investment portfolio given some objective function. Such representative investor is otherwise known in the literature as an agent. In this view of the world, an environment, i.e. the market, is clearly external to the agent.

But what if we take an inverse optimization view of this problem, as in Black-Litterman [8]? In this approach, the optimal portfolio is already known, it is the market portfolio itself. But then, who is an agent that dynamically maintains (rebalances) such market-optimal portfolio?

We can identify such agent with a ’collective mode’ of all individual traders involved in the market, that are guided in their decisions by a commonly agreed set of predictors which may include news, other market indicators and/or indexes, variables describing the current state of the limit order book, etc. Therefore, the first difference of our framework from conventional utility-based models is that our agent is a sum of all investors, rather than their average, i.e. a ’representative’ investor.

Because such agent aggregates actions of a partly homogeneous and partly heterogeneous crowd of individual investors, it can not be a fully rational agent, but rather should be represented as an agent with bounded rationality. Bounded rationality, which will be explained in more details below, is a second key difference of our framework from a classical agent-based approach.

Furthermore, because jointly all individual trades by all market participants amount to actual market moves that dynamically re-adjust the market-optimal portfolio, such agent can then be identified with a bounded-rational component of the market itself.

If we adopt such view, the actual dynamics of market prices can now be mathematically modeled as a sequential decision-making process of such bounded-rational agent who is engaged in self-learning via self-play in a partly controllable and partly uncontrollable environment which is identified with the rest of the market. The first component identified with a RL agent can then be thought of as a ’mind’ of such self-organizing market, that learns about its environment and itself via self-play in such open environment.

Our agent embodies an ’Invisible Hand’ of the market, which is goal-oriented in our framework, as will be made more clear below.
The Invisible Hand is implemented in our model as a fictitious self-play of a bounded-rational RL agent.
Agent’s self-play amounts to mimicking a risk-averse investor seeking a dynamic Markowitz-optimal portfolio, while actions of this investor are randomized by entropy. As will be shown below in Sect. 4.6, this is mathematically equivalent to portfolio optimization in a two-party game with
an adversarial player, such that the original agent and its imaginary adversary form a Nash equilibrium.
As a result, the agent simultaneously mimicks all traders as a bounded-rational ’mind’ of a self-organizing market^{6}^{6}6
Equivalence between self-organization in dynamic systems and sequential decision making was emphasized by Yukalov and Sornette in [57].
A similar approach in neuroscience
is a unified free-energy model of the brain of Friston [19], see also [39] for recent applications of the free-energy principle to living systems. In short, this approach suggests that ”all biological systems instantiate a hierarchical generative model of the world that implicitly minimizes its internal entropy by minimizing free energy” [39]..

This produces a dynamic model of market price dynamics, where a total price impact of all traders in the market, who try to construct Markowitz-optimal portfolios, amounts to a dynamically

generated mean reversion in market-observed asset returns. The resulting model can be interpreted as a Geometric Mean Reversion model with external signals, where mean reversion arises dynamically, rather than being introduced by hands, as is done in descriptive models of market dynamics. The resulting model can be viewed as a non-linear factor model for returns that can be estimated using standard methods of statistics such as Maximum Likelihood.

More than that, because our resulting asset return model is a structural model, mean reversion in our model has a ’story’ behind it. In our approach it is produced by a total bounded-rational action of all (bounded-rational or rational) agents in the market, that dynamically optimize their investment portfolios following mean-variance strategies. This can be compared with a mechanism for mean reversion due to zero-intelligence ’noise traders’ suggested in 1988 by Poterba and Summers [44]. We have only one agent, but it is bounded-rational, unlike many noise traders with a zero total rationality/intelligence in the model of [44].

### 1.2 Possible insights from the model

Because we formulate the problem in a setting of inverse, rather than direct portfolio optimization, the objective of a bounded-rational agent can be viewed as the problem of rebalancing its own fictitious ”shadow” portfolio, such that it is kept as close as possible to the market portfolio in such continuous self-play. Note that except for a bounded rationality of an agent assumed in such framework, it resembles the classical pole balancing problem of Reinforcement Learning (see e.g. [49]), where now it is the market portfolio that serves the role of a pole, and we invert the problem.

Our model is also quite similar to an index tracking problem, except we set it as an inverse optimization problem to infer market views of its own dynamics, instead of solving a forward optimization problem of finding a good tracking portfolio for an index. Note that data for such model formulation is readily available as level-1 limit order book (LOB) data (level-2 LOB data can be incorporated in the model via a set of external predictors , see below).

This is unlike a (mathematically identical) portfolio optimization problem for an individual trader, that would require trader’s proprietary execution data for model estimation. If, however, such trader’s proprietary data are available, our framework can be used in the same way to construct a probabilistic model of the trader. This could be used, in particular, by regulators for monitoring activities of exchanges or individual traders.

Note that in a single-period setting, our problem formulation brings us back to the BL model, where instead of multi-period trading strategies, we have just single-period optimal portfolio allocations.

On the other hand, in a multi-period formulation, it extends the setting of the BL model in multiple ways, including a probabilistic model of observed actions, that takes into account effects that are absent in a single-period settings, such as dynamic market impacts and dynamically changing predictors. As our model is probabilistic, i.e. generative, it can be used for forward simulation of dynamics.

Also note that in a multi-period setting, it is a combination of non-linearity induced by market impacts and dynamical exogenous predictors that may produce potentially very rich dynamics that would be driven by a combinators of external signals , non-linear system feedback via agent’s trades, and incontrollable noise. As we will show below, our model is tractable in a quasi-equilibrium setting using conventional tools of constrained convex optimization, due to its simple structure with a quadratic non-linearity of dynamics.

On the other hand, external signals have their own dynamics, and might operate at different frequencies from typical times of market responses to news, events, and other changes in predictors . Therefore, due to its non-linearity, and depending on a relation between characteristic times of market responses and signal changes, the model can describe both an equilibrium and non-equilibrium settings with such non-linear dynamics. A combination of non-linearity of dynamics with particular patterns of external signals whose changes provide new information to the agent, can lead to potentially very rich dynamics.

We will leave an exploration into generative properties of our model for a future research. The focus of the present paper is rather of a batch-mode (off-line) learning from past data. Such learning can be done using model-free or model-based Reinforcement Learning (see e.g. [49]) when rewards are observable, or Inverse Reinforcement Learning (IRL) when they are not. As in our case rewards (either of a single investor, or ’market-implied’ rewards) are not observable, we rely on an IRL-based approach for learning in such setting.

Our model, where rewards are not observed but rather inferred from data, belongs in a class of model-based IRL approaches with a parametrized reward function and dynamics. The objective of modeling in this approach is to infer the reward function and action policy from data by tuning model parameters. As in our case we solve the dynamic inverse portfolio optimization problem for a market-optimal portfolio, our IRL approach infers market-implied reward function and optimal action policy.

Note that in typical applications of RL for financial decision making, an agent is typically a (representative or particular) trader or a financial institution who is external to the market. In contrast, in our approach, an agent is the bounded-rational component of the market itself, as it is now inseparable from the market, so long as it maintains the market-optimal portfolio.

Therefore, our model is a dynamic model of the market itself, rather than a model of an external representative investor in such market. Our model is inspired by IRL, Information Theory, statistical physics and neuroscience, yet it is based on a simple parametric specification of a one-step reward, and a simple specification of dynamics.

The model is tractable as a non-linearity of dynamics is ’only’ quadratic. Furthermore, because we use a simple low-dimensional parametric specification of the ’actual’ reward of the agent, the data requirements for the model are modest. The model does not need tens, hundreds, or thousands years of training data, even though both the state and action spaces in our problem are very high-dimensional.

Computationally, the model amounts to a simple and transparent scheme, rather than being a black-box model in the spirit of Deep Reinforcement Learning. This is because a simple parametric specification of the model enables proceeding without sophisticated function approximations that are typically implemented in Deep Reinforcement Learning by deep neural networks. The main computational tool employed by the model is (an iterative version of) the conventional Maximum Likelihood estimation method available via standard off-the-shelf numerical optimization software. This can be conveniently done with TensorFlow using its automatic differentiation functionality.

The paper is organized as follows. In Sect. 2, we review related work, and simultaneously provide further high-level details of our framework. In Sect. 3 we introduce our notation and describe an investment portfolio of stocks. In Sect. 4, we present a RL formulation of the model. Sect. 5 re-formulates the model in an IRL setting, and presents our solution to the problem of finding an optimal policy and reward function for a single investor case. The IRL problem for the market as a whole is addressed in Sect. 6. The same section introduces an effective market dynamics model that is obtained as a by-product of our IRL solution. Experiments are presented in Sect. 7. Sect. 8 discusses our results and outlines future directions. A brief summary is given in Sect. 9.

## 2 Related work

Our model builds on several threads developed separately by the Quantitative Finance, Reinforcement Learning, Information Theory, Physics and Neuroscience communities. Here we provide a brief overview of related work in these different fields that have a close overlap with the model developed here, as well as explain their relation with our approach.

### 2.1 What kind of equilibrium holds for markets?

Quoting Duffie, ”while there are important alternatives, a current basic paradigm for valuation, in both academia and in practice, is that of competitive market equilibrium” [14]. While this was said in 1997, this assessment remains true to this day.

Of course, deficiencies of standard financial models based on competitive market equilibrium and/or no-arbitrage paradigm were not left unnoticed both within the financial community, and among researchers in other disciplines, most notably physics and computer science. The latter disciplines contributed a number of interesting and fresh ideas to financial modeling [9], [46]. In particular, agent-based models may provide interesting insight into how financial markets can operate when viewed as evolving complex systems, see e.g. [1].

The main challenge with agent-based models is that while they are capable of explaining some stylized facts of the market, they can hardly be turned, at least at the current stage, into practically useful tools - in part, due to their high computational complexity. While models such as CAPM or the Black-Scholes model may miss some important features of real markets, they also work reasonably well under certain market and trade conditions, and they are fast.

Yet, to better model effects such as market liquidity, Amihud et. al. suggested that, instead of assuming competitive market equilibrium, researchers should assume an ”equilibrium level of disequilibrium” [3]. In physics, this is normally referred to as a non-equilibrium steady state.

Viewing markets as an ’equilibrium disequilibrium’ is beneficial if we are willing to consider them as evolving and self-organizing systems that may bear some similarities to living organisms. Boltzmann and Schödinger have emphasizes that activities of living organisms are impossible in thermal equilibrium, and necessarily depend on harnessing a pre-existing disequilibrium. In other words, as a consequence of the Second Law of thermodynamics, living organisms can only exist as processes on the way to a state of maximum entropy describing a thermal equilibrium, but not in this state itself [51], [34].

A demand-based option pricing model that does not rely on no-arbitrage assumptions was proposed by Garleanu et. al. [20].
A Reinforcement Learning based option pricing model that similarly does not rely on no-arbitrage but uses instead a model-free and data-driven Q-learning approach was proposed by one of the authors in [26]^{7}^{7}7If so desired, the latter model can also be constructed as an arbitrage-free model, by using a suitable utility function, instead of a quadratic utility [26].. Residual inefficiencies of markets resulting from multi-step strategies and market impact were studied by Esipov [16].

### 2.2 Optimal portfolio execution

A close analog of a setting of our model is a problem of optimal execution in stock trading, one of the classical problems of Quantitative Finance. The problem amounts to designing an optimal strategy (policy) for partitioning a large trade order to buy or sell a large block of a stock of some company into smaller chunks, and buy these chunks sequentially so that a potential market impact would be minimized, and respectively the total cost of implementing the trade will be minimized as a result. This is a problem solved many thousands times a day by brokers, as well as those asset managers and hedge funds that execute such trades themselves instead of passing trade orders to brokers.

The classical way to address such (forward) optimization problem is to start with building and calibrating models for stock dynamics and price impact.
Provided this is done, the next step is to define a cost function that specifies loss that will be observed upon taking certain actions in certain states.
If we focus for now on execution strategies that involve only market orders but not limit orders, then these market orders will be our actions ^{8}^{8}8This is sufficient if we look at aggregate actions of all traders, i.e. the market itself, which is the main setting of our model in this paper.
If the model is applied to an individual investor, restricting a model to modeling only market orders may be a reasonable approximation for
liquid stocks, while for stocks with limited liquidity optimal strategies may involve combinations of market and limit orders. Extensions of our framework to such setting of mixed market and limit orders for individual investors will be provided elsewhere..

Assume that the trade order is to sell shares of a given stock within time . Optimal actions are obtained from (forward) optimization of total cumulative costs of execution, as determined by a policy . Here is current time and is a statevector of the system that includes the current mid-price of the stock , the number of stocks currently held, and values of external signals that may include, in particular, predictors derived from properties of the limit order book (LOB).

If is a (deterministic) optimal policy, then the optimal action is simply the value . The classical multi-period optimal execution problem was formulated in the dynamic programming (DP) setting by Bertsimas and Lo [6] for a risk-neutral investor, and then extended by Almrgen and Chriss [2] to a risk-averse investor.

### 2.3 Inverse portfolio optimization

In this paper, we consider three (related) modifications to the direct optimization problem described above. First, we take the view of dynamic inverse optimization, in the spirit of the Black-Litterman model [8] and its reformulation in [7], and assume that such optimization problem was already solved by the market itself. Respectively, we look for market-implied optimal trading policies/strategies rather than trading/execution strategies of an individual investor. However, our market-wise aggregate trader-agent does the same thing as nearly all traders in the market do, i.e. it dynamically optimizes its own investment portfolio.

### 2.4 Dynamic portfolio management with constrained convex optimization

In our specification of single-step rewards, or negative costs of trading, we follow a large literature on multi-period mean-variance optimization. An accessible review of a version of such mean-variance optimization is given by Boyd et. al. [10]. We largely adopt the notation and assumption of the portfolio model suggested by Boyd et. al., while in addition we explicitly introduce predictors and market impacts effects not considered in [10]. Quadratic objective functions for multi-period portfolio optimization discussed at length in [10] are formulated within the conventional DP approach that assumes a known model, including a known risk aversion parameter.

### 2.5 Stochastic policies

The second modification we make to the classical formulation of the optimal execution problem is that we consider stochastic (probabilistic), rather than deterministic policies . A stochastic policy

describes a probability distribution, so that action

becomes a sample from this distribution, , rather than a fixed number. Respectively, an optimal action would be a sample from an optimal policy, . Deterministic policies can now be viewed as a special case of stochastic policies, where the action distribution is a Dirac delta-function where is an optimal action for state , which corresponds to a deterministic policy setting of the classical DP approach.What would be the meaning of such probabilistic modeling of execution orders, given that at the end they amount to specific numbers, rather than probabilities? Such choice can be justified for both the direct and inverse problems of optimal execution.

Let’s start with an argument why stochastic policies can be useful for direct optimization. Given that parameters defining optimal strategies are estimated from data, the resulting policy is always stochastic de-facto, even though this is not explicitly recognized in deterministic policy execution models such Bertsimas and Lo [6] and Almgren and Chriss [2] models.

Adapting stochastic policies as a principal modeling tool allows one to explicitly control uncertainty around an optimal action in each state of the world. The latter can be identified with a mode of the policy distribution, while uncertainty around this value would be specified by properties of this distribution, and measured, in a simplest case, by the variance of a predicted optimal action. This argument is rather similar to an argument for stochastic, rather than deterministic, portfolio allocations for a one-period Markowitz-like portfolio optimization problem, which was put forward by Marschinski et. al. [33].

In the setting of inverse portfolio optimization adopted in this paper, the usefulness of stochastic policies becomes even more evident. In this case, stochastic policies are needed in order to account for a possible sub-optimality of a policy used in generated data. Such events would be incompatible with an assumption of a strict optimality of each action in the data, leading to vanishing probabilities of observed execution paths. Reliance on stochastic rather than deterministic policies allows one to cope with possible sub-optimality of historical data.

### 2.6 Reinforcement Learning

Deterministic policy optimization problem in a dynamic mean-variance optimization setting similar to Boyd et. al. [10] was reformulated in a data-driven Reinforcement Learning (RL) way by Ritter [25]. Ritter considers the classical on-line Q-learning for the problem of multi-period portfolio optimization from data, using a quadratic risk-adjusted cost function. This translates the problem into a data-driven forward optimization that can can be solved, given enough training data, by the famous Q-Learning of Watkins and Dayan [56].

The difference between our approach and Ritter’s is that we consider an off-line (batch-mode) learning, and we do not observe one step costs (or, equivalently, negative rewards). Therefore, our setting is that of IRL, while Ritter [25] considers an on-line RL formulation. Also, unlike [25] that uses a discretized state space, we use a continuous-state formulation. Furthermore, Ritter considers a general optimal portfolio investment problem for a given (representative?) investor, while here we focus on modeling an agent that represents a bounded-rational component of the market as a whole. This transforms our approach into a market model, unlike the case considered by Ritter, which is a trader model.

Quadratic risk-adjusted objective functions were considered in an apparently different problem of optimal option pricing and hedging using a model-free, data-driven approach in the work by one of the authors [26, 27]. The approach used in this work assumes off-line, batch-mode learning, that enables using data-efficient batch RL methods such as Fitted Q Iteration [15, 37].

We use entropy-regularized Reinforcement Learning in the form suggested by Tishby and co-workers under the name of G-learning [18], as a way to do Reinforcement Learning in a noisy environment. While [18]

assumed a tabulated discrete-state/discrete-action setting, in our case both the state and action spaces are high-dimensional continuous spaces. For a tutorial-style introduction to Information-constrained Markov Decision Processes, see Larsson

el. al. [30].### 2.7 Inverse Reinforcement Learning

A third modification we introduce to the classical portfolio optimization scheme is that we assume that some critical model parameters are unknown. Note that forward optimization using Dynamic Programming always assumes that dynamics and model parameters are known, or estimated using independent models. In particular, market impact parameters or risk aversion parameters are not easy to mark down without using additional models to estimate them before using them in a direct execution optimization method. Moreover, traders do not necessarily even think in terms of any utility function, and respectively may not even know their own risk-aversion parameter .

Unlike such DP approach, in our model we treat these parameters as unknown, and estimate them simultaneously with estimating optimal policy from historical trading data. What we obtain with such procedure can be interpreted as implied market impact and risk aversion parameters, similar to how implied volatilities are used to price and hedge options in option markets. In particular, even if traders may not think in terms of a quadratic utility function with some pre-determined value of , their observed behavior might be consistent with such simple utility function, with some data-implied risk aversion rate .

Note that when risk aversion and parameters determining market impact are unknown, it also means that one-step costs (see below) are unknown as well. Our data therefore consist of sequences of states and actions, but it does not reveal costs incurred by following these actions. Such problems of estimating costs (or rewards) from an observed behavior are solved using methods of Inverse Optimal Control (IOC) when dynamics are known, or using Inverse Reinforcement Learning (IRL) when dynamics are unknown.

In this paper, we address this problem using model-based IRL. Our framework relies on a model for specification of one-step costs, market impact, and risk metrics (we will use quadratic risk measures going forward). On the side of IRL literature, our approach is based on Maximum Entropy IRL developed in [58], and extended to continuous-space formulation in [36]. A closely related method is Iterative Quadratic-Gaussian Regulator (IQGR) of Todorov and Li [53].

### 2.8 Neuroscience and biology

Our approach is similar to a Free-Energy Principle (FEP) approach to living systems and the brain function developed by Friston and collaborators in [19, 39]. Under this formalism, ”for an organism to resist dissipation and persist as an adaptive system that is a part of, coupled with, and yet statistically independent from, the larger system in which it is embedded, it must embody a probabilistic model of the statistical interdependencies and regularity of its environment” [39].

Our model applies similar approach, based on ideas from statistical thermodynamics, to the market as a dynamic persistent and adaptive system that embodies a bounded-rational RL agent that imitates a ’mind’ of the market as a goal-directed ’living organism’ in an adversarial environment. We implement the above requirement that the agent should embody a probabilistic model of its environment by formulating this problem as Inverse Reinforcement Learning. The free energy arises in this approach either as a way to regularize (Inverse) Reinforcement Learning in a noisy environment by entropy, as in G-learning [18], or as a way to model bounded-rational decision-making of the agent by imposing constraints on information processing costs [40, 52, 42], or equivalently as a way to account for an adversarial character of the environment, see the next two sections.

### 2.9 Thermodynamics, Bounded Rationality and Information Theory

Another, and mathematically equivalent way to introduce entropy and free energy into the problem of sequential decision-making, was formulated within an information-theoretic and physics-inspired approach in [40, 52, 42]. In particular, Ortega et. al. [40, 42] emphasize that a regularization ’inverse temperature’ parameter that corresponds to a cost of information processing in a system, can also be interpreted as a degree of rationality of an agent that dynamically maximizes its free energy (i.e. an entropy-regularized value function).

This interpretation is provided by noting that parameter determines complexity of a search for a better policy starting from a given prior policy [40, 42]. Agents with large can afford a highly complex (costly) search for a better policy, and therefore are more rational than agents that live in a world with a small value . In this regime, an agent cannot afford to change from a prior policy, and therefore behaves as an irrational (entropy-dominated) agent. The information-theoretic approach thus provides a quantitative and tractable framework for a bounded-rational agent of Simon [45].

### 2.10 Self-play, adversarial learning, and the free energy optimization

An adversarial interpretation of Information Theoretic Bounded Rationality was suggested in [41] where it was shown that a single-agent free energy optimization is equivalent to a fictitious game between an agent and an imaginary adversary. In our model, we have a similar setting, where an agent representing a bounded-rational component of the market optimizes its free energy. The optimization amounts to a dynamical optimization of agent’s portfolio in a stochastic market environment with information processing costs. The latter are expressed as an entropy regularization of a value function, see below. As will be shown in Sect. 4.6, using the method of [41], such self-play can be equivalently viewed as adversarial learning in a fictitious two-party game with an adversarial opponent.

### 2.11 Bounded Rational Information Theoretic IRL (BRIT-IRL)

Our approach integrates ideas of Maximum Entropy IRL with a Bounded Rational Information-Theoretic interpretation of the process of learning, and applies them to make inferences of an ’Invisible Hand’, in the spirit of the Black-Litterman model. In splitting the market into its bounded-rational self and the rest, the model also has strong similarities with the free-energy approach to the brain and biological systems [19, 39]. In our approach, such view is applied to a financial market as a dynamic self-organizing system, with a focus on inverse rather than direct learning.

As our setting is of inverse learning, instead of assuming some value of degree of rationality , we infer such parameter implied by the market data within our model. This produces a dynamic ”market-implied” index of rationality that can be used as a simple monitoring statistic, or possibly as a predictor of future events in the market. If the model is applied to an individual investor, provided corresponding proprietary trading data are available, it can produce an implied ’amount of rationality’ of that particular trader.

## 3 Investment portfolio

We adopt the notation and assumption of the portfolio model suggested by Boyd et. al. [10]. In this model, dollar values of positions in assets are denoted as a vector with components for a dollar value of asset at the beginning of period . In addition to assets , an investment portfolio includes a risk-free bank cash account with a risk-free interest rate . A short position in any asset then corresponds to a negative value . The vector of mean of bid and ask prices of assets at the beginning of period is denoted as , with being the price of asset . Trades are made at the beginning of interval , so that asset values immediately after trades are deterministic:

(1) |

The total portfolio value is

(2) |

where is a vector of ones. The post-trade portfolio is therefore

(3) |

We assume that all re-balancing of stock positions are financed from the bank cash account (additional cash cost related to the trade will be introduced below). This imposes the following ’self-financing’ constraint:

(4) |

which simply means that the portfolio value remains unchanged upon an instantaneous re-shuffle of the wealth between the stock and cash:

(5) |

The post-trade portfolio and cash are invested at the beginning of period until the beginning of the next period. The return of asset over period is defined as

(6) |

Asset positions at the next time period are then given by

(7) |

where stands for an element-wise (Hadamard) product, and is the vector of asset returns from period to period . The next-period portfolio value is then obtained as follows:

(8) |

Given a vector of returns in period , the change of the portfolio value in excess of a risk-free growth is

(9) | |||||

where in the second equation we used Eq.(4).

### 3.1 Terminal condition

A terminal condition for the market portfolio is obtained from the requirement that at a planning horizon , all stock positions should be equal to the actual observed weights of stocks in the market index. This implies that where are market cap weights in the S&P 500 index at time . By Eq.(1), this fixes the action at the last time step:

(10) |

Therefore, action at the last step is deterministic and is not subject to optimization that should be applied to remaining actions .

If the model is applied to an individual investor, the planning horizon is an investment horizon for that investor, while the terminal condition (10) can be replaced by a similar terminal condition for the investor portfolio.

### 3.2 Asset returns model

We assume the following linear specification of one-period excess asset returns:

(11) |

where is a vector of predictors with factor loading matrix , is a matrix of permanent market impacts with a linear impact specification, and is a vector of residuals with

(12) |

Equation (11) specifies stochastic returns , or equivalently the next-step stock prices, as driven by external signals , control (action) variables , and uncontrollable noise .

Though they enter ’symmetrically’ in Eq.(11), two drivers of returns and play entirely different roles. While
signals are completely external for the agent, actions are controlleddegrees of freedom.
In our approach, we will be looking for optimal controls for the market-wise portfolio. When we
set up a proper optimization problem, we solve for an optimal
action . As will be shown in this paper, this optimal control turns out to be a linear function of , plus noise.
Substituting it back into
Eq.(11), this produces effective dynamically generated dynamics that involve only stock prices, see Eq.(109) below in Sect. 6.1^{9}^{9}9The reader interested only in the final asset return model resulting from our framework but not in its derivation can jump directly to Eq.(109)..

### 3.3 Signal dynamics and state space

For dynamics of signals , similar to [21], we will assume a simple multi-variate mean-reverting Ornstein-Uhlenbeck (OU) process for a -component vector :

(13) |

where is the noise term, and is a diagonal matrix of mean reversion rates.

It is convenient to form an extended state vector of size by concatenating vectors and :

(14) |

The extended vector describes a full state of the system for the agent that has some control of its -component, but no control of its -component.

### 3.4 One-period rewards

We first consider an idealized case when there are no costs of taking action at time step . An instantaneous random reward received upon taking such action is obtained by substituting Eq.(11) in Eq.(9):

(15) |

In addition to this reward that would be obtained in an ideal friction-free world, we have to add (negative) rewards received due to instantaneous market impact and transaction fees^{10}^{10}10We assume no short sale positions in our setting, and therefore do not include borrowing costs.. Furthermore,
we have to include a negative reward due to risk in a newly created portfolio position at time . Similar to [10], we choose a simple quadratic measure of such risk penalty, as the variance of the instantaneous reward (15) conditional on the new state ,
multiplied by the risk aversion parameter :

(16) |

To specify negative rewards (costs) of an instantaneous market impact and transaction costs, it is convenient to represent each action as a difference of two non-negative action variables :

(17) |

so that if and if . The instantaneous market impact and transaction costs are then given by the following expressions:

(18) |

Here and are, respectively, matrices-valued and vector-valued parameters that in a simplest case can be parametrized in terms of single scalars multiplied by unit vectors or matrices.

Combining Eqs.(15, (16), (3.4), we obtain our final specification of a risk- and cost-adjusted instantaneous reward function for the problem of optimal portfolio liquidation:

(19) |

The expected one-step reward given action is given by

(20) |

where

(21) |

where stands for averaging over next-periods realizations of market returns.

Note that the one-step expected reward (20) is a quadratic form of its inputs. We can write it more explicitly using vector notation:

(22) |

where

(23) |

### 3.5 Multi-period portfolio optimization

Multi-period portfolio optimization is equivalently formulated either as maximization of risk- and cost-adjusted returns, as in the Markowitz portfolio model, or as minimization of risk- and cost-adjusted trading costs. The latter specification is usually used in problems of optimal portfolio liquidation.

A multi-period risk- and cost-adjusted reward maximization problem reads

maximize | (24) | ||

where | |||

subject to |

Here is a discount factor. Note that the sum over future periods does not include the last period , because the last action is fixed by Eq.(10).

An equivalent cost-focused formulation is obtained by flipping the sign of the above problem, and re-phrasing it as minimization of trading costs :

minimize | (25) | ||

where | (26) |

subject to the same constraints as in (24).

### 3.6 Dynamic Inverse Portfolio Optimization

When the model dynamics are known (or independently estimated from data), the dynamic portfolio optimization problem of Eq.(24) can be formulated as a problem of Stochastic Optimal Control (SOC), also known as a Dynamic Programming approach. This approach was pursued in Ref. [10] in a general setting of convex portfolio optimization, see also references there on previous work on this topic. In particular, one well-known example is a dynamic mean-variance model of Garleanu and Pedersen [21] with quadratic transaction costs.

We keep a convex multi-period portfolio formulation while adding to it modeling of market impact and external signals, and focusing on a inverse optimization problem, rather than a forward optimization problem as in [10]. We can refer to this problem as a Dynamic Inverse Portfolio Optimization (DIPO) problem. The word ’dynamic’ here means that a learned optimal policy should be adaptive to predictors .

In DIPO learning, we assume that an optimal portfolio strategy has been already found, perhaps not quite optimally, in the past by an expert trader. We assume that we have a record of different runs of such nearly optimal strategy, each of length , performed by this expert trader. Following the common conventions of the RL/IRL literature, we can call this data samples expert demonstrations, or expert trajectories. The problem is then to find the optimal execution policy from these data.

We may differentiate between two possible settings for such data-driven DIPO learning that can be encountered in practice. First, in a setting of Reinforcement Learning we have access to historical data consisting of stock market prices, actions taken (i.e. portfolio trades), and risk-adjusted rewards received upon taking these actions (see below for details). In addition, the data consists of all predictive factors (”alpha-factors”) that might be predictive of rewards. The objective is to learn and improve a policy that was used in the data, so that the new improved policy can be used to generate higher rewards in the future.

The other setting is of Inverse Reinforcement Learning (IRL), where everything is the same as above, except we do not observe rewards anymore. The objective is to learn the reward function that leads to the observed behavior, and learn the policy as well.

This is the setting of this paper, where we use an IRL framework to represent all traders in the market as one market-wise ’expert trader’ who is mathematically modeled as a bounded-rational RL agent. A reward function of this agent is learned from market data, plus whatever signals that are used by the model. The learned parameters include market-implied risk aversion , market impact parameters , weights of predictors , and market-implied ’rationality index’ .

Note that if proprietary trading data from a particular trader or broker are available, the same framework can be applied to learn a reward function of that particular trader. Such setting might be interesting given that the value of a ’true’ risk aversion parameter is often unknown to investors themselves, as they may not base their decisions on a quadratic utility model. When applied to an individual investor, the model developed here may offer a probabilistic model of that particular investor, with parameters estimated on trading data of this investor, combined with the market data.

Regarding the policy optimization problem, as rewards are not observed in the IRL setting, this problem is in general both harder and less well-posed in comparison to the RL setting. In particular, unlike RL off-policy methods such as Q-learning that can learn, given enough data, even from data with purely random actions, IRL methods cannot proceed with data with entirely random actions. For IRL to work, data collected should correspond to some good, though not necessarily optimal policy. Probabilistic IRL methods are capable of learning when demonstrated data does not always correspond to optimal actions.

While our main focus in this paper is on the IRL setting, we will start below with RL approaches to the problem.

## 4 Reinforcement Learning of optimal trading

In this section, we will discuss a data-driven Reinforcement Learning approach to multi-period portfolio optimization of Eq.(24). We first introduce stochastic policies and a Bellman equation with stochastic policies, and then consider an entropy-regularized methods for MDP corresponding to Eq.(24).

### 4.1 Stochastic policy

Note that the multi-period portfolio optimization problem (24) assumes that an optimal policy that determines actions is a deterministic policy that can also be described as a delta-like probability distribution

(27) |

where the optimal deterministic action is obtained by maximization of the objective (24) with respect to controls .

But the actual trading data may be sub-optimal, or noisy at times, because of model mis-specifications, market timing lags, human errors etc. Potential presence of such sub-optimal actions in data poses serious challenges, if we try to assume deterministic policy (27) that assumes the the action chosen is always an optimal action. This is because such events should have zero probability under these model assumptions, and thus would produced vanishing path probabilities if observed in data.

Instead of assuming a deterministic policy (27), stochastic policies described by smoothed distributions , are more useful for inverse problems such as the problem of inverse portfolio optimization. In this approach, instead of maximization with respect to deterministic policy/action , we re-formulate the problem as maximization over probability distributions :

maximize | (28) | ||

where | |||

w.r.t. | |||

subject to |

Here stands for expectations with respect to path probabilities defined according to the third line in Eqs.(28).

Note that due to inclusion of a quadratic risk penalty in the risk-adjusted return the original problem of risk-adjusted return optimization is re-stated in Eq.(28) as maximizing the expected cumulative reward in the standard MDP setting, thus making the problem amenable to a standard risk-neutral approach of MDP models. Such simple risk adjustment based on one-step variance penalties was suggested in a non-financial context by Gosavi [23], and used in a Reinforcement Learning based approach to option pricing in [26, 27].

Another comment that is due here is that a probabilistic approach to actions in portfolio trading appears, on many counts, a more natural way than a formalism based on deterministic policies. Indeed, even in a simplest one-period setting, because the Markowitz-optimal solution for portfolio weights is a function of estimated stock means and covariances, they are in fact random variables. Yet the probabilistic nature of portfolio optimization is not recognized as such in Markowitz-type single-period or multi-period optimization settings such as (24). A probabilistic portfolio optimization formulation was suggested in a one-period setting by Marshinski et. al. [33].

### 4.2 Reference policy

We assume that we are given a probabilistic reference (or prior) policy which should be decided upon prior to attempting the portfolio optimization (28

). Such policy can be chosen based on a parametric model, past historic data, etc. We will use a simple Gaussian reference policy

(29) |

where can be a deterministic policy chosen to be a linear function of a state vector :

(30) |

A simple choice of parameters in (29) could be to specify them in terms of only two scalars as follows: and where is the the size of vector , and are, respectively, a vector and matrix made of ones. The scalars and would then serve as hyper-parameters in our setting. Similarly, covariance matrix for the prior policy can be taken to be a simple matrix with constant correlations and constant variances .

As will be shown below, an optimal policy has the same Gaussian form as the prior policy (29), with updated parameters , and . These updates will be computed iteratively starting with their initial values defining the prior (29). Respectively, updates at iteration will be denoted by upper subscripts, e.g. , .

Furthermore, it turns out that a linear dependence on at iteration , driven by the value of arises even if we set in the prior (29). Such choice of a state-independent prior , although not very critical, reduces the number of free parameters in the model by two, as well as simplifies some of the analyses below, and hence will be assumed going forward. It also makes it unnecessary to specify the value of in the prior (29) (equivalently, we can initialize it at zero). The final set of hyper-parameters defining the prior (29) therefore includes only three values of .

### 4.3 Bellman Optimality Equation

Let

(31) |

The optimal state value function satisfies the Bellman optimality equation (see e.g. [49])

(32) |

The optimal policy can be obtained from as follows:

(33) |

The goal of Reinforcement Learning (RL) is to solve the Bellman optimality equation based on samples of data. Assuming that an optimal value function is found by means of RL, solving for the optimal policy takes another optimization problem as formulated in Eq.(33).

### 4.4 Entropy-regularized Bellman optimality equation

Following [11], we start with reformulating the Bellman optimality equation using a Fenchel-type representation:

(34) |

Here stands for a set of all valid distributions. Eq.(34) is equivalent to the original Bellman optimality equation (31), because for any , we have . Note that while we use discrete notations for simplicity of presentation, all formulas below can be equivalently expressed in continuous notations by replacing sums by integrals. For brevity, we will denote the expectation as in what follows.

The one-step information cost of a learned policy relative to a reference policy is defined as follows [18]:

(35) |

Its expectation with respect to policy is the Kullback-Leibler (KL) divergence of and :

(36) |

The total discounted information cost for a trajectory is defined as follows:

(37) |

The free energy function is defined as the value function (34) augmented by the information cost penalty (37):

(38) | |||||

Note that in Eq.(38) serves as the ”inverse temperature” parameter
that controls a trade-off between reward optimization and proximity to the reference policy, see below. The free energy
is the entropy-regularized value function, where the amount of regularization can be tuned to better cope with noise in
data^{11}^{11}11Note that in physics, as well as in the free-energy principle literature [19, 39], free energy is defined with a negative sign relative to Eq.(38). This difference is purely a matter of a sign convention, as maximization of Eq.(38) can be re-stated as minimization of its negative. With our sign convention for the free energy function, we follow Reinforcement Learning and Information Theory literature [40, 52, 42, 30]..
The reference
policy provides a ”guiding hand” in the stochastic policy optimization process that we describe next.

A Bellman equation for the free energy function is obtained from (38):

(39) |

For a finite-horizon setting, Eq.(39) should be supplemented by a terminal condition

(40) |

(see Eq.(10)). Eq.(39) can be viewed as a soft probabilistic relaxation of the Bellman optimality equation for the value function, with the KL information cost penalty (36) as a regularization controlled by the inverse temperature . In addition to such regularized value function (free energy), we will next introduce an entropy regularized Q-function.

### 4.5 G-function: an entropy-regularized Q-function

Similarly to the action-value function, we define the state-action free energy function as [18]

where in the last equation we used the fact that the first action in the G-function is fixed, and hence when we condition on .

If we now compare this expression with Eq.(38), we obtain the relation between the G-function and the free energy :

(42) |

This functional is maximized by the following distribution :

(43) | |||

The free energy (42) evaluated at the optimal solution (43) becomes

(44) |

Using Eq.(44), the optimal action policy (43) can be written as follows :

(45) |

Eqs.(44), (45), along with the first form of Eq.(4.5) repeated here for convenience:

(46) |

constitute a system of equations that should be solved self-consistently by backward recursion for , with terminal conditions

(47) | |||

The self-consistent scheme of Eqs.(44, 45, 46) [18] can be used in both the RL setting, when rewards are observed, and in the IRL setting when they are not. Before proceeding with these methods, we want to digress on an alternative interpretation of entropy regularization in Eq.(38), that can be useful for clarifying the approach of this paper.

### 4.6 Adversarial interpretation of entropy regularization

A useful alternative interpretation of the entropy regularization term in Eq.(38) can be suggested using its representation as a Legendre-Fenchel transform of another function [41]:

(48) |

where is an arbitrary function. Eq.(48) can be verified by direct minimization of the right-hand side with respect to .

Using this representation of the KL term, the free energy maximization problem (42) can be re-stated as a max-min problem

(49) |

The imaginary adversary’s optimal cost obtained from (49) is

(50) |

Similarly to [41], one can check that this produces an indifference solution for the imaginary game between the agent and its adversarial environment where the total sum of the optimal G-function and the optimal adversarial cost (50) is constant: , which means that the game of the original agent and its adversary is in a Nash equilibrium.

Therefore, portfolio optimization in a stochastic environment by a single agent that represents a bounded-rational component of the market as a whole, as is done in our approach using the entropy-regularized free energy, is mathematically equivalent to studying a Nash equilibrium in a two-party game of our our agent with an adversarial counter-party with an exponential budget given by the last term in Eq.(49).

### 4.7 G-learning and F-learning

In the RL setting when rewards are observed, the system Eqs.(44, 45, 46) can be reduced to one non-linear equation. Substituting the augmented free energy (44) into Eq.(4.5), we obtain

(51) |

This equation provides a soft relaxation of the Bellman optimality equation for the action-value Q-function, with the G-function defined in Eq.(4.5) being an entropy-regularized Q-function [18]. The ”inverse-temperature” parameter in Eq.(51) determines the strength of entropy regularization. In particular, if we take , we recover the original Bellman optimality equation for the Q-funciton. Because the last term in (51) approximates the function when is large but finite, Eq.(51) is known in the literature as soft Q-learning.

For finite values , in a setting of Reinforcement Learning with observed rewards, Eq.(51) can be used to specify G-learning [18]: an off-policy time-difference (TD) algorithm that generalizes Q-learning to noisy environments where an entropy-based regularization can be needed. The G-learning algorithm of Ref. [18] was specified in a tabulated setting where both the state and action space are finite. In our case, we deal with high-dimensional state and action spaces, and in addition, we do not observe rewards, therefore we are in a setting of Inverse Reinforcement Learning.

Another possible approach is to bypass the -function (i.e. the entropy-regulated Q-function) altogether, and proceed with the Bellman optimality equation for the free energy F-function (38). In this case, we have a pair of equations for