Mean-variance Markowitz optimization (MVO)(Markowitz) remains one of the most commonly used tools in wealth management. Portfolio objectives in this approach are defined in terms of expected returns and covariances of assets in the portfolio, which may not be the most natural formulation for retail investors. Indeed, the latter typically seek specific financial goals for their portfolios. For example, a contributor to a retirement plan may demand that the value of their portfolio at the age of his or her retirement be at least equal to, or preferably larger than, some target value .
Goal-based wealth management offers some valuable perspectives into optimal structuring of wealth management plans such as retirement plans or target date funds. The motivation for operating in terms of wealth goals can be more intuitive (while still tractable) than the classical formulation in terms of expected excess returns and variances. To see this, let be the final wealth in the portfolio, and be a certain target wealth level at the horizon . The goal-based wealth management approach of Browne_1996 and Das_2018
uses the probabilityof final wealth to be above the target level as an objective for maximization by an active portfolio management. This probability is the same as the price of a binary option on the terminal wealth with strike : . Instead of a utility of wealth such as e.g. a power or logarithmic utility, this approach uses the price of this binary option as the objective function. This idea can also be modified by using a call option-like expectation , instead of a binary option. Such an expectation quantifies how much the terminal wealth is expected to exceed the target, rather than simply providing the probability of such event111 The problem of optimal consumption with an investment portfolio is frequently referred to as the Merton consumption problem, after the celebrated work of Robert Merton who formulated this problem as a continuous-time optimal control problem with log-normal dynamics for asset prices (Merton_1971). As optimization in problems involving cash injections instead of cash withdrawals formally corresponds to a sign change of one-step consumption in the Merton formulation, we can collectively refer to all types of wealth management problems involving injections or withdrawals of funds at intermediate time steps as a generalized Merton consumption problem. .
This treatment of the goal-based utility function can be implemented in a reinforcement learning (RL) framework for discrete-time planning problems. In contrast to the Merton consumption approach, RL does not require specific functional forms of the utility nor does it require that the dynamics of the assets be treated as log-normal. Thus in theory, RL can be viewed as a data-driven extension of dynamic programming (SB)
. In practice, a substantial challenge with the RL framework is the curse of dimensionality — portfolio allocation as a continuous action space Markov Decision Process (MDP) requires techniques such as deep Q-learning or other function approximation methods combined e.g. with the Least Squares Policy Iteration (LSPI) method(LSPI)
. The latter has exponential complexity with increasing stocks in the portfolio, and the former is cumbersome, highly data intensive, and heavily relies on heuristics for operational efficiency. For more details, see e.g.(DHB).
In this paper, we present G-learning (G-Learning) — a probabilistic extension of Q-learning which scales to high dimensional portfolios while providing a flexible choice of utility functions. To demonstrate the utility of G-learning, we consider a general class of wealth management problems: optimization of a defined contribution retirement plan, where cash is injected (rather than withdrawn) at each time step. In contrast to methods based on a utility of consumption, we adopt a more “RL-native” approach by directly specifying one-step rewards. Such an approach is sufficiently general to capture other possible settings, such as e.g. a retirement plan in a decumulation (post-retirement) phase, or target based wealth management. Previously, G-learning was applied to dynamic portfolio optimization in (IHIF), while here we extend this approach to portfolio management involving cashflows at intermediate time steps.
A key step in our formulation is that we define actions as absolute (dollar-valued) changes of asset positions, instead of defining them in fractional terms, as in the Merton approach (Merton_1971). This enables a simple transformation of the optimization problem into an unconstrained optimization problem, and provides a semi-analytical solution for a particular choice of the reward function. As will be shown below, this approach offers a tractable setting for both the direct reinforcement learning problem of learning the optimal policy which maximizes the total reward, and its inverse problem where we observe actions of a financial agent but not the rewards received by the agent. Inference of the reward function from observations of states and actions of the agent is the objective of Inverse Reinforcement Learning (IRL). After we present G-Learner — a G-learning algorithm for the direct RL problem, we will introduce GIRL (G-learning IRL) — a framework for inference of rewards of financial agents that are “implied” by their observed behavior. The two practical algorithms, G-Learner and GIRL, can be used either separately or in a combination, and we will discuss their potential joint applications for wealth management and robo-advising.
The paper is organized as follows. In Section 2, we introduce G-learning and explain how it generalizes the more well known Q-learning method for reinforcement learning. Section 3 introduces the problem of portfolio optimization for a defined contribution retirement plan. Then in Section 4, we present the G-Learner: a G-learning algorithm for portfolio optimization with cash injection and consumption. The GIRL algorithm for performing IRL of financial agents is introduced in Section 5. Section 6 presents the results of our implementation and demonstrates the ability of G-learner to scale to high dimensional portfolio optimization problems, and the ability of GIRL to make inference of the reward function of a G-Learner agent. Section 7 concludes with ideas for future developments in G-learning for wealth management and robo-advising.
In this section, we provide a short but self-contained overview of G-learning as a probabilistic extension of the popular Q-learning method in reinforcement learning. We assume some familiarity with constructs in dynamic programming and reinforcement learning, see e.g. (SB), or (DHB) for a more finance-focused introduction. In particular, we assume that the reader is familiar with the notions of value function, action-value function, and the Bellman optimality equations. Familiarity with Q-learning is desirable but not critical for understanding this section, however for the benefit of the informed reader, a short informal summary of the differences is as follows:
Q-learning is an off-policy RL method with a deterministic policy.
G-Learning is an off-policy RL method with a stochastic policy. G-learning can be considered as an entropy-regularized Q-learning, which may be suitable when working with noisy data. Because G-learning operates with stochastic policies, it amounts to a generative RL model.
2.1 Bellman optimality equation
More formally, let
be a state vector for an agent that summarizes the knowledge of the environment that the agent needs in order to perform an actionat time step 222Here we assume a discrete-time setting where time is measured in terms of integer-valued number of elementary time steps .. Let be a random reward collected by the agent for taking action at time when the state of the environment is . Assume that all future actions for future time steps are determined according to a policy which specifies which action to take when the environment is in state . We note that policy can be deterministic as in Q-learning, or stochastic as in G-learning, as we will discuss below.
For a given policy , the expected value of cumulative reward with a discount factor , conditioned on the current state , defines the value function
Here stands for the expectation of future states and actions, conditioned on the current state and policy .
Let be the optimal policy, i.e. the policy that maximizes the total reward. This policy corresponds to the optimal value function, denoted . The latter satisfies the Bellman optimality equation (see e.g. (SB))
Here stands for an expectation conditional on the current state and action . The optimal policy can be obtained from as follows:
The goal of Reinforcement Learning (RL) is to solve the Bellman optimality equation based on samples of data. Assuming that an optimal value function is found by means of RL, solving for the optimal policy takes another optimization problem as formulated in Eq.(3).
2.2 Entropy-regularized Bellman optimality equation
Let us begin by reformulating the Bellman optimality equation using a Fenchel-type representation:
Here denotes a set of all valid distributions. Eq.(4) is equivalent to the original Bellman optimality equation (2), because for any , we have . Note that while we use discrete notations for simplicity of presentation, all formulae below can be equivalently expressed in continuous notations by replacing sums by integrals. For brevity, we will denote the expectation as in what follows.
The one-step information cost of a learned policy relative to a reference policy is defined as follows (G-Learning):
Its expectation with respect to the policy is the Kullback-Leibler (KL) divergence of and :
The total discounted information cost for a trajectory is defined as follows:
The free energy, , is the entropy-regularized value function, where the amount of regularization can be tuned to the level of noise in the data. The regularization parameter in Eq.(8) controls a trade-off between reward optimization and proximity of the optimal policy to the reference policy, and is often referred to as the “inverse temperature” parameter, using the analogy between Eq.(8) and free energy in physics, see e.g. (DHB). The reference policy, , provides a “guiding hand” in the stochastic policy optimization process that we now describe.
A Bellman equation for the free energy function is obtained from Eq.(8):
For a finite-horizon setting with a terminal reward , Eq.(9) should be supplemented by a terminal condition
where the final action maximizes the terminal reward for the given terminal state . Eq.(9) can be viewed as a soft probabilistic relaxation of the Bellman equation for the value function, with the KL information cost penalty (5) as a regularization controlled by the inverse temperature . In addition to such a regularized value function (free energy), we will next introduce an entropy regularized Q-function.
2.3 G-function: an entropy-regularized Q-function
Similar to the action-value function, we define the state-action free energy function as (G-Learning)
where in the last equation we used the fact that the first action in the G-function is fixed, and hence when we condition on .
If we now compare this expression with Eq.(8), we obtain the relation between the G-function and the free energy :
This functional is maximized by the following distribution :
Using Eq.(14), the optimal action policy can be written as follows :
constitute a system of equations for G-learning (G-Learning) that should be solved self-consistently for , and by backward recursion for , with terminal conditions
We will next show how G-learning can be implemented in the context of (direct) reinforcement learning.
This equation provides a soft relaxation of the Bellman optimality equation for the action-value Q-function, with the G-function defined in Eq.(2.3) being an entropy-regularized Q-function (G-Learning). The ”inverse-temperature” parameter in Eq.(18) determines the strength of entropy regularization. In particular, if we take a “zero-temperature” limit , we recover the original Bellman optimality equation for the Q-function. Because the last term in (18) approximates the function when is large but finite, for a particular choice of a uniform reference distribution , Eq.(18) is known in the literature as “soft Q-learning”.
For finite values , in a setting of Reinforcement Learning with observed rewards, Eq.(18) can be used to specify G-learning (G-Learning): an off-policy time-difference (TD) algorithm that generalizes Q-learning to noisy environments where an entropy-based regularization is appropriate.
The G-learning algorithm of G-Learning
was specified in a tabulated setting where both the state and action space are finite. In our case, we model MDPs in high-dimensional continuous state and action spaces. Respectively, we cannot rely on a tabulated G-learning, and need to specify a functional form of the action-value function, or use a non-parametric function approximation such as a neural network to represent its values. An additional challenge is to compute a multidimensional integral (or a sum) over all next-step actions in Eq.(18). Unless a tractable parameterization is used for and , repeated numerical integration of this integral can substantially slow down the learning.
To summarize, G-learning is an off-policy, generative reinforcement learning algorithm with a stochastic policy. In contrast to Q-learning, which produces deterministic policies, G-learning generally produces stochastic policies, while the deterministic Q-learning policies are recovered in a zero-temperature limit . In the next section, we will build an approach to goal-based wealth management based on G-learning. Later in this paper, we will also consider applications of G-learning for Inverse Reinforcement Learning (IRL).
3 Portfolio optimization for a defined contribution retirement plan
Let us begin by considering a simplified model for retirement planning. We assume a discrete-time process with steps, so that is the (integer-valued) time horizon. The investor/planner keeps the wealth in assets, with being the vector of dollar values of positions in different assets at time , and being the vector of changes in these positions. We assume that the first asset with is a risk-free bond, and other assets are risky, with uncertain returns whose expected values are . The covariance matrix of return is of size .
Optimization of a retirement plan involves optimization of both regular contributions to the plan and asset allocations. Let be a cash installment in the plan at time . The pair can thus be considered the action variables in a dynamic optimization problem corresponding to the retirement plan.
We assume that at each time step , there is a pre-specified target value of a portfolio at time . We assume that the target value at step exceeds the next-step value of the portfolio, and we seek to impose a penalty for under-performance relative to this target. To this end, we can consider the following expected reward for time step :
Here the first term is due to an installment of amount at the beginning of time period , the second term is the expected negative reward from the end of the period for under-performance relative to the target, and the third term approximates transaction costs by a convex functional with the parameter matrix , and serves as a regularization.
The one-step reward (19) is inconvenient to work with due to the rectified non-linearity under the expectation. Another problem is that decision variables and are not independent but rather satisfy the following constraint
which simply means that at every time step, the total change in all positions should equal the cash installment at this time.
The new reward function (21) is attractive on two counts. First, it explicitly resolves the constraint (20) between the cash injection and portfolio allocation decisions, and thus converts the initial constrained optimization problem into an unconstrained one. We remind the reader that this differs from the Merton model where allocation variables are defined as fractions of the total wealth, and thus are constrained by construction. The approach based on dollar-measured actions both reduces the dimensionality of the optimization problem, and makes it unconstrained. When the unconstrained optimization problem is solved, the optimal contribution at time can be obtained from Eq.(20).
The second attractive feature of the reward (21) is that it is quadratic in actions , and is therefore highly tractable. On the other hand, the well known disadvantage of quadratic rewards (penalties) is that they are symmetric, and penalize both scenarios and , while in fact we only want to penalize the second class of scenarios. To mitigate this drawback, we can consider target values that are considerably higher than the time- expectation of the next-period portfolio value. For example, one simple choice could be to set the target portfolio as a linear combination of a portfolio-independent benchmark and the current portfolio growing with a fixed rate :
where is a relative weight of the portfolio-independent and portfolio-dependent terms, and is a parameter that defines the desired growth rate of the current portfolio whose value is . For a sufficiently large values of and , such a target portfolio would be well above the current portfolio at all times, and thus would serve as a reasonable proxy to the asymmetric measure (19). The advantage of such a parameterization of the target portfolio is that both the “desired growth” parameter and the mixture parameter can be learned from an observed behavior of a financial agent in the setting of Inverse Reinforcement Learning (IRL), as we will discuss in Sec. 5. In what follows, we use Eq.(22) as our specification of the target portfolio.
We note that a quadratic loss specification relative to a target time-dependent wealth level is a popular choice in the recent literature on wealth management. One example is provided by Lin_Zeng_2019
who develop a dynamic optimization approach with a similar squared loss function for a defined contribution retirement plan. A similar approach which relies on a direct specification of a reward based on a target portfolio level is known as “goal-based wealth management”(Browne_1996; Das_2018).
The square loss reward specification is very convenient, as it allows one to construct optimal policies semi-analytically. Here we will demonstrate how to build a semi-analytical scheme for computing optimal stochastic consumption-investment policies for a retirement plan — the method is sufficiently general for either a cumulation or de-cumulation phase. For other specifications of rewards, numerical optimization and function approximations (e.g. neural networks) would be required.
The expected reward (21) can be written in a more explicit quadratic form if we denote asset returns as where the first component is the risk-free rate (as the first asset is risk-free), and ) where is an idiosyncratic noise with covariance of size . Substituting this expression in Eq.(21), we obtain
Assuming that the expected returns , covariance matrix and the benchmark are fixed, the vector of free parameters defining the reward function is thus .
4 G-learner for retirement plan optimization
To solve the optimization problem, we use a semi-analytical formulation of G-learning with Gaussian time-varying policies (GTVP). In what follows, we will refer to our specific algorithm implementing G-learning with our model specifications as the G-Learner algorithm, to differentiate our model from more general models that could potentially be constructed using G-learning as a general RL method.
We start by specifying a functional form of the value function as a quadratic form of :
where are parameters that can depend on time via their dependence on the target values and the expected returns . The dynamic equation takes the form:
Note that the only features used here are the expected asset returns for the current period . We assume that the expected asset returns are available as an output of a separate statistical model using e.g. a factor model framework. The present formalism is agnostic to the choice of the expected return model.
where we defined as follows
Note that the optimal action is a linear function of the state. Another interesting point to note is that the last term that describes convex transaction costs in Eq.(3) produces regularization of matrix inversion in Eq.(26).
As for the last time step we have , coefficients can be computed by plugging Eq.(26) back in Eq.(3), and comparing the result with Eq.(24) with . This provides terminal conditions for parameters in Eq.(24):
For an arbitrary time step , we use Eq.(25) to compute the conditional expectation of the next-period F-function in the Bellman equation as follows:
where , and similarly for and . This is a quadratic function of and , and has the same structure as the quadratic reward in Eq.(3). Plugging both expressions in the Bellman equation
we see that the action-value function should also be a quadratic function of and :
After the action-valued function is computed as per Eqs.(4), what remains is to compute the F-function for the current step:
A reference policy is Gaussian:
where the mean value is a linear function of the state :
Integration over in Eq.(32) is performed analytically using the well known -dimensional Gaussian integration formula
where denotes the determinant of matrix .
Note that, unlike in the Merton approach (Merton_1971) or in traditional Markowitz portfolio optimization (Markowitz), here we work with unconstrained variables that do not have to sum up to one, and therefore an unconstrained multivariate Gaussian integration readily applies here. Remarkably, this implies that once the decision variables are chosen appropriately, portfolio optimization for wealth management tasks may in a sense be an easier problem than portfolio optimization that does not involve intermediate cashflows, and is often formulated using self-financing conditions.
Performing the Gaussian integration and comparing the resulting expression with Eq.(24), we obtain for its coefficients:
where we use the auxiliary parameters
The optimal policy for the given step is given by
Using here the quadratic action-value function (30) produces a new Gaussian policy :