1 Introduction
Quantitative asset management methods typically predict the distribution of future asset returns by a parametric model, which is then used as input by the decision models that construct the portfolio of asset holdings. This twostage “predictthenoptimize” design, though intuitively appealing, is effective only under ideal conditions, i.e. when market conditions are stationary and there is sufficient data, the resulting portfolios have good performance. However, in practice, predictions are often
unreliable, and consequently, the resulting portfolios have poor outofsample performance (Chopra and Ziemba, 1993; Merton, 1980; Best and Grauer, 1991; Broadie, 1993). There are two mitigation strategies: add a measure of risk to control variability, and allow for model robustness. However, these approaches involve unknown parameters that are hard to set in practice (Bertsimas et al., 2018).In decisionmaking systems, we are concerned with decision errors instead of just prediction errors. Although it is well known that splitting the prediction and optimization task is not optimal for mitigating decision errors, it is only recently that the two steps have been combined into a “smartpredictthenoptimize” framework where parameters are chosen to optimize the performance of the corresponding decision (Elmachtoub et al., 2020). This framework has been implemented as an endtoend system where one goes directly from data to decision by combining the prediction and decision layers, and backpropagating the gradient information through both decision and prediction layers during training (Donti et al., 2017). In the context of portfolio construction, such an endtoend system allows us to learn the parameters of a given parametric prediction model, improving the performance of a given fixed portfolio selection problem. However, these endtoend systems cannot accommodate robust or distributionally robust decision layers that provide robustness with respect to the prediction model, nor can they accommodate optimization layers with learnable parameters.
We show that the endtoend approach can be successfully extended to settings where the decision is chosen by solving a distributionally robust optimization problem. Introducing robustness regularizes the decision and improves its outofsample performance. This is a natural next step in the evolution of this approach. Although we use portfolio selection as a test case for the robust endtoend framework since predictions, decisions and uncertainty are especially important in this problem, it will be clear from our model development that the approach itself can be applied to any robust decision problem.
1.1 Contributions
Our main contribution is to show how to accommodate model robustness within an endtoend system in a tractable and intuitive fashion. The specific contributions of this paper are as follows (see also Figure 1):

We propose an endtoend portfolio construction system where the decision layer is a distributionally robust (DR) optimization problem (see the ‘Decision layer’ box in Figure 1). We show how to integrate the DR layer with any prediction layer.

The DR optimization problem requires both the point prediction as well as the prediction errors as inputs to quantify and control the model risk. Therefore, unlike standard endtoend systems, we provide both the point prediction, as well as a set of past prediction errors as inputs to the decision layer (see the ‘Prediction layer’ box in Figure 1).

The DR layer is formulated as a minimax optimization problem where the objective function is a combination of the mean loss and the worstcase risk term that combines the impact of variability and model error. The worstcase risk is taken over a set of probability measures within a “distance”
of the empirical measure. We show that, by using convex duality, the minimax problem can be recast as a minimization problem. In this form, the gradient of the overall task loss can be backpropagated through the DR decision layer to the prediction layer. Moreover, this means the endtoend system remains computationally tractable and can be efficiently solved using existing software. This result is of interest to embedding any minimax (or maximin) decision layer into an endtoend system. 
We show that the parameters that control the risk appetite () and model robustness () can be learned directly from data in our proposed DR endtoend system. Learning is relatively straightforward. However, showing that can be learned in the endtoend system is nontrivial, and requires the use of convex duality. This step is very important because setting appropriate values for these parameters is, in practice, difficult and computationally expensive (Bertsimas et al., 2018).

We have implemented our proposed DR endtoend system for portfolio construction in Python. Source code is available at https://github.com/IyengarLab.
1.2 Background and related work
Recently, there has been increasing interest in endtoend learning systems (Bengio, 1997; LeCun et al., 2005; Thomas et al., 2006; Elmachtoub and Grigas, 2021)
. The goal of an endtoend system is to integrate a prediction layer with the downstream decision layer so that one estimates parameter values that minimize the decision error (known as the ‘task loss’), instead of simply minimizing the prediction error. The main technical challenge in such systems is to backpropagate the gradient of the task loss through the decision layer onto the prediction layer
(Amos and Kolter, 2017). These systems come in two varieties: modelfree and modelbased. Modelfree methods follow a ‘blackbox’ approach, and have found some success in portfolio construction (Uysal et al., 2021; Zhang et al., 2020). However, modelfree methods are often datainefficient during training. On the other hand, modelbased methods rely on some predefined structure of their environment before model training can take place. Modelbased methods have the advantage of retaining some level of interpretability while also being more dataefficient during training (Amos et al., 2018).Modelbased endtoend systems have found successful applications in asset management and portfolio construction. In addition to the modelfree system, Uysal et al. (2021) also propose a modelbased risk budgeting portfolio construction model. Butler and Kwon (2021)
propose an endtoend mean–variance optimization model.
Zhang et al. (2021)integrate the convex optimization layer with a deep learning prediction layer for portfolio construction.
As a standalone tool, portfolio optimization has been criticized for its sensitivity to model and parameter errors that leads to poor outofsample performance (Merton, 1980; Best and Grauer, 1991; Chopra and Ziemba, 1993; Broadie, 1993). Robust optimization methods that explicitly model perturbation in the parameters of an optimization problem, and choose decisions assuming the worstcase behavior of the parameters, have been successfully employed to improve the performance of portfolio selection (e.g., see Goldfarb and Iyengar, 2003; Tütüncü and Koenig, 2004; Fabozzi et al., 2007; Costa and Kwon, 2020). The robust optimization approach was subsequently extended to distributionally robust optimization (DRO) (Scarf, 1958; Delage and Ye, 2010; BenTal et al., 2013), where the parameters of an optimization problem are distributed according to a probability measure that belongs to a given ambiguity set. The DRO problem can be interpreted as a game between the decisionmaker who chooses an action to minimize cost, and an adversary (i.e., nature) who chooses a parameter distribution that maximizes the cost (Von Neumann, 1928). The DRO approach has been implemented for portfolio optimization (e.g., see Calafiore, 2007; Delage and Ye, 2010; Costa and Kwon, 2021).
We propose a modelbased endtoend portfolio selection framework where the decision layer is a DR portfolio selection problem. In keeping with existing endtoend systems, we allow the prediction layer to be any differentiable supervised learning model, ranging from simple linear models to deep neural networks. However, unlike existing endtoend frameworks, the decision layer is a minimax problem. We show how to use the errors from the prediction layer to construct this minimax problem. It is well known that a differentiable optimization problem can be embedded into the architecture of a neural network, allowing for gradientbased information to be communicated between prediction and decision layers
(Amos and Kolter, 2017; Donti et al., 2017; Agrawal et al., 2019; Amos, 2019). Using convex duality, we extend this result to minimax problems, and show how to communicate the gradient information from the DR decision layer back to the prediction layer.Finally, we note that the task loss function that guides the training of an endtoend system does not need to be the same as the objective function in the decision layer. The discrepancy between these two functions may stem from different reasons. The decision layer may be designed as a computationally tractable surrogate for the task loss function. Alternatively, we may choose a task loss function that directs the system’s training towards some desirable outofsample reward that cannot be explicitly embedded into the decision layer. For example, Uysal et al. (2021)
present a decision layer designed to diversify financial risk; however, their system’s task loss function emphasizes financial return. In such cases, the endtoend system is fundamentally similar to a reinforcement learning problem.
2 Endtoend portfolio construction
In this section, we present our proposed DR endtoend portfolio construction system and describe each individual component of the system. To allow for a natural progression, the structure of this section follows the ‘forward pass’ of Figure 1, starting with a discussion of the prediction layer, followed by the DR decision layer, and finally the task loss function. We conclude by presenting the complete DR endtoend algorithm.
2.1 Prediction layer
We consider portfolio selection in discrete time. Denote the present time as and let denote financial factors (i.e., predictive features) observed at time . Using these factors, we want to predict the random return on assets over the period . Let denote the historical time series of financial factors and let denote the historical time series of returns on the assets with time steps.
Suppose we have access to the features . A prediction model that maps to the prediction of the expected return is assumed be a differentiable function of the parameter ; otherwise the model may be as simple or as complex as required. An illustrative example is the linear model,
where is a prediction and is the matrix of weights for this specific model. Note that the dimensions of will change depending on the structure of the prediction model.
Let denote the prediction error. Note that the prediction error is a combination of stochastic noise (i.e., variance) and model risk. We assume that the set of past prediction errors are IID samples of . This sample set of prediction errors will be used to introduce distributional robustness in the decision layer.
Traditionally, prediction models are trained by minimizing a prediction loss function to improve predictive accuracy. However, an endtoend system is concerned with minimizing the task loss rather than the prediction loss. In our case, the task loss corresponds to some measure of outofsample portfolio performance, which we will discuss in Section 2.3. For now, we will focus on how to use the set of prediction errors to introduce distributional robustness into the decision layer.
2.2 Decision layer
A feedforward neural network is trained by iterating over ‘forward’ and ‘backward’ passes. During the ‘forward pass’, the network is evaluated using the current set of weights. This is followed by the ‘backward pass’, where the gradient of the loss function is computed and propagated backwards through the layers of the neural network in order to update the weights.
In existing endtoend systems, during the forward pass, the decision layer is treated as a standard convex minimization problem. On the other hand, the backward pass requires that we differentiate through the ‘’ operator (Donti et al., 2017). In general, the solutions of optimization problems cannot be written as explicit functions of the input parameters, i.e., they generally do not admit a closedform solution that we can differentiate. However, it is possible to differentiate through a convex optimization problem by implicitly differentiating its optimality conditions, provided that some regularity conditions are satisfied (Amos and Kolter, 2017; Agrawal et al., 2019).
First, we adapt the existing endtoend system to solve a portfolio selection problem involving risk measures. This extension requires us to work with the set of prediction errors in addition to the prediction
corresponding to the factor vector
. Next, we show how to extend the methodology to distributionally robust portfolio selection.A portfolio is a vector of asset weights . In order to keep the exposition simple, the set of admissible portfolios is
The equality constraint in is the budget constraint and it ensures that the entirety of our available budget is invested, while the nonnegativity constraint disallows the short selling of financial assets. Our methodology extends to sets defined by limits on industry/sector exposures and leverage constraints.
Suppose at time we have access to the factors but not to the realized asset returns over the period . We approximate the expected return by the output of the prediction layer. Next, we characterize the variability in the portfolio return both due to the stochastic noise (i.e. variance) and the model risk. Given that we allow the prediction layer to have any general form, we avoid attempting to measure the parametric uncertainty associated with the prediction layer weights. Instead, we take a datadriven approach and estimate the combined effect of variance and model risk directly from a sample set of past prediction errors. We quantify the ‘risk’ associated with the portfolio by a deviation risk measure (Rockafellar et al., 2006) defined below.
Proposition 2.1 (Deviation risk measure).
Let denote the finite set of prediction error outcomes and let denote a fixed portfolio. Suppose is a closed convex function where and . Let denote any probability mass function (PMF) in the probability simplex
Then, the deviation risk measure associated with the set of outcomes , the portfolio and PMF is given by
(1) 
The deviation risk measure has the following properties.

is convex for any fixed .

for all and .

is shiftinvariant with respect to .

is symmetric with respect to .
Proof.
See Appendix A. ∎
Since our ‘deviation risk measure’ pertains to the prediction error rather than financial risk, we use a broader definition of the risk measure as compared to traditional financial risk measures (e.g., see Rockafellar et al., 2006).
The centering parameter plays an important role. It is crucial for the shift invariant property (3), and this property in turn implies that the deviation risk associated with the outcomes is the same as that associated with the mean adjusted outcomes . Thus, the deviation risk measure is really a function of the deviations around the mean. When , the associated deviation risk measure
is the variance, where the optimal (see Appendix B for details). Later, we introduce the worstcase deviation risk by taking the maximum over . In that setting, the centering parameter ensures that the adversary cannot increase risk by putting all the weight on the worst .
2.2.1 Nominal layer
We start with the assumption that every outcome in the set has equal probability. We refer to this as the nominal decision problem. Let
denote a uniform probability distribution (i.e.,
). Then, the nominal decision layer computes(2) 
where is the risk appetite parameter and is a deviation risk measure as defined by Proposition 2.1.
A challenge often faced by practitioners is determining an appropriate value for . The parameter is typically calibrated by trialanderror using insample performance, potentially biasing the performance quite heavily. Instead, we treat as a learnable parameter of the endtoend system. As shown by Amos and Kolter (2017) and Agrawal et al. (2019), we can find the partial derivative of the task loss function with respect to and subsequently use it to update through gradient descent like any other parameter in the network. This is advantageous because we are able to learn a taskoptimal value of and relieve the user from having to compute an appropriate value of .
2.2.2 DR layer
The nominal problem puts equal weight on every sample in the set , i.e. the PMF
defining the risk measure is the uniform distribution
. We introduce distributional robustness by allowing to deviate from to within a certain “distance” (Calafiore, 2007; BenTal et al., 2013; Kannan et al., 2020; Costa and Kwon, 2021). Let denote a statistical distance function based on a divergence (e.g., Kullback–Leibler, Hellinger, chisquared). Then, the ambiguity set for the distribution is given byThe size parameter defines the maximum permissible distance between and the nominal distribution . The DR decision layer chooses the portfolio assuming the worst case behavior of , i.e. it solves the following minimax problem:
(3) 
In general, solving minimax problems can be computationally expensive and can also lead to local optimal solutions rather than global solutions. It is straightforward to check that the objective in (3) is convex in ; however, the concavity property of the deviation risk measure as a function of is not immediately obvious. Nevertheless, we show in Appendix C that we can still use convex duality to reformulate the minimax problem (3) into the following minimization problem,
(4) 
where
(5) 
is the convex conjugate of the divergence defining the ambiguity set ,^{1}^{1}1For a description of divergence functions and their convex conjugates, please refer to Tables 2 and 4 in BenTal et al. (2013). and are auxiliary variables arising from constructing the Lagrangian dual. Note that we have abused our notation of the ‘’ operator in (4) since is the only pertinent output.
Tractable reformulations of the DR layer exist for many choices of divergence. In Appendix D, we show that the DR layer can be formulated as a secondorder cone problem when is the Hellinger distance^{2}^{2}2The DR layer reduces to a secondorder cone program if the function in Proposition 2.1 is quadratic or piecewise linear. Otherwise, the complexity of the problem is dictated by the choice of . and as a linear optimization problem when is the Variational distance^{3}^{3}3
The DR layer reduces to a linear program if the function
in Proposition 2.1 is piecewise linear. Otherwise, the complexity of the problem is dictated by the choice of .. BenTal et al. (2013) provide tractable reformulations for other choices of divergence.Recasting the minimax problem into a convex minimization problem allows us to differentiate through the DR layer during training of the endtoend system (Amos and Kolter, 2017; Amos, 2019). Another benefit of dualizing the inner maximization problem is that the ambiguity sizing parameter becomes an explicit part of the DR layer’s objective function, and can, therefore, be treated as a learnable parameter of the endtoend system. Determining the size of an ambiguity set a priori is often a subjective exercise, with many users resorting to a probabilistic confidence level. By treating as a learnable parameter, we relieve the user from the responsibility of having to assign a value of a priori.
2.3 Task loss
In standard supervised learning models, the loss function is a measure of predictive accuracy. For example, a popular prediction loss function is the mean squared error (MSE). For a prediction at time , the loss is
(6) 
In an endtoend system, predictive loss measures the performance of only the the prediction layer. However, using the predictive loss to measure the performance of the entire system fails to consider our main objective: the outofsample performance of the decision.
Therefore, in contrast to standard supervised learning models, endtoend systems measure their performance using a ‘task loss’ function, which is chosen in order to train the system based on the outofsample performance of the optimal decision. For example, the task loss in Butler and Kwon (2021) has the same form as the objective function of the decision layer, except the predictions are replaced with the corresponding realizations in order to calculate the outofsample performance. We allow for the possibility that the taskloss is different from the objective function of the decision layer. This allows us to approximate a task loss function that is hard to optimize with a more tractable surrogate. In such cases, the endtoend system is fundamentally similar to a reinforcement learning problem.
We define the task loss as the financial performance of our optimal portfolio measured over some outofsample period of length with the realized asset returns . When , we set the task loss to the realized return . When , we can use other measures of financial performance, e.g. the portfolio volatility or the Sharpe ratio (Sharpe, 1994) calculated over the next time steps. In our numerical experiments, we set the task loss to be a weighted combination of the predictive accuracy and the Sharpe ratio^{4}^{4}4Note that we have defined the Sharpe ratio using the portfolio returns rather than the portfolio excess returns (i.e., the returns in excess of the riskfree rate).
(7) 
where the operator calculates the mean of a set, while
calculates the standard deviation.
In general, the task loss is any differentiable function of the optimal portfolio . In turn, is an implicit function of the parameters , and . Therefore, during the backward training pass, we can differentiate the task loss with respect to and in the decision layer, and with respect to in the prediction layer.
2.4 Training the DR endtoend system
Our proposed DR endtoend system for portfolio construction is detailed in Algorithm 1. Recall that denotes the number of prediction error samples from which to estimate the portfolio’s prediction error, while is the length of the outofsample performance window. Additionally, let us define as the total number of observations in the full training data set.
In certain settings, a user may be unable or unwilling to integrate the prediction layer with the rest of the system. Alternatively, they may be using a prediction layer that cannot be trained via gradient descent, e.g. a treebased predictor. In such cases, we assume the prediction layer is fixed, which in turn means that the prediction errors are constant during training.
Nevertheless, we can still pass a sample set of prediction errors as an input to the DR layer during training. Since the DR layer is a differentiable convex optimization problem, we are still able to learn values of and that minimize the task loss. In practice, setting the risk appetite parameter and model robustness parameter is difficult, and requires significant effort. For example, Bertsimas et al. (2018) propose using crossvalidation techniques to set the level of robustness. Instead, our endtoend system learns these parameters directly from data as part of the overall training in a much more efficient manner.
3 Numerical experiment
We present the results of five numerical experiments. Each experiment evaluates different characteristics of our proposed DR endtoend system. The first four experiments are conducted using historical data from the U.S. stock market. The fifth experiment uses synthetic data generated from a controlled stochastic process.
The first experiment provides a holistic measure of financial performance. The second, third and fourth experiments isolate the outofsample effect of allowing the endtoend system to learn the parameters , and , respectively. Finally, the fifth experiment evaluates the effect of robustness when working with complex prediction layers.
The numerical experiments were conducted using a code written in Python (version 3.8.5), with PyTorch (version 1.10.0)
(Paszke et al., 2019) and Cvxpylayers (version 0.1.4) (Agrawal et al., 2019) used to compute the endtoend systems. The neural network is trained using the ‘Adam’ optimizer (Kingma and Ba, 2014) and the ‘ECOS’ convex optimization solver (Domahidi et al., 2013).3.1 Competing investment systems
The numerical experiments involve seven different investment systems. The individual experiments compare these systems against each other. Although many of the systems are designed to learn the parameters , and , some experiments purposely keep these parameters constant in order to isolate the effect of learning the remaining parameters. The seven investment systems are described below.

Equal weight (EW): Simple portfolio where all assets have equal weight – no prediction or optimization is required and no parameters need to be learned. Equal weight portfolios promote diversity and have been empirically shown to have a good outofsample Sharpe ratio (DeMiguel et al., 2009).

Predictthenoptimize (PO): Twostage system with a linear prediction layer. The decision layer is given by the nominal problem defined in (2) with held constant. No parameters are learned (i.e., once the parameters are initialized, they are held constant).

Base: Endtoend system that does not incorporate the risk function and chooses portfolios by solving the optimization problem
The prediction layer is linear and the only learnable parameter is . Note that the base system is equivalent to a system where the variability of the outcome is not impacted by the decision – as was the case in Donti et al. (2017).

Nominal: Endtoend system with a linear prediction layer and a decision layer corresponds to the nominal problem (2). The learnable parameters are and .

DR: Proposed endtoend system with a linear prediction layer and a decision layer corresponds to the DR problem (4). We choose the Hellinger distance as the divergence to define the ambiguity set . The learnable parameters are , and .

NNnominal: Endtoend system with a nonlinear prediction layer. The prediction layer is composed of a neural network with either two or three hidden layers. The decision layer corresponds to the nominal problem (2). The learnable parameters are and .

NNDR: Endtoend system with a nonlinear prediction layer. The prediction layer is composed of a neural network with either two or three hidden layers. The decision layer corresponds to the DR problem (4). The learnable parameters are , and .
Additionally, since retaining some degree of predictive accuracy in the prediction layer is often desirable (Donti et al., 2017), we define the task loss as a linear combination of the Sharpe ratio loss in (7) and the MSE loss in (6),
The lookahead evaluation period of the task loss from to consists of one financial quarter (i.e., 13 weeks), which means . The weight of on the MSE loss was chosen to ensure a reasonable tradeoff between outofsample performance and prediction error. We expect similar performance with other weights.
3.2 Experiments with historical data
The historical data consisted of weekly asset and feature returns from 07–Jan–2000 to 01–Oct–2021. The predictive feature data were sourced from Kenneth French’s database (French, 2021) and consist of the weekly returns of eight features. The asset data were sourced from AlphaVantage (www.alphavantage.co) and consist of the weekly returns of 20 U.S. stocks belonging to the S&P 500 index. The selected features and assets are listed in Table 1. To avoid prediction biases, the input–output pairs are lagged by a period of one week, e.g. asset returns for 14–Jan–2000 are predicted using the feature vector observed on 07–Jan2000.
Features  
Market  Profitability  Investment  
Size  ST reversal  LT reversal  
Value  Momentum  
Assets  
AAPL  MSFT  AMZN  C  JPM 
BAC  XOM  HAL  MCD  WMT 
COST  CAT  LMT  JNJ  PFE 
DIS  VZ  T  ED  NEM 
For each experiment, the setup of the participating investment systems is outlined in a corresponding table (see Table 2 as an example). These tables indicate the initial values for the parameters , and , and whether these parameters were learned during training or they remained constant during the experiment. Some experiments were designed to isolate the effect of learning a specific parameter; in these experiments, other parameters were kept constant even when the investment system could potentially learn them.
The training of the endtoend learning systems was carried out as follows. We used the portfolio error variance as the deviation risk measure for all investment systems (we show in Appendix B that the variance can be cast as a deviation risk measure
). For consistency, all linear prediction layers are initialized to the ordinary least squares (OLS) weights. Additionally, the initial values of the risk appetite parameter
and robustness parameter were sampled uniformly from appropriately defined intervals (see Appendix E for the initialization methodology).We used two years of weekly observations as a representative sample of prediction errors (). The data is separated by a ratio into training and testing sets, respectively. The training set ranges from 07–Jan–2000 to 18–Jan–2013 and serves to train and validate the endtoend systems.
Crossvalidation was used to tune the learning rate and the number of training epochs
. Since we are working with time series data, the hyperparameters were selected through a time series split crossvalidation process. To do this, we trained and validated the endtoend systems over four separate folds. For each fold, the original training set was divided into a training subset and a validation subset. We began by using the first 20% of the training set as the training subset, and the subsequent 20% as the validation set. This is increased to a ratio of 40:20, 60:20 and, finally, 80:20. Moreover, we tested all possible combinations between three possible learning rates,
, and number of epochs .Once all four folds were completed, the average validation loss was calculated and used to select the optimal hyperparameters that yield the lowest validation loss for each endtoend system. Once the optimal hyperparameters were selected, they were kept constant during the outofsample test. The average validation losses of the endtoend systems from Experiments 1–4 are presented in Table 12 in Appendix F. The table also highlights the optimal hyperparameters selected for each system.
The outofsample test ranges from 25–Jan–2013 to 01–Oct–2021, with the first prediction taking place on 25–Jan–2013. Immediately before the test starts, the portfolios were retrained using the full training data set. Note that the hyperparameters remained constant with the values selected during the crossvalidation stage.
Following the best practice for time series models, we retrained the investment systems approximately every two years using all past data available at the time of training. Therefore, the investment systems were trained a total of four times. Before retraining takes place, the prediction layer weights are reset to the OLS weights computed from the corresponding training data set. In addition, the parameters and are reset to their initial values before retraining takes place.
For each experiment, the outofsample results are presented in the form of a wealth evolution plot, as well as a table that summarizes the financial performance of the competing portfolios.
3.2.1 Experiment 1: General evaluation
The first experiment is a complete financial “backtest” to evaluate the performance of the DR endtoend learning system as an asset management tool. To do so, the DR system is compared against four other competing systems presented in Table 2. In this set of experiments, the systems were able to learn all available parameters within their purview.
System  Val.  Lrn  Val.  Lrn  Val.  Lrn 
EW             
PO  OLS    0.046       
Base  OLS  ✓         
Nominal  OLS  ✓  0.046  ✓     
DR  OLS  ✓  0.046  ✓  0.312  ✓ 
Note(s): Val, Initial value; Lrn, Learnable. 
The outofsample financial performance of the five competing investment systems is compared as follows: Figure 2 shows the wealth evolution of the five corresponding portfolios, Figure 3 shows the cumulative Sharpe ratio, and Table 3 presents a summary of the results over the complete investment horizon. The experimental results lead to the following observations.

[itemsep=0.15em, topsep=0pt, leftmargin=*]

Benefit of using a samplebased model risk measure: Unlike the base model, the nominal and DR systems integrate a samplebased prediction error into the decision layer. The results in Table 3 clearly show the significance of incorporating prediction errors into the decision layer – both the nominal and DR systems have a higher return and lower volatility than the base system.

Impact of endtoend learning: When compared against the straightforward predictthenoptimize and equal weight systems, the nominal and DR systems have higher Sharpe ratios on average over the entire investment horizon, highlighting the advantage of endtoend systems of being able to learn the prediction and decision parameters. We note that the nominal and DR portfolios have a higher volatility due to their pronounced growth in wealth, but nevertheless maintain a high cumulative Sharpe ratio as shown in Figure 3.

Distributional robustness: Comparing the nominal and DR systems, we clearly see the benefit of incorporating robustness into our portfolio selection system. The DR system has a higher Sharpe ratio, which is a reflection of our choice of a task loss function.
EW  PO  Base  Nom.  DR  
Return (%)  15.6  13.4  16.1  23.4  20.1 
Volatility (%)  14.9  15.3  25.0  18.8  15.5 
Sharpe ratio  1.05  0.88  0.64  1.24  1.30 
Note: Values are annualized. 
3.2.2 Experiment 2: Learning
The second experiment focused on understanding the improvement in performance by learning ; therefore, and were fixed. We focused on three investment systems: the predictthenoptimize (PO) system (since and are fixed, this is the same as the nominal system), the DR system with constant , and ; and the DR system with constant and , but with a learnable . The difference in performance between the PO system and the DR system with all parameters fixed will highlight the benefit of adding some (not optimized) robustness, and the difference in performance between the DR systems with optimized and fixed is a measure of the impact of size of the uncertainty set. Details of the three systems are presented in Table 4. The outofsample financial performance of the three investment systems is summarized in Figure 4 and Table 5. The experimental results lead to the following observations.

[itemsep=0.15em, topsep=0pt, leftmargin=*]

Impact of robustness without learning: The results for the PO system and the DR with no parameter learning show that, even when learning is not allowed, robustness has a positive impact on outofsample performance.

Isolated learning of : The results in Table 5 show that, although adding robustness improves the Sharpe ratio, solely optimizing can be detrimental to the system’s outofsample performance.
System  Val.  Lrn  Val.  Lrn  Val.  Lrn 
PO  OLS    0.046       
DR  OLS  ✗  0.046  ✗  0.312  ✗ 
DR  OLS  ✗  0.046  ✗  0.312  ✓ 
Note: Val, Initial value; Lrn, Learnable. 
PO  DR  

Const.  Learn  
Return (%)  13.4  12.9  12.6 
Volatility (%)  15.3  12.8  12.8 
Sharpe ratio  0.88  1.01  0.99 
Note: Values are annualized. 
3.2.3 Experiment 3: Learning
The third experiment assessed the outofsample impact of learning . Therefore, here we experimented with only those systems involving , i.e. nominal system with constant , and a DR system with constant and but with a learnable , with the PO system added as a baseline control. Details of the three systems are given in Table 6.
System  Val.  Lrn  Val.  Lrn  Val.  Lrn 
PO  OLS    0.046       
Nominal  OLS  ✗  0.046  ✓     
DR  OLS  ✗  0.046  ✓  0.312  ✗ 
Note(s): Val, Initial value; Lrn, Learnable. 
The outofsample financial performance of the three investment systems is summarized in Figure 5 and Table 7. The experimental results lead to the following observations.

[itemsep=0.15em, topsep=0pt, leftmargin=*]

Impact of robustness: The results of this experiment once again highlight the outofsample benefits of incorporating robustness into the system: the Sharpe ratio of the DR system is the highest. Although the PO system has the highest return, we must keep in mind that the task loss function was a weighted combination of Sharpe ratio and prediction error, meaning, by design, a higher Sharpe ratio is desirable.

Isolated learning of : Comparing the PO system with the nominal system, we observe that only learning may not be beneficial to the outofsample performance of a system. However, when robustness is incorporated into the system, the outofsample performance is greatly enhanced.
PO  Nom.  DR  
Return (%)  13.4  12.4  13.6 
Volatility (%)  15.3  14.7  13.4 
Sharpe ratio  0.88  0.84  1.02 
Note(s): Values are annualized. 
3.2.4 Experiment 4: Learning
The fourth experiment assesses the outofsample impact of learning the prediction layer weights . We tested four investment systems: the PO system, the base system, the nominal system with constant , and the DR system with constant and . Details of the four systems are presented in Table 8.
System  Val.  Lrn  Val.  Lrn  Val.  Lrn 
PO  OLS    0.046       
Base  OLS  ✓         
Nominal  OLS  ✓  0.046  ✗     
DR  OLS  ✓  0.046  ✗  0.312  ✗ 
Note: Val, Initial value; Lrn, Learnable. 
The outofsample financial performance of the four investment systems is presented in Figure 6 and Table 9. The experimental results lead to the following observations.

[itemsep=0.15em, topsep=0pt, leftmargin=*]

Impact of incorporating a deviation risk measure: The difference in the outofsample performance between the base system and the PO system clearly highlights the benefit of the deviation risk measure – even though the base system learns , this is not enough to combat the inherent uncertainty of the portfolio returns.

Learning : Learning becomes advantageous once a risk measure is added to the decision layer. The PO and nominal systems differ only in that the nominal system is able to learn values of that differ from the OLS weights. Thus, as suggested by Donti et al. (2017), learning enhances the mapping of the prediction layer from the feature space to the asset space to extract a higher quality decision, but only if the impact of prediction error in the decision layer is properly modeled.

Impact of robustness: Comparing the nominal and DR systems, we can see that incorporating robustness may not always be advantageous, in particular when the robustness sizing parameter is not optimally calibrated. The results in Table 9 indicate that was set too conservative – even though the volatility of the DR system is lower than the nominal, it incurs a large opportunity cost with regards to the portfolio return.
PO  Base  Nom.  DR  
Return (%)  13.4  16.1  23.0  22.3 
Volatility (%)  15.3  25.0  19.1  18.5 
Sharpe ratio  0.88  0.64  1.21  1.21 
Note(s): Values are annualized. 
3.3 Experiment with synthetic data
We simulate an investment universe with 10 assets and 5 features. The features were assumed to follow an uncorrelated zeromean Brownian motion. The asset returns are generated from a linear model of the features,
where is the vector of biases, where is the matrix of weights (i.e., loadings), where is a Gaussian noise vector, where is an exponential noise vector, and
is a discrete random variable that takes the value
with probability 0.15, 0.7, 0.15, respectively, and serves to periodically perturb the asset return process bidirectionally with the magnitude of in order to simulate ‘jumps’ in the asset returns.The experimental data set is composed of 1,200 observations, with the first 840 reserved for training and the remaining 360 reserved for testing. However, unlike previous experiments, the synthetic data allows us to use a single fold for validation. Here, the complete training set is separated into a single training subset and single validation subset. The validation results are shown in Table 13 in Appendix F. The table also highlights the optimal hyperparameters selected for each system.
The fifth experiment explored the advantage of robustness when the prediction layer is more complex, e.g. a neural network with multiple hidden layers. Specifically, this experiment compares nominal and DR systems when the prediction layer is either a linear model, or neural network with two or three hidden layers. The neural networks had fully connected hidden layers, and the activation functions were rectified linear units (ReLUs).
The prediction layer in this experiment is initialized to random weights using the standard PyTorch (Paszke et al., 2019) initialization mechanism. This applies to all three prediction layer designs tested in this experiment. The initial values of the risk appetite parameter and robustness parameter were sampled uniformly from the same intervals as the previous experiments (see Appendix E for the initialization methodology).
The objective of the experiment is to investigate whether robustness enhances the outofsample performance of a system with a more complex prediction layer, i.e., with a prediction layer that is more difficult to train accurately. To avoid biases pertaining to the design of the prediction layer, our assessment is based on a pairwise comparison between the two and threelayer systems. Details of the four investment systems are presented in Table 10.
System  Val.  Lrn  Val.  Lrn  Val.  Lrn 
Linear  
Nom.  Note 1  ✓  0.089  ✓     
DR  Note 1  ✓  0.089  ✓  0.146  ✓ 
2layer  
Nom.  Note 1  ✓  0.081  ✓     
DR  Note 1  ✓  0.081  ✓  0.358  ✓ 
3layer  
Nom.  Note 1  ✓  0.072  ✓     
DR  Note 1  ✓  0.072  ✓  0.163  ✓ 
Notes: (1) is initialized to the same values for each pair of systems. (2) Val, Initial value; Lrn, Learnable. 
The outofsample financial performance of the four investment systems is summarized in Figure 7 and Table 11. As previously noted, our assessment is based on a pairwise comparison between the linear, and two and threelayer systems. The experimental results lead to the following observation.

[itemsep=0.15em, topsep=0pt, leftmargin=*]

Impact of robustness: As indicated by the Sharpe ratios in Table 11, introducing distributional robustness greatly enhances the portfolio outofsample performance in all three cases. Recall that, unlike previous experiments where the prediction layers were initialized to the naturally intuitive OLS weights, the prediction layer in the current systems are initialized to fully randomized weights. Thus, through this experiment we can appreciate how robustness protects the portfolios from model error, particularly as the complexity of the prediction layer increases.
Linear  2layer  3layer  
Nom.  DR  Nom.  DR  Nom.  DR  
Return (%)  3.30  5.10  1.80  4.20  0.70  5.40 
Vol. (%)  2.90  2.70  8.30  5.80  7.80  2.60 
SR  1.16  1.88  0.21  0.73  0.08  2.11 
Notes: (1) Vol, Volatility; SR, Sharpe ratio. (2) Values are annualized. 
4 Conclusion
This paper introduces a novel DR endtoend learning system for portfolio construction. Specifically, the system integrates a prediction layer of asset returns with a decision layer that selects portfolios by solving a DR optimization problem that explicitly incorporates model risk.
The decision layer in our endtoend system consists of a DR portfolio selection problem that gets as input both the point prediction and the prediction errors. The prediction errors are used to define a deviation risk measure, and we penalize the performance of a portfolio with the worstcase risk over a set of probability measures. We show that, even though the deviation risk measure is not a linear function of the probability measure, one can still use convex duality to reformulate the minimax DR optimization problem into an equivalent minimization problem. This reformulation is critical to ensure that the gradients with respect to the task loss can be backpropagated to the prediction layer. Numerical experiments clearly show that incorporating model robustness via a DR layer leads to enhanced financial performance.
The parameters that control risk and robustness are learnable as part of the endtoend system, meaning these two parameters are optimized directly from data based on the system’s outofsample performance. The learnability of the robustness parameter is a consequence of the convex duality that we use to reformulate the DR layer. Setting these two parameters in practice is a challenge, and our endtoend system relieves the user from having to set them. Furthermore, our numerical experiments show that these parameters do significantly impact performance.
We have implemented our proposed robust endtoend system in Python, and have made the code available on GitHub. We anticipate the DR approach to be very impactful for any application where model risk is an important consideration. Portfolio construction is only the beginning.
References
 Agrawal et al. (2019) Agrawal, A., Amos, B., Barratt, S., Boyd, S., Diamond, S., and Kolter, J. Z. (2019). Differentiable convex optimization layers. Advances in Neural Information Processing Systems, 32:9562–9574.
 Amos (2019) Amos, B. (2019). Differentiable optimizationbased modeling for machine learning. PhD thesis, PhD thesis. Carnegie Mellon University.
 Amos and Kolter (2017) Amos, B. and Kolter, J. Z. (2017). Optnet: Differentiable optimization as a layer in neural networks. In International Conference on Machine Learning, pages 136–145. PMLR.
 Amos et al. (2018) Amos, B., Rodriguez, I. D. J., Sacks, J., Boots, B., and Kolter, J. Z. (2018). Differentiable mpc for endtoend planning and control. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 8299–8310.
 BenTal et al. (2013) BenTal, A., Den Hertog, D., De Waegenaere, A., Melenberg, B., and Rennen, G. (2013). Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341–357.
 Bengio (1997) Bengio, Y. (1997). Using a financial training criterion rather than a prediction criterion. International Journal of Neural Systems, 8(04):433–443.
 Bertsimas et al. (2018) Bertsimas, D., Gupta, V., and Kallus, N. (2018). Datadriven robust optimization. Mathematical Programming, 167(2):235–292.
 Best and Grauer (1991) Best, M. J. and Grauer, R. R. (1991). On the sensitivity of meanvarianceefficient portfolios to changes in asset means: some analytical and computational results. The review of financial studies, 4(2):315–342.
 Boyd and Vandenberghe (2004) Boyd, S. and Vandenberghe, L. (2004). Convex optimization. Cambridge university press.
 Broadie (1993) Broadie, M. (1993). Computing efficient frontiers using estimated parameters. Annals of Operations Research, 45(1):21–58.
 Butler and Kwon (2021) Butler, A. and Kwon, R. H. (2021). Integrating prediction in meanvariance portfolio optimization.
 Calafiore (2007) Calafiore, G. C. (2007). Ambiguous risk measures and optimal robust portfolios. SIAM Journal on Optimization, 18(3):853–877.
 Chopra and Ziemba (1993) Chopra, V. K. and Ziemba, W. T. (1993). The effect of errors in means, variances, and covariances on optimal portfolio choice. Journal of Portfolio Management, pages 6–11.
 Costa and Kwon (2020) Costa, G. and Kwon, R. H. (2020). A robust framework for risk parity portfolios. Journal of Asset Management, 21:447–466.
 Costa and Kwon (2021) Costa, G. and Kwon, R. H. (2021). Datadriven distributionally robust risk parity portfolio optimization. arXiv preprint arXiv:2110.06464.

Delage and Ye (2010)
Delage, E. and Ye, Y. (2010).
Distributionally robust optimization under moment uncertainty with application to datadriven problems.
Operations Research, 58(3):595–612.  DeMiguel et al. (2009) DeMiguel, V., Garlappi, L., and Uppal, R. (2009). Optimal versus naive diversification: How inefficient is the 1/n portfolio strategy? The review of Financial studies, 22(5):1915–1953.
 Domahidi et al. (2013) Domahidi, A., Chu, E., and Boyd, S. (2013). ECOS: An SOCP solver for embedded systems. In European Control Conference (ECC), pages 3071–3076.
 Donti et al. (2017) Donti, P. L., Amos, B., and Kolter, J. Z. (2017). Taskbased endtoend model learning in stochastic optimization. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 5490–5500.
 Elmachtoub et al. (2020) Elmachtoub, A., Liang, J. C. N., and McNellis, R. (2020). Decision trees for decisionmaking under the predictthenoptimize framework. In International Conference on Machine Learning, pages 2858–2867. PMLR.
 Elmachtoub and Grigas (2021) Elmachtoub, A. N. and Grigas, P. (2021). Smart “predict, then optimize”. Management Science.
 Fabozzi et al. (2007) Fabozzi, F. J., Kolm, P. N., Pachamanova, D. A., and Focardi, S. M. (2007). Robust portfolio optimization. Journal of Portfolio Management, 33(3):40.
 French (2021) French, K. R. (2021). Data library. [Online; accessed 10Dec2021].
 Goldfarb and Iyengar (2003) Goldfarb, D. and Iyengar, G. (2003). Robust portfolio selection problems. Mathematics of Operations Research, 28(1):1–38.
 Kannan et al. (2020) Kannan, R., Bayraksan, G., and Luedtke, J. R. (2020). Residualsbased distributionally robust optimization with covariate information. arXiv preprint arXiv:2012.01088.
 Kingma and Ba (2014) Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 LeCun et al. (2005) LeCun, Y., Muller, U., Ben, J., Cosatto, E., and Flepp, B. (2005). Offroad obstacle avoidance through endtoend learning. In Proceedings of the 18th International Conference on Neural Information Processing Systems, pages 739–746.
 Merton (1980) Merton, R. C. (1980). On estimating the expected return on the market: An exploratory investigation. Journal of financial economics, 8(4):323–361.
 Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. (2019). Pytorch: An imperative style, highperformance deep learning library. In Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc.
 Rockafellar et al. (2006) Rockafellar, R. T., Uryasev, S., and Zabarankin, M. (2006). Generalized deviations in risk analysis. Finance and Stochastics, 10(1):51–74.
 Scarf (1958) Scarf, H. (1958). A minmax solution of an inventory problem. Studies in the mathematical theory of inventory and production, pages 201–209.
 Sharpe (1994) Sharpe, W. F. (1994). The sharpe ratio. Journal of Portfolio Management, 21(1):49–58.
 Thomas et al. (2006) Thomas, R. W., Friend, D. H., DaSilva, L. A., and MacKenzie, A. B. (2006). Cognitive networks: adaptation and learning to achieve endtoend performance objectives. IEEE Communications magazine, 44(12):51–57.
 Tütüncü and Koenig (2004) Tütüncü, R. H. and Koenig, M. (2004). Robust asset allocation. Annals of Operations Research, 132(14):157–187.
 Uysal et al. (2021) Uysal, A. S., Li, X., and Mulvey, J. M. (2021). Endtoend risk budgeting portfolio optimization with neural networks.
 Von Neumann (1928) Von Neumann, J. (1928). Zur theorie der gesellschaftsspiele. Mathematische annalen, 100(1):295–320.
 Zhang et al. (2021) Zhang, C., Zhang, Z., Cucuringu, M., and Zohren, S. (2021). A universal endtoend approach to portfolio optimization via deep learning. arXiv preprint arXiv:2111.09170.

Zhang et al. (2020)
Zhang, Z., Zohren, S., and Roberts, S. (2020).
Deep learning for portfolio optimization.
The Journal of Financial Data Science
, 2(4):8–20.
Appendix A Proof of Proposition 2.1
We begin with some preliminary information. Recall that denotes the finite set of prediction error outcomes and is a PMF.
Here we prove the four properties of the deviation risk measure outlined in Proposition 2.1.

is convex for any fixed .
Proof.
Since is convex, it follows that
(8) is a convex function in for a fixed . Accordingly, is also a convex function of for fixed (Boyd and Vandenberghe, 2004). ∎

for all and .
Proof.
By definition, we have that is a nonnegative function. Moreover, is constrained to the probability simplex (specifically, ). Since, is the sum of functions, each of which results from the product of two nonnegative elements, then . ∎

is shiftinvariant with respect to , i.e. for any fixed vector .
Proof.
By definition, we have that
where . The result follows from the last expression. ∎

is symmetric with respect to , i.e. .
Proof.
By definition, we have that
(9) where , and (9) follows from the fact that . The result follows from the last expression. ∎
Appendix B Portfolio error variance
Recall that denotes the finite set of prediction error outcomes. For a fixed distribution , the expected error and its corresponding covariance matrix are
(10)  
(11) 
where and . The matrix results from the weighted sum of rank1 symmetric matrices, meaning it is guaranteed to be positive semidefinite.
Note that is a nonlinear function of ; consequently, the portfolio variance is also nonlinear in . Thus, in this form, the worstcase variance does not have the correct convexity properties that allow one to reformulate the problem using duality. We resolve this issue by recasting the portfolio error variance into the form prescribed by Proposition 2.1, which yields the following corollary.
Corollary B.1.
For a fixed and , the portfolio variance is
where is an unrestricted auxiliary variable.
Appendix C Dualizing the DR layer
Recall that in (8) is a convex function in for any fixed . Moreover, is linear in for any fixed . Thus, is convex–linear in and , respectively.
Fix . Then, the minimax theorem for convex duality (Von Neumann, 1928) implies that
Next, we adapt the results in BenTal et al. (2013) to write the maximization over as a dual minimization problem. Note that this transformation is straightforward, and is only included for completeness.
Fix . From the definition of the ambiguity set , it follows that the maximization problem in is
Associate a dual variable with the constraint and a dual variable with the constraint . Then, the Lagrangian dual function is
(12) 
Note that we arrive at (12) by taking the convex conjugate of and by using the identity for . Additionally, recall that our nominal assumption states that .
Since the conjugate of a divergence is a convex function, it follows that the Lagrangian function is jointly convex in (see BenTal et al. (2013) for details).
Appendix D Tractable reformulations
Recall that the DR layer in (4) corresponds to the following minimization problem,
The computational tractability of this optimization problem depends on the complexity of the divergence selected to construct the ambiguity set .
d.1 Hellinger distance
The Hellinger distance is defined as