1 Introduction
Deep probabilistic autoregressive models have been recently integrated into the Amazon SageMaker toolkit and successfully applied to various kinds of sequential data such as handwriting
(Graves, 2013), speech and music (Oord et al., 2016a), images (Oord et al., 2016c, b) and time series from a number of different domains (Salinas et al., 2019). At a high level, given a sequence of input values (i.e., pentip locations, raw audio signals or stock market prices), the goal is to train a generative model that outputs an accurate sequence of next values, conditioned on all the previous values. The main benefit of such probabilistic models is that they model the joint distribution of output values, rather than predicting only a single best realization (i.e., the most likely value at each step). Predicting a density rather than just a single best value has several advantages – it naturally fits the inherently stochastic nature of many processes, allows to assess the uncertainty and the associated risk, and has been shown to produce better overall prediction accuracy when used in forecasting tasks (Salinas et al., 2019).This Work
We develop an efficient approach for generating adversarial attacks on deep probabilistic autoregressive models. The issue of adversarial robustness and attacks (Szegedy et al., 2013; Goodfellow et al., 2014)
, i.e., generating small input perturbations that lead to mispredictions, is an important problem with large body of recent work. Yet, to our best knowledge this is the first work that explores adversarial attacks in a new challenging setting where the neural network output is a sequence of probability distributions. The difficulty in generating adversarial attacks in this setting is that it requires computing the gradient of an expectation that it is too complex to be analytically integrated
(Schittenkopf et al., 2000; Salinas et al., 2019) and is approximated using MonteCarlo methods.Differentiating through MonteCarlo estimation of the Expectation
We address the key technical challenge of efficiently differentiating through the MonteCarlo estimation, a necessary part of generating whitebox gradient based adversarial attacks, by using two techniques to approximate the gradient of the expectation. The first approach is the scorefunction estimator (Glynn, 1990; Kleijnen and Rubinstein, 1996) obtained by inverting the gradient and the expectation’s integral. The second technique differentiates individual samples using a random variate reparametrizatrion and originates from the variational inference literature (Salimans et al., 2013; Kingma and Welling, 2013; Rezende et al., 2014).
Main Contributions We present the first approach for generating adversarial attacks on deep probabilistic autoregressive models by applying two techniques that differentiate through MonteCarlo estimation of an expectation. We show that the reparametrization estimator is efficient at generating adversarial attacks and outperforms the scorefunction estimator by evaluating on two domains that benefit from stochastic sequential reasoning – stock market trading and electricity consumption. We make our code, datasets and scripts to reproduce our experiments available online^{1}^{1}1https://github.com/ethsri/probabilisticforecastsattacks.
2 Probabilistic Forecasting Models
In this section, we formally describe the probabilistic autoregressive model used in prior works and throughout this paper. Most notably, the model described in Section 2.1 is based on the recent work of (Salinas et al., 2019), which is now an inherent part of the Amazon SageMaker toolkit. Further, in Section 2.2 we describe an extension of this model to the Bayesian setting proposed in our work.
2.1 Sequence Modeling: Preliminaries
Given a sequential process and an index , we consider the task of modeling the distribution of predicted (future) values given the observed (past) values
. Using the chain rule, the joint distribution of predicted values conditioned on observed values
can be written as a product of conditional distributions:(1) 
In deep autoregressive models, a neural network is used to approximate the conditional distribution by a parametric distribution specified by learnable parameters . This yields a joint model:
(2) 
This decomposed form for the joint distribution is general and independent of the particular neural architecture chosen for . In principle, any type of sequential model can be used including wellknown architectures such as LSTMs (Hochreiter and Schmidhuber, 1997), Temporal Convolutional Networks (Bai et al., 2018) or Transformers (Vaswani et al., 2017). Next, we describe a LSTM based instantiation of probabilistic autoregressive models.
Probabilistic Autoregressive Models
Let be a function implemented by a LSTM network. Given , we compute the hidden state for each timestep, conditioned on the previous hidden state , previous input value and the network parameters . Then, the hidden state is used to generate a set of parameters that specify a distribution with density , giving the following form for the conditional distribution:
(3) 
The main difference here compared to the nonprobabilistic models is that the network predicts parameters of a distribution rather than a single value. Commonly used distributions in prior works are Gaussian distribution for realvalued data or the negative binomial distribution for count data
(Salinas et al., 2019). When using Gaussian distribution, has two components, that correspond to the mean and the standard deviation, respectively. The density is defined as:
(4) 
Note, that the choice of a Gaussian distribution corresponds to the assumption that each value is normally distributed conditioned on past values – a hypothesis which has to be assessed per application domain. In what follows, to make a clear distinction between the network inputs and outputs, we use
to denote the inputs (i.e., observed values) and to denote the outputs (i.e., the predicted values).Inference
Performing inference for probabilistic autoregressive models corresponds to characterizing the joint distribution of the output sequence . This includes estimating steps ahead both, the mean and the standard deviation of the value
, via the first and second moments
and . More generally, given the space of output sequences and any statistic , we consider the task of estimating the expectation . The main challenge here is the complexity of the underlying integral on the distribution , which is in general not analytically solvable.During training, one can either use scheduled sampling (Bengio et al., 2015), where a single sample from the distribution is used, or avoid this issue completely by using teacher forcing (Williams and Zipser, 1989), where the deterministic ground truth value for is fed back into the network in the next timestep. This setting is solvable but only because the prediction is only for a single next step. However, when performing iterated prediction at test time, the value used in the feedback loop is sampled from the predicted distribution of . Therefore, the next hidden state depends on the randomness introduced in sampling . This yields an arbitrarily complex form for the joint distribution . To address this issue, prior works perform MonteCarlo inference (Schittenkopf et al., 2000; Salinas et al., 2019) to approximate the expectation as:
(5) 
That is, the MonteCarlo estimations of the expected value of is computed using generated samples for the output sequence.
2.2 Extension to Bayesian Setting
We extend the probabilistic autoregressive models presented in this section to the Bayesian setting, where the output sequence can be conditioned on arbitrary values (i.e., both past and from the future). Formally, we define an observation function , which takes as input a sequence of values and outputs a boolean denoting whether the observation holds. As an example, using denotes that we would like the model to predict values conditioned both on the inputs as well as on the observation .
There are several cases for which the Bayesian setting is useful: (i) some of the data is missing (e.g., due to sensor failures), (ii) to allow encoding prior beliefs about the future evolution of the process, or (iii) to evaluate complex domainspecific statistics. The financial domain offer good examples for (iii), in pricing exotic derivatives such as barrier options (Rich, 1994) whose existence depends upon the underlying asset’s price breaching a preset barrier level.
To remove clutter, in the remainder of the paper we will use to denote the output of the observation function when evaluated on . In this Bayesian setting, the expectation can be estimated via MonteCarlo importance sampling as:
(6) 
This corresponds to generating samples from the prior distribution , and reweighing with Bayes rules. Note, that this formula includes the former by using .
3 Adversarial Attacks on Probabilistic Forecasting Models
In this section, we present our approach for generating adversarial attacks on deep probabilistic forecasting models. We start by formally defining the problem statement suitable for this setting and then we describe two practical adversarial attacks that address it.
Adversarial Examples
Recall that in the canonical classification setting, adversarial examples are typically found by solving the following optimization problem (Szegedy et al., 2013; Papernot et al., 2016; Carlini and Wagner, 2017):
(7) 
where
is a classifier,
is an input (i.e., an image), is the desired adversarial output (the target), and is the minimal perturbation (according to a given norm) applied to the input image such that the classifier predicts the desired output .To make the above formulation applicable to probabilistic forecasting models we perform two standard modifications – (i) we replace the hard equality constraint with an easier to optimize soft constraint that captures the distance between real values, and (ii) we replace the single value output with the expected value of a given statistic of the output joint probability distribution. Applying the first modification leads to the following formulation:
(8) 
That is, we use a soft constraint that minimizes the distance between the target and the predicted value, subject to a given tolerance on the perturbation norm. Applying the second modification corresponds to replacing with the expected value , where is an observation over outputs as defined in Section 2.2, and is a statistic of the output sequence. Overall, this leads to the following problem statement.
Problem Statement
Let be a function (i.e., a probabilistic neural network) that takes as input a sequence of values and outputs a probability distribution with density that can be sampled from to obtain a concrete output sequence . Given an observation variable , a statistic of the network output distribution and a target , the goal of the adversarial attack is to find a perturbation that solves the following optimization problem:
(9) 
3.1 Practical Attack on Probabilistic Networks
The constrained minimization problem defined in Equations 8 and 9 has repeatedly been identified as very difficult to solve in the adversarial attacks literature (Szegedy et al., 2013; Carlini and Wagner, 2017). As a result, we instead follow the approach of Szegedy et. al. 2013 and solve an adjusted optimization problem. Given a real hyperparameter , we aim at minimizing:
(10) 
via gradientdescent. The attack is run with different values of , and the final value is chosen to ensure that the hard constraint is satisfied.
While optimizing the objective function in Equation 9 is standard (Szegedy et al., 2013; Goodfellow et al., 2014; Kurakin et al., 2017b, a), the crucial aspect in ensuring efficient gradient descent is to obtain a good estimation of the objective function’s gradient. In particular, this involves computing:
The difficulty of computing this gradient comes from the fact that the expectation can not be analytically computed, but only approximated via MonteCarlo methods. Informally, it raises the question of how to efficiently differentiate through the MonteCarlo estimation. We compare two different ways of performing this differentiation, described next.
3.1.1 Scorefunction Estimator
The first approach is to express the gradient of the expectation as an expectation over the distribution by inverting the gradient and integral, and estimate the resulting expectation via MonteCarlo methods. This technique is known under different names in the literature: scorefunction method (Glynn, 1990), REINFORCE (Williams, 1992), or logderivative trick. Below we show how this applies to our setting.
Scorefunction Estimator 3.1.
In the general Bayesian setting where , the scorefunction gradient estimator of the expected value of is:
where is sampled from the prior distribution , and denotes the probability that is true knowing that is generated.
The proof of 3.1 is given in the supplementary material. Note that in the nonBayesian setting where the observation is always true, we have and we obtain a simpler form:
While this estimator allows for generating adversarial perturbations, we observe that it has two drawbacks – highvariance and high sampling complexity.
High Variance
Scorefunction estimators typically lead to slow convergence because they suffer from high variance (Ranganath et al., 2014). It is due to the fact that they operate in a blackbox way with respect to the gradients of the network and the statistic .
Complexity of the Bayesian Setting
The scorefunction requires computing the gradient of with respect to . This is always possible in the special case when the observation is constantly true, however in the general setting this might require another step of sampling, which makes the estimator overly complex.
3.1.2 Reparametrization estimator
The second estimator is based on the reparametrization trick. It reparametrizes the output distribution
in terms of auxiliary random variables whose distribution does not depend on
, in order to make individual samples differentiable with respect to . The differential has a priori no specific meaning when the distribution from which is sampled depends on . However, if can be reparametrized as , where is a random variable whose distribution is independent from , then it makes sense to define the differential of with respect to as .Reparametrization estimators were first proposed as a way of mitigating the variance problems of scorefunction estimators (Salimans et al., 2013; Kingma and Welling, 2013; Rezende et al., 2014). However, to our best knowledge, they have not been used in a Bayesian setting where the estimator to differentiate uses importance sampling.
Reparametrization Estimator 3.2.
Assume there exists a differentiable transformation such that the random variable can be reparametrized as , where is an independent random variable whose marginal distribution is independent from . Then the importance sampling reparametrization estimator of the expectation’s gradient is:
where for , is sampled from the distribution , and .
A proof of 3.2 is given in the supplementary material.
3.2 Reparametrization of Probabilistic Networks
Here, we discuss the question of reparametrizing probabilistic autoregressive models. The stochasticity of such architectures comes from the iterated sampling of . Assuming a Gaussian likelihood, the value follows a normal distribution . Let
be a standard normal random vector, i.e., all of its components are independent and each is a zeromean unitvariance normally distributed random variable. Iteratively writing:
for all such that yields a valid reparametrization. This simple reasoning applies to the particular implementation described in (Salinas et al., 2019) and adapts readily to any kind of likelihood parameterized by location and scale, such as Laplace or logistic distribution.
The case of mixture density likelihoods is more complex as they do not enter this category of ”locationscale” distributions, and their inverse cumulative density function does not admit a simple closed form. The problem of how to adapt reparametrization to mixture densities is outside the scope of this paper, and we refer to the relevant literature (Graves, 2016; Figurnov et al., 2018; Jankowiak and Obermeyer, 2018) for more information about this question.
4 Case Study: Stock Market Trading
In this section, we apply the probabilistic autoregressive models and discuss the types of adversarial attacks in the domain of financial decision making.
Output Sequence Statistics
While a given machine learning model is typically trained to predict the future stock prices given its past, various statistics of the output sequence are used in downstream algorithmic trading and option pricing tasks. This is the reason why the approach presented so far already assumed presence of such statistics. The different statistics used in our evaluation are shown in Table 2 and include predicting cumulated stock return, pricing derivatives such as European call and put options (Black and Scholes, 1973), as well as an example of a binary statistic that predicts the success probability of limit orders (Handa and Schwartz, 1996). All statistics are defined with respect to the last known price, denoted as .
4.1 Probabilistic Autoregressive Models Performance
Before we show the effectiveness of generative adversarial attacks, we first demonstrate that using probabilistic autoregressive models leads to stateoftheart results. We use two baselines as the current stateoftheart for financial predictions: LSTM networks (Fischer and Krauss, 2018), and Temporal Convolutional Networks (TCN) (Borovykh et al., 2017). We provide detailed description of all the training hyperparameters, the dataset used (S&P 500) and extended version of all the experiments in the supplementary material.
Params  Nonprobabilistic  Probabilistic  
(Borovykh et al., 2017)  (Fischer and Krauss, 2018)  This Work  
1  10  3.53 ( 0.49)  4.89 ( 0.39)  4.37 ( 0.51) 
1  30  1.74 ( 0.30)  2.41 ( 0.24)  2.35 ( 0.23) 
1  100  0.70 ( 0.19)  0.93 ( 0.1)  0.99 ( 0.12) 
5  10  5.57 ( 1.93)  8.86 ( 1.03)  9.02 ( 1.52) 
5  30  3.40 ( 1.36)  5.34 ( 0.61)  5.66 ( 0.87) 
5  100  1.64 ( 0.78)  2.47 ( 0.28)  2.70 ( 0.48) 
10  10  6.21 ( 3.52)  9.68 ( 1.58)  9.55 ( 2.30) 
10  30  4.28 ( 2.69)  6.39 ( 0.76)  6.63 ( 1.70) 
10  100  2.09 ( 1.58)  3.12 ( 0.52)  3.48 ( 1.03) 
Statistics  Params  Nonprobabilistic  Probabilistic  

Name  
(Borovykh et al., 2017)  (Fischer and Krauss, 2018)  This Work  
Cum. Return  10    1.548 ( 0.029)  1.541 ( 0.019)  1.002 ( 0.008)  
European Call  10  1  1.122 ( 0.002)  1.121 ( 0.002)  0.982 ( 0.005)  
European Put  10  1  1.302 ( 0.003)  1.300 ( 0.002)  0.974 ( 0.005)  
Limit Sell  10  1.05  1.516 ( 0.001)  1.514 ( 0.002)  0.940 ( 0.006)  
Limit Buy  10  0.95  1.412 ( 0.000)  1.410 ( 0.001)  0.958 ( 0.008) 
LongShort Trading Strategies
Given a prediction horizon , we analyze the characteristics of the following portfolio: at timestep , buy (long) the stocks for which the model predicts the highest gain, and sell (short) the stocks with the highest predicted loss. This task is a generalization of the one presented in (Fischer and Krauss, 2018), where only direct prediction () is considered.
Formally, we consider the cumulative return statistic of the output sequence, which corresponds to the gain of investing one dollar in the stock at time , and then selling at time . In a nonBayesian setting, we estimate the expectation via MonteCarlo sampling for each stock, and buy (or sell) the stocks for which the estimate is the highest (or the lowest). Note that this setting also applies to the deterministic baselines, it suffices to consider that is a Dirac distribution centered in the deterministic prediction.
The performance of all models is summarized in Table 1. We can see that the TCN is consistently outperformed by both probabilistic and nonprobabilistic LSTM models. For the probabilistic model, we observe that it is generally outperformed by the LSTM for direct prediction (), but it has better performance on iterated prediction (), provided that enough samples are used for MonteCarlo estimation. We observe that a large number of samples (at least 1000) is required to match the LSTM performance. We provide extended evaluation results in the supplementary material, including the effect of the number of samples.
Quality of the Probabilistic Forecast
Table 2 shows evaluation of the forecast quality for each of the statistics described earlier. To compare deterministic and probabilistic forecast, we use as metric the Ranked Probability Skill (RPS) (Weigel et al., 2007) of the prediction. However, because it applies only to predictions with finite output space, we first discretize the output before we apply RPS^{2}^{2}2There exists a continuous version (Gneiting and Raftery, 2007), but it is impractical for our setting because of the memory consumption of computing the score: we favor metrics computable in an online fashion with respect to the sampling process.. Here, lower score means better prediction. We provide extended evaluation results in the supplementary material, including multiple different values for the horizon , price and the number of samples for MonteCarlo estimation.
4.2 Market Manipulations
The possibility of artificially influencing stock prices to make profit has always been a major problem of financial markets (Allen and Gale, 1992; Diaz et al., 2011; Öğüt et al., 2009). In our work, we focus on tradebased manipulation, in which a trader attempts to manipulate the price of a stock only by buying and then selling, without taking any other publicly observable action. The core of such an attack is to anticipate the reactions of other agents to a provoked market event, in order to drive the price up or down. In order to decrease the cost and visibility of the attack, an additional constraint for the manipulating trader is to minimize the amplitude of the perturbation. This creates a natural connection with finding adversarial perturbations over the inputs .
Adversarial Attacks Scenario
To measure the perturbation size, we choose a variant of the Euclidean norm specifically tailored to stock price data, defined in Equation 11, where each component is normalized by the corresponding price . This normalization aims at capturing the fact that stock prices are fixed for an arbitrary unit quantity of the underlying asset, and thus should be invariant with respect to multiplication by a scalar. Besides, we add a box constraint to the perturbed prices such that they remain positive. This constraint is enforced using projected gradient descent.
(11) 
5 Experimental Results
In this section, we evaluate the effectiveness of our approach for generating adversarial attack on probabilistic autoregressive models. The two key results of our evaluation are:

The reparametrization estimator leads to significantly better adversarial examples (i.e., with smaller perturbation norm ) than the scorefunction estimator.

The reparametrization estimator successfully generates adversarial attacks for a number of different tasks. For example, using a small perturbation norm ^{3}^{3}3Here, corresponds to perturbing one price in the sequence by , or 10 prices by , or 100 prices by . the attack is powerful enough to cause financial loss when applied to stock market trading.
Datasets
We evaluate on the following two datasets:
S&P 500 dataset, which contains historical prices of S&P 500 constituents from 1990/01 to 2000/12. We consider study periods of four consecutive years. The first three years serve as training data, while the last year is used for outofsample testing. The different periods have nonoverlapping test years, resulting in eight different periods (for each we train four different models) with test year going from 1993 to 2000. We generate inputoutput samples by considering sequences of 251 consecutive daily prices for a fixed constituent. The first 241 prices serve as input , while the last 10 are the ground truth output . We ensure that output sequences from the training and test sets do not overlap and reserve of training samples as a validation set. We use the same preprocessing as in prior work (described in supplementary material) and train our own probabilistic autoregressive model (described in Section 2). The order of magnitude of the cumulated test set size is .
UCI electricity dataset^{4}^{4}4https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014, which contains the electricity consumption of 370 households from 2011 to 2014, downsampled to hourly frequency for the measurements. For this dataset we reuse an existing implementation and already trained models provided by Zhang and Jiang^{5}^{5}5https://github.com/zhykoties/TimeSeries. The model is trained on data from 2011/01 to 2014/08 (included) and we perform the attack on test samples from 2014/09. The input sequence consists of 168 consecutive measurements, and the network predicts the next 24 values (corresponding to the next day). The total number of test samples is 2590.
Experimental Setup
We performed all experiments on a machine running Ubuntu 18.04, with 2.00GHz Intel Xeon E52650 CPU and using a single GeForce RTX 2080 Ti GPU. For the S&P500 dataset, each model’s training time is under one hour, and running the attack on one model for all testset elements of a period takes approximately 24 hours. For the electricity dataset, running the attack on a batch of 256 test sequences takes approximately three hours.
5.1 Attacks on Buy/Sell Classification
We start by considering a classification task on the S&P dataset where each sample is classified as buy, sell or uncertain. For each stock, we predict the cumulated return using the statistic . Then, let be a threshold used to decide whether to buy or sell the stock. Concretely, if the confidence interval (assuming Student’s tdistribution) of the estimation is entirely above , the stock is classified as buy. If it is entirely below , it is classified as sell. Finally, if is inside the confidence interval, the classification is uncertain. We set to be the average over all stocks of the ground truth cumulated return, which leads to roughly balanced decisions to buy and sell.
We attack the statistic twice. First, we perturb all samples initially classified as buy or uncertain, in order to make it classify as sell. Similarly, we perturb all samples initially classified as sell or uncertain, in order to make it classify as buy. The target of the attack is set as for the buy attack and for the sell attack. We fix in our experiments. This is aimed at making the confidence interval fit entirely in the buy (resp. sell) zone. Indeed, with samples, the interval width order of magnitude is .
The results without a Bayesian observation are summarized in Figure 1 a) and show that the reparametrization estimator is significantly better at generating adversarial examples that the scorefunction estimator. For example, using the reparametrization estimator attack succeeds in more cases. The reparametrization estimator can also successfully attack the model when considering a Bayesian setting with similar results. We include such experiment that uses a smaller horizon and an observation in the supplementary material.
5.2 Attacks on LongShort Trading Strategies
Next, we evaluate the impact of attacking the cumulated return statistic on the financial gain of the longshort trading strategy described in Section 4.1. We suppose that the attacker is allowed to perturb all inputs at test time without changing the corresponding ground truth outputs, with a maximum tolerance on the perturbation norm. We consider a horizon of and different portfolio sizes . Given a ground truth output , the target is set to be , where is the buy/sell threshold defined previously, and is a scaling factor that rescales the ground truth output to the prediction range of the network. Intuitively, this attack target corresponds to reversing the prediction of the network around its average, in order to swap the top and flop k stocks.
We report the return of the perturbed portfolios in Figure 1 b). We observe that the reparametrization estimation is again significantly better compared to the scorefunction estimator. Additionally, in both experiments the thresholds for appearance of adversarial perturbation to some of the samples is approximately .
5.3 Attacks on Electricity Consumption Prediction
We perturb each input sequence twice: in order to make the consumption forecast abnormally high (resp. low). We designate these as overestimation (resp. underestimation) attack. The attacked statistic is , with . Given an input , we first approximate the expected value , and set the target to to cause over or underestimation.
We show the attack success for different perturbation tolerances in Figure 1 c). We observe that the reparametrization estimator (continuous line) yields stronger underestimation of the prediction than the scorefunction estimator (dashedline) In Figure 1 d), we give examples of generated adversarial samples for the underestimation attack. We observe a recurrent pattern in the underestimation attack, where the perturbed prediction matches closely the original prediction for the first timesteps, but eventually becomes significantly inferior. In the supplementary material, we provide similar figures for the overestimation attack.
6 Related Work
Probabilistic Autoregressive Forecasting Models
Probabilistic autoregressive forecasting models have been used in diverse applications. Schittenkopf et al. (2000
) developed the Recurrent Mixture Density Network (RMDN) to predict stock prices volatility iteratively, and were the first to propose using MonteCarlo methods for iterative prediction. RMDN is based on a vanilla recurrent neural network coupled with Gaussian mixture likelihood. The recent
architecture (Salinas et al., 2019) uses several LSTM layers with Gaussian likelihood, and has been applied to forecasting electricity consumption, car traffic and business sales. Followup work (Chen et al., 2019) considers an alternative TCNbased architecture. Our characterization of probabilistic forecasting models encompasses these different architectures, and presents the following novelties (i) it generalizes inference to any statistic of the output sequence, and (ii)it extends the prior work to a Bayesian inference setting.
StockMarket Prediction
Various methods have been applied for predicting future stock prices including random forests, gradientboosted trees, or logistic regression. Two notable deterministic neural models applied to this task are TCN
(Borovykh et al., 2017) and LSTM (Fischer and Krauss, 2018), which achieved stateoftheart results. In our work, we trained a probabilistic autoregressive model for the same task and achieved comparable or even better results. Further, there exists a parallel line of work that performs density estimation, but apart from the RMDN (Schittenkopf et al., 2000), most papers restrict to prediction onestep ahead (Ormoneit and Neuneier, 1996), or to lessexpressive and solvable dynamics such as the GARCH (Duan, 1995).Adversarial Attacks
A growing body of recent work on adversarial attacks deals with generating small input perturbations causing mispredictions (see (Wiyatno et al., 2019) for a survey). The objective function defined in Equation 9 is standard in generating adversarial examples (Szegedy et al., 2013; Carlini and Wagner, 2017). However, to the best of our knowledge this is the first time that adversarial attacks are applied to probabilistic autoregressive models. The most related work is the adversarial training of smoothed classifiers (Salman et al., 2019), where the noise applied to the input leads to a stochastic behavior.
Robust Algorithms for Financial Decision Making The adoption of machine learning in financial decision making makes it crucial to develop algorithms robust against small environment variations. Recent work here include robust assessment of loan applications (Ballet et al., 2019), deepfakes on accouting journal entries (Schreyer et al., 2019)
, robust inverse reinforcement learning on market data
(RoaVicens et al., 2019). Adversarial attacks against stockmarket prediction algorithms was studied by (Feng et al., 2018). Compared to the latter, our work is the first to operate on a probabilistic network for iterative prediction.Reparametrization The reparametrization trick has been applied in several fields under different names: perturbation analysis/pathwise derivatives (Glasserman and Ho, 1991)
in stochastic optimization, stochastic backpropagation
(Rezende et al., 2014), affine independent variational inference (Challis and Barber, 2012) or correlated sampling in evaluating differential privacy (Bichsel et al., 2018). The actual reparametrization of our model resembles that of the deep generative model of Rezende et al. (2014), but there it is used to perform variational inference.7 Conclusion
In this work, we explored applying adversarial attacks to a recently proposed probabilistic autoregressive forecasting models. Our work is motivated by the fact that: (i) this model has been included in the Amazon SageMaker toolkit and achieved stateoftheart results on a number of different tasks, and (ii) adversarial attacks and robustness are pressing and important issues that affect it.
Concretely, we implemented and evaluated two techniques, reparametrization and scorefunction estimators, that are used to differentiate trough MonteCarlo estimation inherent to this model and instantiate existing gradient based adversarial attacks. While we show that both of these techniques can be used to generate adversarial attacks, we evidence that using the reparametrization estimator is crucial for producing adversarial attacks with a small perturbation norm. Further, we extend the prior work to the Bayesian setting which enables using these models with new types of queries.
References
 Stockprice manipulation. The Review of Financial Studies 5 (3), pp. 503–529. Cited by: §4.2.
 An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. Cited by: §B.2, §2.1.
 Imperceptible adversarial attacks on tabular data. arXiv preprint arXiv:1911.03274. Cited by: §6.
 Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 1171–1179. Cited by: §2.1.
 Dpfinder: finding differential privacy violations by sampling and optimization. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 508–524. Cited by: §6.
 Mixture density networks. Cited by: §B.2.
 The pricing of options and corporate liabilities. Journal of political economy 81 (3), pp. 637–654. Cited by: §4.

Conditional time series forecasting with convolutional neural networks
. arXiv preprint arXiv:1703.04691. Cited by: §B.2, Table 3, §4.1, Table 1, Table 2, §6.  Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. Cited by: §3, §3.1, §6.
 Affine independent variational inference. In Advances in Neural Information Processing Systems, pp. 2186–2194. Cited by: §6.
 Probabilistic forecasting with temporal convolutional neural network. arXiv preprint arXiv:1906.04397. Cited by: §6.
 Analysis of stock market manipulations using knowledge discovery techniques applied to intraday trade prices. Expert Systems with Applications 38 (10), pp. 12757–12771. Cited by: §4.2.
 The GARCH option pricing model. Mathematical finance 5 (1), pp. 13–32. Cited by: §6.
 Enhancing stock movement prediction with adversarial training. arXiv preprint arXiv:1810.09936. Cited by: §6.
 Implicit reparameterization gradients. In Advances in Neural Information Processing Systems, pp. 441–452. Cited by: §3.2.
 Deep learning with long shortterm memory networks for financial market predictions. European Journal of Operational Research 270 (2), pp. 654–669. Cited by: §B.1, §B.1, §B.2, Table 3, §4.1, §4.1, Table 1, Table 2, §6.
 Gradient estimation via perturbation analysis. Vol. 116, Springer Science & Business Media. Cited by: §6.
 Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM 33 (10), pp. 75–84. Cited by: §1, §3.1.1.
 Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association 102 (477), pp. 359–378. Cited by: footnote 2.
 Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1, §3.1.
 Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850. Cited by: §1.
 Stochastic backpropagation through mixture density distributions. arXiv preprint arXiv:1607.05690. Cited by: §3.2.
 Limit order trading. The Journal of Finance 51 (5), pp. 1835–1861. Cited by: §4.
 Estimates and confidence intervals for importance sampling sensitivity analysis. Mathematical and computer modelling 23 (89), pp. 79–85. Cited by: §B.4.
 Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.1.
 Pathwise derivatives beyond the reparameterization trick. arXiv preprint arXiv:1806.01851. Cited by: §3.2.
 Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §3.1.2.
 Adam: a method for stochastic optimization. External Links: 1412.6980 Cited by: §B.3.
 Optimization and sensitivity analysis of computer simulation models by the score function method. European Journal of Operational Research 88 (3), pp. 413–427. Cited by: §1.
 Adversarial examples in the physical world. ICLR’17 Workshop. Cited by: §3.1.
 Adversarial machine learning at scale. ICLR’17. Cited by: §3.1.
 Detecting stockprice manipulation in an emerging market: the case of turkey. Expert Systems with Applications 36 (9), pp. 11944–11949. Cited by: §4.2.
 Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §1.
 Conditional image generation with PixelCNN decoders. In Advances in Neural Information Processing Systems 29, pp. 4790–4798. Cited by: §1.
 Pixel recurrent neural networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ICML’16, pp. 1747–1756. Cited by: §1.
 Experiments in predicting the German stock index DAX with density estimating neural networks. In IEEE/IAFE 1996 Conference on Computational Intelligence for Financial Engineering (CIFEr), pp. 66–71. Cited by: §6.
 The limitations of deep learning in adversarial settings. In 2016 IEEE European Symposium on Security and Privacy (EuroS&P), pp. 372–387. Cited by: §3.
 Black box variational inference. In Artificial Intelligence and Statistics, pp. 814–822. Cited by: §3.1.1.
 Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082. Cited by: §1, §3.1.2, §6.
 The mathematical foundations of barrier optionpricing theory. Advances in futures and options research 7. Cited by: §2.2.
 Adversarial recovery of agent rewards from latent spaces of the limit order book. arXiv preprint arXiv:1912.04242. Cited by: §6.

Fixedform variational posterior approximation through stochastic linear regression
. Bayesian Analysis 8 (4), pp. 837–882. Cited by: §1, §3.1.2.  DeepAR: probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting. Cited by: §B.1, §1, §1, §2.1, §2.1, §2, §3.2, §6.
 Provably robust deep learning via adversarially trained smoothed classifiers. In Advances in Neural Information Processing Systems, pp. 11289–11300. Cited by: §6.
 Forecasting timedependent conditional densities: a semi nonparametric neural network approach. Journal of Forecasting 19 (4), pp. 355–374. Cited by: §1, §2.1, §6, §6.
 Adversarial learning of deepfakes in accounting. arXiv preprint arXiv:1910.03810. Cited by: §6.
 Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1, §3, §3.1, §3.1, §6.

Lecture 6.5rmsprop: divide the gradient by a running average of its recent magnitude
. COURSERA: Neural networks for machine learning 4 (2), pp. 26–31. Cited by: §B.2.  Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2.1.
 The discrete Brier and ranked probability skill scores. Monthly Weather Review 135 (1), pp. 118–124. Cited by: Table 3, §4.1, Table 2.
 A learning algorithm for continually running fully recurrent neural networks. Neural computation 1 (2), pp. 270–280. Cited by: §2.1.
 Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning 8 (34), pp. 229–256. Cited by: §3.1.1.
 Adversarial examples in modern machine learning: a review. External Links: 1911.05268 Cited by: §6.
Appendix A Proofs
Scorefunction Estimator.
In the general Bayesian setting where , the scorefunction gradient estimator of the expected value of is:
where is sampled from the prior distribution , and denotes the probability that is true knowing that is generated.
Proof.
The expectation is defined as the following integral over the output space:
Using Leibniz rule, we obtain
at every point around which the gradient is locally continuous (in the model described in this paper, this regularity condition holds everywhere). The resulting integral can be transformed as follows into an expectation over the distribution .
This expectation can be approximated via MonteCarlo methods. While it is in general not possible to directly sample from , what can be done instead is generating samples for from the prior , and attribute an importance weight to each of the resulting samples, yielding:
The choice of as the importance weight for results from the application of Bayes rule:
∎
Reparametrization Estimator.
Assume there exists a differentiable transformation such that the random variable can be reparametrized as , where is an independent random variable whose marginal distribution is independent from . Then the importance sampling reparametrization estimator of the expectation’s gradient is:
where for , is sampled from the distribution , and .
Proof.
Approximating the expectation via MonteCarlo estimation with importance sampling yields:
where is sampled from the prior distribution . With the assumptions of the theorem, we can rewrite:
Since the respective effects of the perturbation and of randomness are decoupled in this final expression, it is differentiable with respect to , which concludes. ∎
Appendix B Experimental Details
Here we provide details of all our experiments to support reproducibility. Additionally, we will make all our datasets and source code available online.
b.1 Datasets and Preprocessing
S&p500
The S&P500 dataset is obtained via the yfinance API^{6}^{6}6https://github.com/ranaroussi/yfinance. We focus on datapoints between 1990/01 and 2000/12, identified by Fischer and Krauss (2018) as a period of exceptionally high trading returns compared to the following decades. We also follow Fischer and Krauss for preprocessing the data. A sequence of prices is first preprocessed to obtain a sequence of returns , defined as . Intuitively is the gain (when positive) or loss obtained by investing one dollar in the stock at time , and then selling at time . Inversely, given a sequence of returns and an initial price , the corresponding sequence of prices can be obtained as:
Both transformations are differentiable, which allows to perform the attack in the application space of prices rather than on returns. Besides, returns are normalized to have zero mean and unit variance. Denoting and for the mean and standard deviation of returns in the training set, the normalized sequence is , where . We refer to Fischer and Krauss (2018) for a thorough analysis of the properties of the S&P500 dataset.
Electricity Dataset We use the same preprocessing steps as described in (Salinas et al., 2019). Input sequences are divided by their average value , and the corresponding prediction sequence is multiplied by . This guarantees that all inputs are approximately in the same range.
b.2 Neural Architectures: S&P500 Dataset
Lstm
The LSTM baseline used on the S&P500 dataset, we follow (Fischer and Krauss, 2018), and use a single LSTM layer with
hidden units, followed by a linear output layer. However, we use only one input neuron without activation instead of two neurons with softmax activation.
TCN In (Borovykh et al., 2017), several sets of hyperparameters for Temporal Convolutional Networks are used depending on the experiment. We decided to use layers and a dilation of , in order to match as closely as possible the size of the LSTM receptive field. We selected the other parameters via gridsearch, resulting in a kernel size of and channels. We use the TCN implementation provided by the authors of (Bai et al., 2018).
Ours
For our probabilistic autoregressive model, we chose to use a single LSTM layer with 25 hidden units similar to the LSTM baseline, in order to guarantee the most fair comparison. We only changed the output layers to parametrize a Gaussian distribution. Following (Bishop, 1994), we use a linear layer without activation for the mean, and a linear layer with exponential activation for the scale of the distribution. We also performed experiments with a Gaussian mixture likelihood, but it did improve the performance on our two benchmarks.
Training
For both deterministic networks, we minimize meansquared error on the training set. For our model, we use negative loglikelihood as a loss function. In both cases, we use the
optimizer (Tieleman and Hinton, 2012) advised by Fischer and Krauss, with default parameters and learning rate of . We use an earlystopping patience of , and a large batch size of for training. Experiments with different values did not reveal a significant influence of these parameters.b.3 Neural Architectures: Electricty Dataset
The
architecture used for the Electricity experiments is based on a threelayer LSTM with 40 hidden units each. The number of samples used for MonteCarlo estimation of the output is set to 200. The network is trained for 20 epochs with the
optimizer (Kingma and Ba, 2014), with batch size of 64 and learning rate of 0.001.b.4 Attack HyperParameters: S&P500 Dataset
For the S&P500 dataset, we optimize the attack objective function with the RMSPROP optimizer using a learning rate of and iterations. These parameters were selected with a simple grid search because of the computational cost of running the attack repeatedly. The values used for the coefficient are . We select the value that yields the best adversarial sample under the constraint that the perturbation norm is below the tolerance . The number of samples used to estimate the gradient is chosen to be .
Buy/Sell Attack
We use for the target. For the Bayesian setting, we use , in order to approximately balance the different classes. The confidence interval is computed assuming Student’s tdistribution. In the Bayesian case, the formula for the confidence interval with importance sampling is derived in (Hesterberg, 1996).
Attack on Trading Strategies
We use for the target scaling factor.
Influence of c
In Figure 2, we examine the influence of tuning the attack objective function on average perturbation norm and distance to the attack target. We observe a tradeoff between these two quantities that depends on the coefficient : higher value for yields better adversarial samples, at the cost of more input perturbation.
Influence of L
We evaluate the effect of the number of samples used in the reparametrization estimator on the attack loss in Figure 3. In this experiment, the value of is fixed to 1000. We notice a tradeoff in terms of convergence speed vs. final loss, that depends on the number of samples used for estimating the gradient. As a result, we choose to use in our attacks.
b.5 Attack HyperParameters: Electricity Dataset
We optimize the attack objective function with the Adam optimizer. We use different optimizers for the two datasets so that the same optimizer is used for training the network and to attack it. We use a learning rate of and iterations. These parameters were also selected via informal search. The values used for the coefficient are 0.1, 0.2, 0.3, 0.5, 0.7, 1, 2, 3, 5, 7, 10, 20, 30, 50, 70, 100, 200 and 300. The number of samples used to estimate the gradient is chosen to be .
Appendix C Experimental Results
c.1 Trading Strategies
In Figure 4, we provide extended results for the longshort trading benchmark (Section 4.1), with different horizons and number of samples used for MonteCarlo estimation of the prediction. We observe that the quality of the probabilistic prediction improves with the number of samples until samples. Further increasing the number of samples does not yield significant performance improvements.
c.2 Evaluation of the Probabilistic Forecast
In Table 3, we give detailed results for the comparison of probabilistic forecasts quality with Ranked Probability Skill (a summary of these results is provided in Table 2, Section 4.1). We observe that the forecasting quality of our model improves with the number of samples, and that an order of magnitude of the number of samples needed to obtain the best possible estimation is . As a comparison, the
implementation on the electricity dataset uses 200 samples. We surmise that this discrepancy is due to the low signaltonoise ratio of financial data, that makes inference more difficult.
c.3 Bayesian Attack
In Figure 5, we plot the results of the classification attack in the Bayesian setting with observation , where (Section 5.1). We only implemented the reparametrization estimator, as the scorefunction estimator requires the overly complex estimation of for each sample . We observe that the attack success rate is very similar to the nonBayesian setting, demonstrating that the reparametrization estimator adapts readily to the Bayesian setting. The attack success rate is approximately for .
c.4 Electricity Dataset
In Figure 6, we show results of both overestimation and underestimation attacks on the electricity dataset, with examples of generated adversarial samples (Section 5.3). We observe that for equal perturbation tolerance, the overestimation attack yields mispredictions of smaller amplitude. For instance, the reparametrization attack with causes median overestimation of around , whereas it causes median underestimation of . We believe that this is due to the particular nature of the dataset rather than asymmetry in the attack. We do not observe such a discrepancy in the financial experiments.
Statistics  Nonprobabilistic  Probabilistic  

Name  Ours  
(Borovykh et al., 2017)  (Fischer and Krauss, 2018)  1 sample  100 samples  10000 samples  
Cumulated Return  1    1.423 ( 0.022)  1.424 ( 0.016)  2.016 ( 0.023)  0.992 ( 0.002)  0.982 ( 0.002) 
5    1.468 ( 0.01)  1.466 ( 0.008)  1.992 ( 0.013)  1.0 ( 0.005)  0.99 ( 0.004)  
10    1.548 ( 0.029)  1.541 ( 0.019)  1.995 ( 0.011)  1.012 ( 0.008)  1.002 ( 0.008)  
European Call Option  10  0.9  1.019 ( 0.004)  1.017 ( 0.003)  1.961 ( 0.22)  0.999 ( 0.009)  0.989 ( 0.009) 
10  1  1.122 ( 0.002)  1.121 ( 0.002)  1.966 ( 0.103)  0.992 ( 0.006)  0.982 ( 0.005)  
10  1.1  1.342 ( 0.002)  1.341 ( 0.002)  1.987 ( 0.03)  1.003 ( 0.007)  0.993 ( 0.007)  
European Put Option  10  0.9  1.445 ( 0.021)  1.445 ( 0.017)  1.984 ( 0.015)  1.002 ( 0.007)  0.992 ( 0.007) 
10  1  1.302 ( 0.003)  1.3 ( 0.002)  1.957 ( 0.036)  0.984 ( 0.005)  0.974 ( 0.005)  
10  1.1  1.046 ( 0.005)  1.044 ( 0.002)  1.856 ( 0.094)  0.968 ( 0.004)  0.959 ( 0.003)  
Limit Sell  10  1.01  2.822 ( 0.501)  3.137 ( 0.307)  1.917 ( 0.021)  1.013 ( 0.008)  1.004 ( 0.008) 
10  1.05  1.516 ( 0.001)  1.514 ( 0.002)  1.899 ( 0.027)  0.953 ( 0.006)  0.944 ( 0.006)  
10  1.20  1.035 ( 0.003)  1.034 ( 0.002)  1.792 ( 0.12)  0.951 ( 0.006)  0.942 ( 0.005)  
Limit Buy  10  0.8  1.02 ( 0.002)  1.019 ( 0.002)  1.948 ( 0.223)  0.982 ( 0.015)  0.972 ( 0.013) 
10  0.95  1.412 ( 0.0)  1.41 ( 0.001)  1.926 ( 0.039)  0.967 ( 0.009)  0.958 ( 0.008)  
10  0.99  3.047 ( 0.012)  3.025 ( 0.028)  1.963 ( 0.024)  1.013 ( 0.006)  1.003 ( 0.006) 