Adversarial Attacks on Probabilistic Autoregressive Forecasting Models

by   Raphaël Dang-Nhu, et al.

We develop an effective generation of adversarial attacks on neural models that output a sequence of probability distributions rather than a sequence of single values. This setting includes the recently proposed deep probabilistic autoregressive forecasting models that estimate the probability distribution of a time series given its past and achieve state-of-the-art results in a diverse set of application domains. The key technical challenge we address is effectively differentiating through the Monte-Carlo estimation of statistics of the joint distribution of the output sequence. Additionally, we extend prior work on probabilistic forecasting to the Bayesian setting which allows conditioning on future observations, instead of only on past observations. We demonstrate that our approach can successfully generate attacks with small input perturbations in two challenging tasks where robust decision making is crucial: stock market trading and prediction of electricity consumption.



There are no comments yet.


page 14


Adversarial attacks against Bayesian forecasting dynamic models

The last decade has seen the rise of Adversarial Machine Learning (AML)....

Modeling and Forecasting Art Movements with CGANs

Conditional Generative Adversarial Networks (CGANs) are a recent and pop...

Thought Flow Nets: From Single Predictions to Trains of Model Thought

When humans solve complex problems, they rarely come up with a decision ...

Universal time-series forecasting with mixture predictors

This book is devoted to the problem of sequential probability forecastin...

Probabilistic Forecasting and Simulation of Electricity Markets via Online Dictionary Learning

The problem of probabilistic forecasting and online simulation of real-t...

Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting

In this work, we propose TimeGrad, an autoregressive model for multivari...

Probabilistic Safety for Bayesian Neural Networks

We study probabilistic safety for Bayesian Neural Networks (BNNs) under ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep probabilistic autoregressive models have been recently integrated into the Amazon SageMaker toolkit and successfully applied to various kinds of sequential data such as handwriting

(Graves, 2013), speech and music (Oord et al., 2016a), images (Oord et al., 2016c, b) and time series from a number of different domains (Salinas et al., 2019). At a high level, given a sequence of input values (i.e., pen-tip locations, raw audio signals or stock market prices), the goal is to train a generative model that outputs an accurate sequence of next values, conditioned on all the previous values. The main benefit of such probabilistic models is that they model the joint distribution of output values, rather than predicting only a single best realization (i.e., the most likely value at each step). Predicting a density rather than just a single best value has several advantages – it naturally fits the inherently stochastic nature of many processes, allows to assess the uncertainty and the associated risk, and has been shown to produce better overall prediction accuracy when used in forecasting tasks (Salinas et al., 2019).

This Work

We develop an efficient approach for generating adversarial attacks on deep probabilistic autoregressive models. The issue of adversarial robustness and attacks (Szegedy et al., 2013; Goodfellow et al., 2014)

, i.e., generating small input perturbations that lead to mis-predictions, is an important problem with large body of recent work. Yet, to our best knowledge this is the first work that explores adversarial attacks in a new challenging setting where the neural network output is a sequence of probability distributions. The difficulty in generating adversarial attacks in this setting is that it requires computing the gradient of an expectation that it is too complex to be analytically integrated 

(Schittenkopf et al., 2000; Salinas et al., 2019) and is approximated using Monte-Carlo methods.

Differentiating through Monte-Carlo estimation of the Expectation

We address the key technical challenge of efficiently differentiating through the Monte-Carlo estimation, a necessary part of generating white-box gradient based adversarial attacks, by using two techniques to approximate the gradient of the expectation. The first approach is the score-function estimator (Glynn, 1990; Kleijnen and Rubinstein, 1996) obtained by inverting the gradient and the expectation’s integral. The second technique differentiates individual samples using a random variate reparametrizatrion and originates from the variational inference literature (Salimans et al., 2013; Kingma and Welling, 2013; Rezende et al., 2014).

Main Contributions We present the first approach for generating adversarial attacks on deep probabilistic autoregressive models by applying two techniques that differentiate through Monte-Carlo estimation of an expectation. We show that the reparametrization estimator is efficient at generating adversarial attacks and outperforms the score-function estimator by evaluating on two domains that benefit from stochastic sequential reasoning – stock market trading and electricity consumption. We make our code, datasets and scripts to reproduce our experiments available online111

2 Probabilistic Forecasting Models

In this section, we formally describe the probabilistic autoregressive model used in prior works and throughout this paper. Most notably, the model described in Section 2.1 is based on the recent work of  (Salinas et al., 2019), which is now an inherent part of the Amazon SageMaker toolkit. Further, in Section 2.2 we describe an extension of this model to the Bayesian setting proposed in our work.

2.1 Sequence Modeling: Preliminaries

Given a sequential process and an index , we consider the task of modeling the distribution of predicted (future) values given the observed (past) values

. Using the chain rule, the joint distribution of predicted values conditioned on observed values

can be written as a product of conditional distributions:


In deep autoregressive models, a neural network is used to approximate the conditional distribution by a parametric distribution specified by learnable parameters . This yields a joint model:


This decomposed form for the joint distribution is general and independent of the particular neural architecture chosen for . In principle, any type of sequential model can be used including well-known architectures such as LSTMs (Hochreiter and Schmidhuber, 1997), Temporal Convolutional Networks (Bai et al., 2018) or Transformers (Vaswani et al., 2017). Next, we describe a LSTM based instantiation of probabilistic autoregressive models.

Probabilistic Autoregressive Models

Let be a function implemented by a LSTM network. Given , we compute the hidden state for each time-step, conditioned on the previous hidden state , previous input value and the network parameters . Then, the hidden state is used to generate a set of parameters that specify a distribution with density , giving the following form for the conditional distribution:


The main difference here compared to the non-probabilistic models is that the network predicts parameters of a distribution rather than a single value. Commonly used distributions in prior works are Gaussian distribution for real-valued data or the negative binomial distribution for count data 

(Salinas et al., 2019). When using Gaussian distribution, has two components

, that correspond to the mean and the standard deviation, respectively. The density is defined as:


Note, that the choice of a Gaussian distribution corresponds to the assumption that each value is normally distributed conditioned on past values – a hypothesis which has to be assessed per application domain. In what follows, to make a clear distinction between the network inputs and outputs, we use

to denote the inputs (i.e., observed values) and to denote the outputs (i.e., the predicted values).


Performing inference for probabilistic autoregressive models corresponds to characterizing the joint distribution of the output sequence . This includes estimating -steps ahead both, the mean and the standard deviation of the value

, via the first and second moments

and . More generally, given the space of output sequences and any statistic , we consider the task of estimating the expectation . The main challenge here is the complexity of the underlying integral on the distribution , which is in general not analytically solvable.

During training, one can either use scheduled sampling (Bengio et al., 2015), where a single sample from the distribution is used, or avoid this issue completely by using teacher forcing (Williams and Zipser, 1989), where the deterministic ground truth value for is fed back into the network in the next time-step. This setting is solvable but only because the prediction is only for a single next step. However, when performing iterated prediction at test time, the value used in the feedback loop is sampled from the predicted distribution of . Therefore, the next hidden state depends on the randomness introduced in sampling . This yields an arbitrarily complex form for the joint distribution . To address this issue, prior works perform Monte-Carlo inference (Schittenkopf et al., 2000; Salinas et al., 2019) to approximate the expectation as:


That is, the Monte-Carlo estimations of the expected value of is computed using generated samples for the output sequence.

2.2 Extension to Bayesian Setting

We extend the probabilistic autoregressive models presented in this section to the Bayesian setting, where the output sequence can be conditioned on arbitrary values (i.e., both past and from the future). Formally, we define an observation function , which takes as input a sequence of values and outputs a boolean denoting whether the observation holds. As an example, using denotes that we would like the model to predict values conditioned both on the inputs as well as on the observation .

There are several cases for which the Bayesian setting is useful: (i) some of the data is missing (e.g., due to sensor failures), (ii) to allow encoding prior beliefs about the future evolution of the process, or (iii) to evaluate complex domain-specific statistics. The financial domain offer good examples for (iii), in pricing exotic derivatives such as barrier options (Rich, 1994) whose existence depends upon the underlying asset’s price breaching a preset barrier level.

To remove clutter, in the remainder of the paper we will use to denote the output of the observation function when evaluated on . In this Bayesian setting, the expectation can be estimated via Monte-Carlo importance sampling as:


This corresponds to generating samples from the prior distribution , and reweighing with Bayes rules. Note, that this formula includes the former by using .

3 Adversarial Attacks on Probabilistic Forecasting Models

In this section, we present our approach for generating adversarial attacks on deep probabilistic forecasting models. We start by formally defining the problem statement suitable for this setting and then we describe two practical adversarial attacks that address it.

Adversarial Examples

Recall that in the canonical classification setting, adversarial examples are typically found by solving the following optimization problem (Szegedy et al., 2013; Papernot et al., 2016; Carlini and Wagner, 2017):



is a classifier,

is an input (i.e., an image), is the desired adversarial output (the target), and is the minimal perturbation (according to a given norm) applied to the input image such that the classifier predicts the desired output .

To make the above formulation applicable to probabilistic forecasting models we perform two standard modifications – (i) we replace the hard equality constraint with an easier to optimize soft constraint that captures the distance between real values, and (ii) we replace the single value output with the expected value of a given statistic of the output joint probability distribution. Applying the first modification leads to the following formulation:


That is, we use a soft constraint that minimizes the distance between the target and the predicted value, subject to a given tolerance on the perturbation norm. Applying the second modification corresponds to replacing with the expected value , where is an observation over outputs as defined in Section 2.2, and is a statistic of the output sequence. Overall, this leads to the following problem statement.

Problem Statement

Let be a function (i.e., a probabilistic neural network) that takes as input a sequence of values and outputs a probability distribution with density that can be sampled from to obtain a concrete output sequence . Given an observation variable , a statistic of the network output distribution and a target , the goal of the adversarial attack is to find a perturbation that solves the following optimization problem:


3.1 Practical Attack on Probabilistic Networks

The constrained minimization problem defined in Equations 8 and 9 has repeatedly been identified as very difficult to solve in the adversarial attacks literature (Szegedy et al., 2013; Carlini and Wagner, 2017). As a result, we instead follow the approach of Szegedy et. al. 2013 and solve an adjusted optimization problem. Given a real hyper-parameter , we aim at minimizing:


via gradient-descent. The attack is run with different values of , and the final value is chosen to ensure that the hard constraint is satisfied.

While optimizing the objective function in Equation 9 is standard (Szegedy et al., 2013; Goodfellow et al., 2014; Kurakin et al., 2017b, a), the crucial aspect in ensuring efficient gradient descent is to obtain a good estimation of the objective function’s gradient. In particular, this involves computing:

The difficulty of computing this gradient comes from the fact that the expectation can not be analytically computed, but only approximated via Monte-Carlo methods. Informally, it raises the question of how to efficiently differentiate through the Monte-Carlo estimation. We compare two different ways of performing this differentiation, described next.

3.1.1 Score-function Estimator

The first approach is to express the gradient of the expectation as an expectation over the distribution by inverting the gradient and integral, and estimate the resulting expectation via Monte-Carlo methods. This technique is known under different names in the literature: score-function method (Glynn, 1990), REINFORCE (Williams, 1992), or log-derivative trick. Below we show how this applies to our setting.

Score-function Estimator 3.1.

In the general Bayesian setting where , the score-function gradient estimator of the expected value of is:

where is sampled from the prior distribution , and denotes the probability that is true knowing that is generated.

The proof of 3.1 is given in the supplementary material. Note that in the non-Bayesian setting where the observation is always true, we have and we obtain a simpler form:

While this estimator allows for generating adversarial perturbations, we observe that it has two drawbacks – high-variance and high sampling complexity.

High Variance

Score-function estimators typically lead to slow convergence because they suffer from high variance (Ranganath et al., 2014). It is due to the fact that they operate in a black-box way with respect to the gradients of the network and the statistic .

Complexity of the Bayesian Setting

The score-function requires computing the gradient of with respect to . This is always possible in the special case when the observation is constantly true, however in the general setting this might require another step of sampling, which makes the estimator overly complex.

3.1.2 Reparametrization estimator

The second estimator is based on the reparametrization trick. It reparametrizes the output distribution

in terms of auxiliary random variables whose distribution does not depend on

, in order to make individual samples differentiable with respect to . The differential has a priori no specific meaning when the distribution from which is sampled depends on . However, if can be reparametrized as , where is a random variable whose distribution is independent from , then it makes sense to define the differential of with respect to as .

Reparametrization estimators were first proposed as a way of mitigating the variance problems of score-function estimators (Salimans et al., 2013; Kingma and Welling, 2013; Rezende et al., 2014). However, to our best knowledge, they have not been used in a Bayesian setting where the estimator to differentiate uses importance sampling.

Reparametrization Estimator 3.2.

Assume there exists a differentiable transformation such that the random variable can be reparametrized as , where is an independent random variable whose marginal distribution is independent from . Then the importance sampling reparametrization estimator of the expectation’s gradient is:

where for , is sampled from the distribution , and .

A proof of 3.2 is given in the supplementary material.

3.2 Reparametrization of Probabilistic Networks

Here, we discuss the question of reparametrizing probabilistic autoregressive models. The stochasticity of such architectures comes from the iterated sampling of . Assuming a Gaussian likelihood, the value follows a normal distribution . Let

be a standard normal random vector, i.e., all of its components are independent and each is a zero-mean unit-variance normally distributed random variable. Iteratively writing:

for all such that yields a valid reparametrization. This simple reasoning applies to the particular implementation described in (Salinas et al., 2019) and adapts readily to any kind of likelihood parameterized by location and scale, such as Laplace or logistic distribution.

The case of mixture density likelihoods is more complex as they do not enter this category of ”location-scale” distributions, and their inverse cumulative density function does not admit a simple closed form. The problem of how to adapt reparametrization to mixture densities is outside the scope of this paper, and we refer to the relevant literature (Graves, 2016; Figurnov et al., 2018; Jankowiak and Obermeyer, 2018) for more information about this question.

4 Case Study: Stock Market Trading

In this section, we apply the probabilistic autoregressive models and discuss the types of adversarial attacks in the domain of financial decision making.

Output Sequence Statistics

While a given machine learning model is typically trained to predict the future stock prices given its past, various statistics of the output sequence are used in downstream algorithmic trading and option pricing tasks. This is the reason why the approach presented so far already assumed presence of such statistics. The different statistics used in our evaluation are shown in Table 2 and include predicting cumulated stock return, pricing derivatives such as European call and put options (Black and Scholes, 1973), as well as an example of a binary statistic that predicts the success probability of limit orders (Handa and Schwartz, 1996). All statistics are defined with respect to the last known price, denoted as .

4.1 Probabilistic Autoregressive Models Performance

Before we show the effectiveness of generative adversarial attacks, we first demonstrate that using probabilistic autoregressive models leads to state-of-the-art results. We use two baselines as the current state-of-the-art for financial predictions: LSTM networks (Fischer and Krauss, 2018), and Temporal Convolutional Networks (TCN) (Borovykh et al., 2017). We provide detailed description of all the training hyper-parameters, the dataset used (S&P 500) and extended version of all the experiments in the supplementary material.

Params Non-probabilistic Probabilistic
(Borovykh et al., 2017) (Fischer and Krauss, 2018) This Work
1 10 3.53 ( 0.49) 4.89 ( 0.39) 4.37 ( 0.51)
1 30 1.74 ( 0.30) 2.41 ( 0.24) 2.35 ( 0.23)
1 100 0.70 ( 0.19) 0.93 ( 0.1) 0.99 ( 0.12)
5 10 5.57 ( 1.93) 8.86 ( 1.03) 9.02 ( 1.52)
5 30 3.40 ( 1.36) 5.34 ( 0.61) 5.66 ( 0.87)
5 100 1.64 ( 0.78) 2.47 ( 0.28) 2.70 ( 0.48)
10 10 6.21 ( 3.52) 9.68 ( 1.58) 9.55 ( 2.30)
10 30 4.28 ( 2.69) 6.39 ( 0.76) 6.63 ( 1.70)
10 100 2.09 ( 1.58) 3.12 ( 0.52) 3.48 ( 1.03)
Table 1: Financial gain on algorithmic trading tasks for different horizons and portfolio sizes (expressed per mille ). An extended version is included in the supplementary material.
Statistics Params Non-probabilistic Probabilistic
(Borovykh et al., 2017) (Fischer and Krauss, 2018) This Work
Cum. Return 10 - 1.548 ( 0.029) 1.541 ( 0.019) 1.002 ( 0.008)
European Call 10 1 1.122 ( 0.002) 1.121 ( 0.002) 0.982 ( 0.005)
European Put 10 1 1.302 ( 0.003) 1.300 ( 0.002) 0.974 ( 0.005)
Limit Sell 10 1.05 1.516 ( 0.001) 1.514 ( 0.002) 0.940 ( 0.006)
Limit Buy 10 0.95 1.412 ( 0.000) 1.410 ( 0.001) 0.958 ( 0.008)
Table 2: Definition of various output sequence statistics used in our work (left) and performance of various models used to predict them (right). is the prediction horizon and the price of the option. The comparison metric is Ranked Probability Skill (Weigel et al., 2007) of the prediction (lower scores correspond to better predictions). An extended version is provided in the supplementary material.
Long-Short Trading Strategies

Given a prediction horizon , we analyze the characteristics of the following portfolio: at time-step , buy (long) the stocks for which the model predicts the highest gain, and sell (short) the stocks with the highest predicted loss. This task is a generalization of the one presented in (Fischer and Krauss, 2018), where only direct prediction () is considered.

Formally, we consider the cumulative return statistic of the output sequence, which corresponds to the gain of investing one dollar in the stock at time , and then selling at time . In a non-Bayesian setting, we estimate the expectation via Monte-Carlo sampling for each stock, and buy (or sell) the stocks for which the estimate is the highest (or the lowest). Note that this setting also applies to the deterministic baselines, it suffices to consider that is a Dirac distribution centered in the deterministic prediction.

The performance of all models is summarized in Table 1. We can see that the TCN is consistently outperformed by both probabilistic and non-probabilistic LSTM models. For the probabilistic model, we observe that it is generally outperformed by the LSTM for direct prediction (), but it has better performance on iterated prediction (), provided that enough samples are used for Monte-Carlo estimation. We observe that a large number of samples (at least 1000) is required to match the LSTM performance. We provide extended evaluation results in the supplementary material, including the effect of the number of samples.

Quality of the Probabilistic Forecast

Table 2 shows evaluation of the forecast quality for each of the statistics described earlier. To compare deterministic and probabilistic forecast, we use as metric the Ranked Probability Skill (RPS) (Weigel et al., 2007) of the prediction. However, because it applies only to predictions with finite output space, we first discretize the output before we apply RPS222There exists a continuous version (Gneiting and Raftery, 2007), but it is impractical for our setting because of the memory consumption of computing the score: we favor metrics computable in an on-line fashion with respect to the sampling process.. Here, lower score means better prediction. We provide extended evaluation results in the supplementary material, including multiple different values for the horizon , price and the number of samples for Monte-Carlo estimation.

4.2 Market Manipulations

The possibility of artificially influencing stock prices to make profit has always been a major problem of financial markets (Allen and Gale, 1992; Diaz et al., 2011; Öğüt et al., 2009). In our work, we focus on trade-based manipulation, in which a trader attempts to manipulate the price of a stock only by buying and then selling, without taking any other publicly observable action. The core of such an attack is to anticipate the reactions of other agents to a provoked market event, in order to drive the price up or down. In order to decrease the cost and visibility of the attack, an additional constraint for the manipulating trader is to minimize the amplitude of the perturbation. This creates a natural connection with finding adversarial perturbations over the inputs .

Adversarial Attacks Scenario

To measure the perturbation size, we choose a variant of the Euclidean norm specifically tailored to stock price data, defined in Equation 11, where each component is normalized by the corresponding price . This normalization aims at capturing the fact that stock prices are fixed for an arbitrary unit quantity of the underlying asset, and thus should be invariant with respect to multiplication by a scalar. Besides, we add a box constraint to the perturbed prices such that they remain positive. This constraint is enforced using projected gradient descent.


5 Experimental Results

In this section, we evaluate the effectiveness of our approach for generating adversarial attack on probabilistic autoregressive models. The two key results of our evaluation are:

  • The reparametrization estimator leads to significantly better adversarial examples (i.e., with smaller perturbation norm ) than the score-function estimator.

  • The reparametrization estimator successfully generates adversarial attacks for a number of different tasks. For example, using a small perturbation norm 333Here, corresponds to perturbing one price in the sequence by , or 10 prices by , or 100 prices by . the attack is powerful enough to cause financial loss when applied to stock market trading.

Figure 1: a) Success rate of the classification attack for different perturbation norms. b) Impact of the adversarial attack against trading strategies on financial gain, with portfolio size of . For both a) and b), standard deviation (shaded) is computed across test years c) Under-estimation of electricity consumption for reparametrization (continuous) and score-function (dashed) estimators. For example, the graph shows that with , the attack using reparametrization estimator leads to under-estimation of at least (y-axis) for of samples (x-axis). d) Under-estimation adversarial samples for the electricity dataset, with . Red curve is the original sample, blue curve is the generated adversarial sample. The vertical dashed line separates the input sequence from the network’s prediction.

We evaluate on the following two datasets:

S&P 500 dataset, which contains historical prices of S&P 500 constituents from 1990/01 to 2000/12. We consider study periods of four consecutive years. The first three years serve as training data, while the last year is used for out-of-sample testing. The different periods have non-overlapping test years, resulting in eight different periods (for each we train four different models) with test year going from 1993 to 2000. We generate input-output samples by considering sequences of 251 consecutive daily prices for a fixed constituent. The first 241 prices serve as input , while the last 10 are the ground truth output . We ensure that output sequences from the training and test sets do not overlap and reserve of training samples as a validation set. We use the same preprocessing as in prior work (described in supplementary material) and train our own probabilistic autoregressive model (described in Section 2). The order of magnitude of the cumulated test set size is .

UCI electricity dataset444, which contains the electricity consumption of 370 households from 2011 to 2014, down-sampled to hourly frequency for the measurements. For this dataset we reuse an existing implementation and already trained models provided by Zhang and Jiang555 The model is trained on data from 2011/01 to 2014/08 (included) and we perform the attack on test samples from 2014/09. The input sequence consists of 168 consecutive measurements, and the network predicts the next 24 values (corresponding to the next day). The total number of test samples is 2590.

Experimental Setup

We performed all experiments on a machine running Ubuntu 18.04, with 2.00GHz Intel Xeon E5-2650 CPU and using a single GeForce RTX 2080 Ti GPU. For the S&P500 dataset, each model’s training time is under one hour, and running the attack on one model for all test-set elements of a period takes approximately 24 hours. For the electricity dataset, running the attack on a batch of 256 test sequences takes approximately three hours.

5.1 Attacks on Buy/Sell Classification

We start by considering a classification task on the S&P dataset where each sample is classified as buy, sell or uncertain. For each stock, we predict the cumulated return using the statistic . Then, let be a threshold used to decide whether to buy or sell the stock. Concretely, if the confidence interval (assuming Student’s t-distribution) of the estimation is entirely above , the stock is classified as buy. If it is entirely below , it is classified as sell. Finally, if is inside the confidence interval, the classification is uncertain. We set to be the average over all stocks of the ground truth cumulated return, which leads to roughly balanced decisions to buy and sell.

We attack the statistic twice. First, we perturb all samples initially classified as buy or uncertain, in order to make it classify as sell. Similarly, we perturb all samples initially classified as sell or uncertain, in order to make it classify as buy. The target of the attack is set as for the buy attack and for the sell attack. We fix in our experiments. This is aimed at making the confidence interval fit entirely in the buy (resp. sell) zone. Indeed, with samples, the interval width order of magnitude is .

The results without a Bayesian observation are summarized in Figure 1 a) and show that the reparametrization estimator is significantly better at generating adversarial examples that the score-function estimator. For example, using the reparametrization estimator attack succeeds in more cases. The reparametrization estimator can also successfully attack the model when considering a Bayesian setting with similar results. We include such experiment that uses a smaller horizon and an observation in the supplementary material.

5.2 Attacks on Long-Short Trading Strategies

Next, we evaluate the impact of attacking the cumulated return statistic on the financial gain of the long-short trading strategy described in Section 4.1. We suppose that the attacker is allowed to perturb all inputs at test time without changing the corresponding ground truth outputs, with a maximum tolerance on the perturbation norm. We consider a horizon of and different portfolio sizes . Given a ground truth output , the target is set to be , where is the buy/sell threshold defined previously, and is a scaling factor that rescales the ground truth output to the prediction range of the network. Intuitively, this attack target corresponds to reversing the prediction of the network around its average, in order to swap the top and flop k stocks.

We report the return of the perturbed portfolios in Figure 1 b). We observe that the reparametrization estimation is again significantly better compared to the score-function estimator. Additionally, in both experiments the thresholds for appearance of adversarial perturbation to some of the samples is approximately .

5.3 Attacks on Electricity Consumption Prediction

We perturb each input sequence twice: in order to make the consumption forecast abnormally high (resp. low). We designate these as over-estimation (resp. under-estimation) attack. The attacked statistic is , with . Given an input , we first approximate the expected value , and set the target to to cause over or under-estimation.

We show the attack success for different perturbation tolerances in Figure 1 c). We observe that the reparametrization estimator (continuous line) yields stronger under-estimation of the prediction than the score-function estimator (dashed-line) In Figure 1 d), we give examples of generated adversarial samples for the under-estimation attack. We observe a recurrent pattern in the under-estimation attack, where the perturbed prediction matches closely the original prediction for the first time-steps, but eventually becomes significantly inferior. In the supplementary material, we provide similar figures for the over-estimation attack.

6 Related Work

Probabilistic Autoregressive Forecasting Models

Probabilistic autoregressive forecasting models have been used in diverse applications. Schittenkopf et al. (2000

) developed the Recurrent Mixture Density Network (RMDN) to predict stock prices volatility iteratively, and were the first to propose using Monte-Carlo methods for iterative prediction. RMDN is based on a vanilla recurrent neural network coupled with Gaussian mixture likelihood. The recent

architecture (Salinas et al., 2019) uses several LSTM layers with Gaussian likelihood, and has been applied to forecasting electricity consumption, car traffic and business sales. Follow-up work (Chen et al., 2019) considers an alternative TCN-based architecture. Our characterization of probabilistic forecasting models encompasses these different architectures, and presents the following novelties (i) it generalizes inference to any statistic of the output sequence, and (ii)

it extends the prior work to a Bayesian inference setting.

Stock-Market Prediction

Various methods have been applied for predicting future stock prices including random forests, gradient-boosted trees, or logistic regression. Two notable deterministic neural models applied to this task are TCN 

(Borovykh et al., 2017) and LSTM (Fischer and Krauss, 2018), which achieved state-of-the-art results. In our work, we trained a probabilistic autoregressive model for the same task and achieved comparable or even better results. Further, there exists a parallel line of work that performs density estimation, but apart from the RMDN (Schittenkopf et al., 2000), most papers restrict to prediction one-step ahead (Ormoneit and Neuneier, 1996), or to less-expressive and solvable dynamics such as the GARCH (Duan, 1995).

Adversarial Attacks

A growing body of recent work on adversarial attacks deals with generating small input perturbations causing mis-predictions (see (Wiyatno et al., 2019) for a survey). The objective function defined in Equation 9 is standard in generating adversarial examples (Szegedy et al., 2013; Carlini and Wagner, 2017). However, to the best of our knowledge this is the first time that adversarial attacks are applied to probabilistic autoregressive models. The most related work is the adversarial training of smoothed classifiers (Salman et al., 2019), where the noise applied to the input leads to a stochastic behavior.

Robust Algorithms for Financial Decision Making The adoption of machine learning in financial decision making makes it crucial to develop algorithms robust against small environment variations. Recent work here include robust assessment of loan applications (Ballet et al., 2019), deepfakes on accouting journal entries (Schreyer et al., 2019)

, robust inverse reinforcement learning on market data 

(Roa-Vicens et al., 2019). Adversarial attacks against stock-market prediction algorithms was studied by  (Feng et al., 2018). Compared to the latter, our work is the first to operate on a probabilistic network for iterative prediction.

Reparametrization The reparametrization trick has been applied in several fields under different names: perturbation analysis/pathwise derivatives (Glasserman and Ho, 1991)

in stochastic optimization, stochastic backpropagation 

(Rezende et al., 2014), affine independent variational inference (Challis and Barber, 2012) or correlated sampling in evaluating differential privacy (Bichsel et al., 2018). The actual reparametrization of our model resembles that of the deep generative model of Rezende et al. (2014), but there it is used to perform variational inference.

7 Conclusion

In this work, we explored applying adversarial attacks to a recently proposed probabilistic autoregressive forecasting models. Our work is motivated by the fact that: (i) this model has been included in the Amazon SageMaker toolkit and achieved state-of-the-art results on a number of different tasks, and (ii) adversarial attacks and robustness are pressing and important issues that affect it.

Concretely, we implemented and evaluated two techniques, reparametrization and score-function estimators, that are used to differentiate trough Monte-Carlo estimation inherent to this model and instantiate existing gradient based adversarial attacks. While we show that both of these techniques can be used to generate adversarial attacks, we evidence that using the reparametrization estimator is crucial for producing adversarial attacks with a small perturbation norm. Further, we extend the prior work to the Bayesian setting which enables using these models with new types of queries.


  • F. Allen and D. Gale (1992) Stock-price manipulation. The Review of Financial Studies 5 (3), pp. 503–529. Cited by: §4.2.
  • S. Bai, J. Z. Kolter, and V. Koltun (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. Cited by: §B.2, §2.1.
  • V. Ballet, X. Renard, J. Aigrain, T. Laugel, P. Frossard, and M. Detyniecki (2019) Imperceptible adversarial attacks on tabular data. arXiv preprint arXiv:1911.03274. Cited by: §6.
  • S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer (2015) Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 1171–1179. Cited by: §2.1.
  • B. Bichsel, T. Gehr, D. Drachsler-Cohen, P. Tsankov, and M. Vechev (2018) Dp-finder: finding differential privacy violations by sampling and optimization. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 508–524. Cited by: §6.
  • C. M. Bishop (1994) Mixture density networks. Cited by: §B.2.
  • F. Black and M. Scholes (1973) The pricing of options and corporate liabilities. Journal of political economy 81 (3), pp. 637–654. Cited by: §4.
  • A. Borovykh, S. Bohte, and C. W. Oosterlee (2017)

    Conditional time series forecasting with convolutional neural networks

    arXiv preprint arXiv:1703.04691. Cited by: §B.2, Table 3, §4.1, Table 1, Table 2, §6.
  • N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. Cited by: §3, §3.1, §6.
  • E. Challis and D. Barber (2012) Affine independent variational inference. In Advances in Neural Information Processing Systems, pp. 2186–2194. Cited by: §6.
  • Y. Chen, Y. Kang, Y. Chen, and Z. Wang (2019) Probabilistic forecasting with temporal convolutional neural network. arXiv preprint arXiv:1906.04397. Cited by: §6.
  • D. Diaz, B. Theodoulidis, and P. Sampaio (2011) Analysis of stock market manipulations using knowledge discovery techniques applied to intraday trade prices. Expert Systems with Applications 38 (10), pp. 12757–12771. Cited by: §4.2.
  • J. Duan (1995) The GARCH option pricing model. Mathematical finance 5 (1), pp. 13–32. Cited by: §6.
  • F. Feng, H. Chen, X. He, J. Ding, M. Sun, and T. Chua (2018) Enhancing stock movement prediction with adversarial training. arXiv preprint arXiv:1810.09936. Cited by: §6.
  • M. Figurnov, S. Mohamed, and A. Mnih (2018) Implicit reparameterization gradients. In Advances in Neural Information Processing Systems, pp. 441–452. Cited by: §3.2.
  • T. Fischer and C. Krauss (2018) Deep learning with long short-term memory networks for financial market predictions. European Journal of Operational Research 270 (2), pp. 654–669. Cited by: §B.1, §B.1, §B.2, Table 3, §4.1, §4.1, Table 1, Table 2, §6.
  • P. Glasserman and Y. Ho (1991) Gradient estimation via perturbation analysis. Vol. 116, Springer Science & Business Media. Cited by: §6.
  • P. W. Glynn (1990) Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM 33 (10), pp. 75–84. Cited by: §1, §3.1.1.
  • T. Gneiting and A. E. Raftery (2007) Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association 102 (477), pp. 359–378. Cited by: footnote 2.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1, §3.1.
  • A. Graves (2013) Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850. Cited by: §1.
  • A. Graves (2016) Stochastic backpropagation through mixture density distributions. arXiv preprint arXiv:1607.05690. Cited by: §3.2.
  • P. Handa and R. A. Schwartz (1996) Limit order trading. The Journal of Finance 51 (5), pp. 1835–1861. Cited by: §4.
  • T. C. Hesterberg (1996) Estimates and confidence intervals for importance sampling sensitivity analysis. Mathematical and computer modelling 23 (8-9), pp. 79–85. Cited by: §B.4.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.1.
  • M. Jankowiak and F. Obermeyer (2018) Pathwise derivatives beyond the reparameterization trick. arXiv preprint arXiv:1806.01851. Cited by: §3.2.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §3.1.2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. External Links: 1412.6980 Cited by: §B.3.
  • J. P. Kleijnen and R. Y. Rubinstein (1996) Optimization and sensitivity analysis of computer simulation models by the score function method. European Journal of Operational Research 88 (3), pp. 413–427. Cited by: §1.
  • A. Kurakin, I. Goodfellow, and S. Bengio (2017a) Adversarial examples in the physical world. ICLR’17 Workshop. Cited by: §3.1.
  • A. Kurakin, I. J. Goodfellow, and S. Bengio (2017b) Adversarial machine learning at scale. ICLR’17. Cited by: §3.1.
  • H. Öğüt, M. M. Doğanay, and R. Aktaş (2009) Detecting stock-price manipulation in an emerging market: the case of turkey. Expert Systems with Applications 36 (9), pp. 11944–11949. Cited by: §4.2.
  • A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu (2016a) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §1.
  • A. v. d. Oord, N. Kalchbrenner, L. Espeholt, k. kavukcuoglu, O. Vinyals, and A. Graves (2016b) Conditional image generation with PixelCNN decoders. In Advances in Neural Information Processing Systems 29, pp. 4790–4798. Cited by: §1.
  • A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu (2016c) Pixel recurrent neural networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 1747–1756. Cited by: §1.
  • D. Ormoneit and R. Neuneier (1996) Experiments in predicting the German stock index DAX with density estimating neural networks. In IEEE/IAFE 1996 Conference on Computational Intelligence for Financial Engineering (CIFEr), pp. 66–71. Cited by: §6.
  • N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami (2016) The limitations of deep learning in adversarial settings. In 2016 IEEE European Symposium on Security and Privacy (EuroS&P), pp. 372–387. Cited by: §3.
  • R. Ranganath, S. Gerrish, and D. Blei (2014) Black box variational inference. In Artificial Intelligence and Statistics, pp. 814–822. Cited by: §3.1.1.
  • D. J. Rezende, S. Mohamed, and D. Wierstra (2014) Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082. Cited by: §1, §3.1.2, §6.
  • D. R. Rich (1994) The mathematical foundations of barrier option-pricing theory. Advances in futures and options research 7. Cited by: §2.2.
  • J. Roa-Vicens, Y. Wang, V. Mison, Y. Gal, and R. Silva (2019) Adversarial recovery of agent rewards from latent spaces of the limit order book. arXiv preprint arXiv:1912.04242. Cited by: §6.
  • T. Salimans, D. A. Knowles, et al. (2013)

    Fixed-form variational posterior approximation through stochastic linear regression

    Bayesian Analysis 8 (4), pp. 837–882. Cited by: §1, §3.1.2.
  • D. Salinas, V. Flunkert, J. Gasthaus, and T. Januschowski (2019) DeepAR: probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting. Cited by: §B.1, §1, §1, §2.1, §2.1, §2, §3.2, §6.
  • H. Salman, J. Li, I. Razenshteyn, P. Zhang, H. Zhang, S. Bubeck, and G. Yang (2019) Provably robust deep learning via adversarially trained smoothed classifiers. In Advances in Neural Information Processing Systems, pp. 11289–11300. Cited by: §6.
  • C. Schittenkopf, G. Dorffner, and E. J. Dockner (2000) Forecasting time-dependent conditional densities: a semi non-parametric neural network approach. Journal of Forecasting 19 (4), pp. 355–374. Cited by: §1, §2.1, §6, §6.
  • M. Schreyer, T. Sattarov, B. Reimer, and D. Borth (2019) Adversarial learning of deepfakes in accounting. arXiv preprint arXiv:1910.03810. Cited by: §6.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1, §3, §3.1, §3.1, §6.
  • T. Tieleman and G. Hinton (2012)

    Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude

    COURSERA: Neural networks for machine learning 4 (2), pp. 26–31. Cited by: §B.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2.1.
  • A. P. Weigel, M. A. Liniger, and C. Appenzeller (2007) The discrete Brier and ranked probability skill scores. Monthly Weather Review 135 (1), pp. 118–124. Cited by: Table 3, §4.1, Table 2.
  • R. J. Williams and D. Zipser (1989) A learning algorithm for continually running fully recurrent neural networks. Neural computation 1 (2), pp. 270–280. Cited by: §2.1.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §3.1.1.
  • R. R. Wiyatno, A. Xu, O. Dia, and A. de Berker (2019) Adversarial examples in modern machine learning: a review. External Links: 1911.05268 Cited by: §6.

Appendix A Proofs

Score-function Estimator.

In the general Bayesian setting where , the score-function gradient estimator of the expected value of is:

where is sampled from the prior distribution , and denotes the probability that is true knowing that is generated.


The expectation is defined as the following integral over the output space:

Using Leibniz rule, we obtain

at every point around which the gradient is locally continuous (in the model described in this paper, this regularity condition holds everywhere). The resulting integral can be transformed as follows into an expectation over the distribution .

This expectation can be approximated via Monte-Carlo methods. While it is in general not possible to directly sample from , what can be done instead is generating samples for from the prior , and attribute an importance weight to each of the resulting samples, yielding:

The choice of as the importance weight for results from the application of Bayes rule:

Reparametrization Estimator.

Assume there exists a differentiable transformation such that the random variable can be reparametrized as , where is an independent random variable whose marginal distribution is independent from . Then the importance sampling reparametrization estimator of the expectation’s gradient is:

where for , is sampled from the distribution , and .


Approximating the expectation via Monte-Carlo estimation with importance sampling yields:

where is sampled from the prior distribution . With the assumptions of the theorem, we can rewrite:

Since the respective effects of the perturbation and of randomness are decoupled in this final expression, it is differentiable with respect to , which concludes. ∎

Appendix B Experimental Details

Here we provide details of all our experiments to support reproducibility. Additionally, we will make all our datasets and source code available online.

b.1 Datasets and Preprocessing


The S&P500 dataset is obtained via the yfinance API666 We focus on data-points between 1990/01 and 2000/12, identified by Fischer and Krauss (2018) as a period of exceptionally high trading returns compared to the following decades. We also follow Fischer and Krauss for preprocessing the data. A sequence of prices is first preprocessed to obtain a sequence of returns , defined as . Intuitively is the gain (when positive) or loss obtained by investing one dollar in the stock at time , and then selling at time . Inversely, given a sequence of returns and an initial price , the corresponding sequence of prices can be obtained as:

Both transformations are differentiable, which allows to perform the attack in the application space of prices rather than on returns. Besides, returns are normalized to have zero mean and unit variance. Denoting and for the mean and standard deviation of returns in the training set, the normalized sequence is , where . We refer to Fischer and Krauss (2018) for a thorough analysis of the properties of the S&P500 dataset.

Electricity Dataset We use the same preprocessing steps as described in (Salinas et al., 2019). Input sequences are divided by their average value , and the corresponding prediction sequence is multiplied by . This guarantees that all inputs are approximately in the same range.

b.2 Neural Architectures: S&P500 Dataset


The LSTM baseline used on the S&P500 dataset, we follow (Fischer and Krauss, 2018), and use a single LSTM layer with

hidden units, followed by a linear output layer. However, we use only one input neuron without activation instead of two neurons with softmax activation.

TCN In (Borovykh et al., 2017), several sets of hyper-parameters for Temporal Convolutional Networks are used depending on the experiment. We decided to use layers and a dilation of , in order to match as closely as possible the size of the LSTM receptive field. We selected the other parameters via grid-search, resulting in a kernel size of and channels. We use the TCN implementation provided by the authors of (Bai et al., 2018).


For our probabilistic autoregressive model, we chose to use a single LSTM layer with 25 hidden units similar to the LSTM baseline, in order to guarantee the most fair comparison. We only changed the output layers to parametrize a Gaussian distribution. Following (Bishop, 1994), we use a linear layer without activation for the mean, and a linear layer with exponential activation for the scale of the distribution. We also performed experiments with a Gaussian mixture likelihood, but it did improve the performance on our two benchmarks.


For both deterministic networks, we minimize mean-squared error on the training set. For our model, we use negative log-likelihood as a loss function. In both cases, we use the

optimizer (Tieleman and Hinton, 2012) advised by Fischer and Krauss, with default parameters and learning rate of . We use an early-stopping patience of , and a large batch size of for training. Experiments with different values did not reveal a significant influence of these parameters.

b.3 Neural Architectures: Electricty Dataset


architecture used for the Electricity experiments is based on a three-layer LSTM with 40 hidden units each. The number of samples used for Monte-Carlo estimation of the output is set to 200. The network is trained for 20 epochs with the

optimizer (Kingma and Ba, 2014), with batch size of 64 and learning rate of 0.001.

Figure 2: Perturbation norm and distance to target for different values of when evaluated on the S&P500 Dataset.

b.4 Attack Hyper-Parameters: S&P500 Dataset

For the S&P500 dataset, we optimize the attack objective function with the RMSPROP optimizer using a learning rate of and iterations. These parameters were selected with a simple grid search because of the computational cost of running the attack repeatedly. The values used for the coefficient are . We select the value that yields the best adversarial sample under the constraint that the perturbation norm is below the tolerance . The number of samples used to estimate the gradient is chosen to be .

Buy/Sell Attack

We use for the target. For the Bayesian setting, we use , in order to approximately balance the different classes. The confidence interval is computed assuming Student’s t-distribution. In the Bayesian case, the formula for the confidence interval with importance sampling is derived in (Hesterberg, 1996).

Attack on Trading Strategies

We use for the target scaling factor.

Influence of c

In Figure 2, we examine the influence of tuning the attack objective function on average perturbation norm and distance to the attack target. We observe a trade-off between these two quantities that depends on the coefficient : higher value for yields better adversarial samples, at the cost of more input perturbation.

Influence of L

We evaluate the effect of the number of samples used in the reparametrization estimator on the attack loss in Figure 3. In this experiment, the value of is fixed to 1000. We notice a trade-off in terms of convergence speed vs. final loss, that depends on the number of samples used for estimating the gradient. As a result, we choose to use in our attacks.

Figure 3: Attack loss for different values of when evaluated on the S&P500 Dataset. The x-axis is the total number of generated samples rather than the number of perturbation updates, in order to provide a fair comparison in terms of attack computational cost.

b.5 Attack Hyper-Parameters: Electricity Dataset

We optimize the attack objective function with the Adam optimizer. We use different optimizers for the two datasets so that the same optimizer is used for training the network and to attack it. We use a learning rate of and iterations. These parameters were also selected via informal search. The values used for the coefficient are 0.1, 0.2, 0.3, 0.5, 0.7, 1, 2, 3, 5, 7, 10, 20, 30, 50, 70, 100, 200 and 300. The number of samples used to estimate the gradient is chosen to be .

Figure 4: Financial gain on algorithmic trading tasks for different horizons and portfolio sizes (in of the invested capital). The blue bars correspond to different number of samples for Monte-Carlo estimation of the prediction: from left to right .

Appendix C Experimental Results

c.1 Trading Strategies

In Figure 4, we provide extended results for the long-short trading benchmark (Section 4.1), with different horizons  and number of samples used for Monte-Carlo estimation of the prediction. We observe that the quality of the probabilistic prediction improves with the number of samples until samples. Further increasing the number of samples does not yield significant performance improvements.

c.2 Evaluation of the Probabilistic Forecast

In Table 3, we give detailed results for the comparison of probabilistic forecasts quality with Ranked Probability Skill (a summary of these results is provided in Table 2, Section 4.1). We observe that the forecasting quality of our model improves with the number of samples, and that an order of magnitude of the number of samples needed to obtain the best possible estimation is . As a comparison, the

implementation on the electricity dataset uses 200 samples. We surmise that this discrepancy is due to the low signal-to-noise ratio of financial data, that makes inference more difficult.

c.3 Bayesian Attack

In Figure 5, we plot the results of the classification attack in the Bayesian setting with observation , where (Section 5.1). We only implemented the reparametrization estimator, as the score-function estimator requires the overly complex estimation of for each sample . We observe that the attack success rate is very similar to the non-Bayesian setting, demonstrating that the reparametrization estimator adapts readily to the Bayesian setting. The attack success rate is approximately for .

Figure 5: Success rate of the classification attack for different perturbation norms, in a Bayesian setting with observation , and prediction horizon .

c.4 Electricity Dataset

In Figure 6, we show results of both over-estimation and under-estimation attacks on the electricity dataset, with examples of generated adversarial samples (Section 5.3). We observe that for equal perturbation tolerance, the over-estimation attack yields mis-predictions of smaller amplitude. For instance, the reparametrization attack with causes median over-estimation of around , whereas it causes median under-estimation of . We believe that this is due to the particular nature of the dataset rather than asymmetry in the attack. We do not observe such a discrepancy in the financial experiments.

Statistics Non-probabilistic Probabilistic
Name Ours
(Borovykh et al., 2017) (Fischer and Krauss, 2018) 1 sample 100 samples 10000 samples
Cumulated Return 1 - 1.423 ( 0.022) 1.424 ( 0.016) 2.016 ( 0.023) 0.992 ( 0.002) 0.982 ( 0.002)
5 - 1.468 ( 0.01) 1.466 ( 0.008) 1.992 ( 0.013) 1.0 ( 0.005) 0.99 ( 0.004)
10 - 1.548 ( 0.029) 1.541 ( 0.019) 1.995 ( 0.011) 1.012 ( 0.008) 1.002 ( 0.008)
European Call Option 10 0.9 1.019 ( 0.004) 1.017 ( 0.003) 1.961 ( 0.22) 0.999 ( 0.009) 0.989 ( 0.009)
10 1 1.122 ( 0.002) 1.121 ( 0.002) 1.966 ( 0.103) 0.992 ( 0.006) 0.982 ( 0.005)
10 1.1 1.342 ( 0.002) 1.341 ( 0.002) 1.987 ( 0.03) 1.003 ( 0.007) 0.993 ( 0.007)
European Put Option 10 0.9 1.445 ( 0.021) 1.445 ( 0.017) 1.984 ( 0.015) 1.002 ( 0.007) 0.992 ( 0.007)
10 1 1.302 ( 0.003) 1.3 ( 0.002) 1.957 ( 0.036) 0.984 ( 0.005) 0.974 ( 0.005)
10 1.1 1.046 ( 0.005) 1.044 ( 0.002) 1.856 ( 0.094) 0.968 ( 0.004) 0.959 ( 0.003)
Limit Sell 10 1.01 2.822 ( 0.501) 3.137 ( 0.307) 1.917 ( 0.021) 1.013 ( 0.008) 1.004 ( 0.008)
10 1.05 1.516 ( 0.001) 1.514 ( 0.002) 1.899 ( 0.027) 0.953 ( 0.006) 0.944 ( 0.006)
10 1.20 1.035 ( 0.003) 1.034 ( 0.002) 1.792 ( 0.12) 0.951 ( 0.006) 0.942 ( 0.005)
Limit Buy 10 0.8 1.02 ( 0.002) 1.019 ( 0.002) 1.948 ( 0.223) 0.982 ( 0.015) 0.972 ( 0.013)
10 0.95 1.412 ( 0.0) 1.41 ( 0.001) 1.926 ( 0.039) 0.967 ( 0.009) 0.958 ( 0.008)
10 0.99 3.047 ( 0.012) 3.025 ( 0.028) 1.963 ( 0.024) 1.013 ( 0.006) 1.003 ( 0.006)
Table 3: Performance of different models on probabilistic forecast of various statistics. The comparison metric is Ranked Probability Skill (Weigel et al., 2007) of the prediction (lower scores correspond to better predictions). The performance of our architecture is given for different number of samples used in the Monte-Carlo estimation (1, 100, and 10000).
Figure 6: Results for the electricity dataset. a) Under-estimation of electricity consumption. For example, with , the attack using reparametrization estimator leads to under-estimation of at least (y-axis) for of samples (x-axis). b) Under-estimation adversarial samples. c) Over-estimation of electricity consumption. d) Over-estimation adversarial samples. In a) and b), results are given for reparametrization (continuous) and score-function (dashed) estimators. In c) and d), the reparametrization estimator is used, and is fixed to 0.9. Red curve is the original sample, blue curve is the generated adversarial sample. The vertical dashed line separates the input sequence from the network’s prediction.