## 1 Introduction

The auto-regressive (AR) model, which infers and predicts the causal relationship between the previous and future samples in a sequential data, has been widely studied since the beginning of machine learning research. The recent advances of the auto-regressive model brought by the neural network have achieved impressive success in handling complex data including texts

(sutskever2011generating), audio signals (vinyals2012revisiting; tamamori2017speaker; van2016wavenet), and images (oord2016pixel; salimans2017pixelcnn++).It is well known that AR model can learn a tractable data distribution and can be easily extended for both discrete and continuous data. Due to their nature, AR models have especially shown a good fit with a sequential data, such as voice generation (van2016wavenet) and provide a stable training while they are free from the mode collapsing problem (oord2016pixel). However, these models must infer each element of the data step by step in a serial manner, requiring

times more than other non-sequential estimators

(garnelo2018conditional; kim2019attentive; kingma2013auto; goodfellow2014generative). Moreover, it is difficult to employ recent parallel computation because AR models always require a previous time step by definition. This mostly limits the use of the AR models in practice despite their advantages.To resolve the problem, we introduce a new and generic approximation method, *Neural Auto-Regressive model Approximator (NARA)*, which can be easily plugged into any AR model. We show that NARA can reduce the generation complexity of AR models by relaxing an inevitable AR nature and enables AR models to employ the powerful parallelization techniques in the sequential data generation, which was difficult previously.

NARA consists of three modules; (1) a prior-sample predictor, (2) a confidence predictor, and (3) original AR model. To relax the AR nature, given a set of past samples, we first assume that each sample of the future sequence can be generated in an independent and identical manner. Thanks to the i.i.d. assumption, using the first module of NARA, we can sample a series of future priors and these future priors are post-processed by the original AR model, generating a set of raw predictions. The confidence predictor evaluates the credibility of these raw samples and decide whether the model needs re-sampling or not. The confidence predictor plays an important role in that the approximation errors can be accumulated during the sequential AR generation process if the erroneous samples with low confidence are left unchanged. Therefore, in our model, the sample can be drawn either by the mixture of the AR model or the proposed approximation method, and finally the selection of the generated samples are guided by the predicted confidence.

We evaluate NARA with various baseline AR models and data domains including simple curves, image sequences (Yoo2017VariationalAR), CelebA (liu2015deep)

, and ImageNet

(imagenet_cvpr09). For the sequential data (simple curves and golf), we employed the Long Short-Term Memory models (LSTM)

(hochreiter1997long) as a baseline AR model while PixelCNN++ (salimans2017pixelcnn++) is used for the image generation (CelebA and ImageNet). Our experiments show that NARA can largely reduce the sample inference complexity even with a heavy and complex model on a difficult data domain such as image pixels.The main contributions of our work can be summarized as follows: (1) we introduce a new and generic approximation method that can accelerate any AR generation procedure. (2) Compared to a full AR generation, the quality of approximated samples remains reliable by the accompanied confidence prediction model that measures the sample credibility. (3) Finally, we show that this is possible because, under a mild condition, the approximated samples from our method can eventually converge toward the true future sample. Thus, our method can effectively reduce the generation complexity of the AR model by partially substituting it with the simple i.i.d. model.

## 2 Preliminary: Auto-regressive Models

Auto-regressive generation model is a probabilistic model to assign a probability of the data including samples. This method considers the data as a sequence , and the probability is defined by an AR manner as follows:

(1) |

From the formulation, the AR model provides a tractable data distribution . Recently, in training the model parameters using the training samples , the computation parallelization are actively employed for calculating the distance between the real sample and the generated sample from equation (1). Still, for generating the future samples, it requires by definition.

## 3 Proposed Method

### 3.1 Overview

Figure 1 shows the concept of the proposed approximator NARA.
NARA consists of a *prior-sample predictor* and *confidence predictor* .
Given samples , the prior-sample predictor predicts a chunk of number of the prior values . Afterward, using the prior samples, we draw the future samples in parallel.
We note that this is possible because the prior is i.i.d. variable from our assumption.
Subsequently, for the predicted , the confidence predictor predicts confidence scores . Then, using the predicted confidence, our model decides whether the samples of interest should be redrawn by the AR model (re-sample ) or they are just *accepted*.
The detailed explanation will be described in the following sections.

### 3.2 Approximating Sample Distribution of AR Model

Given the samples , a AR model defines the distribution of future samples as follows:

(2) |

Here, denotes the parameters of the AR model. The indices are assumed to satisfy the condition .
To approximate the distribution , we introduce a set of *prior* samples , where we assume that they are i.i.d. given the observation .
Here, is the model parameter of the *prior-sample predictor* .

Base on this, we define an approximated distribution characterized by the original AR model and the prior-sample predictor as follows:

(3) |

Here, approximation (A) is true when approaches to . Note that it becomes possible to compute in a constant time because we assume the prior variable to be i.i.d. while requires a linear time complexity.

Then, we optimize the network parameters and by minimizing the negative log-likelihood (NLL) of where is a set of samples that are sampled from the baseline AR model. We guide the prior-sample predictor to generate the prior samples that is likely to come from the distribution of original AR model by minimizing the original AR model and prior-sample predictor jointly as follows:

(4) |

where denotes the ground truth sample value in the generated region of the training samples. Note that both and its approximated distribution approaches to the true data distribution when (1) our prior-sample predictor generates the prior samples m close to the true samples x and (2) the NLL of the AR distribution approaches to the data distribution. Based on our analysis and experiments, we later show that our model can satisfy these conditions theoretically and empirically in the following sections.

### 3.3 Confidence Prediction

Using the prior-sample predictor , our model generates future samples based on the previous samples.
However, accumulation of approximation errors in the AR generation may lead to an unsuccessful sample generation.
To mitigate the problem, we introduce an auxiliary module that
determines whether to accept or reject the approximated samples generated as described in the previous subsection, referred to as *confidence predictor*.

First, we define the confidence of the generated samples as follows:

(5) |

where and . The confidence value provides a measure of how likely the generated samples from is drawn from . Based on the confidence value , our model decides whether it can accept the sample or not. More specifically, we choose a threshold and accept samples which have the confidence score larger than the threshold .

When the case , our model always redraws the sample using the AR model no matter how our confidence is high. Note that our model becomes equivalent to the target AR model when . When , our model always accepts the approximated samples. In practice, we accept samples among approximated samples where . Subsequently, we re-sample from the original AR model and repeat approximation scheme until reach the maximum length.

However, it is impractical to calculate equation (5) directly because we need the samples from the original AR model. We need first to go forward using the AR model to see the next sample and come backward to calculate the confidence to decide whether we use the sample or not, which is nonsense.

To detour this problem, we introduce a network that approximates the binary decision variable as follows:

(6) |

where . The network is implemented by a auto-encoder architecture with a sigmoid activation output that makes the equation (6

) equivalent to the logistic regression.

### 3.4 Training details

To train the proposed model, we randomly select the sample for the sequence in a training batch. Then, we predict sample values after , where denotes the number of samples the prediction considers. To calculate equation (4), we minimize the loss of the training sample , and the locations as,

(7) |

Here, for denotes number of the sequences from the AR distribution for -th training data. From the experiment, we found that sample is enough to train the model. This training scheme guides the distribution drawn by NARA to fit the original AR distribution as well as to generate future samples, simultaneously. To train , binary cross-entropy loss is used with in equation (6), with freezing the other parameters.

### 3.5 Theoretical Explanation

Here, we show that the proposed NARA is a regularized version of the original AR model. At the extremum, the approximated sample distribution from NARA is equivalent to that of the original AR model. In NARA, our approximate distribution is reformulated as follows:

(8) |

where the parameter denotes the network parameters of the approximated distribution . Therefore, our proposed cost function can be represented as the negative log-likelihood of the AR model with a regularizer :

(9) |

Note that the proposed cost function is equivalent to that of the original AR model when , which is true under the condition of and . Here, . By minimizing the equation (9), enforces the direction of the optimization to estimate the probability ratio of and while it minimize the gap between so that approaches to .

## 4 Related Work

Deep AR and regression models: After employing the deep neural network, the AR models handling sequential data has achieved significant improvements in handling the various sequential data including text (sutskever2011generating), sound (vinyals2012revisiting; tamamori2017speaker), and images (oord2016pixel; salimans2017pixelcnn++). The idea has been employed to “Flow based model” which uses auto-regressive sample flows (kingma2018glow; germain2015made; papamakarios2017masked; kingma2016improved) to infer complex distribution, and reported meaningful progresses. Also, the attempts (Yoo2017VariationalAR; garnelo2018conditional; kim2019attentive) to replace the kernel function of the stochastic regression and prediction processes to neural network has been proposed to deal with semi-supervised data not imposing an explicit sequential relationship.

Approximated AR methods: Reducing the complexity of the deep AR model has been explored by a number of studies, either targeting multiple domain (seo2017neural; stern2018blockwise) or specific target such as machine translation (wang2018semi; ghazvininejad2019constant; welleck2019non; wang2018semi; wang2019non) and image generation (ramachandran2017fast).

Adding one step further to the previous studies, we propose a new general approximation method for AR methods by assuming the i.i.d. condition for the “easy to predict” samples. This differentiates our approach to (seo2017neural) in that we do not sequentially approximate the future samples by using a smaller AR model but use a chunk-wise predictor to approximate the samples at once. In addition, our confidence prediction module can be seen as a stochastic version of the verification step in (stern2018blockwise), which helps our model to converge toward the original solution. This confidence guided approximation can be easily augmented to the other domain specific AR approximation methods because our method is not limited to a domain specific selection queues such as quotation (welleck2019non; ghazvininejad2019constant) or nearby convolutional features (ramachandran2017fast).

## 5 Experiments

In this section, we demonstrate the data generation results from the proposed NARA. To check the feasibility, we first test our method into time-series data generation problem, and second, into image generation. The detailed model structures and additional results are attached in the Supplementary material. The implementation of the methods will be available soon.

### 5.1 Experimental Setting

Time-series data generation problem: In this problem, we used LSTM as the base model. First, we tested our method with a simple one-dimensional sinusoidal function. Second, we tested the video sequence data (golf swing) for demonstrating the more complicated case. In this case, we repeated the swing sequences times to make the periodic image sequences and resize each image to

resolution. Also, beside the LSTM, we used autoencoder structures to embed the images into latent space. The projected points for the image sequences are linked by LSTM, similar to

(Yoo2017VariationalAR). For both cases, we used ADAM optimizer (kingma2014adam) with a default setting and a learning rate .1.0 | 0.9 | 0.8 | 0.7 | 0.6 | 0.5 | 0.4 | 0.3 | 0.2 | 0.1 | 0.0 | |
---|---|---|---|---|---|---|---|---|---|---|---|

Acceptance (%) | 0.0 | 15.3 | 43.0 | 59.3 | 69.1 | 77.1 | 82.1 | 84.5 | 88.0 | 92.5 | 100 |

Mean error () | 21.1 | 26.8 | 24.9 | 21.7 | 18.4 | 19.9 | 18.3 | 20.6 | 19.4 | 25.3 | 24.3 |

Image generation: For the image generation task, we used PixelCNN++ (salimans2017pixelcnn++) as the base model. The number of channels of the network was set to and the number of logistic mixtures was set to . See (salimans2017pixelcnn++) for the detailed explanation of the parameters. In this task, the baseline AR model (PixelCNN++) is much heavier than those used in the previous tasks. Here, we show that the proposed approximated model can significantly reduce the computational burden of the original AR model. The prior-sample predictor and the confidence estimator were both implemented by U-net structured autoencoder (ronneberger2015u). We optimized the models using ADAM with learning rate . Every module was trained from scratch. We mainly used CelebA (liu2015faceattributes) resizing the samples to resolution. In the experiments, we randomly pick images for training and images for validation.

Training and evaluation: For the first problem, we use single GPU (NVIDIA Titan XP), and for the second problem, four to eight GPUs (NVIDIA Tesla P40) were used^{1}^{1}1The overall expreiments were conducted on NSML (sung2017nsml) GPU system.

. The training and inference code used in this section are implemented by PyTorch library. For the quantitative evaluation, we measure the error between the true future samples and the generated one, and also employ Fréchet Inception Distance score (FID)

(heusel2017fid) as a measure of the model performance and visual quality of the generated images for the second image generation problem.### 5.2 Analysis

#### 5.2.1 Time-series Data Generation

Figure 1(a) shows the generation results of the one-dimensional time-series from our approximation model with different acceptance ratios (red, green, and blue) and the baseline LSTM models (black). From the figure, we can see that both models correctly generates the future samples. Please note that, from the prior sample generation result (magenta), the prior samples m converged to the true samples x as claimed in Section 3.5.

The graph in Figure 1(b) shows the acceptance ratio and the -error over the gauge threshold . The error denotes the distance between the ground truth samples x and the generated ones. As expected, our model accepted more samples as the threshold decreases. However, contrary to our initial expectations, the error-threshold graph shows that the less acceptance of samples does not always bring the more accurate generation results. From the graph, the generation with an intermediate acceptance ratio achieved the best result. Interestingly, we report that this tendency between the acceptance ratio and the generation quality was repeatedly observed in the other datasets as well.

Figure 3 shows the image sequence generation results from NARA. From the result, we can see that the proposed approximation method is still effective when the input data dimension becomes much larger and the AR model becomes more complicated. In the golf swing dataset, the proposed approximation model also succeeded to capture the periodic change of the image sequence. The table 3 shows that the proper amount of approximation can obtain better accuracy than the none, similar to the other previous experiments. One notable observation regarding the phenomenon is that the period of image sequence was slightly changed among different ratio of approximated sample acceptance (Figure 3). One possible explanation would be that the approximation module suppress the rapid change of the samples, and this affects the interval of a single cycle.

#### 5.2.2 Image Generation

Figure 4 shows that our method can be integrated into PixelCNN++ and generates images with the significant amount of the predicted sample acceptance (white region). We observed that the confidence was mostly low (blue) in the eyes, mouth, and boundary regions of the face, and the PixelCNN is used to generate those regions. This shows that compared to the other homogeneous regions of the image, the model finds it relatively hard to describe the details, which matches with our intuition.

The graphs in Figure 5 present the quantitative analysis regarding the inference time and the NLL in generating images. In Figure 4(a), the relation between inference time and the skimming ratio is reported. The results show that the inference speed is significantly improved as more pixels are accepted. Table 2 further supports this that our approximation method generates a fair quality of images while it speeds up the generation procedure times faster than the base model.

for every epoch.

In the image generation example also, we found that the fair amount of acceptance can improve the perceptual visual quality of the generated images compared to the vanilla PixelCNN++ (Table 2). Our method benefits from increasing the acceptance ratio to some extent in terms of FID showing a U-shaped trend over the variation, similar to those in Figure 1(b). Note that a lower FID score identifies a better model. Consistent with previous results, we can conjecture that the proposed approximation scheme learns the mean-prior of the images and guides the AR model to prevent generating erroneous images. The confidence maps and the graph illustrated in Figure 4, 5a, and 5c support this conjecture. Complex details such as eyes, mouths and contours have largely harder than the backgrounds and remaining faces.

In Figure 4(b) and Figure 4(c), the graphs show the results supporting the convergence of the proposed method. The graph in Figure 4(b) shows the NLL of the base PixelCNN++ and that of our proposed method under the full-accept case, i.e. we fully believe the approximation results. Note that the NLL of both cases converged and the PixelCNN achieved noticeably lower NLL compared to the fully accepting the pixels at every epoch. This is already expected in Section 3.2 that the baseline AR model approaches more closely to the data distribution than our module. This supports the necessity of re-generating procedure by using the PixelCNN++, especially when the approximation module finds the pixel has a low confidence.

The graph in Figure 4(c) presents the distance between the generated prior pixel m and the corresponding ground-truth pixel x in the test data reconstruction. Again, similar to the previous time-series experiments, the model successfully converged to the original value (m approaches to x). Combined with the result in Figure 4(b), this result supports the convergence conditions claimed in section 3.2. Regarding the convergence, we compared the NLL of the converged PixelCNN++ distribution from the proposed scheme and that of PixelCNN++ with CelebA dataset from the original paper (salimans2017pixelcnn++).

1.0 | 0.9 | 0.8 | 0.7 | 0.6 | 0.5 | 0.4 | 0.3 | 0.2 | 0.1 | 0.0 | |
---|---|---|---|---|---|---|---|---|---|---|---|

SU | 1.0x | 1.2x | 1.7x | 2.2x | 2.8x | 3.5x | 4.1x | 4.5x | 5.3x | 6.8x | 12.9x |

SU \A | 1.0x | 1.2x | 1.7x | 2.4x | 3.1x | 4.1x | 5.2x | 5.9x | 7.4x | 11.0x | 59.4x |

AR (%) | 0.0 | 15.3 | 43.0 | 59.3 | 69.1 | 77.1 | 82.1 | 84.5 | 88.0 | 92.5 | 100 |

FID | 56.8 | 50.1 | 45.1 | 42.5 | 41.6 | 42.3 | 43.4 | 45.9 | 48.7 | 53.9 | 85.0 |

## 6 Conclusion

In this paper, we proposed the efficient neural auto-regressive model approximation method, NARA, which can be used in various auto-regressive (AR) models. By introducing the prior-sampling and confidence prediction modules, we showed that NARA can theoretically and empirically approximate the future samples under a relaxed causal relationships. This approximation simplifies the generation process and enables our model to use powerful parallelization techniques for the sample generation procedure. In the experiments, we showed that NARA can be successfully applied with different AR models in the various tasks from simple to complex time-series data and image pixel generation. These results support that the proposed method can introduce a way to use AR models in a more efficient manner.

## References

## Appendix A Implementation Details

Time serious sample generation: For the 1-dimensional samples case, the LSTM consists of two hidden layers, and the dimension of each layer was set to . We observed steps and predicted future samples. The chunk-wise predictor and the confidence predictor were defined by single fully-connected layer with input size and the output size . In this task, our model predicted every samples by seeing the previous samples.

For the visual sequence case, we used autoencoder structures to embed the images into latent space. The encoder consists of four “Conv-activation” plus one final Conv filters. Here, the term “Conv” denotes the convolutional filter. Similarly, the decoder consists of four “Conv transpose-activation” with one final Conv transpose filters. The channel sizes of the encoder and decoder were set to and . All the Conv and Conv Transpose filters were defined by

. The activation function was defined as Leaky ReLU function.

The embedding space was defined to have 10-dimensional space, and the model predicts every future samples given previous samples. The prior sample predictor and the confidence predictor are each defined by a fully-connected layers, and predict future samples in conjunction to the decoder. We note that the encoder and decoder were pre-trained by following (kingma2013auto).

Image generation: The prior sample and confidence predictors consist of an identical backbone network except the last block. The network consists of four number of “Conv-BN-activation” blocks, and the four number of “Conv transpose-BN-Activation” block. All the Conv and Conv Transpose filters were defined by

. The last activation of the prior sample predictor is tangent hyperbolic function and that of confidence predictor is defined as sigmoid function. The other activation functions are defined as Leaky-ReLU. Also, the output channel number of the last block are set to

and , respectively. The channel size of each convolution filter and convolution-transpose filter was set to be and .We used batch normalization

(ioffe2015batch)for both networks and stride is set to two for all the filters. The decision threshold

is set to the running mean of the . We set for every test and confirmed that it properly works for all the cases.## Appendix B Supplementary Experiments

In addition to the results presented in the paper, we show supplement generation examples in below figures. Figure 3 and Table 3 present the image sequence generation result from the other golf swing sequence. In this case also, we can observe the similar swing cycle period changes and acceptance ratio-error tendencies reported in the paper. Our approximation slightly affects the cycle of the time-serious data, and the result also shows that the approximation can achieve even better prediction results than “none-acceptance” case.

Figure 8 and 9 shows the additional facial image generation results among . We can see that pixels from boundary region were more frequently re-sampled compared to other relatively simple face parts such as cheek or forehead. Also, we tested our model with ImageNet classes, and the results were presented in Figure 7.

1.0 | 0.9 | 0.8 | 0.7 | 0.6 | 0.5 | 0.4 | 0.3 | 0.2 | 0.1 | 0.0 | |
---|---|---|---|---|---|---|---|---|---|---|---|

Acceptance (%) | 0.0 | 18.0 | 35.0 | 44.5 | 56.1 | 61.0 | 70.5 | 85.5 | 92.5 | 99.0 | 100 |

Mean Error | 2.20 | 2.61 | 2.21 | 1.79 | 2.28 | 2.10 | 1.91 | 1.87 | 1.94 | 2.38 | 2.45 |

Comments

There are no comments yet.