Parallel and Flexible Sampling from Autoregressive Models via Langevin Dynamics

05/17/2021 ∙ by Vivek Jayaram, et al. ∙ 13

This paper introduces an alternative approach to sampling from autoregressive models. Autoregressive models are typically sampled sequentially, according to the transition dynamics defined by the model. Instead, we propose a sampling procedure that initializes a sequence with white noise and follows a Markov chain defined by Langevin dynamics on the global log-likelihood of the sequence. This approach parallelizes the sampling process and generalizes to conditional sampling. Using an autoregressive model as a Bayesian prior, we can steer the output of a generative model using a conditional likelihood or constraints. We apply these techniques to autoregressive models in the visual and audio domains, with competitive results for audio source separation, super-resolution, and inpainting.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 14

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural autoregressive models (Larochelle & Murray, 2011) are a popular family of generative models, with wide-ranging applications in a variety of domains including audio (van den Oord et al., 2016a; Dhariwal et al., 2020), images (van den Oord et al., 2016b; Salimans et al., 2017; Parmar et al., 2018; Razavi et al., 2019), and text (Radford et al., 2019; Brown et al., 2020). These models parameterize the conditional distribution over a token in an ordered sequence, given previous tokens in the sequence. The standard approach to sampling from an autoregressive model iteratively generates tokens, according to a conditional distribution over tokens defined by the model, conditioned on the partial sequence of previously generated tokens. We will refer to this approach to sampling as the ancestral sampler.

There are two major drawbacks to ancestral sampling that limit the usefulness of autoregressive models in practical settings. First, ancestral sampling has time complexity that scales linearly in the length of the generated sequence. For data such as high-resolution images or audio, ancestral sampling from an autoregressive model (where the tokens are pixels or sound pressure readings respectively) can be impractically slow. Second, ancestral sampling is frustratingly inflexible. It is easy to sample the second half of a sequence conditioned on the first, but filling in the first half a sequence conditioned on the second naively requires training a new model that reverses the ordering of tokens in the autoregressive factorization. Conditioning on arbitrary subsets of tokens for tasks such as inpainting or super-resolution seems beyond reach of autoregressive modeling.

This paper introduces an alternative, parallel and flexible (PnF) sampler for autoregressive models that can be parallelized and steered using conditioning information or constraints.111Code and examples of PnF sampling are available at:
https://grail.cs.washington.edu/projects/pnf-sampling/.
Instead of sampling tokens sequentially, the PnF sampler initializes a complete sequence (with random tokens) and proceeds to increase the log-likelihood of this sequence by following a Markov chain defined by Langevin dynamics (Neal et al., 2011) on a smoothed log-likelihood. The smoothing temperature is cooled over time according to an annealing schedule informed by Song & Ermon (2019, 2020). Convergence time of this annealed Langevin dynamics is empirically independent of the sequence length and, generalizing Jayaram & Thickstun (2020), the PnF sampler can be applied to posterior log-likelihoods to incorporate conditional information into the sampling process.

Figure 1: A visual summary of discretized autoregressive smoothing. Given a noisy history (left column) where , we train a model to predict the un-noised distribution over (middle column). This distribution is discrete and non-differentiable in ; we convolve with a Gaussian

to produce a continuous estimate of

(right column). We can run Langevin dynamics on the continuous distribution, and gradually anneal the smoothing to approximate the target distribution.

The primary technical contribution of this paper is the development of the PnF sampler for discretized autoregressive models (Section 3.1). Our interest in these models is motivated by their success as unconditional models of audio waves (van den Oord et al., 2016a; Mehri et al., 2017; Dhariwal et al., 2020). Defined over a discrete lattice within a continuous space, these models occupy a middle ground between continuous and discrete models. For continuous models such as RNADE (Uria et al., 2013), PnF sampling can be directly applied as in Song & Ermon (2019); Jayaram & Thickstun (2020). We defer the development of the PnF sampler for fully discrete models to future work.

In Section 3.2, we present a stochastic variant of the PnF sampler based on stochastic gradient Langevin dynamics (Welling & Teh, 2011). This is an embarrassingly parallel, asynchronous distributed algorithm for autoregressive sampling. Using a WaveNet model, we show in Section 4.3 that stochastic PnF sampling approximates the quality of ancestral sampling to arbitrary accuracy, with compute time that is inverse proportional to the number of computing devices. This allows PnF sampling to take full advantage of modern, massively parallel computing infrastructure.

We will see in Section 3.3 how the PnF sampler can find solutions to general posterior sampling problems, using an unconditional generative model as a prior. In Section 4 we present applications of the PnF sampler to a variety of Bayesian image and audio inverse problems. We focus on linear inverse problems, using PixelCNN++ (Salimans et al., 2017) and WaveNet (van den Oord et al., 2016a) models as priors. Sections 4.4, 4.5, and 4.6 demonstrate PnF conditional sampling for source separation, super-resolution, inpainting respectively. PnF sampling results correlate strongly with the strength of the generative model used as a prior; as better autoregressive models are developed, they can be used with PnF sampling to improve performance on conditional generation tasks. We refer the reader to the project website for demonstrations of audio PnF sampling.

2 Related Work

The PnF autoregressive sampler is based on the annealed Langevin dynamics introduced in Song & Ermon (2019). The extension of annealed Langevin dynamics to conditional sampling problems was discussed in Jayaram & Thickstun (2020) for source separation and image coloring problems, and developed further in Song et al. (2021) for general posterior sampling problems. The present work extends these methods to discretized autoregressive models, for which the smoothing procedures described in previous work are not directly applicable (Frank & Ilse, 2020). A related sampler based on Gibbs sampling rather than Langevin dynamics is presented in Hadjeres et al. (2017). This Gibbs sampler applies to less general conditional sampling problems (partial observations) than PnF sampling, and requires a specialized model trained with a pseudo-likelihood objective.

The slow speed of ancestral sampling is a persistent obstacle to the adoption and deployment of autoregressive models. This has inspired algorithms that seek to parallelize the sampling process. Parallel WaveNet (van den Oord et al., 2018) and ClariNet (Ping et al., 2019) train generative flow models to mimic the behavior of an autoregressive model. Sampling a flow model requires only one pass through a feed-forward network and can be distributed across multiple devices. Wiggers & Hoogeboom (2020) and Song et al. (2020) propose fixed-point algorithms that, like PnF sampling, iteratively refine an initial sample from a simple distribution into a sample from the target distribution. But none of these methods are easily adaptable to source separation (Section 4.4) or more general conditional sampling tasks.

Bayesian inverse problems are explored extensively in theoretical settings, where the prior is given by a simple analytical distribution (Tropp & Wright, 2010; Knapik et al., 2011; Wang et al., 2017). These problems have also been studied using learned priors given by GAN’s, with a focus on linear inverse problems (Rick Chang et al., 2017; Bora et al., 2017; Raj et al., 2019). These GAN-based approaches are tailored to the latent variable architecture of the model, performing latent space optimizations to find codes that correspond to desired outputs. There is no obvious extension of these latent variable approaches to autoregressive models.

While we focus on autoregressive models, due to their strong empirical performance as unconditional models of audio, PnF sampling could be applied more generally with other likelihood-based models. In the audio space, this includes recent diffusion models (Kong et al., 2021; Chen et al., 2021). Note however that audio vocoder models (Prenger et al., 2019; Kim et al., 2018; Ping et al., 2020), which rely on spectrogram conditioning, cannot be adapted as priors for the source separation, super-resolution, and inpainting experiments presented in Section 4. In addition, GAN based models (Donahue et al., 2019; Kumar et al., 2019), which are not likelihood based, cannot be sampled using PnF.

3 Parallel and Flexible Sampling

We want to sample from an autoregressive generative model over some indexed sequence of values where

(1)

We are particularly interested in developing a sampler for discretized autoregressive models, where and each conditional has support on a finite set of scalar values . The set could represent, for example, an 8-bit encoding of an image or audio wave.

We propose to sample from via Langevin dynamics. Let , , and define a Markov chain

(2)

If were a smooth density then, for sufficiently small , the Markov chain mixes and converges in distribution to as

. But a discretized probability distribution defined over

is not smooth; the gradient is not even well-defined. In Section 3.1 we propose a smoothing of the discrete model , creating a differentiable density on which the Markov chain (2) can mix.

To support conditional generation, we then turn our attention to sampling from the posterior of a joint distribution

, where is an autoregressive model over and is a conditional likelihood. Langevin dynamics for sampling from the posterior are given by

(3)

This is a convenient Markov chain for posterior sampling, because the partition function vanishes when we take the gradient. However, like , the posterior is not smooth. In Section 3.3 we propose a smoothing of the joint distribution , for which the posterior is differentiable and the Markov chain (3) can mix.

  Input: y, , ,
  Sample
  for  to  do
     
     for  to  do
        Sample
        
        
     end for
  end for
Algorithm 1 Parallel and Flexible Sampling

Given a smoothing procedure parameterized by a temperature parameter, we appeal to the simulated annealing heuristic developed in

Song & Ermon (2019) to turn down the temperature as the Markov chain (2) or (3) mixes. In contrast to classical Markov chain sampling, for which samples converge in distribution to , annealed Langevin dynamics converges asymptotically to a single point distributed approximately according to . Algorithm 1 describes these annealed Langevin dynamics given a smoothed prior and smoothed likelihood . The structure of this algorithm is the same as the annealed Langevin dynamics presented in Song & Ermon (2019) (the unconditional case where ) and Jayaram & Thickstun (2020) (the conditional case); the novel contribution of this paper is the smoothing algorithm for evaluating given (Section 3.1) and given (Section 3.3).

Each step of the Langevin dynamics described in Equations (2) or (3) requires inference of , an operation. Unlike sequential sampling, calculating for a given sequence is embarrassingly parallel; for moderate sequence lengths , the cost of computing is essentially constant using a modern parallel computing device. When is very large, e.g. for WaveNet models where just a minute of audio has samples, it can be convenient to distribute sampling across multiple computing devices. In Section 3.2 we describe a stochastic variant of PnF sampling based on stochastic gradient Langevin dynamics that is easily distributed across a cluster of devices.

3.1 Autoregressive Smoothing

We consider models that parameterize the conditional distribution with a categorical softmax distribution over discrete values; given functions , we define

(4)

The functions

are typically given by a neural network, with shared weights across the sequential indices

. Collectively, these conditional models define the joint distribution according to Equation (1).

We cannot directly compute gradients of the distribution defined by discrete conditionals . Instead, we smooth by convolving it with a spherical Gaussian . This smoothing relies on the fact that represent scalar values on the real line, and therefore the discrete distribution can be viewed as a linear combination of weighted Dirac spikes on . If denotes the density of a spherical Gaussian on then we define a density on given by

(5)

This distribution has well-defined gradients and in total variation as .

When is a deep autoregressive model, the convolution (5) is difficult to calculate directly. Previous work proposed training a smoothed model by fine-tuning a model on noisy data where and (Jayaram & Thickstun, 2020). This approach cannot be directly applied to discrete autoregressive models as defined by Equation (4). The obstruction is that noisy samples are not supported by the discretization . One way to address the problem is to replace the discrete model with a continuous autoregressive model of , e.g. RNADE (Uria et al., 2013). We avoid this approach because fine-tuning to becomes complicated when these models have different architectures.

Instead of directly fine-tuning to a model , we combine an analytic calculation with an auxiliary model learned via fine-tuning. Let denote a (discrete) conditional model trained to predict given noisy covariates . If denotes the density of then we can re-write the factored density as

(6)

On the right-hand side, we decompose the smoothed conditional densities into Gaussian convolutions of discrete conditionals evaluated at . This suggests the following approach to evaluating :

  • Learn a (discrete) model , trained to predict the un-noised value given noisy history . This model can be learned efficiently by finetuning a pre-trained model on noisy covariates . This is visualized in the middle column of Figure 1.

  • Evaluate the Gaussian convolution at to compute . This convolution can be calculated in closed form given . This is visualized in the right column of Figure 1.

The convolution

has a simple closed form given by a Gaussian mixture model

(7)

3.2 Stochastic Gradient Langevin Dynamics

  Input: y, , ,
  Sample
  for  to  do
     
     Fork()
     for  to  do
        Sample
        Sample
        Read
        
              
        Write
     end for
     Synchronize()
  end for
Algorithm 2 Stochastic Parallel and Flexible Sampling

Calculating the Langevin updates described in Equations (2) and (3) requires operations to compute , given a sequence . This calculation decomposes into calculations of () which can each be computed in parallel (this is what allows autoregressive models to be efficiently trained using the maximum likelihood objective). But for very large values of , even modern parallel computing devices cannot fully parallelize all calculations. In this section, we develop a stochastic variant of PnF sampling (Algorithm 2) that is easily distributed across multiple devices.

Instead of making batch updates on a full sequence , consider updating a single coordinate :

(8)

This coordinate-wise derivative is only dependent on the tail of the sequence :

(9)

This doesn’t look promising; calculating an update on a single coordinate required inference calculations . But models over very long sequences, including WaveNets, usually make a Markov assumption for some limited contextual window of length . In this case, the coordinate-wise derivative requires only calls:

(10)

Calculating a gradient on a contiguous block of coordinates leads to a more efficient update

(11)

Calculating Equation (11) requires transmission of a block of length to the computing device, and calculations in order to compute the gradient of a block of length . If we partition a sequence of length into blocks of length , then we can distribute computation of with an overhead factor of . This motivates choosing as large as possible, under the constraint that calculations can still be parallelized on a single device.

We can calculate by aggregating blocks of gradients according to Equation (11), requiring synchronous communication between machines for every update Equation (2); this is a MapReduce algorithm (Dean & Ghemawat, 2004). We propose a bolder approach in Algorithm 2 based on block-stochastic Langevin dynamics (Welling & Teh, 2011). If is chosen uniformly at random then Equation (11

) is an unbiased estimate of

. This motivates block-stochastic updates on patches, which multiple devices can perform asynchronously, a Langevin analog to Hogwild! (Niu et al., 2011).

3.3 Smoothing a Joint Distribution

We now consider joint distributions over sources and measurements . For example, y could be a low resolution version of the signal, or an observed mixture of two signals. We are particularly interested in measurement models of the form , for some linear function . We can view these measurements y as degenerate likelihoods of the form

(12)

where denotes the Dirac delta function. This family of linear measurement models describes the source separation, in-painting, and super-resolution tasks featured in Section 4

, as well as other linear inverse problems including sparse recovery and image colorization.

Extending our analysis in Section 3.1, we can smooth the joint density by convolving x with a spherical Gaussian . Let where . Note that is conditionally independent of y given x and therefore the joint distribution over x, , and y can be factored as

(13)

We will work with the smoothed marginal of the joint distribution . If denotes the density of a spherical Gaussian on then the marginal of Equation (13) can be expressed by

(14)

This density approximates the original distribution in the sense that in total variation as .

The smoothed density can be factored as . The density is simply the smoothed Gaussian convolution of given by Equation (5). For general likelihoods , because y is conditionally independent of given x, we can write the density by marginalizing over x as

(15)
(16)

This integral is difficult to calculate directly. One way to evaluate is to take the same approach described in Section 3.1 to evaluate : fine-tune the model to a model that predicts y given noisy covariates . A similar procedure (using a noise-conditioned architecture rather than finetuning) is described in Section 5 of Song et al. (2021).

For linear measurement models with the form given by Equation (12), the smoothed density can be calculated in closed form. Writing the linear function as a matrix , we have

(17)

This smoothing given by Equation (17) generalizes the smoothing proposed in Jayaram & Thickstun (2020) for source separation. That work proposed separately smoothing the prior and likelihood, resulting in a smoothed likelihood over , where . This is equivalent to Equation (17) in the case of source separation, for which and therefore .

For general likelihoods

(e.g. a classifier) the conditioning values

y may depend on the whole sequence x. In this case, stochastic PnF must read the entire sequence x in order to calculate the posterior

(18)

But for long sequences x such as audio, the conditioning information y is often a local function of the sequence x. In this case, , , where is a local neighborhood of indices near , and the likelihood decomposes via conditional independence into

(19)

All experiments presented in Section 4 feature this conditioning pattern. For spectrogram conditioning, is the set of indices (centered at

) required to compute a short-time Fourier transform. For source separation, super-resolution, and in-painting,

. This allows us to compute block gradients of the conditional likelihood (Algorithm 2).

4 Experiments

We present qualitative and quantitive results of PnF sampling for WaveNet models of audio (van den Oord et al., 2016a) and a PixelCNN++ model of images (Salimans et al., 2017). In Section 4.2 we show that PnF sampling can produce samples of comparable quality to ancestral sampling. In Section 4.3 we show that stochastic PnF sampling is faster than ancestral sampling, when parallelized across a modest number of devices. We go on to demonstrate how PnF sampling can be applied to a variety of image and audio restoration tasks: source separation (Section 4.4), super-resolution (Section 4.5), and inpainting (Section 4.6). We encourage the reader to browse the supplementary material for qualitative examples of PnF audio sampling.

4.1 Datasets

For audio experiments we use the VCTK dataset (Veaux et al., 2016) consisting of 44 hours of speech, as well as the Supra Piano dataset (Shi et al., 2019) consisting of 52 hours of piano recordings. An 80-20 train-test split of VCTK speakers and piano recordings is maintained. Audio sequences are sampled at a kHz, with -bit -law encoding (CCITT, 1988), except for source separation where -bit linear encoding is used. Sequences used for quantitative evaluation are

k sample excerpts, approximately 2.3 seconds of audio, chosen randomly from the longer test set recordings. For image experiments we use the CIFAR-10 dataset

(Krizhevsky, 2009)

. Additional training and hyperparameter details can be found in the appendix.

4.2 Quality of Generated Samples

To evaluate the quality of samples generated by PnF sampling, we follow a similar procedure to Holtzman et al. (2020). We compare log-likelihoods under the model of sequences generated by PnF sampling to sequences generated by ancestral sampling. We consider the sampling procedure to be successful if it generates sequences with comparable log-likelihoods to ancestral generations. The likelihood values of PnF samples are computed by quantizing the output and calculating the likelihood under the ancestral model. In Figure 2 we present quantitative results for PnF sampling using a spectrogram-conditioned WaveNet model for both voice and piano datasets. We evaluate audio generations of length for various numbers of Langevin iterations, and report the median log likelihood of quantizations of these sequences under the noiseless model. Asymptotically, as the iterations of Langevin dynamics increase, the likelihood of PnF samples approaches the likelihood of ancestral samples. Audio PnF samples for various are presented on the project website.

Figure 2: As the number of Langevin iterations increases, the log-likelihood of sequences generated by PnF sampling approaches the log-likelihood of sequences generated by ancestral sampling.

4.3 Speed and Parallelism

Ancestral sampling has serial runtime in the length of the generated sequence. Using the stochastic PnF sampler described in Section 3.2, the serial runtime is , where is the number of Langevin iterations at each level of smoothing. We find empirically that we can set independent of , and therefore the theoretical serial runtime of stochastic PnF is constant as a function of sequence length. In practice, we do not have an infinite supply of parallel devices, so the serial runtime of stochastic PnF grows inversely proportional to the number of devices. This behavior is demonstrated in Figure 3 for WaveNet stochastic PnF sampling using a cluster of Nvidia Titan Xp GPU’s and . Each GPU can calculate Equation (11) for a block of samples (2.3 seconds of audio). For PixelCNN++, we find that and therefore the PnF sampler does not improve sampling speed for this model.

Figure 3: PnF sampling can be parallelized across multiple devices, resulting in faster inference time than autoregressive sampling. Beyond a threshold level of computation, PnF sampling time is inversely proportional to the number of devices.

Stochastic PnF sampling depends upon asynchronous writes being sparse so that memory overwrites, when two workers update overlapping blocks, are rare. This situation is analogous to the sparse update condition required for Hogwild! If blocks are length and the number of devices is substantially less than , then updates are sufficiently sparse. But if the number of devices is larger than , memory overwrites become common, and stochastic PnF sampling fails to converge. This imposes a floor on generation time determined by , exhibited in Figure 3. We cannot substantially reduce this floor by decreasing because of the tradeoff between and the model’s Markov window described in Section 3.2.

Figure 4: PnF sampling applied to visual source separation (Section 4.4) super-resolution (Section 4.5) and inpainting (Section 4.6) using a PixelCNN++ prior over images trained on CIFAR-10. Ground-truth images in this figure are taken from the CIFAR-10 test set.

4.4 Source Separation

The single-channel source separation problem (Davies & James, 2007) asks us to recover unobserved sources given an observed mixture . Like Jayaram & Thickstun (2020), we view source separation as a linear Bayesian inverse problem: recover ) given y and a prior . We consider three variants of this task: (1) audio separation of mixtures of voice (VCTK test set) and piano (Supra Piano test set), (2) visual separation of mixtures of CIFAR-10 test set “animal” images and “machine” images, and (3) class-agnostic visual separation of mixtures of CIFAR-10 test set images.

We compare PnF audio separation to results using the Demucs (Défossez et al., 2019) and Conv-Tasnet (Luo & Mesgarani, 2019) source separation models. Both Demucs and Conv-Tasnet are supervised models, trained specifically for the source separation task, that learn to output source components given an input mixture. An advantage of PnF sampling is that it does not rely on pairs of source signals and mixes like these supervised methods. We train the supervised models on 10K mixtures of VCTK and Supra Piano samples and measure results on 1K test set mixtures using the standard Scale Invariant Signal-to-Distortion Ratio (SI-SDR) metric for audio source separation (Le Roux et al., 2019). Results in Table 1 show that PnF sampling is competitive with these specialized source separation models. Qualitative comparisons are provided in the supplement. We do not compare results on the popular MusDB dataset (Rafii et al., 2017) because this dataset has insufficient single-channel audio to train WaveNet generative models.

For CIFAR-10, we follow the experimental methodology described in Jayaram & Thickstun (2020). Table 2 shows that PixelCNN++ performs comparably to Glow as a prior, but underperforms NCSN. This makes sense, given the relative strength of NCSN as a prior over CIFAR-10 images in comparison to PixelCNN++ and Glow. Given the strong correlation between the quality of a generative model and the quality of separations using that model as a prior, we anticipate that more recent innovations in autoregressive image models based on transformers (Parmar et al., 2018; Child et al., 2019) will lead to stronger separation results once implementations of these models that match the results reported in these papers become public. Select qualitative image separation results are presented in Figure 4.


Algorithm Test SI-SDR (dB)
All Piano Voice
PnF (WaveNet) 17.07 13.92 20.25
Conv-Tasnet 17.48 20.02 15.50
Demucs 14.18 16.67 12.75
Table 1: Quantitative results for audio source separation of mixtures of Supra piano and VCTK voice samples. Results are measured using SI-SDR (higher is better).

Algorithm Inception Score FID
Class Split
NES 5.29 0.08 51.39
BASIS (Glow) 5.74 0.05 40.21
PnF (PixelCNN++) 5.86 0.07 40.66
Average 6.14 0.11 39.49
BASIS (NCSN) 7.83 0.15 29.92
Class Agnostic
BASIS (Glow) 6.10 0.07 37.09
PnF (PixelCNN++) 6.14 0.15 37.89
Average 7.18 0.08 28.02
BASIS (NCSN) 8.29 0.16 22.12
Table 2: Quantitative results for visual sources separation on CIFAR-10. Results are measured using Inception Score / FID Score of 25,000 separations (50,000 separated images) of two overlapping CIFAR-10 images. In Class Split one image comes from the category of animals and other from the category of machines. NES results (Halperin et al., 2019) and BASIS results are as reported in Jayaram & Thickstun (2020).

4.5 Super-Resolution

The super-resolution problem asks us to recover unobserved data x given a down-sampled observation . For 1-dimensional (audio) super-resolution, , , and (Kuleshov et al., 2017; Eskimez et al., 2019). For 2-dimensional (image) super-resolution , , and (Dahl et al., 2017; Zhang et al., 2018). Like source separation, super-resolution can be viewed as a Bayesian linear inverse problem, and we can recover solutions to this problem via PnF sampling. In the audio domain, the down-sampling operation can be interpreted as a low-pass filter.

We measure audio super-resolution performance using peak signal-to-noise ratio (PSNR) and compare against a deep learning baseline

(Kuleshov et al., 2017) as well as a simple cubic B-spline. Quantitative audio results are presented in Table 3, which show that we outperform these baselines on piano data and produce similar quality reconstructions on voice data. Qualitative audio samples are available in the supplement, where we also show examples of 32x super resolution - beyond the reported ability of existing methods. Select qualitative visual results are presented in Figure 4.


Piano Voice
Ratio Spline KEE PnF Spline KEE PnF
4x 23.07 22.25 29.78 15.8 16.04 15.47
8x 13.58 15.79 23.49 10.7 11.15 10.03
16x 7.09 6.76 14.23 6.4 7.11 5.32
Table 3: Quantitative results for audio super-resolution at three different scales on the Supra piano and VCTK voice datasets. Results are measured using PSNR (higher is better). KEE refers to the method described in Kuleshov et al. (2017)

4.6 Inpainting

Inpainting problems involve recovering unobserved data x given a masked observation , where (Adler et al., 2011; Pathak et al., 2016)

. This family of problems includes completion tasks (finishing a sequence given a prime) pre-completion tasks (generating a prefix to a sequence) and outpainting tasks. Ancestral sampling can only be applied to completion tasks, whereas PnF sampling can be used to fill in any pattern of masked occlusions. Qualitative results for audio inpainting are available in the supplement. Select qualitative results for image inpainting are presented in Figure 

4.

5 Conclusion

In this paper we introduced PnF sampling, a parallelizable approach to sampling from autoregressive models that can be flexibly adapted to conditional sampling tasks. The flexibility of PnF sampling decouples the (unconditional) generative modeling problem from the details of specific conditional sampling tasks. Using WaveNet models, we demonstrated a reduction in wall-clock sampling time using PnF sampling in comparison to ancestral sampling, as well as PnF’s ability to solve a variety of practical audio processing problems: source separation, super-resolution, and inpainting. We anticipate that PnF conditional sampling results will improve as developments in generative modeling empower us to incorporate stronger models as priors. More broadly, we are inspired by ongoing research in generative modeling that, coupled with PnF sampling, will continue to drive performance improvements for practical conditional image and audio restoration tasks.

Acknowledgements

We thank Zaid Harchaoui, Sham M. Kakade, Steven Seitz, and Ira Kemelmacher-Shlizerman for valuable discussion and computing resources. This work was supported by a Qualcomm Innovation Fellowship.

References

  • Adler et al. (2011) Adler, A., Emiya, V., Jafari, M. G., Elad, M., Gribonval, R., and Plumbley, M. D. Audio inpainting. IEEE Transactions on Audio, Speech, and Language Processing, 2011.
  • Bora et al. (2017) Bora, A., Jalal, A., Price, E., and Dimakis, A. G. Compressed sensing using generative models. In International Conference on Machine Learning, pp. 537–546, 2017.
  • Brown et al. (2020) Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  • CCITT (1988) CCITT. Pulse code modulation (pcm) of voice frequencies. International Telecommunication Union, 1988.
  • Chen et al. (2021) Chen, N., Zhang, Y., Zen, H., Weiss, R. J., Norouzi, M., and Chan, W. Wavegrad: Estimating gradients for waveform generation. International Conference on Learning Representations, 2021.
  • Child et al. (2019) Child, R., Gray, S., Radford, A., and Sutskever, I. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  • Dahl et al. (2017) Dahl, R., Norouzi, M., and Shlens, J. Pixel recursive super resolution. In

    Proceedings of the IEEE International Conference on Computer Vision

    , pp. 5439–5448, 2017.
  • Davies & James (2007) Davies, M. E. and James, C. J. Source separation using single channel ica. Signal Processing, 2007.
  • Dean & Ghemawat (2004) Dean, J. and Ghemawat, S. Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 2004.
  • Défossez et al. (2019) Défossez, A., Usunier, N., Bottou, L., and Bach, F. Music source separation in the waveform domain. arXiv preprint arXiv:1911.13254, 2019.
  • Dhariwal et al. (2020) Dhariwal, P., Jun, H., Payne, C., Kim, J. W., Radford, A., and Sutskever, I. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020.
  • Donahue et al. (2019) Donahue, C., McAuley, J., and Puckette, M. Adversarial audio synthesis. International Conference on Learning Representations, 2019.
  • Eskimez et al. (2019) Eskimez, S. E., Koishida, K., and Duan, Z. Adversarial training for speech super-resolution. IEEE Journal of Selected Topics in Signal Processing, 2019.
  • Frank & Ilse (2020) Frank, M. and Ilse, M. Problems using deep generative models for probabilistic audio source separation. In “I Can’t Believe It’s Not Better!” NeurIPS 2020 workshop, 2020.
  • Hadjeres et al. (2017) Hadjeres, G., Pachet, F., and Nielsen, F. Deepbach: a steerable model for bach chorales generation. In International Conference on Machine Learning, 2017.
  • Halperin et al. (2019) Halperin, T., Ephrat, A., and Hoshen, Y. Neural separation of observed and unobserved distributions. In International Conference on Machine Learning, 2019.
  • Holtzman et al. (2020) Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. The curious case of neural text degeneration. International Conference on Learning Representations, 2020.
  • Jayaram & Thickstun (2020) Jayaram, V. and Thickstun, J. Source separation with deep generative priors. International Conference on Machine Learning, 2020.
  • Kim et al. (2018) Kim, S., Lee, S.-G., Song, J., Kim, J., and Yoon, S. Flowavenet: A generative flow for raw audio. International Conference on Machine Learning, 2018.
  • Knapik et al. (2011) Knapik, B. T., Van Der Vaart, A. W., van Zanten, J. H., et al. Bayesian inverse problems with gaussian priors. The Annals of Statistics, 2011.
  • Kong et al. (2021) Kong, Z., Ping, W., Huang, J., Zhao, K., and Catanzaro, B. Diffwave: A versatile diffusion model for audio synthesis. International Conference on Learning Representations, 2021.
  • Krizhevsky (2009) Krizhevsky, A. Learning multiple layers of features from tiny images. 2009.
  • Kuleshov et al. (2017) Kuleshov, V., Enam, S. Z., and Ermon, S. Audio super-resolution using neural nets. In ICLR (Workshop Track), 2017.
  • Kumar et al. (2019) Kumar, K., Kumar, R., de Boissiere, T., Gestin, L., Teoh, W. Z., Sotelo, J., de Brébisson, A., Bengio, Y., and Courville, A.

    Melgan: Generative adversarial networks for conditional waveform synthesis.

    Advances in Neural Information Processing Systems, 2019.
  • Larochelle & Murray (2011) Larochelle, H. and Murray, I. The neural autoregressive distribution estimator. In

    Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics

    . JMLR Workshop and Conference Proceedings, 2011.
  • Le Roux et al. (2019) Le Roux, J., Wisdom, S., Erdogan, H., and Hershey, J. R. Sdr–half-baked or well done? In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
  • Luo & Mesgarani (2019) Luo, Y. and Mesgarani, N. Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019.
  • Mehri et al. (2017) Mehri, S., Kumar, K., Gulrajani, I., Kumar, R., Jain, S., Sotelo, J., Courville, A., and Bengio, Y. Samplernn: An unconditional end-to-end neural audio generation model. International Conference on Learning Representations, 2017.
  • Neal et al. (2011) Neal, R. M. et al. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2011.
  • Niu et al. (2011) Niu, F., Recht, B., Ré, C., and Wright, S. J.

    Hogwild!: A lock-free approach to parallelizing stochastic gradient descent.

    Advances in Neural Information Processing Systems, 2011.
  • Parmar et al. (2018) Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, Ł., Shazeer, N., Ku, A., and Tran, D. Image transformer. International Conference on Machine Learning, 2018.
  • Pathak et al. (2016) Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., and Efros, A. A. Context encoders: Feature learning by inpainting. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 2016.
  • Ping et al. (2019) Ping, W., Peng, K., and Chen, J. Clarinet: Parallel wave generation in end-to-end text-to-speech. In International Conference on Learning Representations, 2019.
  • Ping et al. (2020) Ping, W., Peng, K., Zhao, K., and Song, Z. Waveflow: A compact flow-based model for raw audio. In International Conference on Machine Learning, 2020.
  • Prenger et al. (2019) Prenger, R., Valle, R., and Catanzaro, B. Waveglow: A flow-based generative network for speech synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2019.
  • Radford et al. (2019) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. OpenAI blog, 2019.
  • Rafii et al. (2017) Rafii, Z., Liutkus, A., Stöter, F.-R., Mimilakis, S. I., and Bittner, R. The MUSDB18 corpus for music separation, 2017. URL https://doi.org/10.5281/zenodo.1117372.
  • Raj et al. (2019) Raj, A., Li, Y., and Bresler, Y. Gan-based projector for faster recovery with convergence guarantees in linear inverse problems. In Proceedings of the IEEE International Conference on Computer Vision, 2019.
  • Razavi et al. (2019) Razavi, A., van den Oord, A., and Vinyals, O. Generating diverse high-fidelity images with vq-vae-2. In Advances in Neural Information Processing Systems, 2019.
  • Rick Chang et al. (2017) Rick Chang, J., Li, C.-L., Poczos, B., Vijaya Kumar, B., and Sankaranarayanan, A. C. One network to solve them all–solving linear inverse problems using deep projection models. In Proceedings of the IEEE International Conference on Computer Vision, 2017.
  • Salimans et al. (2017) Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. International Conference on Learning Representations, 2017.
  • Shi et al. (2019) Shi, Z., Sapp, C., Arul, K., McBride, J., and Smith III, J. O. Supra: Digitizing the stanford university piano roll archive. In ISMIR, 2019.
  • Song & Ermon (2019) Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, 2019.
  • Song & Ermon (2020) Song, Y. and Ermon, S. Improved techniques for training score-based generative models. Advances in Neural Information Processing Systems, 2020.
  • Song et al. (2020) Song, Y., Meng, C., Liao, R., and Ermon, S. Nonlinear equation solving: A faster alternative to feedforward computation. arXiv preprint arXiv:2002.03629, 2020.
  • Song et al. (2021) Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. International Conference on Learning Representations, 2021.
  • Tropp & Wright (2010) Tropp, J. A. and Wright, S. J. Computational methods for sparse solution of linear inverse problems. Proceedings of the IEEE, 2010.
  • Uria et al. (2013) Uria, B., Murray, I., and Larochelle, H. Rnade: The real-valued neural autoregressive density-estimator. In Advances in Neural Information Processing Systems, 2013.
  • van den Oord et al. (2016a) van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016a.
  • van den Oord et al. (2016b) van den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K.

    Pixel recurrent neural networks.

    In International Conference on Machine Learning, 2016b.
  • van den Oord et al. (2018) van den Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., Driessche, G., Lockhart, E., Cobo, L., Stimberg, F., et al. Parallel wavenet: Fast high-fidelity speech synthesis. In International Conference on Machine Learning, 2018.
  • Veaux et al. (2016) Veaux, C., Yamagishi, J., MacDonald, K., et al. Superseded-cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. 2016.
  • Wang et al. (2017) Wang, Z., Bardsley, J. M., Solonen, A., Cui, T., and Marzouk, Y. M. Bayesian inverse problems with l_1 priors: a randomize-then-optimize approach. SIAM Journal on Scientific Computing, 2017.
  • Welling & Teh (2011) Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In International Conference on Machine Learning, 2011.
  • Wiggers & Hoogeboom (2020) Wiggers, A. J. and Hoogeboom, E. Predictive sampling with forecasting autoregressive models. International Conference on Machine Learning, 2020.
  • Zhang et al. (2018) Zhang, Y., Tian, Y., Kong, Y., Zhong, B., and Fu, Y. Residual dense network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.

Appendix A PnF Sampling Details and Hyper-Parameters

We broadly adopt the geometric annealing schedule and hyper-parameters of annealed Langevin dynamics introduced in Song & Ermon (2019) and elaborated upon in Song & Ermon (2020). For both the PixelCNN++ and WaveNet models, we found that we needed additional intermediate noise levels to generate quality samples. We also found that good sample quality using these models required a smaller learning rate and mixing for more iterations than previous work (Song & Ermon, 2019; Jayaram & Thickstun, 2020). We speculate that the need for more levels of annealing and slower mixing may be a attributable to the autoregressive model parameterization, because also required a finer annealing and mixing schedule for the WaveNet models. But we are uncertain of this hypothesis. Detailed hyper-parameters for the PixelCNN++ and WaveNet experiments are presented in Appendix B and C respectively.

Annealed Langevin dynamics requires storage of a copy of the model for each of noise levels. This could be avoided by training a noise-conditioned generative model. The advantage of using copies of the model is that we can directly use standard generative models, without any adjustment to the network architecture or subsequent hyper-parameter tuning of a modified architecture; this cleanly decouples our approach to conditional sampling from neural architecture design questions. Note that while we store copies of the model, there is no additional memory overhead: these models are loaded and unloaded serially during optimization as we anneal the noise levels, so only one model is resident in memory at a time. While GPU memory is a scarce resource, disk space is generally abundant.

Previous work has made the empirical observation that gradients of a noisy model

are inverse-proportional to the variance of the noise:

(Song & Ermon, 2019). This motivates the choice of learning rate . The empirical scale of the gradients is conjectured in Jayaram & Thickstun (2020) to be a consequence of “severe non-smoothness of the noiseless model .” That work goes on to show that, if were a Dirac spike, then exact inverse-proportionality of the gradients would hold in expectation. For the discretized autoregressive models discussed in this work, the noiseless distribution is genuinely a mixture of Dirac spikes, and so the analysis in Jayaram & Thickstun (2020) applies without caveats to these models and justifies the choice (the precise constant of proportionality remains application dependent, and discussed in the experimental details sections).

Appendix B PixelCNN++ Experimental Details

Our visual sampling experiments are performed using a PixelCNN++ model trained on CIFAR-10. Specifically, we used a public implementation of PixelCNN++ written by Lucas Caccia, available at:

We used the pre-trained weights for this model shared by Lucas Caccia (at the link above) with a reported test-set log loss of 2.95 bits per dimension. For the models

, we fine-tuned the pre-trained model for 10 epochs at each noise level

. We adopt the geometric annealing schedule proposed in Song & Ermon (2019) for annealing , beginning at and ending at using noise levels. This is double the number of noise levels used in previous work (Song & Ermon, 2019; Jayaram & Thickstun, 2020). We also found that sample quality improved using a smaller learning rate and mixing for more iterations than reported in previous work. For conditional sampling tasks, we set and in contrast to and used in previous work. In wall-clock time, we find that conditional PixelCNN++ sampling tasks require approximately 60 minutes to generate a batch of 16 samples using a 1080Ti GPU.

We can also generate unconditional samples from the PixelCNN++ model using Langevin dynamics. This is not a practical sampling procedure for CIFAR-10 images, due to its slow speed compared to autoregressive sampling (see the discussion in Section 4.3). Unconditional sampling requires many more iterations of Langevin dynamics than the more highly-constrained conditional sampling tasks. Despite this, we show in Figure 5 that using sufficiently many iterations () the quality of Langevin samples does approximate the quality of exact autoregressive samples as measured by the sample log-likelihood.

Figure 5: PixelCNN++ PnF sampling: as the number of Langevin iterations increases, the log-likelihood of sequences generated by PnF sampling approaches the log-likelihood of test set sequences.

Appendix C WaveNet Experimental Details

Our audio sampling experiments are performed using a WaveNet model trained on both the VCTK and Supra Piano datasets. We used the public implementation of Wavenet written by Ryuichi Yamamoto available at:

For all audio experiments, where data is encoded between , we use noise levels geometrically spaced between and . The same noise levels are also used for the sampling speed and quality results presented in Figures 2 and 3. For all experiments, the number of Langevin steps per noise level , except for Figure 2 where that parameter is varied to highlight changes in likelihood. The learning rate multiper for all experiments. The markov window is based on the underlying architecture. When training the fine-tuned noise models, all training hyperparameters are kept the same as the original WaveNet implementation. For the WaveNet implementation used in this paper, this is samples which is roughly 0.3 seconds at a 22kHz sample rate Please refer to the WaveNet paper or the public WaveNet implementation for training details.

As discussed in the WaveNet paper (van den Oord et al., 2016a), -bit -law encoding results in a higher fidelity representation of audio than -bit linear encoding. For most experiments, the observation constraint is still linear even under a -law encoding of . However, for source separation, the constraint is no longer linear under -law encoding. Consequently, we use an -bit linear encoding of for source separation experiments to avoid a change of variables calculation. To facilitate a fair comparison, all ground truths and baselines shown in the demos use the corresponding -law or linear -bit encoding.

In the source separation experiments, all mixtures were created with 1/2 gain on each source component. Due to the natural variation of loudness in the training datasets, we find that our model generalizes to mixtures without exactly 1/2 gain on each source. The real life source separation result on our website shows that we can separate a mixture in the wild when we have no information about the relative loudness of each component.

Appendix D Additional PixelCNN++ Sampling Results

Figure 6: Additional uncurated results of PnF source separation (Section 4.4) for mixtures of CIFAR-10 test-set images using a PixelCNN++ prior trained on CIFAR-10.

Figure 7: Additional uncurated results of PnF super-resolution (Section 4.5) applied to down-sampled CIFAR-10 test-set images using a PixelCNN++ prior trained on CIFAR-10.

Figure 8: Additional uncurated results of PnF inpainting (Section 4.6) applied to masked CIFAR-10 test-set images using a PixelCNN++ prior trained on CIFAR-10.