1 Introduction
Deep autoregressive (AR) models have achieved stateoftheart density estimation results in image, video, and audio (Salimans et al., 2017; van den Oord et al., 2016; Van den Oord et al., 2016; Child et al., 2019; Radford et al., 2019; Weissenborn et al., 2019). Recent work has applied them to domains traditionally outside of machine learning, such as physics (Sharir et al., 2020), protein modeling (Rao et al., 2019), and database query optimization (Yang et al., 2019b, 2020). These use cases have surfaced the need for complex inference capabilities from deep AR models. For example, the database cardinality estimation task reduces to estimating the density mass occupied by sets of variables under sparse range constraints. In this problem, the database optimizer probes the fraction of records satisfying a query of highdimensional constraints, e.g., , and relies on accurate estimates to pick performant query execution strategies.
In this paper, we call for attention to such range density estimation problems in the context of deep AR models. Given rapid advances in model capabilities, fast and accurate range density estimation has broad potential applicability to a number of domains, including databases, text processing, and inpainting (Section 1.1).
Range density estimation involves two related challenges:

[noitemsep]

Marginalization: the handling of unconstrained variables, and

Range Constraints: variables that are constrained to a specific range or subset of values.
Exact inference or integration over the query region takes time exponential in the number of dimensions—a cost too high for all but the tiniest problems. Further, both marginalization and range constraints are difficult to implement on top of AR models since they are only trained to provide point density estimates. This motivates the use of approximate inference algorithms such as recently proposed by (Yang et al., 2019b), which show that AR models can significantly improve on the stateoftheart in range estimation accuracy while remaining competitive in latency.
Building on prior work, we distill and evaluate a more general optimization for accelerating range density estimation termed variable skipping. The central idea is to exploit the sparsity of range queries, by avoiding sampling through the unconstrained dimensions (i.e., those to be marginalized over) during approximate inference. A trainingtime data augmentation procedure randomly replaces some dimensions in the input with learnable marginalization tokens, which are trained to represent the absence of those dimensions. During inference, the unconstrained dimensions take on these learned values instead of being sampled from their respective domains.
Variable skipping provides two key advantages. First, by not needing to sample a concrete value for certain variables, the number of forward passes is significantly reduced from (e.g., hundreds) to
(e.g., a few). Second, by avoiding sampling through the (potentially large) unconstrained region, it is possible to reduce the variance of the samplingbased estimator. We show that variable skipping realizes both advantages in practice (Figure
1).Reducing the computation required for estimates can significantly impact the viability of modelbased estimators for the aforementioned computer systems applications. For example, in database query optimization, cardinality estimation is typically run in the inner loop of a dynamic program (Selinger et al., 1979), and hence has to be executed many times in potentially unbatchable fashion. Further, this process must be rerun for each new query as it may have different variable constraints. In this setting, reducing estimation costs from tens or hundreds of forward passes (i.e., the number of columns in a typical production database) to just a handful (i.e, the number of constraints in a typical range query) is critical for adoption. Models that include rarely queried text columns (e.g., byte pair encoded, which exacerbates the problem) may benefit further still.
We start by discussing related work, then reviewing the previously proposed approximate inference algorithm (Yang et al., 2019b), termed Progressive Sampling (Section 3.1), which allows any trained autoregressive model to efficiently compute range densities. We then discuss an optimization, variable skipping, which allows dimensions irrelevant to a query to be skipped over at inference time, greatly reducing or eliminating sampling costs (Section 4). We show that, beyond accelerating range density estimation, variable skipping can enable related applications such as pattern matching and text completion. Finally, we study the performance of variable skipping (Section 5).
The contributions of this paper are as follows:

We distill the more general concept of variable skipping, a training and runtime optimization that greatly reduces the variance of range density estimates.

To show its generality, we apply variable skipping to text models, which can then support applications such as pattern matching.

We evaluate the effectiveness of variable skipping across a variety of datasets, architecture, and hyperparameter choices, and compare with related techniques such as multiorder training.

To invite research on this underexplored problem, we open source our code and a set of range density estimation benchmarks on highdimensional discrete datasets at
https://varskip.github.io.
1.1 Applications of Range Density Estimation
Range density estimation is important for the following applications, among others:
Database Systems: A core primitive in database query optimizer is cardinality estimation (Selinger et al., 1979): given a query with userdefined predicates for a subset of columns, estimate the fraction of records that satisfy the predicates. Applying AR models to cardinality estimation was the topic of (Yang et al., 2019b).
Pattern Matching: A regular expression can be interpreted as a dynamically unrolled predicate (i.e., a nondeterministic finite automata) (Hopcroft and Ullman, 1979) over a series of character variables. Hence, its
match probability
can be estimated in the same way as a range query. Section 4.1 shows how this can be realized with variable skipping.Completion and Inpainting: While an AR model can be straightforwardly used to extend a prefix in the variable ordering, completing a missing value from the middle of a sequence of variables requires sampling from the marginal distribution over missing values. We show that variable skipping allows this to be done efficiently (Section 4.2).
2 Related Work
Density Estimation with Deep Autoregressive Models have enjoyed vast interest due to their outstanding capability of modeling highdimensional data (text, images, tabular). Efficient architectures such as MADE (Germain et al., 2015) and ResMADE (Durkan and Nash, 2019)
have been proposed, and selfattention models (e.g., Transformer
(Vaswani et al., 2017)) have underpinned recent new advances in language tasks. Our work optimizes the approximate inference (of range density estimates) on top of such AR architectures.Masked Language Models. Our variable skipping learns special MASK tokens (Section 4) by randomly masking inputs, which is similar to masked language models such as BERT (Devlin et al., 2019) and CMLMs (Ghazvininejad et al., 2019). These models differ from AR models in optimization goals: they typically predict only the masked tokens conditioned on present tokens, and may assume independence among the masked tokens. We study deep AR models for two reasons: (1) our problem settings are in density estimation, and deep AR models have generally shown superior density modeling quality than other generative models; (2) the approximate inference procedure we study (Section 3.1) assumes access to autoregressive factors.
MultiOrder Training handles marginalization by training over many orders and invoking a suitable order (or an ensemble over available orders) during inference. This technique has appeared in NADE (Uria et al., 2014), MADE (Germain et al., 2015), XLNet (Yang et al., 2019a), among others. Variable skipping shares the same goal of efficiently handling marginalization. These prior works have reported increased optimization difficulty as the number of orders to learn increases (some sample a fixed set of orders, while others keep sampling new orders). In the latter case, we posit that the difficulty is due to adding input variations; in contrast, variable skipping only extends the vocabulary of each dimension by a MASK symbol, a relatively smaller increase in task difficulty. In Section 5, we compare variable skipping against multiorder training, and show that they can be combined to further reduce errors.
3 Range Density Estimation on Deep Autoregressive Models
We model a finite set of dimensional data points as a discrete distribution using an autoregressive model , parameterized by . The model is trained on using the maximum likelihood objective:
(1) 
where for each data point .
Range density. We consider range queries of the form
(2) 
where each region is a subset of the domain . This formulation encapsulates unconstrained dimensions, where we simply take (the whole domain).
3.1 Background: Progressive Sampling
Exact inference of Equation 2 is computationally efficient only for low dimensions or small domain sizes. Approximate inference is required to scale its computation.
To solve this problem, (Yang et al., 2019b) adapts classical forward sampling (Koller and Friedman, 2009) for range likelihoods, yielding an unbiased approximate inference algorithm. The algorithm works by drawing inrange samples and reweighting each intermediate range likelihood. Each inrange sample is drawn from the first dimension to the last (in the AR ordering). As an example, consider estimating . Progressive sampling draws and stores —both tractable operations since a forward pass on the trained AR produces this singledimensional distribution. It performs another forward pass to obtain , which then produces a sample and the range likelihood . Lastly it obtains . It can be shown that the product of all range likelihoods, e.g.,
is a valid MonteCarlo estimate of the desired range density.
In the remainder of the paper, we invoke as a blackbox estimator, although our variable technique (described next) can work with other estimators for Equation 2.
4 Variable Skipping
Variable skipping works by (1) training special marginalization tokens, , for each dimension ; (2) at approximate inference, rewriting each unconstrained variable, e.g., , into a constrained variable with the singleton range, . The training process can be interpreted as dropout of the input, or as data augmentation.
Architecture. We assume a model architecture shown in Figure 3: the input layer, an autoregressive core, and the output layer. For the autoregressive core, we use ResMADE (Durkan and Nash, 2019) for tabular data and an autoregressive Transformer (Vaswani et al., 2017) (encoder only with correct masking) for text data. At the input layer, we embed each data point using a perdimension trainable embedding table, denoted by
. (For text, we tie the embeddings across all dimensions since they share the same character vocabulary.) The output layer dots the hidden features with the input embeddings to produce logits.
Trainingtime input masking. First, we add a special token to each dimension ’s vocabulary. For each input we uniformly draw the number of masked dimensions , then sample the positions to mask, . For position , we replace the original representation, , by the masked representation, :
(3) 
Importantly, the objective remains the MLE for all autoregressive factors: we train the parameters to predict the original values at each dimension, given a mix of original and masked information at previous dimensions. In other words, we minimize the negative loglikelihood
(4) 
over all dimension . Conditioning on the mask tokens ensures that those representations are trained. Since we do not alter the output targets and the mask positions are chosen independently of the data, no bias is introduced.
Infertime skipping. Given a range query (Equation 2), we look for each unconstrained dimension and replace its domain with a singleton set of its marginalization token:
(5) 
We then invoke which would thus skip the sampling for those dimensions.
Example. Suppose we have an AR model trained over the autoregressive ordering , and want to draw a sample from .

Without skipping, first we draw , then , and finally .

With skipping, we can directly sample , followed by .
4.1 Prefix Skipping for Text Pattern Matching
Any regex can be implemented as a nondeterministic finite automata (NFA) (Hopcroft and Ullman, 1979), which takes a stream of characters and determines acceptable next characters. We can use progressive sampling with any regex, treating its NFA like a dynamically unrolled predicate. For example, consider the regex . Possible matches include , , and . Progressive sampling would work as follows: first we sample , then . Depending on whether or , third we either sample or (this is the “dynamic” part), and so on. By retaining an NFA per sample, we obtain an estimate of the overall match probability.
However, this naive formulation is inefficient when there are long unconstrained sequences. Consider the regex , intended to match any string containing the token . The probability of sampling a random prefix from an AR model matching this is vanishingly small—perhaps millions of samples for a hit. To avoid this, we can try to skip over sequences of unconstrained characters and compute the probability of at specific offsets directly. All that would remain is sampling forward through the remainder of the variables to avoid double counting duplicate occurrences of the token. Using to denote a string match at position , and the existence of a match at any position , the match probability is approximated as:
(6) 
Due to the need for masking contiguous prefixes, the model is trained with random prefix masking (Figure 4) to allow such contiguous characters to be skipped. We show the effectiveness of this strategy in Section 5.7, which implements simple pattern queries over an AR Transformer model.
4.2 Other Mask Patterns
Finally, we note that more structured mask patterns can be used, such as subsequences in text or random patches in images (Dupont and Suresha, 2018). This allows for marginalization over complex subsets of dimensions with potential applicability to not only sample variables given prefixes of the AR ordering (i.e., from ) but also variables later on, i.e., from by marginalizing over . We leave investigation of these potential applications to future work.
5 Evaluation
Our evaluation investigates the following questions:

How much does variable skipping improve estimation accuracy compared to baselines, and how is this impacted by the sampling budget?

Can variable skipping be combined with multiorder training to further improve accuracy?

To what extent do hyperparameters such as the model capacity and mask token distribution impact the effectiveness of variable skipping?

Can variable skipping be applied to related domains such as text, or is it limited to tabular data?
Overall, we find that variable skipping robustly improves estimation accuracy across a variety of scenarios. Given a certain target accuracy, skipping reduces the required compute cost by one to two orders of magnitude.
Dataset  Rows  Cols  Domain  Type 

DMVFull  11.6M  19  2–32K  Discrete 
Census  2.5M  67  2–18  Discrete 
KDD  95K  100  2–896  Discrete 
DryadURLs  2.4M  100  78  Text 
5.1 Datasets
We use the following public datasets in our evaluation, also summarized in Table 1. When necessary, we drop columns representing continuous data. We consider supporting continuous variables an orthogonal issue, and limit our evaluation to discrete domains:
DMVFull (State of New York, 2019). Dataset consisting of vehicle registration information in New York (i.e., attributes like vehicle class, make, model, and color). We use all columns except for the unique vehicle ID (VIN). This dataset was also used in (Yang et al., 2019b), but there it was restricted to 11 of the smaller columns.
KDD (Dua and Graff, 2017). KDD Cup 1998 Data. We used the first hundred columns, sans noexch, zip, and pop9013, which were especially highcardinality. This leaves 100 discrete integer domains with 2 to 896 distinct values each.
Census (Dua and Graff, 2017). The US Census Data (1990) Data Set, which consists of a 1% sample made publicly available. We use all available columns, which range from 2 to 18 distinct values each.
DryadURLs (Sen et al., 2016). For text domain experiments, we use this small dataset of 2.4M URLs, each truncated to 100 characters. This dataset was chosen to emulate a plausible STRING column in a relational database.
Hyperparameter  Value 

Training Epochs 
20 (200 for KDD) 
Batch Size  2048 
Architecture  ResMADE 
Residual Blocks  3 
Hidden Layers / Block  2 
Hidden Layer Units  256 
Embedding Size  32 
Optimizer  Adam 
Learning Rate  5e4 
Learning Rate Warmup  1 epoch 
Mask Probability  
Transformer Num Blocks  8 
Transformer MLP Dims ()  256 
Transformer Embed Size ()  32 
Transformer Num Heads  4 
Transformer Batch Size  512 
5.2 Evaluation Metric
We issue a large set of randomly generated range queries, and measure how accurately each estimator answers them. We report the multiplicative error, or Qerror, defined as the factor by which an estimate differs from the actual density (obtained by actually executing each query on the dataset):
Hence, a perfect estimate for a query has an error of 1.0. Moreover, we report the median, 99%tile, and maximum Qerror across all queries. We note that the median error is typically within a fraction of 1.0 for all estimators. The reason is that most randomly generated queries are “easy” (i.e., hit few crossdimension predicate correlations), and only a few are “hard”. Because of this, even a naive estimator can achieve good performance in many cases. Hence, our focus is on high quantile errors for evaluation.
5.3 Experiment Setup
For queries against tabular data, we used the experiment framework from (Yang et al., 2019b), randomly drawing between 5 and 12 conjunctive variable constraints per query^{1}^{1}1An example query for DMVFull may be: record_type == 1 AND city == 17 AND zip > 10000 AND model_year < 1990 AND max_weight > 5000.
. It is important to not have too many or too few constraints, which would skew the distribution of true density estimates towards 0.0 (too many constraints lead to little density) or 1.0 (too few constraints lead to high density) respectively.
For text queries, we issued pattern glob queries of the form value CONTAINS <str>, where <str> is a character sequence between 3 and 5 characters in length drawn randomly from the full text corpus. This also provides a challenging spread of density from very common (e.g., CONTAINS ".com"), to quite rare (e.g., CONTAINS "XVQ/i").
We compare between the following approaches, all of which use progressive sampling (Section 3.1) as the approximate inference procedure:

Baseline: An autoregressive model queried using vanilla progressive sampling (Yang et al., 2019b).

Skipping: An autoregressive model trained with random input masking and queried with the variable skipping optimization enabled (Section 4).

MultiOrder + Skipping: Combining the multiorder and variable skipping techniques.
The full list of training hyperparameters can be found in Table 2. Unless otherwise specified, we use a ResMADE (Durkan and Nash, 2019) with 3 residual blocks, two 256unit hidden layers per block, and an 32unit wide embedding for each input dimension. We choose hyperparameters known to optimize for progressive sampling performance (Yang et al., 2019b), but did not otherwise tune them for our experiments. In our ablations (Section 5.6) we found that the most sensitive hyperparameter to performance is the embedding size, which is closely related to model size.
Dataset  Baseline  Random Input Masking  MultiOrder(5)  MultiOrder(10)  MultiOrder(15) 

Census  52.04 .009  52.34 .02  52.69 .02  52.79 .03  52.81 .03 
DMVFull  43.12 .04  43.65 .06  44.15 .05  44.53 .06  44.65 .04 
KDD  107.5 .3  116.58 .2  123 .9  127.6 .4  128.4 .5 
The model negative loglikelihoods at convergence in bits/datapoint (evaluated using nonmasked data). We also report standard deviation across multiple random order seeds.
The autoregressive variable ordering can significantly affect estimator variance. We thus evaluate each technique on 8 randomly chosen variable orderings and train a (fixedorder) model for each ordering^{2}^{2}2For ResMADE, this means we sample 8 sets of {input ordering, intermediate connectivity masks}.. For multiorder models, we train 8 distinct sets of 10 randomly chosen orders (we saw diminishing returns past 10 orders), unless specified otherwise. To ensure fairness, when not using skipping, we use a model trained without masking.
For multiorder ResMADE, to condition on the current ordering statistics each masked linear layer is allocated an additional weight matrix that shares the existing mask and has an allone vector as its input
^{3}^{3}3This treatment has appeared in MADE (Germain et al., 2015).. Due to the additional weights, we size down the hidden units appropriately to ensure that the multiorder models have about the same parameter count as other models.5.4 Variable Skipping Performance
We evaluate the impact of the variable skipping optimization on the DMVFull, Census, and KDD datasets. For each dataset, we generated 1000 random range queries.
In Figure 5 we show the results of variable skipping (orange and red lines) compared against baselines (black and turquoise lines). This data is also shown in tabular form in Table 5. We evaluate with sampling budgets of 100, 1000, and 10000 samples (left, center, and right columns respectively). Note that a sample refers to all the forward passes required to sample relevant variables (e.g., for Census a single sample takes 67 forward passes without skipping). We limit to 10k samples for cost reasons (at 10k samples, each query takes multiple seconds to evaluate even with a GPU). There are several key takeaways:
Highquantile error differentiates estimators: Across all estimators, the median error is very close to 1.0 (not shown since it is indistinguishable in log scale). However, systems applications necessarily seek to minimize the worstcase error, which does vary significantly across samplers.
Skipping significantly improves sampling efficiency: Across all datasets, variable skipping provides between to max error reduction at low sampling budgets (i.e., 100 samples), compared to the baseline. It also provides up to improvement over the multiorder ensemble alone.
Concretely, at 100 samples the 99thquantile error for Census is reduced from to , KDD from to , and DMVFull from to . Moving up to 1000 samples, we continue to see a significant improvement at the max error, with Census improved from to , KDD from to , and DMVFull from to .
Compared to multiorder, variable skipping provides better max error reduction for Census and KDD at 100 samples. Interestingly, while multiorder and variable skipping provide comparable improvements for DMVFull at 100 samples, combining multiorder and skipping provides a further improvement in both max and 99th quantile error. This suggests that variable skipping and multiorder training are orthogonal mechanisms, and can be combined for larger datasets such as DMVFull to reduce both error and inference costs.
Variable skipping can help even at high sampling budgets: On the DMVFull dataset, variable skipping provides more than an order of magnitude reduction in max error (from to ), even at 10000 samples. We hypothesize this is due to the large domain sizes of DMVFull (up to 32K distinct values), which in the worst case would require a much larger number of samples to achieve low estimation error. As evidence for this, a large number of samples are required to achieve good errors even with skipping enabled. This is in contrast to the smaller Census and KDD datasets where skipping achieves close to singledigit errors even with as low as 100 samples. This suggests that for even larger datasets (common in industrial settings), skipping may have an even greater impact.
We have also provided a summary of the results for Figure 5 in Figure 1. For the concrete target of error at the 99th quantile, skipping significantly reduces the compute requirements over both the baseline and multiorder. This is due to both accuracy improvements that combine with those provided by multiorder ensembles, and also the reduced compute requirements of skipping.
5.5 Model Likelihoods vs. Training Scheme
Num Samples  Metric  Naive Sampling  Progressive Sampling (Baseline)  Variable Skipping 

1000  Max Error  6412 1280  4628 1785  115.2 25.6 
1000  P99 Error  4054 1386  1741 1824  89.5 30.6 
1000  Median Error  1.23 .06  1.66 .15  1.39 .08 
Training with partially masked inputs makes the learning task more difficult: the number of examples increases by a factor of , and effectively the same model is learning multiple autoregressive distributions. Table 3 shows that in terms of negative loglikelihoods achieved, models trained with masking do have a slightly higher NLL than baseline, as expected. However, the NLLs achieved are lower than those of multiorder models, and the gap only widens with an increased number of orders.
Even though NLLs of masked models are higher than those of baseline, Figure 5 shows the benefit: estimation error is significantly improved when variable skipping is enabled. This highlights the nonperfect alignment of optimizing for point likelihoods vs. downstream range query performance, opening up interesting future directions.
5.6 Model Size and Masking Ablations
In Figure 6 we study the relationship between model size and estimation accuracy. For this experiment, we vary the model embedding size among dimensions, and the hidden layer size among units. We see that variable skipping retains a robust advantage across nearly two orders of magnitude variation in model size.
In Figure 7 we compare a few schemes for selecting the mask distribution, from fixed masking probabilities of , vs. the random uniform scheme used in the main experiment. We see that drawing the mask probability uniformly at random obtains the lowest errors for KDD and DMVFull, and close to optimal for Census as well, showing it to be a robust choice.
5.7 Application to Pattern Matching in Text Domain
Finally, we show that variable skipping can be applied to the text domain for estimating the probability of pattern matches. Pattern matching (or more generally, regex matching), can be thought of as unrolling a dynamic predicate as variables are sampled (Section 4.1). Here we evaluate a simple characterlevel Transformer model on the DryadURLs dataset. We note that this is not a realistic application since scanning a dataset of this size is much faster than sampling from a model, however it demonstrates the applicability of variable skipping across domains. Table 4 shows that prefix skipping enables much lower variance estimates than naive sampling and vanilla progressive sampling.
6 Conclusion
To summarize, we identify the range density estimation task and important applications. We propose variable skipping, which greatly reduces sampling variance and inference latency. We validate the effectiveness of these techniques across a variety of datasets and model configurations.
Samples  Metric  Dataset  Progressive (Baseline)  MultiOrder  Skipping  MultiOrder + Skipping 

100  P50  Census  1.19 .01  1.53 .32  1.09 .01  1.21 .03 
KDD  1.24 .02  1.41 .16  1.14 .02  1.25 .05  
DMVFull  1.22 .04  1.70 .22  1.08 .01  1.22 .04  
P99  Census  1150 550  38.6 52  2.55 .29  3.36 .32  
KDD  36.4 11  9.82 3.4  4.58 .83  5.07 .85  
DMVFull  1130 1300  80 81  93 69  6.5 1.7  
Max  Census  24500 18000  3490 6700  15.1 7.3  12.3 5.3  
KDD  345 280  99.9 140  9.97 4.1  9.25 1.9  
DMVFull  21200 22000  1640 2600  1560 1700  22.7 13  
1000  P50  Census  1.06 .01  1.44 .31  1.09 .01  1.20 .02 
KDD  1.08 .01  1.34 .17  1.13 .02  1.25 .05  
DMVFull  1.08 .01  1.53 .24  1.05 .01  1.18 .04  
P99  Census  2.58 .39  3.84 1.1  2.50 .28  3.2 .28  
KDD  5.78 1.7  4.69 .67  4.41 .74  5.05 .89  
DMVFull  105 44  11.3 4.8  4.24 2.1  4.92 .81  
Max  Census  798 700  212 330  14.7 7.0  12.8 5.1  
KDD  30.7 30  9.42 3.8  9.95 4.1  9.37 1.9  
DMVFull  508 200  39.4 43  114 100  14 6.7  
10000  P50  Census  1.03 .01  6.31 1.6  1.09 .01  1.20 .24 
KDD  1.05 .01  1.33 .18  1.13 .02  1.25 .05  
DMVFull  1.05 .01  1.48 .24  1.05 .01  1.18 .04  
P99  Census  1.49 .08  2.84 .70  2.48 .29  3.20 .29  
KDD  2.99 .19  4.43 .93  4.36 .75  5.08 .85  
DMVFull  5.95 3.4  7.23 2.4  2.89 .34  4.96 .85  
Max  Census  8.86 14  6.31 1.6  14.7 7.0  12.6 5.0  
KDD  7.0 1.5  7.57 1.9  9.97 4.1  9.37 1.9  
DMVFull  181 78  13.5 6.2  20.2 40  10.5 2.6 
References
 Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509. Cited by: §1.
 BERT: pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Document Cited by: §2.
 UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §5.1, §5.1.

Probabilistic semantic inpainting with pixel constrained cnns
. arXiv preprint arXiv:1810.03728. Cited by: §4.2.  Autoregressive energy machines. In Proceedings of the 36th International Conference on Machine Learning, Vol. 97, Long Beach, California, USA, pp. 1735–1744. Cited by: §2, §4, §5.3.

MADE: masked autoencoder for distribution estimation
. In International Conference on Machine Learning, pp. 881–889. Cited by: §2, §2, 3rd item, footnote 3. 
Maskpredict: parallel decoding of conditional masked language models.
In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP)
, pp. 6114–6123. Cited by: §2.  Introduction to automata theory, languages and computation. adisonwesley. Reading, Mass. Cited by: §1.1, §4.1.
 Probabilistic graphical models: principles and techniques. MIT press. Cited by: §3.1.
 Language models are unsupervised multitask learners. OpenAI Blog 1 (8). Cited by: §1.

Evaluating protein transfer learning with tape
. In Advances in Neural Information Processing Systems, pp. 9689–9701. Cited by: §1.  PixelCNN++: improving the pixelcnn with discretized logistic mixture likelihood and other modifications. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, Cited by: §1.
 Access path selection in a relational database management system. In Proceedings of the 1979 ACM SIGMOD international conference on Management of data, pp. 23–34. Cited by: §1.1, §1.
 Data from: women are seen more than heard in online newspapers.. External Links: Document, Link Cited by: §5.1.
 Deep autoregressive models for the efficient variational simulation of manybody quantum systems. Phys. Rev. Lett. 124, pp. 020503. External Links: Document Cited by: §1.
 Vehicle, snowmobile, and boat registrations. Note: catalog.data.gov/dataset/vehiclesnowmobileandboatregistrations[Online; accessed March 1st, 2019] Cited by: §5.1.
 A deep and tractable density estimator. In International Conference on Machine Learning, pp. 467–475. Cited by: §2, 3rd item.
 WaveNet: a generative model for raw audio. In Arxiv, Cited by: §1.
 Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems, pp. 4790–4798. Cited by: §1.
 Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2, §4.
 Scaling autoregressive video models. arXiv preprint arXiv:1906.02634. Cited by: §1.
 XLNet: generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pp. 5754–5764. Cited by: §2.
 NeuroCard: one cardinality estimator for all tables. arXiv preprint arXiv:2006.08109. Cited by: §1.
 Deep unsupervised cardinality estimation. Vol. 13, pp. 279–292. Cited by: §1.1, §1, §1, §1, §3.1, 1st item, §5.1, §5.3, §5.3.
Comments
There are no comments yet.