Tractable Generative Convolutional Arithmetic Circuits

10/13/2016 ∙ by Or Sharir, et al. ∙ Hebrew University of Jerusalem 0

Casting neural networks in generative frameworks is a highly sought-after endeavor these days. Existing methods, such as Generative Adversarial Networks, capture some of the generative capabilities, but not all. To truly leverage the power of generative models, tractable marginalization is needed, a feature outside the realm of current methods. We present a generative model based on convolutional arithmetic circuits, a variant of convolutional networks that computes high-dimensional functions through tensor decompositions. Our method admits tractable marginalization, combining the expressive power of convolutional networks with all the abilities that may be offered by a generative framework. We focus on the application of classification under missing data, where unknown portions of classified instances are absent at test time. Our model, which theoretically achieves optimal classification, provides state of the art performance when classifying images with missing pixels, as well as promising results when treating speech with occluded samples.



There are no comments yet.


page 15

page 16

Code Repositories


Experiments from the article "Tractable Generative Convolutional Arithmetic Circuits"

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There have been many attempts in recent years to marry generative models with neural networks, including successful methods, such as Generative Adversarial Networks (Goodfellow et al., 2014), Variational Auto-Encoders (Kingma and Welling, 2014), NADE (Uria et al., 2016), and PixelRNN (van den Oord et al., 2016). Though each of the above methods has demonstrated its usefulness on some tasks, it is yet unclear if their advantage strictly lies in their generative nature or some other attribute. More broadly, we ask if combining generative models with neural networks could lead to methods who have a clear advantage over purely discriminative models.

On the most fundamental level, if stands for an instance and  for its class, generative models learn , from which we can also infer , while discriminative models learn only . It might not be immediately apparent if this sole difference leads to any advantage. In Ng and Jordan (2002), this question was studied with respect to the sample complexity, proving that under some cases it can be significantly lesser in favor of the generative classifier. We wish to highlight a more clear cut case, by examining the problem of classification under missing data – where the value of some of the entries of

are unknown at prediction time. Under these settings, discriminative classifiers typically rely on some form of data imputation, i.e. filling missing values by some auxiliary method prior to prediction. Generative classifiers, on the other hand, are naturally suited to handle missing values through marginalization – effectively assessing every possible completion of the missing values. Moreover, under mild assumptions, this method is optimal

regardless of the process by which values become missing (see sec. 3).

It is evident that such application of generative models assumes we can efficiently and exactly compute , a process known as tractable inference. Moreover, it assumes we may efficiently marginalize over any subset of , a procedure we refer to as tractable marginalization. Not all generative models have both of these properties, and specifically not the ones mentioned in the beginning of this section. Known models that do possess these properties, e.g. Latent Tree Model (Mourad et al., 2013), have other limitations. A detailed discussion can be found in sec. 4

, but in broad terms, all known generative models possess one of the following shortcomings: (i) they are insufficiently expressive to model high-dimensional data (images, audio, etc.), (ii) they require explicitly designing all the dependencies of the data, or (iii) they do not have tractable marginalization. Models based on neural networks typically solve (i) and (ii), but are incapable of (iii), while more classical methods, e.g. mixture models, solve (iii) but suffer from (i) and (ii).

There is a long history of specifying tractable generative models through arithmetic circuits and sum-product networks (Darwiche, 2003; Poon and Domingos, 2011) – computational graphs comprised solely of product and weighted sum nodes. To address the shortcomings above, we take a similar approach, but go one step further and leverage tensor analysis to distill it to a specific family of models we call Tensorial Mixture Models. A Tensorial Mixture Model assumes a convolutional network structure, but as opposed to previous methods tying generative models with neural networks, lends itself to theoretical analyses that allow a thorough understanding of the relation between its structure and its expressive properties. We thus obtain a generative model that is tractable on one hand, and on the other hand, allows effective representation of rich distributions in an easily controlled manner.

2 Tensorial Mixture Models

One of the simplest types of tractable generative models are mixture models, where the probability distribution is defined שד the convex combination of

mixing components (e.g. Normal distributions):

. Mixture models are very easy to learn, and many of them are able to approximate any probability distribution, given sufficient number of components, rendering them suitable for a variety of tasks. The disadvantage of classic mixture models is that they do not scale will to high dimensional data (“curse of dimensionality”). To address this challenge, we extend mixture models, leveraging the fact many high dimensional domains (e.g. images) are typically comprised of small, simple local structures. We represent a high dimensional instance as

 – an -length sequence of

-dimensional vectors

(called local structures). is typically thought of as an image, where each local structure corresponds to a local patch from that image, where no two patches are overlapping. We assume that the distribution of individual local structures can be efficiently modeled by some mixture model of few components, which for natural image patches, was shown to be the case (Zoran and Weiss, 2011). Formally, for all there exists such that , where is a hidden variable specifying the matching component for the -th local structure. The probability density of sampling is thus described by:



represents the prior probability of assigning components

to their respective local structures

. As with classical mixture models, any probability density function

could be approximated arbitrarily well by eq. 1, as (see app. A).

At first glance, eq. 1 seems to be impractical, having an exponential number of terms. In the literature, this equation is known as the “Network Polynomial” (Darwiche, 2003), and the traditional method to overcome its intractability is to express by an arithmetic circuit, or sum-product networks, following certain constraints (decomposable and complete). We augment this method by viewing from an algebraic perspective, treating it as a tensor of order and dimension in each mode, i.e., as a multi-dimensional array, specified by indices , each ranging in , where . We refer to as the prior tensor. Under this perspective, eq. 1 can be thought of as a mixture model with tensorial mixing weights, thus we call the arising models Tensorial Mixture Models, or TMMs for short.

2.1 Tensor Factorization, Tractability, and Convolutional Arithmetic Circuits

Not only is it intractable to compute eq. 1, but it is also impossible to even store the prior tensor. We argue that addressing the latter is intrinsically tied to addressing the former. For example, if we impose a sparsity constraint on the prior tensor, then we only need to compute the few non-zero terms of eq. 1. TMMs with sparsity constraints can represent common generative models, e.g. GMMs (see app. B). However, they do not take full advantage of the prior tensor. Instead, we consider constraining TMMs with prior tensors that adhere to non-negative low-rank factorizations.

We begin by examining the simplest case, where the prior tensor takes a rank-1 form, i.e. there exist vectors such that , or in tensor product notation, . If we interpret111 represents a probability, and w.l.o.g. we can assume all entries of are non-negative and as a probability over , and so , then it reveals that imposing a rank-1 constraint is actually equivalent to assuming the hidden variables are statistically independent. Applying it to eq. 1 results in the tractable form , or in other words, a product of mixture models. Despite the familiar setting, this strict assumption severely limits expressivity.

Figure 1: A generative variant of Convolutional Arithmetic Circuits.

In a broader setting, we look at general factorization schemes that given sufficient resources could represent any tensor. Namely, the CANDECOMP/PARAFAC (CP) and the Hierarchical Tucker (HT) factorizations. The CP factorization is simply a sum of rank-1 tensors, extending the previous case, and HT factorization can be seen as a recursive application of CP (see def. in app. C). Since both factorization schemes are solely based on product and weighted sum operations, they could be realized through arithmetic circuits. As shown by Cohen et al. (2016a), this gives rise to a specific class of convolutional networks named Convolutional Arithmetic Circuits (ConvACs), which consist of -convolutions, non-overlapping product pooling layers, and linear activations. More specifically, the CP factorization corresponds to shallow ConvACs, HT corresponds to deep ConvACs, and the number of channels in each layer corresponds to the respective concept of “rank” in each factorization scheme. In general, when a tensor factorization is applied to eq. 1, inference is equivalent to first computing the likelihoods of all mixing components , in what we call the representation layer, followed by a ConvAC. A complete network is illustrated in fig. 1.

Figure 2: Graphical model description of HT-TMM

When restricting the prior tensor of eq. 1 to a factorization, we must ensure it represents actual probabilities, i.e. it is non-negative and its entries sum to one. This can be addressed through a restriction to non-negative factorizations, which translates to limiting the parameters of each convolutional kernel to the simplex. There is a vast literature on the relations between non-negative factorizations and generative models (Hofmann, 1999; Mourad et al., 2013). As opposed to most of these works, we apply factorizations merely to derive our model and analyze its expressivity – not for learning its parameters (see sec. 2.3).

From a generative perspective, the restriction of convolutional kernels to the simplex results in a latent tree graphical model, as illustrated in fig. 2. Each hidden layer in the ConvAC network – a pair of convolution and pooling operations, corresponds to a transition between two levels in the tree. More specifically, each level is comprised of multiple latent variables, one for each spatial position in the input to a hidden layer in the network. Each latent variable in the input to the -th layer takes values in  – the number of channels in the layer that precedes it. Pooling operations in the network correspond to the parent-child relationships in the tree – a set of latent variables are siblings with a shared parent in the tree, if they are positioned in the same pooling window in the network. The weights of convolution operations correspond to the transition matrix between a parent and each of its children, i.e. if is the parent latent variable, taking values in , and is one of its child variables, taking values in , then , where is the convolutional kernel for the -th output channel. With the above graphical representation in place, we can easily draw samples from our model.

To conclude this subsection, by leveraging an algebraic perspective of the network polynomial (eq. 1), we show that tractability is related to the tensor properties of the priors, and in particular, that low rank factorizations are equivalent to inference via ConvACs. The application of arithmetic circuits to achieve tractability is by itself not a novelty. However, the particular convolutional arithmetic circuits we propose lead to a comprehensive understanding of representational abilities, and as a result, to a straightforward architectural design of TMMs.

2.2 Controlling the Expressivity and Inductive Bias of TMMs

As discussed in sec. 1, it is not enough for a generative model to be tractable – it must also be sufficiently expressive, and moreover, we must also be able to understand how its structure affects its expressivity. In this section we explain how our algebraic perspective enables us to achieve this.

To begin with, since we derived our model by factorizing the prior tensor, it immediately follows that given sufficient number of channels in the ConvAC, i.e. given sufficient ranks in the tensor factorization, any distribution could be approximated arbitrarily well (assuming  is allowed to grow). In short, this amounts to saying that TMMs are universal. Though many other generative models are known to be universal, it is typically not clear how one may assess what a given structure of finite size can and cannot express. In contrast, the expressivity of ConvACs has been throughly studied in a series of works (Cohen et al., 2016a; Cohen and Shashua, 2017; Cohen et al., 2017; Levine et al., 2017), each of which examined a different attribute of its structure. In Cohen et al. (2016a) it was proven that ConvACs exhibit the Depth Efficiency property, i.e. deep networks are exponentially more expressive than shallow ones. In Cohen and Shashua (2017) it was shown that deep networks can efficiently model some input correlations but not all, and that by designing appropriate pooling schemes, different preferences may be encoded, i.e. the inductive bias may be controlled. In Cohen et al. (2017) this result was extended to more complex connectivity patterns, involving mixtures of pooling schemes. Finally, in Levine et al. (2017), an exact relation between the number of channels and the correlations supported by a network has been found, enabling tight control over expressivity and inductive bias. All of these results are brought forth by the relations of ConvACs to tensor factorizations. They allow TMMs to be analyzed and designed in much more principled ways than alternative high-dimensional generative models.222 As a demonstration of the fact that ConvAC analyses are not affected by the non-negativity and normalization restrictions of our generative variant, we prove in app. D that the Depth Efficiency property still holds.

2.3 Classification and Learning

TMMs realized through ConvACs, sharing many of the same traits as ConvNets, are especially suitable to serve as classifiers. We begin by introducing a class variable , and model the conditional likelihood for each . Though it is possible to have separate generative models for each class, it is much more efficient to leverage the relation to ConvNets and use a shared ConvAC instead, which is equivalent to a joint-factorization of the prior tensors for all classes. This results in a single network, where instead of a single scalar output representing , multiple outputs are driven by the network, representing for each class . Predicting the class of a given instance is carried through Maximum A-Posteriori, i.e. by returning the most likely class. In the common setting of uniform class priors, i.e. , this corresponds to classification by maximal network output, as customary with ConvNets. We note that in practice, naïve implementation of ConvACs is not numerically stable333Since high degree polynomials (as computed by ACs) are susceptible to numerical underflow or overflow., and this is treated by performing all computations in log-space, which transforms ConvACs into SimNets

 – a recently introduced deep learning architecture 

(Cohen and Shashua, 2014; Cohen et al., 2016b).

Suppose now that we are given a training set of instances and labels, and would like to fit the parameters

of our model according to the Maximum Likelihood principle, or equivalently, by minimizing the Negative Log-Likelihood (NLL) loss function:

. The latter can be factorized into two separate loss terms:

where , which we refer to as the discriminative loss, is commonly known as the cross-entropy loss, and , which corresponds to maximizing the prior likelihood , has no analogy in standard discriminative classification. It is this term that captures the generative nature of the model, and we accordingly refer to it as the generative loss. Now, let stand for the ’th output of the SimNet (ConvAC in log-space) realizing our model with parameters . In the case of uniform class priors (

), the empirical estimation of

may be written as:


This objective includes the standard softmax loss as its first term, and an additional generative loss as its second. Rather than employing dedicated Maximum Likelihood methods for training (e.g. Expectation Minimization), we leverage once more the resemblance between our networks and ConvNets, and optimize the above objective using Stochastic Gradient Descent (SGD).

3 Classification under Missing Data through Marginalization

A major advantage of generative models over discriminative ones lies in their ability to cope with missing data, specifically in the context of classification. By and large, discriminative methods either attempt to complete missing parts of the data before classification (a process known as data imputation), or learn directly to classify data with missing values (Little and Rubin, 2002). The first of these approaches relies on the quality of data completion, a much more difficult task than the original one of classification under missing data. Even if the completion was optimal, the resulting classifier is known to be sub-optimal (see app. E). The second approach does not rely on data completion, but nonetheless assumes that the distribution of missing values at train and test times are similar, a condition which often does not hold in practice. Indeed, Globerson and Roweis (2006) coined the term “nightmare at test time” to refer to the common situation where a classifier must cope with missing data whose distribution is different from that encountered in training.

As opposed to discriminative methods, generative models are endowed with a natural mechanism for classification under missing data. Namely, a generative model can simply marginalize over missing values, effectively classifying under all possible completions, weighing each completion according to its probability. This, however, requires tractable inference and marginalization. We have already shown in sec. 2 that TMMs support the former, and will show in sec. F that marginalization can be just as efficient. Beforehand, we lay out the formulation of classification under missing data.

Let be a random vector in  representing an object, and let

be a random variable in 

representing its label. Denote by 

the joint distribution of 

, and by specific realizations thereof. Assume that after sampling a specific instance , a random binary vector is drawn conditioned on . More concretely, we sample a binary mask (realization of ) according to a distribution . is considered missing if  is equal to zero, and observed otherwise. Formally, we consider the vector , whose ’th coordinate is defined to hold  if , and the wildcard  if . The classification task is then to predict  given access solely to .

Following the works of Rubin (1976); Little and Rubin (2002), we consider three cases for the missingness distribution : missing completely at random (MCAR), where  is independent of , i.e.  is a function of  but not of ; missing at random (MAR), where  is independent of the missing values in , i.e.  is a function of both  and , but is not affected by changes in  if ; and missing not at random (MNAR), covering the rest of the distributions for which  depends on missing values in , i.e.  is a function of both  and , which at least sometimes is sensitive to changes in  when .

Let be the joint distribution of the object , label , and missingness mask :

For given and , denote by the event where the random vector  coincides with  on the coordinates  for which . For example, if  is an all-zero vector, covers the entire probability space, and if  is an all-one vector, corresponds to the event . With these notations in hand, we are now ready to characterize the optimal predictor in the presence of missing data. The proofs are common knowledge, but provided in app. E for completeness.

Claim 1.

For any data distribution  and missingness distribution , the optimal classification rule in terms of 0-1 loss is given by predicting the class , that maximizes , for an instance .

When the distribution  is MAR (or MCAR), the optimal classifier admits a simpler form, referred to as the marginalized Bayes predictor:

Corollary 1.

Under the conditions of claim 1, if the distribution is MAR (or MCAR), the optimal classification rule may be written as:


Corollary 1 indicates that in the MAR setting, which is frequently encountered in practice, optimal classification does not require prior knowledge regarding the missingness distribution . As long as one is able to realize the marginalized Bayes predictor (eq. 3), or equivalently, to compute the likelihoods of observed values conditioned on labels (), classification under missing data is guaranteed to be optimal, regardless of the corruption process taking place. This is in stark contrast to discriminative methods, which require access to the missingness distribution during training, and thus are not able to cope with unknown conditions at test time.

Most of this section dealt with the task of prediction given an input with missing data, where we assumed we had access to a “clean” training set, and only faced missingness during prediction. However, many times we wish to tackle the reverse task, where the training set itself is riddled with missing data. Tractability leads to an advantage here as well: under the MAR assumption, learning from missing data with the marginalized likelihood objective results in an unbiased classifier (Little and Rubin, 2002).

In the case of TMMs, marginalizing over missing values is just as efficient as plain inference – requires only a single pass through the corresponding network. The exact mechanism is carried out in similar fashion as in sum-product networks, and is covered in app. F. Accordingly, the marginalized Bayes predictor (eq. 3) is realized efficiently, and classification under missing data (in the MAR setting) is optimal (under generative assumption), regardless of the missingness distribution.

4 Related Works

There are many generative models realized through neural networks, and convolutional networks in particular, e.g. Generative Adversarial Networks (Goodfellow et al., 2014), Variational Auto-Encoders (Kingma and Welling, 2014), and NADE (Uria et al., 2016). However, most do not posses tractable inference, and of the few that do, non posses tractable marginalization over any set of variables. Due to limits of space, we defer the discussion on the above to app. G, and in the remainder of this section focus instead on the most relevant works.

As mentioned in sec. 2, we build on the approach of specifying generative models through Arithmetic Circuits (ACs) (Darwiche, 2003), and specifically, our model is a strict subclass of the well-known Sum-Product Networks (SPNs) (Poon and Domingos, 2011), under the decomposable and complete restrictions. Where our work differs is in our algebraic approach to eq. 1, which gives rise to a specific structure of ACs, called ConvACs, and a deep theory regarding their expressivity and inductive bias (see sec. 2.2). In contrast to the structure we proposed, the current literature on general SPNs does not prescribe any specific structures, and its theory is limited to either very specific instances (Delalleau and Bengio, 2011), or very broad classes, e.g fixed-depth circuits (Martens and Medabalimi, 2014)

. In the early works on SPNs, specialized networks of complex structure were designed for each task based mainly on heuristics, often bearing little resemblance to common neural networks. Contemporary works have since moved on to focus mainly on learning the structure of SPNs directly from data 

(Peharz et al., 2013; Gens and Domingos, 2013; Adel et al., 2015; Rooshenas and Lowd, 2014), leading to improved results in many domains. Despite that, only few published studies have applied this method to natural domains (images, audio, etc.), on which only limited performance, compared to other common methods, was reported, specifically on the MNIST dataset (Adel et al., 2015). The above suggests that choosing the right architecture of general SPNs, at least on some domains, remains to be an unsolved problem. In addition, both the previously studied manually-designed SPNs, as well as ones with a learned structure, lead to models, which according to recent works on GPU-optimized algorithms (Ben-Nun et al., 2015), cannot be efficiently implemented due to their irregular memory access patterns. This is in stark contrast to our model, which leverages the same patterns as modern ConvNets, and thus enjoys similar run-time performance. An additional difference in our work is that we manage to successfully train our model using standard SGD. Even though this approach has already been considered by Poon and Domingos (2011), they deemed it lacking and advocated for specialized optimization algorithms instead.

Outside the realm of generative networks, tractable graphical models, e.g. Latent Tree Models (LTMs) (Mourad et al., 2013), are the most common method for tractable inference. Similar to SPNs, it is not straightforward to find the proper structure of graphical models for a particular problem, and most of the same arguments apply here as well. Nevertheless, it is noteworthy that recent progress in structure and parameters learning of LTMs (Huang et al., 2015; Anandkumar et al., 2014) was also brought forth by connections to tensor factorizations, similar to our approach. Unlike the aforementioned algorithms, we utilize tensor factorizations solely for deriving our model and analyzing its expressivity, while leaving learning to SGD – the most successful method for training neural networks. Leveraging their perspective to analyze the optimization properties of our model is viewed as a promising avenue for future research.

5 Experiments

We demonstrate the properties of TMMs through both qualitative and quantitative experiments. In sec. 5.1 we present state of the art results on image classification under missing data, with robustness to various missingness distributions. In sec. 5.2 we show that our results are not limited to images, by applying TMMs for speech recognition. Finally, in app. H

we show visualizations of samples drawn from TMMs, shedding light on their generative nature. Our implementation, based on Caffe 

(Jia et al., 2014) and MAPS (Ben-Nun et al., 2015) (toolbox for efficient GPU code generation), as well as all other code for reproducing our experiments, are available at: Extended details regarding the experiments are provided in app. I.

5.1 Image Classification under Missing Data

n= 0 25 50 75 100 125 150
LP 97.9 97.5 96.4 94.1 89.2 80.9 70.2
HT-TMM 98.5 98.2 97.8 96.5 93.9 87.1 76.3
Table 1: Prediction for each two-class task of MNIST digits, under feature deletion noise.
[width=5em, height= 1.7em] 0.25 0.50 0.75 0.90 0.95 0.99
0.25 98.9 97.8 78.9 32.4 17.6 11.0
0.50 99.1 98.6 94.6 68.1 37.9 12.9
0.75 98.9 98.7 97.2 83.9 56.4 16.7
0.90 97.6 97.5 96.7 89.0 71.0 21.3
0.95 95.7 95.6 94.8 88.3 74.0 30.5
0.99 87.3 86.7 85.0 78.2 66.2 31.3
i.i.d. (rand) 98.7 98.4 97.0 87.6 70.6 29.6
rects (rand) 98.2 95.7 83.2 54.7 35.8 17.5
(a) MNIST with i.i.d. corruption
(a) MNIST with missing rectangles.
Figure 3: We examine ConvNets trained on one missingness distribution while tested on others. “(rand)” denotes training on distributions with randomized parameters. (table:convnet_iid_bias) i.i.d. corruption: trained with probability and tested on . (fig:convnet_mnist_rects) missing rectangles: training on randomized distributions (rand) compared to training on the same (fixed) missing rectangles distribution.
(a) MNIST with i.i.d. corruption.
(b) MNIST with missing rectangles.
(c) NORB with i.i.d. corruption.
(d) NORB with missing rectangles.
Figure 4: Blind classification under missing data. (fig:mnist_iid,fig:norb_iid) Testing i.i.d. corruption with probability for each pixel. (fig:mnist_rects,fig:norb_rects) Testing missing rectangles corruption with missing rectangles, each of width and hight equal to . (*) Based on the published results (Goodfellow et al., 2013). (†) Data imputation algorithms.

In this section we experiment on two datasets: MNIST (LeCun et al., 1998) for digit classification, and small NORB (LeCun et al., 2004) for 3D object recognition. In our results, we refer to models using shallow networks as CP-TMM, and to those using deep networks as HT-TMM, in accordance with the respective tensor factorizations (see sec. 2). The theory discussed in sec. 2.2 guided our exact choice of architectures. Namely, we used the fact (Levine et al., 2017) that the capacity to model either short- or long-range correlations in the input, is related to the number of channels in the beginning or end of a network, respectively. In MNIST, discriminating between digits has more to do with long-range correlations than the basic strokes digits are made of, hence we chose to start with few channels and end with many – layer widths were set to 64-128-256-512. In contrast, the classes of NORB differ in much finer details, requiring more channels in the first layers, hence layer widths were set to 256-256-256-512. In both cases, Gaussian mixing components were used.

We begin by comparing our generative approach to missing data against classical methods, namely, methods based on Globerson and Roweis (2006). They regard missing data as “feature deletion” noise, replace missing entries by zeros, and devise a learning algorithm over linear predictors that takes the number of missing features, , into account. The method was later improved by Dekel and Shamir (2008). We compare TMMs to the latter, with non-zero pixels randomly chosen and changed to zero, in the two-class prediction task derived from each pair of MNIST digits. Due to limits of their implementation, only 300 images per digit are used for training. Despite this, and the fact that the evaluated scenario is of the MNAR type (on which optimality is not guaranteed – see sec. 3), we achieve significantly better results (see table 1), and unlike their method, which requires several classifiers and knowing , we use a single TMM with no prior knowledge.

Heading on to multi-class prediction under missing data, we focus on the challenging “blind” setting, where the missingness distribution at test time is completely unknown during training. We simulate two kinds of MAR missingness distributions: (i) an i.i.d. mask with a fixed probability  of dropping each pixel, and (ii) a mask composed of the union of (possibly overlapping) rectangles of width and height 

, each positioned randomly in the image (uniform distribution). We first demonstrate that purely discriminative classifiers cannot generalize to all missingness distributions, by training the standard LeNeT ConvNet 

(LeCun et al., 1998) on one set of distributions and then testing it on others (see fig. 3

). Next, we present our main results. We compare our model against three different approaches. First, as a baseline, we use K-Nearest Neighbors (KNN) to vote on the most likely class, augmented with an

-metric that disregards missing coordinates. KNN actually scores better than most methods, but its missingness-aware distance metric prevents the common memory and runtime optimizations, making it impractical for real-world settings. Second, we test various data-imputation methods, ranging from simply filling missing pixels with zeros or their mean, to modern generative models suited to inpainting. Data imputation is followed by a ConvNet prediction on the completed image. In general, we find that this approach only works well when few pixels are missing. Finally, we test generative classifiers other than our model, including MP-DBM and SPN (sum-product networks). MP-DBM is notable for being limited to approximations, and its results show the importance of using exact inference instead. For SPN, we have augmented the model from Poon and Domingos (2011) with a class variable , and trained it to maximize the joint probability using the code of Zhao et al. (2016). The inferior performance of SPN suggests that the structure of TMMs, which are in fact a special case, is advantageous. Due to limitations of available public code and time, not all methods were tested on all datasets and distributions. See fig. 4 for the complete results.

To conclude, TMMs significantly outperform all other methods tested on image classification with missing data. Although they are a special case of SPNs, their particular structure appears to be more effective than ones existing in the literature. We attribute this superiority to the fact that their architectural design is backed by comprehensive theoretical studies (see sec. 2.2).

5.2 Speech Recognition under Missing Data

To demonstrate the versatility of TMMs, we also conducted limited experiments on the TIMIT speech recognition dataset, following the same protocols as in sec. 5.1. We trained a TMM and a standard ConvNet on 256ms windows of raw data at 16Hz sample rate to predict the phoneme at the center of a window. Both the TMM and the ConvNet reached accuracy on the clean dataset, but when half of the audio is missing i.i.d., accuracy of the ConvNet with mean imputation drops to , while the TMM remains at . Utilizing common audio inpainting methods (Adler et al., 2012) only improves accuracy of the ConvNet to , well below that of TMM.

6 Summary

This paper focuses on generative models which admit tractable inference and marginalization, capabilities that lie outside the realm of contemporary neural network-based generative methods. We build on prior works on tractable models based on arithmetic circuits and sum-product networks, and leverage concepts from tensor analysis to derive a sub-class of models we call Tensorial Mixture Models (TMMs). In contrast to existing methods, our algebraic approach leads to a comprehensive understanding of the relation between model structure and representational properties. In practice, utilizing this understanding for the design of TMMs has led to state of the art performance in classification under missing data. We are currently investigating several avenues for future research, including semi-supervised learning, and examining more intricate ConvAC architectures, such as the ones suggested by

Cohen et al. (2017)).


This work is supported by Intel grant ICRI-CI #9-2012-6133, by ISF Center grant 1790/12 and by the European Research Council (TheoryDL project). Nadav Cohen is supported by a Google Fellowship in Machine Learning.



  • Adel et al. [2015] Tameem Adel, David Balduzzi, and Ali Ghodsi. Learning the Structure of Sum-Product Networks via an SVD-based Algorithm. UAI, 2015.
  • Adler et al. [2012] A Adler, V Emiya, M G Jafari, and M Elad. Audio inpainting. IEEE Trans. on Audio, Speech and Language Processing, 20:922–932, March 2012.
  • Anandkumar et al. [2014] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research (), 15(1):2773–2832, 2014.
  • Ben-Nun et al. [2015] Tal Ben-Nun, Ely Levy, Amnon Barak, and Eri Rubin. Memory Access Patterns: The Missing Piece of the Multi-GPU Puzzle. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 19:1–19:12. ACM, 2015.
  • Bengio et al. [2014] Yoshua Bengio, Éric Thibodeau-Laufer, Guillaume Alain, and Jason Yosinski. Deep Generative Stochastic Networks Trainable by Backprop. In International Conference on Machine Learning, 2014.
  • Caron and Traynor [2005] Richard Caron and Tim Traynor. The Zero Set of a Polynomial. WSMR Report 05-02, 2005.
  • Cohen and Shashua [2014] Nadav Cohen and Amnon Shashua. SimNets: A Generalization of Convolutional Networks. In Advances in Neural Information Processing Systems NIPS, Deep Learning Workshop, 2014.
  • Cohen and Shashua [2017] Nadav Cohen and Amnon Shashua. Inductive Bias of Deep Convolutional Networks through Pooling Geometry. In International Conference on Learning Representations ICLR, April 2017.
  • Cohen et al. [2016a] Nadav Cohen, Or Sharir, and Amnon Shashua. On the Expressive Power of Deep Learning: A Tensor Analysis. In Conference on Learning Theory COLT, May 2016a.
  • Cohen et al. [2016b] Nadav Cohen, Or Sharir, and Amnon Shashua. Deep SimNets. In Computer Vision and Pattern Recognition CVPR, May 2016b.
  • Cohen et al. [2017] Nadav Cohen, Ronen Tamari, and Amnon Shashua. Boosting Dilated Convolutional Networks with Mixed Tensor Decompositions., 2017.
  • Darwiche [2003] Adnan Darwiche.

    A differential approach to inference in Bayesian networks.

    Journal of the ACM (JACM), 50(3):280–305, May 2003.
  • Dekel and Shamir [2008] Ofer Dekel and Ohad Shamir. Learning to classify with missing and corrupted features. In International Conference on Machine Learning. ACM, 2008.
  • Delalleau and Bengio [2011] Olivier Delalleau and Yoshua Bengio. Shallow vs. Deep Sum-Product Networks. Advances in Neural Information Processing Systems, pages 666–674, 2011.
  • Dinh et al. [2014] Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear Independent Components Estimation., October 2014.
  • Gens and Domingos [2013] R Gens and P M Domingos. Learning the Structure of Sum-Product Networks. Internation Conference on Machine Learning, 2013.
  • Globerson and Roweis [2006] Amir Globerson and Sam Roweis. Nightmare at test time: robust learning by feature deletion. In International Conference on Machine Learning. ACM, 2006.
  • Goodfellow et al. [2013] Ian Goodfellow, Mehdi Mirza, Aaron Courville, and Yoshua Bengio.

    Multi-Prediction Deep Boltzmann Machines.

    Advances in Neural Information Processing Systems, 2013.
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets. Advances in Neural Information Processing Systems, 2014.
  • Hackbusch and Kühn [2009] W Hackbusch and S Kühn. A New Scheme for the Tensor Representation. Journal of Fourier Analysis and Applications, 15(5):706–722, 2009.
  • Hofmann [1999] Thomas Hofmann. Probabilistic latent semantic analysis. Morgan Kaufmann Publishers Inc., July 1999.
  • Huang et al. [2015] Furong Huang, Niranjan U N, Ioakeim Perros, Robert Chen, Jimeng Sun, and Anima Anandkumar. Scalable Latent Tree Model and its Application to Health Analytics. In NIPS Machine Learning for Healthcare Workshop, 2015.
  • Jia et al. [2014] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross B Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional Architecture for Fast Feature Embedding. CoRR abs/1202.2745, cs.CV, 2014.
  • Kingma and Welling [2014] Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. In International Conference on Learning Representations, 2014.
  • LeCun et al. [1998] Yan LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • LeCun et al. [2004] Yann LeCun, Fu Jie Huang, and Léon Bottou. Learning Methods for Generic Object Recognition with Invariance to Pose and Lighting. Computer Vision and Pattern Recognition, 2004.
  • Levine et al. [2017] Yoav Levine, David Yakira, Nadav Cohen, and Amnon Shashua. Deep Learning and Quantum Entanglement: Fundamental Connections with Implications to Network Design., April 2017.
  • Little and Rubin [2002] Roderick J A Little and Donald B Rubin. Statistical analysis with missing data (2nd edition). John Wiley & Sons, Inc., September 2002.
  • Martens and Medabalimi [2014] James Martens and Venkatesh Medabalimi. On the Expressive Efficiency of Sum Product Networks. CoRR abs/1202.2745, cs.LG, 2014.
  • Mourad et al. [2013] Raphaël Mourad, Christine Sinoquet, Nevin Lianwen Zhang, Tengfei Liu, and Philippe Leray. A Survey on Latent Tree Models and Applications. J. Artif. Intell. Res. (), cs.LG:157–203, 2013.
  • Ng and Jordan [2002] Andrew Y Ng and Michael I Jordan.

    On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes.

    In Advances in Neural Information Processing Systems NIPS, Deep Learning Workshop, 2002.
  • Pedregosa et al. [2011] F Pedregosa, G Varoquaux, A Gramfort, V Michel, B Thirion, O Grisel, M Blondel, P Prettenhofer, R Weiss, V Dubourg, J Vanderplas, A Passos, D Cournapeau, M Brucher, M Perrot, and E Duchesnay. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research (), 12:2825–2830, 2011.
  • Peharz et al. [2013] Robert Peharz, Bernhard C Geiger, and Franz Pernkopf. Greedy Part-Wise Learning of Sum-Product Networks. In Machine Learning and Knowledge Discovery in Databases, pages 612–627. Springer Berlin Heidelberg, Berlin, Heidelberg, September 2013.
  • Poon and Domingos [2011] Hoifung Poon and Pedro Domingos. Sum-Product Networks: A New Deep Architecture. In Uncertainty in Artificail Intelligence, 2011.
  • Rooshenas and Lowd [2014] Amirmohammad Rooshenas and Daniel Lowd. Learning Sum-Product Networks with Direct and Indirect Variable Interactions. ICML, 2014.
  • Rubin [1976] Donald B Rubin. Inference and missing data. Biometrika, 63(3):581–592, December 1976.
  • Sohl-Dickstein et al. [2015] Jascha Sohl-Dickstein, Eric A Weiss, Niru Maheswaranathan, and Surya Ganguli.

    Deep Unsupervised Learning using Nonequilibrium Thermodynamics.

    Internation Conference on Machine Learning, 2015.
  • Uria et al. [2016] Benigno Uria, Marc-Alexandre C ô t é, Karol Gregor, Iain Murray, and Hugo Larochelle. Neural Autoregressive Distribution Estimation. Journal of Machine Learning Research (), 17(205):1–37, 2016.
  • van den Oord et al. [2016] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu.

    Pixel Recurrent Neural Networks.

    In International Conference on Machine Learning, 2016.
  • Zeiler and Fergus [2014] Matthew D Zeiler and Rob Fergus. Visualizing and Understanding Convolutional Networks. In European Conference on Computer Vision. Springer International Publishing, 2014.
  • Zhao et al. [2016] Han Zhao, Pascal Poupart, and Geoff Gordon. A Unified Approach for Learning the Parameters of Sum-Product Networks. In Advances in Neural Information Processing Systems NIPS, Deep Learning Workshop, 2016.
  • Zoran and Weiss [2011] Daniel Zoran and Yair Weiss. From learning models of natural image patches to whole image restoration. ICCV, pages 479–486, 2011.

Appendix A The Universality of Tensorial Mixture Models

In this section we prove the universality property of Generative ConvACs, as discussed in sec. 2. We begin by taking note from functional analysis and define a new property called PDF total set, which is similar in concept to a total set, followed by proving that this property is invariant under the cartesian product of functions, which entails the universality of these models as a corollary.

Definition 1.

Let be a set of PDFs over . is PDF total iff for any PDF over and for all there exists , and s.t. . In other words, a set is a PDF total set if its convex span is a dense set under norm.

Claim 2.

Let be a set of PDFs over and let be a set of PDFs over the product space . If is a PDF total set then is PDF total set.


If is the set of Gaussian PDFs over with diagonal covariance matrices, which is known to be a PDF total set, then is the set of Gaussian PDFs over with diagonal covariance matrices and the claim is trivially true.

Otherwise, let be a PDF over and let . From the above, there exists , and a set of diagonal Gaussians s.t.


Additionally, since is a PDF total set then there exists , and s.t. for all it holds that , from which it is trivially proven using a telescopic sum and the triangle inequality that:


From eq. 4, eq. 5 the triangle inequality it holds that:

where which holds . Taking , and completes the proof. ∎

Corollary 2.

Let be a PDF total set of PDFs over , then the family of Generative ConvACs with mixture components from can approximate any over arbitrarily well, given arbitrarily many components.

Appendix B TMMs with Sparsity Constraints Can Represent Gaussian Mixture Models

As discussed in sec. 2

, TMMs become tractable when a sparsity constraint is imposed on the priors tensor, i.e. most of the entries of the tensors are replaced with zeros. In this section, we demonstrate that under such a case, TMMs can represent Gaussian Mixture Models with diagonal covariance matrices, probably the most common type of mixture models.

With the same notations as sec. 2, assume the number of mixing components of the TMM is for some , let be these components, and finally, assume the prior tensor has the following structure:

then eq. 1 reduces to:

which is equivalent to a diagonal GMM with mixing weights (where is the -dimensional simplex) and Gaussian mixture components with means and covariances .

Appendix C Background on Tensor Factorizations

In this section we establish the minimal background in the field of tensor analysis required for following our work. A tensor is best thought of as a multi-dimensional array , where . The number of indexing entries in the array, which are also called modes, is referred to as the order of the tensor. The number of values an index of a particular mode can take is referred to as the dimension of the mode. The tensor mentioned above is thus of order with dimension in its -th mode. For our purposes we typically assume that , and simply denote it as .

The fundamental operator in tensor analysis is the tensor product. The tensor product operator, denoted by , is a generalization of outer product of vectors (1-ordered vectors) to any pair of tensors. Specifically, let and be tensors of order and respectively, then the tensor product results in a tensor of order , defined by: .

The main concept from tensor analysis we use in our work is that of tensor decompositions. The most straightforward and common tensor decomposition format is the rank-1 decomposition, also known as a CANDECOMP/PARAFAC decomposition, or in short, a CP decomposition. The CP decomposition is a natural extension of low-rank matrix decomposition to general tensors, both built upon the concept of a linear combination of rank-1 elements. Similarly to matrices, tensors of the form , where are non-zero vectors, are regarded as -ordered rank-1 tensors, thus the rank- CP decomposition of a tensor is naturally defined by:


where and are the parameters of the decomposition. As mentioned above, for it is equivalent to low-order matrix factorization. It is simple to show that any tensor can be represented by the CP decomposition for some , where the minimal such is known as its tensor rank.

Another decomposition we will use in this paper is of a hierarchical nature and known as the Hierarchical Tucker decomposition (Hackbusch and Kühn, 2009), which we will refer to as HT decomposition. While the CP decomposition combines vectors into higher order tensors in a single step, the HT decomposition does that more gradually, combining vectors into matrices, these matrices into 4th ordered tensors and so on recursively in a hierarchically fashion. Specifically, the following describes the recursive formula of the HT decomposition444 More precisely, we use a special case of the canonical HT decomposition as presented in Hackbusch and Kühn (2009). In the terminology of the latter, the matrices are diagonal and equal to (using the notations from eq. 7). for a tensor where , i.e. is a power of two555The requirement for to be a power of two is solely for simplifying the definition of the HT decomposition. More generally, instead of defining it through a complete binary tree describing the order of operations, the canonical decomposition can use any balanced binary tree.:


where the parameters of the decomposition are the vectors and the top level vector , and the scalars are referred to as the ranks of the decomposition. Similar to the CP decomposition, any tensor can be represented by an HT decomposition. Moreover, any given CP decomposition can be converted to an HT decomposition by only a polynomial increase in the number of parameters.

Finally, since we are dealing with generative models, the tensors we study are non-negative and sum to one, i.e. the vectorization of (rearranging its entries to the shape of a vector), denoted by , is constrained to lie in the multi-dimensional simplex, denoted by:


Appendix D Proof for the Depth Efficiency of Convolutional Arithmetic Circuits with Simplex Constraints

In this section we prove that the depth efficiency property of ConvACs that was proved in Cohen et al. (2016a) applies also to the generative variant of ConvACs we have introduced in sec. 2. Our analysis relies on basic knowledge of tensor analysis and its relation to ConvACs, specifically, that the concept of “ranks” of each factorization scheme is equivalent to the number of channels in these networks. For completeness, we provide a short introduction to tensor analysis in app. C. The

We prove the following theorem, which is the generative analog of theorem 1 from (Cohen et al., 2016a):

Theorem 1.

Let be a tensor of order and dimension in each mode, generated by the recursive formulas in eq. 7, under the simplex constraints introduced in sec. 2. Define , and consider the space of all possible configurations for the parameters of the decomposition – . In this space, the generated tensor will have CP-rank of at least almost everywhere (w.r.t. the product measure of simplex spaces). Put differently, the configurations for which the CP-rank of is less than form a set of measure zero. The exact same result holds if we constrain the composition to be “shared”, i.e. set and consider the space of configurations.

The only differences between ConvACs and their generative counter-parts are the simplex constraints applied to the parameters of the models, which necessitate a careful treatment to the measure theoretical arguments of the original proof. More specifically, while the -dimensional simplex is a subset of the -dimensional space , it has a zero measure with respect to the Lebesgue measure over . The standard method to define a measure over is by the Lebesgue measure over of its projection to that space, i.e. let be the Lebesgue measure over , be a projection, and be a subset of the simplex, then the latter’s measure is defined as . Notice that has a positive measure, and moreover that is invertible over the set , and that its inverse is given by . In our case, the parameter space is the cartesian product of several simplex spaces of different dimensions, for each of them the measure is defined as above, and the measure over their cartesian product is uniquely defined by the product measure. Though standard, the choice of the projection function above could be seen as a limitation, however, the set of zero measure sets in is identical for any reasonable choice of a projection (e.g. all polynomial mappings). More specifically, for any projection that is invertible over , is differentiable, and the Jacobian of is bounded over , then a subset is of measure zero w.r.t. the projection iff it is of measure zero w.r.t. (as defined above). This implies that if we sample the weights of the generative decomposition (eq. 7 with simplex constraints) by a continuous distribution, a property that holds with probability 1 under the standard parameterization (projection ), will hold with probability 1 under any reasonable parameterization.

We now state and prove a lemma that will be needed for our proof of theorem 1.

Lemma 1.

Let , and a polynomial mapping (i.e. for every then is a polynomial function). If there exists a point s.t. , then the set has zero measure.


Remember that iff there exits a non-zero minor of , which is polynomial in the entries of , and so it is polynomial in as well. Let be the number of minors in , denote the minors by , and define the polynomial function . It thus holds that iff for all it holds that , i.e. iff .

Now, is a polynomial in the entries of , and so it either vanishes on a set of zero measure, or it is the zero polynomial (see Caron and Traynor (2005) for proof). Since we assumed that there exists s.t. , the latter option is not possible. ∎

Following the work of Cohen et al. (2016a), our main proof relies on following notations and facts:

  • We denote by the matricization of an -order tensor (for simplicity,

    is assumed to be even), where rows and columns correspond to odd and even modes, respectively. Specifically, if

    , the matrix has rows and columns, rearranging the entries of the tensor such that is stored in row index and column index . Additionally, the matricization is a linear operator, i.e. for all scalars and tensors with the order and dimensions in every mode, it holds that .

  • The relation between the Kronecker product (denoted by ) and the tensor product (denoted by ) is given by .

  • For any two matrices and , it holds that .

  • Let be the CP-rank of , then it holds that (see (Cohen et al., 2016a) for proof).

Proof of theorem 1.

Stemming from the above stated facts, to show that the CP-rank of is at least , it is sufficient to examine its matricization and prove that .

Notice from the construction of , according to the recursive formula of the HT-decomposition, that its entires are polynomial in the parameters of the decomposition, its dimensions are each and that . In accordance with the discussion on the measure of simplex spaces, for each vector parameter , we instead examine its projection , and notice that is a polynomial mapping666As we mentioned earlier, is invertible only over , for which its inverse is given by . However, to simplified the proof and notations, we use as defined here over the entire range , even where it does not serve as the inverse of . w.r.t. . Thus, is a polynomial mapping w.r.t. the projected parameters , and using lemma 1 it is sufficient to show that there exists a set of parameters for which .

Denoting for convenience and , we will construct by induction over a set of parameters, , for which the ranks of the matrices are at least , while enforcing the simplex constraints on the parameters. More so, we’ll construct these parameters s.t. , thus proving both the ”unshared” and ”shared” cases.

For the case we have:

and let and for all and , and for all and , and so

which means , while preserving the simplex constraints, which proves our inductive hypothesis for .

Assume now that for all and . For some specific choice of and we have:

Denote for . By our inductive assumption, and by the general property , we have that the ranks of all matrices are at least