# Deep Directed Generative Autoencoders

For discrete data, the likelihood P(x) can be rewritten exactly and parametrized into P(X = x) = P(X = x | H = f(x)) P(H = f(x)) if P(X | H) has enough capacity to put no probability mass on any x' for which f(x')≠ f(x), where f(·) is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with f(·) as the encoder and P(X|H) as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations h=f(x), e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood p(x). The objective is to learn an encoder f(·) that maps X to f(X) that has a much simpler distribution than X itself, estimated by P(H). This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.

READ FULL TEXT