Towards Bayesian Deep Learning: A Survey

04/06/2016 ∙ by Hao Wang, et al. ∙ The Hong Kong University of Science and Technology 0

While perception tasks such as visual object recognition and text understanding play an important role in human intelligence, the subsequent tasks that involve inference, reasoning and planning require an even higher level of intelligence. The past few years have seen major advances in many perception tasks using deep learning models. For higher-level inference, however, probabilistic graphical models with their Bayesian nature are still more powerful and flexible. To achieve integrated intelligence that involves both perception and inference, it is naturally desirable to tightly integrate deep learning and Bayesian models within a principled probabilistic framework, which we call Bayesian deep learning. In this unified framework, the perception of text or images using deep learning can boost the performance of higher-level inference and in return, the feedback from the inference process is able to enhance the perception of text or images. This survey provides a general introduction to Bayesian deep learning and reviews its recent applications on recommender systems, topic models, and control. In this survey, we also discuss the relationship and differences between Bayesian deep learning and other related topics like Bayesian treatment of neural networks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning has achieved significant success in many perception tasks including seeing (visual object recognition), reading (text understanding), and hearing (speech recognition). These are undoubtedly fundamental tasks for a functioning comprehensive artificial intelligence (AI) system. However, in order to build a real AI system, simply being able to see, read, and hear is far from enough. It should, above all, possess the ability of thinking.

Take medical diagnosis as an example. Besides seeing visible symptoms (or medical images from CT) and hearing descriptions from patients, a doctor has to look for relations among all the symptoms and preferably infer the etiology of them. Only after that can the doctor provide medical advice for the patients. In this example, although the abilities of seeing and hearing allow the doctor to acquire information from the patients, it is the thinking part that defines a doctor. Specifically, the ability of thinking here could involve causal inference, logic deduction, and dealing with uncertainty, which is apparently beyond the capability of conventional deep learning methods. Fortunately, another type of models, probabilistic graphical models (PGM), excels at causal inference and dealing with uncertainty. The problem is that PGM is not as good as deep learning models at perception tasks. To address the problem, it is, therefore, a natural choice to tightly integrate deep learning and PGM within a principled probabilistic framework, which we call Bayesian deep learning (BDL) in this paper.

With the tight and principled integration in Bayesian deep learning, the perception task and inference task are regarded as a whole and can benefit from each other. In the example above, being able to see the medical image could help with the doctor’s diagnosis and inference. On the other hand, diagnosis and inference can in return help with understanding the medical image. Suppose the doctor may not be sure about what a dark spot in a medical image is, but if she is able to infer the etiology of the symptoms and disease, it can help him better decide whether the dark spot is a tumor or not.

As another example, to achieve high accuracy in recommender systems [60, 45]

, we need to fully understand the content of items (e.g., documents and movies), analyze the profile and preference of users, and evaluate the similarity among users. Deep learning is good at the first subtask while PGM excels at the other two. Besides the fact that better understanding of item content would help with the analysis of user profiles, the estimated similarity among users could provide valuable information for understanding item content in return. In order to fully utilize this bidirectional effect to boost recommendation accuracy, we might wish to unify deep learning and PGM in one single principled probabilistic framework, as done in


Besides recommender systems, the need for Bayesian deep learning may also arise when we are dealing with control of non-linear dynamical systems with raw images as input. Consider controlling a complex dynamical system according to the live video stream received from a camera. This problem can be transformed into iteratively performing two tasks, perception from raw images and control based on dynamic models. The perception task can be taken care of using multiple layers of simple nonlinear transformation (deep learning) while the control task usually needs more sophisticated models like hidden Markov models and Kalman filters

[21, 38]. The feedback loop is then completed by the fact that actions chosen by the control model can affect the received video stream in return. To enable an effective iterative process between the perception task and the control task, we need two-way information exchange between them. The perception component would be the basis on which the control component estimates its states and the control component with a dynamic model built in would be able to predict the future trajectory (images). In such cases, Bayesian deep learning is a suitable choice [62].

Apart from the major advantage that BDL provides a principled way of unifying deep learning and PGM, another benefit comes from the implicit regularization built in BDL. By imposing a prior on hidden units, parameters defining a neural network, or the model parameters specifying the causal inference, BDL can to some degree avoid overfitting, especially when we do not have sufficient data. Usually, a BDL model consists of two components, a perception component that is a Bayesian formulation of a certain type of neural networks and a task-specific component that describes the relationship among different hidden or observed variables using PGM. Regularization is crucial for them both. Neural networks usually have large numbers of free parameters that need to be regularized properly. Regularization techniques like weight decay and dropout [51] are shown to be effective in improving performance of neural networks and they both have Bayesian interpretations [13]. In terms of the task-specific component, expert knowledge or prior information, as a kind of regularization, can be incorporated into the model through the prior we imposed to guide the model when data are scarce.

Yet another advantage of using BDL for complex tasks (tasks that need both perception and inference) is that it provides a principled Bayesian approach of handling parameter uncertainty. When BDL is applied to complex tasks, there are three kinds of parameter uncertainty that need to be taken into account:

  1. Uncertainty on the neural network parameters.

  2. Uncertainty on the task-specific parameters.

  3. Uncertainty of exchanging information between the perception component and the task-specific component.

By representing the unknown parameters using distributions instead of point estimates, BDL offers a promising framework to handle these three kinds of uncertainty in a unified way. It is worth noting that the third uncertainty could only be handled under a unified framework like BDL. If we train the perception component and the task-specific component separately, it is equivalent to assuming no uncertainty when exchanging information between the two components.

Of course, there are challenges when applying BDL to real-world tasks. (1) First, it is nontrivial to design an efficient Bayesian formulation of neural networks with reasonable time complexity. This line of work is pioneered by [37, 24, 40], but it has not been widely adopted due to its lack of scalability. Fortunately, some recent advances in this direction [19, 32, 22, 7, 1] seem to shed light on the practical adoption of Bayesian neural network111Here we refer to Bayesian treatment of neural networks as Bayesian neural network. The other term, Bayesian deep learning, is retained to refer to complex Bayesian models with both a perception component and a task-specific component.

. (2) The second challenge is to ensure efficient and effective information exchange between the perception component and the task-specific component. Ideally both the first-order and second-order information (e.g., the mean and the variance) should be able to flow back and forth between the two components. A natural way is to represent the perception component as a PGM and seamlessly connect it to the task-specific PGM, as done in

[59, 60, 15].

In this survey, we aim to give a comprehensive overview of BDL models for recommender systems, topic models (and representation learning), and control. The rest of the survey is organized as follows: In Section 2, we provide a review of some basic deep learning models. Section 3 covers the main concepts and techniques for PGM. These two sections serve as the background for BDL, and the next section, Section 4, would survey the BDL models applied to areas like recommender systems and control. Section 5 discusses some future research issues and concludes the paper.

2 Deep Learning

Deep learning normally refers to neural networks with more than two layers. To better understand deep learning, here we start with the simplest type of neural networks, multilayer perceptrons (MLP), as an example to show how conventional deep learning works. After that, we will review several other types of deep learning models based on MLP.

2.1 Multilayer Perceptron

Essentially a multilayer perceptron is a sequence of parametric nonlinear transformations. Suppose we want to train a multilayer perceptron to perform a regression task which maps a vector of

dimensions to a vector of dimensions. We denote the input as a matrix ( means it is the -th layer of the perceptron). The -th row of , denoted as , is an -dimensional vector representing one data point. The target (the output we want to fit) is denoted as . Similarly denotes a -dimensional row vector. The problem of learning an -layer multilayer perceptron can be formulated as the following optimization problem:

subject to


is an element-wise sigmoid function for a matrix and

. The purpose of imposing is to allow nonlinear transformation. Normally other transformations like and can be used as alternatives of the sigmoid function.

Here () is the hidden units. As we can see, can be easily computed once , , and are given. Since is given by the data, we only need to learn and

here. Usually this is done using backpropagation and stochastic gradient descent (SGD). The key is to compute the gradients of the objective function with respect to

and . If we denote the value of the objective function as

, we can compute the gradients using the chain rule as:


where and the regularization terms are omitted. denotes the element-wise product and is the matlab operation on matrices. In practice, we only use a small part of the data (e.g., data points) to compute the gradients for each update. This is called stochastic gradient descent.

As we can see, in conventional deep learning models, only and are free parameters, which we will update in each iteration of the optimization. is not a free parameter since it can be computed exactly if and are given.

Fig. 1: A 2-layer SDAE with .

2.2 Autoencoders

An autoencoder (AE) is a feedforward neural network to encode the input into a more compact representation and reconstruct the input with the learned representation. In its simplest form, an autoencoder is no more than a multilayer perceptron with a bottleneck layer (a layer with a small number of hidden units) in the middle. The idea of autoencoders has been around for decades

[33, 8, 25, 18] and abundant variants of autoencoders have been proposed to enhance representation learning including sparse AE [43], contrastive AE [46], and denoising AE [54]. For more details, please refer to a nice recent book on deep learning [18]

. Here we introduce a kind of multilayer denoising AE, known as stacked denoising autoencoders (SDAE), both as an example of AE variants and as background for its applications on BDL-based recommender systems in Section


SDAE [54] is a feedforward neural network for learning representations (encoding) of the input data by learning to predict the clean input itself in the output, as shown in Figure 1. The hidden layer in the middle, i.e., in the figure, can be constrained to be a bottleneck to learn compact representations. The difference between traditional AE and SDAE is that the input layer is a corrupted version of the clean input data. Essentially an SDAE solves the following optimization problem:

subject to

where is a regularization parameter and denotes the Frobenius norm. Here SDAE can be regarded as a multilayer perceptron for regression tasks described in the previous section. The input of the MLP is the corrupted version of the data and the target is the clean version of the data . For example, can be the raw data matrix, and we can randomly set of the entries in to and get . In a nutshell, SDAE learns a neural network that takes the noisy data as input and recovers the clean data in the last layer. This is what ‘denoising’ in the name means. Normally, the output of the middle layer, i.e., in Figure 1, would be used to compactly represent the data.

Fig. 2: A convolutional layer with input feature maps and output feature maps.

2.3 Convolutional Neural Networks

Convolutional neural networks (CNN) can be viewed as another variant of MLP. Different from AE, which is initially designed to perform dimensionality reduction, CNN is biologically inspired. According to [29]

, two types of cells have been identified in the cat’s visual cortex. One is simple cells that respond maximally to specific patterns within their receptive field, and the other is complex cells with larger receptive field that are considered locally invariant to positions of patterns. Inspired by these findings, the two key concepts in CNN are then developed: convolution and max-pooling.

Convolution: In CNN, a feature map is the result of the convolution of the input and a linear filter, followed by some element-wise nonlinear transformation. The input here can be the raw image or the feature map from the previous layer. Specifically, with input , weights , bias , the -th feature map can be obtained as follows:

Note that in the equation above we assume one single input feature map and multiple output feature maps. In practice, CNN often has multiple input feature maps as well due to its deep structure. A convolutional layer with input feature maps and output feature maps is shown in Figure 2.

Max-Pooling: Usually, a convolutional layer in CNN is followed by a max-pooling layer, which can be seen as a type of nonlinear downsampling. The operation of max-pooling is simple. For example, if we have a feature map of size , the result of max-pooling with a region would be a downsampled feature map of size . Each entry of the downsampled feature map is the maximum value of the corresponding region in the feature map. Max-pooling layers can not only reduce computational cost by ignoring the non-maximal entries but also provide local translation invariance.

Putting it all together: Usually to form a complete and working CNN, the input would alternate between convolutional layers and max-pooling layers before going into an MLP for tasks like classification or regression. One famous example is the LeNet-5 [34], which alternates between convolutional layers and max-pooling layers before going into a fully connected MLP for target tasks.

Fig. 3: On the left is a conventional feedforward neural network with one hidden layer, where is the input, is the hidden layer, and is the output, and

are the corresponding weights (biases are omitted here). On the right is a recurrent neural network with input

, hidden states , and output .
Fig. 4: An unrolled RNN which is equivalent to the one in Figure 3(right). Here each node (e.g., , , or ) is associated with one particular time instance.

2.4 Recurrent Neural Network

When we read an article, we would normally take in one word at a time and try to understand the current word based on previous words. This is a recurrent process that needs short-term memory. Unfortunately conventional feedforward neural networks like the one shown in Figure 3(left) fail to do so. For example, imagine we want to constantly predict the next word as we read an article. Since the feedforward network only computes the output as , where the function denotes element-wise nonlinear transformation, it is unclear how the network could naturally model the sequence of words to predict the next word.

2.4.1 Vanilla Recurrent Neural Network

To solve the problem, we need a recurrent neural network [18] instead of a feedforward one. As shown in Figure 3(right), the computation of the current hidden states depends on the current input (e.g., the -th word) and the previous hidden states . This is why there is a loop in the RNN. It is this loop that enables short-term memory in RNNs. The in the RNN represents what the network knows so far at the -th time step. To see the computation more clearly, we can unroll the loop and represent the RNN as in Figure 4. If we use hyperbolic tangent nonlinearity (), the computation of output will be as follows:

where , , and denote the weight matrices for input-to-hidden, hidden-to-hidden, and hidden-to-output connections, respectively, and and

are the corresponding biases. If the task is to classify the input data at each time step, we can compute the classification probability as


Fig. 5: The encoder-decoder architecture involving two LSTMs. The encoder LSTM (in the left rectangle) encodes the sequence ‘ABC’ into a representation and the decoder LSTM (in the right rectangle) recovers the sequence from the representation. ‘$’ marks the end of a sentence.
Fig. 6: The probabilistic graphical model for LDA, is the number of documents and is the number of words in a document.

Similar to feedforward networks, to train an RNN, a generalized back-propagation algorithm called back-propagation through time (BPTT) [18] can be used. Essentially the gradients are computed through the unrolled network as shown in Figure 4 with shared weights and biases for all time steps.

2.4.2 Gated Recurrent Neural Network

The problem with the vanilla RNN introduced above is that the gradients propagated over many time steps are prone to vanish or explode, which makes the optimization notoriously difficult. In addition, the signal passing through the RNN decays exponentially, making it impossible to model long-term dependencies in long sequences. Imagine we want to predict the last word in the paragraph ‘I have many books … I like reading

’. In order to get the answer, we need ‘long-term memory’ to retrieve information (the word ‘books’) at the start of the text. To address this problem, the long short-term memory model (LSTM) is designed as a type of gated RNN to model and accumulate information over a relatively long duration. The intuition behind LSTM is that when processing a sequence consisting of several subsequences, it is sometimes useful for the neural network to summarize or forget the old states before moving on to process the next subsequence

[18]. Using to index the words in the sequence, the formulation of LSTM is as follows (we drop the item index for notational simplicity):


where is the word embedding of the -th word, is a -by- word embedding matrix, and is the -of- representation, stands for the element-wise product operation between two vectors, denotes the sigmoid function, is the cell state of the -th word, and , , and denote the biases, input weights, and recurrent weights respectively. The forget gate units and the input gate units in Equation (5) can be computed using their corresponding weights and biases , , , , , and :

The output depends on the output gate which has its own weights and biases , , and :

Note that in the LSTM, information of the processed sequence is contained in the cell states and the output states , both of which are column vectors of length .

Similar to [53, 12], we can use the output state and cell state at the last time step ( and ) of the first LSTM as the initial output state and cell state of the second LSTM. This way the two LSTMs can be concatenated to form an encoder-decoder architecture, as shown in Figure 5.

Note that there is a vast literature on deep learning and neural networks. The introduction in this section intends to serve only as the background of Bayesian deep learning. Readers are referred to [18] for a comprehensive survey and more details.

3 Probabilistic Graphical Models

Probabilistic Graphical Models (PGM) use diagrammatic representations to describe random variables and relationships among them. Similar to a graph that contains nodes (vertices) and links (edges), PGM has nodes to represent random variables and links to express probabilistic relationships among them.

3.1 Models

There are essentially two types of PGM, directed PGM (also known as Bayesian networks) and undirected PGM (also known as Markov random fields). In this survey we mainly focus on directed PGM222For convenience, PGM stands for directed PGM in this survey unless specified otherwise.. For details on undirected PGM, readers are referred to [3].

A classic example of PGM would be latent Dirichlet allocation (LDA), which is used as a topic model to analyze the generation of words and topics in documents. Usually PGM comes with a graphical representation of the model and a generative process to depict the story of how the random variables are generated step by step. Figure 6 shows the graphical model for LDA and the corresponding generative process is as follows:

  • For each document (),

    1. Draw topic proportions .

    2. For each word of item (paper) ,

      1. Draw topic assignment .

      2. Draw word .

The generative process above gives the story of how the random variables are generated. In the graphical model in Figure 6, the shaded node denotes observed variables while the others are latent variables ( and ) or parameters ( and ). As we can see, once the model is defined, learning algorithms can be applied to automatically learn the latent variables and parameters.

Due to its Bayesian nature, PGM like LDA is easy to extend to incorporate other information or to perform other tasks. For example, after LDA, different variants of topic models based on it have been proposed. [5, 56] are proposed to incorporate temporal information and [4] extends LDA by assuming correlations among topics. [26] extends LDA from the batch mode to the online setting, making it possible to process large datasets. On recommender systems, [55] extends LDA to incorporate rating information and make recommendations. This model is then further extended to incorporate social information [44, 57, 58].

Applications Models Variance of MAP VI Gibbs Sampling SG Thermostats
Recommender     Systems CDL Hyper-Variance
Bayesian CDL Hyper-Variance
Marginalized CDL Learnable Variance
Symmetric CDL Learnable Variance
Collaborative Deep Ranking Hyper-Variance

 Topic Models
Relational SDAE Hyper-Variance
DPFA-SBN Zero-Variance
DPFA-RBM Zero-Variance

Embed to Control Learnable Variance

TABLE I: Summary of BDL Models

3.2 Inference and Learning

Strictly speaking, the process of finding the parameters (e.g., and in Figure 6) is called learning and the process of finding the latent variables (e.g., and in Figure 6) given the parameters is called inference. However, given only the observed variables (e.g. in Figure 6), learning and inference are often intertwined. Usually the learning and inference of LDA would alternate between the updates of latent variables (which correspond to inference) and the updates of the parameters (which correspond to learning). Once the learning and inference of LDA is completed, we would have the parameters and . If a new document comes, we can now fix the learned and and then perform inference alone to find the topic proportions of the new document.333For convenience, we use ‘learning’ to represent both ‘learning and inference’ in the following text.

Like in LDA, various learning and inference algorithms are available for each PGM. Among them, the most cost-effective one is probably maximum a posteriori (MAP), which amounts to maximizing the posterior probability of the latent variable. Using MAP, the learning process is equivalent to minimizing (or maximizing) an objective function with regularization. One famous example is the probabilistic matrix factorization (PMF)

[48]. The learning of the graphical model in PMF is equivalent to factorization of a large matrix into two low-rank matrices with L2 regularization.

MAP, as efficient as it is, gives us only point estimates

of latent variables (and parameters). In order to take the uncertainty into account and harness the full power of Bayesian models, one would have to resort to Bayesian treatments like variational inference and Markov chain Monte Carlo (MCMC). For example, the original LDA uses variational inference to approximate the true posterior with factorized variational distributions

[6]. Learning of the latent variables and parameters then boils down to minimizing the KL-divergence between the variational distributions and the true posterior distributions. Besides variational inference, another choice for a Bayesian treatment is to use MCMC. For example, MCMC algorithms like [42] have been proposed to learn the posterior distributions of LDA.

4 Bayesian Deep Learning

With the background on deep learning and PGM, we are now ready to introduce the general framework and some concrete examples of BDL. Specifically, in this section we will list some recent BDL models with applications on recommender systems, topic models, and control. A summary of these models is shown in Table I.

4.1 General Framework

As mentioned in Section 1, BDL is a principled probabilistic framework with two seamlessly integrated components: a perception component and a task-specific component.

PGM for BDL: Figure 7 shows the PGM of a simple BDL model as an example. The part inside the red rectangle on the left represents the perception component and the part inside the blue rectangle on the right is the task-specific component. Typically, the perception component would be a probabilistic formulation of a deep learning model with multiple nonlinear processing layers represented as a chain structure in the PGM. While the nodes and edges in the perception component are relatively simple, those in the task-specific component often describe more complex distributions and relationships among variables (like in LDA).

Fig. 7: The PGM for an example BDL. The red rectangle on the left indicates the perception component, and the blue rectangle on the right indicates the task-specific component. The hinge variable .

Three Sets of Variables: There are three sets of variables in a BDL model: perception variables, hinge variables, and task variables. In this paper, we use to denote the set of perception variables (e.g., , , and in Figure 7), which are the variables in the perception component. Usually

would include the weights and neurons in the probabilistic formulation of a deep learning model.

is used to denote the set of hinge variables (e.g. in Figure 7). These variables directly interact with the perception component from the task-specific component. Table I shows the set of hinge variables for each listed BDL models. The set of task variables (e.g. , , and in Figure 7), i.e., variables in the task-specific component without direct relation to the perception component, is denoted as .

The I.I.D. Requirement: Note that hinge variables are always in the task-specific component. Normally, the connections between hinge variables and the perception component (e.g., and in Figure 7) should be i.i.d. for convenience of parallel computation in the perception component. For example, each row in is related to only one corresponding row in and one in . Although it is not mandatory in BDL models, meeting this requirement would significantly increase the efficiency of parallel computation in model training.

Joint Distribution Decomposition: If the edges between the two components point towards

, the joint distribution of all variables can be written as:


If the edges between the two components originate from , the joint distribution of all variables can be written as:


Apparently, it is possible for BDL to have some edges between the two components pointing towards and some originating from , in which case the decomposition of the joint distribution would be more complex.

Variance Related to : As mentioned in Section 1, one of the motivations for BDL is to model the uncertainty of exchanging information between the perception component and the task-specific component, which boils down to modeling the uncertainty related to . For example, this kind of uncertainty is reflected in the variance of the conditional density in Equation (6)444For models with the joint likelihood decomposed as in Equation (7), the uncertainty is reflected in the variance of .. According to the degree of flexibility, there are three types of variance for (for simplicity we assume the joint likelihood of BDL is Equation (6), , , and in our example):

  • Zero-Variance: Zero-Variance (ZV) assumes no uncertainty during the information exchange between the two components. In the example, zero-variance means directly setting to .

  • Hyper-Variance

    : Hyper-Variance (HV) assumes that uncertainty during the information exchange is defined through hyperparameters. In the example, HV means that

    is a hyperparameter that is manually tuned.

  • Learnable Variance: Learnable Variance (LV) uses learnable parameters to represent uncertainty during the information exchange. In the example, is the learnable parameter.

As shown above, we can see that in terms of model flexibility, . Normally, if the models are properly regularized, an LV model would outperform an HV model, which is superior to a ZV model. In Table I, we show the types of variance for in different BDL models. Note that although each model in the table has a specific type, one can always adjust the models to devise their counterparts of other types. For example, while CDL in the table is an HV model, we can easily adjust in CDL to devise its ZV and LV counterparts. In [60], they compare the performance of an HV CDL and a ZV CDL and finds that the former performs significantly better, meaning that sophisticatedly modeling uncertainty between two components is essential for performance.

Learning Algorithms: Due to the nature of BDL, practical learning algorithms need to meet these criteria:

  1. They should be online algorithms in order to scale well for large datasets.

  2. They should be efficient enough to scale linearly with the number of free parameters in the perception component.

Criterion (1) implies that conventional variational inference or MCMC methods are not applicable. Usually an online version of them is needed [27]. Most SGD-based methods do not work either unless only MAP inference (as opposed to Bayesian treatments) is performed. Criterion (2) is needed because there are typically a large number of free parameters in the perception component. This means methods based on Laplace approximation [37] are not realistic since they involve the computation of a Hessian matrix that scales quadratically with the number of free parameters.

4.2 Bayesian Deep Learning for Recommender Systems

Despite the successful applications of deep learning on natural language processing and computer vision, very few attempts have been made to develop deep learning models for CF.


uses restricted Boltzmann machines instead of the conventional matrix factorization formulation to perform CF and

[17] extends this work by incorporating user-user and item-item correlations. Although these methods involve both deep learning and CF, they actually belong to CF-based methods because they do not incorporate content information like CTR [55], which is crucial for accurate recommendation. [47] uses low-rank matrix factorization in the last weight layer of a deep network to significantly reduce the number of model parameters and speed up training, but it is for classification instead of recommendation tasks. On music recommendation, [41, 61]

directly use conventional CNN or deep belief networks (DBN) to assist representation learning for content information, but the deep learning components of their models are deterministic without modeling the noise and hence they are less robust. The models achieve performance boost mainly by loosely coupled methods without exploiting the interaction between content information and ratings. Besides, the CNN is linked directly to the rating matrix, which means the models will perform poorly due to serious overfitting when the ratings are sparse.

4.2.1 Collaborative Deep Learning

To address the challenges above, a hierarchical Bayesian model called collaborative deep learning (CDL) as a novel tightly coupled method for RS is introduced in [60]. Based on a Bayesian formulation of SDAE, CDL tightly couples deep representation learning for the content information and collaborative filtering for the rating (feedback) matrix, allowing two-way interaction between the two. Experiments show that CDL significantly outperforms the state of the art.

In the following text, we will start with the introduction of the notation used during our presentation of CDL. After that we will review the design and learning of CDL.

Notation and Problem Formulation: Similar to the work in [55], the recommendation task considered in CDL takes implicit feedback [28] as the training and test data. The entire collection of items (articles or movies) is represented by a -by- matrix , where row is the bag-of-words vector for item based on a vocabulary of size . With users, we define an -by- binary rating matrix . For example, in the dataset citeulike-a [55, 57, 60] if user has article in his or her personal library and otherwise. Given part of the ratings in and the content information , the problem is to predict the other ratings in . Note that although CDL in its current from focuses on movie recommendation (where plots of movies are considered as content information) and article recommendation like [55] in this section, it is general enough to handle other recommendation tasks (e.g., tag recommendation).

The matrix plays the role of clean input to the SDAE while the noise-corrupted matrix, also a -by- matrix, is denoted by . The output of layer of the SDAE is denoted by which is a -by- matrix. Similar to , row of is denoted by . and

are the weight matrix and bias vector, respectively, of layer

, denotes column of , and is the number of layers. For convenience, we use to denote the collection of all layers of weight matrices and biases. Note that an -layer SDAE corresponds to an -layer network.

Fig. 8: On the left is the graphical model of CDL. The part inside the dashed rectangle represents an SDAE. An example SDAE with is shown. On the right is the graphical model of the degenerated CDL. The part inside the dashed rectangle represents the encoder of an SDAE. An example SDAE with is shown on its right. Note that although is still , the decoder of the SDAE vanishes. To prevent clutter, we omit all variables except and in the graphical models.

Generalized Bayesian SDAE: Following the introduction of SDAE in Section 2.2, if we assume that both the clean input and the corrupted input are observed, similar to [3, 37, 2, 9], we can define the following generative process of generalized Bayesian SDAE:

  1. For each layer of the SDAE network,

    1. For each column of the weight matrix , draw

    2. Draw the bias vector .

    3. For each row of , draw

  2. For each item , draw a clean input 555Note that while generation of the clean input from is part of the generative process of the Bayesian SDAE, generation of the noise-corrupted input from is an artificial noise injection process to help the SDAE learn a more robust feature representation.

Note that if

goes to infinity, the Gaussian distribution in Equation (

8) will become a Dirac delta distribution [52] centered at , where is the sigmoid function. The model will degenerate to be a Bayesian formulation of SDAE. That is why we call it generalized SDAE.

Note that the first layers of the network act as an encoder and the last layers act as a decoder. Maximization of the posterior probability is equivalent to minimization of the reconstruction error with weight decay taken into consideration.

Collaborative Deep Learning: Using the Bayesian SDAE as a component, the generative process of CDL is defined as follows:

  1. For each layer of the SDAE network,

    1. For each column of the weight matrix , draw .

    2. Draw the bias vector .

    3. For each row of , draw

  2. For each item ,

    1. Draw a clean input ).

    2. Draw the latent item offset vector and then set the latent item vector:

  3. Draw a latent user vector for each user :

  4. Draw a rating for each user-item pair :

Here , , , , and are hyperparameters and is a confidence parameter similar to that for CTR [55] ( if and otherwise). Note that the middle layer serves as a bridge between the ratings and content information. This middle layer, along with the latent offset , is the key that enables CDL to simultaneously learn an effective feature representation and capture the similarity and (implicit) relationship between items (and users). Similar to the generalized SDAE, for computational efficiency, we can also take to infinity.

The graphical model of CDL when approaches positive infinity is shown in Figure 8, where, for notational simplicity, we use , , and in place of , , and , respectively.

Note that according the definition in Section 4.1, here the perception variables , the hinge variables , and the task variables .

Learning: Based on the CDL model above, all parameters could be treated as random variables so that fully Bayesian methods such as Markov chain Monte Carlo (MCMC) or variational approximation methods [30] may be applied. However, such treatment typically incurs high computational cost. Consequently, CDL uses an EM-style algorithm for obtaining the MAP estimates, as in [55].

Like in CTR [55], maximizing the posterior probability is equivalent to maximizing the joint log-likelihood of , , , , , , and given , , , , and :

If goes to infinity, the likelihood becomes:


where the encoder function takes the corrupted content vector of item as input and computes the encoding of the item, and the function also takes as input, computes the encoding and then reconstructs the content vector of item . For example, if the number of layers , is the output of the third layer while is the output of the sixth layer.

Fig. 9: NN representation for degenerated CDL.

From the perspective of optimization, the third term in the objective function (4.2.1) above is equivalent to a multi-layer perceptron using the latent item vectors as the target while the fourth term is equivalent to an SDAE minimizing the reconstruction error. Seeing from the view of neural networks (NN), when approaches positive infinity, training of the probabilistic graphical model of CDL in Figure 8(left) would degenerate to simultaneously training two neural networks overlaid together with a common input layer (the corrupted input) but different output layers, as shown in Figure 9. Note that the second network is much more complex than typical neural networks due to the involvement of the rating matrix.

When the ratio approaches positive infinity, it will degenerate to a two-step model in which the latent representation learned using SDAE is put directly into the CTR. Another extreme happens when goes to zero where the decoder of the SDAE essentially vanishes. On the right of Figure 8 is the graphical model of the degenerated CDL when goes to zero. As demonstrated in the experiments, the predictive performance will suffer greatly for both extreme cases [60].

For and , block coordinate descent similar to [55, 28] is used. Given the current , we compute the gradients of with respect to and and then set them to zero, leading to the following update rules:

where , , is a diagonal matrix, is a column vector containing all the ratings of user , and reflects the confidence controlled by and as discussed in [28]. and are defined similarly for item .

Given and , we can learn the weights and biases for each layer using the back-propagation learning algorithm. The gradients of the likelihood with respect to and are as follows:

By alternating the update of , , , and , we can find a local optimum for . Several commonly used techniques such as using a momentum term may be applied to alleviate the local optimum problem.

Prediction: Let be the observed test data. Similar to [55], CDL uses the point estimates of , and to calculate the predicted rating:

where denotes the expectation operation. In other words, we approximate the predicted rating as:

Note that for any new item with no rating in the training data, its offset will be .

In the following text, we provide several extensions of CDL from different perspectives.

4.2.2 Bayesian Collaborative Deep Learning

Besides the MAP estimates, a sampling-based algorithm for the Bayesian treatment of CDL is also proposed in [60]. This algorithm turns out to be a Bayesian and generalized version of the well-known back-propagation (BP) learning algorithm. We list the key conditional densities as follows:

For : We denote the concatenation of and as . Similarly, the concatenation of and is denoted as . The subscripts of are ignored. Then

For (): Similarly, we denote the concatenation of and as and have

Note that for the last layer () the second Gaussian would be instead.

For (): Similarly, we have

For : The posterior

For : The posterior

Interestingly, if goes to infinity and adaptive rejection Metropolis sampling (which involves using the gradients of the objective function to approximate the proposal distribution) is used, the sampling for turns out to be a Bayesian generalized version of BP. Specifically, as Figure 10

shows, after getting the gradient of the loss function at one point (the red dashed line on the left), the next sample would be drawn in the region under that line, which is equivalent to a probabilistic version of BP. If a sample is above the curve of the loss function, a new tangent line (the black dashed line on the right) would be added to better approximate the distribution corresponding to the loss function. After that, samples would be drawn from the region under both lines. During the sampling, besides searching for local optima using the gradients (MAP), the algorithm also takes the variance into consideration. That is why it is called

Bayesian generalized back-propagation.

Fig. 10: Sampling as generalized BP.

4.2.3 Marginalized Collaborative Deep Learning

In SDAE, corrupted input goes through encoding and decoding to recover the clean input. Usually, different epochs of training use different corrupted versions as input. Hence generally, SDAE needs to go through enough epochs of training to see sufficient corrupted versions of the input. Marginalized SDAE (mSDAE)

[10] seeks to avoid this by marginalizing out the corrupted input and obtaining closed-form solutions directly. In this sense, mSDAE is more computationally efficient than SDAE.

As mentioned in [35], using mSDAE instead of the Bayesian SDAE could lead to more efficient learning algorithms. For example, in [35], the objective when using a one-layer mSDAE can be written as follows:


where is the collection of different corrupted versions of (a -by- matrix) and is the -time repeated version of (also a -by- matrix). is the transformation matrix for item latent factors.

The solution for would be:

where and . A solver for the expectation in the equation above is provided in [10]. Note that this is a linear and one-layer case which can be generalized to the nonlinear and multi-layer case using the same techniques as in [10, 9].

As we can see, in marginalized CDL, the perception variables , the hinge variables , and the task variables .

4.2.4 Collaborative Deep Ranking

CDL assumes a collaborative filtering setting to model the ratings directly. However, the output of recommender systems is often a ranked list, which means it would be more natural to use ranking rather than ratings as the objective. With this motivation, collaborative deep ranking (CDR) is proposed [64] to jointly perform representation learning and collaborative ranking. The corresponding generative process is as follows:

  1. For each layer of the SDAE network,

    1. For each column of the weight matrix , draw .

    2. Draw the bias vector .

    3. For each row of , draw

  2. For each item ,

    1. Draw a clean input ).

    2. Draw a latent item offset vector and then set the latent item vector to be:

  3. For each user ,

    1. Draw a latent user vector for each user :

    2. For each pair-wise preference , where , draw the preference: