Epoch-evolving Gaussian Process Guided Learning

06/25/2020 ∙ by Jiabao Cui, et al. ∙ Zhejiang University 0

In this paper, we propose a novel learning scheme called epoch-evolving Gaussian Process Guided Learning (GPGL), which aims at characterizing the correlation information between the batch-level distribution and the global data distribution. Such correlation information is encoded as context labels and needs renewal every epoch. With the guidance of the context label and ground truth label, GPGL scheme provides a more efficient optimization through updating the model parameters with a triangle consistency loss. Furthermore, our GPGL scheme can be further generalized and naturally applied to the current deep models, outperforming the existing batch-based state-of-the-art models on mainstream datasets (CIFAR-10, CIFAR-100, and Tiny-ImageNet) remarkably.



There are no comments yet.


page 3

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent years have witnessed a great development of deep learning with a wide range of applications. Due to the computational resource limit, it has to rely on the mini-batch stochastic gradient descent (SGD 

Bottou (1998)) algorithm for iterative model learning over a sequence of epochs, each of which corresponds to a collection of randomly generated sample batches with small sizes. In the learning process, it asynchronously updates the model parameters with respect to time-varying sample batches, and thus captures the local batch-level distribution information, resulting in the “zig-zag” effect in the optimization process Qian (1999). Therefore, it usually requires a large number of epoch iterations for sufficient model learning, which essentially takes a bottom-up learning pipeline from local batches to global data distribution. Such a pipeline is incapable of effectively balancing the correlation information between batch-level distribution and global data distribution for those sequentially-added sample batches within different epochs.

Motivated by the above observation, we seek for a possible solution to enhance the bottom-up learning pipeline with the top-down strategy, which aims at approximately encoding the global data distribution information as class distribution Laurikkala (2001) by a nonparametric learning model Roy et al. (2018)

. As a result, we have a hybrid learning pipeline that effectively combines the benefits of precise batch-level learning and global distribution-aware nonparametric modeling. More specifically, we propose an epoch-evolving Gaussian Process Guided Learning (GPGL) scheme that dynamically estimates the class distribution information for any sample with the evolution of epochs in a nonparametric learning manner. For each epoch learning, the proposed Gaussian process approach builds a class distribution regression model in the corresponding epoch-related feature space to estimate the class distribution as context label for any sample, relative to a set of fixed class-aware anchor samples. In essence, this context label estimation corresponds to a contextual label propagation process, where the class distribution information from the class-aware anchor samples are dynamically propagated to the given samples through Gaussian process regression 

Rasmussen and Williams (2005)

. Subsequently, with the guidance of the propagated context label, the deep model can learn the class distribution information in the conventional learning pipeline. Hence, we have a triangle consistency loss function consisting of three learning components: 1) deep model prediction with ground truth label; 2) deep model prediction with context label; and 3) context label with ground truth label. The triangle consistency loss function is jointly optimized for each epoch. After one epoch, the epoch-related feature space is accordingly updated from the latest deep model. Based on the updated feature space, the joint triangle consistency loss is optimized once again in the next epoch. The above learning process is repeated until convergence.

In principle, the epoch-evolving GPGL scheme takes into account the context dependency relationships between given samples and class-aware anchor samples, which effectively explores global data distribution in a nonparametric modeling fashion. Such a distribution structure usually carries a rich body of contextual information that is capable of alleviating the “zig-zag” problem and meanwhile speeding up the convergence process. The joint triangle consistency loss seeks for a good balance of deep learning features, deep learning prediction, Gaussian process context label, and ground truth for each sample. In summary, the main contributions of this work are as follows:

  • We propose an epoch-evolving GPGL scheme that takes a nonparametric modeling strategy for estimating the context-aware class distribution information to guide the model learning process. Based on the Gaussian process approach, we set up a hybrid bottom-up and top-down learning pipeline for effective model learning with better convergence performance.

  • We present a joint triangle consistency loss function that is capable of achieving a good balance between batch-level learning and global distribution-aware nonparametric modeling. Experimental results over the benchmark datasets show this work achieves the state-of-the-art results.

Related work

Gaussian Processes (GPs) Rasmussen and Williams (2005)

are a class of powerful, useful, and flexible Bayesian nonparametric probabilistic models. With the rapid development of deep learning, GPs have been generalized with multi-layer neural network as Deep Gaussian Processes (DGPs) 

Damianou and Lawrence (2013); Bui et al. (2016). In DGPs, the mapping between layers is parameterized as GPs such that the uncertainty of prediction could be accurately estimated. On the other hand, variational inference, which aims at approximating the posterior distribution over the latent variables, is introduced in to augment deep Gaussian processes to form Variational Gaussian Processes (VGPs) Tran et al. (2015); Dai et al. (2015); Casale et al. (2018). However, both DGPs and VGPs are focusing on feature space, our work mainly apply GPs to seek a more ideal label space.

Optimization of deep learning model mainly focuses on the training algorithms and hyper-parameter tuning He et al. (2016a); Huang et al. (2017); He et al. (2016b); Zagoruyko and Komodakis (2016). Roughly speaking, two classes of methods are commonly categorized. The first class, e.g., SGD Momentum (SGD-M Sutskever et al. (2013)), Nesterov Nesterov (1983), utilizes the momentum to fix the current update direction. The other class designs an adaptive learning rate adjustment scheme, such as AdaGrad Duchi et al. (2011)

, RMSProp 

Hinton et al. (2012), and Adam Kingma and Ba (2014). However, we argue that these conventional training approaches focus on batch-level data distribution in optimization process without the attention of global data distribution information. Motivated by this, we propose the epoch-evolving GPGL to alleviate this concern by introducing the contextual information of the dataset.

2 Method

2.1 Overview

In this section, we describe the overview of the epoch-evolving Gaussian Process Guided Learning (GPGL) scheme. Given a dataset , is the input data and is the corresponding label. Let

be the feature extracted by function

and be the predicted label by function . Since the conventional bottom-up deep learning pipeline usually applies a batch-based optimization paradigm, we denote , , as the prediction, ground truth, and feature of a single sample in a batch . The routine optimization framework is interpreted as:


where a batch of samples is randomly picked and optimized with learning rate , and is the classification loss function.

The function in conventional DNN setting is determined by its weight uniquely. Usually, we apply a stochastic gradient optimization algorithm based on the mini-batch in each iteration. The specific batch of samples, however, only contains a relatively limited amount of data compared with the whole dataset. Hence, the update of weight is easily governed by the current batch, resulting in “zig-zag” effect. In other words, we argue that the conventional batch-based optimization paradigm can easily be affected by current batch without respect to the global data distribution information.

Motivated by the above challenge, we explore the class distribution predictions which encode the global data distribution. Inspired by the conventional nonparametric generative model, we propose a GPGL scheme to make class distribution predictions which we name as the context label. As shown in Figure 1

, our GPGL scheme consists of Gaussian process model construction and Gaussian process model guided learning. Based on the observation data, our Gaussian process model builds the joint probability distribution which could embed the global distribution of the data in a nonparametric way. Therefore, the distribution of batch data is well fixed by the guidance of GP model and thus highly mitigates the effect of the aforementioned concern. A question then occurs on how to encode the global data distribution with our GP model.

We firstly extract a subset and called the anchor set as a representative of the whole dataset due to the limited computational resources. The feature of the anchor set is then defined as . A Gaussian process (GP) is thus build upon both anchor set and a coming sample:


where denotes the context label of a sample which embeds the whole dataset with annotation and is the covariance function. The context label will further be applied into our novel triangle loss. In the following section, we would describe our context label construction and triangle loss in detail.

Figure 1: Overview of Gaussian Process Guided Learning (GPGL) scheme. The GP model leverages the features and labels of the anchor set to construct the global data distribution approximately. The architecture of the feature extractor is the same as that of the deep model in batch-level learning process. However, the parameters of feature extractor in GP model is updated at the epoch-level. After constructing the GP model, it would make a context label for each sample, and guide the deep model learning by the triangle loss function.

2.2 Context Label construction

The one-hot vector is commonly used as supervision in conventional work for its simplicity. However, since labels are encoded as discrete form, every two different labels are orthogonal to each other, neglecting the relationship among label pairs. We respectfully point that the label itself represents a space. Thus a perfect label space should hold both smoothness and consistency with feature space. To make it clear, if two samples are similar in feature space, the same similarity should hold in label space.

From the perspective of statistical learning, we seek an ideal label space which fulfills the above requirements based on the one-hot label distribution. The Gaussian process, which assumes any collection of its random variables satisfies a joint multivariate distribution, is thus applied to obtain an ideal label space. Given a label prior in the one-hot format, our method shows two advantages. On the one hand, we infer a better form of label which we term as a context label. On the other hand, the context label could better guide the representation learning.

The context label is estimated by our Gaussian process model with the help of the anchor set described in the Section 2.1

. The anchor set is an abstraction of the overall dataset which contains the contextual information between different classes. How to make a context label with respect to the correlation among a sample and the anchor set samples is a big challenge. Bayesian prediction is a powerful nonparametric method to make a posterior prediction based on the prior distribution. Rethinking the conventional batch-by-batch learning pipeline, any finite collection of sample’s context label

and anchor set labels

follow the joint multivariate Gaussian distribution. In this general assumption, we make a Bayesian prediction for any coming sample

(with feature ) in batch data based on the features and one-hot labels of the anchor set .

Following the above notation,

is a multivariate normal distribution whose covariance matrix

has the same form as Equation (2). As mentioned above, the similarities of samples in feature space hold consistency with that of samples in smooth label space. How do we measure the distance between any two samples in feature space? Our Gaussian process model selects the RBF kernel to capture the similarity between different features:


where is a length-scale parameter Jylänki et al. (2011) and is the Euclidean distance function.

With this distance measurement, our Gaussian process model builds the aforementioned context label for any sample. The context label corresponds to the correlation information among a sample and the anchor set samples. In our Gaussian process model, the label information of anchor samples is propagated by a Gaussian process regression (GPR) Rasmussen and Williams (2005)

approach, assuring the smoothness property in label space through interpolation. Therefore, the conditional distribution of context label can be done analytically



where is the identity matrix. The mean of distribution represents our context label and the covariance function controls to what extent we trust our context label.

In our Gaussian process model, the inference time complexity for each sample is , where is the size of the anchor set, which is infeasible in practice for a large anchor set. We provided a class-aware anchor sampling mechanism for picking the near neighbors for each class by measuring the similarity between mean features of each class in feature space. With this mechanism, the anchor set could represent the global structure of data distribution approximately. The correlation between a sample and the anchor set samples is visualized in Figure 2(b).

               (a) Triangle loss backward                                       (b) Class-aware anchor sample

Figure 2: (a) As shown in red dashed lines, the loss functions and are used for updating all the parameters in the deep neural network model, while the is used for updating the parameters of feature extractor. (b) The context label of an image in dataset is estimated by the correlation with each class in the anchor set. The width of red solid lines indicates the probability value of the context label for each class. Wider lines show a stronger correlation.

2.3 Triangle consistency loss

The loss terms

With our Gaussian process (GP) model employed, there are two prediction terms for a sample : the deep model prediction and the context label . With the ground truth label , the conventional cross-entropy loss function between the model prediction and ground truth for a sample is defined as:


where is the model prediction for class , and is the ground truth label for class . In addition to the conventional cross-entropy loss, we employ a KL divergence loss which is defined in Equation (7) between the context label and deep model prediction. This loss term aims to reduce the difference between the prediction of the deep model and the prediction of our GP model. The KL divergence loss is used as a regularization term to prevent the deep model learning from overfitting.


where is the value of context label for class . Our GP model produces a context label based on the feature space, and when the model is not well trained, the feature space is easily affected by any input data. So we use an extra loss term to constrain the transformation from data into feature space. We use a cross-entropy loss defined in Equation (8

) between context label and the ground truth. The backpropagation of this loss only flows through the feature extractor. Trying to minimize this loss term, the model can update the feature extractor’s parameters, which improves the feature representation.


Our Gaussian process model is a nonparametric model based on the feature space. As indicated in Equation (4) and Equation (5), we build our Gaussian process model by Gaussian process regression. The backpropagation of loss function only affects the convolutional layers which are used for the feature extractor. Both of the loss function and act on all the model parameters including convolutional layers and fully connected layers in backpropagation, as shown in Figure 2(a).

Triangle consistency

With the above three terms of loss functions, there appears a triangle loss function between the deep model prediction, context label and ground truth label. The ground truth label is just one-hot for input data which cannot reflect the correlation information with other classes or samples. The context label is generated by our Gaussian process model which embeds the global data distribution. As our Gaussian process model is based on the feature space which is dynamically changing through learning process, the context label should be constrained by the ground truth to seek for better feature expressions. That is why we use a triangle consistency loss function among them. By introducing the triangle loss function, our GPGL scheme updates the parameters of the deep model with Equation (9) in comparison to the conventional optimization framework with Equation (1).


where , , and are the normalization items to balance the weight between the three loss terms. is the error rate of deep model initialized with random guess probability. denotes the absolute value operator.

Rethinking the optimization framework with epoch sequences, for each epoch, all samples over the dataset are used once to update the model parameters, which makes the transformation of the data into a feature space changeable. Our Gaussian process model is based on the feature space and the features of the anchor set are updated after each epoch, hence our GP model evolves and is further utilized in next epoch (namely, epoch-evolving GP model). The optimization method of our proposed epoch-evolving GPGL scheme is shown in Algorithm 1.

Input: Training set with classes; Training epochs ; Training batch size
1 Initialize (parameters ; error rate ; (referred to Section 2.3));
2 Uniformly sample an anchor set (referred to Section 2.2);
3 Build our Gaussian process (GP) model based on the feature space of anchor set ;
4 for epoch =  do
5      for batch  do
6           Load and normalize the samples;
7           Compute the deep model prediction (referred to Section 2.1);
8           Compute the context label with our GP model (Equation (4));
9           Compute the loss functions , and (Equations (6), (7) and (8));
10           Update the parameters (Equation (9));
11           end for
12          Evaluate the model and update , and ;
13      Update our GP model based on the new feature space;
14      end for
     Output: Trained neural network model parameters
Algorithm 1 Epoch-evolving Gaussian Process Guided Learning (GPGL)

3 Experiments

3.1 Datasets


It is a labeled subset of the 80 million tiny images dataset for object recognition. This dataset contains 60000 3232 RGB images in 10 classes, with 5000 images per class for training and 1000 images per class for testing.


It is a similar dataset to CIFAR-10 in that it also contains 60000 images, but it covers 100 fine-grained classes. Each class has 500 training100 test images.


It has 200 different classes. 500 training images, 50 validation images, and 50 test images are contained in each class. Compared with CIFAR, the Tiny-ImageNet is more difficult because it has more classes and the target objects often cover just a tiny area in the image.

3.2 Implementation details

Data preprocessing

On CIFAR-10 and CIFAR-100, our data preprocessing is the same in He et al. (2016a)

: 4 pixels are padded on each side, a 32

32 crop is randomly sampled from the padded image or its horizontal flip. For testing, we only evaluate the single view of the original 3232 image. On Tiny ImageNet, we use original images or their horizontal flip, rescale it to 256256, and sample 224224 crop randomly from the 256256 images. In testing, we only rescale images to 256256, and sample 224224 center crops.

Training details

We use the same hyper-parameters as we replicate the others’ results with SGD-Momentum (SGD-M) strategy when we train our model with GPGL scheme. In particular, we set the momentum to 0.9 and set the weight decay to 0.0001, following a standard learning rate schedule that goes from 0.1 to 0.01 at 60 (CIFAR-10)50 (CIFAR-100)33.3 (Tiny ImageNet) and to 0.001 at 80 (CIFAR-10)75 (CIFAR-100)66.7 (Tiny ImageNet) training. We use 250200120 epochs on CIFAR-10CIFAR-100Tiny ImageNet respectively. In addition, we train our models from random initialization on CIFAR-10CIFAR-100. However, we load Imagenet pre-trained model and train the fully connected layer 10 epochs (learning rate = 0.1) on Tiny ImageNet dataset. In addition, we train our models on a server with 4 NVIDIA GTX 1080Ti.

3.3 Ablation experiments

Setting of key hyper-parameters

There are several key hyper-parameters in our GPGL: the length-scale parameter in the kernel function (mentioned in Equation (3)), the number of classes included in the anchor set and the number of samples in each anchor class. We observe the effectiveness of context labels with respect to the number of samples in each class, and find out that the performance of deep model is stable enough when the length-scale parameter and the number of samples in each anchor class are in a relatively large scale. Therefore, we choose 2007070 in the CIFAR-10CIFAR-100Tiny-ImageNet experiments for simplicity. For the number of samples, we use 128700014000 on CIFAR-10CIFAR-100Tiny-ImageNet in practice. We calculate the mean of context labels for the whole CIFAR-100 test datasets and find that the “Top-5” classes contain about 70 information of the whole context label. “Top-5” classes are the 5 most related classes which are the 5 biggest values in the mean of context labels. When we do the comparative experiments according to the “Top-5”“Top-10”“Top-20”“Top-100” strategies, we do not find any remarkable difference between them. We follow the “Top-5” strategy in our following experiments for simplicity.

Validation of loss combinations

With the constraints of two extra loss terms, we explore the contribution of each loss function in the deep learning process. We train ResNet20 on CIFAR-10 with 4 strategies: 1) only, 2) , 3) and 4) triangle loss. In Figure 3(a), we can learn that the performance of triangle loss is better than the performance of using the combinations of 2 loss terms and these three strategies are all better than only using the loss. In other words, each of the two extra loss terms improve the performance and they can be used together.

     (a) Triangle loss on CIFAR-10          (b) Train error on CIFAR-100          (c) Val error on CIFAR-100

Figure 3: (a) Different combinations of three loss terms effect the validation error on CIFAR-10, which shows the effectiveness of our triangle loss function. (b) The training error of Resnet20 on CIFAR-100 over epochs. (c) The validation error of Resnet20 on CIFAR-100 over epochs.

3.4 Performance comparison

In this part, we compare our GPGL scheme with a common SGD-Momentum optimizer (replicated by ourselves) in convergence speed on CIFAR-10, CIFAR-100 and Tiny-ImageNet. We report the performance and the number of epochs by which they reach the best performance in SGD-Momentum optimizer at the last error plateaus. The experiments are repeated 5 times and the results are shown in Table 1.

From Table 1, we have the following observations. Firstly, our GPGL scheme achieves lower test errors and converges faster than SGD-Momentum for all the used CNN architectures on both three datasets. our GPGL scheme has an average improvement about 0.3 on CIFAR-10 and up to 1.47 on CIFAR-100 compared with SGD-Momentum. Our GPGL scheme uses about 50 to 90 less epochs to achieve the same accuracy as SGD-Momentum. This demonstrates the importance of the context labels, which can provide guidance to DNNs.

We also compare performance of our approach with some state-of-the-art optimization strategies on CIFAR-10. The results are shown in Table 2 (notice that the models in BRLNN Lu et al. (2018) use the pre-activation strategy He et al. (2016b)), which shows that our GPGL scheme outperforms the other methods.

error (%) epochs error (%) epochs
ResNet20 7.94 (8.180.14) 231 7.67 (7.830.13) 158
ResNet32 6.94 (7.090.11) 233 6.55 (6.880.18) 171
ResNet44 6.55 (6.760.14) 242 6.23 (6.420.15) 163
ResNet56 6.20 (6.400.16) 231 5.90 (6.130.20) 183
ResNet110 5.82 (5.970.18) 235 5.48 (5.680.20) 173
PreActResNet20 7.80 (7.940.13) 228 7.53 (7.640.08) 169
ResNeXt29_8_64 4.18 (4.430.19) 236 4.11 (4.230.10) 178
ResNet20 32.98 (33.160.14) 164 31.63 (31.890.24) 101
ResNet110 27.81 (28.210.26) 190 26.34 (27.070.39) 101
ResNeXt29_8_64 21.07 (21.320.16) 186 20.58 (20.840.19) 132
ResNet18 33.20 (33.580.24) 115 32.36 (32.610.18) 49
Table 1: Comparison of SGD-M and GPGL
Methods SGD-M He et al. (2016a) BFLNN Lu et al. (2018) Others Ours
ResNet20 8.75 8.33 7.89 Han et al. (2017) 7.67
ResNet32 7.51 7.18 - 6.55
ResNet44 7.17 6.66 - 6.23
ResNet56 6.97 6.31 6.86 Elad et al. (2018) 5.90
ResNet110 6.43 6.16 6.84 Hao-Yun et al. (2019) 5.48
Table 2: Comparison with state-of-the-art (error rate (%))

4 Conclusion

In this paper, we have presented an epoch-evolving Gaussian Process Guided Learning (GPGL) scheme to estimate the context-aware class distribution information and guide the conventional bottom-up learning process efficiently. We have demonstrated that our triangle consistency loss function is effective for a good balance between precise batch-level distribution learning and global distribution-aware nonparametric modeling. The experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets have validated that our GPGL scheme improves the performance of CNN models remarkably and reduces the number of epochs significantly with better convergence performance.


  • L. Bottou (1998) Online learning and stochastic approximations. On-line learning in neural networks 17 (9), pp. 142. Cited by: §1.
  • T. Bui, D. Hernández-Lobato, J. Hernandez-Lobato, Y. Li, and R. Turner (2016) Deep gaussian processes for regression using approximate expectation propagation. In

    Procecedings of the International Conference on Machine Learning

    pp. 1472–1481. Cited by: §1.
  • F. P. Casale, A. Dalca, L. Saglietti, J. Listgarten, and N. Fusi (2018)

    Gaussian process prior variational autoencoders

    In Advances in Neural Information Processing Systems, pp. 10369–10380. Cited by: §1.
  • Z. Dai, A. Damianou, J. González, and N. Lawrence (2015) Variational auto-encoded deep gaussian processes. arXiv preprint arXiv:1511.06455. Cited by: §1.
  • A. Damianou and N. Lawrence (2013) Deep gaussian processes. In

    16th International Conference on Artificial Intelligence and Statistics

    pp. 207–215. Cited by: §1.
  • J. Duchi, E. Hazan, and Y. Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12 (Jul), pp. 2121–2159. Cited by: §1.
  • H. Elad, H. Itay, and S. Daniel (2018)

    Fix your classifier: the marginal value of training the last weight layer

    In Proceedings of the International Conference on Learning Representations, Cited by: Table 2.
  • S. Han, J. Pool, S. Narang, H. Mao, E. Gong, S. Tang, E. Elsen, P. Vajda, M. Paluri, J. Tran, et al. (2017) DSD: dense-sparse-dense training for deep neural networks. In Proceedings of the International Conference on Learning Representations, Cited by: Table 2.
  • C. Hao-Yun, W. Pei-Hsin, L. Chun-Hao, C. Shih-Chieh, P. Jia-Yu, C. Yu-Ting, W. Wei, and J. Da-Cheng (2019) Complement objective training. In Proceedings of the International Conference on Learning Representations, Cited by: Table 2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016a) Deep residual learning for image recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 770–778. Cited by: §1, §3.2, Table 2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016b) Identity mappings in deep residual networks. In European Conference on Computer Vision, pp. 630–645. Cited by: §1, §3.4.
  • G. Hinton, N. Srivastava, and K. Swersky (2012) Lecture 6a overview of mini-batch gradient descent (2012). Coursera Lecture slides. Cited by: §1.
  • G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708. Cited by: §1.
  • P. Jylänki, J. Vanhatalo, and A. Vehtari (2011) Robust process regression with a student-t likelihood. Journal of Machine Learning Research 12 (Nov), pp. 3227–3257. Cited by: §2.2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, Cited by: §1.
  • J. Laurikkala (2001) Improving identification of difficult small classes by balancing class distribution. In Conference on Artificial Intelligence in Medicine in Europe, pp. 63–66. Cited by: §1.
  • Y. Lu, A. Zhong, Q. Li, and B. Dong (2018) Beyond finite layer neural networks: bridging deep architectures and numerical differential equations. In Procecedings of the International Conference on Machine Learning, Vol. 80, pp. 3282–3291. Cited by: §3.4, Table 2.
  • Y. E. Nesterov (1983) A method for solving the convex programming problem with convergence rate o (1/k^ 2). In Dokl. akad. nauk Sssr, Vol. 269, pp. 543–547. Cited by: §1.
  • N. Qian (1999) On the momentum term in gradient descent learning algorithms. Neural networks 12 (1), pp. 145–151. Cited by: §1.
  • C. E. Rasmussen and C. K. I. Williams (2005) Gaussian processes for machine learning. The MIT Press. External Links: ISBN 026218253X Cited by: §1, §1, §2.2.
  • J. Roy, K. J. Lum, B. Zeldow, J. D. Dworkin, V. L. Re III, and M. J. Daniels (2018) Bayesian nonparametric generative models for causal inference with missing at random covariates. Biometrics 74 (4), pp. 1193–1202. Cited by: §1.
  • I. Sutskever, J. Martens, G. Dahl, and G. Hinton (2013) On the importance of initialization and momentum in deep learning. In Procecedings of the International Conference on Machine Learning, pp. 1139–1147. Cited by: §1.
  • D. Tran, R. Ranganath, and D. M. Blei (2015) The variational gaussian process. arXiv preprint arXiv:1511.06499. Cited by: §1.
  • S. Zagoruyko and N. Komodakis (2016) Wide residual networks. In Proceedings of the British Machine Vision Conference, pp. 87.1–87.12. Cited by: §1.