Nonparametric Weight Initialization of Neural Networks via Integral Representation

12/23/2013 ∙ by Sho Sonoda, et al. ∙ 0

A new initialization method for hidden parameters in a neural network is proposed. Derived from the integral representation of the neural network, a nonparametric probability distribution of hidden parameters is introduced. In this proposal, hidden parameters are initialized by samples drawn from this distribution, and output parameters are fitted by ordinary linear regression. Numerical experiments show that backpropagation with proposed initialization converges faster than uniformly random initialization. Also it is shown that the proposed method achieves enough accuracy by itself without backpropagation in some cases.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the backpropagation learning of a neural network, the initial weight parameters are crucial to its final estimates. Since hidden parameters are put inside nonlinear activation functions, simultaneous learning of all parameters by backpropagation is accompanied by a non-convex optimization problem. When the machine starts from an initial point far from the goal, the learning curve easily gets stuck in local minima or lost in plateaus, and the machine fails to provide good performance.

Recently deep learning schemes draw tremendous attention for their overwhelming high performances for real world problems

[1, 2]. Deep learning schemes consist of two stages: pre-training and fine-tuning

. The pre-training stage plays an important role for the convergence of the following fine-tuning stage. In pre-training, the weight parameters are constructed layer by layer, by stacking unsupervised learning machines such as restricted Boltzmann machines

[3]

or denoising autoencoders

[4]. Despite the brilliant progress in application fields, theoretical interpretation of the schemes is still an open question[5].

In this paper we introduce a new initialization/pre-training scheme which could avoid the non-convex optimization problem. The key concept is the probability distribution of weight parameters derived from Murata’s integral representation of neural networks[6]

. The distribution gives an intuitive idea what the parameters represent and contains information about where efficient parameters exist. Sampling from this distribution, we can initialize weight parameters more efficiently than just sampling from a uniform distribution. In fact, for relatively simple or low dimensional problems, our method by itself attains a high accuracy solution without backpropagation.

De Freitas et al.[7]

also introduced a series of stochastic learning methods for neural networks based on the Sequential Monte Carlo (SMC). In their methods the learning process is iterative and initial parameters are given by less informative distributions such as normal distributions. On the other hand we could draw the parameters from a

data dependent distribution. Furthermore, in SMC, the number of hidden units must be determined before the learning, while it is determined naturally in our method.

2 Back ground and related works

One of the most naive initialization heuristics is to draw samples uniformly from an interval

. Nguyen and Widrow[8] gave two fundamental points of view. First, since a typical activation function such as sigmoid and hyperbolic tangent is approximated as a linear function at its inflection point, one should initialize the hidden parameters in such a way that the inputs for each hidden unit are in the linear region. Second, since each hidden unit determines the slice of the Fourier transformed input space, that is, each individual hidden unit responds selectively to only the inputs whose spatial frequency is in a particular band, one should initialize hidden parameters in such a way that the corresponding frequency bands cover the possible input frequencies.

LeCun et al.[9]

also emphasized the need to preset parameters in the linear region because parameters outside the linear region have small gradients and stray into more difficult nonlinear regions. They focused on the curvature of input vectors and proposed to use

, where is the fan-in, or the dimensionality of input vectors. Shimodaira[10] proposed to initialize parameters such that corresponding activation regions to cover whole the possible inputs. Linear algebraic techniques are also employed. For example, Shepanski[11] used the pseudo inverse to determine the parameters of linear approximated neural networks, and Yam and Chow[12]

used the QR decomposition.

Integral transform viewpoints originated from more theoretical backgrounds than linear region viewpoints: the theoretical evaluation of the approximation power of neural networks. In the earliest stage, purely functional analysis methods were employed. In 1957 Kolmogorov[13] showed that any multivariate continuous functions can be exactly represented by sums of compositions of different continuous functions of only one variable. Inspired by the Kolmogorov’s theorem, Hecht-Nielsen[14] and Kůrková[15] applied the idea to neural networks, which are sums of compositions of the samesigmoid function. Sprecher[16] gave more constructive version of the proof and later implemented the improved proof as a learning algorithm of neural networks[17].

In 1989 the universal approximation property of single layer neural networks has been investigated and the integral transform aspects emerged. Carroll and Dickinson[18] introduced the Radon transform and Funahashi[19] used the Fourier analysis and the Paley-Weiner theory, whereas Cybenko[20] employed the Hahn-Banach and Riesz Representation theorems. In the following years, upper bounds of the approximation error were investigated[21, 22, 6]. Barron[22] refined the Jones’ result[21] using the weighted Fourier transform. Kůrková[15] later developed the general theory of integral transforms. Inspired by the Barron’s result, Murata[6] introduced a family of integral transforms defined by ridge functions, which are regarded as a hybrid of the Radon and wavelet transforms. Candés[23] inherited Murata’s transforms and developed ridgelets, which was the beginning of the series of multiscale “-lets” analysis[24].

Those multiscale viewpoints also inherits the selective activation properties of neural networks. Denoeux and Lengellé[25] proposed to collect prototype vectors as initial hidden parameters. Each prototype is drawn from its corresponding cluster , where the clusters are formed in a stereographically projected input space. In this manner each prototype comes to selectively respond to the input vectors which belongs to the cluster .

This study is based on the integral transform viewpoint, and proposes a new way for practical implementation. Although integral transforms have been well studied as theoretical integral representations of neural networks, practical implementations for training have been merely done. However integral representations have big advantage over linear region viewpoints in that they can give global directions how each neural units should behave, while the latter only give local directions.

3 Nonparametric weight initialization via integral transform

3.1 Sampling based two-stage learning

Let be a neural network with a single hidden layer expressed as

(1)

where the map is called the activation function; and are called hidden parameters, and are output parameters. With an ordinary sigmoid function , the activation function is supposed to be the sigmoid pair in the form

(2)

where normalizes the maximum value of to be one.

We consider an oracle distribution of hidden parameters. If such a distribution exists, we can sample and fix these hidden parameters according to first, and then we could fit the rest output parameters by ordinary linear regression. We call this two-stage framework as Sampling Regression (SR) learning.

The candidates of could be some parametric distributions such as normal distributions or uniform distributions. In the following sections we derive a data dependent distribution from an integral representation of neural networks.

3.2 Integral representations of neural networks

Consider approximating a map with a neural network. Murata[6] defined an integral transform of with respect to a decomposing kernel as

(3)

where is a normalizing constant. Murata also showed that given the decomposing kernel , there exists the associating composing kernel such that for any , the inversion formula

(4)

holds (Th.1 in [6]) where denotes the complex conjugate. The convergence factor is omitted when , which is attained when is compactly supported and -Hölder continuous with (Th.3 in [6]), or compactly supported and bounded -smooth (Cor.2 in [6]).

In particular one can set a composing kernel as a sigmoid pair given in Eq.2 and the associating decomposing kernel as:

(5)

where is a nonnegative -smooth function whose support is in the interval . Such a does exist and is known as a mollifier[26].The standard mollifier is a well-known example.

Hereafter we assume is a sigmoid pair and is the corresponding derivative of the standard mollifier. We also assume that our target is a bounded and compactly supported -smooth function. Then the integral transform of is absolutely integrable and the inversion formula is reduced to the direct form .

Let be a probability distribution function over which is proportional to , and be satisfying for all . With this notations, the inversion formula is rewritten as the expectation form with respect to , that is,

(6)

The expression implies the finite sum

(7)

converges to in mean square as , i.e. and holds for any (Th.2 in [6]). Here is a neural network with hidden units, therefore we can regard the inversion formula as an integral representation of neural networks.

3.3 Practical calculation of the integral transform

Now we attempt to make use of the integral transform as an oracle distribution of hidden parameters. Although the distribution is given in the explicit form as we saw in the preceding section, further refinements are required for practical calculation.

Given a set of input and output pairs, is empirically approximated as

(8)

with some constant which is hard to calculate exactly. In fact sampling algorithms such as the acceptance-rejection method[27]

and Markov chain Monte Carlo method

[27] work with any unnormarized distribution because they only evaluate the ratio between probability values. Note that the approximation converges to the exact

in probability by the law of large numbers

only when the input vectors are i.i.d. samples from a uniform distribution.

As a decomposing kernel we make use of the -th order derivative of the standard mollifier where if is even and otherwise. The -th derivative of the mollifier takes the form

(9)

where denotes a polynomial of which is calculated by the following recurrence formula:

(10)
(11)

The higher order derivatives of a mollifier has more rapid oscillations in the neighbourhoods of both edges of its support.

Given a data set , our Sampling Regression method is summarized as below:

  1. Preliminary stage: Calculate according to Eq.9, Eq.10 and Eq.11, where if is even and otherwise. Then is calculated by Eq.8 with setting . As we noted above, one can choose arbitrary .

  2. Sampling stage: Draw samples from the probability distribution by acceptance-rejection method, where denotes the number of hidden (sigmoid pair) units. Then we obtain the hidden parameters .

  3. Regression stage: Let for all and . Solve the system of linear equations with respect to . Then we obtain the output parameters .

3.4 For more efficient sampling

Generally is ill-shaped and sampling from the distribution is difficult. For example in Fig.1 Left, samples drawn from of with is plotted. Whereas in Fig.1 Right, the same distribution is transformed to another -coordinate system (which is explained below). The support of the distribution is reshaped into a rectangular, which implies sampling from is easier than doing from .

Figure 1: Sample parameters drawn from of case. Red lines indicate the theoretical boundary of the support of . Left: has a non-convex support, in which case sampling is inefficient. Right: The same sample points are plotted in the coordinate transformed -space. Coordinate lines are deformed to lattice lines, and has a rectangular support.

This ill-shapeness is formulated as following proposition.

Proposition 3.1.

Suppose the objective function has a compact support, then the support of its transform is in the region with .

Proof.

Recall the support of is included in the interval , therefore for any and , implies . The latter condition is equivalently deformed to , which implies . By the compact support assumption of , taking the maximum with respect to leads to . By tracking back the inferences, for any and ,

(12)

Since for any , the integrand of is always zero, the integration domain of can be restricted into . Therefore by Eq.12,

(13)

holds, which comes to the conclusion: . ∎

In a relatively high dimensional input case, sampling in the coordinate transformed -space

(14)

is more efficient than sampling in the -space because the shape of the support of in the -space is rectangular (see, Fig.1) and therefore the proposal distribution is expected to reduce miss proposals, out of the support.

In case that the coordinate transform technique is not enough, it is worth sampling from each component distribution. Namely, the empirically approximated is bounded above by a mixture distribution:

(15)

where is a component distribution and is a mixing probabilities.

In addition, an upper bound of is given by the form

(16)

for some and .

4 Experimental results

We conducted three sets of experiments comparing three types of learning methods:

  • Whole parameters are initialized by samples from a uniform distribution, and trained by BackPropagation.

  • Hidden parameters are initialized by Sampling from ; and the rest output parameters are initialized by samples from a uniform distribution. Then whole parameters are trained by BackPropagation.

  • Hidden parameters are determined by Sampling from ; the rest output parameters are fitted by linear Regression.

In order to compare the ability of the three methods, we conducted three experiments on three different problems: One-dimensional complicated curve regression, Multidimensional Boolean functions approximation and Real world data classification.

4.1 One-dimensional complicated curve regression - Topologist’s sine curve

Figure 2: Training results of three methods for fitting the topologist’s sine curve . Left: SR (solid black line) by itself achieved the highest accuracy without the iterative learning, whereas SBP (dashed red line) converged to lower RMSE than BP (dotted green line). Right: The original curve (upper left) has high frequencies around the origin. SR (upper right) followed such a dynamic variation of frequency better than other two methods. SBP (lower left) roughly approximated the curve with noise. BP (lower right) only fitted moderate part of the curve.

First we performed one-dimensional curve regression. The objective function is a two-sided topologists’s sine curve (TSC) defined on the interval whose indefiniteness at zero is removed by defining . The TSC is such a complicated curve whose spatial frequency gets arbitrary high as tends to zero. For training, points were sampled from the domain in equidistant manner. The number of hidden parameters were fixed to in each model. Note that relatively redundant quantity of parameters are needed for our sampling initialization scheme to obtain good parameters. The output function was set to linear and the batch learning was performed by BFGS quasi-Newton method. Uniformly random initialization parameters for BP and SBP were drawn from the interval . Sampling from was performed by acceptance-rejection method.

In Fig.2 Left, the Root Mean Squared Error (RMSE) in training phase of three methods are shown. The solid black line corresponds to the result by SR, which by itself achieved the highest accuracy without iterative learnings. The dashed red line corresponds to the result by SBP, and it converged to lower RMSE than that of BP depicted in the dotted green line. In Fig.2 Right, fitting results of the three methods are shown. As we noted the original curve (upper left) has numerical instability around the origin, therefore it is difficult to fit the curve. SR (upper right) approximated the original curve well except around the origin, while other two methods, SBP (lower left) and BP (lower right) could just partly fit the original curve. In this experiment, we examined the flexibility of our method by fitting a complicated curve. The experimental result supports that the oracle distribution gave advantageous directions.

4.2 Multidimensional Boolean functions approximation - Combined AND, OR and XOR

Figure 3: Cross-entropy curves (thin lines) and classification error rates (thick lines). SR (solid black line) achieved the perfectly correct answer from the beginning. SBP (dashed red line) also attained the perfect solution faster than BP. BP (dotted green line) costed iterations of learning to give the correct answer.

Second we performed a binary problem with two-dimensional input and three-dimensional output. Output vectors are composed of three logical functions: . Therefore the total number of data is just four: . The number of hidden units were fixed to

. The output function was set to sigmoid and the loss function was set to cross-entropy. Uniformly random initialization parameters for BP and SBP were drawn from the interval

. Sampling from was performed by acceptance-rejection method.

In Fig.3 both the cross-entropy curves and classification error rates are depicted in thin and thick lines respectively. The solid black line corresponds to the results by SR, which achieved the perfectly correct answer from the beginning. The dashed red line corresponds to the results by SBP, which also attained the perfect solution faster than BP. The dotted green line corresponds to the results by BP, which cost iterations of learning to give the correct answer. In this experiment we have validated that the proposed method works well with multiclass classification problems. The quick convergence of SBP indicates that contains advantageous information on the training examples to the uniform distribution.

4.3 Mnist

Figure 4: Test classification error rates for MNIST dataset. SR (black real line) marked at the beginning and finished , the error reascent suggests that SR may have overfitted. SBP (red dashed line) reduced the fastest and finished . BP (green dotted line) declined the slowest and finished .

Finally we examined a real classification problem using the MNIST data set[28]. The data set consists of training examples and test examples. Each input vector is a -level gray-scaled -pixel image of a handwritten digit. The corresponding label is one of 10 digits. We implemented these labels as -dimensional binary vectors whose components are chosen randomly with equivalent probability for one and zero. We used randomly sampled training examples for training and whole testing examples for testing. The number of hidden units were fixed to , which is the same size as used in the previous study of LeCun et al.[29]. Note that sigmoid pairs corresponds to sigmoid units, therefore we used sigmoid pairs for SR and SBP, and sigmoid units for BP. The output function was set to sigmoid and the loss function was set to cross-entropy. In obedience to LeCun et al.[9]

, input vectors were normalized and randomly initialized parameters for BP and SBP were drawn from uniform distribution with mean zero and standard deviation

.

Direct sampling from is numerically difficult because the differential order of its decomposing kernel piles up as high as -th order. We abandoned rigorous sampling and tried sampling from a mixture annealed distribution. As described in Eq.15, we regarded as a mixture of . By making use of the log boundary given by Eq.16, we numerically approximated from above

(17)

and drew samples from an easier component distribution . Details of the sampling technique is explained in A.2. The sampling procedure scales linearly with the dimensionality of the input space () and the number of required hidden units () respectively. In particular it scales constantly with the number of the training examples.

The following linear regression was conducted by singular value demcomposition (SVD), which generally costs

operations, assuming , for decomposing a -matrix. In our case corresponds to the number of the training examples () and corresponds to the number of hidden units (

). At last backpropagation learning was performed by stochastic gradient descent (SGD) with adaptive learning rates and diagonal approximated Hessian

[30]. The experiment was performed in R[31] on a Xeon X5660 GHz with GB memory.

In Fig.4 the classification error rates for test examples are depicted. The black real line corresponds to the results by SR, which marked the lowest error rate () of the three at the beginning, and finished after iterations of SGD training. The training process was not monotonically decreasing in the early stage of training, it appears that the SR initialization overfitted to some extent. The red dashed line corresponds to the results by SBP, which marked the steepest error reduction in the first iterations of SGD training and finished . The green dotted line corresponds to the results by BP, which declined the slowest in the early stage of training and finished .

In Tab.1 the training time from initialization through SGD training is listed. The sampling step in SR ran faster than the following regression and SGD steps. In addition, the sampling time of SR and SBP was as fast as the sampling time of BP. As we expected, the regression step in SR, which scales linearly with the amount of the data, cost much more time than the sampling step did. The SGD step also cost, however each step cost around merely seconds, and it would be shorten if the initial parameters had better accuracy.

Method Sampling [s] Regression [s] BP by SGD ( itrs.) [s]
SR
SBP -
BP -
Table 1: Training Times for MNIST

In this experiment, we confirmed that the proposed method still works for real world data with the aid of an annealed sampling technique. Although SR showed an overfitting aspects, the fastest convergence of SBP supports that the oracle distribution gave meaningful parameters, and the annealed sampling technique could draw meaningful samples. Hence the overfitting of SR possibly comes from regression step, which suggests the necessity for further blushing up of regression technique. In addition, our further experiments also indicated that when the number of hidden units increased to , the initial test error rate scored , which is smaller than the previously reported error rates by LeCun et al.[29] with hidden units.

5 Conclusion and future directions

In this paper, we introduced a two-stage weight initialization method for backpropagation: sampling hidden parameters from the oracle distribution and fitting output parameters by ordinary linear regression. Based on the integral representation of neural networks, we constructed our oracle distributions from given data in a nonparametric way. Since the shapes of those distributions are not simple in high dimensional input cases, we also discussed some numerical techniques such as the coordinate transform and the mixture approximation of the oracle distributions. We performed three numerical experiments: complicated curve regression, Boolean function approximation, and handwritten digit classification. Those experiments show that our initialization method works well with backpropagation. In particular for the low dimensional problems, well-sampled parameters by themselves achieve good accuracy without any parameter updates by backpropagation. For the handwritten digit classification problem, the proposed method works better than random initialization.

Sampling learning methods inevitably come with redundant hidden units since drawing good samples usually requires a large quantity of trial. Therefore the model shrinking algorithms such as pruning, sparse regression, dimension reduction and feature selection are naturally compatible to the proposed method.

Although plenty of integral transforms have been used for theoretical analysis of neural networks, numerical implementations, in particular sampling approaches are merely done. Even theoretical calculations often lack practical applicability, for example a higher order of derivative in our case, each integral representation interprets different aspects of neural networks. Further Monte Carlo discretization of other integral representations is an important future work.

In the deep learning context, it is said that the deep structure remedies the difficulty of a problem by multilayered superpositions of simple information transformations. We conjecture that the complexity of high dimensional oracle distributions can be decomposed into relatively simple distributions in each layer of the deep structure. Therefore, extending our method to the multilayered structure is our important future work.

Acknowledgments

The authors are grateful to Hideitsu Hino for his incisive comments on the paper. They also thank to Mitsuhiro Seki for having constructive discussions with them.

References

  • [1] Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, and A. Ng. Building high-level features using large scale unsupervised learning. In J. Langford and J. Pineau, editors, ICML2012, pages 81–88, 2012.
  • [2] G. E. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech & Language Processing, 20(1), 2012.
  • [3] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554, 2006.
  • [4] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In Bernhard Scholkopf, John Platt, and Thomas Hoffman, editors, NIPS ’06, pages 153–160, 2007.
  • [5] Y. Bengio., A. Courville., and P. Vincent. Representation learning: A review and new perspectives. PAMI, 35(8):1798–1828, 2013.
  • [6] N. Murata. An integral representation of functions using three-layered networks and their approximation bounds. Neural Networks, 9(6):947–956, 1996.
  • [7] J. F. G. de Freitas, M. Niranjan, A. H. Gee, and A. Doucet. Sequential monte carlo methods to train neural network models. Neural Computation, 12(4):955–993, 2000.
  • [8] D. Nguyen and B. Widrow. Improving the learning speed of 2-layer neural networks by choosing initial values of the adaptive weights. In IJCNN 1990, volume 3, pages 21–26, 1990.
  • [9] Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller. Efficient backprop. In G. Montavon, G. B. Orr, and K.-R. Müller, editors, Neural Networks: Tricks of the Trade (2nd ed.), pages 9–48. Springer, 2012.
  • [10] H. Shimodaira. A weight value initialization method for improving learning performance of the backpropagation algorithm in neural networks. In ICTAI1994, pages 672–675, 1994.
  • [11] J. F. Shepanski.

    Fast learning in artificial neural systems: multilayer perceptron training using optimal estimation.

    In ICNN1988, volume 1, pages 465–472, 1988.
  • [12] J. Y. F. Yam and T. W. S. Chow. A weight initialization method for improving training speed in feedforward neural network. Neurocomputing, 30(1-4):219–232, 2000.
  • [13] A. N. Kolmogorov. On the representation of continuous functions of several variables by superposition of continuous functions of one variable and addition. Doklady Akademii Nauk SSSR, 114:369–373, 1957.
  • [14] R. Hecht-Nielsen. Kolmogorov’s mapping neural network existence theorem. In ICNN1987, volume III, pages 11–13, 1987.
  • [15] V. Kůrková. Kolmogorov’s theorem is relevant. Neural Computation, 3(4):617–622, 1991.
  • [16] D. A. Sprecher. On the structure of continuous functions of several variables. Transactions of the AMS, 115:340–355, 1965.
  • [17] D. A. Sprecher. A numerical implementation of kolmogorov’s superpositions. Neural Networks, 9(5):765–772, 1996.
  • [18] S. M. Carroll and B. W. Dickinson. Construction of neural nets using the radon transform. In IJCNN 1989, volume 1, pages 607–611, 1989.
  • [19] K. Funahashi. On the approximate realization of continuous mappings by neural networks. Neural Networks, 2(3):183–192, 1989.
  • [20] G. Cybenko. Approximation by superpositions of a sigmoidal function. MCSS, 2(4):303–314, 1992.
  • [21] L. K. Jones. A simple lemma on greedy approximation in hilbert space and convergence rates for projection pursuit regression and neural network training. The Annals of Statistics, 20(1):608–613, 1992.
  • [22] A.R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930–945, 1993.
  • [23] E. J. Candes. Ridgelets: Theory and Applications. PhD thesis, Standford University, 1998.
  • [24] J. Fadili and J. L. Starck. Curvelets and ridgelets. In R. A. Meyers, editor, Encyclopedia of Complexity and Systems Science, volume 3, pages 1718–1738. Springer New York, 2009.
  • [25] T. Denoeux and R. Lengelle. Initializing back propagation networks with prototypes. Neural Networks, 6(3):351–363, 1993.
  • [26] K. O. Friedrichs. The identity of weak and strong extensions of differential operators. Transactions of the AMS, 55(1):132–151, 1944.
  • [27] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
  • [28] C. J. C. Burges Y. LeCun, C. Cortes.

    The mnist database of handwritten digits.

    http://yann.lecun.com/exdb/mnist/.
  • [29] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, volume 11, pages 2278–2324, 1998.
  • [30] L. Bottou. Online algorithms and stochastic approximations. In D. Saad, editor, Online Learning and Neural Networks. Cambridge University Press, 1998.
  • [31] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2013.

Supplementary materials

Appendix A Sampling recipes

Sampling hidden parameter ’s from the oracle distribution demands a little ingenuity. In our experiments, we have implemented two sampling procedures: a rigorous but naive, computationally inefficient way and an approximative/ad hoc but quick and well-performing way. Although both work quickly and accurately in a low dimensional input problem, only the latter works in a high dimensional problem such as MNIST.

a.1 Sampling from rigorous oracle distribution

Given a decomposing kernel , we employed acceptance-rejection (AR) method directly on rigorous sampling from On a proposal distribution , we employed uniform distribution. We assume here that the support of proposal distribution has been adjusted to cover the mass of as tight as possible, and the infimum has been estimated. Then our sampling procedure is conducted according to the following Alg.1.

  repeat
     draw proposal point .
     draw uniformly random value from the interval .
     if   then
         return   {accept}
     else
         do nothing {reject}
     end if
  until acceptance occurs.
Algorithm 1 Rigorous sampling according to ordinary acceptance-rejection method.

Note that in a high dimensional case, the estimation accuracy of and the tightness of affects the sampling efficiency and accuracy materially. In fact, the expectation number of trial to obtain one sample by AR is times, which gets exponentially large as the dimensionality increases. Since the support of the oracle distribution is not rectangular, sampling from coordinate transformed remedies the difficulty. In addition, the high order differentiation in the decomposing kernel cause numerical unstability.

a.2 Sampling from mixture annealed distribution

In order to overcome the high dimensional sampling difficulty, we approximately regarded as a mixture distribution (as described in Eq.15) and conducted two-step sampling: first choose one component distribution according to the mixing probability , second draw a sample from chosen component distribution .

Sampling from holds another difficulty due to its high order differentiation in . According to its upper bound evaluation (Eq.16), a high order derivative has its almost all mass around both edge of its domain interval and almost no mass in the middle of the domain (see Fig.5 Left). Hence we approximated, or annealed,

by a beta distribution, which could model extreme skewness of

(e.g., ; see Fig.5 Right). Then we conducted further steps of sampling: first sample according to the annealing beta distribution, then sample and under the restriction .

Figure 5: 10-th order derivative of mollifier. Left: has almost all mass, with high frequency, at both ends, and no mass in the middle of domain. Right: The right half of is approximated by beta distribution (red line).

Obviously the mixture approximation gives rise to poor restriction and virtual indefiniteness of . Since the rigorous computation establishes all relations between and all ’s, whereas the mixture approximation does just one relation between and one particular . We introduced two additional assumptions. First, is parallel to given . Since always appears in the form , only the parallel component of could have any effect (on one particular ), hence we eliminated the extra freedom in the orthogonal component. Second, the norm has similar scale to the distances between input vectors. Since controls the spatial frequency of a hidden unit, it determines how broad the hidden unit covers the part of the input space. Namely, controls which input vectors are selectively responded by the unit. Therefore, in order to avoid such an isolation case that an unit responds for only one input vector, we assumed is no smaller than the distance between input vectors. In this procedure we set as a distance of randomly selected two input examples and . We denote this procedure simply by . Once is fixed with these assumptions, is determined as .

Given shape parameters of the beta distribution , one cycle of our second sampling method is summarized as Alg.2. This method consists of no more expensive steps. It scales linearly with the dimensionality of the input space and the number of required sample parameters respectively. Moreover, it does not depends on the size of the training data.

  choose a suffix of according to the mixing probability .
  draw and
  
  set length .
  .
  .
Algorithm 2 Quick sampling from mixture annealed distribution (for high dimensional use.)