1 Introduction
With the increasing complexity of deep architectures (e.g. [1, 2]), finding the right architecture and associated hyperparameters, known as model search has kept humans “intheloop.” Traditionally, the deep learning community has resorted to techniques such as grid search and random search [3]. In recent years, modelbased search, in particular, Bayesian Optimization (BO), has become the preferred technique in many deep learning applications [4].
It is common to apply BO to the search over various architectural hyperparameters (e.g. number of layers, number of hidden units per layer) but applying BO to search the space of complete network architectures is much more challenging. Modeling the space of network architectures as a discrete topological space is equivalent to making random choices between architectures during the BO procedure  as each architecture variant in the search space would be equally similar. To exploit architectural similarities we require some distance metric between architectures to be quantified.
To this end, we introduce a flexible family of graphinduced kernels which can effectively quantify the similarity or dissimilarity between different network architectures. We are the first to explore hierarchical structure learning using BO with a focus on multimodal fusion DNN architectures. We demonstrate empirically on the Cornell Human Activity [5] Recognition Dataset, and the Montalbano Gesture Recognition Dataset [6] that the optimized fusion structure found using our approach is onpar with an engineered fusion structure.
2 GraphInduced Kernels for Bayesian Optimization
Let be a discrete space. Suppose we want to find where is some nonnegative realvalued objective function that is expensive to evaluate, such as the classification accuracy on a validation set of a trained net . We will find an optimal using Gaussian Processbased Bayesian Optimization. More formally, at each point during the optimization procedure, we have a collection of known pairs for , where . We want to use Gaussian Process regression to model this data, and then use that to choose the next to evaluate. To fit a Gaussian Process we need to define a kernel function on the discrete domain .
Radial Kernels: Let be a kernel on . We say that is radial when there exists a metric on and some real shape function such that . The kernel could also be described as radial with respect to the metric . For example, the Gaussian and exponential kernels on are both radial with respect to the standard Euclidean metric.
GraphInduced Kernels: Let be an undirected graph, let be its geodesic distance metric^{1}^{1}1The geodesic distance between two vertices in a graph is the number of edges in a shortest path connecting them., and let be some real shape function. We then define the kernel , induced by graph , and shape , to be . For example, choosing the Gaussian shape function gives , where and is a parameter of the kernel. If the graph edges are assigned costs, those costs can be treated as the parameters of the kernel instead of .
To apply BO to where is discrete, we design a graph that respects the topology of the domain and choose a shape function , inducing a kernel on that we can use to fit a Gaussian Process to the collection of known pairs for , where . This approach is desirable because it reduces the task of defining a kernel on to the much simpler task of designing the graph . This enables the user to flexibly design kernels that are customized to the diverse domain topologies encountered across a variety of applications.
For example, consider the problem of choosing the best deep multimodal fusion architecture for classification. In this case, each element of the domain
might be a tree data structure describing a neural network architecture, with the graph
describing neighbor relationships between similar architectures. For a particular architecture , each possible modification of the architecture yields a neighboring architecture , where is an edge in . To accommodate different modifications to the network structure, each modification type can have a corresponding edge weight parameter , such that the graphinduced kernel could be parameterized to respect the different types of modifications.Network Architecture: The deep neural network that was used in this work was adapted from the treestructured architecture reported in [7]. The tree structured network architecture has a multistage training procedure. In the first stage, separate networks are trained to learn modalityspecific representations. The second stage of training consists of learning a multimodal shared representation by fusing the modalityspecific representation layers. We used identical structure and hyperparameters as reported in that paper for each modalityspecific representation learning layers that are typically pretrained until convergence.
We generalize the fusion strategy to consider ary fusions between any number of modalityspecific or mergedmodality network pathways. The search space is constructed by adding fullyconnected (FC) layers after any input node or fusion nodes. Figs. 0(a) and 0(b) depict two possible multimodal fusion architectures with different fusion depths and orders of fusion.
Graph Design: To apply BO to the problem of finding the best multimodal fusion architecture (a net for brevity), we design a graph where the nodes are nets and then use the kernel induced by and a shape function . To design , we first formalize the domain of nets, then we define the edges of . We encode a net as a pair where is a nested set describing the order in which modalities are fused and is a map from subtrees (nested sets) to the number of subsequent FC layers.
We define two nets to be neighbors in if and only if exactly one of the following statements holds:

can be constructed from by either adding or removing a single fusion (while keeping the same set of modalities);

can be constructed from also by changing the position of one of the modalities in the fusion hierarchy, by shifting its merging point to either earlier or later fusion;

can be constructed from by incrementing or decrementing its total number of FC layers.
Pairing the Gaussian shape function with this completed definition of induces the kernel we use during BO to find the optimal architecture , which can be parameterized by setting the weights of those edges to be . In our experiments, we simply set
3 Results and Discussion
We validated the efficacy of our approach on two datasets:
Cornell Human Activity (CAD60) Dataset [5]:
consists of 5 descriptorbased modalities derived from RGBD video and the objective is to classify over 12 human activity classes. For each net that was evaluated, we computed the average test accuracy across 4 crossvalidation dataset subsets, yielding a generalized measure of accuracy for a given net.
Montalbano Gesture Recognition Dataset [6] : is a much larger dataset compared to CAD60. It consists of 4 modalities: RGB video, depth video, mocap, and audio. The objective is to classify and localize over 20 communicative gesture categories.
We integrated our Graphinduced kernel with a GP–based BO framework [8] and compared it with random search for fusion structure optimization. The multimodal network architecture was implemented in Lasagne [9]
. We assumed sample noise with a variance of 1.0 for our normalized inputs. The random search
[3] is the baseline that has been shown to be in line with human performance for the same number of trials. Fig. 1(a) shows the performance of those two methods averaged over 100 runs. Our method can find an architecture with the same classification error in 2 less iterations than the random search. For example, to find an architecture that achieves 19.7% validation error, our approach only needed around 8 iterations, while random search required 18 iterations. Fig. 2(a) shows the average absolute test accuracy difference obtained as a function of the respective graph kernel distances computed. The strictly positive trend of this plot suggests that the metric incorporated into our graph kernel captures enough information about the search space to correctly evaluate the real distance between network structures. Fig. 1(b) shows the number of iterations needed to find a network structure that produces good test performance for the Montalbano dataset. Our proposed technique achieved up to speedup compared to random search. Fig. 2(b) shows a similar positive trend to that seen in CAD60. Despite having a tight variance between performances of different architectures for this dataset, our graphinduced kernel provided sufficiently high signal to noise ratio to be usable for structure optimization.4 Conclusion and Future Work
In this work, we have proposed a novel graphinduced kernel approach in which easilydesigned graphs can define a kernel specialized for any discrete domain. To demonstrate its utility, we have cast a deep multimodal fusion architecture search as a discrete hyperparameter optimization problem. We demonstrate that our method could optimize the network architecture leading to accuracies that are at par or slightly exceed those of manuallydesigned architectures [7] while evaluating between 25 less architectures than random search on 2 challenging human activity recognition problems.
Acknowledgments
This research is partially funded by the Defense Advanced Research Projects Agency (DARPA) and the Air Force Laboratory (AFRL). We would also like to thank the NSERC and Canada Foundation for Innovation for infrastructure funding as well as NVIDIA Corp. for contributing an NVIDIA Titan X GPU.
References

[1]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich.
Going deeper with convolutions.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 1–9, 2015.  [2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 [3] James Bergstra and Yoshua Bengio. Random search for hyperparameter optimization. Journal of Machine Learning Research, 13:281–305, February 2012.
 [4] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando de Freitas. Taking the human out of the loop: A review of Bayesian optimization. Proc. IEEE, 104(1):148–175, 2016.
 [5] Jaeyong Sung, Colin Ponce, Bart Selman, and Ashutosh Saxena. Human activity detection from rgbd images. In In AAAI workshop on Pattern, Activity and Intent Recognition (PAIR), 2011.
 [6] S. Escalera, X. Baro, J. Gonzalez, M.A. Bautista, M. Madadi, M. Reyes, V. Ponce, H.J. Escalante, J. Shotton, and I. Guyon. Chalearn looking at people challenge 2014: Dataset and results. In ECCVW, 2014.
 [7] N. Neverova, C. Wolf, G. Taylor, and F. Nebout. Moddrop: adaptive multimodal gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP(99):1–1, 2015.
 [8] GPy. GPy: A gaussian process framework in python. http://github.com/SheffieldML/GPy, since 2012.

[9]
Lasagne.
Lasagne: Lightweight library to build and train neural networks in Theano.
https://github.com/Lasagne/Lasagne, since 2015.  [10] James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for HyperParameter optimization. In J ShaweTaylor, R S Zemel, P L Bartlett, F Pereira, and K Q Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2546–2554. Curran Associates, Inc., 2011.
 [11] Frank Hutter, Holger H Hoos, and Kevin LeytonBrown. Sequential modelbased optimization for general algorithm configuration. In Learning and Intelligent Optimization, pages 507–523. Springer, 2011.