Log In Sign Up

MFAS: Multimodal Fusion Architecture Search

by   Juan-Manuel Pérez-Rúa, et al.

We tackle the problem of finding good architectures for multimodal classification problems. We propose a novel and generic search space that spans a large number of possible fusion architectures. In order to find an optimal architecture for a given dataset in the proposed search space, we leverage an efficient sequential model-based exploration approach that is tailored for the problem. We demonstrate the value of posing multimodal fusion as a neural architecture search problem by extensive experimentation on a toy dataset and two other real multimodal datasets. We discover fusion architectures that exhibit state-of-the-art performance for problems with different domain and dataset size, including the NTU RGB+D dataset, the largest multi-modal action recognition dataset available.


Sample-Efficient Neural Architecture Search by Learning Action Space

Neural Architecture Search (NAS) has emerged as a promising technique fo...

An Approach for Combining Multimodal Fusion and Neural Architecture Search Applied to Knowledge Tracing

Knowledge Tracing is the process of tracking mastery level of different ...

Structure Optimization for Deep Multimodal Fusion Networks using Graph-Induced Kernels

A popular testbed for deep learning has been multimodal recognition of h...

RandomNet: Towards Fully Automatic Neural Architecture Design for Multimodal Learning

Almost all neural architecture search methods are evaluated in terms of ...

MUFASA: Multimodal Fusion Architecture Search for Electronic Health Records

One important challenge of applying deep learning to electronic health r...

Training Frankenstein's Creature to Stack: HyperTree Architecture Search

We propose HyperTrees for the low cost automatic design of multiple-inpu...

Landscape of Neural Architecture Search across sensors: how much do they differ ?

With the rapid rise of neural architecture search, the ability to unders...

1 Introduction

Deep neural networks have demonstrated to be effective models for solving a large variety of problems in several domains, including image 

[krizhevsky2012imagenet] and video [baccouche2011sequential] classification, speech recognition [hinton2012deep], and machine translation [wu2016google], to name a few. In a multimodal setting, it is very common to transfer models trained on the individual modalities and merge them at a single point. It can be at the deepest layers, known in the literature as late fusion, which is relatively successful on a number of multimodal tasks [snoek2005early]

. However, fusing modalities at their respective deepest features is not necessarily the most optimal way to solve a given multimodal problem. We argue in this paper that considering features extracted from all the hidden layers of independent modalities could potentially increase performance with respect to only using a single combination of late (or early) features. Thus, this work tackles the problem of finding good ways to combine multimodal features to better exploit the information embedded at different layers in deep learning models for classification.

Our hypothesis is in line with a common interpretation of deep neural models considering that features learned in a convolutional neural network carry varying levels of semantic meanings. In vision, for example, lower layers are known to serve as edge detectors with different orientations and extent, while further layers capture more complex information such as semantic concepts, like

faces, trees, animals

, etc. Evidently, it is difficult to determine by hand what is the most optimal way of mixing features with varying levels of semantic meaning when solving for multimodal classification problems. For example, learning to classify

furry animals might require analysis of lower level visual features that can be used to build up the concept of fur, whereas classes like chirping birds or growling might require analysis of more complex audiovisual attributes. Indeed, features from different layers at different modalities can give different insights from the input data. A similar idea is exploited by unimodal ResNets [he2016deep], where features from different depths are utilized by later layers through skip connections.

In this line of thought, a few recent works analyzed other possible combinations from input modalities [shahroudy2017deep, vielzeuf2018centralnet]. However, those methods fall short, as the model designer needs to choose empirically which intermediate features to consider. Evaluating all of the possibilities by hand would be extremely intensive or simply intractable. Indeed, the more modalities and the deeper they are, the more complicated it is to choose a mixture. This is all the more true when enabling nested combinations of multimodal features. It is in fact a large combinatorial problem.

In order to handle this issue, the aforementioned combinatorial problem has to be tackled by an efficient search method. Luckily, the underlying structure of this problem makes it specially amenable to sequential search algorithms. We propose in this paper to rely on a sequential model-based optimization (SMBO) [hutter2011sequential] scheme, which has previously been applied to the related problem of neural architecture search or AutoML [liu2018progressive, epnas2018]. In a few words, we tackle the problem of multimodal classification by directly posing the problem as a combinatorial search. To the best of our knowledge, this is a completely new approach to the multimodal fusion problem, which, as shown by thorough experimentation, improves the state-of-the-art on several multimodal classification datasets.

This paper brings four main contributions: i) an empirical evidence of the importance of searching for optimal multimodal feature fusion on a synthetic toy database. ii) The definition of a search space adapted to multimodal fusion problems, which is a superset of modern fusion approaches. iii) An adaptation of an automatic search approach for accurate fusion of deep modalities on the defined search space. iv) Three automatically-found state-of-the-art fusion architectures for different known and well studied multimodal problems encompassing five types of modalities.

The rest of this paper is organized as follows. In Section 2, we describe the work that is related to ours, including multimodal fusion for classification and neural architecture search. Next, in Section 3 we explain our search space and methodology. In Section 4, we present an experimental validation of our approach. Finally, in Section 5, we give final comments and conclusions.

2 Related work

Current design strategies of neural architectures for general classification (multimodal or not) and other learning problems consider the importance of the information encoded at various layers along a deep neural network. Indeed, advances in image classification like residual nets [he2016deep] and densely connected nets [huang2017densely]

are related to this idea. Similarly, for the problem of pose estimation,

stacked hourglass networks [newell2016stacked]

connect encoder and decoder parts of an autoencoder by short-circuit convolutions, allowing the final classifiers to ponder features from bottom layers. However, it is commonly accepted that manually-designed architectures are not necessarily optimally solving the task 

[zoph2017neural]. In fact, looking at the type of neural networks that are automatically designed by search algorithms, it seems that convoluted architectures with many cross-layer connections and different convolutional operations are preferred [brock2017smash, zoph2017neural].

Interestingly, Escorcia et al. argued that the visual attributes learned by a neural network are distributed across the entire neural network [escorcia2015relationship]. Similarly, it is commonly understood that neural networks encode features in a hierarchical manner, starting from low-level to higher-level features as one goes deeper along them. These ideas motivate well our take on the problem of multimodal classification. This is, trying to establish an optimal way to connect and fuse multimodal features. To the best of our knowledge, this work is the first one to directly tackle multimodal fusion for classification as an architecture search problem.

In the following, we give an overview of the multimodal fusion problem for classification as a whole. We then continue with a short discussion on relevant methods for architecture search, since it appears at the core of our method.

Multimodal fusion.

To categorize the different recent approaches of deep multimodal fusion, we can define two main paths of research: architectures and constraints.

The first path focuses on building best possible fusion architectures e.g. by finding at which depths the unimodal layers should be fused. Early works distinguished early and late fusion methods [atrey2010multimodal], respectively fusing low-level features and prediction-level features. As reported by [snoek2005early], late fusion performs slightly better in many cases, but for others, it is largely outperformed by the early fusion. Late fusion is often defined by the combination of the final scores of each unimodal branch. This combination can be a simple [simonyan2014two] or weighted [natarajan2012multimodal] score average, a bilinear product [ben2017mutan], or a more robust one such as rank minimization [ye2012robust]. Thus, methods such as multiple kernel learning [bach2004multiple] and super-kernel learning [wu2004optimal] may be seen as examples of late fusion. Closer to early fusion, Zhou et al[zhou2008feature] propose to use a Multiple Discriminant Analysis on concatenated features, while Neverova et al [neverova2014multi]

apply a heuristic consisting of fusing similar modalities earlier than the others. Recently, to take advantage of both low-level and high-level features, Yang

et al[yang2016multilayer] leverage boosting for fusion across all layers. To avoid overfitting due to large number of parameters in multilayer approches, multimodal regularization methods [amer2018deep, gu2017learning, jiang2018exploiting] are also investigated. Another architecture approach for multimodal fusion could be grouped under the idea of attention mechanisms, which decides how to ponder different modalities by contextual information. The mixture of experts by [jacobs1991adaptive] can be viewed as a first work in this direction. The authors proposed a gated model that picks an expert network for a given input. As an extension, Arevalo et al[arevalo2017gated], proposed Gated Multimodal Units, allowing to apply this fusion strategy anywhere in the model and not only at prediction-level. In the same spirit, multimodal attention can also be integrated to temporal approaches [hori2017attention, long2018multimodal].

The second category of multimodal fusion methods proposes to define constraints in order to control the relationship between unimodal features and/or the structure of the weights. Ngiam et al[ngiam2011multimodal], proposed a bimodal autoencoder, forcing the hidden shared representation to be able to reconstruct both modalities, even in the absence of one of them. Andrew et al[andrew2013deep], adapted Canonical Correlation Analysis to deep neural networks, maximizing correlation between representations. Shahroudy et al. [shahroudy2017deep], use cascading factorization layers to find shared representations between modalities and isolate modality-specific information. To ensure similarity between unimodal features, Engilberge et al. [engilberge2018finding] minimize their cosine distance. Structural constraints can also be applied on the very weights of the neural networks. In addition to modality dropping, Neverova et al. [neverova2016moddrop] propose to zero-mask the cross-modal blocks of the weight matrix in early stages of training. Extending the idea of modality dropping, Li et al[li2016modout]

, propose to learn a stochastic mask. Another structure constraint as done through tensor factorization was proposed by 


Neural architecture search.

The last couple of years have seen an increased interest on AutoML methods [brock2017smash, liu2018progressive, epnas2018, pham2018efficient, zoph2017neural]. Most of these methods rely somehow on a neural module at the core of their respective search approaches. This is now known in the literature as neural architecture search (NAS). Neural-based or not, AutoML methods were traditionally reserved for expensive hardware configurations with hundreds of available GPUs [liu2018progressive, zoph2017neural].

Very recently, progressive exploration approaches and weight-sharing schemes have allowed to tremendously reduce the necessary computing power to effectively perform architecture search on sizeable datasets. Another advantage of progressive search methods [liu2018progressive, epnas2018] is that they leverage the intrinsic structure of the search space, by sequentially increasing the complexity of sampled architectures. In this paper, we start from a sequential method with weight sharing [epnas2018] and adapt it to the problem of multimodal classification. In particular, we design a search space that is prone to sequential search and which is a superset of previously introduced fusion schemes, e.g., [vielzeuf2018centralnet]. This is an important aspect of our contribution. As demonstrated by [zoph2018learning], constraining the search space is a key element for affordable architecture search. It turns out that directly tackling multimodal datasets by automatic architecture search without designing a constrained, but meaningful, search space would not be tractable. We demonstrate the value of our approach and the importance of optimizing neural architectures for multimodal classification tasks by tackling three challenging datasets.


Figure 1: General structure of a bi-modal fusion network. Top: A neural network with several hidden layers (grey boxes) with input , and output . Bottom: A second network with input , and output . In this work we focus on finding efficient fusion schemes (yellow box and dotted lines).

3 Methodology

In this work, as in many others addressing multimodal fusion, we start from the assumption of having an off-the-shelf multi-layer feature extractor for each one of the involved modalities. In practice, this means that we start from a multi-layer neural network for each modality, which we assume to be already pre-trained. However, the reader should consider that our fusion approach is in fact not limited to neural networks as primary feature extractors.

Without loss of conceptual generality, we assume from now on that we will deal with two modalities. The multimodal dataset is composed by pairs of input and output data , where accounts for the first modality, for the second one, and for the supervision labels. Now, we assume that there exists two functions and which take and as inputs, and output and , which are estimates of the ground-truth labels .

Furthermore, functions and are composed of and layers, respectively, subfunctions denoted by and . With a slight abuse of notation, we write for layer , , and . See Fig. 1 for a visual representation. Examples of subfunctions when dealing with standard neural networks are operations like convolution, pooling, multiplication by a matrix, non-linearity, etc. The outputs of these subfunctions are the features we want to fuse across modalities. The problem is then to choose which features to fuse and how to mix them.



Figure 2: Two realizations of our search space on a small bimodal network. Left: network defined by . Right: network defined by .

3.1 Multimodal fusion search space

In our approach, data fusion is introduced through a third neural network (see Fig. 2 for some illustrations). Each fusion layer combines three inputs: the output of the previous fusion layer and one output from each modality. This is done according to the following equation:


where is a triplet of variable indices establishing, respectively, which feature from the first modality, which feature from the second modality, and which non-linearity is applied. Also, , , and . For the first fusion layer (), the fusion operation is defined as:


The number of possible fusion layers, a search parameter, is denoted by , so that . The fusion layer weight matrix is trainable. Note that we establish feature concatenation as fixed strategy to process and fuse features. In fact, this could be replaced by a weighted sum of input features. However, during our experiments, we noticed that fusion networks with weighted sum of features were almost never chosen, and almost always reduced final classification performance with respect to concatenation. Thus, we decided to simply fix the fusion operation to concatenation.

An illustrative example for , and (; ) is shown in Fig. 2

. We can observe a couple of realizations of the search space for modalities of four hidden layers and two possible non-linearities. On the right, a fusion scheme with a single fusion at the third layer of first and second modalities. On the left, two composed fusions. A composed fusion scheme is defined then by a vector of triplets:

. We denote the set of all possible triplets with layers as .

Observe that this design enables our space to contain a large number of possible fusion architectures, including the networks defined in, for example, CentralNet [vielzeuf2018centralnet]. The size of the search space is exponential on the number of fusion layers , and is expressed by: . If we were to tackle a multimodal problem where the number of layers of the feature extractor is only a portion of the depth that modern neural networks exhibit, say , and only considered two possible non-linearities , a fusion scheme with would result in a search space of dimension .

Exhaustively exploring all these possibilities is intractable. In particular, consider that evaluation of a single sample in this space corresponds to training and evaluating a multimodal architecture, which can take from several hours to a few days, depending on the problem at hand. This is the reason why we focus on an exploration method that has shown to be sampling-efficient for the related problem of neural architecture search. This is, sequential model-based optimization (SMBO), as used by [liu2018progressive, epnas2018]. In their works, the authors showed that progressively exploring a search space by dividing it into “complexity levels”, ends up providing architectures that perform as well as the ones discovered by a more direct exploration approach, as in [zoph2017neural, zoph2018learning], while sampling fewer architectures. SMBO is well fit to find optimal architectures in the search space designed by [zoph2018learning]. This is because the space is naturally divided by complexity levels that can be interpreted as progression steps (blocks in the “micro space” [liu2018progressive, epnas2018, zoph2018learning]). SMBO sequentially unfolds the complexity of the sampled architectures starting from the simplest one. Luckily, our search space shares a similar structure. We can interpret the number of fusion layers as a hook for progression.

It is worth noting that the constrained search space that we propose exhibits certain desirable properties. Assuming that the unimodal feature extractor networks are available greatly reduces search burden as they do not need to be trained during search, and the complexity of the problem is limited to a manageable magnitude.

3.2 Search algorithm

In SMBO, a model predicting accuracy of sampled architectures lies at the core of the method. This model, or surrogate function is trained during progressive exploration of the search space, and it is used to reduce the amount of neural networks that have to be trained and evaluated by predicting performance of unseen architectures. In our case, having a variable-length description of the multimodal architectures , as described in previous subsection, naturally results in using a recurrent model as surrogate. Let us denote this recurrent function by . The parameters of are updated at iteration

by stochastic gradient descent (SGD) training on a subset of

with real valued accuracies .

Our procedure, named multimodal fusion architecture search (MFAS), and based on  [liu2018progressive], is laid out in Alg. 1. From lines 11 to 16, the progressive algorithm starts at the smallest fusion network complexity level, i.e., . Then, the next complexity levels unroll one after the other by sampling

architectures with a probability that is a function of the surrogate model predictions in lines 

20 and 21. The fusion architecture search is effectively guided by how new architectures are sampled. Observe that we implement search iterations () and temperature-based sampling () as in EPNAS [epnas2018]. This is done so the surrogate function does not guide the search with biased assumptions made from partial observations of the search space at early iterations. By using temperature-based sampling, the surrogate function is only trusted as the exploration advances (by reducing the temperature in line 27

). This is complemented by training sampled architectures with very few epochs as in ENAS 

[pham2018efficient], and implementing weight-sharing among sampled architectures to counterweight the main bottleneck of neural architecture search: training sampled architectures to completion. This aspect is of particular importance for multimodal networks, which tend to have a large memory footprint and computing times.

1:procedure ()
2:      : max number of fusion layers
3:      : number of search iterations
4:      : number of training epochs
5:      : number of sampled fusion architectures
6:      : training and validation sets
7:      : sampling temperature range
8:       // Set temperature
9:       // Initialize corresponding sets of arhcs. and accuracy
10:      for  do
11:             // Set of fusion architectures with
12:             // Build fusion nets
13:             // Train fusion nets
14:             // Get real accuracies for them
15:             // Keep track of sampled archs.
16:             // Train surrogate
17:            for  do
18:                  // Unfold 1 more fusion layer
19:                  // Predict with surrogate
20:                  // Compute sampling probs.
21:                  // Sample K fusion archs
22:                  // Build fusion net.
23:                  // Train
24:                  // Calculate accuracies
25:                  // Keep track of sampled archs.
26:                  // Update surrogate
28:            end for
29:      end forreturn // Return best K from all sampled archs.
30:end procedure
Algorithm 1 Multimodal fusion architecture search (MFAS)

Another aspect where our search algorithm differs from the original algorithm [liu2018progressive] and from [epnas2018] is that we assume the existence of pre-trained modal functions and . These functions are used to build a multimodal network from a description of the fusion scheme with layers (line 12 and line 22). At the end of the iterative progressive search, MFAS returns the best from the set of all sampled architectures .

Final architecture.

From Alg. 1, we obtain a set of fusion architectures. One could think of using the surrogate function after its last update to predict the very best fusion scheme from those. However, in this paper we train the best five of the final architectures to completion, and simply pick the absolute best one from the obtained validation accuracies. During this last training step we also evaluate the performance of the chosen architectures with a larger size of matrices . The reduced size is used during search to improve sampling speed and to reduce memory costs.

Loss function.

During the search, the weights of the feature extractors and are frozen. Because of it, only the fusion softmax

is used for the loss function. Found architectures are initially trained for a few epochs with frozen

and functions. A second training step with more epochs involves a multitask loss on , , , and unfrozen and functions. A categorical cross-entropy loss is used in all the reported experiments unless otherwise noted.

Handling arbitrary tensor dimensions.

A practical issue during the creation of a multimodal neural network from and is that subfunctions might deliver tensors with arbitrary dimensions, hindering fusion of arbitrary modalities and layer positions. To deal with this in a generic way, we perform global pooling along the channel dimension of 2D and 3D convolutions, while leaving linear layer outputs as they are.

As a side note, observe in Eq. 1 that our default layer type for fusion is fully connected. We experimented with several forms of 1D convolutions without noticing any improvements.

Weight sharing of fusion layers.

In our implementation of Alg. 1, multimodal neural networks are not trained in parallel. Instead, sampled fusion networks are trained sequentially for a small number of epochs ( in all of our experiments). For two sample indices and , where , we keep track of the weight matrix for layer , so is initialized from if . Please note that weights are only shared among matrices in the same layer .

4 Experiments

Method Modalities Acc
Top-5 found architectures by random search
image + spect. 0.9174
image + spect. 0.9190
image + spect. 0.9196
image + spect. 0.9224
image + spect. 0.9222
Mean (Std) 0.9203 (0.0021)
Top-5 found architectures by MFAS
image + spect. 0.9258
image + spect. 0.9260
image + spect. 0.9270
image + spect. 0.9266
image + spect. 0.9268
Mean (Std) 0.9264 (0.0004)
Table 1: Evaluation of our search method on the Av-Mnist dataset. The fusion architectures described by arrays of numbers are instances of our search space with . Validation accuracy is reported.

In this section we present an extensive experimental validation of our claims. We first start by presenting experiments on a synthetic toy dataset, namely the Av-Mnist  dataset [vielzeuf2018centralnet]. We then continue our experimental work by directly tackling two other multimodal datasets. These are i) the visual-textual multilabel movie genre classification dataset by [arevalo2017gated] (Mm-Imdb) and ii) the multimodal action recognition dataset by [shahroudy2016ntu] (Ntu rgb+d).

For each dataset, we provide a short description of the task as well as the experimental set-up, and then discuss on the results.

Av-Mnist dataset.

This is a simple audio-visual dataset artificially assembled from independent visual and audio datasets. The first modality corresponds to MNIST images, with 75% of their energy removed by PCA. The audio modality is made of audio samples on which we have computed spectrograms. The audio samples are 25,102 pronounced digits of the Tidigits database augmented by adding randomly chosen noise samples from the ESC-50 dataset [piczak2015esc]. Contaminated audio samples are randomly paired, accordingly with labels, with MNIST digits in order to reach 55,000 pairs for training and 10,000 pairs for testing. For validation we take 5000 samples from the training set. The digit energy removal and audio contamination are intentionally done to increase the difficulty of the task (otherwise unimodal networks would achieve almost perfect results and data fusion would not be necessary).



Figure 3: Structure of the found fusion architectures. First: Av-Mnist. Second: Mm-Imdb.
Method Modalities Acc (%)
Unimodal baselines for fusion
LeNet-3 [lecun1990handwritten] image 74.52
LeNet-5 [lecun1990handwritten] spectrogram 66.06
Explicit fusion
Two-stream [simonyan2014two] image + spect. 87.78
CentralNet [vielzeuf2018centralnet] image + spect. 87.86
Ours Top 1 image + spect. 88.38
Table 2: Evaluation of multiple fusion architectures on the Av-Mnist dataset. Test accuracy is reported.

In here, function is a modified LeNet network [lecun1990handwritten] with five convolutional layers and a global pooling softmax processing spoken digits. Similarly is a modified LeNet with three convolutional layers. We limit the subfunctions of and to convolutional layers with activation, so we hook global pooling to each one of them: three for the written digit modality (), and five for the spectrogram one (). For this experiment we let

by allowing the activation functions of fusion layers to be either

or .

In Table 1, we show results for two exploration approaches: a purely random one (upper part), and MFAS (bottom). Both exploration approaches are allowed to sample 180 architectures. We show validation accuracy for the top five randomly sampled architectures on the proposed search space (top of Table 1

). The large standard deviation is a testament to the usefulness of multimodal fusion architecture search. From these results we can infer that some feature combinations provide better insights from data than some other mixtures. At the lower part of Table 

1 we can see that in contrast to random search, the top five found architectures with our search method present scores with less variability. Furthermore, the best performing architecture on the validation set (in bold) is found by our method.

Test accuracy for baselines and competing fusion architectures are reported in Table 2. We report test score of our best found architecture according to Table 1. It can be observed that all multimodal fusion networks largely improve over the unimodal networks, but our automatically found fusion architecture is the one with the best overall score. This was found after three iterations of progressive search and . The success on this toy (but not trivial) dataset is a first milestone in the validation of our contributions.

Method Modalities F1-W F1-M
Unimodal baselines for fusion
Maxout MLP [goodfellow2013maxout] text 0.5754 0.4598
VGG Transfer image 0.4921 0.3350
Explicit fusion
Two-stream [simonyan2014two] image + text 0.6081 0.5049
GMU [arevalo2017gated] image + text 0.6170 0.5410
CentralNet [vielzeuf2018centralnet] image + text 0.6223 0.5344
Ours Top 1 image + text 0.6250 0.5568
Table 3: Evaluation of multiple methods on the Mm-Imdb  dataset [shahroudy2016ntu]. Weighted F1 (F1-W) and Macro F1 (F1-M) are reported for each method.

Mm-Imdb dataset.

This multimodal dataset comprises 25,959 movie titles and metadata from the Internet Movie Database111 [arevalo2017gated]. Movie data is formed by their plots, posters (RGB images), genres, and many more metadata fields including director, writer, picture format, etc. The task in this dataset is to predict movie genres from posters and movie descriptions. Since very often a movie is assigned to more than one genre, the classification is multi-label. The loss function used for training is binary cross-entropy with weights to balance the dataset.

The original split of the dataset is used in our experiments: 15,552 movies are used for training, 7,799 for testing, and 2,608 for validation. The genres to predict include drama, comedy, documentary, sport, western, film-noir, etc., for a total of 23 non-mutually exclusive classes.

Performance of unimodal networks is given at the top of Table 3. Using these unimodal networks as a basis, we implemented Two-stream fusion [simonyan2014two], CentralNet [vielzeuf2018centralnet], GMU [arevalo2017gated], and our best found architecture. One can note that our method gives the best results among the four fusion strategies, once again validating our choices on search space design and fusion scheme 222Observe that the original Central-Net paper considers the last features layer (as pre-computed by the original authors [arevalo2017gated]. Intermediate layers being not provided, we did not start with the exact same unimodal baselines and re-implement all methods in order to allow fair comparison..

The search space for the Mm-Imdb dataset is formed by eight convolutional layers of a VGG-19 image network, and two text Maxout-MLP features. The number of possible fusion configurations available from these features (we set , and ) and the three possible non-linearities (and ) is of 110,592. Our best configuration can be seen in Fig. 3.

Ntu rgb+d dataset.

This dataset was first introduced by Shahroudy et al., [shahroudy2016ntu] in 2016. With 56,880 samples, to the best of our knowledge, it is the largest color and depth multimodal dataset. Capturing 40 subjects from 80 viewpoints performing 60 classes of activities Ntu rgb+d is a very challenging dataset with the particularity that it provides dynamic skeleton-based pose data on the top of RGB video sequences. The target activities include drinking, eating, falling down, and even subject interactions like hugging, shaking hands, punching, etc.

Method Modalities Acc (%)
Single modality
LSTM [shahroudy2016ntu] pose 60.69
part-LSTM [shahroudy2016ntu] pose 62.93
Spatio-temp. attention [song2017end] pose 73.40
Multiple modalities
Shahroudy et al. [shahroudy2017deep] video + pose 74.86
Shahroudy et al. [shahroudy2017deep] video + pose 74.86
Bilinear Learning [hu2018deep] video + pose 83.30
Bilinear Learning [hu2018deep] video + pose + depth 85.40
2D/3D Multitask [luvizon20182d] video + pose 85.50
Unimodal baselines for fusion
Inflated ResNet-50 [baradel2018glimpse] video 83.91
Co-occurrence [li2018co] pose 85.24
Explicit fusion
Two-stream [simonyan2014two] video + pose 88.60
GMU [arevalo2017gated] video + pose 85.80
CentralNet [vielzeuf2018centralnet] video + pose 89.36
Ours Top 1 video + pose 90.04
Table 4: Evaluation of multiple methods on the Ntu rgb+d  dataset [shahroudy2016ntu]. The reported numbers are the average accuracy over the different action subjects (cross-subject measure).
Network # of fusion Parameters Acc (%)
0 2,229,248 0.9327
1 2,196,480 0.9289
2 1,737,728 0.9301
3 2,163,712 0.9346
Table 5: Top 4 found architectures on Ntu rgb+d according to validation accuracy during search.


Figure 4: Structure of found fusion architectures. Ntu rgb+d.

In our work, we focus on the cross-subject evaluation, splitting the 40 subjects into training, validation, and testing groups. The subject IDs for training during search are: 1, 4, 8, 13, 15, 17, 19. For validation we use: 2, 5, 9, and, 14. During final training of the found architectures we use the same splitting originally proposed by [shahroudy2016ntu]. We report results on the testing set to objectively compare our found architectures with manually designed fusion strategies from the state-of-the-art.

Results on the Ntu rgb+d dataset are summarized in Table 4. We report accuracy in percentages for several methods. The first group of methods are models processing single modalities as reported by the authors themselves. The second group of results are by methods from the state-of-the-art processing and fusing several modalities (video, pose, and/or depth). Then, we provide the score as computed by us of methods processing single modalities. For video, we tested the Inflated ResNet-50 used by [baradel2018glimpse]; and for pose, we leverage the deep co-occurrence model by [li2018co]. The reported numbers in this group are our departing point and baselines. Finally, the last group of methods perform explicit fusion of modalities and are our main competitors.

Observe that our scores are the highest in Table 4. We report

average accuracy over four runs with a variance of

, which is a significant improvement over all baselines and competing methods. This is achieved by performing fusion search on the convolutional and fully connected features of the Inflated ResNet-50 and deep Co-occurrence baselines. We start from four possible features for each modality () and three non-linearities, i.e., and . This means, the search space for the Ntu rgb+d dataset is of dimension . The best found configuration is shown in Fig. 4. In Table 5 we report validation accuracy during search for the final top four architectures. Observe that the best architecture is not necessarily the largest one.

[width=0.25]errors.pdf [width=0.25]temperature.pdf
Figure 5: Left: Error progression during search. Each plot point represents the validation error of a sampled fusion architecture at a given step of our search algorithm on the Av-Mnist set, where the total number of steps is . Mean error and standard deviation per step are represented with stars and plot shadow, respectively. Right: search temperature schedule.

Multimodal fusion search behaviour.

In Fig. 5 (top) we display the behaviour of our search procedure by plotting validation errors of sampled architectures. It can be observed that, overall, sampled architectures are more and more stable error-wise as the search progresses. The stabilization of sampled errors originates from two sources: first, the shared fusion weights have been more refined at the final steps of the search, and second, the search is driven with more confidence by the predictions of the surrogate function. Indeed, at the last few steps, mean error is significantly lower than the initial ones.

Another interesting effect of our search method and fusion scheme is the fact that even at the initial search steps it is possible to sample architectures that display relatively small validation errors. Since the fusion weights of sampled architectures are trained only for a few epochs, this effect is not necessarily a positive reflection of how good or bad the sampled architecture is. Indeed, it is possible to sample a simple fusion scheme on very deep uni-modal features (which have been pretrained offline) and outperform other sampled architectures that might actually perform better when its weights are revisited at later search steps. In this sense, our temperature-driven sampling of architectures offers a way to escape the fake local minima that originate from this phenomenon. This all boils down to the fact that it is important, in order to avoid getting trapped by initial biased evidence, to trust the surrogate function only after exploration has advanced. We use an inverse exponential schedule for the sampling temperature, as shown at the bottom of Fig. 5, since we observed a better outcome in comparison to a linear temperature schedule.

Search timings.

In Table 6 we provide the hardware settings and timings for the search on all the reported datasets. Multi GPU training through data parallelism was necessary on the Ntu rgb+d. Search times on Ntu rgb+d are much larger than on the Mm-Imdb dataset due to model complexity and larger search space.

Dataset GPUs Search time Avg. step
(P100) (steps) (hours) time (hours)
Av-Mnist 1 3.42 0.285
Mm-Imdb 1 9.24 0.616
Ntu rgb+d 4 150.91 12.57
Table 6: Search timings and hardware configurations.

5 Conclusion

This work tackles the problem of finding accurate fusion architectures for multimodal classification. We propose a novel multimodal search space and exploration algorithm to solve the task in an efficient yet effective manner. The proposed search space is constrained in such a way that it allows convoluted architectures to take place while also containing the complexity of the problem to reasonable levels. We experimentally demonstrated on three datasets the validity of our method, discovering several fusion schemes that provide state-of-the-art results on those datasets. Future research directions include improving the search space so the composition of fusion layers is even more flexible.