HyperLearn: A Distributed Approach for Representation Learning in Datasets With Many Modalities

09/19/2019 ∙ by Devanshu Arya, et al. ∙ University of Amsterdam 52

Multimodal datasets contain an enormous amount of relational information, which grows exponentially with the introduction of new modalities. Learning representations in such a scenario is inherently complex due to the presence of multiple heterogeneous information channels. These channels can encode both (a) inter-relations between the items of different modalities and (b) intra-relations between the items of the same modality. Encoding multimedia items into a continuous low-dimensional semantic space such that both types of relations are captured and preserved is extremely challenging, especially if the goal is a unified end-to-end learning framework. The two key challenges that need to be addressed are: 1) the framework must be able to merge complex intra and inter relations without losing any valuable information and 2) the learning model should be invariant to the addition of new and potentially very different modalities. In this paper, we propose a flexible framework which can scale to data streams from many modalities. To that end we introduce a hypergraph-based model for data representation and deploy Graph Convolutional Networks to fuse relational information within and across modalities. Our approach provides an efficient solution for distributing otherwise extremely computationally expensive or even unfeasible training processes across multiple-GPUs, without any sacrifices in accuracy. Moreover, adding new modalities to our model requires only an additional GPU unit keeping the computational time unchanged, which brings representation learning to truly multimodal datasets. We demonstrate the feasibility of our approach in the experiments on multimedia datasets featuring second, third and fourth order relations.



There are no comments yet.


page 2

page 5

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The field of multimedia has been slowly, but steadily growing beyond simple combining of diverse modalities, such as text, audio and video, to modeling their complex relations and interactions. These relations are commonly perceived, and therefore, modelled as only pair-wise connections between two items, which is a major drawback in the majority of the existing techniques. Going beyond pair-wise connections to encode higher-order relations can not only discover complex inter-dependencies between items but also help in removing ambiguous relations. For instance: in the task of social image-tag refinement, conventional approaches focus on exploiting the pairwise tag-image relations, without considering the user information which has been proven extremely useful in resolving tag ambiguities and closing the semantic gap between visual representation and semantic meaning (Cui et al., 2014; Tang et al., 2017, 2019; Li and Tang, 2016). It is hence an interesting, but far more challenging problem in multimedia to exploit and learn higher-order relations to be able to (a) learn a better representation for each item, (b) improve pairwise retrieval tasks and (c) discover far more complex relations which can be ternary ( order), quaternary ( order), quinary ( order) or even beyond. As examples, figure 1 shows the importance of modeling higher-order relations in social networks and in artistic analysis respectively. In the upper example from Figure 1, textual annotations and information about user demographics is utilized for disambiguation between landmarks with very similar visual appearance. Similarly, the second example illustrates quaternary relations formed by the artworks, media, artists and the time-frame in which they were active. Capturing such complex relations is of utmost importance in a number of tasks performed by the domain experts, such as author attribution, influence and appreciation analysis.

Figure 1. Example showing importance of capturing ternary relations (images-tags-users) in a social network dataset and quaternary relations (artworks-media-artists-timeframe) in artistic dataset. HyperLearn exploits such relations to learn complex representations for each modality. At the same time, HyperLearn provides a distributed learning approach, which makes it scalable to datasets with many modalities

Learning representations in multimodal datasets is an extremely complex task due to the enormous amount of relational information available. At the same time most of these relations have an innate property of ‘homophily’, which is the fact that similarity breeds connections. Exploiting this property of similarities can help in immensely simplifying the understanding of these relations. These similarities can be derived from both intra relations between items of the same modality and inter-relations between items across different modalities. Unifying the two types of relations in a complementary manner has the potential to bolster the performance of practically any multimedia task. Thus, in this work we propose an efficient learning framework that can merge information generated by both intra as well as inter-relations in datasets with many modalities. We conjecture that such an approach can pave the way for a generic methodology for learning representations by exploiting higher-order relations. At the same time, we introduce an approach that makes our framework scale to multiple modalities.

We focus on learning a low-dimensional representation for each multimodal item using an unsupervised framework. The unsupervised methods utilize relational information both within as well as across modalities to learn common representations for a given multimodal dataset. The co-occurrence information simply means that two items from different modalities are semantically similar if they co-exist in a multimedia collection. For example, the textual description of a video often describes the events shown in the visual channel. Many of the multimedia tasks revolve around this compact latent representation of each multimodal entity  (Rasiwasia et al., 2010; Ngiam et al., 2011). The major challenge lies in bridging the learning gap between the two types of relations in a way that they can be semantically complementary in describing similar concepts. Learning representation is usually extremely expensive, both in computational time and required storage as even a relatively small multimedia collection normally contains a multitude of complex relations.

Handling a large amount of relations requires a framework with a flexible approach to training across multiple pipelines. Most of the existing algorithms fail in parallelizing their framework into separate pipelines  (Li et al., 2018; Yang et al., 2015), resulting in large time and memory consumption. Thus, in the proposed framework we can parallelize the training process for different modalities into separate pipelines, each requiring just an additional GPU core. By doing so, we facilitate joint multimodal representation learning on highly heterogeneous multimedia collections containing an arbitrarily large number of modalities, effectively hitting an elusive target sought after since the early days of multimedia research. The points below highlight the contributions of this paper:

  • We address the challenging problem of multimodal representation learning by proposing HyperLearn, an unsupervised framework capable of jointly modeling relations between the items of the same modality, as well as across different modalities.

  • Based on the concept of geometric deep learning on hypergraphs, our HyperLearn framework is effective in extracting higher-order relations in multimodal datasets.

  • In order to reduce prohibitively high computational costs associated with multimodal representation learning, in this work we propose a distributed learning approach, which can be parallelized across multiple GPUs without harming the accuracy. Moreover, introducing a new modality into HyperLearn framework requires only an additional GPU, which makes it scalable to datasets with many modalities.

  • Extensive experimentation shows that our approach is task-independent, with a potential for deployment in a variety of applications and multimedia collections.

2. Related Work

The core challenge in multimodal learning revolves around learning representations that can process and relate information from multiple heterogeneous modalities. Most of existing multimodal representation learning methods can be split into two broad categories – multimodal network embeddings and tensor factorization-based latent representation learning. In this section we reflect on the representative approaches from these two categories. Since, in this work we extend the notion of graph convolution networks for multimodal datasets, we also touch upon some of the existing techniques that aim to deploy deep learning on graphs.

2.1. Multimodal Network Embedding

A common strategy for representation learning is to project different modalities together into a joint feature space. Traditional methods  (McAuley and Leskovec, 2012; Roweis and Saul, 2000; Tenenbaum et al., 2000)

focus on generating node embeddings by constructing an affinity graph on the nodes and then finding the leading eigenvectors for representing each node. With the advent of deep learning, neural networks have become a popular way to construct combined representations. They owe their popularity to the ability to jointly learn high-quality representations in an end-to-end manner. For example, Srivastava and Salakhutdinov proposed an approach for learning higher-level representation for multiple modalities, such as images and texts using Deep Boltzmann Machines (DBM)

(Srivastava and Salakhutdinov, 2012). Since then a large number of multimodal representation learning methods based on deep learning have been proposed. Some of these methods attempt to learn a multimodal network embedding by combining the content and link information  (Huang et al., 2018; Li et al., 2018; Tang et al., 2015a; Zhang et al., 2017; Yang et al., 2015; Li et al., 2017; Chang et al., 2015)

. Other set of methods focuses on modeling the correlation between multiple modalities to learn a shared representation of multimedia items. An example of such coordinated representation is Deep Canonical Correlation Analysis (DCCA) that aims to find a non-linear mapping that maximizes the correlation between the mapped vectors from the two modalities

(Yan and Mikolajczyk, 2015). Ambiguities often occur while using network embedding methods to learn multimodal relations due to sub-optimal usage of available information. This is mostly because these methods assume relations between items to be pairwise which often leads to loss of information  (Bu et al., 2010; Li et al., 2013; Arya and Worring, 2018).

2.2. Tensor Factorization Based Latent Representation Learning

Decoupling a multidimensional tensor into its factor matrices has been proven successful in unraveling latent representations of their components in an unsupervised manner  (Lacroix et al., 2018; Narita et al., 2012; Kolda and Bader, 2009). Most existing approaches aim to embed both entities and relations into a low-dimensional space for tasks such as link prediction (Trouillon et al., 2016), reasoning in knowledge bases (Socher et al., 2013) or multi-label classification problems (Maruhashi et al., 2018)

. Recent methods on social image understanding incorporate user information as the third modality for tag based image retrieval and image-tag refinement problems  

(Tang et al., 2017, 2019, 2015b)

. Even though most of these approaches are suitable for large datasets, one of the main disadvantages of using a factorization based model is the lack of flexibility when scaling to highly multidimensional datasets. Additionally, most of the tensor decomposition methods are based on the optimization with a least squared criterion, which severely lacks robustness to outliers

(Kim et al., 2013).

In this work, we first overcome the issues of network embedding methods by using a hypergraph-based learning method. Secondly, we introduce a scalable approach to tensor decomposition for scaling representation learning to many modalities. Finally, we can combine the advantages of rich information from network structure with the unsupervised nature of tensor decomposition in one single end-to-end framework.

2.3. Geometric Deep Learning on graphs

Geometric deep learning (Bronstein et al., 2017) brings the algorithms that can help learn from non-euclidean data like graphs and 3D objects by proposing an ordering of mathematical operators that is different from common convolutional networks. The aim of Geometric Deep Learning is to process signals defined on the vertices of an undirected graph , where V is the set of vertices, E is set of edges, and is the adjacency matrix. Following (Shuman et al., 2013; Defferrard et al., 2016), spectral domain convolution of signals and defined on the vertices of a graph is formulated as:



corresponds to Graph Fourier Transform and

represents Fourier Transform; the eigen functions

of the graph laplacian play the role of Fourier modes; the corresponding eigenvalues

of the graph laplacian are identified as the frequencies of the graph. Recent applications of graph convolutional networks range from computer graphics (Boscaini et al., 2016) to chemistry (Duvenaud et al., 2015)

. The spectral graph convolutional neural networks (

), originally proposed in (Bruna et al., 2014) and extended in (Defferrard et al., 2016) were proven effective in classification of handwritten digits and news texts. A simplification of the GCN formulation was proposed in (Kipf and Welling, 2017)

for semi-supervised classification of nodes in a graph. In the computer vision community,

GCN has been extended to describe shapes in different human poses (Masci et al., 2015), perform action detection in videos (Wang and Gupta, 2018) and for image and 3D shape analysis (Monti et al., 2017). However, in the multimedia field there have been considerably less examples of using deep learning on graphs for modeling highly multimodal datasets with (Rudinac et al., 2017; Arya and Worring, 2018) as notable exceptions.

In this paper, we propose an approach that introduces the application of graph convolutional networks on multimodal datasets. We deploy Multi-Graph Convolution Network (MGCNN) originally proposed by (Monti et al., 2017) for the matrix completion task using row and column graphs as auxiliary information. It aims at extracting spatial features from a matrix by using information from both the row and column graphs. For a matrix , MGCNN is given by


where, is dimensional matrix which represents the coefficients of the filters, denotes the Chebyshev polynomial of degree and , are the row and column Graph Laplacians respectively. Using Equation 2 as the convolutional layer of MGCNN, it produces output channels () for matrix with a single input channel. In this way, one can extract dimensional features for each item in matrix by combining information from row and column graphs, which can correspond to e.g. individual modalities.

3. The Proposed Framework

Figure 2. (a) Pair-wise relationship among the items of the same modality in K-modal data. (b) Complex higher-order heterogeneous relationships between entities of different modalities using a Hypergraph representation.
Figure 3. Proposed HyperLearn framework deployed on K modalities with a distributed learning approach

In this section, we propose a novel distributed learning framework that can simultaneously exploit both intra and inter-relations in multimodal datasets. We depict these inter-relations on a hypergraph and conjecture that this way of representing higher-order relations reduces any loss of information contained within the multimodal network structure  (Zhou et al., 2007; Li et al., 2013; Arya and Worring, 2018). Mathematically, a hypergraph is depicted by its adjacency tensor (Banerjee et al., 2017). A simple tensor factorization on this adjacency tensor can disentangle modalities into their compact representations. However, this kind of representation lacks information from the intra-relation of items belonging to the same modality. Subsequently, we therefore incorporate intra-relations among entities as auxiliary information to facilitate flow of within-modal relationship information.

3.1. Notations

We use boldface underlined letters, such as , to denote tensors and simple upper case letters, such as , to denote matrices. Let represent the ”Khatri-Rao” product (Khatri and Rao, 1968) defined as


where, and are arbitrary matrices and is the Kronecker Product. The resulting matrix is an expanded matrix of dimension on the columns of and .

3.2. Representing Cross-Modal Inter-Relations using Hypergraphs

Hypergraphs have been proven extremely efficient in depicting higher-order and heterogeneous relations. A hypergraph is the most efficient way to represent complex relationships between a multitude of diverse entities, as it minimizes any loss of available information  (Wolf et al., 2016; Arya and Worring, 2018; Bu et al., 2010). Given multimodal data, we construct a unified hypergraph by building hyperedges () around each of the individual multimodal items which are represented on a set of nodes (). These hyperedges correspond to the cross-modal relations between items of different modalities as illustrated in Figure 2.

A more formalised mathematical interpretation of this unified hypergraph is given by its adjacency tensor , where the number of components of the tensor is equal to the number of modalities in the hypergraph. Further, each hyperedge corresponds to an entry in the tensor whose value are the weights of the hyperedge. For simplicity, in this work we focus on unweighted hypergraphs.

Thus, a multimodal data with modalities is depicted on a tensor , where each component () of this tensor represents one of the heterogeneous modalities. A single element of is addressed by providing its precise position by a series of indices i.e.


Further, a hyperedge around a set of nodes can be represented as binary values such that if the relation is known i.e. if there exists a mutual relation between the modalities for that instance. For example, in the social network use case, with a possible corresponding image-tag-user associated tensor , the images () are represented on rows, users () on columns and tags () on tubes. If the image uploaded by the user is annotated with the tag, then and 0 otherwise.

3.3. Representing Intra-Relations Between the Items of the Same Modality

Relationships between items of the same modality are dependent on the nature/properties of the modality. For instance, relationships between users in a social network is defined based on their common interests. To make our framework flexible, each modality () is represented on a separate graph whose connections can be defined independently. For example: relations among images can be established based on their visual features, for tags it can be calculated based on their co-occurrence and for users it can very well be based on their mutual likes/dislikes. We denote the adjacency matrix of by where each of its entries , if there exists a relation between the and element and 0 otherwise. The corresponding normalized graph laplacians () are given by


where, is known as the degree matrix.

3.4. Combined Inter-Intra Relational Feature Extraction

Tensor can be factorized using Candecomp/Parafac(CP) - decomposition (Harshman and others, 1970) which decomposes a tensor into a sum of outer products of vectors .

The CP-decomposition of is defined as


where is the outer product and represents mode- multiplication (Tensor matrix product). Matrices are called factor matrices of rank and is an order identity tensor. Matrices are essentially the latent lower dimensional representations for each of the components of the tensor and therefore, for each of the modalities.

Subsequently, we introduce an approach that can learn robust representations by combining intra relational information. We extract spatial features that merge information from each of the graphs with the latent representation matrices using Multi-Graph Convolutional Network (MGCNN) layers given by


where, the output has output channels. Similar to  (Monti et al., 2017), we use an to implement the feature diffusion process which essentially iteratively predicts accurate changes for the matrix . Due to its ability to keep long-term internal states, this architecture is highly efficient in learning complex non-linear diffusion processes.

3.5. Loss Function Incorporating Cross-Modality Inter-Relations and Within-Modality Intra Relations

In standard CP decomposition of a tensor, its factor matrices are approximated by finding a solution to the following equation


This equation essentially tries to find low dimensional factor matrices such that their combination is as close as possible to the original tensor . Further, to add relational information among items within each of these , we extend the ”within-mode” regularization term introduced in (Li and Yeung, 2009) for matrices and (Narita et al., 2012) for third order tensors to generic order tensors. The basic idea is to add a regularization term to Equation 9

such that it can force two similar objects in each modality to have similar factors, so that they operate similarly. Thus, the combined loss function is given by:


where, returns the trace of a matrix. In Equation 10, the first term ensures closeness between items of the same modalities and the second term consolidates the relative similarities between items across modalities. Minimizing Equation 10 is a non-convex optimization problem for a set of variables . Apart from being an NP-hard problem, computationally it is also expensive to perform even simple operations like element wise product on a order tensor. To get a more robust solution, we introduce an alternating method to tensor decomposition similar to  (Kolda and Bader, 2009; Kim et al., 2013). The key insight of such a method is to iteratively solve one of the components of the tensor while keeping the rest fixed. We exploit this kind of alternating optimization solution to parallelize our framework across multiple GPUs, by placing each modality on one of them. This creates an independent pipeline for all of the modalities as shown in Figure 3 which summarizes our distributed learning framework for multimodal datasets.

3.6. Distributed Training Approach for Learning Latent Representations

The separable feature extraction process for each modality makes our methodology unique and scalable to multiple modalities. These separate pipelines are combined by a joint loss function. Consider solving Equation

10 by keeping all other components except as constant. Since, all but one component of the tensor is a variable, unfolding original tensor into a matrix along the component results in matrix with dimensions (where ). So, the loss function in Equation 10 can be rewritten for each of the components () as


, where and represents the ”Khatri-Rao” product.

4. Experiments

Table 1. Table showing the total number of intra and inter-relations between items on MovieLens, MIR Flickr and OmniArt datasets.

We start our experimental evaluation showing the performance of our approach on a 2-dimensional standard matrix completion task and then extend it to 3 and 4 dimensional cases. For 2D, 3D and 4D case, we use MovieLens (Miller et al., 2003), MIR Flickr (Huiskes and Lew, 2008) and OmniArt (Strezoski and Worring, 2017) datasets respectively. We conjecture that our framework can be generalized to datasets with even more modalities. Table 1 summarizes the number of inter and intra relations for the three above mentioned cases. Here, represents the number of relations. As seen from the table, even relatively small datasets feature a multitude of relations, which makes learning them even more challenging.

4.1. Task 1: Matrix Completion on Graphs

(a) Time (in ms) taken for each training iteration
(b) Convergence rate of RMSE Loss over time
Figure 6. Illustration of the convergence rate of HyperLearn against sRGCNN. Our method clearly requires a much lower training time per iteration and also converges much faster than sRGCNN.
Figure 7. Detailed performance comparison in terms of Average Precision over 18 concepts on the MIRFlickr dataset

We show the computational advantage of our approach against a matrix completion method that makes use of side information as a baseline. For this, we use the standard MovieLens 100K dataset (Miller et al., 2003), which consists of 100,000 ratings on a scale of 0 to 5 corresponding to 943 users () and for 1,682 movies (). We follow the experimental setup of Monti et.al. (Monti et al., 2017) for constructing the respective user and movie intra-relation graphs as unweighted 10-nearest neighbor graphs.

We compare the performance of HyperLearn with separable Recurrent Graph Convolutional Networks (sRGCNN) as proposed in (Monti et al., 2017). As can be seen from Figure 6, our approach attains comparable performance to the state of the art alternative, while being much faster. The feature extraction approach alternating between movie and item graphs reduces the time complexity (although not linearly) considerably as can be seen from Figure 6(a) which in turn increases the rate of convergence for the algorithm as depicted in Figure 6(b). However, due to continuous alternating loss calculations, sometimes the back-propagated gradients tend to get biased towards one of the modalities resulting in some higher peaks for HyperLearn in Figure 6(b).

4.2. Task 2: Social Image Understanding

In this experiment, we test the performance of our model on a 3 order multimodal relational dataset. We apply our method to uncover latent image representations by jointly exploring user-provided tagging information, visual features of images and user demographics. We conduct experiments on the social image dataset: MIR Flickr (Huiskes and Lew, 2008). The MIR Flickr dataset consists of 25,000 images () from Flickr posted by 6,386 users () with over 50,000 user-provided tags () in total. Some tags are obviously noisy and should be removed. Tags appearing at least 50 times are kept and the remaining ones are removed as in (Li and Tang, 2016; Tang et al., 2017). To include user information, we crawl the groups joined by each user through the Flickr API. Some images have broken links, or are deleted by their users. We remove such images from our dataset which leaves us with 15,662 images, 6,618 users and 315 tags. The dataset also provides manually-created ground truth image annotations at the semantic concept level. For this filtered dataset, there are 18 unique concepts such as animals, bird, sky etc. for the images which we adopt to evaluate the performance. We create an intra-relation graph for images by taking 10-nearest neighbors based on their widely used standard SIFT features. For users, we create edges between them if they joined the same groups and for tags a graph is created based on their co-occurrence.

To empirically evaluate the effectiveness of our proposed method, we present the performance of the latent representation of images in classifying them into 18 concepts. We compare our model with the following methods:

  • OT: The user-provided tags from Flickr as baseline.

  • TD: The conventional CANDECOMP/PARAFAC (CP) tensor decomposition (Harshman and others, 1970)

  • WDMF: Weakly-supervised Deep Matrix Factorization for Social Image Understanding (Li et al., 2016)

  • MRTF: Multi-correlation Regularized Tensor Factorization approach (Sang et al., 2011)

Model Training Time (in hours)
WDMF 4.2 0.4
MRTF 2.7 0.3
HyperLearn 1.8 0.3
Table 2. Comparison of the training times (in hours) on MIR Flickr dataset

These methods, to the best of our knowledge, cannot provide the flexibility of performing distributed training for each modality using multiple GPUs. We report Average Precision (AP) scores for comparing our HyperLearn approach against all of these methods. Average Precision (AP) is the standard measure used for multi-label classification benchmarks. It corresponds to the average of the precision at each position where a relevant image appears. Figure 7 shows the comparative performance for all the 18 concepts. We also compare HyperLearn with MRTF and WDMF in terms of the training time and report the results in Table 2. As can be seen from this table, HyperLearn executes faster than MRTF and WDMF while its performance is at par or even better for most of the concepts in the multi-label classification task shown in Figure 7.

Through this experiment we show that - (a) the performance of our approach is at par with the existing methods in understanding social image relationships (b) by introducing a distributed approach we can cut down training time of the model significantly.

4.3. Task 3: OmniArt

In the last experiment, we show the performance of our model in learning relations that go beyond 3 order of connections. For this we require a highly multimodal dataset containing complex relations that are hard to interpret. One such dataset is OmniArt, a large-scale artistic benchmark collection consisting of digitized artworks from museums across the world  (Strezoski and Worring, 2017, 2018). OmniArt comprises millions of artworks coupled with extensive metadata such as genre, school, material information, creation period and dominant color. This makes the dataset extremely multi-relational and, at the same time, very challenging to perform learning tasks.

For the purpose of comparison with related work, we first perform the artist attribution task in which we attempt to determine the creator of an artwork based on his/her inter-relations with artworks, media (e.g., oil, watercolor, canvas etc.) and creation period (timeframe), along with their intra-relations. To this end we select artworks corresponding to the most common artists in the collection. Considering each of these data streams - artworks , artist , media and their timeframe () in centuries as a separate modality, we create the inter-relation hypergraph between them. Subsequently, intra-relation graphs are created for each of the 4 modalities in the following way:

  • : Based on color palettes similarity

  • : Based on the schools the artist belongs to

  • : Based on the co-occurrence in all artworks

  • : Based on the style and genre prevalent in that century

We take a sub-sample of the OmniArt dataset consisting of 10,000 artworks from 2,776 artists in the time period ranging all the way from 8 to 20 century along with 63 prominent media types. On this sampled dataset, we achieve an accuracy of for the artist attribution task. The performance of our model is at par with the benchmark accuracy of  (Strezoski and Worring, 2017). In addition, we conjecture that Hypergraph has an important advantage – the ability to learn even higher order relations, i.e. 5, 6 and beyond, something that we intend proving in future work.

In the particular case of OmniArt, such higher-order relations would include information about e.g., artist, school, timeframe, medium, dominant colour use, semantics and (implicit) social network. For example, Figure 10 shows the well-known “Olive Trees with Yellow Sky and Sun” painted by Vincent van Gogh in 1889 and Claude Monet’s masterpiece “Marine View with a Sunset” from 1875. As nicely portrayed by these two examples, while the two artists exhibit many stylistic similarities, sharing motives and a time period, their materialization is very different. Influenced by Monet, Van Gogh changed both his colour palette and coarseness of brushstrokes, so technically, his work became closer to the French Impressionism. Detecting “tipping points” in the artist’s opus would require multimedia representations capable of capturing information about e.g. colour, texture and semantic concepts depicted in the paintings, but also information about school, social network, relevant locations and timeframe and historical context. We believe that our proposed framework is a significant and brave step forward in ultimately deploying multimedia analysis for solving such complex tasks.

(a) Vincent van Gogh – Olive Trees with Yellow Sky and Sun, 1889
(b) Claude Monet – Marine View with a Sunset, 1875
Figure 10. Van Gogh (a) and Monet (b) have many stylistic similarities, but their materialization is different. Capturing their similarities, differences and influences requires the ability to model higher-order relations.

5. Conclusion and Future Work

In this paper we propose HyperLearn, a hypergraph-based framework for learning complex higher-order relationships in multimedia datasets. The proposed distributed training approach makes this framework scalable to many modalities. We demonstrate benefits of our approach with regards to both performance and computational time through extensive experimentation on MovieLens and MIRFlickr datasets with 2 and 3 modalities respectively. To show the flexibility of HyperLearn in encoding a larger number modalities, we perform experiments on 4 order relations from the OmniArt dataset. In conclusion, on the examples of very different datasets, domains and use cases, we demonstrate that HyperLearn can be extremely useful in learning representations that can capture complex higher order relations within and across multiple modalities. For future work we plan to test the approach on even higher number of heterogeneous modalities and further extend this approach to much larger datasets by solving sub-tensors derived from slicing hypergraph into multiple smaller hypergraphs.


  • D. Arya and M. Worring (2018) Exploiting relational information in social networks using geometric deep learning on hypergraphs. In Proceedings of the 2018 ACM International Conference on Multimedia Retrieval, pp. 117–125. Cited by: §2.1, §2.3, §3.2, §3.
  • A. Banerjee, A. Char, and B. Mondal (2017) Spectra of general hypergraphs. Linear Algebra and its Applications 518, pp. 14–30. Cited by: §3.
  • D. Boscaini, J. Masci, E. Rodolà, and M. Bronstein (2016) Learning shape correspondence with anisotropic convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 3189–3197. Cited by: §2.3.
  • M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst (2017) Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §2.3.
  • J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun (2014) Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR2014), CBLS, April 2014, Cited by: §2.3.
  • J. Bu, S. Tan, C. Chen, C. Wang, H. Wu, L. Zhang, and X. He (2010) Music recommendation by unified hypergraph: combining social media information and music content. In Proceedings of the 18th ACM international conference on Multimedia, pp. 391–400. Cited by: §2.1, §3.2.
  • S. Chang, W. Han, J. Tang, G. Qi, C. C. Aggarwal, and T. S. Huang (2015) Heterogeneous network embedding via deep architectures. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 119–128. Cited by: §2.1.
  • P. Cui, S. Liu, W. Zhu, H. Luan, T. Chua, and S. Yang (2014) Social-sensed image search. ACM Transactions on Information Systems (TOIS) 32 (2), pp. 8. Cited by: §1.
  • M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: §2.3.
  • D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232. Cited by: §2.3.
  • R. A. Harshman et al. (1970) Foundations of the parafac procedure: models and conditions for an” explanatory” multimodal factor analysis. Cited by: §3.4, 2nd item.
  • F. Huang, X. Zhang, C. Li, Z. Li, Y. He, and Z. Zhao (2018)

    Multimodal network embedding via attention based multi-view variational autoencoder

    In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pp. 108–116. Cited by: §2.1.
  • M. J. Huiskes and M. S. Lew (2008) The mir flickr retrieval evaluation. In Proceedings of the 1st ACM international conference on Multimedia information retrieval, pp. 39–43. Cited by: §4.2, §4.
  • C. Khatri and C. R. Rao (1968)

    Solutions to some functional equations and their applications to characterization of probability distributions

    Sankhyā: The Indian Journal of Statistics, Series A, pp. 167–180. Cited by: §3.1.
  • H. Kim, E. Ollila, V. Koivunen, and C. Croux (2013)

    Robust and sparse estimation of tensor decompositions

    In 2013 IEEE Global Conference on Signal and Information Processing, pp. 965–968. Cited by: §2.2, §3.5.
  • T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. Proceedings of the International Conference on Learning Representations. Cited by: §2.3.
  • T. G. Kolda and B. W. Bader (2009) Tensor decompositions and applications. SIAM review 51 (3), pp. 455–500. Cited by: §2.2, §3.5.
  • T. Lacroix, N. Usunier, and G. Obozinski (2018) Canonical tensor decomposition for knowledge base completion. In

    International Conference on Machine Learning

    pp. 2869–2878. Cited by: §2.2.
  • D. Li, Z. Xu, S. Li, and X. Sun (2013) Link prediction in social networks based on hypergraph. In Proceedings of the 22nd International Conference on World Wide Web, pp. 41–42. Cited by: §2.1, §3.
  • H. Li, H. Wang, Z. Yang, and M. Odagaki (2017) Variation autoencoder based network representation learning for classification. In Proceedings of ACL 2017, Student Research Workshop, pp. 56–61. Cited by: §2.1.
  • W. Li and D. Yeung (2009) Relation regularized matrix factorization. In Twenty-First International Joint Conference on Artificial Intelligence, Cited by: §3.5.
  • X. Li, T. Uricchio, L. Ballan, M. Bertini, C. G. Snoek, and A. D. Bimbo (2016) Socializing the semantic gap: a comparative survey on image tag assignment, refinement, and retrieval. ACM Computing Surveys (CSUR) 49 (1), pp. 14. Cited by: 3rd item.
  • Z. Li, J. Tang, and T. Mei (2018) Deep collaborative embedding for social image understanding. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1, §2.1.
  • Z. Li and J. Tang (2016) Weakly supervised deep matrix factorization for social image understanding. IEEE Transactions on Image Processing 26 (1), pp. 276–288. Cited by: §1, §4.2.
  • K. Maruhashi, M. Todoriki, T. Ohwa, K. Goto, Y. Hasegawa, H. Inakoshi, and H. Anai (2018) Learning multi-way relations via tensor decomposition with neural networks. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.2.
  • J. Masci, D. Boscaini, M. Bronstein, and P. Vandergheynst (2015) Geodesic convolutional neural networks on riemannian manifolds. In Proceedings of the IEEE international conference on computer vision workshops, pp. 37–45. Cited by: §2.3.
  • J. McAuley and J. Leskovec (2012) Image labeling on a network: using social-network metadata for image classification. In European conference on computer vision, pp. 828–841. Cited by: §2.1.
  • B. N. Miller, I. Albert, S. K. Lam, J. A. Konstan, and J. Riedl (2003) MovieLens unplugged: experiences with an occasionally connected recommender system. In Proceedings of the 8th international conference on Intelligent user interfaces, pp. 263–266. Cited by: §4.1, §4.
  • F. Monti, M. Bronstein, and X. Bresson (2017) Geometric matrix completion with recurrent multi-graph neural networks. In Advances in Neural Information Processing Systems, pp. 3697–3707. Cited by: §2.3, §2.3, §3.4, §4.1, §4.1.
  • A. Narita, K. Hayashi, R. Tomioka, and H. Kashima (2012) Tensor factorization using auxiliary information. Data Mining and Knowledge Discovery 25 (2), pp. 298–324. Cited by: §2.2, §3.5.
  • J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng (2011) Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 689–696. Cited by: §1.
  • N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet, R. Levy, and N. Vasconcelos (2010) A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM international conference on Multimedia, pp. 251–260. Cited by: §1.
  • S. T. Roweis and L. K. Saul (2000) Nonlinear dimensionality reduction by locally linear embedding. science 290 (5500), pp. 2323–2326. Cited by: §2.1.
  • S. Rudinac, I. Gornishka, and M. Worring (2017) Multimodal classification of violent online political extremism content with graph convolutional networks. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017, pp. 245–252. Cited by: §2.3.
  • J. Sang, J. Liu, and C. Xu (2011) Exploiting user information for image tag refinement. In Proceedings of the 19th ACM international conference on Multimedia, pp. 1129–1132. Cited by: 4th item.
  • D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst (2013)

    The emerging field of signal processing on graphs: extending high-dimensional data analysis to networks and other irregular domains

    IEEE Signal Processing Magazine, vol. 30, no. 3, pp. 83 – 98. Cited by: §2.3.
  • R. Socher, D. Chen, C. D. Manning, and A. Ng (2013) Reasoning with neural tensor networks for knowledge base completion. In Advances in neural information processing systems, pp. 926–934. Cited by: §2.2.
  • N. Srivastava and R. R. Salakhutdinov (2012) Multimodal learning with deep boltzmann machines. In Advances in neural information processing systems, pp. 2222–2230. Cited by: §2.1.
  • G. Strezoski and M. Worring (2017) Omniart: multi-task deep learning for artistic data analysis. arXiv preprint arXiv:1708.00684. Cited by: §4.3, §4.3, §4.
  • G. Strezoski and M. Worring (2018) OmniArt: a large-scale artistic benchmark. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14 (4), pp. 88. Cited by: §4.3.
  • J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei (2015a) Line: large-scale information network embedding. In Proceedings of the 24th international conference on world wide web, pp. 1067–1077. Cited by: §2.1.
  • J. Tang, Z. Li, M. Wang, and R. Zhao (2015b) Neighborhood discriminant hashing for large-scale image retrieval. IEEE Transactions on Image Processing 24 (9), pp. 2827–2840. Cited by: §2.2.
  • J. Tang, X. Shu, Z. Li, Y. Jiang, and Q. Tian (2019) Social anchor-unit graph regularized tensor completion for large-scale image retagging. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1, §2.2.
  • J. Tang, X. Shu, G. Qi, Z. Li, M. Wang, S. Yan, and R. Jain (2017) Tri-clustered tensor completion for social-aware image tag refinement. IEEE transactions on pattern analysis and machine intelligence 39 (8), pp. 1662–1674. Cited by: §1, §2.2, §4.2.
  • J. B. Tenenbaum, V. De Silva, and J. C. Langford (2000) A global geometric framework for nonlinear dimensionality reduction. science 290 (5500), pp. 2319–2323. Cited by: §2.1.
  • T. Trouillon, J. Welbl, S. Riedel, É. Gaussier, and G. Bouchard (2016) Complex embeddings for simple link prediction. In International Conference on Machine Learning, pp. 2071–2080. Cited by: §2.2.
  • X. Wang and A. Gupta (2018) Videos as space-time region graphs. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 399–417. Cited by: §2.3.
  • M. M. Wolf, A. M. Klinvex, and D. M. Dunlavy (2016) Advantages to modeling relational data using hypergraphs versus graphs. In 2016 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–7. Cited by: §3.2.
  • F. Yan and K. Mikolajczyk (2015) Deep correlation for matching images and text. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 3441–3450. Cited by: §2.1.
  • C. Yang, Z. Liu, D. Zhao, M. Sun, and E. Chang (2015) Network representation learning with rich text information. In Twenty-Fourth International Joint Conference on Artificial Intelligence, Cited by: §1, §2.1.
  • D. Zhang, J. Yin, X. Zhu, and C. Zhang (2017) User profile preserving social network embedding.. In IJCAI, pp. 3378–3384. Cited by: §2.1.
  • D. Zhou, J. Huang, and B. Schölkopf (2007) Learning with hypergraphs: clustering, classification, and embedding. In Advances in neural information processing systems, pp. 1601–1608. Cited by: §3.