Introduction
One of the fundamental problems in learning sequential video data is to find semantic structures underlying sequences for better representation learning. As most semantic flows cannot be modeled with simple temporal inductive biases, i.e., Markov dependencies, it is crucial to find the complex temporal semantic structures from long sequences to understand the video data. We believe there are two ingredients to solving this problem: segmenting the whole longlength sequence into (multiple) semantic units and finding their compositional structures. In this work, we propose a new graphbased method which can discover composite semantic flows in video inputs and utilize them for representation learning of videos. The compositional semantic structures are defined as multilevel graph forms, which make it possible to find longlength dependencies and their hierarchical relationships effectively.
In terms of modeling sequential semantic flows, the related work can be summarized with the following three categories: neural networks, clustering, and selfattention based methods.
In terms of neural network architectures, many problems involving sequential inputs are resolved by using Recurrent Neural Networks (RNNs) as the networks naturally take sequential inputs frame by frame. However, as RNNbased methods take frames in (incremental) order, the parameters of methods are trained to capture patterns in transition between successive frames. This makes it hard to find longterm dependencies through overall frames. To consider the longterm dependency, Long ShortTerm Memory (LSTM)
[19]and Gated Recurrent Units (GRU)
[9] introduced switches to the RNN structures and Hierarchical RNN [8]stacked multiple layers to find hierarchical structures. Even though those methods ignore noisy (unnecessary) frames and maintain the semantic flow through the whole sequence, it is still hard for RNN variants to retain multiple semantic flows and to learn their hierarchical and compositional relationships. Interestingly, Convolutional Neural Network (CNN) based methods, such as ByteNet
[21], ConvS2S [13] and WaveNet [30], applied 1D convolution operator to temporal sequences for modeling dependencies of adjacent frames and their compositions. However these operators hardly captured variablelength dependencies which play a significant role as semantic units.Recent researches revisited the traditional idea of clustering successive frames into representative clusters. Deep Bag of Frame (DBoF) [2] randomly selects frames from whole sequences as the representatives. NetVLAD [3] divides all features into
clusters and calculates residuals, which are difference vectors between the feature vectors and their corresponding cluster center, as the new representations for each feature vector. Even though the main idea of this type of research is quite simple, it helps to understand the semantics of long sequences by focusing on a small number of representative frames. However, it is hard to consider the complex temporal relationships.
Along with enormous interest of attention mechanism, a number of significant researches [40, 11] have been aimed to understanding the long (sequential) inputs relying purely on selfattention mechanisms. With realworld applications on natural language understanding, such as question and answering (QA) and paragraph detection, those methods focus on meaningful words (or phrases, sentences) among the long reading passages and ignore irrelevant words to the passage by stacking multiple attention layers. These methods consist of a large number of layers with a huge number of parameters and require a huge amount of training dataset.
In this paper, we propose a method to learn representations of video by discovering the compositional structure in multilevel graph forms. A single video data input is represented as a graph, where nodes and edges represent frames of the video and relationships between all node pairs. From the input representations, the CutBased Graph Learning Networks (CBGLNs) find temporal structures in the graphs with two key operations: temporally constrained normalized graphcut and messagepassing on the graph. A set of semantic units is found by parameterized kernel and cutting operations, then representations of the inputs are updated by message passing operations.
We thoroughly evaluate our method on the largescale video theme classification task, YouTube8M dataset [2]. As a qualitative analysis of the proposed model, we visualize compositional semantic dependencies of sequential input frames, which are automatically constructed. Quantitatively, the proposed method shows a significant improvement on classification performance over the baseline models. Furthermore, as an extension of our work, we apply the CBGLNs to another video understanding task, Video Question and Answering on TVQA dataset [26]. With this experiment, we show how the CBGLNs can fit into other essential components such as attention mechanism, and demonstrate the effectiveness of the structural representations of our model.
The remainder of the paper is organized as follows. As preliminaries, basic concepts of Graph Neural Networks (GNNs) and graphcut mechanisms are introduced. Next, the problem statement of this paper is described to make further discussion clear. After that, the proposed CutBased Graph Learning Networks (CBGLNs) are suggested in detail and the experimental results with the real datasets, YouTube8M and TVQA are presented.
Preliminaries
In this section, basic concepts related to graphs are summarized. First, mathematical definitions and notations of graphs are clarified. Second, the normalized graphcut method is described. Lastly, variants of Graph Neural Networks (GNNs) are introduced.
Graph notations
A graph is denoted as a pair with the set of nodes (vertices), and the set of edges. Each node is associated with a feature vector . To make notation more compact, the feature matrix of graph is denoted as . Also, a graph has an by weighted adjacency matrix where represents the weight of the edge between and .
Normalized graphcut algorithm
A graph can be partitioned into two disjoint sets , by removing edges connecting the two parts^{1}^{1}1 and . The partitioning cost is defined as the total weight of the edges that have been removed. In addition to the cut cost, the normalized graphcut method [36] considers the total edge weight connecting a partition with the entire graph (the degree of the partition) to avoid the trivial solutions which can make extremely imbalanced clusters. The objective of the normalized graphcut can be formally described as follows.
(1) 
with
(2)  
(3) 
where is an edge weight value between node and
. It is formulated as a discrete optimization problem and usually relaxed to continuous, which can be solved by eigenvalue problem with the
time complexity. By applying the cut method recursively, an input graph is divided into finegrained subgraphs.Graph Neural Networks (GNNs)
Since the first proposed by [15]
, interest in combining deep learning and structured approaches has steadily increased. This has led to various graphbased neural networks being proposed over the years.
Based on spectral graph theory [7], spectral approaches which convert the graph to the spectral domain and apply the convolution kernel of the graph were proposed [6, 17, 25]. [14] suggested the message passing neural networks (MPNNs), which encompass a number of previous neural models for graphs under a differentiable message passing interpretation.
In detail, the MPNNs have two phases, a message passing phase and a update phase. In the message passing phase, the message for vertex in th layer is defined in terms of message function :
(4) 
where in the sum, denotes the neighbors of in a graph . With the message , the representations of vertex in th layer are obtained via the update function .
(5) 
The message functions and the update functions are differentiable so that the MPNNs can be trained in an endtoend fashion. As extensions, there are some attempts to have been made to improve message passing by putting a gating or attention mechanism into a message function, which can have computational benefits [29, 12, 20, 41, 35, 39].
We further note other previous research for learning a structure of a graph. Neural Relational Inference (NRI) [24]
used a Variational Autoencoder (VAE)
[23] to infer the connectivity between nodes with latent variables. Other generative models based approaches also have been well studied [5, 10, 37]. However, those suffer from availability of structural information in training data or have complex training procedures. To our knowledge, it is the first time to suggest graphcut based neural networks to discover the inherent structure of videos without supervision of the structural information.Problem Statement
The problem to be tackled in this work can be clearly stated with the notations in the previous section as below.
We consider videos as inputs, and a video is represented as a graph . The graph has nodes corresponding to each frame respectively in the video with feature vectors and the dependencies between two nodes are represented with weight values of corresponding edges.
Suppose that video data has successive frames and each frame has an dimensional feature vector . Each frame corresponds to a node of graph , and the dependency between two frames , is represented by a weighted edge . From , the dependency structures among video frames are defined as the weighted adjacency matrix , where . With aforementioned notations and definitions, we can now formally define the problem of video representations learning as follows:
Given the video frames representations , we seek to discover a weighted adjacency matrix which represents dependency among frames.
(6) 
With and , final representations for video are acquired by .
(7) 
The obtained video representations can be used for various tasks of video understanding. In this paper, the video theme classification and the video question and answering tasks are mainly considered.
CutBased Graph Learning Networks
The CutBased Graph Learning Networks (CBGLNs) consist of two submodules: a structure learning module with the graphcuts and a representation learning module with messagepassing operations. The key idea of the method is to find inherent semantic structures using the graphcuts and to learn feature vectors of the video with the messagepassing algorithm on the semantic structures. Stacking these modules leads to the subsequent discovery of compositional structures in the form of a multilevel graph. Figure 1(a) illustrates the whole structure of the CBGLNs. In the next sections, operations of each of these modules are described in detail.
Structure Learning Module
In the structure learning module, the dependencies between frames
are estimated via parameterized kernels and the temporally constrained graphcut algorithm.
As the first step, the initial temporal dependencies over all frames are constructed via the parameterized kernel :
(8) 
where is a singlelayer feedforward network without nonlinear activation.
Then, as the second step, the meaningful dependency structure among all pairwise relationships is refined by applying normalized graphcut algorithm to the . The objective of the normalized graphcut for CBGLNs is:
(9) 
To reduce the complexity of the equation (9)Structure Learning Module and to keep the inherent characteristics of the video data, an additional constraint is added to the graphcut algorithm. As the video data is composed of time continuous subsequences, no two partitioned subgraphs have an overlap in physical time. This characteristic is implemented by applying the temporal constraint [32, 33] as follows.
(10) 
Thus, a cut can only be made along the temporal axis and complexity of the graph partitioning is reduced to linear time while keeping the characteristics of the video data. Also, as the gradients can flow through the surviving edges, it is endtoend trainable.
The graphcut can be recursively applied to the , so partitioned subgraphs can be obtained. The number of cut operations is determined by the length of the video , , and we also add the constraint that subcluster should not be partitioned if it is no longer than prespecified length. Figure 1(b) depicts the detailed operations of the structure learning module.
Representation Learning Module
After estimating the weighted adjacency matrix , the representation learning module updates the representations of each frame via a differentiable messagepassing framework [14]. For the message function
, we simply use the weighted sum of adjacent nodes’ representations after linear transformation similar to
[25]:(11) 
where is a degree matrix of the graph and is an adjacency matrix after cut operations.
For the update function , we integrate the message with node representations by using lowrank bilinear pooling [22] followed by a positionwise fully connected network.
(12) 
where
is a singlelayer positionwise fully connected network. We also employ a residual connection
[16] around each layer followed by layer normalization [4].Once the representations of all frames are updated, a pooling operation for each partitioned subgraph is applied. Then we can obtain higher level representations , where is the number of partitioned subgraphs (Figure 1(c)). If we have additional information such as query (e.g. a question feature vector in video QA setting), we can pool the subgraph with attentive pooling similar to [34].
In the same way, is fed into the new structure learning module and we can get the videolevel representation .
Experiments
In this section, the experimental results on the two different video datasets, YouTube8M (video theme classification) and TVQA (video question and answering), are provided.
Video Theme Classification Task on YouTube8M
Data specification
YouTube8M [2] is a benchmark dataset for video understanding, where the main task is to determine the key topical themes of a video. The dataset consists of 6.1M video clips collected from YouTube. Each video is labeled with one or multiple tags referring to the main topic of the video. Each video is encoded at 1 framepersecond up to the first 300 seconds. The volume of video data is too large to be treated in it’s raw form. As such the input is preprocessed with pretrained models by the author of the dataset.
Global Average Precision (GAP) is used for the evaluation metric for the multilabel classification task as used in the YouTube8M competition. For each video, 20 labels are predicted with confidence scores, then the GAP score computes the average precision and recall across all of the predictions and all the videos.
Model setup
The framelevel visual and audio features are extracted by inceptionv3 network [38]
trained on imagenet and VGGinspired architecture
[18] trained for audio classification. These features construct an input feature matrix of video sequences, then is fed into a sequence model to extract final video representation . Including our model, all baseline sequence models are composed of two layers and average pooling is used for final representations. With the representation, a simple logistic regression is used as a final classifier.
Quantitative results
Framelevel model  GAP 

Average pooling  0.7824 
DeepBoF (4096 clusters)  0.8079 
NetVLAD (256 clusters)  0.8396 
1D CNN (2 layers, kernel size 3)  0.8254 
1D CNN (2 layers, kernel size 5)  0.8245 
1D CNN (2 layers, kernel size 7)  0.8247 
LSTM (2 layers)  0.8446 
GRU (2 layers)  0.8160 
BiLSTM (2 layers)  0.8410 
BiGRU (2 layers)  0.8079 
SelfAttention (4 heads, 2 layers)  0.8553 
NeXtVLAD [27]  0.8499 
DCGN [28]  0.8450 
CBGLNs (2 layers)  0.8597 
Firstly, we evaluate the classification performance of the proposed model against four types of representative sequential models described in the Introduction section and two stateoftheart models [27, 28] previously reported.
The results with GAP score are summarized in Table 1. The proposed model considerably outperforms all the comparative models. The second best performing model is the selfattention based approach, followed by RNNs, CNNs and Clustering based approaches. In the “Qualitative results” Section, automatically constructed compositional structures are discussed for better understanding of the model.
For ablation studies, we selected three critical characteristics of CBGLNs for indepth analysis: layer normalization, residual connection and graphcut after learning representations. The GAP scores on the validation set for each ablation experiments are shown in Table 2. As can be seen from (a) to (c) in the Table 2, the residual connections followed by layer normalization are crucial for the representation learning module. Also, to see the effect of sparsening an adjacency matrix via the graphcut, reversed order of representation learning and graphcut is also conducted ((d) in Table 2). By doing so, the representations of each node are updated with obtained only using kernel (in Equation 8) and graphcut algorithm is used just for subgraph pooling. Thus, the model has to learn representations with dense and noisy connections, degrading the performance of the model. From this result, we can argue that the temporally constrained graphcut effectively reduce the noisy connections in the graph.
Ablation model  GAP 

(a) layer normalization 
0.8486 
(b) residual connection  0.8447 
(c) residual connection with layer normalization  0.8370 
(d) graphcut after learning representations  0.8576 
CBGLNs  0.8597 
Qualitative results: Learning compositional dependency structure
In this section, we demonstrate compositional learning capability of CBGLNs by analyzing constructed multilevel graphs. To make further discussion clear, four terms are used to describe the compositional semantic flows: semantic units, scenes, sequences and a video for each level. In Figure 2, a real example with the usage of video titled “Rice Pudding^{2}^{2}2https://youtu.be/cD3enxnSJY” is described to show the results.
In Figure 2(a), the learned adjacency matrices in each layer are visualized in grayscale images: the two leftmost images are from the 1st layer and the two rightmost images are from the 2nd layer. To denote multilevel semantic flows, four colorcoded rectangles (blue, orange, red and green) are marked and those colors are consistent with Figure 2(b).
Along with diagonal elements of the adjacency matrix in the 1st layer (Figure 2(a)1), a set of semantic units are detected corresponding to bright blocks (blue). Interestingly, we found that each semantic unit contains highly correlated frames. For example, the #1 and #2 are each shots introducing the YouTube cooking channel and how to make rice pudding, respectively. The #4 and #5 are shots showing a recipe of rice pudding and explaining about the various kinds of rice pudding. The #6 and #7 are shots putting ingredients into boiling water in the pot and bringing milk to boil along with other ingredients. At the end of the video clip, #11 is a shot decorating cooked rice pudding and #12 is an outro shot that invites the viewers to subscribe.
These semantic units compose variablelength scenes of the video, and each scene corresponds to a subgraph obtained via graphcut (Figure 2(a)2.). For example, #13 is a scene introducing this cooking channel and rice pudding. Also, #15 is a scene of making rice pudding with detailed step by step instructions, and #16 is an outro scene wrapping up with cooked rice pudding. The 1stlayer of the model updates representations of framelevel nodes with these dependency structures, then aggregates framelevel nodes to form scenelevel nodes (Layer 1 in the Figure 2(b)).
In Figure 2(a)3 and (a)4, the sequencelevel semantic dependencies (red) are shown. #17 denotes a sequence of making rice pudding from beginning to end, which contains much of the information for identifying the topical theme of this video. Finally, the representations of scenes are updated and aggregated to get representations of the whole video (Layer 2 in the Figure2(b)).
Video Question & Answering Task on TVQA
Data specification
TVQA [26] is a video question and answering dataset on TV show domain. It consists of total 152.5k questionanswer pairs on six TV shows: The Big Bang Theory, How I Met Your Mother, Friends, Grey’s Anatomy, House and Castle. Also, it contains 21.8k short clips of 6090 seconds segmented from the original TV show for questionanswering. The provided inputs are 3 fps image frames, subtitles and multiple choice questions with 5 candidate answers for each question, for which only one is correct.
The questions in the dataset are localized to a specific subpart in the video clips by restricting questions to a composition of two parts, e.g., ”Where was Sheldon sitting / before he spilled the milk?”. Models should answer questions using both visual information and associated subtitles from the video.
Model setup
The input visual features were extracted by the pooled 2048D feature of the last block of ResNet101 [16] trained on imagenet and the textbased features for subtitles, questions and answers were extracted by GloVe [31]. The visual features and subtitle features are manually aligned with timestamp and answer features also aligned with visual and subtitle features by attention mechanism to construct input . Then the is fed into the CBGLNs to extract final representations . Different from the YouTube8M dataset case, we use a question feature vector as a query of attentive pooling, so that the representations of the sequence are pooled via weighted sum with the attention values.
Qualitative results: Attention hierarchy with learned compositional structure
In this section, we show how the attention mechanism can be fit into the CBGLNs to learn representations more effectively. Basically, the attention mechanism places more weight on important parts to aggregate values, given a query. By virtue of the compositionality, the attention mechanism can be naturally applied to the CBGLNs in a hierarchical fashion. Figure 3 presents learned attention hierarchy in a real video clip of “Friends”. In this example, The question is “What did Chandler say he was going to get, when he got up?” and the answer for the question is that “Chandler said he was going to get cigarettes.”.
In Figure 3(a), we can see that scenes with high attention values (Scene 5 and Scene 6 (coded by a red rectangle)) in layer 2 are aligned well with localized video portions relevant to a given question. Scene 4, where “Chandler is searching his pocket to find cigarettes while sitting down”, is also in the localized section. Therefore, we can say our model finds a sensitive portion of the video relevant to the given question. In layer 1 (Figure 3
(a)), the model gives a keener attention to the framelevel within each scenes (coded by orange rectangles), such as a moment where Chandler gets up or Chandler yells ‘I gotta smoke‘.
Because the attention operation for each frame is conducted hierarchically, we can calculate cumulative attention values by multiplying them in layer 1 and layer 2. Figure 3(b) shows the cumulative attention values for each frame. The most important frame in a viewpoint of the model is the ”getting up moment” frame because the model should answer the question by identifying the meaning of “when he got up” in the question.
In Figure 3(c), the learned adjacency matrices in each layer are visualized. Same colorcoded rectangles with (a) are used for cut scenes (orange rectangles in layer 1) and scenes with high attention values (a red rectangle in layer 2). We also coded red rectangle in layer 1, which is corresponding to scenes with high attention values in layer 2. Even though scene 5 and scene 6 in the frame level (layer 1) are considerably short when compared to the whole video sequences, the CBGLNs can find important moments and aggregate them in an effective way.
Conclusion
In this paper, we proposed CutBased Graph Learning Networks (CBGLNs) which learn not only the representations of video sequences, but also composite dependency structures within the sequence. To explore characteristics of CBGLNs, various experiments are conducted on a real largescale video dataset YouTube8M and TVQA. The results show that the proposed model efficiently learns the representations of sequential video data by discovering inherent dependency structure of itself.
Acknowledgements
The authors would like to thank Woo Suk Choi and Chris Hickey for helpful comments and editing. This work was partly supported by the Korea government (2015000310SW.StarLab, 2017001772VTT, 2018000622RMI, 2019001367BabyMind, P0006720GENKO).
References
 [1] (2017) 5th international conference on learning representations, ICLR 2017, toulon, france, april 2426, 2017, conference track proceedings. OpenReview.net. External Links: Link Cited by: 22.
 [2] (2016) Youtube8m: a largescale video classification benchmark. arXiv preprint arXiv:1609.08675. Cited by: CutBased Graph Learning networks to Discover Compositional Structure of Sequential Video Data, Introduction, Introduction, Data specification.

[3]
(2016)
NetVLAD: cnn architecture for weakly supervised place recognition.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 5297–5307. Cited by: Introduction.  [4] (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: Representation Learning Module.

[5]
(2018)
NetGAN: generating graphs via random walks.
In
International Conference on Machine Learning
, pp. 609–618. Cited by: Graph Neural Networks (GNNs).  [6] (2013) Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203. Cited by: Graph Neural Networks (GNNs).
 [7] (1997) Spectral graph theory. American Mathematical Soc.. Cited by: Graph Neural Networks (GNNs).
 [8] (2018) Hierarchical multiscale recurrent neural networks. In International Conference on Learning Representations, External Links: Link Cited by: Introduction.
 [9] (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: Introduction.
 [10] (2018) MolGAN: an implicit generative model for small molecular graphs. arXiv preprint arXiv:1805.11973. Cited by: Graph Neural Networks (GNNs).
 [11] (2019) BERT: pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: Introduction.

[12]
(2017)
Oneshot imitation learning
. In Advances in neural information processing systems, pp. 1087–1098. Cited by: Graph Neural Networks (GNNs).  [13] (2017) Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1243–1252. Cited by: Introduction.
 [14] (2017) Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1263–1272. Cited by: Graph Neural Networks (GNNs), Representation Learning Module.
 [15] (2005) A new model for learning in graph domains. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., Vol. 2, pp. 729–734. Cited by: Graph Neural Networks (GNNs).
 [16] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Representation Learning Module, Model setup.
 [17] (2015) Deep convolutional networks on graphstructured data. arXiv preprint arXiv:1506.05163. Cited by: Graph Neural Networks (GNNs).
 [18] (2017) CNN architectures for largescale audio classification. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp. 131–135. Cited by: Model setup.
 [19] (1997) Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: Introduction.
 [20] (2017) Vain: attentional multiagent predictive modeling. In Advances in Neural Information Processing Systems, pp. 2701–2711. Cited by: Graph Neural Networks (GNNs).
 [21] (2016) Neural machine translation in linear time. arXiv preprint arXiv:1610.10099. Cited by: Introduction.
 [22] (2017) Hadamard product for lowrank bilinear pooling. See 1, External Links: Link Cited by: Representation Learning Module.
 [23] (2013) Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: Graph Neural Networks (GNNs).
 [24] (2018) Neural relational inference for interacting systems. In International Conference on Machine Learning, pp. 2693–2702. Cited by: Graph Neural Networks (GNNs).
 [25] (2016) Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: Graph Neural Networks (GNNs), Representation Learning Module.

[26]
(2018)
TVQA: localized, compositional video question answering.
In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pp. 1369–1379. Cited by: CutBased Graph Learning networks to Discover Compositional Structure of Sequential Video Data, Introduction, Data specification.  [27] (2018) Nextvlad: an efficient neural network to aggregate framelevel features for largescale video classification. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: Quantitative results, Table 1.
 [28] (2018) Hierarchical video frame sequence representation with deep convolutional graph network. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: Quantitative results, Table 1.
 [29] (2017) Geometric deep learning on graphs and manifolds using mixture model cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5115–5124. Cited by: Graph Neural Networks (GNNs).
 [30] (2016) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: Introduction.
 [31] (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: Model setup.
 [32] (2005) Detection and representation of scenes in videos. IEEE transactions on Multimedia 7 (6), pp. 1097–1105. Cited by: Structure Learning Module.
 [33] (2008) Graphbased multilevel temporal video segmentation. Multimedia systems 14 (5), pp. 277–290. Cited by: Structure Learning Module.
 [34] (2016) Attentive pooling networks. arXiv preprint arXiv:1602.03609. Cited by: Representation Learning Module.
 [35] (2018) Fewshot learning with graph neural networks. In International Conference on Learning Representations, External Links: Link Cited by: Graph Neural Networks (GNNs).
 [36] (2000) Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence 22 (8), pp. 888–905. Cited by: Normalized graphcut algorithm.
 [37] (2018) Graphvae: towards generation of small graphs using variational autoencoders. In International Conference on Artificial Neural Networks, pp. 412–422. Cited by: Graph Neural Networks (GNNs).
 [38] (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: Model setup.

[39]
(2018)
Relational neural expectation maximization: unsupervised discovery of objects and their interactions
. In International Conference on Learning Representations, External Links: Link Cited by: Graph Neural Networks (GNNs).  [40] (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: Introduction.
 [41] (2018) Graph attention networks. In International Conference on Learning Representations, External Links: Link Cited by: Graph Neural Networks (GNNs).