1 Introduction
An image contains multiple layers of abstractions, from low-level features to intermediate-level objects to high-level complex semantic relations. To gain a complete visual understanding, it is essential to investigate different abstraction layers jointly. An example of such multi-abstraction problem is image-to-graph generation, such as road-network extraction [18], blood vessel-graph extraction [42], and scene-graph generation [58]. In all of these tasks, one needs to explore not only the objects or the nodes, but also their mutual dependencies or relations as edges.
In spatio-structural tasks, such as road network extraction (Fig. 1a), nodes represent road-junctions or significant turns, while edges correspond to structural connections, i.e., the road itself. The resulting spatio-structural graph construction is crucial for navigation tasks, especially with regard to autonomous vehicles. Similarly, in 3D blood vessel-graph extraction (Fig. 1b), nodes represent branching points or substantial curves, and edges correspond to structural connections, i.e., arteries, veins, and capillaries. Biological studies relying on a vascular graph representation, such as detecting collaterals [54], assessing structural robustness [22], emphasize the importance of efficient extraction thereof. In case of spatio-semantic graph generation, e.g. scene graph generation from natural images (Fig. 1c), the objects denote nodes and the semantic-relation denotes the edges [23]. This graphical representation of natural images is compact, interpretable, and facilitates various downstream tasks like visual question answering [19, 25]. Notably, different image-to-graph tasks have been addressed separately in previous literature (see Sec. 2), and to the best of our knowledge, no unified approach has been reported so far.

Traditionally, image-to-graph generation has been studied as a complex multistage pipeline, which consist of an object detector [44], followed by a separate relation predictor [24, 33]. Similarly, for spatio-structural graph generation, the usual first stage is segmentation, followed by a morphological operation on binary data. While a two-stage object-relation graph generation approach is modular, it is usually trained sequentially, which increases model complexity and inference time and lacks simultaneous exploration of shared object-relation representations. Additionally, mistakes in the first stage may propagate into the later stages. It should also be noted that the two-stage approach depends on multiple hand-designed features, spatial [61], or multi-modal input [8].
We argue that a single-stage image-to-graph model with joint object and relation exploration is efficient, faster, and easily extendable to multiple downstream tasks compared to a traditional multi-stage approach. Crucially, it reduces the number of components and simplifies the training and inference pipeline (Fig. 2). Furthermore, intuitively, a simultaneous exploration of objects and relations could utilize the surrounding context and their co-occurrence. For example, Fig. 1c depicting the kid on a board introduces a spatial and semantic inclination that it could be an outdoor scene where the presence of a tree or a helmet, the kid might wear, is highly likely. The same notion is analogous in a spatio-structural vessel graph. Detection of a bifurcation point and an artery would indicate the presence of another artery nearby. The mutual co-occurrence captured in joint object-relation representation overcomes individual object boundaries and leads to a more informed big picture.
Recently, there has been a surge of one-stage models in object detection thanks to the DETR approach described in [7]. These one-stage models are popular due to the simplicity and the elimination of reliance on hand-crafted designs or features. DETR exploits a encoder-decoder transformer architecture and learns object queries or [obj]-token for object representation.
To this end, we propose Relationformer, a unified one-stage framework for end-to-end image-to-graph generation. We leverage set-based object detection of DETR and introduce a novel learnable token named [rln]-token in tandem with [obj]-tokens. The [rln]-token captures the inter-dependency and co-occurrence of low-level objects and high-level spatio-semantic relations. Relationformer directly predicts objects from the learned [obj
]-tokens and classifies their pairwise relation from combinations of [
obj-rln-obj]-tokens. In addition to capturing pairwise object-interactions, the [rln]-token, in conjunction with relation information, allows all relevant [obj]-tokens to be aware of the global semantic structure. These enriched [obj]-tokens in combination with the relation token, in turn, contributes to the relation prediction. The mutually shared representation of joint tokens serves as an excellent basis for an image-to-graph generation. Moreover, our approach significantly simplifies the underlying complex image-to-graph pipeline by only using image features extracted by its backbone.
We evaluate Relationformer across numerous publicly available datasets, namely Toulouse, 20 US Cities, DeepVesselNet, and Visual Genome, comprising 2D and 3D, directed- and undirected image-to-graph generation tasks. We achieve a new state-of-the-art for one-stage methods on Visual Genome, which is better or comparable to the two-stage approaches. We achieve state-of-the-art results on road-network extraction on the Toulouse and 20 US Cities dataset. To the best of our knowledge, this is the first image-to-graph approach working directly in 3D, which we use to extract graphs formed by blood vessels.
2 Related Work
Transformer in Vision:
In recent times, transformer-based architectures have emerged as the de-facto gold standard model for various multi-domain and multi-modal tasks such as image classification [13], object detection [7], and out-of-distribution detection [26]. DETR [7] proposed an end-to-end transformer-based object detection approach with learnable object queries ([obj]-tokens) and direct set-based prediction. DETR eliminates burdensome object detection pipelines (e.g., anchor boxes, NMS) of traditional approaches [44] and directly predicts objects. Building on DETR, a series of object detection approaches improved DETR’s slow convergence [64], adapted a pure sequence-to-sequence approach [15], and improved detector efficiency [52]. In parallel, the development of the vision transformer [13] for image classification offered a powerful alternative. Several refined idea [55, 35] have advanced this breakthrough and transformer in general emerges as a cutting-edge research topic with focus on novel design principle and innovative application. Fig. 2, shows a pictorial overview of transformer-based image classifier, object detector, and relation predictor including our proposed method, which we referred to as Relationformer.

This illustrates a general architectural evolution of transformers in computer vision and how Relationformer advances the concept of a task-specific learnable token one step further. The proposed Relationformer is also shown in comparison to the conventional two-stage relation predictor. The amalgamation of two separate stages not only simplifies the architectural pipeline but also co-reinforces both of the tasks.
Spatio-structural Graph Generation:
In a spatio-structural graph, the most important physical objects are edges, i.e., roads for a road network or arteries and veins in vessel graphs. Conventionally, spatio-structural graph extraction has only been discussed in 2D with little-to-no attention on the 3D counterpart. For 2D road network extraction, the predominant approach is to segment [38, 4] followed by morphological thinning to extract the spatial graph. Only few approaches combine graph level information processing, iterative node generation [3], sequential generative modelling [9]
, and graph-tensor-encoding
[18]. Belli et al. [5] for the first time, adopted attention mechanisms in an auto-regressive model to generate graphs conditioned on binary segmentation. Importantly, to this date, none of these 2D approaches has been shown to scale to 3D.For 3D vessel-network extraction, segmentation of whole-brain microscopy images [54, 40] has been combined with rule-based graph extraction algorithms [51]. Recently, a large-scale study [42] used the Voreen [39]
software to extract whole-brain vascular graph from binary segmentation, which required complicated heuristics and huge computational resources. Despite recent works on 3D scene graphs
[1] and temporal scene graphs [21], to this day, there exists no learning-based solution for 3D spatio-structural graph extraction.Considering two spatio-structural image-to-graph examples of vessel-graph and road-network, one can understand the spatial relation detection task as a link prediction task. In link-prediction, graph neural networks, such as GraphSAGE
[16], SEAL [62] are trained to predict missing links among nodes using node features. These approaches predict links on a given set of nodes. Therefore, link prediction can only optimize correct graph topology. In comparison, we are interested in joint node-edge prediction, emphasizing correct topology and correct spatial location simultaneously, making the task even more challenging.Spatio-semantic Graph Generation:
Scene graph generation (SGG) [36, 58] from 2D natural images has long been studied to explore objects and their inter-dependencies in an interpretable way. Context refinement across objects [58, 61], extra modality of features [36, 50] or prior knowledge [48] has been used to model inter-dependencies of objects for relation prediction. RTN [24, 27] was one of the first transformer approaches to explore context modeling and interactions between objects and edges for SGG. Li et al. [30] uses DETR like architecture to separately predict entity and predicate proposal followed by a graph assembly module. Later, several works [12, 37]
explored transformers, improving relation predictions. On the downside, such two-stage approaches increase model size, lead to high inference times, and rely on extra features such as glove vector
[43]embedding or knowledge graph embedding
[49], limiting their applicability. Recently, Liu et al. [34] proposed a fully convolutional one-stage SGG method. It combined a feature pyramid network [32] and a relation affinity field [63, 41] for modeling a joint object-relation graph. However, their convolution-based architecture limits the context exploration across objects and relations. Contemporary to us [10] used transformers for the task of SGG. However, their complex pipeline for a separate subject and object further increases computational complexity. Crucially, there has been a significant performance gap between one-stage and two-stage approaches. This paper bridges this gap with simultaneous contextual exploration across objects and relations.3 Methodology
In this section, we formally define the generalized image-to-graph generation problem. Each of the presented relation prediction tasks in Figure 1 is a special instance of this generalized image-to-graph problem. Consider an image space , where for a dimensional image and #ch denotes the number of channels. Now, an image-to-graph generator predicts for a given image , where represents a graph with vertices (or objects) and edges (or relations) . Specifically, the vertex has a node or object location specified by a bounding box and a node or object label . Similarly, each edge has an edge or relation label , where we have number of object classes and types of relation classes. Note that can be both directed and undirected. The algorithmic complexity of predicting graph depends on its size, which is of order complexity for . It should be noted that object detection as a special case of the generalized image-to-graph generation problem, where . In the following, we briefly revisit a set-based object detector before expanding on our rationale and proposed architecture.
3.1 Preliminaries of Set-based Object Detector
Carion et al. [7] proposed DETR, which shows the potential of set-based object detection, building upon an encoder-decoder transformer architecture [56]. Given an input image , a convolutional backbone [17] is employed to extract high level and down scaled features. Next, the spatial dimensions of extracted features are reshaped into a vector to make them sequential. Afterwards, these sequential features are coupled with a sinusoidal positional encoding [6] to mark an unique position identifier. A stacked encoder layer, consisting of a multi-head self-attention and a feed-forward network, processes the sequential features. The decoder takes number of learnable object queries ([obj]-tokens) in the input sequence and combines them with the output of the encoder via cross-attention, where is larger than the maximum number of objects.
DETR utilizes the direct Hungarian set-based assignment for one-to-one matching between the ground truth and the predictions from [obj]-tokens. The bipartite matching assigns a unique predicted object from the predictions to each ground truth object. Only matched predictions are considered valid. The rest of the predictions are labeled as or ‘background’. Subsequently, it computes the box regression loss solely for valid predictions. For the classification loss, all predictions, including ‘background’ objects, are considered.
In our work, we adopt a modified attention mechanism, namely deformable attention from deformable-DETR (def-DETR) [64] for its faster convergence and computational efficiency. In DETR, complete global attention allows each token to attend to all other tokens and hence capture the entire context in one image. However, information about the presence of an object is highly localized to a spatial position. Following the concept of deformable convolutions [11], deformable attention enables the queries to attend to a small set of spatial features determined from learned offsets of the reference points. This improves convergence and reduces the computational complexity of the attention operation.
Let us consider an image feature map , the [obj]-token with associated features , and the reference point . First, for the attention head, we need to compute the sampling offset based on the query features . Subsequently, the sampled image features go through a single layer followed by a multiplication with the attention weight , which is also obtained from the query features . Finally, another single layer merges all the heads. Formally, the deformable attention operation (DefAttn) for heads and sampling points is defined as:
(1) |

3.2 Object-Relation Prediction as Set-prediction and Interaction
A joint object-relation graph generation requires searching from a pairwise combinatorial space of the maximum number of expected nodes. Hence, a naive joint-learning for object-relations requires number of tokens for number of objects. This is computationally intractable because self-attention is quadratically-complex to the number of tokens. We overcome this combinatorially challenging formulation with a carefully engineered inductive bias. The inductive bias, in this case, is to exploit learned pair-wise interactions among [obj]-tokens and combine refined pair-wise [obj]-tokens with an additional token, which we refer to as [rln]-token. One can think of the [rln]-token as a query to pair-wise object interaction.
The [rln]-token captures the additional context of pair-wise interactions among all valid predicted classes. In this process, related objects are incentivized to have a strong correlation in an embedding space of, and unrelated objects are penalized to be dissimilar. The [rln]-token attends to all [obj]-tokens along with contextualized image features that enrich its local pairwise and global image reasoning. Finally, we classify a pair-wise relation by combining the pair-wise [obj]-tokens with the [rln]-token. Thus, instead of number of tokens, we only need tokens in total. These consist of [obj]-tokens and one [rln]-token. This novel formulation allows relation detection with a marginally increased cost compared to one-stage object detection.
Here, one could present a two-fold argument: 1) There is no need for an extra token as one could directly classify joint pairwise [obj]-tokens. 2) Instead of one single [rln]-token, one could use as many as the number of possible object-pairs. To answer the first question, we argue that relations can be viewed as a higher order topological entity compared to objects. Thus, to capture inter-dependencies among the relations the model requires additional expressive capacity, which can be shared among the objects. The [rln]-token reduces the burden on the [obj]-tokens by specializing exclusively on the task of relation prediction. Moreover, [obj]-tokens can also attend to the [rln]-token and exploit a global semantic reasoning. This hypothesis has been confirmed in our ablation. For the second question, we argue that individual tokens for all possible object-pairs will lead to a drastic increase in the decoder complexity, which may results in computationally intractability.
3.3 Relationformer
The Relationformer architecture is intuitive and without any bells and whistles, see Fig. 3. We have four main components: a backbone, a transformer, an object detection head and a relation prediction head. In the following, we describe each of the components and the set-based loss formulations specific to joint object-relation graph generation in detail.
Backbone:
Given the input image , a convolutional backbone [17] extracts features , where is the spatial dimensions of the features and # emb denotes embedding dimension. Further, this feature dimension is reduced to , the embedding dimension of the transformer, and flattened by its spatial size. The new sequential features coupled with the sinusoidal positional encoding [6] produce the desired sequence which is processed by the encoder.
Transformer:
We use a transformer encoder-decoder architecture with deformable attention [64], which considerably speeds up the training convergence of DETR by exploiting spatial sparsity of the image features.
Encoder:
Our encoder remains unchanged from [64], and uses multi-scale deformable self-attention. We use a different number of layers based on each task’s requirement, which is specified in detail in the supplementary material.
Decoder:
We use tokens for the joint object-relation task as inputs to the decoder, where represents the number of [obj]-tokens preceded by a single [rln]-token. Contextualized image features from the encoder serve as the second input of our decoder. In order to have a tractable computation and to leverage spatial sparsity, we use deformable cross-attention between the joint tokens and the image features from the encoder. The self-attention in the decoder remains unchanged. The [obj]-tokens and [rln]-token go through a series of multi-hop information exchanges with other tokens and image features, which gradually builds a hierarchical object and relational semantics. Here, [obj]-tokens learn to attend to specific spatial positions, whereas the [rln]-token learns how objects interact in the context of their semantic or global reasoning.
Object Detection Head:
The object detection head has two components. The first one is a stack of fully connected network or multi layer-perceptron (MLP), which regresses the location of objects, and the second one is a single layer classification module. For each refined [
obj]-token , the object detection head predicts an object class and an object location in parallel, where represents the image dimension, is the classification layer, and is an MLP. We use the normalized bounding box co-ordinate for scale invariant prediction. Note that for the spatio-structural graph, we create virtual objects around each node’s center by assuming an uniform bounding box with a normalized width of .Relation Prediction Head:
In parallel to the object detection head, the input of the relation head, given by a pair-wise [obj]-token and a shared [rln]-token, is processed as . Here, represents the refined [rln]-token and a three-layer fully-connected network headed by layer normalization [2]. In the case of directional relation prediction (e.g., scene graph), the ordering of the object token pairs determines the direction . Otherwise (e.g., road network, vessel graph), the network is trained to learn object token order invariance as well.
3.4 Loss Function
For object detection, we utilize a combination of loss functions. We use two standard box prediction losses, namely the
regression loss ) and the generalized intersection over union loss () between the predicted and ground truth box coordinates. Besides, we use the cross-entropy classification loss () between the predicted class and the ground truth class .Stochastic Relation Loss:
In parallel to object detection, their pair-wise relations are classified with a cross-entropy loss. Particularly, we only use predicted objects that are assigned to ground truth objects by the Hungarian matcher. When two objects have a relation, we refer to their relation as a ‘valid’-relation. Otherwise, the relation is categorized as ‘background’. Since ‘valid’-relations are highly sparse in the set of all possible permutations of objects, computing the loss for every possible pair is burdensome and will be dominated by the ‘background’ class, which may hurt performance. To alleviate this, we randomly sample three ‘background’-relations for every ‘valid’-relation. From sampled ‘valid’- and ‘background’-relations, we obtain a subset of size , where . To this end, represents a classification loss for the predicted relations in . The total loss for simultaneous object-relation graph generation is defined as:
(2) |
where and are the loss functions specific weights.
4 Experiments
4.1 Datasets
We conducted experiments on four public datasets for the tasks of road network generation (20 US cities [18], Toulouse [5]), 3D synthetic vessel graph generation [53], and scene-graph generation (Visual Genome [28]). The road and vessel graph generation datasets are spatio-structural with a binary node and edge classification task, while the scene-graph generation dataset is spatio-sematic and has 151 node classes and 51 edge classes, including ‘background’ class.
Dataset | Description | Data Split | |||||
Edge Type | 2D/3D | Image Type | Image Size | Train | Val | Test | |
Toulouse [5] | Undirected | 2D | Binary | 64x64 | 80k | 12k | 19k |
20 US cities[18] | Undirected | 2D | RGB | 128x128 | 124k | 13k | 25k |
Synthetic vessel [53] | Undirected | 3D | Grayscale | 64x64x64 | 54k | 6k | 20k |
Visual Genome [28] | Directed | 2D | RGB | 800x800 | 57k | 5k | 26k |
4.2 Evaluation Metrics
Given the diversity of tasks at hand, we resort to widely-used task-specific metrics. Following is a brief description, while details can be found in the supplementary material. For Spatio-Structural Graphs, we use four different metrics to capture spatial similarity alongside the topological similarity of the predicted graphs. 1) Street Mover Distance (SMD)[5] computes a Wasserstein distance between predicted and ground truth edges; 2) TOPO Score[18] includes precision, recall, and F-1 score for topological mismatch; 3) Node Detection yields mean average precision (mAP) and mean average recall (mAR) for the node; and 4)Edge Detection yields mAP and mAR for the edges. For Spatio-semantic Graphs, the scene graph detection (SGDet) metric is the most challenging and appropriate for joint object-relation detection tasks, because it does not need apriori knowledge about object location or class label. Hence, we compute recall, mean-recall, and no-graph constraint (ng)-recall for on the SGDet task following Zellers et al. [61]. Further, we evaluate the quality of object detection using average precision, AP@50 (IoU=0.5) [31].
4.3 Results
Spatio-structural Graph Generation:
Dataset | Model | Graph-level Metrics | Node Det. | Edge Det. | |||||
SMD | Prec. | Rec. | F1 | mAP | mAR | mAP | mAR | ||
Toulouse (2D) | RNN [5] | 0.04857 | 65.41 | 57.52 | 61.21 | 0.50 | 5.01 | 0.21 | 2.56 |
GraphRNN [5] | 0.02450 | 71.69 | 73.21 | 72.44 | 1.34 | 4.15 | 0.34 | 1.01 | |
GGT [5] | 0.01649 | 86.95 | 79.88 | 83.26 | 2.94 | 13.31 | 1.62 | 9.75 | |
Relationformer | 0.00012 | 99.76 | 98.99 | 99.37 | 94.59 | 96.76 | 83.30 | 89.87 | |
20 US Cities (2D) | RoadTracer[3] | N.A. | 78.00 | 57.44 | 66.16 | N.A. | N.A. | N.A. | N.A. |
Seg-DRM[38] | N.A. | 76.54 | 71.25 | 73.80 | N.A. | N.A. | N.A. | N.A. | |
Seg-Orientation[4] | N.A. | 75.83 | 68.90 | 72.20 | N.A. | N.A. | N.A. | N.A. | |
Sat2Graph [18] | N.A. | 80.70 | 72.28 | 76.26 | N.A. | N.A. | N.A. | N.A. | |
Relationformer | 0.04939 | 85.28 | 77.75 | 81.34 | 29.25 | 42.84 | 33.19 | 13.45 | |
Synthetic Vessel (3D) | U-net[46]+heuristics | 0.01982 | N/A | N/A | N/A | 18.94 | 29.81 | 17.88 | 27.63 |
Relationformer | 0.01107 | N/A | N/A | N/A | 78.51 | 84.34 | 78.10 | 82.15 |
*N.A. indicates scores are not readily available. N/A indicates that the metric is not applicable.
In spatio-structural graph generation, both correct graph topology and spatial location are equally important. Note that the objects here are represented as points in 2D/3D space. For practical reasons, we assume a hypothetical box of around these points and treat these boxes as objects.
The Toulouse dataset poses the least difficulty as we can predict a graph from a binary segmentation image. We notice that existing methods perform poorly. Our method improves the SMD score by three orders of magnitude. All other metrics, such as TOPO-Score (prec., rec., and F-1), indicate near-optimal topological accuracy of our method. At the same time, our performance in node and edge mAP and mAR is vastly superior to all competing methods. For the more complex 20 U.S. cities dataset, we observe a similar trend. Note that due to the lack of existing scores from competing methods (SMD, mAP, and mAR), we only compare the TOPO scores, which we outperform by a significant margin. However, when compared to the results on the Toulouse dataset, Relationformer yields lower node detection scores on the 20 U.S. cities dataset, which can be attributed to the increased dataset complexity. Furthermore, the edge detection score also deteriorates. This is due to the increased proximity of edges, i.e., parallel roads.
For 3D data, such as vessels, no learning-based comparisons exist. Hence, we compare to the current best practice [51], which relies on segmentation, skeletonization, and heuristic pruning of the dense skeleta extracted from the binary segmentation [14]. The purpose of pruning is to eliminate redundant neighboring nodes, which is error-prone due to the voxelization of the connectivity, leading to poor performances. Table 2 clearly depicts how our method outperforms the current method. Importantly, we find that our method effortlessly translates from 2D to 3D without major modifications. Moreover, our 3D model is trained end-to-end from scratch without a pre-trained backbone. To summarize, we propose the first reliable learning-based 3D spatio-structural graph generation method and show how it outperforms existing 2D approaches by a considerable margin.
Scene Graph Generation:
We extensively compare our method to numerous existing methods, which can be grouped based on three concepts. One-stage methods, two-stage methods utilizing only image features, and two-stage methods utilizing extra features. Importantly, Relationformer represents a one-stage method without the need for extra features. We find that Relationformer outperforms all one stage methods in Recall and ng-Recall despite using a simpler backbone. In terms of mean-Recall, a metric addressing dataset bias, we outperform [34] and our contemporary [10] @50 and perform close to [10] @20.
width=1.0center Method Extra Recall mean-Recall ng-Recall AP #param FPS Feat. @20 @50 @100 @20 @50 @100 @20 @50 @100 @50 (M) Two-Stage MOTIFS [61] ✓ 21.4 27.2 30.5 4.2 5.7 6.6 - 3 0.5 35.8 20.0 240.7 6.6 KERN [8] ✓ 22.3 27.1 - - 6.4 - - 30.9 35.8 20.0 405.2 4.6 GPS-Net [33] ✓ 22.3 28.9 33.2 6.9 8.7 9.8 - - - - - - BGT-Net [12] ✓ 23.1 28.6 32.2 - - 9.6 - - - - - - RTN [24] ✓ 22.5 29.0 33.1 - - - - - - - - - BGNN [29] ✓ 23.3 31.0 - 7.5 10.7 - - - - 29.0 341.9 2.3 GB-Net [60] ✓ - 26.3 29.9 - 7.1 8.5 - 29.3 35.0 - - - IMP+ [57] ✗ 14.6 20.7 24.5 2.9 3.8 4.8 - 22.0 27.4 20.0 203.8 10.0 G-RCNN[59] ✗ - 11.4 13.7 - - - - 28.5 35.9 23.0 - - One- Stage FCSGG [34] ✗ 16.1 21.3 - 2.7 3.6 - 16.7 23.5 29.2 28.5 87.1 RelTR [10] ✗ 20.2 25.2 - 5.8 8.5 - - - - 26.4 63.7 16.6 Relationformer ✗ 22.2 28.4 31.3 4.6 9.3 10.7 22.9 31.2 36.8 26.3 92.9 #param are taken from [10]. * Frame-per-second (FPS) is computed in Nvidia GTX 1080 GPU.
In terms of object detection performance, we achieve an AP@50 of , which is close to the best performing one- and two-stage methods, even though we use a simpler backbone. Note that the object detection performance varies substantially across multiple backbones and object detectors. For example, BGNN [29] uses X-101FPN, FCSGG [34] uses HRNetW48-5S-FPN, whereas Relationformer and its contemporary RelTR [10] use a simple ResNet50 [17] backbone.
Comparing our Relationformer to two-stage models, we outperform all models that use no extra features in all metrics. Moreover, we perform almost equal to the remaining two-stage models, which use powerful backbones [29], bi-label graph resampling [29], custom loss functions [33], and extra features such as word [24] or knowledge graph embeddings [8]. Therefore, we can claim that we achieve competitive performances without custom loss functions or extra features while using significantly fewer parameters. We also achieve much faster processing times, measured in frames per second (FPS)(see Table 3). For example, BGNN [29], which was the top performer in a number of metrics, requires three times more parameters and is an order of magnitude slower than our method.
Fig. 4 shows qualitative examples for all datasets used in our experiments. Qualitative and quantitative results from both spatio-structural and spatio-semantic graph generation demonstrate the efficiency of our approach and the importance of simultaneously leveraging [obj]-tokens and the [rln]-token. Relationformer achieves benchmark performances across a diverse set of image-to-graph generation tasks suggesting its wide applicability and scalability.

4.4 Ablation Studies
In our ablation study, we focus on two aspects. First, how the [rln]-token and relation-head guide the graph generation; second, the effect of the sample size in training transformers from scratch. We select the complex 3D synthetic vessel and Visual Genome datasets for the ablation. Further ablation experiments can be found in the supplementary material.
In Table 5, we evaluate the importance of the [rln]-token and different relation-head types. First, we train def-DETR only for object detection as proposed in [7, 64], second, we evaluate Relationformer w/ and w/o [rln]-token and use a linear relation classification layer (models w/o the [rln]-token use only concatenated pair-wise [obj]-tokens for relation classification). Third, we replace the linear relation head with an MLP and repeat the same w/ and w/o [rln]-tokens.
width=1.0center Model [rln]- token Rel. Head AP @50 SGDet Recall @20 @50 @100 def-DETR N/A N/A 26.4 N/A N/A N/A Relationformer ✗ Linear 24.1 16.6 22.0 25.2 Relationformer ✓ Linear 25.3 20.1 25.4 28.3 Relationformer ✗ MLP 26.0 19.2 26.4 29.4 Relationformer ✓ MLP 26.3 22.2 28.4 31.3
width=1.0center Model [rln]- token Train Data SMD Node Det. Edge Det. mAP mAR mAP mAR def-DETR N/A 100% N/A 77.5 83.5 N/A N/A Relationformer ✗ 100% 0.0129 75.5 81.6 76.3 80.4 Relationformer ✓ 25% 0.0138 17.0 32.1 11.5 28.3 Relationformer ✓ 50% 0.0124 39.2 53.5 33.3 48.9 Relationformer ✓ 100% 0.0110 78.5 84.3 78.1 82.1
We observe that a linear relation classifier w/o [rln]-token is insufficient to model the mutual relationships among objects and diminishes the object detection performance as well. In contrast, we see that the [rln]-token significantly improves performance despite using a linear relation classifier. Using an MLP instead of a linear classifier is a better strategy whereas the Relationformer w/ [rln]-token shows a clear benefit. Unlike the linear layer, we hypothesize that the MLP provides a separate and adequate embedding space to model the complex semantic relationships for [obj]-tokens and our [rln]-token.
From ablation on 3D vessel (Table 5), we draw the same conclusion that [rln]-token significantly improve over Relationformer w/o [rln]-token. Further, a high correlation between performance and train-data size indicates additional room for improvement by increasing the sample size while training from scratch.
4.5 Limitations and Outlook:
5 Conclusion
Extraction of structural- and semantic-relational graphs from images is the key for image understanding. We propose Relationformer, a unified single-stage model for direct image-to-graph generation. Our method is intuitive and easy to interpret because it is devoid of any hand-designed components. We show consistent performance improvement across multiple image-to-graph tasks using Relationformercompared to previous methods; all while being substantially faster and using fewer parameters which reduce energy consumption. Relationformer opens up new possibilities for efficient integration of a image-to-graph models to downstream applications in an end-to-end fashion. We believe that our method has the potential to shed light on many previously unexplored domains and can lead to new discoveries, especially in 3D.
Acknowledgement
Suprosanna Shit is supported by TRABIT (EU Grant: 765148). Bjoern Menze gratefully acknowledges the support of the Helmut Horten Foundation.
References
- [1] (2019) 3d scene graph: a structure for unified semantics, 3d space, and camera. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5664–5673. Cited by: §2.
- [2] (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §3.3.
-
[3]
(2018)
Roadtracer: automatic extraction of road networks from aerial images.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 4720–4728. Cited by: §2, Table 2. - [4] (2019) Improved road connectivity by joint learning of orientation and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10385–10393. Cited by: §2, Table 2.
- [5] (2019) Image-conditioned graph generation for road network extraction. arXiv preprint arXiv:1910.14388. Cited by: §B.1, §B.2, Appendix C, §2, §4.1, §4.2, Table 1, Table 2.
- [6] (2019) Attention augmented convolutional networks. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 3286–3295. Cited by: §3.1, §3.3.
- [7] (2020) End-to-end object detection with transformers. In European Conference on Computer Vision, pp. 213–229. Cited by: Appendix F, §1, §2, §3.1, §4.4.
- [8] (2019) Knowledge-embedded routing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6163–6171. Cited by: §1, §4.3, Table 3.
- [9] (2019) Neural turtle graphics for modeling city road layouts. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4522–4530. Cited by: §2.
- [10] (2022) RelTR: relation transformer for scene graph generation. arXiv preprint arXiv:2201.11460. Cited by: §2, §4.3, §4.3, Table 3.
- [11] (2017) Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 764–773. Cited by: §3.1.
-
[12]
(2021)
BGT-net: bidirectional gru transformer network for scene graph generation
. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2150–2159. Cited by: §2, Table 3. - [13] (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §2.
- [14] (2021) Scalable robust graph and feature extraction for arbitrary vessel networks in large volumetric datasets. arXiv preprint arXiv:2102.03444. Cited by: §4.3.
- [15] (2021) You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection. arXiv preprint arXiv:2106.00666. Cited by: §2.
- [16] (2017) Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 1025–1035. Cited by: §2.
- [17] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.1, §3.3, §4.3.
- [18] (2020) Sat2Graph: road graph extraction through graph-tensor encoding. In European Conference on Computer Vision, pp. 51–67. Cited by: §B.2, Appendix C, §1, §2, §4.1, §4.2, Table 1, Table 2.
- [19] (2020) Scene graph reasoning for visual question answering. arXiv preprint arXiv:2007.01072. Cited by: §1.
- [20] (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: Appendix D.
- [21] (2020) Action genome: actions as compositions of spatio-temporal scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10236–10247. Cited by: §2.
- [22] (2021) Brain microvasculature has a common topology with local differences in geometry that match metabolic load. Neuron 109 (7), pp. 1168–1187. Cited by: §1.
- [23] (2015) Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3668–3678. Cited by: §1.
- [24] (2020) Relation transformer network. arXiv preprint arXiv:2004.06193. Cited by: §B.4, §1, §2, §4.3, Table 3.
- [25] (2021) Graphhopper: multi-hop scene graph reasoning for visual question answering. In International Semantic Web Conference, pp. 111–127. Cited by: §1.
- [26] (2021) Oodformer: out-of-distribution detection transformer. arXiv preprint arXiv:2107.08976. Cited by: §2.
- [27] (2021) Scenes and surroundings: scene graph generation using relation transformer. arXiv preprint arXiv:2107.05448. Cited by: §2.
- [28] (2016) Visual genome: connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332. Cited by: §B.4, §4.1, Table 1.
- [29] (2021) Bipartite graph network with adaptive message passing for unbiased scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11109–11119. Cited by: §4.3, §4.3, Table 3.
- [30] (2021) SGTR: end-to-end scene graph generation with transformer. arXiv preprint arXiv:2112.12970. Cited by: §2.
-
[31]
(2014)
Microsoft coco: common objects in context
. In European conference on computer vision, pp. 740–755. Cited by: §4.2. - [32] (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §2.
- [33] (2020) Gps-net: graph property sensing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3746–3753. Cited by: §B.4, Appendix E, §1, §4.3, Table 3.
- [34] (2021) Fully convolutional scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11546–11556. Cited by: §B.4, §2, §4.3, §4.3, Table 3.
- [35] (2021) Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030. Cited by: §2, §4.5.
- [36] (2016) Visual relationship detection with language priors. In European Conference on Computer Vision, Cited by: §B.4, §2.
- [37] (2021) Context-aware scene graph generation with seq2seq transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15931–15941. Cited by: §2.
- [38] (2017) Deeproadmapper: extracting road topology from aerial images. In Proceedings of the IEEE international conference on computer vision, pp. 3438–3446. Cited by: §2, Table 2.
- [39] (2009) Voreen: a rapid-prototyping environment for ray-casting-based volume visualizations. IEEE Computer Graphics and Applications 29 (6), pp. 6–13. Cited by: §2.
- [40] (2021) Micrometer-resolution reconstruction and analysis of whole mouse brain vasculature by synchrotron-based phase-contrast tomographic microscopy. bioRxiv. Cited by: §2.
- [41] (2017) Pixels to graphs by associative embedding. Advances in neural information processing systems 30. Cited by: §2.
- [42] (2021) Whole brain vessel graphs: a dataset and benchmark for graph learning and neuroscience. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), Cited by: §1, §2.
-
[43]
(2014)
Glove: global vectors for word representation.
In
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)
, pp. 1532–1543. Cited by: §2. - [44] (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Advances in neural information processing systems 28. Cited by: §1, §2.
- [45] (2020) Deep graph matching via blackbox differentiation of combinatorial solvers. In European Conference on Computer Vision, pp. 407–424. Cited by: §4.5.
- [46] (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: Table 2.
- [47] (2012) Tissue metabolism driven arterial tree generation. Med Image Anal. 16 (7), pp. 1397–1414. Cited by: §B.3.
-
[48]
(2021)
Classification by attention: scene graph classification with prior knowledge.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, pp. 5025–5033. Cited by: §2. - [49] (2021) Improving scene graph classification by exploiting knowledge from texts. arXiv preprint arXiv:2102.04760. Cited by: §2.
- [50] (2021) Improving visual relation detection using depth maps. In 2020 25th International Conference on Pattern Recognition (ICPR), pp. 3597–3604. Cited by: §2.
- [51] (2021) ClDice-a novel topology-preserving loss function for tubular structure segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16560–16569. Cited by: §2, §4.3.
- [52] (2021) ViDT: An Efficient and Effective Fully Transformer-based Object Detector. arXiv preprint arXiv:2110.03921. Cited by: §2.
- [53] (2020) Deepvesselnet: vessel segmentation, centerline prediction, and bifurcation detection in 3-d angiographic volumes. Frontiers in Neuroscience 14, pp. 1285. Cited by: §B.3, §4.1, Table 1.
- [54] (2020) Machine learning analysis of whole mouse brain vasculature. Nature methods 17 (4), pp. 442–449. Cited by: §1, §2.
- [55] (2021) Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pp. 10347–10357. Cited by: §2.
- [56] (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: Appendix A, §3.1.
- [57] (2017) Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5410–5419. Cited by: Table 3.
- [58] (2017) Scene graph generation by iterative message passing. In Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
- [59] (2018) Graph r-cnn for scene graph generation. In Proceedings of the European conference on computer vision (ECCV), pp. 670–685. Cited by: Table 3.
- [60] (2020) Bridging knowledge graphs to generate scene graphs. In European Conference on Computer Vision, pp. 606–623. Cited by: Table 3.
- [61] (2018) Neural motifs: scene graph parsing with global context. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5831–5840. Cited by: §B.4, Appendix E, §1, §2, §4.2, Table 3.
- [62] (2018) Link prediction based on graph neural networks. External Links: 1802.09691 Cited by: §2.
- [63] (2019) Objects as points. arXiv preprint arXiv:1904.07850. Cited by: §2.
- [64] (2020) Deformable DETR: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159. Cited by: Appendix A, §2, §3.1, §3.3, §3.3, §4.4.
Appendix A Transformer and Deformable-DETR
The core of a transformer [56] is the attention mechanism. Let us consider an image feature map , the query with associated features and key with associated image features . One can define the multi-head attention for number of heads and number of key elements as
where and are learnable weights. The attention weights are normalized as , where are also learnable weights and is the temperature parameter. To differentiate position of each element uniquely, and are given a distinct positional embedding.
In our work, we use the multi scale deformable attention [64] for number of level as
where rescales the normalized reference point coordinates appropriately in the corresponding image space.
Appendix B Dataset
Here we describe the individual datasets used in our experimentation in detail. We also elaborate on generating train-test sets for our experiments. For 20 U.S. Cities and 3D synthetic vessel we extract overlapping patches from large images. This provides us a large enough sample size to train our Relationformer from scratch. Since, a DETR like architecture is not translation invariant because of learned [obj]-tokens in the decoder, extracting overlapping patches drastically increases the effective sample size within a limited number of available images.
b.1 Toulouse Road Network
The Toulouse Road Network dataset [5] is based on publicly available satellite images from Open Streetmap and consists of semantic segmentation images with their corresponding graph representations. For our experiments we use the same split as in the original dataset paper with 80,357 samples in the training set, 11,679 samples in the validation set, and 18,998 samples in the test set [5].
b.2 20 U.S. Cities Dataset
For the 20 U.S. Cities dataset [18], there are 180 images with a resolution of 2048x2048. We select 144 for training, 9 for validation, and 27 for testing. From those images, we extract overlapping patches of size 128x128 to construct the final train-validation-test split. We crop the RGB image and the corresponding graph followed by a node simplification. Following Belli et al. [5], we prune the dense nodes by computing the angle between two road-segments at each node of degree 2 and only keep a node if the road curvature is less than 160 degrees. This allows eliminating redundant nodes and simplifying the graph prediction task. Fig. 5 illustrates the pruning process.
b.3 3D Synthetic Vessels
Our synthetic vessel dataset is based on publicly available synthetic images generated in Tetteh et al. [53]. In this dataset, the ground truth graph was generated by [47] and from that, corresponding voxel-level semantic segmentation data was generated. Grey valued data was obtained by adding different noise levels to the segmentation map. Specifically, we train on greyscale "images" and their corresponding vessel graph representations, where each node represents a bifurcation point, and the edges represent their connecting vessels. The whole dataset contains 136 3D volumes of size 325x304x600. First, we choose 40 volumes to create a train and validation set and next pick 10 volumes for the test set. From this, we extract overlapping patches of size 64x64x64 to construct the final train-validation-test set. Similar to the 20 U.S. cities dataset, we prune nodes having degree 2 based on the angle between two edges.

b.4 Visual Genome
Visual Genome is one of the largest scene graph datasets consisting of 108,077 natural images [28]. However, the original dataset suffers from multiple annotation errors and improper bounding boxes. Lu et al. [36] proposed a refined version of Visual Genome with the most frequent occurring 150 objects classes and 50 relation categories. It also proposed its own train/val/test splits and is the most widely used data-split [61, 24, 33, 34] for SGG. For fair comparison, we only train on the Visual Genome dataset and do not use any pre-training.
Appendix C Metrics Details
Metrics for Spatio-Structural Graph:
We use three different kinds of metrics to capture spatial similarity alongside the topological similarity of the predicted graphs. The graph-level metrics include; 1) Street Mover Distance (SMD): SMD[5] compute Wasserstein distance between the uniformly sampled fixed number of points (See Fig. 6) from the predicted and ground truth edges; and 2) TOPO Score: TOPO Score[18] computes precision, recall, and F-1 score for topological mismatch in terms of the false-positive and false-negative topological loop. Alongside, we use 3) Node Detection: For this, we report mean average precision (mAP) and mean average recall (mAR) over a threshold range [0.5,0.95,0.05] for node box prediction. Similarly, we use 4) Edge Detection: We compute the mAP and mAR for the edge in the same way as above. The edge boxes are constructed from the center points of two connecting nodes (See Fig. 6). For vertical and horizontal edges we assume an hypothetical width of 0.15 to avoid objects with near zero width.
Metrics for Spatio-Semantic Graph:
We evaluate Relationformer on the most challenging Scene Graph Detection(SGDet) metrics and its variants. Unlike other scene graph metrics like Predicate Classification (PredCls) or Scene Graph classification (SGCls) , SGDet does not use apriori information on class label or object spatial position and does not rely on complex RoI-align based spatial features. SGDet jointly measures the predicted boxes (with overlaps) class labels of an object, and relation labels. The variants of SGDet include 1) Recall: Recall at the different K (20, 50 and 100) of predicted relation that reflects overall relation prediction performance, 2) Mean-Recall: mean-Recall computes mean of each relation class-wise recall that reflects the performance under the relational imbalance or long-tailed distribution of relation class, 3) ng-Recall: ng-Recall is recall w/o graph constraints on the prediction, which takes the top-k predictions instead of just the top-1. Additionally, we use 4) AP@50: Average precision at 50% threshold of IOU reflects an average object detection performance.

Appendix D Model Details
DataSet | Backbone | Transformer | MLP Dim | |||
Enc. Layer | Dec. Layer | # [obj]-tokens | ||||
Toulouse | ResNet-50 | 4 | 4 | 20 | 256 | 512 |
20 US cities | ResNet-101 | 4 | 4 | 80 | 512 | 1024 |
Synth Vessel | SE-Net | 4 | 4 | 80 | 256 | 1024 |
Visual Genome | ResNet-50 | 6 | 6 | 200 | 512 | 2048 |
Table 6, describes the backbone and important parameters of the Relationformer. We experiment with different ResNet backbones to show the flexibility of our Relationformer. In order to reduce energy consumption, we use the lighter ResNet50 for most 2D datasets. For the 3D experiment, we used Squeeze-and-Excite Net [20]. We used the number of encoder and decoder layers and the number of [obj]-tokens in the increasing order of dataset complexity. We find that four transformer layers and 20 [obj]-tokens suffice for Toulouse, while we need four transformer layers and 80 [obj]-tokens are required for 20 U.S. cities and synthetic vessel datasets. We need 6 layers of transformer and 200 [obj]-tokens for the visual genome. The ablation on the number of transformer layers and number of [obj]-tokens are shown in the next section.
Appendix E Training Details
DataSet | Batch Size | Learning rate | Epoch | Cost Coeff. | Loss Coeff. | |||||
cls | reg | gIoU | ||||||||
Toulouse | 64 | 50 | 2 | 5 | 0 | 5 | 2 | 2 | 1 | |
20 US cities | 32 | 100 | 3 | 5 | 0 | 5 | 2 | 3 | 4 | |
3D Vessel Net | 48 | 100 | 2 | 5 | 0 | 2 | 3 | 3 | 4 | |
Visual Genome | 16 | 25 | 3 | 2 | 3 | 2 | 2 | 4 | 6 |
Table. 7, summarizes some principal parameters we use in the training. We use AdamW optimizer with a step learning rate. For scene graph generation, we use the prior statistical distribution or frequency-bias [61] of relation for each subject-object pair. To minimize the data imbalance for a relation label present in the Visual Genome, we use log-softmax distribution [33] to soften the frequency bias. Finally, we add this distribution with the predicted relation distribution from the relation head. For the spatio-structural dataset, we set the cost coefficient for the GIoU in the bipartite matcher to be zero because we assume 0.2 widths of the normalized box for each node. Hence, cost is sufficient to consider for the spatial distances.
Appendix F More Ablation Studies on [obj]-tokens and Transformer
We conduct two more ablation studies on Visual Genome for analyzing the influence of [obj]-tokens and optimal number of layers in transformer for the joint graph generation. Furthermore Figure. 7 gives additional insight how [rln]-token is beneficial for joint object-relation graph.
width=1.0center #[obj]-tokens AP@50 R@20 R@50 R@100 75 25.1 20.6 26.1 29.5 100 25.8 21.1 27.4 30.6 200(ours) 26.3 22.2 28.4 31.3 300 26.3 21.9 27.9 31.0
width=1.0center # layer AP@50 R@20 R@50 R@100 4 24.6 20.5 26.5 28.8 5 25.2 21.0 27.2 29.9 6(ours) 26.3 22.2 28.4 31.3
As shown in Table 9, it can be observed that increasing [obj]-tokens does increase object and relation detection performance. However, it becomes relatively stable with increasing object quarries. DETR-like architectures rely on an optimal number of [obj]-tokens to balance positive and negative simple which also helps in object detection as observed in [7]. Thus, in a joint object and relation prediction, a gain might come from optimal number [obj]-tokens, as relation prediction is linearly co-related to object detection performance. It demonstrates that joint object and relation detection can perfectly coexist without hurting the object detection performance. Instead, it can exploit [obj]-tokens enriched with global relational reasoning for efficient relation extraction.
During the ablation with transformer layers, we observe decreasing number of transformer layers shows an initial gain in object and relation detection. However, they lead to early plateau and inferior performance as depicted in table 9. One intuitive reason is that with less parameter and insufficient contextualization Relationformer quickly learn the initial biases present in both object and relation detection and failed to learn the complex global scenario. We use the same number of layers for both encoder and decoder.

Appendix G Qualitative Results

