Triangular Contrastive Learning on Molecular Graphs

by   MinGyu Choi, et al.

Recent contrastive learning methods have shown to be effective in various tasks, learning generalizable representations invariant to data augmentation thereby leading to state of the art performances. Regarding the multifaceted nature of large unlabeled data used in self-supervised learning while majority of real-word downstream tasks use single format of data, a multimodal framework that can train single modality to learn diverse perspectives from other modalities is an important challenge. In this paper, we propose TriCL (Triangular Contrastive Learning), a universal framework for trimodal contrastive learning. TriCL takes advantage of Triangular Area Loss, a novel intermodal contrastive loss that learns the angular geometry of the embedding space through simultaneously contrasting the area of positive and negative triplets. Systematic observation on embedding space in terms of alignment and uniformity showed that Triangular Area Loss can address the line-collapsing problem by discriminating modalities by angle. Our experimental results also demonstrate the outperformance of TriCL on downstream task of molecular property prediction which implies that the advantages of the embedding space indeed benefits the performance on downstream tasks.


page 1

page 2

page 3

page 4


Augmented Contrastive Self-Supervised Learning for Audio Invariant Representations

Improving generalization is a major challenge in audio classification du...

Uncertainty in Contrastive Learning: On the Predictability of Downstream Performance

The superior performance of some of today's state-of-the-art deep learni...

Contrastive Multimodal Fusion with TupleInfoNCE

This paper proposes a method for representation learning of multimodal d...

Multimodal Masked Autoencoders Learn Transferable Representations

Building scalable models to learn from diverse, multimodal data remains ...

Video Understanding as Machine Translation

With the advent of large-scale multimodal video datasets, especially seq...

Contrastive and Non-Contrastive Self-Supervised Learning Recover Global and Local Spectral Embedding Methods

Self-Supervised Learning (SSL) surmises that inputs and pairwise positiv...

Cleora: A Simple, Strong and Scalable Graph Embedding Scheme

The area of graph embeddings is currently dominated by contrastive learn...

1 Introduction

Data scarcity has been a severe problem in representation learning, due to the time-consuming and high-cost nature of annotating large-scale data (Tan et al., 2018). In the field of self-supervised learning (SSL), contrastive learning (CL) that learns the general landscape of an embedding space from unlabeled data by pulling similar (positive) pairs and pushing dissimilar (negative) pairs (Jaiswal et al., 2020; Jing and Tian, 2020), have shown promising strength in learning diverse characteristics from multiple viewpoints without labels when compared with traditional supervised learning methods. Multimodal CL is especially powerful for the data whose characteristics are naturally hard to be expressed comprehensively with individual representation (Radford et al., 2021; Yuan et al., 2021; Zolfaghari et al., 2021).

As most of the real-world downstream tasks use data from a single modality in fine-tuning, generating informative embedding space that can be fully utilized by a single encoder (referred to as main encoder) of the multimodal CL framework becomes extremely important. The special characteristics of multimodal CL require views from other modalities (referred to as auxiliary modalities) to be distilled and mapped into the embedding space. Meanwhile, many works recently started introducing more modalities into CL (Mai et al., 2022) while existing objectives were mainly proposed for unimodal or bimodal networks (van den Oord et al., 2018; Sohn, 2016). Thus, design of scalable framework and appropriate contrastive objective for higher-modality is urgently needed.

To address the two challenges described above, we introduce TriCL, a trimodal contrastive learning framework with novel Triangular Area Loss. TriCL focuses on fully utilizing trimodal information to build an effective embedding space for downstream tasks. Our contributions can be summarized as:

  1. Observation on the geometry of trimodal embedding space. Expanding the analysis by Wang and Isola (2020), we characterize the embedding space produced by multimodal CL in terms of intermodal alignment and uniformity. We demonstrate that intramodal contrastive loss distributes the embedding space while the intermodal loss compresses.

  2. Proposal of Triangular Area Loss, geometry-aware contrastive loss for trimodal CL. We analyze the reasons behind space collapse in intermodal CL by readdressing the importance of intermodal uniformity. Triangular Area Loss is proposed and formulated under TriCL, a universal framework for trimodal CL. Triangular Area Loss takes a glimpse of geometry through minimizing and maximizing areas of triangles, instead of their pairwise distances. We also demonstrate that the embedding space produced by TriCL with our loss displays more useful properties than the space optimized with the pairwise loss objective.

  3. State-of-the-Art performance on molecular property prediction tasks. We proved the utility of TriCL by achieving the best AUC-ROC on most molecular property prediction tasks over the latest methods. Ablations on objective functions prove that geometric advantages in embedding space result in improvement of downstream task performance.

2 Related Works

Contrastive learning (CL) has recently shown the competitive power of learning transferable knowledge from large-scale unlabeled data for downstream tasks (Islam et al., 2021). Building effective representations that are invariant from different views of data is one of the most important missions in CL that demands the minimization of irrelevant nuisances during the pretraining process (Tian et al., 2020). Chen et al. (2021)

emphasized clever choice of the contrastive loss function and augmentation strategy.

Vassileios Balntas and Mikolajczyk (2016) designs the Triplet Margin Loss by setting an anchor as the criterion for pulling positive samples and pushing negative ones. Based on information theoretic arguments, van den Oord et al. (2018) develops a loss function named InfoNCE to optimize the lower bound of mutual information between the encoded representations. NT-Xent (Normalized Temperature-scaled Cross Entropy) suggested by Sohn (2016) compares multiple negatives effectively to identify positives. Later, Chen et al. (2020) leverages the normalization and temperature of NT-Xent loss which performs best on their visual representation learning framework.

To measure the strength of contrastive objective in downstream tasks, Wang and Isola (2020) characterized two desirable properties in the following:

  • Alignment: Positive pairs are mapped closely in the embedding space.

  • Uniformity

    : Embeddings are uniformly distributed, preserving as much information as possible.

Chen et al. (2021) gives the theoretical argument of the significance of alignment and uniformity similarly.

The multimodal characteristics of data have inspired researches on multimodal CL in computer vision domain. To jointly train the image encoder and the text encoder,

Radford et al. (2021)

calculates the cosine similarity between the embeddings of image and text for all pairs across a batch.

Yuan et al. (2021) takes both intra and intermodal similarities into account, enforcing consistency within modality and introducing influential samples from another modality simultaneously to preserve the semantic similarity. Focused on learning a cross-modal embedding, Zolfaghari et al. (2021) introduces CrossCLR loss to further ensure the intramodal proximity for improving cross-modal retrieval performance.

When it comes to graph contrastive learning, You et al. (2020) proposes several graph augmentations including node dropping and masking, and notes the importance of augmentation selection for different tasks. Suresh et al. (2021) demonstrates the importance of avoiding capturing redundant information to identify graphs in contrastive learning. Meanwhile, molecules as commonly used graph benchmark data have also become the research highlights of graph representation learning. To thoroughly represent the molecular information, Wang et al. (2022) builds a framework for learning molecular graph representations by contrasting positives against negatives. Liu et al. (2021) includes a 3D representation of molecules as an additional modality to consider stochasticity for capturing the conformer distribution of a 2D graph.

3 Observations and Explanations on Trimodal Embedding Space.

Figure 1: Illustration of embedding space after trimodal contrastive learning. Specific loss function and geometry of each space: (a) NT-Xent as intramodal loss: ‘hypersphere’ (b) NT-Xent as intermodal loss: ‘line’ (c) Triangle Area Loss as intermodal loss: ‘line’ (d) NT-Xent as intra- and intermodal loss: ‘cone’ (e) Triangle Area Loss as intermodal loss, NT-Xent as intramodal loss: ‘cone’. Angles within the space and angles between them are not to scale. Refer to the Table 1 for quantified metrics.

Inspired by the bipartite components of NT-Xent loss which quantitatively pulls positive pairs and pushes negative pairs, we start by inspecting how alignment and uniformity as contrastive loss optimize the joint embedding space. We first implement a simple trimodal framework comprising transformer, GNN, and 3D CNN to encode text, graph, and structure of molecules respectively.

3.1 Alignment and Uniformity in Multimodal Contrastive Learning

Analogous to the alignment and uniformity in unimodal NT-Xent loss, we expand the scope of discussion into the multimodal CL by introducing the concept of intermodal alignment and intermodal uniformity (Equation 1). An intermodal alignment regulates to what extent an encoder learns sample diverse perspectives from other encoders and generates representations in regard to multiple views. Conversely, an intermodal uniformity enhances encoder discriminability in capturing distinct features that are unobservable in one modality by contrasting them with negatives from other modalities.


In fact, intramodal uniformity and intermodal alignment are not independent. When samples result in distinct representations on the main encoder but similar on the auxiliary modalities, intermodal alignment would pull these representations closer with the expense of intramodal uniformity. Regarding that multimodal CL aims to train the main encoder to reflect similarities from auxiliary representations, the balance between intramodal uniformity and intermodal alignment would be the key objective for successful multimodal CL.

Loss Intramodal (Main encoder) Intermodal
Align Uniform Combined Align Uniform Combined
Intra NT-Xent 0.663 0.001 0.662 -0.003 0.000 -0.003
Inter NT-Xent 0.996 0.998 -0.002 1.000 0.999 0.001
Ours 0.997 1.000 -0.003 0.002 0.002 0.000
Joint NT-Xent 0.660 0.036 0.624 0.101 0.091 0.010
Ours 0.694 0.004 0.690 0.138 0.079 0.049
Table 1: Metrics regarding the embedding space after trimodal contrastive learning. Alignment metric is the average cosine similarity between all positive pairs (higher is better). Uniformity metric is the average cosine similarity between randomly chosen pairs (close to 0 is better). Combined metric refers to (higher is better). For triplets, all metrics are computed as the average pairwise metric. NT-Xent loss uses temperature = 0.1. For implementation details, see Appendix D.

3.2 Transformation of an embedding space affected by Intra and Intermodal NT-Xent loss

We design two experiments to explore how the embedding space transforms when optimized with contrastive loss. Specifically, encoders are pre-trained twice under different combinations of two losses: (1) Intramodal loss between augmented data within modality. (2) Intermodal loss between pairs of three modalities. At this time, only NT-Xent loss is applied to both experiments.

We observed the transformations of embedding spaces and measured the contributions of each loss component through two cosine similarity metrics (Table 1): Intramodal cosine similarities between positive pairs and negative pairs are calculated respectively to assess the effects of intramodal alignment and intramodal uniformity. Similarly, the sum of intermodal pairwise cosine similarities of positive triplets and negative triplets are calculated respectively to reflect contributions of intermodal alignment and intermodal uniformity. Upon the results, we draw the conceptual prospect of joint embedding spaces under each experiment setting in Figure A1(a, b).

Intramodal contrastive loss distributes the embedding space to a hypersphere.

Applying intramodal NT-Xent loss as an exclusive loss within modality is identical to the CL of three independent modalities. We observe that under this setting, the intramodal alignment metric keeps increasing during the training process until it converges to 0.663. Intramodal uniformity metric close to zero indicates that the encoder could distinguish individual samples, within the scope of the specific encoder itself. Low intermodal alignment metric implies that positive representations are randomly spread over the space, which is straightforward as no information is exchanged over different modalities.

Intermodal contrastive loss compresses the embedding space into a ‘line’.

To assure that one encoder can borrow diverse viewpoints from extra modalities, we apply NT-Xent losses on inter-modal pairs over three encoders. The intramodal alignment metric of three encoders rapidly approaches 0.996, indicating the collapse of individual embedding spaces. Interestingly, according to the 1.000 intermodal alignment, embedding spaces of three modalities mostly overlap each other resulting in a single line.

About zero intramodal combined metric also implies that the encoder is nearly unable to distinguish different representations, leading to the ineffectiveness of the joint embedding space. This result is counter-intuitive, as the intermodal uniformity loss would separate individual embedding clusters apart. To the best of our knowledge, we guess this line-shaped joint space is a local minimum easy to fall while utilizing NT-Xent loss for intermodal contrast.

4 TriCL : Triangular Contrastive Learning

In this section, we reconsider the intermodal uniformity as a regularization objective. Based on this intuition, we would introduce Triangular Area Loss, which aims to learn the geometry of the embedding space thereby preventing the collapse we observed above. Triangular Area Loss would be formulized under TriCL (Triangular Contrastive Learning), a universal framework for trimodal CL applying our objective. Advantages of Trianglular Area Loss and TriCL would also be discussed in Section 3.2 inheriting the same views of embedding space.

4.1 Readdressing intermodal uniformity: Geometry-aware contrastive loss

As discussed in Section 3.1 and (Wang and Isola, 2020; Chen et al., 2021), minimizing NT-Xent loss could be interpreted as optimizing embedding space that satisfies two desirable properties: alignment and uniformity. In multimodal contrastive learning, a network should additionally consider intermodal alignment and uniformity.

Minimizing intermodal alignment to pull inter-modal positive pair closer is a straightforward way to guide main encoder for reflecting perspectives from auxiliary modalities. However, diminishing intermodal uniformity seems counter-intuitive, as harmonized embedding space is commonly considered an ideal space whose embedding vectors from the same samples converge into a single point.

At this time, we would emphasize the role of intermodal uniformity as a regularization factor, which prevents encoders from falling into the local minima. As we observed in Section 3.2, a premature utilization of NT-Xent loss across modalities results in a "line" space. Although this shape of embedding space looks harmonized and aligned, it is clearly undesirable because this embedding cannot distinguish samples yet injudiciously collapse into a single "line".

We found the reason for collapse through imprudently applying intermodal NT-Xent into trimodal CL, which pushes and pulls representations without considering the geometry of the embedding space.

Triangle Area Loss we devised explicitly considers the geometry among embedding vectors by calculating the area of a triangle with three representations. Specifically, as an area of a triangle is calculated with two sides and the angle between, the encoder becomes aware of the angle between two sides that reflects geometry, which was unavailable when applying intermodal NT-Xent loss. This discourages collapse of the embedding space to a single ‘line’ by explicitly enlarging angles between negative triplets, which is equivalent to optimization of intermodal uniformity.

Figure 2: The TriCL framework. (a) Each sample is represented as three distinct format; after augmented twice then encoded generating six reprsentations per sample. (b) Representations in different modalities are contrasted using Triangle Area Loss. (c) Representations in the same modality are contrasted using pairwise NT-Xent loss. (d) TriCL build the embedding space by carefully balancing intramodal and intermodal contrastive loss.

4.2 Architecture and Learning Objectives

We would continue our discussions on Triangular Area Loss over TriCL, a universal trimodal contrastive learning framework appropriate for any types of pre-training using any network architectures and any data formats which can be decomposed to three modalities.

TriCL is designed to pre-train a main encoder with two other auxiliary modalities, aux1 and aux2. By contrasting positive triplets and negative triplets at once, TriCL aims to train the main encoder to learn similarities and differences between inputs that are only recognizable for auxiliary modalities so that the main encoder can better capture general properties apt to downstream tasks.


Three encoder networks, main, aux1, aux2 are given their respective set of inputs which are multiple views of the same sample . Each sample is then augmented twice as , following probabilistic augmentation strategies. The output of each encoder is a vector of equal length, defined as where is one of main, aux1, aux2. Implementation details of augmentation is described in Appendix D.

Learning Objective

As discussed in Section 3.1, a learning objective for multimodal contrastive learning should be carefully designed to maintain a balance between alignment and uniformity of intra and inter modalities. For this, the learning objectve of TriCL is formulized as a weighted sum of the intramodal contrastive loss with the intermodal contrastive loss.

For intermodal loss, TriCL adopts Triangular Area Loss as an objective. To formulize Triangle Area Loss, we first define positive triplet as a triplet whose inputs are trimodal augmentations from the same sample, and the rest as negative triplet . Note that in a sample batch of size , every sample is augmented twice, and thus there are positive triplets and negative triplets.

We then define the triangular contrastive metric as:


In equation 2, expectation is taken over all triplets of augmented data, and refers to the area of triangle whose vertices are defined by , and . Square is taken to reduce numerical instability in computation of triangular area.

To compute on batch with size of , triangular contrastive metric can be implemented as:


Where is a normalization factor accounting number of positive and negative triplets, defined as


Intuitively, is an objective to minimize the area of triangle drawn with positive triplets, thus pulling positive triplets closer in joint embedding space. However, the collapse of entire embedding space into a single point leads to minimization of average positive triplets area mathematically, similar to collapse in Section 3.2. Referring to Section 4.1 emphasizing the regularization role of intermodal uniformity, we aim to simultaneously shrink positive triangles and expand negative ones.

The intramodal loss is calculated in a encoder-wise manner as intramodal alignment and uniformity are independent to relationships between modalities. Specifically, TriCL adopts NT-Xent as intramodal losses for each encoder, which is a combination of intramodal alignment and uniformity as shown in equation 5 (Wang and Isola, 2020; Chen et al., 2021). For similarity metric , the cosine similarity is used in TriCL.


We unified intramodal alignment loss and intramodal uniformity loss for each individual modality, resulting in the objective function in Equation 6 that comprises three intramodal contrastive losses and one intermodal contrastive loss. and

are hyperparameters controlling weight of the main modality and the intermodal contrastive loss respectively.


4.3 More on Embedding Space

We would finish this section by revisiting and continuously discussing the embedding space. At section 3.2, we reported a collapse of the embedding space into a ‘line’ while applying NT-Xent loss as an intermodal contrastive loss. To address this problem, we introduce Triangular Area Loss which is expected to regulate the encoders from falling into the local minima by inspecting the geometry of the embedding space (Section 4.1).

We believe the last question to be: Does Triangular Area Loss solve collapsing problem? To give an answer to the question, we would show that 1) Triangle Area Loss mitigates the collapse by dispersing embedding vectors from different modalities, and 2) Joint application of Triangle Area Loss with NT-Xent generates informative embedding spaces through the experiments below.

Triangle Area Loss discriminates modalities by angle.

Maintaining experimental conditions from Section 3.2, we apply Triangular Area Loss replacing intermodal NT-Xent loss. Intramodal alignment and uniformity metrics converge to 0.997 and 1.000 respectively, which indicates the collapse reproduced for each encoder and space falls into a ‘line’. Resulting embedding space is also inappropriate for multimodal CL because 0.000 intermodal combined metric implies that the encoder could not distinguish different samples. However, we found a key to solve collapsing problem from 0.002 of intermodal uniformity, which explains a topology of the space in detail that embedding vectors form several ‘lines’ with diverse angles, rather than a single ‘line’ as in Section 3.2.

Joint application of NT-Xent forms ‘cones’

Without considering Triangle Area Loss, NT-Xent loss was first applied as both intramodal and intermodal loss. After careful balancing between a spreading effect of intramodal NT-Xent loss and a collapsing effect of intermodal NT-Xent loss, we could observe an embedding space as a shape of ‘cone’ with 0.660 intramodal alignment. Yet resulting embedding space had several advantages over the ‘line’ space such as low intramodal uniformity indicating expressive power of the encoder, this joint embedding space is not informative because intermodal combined metric was nearly zero.

Angular diversification of ‘cones’ using Triangle Area Loss

Hypothesizing that the low information gain from auxiliary modalities stems from the collapsing effect in Section 3.2, we widen the angles between ‘cones’ using Triangle Area Loss as an intermodal loss. Consistent to previous observation, angles between ‘cones’ become wider, indicated from decreased intermodal uniformity from 0.091 to 0.079. Surprisingly, combined metrics of both intramodal and intermodal losses recorded the highest value, which implies that the main encoder can better capture both features available from the main encoder itself and from auxiliary modalities.

5 Experiments

To assess the performance of TriCL framework and triangle area loss, we implemented TriCL comprising widely-used architectures for each representation: transformer, GNN, and CNN. We also adopted and implemented three types of augmentation strategies appropriate for strings, graphs, and conformers. Under these condition, TriCL achieved a state-of-the-art performance on molecular property classification tasks. We would present experimental details and explanations in the following.

5.1 Task Definition


We pre-trained TriCL on the molecular conformation dataset then fine-tuned on the molecular property prediction downstream tasks. As in Liu et al. (2021), 50k qualified molecules randomly chosen from GEOM dataset (Axelrod and Gomez-Bombarelli, 2022) were used for pre-training. The pre-trained model is fine-tuned and assessed on 8 binary molecular property classification tasks from MoleculeNet (Wu et al., 2018). Note that since MoleculeNet dataset only contains graph level representation, 3D conformer information is unavailable in downstream tasks. Further details on datasets used are described in Appendix C.


We compared our results with models from well-acknowledged, peer-reviewed works dealing with graph SSL: EdgePred (Hamilton et al., 2017), InfoGraph (Hamilton et al., 2017), GPT-GNN (Hu et al., 2020), ContextPred (Hu et al., 2019), GraphLoG (Xu et al., 2021), G-Motif (Rong et al., 2020), GraphCL (You et al., 2020), and JOAO (You et al., 2021). We also compared our model with 3D structure-aware graph SSL model GraphMVP(Liu et al., 2021).

5.2 TriCL Implementation


Referring to previous graph-based self-supervised learning models (Liu et al., 2021; Hu et al., 2019; You et al., 2020, 2021), we adopted five layers of Graph Isomorphism Network (GIN) (Xu et al., 2019)

as the main encoder. For two auxiliary encoders, we used 6 layers of transformers and 4 layers of 3DCNN, which are best suitable for learning representations from 1D SELFIES strings and 3D structures, respectively. Specifically, the transformer layer is directly adopted from PyTorch and 3DCNN architecture refers to

Townshend et al. (2020)

. The resulting 3 embedding vectors were fed into three consecutive multilayer perceptron layers producing three joint embedding vectors.

Initial Representation

Starting from the 1D SMILES string randomly chosen from GEOM, we first build the 1D SELFIES string by following Krenn et al. (2020). We utilized SELFIES rather than SMILES, because SELFIES string representation maintains valid over any types of permutations and mutations thereby much appropriate for reasonable augmentation. 2D graph representations were obtained utilizing RDKit (Landrum et al., 2013) in combination with the PyTorch Geometric. Atomic coordinates in 3D conformer structures were calculated using RDKit, then voxelized following Townshend et al. (2020).


We adopted and implemented node drop (ND), node/edge masking (NM), and subgraph masking (SM) augmentation for each representation type and corresponding architectures. Specifically, we referred You et al. (2020) and Wang et al. (2022) for graph augmentations while string and structure augmentation are newly devised. As described in Section 4.2, one query representation is augmented twice under the same policy, resulting in two augmented representations for each modality. The final model used ND for GNN; NM for transformers and CNN; SM for all three architectures.


Triangle Area Loss in combination with NT-Xent loss is applied during pre-training. Model parameters were optimized using the Adam optimizer, and hyperparameters were tuned through the grid search.

Further details on model implementation are described in Appendix D.

Pre-training BBBP Tox21 ToxCast SIDER ClinTox MUV HIV BACE AVG
- 65.4(2.4) 74.9(0.8) 61.6(1.2) 58.0(2.4) 58.8(5.5) 71.0(2.5) 75.3(0.5) 72.6(4.9) 67.21
EdgePred 64.5(3.1) 74.5(0.4) 60.8(0.5) 56.7(0.1) 55.8(6.2) 73.3(1.6) 75.1(0.8) 64.6(4.7) 65.64
AttrMask 70.2(0.5) 74.2(0.8) 62.5(0.4) 60.4(0.6) 68.6(9.6) 73.9(1.3) 74.3(1.3) 77.2(1.4) 70.16
GPT-GNN 64.5(1.1) 74.2(0.8) 62.5(0.4) 60.4(0.6) 68.6(9.6) 73.9(1.3) 74.3(1.3) 77.2(1.4) 68.27.
InfoGraph 69.2(0.8) 73.0(0.7) 62.0(0.3) 59.2(0.2) 75.1(5.0) 74.0(1.5) 74.5(1.8) 73.9(2.5) 70.10
ContextPred 71.2(0.9) 73.3(0.5) 62.8(0.3) 59.3(1.4) 73.7(4.0) 72.5(2.2) 75.8(1.1) 78.6(1.4) 70.89
GraphLoG 67.8(1.7) 73.0(0.3) 62.2(0.4) 57.4(2.3) 62.0(1.8) 73.1(1.7) 73.4(0.6) 78.8(0.7) 68.47
G-Motif 66.4(3.4) 73.2(0.8) 62.6(0.5) 60.6(1.1) 77.8(2.0) 73.3(2.0) 73.8(1.4) 73.4(4.0) 70.14
GraphCL 67.5(3.3) 75.0(0.3) 62.8(0.2) 60.1(1.3) 78.9(4.2) 77.1(1.0) 75.0(0.4) 68.7(7.8) 70.64
JOAO 66.0(0.6) 74.4(0.7) 62.7(0.6) 60.7(1.0) 66.3(3.9) 77.0(2.2) 76.6(0.5) 72.9(2.0) 69.57
GraphMVP-G 70.8(0.5) 75.9(0.5) 63.1(0.2) 60.2(1.1) 79.1(2.8). 77.7(0.6) 76.0(0.1). 79.3(1.5) 72.76
GraphMVP-C 72.4(1.6) 74.4(0.2) 63.1(0.4) 63.9(1.2) 77.5(4.2) 75.0(1.0) 77.0(1.2) 81.2(0.9) 73.07
TriCL(OURS) 72.4(0.4) 75.5(0.3) 63.9(0.4) 62.0(1.0) 85.4(1.9) 77.0(0.8) 78.9(0.5) 82.5(1.2) 74.71
Table 2:

Results on the molecular property prediction classification tasks. We report an average test AUC-ROC on 8 downstream tasks with standard deviation inside the parenthesis. Top 1 AUC-ROC score for each task is underlined and bolded. Datasets were scaffold splitted. Baseline performances were adopted from

Liu et al. (2021). Finetuning was repeated under 3 independent seeds

. We report the test AUC-ROC at the epoch which validation AUC-ROC was the highest.

5.3 Results on the molecular property classification tasks.

Results are summarized in Table 2. TriCL achieved an outstanding average 74.71 AUC on 8 molecular property classification tasks, with best performance on 5 tasks and Top 2 accuracy on 2 tasks.

Intra loss Inter loss Performance
NT-Xent - 72.29
- NT-Xent 71.88
- Triplet Margin 71.10
- Triangle Area 71.20
NT-Xent NT-Xent 73.31
NT-Xent Triplet Margin 73.51
NT-Xent Triangle Area 74.71
Augmentation Method Performance
- - 71.40
- - 72.38
- - 71.43
- 71.63
- 74.71

5.4 Ablation Study

We then assessed the effects of core components in TriCL by systematically ablating each component while maintaining other settings. Experiments were performed with three seeds, then an average performance among 8 fine-tuning tasks is reported. Detailed results are provided in Appendix E.

Effects of the objective function

To demonstrate the effectiveness of Triangular Area Loss, we conducted an experiment where we fine-tuned the same model with different loss functions and assessed their performance on the MoleculeNet dataset. The results in Table (a)a empirically show that: (1) Pre-training with only intermodal loss performs worse than using only intramodal loss, which is as expected in Section 3.2. As we discussed, the collapse of the embedding space impedes the main encoder to learn from auxiliary modalities. (2) Joint application of intra and intermodal loss better captures multifaceted features of the sample. As we expected in Section 4.3, careful application of both losses clearly enhances the performance, compared with intramodal loss. (3) Alignment and uniformity in the embedding space by Triangle Area Loss are beneficial for downstream tasks. As discussed previously, careful design of intermodal loss could encourage the encoder to capture the innate geometry of the embedding space, resulting in the highest performance.

Effects of the augmentation strategy

We also tested dependency on augmentations by assessing performances after applying various combinations of augmentation strategies during pre-training. Results in Table (b)b indicate that the performance highly depends on augmentation strategies. This dependency is expected since TriCL lacks a metric of similarity between samples without augmentation and only regards representations from the same sample as similar pairs. Therefore the only way to learn innate similarities between samples is when two samples generate the same augmented representations. This gives an important insight that while TriCL conceptually could be implemented for other tasks such as video learning, the quality of an augmentation strategy would be crucial to downstream performance. We finally note that augmentation strategies should be carefully curated, as in Table (b)b applying all available augmentations might actually harm the performance. Tian et al. (2020) also discusses this phenomenon and gives an explanation by considering the amount of preserved task-relevant information during augmentation.

6 Conclusion

In this paper, we start by inspecting how alignment and uniformity in intramodal and intermodal contrastive loss construct the joint embedding space. To mitigate a line-collapsing problem, we proposed Triangle Area Loss, a novel intermodal contrastive loss that can learn the geometry in terms of angle through contrasting the area of positive triplets and negative triplets. Under TriCL, a universal trimodal contrastive learning framework, we formulized Triangle Area Loss and discussed the advantages of the embedding space in multimodal representation learning. Our experimental results demonstrate the outperformance of TriCL compared to existing methods even when only unimodal information is available on downstream tasks.

Generalization and expansion of TriCL

To the best of our knowledge, learning molecular representation is one of the most natural and general task for multimodal contrastive learning, since (1) instead of arbitrarily chosen data formats, our use of strings, graphs, and conformer structures respectively represent 1D, 2D, and 3D information of molecules, (2) chosen main (GIN) and auxiliary (Transformer and CNN) encoders are the most representative architecture for treating their respective data formats, and (3) our experiment considers the case where only unimodal information is available for downstream tasks, by choosing GIN as the main encoder. Adapting TriCL on tasks where multiple encoders are of the same architecture (such as machine translation) or representations are shared among samples (image tagging) together with analyzing geometric properties of embedding space learned for such tasks is an interesting future work.

Theoretical considerations

We characterized certain desirable properties of embedding space in terms of alignment and uniformity, and designed TriCL to suffice such objectives. Admittedly, our arguments primarily rely on empirical results and observations. Mathematically rigorous discussion on the topic of joint embedding space learned via tri and even higher modalities remains a difficult and important work, which we strongly believe TriCL would be the great starting point.


  • Axelrod and Gomez-Bombarelli [2022] Simon Axelrod and Rafael Gomez-Bombarelli. GEOM: Energy-annotated molecular conformations for property prediction and molecular generation. Scientific Data, 9(1):1–14, 2022.
  • Chen et al. [2020] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A Simple Framework for Contrastive Learning of Visual Representations. In

    Proceedings of the 37th International Conference on Machine Learning

    , volume 119 of Proceedings of Machine Learning Research, pages 1597–1607. PMLR, 13–18 Jul 2020.
  • Chen et al. [2021] Ting Chen, Calvin Luo, and Lala Li. Intriguing Properties of Contrastive Losses. In Advances in Neural Information Processing Systems, volume 34. Curran Associates, Inc., 2021.
  • Hamilton et al. [2017] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive Representation Learning on Large Graphs. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  • Hu et al. [2019] Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec. Strategies for Pre-training Graph Neural Networks. arXiv preprint arXiv:1905.12265, 2019.
  • Hu et al. [2020] Ziniu Hu, Yuxiao Dong, Kuansan Wang, Kai-Wei Chang, and Yizhou Sun.

    Gpt-gnn: Generative pre-training of graph neural networks.

    In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1857–1867, 2020.
  • Islam et al. [2021] Ashraful Islam, Chun-Fu (Richard) Chen, Rameswar Panda, Leonid Karlinsky, Richard Radke, and Rogerio Feris. A Broad Study on the Transferability of Visual Representations With Contrastive Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8845–8855, October 2021.
  • Jaiswal et al. [2020] Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki Zadeh, Debapriya Banerjee, and Fillia Makedon. A Survey on Contrastive Self-supervised Learning. Technologies, 9(1):2, 2020.
  • Jing and Tian [2020] Longlong Jing and Yingli Tian. Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey. IEEE transactions on pattern analysis and machine intelligence, 43(11):4037–4058, 2020.
  • Krenn et al. [2020] Mario Krenn, Florian Häse, AkshatKumar Nigam, Pascal Friederich, and Alan Aspuru-Guzik. Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. Machine Learning: Science and Technology, 2020.
  • Landrum et al. [2013] Greg Landrum et al. RDKit: A software suite for cheminformatics, computational chemistry, and predictive modeling, 2013.
  • Liu et al. [2021] Shengchao Liu, Hanchen Wang, Weiyang Liu, Joan Lasenby, Hongyu Guo, and Jian Tang. Pre-training Molecular Graph Representation with 3D Geometry. arXiv preprint arXiv:2110.07728, 2021.
  • Mai et al. [2022] Sijie Mai, Ying Zeng, Shuangjia Zheng, and Haifeng Hu.

    Hybrid Contrastive Learning of Tri-Modal Representation for Multimodal Sentiment Analysis.

    IEEE Transactions on Affective Computing, 2022.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021.
  • Rong et al. [2020] Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying WEI, Wenbing Huang, and Junzhou Huang. Self-Supervised Graph Transformer on Large-Scale Molecular Data. In Advances in Neural Information Processing Systems, volume 33. Curran Associates, Inc., 2020.
  • Sohn [2016] Kihyuk Sohn. Improved Deep Metric Learning with Multi-class N-pair Loss Objective. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
  • Suresh et al. [2021] Susheel Suresh, Pan Li, Cong Hao, and Jennifer Neville. Adversarial Graph Augmentation to Improve Graph Contrastive Learning. In Advances in Neural Information Processing Systems, volume 34. Curran Associates, Inc., 2021.
  • Tan et al. [2018] Chuanqi Tan, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang, and Chunfang Liu.

    A Survey on Deep Transfer Learning.

    In International conference on artificial neural networks, pages 270–279. Springer, 2018.
  • Tian et al. [2020] Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. What Makes for Good Views for Contrastive Learning? In Advances in Neural Information Processing Systems, volume 33. Curran Associates, Inc., 2020.
  • Townshend et al. [2020] Raphael J. L. Townshend, Martin Vögele, Patricia Suriana, Alexander Derry, Alexander Powers, Yianni Laloudakis, Sidhika Balachandar, Brandon M. Anderson, Stephan Eismann, Risi Kondor, Russ B. Altman, and Ron O. Dror. ATOM3D - Tasks on Molecules in Three Dimensions. arXiv preprint arXiv:2012.04035, 2020.
  • van den Oord et al. [2018] Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation Learning with Contrastive Predictive Coding. arXiv preprint arXiv:1807.03748, 2018.
  • Vassileios Balntas and Mikolajczyk [2016] Daniel Ponsa Vassileios Balntas, Edgar Riba and Krystian Mikolajczyk.

    Learning local feature descriptors with triplets and shallow convolutional neural networks.

    In Proceedings of the British Machine Vision Conference (BMVC). BMVA Press, 2016.
  • Wang and Isola [2020] Tongzhou Wang and Phillip Isola. Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere. In International Conference on Machine Learning. PMLR, 2020.
  • Wang et al. [2022] Yuyang Wang, Jianren Wang, Zhonglin Cao, and Amir Barati Farimani. Molecular contrastive learning of representations via graph neural networks. Nature Machine Intelligence, 4(3):279–287, 2022.
  • Wu et al. [2018] Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu, Karl Leswing, and Vijay Pande. MoleculeNet: a benchmark for molecular machine learning. Chemical science, 9(2):513–530, 2018.
  • Xu et al. [2019] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How Powerful are Graph Neural Networks? In International Conference on Learning Representations, 2019.
  • Xu et al. [2021] Minghao Xu, Hang Wang, Bingbing Ni, Hongyu Guo, and Jian Tang. Self-supervised Graph-level Representation Learning with Local and Global Structure. In International Conference on Machine Learning. PMLR, 2021.
  • You et al. [2020] Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. Graph Contrastive Learning with Augmentations. In Advances in Neural Information Processing Systems, volume 33. Curran Associates, Inc., 2020.
  • You et al. [2021] Yuning You, Tianlong Chen, Yang Shen, and Zhangyang Wang. Graph contrastive learning automated. In International Conference on Machine Learning. PMLR, 2021.
  • Yuan et al. [2021] Xin Yuan, Zhe Lin, Jason Kuen, Jianming Zhang, Yilin Wang, Michael Maire, Ajinkya Kale, and Baldo Faieta. Multimodal Contrastive Training for Visual Representation Learning. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    , pages 6995–7004, 2021.
  • Zolfaghari et al. [2021] Mohammadreza Zolfaghari, Yi Zhu, Peter Gehler, and Thomas Brox. CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1450–1459, 2021.

Appendix A Additional Results

The following section would describe results obtained from additional experiments that were not described in the main manuscript. This would include: (1) Scaled-up pre-training on a large dataset, (2) SELFIES to SMILES alternation test, (3) Performance comparison to the bimodal system. We hope this section would provide a detailed explanation supporting TriCL and Triangular Area Loss, while also deliver useful insights for future research.

a.1 Pre-training on Large Dataset

To assess the effects of unlabeled dataset size for pre-training, we pre-train TriCL with larger numbers of molecules in GEOM. As shown in Table A1, increased size of dataset could be either beneficial or not depending on tasks. In case of Tox21, ToxCast and MUV, AUC-ROCs are gradually increased while for BBBP, ClinTox, HIV and BACE, AUC-ROCs are higher when pre-trained with smaller dataset. We could conclude that unlabeled dataset should also be curated task-specifically to achieve optimal performance, because the model could be ‘distracted’ by irrelevant samples.

# Compounds BBBP Tox21 ToxCast SIDER ClinTox MUV HIV BACE AVG
50k 72.4(0.4) 75.5(0.3) 63.9(0.4) 62.0(1.0) 85.4(1.9) 77.0(0.8) 78.9(0.5) 82.5(1.2) 74.71
100k 71.7(0.5) 75.7(0.5) 64.1(0.1) 62.1(0.6) 81.2(2.2) 77.6(0.8) 78.6(0.2) 82.3(0.8) 74.16
200k 72.1(0.8) 76.3(0.3) 64.7(0.3) 61.6(0.3) 85.2(1.0) 78.3(1.4) 78.3(0.8) 82.1(1.1) 74.83
Table A1: Downstream performances of TriCL pre-trained on dataset with different size. TriCL was identically pre-trained and fine-tuned except for the pre-training dataset size.

a.2 Pre-training with SMILES String Representation

Instead of using the most common molecular text representation SMILES, TriCL adopts SELFIES string as a 1D representation of the molecule. We believe that SELFIES is a better representation for extracting chemically significant properties because: (1) SELFIES keep representing valid molecules even when any types of mutations are applied on the original string. Thus, using SELFIES could help the model to learn from chemically valid augmented representations, which obviously benefit learning more meaningful relationships between augmentations than using SMILES. (2) In SELFIES string, all tokens possess semantic meanings, while in SMILES some tokens play only syntactic roles. Since each token represents a specific chemical unit in SELFIES, the utilization of node/edge masking and subgraph masking techniques generate meaningful augmented representations.

- - 66.0(3.0) 75.5(0.4) 64.2(0.4) 61.1(1.6) 64.8(5.4) 76.7(0.2) 77.2(0.2) 77.8(1.1) 70.51 71.40 +0.89
- - 69.5(0.8) 74.1(0.5) 62.7(1.1) 61.6(0.5) 76.8(0.8) 74.3(2.4) 75.4(2.0) 81.0(1.7) 71.94 72.38 +0.44
- - 68.7(0.4) 74.6(0.1) 63.1(0.8) 60.0(1.0) 76.5(2.7) 73.5(1.2) 75.2(0.8) 78.4(3.2) 71.25 71.43 +0.18
- 67.6(4.0) 75.0(1.1) 62.2(0.7) 59.7(1.0) 70.6(2.2) 77.0(0.6) 76.6(0.3) 77.5(2.3) 70.77 71.63 +0.86
- 71.8(1.0) 74.5(0.7) 63.5(0.2) 60.7(0.9) 79.1(2.4) 76.0(2.0) 76.6(1.2) 81.6(2.0) 72.98 74.71 +1.73
70.8(0.8) 73.2(0.4) 61.4(0.2) 60.5(2.0) 65.4(4.7) 74.1(1.7) 75.2(1.6) 71.5(1.7) 69.01 72.13 +3.12
Table A2: Performance of TriCLs applying SMILES/SELFIES(OURS) as 1D auxiliary representation.

a.3 Comparison to Bimodal CL

We validate the necessity of each auxiliary modality by implementing a bimodal CL framework using NT-Xent intermodal loss and measuring performances of models on the same downstream tasks. As stated in Table A3, GNNs trained with two auxiliary modalities shows better performances than GNNs trained with one auxiliary modality. This result might mislead that TriCL’s performance stems from additional modalities. However, the facts that TriCL showed significantly better performance than the pairwise trimodal contrastive learning method implies that TriCL takes advantage of Triangular Area Loss, which better distill diverse perspectives of auxiliary modalities to the main encoder by contrasting triplets and considering the geometry of them simultaneously.

Encoder Loss BBBP Tox21 ToxCast SIDER ClinTox MUV HIV BACE AVG
1D+2D NT-Xent 71.1(0.7) 75.0(0.5) 64.0(0.8) 61.0(0.6) 79.9(2.2) 76.3(0.6) 76.8(0.8) 80.1(1.8) 73.02
3D+2D NT-Xent 72.2(1.8) 74.8(1.0) 64.3(0.4) 58.7(1.4) 78.2(3.4) 78.1(1.2) 77.1(1.1) 78.9(3.2) 72.80
1D+2D+3D NT-Xent 71.1(0.3) 75.0(0.6) 63.6(0.5) 60.6(1.2) 81.6(4.4) 76.7(1.0) 77.4(0.6) 80.4(1.5) 73.31
1D+2D+3D Triangular 72.4(0.4) 75.5(0.3) 63.9(0.4) 62.0(1.0) 85.4(1.9) 77.0(0.8) 78.9(0.5) 82.5(1.2) 74.71
Table A3: Comparison to Bimodal CL. All other settings remained the same.

Appendix B Case Study

We would support our experiments about an embedding space and performances of TriCL by assessing TriCL on widely-used, domain-acknowledged independent validation dataset.

b.1 DUD-E dataset and GPCR target proteins

DUD-E (Database of Useful Decoys: Enhanced) DUDE provides challenging negative samples (‘decoys’) for the protein-molecule docking task. 22,886 active compounds and their affinities against 102 protein targets are contained.

  • For each positive (‘active’) protein-binding compound, 50 decoy molecules having similar physico-chemical properties but dissimilar in 2D graph topologies are also involved.

  • Therefore we expect pre-trained GNN to map active compounds and corresponding decoys near in space while discriminating irrelevant molecules, so that GNN can concentrate on capturing sophisticated differences in graph during the fine-tuning phase.

  • This could be measured by applying the same metrics with Section 3.2, alignment and uniformity; alignment between molecules targeting the same protein and uniformity between irrelevant molecules should be high for better performance.

Among subsets of DUD-E protein targets, we focused on GPCR(G protein-coupled receptor)-binding active compounds and corresponding decoys because of three reasons:

  1. GPCRs are involved in multiple different signaling pathways crucial in life of the cell which makes GPCRs important protein targets gpcr4.

  2. GPCRs are highly dynamic entailing huge conformational change during activation, which necessitates the delicate 3D design of drug structures gpcr1, gpcr2.

  3. Small molecules without careful structural design can bind multiple substructures of GPCRs which can cause severe side effects gpcr3.

DUD-E GPCR subset comprises five specific GPCR targets - AA2AR (Anti-adenosine A2A receptor), ADRB1 (Adrenoceptor beta 1), ADRB2 (Adrenoceptor beta 2), CXCR4 (C-X-C chemokine receptor type 4), and DRD3 (Dopamine receptor D3). We first verified the embedding space made by all GPCRs, then analyzed the embedding spaces of molecules targeting each specific target.

b.2 Embedding Space Properties

Using pre-trained TriCL, we assessed the alignment of generated embedding space made from active compounds of GPCRs. Uniformity was measured by measuring cosine similarities with respect to irrelevant compounds which bind to other proteins. Combined metric was calculated by , as defined from Table 1. For instance-wise analysis, 500 active compounds (for AA2AR, ADRB1, ADRB2 and DRD3) and 200 active compounds (for CXCR4) were randomly selected from each target with corresponding decoys then average cosine similaries are reported. We compared embedding space metrics with the base GNN encoder trained using only NT-Xent loss as an intramodal contrastive loss.

GPCR active compounds Target Instances (Alignment)
Align Uniform Combined AA2AR ADRB1 ADRB2 CXCR4 DRD3
GNN (Unimodal CL) 0.574 0.546 0.028 0.317 0.324 0.324 0.233 0.388
TriCL 0.602 0.316 0.286 0.299 0.368 0.384 0.381 0.458
Table A4: Case study on GPCR-binding compounds. Alignment metric is the average cosine similarity between all active compounds targeting GPCRs or the same GPCR (higher is better). Uniformity metric is the average cosine similarity between GPCR-targeting compounds and others (close to 0 is better). Combined metric refers to (higher is better).

As shown in Table A4, TriCL maps GPCR-binding active compounds near in space (high alignment) while discriminating others (low uniformity). This implies that TriCL indeed captures additional information from auxiliary modularities and these information helps discriminating molecular targets in specific real-world problems. Furthermore, TriCL also outperformed in aligning compounds and decoys targeting the same GPCR target instance.

b.3 Active compounds and Decoys Examples

Here, we provide several examples of a set of active compounds and corresponding decoys. Decoy molecules are structurally similar to corresponding acitve compound, but their 2D graph representations have huge differences. TriCL can recognize structural similarities between active and decoys by simultaneously contrasts with auxiliary modalities.

Figure A1: Selected active and decoy compounds in DUD-E GPCR subset. Labels below each structures are protonation codes, provided in DUD-E dataset.

Appendix C Dataset and Baseline Models Overview

c.1 Pre-training Dataset Overview


Geometric Ensemble Of Molecules (GEOM) is a dataset of high-quality conformers for 317,928 mid-sized organic molecules with experimental data geom_appendix. Conformers in GEOM are generated with the CREST program CREST which can generate reliable and accurate structures by using extensive sampling based on the semi-empirical extended tight-binding method (GFN2-xTB) GFN2-xTB. CC BY 4.0. 50k molecules and corresponding conformer were selected and utilized for pre-training TriCL. Results using full data (200k) can be found in Table A1.

c.2 Fine-tuning Dataset Overview

For downstream finetuning-tasks, we used MoleculeNet moleculenet_appendix. All data were split by scaffold. Most datasets have no license specification; we mark them as MIT License based on deepchem deepchem

BBBP is a binary classification task that aims to predict the ability of a drug whether it penetrates the blood-brain barrier(BBB), a membrane separating circulating blood and brain extracellular fluid. Since BBB penetrated drugs might directly affect the central nervous system, BBBP is a crucial challenge in drug development. BBBP_sub. This dataset curated by BBBP contains 1,975 drugs. MIT License.

Tox21 is a multitask classification dataset that was curated from the "Toxicology in the 21st Century" initiative and has been used in the 2014 Tox21 Data Challenge Tox21. Tox21 comprises qualitative toxicity measurements of 7,831 compounds on 12 different targets including nuclear receptors and stress response pathways. Nearly 6,000 drugs are annotated for each target. MIT License.

ToxCast is a similar data collection as Tox21, but much intensive toxicity dataset including qualitative results of 617 in vitro high-throughput screening assays on 8,575 compounds ToxCast. Depending on a specific task, hundreds to thousands of drugs are labeled. MIT License.

ClinTox is a task discriminating FDA-approved drugs from rejected drugs that have failed at the clinical trial stage, for the reason of toxicity ClinTox. ClinTox is composed of two binary classification tasks dealing with: (1) clinical trial toxicity and (2) FDA approval status. A total of 1,478 compounds are curated in the ClinTox dataset. MIT License.

SIDER (Side Effect Resource) is about marketed drugs and adverse drug reactions (ADR) SIDER. The raw SIDER dataset is annotated hierarchically; we used the common version which grouped the side effects of 1,427 drugs into 27 system organ classes. CC BY 4.0.

MUV (Maximum Unbiased Validation) is a virtual screening benchmark dataset curated by using the refined nearest neighbor analysis on PubChem bioactivity data MUV. MUV dataset contains 93,087 compounds for 17 subtasks from pairs of primary high-throughput screening assays and confirmatory dose-response experiments. MIT License.


was annotated by the Drug Therapeutics Program (DTP) AIDS Antiviral Screen HIV_appendix. The raw dataset classified drugs into three categories - confirmed inactive (CI), confirmed active (CA), and confirmed moderately active (CM). This data was re-curated into binary classification between inactive (CI) and active (CA and CM). A total of 41,127 compounds are contained. CC BY 4.0.


deals with qualitative binding results for human beta-secretase 1 (BACE-1) BACE. A raw BACE dataset containing quantitative IC50 of 1,513 compounds reported in the scientific literature was binarized for building a classification task. MIT License.

c.3 Baseline Models Overview

Graph-only SSL Baseline

EdgePred EdgePred_appendix masked node/edge attributes, then pre-trained GNN to obtain the corresponding node/edge embeddings. InfoGraph InfoGraph_appendix adopted student-teacher model by maximizing the mutual information between the graph-level representation of InfoGraph and substructure representations of existing supervised methods. GPT-GNN GPT-GNN_appendix introduced a self-supervised attributed graph attribute and edge generation so that the model learns capturing inherent dependencies between node attributes and graph structures. ContextPred ContextPred_appendix contrasts node features with subgraph structures by pre-training GNN to map nodes in similar structural contexts to nearby embeddings. GraphLoG

GraphLoG_appendix learned global semantic structure by introducing hierarchical prototypes and maximizing the data likelihood with respect to GNN parameters via an online expectation-maximization (EM) algorithm.

GraphCL GraphCL_appendix introduced four types of augmentation - node drop, edge perturbation, attribute masking, and subgraph sampling - and performed task-wise analysis. JOAO JOAO_appendix adopted a plug-and-play framework for optimizing a joint augmentation method over four augmentation strategies.

Structure-aware SSL Baseline

GraphMVP graphmvp_appendix incorporated 3D structural view into GNN by maximizing mutual information between 3D graph representation and 2D graph representation using both contrastive and generative SSL methods.

Appendix D TriCL Implementation Details

In this section, we would explain details in implementation of TriCL.

d.1 TriCL framework overview

As briefly described in Section 4.2, TriCL is devised to pre-train a main encoder while contrasting with two other auxiliary modalities, aux1 and aux2. Before specifying architectural details of submodules in TriCL, we would first highlight key properties of TriCL:

  1. Modularity The main advantage of TriCL is its modularity: three networks could be independently designed and transplanted into the TriCL framework. Following Triangular Area Loss would force the main encoder to learn similarities and differences between inputs while borrowing diverse viewpoints from auxiliary modalities.

  2. Universality TriCL is indeed suitable for any type of trimodal pre-training tasks by devising appropriate encoders which can effectively extract useful properties of the input having diverse formats. For any data format that comprises three distinct components, TriCL can provide a universal method for pre-training one main encoder representing the input object.

  3. Unity TriCL learns relationships among three modalities simultaneously, not by learning pairwise relationships between two modalities. Triangular Area Loss is fundamentally designed to consider geometry made from three representations simultaneously by utilizing the area of the triangle. This property encourages the model to learn much-balanced embedding space affected by three networks.

In the following sections, details for implementing TriCL showing three key properties are described.

d.2 TriCL Algorithm

The formal definition of TriCL is provided in Algorithm 1. For detailed explanations including the definition of NTXent loss and , please refer to Section 4.2.


  input: batch size , encoder model , molecule representation
  for sampled minibatch  do
     for  do
        for  do
           generate augmented input from
        end for
     end for
     for  do
     end for
     update parameters of each model to minimize
  end for
  return encoder
Algorithm 1 TriCL main learning algorithm

d.3 Resources

TriCL is pre-trained using NVIDIA GeForce RTX 3090 for 2 hours on average per each pre-train. Pre-trained GNN is fine-tuned using NVIDIA GeForce RTX 2080 Ti; about 1 hour was spent for fine-tuning on 8 downstream tasks.

d.4 Encoding Details

Main Encoder GNN architecture was directly adopted from graphmvp_appendix which is also identical to other baseline models. 5 GIN layers were stacked, with 300 hidden dimension and 0.5 dropout ratio. Node embedding vectors at the last layer was mean-pooled.

Auxiliary Encoder 1 Transformer modules were adopted from PyTorch. 6 transformer encoder was sequentially connected. For self-attention, hidden dimension was 64 and 8 heads used for multihead attention. For feed forward layer, hidden dimension was 64. 0.1 dropout ratio was applied for both layers. Other details not described here is specified in Section D.5.

Auxiliary Encoder 2

CNN architecture was directly adopted from atom3d_appendix, which utilized PyTorch Conv3d layer. Starting from 12 in_channel dimension, 4 3D convolution layers were applied with 3 convolution kernel size, 1 stride and 0 padding. 3D representation tensor was flattened and reduced via one fully connected layer. Dropout ratio was 0.1 for convolution layer, and 0.25 for fully connected layer.

d.5 Representation Details

Main Representation For graph representation, we followed ContextPred_appendix. Node feature comprise one atomic number and four chirality features. Edge feature is composed of bond type and bond direction. This representation method was identical to graphmvp_appendix and other baseline methods. We provide details of node and edge features in Table A5.

Feature type Feature name Range
Node feature Atomic number [1, 119]
Chirality unspecified, tetrahedral CW, tetrahedral CCW, other
Edge feature Bond type single, double, triple, aromatic
Bond direction none, end-upright, end-downright
Table A5: Node and edge features used in TriCL.

Auxiliary Representation 1 For string representation, we followed SELFIES_appendix to convert SMILES strings to SELFIES strings. Then, generated SELFIES string was vectorized following pre-defined tokenization rule. Atomic symbols were converted according to atomic vocabulary, which annotates non-metal 99 common atomic tokens. This common atomic tokens were extracted from 10M PubChem dataset pubchem. Remaining metal tokens not involved in 99 tokens were represented as one metal token [Me]. This abstraction originates from chemical insights that most drugs do not involve metal elements and most metal elements show similar chemical properties compared to non-metal elements. Other tokens representing branches and rings were thoroughly defined. We provide details of SELFIES tokenization rule in Table A6.

Components Representation # Tokens Token Number
Padding / No Operation [NOP] 1 0
Atom-masking [MASK_AT] 1 1
Branch/Ring-masking [MASK_BO] 1 2
Metals [Me] 1 3
Special Classification [CLS] 1 4
Common Atoms [#B] - [S] 99 5-103
Branches [Branch1] - [#Branch3] 9 104-112
Rings [Ring1] - [-/Ring3] 36 113-138
Table A6: SELFIES tokenization rule in TriCL. This includes augmentation tokens.

Auxiliary Representation 2 For structure representation, we used RDKit to generate conformers of molecules. Predefined 10 common non-metal elements were mapped to corresponding feature integers, and other metal elements were abstracted to a single metal integer. The structure was voxelized by 7.5 Å radius of grids, with 1.0 Å resolution, after random rotation. We provide details of element feature representation in Table A7.

Integer 0 1 2 3 4 5 6 7 8 9 10 11
Element Mask H C N O F Cl Br P S B Metal
Table A7: Structure featurization in TriCL. This includes a masking integer.

d.6 Augmentation Details

We would provide a detailed augmentation strategies for each representations. Specifically, we adopted or implemented node drop (ND), node masking (NM), and subgraph masking (SM). Note that each input molecule is augmented two times in terms of each representation format, generating 6 augmented representations. For mixed augmentation settings, independent augmentation strategies were applied sequentially. Predefined ratio of augmentation for ND, NM, and SM were 0.2, 0.2, and 0.05 respectively.

Main Graph Augmentation For NM and SM, we referred MolCLR_appendix. Predefined ratio of random node and edge features were masked to zero vector in NM augmentation. For SM augmentation, predefined ratio of node and edge features were masked to zero vector, but each node and edge were selected as adjacent components to the randomly selected anchor node. For node drop, we referred GraphCL_appendix. Predefined ratio of random nodes were deleted from the graph; edges connected to the deleting nodes were also deleted.

Auxiliary String Augmentation For NM, predefined ratio of random atom tokens and bond tokens were masked to [MASK_AT] and [MASK_BO], respectively. For SM, predefined ratio of tokens were masked similarly, but tokens were selected from adjacent tokens from randomly selected anchor atom token. To maintain syntax of SELFIES string, the whole subsequence of branch and ring components were masked simultaneously. For graph ND setting, NM is applied instead.

Auxiliary Structure Augmentation For NM, predefined ratio of random atoms were masked to ‘M’ element. For SM, predefined ratio of atoms were masked similarly, but atoms were selected from nearest atoms from the randomly selected hetero atom. For graph ND setting, NM is applied instead.

d.7 Optimization Details


Parameters in three encoders and final readout layer were optimized by using Adam optimizer, under weight decay 1e-5. The learning rate was scheduled via cosine annealing scheduler, with initial learning rate 0.0005 and 10 warm-up epoch setting. TriCL is pre-trained 100 epochs. During the warm-up 5 epochs, intermodal loss was not backpropagated.

Fine-tuning For fine-tuning, we followed graphmvp_appendix and other baseline models for fair comparison. Parameters were optimized via Adam optimizer, under learning rate 0.001.

Appendix E Detailed Results

Here, we provide the full results of experiments described in Section 5. Specifically, we first illustrated the whole measured AUC-ROC of three seeds on 8 MoleculeNet tasks. Then, task-wise statistics corresponding to Table (a)a and Table (b)b would be described.

e.1 Full results of Table 2

Detailed data measured from TriCL is described below. The pre-trained TriCL is tested three times under three fixed independent seeds. We then reported the average and standard deviation in Table 2.

Model Seed BBBP Tox21 ToxCast SIDER ClinTox MUV HIV BACE AVG
0 72.81 75.39 64.23 60.99 85.80 78.04 78.57 83.10 74.64
TriCL 1 72.07 75.82 64.03 61.97 87.37 78.06 79.36 81.20 74.99
42 72.08 75.33 63.52 63.13 83.71 77.55 78.92 83.32 74.50
Table A8: Detailed Results of Table 2.

e.2 Full results of Table (a)a

Identical to the experiment above, experiments were repeated three times under three independent seeds - 0, 1, and 42; we report the average AUC-ROC and standard deviation of them in Table A9. We marked top 1 values with bold and underline.

Intra loss Inter loss BBBP Tox21 ToxCast SIDER ClinTox MUV HIV BACE AVG
NT-Xent - 70.6(1.9) 75.1(0.2) 63.9(0.4) 60.1(1.4) 80.4(3.0) 74.2(1.8) 77.1(1.2) 76.8(1.7) 72.29
- NT-Xent 69.7(1.8) 75.2(0.5) 63.4(0.5) 60.7(1.5) 71.3(4.3) 78.2(1.6) 78.3(4.7) 78.3(4.7) 71.88
- Triplet Margin 70.3(1.5) 74.9(0.8) 62.3(1.4) 59.3(1.0) 70.6(1.0) 73.9(2.2) 77.8(0.3) 79.6(1.8) 71.10
- Triangular Area 69.1(0.8) 75.2(0.5) 62.8(0.8) 61.9(1.2) 68.4(3.9) 76.3(0.1) 77.5(1.4) 78.5(6.8) 71.20
NT-Xent NT-Xent 71.1(0.3) 75.0(0.6) 63.6(0.5) 60.6(1.2) 81.6(4.4) 76.7(1.0) 77.4(0.6) 80.4(1.5) 73.31
NT-Xent Triplet Margin 70.9(0.6) 74.2(0.5) 64.1(0.4) 61.1(0.9) 81.4(3.1) 75.7(2.2) 77.7(0.2) 82.9(1.1) 73.51
NT-Xent Triangular Area 72.4(0.4) 75.5(0.3) 63.9(0.4) 62.0(1.0) 85.4(1.9) 77.0(0.8) 78.9(0.5) 82.5(1.2) 74.71
Table A9: Detailed Results of Table 3(a).

e.3 Full results of Table (b)b

Experiments were repeated three times under three independent seeds - 0, 1, and 42; we report the average AUC-ROC and standard deviation of them in Table A10. We marked top 1 values with bold and underline.

- - 68.8(0.1) 75.9(0.1) 62.8(0.1) 61.5(1.1) 69.9(4.5) 76.8(0.8) 77.1(0.1) 78.5(1.5) 71.40
- - 70.5(0.8) 75.1(1.1) 63.4(0.7) 59.8(0.6) 77.9(5.9) 76.1(0.7) 77.3(0.8) 78.8(0.8) 72.38
- - 69.6(0.2) 75.3(0.5) 63.0(0.7) 61.1(0.9) 75.2(6.9) 75.4(0.8) 77.6(0.8) 74.3(6.9) 71.43
- 70.1(3.3) 75.0(0.2) 63.2(0.5) 62.1(1.0) 69.7(8.0) 75.4(1.1) 77.9(0.8) 76.9(1.6) 71.63
- 72.4(0.4) 75.5(0.3) 63.9(0.4) 62.0(1.0) 85.4(1.9) 77.0(0.8) 78.9(0.5) 82.5(1.2) 74.71
70.5(2.0) 74.3(0.8) 63.2(0.6) 60.0(0.8) 79.6(0.6) 75.4(1.1) 75.8(1.1) 78.2(1.8) 72.13
Table A10: Detailed Results of Table 3(a).

Appendix F Limitations, Future Directions and Broader Impacts

Lastly, we would carefully illustrate limitations of TriCL and introduce future research topics that can directly utilize or inspired by TriCL. Then we would finish by discussing possible societal impacts.

f.1 Limitations and Future Directions

  1. Theoretical Considerations As discussed in Section 6, our arguments primarily rely on empirical results and observations. Mathematically rigorous discussion on joint embedding spaces and optimization would be an important future challenge.

  2. Application on Other Tasks Also discussed in Section 6

    . Although TriCL is a universal framework that can be utilized in any type of trimodal system, this paper assessed the performance of TriCL on molecular graph representation learning tasks. We believe TriCL can be applied to other computer science tasks such as multilingual representation learning in the natural language processing field pan2021contrastive, wei2020learning or video representation learning in the computer vision field sermanet2018time, xu2021videoclip. Moreover, we believe TriCL also could be utilized in applied fields like biological representation learning through DNA-RNA-Protein (genome-transcriptome-proteome) central dogma. The wide application of TriCL in various fields would be an intriguing future work.

  3. Higher-modal Contrastive Learning TriCL well performs in trimodal contrastive learning tasks by simultaneously contrasting triplet representations using Triangular Area Loss. As we devised a new loss function appropriate for trimodal tasks which are different from pairwise contrastive losses, a new form of losses would be required for the effective higher-modal contrastive learning task. For example, contrasting volumes of tetrahedrons would be a great starting point for the tetramodal system.

  4. Generative Tasks In terms of molecular representation learning, molecular generation is an important research topic. Although this paper comprehensively discussed embedding spaces, molecular generation or optimization was not involved in the scope. Based on discriminated nature of each modality, it would be an interesting research to generate molecules having desired structural or sequential properties from the GNN embedding space.

f.2 Broader Impacts on Society

Potential Positive Impacts

  • Understanding Cognitive Process Multimodal contrastive learning in fact resembles the human cognitive process. As TriCL ‘simultaneously’ contrasts three data representations, human collects diverse sensory information ‘as a whole’ and forms virtual cognitive space by finding relations between them. We expect we could improve our understanding of human cognition by studying TriCL.

  • Drug discovery This study is not restricted to graph representation learning, but we extended pre-trained TriCL on the specific biological task in Section B. As we tested the power of pre-trained embedding space on GPCR-binding molecules, we expect that well-trained molecular embedding space could help develop new drugs for novel target proteins which require sophisticated structural design. Specifically, TriCL would be effective in developing drugs for rare diseases because TriCL could rapidly narrow down drug candidates only by using graph representations without a huge investment.

Potential Negative Impacts

  • Chemical Hazards As a counter-effect of improved understanding of molecules, TriCL could be used in developing harmful chemicals. In fact, an uncontrolled dose of drugs also could role as a hazard. To prevent abuse, societal monitoring of chemical weapon development and consistent responsibilities on ethics of scientific knowledge and technology is required.

  • Environmental Impact Even though we have reduced the parameters and controlled the model size to our best, the process of training and implementing TriCL could still increase the carbon emissions slightly when compared with lighter unimodal framework. Developing a more efficient version of TriCL is among our considerations of future work.

plainnat appendixref