Log In Sign Up

Multimodal SuperCon: Classifier for Drivers of Deforestation in Indonesia

Deforestation is one of the contributing factors to climate change. Climate change has a serious impact on human life, and it occurs due to emission of greenhouse gases, such as carbon dioxide, to the atmosphere. It is important to know the causes of deforestation for mitigation efforts, but there is a lack of data-driven research studies to predict these deforestation drivers. In this work, we propose a contrastive learning architecture, called Multimodal SuperCon, for classifying drivers of deforestation in Indonesia using satellite images obtained from Landsat 8. Multimodal SuperCon is an architecture which combines contrastive learning and multimodal fusion to handle the available deforestation dataset. Our proposed model outperforms previous work on driver classification, giving a 7 state-of-the-art rotation equivariant model for the same task.


Rotation Equivariant Deforestation Segmentation and Driver Classification

Deforestation has become a significant contributing factor to climate ch...

Identifying the atmospheric drivers of drought and heat using a smoothed deep learning approach

Europe was hit by several, disastrous heat and drought events in recent ...

ForestNet: Classifying Drivers of Deforestation in Indonesia using Deep Learning on Satellite Imagery

Characterizing the processes leading to deforestation is critical to the...

SANCL: Multimodal Review Helpfulness Prediction with Selective Attention and Natural Contrastive Learning

With the boom of e-commerce, Multimodal Review Helpfulness Prediction (M...

Learning drivers of climate-induced human migrations with Gaussian processes

In the current context of climate change, extreme heatwaves, droughts, a...

Multimodal Adversarially Learned Inference with Factorized Discriminators

Learning from multimodal data is an important research topic in machine ...

A Systematic Approach to Detect Hierarchical Healthcare Cost Drivers and Interpretable Change Patterns

There is strong interest among payers to identify emerging healthcare co...

I Introduction

Deforestation has become a major problem for the world at large, especially in tropical countries such as Indonesia. Since 2001, the amount of forest area in Indonesia that is overrun by oil palm plantations has doubled, reaching 16.24 Mha in 2019 (64% industrial; 36% smallholder). This amount is more than the official estimates of 14.72 Mha

[11]. At the same time, forest area has declined by 11% (9.79 Mha), 32% (3.09 Mha) of which ultimately converted into oil palm plantations, and 29% (2.85 Mha) were both cleared and converted within the same year [11].

Vegetation and soils in forest ecosystems are crucial in regulating various ecosystem processes [1]. Deforestation, both natural or human-made, can cause droughts due to lack of transpiration in the soil. If this disaster continues and becomes large scale, it will have an adverse impact on the climate. To illustrate, the expansion of oil palm plantations in Kalimantan alone is projected to contribute 18–-22% of Indonesia’s 2020 carbon emission [4]. Deforestation also has a destructive impact on the ecosystem and biodiversity, and it is becoming a force of global importance [10]. As one of the countries with the most primary forests in the world, Indonesia has a looming threat of major deforestation. Therefore, analyzing and understanding the drivers for deforestation can help to determine areas in need of reforestation. It also prevents deforestation from becoming more widespread.

Since the advent of high-resolution satellite imaging, disaster analysis typically uses satellite imagery which offers a rich source of information. This can be helpful in determining regions that are currently affected by a disaster, such as deforestation. Prior work have used satellite imagery and deep learning, especially CNN, to solve problems in remote sensing due to their ability to give better performance. In addition, some work have also used multimodal inputs and segmentation to achieve precise predictions of drivers of deforestation in particular. ForestNet

[16] implements scene data augmentation and multimodal fusion [2, 24] using CNN models. On top of that, another study [20] introduces the use of rotation-equivariant CNNs on segmentation learning to generate stable segmentation maps with U-Net. However, none of these studies explains in depth the optimal number of auxiliary inputs for the multimodal fusion and the class imbalance naturally present in the dataset.

In this work, we propose a deep learning model, called Multimodal SuperCon, which uses contrastive loss and focal loss for classifying drivers of deforestation in Indonesia. The architecture is chosen due to the class imbalance present in the dataset used in [16]. Using satellite images from Landsat 8, our model consists of two stages of training, namely representation training as the first stage and classifier fine-tuning as the second stage. Our model is a modification of an established architecture known as SuperCon [5], and it allows multimodal fusion in the classifier fine-tuning stage.

Ii Related Work

Ii-a Machine Learning for Deforestation Analysis

To our knowledge, there have been two proposed deep learning models for handling the deforestation driver classification task in Indonesian forest regions. The first was the ForestNet architecture introduced by Irvin et al. [16] in 2020, and the second was a rotation equivariant CNN architecture applied by Mitton and Murray-Smith [20] in 2021.

ForestNet [16] is a semantic segmentation architecture that is created to (1) address that there are often multiple land uses within a single forest loss image, and (2) implicitly utilize the information of specific loss regions. ForestNet allows for predictions from high-resolution (15m) images to predict different drivers of multiple loss regions with varying sizes. The model also incorporates a type of data agumentation named scene data augmentation (SDA), per-pixel classification, and multimodal fusion in its architecture.

Many research studies in deep learning use convolutional neural networks (CNN) due to their powerful capabilities to process sensory data such as images and videos. A CNN model consists of multiple convolutional layers which are translation equivariant. In other words, if an input image is translated to any direction, the result of the feature map is shifted accordingly. However, standard convolutional layers are not rotation equivariant, which can be crucial in order to ensure that the model performs well in certain applications. To make the convolutional layers rotation equivariant, architectures involving group equivariant CNNs can be employed

[27, 8, 7]. Rotation equivariant CNNs built using this method has been successfully applied in [20] to the same deforestation dataset used by Irvin et al. [16].

Ii-B Contrastive Learning

Contrastive learning is a representation learning approach that aims to shape the feature space in such a way that similar (or positive) samples are clustered together, while dissimilar (or negative) samples are pushed apart. In the past, contrastive learning is executed in a self-supervised manner, with every anchor having exactly one positive sample [6, 14, 15, 26]. This positive sample can be obtained by, for example, applying data augmentation to the anchor.

Khosla et al. [17] introduced the concept of supervised contrastive learning, which allows an anchor to have multiple positive samples. The positive samples of an anchor consists of samples that belong to the same class as the anchor. In effect, supervised contrastive learning aims to minimize the distance of intra-class samples and maximize the distance of inter-class samples in the feature space. In practice, supervised contrastive learning is done using the contrastive loss, whose formula is shown in Eq. (1).

Ii-C Class Imbalance in Datasets

Class imbalance occurs when the majority of data in a dataset is classified as a certain class. This imbalance may cause difficulty in training the model using data from the minority classes. Several methods have been proposed in the past to deal with class imbalance. One popular solution is by either undersampling the majority class [9, 12] or oversampling the minority class [21, 23] so that the dataset becomes balanced.

Another way to ensure that the model learns from the minority classes is by modifying the loss function. Despite being the most popular classification loss, the cross-entropy loss is often not the most suitable choice for handling imbalanced datasets. Focal loss

[18, 28], shown in Eq. (3), is a widely used modification of cross-entropy loss that is able to focus on hard-to-classify samples. Consequently, focal loss is a more suitable option when dealing with class imbalance. Another loss function that has been successfully utilized is the contrastive loss. A training strategy involving contrastive learning has been shown to be more effective or at least on par compared to cross-entropy and focal loss [19, 5].

Iii Methodology

The model that we apply to the deforestation dataset is built upon the SuperCon architecture [5]

. SuperCon is a two-stage, contrastive learning architecture consisting of a representation training stage and a classifier fine-tuning stage. The resulting model is used for handling classification tasks. During the representation training stage, the ResNet encoder backbone learns from data using the contrastive loss. Afterwards, the classifier fine-tuning stage is used to train a classification layer which generates an estimated probability for each class label. SuperCon works well on datasets under settings involving class imbalance


We propose Multimodal SuperCon, a modification of SuperCon with the addition of multimodal fusion during classifier fine-tuning. This way, the model can learn from a variety of data in order to obtain a more accurate result. In our application to the deforestation dataset, Multimodal SuperCon allows the use of auxiliary predictors, such as slope, aspect, and elevation, in addition to the satellite images. The architecture of Multimodal SuperCon is illustrated in Fig. 1. The following subsections explain the two stages of Multimodal SuperCon in more detail.

Fig. 1: The Architecture of Multimodal SuperCon

Iii-a Representation Training

Let be a mini-batch of augmented training samples from the dataset with batch size . The features of each training sample are extracted using the encoder backbone . In practice, the encoder uses a pre-trained model, such as the ResNet or UNet architecture. The encoder generates a representation , which is then projected using the projection head .

The projection head is a two-layer network of size 2048 and 128, respectively. It generates a lower-dimensional representation of each training sample . Each is -normalized () so that lies on the unit hypersphere. Afterwards, contrastive loss is applied to the set :


The notation denotes the set of positive samples of the anchor (that is, samples in the mini-batch other than having the same class label as ). On the other hand, denotes the set of samples in the mini-batch other than .

Contrastive loss serves to contrast the training samples among themselves in order to shape the feature space according to class labels. The loss contains the temperature parameter . The parameters of the encoder are then updated according to the gradient of the contrastive loss:


where is the iteration index and is the learning rate.

Iii-B Classifier Fine-Tuning

In the fine-tuning stage, the encoder backbone is frozen so that training is focused on the classifier . Each training sample is fed through the trained encoder to generate . Afterwards, is concatenated with the extracted features of the auxiliary variables to generate . This concatenation is the process in which multimodal fusion occurs.

The MLP classifier takes as input to generate an estimated probability for every class. Unlike regular SuperCon, consists of multiple layers instead of just one in order to handle the multimodal fusion. Let be the estimated probability of the ground-truth class label of . The classification loss of the model is calculated using the focal loss:


The loss is simply the standard cross-entropy loss. Therefore, focal loss is a generalization of cross-entropy loss.

Focal loss contains two parameters: the balancing parameter of the ground truth label and the focusing parameter . Since the encoder is frozen, only the parameters of the classifier are updated according to the gradient of the focal loss:


where is the iteration index and is the learning rate.

Iv Experiments

Iv-a Experimental Datasets

We use the public Landsat8 image dataset on deforestation in Indonesia to train and test our model. This dataset is the same as the dataset used in [16, 20]. The driver annotations and coordinates for forest loss events were curated by Austin et al. [3]. The dataset consists of 1616 data for training, 473 data for validation, and 668 data for testing. The forest loss images were obtained from various regions no more than five years after deforestation occurred in the area. Each image is classified into one of four classes: grassland/shrubland, plantation, smallholder agriculture, or other. The classes represent the drivers of deforestation of the forest loss region shown in a satellite image. As shown in Fig. 2, the data distribution for the four classes is imbalanced, with most of the data belonging to the plantation class.

Fig. 2: The Distribution of The ForestNet Dataset by Class

Fig. 3: The Distribution of The ForestNet Dataset Based on Year of Occurrence of The Forest Loss Event

The original dataset is collected from forest loss events at 30m resolution from 2001 to 2016 in various Indonesian islands. Fig. 3 shows the distribution of data based on when the forest loss event occurred. Due to the lack of availability of Landsat 8 imagery, which is used for constructing composite images as input data, we only use data obtained in the last five years. Each forest loss region is represented as a polygon, indicating the forest loss event within a year. On the other hand, the composite images are represented as an RGB image with a total size of 332 x 332 pixels to capture the forest loss. We also use auxiliary predictors in addition to the images so that we are able to employ multimodal fusion to the classification task. The auxiliary predictors used in our experiment are slope, altitude, aspect, and gain. Fig. 4 provides an example of the data used in this research.

Fig. 4: An Illustration of a Landsat 8 Imaging Datum Along With Contour Plots of Its Auxiliary Variables

Iv-B Experimental Settings

We use the PyTorch framework to build our Multimodal SuperCon model

111Source code: We implement several data augmentation techniques, including horizontal flip, rotation, and elastic transform, in order to increase the robustness of the model. The model is trained on NVIDIA DGX A100 with 64 GB memory for roughly 45 minutes. In addition, we use the Adam optimizer and a learning rate of

during training. We set aside 15 epochs for the representation training of the encoder backbone and 10 epochs for the classifier fine-tuning. The batch size for both stages is set to 16.

We use EfficientNet-B2 [25] as the encoder backbone for the Multimodal Supercon architecture. For comparison purposes, we also utilize the ResNet18 [13] and UNet [22] pre-trained models as substitute encoders. We set for the contrastive loss, and and for the focal loss. The architecture is trained a total of three times using only one auxiliary variable (slope), two auxiliary variables (slope and altitude), and all four auxiliary variables (slope, altitude, aspect, and gain), respectively. We compare the performance of our model to the CNN models used in [20] as baselines.

Iv-C Results and Discussion

From Tables I and II, the experimental results show that the proposed model (Multimodal SuperCon using EfficientNet-B2 as the encoder and MLP as the classifier) gave the best performance. Compared to other models (ResNet18 as the encoder with MLP; Vanilla U-Net as the encoder with MLP), EfficientNet-B2 excels both during the representation training stage and the classifier fine-tuning stage. The proposed model gave an accuracy of 66% when using one auxiliary variable (slope) and two auxiliary variables (slope and altitude). An accuracy of 70% is obtained when using all four auxiliary variables during classifier training. Also, the proposed model gave lower loss on representation learning, namely 0.49 using one auxiliary variable, 0.52 using two auxiliary variables, and 0.48 using four auxiliary variables. These differences indicate that the use of an appropriate pre-trained model is crucial to obtain better performance. EfficientNet outperforms not only in terms of accuracy and loss, but also in terms of computation time.

On the optimal number of auxiliary variables used, the best model gave an accuracy of 70% and a loss of 0.48 when using all four auxiliary variables. This indicates that the addition of several relevant inputs can give more information to the model, hence benefiting training. From Table III, our proposed model also outperforms the results of the baseline model for the same task, namely the rotation equivariant CNN architecture, which yields an accuracy of only 63% [20].

Evaluation (Loss)
1 Aux 2 Aux 4 Aux
ResNet18 + MLP
EfficientNet + MLP 0.49 0.52 0.48
UNet + MLP
TABLE I: The Performance of Multimodal SuperCon on Representation Learning
Evaluation (Accuracy)
1 Aux 2 Aux 4 Aux
ResNet18 + MLP
EfficientNet + MLP 0.66 0.66 0.70
UNet + MLP
TABLE II: The Performance of Multimodal SuperCon on Classifier Learning
Test accuracy
UNet - CNN [20]
UNet - C8 Equivariant [20]
Multimodal SuperCon (1 Aux)
Multimodal SuperCon (2 Aux)
Multimodal SuperCon (4 Aux) 0.70
TABLE III: A Comparison Between the Performance of Multimodal SuperCon and Baseline CNN Models

V Conclusion and Future Work

In this paper, we introduced Multimodal SuperCon, a two-stage training architecture involving multimodal fusion to handle the class imbalance present within the deforestation dataset used in [16, 20]. We applied Multimodal SuperCon to the public Landsat8 imaging dataset consisting of data on forest loss regions in Indonesia. The EfficientNet-B2 + MLP model gave superior performance compared to other encoder architectures, and the resulting test accuracy was significantly higher than the accuracy obtained by the CNN models in [20]. It was also shown that multimodal fusion is essential in order to yield good results. By using all four auxiliary predictors, the model was able to perform better with an accuracy of 70%.

In the future, Multimodal SuperCon can be applied to a dataset not necessarily on deforestation datasets, but on any domain requiring some form of multimodal fusion. The model can also be modified for other uses both in deep learning and remote sensing. EfficientNet should be a strong candidate for the encoder backbone since it performs significantly better than other pre-trained models including ResNet18 and UNet.

Vi Acknowledgment

We thank the support of Tokopedia-UI AI Center of Excellence, Faculty of Computer Science, University of Indonesia, for providing access to the NVIDIA DGX-A100 to run the experiments.


  • [1] A. Arneth, F. Denton, F. Agus, A. Elbehri, K. H. Erb, B. O. Elasha, M. Rahimi, M. Rounsevell, A. Spence, R. Valentini, et al. (2019) Framing and context. In Climate Change and Land: an IPCC special report on climate change, desertification, land degradation, sustainable land management, food security, and greenhouse gas fluxes in terrestrial ecosystems, pp. 1–98. Cited by: §I.
  • [2] P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankanhalli (2010) Multimodal fusion for multimedia analysis: a survey. Multimedia systems 16 (6), pp. 345–379. Cited by: §I.
  • [3] K. G. Austin, A. Schwantes, Y. Gu, and P. S. Kasibhatla (2019) What causes deforestation in indonesia?. Environmental Research Letters 14 (2), pp. 024007. Cited by: §IV-A.
  • [4] K. M. Carlson, L. M. Curran, G. P. Asner, A. M. Pittman, S. N. Trigg, and J. Marion Adeney (2013) Carbon emissions from forest conversion by kalimantan oil palm plantations. Nature Climate Change 3 (3), pp. 283–287. Cited by: §I.
  • [5] K. Chen, D. Zhuang, and J. M. Chang (2022) SuperCon: supervised contrastive learning for imbalanced skin lesion classification. arXiv. External Links: Document, Link Cited by: §I, §II-C, §III.
  • [6] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In

    International conference on machine learning

    pp. 1597–1607. Cited by: §II-B.
  • [7] T. S. Cohen and M. Welling (2016) Steerable CNNs. arXiv preprint arXiv:1612.08498. Cited by: §II-A.
  • [8] T. Cohen and M. Welling (2016) Group equivariant convolutional networks. In International conference on machine learning, pp. 2990–2999. Cited by: §II-A.
  • [9] C. Drummond, R. C. Holte, et al. (2003) C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on learning from imbalanced datasets II, Vol. 11, pp. 1–8. Cited by: §II-C.
  • [10] J. Foley, R. Defries, G. Asner, C. Barford, G. Bonan, S. Carpenter, F. S. Chapin III, M. Coe, G. Daily, H. Gibbs, J. Helkowski, T. Holloway, E. Howard, C. Kucharik, C. Monfreda, J. Patz, I. Prentice, N. Ramankutty, and P. Snyder (2005-08) Global consequences of land use. Science (New York, N.Y.) 309, pp. 570–4. External Links: Document Cited by: §I.
  • [11] D. L. A. Gaveau, B. Locatelli, M. A. Salim, Husnayaen, T. Manurung, A. Descals, A. Angelsen, E. Meijaard, and D. Sheil (2022) Slowing deforestation in indonesia follows declining oil palm expansion and lower oil prices. PLOS ONE e0266178. External Links: Link Cited by: §I.
  • [12] H. He and E. A. Garcia (2009) Learning from imbalanced data. IEEE Transactions on knowledge and data engineering 21 (9), pp. 1263–1284. Cited by: §II-C.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §IV-B.
  • [14] O. Henaff (2020) Data-efficient image recognition with contrastive predictive coding. In International Conference on Machine Learning, pp. 4182–4192. Cited by: §II-B.
  • [15] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2018) Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: §II-B.
  • [16] J. Irvin, H. Sheng, N. Ramachandran, S. Johnson-Yu, S. Zhou, K. Story, R. Rustowicz, C. Elsworth, K. Austin, and A. Y. Ng (2020) ForestNet: classifying drivers of deforestation in indonesia using deep learning on satellite imagery. CoRR abs/2011.05479. External Links: Link, 2011.05479 Cited by: §I, §I, §II-A, §II-A, §II-A, §IV-A, §V.
  • [17] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan (2020) Supervised contrastive learning. CoRR abs/2004.11362. External Links: Link, 2004.11362 Cited by: §II-B.
  • [18] T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. CoRR abs/1708.02002. External Links: Link, 1708.02002 Cited by: §II-C.
  • [19] Y. Marrakchi, O. Makansi, and T. Brox (2021) Fighting class imbalance with contrastive learning. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 466–476. Cited by: §II-C.
  • [20] J. Mitton and R. Murray-Smith (2021) Rotation equivariant deforestation segmentation and driver classification. CoRR abs/2110.13097. External Links: Link, 2110.13097 Cited by: §I, §II-A, §II-A, §IV-A, §IV-B, §IV-C, TABLE III, §V.
  • [21] J. Peng, X. Bu, M. Sun, Z. Zhang, T. Tan, and J. Yan (2020) Large-scale object detection in the wild from imbalanced multi-labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9709–9718. Cited by: §II-C.
  • [22] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §IV-B.
  • [23] L. Shen, Z. Lin, and Q. Huang (2016)

    Relay backpropagation for effective learning of deep convolutional neural networks

    In European conference on computer vision, pp. 467–482. Cited by: §II-C.
  • [24] H. Sheng, X. Chen, J. Su, R. Rajagopal, and A. Ng (2020) Effective data fusion with generalized vegetation index: evidence from land cover segmentation in agriculture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 60–61. Cited by: §I.
  • [25] M. Tan and Q. Le (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In International conference on machine learning, pp. 6105–6114. Cited by: §IV-B.
  • [26] Y. Tian, D. Krishnan, and P. Isola (2020) Contrastive multiview coding. In European conference on computer vision, pp. 776–794. Cited by: §II-B.
  • [27] M. Weiler and G. Cesa (2019) General E(2)-equivariant steerable CNNs. CoRR abs/1911.08251. External Links: Link, 1911.08251 Cited by: §II-A.
  • [28] M. Yeung, E. Sala, C. Schönlieb, and L. Rundo (2022) Unified focal loss: generalising dice and cross entropy-based losses to handle class imbalanced medical image segmentation. Computerized Medical Imaging and Graphics 95, pp. 102026. Cited by: §II-C.