Deforestation has become a major problem for the world at large, especially in tropical countries such as Indonesia. Since 2001, the amount of forest area in Indonesia that is overrun by oil palm plantations has doubled, reaching 16.24 Mha in 2019 (64% industrial; 36% smallholder). This amount is more than the official estimates of 14.72 Mha. At the same time, forest area has declined by 11% (9.79 Mha), 32% (3.09 Mha) of which ultimately converted into oil palm plantations, and 29% (2.85 Mha) were both cleared and converted within the same year .
Vegetation and soils in forest ecosystems are crucial in regulating various ecosystem processes . Deforestation, both natural or human-made, can cause droughts due to lack of transpiration in the soil. If this disaster continues and becomes large scale, it will have an adverse impact on the climate. To illustrate, the expansion of oil palm plantations in Kalimantan alone is projected to contribute 18–-22% of Indonesia’s 2020 carbon emission . Deforestation also has a destructive impact on the ecosystem and biodiversity, and it is becoming a force of global importance . As one of the countries with the most primary forests in the world, Indonesia has a looming threat of major deforestation. Therefore, analyzing and understanding the drivers for deforestation can help to determine areas in need of reforestation. It also prevents deforestation from becoming more widespread.
Since the advent of high-resolution satellite imaging, disaster analysis typically uses satellite imagery which offers a rich source of information. This can be helpful in determining regions that are currently affected by a disaster, such as deforestation. Prior work have used satellite imagery and deep learning, especially CNN, to solve problems in remote sensing due to their ability to give better performance. In addition, some work have also used multimodal inputs and segmentation to achieve precise predictions of drivers of deforestation in particular. ForestNet implements scene data augmentation and multimodal fusion [2, 24] using CNN models. On top of that, another study  introduces the use of rotation-equivariant CNNs on segmentation learning to generate stable segmentation maps with U-Net. However, none of these studies explains in depth the optimal number of auxiliary inputs for the multimodal fusion and the class imbalance naturally present in the dataset.
In this work, we propose a deep learning model, called Multimodal SuperCon, which uses contrastive loss and focal loss for classifying drivers of deforestation in Indonesia. The architecture is chosen due to the class imbalance present in the dataset used in . Using satellite images from Landsat 8, our model consists of two stages of training, namely representation training as the first stage and classifier fine-tuning as the second stage. Our model is a modification of an established architecture known as SuperCon , and it allows multimodal fusion in the classifier fine-tuning stage.
Ii Related Work
Ii-a Machine Learning for Deforestation Analysis
To our knowledge, there have been two proposed deep learning models for handling the deforestation driver classification task in Indonesian forest regions. The first was the ForestNet architecture introduced by Irvin et al.  in 2020, and the second was a rotation equivariant CNN architecture applied by Mitton and Murray-Smith  in 2021.
ForestNet  is a semantic segmentation architecture that is created to (1) address that there are often multiple land uses within a single forest loss image, and (2) implicitly utilize the information of specific loss regions. ForestNet allows for predictions from high-resolution (15m) images to predict different drivers of multiple loss regions with varying sizes. The model also incorporates a type of data agumentation named scene data augmentation (SDA), per-pixel classification, and multimodal fusion in its architecture.
Many research studies in deep learning use convolutional neural networks (CNN) due to their powerful capabilities to process sensory data such as images and videos. A CNN model consists of multiple convolutional layers which are translation equivariant. In other words, if an input image is translated to any direction, the result of the feature map is shifted accordingly. However, standard convolutional layers are not rotation equivariant, which can be crucial in order to ensure that the model performs well in certain applications. To make the convolutional layers rotation equivariant, architectures involving group equivariant CNNs can be employed[27, 8, 7]. Rotation equivariant CNNs built using this method has been successfully applied in  to the same deforestation dataset used by Irvin et al. .
Ii-B Contrastive Learning
Contrastive learning is a representation learning approach that aims to shape the feature space in such a way that similar (or positive) samples are clustered together, while dissimilar (or negative) samples are pushed apart. In the past, contrastive learning is executed in a self-supervised manner, with every anchor having exactly one positive sample [6, 14, 15, 26]. This positive sample can be obtained by, for example, applying data augmentation to the anchor.
Khosla et al.  introduced the concept of supervised contrastive learning, which allows an anchor to have multiple positive samples. The positive samples of an anchor consists of samples that belong to the same class as the anchor. In effect, supervised contrastive learning aims to minimize the distance of intra-class samples and maximize the distance of inter-class samples in the feature space. In practice, supervised contrastive learning is done using the contrastive loss, whose formula is shown in Eq. (1).
Ii-C Class Imbalance in Datasets
Class imbalance occurs when the majority of data in a dataset is classified as a certain class. This imbalance may cause difficulty in training the model using data from the minority classes. Several methods have been proposed in the past to deal with class imbalance. One popular solution is by either undersampling the majority class [9, 12] or oversampling the minority class [21, 23] so that the dataset becomes balanced.
Another way to ensure that the model learns from the minority classes is by modifying the loss function. Despite being the most popular classification loss, the cross-entropy loss is often not the most suitable choice for handling imbalanced datasets. Focal loss[18, 28], shown in Eq. (3), is a widely used modification of cross-entropy loss that is able to focus on hard-to-classify samples. Consequently, focal loss is a more suitable option when dealing with class imbalance. Another loss function that has been successfully utilized is the contrastive loss. A training strategy involving contrastive learning has been shown to be more effective or at least on par compared to cross-entropy and focal loss [19, 5].
The model that we apply to the deforestation dataset is built upon the SuperCon architecture 
. SuperCon is a two-stage, contrastive learning architecture consisting of a representation training stage and a classifier fine-tuning stage. The resulting model is used for handling classification tasks. During the representation training stage, the ResNet encoder backbone learns from data using the contrastive loss. Afterwards, the classifier fine-tuning stage is used to train a classification layer which generates an estimated probability for each class label. SuperCon works well on datasets under settings involving class imbalance.
We propose Multimodal SuperCon, a modification of SuperCon with the addition of multimodal fusion during classifier fine-tuning. This way, the model can learn from a variety of data in order to obtain a more accurate result. In our application to the deforestation dataset, Multimodal SuperCon allows the use of auxiliary predictors, such as slope, aspect, and elevation, in addition to the satellite images. The architecture of Multimodal SuperCon is illustrated in Fig. 1. The following subsections explain the two stages of Multimodal SuperCon in more detail.
Iii-a Representation Training
Let be a mini-batch of augmented training samples from the dataset with batch size . The features of each training sample are extracted using the encoder backbone . In practice, the encoder uses a pre-trained model, such as the ResNet or UNet architecture. The encoder generates a representation , which is then projected using the projection head .
The projection head is a two-layer network of size 2048 and 128, respectively. It generates a lower-dimensional representation of each training sample . Each is -normalized () so that lies on the unit hypersphere. Afterwards, contrastive loss is applied to the set :
The notation denotes the set of positive samples of the anchor (that is, samples in the mini-batch other than having the same class label as ). On the other hand, denotes the set of samples in the mini-batch other than .
Contrastive loss serves to contrast the training samples among themselves in order to shape the feature space according to class labels. The loss contains the temperature parameter . The parameters of the encoder are then updated according to the gradient of the contrastive loss:
where is the iteration index and is the learning rate.
Iii-B Classifier Fine-Tuning
In the fine-tuning stage, the encoder backbone is frozen so that training is focused on the classifier . Each training sample is fed through the trained encoder to generate . Afterwards, is concatenated with the extracted features of the auxiliary variables to generate . This concatenation is the process in which multimodal fusion occurs.
The MLP classifier takes as input to generate an estimated probability for every class. Unlike regular SuperCon, consists of multiple layers instead of just one in order to handle the multimodal fusion. Let be the estimated probability of the ground-truth class label of . The classification loss of the model is calculated using the focal loss:
The loss is simply the standard cross-entropy loss. Therefore, focal loss is a generalization of cross-entropy loss.
Focal loss contains two parameters: the balancing parameter of the ground truth label and the focusing parameter . Since the encoder is frozen, only the parameters of the classifier are updated according to the gradient of the focal loss:
where is the iteration index and is the learning rate.
Iv-a Experimental Datasets
We use the public Landsat8 image dataset on deforestation in Indonesia to train and test our model. This dataset is the same as the dataset used in [16, 20]. The driver annotations and coordinates for forest loss events were curated by Austin et al. . The dataset consists of 1616 data for training, 473 data for validation, and 668 data for testing. The forest loss images were obtained from various regions no more than five years after deforestation occurred in the area. Each image is classified into one of four classes: grassland/shrubland, plantation, smallholder agriculture, or other. The classes represent the drivers of deforestation of the forest loss region shown in a satellite image. As shown in Fig. 2, the data distribution for the four classes is imbalanced, with most of the data belonging to the plantation class.
The original dataset is collected from forest loss events at 30m resolution from 2001 to 2016 in various Indonesian islands. Fig. 3 shows the distribution of data based on when the forest loss event occurred. Due to the lack of availability of Landsat 8 imagery, which is used for constructing composite images as input data, we only use data obtained in the last five years. Each forest loss region is represented as a polygon, indicating the forest loss event within a year. On the other hand, the composite images are represented as an RGB image with a total size of 332 x 332 pixels to capture the forest loss. We also use auxiliary predictors in addition to the images so that we are able to employ multimodal fusion to the classification task. The auxiliary predictors used in our experiment are slope, altitude, aspect, and gain. Fig. 4 provides an example of the data used in this research.
Iv-B Experimental Settings
We use the PyTorch framework to build our Multimodal SuperCon model111Source code: https://github.com/bellasih/multimodal_supercon. We implement several data augmentation techniques, including horizontal flip, rotation, and elastic transform, in order to increase the robustness of the model. The model is trained on NVIDIA DGX A100 with 64 GB memory for roughly 45 minutes. In addition, we use the Adam optimizer and a learning rate of
during training. We set aside 15 epochs for the representation training of the encoder backbone and 10 epochs for the classifier fine-tuning. The batch size for both stages is set to 16.
We use EfficientNet-B2  as the encoder backbone for the Multimodal Supercon architecture. For comparison purposes, we also utilize the ResNet18  and UNet  pre-trained models as substitute encoders. We set for the contrastive loss, and and for the focal loss. The architecture is trained a total of three times using only one auxiliary variable (slope), two auxiliary variables (slope and altitude), and all four auxiliary variables (slope, altitude, aspect, and gain), respectively. We compare the performance of our model to the CNN models used in  as baselines.
Iv-C Results and Discussion
From Tables I and II, the experimental results show that the proposed model (Multimodal SuperCon using EfficientNet-B2 as the encoder and MLP as the classifier) gave the best performance. Compared to other models (ResNet18 as the encoder with MLP; Vanilla U-Net as the encoder with MLP), EfficientNet-B2 excels both during the representation training stage and the classifier fine-tuning stage. The proposed model gave an accuracy of 66% when using one auxiliary variable (slope) and two auxiliary variables (slope and altitude). An accuracy of 70% is obtained when using all four auxiliary variables during classifier training. Also, the proposed model gave lower loss on representation learning, namely 0.49 using one auxiliary variable, 0.52 using two auxiliary variables, and 0.48 using four auxiliary variables. These differences indicate that the use of an appropriate pre-trained model is crucial to obtain better performance. EfficientNet outperforms not only in terms of accuracy and loss, but also in terms of computation time.
On the optimal number of auxiliary variables used, the best model gave an accuracy of 70% and a loss of 0.48 when using all four auxiliary variables. This indicates that the addition of several relevant inputs can give more information to the model, hence benefiting training. From Table III, our proposed model also outperforms the results of the baseline model for the same task, namely the rotation equivariant CNN architecture, which yields an accuracy of only 63% .
|1 Aux||2 Aux||4 Aux|
|ResNet18 + MLP|
|EfficientNet + MLP||0.49||0.52||0.48|
|UNet + MLP|
|1 Aux||2 Aux||4 Aux|
|ResNet18 + MLP|
|EfficientNet + MLP||0.66||0.66||0.70|
|UNet + MLP|
V Conclusion and Future Work
In this paper, we introduced Multimodal SuperCon, a two-stage training architecture involving multimodal fusion to handle the class imbalance present within the deforestation dataset used in [16, 20]. We applied Multimodal SuperCon to the public Landsat8 imaging dataset consisting of data on forest loss regions in Indonesia. The EfficientNet-B2 + MLP model gave superior performance compared to other encoder architectures, and the resulting test accuracy was significantly higher than the accuracy obtained by the CNN models in . It was also shown that multimodal fusion is essential in order to yield good results. By using all four auxiliary predictors, the model was able to perform better with an accuracy of 70%.
In the future, Multimodal SuperCon can be applied to a dataset not necessarily on deforestation datasets, but on any domain requiring some form of multimodal fusion. The model can also be modified for other uses both in deep learning and remote sensing. EfficientNet should be a strong candidate for the encoder backbone since it performs significantly better than other pre-trained models including ResNet18 and UNet.
We thank the support of Tokopedia-UI AI Center of Excellence, Faculty of Computer Science, University of Indonesia, for providing access to the NVIDIA DGX-A100 to run the experiments.
-  (2019) Framing and context. In Climate Change and Land: an IPCC special report on climate change, desertification, land degradation, sustainable land management, food security, and greenhouse gas fluxes in terrestrial ecosystems, pp. 1–98. Cited by: §I.
-  (2010) Multimodal fusion for multimedia analysis: a survey. Multimedia systems 16 (6), pp. 345–379. Cited by: §I.
-  (2019) What causes deforestation in indonesia?. Environmental Research Letters 14 (2), pp. 024007. Cited by: §IV-A.
-  (2013) Carbon emissions from forest conversion by kalimantan oil palm plantations. Nature Climate Change 3 (3), pp. 283–287. Cited by: §I.
-  (2022) SuperCon: supervised contrastive learning for imbalanced skin lesion classification. arXiv. External Links: Cited by: §I, §II-C, §III.
A simple framework for contrastive learning of visual representations.
International conference on machine learning, pp. 1597–1607. Cited by: §II-B.
-  (2016) Steerable CNNs. arXiv preprint arXiv:1612.08498. Cited by: §II-A.
-  (2016) Group equivariant convolutional networks. In International conference on machine learning, pp. 2990–2999. Cited by: §II-A.
-  (2003) C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on learning from imbalanced datasets II, Vol. 11, pp. 1–8. Cited by: §II-C.
-  (2005-08) Global consequences of land use. Science (New York, N.Y.) 309, pp. 570–4. External Links: Cited by: §I.
-  (2022) Slowing deforestation in indonesia follows declining oil palm expansion and lower oil prices. PLOS ONE e0266178. External Links: Cited by: §I.
-  (2009) Learning from imbalanced data. IEEE Transactions on knowledge and data engineering 21 (9), pp. 1263–1284. Cited by: §II-C.
-  (2016) Deep residual learning for image recognition. In , pp. 770–778. Cited by: §IV-B.
-  (2020) Data-efficient image recognition with contrastive predictive coding. In International Conference on Machine Learning, pp. 4182–4192. Cited by: §II-B.
-  (2018) Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: §II-B.
-  (2020) ForestNet: classifying drivers of deforestation in indonesia using deep learning on satellite imagery. CoRR abs/2011.05479. External Links: Cited by: §I, §I, §II-A, §II-A, §II-A, §IV-A, §V.
-  (2020) Supervised contrastive learning. CoRR abs/2004.11362. External Links: Cited by: §II-B.
-  (2017) Focal loss for dense object detection. CoRR abs/1708.02002. External Links: Cited by: §II-C.
-  (2021) Fighting class imbalance with contrastive learning. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 466–476. Cited by: §II-C.
-  (2021) Rotation equivariant deforestation segmentation and driver classification. CoRR abs/2110.13097. External Links: Cited by: §I, §II-A, §II-A, §IV-A, §IV-B, §IV-C, TABLE III, §V.
-  (2020) Large-scale object detection in the wild from imbalanced multi-labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9709–9718. Cited by: §II-C.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §IV-B.
-  (2016) . In European conference on computer vision, pp. 467–482. Cited by: §II-C.
-  (2020) Effective data fusion with generalized vegetation index: evidence from land cover segmentation in agriculture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 60–61. Cited by: §I.
-  (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In International conference on machine learning, pp. 6105–6114. Cited by: §IV-B.
-  (2020) Contrastive multiview coding. In European conference on computer vision, pp. 776–794. Cited by: §II-B.
-  (2019) General E(2)-equivariant steerable CNNs. CoRR abs/1911.08251. External Links: Cited by: §II-A.
-  (2022) Unified focal loss: generalising dice and cross entropy-based losses to handle class imbalanced medical image segmentation. Computerized Medical Imaging and Graphics 95, pp. 102026. Cited by: §II-C.