Remote sensing applications have gained immense traction in recent years owing to the advent of high quality acquisition systems, sophisticated processing algorithms and increasingly accurate detection and classification methods enabled by deep learning. Owing to it’s rich features, some of the most popular remote sensing tasks rely on semantic segmentation to assign class-wise labels to each pixel in the frame. Semantic segmentation algorithms are used for a plethora of applications such as anomaly detection, event detection, land use cover change, etcyao2019unmanned. Unmanned Aerial Vehicles (UAVs) have enabled capture of ultra-high resolution data yao2019unmanned due to characteristics like low-cost, flexibility and low-flying altitude thus leading to increasing interest in the field.
In the context of UAV-centric deep learning applications, there has been some work around autonomous navigation fraga2019review, object tracking aguilar2017pedestrian, change detection avola2018uav, semantic segmentation Yang_2020. In the context of segmentation, some papers use standalone networks such as FCN, U-NetGirisha_2019, SegNet Yang_2020; lobo2020applying, DeepLabV3+Girisha_2019; lobo2020applying, Conditional Random Fields Girisha_2019, Markov Random Fields li2017semantic, Adverserial Networksli2019road, while some have experimented with Hybrid Networks such as FCN-AlexNetYang_2020, FCN-ConvLSTM Wang_2019, DenseCNN with RNN rahnemoonfar2018flooded or Ensemble Networks such as ensemble of ConvNets nogueira2017semantic. Some papers have used additional information such as Digital Surface Models zhang2019urban or applied post-processing on segmentation results with overlay techniques and probabilistic graphical modelsdlreview.
In recent times, deep learning implementations have been employed for real-time recognition and tracking on drones neurala
with cloud-based processing. With resource-hungry Deep Convolutional Neural Networks (DCNNs) breaking accuracy ceilings, a significant aspect of the viability relies on reduction of costs in terms of network communication and computational energy. Hence UAV systems warrant use of new-age edge AI (artificial intelligence) hardware accelerators such as edge-GPU (Graphics Processing Units)Cass_2020
, edge-TPU (Tensor Processing Units)edgetpu, specialized ASIC (Application specific Integrated Circuits) Puglia_2016; parmar_hsi_ijcnn, which help achieve faster inference time, lightweight deployment and low-power localized processing. Studies proposing embedded deployment in the remote sensing domain have been limited to object detection barba2020deep; kyrkou2018dronetkyrkou2019deep, or semantic segmentation for satellite deployment inria.
Despite standalone work in each of these domains, another untapped aspect for realizing the potential of UAV applications is real time edge AI deployment of segmentation algorithms on UAV platforms. Such applications warrant solutions optimizing all parameters viz. accuracy, latency, and memory thus needing extensive design exploration which is the primary motivation behind this study.
To this purpose, we present a first-of-its-kind intensive algorithmic-hardware exploration for performing segmentation using UAV images based multi-class dataset in a resource-efficient manner for future-ready edge-AI deployments. Key contributions of the paper are listed below:
Detailed performance benchmarking of standard segmentation models on a new multi-class UAV segmentation dataset (DroneDeploy dronedeploy).
First demonstration of EfficientNet based Semantic Segmentation in context of UAV images.
First hardware-software co-optimization study for semantic segmentation in context of UAV images.
2 Materials and Methods
2.1 UAV Datasets
While there exist many UAV video datasets Girisha_2019; avola2018uav; uavid_isprs, UAV static images datasets are typically more application oriented Yang_2020, and hence suited for object detection applications stanforddronedataset. Further they also generally have lower number of annotated classes inria. For the purpose of this study, we have used DroneDeploy Datasetdronedeploy, comprising of 55 RGB images, along with single-channel elevation maps and label maps. The label maps are annotated with 7 classes - namely Building, Clutter, Vegetation, Water, Ground, Car and ‘Ignore’ - the last class referring to missing pixels/ boundaries. The ground resolution is 10 cm/pixel. For this study, we have only used raw RGB TIFFs, in order to demonstrate generalized capability without the need of additional channels such as elevation (as in the case of this dataset) or hyper-spectral bands (as in the case of other UAV datasets) due to relatively high costs of lightweight multispectral cameras yao2019unmanned. A description of the class-wise distribution alongwith color map is provided in Table 1. Maximum image size available in the dataset is 637 MB. Fig. 1 shows an image chip and corresponding label map from the dataset.
|Class||Color Code||Percentage of Pixels|
For training, we divided images in the dataset into non-overlapping chips with chip sizes of 300300 for UNet, FPN, LinkNet, and 384384 for PSPNet with ensured presence of at least 1 class of interest. Resulting training dataset consists of 5887 and 3893 chips of sizes 300300 and 384384 respectively. Basic augmentations such as Horizontal and Vertical Flips as well as Rotation operations in the range of 180 were applied during training in order to artificially increase size of the dataset as well as provide better generalization. The ratio of training and validation data is chosen as 85%:15%.
2.2 Semantic Segmentation Architectures
In this work, we experiment with a variety of models, backbones, hyper-parameters and training variations before arriving at an accurate model and further try to optimize the architecture for embedded implementation. Segmentation models and encoder-backbones investigated for this study are listed in Table 2
. Structurally, all semantic segmentation architectures are similar and comprise of an encoder-decoder network where the encoder network performs feature extraction and decoder performs localization of spatial featuresinria. However, it is the method of combining the two sets of information - spatial and feature, which demarcates them. Four models were taken into consideration for this study: (i) UNet unet, (ii) LinkNet linknet, (iii) Pyramid Scene Parsing Network (PSPNet) pspnet, (iv) Feature Pyramid Networks (FPN) fpn.
UNet is one of the most fundamental semantic segmentation networks. It was originally intended to be used on biomedical images, however it finds increasing relevance in nearly all areas of interest today including remote sensing inria. In case of UNet, the encoder is used for multi-level feature extraction and the decoder combines learnt features and resolution through a sophisticated stacking, taking both localization and feature representation into accountunet. LinkNet tweaks the UNet structure by adding the upsampled feature representation with resolution information instead of concatenating. The other two Pyramid Networks attempt to form a Pyramid structure. PSPNet achieves this by creating a pyramid by variably pooling the lowest downsampled map, resulting in a vast collection of spatial resolutions used to enrich the features. On the other hand, FPN works by creating two pyramids, and combines them to generate feature-rich segmentation maps at each level.
Three encoder-backbones have been considered for this study: (i) EfficientNetB3 tan2019efficientnet, (ii) InceptionResnetV2 szegedy2017inception, (iii) MobileNetV2 sandler2018mobilenetv2. The prime motivation for the choice of backbones was to ensure an exhaustive exploration. For analyzing trade-off between memory and accuracy over the complete design space, we have selected three backbones as representative workloads.InceptionResNetV2, a hybrid of two sophisticated networks, exhibits high accuracy but is computationally heavy, whereas MobileNetV2 despite having lower accuracy enables light-weight edge AI computing bianco2018benchmark. EfficientNet architectures use a compound scaling optimization with variable width, depth, resolution in order to optimize Accuracy and FLOPS. We use the EfficientNetB3 architecture in this family which lies approximately in the middle of the spectrum as will be shown in Section 3.3.
|Models||UNet, FPN, LinkNet, PSPNet|
|Encoder Backbone||MobileNetv2, EfficientNetB3,|
Fig. 3 depicts the process of network exploration followed in this study. At the level of network architecture, a total of 12 model-backbone combinations were investigated as shown in Table 2. In case of training parameters, 4 choices viz. optimizer, learning rate policy, weight initialization and decoder-only/ complete training are explored. Further, detailed memory and latency benchmarking is performed for hardware-specific optimization. Experiments performed in the study are based on the Segmentation Models Yakubovskiy:2019
library based on the Keras frameworkchollet2015keras. For hardware-specfic optimization we utilize NVIDIA TensorRT libraryvanholder2016efficient. A base configuration was selected and for all experiments, where only the parameter of interest was changed. The base configuration is defined in Table 3
. Two evaluation metrics used for the optimization are defined below.
IoU Score or the Jaccard Index.
F1 Score or the Dice Coefficient ( = 1)
|Learning Rate Policy||Static Rate (1e-4)|
|Weight Initialization||ImageNet deng2009imagenet|
3.1 Network architecture exploration
Fig. 4 and Fig. 5 show comparison of performance of different segmentation models with EfficientNetB3 backbone. The best performing Model is found to be FPN. When UNet is selected as the Model, best performing backbone can be taken to be either EfficientNetB3 or InceptionResnetV2 owing to their close performance. Since FPN emerged as a clear choice from the first set of experiments, we then updated the base model to FPN and performed a second set of experiments to compare the performance of encoder-backbones. From Fig. 6 we can observe that EfficientNetB3 shows the best performance and hence is the choice of backbone. This is further validated from results shown in Table 4. FPN with backbone as EfficientNetB3 or InceptionResnetV2 is shown to have highest IoU score, however EfficientNetB3 results in marginally better Validation IoU Score, hence we select it as the backbone. The implications of this choice for hardware will be discussed in detail in Section 3.3 and Section 3.4.
|Model||Backbone||Train IoU||Val IoU|
3.2 Training parameter exploration
Based on the results of the network architecture exploration, the base configuration is now updated to Model:FPN and Backbone:EfficientNetB3. In these set of experiments, we focus on finding an optimal configuration for the following training parameters:optimizer, learning rate policy, weight initialization, decoder-only/complete training and understanding their impact. Description for each investigated parameter is provided below.
Adam optimizerkingma2014adam is one of the most commonly used optimizers used in context of semantic segmentation networks, even finding applications in context of hyperspectral images owing to its high accuracy in classification paoletti2019deep. As part of this experiment, we compare the performance of Adam and AdamW loshchilov2017decoupled optimizers, wherein AdamW decouples weight decay from the gradient updates. Fig. 7 shows better performance for AdamW than Adam, and makes it an optimal choice as the optimizer.
3.2.2 Learning Rate Policy
Learning rate (LR) is an important parameter for preventing overfitting or getting stuck in local minima. Hence as part of this experiment we compare performance between two learning rate policies: (i) static LR throughout the training epochs, (ii) LR scheduler based as defined in algorithm 1.
Fig. 8 shows that using a LR scheduler policy with an adaptive learning rate performs better compared to a static LR policy. Thus base configuration is updated with a scheduler based learning rate policy.
3.2.3 Weight Initialization
While using ImageNet weights to initialize pre-training is a widely accepted method to accelerate training for deep learning, we also have to acknowledge the difference in target classes specific to the UAV dataset. Hence we studied the effect of pre-training by ImageNet deng2009imagenet in comparison with random weight initialization. Fig. 9 shows that a network pre-trained on ImageNet outperforms random initialization by a huge margin. Hence, we proceed with this weight initialization scheme for the base configuration.
3.2.4 Decoder-Only / Complete Training
As discussed in Section 2.2 the semantic segmentation networks have two components: (i) Encoder (ii) Decoder. In the interest of saving the cost of training operations, a commonly used strategy for training these networks is to freeze the encoder weights and only train the decoder michieli2019incremental. Another alternative is to train both components (Complete Training). Fig. 10 shows how complete training outperforms decoder training, possibly owing to the requisite learning for fine-tuned feature extraction from the encoder.
3.3 Memory Profiling
So far, we have analyzed methods to update the base configuration in an effort to optimize accuracy. However, real world edge-deployment on any UAV platform will require it to have low memory footprint, low latency and low power. To achieve lightweight implementation in terms of memory footprint, there are two strategies: (i) Network architecture selection to reduce model weights, (ii) Quantization of model weights and computation graph. Both these strategies are discussed further.
3.3.1 Model Weights
In this study, we compare the memory consumption of different models’ weights as shown in Table 5 in order to select the architecture with the best trade-off between model weight size and validation accuracy.
|Model||Backbone||Val IOU||Memory (MB)|
3.3.2 Quantization based Memory Optimization
In order to reduce memory footprint and inference latency, we analyze the impact of quantizing floating-point-32 (FP32) precision network to a more efficient data-type i.e. floating-point-16 (FP16). FP16 quantization is performed based on NVIDIA TensorRT (TRT) vanholder2016efficient. Fig. 11 demonstrates the prediction performance for the baseline as well as TRT optimized models. We observe that TRT FP16 model performance matches closely with baseline FP32 model, and thus making it a valid candidate for further latency analysis. Fig. 12 shows prediction performance comparison of FP16-architectures with FPN model using all 3 encoder-backbones. As expected, based on earlier analysis (shown in Table 4), MobileNetV2 shows noisy predictions whereas EfficientNetB3 and InceptionResnetV2 are at par with each other.
3.4 Latency Profiling
Table 6 shows the latency analysis for architectures with FPN as model on an input image of size 522 MB, spanning an area of 1.36
. The analysis is performed on NVIDIA RTX 8000. The updated base model is run on an additional configuration - NVIDIA RTX 2080 Ti since it is more representative of an inference oriented GPU. Keras model is based on FP32 data type, whereas Tensorflow-TensorRT (TF-TRT) implementations are based on FP16 data type. FP16 implementation shows a performance speedup as compared to FP32 thus increasing its viability for deployment. An important point to note here is that while the results presented in Table6 report latencies for processing a complete ortho-rectified image from the dataset with >1000 slices of 320320, actual latency for the target platform would be far less depending on sensor resolution. Hence, we present inference latency for image size corresponding to a typical drone sensor (960480)dji in Table 7. Out of the edge-devices referred to in the study, the best performance is obtained for NVIDIA Jetson Nano-Tensorflow framework. For inference time, the throughput metrics of interest are as follows:
File Size per unit time = 125 kB/second
Area spanned per unit time = 320 /second
|Backbone||Keras Model (s)||TF-TensoRT (s)|
*This experiment is performed on NVIDIA RTX 2080 Ti, All other experiments are performed on NVIDIA RTX 8000.
|Platform||Framework||Inference Latency (s)|
|Nvidia RTX 2080 Ti||Tensorflow||0.07|
|Xeon CPU||Tensorflow||0.7911footnotemark: 1|
|Nvidia Jetson Nano||Tensorflow||14.40|
|Raspberry Pi 3B+||Tensorflow||75.20|
Xeon CPU latency is added from the perspective of overall comparison.
Estimated based on scaling factors derived from tryolabs.
Section 3.1 shows that FPN models with high-level semantic features at every level outperforms UNet models by a considerable margin in terms of validation IoU scores (see Table 4). FPN model with encoder-backbone as EfficientNetB3 has the highest validation IoU score when compared to all other architectures, thus being the preferred choice. The choice of EfficientNetB3 over InceptionResnetV2 stems from the marginally higher IoU score, along with memory footprint considerations as shown in Table 5. Thus with the choice of the optimal network architecture established, training parameter optimization was explored. In Section 3.2.1 we see that AdamW shows superior performance over Adam optimizer because of incorporation of weight decay (shown in Fig. 7). Section 3.2.2 shows that the scheduler based learning rate policy outperforms the static LR policy by a significant margin due to it’s adaptive nature. Section 3.2.3 validates the standard method of weight initialization by ImageNet weights by comparing its performance with regards to random initialization. This is a significant observation in contrary to the view that remote sensing applications differ considerably from vision applications necessitating departure from standard initializations like ImageNet deng2009imagenet. Section 3.2.4 shows that complete training allows better feature adaptation than encoder freezing and hence a complete training is followed. Following are some important observations based on trends shown in Section 3.3.1:
All models with MobileNetV2 as backbone are lightweight, however, highly underperform when compared to other backbones’ Validation IoU.
All InceptionResnetV2 models are very memory intensive, however perform better or at par when compared to other backbones for the same model.
Out of all models, PSPNet has least memory consumption, however, it’s performance is also considerably lower than all other models.
If models with best validation IoU are compared in terms of memory, InceptionResnetV2 is 4x heavier than EfficientNetB3 making it a suboptimal choice for lightweight on-the-edge deployment.
The above trends further explain motivation behind choosing the investigated architectures, since they provide a more complete picture of the tradeoff between accuracy and memory footprint. MobileNetV2 is highly resource efficient for deployment, whereas InceptionResnetV2 performs consistently and accurately, however EfficientNetB3 depicts the tradeoff between accuracy and memory consumption.
Section 3.3.2 shows performance of FP 16 optimized models to be at par with FP32 models in terms of accuracy. This observation, compounded with memory footprint savings, motivates the need for investigating the latency performance for the TF-TRT models.
Section 3.4 shows lower inference times for TF-TRT optimized models (improvements in the range of 1.05 to 1.4 except in the case of EfficientNetB3 performed on NVIDIA RTX 2080 Ti, possibly owing to a memory bottleneck issue. NVIDIA Jetson Nano emerges as the preferable choice as based on edge device estimates described in Table 7.
4.1 Future Scope
While there are several significant observations contributed from the study, following aspects we aim to address in future and on-going work:
Investigate optimization strategies in order to address UAV dataset imbalance.
TRT edge-AI hardware latency estimates are approximate to a margin of 10%. Accurate latency and power measurements based on actual deployed characterization will yield more precise values.
Since no gold-standard dataset currently exists for training classifiers for UAV based images as reference, further investigation into impact of weight initialization may be advisable and comparison with ImageNet based initialization.
Power profiling of the networks may give more insight since latency for operations is not representative of energy costs. This would further boost the overall impact of optimization in terms of EDP (Energy Delay Product).
In this paper, we present a detailed benchmarking study of semantic segmentation models in context of UAV applications on the DroneDeploy dataset. We also present the first demonstration of semantic segmentation based on EfficientNet architectures for remote sensing applications. Based on extensive exploration, the best configuration is found to be: Model: FPN, Backbone: EfficientNetB3, Pretraining: ImageNet weights, Optimizer: AdamW with Learning rate scheduler and complete training which achieves IoU score of 0.65 and F1-score of 0.71 over the validation dataset. We also profile memory usage and latency for each model and optimize them for inference based on TensorRT with FP16 precision. Based on this we achieve memory savings of 4.1 and latency improvement of 10% compared to Model: FPN and Backbone: InceptionResnetV2.
Conceptualization V.P., N.B. and M.S.; Methodology & Investigation: V.P & N.B,; SW & Validation: V.P., N.B., S.N; Writing, Editing and Review: all authors.; Supervision, M.S.
This research received no external funding.
Acknowledgements.Authors would like to acknowledge support provided by CYRAN AI Solutions for this study. The authors declare no conflict of interest. The following abbreviations are used in this manuscript:
|FPN||Feature Pyramid Network|
|PSPNet||Pyramid Scene Parsing Network|