Exploration of Optimized Semantic Segmentation Architectures for edge-Deployment on Drones

In this paper, we present an analysis on the impact of network parameters for semantic segmentation architectures in context of UAV data processing. We present the analysis on the DroneDeploy Segmentation benchmark. Based on the comparative analysis we identify the optimal network architecture to be FPN-EfficientNetB3 with pretrained encoder backbones based on Imagenet Dataset. The network achieves IoU score of 0.65 and F1-score of 0.71 over the validation dataset. We also compare the various architectures in terms of their memory footprint and inference latency with further exploration of the impact of TensorRT based optimizations. We achieve memory savings of  4.1x and latency improvement of 10


page 3

page 5

page 11


SqueezeNAS: Fast neural architecture search for faster semantic segmentation

For real time applications utilizing Deep Neural Networks (DNNs), it is ...

LGENet: Local and Global Encoder Network for Semantic Segmentation of Airborne Laser Scanning Point Clouds

Interpretation of Airborne Laser Scanning (ALS) point clouds is a critic...

LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation

Pixel-wise semantic segmentation for visual scene understanding not only...

Waterfall Atrous Spatial Pooling Architecture for Efficient Semantic Segmentation

We propose a new efficient architecture for semantic segmentation, based...

Shift-Memory Network for Temporal Scene Segmentation

Semantic segmentation has achieved great accuracy in understanding spati...

Exploration of TCP Parameters for Enhanced Performance in a Datacenter Environment

TCP parameters in most of the operating systems are optimized for generi...

HarDNet: A Low Memory Traffic Network

State-of-the-art neural network architectures such as ResNet, MobileNet,...

1 Introduction

Remote sensing applications have gained immense traction in recent years owing to the advent of high quality acquisition systems, sophisticated processing algorithms and increasingly accurate detection and classification methods enabled by deep learning. Owing to it’s rich features, some of the most popular remote sensing tasks rely on semantic segmentation to assign class-wise labels to each pixel in the frame. Semantic segmentation algorithms are used for a plethora of applications such as anomaly detection, event detection, land use cover change, etc

yao2019unmanned. Unmanned Aerial Vehicles (UAVs) have enabled capture of ultra-high resolution data yao2019unmanned due to characteristics like low-cost, flexibility and low-flying altitude thus leading to increasing interest in the field.

In the context of UAV-centric deep learning applications, there has been some work around autonomous navigation fraga2019review, object tracking aguilar2017pedestrian, change detection avola2018uav, semantic segmentation Yang_2020. In the context of segmentation, some papers use standalone networks such as FCN, U-NetGirisha_2019, SegNet Yang_2020; lobo2020applying, DeepLabV3+Girisha_2019; lobo2020applying, Conditional Random Fields Girisha_2019, Markov Random Fields li2017semantic, Adverserial Networksli2019road, while some have experimented with Hybrid Networks such as FCN-AlexNetYang_2020, FCN-ConvLSTM Wang_2019, DenseCNN with RNN rahnemoonfar2018flooded or Ensemble Networks such as ensemble of ConvNets nogueira2017semantic. Some papers have used additional information such as Digital Surface Models zhang2019urban or applied post-processing on segmentation results with overlay techniques and probabilistic graphical modelsdlreview.

In recent times, deep learning implementations have been employed for real-time recognition and tracking on drones neurala

with cloud-based processing. With resource-hungry Deep Convolutional Neural Networks (DCNNs) breaking accuracy ceilings, a significant aspect of the viability relies on reduction of costs in terms of network communication and computational energy. Hence UAV systems warrant use of new-age edge AI (artificial intelligence) hardware accelerators such as edge-GPU (Graphics Processing Units)


, edge-TPU (Tensor Processing Units)

edgetpu, specialized ASIC (Application specific Integrated Circuits) Puglia_2016; parmar_hsi_ijcnn, which help achieve faster inference time, lightweight deployment and low-power localized processing. Studies proposing embedded deployment in the remote sensing domain have been limited to object detection barba2020deep; kyrkou2018dronet

, scene classification

kyrkou2019deep, or semantic segmentation for satellite deployment inria.

Despite standalone work in each of these domains, another untapped aspect for realizing the potential of UAV applications is real time edge AI deployment of segmentation algorithms on UAV platforms. Such applications warrant solutions optimizing all parameters viz. accuracy, latency, and memory thus needing extensive design exploration which is the primary motivation behind this study.

To this purpose, we present a first-of-its-kind intensive algorithmic-hardware exploration for performing segmentation using UAV images based multi-class dataset in a resource-efficient manner for future-ready edge-AI deployments. Key contributions of the paper are listed below:

  1. Detailed performance benchmarking of standard segmentation models on a new multi-class UAV segmentation dataset (DroneDeploy dronedeploy).

  2. First demonstration of EfficientNet based Semantic Segmentation in context of UAV images.

  3. First hardware-software co-optimization study for semantic segmentation in context of UAV images.

2 Materials and Methods

2.1 UAV Datasets

While there exist many UAV video datasets Girisha_2019; avola2018uav; uavid_isprs, UAV static images datasets are typically more application oriented Yang_2020, and hence suited for object detection applications stanforddronedataset. Further they also generally have lower number of annotated classes inria. For the purpose of this study, we have used DroneDeploy Datasetdronedeploy, comprising of 55 RGB images, along with single-channel elevation maps and label maps. The label maps are annotated with 7 classes - namely Building, Clutter, Vegetation, Water, Ground, Car and ‘Ignore’ - the last class referring to missing pixels/ boundaries. The ground resolution is 10 cm/pixel. For this study, we have only used raw RGB TIFFs, in order to demonstrate generalized capability without the need of additional channels such as elevation (as in the case of this dataset) or hyper-spectral bands (as in the case of other UAV datasets) due to relatively high costs of lightweight multispectral cameras yao2019unmanned. A description of the class-wise distribution alongwith color map is provided in Table 1. Maximum image size available in the dataset is 637 MB. Fig. 1 shows an image chip and corresponding label map from the dataset.

Class Color Code Percentage of Pixels
Building Red 5.6%
Clutter Purple 2%
Vegetation Green 10.43%
Water Orange 1.2%
Ground White 37.7%
Car Blue 0.38%
Ignore Magenta 42.7%
Table 1: Distribution of class samples in the dataset

For training, we divided images in the dataset into non-overlapping chips with chip sizes of 300300 for UNet, FPN, LinkNet, and 384384 for PSPNet with ensured presence of at least 1 class of interest. Resulting training dataset consists of 5887 and 3893 chips of sizes 300300 and 384384 respectively. Basic augmentations such as Horizontal and Vertical Flips as well as Rotation operations in the range of 180 were applied during training in order to artificially increase size of the dataset as well as provide better generalization. The ratio of training and validation data is chosen as 85%:15%.

Figure 1: Example of image chips used for training from the DroneDeploy datasetdronedeploy. (a) RGB image (b) Ground Truth label map.

2.2 Semantic Segmentation Architectures

In this work, we experiment with a variety of models, backbones, hyper-parameters and training variations before arriving at an accurate model and further try to optimize the architecture for embedded implementation. Segmentation models and encoder-backbones investigated for this study are listed in Table 2

. Structurally, all semantic segmentation architectures are similar and comprise of an encoder-decoder network where the encoder network performs feature extraction and decoder performs localization of spatial features

inria. However, it is the method of combining the two sets of information - spatial and feature, which demarcates them. Four models were taken into consideration for this study: (i) UNet unet, (ii) LinkNet linknet, (iii) Pyramid Scene Parsing Network (PSPNet) pspnet, (iv) Feature Pyramid Networks (FPN) fpn.

UNet is one of the most fundamental semantic segmentation networks. It was originally intended to be used on biomedical images, however it finds increasing relevance in nearly all areas of interest today including remote sensing inria. In case of UNet, the encoder is used for multi-level feature extraction and the decoder combines learnt features and resolution through a sophisticated stacking, taking both localization and feature representation into accountunet. LinkNet tweaks the UNet structure by adding the upsampled feature representation with resolution information instead of concatenating. The other two Pyramid Networks attempt to form a Pyramid structure. PSPNet achieves this by creating a pyramid by variably pooling the lowest downsampled map, resulting in a vast collection of spatial resolutions used to enrich the features. On the other hand, FPN works by creating two pyramids, and combines them to generate feature-rich segmentation maps at each level.

Three encoder-backbones have been considered for this study: (i) EfficientNetB3 tan2019efficientnet, (ii) InceptionResnetV2 szegedy2017inception, (iii) MobileNetV2 sandler2018mobilenetv2. The prime motivation for the choice of backbones was to ensure an exhaustive exploration. For analyzing trade-off between memory and accuracy over the complete design space, we have selected three backbones as representative workloads.InceptionResNetV2, a hybrid of two sophisticated networks, exhibits high accuracy but is computationally heavy, whereas MobileNetV2 despite having lower accuracy enables light-weight edge AI computing bianco2018benchmark. EfficientNet architectures use a compound scaling optimization with variable width, depth, resolution in order to optimize Accuracy and FLOPS. We use the EfficientNetB3 architecture in this family which lies approximately in the middle of the spectrum as will be shown in Section 3.3.

Figure 2: Segmentation models investigated in the study: (a) UNet (b) Linknet (c) PSPNet (d) FPN Yakubovskiy:2019. (e) Performance benchmark of encoder-backbones investigated in this study (MobileNetv2, EfficientNetB3, InceptionResNetv2) on ImageNet dataset tan2019efficientnet.
Models UNet, FPN, LinkNet, PSPNet
Encoder Backbone MobileNetv2, EfficientNetB3,
Inception ResNetv2
Table 2: Network architectures used in this study

2.3 Experiments

Fig. 3 depicts the process of network exploration followed in this study. At the level of network architecture, a total of 12 model-backbone combinations were investigated as shown in Table 2. In case of training parameters, 4 choices viz. optimizer, learning rate policy, weight initialization and decoder-only/ complete training are explored. Further, detailed memory and latency benchmarking is performed for hardware-specific optimization. Experiments performed in the study are based on the Segmentation Models Yakubovskiy:2019

library based on the Keras framework

chollet2015keras. For hardware-specfic optimization we utilize NVIDIA TensorRT libraryvanholder2016efficient. A base configuration was selected and for all experiments, where only the parameter of interest was changed. The base configuration is defined in Table 3

. Two evaluation metrics used for the optimization are defined below.

  1. IoU Score or the Jaccard Index.

  2. F1 Score or the Dice Coefficient ( = 1)


    where ,

Figure 3: Flowchart depicting steps for network parameter exploration.
Parameter Configuration
Model UNet
Backbone EfficientNetB3
Optimizer AdamW
Learning Rate Policy Static Rate (1e-4)
Weight Initialization ImageNet deng2009imagenet
Decoder/Complete Training Complete
Data Type FP32
Table 3: Base network architecture used for the Study.

3 Results

3.1 Network architecture exploration

Fig. 4 and Fig. 5 show comparison of performance of different segmentation models with EfficientNetB3 backbone. The best performing Model is found to be FPN. When UNet is selected as the Model, best performing backbone can be taken to be either EfficientNetB3 or InceptionResnetV2 owing to their close performance. Since FPN emerged as a clear choice from the first set of experiments, we then updated the base model to FPN and performed a second set of experiments to compare the performance of encoder-backbones. From Fig. 6 we can observe that EfficientNetB3 shows the best performance and hence is the choice of backbone. This is further validated from results shown in Table 4. FPN with backbone as EfficientNetB3 or InceptionResnetV2 is shown to have highest IoU score, however EfficientNetB3 results in marginally better Validation IoU Score, hence we select it as the backbone. The implications of this choice for hardware will be discussed in detail in Section 3.3 and Section 3.4.

Figure 4: Epoch-wise IoU score variation from network architecture experiments for: (i) Model selection with EfficientNetB3 backbone: (a) Training (b) Validation; (ii) Backbone selection with UNet model: (c) Training (d) Validation.
Figure 5: Epoch-wise loss variation from network architecture experiments for: (i) Model selection with EfficientNetB3 backbone: (a) Training loss (b) Validation loss; (ii) Backbone selection with UNet model: (c) Training loss (d) Validation loss.
Figure 6: Epoch-wise performance analysis of FPN model with backbone variations. IoU score evolution: (a) Training (b) Validation. Loss evolution: (a) Training (b) Validation.
Model Backbone Train IoU Val IoU
UNet EfficientNetB3 0.77 0.64
InceptionResnetV2 0.80 0.63
MobileNetV2 0.74 0.59
FPN EfficientNetB3 0.81 0.65
InceptionResnetV2 0.81 0.65
MobileNetV2 0.76 0.61
LinkNet EfficientNetB3 0.76 0.61
InceptionResnetV2 0.78 0.63
MobileNetV2 0.71 0.54
PSPNet EfficientNetB3 0.71 0.56
InceptionResnetV2 0.74 0.58
MobileNetV2 0.68 0.52
Table 4: Network architecture exploration results in terms of IoU Scores

3.2 Training parameter exploration

Based on the results of the network architecture exploration, the base configuration is now updated to Model:FPN and Backbone:EfficientNetB3. In these set of experiments, we focus on finding an optimal configuration for the following training parameters:optimizer, learning rate policy, weight initialization, decoder-only/complete training and understanding their impact. Description for each investigated parameter is provided below.

3.2.1 Optimizer

Adam optimizerkingma2014adam is one of the most commonly used optimizers used in context of semantic segmentation networks, even finding applications in context of hyperspectral images owing to its high accuracy in classification paoletti2019deep. As part of this experiment, we compare the performance of Adam and AdamW loshchilov2017decoupled optimizers, wherein AdamW decouples weight decay from the gradient updates. Fig. 7 shows better performance for AdamW than Adam, and makes it an optimal choice as the optimizer.

Figure 7: Comparison of performance between Adam and AdamW optimizers: (a) Training IoU score (b) Validation IoU score (c) Training loss (d) Validation loss.

3.2.2 Learning Rate Policy

Learning rate (LR) is an important parameter for preventing overfitting or getting stuck in local minima. Hence as part of this experiment we compare performance between two learning rate policies: (i) static LR throughout the training epochs, (ii) LR scheduler based as defined in algorithm 1.

Data: Current Epoch ,Max Epoch , Base Learning Rate , Decay factor
Result: New Learning Rate
if  then
end if
Algorithm 1 Learning rate scheduler algorithm

Fig. 8 shows that using a LR scheduler policy with an adaptive learning rate performs better compared to a static LR policy. Thus base configuration is updated with a scheduler based learning rate policy.

Figure 8: Comparison of performance for static and scheduler LR based training: (a) Training IoU score (b) Validation IoU score (c) Training loss (d) Validation loss.

3.2.3 Weight Initialization

While using ImageNet weights to initialize pre-training is a widely accepted method to accelerate training for deep learning, we also have to acknowledge the difference in target classes specific to the UAV dataset. Hence we studied the effect of pre-training by ImageNet deng2009imagenet in comparison with random weight initialization. Fig. 9 shows that a network pre-trained on ImageNet outperforms random initialization by a huge margin. Hence, we proceed with this weight initialization scheme for the base configuration.

Figure 9: Comparison of performance for weight-initialization strategies: (a) Training IoU score (b) Validation IoU score (c) Training loss (d) Validation loss.

3.2.4 Decoder-Only / Complete Training

As discussed in Section 2.2 the semantic segmentation networks have two components: (i) Encoder (ii) Decoder. In the interest of saving the cost of training operations, a commonly used strategy for training these networks is to freeze the encoder weights and only train the decoder michieli2019incremental. Another alternative is to train both components (Complete Training). Fig. 10 shows how complete training outperforms decoder training, possibly owing to the requisite learning for fine-tuned feature extraction from the encoder.

Figure 10: Comparison of performance between Complete and Decoder-Only training: (a) Training IoU score (b) Validation IoU score (c) Training loss (d) Validation loss.

3.3 Memory Profiling

So far, we have analyzed methods to update the base configuration in an effort to optimize accuracy. However, real world edge-deployment on any UAV platform will require it to have low memory footprint, low latency and low power. To achieve lightweight implementation in terms of memory footprint, there are two strategies: (i) Network architecture selection to reduce model weights, (ii) Quantization of model weights and computation graph. Both these strategies are discussed further.

3.3.1 Model Weights

In this study, we compare the memory consumption of different models’ weights as shown in Table 5 in order to select the architecture with the best trade-off between model weight size and validation accuracy.

Model Backbone Val IOU Memory (MB)
UNet EfficientNetB3 0.64 68.17
InceptionResnetV2 0.63 236.75
MobileNetV2 0.59 30.7
FPN EfficientNetB3 0.65 53.09
InceptionResnetV2 0.65 219.45
MobileNetV2 0.61 19.90
LinkNet EfficientNetB3 0.61 52.49
InceptionResnetV2 0.63 220.75
MobileNetV2 0.54 15.81
PSPNet EfficientNetB3 0.56 7.68
InceptionResnetV2 0.58 13.68
MobileNetV2 0.52 6.29
Table 5: Trade-off between Accuracy and Memory Consumption of various Architectures used in the study

3.3.2 Quantization based Memory Optimization

In order to reduce memory footprint and inference latency, we analyze the impact of quantizing floating-point-32 (FP32) precision network to a more efficient data-type i.e. floating-point-16 (FP16). FP16 quantization is performed based on NVIDIA TensorRT (TRT) vanholder2016efficient. Fig. 11 demonstrates the prediction performance for the baseline as well as TRT optimized models. We observe that TRT FP16 model performance matches closely with baseline FP32 model, and thus making it a valid candidate for further latency analysis. Fig. 12 shows prediction performance comparison of FP16-architectures with FPN model using all 3 encoder-backbones. As expected, based on earlier analysis (shown in Table 4), MobileNetV2 shows noisy predictions whereas EfficientNetB3 and InceptionResnetV2 are at par with each other.

Figure 11: Prediction performance of FPN-EfficienNetB3 model: (a) Original RGB image, (b) Ground truth labels, (c) Prediction with FP32 precision (Keras) (d) Prediction with FP16 precision (TF-TRT).
Figure 12: Impact of encoder-backbone choice on prediction performance with FPN TF-TRT FP16 model: (a) Original RGB image, (b) Ground truth labels, (c) EfficientNetB3, (d) InceptionResNetv2, (e) MobileNetv2.

3.4 Latency Profiling

Table 6 shows the latency analysis for architectures with FPN as model on an input image of size 522 MB, spanning an area of 1.36

. The analysis is performed on NVIDIA RTX 8000. The updated base model is run on an additional configuration - NVIDIA RTX 2080 Ti since it is more representative of an inference oriented GPU. Keras model is based on FP32 data type, whereas Tensorflow-TensorRT (TF-TRT) implementations are based on FP16 data type. FP16 implementation shows a performance speedup as compared to FP32 thus increasing its viability for deployment. An important point to note here is that while the results presented in Table

6 report latencies for processing a complete ortho-rectified image from the dataset with >1000 slices of 320320, actual latency for the target platform would be far less depending on sensor resolution. Hence, we present inference latency for image size corresponding to a typical drone sensor (960480)dji in Table 7. Out of the edge-devices referred to in the study, the best performance is obtained for NVIDIA Jetson Nano-Tensorflow framework. For inference time, the throughput metrics of interest are as follows:

  1. File Size per unit time = 125 kB/second

  2. Area spanned per unit time = 320 /second

Backbone Keras Model (s) TF-TensoRT (s)
EfficientNetB3 36.1* 33.7*
EfficientNetB3 27.8 26.3
InceptionResnetV2 31.5 23.3
MobileNetV2 23.4 17.0
  • *This experiment is performed on NVIDIA RTX 2080 Ti, All other experiments are performed on NVIDIA RTX 8000.

Table 6: Impact of Encoder Backbone on Inference Latency for complete image for FPN architecture
Platform Framework Inference Latency (s)
Nvidia RTX 2080 Ti Tensorflow 0.07
TF-TRT 0.10
Xeon CPU Tensorflow 0.7911footnotemark: 1
Nvidia Jetson Nano Tensorflow 14.40
TF-TRT 19.1422footnotemark: 2
Raspberry Pi 3B+ Tensorflow 75.20
  • Xeon CPU latency is added from the perspective of overall comparison.
    Estimated based on scaling factors derived from tryolabs.

Table 7: Inference latency for typical Drone Image Sensor Resolution (960480)dji using FPN model and EfficientNetB3 Backbone.

4 Discussion

Section 3.1 shows that FPN models with high-level semantic features at every level outperforms UNet models by a considerable margin in terms of validation IoU scores (see Table 4). FPN model with encoder-backbone as EfficientNetB3 has the highest validation IoU score when compared to all other architectures, thus being the preferred choice. The choice of EfficientNetB3 over InceptionResnetV2 stems from the marginally higher IoU score, along with memory footprint considerations as shown in Table 5. Thus with the choice of the optimal network architecture established, training parameter optimization was explored. In Section 3.2.1 we see that AdamW shows superior performance over Adam optimizer because of incorporation of weight decay (shown in Fig. 7). Section 3.2.2 shows that the scheduler based learning rate policy outperforms the static LR policy by a significant margin due to it’s adaptive nature. Section 3.2.3 validates the standard method of weight initialization by ImageNet weights by comparing its performance with regards to random initialization. This is a significant observation in contrary to the view that remote sensing applications differ considerably from vision applications necessitating departure from standard initializations like ImageNet deng2009imagenet. Section 3.2.4 shows that complete training allows better feature adaptation than encoder freezing and hence a complete training is followed. Following are some important observations based on trends shown in Section 3.3.1:

  1. All models with MobileNetV2 as backbone are lightweight, however, highly underperform when compared to other backbones’ Validation IoU.

  2. All InceptionResnetV2 models are very memory intensive, however perform better or at par when compared to other backbones for the same model.

  3. Out of all models, PSPNet has least memory consumption, however, it’s performance is also considerably lower than all other models.

  4. If models with best validation IoU are compared in terms of memory, InceptionResnetV2 is 4x heavier than EfficientNetB3 making it a suboptimal choice for lightweight on-the-edge deployment.

The above trends further explain motivation behind choosing the investigated architectures, since they provide a more complete picture of the tradeoff between accuracy and memory footprint. MobileNetV2 is highly resource efficient for deployment, whereas InceptionResnetV2 performs consistently and accurately, however EfficientNetB3 depicts the tradeoff between accuracy and memory consumption.

Section 3.3.2 shows performance of FP 16 optimized models to be at par with FP32 models in terms of accuracy. This observation, compounded with memory footprint savings, motivates the need for investigating the latency performance for the TF-TRT models.

Section 3.4 shows lower inference times for TF-TRT optimized models (improvements in the range of 1.05 to 1.4 except in the case of EfficientNetB3 performed on NVIDIA RTX 2080 Ti, possibly owing to a memory bottleneck issue. NVIDIA Jetson Nano emerges as the preferable choice as based on edge device estimates described in Table 7.

4.1 Future Scope

While there are several significant observations contributed from the study, following aspects we aim to address in future and on-going work:

  1. Investigate optimization strategies in order to address UAV dataset imbalance.

  2. TRT edge-AI hardware latency estimates are approximate to a margin of 10%. Accurate latency and power measurements based on actual deployed characterization will yield more precise values.

  3. Since no gold-standard dataset currently exists for training classifiers for UAV based images as reference, further investigation into impact of weight initialization may be advisable and comparison with ImageNet based initialization.

  4. Power profiling of the networks may give more insight since latency for operations is not representative of energy costs. This would further boost the overall impact of optimization in terms of EDP (Energy Delay Product).

5 Conclusions

In this paper, we present a detailed benchmarking study of semantic segmentation models in context of UAV applications on the DroneDeploy dataset. We also present the first demonstration of semantic segmentation based on EfficientNet architectures for remote sensing applications. Based on extensive exploration, the best configuration is found to be: Model: FPN, Backbone: EfficientNetB3, Pretraining: ImageNet weights, Optimizer: AdamW with Learning rate scheduler and complete training which achieves IoU score of 0.65 and F1-score of 0.71 over the validation dataset. We also profile memory usage and latency for each model and optimize them for inference based on TensorRT with FP16 precision. Based on this we achieve memory savings of 4.1 and latency improvement of 10% compared to Model: FPN and Backbone: InceptionResnetV2.

Conceptualization V.P., N.B. and M.S.; Methodology & Investigation: V.P & N.B,; SW & Validation: V.P., N.B., S.N; Writing, Editing and Review: all authors.; Supervision, M.S.

This research received no external funding.

Authors would like to acknowledge support provided by CYRAN AI Solutions for this study. The authors declare no conflict of interest. The following abbreviations are used in this manuscript:
FPN Feature Pyramid Network
PSPNet Pyramid Scene Parsing Network
LR Learning Rate
TRT Tensor RT
References yes