The analysis of data collected from airborne sensors such as aerial images/videos are increasingly becoming a vital factor in many applications such as scene understanding, studying the ecological variations, , tracking of vehicles/animals/humans , ,  surveying the urban development , , surveillance , etc. Besides, aerial image analysis has been used for assessing the damage immediately after a natural disaster . Typically, the aerial images are captured by different imaging modalities such as Synthetic Aperture Radar (SAR) , hyper-spectral imaging  which are present on-board a satellite. Recently, the Unmanned Aerial Vehicles (UAV) have also been widely used for various applications such as disaster management, urban planning, tracking of wildlife, agricultural planning etc , . Due to rapid deployment and a customized flight path, the UAV images/videos, could provide additional finer details and complement satellite-based image analysis approaches for critical applications such as disaster response . Besides, the UAV images could be utilized along with satellite images for better urban planning or geographical information updating. Typically, the UAV image/video analysis is limited for object detection ,  and recognition  tasks such as building detection, road segmentation etc. However, to the best of our knowledge, there are limited works on semantic segmentation of UAV images or videos , .
Segmentation is a crucial task for scene understanding and has been used for various applications , , . Semantic segmentation is a process of assigning predetermined class labels to all the pixels in an image. Semantic segmentation of an image
is a widely studied topic in computer vision. However, the extension of semantic segmentation for video applications is a non-trivial task. One of the challenges in video semantic segmentation is to find a way to incorporate temporal information. Figure1 illustrates the importance of temporal information in the context of video acquired by UAV. The poor segmentation in the greenery class can be observed in the keyframe which can be improved by embedding temporal information from the past frames.
In a typical video semantic segmentation approach, a sequential model is added on top of the frame-wise semantic segmentation module, thus creating an overhead . Besides, features/label propagation , which re-utilizes features/labels from previous frames has also been utilized to capture the temporal information. However, these methods depend on the establishment of pixel correspondence between two frames. Recently, video prediction based approach  has been used to generate new training images and has achieved state-of-the-art performance for video semantic segmentation. However, this approach uses an additional video prediction model to learn the motion information.
This work focuses on semantic segmentation of videos acquired using UAV. The proposed method demonstrates that a simple modification in the encoder branch of CNN is able to capture the temporal information from the video thus eliminating the need for an extra sequential model for computing correspondence for feature/label propagation.
A new encoder-decoder based CNN architecture (UVid-Net) proposed in this work has two parallel branches of CNN layers for feature extraction. This new encoding path captures the temporal dynamics of the video by extracting features from multiple frames. These features are further processed by the decoder for the estimation of class labels. The proposed algorithm utilizes a new decoding path that retains the features of encoder layers for decoders. The contribution of the paper can be summarized as,
A new encoding path is presented consisting of two parallel branches for extracting temporal and spatial features for video semantic segmentation.
A modified up-sampling path is proposed which uses feature retainer module to capture fine-grain features for accurate classification of class boundary pixels.
An extended version of UAV video semantic segmentation dataset is presented. This dataset is an extension of ManipalUAVid dataset  and contains additional videos captured at new locations. Fine pixel-level annotations are provided for four background classes namely greenery, roads, constructions and water bodies as per the policy adopted in . The dataset is available for download at https://github.com/uverma/ManipalUAVid
This work shows that UVid-Net trained on a larger urban street scene dataset for semantic segmentation can be fine-tuned for segmentation of UAV aerial videos.
Ii Related works
Video semantic segmentation is generally addressed by utilizing traditional energy-based algorithms such as CRF or deep learning-based algorithms such as CNN, RNN, LSTM, etc. One of the challenges in video semantic segmentation is to embed temporal information. Learning the dynamics of the video aids in improving the performance of video semantic segmentation by ensuring temporal consistency. Despite this interest, previous works such as , ,  extended the traditional image semantic segmentation approach for video semantic segmentation. These approaches segment all the frames independently of each other which fails to capture the dynamics of the video. Recent advances in video semantic segmentation by utilizing Spatio-temporal information can be categorized into roughly two groups: Deep Learning based methods and CRF based methods.
There exists several CNN based semantic segmentation approaches (, , , , , , , , , ). Popular CNN based algorithms like , ,  used encoder and decoder based architecture for learning the various patterns of the data and localizing the class labels. These algorithms are dependent on a large densely annotated dataset. However, obtaining a finely annotated large dataset is expensive, time-consuming and challenging. Few authors (, ) used GAN to learn the dynamics of the video and perform video scene parsing. GAN can be trained to parse future frames as well as label images as proposed by . Besides, temporal dynamics are also learnt using a sequential model like LSTM . Moreover, LSTM is also used to select keyframes for video scene parsing . Few authors explored the attention mechanism with CNN to perform video semantic segmentation , . Optical flow is another popular choice for the establishment of temporal correspondence between two consecutive frames . Few studies such as  and  proposed to predict labels and images jointly to efficiently train deep learning models with less training data. However, the dependence of deep learning algorithms on large annotated datasets limits the development of deep learning algorithms for other contexts such as UAV, etc.
Many researchers have explored Conditional Random Field (CRF) for incorporating Spatio-temporal information in video semantic segmentation. CRF is a graphical model that captures a large spatial relationship between pixels. Hence it is widely used in literature for context-aware scene parsing , . CRF can be extended to incorporate temporal information , , , , , ,  but it depends on the reliable estimation of temporal links. In general, optical flow is widely used to establish the temporal link and propagate features and labels. However, estimation of accurate optical flow is an overhead for real-time video semantic segmentation. Several authors explored higher-order potential energies for video semantic segmentation by defining potential energy on temporal links . Class labels in CRF are inferred by using an inference algorithm which is computationally intensive and impractical for video processing.
The existing state-of-the-art method for video semantic segmentation predicts frames and its labels from the historic data . However, this approach is dependent on a reliable estimation of temporal correspondence between two consecutive frames. Temporal links are generally established by utilizing dense optical flow-based methods . Optical flow estimation is an overhead and accuracy of semantic segmentation depends on the accuracy of optical flow estimation. Besides, the error in optical flow estimation can lead to misaligned predicted labels in the future frames, thus affecting the accuracy of the segmentation.
In this work, a new encoder module is proposed which can capture the temporal dynamics of the video. The proposed work eliminates the need for computing optical flow, thus reducing the overhead.
In this work, a two-branch encoder is proposed for incorporating temporal smoothness in video semantic segmentation. Multi-branch CNNs are popularly used in video processing due to their ability to capture the relationship between the sequence of frames. Several authors used multi-branch CNNs to perform video classification , action recognition  and video captioning . Few authors utilized multi-branch CNN architecture to provide attention mechanism. Authors of  explored multi-branch CNN to extract features from frames. To the best of our knowledge, multi-branch CNNs are not explored to perform video semantic segmentation of UAV videos and capture temporal dynamics.
This section describes the encoder (Section III-A) and decoder module (Section III-B) of the proposed approach. The Figure 2 and Figure 3 shows the proposed architecture with U-Net and ResNet-50 feature extractor respectively. In a typical video, the changes between two consecutive frames are very minimum and hence processing every frame is redundant and time-consuming for video semantic segmentation. However, selecting keyframes at constant interval may result in loss of useful information required for temporal consistency. This would be detrimental for video semantic segmentation methods which depend on temporal features. Hence, in the present study, the keyframes are identified using the shot boundary detection approach presented in  (on an average, a shot consists of 15-20 frames). The use of shot boundary detection method for dynamically identifying the keyframes ensures that the frames containing useful information are not ignored.
Let us represent the frame from the shot in a video as . The inputs to the two branches of UVid-Net (Figures 2, 3) are the two frames from two consecutive shots: (upper branch) and (lower branch) , where represents the total number of frames in a shot. These two frames correspond to the next frame after the middle frame of the previous shot and the middle frame from the current shot. These two input frames produce the semantic segmentation for the middle frame of the current shot . For the first shot, since there is no prior shot, the first frame () of the video and middle frame () of the first shot is considered as input to the network.
In the rest of this document, the middle frame of a shot is considered as the keyframe, as per the policy followed for UAV video semantic segmentation .
In this work, the performance of two different architectures (U-Net and ResNet-50 encoders) is studied for feature extraction. U-Net encoder consists of a convolutional layer and maxpool layers for feature extraction. The ResNet-50 feature extractor consists of residual blocks which helps in alleviating the vanishing gradient. These two feature extractors are different and comparing their performance on multi-branch CNN helps us in providing insight into the robustness of the model. In the following text, UVid-Net (U-Net encoder) and UVid-Net (ResNet-50 encoder) refer to the proposed architecture with U-Net encoder and ResNet-50 encoder module respectively. It may be noted that the decoder module is identical for both the architectures.
Iii-A1 U-Net Encoder
The upper branch of the encoder (Figure 2) contains four blocks. Each block consists of two consecutive48]. Finally, the activation is then passed through a
convolution layer which is additionally introduced to reduce the dimensions of the feature maps. Lastly, a maxpooling layer with stride
is applied to extract the most prominent features for the subsequent layers. As in the traditional U-Net, the number of feature maps doubles after each max pooling operation, starting with 64 feature maps for the first block.
The lower branch of the encoder also consists of four blocks. Each block in the lower branch has a convolution layer with batch normalization and ReLU activation function and the second set of convolution layer with batch normalization and ReLU activation function. This is followed by a maxpooling layer which extracts most prominent features. Similar to the upper branch, the number of feature maps doubles after each max pooling operation.
The features extracted by the upper and lower branch of the encoder are fed to two separate bottleneck layers consisting of convolution with 1024 features maps. Finally, the activation of both these branches is concatenated and fed to the decoder.
Iii-A2 ResNet-50 encoder
Besides the UNet based encoder described above, the ResNet-50 architecture (Figure 3) could also be used as a branch in the encoder. ResNet-50 is a CNN architecture proposed for image classification. This architecture proposed the idea of skipping a few layers to learn identity mapping. ResNet-50 has also been widely used as a feature extractor for transfer learning applications .
In the present study, the upper branch and lower branch consists of identical ResNet50 architecture to extract features (Figure 3
). This architecture consists of an initial convolution operation with kernel size (7x7) followed by batch normalization layer and ReLU activation function. Subsequently, a max pool operation with kernel size (3x3) is applied. Followed by the maxpool operation the architecture consists of four stages. The first stage consists of three residual blocks each containing three layers. Each of these residual blocks consists of 64, 64 and 128 filters. The second stage consists of 4 residual blocks with three layers each. These three layers use 128, 128 and 256 filters. The third stage consists of 6 residual blocks with three layers each. These layers use 256, 256 and 512 filters. The fourth stage consists of 3 residual blocks with three layers each. These layers use 512, 512 and 1024 filters. The first residual blocks of stage 2,3 and 4 utilizes stride operation to reduce the input dimension by 2 in terms of width and height. First and last layers in every residual block consist of (1x1) kernel size and the second layer consists of (3x3) kernel size. All residual block consists of identity connection which solves the vanishing gradient problem.
The activations of upper and lower ResNet50 branch are concatenated and is further used by the decoder to perform semantic segmentation.
In an encoder-decoder based architecture, the consecutive max pooling operations in encoder reduces feature maps size and results in the loss of spatial resolution. Hence to compensate for this loss of information, skip connections are popularly used from encoding layers to decoding layers ,. Networks like U-Net use concatenation operation where the feature maps from the last layer of each block in the encoder are stacked with the feature maps of corresponding decoding layers. Here, we argue that element-wise multiplication of the feature maps from the last layer of each block in encoder with the corresponding decoding layers results in better representation of feature maps. This module which performs element-wise multiplication of feature maps is called as feature retainer since it retains the features of the corresponding encoding path. In addition to the improvement in segmentation, the proposed feature retainer module reduces the number of learnable parameters as compared to the concatenation operation. For instance, the total number of parameters for UVid-Net (U-Net encoder) with multiplication is whereas the total number of parameters for UVid-Net (U-Net encoder) with concatenation is . The experimental results (Section IV-C) show that the element-wise multiplication of the encoder feature map with the corresponding decoder feature map produces a more smoother segmentation map.
As discussed earlier, the decoder module is identical for both UVid-Net (U-Net encoder) and UVid-Net (ResNet-50 encoder) (Figure 2 and 3). The decoder path of the proposed architecture contains four blocks. Each of these blocks consists of an upsampling layer with stride 2. This is followed by a convolution layer with filter size (
). The output of this is passed through a feature retainer module which multiplies the corresponding feature maps of the encoder (lower branch) and the decoder. Note that the last layer of each stage/block of the lower branch encoder is merged with corresponding decoder layers. This is followed by convolution layers and the ReLU activation layer. At the last layer, the SoftMax layer is applied to obtain the probability of pixels belonging to each class.
Iv Results and discussion
In the present study, an extended version of ManipalUAVid  dataset is used to evaluate the performance of the UVid-Net for UAV video semantic segmentation. The proposed architecture is trained by utilizing categorical cross-entropy loss with Adam optimizer for learning the parameters of the model. In this section, it is shown experimentally that the proposed encoder module is able to incorporate temporal smoothness for video semantic segmentation (Section IV-B). Further, the effectiveness of the feature retainer in the decoder module is demonstrated in Section IV-C. Finally, the performance of the proposed architecture is compared with the state-of-the-art methods for video semantic segmentation (Section IV-D).
Iv-a Dataset: ManipalUAVid
This paper presents an extended version of ManipalUAVid  dataset for semantic segmentation of UAV videos. This extended dataset consists of new videos captured at additional locations. The extended dataset consists of videos with annotations provided for keyframes. The pixel-level annotations are provided for four background classes viz. greenery, construction, road and water bodies. The videos are captured at frames per second and at a resolution of pixels. The keyframes are identified by following the shot boundary detection approach mentioned in  and on an average, a shot consists of 15-20 frames. More details of this dataset can be found in . The ManipalUAVid presented in  contains 33 videos and annotations were provided for 667 keyframes. Besides, the performance of semantic segmentation algorithms which analyses each keyframe individually was provided in  on the ManipalUAVid dataset. The earlier version of ManipalUAVid dataset  consists of last two keyframes of each video in the test split which might not be sufficient to observe the temporal smoothness or the error (if any) accumulated over the period of time in the video. Therefore, in this work, ManipalUAVid is extended by incorporating four new videos (total key frames: 44) which are entirely in the test split. Besides, the training-test split distribution is slightly modified so that a greater number of frames (4-5 frames) per video is included in the test split of this updated dataset. This aids in evaluating the video semantic segmentation models for temporal consistency.
Following the same protocol , the performance of UVid-Net is evaluated by comparing the keyframes segmented using UVid-Net with the ground truth. In ManipalUAVid, middle frames of a shot () are considered as the keyframes. As discussed earlier, two frames ( and ) are provided as the input to UVid-Net for semantic segmentation of (). The dataset is divided into train, validation and test split which consists of , and
keyframes respectively. The following metrics are computed to evaluate the performance of the proposed method: mean Intersection over Union (mIoU), Precision, Recall and F1-score. It may be noted that the values of the evaluation metrics obtained in this study are different from that reported in due to additional videos being added in the dataset.
|DeepLabV3+ (MobileNet-V2 backbone) ||0.85||0.85||0.85||0.65||2,142,276||4,218,531|
|Video propagation and label relaxation||0.89||0.88||0.88||0.72||137,100,096||91,055,000,000|
|FCN-8 + ConvLSTM ||0.86||0.86||0.85||0.61||134,629,100||269,821,618|
|U-Net + ConvLSTM||0.90||0.90||0.90||0.76||21,695,976||62,842,913|
|DeepLabV3+ + ConvLSTM||0.87||0.86||0.85||0.62||2,244,520||5,011,257|
UVid-Net (U-Net encoder)
|UVid-Net (ResNet-50 encoder)||0.90||0.89||0.89||0.72||44,740,420||133,871,366|
|Video propagation and label relaxation ||0.82||0.80||0.67||0.61||0.72|
|FCN-8 + ConvLSTM ||0.84||0.75||0.49||0.39||0.61|
|U-Net + ConvLSTM||0.87||0.82||0.56||0.82||0.76|
|DeepLabV3+ + ConvLSTM||0.81||0.76||0.57||0.28||0.62|
|UVid-Net (U-Net encoder)||0.87||0.86||0.60||0.86||0.79|
|UVid-Net (ResNet-50 encoder)||0.88||0.82||0.50||0.69||0.72|
|U-Net encoder (Concatenation)||0.91||0.90||0.90||0.78||26,878,472||161,093,886|
ResNet-50 encoder (Concatenation)
U-Net encoder (Multiplication)
ResNet-50 encoder (Multiplication)
Iv-B Evaluation of encoder
The proposed encoder part consists of two branches which extract features from two consecutive keyframes of a video simultaneously. Two variants of UVid-Net (U-Net encoder and ResNet-50 encoder) encoders are considered in this work. To evaluate the performance of the proposed architecture, we compare it with the traditional U-Net architecture (with a single encoder branch). Figure 4
shows the comparison of the segmentation results obtained using a single branch U-Net and two branch UVid-Net. Since single branch U-Net is an image semantic segmentation algorithm, it fails to capture the temporal information and hence produces temporally inconsistent labels. In contrast, the proposed architecture is able to capture the temporal dynamics between the two keyframes and produces more accurate results. For instance, the U-Net with single branch encoder incorrectly classifies few pixels belonging to road/greenery class as construction (shown in yellow circles). However, the two branch encoder based proposed method correctly classifies these pixels as road/greenery, thus producing a temporally smoother segmentation result as shown in Figure4 (d) and Figure 4 (e).
Table I and Table II compares the performance of single branch encoder U-Net with the proposed UVid-Net in terms of mIoU, Precision, Recall and F1-score. It is observed that the per class IoU of UVid-Net (U-Net encoder) for all the four classes are higher than the single branch U-Net. Moreover, from Table I
it is observed the proposed method has higher recall and precision scores than single branch U-Net which indicates that it has produced lower false positives and false negatives. The above results demonstrate the effectiveness of two branch encoder module in acquiring temporal information and thus resulting in a more accurate segmentation as compared to the classical single branch encoder UNet. It may be noted that a single branch U-Net with ResNet50 encoder suffered from high variance (over-fitting), even in the presence of regularization, with a training and validation accuracy of 0.98 and 0.66 respectively.
In addition to the qualitative and quantitative evaluation of the encoder, the softmax output of U-Net and UVid-Net (U-Net encoder) is also analysed in Figure 6. It can be observed that a high probability score is obtained for the pixels in their actual class in UVid-Net as compared to that of U-Net. The high probability score eliminates uncertainty and produces a more accurate segmentation. For example, a high probability score for greenery class is obtained for pixel belonging to trees using UVid-Net (Figure 6). In addition, U-Net which lacks temporal information has produced higher construction class probability for pixel belonging to greenery at the boundaries (Refer the 6 6 representative regions in Figure 6). In contrast, the UVid-Net which utilizes features propagated from the previous frame has produced very low construction class probability for greenery pixels at the class boundaries.
Iv-C Evaluation of decoder
The decoder of the proposed UVid-Net architecture consists of skip connections from the lower branch of the encoder to the corresponding decoder layers. Element-wise multiplication operation is utilized to combine the activations of the encoder and decoder layers. The experimental evaluation of the proposed feature retainer module with the concatenation approach suggests a marginal increase in mIoU for UVid-Net with U-Net encoder ( Tables III and IV ). It can be observed that the per class IoU is higher for road and water bodies for the multiplication operation as compared to concatenation. Further, the other two classes (greenery and construction) performs competitively in terms of per class mIoU. However, the qualitative evaluation shows that a more accurate segmentation is obtained using the proposed approach compared with the concatenation. Figure 5 shows few images where finer segmentation boundaries are obtained using the UVid-Net (multiplication) with U-Net encoder as compared to UVid-Net (concatenation). It may be observed in Figure 5 (First two rows), that the pixels from the road class have been misclassified as construction class using UVid-Net (concatenation), while a precise greenery-road boundary is obtained using UVid-Net (multiplication). The improvement obtained using the proposed feature retainer module is more prominent for UVid-Net (ResNet encoder). A mIoU of 0.72 is obtained with UVid-Net (ResNet encoder) with multiplication operation as compared to 0.53 with concatenation.
Moreover, the feature retainer module reduces the number of FLOPs along with a number of parameters. It is observed that UVid-Net (multiplication) results in 142,291,716 FLOPs while UVid-Net (concatenation) results in 161,093,892 FLOPs for U-Net encoder ( 11% less FLOPs). These results show that an accurate segmentation is obtained using UVid-Net (multiplication) with much less computation overhead. Besides, the element-wise multiplication operation in UVid-Net also reduces the number of learnable parameters () in the network as compared to the concatenation in the UVid-Net (). This result is significant since the proposed architecture produces higher mIoU (in the order of 0.79) with a reduced number of parameters. Indeed, the reduced complexity and the number of parameters of UVid-Net as compared to traditional concatenation operation makes it an ideal CNN architecture which can be used for UAV-based IoT applications.
Iv-D Comparison with state-of-art
The proposed approach is compared with the existing state-of-the-art image semantic segmentation methods viz. U-Net , FCN8  and DeepLabV3+ . However, these methods do not incorporate temporal information and segments each keyframe independently. Therefore, the proposed method is also compared with the state-of-the-art approach  on CityScape dataset that include temporal information. This method uses video prediction model to propagate labels to the immediate neighbouring frames for creating create more image-label pairs . Besides, the performance of UVidNet is compared with the UAV video semantic segmentation approach proposed in 
. This approach uses a Convolution Long Short Term Memory (Conv-LSTM) module to capture the temporal dynamics of the video. It may be noted that the method proposed in independently segments each frames using FCN8, and then the resulting frames are passed through Conv-LSTM module as the post-processing step. However, in addition to combining FCN8 + Conv-LSTM, we also compare the performance by segmenting individual frames with U-Net/DeepLabV3+ and then post-processing it with Conv-LSTM module, resulting in two additional methods viz UNet + ConvLSTM and DeepLabV3+ + ConvLSTM.
The proposed architecture is quantitatively compared with the above mentioned existing approaches. Table I compares the performance metrics such as precision, recall, F1-score and mean Intersection over Union (mIoU) while Table II compares the per class IoU and mIoU of the existing methods with the proposed method. As discussed earlier, the image semantic segmentation approaches (UNet, FCN, DeepLabV3+) segments each keyframe independently and fails to capture temporal cues. It can be observed a mIoU of 0.79 is obtained by the proposed approach as compared to a mIoU of 0.75, 0.64 and 0.65 for UNet, FCN8 and DeepLabV3+ respectively. The proposed approach outperforms the existing image segmentation approach. Besides, it can be observed from Figure 8 that UVid-Net produces a more accurate segmentation map with smoother segmentation boundaries as compared with other approaches. The proposed UVid-Net incorporates temporal information by merging the features extracted from two different frames of a video and thereby outperforms the existing image semantic segmentation algorithms.
In addition to the image segmentation algorithms, the proposed approach is also compared with the video semantic segmentation algorithms viz. Video Propagation /Label Relaxation , UNet-ConvLSTM, FCN8-ConvLSTM , and DeepLabV3+ - ConvLSTM. It can be seen (Table I) that the UVid-Net (U-Net encoder) achieves a mIoU of 0.79 and F1-score of 0.91 outperforming the other video segmentation approaches. Besides, UVid-Net (ResNet50-encoder) performs competitively and achieves F1-score of 0.89 and a mIoU of 0.72. To study the performance of the proposed method for each class, the per-class IoU is computed as shown in Table II. It can be observed that the UVid-Net (U-Net encoder) achieves significantly higher IoU for road and water bodies class. It may be noted that the dataset is unbalanced with very few pixels corresponding to water bodies and construction class. Therefore, a significant higher IoU (0.86) for water bodies class as compared to other approaches demonstrate the robustness of the proposed method. Besides, the UVid-Net (U-Net encoder) performs competitively for construction class with IoU of 0.60 even in the presence of limited data. Figure 8 compares the segmentation results obtained using the proposed approach and the existing methods. It can be observed that the more accurate segmentation is obtained using the proposed method as compared to the existing methods. For instance, the proposed method is able to accurately identify construction, greenery and water bodies especially in fifth and eight rows of Figure 8.
The UNet-ConvLSTM performs competitively on ManipalUAVid dataset with a mIoU of 0.76. However, U-Net-ConvLSTM fails to capture the temporal dynamics as shown in Figure 7. In comparison, UVid-Net (U-Net encoder) produces a more accurate segmentation, especially for the water body class.
In addition to the significant improvement in the performance, the UVid-Net (U-Net encoder) has a lower number of parameters as compared to the FCN-8, FCN-8 + ConvLSTM as shown in Table I. Further, UVid-Net (U-Net encoder) has a comparable number of parameters with other models with an exception of DeepLabV3+ which uses MobileNet-V2 backbone. The lower parameters of UVid-Net reduces the dependency on the availability of huge training data.
Iv-E Evaluation of transfer learning
The availability of manually annotated training dataset of sufficient size is a challenge in supervised deep learning based approach. A widely used approach in this scenario is to train the CNN network on a huge dataset and then transfer the weights learned for the task at hand .
In this work, the transfer learning approach has been studied on UVid-Net (U-Net encoder) for semantic segmentation of UAV aerial videos. The UAVid-Net (U-Net encoder) is initially trained on Cityscape  dataset to predict eight categorical classes (flat, human, vehicle, construction, object, nature, sky and void) by using Adam optimizer with a learning rate set to . This dataset is selected due to its similarity in classes as compared to ManipalUAVid. Moreover, this dataset consists of 3000 training images which are greater than the ManipalUAVid dataset and helps in learning more generalized features. Subsequently, the last layer of the model is re-trained (with other layers frozen) on the ManipalUAVid dataset to predict four classes (greenery, road, construction and water bodies). The performance metrics of UVid-Net (U-Net encoder) by utilizing transfer learning is shown in Table I and II. It is observed that the UVid-Net has performed competitively on greenery, road and construction classes with a per class IoU of , and respectively. However, a low per class IoU is observed on water bodies class (). This result was expected since the Cityscape dataset does not contain any images with water, and has no definition for water bodies class. Figure 8 shows the segmentation result of transfer learning on UVid-Net (U-Net encoder). It can be observed that the transfer learning approach offers competitive results as compared to existing approaches on greenery, road and construction classes. Despite the limitation on unknown classes, pre-trained UVid-Net (U-Net encoder) could be the preferred choice especially in the case of limited availability of training dataset for UAV aerial videos segmentation.
This paper presents a new encoder-decoder based CNN architecture for semantic segmentation of UAV aerial videos. The proposed architecture utilizes a new encoder consisting of two parallel encoding branches with two consecutive keyframes of the video as the input to the network. By integrating the features extracted from the two encoding branches, the network can learn temporal information eliminating the need for an extra sequential module. Besides, it uses a feature retainer module in the decoder path. This module produces smoother segmentation boundaries. The proposed architecture achieved a mIoU of on ManipalUAVid dataset which outperforms the other state-of-the-art algorithms. This work also demonstrated that the proposed network UVid-Net trained on a larger semantic segmentation dataset for Urban street scenes (Cityscape) can be utilized for UAV aerial videos segmentation. This transfer learning approach shows that competitive results are obtained on ManipalUAVid dataset by re-training only the last layer of UVid-Net trained on Cityscape dataset. These results hold significance as it reduces the dependency on the availability of manually annotated training dataset which is a time consuming and laborious task. The improved efficiency of UVid-Net by incorporating temporal information, along with reduced dependency on the availability of training data, will provide better segmentation of aerial videos. The lightweight architecture of UVid-Net aids in reducing the computational complexity and number of trainable parameters which makes it an ideal CNN architecture for UAV-based IoT applications. This improved segmentation can be utilized for monitoring of environmental changes, urban planning, disaster management and other aerial surveillance tasks. In future, the developed system will be studied for real-time performance and be deployed in UAV drones for real-time scene analysis.
-  (2018) Recycle-gan: unsupervised video retargeting. In Proceedings of the European conference on computer vision (ECCV), pp. 119–135. Cited by: §II.
CNN in mrf: video object segmentation via inference in a cnn-based higher-order spatio-temporal mrf.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5977–5986. Cited by: §II.
-  (2019) On the operational use of uavs for video-derived bathymetry. Coastal Engineering 152, pp. 103527. Cited by: §I.
-  (2011) Context-based urban terrain reconstruction from uav-videos for geoinformation applications. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences 38 (1/C22). Cited by: §I.
-  (2019) Guided anisotropic diffusion and iterative learning for weakly supervised change detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §II.
-  (2011) Temporally consistent multi-class video-object segmentation with the video graph-shifts algorithm. In 2011 IEEE Workshop on Applications of Computer Vision (WACV), pp. 614–621. Cited by: §II.
-  (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §II.
-  (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pp. 801–818. Cited by: §IV-D, TABLE I, TABLE II.
-  (2012) Exploiting nonlocal spatiotemporal structure for video segmentation. In 2012 IEEE conference on computer vision and pattern recognition, pp. 741–748. Cited by: §II.
Automated detection of koalas using low-level aerial surveillance and machine learning. Scientific reports 9 (1), pp. 1–9. Cited by: §I.
The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §I.
-  (2016) The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §IV-E.
-  (2011) An improved object tracking method in uav videos. Procedia Engineering 15, pp. 634–638. Cited by: §I.
STFCN: spatio-temporal fully convolutional neural network for semantic segmentation of street scenes. In Asian Conference on Computer Vision, pp. 493–509. Cited by: §I.
-  (2019) Performance analysis of semantic segmentation algorithms for finely annotated new uav aerial video dataset (manipaluavid). IEEE Access 7, pp. 136239–136253. Cited by: 3rd item, §II, §III, §III, Fig. 8, §IV-A, §IV-A, §IV.
Semantic segmentation of uav aerial videos using convolutional neural networks. In , pp. 21–27. Cited by: §I, §II.
-  (2019) Post disaster mapping with semantic change detection in satellite imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Cited by: §I.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §III-A2.
-  (2019) Ccnet: criss-cross attention for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 603–612. Cited by: §II.
-  (2019) Accel: a corrective fusion network for efficient semantic segmentation on video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8866–8875. Cited by: §II.
-  (2017) Video scene parsing with predictive feature learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5580–5588. Cited by: §II.
-  (2014) The shape-time random field for semantic video labeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 272–279. Cited by: §II.
-  (2017) Multiple moving object detection from uav videos using trajectories of matched regional adjacency graphs. IEEE Transactions on Geoscience and Remote Sensing 55 (9), pp. 5198–5213. Cited by: §I.
-  (2019) When a few clicks make all the difference: improving weakly-supervised wildlife detection in uav images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Cited by: §I.
-  (2016) Feature space optimization for semantic video segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3168–3175. Cited by: §II, §II.
-  (2019) Attention-guided network for semantic video segmentation. IEEE Access 7, pp. 140680–140689. Cited by: §II.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §II, §III-B, §IV-D, TABLE I, TABLE II.
-  (2019) Unmanned aerial vehicles for disaster management. In Geological Disaster Monitoring Based on Sensor Networks, pp. 83–107. Cited by: §I.
-  (2017) Budget-aware deep semantic video segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1029–1038. Cited by: §II.
-  (2013) Efficient temporal consistency for streaming video scene analysis. In 2013 IEEE International Conference on Robotics and Automation, pp. 133–139. Cited by: §I.
-  (2019) Surveillance of panicle positions by unmanned aerial vehicle to reveal morphological features of rice. PloS one 14 (10). Cited by: §I.
-  (2019) Automatic segmentation of river and land in sar images: a deep learning approach. In 2019 IEEE Second International Conference on Artificial Intelligence and Knowledge Engineering (AIKE), pp. 15–20. Cited by: §I.
-  (2020) Efficient video semantic segmentation with labels propagation and refinement. In The IEEE Winter Conference on Applications of Computer Vision, pp. 2873–2882. Cited by: §II.
-  (2016) Multi-region two-stream r-cnn for action detection. In European conference on computer vision, pp. 744–759. Cited by: §II.
-  (2016) Deep video code for efficient face video retrieval. In Asian Conference on Computer Vision, pp. 296–312. Cited by: §II.
-  (2019) AeroRIT: a new scene for hyperspectral image analysis. arXiv preprint arXiv:1912.08178. Cited by: §I.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §I, §II, §III-B, §IV-D, TABLE I, TABLE II.
-  (2006) Textonboost: joint appearance, shape and context modeling for multi-class object recognition and segmentation. In European conference on computer vision, pp. 1–15. Cited by: §II.
Semi supervised semantic segmentation using generative adversarial network. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5688–5696. Cited by: §II.
-  (2019) A survey of variational and cnn-based optical flow techniques. Signal Processing: Image Communication 72, pp. 9–24. Cited by: §I.
-  (2015-12-01) Segmentation and size estimation of tomatoes from sequences of paired images. Eurasip Journal on Image and Video Processing 2015 (1), pp. 1–23 (English). External Links: Cited by: §I.
-  (2012) A framework for moving target detection, recognition and tracking in uav videos. In Affective Computing and Intelligent Interaction, pp. 69–76. Cited by: §I.
-  (2018) Appearance-and-relation networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1430–1439. Cited by: §II.
-  (2019) Deep learning for semantic segmentation of uav videos. In IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium, pp. 2459–2462. Cited by: §I, §II, §IV-D, §IV-D, TABLE I, TABLE II.
-  (2008) A dynamic conditional random field model for joint labeling of object and scene classes. In European Conference on Computer Vision, pp. 733–747. Cited by: §II.
-  (2019) PolSAR image semantic segmentation based on deep transfer learning—realizing smooth classification with small training sets. IEEE Geoscience and Remote Sensing Letters 16 (6), pp. 977–981. Cited by: §I.
-  (2014) How transferable are features in deep neural networks?. In Advances in neural information processing systems, pp. 3320–3328. Cited by: §IV-E.
-  (2018) Bisenet: bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European conference on computer vision (ECCV), pp. 325–341. Cited by: §II, §III-A1.
Video paragraph captioning using hierarchical recurrent neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4584–4593. Cited by: §II.
-  (2011) Unsupervised polarimetric sar image segmentation and classification using region growing with edge penalty. IEEE Transactions on Geoscience and Remote Sensing 50 (4), pp. 1302–1317. Cited by: §I.
-  (2016) Improving semantic video segmentation by dynamic scene integration. Proceedings of the NCCV. Cited by: §II.
-  (2018) D-linknet: linknet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction.. In CVPR Workshops, pp. 182–186. Cited by: §I.
-  (2019) Improving semantic segmentation via video propagation and label relaxation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8856–8865. Cited by: §I, §II, §II, Fig. 8, §IV-D, §IV-D, TABLE I, TABLE II.