DeepAI
Log In Sign Up

Deep Learning Computer Vision Algorithms for Real-time UAVs On-board Camera Image Processing

11/02/2022
by   Alessandro Palmas, et al.
0

This paper describes how advanced deep learning based computer vision algorithms are applied to enable real-time on-board sensor processing for small UAVs. Four use cases are considered: target detection, classification and localization, road segmentation for autonomous navigation in GNSS-denied zones, human body segmentation, and human action recognition. All algorithms have been developed using state-of-the-art image processing methods based on deep neural networks. Acquisition campaigns have been carried out to collect custom datasets reflecting typical operational scenarios, where the peculiar point of view of a multi-rotor UAV is replicated. Algorithms architectures and trained models performances are reported, showing high levels of both accuracy and inference speed. Output examples and on-field videos are presented, demonstrating models operation when deployed on a GPU-powered commercial embedded device (NVIDIA Jetson Xavier) mounted on board of a custom quad-rotor, paving the way to enabling high level autonomy.

READ FULL TEXT VIEW PDF

page 5

page 6

page 7

page 8

05/04/2018

Ultra Low Power Deep-Learning-powered Autonomous Nano Drones

Flying in dynamic, urban, highly-populated environments represents an op...
08/10/2022

Real-Time Oil Leakage Detection on Aftermarket Motorcycle Damping System with Convolutional Neural Networks

In this work, we describe in detail how Deep Learning and Computer Visio...
03/09/2020

Low-viewpoint forest depth dataset for sparse rover swarms

Rapid progress in embedded computing hardware increasingly enables on-bo...
08/13/2017

An Extremely Efficient Chess-board Detection for Non-trivial Photos

We present a set of algorithms that can be used to locate and crop the c...
04/28/2022

Computer Vision for Road Imaging and Pothole Detection: A State-of-the-Art Review of Systems and Algorithms

Computer vision algorithms have been prevalently utilized for 3-D road i...
11/15/2018

ShuffleDet: Real-Time Vehicle Detection Network in On-board Embedded UAV Imagery

On-board real-time vehicle detection is of great significance for UAVs a...
05/27/2021

Embedded Vision for Self-Driving on Forest Roads

Forest roads in Romania are unique natural wildlife sites used for recre...

1 Introduction

Recent trends in the domain of unmanned aerial vehicles (UAVs) see the extensive exploration of drone swarms solutions, especially leveraging the increasing availability of small systems, with both rotating and fixed wings architectures, at low costs. These new advancements allow the design and execution of new types of missions, being able to exploit a different configuration of tactical assets, and the increased stream of data made available by the high number of sensors that can be deployed.

In this new setting, additional bottlenecks arise. In fact, every sensor on-board the flying assets requires, in the majority of cases, a dedicated operator to continuously monitor it. This poses an important limitation to the scalability of the system, depending on the available qualified manpower.

In the last decade, another field has seen a relevant growth in terms of technological capabilities, opening major opportunities. The computer vision domain underwent a revolution thanks to the breakthroughs achieved in the deep learning (DL) field. Performances of image processing algorithms overpassed every established state-of-the-art reference, unlocking applications that were not possible before.

This paper describes how advanced deep learning based computer vision algorithms are applied to enable real-time on-board sensor processing for small UAVs. Four use cases are considered: target detection, classification and localization, road segmentation for autonomous navigation in GNSS-denied zones, human body segmentation, and human action recognition.

The first step has been a thorough literature review of deep learning methods for image processing, in order to identify the best performing solution for each of the four domains. The models have been selected focusing on both accuracy and speed, given the need of providing real-time inference.

All algorithms have been developed using state-of-the-art image processing methods based on deep neural networks. Acquisition campaigns have been carried out to collect custom datasets reflecting typical operational scenarios, where the peculiar point of view of a multi-rotor UAV is replicated. Acquired images have been manually annotated to be used for training and evaluation.

Selected models can be applied on RGB images as demonstrated, as well as on gray scale, infra-red, multi-spectral and hyper-spectral data. In addition, the system can be easily customized to be trained on custom targets.

They have been trained applying all relevant best practices for deep learning training. Obtained performances are reported, demonstrating a high level of both accuracy and inference speed. Finally, output images and videos references are shared, showing models operation when deployed on a GPU-powered commercial embedded device (NVIDIA Jetson Xavier) mounted on board of a custom quad-rotor, paving the way to enabling high level autonomy.

The paper is organized as follows: Section 2 provides a detailed literature review that presents state-of-the-art models and the ones that have been selected, Section 3 discusses model training and deployment details, Section 4 presents the results that have been achieved and in Section 5 is reported a discussion with conclusions.

2 Background and Literature Review

In the last ten years the computer vision research has progressed at unprecedented pace, mainly thanks to the rise of the deep learning technology and, in particular, convolutional neural networks (CNN). Their performance in the context of image processing has overpassed every previous state-of-the-art record, making them the undiscussed best technical solution for these tasks

(Wu et al., 2020).

2.1 Object Detection, Classification and Localization

When focusing on DL algorithms for object detection, classification and localization, two main families of methods can be identified (Sultana et al., 2020) (Jiao et al., 2019): two-stages approaches, as for example the R-CNN series (He et al., 2017) (Girshick et al., 2014) (Li et al., 2017) (Ren et al., 2015), and one-stage approaches, as for example SSD (Liu et al., 2016) and YOLO (Redmon et al., 2016). These two categories can be defined respectively as region proposal based and regression/classification based (Zhao et al., 2019) (Liu et al., 2021)

. The former can be considered a traditional approach, in which regions of interest (ROIs) are generated in a first passage and then, in a second step, they are processed/classified. In the latter, instead, the localization and classification operations are performed in a unique passage. One-stage methods are in general faster, while two-stages ones are generally more accurate

(Soviany and Ionescu, 2019). In what follows the most important examples of the two groups are briefly described.

2.1.1 Two-stages Approaches

The main idea at the core of the two-stages family is the generation of a great number of ROIs in the image, so that every object in the scene is present in one or more of them. In the second stage, each of these ROIs is finally classified. The most important algorithms among this category are: R-CNN (Girshick et al., 2014) (2014), one of the first to beat classical methods such as HOG descriptor-based, but still too high in computational cost (Sharma and Mir, 2020). SPP-net (He et al., 2014)

(2014), that couples the CNN with a Spatial Pyramid Pooling (SPP) level for features extraction, extending applicability to images of different sizes and achieving a 20x speedup, while not being able to undergo an end-to-end training as SPP does not support back-propagation. Fast R-CNN

(Girshick, 2015) (2015) substitutes the SPP layer with a ROI pooling layer, which supports back-propagation, thus making end-to-end training possible. Faster R-CNN (Ren et al., 2015) (2015) extends the last one leveraging a Region Proposal Network (RPN) to generate ROIs of different sizes and proportions, improving efficiency and obtaining a 10x speedup with respect to the Fast-R-CNN (Jiao et al., 2019) (Simonyan and Zisserman, 2015).

2.1.2 One-stage Approaches

These methods skip the ROIs generation step typical of two-stages approaches, considering all subregions of the image as candidates objects. This sensibly reduces processing time and makes these methods better suited for real-time applications. The most important algorithms among this category are: You Only Look Once (YOLO) (Redmon et al., 2016) (2015), the first to work in real-time while maintaining a good accuracy, it also features a simplified version (Tiny-YOLO) for inference speed-up. It generates potential ROIs by dividing the image in a NxN grid. Its performances degrade on objects that are small with respect to the image size and on those with relevant occlusion. YOLO underwent a series of improvements, leading to YOLO9000 (Redmon and Farhadi, 2017) (2016), YOLOv3 (Redmon and Farhadi, 2018) (2018) and YOLOv4 (Bochkovskiy et al., 2020)

(2020). Modifications implemented dealt with the neural network architecture, its pre-training, a new automated anchors system, the integration of the batch-normalization technique and multi-scale training. They allowed YOLO to become a top performing model in terms of both accuracy and inference speed. Single Shot Detector (SSD)

(Liu et al., 2016) (2016) directly predicts the object class and performs bounding box regression operating at different levels, achieving competitive results, in particular for the SSD512 model, using VGG-16 as convolutional backbone network. RetinaNet (Lin et al., 2017b)

(2017) features a loss function aiming at reducing the ”foreground-background class imbalance”

(Chen et al., 2020) allowing it to obtain an accuracy and speed higher than those of the two-stages method already described.

2.2 Semantic Segmentation

These models have the goal of generating a pixel-map for a given image, where a class label is assigned to each pixel in the input. The most important families of this type of models are presented in this section.

Fully Convolutional Networks (FCN) (Long et al., 2015) have been the first methods used for deep learning-based image segmentation, using only convolutional layers. A specific variant, FCN-8s, has been among the first to obtain state-of-the-art-results in segmentation tasks, while not being suited for real-time application due to high inference times, and not efficient in leveraging context information. ParseNet (Liu et al., 2015)

extends FCNs concatenating latent features from the CNN with a vector of global information, obtaining sensible improvement in model performance.

To address the problem of information loss, deep learning-based models have been coupled with probabilistic ones, as the Conditional Random Field (CRFs) and Markov Random Field (MRF). DeepLab-CRF (DeepLabv1) (Chen et al., 2014) puts together a CNN and a CRF model that processes the output of the former, providing a more accurate segmentation map. Deep Parsing Network (DPN) (Liu et al., 2015), instead, leverages MRF in a similar fashion, obtaining a relevant improvement in terms of both accuracy and inference time.

The encoder-decoder family uses convolutional layers to generate a compressed input representation in the latent space, and deconvolutional ones to reconstruct the image. One of the main problems of these models is setting the right level of abstraction, limiting information loss and the impact on achievable accuracy. DeconvNet (Noh et al., 2015) (2015) and SegNet (Badrinarayanan et al., 2017) (2017) both use VGG-16, without the last FC layers, as encoder, but the latter uses additional connections between encoding-decoding pooling-up-sampling layers at different depth, resulting in better performance in both accuracy and inference speed. Efficient Symmetric Network (ESNet) (Wang et al., 2019) is an encoder-decoder network optimized for real-time applications but maintaining a very good accuracy. U-Net (Ronneberger et al., 2015) features some additional direct connections between encoding and decoding layers and, while being conceived in the context of biomedical applications, it has also been applied in other domains, such as road-segmentation applications (Zhang et al., 2018).

Another family of segmentation models is based on multi-scale pyramids and parallel paths. Feature Pyramid Network (FPN) (Lin et al., 2017a) (2016) creates connections between high and low-resolution layers with a minimal computational cost. Pyramid Scene Parsing Network (PSPN) (Zhao et al., 2017) takes into account the global context more accurately, applying pooling operations on different levels of the layers hierarchy. High-Resolution Network (HRNet) (Sun et al., 2019) and Object-Contextual Representation (OCR) (Yuan et al., 2020) (2019) extract information on different layers with connections favoring exchange between different hierarchy levels. Deep Dual-Resolution Network (DDRNet) (Hong et al., 2021) (2021), based on HRNet, obtained a relevant improvement in terms of inference speed. The network has two different branches: one extracts a feature map at high resolution, while the other extracts context information using down-sampling. FasterSeg (Chen et al., 2020), inspired by Auto-DeepLab (Liu et al., 2019), makes use of parallel branches too, showing a good compromise in terms of accuracy and inference speed.

A different family of models for semantic segmentation is based on dilated convolution, a modified version of the standard convolution, allowing to reduce the number of model parameters, and thus computational cost, while obtaining similar results, fostering real-time applications. DeepLabv2 (Chen et al., 2018a) uses the dilated convolution, it introduces a module called atrous spatial pyramid pooling (ASPP) to capture both the global context and the low level information, and it couples the CNN with a complementary model as seen in the previous DeepLab model. Updated versions have been proposed after it, DeepLabv3 (Chen et al., 2018b) and DeepLabv3+ (Liu et al., 2019), where the latter builds upon the former, using it as encoder in the encoder-decoder fashion. Efficient Network (ENet) (Paszke et al., 2016), is specifically designed for real-time applications and leverages the lower computational requirements of the dilated convolution in the encoder-decoder context, showing a specific feature of having an asymmetrical encoding-decoding branches, with a deeper encoder and a more shallow decoder. Gated Shape CNN (GSCNN) (Takikawa et al., 2019) has a ”regular stream”, where a CNN generates feature maps, and a ”shape stream”, where a gated convolutional layer (GCL) uses CNN output to more accurately deal with objects contours. The output is then fused in an ASPP module where contextual information is preserved on different scales.

2.3 Human Action Recognition

The aim of these algorithms is to classify a sequence of images in a given set of categories. Typical examples of labels can be ”walking”, ”running”, ”standing idle” or ”throwing something”. This section presents the most important models available in the literature based on visual input data. They are typically divided in two categories: methods based on two or more 2D fluxes and methods based on recurrent neural networks (RNN).

2.3.1 Methods Based on Two or More 2D Fluxes

These methods make use of at least two 2D CNNs working in parallel. They work as independent features extractors operating on the input images sequence, and their classification outputs are then combined to produce the final result. Two-Stream CNN+SVM (Simonyan and Zisserman, 2014) (2014) and Multi-Resolution CNN (Karpathy et al., 2014) (2014) both feature two branches operating, respectively, on a single frame and the optical-flow obtained from a sequence of multiple frames, and low and high resolution frames. The first fuses the classification output of each branch, while the second directly concatenates the convolutional layers output. Trajectory-pooled Deep-convolutional Descriptors (TDD) (Wang et al., 2015) (2015) uses two parallel fluxes based on deep learning and hand-crafted operations respectively, then using ”Fisher Vector” representation (Sánchez et al., 2013) as input to a classifying SVM. Temporal Segment Network (TSN) (Wang et al., 2016) (2016) has been designed to address complex actions covering a large time span, it divides the input frame sequence in three parts and forward them to three networks, fusing their output for classification. Temporal Linear Encoding (TLE) (Diba et al., 2017) aims at recognizing long-duration actions too and features a specific layer for this purpose. ActionVLAD (Girdhar et al., 2017) (2017) processes in parallel RGB and flow streams and then applies a particular pooling layer embedding a clustering operation where classification is carried out through proximity measures. Two-stream ConvNet (Feichtenhofer et al., 2017)

(2017) introduced spatial-temporal information fusion in the convolutional layers instead of limiting it to outputs, resulting in improved performances and in a lower number of parameters. CNN + Deep AutoEncoder + Support-Vector-Machine (CNN + DAE + SVM)

(Ullah et al., 2019) (2019) processes videos in quasi-real-time while maintaining the same accuracy of the previous ones. The outputs of the two parallel features extractors is processed by FC layers, an auto-encoder, and finally a SVM generating the classification. MSM-ResNet (Zong et al., 2021) (2021) uses three parallel CNN-based processing pipelines, increasing computational costs, operating on the single frame, the optical flow and motion saliency stream, and all three are used to generate the final prediction.

2.3.2 Methods Based on Recurrent Neural Networks

These models try to solve the limits encountered when dealing with actions of long durations, that extend beyond the number of frames the model processes at the same time. They leverage Recurrent Neural Networks (RNN), modules having neurons linked in loop. This generates a ”memory” effect, allowing to handle tasks requiring to take decisions based on information received far back in time, which makes them suitable for dynamic applications like action recognition. This section presents these models, having a Long-Short-Term-Memory (LSTM) block (a specific RNN module) which processes features extracted by two or more parallel CNNs. The additional block results in additional computational costs with respects to the previous ones. Long-term Recurrent Convolutional Networks (LRCNs)

(Donahue et al., 2017) (2015) uses multiple CNN-based features extractors followed by LSTM blocks and an averaging module that performs the final classification. Lattice Long-Short-Term Memory (L2STM) (Sun et al., 2017) (2017) features two CNN-based features extractors and a Lattice-LSTM module, which extends the classic LSTM, allowing for cross flux information exchange. ShuttleNet (Shi et al., 2017) (2017) does not use LSTM blocks as recurrent units, maintaining the computational cost on par with the other models. Attention Mechanism-Long Short Time Memory (AM-LSTM) (Ge et al., 2019) (2019) can be decomposed in four main modules: a CNN-based feature extractor, a module based on the attention mechanism with the purpose of automatically select the most important features extracted, a RNN module based on convolutional LSTM (ConvLSTM) that generates predictions based on the frame, and a final prediction block providing the classification output. Densely-connected Bi-directional Long-Short Time Memory (DB-LSTM) (He et al., 2021) (2021) creates batches sampling from the input frames sequence and optical flux, and sends them to the sample representation learned (SRL) block, composed by two parallel processing pipelines, one for spatial and one for temporal information, leveraging CNNs for features extraction. The output is provided to the bidirectional LSTMs that, given this peculiar feature, allow to capture information for longer time-frames with respect to conventional LSTMs. The last element is a fusion layer that generates the classification.

2.4 Model Selection

Following the literature review, a model for each of the four use cases has been identified: YOLOv3 (Redmon and Farhadi, 2018) for object detection, classification and localization, DDRNet (Hong et al., 2021) for the two segmentation scenarios (road segmentation and human body segmentation), and Two-stream ConvNet (Feichtenhofer et al., 2017) for human action recognition.

All the algorithms have been customized to easily deal with different type of inputs, supporting RGB and gray-scale images in the visible spectrum, gray-scale images in the infra-red spectrum, and images acquired by multi-spectral cameras or (typically satellite based) hyper-spectral sensors. These modifications have also taken into account the specific deployment hardware used for on-board application, and the need for quasi-real-time inference.

3 Model Training, Deployment and Performance

With the aim of carrying out on-field demonstrations, we focused on specific applications in the context of the selected use cases. Object detection, classification and localization is demonstrated on a single class (person) application, using RGB images acquired from a drone flying 10 meters above the ground with the camera oriented downwards (-90 ). This scenario has been chosen being it representative of situations where, for example, there’s the need to detect a target and autonomously track it in the contexts of search & rescue missions or restricted area monitoring.

The model for semantic segmentation has been applied in two different contexts. The first aims at distinguishing between road and off-road pixels in RGB images acquired from a drone flying 10 meters above the ground with the camera oriented downwards (-90 ). This functionality can be used to implement autonomous navigation, particularly useful in GNSS-denied zones, to build for example a system able to autonomously follow a road for patrolling operations.

The second application of semantic segmentation aimed at segmenting the human silhouette in 19 different body parts (like head, hair, arms, legs, torso, feet) in RGB images acquired from a drone flying at 5 meters above the ground with the camera oriented at -40 .

The model for human action recognition has been trained to classify six different actions in RGB frames sequences acquired from a drone flying 10 meters above the ground with the camera oriented downwards (-90 ). Actions performed were: standing idle, walk, run, crouch, aim and throw. This feature can be used, for example, in crowded areas monitoring applications to automatically identify threatening behavior.

3.1 Dataset

The four use cases considered in this study, and the goal of demonstrating their application in real world tests after having deployed them on-board of a multi-rotor drone, posed a significant challenge. In fact, while a relevant number of labeled image datasets are openly available for all the different applications here considered, only a few of them feature the very peculiar airborne point of view needed in our case. For this reason, we carried out a number of on-field campaigns with the aim of building our own custom dataset. Figures 1, 2 and 3 show some samples used for object detection, segmentation and action recognition respectively.


Figure 1: Samples of the dataset used for the object detection, classification and localization scenario

Figure 2: Samples of the dataset used for road segmentation (left) and human body segmentation (right) scenarios

Figure 3: Samples of the dataset used for human action recognition scenario (throwing action on the left, aiming action on the right).

After the collection campaigns, we manually labeled the data using the tools presented in Figure 4.


Figure 4: Annotation tool used for object detection and action recognition (left) and semantic segmentation (right)

As an additional research direction we wanted to explore, we also leveraged Unreal Engine to create synthetic datasets having similar characteristics to those used for the four considered use cases, and to virtually test the algorithms and their integration with the drone navigation dynamics. Figure 5 shows the application of object detection algorithm in a virtual scenario. Making a proper use of these engines allows to by-pass the labeling step, leveraging the knowledge of the virtual environment composition underlying the rendered scene, completely automating dataset generation. In addition, it provides a great flexibility, as the user can vary:

  • Environmental conditions: weather (sun, rain, fog, etc.), time-of-day, environment (urban, desert, woods, rural, etc.)

  • Target types: category of assets for object detection, actions executed, assets appearance (clothes, wearables, equipment)

  • Test conditions to investigate algorithms generalization and robustness

just to list the most important ones.


Figure 5: Application of the object detection algorithm in a virtual scenario generated with Unreal Engine

3.2 Training Techniques

After having collected the dataset for all the considered use cases, the different training strategies described below have been applied.

Training-Validation-Test Split.

For each of the four algorithms, the dataset has been split in training, validation and test sets, making sure label balance was properly preserved. To optimize data usage, a 5-fold partitioning has been adopted, averaging scores across folds to obtain model performance estimations.

Data Augmentation. With the aim of enhancing generalization power of the algorithms, different data augmentation techniques have been applied on the collected images. Figure 6 presents them clearly, starting from the top left and going row-wise one finds: the original image, application of a brightness shift, rotation of the image, application of a shear affine transformation, random cropping and noise superimposition.

These transformations are applied at training time, randomizing the parameters they depend on.


Figure 6: Data augmentation operations: a) original image b) brightness shift c) rotation d) shear affine transformation e) cropping f) noise superimposition

Pre-training, Layer Freezing and Fine Tuning. All selected models have been pre-trained using openly available datasets. As a next step, a first round of training made use of the synthetic generated data, to drive learning towards the real world scenarios.

On the resulting models, the first layers have been frozen, Figure 7 intuitively shows this on the DDRNet model where frozen layers are grayed out, and the deep network has been fine-tuned on the dataset collected in the acquisition campaigns.

Finally, early stopping based on the validation performance measure has been adopted to avoid over-fitting.


Figure 7:

Layer freezing for transfer learning and fine-tuning

3.3 Deployment

All the algorithms have been implemented using state of the art deep learning frameworks (TensorFlow / Pytorch). To speed-up hyper-parameters tuning and experimentation, training has been carried out on a desktop computer powered by a NVIDIA RTX 3090 GPU. Deployment constraints have guided the development from the start, thus models were designed to be easily portable to embedded, GPU-powered devices. Trained algorithms have been deployed on the NVIDIA Jetson Xavier board, which is particularly well suited to be equipped on-board, due to its low weight and power consumption, as well as very flexible hardware and software interfaces.

To maximize performances, reduced precision arithmetics have been adopted leveraging NVIDIA TensorRT tool.

Figure 8 shows the custom design of a mechanical interface to equip the NVIDIA hardware on-board of a multi-rotor drone as an additional payload. In Figure 9 is shown the final prototype assembled and ready to be flown for tests using a quad-x configuration opto-copter.


Figure 8: Custom design of NVIDIA Jetson’s support for deployment on-board of a multi-rotor drone

Figure 9: NVIDIA Jetson deployed on-board of a multi-rotor drone

4 Results

The following table summarizes the performances of implemented algorithms. For each algorithm it shows two metrics, one for inference speed and one for accuracy. The former is measured for every algorithm in terms of frames per second, while the latter depends on the specific application and presents the most meaningful metric typically used with it.

max width= Algorithm Performance Measure Result Automatic target detection, classification and localization Frames Per Second (FPS) 15-20 FPS Mean Average Precision (mAP) 70.5 % Road Segmentation Frames Per Second (FPS) 10 FPS Intersection over Union (IoU) 93.86 % Human Body Segmentation Frames Per Second (FPS) 25 FPS Intersection over Union (IoU) 47.84 % Human Action Recognition Frames Per Second (FPS) 10 FPS Mean Average Precision (mAP) 50.8 %

Table 1: Algorithms Performances

Figures 10, 11, 12 and 13 show four examples of the models’ outputs for, respectively, object detection and localization, road segmentation, human body segmentation and human action recognition.

Video demonstrations for the four use cases are listed below:


Figure 10: Example of the object detection, classification and localization model output

Figure 11: Example of road segmentation model output

Figure 12: Example of human body segmentation model output

Figure 13: Example of human action recognition model output

5 Discussion and Conclusions

This work presented how deep learning based computer vision algorithms have been developed, trained and deployed on an embedded NVIDIA Jetson Xavier equipped on-board a small multi-rotor drone.

All selected models have been chosen to assure a broad applicability, favoring the easiest possible integration in terms of both input sources and output third party downstream consumers. The adopted technology and the implementation choices have been driven by the goal of creating a software library able to handle different types of input data and to provide outputs that contain abstract, high level information extracted from the frames.

While here applied on RGB images coming from a visible camera, each model can work on the following type of inputs, and can easily be extended to similar ones:

  • RGB (3-channels)/Grayscale (single-channel) images coming from visible camera

  • Gray-scale (single-channel) images coming from infra-red cameras

  • Multi-spectral and hyper-spectral images coming from ground, air or space vehicles

Each model provides its own specific output:

  • Automatic target detection, classification and localization:

    • One array of detections having the whole list of targets objects identified in every frame

    • One bounding box for each detection, defining target position inside the frame

    • One classification label for each detection, representing the class in which the identified and localized target belongs

    • One classification confidence for each detection, measuring the probability associated with the label assigned to the target identified in the frame

  • Context semantic segmentation (for both road segmentation and human body segmentation):

    • One pixel map, where each frame pixel is associated with a classification label

  • Human action recognition:

    • One classification label for each frame sequence, representing the class the target action belongs to

    • One classification confidence for each action detection, measuring the probability associated with the label assigned to the target action identified in the frame

Results presented demonstrate the successful application of state-of-the-art, deep learning based computer vision algorithms for quasi-real-time sensor processing on-board of multi-rotor drones. Performances obtained proved that this approach can be a very promising way to pursue scalability in UAVs applications, and these functionalities can be directly applied in enabling fully autonomous systems.

References

  • V. Badrinarayanan, A. Kendall, and R. Cipolla (2017) SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, pp. 2481–2495. Cited by: §2.2.
  • A. Bochkovskiy, C. Wang, and H. M. Liao (2020) YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv e-prints, pp. arXiv:2004.10934. External Links: 2004.10934 Cited by: §2.1.2.
  • J. Chen, Q. Wu, D. Liu, and T. Xu (2020) Foreground-background imbalance problem in deep object detectors: a review. In 2020 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), Vol. , Los Alamitos, CA, USA, pp. 285–290. External Links: ISSN , Document, Link Cited by: §2.1.2.
  • L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2014) Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. arXiv e-prints, pp. arXiv:1412.7062. External Links: 1412.7062 Cited by: §2.2.
  • L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2018a) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4), pp. 834–848. External Links: Document Cited by: §2.2.
  • L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018b) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Computer Vision – ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss (Eds.), Cham, pp. 833–851. External Links: ISBN 978-3-030-01234-2 Cited by: §2.2.
  • W. Chen, X. Gong, X. Liu, Q. Zhang, Y. Li, and Z. Wang (2020) FasterSeg: searching for faster real-time semantic segmentation. In International Conference on Learning Representations, External Links: Link Cited by: §2.2.
  • A. Diba, V. Sharma, and L. V. Gool (2017) Deep temporal linear encoding networks.

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pp. 1541–1550.
    Cited by: §2.3.1.
  • J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (4), pp. 677–691. External Links: Document Cited by: §2.3.2.
  • C. Feichtenhofer, A. Pinz, and R. P. Wildes (2017) Spatiotemporal multiplier networks for video action recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 7445–7454. External Links: Document Cited by: §2.3.1, §2.4.
  • H. Ge, Z. Yan, W. Yu, and L. Sun (2019) An attention mechanism based convolutional lstm network for video action recognition. Multimedia Tools Appl. 78 (14), pp. 20533–20556. External Links: ISSN 1380-7501, Link, Document Cited by: §2.3.2.
  • R. Girdhar, D. Ramanan, A. K. Gupta, J. Sivic, and B. C. Russell (2017) ActionVLAD: learning spatio-temporal aggregation for action classification. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3165–3174. Cited by: §2.3.1.
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 580–587. External Links: Document Cited by: §2.1.1, §2.1.
  • R. Girshick (2015) Fast r-cnn. In 2015 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 1440–1448. External Links: Document Cited by: §2.1.1.
  • J. He, X. Wu, Z. Cheng, Z. Yuan, and Y. Jiang (2021) DB-lstm: densely-connected bi-directional lstm for human action recognition. Neurocomputing 444, pp. 319–331. Cited by: §2.3.2.
  • K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In 2017 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 2980–2988. External Links: Document Cited by: §2.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. In Computer Vision – ECCV 2014, Cham, pp. 346–361. External Links: ISBN 978-3-319-10578-9 Cited by: §2.1.1.
  • Y. Hong, H. Pan, W. Sun, and Y. Jia (2021) Deep dual-resolution networks for real-time and accurate semantic segmentation of road scenes. ArXiv abs/2101.06085. Cited by: §2.2, §2.4.
  • L. Jiao, F. Zhang, F. Liu, S. Yang, L. Li, Z. Feng, and R. Qu (2019) A survey of deep learning-based object detection. IEEE Access 7 (), pp. 128837–128868. External Links: Document Cited by: §2.1.1, §2.1.
  • A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei (2014) Large-scale video classification with convolutional neural networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1725–1732. External Links: Document Cited by: §2.3.1.
  • Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun (2017) Light-Head R-CNN: In Defense of Two-Stage Object Detector. arXiv e-prints, pp. arXiv:1711.07264. External Links: 1711.07264 Cited by: §2.1.
  • T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017a) Feature pyramid networks for object detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 936–944. External Links: Document Cited by: §2.2.
  • T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017b) Focal loss for dense object detection. In 2017 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 2999–3007. External Links: Document Cited by: §2.1.2.
  • C. Liu, L. Chen, F. Schroff, H. Adam, W. Hua, A. L. Yuille, and L. Fei-Fei (2019) Auto-deeplab: hierarchical neural architecture search for semantic image segmentation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 82–92. External Links: Document Cited by: §2.2, §2.2.
  • W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) SSD: single shot multibox detector. In Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), Cham, pp. 21–37. External Links: ISBN 978-3-319-46448-0 Cited by: §2.1.2, §2.1.
  • W. Liu, A. Rabinovich, and A. C. Berg (2015) ParseNet: Looking Wider to See Better. arXiv e-prints, pp. arXiv:1506.04579. External Links: 1506.04579 Cited by: §2.2.
  • Y. Liu, P. Sun, N. Wergeles, and Y. Shang (2021) A survey and performance evaluation of deep learning methods for small object detection. Expert Systems with Applications 172, pp. 114602. External Links: ISSN 0957-4174, Document, Link Cited by: §2.1.
  • Z. Liu, X. Li, P. Luo, C. Loy, and X. Tang (2015) Semantic image segmentation via deep parsing network. In 2015 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 1377–1385. External Links: Document Cited by: §2.2.
  • J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 3431–3440. External Links: Document Cited by: §2.2.
  • H. Noh, S. Hong, and B. Han (2015) Learning deconvolution network for semantic segmentation. 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1520–1528. Cited by: §2.2.
  • A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello (2016) ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. arXiv e-prints, pp. arXiv:1606.02147. External Links: 1606.02147 Cited by: §2.2.
  • J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 779–788. External Links: Document Cited by: §2.1.2, §2.1.
  • J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 6517–6525. External Links: Document Cited by: §2.1.2.
  • J. Redmon and A. Farhadi (2018) YOLOv3: An Incremental Improvement. arXiv e-prints, pp. arXiv:1804.02767. External Links: 1804.02767 Cited by: §2.1.2, §2.4.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28. External Links: Link Cited by: §2.1.1, §2.1.
  • O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi (Eds.), Cham, pp. 234–241. External Links: ISBN 978-3-319-24574-4 Cited by: §2.2.
  • J. Sánchez, T. Mensink, and J. Verbeek (2013) Image classification with the fisher vector: theory and practice. International Journal of Computer Vision 105, pp. . External Links: Document Cited by: §2.3.1.
  • V. Sharma and R. N. Mir (2020) A comprehensive and systematic look up into deep learning based object detection techniques: a review. Computer Science Review 38, pp. 100301. External Links: ISSN 1574-0137, Document, Link Cited by: §2.1.1.
  • Y. Shi, Y. Tian, Y. Wang, W. Zeng, and T. Huang (2017) Learning long-term dependencies for action recognition with a biologically-inspired deep network. In 2017 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 716–725. External Links: Document Cited by: §2.3.2.
  • K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In NIPS, Cited by: §2.3.1.
  • K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: §2.1.1.
  • P. Soviany and R. T. Ionescu (2019) Frustratingly easy trade-off optimization between single-stage and two-stage deep object detectors. In Computer Vision – ECCV 2018 Workshops, L. Leal-Taixé and S. Roth (Eds.), Cham, pp. 366–378. External Links: ISBN 978-3-030-11018-5 Cited by: §2.1.
  • F. Sultana, A. Sufian, and P. Dutta (2020) A review of object detection models based on convolutional neural network. In Intelligent Computing: Image Processing Based Applications, pp. 1–16. External Links: ISBN 978-981-15-4288-6, Document, Link Cited by: §2.1.
  • K. Sun, Y. Zhao, B. Jiang, T. Cheng, B. Xiao, D. Liu, Y. Mu, X. Wang, W. Liu, and J. Wang (2019) High-resolution representations for labeling pixels and regions. ArXiv abs/1904.04514. Cited by: §2.2.
  • L. Sun, K. Jia, K. Chen, D. Yeung, B. E. Shi, and S. Savarese (2017) Lattice long short-term memory for human action recognition. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2166–2175. Cited by: §2.3.2.
  • T. Takikawa, D. Acuna, V. Jampani, and S. Fidler (2019) Gated-scnn: gated shape cnns for semantic segmentation. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Vol. , pp. 5228–5237. External Links: Document Cited by: §2.2.
  • A. Ullah, K. Muhammad, I. U. Haq, and S. W. Baik (2019) Action recognition using optimized deep autoencoder and cnn for surveillance data streams of non-stationary environments. Future Generation Computer Systems 96, pp. 386–397. External Links: ISSN 0167-739X, Document, Link Cited by: §2.3.1.
  • L. Wang, Y. Qiao, and X. Tang (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 4305–4314. External Links: Document Cited by: §2.3.1.
  • L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool (2016) Temporal segment networks: towards good practices for deep action recognition. In Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), Cham, pp. 20–36. External Links: ISBN 978-3-319-46484-8 Cited by: §2.3.1.
  • Y. Wang, Q. Zhou, J. Xiong, X. Wu, and X. Jin (2019) ESNet: an efficient symmetric network for real-time semantic segmentation. In Pattern Recognition and Computer Vision, Z. Lin, L. Wang, J. Yang, G. Shi, T. Tan, N. Zheng, X. Chen, and Y. Zhang (Eds.), Cham, pp. 41–52. External Links: ISBN 978-3-030-31723-2 Cited by: §2.2.
  • X. Wu, D. Sahoo, and S. C.H. Hoi (2020) Recent advances in deep learning for object detection. Neurocomputing 396, pp. 39–64. External Links: ISSN 0925-2312, Document, Link Cited by: §2.
  • Y. Yuan, X. Chen, and J. Wang (2020) Object-contextual representations for semantic segmentation. In Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm (Eds.), Cham, pp. 173–190. External Links: ISBN 978-3-030-58539-6 Cited by: §2.2.
  • Z. Zhang, Q. Liu, and Y. Wang (2018) Road extraction by deep residual u-net. IEEE Geoscience and Remote Sensing Letters 15 (5), pp. 749–753. External Links: Document Cited by: §2.2.
  • H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
  • Z. Zhao, P. Zheng, S. Xu, and X. Wu (2019) Object detection with deep learning: a review. IEEE Transactions on Neural Networks and Learning Systems 30 (11), pp. 3212–3232. External Links: Document Cited by: §2.1.
  • M. Zong, R. Wang, X. Chen, Z. Chen, and Y. Gong (2021) Motion saliency based multi-stream multiplier resnets for action recognition. Image and Vision Computing 107, pp. 104108. External Links: ISSN 0262-8856, Document, Link Cited by: §2.3.1.