A Review on Deep Learning in UAV Remote Sensing

01/22/2021 ∙ by Lucas Prado Osco, et al. ∙ University of Waterloo Embrapa Universidade Federal de Mato Grosso do Sul 43

Deep Neural Networks (DNNs) learn representation from data with an impressive capability, and brought important breakthroughs for processing images, time-series, natural language, audio, video, and many others. In the remote sensing field, surveys and literature revisions specifically involving DNNs algorithms' applications have been conducted in an attempt to summarize the amount of information produced in its subfields. Recently, Unmanned Aerial Vehicles (UAV) based applications have dominated aerial sensing research. However, a literature revision that combines both "deep learning" and "UAV remote sensing" thematics has not yet been conducted. The motivation for our work was to present a comprehensive review of the fundamentals of Deep Learning (DL) applied in UAV-based imagery. We focused mainly on describing classification and regression techniques used in recent applications with UAV-acquired data. For that, a total of 232 papers published in international scientific journal databases was examined. We gathered the published material and evaluated their characteristics regarding application, sensor, and technique used. We relate how DL presents promising results and has the potential for processing tasks associated with UAV-based image data. Lastly, we project future perspectives, commentating on prominent DL paths to be explored in the UAV remote sensing field. Our revision consists of a friendly-approach to introduce, commentate, and summarize the state-of-the-art in UAV-based image applications with DNNs algorithms in diverse subfields of remote sensing, grouping it in the environmental, urban, and agricultural contexts.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

page 9

page 11

page 16

Code Repositories

DeepLearning-RemoteSensing

Here, we intend to publish easy-to-implement deep neural network codes. These codes are prepared to deal mostly with remote sensing data. Remote sensing data include, but are not limited to, spectral and image information. The examples prepared here are only for educational purposes.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

For investigations using remote sensing image data, multiple processing tasks depend on computer vision algorithms. In the past decade, applications conducted with statistical and Machine Learning (ML) algorithms were mainly used in classification/regression tasks. The increase of remote sensing systems allowed a wide collection of data from any target on the Earth’s surface. Aerial imaging has become a common approach to acquiring data with the advent of Unnamed Aerial Vehicles (UAV). These are also known as Remotely Piloted Aircrafts (RPA), or, as in a popular term, drones (multi-rotor, fixed wings, hybrid, etc). These devices have grown in market availability for their relatively low-cost and high-operational capability to capture images quickly and in an easy manner. The high-spatial-resolution of UAV-based imagery and its capacity for multiple visits allowed the creation of large and detailed amounts of datasets to be dealt with.

The surface mapping with UAV platforms presents some advantages compared to orbital and other aerial sensing methods of acquisition. Less atmospheric interference, the possibility to fly within lower altitudes, and mainly, the low operational cost have made this acquisition system popular in both commercial and scientific explorations. However, the visual inspection of multiple objects can still be a time-consuming, biased, and inaccurate operation. Currently, the real challenge in remote sensing approaches is to obtain automatic, rapid and accurate information from this type of data. In recent years, the advent of Deep Learning (DL) techniques has offered robust and intelligent methods to improve the mapping of the Earth’s surface.

DL is an Artificial Neural Network (ANN) method with multiple hidden layers and deeper combinations, which is responsible for optimizing and returning better learning patterns than a common ANN. There is an impressive amount of revision material in the scientific journals explaining DL-based techniques, its historical evolution, general usage, as well as detailing networks and functions. However, these are not the main concerns of this paper, and we just briefly explain the necessary information to assist the reader in the subject before dividing into its applications. For those interested in an in-depth approach, we recommend both Lecun’s paper (Lecun et al., 2015) and Goodfellow’s book (Goodfellow et al., 2016).

As computer processing and labeled examples (i.e. samples) became more available in recent years, the performance of Deep Neural Networks (DNNs) increased in the image-processing applications. DNN has been successfully applied in data-driven methods. However, much needs to be covered to truly understand its potential, as well as its limitations. In this regard, several surveys on the application of DL in remote sensing were developed in both general and specific contexts to better explain its importance.

The context in which remote sensing literature surveys are presented is variated. Zhang et al. (Zhang et al., 2016) organized a revision material which explains how DL methods were being applied, at the time, to image classification tasks. Later on, Cheng et al. (Cheng and Han, 2016) investigated object detection in optical images, but focused more on the traditional ANN and ML. A more complete and systematic review was presented by Ball et al. (Ball et al., 2017) in a survey describing DL theories, tools, and its challenges in dealing with remote sensing data. This work should serve as an introductory approach to the theme for first-time readers. Cheng et al. (Cheng et al., 2017) produced a revision on image classification with examples produced at their experiments. Also, focusing on classification, Zhu et al. (Zhu et al., 2017) summarized most of the current information to understand the DL methods used for this task.

Yet in the literature revision theme, a survey performed by Li et al. (Li et al., 2018) helped to understand some DL applications regarding the overall performance of DNNs in publicly available datasets for image classification task. Yao et al. (Yao et al., 2018) stated in their survey that DL will become the dominant method of image classification in remote sensing community. Although DL does provide promising results, many observations and examinations are still required. Interestingly enough, at this time, multiple remote sensing applications using hyperspectral data were in process, which gained attention as a literature revision subject. In Petersson et al. (Petersson et al., 2017)

, probably one of the first surveys on hyperspectral data was performed. A comparative review by Audebert et al.

(Audebert et al., 2019) was conducted by examining various families of networks’ architectures while providing a toolbox to perform such methods to be publicly available. In this regard, another paper written by Paoletti et al. (Paoletti et al., 2019) organized the source code of DNNs to be easily reproduced. Similar to (Cheng et al., 2017), Li et al. (Li et al., 2019) conducted a literature revision while presenting an experimental analysis with DNNs’ methods.

As of recently, literature revision focused on more specific approaches within this theme. Some of which included DL methods for enhancement of remote sensing observations, as super-resolution, denoising, restoration, pan-sharpening, and image fusion techniques, as demonstrated by Tsagkatakis et al.

(Tsagkatakis et al., 2019). Also, one recent meta-analysis by Ma et al. (Ma et al., 2019)

was performed concerning the usage of DL algorithms in seven subfields of remote sensing: image fusion and image registration, scene classification, object detection, land use and land cover classification, semantic segmentation, and object-based image analysis (OBIA). Although, from these recent reviews, various remote sensing applications using DL can be verified, it should be noted that the authors did not focus on specific surveying in the context of DL algorithms applied to UAV-image sets, which is something that, at the time of writing, has gained the attention of remote sensing investigations.

Another interesting take on DL-based methods was related to image segmentation in a survey by Hossain et al. (Hossain and Chen, 2019), which its theme was expanded by Yuan et al. (Yuan et al., 2021) and included state-of-the-art algorithms. A summarized analysis by Zheng et al. (Zheng et al., 2020) focused on remote sensing images with object detection approaches, indicating some of the challenges related to the detection with few labeled samples, multi-scale issues, network structure problems, and cross-domain detection difficulties. In more of a “niche” type of research, environmental applications and land surface change detection were investigated in literature revision papers by Yuan et al. (Yuan et al., 2020) and Khelifi et al. (Khelifi and Mignotte, 2020), respectively.

The aforementioned studies were evaluated with a text processing method that returned a word-cloud in which the word-size denotes the frequency of the word within these papers (Fig. 1). An interesting observation regarding this world-cloud is that the term “UAV” is under or not represented at all. This revision gap is a problem since UAV image data is daily produced in large amounts, and no scientific investigation appears to offer a comprehensive literature revision to assist new research on this matter. In the UAV context, there are some revision papers published in important scientific journals from the remote sensing community. As of recently, a revision-survey (Bithas et al., 2019) focused on the implications of ML methods being applied to UAV image processing, but no investigation was conducted on DL algorithms for this particular issue. This is an important theme, especially since UAV platforms are more easily available to the public and DL-based methods are being tested to provide accurate mapping in highly detailed imagery.

Figure 1: Word-cloud of different literature-revision papers related to the “remote sensing” and “deep learning” themes.

As mentioned, UAVs offer flexibility in data collection, as flights are programmed under users’ demand; are low-cost when compared to other platforms that offer similar spatial-resolution images; produces high-level of detail in its data collection; presents dynamic data characteristics since it is possible to embed RGB, multispectral, hyperspectral, thermal and, LiDAR sensors on it; and are capable of gathering data from difficult to access places. Aside from that, sensors embedded in UAVs are known to generate data at different altitudes and point-of-views. These characteristics, alongside others, are known to produce a higher dynamic range of images than common sensing systems. This ensures that the same object is viewed from different angles, where not only their spatial and spectral information is affected, as well as form, texture, pattern, geometry, illumination, etc. This becomes a challenge for multi-domain detection. As such, studies indicate that DL is the most prominent solution for dealing with these disadvantages. These studies, which most are presented in this revision paper, were conducted within a series of data-criteria and evaluated DL architectures in classifying, detecting, and segmenting various objects from UAV scenes.

To the best of our knowledge, there is a literature gap related to review articles combining both “deep learning” and “UAV remote sensing” thematics. This survey is important to summarize the direction of DL applications in the remote sensing community, particularly related to UAV-imagery. The purpose of this study is to provide a brief review of DL methods and their applications to solve classification, object detection, and semantic segmentation problems in the remote sensing field. Herein, we discuss the fundamentals of DL architectures, including recent proposals. There is no intention of summarizing all of the existing literature, but to present an examination of DL models while offering the necessary information to understand the state-of-the-art in which it encounters. Our revision is conducted highlighting traits about the UAV-based image data, their applications, sensor types, and techniques used in recent approaches in the remote sensing field. Additionally, we relate how DL models present promising results and project future perspectives of prominent paths to be explored. In short, this paper brings the following contributions:

  1. A presentation of fundamental ideas behind the DL models, including classification, object detection, and semantic segmentation approaches; as well as the application of these concepts to attend UAV-image based mapping tasks;

  2. The examination of published material in scientific sources regarding sensors types and applications, categorized in environmental, urban, and agricultural mapping contexts;

  3. The organization of publicly available datasets from previous researches, conducted with UAV-acquired data, also labeled for both object detection and segmentation tasks;

  4. A description of the challenges and future perspectives of DL-based methods to be applied with UAV-based image data.

2 Deep Neural Networks Overview

DNNs are based on neural networks which are composed of neurons (or units) with certain activation and parameters that transform input data (e.g., UAV remote-sensing image) to outputs (e.g., land use and land cover maps) while progressively learning higher-level features

(Ma et al., 2019; Schmidhuber, 2015). This progressive feature learning occurs, among others, on layers between the input and the output, which are referred to as hidden layers (Ma et al., 2019)

. DNNs are considered as a DL method in their most traditional form (i.e. with 2 or more hidden layers). Their concept, based on an Artificial Intelligence (AI) modeled after the biological neurons’ connections, exists since the 1950s. But only later, with advances in computer hardware and the availability of a high number of labeled examples, its interest has resurged in major scientific fields. In the remote sensing community, the interest in DL algorithms has been gaining attention since mid 2010s decade, specifically because these algorithms achieved significant success at digital image processing tasks

(Ma et al., 2019; Khan et al., 2020).

A DNN works similarly to an ANN, in a sense that it, when as a supervised algorithm, uses a given number of input features to be trained, and that these features observations are combined through multiple operations, where a final layer is used to return the desired prediction. Still, this explanation does not do much to highlight the differences between traditional ANNs and DNNs. LeCun et. al. (Lecun et al., 2015), the paper amongst the most cited articles in DL literature, defines DNN as follows: “Deep-learning methods are representation-learning methods with multiple levels of representation”. Representation-learning is a key concept in DL. It allows the DL algorithms to be fed with raw data, usually unstructured data such as images, texts, and videos, to automatically discover representations.

The most common DNNs (Fig. 2

) are generally composed of dense layers, wherein activation functions are implemented in. Activation functions compute the weighted sum of input and biases, which is used to decide if a neuron can be activated or not

(Nwankpa et al., 2018). These functions constitute decision functions that help in learning intrinsic patterns (Khan et al., 2020)

; i.e., they are one of the main aspects of how each neuron learns from its interaction with the other neurons. Commonly applied activation functions include linear, sigmoid, tahn, max-out, Rectified Linear Unit (ReLu), and variants of ReLu, including leaky ReLu, Exponential Linear Unit (ELU), and Parametric Rectified Linear Unit (PReLU)

(Khan et al., 2020)

. Known as a piecewise linear function type, ReLu defines the 0 valor for all negative values of X. This function is, at the time of writing, the most popular in current DNNs models. There are some reasons for that since this function is not-much computationally expensive as in comparison against others, deals well with the vanishing gradient problem

(Nwankpa et al., 2018), leads to more sparse representations of data, and, as described in recent literature (Naitzat et al., 2020), has the ability to change data topology. Regardless, another potential activation function recently explored is Mish, a self regularized non-monotonic activation function, which is returning interesting outcomes (Khan et al., 2020), as more investigations are currently conducted.

Figure 2: A DNN architecture. This is a simple example of how a DNN may be built. Here the initial layer (Xinput) is composed of the collected data samples. Later this data information can be extracted by hidden layers in a back-propagation manner, which is used by subsequent hidden layers to learn these features’ characteristics. In the end, another layer is used with an activation function related to the given problem (classification or regression, as an example), by returning a prediction outcome (Ylabel).

Aside from the activation function, another important information on how a DNN works is related to its layers, such as dropout, batch-normalization, convolution, deconvolution, max-pooling, encode-decode, memory cells, and others. For now, we will focus on dropout and batch-normalization layers, as the remaining will be further mentioned. Dropout layers are important to introduce regularization within the network since it randomly chooses to “drop” connections and units with a given probability. This not only helps to reduce overfitting by removing the presence of co-adapted connections but also improves its generalization and contributes to optimized and faster learning-rates

(Khan et al., 2020; Hinton et al., 2012). The batch-normalization layer acts as a regulating factor and smoothens the flow of the loss gradient, which also improves generalization. This layer is regularly used to solve issues with covariance-shift within feature-maps (Khan et al., 2020). The organization in which these and the other layers are composed, as well as its parameters, is one of the main aspects of the architecture.

When compiling a model to be further trained, some basic information is also needed. One of which is the optimizer that will be implemented to calculate the learning-rate. Some of the most used methods are Adam, momentum algorithm, Stochastic Gradient Descent (SGD), and Root Mean Aquared Propagation (RMSprop). There are several optimizers and the correct choice, according to the model and its objective, could help in optimizing accuracies. SGD is the simplest method, where the neurons are converged and shifting towards the optimum cost function by calculating it one example per step. Momentum tries to solve the issue of being stuck at a local minimum by adding a temporal concept to it. RMSprop, a gradient based optimization technique, implements an exponentially decaying average of the gradients combining both the momentum and another algorithm known as the Adaptive Gradient Algorithm (AdaGrad). Adam, for instance, is currently the most used option, and its popularity is due to its ability to use both momentum and adaptive learning rates. In this topic, a more detailed discussion is presented in both

(Ruder, 2017) and (Khan et al., 2020)

. The optimizers are an important aspect of the DL network and, combined with the correct loss function, can influence its accuracy.

In the optimization context, the function defined to evaluate the model is known a loss function (also known as objective or cost function). This function represents the ability of the model to represent the training data in a single scalar value. With this reduction, the learning problem is now related to finding ways to adjust the model’s parameters to minimize the loss function. This allows for possible solutions to be ranked and then compared between the neuron interactions (Goodfellow et al., 2016). Loss functions are calculated according to mathematical probabilities. This metric is related to the nature of the problem itself; i.e., if the network is dealing with a classification or a regression problem. For solving classifications, also known as probabilistic losses, one may use functions like cross-entropy (binary, category, and category-sparse), Poisson, Kullback-Leibler (KL) divergence, as others. For regression-related problems, losses based on Mean-Squared Error (MSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), Mean Squared Logarithmic Error (MSLE), etc. are commonly implemented. A detailed intake on function losses can be read in the (Goodfellow et al., 2016) material.

For evaluating the DNN’s performance, different metrics have been adopted (Minaee et al., 2020a)

, as specialists often rely on the same aforementioned division. For classification, although accuracy (or recall; or sensitivity) is a commonly used parameter, a comparison of metrics like precision, F-measure (or F-score), area under the Receiver Operating Characteristics (ROC) curve, and the Intersection over Union (IoU) are also preferred to judge the performance of a network. Another used metric is the Kappa coefficient, but it should be avoided, as explained in recent publications in the remote sensing area

(Foody, 2020). For regression related problems, metrics like MSE, MAE, Mean Relative Error (MRE), Root Mean Squared Error (RMSE), and Correlation Coefficient (r) are also used. These metrics are important to establish a relationship between predictions and labeled examples (or ground-truth in some cases) and are necessary when comparing one model against the other (Minaee et al., 2020a). Although regression is not as common in the analysis of remote sensing data as classification, we discuss UAV-based applications implemented in both situations (classification and regression problems) in the subsequent sections.

Multiple types of architectures were proposed in recent years to improve and optimize DNNs by implementing different kinds of layers, optimizers, loss functions, depth-level, etc. However, it is well known that one of the major reasons behind DNNs’ popularity today is also related to the high amount of available data to learn from it. A rule of thumb conceived among data scientists indicates that at least 5,000 labeled examples per category was recommended (Goodfellow et al., 2016)

. But, as of today, many of DNNs’ proposals focused on improving these network’s capacities to predict features with fewer examples than that. Some applications which are specifically oriented may benefit from it, as it reduces the amount of labor required at sample collection by human inspection. Even so, it should be noted that, although this pursuit is being conducted, multiple takes are performed by the vision computer communities and novel research includes methods for data-augmentation, self-supervising, and unsupervised learning strategies, as others. A detailed discussion of this manner is presented in

(Khan et al., 2020), but we briefly discuss some by the end of our revision.

2.1 Convolutional and Recurrent Neural Networks

A DNN may be formed by different architectures, and the complexity of the model is related to how each layer and additional computational method is implemented. Different DL architectures are proposed regularly, Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and Deep Belief Networks (DBN)

(Ball et al., 2017)

, and, more recently yet, Generative Adversarial Networks (GAN)

(Goodfellow et al., 2016). However, the most common DNNs in the supervised networks categories are usually classified as CNNs and RNNs (Khan et al., 2020).

For image processing and object recognition tasks, the majority of current research is focused on CNNs architectures. CNNs are well-known in computer vision but did not receive as much attention as of today. Although studies envisaged that CNN architectures would offer a high potential to classify images, it was only when, in 2012, Krizhevsky et al. (Krizhevsky et al., 2012) demonstrated a method that won an image classification competition by a large marge, that others became interested in CNNs for image processing. The network, which came to be known as AlexNet, was built with 8 layers, in which the 5 initial layers were all convolutional, some followed by max-pooling layers, and finished with 3 fully-connected layers; which all used the ReLu activation function (Khan et al., 2020). The success of this method, now considered as a simple DL network, was associated with its depth.

CNNs (Fig. 3) are a type of architecture that is composed mainly of three distinct hierarchical structures, such as convolution layers, pooling layers, and fully connected layers (Ma et al., 2019)

, and have a large number of parameters like weights, biases, the number of layers and neurons, filter size, stride, activation function, learning rate, etc.

(Khan et al., 2020). At each layer, the input image is convolved with a set of kernels (i.e. filters) and added biases, generating feature maps (Ma et al., 2019). The convolution operation considers the neighborhood of input pixels, thus different levels of correlation can be explored according to the filter sizes (Khan et al., 2020). The CNNs were originally designed to process data in the form of multiple arrays, and this trait is particularly well-suited to deal with multiband remote-sensing images since pixels are arranged regularly. As a result, this architecture is being considered one of the most popular DNN models today (Ma et al., 2019), and its success has been demonstrated in several UAV-based image applications.

Figure 3: A CNN type of architecture with convolution and deconvolution layers. This example architecture is formed by convolutional layers, where a dropout layer is added between each conv layer, and a max-pooling layer is adopted each time the convolution window-size is decreased. By the end of it, a deconvolutional layer is used with the same size as the last convolutional, and then it uses information from the previous step to reconstruct the image with its original size. The final layer is of a softmax, where it returns the models’ predictions.

As a different kind of DL network structure, RNNs refer to another supervised learning model. Although RNNs have been used for a while in other computer vision tasks, only later it was proposed to be used with remote sensing data. The RNN model was originally developed to deal with discrete sequence analysis

(Ma et al., 2019)

. The main idea behind implementing RNNs regards their capability of improving their learning in repetitive observations of a given phenom or object, often associated with a time-series collection. A type of RNN being currently implemented in multiple tasks is the Long Short-Term Memory (LSTM). LSTMs are an interesting choice for time-series related predictions as they solve the vanishing gradient problem produced in the original RNNs. For that they use additional additive components, allowing the gradients to flow through the network more efficiently

(Hochreiter and Schmidhuber, 1997). An LSTM unit is normally composed of a cell, as well as input, output, and forget gates. As the cell “remembers” values from arbitrary time intervals, these three gates regulate the flow of information in and out of the cell.

In the remote sensing field, RNN models have been applied to deal with time series tasks analysis, aiming to produce, for example, land cover mapping (Ienco et al., 2017; Ho Tong Minh et al., 2018). For a pixel-based time series analysis aiming to discriminate classes of winter vegetation coverage using SAR Sentinel-1 (Ho Tong Minh et al., 2018), it was verified that RNN models outperformed classical ML approaches. A recent approach, (Feng et al., 2020), for accurate vegetation mapping, combined a multi-scale CNN to extract spatial features from UAV-RGB imagery and then fed an attention-based RNN to establish the sequential dependency between multi-temporal features. The aggregated spatial-temporal features are used to predict the vegetable category. Such examples with remote sensing data demonstrate the potential in which RNNs are being used. Also, one prominent type of architecture is the CNN-LSTM method (Fig. 4). This network uses convolutional layers to extract important features from the given input image, and feed the LSTM. Although few studies implemented this type of network, it should be noted that it serves specific purposes, and its usage, for example, can be valued for multitemporal applications.

Figure 4: An example of a neural network based on the CNN-LSTM type of architecture. The input image is processed with convolutional layers, and a max-pooling layer is used to introduce the information to the LSTM. Each memory cell is updated with weights from the previous cell. After this process, one may use a flatten layer to transform the data in an arrangement to be read by a dense (fully-connected) layer, returning a classification prediction, for instance.

As aforementioned, other types of neural networks, aside from CNNs and RNNs, are currently being proposed to also deal with an image type of data. GANs are amongst the most innovative unsupervised DL models. GANs are composed of two networks: generative and discriminative, that contest between themselves. The generative network is responsible for extracting features from a particular data distribution of interest, like images, while the discriminative network distinguishes between real (reference or ground truth data) and those data generated by the generative part of GANs (fake data) (Goodfellow et al., 2014; Ma et al., 2019). Recently approaches in the image processing context like the classification of remote sensing images (Lin et al., 2017a)

and image-to-image translation problems solution

(Isola et al., 2018) adopted GANs as DL models, obtaining successful results.

In short, several DNNs are constantly developed, in both scientific and/or image competition platforms, to surpass existing methods. However, as each year passes, some of these neural networks are often mentioned, remembered, or even improved by novel approaches. A summary of well-known DL methods built in recent years is presented in Fig. 5. A detailed take on this, which we recommend to anyone interested, is found in Khan et al. (Khan et al., 2020)

. Alongside the creations and developments of these and others, researchers observed that higher depth, channel exploration, and, as of recently proposed, attention-based feature extraction neural networks, are regarded as some of the most prominent approaches for DL.

Figure 5: A DL time-series indicating some popular architectures implemented in image classification (yellowish color), object detection (greenish color), and segmentation (bluish color). These networks often intertwine, and many adaptations have been proposed for them. Although it may appear that most of the DL methods were developed during 2015-2017 annuals, it is important to note that, as some, novel deep networks use most of the already developed methods as backbones, or accompanied from other types of architectures, mainly used as the feature extraction part of a much more complex structure.

Initially, most of the proposed supervised DNNs, like CNN and RNN, or CNN-LSTM models, were created to perform and deal with specific issues. Often, these approaches can be grouped into classification tasks, like scene-wise classification, object detection, semantic and instance segmentation (pixel-wise), and regression tasks. Here, we aimed to comprehensively resume them as shown in the next subsections. What follows is a short description on how these approaches are being used in image related tasks and how it is capable of overcoming some of the challenges faced by the previous methods.

2.2 Classification and Regression Approaches

When considering remote sensing data processed with DL-based algorithms, the following tasks can be highlighted: scene-wise classification, semantic and instance segmentation, and object detection. Scene-wise classification involves assigning a class label to each image (or patch), while the object detection task aims to draw bounding boxes around objects in an image (or patch) and labeling each of them according to the class label. Object detection can be considered a more challenging task since it requires to locate the objects in the image and then perform their classification. Another manner to detect objects in an image, instead of drawing bounding boxes, is to draw regions or structures around the boundary of objects, i.e., distinguish the class of the object at the pixel level. This task is known as semantic segmentation. However, in semantic segmentation, it is not possible to distinguish multiple objects of the same category, as each pixel receives one class label (Wu et al., 2020a). To overcome this drawback, a task that combines semantic segmentation and object detection named instance segmentation was proposed to detect multiple objects in pixel-level mask and labeling each mask into a class label (Sharma and Mir, 2020).

To produce a deep-regression approach, the model needs to be adapted so that the last fully-connected layer of the architecture is changed to deal with a regression problem instead of a common classification one. With this adaptation, continuous values are estimated, differently from classification tasks. In comparison to classification, regression tasks using DL is not often used; however, recent publications have shown its potential in remote sensing applications. One approach

(Lathuilière et al., 2020) performed a comprehensive analysis of deep regression methods and pointed out that well-known fine-tuned networks, like VGG-16 (Simonyan and Zisserman, 2015) and ResNet-50 (He et al., 2016), can provide interesting results. These methods, however, are normally developed for specific applications, which is a drawback for general-purpose solutions. Another important point is that depending on the application, not always deep regression succeeds. A strategy is to discretize the output space and consider it as a classification solution. For UAV remote sensing applications, the strategy of using well-known networks is in general adopted. Not only VGG-16 and ResNet-50, as investigated by (Lathuilière et al., 2020), but also other networks including AlexNet (Krizhevsky et al., 2012) and VGG-11 have been used. An important issue that could be investigated in future research, depending on the application, is the optimizer. Algorithms with adaptive learning rates such as AdaGrad, RMSProp, AdaDelta (an extension of AdaGrad), and Adam are between the commonly used.

2.2.1 Scene-Wise Classification, Object Detection, and segmentation

Scene-wise classification or scene recognition refers to methods that associate a label/theme for one image (or patch) based on numerous images, such as in agricultural scenes, beach scenes, urban scenes, and others (Zou et al., 2015; Ma et al., 2019). Basic DNNs methods were developed for this task, and they are among the most common networks for traditional image recognition tasks. In remote sensing applications, scene-wise classification is not usually applied. Instead, most applications benefit more from object detection and pixel-wise semantic segmentation approaches. For scene-wise classification, the method needs only the annotation of the class label of the image, while other tasks like object detection methods needs a drawn of a bounding box for all objects in an image, which makes it more costly to build labeled datasets. For instance or semantic segmentation, the specialist (i.e. person who performs the annotation or object labeling) needs to draw a mask involving each pixel of the object, which needs more attention and precision in the annotation task, reducing, even more, the availability of datasets. Fig. 6 shows the examples of both annotation approaches (object detection and instance segmentation).

Figure 6: Labeled examples. The first-row consists of a bounding-box type of object detection approach label-example to identify individual tree-species in an urban environment. The second-row is a labeled-example of instance segmentation to detect rooftops in the same environment.

Object detection methods can be described into two mainstream categories: one-stage detectors (or regression-based methods) and two-stage detectors (or region proposal-based methods) (Zhao et al., 2019; Liu et al., 2019; Wu et al., 2020a). The usual two-stage object detection pipeline is to generate region proposals (candidate rectangular bounding boxes) on the feature map. It then classifies each one into an object class label and refines the proposals with a bounding box regression. A widely used strategy in the literature to generate proposals was proposed with the Faster-RCNN algorithm with the Region Proposal Network (RPN) (Zhao et al., 2019). Other state-of-the-art representatives of such algorithms are Cascade-RCNN (Cai and Vasconcelos, 2018), Trident-Net (Li et al., 2019), Grid-RCNN (Lu et al., 2019), Dynamic-RCNN (Zhang et al., 2020a), DetectoRS (Qiao et al., 2020). As for one-stage detectors, they directly make a classification and detect the location of objects without a region proposal classification step. This reduced component achieves a high detection speed for the models but tends to reduce the accuracy of the results. These are known as region-free detectors since they typically use cell grid strategies to divide the image and predict the class label of each one. Besides that, some detectors may serve for both one-stage and two-stage categories.

Object detection based methods can be described in three components: a) backbone, which is responsible to extract semantic features from images; b) the neck, which is an intermediate component between the backbone and the head components, used to enrich the features obtained by the backbone, and; c) head component, which performs the detection and classification of the bounding boxes.

The backbone is a CNN that receives as input an image and outputs a feature map that describes the image with semantically features. In the DL literature the state-of-the-art is composed of the following backbones: VGG (Simonyan and Zisserman, 2015), ResNet (He et al., 2016), ResNeXt (Xie et al., 2017), HRNet (Wang et al., 2020), RegNet (Radosavovic et al., 2020), Res2Net (Gao et al., 2021), and ResNesT (Zhang et al., 2020b). The neck component combines in several scales low-resolution and semantically strong features, capable of detecting large objects, with high-resolution and semantically weak features, capable of detecting small objects, which is done with the lateral and top-down connections of the convolutional layers of the Feature Pyramid Network (FPN) (Lin et al., 2017b), and its variants like PAFPN (Liu et al., 2018) and NAS-FPN (Ghiasi et al., 2019). Although FPN was originally designed to be a two-stage method, the method’ purpose was a manner to use the FPN on single-stage detectors by removing RPN and adding a classification subnet and a bounding box regression subnet. The head component is responsible for the detection of the objects with the softmax classification layer, which produces probabilities for all classes and a regression layer to predict the relative offset of the bounding box positions with the ground-truth.

Despite the differences in object detectors (one or two-stage), their universal problem consists of dealing with a large gap between positive samples (foreground) and negative samples (background) during training, i.e class imbalance problem that can deteriorate the accuracy results (Chen et al., 2020). In these detectors, the candidate bounding boxes can be represented into two main classes: positive samples, which are bounding boxes that match with the ground-truth, according to a metric; and negative samples, which do not match with the ground-truth. In this sense, a non-max suppression filter can be used to refine these dense candidates by removing overlaps to the most promising ones. The Libra-RCNN (Pang et al., 2019), ATSS (Zhang et al., 2019a), Guided Anchoring (Wang et al., 2019), FSAF (Zhu et al., 2019a), PAA (Kim and Lee, 2020), GFL (Li et al., 2020a), PISA (Cao et al., 2020) and VFNet (Zhang et al., 2020c) detectors explore different sampling strategies and new loss metrics to improve the quality of selected positive samples and reduce the weight of the large negative samples.

Another theme explored in the DL literature is the strategy of encoding the bounding boxes, which influences the accuracy of the one-stage detectors as they do not use region proposal networks (Zhang et al., 2020c). In this report (Zhang et al., 2020c), the authors represent the bounding boxes like a set of representatives or key-points and find the farthest top, bottom, left, and right points. CenterNet (Duan et al., 2019) detects the object center point instead of using bounding boxes, while CornerNet (Law and Deng, 2020) estimates the top-left corner and the bottom-right corner of the objects. SABL (Wang et al., 2020) uses a chunk based strategy to discretize horizontally and vertically the image and estimate the offset of each side (bottom, up, left, and right). The VFNet (Zhang et al., 2020c) method proposes a loss function and a star-shaped bounding box (described by nine sampling points) to improve the location of objects.

Regarding semantic segmentation and instance segmentation approaches, they are generally defined as a pixel-level classification problem (Minaee et al., 2020b). The main difference between semantic and instance is that the former one is capable to identify pixels belonging to one class but can not distinguish objects of the same class in the image. However, instance segmentation approaches can not distinguish overlapping of different objects, since they are concerned with identifying objects separately. For example, it may be problematic to identify in an aerial urban image the location of the cars, trucks, motorcycle, and the asphalt pavement which consists of the background or region in which the other objects are located. To unify these two approaches, a method was recently proposed in (Kirillov et al., 2019), named panoptic segmentation. With panoptic segmentation, the pixels that are contained in uncountable regions (e.g. background) receive a specific value indicating it.

Considering the success of the RPN method for object detection, some variants of Faster R-CNN was considered to instance segmentation as Mask R-CNN (He et al., 2017), which in parallel to bounding box regression branch add a new branch to predict the mask of the objects (mask generation). The Cascade Mask R-CNN (Cai and Vasconcelos, 2019) and HTC (Chen et al., 2019) extend Mask R-CNN to refine in a cascade manner the object localization and mask estimation. The PointRend (Kirillov et al., 2020) is a point-based method that reformulates the mask generation branch as a rendering problem to iteratively select points around the contour of the object. Regarding semantic segmentation, methods like U-Net (Ronneberger et al., 2015), SegNet (Badrinarayanan et al., 2017), DeepLabV3+ (Chen et al., 2018), and Deep Dual-domain Convolutional Neural Network (DDCN) (Nogueira et al., 2019) have also been regularly used and adapted for recent remote sensing investigations (Nogueira et al., 2020). Another important remote sensing approach that is been currently investigated is the segmentation of objects considering sparse annotations (Hua et al., 2021). Still, as of today, the CGnet (Wu et al., 2020b) and DLNet (Yin et al., 2020) are considered the state-of-art methods to semantic segmentation.

3 Deep Learning in UAV Imagery

To identify works related to DL in UAV remote sensing applications, we performed a search in the Web of Science (WOS) and Google Scholar databases. WOS is one of the most respected scientific databases and hosts a high number of scientific journals and publications. We conducted a search using the following string in the WOS: (“TS = ((deep learning OR CNN OR convolutional neural network) AND (UAV OR unmanned aerial vehicle OR drone OR RPAS) AND (remote sensing OR photogrammetry)) AND LANGUAGE: (English) AND Types of Document: (Article OR Book OR Book Chapter OR Book Review OR Letter OR Proceedings Paper OR Review); Indexes=SCI-EXPANDED, SSCI, A%HCI, CPCI-S, CPCI-SSH, ESCI. Stipulated-time=every-years.”). We considered DL, but added CNN, as its one of the main DL-based architectures used in remote sensing applications (Ma et al., 2019).

We filtered the results to consider only papers that implemented approaches with UAV-based systems. A total of 190 papers were found in the WOS database, where 136 were articles, 46 proceedings, and 10 reviews. An additional search was conducted in the Google Scholar database to identify works not detected in the WOS. We adopted the same combination of keywords in this search. We performed a detailed evaluation of its results and selected only those that, although from respected journals, were not encountered in the WOS search. This resulted in a total of 34 articles, 16 proceedings, and 8 reviews. The entire dataset was composed of 232 articles + proceedings and 18 reviews from scientific journals indexed in those bases. These papers were then organized and revised. Fig. 7 demonstrates the main steps to map this research. The encountered publications were registered only in the last five years (from 2016 to 2021), which indicates how recent UAV-based approaches integrated with DL methods are in the scientific journals.

Figure 7: The schematic procedure adopted to organize the revised material according to their respective categories as proposed in this review.

The review articles gathered at those bases were separated and mostly used in the cloud-text analysis of Fig. 1, while the remaining papers (articles and proceedings) were organized according to their category. A total of 283.785 words were analyzed for the word-cloud, as we removed words with less than 5% occurrences to cut lesser-used words unrelated to the theme, and higher than 95% occurrences to remove plain and simple words frequently used in the English language. The published articles and proceedings were divided in terms of DL-based networks (classification: scene-wise classification, segmentation, and object detection and; regression), sensors type used (RGB, multispectral, hyperspectral, and LiDAR); and; applications (environmental, urban, and agricultural context). We also provided, in a subsequent section, datasets from previously conducted research for further investigation by novel studies. These datasets were organized and their characteristics were also summarized accordingly.

Most of our research was composed of publications from peer-review publishers in the area of remote sensing journals (Fig. 8). Even though the review articles encountered in the WoS and Google Scholar databases do mention, to some extent, UAV-based applications, none of them were dedicated to it. Towards the end of our paper, we examined state-of-the-art approaches, like real-time processing, data dimensionality reduction, domain adaptation, attention-based mechanisms, few-shot learning, open-set, semi-supervised and unsupervised learning, and others. This information provided an overview of the future opportunities and perspectives on DL methods applied in UAV-based images, where we discussed the implications and challenges of novel approaches.

Figure 8: The distribution of the evaluated scientific material according to data gathered at Web of Science (WOS) and Google Scholar databases. The y-axis on the left represents the number (n) of published papers, illustrated by solid-colored boxes. The y-axis on the right represents the number of citations that these publications, according to peer-review scientific journals, received since their publication, illustrated by dashed-lines of the same color to its corresponding solid-colored box.

The 232 papers (articles + proceedings) were investigated through a quantitative perspective, where we evaluated the number of occurrences per journal, the number of citations, year of publication, and location of the conducted applications according to country. We also prepared and organized a sampling portion in relation to the corresponding categories, as previously explained, identifying characteristics like architecture used, evaluation metric approach, task conducted, type of sensor and mapping context objective. After evaluating it, we adopted a qualitative approach by revising and presenting some of the applications conducted within the papers (UAV + DL) encountered in the scientific databases, summarizing the most prominent ones. This narrative over these applications was separated accordingly to the respective categories related to mapping context (environmental, urban, and agricultural). Later on, when presenting future perspectives and current trends in DL, we mentioned some of these papers alongside other investigations proposed at computer vision scientific journals that could be potentially used for remote sensing and UAV-based applications.

3.1 Sensors and Applications Worldwide

In the UAV-based imagery context, several applications were beneficiated from DL approaches. As these networks’ usability is increasing throughout different remote sensing areas, researchers are also experimenting with their capability in substituting laborious-human tasks, as well as improving traditional measurements performed by shallow learning or conventional statistical methods. As of recently, several articles and proceedings were published in renowned scientific journals. Our survey, which its specifics were previously mentioned, was able to detect some important characteristics. From the data collected, we verified that most UAV-based applications with DL are conducted mostly in countries like China and the USA (Fig. 9). This is somewhat expected since these countries, alongside their educational and scientific investments, have been traditionally focusing on both computer vision and remote sensing advances for a long time.

Figure 9: Published material according to their respective country of origin. The names from the top publishing countries per continent were also highlighted on the map.

The top 9 countries (highlighted under Fig. 9 map) are responsible for almost 90% of scientific publication production regarding this theme. This spatially-distributed global information is also important to pinpoint some of the characteristics in which these UAV-based applications were conducted. In European countries like Germany, UK, Netherlands, and Spain, our data indicated that most of the applied methods were used to map the environmental context. In South-American countries like Brazil, precision agriculture practices are the preferred approaches. In Asian countries like China and India, both urban and agricultural contexts are the most focused areas. In North-America, articles publications from the USA focused on both agricultural, urban, and environmental contexts. Although loosely, this observation analysis may shine some light on how each one of these regions is treating its problems and implementing practices related to these themes.

In general terms, the articles collected at the scientific databases demonstrated a pattern related to its architecture (CNN or RNN), evaluation (classification or regression) approach (object detection, segmentation or scene-wise classification), type of sensor (RGB, multispectral, hyperspectral or LiDAR) and mapping context (environmental, urban, or agricultural). These patterns can be viewed with a simple diagram (Fig. 10). The following observations can be extracted from this graphic:

  1. The majority of networks in UAV-based applications still rely mostly on CNNs;

  2. Even though object detection is the highest type of approach, there has been a lot of segmentation approaches in recent years;

  3. Most of the used sensors are RGB, followed by multispectral, hyperspectral, and LiDAR, and;

  4. There is an interesting amount of papers published within the environmental context, with forest-type related applications being the most common approach in this category, while both urban and agricultural categories were almost evenly distributed among opted approaches.

Figure 10: Diagram describing proceedings and articles according to the defined categories using WOS and Google Scholar datasets.

The majority of papers published using UAV-based applications implemented a type of CNN (91.2%). Most of these articles used established architectures (Fig. 5) and a small portion proposed their models and compared them against the state-of-the-art networks. In reality, this comparison appears to be a crucial concern regarding recent publications, since it is necessary to ascertain the performance of the proposed method in relation to well-known DL-based models. Still, the popularity of CNNs architectures in remote sensing images is not new, mainly because of reasons already stated in the previous sections. Besides that, even though presented in a small number of articles, RNNs (8.8%), mostly composed of CNN-LSTM architectures, are an emerging trend in this area and appear to be the focus of novel proposals. As UAV systems are capable of operating mostly according to the users’ own desire (i.e. can acquire images from multiple dates in a more personalized manner), the same object is viewed through a type of time-progression approach. This is beneficial for many applications that include monitoring of stationary objects, like rivers, vegetation, or terrain slopes, for example.

Although classification (97.7%) tasks are the most common evaluation metrics implemented in these papers, regression (2.3%) is an important estimate and may be useful in future applications. The usage of regression metrics in remote sensing applications is worth it simply because it enables the estimation of continuous data. Applications that could benefit from regression analysis are present in environmental, urban, and agricultural contexts, as in many others, and it is useful to return predictions on measured variables. Classification, on the other hand, is more of a common ground for remote sensing approaches and it is implemented in every major task (object detection; pixel-wise semantic segmentation and scene-wise classification).

The aforementioned DL-based architectures were majorly applied in object detection (53.9%) and image segmentation (40.7%) problems, while (scene-wise) classification (5.4%) were the least common. This preference for object detection may be related to UAV-based data, specifically, since the high amount of detail of an object provided by the spatial resolution of the images is both an advantage and a challenge. It is an advantage because it increases the number of objects to be detected on the surface (thus, more labeled examples), and it is a challenge because it difficulties both recognition and segmentation of these objects (higher detail implies more features to be extracted and analyzed). Classification (scene-wise), on the other hand, is not as common in remote sensing applications, and image segmentation is often preferred in some applications since assigning a class to each pixel of the image has more benefits for this type of analysis than rather only identifying a scene.

Following it, there is an interesting distribution pattern related to the application context. The data indicated that most of the applications were conducted in the environmental context (46.6%). This context includes approaches that aimed to, in a sense, deal with detection and classification tasks on land use and change, environmental hazards and disasters, erosion estimates, wild-life detection, forest-tree inventory, monitoring difficult to access regions, as others. Urban and agricultural categories (both 27.2% and 26.4%, respectively) were associated with cars and traffic detection, buildings, streets, and rooftops extraction, as well as plant counting, plantation-row detection, weed infestation identification, and others. Interestingly, all of the LiDAR data applications were related to environmental mapping, while RGB images were mostly used for urban, followed by the agricultural context. Multispectral and hyperspectral data, however, were less implemented in the urban context when in comparison against the other categories. As these categories benefit differently from DL-based methods, a more detailed intake is needed to understand its problems, challenges, and achievements. In the following subsections, we explain these issues and advances while citing some suitable examples from within our search database.

Lastly, another important observation to be made regarding the categorization division used here is that there is a visible dichotomy between the type of sensor used. Most of the published papers in this area evaluated the performance of DL-based networks in RGB sensors (52.4%). This was respectively followed by multispectral (24.3%), hyperspectral (17.8%), and LiDAR (5.5%). The preference for RGB sensors in UAV-based systems may be associated with their low-cost and high market availability. As such, the published articles may reflect on this, since it is a viable option for practical reasons when considering the replicability of the methods. It should be noted that the number of labeled examples in public databases are mostly RGB, which helps improvements and investigations with this type of data. Also, data obtained from multispectral, hyperspectral, and LiDAR sensors are used in more specific applications, which contributes to this division.

Most of the object detection applications went on RGB types of data, while segmentation problems were dealt with both RGB, multispectral, hyperspectral, and LiDAR data. A possible explanation for this is that object detection often relies on the spatial, texture, pattern, and shape characteristics of the object in the image, as segmentation approaches are a diverse type of applications, which benefit from the amount of spectral and terrain information provided by these sensors. In object detection, DL-based methods may have potentialized the usage of RGB images, since simpler and traditional methods need additional spectral information to perform it. Also, apart from the spectral information, LiDAR, for example, offers important features of the objects for the networks to learn and refine edges around them, specifically where their patterns are similar. Regardless, many of these approaches are related to the available equipment and nature of the application itself, so it is difficult to pinpoint a specific reason.

3.2 Environmental Mapping

Environmental approaches with DNNs-based methods hold the most diverse applications with remote sensing data, including UAV-imagery. These applications adopt different sensors simply because of their divergent nature. To map natural habits and their characteristics, studies often relied on methods and procedures specifically related to its goals, and no “universal” approach could be proposed nor discovered. However, although DL-based methods have not reached this type of “universal” approach, they are changing some skepticism by being successfully implemented in the most unique scenarios. Although UAV-based practices still offer some challenges to both classification and regression tasks, DNNs methods are proving to be generally capable of performing such tasks. Regardless, there is still much to be explored.

Several environmental practices could potentially benefit from deep networks like CNNs and RNNs. For example, monitoring and counting wild-life (Barbedo et al., 2020; Hou et al., 2020; Sundaram and Loganathan, 2020), detecting and classifying vegetation from grasslands and heavily-forested areas (Horning et al., 2020; Hamdi et al., 2019), recognizing fire and smoke signals (Alexandra Larsen et al., 2020; Zhang et al., 2019b), analyzing land use, land cover, and terrain changes, which are often implemented into environmental planning and decision-making models (Kussul et al., 2017; Zhang et al., 2020d), predicting and measuring environmental hazards (Dao et al., 2020; Bui et al., 2020), among others. What follows is a brief description of recent material published in the remote sensing scientific journals that aimed to solve some of these problems by integrating data from UAV embedded sensors with DL-based methods.

One of the most common approaches related to environmental remote sensing applications regards land use, land cover, and other types of terrain analysis. A recent study (Giang et al., 2020) applied semantic segmentation networks to map land use over a mining extraction area. Another one, (Al-Najjar et al., 2019), combined information from a Digital Surface Model (DSM) with UAV-based RGB images and applied a type of feature fusion as input for a CNN model. To map coastal regions, an approach (Buscombe and Ritchie, 2018), with RGB data registered at multiple scales, used a CNN in combination with a graphical method named conditional random field (CRF). Another research (Park and Song, 2020), with hyperspectral images in a combination between 2D and 3D convolutional layers, was developed to determine the discrepancy of land cover in the assigned land category of cadastral map parcels.

With a semantic segmentation approach, road extraction by a CNN was demonstrated in another investigation (Li et al., 2019). Another study (Gevaert et al., 2020) investigated the performance of a FCN to monitor household upgrading in unplanned settlements. Terrain analysis is a diversified topic in any type of cartographic scale, but for UAV-based images, in which most data-acquisitions are composed by a high-level of detail, DL-based methods are resulting in important discoveries, demonstrating the feasibility of these methods to perform this task. Still, although these studies are proving this feasibility, especially in comparison with other methods, novel research should focus on evaluating the performance of deep networks regarding their domain-adaptation, as well as its generalization capability, like using data in different spatial-resolution, multitemporal imagery, etc.

The detection, evaluation, and prediction of flooded areas represents another type of investigation with datasets provided by UAV-embedded sensors. A study (Gebrehiwot et al., 2019) demonstrated the importance of CNNs for the segmentation of flooded regions, where the network was able to separate water from other targets like buildings, vegetation, and roads. One potential application that could be conducted with UAV-based data, but still needs to be further explored, is mapping and predicting regions of possible flooding with a multitemporal analysis, for example. This, as well as many other possibilities related to flooding, water-bodies, and river courses (Carbonneau et al., 2020), could be investigated with DL-based approaches.

For river analysis, an investigation (Zhang et al., 2020e) used a CNN architecture for image segmentation by fusing both the positional and channel-wise attentive features to assist in river ice monitoring. Another study (Jakovljevic et al., 2019) compared LiDAR data with point cloud generated by UAV mapping and demonstrated an interesting approach to DL-based methods applications for point cloud classification and a rapid Digital Elevation Model (DEM) generation for flood risk mapping. One type of application with CNN in UAV data involved measuring hailstones in open areas (Soderholm et al., 2020). For this approach, image segmentation was used in RGB images and returned the maximum dimension and intermediate dimension of the hailstones. Lastly, on this topic, a comparison (Ichim and Popescu, 2020) with CNNs and GANs to segment both river and vegetation areas demonstrated that a type of “fusion” between these networks into a global classifier had an advantage of increasing the efficiency of the segmentation.

UAV-based forest mapping and monitoring is also an emerging approach that has been gaining the attention of the scientific community and, at some level, governmental bodies. Forest areas often pose difficulties for precise monitoring and investigation, since they can be hard to access and may be dangerous to some extent. In this aspect, images taken from UAV embedded sensors can be used to identify single tree-species in forested environments and compose an inventory. From the papers gathered, multiple types of sensors, RGB, both multi and hyperspectral, and also LiDAR, were used for this approach. An application investigated the performance of a 3D-CNN method to classify tree species in a boreal forest, focusing on pine, spruce, and birch trees, with a combination between RGB and hyperspectral data (Nezami et al., 2020).

Single-tree detection and species classification by CNNs were also investigated in (Ferreira et al., 2020) in which three types of palm-trees in the Amazon forest, considered important for its population and native communities, were mapped with this type of approach. Another example (Hu et al., 2020) includes the implementation of a Deep Convolutional Generative Adversarial Network (DCGAN) to discriminate between health diseased pinus-trees in a heavily-dense forested park area. Another recent investigation (Miyoshi et al., 2020) proposed a novel DL method to identify single-tree species in highly-dense areas with UAV- hyperspectral imagery. These and other scientific studies demonstrate how well DL-based methods can deal with such environments.

Although the majority of approaches encountered at the databases for this category relate to tree-species mapping, UAV-acquired data was also used for other applications in these natural environments. A recent study (Zhang et al., 2020f) proposed a method based on semantic segmentation and scene-wise classification of plants in UAV-based imagery. The method bases itself on a CNN that classifies individual plants by increasing the image scale while integrating features learned from small scales. This approach is an important intake in multi-scale information fusion. Also related to vegetation identification, multiple CNNs architectures were investigated in (Hamylton et al., 2020) to detect between plants and non-type of plants with UAV-based RGB images on an island achieving interesting performances.

Another application aside from vegetation mapping involves wild-life identification. Animal monitoring in open-spaces and grasslands is also something that received attention as DL-based object detection and semantic segmentation methods are providing interesting outcomes. A paper by (Kellenberger et al., 2018) covers this topic and discusses, with practical examples, how CNNs may be used in conjunction with UAV-based images to recognize mammals in the African Savannah. This study relates the challenges related to this task and proposes a series of suggestions to overcome them, focusing mostly on imbalances in the labeled dataset. The identification of wild-life, also, was not only performed in terrestrial environments, but also in marine spaces, where a recent publication (Gray et al., 2019) implemented a CNN-based semantic segmentation method to identify cetacean species, mainly blue, humpback, and minke whales, in the ocean. These studies not only demonstrate that such methods can be highly accurate at different tasks but also implies the potential of DL approaches with UAVs in the current literature.

3.3 Urban Mapping

For urban environments, many DL-based proposals with UAV data have been presented in the literature in the last years. The high-spatial-resolution easily provided by UAV embedded sensors are one of the main reasons behind its usage in these areas. Object detection and instance segmentation methods in those images are necessary to individualize, recognize, and map highly-detailed targets. Thus, many applications rely on CNNs and, in small cases, RNNs (CNN-LSTM) to deal with them. Some of the most common examples encountered in this category during our survey are the identification of pedestrians, car and traffic monitoring, segmentation of individual tree-species in urban forests, detection of cracks in concrete surfaces and pavements, building extraction, etc. Most of these applications were conducted with RGB type of sensors, and, in a few cases, spectral ones.

The usage of RGB sensors is, as aforementioned, a preferred option for small-budget experiments, but also is related to another important preference of CNNs, and that is that features like pixel-size, form, and texture of an object are essential to its recognition. In this regard, novel experiments could compare that the performance of DL-based methods with RGB imagery against other types of sensors. As low-budge systems are easy to implement in larger quantities, many urban monitoring activities could benefit from such investigations. In urban areas, the importance of UAV real-time monitoring is relevant, and that is one of the current objectives when implementing such applications.

The most common practices with UAV-based imagery in urban environments with DL-based methods involve the detection of vehicles and traffic. Car identification is an important task to help urban monitoring and may be useful for real-time analysis of traffic flow in those areas. It is not an easy task, since vehicles can be occluded by different objects like buildings and trees, for example. A recent approach using RGB video footage obtained with UAV, as presented in (Zhang et al., 2019c), used an object detection CNN for this task. They also dealt with differences in traffic monitoring to motorcycles, where a frame-by-frame analysis enabled the neural network to determine if the object in the image was a person (pedestrian) or a person riding a motorcycle since differences in its pattern and frame-movement indicated it. Regarding pedestrian traffic, an approach with thermal cameras presented by (de Oliveira and Wehrmeister, 2018) demonstrated that CNNs are appropriate to detect persons with different camera rotations, angles, sizes, translation, and scale, corroborating the robustness of its learning and generalization capabilities.

Another important survey in those areas is the detection and localization of single-tree species, as well as the segmentation of their canopies. Identifying individual species of vegetation in urban locations is an important requisite for urban-environmental planning since it assists in inventorying species and providing information for decision-making models. A recent study (dos Santos et al., 2019) applied object detection methods to detect and locate tree-species threatened by extinction. Following their intentions, a research (Torres et al., 2020) evaluated semantic segmentation neural networks to also map endangered tree-species in urban environments. While one approach aimed to recognize the object to compose an inventory, the other was able to identify it and return important metrics, like its canopy-area for example. Indeed, some proposals that were implemented in a forest type of study could also be adopted in urban areas, and this leaves an open-field for future research that intends to evaluate DL-based models in this environment. Urban areas pose different challenges for tree monitoring, so these applications need to consider their characteristics.

DL-based methods have also been used to recognize and extract infrastructure information. An interesting approach demonstrated by (Boonpook et al., 2021), based on semantic segmentation methods, was able to extract buildings in heavily urbanized areas, with unique architectural styles and complex structures. Interestingly enough, a combination of RGB with a DSM improved building identification, indicating that the segmentation model was able to incorporate appropriate information related to the objects’ height. This type of combinative approach, between spatial-spectral data and height, may be useful in other identification and recognition approaches. Also regarding infrastructure, another possible application in urban areas is the identification and location of utility poles (Gomes et al., 2020). This application, although being of rather a specific example, is important to maintain and monitor the conditions of poles regularly. These types of monitoring in urban environments is something that benefits from DL-based models approaches, as it tends to substitute multiple human inspection tasks. Another application involves detecting cracks in concrete pavements and surfaces (Bhowmick et al., 2020). Because some regions of civil structures are hard to gain access to, UAV-based data with object detection networks may be useful to this task, returning a viable real-life application.

Another topic that is presenting important discoveries relates to land cover pixel segmentation in urban areas, as demonstrated by (Benjdira et al., 2019a). In this investigation, an unsupervised domain adaptation method based on GANs was implemented, working with different data from UAV-based systems, while being able to improve image segmentation of buildings, low vegetation, trees, cars, and impervious surfaces. As aforementioned, GANs or DCGANs are quickly gaining the attention of computer vision communities due to their wide area of applications and the way they function by being trained to differentiate between real and fake data (Goodfellow et al., 2014). Regardless, its usage in UAV-based imagery is still underexplored, and future investigations regarding not only land change and land cover but also other types of applications’ accuracies may be improved with them. Nonetheless, apart from differences in angles, rotation, scales, and other UAV-based imagery related characteristics, diversity in urban scenarios is a problem that should be considered by unsupervised approaches. Therefore, in the current state, DL-based networks still may rely on some supervised manner to guide image processing, specifically regarding domain shift factors.

3.4 Agricultural Mapping

Precision agriculture applications have been greatly benefited from the integration between UAV-based imagery and DL methods in recent scientific investigations. The majority of issues related to these approaches involve object detection and feature extraction for counting plants and detecting plantation-lines, recognizing plantation-gaps, segmentation of plants species and invasive species as weeds, phenology, and phenotype detection, and many others. These applications offer numerous possibilities for this type of mapping, especially since most of these tasks are, still, conducted manually by human-vision inspection. As a result, they can help precision farming practices by returning predictions with rapid, unbiased, and accurate results, influencing decision-making for the management of agricultural systems.

Regardless, although automatic methods do provide important information in this context, they face difficult challenges. Some of these include similarity between the desired plant and invasive plants, hard-to-detect plants in high-density environments (i.e. presenting small spacing between plants and lines), plantation-lines that do not follow a straight-path, edge-segmentation in mapping canopies with conflicts between shadow and illumination, and many others. Still, novel investigations aim to achieve a more generative capability to these networks in dealing with such problems. In this sense, approaches that implement methods in more than one condition or plantation are being the main focus of recent publications. Thus, varied investigation scenarios are currently being proposed, with different types of plantations, sensors, flight-altitudes, angles, spatial and spectral divergences, dates, phenological-stages, etc.

An interesting approach that has the potential to be expanded to different orchards was used in (Apolo-Apolo et al., 2020). There, a low-altitude flight approach was adopted with side-view angles to map yield by counting fruits with the CNN-based method. Counting fruits is not something entirely new in DL-based approaches, some papers demonstrated the effectiveness of bounding-box and point-feature methods to extract it (Biffi et al., 2021; Tian et al., 2019a; Kang and Chen, 2020) aside from several differences in occlusion, lightning, fruit size, and image corruption.

Today’s deep networks demonstrate high potential in yield-prediction, as some applications are adapting to CNN architectures mainly because of its benefits in image processing. One of which includes predicting pasture-forage with only RGB images (Castro et al., 2020). Another interesting example in crop-yield estimates is presented by (Nevavuori et al., 2020), where a CNN-LSTM was used to predict yield with a spatial-multitemporal approach. There the authors implemented this structure since RNNs are more appropriate to learn with temporal data, while a 3D-CNN was used to process and classify the image. Although used less frequently than CNNs in the literature, there is emerging attention to LSTM architectures in precision agriculture approaches, as appear to be an appropriate intake for temporal-monitoring of these areas.

Nonetheless, one of the most used and beneficiated approaches in precision agriculture with DL-based networks is counting and detecting plants and plantation-lines. Counting plants is essential to produce estimates regarding production-rates, as well as, by geolocating it, determine if a problem occurred during the seedling process by identifying plantation-gaps. In this regard, plantation-lines identification with these gaps is also a desired application. Both object detection and image segmentation methods were implemented in the literature, but most approaches using image semantic segmentation algorithms rely on additional procedures, like using a blob detection method (Kitano et al., 2019), for example. These additional steps may not always be desirable, and to prove the generality capability of one model, multiple tests at different conditions should be performed.

For plantation-line detection, segmentations are currently being implemented and often used to assist in more than one information extraction. In (Feng et al., 2020), a DL-based method was firstly used to perform a row detection in the image, and later to evaluate plant-stand count and canopy sizes. As such, this type of approach appears to minimize problems related to multi-direction plantation-line paths. Another research presented in (Osco et al., 2021) applied semantic segmentation in UAV-based multispectral data to extract canopy areas and was able to demonstrate which spectral regions were more appropriate to it.

A recent application with UAV-based data was also proposed in (Osco et al., 2020a), where a CNN model is presented to simultaneously count and detect plants and plantation-lines. This model is based on a confidence map extraction and was an upgraded version from previous research with citrus-tree counting (Osco et al., 2020b). This CNN works by implementing some convolutional layers, a Pyramid Pooling Module (PPM) (Zhao et al., 2017), and a Multi-Stage Module (MSM) with two information branches that, concatenated at the end of the MSM processes, shares knowledge learned from one to another. This method ensured that the network learned to detect plants that are located at a plantation-line, and understood that a plantation-line is formed by linear conjunction of plants. This type of method has also been proved successful in dealing with highly-dense plantations. Another research (Ampatzidis and Partel, 2019) that aimed to count citrus-trees with a bounding-box-based method also returned similar accuracies. However, it was conducted in a sparse plantation, which did not impose the same challenges faced at (Osco et al., 2020b, a). Regardless, to deal with highly-dense scenes, feature extraction from confidence maps appears to be an appropriate approach.

But agricultural applications do not always involve plant counting or plantation-line detection. Similar to wild-animal identification as included in other published studies (Kellenberger et al., 2018; Gray et al., 2019), there is also an interest in cattle detection, which is still an onerous task for human-inspection. In UAV-based imagery, some approaches included DL-based bounding-boxes methods (Barbedo et al., 2019), which were also successfully implemented. DNNs used for this task are still underexplored, but published investigations (Rivas et al., 2018)

argue that one of the main reasons behind the necessity to use DL methods is based on occurrences of changes in terrain (throughout the seasons of the year) and the non-uniform distribution of the animals throughout the area. On this matter, one interesting approach should involve the usage of real-time object detection on the flight. This is because it is difficult to track animal movement, even in open areas such as pastures, when a UAV system is acquiring data. Another agricultural application example refers to the monitoring offshore aquaculture farms using UAV-underwater color imagery and DL models to classify them

(Bell et al., 2020). These examples reveal the widespread variety of agriculture problems that can be attended with the integration of DL models and UAV remote sensing data.

Lastly, a field yet to be also explored by the literature is the identification and recognition of pests and disease indicators in plants using DL-based methods. Most recent approaches aimed to identify invasive species, commonly named “weeds”, in plantation-fields. In a demonstration with unsupervised data labeling, (Dian Bah et al., 2018) evaluated the performance of a CNN-based method to predict weeds in the plantation-lines of different crops. This pre-processing step to automatically generate labeled-data, which is implemented outside the CNN model structure, is an interesting approach. However, others prefer to include a “one-step” network to deal with this situation, and different fronts are emerging in the literature. Unsupervised domain-adaptation, in which the network extracts learning-features from new unviewed data is one of the most current aimed models.

A recent publication (Li et al., 2020b) proposed it to recognize and count in-field cotton-boll status identification. Regardless, with UAV-based data examples, this is still an issue. As for disease detection, a study (Kerkech et al., 2020) investigated the use of image segmentation for vine-crops with multispectral images, and was able to separate visible symptoms (RGB), infrared symptoms (i.e. when considering only the infrared band) and in an intersection between visible and infrared spectral data. Another interesting example regarding pests identification with UAV-based image was demonstrated in (Tetila et al., 2020) where superpixel image samples of multiple pest species were considered, and activation filters used to recognize undesirable visual patterns implemented alongside different DL-based architectures.

4 Publicly Available UAV-Based Datasets

As mentioned, one of the most important characteristics of DL-based methods is that they tend to increase their learning capabilities as the number of labeled examples are used to train a network. In most of the early approaches with remote sensing data, CNNs were initialized with pre-trained weights from publicly available image repositories over the internet. But most of these repositories are not from data acquired with remote sensing platforms. Still, there are some known aerial repositories with labeled examples, which were presented in recent years, such as the DOTA (Xia et al., 2018), UAVDT (Du et al., 2018), VisDrone (B et al., 2019), WHU-RS19 (Sheng et al., 2012), RSSCN7 (Zou et al., 2015), RSC11 (Zhao et al., 2016), Brazilian Coffee Scene (Penatti et al., 2015) datasets. These and others are gaining notoriety in UAV-based applications and could be potentially used to pre-train or benchmark DL methods. These datasets not only serve as an additional option to start a network but also may help in novel proposals to be compared against the evaluated methods.

Since there is a still scarce amount of labeled examples with UAV-acquired data, specifically in multispectral and hyperspectral data, we aimed to provide UAV-based datasets in both urban and rural scenarios for future research to implement and compare the performance of novel DL-based methods with them. Table 1 summarizes some of the information related to these datasets, as well as indicates recent publications in which previously conducted approaches were implemented, as well as results achieved on them. They are available on the following webpage, which is to be constantly updated with novel labeled datasets from here on: Geomatics and Computer Vision/Datasets

Reference Task Target Sensor GSD(cm) Best Method Result
(dos Santos et al., 2019) Detection Trees RGB 0.82 RetinaNet AP = 92.64%
(Torres et al., 2020) Segmentation Trees RGB 0.82 FC-DenseNet F1 = 96.0%
(Osco et al., 2021) Segmentation Citrus Multispectral 12.59 DDCN F1 = 94.4%
(Osco et al., 2020a) Detection Citrus RGB 2.28 (Osco et al., 2020a) F1 = 96.5%
(Osco et al., 2020a) Detection Corn RGB 1.55 (Osco et al., 2020a) F1 = 87.6%
(Osco et al., 2020b) Detection Citrus Multispectral 12.59 (Osco et al., 2020b) F1 = 95.0%
Table 1: UAV-based datasets that are publically available from previous research.

5 Perspectives in Deep Learning with UAV Data

There is no denying that DL-based methods are a powerful and important tool to deal with the numerous amounts of data daily produced by remote sensing systems. What follows in this section is a short commentary on the near perspectives of one of the most emerging fields in the DL and remote sensing communities that could be implemented with UAV-based imagery. These topics, although individually presented here, have the potential to be combined, as already performed in some studies, contributing to the development of novel approaches.

5.1 Real-Time Processing

Most of the environmental, urban, and agricultural applications presented in this study can benefit from real-time responses. Although UAV and DL-based combinations speed up the processing pipeline, these algorithms are highly computer-intensive. Usually, they do require post-processing in data centers or dedicated Graphics Processing Units (GPUs) machines. Although DL is considered a fast method to extract information from data after its training, it still bottlenecks real-time applications mainly because of the number of layers intrinsic to the DL methods architectures. Research groups, especially from the IoT industry/academy, race to develop real-time DL methods because of it. The approach usually goes in two directions: developing faster algorithms and developing dedicated GPU processors.

DL models use 32-bit floating points to represent the weights of the neural network. A simple strategy known as quantization reduces the amount of memory required by DL models representing the weights, using 16, 8, or even 1 bit instead of 32-bits floating points. This idea dates back to the 1990s (Fiesler et al., 1990; Balzer et al., 1991)

and, recently, was revived due to DL models’ size. For instance, XNOR-Net

(Rastegari et al., 2016)

, a popular binarized weight strategy, results in 58 times faster convolution operations and 32 times faster memory savings. The compact representation comes with a possible degradation in predictive performance. A 32-bit full precision ResNet-18

(He et al., 2016)

achieves 89.2% top-5 accuracy on the ImageNet dataset

(ImageNet,

2018)
, while the ResNet-18 (He et al., 2016) ported to XNOR-Net achieves 73.2% top-5 accuracy in the same dataset. The quantization goes beyond weights, in all network components, while the literature reports activation functions and gradient optimizations quantized methods. The survey conducted in (Guo, 2018) gives an important overview of quantization methods. Also, knowledge distillation (Hinton et al., 2015) is another example of training a model using a smaller network, where a larger “teacher” network guides the learning process of a smaller “student” network.

Another strategy to develop fast DL models is to design layers using fewer parameters that are still capable of retaining predictive performance. MobileNets (Howard et al., 2017) and its variants are a good example of this idea. The first version of MobileNet is based on a depthwise convolution (Chollet, 2017) and a point-wise convolution (Szegedy et al., 2015). The MobileNet (569 million mult/adds and 3.3 million parameters) achieved 83.3% top-1 accuracy on Stanford Dogs. The Inception V3 (5000 million mult/adds and 23.3 million parameters) achieved 84.0% top-1 accuracy on the same dataset. The MobileNet V3 (Howard et al., 2019) architecture was developed using the Network Architecture Search (NAS) (Elsken et al., 2019), followed by a h-swish activation function and the NetAdapt Algorithm (Yang et al., 2018). According to this paper, MobileNetV3-Large is 3.2% and 20.0% more accurate (on ImageNet (ImageNet, 2018)) and faster (low latency), respectively, compared to MobileNetV2. In specific tasks, such as object detection, it is possible to develop architectural enhancements for this approach, such as the Context Enhanced Module (CEM) and the Spatial Attention Module (SAM) (Qin et al., 2019)

. The mAP Frames per Second (FPS) are proportional to the size of the backbone. ThunderNet can deliver 24.1 FPS in ARM Snapdragon 845 on 19.2 mAP (0.5;0.95) on COCO benchmarks

(Lin et al., 2014) using SNET49 backbone. Swapping the backbone to a bigger model, SNET 535, the mAP increased to 28.1, but the FPS was reduced to 5.8.

When considering even smaller computational power, it is possible to find DL running on microcontroller units (MCU) where the memory and computational power are 3-4 orders of magnitude smaller than mobile phones. MCUNet (Lin et al., 2020) combines TinyNAS and TinyEngine to build a model that requires 320kB of memory and 1MB of storage. MCUNet achieves 70.7% top-1 accuracy on ImageNet (ImageNet, 2018), which is similar to ResNet18 (He et al., 2016) and MobileNetV2 (Sandler et al., 2018) accuracy. On hardware, the industry already developed embedded AI platforms that run DL algorithms. NVIDIA’s Jetson is amongst the most popular choices and a survey (Mittal, 2019) of studies using the Jetson platform and its applications demonstrate it. Also, a broader survey on this theme, that considers GPU, ASIC, FPGA, and MCUs of AI platforms, can be read in (Imran et al., 2020). Regardless, research in the context of UAV remote sensing is quite limited, and there is a gap that can be fulfilled by future works. Several applications can be benefited by this technology, including, for example, agricultural spraying UAV, which can recognize different types of weeds in real-time, and simultaneously use the spray. Other approaches may include real-time monitoring of trees in both urban and forest environments, as well as the detection of other types of objects that benefit from a rapid intake.

5.2 Dimensionality Reduction

Due to recent advances in capture devices, hyperspectral images can be acquired even in UAVs. These images consist of tens to hundreds of spectral bands that can assist in the classification of objects in a given application. However, two main issues arise from the high dimensionality: i) the bands can be highly correlated, and ii) the excessive increase in the computational cost of DL models. High-dimensionality could invoke a problem known as the Hughes phenomenon, which is also known as the curse of dimensionality, i.e. when the accuracy of a classification is reduced due to the introduction of noise and other implications encountered in hyperspectral or high-dimensional data

(Hennessy et al., 2020)

. Regardless, hyperspectral data may pose an hindrance for the DL-based approaches accuracies, thus being an important issue to be considered in remote sensing practices. The classic approach to address high dimensionality is by applying a Principal Component Analysis (PCA)

(Licciardi et al., 2012).

Despite several proposals, PCA is generally not applied in conjunction with DL, but as a pre-processing step. Although this method may be one of the most known approaches to reduce dimensionality when dealing with hyperspectral data, different intakes were already presented in the literature. A novel DL approach, implemented with UAV-based imagery, was demonstrated in Miyoshi et al. (Miyoshi et al., 2020). There, the authors proposed a one-step approach, conducted within the networks’ architecture, to consider a combination of bands of a hyperspectral sensor that were highly related to the labeled example provided in the input layer at the initial stage of the network. Another investigation (Vaddi and Manoharan, 2020) combines a band selection approach, spatial filtering, and CNN to simultaneously extract the spectral and spatial features. Still, the future perspective to solve this issue appears to be a combination of spectral band selection and DL methods in an end-to-end approach. Thus, both selection and DL methods can exchange information and improve results. This can also contribute to understanding how DL operates with these images, which was slightly accomplished at Miyoshi et al. (Miyoshi et al., 2020).

5.3 Domain Adaptation and Transfer Learning

The training steps of DL models are generally carried out with images captured in a specific geographical region, in a short-time period, or with single capture equipment (also known as domains). When the model is used in practice, it is common for spectral shifts to occur between the training and test images due to differences in acquisition, geographic region, atmospheric conditions, among others (Tuia et al., 2016)

. Domain adaptation is a technique for adapting models trained in a source domain to a different, but still related, target domain. Therefore, domain adaptation is also viewed as a particular form of transfer learning

(Tuia et al., 2016). On the other hand, transfer learning (Zhuang et al., 2020; Tan et al., 2018) does include applications in which the characteristics of the domain’s target space may differ from the source domain.

A promising research line for domain adaptation and transfer learning is to consider GANs (Goodfellow et al., 2014; Elshamli et al., 2017). For example, (Benjdira et al., 2019b) proposed the use of GANs to convert an image from the source domain to the target domain, causing the source images to mimic the characteristics of the images from the target domain. Recent approaches seek to align the distribution of the source and target domains, although they do not consider direct alignment at the level of the problem classes. Approaches that are attentive to class-level shifts may be more accurate, as the category-sensitive domain adaptation proposed by (Fang et al., 2019). Thus, these approaches reduce the domain shift related to the quality and characteristics of the training images and can be useful in practice for UAV remote sensing.

5.4 Attention Based Mechanisms

Attention mechanisms aim to highlight the most valuable features or image regions based on assigning different weights for them in a specific task. It is a topic that has been recently applied in remote sensing, providing significant improvements. As pointed out by (Xu et al., 2018), high-resolution images in remote sensing provide a large amount of information and exhibit minor intra-class variation while it tends to increase. These variations and a large amount of information make extraction of relevant features more difficult, since traditional CNNs process all regions with the same weight (relevance). Attention mechanisms, such as the one proposed by (Xu et al., 2018), are useful tools to focus the feature extraction in discriminative regions of the problem, be it image segmentation (Ding et al., 2021; Su et al., 2019; Zhou et al., 2020), scene-wise classification (Zhu et al., 2019b; Li et al., 2020c), or object detection (Li et al., 2019, 2020c), as others.

Besides, (Su et al., 2019) argue that when remote sensing images are used, they are generally divided into patches for training the CNNs. Thus, objects can be divided into two or more sub-images, causing the discriminative and structural information to be lost. Attention mechanisms can be used to aggregate learning by focusing on relevant regions that describe the objects of interest, as presented in (Su et al., 2019), through a global attention upsample module that provides global context and combines low and high-level information. Recent advances in computer vision were achieved with attention mechanisms for classification (e.g., Vision Transformer (Dosovitskiy et al., 2020) and Data-efficient Image Transformers (Touvron et al., 2020)) and in object detection (e.g., DETR (Carion et al., 2020)) that have not yet been fully evaluated in remote sensing applications. Some directions also point to the use of attention mechanisms directly in a sequence of image patches (Dosovitskiy et al., 2020; Touvron et al., 2020). These new proposals can improve the results already achieved in remote sensing data, just as they have advanced the results on the traditional image datasets in computer vision (e.g., ImageNet (ImageNet, 2018)).

5.5 Few-Shot Learning

Although recent material demonstrated the feasibility of DL-based methods for multiple tasks, they still are considered limited in terms of high generalization. This occurs when dealing with the same objects in different geographical areas or when new object classes are considered. Traditional solutions require retraining the model with a robust labeled dataset for the new area or object. Few-shot learning aims to cope with situations in which few labeled datasets are available. A recent study (Li et al., 2020), in the context of scene classification, pointed out that few-shot methods in remote sensing are based on transfer learning and meta-learning. Meta-learning can be more flexible than transfer learning, and when applied in the training set to extract meta-knowledge, contributes significantly to few-shot learning in the test set. An interesting strategy to cope with large intraclass variation and interclass similarity is the implementation of the attention mechanism in the feature learning step, as previously described. The datasets used in the (Li et al., 2020) study were not UAV-based; however, the strategy can be explored in UAV imagery.

In the context of UAV remote sensing, there are few studies on few-shot learning. Recently, an investigation (Karami et al., 2020) aimed for the detection of maize plants using the object detection method CenterNet. The authors adopted a transfer learning strategy using pre-trained models from other geographical areas and dates. Fewer images (in total, 150 images), when compared to the previous training (with 600 images), from the new area were used for fine-tuning the model. Based on the literature survey, there is a research-gap to be further explored in the context of object detection using few-shot learning in UAV remote sensing. The main idea behind this is to consider less labeled datasets for training, which may help in some remote applications where data availability is scarce or presents few occurrences.

5.6 Semi-Supervised Learning and Unsupervised Learning

With the increasing availability of remote sensing images, the labeling task for supervised training of DL models is expensive and time-consuming. Thus, the performance of DL models is impacted due to the lack of large amounts of labeled training images. Efforts have been made to consider unlabeled images in training through unsupervised (unlabeled images only) and semi-supervised (labeled and unlabeled images) learnings. In remote sensing, most semi-supervised or unsupervised approaches are based on transfer learning, which usually requires a supervised pre-trained model (Liu and Qin, 2020). In this regard, a recent study (Kang et al., 2020) proposed a promising approach for unlabeled remote sensing images that define spatial augmentation criteria for relating close sub-images. Regardless, this is still an under-developed practice with UAV-based data and should be investigated in novel approaches.

Future perspectives point to the use of contrastive loss (Bachman et al., 2019; Tian et al., 2019b; Hjelm et al., 2019; He et al., 2020) and clustering-based approaches (Caron et al., 2018, 2021). Recent publications have shown interesting results with the use of contrastive loss that has not yet been fully evaluated in remote sensing. For example, (He et al., 2020) proposed an approach based on contrastive loss that surpassed the performance of its supervised pre-trained counterpart. As for clustering-based methods, they often group images with similar characteristics (Caron et al., 2018). On this matter, a research (Caron et al., 2018) presented an approach that groups the data while reinforcing the consistency between the cluster assignments produced for a pair of images (same images with two augmentations). An efficient and effective way to use a large number of unlabeled images can considerably improve performance, mainly related to the generalizability of the models.

5.7 Multitask Learning

Multitask learning aims to perform multiple tasks simultaneously. Several advantages are mentioned in (Crawshaw, 2020), including fast learning and the minimization of overfitting problems. Recently, in the context of UAV remote sensing, there were some important researches already developed. A study (Wang et al., 2021) proposed a method to conduct three tasks (semantic segmentation, height estimation, and boundary detection), which also considered boundary attention modules. Another research (Osco et al., 2020a) simultaneously detected plants and plantation lines in UAV-based imagery. The proposed network benefited from the contributions of considering both tasks in the same structure, since the plants must, essentially, belong to a plantation-line. In short, improvements occurred in the detection task when line detection was considered at the same time. This approach can be further explored in several UAV-based remote sensing applications.

5.8 Open-Set

The main idea of an open-set is to deal with unknown or unseen classes during the inference in the testing set (Bendale and Boult, 2016). As the authors mention, recognition in real-world scenarios is “open-set”, different from neural networks’ nature, which is in a “close-set”. Consequently, the testing set is classified considering only the classes used during the training. Therefore, unknown or unseen classes are not rejected during the test. There are few studies regarding open-set in the context of remote sensing. Regarding semantic segmentation of aerial imagery, a study by (da Silva et al., 2020) presented an approach considering the open-set context. There, an adaptation of a close-set semantic segmentation method, adding a probability threshold after the softmax, was conducted. Later, a post-processing step based on morphological filters was applied to the pixels classified as unknown to verify if they are inside pixels or from borders. Another interesting approach is to combine open-set and domain adaptation methods, as proposed by (Adayel et al., 2020) in the remote sensing context.

5.9 Photogrammetric Processing

Although not as developed as other practices, DL-based methods can be adopted for processing and optimizing the UAV photogrammetric processing task. This process aims to generate a dense point cloud and an orthomosaic, and it is based on Structure-from-Motion (SfM) and Multi-View Stereo (MVS) techniques. In SfM, the interior and exterior orientation parameters are estimated, and a sparse point cloud is generated. A matching technique between the images is applied in SfM. A recent survey on image matching (Ma et al., 2021) concluded that this thematic is still an open problem and also pointed out the potential of DL is this task. The authors mentioned that DL techniques are mainly applied to feature detection and description, and further investigations on feature matching can be explored. Finally, they pointed out that a promising direction is the customization of modern feature matching techniques to attend SfM. Regarding DL for UAV image matching, there is a lack of works, indicating a potential for future exploration.

In the UAV photogrammetric process, DL also can be used in filtering the DSM, which is essential to generate high-quality orthoimages. Previous work (Gevaert et al., 2018) showed the potential of using DL to filter the DSM and generate the DTM. Further investigations are required in this thematic, mainly considering UAV data. Besides, another task that can be beneficiated by DL is the color balancing between images when generating orthomosaic from thousands of images, corresponding to extensive areas.

To summarize, the topics addressed in this section compose some of the hot topics in the computer vision community, and the combination of them with remote sensing data can contribute to the development of novel approaches in the context of UAV mapping. In this regard, it is important to emphasize that not only these topics are currently being investigated by computer vision research, but that they also are being fastly implemented in multiple approaches aside from remote sensing. As other domains are investigated, novel ways of improving and adapting these networks can be achieved. Future studies in the remote sensing communities, specifically with UAV-based systems, may benefit from these improvements and incorporate them into their applications.

6 Conclusions

DL is still considered, up to the time of writing, a “black-box” type of solution for most of the problems, although novel research is advancing in minimizing this notion at considerable proportions. Regardless, in the remote sensing domain, it already provided important discoveries on most of its implementation. Our literature-revision has focused on the application of these methods in UAV-based image processing. In this sense, we structured our study to offer more of a comprehensive approach to the subject while presenting an overview of state-of-the-art techniques and perspectives regarding its usage. As such, we hope that this literature revision may serve as an inclusive survey to summarize the UAV applications based on DNNs. Thus, in the evaluated context, this review concludes that:

  1. In the context of UAV remote sensing, most of the published materials are based on object detection methods and RGB sensors; however, some applications, as in precision agriculture and forest-related, benefit from multi/hyperspectral data;

  2. There is a need for additional labeled public available datasets obtained with UAVs to be used to train and benchmark the networks. In this context, we contributed by providing a repository with some of our UAV datasets in both agricultural and environmental applications;

  3. Even though CNNs are the most adopted architecture, other methods based on CNN-LSTMs and GANs are gaining attention in UAV remote sensing and image applications, and future UAV remote sensing works may benefit from their inclusion;

  4. DL, when assisted by GPU processing, can provide fast inference solutions. However there is still a need for further investigation regarding real-time processing using embedded systems on UAVs, and, lastly;

  5. Some promising thematics, such as open-set, attention-based mechanisms, few shot and multitask learning can be combined and provide novel approaches in the context of UAV remote sensing; also, these thematics can contribute significantly to the generalization capacity of the DNNs.

Funding

This research was funded by CNPq (p: 433783/2018-4, 310517/2020-6, 314902/2018-0, 304052/2019-1 and 303559/2019-5), FUNDECT (p: 59/300.066/2015) and CAPES PrInt (p: 88881.311850/2018-01). The authors acknowledge the support of the UFMS (Federal University of Mato Grosso do Sul) and CAPES (Finance Code 001).

Acknowledgments

The authors would like to acknowledge Nvidia Corporation for the donation of the Titan X graphics card.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviations were used in this manuscript:

AdaGrad Adaptive Gradient Algorithm
AI Artificial Intelligence
ANN Artificial Neural Network
CEM Context Enhanced Module
CNN Convolutional Neural Network
DCGAN Deep Convolutional Generative Adversarial network
DDCN Deep Dual-domain Convolutional neural Network
DL Deep Learning
DNN Deep Neural Network
DEM Digital Elevation Model
DSM Digital Surface Model
FPS Frames per Second
GAN Generative Adversarial Network
GPU Graphics Processing Unit
KL Kullback-Leibler
LSTM Long Short-Term Memory
IoU Intersection over Union
ML Machine Learning
MAE Mean Absolute Error
MAPE Mean Absolute Percentage Error
MRE Mean Relative Error
MSE Mean Squared Error
MSLE Mean Squared Logarithmic Error
MSM Multi-Stage Module
MVS Multiview Stereo
NAS Network Architecture Search
PCA Principal Component Analysis
PPM Pyramid Pooling Module
r Correlation Coefficient
RMSE Root Mean Squared Error
RNN Recurrent Neural Network
ROC Receiver Operating Characteristics
RPA Remotely Piloted Aircraft
SAM Spatial Attention Module
SGD Stochastic Gradient Descent
SfM Structure from Motion
UAV Unmanned Aerial Vehicle
WOS Web of Science

References

  • Lecun et al. [2015] Yann Lecun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015. ISSN 14764687. doi: 10.1038/nature14539.
  • Goodfellow et al. [2016] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
  • Zhang et al. [2016] Liangpei Zhang, Lefei Zhang, and Bo Du. Deep learning for remote sensing data: A technical tutorial on the state of the art. IEEE Geoscience and Remote Sensing Magazine, 4(2):22–40, 2016. ISSN 21686831. doi: 10.1109/MGRS.2016.2540798.
  • Cheng and Han [2016] Gong Cheng and Junwei Han. A survey on object detection in optical remote sensing images. ISPRS Journal of Photogrammetry and Remote Sensing, 117:11–28, 2016. ISSN 09242716. doi: 10.1016/j.isprsjprs.2016.03.014. URL http://dx.doi.org/10.1016/j.isprsjprs.2016.03.014.
  • Ball et al. [2017] John E. Ball, Derek T. Anderson, and Chee Seng Chan. A comprehensive survey of deep learning in remote sensing: Theories, tools and challenges for the community. arXiv, 11(4), 2017. ISSN 1931-3195. doi: 10.1117/1.jrs.11.042609.
  • Cheng et al. [2017] Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Benchmark and state of the art. arXiv, 2017. ISSN 23318422.
  • Zhu et al. [2017] Xiao Xiang Zhu, Devis Tuia, Lichao Mou, Gui Song Xia, Liangpei Zhang, Feng Xu, and Friedrich Fraundorfer. Deep Learning in Remote Sensing: A Comprehensive Review and List of Resources. IEEE Geoscience and Remote Sensing Magazine, 5(4):8–36, 2017. ISSN 21686831. doi: 10.1109/MGRS.2017.2762307.
  • Li et al. [2018] Ying Li, Haokui Zhang, Xizhe Xue, Yenan Jiang, and Qiang Shen. Deep learning for remote sensing image classification: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(6):1–17, 2018. ISSN 19424795. doi: 10.1002/widm.1264.
  • Yao et al. [2018] Chuchu Yao, Xianxian Luo, Yudan Zhao, Wei Zeng, and Xiaoyu Chen. A review on image classification of remote sensing using deep learning. 2017 3rd IEEE International Conference on Computer and Communications, ICCC 2017, 2018-Janua:1947–1955, 2018. doi: 10.1109/CompComm.2017.8322878.
  • Petersson et al. [2017] Henrik Petersson, David Gustafsson, and David Bergström. Hyperspectral image analysis using deep learning - A review. 2016 6th International Conference on Image Processing Theory, Tools and Applications, IPTA 2016, 2017. doi: 10.1109/IPTA.2016.7820963.
  • Audebert et al. [2019] Nicolas Audebert, Bertrand Le Saux, and Sebastien Lefevre. Deep learning for classification of hyperspectral data: A comparative review. IEEE Geoscience and Remote Sensing Magazine, 7(2):159–173, 2019. ISSN 21686831. doi: 10.1109/MGRS.2019.2912563.
  • Paoletti et al. [2019] M. E. Paoletti, J. M. Haut, J. Plaza, and A. Plaza. Deep learning classifiers for hyperspectral imaging: A review. ISPRS Journal of Photogrammetry and Remote Sensing, 158(September):279–317, 2019. ISSN 09242716. doi: 10.1016/j.isprsjprs.2019.09.006. URL https://doi.org/10.1016/j.isprsjprs.2019.09.006.
  • Li et al. [2019] Shutao Li, Weiwei Song, Leyuan Fang, Yushi Chen, Pedram Ghamisi, and Jon Atli Benediktsson. Deep learning for hyperspectral image classification: An overview. IEEE Transactions on Geoscience and Remote Sensing, 57(9):6690–6709, 2019. ISSN 15580644. doi: 10.1109/TGRS.2019.2907932.
  • Tsagkatakis et al. [2019] Grigorios Tsagkatakis, Anastasia Aidini, Konstantina Fotiadou, Michalis Giannopoulos, Anastasia Pentari, and Panagiotis Tsakalides. Survey of deep-learning approaches for remote sensing observation enhancement. Sensors (Switzerland), 19(18):1–39, 2019. ISSN 14248220. doi: 10.3390/s19183929.
  • Ma et al. [2019] Lei Ma, Yu Liu, Xueliang Zhang, Yuanxin Ye, Gaofei Yin, and Brian Alan Johnson. Deep learning in remote sensing applications: A meta-analysis and review. ISPRS Journal of Photogrammetry and Remote Sensing, 152:166 – 177, 2019. ISSN 0924-2716. doi: https://doi.org/10.1016/j.isprsjprs.2019.04.015. URL http://www.sciencedirect.com/science/article/pii/S0924271619301108.
  • Hossain and Chen [2019] Mohammad D. Hossain and Dongmei Chen. Segmentation for Object-Based Image Analysis (OBIA): A review of algorithms and challenges from remote sensing perspective. ISPRS Journal of Photogrammetry and Remote Sensing, 150(February):115–134, 2019. ISSN 09242716. doi: 10.1016/j.isprsjprs.2019.02.009. URL https://doi.org/10.1016/j.isprsjprs.2019.02.009.
  • Yuan et al. [2021] Xiaohui Yuan, Jianfang Shi, and Lichuan Gu. A review of deep learning methods for semantic segmentation of remote sensing imagery. Expert Systems with Applications, 169(November 2020):114417, 2021. ISSN 09574174. doi: 10.1016/j.eswa.2020.114417. URL https://doi.org/10.1016/j.eswa.2020.114417.
  • Zheng et al. [2020] Zhe Zheng, Lin Lei, Hao Sun, and Gangyao Kuang. A Review of Remote Sensing Image Object Detection Algorithms Based on Deep Learning. 2020 IEEE 5th International Conference on Image, Vision and Computing, ICIVC 2020, pages 34–43, 2020. doi: 10.1109/ICIVC50857.2020.9177453.
  • Yuan et al. [2020] Qiangqiang Yuan, Huanfeng Shen, Tongwen Li, Zhiwei Li, Shuwen Li, Yun Jiang, Hongzhang Xu, Weiwei Tan, Qianqian Yang, Jiwen Wang, Jianhao Gao, and Liangpei Zhang. Deep learning in environmental remote sensing: Achievements and challenges. Remote Sensing of Environment, 241(February):111716, 2020. ISSN 00344257. doi: 10.1016/j.rse.2020.111716. URL https://doi.org/10.1016/j.rse.2020.111716.
  • Khelifi and Mignotte [2020] Lazhar Khelifi and Max Mignotte. Deep Learning for Change Detection in Remote Sensing Images: Comprehensive Review and Meta-Analysis. IEEE Access, 8:126385–126400, 2020. ISSN 21693536. doi: 10.1109/ACCESS.2020.3008036.
  • Bithas et al. [2019] Petros S. Bithas, Emmanouel T. Michailidis, Nikolaos Nomikos, Demosthenes Vouyioukas, and Athanasios G. Kanatas. A survey on machine-learning techniques for UAV-based communications. Sensors (Switzerland), 19(23):1–39, 2019. ISSN 14248220. doi: 10.3390/s19235170.
  • Schmidhuber [2015] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85 – 117, 2015. ISSN 0893-6080. doi: 10.1016/j.neunet.2014.09.003. URL http://www.sciencedirect.com/science/article/pii/S0893608014002135.
  • Khan et al. [2020] Asifullah Khan, Anabia Sohail, Umme Zahoora, and Aqsa Saeed Qureshi. A survey of the recent architectures of deep convolutional neural networks, volume 53. Springer Netherlands, 2020. ISBN 0123456789. doi: 10.1007/s10462-020-09825-6. URL https://doi.org/10.1007/s10462-020-09825-6.
  • Nwankpa et al. [2018] Chigozie Nwankpa, Winifred Ijomah, Anthony Gachagan, and Stephen Marshall. Activation functions: Comparison of trends in practice and research for deep learning. arXiv preprint arXiv:1811.03378, 2018.
  • Naitzat et al. [2020] Gregory Naitzat, Andrey Zhitnikov, and Lek Heng Lim. Topology of deep neural networks. Journal of Machine Learning Research, 21:1–40, 2020. ISSN 15337928.
  • Hinton et al. [2012] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012. URL http://arxiv.org/abs/1207.0580.
  • Ruder [2017] Sebastian Ruder. An overview of gradient descent optimization algorithms, 2017.
  • Minaee et al. [2020a] Shervin Minaee, Yuri Boykov, Fatih Porikli, Antonio Plaza, Nasser Kehtarnavaz, and Demetri Terzopoulos. Image segmentation using deep learning: A survey, 2020a.
  • Foody [2020] Giles M. Foody. Explaining the unsuitability of the kappa coefficient in the assessment and comparison of the accuracy of thematic maps obtained by image classification. Remote Sensing of Environment, 239(December 2019):111630, 2020. ISSN 00344257. doi: 10.1016/j.rse.2019.111630. URL https://doi.org/10.1016/j.rse.2019.111630.
  • Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, page 1097–1105, Red Hook, NY, USA, 2012. Curran Associates Inc.
  • Hochreiter and Schmidhuber [1997] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9, 1997. doi: {10.1162/neco.1997.9.8.1735}.
  • Ienco et al. [2017] D. Ienco, R. Gaetano, C. Dupaquier, and P. Maurel. Land cover classification via multitemporal spatial data by deep recurrent neural networks. IEEE Geoscience and Remote Sensing Letters, 14(10):1685–1689, 2017. doi: 10.1109/LGRS.2017.2728698.
  • Ho Tong Minh et al. [2018] D. Ho Tong Minh, D. Ienco, R. Gaetano, N. Lalande, E. Ndikumana, F. Osman, and P. Maurel. Deep recurrent neural networks for winter vegetation quality mapping via multitemporal sar sentinel-1. IEEE Geoscience and Remote Sensing Letters, 15(3):464–468, 2018. doi: 10.1109/LGRS.2018.2794581.
  • Feng et al. [2020] Quanlong Feng, Jianyu Yang, Yiming Liu, Cong Ou, Dehai Zhu, Bowen Niu, Jiantao Liu, and Baoguo Li. Multi-temporal unmanned aerial vehicle remote sensing for vegetable mapping using an attention-based recurrent convolutional neural network. Remote Sensing, 12(10), 2020. ISSN 20724292. doi: 10.3390/rs12101668.
  • Goodfellow et al. [2014] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, 2014.
  • Lin et al. [2017a] D. Lin, K. Fu, Y. Wang, G. Xu, and X. Sun. Marta gans: Unsupervised representation learning for remote sensing image classification. IEEE Geoscience and Remote Sensing Letters, 14(11):2092–2096, 2017a. doi: 10.1109/LGRS.2017.2752750.
  • Isola et al. [2018] P. Isola, Jun-Yan Zhu, T. Zhou, and A.A Efros. Image-to-image translation with conditional adversarial networks, 2018.
  • Wu et al. [2020a] Xiongwei Wu, Doyen Sahoo, and Steven C.H. Hoi. Recent advances in deep learning for object detection. Neurocomputing, 396:39 – 64, 2020a. ISSN 0925-2312. doi: https://doi.org/10.1016/j.neucom.2020.01.085.
  • Sharma and Mir [2020] Vipul Sharma and Roohie Naaz Mir. A comprehensive and systematic look up into deep learning based object detection techniques: A review. Computer Science Review, 38:100301, 2020. ISSN 1574-0137. doi: https://doi.org/10.1016/j.cosrev.2020.100301.
  • Lathuilière et al. [2020] S. Lathuilière, P. Mesejo, X. Alameda-Pineda, and R. Horaud. A comprehensive analysis of deep regression. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(9):2065–2081, 2020. doi: 10.1109/TPAMI.2019.2910523.
  • Simonyan and Zisserman [2015] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, page 14, 2015.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.

    Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    , 2016-December:770–778, 2016.
    ISSN 10636919. doi: 10.1109/CVPR.2016.90.
  • Zou et al. [2015] Q. Zou, L. Ni, T. Zhang, and Q. Wang.

    Deep learning based feature selection for remote sensing scene classification.

    IEEE Geoscience and Remote Sensing Letters, 12(11):2321–2325, 2015. doi: 10.1109/LGRS.2015.2475299.
  • Zhao et al. [2019] Zhong-Qiu Zhao, Peng Zheng, Shou-Tao Xu, and Xindong Wu. Object detection with deep learning: A review. IEEE transactions on neural networks and learning systems, 30(11):3212—3232, November 2019. ISSN 2162-237X. doi: 10.1109/tnnls.2018.2876865.
  • Liu et al. [2019] L. Liu, W. Ouyang, X. Wang, W. P. Fieguth, J. Chen, X. Liu, and M. Pietikäinen. Deep learning for generic object detection: A survey. International Journal of Computer Vision, pages 261–318, 2019.
  • Cai and Vasconcelos [2018] Z. Cai and N. Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6154–6162, 2018. doi: 10.1109/CVPR.2018.00644.
  • Li et al. [2019] Y. Li, Y. Chen, N. Wang, and Z. Zhang. Scale-aware trident networks for object detection. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6053–6062, 2019. doi: 10.1109/ICCV.2019.00615.
  • Lu et al. [2019] Xin Lu, Buyu Li, Yuxin Yue, Quanquan Li, and Junjie Yan. Grid R-CNN plus: Faster and better. CoRR, abs/1906.05688, 2019. URL http://arxiv.org/abs/1906.05688.
  • Zhang et al. [2020a] Hongkai Zhang, Hong Chang, Bingpeng Ma, Naiyan Wang, and Xilin Chen. Dynamic R-CNN: Towards high quality object detection via dynamic training. arXiv preprint arXiv:2004.06002, 2020a.
  • Qiao et al. [2020] Siyuan Qiao, Liang-Chieh Chen, and Alan Yuille. Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution. arXiv preprint arXiv:2006.02334, 2020.
  • He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. doi: 10.1109/CVPR.2016.90.
  • Xie et al. [2017] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5987–5995, 2017. doi: 10.1109/CVPR.2017.634.
  • Wang et al. [2020] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang, W. Liu, and B. Xiao. Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2020. doi: 10.1109/TPAMI.2020.2983686.
  • Radosavovic et al. [2020] I. Radosavovic, R. Kosaraju, R. Girshick, K. He, and P. Dollar. Designing network design spaces. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10425–10433, Los Alamitos, CA, USA, 2020. URL https://doi.ieeecomputersociety.org/10.1109/CVPR42600.2020.01044.
  • Gao et al. [2021] S. H. Gao, M. M. Cheng, K. Zhao, X. Y. Zhang, M. H. Yang, and P. Torr. Res2net: A new multi-scale backbone architecture. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(2):652–662, 2021. doi: 10.1109/TPAMI.2019.2938758.
  • Zhang et al. [2020b] Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Haibin Lin, Zhi Zhang, Yue Sun, Tong He, Jonas Mueller, R. Manmatha, Mu Li, and Alexander Smola. Resnest: Split-attention networks, 2020b.
  • Lin et al. [2017b] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 936–944, 2017b. doi: 10.1109/CVPR.2017.106.
  • Liu et al. [2018] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), page 11, 2018.
  • Ghiasi et al. [2019] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7036–7045, 2019.
  • Chen et al. [2020] J. Chen, Q. Wu, D. Liu, and T. Xu. Foreground-background imbalance problem in deep object detectors: A review. In 2020 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pages 285–290, 2020. doi: 10.1109/MIPR49039.2020.00066.
  • Pang et al. [2019] Jiangmiao Pang, Kai Chen, Jianping Shi, Huajun Feng, Wanli Ouyang, and Dahua Lin. Libra R-CNN: Towards balanced learning for object detection. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019-June:821–830, 2019. ISSN 10636919. doi: 10.1109/CVPR.2019.00091.
  • Zhang et al. [2019a] Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and Stan Z. Li. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. arXiv preprint arXiv:1912.02424, 2019a.
  • Wang et al. [2019] Jiaqi Wang, Kai Chen, Shuo Yang, Chen Change Loy, and Dahua Lin. Region proposal by guided anchoring. In IEEE Conference on Computer Vision and Pattern Recognition, page 12, 2019.
  • Zhu et al. [2019a] Chenchen Zhu, Yihui He, and Marios Savvides. Feature selective anchor-free module for single-shot object detection. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019-June:840–849, 2019a. ISSN 10636919. doi: 10.1109/CVPR.2019.00093.
  • Kim and Lee [2020] Kang Kim and Hee Seok Lee. Probabilistic anchor assignment with iou prediction for object detection. In European Conference on Computer Vision (ECCV), page 22, 2020.
  • Li et al. [2020a] Xiang Li, Wenhai Wang, Lijun Wu, Shuo Chen, Xiaolin Hu, Jun Li, Jinhui Tang, and Jian Yang. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. arXiv preprint arXiv:2006.04388, 2020a.
  • Cao et al. [2020] Yuhang Cao, Kai Chen, Chen Change Loy, and Dahua Lin. Prime sample attention in object detection. In IEEE Conference on Computer Vision and Pattern Recognition, page 9, 2020.
  • Zhang et al. [2020c] Haoyang Zhang, Ying Wang, Feras Dayoub, and Niko Sünderhauf. Varifocalnet: An iou-aware dense object detector. arXiv preprint arXiv:2008.13367, 2020c.
  • Duan et al. [2019] Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, and Qi Tian. CenterNet: Keypoint triplets for object detection. Proceedings of the IEEE International Conference on Computer Vision, 2019-October:6568–6577, 2019. ISSN 15505499. doi: 10.1109/ICCV.2019.00667.
  • Law and Deng [2020] Hei Law and Jia Deng. CornerNet: Detecting Objects as Paired Keypoints. International Journal of Computer Vision, 128(3):642–656, 2020. ISSN 15731405. doi: 10.1007/s11263-019-01204-1.
  • Wang et al. [2020] Jiaqi Wang, Wenwei Zhang, Yuhang Cao, Kai Chen, Jiangmiao Pang, Tao Gong, Jianping Shi, Chen Change Loy, and Dahua Lin. Side-aware boundary localization for more precise object detection. In European Conference on Computer Vision (ECCV), page 21, 2020.
  • Minaee et al. [2020b] Shervin Minaee, Yuri Boykov, Fatih Porikli, Antonio Plaza, Nasser Kehtarnavaz, and Demetri Terzopoulos. Image segmentation using deep learning: A survey, 2020b.
  • Kirillov et al. [2019] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár. Panoptic segmentation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9396–9405, 2019. doi: 10.1109/CVPR.2019.00963.
  • He et al. [2017] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2980–2988, 2017. doi: 10.1109/ICCV.2017.322.
  • Cai and Vasconcelos [2019] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: high quality object detection and instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  • Chen et al. [2019] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. Hybrid task cascade for instance segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, page 10, 2019.
  • Kirillov et al. [2020] Alexander Kirillov, Yuxin Wu, Kaiming He, and Ross Girshick. Pointrend: Image segmentation as rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 10, June 2020.
  • Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9351:234–241, 2015. ISSN 16113349. doi: 10.1007/978-3-319-24574-4_28.
  • Badrinarayanan et al. [2017] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12):2481–2495, 2017. ISSN 01628828. doi: 10.1109/TPAMI.2016.2644615.
  • Chen et al. [2018] Liang Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834–848, 2018. ISSN 01628828. doi: 10.1109/TPAMI.2017.2699184.
  • Nogueira et al. [2019] Keiller Nogueira, Mauro Dalla Mura, Jocelyn Chanussot, William Robson Schwartz, and Jefersson Alex Dos Santos. Dynamic multicontext segmentation of remote sensing images based on convolutional networks. IEEE Transactions on Geoscience and Remote Sensing, 57(10):7503–7520, 2019. ISSN 15580644. doi: 10.1109/TGRS.2019.2913861.
  • Nogueira et al. [2020] Keiller Nogueira, Gabriel L.S. Machado, Pedro H.T. Gama, Caio C.V. da Silva, Remis Balaniuk, and Jefersson A. dos Santos. Facing erosion identification in railway lines using pixel-wise deep-based approaches. Remote Sensing, 12(4):1–21, 2020. ISSN 20724292. doi: 10.3390/rs12040739.
  • Hua et al. [2021] Yuansheng Hua, Diego Marcos, Lichao Mou, Xiao Xiang Zhu, and Devis Tuia. Semantic segmentation of remote sensing images with sparse annotations. IEEE Geoscience and Remote Sensing Letters, 2021.
  • Wu et al. [2020b] Tianyi Wu, Sheng Tang, Rui Zhang, Juan Cao, and Yongdong Zhang. Cgnet: A light-weight context guided network for semantic segmentation. IEEE Transactions on Image Processing, 30:1169–1179, 2020b.
  • Yin et al. [2020] Minghao Yin, Zhuliang Yao, Yue Cao, Xiu Li, Zheng Zhang, Stephen Lin, and Han Hu. Disentangled non-local neural networks, 2020.
  • Barbedo et al. [2020] Jayme Garcia Arnal Barbedo, Luciano Vieira Koenigkan, Patrícia Menezes Santos, and Andrea Roberto Bueno Ribeiro. Counting cattle in uav images—dealing with clustered animals and animal/background contrast changes. Sensors, 20(7), 2020. ISSN 1424-8220. doi: 10.3390/s20072126. URL https://www.mdpi.com/1424-8220/20/7/2126.
  • Hou et al. [2020] Jin Hou, Yuxin He, Hongbo Yang, Thomas Connor, Jie Gao, Yujun Wang, Yichao Zeng, Jindong Zhang, Jinyan Huang, Bochuan Zheng, and Shiqiang Zhou. Identification of animal individuals using deep learning: A case study of giant panda. Biological Conservation, 242:108414, 2020. ISSN 0006-3207. doi: https://doi.org/10.1016/j.biocon.2020.108414. URL http://www.sciencedirect.com/science/article/pii/S000632071931609X.
  • Sundaram and Loganathan [2020] Divya Meena Sundaram and Agilandeeswari Loganathan. FSSCaps-DetCountNet: fuzzy soft sets and CapsNet-based detection and counting network for monitoring animals from aerial images. Journal of Applied Remote Sensing, 14(2):1 – 30, 2020. doi: 10.1117/1.JRS.14.026521. URL https://doi.org/10.1117/1.JRS.14.026521.
  • Horning et al. [2020] Ned Horning, Erica Fleishman, Peter J. Ersts, Frank A. Fogarty, and Martha Wohlfeil Zillig.

    Mapping of land cover with open-source software and ultra-high-resolution imagery acquired with unmanned aerial vehicles.

    Remote Sensing in Ecology and Conservation, 6(4):487–497, 2020. ISSN 20563485. doi: 10.1002/rse2.144.
  • Hamdi et al. [2019] Zayd Mahmoud Hamdi, Melanie Brandmeier, and Christoph Straub. Forest damage assessment using deep learning on high resolution remote sensing data. Remote Sensing, 11(17):1–14, 2019. ISSN 20724292. doi: 10.3390/rs11171976.
  • Alexandra Larsen et al. [2020] A. Alexandra Larsen, I. Hanigan, B. J. Reich, Y. Qin, M Cope, G. Morgan, and A. G Rappold. A deep learning approach to identify smoke plumes in satellite imagery in near-real time for health risk communication. Journal of Exposure Science & Environmental Epidemiology, 31:170–176, 2020.
  • Zhang et al. [2019b] Guoli Zhang, Ming Wang, and Kai Liu. Forest Fire Susceptibility Modeling Using a Convolutional Neural Network for Yunnan Province of China. International Journal of Disaster Risk Science, 10(3):386–403, 2019b. ISSN 21926395. doi: 10.1007/s13753-019-00233-1. URL https://doi.org/10.1007/s13753-019-00233-1.
  • Kussul et al. [2017] N. Kussul, M. Lavreniuk, S. Skakun, and A. Shelestov. Deep learning classification of land cover and crop types using remote sensing data. IEEE Geoscience and Remote Sensing Letters, 14(5):778–782, 2017. doi: 10.1109/LGRS.2017.2681128.
  • Zhang et al. [2020d] Xin Zhang, Liangxiu Han, Lianghao Han, and Liang Zhu. How well do deep learning-based methods for land cover classification and object detection perform on high resolution remote sensing imagery? Remote Sensing, 12(3), 2020d. ISSN 2072-4292. doi: 10.3390/rs12030417. URL https://www.mdpi.com/2072-4292/12/3/417.
  • Dao et al. [2020] Dong Van Dao, Abolfazl Jaafari, Mahmoud Bayat, Davood Mafi-Gholami, Chongchong Qi, Hossein Moayedi, Tran Van Phong, Hai-Bang Ly, Tien-Thinh Le, Phan Trong Trinh, Chinh Luu, Nguyen Kim Quoc, Bui Nhi Thanh, and Binh Thai Pham. A spatially explicit deep learning neural network model for the prediction of landslide susceptibility. CATENA, 188:104451, 2020. ISSN 0341-8162. doi: https://doi.org/10.1016/j.catena.2019.104451. URL http://www.sciencedirect.com/science/article/pii/S0341816219305934.
  • Bui et al. [2020] Dieu Tien Bui, Paraskevas Tsangaratos, Viet-Tien Nguyen, Ngo Van Liem, and Phan Trong Trinh. Comparing the prediction performance of a deep learning neural network model with conventional machine learning models in landslide susceptibility assessment. CATENA, 188:104426, 2020. ISSN 0341-8162. doi: https://doi.org/10.1016/j.catena.2019.104426. URL http://www.sciencedirect.com/science/article/pii/S0341816219305685.
  • Giang et al. [2020] T. L. Giang, K. B. Dang, Q. Toan Le, V. G. Nguyen, S. S. Tong, and V. M. Pham. U-net convolutional networks for mining land cover classification based on high-resolution uav imagery. IEEE Access, 8:186257–186273, 2020. doi: 10.1109/ACCESS.2020.3030112.
  • Al-Najjar et al. [2019] Husam A. H. Al-Najjar, Bahareh Kalantar, Biswajeet Pradhan, Vahideh Saeidi, Alfian Abdul Halin, Naonori Ueda, and Shattri Mansor. Land cover classification from fused dsm and uav images using convolutional neural networks. Remote Sensing, 11(12), 2019. ISSN 2072-4292. doi: 10.3390/rs11121461. URL https://www.mdpi.com/2072-4292/11/12/1461.
  • Buscombe and Ritchie [2018] Daniel Buscombe and Andrew C. Ritchie. Landscape classification with deep neural networks. Geosciences, 8(7), 2018. ISSN 2076-3263. doi: 10.3390/geosciences8070244. URL https://www.mdpi.com/2076-3263/8/7/244.
  • Park and Song [2020] Seula Park and Ahram Song. Discrepancy analysis for detecting candidate parcels requiring update of land category in cadastral map using hyperspectral uav images: A case study in jeonju, south korea. Remote Sensing, 12(3), 2020. ISSN 2072-4292. doi: 10.3390/rs12030354. URL https://www.mdpi.com/2072-4292/12/3/354.
  • Li et al. [2019] Yuxia Li, Bo Peng, Lei He, Kunlong Fan, Zhenxu Li, and Ling Tong. Road extraction from unmanned aerial vehicle remote sensing images based on improved neural networks. Sensors (Switzerland), 19(19), 2019. ISSN 14248220. doi: 10.3390/s19194115.
  • Gevaert et al. [2020] Caroline M. Gevaert, Claudio Persello, Richard Sliuzas, and George Vosselman. Monitoring household upgrading in unplanned settlements with unmanned aerial vehicles. International Journal of Applied Earth Observation and Geoinformation, 90(January):102117, 2020. ISSN 03032434. doi: 10.1016/j.jag.2020.102117. URL https://doi.org/10.1016/j.jag.2020.102117.
  • Gebrehiwot et al. [2019] Asmamaw Gebrehiwot, Leila Hashemi-Beni, Gary Thompson, Parisa Kordjamshidi, and Thomas E. Langan. Deep convolutional neural network for flood extent mapping using unmanned aerial vehicles data. Sensors, 19(7), 2019. ISSN 1424-8220. doi: 10.3390/s19071486. URL https://www.mdpi.com/1424-8220/19/7/1486.
  • Carbonneau et al. [2020] Patrice E. Carbonneau, Stephen J. Dugdale, Toby P. Breckon, James T. Dietrich, Mark A. Fonstad, Hitoshi Miyamoto, and Amy S. Woodget. Adopting deep learning methods for airborne RGB fluvial scene classification. REMOTE SENSING OF ENVIRONMENT, 251, DEC 15 2020. ISSN 0034-4257. doi: {10.1016/j.rse.2020.112107}.
  • Zhang et al. [2020e] Xiuwei Zhang, Jiaojiao Jin, Zeze Lan, Chunjiang Li, Minhao Fan, Yafei Wang, Xin Yu, and Yanning Zhang. ICENET: A semantic segmentation deep network for river ice by fusing positional and channel-wise attentive features. Remote Sensing, 12(2):1–22, 2020e. ISSN 20724292. doi: 10.3390/rs12020221.
  • Jakovljevic et al. [2019] Gordana Jakovljevic, Miro Govedarica, Flor Alvarez-Taboada, and Vladimir Pajic. Accuracy assessment of deep learning based classification of lidar and uav points clouds for dtm creation and flood risk mapping. Geosciences, 9(7), 2019. ISSN 2076-3263. doi: 10.3390/geosciences9070323. URL https://www.mdpi.com/2076-3263/9/7/323.
  • Soderholm et al. [2020] J. S. Soderholm, M. R. Kumjian, N. McCarthy, P. Maldonado, and M. Wang. Quantifying hail size distributions from the sky – application of drone aerial photogrammetry. Atmospheric Measurement Techniques, 13(2):747–754, 2020. doi: 10.5194/amt-13-747-2020. URL https://amt.copernicus.org/articles/13/747/2020/.
  • Ichim and Popescu [2020] Loretta Ichim and Dan Popescu. Segmentation of vegetation and flood from aerial images based on decision fusion of neural networks. Remote Sensing, 12(15), 2020. ISSN 2072-4292. doi: 10.3390/rs12152490. URL https://www.mdpi.com/2072-4292/12/15/2490.
  • Nezami et al. [2020] S. Nezami, E. Khoramshahi, O. Nevalainen, I. Pölönen, and E. Honkavaara. ree species classification of drone hyperspectral and rgb imagery with deep learning convolutional neural networks. Remote Sensing, 12(2), 2020. doi: 10.3390/rs12071070.
  • Ferreira et al. [2020] Matheus Pinheiro Ferreira, Danilo Roberti Alves de Almeida, Daniel de Almeida Papa, Juliano Baldez Silva Minervino, Hudson Franklin Pessoa Veras, Arthur Formighieri, Caio Alexandre Nascimento Santos, Marcio Aurélio Dantas Ferreira, Evandro Orfanó Figueiredo, and Evandro José Linhares Ferreira. Individual tree detection and species classification of Amazonian palms using UAV images and deep learning. Forest Ecology and Management, 475(April):118397, 2020. ISSN 03781127. doi: 10.1016/j.foreco.2020.118397. URL https://doi.org/10.1016/j.foreco.2020.118397.
  • Hu et al. [2020] Gensheng Hu, Cunjun Yin, Mingzhu Wan, Yan Zhang, and Yi Fang. Recognition of diseased Pinus trees in UAV images using deep learning and AdaBoost classifier. Biosystems Engineering, 194:138–151, 2020. ISSN 15375110. doi: 10.1016/j.biosystemseng.2020.03.021. URL https://doi.org/10.1016/j.biosystemseng.2020.03.021.
  • Miyoshi et al. [2020] Gabriela Takahashi Miyoshi, Mauro dos Santos Arruda, Lucas Prado Osco, José Marcato Junior, Diogo Nunes Gonçalves, Nilton Nobuhiro Imai, Antonio Maria Garcia Tommaselli, Eija Honkavaara, and Wesley Nunes Gonçalves. A novel deep learning method to identify single tree species in uav-based hyperspectral images. Remote Sensing, 12(8), 2020. ISSN 2072-4292. doi: 10.3390/rs12081294. URL https://www.mdpi.com/2072-4292/12/8/1294.
  • Zhang et al. [2020f] Ce Zhang, Peter M. Atkinson, Charles George, Zhaofei Wen, Mauricio Diazgranados, and France Gerard. Identifying and mapping individual plants in a highly diverse high-elevation ecosystem using UAV imagery and deep learning. ISPRS Journal of Photogrammetry and Remote Sensing, 169(May):280–291, 2020f. ISSN 09242716. doi: 10.1016/j.isprsjprs.2020.09.025. URL https://doi.org/10.1016/j.isprsjprs.2020.09.025.
  • Hamylton et al. [2020] S.M. Hamylton, R.H. Morris, R.C. Carvalho, N. Roder, P. Barlow, K. Mills, and L. Wang. Evaluating techniques for mapping island vegetation from unmanned aerial vehicle (UAV) images: Pixel classification, visual interpretation and machine learning approaches. International Journal of Applied Earth Observation and Geoinformation, 89(February):102085, 2020. ISSN 03032434. doi: 10.1016/j.jag.2020.102085. URL https://doi.org/10.1016/j.jag.2020.102085.
  • Kellenberger et al. [2018] Benjamin Kellenberger, Diego Marcos, and Devis Tuia. Detecting mammals in uav images: Best practices to address a substantially imbalanced dataset with deep learning. Remote Sensing of Environment, 216:139 – 153, 2018. ISSN 0034-4257. doi: https://doi.org/10.1016/j.rse.2018.06.028. URL http://www.sciencedirect.com/science/article/pii/S0034425718303067.
  • Gray et al. [2019] Patrick C. Gray, Kevin C. Bierlich, Sydney A. Mantell, Ari S. Friedlaender, Jeremy A. Goldbogen, and David W. Johnston. Drones and convolutional neural networks facilitate automated and accurate cetacean species identification and photogrammetry. Methods in Ecology and Evolution, 10(9):1490–1500, 2019. ISSN 2041210X. doi: 10.1111/2041-210X.13246.
  • Zhang et al. [2019c] Huaizhong Zhang, Mark Liptrott, Nik Bessis, and Jianquan Cheng. Real-time traffic analysis using deep learning techniques and UAV based video. 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance, AVSS 2019, pages 1–5, 2019c. doi: 10.1109/AVSS.2019.8909879.
  • de Oliveira and Wehrmeister [2018] Diulhio Candido de Oliveira and Marco Aurelio Wehrmeister. Using deep learning and low-cost rgb and thermal cameras to detect pedestrians in aerial images captured by multirotor uav. Sensors (Switzerland), 18(7), 2018. ISSN 14248220. doi: 10.3390/s18072244.
  • dos Santos et al. [2019] Anderson Aparecido dos Santos, José Marcato Junior, Márcio Santos Araújo, David Robledo Di Martini, Everton Castelão Tetila, Henrique Lopes Siqueira, Camila Aoki, Anette Eltner, Edson Takashi Matsubara, Hemerson Pistori, Raul Queiroz Feitosa, Veraldo Liesenberg, and Wesley Nunes Gonçalves. Assessment of CNN-based methods for individual tree detection on images captured by RGB cameras attached to UAVS. Sensors (Switzerland), 19(16):1–11, 2019. ISSN 14248220. doi: 10.3390/s19163595.
  • Torres et al. [2020] Daliana Lobo Torres, Raul Queiroz Feitosa, Patrick Nigri Happ, Laura Elena Cué La Rosa, José Marcato Junior, José Martins, Patrik Olã Bressan, Wesley Nunes Gonçalves, and Veraldo Liesenberg. Applying fully convolutional architectures for semantic segmentation of a single tree species in urban environment on high resolution UAV optical imagery. Sensors (Switzerland), 20(2):1–20, 2020. ISSN 14248220. doi: 10.3390/s20020563.
  • Boonpook et al. [2021] Wuttichai Boonpook, Yumin Tan, and Bo Xu. Deep learning-based multi-feature semantic segmentation in building extraction from images of UAV photogrammetry. International Journal of Remote Sensing, 42(1):1–19, 2021. ISSN 13665901. doi: 10.1080/01431161.2020.1788742.
  • Gomes et al. [2020] Matheus Gomes, Jonathan Silva, Diogo Gonçalves, Pedro Zamboni, Jader Perez, Edson Batista, Ana Ramos, Lucas Osco, Edson Matsubara, Jonathan Li, José Marcato Junior, and Wesley Gonçalves. Mapping utility poles in aerial orthoimages using atss deep learning method. Sensors (Switzerland), 20(21):1–14, 2020. ISSN 14248220. doi: 10.3390/s20216070.
  • Bhowmick et al. [2020] Sutanu Bhowmick, Satish Nagarajaiah, and Ashok Veeraraghavan. Vision and deep learning-based algorithms to detect and quantify cracks on concrete surfaces from UAV videos. Sensors (Switzerland), 20(21):1–19, 2020. ISSN 14248220. doi: 10.3390/s20216299.
  • Benjdira et al. [2019a] Bilel Benjdira, Yakoub Bazi, Anis Koubaa, and Kais Ouni. Unsupervised domain adaptation using generative adversarial networks for semantic segmentation of aerial images. Remote Sensing, 11(11), 2019a. ISSN 20724292. doi: 10.3390/rs11111369.
  • Apolo-Apolo et al. [2020] O. E. Apolo-Apolo, J. Martínez-Guanter, G. Egea, P. Raja, and M. Pérez-Ruiz. Deep learning techniques for estimation of the yield and size of citrus fruits using a UAV. European Journal of Agronomy, 115(August 2019):126030, 2020. ISSN 11610301. doi: 10.1016/j.eja.2020.126030.
  • Biffi et al. [2021] Leonardo Josoé Biffi, Edson Mitishita, Veraldo Liesenberg, Anderson Aparecido Dos Santos, Diogo Nunes Gonçalves, Nayara Vasconcelos Estrabis, Jonathan de Andrade Silva, Lucas Prado Osco, Ana Paula Marques Ramos, Jorge Antonio Silva Centeno, Marcos Benedito Schimalski, Leo Rufato, Sílvio Luís Rafaeli Neto, José Marcato Junior, and Wesley Nunes Gonçalves. Article atss deep learning-based approach to detect apple fruits. Remote Sensing, 13(1):1–23, 2021. ISSN 20724292. doi: 10.3390/rs13010054.
  • Tian et al. [2019a] Yunong Tian, Guodong Yang, Zhe Wang, Hao Wang, En Li, and Zize Liang. Apple detection during different growth stages in orchards using the improved YOLO-V3 model. Computers and Electronics in Agriculture, 157(October 2018):417–426, 2019a. ISSN 01681699. doi: 10.1016/j.compag.2019.01.012.
  • Kang and Chen [2020] Hanwen Kang and Chao Chen. Fast implementation of real-time fruit detection in apple orchards using deep learning. Computers and Electronics in Agriculture, 168(July):105108, 2020. ISSN 01681699. doi: 10.1016/j.compag.2019.105108. URL https://doi.org/10.1016/j.compag.2019.105108.
  • Castro et al. [2020] Wellington Castro, José Marcato Junior, Caio Polidoro, Lucas Prado Osco, Wesley Gonçalves, Lucas Rodrigues, Mateus Santos, Liana Jank, Sanzio Barrios, Cacilda Valle, Rosangela Simeão, Camilo Carromeu, Eloise Silveira, Lúcio André de Castro Jorge, and Edson Matsubara. Deep learning applied to phenotyping of biomass in forages with uav-based rgb imagery. Sensors (Switzerland), 20(17):1–18, 2020. ISSN 14248220. doi: 10.3390/s20174802.
  • Nevavuori et al. [2020] Petteri Nevavuori, Nathaniel Narra, Petri Linna, and Tarmo Lipping. Crop yield prediction using multitemporal UAV data and spatio-temporal deep learning models. Remote Sensing, 12(23):1–18, 2020. ISSN 20724292. doi: 10.3390/rs12234000.
  • Kitano et al. [2019] Bruno T. Kitano, Caio C. T. Mendes, Andre R. Geus, Henrique C. Oliveira, and Jefferson R. Souza. Corn Plant Counting Using Deep Learning and UAV Images. IEEE Geoscience and Remote Sensing Letters, pages 1–5, 2019. ISSN 1545-598X. doi: 10.1109/lgrs.2019.2930549.
  • Osco et al. [2021] Lucas Prado Osco, Keiller Nogueira, Ana Paula Marques Ramos, Mayara Maezano Faita Pinheiro, Danielle Elis Garcia Furuya, Wesley Nunes Gonçalves, Lucio André de Castro Jorge, José Marcato Junior, and Jefersson Alex dos Santos. Semantic segmentation of citrus-orchard using deep neural networks and multispectral uav-based imagery. Precision Agriculture, 2021. ISSN 1573-1618. doi: 10.1007/s11119-020-09777-5.
  • Osco et al. [2020a] Lucas Prado Osco, Mauro dos Santos de Arruda, Diogo Nunes Gonçalves, Alexandre Dias, Juliana Batistoti, Mauricio de Souza, Felipe David Georges Gomes, Ana Paula Marques Ramos, Lúcio André de Castro Jorge, Veraldo Liesenberg, Jonathan Li, Lingfei Ma, José Marcato Junior, and Wesley Nunes Gonçalves. A cnn approach to simultaneously count plants and detect plantation-rows from uav imagery, 2020a.
  • Osco et al. [2020b] Lucas Prado Osco, Mauro dos Santos de Arruda, José Marcato Junior, Neemias Buceli da Silva, Ana Paula Marques Ramos, Érika Akemi Saito Moryia, Nilton Nobuhiro Imai, Danillo Roberto Pereira, José Eduardo Creste, Edson Takashi Matsubara, Jonathan Li, and Wesley Nunes Gonçalves. A convolutional neural network approach for counting and geolocating citrus-trees in UAV multispectral imagery. ISPRS Journal of Photogrammetry and Remote Sensing, 160(December 2019):97–106, 2020b. ISSN 09242716. doi: 10.1016/j.isprsjprs.2019.12.010. URL https://doi.org/10.1016/j.isprsjprs.2019.12.010.
  • Zhao et al. [2017] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network, 2017.
  • Ampatzidis and Partel [2019] Yiannis Ampatzidis and Victor Partel. UAV-based high throughput phenotyping in citrus utilizing multispectral imaging and artificial intelligence. Remote Sensing, 11(4), 2019. ISSN 20724292. doi: 10.3390/rs11040410.
  • Barbedo et al. [2019] Jayme Garcia Arnal Barbedo, Luciano Vieira Koenigkan, Thiago Teixeira Santos, and Patrícia Menezes Santos. A study on the detection of cattle in UAV images using deep learning. Sensors (Switzerland), 19(24):1–14, 2019. ISSN 14248220. doi: 10.3390/s19245436.
  • Rivas et al. [2018] Alberto Rivas, Pablo Chamoso, Alfonso González-Briones, and Juan Manuel Corchado. Detection of cattle using drones and convolutional neural networks. Sensors (Switzerland), 18(7):1–15, 2018. ISSN 14248220. doi: 10.3390/s18072048.
  • Bell et al. [2020] T. W. Bell, N. J. Nidzieko, D. A. Siegel, R. J. Miller, K. C. Cavanaugh, N. B. Nelson, and M. … Griffith. The utility of satellites and autonomous remote sensing platforms for monitoring offshore aquaculture farms: A case study for canopy forming kelps. Frontiers in Marine Science, 2020.
  • Dian Bah et al. [2018] M. Dian Bah, Adel Hafiane, and Raphael Canals. Deep learning with unsupervised data labeling for weed detection in line crops in UAV images. Remote Sensing, 10(11):1–22, 2018. ISSN 20724292. doi: 10.3390/rs10111690.
  • Li et al. [2020b] Yanan Li, Zhiguo Cao, Hao Lu, and Wenxia Xu. Unsupervised domain adaptation for in-field cotton boll status identification. Computers and Electronics in Agriculture, 178:105745, 2020b. ISSN 0168-1699. doi: https://doi.org/10.1016/j.compag.2020.105745. URL http://www.sciencedirect.com/science/article/pii/S0168169920306517.
  • Kerkech et al. [2020] Mohamed Kerkech, Adel Hafiane, and Raphael Canals. Vine disease detection in UAV multispectral images using optimized image registration and deep learning segmentation approach. Computers and Electronics in Agriculture, 174(April), 2020. ISSN 01681699. doi: 10.1016/j.compag.2020.105446.
  • Tetila et al. [2020] Everton Castelão Tetila, Bruno Brandoli Machado, Gabriel Kirsten Menezes, Adair Da Silva Oliveira, Marco Alvarez, Willian Paraguassu Amorim, Nícolas Alessandro De Souza Belete, Gercina Gonçalves Da Silva, and Hemerson Pistori. Automatic Recognition of Soybean Leaf Diseases Using UAV Images and Deep Convolutional Neural Networks. IEEE Geoscience and Remote Sensing Letters, 17(5):903–907, 2020. ISSN 15580571. doi: 10.1109/LGRS.2019.2932385.
  • Xia et al. [2018] Gui Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Belongie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liangpei Zhang. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3974–3983, 2018. ISSN 10636919. doi: 10.1109/CVPR.2018.00418.
  • Du et al. [2018] Dawei Du, Yuankai Qi, Hongyang Yu, Yifan Yang, Kaiwen Duan, Guorong Li, Weigang Zhang, Qingming Huang, and Qi Tian. The unmanned aerial vehicle benchmark: Object detection and tracking. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 11214 LNCS:375–391, 2018. ISSN 16113349. doi: 10.1007/978-3-030-01249-6_23.
  • B et al. [2019] Pengfei Zhu B, Longyin Wen, Dawei Du, Xiao Bian, Haibin Ling, Qinghua Hu, Qinqin Nie, Hao Cheng, Chenfeng Liu, Xiaoyu Liu, Wenya Ma, Haotian Wu, Lianjie Wang, Arne Schumann, Chase Brown, and Robert Lagani. VisDrone-DET2018 : The Vision Meets Drone Object Detection in Image Challenge Results, volume 1. Springer, Cham, 2019. ISBN 9783030110215. doi: 10.1007/978-3-030-11021-5.
  • Sheng et al. [2012] Guofeng Sheng, Wen Yang, Tao Xu, and Hong Sun. High-resolution satellite scene classification using a sparse coding based multiple feature combination. International Journal of Remote Sensing, 33(8):2395–2412, 2012. ISSN 13665901. doi: 10.1080/01431161.2011.608740.
  • Zou et al. [2015] Qin Zou, Lihao Ni, Tong Zhang, and Qian Wang. Remote Sensing Scene Classification. IEEE Transactions on Geoscience and Remote Sensing Letters, 12(11):2321–2325, 2015.
  • Zhao et al. [2016] Lijun Zhao, Ping Tang, and Lianzhi Huo. Feature significance-based multibag-of-visual-words model for remote sensing image scene classification. Journal of Applied Remote Sensing, 10(3):1 – 21, 2016. doi: 10.1117/1.JRS.10.035004. URL https://doi.org/10.1117/1.JRS.10.035004.
  • Penatti et al. [2015] Otavio A.B. Penatti, Keiller Nogueira, and Jefersson A. Dos Santos.

    Do deep features generalize from everyday objects to remote sensing and aerial scenes domains?

    IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2015-October:44–51, 2015. ISSN 21607516. doi: 10.1109/CVPRW.2015.7301382.
  • Fiesler et al. [1990] Emile Fiesler, Amar Choudry, and H John Caulfield. Weight discretization paradigm for optical neural networks. In Optical interconnections and networks, volume 1281, pages 164–173. International Society for Optics and Photonics, 1990.
  • Balzer et al. [1991] Wolfgang Balzer, Masanobu Takahashi, Jun Ohta, and Kazuo Kyuma.

    Weight quantization in boltzmann machines.

    Neural Networks, 4(3):405–409, 1991.
  • Rastegari et al. [2016] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, pages 525–542. Springer, 2016.
  • ImageNet [2018] ImageNet. Imagenet object localization challenge, 2018. URL https://www.kaggle.com/c/imagenet-object-localization-challenge.
  • Guo [2018] Yunhui Guo. A survey on methods and theories of quantized neural networks. arXiv preprint arXiv:1808.04752, 2018.
  • Hinton et al. [2015] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • Howard et al. [2017] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  • Chollet [2017] François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.
  • Szegedy et al. [2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
  • Howard et al. [2019] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE International Conference on Computer Vision, pages 1314–1324, 2019.
  • Elsken et al. [2019] Thomas Elsken, Jan Hendrik Metzen, Frank Hutter, et al. Neural architecture search: A survey. J. Mach. Learn. Res., 20(55):1–21, 2019.
  • Yang et al. [2018] Tien-Ju Yang, Andrew Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivienne Sze, and Hartwig Adam. Netadapt: Platform-aware neural network adaptation for mobile applications. In Proceedings of the European Conference on Computer Vision (ECCV), pages 285–300, 2018.
  • Qin et al. [2019] Zheng Qin, Zeming Li, Zhaoning Zhang, Yiping Bao, Gang Yu, Yuxing Peng, and Jian Sun. Thundernet: Towards real-time generic object detection on mobile devices. In Proceedings of the IEEE International Conference on Computer Vision, pages 6718–6727, 2019.
  • Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2014. URL http://arxiv.org/abs/1405.0312. cite arxiv:1405.0312Comment: 1) updated annotation pipeline description and figures; 2) added new section describing datasets splits; 3) updated author list.
  • Lin et al. [2020] Ji Lin, Wei-Ming Chen, Yujun Lin, John Cohn, Chuang Gan, and Song Han. Mcunet: Tiny deep learning on iot devices. arXiv preprint arXiv:2007.10319, 2020.
  • Sandler et al. [2018] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang Chieh Chen. MobileNetV2: Inverted Residuals and Linear Bottlenecks. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018. ISSN 10636919. doi: 10.1109/CVPR.2018.00474.
  • Mittal [2019] Sparsh Mittal. A survey on optimized implementation of deep learning models on the nvidia jetson platform. Journal of Systems Architecture, 97:428–442, 2019.
  • Imran et al. [2020] Hamza Ali Imran, Usama Mujahid, Saad Wazir, Usama Latif, and Kiran Mehmood. Embedded development boards for edge-ai: A comprehensive report. arXiv preprint arXiv:2009.00803, 2020.
  • Hennessy et al. [2020] Andrew Hennessy, Kenneth Clarke, and Megan Lewis. Hyperspectral Classification of Plants: A Review of Waveband Selection Generalisability. Remote Sensing, 12(1):113, 2020. ISSN 2072-4292. doi: 10.3390/rs12010113.
  • Licciardi et al. [2012] G. Licciardi, P. R. Marpu, J. Chanussot, and J. A. Benediktsson. Linear versus nonlinear pca for the classification of hyperspectral data based on the extended morphological profiles. IEEE Geoscience and Remote Sensing Letters, 9(3):447–451, 2012. doi: 10.1109/LGRS.2011.2172185.
  • Vaddi and Manoharan [2020] Radhesyam Vaddi and Prabukumar Manoharan. Cnn based hyperspectral image classification using unsupervised band selection and structure-preserving spatial features. Infrared Physics & Technology, 110:103457, 2020. ISSN 1350-4495. doi: https://doi.org/10.1016/j.infrared.2020.103457. URL http://www.sciencedirect.com/science/article/pii/S1350449520305053.
  • Tuia et al. [2016] D. Tuia, C. Persello, and L. Bruzzone. Domain adaptation for the classification of remote sensing data: An overview of recent advances. IEEE Geoscience and Remote Sensing Magazine, 4(2):41–57, 2016. doi: 10.1109/MGRS.2016.2548504.
  • Zhuang et al. [2020] Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. A comprehensive survey on transfer learning. Proceedings of the IEEE, 109(1):43–76, 2020.
  • Tan et al. [2018] Chuanqi Tan, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang, and Chunfang Liu. A survey on deep transfer learning. In International conference on artificial neural networks, pages 270–279. Springer, 2018.
  • Elshamli et al. [2017] A. Elshamli, G. W. Taylor, A. Berg, and S. Areibi. Domain adaptation using representation learning for the classification of remote sensing images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 10(9):4198–4209, 2017. doi: 10.1109/JSTARS.2017.2711360.
  • Benjdira et al. [2019b] Bilel Benjdira, Yakoub Bazi, Anis Koubaa, and Kais Ouni. Unsupervised domain adaptation using generative adversarial networks for semantic segmentation of aerial images. Remote Sensing, 11(11), 2019b. ISSN 2072-4292. doi: 10.3390/rs11111369. URL https://www.mdpi.com/2072-4292/11/11/1369.
  • Fang et al. [2019] Bo Fang, Rong Kou, Li Pan, and Pengfei Chen. Category-sensitive domain adaptation for land cover mapping in aerial scenes. Remote Sensing, 11(22), 2019. ISSN 2072-4292. doi: 10.3390/rs11222631. URL https://www.mdpi.com/2072-4292/11/22/2631.
  • Xu et al. [2018] Rudong Xu, Yiting Tao, Zhongyuan Lu, and Yanfei Zhong. Attention-mechanism-containing neural networks for high-resolution remote sensing image classification. Remote Sensing, 10(10), 2018. ISSN 2072-4292. doi: 10.3390/rs10101602. URL https://www.mdpi.com/2072-4292/10/10/1602.
  • Ding et al. [2021] L. Ding, H. Tang, and L. Bruzzone. Lanet: Local attention embedding to improve the semantic segmentation of remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 59(1):426–435, 2021. doi: 10.1109/TGRS.2020.2994150.
  • Su et al. [2019] Y. Su, Y. Wu, M. Wang, F. Wang, and J. Cheng. Semantic segmentation of high resolution remote sensing image based on batch-attention mechanism. In IGARSS 2019 - 2019 IEEE International Geoscience and Remote Sensing Symposium, pages 3856–3859, 2019. doi: 10.1109/IGARSS.2019.8898198.
  • Zhou et al. [2020] Dengji Zhou, Guizhou Wang, Guojin He, Tengfei Long, Ranyu Yin, Zhaoming Zhang, Sibao Chen, and Bin Luo. Robust building extraction for high spatial resolution remote sensing images with self-attention network. Sensors, 20(24), 2020. ISSN 1424-8220. doi: 10.3390/s20247241. URL https://www.mdpi.com/1424-8220/20/24/7241.
  • Zhu et al. [2019b] Ruixi Zhu, Li Yan, Nan Mo, and Yi Liu. Attention-based deep feature fusion for the scene classification of high-resolution remote sensing images. Remote Sensing, 11(17), 2019b. ISSN 2072-4292. doi: 10.3390/rs11171996. URL https://www.mdpi.com/2072-4292/11/17/1996.
  • Li et al. [2020c] Yangyang Li, Qin Huang, Xuan Pei, Licheng Jiao, and Ronghua Shang. Radet: Refine feature pyramid network and multi-layer attention network for arbitrary-oriented object detection of remote sensing images. Remote Sensing, 12(3), 2020c. ISSN 2072-4292. doi: 10.3390/rs12030389. URL https://www.mdpi.com/2072-4292/12/3/389.
  • Li et al. [2019] C. Li, C. Xu, Z. Cui, D. Wang, T. Zhang, and J. Yang. Feature-attentioned object detection in remote sensing imagery. In 2019 IEEE International Conference on Image Processing (ICIP), pages 3886–3890, 2019. doi: 10.1109/ICIP.2019.8803521.
  • Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2020.
  • Touvron et al. [2020] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention, 2020.
  • Carion et al. [2020] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 213–229, Cham, 2020. Springer International Publishing.
  • Li et al. [2020] L. Li, J. Han, X. Yao, G. Cheng, and L. Guo. Dla-matchnet for few-shot remote sensing image scene classification. IEEE Transactions on Geoscience and Remote Sensing, pages 1–10, 2020. doi: 10.1109/TGRS.2020.3033336.
  • Karami et al. [2020] A. Karami, M. Crawford, and E. J. Delp. Automatic plant counting and location based on a few-shot learning technique. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 13:5872–5886, 2020. doi: 10.1109/JSTARS.2020.3025790.
  • Liu and Qin [2020] W. Liu and R. Qin. A multikernel domain adaptation method for unsupervised transfer learning on cross-source and cross-region remote sensing data classification. IEEE Transactions on Geoscience and Remote Sensing, 58(6):4279–4289, 2020. doi: 10.1109/TGRS.2019.2962039.
  • Kang et al. [2020] J. Kang, R. Fernandez-Beltran, P. Duan, S. Liu, and A. J. Plaza. Deep unsupervised embedding for remotely sensed images based on spatially augmented momentum contrast. IEEE Transactions on Geoscience and Remote Sensing, pages 1–13, 2020. doi: 10.1109/TGRS.2020.3007029.
  • Bachman et al. [2019] Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32, pages 15535–15545. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/ddf354219aac374f1d40b7e760ee5bb7-Paper.pdf.
  • Tian et al. [2019b] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. CoRR, abs/1906.05849, 2019b. URL http://arxiv.org/abs/1906.05849.
  • Hjelm et al. [2019] Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Philip Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In ICLR 2019, page 24. ICLR, April 2019.
  • He et al. [2020] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual representation learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9726–9735, 2020. doi: 10.1109/CVPR42600.2020.00975.
  • Caron et al. [2018] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors, Computer Vision – ECCV 2018, pages 139–156, Cham, 2018. Springer International Publishing.
  • Caron et al. [2021] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments, 2021.
  • Crawshaw [2020] Michael Crawshaw. Multi-task learning with deep neural networks: A survey, 2020.
  • Wang et al. [2021] Y. Wang, W. Ding, R. Zhang, and H. Li. Boundary-aware multitask learning for remote sensing imagery. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14:951–963, 2021. doi: 10.1109/JSTARS.2020.3043442.
  • Bendale and Boult [2016] Abhijit Bendale and Terrance E. Boult. Towards open set deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), page 14, June 2016.
  • da Silva et al. [2020] C. C. V. da Silva, K. Nogueira, H. N. Oliveira, and J. A. d. Santos. Towards open-set semantic segmentation of aerial images. In 2020 IEEE Latin American GRSS ISPRS Remote Sensing Conference (LAGIRS), pages 16–21, 2020. doi: 10.1109/LAGIRS48042.2020.9165597.
  • Adayel et al. [2020] Reham Adayel, Yakoub Bazi, Haikel Alhichri, and Naif Alajlan. Deep open-set domain adaptation for cross-scene classification based on adversarial learning and pareto ranking. Remote Sensing, 12(11):1716, May 2020. ISSN 2072-4292. doi: 10.3390/rs12111716. URL http://dx.doi.org/10.3390/rs12111716.
  • Ma et al. [2021] Jiayi Ma, Xingyu Jiang, Aoxiang Fan, Junjun Jiang, and Junchi Yan. Image matching from handcrafted to deep features: A survey. International Journal of Computer Vision, 129(1):23–79, Jan 2021. ISSN 1573-1405. doi: 10.1007/s11263-020-01359-2. URL https://doi.org/10.1007/s11263-020-01359-2.
  • Gevaert et al. [2018] C.M. Gevaert, C. Persello, F. Nex, and G. Vosselman. A deep learning approach to dtm extraction from imagery using rule-based training labels. ISPRS Journal of Photogrammetry and Remote Sensing, 142:106 – 123, 2018. ISSN 0924-2716. doi: https://doi.org/10.1016/j.isprsjprs.2018.06.001. URL http://www.sciencedirect.com/science/article/pii/S0924271618301643.