I Introduction
Air pollution has been proved to have significantly negative effects on human health and sustainable development [1]. Air pollution is caused by gaseous pollutants that are harmful to humans and ecosystem. To quantify the degree of air pollution, government agencies have defined the air quality index (AQI). AQI is calculated based on the concentration of a number of air pollutants, such as and particles. A higher AQI indicates that air pollution is more severe and people are more likely to experience harmful health effects [2]. Thus, AQI monitoring is a critical issue. The more accurate AQI distribution that can be obtained in a region, the more effective methods we can find to deal with the air pollution.
Existing AQI monitoring approaches can be classified into two categories. The first category includes the
sensorbased monitoring approaches, wherein government agencies have set up monitoring stations on dedicated sites in a city [3]. However, these fixed stations only provide coarsegrained 2D monitoring, with several kilometers between two monitoring stations. Existing study has shown that AQI distribution has intrinsic variation within meters [4]. Large scale InternetofThings (IoT) applications have been developed to monitor the finegrained air quality using densely deployed sensors [6, 7]. Although the static sensors may achieve the high precision of monitoring, they suffer from the high cost as well as lack of mobility. Mobile devices or vehicles, such as phones, cars, balloons are utilized to carry sensors for AQI monitoring [9, 11, 12, 10]. However, the sensorbased approach may induce high energy consumptions for mobile devices or vehicles to acquire certain amount of data.The second category of approaches includes the visionbased monitoring. Imagebased AQI monitoring stations are set up by researchers at dedicated locations [16], and these static stations can only take photos and infer the AQI at limited sites over the whole region. Crowdsourced photos contributed by mobile phones can depict the AQI distribution [15] at more locations. However, the performance of the crowd sourcing approach is usually restricted by the low quality photos contributed by many nonsavvy users.
Systems  Scale  Dimension 

Resolution  Mobility  Costs 

Accuracy  
Official stations [3]  km  D  Sensor  Low  Static  High  No  Low  
AirCloud [6]  km  D  Sensor  Medium  Static  Low  No  Medium  
Mosaic [7]  km  D  Sensor  Medium  Mobile  Low  No  Medium  
Mobile nodes [8]  km  D  Sensor  Medium  Mobile  Medium  Yes  Medium  
Balloons [9]  km  D  Sensor  High  Mobile  Medium  No  Low  
BlueAer [10]  km  D  Sensor  Medium  Static+Mobile  High  No  High  
ARMS [14]  m  D  Sensor  High  Mobile  High  Yes  High  
AQNet [12]  km  D  Sensor  High  Static+Mobile  Low  Yes  High  
Cell phones [15]  km  D  Vision  Low  Mobile  Low  Yes  Medium  
IBAQMS [16]  km  D  Vision  Low  Static  Medium  No  Medium  
ImgSensingNet  km  D  Sensor+Vision  High  Static+Mobile  Low  Yes  High 
Previous works have separated the two categories of methods in AQI monitoring; however, sensorbased and visionbased methods can be combined to promote the performance of the mobile sensing system, while reducing the power consumption. For example, the combination of computer vision and inertial sensing has been proved to be successful in the task of localization and navigation by phones [18, 17]. In this work, we seek a way of leveraging both phototaking and data sensing to monitor and infer the AQI value.
In this paper, we present ImgSensingNet, a UAV vision guided aerialground air quality sensing system, to monitor and forecast AQI distributions in spatialtemporal perspectives. Unlike existing systems, we implement: (1) mobile visionbased sensing over an unmannedaerialvehicle (UAV), which realizes threedimensional (3D) AQI monitoring by UAV phototaking instead of using particle sensors, to infer regionlevel AQI scale (an interval of possible AQI values) by applying a deep convolutional neural network (CNN) over the taken hazy photos; (2) ground sensing over a wireless sensor network (WSN) for smallscale accurate spatialtemporal AQI inference, using an entropybased inference model; (3) an energyefficient wakeup mechanism that powers on the sensors in a region when smallscale monitoring is needed in that region, based on the result of visionbased AQI inference, which greatly reduces energy consumption while maintaining high inference accuracy. We implement and evaluate ImgSensingNet on two university campuses (i.e., Peking University and Xidian University) since Feb. 2018. We have collected 17,630 photos and 2.6 millions of data samples. Compared to stateoftheart methods, evaluation results confirm that ImgSensingNet can save the energy consumptions by 50.5% while achieving an accuracy of 95.2% for inference.
The main contributions are summarized as below.

We implement ImgSensingNet, a UAV vision guided aerialground AQI sensing system, and we deploy and evaluate it in the realworld testbed;

The proposed visionbased sensing method can learn the direct correlation between raw haze images and corresponding AQI scale distribution;

The proposed entropybased inference model for ground WSN can achieve a high accuracy in both realtime AQI distribution estimation and future AQI prediction;

The wakeup mechanism connects the aerial vision technique with the onground WSN, which can greatly save the energy consumptions of the onground sensor network while ensuring high inference and prediction accuracy
The rest of this paper goes as follows. Related works are introduced in Section II. In Section III, we present the system overview of ImgSensingNet. Section IV introduces the UAV visionbased aerial sensing. In Section V, we propose the AQI inference model for ground WSN. Section VI introduces the energyefficient wakeup mechanism. In Section VII, we detail the system implementation. Experimental results and conclusions are provided in Section VIII and Section IX.
Ii Taxonomy
Iia AQI Monitoring Methods
In Table I, we show stateoftheart works on air quality monitoring systems. Existing AQI monitoring methods can be summarized into two categories.
Sensorbased: Stationary stations [2] are set up on dedicated sites in a city, but only provide a limited number of measurement samples. For example, there are only 28 monitoring stations in Beijing. The distance between two nearby stations is typically several tenthousand meters, and the AQI is monitored every 2 hours [3]. AirCloud [6] uses densely distributed sensors in a static way, while Mosaic [7] and [8, 9] adopt mobile devices such as buses or balloons to carry lowcost sensors. However, they all fail to consider the heterogeneous 3D AQI distribution. In [10, 12, 11, 13], drones with sensors together with ground sensors are used for AQI profiling. However, they are either restricted in a small scale region or may induce high costs, without designing energyefficient schemes for integrating aerial sensing with ground sensing.
Visionbased: Instead of various particle sensors, imagebased approaches are also used for AQI estimation. In [16], imagebased air quality monitoring stations are set up at dedicated sites over a city. Again, these methods can only profile AQI at a limited number of locations. In [15], cameraenabled mobile devices are used for generating crowdsourced photos for AQI monitoring. However, the incentive to stimulate users for volunteer highquality phototaking is the pain point for such a crowdsourced system. Without precise correlations between haze images and AQI values, they cannot generalize well and may introduce low accuracy.
ImgSensingNet overcomes the above shortcomings by using vision guided aerial sensing to extend sensing scope, while also combining it with ground WSN for accurate AQI distribution inference. An energyefficient wakeup mechanism is designed to switch on or off the onground WSN by examining the aerial sensing results, which greatly lowers the system’s energy consumption.
IiB AQI Inference at Unmeasured Locations
In realworld sensing applications, it is not feasible to acquire AQI data samples at all locations within a region. Hence, AQI modeling and inference are used to estimate AQI distributions at unmeasured locations. Again, the inference models can be summarized into two categories.
Inference by sensor data: Zheng et al. [5] propose to infer air quality based on data from official air quality stations and other features such as the meteorological data. In [6, 7], crowdsourcing and a Gaussian process model are used for 2D AQI inference. [10] extends the inference to 3D space by using a random walk model. A finegrained AQI distribution model is proposed in [14]
for realtime AQI estimation over a 3D space. Longshort term memory (LSTM) networks are used in
[19] to utilize historical data for more accurate inference. To do the temporal inference, neural networks (NN) are used [12] to analyze spatialtemporal correlations and to forecast future distribution.Inference by image data:
Imagebased inference has been used to estimate AQI from haze images by designing appropriate inference models. Classical image processing methods as well as machine learning techniques are used in
[16, 15] to model the correlation between haze images and the degree of air pollution.In this work, we investigate two novel inference models: (1) imagebased AQI scale inference in different monitoring regions by computer vision, and (2) the finegrained spatialtemporal AQI value inference at locations inside each region by ground WSN.
Iii System Overview
The ImgSensingNet system includes onground programmable monitoring devices and a UAV. The aerial UAV sensing and the ground WSN sensing form a hybrid sensing network, as illustrated in Fig. 1.
The central idea of ImgSensingNet is to trigger aerial sensing and ground sensing sequentially during one measurement, which can provide coarsetofine grained AQI value inference. This operation can not only achieve high accuracy, but also scale down the monitoring overhead, which can guarantee a long battery duration without external power supply.
Iiia Aerial Sensing
Fig. 1
shows the overall framework of ImgSensingNet. The aerial sensing utilizes the UAV camera to capture a series of haze images in different monitoring regions. The raw image data is streamed back to the central server, where a welltrained deep learning model performs realtime image data analysis and output the inferred AQI scale for each region.
IiiB Ground Sensing
Ground WSN adopts a spatialtemporal inference model for AQI estimation at unmeasured locations and future air quality prediction. Every time when aerial sensing is finished, each ground device follows a designed wakeup mechanism to decide whether to wake up for data sensing based on both the inference result at last time and the aerial sensing result. In this way, the realtime finegrained AQI distribution is obtained and the future distribution can also be forecasted.
Iv Aerial Sensing: Learning AQI Scale from Images Captured by UAV
ImgSensingNet performs visionbased sensing using UAV, because: (1) the UAV has intrinsic advantages in flexible 3D space sensing over different heights and angles, which avoids possible obstacles, and also guarantees certain scene depths; (2) with builtin camera, the UAV does not need to carry extra sensors, which enables longer monitoring time; and (3) instead of hovering at different locations to collect data by sensors, the UAV can keep flying and video recording by cameras through monitoring regions, which greatly extends the sensing scope.
Recent works have well studied how to remove haze from images in the computer vision field [24, 25, 23, 21]. However, there has not been works on quantifying the haze in the image to real AQI value. To do the direct learning from raw haze images to quantified AQI values, two main problems should be answered: (1) how to extract the haze components from origin images to eliminate the influence of image content, and (2) how to quantify the AQI based on the haze components.
This section details the method to solve these problems. Specifically, we investigate contentnonspecific hazerelevant features for raw haze images. With the haze features extracted, a novel 3D CNN model is designed to better process feature maps and output the inferred AQI scale for each single image.
Iva Overview of Haze Image Processing
In image processing, a haze image can be mathematically described using the haze image formation model [21] as
(1) 
where is the observed hazy image, is the hazefree image, denotes the medium transmission, is the global atmospheric light, and represents pixel coordinates. The hazeremoval methods have spent large effort estimating and for hazefree image recovery [24, 25, 23, 21].
Instead, in this work we propose a new objective to estimate the degree of haze in a single image.
IvB Hazerelevant Features Extraction
The first step is to extract a list of hazerelevant statistical features. Since we want to investigate general approach for all image inputs regardless of their contents, the features that correlate well with haze density in images but do not correlate well with image contents should be selected.
In the following, we investigate six contentnonspecific hazerelevant features, and an example is illustrated in Fig. 2.
IvB1 Refined Dark Channel
Dark channel [21] is an informative feature for haze detection, defined as the minimum of all pixel colors in a local patch:
(2) 
where is a local patch centered at , is one color channel of . It is found that most local patches in outdoor hazefree images contain some pixels whose intensity is very low in at least one color channel [21]. Therefore, the dark channel is a rough approximation of the thickness of the haze.
To obtain a better estimation of haze density, we propose the refined dark channel by applying the guided filter [22] on the estimated medium transmission , to capture the sharp edge discontinuous and outline the haze profile. Note that by applying the min operation on (1), the dark channel of tends to be zero, and we have . Hence, the refined dark channel can be expressed as
(3) 
Fig. 2(b) shows the refined dark channel feature. As we can see, the feature has a high correlation to the amount of haze in the image.
IvB2 Max Local Contrast
Since haze can scatter the light reaching cameras, the contrast of the haze image can be highly reduced. Therefore, the contrast is one of the most perceived features to detect haze in the scene. The local contrast is defined as the variance of pixel intensities in a local
region compared with the center pixel. Inspired by [23], we further use the local maximum of local contrast values in a local patch to form the max local contrast feature as(4) 
where denotes the size of the local region , and is a constant that equals to the number of channels. Fig. 2(c) shows the contrast feature, in which the correlation between haze and the contrast feature are visually obvious.
IvB3 Max Local Saturation
It is observed that the image saturation varies sharply with the change of haze in the scene [16]. Therefore, similar to image contrast, we define the max local saturation feature that represents the maximum saturation value of pixels within a local patch, written as
(5) 
The max local saturation feature for the “forest” image is shown in Fig. 2(d), which is also correlated with the haze.
IvB4 Min Local Color Attenuation
In [24], the scene depth is found to be positively correlated with the difference between the image brightness and the image saturation by numerous experiments on haze images. This statistics is regarded as the color attenuation prior, expressed as
(6) 
where and denote the brightness and the saturation, respectively. Let , and , , , can be estimated through maximum likelihood. To process the raw depth map for better representation of the haze influence, we define the min local color attenuation feature by considering the minimum pixelwise depth within a local patch :
(7) 
Fig. 2(e) shows the min local color attenuation feature, where an obvious correlation with haze density can be observed.
IvB5 Hue Disparity
In [25], the hue disparity between the original image and its semiinverse image is utilized to remove haze. The semiinverse image is defined as the max value between original image and its inverse, expressed as
(8) 
The hue disparity is also reduced by haze, thus can serve as another hazerelevant feature, written as
(9) 
where denotes the hue channel of the image. Fig. 2(f) shows the hue disparity feature for haze image.
IvB6 Chroma
In the CIELab color space, the chroma is one of the most representative image feature to describe the color degradation by the haze in the atmosphere. Let denotes the haze image in the CIELab space, the feature is defined as
(10) 
As shown in Fig. 2(g), chroma is an excellent hazerelevant feature since it strongly correlates with the haze density but is not affected by the image contents.
IvC 3D CNNbased Learning for AQI Scale Inference
With the above hazerelevant features extracted, we design a 3D CNN model to perform direct learning for precise AQI scale estimation of input haze images. CNN is a type of deep learning model in which trainable filters and local neighborhood pooling operations are applied alternatively on the raw input images, resulting in a hierarchy of increasingly complex features. CNN has been widely used for image processing and vision applications, and has been proved to achieve superior performance compared to classical methods.
In this work, to better fit the extracted features for high accuracy, we introduce a 3D CNN model by adding a “prior feature map” dimension. The advantage behind 3D convolution is the utilization of haze prior information, which is encoded in the six feature maps.
Preprocessing: For each input haze image, we first resize it spatially to pixels. The resized image is then performed with feature maps extraction and rescaled into in grayscale. We normalize each dimension except the prior feature map dimension of all training haze images to be of zero mean, which can help our model converge faster.
Model Architecture: Fig. 3 presents the architecture of the 3D CNN model. The first layer is called the “hardwired” layer that extracts feature maps from original haze image, consisting of six feature frames stacked together to be a
sized tensor. The rationale for using this hardwired layer is to encode our prior knowledge on different hazerelevant features. This scheme regularizes the model training constrained in the prior haze feature space, which leads to better performance compared to random initialization.
We then apply 3D convolutions with a kernel size of and 32 kernels, to extract complex features in different feature map domains separately. In the subsequent pooling layer, max pooling is applied. The next convolution layer uses kernel size, followed by another max pooling. 3D convolution with
kernel size is then applied and it contains 13,456 trainable parameters. Finally, the vector is densely connected to the output layer that consists predivided AQI scale classes. This architecture has been verified to give the best performance compared to other 3D CNN architectures.
Training and AQI Scale Inference: As the output is AQI scale (i.e., ), the inference is modeled as a classification problem, where the AQI scale classes are predivided based on the number of different AQI values in training data. Given new image input, the model finds images in training set with most similar haze degrees, and uses the corresponding AQI ground truth values to generate an AQI scale. With more data of different AQI values collected, the number of class will increase, resulting in more finegrained scale labels.
V Ground Sensing: AQI Inference by Ground Sensor Monitoring
Given the 3D target monitoring space, we utilize ground WSN for accurate AQI inference that enables both the realtime inference spatially, and future distribution forecasting temporally. This section illustrates how to do accurate inference based on (1) sparse historical ground WSN data, and (2) the prior AQI scale knowledge by aerial sensing.
The target 3D space is first divided into disjointed cubes, which form the basic unit in our inference. Each cube contains its own geographical coordinates in 3D space, and each cube is associated with an AQI value. Note that AQI values in a limited number of cubes are observed/sensed from the WSN, while the AQI values in other unobserved cubes need to be estimated using the proposed model. Here we define a set of cubes over a series of time stamps with equal intervals (e.g., one hour). Most cubes do not have observed/sensed data (e.g.,
in both Peking University and Xidian University), whose AQI values can be estimated using a probability function,
. The objective is to infer of any unobserved location at any given time stamp (including both the current and future time stamps).Why a semisupervised learning model:
Since the data observed using the sensor network can be extremely sparse, prevailing deep learning methods for time series processing (e.g., RNN and LSTM) are not feasible in our task. Hence, a semisupervised learning method is designed to achieve the goal. We first establish a multilayer spatialtemporal graph to model the correlation between cubes. The weights of edges are represented by the correlations of features between cubes, based on the fact that cubes whose features are similar tend to share similar AQI values. The model iteratively learns and adjusts the edge weights to achieve the inference.
Va Feature Selection
Based on the study for key features in finegrained scenarios [11, 12, 13, 14], we select nine highly correlated features as: 3D coordinates, current time stamp, weather condition, wind speed, wind direction, humidity and temperature. These features can be obtained either by our monitoring devices or crawling data from online websites.
VB MultiLayer SpatialTemporal Inference Model
The AQI values at different locations are correlated with each other in a spatialtemporal manner. For example, the AQI value at one location is highly similar to that at its neighboring location; the AQI value at a location depend on its values in past few hours.
Based on this observation, we propose a multilayer graph model to characterize the correlations between cubes. Each cube is represented by a node in the graph, as shown in Fig. 4. These nodes are connected in both spatial and temporal dimensions to form a multilayer weighted graph . Each layer represents one spatial graph at a specific time stamp . We name the nodes with observed data from the sensors as nodes, while nodes without observed data as nodes. Each labeled node has the ground truth AQI value, while the AQI value of each unlabeled node
is estimated through a probability distribution
.We construct the edges in the graph by following steps: (1) Connecting to labeled nodes, where each unlabeled node is connected with all labeled nodes at the same time stamp ; (2) Connecting to spatial neighbors, where each unlabeled node is also connected with neighboring nodes within a given spatial radius ; and (3) Connecting to temporal neighbors, where each unlabeled node is connected to nodes in the same location but at neighboring time stamps. Fig. 4 shows an example of edge construction.
For every edge , it has a corresponding weight. The weight of edge denotes how much the features between and are correlated. The correlation is defined by:
Definition 1.
Correlation Function. Given a set of features , the correlation function of each feature between node and is defined as a linear function
(11) 
In (11), and are parameters that can be estimated using the maximum likelihood estimation. Based on the correlation modeling between feature difference and AQI similarity, we define the weight matrix , where the weight on edge is expressed as
(12) 
where is the weight of feature , and needs to be further learned to determine the AQI distribution of unlabeled nodes.
VC AQI Inference on Unlabeled Nodes
The objective for the model’s convergence is to minimize the model’s uncertainty for inferring unlabeled nodes. We show that the distribution at an unlabeled node is the weighted average of distributions at its neighboring nodes [27]. Then, the objective becomes to minimize the entropy of the whole model, i.e., , to achieve accurate estimation. This idea comes from the fact that an unlabeled node should possess a similar AQI value of its adjacent labeled nodes which are connected to it. Therefore, based on the edge weight function in (12
), we define the loss function of the correlation graph to enable the propagation between highly correlated nodes with higher edge weights:
(13) 
where and are the AQI distribution at node and , denotes the similarity of AQI distributions between and , described by the Symmetrical KullbackLeibler (KL) Divergence [26]. Thus, the objective function is given by:
(14) 
By minimizing , the nodes with higher edge weights would possess more similar AQI value while the nodes with lower edge weights would be more independent. Thus, the objective function can enable the AQI propagation between highly correlated nodes, thus improving inference accuracy.
Proposition 1.
The solution of for (14) is the average of the distributions at its neighboring nodes.
Proof: According to [27], the minimum function in (14) is harmonic. Therefore, we have on unlabeled nodes , while on labeled nodes . Here is the , which is defined by . diag is the diagonal matrix with denotes the degree of ; is the weight matrix defined in (12). The harmonic property provides the form of solution as:
(15) 
where is the maximum possible AQI value. To normalize the solution, we redefine it as
(16) 
Hence, the distribution of unlabeled nodes is the average of distributions at its neighboring nodes.
Proposition 2.
in (16) is a probability mass function (PMF) on .
Proof: To be a PMF on , we test the satisfaction of on the following three properties:

The domain of is the set of all possible states of .

, .

.
Considering the expression form in (16), the conclusion is obvious, that is a PMF on .
The solution again shows the influence of the highly correlated nodes that are connected by highweight edges.
VD Entropybased Learning with AQI Scale Prior
AQI Scale Prior: A key characteristic of our model is the conditioning of prior AQI scale knowledge on unlabeled nodes at current time stamp, (see Fig. 4). This conditioning allows the learnt AQI scale from visionbased sensing to guide ground WSN sensing, providing faster convergence and more accurate inference. Specifically, target space is divided into disjointed regions for aerial sensing. Each contains a number of cubes to be inferred. For each , the aerial sensing provides a conditioning for :
(17) 
By applying to in (16), we finally induce as the inferred distribution. The conditioning brings faster convergence during training, and also enables more accurate inference. Sec. VI will detail the region division method, which helps lead out the lowcost wakeup mechanism design.
So far, the expression of is determined, the next step is to investigate the weight functions given by (12). is learned from both labeled and unlabeled data, which forms a semisupervised mechanism.
Learning Criterion: Since the labeled nodes are sparse, maximizing the likelihood of labeled nodes’ data to learn is infeasible. Instead, we use model’s entropy as the criterion, since high entropies can be regarded as unpredicted values, resulting in poor capability of inference and low accuracy. Thus, the objective is to minimize the entropy of unlabeled nodes:
(18) 
where is the number of unlabeled nodes. By unfolding the objective function, we have
(19) 
For simplicity, we denote as , the gradient can be derived as
(20) 
For every unlabeled , we investigate based on (16) and (12
). By applying the chain rule of differentiation, the final gradient can be derived as
(21) 
Thus, by iteratively learning and updating using (21), the edge weights can be studied and further generate the final AQI distribution when the iteration converges.
Realtime Inference: As illustrated in Fig. 4, the realtime inference is based on (1) historical ground WSN data over last time stamps, and (2) the conditioning of prior AQI scale knowledge . When the model converges, we obtain the determined AQI distribution over , which is called as soft labeling. To provide an exact or hard labeling value of inference, as is proofed in Proposition 2 that is a PMF on , we quantize it using the expectation of :
(22) 
Note that we can obtain on each unlabeled node over time stamps. However, only data at current time is needed for realtime inference. Inspired by this idea, we store the whole inferred distribution map each time when realtime inference is completed, and further use it as historical data in the future sensing. By doing so, more labeled nodes are known to get better inference results, which can accelerate the convergence speed and improve the accuracy.
Future Forecasting: Our model is also capable of future inference. In Fig. 4, the edge can be extended to following time stamps and more. With the entropybased learning procedure, it can maintain sufficient accuracy for nearfuture distribution forecasting even without the prior by aerial sensing.
Vi Energyefficient Wakeup Mechanism
Since our ground inference model is able to operate with very sparse labeled data, we only need to wake up a small number of ground sensors in selected regions to sense data at each . This scheme can greatly save the battery of devices and extend our system’s working duration, while also ensure high inference accuracy.
Recent methods [6, 7, 12] which utilize ground WSN for inference, have employed all of their sensors to wake up simultaneously for data collection. This can lead to short working duration even with lowcost sensors. For example, devices in [7] can only last for less than 5 days before recharging, causing high consumption of battery power and human labor. Yet there have scarcely been works in asynchronous wakeup for AQI monitoring. In fact, due to the spatialtemporal correlations of AQI distribution, waking up a specific number of sensors is enough to realize high inference accuracy, while greatly reducing the power consumption. Thus, an energyefficient wakeup mechanism is designed for ImgSensingNet to connect the aerial sensing and ground sensing, and to guide system selecting specific devices to wake up at current time stamp for energy saving.
Via Voronoi Diagram based Region Division
Since the total monitoring space can be very large, we first divide it into disjointed regions for aerial sensing. Note that even if devices are deployed in 3D (e.g., different floors of buildings), we only consider 2D coordinates for region division. The height and the camera angle of UAV are fixed in advance, in order to make sure the region is covered in images. Cubes inside each are provided an AQI scale conditioning using visionbased inference. Since the distribution of ground devices is heterogeneous and uneven, we implement the division (as shown in Fig. 5) by following steps:
Initialization: Fig. 5(a)(b)(c) present an example of the initialization process. Given a target space with ground devices deployed, points of interests (POIs), e.g., a hospital or an office building, are selected dynamically at different time stamps.
Clustering: With POIs selected, we cluster each device to its nearby POI in spatial dimension based on the spatial correlation of AQI distribution, where means clustering is used. We obtain classes after the clustering, each containing devices (), as shown in Fig. 5(d).
Multisite Weighted Voronoi Diagram: Voronoi diagram is a partitioning of a plane into regions based on distance to sites in a specific subset [28]. The original voronoi diagram only considers one site in a region, and using the Euclidean distance for division. As we have multiple devices in one region, we propose a multisite weighted voronoi diagram that enables division with (1) multiple sites inside one region, and (2) different weights assigned to each region for calculating the division boundary.
As shown in Fig. 5(e), we first calculate the center in using the mean 2D coordinates of devices inside it. The coordinates of center is used for division on behalf of . Since the number of devices varies over different regions, they should possess different weights when calculating the division boundary. Hence, we define the weighted distance as:
(23) 
where is the Euclidean distance between location and region center , is the number of devices inside region . Thus, the weighted voronoi division can be written as
(24) 
Proposition 3.
The complexity of the region division algorithm is .
Proof: Denote the total number of 2D grids as , where and are constants for a specific monitoring area. In the first stage, all devices need to be clustered to a nearby POI, which computes for times. In the second stage, we calculate the center of each class, which will compute times in the worst case. In the last stage, the assignments for each grid will take times for computing. Note that we always have , while there always exists a constant such that ( in the worst case). Hence, the complexity for the last stage is . By combining three stages, we derive the final complexity as .
The best achieved complexity for classical voronoi diagram is [28], which only fits for onesite division. In contrast, our algorithm can generate multisite division as well as can reduce the computation overhead. Algorithm 1 shows the procedure of the region division algorithm.
ViB When to Wake up
At each time stamp, we first perform visionbased aerial sensing over regions to obtain the AQI scale inference for each region. Before triggering ground devices, we first utilize the semisupervised learning model to give hard labeling on all nodes at current time stamp, based on the stored AQI inference maps over past time stamps. Therefore, for each node , there are two estimations: (1) AQI scale , and (2) preinferred value using historical data. Based on the two priors, we propose an indicator to analyse the inference reliability and further decide which devices to wake up at current time stamp.
Joint Estimation Error: We first define two metrics of correlations between the two priors:
(25)  
(26) 
where we call as Degree of Bios (DoB), as Degree of Variance (DoV), as shown in Fig. 6. Intuitively, when is low, the variance of the AQI scale prior is small, which means a more reliable inference; as for , a low induces small deviation between the two priors, which in turn guarantees the inference reliability. Hence, DoB and DoV can both reflect the degree of estimation errors. By merging the two metrics, we define the Joint Estimation Error (JE) as:
(27) 
where and denote the maximum value of DoB and DoV for all nodes with devices. As a result, JE is normalized into , and each node has a corresponding JE. In general, JE reflects the degree of average inference error for labeled nodes before waking up for ground sensing. For cube, a greater indicates higher uncertainty for inference at , which signifies should be measured currently if exceeds a threshold. Hence, given a specific JE as threshold, sensors/nodes with should wake up for data collection at current time stamp. These nodes are then labeled with measured data at layer , which can best reduce the model’s entropy and are sufficient for realtime and future inference. In this way, by only measuring a small number of cubes, ImgSensingNet can greatly reduce the measurement overhead while maintaining high inference accuracy.
In general, JE is adjusted manually for different scenarios, which forms a tradeoff. When JE is low, the threshold for inference error declines, indicating the measuring cubes will increase and brings promotion in inference accuracy. However, it can cause great battery consumption. On the other hand, as JE is high, the measuring cubes will decrease. This may cause a decline in accuracy, but can highly reduce consumption. In summary, the tradeoff between accuracy and consumption should be studied to acquire a better performance.
Wakeup Mechanism Design: JE can guide system waking up selected devices at each time stamp. First, if the two priors and are both less than a pregiven value , then the current air quality is too good to wake up the device for measurement, and switching off the device in such a case can help save the battery.
Second, since we construct the graph model by connecting nodes within a spatial radius , it is possible that two nodes that are selected to wakeup are adjacent and connected in the model (Fig. 7(a) provides an example). In this case, waking up connected nodes would be redundant as their measurements are similar. Denote as the set of selected wakeup nodes using JE, our objective is to find a subset such that (1) nodes in are not adjacent, and (2) every node not in is adjacent to at least one member of . This problem is wellknown as the minimum independent dominating set problem [29], which is NPhard. Since is sparse in our case, the computing overhead is small, we simply apply a greedybased method to find . Note that the algorithm is applied in each region independently. The total process of finding a final wakeup set based on is shown in Fig. 7.
Lemma 1.
The number of final wakeup devices decreases monotonically when increases.
Proof: When increases to , the number of edge can increase, which induces larger local connectivity graphs. For the extreme case, we choose as the same as , which at least forms an independent dominating set. Since can increase, nodes in can be directly connected, which cannot be a minimum set. Hence, we have with , that decreases monotonically when gets larger.
Proposition 4.
The complexity of the wakeup mechanism algorithm decreases monotonically when increases, which is .
Proof: For the inner loop, we compute times to generate for each region . As we need to traverse to find a node with largest degree each time, and the total times are represented by , thus the complexity to find is . As for the outer loop, times are needed for each region. Hence, the total complexity is . It’s obvious to find both the upper bound and lower bound can be denoted as multiply by a constant , which induces the final complexity as .
Based on Lemma 1, the complexity also decreases monotonically when gets larger.
Algorithm 2 shows the procedure of the wakeup mechanism. The target node set is determined based on JE as well as two conditions we studied, which contains the nodes with the highest uncertainty currently. Thus, by only monitoring , ImgSensingNet achieves high inference accuracy while greatly reducing the measurement overhead.
Proposition 5.
The overall complexity for wakeup mechanism each time is .
The overall wakeup mechanism contains dynamic region division and wakeup devices selection each time. In practice, since and are small due to the device sparsity in deployment, by choosing proper , the computation of the whole mechanism can be completed in realtime, which will be shown in the evaluation.
Vii Implementation
In this section, we detail the implementation of ImgSensingNet system. Specifically, we first introduce the components of ImgSensingNet, including both the ground devices and the aerial device. Based on the devices, we then illustrate the data collection of both aerial image data and ground sensing data.
Viia System Components
Ground Devices: Fig. 8(a) shows the components of the ground AQI monitoring device. Each device contains a lowcost A3IG sensor, a twolayer circuit board, a ATmega128A working as the microcontroller unit (MCU), a SIM7000C as the wireless communication module, a 13600mAh rechargeable battery and a fixed shell structure. Considering the intrinsic lack of precision for the smallscale laserbased sensors, these devices are carefully calibrated through a whole month adjustment by comparing the results with a highprecision calibrating instrument TSI8530. Finally, these devices can provide monitor error for common pollutants in AQI calculation, such as and , and send the realtime data back to the central server for further data analysis. To realize high energyefficiency, the devices are programmed to sleep during most of the time and wake up for data collection based on adjustable time intervals that are controlled by a designed wakeup mechanism, which will be discussed in Sec. VI. Thus, an online tradeoff between data quantity and battery endurance can be implemented.
Aerial Device: For the UAV, we select DJI Phantom 3 Quadcopter as the sensing device, as shown in Fig. 8(c). The GPS sensor on the UAV can provide the realtime 3D position. In existing systems [14], the UAV can keep flying for at most 1020 minutes due to both the load consumption (carrying sensors can significantly reduce the UAV’s battery life), and the loitering consumption (to acquire sensing data, the UAV needs to stay still at every measuring location), which restricts the monitoring scope within one measurement [30]. However, as the UAV contains a builtin HD camera, when we focus on visionbased sensing, the extra loading and the hovering time can be eliminated. Hence, the sensing scope as well as the flight duration can be greatly increased.
ViiB Experiment Setup and Data Collection
Methods  2D  3D  

Realtime  After 1 hour  After 3 hours  After 10 hours  Realtime  After 1 hour  After 3 hours  After 10 hours  
ImgSensingNet  3.540  6.178  9.330  20.269  5.529  9.928  13.341  31.409 
ARMS [14]  5.412  7.384  
LSTM Nets [19]  4.217  7.804  10.672  19.873  
AQNet [12]  4.493  7.562  13.695  25.192  6.481  14.735  19.634  39.790 
ST NN [20]  7.039  9.667  11.882  26.055  9.147  12.954  18.065  44.256 
ImgSensingNet system prototype includes 200 ground devices and a UAV, and it has been deployed on two university campuses (i.e., Peking University and Xidian University), since Feb. 2018. Throughout more than a half year’s measurement, 2.6 millions of ground data samples and and a number of 17,630 haze images are collected, covering from good air quality cases to hazardous air quality cases, which are used for evaluation in this paper.
Aerial Image Data: The visionbased sensing works online continuously and realtimely by sampling images from the UAV video streams between equal time intervals. To get ground truth data for training the CNN model, we set up the dataset by carrying calibrated sensor to label the image with ground truth AQI value. We collected 17,630 labeled images in different places to make the data generalize well. Fig. 9 shows an overview of the image dataset.
Ground Sensing Data: The testing areas are on campus of Peking University ( 2km2km) and Xidian University ( 2km1.5km). The ground devices are deployed in 3D space with a 50m maximum height. We divide the areas into 20m20m10m cubes, where a small number of cubes are deployed with our devices. We manually set the minimum time intervals as 30 minutes.
Viii Evaluation
In this section, we present the performance analysis of ImgSensingNet in various aspects.
Viiia Visionbased Aerial Sensing
We evaluate the proposed AQI scale inference model in two aspects: accuracy and robustness in predictions. We randomly divide the image dataset with 7:3 training set to testing set ratio. We compare the proposed inference model with the following models from two categories: (1) three deep learning methods: 2D CNN with our extracted features, 2D CNN without features and a 50layer deep neural network (DNN); the 2D CNN architecture is the same with our 3D CNN, but with only 2D kernels. (2) five classical training
methods: support vector machine (SVM),
nearest neighbors (NN), decision tree (DT), multivariable linear regression (MLR) and random forest (RF).
Accuracy of Inference: As shown in Fig. 10, in general our method outperforms all other models. We can achieve a 96% accuracy for imagebased AQI scale inference by the proposed model. Moreover, when the features are considered, the 2D CNN model also outperforms the one without features, which confirms the effectiveness of hazerelevant feature extraction.
Robustness of Inference: In Fig. 10, we test how much the inferred values deviate from the real values, using root mean square error (RMSE). The results show that the proposed model outperforms other models by maintaining a very low deviation, i.e., 0.088 classification deviation in average. This again proves the advances in using 3D model and feature extraction.
ViiiB Inference Accuracy
We evaluate the inference accuracy of ImgSensingNet in both realtime estimation and nearfuture forecasting. Since there are no measured data for most cubes, we divide labeled samples into training set and testing set, while performing an crossvalidation by randomly choosing the training data, and repeat for 1000 times to avoid stochastic errors.
We use inference models in stateoftheart AQI monitoring systems as ARMS [14], LSTM Nets [19], AQNet [12] and spatiotemporal NN [20] for comparison. These models are all evaluated using the same data each time.
In Table II, we report the average estimation errors of realtime inference and nearfuture forecasting (i.e., after 1, 3, and 10 hours respectively), in both 2D and 3D scenarios. As a result, ImgSensingNet can achieve the best inference accuracy (referred as the lowest RMSE in the table) in both realtime inference and future AQI forecasting. Even with high accuracy, the competitors may either lack the ability of future prediction (e.g., ARMS) or 3D inference (e.g., LSTM Nets).
ViiiC Energy Efficiency
The energyefficiency is analysed in two aspects: (1) the consumption of aerial UAV sensing, and (2) the consumption of ground WSN sensing. We choose AQNet [12] that has similar components (using UAV and ground WSN) for comparison in the two aspects, respectively.
Consumption of Aerial Sensing: We set up experiments by comparing the normalized system consumption in monitoring tasks with different coverage spaces. As shown in Fig. 11, ImgSensingNet uses UAV that does not suffer from both the load and loitering consumptions, hence can greatly save the battery. Compared to AQNet system with different loitering time for data sensing, ImgSensingNet consumes about 50% less energy than that of AQNet, with different coverage space. Thus, the energyefficiency of the proposed system is demonstrated.
Consumption of Ground Sensing: We further study the normalized consumption of ground sensing using the same method. We compare one day’s consumption of all ground devices within different coverage spaces, using the same detection time and uploading time for each method. Fig. 11 presents the experimental results. When , our ground sensing achieves the maximum consumption, which still slightly outperforms AQNet system. As , the normalized consumption of the WSN significantly reduces to only 53%, which again validates the energyefficiency of ImgSensingNet.
ViiiD Wakeup Mechanism
We analyse the impact of on wakeup mechanism in two aspects: (1) the average number of devices that wake up each time, and (2) the average computing time for devices selection. We vary the number of devices as 30 and 100, and set . For each instance, we perform 1000 independent runs to get the average values.
Average Number of Wakeup Devices: As shown in Fig. 12(a)(c), we plot the average number of wakeup devices with different values of , by setting as an invariant. The number of selected devices decreases monotonically when increases. Specifically, when we choose m, the average number of wakeup devices can greatly reduce to less than 50% of total devices (e.g., 38.5 on average when there are 100 devices in total). Thus, by choosing a proper , the number of wakeup devices greatly scales down, which is energy efficient.
Average Runtime of Wakeup Mechanism: Further, we study the runtime for obtaining the set of wakeup devices each time. As shown in Fig. 12(b)(d), the runtime also decreases with a greater . Specifically, the average running time is about 0.01 s when there are 30 devices in total. When there are more devices, the computation time will increase, but it is still completed in realtime (about 1s in Fig. 12(d)).
ViiiE Impacts of Different Joint Estimation Errors
In this section, we investigate the impacts of different JE values on ImgSensingNet, in three aspects as (1) estimation accuracy, (2) energy consumption, and (3) working durations, respectively.
Estimation Accuracy: As shown in Fig. 13, the estimation accuracy gradually decreases when JE increases. From the figure, we can see that ImgSensingNet achieves high accuracy in both realtime inference and future forecasting. Moreover, the system can achieve higher inference accuracy when there are more devices deployed.
Normalized Energy Consumption: In Fig. 14 we report the relationship between energy consumption and different JE values. By comparing Fig. 14(a) and Fig. 14(b), the energy consumption scales down as JE increases, while a more stable procedure is obtained when there are more devices.
System Working Durations: In Fig. 15 we study the impacts of different JE on system working durations, over a fixed area inside Peking University. It is shown that ImgSensingNet can guarantee a long battery duration for more than one month when , which greatly outperforms stateoftheart systems. As JE decreases, the monitoring overhead would increase, while it can also bring high inference accuracy. Hence, there exists a tradeoff between consumption and accuracy caused by different JE, which needs to be further studied.
ViiiF Impact of Degree of Air Pollution
In Fig. 16, we study the impact of the degree of air pollution on ImgSensingNet. We first manually divide our dataset into three degrees as slightly, moderately and highly polluted (i.e., , and ), and evaluate the performance of our model separately.
Estimation Accuracy: In Fig. 16(a) we compare the inference accuracy when AQI value varies. As a result, out system performs the best when . Moreover, the performance tends to be better when AQI value is higher, as most devices are scheduled to sleep when air quality is good.
Normalized Energy Consumption: We further study the normalized energy consumption in different AQI degrees with various values of JE. From Fig. 16(b), we can see that our system maintains the lowest consumption when AQI value is low, which again validates the energyefficiency of the wakeup mechanism. By comparing Fig. 16(a) and (b), the tradeoff can also be illustrated.
ViiiG Tradeoff between Accuracy and Consumption
In Fig. 17, an inherent tradeoff between system consumption and inference accuracy is illustrated, versus JE. As JE becomes higher, the average inference error grows rapidly while consumption can drop fairly. Given the average error, for example, when RMSE is , the corresponding , which indicates the power consumption can be reduced to as little as . Hence, by choosing proper JE value, the measuring cost can greatly scale down.
Ix Conclusion
This paper presents the design, technologies and implementation of ImgSensingNet, a UAV vision guided aerialground AQI sensing system, to monitor and forecast the air quality in a finegrained manner. We first utilize visionbased aerial UAV sensing for AQI scale inference, based on the proposed hazerelevant features and 3D CNN model. Ground WSN sensing are then used for accurate AQI inference in spatialtemporal perspectives using an entropybased model. Further, an energyefficient wakeup mechanism is designed to greatly reduce the energy consumption while achieving high inference accuracy. ImgSensingNet has been deployed on two university campuses for daily monitoring and forecasting. Experimental results show that ImgSensingNet outperforms stateoftheart methods, by achieving higher inference accuracy while best reducing the energy consumption.
References
 [1] W. H. O., “7 million premature deaths annually linked to air pollution,” Air Quality & Climate Change, vol. 22, no. 1, pp. 5359, Mar. 2014.
 [2] B. Zou et al., “Air pollution exposure assessment methods utilized in epidemiological studies,” J. Environ. Monit., vol. 11, pp. 475490, 2009.
 [3] Beijing MEMC. [Online]. Available: http://www.bjmemc.com.cn/. 2018.
 [4] T. Quang et al., “Vertical particle concentration profiles around urban office buildings,” Atmos. Chem. Phys., vol. 12, pp. 50175030. 2012.
 [5] Y. Zheng, F. Liu, and H.P. Hsieh, “UAir: When urban air quality inference meets big data,” in Proc. ACM KDD’13, Chicago, IL, Aug. 2013.
 [6] Y. Cheng et al., “Aircloud: a cloudbased airquality monitoring system for everyone,” in Proc. ACM SenSys’14, New York, NY, Nov. 2014.
 [7] Y. Gao, W. Dong, K. Guo et al., “Mosaic: a lowcost mobile sensing system for urban air quality monitoring,” in Proc. IEEE INFOCOM’16, San Francisco, CA, Jul. 2016.
 [8] D. Hasenfratz, O. Saukh, C. Walser et al., “Deriving highresolution urban air pollution maps using mobile sensor nodes,” Pervasive and Mobile Compting, vol. 16, no. 2, pp. 268285, Jan. 2015.
 [9] J. Li et al., “Tethered balloonbased black carbon profiles within the lower troposphere of shanghai in the east china smog,” Atmos. Environ., vol. 123, pp. 327338. Sept. 2015.
 [10] Y. Hu, G. Dai, J. Fan, Y. Wu and H. Zhang, “BlueAer: a finegrained urban PM2.5 3D monitoring system using mobile sensing,” in Proc. IEEE INFOCOM’16, San Francisco, CA, Jul. 2016.
 [11] Y. Yang et al., “Arms: a finegrained 3D AQI realtime monitoring system by UAV,” in Proc. IEEE Globecom’17, Singapore, Dec. 2017.
 [12] Y. Yang, Z. Bai, Z. Hu, Z. Zheng, K. Bian, and L. Song, “AQNet: finegrained 3D spatiotemporal air quality monitoring by aerialground WSN,” in Proc. IEEE INFOCOM’18, Honolulu, HI, Apr. 2018.
 [13] Z. Hu et al., “UAV Aided AerialGround IoT for Air Quality Sensing in Smart City: Architecture, Technologies and Implementation,” IEEE Network Magazine, accepted, available on https://arxiv.org/abs/1809.03746.
 [14] Y. Yang, Z. Zheng, K. Bian, L. Song, and Z. Han, “Realtime profiling of finegrained air quality index distribution using UAV sensing,” IEEE Internet of Things Journal, vol. 5, no. 1, pp. 186198, Feb. 2018.
 [15] Z. Pan, H. Yu, C. Miao, and C. Leung. “Crowdsensing air quality with cameraenabled mobile devices,” in Proc. ThirtyFirst AAAI Conf. on Artificial Intell., San Francisco, CA, Feb. 2017.
 [16] S. Li, T. Xi, Y. Tian, and W. Wang. “Inferring finegrained PM2.5 with bayesian based kernel method for crowdsourcing system,” in Proc. IEEE Globecom’17, Singapore, Dec. 2017.
 [17] R. Gao et al., “Sextant: towards ubiquitous indoor localization service by phototaking of the environment,” IEEE Trans. Mobile Comput., vol. 15, no. 2, pp. 460474, Feb. 2016.
 [18] H. Kim and K. G. Shin. “Inband spectrum sensing in cognitive radio networks: energy detection or feature dtection?” in Proc. ACM MobiCom’08, 2008.
 [19] V. O.K. Li, J. Lam, Y. Chen, and J. Gu. “Deep learning model to estimate air pollution using MBP to fill in missing proxy urban data,” in Proc. IEEE Globecom’17, Singapore, Dec. 2017.
 [20] Y. Yang, Z. Zheng, K. Bian, L. Song, and Z. Han, “Sensor deployment recommendation for 3D finegrained air quality monitoring using semisupervised learning,” in Proc. IEEE ICC’18, Kansas City, MO, May 2018.
 [21] K. He, J. Sun, and X. Tang, “Single image haze removal using dark channel prior,” in Proc. IEEE CVPR’09, Miami, FL, Jun. 2009.
 [22] K. He, J. Sun, and X. Tang, “Guided image filtering,” in Proc. ECCV’10, Crete, Greece, Sept. 2010.
 [23] R. Tan, “Visibility in bad weather from a single image,” in Proc. IEEE CVPR’08, Anchorage, AK, Jun. 2008.
 [24] Q. Zhu, J. Mai, and L. Shao, “A fast single image haze removal algorithm using color attenuation prior,” IEEE Trans. Image Process., vol. 24, no. 11, pp. 35223533, Jun. 2015.
 [25] C. Ancuti et al., “A fast semiinverse approach to detect and remove the haze from a single image,” in Proc. ACCV, Pondicherr, India, Aug. 2011.
 [26] I. Goodfellow, Y. Bengio, and A. Courville, “Applied Math and Machine Learning,” in Deep Learning. Cambridge, MA: MIT Press, 2016.
 [27] X. Zhu et al., “Semisupervised learning using gaussian fields and harmonic functions,” in Proc. ICML’03, Washington, DC, Aug. 2003.
 [28] F. Aurenhammer, “Voronoi diagrams: a survey of a fundamental geometric data structure,” ACM Comput. Survey, vol. 23, pp. 345405. 1991.
 [29] N. Bourgeois et al., “Fast algorithms for min independent dominating set,” Discrete Applied Mathematics, vol. 161, pp. 558572. Mar. 2013.
 [30] P. Zhao et al., “Optimal trajectory planning of drones for 3d mobile sensing,” in Proc. IEEE Globecom’18, Abu Dhabi, UAE, Dec. 2018.
Comments
There are no comments yet.