## 1 Introduction

Artificial Intelligence (AI) systems dedicated to the analysis and interaction with the physical world can have a significant impact on human life. These systems can process a massive amount of data and make/suggest decisions that help solve many real-world problems where humans are at the epicenter. Crucial examples are city mobility, pollution monitoring, or critical infrastructure management, where decision-makers require, for instance, measurements about flows of bicycles, cars or people. Like no other sensing mechanism, networks of city cameras can observe such large dimensions and simultaneously provide visual data to AI systems to extract relevant information from this deluge of data.

Different smart cameras across the city are subject to various visual conditions (luminance, position, context). This results in different performance from each of them, and added difficulty in effectively scaling-up the learning task. In this paper, we will address this issue, and we propose a methodology that performs unsupervised domain adaptation among different cameras to reliably compute the number of vehicles in a city. We focus on vehicle counting but the approach is applicable to counting any other type of object.

### 1.1 Counting as a supervised learning task

The counting problem is the estimation of the number of objects instances in still images or video frames [lempitsky2010learning]

. Current systems address the counting problem as a supervised learning process. They fall in two main classes of methods: a) detection-based approaches (

[amato2019counting, ciampi2018counting, amato2018wireless]) that try to identify and localize single instances of objects in the image and b)density-based techniques that rely on regression techniques to estimate a density map from the image, and where the final count is given by summing all pixel values [lempitsky2010learning]. Figure 1 illustrates the mapping of such regression. Concerning vehicle counting in urban spaces, where images are of very low resolution, and most objects are partially occluded, density-based methods have a clear advantage on detection methods [zhang2016single, guerrero2015extremely, li2018csrnet, boominathan2016crowdnet].Hinging on Convolution Neural Networks (CNN) to learn the regressor, this class of approaches has shown to be very effective, especially in single-camera scenarios. However, since they require pixel-level ground truth for supervised learning, they may not generalize well to unseen images, especially when there is a large

domain gap between the training (source) and the test (target) sets, such as different camera perspectives, weather, or illumination. This gap severely hampers the application of counting methods to very large scale scenarios since annotating images for all the possible cases is inviable.### 1.2 Unsupervised domain adaptation

This paper proposes to generalize the counting process through a new domain adaptation algorithm for density map estimation and counting. Specifically, we suppose to have an annotated training set for a *source domain*, and we want to adapt the system to perform well in an unseen and unlabelled *target domain*. For instance, the source domain consists of images taken from a set of cameras, and the target domain consists of images taken from a different set of cameras, with different luminances, perspectives, contexts. This class of algorithms is commonly referred to as Unsupervised Domain Adaptation.

We conduct preliminary experiments using the WebCamT dataset introduced in [zhang2017understanding]. In particular, we consider a test set containing images from cameras with different perspectives from the training ones, showing that our unsupervised domain adaptation technique can mitigate the perspective domain gap.

Traditional approaches of Unsupervised Domain Adaptation have been developed to address the problem of image classification, and they try to align features across the two domains ([ganin2014unsupervised, tzeng2017adversarial]). However, as pointed out in [zhang2017curriculum], they do not perform well in other tasks, such as semantic segmentation.

## 2 Proposed Method

We propose an end-to-end CNN-based unsupervised domain adaptation algorithm for traffic density estimation and counting. Inspired by [tsai2018learning], our method is based on adversarial learning in the output space (density maps), which contains rich information such as scene layout and context. In our approach, we rely on the adversarial learning scheme to make the predicted density distributions of the source and target domains consistent.

The proposed framework, shown in Fig. 2, consists of two modules: 1) a CNN that predicts traffic density maps and estimates the number of vehicles occurring in the scene, and 2) a discriminator that distinguishes whether the density map (received by the density map estimator) is generated processing an image of the source domain or the target domain. In the training phase, the density map predictor learns to map images to densities, based on annotated data from the source domain. At the same time, it learns to fool the discriminator exploiting an adversarial loss, computed using the predicted density map of unlabeled images from the target domain. Consequently, the output space is forced to have similar distributions for both the source and target domains. In the inference phase, the discriminator is discarded, and only the density map predictor is used for the target images. A description of each module and their training is provided in the following subsections.

### 2.1 Density Estimation Network

We formulate the counting task as a density map estimation problem [lempitsky2010learning]. The density (weight) of each pixel in the map depends on its proximity to a vehicle centroid and the size of the vehicle in the image, as shown in Fig. 1, so that each vehicle contributes with a total value of 1 to the map. Therefore, it provides statistical information about the location of the vehicles and allows the counting to be estimated by summing of all density values.

This task is performed by a CNN-based model, whose goal is to automatically determine the vehicle density map associated with a given input image. Formally, the density map estimator, , transforms a channels input image, , into a density map, .

### 2.2 Discriminator Network

The discriminator network, denoted by , also consists of a CNN model. It takes as input the density map, , estimated by the network

. Its output is a lower resolution probability map where each pixel represents the probability that the corresponding area (from the input density map) comes either from the source or the target domain. The goal of the discriminator is to learn to distinguish between density maps belonging to source or target domains. This, in turn, forces the density estimator to provide density maps with similar distributions in both domains,

i.e., the density maps, , of the target domain have to look realistic, even if network was not trained with an annotated training set from that domain.### 2.3 Domain Adaptation Learning

The proposed framework is trained based on an alternate optimization of density estimation network, , and the discriminator network,

. Regarding the former, the training process relies on two components: 1) density estimation using pairs of images and ground truth density maps, which we assume are only available in the source domain; and 2) adversarial training, which aims to make the discriminator fail to distinguish between the source and target domains. As for the latter, images from both domains are used to train the discriminator on correctly classifying each pixel of the probability map as either source or target.

To implement the above training procedure, we introduce two loss functions: one is employed in the first step of the algorithm to train network

, and the other is used in the second step to train the discriminator . These loss functions are detailed next.Network Training. We formulate the loss function for as the sum of two main components:

(1) |

where is a composite loss computed using ground truth annotations available in the source domain, while is the adversarial loss that is responsible for making the distribution of the target and the source domain close each other. In particular, we define the density loss as:

(2) |

where is the mean square error between the predicted and ground truth density maps, i.e. , while is Euclidean loss between predicted and ground truth count.

To compute the adversarial loss , we first forward the images belonging to the target domain and we generate the predicted density maps . Then, we compute

(3) |

This loss forces the distribution of to be closer to by training to fool the discriminator, maximizing the probability of the target predicted density map to be considered as the source prediction.

Discriminator Training. Given the estimated density map , we forward to a fully-convolutional discriminator using a binary cross-entropy loss for the two classes (i.e., source and target domains). We formulate the loss as:

(4) |

where if the sample is taken from the target domain, and if the sample is taken from the source domain.

### 2.4 Implementation Details

Density Map Estimation and Counting Network. We build our density map estimation network based on U-Net [ronneberger2015u]

. U-Net is a popular end-to-end encoder-decoder network for semantic segmentation first used for biomedical image segmentation. The encoder part consists of convolution blocks followed by max-pooling blocks that downscale the feature representations at multiple different levels. The decoder part of the network upsamples the features through upsampling layers followed by regular convolution operations. Furthermore, the upsampled features are concatenated with the same scale features from the encoder, which contain more detailed spatial information and prevents the network from losing spatial awareness due to downsampling.

Discriminator. For the discriminator, we use a Fully Convolutional Network similar to [tsai2018learning, radford2015unsupervised], composed of 5 convolution layers with kernel

and stride of 2. The number of channels are {64, 128, 256, 512, 1}, respectively. Each convolution layer is followed by a leaky ReLU having a parameter equals to 0.2.

## 3 Experimental Setup

We conduct preliminary experiments using the WebCamT dataset introduced in [zhang2017understanding]. This dataset is a collection of traffic scenes recorded using city-cameras, and it is particularly challenging for analysis due to the low-resolution , high occlusion, and large perspective. We consider a total of about 42,000 images belonging to 10 different cameras and consequently having different perspectives. We employ the existing bounding box annotations of the dataset to generate ground truth density maps, one for each image. In particular, we consider one Gaussian Normal kernel for each vehicle present in the scene, having a value of and equals to the center and proportional to the length of the bounding box surrounding the vehicle, respectively.

Firstly, we show the domain gap that we want to face. We generate a first pair of training and validation subsets, picking images at random from the whole dataset. Then, we create a second pair of training and validation subsets, this time picking images belonging to seven different cameras for the first and images belonging to the three remaining ones for the second (per-camera splits of the whole dataset). We show the domain gap training our model without the discriminator on the training subsets and then comparing the results obtained over the validation splits.

Once we quantified and proved this domain gap, we try to mitigate it, conducting experiments on the per-camera splits using our solution, i.e., the network and the discriminator that acts on the output space. In particular, during the training, we also use the images belonging to the validation subset without the labels to generate an adversarial loss aimed at making the source domain (i.e., the training subset) and the target domain (i.e., the validation subset) close each other.

We base the evaluation of the models on three metrics: (i) Mean Absolute Error (MAE) that measures the absolute count error of each image; (ii) Mean Squared Error (MSE) that penalizes large errors more heavily than small ones; (iii) Average Relative Error (ARE), which measures the absolute count error divided by the true count.

## 4 Results and Discussion

(a) | (b) |

Performance during training: (a) Comparison between the random and the per-camera validation splits showing the domain gap; (b) comparison between the proposed approach with and without discriminator. Each row corresponds to a specific evaluation metric.

Figure 3 (a) shows the results for the two validation sets - the random one and the per-camera one, using the density estimation network without the discriminator trained over the two training subsets - the random one and the per-camera one, respectively. Each plot corresponds to one of the three metrics. As we can see, the domain gap is significant: even if all the images of the subsets belong to the same dataset and are collected in the same city under similar conditions, small changes to the perspectives cause a remarkable loss in performance. In other words, the network is not able to generalize well to perspectives that have not seen during the training.

When combining the density estimation network with the adversarial component, the performance of the network improves considerably. These results are shown in Figure 3 (b), where the improvements obtained using our model (red line) compared to the baseline model, without discriminator, is clearly visible in all the three metrics. The discriminator mitigates the domain gap and the network is able to generalize better over images having a different perspectives from the ones employed during the training. The results are related to a specific value of that showed the most promising results.

Since all the metrics that we considered take into account only the counting errors, we also plot some examples of the predicted density maps using our model either with and without the discriminator. Figure 4 shows the ground truth and the predicted density maps for two random samples of the validation subset. As we can see, the density maps predicted using the model with the discriminator show a decrease of the noise compared with the ones obtained using the baseline model without the discriminator.

## 5 Conclusions

In this article, we tackle the problem of determining the density and the number of objects present in large sets of images. Building on a CNN-based density estimator, the proposed methodology is capable of generalizing to new sources of data for which there is no training data available. We achieve this generalization by adversarial learning, whereby a discriminator attached to the output induces similar density distribution in the target and source domains. Experiments show a significant improvement relative to the performance of the model without domain adaptation. Given the conventional structure of the estimator, the improvement obtained by just monitoring the output entails a great capacity to generalize training and thus suggesting the application of similar principles to the inner layers of the network. In our view, the surprising outcome of this work opens new perspectives to deal with the scalability of learning methods for large physical systems with scarce supervisory resources.

This work was partially supported by LARSyS - FCT Plurianual funding 2020-2023 and by H2020 project AI4EU under GA 825619.

Comments

There are no comments yet.