Urban littering, defined as the waste products disposed improperly in cities, has recently become a major concern for our modern cities. Major European cities place urban cleanliness as a top priority for the authorities, as it directly impacts the concern and satisfaction of their citizens and the attractiveness of their economy and tourism. At a recent Clean Europe Network summit111http://www.cleaneuropenetwork.eu/de/measuring-litter/aus/, the lack of data has been pointed out as one of the major difficulty in addressing properly this environmental issue.
The key to properly manage urban cleanliness is to implement a continuous improvement management system. The measurement of urban litter is mandatory for such a process. Anti-littering organizations such as AVPU222http://www.avpu.fr/pdf%20AVPU/formation%20grille%20IOP-2014.pdf and cities worldwide are assessing urban cleanliness by means of human audits. Zurich -ranked third over 83 European cities for the satisfaction of its citizens regarding cleanliness333http://ec.europa.eu/regional_policy/sources/docgener/studies/pdf/urban/survey2015_en.pdf- is conducting 14’000 audits a year to assess and manage its cleanliness. To provide such a measurement, as an index of cleanliness, a key step is to be able to recognize different types of wastes on urban places, to quantify and to classify them by their type.
In this study, we propose and develop a computer vision application based on deep CNN algorithms to localize and classify urban wastes such as bottles, leaves, etc. in an automated manner in RGB images. This measurement is realized by an image acquisition system consisting of a high-resolution camera, mounted on the top of a vehicle, facing the ground. The front surface of the vehicle are covered by the camera view. The system must be able to detect the smallest defined waste -a cigarette butt, seen from a camera placed at a height of two to three meters. The output of this application is a geo-localized density of different categories of urban wastes. An overview of the system is shown in Fig. 1.
, the deep neural network used for detection -including its implementation details- is explained. In Section4, we present the data collection setup and use it to obtain a waste dataset. In Section 5 we test our application on a real case scenario and present results. Finally, Section 6 summarizes our work and discusses about future work.
2 Related Work
Different methodologies have been developed worldwide to obtain an index of cleanliness for a city. These approaches are mostly focused on human interpretations of cleanliness. However, an automated approach has not yet been developed.
The closest work to this study is a trash related project designed to coarsely segment a pile of garbage in an image . They also provide an Android application, which allows citizens to track and report garbage in their neighborhoods. Bing Image Search API444https://www.microsoft.com/cognitive-services/en-us/bing-image-search-api was used to create their dataset. They have labeled images as containing garbage or not. The authors utilize a pre-trained AlexNet  model and obtain 83.96% of sensitivity with 90.06% specificity. Their approach focuses on segmenting a pile of garbage in an image and provides no details about types of wastes in that segment.
There exist approaches that classify garbage into recycling categories; 
proposes an automated recognition system using deep learning algorithm which classifies objects as biodegradable and non-biodegradable. They propose a model and have its implementation done in Caffe. However, there are no experimental results presented. In , they propose a system to classify waste in high schools. They design a box containing a camera inside it. In order to do the classification, objects are required to be placed inside the box. Their image processing module is based on finding correlation between the image of the object in the box and 50 different images, then choosing the best one as the right category. The developed system classifies three kinds of waste: PET bottle, soda cans and cartoon box, with a classification performance over 70%.
An automatic waste sorting approach is presented in pixel resolution image of the waste. For their CNN architecture, they use AlexNet model. Their SVM utilizes a bag of features obtained by passing a window over the whole image. Each algorithm creates a different classifier that separates waste into three main categories: plastic, paper, and metal. They achieved a classification accuracy of 94.8% with SVM, while CNN had an accuracy of 83%. As they have mentioned in their paper, the main reason of not having better results with CNN is the insufficient number of images in their training set. Their approach focuses on classifying a specific object and not to localize it from a far distance.
In this section, we describe each step of our approach. We explain how we localize and classify the wastes on given input images, then we discuss about the implementation details.
3.1 Waste Localization and Classification
The proposed system must take care of two main tasks: The first task is to localize all objects in the image. The second task is to classify all detected objects on their right littering category. In this section, all tasks are addressed using a single framework and a shared feature learning base.
The fact that CNNs are trained end-to-end, from raw pixels to final classes, makes them much more advantageous for many tasks than manually designing a suitable feature extractor. Our approach is similar to OverFeat model  which proposes a multi-scale deep learning approach that can be used for classification, localization and detection. We replace its classification architecture by GoogLeNet 
. For localization, as OverFeat put forward, starting from the classification-trained network, the classifier layers are replaced by a regression network and trained to predict object bounding boxes at each spatial location and scale. Then the regression predictions with the classification results are combined at each location to obtain detection results. Object bounding box predictions are generated by running the classifier and regressor networks for all locations and scales. Considering that these two networks are sharing the same feature extraction layers; after computing the classification net, only the final regression layers must be recomputed. The final output layer of regression network has 4 units which correspond to coordinates for the bounding box of the detected object.
We use OverFeat-GoogLeNet model presented in . The original version of OverFeat relies on image representation based on AlexNet . In , they were able to directly substitute the GoogLeNet architecture into the OverFeat model and denoted the new model as OverFeat-GoogLeNet. They show that Overfeat-GoogLeNet performs significantly better than OverFeat-AlexNet. GoogLeNet is initially trained on 1.2 million images for 1000-classes object recognition. Overfeat-GoogLeNet uses expressive image features from GoogLeNet that in our implementation are fine-tuned as part of our system. The size of the input layer is fixed to pixels. The model is constructed to encode the input image into a grid where each cell contains 1024-dimensional top level GoogLeNet features and has a receptive field of size . Cells are trained to produce the set of all bounding boxes intersecting the central
region. The convolutional layers are followed by two fully connected layers containing 3092 and 4096 neurons, respectively. At the end, the output layer contains 25 neurons corresponding to different categories of waste.
An open source implementation of OverFeat on Tensorflow was used as a starting point. Then, some modifications were done to perform multi-classification. The image of a cigarette butt must contain at least same number of pixels as the smallest possible bounding box for the network. To fulfill this last criterion, and also regarding to the height of the camera, the resolution is fixed to pixels. During training, these images occupy a considerable amount of memory while loading their batches. Due to this and the challenge of having a cheaper system capable of processing and detecting wastes onboard on an embedded system, we decided to pass a pixels sliding window with an overlapping factor over the input image and keep the network input size same as the window size. The final result is produced by converting the detection coordinates with respect to initial full image. Detections within the same category are merged in case of having an overlapping factor of more than 60%. The model is fine-tuned on Tensorflow  using Nvidia K40 GPUs for 350,000 iterations with a batch size of 16. Validation is performed every 2,000 iterations.
4 The Dataset
Convolutional Neural Networks have lots of advantages over methods requiring to design a suitable feature extractor. However, one of their drawbacks is the need for a large amount of labeled training samples.
There is no waste image dataset currently available, which differentiates different types of litters/wastes. Our initial idea was to gather a diverse set of images, for example using image search by entering the category names as the keywords or using ImageNet, to train our system. However, the final decision was to not use them for training as their conditions like camera view, illumination, etc. were too different from what our system captures. To collect our own dataset, we have built our own acquisition system, mounted on a vehicle and drove several hours in Geneva area, Switzerland. We have obtained 18,676 images. To avoid overlapping between training images we have decreased number of images from 2 to 0.4 frame per second and among them we have annotated 469 full images, which corresponds to 4338, pixel resolution images. Because of the time and season of our acquisition process, most of wastes found in images were leaves and cigarette butts.
Another important step is to define what the waste is and needs to be considered for a cleanliness measure, and also how the categories should be defined in order to cover most of litters. Different organisations use different waste classifications. The OFEV555http://www.bafu.admin.ch/publikationen/publikation/01604/index.html?lang=fr approach, for example, does not take into account gums or excrement, which nevertheless play an important role in the perception of cleanliness and urban pollution. To give an example, in Roma, 5.54 million gums are discarded every year that take about 5 years to degrade. In this work, after some discussion with different cities we have decided to classify different wastes into one of the 25 general categories. Here we mention some important ones: 1. Beverage and meal packages, 2. Cigarettes and derivatives, 3. Leaves, 4. Newspapers and papers, 5. Vegetable waste, etc.
We equipped an automatic street sweeper car with a camera and an embedded system to obtain and store our dataset. As it is shown in Fig. 2, the camera was installed on a metallic arm, on top and coming out of the vehicle, having a flat view of the ground. The camera has a rolling shutter with a 1/2.3 inch CMOS sensor and 4K resolution, however after tuning the input image size of the network, the camera was configured to an output of pixel resolution images. The camera was set to get two frames per second and the average speed of the vehicle was twelve kilometers per hour.
Similar to other object recognition problems, our model also requires considerable amount of labeled training samples. We have developed an annotation tool to label a sequence of images by putting a bounding box around each waste and assigning an integer number to it showing its class number. A screenshot of this tool is shown in Fig. 3. This approach is based on the hypothesis that each object is well-separated, countable and has its particular shape, which is not the case for all categories. For example during autumn, the ground is covered by leaves where each individual leaf will not appear the same way that it appears alone. A significant improvement was observed in the correct classification accuracy once two different classes were introduced for leaves: a class for single leaves and another class for piles of leaves. However, for the cleanliness measurement both classes are considered as one category. This approach helped the network to have a better generalization for each type, separately. An example of these two classes is shown in Fig. 4.
5 Results and Discussion
The proposed application was validated using a test set consisting of 62 non-overlapping full-size images collected from our setup, equal to 558 images that are fed to the network. Rectangular ground-truth bounding boxes were defined on each image. In total, they consist of 69 cigarettes, 958 leaves and 394 bounding boxes on piles of leaves. Although other types of waste had been annotated and were used during training, they were not considered for the evaluation. Their number was not sufficient and could not provide a reliable training/testing. For example, in total we have: 8 bottles, 5 cans, 6 goblets in training set.
We have reached to process images at 2 frames () per seconds. This could be interpreted as: with a camera mounted at a height of 3 meters, we can detect a cigarette butt with a speed up to 12 kilometers per hour (this number of frame per second enables us to have 15% of overlap between two consecutive images). Both training and testing processes were done on a Nvidia K40 GPU.
5.1 Precision-recall analysis
To evaluate the performance of the proposed application in a quantitative manner, a precision-recall analysis was performed . The precision () and recall () rates of the system are simply defined as: and where , and are the total number of correct detections, false positive and ground truth objects respectively.
In order to calculate these parameters, first, each detection needs to be labeled as either correct detection or false positive by reference to the ground truth. For cigarette butts category, a detection is marked as correct when the overlap between its detected bounding boxes and the corresponding ground truth is at least 50%. In Fig. 5
.(a) the precision and recall of cigarette butts is illustrated. Different values forand are obtained by varying a threshold on final detection score for this category. We have reached 63.2% of precision while having 61.02% of recall for the cigarette butts class.
This method of defining correct detection and false positive could pose a problem while evaluating the application for some categories like leaves. Let’s imagine a scene covered by leaves. As explained previously, our ground-truth is defined by different overlapping bounding boxes, with different sizes, on some random position, covering leaves. In this case, the algorithm would correctly return different detections on top of leaves’ regions, but not exactly the same position that was defined in the ground-truth. To avoid this issue, only for this category, a binary image of detection/ground-truth was produced for each image. Pixels set to 0 indicate background and pixels set to 1 show ground-truth/detection. Comparing these two binary images pixel by pixel gives and parameters. Fig. 5.(b) shows precision-recall curve for leaves category. We have obtained 77.35% of precision while having 60% of recall for the leaves class.
Although the quantitative results may not seem to be too high, it should also be taken into account that the system is designed to localize very small objects such as cigarette butts in relatively large images, covering five meters of a street. Considering this challenge, these results are promising for waste localization and classification even if they are seen from a distance.
5.2 Qualitative assessment
Some localization and classification results obtained on sample representative images are shown in Fig. 6. The proposed approach performs well for small objects like a cigarette butt from a three meters height on a clear background as well as in backgrounds crowded by other types of waste. Also, the method is able to detect multiple/overlapped wastes. It should be noted that some leaves/cigarettes were missed on some images, which could be due to our limited training-set. Examples of a false positive detection and a missed detection are shown in Fig. 7.
6 Summary and Future Work
In this paper, a novel application for measuring cleanliness of a place, using a deep learning framework was proposed. The application localizes and classifies wastes in RGB images taken by a camera facing ground from three meters of height. Since there was no waste dataset available, we used our proposed acquisition setup to obtain images. We have also developed an annotation tool to label objects in our dataset for 25 different types of waste. Experimental results on a real case scenario -on a test-set obtained by our proposed acquisition setup- show promising performance on variant backgrounds.
As a future work, our dataset could be expanded by adding more images, especially for categories different than cigarette butts and leaves, to be able to detect all existing classes of wastes, and also to increase the accuracy of the current system.
The authors would like to thank Mr. Niels Michel, Manager of Dialog & Service at City of Zurich for sharing his in-depth experience on cleanliness measurement thus significantly contributing to this project.
The final publication is available at link.springer.com via http://dx.doi.org/10.1007/978-3-319-68345-4_18.
-  P. Sermanet, D. Eigen, X. Zhang, et al. . Overfeat: Integrated Recognition, Localization and Detection Using Convolutional Networks. ICLR, 2014
-  G. Mittal, K. B. Yagnik, M. Garg, and N. Krishnan. Spotgarbage: Smartphone App to Detect Garbage Using Deep Learning. UbiComp, 2016
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet Classification with Deep Convolutional Neural Networks. NIPS, 2012
-  S. Sudha, M.Vidhyalakshmi, K.Pavithra et al. . An Automatic Classification Method for Environment. TIAR, 2016
-  Briñez L. Juan Carlos, Rengifo Alejandro and Escobar Manuel. Automatic Waste Classification using Computer Vision as an Application in Colombian High Schools. LACNEM, 2015
-  G. Sakr, M. Mokbel, and A. Darwich. Comparing Deep Learning and Support Vector Machines for Autonomous Waste Sorting. IMCIT, 2016
-  R. Stewart, and M. Andriluka. End-to-end People Detection in Crowded Scenes. CVPR, 2015.
-  C. Szegedy, W. Liu, Y. Jia et al. . Going deeper with convolutions. CoRR, abs/1409.4842, 2014
M. Abadi, A. Agarwal, P. Barham et al. . TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. Software available from tensorflow.org.
-  Jia Deng , Wei Dong , Richard Socher , et al. . ImageNet: A Large-Scale Hierarchical Image Database. CVPR, 2009
-  Jia, Yangqing, Shelhamer et al. . Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv:1408.5093’14
-  M. Everingham, L. Van Gool, C. K. I. Williams et al. , The Pascal Visual Object Classes (VOC) Challenge. IJCV, 2010