Graffiti is already part of the current landscape of most megacities. It can be categorized as artistic drawings or tagging and whereas graffiti drawing is an artistic expression and as such, requires talent and practice, most of the times graffiti tagging represents an unauthorized act that people convey messages or show their names (see Figure 1). The discussion whether graffiti is an art is extensively explored [1, 2]. In 2017 a Brooklyn based company was fined 6.7 million dollars for whitewashing the murals containing famous graffiti 111https://www.nytimes.com/2018/02/12/nyregion/5pointz-graffiti-judgment.html. In this study we focus on the latter and refer to it as simply graffiti.
There is no São Paulo map of graffiti to our knowledge and creating it by manual inspection would demand great effort. In this work we propose the creation of a graffiti map based on the segmentation of graffiti regions on street view images .
Our contribution can be summarized as the proposal of an automated way to quantify the level of graffiti in a location. For this task, street view images are systematically acquired, computer vision algorithms identify and quantify the amount of graffiti in a picture and finally a new metric, the graffiti level of the region is devised and computed. We perform a case study in a highly urbanized city, São Paulo, Brazil
Ii Related Work
There are several works aimed to combat graffiti vandalism acts. Web-based frameworks [4, 5, 6, 7] rely on community participation to identify recently degraded locations while some works [8, 9, 10] try to identify the drawing act. Other works retrieve similar graffitis in a reference database  by using connected components and keypoint matching in an attempt to associate graffiti with gangs. Alternatively,  attempts to identify the authorship, given a target image, they compute a metric based on the symbols contained, manually annotate and do a keypoint matching between the images and the gang graffitis.
As a signal of the relevance of the topic, the European Union has a dedicated project  to analyze the main actors involved in the graffiti acts, including writers, citizens, law enforcement and public administration. It also includes the consultation of stakeholders and the establishing of a web-based platform that allows the discussion and sharing of ideas about the topic from different perspectives.
Semantic segmentation is a high-level computer vision task which aims to split the image into known classes . It is a complex task when compared to image classification and object detection because it requires the image pixel-wise classification. Research on semantic segmentation is very active and recent works achieve impressive results [14, 15, 16, 17, 18]. A related task is the instance segmentation, where the objective is also to identify the instances but in contrast to the ordinary segmentation, the method must be able to identify the boundaries of two adjacent instances. Some previous works [19, 20] performed this task by preceding the object detection stage with a segmentation stage. The work of Mask-RCNN  in turn performed this task by running the classification and the segment proposals in a parallel manner.  relies on Faster-RCNN  architecture but with an additional branch for instance segmentation.
Google Maps provides public access to images captured by cars driven down the streets. Images are obtained from different geographical locations and in different views and formats. Many works [23, 24] have already utilized this type of imagery for urban analyzes. The authors of  use street view images to compare the architectural elements from different cities. In , a study on the feasibility on audits of neighbourhoods environments instead of in-person auditing is presented. The work  proposes the assessment of urban greenery using the same kind of images.
Iii Materials and Methods
In order to confidently estimate the level of graffiti in a geographical region we propose a metric, thegraffiti level, obtained through the identification and computation of the areas containing graffiti on street view imagery .
The region of interest is initially defined and the images are acquired. Due to limitations on the coverage of the pictures provided and on computational constraints, a sample of the full region is considered. There are a number of ways of performing sampling 
but they can be classified in random sampling methods and systematic sampling methods. The first removes the bias of the selection by randomly selecting the sample points although not guaranteeing good coverage. The latter, in contrast, assures coverage by including bias.
Once the geographical sample is defined, ideally a full view should be considered for each geographical location. A single panorama view can be used, but one may need to worry about the distortions present in panoramic photos. Alternatively, complementary views for each location may be considered (see Figure 2).
Iii-B Graffiti recognition
Given the objective of quantifying the level of graffiti in a given location, a simple and direct way would be to identify the images containing or not graffiti. Such characterization however give us a sharp and non-precise information of each picture. It would be interesting to have a more granular value for each picture. So we define the graffiti level of a geographical location as the total area of the picture containing graffiti. Such approach can be affected by the projection map of the scene and also by the distance of the camera to the region containing graffiti. We assume that different regions, given a minimum extent, have corresponding distribution of projections and of distance to the walls and with this assumption, may be used to compare different geographical locations. Since we represent each location by a set of views, we define as the sum of the areas of the regions containing graffiti in each view (see (1)). We can then aggregate the graffiti level by geographical regions by computing the average of the graffiti levels on our sample of size (see (2)) .
We opted for Mask-RCNN  method for our segmentation task given its high performance reported on important benchmarks [28, 29]. During training, the method minimizes a multi-task loss , being the classification loss and bounding-box loss the same as defined in  and the mask loss defined as the average binary cross-entropy loss.
Since there is no dataset publicly available, we created a dataset with manually annotated images which were used to train our model.
We initially collected a pilot sample of 10,000 street view images from  from São Paulo City and a sample was manually chosen. The regions containing graffiti tags were manually identified. A total 632 of images were manually annotated and used to train Resnet 101-layers backbone  pre-trained on the Coco dataset 
. We used a learning rate of 0.001 and a momentum of 0.9 and trained for 80 epochs. We used the model obtained in epoch 30, given its highest validation error (see Figure3). The final model showed an average precision  of . Figure 4 presents a sample of the detections evaluated. The full time to process a single image is of s on a Geforce GTX 1050.
In Figure 8 (a) we can see the heterogeneous coverage of the service utilized  in the city. The two bottommost districts showed little coverage at the time of our acquisition, given the predominantly rural and unpopulated nature of these regions and thus they were not considered in this study.
We used four views for each geographical location, spaced by . Notice in Figure 2 how the scene elements from the second and the third figures intersect which indicates a full coverage for each geographical location. The majority of the images considered are from 2017 as can be seen in Table I.
We created a grid over the spatial extent of the city with 134,624 points with 102m vertical and horizontal spacing of our grid. After eliminating images from third-party providers and non-mapped regions (see Figure 8 (a)), we obtained a geographical coverage of 68,752 geographical points and 275,339 images overall.
We can see in Figure 5 that except a small region inside the map, the regions with highest levels of graffiti are in peripheral regions of the city. The regions with lowest levels of graffiti are in the business center of the city. The bottommost parts were not considered given the coverage of the service utilized.
The Human Development Index (HDI) is a development measure of a region that considers life span, income and education aspects . Figure 8 (b) is a HDI heat-map by districts using the data released by the city hall  in 2007. Notice that the regions with the lowest levels of graffiti in Figure 5 correspond to the regions of highest HDIs in Figure 8 (b).
This work presents an extension of  in the attempt to automaticcaly map regions containing graffiti tags the city. We systematically collect street view imagery from  and identify the graffiti tags in each image and we propose a metric for the graffiti level of a geographical region. We did a case study in São Paulo and show that it is in accordance to what is expected given the indicators of HDI.
There are limitations of the proposed approach. One of them is the requirement of sampling, due to computing constraints. Small regions with highly concentrated tagging do not properly contribute to the metric of the region. Ongoing steps include the utilization and the combination of vision algorithms [35, 36] with higher performances and the use of semi-supervised approaches to increase the annotated dataset . Future steps include a denser sampling, a joint analysis with other geographical regions and the use of new datasets that include the same view in different times .
The authors thank FAPESP grants #2014/24918-0, #2015/22308-2, CNPq, CAPES and NAP eScience - PRP - USP.
-  C. McAuliffe, “Graffiti or street art? negotiating the moral geographies of the creative city,” Journal of urban affairs, vol. 34, no. 2, pp. 189–206, 2012.
-  A. Young, Street art, public city: Law, crime and the urban imagination. Routledge, 2013.
-  Google, “Google Maps,” https://www.google.com/maps, 2005, [Last accessed April-2018].
-  Automated Regional Justice Information System (ARJIS), “Graffiti tracker,” http://graffititracker.net/, 2006, [Last accessed April-2018].
-  594 Graffiti, LLC, “racking and Automated Graffiti Reporting System (TAGRS),” http://www.594graffiti.com, 2009, [Last accessed April-2018].
-  B. Archer, “Graffiti Tracking system,” http://www.graffititrackingsystem.com/, 2005, [Last accessed April-2018].
-  V. Ltd., “VandalTrack,” https://www.vandaltrak.com.au/, 2008, [Last accessed April-2018].
-  D. Angiati, G. Gera, S. Piva, and C. S. Regazzoni, “A novel method for graffiti detection using change detection algorithm,” in Advanced Video and Signal Based Surveillance, 2005. AVSS 2005. IEEE Conference on. IEEE, 2005, pp. 242–246.
-  L. Di Stefano, F. Tombari, A. Lanza, S. Mattoccia, and S. Monti, “Graffiti detection using two views,” in The Eighth International Workshop on Visual Surveillance-VS2008, 2008.
-  F. Tombari, L. Di Stefano, S. Mattoccia, and A. Zanetti, “Graffiti detection using a time-of-flight camera,” in International Conference on Advanced Concepts for Intelligent Vision Systems. Springer, 2008, pp. 645–654.
C. Yang, P. C. Wong, W. Ribarsky, and J. Fan, “Efficient graffiti image retrieval,” inProceedings of the 2nd ACM International Conference on Multimedia Retrieval. ACM, 2012, p. 36.
-  W. Tong, J.-E. Lee, R. Jin, and A. K. Jain, “Gang and moniker identification by graffiti matching,” in Proceedings of the 3rd international ACM workshop on Multimedia in forensics and intelligence. ACM, 2011, pp. 1–6.
-  S. Gmbh, “GRAFFOLUTION Awareness and Prevention Solutions against Graffiti Vandalism in Public Areas and Transport - Final report summary,” 2016.
P. Arbeláez, B. Hariharan, C. Gu, S. Gupta, L. Bourdev, and J. Malik,
“Semantic segmentation using regions and parts,” in
Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 3378–3385.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
-  V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” arXiv preprint arXiv:1511.00561, 2015.
-  B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, 2017.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, 2015, pp. 3431–3440.
-  P. O. Pinheiro, R. Collobert, and P. Dollár, “Learning to segment object candidates,” in Advances in Neural Information Processing Systems, 2015, pp. 1990–1998.
-  J. Dai, K. He, Y. Li, S. Ren, and J. Sun, “Instance-sensitive fully convolutional networks,” in European Conference on Computer Vision. Springer, 2016, pp. 534–549.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Computer Vision (ICCV), 2017 IEEE International Conference on. IEEE, 2017, pp. 2980–2988.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems, 2015, pp. 91–99.
-  A. G. Rundle, M. D. Bader, C. A. Richards, K. M. Neckerman, and J. O. Teitler, “Using google street view to audit neighborhood environments,” American Journal of Preventive Medicine, vol. 40, no. 1, pp. 94–100, 2011.
-  A. Torii, M. Havlena, and T. Pajdla, “From google street view to 3d city models,” in Computer vision workshops (ICCV Workshops), 2009 IEEE 12th international conference on. IEEE, 2009, pp. 2188–2195.
-  C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. Efros, “What makes paris look like paris?” ACM Transactions on Graphics, vol. 31, no. 4, 2012.
-  X. Li, C. Zhang, W. Li, R. Ricard, Q. Meng, and W. Zhang, “Assessing street-level urban greenery using google street view and a modified green view index,” Urban Forestry & Urban Greening, vol. 14, no. 3, pp. 675–685, 2015.
S. V. Stehman, “Basic probability sampling designs for thematic map accuracy assessment,”International Journal of remote sensing, vol. 20, no. 12, pp. 2423–2441, 1999.
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213–3223.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision. Springer, 2014, pp. 740–755.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International Journal of Computer Vision, vol. 88, no. 2, pp. 303–338, 2010.
-  Human Development Report Office (HDRO), “Human Development Report: Concept and Measurement of Human Development.” United Nations Development Programme , Tech. Rep. ISBN 0-19-506480-1, 1990.
-  S. P. prefecture, “Atlas do trabalho de desenvolvimento da cidade de são paulo 2012.” http://atlasmunicipal.prefeitura.sp.gov.br/, 2012, [Last accessed Nov-2017].
-  E. K. Tokuda, R. M. César Júnior, and C. Silva, “Identificação automática de pichação a partir de imagens urbanas,” in Conference on Graphics, Patterns and Images - SIBGRAPI. SBC, 2018.
-  T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” IEEE transactions on pattern analysis and machine intelligence, 2018.
-  E. K. Tokuda, H. Pedrini, and A. Rocha, “Computer generated images vs. digital photographs: A synergetic feature and classifier combination approach,” Journal of Visual Communication and Image Representation, vol. 24, no. 8, pp. 1276–1292, 2013.
-  E. K. Tokuda, G. B. A. Ferreira, C. Silva, and R. M. Cesar-Jr, “A novel semi-supervised detection approach with weak annotation,” in Image Analysis and Interpretation, 2018. SSIAI 2018. IEEE Southwest Symposium on. IEEE, 2018.
-  E. K. Tokuda, Y. Lockerman, G. B. A. Ferreira, E. Sorrelgreen, D. Boyle, R. M. Cesar-Jr., and C. T. Silva, “A new approach for pedestrian density estimation using moving sensors and computer vision,” arXiv preprint arXiv, 2018.