Parkour Spot ID: Feature Matching in Satellite and Street view images using Deep Learning

01/02/2022
by   João Morais, et al.
Arizona State University
0

How to find places that are not indexed by Google Maps? We propose an intuitive method and framework to locate places based on their distinctive spatial features. The method uses satellite and street view images in machine vision approaches to classify locations. If we can classify locations, we just need to repeat for non-overlapping locations in our area of interest. We assess the proposed system in finding Parkour spots in the campus of Arizona State University. The results are very satisfactory, having found more than 25 new Parkour spots, with a rate of true positives above 60

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 5

07/10/2018

Street Sense: Learning from Google Street View

How good are the public services and the public infrastructure? Does the...
03/11/2021

Coming Down to Earth: Satellite-to-Street View Synthesis for Geo-Localization

The goal of cross-view image based geo-localization is to determine the ...
03/02/2021

Geometry-Guided Street-View Panorama Synthesis from Satellite Imagery

This paper presents a new approach for synthesizing a novel street-view ...
02/14/2022

A Graph-Matching Approach for Cross-view Registration of Over-view 2 and Street-view based Point Clouds

In this paper, based on the assumption that the object boundaries (e.g.,...
11/19/2019

Eliminating artefacts in Polarimetric Images using Deep Learning

Polarization measurements done using Imaging Polarimeters such as the Ro...
08/05/2018

A novel method for predicting and mapping the presence of sun glare using Google Street View

The sun glare is one of the major environmental hazards that cause traff...
05/09/2021

Slash or burn: Power line and vegetation classification for wildfire prevention

Electric utilities are struggling to manage increasing wildfire risk in ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

I-a Motivation

Nowadays, the prevalent method of finding information is to ”Google it”. Should we want to find locations, Google Maps is our go-to place. However, not all locations are indexed in Google Maps. Indeed, why would Google index all street lamps in New York, all the park benches in Paris, or all the bridges in Amsterdam? Nonetheless, it is conceivable that a photographer is looking for lamps that offer his desired color composition and background or a film crew looking for the ideal bridge to perform a stunt. In this paper, we provide a method for finding non-indexed places in Google Maps.

When locations have distinctive spatial features a straightforward approach to classify locations is to use images [21]. Such images may have a multitude of sources as long as they can be associated with a GPS location [8]. Google APIs allow us precisely that. Through Google Static Maps API [6] and Google Street view API [7], we can gather satellite and street view images, respectively, to aid classification.

We consider the task of finding outdoor places to practice Parkour and FreeRunning (detailed in Section II). Parkour involves jumping, climbing, and/or running, or any other form of movement, typically in urban environments [16]. As a community that just now creating its second generation of practitioners, the Parkour community is new and fast-growing [22]. As it settles down and organizes, tools and common shared resources emerge, and the most commonly shared information between communities is training locations [16]. One of the aims of this work is to help members of the Parkour Community systematically find and share training spots within their region of interest.

I-B Prior Work

Geolocation-aided machine vision has been previously studied. Several studies [21, 8] were able to place confidence regions on the surface of the earth, based on the pixels of a single image. The authors of [1] further leverage a hierarchical database to improve geolocation. However, the objective of all these approaches is to use an image to find a location. On the other hand, we aim to find a location, based not on an image, but on a generic set of spatial features and have the system return possible locations within a region of interest.

I-C Contribution

This work adds to the extensive literature on machine vision applications. It presents no novelty in the methods it employs but in how known methods are utilized to help a growing community. More specifically, in this manuscript, we present:

  • A scalable method for feature matching in Google Maps;

  • A real-world-tested and systematic approach towards finding Parkour spots in a region of interest.

Furthermore, we test the system in the Arizona State University (ASU) campus, one of the largest in the United States. And we verify that the method to find Parkour spot presented in this work can methodically populate a database. The method, and the database, have the logo in Figure 1.

Fig. 1: The Parkour database logo. In GitHub [5].

This work is organized as follows. First, in Section II we formulate the problem of geolocation based on feature matching. Section III

reviews the current literature and presents the proposed solution, detailing the machine learning models employed. In Section

IV we present the results, both of the individual parts of the system as well as the system as a whole. Lastly, Section V summarizes the conclusions, and in Section VI we leave some remarks on how the system can be improved and possible ways of building upon this project.

Ii Problem Formulation

Ii-a The geo-classification problem

The problem we aim to solve is a classification task. Given location in geographical coordinates (latitude and longitude) such that

, output a probability

that translates the likelihood of being a Parkour spot. For the sake of simplicity, we assume the only accessible information about are satellite images and street images . Each image belongs to - it contains the three values referring to the components of red, green, and blue (RGB), each between 0 and 1, for every pixel. If we denote by and the sets of all satellite and street view images of a certain location, then we may further define the functions that will extract knowledge from each set, respectively, and . Thus, we may write:

(1)

where is the function that weighs and combines the outputs of and resulting in the probability .

Ii-B Features of Parkour Spots

Specific to our application, we must define what features constitute a Parkour spot because the capability of identifying such features needs to be encoded in . Since Parkour, or l’art du déplacement, as the first practitioners call it [14], suffers from being quite loosely defined. Despite contributing to its glamour, it complicates objective definitions. Thus, although highly subjective, we attempt to define that the quality of a parkour spot. Definition: The suitability of a location for Parkour is proportional to how easily the practitioner can come up with ideas for Parkour moves and sequences in that location. Therefore, we may objectively conclude that the more architectural features exist to jump, climb, roll, crawl, or interact with the environment, the higher the likelihood of the location to be suitable for Parkour.

Iii Proposed Solution

Solving the single-coordinate classification problem presented in Section II

enables us to check coordinates for Parkour spots. Therefore, to find multiple Parkour spots, one simply needs to run the same algorithm for other coordinates systematically. In this section, we present our proposed solution to the classification of single coordinates. Our solution consists of two computer vision tasks, one for top view and one for street view images, and both tasks rely on object detection methods.

Iii-a Object Detection

Typical object detection tasks have two phases: i) Object Localization and ii) Image Classification [17]. While object localization involves using of a bounding box to locate the exact position of the object in the image, image classification is the process of correctly classifying the object within the bounding box [17]. Figure 2 helps show the difference. Instance and semantic segmentation go a level deeper. In semantic segmentation each pixel is classified to a particular class label, hence it is a pixel-level classification [19]. Instance segmentation is similar except that multiple objects of the same class are considered separately as individual entities [19].

For the top-view model, we opt an image classification method. For us humans is hard to delineate and annotate useful features for Parkour in satellite images, therefore we intend to leave this complexity for the network to learn. We opt for the instance segmentation approach for the street view model because it provides a more detailed classification, while object detection would only provide bounding boxes. Bounding boxes become less practical and robust when the shape of the object varies considerably and is random in nature [3]. We choose instance instead of semantic segmentation because the number of objects definitely matters in the quality of Parkour spots.

Fig. 2: Differentiation of object detection tasks in computer vision. [19]

Iii-B Satellite Imaging model

The model to process satellite images uses binary classification. In a satellite image, Parkour locations may contain visible features such as stairs, railings, walls, and other elevations that resemble obstacle courses or may be suitable for Parkour. By providing only 0 or 1 labeled image our goal is to have the model identify these patterns through convolutions.

The coordinates of known Parkour locations were crowdsourced from parkour communities worldwide. We gathered over 1300 coordinates from cities such as Paris (France), London (United Kingdom), Lisbon (Portugal), and Phoenix (Arizona, United States). For top-view, the coordinates were queried from the Google Map Static API [6] with a fixed magnification of 21, resulting in high definition images of 640 by 640 pixels. For the negative examples required for training of the top-view model, we uniformly sampled cities, gathering 400 random coordinates from 6 random locations, resulting in 2400 negative samples. Figure 3 shows some positive and negative examples.

Fig. 3: Positive and negative satellite image examples used for training.

With our classification problem requiring all the detail satellite images can provide, it is essential to maintain a relatively high resolution. However, larger images imply larger memory requirements during training. We downscaled the images from (640,640) pixels to (512,512). Then, we divide each satellite image into four to reduce the chance of false positives by limiting the information in each input - this approach is represented in Figure 4. Post filtering, the positive training set had 3117 samples, and the negative set had 13,231 samples. For training, 3117 random negative samples were selected to maintain class balance.

Fig. 4: Splitting a positive sample into quadrants.

During training, multiple models such as VGG16 [18], Resnet50 [10] and InceptionV3 [20], were tested. We experimented with:

  • Training partial sections of the model, including (but not limited to) exclusively convolution layers;

  • Forcing data imbalance towards the positive samples space so the model can learn positive features better;

Ultimately, we designed our own model based on the above-mentioned architectures. The hyperparameters used to train the CNN are listed in Table

I.

Parameters Values
Input Size (256,256,3)
Epochs 100
Batch size 32
Initial learning rate 0.001
Optimizer Adam [11]
TABLE I:

Satellite/Top-view model Neural Network Hyper-parameters

Iii-C Street view model

One standard method for instance segmentation is using Mask Region-based CNN (R-CNN) [9]. We use an implementation in [4]

. We used transfer learning on a model pre-trained on the COCO dataset

[12], and we retrain only the convolutional layers. The COCO dataset consists of 80 distinct categories, but we optimized the model to differentiate only three (short walls, railings, and stairs) from the background (also considered a class).

The data collected from the community had street view images and other images taken by the community. Out of 1300 coordinates verified to contain Parkour spots, each image was manually evaluated, and the dataset was narrowed down to 249 images for training and 51 images for validation. The images were filtered on the basis of:

  • How clear and understandable the images were to the naked eye.

  • Selecting only daytime images since nighttime images were really low in number and it could hamper the prediction capability of the model because those images could act as noise. Google street view images are all captured in the daytime.

  • Images which were blurry and really complicated to annotate were discarded.

After data filtering, the VGG Image Annotator (VIA) [2] was used to manually annotate each image. An example of an annotated image is presented in Figure 5.

Fig. 5: Example of annotations performed in street view image.

The Mask R-CNN works for any input up to 1024 by 1024, but we used inputs of size 640 by 640, because that is the maximum size for the Google street view API.

Contrary to typical model training, we did not aim to minimize the loss during training. Mathematically defining a loss for finding a parkour spot is hard. Instead, we manually tuned parameters and assessed the output images of the test set. When the loss was sufficiently low, and the output matched our intuition for parkour spots by not over- nor under-identifying objects, then we stopped training. Model hyperparameters and implementation-specific parameters are in Table II. For specific parameters, refer to [9, 4] for their meaning.

Parameters Values
Input size (640,640,3)
Epochs 100
Optimizer Adam
NUM_CLASSES 4
STEPS_PER_EPOCH 15
VALIDATION_STEPS 1
BATCH_SIZE 1
DETECTION_MIN_CONFIDENCE 0.75
TABLE II: Mask R-CNN Specific parameters and Training Hyper-parameters

Iv Results

In this section, we first analyze the performance of each system component, i.e. satellite and street view models. Subsequently, both models are integrated as described in Section III. The performance is assessed using real, unlabeled data.

Iv-a Satellite Model

The satellite or top-view model was trained on thousands of positive and negative labeled examples. As a result of such training, the binary classification model yielded a classification accuracy of 80% on our test set. The confusion matrix in

6 reflects the performance of our model.

Fig. 6: Confusion matrix of satellite images model.

Looking now at unlabeled data, Figure 7 shows the performance of the model given a grid of 196 images spanning 100 meters from the central coordinate. The model can detect most of the small elevations, walls, railing, stairs, and similar features that are suitable for Parkour. However, the model does classify pointed roofs, solar panel arrays, and HVAC (heating, ventilation and air conditioning) arrays as positives. A solution is to include similar samples in the negatives training set.

Fig. 7: Results from testing the top-view model.

Iv-B Street view Model

To assess the street view model in realistic conditions, we used many unlabeled examples. Overall, the model works consistently well. Figure 8 shows an example. We see that although the model might at times mistake railings by walls, it is still identifying those elements to be useful for Parkour, which is what is most relevant to our application.

Fig. 8: Example of street view model output in unlabeled data.

Iv-C ASU Campus Results

To test the end-to-end framework, we used the proposed system to identify spots at ASU campus. We used a center coordinate and an area of interest that is a square inscribed in a circle with a radius of 650 meters. Then we uniformly sampled the region to achieve non-overlapping satellite images (roughly 40 meters apart), and acquired four 90-degree street view images in each of the uniform coordinates.

The method used to determine the quality of a Parkour spot is counting the number of class hits for each of the four street view directions. If there are more than Parkour-usable objects (i.e. short walls, stairs or rails), we mark the coordinate as containing a Parkour spot. The number of positives can be controlled with the threshold . Resorting solely to street-view provided the highest reliability and interpretability. Table III shows some statistics from this study.

Center coordinate (33.4184, -111.9328)
Radius of interest 650 meters
Number of coordinates 1155
Number of API requests 5775
Total cost of API requests 34.65 $
Number of Positives 46
Number of True Positives 28
TABLE III: Statistics from large-scale testing at ASU campus.

Almost 50% of the positive results are false positives. Figure 9

shows four cases where the system was fooled. First, unbeknownst to us, the Google street view API sometimes returns indoor images. Since tables, benches, counters, and walls are identified as useful for Parkour, indoor locations are wrongly ranked high. Outdoor locations with furniture are also positively classified. Pools too, due to having a fair share of sun loungers and railings. Lastly, several street-view requests had a view considerably above street-level, leading the system to identify spots for 5-story high giants instead of humans. We estimate that filtering problematic inputs from the Google API can reduce the percentage of false positives to below 20%.

Fig. 9: Examples of false positives in ASU campus test. Label means the number of identified [small walls, rails, stairs].

V Conclusions

In this work, we presented a systematic method for finding Parkour spots. The first of its kind in the Parkour Community. We defined the general feature matching problem. Using a binary classification of satellite images and instance segmentation for street view images and connecting both approaches to maximize the information derived for each coordinate, we accurately determined the likelihood of a location being a Parkour Spot. We then executed our methodology on our campus and personally verified the quality of the results, achieving a precision of over 60%. Finally, we analyzed the most prominent false positives and identified fixes to improve the system performance further.

Vi Future Work

We presented a scalable framework because its performance improves by enhancing its parts. One future work direction can be to evolve the proposed system by enhancing the satellite image (top-view) model or the street-view model accuracy, robustness, or inference speed. In terms of inference speed, the bottleneck is on the street-view model. High-speed object detection approaches, like Single Shot Detection [13] and YOLO [15]

, can improve classification speed while possibly improving performance. The integration of both models can be improved to reduce the required API requests, i.e. cost and speed of operation in unknown terrain. Furthermore, another interesting approach is to exploit feature extraction explicitly, e.g. by engineering a solution with edge detection. Finally, it would be interesting to study faster ways of encoding feature knowledge since it takes several days to perform all data annotations.

References

  • [1] S. Cao and N. Snavely (2013) Graph-based discriminative learning for location recognition. In

    2013 IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Vol. , pp. 700–707. External Links: Document Cited by: §I-B.
  • [2] A. Dutta and A. Zisserman (2019) The VGG image annotator (VIA). CoRR abs/1904.10699. External Links: 1904.10699 Cited by: §III-C.
  • [3] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik (2013) Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR abs/1311.2524. External Links: 1311.2524 Cited by: §III-A.
  • [4] GitHub Mask R-CNN for Object Detection and Segmentation. Note: https://github.com/matterport/Mask_RCNN[Online; Dec 2021] Cited by: §III-C, §III-C.
  • [5] GitHub Parkour Spot ID. Note: https://github.com/jmoraispk/ParkourSpotID[Online; Dec 2021] Cited by: Fig. 1.
  • [6] Google Google Maps Static API. Note: https://developers.google.com/maps/documentation/maps-static/overview[Online; Dec 2021] Cited by: §I-A, §III-B.
  • [7] Google Google Maps Street View Static API. Note: https://developers.google.com/maps/documentation/streetview/overview[Online; Dec 2021] Cited by: §I-A.
  • [8] J. Hays and A. A. Efros (2008) IM2GPS: estimating geographic information from a single image. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1–8. External Links: Document Cited by: §I-A, §I-B.
  • [9] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick (2017) Mask R-CNN. CoRR abs/1703.06870. External Links: 1703.06870 Cited by: §III-C, §III-C.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Link Cited by: §III-B.
  • [11] D. P. Kingma and J. Ba (2017) Adam: a method for stochastic optimization. External Links: 1412.6980 Cited by: TABLE I.
  • [12] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. CoRR abs/1405.0312. Cited by: §III-C.
  • [13] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg (2015) SSD: single shot multibox detector. CoRR abs/1512.02325. External Links: Link, 1512.02325 Cited by: §VI.
  • [14] O. Mould (2009) Parkour, the city, the event. Environment and Planning D: Society and Space 27 (4), pp. 738–750. External Links: Document Cited by: §II-B.
  • [15] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi (2015) You only look once: unified, real-time object detection. CoRR abs/1506.02640. External Links: 1506.02640 Cited by: §VI.
  • [16] S. J. Saville (2008) Playing with fear: parkour and the mobility of emotion. Social & Cultural Geography 9 (8), pp. 891–914. External Links: Document Cited by: §I-A.
  • [17] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. Lecun (2013-12) OverFeat: integrated recognition, localization and detection using convolutional networks. International Conference on Learning Representations (ICLR) (Banff), pp. . Cited by: §III-A.
  • [18] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §III-B.
  • [19] Stanford

    CS231n: Convolutional Neural Networks for Visual Recognition, 2017

    .
    Note: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf[Online; Dec 2021] Cited by: Fig. 2, §III-A.
  • [20] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2015) Rethinking the inception architecture for computer vision. CoRR abs/1512.00567. Cited by: §III-B.
  • [21] T. Weyand, I. Kostrikov, and J. Philbin (2016) PlaNet - photo geolocation with convolutional neural networks. CoRR abs/1602.05314. External Links: Link, 1602.05314 Cited by: §I-A, §I-B.
  • [22] Wikipedia Parkour. Note: https://en.wikipedia.org/wiki/Parkour[Online; Dec 2021] Cited by: §I-A.