Playing Soccer without Colors in the SPL: A Convolutional Neural Network Approach

by   Francisco Leiva, et al.
Universidad de Chile

The goal of this paper is to propose a vision system for humanoid robotic soccer that does not use any color information. The main features of this system are: (i) real-time operation in the NAO robot, and (ii) the ability to detect the ball, the robots, their orientations, the lines and key field features robustly. Our ball detector, robot detector, and robot's orientation detector obtain the highest reported detection rates. The proposed vision system is tested in a SPL field with several NAO robots under realistic and highly demanding conditions. The obtained results are: robot detection rate of 94.90 rate of 99.88 robot is moving.


Using Convolutional Neural Networks in Robots with Limited Computational Resources: Detecting NAO Robots while Playing Soccer

The main goal of this paper is to analyze the general problem of using C...

A Monocular Vision System for Playing Soccer in Low Color Information Environments

Humanoid soccer robots perceive their environment exclusively through ca...

Reliable Real Time Ball Tracking for Robot Table Tennis

Robot table tennis systems require a vision system that can track the ba...

Spin Detection in Robotic Table Tennis

In table tennis the rotation (spin) of the ball plays a crucial role. A ...

Design, Construction and Implementation of Stewart Platform with Control of Rolling Ball on Platform through Artificial Vision

Artificial vision (AV) has recently emerged as an extremely important to...

Shape and Color Object Tracking for Real-Time Robotic Navigation

This paper presents a real-time approach for single-colored ball detecti...

DeepBall: Deep Neural-Network Ball Detector

The paper describes a deep network based object detector specialized for...

1 Introduction

The perception of the environment is one of the key abilities for playing soccer; without an adequate vision system it is not possible to determine the position of field’s features or to self-localize. It is also impossible to determine the position of the ball and the other players, which is necessary in order to play properly. Given that the soccer environment is highly dynamic and has a predefined physical setup, most of the current vision systems use color information.

In the case of the SPL and the former Four-Legged League, the first generation of vision systems analyzed colored objects which were then segmented. Year by year, the restriction of having colored objects in the field was relaxed: (i) the number of colored beacons was first reduced and then beacons were not used anymore, (ii) the goals were first colored and solid, then non-solid, and finally white, (iii) the ball used to be orange, and since 2016, black and white. However, still most of the teams use color information to detect field features (e.g., lines and their intersections), other players and the ball. Very recently, Convolutional Neural Networks (CNNs) have been used for detecting the robots and/or the ball (e.g., [3, 4, 5, 6]), but even in these cases, the CNN-based detectors require object proposals which are usually obtained using color information. Therefore, to the best of our knowledge, color-free vision systems have not been used in robotic soccer, at least not in the SPL. Some of the main reasons are: (i) the challenge of achieving real-time operation when using limited computational resources, (ii) the problem of training deep detectors without having very large databases, and (iii) the challenge of having fast and color-free object proposals.

The goal of this paper is to propose a color-free vision system for the SPL. The main features of this system are: (i) real-time operation in the NAO robot, and (ii) the ability to detect the ball, the robots, their orientations, the lines and key field features very robustly. In fact, our ball, robots and robots’ orientation detectors are highly performant; they obtain the highest reported detection rates.

2 Playing Soccer without Color Information

In this section we present the proposed vision system. Section 2.1 broadly explains the general characteristics and functioning of the vision framework, while Sections 2.2, 2.3, 2.4, 2.5 and 2.6 detail the operation of each of its main modules.

2.1 The General Framework

The main feature of our framework is that it manages to detect the ball, other players, their orientations, and key features of the field without using any color information: all the processing is performed on grayscale images. This is done by following a cascade methodology (inspired in [16]

) that combines classical approaches widely used in pattern recognition and modern CNN-based classifiers.

The proposed vision framework is illustrated in Fig. 1

. While the detection of lines and field features is done by using a set of rules and heuristics, both the detection of the ball and the other robots is done by means of object proposals and their subsequent classification using CNNs. This cascade approach takes advantage of the information previously extracted from the image to use it in benefit of following processing modules.

Figure 1: Block diagram of the proposed vision system.

2.2 High Contrast Regions Detection

Since the robots and the ball used in the SPL possess high contrast, an effective approach to know where to search for them is to find high contrast regions in the images. To do this, the grayscale input images are scanned using windows of 16

16 pixels. Regions outside the field boundaries and within the body of the observer robot are discarded. The remaining windows are used to construct histograms of pixels, which are used to estimate thresholds for image binarization using Otsu’s method

[7]. Windows with thresholds over a predefined value are considered as important, since they may be close or within another robot or the ball. Since the chosen threshold for the selection of windows could be restrictive and leave out image regions belonging to objects of interest, a morphological dilation operation is applied on the previously selected windows, which means that all the 16x16 pixels blocks adjacent to selected windows are also considered as high contrast regions.

2.3 Robot Detection

In [3] we presented a robot detector based on CNNs, capable of operating in real time. The system was based on the classification of color-based robot proposals (generated by B-Human’s robot perceptor [8]). This was modeled as a binary classification problem where proposals could be labeled as robots or non-robots. The system processed hypotheses in 1 ms with an average accuracy of 97%. Although this system achieved a very high performance, it possessed some major drawbacks. First, while the CNN classifier was very robust to noise and variations of the illumination, the same did not apply to the color-based robot proposal generator. Adverse environmental conditions could lead the algorithm to produce an excessive amount of object hypotheses, or none at all. The second drawback derived from the CNN inference time of 1 ms. While such a network is deployable on a NAO robot, it is much slower than alternative algorithms based on heuristics or shallow classifiers, and can be prohibitively slow when too many robot proposals are generated. In this paper we address both problems by changing the robot proposals generation approach, and by further reducing the inference times while maintaining the detection accuracy.

The proposal generation of this new framework does not use any color information: it uses vertical scan lines over all the image -coordinates where high contrast regions were detected (see Section 2.2). The scan lines search for luminance changes in order to find the robots’ feet positions, and by performing geometric sanity checks, the proposal generator provides a set of bounding boxes which may contain the robots’ body. Most checks are similar to the rules used in the B-Human player detector [8], but applied to a grayscale image. This approach is more robust to changes in lighting since it relies on contrast information rather than heuristic color segmentation.

The obtained grayscale image regions are then fed to a CNN, which we call RobotNet, that classifies the proposals as robots or non-robots. This CNN is based on the architecture described in Section 3.1. Using grayscale image regions allows the system to perform in real time for a large number of robot proposals, since the reduction of input channels greatly reduces the CNN’s inference time.

2.4 Robot Orientation Determination

Inspired on [10], we propose an improved Vision-Based Orientation Detection for the SPL League, which makes use of CNNs in order to achieve much better prediction accuracy than the original system. The general architecture of the module is presented in Fig. 2.

Figure 2: Robot orientation module pipeline.

This system uses the bounding boxes of the Detected Robots as inputs. Over these regions, the set of points that compose the robots’ lower silhouette [10] is calculated by the Lines Generator module, which extracts a region corresponding to the robot’s feet and analyses its Contrast-Normalized Sobel (CNS) image [11] by using vertical scan lines. Over each scan line pixel an horizontal median filter is applied and its response is compared to a threshold. Pixels with a filter response below the threshold are considered as part of the lower silhouette. Then, by iterating for each scan line, the subset of points that make up a closed convex region can be obtained by using Andrew’s convex hulls algorithm [12]. For each consecutive pair of points of the convex set we calculate a line model in field coordinates. Each line model is then validated with the set of points of the lower silhouette, by using a voting methodology akin to the RANSAC algorithm [17]. The line with the higher number of votes is selected as the first line. Once the linear model has been chosen, a second line may be generated by iterating over the remaining pairs of convex points. This line must comply with a series of conditions such as a minimum and maximum length and approximate orthogonality to the first line in order to be accepted as valid.

To estimate the orientation of the observed robot, the lines are classified to determine the robot’s direction. To do this, a region that includes the robot’s feet and legs is constructed around each line by the Line Regions Proposals Generator module. The regions are then classified by the Deep Classification module which is based on CNNs, whose structure is shown in Fig. 4. For each of the line’s regions a CNN that measures its quality, OriBoostNet, is first applied. Regions with too much motion blur or that were incorrectly estimated are discarded to decrease the number of wrong orientation estimations. If a region is accepted, it is then fed to a second CNN, OriNet, that in turns classifies it as a side, front or back region. Afterwards, we perform a Consistency Check by imposing that no more than one region of each class must exist. This further reduces the number of incorrect orientation estimations. Finally, the Orientation Determination is performed by combining the rotation given by the inverse tangent from two points belonging to the analyzed line, with the direction of the line determined by its class. The resulting orientation is added to a buffer that stores the last 11 measurements and a circular median filter is applied over it. In order to avoid invalid results, we consider the direction as valid only for a small period of time if no new samples are added to the buffer.

2.5 Ball Detection

In the proposed vision framework, the ball detector follows the paradigm of proposal generation and subsequent classification. Fig. 3 shows the general architecture of this module.

Figure 3: Ball detection module pipeline.

Our ball proposal generator is inspired on the hypothesis provider developed by the HTWK team [13]. The main differences between both approaches are: (i) we only use grayscale images, (ii) we use a different method to estimate high contrast regions (see Section 2.2), and (iii) we use the robots’ detections in order to improve the generation of proposals.

The proposal generator uses the high contrast regions and the robots’ detections to provide the ball hypotheses. To accomplish this task, the generator performs a pixel-wise scan over all image windows that were detected by the high contrast detector and over image regions corresponding to the detected robots’ feet. During this stage, a Ball Radius Estimation is calculated for every analyzed position in image coordinates.

The next stage consists in a Difference of Gaussians (DoG) Filtering. During this process, DoG filters’ local responses are calculated for each scan coordinate. The support regions of the filters are dependent on the estimated ball radii, so we are actually searching for blobs by means of the same approach used by the SIFT algorithm [15]. Additional DoG responses are calculated in front of the other robots’ feet given that the ball may be in these regions. Finally, only the highest responses are used to construct a set of proposals, whose size depends on the estimation of the radius of the ball.

To perform the ball detection, the proposals are fed to a cascade of two CNNs which classifies them as ball or non-ball. The first CNN, BoostBallNet, performs Deep Boosting to both limit the proposals’ number to a maximum of five, and sort them based on their confidence. The second CNN, BallNet, performs Deep Classification, meaning that it processes the filtered hypotheses to detect the ball. Both networks are extremely fast and accurate, having execution times of 0.043 ms and 0.343 ms, and accuracy rates of 0.965 and 0.984, respectively.

2.6 Field Lines & Special Features Detection

The field lines and features detection follow the same algorithm released by B-Human [8]. The main difference with respect to the original approach, is that in the proposed framework no color information is used. To do this, a set of vertical and horizontal scan lines are used, which save transitions from high-to-low and low-to-high luminance. This allows the detection of a set of points which are then fed to the B-Human’s algorithm in order to associate them with lines and other features such as the middle circle, corners and intersections.

3 Design and Training of the CNN-based Detectors

In this section we focus on the design and training methodologies used to obtain highly performant CNN based classifiers for our vision framework. Section 3.1 presents the network architectures of our classifiers and Section 3.2

describes the active learning-based algorithm that was developed to train them.

3.1 Base CNN

The proposed vision system makes use of several classifiers based on CNNs. While these CNNs are used for different purposes, their architectures remain similar across all the developed modules and are based on the work presented in [3], with slight variations to achieve higher speeds while maintaining accuracy. The main component of these architectures is the extended Fire module, which was developed in [3] inspired on the original Fire module proposed in [9]

. This module concatenates the outputs produced by filters of different sizes in order to achieve increased accuracy while being computationally inexpensive. Small filters are used to extract local information across channels, while bigger filters obtain global information which is more spatially spread out. The information obtained at different scales is then combined into a single tensor and fed to the next layer. This allows the network to extract and work with both local and detailed features as well as broad, global features. Following this approach allows the training of performant models, but concatenating the information of several filters could be prohibitively expensive in terms of computational cost. To account for this, a 1

1 filter is placed at the beginning of each Fire module to compress the size of the representation that correspond to the input of the subsequent larger and more expensive filters. In contrast with the previous miniSqueezeNet version, all newly developed CNNs have grayscale image inputs. Since most of the computational cost of the network correspond to the first convolutional layers, this translates in sharply reduced inference times, and an accuracy loss of about 0.01. Another change is the use of leaky ReLU


instead of ReLU as activation functions. Previously, we used ReLU in most layers, however, this sometimes resulted in the “dying ReLU” problem while training (no gradients flow backward through the neurons). The use of leaky ReLU solves this, while incurring in no accuracy loss. All CNNs were developed using the Darknet library

[14]. A diagram of the new CNN structure is presented in Fig. 4.

Figure 4: Modified MiniSqueezeNet network structure.

3.2 Active Learning Training Methodology

In order to train the classifiers, we implemented an active learning-based algorithm that automatically selects and pseudo-annotates unlabeled data.

We start by initializing the parameters of the CNNs by training them using publicly available datasets (e.g., SPQR datasets [4]). However, if we directly use the obtained CNN weights in our vision framework, the classifiers behave poorly because there is a distribution mismatch between the samples present in the public datasets, and the ones that our proposal generators output.

To address this problem, the classifiers must be trained using the same kind of samples that would actually reach the networks during games. To accomplish this, the vision system is deployed on the NAO robot and data is collected using the proposal generators. Each proposal is then stored in the robot’s memory with a label annotated by the CNN. To get uncorrelated data, we set a constraint for the object hypotheses to be saved: for the robot proposals, data is acquired periodically in accordance to a predefined time step; for the ball proposals, samples can only be saved if no other proposals with the same position and estimated radius were previously collected. The next stage consists in actively checking the data saved by the observer robot, and manually annotate the samples that were incorrectly labeled. We then aggregate this data to the original dataset and re-train the models.

The above process is repeated until the CNNs reach a high performance. By doing this, we are progressively aggregating correctly labeled samples to provide enough training data for robust feature learning, but also aggregating hard examples which the models fail to correctly infer, to actively shape the decision boundary of the classifiers.

After we obtain proficient models by following the described methodology, we further enhance them by switching to a bootstrap procedure. To do this, we add confidence-based constrains to collect new training data in environments where the object we want to detect is absent. For instance, if we are getting false positives from the ball detector, we would set the NAO robot to collect data from proposals with high confidence in environments were no balls are present. The samples collected would then be used to re-train the ball classifiers.

This active learning-bootstrap procedure results in a dramatical improvement in the performance of the classifiers after only a few iterations, and also allows the fine tuning of the CNN parameters by means of using data aggregation when an abrupt domain change occurs. Since the inputs to our models have relatively low dimensionality, the space used in the NAO memory during the data collection process is very small, for instance, 1,000 robot proposal samples weight about 3 MB. This procedure, combined with the automatic selection and labeling of the new samples, make the training process extremely time-wise efficient.

4 Results

4.1 CNN Classification

Table 1 shows the model complexity (number of CNN parameters), average inference time (on the NAO robot), and accuracy for each developed CNN.

Results show that the classifiers achieve very high performance while maintaining low inference times, which proves that their use is suitable for real time applications such as playing soccer. This also validates the effectiveness of the proposed methodology for the design and training of the classifiers. Finally, this also proves that the use of color information is not necessary to detect robots or balls when using expressive classifiers such as CNNs. In fact, the CNN used in the robot detector achieves a similar accuracy rate that the model proposed in [3], while being approximately 2.75 times faster.

Model RobotNet BoostBallNet BallNet OriBoostNet OriNet
Input size 24241 12121 26261 12121 24241
N° of parameters 884 125 444 246 657
Inference time (ms) 0.382 0.043 0.343 0.059 0.329
Accuracy 0.969 0.965 0.984 0.962 0.984
Table 1: Performance of the developed CNNs.

4.2 Robots, Ball and Field Features Detection Systems

For the robots and ball detectors, results are divided on proposal generation and module performance. We replicated typical and challenging game conditions in order to acquire about 600 processed frames by an observer robot. Several lighting conditions were imposed while collecting these frames in order to test the robustness and reliability of our modules. The analysis of these frames allowed the extraction of empirical results in relation to the performance of the proposals generators and each detector, which are shown in Table 2.

Results show that the robots and ball proposals generators achieve high recall rates, while producing an average number of hypotheses per frame that can be processed in real time by the subsequent classifiers. Given the recall rate of the ball proposals module and the percentage of true positives of the boosting stage, the overall detection module has a very high detection rate. In fact, our ball detector outperforms B-Human’s implementation, which achieves an overall accuracy rate of 0.697 when testing it under the same conditions.

Finally, the field lines and features detector was tested by comparing the difference between the real and the estimated robot pose. The estimation was obtained by using the field lines and features detected by our module. By using this approach we calculated a mean squared error of 40.07 mm, which indicates that our detector is very accurate and reliable.

Module Robot Detector Ball Detector
Avg. Proposals per Frame 3.05 10.3
Proposals Recall 0.972 0.993
Overall Accuracy 0.949 0.971
Table 2: Performance of the robots and ball detection systems.

4.3 Robot Orientation Determination

In Fig. 5 we present a comparison between the B-Human algorithm proposed in [10], our base orientation determination system, and its output after applying a circular median filtering. For this experiment, the observer and the observed robot are static and placed at a distance of 120cm from each other. For each measurement the observed robot was rotated around its axis. As in [10], we define a false positive as any estimation that deviates more than a tolerance angle of from the ground-truth. The orientation is classified as semi perceived when the rotation can be determined but the facing direction is unknown. The class not perceived corresponds to any frame where the orientation could not be calculated, while an orientation estimation is perceived if it does not deviate more than a tolerance angle of from the ground-truth orientation.

Figure 5: Results obtained for the first experiment. Graph shows a performance comparison between raw (UCh) and filtered (UChF) estimations for our orientation detector and a B-Human system replication (BH).

In Fig. 6 we show the results obtained when testing our system in a dynamic environment, where the observed robot is moving at a speed of 12.0 cm/s, while the observer remains static. The observed robot is rotated in around its axis for each measurement. We define the same classes for the orientation estimations as in the static experiment, but using a tolerance angle of .

Figure 6: Dynamic experiment results. Graph shows a performance comparison between raw (UCh) and filtered (UChF) estimations for our orientation detector.

As shown in Fig. 5 and Fig. 6, the proposed method outperforms the original B-Human implementation. The orientation estimation is completely perceived 99.88% of the time in static conditions, and 95.52% of the time in the dynamic experiment. It is clear that the algorithm proposed is better at determining the facing direction of the observed robots. This results in an increased number of completely perceived orientations while sharply decreasing the number of semi perceived orientations. Also, noise filtering techniques such as the median filter and RANSAC algorithm, combined with the utilization of a CNN contribute to lowering the number of false positive estimations. Finally, the integration of the circular median filter further reduces the number of false positives.

4.4 Profiling

Table 3 shows the maximum and average execution times for the different modules of the proposed vision framework when deployed on the NAO v5 platform. The results obtained show that the proposed color-free vision system is deployable on platforms with limited processing capacity (such as the NAO robot). In addition, they prove the importance of the dimensionality reduction of CNN-based classifier inputs, and how this design decision impacts the performance of the system from a time-efficiency point of view.

Module Max. (ms) Avg. (ms)
High Contrast Regions Detector 2.755 1.478
Field Lines & Features Detector 2.909 1.300
Robot Proposals Generator 2.692 1.083
Robot Detector 2.417 0.939
Robot Orientation Detector 4.537 1.366
Ball Proposals Generator 2.506 1.132
Ball Detector 6.959 2.452
Table 3: Vision framework profiling.

5 Conclusions

This paper presents a new vision framework that does not use any color information. This is a novel approach for vision systems designed for the SPL, achieving very high performance while being computationally inexpensive.

The proposed vision system we present introduces four new modules: a redesigned robot detector, a visual robot orientation estimator, a brand new ball detector, and finally, a color-free field lines and features detector. All modules developed for this paper are able to run simultaneously in real-time when deployed on the NAO robot, and greatly outperform our previous perception system.

Furthermore, we demonstrate that CNN-based classifiers are a useful tool to solve most of the perception requirements of the SPL, and generally translate in an overall better performance of the corresponding module when coupled with good region proposal algorithms, and proper design and training techniques.


This work was partially funded by FONDECYT Project 1161500.


  • [1] Veloso, M., Lenser, S., Vail, D., Roth, M., Stroupe, A., Chernova, S.: CMPack-02: CMU’s Legged Robot Soccer Team. Carnegie Mellon University Report, 2002
  • [2] Zagal, J., Ruiz-del-Solar, J., Guerrero, P. and Palma R. (2004). Evolving Visual Object Recognition for Legged Robots. Lecture Notes in Computer Science 3020 (RoboCup 2003), Springer, 181-191
  • [3] Cruz, N., Lobos-Tsunekawa, K., Ruiz-del-Solar, J.: Using Convolutional Neural Networks in Robots with Limited Computational Resources: Detecting NAO Robots while Playing Soccer. RoboCup Int. Symposium 2017
  • [4] Albani, D., Youssef, A., Suriani, V., Nardi, D., Bloisi, D.D.: A Deep Learning Approach for Object Recognition with NAO Soccer Robots. RoboCup Int. Symposium 2016
  • [5] Speck, D., Barros, P., Weber, C., Wermter, S.: Ball Localization for Robocup Soccer using Convolutional Neural Networks. RoboCup Int. Symposium 2016
  • [6] Menashe, J., Kelle, J., Genter, K. Hanna, J., Liebman, E., Narvekar, S., Zhang, R., Stone, P.: Fast and Precise Black and White Ball Detection for RoboCup Soccer. RoboCup Int. Symposium 2017
  • [7] Otsu, N.: A Threshold Selection Method from Gray-Level Histograms. IEEE Transactions on Systems, Man, and Cybernetics, vol. 9, no. 1, pp. 62-66, 1979.
  • [8] Röfer, T., Laue, T., Bülter, Y., Krause, D., Kuball, J., Mühlenbrock, A., Poppinga, B., Prinzler, M., Post, L., Roehrig, E., Schröder, R., Thielke, F.: B-Human Team Report and Code Release 2017, 2017. Only available online:
  • [9] Iandola, F.N., Moskewicz, M.W., Ashraf, K., Han, S., Dally, W.J., Keutzer, K.: Squeezenet: Alexnet-levell accuracy with 50x fewer parameters and 1mb modelsize. CoRR (2016),
  • [10] Muhlenbrock, A., Laue, T.: Vision-based Orientation Detection of Humanoid Soccer Robots. RoboCup Int. Symposium 2017
  • [11] Röfer, T., Müller, J., Frese, U.: Grab a mug – object detection and grasp motion planning with the Nao robot. IEEE-RAS International Conference on Humanoid Robots (HUMANOIDS 2012), Osaka, Japan, 2012.
  • [12] Andrew, A.M.: Another efficient algorithm for convex hulls in two dimensions. Information Processing Letters 9(5), 216–219. 1979.
  • [13] Nao-Team HTWK: Team Research Report. 2018. Available online:
  • [14] Redmon, J.: Darknet: Open source neural networks in C. (2013-2016)
  • [15]

    Lowe, D.G. International Journal of Computer Vision (2004) 60: 91.
  • [16] Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on. Vol. 1. IEEE, 2001.
  • [17] M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
  • [18] Maas, Andrew L., Awni Y. Hannun, and Andrew Y. Ng. ”Rectifier nonlinearities improve neural network acoustic models.” Proc. icml. Vol. 30. No. 1. 2013.