The RoboCup introduced by Kitano et al. 
serves as the central problem in understanding and development of Artificial Intelligence. The challenge aims at developing a team of autonomous robots capable of playing soccer in a dynamic environment. It requires the development of collective intelligence and an ability to interact with surroundings for effective control and decision making. Over the years several humanoid robots[7, 20, 8] have participated in the challenge.
One of the main hurdle identified within the tournament is perceiving the soccer ball. The efficient detection of soccer ball relies on how good the vision system performs in tracking the ball. For instance, consider cases where the ball disappears or gets occluded from robots point of view for a few frames. In such situations using the current frame is not useful. However, a dependence on the history of frames can help in making a proper move. In this work, we propose an approach which can effectively utilize the history of ball movement and improve the task of ball detection. We first utilize the encoder-decoder architecture of SweatyNet model and train it for detection of the ball in single images. Later we use it as a part of our proposed layers and learn from temporal sequences of images, thereby developing a more robust detection system. In our approach we make use of three spatio-temporal models: TCN , ConvLSTM  and ConvGRU .
2 Related Work
Numerous works have been done in the area of soccer ball detection. Before RoboCup 2015 the ball was orange, and many teams used color information 
. Since RoboCup2015, the ball is not color coded anymore, which forced teams to use more sophisticated learning based approaches like HOG cascade classifier. In recent years, the convolutional approaches with their innate ability to capture equivariance and hierarchical features in images have emerged as a favorite choice for the task. In  authors use CNN to perform localization of soccer ball by predicting the and coordinates. In a recent work 
use proposal generators to estimate regions of soccer ball and further use CNN for the classification of regions. In authors compared various CNN architectures namely LeNet, SqueezeNet, and GoogleLeNet for the task of a ball detection by humanoid robots. In  authors inspired by work of  proposed a Fully Convolutional Networks (FCN) that offers a robust and low inference time, which is an essential requirement for the soccer challenge. As the name suggests, the FCN is composed entirely of convolution layers which allows them to learn a path from pixels in the first layers to the pixel in the deeper layers and produce an output in the spatial domain — hence making FCN architecture a natural choice for pixel-wise problems like object localization or image segmentation. In  authors use geometric properties of the scene to create graph-structured FCN. In  authors proposed a modification of U-Net  architecture by removing skip connections from encoder to decoder and using depthwise separable convolution. This allows to achieve improvement in inference time and making it the right choice for real-time systems.
The existing work uses the current frame for the detection of the soccer ball. We hypothesize that the history of frames (coherent sequence of previous frames) could help model in making a better prediction, especially in cases where ball disappears or is missed for a few frames. To support our hypothesis, we extend our experiments and use temporal sequences of images. A crucial element of processing continuous temporal sequences is to encode consensual information in spatial and temporal domains simultaneously. There are several methods which allow extracting spatiotemporal video features like widely used Dense Trajectories  where densely sampled points are tracked based on information from the optical flow field and describe local information along temporal and spatial axes. In  authors proposed Two-Stream Inflated 3D ConvNet (I3D) where convolution filters expanded into 3D let the network learn seamless video feature in both domains. For predicting object movement in the video, Farazi et al. proposed a model based on frequency domain representation . One of the recent methods in modeling temporal data is temporal convolution networks (TCN) . The critical advantage of TCN is the representation gained by applying the hierarchy of dilated causal convolution layers on the temporal domain, which successfully capture long-range dependencies. Also, it provides a faster inference time compared to recurrent networks, which make it suitable for real-time applications.
Additionally, there are successful end-to-end recurrent networks which can leverage correlations within sequential data [26, 10, 5]. ConvLSTM  and ConvGRU  are recurrent architectures which compound convolutions to determine the future state of the cell based on its local neighbors instead of the entire input.
In this work, we propose a CNN architecture which utilizes sequences of ball movements in order to improve the task of soccer ball detection in challenging scenarios.
3 Detection Models
3.1 Single Image Detection
In this paper, the task of soccer ball detection is formulated as a binary pixel-wise classification problem, where for a given image, the feed-forward model predicts the heatmap corresponding to the soccer ball. In this part we utilize three feed-forward models namely SweatyNet-1, SweatyNet-2 and SweatyNet-3 as proposed in .
All three networks are based on an encoder-decoder design. The SweatyNet-1 consists of five blocks in the encoder part and two blocks in the decoder part. In the encoder part, the first block includes one layer, and the number of filters is doubled after every block. In the decoder part, both blocks contain three layers. Each layer comprises of a convolutional operator followed with batch normalization and ReLU as the non-linearity. In addition, bilinear upsampling is used twice: after the last block of the encoder and after the first block of the decoder. Skip connections are added between layers of encoder and decoder to provide high-resolution details of the input to the decoder. Similar approaches have been successfully used in Seg-Net, V-Net  and U-Net .
All convolutional filters across the layers are of the fixed size of
. The encoder part includes four max-pooling layers where each one is situated after the first four blocks. The number of filters in the first layer is eight, and it is doubled after every max-pooling layer. In the decoder, the number of filters is reduced by a factor of two before every upsampling layer.
The other two variants of SweatyNet, are designed to reduce the number of parameters and speed up the inference time. In SweatyNet-2, the number of parameters is reduced by removing the first layer in each of the last five blocks of the encoder. In SweatyNet-3, the number of channels is decreased by changing the size of convolutions to in every first layer of last five encoder blocks and both of the decoder blocks.
3.2 Detection in a Sequence
Temporal extensions capture spatio-temporal interdependence in the sequence and allow to predict the movement of the ball capturing its size, direction, and speed correctly. In our experiments, we utilize the temporal series of images to improve the task of soccer ball detection further.
Our approach illustrated in Fig. 1 propose a temporal layer and learnable weight
which makes use of the history of sequences of fixed length to predict the probability map of the soccer ball. We use a feed-forward layer TCN and compare it with recurrent layers ConvLSTM and ConvGRU. The three approaches differ in the type of connections formed in the network.
We train our model to learn heatmaps of a ball based on the sequence of frames representing the history of its movement. More precisely, if the timestamp of the current frame is , given the heatmaps from to the output of the network is the sequence of heatmaps from timestamp to , where is the history length and is the length of predicted sequence.
The ConvLSTM and ConvGRU layers are stacks of several convolutional LSTM and GRU cells, respectively, which allows for capturing spatial as well as temporal correlations. Each ConvLSTM cell acts based on the input, forget and output gates, while the core information is stored in the memory cell controlled by the aforementioned gates. Each ConvGRU cell adaptively captures time dependencies with various time ranges based on content and reset gates. Convolutional structure avoids the use of redundant, non-local spatial data and results in lower computations. Fig. 3 depicts the structure of convolutional LSTM cell where the input is a set of flattened 1D array image features obtained with the convolutions layers. Convolutional GRU cell also differs from standard GRU cell only in the way how input is passed to it.
Unlike the two recurrent models, where gated units control hidden states, TCN hidden states are intrinsically temporal. This is attributed to the dilated causal convolutions used in TCN, which generates temporally structured states without explicitly modeling connection between them. Thus, TCN captures long term temporal dependencies in a simple feed-forward network architecture. This feature further provides an advantage of the faster inference time. Fig. 3
shows dilated causal convolutions for dilations 1, 2, 4, and 8. For our work, we replicated the original TCN-ED structure with repeated blocks of dilated convolution layers and normalized ReLU as activation functions.
For sequential data, it is challenging to train a network from scratch because of the limited size of the dataset and the difficulties in collecting the real data. Besides, the training process requires more memory to store a batch of sequences, resulting in a choice of smaller batch size. To address this problem, we use transfer learning and finetune the weights of our model on the sequences of synthetic data. We use SweatyNet-1 as the feature extractor and finetune it with the temporal layers.
For the input to temporal layers; TCN, ConvLSTM, and ConvGRU, we also take advantage of high-resolution spatial information by concatenating the output of and block of SweatyNet-1. To speed up the training process and propagate spatial information, we apply a convolution of size on the combined features. Moreover, we take an element-wise product of the output of convolution with a learnable weight of and add it to the output of SweatyNet. This combination serves as an input to the temporal layers. The weight serves as a gate which learns to control how much of high-resolution information is transferred from the early layers of Sweaty-Net and helps the network in detecting soccer ball with subpixel level accuracy.
In this section, we describe the details of the training process for our two sets of experiments. In the first experiment, we consider a problem of localization of the object in an image. In the second experimet, we evaluated our temporal approach. The evaluation of our experiments is discussed in Section 4.3.
Detection in an Image: For our work, we created a dataset of 4562 images, of which 4152 images contain a soccer ball. We refer to it as SoccerData. The images are extracted from a video recorded from the robot’s point of view and are manually annotated using the imagetagger 222https://imagetagger.bit-bots.de/ library. The images are from three different fields with different light sources. Note that since the data is recorded on walking robot, in many images we have blurry data.
Each image is represented by a bounding box with coordinates:
. For teaching signal we generated a binormal probability distribution centered at
and with the variance of. In contrast to the work of () where authors consider ball of fixed radius, we take into account the variable radius of a ball by computing the radius based on the size of the bounding box.
We apply three variants of SweatyNet model as described in Section 3 on the SoccerData. For the fair evaluation of the model, we randomly split our data into training and testing. In the training phase, mean squared error (MSE) is optimized between a predicted and a target probability map. We use Adam 
as the optimizer. We trained all of our models for a maximum of 100 epochs on the Nvidia GeForce GTX TITAN GK110. Similar to (
) the hyperparameters used in our experiments are learning rate ofand a batch size of . In addition, we experiment with dropout probability of and .
Detection in a Sequence:
We train the temporal part in two ways: (i) we pre-train the temporal model on artificially generated sequences and finetune it on top of the pre-trained SweatyNet-1 for the real sequences,
(ii) finetune the joint model on the real sequences where the pre-trained weights are used only for the SweatyNet-1 model.
Algorithm 2 details the procedure for synthetic data generation. To get heatmaps of a particular sequence at each time step we generate a multinormal probability distribution centered at with a variance equal to the radius .
To finetune the model on the real sequential data, we extracted a set of real soccer ball frames from bags recorded during RoboCup2018 Adult-Size games. Since video frames do not always contain a ball in the field of view, we preprocess videos to make sure that we do not use a sequence of frames without any ball present. With such restrictions, we got 20 sets of consecutive balls with an average length of 60. For all of our experiments, we fixed the history size to 20 and prediction length to 1.
For training on real data, we use learning rates of for the detection task and for the temporal part after pretraining. In the temporal network on the artificial sequences, the learning rate is set to . We train on synthetic data for 20 epochs and 30 epochs for the real data.
TCN: Encoder and decoder of TCN are two convolutional networks with two layers of 64 and 96 channels, respectively. We set up all parameters following the work of () except that we use Adam as an optimizer with MSE loss.
ConvLSTM and ConvGRU: We use four layers of ConvLSTM / ConvGRU cells with the respective size of 32, 64, 32, 1, and fixed kernel of size five across all layers.
Multiple Balls in a Sequence: To verify that our model can generalize, we test it on a more complex scenario with two present balls. Note that the network was only trained on a dataset containing a single ball. The qualitative results can found in Fig. 4 and Fig. 5. These figures depict that the model is powerful enough to handle cases not covered by training data. The temporal part leverages the previous frames and residual information and can detect the ball which is absent in SweatyNet output (Fig. 4 a) vs. d)).
The output of a network is a probability map of size . We use the contour detection technique explained in Algorithm 1 to find the center coordinates of a ball. The output of the network is of lower resolution and has less spatial information than the input image. To account for this effect, we calculate sub-pixel level coordinates and return the center of contour mass, as the center of the detected soccer ball.
To analyze the performance of different networks we use several metrics: false discovery rate (FDR), precision(PR), recall (RC), F1-score(F1) and accuracy (Acc) as defined in Eq. 1, where TP is true positives, FP is false positives, FN is false negatives, and TN is true negatives.
An instance is classified as a TP if the predicted center and actual center of the soccer ball is within a fixed distance of .
The results of our experiments are summarized in Table 1. The performance of all three models are comparable. To improve generalization and prevent overfitting, we further experiment with different dropout  probability values. We train all our models on a PC with Intel Core i7-4790K CPU with 32 GB of memory and a graphics card Nvidia GeForce GTX TITAN with 6 GB of memory. For real-time detection, one major requirement is of a faster inference time. We report the inference time of the model on the NimbRo-OP2X robot in Table 2(a). The NimbRo-OP2X robot is equipped with Intel Core i7-8700T CPU with 8 GB of memory and a graphics card Nvidia GeForce GTX 1050 Ti with 4 GB of memory. Since all three models don’t use the full capacity of GPU during inference, which allows bigger models to perform extra computations in parallel; as a result, all three SweatyNet networks are comparable in real time inference. Fig. 8 demonstrates the effectiveness of the model for the task of soccer ball detection. For this study, we only consider SweatyNet-1.
The results of sequential part are further summarized in Table 2(b). The sequential network successfully captures temporal dependencies and gives an improvement over the SweatyNet. Usage of artificial data for pre-training the temporal network is beneficial due to the shortage of real training data and boosts performance. Fig. 2 illustrates artificially generated ball sequences with the temporal prediction. We observed that when the temporal model is pre-trained on the artificial data, the learnable weight for the residual information takes a value of 0.57 on average, though without pre-training, the value is 0.49. The performance of TCN is comparable to ConvLSTM and ConvGRU, but it considerably outperforms ConvLSTM and ConvGRU in terms of inference time, which is a critical requirement for a real-time decision-making process. Table 2(a) presents a comparison between temporal models on inference time.
To support our proposal of using sequential data, in Fig. 7 we present an example image where the SweatyNet alone is uncertain of the prediction, though the network gives an strong detection when further processed with the temporal model.
|Method||Time in ms measured on the robot|
Performance Metric on Test set Method — FDR — PR — RC — F1 — Acc SweatyNet-1 (0.5) 0.024 0.975 0.972 0.973 0.955 Net+LSTM(real) 0.025 0.976 0.979 0.977 0.962 Net+LSTM(ft) 0.026 0.975 0.987 0.981 0.967 Net+GRU(real) 0.024 0.975 0.980 0.978 0.963 Net+GRU(ft) 0.026 0.972 0.987 0.980 0.966 Net+TCN(real) 0.024 0.976 0.982 0.979 0.964 Net+TCN(ft) 0.026 0.974 0.985 0.980 0.966
In this paper, we address the problem of soccer ball detection using sequences of data. We proposed a model which utilizes the history of ball movements for efficient detection and tracking. Our approach makes use of temporal models which effectively leverage the spatio-temporal correlation of sequences of data and keeps track of the trajectory of the ball. We present three temporal models: TCN, ConvLSTM, and ConvGRU. The feed-forward nature of TCN allows faster inference time and makes it an ideal choice for real-time application of RoboCup soccer. Furthermore, we show that with transfer learning, sequential models can further leverage knowledge learned from synthetic counterparts. Based on our results, we conclude that our proposed deep convolutional networks are effective in terms of performance as well as inference time and are a suitable choice for soccer ball detection. Note that the presented models can be used for detecting other soccer objects like goalposts and robots.
-  (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39 (12), pp. 2481–2495. Cited by: §3.1.
-  (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. Cited by: §1.
-  (2016) Delving deeper into convolutional networks for learning video representations. In ICLR, Cited by: §1, §2.
-  (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In , pp. 6299–6308. Cited by: §2.
Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §2.
-  (2015) A monocular vision system for playing soccer in low color information environments. In Proceedings of 10th Workshop on Humanoid Soccer Robots, IEEE-RAS Int. Conference on Humanoid Robots, Seoul, Korea. Cited by: §2.
-  (2013) AUTMan kid-size team description 2013. Technical report Amirkabir University of Technology. Cited by: §1.
-  (2018) NimbRo-op2x: adult-sized open-source 3d printed humanoid robot. In 2018 IEEE-RAS 18th International Conference on Humanoid Robots (Humanoids), pp. 1–9. Cited by: §1.
Frequency domain transformer networks for video prediction. In
European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), Bruges, Belgium. Cited by: §2.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.
-  (2018) Visual mesh: real-time object detection using constant sample density. arXiv preprint arXiv:1807.08405. Cited by: §2.
-  (2017) Humanoid robot detection using deep learning: a speed-accuracy tradeoff. In Robot World Cup, pp. 338–349. Cited by: §2.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
-  (1997) Robocup: the robot world cup initiative. In Proceedings of the first international conference on Autonomous agents, pp. 340–347. Cited by: §1.
-  (2017) Temporal convolutional networks for action segmentation and detection. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165. Cited by: §2, §4.1.
-  (2018) Playing soccer without colors in the spl: a convolutional neural network approach. arXiv preprint arXiv:1811.12493. Cited by: §2.
-  (2016) V-net: fully convolutional neural networks for volumetric medical image segmentation. In 3D Vision (3DV), 2016 Fourth International Conference on, pp. 565–571. Cited by: §3.1.
-  (2017) Automatic differentiation in pytorch. Cited by: §1.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §2, §3.1.
-  (2017) Detection and localization of features on a soccer field with feedforward fully convolutional neural networks (fcnn) for the adultsize humanoid robot sweaty. In Proceedings of the 12th Workshop on Humanoid Soccer Robots, IEEE-RAS International Conference on Humanoid Robots, Birmingham, Cited by: §1, §2, §3.1, §4.1, §4.1.
-  (2007) A ball is not just orange: using color and luminance to classify regions of interest. In Proc. of Second Workshop on Humanoid Soccer Robots, Pittsburgh, Cited by: §2.
-  (2016) Ball localization for robocup soccer using convolutional neural networks. In Robot World Cup, pp. 19–30. Cited by: §2.
-  (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §5.
-  (2018) Deep learning for semantic segmentation on minimal hardware. arXiv preprint arXiv:1807.05597. Cited by: §2.
-  (2013) Dense trajectories and motion boundary descriptors for action recognition. International journal of computer vision 103 (1), pp. 60–79. Cited by: §2.
-  (2015) Convolutional lstm network: a machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pp. 802–810. Cited by: §1, §2.