Code release for WACV 2020, "Optimizing Through Learned Errors for Accurate Sports Field Registration"
We propose an optimization-based framework to register sports field templates onto broadcast videos. For accurate registration we go beyond the prevalent feed-forward paradigm. Instead, we propose to train a deep network that regresses the registration error, and then register images by finding the registration parameters that minimize the regressed error. We demonstrate the effectiveness of our method by applying it to real-world sports broadcast videos, outperforming the state of the art. We further apply our method on a synthetic toy example and demonstrate that our method brings significant gains even when the problem is simplified and unlimited training data is available.READ FULL TEXT VIEW PDF
Code release for WACV 2020, "Optimizing Through Learned Errors for Accurate Sports Field Registration"
has received much attention recently, due to the success of deep learning in many other areas in computer vision[17, 16, 40]. Registration of a sports field template onto a camera view is not an exception [6, 19, 42], where deep learning has shown promising results compared to traditional baselines. Despite the recent advancements, there is room for further improvement, especially for augmented reality and sport analytics.
For mixed and augmented reality, even the slightest inaccuracies in estimates can break immersion . For sports analytics, good alignment is crucial for detecting the important events – offsides in soccer.
Existing methods have also acknowledged this limitation, and have sought to improve accuracy. For example, some rely on a hierarchical strategy [38, 24]. In these methods, the refinement network is used on top of a rough pose estimator, where both are feed-forward networks. However, as we will demonstrate through our experiments, there is an alternative way to enhance performance.
In order to achieve more accurate registration, we take a different route to the commonly used feed-forward paradigm. Inspired by classic optimization-based approaches for image registration [37, 30, 28, 35], we propose optimizing to reduce the estimated registration error. Opposed to traditional methods, we rely on a deep network for estimating the error.
Specifically, as illustrated in Fig. 1, we propose a two-stage deep learning pipeline, similar to existing methods [38, 24], but with a twist on the refinement network. The first-stage network, which we refer to as the initial registration network, provides a rough estimate of the registration, parameterized by a homography transform. For the second-stage network, instead of a feed-forward refinement network, we train a deep neural network that regresses the error of our estimates – registration error network. We then use the initial registration network to provide an initial estimate, and optimize the initial estimate using the gradients provided by differentiating through the registration error network. This allows much more accurate estimates compared to the single stage feed-forward inference.
In addition, we propose not to train the two networks together. While end-to-end and joint training is often preferred [39, 36], we find that it is beneficial to train the two networks separately – decoupled training. We attribute this to two observations: the two networks – initial registration network and registration error network – aim to regress different things – pose and error; it is useful for the registration error network to be trained independently of the initial registration network so that it does not overfit to the mistakes the initial registration network makes while training.
We demonstrate empirically that our framework consistently outperforms feed-forward pipelines. We apply our method to sports field registration with broadcast videos. We show that not only our method outperforms feed-forward networks in a typical registration setup, it is also able to outperform the state of the art even when training data is scarce – a trait that is desirable with deep networks. We further show that our method is not limited to sport fields registration. We create a simple synthetic toy dataset of estimating equations of a line in the image, and show that even in this simple case when unlimited train data is available, our method brings significant advantage.
To the best of our knowledge, our method is the first method that learns to regress registration errors for optimization-based image registration. The idea of training a deep network to regress the error and perform optimization-based inference has recently been investigated for image segmentation, multi-label classification, and object detection [13, 23]. While the general idea exists, it is non-trivial to formulate a working framework for each task. Our work is the first successful attempt for sport field registration, thanks to our two-stage pipeline and the way we train the two networks.
To summarize, our contributions are:
we propose a novel two-stage image registration framework where we iteratively optimize our estimate by differentiating through a learned error surface;
we propose a decoupled training strategy to train the initial registration network and the registration error network;
our method achieves the state-of-the-art performance, even when the training dataset size is as small as 209 images;
we demonstrate the potential of our method through a generic toy example.
Early attempts on registering sports field to broadcast videos [12, 11, 37] typically rely on a set of pre-calibrated reference images. These calibrated references are used to estimate a relative pose to the image of interest. To retrieve the relative pose, these methods either assume that images are of correspond to consecutive frames in a video [12, 11], or use local features, such as SIFT  and MSER , to find correspondences . These methods, however, require that the set of calibrated images contains images with similar appearance to the current image of interest, as traditional local features are weak against long-term temporal changes . While learned alternatives exist [50, 36, 10], their performances in the context of pose estimation and registration remains questionable .
To overcome these limitations, more recent methods [19, 42, 6] focus on converting broadcast videos into images that only contain information about sports fields, known marker lines, then perform registration. Homayounfar  perform semantic segmentation on broadcast images with a deep network, then optimize for the pose using branch and bound  with a Markov Random Field (MRF) formulated with geometric priors. While robust to various scenic changes, their accuracy is still limited. Sharma  simplify the formulation by focusing on the edges and lines of sports fields, rather than the complex semantic segmentation setup. They use a database of edge images generated with known homographies to extract the pose, which is then temporally smoothed. Chen and Little  further employ an image translation network  in a hierarchical setup where the play-field is first segmented out, followed by sports field line extraction. They also employ a database to extract the pose, which is further optimized through Lucas-Kanade optimization  on distance transformed version of the edge images. The bottleneck of these two methods is the necessity of a database, which hinders their scalability.
Traditional methods for homography estimation include sparse feature-based approaches  and dense direct approaches . Regardless of sparse or dense, traditional approaches are mainly limited by either the quality of the local features , or by the robustness of the objective function used for optimization .
Deep learning based approaches have also been proposed for homography estimation. In , the authors propose to train a network that directly regresses to the homography between two images through self supervision. Interestingly, the output of the regression network is discretized, allowing the method to be formulated as classification. Nguyen  train a deep network in an unsupervised setup to learn to estimate the relative homography. The main focus of these methods, however, is on improving the inference speed, without significant improvements on accuracy when compared to traditional baselines.
Pose estimators are also higly related to image registration. Deep networks have also been proposed to directly regress the 6 DoF pose of cameras [51, 43, 25]. Despite being efficient to compute, these methods highly depend on their parameterization of the pose – naive parameterizations can lead to bad performance, and are known to have limited accuracy . To overcome this limitation, recent works focus on regressing the 2D projection of 3D control points [46, 38, 24]. Compared with directly predicting the pose, control points based pose show improved performance due to the robust estimation of parameters. Our initial registration network follows the same idea as these methods to obtain our initial estimate.
Incorporating optimization into deep pipelines is a current topic of interest. BA-Net  learns to perform Levenberg-Marquardt optimization within the network to solve dense bundle adjustment. LS-Net  learns to predict the directions to improve a given camera pose estimate. Han  also learn to estimate the Jacobian matrix from an image pair to update the 6 DoF camera pose. In contrast to these methods, which propose learning a function to update a camera pose estimate, we propose to learn an error function that predicts how well two images are aligned. Using the error function we can obtain the update direction via differentiation. The most similar work to ours is the deep value networks , where they train a network to estimate intersection over union (IoU) between the input and ground truth masks regarding image segmentation. While sharing a similar idea, it is non-trivial to extend and adapt their method to image registration. For example, their method is limited to a static initial estimate, which requires a longer optimization trajectory than ours. This may become a problem when applied to sport field registration, where the broadcast view change drastically even when there is small camera rotation. We show through experiments that just having an error network is not enough, as we will show later in Table 2.
For clarity in presentation, we first assume that our models are pre-trained and detail our overall framework at inference time. We then provide details on the training setup and the architectural choices.
Our pipeline is depicted in Fig. 1. We assume a known planar sports field template and undistorted images, so that we can represent the image-template alignment with a homography matrix. The framework can be broken down into two stages: the first stage provides an initial estimate of the homography matrix, the second iteratively optimizes this estimate. The first stage follows a typical feed-forward paradigm [9, 46], and we utilize a deep neural network. However, any method can be used here instead, such as a database search [42, 6].
The distinctiveness of our model comes from the second stage of the pipeline. Using the first stage estimate, we warp the sports field template to the current view. We concatenate this warped image with the current observed image, and evaluate the registration error through a second neural network. We then backpropagate the estimated error to the the homography parameters to obtain the gradient, which gives the direction in which the parameters should be updated to minimize the registration error. Then, using this gradient, we update the homography parameters. This process is performed iteratively until convergence or until a maximum number of iterations is met. Thisinference through optimization allows our method to be significantly more accurate than a typical feed-forward setup, provided that our error model gives reasonable error predictions.
We follow the recent trend of using projected coordinates for pose parameterization [9, 5]. In the case of homographies, this can be done with 4 points . We parameterize the homography defining the relationship between the input image and the target template through the coordinate of the four control points on the current input image when warped onto the sports field template. Specifically, considering a normalized image coordinate system where the width and height of the image is set to one, and the centre of the image is at the origin, we use , , , and , that is, the corners of the lower three-fifths of the image as our reference control points. We write the reference control points as,
We use the lower parts of the image as sports field broadcast videos are typically in a setup where the camera is looking down on the field, as shown in Fig. 2.
Let denote the -th control point of the current image projected onto the sports field template . We then write the homography as
We obtain the actual transformation matrix from and
through direct linear transformation.
Given an initial registration network , we obtain a rough homography estimate for image as
where the superscript in parenthesis denote the optimization iteration.
With the current homography estimate at optimization iteration , we warp the play-field template to obtain an image of the template in the current view, using a bilinear sampler  to preserve differentiability. We concatenate the result of this warping operation and the image , and pass it as input to the model to obtain a prediction of the registration error as
where denotes concatenation along the channel direction of two images. We then retrieve the gradient of with respect to and apply this gradient to retrieve an updated estimate. In practice, we rely on Adam  for a stable optimization.
Note here that our registration error network is not trained to give updates. It simply regresses to the correctness of the current estimate. We show empirically in Section 4.3 that this is a much more effective than, for example learning to provide a perfect homography, or learning to correct erroneous estimates.
To avoid overfitting, we propose to purposely decouple the training of two networks. We show in Section 4.3 that this is necessary in order to obtain the best performance.
To train the initial registration network, we directly regress the four control points of our template warped into a given view using the ground truth homography. With the ground truth homography , we train our deep network to minimize
Note that while we use a deep network to obtain the initial homography estimate, any other method can also be used in conjunction, such as nearest neighbor search.
To train the registration error network, we create random perturbations on the ground truth homography. We then warp the target template to the view using the perturbed ground truth homography, and concatenate it with the input image to be used as input data for training. The network model is trained to predict a registration error metric, the IoU. We detail our design choice of error metric in Section 4.3.
In more detail, with the ground truth homography , we create a perturbed homography by applying uniform noise hierarchically: one for global translation, and one for local translation of each control point. Specifically, we add a global random translation , where , to all control points, and add a local random translation of , where individually to each control point. We then warp the target template according to the perturbed homography to create our input data for training. Thus, the input to the registration error network for training is . Then, to train the network, we minimize
where is the error metric, for example the IoU value.
We apply the proposed method to sports field registration. We first discuss the datasets, baselines, the metrics used for our evaluation, as well as implementation details. We then present qualitative and quantitative results of our method, compared to the state of the art. We then provide experimental insights to our method.
To validate our method, we rely on two datasets. The World Cup dataset  is a dataset made of broadcast videos of soccer games. It has 209 images for training and validation, and 186 images for testing. This dataset is extremely small, making it challenging to apply deep methods. The state of the art for this dataset  relies on learning to transfer the input image to look similar to the sports field template, then searching a database of known homographies and warped templates to retrieve the estimate. For our method, we use 39 images from the train-valid split as validation dataset, and respect the original test split for testing. The Hockey dataset is composed of broadcast videos of NHL ice hockey games . This is a larger dataset than the World Cup dataset, having 1.67M images in total. Of this large dataset, we use two sequences of 800 consecutive images as validation and testing sets. By using consecutive frames, we ensure that images from one game do not fall into different splits. See Fig. 3 for example images.
We compare our method against three existing works for sports field registration [19, 42, 6]. As there is no publicly available implementation of the two methods [19, 42], we take the results reported on the respective papers for the World Cup dataset. For , we use the authors’ public implementation. For  with the Hockey dataset, we use the reported results as a reference111 No information is provided by the authors on how the the train, validation, and test splits are created, thus the results are not directly comparable. .
In addition, we compare our method against feed forward baselines – single stage feed-forward network (SSF) and a two-stage feed-forward refinement network (FFR). We further explore whether the error registration network can be used alone by retrieving the initial estimate by searching a database of known poses, the traing set, and using the example which gives the lowest error estimate. We will refer to the initial estimate obtained through nearest neighbor search as NN, and the fully optimized estimate as NNo. To do a nearest neighbor search we evaluate the registration error for the query image with all the training homographies using the trained registration error network, and return the homography with lowest estimated error. Although this method is not scalable because the computational requirement grows linearly with the size of the database, it provides insight into the capability of the trained registration error network.
Following a recent trend [39, 16], we base our network on the ResNet-18 architecture . Instead of the classification head, we simply replace the last fully connected layer to estimate 8 numbers which represent the homography,
. We use the pretrained weights for the network trained on ImageNet, and fine-tune.
For the registration error network, we also rely on the ResNet-18 architecture, but with spectral normalization 
on all convolutional layers, and take as input a 6-channel image, that is, the concatenation of the input image and the warped target template. Spectral normalization smooths the error predictions by constraining the Lipschitz constant of the model, which limits the magnitude of its gradients. As the output of the registration error network cannot be negative, we use sigmoid function as the final activation function for the IoU-based error metrics, and squaring function for reprojection error metric. For the registration network, as the input is very different from a typical image-based network, we train from scratch.
We train our networks with the Adam  optimizer, with default parameters and , and with a learning rate of . We train until convergence, and use the validation dataset to perform early stopping. For the noise parameters and in Section 3.2 we empirically set and , by observing the validation dataset results. For inference, we again use Adam, but with a learning rate of . We run our optimization for 400 iterations, and return the estimate that gave the lowest estimated error predicted by the trained registration error network.
As shown in Table 1, having an additional feed-forward refinement network (FFR) only provides minor improvement over the initial estimate (SFF). This phenomenon is more obvious in the WorldCup dataset results, where training data is scarce. By contrast, our method is able to provide significant reduction in the registration error.
We also compare results when different target error is used for the training of the registration error network in Table 2. We compare regressing to , , and the average reprojection error of all pixels inside the current view (Reproj.). Interestingly, regressing to does not guarantee best performance in terms of . In all cases, regressing to gives best performance.
It is a common trend to train multiple components together. However, our framework does not allow joint training, as the two networks are aiming for entirely different goals. Nonetheless, we simultaneously trained the two networks, thus allowing the registration error network to see all the mistakes that the initial registration network makes during training (Coupled). Coupled training, however, performs worse than decoupled training, as shown in Table 2. In case of the Hockey dataset, coupled training performs even worse than feed-forward refinement. This is because while the initial registration network is converging, it is making predictions with smaller and smaller mistakes, thus the registration error network is learning a narrow convergence basin due to the small perturbations it sees. The estimates that fall out of the convergence basin can not be optimized using the learned error. Therefore, it is necessary to have a decoupled training setup to stop this from happening.
The two variants, NN and NNo, provide insights into the capability of the registration error networks. Due to the limited size of the database, i.e. training data, NN provides initial estimates with lower accuracy than the single stage feed-forward network SFF. However, with optimization (NNo), the registration results are even comparable to the results from our full pipeline. This observation shows that the registration error network can provide a wide convergence basin, and can optimize for inaccurate initial estimates.
Note that we only test these methods on the World Cup dataset, as applying the method on Hockey dataset requires too much computation due to the larger database to search.
To validate that the trained registration error network can provide a convergence basin, we visualize the average estimated error surface for translation over all test samples. To do so we create a regular grid with from , and from with resolution 50 by 50. For each point on the grid we warp the template with ground truth homography combined with the translation from the origin to the point location. We then pass the observed image concatenated with the warped sports field to the trained registration error network, and infer the registration error at that point on the grid. We calculate the error surface for all the test samples, and visualize the average.
As show in Fig. 5, the estimated error surface resembles the ground truth one. The error is lower towards the origin where the perturbation – translation – is smaller, and is higher towards the border where the perturbation is larger. Most importantly, the minima of the estimated error is very close to the origin, which is the ground truth. This allows our optimization-based inference to work properly.
Beyond sport fields registration, our method could be applied to other tasks that involve parameter regression. Here, we show briefly that, even a task as simple and generic as fitting a line equation in an image can benefit from our method. We hope to shed some light into the potentials of our method.
. Unlike DSAC, we are not learning to reject outliers via their pixel coordinates, but rather are directly regressing to the line equations given an image of a line.
We follow the same setup as our image registration task, but instead regress two parameters that define the equation of a line, that is, – the angle and – the intercept, where the line equation is given by , and and are the image coordinates.
To create the synthetic images, we first generate random lines by selecting a random pivot point in an 6464 image, then uniformly sample in range to obtain its angle. We draw this line with a random color. We then add a random colored ellipse with random parameters as distraction, and finally apply additive Gaussian noise. We use VGG-11  as the backbone for the initial registration network.
We use the intercept error as the target error metric, that is maximum error between ground-truth and estimated intercept at or . To generate erroneous estimates for training, we add uniform noise and to the ground truth and respectively, where , and . To render the estimate into an image in a differentiable way, we warp the the template image which is simply an image of a line, using the hypothesized line parameters as in the case of image registration. We concatenate the input image with the warp template to the error network to estimate the error, in this case the intercept error. We also use VGG-11 as the backbone for the error network.
We train both networks until convergence and optimize for 400 iterations at inference time.
As shown in Fig. 6, our method estimates the line parameters more accurately than a feed-forward deep network. Quantitative results are shown in Table 3. As shown, even in this simple generic task, our method outperforms its feed-forward counterpart. As this task can be viewed as a simplified version of other computer vision tasks, it shows that our method may be applicable outside the scope of the current paper. We further highlight that this experimental setup is with unlimited labeled data. Even in such a case, our method brings significant improvement in performance.
We have proposed a two-stage pipeline for registering sports field templates to broadcast videos accurately. In contrast to existing methods that do single stage feed-forward inference, we opted for an optimization-based inference inspired by established classic approaches. The proposed method makes use of two networks, one that provides an initial estimate for the registration homography, and one that estimates the error given the observed image and the current hypothesized homography. By optimizing through the registration error network, accurate results were obtained.
We have shown through experiments that the proposed method can be trained with very sparse data, as little as 209 images, and achieve state-of-the-art performance. We have further revealed how different design choices in our pipeline affect the final performance. Finally, we have shown that our framework can be translated into other tasks and improve upon feed-forward strategies.
As future work, since the inference is optimization-based, we can naturally embed temporal consistency by reusing the optimization state for consecutive images to register sports field for a video. We show preliminary results of doing so in our supplementary video.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6684–6692, 2017.
International Joint Conference on Artificial Intelligence, 1981.
3d pose regression using convolutional neural networks.In The IEEE International Conference on Computer Vision (ICCV) Workshops, 2017.