A Context-Aware Loss Function for Action Spotting in Soccer Videos
Action spotting is an important element of general activity understanding. It consists of detecting human-induced events annotated with single timestamps. In this paper, we propose a novel loss function for action spotting. Our loss aims at dealing specifically with the temporal context naturally present around an action. Rather than focusing on the single annotated frame of the action to spot, we consider different temporal segments surrounding it and shape our loss function accordingly. We test our loss on SoccerNet, a large dataset of soccer videos, showing an improvement of 12.8 the generalization capability of our loss function on ActivityNet for activity proposals and detection, by spotting the beginning and the end of each activity. Furthermore, we provide an extended ablation study and identify challenging cases for action spotting in soccer videos. Finally, we qualitatively illustrate how our loss induces a precise temporal understanding of actions, and how such semantic knowledge can be leveraged to design a highlights generator.READ FULL TEXT VIEW PDF
This notebook paper presents an overview and comparative analysis of our...
To understand causal relationships between events in the world, it is us...
An integral part of video analysis and surveillance is temporal activity...
Toward the goal of automatic production for sports broadcasts, a paramou...
In contrast to the widely studied problem of recognizing an action given...
Unsupervised segmentation of action segments in egocentric videos is a
Abnormal activity detection is one of the most challenging tasks in the ...
A Context-Aware Loss Function for Action Spotting in Soccer Videos
Asidefrom automotive, consumer, and robotics applications, sports is considered one of the most valuable applications in computer vision, capping $91 billion of annual market revenue , with $28.7 billion originating from the European Soccer market . Recent advances helped provide automated tools to understand and analyze broadcast games. For instance, current computer vision methods are able to localize the field and its lines [17, 24], detect players [12, 62], their motion [18, 40], their pose [7, 66], their team , and track the ball position [50, 56] and the camera motion . Understanding such information can be useful in enhancing the visual experience of sports viewers  and gathering statistics about the players . However, these analyses only focus on spatial frame-wise information, providing per-player statistics rather than higher-level game understanding.
For broadcast producers, it is of paramount importance to have a deeper understanding of the game actions. For instance, live broadcast production follows specific patterns when particular actions occur; sports live reporters comment on the game actions; and highlights producers generate short summaries by ranking the most representative actions within the game. In order to automate these production tasks, computer vision methods should understand the salient actions of a game and respond accordingly. While spatial information is widely studied and quite mature (as evidenced by current player and ball detectors), localizing actions in time remains a challenging task for current video understanding algorithms.
In this paper, we target the action spotting challenge, with a primary application on soccer videos. The task of action spotting has been defined as the temporal localization of human-induced events annotated with a single timestamp . Inherent difficulties arise from such annotations: their sparsity, the absence of start and end times of the actions, and their temporal discontinuities, i.e. the unsettling fact that adjacent frames may be annotated differently albeit being possibly highly similar. To overcome these issues, we propose a novel loss that leverages the temporal context information naturally present around the actions, as depicted in Figure 1. To highlight its generality and versatility, we showcase how our loss can be used for the task of activity localization in ActivityNet , by spotting the beginning and end of each activity. Using the network BMN introduced in  and simply substituting their loss with our enhanced context-aware spotting loss function, we show an improvement of 0.15% in activity proposal leading to a direct 0.38% improvement in activity detection on ActivityNet . On the large-scale action spotting soccer-centric dataset, SoccerNet , our network substantially increases the Average-mAP spotting metric from to . We will release the codes shortly.
Contributions. We summarize our contributions as follows. (i) We present a new loss function for temporal action segmentation further used for the task of action spotting, which is parameterized by the time-shifts of the frames from the ground-truth actions. (ii) We improve the performance of the state-of-the-art method on ActivityNet  by including our new contextual loss to detect activity boundaries, and improve the action spotting baseline of SoccerNet  by . (iii) We provide detailed insights into our action spotting performance, as well as a qualitative application for automatic highlights generation.
Broadcast soccer video understanding. Computer vision tools are widely used in sports broadcast videos to provide soccer analytics [42, 57]. Current challenges lie in understanding high-level game information to identify salient game actions [13, 60], perform automatic game summarization [49, 51, 61] or report commentaries of live actions . Early work uses camera shots to segment broadcasts 
, or analyze production patterns to identify salient moments of the game. Further developments have used low-level semantic information in Bayesian frameworks [25, 55] to automatically detect salient game actions.
provides an in-depth analysis of deep frame feature extraction and aggregation for action spotting in soccer game broadcasts. Multi-stream networks merge additional optical flow[10, 59] or excitement [6, 51]
information to improve game highlights identification. Furthermore, attention models are fed with per-frame semantic information such as pixel information or player localization  to extract targeted frame features. In our work, we leverage the temporal context information around actions to handle the intrinsic temporal patterns representing these actions.
Deep video understanding models are trained with large-scale datasets. While early works leveraged small custom video sets, a few large-scale datasets are available and worth mentioning, in particular Sports-1M  for generic sports video classification, MLB-Youtube  for baseball activity recognition, and GolfDB  for golf swing sequencing. These datasets all tackle specific tasks in sports. In our work, we use SoccerNet  to assess the performance of our context-aware loss for action spotting in soccer videos.
Activity understanding. Recent video challenges  have brought attention to activity localization, to find temporal boundaries of activities. Following object localization practices, current work has proposed a two-stage approach with proposal generation  and classification . SSN  models each action instance with a structured temporal pyramid, TURN TAP  predicts action proposals and regresses the temporal boundaries, while GTAN  dynamically optimizes the temporal scale of each action proposal with Gaussian kernels. BSN , MGG  and BMN  have been used to temporally search for activity boundaries, showing state-of-the-art performances on both ActivityNet 1.3  and Thumos’14 .
Alternatively, ActionSearch  tackles the spotting task iteratively, learning to predict which frame to visit next in order to spot a given activity. However, this method requires sequences of temporal annotations by human annotators to train the models. Such annotation sequences are not readily available for datasets outside ActivityNet. Also, Alwassel et al.  define an action spot as positive as soon as it lands within the boundary of an activity, which is less constraining than the action spotting defined in SoccerNet .
Recently, Sigurdsson et al.  question boundaries sharpness and show that human agreement on temporal boundaries reach an average tIoU of 72.5% for Charades  and 58.7% on MultiTHUMOS. Alwassel et al.  confirm such disparity on ActivityNet , but also show that it does not constitute a major roadblock to progress in the field. Different from activity localization, SoccerNet  proposes an alternative action spotting task for soccer action understanding, leveraging a well-defined set of soccer rules that define a single temporal anchor per action. In our work, we improve the SoccerNet  action spotting baseline by introducing a novel context-aware loss that temporally slices the vicinity of the action spots. Also, we integrate our loss for generic activity localization and detection on a boundary-based method [34, 36].
We address the action spotting task by developing a context-aware loss for a temporal segmentation module, and a YOLO-like loss for an action spotting module that outputs the spotting predictions of the network. We first present the re-encoding of the annotations needed for the segmentation and spotting tasks, then we explain how the losses of these modules are computed based on the re-encodings.
Problem definition. We denote by the number of classes of the action spotting problem. Each action is identified by a single action framecomponents for the action frames or a vector of zeros for the background frames. We denote to be the number of frames in a video.
To train our network, the initial annotations are re-encoded in two different ways: with a time-shift encoding used for the temporal segmentation loss, and with a YOLO-like encoding used for the action spotting loss.
Time-shift encoding (TSE) for temporal segmentation. We slice the temporal context around each action into segments related to their distance from the action, as shown in Figure 2. The segments regroup frames that are either far before, just before, just after, far after an action, or in transition zones between these segments.
We use the segments in our temporal segmentation module so that its segmentation scores reflect the following ideas. (1) Far before an action spot of some class, we cannot foresee its occurrence. Hence, the score for that class should indicate that no action is occurring. (2) Just before an action, its occurrence is uncertain. Therefore, we do not influence the score towards any particular direction. (3) Just after an action has happened, plenty of visual cues allow for the detection of the occurrence of the action. The score for its class should reflect the presence of the action. (4) Far after an action, the score for its class should indicate that it is not occurring anymore. The segments around the actions of class are delimited by four temporal context slicing parameters as shown in Figure 2.
The context slicing is used to perform a time-shift encoding (TSE) of each frame of a video with a vector of length , containing class-wise information on the relative location of with respect to its closest past or future actions. The TSE of for class , noted , is the time-shift (i.e. difference in frame indices) of from either its closest past or future ground-truth action of class , depending on which has the dominant influence on . We set as the time-shift from the past action if either (i) is just after the past action; or (ii) is in the transition zone after the past action, but is far before the future action; or (iii) is in the transition zones after the past and before the future actions while being closer to the past action. In all other cases, is the time-shift from the future action.
If is both located far after the past action and far before the future action, selecting either of the two time-shifts has the same effect in our loss. Furthermore, for the frames located before either before the first or after the last annotated action of class , only one time-shift can be computed and is thus set as . Finally, if no action of class is present in the video, then we set for all the frames. This induces the same behavior in our loss as if they were all located far before their closest future action.
YOLO-like encoding for action spotting. Inspired by YOLO , each ground-truth action of the video engenders an action vector composed of values. The first value is a binary indicator of the presence () of the action. The second value is the location of the frame annotated as the action, computed as the index of that frame divided by . The remaining values represent the one-hot encoding of the action. We encode a whole video containing actions in a matrix Y of dimension , with each line representing an action vector of the video.
Temporal segmentation loss. The TSE parameterizes the temporal segmentation loss described below. For clarity, we denote to be the segmentation score for a frame to belong to class output by the segmentation module, and as the TSE of for class . We detail the loss generated by in this setting, noted . First, in accordance with Figure 2, we compute as follows:
Then, following the practice in [14, 48] to help the network focus on improving its worst segmentation scores, we zero out the loss for scores that are satisfying enough. In the case of Equation (4) when , we say that a score is satisfactory when it exceeds some maximum margin . In the cases of Equations (1) and (6), we say that a score is satisfactory when it is lower than some minimum margin . The range of values for that leads to zeroing out the loss varies with and the slicing parameters in most cases. This is achieved by revising as in Equations (7) and (8). Figure 1 shows a representation of .
Finally, the segmentation loss for a given video of frames is given in Equation (9).
Action spotting loss. Let be a fixed number of action spotting predictions generated by our network for each video. Those predictions are encoded in of dimension , similarly to Y.
We leverage an iterative one-to-one matching algorithm to pair each of the ground-truth actions with a prediction. First, we match each ground-truth location of with its closest predicted location in , and vice-versa (i.e. we match the predicted locations with their closest ground-truth locations). Next, we form pairs of locations that reciprocally match, we remove them from the process, and we iterate until all ground truths are coupled with a prediction. Consequently, we build as a reorganized version of the actions encoded in , such that and reciprocally match for all .
We define the action spotting loss in Equation (10). It corresponds to a weighted sum of the squared errors between the matched predictions and a regularization on the confidence score of the unmatched predictions.
Complete loss. The final loss is presented in Equation (11) as a weighted sum of and .
Network for action spotting. The architecture of the network is illustrated in Figure 3 and further detailed in the supplementary material. We leverage frame feature representations for the videos (e.g. ResNet) provided with the dataset, embodied as the output of the frame feature extractor of Figure 3. The temporal CNN of Figure 3 is composed of a spatial two-layer MLP, followed by four multi-scale 3D convolutions (i.e. across time, features and classes). The temporal CNN outputs a set of features for each frame organized in feature vectors (one per class) of size , as in 
. These features are input into a segmentation module, in which we use Batch Normalization and sigmoid activations. The closeness of the vectors obtained in this way to a pre-defined vector gives the segmentation scores output by the segmentation module, as . The features obtained previously are concatenated with the segmentation scores and fed to the action spotting module, as shown in Figure 3
. It is composed of three successive temporal max-pooling and 3D convolutions, and outputsvectors of dimension . The first two elements of these vectors are sigmoid-activated, the last are softmax-activated. The activated vectors are stacked to produce the prediction of dimension for the action spotting task.
Data. Three classes of action are annotated in SoccerNet by Giancola et al. : goals, cards, and substitutions, so in this case. They identify each action by one annotated frame: the moment the ball crosses the line for goal, the moment the referee shows a player a card for card, and the moment a new player enters the field for substitution. We train our network on the frame features already provided with the dataset. Giancola et al. first subsampled the raw videos at fps, then they extracted the features with a backbone network and reduced them by PCA to features for each frame of the subsampled videos. Three sets of features are provided, each extracted with a particular backbone network: I3D , C3D , and ResNet .
Action spotting metric. We measure performances with the action spotting metric introduced in SoccerNet . An action spot is defined as positive if its temporal offset from its closest ground truth is less than a given tolerance
. The average precision (AP) is estimated based on Precision-Recall curves, then averaged between classes (mAP). An Average-mAP is proposed as the AUC of the mAP over different tolerancesranging from 5 to 60 seconds.
Experimental setup. We train our network on batches of chunks. We define a chunk as a set of contiguous frame feature vectors. We set to maintain a high training speed while retaining sufficient contextual information. This size corresponds to a clip of minutes of raw video. A batch contains chunks extracted from a single raw video. We extract a chunk around each ground-truth action, such that the action is randomly located within the chunk. Then, to balance the batch, we randomly extract
chunks composed of background frames only. An epoch ends when the network has been trained on one batch per training video. At each epoch, new batches are re-computed for each video for data augmentation purposes. Each raw video is time-shift encoded before training. Each new training chunk is encoded with the YOLO-like encoding.
The number of action spotting predictions generated by the network is set to , as we observed that no chunks of minutes of raw video contain more than actions. We train the network during epochs, with an initial learning rate linearly decreasing to . We use Adam as the optimizer with default parameters .
For the segmentation loss, we set the margins and in Equations (7) and (8), following the practice in . For the action spotting loss in Equation (10), we set for , while is optimized (see below) to find an appropriate weighting for the location components of the predictions. Similarly, is optimized to find the balance between the loss of the action vectors and the regularization of the remaining predictions. For the final loss in Equation (11), we optimize to find the balance between the two losses.
Hyperparameter optimization. For each set of features (I3D, C3D, ResNet), we perform a joint Bayesian optimization  on the number of frame features extracted per class, on the temporal receptive field of the network (i.e. temporal kernel dimension of the 3D convolutions), and on the parameters . Next, we perform a grid search optimization on the slicing parameters .
For Resnet, we obtain . For goals (resp. cards, substitutions) we have (resp. , ), (resp. , ), (resp. , ), and (resp. , ). Given the framerate of 2 fps, those values can be translated to seconds by scaling them down by a factor of 2. The value corresponds to a temporal receptive field of seconds on both sides of the central frame in the temporal dimension of the 3D convolutions.
|SoccerNet baseline 5s ||-||-||34.5|
|SoccerNet baseline 60s ||-||-||40.6|
|SoccerNet baseline 20s ||-||-||49.7|
Main results. The performances obtained with the optimized parameters are reported in Table 1. As shown, we establish a new state-of-the-art performance on the action spotting task of SoccerNet, outperforming the previous benchmark by a comfortable margin, for all the frame features. ResNet gives the best performance, as also observed in . A sensitivity analysis of the parameters reveals robust performances around the optimal values, indicating that no heavy fine-tuning is required for the context slicing. Also, performances largely decrease as the slicing is strongly reduced, which emphasizes its usefulness.
Ablation study. Since the ResNet features provide the best performance, we use them with their optimized parameters for the following ablation studies. (i) We remove the segmentation module, which is equivalent to setting in Equation (11). This also removes the context slicing and the margins and . (ii) We remove the action context slicing such that the ground truth for the segmentation module is the raw binary annotations, i.e
. all the frames must be classified as background except the action frames. This is equivalent to setting. (iii) We remove the margins that help the network focus on improving its worst segmentation scores, by setting in Equations (7) and (8). (iv) We remove the iterative one-to-one matching between the ground truth Y and the predictions before the action spotting loss, which is equivalent to using instead of in Equation (10). The results of the ablation studies are shown in Table 2.
From an Average-mAP perspective, the auxiliary task of temporal segmentation improves the performance on the action spotting task (from to ), which is a common observation in multi-task learning . When the segmentation is performed, our temporal context slicing gives a significant boost compared to using the raw binary annotations (from to ). This observation is in accordance with the sensitivity analysis. It also appears that it is preferable to not use the segmentation at all rather than using the segmentation with the raw binary annotations ( vs ), which further underlines the usefulness of the context slicing. A boost in performance is also observed when we use the margins to help the network focus on improving its worst segmentation scores (from to ). Eventually, Table 2 shows that it is extremely beneficial to match the predictions of the network with the ground truth before the action spotting loss (from to ). This makes sense since there is no point in evaluating the network on its ability to order its predictions, which is a hard and unnecessary constraint. The large impact of the matching is also justified by its direct implication in the action spotting task assessed through the Average-mAP.
Results through game time.
In soccer, it makes sense to analyze the performance of our model through game time, since the actions are not uniformly distributed throughout the game. For example, a substitution is more likely to occur during the second half of a game. We consider non-overlapping bins corresponding tominutes of game time and compute the Average-mAP for each bin. Figure 4 shows the evolution of this metric through game time.
It appears that actions occurring during the first five minutes of a half-time are substantially more difficult to spot than the others. This may be partially explained by the occurrence of some of these actions at the very beginning of a half-time, for which the temporal receptive field of the network requires the chunk to be temporally padded. Hence, some information may be missing to allow the network to spot those actions. Besides, when substitutions occur during the break, they are annotated as such on the first frame of the second halves of the matches, which makes them practically impossible to spot. In the test set, this happens forof the matches. None of these substitutions are spotted by our model, which thus degrades the performances during the first minutes of play in the second halves of the matches. However, they merely represent of all the substitutions, and removing them from the evaluation only boosts our Average-mAP by (from to ).
Results as function of action vicinity. We investigate whether actions are harder to spot when they are close to each other. We bin the ground-truth actions based on the distance that separates them from the previous (or next, depending on which is the closest) ground-truth action, regardless of their classes. Then, we compute the Average-mAP for each bin. The results are represented in Figure 5.
We observe that the actions are more difficult to spot when they are close to each other. This could be due to the reduced number of visual cues, such as replays, when an action occurs rapidly after another and thus must be broadcast. Some confusion may also arise because the replays of the first action can still be shown after the second action, e.g. a sanctioned foul followed by a converted penalty. This analysis also shows that the action spotting problem is challenging even when the actions are further apart, as the performances in Figure 5 eventually plateau.
Per-class results. We perform a per-class analysis in a similar spirit as the Average-mAP metric. For a given class, we fix a tolerance
around each annotated action to determine positive predictions and we aggregate these results in a confusion matrix. An action is considered spotted when its confidence score exceeds some threshold optimized for thescore on the validation set. From the confusion matrix, we compute the precision, recall and score for that class and for that tolerance . Varying from to seconds provides the evolution of the three metrics as a function of the tolerance. Figure 6 shows these curves for goals for our model and for the predictions of the baseline . The results for cards and substitutions are provided in supplementary material.
Figure 6 shows that most goals can be efficiently spotted by our model within seconds around the ground truth ( seconds). We achieve a precision of for that tolerance. The previous baseline plateaus within seconds ( seconds) and still has a lower performance. In particular for goals, many visual cues facilitate their spotting, e.g. multiple replays, particular camera views, or celebrations from the players and from the public.
In this section, we evaluate our context-aware loss in a more generic task than action spotting in soccer videos. We tackle the Activity Proposal and Activity Detection tasks of the challenging ActivityNet dataset, for which we use the ResNet features provided with the dataset at fps.
Setup. We use the current state-of-the-art network, namely BMN , with the code provided in . BMN is equipped with a temporal evaluation module (TEM), which plays a similar role as our temporal segmentation module. We replace the loss associated with the TEM by our novel temporal segmentation loss . The slicing parameters are set identically for all the classes and are optimized with respect to the AUC performance on the validation set by grid search with the constraint . The optimization yields the best results where .
Results. The average performances on runs of our experiment and of the BMN base code  are reported in Table 3. Our novel temporal segmentation loss improves the performance obtained with BMN  by and for the activity proposal task (AR@100 and AUC) and by for the activity detection task (Average-mAP). These increases compare with some recent increments, while being obtained just by replacing their TEM loss by our context-aware segmentation loss. The network thus has the same architecture and number of parameters. We conjecture that our loss , through its particular context slicing, helps train the network by modelling the uncertainty surrounding the annotations. Indeed, it has been shown in [3, 52] that a large variability exists among human annotators on which frames to annotate as the beginning and the end of the activities of the dataset. Let us note that in BMN, the TEM loss is somehow adapted around the action frames in order to mitigate the penalization attributed to their neighboring frames. Our work goes one step further, by directly designing a temporal context-aware segmentation loss.
|Lin et al. ||73.01||64.40||29.17|
|Gao et al. ||73.17||65.72||-|
|Lin et al. ||74.16||66.17||30.03|
|Lin et al.  (BMN)||75.01||67.10||33.85|
|BMN code  ()||75.11||67.16||30.67|
|Ours:  + ()||75.26||67.28||31.05|
Some action spotting and temporal segmentation results are shown in Figure 7. It appears that some sequences of play have a high segmentation score for some classes but do not lead, quite rightly, to an action spotting. It turns out that these sequences are often related to unannotated actions of supplementary classes that resemble those considered so far, such as unconverted goal opportunities and unsanctioned fouls. Video clips of the two actions identified in Figure 7 are provided in the supplementary material.
To quantify the spotting results of goal opportunities, we can only compute the precision metric since these actions are not annotated. We manually inspect each video sequence of the test set where the segmentation score for goals exceeds some threshold but where no ground-truth goal is present. We decide whether the sequence is a goal opportunity or not by asking two frequent observers of soccer games if they would include it in the highlights of the match. The sequence is a true positive when they both agree to include it and a false positive, otherwise. The precision is then computed for that . By gradually decreasing from to , we obtain the precision curve shown in Figure 8. It appears that of the sequences with a segmentation score larger than are considered goal opportunities. Also, the two observers disagreed with respect to what they considered to be an interesting sequence for only of the sequences, all of which having a low segmentation score.
As a direct by-product, we can derive a simple automatic highlights generator without explicit supervision. We extract a video clip starting seconds before each spotting of a goal or a card and ending seconds after. We proceed likewise for the sequences with a segmentation score for goals. Substitutions are not considered here, since they almost never appear in highlights. The clips are assembled chronologically to produce the highlights video, as provided in the supplementary material. The evaluation of the overall quality of this video is subjective, but we found its content to be adequate, even if the montage could be improved. Indeed, only sequences where a goal, a goal opportunity, or a foul occurs are selected. This reinforces the usefulness of the segmentation task, as it provides a direct overview of the proceedings of the match, including proposals for unannotated actions that are usually interesting for highlights.
We tackle the challenging action spotting task of SoccerNet with a novel context-aware loss for segmentation and a YOLO-like loss for the spotting. The former treats the frames according to their time-shift from their closest ground-truth actions. The latter leverages an iterative matching algorithm that alleviates the need for the network to order its predictions. To show generalization capabilities, we also test our context-aware loss on ActivityNet.
We improve upon the performance of the state-of-the-art method on ActivityNet by in the AR@100, in the AUC, and in the Average-mAP, only by including our context-aware loss without changing the architecture of the network. We achieve a new state-of-the art performance on SoccerNet, surpassing by far the previous baseline (from to in Average-mAP) and spotting most actions within seconds around their ground truths. Both the context-aware loss and matching algorithm are shown to be key components in this achievement. Finally, we leverage the segmentation results to identify unannotated actions such as goal opportunities and derive a highlights generator without specific supervision.
This work is supported by the DeepSport project of the Walloon region, Belgium, and by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research. A. Cioppa is funded by the FRIA, Belgium.
Multi-Person 3D Pose Estimation and Tracking in Sports.In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2019.
Semantic analysis of soccer video using dynamic Bayesian network.IEEE Transactions on Multimedia, 8(4):749–760, 2006.
International Conference on Tools with Artificial Intelligence (ICTAI), November 2016.
Large-scale Video Classification with Convolutional Neural Networks.In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
Taskonomy: Disentangling Task Transfer Learning.In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
Let us recall the following notations from the paper:
is the number of classes in the spotting task.
is the number of frames in the chunk considered.
is the number of ground-truth actions in the chunk considered.
is the number of predictions output by the network for the spotting task.
is the number of features computed for each class, for each frame, before the segmentation module (see Figure 9).
is the temporal receptive field of the network (used in the temporal convolutions).
regroups the spotting predictions of the network, and has dimension . The first column represents the confidence scores for the spots, the second contains the predicted locations, and the other are per-class classification scores.
Y encodes the ground-truth action vectors of the chunk considered, and has dimension .
() denotes the context slicing parameters of class .
1. Frame feature extractor and temporal CNN. SoccerNet  provides three frame feature extractors with different backbone architectures (I3D, C3D, and ResNet). Each of them respectively extracts , , and features that are further reduced to
features with a Principal Component Analysis (PCA). We use the PCA-reduced features provided with the dataset as input of our temporal CNN.
The aim of the temporal CNN is to provide features for each frame, while mixing temporal information across the frames. It transforms an input of shape into an output of shape .
First, each frame is input to a -layer MLP to reduce the dimensionality of the feature vectors of each frame. We design its architecture as: FC() - ReLU - FC() - ReLU. We thus obtain a set of features, which we note .
Then, is input to a spatio-temporal pyramid, i.e. it is input in parallel to each of the following layers of the pyramid:
Conv() - ReLU
Conv() - ReLU
Conv() - ReLU
Conv() - ReLU
producing features for each frame, which are concatenated with to obtain a set of features.
Finally, we feed these features to a Conv() layer, which produces a set of features, noted .
2. Segmentation module. This module produces a segmentation score per class for each frame. It transforms into an output of dimension , through the following steps:
Reshape to have dimension .
Use a frame-wise Batch Normalization.
Activate with a sigmoid so that each frame has, for each class, a feature vector .
For each frame, for each class, compute the distance between and the center of the unit hypercube , i.e. a vector composed of for its components. Hence, .
The segmentation score is obtained as , which belongs to . This way, scores close to for a class (i.e. close to the center of the cube) can be interpreted as indicating that the frame is likely to belong to that class.
The segmentation scores output by the segmentation module thus has dimension and is assessed through the segmentation loss .
3. Spotting module. The spotting module takes as input and , and outputs the spotting predictions of the network. It is composed of the following layers:
ReLU on , then concatenate with . This results in features.
Temporal max-pooling with a stride.
Conv() - ReLU
Temporal max-pooling with a stride.
Conv() - ReLU
Temporal max-pooling with a stride.
Flatten the resulting features, which yields .
Feed to a FC() layer, then reshape to and use sigmoid activation. This produces the confidence scores and the predicted locations for the action spots.
Feed to a FC() layer, then reshape to and use softmax activation on each row. This produces the per-class predictions for the action spots.
Concatenate the confidence scores, predicted locations, and per-class predictions to produce the spotting predictions of shape .
Eventually, is assessed through the action spotting loss .
The time-shift encoding (TSE) described in the paper is further detailed below. We note the TSE of frame related to class .
We denote (resp. ) the difference between the frame index of and the frame index of its closest past (resp. future) ground-truth action of class . They constitute the time-shifts of from its closest past and future ground-truth actions of class , expressed in number of frames (i.e. if frames and are actions of class , then frame has and ). We set for a frame corresponding to a ground-truth action of class , thus ensuring the relations . The TSE is defined as the time-shift among related to the action that has the dominant influence on . The rules used to determine which time-shift is selected are the following:
if : keep , because is located just after the past action, which still strongly influences .
if : is in the transition zone after the past action, whose influence weakens, thus the decision depends on how far away is the future action:
if : keep , because is located far before the future action, which does not yet influence .
if : The future action may be close enough to influence :
if : keep , because is closer to the just after region of the past action than it is to the just before region of the future action, with respect to the size of the transition zones.
else: keep , because the future action influences more than the past action.
if : keep , because is located far after the past action, which does not influence anymore.
For completeness, let us recall the following details mentioned in the main paper. If is both located far after the past action and far before the future action, selecting either of the two time-shifts has the same effect in our loss. Furthermore, for the frames located either before the first or after the last annotated action of class , only one time-shift can be computed and is thus set as . Finally, if no action of class is present in the video, then we set for all the frames. This induces the same behavior in our loss as if they were all located far before their closest future action.
The TSE is used to shape our novel context-aware loss function for the temporal segmentation module. The cases described above ensure the temporal continuity of the loss, regardless of the proximity between two actions of the same class, excepted at frames annotated as ground-truth actions. This temporal continuity can be visualized in Figure 11, which shows a representation of (analogous to Figure 1) when two actions are close to each other. It is further illustrated in the video clip 3dloss.mp4 provided with this document, where we gradually vary the location of the second action. For each location of the second action, the TSE of all the frames is re-computed, and so is the loss.
Per-class results. As for the class goal in Figure 6 of the main paper, Figures 12 and 13 display the number of TP, FP, FN and the precision, recall and metrics for the classes card and substitution as a function of the tolerance allowed for the localization of the spots.
Figure 12 shows that most cards can be efficiently spotted by our model within seconds around the ground truth ( seconds). We achieve a precision of for that tolerance. The previous baseline plateaus within seconds ( seconds) and still has a lower performance.
Figure 13 shows that most substitutions can be efficiently spotted by our model within seconds around the ground truth ( seconds). We achieve a precision of for that tolerance. The previous baseline reaches a similar performance for that tolerance, and reaches within seconds ( seconds) around the ground truth.
Except for the precision metric for the substitutions with tolerances larger than seconds, our model outperforms the previous baseline of SoccerNet . As mentioned in the paper, for goals, many visual cues facilitate their spotting, e.g. multiple replays, particular camera views, or celebrations from the players and from the public. Cards and substitutions are more difficult to spot since the moment the referee shows a player a card and the moment a new player enters the field to replace another are rarely replayed (e.g. for cards, the foul is replayed, not the sanction). Also, the number of visual cues that allow their identification is reduced, as these actions generally do not lead to celebrations from the players or the public. Besides, cards and substitutions may not be broadcast in full screen, as they are sometimes merely shown from the main camera and are thus barely visible. Finally, substitutions occurring during the half-time are practically impossible to spot, as said in the main paper.
Segmentation loss analysis. We provide a supplementary analysis on the parameter, which balances the segmentation loss and the action spotting loss in Equation 11 of the main paper. We fix different values of and train a network for each value. We show the segmentation scores on one game for the goal class in Figure 14. We also display the Average-mAP for the whole test set for the different values of .
It appears that extreme values of substantially influence both the action spotting performance and the segmentation curves, hence the automatic highlights generation. Small values (i.e. ) produce a useless segmentation for spotting the interesting unannotated goal opportunities. This is because the loss does not provide a sufficiently strong feedback for the segmentation task as it does not penalize enough the segmentation scores. These values of also lead to a decrease in the Average-mAP for the action spotting task, as already observed in the ablation study presented in the main paper. Moreover, very large values () penalize too much the unannotated goal opportunities, for which the network is then forced to output very small segmentation scores. Such actions are thus more difficult to retrieve for the production of highlights. These values of also lead to a large decrease in the Average-mAP for the action spotting task, as the feedback of the segmentation loss overshadows the feedback of the spotting loss. Finally, it seems that for , the spotting performance is high while providing informative segmentation scores on goal opportunities. These values lead to the spotting of several goal opportunities, shown in Figure 14, which might be included in the highlights automatically generated for this match by the method described in the main paper.
Figure 15 shows additional action spotting and segmentation results. We can identify actions that are unannotated but display high segmentation scores such as goal opportunities and unsanctioned fouls. A goal opportunity around the minute can be identified through the segmentation results. Besides, a false positive spot (green star) for a card is predicted by our network around the minute, further supported by a high segmentation score. A manual inspection reveals that a severe unsanctioned foul occurs at this moment. The automatic highlights generator presented in the main paper would include it in the summary of the match. Even though this foul does not lead to a card for the offender, the content of this sequence corresponds to an interesting action that would be tolerable in a highlights video.
of the main paper. We can see that the LED panel used by the referee to announce substitutions is visible on the frame. This may indicate that the network learns, quite rightly, to associate this panel with substitutions. As a matter of fact, at this moment, even the commentator announces that a substitution is probably imminent.