Supervision Levels Scale (SLS)

08/22/2020 ∙ by Dima Damen, et al. ∙ University of Bristol 18

We propose a three-dimensional discrete and incremental scale to encode a method's level of supervision - i.e. the data and labels used when training a model to achieve a given performance. We capture three aspects of supervision, that are known to give methods an advantage while requiring additional costs: pre-training, training labels and training data. The proposed three-dimensional scale can be included in result tables or leaderboards to handily compare methods not only by their performance, but also by the level of data supervision utilised by each method. The Supervision Levels Scale (SLS) is first presented generally fo any task/dataset/challenge. It is then applied to the EPIC-KITCHENS-100 dataset, to be used for the various leaderboards and challenges associated with this dataset.



There are no comments yet.


page 1

page 4

page 5

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Motivation

Performance gains for supervised learning methods have been achieved through a variety of means. Of these, additional loss/regularisation functions, architectural changes, and data augmentation techniques have all demonstrated high performance gains. Acknowledging models are typically data-hungry, a primary way to improve performance is to increase the amount of labelled training data. However, this is restricted to problems where additional data is readily available and labelling costs are not an issue. Alternatively, using a large-scale dataset for pre-training, before fine-tuning on the target data, has been shown to provide a significant advantage. Particularly of note, there has been a recent surge of methods using self-supervised pre-training which offers a competitive alternative to increasing the amount of labelled pre-training data. Additionally, during fine-tuning, some methods use additional training labels, such as segmentation or object graphs, which have also provided performance boosts.

All of the above makes method comparisons far from trivial. One method can outperform another purely based on the amount and type of data used for pre-training (i.e. initialisation) or because additional labels have been utilised during training. We propose a discrete and incremental scale that can be incorporated into leaderboards for handily capturing the levels of supervision between methods, compared on the same test set. Thus allowing direct comparison between methods that use the same level of supervision. Additionally, this scale also enables assessing the impact of various dimensions of supervision on the performance.

Figure 1: The three-dimensional SLS captures incrementally levels of supervision: pre-training, labels and amount of training data.

It is important to note that the proposed scale does not include any knowledge of the method’s implementation details (e.g. augmentation, optimisation choice, loss functions, …). Instead, it focuses on the data-relevant knowledge available to the researchers before they make any method-relevant choices. This makes the scale suitable for anonymous submissions to leaderboards, not disclosing methods’ novelties.

We thus restrict our scale to include three supervision dimensions (Fig 1). These are:

  1. Pre-training: The amount of data used in pre-training gives an advantage, particularly when the type of data used in pre-training is relevant to the task.

  2. Training labels: The nature of labels associated with the training data, is another dimension to distinguish between methods’ supervision. For example, a method that takes advantage of not only image-level labels but also localisation or segmentation labels has an advantage for image captioning.

  3. Training data: The amount of training data offers a clear advantage. When less labelled data is utilised, while achieving comparable performance, such methods deserve to be highlighted.

In the next sections, we provide a general framework for this discrete three-dimensional scale, which we refer to as the Supervision Levels Scale (SLS). We believe SLS achieves the following objectives:

  • SLS can be used to readily compare methods in results tables, so as to highlight any data-relevant advantages when comparing state-of-the-art performances.

  • SLS can be used in anonymous leaderboards in order to encourage and highlight methods that utilise less labels to demonstrate their competitive performance.

  • SLS enables one leaderboard to compare a variety of self-supervised, weakly-supervised, few-shot and other distinct levels of supervision, as well as their combination. This would encourage the community to converge to a single evaluation leaderboard for the task, while acknowledging these methods differ in data usage.

  • SLS can be used to analyse the impact of data-relevant advantages (pre-training, training data and training labels) for one task on the same test set.

We additionally demonstrate how SLS can be used in the leaderboards for the EPIC-KITCHENS dataset, showing how it can compare the supervision levels of previous years’ competition winners. This is achieved through self-declaration, i.e. each submission is requested to declare the supervision level across each dimension.

2 Related Efforts

The proposed Supervision Levels Scale (SLS) can be applied to provide additional supervision tags as methods are compared on the same test set. Previous efforts to find discrete scales or categories have focused on identifying easy or hard examples within one datasets, in order to provide insights into methods failure cases (e.g. [9, 10, 1]

). Up to our knowledge, there is no previous attempt to provide a discrete scale of the levels of supervision on the same dataset. We hope that this first attempt can be further utilised and elaborated for a variety of computer vision tasks.

3 Supervision Levels Scale

The proposed SLS has three dimensions: Pre-training (PT), Training labels (TL) and (amount of) Training data (TD), explained below. These dimensions are orthogonal and a method is likely to consider a different level in each dimension. The discrete levels are incremental, with the lowest level implying the least amount of supervision on the corresponding scale.

For each dimension, we define five supervision levels (1-5), in addition to 0 where no supervision on that dimension is expected. A more fine-grained or wider range could be adopted by future tasks, but in this proposal we believe this level of discretisation to be sufficient.

SLS-PT: Pre-Training

Pre-training is the data utilised to provide model initialisation, and is independent from data used in training the model itself. When the model is made of several parts or stages (e.g. feature extraction followed by temporal modelling), the PT supervision level is calculated for each stage, and the maximum PT value is considered as the supervision level of the overall method. For example, if a method uses features (pre-trained and then fine-tuned) while the classifier is trained from scratch, then the feature’s pre-training is used to define the level of PT supervision of the full method.

PT Description
0 No pre-training was used. Models are randomly initialised, including features — i.e. features are trained from randomly initialised models.

Standard pre-training on data of limited relevance to the task. For example, in a task on medical data, ImageNet 

[7] is used for pre-training.
2 Relevant pre-training to the task. The data for pre-training is chosen to best fit the problem, and is believed to have a significant impact on learning low-level as well as medium-level features.
3 Self-supervision on large-scale unlabelled public data. Importantly, as public data is been used, the pre-training can be replicated.
4 Self-supervision using task-specific data. That is, a model has used training and/or test data on which performance will be trained/evaluated, with/without other large-scale public data, thus offering stronger pre-training supervision.
5 In addition to any or all of the above, pre-training on private data. This level of pre-training supervision is restricted to approaches that pre-train on data not accessible to other researchers and thus cannot be replicated. Even when these models are made available, other researchers are unable to replicate or improve this pre-training thus should be considered as an advantage.

SLS-TL: Training Labels

The second dimension of the SLS focuses on the training labels utilised by the method. This includes labels already available with the dataset, or additional labels acquired by the authors for the model specifically.

TL Description
0 No labelled supervision was used. Training data was used in an unsupervised or self-supervised manner without employing any labels.
1 Weak supervision - L1. Weak labels are provided for multiple instances. One-to-one mapping between labels and instances is not available.
2 Weak supervision - L2. Weak labels (noisy or incomplete) are associated with each instance.
3 Strong supervision - an instance-level label is given.
4 Strong supervision - in addition to instance-level labels, additional labels are provided (e.g. segmentation in images or spatio-temporal bounding boxes in video).
5 Strong supervision with additional labels, not available with the dataset by default.

SLS-TD: Training Data

Even with the same pre-training and labels supervision, methods can vary by the amount of training data used. The last dimension of the SLS captures these differences, unifying few-shot and many-shot training paradigms, when evaluated on the same test set. Note that smart approaches to utilise less data (e.g. avoiding noisy labels) can result in improved performance. However, in this case the data has been fully taken advantage of and thus these methods have access to the same amount of training data.

TD Description
0 Zero-shot learning - the training set has no label class/category overlap with the test set.
1 Few-shot learning - In line with [18], (up to) 5-shot training data is used (per label class/category).
2 Efficient learning - Randomly selected fraction (commonly 25%) of the data was used. The remainder of the training data were not used and the choice of sample is not optimised.
3 Train set - training set used in full.

Train+Val sets - after optimising any hyperparameters on the validation set, the combination of training and validation sets are used in training the model. Note that we here assume an official Train/Val split.

5 Train+Val++ sets - additional data is used during training (note that this is different from pre-training. This could be used with or without labels, but utilised during training the model itself.

Putting it together

The three dimensions are then put together for each method, in the order:


To give an example, a method trained from scratch (i.e. with no pre-training), uses instance-level strong supervision as well as being trained with the full training set would be referred to as: SLS-0-3-3. Conversely, a method that uses private data for pre-training, weak-supervision from incomplete labels and few-shot learning would be referred to as: SLS-5-2-1. The two methods would be evaluated on the same test set, and thus the performance (given chosen evaluation metrics) are directly comparable, though with significantly different levels of supervision.

Next, we give an example of SLS on our dataset EPIC-KITCHENS [4] and provide some comparative analysis from the public leaderboard of the 2019 and 2020 leaderboards of the Action Recognition challenge.

Figure 2: SLS for the EPIC-KITCHENS Action Recognition challenge

Submissions SLS Top-1 Accuracy Top-5 Accuracy


1 UTS-Baidu [21] 14 05/28/20 5 4 4
2 NUS-CVML [16] 18 05/29/20 2 4 4

UTS-Baidu [20] 16 05/30/19 2 4 4

3 SAIC-Cambridge [15] 34 05/27/20 1 3 4

3 FBK-HuPBA [17] 50 05/29/20 5 3 4

4 GT-WISC-MPI [12] 12 01/30/20 2 3 4

5 G-Blend [19] 14 05/28/20 5 3 4
6 TBN [11] 2 05/30/19 2 3 4
FAIR [8] 9 10/30/19 5 3 4

Table 1: Top Teams in the 2019 (gray) and 2020 Action Recognition Challenges in EPIC-KITCHENS-55. Coloured columns include the added SLS demonstrating the various levels of supervision employed by these methods.
Figure 3: Recently released EPIC-KITCHENS-100 Action Recognition Leaderboard, with self-reporting submission of SLS on the various baselines. Note 6 only uses weak supervision hence the SLS-TL of 2.


EPIC-KITCHENS is the largest egocentric dataset, captured in a non-scripted manner, and recorded in participants’ kitchens. First released in 2018 [3], and later extended [4] to 20M frames, 90K action segments and 100 hours of densely annotated actions, the dataset offers an ideal testbed for assessing video understanding benchmarks in untrimmed videos. Several benchmarks are associated with the dataset - currently 5 leaderboards are available for comparative evaluation, see challenges at:

We showcase how SLS can apply to the EPIC-KITCHENS Action Recognition (AR) challenge (Sec 4.1 in [4]), then demonstrate how it can be used to compare the levels of supervision of methods submitted to this leaderboard.

We specialise the SLS for this task as follows:


    • SLS-1-X-X: standard pre-training, in this case would be any image-based pre-training.

    • SLS-2-X-X: Relevant pre-training would be video pretraining (e.g. large scale video datasets such as Kinetics [2] or HowTo100M [13]).


    • SLS-X-1-X: Would be utilising no temporal information about action segments or instances per video, i.e. video-level labels of actions are only known, without their rough or exact locations.

    • SLS-X-2-X: Would utilise weak temporal labels, known as single timestamps in EPIC-KITCHENS [14]. These are rough single time points within or close to the action of interest.

    • SLS-X-3-X: Full temporal labels (i.e. start-end times) without any spatial labels have been used.

    • SLS-X-4-X: Methods have taken advantage of spatio-temporal labels, i.e. start-end times plus bounding boxes, hand detections and/or masks.

  • SLS-TD for AR in EPIC-KITCHENS: No adjustments needed to adapt this dimension. Train and Val splits correspond to the splits in Table 1 from [4].

Fig. 2 visualises SLS for AR in EPIC-KITCHENS, across the three dimensions. We next demonstrate how this can be applied to the leaderboard, using examples from the previous years’ challenge leaderboards. Note that these were evaluated on a previous version (subset) of the dataset known as EPIC-KITCHENS-55.

In Table 1 we list the winners from the AR challenge for years 2019 and 2020 ranked by their performance. More details of these methods are available in the technical reports [6, 5]. We denote SLS for each method based on the technical reports submitted. The table shows that: methods vary in pre-training, with some utilising the weights from the private dataset IG-Kinetics [8]. Methods that employed spatio-temporal labels (TL=4) as opposed to temporal labels solely, consistently outperform others, highlighting the nature of this dataset: actions take place in the messy environment of participants’ kitchens and identifying where the action takes place is consistently helpful. Finally, no work has attempted few-show or efficient learning on the dataset to-date. SLS-TD in this table is thus consistent for all methods.

We have also integrated SLS into the new leaderboard for Action Recognition in EPIC-KITCHENS-100, open for the 2021 round of the challenges. Fig 3 shows the leaderboard recently opened with the baselines from [4]. Importantly, this leaderboard shows both full temporal supervised (TL=3) as well as weak temporal supervision (TL=2) in the same leaderboard. This enables direct comparison of the drop in performance between methods that utilise single timestamps [14] and full temporal supervision. In the given baseline, start-end times improve Top-1 Action Accuracy by 11-16%.

5 Further discussion and limitations of SLS

There are a few known limitations of the proposed SLS, due to decisions that were made with the aim to reduce complexity. First, the discrete nature of SLS only indicates part of the picture, and details to replicate any method are only available by reading about the methods themselves. We believe this compromise to be acceptable. Second, when a method is formed of multiple stages (e.g. feature extraction then temporal modelling for example), SLS encodes only one scale per dimension; the maximum level of supervision across all the stages. Due to some models using a large number of stages and sources for pre-training, we also consider this simplicity to be acceptable. Third, SLS relies on self-reporting of each method’s supervision scale, and it is left to authors to justify their declared supervision levels. Misunderstanding any dimension can result in incorrect reporting. We hope that this technical report can assist in alleviating any ambiguity, and will be reviewing its content as we receive inquiries or concerns.

The remaining (fourth and fifth) limitations require future work to be addressed. Fourth, SLS assumes all training data uses the same training labels. Semi-supervised approaches cannot be correctly represented by SLS. Fifth, multi-modal data is not captured. For example, in Table 1 the method TBN [11] utilises audio information in addition to video, giving it an advantage. Similarly, when performing tasks like retrieval or captioning, two modalities (vision + langauage) will consider different SLS for each modality. We leave this consideration of multiple modalities and their relationship to SLS for future work and focus only on the visual modality.

6 Conclusion

This report introduces a three-dimensional, incremental and discrete scale to encode the level of supervision of methods, compared on the same test set. It aims to unify and directly compare methods that attempt self-supervised pre-training, few-shot learning and weak supervision, when compared for the same task and on one test set. We believe SLS can provide new insights into how methods learn from more/less labelled data. We will be analysing the SLS of various methods submitted to the five leaderboards of EPIC-KITCHENS-100 in the next round of challenge submissions.


We would like to thank Hazel Doughty and Will Price for early discussions on this idea.


  • [1] D. Bolya, S. Foley, J. Hays, and J. Hoffman (2020) TIDE: a general toolbox for identifying object detection errors. In ECCV, Cited by: §2.
  • [2] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In Proc. CVPR, Cited by: 2nd item.
  • [3] D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray (2018) Scaling egocentric vision: the epic-kitchens dataset. In ECCV, Cited by: §4.
  • [4] D. Damen, H. Doughty, G. M. Farinella, A. Furnari, J. Ma, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray (2020) Rescaling egocentric vision. CoRR abs/2006.13256. Cited by: Supervision Levels Scale (SLS), §3, 3rd item, §4, §4, §4.
  • [5] D. Damen, E. Kazakos, W. Price, J. Ma, H. Doughty, A. Furnari, and G. M. Farinella (2020) EPIC-kitchens-55 - 2020 challenges report. Technical report Cited by: §4.
  • [6] D. Damen, W. Price, E. Kazakos, A. Furnari, and G. M. Farinella (2019) EPIC-kitchens - 2019 challenges report. Technical report Cited by: §4.
  • [7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, Cited by: §3.
  • [8] D. Ghadiyaram, D. Tran, and D. Mahajan (2019) Large-scale weakly-supervised pre-training for video action recognition. In CVPR, Cited by: Table 1, §4.
  • [9] D. Hoiem, Y. Chodpathumwan, and Q. Dai (2012) Diagnosing error in object detectors. In ECCV, Cited by: §2.
  • [10] J. Hosang, R. Benenson, P. Dollár, and B. Schiele (2016) What makes for effective detection proposals?. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (4), pp. 814–830. Cited by: §2.
  • [11] E. Kazakos, A. Nagrani, A. Zisserman, and D. Damen (2019) EPIC-fusion: audio-visual temporal binding for egocentric action recognition. In ICCV, Cited by: Table 1, §5.
  • [12] M. Liu, S. Tang, Y. Li, and J. Rehg (2020) Forecasting human object interaction: joint prediction of motor attention and actions in first person video. In ECCV, Cited by: Table 1.
  • [13] A. Miech, D. Zhukov, J. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic (2019) HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In ICCV, Cited by: 2nd item.
  • [14] D. Moltisanti, S. Fidler, and D. Damen (2019) Action Recognition from Single Timestamp Supervision in Untrimmed Videos. In CVPR, Cited by: 2nd item, §4.
  • [15] J. Perez-Rua, B. Martinez, X. Zhu, A. Toisoul, V. Escorcia, and T. Xiang (2020) Knowing what, where and when to look: efficient video action modeling with attention. Technical report Cited by: Table 1.
  • [16] F. Sener, D. Singhania, and A. Yao (2020) Temporal aggregate representations for long-range video understanding. In ECCV, Cited by: Table 1.
  • [17] S. Sudhakaran, S. Escalera, and O. Lanz (2020) FBK-hupba submission to the epic-kitchens action recognition 2020 challenge. Technical report Cited by: Table 1.
  • [18] O. Vinyals, C. Blundell, T. Lillicrap, k. kavukcuoglu, and D. Wierstra (2016) Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pp. 3630–3638. Cited by: §3.
  • [19] W. Wang, D. Tran, and M. Feiszli (2020) What makes training multi-modal classification networks hard?. Technical report Cited by: Table 1.
  • [20] W. Xiaohan, Y. Wu, L. Zhu, and Y. Yang (2019) Baidu-uts submission to the epic-kitchens action recognition challenge 2019. Technical report Cited by: Table 1.
  • [21] W. Xiaohan, Y. Wu, L. Zhu, and Y. Yang (2020) Symbiotic attention with privileged information for egocentric action recognition. In AAAI, Cited by: Table 1.