How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning?

1University of Amsterdam
drawing

We investigate how sensitive video self-supervised learning is to the currently used benchmark convention and whether methods generalize beyond the canonical evaluation setting.

Overview

We investigate across four different factors of sensitivity in the downstream setup:

  • Domain: First, we analyse whether features learned by self-supervised models transfer to datasets that vary in domain with respect to the pre-training dataset.
  • Samples: Second, we evaluate the sensitivity of self-supervised methods to the number of downstream samples available for finetuning.
  • Actions: Third, we investigate whether self-supervised methods can learn fine-grained features required for recognizing semantically similar actions.
  • Task: Finally, we study the sensitivity of video self-supervised methods to the downstream task and question whether self-supervised features can be used beyond action recognition.

Models evaluated: We evaluate a suite of 9 recent video SSL models.

Video Datasets: We use datasets varying along different factors as shown in the radar-plot below.

radar-plot

Highlights

We summarize the key observations from our experiments below.

I. Downstream Domain

(See Table 1) Performance for UCF-101 finetuning and Kinetics-400 linear evaluation is not indicative of how well a self-supervised video model generalizes to different downstream domains, with the ranking of methods changing substantially across datasets and whether full finetuning or linear classification is used.

II. Downstream Samples

We observe from Fig. 3 that video self-supervised models are highly sensitive to the amount of samples available for finetuning, with both the gap and rank between methods changing considerably across sample sizes on each dataset.

III. Downstream Actions

Most self-supervised methods in Table 2 are sensitive to the actions present in the downstream dataset and do not generalize well to more semantically similar actions. This further emphasizes the need for proper evaluation of self-supervised methods beyond current coarse-grained action classification

IV. Downstream Tasks

The results in Table 3 reveal that action classification performance on UCF101 is mildly indicative for transferability of self-supervised features to other tasks on UCF-101. However, when methods pre-trained on Kinetics-400 are confronted with a domain change in addition to the task change, UCF-101 results are no longer a good proxy and the gap between supervised and self-supervised pre-training is large.

SEVERE Benchmark

Based on our findings, we propose the SEVERE benchmark (SEnsitivity of VidEo REpresentations) for use in future works to more thoroughly evaluate new video self-supervised methods for generalization. This is a subset of our experiments that are indicative benchmarks for each sensitivity factor and realistic to run.

severe

Please check out our code if you'd like to evaluate your self-supervised model on the SEVERE benchmark.

BibTeX

@inproceedings{thoker2022severe,
  author    = {Thoker, Fida Mohammad and Doughty, Hazel and Bagad, Piyush and Snoek, Cees},
  title     = {How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning?},
  journal   = {ECCV},
  year      = {2022},
}