⏰ Test of Time: Instilling Video-Language Models with a Sense of Time

1University of Amsterdam 2IIIT, Hyderabad

drawing

We propose TACT (Temporal Adaptation by Consistent Time-ordering), a method for temporal adaptation using this time order consistency without having to pretrain from scratch.

🔭 Quick Preview

  • Do existing video-language models understand time? We show that existing video-language models struggle to associate time order in video and language through a controlled experiment on synthetic data.
  • Can we adapt a video-language model to instill this sense of time? Based on VideoCLIP, we propose TACT (Temporal Adaptation by Consistent Time-ordering), a method for temporal adaptation using this time order consistency without having to pretrain from scratch.
  • What does such an adaptation enable us to do? We demonstrate improved zeroshot generalizability of our temporally adapted models on tasks that require higher time awareness.

📹 Do Video-Language Models Sense Time?

drawing

Synthetic data. We create synthetic video-language examples that show a pair of events (a red circle appears and a yellow circle appears) in a certain order. The correct caption is consistent with this order and the incorrect caption is not. The task is to compute probability of the correct caption. We find that six of the existing video-language models struggle to understand even such simple temporal relations.

drawing

⏰ Can we Instill this Sense of Time?

drawing
Overview of TACT. Along with the usual contrastive loss, where negatives come from other samples in the batch, we make use of time-order reversal within the same sample and cross samples to generate additional negatives for both video and text. We also extend the contrastive loss to time-order reversed video and text corresponding to reverse consistency.
We train TACT on four real video-language datasets and show that it can instill a sense of time in the models.

drawing

🤔 What does this enable on downstream tasks?

drawing
AGQA: We consider VQA task on a subset that has expliclitly before/after relations.
drawing
SSv2: We follow Lei et al and use template (action) retrieval on SSv2 given its need for temporal reasoning.
All downstream results: Results on all downstream tasks are shown below. As we move towards tasks that need higher level of temporal reasoning, we see that TACT outperforms the baseline. Note that we do not fine-tune on these tasks and this is zero-shot evaluation.

drawing

Generalization to other temporal prompts: Although we train video-language models using temporal relations such as before/after, it is natural to ask if the model still correctly associates time order for a different prompt such as First, .., then, .... We find that TACT is able to generalize to a new prompt and correctly associate time order.

drawing

Please see the paper for more results, details and discussion.

Acknowledgements

BibTeX

@inproceedings{
      bagad2023testoftime,
      title={{T}est of {T}ime: {I}nstilling {V}ideo-{L}anguage {M}odels with a {S}ense of {T}ime},
      author={Bagad, Piyush and Tapaswi, Makarand and Snoek, Cees G. M.},
      booktitle={CVPR},
      year={2023}
}

Related Work

Please also consider looking at the following related papers: