Chirality in Action: Time-Aware Video Representation Learning by Latent Straightening

NeurIPS 2025

Piyush Bagad, Andrew Zisserman

University of Oxford

📋 arXiv Code 🤗 Data Models Poster (NeurIPS 2025)

📢 Update: Check out our follow-up work TARA: Simple and Efficient Time Aware Retrieval Adaptation of MLLMs for Video Understanding that uses Multimodal LLMs (MLLMs) to encode videos for time-aware video-text retrieval.

Chiral actions: Our objective is to learn a video embedding that encodes the direction of time such that it is able to linearly separate temporally opposite (chiral) actions.

🔐 The Key Nugget

Key observation: tSNE projections of per-frame features from DINOv2 show that they lie on a time-sensitive trajectory. Can we use these to learn a time-aware video representation?

🧠 The Perceptual Straightening Hypothesis

Henaff et al. (2019) hypothesized that humans convert non-linear spatial representations of naturally occurring videos into linear temporal trajectories enabling their prediction with linear extrapolation. We are loosely inspired by this idea to transform DINO trajectories into a time-aware video embedding under a linearised Auto-Encoder model.

[1] Perceptual straightening of natural videos. Olivier J. Hénaff, Robbe L. T. Goris and Eero P. Simoncelli. Nature 2019.

🏗️ The Model: LiFT

Please play the video animation below to understand the LiFT model design.

LiFT is trained in an unsupervised manner by reconstructing the input feature sequence.
What does LiFT learn? We observe that it essentially learns a smooth approximation of the feature trajectory. Furthermore, it is able to learn different embeddings for videos of opening vs closing door actions. See the qualitative results below.

🎞️ The Chirality in Action Benchmark

We repurpose three existing datasets (SSv2, EPIC, Charades) to mine chiral actions and build a new benchmark to probe video embedding models for chirality. We search for temporally opposite verbs using ChatGPT and then group together similar nouns to construct chiral groups.
Evaluation protocol: For each chiral group, we compute video embeddings for + and - samples. Then, we train a linear probe. The overall accuracy is averaged across all chiral groups.

🗒️ Highlight Results

We evaluate LiFT on the CiA benchmark as well as standard action recognition benchmarks. First, we show that LiFT embeddings are time-sensitive (chiral-sensitive) & compact even outperforming much bigger video models like VideoJEPA and VideoMAE. Second, we show that LiFT encodes temporal information that is likely complementary to existing video models such as VideoJEPA. This is established by the performance gains we observe when concatenating LiFT embeddings with VideoJEPA embeddings on standard action recognition benchmarks.

LiFT embeddings are time-sensitive (chiral-sensitive) & compact

🙏 Acknowledgements

We thank Ashish Thandavan for support with infrastructure and Sindhu Hegde, Makarand Tapaswi, for useful discussions.
This research is funded by the EPSRC Programme Grant VisualAI EP/T028572/1, and a Royal Society Research Professorship RSRP\R\241003

📜 Citation

If you find this work useful, please consider citing:


      @article{bagad2025chirality,
        title={Chirality in Action: Time-Aware Video Representation Learning by Latent Straightening},
        author={Bagad, Piyush and Zisserman, Andrew},
        journal={arXiv preprint arXiv:2509.08502},
        year={2025}
      }


        @InProceedings{Bagad25,
          author       = "Piyush Bagad and Andrew Zisserman",
          title        = "Chirality in Action: Time-Aware Video Representation Learning by Latent Straightening",
          booktitle    = "NeurIPS",
          year         = "2025",
        }

📙 Related Work

Please also consider checking out the following papers:

Seeing the Arrow of Time in Large Multimodal Models. NeurIPS (2025).
Retro-Actions: Learning ‘Close’ by Time-Reversing ‘Open’ Videos. ICCVW (2019).
Perceptual straightening of natural videos. Nature Neuroscience (2019).