Chirality in Action: Time-Aware Video Representation Learning by Latent Straightening

NeurIPS 2025
University of Oxford  
Chiral actions: Our objective is to learn a video embedding that encodes the direction of time such that it is able to linearly separate temporally opposite (chiral) actions.

๐Ÿ” The Key Nugget

Key observation: tSNE projections of per-frame features from DINOv2 show that they lie on a time-sensitive trajectory. Can we use these to learn a time-aware video representation?

๐Ÿง  The Perceptual Straightening Hypothesis

Henaff et al. (2019) hypothesized that humans convert non-linear spatial representations of naturally occurring videos into linear temporal trajectories enabling their prediction with linear extrapolation. We are loosely inspired by this idea to transform DINO trajectories into a time-aware video embedding under a linearised Auto-Encoder model.
[1] Perceptual straightening of natural videos. Olivier J. Hรฉnaff, Robbe L. T. Goris and Eero P. Simoncelli. Nature 2019.

๐Ÿ—๏ธ The Model: LiFT

Please play the video animation below to understand the LiFT model design.
  • LiFT is trained in an unsupervised manner by reconstructing the input feature sequence.
  • What does LiFT learn? We observe that it essentially learns a smooth approximation of the feature trajectory. Furthermore, it is able to learn different embeddings for videos of opening vs closing door actions. See the qualitative results below.

๐ŸŽž๏ธ The Chirality in Action Benchmark

We repurpose three existing datasets (SSv2, EPIC, Charades) to mine chiral actions and build a new benchmark to probe video embedding models for chirality. We search for temporally opposite verbs using ChatGPT and then group together similar nouns to construct chiral groups.
Evaluation protocol: For each chiral group, we compute video embeddings for + and - samples. Then, we train a linear probe. The overall accuracy is averaged across all chiral groups.

๐Ÿ—’๏ธ Highlight Results

We evaluate LiFT on the CiA benchmark as well as standard action recognition benchmarks. First, we show that LiFT embeddings are time-sensitive (chiral-sensitive) & compact even outperforming much bigger video models like VideoJEPA and VideoMAE. Second, we show that LiFT encodes temporal information that is likely complementary to existing video models such as VideoJEPA. This is established by the performance gains we observe when concatenating LiFT embeddings with VideoJEPA embeddings on standard action recognition benchmarks.

๐Ÿ™ Acknowledgements

๐Ÿ“œ Citation

If you find this work useful, please consider citing:

      @article{bagad2025chirality,
        title={Chirality in Action: Time-Aware Video Representation Learning by Latent Straightening},
        author={Bagad, Piyush and Zisserman, Andrew},
        journal={arXiv preprint arXiv:2509.08502},
        year={2025}
      }
      

        @InProceedings{Bagad25,
          author       = "Piyush Bagad and Andrew Zisserman",
          title        = "Chirality in Action: Time-Aware Video Representation Learning by Latent Straightening",
          booktitle    = "NeurIPS",
          year         = "2025",
        }
      

๐Ÿ“™ Related Work