CineTrans: Learning to Generate Videos with Cinematic Transitions via Masked Diffusion Models

1Fudan University  2Shanghai Artificial Intelligence Laboratory  3Shanghai Jiao Tong University
*Work done during internship at Shanghai AI Laboratory Corresponding author

TL;DR

We propose a novel video generation framework that enables cinematic transitions between shots using masked diffusion. Our method ensures precise shot timestamp control and stylistic coherence across shots, trained on a custom curated multi-shot dataset, and further transfers effectively in a training-free setting.

Teaser Video

Teaser Video
Figure
Overview

⏩ Cinetrans-DiT

🎬 Cinetrans-Unet

📹 Customize (Training-free)

Abstract

Despite significant advances in video synthesis, research into multi-shot video generation remains in its infancy. Even with scaled-up models and massive datasets, the shot transition capabilities remain rudimentary and unstable, largely confining generated videos to single-shot sequences. In this work, we introduce CineTrans, a novel framework for generating coherent multi-shot videos with cinematic, film-style transitions. To facilitate insights into the film editing style, we construct a multi-shot video-text dataset Cine250K with detailed shot annotations. Furthermore, our analysis of existing video diffusion models uncovers a strong correspondence between diffusion model attention maps and shot boundaries, which we leverage to design a mask-based control mechanism that enables transitions at arbitrary positions and transfers effectively in a training-free setting. After fine-tuning on our multi-shot video dataset with the mask mechanism, CineTrans produces cinematic multi-shot sequences while adhering to the film editing style, avoiding unstable transitions or naive concatenations. Finally, we propose specialized evaluation metrics for transition control, temporal consistency and overall quality, and demonstrate through extensive experiments that CineTrans significantly outperforms existing baselines across all criteria.

Metric and Results

Figure
Metric
Figure
Results

More Results

BibTeX

@misc{wu2025cinetranslearninggeneratevideos,
      title={CineTrans: Learning to Generate Videos with Cinematic Transitions via Masked Diffusion Models}, 
      author={Xiaoxue Wu and Bingjie Gao and Yu Qiao and Yaohui Wang and Xinyuan Chen},
      year={2025},
      eprint={2508.11484},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.11484}, 
}