CineTrans: Learning to Generate Videos with Cinematic Transitions via Masked Diffusion Models

Xiaoxue Wu^1,2*,

Bingjie Gao^2,3,

Yu Qiao^2†,

Yaohui Wang^2†,

Xinyuan Chen^2†

¹Fudan University ²Shanghai Artificial Intelligence Laboratory ³Shanghai Jiao Tong University

^*Work done during internship at Shanghai AI Laboratory ^†Corresponding author

Paper Code

Weights

TL;DR

We propose a novel video generation framework that enables cinematic transitions between shots using masked diffusion. Our method ensures precise shot timestamp control and stylistic coherence across shots, trained on a custom curated multi-shot dataset, and further transfers effectively in a training-free setting.

Teaser Video

Overview

⏩ Cinetrans-DiT

Emerald eyes, journal raised, wind and light embrace a woman.
Shot1:[0s,2.75s] Shot2:[2.75s,5s]
shot count: 2

Focused man jogs at dawn, sun glowing on determined frame.
Shot1:[0s,3s] Shot2:[3s,5s]
shot count: 2

Wearing a leather hat, a man flicks open his silver pocket watch.
Shot1:[0s,2.75s] Shot2:[2.75s,5s]
shot count: 2

A man strides through a deserted cobblestone plaza at dusk as light highlights his angular features.
Shot1:[0s,1s] Shot2:[1s,2.5s] Shot3:[2.5s,3.75s] Shot4:[3.75s,5s]
shot count: 4

🎬 Cinetrans-Unet

Lake at dawn, mountains in morning light.
Shot1:[0s,2s] Shot2:[2s,3.75s] Shot3:[3.75s,5.75s] Shot4:[5.75s,8s]
shot count: 4

Close-up of flower in light, highlighting colors and shadows.
Shot1:[0s,4s] Shot2:[4s,8s]
shot count: 2

Tranquil lakeside at sunset, ripples, birds, calm atmosphere.
Shot1:[0s,4s] Shot2:[4s,8s]
shot count: 2

Ocean at sunset, boat silhouette, calm water reflecting light.
Shot1:[0s,3s] Shot2:[3s,6s] Shot3:[6s,8s]
shot count: 3

📹 Customize

Elsa in sparkling ice-blue gown walks snowy peaks.
Shot1:[0s,2.5s] Shot2:[2.5s,5s]
shot count: 2

Giselle in pastel pink ball gown wanders sunlit forest amidst fluttering birds.
Shot1:[0s,2.5s] Shot2:[2.5s,5s]
shot count: 2

Elsa in burgundy velvet gown walks enchanted castle corridor.
Shot1:[0s,1.5s] Shot2:[1.5s,3.25s] Shot3:[3.25s,5s]
shot count: 3

Giselle in cobalt‑purple jacket sips latte in a dreamy, fairytale café.
Shot1:[0s,1.5s] Shot2:[1.5s,3.25s] Shot3:[3.25s,5s]
shot count: 3

🎥 Cinetrans-DiT (Training-free)

Majestic red canyon with winding river, sunlit cliffs, and timeless geological layers.
Shot1:[0s,3s] Shot2:[3s,5s]
shot count: 2

Seashell close-up expands to reveal a glowing, textured collection on sand.
Shot1:[0s,1.25s] Shot2:[1.25s,3s] Shot3:[3s,5s]
shot count: 3

Close-up of drummer to full band, ending in a unified final chord.
Shot1:[0s,1s] Shot2:[1s,2s] Shot3:[2s,3.25s] Shot4:[3.25s,5s]
shot count: 4

Terraced rice fields with farmers, mist, and shimmering water on verdant hillsides.
Shot1:[0s,1s] Shot2:[1s,2.25s] Shot3:[2.25s,5s]
shot count: 3

Abstract

Despite significant advances in video synthesis, research into multi-shot video generation remains in its infancy. Even with scaled-up models and massive datasets, the shot transition capabilities remain rudimentary and unstable, largely confining generated videos to single-shot sequences. In this work, we introduce CineTrans, a novel framework for generating coherent multi-shot videos with cinematic, film-style transitions. To facilitate insights into the film editing style, we construct a multi-shot video-text dataset Cine250K with detailed shot annotations. Furthermore, our analysis of existing video diffusion models uncovers a strong correspondence between diffusion model attention maps and shot boundaries, which we leverage to design a mask-based control mechanism that enables transitions at arbitrary positions and transfers effectively in a training-free setting. After fine-tuning on our multi-shot video dataset with the mask mechanism, CineTrans produces cinematic multi-shot sequences while adhering to the film editing style, avoiding unstable transitions or naive concatenations. Finally, we propose specialized evaluation metrics for transition control, temporal consistency and overall quality, and demonstrate through extensive experiments that CineTrans significantly outperforms existing baselines across all criteria.

Metric and Results

Metric

Results

Qualitative Comparisons

Script 1 (Shot count: 3)

A sunset ocean panorama gradually shifts to a warm close-up of sunlit clouds. Shot 1:[0s,3s] Shot 2:[3s,6s] Shot 3:[6s,8s].

Ours

StoryDiffusion

HunyuanVideo

Cinematron LoRA

Wanx2.1-T2V-turbo

CogVideoX

Script 2 (Shot count: 4)

The video opens with a drummer's cymbals, then shifts to a guitarist creating smooth melody. Shot 1:[0s,1s] Shot 2:[1s,2s] Shot 3:[2s,3.25s] Shot 4:[3.25s,5s].

Ours

StoryDiffusion

HunyuanVideo

Cinematron LoRA

Wanx2.1-T2V-turbo

CogVideoX

Script 3 (Shot count: 3)

Terraced rice fields with farmers, mist, and water on verdant hillsides Shot 1:[0s,1s] Shot 2:[1s,2.25s] Shot 3:[2.25s,5s].

Ours

StoryDiffusion

HunyuanVideo

Cinematron LoRA

Wanx2.1-T2V-turbo

CogVideoX

Script 4 (Shot count: 2)

Focused man jogs at dawn, sun glowing on determined frame. Shot 1:[0s,3s] Shot 2:[3s,5s].

Ours

StoryDiffusion

HunyuanVideo

Cinematron LoRA

Wanx2.1-T2V-turbo

CogVideoX

More Results

A massive and luxurious cruise ship gliding gracefully across the ocean.
shot count: 2

Lake at dawn, rowboat on dock, misty mountains.
shot count: 3

A woman with striking silver hair walks confidently through a bustling city street.
shot count: 2

Ripe fruits on wooden table, soft lighting, gentle shadows.
shot count: 3

Sunlight fades into darkness and bioluminescent creatures drift through the inky waters.
shot count: 2

candle, vase with flowers, golden light.
shot count: 3

A woman delicate jawline and soft honey-brown curls framing a face lit by candlelight.
shot count: 3

Cozy café, woman sipping coffee, streetlamps, gentle city hum.
shot count: 2

A man strides through a cobblestone plaza at dusk as light highlights his angular features.
shot count: 4

Desert at dawn, wind patterns, cactus, golden morning light.
shot count: 4

Emerald eyes, the journal raised, wind and light embrace a woman.
shot count: 2

Blooming flower, green field, wildflowers, farmhouse in sunlight.
shot count: 3

BibTeX

@misc{wu2025cinetranslearninggeneratevideos,
      title={CineTrans: Learning to Generate Videos with Cinematic Transitions via Masked Diffusion Models}, 
      author={Xiaoxue Wu and Bingjie Gao and Yu Qiao and Yaohui Wang and Xinyuan Chen},
      year={2025},
      eprint={2508.11484},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.11484}, 
}