ShotDirector: Directorially Controllable Multi-Shot Video Generation with Cinematographic Transitions

1Fudan University  2Shanghai Artificial Intelligence Laboratory 
*Work done during internship at Shanghai AI Laboratory Corresponding author

TL;DR

We introduce ShotDirector, a controllable multi-shot video generation framework that models diverse cinematographic transition types by combining parameter-level camera control with editing-pattern-aware prompting. Through 6-DoF camera conditioning and a shot-aware mask mechanism, it enables intentional, film-like transitions beyond simple shot changes.

⏩ Cut-in

[Subject] [Woman 1] a young artist with wavy brown hair in a paint-stained apron.
[General] The scene portrays a quiet morning in her studio as she paints in focused solitude.
[Transition] Cut-in. Transition from the wide studio view into a close-up of her brush and canvas, emphasizing artistic detail.
[Shot 1] [Woman 1] sits in a sunlit art studio surrounded by easels and canvases; wide shot, soft ambient light, artistic clutter, pastel tones, calm creative atmosphere.
[Shot 2] [Woman 1]'s hand moves a fine brush over a canvas, revealing delicate floral strokes; close-up, raking light, rich pigment texture, painterly aesthetic.

[Subject] [Man 1] a hooded man in a bulky gray parka, face pressed to a brick seam.
[General] [Man 1] lies on concrete under a stairwell between green-painted and red brick walls, intently peering through a narrow gap or doorway.
[Transition] Cut-in. Cut from a wider establishing shot of [Man 1] lying under the stairs to a close-up on his head and coat pressed into the brick seam, intensifying detail and focus.
[Shot 1] [Man 1] sprawled on the floor beside a stairwell and green brick wall, facing a red-brick corner. Wide/medium-wide framing, low-angle, low-key lighting, muted greens and reds.
[Shot 2] [Man 1]'s hooded head and shoulder pressed into the red brick seam in tight close-up. Close framing, shallow depth, intimate perspective, directional light emphasizing texture and fabric.

[Subject] [Woman 1] a woman with long wavy hair wearing a patterned knit hat, bright red scarf, black coat and jeans.
[General] [Woman 1] walks down a dim, tree-lined sidewalk at dusk, hands in pockets, with blurred car headlights and traffic lights behind her.
[Transition] Cut-in. The edit tightens from a wider, full-body sidewalk view to a closer medium-close portrait of [Woman 1], increasing intimacy and isolating her against blurred lights.
[Shot 1] [Woman 1] is shown full-body walking toward camera on a sidewalk, coat and scarf visible. Wide/medium-long shot, eye-level, moderate depth, cool dusk lighting, soft bokeh.
[Shot 2] [Woman 1] is framed in a tighter head-and-shoulders shot, neutral expression visible. Medium close-up, eye-level, shallow depth of field, low-key dusk light, stronger background bokeh.

[Subject] [Woman 1] a young traveler with shoulder-length black hair wearing a long beige coat.
[General] The scene captures a moment of anticipation before departure on a misty morning platform.
[Transition] Cut-in. Transition from the wide platform view into a close-up of her hand gripping the bag, focusing on quiet emotional tension within the stillness.
[Shot 1] [Woman 1] stands on a quiet train platform as the morning train arrives; wide shot, cool dawn light, gentle fog, balanced composition, muted tones.
[Shot 2] [Woman 1]'s fingers tighten around the strap of her leather bag; close-up, shallow depth of field, soft sidelight glinting off the metal buckle, subtle tension.

[Subject] [Woman 1] an older woman with a red bob haircut wearing a floral dress and a pendant necklace.
[General] [Woman 1] sits in a wood-paneled study speaking to camera; first frame shows a seated, gesturing view, second frame tightens to a close-up of her face.
[Transition] Cut-in. Cut from a medium-wide, gesturing shot of [Woman 1] in the study to a closer head-and-shoulders view, increasing intimacy and revealing facial expression details.
[Shot 1] [Woman 1] seated in a leather chair gesturing with hands beside a round table and lamp; medium-wide shot, eye-level, warm practical lighting, shallow focus, rich wood tones.
[Shot 2] [Woman 1] close-up, mid-head-and-shoulders as she speaks, expressive facial detail; close-up framing, tight focus on face, soft warm lighting, blurred background, eye-level.

[Subject] [Woman 1] an elderly woman with grey hair, gaunt features, wearing a patterned hospital gown, covered by a pink blanket.
[General] [Woman 1] lies propped in a clinical hospital room, mouth open and vulnerable, monitored equipment and bed rails visible; cold, desaturated light emphasizes frailty.
[Transition] Cut-in. Cut from a medium-wide view of [Woman 1] in the bed to a close-up of her face, tightening focus to heighten vulnerability and emotional detail.
[Shot 1] [Woman 1] reclines in a hospital bed seen from a medium-wide/high-angle view, showing rails, monitor, and blanket. Cool desaturated color, even clinical lighting, moderate depth.
[Shot 2] [Woman 1] fills the frame in a tight close-up, mouth open and details of wrinkles visible. Close-up, shallow focus, low-key cool light, intimate, slightly tilted angle.

🔁🎬 Shot/Reverse Shot

[Subject] [Woman 1] a blonde woman in a red-orange vintage coat with a pearl necklace, seated at a desk; [Man 1] a dark-haired man in a checked suit and tie, neatly groomed.
[General] [Woman 1] sits at a wood-paneled office desk, glancing sideways across ornate brass desk objects; [Man 1] sits opposite, listening attentively in a muted, warm period interior.
[Transition] Shot/Reverse Shot. Cut from the medium, desk-framed shot of [Woman 1] to a tighter reverse close-up on [Man 1], switching perspective to capture his reaction.
[Shot 1] [Woman 1] seated at the desk, turned to the right as if speaking; medium shot, eye-level framing, moderate depth of field, warm low-key lighting, muted vintage colors.
[Shot 2] [Man 1] shown in a tight medium close-up, looking offscreen toward [Woman 1]; sharp facial focus, soft warm illumination, eye-level angle, subtle contrast and color grading.

[Subject] [Man 1] a male security officer in a blue uniform with dark hair, a badge and radio. [Woman 1] a blonde woman in a purple blouse with curled hair.
[General] [Man 1] and [Woman 1] converse on a tree-lined street near cafe seating; he offers a mild, knowing smile while she responds with a concerned, intense expression.
[Transition] Shot/Reverse Shot. The cut shifts perspective from a medium frame highlighting [Man 1]'s expression to a reverse medium close-up of [Woman 1], showing her reaction and preserving conversational continuity.
[Shot 1] [Man 1] faces [Woman 1], smiling slightly, badge and radio visible. Medium shot, eye-level framing, shallow focus on his face, natural daylight, warm saturated colors, soft background.
[Shot 2] [Woman 1] reacts with tense, focused expression, [Man 1]'s shoulder partially visible. Medium close-up, eye-level, sharp focus on her face, soft daylight, neutral tones, intimate framing.

[Subject] [Man 1] a middle-aged man with graying hair wearing a dark fur-collared coat and light blue shirt. [Woman 1] a young blonde woman wearing silver tinsel.
[General] [Man 1] stands at a store counter while [Woman 1] in festive garb hands him a small box; holiday shelves and decorations fill the background.
[Transition] Shot/Reverse Shot. Cuts from a medium-close of [Man 1] to a reverse medium-close on [Woman 1] offering the item, maintaining spatial continuity and conversational pacing.
[Shot 1] [Man 1] is framed chest-up at the counter, looking down with Christmas displays behind him. Medium close, eye-level, soft fluorescent light, muted colors, shallow focus.
[Shot 2] [Woman 1] leans forward presenting a small box to [Man 1], silver tinsel framing her face. Medium-close, slight high angle, soft fill light, warm highlights, focused on hands and face.

[Subject] [Man 1] a man with dark wavy hair wearing a casual button shirt. [Woman 1] a woman with long straight hair and a braided crown, smiling.
[General] [Man 1] and [Woman 1] converse at an evening fair, exchanging warm smiles as colorful bokeh lights and blurred crowds form a soft, festive background.
[Transition] Shot/Reverse Shot. The edit alternates perspectives between speakers: a medium-close of [Man 1] cuts to the reverse medium-close of [Woman 1], maintaining conversational rhythm and spatial continuity.
[Shot 1] [Man 1] faces the camera slightly right, mid-close framing; shallow focus isolates him, warm dusk lighting, soft color palette, eye-level angle, visible bokeh background.
[Shot 2] [Woman 1] smiles in profile toward [Man 1], mid-close framing; shallow depth-of-field keeps her sharp, warm ambient light, saturated bokeh highlights, eye-level, intimate feel.

[Subject] [Woman 1] a woman wearing a red knit hat and matching red scarf. [Man 1] a middle-aged man with glasses wearing a blue sweater.
[General] [Woman 1] and [Man 1] sit outdoors in a quiet conversation; she looks down pensively while he watches her closely, suggesting an intimate emotional exchange.
[Transition] Shot/Reverse Shot. The cut alternates perspective from a close, introspective shot of [Woman 1] to a responsive medium close-up of [Man 1], emphasizing dialogue and emotional reaction.
[Shot 1] [Woman 1] in a close-up, head bowed slightly, wearing a red hat and scarf. Close-up, shallow focus, soft natural light, muted background, slight high-angle feel.
[Shot 2] [Man 1] in a medium close-up, looking surprised or attentive, wearing glasses and blue sweater. Medium close-up, eye-level, sharp focus on face, natural lighting, subdued colors.

[Subject] [Woman 1] a short-haired athlete wearing a navy "Athletes Row" shirt. [Woman 2] a longer-haired woman in a teal tank top.
[General] [Woman 1] and [Woman 2] stand close in a blue-tinted locker room, exchanging tense dialogue with intense eye contact and focused expressions.
[Transition] Shot/Reverse Shot. The edit switches perspective from [Woman 1] to [Woman 2], alternating viewpoints in a dialogue exchange while preserving spatial continuity and emotional tension.
[Shot 1] [Woman 1] faces the camera talking to the blurred back of [Woman 2]. Medium close-up, shallow focus on subject, cool lighting, eye-level, tight framing.
[Shot 2] [Woman 2] looks intently at [Woman 1], her expression serious; over-the-shoulder of [Woman 1] visible. Medium close-up, sharp focus, rim backlighting, cool tones.

⏪ Cut-out

[Subject] [Man 1] a bald man with a short beard, wearing an orange prison jumpsuit and handcuffs, with a large tattoo on his left forearm.
[General] [Man 1] sits at an interrogation table in a cinderblock-walled room; first frame tightly shows his face, then a wider frame reveals his handcuffed hands and tattoo.
[Transition] Cut-out. The edit moves from a tight facial close-up to a wider medium shot revealing context handcuffs, tattoo, and the interrogation setting broadening visual information.
[Shot 1] [Man 1] close-up on his face, looking down and slightly left. Tight framing, shallow focus, cool fluorescent lighting, muted color palette, slight high-angle composition.
[Shot 2] [Man 1] medium shot showing torso, tattooed forearm, and handcuffed hands on the table. Wider framing, deeper focus, flat lighting, neutral color, eye-level angle.

[Subject] [Officer 1] a uniformed policeman in a dark blue uniform and cap, crouching with a tense expression. [Dog 1] a tan-and-white English bulldog wearing a studded collar.
[General] [Officer 1] holds [Dog 1] on a leash inside a marble-floored, columned interior, peering down a corridor, appearing alert and ready.
[Transition] Cut-out. The edit moves from a tighter, nearer view of [Officer 1] restraining [Dog 1] to a wider shot that reveals the columned hall and spatial context.
[Shot 1] [Officer 1] crouches and grips [Dog 1]'s collar in a closer view; close-medium shot, low angle, sharp focus on faces, even interior lighting, muted colors.
[Shot 2] [Officer 1] and [Dog 1] are framed farther down the hall; wide shot, low camera angle, deep focus, reflective marble floor, cool muted lighting.

[Subject] [Boy 1] a brown-haired boy in a plaid shirt and red hoodie. [Girl 1] a brown-haired girl in a purple cardigan holding a blanket. [Dog 1] a large brown-and-white Saint Bernard.
[General] [Boy 1] looks down with concern while [Girl 1] stands nearby smiling and holding a blanket; [Dog 1] sniffs the ground in a churchyard setting, daytime.
[Transition] Cut-out. Cut from a tight close-up of [Boy 1] to a medium-wide shot revealing [Girl 1], [Dog 1], and the surrounding churchyard, expanding context and showing interaction.
[Shot 1] [Boy 1] close-up, looking downward with a worried expression. Tight framing, shallow focus on his face, soft natural daylight, warm muted color palette, eye-level camera.
[Shot 2] [Boy 1] and [Girl 1] stand beside [Dog 1] in a medium-wide shot showing the churchyard. Balanced composition, deeper focus, even daylight, natural warm tones, neutral camera angle.

[Subject] [Man 1] a middle-aged man with salt-and-pepper hair and a trimmed goatee wearing a burgundy/red dress shirt.
[General] [Man 1] speaks during an interview in a warmly lit living room, positioned against blurred wall decor and a couch, conveying a relaxed, conversational tone.
[Transition] Cut-out. Cut from a close, intimate head-and-shoulders shot of [Man 1] to a wider medium shot revealing environment, expanding context while keeping conversational continuity.
[Shot 1] [Man 1] tight close-up on face and shoulders as he talks. Framing: head-and-shoulders. Shallow focus, soft warm lighting, eye-level angle, intimate, muted color.
[Shot 2] [Man 1] medium-wide seated shot showing more of the living room and wall decor. Wider framing, slightly deeper focus, soft warm light, eye-level, contextual composition.

[Subject] [Man 1] a bearded man wearing a black hockey jersey and a white backwards cap.
[General] [Man 1] performs stand-up on stage holding a microphone; one frame is a close frontal portrait, the other shows his back with "SMITH 37" and the audience.
[Transition] Cut-out. Cut from a medium close frontal portrait of [Man 1] to a wider rear full-body shot, increasing distance and changing angle to reveal his jersey and audience.
[Shot 1] [Man 1] faces the camera holding a microphone in a head-and-shoulders shot. Medium close-up, eye-level, sharp focus on face, spotlighting, dark background, warm tones.
[Shot 2] [Man 1] is seen from behind in a full-body wide shot showing "SMITH 37" on his jersey, stage stool, and audience. Wide framing, deeper focus, spotlighting, high contrast, slightly low angle.

[Subject] [Man 1] a man with curly red hair in a suit. [Piata 1] a tall blue piata-like creature with orange accents.
[General] [Man 1] energetically performs beside [Piata 1] on a conference table in a festively decorated boardroom while businessmen watch, creating a chaotic, comedic scene.
[Transition] Cut-out. Cuts from a tight, energetic close-up of [Man 1] and [Piata 1] to a wider establishing shot that reveals the full boardroom, audience, and props.
[Shot 1] [Man 1] mid-shout beside [Piata 1], tightly framed on faces and upper torsos. Close-up, shallow focus, overhead fluorescent light, saturated colors, eye-level, expressive movement.
[Shot 2] [Man 1] sitting on the conference table with [Piata 1] in foreground and businessmen around. Wide shot, deep focus, flat fluorescent lighting, low tabletop camera angle, vivid set.

🎥🔄 Multi-Angle

[Subject] [Man 1] an older man in a fedora, pinstripe suit and polka-dot bow tie, with pronounced forehead wrinkles and a cheek mole.
[General] [Man 1] leans forward inside a dim, cluttered interior, intently watching a small vintage television showing black-and-white dancers, conveying focused curiosity and nostalgia.
[Transition] Multi-Angle. Cuts from an intimate close-up of [Man 1]'s face to a wider over-the-shoulder view revealing him watching the vintage television, expanding context and spatial orientation.
[Shot 1] [Man 1] close-up of his wrinkled face and fedora brim, eyes partially closed. Tight framing, shallow focus, soft interior lighting, warm color saturation, slight high-angle.
[Shot 2] [Man 1] seen from behind leaning toward a small tabletop black-and-white TV showing dancers. Medium over-the-shoulder, deeper focus, low-key lighting, muted palette, eye-level angle.

[Subject] [Woman 1] a brunette woman wearing sunglasses on her head and a light blue lace top. [Man 1] a middle-aged man in a blue shirt. [Man 2] a man in a white shirt with blue stripes holding a glass.
[General] [Woman 1] greets guests warmly, reaching in for a hug with a broad smile while [Man 2] watches with a drink and [Man 1] receives the embrace under a tented outdoor event.
[Transition] Multi-Angle. Camera moves to the opposite side of the interaction, cutting from a frontal view of [Woman 1] reaching out to an over-the-shoulder perspective focused on [Man 1], maintaining continuity.
[Shot 1] [Woman 1] faces the camera smiling and reaching forward; [Man 2] stands slightly behind holding a glass. Medium close-up, eye-level, shallow focus, natural daylight, vibrant colors, slight motion blur.
[Shot 2] [Woman 1] is seen from behind as she embraces [Man 1], whose face responds; background guests softly blurred. Over-the-shoulder mid shot, eye-level, soft daylight, intimate framing, warm tones.

[Subject] [Rider 1] a motorcycle rider with a visible hand gripping chrome handlebars. [Car 1] a light-colored vintage station wagon.
[General] [Rider 1] pursues [Car 1] along a sunlit desert highway; handlebars foreground, station wagon midground, dust trailing, conveying high-speed chase across wide-open heat-hazed terrain.
[Transition] Multi-Angle. Cuts from an intimate, handlebar-level perspective alongside [Car 1] to a wider rear-trailing shot, shifting angle to expose road, spacing, and chase context.
[Shot 1] [Rider 1]'s hand and motorcycle handlebars dominate frame with [Car 1] alongside. Close low-angle, tight framing, slight motion blur, harsh midday light, saturated blues.
[Shot 2] [Car 1] centered trailing down road, dust plume visible. Wide medium-long shot, eye-level, deeper focus, balanced daylight, cooler desaturated tones, reveals open highway.

[Subject] [Woman 1] a woman with short platinum-blonde hair wearing a pale sheer nightgown.
[General] [Woman 1] appears anxious in a dim interior hallway, moving toward and inspecting a closed apartment door, creating a tense, suspenseful atmosphere.
[Transition] Multi-Angle. Cut from a closer frontal shot of [Woman 1] to a wider rear-facing shot revealing the door, expanding spatial context and heightening suspense.
[Shot 1] [Woman 1] is shown in a close medium frontal shot, tense expression with strong shadowing; tight framing, shallow focus, low-key overhead lighting, muted tones, eye-level.
[Shot 2] [Woman 1] is seen from behind facing a closed door; wider medium shot with more negative space, cool low-key lighting, deeper focus, off-center composition, slightly high angle.

[Subject] [Person 1] a person in an orange protective suit with a clear face visor. [Man 1], [Woman 1] two homeowners watching on a porch.
[General] [Person 1] smiles behind the visor while standing outside a suburban house; [Man 1] and [Woman 1] watch from the doorway, amused and curious.
[Transition] Multi-Angle. Cut from a close-up of [Person 1]'s smiling face to a wider shot revealing the full protective suit, backpack apparatus, and homeowners' reactions on the porch.
[Shot 1] [Person 1]'s face fills the frame behind the clear visor, smiling; close-up, tight framing, shallow focus, natural daylight, warm color, slight low-angle reflection on visor.
[Shot 2] [Person 1] is seen from behind with a backpack unit approaching the house as [Man 1] and [Woman 1] stand together; medium-long shot, eye-level, deeper focus, daylight.

[Subject] [Man 1] a bald middle-aged man in a black suit. [Crew 1] a cluster of photographers, camera operators and a boom mic.
[General] [Man 1] marches through a tree-lined park while [Crew 1] swarms him, cameras and boom mics thrust forward, a tense confrontation in cold, overcast light.
[Transition] Transition Label: Multi-Angle.
[Shot 1] [Man 1] strides left-center through a crowd of reporters; trees fill the background. Wide/medium-long shot, eye-level, muted desaturated colors, soft overcast lighting, moderate depth.
[Shot 2] [Man 1] in a tighter medium-close, shouting as [Crew 1] and cameras press in from the right. Medium-close, eye-level, sharp focus on expression, handheld immediacy, cool tones.

Abstract

Shot transitions play a pivotal role in multi-shot video generation, as they determine the overall narrative expression and the directorial design of visual storytelling. However, recent progress has primarily focused on low-level visual consistency across shots, neglecting how transitions are designed and how cinematographic language contributes to coherent narrative expression. This often leads to mere sequential shot changes without intentional film-editing patterns. To address this limitation, we propose ShotDirector, an efficient framework that integrates parameter-level camera control and hierarchical editing-pattern-aware prompting. Specifically, we adopt a camera control module that incorporates 6-DoF poses and intrinsic settings to enable precise camera information injection. In addition, a shot-aware mask mechanism is employed to introduce hierarchical prompts aware of professional editing patterns, allowing fine-grained control over shot content. Through this design, our framework effectively combines parameter-level conditions with high-level semantic guidance, achieving film-like controllable shot transitions. To facilitate training and evaluation, we construct ShotWeaver40K, a dataset that captures the priors of film-like editing patterns, and develop a set of evaluation metrics for controllable multi-shot video generation. Extensive experiments demonstrate the effectiveness of our framework.

Figure

Qualitative Comparisons

Script 1 (Cut-in)

Cut from a medium-wide group shot including a group of men to a closer, intimate portrait of a man of them, highlighting his expression and attendants. camera moves right-forward, rotates left 90°

Ours

Phantom

CineTrans

StoryDiffusion

Mask2DiT

HunyuanVideo

Wan2.2

SynCamMaster

ReCamMaster

Script 2 (Shot/Reverse Shot)

Alternating perspective between a woman and a man as they converse; cut switches from her mid-close to his medium close-up. camera moves left-forward, rotates right 120°

Ours

Phantom

CineTrans

StoryDiffusion

Mask2DiT

HunyuanVideo

Wan2.2

SynCamMaster

ReCamMaster

Script 3 (Cut-out)

The sequence pulls back from a medium close-up of a man inspecting lighter and smoking to a wider shot that reveals his full figure and the surrounding stoop. camera moves backward, no rotation

Ours

Phantom

CineTrans

StoryDiffusion

Mask2DiT

HunyuanVideo

Wan2.2

SynCamMaster

ReCamMaster

Script 4 (Multi-Angle)

Cut between two camera angles of a man: a side biased medium-close emphasizing hands and object, then a frontal medium revealing silhouette and surroundings. camera moves left-backward, rotates right 45°

Ours

Phantom

CineTrans

StoryDiffusion

Mask2DiT

HunyuanVideo

Wan2.2

SynCamMaster

ReCamMaster

BibTeX

@misc{wu2025shotdirectordirectoriallycontrollablemultishot,
      title={ShotDirector: Directorially Controllable Multi-Shot Video Generation with Cinematographic Transitions}, 
      author={Xiaoxue Wu and Xinyuan Chen and Yaohui Wang and Yu Qiao},
      year={2025},
      eprint={2512.10286},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.10286}, 
}