We introduce ShotDirector, a controllable multi-shot video generation framework that models diverse cinematographic transition types by combining parameter-level camera control with editing-pattern-aware prompting. Through 6-DoF camera conditioning and a shot-aware mask mechanism, it enables intentional, film-like transitions beyond simple shot changes.
[Subject] [Woman 1] a young artist with wavy brown hair in a paint-stained apron.
[General] The scene portrays a quiet morning in her studio as she paints in focused solitude.
[Transition] Cut-in. Transition from the wide studio view into a close-up of her brush and canvas, emphasizing artistic detail.
[Shot 1] [Woman 1] sits in a sunlit art studio surrounded by easels and canvases; wide shot, soft ambient light, artistic clutter, pastel tones, calm creative atmosphere.
[Shot 2] [Woman 1]'s hand moves a fine brush over a canvas, revealing delicate floral strokes; close-up, raking light, rich pigment texture, painterly aesthetic.
[Subject] [Man 1] a hooded man in a bulky gray parka, face pressed to a brick seam.
[General] [Man 1] lies on concrete under a stairwell between green-painted and red brick walls, intently peering through a narrow gap or doorway.
[Transition] Cut-in. Cut from a wider establishing shot of [Man 1] lying under the stairs to a close-up on his head and coat pressed into the brick seam, intensifying detail and focus.
[Shot 1] [Man 1] sprawled on the floor beside a stairwell and green brick wall, facing a red-brick corner. Wide/medium-wide framing, low-angle, low-key lighting, muted greens and reds.
[Shot 2] [Man 1]'s hooded head and shoulder pressed into the red brick seam in tight close-up. Close framing, shallow depth, intimate perspective, directional light emphasizing texture and fabric.
[Subject] [Woman 1] a woman with long wavy hair wearing a patterned knit hat, bright red scarf, black coat and jeans.
[General] [Woman 1] walks down a dim, tree-lined sidewalk at dusk, hands in pockets, with blurred car headlights and traffic lights behind her.
[Transition] Cut-in. The edit tightens from a wider, full-body sidewalk view to a closer medium-close portrait of [Woman 1], increasing intimacy and isolating her against blurred lights.
[Shot 1] [Woman 1] is shown full-body walking toward camera on a sidewalk, coat and scarf visible. Wide/medium-long shot, eye-level, moderate depth, cool dusk lighting, soft bokeh.
[Shot 2] [Woman 1] is framed in a tighter head-and-shoulders shot, neutral expression visible. Medium close-up, eye-level, shallow depth of field, low-key dusk light, stronger background bokeh.
[Subject] [Woman 1] a young traveler with shoulder-length black hair wearing a long beige coat.
[General] The scene captures a moment of anticipation before departure on a misty morning platform.
[Transition] Cut-in. Transition from the wide platform view into a close-up of her hand gripping the bag, focusing on quiet emotional tension within the stillness.
[Shot 1] [Woman 1] stands on a quiet train platform as the morning train arrives; wide shot, cool dawn light, gentle fog, balanced composition, muted tones.
[Shot 2] [Woman 1]'s fingers tighten around the strap of her leather bag; close-up, shallow depth of field, soft sidelight glinting off the metal buckle, subtle tension.
[Subject] [Woman 1] an older woman with a red bob haircut wearing a floral dress and a pendant necklace.
[General] [Woman 1] sits in a wood-paneled study speaking to camera; first frame shows a seated, gesturing view, second frame tightens to a close-up of her face.
[Transition] Cut-in. Cut from a medium-wide, gesturing shot of [Woman 1] in the study to a closer head-and-shoulders view, increasing intimacy and revealing facial expression details.
[Shot 1] [Woman 1] seated in a leather chair gesturing with hands beside a round table and lamp; medium-wide shot, eye-level, warm practical lighting, shallow focus, rich wood tones.
[Shot 2] [Woman 1] close-up, mid-head-and-shoulders as she speaks, expressive facial detail; close-up framing, tight focus on face, soft warm lighting, blurred background, eye-level.
[Subject] [Woman 1] an elderly woman with grey hair, gaunt features, wearing a patterned hospital gown, covered by a pink blanket.
[General] [Woman 1] lies propped in a clinical hospital room, mouth open and vulnerable, monitored equipment and bed rails visible; cold, desaturated light emphasizes frailty.
[Transition] Cut-in. Cut from a medium-wide view of [Woman 1] in the bed to a close-up of her face, tightening focus to heighten vulnerability and emotional detail.
[Shot 1] [Woman 1] reclines in a hospital bed seen from a medium-wide/high-angle view, showing rails, monitor, and blanket. Cool desaturated color, even clinical lighting, moderate depth.
[Shot 2] [Woman 1] fills the frame in a tight close-up, mouth open and details of wrinkles visible. Close-up, shallow focus, low-key cool light, intimate, slightly tilted angle.
[Subject] [Woman 1] a blonde woman in a red-orange vintage coat with a pearl necklace, seated at a desk; [Man 1] a dark-haired man in a checked suit and tie, neatly groomed.
[General] [Woman 1] sits at a wood-paneled office desk, glancing sideways across ornate brass desk objects; [Man 1] sits opposite, listening attentively in a muted, warm period interior.
[Transition] Shot/Reverse Shot. Cut from the medium, desk-framed shot of [Woman 1] to a tighter reverse close-up on [Man 1], switching perspective to capture his reaction.
[Shot 1] [Woman 1] seated at the desk, turned to the right as if speaking; medium shot, eye-level framing, moderate depth of field, warm low-key lighting, muted vintage colors.
[Shot 2] [Man 1] shown in a tight medium close-up, looking offscreen toward [Woman 1]; sharp facial focus, soft warm illumination, eye-level angle, subtle contrast and color grading.
[Subject] [Man 1] a male security officer in a blue uniform with dark hair, a badge and radio. [Woman 1] a blonde woman in a purple blouse with curled hair.
[General] [Man 1] and [Woman 1] converse on a tree-lined street near cafe seating; he offers a mild, knowing smile while she responds with a concerned, intense expression.
[Transition] Shot/Reverse Shot. The cut shifts perspective from a medium frame highlighting [Man 1]'s expression to a reverse medium close-up of [Woman 1], showing her reaction and preserving conversational continuity.
[Shot 1] [Man 1] faces [Woman 1], smiling slightly, badge and radio visible. Medium shot, eye-level framing, shallow focus on his face, natural daylight, warm saturated colors, soft background.
[Shot 2] [Woman 1] reacts with tense, focused expression, [Man 1]'s shoulder partially visible. Medium close-up, eye-level, sharp focus on her face, soft daylight, neutral tones, intimate framing.
[Subject] [Man 1] a middle-aged man with graying hair wearing a dark fur-collared coat and light blue shirt. [Woman 1] a young blonde woman wearing silver tinsel.
[General] [Man 1] stands at a store counter while [Woman 1] in festive garb hands him a small box; holiday shelves and decorations fill the background.
[Transition] Shot/Reverse Shot. Cuts from a medium-close of [Man 1] to a reverse medium-close on [Woman 1] offering the item, maintaining spatial continuity and conversational pacing.
[Shot 1] [Man 1] is framed chest-up at the counter, looking down with Christmas displays behind him. Medium close, eye-level, soft fluorescent light, muted colors, shallow focus.
[Shot 2] [Woman 1] leans forward presenting a small box to [Man 1], silver tinsel framing her face. Medium-close, slight high angle, soft fill light, warm highlights, focused on hands and face.
[Subject] [Man 1] a man with dark wavy hair wearing a casual button shirt. [Woman 1] a woman with long straight hair and a braided crown, smiling.
[General] [Man 1] and [Woman 1] converse at an evening fair, exchanging warm smiles as colorful bokeh lights and blurred crowds form a soft, festive background.
[Transition] Shot/Reverse Shot. The edit alternates perspectives between speakers: a medium-close of [Man 1] cuts to the reverse medium-close of [Woman 1], maintaining conversational rhythm and spatial continuity.
[Shot 1] [Man 1] faces the camera slightly right, mid-close framing; shallow focus isolates him, warm dusk lighting, soft color palette, eye-level angle, visible bokeh background.
[Shot 2] [Woman 1] smiles in profile toward [Man 1], mid-close framing; shallow depth-of-field keeps her sharp, warm ambient light, saturated bokeh highlights, eye-level, intimate feel.
[Subject] [Woman 1] a woman wearing a red knit hat and matching red scarf. [Man 1] a middle-aged man with glasses wearing a blue sweater.
[General] [Woman 1] and [Man 1] sit outdoors in a quiet conversation; she looks down pensively while he watches her closely, suggesting an intimate emotional exchange.
[Transition] Shot/Reverse Shot. The cut alternates perspective from a close, introspective shot of [Woman 1] to a responsive medium close-up of [Man 1], emphasizing dialogue and emotional reaction.
[Shot 1] [Woman 1] in a close-up, head bowed slightly, wearing a red hat and scarf. Close-up, shallow focus, soft natural light, muted background, slight high-angle feel.
[Shot 2] [Man 1] in a medium close-up, looking surprised or attentive, wearing glasses and blue sweater. Medium close-up, eye-level, sharp focus on face, natural lighting, subdued colors.
[Subject] [Woman 1] a short-haired athlete wearing a navy "Athletes Row" shirt. [Woman 2] a longer-haired woman in a teal tank top.
[General] [Woman 1] and [Woman 2] stand close in a blue-tinted locker room, exchanging tense dialogue with intense eye contact and focused expressions.
[Transition] Shot/Reverse Shot. The edit switches perspective from [Woman 1] to [Woman 2], alternating viewpoints in a dialogue exchange while preserving spatial continuity and emotional tension.
[Shot 1] [Woman 1] faces the camera talking to the blurred back of [Woman 2]. Medium close-up, shallow focus on subject, cool lighting, eye-level, tight framing.
[Shot 2] [Woman 2] looks intently at [Woman 1], her expression serious; over-the-shoulder of [Woman 1] visible. Medium close-up, sharp focus, rim backlighting, cool tones.
[Subject] [Man 1] a bald man with a short beard, wearing an orange prison jumpsuit and handcuffs, with a large tattoo on his left forearm.
[General] [Man 1] sits at an interrogation table in a cinderblock-walled room; first frame tightly shows his face, then a wider frame reveals his handcuffed hands and tattoo.
[Transition] Cut-out. The edit moves from a tight facial close-up to a wider medium shot revealing context handcuffs, tattoo, and the interrogation setting broadening visual information.
[Shot 1] [Man 1] close-up on his face, looking down and slightly left. Tight framing, shallow focus, cool fluorescent lighting, muted color palette, slight high-angle composition.
[Shot 2] [Man 1] medium shot showing torso, tattooed forearm, and handcuffed hands on the table. Wider framing, deeper focus, flat lighting, neutral color, eye-level angle.
[Subject] [Officer 1] a uniformed policeman in a dark blue uniform and cap, crouching with a tense expression. [Dog 1] a tan-and-white English bulldog wearing a studded collar.
[General] [Officer 1] holds [Dog 1] on a leash inside a marble-floored, columned interior, peering down a corridor, appearing alert and ready.
[Transition] Cut-out. The edit moves from a tighter, nearer view of [Officer 1] restraining [Dog 1] to a wider shot that reveals the columned hall and spatial context.
[Shot 1] [Officer 1] crouches and grips [Dog 1]'s collar in a closer view; close-medium shot, low angle, sharp focus on faces, even interior lighting, muted colors.
[Shot 2] [Officer 1] and [Dog 1] are framed farther down the hall; wide shot, low camera angle, deep focus, reflective marble floor, cool muted lighting.
[Subject] [Boy 1] a brown-haired boy in a plaid shirt and red hoodie. [Girl 1] a brown-haired girl in a purple cardigan holding a blanket. [Dog 1] a large brown-and-white Saint Bernard.
[General] [Boy 1] looks down with concern while [Girl 1] stands nearby smiling and holding a blanket; [Dog 1] sniffs the ground in a churchyard setting, daytime.
[Transition] Cut-out. Cut from a tight close-up of [Boy 1] to a medium-wide shot revealing [Girl 1], [Dog 1], and the surrounding churchyard, expanding context and showing interaction.
[Shot 1] [Boy 1] close-up, looking downward with a worried expression. Tight framing, shallow focus on his face, soft natural daylight, warm muted color palette, eye-level camera.
[Shot 2] [Boy 1] and [Girl 1] stand beside [Dog 1] in a medium-wide shot showing the churchyard. Balanced composition, deeper focus, even daylight, natural warm tones, neutral camera angle.
[Subject] [Man 1] a middle-aged man with salt-and-pepper hair and a trimmed goatee wearing a burgundy/red dress shirt.
[General] [Man 1] speaks during an interview in a warmly lit living room, positioned against blurred wall decor and a couch, conveying a relaxed, conversational tone.
[Transition] Cut-out. Cut from a close, intimate head-and-shoulders shot of [Man 1] to a wider medium shot revealing environment, expanding context while keeping conversational continuity.
[Shot 1] [Man 1] tight close-up on face and shoulders as he talks. Framing: head-and-shoulders. Shallow focus, soft warm lighting, eye-level angle, intimate, muted color.
[Shot 2] [Man 1] medium-wide seated shot showing more of the living room and wall decor. Wider framing, slightly deeper focus, soft warm light, eye-level, contextual composition.
[Subject] [Man 1] a bearded man wearing a black hockey jersey and a white backwards cap.
[General] [Man 1] performs stand-up on stage holding a microphone; one frame is a close frontal portrait, the other shows his back with "SMITH 37" and the audience.
[Transition] Cut-out. Cut from a medium close frontal portrait of [Man 1] to a wider rear full-body shot, increasing distance and changing angle to reveal his jersey and audience.
[Shot 1] [Man 1] faces the camera holding a microphone in a head-and-shoulders shot. Medium close-up, eye-level, sharp focus on face, spotlighting, dark background, warm tones.
[Shot 2] [Man 1] is seen from behind in a full-body wide shot showing "SMITH 37" on his jersey, stage stool, and audience. Wide framing, deeper focus, spotlighting, high contrast, slightly low angle.
[Subject] [Man 1] a man with curly red hair in a suit. [Piata 1] a tall blue piata-like creature with orange accents.
[General] [Man 1] energetically performs beside [Piata 1] on a conference table in a festively decorated boardroom while businessmen watch, creating a chaotic, comedic scene.
[Transition] Cut-out. Cuts from a tight, energetic close-up of [Man 1] and [Piata 1] to a wider establishing shot that reveals the full boardroom, audience, and props.
[Shot 1] [Man 1] mid-shout beside [Piata 1], tightly framed on faces and upper torsos. Close-up, shallow focus, overhead fluorescent light, saturated colors, eye-level, expressive movement.
[Shot 2] [Man 1] sitting on the conference table with [Piata 1] in foreground and businessmen around. Wide shot, deep focus, flat fluorescent lighting, low tabletop camera angle, vivid set.
[Subject] [Man 1] an older man in a fedora, pinstripe suit and polka-dot bow tie, with pronounced forehead wrinkles and a cheek mole.
[General] [Man 1] leans forward inside a dim, cluttered interior, intently watching a small vintage television showing black-and-white dancers, conveying focused curiosity and nostalgia.
[Transition] Multi-Angle. Cuts from an intimate close-up of [Man 1]'s face to a wider over-the-shoulder view revealing him watching the vintage television, expanding context and spatial orientation.
[Shot 1] [Man 1] close-up of his wrinkled face and fedora brim, eyes partially closed. Tight framing, shallow focus, soft interior lighting, warm color saturation, slight high-angle.
[Shot 2] [Man 1] seen from behind leaning toward a small tabletop black-and-white TV showing dancers. Medium over-the-shoulder, deeper focus, low-key lighting, muted palette, eye-level angle.
[Subject] [Woman 1] a brunette woman wearing sunglasses on her head and a light blue lace top. [Man 1] a middle-aged man in a blue shirt. [Man 2] a man in a white shirt with blue stripes holding a glass.
[General] [Woman 1] greets guests warmly, reaching in for a hug with a broad smile while [Man 2] watches with a drink and [Man 1] receives the embrace under a tented outdoor event.
[Transition] Multi-Angle. Camera moves to the opposite side of the interaction, cutting from a frontal view of [Woman 1] reaching out to an over-the-shoulder perspective focused on [Man 1], maintaining continuity.
[Shot 1] [Woman 1] faces the camera smiling and reaching forward; [Man 2] stands slightly behind holding a glass. Medium close-up, eye-level, shallow focus, natural daylight, vibrant colors, slight motion blur.
[Shot 2] [Woman 1] is seen from behind as she embraces [Man 1], whose face responds; background guests softly blurred. Over-the-shoulder mid shot, eye-level, soft daylight, intimate framing, warm tones.
[Subject] [Rider 1] a motorcycle rider with a visible hand gripping chrome handlebars. [Car 1] a light-colored vintage station wagon.
[General] [Rider 1] pursues [Car 1] along a sunlit desert highway; handlebars foreground, station wagon midground, dust trailing, conveying high-speed chase across wide-open heat-hazed terrain.
[Transition] Multi-Angle. Cuts from an intimate, handlebar-level perspective alongside [Car 1] to a wider rear-trailing shot, shifting angle to expose road, spacing, and chase context.
[Shot 1] [Rider 1]'s hand and motorcycle handlebars dominate frame with [Car 1] alongside. Close low-angle, tight framing, slight motion blur, harsh midday light, saturated blues.
[Shot 2] [Car 1] centered trailing down road, dust plume visible. Wide medium-long shot, eye-level, deeper focus, balanced daylight, cooler desaturated tones, reveals open highway.
[Subject] [Woman 1] a woman with short platinum-blonde hair wearing a pale sheer nightgown.
[General] [Woman 1] appears anxious in a dim interior hallway, moving toward and inspecting a closed apartment door, creating a tense, suspenseful atmosphere.
[Transition] Multi-Angle. Cut from a closer frontal shot of [Woman 1] to a wider rear-facing shot revealing the door, expanding spatial context and heightening suspense.
[Shot 1] [Woman 1] is shown in a close medium frontal shot, tense expression with strong shadowing; tight framing, shallow focus, low-key overhead lighting, muted tones, eye-level.
[Shot 2] [Woman 1] is seen from behind facing a closed door; wider medium shot with more negative space, cool low-key lighting, deeper focus, off-center composition, slightly high angle.
[Subject] [Person 1] a person in an orange protective suit with a clear face visor. [Man 1], [Woman 1] two homeowners watching on a porch.
[General] [Person 1] smiles behind the visor while standing outside a suburban house; [Man 1] and [Woman 1] watch from the doorway, amused and curious.
[Transition] Multi-Angle. Cut from a close-up of [Person 1]'s smiling face to a wider shot revealing the full protective suit, backpack apparatus, and homeowners' reactions on the porch.
[Shot 1] [Person 1]'s face fills the frame behind the clear visor, smiling; close-up, tight framing, shallow focus, natural daylight, warm color, slight low-angle reflection on visor.
[Shot 2] [Person 1] is seen from behind with a backpack unit approaching the house as [Man 1] and [Woman 1] stand together; medium-long shot, eye-level, deeper focus, daylight.
[Subject] [Man 1] a bald middle-aged man in a black suit. [Crew 1] a cluster of photographers, camera operators and a boom mic.
[General] [Man 1] marches through a tree-lined park while [Crew 1] swarms him, cameras and boom mics thrust forward, a tense confrontation in cold, overcast light.
[Transition] Transition Label: Multi-Angle.
[Shot 1] [Man 1] strides left-center through a crowd of reporters; trees fill the background. Wide/medium-long shot, eye-level, muted desaturated colors, soft overcast lighting, moderate depth.
[Shot 2] [Man 1] in a tighter medium-close, shouting as [Crew 1] and cameras press in from the right. Medium-close, eye-level, sharp focus on expression, handheld immediacy, cool tones.
Shot transitions play a pivotal role in multi-shot video generation, as they determine the overall narrative expression and the directorial design of visual storytelling. However, recent progress has primarily focused on low-level visual consistency across shots, neglecting how transitions are designed and how cinematographic language contributes to coherent narrative expression. This often leads to mere sequential shot changes without intentional film-editing patterns. To address this limitation, we propose ShotDirector, an efficient framework that integrates parameter-level camera control and hierarchical editing-pattern-aware prompting. Specifically, we adopt a camera control module that incorporates 6-DoF poses and intrinsic settings to enable precise camera information injection. In addition, a shot-aware mask mechanism is employed to introduce hierarchical prompts aware of professional editing patterns, allowing fine-grained control over shot content. Through this design, our framework effectively combines parameter-level conditions with high-level semantic guidance, achieving film-like controllable shot transitions. To facilitate training and evaluation, we construct ShotWeaver40K, a dataset that captures the priors of film-like editing patterns, and develop a set of evaluation metrics for controllable multi-shot video generation. Extensive experiments demonstrate the effectiveness of our framework.
Cut from a medium-wide group shot including a group of men to a closer, intimate portrait of a man of them, highlighting his expression and attendants. camera moves right-forward, rotates left 90°
Ours
Phantom
CineTrans
StoryDiffusion
Mask2DiT
HunyuanVideo
Wan2.2
SynCamMaster
ReCamMaster
Alternating perspective between a woman and a man as they converse; cut switches from her mid-close to his medium close-up. camera moves left-forward, rotates right 120°
Ours
Phantom
CineTrans
StoryDiffusion
Mask2DiT
HunyuanVideo
Wan2.2
SynCamMaster
ReCamMaster
The sequence pulls back from a medium close-up of a man inspecting lighter and smoking to a wider shot that reveals his full figure and the surrounding stoop. camera moves backward, no rotation
Ours
Phantom
CineTrans
StoryDiffusion
Mask2DiT
HunyuanVideo
Wan2.2
SynCamMaster
ReCamMaster
Cut between two camera angles of a man: a side biased medium-close emphasizing hands and object, then a frontal medium revealing silhouette and surroundings. camera moves left-backward, rotates right 45°
Ours
Phantom
CineTrans
StoryDiffusion
Mask2DiT
HunyuanVideo
Wan2.2
SynCamMaster
ReCamMaster
@misc{wu2025shotdirectordirectoriallycontrollablemultishot,
title={ShotDirector: Directorially Controllable Multi-Shot Video Generation with Cinematographic Transitions},
author={Xiaoxue Wu and Xinyuan Chen and Yaohui Wang and Yu Qiao},
year={2025},
eprint={2512.10286},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.10286},
}