Skip to main content
A SANA-Streaming prompt is an editing instruction, not a scene description. It names one change to the source video; the model makes that change and carries everything else through untouched. The model holds your instruction constant across the whole stream and has no way to resolve ambiguity, so a vague prompt produces inconsistent or wrong edits. The craft is to replace vague intent with specific, bounded detail. This guide covers the anatomy of a strong instruction, a recipe for each kind of edit, prompting a live webcam feed, and the mistakes to avoid.

Instructions, not descriptions

Prompts for Helios and LongLive-2.0 build a world from words, so dense scene description pays off. SANA-Streaming works in reverse: the source video supplies the world. Describe only the change, open with its verb (“Replace the…”, “Remove the…”, “Apply a…”), and spend your detail budget on the new content and on what must stay fixed.

Weak vs strong prompts

The same edit, twice. The only difference is specificity. Garment swap
Make her shirt into a nice silk blouse.
Replace the white button-up shirt with a dark navy silk blouse featuring a draped ruffled collar and iridescent pearl buttons. Ensure the silk catches and reflects the warm golden lamp light with a subtle sheen throughout the sequence. Preserve the subject’s pose, body and head motion, her identity, the background, and the scene’s depth of field and lighting.
Removal
Remove the earrings.
Remove the thick textured gold hoop earrings from the woman’s ears. Reconstruct the exposed earlobes to match the surrounding skin tone and texture, blending lighting and shadows naturally and leaving no metallic trace or reflection. Leave all other video content unchanged with temporally consistent inpainting.
The weak versions leave the model to decide which shirt, what kind of silk, what the light does to it, and what happens to the skin the earrings covered. The strong versions decide all of it.

Anatomy of an edit prompt

A strong instruction runs two to five sentences and assembles these parts in order:
  • Name the target with disambiguating attributes. “The white button-up shirt,” not “the shirt.” If two similar objects are in frame, the attributes are what aim the edit.
  • Specify the new content with two to four concrete details. Draw from material (silk, leather, plaster), named color (dark navy, cream, faded ochre), silhouette (draped ruffled collar, round wire frames), and texture (distressed, iridescent, woven). Two sharp details do more than six vague ones.
  • Add one clause of material physics. Say how the new content behaves under the scene’s existing light: “catches the warm golden lamp light with a subtle sheen.” This single clause does more for realism than any adjective stack.
  • Enumerate what stays fixed. Never write “keep everything else the same”; name the axes: pose, motion, identity, background, camera movement, lighting, depth of field. The model preserves what you name.
  • For whole-frame edits, assert temporal consistency. Style transfers, scene transforms, and background swaps should end with “seamless temporal consistency across all frames, no jarring frames.”
These parts are not style preferences; they map to the four axes the model was scored on during training: instruction alignment, consistency outside the edit, physical plausibility, and video quality. Two rules sit on top of the recipe:
  • One edit per prompt. “Swap the jacket and the background” is two instructions; send them as two prompts. Independent edits run separately over the same source.
  • Only claim motion you can see. Write “keep the subject perfectly still” only if your footage is still; a wrong motion claim freezes a moving subject or animates a still one. When in doubt, preserve motion by reference: “preserve the subject’s existing motion.”

Recipes

Seven kinds of edit cover the model’s range. Find your edit, follow its pattern:
EditYou want to…
RemoveDelete an object, person, watermark, or blemish
ReplaceSwap a garment, object, or the subject itself
AddOverlay a new element that tracks the scene
BackgroundChange the setting behind a preserved subject
StyleRepaint the whole frame in an art style
Scene transformRebuild every object in the scene as a new medium
Physical AIChange the weather in driving footage, swap hands for robot arms

Remove

Name the target, reconstruct what its removal reveals, and assert no trace. The reconstruction clause is the part beginners skip, and it’s what separates a clean removal from a smudge.
Remove the white “GagaOOLala” watermark logo in the top-left corner. Seamlessly blend the region with the surrounding sky, foliage, and building edges, with temporally consistent inpainting.

Replace

Old to new, an attribute stack capped at four details, one material-physics clause, then preservation. A garment or object swap preserves the subject’s identity; replacing the subject (aging them, turning them into a creature) preserves their exact pose and motion instead.
Transform the middle-aged man into an elderly gentleman with silver hair and natural wrinkles, in the same position and pose. Preserve his exact body and head motion, the background, and the original camera movement and lighting across all frames.

Add

Describe the element, where it sits, how it moves, and that it stays tracked to the face, sky, or surface as the camera moves. Untracked additions float.
Overlay an animated colorful kite in the upper-left sky. The kite flutters and sways with its tail moving in the wind, staying tracked to the sky as the camera moves, with lighting and shadows adjusting dynamically. All other parts of the video remain unchanged.

Background

Enumerate the new scene’s elements and mood, hold the subject and foreground unchanged, and match the new environment’s light to the existing light on the subject, since a subject lit for a studio looks pasted onto a beach. Background motion (rolling waves, drifting crowds) is a deliberate extra; ask for it or it stays static.
Replace the background with a rain-streaked windowpane at dusk, with out-of-focus teal and amber city lights, condensation, and raindrops trickling down the glass. Keep a shallow depth of field, do not alter the subject’s lighting or appearance, and maintain seamless consistency across all frames.

Style

Name the style, describe its concrete visual characteristics, and preserve motion, actions, camera, and composition. Style repaints; it moves nothing.
Apply a Fauvist painting style with electric blues, greens, and oranges, thick brushstrokes, bold outlines, and flat saturated color blocks. Preserve all original motion, actions, camera movement, and composition, with seamless temporal consistency and no jarring frames.

Scene transform

Style’s bigger sibling: list the object inventory and rebuild each item as the new medium, then add the medium’s artifacts (a gilded border, plaster cracks, paper grain).
Re-render the office as a warm antique wall fresco: convert the man, desk, laptop, notebook, shelves, plants, and lamp into hand-painted ochre and faded-blue forms with visible brush texture, a gilded border, plaster grain, and fine cracks. Preserve the original composition, gestures, object layout, and temporal motion.

Physical AI

Domain randomization for driving and robotics footage. The restyle changes appearance only, so preservation is maximal and names the domain’s invariants: lanes, road geometry, signs, and trajectory for driving; objects, tools, contacts, and timing for manipulation.
Transform this driving-camera feed into a scene with light snowfall at dawn, adding soft falling snow and a cool pale light. Keep all vehicles, road geometry, lane markings, signs, and motion unchanged, and maintain temporal consistency throughout.

Common pitfalls

  • A vague target. “The shirt” with two people in frame makes the model choose. Name the white button-up shirt.
  • “Keep everything else the same.” The model preserves what you name. Enumerate the axes: pose, motion, background, camera, lighting.
  • Adjective pileup. Past about four concrete details, extra adjectives degrade the edit instead of sharpening it. One material-physics clause, two to five sentences total.
  • Claiming motion you don’t have. “Perfectly still” on a moving subject freezes them; an animated description of a static scene invents motion. Match motion claims to the footage or preserve it by reference.
  • Two edits in one prompt. A combined instruction muddies both changes. Run them as separate prompts over the same source.
  • Describing a scene from scratch. “A cyberpunk city at night” is a Helios prompt. Here, phrase it as a change: “Apply a neon cyberpunk look to the scene…”

See also