LingBot prompts steer POV, environment, subject behavior, and events. The model gets three simultaneous inputs (a seed image, control signals (yourDocumentation Index
Fetch the complete documentation index at: https://docs.reactor.inc/llms.txt
Use this file to discover all available pages before exploring further.
set_movement / set_look_* commands),
and a text prompt) and the text job is to describe what the world is and feels like. This guide
covers the rules that produce good output and the mistakes that wreck it.
One concept this guide assumes up front: prompts in LingBot are dynamic. You set the prompt with
set_prompt and can call set_prompt again at any point in a live session to update it, typically
in response to action state (driving vs idle), scene events (weather changes, dramatic moments), or
other user input. When the rules below talk about “swapping fragments” or “rewriting parts of the
prompt,” it means calling set_prompt with revised text.
Layered composition covers how to structure prompts so these
updates stay clean.
The most important rule: keep conditions aligned
LingBot is different from a normal text-to-video model. It’s a first image + [real-time text + action] model. Three signals condition the generation at the same time:- A seed image (the first frame)
- A text prompt (continuously updated)
- Action signals (your
set_movement/set_look_*commands)
- Text vs first image: what the prompt describes has to match what the seed image shows.
- Text vs action: what the prompt says about motion has to match what action is doing right now.
- Text vs text: different parts of the prompt can’t contradict each other.
Text vs first image
The seed image fixes what the scene actually contains: a specific subject, its materials, the palette. If your prompt describes something noticeably different, the model tries to satisfy both signals at once and the world drifts mid-generation. If the seed image shows a metal boat, don’t write:A weathered wooden rowboat drifting through reeds.
With this prompt, the boat’s hull will gradually morph from metal toward wood as the video plays
out. Texture, color, and surface detail will all degrade if not directly mentioned. Describe what’s
actually in the seed image:
A battered metal boat drifting through reeds.
If you want a different subject, regenerate the seed instead of fighting it from the prompt.
Text vs action
Text can describe motion, but whatever it describes has to agree with what action is doing in that same moment. The failure mode isn’t “the text mentioned movement”; it’s text saying one thing while action says another. The output is motion you can’t stop, controls that feel slow, or a subject that won’t go where you steer it. Example 1: motion baked into the base prompt Watch out for camera motion verbs:The camera pans across a snowy mountainside.
The model pans on its own; your set_look_* input feels slow or gets ignored.
And for subject motion verbs:
A horse galloping across the plains toward the mountains.
The horse keeps galloping forward even when you send set_movement: idle. You can’t bring it to a
stop. Both fail the same way: motion described in the base stays on regardless of action state.
Camera motion verbs (pan, tilt, dolly, track, push in, pull back, orbit, zoom, fly through) fight
live look input the same way subject motion verbs fight live movement input.
Keep motion out of the base. Describe the world statically:
A bay horse on open plains, mountains on the far horizon.
For camera framing, use position-only language. Match the subject’s position to what action is doing
right now:
- When the user is moving forward and you want the subject to stay in third-person rear view,
anchor that explicitly: “rear view, subject centered in frame.”
set_look_*then turns the subject’s heading instead of orbiting the camera. - When there’s no movement input and you want the subject still while the camera is free to look around it, describe the subject as static and centered with a fixed pointing direction: “the subject stays motionless at the center of the frame, oriented forward.” The look input can orbit the camera around it without dragging the subject along.
A dog walking down a country lane.
The dog stays glued to the lane. Pressing left or right to steer off-road doesn’t work since the
prompt keeps pulling it back. Constraining the subject’s relationship to a named entity (a path, a
vehicle, a specific landmark) bakes that relationship into every frame; action input can’t break it.
This is another flavor of text vs action conflict.
Describe the place, not the relationship:
A border collie in tall summer grass, a country lane running through the field.
Text vs text
The prompt can also contradict itself when an event or state change introduces a new fragment mid-session without removing the old phrasing it’s meant to replace. The model receives both descriptions in the same string and splits the difference. For example, with a base that says “dark storm clouds gathered overhead,” don’t simply append a weather-clearing event on top:The sun breaks through and the sky clears.
The assembled prompt contains both “overcast” and “clear”, which leads to the output ending up muddy
and indecisive.
When a state change replaces something in the base, rewrite the conflicting fragment, don’t append
to it. Remove or revise the old phrasing so the assembled prompt describes one consistent world.
Layered composition makes this easier, see also
Watch for layer conflicts.
Token budget
LingBot’s text encoder is umt5-xxl, which has a hard cap of exactly 512 tokens. Everything beyond that is silently truncated. In practice, quality starts to degrade noticeably past ~500 tokens, before you even reach the hard limit, so treat 500 as your effective ceiling. What matters more in practice is the soft version of this rule: if you want something to actually show up in the output, the words describing it have to take up enough room within that budget. A two-word mention tucked into an otherwise dense prompt usually gets ignored; the rest of the scene drowns it out. There’s no exact ratio, and the right balance depends on the scene. As a starting point, give the thing you want to see at least as much room as your other sentence-anchors. Try it, watch the output, and tune from there.Writing a base prompt
A base prompt describes what the world is, in a way that holds true regardless of what action is doing in any given frame. A good base prompt covers four things, in roughly this order:- FOV + subject declaration. Open with one sentence that names the point of view and the subject. The model anchors POV from the first sentence; don’t bury it.
- Object layers: near, mid, far. Spell out what’s around the subject at each distance. Near is what’s directly around at ground level. Mid is the focal element of the scene. Far is the backdrop or horizon. Missing layers produce flat, muddy backgrounds.
- Camera framing. Where the camera sits and what the framing looks like in static, position-only terms, with no motion verbs. See the alignment rule for why motion verbs belong in dynamic fragments, not in base.
- Atmosphere. One closing phrase that names the palette, energy, or rendering style. One phrase, not three.
Worked example: Citadel Approach
A canyon-driving scene with a battered Defender 4x4 approaching a desert citadel.Scene design
The base anchors all four parts:- FOV + subject: “This is a third-person-view video of a battered grey-green vintage Defender 4x4 deep in a coral-lit desert canyon.”
- Near plane: “Prickly pear cacti tipped with magenta blooms, scattered red poppies, and weather-pitted boulders dot the open desert floor”
- Mid plane: “smooth ochre dunes sweep up toward towering sandstone mesas that wall the valley on the left.”
- Far plane: “Ahead, a cliff-built sandstone citadel of white-washed houses, crenellated battlements, and slender minarets stands against a hazy peach-orange sunset sky.”
- Atmosphere: “Warm painterly desert storybook atmosphere.”
Dynamic state
When the user is driving, moving forward, or looking around, the assembled prompt looks like this:This is a third-person-view video of a battered grey-green vintage Defender 4x4 deep in a coral-lit desert canyon. Prickly pear cacti tipped with magenta blooms, scattered red poppies, and weather-pitted boulders dot the open desert floor; smooth ochre dunes sweep up toward towering sandstone mesas that wall the valley on the left. Ahead, a cliff-built sandstone citadel of white-washed houses, crenellated battlements, and slender minarets stands against a hazy peach-orange sunset sky. Warm painterly desert storybook atmosphere. Strict centred third-person rear view: the Defender is locked at the exact centre of the frame at all times. Horizontally centred, vertically centred, and the camera sits on a fixed offset directly behind the vehicle’s rear axle (zero lateral offset, zero orbit angle). The camera tracks the Defender from directly behind as it travels forward and never rotates, orbits, or pans around it under any input. Arrow-key look-input turns the Defender’s heading instead, so the rear-view framing, camera-behind-centre, is preserved frame-by-frame. The Defender rolls forward across the open desert sand, plumes of pale golden dust kicking up from its tires and trailing behind the rear hatch, the suspension flexing softly as it crests the dunes, faint heat shimmer rising from the tailpipe.
Static state
When the user releases controls, the assembled prompt swaps the trailing camera and subject description for a still variant:This is a third-person-view video of a battered grey-green vintage Defender 4x4 deep in a coral-lit desert canyon. Prickly pear cacti tipped with magenta blooms, scattered red poppies, and weather-pitted boulders dot the open desert floor; smooth ochre dunes sweep up toward towering sandstone mesas that wall the valley on the left. Ahead, a cliff-built sandstone citadel of white-washed houses, crenellated battlements, and slender minarets stands against a hazy peach-orange sunset sky. Warm painterly desert storybook atmosphere. The Defender stays perfectly centered in the camera’s view with a fixed orientation, and its apparent size and distance from the camera remain constant at all times. It never moves or shifts in place. The Defender is completely still and motionless, parked on the open desert sand, mud-streaked rear hatch and weathered tail-lights static, no dust or movement around its tires.The base sentences are identical in both states; only the trailing description of camera and subject behavior swaps to match what action is doing right now.
Third-person vs first-person
LingBot supports both third-person and first-person scenes. The base-prompt structure and the Citadel example above both assume a third-person framing: a named subject (the Defender) with environment described around it. First-person scenes simplify the prompt in a few ways:set_look_*directly controls the camera. In first-person, the camera is the viewer’s eyes, so you don’t need to describe the camera in terms of its relationship to a subject. You can skip camera-framing language altogether, beyond declaring POV up front.- If there’s no controlled subject in frame (no horse, no vehicle, no avatar body), you may also drop subject descriptions from the base, it becomes a more pure environment description. The distinction between “subject moving” and “subject still” tends to collapse too, since there’s no separate subject body to track.
camera layer and (often) the static/dynamic split
inside the movement layer.
Runtime caveats
A few practical limits affect how a prompt will play out in a live session:- Don’t sit on one condition for too long. Staying with the same text prompt and an idle
movement state across many generated chunks may cause details to drift and visual artifacts to
appear. Active input keeps the condition fresh. Pressing forward (
W) or any other movement key counts. The failure mode is sitting with both the prompt and movement idle. Periodically introducing aset_promptchange, or any user input, resets this. - Video length. The model retains exactly 300 generated chunks before resetting back to the seed image. This is a hard cap; plan scene transitions around it.
- Model limitations. A few known weak spots in the current model:
- Movement input reliability. Forward (
W) is noticeably more stable than the other movement keys. Lateral strafing, moving sideways without turning, is the worst case: it displaces the camera off the centred subject and degrades output. Route left/right motion through forward + turning (look-input that turns the subject’s heading) rather than dedicated strafe input. - Orbit can fail even with correct framing. Some scenes won’t orbit cleanly under look-input even when the prompt explicitly fixes the subject as still, centred, and forward-oriented. This appears to be a training-data limitation rather than a prompt issue. Subjects with clear orbit coverage in training data (horse-riding, cars) work better than ones without (boats, dragons); if your scene needs free 360° look around the subject, prototype with the seed before committing.
- Movement input reliability. Forward (
Layered composition (advanced)
The example above shows two assembled prompts that share the same opening (same FOV, world, atmosphere) and only differ in their trailing description of camera and subject behavior. That’s not a coincidence. Behind the scenes, the prompt is split into orthogonal layers that you concatenate client-side before eachset_prompt call. The encoder
never sees the structure; it just receives the assembled string.
A typical setup splits the prompt into:
base: FOV, world, object layers, atmosphere. Static across the session.camera: how the camera is locked, oriented, and tracking. Hasstaticanddynamicvariants that swap withset_look_*state.movement: whether the subject is moving and how. Hasstaticanddynamicvariants that swap withset_movementstate.events: optional overlays for things that happen to the world (see below).
Events
Beyond camera and movement, you can layer events on top of the assembled scene, overlays for things that happen to the world rather than describing the world itself. Citadel Approach defines two:- Camel caravan: “Across the open desert ahead, a long caravan of laden bactrian camels lines up in twos toward the citadel. Broad woolen flanks tasseled with crimson and indigo blankets, brass bells at their necks, robed riders in earth-toned wraps seated in carved wooden saddles. The camels’ long shadows lie across the ochre dunes and prickly pear blooms.”
- Solar eclipse: “The sun overhead is reduced to a perfect black disc ringed by a ghostly white corona, casting strange dim slate-blue light across the sandstone mesas, draining the ochre dunes to muted plum and turning the prickly pear blooms and red poppies into dark silhouettes flecked with deep-violet highlights. The citadel ahead stands bone-pale and chilled against a bruise-purple sky, its windows lit with steady warm lamps in the eclipse hush.”
- One layer. An addition that doesn’t disturb the rest of the world: fireworks in the sky,
lightning, fire breath, a shield raising, the camel caravan above. Only the
eventslayer changes. - Many layers. A change that re-skins the world: night, winter, steampunk, pixel art, or the eclipse above (which re-colors lighting, atmosphere, and surface descriptions). Atmosphere and object layers change together with the event.
How events fire
Firing an event just means callingset_prompt again with the relevant layer rewritten, there’s no
special API for it. How you wire the trigger in your app is a frontend choice, not something
that changes the prompt itself. Two common patterns:
- Pinned (toggle). The user picks a state and it stays on. Your app calls
set_promptonce and leaves it. Fits events that should persist: weather, time of day, season, art style. - Momentary (press-and-hold). The event applies only while a key or button is held. On release,
your app calls
set_promptagain to revert the layer. Fits short, dramatic events: a dragon breathing fire, a laser firing, a fireworks burst.
set_prompt call. Neither is “better”. Pick
whichever matches the event. “Now it’s snowing” feels wrong as a press-and-hold; “dragon breathes
fire” feels wrong as a permanent toggle.
Watch for layer conflicts
Layered composition is more powerful than swapping whole prompts, but it adds a new risk: the layers inside a single prompt can contradict each other, the same way text and action can. If thebase layer says “under a cloudy, overcast sky” and you append an events layer that says “the
sun breaks through and the clouds clear”, you’ve handed the model two contradictory weather
descriptions in one string. The output splits the difference and looks bad, this is similar to the
same failure mode as fighting the controls, just inside the prompt itself.
Before assembling, check that no two layers describe the same surface, lighting, or weather
differently. Either rewrite the conflicting layer to be additive (“a sudden break in the clouds”
rather than “the sky clears”), or swap the entire prompt as a fresh set_prompt call.
Troubleshooting
When a prompt doesn’t produce what you wanted, walk through this checklist before assuming the model can’t do it. 1. Is anything in conflict? Walk the three alignment flavors from the alignment rule:- Text vs first image. Does the prompt describe a subject or materials that differ from what the seed actually shows?
- Text vs action. Does the prompt describe motion, camera moves, or relationships to named entities that don’t match what action is doing right now?
- Text vs itself. If you’re using layered composition, do two layers describe the same surface, lighting, or weather in incompatible ways? See Watch for layer conflicts.
- If something in the video is underperforming, give its description more room: a longer phrase, more concrete physical detail, an extra clause. A two-word mention is easy for the rest of the prompt to drown out.
- If something unwanted is showing up, don’t just hope it goes away. Describe its opposite explicitly, or call out that it shouldn’t appear (“clear cloudless sky,” “no people in frame”). The model responds to what you write, not to what you omit.
See also
- LingBot overview: model name, commands, events, lifecycle
- LingBot tutorial: end-to-end example project
- Concepts → Commands and messages: the generic
sendCommand/ message contract