LingBot prompt guide

LingBot prompts steer POV, environment, subject behavior, and events. The model gets three simultaneous inputs (a seed image, control signals (your set_movement / set_look_* commands), and a text prompt) and the text job is to describe what the world is and feels like. This guide covers the rules that produce good output and the mistakes that wreck it. One concept this guide assumes up front: prompts in LingBot are dynamic. You set the prompt with set_prompt and can call set_prompt again at any point in a live session to update it, typically in response to action state (driving vs idle), scene events (weather changes, dramatic moments), or other user input. When the rules below talk about “swapping fragments” or “rewriting parts of the prompt,” it means calling set_prompt with revised text. Layered composition covers how to structure prompts so these updates stay clean.

The most important rule: keep conditions aligned

LingBot is different from a normal text-to-video model. It’s a first image + [real-time text + action] model. Three signals condition the generation at the same time:

A seed image (the first frame)
A text prompt (continuously updated)
Action signals (your set_movement / set_look_* commands)

All three are active simultaneously, and they all need to agree. When they don’t, either signal can win, neither wins cleanly, and the output gets confused. There are three kinds of alignment to maintain:

Text vs first image: what the prompt describes has to match what the seed image shows.
Text vs action: what the prompt says about motion has to match what action is doing right now.
Text vs text: different parts of the prompt can’t contradict each other.

Text vs first image

The seed image fixes what the scene actually contains: a specific subject, its materials, the palette. If your prompt describes something noticeably different, the model tries to satisfy both signals at once and the world drifts mid-generation. If the seed image shows a metal boat, don’t write:

A weathered wooden rowboat drifting through reeds.

With this prompt, the boat’s hull will gradually morph from metal toward wood as the video plays out. Texture, color, and surface detail will all degrade if not directly mentioned. Describe what’s actually in the seed image:

A battered metal boat drifting through reeds.

If you want a different subject, regenerate the seed instead of fighting it from the prompt.

Text vs action

Text can describe motion, but whatever it describes has to agree with what action is doing in that same moment. The failure mode isn’t “the text mentioned movement”; it’s text saying one thing while action says another. The output is motion you can’t stop, controls that feel slow, or a subject that won’t go where you steer it. Example 1: motion baked into the base prompt Watch out for camera motion verbs:

The camera pans across a snowy mountainside.

The model pans on its own; your set_look_* input feels slow or gets ignored. And for subject motion verbs:

A horse galloping across the plains toward the mountains.

The horse keeps galloping forward even when you send set_movement: idle. You can’t bring it to a stop. Both fail the same way: motion described in the base stays on regardless of action state. Camera motion verbs (pan, tilt, dolly, track, push in, pull back, orbit, zoom, fly through) fight live look input the same way subject motion verbs fight live movement input. Keep motion out of the base. Describe the world statically:

A bay horse on open plains, mountains on the far horizon.

For camera framing, use position-only language. Match the subject’s position to what action is doing right now:

When the user is moving forward and you want the subject to stay in third-person rear view, anchor that explicitly: “rear view, subject centered in frame.” set_look_* then turns the subject’s heading instead of orbiting the camera.
When there’s no movement input and you want the subject still while the camera is free to look around it, describe the subject as static and centered with a fixed pointing direction: “the subject stays motionless at the center of the frame, oriented forward.” The look input can orbit the camera around it without dragging the subject along.

Describe motion only in fragments you swap with action state (see Layered composition). Example 2: tying the subject to a named entity

A dog walking down a country lane.

The dog stays glued to the lane. Pressing left or right to steer off-road doesn’t work since the prompt keeps pulling it back. Constraining the subject’s relationship to a named entity (a path, a vehicle, a specific landmark) bakes that relationship into every frame; action input can’t break it. This is another flavor of text vs action conflict. Describe the place, not the relationship:

A border collie in tall summer grass, a country lane running through the field.

Text vs text

The prompt can also contradict itself when an event or state change introduces a new fragment mid-session without removing the old phrasing it’s meant to replace. The model receives both descriptions in the same string and splits the difference. For example, with a base that says “dark storm clouds gathered overhead,” don’t simply append a weather-clearing event on top:

The sun breaks through and the sky clears.

The assembled prompt contains both “overcast” and “clear”, which leads to the output ending up muddy and indecisive. When a state change replaces something in the base, rewrite the conflicting fragment, don’t append to it. Remove or revise the old phrasing so the assembled prompt describes one consistent world. Layered composition makes this easier, see also Watch for layer conflicts.

Token budget

LingBot’s text encoder is umt5-xxl, which has a hard cap of exactly 512 tokens. Everything beyond that is silently truncated. In practice, quality starts to degrade noticeably past ~500 tokens, before you even reach the hard limit, so treat 500 as your effective ceiling. What matters more in practice is the soft version of this rule: if you want something to actually show up in the output, the words describing it have to take up enough room within that budget. A two-word mention tucked into an otherwise dense prompt usually gets ignored; the rest of the scene drowns it out. There’s no exact ratio, and the right balance depends on the scene. As a starting point, give the thing you want to see at least as much room as your other sentence-anchors. Try it, watch the output, and tune from there.

Want to see exactly how your prompt tokenizes? Paste it into the token playground to see total token count, and highlight individual sections to see how each piece weighs in.

Writing a base prompt

A base prompt describes what the world is, in a way that holds true regardless of what action is doing in any given frame. A good base prompt covers four things, in roughly this order:

FOV + subject declaration. Open with one sentence that names the point of view and the subject. The model anchors POV from the first sentence; don’t bury it.
Object layers: near, mid, far. Spell out what’s around the subject at each distance. Near is what’s directly around at ground level. Mid is the focal element of the scene. Far is the backdrop or horizon. Missing layers produce flat, muddy backgrounds.
Camera framing. Where the camera sits and what the framing looks like in static, position-only terms, with no motion verbs. See the alignment rule for why motion verbs belong in dynamic fragments, not in base.
Atmosphere. One closing phrase that names the palette, energy, or rendering style. One phrase, not three.

Length. 2–4 sentences. If you have anchor gaps, expand existing sentences with more clauses rather than adding new ones.

Worked example: Citadel Approach

A canyon-driving scene with a battered Defender 4x4 approaching a desert citadel.

Scene design

The base anchors all four parts:

FOV + subject: “This is a third-person-view video of a battered grey-green vintage Defender 4x4 deep in a coral-lit desert canyon.”
Near plane: “Prickly pear cacti tipped with magenta blooms, scattered red poppies, and weather-pitted boulders dot the open desert floor”
Mid plane: “smooth ochre dunes sweep up toward towering sandstone mesas that wall the valley on the left.”
Far plane: “Ahead, a cliff-built sandstone citadel of white-washed houses, crenellated battlements, and slender minarets stands against a hazy peach-orange sunset sky.”
Atmosphere: “Warm painterly desert storybook atmosphere.”

Four sentences. No motion verbs in the camera framing. The world description holds whether the Defender is moving or parked.

Dynamic state

When the user is driving, moving forward, or looking around, the assembled prompt looks like this:

This is a third-person-view video of a battered grey-green vintage Defender 4x4 deep in a coral-lit desert canyon. Prickly pear cacti tipped with magenta blooms, scattered red poppies, and weather-pitted boulders dot the open desert floor; smooth ochre dunes sweep up toward towering sandstone mesas that wall the valley on the left. Ahead, a cliff-built sandstone citadel of white-washed houses, crenellated battlements, and slender minarets stands against a hazy peach-orange sunset sky. Warm painterly desert storybook atmosphere. Strict centred third-person rear view: the Defender is locked at the exact centre of the frame at all times. Horizontally centred, vertically centred, and the camera sits on a fixed offset directly behind the vehicle’s rear axle (zero lateral offset, zero orbit angle). The camera tracks the Defender from directly behind as it travels forward and never rotates, orbits, or pans around it under any input. Arrow-key look-input turns the Defender’s heading instead, so the rear-view framing, camera-behind-centre, is preserved frame-by-frame. The Defender rolls forward across the open desert sand, plumes of pale golden dust kicking up from its tires and trailing behind the rear hatch, the suspension flexing softly as it crests the dunes, faint heat shimmer rising from the tailpipe.

Static state

When the user releases controls, the assembled prompt swaps the trailing camera and subject description for a still variant:

This is a third-person-view video of a battered grey-green vintage Defender 4x4 deep in a coral-lit desert canyon. Prickly pear cacti tipped with magenta blooms, scattered red poppies, and weather-pitted boulders dot the open desert floor; smooth ochre dunes sweep up toward towering sandstone mesas that wall the valley on the left. Ahead, a cliff-built sandstone citadel of white-washed houses, crenellated battlements, and slender minarets stands against a hazy peach-orange sunset sky. Warm painterly desert storybook atmosphere. The Defender stays perfectly centered in the camera’s view with a fixed orientation, and its apparent size and distance from the camera remain constant at all times. It never moves or shifts in place. The Defender is completely still and motionless, parked on the open desert sand, mud-streaked rear hatch and weathered tail-lights static, no dust or movement around its tires.

The base sentences are identical in both states; only the trailing description of camera and subject behavior swaps to match what action is doing right now.

Third-person vs first-person

LingBot supports both third-person and first-person scenes. The base-prompt structure and the Citadel example above both assume a third-person framing: a named subject (the Defender) with environment described around it. First-person scenes simplify the prompt in a few ways:

set_look_* directly controls the camera. In first-person, the camera is the viewer’s eyes, so you don’t need to describe the camera in terms of its relationship to a subject. You can skip camera-framing language altogether, beyond declaring POV up front.
If there’s no controlled subject in frame (no horse, no vehicle, no avatar body), you may also drop subject descriptions from the base, it becomes a more pure environment description. The distinction between “subject moving” and “subject still” tends to collapse too, since there’s no separate subject body to track.

To anchor POV from the start, open the prompt with something like “a first-person video of…” so the model doesn’t default to a third-person framing. If you use layered composition to swap fragments per frame, these simplifications correspond to dropping the camera layer and (often) the static/dynamic split inside the movement layer.

Runtime caveats

A few practical limits affect how a prompt will play out in a live session:

Don’t sit on one condition for too long. Staying with the same text prompt and an idle movement state across many generated chunks may cause details to drift and visual artifacts to appear. Active input keeps the condition fresh. Pressing forward (W) or any other movement key counts. The failure mode is sitting with both the prompt and movement idle. Periodically introducing a set_prompt change, or any user input, resets this.
Video length. The model retains exactly 300 generated chunks before resetting back to the seed image. This is a hard cap; plan scene transitions around it.
Model limitations. A few known weak spots in the current model:
- Movement input reliability. Forward (W) is noticeably more stable than the other movement keys. Lateral strafing, moving sideways without turning, is the worst case: it displaces the camera off the centred subject and degrades output. Route left/right motion through forward + turning (look-input that turns the subject’s heading) rather than dedicated strafe input.
- Orbit can fail even with correct framing. Some scenes won’t orbit cleanly under look-input even when the prompt explicitly fixes the subject as still, centred, and forward-oriented. This appears to be a training-data limitation rather than a prompt issue. Subjects with clear orbit coverage in training data (horse-riding, cars) work better than ones without (boats, dragons); if your scene needs free 360° look around the subject, prototype with the seed before committing.

Layered composition (advanced)

The example above shows two assembled prompts that share the same opening (same FOV, world, atmosphere) and only differ in their trailing description of camera and subject behavior. That’s not a coincidence. Behind the scenes, the prompt is split into orthogonal layers that you concatenate client-side before each set_prompt call. The encoder never sees the structure; it just receives the assembled string. A typical setup splits the prompt into:

base: FOV, world, object layers, atmosphere. Static across the session.
camera: how the camera is locked, oriented, and tracking. Has static and dynamic variants that swap with set_look_* state.
movement: whether the subject is moving and how. Has static and dynamic variants that swap with set_movement state.
events: optional overlays for things that happen to the world (see below).

When action state changes, you only rewrite the layer(s) it touches and re-send the assembled prompt. The base never gets rewritten.

Events

Beyond camera and movement, you can layer events on top of the assembled scene, overlays for things that happen to the world rather than describing the world itself. Citadel Approach defines two:

Camel caravan: “Across the open desert ahead, a long caravan of laden bactrian camels lines up in twos toward the citadel. Broad woolen flanks tasseled with crimson and indigo blankets, brass bells at their necks, robed riders in earth-toned wraps seated in carved wooden saddles. The camels’ long shadows lie across the ochre dunes and prickly pear blooms.”
Solar eclipse: “The sun overhead is reduced to a perfect black disc ringed by a ghostly white corona, casting strange dim slate-blue light across the sandstone mesas, draining the ochre dunes to muted plum and turning the prickly pear blooms and red poppies into dark silhouettes flecked with deep-violet highlights. The citadel ahead stands bone-pale and chilled against a bruise-purple sky, its windows lit with steady warm lamps in the eclipse hush.”

A given event change touches either one layer or several:

One layer. An addition that doesn’t disturb the rest of the world: fireworks in the sky, lightning, fire breath, a shield raising, the camel caravan above. Only the events layer changes.
Many layers. A change that re-skins the world: night, winter, steampunk, pixel art, or the eclipse above (which re-colors lighting, atmosphere, and surface descriptions). Atmosphere and object layers change together with the event.

When many layers change at once, rewrite them all consistently. A “now at night” change that updates the sky to dark but still describes “noon sun on the chrome” contradicts itself, and the output splits the difference badly. Re-skin every layer the change would touch. Event weight. Events follow the same room-budget rule as anything else. Give them a full sentence-anchor with concrete physical detail, not a passing mention (see Token budget). Look at the camel caravan and eclipse phrasing above, those aren’t one-liners.

How events fire

Firing an event just means calling set_prompt again with the relevant layer rewritten, there’s no special API for it. How you wire the trigger in your app is a frontend choice, not something that changes the prompt itself. Two common patterns:

Pinned (toggle). The user picks a state and it stays on. Your app calls set_prompt once and leaves it. Fits events that should persist: weather, time of day, season, art style.
Momentary (press-and-hold). The event applies only while a key or button is held. On release, your app calls set_prompt again to revert the layer. Fits short, dramatic events: a dragon breathing fire, a laser firing, a fireworks burst.

These are just two ways to trigger the same kind of set_prompt call. Neither is “better”. Pick whichever matches the event. “Now it’s snowing” feels wrong as a press-and-hold; “dragon breathes fire” feels wrong as a permanent toggle.

Watch for layer conflicts

Layered composition is more powerful than swapping whole prompts, but it adds a new risk: the layers inside a single prompt can contradict each other, the same way text and action can. If the base layer says “under a cloudy, overcast sky” and you append an events layer that says “the sun breaks through and the clouds clear”, you’ve handed the model two contradictory weather descriptions in one string. The output splits the difference and looks bad, this is similar to the same failure mode as fighting the controls, just inside the prompt itself. Before assembling, check that no two layers describe the same surface, lighting, or weather differently. Either rewrite the conflicting layer to be additive (“a sudden break in the clouds” rather than “the sky clears”), or swap the entire prompt as a fresh set_prompt call.

Troubleshooting

When a prompt doesn’t produce what you wanted, walk through this checklist before assuming the model can’t do it. 1. Is anything in conflict? Walk the three alignment flavors from the alignment rule:

Text vs first image. Does the prompt describe a subject or materials that differ from what the seed actually shows?
Text vs action. Does the prompt describe motion, camera moves, or relationships to named entities that don’t match what action is doing right now?
Text vs itself. If you’re using layered composition, do two layers describe the same surface, lighting, or weather in incompatible ways? See Watch for layer conflicts.

2. Are there detail gaps? Walk through the base-prompt checklist: FOV + subject declared up front, near / mid / far object layers all filled in, camera framing in position-only terms, one atmosphere phrase. Skipped layers are the most common cause of muddy output. 3. Does the element you want have enough room in the prompt? If a specific element isn’t showing up, expand its description. See Token budget. 4. Tune by emphasis

If something in the video is underperforming, give its description more room: a longer phrase, more concrete physical detail, an extra clause. A two-word mention is easy for the rest of the prompt to drown out.
If something unwanted is showing up, don’t just hope it goes away. Describe its opposite explicitly, or call out that it shouldn’t appear (“clear cloudless sky,” “no people in frame”). The model responds to what you write, not to what you omit.

If none of these explain the failure, you may be hitting a current model limitation. See Runtime caveats.

​The most important rule: keep conditions aligned

​Text vs first image

​Text vs action

​Text vs text

​Token budget

​Writing a base prompt

​Worked example: Citadel Approach

​Scene design

​Dynamic state

​Static state

​Third-person vs first-person

​Runtime caveats

​Layered composition (advanced)

​Events

​How events fire

​Watch for layer conflicts

​Troubleshooting

​See also

The most important rule: keep conditions aligned

Text vs first image

Text vs action

Text vs text

Token budget

Writing a base prompt

Worked example: Citadel Approach

Scene design

Dynamic state

Static state

Third-person vs first-person

Runtime caveats

Layered composition (advanced)

Events

How events fire

Watch for layer conflicts

Troubleshooting

See also