LingBot tutorial

A guided tour of the open-source LingBot Interactive reference app, which demonstrates every important pattern in the LingBot SDK. By the end you’ll know how to start a scene from an image, drive it with WASD, snap clips, and surface model errors.

Installation and setup

Get the example running locally before reading further. Every section below points back at code in the example repo. You will need:

Node.js 18+.
pnpm (the example pins lockfiles to pnpm; npm or yarn will work but you’ll regenerate the lockfile).
A Reactor API key (starts with rk_).
Familiarity with the Next.js App Router.

Clone the example

The example lives alongside our other reference apps in reactor-team/js-sdk under examples/.

git clone https://github.com/reactor-team/js-sdk
cd js-sdk/examples/lingbot

Add your API key

Your rk_… key must never reach the browser; the example reads it server-side and mints a short-lived JWT for the client. We’ll cover the broker pattern below; for now, drop the key into .env:

cp .env.example .env
# then edit .env and set REACTOR_API_KEY to your API key

See a “Setup Required” screen? Your REACTOR_API_KEY isn’t loaded. The check lives in app/page.tsx → app/SetupRequired.tsx.

Install dependencies and start the dev server

pnpm install
pnpm dev

Open http://localhost:3000, click Connect, pick a curated scene or upload your own image, and drive it with WASD.

How LingBot works

Building with LingBot is different from calling a typical generative API. There’s no image-in / video-out request. You open a long-lived connection, send a seed image plus a paragraph-length prompt, and the model begins producing a continuous stream of chunks that you steer in real time with WASD. The image anchors the scene and is locked at start; the prompt and movement axes drive everything that happens after. Opening the connection isn’t instant. Reactor provisions a GPU for your session, so the client moves through the same four states as every other Reactor model before media starts flowing (disconnected → connecting → waiting → ready). See Sessions for the full breakdown. Three properties of the LingBot API are worth internalizing before you read on, since the rest of this tutorial assumes them:

Commands are asynchronous; events are the source of truth. Calling setImage doesn’t mean the next chunk uses it; the model confirms by emitting image_accepted when the upload has been decoded and is ready to use.
Errors arrive out-of-band. A broken precondition like start before setImage surfaces later as a command_error event, not as a thrown exception.
Movement axes are persistent state, not pulses. set_movement and the two look axes hold their last value forever; every keydown needs a matching keyup that sends idle.

Authentication

LingBot uses the same broker pattern as every browser-side Reactor app: your rk_… key stays on the server and the client receives a short-lived JWT minted from it. The server-side route at app/api/reactor/token/route.ts and the mount-time fetch in LingbotApp.tsx are character-for-character the Helios setup with the provider swapped to <LingbotProvider>. See Authentication for the full concept page, including the Express equivalent and the Python path that skips the broker entirely.

Starting a scene from an image

The canonical LingBot launch flow lives in ScenePicker.tsx. Picking a curated scene fires a five-step sequence: uploadFile → setImage → await image_accepted → setPrompt → start. The wait in the middle is the part that matters. setImage carries an upload the runtime has to decode and VAE-encode, but start carries nothing and sails past on the same data channel. Skip the wait and the first chunk is generated from the prompt alone, with the image landing one chunk later. The scene visibly “corrects itself” at the first chunk boundary.

Safe LingBot image-start sequence: upload the image, setImage, wait for image_accepted, then setPrompt and start before frames stream

LingBot doesn’t have Helios’s atomic setConditioning; the explicit wait is the answer. The example uses useLingbotImageAccepted with a one-shot ref resolver to gate setPrompt + start on the right event:

app/components/ScenePicker.tsx

const { uploadFile, setImage, setPrompt, start } = useLingbot();

// Park the resolver BEFORE calling setImage. If we registered it
// after, the model's ack could land first and we'd miss it.
const imageReadyRef = useRef<(() => void) | null>(null);

useLingbotImageAccepted(() => {
  if (imageReadyRef.current) {
    imageReadyRef.current();
    imageReadyRef.current = null;
  }
});

async function startScene(scene: Scene) {
  const blob = await fetch(scene.imageUrl).then((r) => r.blob());
  const ref = await uploadFile(blob, { name: `${scene.id}.jpg` });

  const imageReady = new Promise<void>((resolve) => {
    imageReadyRef.current = resolve;
  });

  await setImage({ image: ref });
  await imageReady; // ← the load-bearing line
  await setPrompt({ prompt: scene.prompt });
  await start();
}

The Promise wrapper around imageReadyRef is the standard pattern for waiting on an event-bus event from inside an async function. The hook callback is the resolver; the ref makes it one-shot. Without the ref reset, a second image_accepted from a later run would resolve a Promise nobody is awaiting. The curated scenes live in app/lib/scenes.ts. Each entry pairs a hand-tuned starting prompt with a reference image in public/. The prompts follow the rules in the Prompt Guide above: FOV and subject declared up front, near / mid / far object layers filled in, position-only camera framing, one atmosphere phrase.

Helios SDK 0.9.0+ has setConditioning, an atomic command that bundles setImage + setPrompt so the race in the first paragraph can’t happen. LingBot has no analogue today; the image_accepted wait is the recommended pattern.

Custom uploads

CustomStart.tsx handles the second launch path: the user uploads their own image and types their own prompt. The trick here is to upload as soon as the file is picked, so image_accepted lands while the user is still typing. When they click Start, the example fires setPrompt + start directly; no await needed, because the human typing delay is orders of magnitude longer than the image decode. Contrast with Starting a scene from an image, where a one-click launch has to bridge that gap explicitly. The two halves of the flow are wired to different events:

app/components/CustomStart.tsx

const { uploadFile, setImage, setPrompt, start } = useLingbot();
const [text, setText] = useState("");

// File-pick handler: upload immediately, await image_accepted.
async function uploadCustomImage(file: File) {
  const imageReady = new Promise<void>((resolve) => {
    imageReadyRef.current = resolve;
  });
  const ref = await uploadFile(file);
  await setImage({ image: ref });
  await imageReady;
}

// Start button: image is already accepted, so just prompt + start.
async function startCustom() {
  if (!hasImage || !text.trim()) return;
  await setPrompt({ prompt: text.trim() });
  await start();
}

The Start button derives its disabled state from the snapshot:

const hasPrompt = snapshot?.has_prompt === true || text.trim().length > 0;
const hasImage = snapshot?.has_image === true;

Reading has_image off the snapshot rather than tracking a local boolean keeps the UI honest across edge cases (a reset() from elsewhere in the app, a disconnect mid-upload). The state payload is the canonical source for what the model thinks is set.

Going live

Once snapshot.started === true, the setup panels (ScenePicker and CustomStart) hide and the live UI takes over: a status badge, a now-playing panel with transport controls, and the video pane. StatusBadge.tsx is the user’s window into the four-state connection machine. Every state, including the multi-second waiting step where Reactor is provisioning a GPU, gets a visible label and color:

app/components/StatusBadge.tsx

import { useLingbot } from "@reactor-models/lingbot";

const TONE = {
  disconnected: { dot: "bg-zinc-500", label: "Disconnected" },
  connecting: { dot: "bg-amber-400 animate-pulse", label: "Connecting…" },
  waiting: { dot: "bg-amber-400 animate-pulse", label: "Waiting for GPU…" },
  ready: { dot: "bg-active", label: "Connected" },
};

export function StatusBadge() {
  const { status, lastError, connect, disconnect } = useLingbot();
  const idle = status === "disconnected";

  return (
    <div>
      <span className={TONE[status].dot} />
      <span>{TONE[status].label}</span>
      {idle ? (
        <button onClick={() => connect()}>Connect</button>
      ) : (
        <button onClick={() => disconnect()}>Disconnect</button>
      )}
      {lastError && <p className="text-red-400">{lastError.message}</p>}
    </div>
  );
}

useLingbot() exposes status, connect, disconnect, and lastError. The Connect / Disconnect toggle is purely on status === "disconnected"; every other state renders Disconnect. NowPlaying.tsx is the canonical example of how the rest of the app reads model state: subscribe once with useLingbotState, hold the latest snapshot in useState, read fields off it. No event aggregation, no derived booleans, no useReducer over chunk_complete events.

app/components/NowPlaying.tsx

const { status, pause, resume, reset } = useLingbot();
const [snapshot, setSnapshot] = useState<LingbotStateMessage | null>(null);

useLingbotState((msg) => setSnapshot(msg));

// The SDK doesn't emit a final `state` message on disconnect, so we
// clear ourselves. Otherwise the next session inherits the old one.
useEffect(() => {
  if (status !== "ready") setSnapshot(null);
}, [status]);

// Phase switch: while not started (or after reset), render null and
// let the setup panels take over.
if (status !== "ready" || !snapshot?.started) return null;

return (
  <>
    <p>{snapshot.current_prompt}</p>
    <span>chunk {snapshot.current_chunk}</span>
    <span className="font-mono">{snapshot.current_action || "still"}</span>
    {snapshot.running ? (
      <button onClick={() => pause()}>Pause</button>
    ) : (
      <button onClick={() => resume()}>Resume</button>
    )}
    <button onClick={() => reset()}>Reset</button>
  </>
);

current_action is a LingBot-specific snapshot field: a +-joined composite like "w+left" that reflects what the model is currently moving / looking. It updates per chunk, so it lags the user’s key presses by one chunk; that’s fine for a status readout, but as Driving the scene with WASD covers, it’s the wrong source for button highlights. The video pane itself is one component:

app/components/Video.tsx

import { LingbotMainVideoView } from "@reactor-models/lingbot";

export function Video() {
  return (
    <div className="rounded-lg border bg-black">
      <LingbotMainVideoView className="h-full w-full" videoObjectFit="contain" />
    </div>
  );
}

<LingbotMainVideoView /> is a typed wrapper around <ReactorView track="main_video"> that handles <video> element setup, srcObject binding, and browser autoplay policy quirks. Style the outer container; never reach for the underlying element.

One LingBot-specific behavior to internalize: when a run completes, the server automatically kicks off another with the same image and prompt. snapshot.started doesn’t flip back to false until the user calls reset(). If you want a session to stop generating after a run, listen for generation_complete and call reset() from your handler.

Driving the scene with WASD

This is LingBot’s signature feature, and MovementControls.tsx is the largest component in the example. The model exposes three persistent-state axes: set_movement (forward/back/strafe_left/strafe_right/idle), set_look_horizontal, and set_look_vertical, plus set_rotation_speed_deg as a slider. The crucial invariant: axes hold their last value forever until you send a new one. Every keydown must be paired with a keyup that sends idle, or the camera will keep moving after the user lets go of the key. This is not a pulse API; the model walks forward at every chunk boundary until you explicitly tell it to stop. The component owns three pieces of local state for highlighting:

app/components/MovementControls.tsx

type Movement = "idle" | "forward" | "back" | "strafe_left" | "strafe_right";
type LookH = "idle" | "left" | "right";
type LookV = "idle" | "up" | "down";

const [pressedMovement, setPressedMovement] = useState<Movement>("idle");
const [pressedLookH, setPressedLookH] = useState<LookH>("idle");
const [pressedLookV, setPressedLookV] = useState<LookV>("idle");

Every change goes through a helper that updates both local state and the model:

const sendMovement = useCallback(
  (m: Movement) => {
    if (!ready) return;
    setPressedMovement(m); // ← drives the UI immediately
    setMovement({ movement: m }); // ← model picks up at next chunk
  },
  [ready, setMovement],
);

Why local state, not the snapshot? snapshot.movement reflects what the model is currently generating with, not what was just pressed. It lags every press by a chunk. If the buttons read from the snapshot, a quick W tap would never light up; by the time the highlight wanted to appear, the user has already released the key and the snapshot is back to idle. Local state is instant and matches what the user just did. The keyboard handler attaches a single keydown / keyup pair to window so the pad responds without the user having to click into anything:

const MOVEMENT_KEYS: Record<string, Movement> = {
  w: "forward",
  s: "back",
  a: "strafe_left",
  d: "strafe_right",
};
const LOOK_H_KEYS: Record<string, LookH> = { arrowleft: "left", arrowright: "right" };
const LOOK_V_KEYS: Record<string, LookV> = { arrowup: "up", arrowdown: "down" };

useEffect(() => {
  if (!ready) return;

  const onKeyDown = (e: KeyboardEvent) => {
    // Don't hijack keys when the user is typing into an input.
    const target = e.target as HTMLElement | null;
    if (
      target &&
      (target.tagName === "INPUT" || target.tagName === "TEXTAREA" || target.isContentEditable)
    ) {
      return;
    }
    const k = e.key.toLowerCase();
    if (MOVEMENT_KEYS[k]) {
      e.preventDefault();
      sendMovement(MOVEMENT_KEYS[k]);
    } else if (LOOK_H_KEYS[k]) {
      e.preventDefault();
      sendLookH(LOOK_H_KEYS[k]);
    } else if (LOOK_V_KEYS[k]) {
      e.preventDefault();
      sendLookV(LOOK_V_KEYS[k]);
    }
  };

  const onKeyUp = (e: KeyboardEvent) => {
    const k = e.key.toLowerCase();
    if (MOVEMENT_KEYS[k]) sendMovement("idle");
    else if (LOOK_H_KEYS[k]) sendLookH("idle");
    else if (LOOK_V_KEYS[k]) sendLookV("idle");
  };

  window.addEventListener("keydown", onKeyDown);
  window.addEventListener("keyup", onKeyUp);
  return () => {
    window.removeEventListener("keydown", onKeyDown);
    window.removeEventListener("keyup", onKeyUp);
  };
}, [ready, sendMovement, sendLookH, sendLookV]);

Three patterns worth carrying into your own code:

preventDefault on arrow keys. Without it, the arrow keys scroll the page while the user is looking around. Browsers don’t scroll on WASD by default, but the preventDefault is harmless there and keeps the handler symmetric.
Ignore events in inputs and textareas. Otherwise typing “wad” into the custom-prompt textarea drives the camera around. The check against tagName and isContentEditable covers both bases.
Don’t filter repeat events. Holding a key fires repeated keydowns; the handler re-sends the same axis value. That’s a no-op at the model (same value, same axis), and trying to filter duplicates adds complexity for zero benefit.

The rotation-speed slider is the exception to everything above. It’s a persistent scalar, not a keyed axis, so there’s no “release”; the user sets a value and that value stays. The slider reads straight from the snapshot:

<input
  type="range"
  min={0}
  max={30}
  step={0.5}
  value={snapshot.rotation_speed_deg}
  onChange={(e) => setRotationSpeedDeg({ rotation_speed_deg: Number(e.target.value) })}
/>

Setting rotation_speed_deg to 0 disables look-axis rotation entirely, even with look_h or look_v non-idle. This is the lever to expose if you want to detune look responsiveness for a particular scene.

Snapping a clip

The SDK ships recording primitives so you don’t have to wire up MediaRecorder yourself. The example’s SnapClip.tsx captures the last 10 seconds of the live stream and opens a modal with the SDK’s built-in preview player and a download button.

app/components/SnapClip.tsx

import {
  ClipDownloadButton,
  ClipPlayer,
  RecordingError,
  useReactor,
  type Clip,
} from "@reactor-team/js-sdk";

const { status, reactor } = useReactor((s) => ({
  status: s.status,
  reactor: s.internal.reactor,
}));
const [clip, setClip] = useState<Clip | null>(null);

async function snap() {
  try {
    setClip(await reactor.requestClip(durationSeconds));
  } catch (e) {
    if (e instanceof RecordingError /* render e.code + e.reason */);
  }
}

return (
  <>
    <button onClick={snap}>Snap last {durationSeconds}s</button>
    {clip && (
      <Modal onClose={() => setClip(null)}>
        <ClipPlayer clip={clip} getJwt={getJwt} />
        <ClipDownloadButton clip={clip} getJwt={getJwt} filename={filename} />
      </Modal>
    )}
  </>
);

Notice how the imports are from @reactor-team/js-sdk, not @reactor-models/lingbot. Recording is a base-SDK feature. It works identically for every Reactor model, and the typed model packages don’t re-export the recording surface. So direct base-SDK imports are idiomatic in this one place, and you can drop the file into any other Reactor example unchanged. The Helios tutorial uses the same file. reactor.requestClip(durationSeconds) is the whole capture API. It returns a Clip value that you hand to <ClipPlayer> to preview and <ClipDownloadButton> to save. The getJwt prop is a resolver those components call when they need an auth token to fetch the clip. The example reuses the same cached /api/reactor/token route from Authentication, so repeat captures don’t trigger new token mints. Errors come back as a RecordingError with a typed code and reason, distinct from the command_error events covered next.

Clip preview in Chromium and Firefox requires hls.js, already in the example’s package.json. See Recordings for the full feature page, including continuous recording, programmatic capture, and retention policies.

Surfacing command_error

Every LingBot command can fail a precondition check (most commonly start before both setImage and setPrompt have landed). The example never lets these fail silently.

app/components/CommandError.tsx

const [error, setError] = useState<{ command: string; reason: string } | null>(null);

useLingbotCommandError((msg) => {
  setError({ command: msg.command, reason: msg.reason });
});

// Clear on the next state snapshot. Any state change implies the user
// has moved on from whatever triggered the error.
useLingbotState(() => {
  setError(null);
});

if (!error) return null;

return (
  <div>
    <span>{error.command} failed</span>
    <p>{error.reason}</p>
  </div>
);

useLingbotCommandError is the typed wrapper for the command_error message: it fires when LingBot rejects a command, carrying the failing command name and a human-readable reason. The component sits in the sidebar, renders nothing until an error arrives, and clears itself on the next state snapshot so a stale banner can’t pile up. A few LingBot-specific failure modes worth knowing about:

start before conditions are set. The model rejects start unless both a prompt AND a reference image have been registered. The setup-phase UI (ScenePicker, CustomStart) prevents this in practice by disabling the Start button on !snapshot.has_prompt || !snapshot.has_image, but a programmatic start from elsewhere surfaces as command_error.
setImage during generation is a silent no-op. Unlike start, sending setImage mid-run does not emit command_error; the seed image is locked once a session starts and the new image is just dropped. If you want a “swap reference image” affordance, it has to be a Setup-phase control gated on !snapshot.started, or you have to call reset() first.
setPrompt during generation is fine. It’s not an error path at all. The new prompt takes effect at the next chunk boundary. Useful baseline when distinguishing “rejected” from “applied later.”

command_error is one of several typed messages LingBot emits. See the Messages table for the full list, including generation_started, generation_complete, and the per-chunk chunk_complete event.

What’s intentionally left out

The demo covers the launch + steer + capture loop. Several LingBot features are deliberately out of scope:

Mid-stream prompt swap: useLingbot().setPrompt({ prompt }) during the live phase. The reference image stays locked; the new prompt picks up at the next chunk boundary.
Reproducible runs: setSeed before start.
Movement-aware prompt schedule: react to useLingbotChunkComplete and fire setPrompt when a target chunk fires. LingBot has no native chunk schedule like Helios’s schedule_prompt.
Gamepad input: same shape as the keyboard handler; press = direction, release = idle.

For the full design rationale and the patterns to follow when adding any of the above, read skill/SKILL.md in the example repo.

​Installation and setup

​How LingBot works

​Authentication

​Starting a scene from an image

​Custom uploads

​Going live

​Driving the scene with WASD

​Snapping a clip

​Surfacing command_error

​What’s intentionally left out

Installation and setup

How LingBot works

Authentication

Starting a scene from an image

Custom uploads

Going live

Driving the scene with WASD

Snapping a clip

Surfacing command_error

What’s intentionally left out