bg-worker: offloading heavy I/O

A 60-second video request

The child says, “Make me a short video of Ultraman fighting a monster.”

The agent thinks for 2 seconds and decides to call fal-cli text-to-video. Fal’s average generation time is 30–60 seconds, with long ones reaching 120s.

If we wait synchronously:

[T+0s]    User message enters agent
[T+2s]    agent thinking done, decides to call fal-cli
[T+2s]    fal-cli subprocess starts
[T+62s]   fal returns video URL
[T+62s]   agent gets URL, starts responding "OK, here's what I made..."
[T+64s]   User finally sees the first character

A 60-second typing indicator. In K12 chat this is a disaster — the kid’s attention is gone. They switched to YouTube, came back, and forgot what they asked.

Worse: the Claude Agent SDK’s default turn timeout is 5 minutes. A single 60s tool call isn’t fatal, but if the agent decides to chain three generations in one turn, you blow past the limit.

This is a problem every agent harness eventually hits: the main turn must stay responsive, but the things the LLM wants to do regularly outlast that responsiveness window.

The SDK won’t save you

We use the Claude Agent SDK as our runtime. The SDK’s tool invocation is cooperative:

PreToolUse(tool, input) → tool.run() → PostToolUse(tool, result)
                            ↑
                            blocks the whole turn until return

tool.run() is awaited synchronously. The SDK has no fire-and-forget protocol — no “dispatch this and end the turn” semantics.

So this has to live in a layer above the SDK.

A three-stage async pattern

Our pattern is submit_job + bg-worker + claude -p:

┌────────────────────────────────────────────────────────────┐
│ Main agent turn                                            │
│   ↓ tool call                                              │
│   submit_job(kind="generate-video", params={...})          │
│   ↓ returns batch_id immediately (< 50ms)                  │
│   end turn, agent says "going to make it, a few minutes"   │
└────────────────────────────────────────────────────────────┘
   ↓ (writes one bg_jobs row + one system message into SQLite)
┌────────────────────────────────────────────────────────────┐
│ bg-worker (Go process, independent daemon)                 │
│   Polls bg_jobs WHERE status='pending' every 1s            │
│   Picks job → claim (status=running) → fork subprocess     │
└────────────────────────────────────────────────────────────┘
   ↓
┌────────────────────────────────────────────────────────────┐
│ claude -p (ephemeral agent process running a skill)        │
│   Reads spec prompt from SKILL.md                          │
│   Executes skill (fal-cli / class CLI / fal API ...)       │
│   stdout: {"status":"ok","summary":"...","video_url":...}  │
└────────────────────────────────────────────────────────────┘
   ↓ (worker parses stdout, updates bg_jobs, writes system msg)
┌────────────────────────────────────────────────────────────┐
│ Main agent's next turn                                     │
│   Polling SQLite sees [clawbox-internal:job-done] msg      │
│   Decides whether to say "all done" to the child           │
└────────────────────────────────────────────────────────────┘

The submit_job tool, implemented in the main agent:

tool(
  "submit_job",
  [
    "Submit a long-running task (video generation, PPT render, etc) to bg-worker.",
    "Returns batch_id immediately, doesn't block this turn.",
    "",
    "**When to use**: any tool call you expect to take > 10s. Synchronous waits",
    "will hit the 300s turn cap or make the user feel 'stuck'.",
  ].join("\n"),
  {
    kind: z.string().describe("Job kind, matches skills/<name>/job-spec.yaml"),
    params: z.record(z.string(), z.any()).describe("Job parameters"),
  },
  async ({ kind, params }) => {
    const batchId = `${ymdHMS()}-${randomBytes(4).toString("hex")}`;
    insertBgJob(kind, batchId, JSON.stringify(params ?? {}));
    insertSystemMessage(
      ctx.recipientId,
      `[clawbox-internal:job-queued]\nkind=${kind}\nbatch_id=${batchId}\n\nQueued.`,
      `${kind}-watcher`,
    );
    return {
      content: [{ type: "text", text: `submitted kind=${kind} batch_id=${batchId}` }],
    };
  },
),

The batch_id format is YYYYMMDDHHMMSS-<4 hex> — human-readable, sortable, unique across processes.

The bg_jobs table (in the same SQLite the agent uses):

CREATE TABLE IF NOT EXISTS bg_jobs (
  kind TEXT NOT NULL,
  batch_id TEXT NOT NULL,
  params TEXT DEFAULT '',
  status TEXT NOT NULL,           -- 'pending' | 'running' | 'ok' | 'failed'
  created_at INTEGER NOT NULL,
  started_at INTEGER,
  finished_at INTEGER,
  log_path TEXT DEFAULT '',
  summary TEXT DEFAULT '',
  error TEXT DEFAULT '',
  attempts INTEGER NOT NULL DEFAULT 0,
  PRIMARY KEY (kind, batch_id)
);

The worker is a Go binary — single file main loop, ~80 lines of core logic. At startup it loads ~/.claude/skills/*/job-spec.yaml to know which kind maps to which skill.

Why `claude -p` subprocess, not in-process

We initially considered importing the Claude Agent SDK directly inside the worker to run skills in-process. We rejected it.

Reasons:

1. Context isolation. The main agent has accumulated 6000 tokens of turn history. The skill shouldn’t see any of it. claude -p is a new process, new conversation history. Clean.

2. Failure isolation. Skill crashes (OOM, network, API errors) don’t affect the main agent. Subprocess exits non-zero, worker marks the job failed, main agent sees [clawbox-internal:job-failed] system message next turn, decides how to explain to the user.

3. Independent versioning. Each skill is a self-contained directory with a SKILL.md. Changing one skill doesn’t require redeploying agent-engine.

4. Resource bounds. Subprocesses get ulimits, CPU pinning, scheduling priority. The main agent is user-facing and must stay low-latency; skills are background and can afford slow.

The cost: each skill invocation is a fresh Claude API call (context starts from zero). Our measurements:

Metric	Value
Avg input tokens per skill call	~3,000–8,000 (SKILL.md + params)
Cost per skill call (MiniMax)	TODO: measure — estimate $0.001–$0.01
Skill process cold start	~500ms (node + SDK import)

500ms cold-start in an async context is not a problem. Saving it via in-process at the cost of the four isolation properties above isn’t worth it.

Three skills currently in use

fill-plan-item: fill a learning-plan placeholder card. Read the student’s atlas → decide difficulty → call class CLI to create course/practice → return content_ref. Typical: 60–120s.
analyze-upload: analyze user-uploaded image/PDF/video. Call vision/OCR → output summary + learning hooks. Typical: 5–30s.
generate-video: call fal-cli with fal-ai/wan/v2.2-5b/text-to-video to generate a short clip. Typical: 30–90s.

Each skill is a directory under ~/.claude/skills/<name>/:

~/.claude/skills/generate-video/
├── SKILL.md           # spec prompt ("You are the generate-video skill...")
├── job-spec.yaml      # kind, timeout, retries, schema
└── (no code — behavior is entirely SKILL.md prompt + built-in tools)

Skills aren’t written in Python or Go — they’re agent behaviors described in prompt. When claude -p starts, it becomes an ephemeral agent that follows SKILL.md’s spec.

This is deliberate: skills are tunable the same way the main agent is. We avoid introducing a second programming model.

A UX mismatch

Async, async, async — but the user doesn’t know “async.” From the user’s perspective:

They say “make me a video”
agent says “OK, a few minutes”
30s, 60s, 90s pass
agent suddenly pops up “done, here’s [video link]”

In between, the user has no idea what’s happening. They might switch to something else, get anxious waiting, or repeatedly ask “is it done yet?”

Our current responses:

The agent’s reply on the submit_job turn must give a time estimate (“about a minute”).
When the worker finishes, it injects a system message; the agent continues the topic naturally next turn, not as an explicit “notification.”
When the user repeatedly asks “is it done?”, the agent can query bg_jobs (via another tool) and report progress.

What’s still unresolved:

Progress bars. Video generation isn’t 0/1 — it’s 0%, 30%, 80%, done. We don’t pipe fal-cli’s progress signals up to the agent layer. TODO: write intermediate state to bg_jobs.summary.
Cancellation. “Never mind, don’t bother.” The agent doesn’t cancel running jobs. Worker finishes and learns it was wasted effort.
Multi-task choreography. Agent submits 3 jobs in one turn with no described ordering. Worker runs them in parallel.

Directions we’re still thinking about

SDK-native fire-and-forget tools. If the SDK let us mark a tool as async: true — PreToolUse fires and returns a promise immediately, PostToolUse fires when the promise resolves — half of bg-worker disappears. We’ve sent feedback to Anthropic on this.

Worker observability hung off the main agent trace. Currently, a skill running in the worker is an independent Langfuse trace (generate-video-<batch_id>), unconnected to the main agent’s turn trace. To see the full chain you join manually. TODO: pass TRACEPARENT when worker spawns claude -p to nest skill trace under turn trace.

A warm skill runner pool. claude -p cold-starts in ~500ms. A few resident worker processes could hot-start in ~50ms. But resident processes accumulate state — they’d need a reset between jobs. Under evaluation.

A look back

The submit_job pattern is the boundary discipline of I-2 extended to the agent behavior layer: long-running operations don’t belong in the main turn — push them out, keep the main turn light.

Every time we see a >10s synchronous tool call in main-agent code, the first reaction should be “can this be submit_job?” Most of the time, yes.

Related:

Why the SDK is cooperative: I-1 Why Claude Agent SDK is our agent runtime
How the agent knows the skill finished — alarm + system message: II-1 The overhear companion
How worker traces tie into the observability layer: III-2 Langfuse trace as behavior replay