Why Claude Agent SDK is our agent runtime

What an agent runtime is supposed to do

If you lay out all the engineering tasks that go into “let an LLM run a tool loop,” you get a boring but long list:

Send messages to the model, handle streaming responses
Parse tool calls, invoke matching tools, push results back into conversation history
Handle tool-call hooks (pre / post), errors, timeouts, cancellation
Maintain turn state, compress the context window, manage multi-turn dialog
Integrate with MCP servers
Surface model / tool / usage data to the observability layer
Abstract over different model providers

That’s what “agent runtime” does. Every team building an agent product has to solve this set of problems. The real question is whether you should solve them yourself.

We used to: OpenClaw

The first runtime we shipped was called OpenClaw. A hand-written gateway — Node inside the container, HTTP for messages, calling Anthropic’s API, handling the tool-call loop, streaming, and retries ourselves.

It worked. But there was a maintenance tax every week:

Anthropic changed the streaming protocol (partial messages → typed events). One week of catch-up.
Tool-call input schema migrated from JSON Schema to a Pydantic-like shape. Another week.
Adding vision input, thinking blocks, prompt cache markers — each one a chase.

The deeper problem: we were maintaining something that duplicated Anthropic’s internal SDK. Anthropic obviously has a full-fidelity client library internally; them publishing it was just a matter of time.

2026-03-31: we deleted it

That day’s commit 086a7e91:

refactor(infra): remove OpenClaw, use Claude Agent SDK directly

Net delete: 359 lines of OpenClaw gateway code. Replaced with @anthropic-ai/claude-agent-sdk, imported directly into the agent-engine process.

The argument was simple: by then the Claude Agent SDK had stabilized — streaming, tool hooks (PreToolUse / PostToolUse), MCP integration, usage stats are all officially maintained. When any of those protocols change, we upgrade the SDK. It’s no longer our bug.

agent-engine: the thin wrapper on top

After the switch, agent-engine is no longer “the runtime.” It’s “a thin wrapper on top of the runtime.” Around 430 lines of TypeScript total; the entry point is a polling loop:

// packages/agent-engine/src/index.ts (simplified)
async function processMessages() {
  const messages = getNewMessages(1);  // FIFO from SQLite
  if (messages.length === 0) return;

  const msg = messages[0];
  const composed = await composeTurn({
    userMessage: msg.content,
    userImId: recipientId || undefined,
    runtime: { mode, triggerReason, lastUserMessageAgeMs, lastUserSnippet },
  });

  const agentResult = await Promise.race([
    runAgent(chatId, promptText, {
      cwd: CWD,
      mode,
      speakable,
      channelContext: { chatId, recipientId, groupId, sessionType, turnId, traceId },
      systemPromptAppend: composed.staticSystemPrompt,
      onToolEvent: (event) => {
        broadcastToolEvent(event);
      },
    }),
    new Promise((_, reject) =>
      setTimeout(
        () => reject(new Error(`Agent timeout after ${AGENT_TIMEOUT_MS / 1000}s`)),
        AGENT_TIMEOUT_MS,
      ),
    ),
  ]);
}

runAgent calls into the SDK:

const q = query({
  prompt: promptText,
  options: {
    cwd,
    resume: sessionId,
    systemPrompt: {
      type: "preset",
      preset: "claude_code",
      excludeDynamicSections: true,   // see below
      append: systemPromptAppend,
    },
    permissionMode: "bypassPermissions",
    tools: [...BUILTIN_TOOLS],
    model: process.env.AGENT_MODEL || undefined,
    mcpServers,
    includePartialMessages: true,
    hooks: { /* PreToolUse, PostToolUse */ },
  },
});

What agent-engine adds on top of the SDK falls into four buckets:

1. Input side: polling + context assembly

A 2-second SQLite polling loop, FIFO message pickup.
composeTurn() assembles the substrate (static prefix) + runtime context (dynamic) + user message.
excludeDynamicSections: true is critical: the SDK’s preset: "claude_code" automatically injects current working dir / auto-memory / git status — these change every turn and break the cache prefix before our substrate append. Setting it to true keeps the preset static so prefix caching can hit.

2. Tool side: MCP server

6 tools exposed to the SDK through a hand-rolled MCP server: write_to_board, openApp, showViz, localBash, submit_job, compact_session.
The SDK ships its own Read/Write/Edit/Bash/Glob/Grep — we use those directly.
PreToolUse emits a tool_use event before each call; PostToolUse emits the result.

3. Output side: streaming + lip-sync + learning panel

text_delta events are sliced by sentence → fed to a TTS pipeline that produces voice segments.
write_to_board tool calls render to the learner’s right-side blackboard (the learning panel), synced with the voice stream.
Each turn writes an agent_turn_events row in Postgres for admin replay.

4. Observability: hand-rolled Langfuse instrumentation

We don’t use @arizeai/openinference-instrumentation-claude-agent-sdk — OTel v1/v2 incompatibility (detailed in III-2).
Three observation types: agent-turn (span) / claude-agent-llm (generation) / tool/<name> (tool).
PreToolUse → startToolObservation, PostToolUse → end().

That’s all of it. No LangChain, no LangGraph, no agent framework. The whole runtime fits in one head.

Current trade-offs

The good:

Streaming protocol, tool schema, provider abstraction — all maintained by Anthropic. They ship, we upgrade. Zero engineering on our side.
In-process import. No HTTP / IPC / gRPC layer.
A new engineer can read the agent-engine source in a day.

The bad:

We’re bound to the Anthropic API shape. Our primary model is MiniMax-M2.7 (more in III-2), served through MiniMax’s https://api.minimaxi.com/anthropic compatibility shim. That shim is a single point of risk — if MiniMax changes the protocol, or Anthropic ships a breaking SDK update, we break.
The SDK is closed source. We can’t fork it. Bugs are either worked around or waited on.
Performance ceiling moves with the SDK. They haven’t pushed streaming below 50ms yet, so neither have we.

Still unresolved:

Truly long tool calls. SDK tool execution is cooperative (PreToolUse → run → PostToolUse). A >1-minute call blocks the turn. We bolted bg-worker onto the side, but the SDK protocol itself should support fire-and-forget tools.
Multi-agent. The SDK only knows about a single agent. When we want to spawn a sub-agent inside a turn to run a skill (see I-3), we shell out to a claude -p CLI process instead of spawning inside the SDK. Two interfaces, two abstractions.
Local-model fallback. ANTHROPIC_BASE_URL is a single choice — MiniMax or Anthropic. No runtime fallback. Resilience is entirely a deploy-time decision.

The question we ask each generation

Every time the SDK ships an update, we ask the same question: can we delete more code from agent-engine?

If the answer is no, the SDK isn’t growing. If the answer is yes, the SDK has absorbed something we previously had to backstop — and that’s what we want.

The ideal end state: agent-engine collapses to 50 lines of glue. The whole “agent runtime” concept disappears behind the SDK. The thinner the harness, the better.

We’re far from that day. But every line of OpenClaw-era residue we delete is a step in that direction.

Related:

Where does the runtime run? → I-2: Outside vs. inside the container
Tool calls > 1 minute? → I-3: bg-worker
Why doesn’t prompt cache break? → III-2: Langfuse trace