Langfuse trace as behavior replay

A K12 debugging need

A parent files a complaint: “This afternoon the agent said something weird to my kid — it tried to upsell something.”

You open the OpenIM backend, find that conversation. The agent did say something that sounded like a sales pitch.

Next question: why?

Which version of the system prompt did the agent see at that moment? What did the scoreboard in runtime context show? Was the incoming message a direct user message or an overhear? What did the agent think about in the thinking block? What tools did it call? What did those tools return?

If you only have OpenIM chat logs, none of those questions have answers. You can see what the agent said, not why it said it.

Debugging becomes archaeology.

Trace turns the archaeology into replay. Every turn leaves a complete trace — input, output, thinking, tool calls, usage, metadata. Afterward, anyone can open the Langfuse dashboard and precisely reconstruct what was in the agent’s head at that moment.

That’s why trace is first-class infra in agent harness — not nice-to-have.

Why we don’t use auto-instrumentation

We use the Claude Agent SDK + Langfuse Cloud (JP region). The most natural integration is @arizeai/openinference-instrumentation-claude-agent-sdk — Arize AI’s auto-instrumentation package, beautifully documented in their cookbook, a single register() call away.

We tried. It blew up.

The error:

TypeError: undefined is not an object (evaluating 'span.instrumentationScope.name')
    at LangfuseSpanProcessor.onEnd

Root cause chain:

Package	OTel dependency
`@arizeai/openinference-instrumentation-claude-agent-sdk`	`@opentelemetry/core@^1.25.1`
`@langfuse/otel@^4.0.0`	`@opentelemetry/core@^2.0.1`

OTel v1’s ReadableSpan interface calls the field instrumentationLibrary; v2 renamed it instrumentationScope. @langfuse/otel@v4’s LangfuseSpanProcessor.onEnd reads span.instrumentationScope.name — gets a v1 span from Arize, field is undefined, throws.

npm installs both versions into node_modules, TypeScript types more or less compile, runtime explodes.

The cookbook works on Deno — Deno’s module resolution is different. It doesn’t on Node ESM.

We spent a day on yarn resolutions / npm overrides; every fix sprouted another conflict. The dep chain is too deep to reconcile at user level.

Conclusion: bypass it, instrument manually.

Hand-rolled instrumentation — three observation types

Our packages/agent-engine/src/instrumentation.ts is roughly 370 lines, three observation types:

/**
 * Three observation types:
 *   agent-turn (span)        ← outermost, one per turn; input + final state
 *     └─ claude-agent-llm (gen)  ← full Claude SDK turn; model/usage
 *     └─ tool/<name> (tool)      ← each PreToolUse → PostToolUse
 *
 * When LANGFUSE_PUBLIC_KEY / LANGFUSE_SECRET_KEY are missing, no-op entirely;
 * agent-engine behavior is unchanged (zero deps loaded, zero OTel overhead).
 */

1. agent-turn span (outermost)

withTurnObservation = async (name, attrs, fn) => {
  return startActiveObservation(name, async (span) => {
    span.update({ input: attrs.input });
    return await propagateAttributes(
      {
        userId: attrs.userId,
        sessionId: attrs.sessionId,        // = chatId
        tags: attrs.tags,                  // ["mode:user", "agent:abc"]
        metadata: { turn_id, constitution_hash, ... },
      },
      fn,
    );
  }, { asType: "span" });
};

asType: "span" is required — without it, the SDK takes its default path and coexists with NodeSDK’s auto root, producing a ghost root span with endTime=null (we’ve hit this issue before).

2. claude-agent-llm generation

The real LLM call record. The fields that matter:

gen = startObservation(
  "claude-agent-llm",
  {
    input: params.input,                // user message + runtime context (what the model actually saw)
    model: params.model,                // "claude-opus-4-1-20250805" or "MiniMax-M2.7"
    modelParameters: {
      provider: env.ANTHROPIC_BASE_URL?.includes("minimax") ? "minimax" : "anthropic",
      base_url: env.ANTHROPIC_BASE_URL,
    },
    metadata: { turn_id, ... },
  },
  { asType: "generation", startTime: new Date() },
);

At end, record usage (including cache hits):

const usageDetails = buildUsageDetails(usage);
// {
//   input: 1234,
//   output: 567,
//   cache_read_input: 73456,    ← prefix-cache-hit portion
//   cache_creation_input: 0,    ← cache-write portion
//   total: 75257,
// }
gen.update({ output, usageDetails });
gen.end(new Date());

cache_read_input is the source of I-1’s 95% cache hit rate — directly computed from this Langfuse field.

3. tool/<name> observation

Each SDK tool call: PreToolUse opens an observation, PostToolUse closes it:

startToolObservation = (toolName, input) => {
  const span = startObservation(
    `tool/${toolName}`,         // "tool/Read", "tool/write_to_board", etc.
    { input },
    { asType: "tool" },
  );
  return {
    end({ output, error }) {
      if (error) span.update({ level: "ERROR", statusMessage: error.message });
      else if (output !== undefined) span.update({ output });
      span.end();
    },
  };
};

A typical turn in Langfuse looks like:

agent-turn  [span, 4.2s]
├── claude-agent-llm  [generation, 4.0s, in=1234 out=567 cache=73456]
├── tool/Read  [tool, 12ms, .learner/atlas/math.md]
├── tool/Bash  [tool, 87ms, class plan list]
├── tool/write_to_board  [tool, 5ms]
└── tool/Edit  [tool, 9ms, .learner/journal/2026-05-17.md]

The entire turn is visible.

The “no fake observability” rule

Writing trace is easy. Writing honest trace is hard.

A temptation we encountered: III-1 Substrate + Evaluator mentions evidence_snippet — what justified a positive score delta. The cleanest source is the LLM’s actual input/output at the time. But in some cron scenarios we wanted to backfill historical scores, and by then trace had expired or the data wasn’t stored in full.

The temptation: pull user messages from OpenIM history and use them as evidence_snippet.

The reasoning sounded fine:

IM history is what the user said to the agent — should be close to the LLM input.
Add an evidence_source: "im_history" caveat and we’re transparent.
Better than nothing.

We rejected it 14 hours later. Reasons:

1. “Close” isn’t “the same.” IM history is a user-side view — possibly SDK-compressed, possibly truncated, possibly merging multiple short messages. What the LLM saw might be “can you tell me a story? I like dinosaurs”, what IM history stored might be “tell a story.” That missing “I like dinosaurs” is enough to flip the scoring logic.

2. Caveat comments get ignored. The field is named evidence_snippet; the dashboard displays “evidence:”. Nobody reading the dashboard reads the schema note about evidence_source: "im_history". Field names overpower schema docs by 10x.

3. Fake data is more dangerous than missing data. If evidence_snippet is null, the evaluator knows this row isn’t auditable and won’t make “based on evidence” decisions. If evidence_snippet is backfilled from IM history, the evaluator trusts it and reasons from it.

The rule simplifies to: if you don’t have ground truth, omit the field — don’t fill it with caveat-ed proxies.

This rule now lives in the evaluator’s PR template:

[ ] Every new trace field — is it ground truth? [ ] If unavailable in some scenarios, is it null/omitted (not “approximation + caveat”)?

W3C trace_id propagation across services

PR c624aa46 (2026-05-13) elevated trace to platform level — every TeachClaw service carries trace_id in logEvent, passed via the W3C traceparent header:

[Browser] User clicks send
  ↓ traceparent: 00-abc123...-span001-01
[CF Worker (im-web API proxy)]
  ↓ traceparent: 00-abc123...-span002-01
[TeachClaw Backend API]
  ↓ traceparent: 00-abc123...-span003-01
[Temporal workflow signal]
  ↓ traceparent → activity context
[Workspace container — agent-engine]
  ↓ TRACEPARENT env → Langfuse root span

Five hops, one trace_id stitching them.

Before, debugging a “user clicked send, agent responded 5 seconds later” issue meant manually grepping three services’ logs and aligning by time window. Now a single trace_id sees the whole story in dashboard.

The only discipline requirement: every logEvent must pull trace_id from the current request context; never generate its own. This is the implementation of the no-fake-observability principle — log trace_id must match Langfuse root span trace_id. Mismatch is a bug.

The cost of no sampling (we accept it)

OTel industry standard: sample 1–10% in production, 100% on errors.

We trace 100% of K12 turns. No sampling.

Why:

Safety dimension needs long-tail data. 1% sampling means missing 99% of violation turns.
Evaluator takes trace as input. Sampling distorts distributions.
Parent tickets are always “the conversation at 3:42pm today.” Can’t “reproduce.” Must be precisely retrievable.

Costs:

Langfuse bill. With MiniMax-M2.7, average turn input is 75K tokens; full sampling means significant traffic. TODO: compute trace upload bandwidth + Langfuse storage as fraction of total LLM cost.
Batch flush cadence. @langfuse/tracing defaults to 5-second batch async flush. Flush spikes affect latency-sensitive downstream paths.
Long-term archival strategy. Langfuse Cloud retains 90 days by default. We want “K12 safety-related traces archived forever.” TODO: design a metadata.safety_flag-based archival pipeline.

Eval signals piggyback: make trace carry ground truth

III-1 evaluator needs more than “what the model said” — it needs “what the model did this turn.” How many .learner/ reads, how many atlas writes, how many class CLI calls, how many alarms set.

We have eval-signals.ts hooked into PreToolUse to intercept every tool call. At turn end, it packs 13 metadata fields into the generation:

toMetadata(): Record<string, unknown> {
  return {
    eval_tool_use_count: snapshot.toolUseCount,            // how many tools called
    eval_memory_read_count: snapshot.memoryReadCount,      // .learner/ reads
    eval_memory_write_count: snapshot.memoryWriteCount,    // .learner/ writes
    eval_distillation_write_count: snapshot.distillationWriteCount,  // MEMORY.md / identity writes
    eval_atlas_write_count: snapshot.atlasWriteCount,      // .learner/atlas/ writes
    eval_plan_read_count: snapshot.planReadCount,          // class plan reads
    eval_plan_write_count: snapshot.planWriteCount,        // class plan writes
    eval_class_tool_count: snapshot.classToolCount,        // class CLI calls
    eval_alarm_set_count: snapshot.alarmSetCount,          // alarms scheduled
    eval_updated_learner: snapshot.memoryWriteCount > 0,   // bool: wrote learner state?
    eval_updated_plan: snapshot.planWriteCount > 0,        // bool: changed plan?
    eval_scheduled_followup: snapshot.alarmSetCount > 0,   // bool: set alarm?
    eval_tool_calls_summary: snapshot.toolCallsSummary.join(" | "),  // compact replay
  };
}

All of these are ground truth — straight from SDK PreToolUse interception. Not inferred from chat logs, not reverse-engineered from user behavior.

When the evaluator sees eval_updated_learner: false five days in a row, it knows the agent isn’t maintaining learner state and can dock points. When eval_alarm_set_count is 0 for a week, the agent isn’t self-scheduling. These are facts, not estimates.

Still hard

Trace diffing. “Why did today’s turn output differently from yesterday’s identical-looking turn?” The answer is in the input diff, but Langfuse doesn’t have great diff tooling. TODO: build a trace side-by-side viewer.
Retrieving write_to_board content. Blackboard markdown lives in trace. A query like “all turns last week that taught fractions” needs full-text search on trace.output — Langfuse won’t scale to it. TODO: dump to Elasticsearch.
LLM judge cost. Safety LLM-as-Judge runs on every trace. MiniMax-M2.7 judge runs at roughly $0.0003/turn — cheap per turn, real money at 100% × hundreds of thousands of turns/day. TODO: evaluate judging only high-risk modes (user / overhear), skipping think / dream.
Trace as audit log — legal status. In K12 parent-lawsuit scenarios, can trace serve as evidence? Needs immutable signing. Not done.

A look back

III-1 substrate + evaluator said “evaluation gives the harness’s evolution a ground.” This article adds a corollary:

Evaluation’s ground is trace. Trace’s ground is “no fake data.”

The whole of series III is one observation: the deeper we go, the more “true” matters over “complete.” Incompleteness is fixable; falsehood is poison — it poisons scoring, poisons decisions, poisons the next prompt edit.

This blog itself is written under the same rule: data when we have data, TODO when we don’t — no padding.

Related:

How the scoring system uses trace as input: III-1 Substrate + Evaluator
How to read prompt cache hit rate from trace: end of I-1 Why Claude Agent SDK is our agent runtime
How many layers W3C trace_id spans: the wake flow at the end of I-2 Outside vs inside the container