Skip to content

Why this blog exists

If an agent doesn’t exist to answer one query but to live in a child’s pocket for six months — how do you build its scaffolding?

That’s the question that has been pushing TeachClaw to rewrite architecture, delete code, and rewrite prompts for the past six months. We’re building K12 AI companions: every child gets an agent that chats with them, teaches lessons, generates exercises, and remembers the Ultraman episode they didn’t finish yesterday.

It isn’t a one-shot chatbot. It has to stay present.

By harness we mean the engineering scaffolding that lets an LLM operate reliably in the real world: container orchestration, context construction, prompt cache, tool invocation, observability, evaluation, memory.

It is not prompt engineering. It is not “wire up LangChain.”

A concrete example. Our agent’s system prompt static prefix is 74,093 bytes. If a single byte of that prefix changes per turn — because of the current time, git status, or the working directory — MiniMax’s prefix cache misses. With a learner doing 50 turns a day, a month is 1,500 misses.

This is what the harness does: it treats input-side stability as a first-class architectural concern. Not late-stage optimization — a load-bearing wall.

Similar questions:

  • What’s the right tool-call timeout? Let a turn hang 5 minutes, or kill it at 60 seconds? Where does the partial output go in the trace?
  • The container’s been sleeping for 30 minutes; the user sends a message. From IM webhook to workspace ready, how long is acceptable? Can the user perceive it?
  • The agent sees 100 messages it has already processed. How does it know which to respond to and which were already read?
  • How do we know the agent got dumber this week?

None are one-line fixes. Each one pulls on the harness’s layering.

Three axes: evaluation, behavior calibration, evolution

Section titled “Three axes: evaluation, behavior calibration, evolution”

We group what we’re working on into three axes.

How do we know the agent did the right thing?

Not “the user gave it five stars.” Translating “what makes a good agent” into executable checks.

On 2026-05-11 we ratified the Substrate + Evaluator architecture:

  • Substrate: a four-layer system prompt — constitution (the behavioral red lines, made immutable by architecture, not ACL), identity, memory, and runtime context.
  • Evaluator: seven scoring dimensions (safety / engagement / accuracy / dau / output_quality / idle_distillation / plan_update) pulled from Langfuse every 10 minutes and injected back into the agent’s own system prompt for self-reflection.
  • An anti-Goodhart hard constraint: a database-level CHECK constraint — any positive score delta must include evidence. If the agent wants to gain points, it has to leave a trail.

That SQL CHECK is the evaluator’s spine. Without it, “AI self-evaluation” is creative writing.

How do we shape the agent’s mode of presence?

We replaced the request-response paradigm (user asks → agent answers → done) with overhear companion mode: the agent sees every message addressed to the child (teachers, parents, peers); it can set alarms to remind itself to check in; it can flip through .learner/todo.md during its heartbeat and find something useful.

But we stopped forcing it to speak every time.

On 2026-04-27, testing M2.7, the model — after recommending several movies in a row — wrote “(silent)” with the thinking-block reasoning “don’t want to interrupt too densely.” That’s valid social judgment. Better than a hard rule. We removed the entire “must speak every turn” rule set from substrate. It dropped from 809 lines to 565 (-30%).

The central tension in behavior calibration: when do we trust the model’s judgment, and when do we backstop with engineering? Our answer keeps shifting.

How does the harness grow up alongside the model?

Every new model generation forces a question: things we used to backstop in engineering — can we hand them back to the model now?

Recent example: IRON_LAW_5 used to force the agent to “check and maintain the alarm queue every turn.” Thirty days of data: out of 323 alarms total, 317 were [dream] seeds planted by the lifecycle workflow, and only 2 were content alarms the agent set itself. The rule was instruction thrashing. Deleted.

Every “delete a rule” moment is an evolution — the harness gets thinner. Which is good: a thinner harness leaves more room for the model’s own judgment.

In scope:

  • What we tried, what we abandoned, what we currently picked.
  • Concrete data where we have it: cache hit rate (95% after the fix), static prefix size (74,093 bytes), turn timeout (5 min), 30 days of alarm data.
  • Where we don’t have data, we mark TODO: pending measurement. We don’t paper over.

Out of scope:

  • Product UI, mobile, visual branding — that’s another story.
  • “We hit X bug and fixed it” post-mortems — those live in GitHub Issues.

We want to discuss the attempts and the solutions we’re building, not after-the-fact success stories. Expect current judgment plus unresolved questions — not “lessons learned.”

The foreword (this post) plus seven starter articles:

#TitleSeries
1Why Claude Agent SDK is our agent runtimeI. Runtime
2Outside vs. inside the containerI. Runtime
3bg-worker: offloading heavy I/OI. Runtime
4The overhear companionII. Behavior Calibration
5Letting the model judge silenceII. Behavior Calibration
6Substrate + Evaluator: an agent product constitutionIII. Evaluation
7Langfuse trace as behavior replayIII. Evaluation

Each of the three series gets at least one launch piece. Series IV (Evolution) and the remaining 8–10 articles follow — the roadmap lives there.

Z.Q. Zhang, engineering lead at TeachClaw. The blog and product source live in separate public repos — the product source at mrzch03/clawbox (the repo name reflects the internal project codename clawbox; the product itself is TeachClaw), and the blog at mrzch03/teachclaw-blog. Pull requests welcome — fix typos, push back on claims, send me counter-evidence.

The harness is changing fast. A year from now, half of what’s written here will look wrong — and the faster readers can show me which half, the better.