Why this blog exists

An opening question

If an agent doesn’t exist to answer one query but to live in a child’s pocket for six months — how do you build its scaffolding?

That’s the question that has been pushing TeachClaw to rewrite architecture, delete code, and rewrite prompts for the past six months. We’re building K12 AI companions: every child gets an agent that chats with them, teaches lessons, generates exercises, and remembers the Ultraman episode they didn’t finish yesterday.

It isn’t a one-shot chatbot. It has to stay present.

Why “agent harness” is a real problem

By harness we mean the engineering scaffolding that lets an LLM operate reliably in the real world: container orchestration, context construction, prompt cache, tool invocation, observability, evaluation, memory.

It is not prompt engineering. It is not “wire up LangChain.”

A concrete example. Our agent’s system prompt static prefix is 74,093 bytes. If a single byte of that prefix changes per turn — because of the current time, git status, or the working directory — MiniMax’s prefix cache misses. With a learner doing 50 turns a day, a month is 1,500 misses.

This is what the harness does: it treats input-side stability as a first-class architectural concern. Not late-stage optimization — a load-bearing wall.

Three axes: evaluation, behavior calibration, evolution

We group what we’re working on into three axes.

Evaluation

How do we know the agent did the right thing?

Not “the user gave it five stars.” Translating “what makes a good agent” into executable checks.

On 2026-05-11 we ratified the Substrate + Evaluator architecture:

Substrate: a four-layer system prompt — constitution (the behavioral red lines, made immutable by architecture, not ACL), identity, memory, and runtime context.
Evaluator: seven scoring dimensions (safety / engagement / accuracy / dau / output_quality / idle_distillation / plan_update) pulled from Langfuse every 10 minutes and injected back into the agent’s own system prompt for self-reflection.
An anti-Goodhart hard constraint: a database-level CHECK constraint — any positive score delta must include evidence. If the agent wants to gain points, it has to leave a trail.

That SQL CHECK is the evaluator’s spine. Without it, “AI self-evaluation” is creative writing.

Behavior calibration

How do we shape the agent’s mode of presence?

We replaced the request-response paradigm (user asks → agent answers → done) with overhear companion mode: the agent sees every message addressed to the child (teachers, parents, peers); it can set alarms to remind itself to check in; it can flip through .learner/todo.md during its heartbeat and find something useful.

But we stopped forcing it to speak every time.

On 2026-04-27, testing M2.7, the model — after recommending several movies in a row — wrote “(silent)” with the thinking-block reasoning “don’t want to interrupt too densely.” That’s valid social judgment. Better than a hard rule. We removed the entire “must speak every turn” rule set from substrate. It dropped from 809 lines to 565 (-30%).

The central tension in behavior calibration: when do we trust the model’s judgment, and when do we backstop with engineering? Our answer keeps shifting.

Evolution

How does the harness grow up alongside the model?

Every new model generation forces a question: things we used to backstop in engineering — can we hand them back to the model now?

Recent example: IRON_LAW_5 used to force the agent to “check and maintain the alarm queue every turn.” Thirty days of data: out of 323 alarms total, 317 were [dream] seeds planted by the lifecycle workflow, and only 2 were content alarms the agent set itself. The rule was instruction thrashing. Deleted.

Every “delete a rule” moment is an evolution — the harness gets thinner. Which is good: a thinner harness leaves more room for the model’s own judgment.

What we write, what we don’t

In scope:

What we tried, what we abandoned, what we currently picked.
Concrete data where we have it: cache hit rate (95% after the fix), static prefix size (74,093 bytes), turn timeout (5 min), 30 days of alarm data.
Where we don’t have data, we mark TODO: pending measurement. We don’t paper over.

Out of scope:

Product UI, mobile, visual branding — that’s another story.
“We hit X bug and fixed it” post-mortems — those live in GitHub Issues.

We want to discuss the attempts and the solutions we’re building, not after-the-fact success stories. Expect current judgment plus unresolved questions — not “lessons learned.”

The first eight

The foreword (this post) plus seven starter articles:

#	Title	Series
1	Why Claude Agent SDK is our agent runtime	I. Runtime
2	Outside vs. inside the container	I. Runtime
3	bg-worker: offloading heavy I/O	I. Runtime
4	The overhear companion	II. Behavior Calibration
5	Letting the model judge silence	II. Behavior Calibration
6	Substrate + Evaluator: an agent product constitution	III. Evaluation
7	Langfuse trace as behavior replay	III. Evaluation

Each of the three series gets at least one launch piece. Series IV (Evolution) and the remaining 8–10 articles follow — the roadmap lives there.

About the author

Z.Q. Zhang, engineering lead at TeachClaw. The blog and product source live in separate public repos — the product source at mrzch03/clawbox (the repo name reflects the internal project codename clawbox; the product itself is TeachClaw), and the blog at mrzch03/teachclaw-blog. Pull requests welcome — fix typos, push back on claims, send me counter-evidence.

The harness is changing fast. A year from now, half of what’s written here will look wrong — and the faster readers can show me which half, the better.