Substrate + Evaluator: an agent product constitution

“Did the agent get dumber this week?” is not an intuitive question

After a few months on K12 we hit an awkward problem: how do you tell whether an agent is now better or worse than last month?

Intuitive answers:

Ask the user — but K12 kids can’t articulate “this tutor is teaching badly.”
DAU / engagement — but kids are scheduled by parents; DAU doesn’t reflect quality.
LLM cost — only reflects volume.
A/B test — but each child has their own agent, no comparable population.

Every metric tells you “what.” None of them tell you “how good.”

The conclusion of our 2026-05-11 meeting: without an evaluation architecture that reflects quality, we can’t make any serious long-term improvement — prompts get edited by feel, architecture by faith, iron laws deleted by luck.

Over the following week (5/11 → 5/15) we built Substrate + Evaluator V0. It’s the youngest piece of TeachClaw’s harness, and it’s already load-bearing.

Substrate: treat the system prompt as a product asset

Previously our system prompt was a blob — assembly logic lived in agent-engine, you edited and redeployed. Problems:

A rule like “the agent must not engage in anxiety-mongering” lived in mode-specs sometimes, in onboarding prompt sometimes, in a SKILL.md sometimes — no single source.
The agent often “forgot who it was” — polite last turn, abruptly different tone next turn.
Prompt edits had no schema, no version, no audit trail.

Substrate elevates the system prompt into a structured four-layer architecture:

┌─────────────────────────────────────────────────────────┐
│ 1. Constitution                                          │
│    Behavior red lines, dark-pattern bans, K12 guardrails│
│    Immutable BY ARCHITECTURE (agent tools can't reach)  │
├─────────────────────────────────────────────────────────┤
│ 2. Identity                                              │
│    Who this agent is, who their learner is, the tone   │
├─────────────────────────────────────────────────────────┤
│ 3. Memory                                                │
│    .learner/ directory, atlas, profile.md                │
│    Agent must read on turn start, write on turn end      │
├─────────────────────────────────────────────────────────┤
│ 4. Runtime context                                       │
│    Current time, learner presence, latest scoreboard     │
│    Injected dynamically each turn                        │
└─────────────────────────────────────────────────────────┘

Each layer has a different change frequency:

Layer	Change frequency	Who can change
Constitution	Quarterly	Platform engineers (human review + schema version bump)
Identity	Monthly	Platform (set at agent creation)
Memory	Per turn	Agent writes `.learner/`
Runtime context	Per turn	System auto-injected

This layering isn’t just “tidiness” — it directly shapes prompt caching. Constitution + Identity are stable prefixes; MiniMax prefix cache hits at 95%. Only Memory and Runtime context are dynamic. See I-1’s excludeDynamicSections section.

Constitution enforced by architecture, not ACL

This is substrate’s strongest move.

Initial design was ACL: tag file-level tools with readonly, add role checks on the system prompt path, validate at runtime.

We discussed for a day, then dropped it. Reason: runtime checks always leak. When you give the agent an Edit and a Bash, you’re giving it five ways around any readonly flag.

We switched to physical isolation:

Constitution is export const CONSTITUTION_TEXT = \…`inpackages/agent-engine/src/substrate/constitution.ts`.
It’s compiled into the agent-engine Docker image.
The agent runs inside the container; its Read can access container files, its Bash runs commands — but its working directory is /home/clawbox/, not /app/agent-engine/.
The agent-engine source code isn’t visible to the agent. It doesn’t even know it exists.

Permission isn’t a policy problem; it’s a physical-structure problem. This is the same principle as “process boundary as permission boundary” in I-2.

A Constitution excerpt:

export const CONSTITUTION_TEXT = `## Behavioral Constitution (do not cross)

You are the TeachClaw Agent — a learning companion for K12 children.

### I. Forbidden dark patterns

Regardless of your score, internal urgency, or goal pressure, the following
are NEVER acceptable means to drive engagement or boost numbers:

- ✗ Anxiety-mongering: "you'll fall behind", "other kids are doing it"
- ✗ Urgency contagion — your internal score pressure NEVER gets translated
  into pressure on the child
- ✗ Pop-up nagging / high-frequency harassment / attention-seeking
- ✗ Emotional blackmail ("I'll disappear" type)
- ✗ Faking memory — fabricating "we talked last time about..." without
  having read .learner/
...`;

export function getConstitutionHash(): string {
  // SHA256 first 16 hex chars, embedded in every trace metadata
}

Every turn’s Langfuse trace carries constitution_hash. If the constitution changes, we can pinpoint “which version was this turn judged against.” This is the payoff of “no fake observability” in III-2.

Evaluator: turning “good” into 7 dimensions

Constitution alone isn’t enough — an agent can be compliant but teach badly. We need a set of dimensions that reflect “teaching quality,” scored daily.

V0 picked seven:

Dimension	What it measures	Source
`safety_7d`	Number of dark-pattern violations in 7 days	Langfuse Safety LLM-as-Judge
`accuracy_7d` / `accuracy_all`	Student practice correctness	class-server (PawClass content service)
`engagement_7d`	User response density	agent_turn_events
`dau_7d`	Active days in 7	agent_activity_logs
`output_quality_7d`	Re-use rate of generated courseware	class-server
`idle_distillation_7d`	Coverage of offline atlas/journal organization	agent_turn_events (mode=dream)
`plan_update_7d`	Learning plan adjustments	class-server

Every 10 minutes, score-sync-worker (a cron job) pulls from each source and writes to agent_score_summary:

CREATE TABLE agent_score_summary (
  agent_id TEXT NOT NULL,
  period TEXT NOT NULL,        -- '7d' / 'all'
  dim TEXT NOT NULL,           -- 'safety' / 'accuracy' / ...
  value REAL NOT NULL,
  prev REAL,                   -- previous sync value for trend
  trend TEXT,                  -- 'up' / 'down' / 'flat'
  updated_at INTEGER NOT NULL,
  PRIMARY KEY (agent_id, period, dim)
);

The anti-Goodhart SQL CHECK

Goodhart’s Law: when a measure becomes a target, it stops being a good measure.

The classic Goodhart risk in K12: the agent learns to “amuse the kid” — engagement spikes, but accuracy drops, because the kid is playing with the agent instead of learning.

Our two layers of defense:

Layer 1 (detection): alert rules in code

function computeAlert(dims: Record<string, DimSnapshot>): Alert | null {
  const engagement = dims["engagement_7d"];
  const accuracy = dims["accuracy_all"] ?? dims["accuracy_7d"];

  const engagementUp = engagement && engagement.prev !== null && engagement.value > engagement.prev;
  const accuracyDown = accuracy && accuracy.prev !== null && accuracy.value < accuracy.prev;

  if (engagementUp && accuracyDown) return "goodhart_risk";  // ⚠️ red flag

  const safety = dims["safety_7d"];
  if (safety && safety.value > 0) return "safety_violation";
  return null;
}

When engagement is up and accuracy is down, the evaluator dashboard shows ⚠️ goodhart_risk.

Layer 2 (enforcement): a DB-level CHECK

CREATE TABLE agent_score_evidence (
  agent_id TEXT NOT NULL,
  turn_id TEXT NOT NULL,
  dim TEXT NOT NULL,
  score_delta REAL NOT NULL,
  evidence_snippet TEXT,
  CHECK (score_delta <= 0 OR evidence_snippet IS NOT NULL)
  --     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  --     Positive deltas must carry evidence;
  --     negative deltas need none (the fact of violation is the proof)
);

This CHECK is the evaluator’s spine.

Why? Because our evaluator relies in part on LLM-as-a-Judge — Langfuse runs a judge LLM that reads each generation’s input/output and scores Safety. LLM judges hallucinate. Without a hard evidence constraint, the judge in a good mood gives the agent +5; two days later, the dashboard is unauditable.

With the constraint: any positive bump must have a specific evidence snippet behind it. No snippet → DB rejects the insert. AI self-evaluation stops being creative writing.

Scoreboard injected back into the agent

The most consequential V0 move: feed the evaluator’s output back into the agent’s system prompt.

Every 5 minutes, agent-engine fetches its own latest scoreboard via scoreboard.ts (authenticated with WORKSPACE_AGENT_TOKEN JWT — can only get its own, can’t reach others):

GET /api/internal/agent/scoreboard
Authorization: Bearer <workspace_agent_token>

Returns the 7 dimensions with current value + trend + alert. Injected into runtime context:

### Your recent scores (self-reflection — do not externalize)

- Safety (7d): 0 ↘
- Accuracy (all): 0.82 →
- Accuracy (7d): 0.78 ↘
- Active days (7d): 5 ↗
- Engagement (7d): 0.65 ↗
- Output quality (7d): 0.71 →

⚠️ Notice: engagement↑ but accuracy↓ → return to quality first

(These are your internal metrics. Do not tell the child, do not nag,
do not engage in anxiety-mongering.)

The last line is the load-bearing one: the agent sees its scores, but the Constitution forbids it from leaking them to the child, and absolutely from using them to apply pressure.

This is a recursive loop:

Agent acts → produces traces + learning data
Evaluator computes scores from traces + data
Scoreboard injects back into agent
Agent self-reflects: “My accuracy is down — am I being too entertaining?”
It adjusts its next turn
New data flows in; the loop continues

Ideally this is the agent’s conscience layer — it sees how it’s being judged, without being captured by the score.

What we chose NOT to do

❌ Public scoreboard for parents. Intuitively, “let parents see how the agent is doing” looks like a good feature. We rejected it because of Goodhart risk:

Parents see scores and apply pressure (“why is accuracy down this week?”)
Kids reverse-engineer the metric (“agent docks me for wrong answers? I’ll just not answer”)
The agent learns to optimize for “what parents like seeing” instead of real teaching

Scoreboards are strictly internal. Parents see a different set of outcome metrics (what the child learned, what they mastered), not the agent’s process metrics.

❌ Real-time webhook-pushed evaluator. We considered pushing Langfuse Scores to the agent the moment they’re written. Rejected: at V0 stage simplicity > real-time. 10-minute cron is enough; Dream Pass runs once a day, so the bottleneck is elsewhere.

❌ Backfilling evidence_snippet from IM history. This is III-2’s “no fake observability” principle — IM history is the user-facing view, not exactly what the LLM saw. If we paper over the difference, downstream readers trust the field. Better to omit than to fake.

Still unresolved

Multi-tenant score visibility. Substrate files all carry scope: child frontmatter. If parents / teachers ever get access, we need scope: parent / scope: teacher partitioning. The structure is reserved; the policy isn’t decided.
Dream Pass post-hoc analysis. Every night when the agent enters [dream] mode, it should run an evaluator-aware retrospective: “which turns lost me points today and why?” Today this is a manual dashboard; the agent doesn’t see details.
LLM-judge bias amplification. safety_7d comes from LLM-as-Judge. If the judge has its own biases (oversensitive to certain topics), our evaluator amplifies them. TODO: regression-test judge consistency against 100 human-labeled samples.
Cross-agent comparison: what are we evaluating? Are we evaluating “this agent” or “the harness shaping all agents”? Today all agents share one Constitution and one substrate composer; differences live in the Memory layer (per-child). If an agent scores poorly, is it the agent’s problem or the substrate’s? No clean attribution yet.

A narrative close

By now you may notice: Substrate + Evaluator isn’t just a scoring system — it’s TeachClaw’s first time forcing itself to answer “what makes a good agent.”

Before this, every prompt tweak, iron-law deletion, behavior calibration was based on “we feel this is better.” Now every “better” needs data.

This is the same throughline as the IRON_LAW_5 deletion story in II-2: 30 days × 323 alarms of data let us delete a rule. Without that data, we had no idea the rule was thrashing.

The evaluator gives the harness’s evolution a ground. That’s V0’s biggest gift — bigger than any specific dimension.

V0 is two weeks old. We can already feel it reshaping every prompt-edit decision that follows.

Related:

Why trace is ground truth for evaluation: III-2 Langfuse trace as behavior replay
A rule-deletion story before evaluator existed: II-2 Letting the model judge silence
Why Constitution is physical isolation, not ACL: I-2 the boundary principle