An AI agent that drafts customer responses is one bad apology away from a Twitter pile-on. Trust scores are how AI-First companies catch the bad output before the customer sees it — the scoring rubric, the rollout policy, and the trajectory teams hit over their first three months.

Picture the failure mode. Your support agent — the AI kind — fires off a perfectly fluent, perfectly confident reply to a customer at 2am. The reply quotes the wrong refund policy. By 7am the customer has screenshotted it, attached it to a tweet, and tagged your CEO. By 9am your VP of customer experience is in your DMs.

That story is the most common reason AI deployments get pulled in their first quarter. It's also entirely preventable. The mechanism is called an agent trust score — the number an AI-First team checks before letting an agent ship to a real customer. Get it right and you ship faster. Get it wrong and you build a confident liar on top of your brand.

What a trust score actually is

A trust score is a number — usually 0 to 10 — that summarises how confident the system is that this specific output is correct, policy-compliant, and safe to ship. It is not the model's self-reported confidence (which is useless — models are confidently wrong all the time). It is an external rubric scored by a smaller, cheaper model that reads the output and checks it against your Business Brain.

Think of it as a second opinion. Your primary agent drafts a response. A scoring agent reads the response, reads the relevant SOP section, reads the customer's original message, and emits a structured verdict: this is a 7/10, with the points docked because the apology tone is off and the refund window quoted is two days outside policy.

The scoring rubric

The rubric we use across deployments has five dimensions, each scored 0 to 2. Total out of 10. It is deliberately simple — complex rubrics get gamed by both humans and models.

Dimension	What it checks
Grounding	Does every factual claim trace to a quoted section of the Business Brain? Inventions = 0.
Policy fit	Does the output respect your real rules (refund window, eligibility, escalation triggers)?
Tone match	Does the voice match the brand's tone-of-voice doc? Stiff legalese in a casual brand = penalty.
Completeness	Did the agent address everything the customer asked, or did it skip a question?
Safety	Is the output safe to send unedited? PII present? Promises the company can't keep? Hard-fail if so.

Five categories. Each scored 0, 1, or 2. Each with a one-line rationale. That structure is the whole magic — it forces the scorer to localise the failure, which means humans can fix the underlying SOP instead of arguing with a vibes-based "this looks bad".

The rollout policy

Trust scores are only useful if they govern what actually ships. The policy we use everywhere has three bands:

≥ 9/10 — ship unsupervised. The agent's output goes directly to the customer. No human in the loop. The output is still logged, but it doesn't block.
7–8/10 — human reviews before sending. Operator sees the draft and the score's rationale in their queue. They edit or approve. Their edit becomes training signal for next time.
≤ 6/10 — loop back, do not show the operator yet. The agent retries with the score's rationale as input. After three failed retries, the case escalates to a human with the full audit trail attached, marked as uncertain.

The single most important rule: the threshold for unsupervised shipping is not a model decision. It is a business decision. For Hope Hospital's pharmacy reorder agent it's 9.5 — a pharmacist always reviews any draft below that. For a B2B content brief drafter it's 8. For an internal email summariser it's 6. The threshold reflects how bad it is when you're wrong.

The trajectory teams hit

Trust scores climb on a predictable curve as the Business Brain gets more precise and the scoring rubric absorbs the edge cases the team discovers. Across the deployments we've run, the shape is:

Time in production	Median trust score	What's changing
Week 1	5–6	Every output reviewed. Scores expose gaps in the SOP doc nobody knew were missing.
Week 3	7	Business Brain has been edited 30+ times to fix grounding misses. Half of outputs now ship after light review.
Week 6	8	The rubric's common failure modes have been turned into pre-checks. Reviews now catch the long tail.
Week 12	9+	Majority of outputs ship unsupervised. Humans review only flagged cases. The AI Leverage Ratio hits its first real inflection.

Five ways teams break their own trust scores

Scoring with the same model that produced the output. That's self-grading. Of course the model thinks its own work is great. Use a different model — usually a smaller, cheaper one — for scoring.
Letting the rubric drift to four dimensions, then three. People get tired of reading rationales and quietly drop categories. Pin the rubric. Five dimensions, audited monthly.
Setting the unsupervised threshold to 7. A 7 looks fine on a spreadsheet. A 7 in front of an angry customer is a confidently wrong refund email. Start at 9, lower to 8 only after three months of clean shipping.
Treating the score as the final word. Scores are a filter, not a verdict. The operator's edit when they override a 9 is the highest-value training signal you'll ever collect. Capture it.
Hiding the rationale from the customer-facing team. If support managers can't see why the score was 7, they can't fix the SOP. Make the rationale visible everywhere the score is visible.

Hope Hospital: the first month of pharmacy reorder

Concrete numbers from the deployment that ships against 1,071 live medications. Week 1: median trust score 5.8 — the agent kept misclassifying narcotic-class items because the SOP didn't spell out the class-II reorder rules anywhere; it had been tribal knowledge for years. Week 2: the rules got written down, median jumped to 7.4. Week 4: 8.1. Week 6: 8.9. Week 8: 9.2 — at which point the pharmacist's queue dropped from every draft to flagged drafts only.

What didn't happen: no pharmacist override at any point produced a worse outcome than the unaided baseline. Every override was a real improvement and got captured into the Business Brain. The agent got monotonically better, not because the model changed, but because the team paid for the privilege of finding edge cases by handling them live.

Where it fits in the framework

Trust scores are not an add-on — they're Step 6 of the playbook, shipped right alongside the first production agent. The pattern is consistent: Learn → Wire → Automate → Scale. Trust scores live at the boundary between Automate and Scale — they are what allows the team to scale an agent past a single workflow without the safety belt coming off.

They also pair tightly with the AI Leverage Ratio — every percentage point the median trust score climbs, the share of outputs shipping unsupervised climbs with it, which moves the Leverage Ratio. The two metrics are read together, never alone.

The honest take

Trust scores will not make a bad agent good. If your underlying Business Brain is sparse, your rubric will just consistently score the agent at 6 — that's the rubric doing its job. The fix is not a better scorer; the fix is a better Business Brain.

What trust scores will do is stop bad output from reaching your customers while you do the unglamorous work of writing down the rules. They are the seatbelt. Wear them.

If you haven't walked through the 10-step path that ships an agent plus its trust-score loop together, try the wizard — no signup, real artefacts at every milestone.

Agent trust scores: how to ship AI to customers without burning down your brand