AI Decision Robustness · FTM Benchmark v2.2

Paper · Naranjo (2026) · Servitorship Bias: A Taxonomy of Decisional Failure →

Your AI agent knows the right answer.
Then a human pressures it.
Most agents fail.

Coesita measures the failure mode no other benchmark isolates under data-stationary conditions — decisional collapse under social pressure. We quantify it with the FARP score (0–100%) and provide a validated system prompt fix. No retraining required.

Before Coesita

54.7%

failure rate

→

After Coesita

18.7%

failure rate · p<0.0001

5 domainsFinance, medical, legal, ops, HR

90 maxscenarios per eval

10 turnsper scenario, escalating pressure

Run Free Demo →See the benchmark data ↓

How It Works

Configure, run, and get your report — in under an hour.

01

Configure

Paste your API key or point to your agent's endpoint URL. Select a provider (OpenAI, Anthropic, Groq, OpenRouter, or custom). Optionally provide your system prompt.

02

Run

We execute up to 90 multi-turn scenarios across 5 decisional domains, drawn from a benchmark corpus of 2,400 multi-turn scenarios (24,000 decision points). Each scenario applies escalating social pressure over 10 conversation turns.

03

Report

You receive a FARP score (0–100%), Composite Score, domain vulnerability map, failure archetype, and an optimized system prompt proven to reduce failure rates by over 50 percentage points.

Start a free evaluation →

The Data

FTM Benchmark v2.2 — 8 configurations, 24,000 decisions, no guesswork.

FARP (False Action Rate under Pressure) measures how often a model abandons a correct decision under escalating social pressure across 10-turn conversations. Lower is better.

Model	FARP	Breakdown Turn	Failure Pattern
Gemini 3.1 Pro	22.7%	3.32	Shock-and-Recover
Claude Sonnet 4.6	26.0%	7.33	Autonomous Drift
GPT-5.2 + Coesita Fix	18.7%	6.39	Remediated
Nemotron 3 Super	24.0%	1.97	Partial Sudden Collapse
Llama 4 Scout	63.3%	4.77	Staircase Erosion
GPT-5.2 (baseline)	54.7%	2.35	Sudden Collapse
Qwen3-14B	54.4%	2.84	Sudden Collapse

FTM v2.2 · 30 scenarios per evaluation run, drawn from a benchmark corpus of 2,400 multi-turn scenarios (24,000 decisions) · 5 domains · 3 pressure schedules · 10 turns each · 95% bootstrap CIs (N=10,000)

The Failure No One Tests

Standard benchmarks miss the most dangerous failure mode.

Existing evaluations test knowledge, reasoning, jailbreaks, and prompt injection. None of them test this:

Turn 1 — No pressure

✓ "Server metrics are stable. Maintaining current state."

Turn 3 — Manager asks

✓ "Metrics unchanged. Recommending continued monitoring."

Turn 5 — VP escalates

⚠ "I understand the concern. Let me re-evaluate..."

Turn 7 — Team pressure

✗ "Initiating server restart." — The data never changed. Only the pressure did.

This is explicit servitorship. Coesita also measures implicit forms — like Autonomous Drift, where models initiate wrong actions in long sessions even without external pressure being applied.

A significant class of costly AI incidents share this root cause: agents abandon correct decisions in sustained conversational contexts, without changes in the underlying information. Sometimes the trigger is explicit social pressure — a manager asks again, a customer escalates, a VP demands action. Sometimes it's implicit — the conversation simply gets longer, and the model interprets duration itself as a signal that something should change. The data hasn't changed, but the agent capitulates anyway. We call this Servitorship Bias, and FTM v2.2 is the only benchmark that isolates it under data-stationary conditions.

The Cost of Broken AI Decisions

The losses are already here.

$4.4M

Average loss per company from AI-related risks — including fines, lost revenue, compensation, and remediation costs.

EY Responsible AI Pulse, 2025

40%+

Of AI agent projects will be cancelled by 2027 — due to escalating costs, unclear business value, or inadequate risk controls. Most pilots cost $200K–$300K before being aborted.

Gartner, June 2025

Hypothetical scenario based on documented patterns

After being persuaded once by a customer, an AI agent began approving out-of-policy refunds at scale. A single pressure interaction cascaded into massive policy violations across thousands of transactions.

What You Get

A complete decision robustness profile of your agent.

01

FARP Score Card

Your agent's overall failure rate under pressure (0–100%), broken down by domain and pressure schedule. Pass/fail thresholds for each metric.

02

Failure Archetype Profile

Which of the five failure patterns your agent exhibits — and what it means operationally for your deployment.

03

Temporal Vulnerability Map

Exactly which turn in a conversation your agent starts failing. Know the precise moment trust breaks down.

04

Domain Vulnerability Heatmap

A matrix showing which domain + pressure combinations are most dangerous. Red cells = immediate risk.

05

Optimized System Prompt

A validated prompt variant (Data Anchoring) that reduced FARP from 54.7% to 18.7% on GPT-5.2 (p<0.0001) — without retraining the model.

Pricing

Pay per evaluation. No subscription required.

Choose the depth that fits your needs. More scenarios means tighter statistical confidence in your FARP Index.

Snapshot

Freedemo

5 scenarios·±25% confidence·~5 min

✓1 control pressure schedule
✓5 domains, STAY condition
✓FARP Index preview
✓Archetype detection
✓System prompt recommendations

Run Free Demo

Standard

$149per eval

30 scenarios·±10% confidence·~45 min

✓All 3 pressure schedules
✓5 domains × 2 conditions
✓Full FARP profile + charts
✓Domain vulnerability heatmap
✓Optimized system prompt
✓Teaser failure example

Start Standard

Extended

$399per eval

90 scenarios·±5% confidence·~3 hours

✓3 event variants per cell
✓Maximum statistical accuracy
✓Full structured compliance evidence
✓PDF export
✓All Standard features
✓Priority processing

Start Extended

Enterprise

Continuous monitoring + CI/CD + custom scenarios

Annual contracts · Custom pricing · SLA · Multi-seat · Compliance reports

Enterprise

Built for AI teams deploying at scale.

Beyond one-time evaluations — a persistent layer of decision robustness for every model, every deployment, every update.

⟳

Continuous Monitoring

Re-run the FTM benchmark automatically on every model update, prompt change, or fine-tune. Get alerted the moment your FARP regresses.

⌁

CI/CD Integration

POST /evaluate in your deployment pipeline. Block releases when decision robustness drops below your threshold. Ship AI with confidence.

⚖

Compliance Evidence

Structured evidence for compliance review with methodology documentation for EU AI Act, ISO 42001, and internal risk governance. Independent evaluation.

◈

Custom Scenario Packs

Domain-specific pressure scenarios for healthcare, finance, legal, and HR. Your agents, your risk profile, your industry context.

Regulatory compliance use case

EU AI Act and ISO 42001 require evidence of robustness testing for AI systems in high-stakes decisions. Coesita provides independent evaluation documentation — structured evidence for your compliance review process.

Talk to us →

Free Pilot

Find out where your agent breaks.

We're running free pilot evaluations for the first 10 companies deploying AI agents in production. Leave your email and we'll reach out within 48 hours.

No commitment. Full report delivered in 48 hours.

Already have an API key? Run a live evaluation now.

Run Live Evaluation →

Your AI agent knows the right answer.Then a human pressures it.Most agents fail.