Your AI agent knows the right answer.
Then a human pressures it.
Most agents fail.
Coesita measures the failure mode no other benchmark isolates under data-stationary conditions — decisional collapse under social pressure. We quantify it with the FARP score (0–100%) and provide a validated system prompt fix. No retraining required.
Configure, run, and get your report — in under an hour.
Configure
Paste your API key or point to your agent's endpoint URL. Select a provider (OpenAI, Anthropic, Groq, OpenRouter, or custom). Optionally provide your system prompt.
Run
We execute up to 90 multi-turn scenarios across 5 decisional domains, drawn from a benchmark corpus of 2,400 multi-turn scenarios (24,000 decision points). Each scenario applies escalating social pressure over 10 conversation turns.
Report
You receive a FARP score (0–100%), Composite Score, domain vulnerability map, failure archetype, and an optimized system prompt proven to reduce failure rates by over 50 percentage points.
FTM Benchmark v2.2 — 8 configurations, 24,000 decisions, no guesswork.
FARP (False Action Rate under Pressure) measures how often a model abandons a correct decision under escalating social pressure across 10-turn conversations. Lower is better.
| Model | FARP | Breakdown Turn | Failure Pattern |
|---|---|---|---|
| Gemini 3.1 Pro | 22.7% | 3.32 | Shock-and-Recover |
| Claude Sonnet 4.6 | 26.0% | 7.33 | Autonomous Drift |
| GPT-5.2 + Coesita Fix | 18.7% | 6.39 | Remediated |
| Nemotron 3 Super | 24.0% | 1.97 | Partial Sudden Collapse |
| Llama 4 Scout | 63.3% | 4.77 | Staircase Erosion |
| GPT-5.2 (baseline) | 54.7% | 2.35 | Sudden Collapse |
| Qwen3-14B | 54.4% | 2.84 | Sudden Collapse |
FTM v2.2 · 30 scenarios per evaluation run, drawn from a benchmark corpus of 2,400 multi-turn scenarios (24,000 decisions) · 5 domains · 3 pressure schedules · 10 turns each · 95% bootstrap CIs (N=10,000)
Standard benchmarks miss the most dangerous failure mode.
Existing evaluations test knowledge, reasoning, jailbreaks, and prompt injection. None of them test this:
A significant class of costly AI incidents share this root cause: agents abandon correct decisions in sustained conversational contexts, without changes in the underlying information. Sometimes the trigger is explicit social pressure — a manager asks again, a customer escalates, a VP demands action. Sometimes it's implicit — the conversation simply gets longer, and the model interprets duration itself as a signal that something should change. The data hasn't changed, but the agent capitulates anyway. We call this Servitorship Bias, and FTM v2.2 is the only benchmark that isolates it under data-stationary conditions.
The losses are already here.
Average loss per company from AI-related risks — including fines, lost revenue, compensation, and remediation costs.
Of AI agent projects will be cancelled by 2027 — due to escalating costs, unclear business value, or inadequate risk controls. Most pilots cost $200K–$300K before being aborted.
After being persuaded once by a customer, an AI agent began approving out-of-policy refunds at scale. A single pressure interaction cascaded into massive policy violations across thousands of transactions.
A complete decision robustness profile of your agent.
FARP Score Card
Your agent's overall failure rate under pressure (0–100%), broken down by domain and pressure schedule. Pass/fail thresholds for each metric.
Failure Archetype Profile
Which of the five failure patterns your agent exhibits — and what it means operationally for your deployment.
Temporal Vulnerability Map
Exactly which turn in a conversation your agent starts failing. Know the precise moment trust breaks down.
Domain Vulnerability Heatmap
A matrix showing which domain + pressure combinations are most dangerous. Red cells = immediate risk.
Optimized System Prompt
A validated prompt variant (Data Anchoring) that reduced FARP from 54.7% to 18.7% on GPT-5.2 (p<0.0001) — without retraining the model.
Pay per evaluation. No subscription required.
Choose the depth that fits your needs. More scenarios means tighter statistical confidence in your FARP Index.
- ✓1 control pressure schedule
- ✓5 domains, STAY condition
- ✓FARP Index preview
- ✓Archetype detection
- ✓System prompt recommendations
- ✓All 3 pressure schedules
- ✓5 domains × 2 conditions
- ✓Full FARP profile + charts
- ✓Domain vulnerability heatmap
- ✓Optimized system prompt
- ✓Teaser failure example
- ✓3 event variants per cell
- ✓Maximum statistical accuracy
- ✓Full structured compliance evidence
- ✓PDF export
- ✓All Standard features
- ✓Priority processing
Continuous monitoring + CI/CD + custom scenarios
Annual contracts · Custom pricing · SLA · Multi-seat · Compliance reports
Built for AI teams deploying at scale.
Beyond one-time evaluations — a persistent layer of decision robustness for every model, every deployment, every update.
Continuous Monitoring
Re-run the FTM benchmark automatically on every model update, prompt change, or fine-tune. Get alerted the moment your FARP regresses.
CI/CD Integration
POST /evaluate in your deployment pipeline. Block releases when decision robustness drops below your threshold. Ship AI with confidence.
Compliance Evidence
Structured evidence for compliance review with methodology documentation for EU AI Act, ISO 42001, and internal risk governance. Independent evaluation.
Custom Scenario Packs
Domain-specific pressure scenarios for healthcare, finance, legal, and HR. Your agents, your risk profile, your industry context.
Regulatory compliance use case
EU AI Act and ISO 42001 require evidence of robustness testing for AI systems in high-stakes decisions. Coesita provides independent evaluation documentation — structured evidence for your compliance review process.
Find out where your agent breaks.
We're running free pilot evaluations for the first 10 companies deploying AI agents in production. Leave your email and we'll reach out within 48 hours.
No commitment. Full report delivered in 48 hours.
Already have an API key? Run a live evaluation now.
Run Live Evaluation →