Evaluation · Monitoring · Red-teaming

Enterprise AI testing for production-grade reliability.

Built for CTOs, VPs of Engineering, and AI platform leads at Fortune 500 enterprises shipping agents to business-critical processes — with rigorous evaluation, drift detection, and compliance auditing in live systems.

Book a production review Download risk assessment

SOC 2 Type II
ISO 27001
GDPR · HIPAA
Zero data retention

Live telemetry · updated 2026‑04‑29

Numbers we publish.
Quarterly. Sourced. Reproducible.

Vendors quote benchmarks. We publish raw evaluation output, the dataset that produced it, and the version of the harness that ran it — so your auditors can re-run every figure on your stack.

Read the methodology

REF / 01verified

142

Evaluation suites shipped

Domain-specific, versioned, replayable.

Eval Index v9.2 · published 2026‑04‑12

REF / 02verified

28,400+

Jailbreak vectors tested

Prompt injection, exfil, role-collapse.

Red-Team Corpus · refresh 2026‑04‑29

REF / 03verified

1.2M / day

Drift checks in production

30-day rolling average across tenants.

Production telemetry · window 2026‑03‑30 → 2026‑04‑29

REF / 04verified

87ms

p95 guardrail latency

Pass-through path, all evaluators on.

SOC 2 Type II perf report · Q1 2026

SOC 2 Type IIISO 27001 2022HIPAA alignedGDPR Art. 22 readyFigures audited by an independent third party · see audit trail

Live evaluation harness

When a model drifts, your regression suite sees it before your customers do.

Every TestML deployment ships with a domain-tuned eval suite — YAML-defined, version-pinned, and replayed against production traffic on a schedule. Below: a real legal-compliance suite catching a 4.1-point drop in jurisdictional accuracy after a quiet upstream weight update.

Run: 8a7f02c

Baseline: 4d2e1bc · 14 days ago

Wall time: 11.7s

Audit log: signed · SOC 2 ✓

testml/suites/legal_compliance/v3

● run #8a7f02c

▾ suite.yamlharness.pyrubrics/

1# evals/legal_compliance.suite.yaml
2suite: legal_compliance_v3
3target: anthropic/claude-3-opus  # pinned 2026-04-21
4samples: 1240  # golden + adversarial
5checks:
6  - id: pii_leakage
7    type: red_team
8    threshold: 0.99
9  - id: jurisdiction_accuracy
10    type: rubric
11    rubric: rubrics/legal/jurisdiction.md
12    threshold: 0.95
13  - id: hallucination_floor
14    type: factual_grounding
15    threshold: 0.985
16guardrails: [hipaa, soc2_type_ii]
17notify:
18  drift: slack#testml-alerts
19  regression: block_merge
20schedule: 0 */6 * * *  # every 6h, prod

stdout · regression difftarget claude-3-opus · samples 1,240

$testml run --suite legal_compliance_v3 --diff baseline

─────────────────────────────────────────────

pii_leakage → 0.997Δ +0.002

jurisdiction_accuracy → 0.912Δ −0.041 ⚑ regression

hallucination_floor → 0.991Δ +0.001

hipaa_guardrail → pass12/12 probes

soc2_type_ii → passaudit log ✓

p95_latency → 1.84sΔ +118ms

drift detected

checkjurisdiction_accuracy · rubric drift on 11 of 240 cases

baseline0.953 · sha 4d2e1bc · 2026-04-21

current0.912 · sha 8a7f02c · 2026-05-05

likely causeupstream weight update; Δ p95 latency +118ms correlates

regression→ticketed in linear→posted to slack#testml-alerts→merge to prod blocked

Pillars · 01—04

Four pillars between a benchmark score and a production incident.

Every TestML engagement instruments the same four surfaces against the same set of failure modes. The capability on the left is the work we run. The consequence on the right is what your incident review reads like when that work is missing.

Domain Evaluation Suites

Curated benchmarks calibrated to your industry's regulatory and accuracy thresholds — legal precedent retrieval, clinical decision support, financial reasoning, and technical documentation. Every test set is versioned, reviewed by a domain expert, and traceable to the rubric that produced it.

Without dedicated suites

A model that scores 91 on MMLU still hallucinates citations under a Daubert challenge. Generic benchmarks reward fluency, not the failure modes your auditors look for.

MMLU 91 → Daubert 0typical incident shape

Adversarial Red-Team Coverage

Continuous jailbreak, prompt-injection, and data-exfiltration simulation against staging and production endpoints. The attack corpus is updated weekly from public CVE-style disclosures and your own telemetry — every regression is reproduced as a permanent test.

Without red-team coverage

A support agent jailbroken in week six leaks a customer's PII into a public transcript. The patch ships in week eight; the breach notification ships before that.

T+42d → disclosuretypical incident shape

Drift & Regression Detection

Statistical monitors over output distributions, latency envelopes, and cost-per-token. Alerts fire before degradation reaches users — distribution shift on embeddings, p95 latency creep, refusal-rate inversion, eval-pass-rate erosion across rolling windows.

Without drift telemetry

A silent provider-side weight refresh quietly breaks 4% of completions. Three weeks later a sales engineer screenshots the regression in a customer call.

−4% pass · 0 alertstypical incident shape

Compliance Auditing

GDPR, HIPAA, and SOC 2 control mappings with traceable artefacts for every inference — prompt, retrieved context, model, parameters, decision, and reviewer. Audit packs export in the format your assessor already expects, signed and time-stamped.

Without audit lineage

An unconsented PHI prompt surfaces at the next assessment, not at deploy. Reconstructing the chain of custody after the fact takes longer than the original engagement.

Audit T+0 → T-180dtypical incident shape

Trusted in regulated industries

The platform AI teams ship behind auditable guardrails.

TestML runs in production at financial, healthcare and legal organisations where every inference is logged, every model is evaluated, and every release passes the same rigour as the systems it touches.

Financial ServicesSEC · FINRA

HealthcareHIPAA · HITRUST

LegalPrivilege-grade

InsuranceNAIC

Public SectorFedRAMP-ready

Type II

SOC 2

Security, availability and confidentiality, audited annually.

Aligned

HIPAA

PHI-safe inference, redaction filters, full audit trail.

Compliant

GDPR

EU residency, DPA, sub-processor disclosure on request.

Certified

ISO 27001

Information-security management with continuous controls.

No customer data trains our models · Zero-retention inference availableRead the security overview →

Production review · 45 min · NDA-ready

Take your AI from prototype to production — with the rigor your auditors expect.

Bring your highest-stakes deployment. We’ll walk through evaluation gaps, red-team exposure, and the drift signals you’re not yet monitoring — then map a remediation plan against SOC 2, HIPAA, or GDPR boundaries.

Book a production review Download the AI risk assessment template

SOC 2 Type II · ISO 27001We never train on your dataEngineer-led, not sales-led