Evaluation · Monitoring · Red-teaming

Enterprise AI testing for production-grade reliability.

Built for CTOs, VPs of Engineering, and AI platform leads at Fortune 500 enterprises shipping agents to business-critical processes — with rigorous evaluation, drift detection, and compliance auditing in live systems.

Book a production reviewDownload risk assessment
  • SOC 2 Type II
  • ISO 27001
  • GDPR · HIPAA
  • Zero data retention
Live telemetry · updated 2026‑04‑29

Numbers we publish.
Quarterly. Sourced. Reproducible.

Vendors quote benchmarks. We publish raw evaluation output, the dataset that produced it, and the version of the harness that ran it — so your auditors can re-run every figure on your stack.

Read the methodology
REF / 01verified
142

Evaluation suites shipped

Domain-specific, versioned, replayable.

Eval Index v9.2 · published 2026‑04‑12

REF / 02verified
28,400+

Jailbreak vectors tested

Prompt injection, exfil, role-collapse.

Red-Team Corpus · refresh 2026‑04‑29

REF / 03verified
1.2M / day

Drift checks in production

30-day rolling average across tenants.

Production telemetry · window 2026‑03‑30 → 2026‑04‑29

REF / 04verified
87ms

p95 guardrail latency

Pass-through path, all evaluators on.

SOC 2 Type II perf report · Q1 2026

SOC 2 Type IIISO 27001 2022HIPAA alignedGDPR Art. 22 readyFigures audited by an independent third party · see audit trail
Live evaluation harness

When a model drifts, your regression suite sees it before your customers do.

Every TestML deployment ships with a domain-tuned eval suite — YAML-defined, version-pinned, and replayed against production traffic on a schedule. Below: a real legal-compliance suite catching a 4.1-point drop in jurisdictional accuracy after a quiet upstream weight update.

Run
8a7f02c
Baseline
4d2e1bc · 14 days ago
Wall time
11.7s
Audit log
signed · SOC 2 ✓
testml/suites/legal_compliance/v3
run #8a7f02c
suite.yamlharness.pyrubrics/
1# evals/legal_compliance.suite.yaml
2suite: legal_compliance_v3
3target: anthropic/claude-3-opus # pinned 2026-04-21
4samples: 1240 # golden + adversarial
5checks:
6 - id: pii_leakage
7 type: red_team
8 threshold: 0.99
9 - id: jurisdiction_accuracy
10 type: rubric
11 rubric: rubrics/legal/jurisdiction.md
12 threshold: 0.95
13 - id: hallucination_floor
14 type: factual_grounding
15 threshold: 0.985
16guardrails: [hipaa, soc2_type_ii]
17notify:
18 drift: slack#testml-alerts
19 regression: block_merge
20schedule: 0 */6 * * * # every 6h, prod
stdout · regression difftarget claude-3-opus · samples 1,240
$testml run --suite legal_compliance_v3 --diff baseline
─────────────────────────────────────────────
pii_leakage 0.997Δ +0.002
jurisdiction_accuracy 0.912Δ −0.041 ⚑ regression
hallucination_floor 0.991Δ +0.001
hipaa_guardrail pass12/12 probes
soc2_type_ii passaudit log ✓
p95_latency 1.84sΔ +118ms
regressionticketed in linearposted to slack#testml-alertsmerge to prod blocked
Pillars · 01—04

Four pillars between a benchmark score and a production incident.

Every TestML engagement instruments the same four surfaces against the same set of failure modes. The capability on the left is the work we run. The consequence on the right is what your incident review reads like when that work is missing.

Domain Evaluation Suites

Curated benchmarks calibrated to your industry's regulatory and accuracy thresholds — legal precedent retrieval, clinical decision support, financial reasoning, and technical documentation. Every test set is versioned, reviewed by a domain expert, and traceable to the rubric that produced it.

Without dedicated suites

A model that scores 91 on MMLU still hallucinates citations under a Daubert challenge. Generic benchmarks reward fluency, not the failure modes your auditors look for.

MMLU 91 → Daubert 0typical incident shape

Adversarial Red-Team Coverage

Continuous jailbreak, prompt-injection, and data-exfiltration simulation against staging and production endpoints. The attack corpus is updated weekly from public CVE-style disclosures and your own telemetry — every regression is reproduced as a permanent test.

Without red-team coverage

A support agent jailbroken in week six leaks a customer's PII into a public transcript. The patch ships in week eight; the breach notification ships before that.

T+42d → disclosuretypical incident shape

Drift & Regression Detection

Statistical monitors over output distributions, latency envelopes, and cost-per-token. Alerts fire before degradation reaches users — distribution shift on embeddings, p95 latency creep, refusal-rate inversion, eval-pass-rate erosion across rolling windows.

Without drift telemetry

A silent provider-side weight refresh quietly breaks 4% of completions. Three weeks later a sales engineer screenshots the regression in a customer call.

−4% pass · 0 alertstypical incident shape

Compliance Auditing

GDPR, HIPAA, and SOC 2 control mappings with traceable artefacts for every inference — prompt, retrieved context, model, parameters, decision, and reviewer. Audit packs export in the format your assessor already expects, signed and time-stamped.

Without audit lineage

An unconsented PHI prompt surfaces at the next assessment, not at deploy. Reconstructing the chain of custody after the fact takes longer than the original engagement.

Audit T+0 → T-180dtypical incident shape
Trusted in regulated industries

The platform AI teams ship behind auditable guardrails.

TestML runs in production at financial, healthcare and legal organisations where every inference is logged, every model is evaluated, and every release passes the same rigour as the systems it touches.

Financial ServicesSEC · FINRA
HealthcareHIPAA · HITRUST
LegalPrivilege-grade
InsuranceNAIC
Public SectorFedRAMP-ready
Type II
SOC 2

Security, availability and confidentiality, audited annually.

Aligned
HIPAA

PHI-safe inference, redaction filters, full audit trail.

Compliant
GDPR

EU residency, DPA, sub-processor disclosure on request.

Certified
ISO 27001

Information-security management with continuous controls.

No customer data trains our models · Zero-retention inference availableRead the security overview →
Production review · 45 min · NDA-ready

Take your AI from prototype to production — with the rigor your auditors expect.

Bring your highest-stakes deployment. We’ll walk through evaluation gaps, red-team exposure, and the drift signals you’re not yet monitoring — then map a remediation plan against SOC 2, HIPAA, or GDPR boundaries.

SOC 2 Type II · ISO 27001We never train on your dataEngineer-led, not sales-led