Domain-specific evaluation suites
Test matrices tuned to legal, medical, and financial workloads — with scoring rubrics that reflect the way regulated teams actually grade output, not generic LLM benchmarks.
Research tracks
Each track has a maintained corpus, a scoring rubric, and a method doc. We do not recommend models on hype — every recommendation TestML makes is traceable back to one of these four bodies of evidence.
Test matrices tuned to legal, medical, and financial workloads — with scoring rubrics that reflect the way regulated teams actually grade output, not generic LLM benchmarks.
We run prompt-injection, data-exfiltration, and policy-evasion attacks against every model we ship. Each engagement leaves behind a reproducible attack corpus you keep.
Continuous monitoring of factuality, latency, and behavioural drift in production — with statistical baselines you can defend in front of an auditor or a board.
How agents fail when they hand off to each other: tool-use loops, role confusion, latent goal drift. Methodology covering planner / worker / verifier topologies end-to-end.
Open brief · RFC-014
The shape of a TestML evaluation manifest. Versioned in your repo, scored in CI, signed into your audit log — never trained on.
Multi-agent failure rarely looks like a single bad answer. It looks like a planner that accepts a forged citation, a worker that calls a tool with a hallucinated argument, and a verifier that nods along because its own context is poisoned. RFC-014 describes how we score each role independently — and how a single regression in any one of them is enough to fail the whole system.
Every probe maps to a regulatory concept: factuality is GDPR Article 5, refusal behaviour is HIPAA minimum-necessary, latency budgets are operational risk. The manifest is the bridge.
Methodology pipeline
The same path every TestML engagement walks — sequenced so the next stage is always defensible from the artefacts of the previous one.
Open artefacts
Methodology only counts when it leaves a paper trail. Every artefact we ship is yours to keep — versioned, signed, and queryable without TestML in the loop.