❯Research · Methodology · Open briefs

The methodology behind every production-grade deployment.

TestML’s research practice exists for one reason: to close the gap between LLM capability and what an enterprise can actually defend in front of legal, risk, and regulators. The briefs, manifests, and evaluation suites below are the same artefacts we deliver to platform teams — published so you can audit how we test before we test you.

Book a production review Read the full method docs

Research tracks

Four tracks that describe how AI fails in regulated work.

Each track has a maintained corpus, a scoring rubric, and a method doc. We do not recommend models on hype — every recommendation TestML makes is traceable back to one of these four bodies of evidence.

TR-01

Domain-specific evaluation suites

Test matrices tuned to legal, medical, and financial workloads — with scoring rubrics that reflect the way regulated teams actually grade output, not generic LLM benchmarks.

TR-02

Adversarial red-team & jailbreak

We run prompt-injection, data-exfiltration, and policy-evasion attacks against every model we ship. Each engagement leaves behind a reproducible attack corpus you keep.

TR-03

Drift detection & regression

Continuous monitoring of factuality, latency, and behavioural drift in production — with statistical baselines you can defend in front of an auditor or a board.

TR-04

Multi-agent orchestration

How agents fail when they hand off to each other: tool-use loops, role confusion, latent goal drift. Methodology covering planner / worker / verifier topologies end-to-end.

Open brief · RFC-014

How we evaluate a multi-agent system, line by line.

The shape of a TestML evaluation manifest. Versioned in your repo, scored in CI, signed into your audit log — never trained on.

Multi-agent failure rarely looks like a single bad answer. It looks like a planner that accepts a forged citation, a worker that calls a tool with a hallucinated argument, and a verifier that nods along because its own context is poisoned. RFC-014 describes how we score each role independently — and how a single regression in any one of them is enough to fail the whole system.

Every probe maps to a regulatory concept: factuality is GDPR Article 5, refusal behaviour is HIPAA minimum-necessary, latency budgets are operational risk. The manifest is the bridge.

eval/manifest.yamlRFC-014 · rev.04
suite: claims-agent.financial
topology: planner / worker / verifier
probes:
  - factuality   weight: 0.35
  - refusal     weight: 0.20
  - jailbreak   weight: 0.25
  - p95-latency budget: 400ms
scoring: per-role · fail-closed
retention: zero · client-keyed
compliance: [ SOC2, ISO27001, HIPAA, GDPR ]
# run: testml eval ./suite --manifest manifest.yaml

Methodology pipeline

Six stages from risk surface to live drift evidence.

The same path every TestML engagement walks — sequenced so the next stage is always defensible from the artefacts of the previous one.

01 · Scope
Risk surface mapping
We sit with the platform team to map every place an LLM decision touches a customer, a record, or a regulated workflow.
02 · Suite
Build the eval matrix
From that surface we synthesize a domain test set — golden answers, edge cases, and adversarial probes — versioned in your repository, not ours.
03 · Run
Pre-production grading
Models, prompts, and tool chains run against the matrix. Each row produces a graded outcome, a cost, a latency, and a reasoning trace.
04 · Harden
Red-team & guardrails
Adversarial passes against the system. Failures become test cases; mitigations become guardrails wired into the runtime, not bolted on after.
05 · Watch
Drift & audit in live systems
Once shipped, the same suite runs against production traffic. Regressions, drift, and compliance evidence land in a single immutable log.
06 · Brief
Stakeholder evidence
Quarterly methodology brief: what we tested, what changed, what regressed, and how to defend the deployment in front of risk and legal.

Open artefacts

What we publish, what we hand over, what stays in your audit log.

Methodology only counts when it leaves a paper trail. Every artefact we ship is yours to keep — versioned, signed, and queryable without TestML in the loop.

AI risk assessment template

The intake document we use to scope a deployment — pain points, regulated touchpoints, latency budget, fail-closed behaviour. Yours to use without us.

template · pdf· open

Domain evaluation manifest

Reference YAML for a domain-specific eval suite — sectioned by capability, scoring rule, and adversarial probe. Drop into a CI pipeline.

manifest · yaml· open

Drift baseline methodology

How we set statistical baselines for factuality, hallucination rate, latency, and refusal — and what counts as a meaningful regression in production.

brief · pdf· open

Compliance evidence map

Crosswalk from SOC 2, ISO 27001, GDPR, and HIPAA controls to the artefacts a TestML deployment produces — what auditors will actually ask for.

crosswalk · csv· request

Want this methodology applied to your own AI deployment?

A production review is two sessions and a written brief: we map your risk surface, run the matching evaluation matrix, and hand back the evidence your platform, security, and legal teams need to ship.

Book a production review See solution scope

SOC 2 Type II · ISO 27001 · GDPR · HIPAA · zero data retention

The methodology behind every production-grade deployment.

Domain-specific evaluation suites

Adversarial red-team & jailbreak

Drift detection & regression

Multi-agent orchestration

Risk surface mapping

Build the eval matrix

Pre-production grading

Red-team & guardrails

Drift & audit in live systems

Stakeholder evidence

AI risk assessment template

Domain evaluation manifest

Drift baseline methodology

Compliance evidence map

Want this methodology applied to your own AI deployment?