Research · Methodology · Open briefs

The methodology behind every production-grade deployment.

TestML’s research practice exists for one reason: to close the gap between LLM capability and what an enterprise can actually defend in front of legal, risk, and regulators. The briefs, manifests, and evaluation suites below are the same artefacts we deliver to platform teams — published so you can audit how we test before we test you.

Book a production reviewRead the full method docs

Research tracks

Four tracks that describe how AI fails in regulated work.

Each track has a maintained corpus, a scoring rubric, and a method doc. We do not recommend models on hype — every recommendation TestML makes is traceable back to one of these four bodies of evidence.

TR-01

Domain-specific evaluation suites

Test matrices tuned to legal, medical, and financial workloads — with scoring rubrics that reflect the way regulated teams actually grade output, not generic LLM benchmarks.

TR-02

Adversarial red-team & jailbreak

We run prompt-injection, data-exfiltration, and policy-evasion attacks against every model we ship. Each engagement leaves behind a reproducible attack corpus you keep.

TR-03

Drift detection & regression

Continuous monitoring of factuality, latency, and behavioural drift in production — with statistical baselines you can defend in front of an auditor or a board.

TR-04

Multi-agent orchestration

How agents fail when they hand off to each other: tool-use loops, role confusion, latent goal drift. Methodology covering planner / worker / verifier topologies end-to-end.

Open brief · RFC-014

How we evaluate a multi-agent system, line by line.

The shape of a TestML evaluation manifest. Versioned in your repo, scored in CI, signed into your audit log — never trained on.

Multi-agent failure rarely looks like a single bad answer. It looks like a planner that accepts a forged citation, a worker that calls a tool with a hallucinated argument, and a verifier that nods along because its own context is poisoned. RFC-014 describes how we score each role independently — and how a single regression in any one of them is enough to fail the whole system.

Every probe maps to a regulatory concept: factuality is GDPR Article 5, refusal behaviour is HIPAA minimum-necessary, latency budgets are operational risk. The manifest is the bridge.

eval/manifest.yamlRFC-014 · rev.04
suite: claims-agent.financial
topology: planner / worker / verifier
probes:
  - factuality weight: 0.35
  - refusal weight: 0.20
  - jailbreak weight: 0.25
  - p95-latency budget: 400ms
scoring: per-role · fail-closed
retention: zero · client-keyed
compliance: [ SOC2, ISO27001, HIPAA, GDPR ]
# run: testml eval ./suite --manifest manifest.yaml

Methodology pipeline

Six stages from risk surface to live drift evidence.

The same path every TestML engagement walks — sequenced so the next stage is always defensible from the artefacts of the previous one.

  1. 01 · Scope

    Risk surface mapping

    We sit with the platform team to map every place an LLM decision touches a customer, a record, or a regulated workflow.

  2. 02 · Suite

    Build the eval matrix

    From that surface we synthesize a domain test set — golden answers, edge cases, and adversarial probes — versioned in your repository, not ours.

  3. 03 · Run

    Pre-production grading

    Models, prompts, and tool chains run against the matrix. Each row produces a graded outcome, a cost, a latency, and a reasoning trace.

  4. 04 · Harden

    Red-team & guardrails

    Adversarial passes against the system. Failures become test cases; mitigations become guardrails wired into the runtime, not bolted on after.

  5. 05 · Watch

    Drift & audit in live systems

    Once shipped, the same suite runs against production traffic. Regressions, drift, and compliance evidence land in a single immutable log.

  6. 06 · Brief

    Stakeholder evidence

    Quarterly methodology brief: what we tested, what changed, what regressed, and how to defend the deployment in front of risk and legal.

Open artefacts

What we publish, what we hand over, what stays in your audit log.

Methodology only counts when it leaves a paper trail. Every artefact we ship is yours to keep — versioned, signed, and queryable without TestML in the loop.

AI risk assessment template

The intake document we use to scope a deployment — pain points, regulated touchpoints, latency budget, fail-closed behaviour. Yours to use without us.

template · pdf· open

Domain evaluation manifest

Reference YAML for a domain-specific eval suite — sectioned by capability, scoring rule, and adversarial probe. Drop into a CI pipeline.

manifest · yaml· open

Drift baseline methodology

How we set statistical baselines for factuality, hallucination rate, latency, and refusal — and what counts as a meaningful regression in production.

brief · pdf· open

Compliance evidence map

Crosswalk from SOC 2, ISO 27001, GDPR, and HIPAA controls to the artefacts a TestML deployment produces — what auditors will actually ask for.

crosswalk · csv· request

Want this methodology applied to your own AI deployment?

A production review is two sessions and a written brief: we map your risk surface, run the matching evaluation matrix, and hand back the evidence your platform, security, and legal teams need to ship.

Book a production reviewSee solution scope

SOC 2 Type II · ISO 27001 · GDPR · HIPAA · zero data retention