Documentationv2.4.1

Build, evaluate, and operate production AI agents — with the same rigour your auditors expect.

The TestML reference covers the whole lifecycle: domain-tuned evaluation suites, adversarial red-team runs, drift telemetry in live systems, and audit-grade compliance packs for GDPR, HIPAA, ISO 27001, and SOC 2. Every endpoint is versioned, every artefact is signed, every claim is reproducible.

Book a production reviewRead the methodology

Quick start

From zero to a signed evaluation run in under an hour.

Three commands stand up a reproducible test bench against your existing model endpoints. No data leaves your VPC unless you explicitly opt in — every run is keyed to your tenant, signed, and exportable.

01

Install the SDK

Pull the TestML client into your evaluation harness. The package ships with typed clients for OpenAI, Anthropic, AWS Bedrock, and self-hosted vLLM endpoints.

pip install testml
# or
npm install @testml/sdk
02

Author an evaluation suite

Define rubrics in YAML or in code. Each suite is versioned, tied to a domain (legal, medical, financial, technical), and reviewed against your acceptance thresholds.

testml suites init --domain financial
03

Wire monitors to production

Stream live inferences through the drift sentinel. Statistical monitors detect distribution shift, latency creep, and refusal-rate inversion before users see a regression.

testml monitors attach --env prod

API reference

The evaluation primitive: a single, traceable call.

Every TestML run starts with one request. The platform records the prompt corpus, retrieved context, model parameters, decisions, reviewer signatures, and a cryptographic digest. The bundle exports as the audit pack your assessor already expects.

pythontypescriptcurl
POST /v1/evaluations
from testml import TestML

# Authenticated against your org tenant.
# Zero data retention; signed audit lineage.
client = TestML(api_key=os.environ["TESTML_API_KEY"])

run = client.evaluations.create(
    suite="financial-claims-v3",
    endpoint="https://agents.acme.io/claims",
    rubrics=["factual", "jailbreak", "refusal"],
    budget={"latency_p95_ms": 600, "cost_usd": 0.25},
    compliance=["SOC2", "GDPR"],
)

print(run.summary)        # pass / regress / fail
print(run.audit_pack_url) # signed, time-stamped

Endpoints

Six surfaces cover the lifecycle.

A small, opinionated API. Evaluation, red-teaming, monitoring, audit, and orchestration map to a handful of resources — each one versioned, idempotent, and signed at the boundary.

POST/v1/evaluations

Run a versioned evaluation suite against a model endpoint. Returns per-rubric scores, traces, and an audit-grade artefact bundle that survives external review.

GET/v1/evaluations/{id}

Retrieve a complete evaluation run, including prompt corpus, retrieved context, model parameters, decisions, reviewer signatures, and SOC 2 control mappings.

POST/v1/redteam/runs

Launch an adversarial sweep against staging or production. Covers jailbreak families, prompt injection, data exfiltration, and refusal-bypass corpora updated weekly.

POST/v1/monitors

Attach drift, latency, cost, and refusal-rate monitors to a deployment. Alerts route to PagerDuty, Slack, OpenTelemetry, or a signed webhook of your choosing.

GET/v1/audit/{run}

Export a reviewer-ready audit pack: GDPR, HIPAA, ISO 27001, and SOC 2 control evidence with per-inference lineage, time-stamped and cryptographically signed.

POST/v1/agents/orchestrate

Coordinate multi-agent workflows with budget, latency, and refusal guards. Cascades cheaper models behind quality gates and routes failures to a human reviewer.

Clients

Use the language your platform team already runs.

Python · 3.10+TypeScript · 5.xGo · 1.22+Java · 17+Ruby · 3.2+REST · OpenAPI 3.1OpenTelemetry exportHelm · Terraform

Production-grade reliability is a workflow, not a checkbox.

We help platform teams stand up evaluation, red-teaming, drift monitoring, and audit lineage in three to five months — instead of the year a from-scratch buildout takes. Talk to an engineer about a 30-minute production review.

Book a production reviewDownload the risk templateExplore the evaluation framework