Install the SDK
Pull the TestML client into your evaluation harness. The package ships with typed clients for OpenAI, Anthropic, AWS Bedrock, and self-hosted vLLM endpoints.
pip install testml # or npm install @testml/sdk
Documentationv2.4.1
The TestML reference covers the whole lifecycle: domain-tuned evaluation suites, adversarial red-team runs, drift telemetry in live systems, and audit-grade compliance packs for GDPR, HIPAA, ISO 27001, and SOC 2. Every endpoint is versioned, every artefact is signed, every claim is reproducible.
Quick start
Three commands stand up a reproducible test bench against your existing model endpoints. No data leaves your VPC unless you explicitly opt in — every run is keyed to your tenant, signed, and exportable.
Pull the TestML client into your evaluation harness. The package ships with typed clients for OpenAI, Anthropic, AWS Bedrock, and self-hosted vLLM endpoints.
pip install testml # or npm install @testml/sdk
Define rubrics in YAML or in code. Each suite is versioned, tied to a domain (legal, medical, financial, technical), and reviewed against your acceptance thresholds.
testml suites init --domain financial
Stream live inferences through the drift sentinel. Statistical monitors detect distribution shift, latency creep, and refusal-rate inversion before users see a regression.
testml monitors attach --env prod
API reference
Every TestML run starts with one request. The platform records the prompt corpus, retrieved context, model parameters, decisions, reviewer signatures, and a cryptographic digest. The bundle exports as the audit pack your assessor already expects.
from testml import TestML # Authenticated against your org tenant. # Zero data retention; signed audit lineage. client = TestML(api_key=os.environ["TESTML_API_KEY"]) run = client.evaluations.create( suite="financial-claims-v3", endpoint="https://agents.acme.io/claims", rubrics=["factual", "jailbreak", "refusal"], budget={"latency_p95_ms": 600, "cost_usd": 0.25}, compliance=["SOC2", "GDPR"], ) print(run.summary) # pass / regress / fail print(run.audit_pack_url) # signed, time-stamped
Endpoints
A small, opinionated API. Evaluation, red-teaming, monitoring, audit, and orchestration map to a handful of resources — each one versioned, idempotent, and signed at the boundary.
/v1/evaluationsRun a versioned evaluation suite against a model endpoint. Returns per-rubric scores, traces, and an audit-grade artefact bundle that survives external review.
/v1/evaluations/{id}Retrieve a complete evaluation run, including prompt corpus, retrieved context, model parameters, decisions, reviewer signatures, and SOC 2 control mappings.
/v1/redteam/runsLaunch an adversarial sweep against staging or production. Covers jailbreak families, prompt injection, data exfiltration, and refusal-bypass corpora updated weekly.
/v1/monitorsAttach drift, latency, cost, and refusal-rate monitors to a deployment. Alerts route to PagerDuty, Slack, OpenTelemetry, or a signed webhook of your choosing.
/v1/audit/{run}Export a reviewer-ready audit pack: GDPR, HIPAA, ISO 27001, and SOC 2 control evidence with per-inference lineage, time-stamped and cryptographically signed.
/v1/agents/orchestrateCoordinate multi-agent workflows with budget, latency, and refusal guards. Cascades cheaper models behind quality gates and routes failures to a human reviewer.
Clients
We help platform teams stand up evaluation, red-teaming, drift monitoring, and audit lineage in three to five months — instead of the year a from-scratch buildout takes. Talk to an engineer about a 30-minute production review.