Contact · Production review intake

Brief our team on the AI workflow you need to take to production.

Most useful conversations start with one paragraph: the model, the workflow, the regulated surface, and the decision the evaluation has to support. We acknowledge every brief on the same business day and scope a domain-specific suite within five.

Send a brief Risk assessment template

// Channels

Pick the queue that matches your brief.

Each desk routes to a named engineer with a published SLA. We do not staff a generic sales inbox; the wrong queue gets forwarded the same hour, not absorbed.

01 · Production review1 business day

Brief us on a deployment

For Fortune 500 and scaling enterprises moving an AI agent into a business-critical workflow. We scope evaluation suites, red-team coverage, and drift gates around your stack.

contact@testml.org

02 · Red-team & jailbreak3 business days

Adversarial intake

Targeted prompt-injection, exfiltration, and role-collapse engagements. Bring a model name, a guardrail spec, and a threat surface. We scope a corpus and report against it.

redteam@testml.org

03 · Compliance & audit2 business days

GDPR · HIPAA · SOC 2

Auditor briefings, evidence packages, and BAA paperwork for regulated workflows. We map evaluator outputs to your control framework so review cycles stop blocking go-live.

compliance@testml.org

04 · Press & analyst5 business days

Briefings and embargo

Reporters, industry analysts, and conference programmers. We ship technical primers and methodology notes; we do not field hype-cycle commentary or generic AI predictions.

press@testml.org

// What happens next

Four steps from brief to a regression loop your team owns.

We work production-first. The pilot output is a working evaluator, not a slide deck — and the engagement ends when your engineers are running it without us.

Step · 01
You send a brief
A short intake describing the model, the workflow, the regulated surface, and what go-live decision the evaluation has to support. Three paragraphs is enough.
Same-day acknowledgement
Step · 02
We scope a suite
An engineer drafts a domain-specific evaluation plan: accuracy, latency, jailbreak resistance, hallucination rate, and the regression gates we will run on every deploy.
Within 5 business days
Step · 03
We run the corpus
Replayable runs against a versioned prompt-and-policy snapshot. You see the dashboards we use internally — not a marketing PDF — with raw traces attached to every claim.
Pilot window: 2–3 weeks
Step · 04
We hand off the loop
Methodology, evaluators, and drift monitors transfer to your team. We co-own the first regression cycle, then step back. No staffing dependency, no vendor lock-in.
Methodology transferred

// Send-us-this

Five fields make a brief useful.

Free-form prose is fine. We just need these signals so the on-call engineer can scope a corpus before the first call.

model
Vendor and version, or open-weights checkpoint hash.
Claude Sonnet 4.5, GPT-4o-2024-08-06, Llama 3.1-70B-Instruct, etc.
workflow
What the agent actually decides or produces.
e.g. claims triage, contract clause extraction, clinician-facing summary.
domain
Regulatory surface and data sensitivity class.
Legal, medical, financial, defense, EU consumer, etc.
stakes
What a wrong answer costs in production.
Reversibility, blast radius, compliance penalty, auditor scrutiny.
go_live
Decision date the evaluation has to support.
Pilot, phased rollout, board-level go-no-go review.

// Responsible disclosure

Security finding? Encrypt it before you send it.

If you have discovered a vulnerability in our evaluators, the production telemetry pipeline, or any TestML-managed service, please report it under our coordinated disclosure policy. First acknowledgement within 72 hours. We do not pursue good-faith researchers operating within scope.

To:      security@testml.org
Scope:   testml.org, *.testml.org, evaluator runtime
PGP:     0xC4F4 7A21 8E3B 9D55
Policy:  /security/.well-known/security.txt

Ack window72 hoursTriage5 business daysSafe harborYes, in scopeBountyCase-by-case

// Before you write

Five answers we send most often.

Do you train on our prompts, traces, or evaluation outputs?
No. Customer artifacts live in tenant-isolated storage with zero data retention by default. We do not train, fine-tune, or share corpora across engagements. Retention windows and BAA paperwork are negotiated per workflow.
We are mid-deployment. Can we plug TestML into a live system?
Yes — production-first is the point. We instrument drift monitors and regression gates against the workflow you already have running, then back-fill an evaluation suite around the failure modes the telemetry surfaces.
Do you only test single-model deployments?
No. Our methodology was built around multi-agent orchestration: routers, tool-callers, retrieval layers, and policy fences. We evaluate the system, not just the underlying model card.
Will you replace our internal ML team?
We are not a staffing agency. Engagements transfer methodology, evaluators, and tooling to your engineers. The success criteria is your team running the regression cycle without us by the second quarter.
What if we just need an AI risk assessment?
Download the standalone template from the Research index and run it yourself. If the output surfaces gaps you want a second opinion on, send us the filled-in brief and we will scope from there.

// Open the conversation

Three paragraphs is enough to start a real evaluation.

We acknowledge every production-review brief on the same business day. If you would rather kick the tyres first, the AI risk assessment template runs on your stack, by you, with no further contact.

Email a brief Read methodology

Brief our team on the AI workflow you need to take to production.

Brief us on a deployment

Adversarial intake

GDPR · HIPAA · SOC 2

Briefings and embargo

You send a brief

We scope a suite

We run the corpus

We hand off the loop

Do you train on our prompts, traces, or evaluation outputs?

We are mid-deployment. Can we plug TestML into a live system?

Do you only test single-model deployments?

Will you replace our internal ML team?

What if we just need an AI risk assessment?

Three paragraphs is enough to start a real evaluation.