Contact · Production review intake

Brief our team on the AI workflow you need to take to production.

Most useful conversations start with one paragraph: the model, the workflow, the regulated surface, and the decision the evaluation has to support. We acknowledge every brief on the same business day and scope a domain-specific suite within five.

Send a briefRisk assessment template

// Channels

Pick the queue that matches your brief.

Each desk routes to a named engineer with a published SLA. We do not staff a generic sales inbox; the wrong queue gets forwarded the same hour, not absorbed.

01 · Production review1 business day

Brief us on a deployment

For Fortune 500 and scaling enterprises moving an AI agent into a business-critical workflow. We scope evaluation suites, red-team coverage, and drift gates around your stack.

contact@testml.org
02 · Red-team & jailbreak3 business days

Adversarial intake

Targeted prompt-injection, exfiltration, and role-collapse engagements. Bring a model name, a guardrail spec, and a threat surface. We scope a corpus and report against it.

redteam@testml.org
03 · Compliance & audit2 business days

GDPR · HIPAA · SOC 2

Auditor briefings, evidence packages, and BAA paperwork for regulated workflows. We map evaluator outputs to your control framework so review cycles stop blocking go-live.

compliance@testml.org
04 · Press & analyst5 business days

Briefings and embargo

Reporters, industry analysts, and conference programmers. We ship technical primers and methodology notes; we do not field hype-cycle commentary or generic AI predictions.

press@testml.org

// What happens next

Four steps from brief to a regression loop your team owns.

We work production-first. The pilot output is a working evaluator, not a slide deck — and the engagement ends when your engineers are running it without us.

  1. Step · 01

    You send a brief

    A short intake describing the model, the workflow, the regulated surface, and what go-live decision the evaluation has to support. Three paragraphs is enough.

    Same-day acknowledgement

  2. Step · 02

    We scope a suite

    An engineer drafts a domain-specific evaluation plan: accuracy, latency, jailbreak resistance, hallucination rate, and the regression gates we will run on every deploy.

    Within 5 business days

  3. Step · 03

    We run the corpus

    Replayable runs against a versioned prompt-and-policy snapshot. You see the dashboards we use internally — not a marketing PDF — with raw traces attached to every claim.

    Pilot window: 2–3 weeks

  4. Step · 04

    We hand off the loop

    Methodology, evaluators, and drift monitors transfer to your team. We co-own the first regression cycle, then step back. No staffing dependency, no vendor lock-in.

    Methodology transferred

// Send-us-this

Five fields make a brief useful.

Free-form prose is fine. We just need these signals so the on-call engineer can scope a corpus before the first call.

  • model

    Vendor and version, or open-weights checkpoint hash.

    Claude Sonnet 4.5, GPT-4o-2024-08-06, Llama 3.1-70B-Instruct, etc.

  • workflow

    What the agent actually decides or produces.

    e.g. claims triage, contract clause extraction, clinician-facing summary.

  • domain

    Regulatory surface and data sensitivity class.

    Legal, medical, financial, defense, EU consumer, etc.

  • stakes

    What a wrong answer costs in production.

    Reversibility, blast radius, compliance penalty, auditor scrutiny.

  • go_live

    Decision date the evaluation has to support.

    Pilot, phased rollout, board-level go-no-go review.

// Before you write

Five answers we send most often.

  • Do you train on our prompts, traces, or evaluation outputs?

    No. Customer artifacts live in tenant-isolated storage with zero data retention by default. We do not train, fine-tune, or share corpora across engagements. Retention windows and BAA paperwork are negotiated per workflow.

  • We are mid-deployment. Can we plug TestML into a live system?

    Yes — production-first is the point. We instrument drift monitors and regression gates against the workflow you already have running, then back-fill an evaluation suite around the failure modes the telemetry surfaces.

  • Do you only test single-model deployments?

    No. Our methodology was built around multi-agent orchestration: routers, tool-callers, retrieval layers, and policy fences. We evaluate the system, not just the underlying model card.

  • Will you replace our internal ML team?

    We are not a staffing agency. Engagements transfer methodology, evaluators, and tooling to your engineers. The success criteria is your team running the regression cycle without us by the second quarter.

  • What if we just need an AI risk assessment?

    Download the standalone template from the Research index and run it yourself. If the output surfaces gaps you want a second opinion on, send us the filled-in brief and we will scope from there.

// Open the conversation

Three paragraphs is enough to start a real evaluation.

We acknowledge every production-review brief on the same business day. If you would rather kick the tyres first, the AI risk assessment template runs on your stack, by you, with no further contact.

Email a briefRead methodology