Company · Methodology

We test AI the way regulated industries need it tested.

TestML is the evaluation, red-teaming, and monitoring layer between LLM capability and the production requirements your auditors, customers, and on-call engineers will actually hold you to. We work with CTOs, VPs of Engineering, and AI platform leads inside Fortune 500 and scaling enterprises shipping agents to business-critical processes.

Book a production reviewRead our methodology

Operating principles

Four convictions that decide what we ship and what we refuse.

Each conviction maps to a category of failure we have watched derail enterprise AI rollouts — and to the guardrail that replaces it.

01

Domain-specific evaluation, not vendor-marketing benchmarks

We build proprietary evaluation suites tuned to legal, medical, and financial risk surfaces. Generic leaderboards do not survive contact with a regulated workflow, so we measure the dimensions auditors will actually ask about.

  • Eval suites
  • Domain risk
  • Auditable

02

Red-teaming is a default, not an add-on

Jailbreak attempts, prompt injection, and data-leakage probes are baked into every engagement before a model reaches staging.

  • Red team
  • Injection
  • Leakage

03

Production-first, with drift detection in live systems

Continuous regression and drift monitoring run against your actual traffic, so model and prompt changes never silently degrade user-facing behaviour.

  • Drift
  • Regression
  • Observability

04

Compliance-by-design across GDPR, HIPAA, and SOC 2

Security guardrails, retention controls, and audit logging are configured before evaluation begins, not retrofitted at procurement review.

  • GDPR
  • HIPAA
  • SOC 2

The engagement

From workflow discovery to live monitoring in four phases.

Engagements are founder-led and stack-agnostic. We adapt to your model mix — Claude, GPT, open-source, custom — and to your existing CI, observability, and incident-response surfaces. Methodology and tooling transfer with us; you do not end up renting a dependency.

  1. 01 · Discovery

    Map the workflow

    Workshop the agent's intended use, the regulated obligations around it, and the failure modes a stakeholder would refuse to accept in production.

    OutputRisk register
  2. 02 · Evaluation

    Build the test harness

    Compose a domain-specific eval suite covering accuracy, latency, cost, factuality, and policy adherence — graded against your own gold-standard data.

    OutputEval suite v1
  3. 03 · Hardening

    Red-team and remediate

    Attack the system: jailbreaks, prompt injection, exfiltration, tool misuse. Every finding routes back into the harness so the regression never returns.

    OutputThreat report
  4. 04 · Production

    Monitor, audit, iterate

    Live drift detection, decision logging, and continuous regression testing keep the deployment trustworthy as models, prompts, and traffic evolve.

    OutputLive runbook

Why production-first

The cost of an AI failure in a regulated workflow is not a refund — it is the engagement.

Most AI tooling optimises for the demo. We optimise for the on-call rotation. When an agent answers a regulated question wrong in production, the consequences land on legal, security, and the engineer who shipped it — not on the vendor that benchmarked the model.

Our methodology starts from that asymmetry. Evaluation suites are written against your own gold-standard data. Red-teaming runs before staging, not after a customer escalation. Drift detection and regression tests live in the same pipeline as the model itself, so a prompt or weights change cannot quietly regress a metric an auditor depends on.

We share our findings, our tooling, and our threat models. The goal is for your team to leave an engagement able to operate the harness without us — not to keep you on a retainer.

Compliance posture

Security and regulatory guardrails configured before evaluation begins.

Every engagement runs on infrastructure aligned to the certifications below, with documented retention, encryption, and access controls. Compliance is a precondition for our methodology, not an afterthought at procurement review.

SOC 2 Type II

Continuous controls audit covering security, availability, and confidentiality of every evaluation pipeline.

ISO 27001

Information security management system aligned to ISO/IEC 27001, with documented risk treatment for every workload.

GDPR-compliant

Lawful basis, data minimisation, retention, and DSR workflows defined before any production data is processed.

HIPAA-compatible

Architecture for protected health information: BAAs, encryption-at-rest, audit logs, and de-identification on the eval surface.

Get a production review of the agent you cannot afford to ship wrong.

Founder-led engagement, scoped in a 30-minute conversation. We map the failure modes, the regulatory surface, and the evaluation gap before you commit to a pilot.

Book a production reviewRisk assessment template