Services · Engagement model

Production engagements for regulated AI workloads.

We work with platform teams shipping LLM agents to legal, medical, and financial workflows. Three engagement shapes, one methodology — calibrated evaluation suites, adversarial coverage, drift telemetry, and audit lineage your assessor already recognises.

Book a production reviewDownload the AI risk assessment

Engagement tiers

Three engagement shapes, sized to where you are in the rollout.

Pick a pilot to de-risk a single workflow, a production engagement to gate the whole portfolio, or a compliance audit when something is already shipping.

ENG-01

Pilot Evaluation

Single-domain readiness probe before a production rollout. Calibrate one workflow against your accuracy, latency, and cost targets.

Duration3–4 weeksScope1 workflow · 1 domain
  • Domain-specific evaluation suite, calibrated to your rubric
  • Baseline accuracy, latency, and cost-per-token report
  • Red-team probe across jailbreak and prompt-injection vectors
  • Findings deck with go/no-go recommendation
Most engagedENG-02

Production Engagement

End-to-end methodology transfer for teams shipping AI agents to business-critical processes. Evaluation, monitoring, and audit lineage in one engagement.

Duration10–14 weeksScopeMulti-agent · multi-domain
  • Versioned evaluation suites across every shipped workflow
  • Continuous drift, regression, and refusal-rate monitors
  • Adversarial corpus seeded from your telemetry, refreshed weekly
  • GDPR / HIPAA / SOC 2 control mappings with audit pack export
  • Dedicated platform engineer; weekly office hours
ENG-03

Compliance Audit

Standalone assessment for systems already in production. Evidence pack matched to the format your assessor expects, signed and time-stamped.

Duration6–8 weeksScopeLive workflows
  • Inference lineage reconstruction across the audit window
  • Control-mapping matrix for the assessor's framework
  • Gap analysis with remediation backlog and owner assignments
  • Re-test attestation against the closed gaps

Methodology

Five phases. Every artefact traceable to the rubric that produced it.

The phases below are the through-line of every engagement. Pilot tiers compress phases 04–05; audits enter at phase 03.

PHASE 01

Discovery

Week 1

Workflow inventory, domain rubric capture, threat model, and the accuracy / latency / cost envelope your stakeholders will sign against.

PHASE 02

Suite Calibration

Weeks 2–3

Domain-expert review of the test corpus, prompt families, and pass/fail thresholds. Every prompt traceable to the rubric that produced it.

PHASE 03

Adversarial Pass

Weeks 4–5

Jailbreak, prompt-injection, and exfiltration probes against staging. Regressions captured as permanent tests in your suite.

PHASE 04

Production Hand-off

Weeks 6–9

Drift monitors, regression gates, and audit lineage wired into your CI and observability stack. Runbooks for the on-call team.

PHASE 05

Continuous Review

Ongoing

Quarterly suite refresh, incident-driven corpus updates, and assessor-ready audit packs exported on demand.

Deliverables

What you own at hand-off.

Engagements end with the methodology in your repository, not in our heads. Every artefact below ships under your version control, gated by your CI.

  • EVAL

    Versioned evaluation suite

    A repository of test prompts, expected behaviours, and scoring rubrics — checked into your monorepo, owned by your team after hand-off.

  • MON

    Drift & regression monitors

    Statistical alerts on output distributions, latency envelopes, and cost-per-token, integrated with your existing on-call and incident tooling.

  • REDT

    Adversarial corpus

    Curated jailbreak, injection, and exfiltration prompts refreshed weekly from public disclosures and your own incident telemetry.

  • AUD

    Audit pack

    Signed, time-stamped lineage for every inference: prompt, retrieved context, model, parameters, decision, and reviewer — exportable for GDPR, HIPAA, and SOC 2 reviews.

  • RUN

    On-call runbooks

    Incident playbooks for refusal-rate inversion, p95 latency creep, distribution shift on embeddings, and provider-side weight refreshes.

Operating commitments

The numbers we sign engagements against.

3–5 mo

Time to production

From engagement kick-off to a workflow gated by drift monitors.

0

Customer data retention

Zero retention by default. Telemetry stays inside your perimeter.

weekly

Adversarial refresh

New jailbreak and injection patterns merged into your suite each week.

T-180d

Audit reachback

Inference lineage queryable up to 180 days before the assessment.

Procurement questions

What buyers ask before signing the SOW.

Which models and providers do you support?

We are vendor-neutral. Engagements have shipped against Claude, GPT, Gemini, and self-hosted open-source models — with cascading routing across providers when cost or latency targets demand it. We do not recommend models based on marketing material; the evaluation rubric decides.

Do you train on customer prompts or completions?

No. Customer data is never used to train models, and is never shared between engagements. Telemetry stays inside your perimeter; we operate with zero retention by default.

How does your work map to GDPR, HIPAA, and SOC 2?

Every engagement includes a control-mapping matrix to the assessor's framework, plus inference lineage signed and time-stamped at write. Audit packs export in the format your reviewer already expects.

Can you operate inside our VPC?

Yes. Production engagements run inside the customer's cloud account; only the evaluation suite source and tooling cross the boundary. Air-gapped variants are available for regulated workloads.

What does a successful pilot look like?

A pilot is successful when stakeholders can sign on the accuracy, latency, and cost envelope; when the adversarial probe surfaces no production-blocking failure modes; and when the team owns the suite well enough to extend it without us.

How is this different from an open-source eval framework?

Open-source frameworks give you scaffolding. We deliver the calibrated rubrics, the adversarial corpus, the drift monitors, and the audit lineage that production systems are graded on. The methodology is the deliverable; the tooling is how it ships.

Bring us a workflow you cannot ship until it can be audited.

A 30-minute production review covers your accuracy envelope, threat model, and audit reachback. You leave with a written gap-analysis whether or not we ever sign an SOW.

Book a production reviewBrowse solutions