Skip to content

What it takes to run AI in production

The cross-cutting concerns that decide whether an AI system holds up in production, and how we address each one.

Retrieval & grounding

Most production AI answers questions about your data, not the open web. Retrieval decides what the model sees, and grounding keeps the answer tied to it. We treat both as things to measure, not assume.

Where it breaks

  • Retrieval looks fine in a demo, then misses the document that mattered once real questions arrive.
  • Chunking and ranking are set once and never revisited, so quality drifts as the corpus grows.
  • The model fills gaps with confident inventions, because nothing tells it to admit it does not know.

What we put in place

Measured retrieval

Tests for what the system actually surfaces, so relevance is a number you can track.

Tuned indexing

Chunking, embeddings, and ranking tuned against your real questions, not defaults.

Grounded answers

Responses kept inside the retrieved context, with citations back to the source.

Honest gaps

When the support is not there, the system says so instead of inventing an answer.

What good looks like

  • Every answer traces back to a source a user can open and check.
  • Retrieval relevance is measured and trending up against a fixed question set.
  • Unsupported claims fall as the corpus grows, instead of creeping up.

Agent orchestration

An agent plans, calls tools, and acts over many steps rather than answering once. The hard part is the control around it: knowing what to do next, when to stop, and what to do when a step fails.

Where it breaks

  • The agent loops, retries, or wanders, burning time and budget with nothing to show for it.
  • A tool call fails and the agent carries on as if it succeeded, compounding the error.
  • It takes an action it should never have been allowed to, because nothing checked first.

What we put in place

Step control

Explicit limits on steps, budget, and time, so a run cannot quietly spiral.

Failure recovery

Defined behavior when a tool errors: retry, fall back, or stop and ask.

Reversible actions

High-stakes steps are gated, logged, and undoable where it matters.

Visible reasoning

Each decision and tool call is captured, so a run can be replayed and understood.

What good looks like

  • Runs finish within their step and cost budgets, predictably.
  • Failed tool calls are handled, not ignored or hidden.
  • Consequential actions happen only after the checks you defined.

Guardrails & safety

The moment a system reads untrusted text, from a user, a document, or the web, it can be steered. Guardrails decide what it accepts, what it will say, and what it is allowed to do.

Where it breaks

  • A crafted prompt walks the model past its instructions and out of bounds.
  • Untrusted content in a document quietly becomes an instruction the model obeys.
  • The system emits something unsafe or off-limits, because nothing checked the output.

What we put in place

Input handling

Untrusted input is treated as data, not commands, wherever it enters.

Output filtering

Responses are checked against the limits you set before they reach a user.

Injection defenses

Known prompt-injection and jailbreak patterns are tested and blocked.

Adversarial testing

We attack the system the way an opponent would, before someone else does.

What good looks like

  • Known injection and jailbreak attempts are caught, not just hoped against.
  • Unsafe or out-of-scope output is blocked before it ships, not after.
  • The defenses hold under adversarial testing, not only in the demo.

Observability

When an AI system behaves oddly, you need to see why. That means following one request through retrieval, prompting, generation, and any tools, with the inputs and steps visible.

Where it breaks

  • Dashboards show that something happened, but never why it happened.
  • A bad answer cannot be reproduced, because the inputs behind it were never captured.
  • An issue turns into an argument about what probably went wrong.

What we put in place

End-to-end traces

One request followed across retrieval, prompt, model, and tool calls.

Captured inputs

The inputs and intermediate steps behind each output, kept for later.

Readable signals

Metrics and logs your team can act on, not just collect.

Fast diagnosis

Enough context that a production issue is investigated, not guessed at.

What good looks like

  • Any output can be traced back to the exact inputs that produced it.
  • Time to diagnose a production issue is measured in minutes, not days.
  • Incidents are settled with evidence, not opinion.

Model lifecycle

Models are not fixed dependencies. Providers update them, performance drifts, and a better or cheaper option appears every few months. The lifecycle is choosing, versioning, and changing them on purpose.

Where it breaks

  • A provider updates the model and behavior shifts overnight, with no warning.
  • Nobody can say which model and prompt version produced last week's results.
  • A cheaper or stronger model ships, but switching feels too risky to attempt.

What we put in place

Pinned versions

Models and prompts pinned, so behavior only changes when you decide.

Benchmarked choices

Candidates compared on your own use cases before anything switches.

Tested upgrades

Provider and version changes run through checks before they reach production.

Clear trade-offs

Cost, latency, and quality laid out so a switch is a decision, not a gamble.

What good looks like

  • Production behavior changes only on a deliberate version bump.
  • Model changes are backed by evidence on your own cases.
  • Moving to a better or cheaper model is routine, not a project.

Evaluation

Evaluation is how you know whether the system is any good, and whether a change made it better or worse. Because outputs are non-deterministic, this takes more than a unit test.

Where it breaks

  • Changes ship on a hunch, and quality moves with no one noticing.
  • There is no shared definition of "good," so every review is an argument.
  • A regression only surfaces when a user complains.

What we put in place

Representative datasets

Test sets built from your real cases, not toy examples.

Meaningful scoring

Scores that reflect what matters for the task, with model-graded checks where useful.

Checks in the pipeline

Every change measured automatically, before it reaches users.

Quality over time

Trends tracked release to release, so drift is visible early.

What good looks like

  • "Good" is defined and measured, not argued.
  • Quality is tracked over time and moving the right way.
  • Regressions are caught before release, not reported after.

Validation

Evaluation tells you the system was good on a test set; validation watches the real thing. We use AI to check production outputs as they are generated, in real time.

Where it breaks

  • A launch-day score says the system is fine, but no one is watching it now.
  • A failure the test set never imagined shows up in production, unnoticed.
  • Bad outputs reach users because nothing checked them on the way out.

What we put in place

Runtime validators

AI-native checks that judge each production output as it is generated.

Live signals

Relevance, grounding, and safety scored on real traffic, not just offline.

Hold and flag

A response can be held, flagged, or escalated the moment it looks wrong.

Beyond the test set

Catches the failures a fixed evaluation never anticipated.

What good looks like

  • Production outputs are checked as they happen, not just at launch.
  • Failures surface in real time, not in next month's review.
  • Bad responses are caught on the way out, before a user sees them.

Governance

Governance decides what an AI system is permitted to do and who answers for it: which use cases are approved, how much autonomy an agent has, and what needs a human in the loop.

Where it breaks

  • A model ships into a use case nobody formally approved.
  • An agent has more autonomy than anyone intended, and no one signed off.
  • When something goes wrong, there is no record of who decided what.

What we put in place

Approved use cases

A clear line between what the system may do and what it may not.

Autonomy limits

How far an agent can act alone, and where a human must approve.

Enforced in code

Controls built into the system, not written in a document nobody reads.

Decision records

Who approved what, captured so it can be reviewed later.

What good looks like

  • Every live use case has been approved by someone accountable.
  • Autonomy and approvals match what was actually signed off.
  • Decisions are on record and can be reviewed after the fact.

Compliance

Compliance is meeting the obligations that apply to your AI: privacy, data residency, retention, and emerging rules like the EU AI Act that ask for documentation and evidence.

Where it breaks

  • An obligation is discovered late, after the system is already live.
  • A review arrives and the evidence has to be reconstructed from memory.
  • Data is handled or retained in a way a regulator would not accept.

What we put in place

Mapped obligations

The rules that actually apply to your domain, turned into concrete requirements.

Controls, not promises

Each obligation backed by a control built into the system.

Evidence on demand

The records a review needs, produced rather than reconstructed.

Designed in early

Compliance built into how the system works, not retrofitted under pressure.

What good looks like

  • Each obligation maps to a control you can point to.
  • The evidence a review needs already exists.
  • The system can show it stayed within approved bounds.

Want to go deep on one of these?