Effective Practices for Mocking LLM Responses During the Software Development Lifecycle
Learn best practices for mocking LLM responses during development to improve testing efficiency and reduce API costs while maintaining code quality.

Testing software that calls a Large Language Model is genuinely different from testing software that calls a database or a REST API. Databases return the same row for the same query. LLMs return something different every time — and that something can be brilliant, mediocre, or confidently wrong depending on factors no test harness controls. Add to that API costs that accumulate quickly across a CI run, round-trip latencies of 5–45 seconds per call, and aggressive rate limits from providers, and you have a category of dependency that will break any naive test suite.
Mocking is the pragmatic answer for most of this. This article covers when to reach for it, which technique fits which scenario, and what to pair it with so your test coverage stays honest.
Why LLMs Are Uniquely Hard to Test Against Directly
The core problem is non-determinism. A standard unit test passes or fails based on a deterministic assertion: function returns X, database row has value Y. That contract doesn't exist with LLMs. The same prompt produces different completions on different runs, different days, and after model updates — any of which can happen silently mid-sprint.
Cost compounds the problem. Running a suite of 200 integration tests where each test makes one GPT-4 call can cost $10–$40 per run. Teams that try to run the real API in CI discover this fast, usually via a surprise billing spike. A single API call to an advanced model can cost close to $1 with long prompts, according to WireMock's analysis of MockGPT adoption.
Latency follows close behind. A test that waits 15–30 seconds for a real completion is a test developers stop running locally. Slow tests get skipped, skipped tests find no bugs.
When Mocking Is the Right Tool
Mocking is the right choice in four scenarios:
Developing library or integration code that sits around LLM calls. If you're building a prompt template engine, a response parser, a retry handler, or a streaming adapter, the LLM output itself is irrelevant to what you're testing. Mock it out completely.
Unblocking parallel development. If a teammate owns the LLM integration layer and you own downstream processing, a mock gives you a stable fixture to work against while the real integration is still being built.
CI/CD pipelines. Every pull request triggering live API calls is a cost and reliability antipattern. Flaky network responses and rate-limit errors should not fail your CI runs. Mocks eliminate both.
Regression testing for prompt parsing and extraction. When you want to verify that your parsing logic correctly handles edge cases — empty responses, malformed JSON, truncated outputs — you control exactly what the "LLM" returns and test your handling of it.
Mocking is _not_ the right choice for evaluating whether the LLM itself produces good outputs. That's a different problem requiring a different tool (covered in the final section).
Technique 1: Mock at the Library Layer
The simplest approach is to replace the LLM client with a fake object inside your test. LangChain ships a FakeListLLM that returns responses from a predefined list in sequence:
from langchain.llms.fake import FakeListLLM
responses = ["The capital of France is Paris.", "I don't know."]
llm = FakeListLLM(responses=responses)
result = llm.invoke("What is the capital of France?")
assert result == "The capital of France is Paris."For TypeScript/JavaScript with Jest or Vitest, you mock the OpenAI or LangChain client module directly:
import { vi } from 'vitest';
import OpenAI from 'openai';
vi.mock('openai', () => ({
default: vi.fn().mockImplementation(() => ({
chat: {
completions: {
create: vi.fn().mockResolvedValue({
choices: [{ message: { content: 'Mocked response' } }],
}),
},
},
})),
}));One important nuance: avoid mocking internal methods of the library itself. LangChain's internals change frequently between minor versions. Mock at the boundary — the API call — not deep inside the framework.
Technique 2: Record and Replay with VCR
Library-layer mocks require you to handwrite responses. For integration tests where you want to verify your code works with _real_ LLM output structures, VCR (Video Cassette Recorder) is a better fit.
VCR records actual HTTP interactions to YAML "cassette" files during an initial run, then replays them deterministically on subsequent runs. The test talks to a real API exactly once; after that, it's completely offline.
import pytest
@pytest.mark.vcr()
def test_summarize_document():
result = summarize("Long document text here...")
assert len(result) < 200
assert "key points" in result.lower()Running with --record-mode=new_episodes hits the real API and saves the cassette. Future runs skip the network entirely. The cassette file goes into version control, so the team shares reproducible fixtures automatically.
One critical security note: VCR cassettes contain the full request and response, including authorization headers. Override the default config to redact credentials before committing:
@pytest.fixture(scope="module")
def vcr_config():
return {"filter_headers": ["authorization", "x-api-key"]}Technique 3: Network-Level Mocks
When you want to test without touching test code at all — for example, in end-to-end local development with a frontend team — a network-level mock server is the right tool. WireMock's MockGPT exposes a drop-in replacement for the OpenAI API at https://mockgpt.wiremockapi.cloud/v1. Point your OPENAI_BASE_URL environment variable there and your entire application stack runs without any API credentials or costs.
This approach is particularly useful for:
- Frontend developers who need a running backend with realistic AI responses
- QA engineers writing Playwright or Cypress tests against a full stack
- Chaos engineering: configuring the mock to return errors, delays, or malformed responses to test your application's error handling
MockLLM and similar local mock servers serve the same purpose for self-hosted scenarios. A local Docker container exposing an OpenAI-compatible API gives your team a fully air-gapped development environment with zero ongoing cost.
Technique 4: Parameterized Edge Cases with Faker
For testing how your application handles the _variety_ of real LLM outputs — verbose responses, terse responses, JSON with missing fields, responses in unexpected languages — parameterized mocks with a fake data library give you breadth without recording hundreds of cassettes:
import pytest
from faker import Faker
fake = Faker()
@pytest.mark.parametrize("response", [
"", # Empty response
fake.paragraph(nb_sentences=1), # Very short
fake.paragraph(nb_sentences=20), # Very long
'{"key": "value"}', # Valid JSON
'{"key":', # Truncated JSON
])
def test_response_parser_handles_edge_cases(response):
result = parse_llm_response(response)
assert result is not None # Should never raiseThis catches parsing failures and unhandled exceptions without a single real API call.
Wiring Mocks into CI/CD
The practical rule: no live LLM calls in CI unless the test is specifically an evaluation run. Structure your test suite in layers:
- Unit tests (always mocked): test prompt formatting, response parsing, retry logic, error handling
- Integration tests (VCR cassettes): test the full request-response cycle against recorded real output
- Evaluation runs (live, scheduled separately): assess output quality using frameworks like DeepEval or Promptfoo
DeepEval integrates with pytest and can run evaluation metrics in the same CI infrastructure you already use. Promptfoo offers declarative YAML-based test configs and native CI/CD integration for running regression suites against prompt changes. Both allow you to gate deployments on quality thresholds — fail the pipeline if answer relevance drops below 0.8.
Keep evaluation runs on a schedule (nightly or per release) rather than per pull request. They're expensive and slow by design; they're not a substitute for fast mock-based unit tests.
Beyond Mocking: Golden Datasets
Mocks verify that your _code_ behaves correctly given a controlled input. They can't tell you whether your _prompt_ produces good outputs. That's where golden datasets come in.
A golden dataset is a curated set of input/expected-output pairs representing the behaviors your application must get right. Each entry contains a question, the ideal answer characteristics, and sometimes explicit negative examples. Teams typically build these from real user interactions (with privacy considerations) or from domain expert curation.
Evaluation against a golden dataset runs outside the normal test suite: feed each input through your real LLM, score the output with automated metrics or an LLM judge, and track scores over time. Score drops between versions are your signal that a model update or prompt change degraded behavior.
The two roles complement each other cleanly: mocks keep your test suite fast and your CI costs predictable; golden dataset evaluations tell you whether the actual model behavior is moving in the right direction.
Practical Starting Point
If you're just getting started, apply mocks in this order of priority:
- Swap the LLM client in unit tests immediately — eliminates cost and flakiness from the entire unit test layer overnight
- Add VCR cassettes to integration tests — gives you real response structures without live network calls
- Set up a mock server for local full-stack development — unblocks frontend and QA work from the LLM integration timeline
- Build a small golden dataset for your most critical flows — gives you a quality regression signal as the application evolves
Mocking is not a compromise on test quality. It's the correct boundary between testing your application logic and evaluating your AI system. Conflating the two leads to either expensive, flaky tests or blind spots in coverage. Keep them separate and you get both fast, reliable CI and meaningful quality signals — which is the actual goal.
More to read
Introducing the Agiflow CLI: Scaling AI Agents Across Machines
GitHub Actions was never built for the fast closed loop an agent needs — going back, redoing a step, fixing its own work. Local agent fan-out solved the loop on one laptop and broke on two. The Agiflow CLI is the convenience wrapper we use internally to drive workflow locks, work units, and artifacts through the Agiflow API — so agents on different machines can pull the same backlog without stepping on each other.
8 min readMulti-Agent Orchestration with Claude and Codex: Role Separation, Handoff Contracts, and Verification Gates
Architect multi-agent code systems that stay coherent. Learn role separation patterns, handoff contracts, and verification gates to prevent coordination failures.
18 min readRoadmap to Build Scalable Frontend Applications with AI: Atomic Design System, Token Efficiency, and Design Systems
Learn how to architect frontend applications that scale with AI assistance. Discover how atomic design methodology, component libraries, and design systems dramatically reduce token consumption while ensuring consistent, maintainable codebases.
18 min read