Testing

Effective Practices for Mocking LLM Responses During the Software Development Lifecycle

Vuong Ngo•2024-07-14'•

#LLM Testing#Software Testing#Software Development

After a few years of experimenting with large language models (LLMs) with Proof of Concepts (POCs), the challenge now is getting those products ready for production. Applying rigorous engineering practices and ensuring comprehensive testing at various levels is crucial to prevent live issues and regressions. A key strategy is implementing extensive unit tests and mock testing to verify the application’s behavior. However, traditional testing methods are often insufficient due to the complexity of different APIs, tools in the LLM stack, and the inherent randomness of LLM responses. This article will discuss the fundamentals of mocking techniques in LLM testing, highlighting strategies that not only facilitate development but also ensure the entire application functions correctly in a production environment.

Table of Content

Challenges and Trade-offs of LLM Testing using Traditional Method
Why Shouldn’t We Mock LLM Responses?
When Should We Mock LLM Responses?
How to Mock LLM Responses?
Conclusion

1. Challenges and Trade-offs of LLM Testing using Traditional Method

Testing large language models (LLMs) presents unique challenges that diverge from traditional software testing paradigms. One significant issue is the unpredictability and variability of LLM responses. Unlike deterministic systems, LLMs can produce different outputs for the same input, making it difficult to establish fixed expectations for test results. This variability complicates the implementation of both Test-Driven Development (TDD) and Behavior-Driven Development (BDD), as tests may not consistently yield the same outcomes. Moreover, LLMs models requires substantial computational resources. Running tests on these models can be prohibitively expensive if fast responses are needed, or alternatively, very slow when using more cost-effective, open-source models locally.

The trade-offs in LLM testing are multifaceted. To achieve rapid response times, we might opt for cloud-based solutions with robust infrastructure, but this incurs high costs, especially for continuous integration and delivery pipelines. Conversely, using open-source models locally can significantly reduce costs but at the expense of speed and scalability, which may hinder development velocity. Another trade-off is the balance between test coverage and practicality. Comprehensive testing, including unit tests, integration tests, and end-to-end tests, is essential for reliable production deployment. However, the complexity and resource intensity of LLMs often necessitate compromises in test scope and frequency. Implementing effective mocking strategies can mitigate some of these issues by simulating LLM responses, allowing for more controlled and repeatable tests, but this approach requires careful design to ensure it accurately reflects real-world behavior.

1. Why Shouldn’t We Mock LLM Responses?

Mocking LLM responses is often ineffective due to the inherent non-deterministic nature of these models. Unlike traditional software systems that produce consistent outputs for given inputs, LLMs can generate a wide range of responses, even when queried with the same prompt. This variability poses significant challenges for mocking methods, which typically rely on predictable and repeatable outputs.

Non-Deterministic Outputs: LLMs generate responses based on probability distributions, which means the same input can yield different outputs each time. Mocking these responses would fail to capture the full spectrum of potential behaviors, leading to an inaccurate representation of the model’s capabilities and limitations.
Complex Interactions: The interactions within LLMs are highly complex, involving large amounts of contextual data and nuanced understanding. Mocking simplifies these interactions, potentially missing critical edge cases and nuanced behaviors that could affect real-world performance.
Resource Intensive: Effective mocking requires extensive pre-processing and storage of possible outputs, which can be resource-intensive and impractical for large-scale LLMs. This approach can also lead to maintenance overhead, as the mocks need to be updated frequently to reflect the evolving nature of the models.

Alternative Methods for Testing, Benchmarking, and Evaluating LLMs

Image from https://www.shakudo.io/blog/evaluating-llm-performance

Instead of relying on mocks, there are more robust and effective methods for testing, benchmarking, and evaluating LLMs:

Golden Data Benchmarks: Utilize curated datasets with known outcomes to benchmark LLM performance. These datasets provide a standard reference point, enabling consistent and reliable comparisons across different models and iterations.
Cross-Model Evaluations: Use other LLMs to evaluate responses. By comparing the outputs of different models, you can gain insights into the strengths and weaknesses of each, helping to identify areas for improvement.
Probabilistic Assertions: Implement probabilistic assertions that evaluate the likelihood of different responses. This approach acknowledges the inherent variability of LLMs and focuses on the distribution of plausible outputs rather than fixed results.

Human-in-the-loop evaluation and continuous monitoring are other techniques which should be used on production. These alternative methods provide a more accurate and holistic assessment of LLM capabilities, ensuring that models perform reliably in diverse and unpredictable real-world scenarios.

1. When Should We Mock LLM Responses?

Mocking LLM responses can be highly beneficial in specific scenarios, particularly when the goal is to enhance development efficiency and reduce costs. Here are some key situations where mocking LLM responses is advantageous:

Developing Libraries and Frameworks Independent of LLM Responses

When building tools such as observability and tracing libraries for LLMs, testing various data responses from multiple providers can be impractical and expensive. Mocking responses allows developers to simulate a wide range of scenarios without incurring the cost of actual LLM invocations. This approach ensures that the library can handle diverse outputs and edge cases, facilitating thorough testing and high development velocity.

Easily produce different types of responses
Test library robustness without high costs
Maintain high development velocity

Unblocking Other Developers

In collaborative development environments, dependencies between team members can cause delays. For instance, if a UI developer is waiting for an API to be completed, providing a working API with mocked LLM responses can keep the project moving forward. This allows the team to develop and test UI components in parallel, ensuring timely completion of tasks.

Provide working API with mocked responses
Allow parallel development and testing
Ensure timely task completion

Saving Money

Continuous integration and continuous delivery (CI/CD) pipelines often involve frequent testing of new changes. Running tests on live LLMs for each change can be prohibitively expensive, especially for large teams. Using mocked responses for regression tests helps streamline the CI/CD process, saving significant costs while ensuring that unchanged code does not repeatedly incur unnecessary expenses.

Avoid costly live LLM invocations in CI/CD
Use mock responses for regression tests
Streamline CI/CD process and save costs

Handling Flaky CI/CD

LLMs can introduce inconsistency in CI/CD workflows, resulting in flaky tests. By isolating LLM evaluations and using mock responses for user flow and integration testing, teams can reduce the impact of these inconsistencies. Incorporating human-in-the-loop testing and review for final benchmarking ensures that the critical aspects of LLM performance are accurately assessed without compromising the stability of the CI/CD pipeline.

Reduce inconsistent CI/CD results
Isolate LLM evaluations for stability
Use mocks for user flow and integration testing

By strategically using mock LLM responses, development teams can maintain high productivity, control costs, and ensure robust testing practices, ultimately leading to more reliable and efficient development cycles.

Image from https://www.testevolve.com/blog/the-testing-pyramid-an-essential-strategy-for-agile-testing

4. How to Mock LLM Responses?

That’s enough theory, show me some codes on how to do it!

1. Identify Levels of Mocking

When mocking, it’s crucial not to interfere with the internal implementation of libraries or frameworks, as these are prone to changes that can disrupt your pipeline. Instead, focus on mocking at higher levels, such as the APIs you consume or bare metal as the network requests, to minimize potential disruptions. Here are two coding examples illustrating these principles:

Option A: Mocking libraries’ method

Mocking the response directly from the OpenAI library can help you test your application logic without making actual API calls. Below is an example using Python’s unittest.mock module to mock an OpenAI chat completion response.

import unittest
from unittest.mock import patch, AsyncMock
from openai import AsyncOpenAI
import asyncio

class OpenAIClient:
    def __init__(self, api_key="dummy_key"):
        self.client = AsyncOpenAI(api_key=api_key)

    async def get_ai_response(self, prompt):
        response = await self.client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content

class TestOpenAIMock(unittest.TestCase):
    def setUp(self):
        self.client = OpenAIClient()

    @patch('openai.resources.chat.completions.AsyncCompletions.create')
    def test_chat_completion(self, mock_create):
        # Set up the mock response
        mock_response = AsyncMock()
        mock_response.choices = [
            AsyncMock(message=AsyncMock(content="This is a mocked response."))
        ]
        mock_create.return_value = mock_response

        # Test your application code
        result = asyncio.run(self.client.get_ai_response("Hello, AI!"))
        self.assertEqual(result, "This is a mocked response.")

        # Verify that the mock was called with the expected arguments
        mock_create.assert_called_once_with(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": "Hello, AI!"}]
        )

Option B: Mocking Network Request

Mocking the network request itself is another effective way to test your application. This example uses the responses library to mock the HTTP call to the OpenAI API.

import unittest
import responses
import json
from openai import AsyncOpenAI
import asyncio
import aiohttp

class OpenAIClient:
    def __init__(self, api_key="dummy_key"):
        self.client = AsyncOpenAI(api_key=api_key)

    async def get_ai_response(self, prompt):
        response = await self.client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content

class TestOpenAIMock(unittest.TestCase):
    def setUp(self):
        self.client = OpenAIClient()

    @responses.activate
    def test_openai_api_call(self):
        # Mock the API endpoint
        api_url = "https://api.openai.com/v1/chat/completions"

        # Prepare the mock response
        mock_response = {
            "id": "chatcmpl-123",
            "object": "chat.completion",
            "created": 1677652288,
            "model": "gpt-3.5-turbo-0613",
            "choices": [{
                "index": 0,
                "message": {
                    "role": "assistant",
                    "content": "This is a mocked API response."
                },
                "finish_reason": "stop"
            }],
            "usage": {
                "prompt_tokens": 9,
                "completion_tokens": 12,
                "total_tokens": 21
            }
        }

        # Register the mock response
        responses.add(
            responses.POST,
            api_url,
            json=mock_response,
            status=200
        )

        # Run the async function in a synchronous context
        async def run_test():
            result = await self.client.get_ai_response("Hello, AI!")
            self.assertEqual(result, "This is a mocked API response.")

        asyncio.run(run_test())

        # Verify that the mock was called
        self.assertEqual(len(responses.calls), 1)

        # Verify the request payload
        request_payload = json.loads(responses.calls[0].request.body)
        self.assertEqual(request_payload['messages'][0]['content'], "Hello, AI!")
        self.assertEqual(request_payload['model'], "gpt-3.5-turbo")

By mocking at these levels, you can effectively test your application while avoiding the pitfalls associated with changes in underlying libraries or frameworks. This approach ensures your tests remain stable and reliable over time.

Create Mock Data

Generating a comprehensive set of mock responses for your LLM is essential to ensure thorough testing and validation of your application. These mock responses should cover a wide range of scenarios, including both typical interactions and edge cases. This approach guarantees that your application can handle various situations effectively. You can use a fixed set of responses, record real API calls, or leverage libraries like faker to introduce randomization. Here are three examples demonstrating these methods:

Example A: Generating a Fixed Set of Mock Data

Using a fixed set of responses allows for consistent and repeatable testing. Simply use a dictionary or list to store response, and select the corresponding item based on request. Here’s an example:

import unittest
from unittest.mock import patch, AsyncMock
from openai import AsyncOpenAI
import asyncio

class MockOpenAI:
    def __init__(self):
        self.responses = {
            "Hello": "Hi there! How can I assist you today?",
            "What's the weather?": "I'm sorry, I don't have real-time weather information. You might want to check a weather app or website for the most up-to-date forecast.",
            "Tell me a joke": "Why don't scientists trust atoms? Because they make up everything!",
            "default": "I'm not sure how to respond to that. Can you please rephrase or ask something else?"
        }

    async def chat_completion(self, model, messages):
        last_message = messages[-1]['content'] if messages else ""
        response = self.responses.get(last_message, self.responses['default'])
        mock_response = AsyncMock()
        mock_response.choices = [AsyncMock(message=AsyncMock(content=response))]
        return mock_response

class OpenAIClient:
    def __init__(self, api_key="dummy_key"):
        self.client = AsyncOpenAI(api_key=api_key)

    async def get_response(self, prompt):
        response = await self.client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content

class TestOpenAIMock(unittest.TestCase):
    def setUp(self):
        self.mock_openai = MockOpenAI()
        self.client = OpenAIClient()

    @patch('openai.resources.chat.completions.AsyncCompletions.create')
    async def test_known_responses(self, mock_create):
        mock_create.side_effect = self.mock_openai.chat_completion

        responses = [
            await self.client.get_response("Hello"),
            await self.client.get_response("What's the weather?"),
            await self.client.get_response("Tell me a joke")
        ]

        self.assertEqual(responses[0], "Hi there! How can I assist you today?")
        self.assertIn("I don't have real-time weather information", responses[1])
        self.assertEqual(responses[2], "Why don't scientists trust atoms? Because they make up everything!")

def async_test(coro):
    def wrapper(*args, **kwargs):
        return asyncio.run(coro(*args, **kwargs))
    return wrapper

class TestOpenAIMockWrapper(unittest.TestCase):
    @async_test
    async def test_known_responses(self):
        await TestOpenAIMock().test_known_responses()

Example B. Generating Data with Faker

The faker library can be used to create dynamic and diverse data for testing. It’s not really useful for unit-test, but if you are building a mock server to facilitate development and e2e testing, this would become handy to add some randomization to your API responses. Here’s an example:

import unittest
from unittest.mock import patch, AsyncMock
from openai import AsyncOpenAI
import asyncio
from faker import Faker
import random

class MockOpenAI:
    def __init__(self):
        self.faker = Faker()

    async def chat_completion(self, model, messages):
        prompt = messages[-1]['content'] if messages else ""
        response = self.generate_response(prompt)
        mock_response = AsyncMock()
        mock_response.choices = [AsyncMock(message=AsyncMock(content=response))]
        return mock_response

    def generate_response(self, prompt):
        if "name" in prompt.lower():
            return f"The name you're asking about is {self.faker.name()}."
        elif "address" in prompt.lower():
            return f"The address you're looking for is {self.faker.address()}."
        elif "company" in prompt.lower():
            return f"The company you're inquiring about is {self.faker.company()}."
        elif "date" in prompt.lower():
            return f"The date you're asking about is {self.faker.date()}."
        else:
            return self.faker.paragraph(nb_sentences=random.randint(1, 3))

class OpenAIClient:
    def __init__(self, api_key="dummy_key"):
        self.client = AsyncOpenAI(api_key=api_key)

    async def get_response(self, prompt):
        response = await self.client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content

class TestOpenAIFakerMock(unittest.TestCase):
    def setUp(self):
        self.mock_openai = MockOpenAI()
        self.client = OpenAIClient()

    @patch('openai.resources.chat.completions.AsyncCompletions.create')
    async def test_name_response(self, mock_create):
        mock_create.side_effect = self.mock_openai.chat_completion
        response = await self.client.get_response("What's the person's name?")
        self.assertIn("The name you're asking about is", response)

    @patch('openai.resources.chat.completions.AsyncCompletions.create')
    async def test_address_response(self, mock_create):
        mock_create.side_effect = self.mock_openai.chat_completion
        response = await self.client.get_response("Can you give me an address?")
        self.assertIn("The address you're looking for is", response)

    @patch('openai.resources.chat.completions.AsyncCompletions.create')
    async def test_company_response(self, mock_create):
        mock_create.side_effect = self.mock_openai.chat_completion
        response = await self.client.get_response("Tell me about a company.")
        self.assertIn("The company you're inquiring about is", response)

    @patch('openai.resources.chat.completions.AsyncCompletions.create')
    async def test_date_response(self, mock_create):
        mock_create.side_effect = self.mock_openai.chat_completion
        response = await self.client.get_response("What's the date?")
        self.assertIn("The date you're asking about is", response)


def async_test(coro):
    def wrapper(*args, **kwargs):
        return asyncio.run(coro(*args, **kwargs))
    return wrapper

class TestOpenAIFakerMockWrapper(unittest.TestCase):
    @async_test
    async def test_name_response(self):
        await TestOpenAIFakerMock().test_name_response()

    @async_test
    async def test_address_response(self):
        await TestOpenAIFakerMock().test_address_response()

    @async_test
    async def test_company_response(self):
        await TestOpenAIFakerMock().test_company_response()

    @async_test
    async def test_date_response(self):
        await TestOpenAIFakerMock().test_date_response()

Example C. Recording Response and Mock with VCRpy

If you don’t know exact API response but still want to mock API response to add some stability to tests, VCR.py can record real API calls and replay them during tests, ensuring consistent and accurate testing. Here’s an example:

import vcr
from openai import AsyncOpenAI
import asyncio
import unittest
import os

# Configure VCR
vcr = vcr.VCR(
    cassette_library_dir='fixtures/vcr_cassettes',
    record_mode='once',
    match_on=['uri', 'method'],
)

class OpenAIClient:
    def __init__(self):
        self.client = AsyncOpenAI(api_key=os.getenv('OPENAI_API_KEY'))

    async def get_chat_completion(self, prompt):
        response = await self.client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content

class TestOpenAIWithVCR(unittest.TestCase):
    def setUp(self):
        self.client = OpenAIClient()

    @vcr.use_cassette('openai_chat_completion_async.yaml')
    def test_chat_completion(self):
        prompt = "What is the capital of France?"
        response = asyncio.run(self.client.get_chat_completion(prompt))

        self.assertIsInstance(response, str)
        self.assertIn("Paris", response)

    @vcr.use_cassette('openai_chat_completion_python_async.yaml')
    def test_chat_completion_python(self):
        prompt = "Write a Python function to calculate the factorial of a number."
        response = asyncio.run(self.client.get_chat_completion(prompt))

        self.assertIsInstance(response, str)
        self.assertIn("def factorial", response)

By using fixed responses, dynamic data generation with faker, or recording real API interactions with VCR.py, you can create robust mock data that ensures comprehensive testing and validation of your application.

3. Mock with testing framework and without framework

Testing frameworks like Pytest offer built-in support for mocking, making it straightforward to patch function calls and utilize mock data. This simplifies the process of testing individual components. In other cases, you might prefer to use framework-agnostic libraries, such as wrapt, to create mock servers that support end-to-end (E2E) testing. Here are two examples demonstrating how to patch the OpenAI chat completion call, one using Pytest and the other using the wrapt library.

Case A: Using Testing Framework

Pytest’s monkey patch fixture allows you to easily mock function calls for testing purposes. Below is an example of patching the OpenAI chat completion call with Pytest:

import pytest
from openai import AsyncOpenAI
import asyncio
from unittest.mock import AsyncMock

# The client we want to test
class OpenAIClient:
    def __init__(self, api_key="dummy_key"):
        self.client = AsyncOpenAI(api_key=api_key)

    async def get_chat_completion(self, prompt):
        response = await self.client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content

# Fixture to create our client
@pytest.fixture
def openai_client():
    return OpenAIClient()

# Mock response for our tests
@pytest.fixture
def mock_openai_response():
    async def mock_create(**kwargs):
        mock_response = AsyncMock()
        mock_response.choices = [
            AsyncMock(message=AsyncMock(content="This is a mocked response."))
        ]
        return mock_response
    return mock_create

# Test using monkeypatch
@pytest.mark.asyncio
async def test_get_chat_completion(openai_client, mock_openai_response, monkeypatch):
    # Monkeypatch the create method
    monkeypatch.setattr(openai_client.client.chat.completions, "create", mock_openai_response)

    # Call our method
    result = await openai_client.get_chat_completion("Test prompt")

    # Assert the result
    assert result == "This is a mocked response."

Case B: Using a Library

The wrapt library is a lightweight and flexible tool for function wrapping and mocking, suitable for use outside of specific testing frameworks. Below is an example of using wrapt to mock the OpenAI chat completion call:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from openai import AsyncOpenAI
import wrapt
import asyncio

app = FastAPI()

# OpenAI client
class OpenAIClient:
    def __init__(self, api_key="dummy_key"):
        self.client = AsyncOpenAI(api_key=api_key)

    async def get_chat_completion(self, prompt):
        response = await self.client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content

# Mock responses
mock_responses = {
    "Hello": "Hi there! How can I assist you today?",
    "What's the weather?": "I'm sorry, I don't have real-time weather information. You might want to check a weather app or website for the most up-to-date forecast.",
    "Tell me a joke": "Why don't scientists trust atoms? Because they make up everything!",
    "default": "I'm not sure how to respond to that. Can you please rephrase or ask something else?"
}

# Mock function for OpenAI chat completion
async def mock_chat_completion(wrapped, instance, args, kwargs):
    messages = kwargs.get('messages', [])
    if messages and isinstance(messages[0], dict):
        prompt = messages[0].get('content', '')
        response = mock_responses.get(prompt, mock_responses['default'])

        class MockResponse:
            choices = [type('MockChoice', (), {'message': type('MockMessage', (), {'content': response})()})()]

        return MockResponse()
    return await wrapped(*args, **kwargs)

# Apply the mock wrapper
wrapt.wrap_function_wrapper('openai.resources.chat.completions', 'AsyncCompletions.create', mock_chat_completion)

# Pydantic model for request body
class ChatRequest(BaseModel):
    prompt: str

# FastAPI endpoint
@app.post("/chat")
async def chat_completion(request: ChatRequest):
    try:
        client = OpenAIClient()
        response = await client.get_chat_completion(request.prompt)
        return {"response": response}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

By leveraging these approaches, you can effectively mock OpenAI chat completion calls within both Pytest and a framework-agnostic context, ensuring robust and flexible testing of your applications.

4. Integrate Mocking with Test Hierarchy

Incorporating mocking into a test hierarchy ensures comprehensive and efficient testing across different levels. Here’s how mocking can be applied at various stages:

Unit Tests:

Use mock responses to test individual functions or methods that interact with the LLM.
Ensure that the application logic works correctly by isolating the function from external dependencies.
Example tools: unittest, pytest with mocking.

Integration Tests:

Verify the interaction between different components of the application by simulating LLM responses.
Ensure that the overall system integrates correctly and handles data flow as expected.
Example tools: pytest, responses, wrapt.

End-to-End Tests:

Test the application in a production-like environment, but use mocking for specific scenarios that are difficult to reproduce with the actual LLM.
Focus on the user journey and overall system functionality while controlling for LLM variability.
Example tools: wrapt, vcrpy.
By strategically integrating mocking at these levels, you can enhance the reliability and efficiency of your testing processes, ensuring robust application performance across different scenarios.

5. And your tests can still randomly fail

The previous example demonstrated simple tests and mock usages, but real-world LLM applications often involve more complex workflows. These might include concurrent API requests or the development of autonomous agentic APIs, where mocking data can become disordered and cause test failures. To address these issues, you can employ more sophisticated data structures to store mock data or use pattern matching to ensure the correct data is returned for your tests. However, it’s essential to first consider the principles outlined in Why Shouldn’t We Mock LLM Responses when complexity escalates. Over-engineering mocks can mangle your test hierarchy, so it’s often better to explore alternative approaches to ensure robust and maintainable testing practices.

Conclusion

Effective practices for mocking LLM responses are essential for ensuring the robustness and reliability of LLM applications in production. While traditional testing methods may fall short due to the inherent complexity and unpredictability of LLMs, incorporating comprehensive unit tests, integration tests, and end-to-end tests with sophisticated mocking strategies can mitigate these challenges. Moreover, leveraging ML and LLM testing methods such as golden data benchmarks, cross-model evaluations, and probabilistic assertions provides a more accurate assessment of LLM capabilities.

By strategically integrating mocking and alternative testing approaches, development teams can maintain high productivity, control costs, and ensure robust application performance across diverse scenarios. This not only enhances the reliability of LLM applications but also facilitates efficient development cycles, ultimately leading to more successful and stable production deployments.

← Back to Blog