Testing large language models (LLMs) presents unique challenges that diverge from traditional software testing paradigms. One significant issue is the unpredictability and variability of LLM responses. Unlike deterministic systems, LLMs can produce different outputs for the same input, making it difficult to establish fixed expectations for test results. This variability complicates the implementation of both Test-Driven Development (TDD) and Behavior-Driven Development (BDD), as tests may not consistently yield the same outcomes. Moreover, LLMs models requires substantial computational resources. Running tests on these models can be prohibitively expensive if fast responses are needed, or alternatively, very slow when using more cost-effective, open-source models locally.
The trade-offs in LLM testing are multifaceted. To achieve rapid response times, we might opt for cloud-based solutions with robust infrastructure, but this incurs high costs, especially for continuous integration and delivery pipelines. Conversely, using open-source models locally can significantly reduce costs but at the expense of speed and scalability, which may hinder development velocity. Another trade-off is the balance between test coverage and practicality. Comprehensive testing, including unit tests, integration tests, and end-to-end tests, is essential for reliable production deployment. However, the complexity and resource intensity of LLMs often necessitate compromises in test scope and frequency. Implementing effective mocking strategies can mitigate some of these issues by simulating LLM responses, allowing for more controlled and repeatable tests, but this approach requires careful design to ensure it accurately reflects real-world behavior.
Mocking LLM responses is often ineffective due to the inherent non-deterministic nature of these models. Unlike traditional software systems that produce consistent outputs for given inputs, LLMs can generate a wide range of responses, even when queried with the same prompt. This variability poses significant challenges for mocking methods, which typically rely on predictable and repeatable outputs.
Image from https://www.shakudo.io/blog/evaluating-llm-performance
Instead of relying on mocks, there are more robust and effective methods for testing, benchmarking, and evaluating LLMs:
Human-in-the-loop evaluation and continuous monitoring are other techniques which should be used on production. These alternative methods provide a more accurate and holistic assessment of LLM capabilities, ensuring that models perform reliably in diverse and unpredictable real-world scenarios.
Mocking LLM responses can be highly beneficial in specific scenarios, particularly when the goal is to enhance development efficiency and reduce costs. Here are some key situations where mocking LLM responses is advantageous:
When building tools such as observability and tracing libraries for LLMs, testing various data responses from multiple providers can be impractical and expensive. Mocking responses allows developers to simulate a wide range of scenarios without incurring the cost of actual LLM invocations. This approach ensures that the library can handle diverse outputs and edge cases, facilitating thorough testing and high development velocity.
In collaborative development environments, dependencies between team members can cause delays. For instance, if a UI developer is waiting for an API to be completed, providing a working API with mocked LLM responses can keep the project moving forward. This allows the team to develop and test UI components in parallel, ensuring timely completion of tasks.
Continuous integration and continuous delivery (CI/CD) pipelines often involve frequent testing of new changes. Running tests on live LLMs for each change can be prohibitively expensive, especially for large teams. Using mocked responses for regression tests helps streamline the CI/CD process, saving significant costs while ensuring that unchanged code does not repeatedly incur unnecessary expenses.
LLMs can introduce inconsistency in CI/CD workflows, resulting in flaky tests. By isolating LLM evaluations and using mock responses for user flow and integration testing, teams can reduce the impact of these inconsistencies. Incorporating human-in-the-loop testing and review for final benchmarking ensures that the critical aspects of LLM performance are accurately assessed without compromising the stability of the CI/CD pipeline.
By strategically using mock LLM responses, development teams can maintain high productivity, control costs, and ensure robust testing practices, ultimately leading to more reliable and efficient development cycles.
Image from https://www.testevolve.com/blog/the-testing-pyramid-an-essential-strategy-for-agile-testing
That’s enough theory, show me some codes on how to do it!
When mocking, it’s crucial not to interfere with the internal implementation of libraries or frameworks, as these are prone to changes that can disrupt your pipeline. Instead, focus on mocking at higher levels, such as the APIs you consume or bare metal as the network requests, to minimize potential disruptions. Here are two coding examples illustrating these principles:
Mocking the response directly from the OpenAI library can help you test your application logic without making actual API calls. Below is an example using Python’s unittest.mock module to mock an OpenAI chat completion response.
import unittest
from unittest.mock import patch, AsyncMock
from openai import AsyncOpenAI
import asyncio
class OpenAIClient:
def __init__(self, api_key="dummy_key"):
self.client = AsyncOpenAI(api_key=api_key)
async def get_ai_response(self, prompt):
response = await self.client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
class TestOpenAIMock(unittest.TestCase):
def setUp(self):
self.client = OpenAIClient()
@patch('openai.resources.chat.completions.AsyncCompletions.create')
def test_chat_completion(self, mock_create):
# Set up the mock response
mock_response = AsyncMock()
mock_response.choices = [
AsyncMock(message=AsyncMock(content="This is a mocked response."))
]
mock_create.return_value = mock_response
# Test your application code
result = asyncio.run(self.client.get_ai_response("Hello, AI!"))
self.assertEqual(result, "This is a mocked response.")
# Verify that the mock was called with the expected arguments
mock_create.assert_called_once_with(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hello, AI!"}]
)
Mocking the network request itself is another effective way to test your application. This example uses the responses library to mock the HTTP call to the OpenAI API.
import unittest
import responses
import json
from openai import AsyncOpenAI
import asyncio
import aiohttp
class OpenAIClient:
def __init__(self, api_key="dummy_key"):
self.client = AsyncOpenAI(api_key=api_key)
async def get_ai_response(self, prompt):
response = await self.client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
class TestOpenAIMock(unittest.TestCase):
def setUp(self):
self.client = OpenAIClient()
@responses.activate
def test_openai_api_call(self):
# Mock the API endpoint
api_url = "https://api.openai.com/v1/chat/completions"
# Prepare the mock response
mock_response = {
"id": "chatcmpl-123",
"object": "chat.completion",
"created": 1677652288,
"model": "gpt-3.5-turbo-0613",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "This is a mocked API response."
},
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 9,
"completion_tokens": 12,
"total_tokens": 21
}
}
# Register the mock response
responses.add(
responses.POST,
api_url,
json=mock_response,
status=200
)
# Run the async function in a synchronous context
async def run_test():
result = await self.client.get_ai_response("Hello, AI!")
self.assertEqual(result, "This is a mocked API response.")
asyncio.run(run_test())
# Verify that the mock was called
self.assertEqual(len(responses.calls), 1)
# Verify the request payload
request_payload = json.loads(responses.calls[0].request.body)
self.assertEqual(request_payload['messages'][0]['content'], "Hello, AI!")
self.assertEqual(request_payload['model'], "gpt-3.5-turbo")
By mocking at these levels, you can effectively test your application while avoiding the pitfalls associated with changes in underlying libraries or frameworks. This approach ensures your tests remain stable and reliable over time.
Generating a comprehensive set of mock responses for your LLM is essential to ensure thorough testing and validation of your application. These mock responses should cover a wide range of scenarios, including both typical interactions and edge cases. This approach guarantees that your application can handle various situations effectively. You can use a fixed set of responses, record real API calls, or leverage libraries like faker to introduce randomization. Here are three examples demonstrating these methods:
Using a fixed set of responses allows for consistent and repeatable testing. Simply use a dictionary or list to store response, and select the corresponding item based on request. Here’s an example:
import unittest
from unittest.mock import patch, AsyncMock
from openai import AsyncOpenAI
import asyncio
class MockOpenAI:
def __init__(self):
self.responses = {
"Hello": "Hi there! How can I assist you today?",
"What's the weather?": "I'm sorry, I don't have real-time weather information. You might want to check a weather app or website for the most up-to-date forecast.",
"Tell me a joke": "Why don't scientists trust atoms? Because they make up everything!",
"default": "I'm not sure how to respond to that. Can you please rephrase or ask something else?"
}
async def chat_completion(self, model, messages):
last_message = messages[-1]['content'] if messages else ""
response = self.responses.get(last_message, self.responses['default'])
mock_response = AsyncMock()
mock_response.choices = [AsyncMock(message=AsyncMock(content=response))]
return mock_response
class OpenAIClient:
def __init__(self, api_key="dummy_key"):
self.client = AsyncOpenAI(api_key=api_key)
async def get_response(self, prompt):
response = await self.client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
class TestOpenAIMock(unittest.TestCase):
def setUp(self):
self.mock_openai = MockOpenAI()
self.client = OpenAIClient()
@patch('openai.resources.chat.completions.AsyncCompletions.create')
async def test_known_responses(self, mock_create):
mock_create.side_effect = self.mock_openai.chat_completion
responses = [
await self.client.get_response("Hello"),
await self.client.get_response("What's the weather?"),
await self.client.get_response("Tell me a joke")
]
self.assertEqual(responses[0], "Hi there! How can I assist you today?")
self.assertIn("I don't have real-time weather information", responses[1])
self.assertEqual(responses[2], "Why don't scientists trust atoms? Because they make up everything!")
def async_test(coro):
def wrapper(*args, **kwargs):
return asyncio.run(coro(*args, **kwargs))
return wrapper
class TestOpenAIMockWrapper(unittest.TestCase):
@async_test
async def test_known_responses(self):
await TestOpenAIMock().test_known_responses()
The faker library can be used to create dynamic and diverse data for testing. It’s not really useful for unit-test, but if you are building a mock server to facilitate development and e2e testing, this would become handy to add some randomization to your API responses. Here’s an example:
import unittest
from unittest.mock import patch, AsyncMock
from openai import AsyncOpenAI
import asyncio
from faker import Faker
import random
class MockOpenAI:
def __init__(self):
self.faker = Faker()
async def chat_completion(self, model, messages):
prompt = messages[-1]['content'] if messages else ""
response = self.generate_response(prompt)
mock_response = AsyncMock()
mock_response.choices = [AsyncMock(message=AsyncMock(content=response))]
return mock_response
def generate_response(self, prompt):
if "name" in prompt.lower():
return f"The name you're asking about is {self.faker.name()}."
elif "address" in prompt.lower():
return f"The address you're looking for is {self.faker.address()}."
elif "company" in prompt.lower():
return f"The company you're inquiring about is {self.faker.company()}."
elif "date" in prompt.lower():
return f"The date you're asking about is {self.faker.date()}."
else:
return self.faker.paragraph(nb_sentences=random.randint(1, 3))
class OpenAIClient:
def __init__(self, api_key="dummy_key"):
self.client = AsyncOpenAI(api_key=api_key)
async def get_response(self, prompt):
response = await self.client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
class TestOpenAIFakerMock(unittest.TestCase):
def setUp(self):
self.mock_openai = MockOpenAI()
self.client = OpenAIClient()
@patch('openai.resources.chat.completions.AsyncCompletions.create')
async def test_name_response(self, mock_create):
mock_create.side_effect = self.mock_openai.chat_completion
response = await self.client.get_response("What's the person's name?")
self.assertIn("The name you're asking about is", response)
@patch('openai.resources.chat.completions.AsyncCompletions.create')
async def test_address_response(self, mock_create):
mock_create.side_effect = self.mock_openai.chat_completion
response = await self.client.get_response("Can you give me an address?")
self.assertIn("The address you're looking for is", response)
@patch('openai.resources.chat.completions.AsyncCompletions.create')
async def test_company_response(self, mock_create):
mock_create.side_effect = self.mock_openai.chat_completion
response = await self.client.get_response("Tell me about a company.")
self.assertIn("The company you're inquiring about is", response)
@patch('openai.resources.chat.completions.AsyncCompletions.create')
async def test_date_response(self, mock_create):
mock_create.side_effect = self.mock_openai.chat_completion
response = await self.client.get_response("What's the date?")
self.assertIn("The date you're asking about is", response)
def async_test(coro):
def wrapper(*args, **kwargs):
return asyncio.run(coro(*args, **kwargs))
return wrapper
class TestOpenAIFakerMockWrapper(unittest.TestCase):
@async_test
async def test_name_response(self):
await TestOpenAIFakerMock().test_name_response()
@async_test
async def test_address_response(self):
await TestOpenAIFakerMock().test_address_response()
@async_test
async def test_company_response(self):
await TestOpenAIFakerMock().test_company_response()
@async_test
async def test_date_response(self):
await TestOpenAIFakerMock().test_date_response()
If you don’t know exact API response but still want to mock API response to add some stability to tests, VCR.py can record real API calls and replay them during tests, ensuring consistent and accurate testing. Here’s an example:
import vcr
from openai import AsyncOpenAI
import asyncio
import unittest
import os
# Configure VCR
vcr = vcr.VCR(
cassette_library_dir='fixtures/vcr_cassettes',
record_mode='once',
match_on=['uri', 'method'],
)
class OpenAIClient:
def __init__(self):
self.client = AsyncOpenAI(api_key=os.getenv('OPENAI_API_KEY'))
async def get_chat_completion(self, prompt):
response = await self.client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
class TestOpenAIWithVCR(unittest.TestCase):
def setUp(self):
self.client = OpenAIClient()
@vcr.use_cassette('openai_chat_completion_async.yaml')
def test_chat_completion(self):
prompt = "What is the capital of France?"
response = asyncio.run(self.client.get_chat_completion(prompt))
self.assertIsInstance(response, str)
self.assertIn("Paris", response)
@vcr.use_cassette('openai_chat_completion_python_async.yaml')
def test_chat_completion_python(self):
prompt = "Write a Python function to calculate the factorial of a number."
response = asyncio.run(self.client.get_chat_completion(prompt))
self.assertIsInstance(response, str)
self.assertIn("def factorial", response)
By using fixed responses, dynamic data generation with faker, or recording real API interactions with VCR.py, you can create robust mock data that ensures comprehensive testing and validation of your application.
Testing frameworks like Pytest offer built-in support for mocking, making it straightforward to patch function calls and utilize mock data. This simplifies the process of testing individual components. In other cases, you might prefer to use framework-agnostic libraries, such as wrapt, to create mock servers that support end-to-end (E2E) testing. Here are two examples demonstrating how to patch the OpenAI chat completion call, one using Pytest and the other using the wrapt library.
Pytest’s monkey patch fixture allows you to easily mock function calls for testing purposes. Below is an example of patching the OpenAI chat completion call with Pytest:
import pytest
from openai import AsyncOpenAI
import asyncio
from unittest.mock import AsyncMock
# The client we want to test
class OpenAIClient:
def __init__(self, api_key="dummy_key"):
self.client = AsyncOpenAI(api_key=api_key)
async def get_chat_completion(self, prompt):
response = await self.client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
# Fixture to create our client
@pytest.fixture
def openai_client():
return OpenAIClient()
# Mock response for our tests
@pytest.fixture
def mock_openai_response():
async def mock_create(**kwargs):
mock_response = AsyncMock()
mock_response.choices = [
AsyncMock(message=AsyncMock(content="This is a mocked response."))
]
return mock_response
return mock_create
# Test using monkeypatch
@pytest.mark.asyncio
async def test_get_chat_completion(openai_client, mock_openai_response, monkeypatch):
# Monkeypatch the create method
monkeypatch.setattr(openai_client.client.chat.completions, "create", mock_openai_response)
# Call our method
result = await openai_client.get_chat_completion("Test prompt")
# Assert the result
assert result == "This is a mocked response."
The wrapt library is a lightweight and flexible tool for function wrapping and mocking, suitable for use outside of specific testing frameworks. Below is an example of using wrapt to mock the OpenAI chat completion call:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from openai import AsyncOpenAI
import wrapt
import asyncio
app = FastAPI()
# OpenAI client
class OpenAIClient:
def __init__(self, api_key="dummy_key"):
self.client = AsyncOpenAI(api_key=api_key)
async def get_chat_completion(self, prompt):
response = await self.client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
# Mock responses
mock_responses = {
"Hello": "Hi there! How can I assist you today?",
"What's the weather?": "I'm sorry, I don't have real-time weather information. You might want to check a weather app or website for the most up-to-date forecast.",
"Tell me a joke": "Why don't scientists trust atoms? Because they make up everything!",
"default": "I'm not sure how to respond to that. Can you please rephrase or ask something else?"
}
# Mock function for OpenAI chat completion
async def mock_chat_completion(wrapped, instance, args, kwargs):
messages = kwargs.get('messages', [])
if messages and isinstance(messages[0], dict):
prompt = messages[0].get('content', '')
response = mock_responses.get(prompt, mock_responses['default'])
class MockResponse:
choices = [type('MockChoice', (), {'message': type('MockMessage', (), {'content': response})()})()]
return MockResponse()
return await wrapped(*args, **kwargs)
# Apply the mock wrapper
wrapt.wrap_function_wrapper('openai.resources.chat.completions', 'AsyncCompletions.create', mock_chat_completion)
# Pydantic model for request body
class ChatRequest(BaseModel):
prompt: str
# FastAPI endpoint
@app.post("/chat")
async def chat_completion(request: ChatRequest):
try:
client = OpenAIClient()
response = await client.get_chat_completion(request.prompt)
return {"response": response}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
By leveraging these approaches, you can effectively mock OpenAI chat completion calls within both Pytest and a framework-agnostic context, ensuring robust and flexible testing of your applications.
Incorporating mocking into a test hierarchy ensures comprehensive and efficient testing across different levels. Here’s how mocking can be applied at various stages:
Unit Tests:
unittest
, pytest
with mocking.Integration Tests:
pytest
, responses
, wrapt
.End-to-End Tests:
wrapt
, vcrpy
.The previous example demonstrated simple tests and mock usages, but real-world LLM applications often involve more complex workflows. These might include concurrent API requests or the development of autonomous agentic APIs, where mocking data can become disordered and cause test failures. To address these issues, you can employ more sophisticated data structures to store mock data or use pattern matching to ensure the correct data is returned for your tests. However, it’s essential to first consider the principles outlined in Why Shouldn’t We Mock LLM Responses when complexity escalates. Over-engineering mocks can mangle your test hierarchy, so it’s often better to explore alternative approaches to ensure robust and maintainable testing practices.
Effective practices for mocking LLM responses are essential for ensuring the robustness and reliability of LLM applications in production. While traditional testing methods may fall short due to the inherent complexity and unpredictability of LLMs, incorporating comprehensive unit tests, integration tests, and end-to-end tests with sophisticated mocking strategies can mitigate these challenges. Moreover, leveraging ML and LLM testing methods such as golden data benchmarks, cross-model evaluations, and probabilistic assertions provides a more accurate assessment of LLM capabilities.
By strategically integrating mocking and alternative testing approaches, development teams can maintain high productivity, control costs, and ensure robust application performance across diverse scenarios. This not only enhances the reliability of LLM applications but also facilitates efficient development cycles, ultimately leading to more successful and stable production deployments.