Navigating the Unknown: Testing Code in an AI-Generated World

The Challenge of Testing Code When You Don't Know Its Contents

Traditional software development relies on the assumption that developers have full visibility into their codebase. But as artificial intelligence and large language models (LLMs) become more integrated into the development process, that assumption no longer holds. When code is generated by AI, or when the behavior of a system is influenced by a model whose internal logic is opaque, testing becomes a fundamentally different problem. Ryan, host of SmartBear's podcast, explored this topic with Fitz Nowlan, the company's VP of AI and Architecture. They discussed how the industry is moving away from old assumptions about software testing and what new practices are emerging to handle non-deterministic, AI-driven systems.

Navigating the Unknown: Testing Code in an AI-Generated World — Source: stackoverflow.blog

Understanding MCP Servers and LLM-Driven Agents

One of the key areas where this challenge surfaces is in the testing of MCP (Model Context Protocol) servers. MCP is a protocol that allows LLMs to interact with external tools and data sources. When an LLM-driven agent uses an MCP server, the agent's decisions are not purely deterministic; they depend on the model's training data, the prompt context, and even randomness in the sampling process. This breaks the traditional testing paradigm, where you can write a test with a known input and expect a known output. Now, the same input might produce different valid outputs each time, making it impossible to rely on simple assertion-based testing.

For example, consider an MCP server that retrieves customer data for a support chatbot. The LLM agent might ask for the customer's name, then decide to fetch order history or recent tickets based on the conversation flow. The exact sequence of tool calls is not predictable, yet the overall outcome must still be correct. Developers need new testing strategies that focus on behavioral constraints rather than exact outputs.

Non-Determinism Breaks Traditional Testing

The rise of LLM-driven agents introduces a level of non-determinism that undermines the classic unit-testing approach. In traditional testing, you define a set of inputs, run the code, and compare the result to an expected value. But when an agent uses an LLM, the “code” is no longer a fixed algorithm. The LLM might choose different paths based on subtle variations in the prompt or even the temperature setting. This makes it nearly impossible to write repeatable tests that pass consistently.

Fitz Nowlan pointed out that this shift forces engineers to think about testing at a higher level. Instead of testing “does function X return value Y,” you test “does the system satisfy these invariants?” For example, you might check that a customer support agent never reveals another user's personal data, regardless of how it decides to respond. This is more aligned with property-based testing or contract testing than traditional unit tests. It also means relying on monitoring and observability in production, because you can't cover every possible path in a pre-deployment suite.

The Rising Value of Data Locality and Data Construction

When source code becomes easy to generate—whether by Copilot, ChatGPT, or other AI tools—the real bottleneck shifts from writing code to building the right data. Data locality and data construction are emerging as critical skills. Data locality refers to the practice of keeping data close to where it is processed, which becomes important when LLMs need access to context-specific information without constant network calls. Data construction is about creating realistic, high-quality test datasets that cover the diverse scenarios an AI system might encounter.

Fitz highlighted that when you can generate code instantly, the value is in what you ask the code to do and how you validate that it did it correctly. That validation increasingly relies on well-constructed data. For instance, testing a system that uses an MCP server to answer customer queries requires a dataset of queries with known correct answers—but also edge cases, ambiguous questions, and malicious inputs. Building and maintaining such datasets becomes as important as writing the tests themselves.

Practical Strategies for Testing in an AI-Driven World

To adapt, teams can adopt several strategies:

Shift from exact outputs to behavioral constraints. Use assertions that check for safety properties (e.g., no PII leaked), response format (e.g., JSON schema), or semantic correctness (using another model to evaluate the answer).
Leverage data locality. Cache frequently used data or embed small context databases directly into the AI agent's memory to reduce non-determinism from external service calls.
Invest in data construction. Build synthetic datasets that simulate realistic user interactions, including adversarial examples. Version-control these datasets alongside your code.
Emphasize integration and end-to-end testing. Since unit tests can't capture emergent behavior, focus on full-system testing in staging environments that mirror production.
Use observability as a testing tool. Log every decision made by the AI agent so you can trace failures back to the context or model choice. This enables post-hoc testing and debugging.

Conclusion: A New Mindset for Testing

As code generation becomes trivial, the real challenge shifts to understanding what the code should do and validating that it does it safely and reliably. The conversation between Ryan and Fitz underscores a critical mindset shift: testing is no longer about staring at source code. It's about constructing meaningful test scenarios, ensuring data quality, and embracing non-determinism as a feature rather than a bug. By adopting strategies that focus on invariants, data locality, and robust data construction, teams can keep pace with the rapid evolution of AI-assisted development. The future of testing is not about knowing every line of code—it's about knowing the context in which that code operates.