Within a matter of years, Large Language Models (LLMs) have become an integral aspect of the current technology landscape.
While chatbots and AI art creation tools capture all the headlines, generative AI offers promising business-oriented applications too. From automated, autonomous processes to code generation, LLMs are delivering important efficiencies and savings to organisations across industries.
The Importance of QA Testing in LLM Projects
QA testing has always been an essential aspect of any software development project. However, as the technology shifts and advances, so does the approach. Testing generative AI varies slightly to traditional methods of QA testing, because of the number of variables involved.
Bias and risk
An LLM can only deliver results based on the data it has been trained on. If the training data is unbalanced in some way, the inferences drawn by the LLM are likely to be biased.
QA testing throughout the LLM deployment process will help to identify factual errors, biases and unexpected behaviour before the model is launched. Fixing these issues early reduces risk and helps to build trust in the LLM.
Intended functionality
Your project has clearly defined goals, but an LLM may not always generate output that aligns with them.
QA testing will reveal these shortcomings earlier, allowing you to make adjustments as required before your application is deployed.
Model robustness
If users can break a system unintentionally, they will. Early QA testing helps to quantify the effect of unexpected input on your LLM. It can also provide insights into malicious activity and attempts to manipulate the LLM by bad actors.
QA testing should be treated as a safety net to identify and address potential issues before they cause problems in the real world.
Unique Challenges in Testing LLMs
Testing LLMs is not always straightforward. There are several challenges that must be addressed when testing LLM applications.
1. Non-deterministic outputs
Perhaps one of the biggest challenges is the non-deterministic nature of the LLMs themselves. You can’t ‘guarantee’ an LLM will generate a specific result because the same prompt can deliver different, yet valid, responses. This non-determinism also increases the difficulty of ensuring consistent behaviour across prompts.
2. Diverse outputs
LLM output may be coherent and informative – or a jumbled mess. Testing must ensure that the application generates meaningful responses across a range of prompts.
3. Hallucinations and fabrications
LLMs formulate their own standard of ‘truth’, making it hard for generative AI applications to deliver a “correct” answer. Generated output will need human guidance to review and correct as necessary.
4. Adversarial attacks
Altering LLM input, particularly during the training phase, can ‘poison’ the model, affecting its output. As a result, deliberate poisoning represents a new attack surface for cybercriminals.
Model testing will need to consider, and address, the potential for adversarial attacks.
5. Bias
Biased data always results in biased outcomes. Your team will need to ensure that their training data is properly balanced to ensure accurate output. You will also need to ensure that output complies with anti-discrimination legislation – even if the results themselves are fully accurate.
6. Outliers and rarities
Based around statistical probability, LLMs struggle with rare or novel events. Most cannot properly explore unique scenarios. Your testing will need to include edge cases and rare events to assess how the model behaves.
7. Context window length
LLMs may lose context over long sequences, particularly if the context window is restricted. Testing will need to assess whether the model can maintain context over extended interactions, adjusting the window length as required.
8. Resource constraints
Depending on the model, testing LLMs may consume vast computational resources. Efficient testing strategies are essential to balance accuracy and resource usage. Choice of LLM will also be important, Llama 3 can be run locally for instance, drastically reducing the amount of resources required.
Types of LLMs: Open Source vs. Proprietary
Existing LLMs offer an effective way to accelerate generative AI development – but they have their own caveats.
In terms of headline price, open source models are a cheaper option – access is usually free. However, the resource costs associated with hosting and fine-tuning the model can quickly add up. There may also be issues with quality control, given that research and development is performed by the community.
Proprietary models can become more costly based on usage. However, paid licenses typically come with vendor support, documentation and software updates. They may also be supplied with APIs to support simplified integration and a host of customisation options.
The biggest difference between Open Source and proprietary LLMs is their relative transparency. An Open Source LLM allows you to inspect the architecture, weights, and training data. Ethical and bias issues are easier to identify and address.
Proprietary LLMs, like a lot of ML models, operate like a black box. Bias is harder to address as you don't have access to the original source code to understand how the model is built or operates. Considerations around fine-tuning and prompt engineering can help to prevent unwanted responses, as well as tools like Azure Content Filtering. But for some applications, this trade-off may be acceptable because you will not incur heavy costs for training the model.
Clearly, based on these factors, the choice of model has significant implications for your testing process.
Strategic Considerations for Effective Testing
Setting up an LLM test environment is a careful balance of cost and resources. You must be able to address each of the issues outlined above – and traditional application testing techniques are unlikely to suffice.
Instead, some teams are moving towards a Test Driven Development (TDD) strategy. TDD offers a structured way to validate the code produced by LLMs. Rather than offloading model evaluation to a dedicated tester, developers perform much of the model testing.
With TDD, developers define the expected application behaviour through their tests. They then ensure the LLM-generated content meets the specified requirements and does not hallucinate – all during the actual development process. TDD allows the developer to evaluate and engineer the model in a very tight feedback loop.
That’s not to say 'normal' application testing is redundant. You will still need to perform validation testing, exploratory testing and non-functional testing on the infrastructure supporting your generative AI applications. It is more likely that testers will become involved to deliver standard application testing, focusing on security, use cases and user experience.
Importantly, testers will be key to finding edge cases, attempting to ‘break’ the model with unexpected inputs. For example, leveraging a red teaming approach to stress-test generative AI for a broad range of potential harms - from safety and security, to social bias.
Developing an LLM-ready testing strategy will also mean anticipating and mitigating the potential costs associated with API tokens and data usage. Your team will need to develop a careful balance of demand and resources to deliver an acceptable user experience whilst simultaneously containing costs.