By Michael England (Lead Software Engineer, Applied AI) and
James Thurgood (Lead ML Engineer, Applied AI)
A common problem across large enterprises is providing employees access to the right data when they need it and in an easy to consume format. Important information is often stored across thousands of PDF files, Microsoft Word documents and intranet pages, making it difficult for employees to find.
Last year, we announced how we took steps to help solve this problem, by bringing the power of GenAI to our colleague chatbot, Ask Archie. In this post, we expand upon the techniques we used, highlight our learnings after almost a year running in production, and discuss how we are looking to scale similar GenAI systems across the organisation.
Generative AI chatbots have taken the world by storm since the launch of ChatGPT by OpenAI in November 2022. Powered by LLMs, generative AI chatbots can answer complex user questions, create new content, and even reason through problems.
However, LLMs such as the Llama and GPT family of models are limited to the knowledge that they acquired during the training phase. These models work well at answering generic questions, such as “what is the role of AI in providing information?”, however if they are prompted to answer a question outside of their training data, they often refuse to answer, or worse, hallucinate and answer incorrectly. This is where Retrieval Augmented Generation (RAG) comes into the picture.
RAG is an architecture, popularised by Meta in 2020, that combines information retrieval and a generative AI model. The idea is simple — when a user enters some text e.g. “I am travelling to London on Monday and returning Friday, how much can I expense for dinner each night?”, the application first fetches relevant text from source systems and then includes this within the context of the LLM text prompt. In this example, relevant text could be passages from an internal travel and expenses policy. The LLM will then utilise both the knowledge learned from its training data (encapsulated by its model weights) and the knowledge provided in the retrieved text to generate an output.
Here is an example prompt that could be used in this scenario:
You are an assistant that helps company employees answer questions on internal policy documents.
Given the following extracted parts of the documents and a question, create a final answer.
Be brief in your answers.
Answer ONLY with the facts listed in the list of sources below.
If there is not enough information below to answer the question, say you don't know.
QUESTION:
[USER QUESTION]
CONTEXT:
[RETRIEVED DOCUMENT TEXT]
ANSWER:
The RAG architecture is ideal for unlocking information across large corpuses of documents. Accurate information retrieval can be utilised to pull out relevant passages of documents and generate legible answers to user queries within a few seconds, saving many employee hours. The user experience is also improved, as rather than trawling multiple pages you are presented with a summary with the ability to ask follow up questions for better understanding.
At NatWest Group, we had the challenge of integrating RAG within an existing Natural Language Understanding (NLU) based chatbot called Ask Archie. The most straightforward approach was to build out the RAG component as a microservice that the existing Ask Archie system could communicate with via a REST API. Here is a simplified diagram of how the various components fit together:

The initial call to the RAG microservice is made when the Ask Archie chatbot system decides the user input requires a generated response. Once a HTTP request is made to the RAG microservice REST API, the system immediately performs guardrail checks on the user input. This ensures that the input is safe from a content moderation and security (jailbreak and prompt injection) perspective, as defined in a guardrail policy.
Assuming the input is deemed safe, the information retrieval stage is initiated. Initially, a hybrid search is executed against the document database to try and retrieve the most relevant parts of the source documents to answer the user query. This comprises of two components:
The two queries executed by the hybrid search provide the system two sets of results containing document chunks which are potentially relevant to the user’s search query. These queries are very fast to execute, however they are not the most accurate. We can think of these queries acting as a broad search, but we need a mechanism to combine these results and rank them in order of the relevance to the question.
The next step is to perform result re-ranking, where a cross-encoder model is used to decide how relevant each retrieved chunk is to the user’s input query. The cross-encoder model is much more accurate compared to the approaches the initial queries used, with the trade off that it is slower. However, given it only needs to operate on the small subset of results returned by the hybrid search queries, it can be used as part of this architecture.
Once the scores have been calculated, the result list is ordered from most to least relevant, and a minimum-score threshold and final result limit is applied. The final result list should contain the most relevant document chunks the LLM can use to answer the user question.
The generation step is more straightforward — the retrieved chunks are combined into a single LLM prompt along with the system instructions and user question. The chat LLM then generates an answer, which is passed via the guardrail service to check for any content moderation breaches before being returned in the response.
Given the flexible nature of LLMs, it is important to keep track of how we utilise them and how they perform, both from a functional and non-functional point of view.
There are many possible data points that can be tracked. For example:
Providing access to this data via dashboards enables the development team to quickly identify where the LLM is not working as expected and replay failures or poorly performing questions using the stored data.
For example, by reviewing the logs, we have been able to identify gaps in our underlying documentation, with users asking questions and the retrieval results coming back empty or with a low relevancy score. This has allowed us to focus on improving specific areas in ongoing content updates.
These logs have also allowed us to keep a close eye on latency. RAG systems are complex with many moving parts which could all lead to bottlenecks, and every millisecond counts when you have a user on the other end of the keyboard awaiting a response. Keyword search, semantic search, re-ranking, LLM queries and LLM moderation are complex processes which all play their part in the overall time taken to respond to a user. Through close observation of these metrics we have been able to identify specific bottlenecks and focus resource on improving that component’s efficiency.
While non-production environments are valuable for the initial testing and validation of a solution, nothing compares to the insights you get from running real-world workloads. Our solution has now been in production for nine months, during which we have been able to evaluate the performance of the system against millions of real-world queries, identify areas of the system that could be improved and develop new learnings as we have iterated on key system components. In this section, we will dive into some key learnings we have identified since we deployed the solution in production.
As new LLMs are released, third-party model providers aim to deprecate older versions of models to free up their hardware. The impact of this is that systems have to be upgraded to a new model version approximately every 6–9 months. Model upgrades inherently come with a number of risks, and therefore to mitigate these, automated test suites and frictionless human evaluation processes are a must for production services utilising LLMs.
To shine a light on why changing models can be difficult with a real example, the microservice originally used OpenAI GPT-3.5 Turbo v1106 as the core chat LLM. After a few months this version was scheduled for retirement, and the system had to be upgraded to use a new patch version of this model, v0125. Our evaluation processes picked up that with no prompt changes, the new model refused to answer the user question based on the context retrieved in 16% of cases across a number of topics, significantly more than the previous model version. To fix this, the prompt had to be re-engineered to follow a few-shot approach, resulting in a 7% reduction of answer refusals compared to the original model.
As the above example highlights, changing models can be a very time consuming process and adds additional workload and risk to teams managing LLM solutions. Self-hosting models is an alternative option which can remove the dependency on third-party model retirement dates, with the trade off of additional cost and complexity.
It isn’t just LLM providers that are innovating. The wider GenAI community is continuously innovating and shipping new products. Whether it is a new data store, agent framework, pattern or managed service, there are continuous innovations that you need to keep up with to ensure your product continues to improve.
When building out GenAI services at NatWest Group, it has been an important focus of ours to ensure that we can swap out individual system components where required. This can be the underlying data stores, embedding models, chat models and so on.
A key component which has enabled teams to move at pace has been the introduction of the NatWest AI platform. This provides a gateway that applications can use to connect to a catalogue of models from different providers, so teams can easily switch to the best model for their use case without large architectural changes. The same platform which is an evolution of the one described here also allowed the team to deploy re-ranking models with ease to AWS. This provided us the ability to switch to other leading re-ranking models as the open source community innovated.
We identified early on that we needed to remove any barriers between data scientists/ML engineers who were building and iterating on ML-related components and software engineers who were building out the production APIs and services.
During the initial RAG microservice build out, the DS/ML team members replicated the functionality of the main service so that they could run experiments locally and iterate on data, prompts and models. This was partly due to the DS/ML teams working on a different stack to the software engineers. This quickly became unwieldy as the setups diverged. It became apparent that we needed to identify a way for the DS/ML team members to iterate on individual parts of the system, whilst the rest of the functionality stayed in-line with the main production service.
We ended up enabling a number of experimental attributes in the request body of our service, so users could override individual parts of the RAG functionality. These were only enabled in non-production, but supported tuning parameters such as:
Evaluation pipelines were then built which took advantage of these attributes. This unlocked the ability for the team to experiment with small changes to the RAG microservice extremely quickly.
To bring this to life, here is an example of a request body sent to the RAG microservice including experimental parameters:
{
"sessionId": "",
"interactionId": "",
"query": {
"content": "What is NatWest's policy for partner leave, how many days am I entitled to?",
"documentCollection": ...,
"messageHistory": ...,
"retrievalResultLimit": ...,
"minRetrievalResultScore": ...,
"reRankerType": ...
},
"additionalContext": {
"location": "United Kingdom"
},
"modelProvider": ...,
"modelName": ...,
"customPrompts":{
"questionAnswer": {
"systemPrompt": ...,
"userPrompt": ...
},
"rephrase": {
"systemPrompt": ...,
"userPrompt": ...
}
}
} It is important to identify non-functional requirements early on, especially those around model latency and throughput. Due to the limited availability of GPUs globally, your workloads may end up requiring the use of Provisioned Throughput Units (PTUs) to guarantee capacity when you need it. PTUs are expensive, and a system using a PTU will often be spending significantly more on inference than if it was using an on-demand plan. The cost-benefit ratio might not justify proceeding with the project if these are necessary.
It is worth analysing whether your applications really need an immediate response from an LLM, or if the model can be called asynchronously instead. This allows you to work around provider rate limits and perform retries where necessary, allowing you to utilise models on-demand. Use cases which send a large number of LLM requests are also likely to benefit from batch APIs which have been made available by many providers at a much reduced cost.
A key challenge is always how can we shorten the route to live for our projects. At NatWest Group, our engineers lean heavily on GitLab to reduce time to value, making use of automated pipelines to provide quick feedback and deploy changes into hosting environments.
Given our experience of using GitLab across our projects, we decided to use it as an end-to-end content management system for managing the underlying documents that the RAG microservice uses to answer colleague questions.

GitLab felt like a natural fit for this use case given it provides feature such as:
We found through experimentation that using well-structured Markdown files for our documents helped us achieve the level of accuracy we required to provide this solution to our NatWest colleagues. Markdown files can be displayed and edited inside of GitLab very easily, and this has allowed the teams that manage content to follow a software development-style workflow. Content modifications are made in feature branches and changes are automatically verified by a GitLab CI process. Following verification, the changes are reviewed by a colleague in a Merge Request, before finally being deployed directly into a data store once it has been approved and merged.
This approach means a content change is automatically deployed to a test environment within a few minutes, and shortly after into production, allowing teams to deliver new content at pace.
Leveraging LLMs within a RAG architecture has yielded tremendous benefits, enabling the Ask Archie chatbot to provide natural and targeted responses to colleague questions around a wide variety of topics. Whilst this project has been a success, there is still a lot we are planning to explore.
Firstly, we are going to look at how we can scale RAG across the organisation. Ask Archie is just one example of a system where RAG can be applied, and we are taking steps to make it easy for teams to spin up RAG infrastructure with the aim of reducing cost and complexity of these solutions going forward, especially now the industry has caught up!
Another interesting area for exploration is fine-tuning language models for improved performance. Internally we have a lot of experience with fine-tuning deep learning models, and we plan to utilise this to fine-tune LLMs. We have found that there is potential to improve information retrieval and answer performance by using domain-tuned models. While many off-the-shelf pretrained embedding models do perform well, fine-tuning against a relevant dataset has the potential to significantly improve the performance of semantic search. Our application is also capturing valuable user question and AI answer data which we will look to use for fine-tuning in future.
We will also be looking to utilise more self-hosted models going forward. The pace of open source model development is incredible, and utilising these models would enable us to decouple ourselves from third-party model retirement schedules. We already have experience hosting open source models in our Amazon SageMaker-based ML platform, and will likely be expanding the number of models we self-host over the coming year. These models will be exposed via our NatWest AI platform alongside third-party models for consumption across the organisation.
Lastly, LLM evaluation is a topic that we are spending a lot of time on and warrants its own blog post. As we move between different embedding and chat models, whether they are completely new models or upgrades of the same models, we need to run automated evaluation test suites to provide a level of certainty that overall model performance does not decline. We have deployed LLMOps tooling internally which will help us iterate on our LLM-based systems with confidence, similar to how we utilise MLOps practices for our standard machine learning models. We expect these tools to mature rapidly over the coming year.
This is a fast-moving and fascinating area, and we are only just getting started! We aim to publish further blog posts around our journey into the world of generative AI later this year.
If you found this blog post interesting and would like to work on similar problems, we encourage you to take a look at our available job openings!
The views and opinions expressed in this article are those of the author and do not necessarily represent the views of the NatWest Group.
<hr><p>From RAG to Riches — How NatWest Group is using GenAI to unlock knowledge across the organisation was originally published in NatWest Group AI & Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>