How to Prepare your content for RAG

In order to have more control over the responses from your AI assistant, you need to implement a RAG Model as part of your process / workflow. For example you may have specific documents within your organisation OR syllabus content a teacher wishes her class of students to engage with.

Retrievable-Augmented Generation : It's a technique that enhances the accuracy and factual grounding of generative models by incorporating external information retrieval. Read more...

This approach offers several Control advantages:


user input (prompt) Foundation Model (LLM) + RAG Model (response) accuracy|context


How to prepare your content and documents for RAG

The quality of the output (response) is largely dependent on the quality of the data. It is therefore recommended to prepare your targeted documentation to get the best results when implementing RAG.

A note: the following steps to optimise all your documents may not be possible, especially when dealing with large volumes of content. Implementation of a RAG Model sourcing your current PDFs and or Rich Text formatted .docs , will still deliver good results you can trust and control. The following steps are therefore recommended as best practice to apply during document creation from the start. Organisations need to begin assuming that any new documents generated may well be required to form part of a collection of source material for AI (Machine Learning) tools to process, at some point.


Headings and structure

It's generally better to have clear headings and subheadings for your RAG document, even if the RAG system can potentially interpret bold text as a new segment. Here's why:

RAG and Bold Text:

While some RAG systems might interpret bold text as a segment divider, it's not a universal feature. Even if it is, it's not always reliable.

Here's a breakdown of the two approaches:

In conclusion, using headings and subheadings is the better approach for creating a clear, well-structured, and accurate RAG document. Check with system compatibility issues and if needed and convert these easy to read docs into PDFs.


Add a Glossary

Including a glossary of key terms and definitions in your RAG document can significantly improve results.

 Here's why:

Isolation for Focused Responses

When using RAG for a single document, the goal is to isolate the LLM from external information sources like the internet. This ensures responses are based solely on the content within that specific document. 

Here's the reasoning:

However, there are some nuances to consider:

In conclusion, including a glossary and isolating the LLM from the internet are both beneficial practices for optimizing RAG performance with a single document. They ensure the LLM leverages the document's specific knowledge base for focused, accurate, and relevant responses.


Connected Documents.  Should the Glossary be a connected Document?

Yes - it's best to connect multiple documents within a RAG system for a specific topic. Here's how to approach it.

Connecting Documents:

Directly referencing all related documents within each document will work but isn't the most efficient approach, there are better ways to establish connections for the LLM:

Glossary as a Separate Document:

Yes, creating a separate glossary document can be a good strategy for several reasons:

Additional Techniques:

Here are some other techniques to consider for connecting documents in a multi-document RAG system:

Choosing the Right Approach:

The best approach depends on the size and complexity of your document collection. Here's a general guideline:

Remember, the goal is to provide clear and concise information about the relationships between documents without overwhelming the LLM with unnecessary redundancy.