How to Prepare your content for RAG
How to Prepare your content for RAG
In order to have more control over the responses from your AI assistant, you need to implement a RAG Model as part of your process / workflow. For example you may have specific documents within your organisation OR syllabus content a teacher wishes her class of students to engage with.
Retrievable-Augmented Generation : It's a technique that enhances the accuracy and factual grounding of generative models by incorporating external information retrieval. Read more...
This approach offers several Control advantages:
Improved accuracy and factuality: By using external (i.e. not only using the data upon which the traditional foundation model was trained) knowledge sources, RAG models can provide more accurate and reliable information compared to traditional generative models that solely rely on their internal statistical patterns.
External knowledge sources: RAG models consult reliable external sources like scientific papers or historical archives. This grounding in factual data reduces the risk of making things up or relying on internal biases.
Traditional generative models: These models rely on the statistical patterns they learned during training. This can lead to factual errors if the training data itself contained inaccuracies.
Context-awareness: The retrieved information helps the model understand the context of the prompt, leading to more relevant and coherent responses.
Flexibility: RAG models can be adapted to different domains by using different information retrieval sources.
Reduced bias: RAG has the potential to reduce bias through Factual Grounding and Diversity. Grounding responses in factual data from reliable sources and by giving the model access to a diverse and wider range of information. This doesn't however completely eliminate bias since the response is still partially influenced by the training data of the foundation model, plus your selection of the additional data for the RAG Model will have either conscious (intended) or unconcious (un-intended) bias.
user input (prompt) → Foundation Model (LLM) + RAG Model → (response) accuracy|context
How to prepare your content and documents for RAG
The quality of the output (response) is largely dependent on the quality of the data. It is therefore recommended to prepare your targeted documentation to get the best results when implementing RAG.
Headings or bold text? [PDFs or .docs?]
Should I include a Glossary of terms?
By applying RAG are responses not from the internet?
What about multiple connected documents?
A note: the following steps to optimise all your documents may not be possible, especially when dealing with large volumes of content. Implementation of a RAG Model sourcing your current PDFs and or Rich Text formatted .docs , will still deliver good results you can trust and control. The following steps are therefore recommended as best practice to apply during document creation from the start. Organisations need to begin assuming that any new documents generated may well be required to form part of a collection of source material for AI (Machine Learning) tools to process, at some point.
Headings and structure
It's generally better to have clear headings and subheadings for your RAG document, even if the RAG system can potentially interpret bold text as a new segment. Here's why:
Improved Readability: Headings and subheadings make the document easier to understand for human readers. They break down the content into logical sections, improving navigation and information retrieval.
Accuracy: Relying solely on bold formatting for segment identification might be inaccurate. Bold text could be used for emphasis within a section, not necessarily to indicate a new one. Headings provide a clearer and more reliable signal.
Consistency: A well-defined heading structure ensures consistency throughout the document. This makes the RAG easier to maintain and update.
RAG and Bold Text:
While some RAG systems might interpret bold text as a segment divider, it's not a universal feature. Even if it is, it's not always reliable.
Here's a breakdown of the two approaches:
Headings and Subheadings: This is the recommended approach. It provides a clear and consistent structure for both human readers and the RAG system.
Bold Text: This might work in some specific RAG systems, but it's unreliable and can lead to confusion.
Rich Text formatted docs: Documents such as .docx that will better retain the segmentation and structure of the content and make it easier for text extraction. However .docx files may not be compatible for every system. PDFs is a universally accepted format and will be fine but be aware that extracting text from complex PDFs with intricate layouts or scanned documents can be challenging. This can lead to inaccuracies in the text the LLM sees, impacting RAG performance.
Images: (currently) Images can't be 'read' by any RAG Models, however this is under development. However there are multimodal models already available that can read and interpret images. Read more...
In conclusion, using headings and subheadings is the better approach for creating a clear, well-structured, and accurate RAG document. Check with system compatibility issues and if needed and convert these easy to read docs into PDFs.
Add a Glossary
Including a glossary of key terms and definitions in your RAG document can significantly improve results.
Here's why:
Improved LLM Understanding: The LLM (Large Language Model) used in RAG relies on the provided document for information. A glossary explicitly defines key terms, ensuring the LLM interprets them correctly when generating responses or retrieving relevant passages.
Reduced Ambiguity: Technical terms or jargon used within the document can be ambiguous. A glossary clarifies their meaning, leading to more focused and accurate responses by the RAG system.
Enhanced Search Accuracy: When users query the RAG system for specific terms, the glossary definitions can help identify relevant sections even if the exact term isn't used in the document itself.
Isolation for Focused Responses
When using RAG for a single document, the goal is to isolate the LLM from external information sources like the internet. This ensures responses are based solely on the content within that specific document.
Here's the reasoning:
Focused Retrieval: RAG aims to retrieve relevant passages from the document itself. External information can introduce irrelevant data and potentially skew the results.
Consistency and Accuracy: Isolating the LLM promotes consistent and accurate responses based on the document's defined knowledge base.
Domain-Specific Knowledge: If the document contains specialized information, keeping the LLM focused helps it leverage that domain-specific knowledge for better responses.
However, there are some nuances to consider:
Pre-trained LLMs: While the LLM shouldn't directly access the internet during RAG processing, its pre-training on a massive dataset still influences its understanding.
Limited Scope: Isolating the LLM might limit the system's ability to handle broader inquiries that might be relevant but not explicitly covered in the document.
In conclusion, including a glossary and isolating the LLM from the internet are both beneficial practices for optimizing RAG performance with a single document. They ensure the LLM leverages the document's specific knowledge base for focused, accurate, and relevant responses.
Connected Documents. Should the Glossary be a connected Document?
Yes - it's best to connect multiple documents within a RAG system for a specific topic. Here's how to approach it.
Connecting Documents:
Directly referencing all related documents within each document will work but isn't the most efficient approach, there are better ways to establish connections for the LLM:
Document Metadata: Include metadata tags within each document that categorize them by topic, subtopic, or any other relevant classification system. This allows the RAG system to understand the relationships between documents based on their shared tags.
Centralized Knowledge Base: Consider creating a separate, concise knowledge base that summarizes key concepts, connections, and relationships between the documents. This knowledge base can then be referenced by the RAG system to understand the broader context.
Glossary as a Separate Document:
Yes, creating a separate glossary document can be a good strategy for several reasons:
Reduced Redundancy: A single glossary avoids repetitive definitions across multiple documents.
Centralized Reference: This allows the LLM to easily access all definitions during retrieval and generation tasks.
Scalability: As the document collection grows, maintaining a separate glossary is easier than updating individual documents.
Additional Techniques:
Here are some other techniques to consider for connecting documents in a multi-document RAG system:
Hyperlinking: If the documents are in a format that supports it (like HTML), you can create hyperlinks between related sections in different documents.
Entity Linking: This involves identifying and linking named entities (people, places, organizations) across documents. This helps the LLM understand how entities connect different documents.
Choosing the Right Approach:
The best approach depends on the size and complexity of your document collection. Here's a general guideline:
Smaller document sets: Metadata tags and a separate glossary might be sufficient.
Larger document sets: Consider a combination of techniques, including metadata, a centralized knowledge base, and potentially entity linking.
Remember, the goal is to provide clear and concise information about the relationships between documents without overwhelming the LLM with unnecessary redundancy.