Four-page web summary of: Context-Aware Retrieval Augmented Generation Using Similarity Validation to Handle Context Inconsistencies in Large Language Models
Authors: Enrico Collini, Felix Indra Kurniadi, Paolo Nesi and Gianni Pantaleo | DISIT Lab, University of Florence | IEEE Access, 2025 | DOI: 10.1109/ACCESS.2025.3614553
|
Key message: CA-RAG improves question answering over specialized documents by letting the LLM consider every chunk, then validating generated answers against their context with similarity scores. This shifts the critical decision from pre-generation retrieval to post-generation evidence checking, reducing hallucinations in settings where accuracy is more important than speed. |
Large Language Models are now widely used to answer questions, draft explanations and support decisions in sectors such as healthcare, education and finance. Their usefulness is limited by two familiar problems: they may produce convincing but incorrect statements, and they often struggle when a user asks about information that is outside the model's original training data or belongs to a narrow technical domain.
Retrieval Augmented Generation, or RAG, was introduced to address this gap. Instead of relying only on what a model already knows, RAG retrieves relevant passages from external documents and gives those passages to the model as evidence. This is especially attractive for organisations that need to work with frequently changing manuals, internal documentation, legal texts, scientific reports or product knowledge without constantly fine-tuning a large model.
The paper argues that traditional RAG introduces a new bottleneck. The quality of the final answer depends heavily on the retriever: if the retriever selects the wrong chunks, or misses information spread across several sections, the LLM may generate an incomplete or misleading response. This failure can occur even when the information exists in the document collection, because similarity search may favour passages that look lexically or semantically close to the question but do not actually contain the evidence needed to answer it.
The authors propose Context-Aware Retrieval Augmented Generation (CA-RAG) to handle this weakness. CA-RAG removes the initial top-k retrieval filter. After document chunking, it feeds the question and each chunk to the LLM, obtains candidate answers, and then uses post-processing to rank or reject the answers according to how well they match the source context. In simple terms, CA-RAG asks the LLM to interpret all candidate evidence first and checks the generated answers afterward.

Why Traditional RAG Can Fail
Traditional RAG follows a retrieve-and-read workflow. First, documents are pre-processed, divided into chunks and transformed into vector embeddings. When a user asks a question, the question is also embedded and compared against stored document embeddings. The system retrieves the top-ranked chunks, combines them with the question and prompt, and sends the package to the LLM for answer generation.
This design is efficient, but the paper highlights three weaknesses. First, retrieval quality depends on the embedding model and the similarity measure. Second, the best evidence may be distributed across sections that are not individually ranked as highly similar to the query. Third, vector similarity can select chunks that appear close to the question while missing deeper contextual relevance. The authors describe this as a risk of semantic drift: relevant information is filtered out before the LLM has a chance to reason over it.
|
Stage |
Traditional RAG |
CA-RAG |
|---|---|---|
|
Indexing |
Documents are chunked and embedded for retrieval. |
Documents are pre-processed and chunked; semantic chunking can preserve meaning. |
|
Selection |
Top-k chunks are selected before generation. |
No initial top-k filter: each chunk can be considered by the LLM. |
|
Generation |
The LLM answers from the retrieved context. |
The LLM produces candidate answers from question-prompt-chunk combinations. |
|
Validation |
Often limited to retrieval score or reranking before generation. |
Candidate answers are checked against context using cosine similarity or dot product. |
The proposed semantic chunking method starts by splitting text into sentences, embedding each sentence and normalizing the embedding. Consecutive sentence embeddings are compared. When their similarity falls below a threshold, a new chunk is started; otherwise, the sentence is added to the current chunk. The goal is to create chunks that remain semantically coherent rather than relying only on fixed text length.
After chunking, CA-RAG works in three stages. In the indexing stage, documents are cleaned and split into chunks. In the generation stage, each chunk is combined with the prompt and the user question, and the Zephyr-7B model generates candidate answers. In the post-processing stage, the system evaluates how closely each generated answer matches the chunk that supported it.
The role of similarity validation
The paper tests two post-processing ideas. One selects the answer with the highest similarity to the question. The other removes answers whose similarity to their source context falls below a threshold, then selects the best remaining answer. The authors focus on the second technique because it directly tests whether the answer is grounded in the provided context. When the similarity score is too low, the answer can be marked as not relevant or replaced with a conservative response such as 'no context provided'.
This mechanism is especially important for out-of-context questions. Instead of allowing the LLM to answer from memory, CA-RAG can reject unsupported answers and avoid presenting unsupported general knowledge as if it came from the source document. This makes the approach suitable for web services that expose domain-specific knowledge bases to non-expert users.
How the Method Was Evaluated
The authors designed two evaluation scenarios. Scenario 1 used established question-answering benchmarks: TriviaQA, Natural Questions, AmbigQA and SQuAD, together forming about ten thousand question-answer pairs. These datasets test different capabilities: short explicit answers, cross-sentence reasoning, ambiguous questions and more complex comprehension. BERTScore was used to compare generated answers with ground truth answers through contextual embeddings rather than exact string matching alone.
Scenario 2 was created to approximate real-world technical document use. It used scientific papers on five topics: bike-sharing predictions, landslide prediction, reputation assessment, a scenario editor, and thermal-camera-based people counting. For each topic, the authors defined specific questions, general questions and out-of-context questions. Domain experts manually evaluated whether generated answers were correct, incorrect or appropriately rejected when no supporting context was available.
|
Finding |
Best/Notable result |
What it means |
Web takeaway |
|---|---|---|---|
|
Longer-context benchmarks |
CA-RAG achieved the strongest F1 on TriviaQA under fixed-size chunking with Contriever and dot product scoring (0.9089). |
The approach is effective when relevant evidence may require broader context. |
Strong for long or complex documents. |
|
Ambiguous questions |
CA-RAG with fixed-size chunking reached a top AmbigQA F1 of 0.8652. |
Similarity validation helps keep answers aligned with evidence under ambiguity. |
Useful for open-ended Q&A. |
|
Short-context tasks |
RAG with reranking performed best on NaturalQA and SQuAD in several settings. |
CA-RAG is not universally faster or better; retrievers still work well when answers are compact and explicit. |
Choose the method based on document type. |
|
Domain papers |
Scenario 2 reported CA-RAG F1 scores up to 0.9872 with semantic chunking and dot-product post-processing; cosine-based configurations also reached about 0.94. |
The post-processing filter reduced hallucinated answers and rejected unsupported questions. |
Most relevant for technical web knowledge bases. |
The paper compares CA-RAG with traditional RAG using MiniLM embeddings, RAG using Contriever embeddings, and versions of both that add a reranker. It also compares the approach in Scenario 2 with a long-context question-answering baseline using Llama 3.3-70B. The benchmark results show a nuanced picture: retrieval and reranking remain competitive on short and well-structured contexts, but CA-RAG becomes more compelling when the documents are longer, specialized and less structured.
The strongest evidence for CA-RAG comes from Scenario 2. In that setting, traditional RAG variants often failed because they could not retrieve the correct chunk from complete scientific papers. Models without post-processing were more likely to hallucinate answers to out-of-context questions. CA-RAG's similarity validation increased true negatives: it correctly identified when a chunk did not contain enough evidence to answer the question. At the same time, it reduced false positives, where an unsupported or incorrect answer might otherwise be presented as valid.
The evaluation also shows that post-processing is robust across embedding choices. MiniLM and Contriever were both tested in the validation phase, and the results stayed broadly consistent. This suggests that the core benefit is not tied to a single embedding model, but to the strategy of validating generated answers against source context.
A practical interpretation
For publishers, portals and knowledge services, the important point is that CA-RAG can help build web interfaces that answer questions over specialized documents while being more cautious about unsupported content. The method is not just trying to retrieve a likely passage; it is checking whether the answer produced from that passage is sufficiently grounded in it.
Implications, Limits and Recommended Web Copy
CA-RAG is most attractive in use cases where evidence quality matters more than response speed. Examples include technical support over product manuals, legal and compliance documents, medical or scientific documentation, smart-city knowledge bases, research repositories and internal enterprise documents. In these settings, a wrong answer can be more damaging than a slower answer, and a conservative 'no context provided' response is preferable to a plausible hallucination.
The approach does have a cost. Because CA-RAG sends more chunks through the generation stage, it is slower than traditional RAG. In Scenario 2, the authors report average per-question times of about 262.6 seconds with semantic chunking and about 2289.5 seconds with fixed-size chunking. The semantic-chunking configuration is therefore the more practical variant, especially because it also delivered the best overall performance in the real-document experiment. The authors argue that this processing time remains favourable compared with the time a human domain expert would need to read long technical documents and produce a similarly grounded answer.
|
Recommended web headline: CA-RAG: A Context-Aware Method for More Reliable Question Answering Over Technical Documents Suggested meta description: A new Context-Aware Retrieval Augmented Generation method improves domain-specific LLM question answering by validating generated answers against document context using similarity measures. Suggested teaser: Instead of trusting the first retrieved passages, CA-RAG lets the LLM consider document chunks and then filters unsupported answers, helping reduce hallucinations in specialized knowledge tasks. |
The main limitation is that similarity between a generated answer and a context chunk is not always a perfect proxy for factual support. It may be less reliable for abstract reasoning, multi-hop answers, or chunks that mix relevant and irrelevant information. The authors also note that CA-RAG is most practical for moderate-sized corpora, because providing all chunks to the LLM is computationally demanding at large scale.
Future work will explore integrating a chain-of-thought-style query algorithm to improve retrieval and generation. This direction is intended to further increase precision and reduce hallucinations, while potentially making the system more efficient in how it reasons over evidence.
Conclusion for web readers
The paper contributes a clear idea to the design of trustworthy LLM systems: do not let a single early retrieval decision determine the final answer. CA-RAG delays the evidence-selection decision until after the LLM has produced candidate answers and then uses similarity validation to identify which answers are actually supported by the document context. The result is a method that is slower than standard RAG but more reliable for technical, specialized and high-stakes knowledge tasks.
In the experiments, this trade-off paid off most clearly on longer and more realistic documents. CA-RAG with post-processing outperformed traditional RAG baselines on the custom scientific-document scenario and achieved strong results on long-context and ambiguous benchmark tasks. For web applications built around domain documents, the message is simple: retrieval is not enough; generated answers should also be validated against the context they claim to represent.
Source: Collini, E.; Kurniadi, F. I.; Nesi, P.; Pantaleo, G. Context-Aware Retrieval Augmented Generation Using Similarity Validation to Handle Context Inconsistencies in Large Language Models. IEEE Access, 2025. DOI: 10.1109/ACCESS.2025.3614553.
See also neuroSymbolic ai at scale Course of paolo Nesi: https://www.disit.org/node/7194