PaperMadeEasy | RAGAS: Automated Evaluation of Retrieval Augmented Generation
The idea of using Large Language Models as knowledge bases has two basic limitations :
- LLMs are not able to answer the questions which are related to events which have happened after they were trained.
- LLMs are not able to memorise knowledge that is rarely mentioned in the training corpus.
The most relevent solution to solve these problems is Retrieval Augmented Generation (RAG).
RAG systems are composed of a retrieval and an LLM based generation module, and provide LLMs with knowledge from a reference textual database, which enables them to act as a natural language layer between a user and textual databases, reducing the risk of hallucinations.
While the approach works, but there are some challenges with RAG:
- Fine tuning efforts since the overall LLM performance is impacted by the retrieval model, considered corpus, etc.
- Evaluation of RAG systems is also challenging because there are several dimensions to consider :
* The ability of the retrieval system to identify relevant and focused context passages.
* The ability of the LLM to exploit such passages in a faithful way
* The quality of the generation itself.
The authors have proposed a new method, RAGAS (Retrieval Augmented Generation Assessment), to evaluate RAG system which will address the above mentioned issues without having to rely on ground truth human annotators.
EVALUATION STRATEGIES
For RAG evaluation, they have considered below setting :
Faithfulness
It refers to the idea that the retrieved context can act as justification for the generated answer. The answer is said to be faithful to the context, if the claims made in the answer can be inferred from the context.
To estimate Faithfulness below steps are performed:
- Use LM to extract set of statements — S(a(q)) with the aim to decompose large sentences into shorter and focussed assertion. Prompt used for the same is :
Given a question and answer, create one or more statements from each sentence in the given answer.
question: [question]
answer: [answer] - For each statement si in S, the LLM determines if si can be inferred from c(q) using a verification function v(si, c(q)). Verification is done via below prompt :
Consider the given context and following statements, then determine whether they are supported by the information present in the context.
Provide a brief explanation for each statement before arriving at the verdict (Yes/No).
Provide a final verdict for each statement in order at the end in the given format.
Do not deviate from the specified format.
statement: [statement 1]
…
statement: [statement n] - Faithful Score is then calculated as :
where, |V| = Number of statements supported by LLM, |S| = Total number of statements.
Answer Relevance
The idea here is to validate if the generated answer addressed the input question in the appropriate way by penalising the cases where the answer is incomplete or contains redundant information.
To estimate the answer relevance, below steps are performed :
- For the generated answer, a(q), LLM is prompted to generate the potential questions qi. Prompt for the same is provided below :
Generate a question for the given answer.
answer: [answer] - Obtain the embedding for all the generated questions (available via Open AI APIs).
- For each generated question in qi, cosine similarity is computed against the original question q.
- Answer Relevance Score is given as :
Context Relevance
The context c(q) is considered relevant to the extent that it exclusively con- tains information that is needed to answer the question. The metrics aims to penalise the inclusion of redundant information.
To estimate Context Relevance, below steps are performed :
- Use LLM to extract the subset of sentences from the context, c(q), that are crucial to answer the question q.
The prompt used for the same in given below :
Please extract relevant sentences from the provided context that can potentially help answer the following question.
If no relevant sentences are found, or if you believe the question cannot be answered from the given context, return the phrase “Insufficient Information”.
While extracting candidate sentences you’re not allowed to make any changes to sentences from given context. - Context Relevance Score is given as :
CONCLUSION
- The authors have created WikiEval for the evaluation of RAGAS.To construct the dataset:
* They first selected 50 Wikipedia pages covering events that have happened since the start of 2022. In selecting these pages, they have prioritised those with recent edits.
* For each of the 50 pages, they have used ChatGPT to suggest a question that can be answered based on the introductory section of the page. - The authors used human annotators to evaluate the answers generated via LM across the three quality metrics — Faithfulness, Answer Relevance and Context Relevance:
For GPT Score, authors asked GPT to assign a score between 0 and 10 for the three quality dimensions. They explained the details of the metrics in the prompt.
For GPT Ranking, authors asked GPT to select the preferred answer/context. In this case, too they have explained the details of metrics in the prompt.
For more details on the experimentations and other details, please check the full paper here : https://arxiv.org/abs/2309.15217
Thanks for reading the blog! Keep Learning!