How does the RAG work better for various enterprise-level Gen AI use cases? What needs to be there to make the LLM model work more efficiently & able to check the response & validate their response, including the bias, hallucination & many more?
This is my post (after a slight GAP), which will capture and discuss some of the burning issues that many AI architects are trying to explore. In this post, I’ve considered a newly formed AI start-up from India, which developed an open-source framework that can easily evaluate all the challenges that one is facing with their LLMs & easily integrate with your existing models for better understanding including its limitations. You will get plenty of insights about it.
But, before we dig deep, why not see the demo first –
Isn’t it exciting? Let’s deep dive into the flow of events.
Architecture:
Let’s explore the broad-level architecture/flow –
Let us understand the steps of the above architecture. First, our Python application needs to trigger and enable the API, which will interact with the Open AI and UpTrain AI to fetch all the LLM KPIs based on the input from the React app named “Evaluation.”
Once the response is received from UpTrain AI, the Python application then organizes the results in a better readable manner without changing the core details coming out from their APIs & then shares that back with the react interface.
Let’s examine the react app’s sample inputs to better understand the input that will be passed to the Python-based API solution, which is wrapper capability to call multiple APIs from the UpTrain & then accumulate them under one response by parsing the data & reorganizing the data with the help of Open AI & sharing that back.
Highlighted in RED are some of the critical inputs you need to provide to get most of the KPIs. And, here are the sample text inputs for your reference –
Q. Enter input question.
A. What are the four largest moons of Jupiter?
Q. Enter the context document.
A. Jupiter, the largest planet in our solar system, boasts a fascinating array of moons. Among these, the four largest are collectively known as the Galilean moons, named after the renowned astronomer Galileo Galilei, who first observed them in 1610. These four moons, Io, Europa, Ganymede, and Callisto, hold significant scientific interest due to their unique characteristics and diverse geological features.
Q. Enter LLM response.
A. The four largest moons of Jupiter, known as the Galilean moons, are Io, Europa, Ganymede, and Marshmello.
Q. Enter the persona response.
A. strict and methodical teacher
Q. Enter the guideline.
A. Response shouldn’t contain any specific numbers
Q. Enter the ground truth.
A. The Jupiter is the largest & gaseous planet in the solar system.
Q. Choose the evaluation method.
A. llm
Once you fill in the App should look like this –
Once you fill in, the app should look like the below screenshot –
Package Installation:
Let us understand the sample packages that are required for this task.
pip install Flask==3.0.3
pip install Flask-Cors==4.0.0
pip install numpy==1.26.4
pip install openai==1.17.0
pip install pandas==2.2.2
pip install uptrain==0.6.13
Code:
1. app.py (This script will consume real-time streaming data coming out from a hosted API source using another popular third-party service named Ably. Ably mimics the pub sub-streaming concept, which might be extremely useful for any start-up. This will then translate into many meaningful KPIs in a streamlit-based dashboard app.)
Note that, we’re not going to discuss the entire script here. Only those parts are relevant. However, you can get the complete scripts in the GitHub repository.
def askFeluda(context, question):
try:
# Combine the context and the question into a single prompt.
prompt_text = f"{context}\n\n Question: {question}\n Answer:"
# Retrieve conversation history from the session or database
conversation_history = []
# Add the new message to the conversation history
conversation_history.append(prompt_text)
# Call OpenAI API with the updated conversation
response = client.with_options(max_retries=0).chat.completions.create(
messages=[
{
"role": "user",
"content": prompt_text,
}
],
model=cf.conf['MODEL_NAME'],
max_tokens=150, # You can adjust this based on how long you expect the response to be
temperature=0.3, # Adjust for creativity. Lower values make responses more focused and deterministic
top_p=1,
frequency_penalty=0,
presence_penalty=0
)
# Extract the content from the first choice's message
chat_response = response.choices[0].message.content
# Print the generated response text
return chat_response.strip()
except Exception as e:
return f"An error occurred: {str(e)}"
This function will ask the supplied questions with contexts or it will supply the UpTrain results to summarize the JSON into more easily readable plain texts. For our test, we’ve used “gpt-3.5-turbo”.
def evalContextRelevance(question, context, resFeluda, personaResponse):
try:
data = [{
'question': question,
'context': context,
'response': resFeluda
}]
results = eval_llm.evaluate(
data=data,
checks=[Evals.CONTEXT_RELEVANCE, Evals.FACTUAL_ACCURACY, Evals.RESPONSE_COMPLETENESS, Evals.RESPONSE_RELEVANCE, CritiqueTone(llm_persona=personaResponse), Evals.CRITIQUE_LANGUAGE, Evals.VALID_RESPONSE, Evals.RESPONSE_CONCISENESS]
)
return results
except Exception as e:
x = str(e)
return x
The above methods initiate the model from UpTrain to get all the stats, which will be helpful for your LLM response. In this post, we’ve captured the following KPIs –
- Context Relevance Explanation
- Factual Accuracy Explanation
- Guideline Adherence Explanation
- Response Completeness Explanation
- Response Fluency Explanation
- Response Relevance Explanation
- Response Tonality Explanation
# Function to extract and print all the keys and their values
def extractPrintedData(data):
for entry in data:
print("Parsed Data:")
for key, value in entry.items():
if key == 'score_context_relevance':
s_1_key_val = value
elif key == 'explanation_context_relevance':
cleaned_value = preprocessParseData(value)
print(f"{key}: {cleaned_value}\n")
s_1_val = cleaned_value
elif key == 'score_factual_accuracy':
s_2_key_val = value
elif key == 'explanation_factual_accuracy':
cleaned_value = preprocessParseData(value)
print(f"{key}: {cleaned_value}\n")
s_2_val = cleaned_value
elif key == 'score_response_completeness':
s_3_key_val = value
elif key == 'explanation_response_completeness':
cleaned_value = preprocessParseData(value)
print(f"{key}: {cleaned_value}\n")
s_3_val = cleaned_value
elif key == 'score_response_relevance':
s_4_key_val = value
elif key == 'explanation_response_relevance':
cleaned_value = preprocessParseData(value)
print(f"{key}: {cleaned_value}\n")
s_4_val = cleaned_value
elif key == 'score_critique_tone':
s_5_key_val = value
elif key == 'explanation_critique_tone':
cleaned_value = preprocessParseData(value)
print(f"{key}: {cleaned_value}\n")
s_5_val = cleaned_value
elif key == 'score_fluency':
s_6_key_val = value
elif key == 'explanation_fluency':
cleaned_value = preprocessParseData(value)
print(f"{key}: {cleaned_value}\n")
s_6_val = cleaned_value
elif key == 'score_valid_response':
s_7_key_val = value
elif key == 'score_response_conciseness':
s_8_key_val = value
elif key == 'explanation_response_conciseness':
print('Raw Value: ', value)
cleaned_value = preprocessParseData(value)
print(f"{key}: {cleaned_value}\n")
s_8_val = cleaned_value
print('$'*200)
results = {
"Factual_Accuracy_Score": s_2_key_val,
"Factual_Accuracy_Explanation": s_2_val,
"Context_Relevance_Score": s_1_key_val,
"Context_Relevance_Explanation": s_1_val,
"Response_Completeness_Score": s_3_key_val,
"Response_Completeness_Explanation": s_3_val,
"Response_Relevance_Score": s_4_key_val,
"Response_Relevance_Explanation": s_4_val,
"Response_Fluency_Score": s_6_key_val,
"Response_Fluency_Explanation": s_6_val,
"Response_Tonality_Score": s_5_key_val,
"Response_Tonality_Explanation": s_5_val,
"Guideline_Adherence_Score": s_8_key_val,
"Guideline_Adherence_Explanation": s_8_val,
"Response_Match_Score": s_7_key_val
# Add other evaluations similarly
}
return results
The above method parsed the initial data from UpTrain before sending it to OpenAI for a better summary without changing any text returned by it.
@app.route('/evaluate', methods=['POST'])
def evaluate():
data = request.json
if not data:
return {jsonify({'error': 'No data provided'}), 400}
# Extracting input data for processing (just an example of logging received data)
question = data.get('question', '')
context = data.get('context', '')
llmResponse = ''
personaResponse = data.get('personaResponse', '')
guideline = data.get('guideline', '')
groundTruth = data.get('groundTruth', '')
evaluationMethod = data.get('evaluationMethod', '')
print('question:')
print(question)
llmResponse = askFeluda(context, question)
print('='*200)
print('Response from Feluda::')
print(llmResponse)
print('='*200)
# Getting Context LLM
cLLM = evalContextRelevance(question, context, llmResponse, personaResponse)
print('&'*200)
print('cLLM:')
print(cLLM)
print(type(cLLM))
print('&'*200)
results = extractPrintedData(cLLM)
print('JSON::')
print(results)
resJson = jsonify(results)
return resJson
The above function is the main method, which first receives all the input parameters from the react app & then invokes one-by-one functions to get the LLM response, and LLM performance & finally summarizes them before sending it to react-app.
For any other scripts, please refer to the above-mentioned GitHub link.
Run:
Let us see some of the screenshots of the test run –
So, we’ve done it.
I’ll bring some more exciting topics in the coming days from the Python verse.
Till then, Happy Avenging! 🙂
Note: All the data & scenarios posted here are representational data & scenarios & available over the internet & for educational purposes only. There is always room for improvement in this kind of model & the solution associated with it. I’ve shown the basic ways to achieve the same for educational purposes only.