Enhancing Prompt Engineering Accuracy in Software Requirements Processing with Multi-Agent Systems and Multiple LLMs
When processing software requirements or testing documents using large language models (LLMs), ensuring the accuracy and reliability of the results is a critical challenge. One effective approach is using a multi-agent system that leverages multiple LLMs to cross-validate outputs, compare results, and improve overall accuracy. This approach reduces reliance on a single model and ensures robustness in processing, especially for tasks like test case generation, requirement validation, and data summarization.
This blog explores how to use multi-agent systems with different LLMs to process software requirements, compare results, and measure accuracy.
What is a Multi-Agent System?
A multi-agent system involves multiple independent AI agents (in this case, LLMs) working together to:
- Perform the same task independently.
- Cross-validate or refine each other’s outputs.
- Provide diverse perspectives to improve the quality and reliability of the final result.
In our case, the agents will be different LLMs (e.g., GPT-4, Claude, Cohere, or Google Bard).
Use Case Overview
We’ll process a software requirements CSV file to:
- Extract high-priority functional requirements.
- Generate test cases for these requirements.
- Use multiple LLMs to process the data and compare their outputs.
Finally, we’ll measure accuracy by comparing the results and identifying discrepancies.
Step-by-Step Guide
Step 1: Setup and Prepare the Data
We’ll begin by preparing the software requirements data in a CSV format.
Example CSV File (requirements.csv
):
ID,Requirement,Type,Priority,Status
1,Users must be able to register and log in using their email and password,Functional,High,Approved
2,Search functionality must return relevant results within 2 seconds,Functional,Medium,Pending
3,The platform must handle 500 concurrent users,Non-Functional,High,Approved
4,Payment processing must support credit cards and PayPal securely,Functional,High,Approved
5,Daily backups of all data must be performed automatically,Non-Functional,Medium,Pending
Load the CSV Data in Python
import pandas as pd
# Load the CSV file
csv_file = 'requirements.csv'
df = pd.read_csv(csv_file)
# Convert the data to JSON format
json_data = df.to_json(orient='records')
print("JSON Data for AI Query:")
print(json_data)
Step 2: Define the Prompt
We’ll use the following prompt to query multiple LLMs. The task involves extracting high-priority functional requirements and generating test cases.
Prompt Template:
Here is the software requirements data in JSON format:
{json_data}
Task:
1. Extract all "High Priority" functional requirements.
2. Generate 2 test cases for each extracted requirement.
3. Return the output in a structured JSON format.
Step 3: Query Multiple LLMs
We’ll use Python to query multiple LLMs (e.g., GPT-4, Claude, Cohere) simultaneously. Here’s how you can set up the multi-agent system.
Install Required Libraries
pip install openai anthropic cohere
Python Code for Multi-Agent Queries
import openai
import requests
import json
# Set API Keys for different LLMs
openai.api_key = 'your-openai-api-key'
anthropic_api_key = 'your-anthropic-api-key'
cohere_api_key = 'your-cohere-api-key'
# Define the prompt
prompt = f"""
Here is the software requirements data in JSON format:
{json_data}
Task:
1. Extract all "High Priority" functional requirements.
2. Generate 2 test cases for each extracted requirement.
3. Return the output in a structured JSON format.
"""
# Query GPT (OpenAI)
def query_gpt(prompt):
response = openai.Completion.create(
engine="text-davinci-003",
prompt=prompt,
max_tokens=500,
temperature=0
)
return response.choices[0].text.strip()
# Query Claude (Anthropic)
def query_claude(prompt):
url = "https://api.anthropic.com/v1/complete"
headers = {
"x-api-key": anthropic_api_key,
"Content-Type": "application/json"
}
data = {
"prompt": prompt,
"model": "claude-v1",
"max_tokens_to_sample": 500,
"temperature": 0
}
response = requests.post(url, headers=headers, json=data)
return response.json()['completion'].strip()
# Query Cohere
def query_cohere(prompt):
url = "https://api.cohere.ai/generate"
headers = {"Authorization": f"Bearer {cohere_api_key}"}
data = {
"model": "command-xlarge-nightly",
"prompt": prompt,
"max_tokens": 500,
"temperature": 0
}
response = requests.post(url, headers=headers, json=data)
return response.json()['generations'][0]['text'].strip()
# Execute queries
gpt_output = query_gpt(prompt)
claude_output = query_claude(prompt)
cohere_output = query_cohere(prompt)
# Display outputs
print("GPT Output:")
print(gpt_output)
print("\nClaude Output:")
print(claude_output)
print("\nCohere Output:")
print(cohere_output)
Step 4: Compare and Validate Outputs
To measure the accuracy of the outputs, we’ll compare the results from all LLMs and identify discrepancies. This can be done by:
- Manual Comparison: Reviewing outputs side-by-side.
- Automated Comparison: Using Python to check for differences.
Automated Comparison Code
# Parse outputs into JSON
gpt_data = json.loads(gpt_output)
claude_data = json.loads(claude_output)
cohere_data = json.loads(cohere_output)
# Compare outputs
def compare_outputs(data1, data2):
discrepancies = []
for key in data1.keys():
if data1[key] != data2[key]:
discrepancies.append({
"Key": key,
"Data1": data1[key],
"Data2": data2[key]
})
return discrepancies
# Compare GPT vs Claude
discrepancies_gpt_claude = compare_outputs(gpt_data, claude_data)
# Compare GPT vs Cohere
discrepancies_gpt_cohere = compare_outputs(gpt_data, cohere_data)
# Display discrepancies
print("Discrepancies between GPT and Claude:")
print(discrepancies_gpt_claude)
print("\nDiscrepancies between GPT and Cohere:")
print(discrepancies_gpt_cohere)
Step 5: Measure Accuracy
To evaluate the accuracy of each model:
- Define a ground truth (a set of manually verified correct outputs).
- Compare each model’s output against the ground truth using metrics like:
- Precision: Correct results out of all results generated by the model.
- Recall: Correct results out of all possible correct results.
- F1-Score: Harmonic mean of precision and recall.
Accuracy Calculation Code
from sklearn.metrics import precision_score, recall_score, f1_score
def calculate_accuracy(predicted, ground_truth):
# Convert results to sets for comparison
predicted_set = set(predicted)
ground_truth_set = set(ground_truth)
# Calculate metrics
tp = len(predicted_set & ground_truth_set) # True positives
fp = len(predicted_set - ground_truth_set) # False positives
fn = len(ground_truth_set - predicted_set) # False negatives
precision = tp / (tp + fp) if tp + fp > 0 else 0
recall = tp / (tp + fn) if tp + fn > 0 else 0
f1 = 2 * (precision * recall) / (precision + recall) if precision + recall > 0 else 0
return {"Precision": precision, "Recall": recall, "F1-Score": f1}
# Example ground truth and predicted outputs
ground_truth = [
"Users must be able to register with valid credentials.",
"Payment can be processed securely via credit card."
]
predicted_gpt = [
"Users must be able to register with valid credentials.",
"Payment can be processed securely via credit card."
]
predicted_claude = [
"Users must be able to register with valid credentials.",
"Payments can be processed using PayPal securely."
]
# Calculate accuracy for GPT
accuracy_gpt = calculate_accuracy(predicted_gpt, ground_truth)
print("GPT Accuracy:", accuracy_gpt)
# Calculate accuracy for Claude
accuracy_claude = calculate_accuracy(predicted_claude, ground_truth)
print("Claude Accuracy:", accuracy_claude)
Benefits of Multi-Agent Systems
- Increased Reliability: Cross-validating outputs from multiple models ensures higher accuracy.
- Error Detection: Discrepancies highlight potential errors or ambiguities in the data or model behavior.
- Diverse Perspectives: Different LLMs may interpret prompts differently, providing complementary insights.
Conclusion
Using a multi-agent system with multiple LLMs (e.g., GPT, Claude, Cohere), you can improve the accuracy and reliability of software requirement processing tasks. By comparing outputs, identifying discrepancies, and calculating metrics like precision and recall, you can ensure robust results for tasks like test case generation, requirement extraction, and validation.
This approach is ideal for critical projects where accuracy is paramount, and errors can have significant consequences. With Python and Generative AI, you can build a scalable and reliable pipeline for processing technical documents. 🚀
Comments