Enhancing Prompt Engineering Accuracy in Software Requirements Processing with Multi-Agent Systems and Multiple LLMs

When processing software requirements or testing documents using large language models (LLMs), ensuring the accuracy and reliability of the results is a critical challenge. One effective approach is using a multi-agent system that leverages multiple LLMs to cross-validate outputs, compare results, and improve overall accuracy. This approach reduces reliance on a single model and ensures robustness in processing, especially for tasks like test case generation, requirement validation, and data summarization.

This blog explores how to use multi-agent systems with different LLMs to process software requirements, compare results, and measure accuracy.

What is a Multi-Agent System?

A multi-agent system involves multiple independent AI agents (in this case, LLMs) working together to:

Perform the same task independently.
Cross-validate or refine each other’s outputs.
Provide diverse perspectives to improve the quality and reliability of the final result.

In our case, the agents will be different LLMs (e.g., GPT-4, Claude, Cohere, or Google Bard).

Use Case Overview

We’ll process a software requirements CSV file to:

Extract high-priority functional requirements.
Generate test cases for these requirements.
Use multiple LLMs to process the data and compare their outputs.

Finally, we’ll measure accuracy by comparing the results and identifying discrepancies.

Step-by-Step Guide

Step 1: Setup and Prepare the Data

We’ll begin by preparing the software requirements data in a CSV format.

Example CSV File (`requirements.csv`):

ID,Requirement,Type,Priority,Status
1,Users must be able to register and log in using their email and password,Functional,High,Approved
2,Search functionality must return relevant results within 2 seconds,Functional,Medium,Pending
3,The platform must handle 500 concurrent users,Non-Functional,High,Approved
4,Payment processing must support credit cards and PayPal securely,Functional,High,Approved
5,Daily backups of all data must be performed automatically,Non-Functional,Medium,Pending

Load the CSV Data in Python

import pandas as pd

# Load the CSV file
csv_file = 'requirements.csv'
df = pd.read_csv(csv_file)

# Convert the data to JSON format
json_data = df.to_json(orient='records')
print("JSON Data for AI Query:")
print(json_data)

Step 2: Define the Prompt

We’ll use the following prompt to query multiple LLMs. The task involves extracting high-priority functional requirements and generating test cases.

Prompt Template:

Here is the software requirements data in JSON format:

{json_data}

Task:
1. Extract all "High Priority" functional requirements.
2. Generate 2 test cases for each extracted requirement.
3. Return the output in a structured JSON format.

Step 3: Query Multiple LLMs

We’ll use Python to query multiple LLMs (e.g., GPT-4, Claude, Cohere) simultaneously. Here’s how you can set up the multi-agent system.

Install Required Libraries

pip install openai anthropic cohere

Python Code for Multi-Agent Queries

import openai
import requests
import json

# Set API Keys for different LLMs
openai.api_key = 'your-openai-api-key'
anthropic_api_key = 'your-anthropic-api-key'
cohere_api_key = 'your-cohere-api-key'

# Define the prompt
prompt = f"""
Here is the software requirements data in JSON format:

{json_data}

Task:
1. Extract all "High Priority" functional requirements.
2. Generate 2 test cases for each extracted requirement.
3. Return the output in a structured JSON format.
"""

# Query GPT (OpenAI)
def query_gpt(prompt):
    response = openai.Completion.create(
        engine="text-davinci-003",
        prompt=prompt,
        max_tokens=500,
        temperature=0
    )
    return response.choices[0].text.strip()

# Query Claude (Anthropic)
def query_claude(prompt):
    url = "https://api.anthropic.com/v1/complete"
    headers = {
        "x-api-key": anthropic_api_key,
        "Content-Type": "application/json"
    }
    data = {
        "prompt": prompt,
        "model": "claude-v1",
        "max_tokens_to_sample": 500,
        "temperature": 0
    }
    response = requests.post(url, headers=headers, json=data)
    return response.json()['completion'].strip()

# Query Cohere
def query_cohere(prompt):
    url = "https://api.cohere.ai/generate"
    headers = {"Authorization": f"Bearer {cohere_api_key}"}
    data = {
        "model": "command-xlarge-nightly",
        "prompt": prompt,
        "max_tokens": 500,
        "temperature": 0
    }
    response = requests.post(url, headers=headers, json=data)
    return response.json()['generations'][0]['text'].strip()

# Execute queries
gpt_output = query_gpt(prompt)
claude_output = query_claude(prompt)
cohere_output = query_cohere(prompt)

# Display outputs
print("GPT Output:")
print(gpt_output)

print("\nClaude Output:")
print(claude_output)

print("\nCohere Output:")
print(cohere_output)

Step 4: Compare and Validate Outputs

To measure the accuracy of the outputs, we’ll compare the results from all LLMs and identify discrepancies. This can be done by:

Manual Comparison: Reviewing outputs side-by-side.
Automated Comparison: Using Python to check for differences.

Automated Comparison Code

# Parse outputs into JSON
gpt_data = json.loads(gpt_output)
claude_data = json.loads(claude_output)
cohere_data = json.loads(cohere_output)

# Compare outputs
def compare_outputs(data1, data2):
    discrepancies = []
    for key in data1.keys():
        if data1[key] != data2[key]:
            discrepancies.append({
                "Key": key,
                "Data1": data1[key],
                "Data2": data2[key]
            })
    return discrepancies

# Compare GPT vs Claude
discrepancies_gpt_claude = compare_outputs(gpt_data, claude_data)

# Compare GPT vs Cohere
discrepancies_gpt_cohere = compare_outputs(gpt_data, cohere_data)

# Display discrepancies
print("Discrepancies between GPT and Claude:")
print(discrepancies_gpt_claude)

print("\nDiscrepancies between GPT and Cohere:")
print(discrepancies_gpt_cohere)

Step 5: Measure Accuracy

To evaluate the accuracy of each model:

Define a ground truth (a set of manually verified correct outputs).
Compare each model’s output against the ground truth using metrics like:
- Precision: Correct results out of all results generated by the model.
- Recall: Correct results out of all possible correct results.
- F1-Score: Harmonic mean of precision and recall.

Accuracy Calculation Code

from sklearn.metrics import precision_score, recall_score, f1_score

def calculate_accuracy(predicted, ground_truth):
    # Convert results to sets for comparison
    predicted_set = set(predicted)
    ground_truth_set = set(ground_truth)
    
    # Calculate metrics
    tp = len(predicted_set & ground_truth_set)  # True positives
    fp = len(predicted_set - ground_truth_set)  # False positives
    fn = len(ground_truth_set - predicted_set)  # False negatives
    
    precision = tp / (tp + fp) if tp + fp > 0 else 0
    recall = tp / (tp + fn) if tp + fn > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if precision + recall > 0 else 0
    
    return {"Precision": precision, "Recall": recall, "F1-Score": f1}

# Example ground truth and predicted outputs
ground_truth = [
    "Users must be able to register with valid credentials.",
    "Payment can be processed securely via credit card."
]

predicted_gpt = [
    "Users must be able to register with valid credentials.",
    "Payment can be processed securely via credit card."
]

predicted_claude = [
    "Users must be able to register with valid credentials.",
    "Payments can be processed using PayPal securely."
]

# Calculate accuracy for GPT
accuracy_gpt = calculate_accuracy(predicted_gpt, ground_truth)
print("GPT Accuracy:", accuracy_gpt)

# Calculate accuracy for Claude
accuracy_claude = calculate_accuracy(predicted_claude, ground_truth)
print("Claude Accuracy:", accuracy_claude)

Benefits of Multi-Agent Systems

Increased Reliability: Cross-validating outputs from multiple models ensures higher accuracy.
Error Detection: Discrepancies highlight potential errors or ambiguities in the data or model behavior.
Diverse Perspectives: Different LLMs may interpret prompts differently, providing complementary insights.

Conclusion

Using a multi-agent system with multiple LLMs (e.g., GPT, Claude, Cohere), you can improve the accuracy and reliability of software requirement processing tasks. By comparing outputs, identifying discrepancies, and calculating metrics like precision and recall, you can ensure robust results for tasks like test case generation, requirement extraction, and validation.

This approach is ideal for critical projects where accuracy is paramount, and errors can have significant consequences. With Python and Generative AI, you can build a scalable and reliable pipeline for processing technical documents. 🚀

Artificial Intelligence (AI) beyond the realms of Machine Learning (ML) and Deep Learning (DL).

AI (Artificial Intelligence) : Definition : AI encompasses technologies that enable machines to mimic cognitive functions associated with human intelligence. Examples : 🗣️ Natural Language Processing (NLP) : AI systems that understand and generate human language. Think of chatbots, virtual assistants (like Siri or Alexa), and language translation tools. 👀 Computer Vision : AI models that interpret visual information from images or videos. Applications include facial recognition, object detection, and self-driving cars. 🎮 Game Playing AI : Systems that play games like chess, Go, or video games using strategic decision-making. 🤖 Robotics : AI-powered robots that can perform tasks autonomously, such as assembly line work or exploring hazardous environments. Rule-Based Systems : Definition : These are AI systems that operate based on predefined rules or logic. Examples : 🚦 Traffic Light Control : Rule-based algorithms manage traffic lights by following fix...

Tech GPT

Search This Blog