Skip to main content

Enhancing Prompt Engineering Accuracy in Software Requirements Processing with Multi-Agent Systems and Multiple LLMs

When processing software requirements or testing documents using large language models (LLMs), ensuring the accuracy and reliability of the results is a critical challenge. One effective approach is using a multi-agent system that leverages multiple LLMs to cross-validate outputs, compare results, and improve overall accuracy. This approach reduces reliance on a single model and ensures robustness in processing, especially for tasks like test case generation, requirement validation, and data summarization.

This blog explores how to use multi-agent systems with different LLMs to process software requirements, compare results, and measure accuracy.


What is a Multi-Agent System?

A multi-agent system involves multiple independent AI agents (in this case, LLMs) working together to:

  1. Perform the same task independently.
  2. Cross-validate or refine each other’s outputs.
  3. Provide diverse perspectives to improve the quality and reliability of the final result.

In our case, the agents will be different LLMs (e.g., GPT-4, Claude, Cohere, or Google Bard).


Use Case Overview

We’ll process a software requirements CSV file to:

  1. Extract high-priority functional requirements.
  2. Generate test cases for these requirements.
  3. Use multiple LLMs to process the data and compare their outputs.

Finally, we’ll measure accuracy by comparing the results and identifying discrepancies.


Step-by-Step Guide

Step 1: Setup and Prepare the Data

We’ll begin by preparing the software requirements data in a CSV format.

Example CSV File (requirements.csv):

ID,Requirement,Type,Priority,Status
1,Users must be able to register and log in using their email and password,Functional,High,Approved
2,Search functionality must return relevant results within 2 seconds,Functional,Medium,Pending
3,The platform must handle 500 concurrent users,Non-Functional,High,Approved
4,Payment processing must support credit cards and PayPal securely,Functional,High,Approved
5,Daily backups of all data must be performed automatically,Non-Functional,Medium,Pending

Load the CSV Data in Python

import pandas as pd

# Load the CSV file
csv_file = 'requirements.csv'
df = pd.read_csv(csv_file)

# Convert the data to JSON format
json_data = df.to_json(orient='records')
print("JSON Data for AI Query:")
print(json_data)

Step 2: Define the Prompt

We’ll use the following prompt to query multiple LLMs. The task involves extracting high-priority functional requirements and generating test cases.

Prompt Template:

Here is the software requirements data in JSON format:

{json_data}

Task:
1. Extract all "High Priority" functional requirements.
2. Generate 2 test cases for each extracted requirement.
3. Return the output in a structured JSON format.

Step 3: Query Multiple LLMs

We’ll use Python to query multiple LLMs (e.g., GPT-4, Claude, Cohere) simultaneously. Here’s how you can set up the multi-agent system.

Install Required Libraries

pip install openai anthropic cohere

Python Code for Multi-Agent Queries

import openai
import requests
import json

# Set API Keys for different LLMs
openai.api_key = 'your-openai-api-key'
anthropic_api_key = 'your-anthropic-api-key'
cohere_api_key = 'your-cohere-api-key'

# Define the prompt
prompt = f"""
Here is the software requirements data in JSON format:

{json_data}

Task:
1. Extract all "High Priority" functional requirements.
2. Generate 2 test cases for each extracted requirement.
3. Return the output in a structured JSON format.
"""

# Query GPT (OpenAI)
def query_gpt(prompt):
    response = openai.Completion.create(
        engine="text-davinci-003",
        prompt=prompt,
        max_tokens=500,
        temperature=0
    )
    return response.choices[0].text.strip()

# Query Claude (Anthropic)
def query_claude(prompt):
    url = "https://api.anthropic.com/v1/complete"
    headers = {
        "x-api-key": anthropic_api_key,
        "Content-Type": "application/json"
    }
    data = {
        "prompt": prompt,
        "model": "claude-v1",
        "max_tokens_to_sample": 500,
        "temperature": 0
    }
    response = requests.post(url, headers=headers, json=data)
    return response.json()['completion'].strip()

# Query Cohere
def query_cohere(prompt):
    url = "https://api.cohere.ai/generate"
    headers = {"Authorization": f"Bearer {cohere_api_key}"}
    data = {
        "model": "command-xlarge-nightly",
        "prompt": prompt,
        "max_tokens": 500,
        "temperature": 0
    }
    response = requests.post(url, headers=headers, json=data)
    return response.json()['generations'][0]['text'].strip()

# Execute queries
gpt_output = query_gpt(prompt)
claude_output = query_claude(prompt)
cohere_output = query_cohere(prompt)

# Display outputs
print("GPT Output:")
print(gpt_output)

print("\nClaude Output:")
print(claude_output)

print("\nCohere Output:")
print(cohere_output)

Step 4: Compare and Validate Outputs

To measure the accuracy of the outputs, we’ll compare the results from all LLMs and identify discrepancies. This can be done by:

  1. Manual Comparison: Reviewing outputs side-by-side.
  2. Automated Comparison: Using Python to check for differences.

Automated Comparison Code

# Parse outputs into JSON
gpt_data = json.loads(gpt_output)
claude_data = json.loads(claude_output)
cohere_data = json.loads(cohere_output)

# Compare outputs
def compare_outputs(data1, data2):
    discrepancies = []
    for key in data1.keys():
        if data1[key] != data2[key]:
            discrepancies.append({
                "Key": key,
                "Data1": data1[key],
                "Data2": data2[key]
            })
    return discrepancies

# Compare GPT vs Claude
discrepancies_gpt_claude = compare_outputs(gpt_data, claude_data)

# Compare GPT vs Cohere
discrepancies_gpt_cohere = compare_outputs(gpt_data, cohere_data)

# Display discrepancies
print("Discrepancies between GPT and Claude:")
print(discrepancies_gpt_claude)

print("\nDiscrepancies between GPT and Cohere:")
print(discrepancies_gpt_cohere)

Step 5: Measure Accuracy

To evaluate the accuracy of each model:

  1. Define a ground truth (a set of manually verified correct outputs).
  2. Compare each model’s output against the ground truth using metrics like:
    • Precision: Correct results out of all results generated by the model.
    • Recall: Correct results out of all possible correct results.
    • F1-Score: Harmonic mean of precision and recall.

Accuracy Calculation Code

from sklearn.metrics import precision_score, recall_score, f1_score

def calculate_accuracy(predicted, ground_truth):
    # Convert results to sets for comparison
    predicted_set = set(predicted)
    ground_truth_set = set(ground_truth)
    
    # Calculate metrics
    tp = len(predicted_set & ground_truth_set)  # True positives
    fp = len(predicted_set - ground_truth_set)  # False positives
    fn = len(ground_truth_set - predicted_set)  # False negatives
    
    precision = tp / (tp + fp) if tp + fp > 0 else 0
    recall = tp / (tp + fn) if tp + fn > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if precision + recall > 0 else 0
    
    return {"Precision": precision, "Recall": recall, "F1-Score": f1}

# Example ground truth and predicted outputs
ground_truth = [
    "Users must be able to register with valid credentials.",
    "Payment can be processed securely via credit card."
]

predicted_gpt = [
    "Users must be able to register with valid credentials.",
    "Payment can be processed securely via credit card."
]

predicted_claude = [
    "Users must be able to register with valid credentials.",
    "Payments can be processed using PayPal securely."
]

# Calculate accuracy for GPT
accuracy_gpt = calculate_accuracy(predicted_gpt, ground_truth)
print("GPT Accuracy:", accuracy_gpt)

# Calculate accuracy for Claude
accuracy_claude = calculate_accuracy(predicted_claude, ground_truth)
print("Claude Accuracy:", accuracy_claude)

Benefits of Multi-Agent Systems

  1. Increased Reliability: Cross-validating outputs from multiple models ensures higher accuracy.
  2. Error Detection: Discrepancies highlight potential errors or ambiguities in the data or model behavior.
  3. Diverse Perspectives: Different LLMs may interpret prompts differently, providing complementary insights.

Conclusion

Using a multi-agent system with multiple LLMs (e.g., GPT, Claude, Cohere), you can improve the accuracy and reliability of software requirement processing tasks. By comparing outputs, identifying discrepancies, and calculating metrics like precision and recall, you can ensure robust results for tasks like test case generation, requirement extraction, and validation.

This approach is ideal for critical projects where accuracy is paramount, and errors can have significant consequences. With Python and Generative AI, you can build a scalable and reliable pipeline for processing technical documents. 🚀

Comments

Popular posts from this blog

Transforming Workflows with CrewAI: Harnessing the Power of Multi-Agent Collaboration for Smarter Automation

 CrewAI is a framework designed to implement the multi-agent concept effectively. It helps create, manage, and coordinate multiple AI agents to work together on complex tasks. CrewAI simplifies the process of defining roles, assigning tasks, and ensuring collaboration among agents.  How CrewAI Fits into the Multi-Agent Concept 1. Agent Creation:    - In CrewAI, each AI agent is like a specialist with a specific role, goal, and expertise.    - Example: One agent focuses on market research, another designs strategies, and a third plans marketing campaigns. 2. Task Assignment:    - You define tasks for each agent. Tasks can be simple (e.g., answering questions) or complex (e.g., analyzing large datasets).    - CrewAI ensures each agent knows what to do based on its defined role. 3. Collaboration:    - Agents in CrewAI can communicate and share results to solve a big problem. For example, one agent's output becomes the input for an...

Optimizing LLM Queries for CSV Files to Minimize Token Usage: A Beginner's Guide

When working with large CSV files and querying them using a Language Model (LLM), optimizing your approach to minimize token usage is crucial. This helps reduce costs, improve performance, and make your system more efficient. Here’s a beginner-friendly guide to help you understand how to achieve this. What Are Tokens, and Why Do They Matter? Tokens are the building blocks of text that LLMs process. A single word like "cat" or punctuation like "." counts as a token. Longer texts mean more tokens, which can lead to higher costs and slower query responses. By optimizing how you query CSV data, you can significantly reduce token usage. Key Strategies to Optimize LLM Queries for CSV Files 1. Preprocess and Filter Data Before sending data to the LLM, filter and preprocess it to retrieve only the relevant rows and columns. This minimizes the size of the input text. How to Do It: Use Python or database tools to preprocess the CSV file. Filter for only the rows an...

Artificial Intelligence (AI) beyond the realms of Machine Learning (ML) and Deep Learning (DL).

AI (Artificial Intelligence) : Definition : AI encompasses technologies that enable machines to mimic cognitive functions associated with human intelligence. Examples : 🗣️  Natural Language Processing (NLP) : AI systems that understand and generate human language. Think of chatbots, virtual assistants (like Siri or Alexa), and language translation tools. 👀  Computer Vision : AI models that interpret visual information from images or videos. Applications include facial recognition, object detection, and self-driving cars. 🎮  Game Playing AI : Systems that play games like chess, Go, or video games using strategic decision-making. 🤖  Robotics : AI-powered robots that can perform tasks autonomously, such as assembly line work or exploring hazardous environments. Rule-Based Systems : Definition : These are AI systems that operate based on predefined rules or logic. Examples : 🚦  Traffic Light Control : Rule-based algorithms manage traffic lights by following fix...