Skip to main content

Building an AI-Powered SQL Assistant: A Beginner's Guide to the Code

 In this blog, we'll break down the provided Python code that creates a streamlit-based AI-powered SQL assistant. This assistant enables users to interact with a database, write optimized SQL queries, and retrieve results—all powered by OpenAI's GPT-4-turbo model. By the end of this blog, you'll understand how the code works and how you can use or modify it for your own projects.


Introduction

The code builds a user-friendly web app that:

  1. Connects to a PostgreSQL database.
  2. Uses OpenAI's GPT-4 model to generate SQL queries.
  3. Executes the queries on the database.
  4. Displays the results, SQL query, and explanations.

The app is beginner-friendly and uses Streamlit for the interface, SQLAlchemy for database interaction, and LangChain for managing the AI functionality.


Breaking Down the Code

1. Setup and Dependencies

Before diving into the code, the following libraries are used:

  • Streamlit: For creating the web-based user interface.
  • SQLAlchemy: For managing database connections.
  • LangChain: For managing interactions with the OpenAI language model.
  • dotenv: For loading environment variables securely.
  • Pandas: For handling query results as dataframes.
  • sqlparse and sqlglot: For validating and formatting SQL queries.

Install these dependencies using:

pip install streamlit sqlalchemy langchain pandas python-dotenv sqlparse sqlglot

2. Class Overview: SQLAgentApp

The code is organized into a class called SQLAgentApp. This class encapsulates the entire application logic, making the code modular and reusable.


3. Environment Setup

The _init_environment method ensures that the OpenAI API key is loaded from environment variables using the dotenv library.

def _init_environment(self):
    self.openai_key = os.getenv("OPENAI_API_KEY")
    if not self.openai_key:
        st.error("OPENAI_API_KEY not found in environment variables")
        st.stop()
  • If the API key is missing, the app stops and shows an error message.
  • This ensures secure handling of sensitive API keys.

4. Database Initialization

The _init_database method connects to a PostgreSQL database using SQLAlchemy. It retrieves database credentials from environment variables or uses default values.

def _init_database(self):
    db_config = {
        "user": os.getenv("PG_USER", "postgres"),
        "password": os.getenv("PG_PASSWORD", "postgres_password"),
        "host": os.getenv("PG_HOST", "localhost"),
        "port": os.getenv("PG_PORT", "5432"),
        "database": os.getenv("PG_DATABASE", "postgres_db")
    }

    self.DATABASE_URL = (
        f"postgresql+psycopg2://{db_config['user']}:{db_config['password']}"
        f"@{db_config['host']}:{db_config['port']}/{db_config['database']}"
    )
  • The code uses create_engine to establish a connection to the database.
  • If the connection fails, an error message is displayed using Streamlit, and the app stops.

5. Initializing the LLM

The _init_llm method creates an instance of OpenAI's GPT-4-turbo model using LangChain.

def _init_llm(self):
    self.llm = ChatOpenAI(
        model_name="gpt-4-turbo-preview",
        temperature=0,
        streaming=True,
        api_key=self.openai_key
    )
  • temperature=0 ensures deterministic responses.
  • Streaming is enabled for real-time responses in Streamlit.

6. Prompt Engineering

The _get_prompt_prefix method constructs a detailed prompt for the AI model. It includes:

  • The database schema (tables and columns).
  • Clear rules for generating SQL queries.
  • Formatting guidelines for the response.
def _get_prompt_prefix(self) -> str:
    schema = self._get_schema_info()
    current_year = datetime.now().year
    return f"""
    You are a world-class SQL expert with 15+ years of experience. ...
    Database Schema:
    {schema}
    """
  • The schema is dynamically retrieved using SQLAlchemy's inspect module.
  • The prompt ensures the AI generates PostgreSQL-compliant SQL.

7. Agent Initialization

The _init_agent method creates a LangChain SQL agent.

self.agent = create_sql_agent(
    llm=self.llm,
    toolkit=self.toolkit,
    agent_type=AgentType.OPENAI_FUNCTIONS,
    verbose=True,
    memory=self.memory,
    prefix=self._get_prompt_prefix(),
    format_instructions=self._get_format_instructions(),
    handle_parsing_errors=True,
    max_iterations=5,
    early_stopping_method="generate"
)
  • LangChain SQL Agent: Combines the AI model with SQL tools to generate and execute queries.
  • ConversationBufferMemory: Stores the chat history for better context.
  • Format Instructions: Ensures the AI provides clear, structured responses.

8. User Interface

The run_ui method builds the entire web interface using Streamlit.

Sidebar

The sidebar displays:

  1. Database Schema: A list of tables and columns retrieved dynamically.
  2. Chat History: Previous user and AI messages.

Main Chat Interface

The main interface handles:

  1. User Input: Users can ask questions about the database.
  2. AI Responses: The AI responds with SQL queries, results, and explanations.
if prompt := st.chat_input("Ask a data question..."):
    st.session_state.messages.append({"role": "user", "content": prompt})
    
    # AI generates response
    response = self.agent.run(prompt)
  • User input is sent to the AI agent.
  • The response includes the SQL query, results, and analysis.

9. SQL Execution and Processing

The _process_response method extracts and executes the SQL query generated by the AI.

if "```sql" in response:
    sql_part = response.split("```sql")[1].split("```")[0].strip()
    result["sql"] = sqlparse.format(sql_part, reindent=True)
    result["data"] = pd.read_sql_query(result["sql"], self.engine)
  • SQL Parsing: Extracts the SQL code from the response.
  • Execution: Runs the SQL query on the database using Pandas.
  • Result Formatting: Processes the query result into a readable format.

10. Error Handling

The code includes robust error handling:

  • If the database connection fails, the app stops with an error message.
  • SQL execution errors are caught and displayed to the user.

11. Running the App

Finally, the app is launched using:

if __name__ == "__main__":
    app = SQLAgentApp()
    app.run_ui()

How the App Works

Step-by-Step Workflow

  1. The user inputs a question, such as "Show all active products."
  2. The AI generates a PostgreSQL query based on the database schema.
  3. The query is executed, and the results are displayed in a table.
  4. The AI provides an explanation of the query and optimization tips.

Key Features

  • Dynamic Schema Retrieval: Automatically fetches database tables and columns.
  • AI-Generated SQL: Ensures accurate and optimized queries.
  • Real-Time Execution: Runs queries instantly and displays results.
  • Detailed Analysis: Explains the results and suggests optimizations.

Conclusion

This code demonstrates how to build an AI-powered SQL assistant that integrates database querying, AI, and a user-friendly interface. By following this guide, you can adapt the application to your own database and enhance it with additional features.

import os
import re
import sqlparse
from dotenv import load_dotenv
from typing import List, Dict, Any
from sqlalchemy import create_engine, inspect
from sqlglot import transpile # FIXED IMPORT
import pandas as pd
import streamlit as st
from datetime import datetime

# LangChain imports
from langchain_openai import ChatOpenAI
from langchain.memory import ConversationBufferMemory
from langchain.agents import create_sql_agent, AgentType
from langchain.callbacks import StreamlitCallbackHandler
from langchain_community.utilities import SQLDatabase
from langchain_community.agent_toolkits import SQLDatabaseToolkit
import sqlparse

# Load environment variables
load_dotenv()
os.environ['OPENAI_API_KEY'] = "" # Replace with your OpenAI key
class SQLAgentApp:
def __init__(self):
self._init_environment()
self._init_database()
self._init_llm()
self._init_agent()

def _init_environment(self):
self.openai_key = os.getenv("OPENAI_API_KEY")
if not self.openai_key:
st.error("OPENAI_API_KEY not found in environment variables")
st.stop()

def _init_database(self):
# Initialize database connection
db_config = {
"user": os.getenv("PG_USER", "postgres"),
"password": os.getenv("PG_PASSWORD", "postgres_password"),
"host": os.getenv("PG_HOST", "localhost"),
"port": os.getenv("PG_PORT", "5432"),
"database": os.getenv("PG_DATABASE", "postgres_db")
}

self.DATABASE_URL = (
f"postgresql+psycopg2://{db_config['user']}:{db_config['password']}"
f"@{db_config['host']}:{db_config['port']}/{db_config['database']}"
)

try:
self.engine = create_engine(self.DATABASE_URL)
self.db = SQLDatabase(self.engine)
self.inspector = inspect(self.engine)
except Exception as e:
st.error(f"Database connection failed: {str(e)}")
st.stop()

def _init_llm(self):
self.llm = ChatOpenAI(
model_name="gpt-4-turbo-preview",
temperature=0,
streaming=True,
api_key=self.openai_key
)

def _get_prompt_prefix(self) -> str:
schema = self._get_schema_info()
current_year = datetime.now().year
return f"""
You are a world-class SQL expert with 15+ years of experience. Follow these rules:
1. Generate PostgreSQL-compliant SQL
2. Validate queries with SQLGlot before execution
3. Use CTEs for complex logic
4. Optimize for query performance
5. Add comments for complex operations
6. Prevent SQL injection attacks
7. Handle NULLs with COALESCE
8. Include LIMIT unless explicitly disabled
9. For date queries:
- Use CURRENT_DATE for today's date
- Use CURRENT_TIMESTAMP for current time
- For relative dates (e.g., last month), use date arithmetic
- Format dates consistently (YYYY-MM-DD)
- Current year is {current_year}

Database Schema:
{schema}

Response Format:
1. Valid SQL Query
2. Execution Result Summary
3. Step-by-Step Explanation
4. Optimization Tips
5. Alternative Approaches

"""

def _init_agent(self):
self.toolkit = SQLDatabaseToolkit(db=self.db, llm=self.llm)
self.memory = ConversationBufferMemory()

self.agent = create_sql_agent(
llm=self.llm,
toolkit=self.toolkit,
agent_type=AgentType.OPENAI_FUNCTIONS,
verbose=True,
memory=self.memory,
prefix=self._get_prompt_prefix(), # Now this method exists before being used
format_instructions=self._get_format_instructions(),
handle_parsing_errors=True,
max_iterations=5,
early_stopping_method="generate"
)

def _get_format_instructions(self) -> str:
return """
Always respond with markdown containing:
- **SQL Query** (syntax-validated)
- **Result** (summarized output)
- **Explanation** (detailed reasoning)
- **Optimizations** (index/partition suggestions)
- **Alternatives** (different methodologies)
"""

def _get_schema_info(self) -> str:
schema = []
for table in self.inspector.get_table_names():
columns = [f"{col['name']} ({col['type']})"
for col in self.inspector.get_columns(table)]
schema.append(f"Table {table}:\n " + "\n ".join(columns))
return "\n\n".join(schema)

def run_ui(self):
st.set_page_config(page_title="AI SQL Agent", layout="wide")
st.title("🛠️ AI-Powered SQL Assistant")

# Initialize chat history if not exists
if "messages" not in st.session_state:
st.session_state.messages = []

# Sidebar with schema info and chat history
with st.sidebar:
st.header("📖 Database Schema")
for table in self.inspector.get_table_names():
with st.expander(f"📑 {table}"):
cols = [
f"{col['name']} ({col['type']})"
for col in self.inspector.get_columns(table)
]
st.code("\n".join(cols))

st.divider()
st.header("📝 Chat History")
# Display chat history in sidebar
for msg in st.session_state.messages:
role = "💬 User" if msg["role"] == "user" else "🤖 Assistant"
with st.expander(f"{role}: {msg['content'][:50]}...", expanded=False):
st.markdown(msg["content"])

# Main chat interface
# Display all previous messages in the main area
for msg in st.session_state.messages:
with st.chat_message(msg["role"]):
st.markdown(msg["content"])

# Chat input
if prompt := st.chat_input("Ask a data question (e.g., 'Show all active products'):"):
# Add user message to chat history
st.session_state.messages.append({"role": "user", "content": prompt})

# Display user message immediately
with st.chat_message("user"):
st.markdown(prompt)

# Create a placeholder for the assistant's response
with st.chat_message("assistant"):
try:
st_callback = StreamlitCallbackHandler(st.container())
response = self.agent.run(prompt, callbacks=[st_callback])
processed = self._process_response(response)

# Display SQL
if processed["sql"]:
with st.expander("🔍 Generated SQL"):
st.code(processed["sql"], language="sql")

# Display Data
if processed["data"] is not None:
st.subheader("📊 Results Preview")
st.dataframe(processed["data"].head(10), use_container_width=True)

# Display Analysis
if processed["explanation"]:
st.subheader("📝 Explanation")
st.markdown(processed["explanation"])

# Add assistant's response to chat history with the summary
st.session_state.messages.append({
"role": "assistant",
"content": processed["summary"]
})

# Force a rerun to update the display while maintaining history
st.rerun()

except Exception as e:
error_message = f" Error: {str(e)}"
st.error(error_message)
st.session_state.messages.append({
"role": "assistant",
"content": error_message
})
# Force a rerun to update the display while maintaining history
st.rerun()

def _process_response(self, response: str) -> dict:
result = {
"sql": "",
"data": None,
"explanation": "",
"optimizations": "",
"alternatives": "",
"summary": "" # Add summary field
}

try:
# Handle both string and tuple responses
if isinstance(response, tuple):
response = response[0]

# Extract SQL Query
if "```sql" in response:
sql_part = response.split("```sql")[1].split("```")[0].strip()
result["sql"] = sqlparse.format(sql_part, reindent=True)

# Execute SQL directly
try:
result["data"] = pd.read_sql_query(result["sql"], self.engine)

# Format the data into a clean text explanation
if result["data"] is not None and not result["data"].empty:
explanation_parts = []
explanation_parts.append(f"Found {len(result['data'])} records:")

# Format each row in a clean text way
for _, row in result["data"].iterrows():
record_info = []
for col in result["data"].columns:
if col != 'id': # Skip ID in the display
value = row[col]
if pd.isna(value):
value = 'None'
record_info.append(f"{value}")
explanation_parts.append(f"• {' - '.join(record_info)}")

result["explanation"] = "\n".join(explanation_parts)
except Exception as e:
st.error(f"SQL Execution Error: {str(e)}")
result["error"] = f"SQL Execution Error: {str(e)}"

# Extract other response parts
sections = ["Result", "Explanation", "Optimizations", "Alternatives", "Execution Result Summary"]
current_section = None

for line in response.split("\n"):
if line.startswith("**"):
current_section = line.strip("*: ").lower()
elif current_section and current_section in result:
result[current_section] += line + "\n"
elif "Execution Result Summary" in line:
current_section = "summary"
elif current_section == "summary":
result["summary"] += line + "\n"

# If no summary was found, use the explanation as summary
if not result["summary"].strip():
result["summary"] = result["explanation"]

return result

except Exception as e:
return {"error": f"Response processing error: {str(e)}"}

if __name__ == "__main__":
app = SQLAgentApp() # Create an instance of the class
app.run_ui() # Call the method to start the UI


Comments

Popular posts from this blog

Transforming Workflows with CrewAI: Harnessing the Power of Multi-Agent Collaboration for Smarter Automation

 CrewAI is a framework designed to implement the multi-agent concept effectively. It helps create, manage, and coordinate multiple AI agents to work together on complex tasks. CrewAI simplifies the process of defining roles, assigning tasks, and ensuring collaboration among agents.  How CrewAI Fits into the Multi-Agent Concept 1. Agent Creation:    - In CrewAI, each AI agent is like a specialist with a specific role, goal, and expertise.    - Example: One agent focuses on market research, another designs strategies, and a third plans marketing campaigns. 2. Task Assignment:    - You define tasks for each agent. Tasks can be simple (e.g., answering questions) or complex (e.g., analyzing large datasets).    - CrewAI ensures each agent knows what to do based on its defined role. 3. Collaboration:    - Agents in CrewAI can communicate and share results to solve a big problem. For example, one agent's output becomes the input for an...

Optimizing LLM Queries for CSV Files to Minimize Token Usage: A Beginner's Guide

When working with large CSV files and querying them using a Language Model (LLM), optimizing your approach to minimize token usage is crucial. This helps reduce costs, improve performance, and make your system more efficient. Here’s a beginner-friendly guide to help you understand how to achieve this. What Are Tokens, and Why Do They Matter? Tokens are the building blocks of text that LLMs process. A single word like "cat" or punctuation like "." counts as a token. Longer texts mean more tokens, which can lead to higher costs and slower query responses. By optimizing how you query CSV data, you can significantly reduce token usage. Key Strategies to Optimize LLM Queries for CSV Files 1. Preprocess and Filter Data Before sending data to the LLM, filter and preprocess it to retrieve only the relevant rows and columns. This minimizes the size of the input text. How to Do It: Use Python or database tools to preprocess the CSV file. Filter for only the rows an...

Artificial Intelligence (AI) beyond the realms of Machine Learning (ML) and Deep Learning (DL).

AI (Artificial Intelligence) : Definition : AI encompasses technologies that enable machines to mimic cognitive functions associated with human intelligence. Examples : 🗣️  Natural Language Processing (NLP) : AI systems that understand and generate human language. Think of chatbots, virtual assistants (like Siri or Alexa), and language translation tools. 👀  Computer Vision : AI models that interpret visual information from images or videos. Applications include facial recognition, object detection, and self-driving cars. 🎮  Game Playing AI : Systems that play games like chess, Go, or video games using strategic decision-making. 🤖  Robotics : AI-powered robots that can perform tasks autonomously, such as assembly line work or exploring hazardous environments. Rule-Based Systems : Definition : These are AI systems that operate based on predefined rules or logic. Examples : 🚦  Traffic Light Control : Rule-based algorithms manage traffic lights by following fix...