Sunday, June 23, 2024

Beautiful Soup Example codes ๐Ÿ“‹

Example 1: Extracting All Paragraphs from a Web Page ๐Ÿ“„

```

url = 'http://example.com'

response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')

paragraphs = soup.find_all('p')

for p in paragraphs:

    print(p.get_text())

```

Example 2: Extracting Table Data ๐Ÿ“Š

```

url = 'http://example.com/tablepage'

response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')

table = soup.find('table')

rows = table.find_all('tr')

for row in rows:

    cells = row.find_all('td')

    for cell in cells:

        print(cell.get_text())

```

Example 3: Extracting Data from a Specific Class ๐ŸŽฏ

```

url = 'http://example.com'

response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')

items = soup.find_all(class_='classname')

for item in items:

    print(item.get_text())

```

How to Use Beautiful Soup ๐Ÿฒ

 Here are the basic steps to use Beautiful Soup for web scraping:

1. **Install Beautiful Soup** ๐Ÿ’ป๐Ÿ“ฆ:

   ```

   !pip install beautifulsoup4

   !pip install lxml

   ```

2. **Import the Necessary Libraries** ๐Ÿ“š:

   ```

   from bs4 import BeautifulSoup

   import requests

   ```

3. **Fetch the Web Page** ๐ŸŒ⬇️:

   ```

   url = 'http://example.com'

   response = requests.get(url)

   html_content = response.content

   ```

4. **Parse the HTML Content** ๐Ÿ—‚️๐Ÿ”:

   ```

   soup = BeautifulSoup(html_content, 'lxml')  # or 'html.parser'

   ```

5. **Extract Data** ๐Ÿ“„➡️๐Ÿ”ข:

   - Extract specific elements like titles, links, tables, etc.

     Example - Extracting all the links ๐Ÿ”—:

   ```

   for link in soup.find_all('a'):

       print(link.get('href'))

   ```

   Example - Extracting text from a specific tag ๐Ÿท️:

   ```

   title = soup.find('title').get_text()

   print(title)

   ```

Use Cases of Beautiful Soup ๐Ÿฒ

 1. **Web Scraping** ๐Ÿ•ธ️๐Ÿ”:

   - Extracting information from web pages for data analysis.

   - Collecting data for research purposes.

   - Aggregating data from multiple sources.

2. **Data Extraction** ๐Ÿ“„➡️๐Ÿ“Š:

   - Parsing HTML and XML documents to retrieve specific data elements.

   - Extracting table data, lists, paragraphs, etc.

3. **Automating Data Collection** ๐Ÿค–๐Ÿ“ฌ:

   - Automating the process of collecting data from websites.

   - Periodically scraping websites for new data.

4. **Processing HTML/XML Data** ๐Ÿงน๐Ÿ“œ:

   - Cleaning and organizing data from web sources.

   - Navigating through HTML/XML documents to find and process needed elements.

Saturday, June 22, 2024

Steps to Improve Sentiment Analysis with Fine-Tuning ๐Ÿ“ˆ๐Ÿง 

Choose a Pre-Trained Language Model:

Select a pre-trained model like BERT, RoBERTa, or GPT. These models have been trained on large corpora and can understand language nuances.

๐Ÿ“š๐Ÿ”: Choose a Pre-Trained Model - Use a powerful model like BERT, RoBERTa, or GPT.

Prepare the Dataset:

Collect a labeled dataset with text samples and corresponding sentiment labels (positive, negative, neutral).

Clean and preprocess the data (e.g., remove noise, tokenize text).

๐Ÿ“Š๐Ÿงน: Prepare the Dataset - Gather and clean labeled sentiment data.

Set Up the Environment:

Install necessary libraries (e.g., Transformers by Hugging Face, PyTorch/TensorFlow).

Set up a GPU environment if possible to speed up training.

๐Ÿ–ฅ️⚙️: Set Up the Environment - Install libraries and set up hardware.

Load the Pre-Trained Model and Tokenizer:

Use a tokenizer compatible with the chosen model to preprocess the text.

Load the pre-trained model and modify it for the sentiment analysis task (e.g., add a classification head).

๐Ÿง ๐Ÿ”ง: Load the Model and Tokenizer - Prepare the model and tokenizer for training.

Fine-Tune the Model:

Define a training loop or use a training API to fine-tune the model on the sentiment dataset.

Monitor training to avoid overfitting and adjust hyperparameters as needed.

๐ŸŽฏ๐Ÿ“ˆ: Fine-Tune the Model - Train the model on sentiment data.

Evaluate and Test the Model:

Evaluate the model on a validation set to ensure it generalizes well.

Test the model on a separate test set to gauge its real-world performance.

๐Ÿ“Š๐Ÿ”: Evaluate the Model - Check the model’s performance on validation and test sets.

Deploy the Model:

Save the fine-tuned model.

Deploy it in a production environment where it can analyze sentiment in new text inputs.

๐Ÿš€๐Ÿ’พ: Deploy the Model - Save and deploy the fine-tuned model.

Implementation Example ๐Ÿง‘‍๐Ÿ’ป

Here’s a Python implementation using Hugging Face’s Transformers library and PyTorch:


# Install necessary libraries

!pip install transformers

!pip install torch

from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments

from datasets import load_dataset

import torch

import numpy as np

from sklearn.metrics import accuracy_score, precision_recall_fscore_support


# Load the dataset ๐Ÿ“Š

dataset = load_dataset('imdb')

# Preprocess the data ๐Ÿงน

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize_function(examples):

    return tokenizer(examples['text'], padding='max_length', truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Load the pre-trained model ๐Ÿง 

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Define metrics ๐Ÿ“

def compute_metrics(p):

    preds = np.argmax(p.predictions, axis=1)

    precision, recall, f1, _ = precision_recall_fscore_support(p.label_ids, preds, average='binary')

    acc = accuracy_score(p.label_ids, preds)

    return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}

# Set training arguments ⚙️

training_args = TrainingArguments(

    output_dir='./results',          

    evaluation_strategy='epoch',     

    learning_rate=2e-5,              

    per_device_train_batch_size=16,  

    per_device_eval_batch_size=16,   

    num_train_epochs=3,              

    weight_decay=0.01,               

)

# Initialize Trainer ๐Ÿง‘‍๐Ÿซ

trainer = Trainer(

    model=model,                       

    args=training_args,                 

    train_dataset=tokenized_datasets['train'],        

    eval_dataset=tokenized_datasets['test'],          

    compute_metrics=compute_metrics,   

)

# Fine-tune the model ๐ŸŽฏ

trainer.train()

# Evaluate the model ๐Ÿ“Š

trainer.evaluate()

# Save the model ๐Ÿ’พ

model.save_pretrained('fine-tuned-bert-imdb')

tokenizer.save_pretrained('fine-tuned-bert-imdb')


Explanation ๐Ÿ“œ

Dataset ๐Ÿ“Š: The IMDB dataset is loaded using Hugging Face’s datasets library, which contains movie reviews labeled as positive or negative.

Tokenization ๐Ÿงน: Text data is tokenized using BertTokenizer to convert text into a format suitable for BERT.

Model Loading ๐Ÿง : A pre-trained BERT model (bert-base-uncased) is loaded and modified for binary classification.

Training Arguments ⚙️: Hyperparameters for training are defined, including the learning rate, batch size, and number of epochs.

Trainer ๐Ÿง‘‍๐Ÿซ: The Trainer class from Hugging Face simplifies the training loop and handles evaluation.

Training and Evaluation ๐Ÿ“ˆ๐Ÿ“Š: The model is fine-tuned on the training dataset and evaluated on the test dataset.

Model Saving ๐Ÿ’พ: The fine-tuned model and tokenizer are saved for later use.

Conclusion ๐ŸŽ‰

Fine-tuning a pre-trained language model on a sentiment analysis dataset can significantly improve its performance for that specific task. By following these steps and using a powerful library like Hugging Face’s Transformers, you can efficiently implement and deploy a high-quality sentiment analysis model.

Retrieval-Augmented Generation (RAG) vs Fine-tuning of Large Language Models (LLMs)

let's break down the differences between Retrieval-Augmented Generation (RAG) and fine-tuning of Large Language Models (LLMs) :

Retrieval-Augmented Generation (RAG) ๐Ÿ“š๐Ÿ”➡️๐Ÿง ๐Ÿ“

Concept:

๐Ÿ“š๐Ÿ”: Integration of Retrieval - RAG searches (๐Ÿ”) through an external knowledge base (๐Ÿ“š) to find relevant information.

➡️: Dynamic Knowledge - It brings this information into the generation process.

Advantages:

๐Ÿ†•๐Ÿ“†: Up-to-date Information - Always has the latest data.

๐Ÿ“ฆ๐Ÿง : Smaller Model Size - Knowledge is stored outside the model.

๐ŸŒ๐Ÿ”€: Versatility - Can handle many different topics by accessing various knowledge sources.

Disadvantages:

๐Ÿ”—๐Ÿ“š: Dependency on Knowledge Base - Quality depends on the knowledge source.

⚙️๐Ÿ”ง: Complexity - Requires a robust retrieval system.

Fine-Tuning Large Language Models (LLMs) ๐Ÿง ๐Ÿ“ˆ➡️๐Ÿ“

Concept:

๐Ÿง ๐Ÿ“ˆ: Model Specialization - The model is further trained (๐Ÿ“ˆ) on specific data to specialize in certain tasks.

➡️: Static Knowledge - Knowledge is embedded directly in the model's parameters.

Advantages:

๐Ÿ†๐Ÿ“Š: Task-Specific Performance - Excels at specific tasks.

✅๐Ÿ”„: Simplicity in Usage - Easy to use once trained.

Disadvantages:

๐Ÿ—“️๐Ÿ“š: Outdated Information - Can become outdated without regular retraining.

๐Ÿ“ˆ๐Ÿง : Larger Model Size - Needs a bigger model to store all the knowledge.

๐Ÿ“Š๐Ÿ“š: Data Requirements - Needs a lot of high-quality, task-specific data.

Key Differences ๐Ÿ” vs. ๐Ÿง 

Source of Knowledge:

๐Ÿ”๐Ÿ“š: RAG - Uses external sources.

๐Ÿง ๐Ÿ“ˆ: Fine-Tuning - Stores knowledge internally.

Flexibility and Updateability:

๐Ÿ”๐Ÿ†•: RAG - Easily updated with new information.

๐Ÿง ๐Ÿ—“️: Fine-Tuning - Needs retraining to update.

Implementation Complexity:

⚙️๐Ÿ”: RAG - More complex to set up.

✅๐Ÿง : Fine-Tuning - Simpler to use post-training.

Response Generation:

๐Ÿง ๐Ÿ“š๐Ÿ“: RAG - Combines internal knowledge with external information.

๐Ÿง ๐Ÿ“: Fine-Tuning - Uses only internal knowledge.

Use Cases ๐ŸŽฏ

๐Ÿ“š๐Ÿ”: RAG - Ideal for real-time, dynamic information needs (e.g., customer support).

๐Ÿง ๐Ÿ“ˆ: Fine-Tuning - Best for specialized, stable tasks (e.g., sentiment analysis).

Saturday, May 25, 2024

Vector partitioning in Pinecone using multiple indexes

vector partitioning in Pinecone using multiple indexes, along with an example use case. ๐ŸŒŸ

Multi-Tenancy and Efficient Querying with Namespaces

What Is Multi-Tenancy?

Multi-tenancy is a software architecture pattern where a single system serves multiple customers (tenants) simultaneously.

Each tenant’s data is isolated to ensure privacy and security.

Pinecone’s abstractions (indexes, namespaces, and metadata) make building multi-tenant systems straightforward.

Namespaces for Data Isolation:

Pinecone allows you to partition vectors into namespaces within an index.

Each namespace contains related vectors for a specific tenant.

Queries and other operations are limited to one namespace at a time.

Data isolation enhances query performance by separating data segments.

Namespaces scale independently, ensuring efficient operations even for different workloads.

Example Use Case: SmartWiki’s AI-Assisted Wiki:

Scenario:

SmartWiki serves millions of companies and individuals.

Each customer (tenant) has varying data scale, user count, and SLAs.

SmartWiki prioritizes great UX and low query latency.

Implementation:

Create an index for each workload pattern (e.g., RAG analysis, semantic search).

Within each index, use namespaces for individual tenants.

Example Python code for creating namespaces:


from pinecone import Pinecone

pc = Pinecone(api_key="YOUR_API_KEY")

pc.create_index(name="rag-index", dimension=128, metric="cosine")

pc.create_index(name="semantic-index", dimension=256, metric="euclidean")


# Create namespaces for tenants

pc.create_namespace(index_name="rag-index", namespace="acme")

pc.create_namespace(index_name="rag-index", namespace="widgets-r-us")

pc.create_namespace(index_name="semantic-index", namespace="acme")

pc.create_namespace(index_name="semantic-index", namespace="widgets-r-us")


Benefits:

Query Performance: Each query interacts with a specific namespace, leading to faster response times.

Cost Efficiency: Namespace-based isolation reduces costs.

Clean Offboarding: Deleting a namespace removes a tenant cleanly.

Friday, May 24, 2024

Namespaces in Pinecone’s vector database

Let’s explore the concept of namespaces in Pinecone’s vector database! ๐ŸŒŸ๐Ÿ”

Namespaces in Pinecone: Organizing Vectors with Style ๐Ÿ“

What Are Namespaces?

Namespaces allow you to partition the vectors in an index.

Each namespace acts like a separate container for related vectors.

Queries and other operations are then limited to one specific namespace.

Think of it as organizing your vector data into different labeled folders.

Why Use Namespaces?

Optimized Search:

By dividing your vectors into namespaces, you can focus searches on specific subsets.

For example, you might want one namespace for articles by content and another for articles by title.

Contextual Filtering:

Metadata or context-specific vectors can reside in different namespaces.

This helps you filter and retrieve relevant information efficiently.

Example Use Case :

Coffee Shop Locator Bot ☕๐Ÿค–:

Imagine you’re building a chatbot that finds nearby coffee shops.

You have two namespaces:

Namespace 1 (“ns1”): Contains vectors for coffee shop locations based on ratings and ambiance.

Namespace 2 (“ns2”): Contains vectors for coffee shop locations based on cuisine type (e.g., Italian, French).

When a user queries for “cozy coffee shops,” you search in “ns1.”

When they ask for “Italian cafes,” you search in “ns2.”

Creating Namespaces:

Namespaces are created implicitly when you upsert records into them.

For example, if you insert vectors with a namespace of “test-1,” Pinecone creates that namespace for you.

Querying a Namespace:

To target a specific namespace during a query, pass the namespace parameter.

If you don’t specify a namespace, Pinecone uses the default (empty string) namespace.

Example query:

# Search in "ns1" for cozy coffee shops

index.query(namespace="ns1", vector=[0.3, 0.3, 0.3, 0.3], top_k=3, include_values=True)

Operations Across All Namespaces:

Most vector operations apply to a single namespace.

However, there’s one exception: your imagination! ๐ŸŒˆ✨

Remember, namespaces help you keep your vectors organized and your searches efficient. Happy vector partitioning! 

AI's Impact on the IT Industry 2026