Understanding Vectors, Embeddings, and Vector db and Vector index using ChromaDB & LlamaIndex.

Sairam Penjarla
Oct 16, 2024
11 min read

In Natural Language Processing (NLP), working with vectors and tokens is crucial for representing and processing textual data. This blog covers the basics of vectors, tokens, embeddings, and how to manage documents efficiently using LlamaIndex and ChromaDB.

What is a Vector?

A vector is a numerical representation of data. In NLP, vectors are often used to capture the meaning of words, phrases, or sentences. For instance, the sentence "I like apple" can be represented as a vector:


"I like apple"

# Vector Representation
[-1.0, -0.0002, 4]

Here, each number represents a specific aspect of the sentence’s meaning or structure.

Understanding Tokens

A token is a basic unit of text, such as a word or part of a word, that a model processes. In the vector [-1.0, -0.0002, 4], each number corresponds to a token from the sentence "I like apple":


-1.0  # token "I"
-0.0002  # token "like"
4  # token "apple"

Tokens are converted into vectors, which machine learning models use for further processing.

Generating Sentence Embeddings with Pretrained Models

We can use pretrained models to convert sentences into vector representations, also known as embeddings. The following code uses the SentenceTransformers library to generate embeddings for sentences.


from sentence_transformers import SentenceTransformer

# Load the pre-trained model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Define sentences
lines = [
    "Biryani is a flavorful rice dish made with fragrant basmati rice and spices.",
    "Samosas are crispy pastries filled with spiced potatoes and peas.",
    "Butter chicken is a creamy curry made with marinated chicken and a rich tomato sauce."
]

# Generate embeddings for the sentences
embeddings = model.encode(lines)
print(embeddings)

Output (truncated for readability):


[[-0.01040684  0.01423134  0.02288207 ...  0.0235445  -0.0365877  0.01918398]
 [ 0.02909455 -0.0835634  -0.02009496 ...  0.06541307 -0.07751977  0.04729676]
 [-0.04539156 -0.04705414 -0.0403274  ...  0.0944678   0.07234132   0.01783809]]

These embeddings are high-dimensional vectors (in this case, of size 384) that capture the semantic meaning of each sentence.

Computing Similarity Between Sentence Embeddings

To compare the meaning of different sentences, we can compute the similarity between their embeddings. Here's an example:


user_question = ["Biryani is a flavorful rice dish made"]

# Generate embedding for the new sentence
question_embeddings = model.encode(user_question)

# Compute similarity with the previous embeddings
similarities = model.similarity(embeddings, question_embeddings)
print(similarities)

Output:


tensor([[0.9277], [0.3249], [0.2918]])

The value 0.9277 indicates a high similarity between the user’s query and the first sentence.

Storing and Managing Documents with ChromaDB

To manage text data and their embeddings efficiently, we can use ChromaDB, a vector database designed for storing, updating, and querying documents.


import chromadb

# Initialize ChromaDB client
chroma_client = chromadb.Client()

# Create or retrieve a collection
collection = chroma_client.get_or_create_collection(name="indian_food_1")

# Insert documents into the collection
collection.upsert(
    documents=[
        "This is a document about pineapple.",
        "This is a document about oranges."
    ],
    ids=["id11", "id21"]
)

ChromaDB organizes documents in collections, allowing you to store and update them along with associated metadata and embeddings.

Using Sentence Embeddings with ChromaDB

ChromaDB also supports embedding functions, allowing you to generate embeddings from textual data:


from chromadb.utils import embedding_functions

# Load pre-trained sentence transformer model for embeddings
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")

Inserting Documents with Metadata and Embeddings into ChromaDB

We can insert documents into a ChromaDB collection along with their embeddings and metadata:


documents = [
    "Biryani is a popular Indian dish made with fragrant basmati rice, spices, and marinated meat.",
    "Paneer Tikka is a vegetarian dish consisting of marinated paneer (cottage cheese) grilled to perfection."
]
metadatas = [
    {"source": "recipe_book"},
    {"source": "food_blog"}
]
ids = ["dish1", "dish2"]

# Insert documents, embeddings, and metadata
for doc, meta, id in zip(documents, metadatas, ids):
    collection.upsert(
        documents=[doc],
        embeddings=[sentence_transformer_ef(doc)[0]],
        metadatas=[meta],
        ids=[id],
    )

This allows for the efficient management and retrieval of documents for future use.

Querying Documents in ChromaDB

To retrieve relevant documents based on a query, ChromaDB provides a querying mechanism:


results = collection.query(
    query_texts=["What are the must-try dishes in Indian cuisine?"],
    n_results=2
)

print(results)

Output:


{'ids': [['dish1', 'dish2']],
 'documents': [['Biryani is a popular Indian dish made with fragrant basmati rice, spices, and marinated meat.',
                'Paneer Tikka is a vegetarian dish consisting of marinated paneer (cottage cheese) grilled to perfection.']],
 'metadatas': [[{'source': 'recipe_book'}, {'source': 'food_blog'}]],
 'distances': [[1.8978, 1.9171]]}

This returns the top 2 most relevant documents based on the query, allowing for a powerful semantic search capability.

Splitting Long Text, Embeddings, and Document Management in LangChain

In this blog post, we will explore techniques for splitting long text into manageable chunks, creating custom embeddings for sentence encoding, and managing documents with unique identifiers in LangChain. These concepts are essential for building robust natural language processing (NLP) applications, enabling efficient text processing, search, and analysis.

1. Splitting Long Text into Manageable Chunks

When dealing with large bodies of text, splitting it into smaller chunks is crucial for efficient processing. LangChain's RecursiveCharacterTextSplitter helps in breaking down lengthy text into smaller, manageable chunks while preserving context.


from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize the text splitter
splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=50)

# Define a long piece of text
long_text = (
    "Indian cuisine is known for its diverse flavors and rich cultural heritage. "
    "Dishes like Biryani, a fragrant rice dish cooked with spices and marinated meat, are popular across the country. "
    "Paneer Tikka, made from marinated cottage cheese grilled to perfection, is a favorite among vegetarians. "
    "Indian food also includes an array of delicious curries, such as Butter Chicken, which is known for its creamy tomato sauce. "
    "Street food like Pani Puri and Samosas are beloved snacks that offer a burst of flavors. "
    "The culinary traditions vary from region to region, making Indian cuisine a delightful experience for food lovers."
)

# Split the text into chunks
chunks = splitter.split_text(long_text)

# Print the first two chunks
print(chunks[0])
# 'Indian cuisine is known for its diverse flavors and rich cultural heritage. Dishes like Biryani, a'

print(chunks[1])
# 'rich cultural heritage. Dishes like Biryani, a fragrant rice dish cooked with spices and marinated'

Here, we split the long_text into chunks of 100 characters with an overlap of 50 characters. The overlap ensures that the context is preserved between adjacent chunks, which is especially important in NLP tasks.

2. Creating a Custom Embeddings Class for Sentence Encoding

Embedding text into vectors allows us to represent the semantic meaning of sentences. In LangChain, you can define a custom embeddings class that uses pre-trained models from SentenceTransformers to generate embeddings for documents and queries.

Custom Embeddings Class


from sentence_transformers import SentenceTransformer
from langchain.embeddings.base import Embeddings

class CustomEmbeddings(Embeddings):
    def __init__(self, model_name: str):
        # Load the sentence transformer model
        self.model = SentenceTransformer(model_name)

    # Method to embed multiple documents
    def embed_documents(self, documents):
        return [self.model.encode(d).tolist() for d in documents]

    # Method to embed a single query
    def embed_query(self, query: str):
        return self.model.encode([query])[0].tolist()

In this custom class, we leverage the SentenceTransformer model to generate embeddings for text documents and queries. The embed_documents method handles a list of documents, while embed_query works for single query strings.

Example Usage


# Instantiate the custom embeddings class with a pre-trained model
embedding_model = CustomEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Define some sample documents
documents = [
    "Biryani is a popular Indian dish made with basmati rice and spices.",
    "Paneer Tikka is a grilled cottage cheese dish."
]

# Generate embeddings for the documents
document_embeddings = embedding_model.embed_documents(documents)
print(document_embeddings)

# Generate an embedding for a query
query_embedding = embedding_model.embed_query("What is Biryani?")
print(query_embedding)

Here, we use the all-MiniLM-L6-v2 model to embed both documents and a query, allowing us to compare and search for semantic similarity.

3. Creating Document Instances with Unique Identifiers

To manage and retrieve documents efficiently, we can use the Document class in LangChain, which allows us to assign metadata and unique identifiers to each document. This is useful in applications like search engines or document retrieval systems.

Example of Creating Documents


from uuid import uuid4
from langchain_core.documents import Document

# Define some sample documents with metadata and unique IDs
documents = [
    Document(
        page_content="I had delicious Biryani for lunch, and it was bursting with flavors!",
        metadata={"source": "tweet"},
        id=1,
    ),
    Document(
        page_content="The weather is perfect for enjoying a plate of spicy Samosas.",
        metadata={"source": "news"},
        id=2,
    ),
    # ... more documents
]

# Generate unique identifiers (UUIDs) for each document
uuids = [str(uuid4()) for _ in range(len(documents))]

# Print the first document and its UUID
print(documents[0])
print(uuids[0])

Explanation

page_content: Contains the actual content of the document (e.g., tweets, articles).
metadata: Provides additional context about the document, such as its source.
id: A unique identifier for each document (which could be an integer or a UUID).

By organizing text data this way, we can efficiently manage large collections of documents, perform searches, and store metadata for more context-aware processing.

Setting Up a Vector Store with ChromaDB and Custom Embeddings

In this code snippet, we set up a vector store using ChromaDB and a custom embedding model to efficiently store, manage, and query document embeddings. Let’s break down the process:

Importing Required Libraries: First, we import chromadb for managing the Chroma database and Chroma from langchain.vectorstores to create and interact with the vector store.
Creating a Custom Embedding Model: An instance of CustomEmbeddings is created using the "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2" model, which generates multilingual sentence embeddings, capturing semantic relationships across various languages.
Initializing a Persistent ChromaDB Client: The PersistentClient is initialized to allow persistent storage of document embeddings.
Creating or Accessing a Collection: The get_or_create_collection method ensures a collection named "indian_food" is either created or accessed. This collection will store document embeddings.
Setting Up the Vector Store: The Chroma vector store is then set up with the persistent client, collection, and the custom embedding model. The vector store is now ready to store embeddings and documents for semantic search.


import chromadb
from langchain.vectorstores import Chroma

# Custom embedding model
embedding_model = CustomEmbeddings(model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")

# Persistent ChromaDB client
persistent_client = chromadb.PersistentClient()

# Create or retrieve a collection
collection = persistent_client.get_or_create_collection("indian_food")

# Vector store setup
vector_store = Chroma(
    client=persistent_client,
    collection_name="indian_food",
    embedding_function=embedding_model,
)

# Add documents to the vector store
vector_store.add_documents(documents=documents, ids=uuids)

Performing a Similarity Search in ChromaDB

This snippet demonstrates how to perform a semantic similarity search in ChromaDB to find documents most relevant to a user query. Here's how it works:

Querying for Similar Documents: We use the similarity_search method on the vector_store with a user query, searching for the top k=2 similar documents.
Applying a Filter: The search is filtered to return only documents where the source is "tweet", restricting the results to a specific document category.
Displaying Results: The search results are iterated over, displaying each document’s content and its metadata.

# Performing a similarity search
results = vector_store.similarity_search(
    "What are some popular dishes in Indian cuisine?",
    k=2,
    filter={"source": "tweet"},
)

# Display the results
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

Output:

* LangGraph is a great tool for sharing Indian food recipes with others! [{'source': 'tweet'}]
* Just had an amazing meal at an Indian restaurant. The flavors were out of this world! [{'source': 'tweet'}]

Complete Example with ChromaDB Setup and Query

This example consolidates the setup and usage of ChromaDB, demonstrating how to embed documents, store them in ChromaDB, and query them to retrieve relevant information based on user input.


import os
import openai
import chromadb

# Step 1: Initialize ChromaDB client
client = chromadb.Client()

# Step 2: Create a collection in ChromaDB
collection = client.get_or_create_collection("documents_collection")

# Step 3: Custom embedding model
embedding_model = CustomEmbeddings(model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")

# Function to embed documents using the embedding model
def embed_document(text):
    return embedding_model.embed_query(text)

# Step 4: Add documents to ChromaDB with embeddings
def add_documents_to_collection(documents):
    for doc_id, text in enumerate(documents):
        embedding = embed_document(text)  # Generate embedding
        collection.add(
            documents=[text],
            embeddings=[embedding],
            ids=[str(doc_id)]
        )

# Step 5: Query ChromaDB to find relevant documents
def query_chromadb(query, top_n=3):
    query_embedding = embed_document(query)  # Embed user query
    results = collection.query(query_embeddings=[query_embedding], n_results=top_n)
    return results["documents"]

# Example documents
documents = [
    "Indian cuisine is known for its diverse flavors and rich cultural heritage.",
    "Dishes like Biryani, a fragrant rice dish cooked with spices and marinated meat, are popular across the country.",
    "Paneer Tikka, made from marinated cottage cheese grilled to perfection, is a favorite among vegetarians.",
    "Indian food also includes an array of delicious curries, such as Butter Chicken, which is known for its creamy tomato sauce.",
    "Street food like Pani Puri and Samosas are beloved snacks that offer a burst of flavors.",
    "The culinary traditions vary from region to region, making Indian cuisine a delightful experience for food lovers."
]

# Step 1: Add documents to ChromaDB collection
add_documents_to_collection(documents)

# Step 2: Query the database and retrieve relevant documents
user_query = "What indian cuisine is known for its rich cultural heritage?"
relevant_docs = query_chromadb(user_query)

print(relevant_docs)

Output:


[['The culinary traditions vary from region to region, making Indian cuisine a delightful experience for food lovers.',
 'Indian food also includes an array of delicious curries, such as Butter Chicken, which is known for its creamy tomato sauce.',
 'Street food like Pani Puri and Samosas are beloved snacks that offer a burst of flavors.']]

Generating Answers Using OpenAI

To generate answers based on relevant documents, this code concatenates the relevant documents and formulates a prompt for OpenAI’s text-davinci model to generate a response:


def generate_answer_from_docs(query, documents):
    context = "\\\\n".join(documents)  # Combine documents
    prompt = f"Based on the following documents, answer the question: {query}\\\\n\\\\nDocuments:\\\\n{context}\\\\nAnswer:"

    response = openai.Completion.create(
        engine="text-davinci-003",
        prompt=prompt,
        max_tokens=300
    )

    return response.choices[0].text.strip()

# Generate an answer
answer = generate_answer_from_docs(user_query, relevant_docs)
print(answer)

This comparison between LlamaIndex and ChromaDB provides a clear distinction of their functionalities and focus, making it easier to understand which tool to use based on the project requirements:

1. Purpose and Focus:

LlamaIndex:
- Primary Focus: LlamaIndex specializes in connecting LLMs with external data sources (documents, databases, etc.), enabling LLMs to use structured and unstructured data to answer complex queries.
- Use Case: Ideal for building context-aware LLM-powered applications, especially when dealing with long or unstructured documents.
ChromaDB:
- Primary Focus: ChromaDB is a high-performance vector database designed to store and manage embeddings efficiently for semantic search, recommendation systems, and machine learning tasks.
- Use Case: Best suited for applications requiring fast embedding retrieval and similarity search, such as in recommendation systems or clustering.

2. Integration with LLMs:

LlamaIndex:
- Tight LLM Integration: Built for seamless integration with LLMs, making it easier to provide LLMs with the context needed for answering queries.
- Strengths: Helps structure data in a way that allows LLMs to query and access relevant information easily, enhancing the response quality.
ChromaDB:
- Embedding Store: Focuses on storing and retrieving embeddings efficiently, without built-in deep integration with LLMs.
- Strengths: You can still use ChromaDB to store embeddings from LLM outputs and perform tasks like semantic search, but it requires more manual integration.

3. Data Storage and Querying:

LlamaIndex:
- Indexed Data Structure: Creates hierarchical and advanced indices (e.g., tree or graph-based) that allow efficient retrieval of relevant text chunks to provide LLMs with context for answering queries.
- Querying: Pulls relevant data chunks based on LLM queries, offering efficient handling of long or complex datasets.
ChromaDB:
- Vector Storage: Stores and manages high-dimensional embeddings and performs fast similarity searches using algorithms like cosine similarity or Euclidean distance.
- Querying: Primarily focuses on vector similarity searches, optimized for performance over textual or document-based querying.

4. Handling of Large Data:

LlamaIndex:
- Document Chunking: Specifically designed to handle large documents by breaking them into smaller, manageable chunks, helping LLMs bypass token limitations during query processing.
ChromaDB:
- Scaling with Embeddings: Optimized for scaling with large numbers of high-dimensional embeddings, but lacks native document chunking or summarization features that are beneficial when working with large text corpora.

5. Customization and Flexibility:

LlamaIndex:
- LLM-Centric Customization: Offers various indexing structures (list, tree, graph) that can be customized based on the application’s needs, allowing for more flexibility in how data is accessed by LLMs.
ChromaDB:
- Flexible Vector Database: Provides robust options for embedding storage, indexing, and retrieval, but doesn’t offer LLM-specific customizations or tools like chunking.

6. Ease of Use:

LlamaIndex:
- Developer-Friendly for LLMs: Built specifically for developers looking to integrate LLMs with external data, making it easy to add external knowledge to LLM responses.
ChromaDB:
- Specialized for Embeddings: While more specialized and efficient for managing embeddings, it requires more manual effort to integrate with LLM-based applications.


import os
from llama_index import SimpleDirectoryReader, GPTSimpleVectorIndex, LLMPredictor, ServiceContext
from langchain.chat_models import ChatOpenAI

# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = 'your-openai-api-key'

# Step 1: Load and prepare your data
# Load documents from a directory
def load_documents(directory_path):
    return SimpleDirectoryReader(directory_path).load_data()

# Step 2: Create the index using LlamaIndex
def create_index(documents):
    # Define the LLM Predictor using OpenAI's GPT model
    llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0.7, model_name="gpt-3.5-turbo"))
    
    # Create a service context (includes the LLM and settings)
    service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)
    
    # Build the index from the documents
    index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)
    return index

# Step 3: Query the index using an LLM
def query_index(index, query):
    response = index.query(query)
    return response

# Main function to run everything
if __name__ == "__main__":
    # Load documents from a local directory
    documents = load_documents("path/to/your/documents")

    # Create the index
    index = create_index(documents)

    # Query the index with a user question
    user_query = "What are the main challenges in AI research?"
    response = query_index(index, user_query)

    # Print the response from the LLM
    print(response)

https://docs.llamaindex.ai/en/stable/module_guides/indexing/index_guide/


import os
from llama_index import GPTSimpleVectorIndex, GPTTreeIndex, SimpleDirectoryReader, LLMPredictor, ServiceContext
from langchain.chat_models import ChatOpenAI

# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = 'your-openai-api-key'

# Step 1: Load documents
def load_documents(directory_path):
    return SimpleDirectoryReader(directory_path).load_data()

# Step 2: Create a Vector Tree-based Index
def create_tree_index(documents):
    # Define the LLM Predictor (using OpenAI GPT)
    llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0.7, model_name="gpt-3.5-turbo"))
    
    # Create a service context
    service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)
    
    # Build the vector-based tree index from the documents
    tree_index = GPTTreeIndex.from_documents(documents, service_context=service_context)
    
    return tree_index

# Step 3: Query the Tree-based Index
def query_tree_index(tree_index, query):
    response = tree_index.query(query)
    return response

# Main function
if __name__ == "__main__":
    # Load documents from your directory
    documents = load_documents("path/to/your/documents")
    
    # Create the vector tree-based index
    tree_index = create_tree_index(documents)

    # Query the index
    user_query = "What are the main challenges in AI research?"
    response = query_tree_index(tree_index, user_query)

    # Output the response
    print(response)