Retrieval Augmented Generation: Insights from Building AI-Powered Apps

Retrieval Augmented Generation: Insights from Building AI-Powered Apps
Photo by BoliviaInteligente / Unsplash

I've been working on AI-powered applications for a while now, and I want to share some thoughts on AI apps, particularly focusing on retrieval augmented generation (RAG). This technology has been central to my recent projects, and I've learned a lot about its potential and challenges. Let's dive into the details.

My Journey with AI Apps

To provide some context, I'll briefly outline my experience with AI app development:

  1. GPTGram (later Solvemigo): My first AI app, essentially bringing ChatGPT to Telegram. It incorporated Whisper for speech-to-text, DALL-E for image generation, and the ChatGPT API for LLM functionality.
  2. PenPersona: A tool designed to convert unstructured thoughts into polished content while preserving the user's unique writing style.
  3. MemoryPlugin.com: A plugin for the popular ChatGPT client TypingMind that adds long-term memory to any LLM that supports tool use, like OpenAI GPT, Anthropic Claude, Google Gemini, Meta Llama, etc., making conversations more contextual and personalised. Implements basic RAG technology.
  4. AskLibrary: My current project, allowing users to upload books and documents, then learn by asking questions about them. Imagine talking to your favourite books. This is where RAG really comes into play.

Understanding Retrieval Augmented Generation

Retrieval augmented generation is a technique that enhances AI-generated answers by retrieving and incorporating additional relevant information. It's designed to address two key limitations of standard large language models (LLMs):

  1. Lack of access to real-time information
  2. Inability to access private or specific content

At its core, RAG involves feeding relevant information to an LLM before it generates a response to a query. This allows the model to provide more accurate, up-to-date, and context-specific answers.

The Naive Approach and Its Limitations

The simplest implementation of RAG would be to feed an entire document to the LLM before asking a question. However, this approach quickly runs into two significant problems:

  1. Technical Constraints: Most LLMs have a token limit. For instance, GPT-4 can handle up to 128,000 tokens. While this might seem substantial, consider that an average book contains between 75,000 to 125,000 tokens. You'd be pushing the limits with just one book, let alone a library of content.
  2. Cost Implications: LLMs typically charge based on the number of input tokens. If you're inputting a 100,000-token book for each query, you're paying for those 100,000 tokens every time you ask a question. For a model like GPT-4, which charges $0.03 per 1K tokens for input, that's $3 per query just for the input, not counting the output tokens. This quickly becomes prohibitively expensive for most applications, especially those handling multiple queries.

Embeddings and Vector Search: The Solution

To overcome these limitations, we turn to embeddings and vector search. Here's how it works:

Embeddings: These are numerical representations of text that capture semantic meaning. When you feed text into an embedding model, it returns a series of numbers (usually 1024 or 1536 dimensions). These numbers represent your text in a high-dimensional vector space.

In this vector space, semantically similar concepts are positioned close to each other. For example, "wheel" and "round" might be neighbors in this space, even though they don't share any letters.

The RAG Pipeline: A Detailed Walkthrough

Let's break down the steps involved in a typical RAG pipeline:

  1. Chunking:

    • Break your content (books, documents, etc.) into smaller, overlapping pieces.
    • The optimal chunk size and overlap depend on your specific use case and data.
  2. Embedding Generation:

    • Convert each chunk into embeddings using an embedding model.
    • This can be done using local models or commercial APIs like those from OpenAI.
  3. Vector Database Storage:

    • Store these embeddings in a vector database.
    • Options include Pinecone (serverless, pay-as-you-go), PG vector (Postgres extension), or Supabase.
  4. Query Processing:

    • When a user asks a question, convert it to embeddings using the same model.
  5. Vector Search:

    • Use algorithms like cosine similarity to find vectors in your database that are similar to the query vector.
  6. Text Retrieval:

    • Fetch the actual text associated with the similar vectors from your database.
  7. Query Transformation (Optional):

    • Use an LLM to create multiple queries from the user's input.
    • This helps capture different aspects or interpretations of the user's question.
  8. Query Fanout:

    • If using query transformation, perform vector search for each generated query.
  9. Re-ranking:

    • Score how relevant each retrieved chunk is to the original query.
    • This step helps filter out less relevant information.
  10. LLM Processing:

    • Feed the top-ranked chunks and the original query into your chosen LLM.
    • The LLM uses this context to generate a response.
  11. Response Generation:

    • Return the LLM-generated answer to the user.

Challenges and Optimizations

Implementing an effective RAG system involves navigating several challenges and optimizations:

  1. Chunk Size and Overlap:

    • Smaller chunks provide more granular information but may lack context.
    • Larger chunks preserve context but may introduce irrelevant information.
    • Finding the right balance is crucial and often requires experimentation.
  2. Embedding Model Selection:

    • Different models produce embeddings of varying quality and dimensionality.
    • Higher dimensional embeddings can capture more nuanced relationships but increase computational costs.
  3. Vector Database Choice:

    • Consider factors like query speed, storage costs, and ease of integration.
  4. Re-ranking Algorithms:

    • Experiment with different scoring methods to improve relevance.
  5. Prompt Engineering:

    • Craft effective prompts for the LLM to generate accurate and coherent responses.
  6. Query Transformation Strategies:

    • Develop methods to generate diverse yet relevant query variations.
  7. Performance Optimization:

    • Balance between retrieval accuracy and system latency.
  8. Cost Management:

    • Optimize API calls and token usage to keep expenses under control.

Conclusion

Retrieval Augmented Generation represents a significant advancement in AI application development. It allows us to create more intelligent, context-aware systems that can leverage specific knowledge bases effectively. The technology opens up exciting possibilities for personalized learning, advanced research tools, and more accurate information retrieval systems.

As we continue to refine RAG techniques, we're likely to see even more powerful and efficient implementations. The field is ripe for innovation, and I'm excited to see how it evolves.

Implementing RAG is a complex but rewarding challenge. It requires a deep understanding of various technologies and a willingness to experiment and optimize. But the potential benefits – more accurate, context-aware AI responses – make it a worthwhile endeavor for many applications.

As I continue working on AskLibrary and exploring the possibilities of RAG, I'm constantly amazed by the potential of this technology. It's not just about building smarter AI; it's about creating tools that can truly augment human knowledge and learning in meaningful ways.