Boosting RAG performance with caching

Introduction

In the rapidly evolving field of artificial intelligence, retrieval-augmented generation (RAG) has become a cornerstone for building high-performance AI models. By combining advanced language models with external knowledge retrieval, RAG systems generate responses that are both accurate and contextually rich. However, as powerful as RAG is, it comes with challenges. High token consumption, increasing operational costs, and slower response times can limit the effectiveness of these models. Addressing these problems is essential to unlocking the true potential of RAG systems and delivering top-tier performance. In this blog, we'll discover how caching could solve these problems.

Understanding Retrieval-Augmented Generation

Simply put, RAG is an AI technique that combines retrieving relevant information from external sources with a language model to generate accurate, context-aware responses.

Imagine you're building an AI to answer questions about your hobbies or interests:

First, you'd gather data about yourself and store it in a vector database, where information is stored as vectors for efficient retrieval
Next, you'd select a powerful language model, such as GPT-4 or Llama 3, to process the data and generate responses
Finally, you'd need tools like LangChain, LlamaIndex, or LLMWare to seamlessly orchestrate these components

This is a simple RAG model architecture:

User Query → Vector DB Search → LLM Processing → Response

We receive the user's message, perform a similarity search in our vector database to retrieve the most relevant documents, and pass these documents to the LLM to generate the final response. However, improving a RAG model often involves additional steps to ensure response quality, which increases token consumption and latency. One effective solution to this challenge is semantic caching.

The Power of Caching

Caching is a computational strategy designed to improve system efficiency by temporarily storing frequently accessed or computationally expensive data in a fast-access storage medium, such as RAM or specialized cache memory. By keeping this data close to the processor, caching minimizes the need for repeated retrievals from slower, more resource-intensive sources, such as databases or external storage.

Integrating Caching in RAG Models

Integrating a caching system into RAG models can significantly enhance efficiency by reducing costs and response latency. Here's how it works:

First Query Processing:
- User sends a message
- System performs similarity search in vector database
- Retrieves relevant documents
- Generates response using LLM
- Stores query and response in cache
Subsequent Queries:
- System checks cache for similar questions
- If match found, returns cached response instantly
- Bypasses further computation

However, cache lookups aren't based on exact string matching (if userPrompt == storedPrompt: return storedResponse), as different users may phrase the same question in various ways. Instead, a similarity-based verification is performed:

Cosine Similarity Approach

The user's prompt is converted into an embedding (vector representation of its semantic meaning)
The system compares this embedding with those of stored queries in the cache
Uses cosine similarity to measure vector proximity in high-dimensional space
If similarity score exceeds threshold, returns stored response

Alternative: Lightweight LLM Approach

A lightweight LLM can be employed to compare incoming queries with cached entries, ensuring accurate identification of semantically similar prompts.

Implementation

In this implementation, we leverage Redis for in-memory storage and use Python libraries to calculate cosine similarity between embeddings.

import time
import numpy as np
from sentence_transformers import SentenceTransformer
from scipy.spatial.distance import cosine
import redis
import json

# Initialize the embedding model
embedding_model = SentenceTransformer('distiluse-base-multilingual-cased-v1')

# Cache settings
CACHE_EXPIRATION = 3600  # 1 hour in seconds
SIMILARITY_THRESHOLD = 0.8  # Threshold

# Initialize Redis connection
r = None
try:
    r = redis.Redis(host='localhost', port=6379, db=0)
except redis.ConnectionError as e:
    print(f"Failed to connect to Redis: {str(e)}")

def set_cached_response(question, response, embedding):
    if r is None:
        return
    try:
        cache_data = {
            'response': response,
            'timestamp': time.time(),
            'embedding': embedding.tolist()
        }
        r.set(question, json.dumps(cache_data))
    except redis.RedisError as e:
        print(f"Failed to set cache: {str(e)}")

def find_similar_question(question, embedding):
    if r is None:
        return None, None
    try:
        for cached_question in r.keys():
            cached_data = json.loads(r.get(cached_question))
            cached_embedding = np.array(cached_data['embedding'])
            if time.time() - cached_data['timestamp'] < CACHE_EXPIRATION:
                similarity = 1 - cosine(embedding, cached_embedding)
                if similarity > SIMILARITY_THRESHOLD:
                    return cached_question.decode(), cached_data['response']
    except redis.RedisError as e:
        print(f"Failed to search cache: {str(e)}")
    return None, None

def get_embedding(text):
    return embedding_model.encode([text])[0]

Implementation Breakdown

Sentence Embedding Model:
- Uses SentenceTransformer (distiluse-base-multilingual-cased-v1)
- Generates high-dimensional vectors representing text semantics
Redis Cache:
- In-memory database for efficient storage/retrieval
- Stores processed questions and responses
Key Settings:
- CACHE_EXPIRATION: 1 hour validity
- SIMILARITY_THRESHOLD: 0.8 (80% similarity)

Core Functions

set_cached_response:
- Stores question, response, and embedding in Redis
- Converts data to JSON format
- Includes timestamp for expiration tracking
find_similar_question:
- Searches Redis for semantically similar questions
- Skips expired entries
- Uses cosine similarity for matching
- Returns cached response if similarity > threshold

Conclusion

Integrating caching into Retrieval-Augmented Generation (RAG) models is a practical and impactful strategy for enhancing their performance. By storing frequently accessed responses and leveraging semantic similarity checks, caching effectively:

Reduces token consumption
Cuts operational costs
Minimizes response latency

This optimization not only makes RAG systems more efficient but also ensures a smoother and faster user experience. In a field where every millisecond and token counts, adopting strategies like caching is not just an improvement — it's a necessity for building next-generation AI models.