Skip to content
January 6 2024

#121 Effective Retrieval: The Crucible of Meaningful Responses in Generative AI

Blog Details

<< Previous Edition: Sparse is Beautiful

Consider LLMs (Large Language Models) as actors who have attended film school. There, they've learned and been trained in all the essential qualities a skilled actor should possess. Their training has ingrained in them an array of traits, shaping their 'weights' in performance and 'biases' in character portrayal.

When these LLM actors partake in a film, a profound understanding of the script and characters is critical. Picture this film as a metaphor for life, where the script is dynamically generated. The audience, wielding significant influence, can query anything they desire, actively shaping the script that the actors must bring to life.

Their performances are inherently unpredictable. While they may naturally lean towards some degree of improvisation, the quality of information they receive can either lead to chaotic misinterpretations or elevate their performances with bursts of creativity.

In the realm of generative AI, much like in our envisioned film studio, there exists a crucial necessity for a system adept at storing and retrieving context-specific data. This system is identified as Retrieval Augmented Generation (RAG). Envision the process of creating embeddings in RAG as equipping an actor with vital, actionable information, gleaned from similar scenes in past movies and other relevant contexts.

In the task of gathering scene details or dialogue, it's crucial for LLM actors to access information that is coherent and contextually appropriate, instead of just a random collection of words. This is particularly relevant in the context of Retrieval Augmented Generation (RAG) systems.

By default, embeddings in generative AI are often created at the token level. However, this default approach can be limiting, as each token, typically smaller than a standard word, only offers a narrow slice of context. Best practices in RAG, therefore, advocate for storing information in larger, more contextually comprehensive units. Such an approach ensures that when LLM actors retrieve information, they are provided with a full, nuanced understanding, akin to working with a detailed and cohesive script rather than fragmented lines. This method enhances the effectiveness and relevance of generative AI applications, ensuring richer and more meaningful interactions.

Movie directors employ various optimization techniques and tricks to elicit the best performances from actors. Let's delve into these strategies below and explore their effectiveness.

The Art of Perfecting Scripts with Query Expansion in RAG

Query Expansion (QE) in Retrieval Augmented Generation (RAG) is directors favorite trick. QE is like a director on set, refining and enhancing the script in real-time based on the scene's needs. It takes the initial user queries, akin to a basic script outline, and enriches them with additional, meaningful terms. This process deepens the context and scope of information retrieval, equipping LLM actors with actionable information that is not just accurate but layered with rich nuances and broader themes.

QE's significance in RAG extends beyond merely responding to straightforward queries. It's about interpreting and expanding upon the underlying themes and emotions that drive those queries. This involves incorporating synonyms, related phrases, or broader concepts linked to the initial query. Such an approach transforms a simple query into a multi-dimensional search parameter, leading to data retrieval that is more relevant and comprehensive. This is akin to a director guiding actors to deliver performances that capture the essence of the entire scene, rather than just their lines.

Cross-Encoder Re-Ranking

Cross-Encoder Re-Ranking in Retrieval Augmented Generation (RAG) mirrors the precision of a film director fine-tuning a scene to ensure every detail aligns with the narrative's emotional impact. This technique is crucial for ensuring responses from Large Language Models (LLMs) are not only relevant but intricately tailored to the subtle intricacies of the user's query. Like a director reviewing the first take, Cross-Encoder Re-Ranking evaluates the initial data retrieved by LLMs—their 'script'—which often captures the essence but needs refinement. This process involves a detailed examination of how well each piece of information matches the query, ensuring the final output is not just a mere response but a nuanced and context-rich performance.

Delving deeper than mere surface-level query-response matching, Cross-Encoder Re-Ranking interprets the deeper meanings and subtleties of a query, akin to a director understanding the subtext of a script. This method guarantees that the LLMs deliver responses that resonate in content, tone, relevance, and depth, aligning with the user's underlying needs and expectations. The result is an AI response that transcends mere accuracy, offering a thoughtfully curated and context-aware engagement, similar to a perfectly executed scene that captures the essence of a movie's narrative.

Embedding Adapters: Tailoring the Script to the Actor in RAG

In Retrieval Augmented Generation (RAG), embedding adapters function similarly to script tailors in a film, ensuring that the information tailored for Large Language Model (LLM) actors is not just relevant, but also intricately customized to their unique processing abilities and nuances. Just as a script tailor in a movie adapts dialogue and actions to an actor's individual talents for a natural and impactful performance, embedding adapters adjust the AI script, or embeddings, to align with each specific AI model's strengths. This process is crucial, as each LLM actor, much like a distinctively styled actor, has unique 'weights' and 'biases' developed through training and experience.

The role of embedding adapters extends to modifying the embeddings to enhance compatibility with the AI model’s architecture and style. This involves fine-tuning the dimensions and formats of embeddings, or adding contextual information, to ensure that the LLM actors process the data more effectively. This not only leads to accurate but also nuanced and engaging performances. In essence, embedding adapters in RAG serve as a bridge, transforming generic data into a screenplay that highlights each LLM's potential, resulting in more effective, nuanced AI responses. This approach guarantees that the final AI output is a model-specific response, deeply reflective of the user's intent and the inquiry's context, akin to a meticulously crafted film scene that resonates with depth and precision.


RAG remains the predominant force driving value in the realm of generative AI, and this trend is expected to continue in the foreseeable future. The quality of a generative AI application is directly proportional to the data it has been exposed to, making it imperative to optimize RAG using every possible tip and trick. Ensuring optimal performance in RAG is paramount.

P.S: This newsletter draws inspiration from the short course "Advanced Retrieval for AI with Chrome" in a series of courses facilitated by Dr. Andrew Ng.