Primer: Agentic RAG (Part 1)
Swapping single LLM calls for LLM-backed “agents” leads to materially improved outcomes on retrieval-augmented generation (“RAG”) tasks for legal document analysis. While additional costs are incurred, the approach is already economic for a wide range of B2B use cases and we expect token costs to continue trending downward. The substitution itself is relatively uncomplicated from an engineering standpoint, and eases the prompt engineering burden on domain experts - especially in smaller teams. While there is an impact on inference latency (that needs to be managed in the user experienced), the improved outcomes are more than worth it for high value, high risk use cases such as legal analysis.
This post - Part 1 - aims to help readers situate agentic RAG in their product stack by considering technological constraints, trade-offs and the likely direction of developments going forward. Part 2, to follow, will compare and contrast two illustrative implementations of agentic behaviour in the context of legal document analysis using the smolagents
and DSpy
libraries.
Context and Technological Constraints
LLM “agent” frameworks built around large parameter-count foundation model inference over API received a lot of attention during 2023-2024. As with LLMs more generally, users were happy to trade the customisability and privacy of open source models and in-house inference stacks for the vastly more capable larger models and the freedom from having to run their own inference stacks on scarce hardware. Consumer GPU supply was still constrained by pandemic-era knock-on, and most new production capacity was (and is) going to data centre cards. GPT3.5 over API was often radically cheaper than running any of Meta’s early Llama models in-house, for instance, as well as being more performant. That cost equation remains true today, although smaller open source models in the hands of knowledgeable practitioners are now genuinely useful for a variety of use cases.
Other trade-offs were also accepted - the lack of context-free grammar support, control over logit bias, and the entire post-training and RLHF process, for instance - because the performance gap was so compelling. Function calling support, in particular, was crucial to bridging the gap. While reliability was initially an issue, those concerns have now largely been put to rest even for lower parameter-count LLMs.
The effect of these conditions on efforts to spin up useful LLM-backed agents has only recently become clear, and may provide an interesting signal as to where the current big model and inference providers - such as OpenAI - see value being created and captured.
Why “Agency” Matters
Attempts to define what constitutes an “agent” under the new regime are famously fraught.1 The blue-sky prospects are intriguing, but when it comes to using LLMs in legal document analysis, arguably the most compelling application in the present is in improving RAG outcomes.
We’re working on a separate post on the merits of RAG systems versus alternative approaches and on preparing your legal documents for use with such systems. For the purposes of this post, suffice to say that the inherent ability of agents to “reflect” and course-correct during the analytical process is an excellent mitigant to the inherent limitations of even current state-of-the-art approaches to RAG.
Agentic RAG is robust to:
- a wide variety of applications;
- compared to single LLM calls, a wider range of approaches to document preparation; and
- compared to single LLM calls, a wider range of variation in input prompt quality,
potentially allowing teams to depreciate engineering and implementation costs across a range of product features.
The principal downsides are:
- increased latency; and
- increased inference costs,
although we expect technological advances will continue to improve both of these.
Given the ease of substituting existing RAG systems with an agentic RAG approach, we expect agentic RAG to largely replace most current approaches based on carefully crafted system prompts and single LLM calls in most quality-sensitive domains.
Comparison with Single-Call Approaches
All of which papers over perhaps the biggest technical question: how does it work?
A good starting point might be comparison against canonical “single LLM call” approaches to analystical queries in the legal domain. A user is interested in knowing something about a document that requires more than just searching: they need some kind of synthesis, and analysis, of the contents of the document or documents in order to satisfy an information need.
A domain specialist - a lawyer or legal engineer, perhaps - will have prepared the model in advance by carefully crafting a system prompt defining a common (as in: used for a variety of applications) specification to help guide the model’s spend of reasoning tokens: “Do this, don’t do that; focus on this, ignore that; contracts provided to you will be governed by English law”. More sophisticated approaches may craft this “framing” prompt dynamically, from information extracted from the document in advance of the user sitting down to interact with the system: “The parties are A and B, incorporated in the United Kingdom; the subject matter is a lease of commercial property located in Manchester, England; disputes shall be resolved through arbitration”. This lightens the load on the end user, who can simply sit down and say: “Summarise the payment terms with a particular focus on the consequences of a failure to pay on time”.
The process may not literally be single-call: many systems are likely to perform some degree of error correction and/or post-processing on the result sent back by the LLM. The more cycles and tokens you spend on error correction and post-processing, the more a single-call approach starts to resemble a “loop” in an “untrained” agentic system - particularly if control over the determination of when a response is ready to be seen by the user is largely handed over to the model. So in reality, when it comes to untrained agent systems, there’s a spectrum, with agentic systems handing near-complete control over decision-making to a “managing” agent. “Trained” agent systems are altogether different - something we’ll explore towards the end of the article.
Careful crafting of the system prompt remains an important step. Teams can and should be creating context and/or document specific framing prompts dynamically from information pre-extracted from the document(s). But by empowering the agent to more aggressively decompose the problem and make active judgements about quality - as we will see below - results in materially improved outcomes and consistency.
We’ll consider a concrete implementation of an untrained agent below, but first, it’s worth looking at developments in the LLM agent space in the first few months of 2025, which have thrown into sharp relief an emerging two-track system for LLM agents and a concerted attempt by model and inference providers - OpenAI in particular - to try to re-capture some of the value lost to the preponderance of high-quality open-source models.
Agency: Untrained or Trained?
Limited foundation model customisability has led to most popular open-source frameworks modelling their agentic approaches as directed cyclic or acyclic graphs - an approach common in robotic process automation (RPA) and workflow systems. This is effectively the “untrained agent” end of the spectrum discussed above.
Different frameworks abstract different aspects of the process of building the graph, so to speak, but all of them rely on establishing and exploiting a manager/managed relationship between multiple unrelated LLM call contexts. Though these approaches have proven fruitful, they create agency through artifice: the models themselves have no particular internal specialisation in performing agentic behaviours, so we create the desired effect with duelling tokens. Encapsulating relevant context analysis behind the “managed” LLM’s context window, and the output approval process behind that of the “manager” LLM is part of the point, but it is not without trade-offs. The extent of those trade-offs depends in part on the concrete implementation and use-case, and can be hard to measure empirically. But until this year, the point was moot: there was no other way for end-users of the most powerful LLMs provided over inference API to approach the task.
On 2 February 2025, OpenAI introduced the “Deep Research” mode in their B2C-facing ChatGPT apps. The rollout to Pro and Plus customers only finished recently, on 25 February. With the caveat that we cannot know for certain how OpenAI is doing what they do - since Deep Research was developed, like all of OpenAI’s products, behind closed doors - informed consensus is that the approach likely involves post-training the LLM with a reward function to select for, and enhance, desirable agentic behaviours such as reflection and error correction as part of the token generation process. Unlike untrained agents, which produce one “cycle” of thought and then are asked by a separate LLM to assess and reconsider their output, the process in trained agents occurs internally: from the user perspective, a question is asked and a high-quality response returned - albeit with higher latency than in single-call scenarios. Decisions on how to improve a piece of analysis, and when it is good enough to return to the user, are all handled by the same LLM, in the same context window.
It is early days for Deep Research. Competitors such as Google and Perplexity have released product features with similar names and behaviour that appears superficially similar to OpenAI’s offering. While it is too early for empirical confirmation, subjectively, the ouput produced by a trained agent - or, at least, OpenAI’s Deep Research - is head and shoulders above anything currently possible with untrained agents. OpenAI has not indicated whether a “Deep Research”-like feature will be made available through its API offering. All current signs seem to indicate that features like this represent OpenAI’s attempt to move up the value stack and claw back some of the value lost competitors and the recent spate of incredibly capable open-source models. As such, we would not be surprised if Deep Research continues to be absent from the API for the foreseeable future - at least until another provider commoditises a similarly compelling feature. We think it unlikely that features like this will form the basis of a sustainable moat for OpenAI. The story of the past two years has been that well-capitlised pathfinders like OpenAI are quickly swamped by fast-followers such as Meta and DeepSeek. But, at least for now, those of us implementing agentic behaviours into our products - including for RAG - must make do with existing untrained approaches.
You can find interesting in-depth discussion of the “trained vs untrained” dichotomy, and what it may mean for value creation and capture going forward, here and here.
Untrained Agent Implementations
In our next post, we propose to walk through two contrasting implementations of agentic RAG in some detail, in the hope that readers may find the concrete examples useful in implementing their own agentic RAG pipelines.
There are far too many open-source frameworks enabling agentic behaviour (for some definition of “agentic”) to cover in detail in one post. In any event, the similarity of approaches - a product of them all being confined to “untrained” approaches - would render such exercise of dubious value.
Instead, we intend to take a high-level look at just two: Hugging Face’s smolagents
, and Stanford NLP’s DSpy
.
smolagents
2 is notable for 3 reasons. First, Hugging Face is a giant in the open source machine learning space, thanks in large part to its widely used transformers
3 library and its community model and dataset sharing platform, Hugging Face Hub. Second, smolagents
itself is committed to being a “low-abstraction” library - unlike, say, LangChain
4. This makes it a good fit for prototyping, easy to integrate and - most importantly for our purposes - the code is easy to read and easy to reason about. Third, smolagents
takes a somewhat novel approach to writing tool definitions that readers may not have seen before5: instead of JSON, actions are written in Python code. In addition, Hugging Face recently published a blog post covering their own toy untrained re-implementation of OpenAI’s Deep Research feature in the smolagents
library.6
Stanford NLP’s DSpy
7 is interesting for different reasons, with the chosen approach contrasting nicely with smolagents
. DSpy
was launched in late 2023 with an Arxiv paper8 that made waves. DSpy
’s main value proposition has always been its focus on automated pipeline and prompt optimisation. DSPy
approach is proving particularly impactful in the context of setting up untrained agents - including for RAG. Since the links between “manager” and “managed” LLMs in an agentic arrangement are, essentially, prompts, the brittle nature of prompts produced using traditional prompt engineering techniques has weighed heavily on performance.
We look forward to seeing you again for the follow-up.