LLMs and RAG are great. What’s Next?

7 min readApr 18, 2024

The story of AI-powered search thus far has been primarily about the replacement of sparse bag-of-words query and content representations with dense embedding-based representations and the use of large language models (LLMs) in retrieval-augmented generation (RAG) to abstract or distill information from multiple retrieved results, as well as to enable searchers to engage in more conversational interaction.

We are still in the early days of AI-powered search. Lots of search applications still use traditional inverted indexes and more modern but hardly bleeding-edge machine-learned ranking (LTR) models. Many search applications still rely on hand-tuned models and BM25! Still, there is no question that AI is slowly but surely taking over much of the search stack.

Where will AI-powered search go from here? Will we continue to see meaningful incremental improvements as LLMs get bigger and RAG architectures get better, or will we soon hit a point of diminishing returns and stagnation? If we are converging on a local maximum, what kind of breakthrough would help us move beyond it? This post explores these questions and speculates on the possible future of AI-powered search.

LLMs vs. RAG

Why do we need RAG when we have LLMs? Are LLMs not sufficient for all of our information-seeking needs? If not, do we just need to build bigger and better LLMs? Or is there an inherent need for RAG?

Here are a few examples that I like to use to answer these questions.

“Is a house bigger than a cat?”

This is a question we should expect any LLM worthy of the name to be able to answer correctly. Indeed, here is a good — if somewhat verbose — answer from ChatGPT:

Yes, generally speaking, a house is much larger than a cat. A house typically refers to a building where people live, which can range in size from small cottages to large mansions. On the other hand, a cat is a small domesticated mammal, typically much smaller than even the smallest houses. So, in terms of size, a house is definitely bigger than a cat.

“When did Daniel Kahneman die?”

Sadly, Daniel Kahneman died on March 27, 2024. ChatGPT, however, was trained before this date and thus cannot answer the question correctly:

As of my last update in January 2022, Daniel Kahneman, the Nobel Prize-winning psychologist and author, was alive. However, to confirm whether there have been any updates or changes since then, I recommend checking the latest news sources or online databases for the most current information on his status.

What is the market cap of Tesla?

As of my last update in January 2022, Tesla’s market capitalization fluctuated depending on the stock market’s performance and Tesla’s stock price. However, I don’t have real-time data, so I can’t provide the current market capitalization of Tesla. You can find this information by checking financial news websites or stock market platforms. Tesla’s market capitalization is calculated by multiplying its current stock price by the total number of outstanding shares.

These last two questions may be difficult for an LLM, but they are trivial for human searchers. We would know to look at Daniel Kahneman’s Wikipedia entry or to look up Tesla on any finance site. Better yet, we would expect a general-purpose search engine like Google or Bing to return relevant, current results, which makes these queries perfect candidates for RAG.

All three of these queries involve public information. So perhaps we just need LLMs to be bigger and more current. Pursuing the tail of public content, however, offers diminishing returns at the expense of enlarging the model. Moreover, the current methods for training LLMs do not lend themselves to real-time or near-real-time updates. So, even if LLMs could answer questions like these in theory, they are unlikely to do so in practice.

A more significant barrier is that we do not expect LLMs — or at least publicly available LLMs — to be aware of non-public information, such as the proprietary content within a company’s firewall or our private personal files. We could build bespoke models for these purposes, but doing so would be expensive and inefficient.

RAG, however, can work quite well for these kinds of queries by using retrieval to obtain relevant documents (or chunks) from any content source — public or private, archival or current— and then feeding the results to a general-purpose generative AI system to distill a response.

Moreover, RAG smooths out small differences between LLMs, at least when it comes to knowledge. It is certainly nice for an LLM to already possess a piece of knowledge, but almost as nice for the LLM to know how to find that knowledge through retrieval. Hence, building bigger LLMs delivers very little incremental value when the new knowledge added to them is easy to retrieve using RAG.

The Limits of RAG

So, is RAG all we need? If we can assume that all of the knowledge we ever need to access is compactly represented within documents that can be retrieved through a readily available search engine, then the answer is yes. But that assumption is a highly restrictive one.

Consider someone who wants to make a dinner reservation at a restaurant near a particular location and has some constraints (e.g. a constraint that the restaurant serves vegetarian food) and preferences (e.g., a preference for a restaurant that has received good reviews). A human would know to pursue such a task by searching on a platform like Yelp or OpenTable, possibly in combination with a mapping tool like Google Maps to establish proximity. That person might even search the web for restaurant menus to confirm that they have vegetarian options. Perhaps most importantly, not all restaurants accept reservations, and those that do accept reservations may each have particular ways of searching for available time slots.

This is a relatively straightforward — if tedious — task for a person, but it is not so straightforward for RAG — particularly for an application not solely intended to support restaurant reservations. After all, it is not enough to retrieve relevant results with information about nearby restaurants. Finding out if a restaurant accepts reservations and has availability at dinner time is probably beyond the capabilities of distilling retrieved results using an LLM — especially if searching for available time slots requires a separate request, such as by filling out a form on the site. Also, an LLM cannot compute driving time between locations like Google Maps.

Function Calling and Tools

To understand the limitations of RAG, it is helpful to reflect on how RAG overcomes the limitations of standalone LLMs. RAG picks up where the LLM’s knowledge leaves off, by resorting to retrieval as a tool to obtain additional or more current knowledge.

However, retrieval from a document collection is only one possible tool. What if LLMs could use a collection of tools, just as we humans do? Tool use is a defining trait of our species. If we want AI systems that match or exceed human-level intelligence, we will need them to be adept tool users.

Fortunately, AI is already moving in this direction. OpenAI provides a function-calling capability that allows you to describe functions and have the model intelligently choose when and how to invoke them. The open-source LangChain framework includes a tools module that abstracts tool use as an interface and supports a variety of LLMs.

These are early days. OpenAI only released its function-calling capability in June 2023, while LangChain added structured tool support in May 2023. Most AI application developers today are still focused on LLMs and RAG!

Imagining the Future of AI-Powered Search

As physicist Niels Bohr and baseball player Yogi Berra are alleged to have said, “It is difficult to make predictions, especially about the future.” Nonetheless, we can explore the possible future of AI-powered search.

LLMs will continue to improve. OpenAI, Google, Anthropic, and others will compete with their closed-source models, while Mistral AI, Meta (through its LLaMa family of models), and others compete on the open-source front.

But the improvements seem likely to slow down. LLM developers are increasingly so concerned about running out of training data that they are turning to ideas like using synthetic data (at the risk of AI inbreeding!), scraping YouTube videos, or even buying book publishers.

While there may be more knowledge to be gleaned from these sources, it seems that the pursuit of even the slightest competitive advantage is driving investment to obtain rapidly diminishing returns. We should see the practical differentiation among LLMs, decrease, leading to commodification. That would be good news for application developers, and for cloud computing platforms like Amazon Web Services — especially since Amazon does not seem to be a contender for the LLM crown.

Meanwhile, there is a lot of room for improvement in the way that LLMs interface with function calling and tools. To achieve an AI breakthrough — let alone anything that passes for artificial general intelligence (AGI) — I believe that we will have to dramatically improve the ability of LLM-based applications to leverage external knowledge and capabilities. We will have to evolve from thinking of AI in terms of LLMs and RAG to envisioning truly capable software agents.

Prediction is Hard

When will all of this happen? I wish I knew. I cannot even say for sure that my predictions will be the major near-term developments. I hardly imagined in late 2022 that we were at the cusp of a generative AI revolution.

Still, as a large language model not trained by OpenAI, I can at least try to generate predictions. I do believe that we will see LLMs focus less on size and more on tool use in the next few years. I certainly hope so.