The RAG Dilemma: Scale vs. Intelligence
Why you can't have both massive document sets and deep reasoning in current systems
Retrieval Augmented Generation (RAG) has been huge in the past year or two as a way to supplement LLMs with knowledge of a particular set of documents or the world in general. I've personally worked with most flavors of RAG quite extensively and there are some fundamental limitations with the two fundamental algorithms (long-context, and embedding) which almost all flavors of RAG are built on. I am planning on writing a longer and more comprehensive piece on this, but I wanted to write a shorter more condensed piece and put it out online first to get some feedback and see if there are any perspectives I might be missing.
Long-context models (e.g. Gemini), designed to process extensive amounts of text within a single context window, face a critical bottleneck in the form of training data scarcity. As context lengths increase, the availability of high-quality training data diminishes rapidly. This is important because of the neural scaling laws, which have been remarkably robust for LLMs so far. Here is a great video explaining those:
One important implication is that if you run out of human-generated training data, the reasoning capabilities of your model are bottle-necked no matter how many other resources or tricks you throw at the problem. This paper provides some nice empirical support for this idea. Across all of the "long-context" models the reasoning capabilities decrease dramatically as the context length increases.

Embeddings based RAG has much better scalability but suffers from some pretty serious issues with high-level reasoning tasks. Here is a small list from this paper:
The authors not only point out many of the issues but also have a nice statement as to the core reason why towards the beginning of the paper:
1) Reasoning Failures: LLMs struggle to accurately interpret user queries and leverage contextual information, resulting in a misalignment between retrieved knowledge and query intent.
2) Structural Limitations: These failures primarily arise from insufficient attention to the structure of knowledge sources, such as knowledge graphs, and the use of inappropriate evaluation metrics.
This structural limitation is particularly problematic when dealing with documents that require deep understanding and contextual interpretation such as textbooks or legal documents. Often there will not only be an important internal structure to each document, but also an important meta-structure across documents (think of scientific papers that cite specific portions of other scientific papers). There are tricks like using knowledge graphs that try to get around some of these issues, but they can only do so much when the fundamental method shreds any structure the documents might have had before any of the secondary steps even begin.
The scalability limitations of long-context, and the reasoning limitations of embedding, lead to an important trade-off for anyone building a RAG system. Long-context models excel in creativity and complex reasoning but are limited to small document sets due to training data constraints. Conversely, embeddings-based approaches can handle vast corpuses but function more like enhanced search engines with minimal reasoning abilities. For many tasks, this trade-off is fine as the task already fits well on one side or the other of the trade-off. Many other tasks however, are simply not easily achievable with SoTA RAG methods due to the fact that they require both large amounts of documents and advanced reasoning over these documents. Solving this trade-off and creating a solution that can handle complex reasoning AND large amounts of documents, is going to be the next big step in RAG. It is a problem I have made some progress in solving personally and I would love to see more folks take a crack at.