·Yufeng Huang

Full Context Reasoning (FCR): Moving Beyond RAG for Multi-Document Reasoning

TL;DR

  • RAG is great for search; it often fails for cross-document reasoning.
  • Long context helps, but naive “stuff everything in” approaches can degrade performance (“context rot”).
  • FCR is a reasoning runtime that constructs a usable full-context environment, then verifies grounding inline as reasoning progresses.
  • Scope: closed-corpus, document-grounded multi-document reasoning (e.g., contracts, diligence packets, compliance evidence).
  • Benchmark evaluation (25 hard 4-hop questions, ~200K-token corpus per question): simple RAG 16%, hybrid RAG 24%, iterative RAG 48%, LLM (all docs) 68%, FCR 88% — same corpus, same open-source model across all five approaches.

1) Long context is real — and RAG was designed for a world without it

In our first post, we introduced the Superlinear model: a subquadratic attention architecture that achieves 109 tokens/sec at 1M tokens of context and 76 tokens/sec at 10M tokens, on a single B200 GPU. The core technical result is that you can hold a very large document corpus in a single model pass without the quadratic cost of dense attention.

That raises the obvious next question: now that you can hold 10M tokens in context, what do you actually do with it?

The naive answer is to just stuff everything in and let the model figure it out. That helps — but it doesn't solve the problem. Naive full-context approaches have real limitations — position effects, noise accumulation, sensitivity to ordering — and more importantly, they give you no verification that the model used the evidence correctly. That's part of what FCR addresses.

The deeper issue is architectural. RAG was built as a workaround for limited context windows: since models couldn’t hold much, you retrieved the most relevant fragments and worked from those. For lookup tasks, that works well. But for tasks that require reasoning across multiple documents — where the answer depends on connecting facts that live in different places — retrieval is a structural bottleneck, not just a performance tradeoff.

The target workload we care about:

“Given this folder of documents, produce a coherent, evidence-backed answer that depends on relationships across the whole set.”

Examples include cross-contract obligation synthesis, diligence packet reconciliation, or compliance evidence packs. For these tasks, retrieval doesn’t just underperform — it fails in ways that are hard to detect without careful evaluation.


2) Why RAG breaks for reasoning

RAG retrieves a small subset of documents and lets the model answer based on those fragments. For “find the passage” tasks, this works well. For reasoning, it has structural failure modes that better retrieval can’t fully fix:

FCR vs RAG schematic
FCR vs RAG schematic
Figure 1: FCR (left) operates over the full corpus with all chunks active, while RAG (right) retrieves only a small subset, leaving most evidence unavailable to the model.

  • Fragmented evidence: the answer depends on multiple documents that are unlikely to be retrieved together. When all four hops of a reasoning chain need to land in the retrieved set, the odds compound against you.
  • Lost dependencies: cross-document relationships (precedence, exceptions, timelines, definitions) disappear when documents are split and ranked independently.
  • Evidence gaps leading to abstention: when retrieval misses key passages, models correctly recognise they can't answer — but this looks like failure. In our evaluation, simple RAG abstained on nearly three-quarters of answerable questions.
  • Hallucination under sparse evidence: with too little grounding material, models sometimes fill gaps from training memory rather than admitting uncertainty — producing confident but unverifiable answers.

Consider a multi-document contract packet. The governing order of precedence might live in one exhibit; a key definition might be overridden by an amendment; a security addendum might carve out obligations that look absolute in the MSA. RAG can retrieve any one of those clauses and still miss the dependency that changes what it means.

This is a structural mismatch, not a retrieval quality problem. The model's "world" is only the fragments it received — and for multi-hop reasoning over a large corpus, those fragments are rarely enough. Even iterative RAG, which issues follow-up queries based on what it found, can't fully escape this: asking better questions only helps if the evidence you need can be expressed as a query. Cross-document dependencies often can't.

The answer to this isn't better retrieval. It's a different architecture.


3) Introducing Full Context Reasoning (FCR)

Full Context Reasoning (FCR) is a reasoning runtime designed for tasks where the correct answer depends on relationships across many documents, not isolated passages.

FCR is not “stuff everything into the context window.” It’s about turning a corpus into a usable reasoning environment. The core difference from a naive full-context approach is that context preparation and grounding verification are built into the process — not left to the model to sort out on its own.

FCR operates in three stages:

Stage 1 — Context construction: FCR structures the corpus into a coherent reasoning context, preserving cross-document relationships and stripping noise. Raw documents aren’t prompt-shaped; this stage makes them so.

Stage 2 — Structured reasoning: The model reasons step-by-step across the full context, tracking intermediate claims explicitly rather than producing a single unguided answer.

Stage 3 — Grounding verification: Each intermediate claim is verified against the source documents as reasoning progresses. Unsupported claims are caught before they can anchor subsequent steps — not discovered only at the final output.

Raw Corpus
   ↓
Context Construction
   ↓
Structured Reasoning (with inline verification)
   ↓
Answer + Evidence

Scope:

  • In-scope: closed-corpus, document-grounded questions where the correct answer is determined by the provided documents.
  • Out-of-scope (for this post): open-web legal research or tasks where “correctness” depends on selecting external authorities.

4) Experimental setup (what we compared)

4.1) LLM choices

All evaluations use the same open-source generation model across all five approaches. This is deliberate: it isolates architecture (how context is selected, how the reasoning is structured, and whether claims are verified) as the variable, not raw model capability.

This setup lets us focus on a more useful question than "which model is best": does the FCR architecture consistently outperform RAG when both use the same underlying model? If the answer is yes, the result is about system design. We evaluate one model configuration here; whether the gap holds across model tiers is a subject of ongoing work.

Separately, we designed FCR to be deployable across the spectrum — from self-hosted open weights to API-served models — because many real-world RAG deployments run on private infrastructure.

Both RAG and FCR use multiple model components. At minimum, RAG uses a retrieval model (for similarity matching) and a generation model (for producing the final answer). FCR uses a comparable setup — the same generation model for the final answer — plus an additional model for the full-context reasoning pass. That additional model is the only asymmetry in the evaluation setup, and it reflects a genuine architectural requirement, not a configuration advantage.

The key implication: the same base model handles final-stage generation across all five approaches. When FCR outperforms RAG, it's because of what happens before generation — not because a more capable generator is used.

4.2) Benchmarking dataset

We evaluated on the MuSiQue dataset — a benchmark designed specifically for multi-hop Q&A over multiple documents. Each question requires chaining facts across several separate sources, and the corpus includes distractor documents intended to mislead the solver. Crucially, MuSiQue provides complete ground-truth reasoning chains: which documents are required, which intermediate facts need to be extracted, and how they connect to the final answer. This makes it possible to evaluate not just whether a system got the right answer, but whether it got there through legitimate reasoning.

We chose MuSiQue over HotPotQA because HotPotQA is widely used in training data and many models have likely seen it; its questions tend to be easier as a result. Another benchmark worth noting is the AA-LCR benchmark, which tests similar reasoning tasks but at shorter context lengths (~100K tokens, versus ~200K tokens here). However, AA-LCR does not provide ground-truth reasoning chains, which makes it impossible to distinguish a correct answer from a hallucinated one — a distinction that turns out to matter a great deal, as Section 5 shows.

In this evaluation, we held all five approaches to the same standard: a response is only counted correct if the approach produced the right final answer and a reasoning path grounded in the actual source documents. Getting the answer right through fabricated evidence counts as wrong.

We also performed substantial data cleaning, for two reasons: (1) the original MuSiQue dataset uses pre-selected paragraphs from Wikipedia, not full articles — we replaced these with the actual source articles so the retrieval step is fully realistic; and (2) MuSiQue was published in 2021, and a meaningful fraction of the factual claims in the original problems are no longer accurate.

Our cleaning process:

  1. Sampled 500 questions from the MuSiQue dev set (chosen over the test set because it includes supporting-fact annotations, and is less likely to be contaminated than the training set).
  2. Filtered out questions where the supporting Wikipedia article no longer exists.
  3. Further filtered questions where the specific supporting sentence or paragraph is no longer present or accurate in the current article.
  4. Selected only 4-hop questions — the hardest tier — since 2-hop and 3-hop questions are considerably easier and less discriminating.
  5. Manually reviewed all remaining questions to confirm they are well-formed, unambiguous, and fully supported by the current source articles.

This process reduced 500 starting candidates to 25 questions — a 20:1 funnel. That ratio reflects the bar we set, not data scarcity: a question only survives if it is factually current, unambiguous, fully 4-hop, and entirely supported by the source articles as they exist today. The 25 that remain are genuinely hard, cleanly defined, and fair to every approach in the comparison.


5) Results

5.1) Headline numbers

As described in Section 4.2, a response only counts as correct if it produced the right final answer and a reasoning path grounded in the source documents — hallucinated correct answers count as wrong. Every approach below sees the same ~200K-token corpus per question; the only variable is how that corpus is presented to the model.

Simple RAGHybrid RAGIterative RAGLLM (all docs)Full Context Reasoning (FCR)
How context is selectedTop keyword matchesKeyword + semantic matchesMultiple retrieval roundsAll documents at onceAll documents, structured + verified
Accuracy (correct answer + grounded reasoning)16%24%48%68%88%
Failure: hid​den hallucination (right answer, ungrounded reasoning)12%4%0%0%0%
Failure: confident wrong (wrong answer, didn't abstain)0%20%20%28%12%
Failure: gave up (abstained when answer existed in documents)72%52%32%4%0%

Each question falls into exactly one row. The three failure rows sum with Accuracy to 100%.

Four patterns stand out beyond the accuracy line:

Hidden hallucination drops to zero with FCR — and it's not zero for RAG. With simple BM25 retrieval, 12% of questions receive the right final answer through fabricated or misattributed reasoning. The model guessed correctly from its training data, not from the documents you provided. This is the most insidious failure mode: it looks like a win, but the system can't be trusted because the next similar question may be answered by a different guess. FCR's grounding requirement eliminates this entirely.

RAG gives up on questions it should be able to answer. Nearly three-quarters of simple RAG's responses — and 32% of iterative RAG's — are cases where the model simply said "I cannot determine the answer from the provided documents." These aren't genuinely unknowable questions; they're cases where retrieval failed to surface the evidence, so the model had nothing to work from. FCR, which operates over the full corpus, never encounters this failure mode.

More context means more confident errors — until you add verification. This is the counterintuitive result the table reveals. Simple RAG commits to zero wrong answers — not because it reasons well, but because it has so little context it defaults to refusing. Once the model has enough context to feel like it has something to work with, confident-wrong errors appear: 20% for Hybrid RAG, still 20% for Iterative RAG, and rising to 28% for LLM (all docs). More evidence does not make the model more cautious — it makes it more willing to commit. LLM (all docs) is the most confidently wrong system of all. Only FCR's verification step reverses this trend, cutting confident-wrong to 12%.

In high-stakes contexts, confident errors are more damaging than abstentions. An abstention signals that the system doesn't know; it can be re-queried or escalated. A confident wrong answer looks like a result — it gets acted on. The table shows a direct tradeoff: systems that abstain less tend to be wrong more often. FCR is the only system that escapes this tradeoff, achieving both a 0% abstention rate and the lowest confident-wrong rate of any approach.

The progression tells a clear story, and it's worth walking through each step to understand why the numbers move the way they do.

5.2) Why RAG approaches fall short: a retrieval problem, not a model problem

The jump from Simple RAG (16%) → Hybrid RAG (24%) → Iterative RAG (48%) is entirely driven by one thing: getting more of the right evidence in front of the model.

A 4-hop reasoning question requires connecting facts across four separate documents. To answer it correctly, the retrieval step needs to surface all four relevant passages — and the connections between them.

Simple keyword-based retrieval (BM25) compresses a ~200K-token corpus down to roughly 1,700 tokens before the model ever sees it. That's a 99% reduction. For a question that depends on four specific passages scattered across dozens of documents, the odds of retaining all of them are very low. In our evaluation, simple BM25 retrieved all the necessary evidence for fewer than 1 in 10 questions.

Better retrieval methods help. Hybrid retrieval — combining keyword search with semantic similarity — roughly doubles how often the model receives complete evidence. Iterative retrieval, where the model actively requests follow-up searches based on what it found in the previous round, does better still. But even the best RAG approach we tested still failed to retrieve all required evidence for more than a quarter of questions.

The data also reveals three failure patterns that often go unnoticed in simpler evaluations. First, nearly three-quarters of simple RAG's responses — 72% — were abstentions: the model said "I cannot determine the answer" when the answer was present all along. This isn't the model being cautious; it's the model having no evidence to work from. Even the best RAG approach we tested abstained 32% of the time. Second, with poor retrieval, models sometimes fill in the gaps from their training memory, producing answers that sound confident but aren't grounded in the documents provided. Simple RAG did this 12% of the time — but crucially, this behavior disappears as retrieval improves: by the iterative RAG stage, it has dropped to zero. The lesson: the safest way to prevent hallucination is to make sure the model doesn't need to guess. Third — and most counterintuitively — simple RAG committed to zero confident wrong answers. Its extreme compression left the model with so little context that it consistently recognised it couldn't answer. This is a false safety: the model isn't reasoning correctly, it's just refusing. But it foreshadows the finding in Section 5.3 that more context, without verification, actually makes confident errors worse.

This is ultimately a structural limitation of the RAG approach for multi-hop reasoning. Retrieval is optimized to find relevant passages, not to preserve the relationships between them. The better your retrieval, the closer you get to full context — which raises the natural question: why not just use full context to begin with?

5.3) Full context helps a lot — but it's not enough on its own

Going from the best RAG approach (48%) to LLM (all docs) (68%) is the largest single jump in the table. Eliminating the retrieval step entirely removes its biggest failure mode: the model now has access to every document, every passage, every cross-reference.

But access to information and correct use of information are different things.

Even with all the evidence available, a model doing a single unguided pass over 200K tokens of text makes systematic errors. We saw four patterns repeatedly:

Latching onto the first plausible answer. Multi-hop questions often have ambiguous intermediate steps. For example: "find the country near Australia that first participated in the Olympics in 1952." There are multiple countries near Australia; the correct one is New Zealand, but Papua New Guinea is also nearby and more commonly discussed in certain contexts. A model doing a single pass tends to commit to the first entity it identifies as a candidate and carry it forward — without going back to check whether that commitment leads to a grounded final answer. The correct evidence was in the context the whole time.

Getting the chain right but the final fact wrong. In one recurring failure, the model correctly traced all four intermediate steps of a reasoning chain — but then selected the wrong conclusion at the end. For instance: a question asked about a specific historical treaty defined by the territory it ceded. The model identified the correct geography, the correct river, the correct region — and then named a different treaty that also involved the same river. The final answer was wrong precisely because the model matched on a shared entity rather than on the specific clause that defined the answer. No retrieval failure, no hallucination — just imprecise resolution at the final step.

Drift under long context. With hundreds of thousands of tokens in the prompt, reasoning chains that span four hops can lose coherence mid-way. The model starts correctly, but by the third or fourth step, interference from loosely related passages causes it to conflate entities or take a wrong turn — while staying internally consistent enough that the error isn't obvious. The answer looks well-reasoned; the reasoning is quietly wrong.

Giving up when the answer is there. In a small number of cases, the model concluded "the answer cannot be determined from the documents" — even though the answer was present and findable. The good news: this rate drops dramatically compared to RAG (from 32–72% down to 4%). The bad news: it still happens, and it happens on questions the model should be able to answer given full access to all evidence. This is the "lost in the middle" effect in a concrete form.

The net result is that LLM (all docs) is the most confidently wrong system in the entire comparison — 28% of responses, higher than Hybrid RAG (20%) or Iterative RAG (20%), and much higher than Simple RAG (0%). Switching from retrieval to full context trades one class of failure (abstentions, hallucinations from missing evidence) for another (confident errors from unverified reasoning). The trade is not neutral: confident-wrong gets worse as context improves, not better.

The underlying mechanism is consistent across all four failure patterns above: the model has more evidence, is more willing to commit, and has no mechanism to check whether its commitment is warranted. With Simple RAG, the model knew it didn't know. With full context, the model always thinks it knows.

What these failures share is an absence of structure and verification. The model makes a single pass, produces an answer, and has no mechanism to check itself. It can't go back and verify that the intermediate entity it chose actually leads to a valid conclusion. It can't confirm that the final answer it selected satisfies the specific predicate in the question — not just the general topic. It can't flag when its reasoning chain has drifted.

Throwing more context at the problem doesn't solve these issues. You need a different architecture.

5.4) Why FCR closes the gap

FCR reaches 88% — a 20-point improvement over LLM (all docs) — on exactly the same corpus and the same underlying model.

The difference isn't a better model. It's a better process.

LLM (all docs)FCR
Context preparationDocuments concatenated as-isStructured and normalized, with document identities preserved
Reasoning approachOne unguided generation passExplicit step-by-step reasoning with intermediate claims tracked
VerificationNoneIntermediate claims are grounded against source documents as reasoning progresses, not just at the final output
Error containmentErrors propagate silently to the final answerUnsupported claims are caught early and don't contaminate downstream steps

The core insight is that verification is woven into the reasoning process itself, not appended as a final check. This is structurally different from asking a model to "be careful" or "cite your sources" — those are instructions. FCR's grounding is architectural: it shapes how reasoning is produced, not just how the final output is evaluated.

This is why FCR specifically addresses the failure modes that trip up LLM (all docs):

  • The "first plausible answer" problem is contained because each intermediate entity must be verifiably grounded before it's carried into the next step. An entity that looks plausible but isn't supported by the documents can't silently anchor the rest of the chain.
  • The "right chain, wrong conclusion" problem is addressed because FCR's verification checks whether the source passage actually supports the specific claim being made — not just whether the document is topically relevant.
  • The "drift under long context" problem is mitigated because reasoning is decomposed into discrete, checkable steps rather than produced as a single uninterrupted stream. Errors are surfaced where they occur, not discovered only at the end.

Looking at the failure breakdown: FCR brings the gave-up rate to zero (LLM (all docs) was still abstaining 4% of the time), and cuts the confident-wrong rate from 28% down to 12%. That remaining 12% represents the genuinely hard cases — questions where the ambiguity is irreducible, not questions the system mishandled.

FCR doesn't just get more answers right. It gets them right in a way that produces auditable evidence chains — where every step can be traced back to a specific passage in the source documents. And when it's wrong, it's wrong on the hard problems, not the solvable ones. In high-stakes workflows, that distinction matters as much as the accuracy number itself.

5.5) The cases that remain hard

A small number of questions — roughly three out of twenty-five — resisted every approach, including FCR. These aren't failures of retrieval or verification; they're genuinely ambiguous questions where the "correct" answer depends on interpretive choices the dataset's ground truth has made but the documents don't unambiguously support.

Examples of why these are hard:

  • A question about a geographic region "to the north" of another — where the answer depends on which definition of the original region you use. Different, equally defensible geographic framings lead to different answers.
  • A question about which treaty applies to a specific river — where two major historical treaties both involve the same river, and the fine distinction between them requires precise predicate matching that no current system handles consistently.
  • A question anchored to a specific year's event schedule — where the model must figure out which year's answer is being asked for based on document context alone.

These are edge cases that highlight real challenges in multi-hop question design and temporal reasoning. They don't represent a ceiling for the FCR architecture — they represent the frontier of what makes multi-document reasoning hard in the first place, and areas we're actively working on.


6) Tradeoffs and limitations

FCR is not a universal replacement for RAG.

Real tradeoffs exist today:

  • Higher compute cost: holistic reasoning and verification are more expensive than a simple retrieve-and-answer loop. That said, inference costs have fallen dramatically year-over-year, and FCR's structured approach creates natural opportunities for caching and selective processing that reduce the gap over time.
  • Higher latency: especially as corpora grow. This is an active engineering focus — batching, parallelism, and smarter context construction all help, and we expect latency to improve significantly as the architecture matures.
  • Context capacity: even large windows have practical limits today. But this is one of the fastest-moving areas in the field — available context lengths have grown orders of magnitude in just a few years. We're also building toward 10M+ token support (see Closing), so we view this as a shrinking constraint, not a ceiling.
  • Configuration sensitivity: different corpora and workflows can require different strategies — this is a real cost that RAG-based systems share too.

RAG remains excellent for its original purpose — navigation, search, lightweight QA, and latency-sensitive workflows. FCR does not eliminate the need for human review in high-stakes settings either; the goal is to reduce the verification burden and failure rate, not to claim “zero-review automation.”


7) When to use FCR vs RAG

Use RAG for exploratory or approximate tasks — finding a relevant document, navigating a knowledge base, answering a FAQ — where a near-miss is still useful and results don't need to be exact. RAG is fast and well-suited to this.

Use FCR when accuracy matters: when the answer will be acted on, cited, or carried into a downstream decision, and a confident but wrong answer is worse than no answer at all. The failure modes in Section 5 follow directly from this distinction: RAG's retrieval-first architecture cannot verify what it assembles, so errors propagate silently. FCR's inline grounding step changes that.

Use caseRecommended
Document search, lookup, navigationRAG
FAQ over small knowledge basesRAG
“Find the clause that says X”RAG
Cross-contract reasoning / obligation synthesisFCR
Compliance analysis / evidence-backed responsesFCR
Multi-report synthesis and reconciliationFCR

The clearest signal: if a wrong answer would be acted on before anyone catches it, the task is FCR-shaped.


Closing: from retrieval to reasoning

RAG helped a generation of systems search over knowledge. We believe the next generation will need to reason over it — reliably, with evidence, and across entire corpora when retrieval becomes the bottleneck.

That’s what Full Context Reasoning is for.

We view million-token contexts as an early milestone, not the end state. Ongoing work is pushing toward 10M+ token corpora, with active research on stronger verification, contradiction detection, and domain-specific reasoning schemas. The core challenge remains the same regardless of scale: usable context is engineered, not merely provided.

If you want to follow along: we plan to offer controlled API access so teams can test FCR on their own document corpora using their own API key (and pay for inference). Join the waitlist for early access.

Waitlist: TBD