April 7, 2026
Evaluating agentic RAG for financial analysis: a FinanceBench study
In 2023, researchers at Patronus AI published FinanceBench [1], a benchmark of 150 financial analysis questions drawn from real SEC filings. The questions require locating specific figures across 10-K and 10-Q filings, computing ratios, comparing across periods, and reasoning about what the numbers mean. The kind of thing a junior analyst does every day.
Their headline finding: retrieval-augmented generation with standard vector search got 19% of questions right. Putting entire SEC documents into a GPT-4 context window got 78%. The implication was that RAG, as typically implemented, was not up to the task of financial document analysis.
That gap matters in practice. Full-context retrieval is expensive, does not scale to large collections, and is bounded by context window size. A 100-page 10-K does not always fit; a 10-Q that references prior-year figures from the associated annual filing definitely does not. If traditional RAG cannot close the gap, financial document analysis needs a fundamentally different approach.
We ran Dewey's /research endpoint on all 150 FinanceBench questions to see where agentic retrieval lands.
Setup
We ingested the full FinanceBench document set into a Dewey collection and ran all 150 questions at depth=exhaustive. At this setting, the model can make up to 50 search calls before producing an answer.
We tested two primary configurations:
- Config A: GPT-5.4 as the reasoning model
- Config B: Claude Opus 4.6 as the reasoning model
We also ran ablation variants with document enrichment features disabled, and a third variant with an upgraded enrichment model. More on those below.
Each answer was scored in two stages: a numeric parser that applies a 2.5% relative tolerance (matching standard financial rounding conventions), followed by a GPT-4o-mini LLM judge for everything that did not parse numerically. Because LLM judges are non-deterministic, we re-ran the judge 10 times per configuration and report means with 95% confidence intervals from Welch's t-test. Judge variance was small (standard deviation 0.3%–0.6%), so the numbers are stable.
The collection is public. All benchmark code is at github.com/meetdewey/financebench-eval [2].
Results

| System | Accuracy |
|---|---|
| GPT-4-Turbo, vector RAG (FinanceBench paper, 2023) | 19.0% |
| Dewey + GPT-5.4 | 62.9% (±0.3%) |
| FinSage, agentic RAG (arXiv 2504.14493, 2025) | 70.0% |
| GPT-4-Turbo, full context (FinanceBench paper, 2023) | 78.0% |
| Dewey + Claude Opus 4.6 | 83.7% (±0.4%) |
Dewey with Claude Opus 4.6 surpasses the full-context baseline by 5.7 percentage points. GPT-5.4 reaches 62.9%, which is well above traditional vector RAG but below the full-context line.
The 21-point gap between the two models is the result that needs explaining.
Why Claude Opus 4.6 and GPT-5.4 diverge
The difference is not reasoning capability; it is retrieval behavior. At the same depth setting, GPT-5.4 averaged 9.4 tool calls per question. Claude Opus 4.6 averaged 21.2.

GPT-5.4's distribution is narrow and peaks between 6 and 10 calls. Opus spreads across the full range, with 40 of 150 questions requiring 31 or more calls. Neither model was instructed to search a specific number of times; this reflects how each model approaches exhaustive research as an agentic task.
The practical consequence shows up most clearly in numerical reasoning questions, which require locating specific figures that may be scattered across different sections of a filing:

| Question type | GPT-5.4 | Claude Opus 4.6 | n |
|---|---|---|---|
| Numerical reasoning | 55.8% | 97.7% | 43 |
| Information extraction | 64.5% | 87.1% | 31 |
| Logical/multi-step | 80.0% | 80.0% | ~15 |
A question like "What was the change in free cash flow from FY2021 to FY2022?" may require finding figures in two different sections of the same 10-K, possibly in a footnote table or in the management discussion rather than the primary financial statements. A model that searches 9 times runs a real risk of missing one of those figures. A model that searches 21 times is substantially more likely to find both.
This difference in search depth appears to be intrinsic to how the two models approach open-ended agentic tasks rather than something that can be tuned away with prompting.
The impact of document enrichment features
When a document is ingested into Dewey, an optional enrichment pipeline runs over the parsed content and generates:
- Section summaries: an LLM-generated paragraph describing each section's content and key figures
- Table captions: structured descriptions of table contents and column semantics
- Image captions: descriptions of charts and graphs
These are embedded alongside the raw text and returned as additional retrieval context. The model generating them was gpt-4o-mini by default.
To measure their contribution, we re-ran both configurations with enrichment disabled. We also ran a third variant with gpt-5.4 handling enrichment instead of gpt-4o-mini.

| Configuration | Accuracy | vs. no enrichment |
|---|---|---|
| GPT-5.4, no enrichment | 64.5% (±0.2%) | baseline |
| GPT-5.4, gpt-4o-mini enrichment | 62.9% (±0.3%) | -1.6 pp (p < 0.001) |
| GPT-5.4, gpt-5.4 enrichment | 62.7% (±0.0%) | -1.9 pp (p < 0.001) |
| Claude Opus 4.6, no enrichment | 79.9% (±0.4%) | baseline |
| Claude Opus 4.6, gpt-4o-mini enrichment | 83.7% (±0.4%) | +3.8 pp (p < 0.001) |
| Claude Opus 4.6, gpt-5.4 enrichment | 83.3% (±0.0%) | +3.5 pp (p < 0.001) |
Enrichment adds a statistically significant 3.8 points for Opus. For GPT-5.4, the same features produce a small but statistically significant decrease.
The GPT-5.4 result is counterintuitive but consistent across all runs and both enrichment models. Our interpretation: section summaries and table captions are most useful for navigation. They let a model scanning a document's structure decide which sections warrant a full read, rather than having to retrieve and parse raw text to find that out. A model doing 21-plus searches per question benefits from that kind of navigational scaffolding. A model issuing 9 targeted keyword searches does not, and the extra summary text may occasionally pull retrieval toward paraphrased content rather than source figures.
The other finding here: upgrading from gpt-4o-mini to gpt-5.4 for enrichment made no statistically significant difference (p = 0.10 for both models). The task of describing what a section contains and flagging its key figures is well within gpt-4o-mini's capability. There is no measurable return on using a frontier model for this step.
Qualitative examples
Segment performance attribution
Question: If we exclude the impact of M&A, which segment has dragged down 3M's overall growth in 2022?
Gold answer: The consumer segment shrunk by 0.9% organically.
Opus made 26 tool calls and constructed a full table of organic sales growth across all four 3M business segments before isolating the M&A effect and comparing them. It correctly identified Consumer as the only segment with negative organic growth and distinguished it from Transportation and Electronics, which had a larger headline decline driven primarily by divestitures rather than underlying demand.
GPT-5.4 (17 tool calls) identified Transportation and Electronics as the answer. It retrieved data on that segment's headline decline but did not reach the consumer segment disclosures before concluding. The answer is wrong for the same reason a first-pass analyst scan might be wrong: it stopped at the most visible number rather than reading all four segments.
Capital intensity analysis
Question: Is 3M a capital-intensive business based on FY2022 data?
Gold answer: No. CAPEX/Revenue: 5.1%, Fixed assets/Total Assets: 20%, Return on Assets: 12.4%
Opus pulled three separate ratios from three different financial statements, reproduced the exact percentages in the gold answer, and reached the correct conclusion. The question is representative of the information extraction category: the answer is not in one place, it requires assembling figures from the income statement, cash flow statement, and balance sheet.
Where both models struggle: 10-Q filings
| Document type | GPT-5.4 | Claude Opus 4.6 | n |
|---|---|---|---|
| 10-K | 64.3% | 87.5% | 112 |
| 10-Q | 33.3% | 53.3% | 15 |
| 8-K | 100% | 100% | 9 |
| Earnings release | 57.1% | 64.3% | 14 |
10-Q accuracy is poor for both models. Quarterly reports often reference prior-period figures from the associated annual filing, and those figures may not be reproduced in the quarterly document. When that happens, a model that searches the collection confidently may retrieve figures from the wrong period. The FinanceBench collection contains only the documents explicitly cited in the benchmark, so cross-period retrieval is sometimes not possible. Extending the collection with paired annual filings would likely narrow this gap substantially.
Limitations
A few things to be precise about:
Single ingestion. All results reflect one ingest with fixed chunking and embedding parameters. Retrieval quality is sensitive to these choices and we have not ablated them.
LLM judge vs. human evaluation. The original paper used human annotators. We used GPT-4o-mini. Our judge can err on borderline cases, particularly when a predicted answer arrives at the correct numeric conclusion via a different path than the gold answer. The 10-run stability analysis gives us confidence that judge variance does not explain configuration-level differences, but the absolute accuracy numbers may differ somewhat from what human evaluation would produce.
Model versions. GPT-5.4 and Claude Opus 4.6 are current as of April 2026. Comparisons against the original paper's GPT-4-Turbo baseline reflect improvements in both the retrieval system and the underlying models.
Reproducing these results
The document collection is publicly accessible — you can run your own queries against the same 150 filings here.
To reproduce the full benchmark, you will need a Dewey account with OpenAI and Anthropic API keys configured under the API Keys panel on your dashboard (BYOK is required for exhaustive depth). Then:
git clone https://github.com/meetdewey/financebench-eval cp .env.example .env # add your DEWEY_API_KEY and OPENAI_API_KEY (for the judge) npm install npm run ingest # ~30 min: uploads SEC filings to Dewey npm run run # ~6 hrs at concurrency=2 npm run score # ~5 min npm run report npm run ci -- --runs 10
All code is at github.com/meetdewey/financebench-eval [2].
What this adds up to
The original FinanceBench paper showed a 59-point gap between vector RAG and full-context retrieval on financial analysis questions. Agentic retrieval closes that gap. The key is not doing one smarter search; it is doing enough searches that the relevant figures are actually found.
The enrichment findings point in the same direction. Section summaries and table captions help, but only when the model is searching broadly enough to use them as navigation signals. For a model doing 9 targeted searches, the extra metadata does not help. For a model doing 21, it does.
The practical takeaway for teams building financial RAG systems: the retrieval strategy is the most important variable. Enrichment quality matters less than whether the model searches thoroughly enough to find what it needs.