Benchmarking AI Agents on Financial Retrieval

Technical Paper Dataset Code

We tested leading AI agent frameworks on 500 financial questions. The best reached 90% accuracy. For production finance work, that’s not enough.

The promise and the gap

AI agents that can autonomously search, reason, and retrieve data represent a significant step forward for financial research. Instead of manually navigating databases and documents, an analyst could ask a question and receive a verified answer in seconds. We focused on single-number lookup—questions like “How many trucks did Volvo deliver in Asia in Q4?” We wanted to measure how close we are to that reality. To find out, we built an evaluation framework and tested three frontier agent systems: OpenAI’s Agents SDK with GPT-5.2, Anthropic’s Agent SDK with Claude Opus 4.5, and Google’s ADK with Gemini 3 Pro. Each agent faced 500 questions in two configurations: web search with reasoning enabled, and adding Daloopa’s structured database via MCP. This study extends our earlier work from consumer chatbots to production agent frameworks, with root-cause analysis for every failure.

Main finding: The 90% ceiling

The best configurations reached approximately 90% accuracy. All three frontier models converged to similar performance when given the same tools.

accuracy_ranking — With web search and reasoning alone, accuracy varies widely (20-71%). Adding structured database access via MCP narrows the gap dramatically, with all three frontier models converging to 89-91% accuracy.

Configuration	Accuracy
Claude Opus 4.5 (+Daloopa MCP)	90.8%
Gemini 3 Pro (+Daloopa MCP)	90.6%
GPT-5.2 (+Daloopa MCP)	89.2%

With MCP database access, Claude Opus 4.5 achieves the highest accuracy at 90.8%, with Gemini and GPT close behind. The differences between top configurations are within margin of error.

90% sounds high. But consider what it means in practice:

For every 10 data points retrieved, expect 1 to be wrong
A 50-company financial screen would contain roughly 5 incorrect values
These aren’t random typos—they’re systematic misinterpretations that a human reviewer might not catch

For production finance work, 90% accuracy is not fully dependable or delegatable. The remaining 10% requires understanding.

US versus non-US: A clue in the data

Before diving into the failures, we noticed a pattern. All three models performed better on US companies than non-US companies:

Model	US Accuracy	Non-US Accuracy	Gap
Claude Opus 4.5 (+Daloopa MCP)	93.4%	87.8%	+5.6pp
Gemini 3 Pro (+Daloopa MCP)	92.6%	88.2%	+4.4pp
GPT-5.2 (+Daloopa MCP)	90.4%	87.8%	+2.6pp

With MCP, the gap ranges from 3 to 6 percentage points. Without structured data, the gap widens to 8-20 percentage points—MCP doesn’t just improve accuracy, it reduces geographic bias.

The gap isn’t about geography or language. It’s about fiscal calendars. Most US companies use December fiscal year-ends. Non-US companies more often have non-December year-ends: March in Japan, September in India. This turned out to be the key. At December year-ends—the standard calendar alignment—non-US accuracy actually matched US accuracy (both ~96%). The gap emerged from non-standard fiscal calendars: companies with March year-ends saw 65% accuracy, September year-ends reached 79%. The pattern is about naming conventions, not countries.

Why the 10% fails

We analyzed every failure from the best-performing configuration (Claude Opus 4.5 +Daloopa MCP, 46 incorrect answers). The errors fall into distinct categories.

error_breakdown — Period confusion dominates at 63% of errors. Period confusion alone accounts for more failures than all other categories combined.

Period confusion: 63% of errors

Nearly two-thirds of all failures stem from a single pattern: fiscal versus calendar period confusion. When asked for data from “fiscal year ended March 2023,” agents queried 2023FY. But Daloopa’s database uses the starting-year convention, so the correct query was 2022FY. The agent retrieved data from the wrong year entirely. This isn’t a reasoning failure. The agent understood the question correctly. It failed because the tool’s period naming convention wasn’t documented clearly enough for the model to infer the correct mapping. The pattern is systematic and fixable. Better tool documentation and client-side prompt engineering could address nearly two-thirds of all current errors.

Wrong series selection: 20% of errors

Financial databases contain many similarly-named series. Agents sometimes selected the wrong one:

“Gross production” instead of “net production”
A sub-component instead of the total
A broader category instead of a specific line item

These errors require better series disambiguation in the data source or more explicit guidance in the question. About 7% of failures traced to data quality issues: mislabeled series, translation errors. Structured data providers aren’t perfect either.

The best AI with databases was worst at web search

The most surprising finding wasn’t which model performed best. It was this paradox:

tools_impact — Claude Opus 4.5 performs dramatically differently based on available tools. With MCP, it reaches 91% accuracy. With only web search, it drops to 20%.

Claude Opus 4.5 +Daloopa MCP: 91%. Claude Opus 4.5 WebOnly: 20%. The same model. A 71 percentage point gap. How?

Not all web search is the same

Different AI platforms have different web browsing capabilities. Claude’s web search tool returns search result snippets—it can see previews but can’t read full pages or PDFs. Google and OpenAI’s tools can browse documents directly and extract data from tables. It’s the difference between having library catalog access and actually reading the books.

Finding answers isn’t the same as trusting them

Even when Claude found relevant information, it often couldn’t commit. In 55% of its WebSearch failures, Claude located a plausible answer but declined to provide it—continuing to search until timing out or explicitly giving up. In one case, Claude searched for Alcoa’s maintenance cost forecast. It found “$10 million” in a search result—the correct answer. But instead of stopping, it ran five more searches looking for “confirmation,” and eventually concluded: “I could not locate the specific dollar amount.” The answer was there. The tool couldn’t verify it confidently.

The insight

AI reliability depends on both the model and the tools it’s given. With MCP, all three frontier models converged to 89-91% accuracy. Without it, tool quality dominated—and the “best” model fell to last place. Structured data levels the playing field. But as the 90% ceiling shows, it’s necessary, not sufficient.

The path to 99%

Based on our error analysis, improving from 90% to 99% requires work across multiple dimensions:

Improvement	Estimated Impact
Better tool documentation (period conventions)	Addresses 63% of errors
Improved series disambiguation	Addresses 20% of errors
Client-side prompt engineering	Compound benefits
Data quality improvements	Contributes to ~7% of errors

The largest gains come from fixing period convention documentation (63% of errors) and improving series disambiguation (20%). These are infrastructure improvements, not model improvements.

None of these require better models. They require better infrastructure around models. This evaluation framework enables measurement. Each improvement can be tested and its impact quantified. Without systematic evaluation, optimization is guesswork.

Study design and limitations

We generated 500 single-number financial retrieval questions across six categories: income statement, balance sheet, cash flow, guidance, operational KPIs, and segment data. Each question has a verified ground-truth answer from company filings. Each agent was tested in two configurations, both with reasoning enabled:

WebOnly: Web search only, no structured data access
+Daloopa MCP: Web search plus Daloopa’s financial database

Responses were scored automatically against ground truth, with manual review for edge cases. We performed root-cause analysis on every incorrect answer to identify systematic patterns. This design intentionally focused on single-number retrieval where answers were known to exist in the database. Multi-step analysis requiring chained reasoning, synthesis tasks combining multiple data points, and questions where data may not exist remain untested.

Conclusion

Frontier AI agents with structured data access reach approximately 90% accuracy on financial data retrieval. That’s notable progress, but not yet reliable enough for unsupervised production use. The errors aren’t random. Nearly two-thirds stem from a single fixable pattern: fiscal period naming conventions. Another fifth come from ambiguous series selection. These are infrastructure problems, not fundamental limitations of the models. The path to 99% is visible:

Better tool documentation that makes conventions explicit
Improved data disambiguation at the source
Client-side prompt engineering for edge cases
Continuous evaluation to measure each improvement

The bottleneck isn’t model capability. It’s the infrastructure around models. Better tool design and data quality matter as much as the next generation of frontier models.