We tested leading AI agent frameworks on 500 financial questions. The best reached 90% accuracy. For production finance work, that’s not enough.
The promise and the gap
AI agents that can autonomously search, reason, and retrieve data represent a significant step forward for financial research. Instead of manually navigating databases and documents, an analyst could ask a question and receive a verified answer in seconds. We focused on single-number lookup—questions like “How many trucks did Volvo deliver in Asia in Q4?” We wanted to measure how close we are to that reality. To find out, we built an evaluation framework and tested three frontier agent systems: OpenAI’s Agents SDK with GPT-5.2, Anthropic’s Agent SDK with Claude Opus 4.5, and Google’s ADK with Gemini 3 Pro. Each agent faced 500 questions in two configurations: web search with reasoning enabled, and adding Daloopa’s structured database via MCP. This study extends our earlier work from consumer chatbots to production agent frameworks, with root-cause analysis for every failure.Main finding: The 90% ceiling
The best configurations reached approximately 90% accuracy. All three frontier models converged to similar performance when given the same tools.FinRetrieval accuracy across agent configurations
| Configuration | Accuracy |
|---|---|
| Claude Opus 4.5 (+Daloopa MCP) | 90.8% |
| Gemini 3 Pro (+Daloopa MCP) | 90.6% |
| GPT-5.2 (+Daloopa MCP) | 89.2% |
- For every 10 data points retrieved, expect 1 to be wrong
- A 50-company financial screen would contain roughly 5 incorrect values
- These aren’t random typos—they’re systematic misinterpretations that a human reviewer might not catch
US versus non-US: A clue in the data
Before diving into the failures, we noticed a pattern. All three models performed better on US companies than non-US companies:| Model | US Accuracy | Non-US Accuracy | Gap |
|---|---|---|---|
| Claude Opus 4.5 (+Daloopa MCP) | 93.4% | 87.8% | +5.6pp |
| Gemini 3 Pro (+Daloopa MCP) | 92.6% | 88.2% | +4.4pp |
| GPT-5.2 (+Daloopa MCP) | 90.4% | 87.8% | +2.6pp |
Why the 10% fails
We analyzed every failure from the best-performing configuration (Claude Opus 4.5 +Daloopa MCP, 46 incorrect answers). The errors fall into distinct categories.Error categories for Claude Opus 4.5 (+Daloopa MCP) failures
Period confusion: 63% of errors
Nearly two-thirds of all failures stem from a single pattern: fiscal versus calendar period confusion. When asked for data from “fiscal year ended March 2023,” agents queried2023FY. But Daloopa’s database uses the starting-year convention, so the correct query was 2022FY. The agent retrieved data from the wrong year entirely.
This isn’t a reasoning failure. The agent understood the question correctly. It failed because the tool’s period naming convention wasn’t documented clearly enough for the model to infer the correct mapping.
The pattern is systematic and fixable. Better tool documentation and client-side prompt engineering could address nearly two-thirds of all current errors.
Wrong series selection: 20% of errors
Financial databases contain many similarly-named series. Agents sometimes selected the wrong one:- “Gross production” instead of “net production”
- A sub-component instead of the total
- A broader category instead of a specific line item
The best AI with databases was worst at web search
The most surprising finding wasn’t which model performed best. It was this paradox:Claude Opus 4.5 accuracy: web alone vs. web + Daloopa
Not all web search is the same
Different AI platforms have different web browsing capabilities. Claude’s web search tool returns search result snippets—it can see previews but can’t read full pages or PDFs. Google and OpenAI’s tools can browse documents directly and extract data from tables. It’s the difference between having library catalog access and actually reading the books.Finding answers isn’t the same as trusting them
Even when Claude found relevant information, it often couldn’t commit. In 55% of its WebSearch failures, Claude located a plausible answer but declined to provide it—continuing to search until timing out or explicitly giving up. In one case, Claude searched for Alcoa’s maintenance cost forecast. It found “$10 million” in a search result—the correct answer. But instead of stopping, it ran five more searches looking for “confirmation,” and eventually concluded: “I could not locate the specific dollar amount.” The answer was there. The tool couldn’t verify it confidently.The insight
AI reliability depends on both the model and the tools it’s given. With MCP, all three frontier models converged to 89-91% accuracy. Without it, tool quality dominated—and the “best” model fell to last place. Structured data levels the playing field. But as the 90% ceiling shows, it’s necessary, not sufficient.The path to 99%
Based on our error analysis, improving from 90% to 99% requires work across multiple dimensions:| Improvement | Estimated Impact |
|---|---|
| Better tool documentation (period conventions) | Addresses 63% of errors |
| Improved series disambiguation | Addresses 20% of errors |
| Client-side prompt engineering | Compound benefits |
| Data quality improvements | Contributes to ~7% of errors |
Study design and limitations
We generated 500 single-number financial retrieval questions across six categories: income statement, balance sheet, cash flow, guidance, operational KPIs, and segment data. Each question has a verified ground-truth answer from company filings. Each agent was tested in two configurations, both with reasoning enabled:- WebOnly: Web search only, no structured data access
- +Daloopa MCP: Web search plus Daloopa’s financial database
Conclusion
Frontier AI agents with structured data access reach approximately 90% accuracy on financial data retrieval. That’s notable progress, but not yet reliable enough for unsupervised production use. The errors aren’t random. Nearly two-thirds stem from a single fixable pattern: fiscal period naming conventions. Another fifth come from ambiguous series selection. These are infrastructure problems, not fundamental limitations of the models. The path to 99% is visible:- Better tool documentation that makes conventions explicit
- Improved data disambiguation at the source
- Client-side prompt engineering for edge cases
- Continuous evaluation to measure each improvement