Background and Methodology
Financial numbers are unforgiving. A misplaced decimal point or the wrong reporting period can cascade into a material misstatement—distorting an analysis, a financial model, or even an investment decision. Today’s large language models are remarkable at synthesizing language but when the task shifts to extracting precise financial figures from official filings, the margin for error narrows dramatically – and this is where current systems often fall short.
To test how current LLM solutions performed in financial retrieval, our research team looked at:
- Five “ungrounded” chatbots: ChatGPT (GPT‑5 Thinking), Claude (Opus 4.1), , Gemini (Gemini 2.5 Pro), Grok (Grok 4), Perplexity Finance.
- One “grounded” chatbot: Claude + Daloopa MCP (Opus 4.1 + MCP)
We challenged each chatbot to answer 500 single‑number prompts directly in each web app. The grounded chatbot has access to a dedicated financial-retrieval tool (Daloopa MCP), whereas the ungrounded chatbots have all their built-in web-search tools enabled, but no specific financial-retrieval tool. Here is everything we learned, and how we did it:
Accuracy
Figure 1. Exact numerical‑match accuracy by model (n=500). Bars show percentage of answers that exactly match the ground truth with normalized punctuation/commas but no rounding tolerance.
Key Takeaways
- Most general‑purpose chatbots surprisingly often miss on single‑number financial questions—even with web, EDGAR, and finance‑specific search enabled.
- Grounded retrieval wins: Claude + Daloopa MCP delivered 91.6% exact‑match accuracy—+27 pp over ChatGPT. With a 1% tolerance it reached 94.7% (+18.7 pts).
- Among chatbots without MCP connections, ChatGPT performed best for numeric retrieval (64.2% accuracy, 81.8% within ±5%). It can locate and read SEC filings, and will sometimes open company PDFs for non‑SEC issuers, but it still cites or relies on unvetted secondary sources (blogs/news) and often fails to ground answers to the exact passage.
- Ambiguous interpretation is the main cause of failure: Across models, ~50% of errors were Ambiguous Question Handling, followed by Rounding/Formatting (~25%) and Non‑Answer/Refusal (~15%). Period‑alignment issues (Fiscal vs. Calendar + Period Shift) were meaningful but smaller (~11% combined). On the 170 fiscal/calendar‑offset companies, Claude + Daloopa MCP still leads (88.2%) vs. ~42.4% for the next best general model.
- Misinterpretation & refusal patterns: Gemini showed the highest rate of question misinterpretations. Claude produced the most refusals/non‑answers; conditional accuracy on attempted answers was ~53%.
- LLM’s tend to calculate what’s already been reported – Several non-grounded systems “compute” metrics instead of returning the reported figure, increasing error risk.
Error breakdown
We bucketed wrong answers into mutually exclusive categories:
- Ambiguous Question Handling: model chooses an interpretation not consistent with our labeling instructions.
- Rounding / Formatting: considerable rounding, decimal misplacement, misplaced thousands separators, or sign errors (+/−).
- Non‑Answer / Refusal: returns boilerplate text or declines despite available ground truth.
- Fiscal vs. Calendar Period Conversion: mixing fiscal and calendar year/quarter or using the wrong reporting basis.
- Period Shift / Trailing Twelve Months: picking the prior/next period (e.g., Q3 vs. Q2) or TTM instead of the asked period.
The following figure breaks down which mistakes each model tends to make. Three patterns stand out:
- Grounding with an MCP reduces both ambiguity and period mistakes. With MCP, errors are few and mostly Rounding/Formatting (18) or light Ambiguous (12); Period issues are rare (6 fiscal, 5 shifts), and Refusals are near‑zero (1).
- General models skew different ways.
- ChatGPT: largest bucket is Rounding/Formatting (76), then Ambiguous (62), with moderate Period Shift (24).
- Grok 4: Ambiguous (85) and Rounding (80) are neck‑and‑neck; Period Shift (39) is higher than peers.
- Perplexity Finance: dominated by Ambiguous (180), plus Rounding (84) and notable Refusals (35).
- Claude (no MCP): unusually high Refusals (159); Rounding (83) and Ambiguous (73) follow.
- Gemini 2.5 Pro: errors are overwhelmingly Ambiguous (351), with Rounding (47) next; period issues are smaller (6 fiscal, 21 shifts).
Net‑net: Ambiguous interpretation is the dominant failure mode overall, but for some models (e.g., ChatGPT) Rounding/Formatting is the single largest bucket—suggesting that stricter output formatting and unit normalization could meaningfully lift accuracy even without changing retrieval.
Error distribution by chatbot (sorted by total errors)
Figure 2. Error distribution by chatbot (n=500 per model; grouped bars). Counts per category: Fiscal vs. Calendar, Period Shift, Rounding/Formatting, Ambiguous Interpretation, Non‑Answer/Refusal.
Examining errors by category and LLM
Rounding and interpretation errors.
For this question, ChatGPT and Claude with Daloopa MCP are correct, Grok 4 rounded its answer to 1,459 (off by 0.3 million), Gemini and Perplexity predicted totally different numbers (ambiguous question handling).
Question:
What was YDUQS’ full year 2022 Adjusted EBITDA in BRL millions?
Expected answer:
YDUQS’ Adjusted EBITDA for the full year 2022 was [1,458.7 million](https://marketplace.daloopa.com/src/69127815) BRL.
Fiscal/calendar period confusion
In the following question:
- Claude with Daloopa MCP and Perplexity answered correctly.
- ChatGPT and Grok used fiscal 2023 value instead of fiscal 2022,
- Gemini misinterpreted the question.
- Claude was not able to find it.
Question: What was Hazama Ando Corp’s overseas building construction backlog at the end of fiscal year 2022 in million Japanese yen?
Expected answer:
Hazama Ando Corp’s overseas building construction backlog at the end of fiscal year 2022 was 10,969 million Japanese yen.
Results With Different Accuracy Thresholds
Exact match accuracy
On strict exact match (no tolerance) – see figure 1
- Claude + Daloopa MCP leads decisively at 91.6%.
- The best general‑purpose baseline is ChatGPT at 64.2%, followed by Grok 4 at 57.2%. Perplexity Finance (34.2%) and Claude (Opus 4.1) (31.4%) trail, with Gemini 2.5 Pro last at 11.8%.
- The ~26–60 point gap between MCP and others suggests most misses are not tiny rounding slips but source/normalization issues that grounded retrieval mitigates.
Accuracy at different tolerance thresholds
Allowing a little wiggle room helped—but unevenly across systems.
- At ±1%, Claude Opus 4.1 + MCP rose to 94.7% while ChatGPT jumped to 76.2% and Grok to 70.3%. Claude (no MCP) made the largest early move, from 31.4% to 67.5%, a signal that many of its misses were rounding/formatting or near‑miss unit issues.
- By ±10%, accuracies converged somewhat: MCP ~96.6%, ChatGPT 85.2%, Grok 80.1% and Claude (no MCP) ~79.2%, Perplexity 66.5%, Gemini 44.5%.
- At ±20%, ceilings appeared: MCP 97.2%, ChatGPT 89.5%, Claude (no MCP) 86.3%, Grok 84.7%, Perplexity 73.5%, Gemini 58%.
Conclusion: Tolerance narrows gaps but does not erase them – MCP’s lead persists at every threshold.

Figure 3. Accuracy vs. tolerance (±% of ground truth). Lines show cumulative share of answers within the specified error band.
Accuracy in companies with fiscal/calendar period offsets
Period alignment was a key stress test. Of the 170 companies prompted where fiscal year ≠ calendar year, Claude + Daloopa MCP achieved 88.2% exact match. The next best results were ChatGPT 42.4% and Grok 40.6%. Perplexity 22.4%, Claude (no MCP) 17.6%, and Gemini 8.2% lagged far behind. Grounding and explicit period normalization appeared to be the deciding factors here.
Accuracy on companies with fiscal/calendar period offsets
Figure 4. Exact‑match accuracy on companies with fiscal/calendar offsets (n=170). Bars show the subset where period alignment is required to answer correctly.
Rating precision, conciseness and insight
We also rated answers on three presentation dimensions (1–5 scale). Two patterns stand out:
- Precision mirrors the main metric. Claude + MCP leads (4.8), with ChatGPT 4.2, Grok 3.9, Perplexity 3.0, Claude (no MCP) 2.8, Gemini 1.8.
- Conciseness isn’t the same as correctness. ChatGPT is the most concise (4.8), followed by Grok 4.5 and MCP 3.8. Claude (no MCP) is wordier (2.3).
- Insight trades off with brevity—sometimes unhelpfully. Claude (no MCP) scores highest on our “insight” rubric (3.5) and Gemini 2.9, but these come with lower precision. MCP balances context and accuracy (2.9 insight).
Average rating by category and chatbot
Figure 5.
Verbosity
We also looked at how many tokens each system used per answer. Shorter isn’t always better, but long replies often bury the number.
What we see
- ChatGPT is the most concise at ~29 tokens on average.
- Grok (40) and Perplexity (43) are also fairly terse.
- Claude + Daloopa MCP averages ~119 tokens and Gemini ~125—more narrative around the number.
- Claude (no MCP) is the longest by far at ~226 tokens (≈ 7–8× ChatGPT).
Why it matters
- There’s no positive correlation between verbosity and precision. Claude (no MCP) is the longest but lands near the bottom on exact match; MCP is longer than ChatGPT yet most accurate.
- For machine‑readable workflows, structured output formats with an optional in‑line citations strike the best balance.
Chatbot Verbosity: Average Token Count per Answer
Conclusion
Across 500 single-number questions drawn from +4,000 public companies, exact financial retrieval remains challenging for general chatbots. The grounded setup—Claude + Daloopa MCP—delivered 91.6% exact-match accuracy, outperforming the next best general chatbot by 27 points; at 1% tolerance it reached 94.5%, an 18.3-point lead (see the accuracy bar chart and tolerance curves). Performance advantages persisted in the hardest cases – companies with fiscal/calendar offsets – where MCP achieved 87.7% vs. ~43.3% for the next best general model. The dominant failure modes were ambiguous question interpretation, rounding slips, and adjacent-period shifts; we also observed models calculating metrics instead of retrieving the reported figure, plus a meaningful share of non-answers/refusals in some systems. Verbosity varied widely (e.g., ~30 tokens for ChatGPT vs. >200 for Claude), and more words did not translate to higher accuracy.
For research-grade workflows, the takeaway is clear: document-grounded retrieval, period/currency-aware normalization, and concise answers with verifiable citations are prerequisites, not nice-to-haves. To enable full auditability and replication, we’re releasing the complete dataset and publishing links to every chatbot conversation so readers can trace each result end-to-end. We invite the community to reproduce and extend these findings.
Questions about this benchmark or want to reproduce it? Reach out and we’ll help.
Methodology
Here’s what we measured and how we measured it:
- Task – Retrieve a single, explicitly reported number from a company’s official document.
- Ground truth – The figure exactly as written, unit/currency and period included, anchored to a page/section/table in the source.
- One turn, same instructions – We asked each chatbot in its web app to return the number (with unit/currency). No APIs, agents or wrappers.
- Measuring accuracy –
- Correctness: Exact match under tight normalization rules: percentages must match when rounded to two decimals; monetary values must match in the asked unit/currency; we ignore thousands separators/whitespace but not scale or sign.
- Transparency, not a rolled up score – We published the full dataset and links to the original conversations for every answer so anyone can inspect the traces.
- Error analysis – After scoring, we tagged misses into a small set of causes—ambiguous interpretation, rounding/formatting, period alignment, and refusals—as detailed in the Errors section.
We also shared the full dataset and a link to the original conversation for every answer to keep the setup easy to inspect at our public HuggingFace dataset.
Distribution of question categories
We asked mostly core operating questions. Revenue leads with 100 prompts (≈20%), followed by Operating Metrics/KPIs (92) and Profitability/Margins (61). Production and cost questions (57) and Segment Performance (45) keep pressure on table lookups beyond headline numbers. The long tail—Guidance (34),, Balance Sheet/Capital (26), Cash Flow/FFO (18), and Backlog/Orders (5)—ensures coverage of capital intensity and liquidity topics.
Distribution of question categories (bucketed)
Figure 7. Distribution of question categories (n=500). The mix emphasizes top‑line and operating health, while still testing margins, costs, segments, and capital structure.
Distribution of answer currencies
Answers spanned 30+ denominations/units. USD dominates (232, ~46%), with a sizable JPY share (87, ~17%), reflecting issuer mix. We also include non‑currency figures like Percent (33) and generic Units (17) plus a long tail across EUR (22), INR (11), GBP (8), CNY (6), and many others—each ≤5%. This diversity stresses currency and scale normalization, a common source of rounding/formatting mistakes.
Figure. Distribution of answer currencies/units (n=500). Heavy USD and JPY presence, plus a broad long tail and non‑currency units, make correct denomination handling essential.
Distribution of answer currency (normalized)
Figure 8. Distribution of answer currencies/units. Heavy USD and JPY presence, plus a broad long tail and non‑currency units, make correct denomination handling essential.
Reproducibility & Limitations
- Reproducibility: the question set (500), ground‑truth snippets, prompts, and evaluation harness are open sourced for audit.
- Scope: this benchmark tests single‑number extraction from official documents. It does not evaluate complex multi‑step analysis, forecasting, or cross‑document aggregation.
- Sampling: companies were randomly sampled from a +4,000‑name list; sector, geographical and filing‑format imbalances may exist.
- Latency & cost: not measured here; results reflect quality only.