An Open Source Benchmark to Measure LLM Accuracy in Financial Retrieval

Background and Methodology

Financial numbers are unforgiving. A misplaced decimal point or the wrong reporting period can cascade into a material misstatement—distorting an analysis, a financial model, or even an investment decision. Today’s large language models are remarkable at synthesizing language but when the task shifts to extracting precise financial figures from official filings, the margin for error narrows dramatically – and this is where current systems often fall short.

To test how current LLM solutions performed in financial retrieval, our research team looked at:

Five “ungrounded” chatbots: ChatGPT (GPT‑5 Thinking), Claude (Opus 4.1), , Gemini (Gemini 2.5 Pro), Grok (Grok 4), Perplexity Finance.
One “grounded” chatbot: Claude + Daloopa MCP (Opus 4.1 + MCP)

We challenged each chatbot to answer 500 single‑number prompts directly in each web app. The grounded chatbot has access to a dedicated financial-retrieval tool (Daloopa MCP), whereas the ungrounded chatbots have all their built-in web-search tools enabled, but no specific financial-retrieval tool. Here is everything we learned, and how we did it:

Accuracy

Figure 1. Exact numerical‑match accuracy by model (n=500). Bars show percentage of answers that exactly match the ground truth with normalized punctuation/commas but no rounding tolerance.

Key Takeaways

Most general‑purpose chatbots surprisingly often miss on single‑number financial questions—even with web, EDGAR, and finance‑specific search enabled.
Grounded retrieval wins: Claude + Daloopa MCP delivered 91.6% exact‑match accuracy—+27 pp over ChatGPT. With a 1% tolerance it reached 94.7% (+18.7 pts).
Among chatbots without MCP connections, ChatGPT performed best for numeric retrieval (64.2% accuracy, 81.8% within ±5%). It can locate and read SEC filings, and will sometimes open company PDFs for non‑SEC issuers, but it still cites or relies on unvetted secondary sources (blogs/news) and often fails to ground answers to the exact passage.
Ambiguous interpretation is the main cause of failure: Across models, ~50% of errors were Ambiguous Question Handling, followed by Rounding/Formatting (~25%) and Non‑Answer/Refusal (~15%). Period‑alignment issues (Fiscal vs. Calendar + Period Shift) were meaningful but smaller (~11% combined). On the 170 fiscal/calendar‑offset companies, Claude + Daloopa MCP still leads (88.2%) vs. ~42.4% for the next best general model.
Misinterpretation & refusal patterns: Gemini showed the highest rate of question misinterpretations. Claude produced the most refusals/non‑answers; conditional accuracy on attempted answers was ~53%.
LLM’s tend to calculate what’s already been reported – Several non-grounded systems “compute” metrics instead of returning the reported figure, increasing error risk.

Learn More About Daloopa’s MCP

Error breakdown

We bucketed wrong answers into mutually exclusive categories:

Ambiguous Question Handling: model chooses an interpretation not consistent with our labeling instructions.
Rounding / Formatting: considerable rounding, decimal misplacement, misplaced thousands separators, or sign errors (+/−).
Non‑Answer / Refusal: returns boilerplate text or declines despite available ground truth.
Fiscal vs. Calendar Period Conversion: mixing fiscal and calendar year/quarter or using the wrong reporting basis.
Period Shift / Trailing Twelve Months: picking the prior/next period (e.g., Q3 vs. Q2) or TTM instead of the asked period.

The following figure breaks down which mistakes each model tends to make. Three patterns stand out:

Grounding with an MCP reduces both ambiguity and period mistakes. With MCP, errors are few and mostly Rounding/Formatting (18) or light Ambiguous (12); Period issues are rare (6 fiscal, 5 shifts), and Refusals are near‑zero (1).
General models skew different ways.
- ChatGPT: largest bucket is Rounding/Formatting (76), then Ambiguous (62), with moderate Period Shift (24).
- Grok 4: Ambiguous (85) and Rounding (80) are neck‑and‑neck; Period Shift (39) is higher than peers.
- Perplexity Finance: dominated by Ambiguous (180), plus Rounding (84) and notable Refusals (35).
- Claude (no MCP): unusually high Refusals (159); Rounding (83) and Ambiguous (73) follow.
- Gemini 2.5 Pro: errors are overwhelmingly Ambiguous (351), with Rounding (47) next; period issues are smaller (6 fiscal, 21 shifts).

Net‑net: Ambiguous interpretation is the dominant failure mode overall, but for some models (e.g., ChatGPT) Rounding/Formatting is the single largest bucket—suggesting that stricter output formatting and unit normalization could meaningfully lift accuracy even without changing retrieval.

Error distribution by chatbot (sorted by total errors)

Figure 2. Error distribution by chatbot (n=500 per model; grouped bars). Counts per category: Fiscal vs. Calendar, Period Shift, Rounding/Formatting, Ambiguous Interpretation, Non‑Answer/Refusal.

Examining errors by category and LLM

Rounding and interpretation errors.

For this question, ChatGPT and Claude with Daloopa MCP are correct, Grok 4 rounded its answer to 1,459 (off by 0.3 million), Gemini and Perplexity predicted totally different numbers (ambiguous question handling).

Question:

What was YDUQS’ full year 2022 Adjusted EBITDA in BRL millions?

Expected answer:

YDUQS’ Adjusted EBITDA for the full year 2022 was [1,458.7 million](https://marketplace.daloopa.com/src/69127815) BRL.

ChatGPT:
R$ 1,458.7 million. (yduqs.com.br)

Claude:

I’ll search for YDUQS’ 2022 Adjusted EBITDA information for you.Let me search for more specific information about YDUQS’ 2022 financial results.Let me fetch the YDUQS investor relations page directly to find their 2022 financial reports.Let me search more specifically for YDUQS’ 2022 financial results.Let me search for YDUQS’ full year 2022 EBITDA specifically.Let me fetch YDUQS’ presentation page to find their 2022 annual report.Let me search for YDUQS’ complete 2022 financial results.Let me search once more for YDUQS’ complete 2022 annual results with their adjusted EBITDA figure.Let me search for YDUQS’ 2022 annual report or investor presentation with the complete financial results.Let me search for the exact YDUQS 2022 full year EBITDA figure with a more specific query.Based on my search of YDUQS’ financial reports, I found information about their 2022 EBITDA performance:

For the full year 2022, YDUQS reported EBITDA of R$ 1.33 billion, which was 8.2% higher than the previous year.

Additionally, I found the following quarterly Adjusted EBITDA figures for 2022:

– Q1 2022: R$ 400.4 million

– Q4 2022: R$ 313.5 million

While I found these specific quarterly figures and the annual EBITDA of R$ 1.33 billion, I was unable to locate the specific full year 2022 **Adjusted** EBITDA figure in the available financial reports and announcements. The R$ 1.33 billion figure appears to be the reported EBITDA rather than the adjusted figure.

If you need the precise adjusted EBITDA figure for 2022, I’d recommend checking YDUQS’ official investor relations website or their complete 2022 annual report, which would contain the detailed reconciliation between reported and adjusted EBITDA.

Claude with Daloopa MCP:

I’ll help you find YDUQS’ full year 2022 Adjusted EBITDA. Let me search for the company in the Daloopa database first.Now let me search for the available EBITDA series for YDUQS, focusing on the 2022 full year period.Now let me retrieve the specific Adjusted EBITDA data for full year 2022.YDUQS’ full year 2022 Adjusted EBITDA was [R$1,458.7 million](https://daloopa.com/src/69127815).

Grok:

**1,459**

YDUQS (Yduqs Participações S.A.) reported an Adjusted EBITDA of R$1,459 million for the full year 2022, up 9% from R$1,337 million in 2021. This figure comes from the company’s official 4Q22 and full-year earnings presentation, which highlights operational growth in premium, on-campus, and digital segments, along with an improved EBITDA margin. It aligns with quarterly breakdowns: approximately R$1,145 million for the first nine months and R$314 million for the fourth quarter. Independent sources like MarketScreener confirm the 2022 value at R$1,459 million.

Gemini:

For the full fiscal year of 2022, Brazilian education company YDUQS reported an Adjusted EBITDA of **BRL 1,543.9 million**.

This figure, which translates to approximately 1.54 billion Brazilian Reais, was a key indicator of the company’s operational profitability for that period. The result was detailed in YDUQS’s annual financial statements for the year ending December 31, 2022. Adjusted EBITDA (Earnings Before Interest, Taxes, Depreciation, and Amortization) is a non-GAAP metric used by the company to provide a clearer view of its operational performance by excluding certain non-cash and non-recurring expenses.

Perplexity Finance:

YDUQS’ full year 2022 Adjusted EBITDA was BRL 1,107 million. yduqs

Fiscal/calendar period confusion

In the following question:

Claude with Daloopa MCP and Perplexity answered correctly.
ChatGPT and Grok used fiscal 2023 value instead of fiscal 2022,
Gemini misinterpreted the question.
Claude was not able to find it.

Question: What was Hazama Ando Corp’s overseas building construction backlog at the end of fiscal year 2022 in million Japanese yen?

Expected answer:

Hazama Ando Corp’s overseas building construction backlog at the end of fiscal year 2022 was 10,969 million Japanese yen.

ChatGPT:
¥14,569 million.

Claude:
I’ll search for information about Hazama Ando Corp’s overseas building construction backlog at the end of fiscal year 2022.Let me search for more specific financial information from Hazama Ando’s fiscal year 2022 report.Let me try to access the actual annual report or fact book that would contain the specific backlog figures.Let me search more specifically for Hazama Ando’s fiscal year 2022 financial data and backlog figures.Let me search for Hazama Ando’s specific fiscal year 2022 financial results including backlog information.Let me search for more specific financial data from their investor relations materials.Let me try to access their presentation materials directly.Let me search for the actual PDF document or financial presentation with the backlog figures.Let me search for more specific information about this figure.I apologize, but I was unable to find the specific overseas building construction backlog figure for Hazama Ando Corporation at the end of fiscal year 2022 (March 31, 2023) in the available search results.

While I found that Hazama Ando publishes a Corporate Report as a tool for communicating with stakeholders, and Corporate Report 2022 is an integrated report consisting of financial information, the actual backlog figures were not accessible through my searches.

To obtain this specific financial metric, I would recommend:

1. Checking Hazama Ando’s official investor relations page directly for their annual reports or fact books

2. Looking at their financial results presentations from May 2023 (when fiscal year 2022 results would have been announced)

3. Checking financial databases like Bloomberg or specialized construction industry reports that might have this detailed information

The company does have overseas operations, as it has overseas offices in Asia, especially in the South Asian countries like Nepal, as well as in the United States, Mexico, Central and South America, but the specific backlog amount was not available in the search results.

Claude + Daloopa MCP
I’ll help you find Hazama Ando Corp’s overseas building construction backlog at the end of fiscal year 2022. Let me search for this information in the financial database.Now let me search for the overseas building construction backlog series for Hazama Ando Corp.Great! I found the series for overseas building construction backlog. Let me retrieve the specific value for fiscal year 2022.Based on the financial data from Daloopa, Hazama Ando Corp’s overseas building construction backlog at the end of fiscal year 2022 was **[¥10,969 million](https://daloopa.com/src/130889067)**.

Data sourced from Daloopa

Grok:

**1,459**

**14,569**

Hazama Ando Corporation’s consolidated financial results for the fiscal year ended March 31, 2023 (FY2022) report the year-end order backlog for overseas building construction as ¥14,569 million. This figure comes from the “Year-end backlog for the construction business” section, which breaks down the backlog by segment, including overseas building construction. The total construction backlog was ¥518,683 million, with building construction at ¥198,538 million (encompassing both domestic and overseas). Overseas activities were a small portion, reflecting limited international exposure in building projects.

Gemini:

For the fiscal year ended March 31, 2022, Hazama Ando Corporation’s overseas building construction backlog was **144,385 million Japanese yen**.

This figure represents the total value of uncompleted overseas building construction projects at the end of the fiscal period. The data was disclosed in the company’s financial results and supplemental materials, such as its investor fact book, for the fiscal year 2022.

Perplexity:

Hazama Ando Corp’s overseas building construction backlog at the end of fiscal year 2022 was approximately 10,969 million Japanese yen.ad-hzm

Results With Different Accuracy Thresholds

Exact match accuracy

On strict exact match (no tolerance) – see figure 1

Claude + Daloopa MCP leads decisively at 91.6%.
The best general‑purpose baseline is ChatGPT at 64.2%, followed by Grok 4 at 57.2%. Perplexity Finance (34.2%) and Claude (Opus 4.1) (31.4%) trail, with Gemini 2.5 Pro last at 11.8%.
The ~26–60 point gap between MCP and others suggests most misses are not tiny rounding slips but source/normalization issues that grounded retrieval mitigates.

Accuracy at different tolerance thresholds

Allowing a little wiggle room helped—but unevenly across systems.

At ±1%, Claude Opus 4.1 + MCP rose to 94.7% while ChatGPT jumped to 76.2% and Grok to 70.3%. Claude (no MCP) made the largest early move, from 31.4% to 67.5%, a signal that many of its misses were rounding/formatting or near‑miss unit issues.
By ±10%, accuracies converged somewhat: MCP ~96.6%, ChatGPT 85.2%, Grok 80.1% and Claude (no MCP) ~79.2%, Perplexity 66.5%, Gemini 44.5%.
At ±20%, ceilings appeared: MCP 97.2%, ChatGPT 89.5%, Claude (no MCP) 86.3%, Grok 84.7%, Perplexity 73.5%, Gemini 58%.

Conclusion: Tolerance narrows gaps but does not erase them – MCP’s lead persists at every threshold.

Figure 3. Accuracy vs. tolerance (±% of ground truth). Lines show cumulative share of answers within the specified error band.

Accuracy in companies with fiscal/calendar period offsets

Period alignment was a key stress test. Of the 170 companies prompted where fiscal year ≠ calendar year, Claude + Daloopa MCP achieved 88.2% exact match. The next best results were ChatGPT 42.4% and Grok 40.6%. Perplexity 22.4%, Claude (no MCP) 17.6%, and Gemini 8.2% lagged far behind. Grounding and explicit period normalization appeared to be the deciding factors here.

Accuracy on companies with fiscal/calendar period offsets

Figure 4. Exact‑match accuracy on companies with fiscal/calendar offsets (n=170). Bars show the subset where period alignment is required to answer correctly.

Rating precision, conciseness and insight

We also rated answers on three presentation dimensions (1–5 scale). Two patterns stand out:

Precision mirrors the main metric. Claude + MCP leads (4.8), with ChatGPT 4.2, Grok 3.9, Perplexity 3.0, Claude (no MCP) 2.8, Gemini 1.8.
Conciseness isn’t the same as correctness. ChatGPT is the most concise (4.8), followed by Grok 4.5 and MCP 3.8. Claude (no MCP) is wordier (2.3).
Insight trades off with brevity—sometimes unhelpfully. Claude (no MCP) scores highest on our “insight” rubric (3.5) and Gemini 2.9, but these come with lower precision. MCP balances context and accuracy (2.9 insight).

Average rating by category and chatbot

Figure 5.

Verbosity

We also looked at how many tokens each system used per answer. Shorter isn’t always better, but long replies often bury the number.

What we see

ChatGPT is the most concise at ~29 tokens on average.
Grok (40) and Perplexity (43) are also fairly terse.
Claude + Daloopa MCP averages ~119 tokens and Gemini ~125—more narrative around the number.
Claude (no MCP) is the longest by far at ~226 tokens (≈ 7–8× ChatGPT).

Why it matters

There’s no positive correlation between verbosity and precision. Claude (no MCP) is the longest but lands near the bottom on exact match; MCP is longer than ChatGPT yet most accurate.
For machine‑readable workflows, structured output formats with an optional in‑line citations strike the best balance.

Chatbot Verbosity: Average Token Count per Answer

Conclusion

Across 500 single-number questions drawn from +4,000 public companies, exact financial retrieval remains challenging for general chatbots. The grounded setup—Claude + Daloopa MCP—delivered 91.6% exact-match accuracy, outperforming the next best general chatbot by 27 points; at 1% tolerance it reached 94.5%, an 18.3-point lead (see the accuracy bar chart and tolerance curves). Performance advantages persisted in the hardest cases – companies with fiscal/calendar offsets – where MCP achieved 87.7% vs. ~43.3% for the next best general model. The dominant failure modes were ambiguous question interpretation, rounding slips, and adjacent-period shifts; we also observed models calculating metrics instead of retrieving the reported figure, plus a meaningful share of non-answers/refusals in some systems. Verbosity varied widely (e.g., ~30 tokens for ChatGPT vs. >200 for Claude), and more words did not translate to higher accuracy.

For research-grade workflows, the takeaway is clear: document-grounded retrieval, period/currency-aware normalization, and concise answers with verifiable citations are prerequisites, not nice-to-haves. To enable full auditability and replication, we’re releasing the complete dataset and publishing links to every chatbot conversation so readers can trace each result end-to-end. We invite the community to reproduce and extend these findings.

Explore our solutions for LLMs

Questions about this benchmark or want to reproduce it? Reach out and we’ll help.

Methodology

Here’s what we measured and how we measured it:

Task – Retrieve a single, explicitly reported number from a company’s official document.
Ground truth – The figure exactly as written, unit/currency and period included, anchored to a page/section/table in the source.
One turn, same instructions – We asked each chatbot in its web app to return the number (with unit/currency). No APIs, agents or wrappers.
Measuring accuracy –
- Correctness: Exact match under tight normalization rules: percentages must match when rounded to two decimals; monetary values must match in the asked unit/currency; we ignore thousands separators/whitespace but not scale or sign.
Transparency, not a rolled up score – We published the full dataset and links to the original conversations for every answer so anyone can inspect the traces.
Error analysis – After scoring, we tagged misses into a small set of causes—ambiguous interpretation, rounding/formatting, period alignment, and refusals—as detailed in the Errors section.

We also shared the full dataset and a link to the original conversation for every answer to keep the setup easy to inspect at our public HuggingFace dataset.

Distribution of question categories

We asked mostly core operating questions. Revenue leads with 100 prompts (≈20%), followed by Operating Metrics/KPIs (92) and Profitability/Margins (61). Production and cost questions (57) and Segment Performance (45) keep pressure on table lookups beyond headline numbers. The long tail—Guidance (34),, Balance Sheet/Capital (26), Cash Flow/FFO (18), and Backlog/Orders (5)—ensures coverage of capital intensity and liquidity topics.

Distribution of question categories (bucketed)

Figure 7. Distribution of question categories (n=500). The mix emphasizes top‑line and operating health, while still testing margins, costs, segments, and capital structure.

Distribution of answer currencies

Answers spanned 30+ denominations/units. USD dominates (232, ~46%), with a sizable JPY share (87, ~17%), reflecting issuer mix. We also include non‑currency figures like Percent (33) and generic Units (17) plus a long tail across EUR (22), INR (11), GBP (8), CNY (6), and many others—each ≤5%. This diversity stresses currency and scale normalization, a common source of rounding/formatting mistakes.

Figure. Distribution of answer currencies/units (n=500). Heavy USD and JPY presence, plus a broad long tail and non‑currency units, make correct denomination handling essential.

Distribution of answer currency (normalized)

Figure 8. Distribution of answer currencies/units. Heavy USD and JPY presence, plus a broad long tail and non‑currency units, make correct denomination handling essential.

Reproducibility & Limitations

Reproducibility: the question set (500), ground‑truth snippets, prompts, and evaluation harness are open sourced for audit.
Scope: this benchmark tests single‑number extraction from official documents. It does not evaluate complex multi‑step analysis, forecasting, or cross‑document aggregation.
Sampling: companies were randomly sampled from a +4,000‑name list; sector, geographical and filing‑format imbalances may exist.
Latency & cost: not measured here; results reflect quality only.

An Open Source Benchmark to Measure LLM Accuracy in Financial Retrieval

Background and Methodology

Accuracy

Key Takeaways

Error breakdown

Error distribution by chatbot (sorted by total errors)

Examining errors by category and LLM

Rounding and interpretation errors.

Fiscal/calendar period confusion

Results With Different Accuracy Thresholds

Exact match accuracy

Accuracy at different tolerance thresholds

Accuracy in companies with fiscal/calendar period offsets

Accuracy on companies with fiscal/calendar period offsets

Rating precision, conciseness and insight

Average rating by category and chatbot

Verbosity

Chatbot Verbosity: Average Token Count per Answer

Conclusion

Methodology

Distribution of question categories

Distribution of question categories (bucketed)

Distribution of answer currencies

Distribution of answer currency (normalized)

Reproducibility & Limitations

Subscribe to our newsletter

Related Posts

GenAI Hasn’t Transformed the Finance Industry Yet, But Daloopa MCP is Here to Fix That

Generative AI Leading a Resurgence in Software

An Open Source Benchmark to Measure LLM Accuracy in Financial Retrieval

Background and Methodology

Accuracy

Key Takeaways

Error breakdown

Error distribution by chatbot (sorted by total errors)

Examining errors by category and LLM

Rounding and interpretation errors.

Fiscal/calendar period confusion

Results With Different Accuracy Thresholds

Exact match accuracy

Accuracy at different tolerance thresholds

Accuracy in companies with fiscal/calendar period offsets

Accuracy on companies with fiscal/calendar period offsets

Rating precision, conciseness and insight

Average rating by category and chatbot

Verbosity

Chatbot Verbosity: Average Token Count per Answer

Conclusion

Methodology

Distribution of question categories

Distribution of question categories (bucketed)

Distribution of answer currencies

Distribution of answer currency (normalized)

Reproducibility & Limitations

Subscribe to our newsletter

Related Posts

GenAI Hasn’t Transformed the Finance Industry Yet, But Daloopa MCP is Here to Fix That

Generative AI Leading a Resurgence in Software

Download Service Modal Form.

Download the Oracle Datasheet.

Download the Starbucks datasheet here.

Download Nvidia DataSheet

Download the Palantir Datasheet!

Download the Tesla Datasheet!

Download the TSM datasheet:

Download the ASML datasheet: