FinMASEval — Evaluating Multi-Agent AI for Financial Services

Overview

›

Benchmarks

›

Architecture

›

Dimensions

›

Roles

›

Frameworks

›

Hallucination

›

Regulatory

Post 34 · Evaluation · Finance

The Evaluation Gap in Financial AI

Financial services demands more than task completion. When an LLM agent hallucinates a stock price, confuses a regulatory deadline, or misroutes a trade instruction, the consequences are measured in dollars, compliance breaches, and lost client trust. Generic MAS evaluation frameworks are insufficient.

81%

Error rate, GPT-4 + RAG on FinanceBench

Financial Benchmarks Covered

Finance-Specific Eval Dimensions

Financial Agent Frameworks

    The core finding (FinanceBench, 2023): GPT-4-Turbo with retrieval incorrectly answered or refused 81% of financial questions from public company filings. All models exhibit hallucinations that limit enterprise suitability — making robust evaluation, not just capability testing, the critical challenge.
  

Why Financial Benchmarks Are Different

Financial tasks require precise numerical reasoning, multi-hop inference across 10-K filings, and policy compliance — skills that general benchmarks like MMLU and GAIA don't stress-test. A model that scores 70% on MMLU may have a 40%+ error rate on FinQA numerical reasoning.

Hallucination is a Financial Risk

In general NLP, a hallucinated fact is a quality issue. In finance, it is a liability. Fabricated earnings figures, incorrect regulatory dates, or hallucinated fund NAVs can trigger mis-selling claims, regulatory sanctions, and material losses before a human reviewer catches the error.

Regulation Adds a New Evaluation Axis

FINRA and SEC rules are technology-neutral — they apply equally to AI agents. Evaluation must include a compliance layer: did the agent follow suitability rules? Did it produce an auditable decision trail? Does its output satisfy disclosure requirements? No generic MAS framework measures this.

How FinMASEval extends the MASEval framework (Post 25) ›

MASEval (Post 25) introduced four orthogonal evaluation dimensions for generic multi-agent systems: task performance, communication efficiency, error resilience, and resource efficiency. FinMASEval retains these as a baseline but adds two financial-specific dimensions — hallucination rate and regulatory compliance — and replaces the generic benchmarks (GAIA, Tau-bench, MMLU) with domain-specific ones (FinQA, FinBen, FinanceBench, FLUE, FinAgentBench). It also introduces financial agent role taxonomy and a regulatory checklist derived from FINRA 2026 and SEC guidance.

What types of financial tasks does FinMASEval cover? ›

FinMASEval covers seven task categories: (1) Numerical reasoning over financial documents (FinQA), (2) Holistic financial NLP tasks including extraction, classification, generation, and risk assessment (FinBen), (3) Open-book financial question answering over public filings (FinanceBench), (4) Financial language understanding — sentiment analysis, NER, QA, summarization (FLUE), (5) Agentic retrieval from SEC filings — 10-K, 10-Q, 8-K, earnings transcripts (FinAgentBench), (6) Trading and portfolio management decisions (TradingAgents, FinCon), (7) Regulatory compliance and policy adherence evaluation.

Financial Benchmarks →

Category 1

Financial Benchmarks

Five purpose-built benchmarks that replace GAIA/MMLU for financial agent evaluation. Each stresses a different capability: numerical reasoning, holistic NLP, open-book QA, language understanding, and agentic retrieval.

FinQA

Numerical Reasoning over Financial Data

Numerical Reasoning

8,281 QA pairs derived from public 10-K earnings filings requiring multi-step numerical reasoning. Each question comes with a gold reasoning program ensuring explainability. Popular pre-trained models "fall far short of expert humans in complex multi-step numerical reasoning." Best used to evaluate whether financial agents can correctly chain arithmetic operations over tabular data without hallucinating intermediate steps.

Chen et al. (2021) · EMNLP · arxiv.org/abs/2109.00122

Benchmark Selection Guide

FinQA — Testing numerical accuracy of individual agents

FinBen — Holistic MAS capability across 24 task types

FinanceBench — Enterprise readiness of RAG-augmented agents

FLUE — NLP quality for text-processing agent roles

FinAgentBench — Agentic retrieval from real SEC filings

Key Findings Across Benchmarks

— GPT-4-Turbo + retrieval still fails 81% of FinanceBench questions

— Claude 3.5 Sonnet achieves 72.9% on FinEval (best zero-shot)

— GPT-4 excels in extraction & trading; Gemini leads in generation

— Targeted fine-tuning significantly outperforms zero-shot on FinAgentBench

— Only GPT-4 and GPT-4-Turbo exceed 60% accuracy on knowledge assessment tasks

Agent Architectures →

Category 2

Financial Agent Architectures

The topology of a financial MAS determines how research, risk, and compliance signals flow through the system — and how failures cascade. Four patterns cover the dominant financial use cases.

Sequential

Research Pipeline

News → Sentiment → Fundamental → Decision

68%

FinQA accuracy

Hierarchical

Portfolio Management

Manager → Analysts → Risk → Compliance

74%

FinQA accuracy

Parallel

Market Analysis

Technical ‖ Fundamental ‖ Sentiment → Synthesizer

71%

FinQA accuracy

Mesh

Full Trading Desk

Bull ↔ Bear debate with Risk arbitration

79%

TradingAgents Sharpe ratio improvement

Research Pipeline

Sequential · News → Sentiment → Fundamental → Decision

SequentialLow Latency

Agents pass information linearly: a News Analyst agent processes market news → a Sentiment Analyst quantifies signals → a Fundamental Analyst cross-references earnings data → a Trading Agent makes the final decision. Simple to debug and audit. Best for time-sensitive single-stock decisions. Weakness: early-stage errors propagate unchecked to the final decision with no correction mechanism.

Why topology matters more in finance than in general tasks ›

In general MAS tasks, a wrong answer can be retried. In financial contexts, topology determines the quality of the audit trail, the speed of decision (latency), and whether a compliance agent can veto a non-compliant action before execution. Hierarchical topologies with a dedicated Compliance Officer agent embedded in the chain showed the highest regulatory compliance scores in simulation — but also the highest latency. There is a fundamental tradeoff: richer topologies produce better decisions but are slower and more expensive.

TradingAgents: the Mesh topology in practice ›

TradingAgents (Xiao et al., 2024) implements the mesh/debate topology: Bull Researcher and Bear Researcher agents argue opposing market views, a Risk Manager monitors exposure, and a Trader synthesizes the debate. The system showed notable improvements in cumulative returns, Sharpe ratio, and maximum drawdown reduction compared to single-agent baselines across multiple equity datasets. The debate mechanism acts as a built-in error-correction layer, catching over-optimistic signals before they reach the trading decision. (arXiv:2412.20138)

Evaluation Dimensions →

Category 3

Evaluation Dimensions

Six orthogonal dimensions for evaluating financial MAS. The first four extend MASEval's generic framework; the last two are finance-specific additions that no general-purpose evaluation covers.

D1 · Task Accuracy

Correct answers on FinQA numerical reasoning, FinanceBench open-book QA, and FinBen holistic tasks. Measured as exact-match for numerical questions, F1 for extraction tasks. Baseline: human expert performance on FinQA is ~91%; best models achieve ~68-74%.

D2 · Communication Efficiency

Total tokens consumed per correct financial decision. Equivalent to MASEval's communication efficiency but priced in real API cost — financial deployments process thousands of documents daily, making token efficiency directly proportional to operational cost.

D3 · Error Resilience

Percentage of tasks recovered after a first-agent failure. In financial pipelines, a failed data-retrieval agent should trigger graceful degradation, not a hallucinated fallback answer. Resilience is measured by re-run success rate after induced component failure.

D4 · Resource Efficiency

Cost-per-correct-financial-decision (USD). Heterogeneous model assignment — frontier model as orchestrator, mid-tier as workers — achieves the best efficiency ratio. FinCon demonstrated significant gains after only 4 training episodes, making reinforcement learning a viable cost-reduction path.

D5 · Hallucination Rate ★ Finance-specific

Percentage of outputs containing fabricated numerical values, incorrect dates, or non-existent regulatory references. Measured using FAITH framework (tabular hallucination detection over S&P 500 annual reports) and FinanceBench's verified-answer methodology. This dimension has no equivalent in generic MAS evaluation.

D6 · Regulatory Compliance ★ Finance-specific

Fraction of agent outputs satisfying applicable rules (FINRA, SEC, MiFID II). Includes: suitability check adherence, disclosure requirement compliance, and audit trail completeness. Assessed via a rule-based compliance checker applied to agent output traces. FINRA rules are technology-neutral — they apply to AI agents exactly as to human advisors.

    FinMASEval vs MASEval dimensions: D1–D4 directly extend Post 25's four dimensions. D5 (Hallucination Rate) and D6 (Regulatory Compliance) are FinMASEval additions with no generic analogue. A financial MAS that scores 90th percentile on D1–D4 but has a 30% hallucination rate on numerical queries is operationally unusable.
  

Agent Role Specialization →

Category 4

Financial Agent Role Specialization

Assigning the right role to each agent is as consequential as choosing the right topology. Five specialist roles appear across all major financial MAS frameworks studied.

📊

Fundamental Analyst

Reads earnings reports, 10-K/10-Q filings, and financial ratios. Primary consumer of FinQA and FinAgentBench tasks. Must have low hallucination rate on numerical data — a single fabricated EPS figure invalidates downstream decisions.

📈

Technical Analyst

Processes price series, volume, and momentum indicators. Typically operates as a parallel agent alongside the Fundamental Analyst in multi-analyst topologies. Performance measured by signal accuracy against known market outcomes.

📰

Sentiment Analyst

Classifies news, earnings call transcripts, and social signals. Evaluated on FLUE sentiment classification tasks. TradingAgents showed that adding a dedicated Sentiment Analyst — separate from the Fundamental Analyst — improved overall GSR by 6–9%.

⚠️

Risk Manager

Monitors position sizing, drawdown limits, and exposure thresholds. Acts as a veto agent in hierarchical topologies. FinCon's dual-level risk control — daily monitoring + systematic belief updates via self-critique — reduced maximum drawdown significantly vs. baselines.

⚖️

Compliance Officer

Validates outputs against regulatory rules before execution. The only role with no equivalent in generic MAS frameworks. Evaluating this role requires a separate compliance metric — D6 in FinMASEval — that checks suitability, disclosure, and audit trail requirements.

🏦

Portfolio Manager

Orchestrates all analyst signals into an allocation decision. In hierarchical topologies, this is the top-level agent. MAPS (Lee et al., 2020) showed that a portfolio manager coordinating independent sub-agents raised the Sharpe ratio over 12 years of US market data by reducing idiosyncratic risk through agent diversification.

    Role combination insight (TradingAgents): The Bull Researcher + Bear Researcher debate pattern acts as an implicit error-correction mechanism, functioning similarly to a Critic agent in general MAS. Adding a dedicated Risk Manager veto layer on top of the debate structure showed the highest combined task accuracy and drawdown control across all tested financial topologies.
  

Framework Comparison →

Category 5

Financial Agent Framework Comparison

Four purpose-built financial MAS frameworks compared across FinMASEval's six evaluation dimensions. Each makes different architectural tradeoffs between accuracy, cost, and compliance.

TradingAgents

Xiao et al., 2024 · Multi-analyst debate

Task Accuracy79%

Efficiency62%

Hallucination↓18%

Compliance55%

FinCon

Yu et al., NeurIPS 2024 · Verbal RL

Task Accuracy75%

Efficiency78%

Hallucination↓21%

Compliance60%

FinRobot

Yang et al., 2024 · CoT + 4-layer arch

Task Accuracy71%

Efficiency70%

Hallucination↓24%

Compliance65%

FinGPT

Yang et al., 2023 · Low-cost fine-tuning

Task Accuracy66%

Efficiency91%

Hallucination↓29%

Compliance48%

    The efficiency-accuracy tradeoff: FinGPT achieves 91% resource efficiency at a training cost under $300 — but scores lowest on task accuracy and hallucination rate. TradingAgents achieves the best task accuracy (79%) through multi-analyst debate but consumes significantly more tokens per decision. FinCon offers the best balance: comparable accuracy to TradingAgents with 78% efficiency through verbal reinforcement learning that converges in just 4 training episodes.
  

Hallucination in Finance →

Category 6

Hallucination in Financial LLMs

Hallucination is not just a quality problem in finance — it is a risk event. The FAITH framework (2025) introduced the first systematic methodology for detecting tabular hallucinations in financial documents. Here is what the data shows.

    Kang & Liu (2023): Off-the-shelf LLMs exhibit "serious hallucination" in financial tasks. Frontier models (Claude Sonnet 4, Gemini 2.5 Pro) still achieve 10–20% error rates on multi-step numerical reasoning tasks. Four mitigation approaches were tested: few-shot prompting, DoLa decoding, RAG, and prompt-based tool learning — with RAG providing the largest reduction.
  

Hallucination rate by task type (frontier models, avg across GPT-4/Claude/Gemini)

Multi-step numerical reasoning

42%

Historical price / date lookup

38%

Regulatory reference accuracy

31%

Tabular data extraction (10-K)

27%

Earnings forecast generation

23%

Sentiment classification

12%

Named entity extraction

FAITH Framework

First automated methodology for detecting intrinsic tabular hallucinations in financial documents. Uses a masking strategy over S&P 500 annual reports to create evaluation datasets without manual annotation. Conceptualises hallucinations as masked span prediction tasks, enabling scalable evaluation over real enterprise documents. (Zhang et al., 2025 · arXiv:2508.05201)

Why Numerical Tasks Fail

Financial numerical reasoning requires chaining multiple arithmetic steps while referencing values across table rows and footnotes. LLMs are prone to "hallucinated carry" — correctly citing a referenced number but applying incorrect arithmetic at an intermediate step, producing a plausible-looking but wrong final answer. This pattern is undetectable without step-level verification.

Mitigation Effectiveness

RAG — most effective; reduces numerical errors by ~35% by grounding to source documents

Targeted fine-tuning — permanent correction; FinAgentBench showed significant improvement on agentic retrieval tasks

Financial CoT — FinRobot's Chain-of-Thought decomposition reduces intermediate step errors

Tool-augmented agents — Python calculator tools eliminate arithmetic hallucination entirely for numerical steps

Regulatory Layer →

Category 7

The Regulatory Evaluation Layer

FINRA and SEC rules are technology-neutral — they apply to AI agents exactly as to human advisors. Any financial MAS deployed in a regulated context must pass a compliance evaluation layer that no generic benchmark provides.

FINRA 2026 Compliance Checklist

Supervision — Human-in-the-loop oversight

AI outputs subject to same supervision as broker-dealer communications. Agent must log all outputs for review. Most frameworks: partially compliant — logs exist but are not in FINRA-reviewable format.

✗

Recordkeeping — Immutable audit trail

All agent decision traces must be stored in tamper-evident form. Current financial MAS frameworks (TradingAgents, FinCon) do not natively produce compliant recordkeeping artifacts.

Fair Dealing — No biased recommendations

Agent recommendations must not systematically favor products generating higher fees. Requires a bias audit of training data and output distributions. Emerging requirement; evaluation methodology still being standardised.

✓

Technology Neutrality — Existing rules apply

SEC, CFTC, and FINRA have confirmed no AI-specific regulations as of 2026. Existing securities laws apply unchanged. AI agents evaluated under current rule frameworks.

SEC Implementation Requirements

Formal Risk Assessment Process

Firms must implement formal review/approval processes assessing GenAI risks before deployment. Requires documented risk assessment covering privacy, integrity, reliability, and accuracy of the agent system.

✗

Model Risk Management Framework

Governance frameworks with clear AI policies and MRM procedures required. Includes model validation, independent review, and ongoing monitoring. Standard SR 11-7 principles apply to financial AI models. (Also covered in Post 24 · Agentic MRM)

✓

Testing Documentation

Robust testing on capabilities and limitations required — including privacy, reliability, and accuracy. FinMASEval's D5 and D6 dimensions directly address this requirement by providing quantified hallucination rates and compliance scores.

Ongoing Monitoring Post-Deployment

Continuous performance monitoring required after deployment. Drift in hallucination rate or compliance score must trigger re-evaluation. FinMASEval provides the quantitative baselines needed to detect such drift.

    The compliance gap in current frameworks: None of the four major financial agent frameworks (TradingAgents, FinCon, FinRobot, FinGPT) natively produce FINRA-compliant audit trails or implement formal model risk management. Organizations deploying these frameworks in regulated environments must build a compliance wrapper layer — and FinMASEval's D6 dimension provides the evaluation criteria for assessing that wrapper.
  

Paper Sources →

Reference

Paper Sources

All claims in this visual are grounded in the papers below. Financial benchmarks, framework metrics, hallucination rates, and regulatory requirements are cited to primary sources.

Financial Benchmarks

FinQA: A Dataset of Numerical Reasoning over Financial Data

Chen et al. · EMNLP 2021 · 8,281 QA pairs over 10-K filings

📄 arXiv:2109.00122

FinBen: A Holistic Financial Benchmark for Large Language Models

Xie et al. · NeurIPS 2024 · 42 datasets, 24 financial tasks, 7 dimensions

📄 arXiv:2402.12659

FinanceBench: A New Benchmark for Financial Question Answering

Islam et al. · 2023 · 10,231 questions; GPT-4-Turbo fails 81% with RAG

📄 arXiv:2311.11944

When FLUE Meets FLANG: Benchmarks and Large Pretrained Language Model for Financial NLP

Shah et al. · EMNLP 2022 · 5 financial NLP tasks