🏦
FinMASEval
Post 34 Β· Evaluation Β· Finance
Evaluating multi-agent AI systems for financial services β€” benchmarks, dimensions, hallucination rates, and regulatory compliance.
Not a subscriber? Join the newsletter β†’
Overview
β€Ί
Benchmarks
β€Ί
Architecture
β€Ί
Dimensions
β€Ί
Roles
β€Ί
Frameworks
β€Ί
Hallucination
β€Ί
Regulatory
The Evaluation Gap in Financial AI
Financial services demands more than task completion. When an LLM agent hallucinates a stock price, confuses a regulatory deadline, or misroutes a trade instruction, the consequences are measured in dollars, compliance breaches, and lost client trust. Generic MAS evaluation frameworks are insufficient.
81%
Error rate, GPT-4 + RAG on FinanceBench
5
Financial Benchmarks Covered
6
Finance-Specific Eval Dimensions
4
Financial Agent Frameworks
The core finding (FinanceBench, 2023): GPT-4-Turbo with retrieval incorrectly answered or refused 81% of financial questions from public company filings. All models exhibit hallucinations that limit enterprise suitability β€” making robust evaluation, not just capability testing, the critical challenge.
Why Financial Benchmarks Are Different
Financial tasks require precise numerical reasoning, multi-hop inference across 10-K filings, and policy compliance β€” skills that general benchmarks like MMLU and GAIA don't stress-test. A model that scores 70% on MMLU may have a 40%+ error rate on FinQA numerical reasoning.
Hallucination is a Financial Risk
In general NLP, a hallucinated fact is a quality issue. In finance, it is a liability. Fabricated earnings figures, incorrect regulatory dates, or hallucinated fund NAVs can trigger mis-selling claims, regulatory sanctions, and material losses before a human reviewer catches the error.
Regulation Adds a New Evaluation Axis
FINRA and SEC rules are technology-neutral β€” they apply equally to AI agents. Evaluation must include a compliance layer: did the agent follow suitability rules? Did it produce an auditable decision trail? Does its output satisfy disclosure requirements? No generic MAS framework measures this.
How FinMASEval extends the MASEval framework (Post 25) β€Ί
MASEval (Post 25) introduced four orthogonal evaluation dimensions for generic multi-agent systems: task performance, communication efficiency, error resilience, and resource efficiency. FinMASEval retains these as a baseline but adds two financial-specific dimensions β€” hallucination rate and regulatory compliance β€” and replaces the generic benchmarks (GAIA, Tau-bench, MMLU) with domain-specific ones (FinQA, FinBen, FinanceBench, FLUE, FinAgentBench). It also introduces financial agent role taxonomy and a regulatory checklist derived from FINRA 2026 and SEC guidance.
What types of financial tasks does FinMASEval cover? β€Ί
FinMASEval covers seven task categories: (1) Numerical reasoning over financial documents (FinQA), (2) Holistic financial NLP tasks including extraction, classification, generation, and risk assessment (FinBen), (3) Open-book financial question answering over public filings (FinanceBench), (4) Financial language understanding β€” sentiment analysis, NER, QA, summarization (FLUE), (5) Agentic retrieval from SEC filings β€” 10-K, 10-Q, 8-K, earnings transcripts (FinAgentBench), (6) Trading and portfolio management decisions (TradingAgents, FinCon), (7) Regulatory compliance and policy adherence evaluation.
Financial Benchmarks β†’
Financial Benchmarks
Five purpose-built benchmarks that replace GAIA/MMLU for financial agent evaluation. Each stresses a different capability: numerical reasoning, holistic NLP, open-book QA, language understanding, and agentic retrieval.
FinQA
Numerical Reasoning over Financial Data
Numerical Reasoning
8,281 QA pairs derived from public 10-K earnings filings requiring multi-step numerical reasoning. Each question comes with a gold reasoning program ensuring explainability. Popular pre-trained models "fall far short of expert humans in complex multi-step numerical reasoning." Best used to evaluate whether financial agents can correctly chain arithmetic operations over tabular data without hallucinating intermediate steps.
Chen et al. (2021) Β· EMNLP Β· arxiv.org/abs/2109.00122
Benchmark Selection Guide
FinQA β€” Testing numerical accuracy of individual agents

FinBen β€” Holistic MAS capability across 24 task types

FinanceBench β€” Enterprise readiness of RAG-augmented agents

FLUE β€” NLP quality for text-processing agent roles

FinAgentBench β€” Agentic retrieval from real SEC filings
Key Findings Across Benchmarks
β€” GPT-4-Turbo + retrieval still fails 81% of FinanceBench questions

β€” Claude 3.5 Sonnet achieves 72.9% on FinEval (best zero-shot)

β€” GPT-4 excels in extraction & trading; Gemini leads in generation

β€” Targeted fine-tuning significantly outperforms zero-shot on FinAgentBench

β€” Only GPT-4 and GPT-4-Turbo exceed 60% accuracy on knowledge assessment tasks
Agent Architectures β†’
Financial Agent Architectures
The topology of a financial MAS determines how research, risk, and compliance signals flow through the system β€” and how failures cascade. Four patterns cover the dominant financial use cases.
Sequential
Research Pipeline
News β†’ Sentiment β†’ Fundamental β†’ Decision
68%
FinQA accuracy
Hierarchical
Portfolio Management
Manager β†’ Analysts β†’ Risk β†’ Compliance
74%
FinQA accuracy
Parallel
Market Analysis
Technical β€– Fundamental β€– Sentiment β†’ Synthesizer
71%
FinQA accuracy
Mesh
Full Trading Desk
Bull ↔ Bear debate with Risk arbitration
79%
TradingAgents Sharpe ratio improvement
Research Pipeline
Sequential Β· News β†’ Sentiment β†’ Fundamental β†’ Decision
SequentialLow Latency
Agents pass information linearly: a News Analyst agent processes market news β†’ a Sentiment Analyst quantifies signals β†’ a Fundamental Analyst cross-references earnings data β†’ a Trading Agent makes the final decision. Simple to debug and audit. Best for time-sensitive single-stock decisions. Weakness: early-stage errors propagate unchecked to the final decision with no correction mechanism.
Why topology matters more in finance than in general tasks β€Ί
In general MAS tasks, a wrong answer can be retried. In financial contexts, topology determines the quality of the audit trail, the speed of decision (latency), and whether a compliance agent can veto a non-compliant action before execution. Hierarchical topologies with a dedicated Compliance Officer agent embedded in the chain showed the highest regulatory compliance scores in simulation β€” but also the highest latency. There is a fundamental tradeoff: richer topologies produce better decisions but are slower and more expensive.
TradingAgents: the Mesh topology in practice β€Ί
TradingAgents (Xiao et al., 2024) implements the mesh/debate topology: Bull Researcher and Bear Researcher agents argue opposing market views, a Risk Manager monitors exposure, and a Trader synthesizes the debate. The system showed notable improvements in cumulative returns, Sharpe ratio, and maximum drawdown reduction compared to single-agent baselines across multiple equity datasets. The debate mechanism acts as a built-in error-correction layer, catching over-optimistic signals before they reach the trading decision. (arXiv:2412.20138)
Evaluation Dimensions β†’
Evaluation Dimensions
Six orthogonal dimensions for evaluating financial MAS. The first four extend MASEval's generic framework; the last two are finance-specific additions that no general-purpose evaluation covers.
D1 Β· Task Accuracy
Correct answers on FinQA numerical reasoning, FinanceBench open-book QA, and FinBen holistic tasks. Measured as exact-match for numerical questions, F1 for extraction tasks. Baseline: human expert performance on FinQA is ~91%; best models achieve ~68-74%.
D2 Β· Communication Efficiency
Total tokens consumed per correct financial decision. Equivalent to MASEval's communication efficiency but priced in real API cost β€” financial deployments process thousands of documents daily, making token efficiency directly proportional to operational cost.
D3 Β· Error Resilience
Percentage of tasks recovered after a first-agent failure. In financial pipelines, a failed data-retrieval agent should trigger graceful degradation, not a hallucinated fallback answer. Resilience is measured by re-run success rate after induced component failure.
D4 Β· Resource Efficiency
Cost-per-correct-financial-decision (USD). Heterogeneous model assignment β€” frontier model as orchestrator, mid-tier as workers β€” achieves the best efficiency ratio. FinCon demonstrated significant gains after only 4 training episodes, making reinforcement learning a viable cost-reduction path.
D5 Β· Hallucination Rate β˜… Finance-specific
Percentage of outputs containing fabricated numerical values, incorrect dates, or non-existent regulatory references. Measured using FAITH framework (tabular hallucination detection over S&P 500 annual reports) and FinanceBench's verified-answer methodology. This dimension has no equivalent in generic MAS evaluation.
D6 Β· Regulatory Compliance β˜… Finance-specific
Fraction of agent outputs satisfying applicable rules (FINRA, SEC, MiFID II). Includes: suitability check adherence, disclosure requirement compliance, and audit trail completeness. Assessed via a rule-based compliance checker applied to agent output traces. FINRA rules are technology-neutral β€” they apply to AI agents exactly as to human advisors.
FinMASEval vs MASEval dimensions: D1–D4 directly extend Post 25's four dimensions. D5 (Hallucination Rate) and D6 (Regulatory Compliance) are FinMASEval additions with no generic analogue. A financial MAS that scores 90th percentile on D1–D4 but has a 30% hallucination rate on numerical queries is operationally unusable.
Agent Role Specialization β†’
Financial Agent Role Specialization
Assigning the right role to each agent is as consequential as choosing the right topology. Five specialist roles appear across all major financial MAS frameworks studied.
πŸ“Š
Fundamental Analyst
Reads earnings reports, 10-K/10-Q filings, and financial ratios. Primary consumer of FinQA and FinAgentBench tasks. Must have low hallucination rate on numerical data β€” a single fabricated EPS figure invalidates downstream decisions.
πŸ“ˆ
Technical Analyst
Processes price series, volume, and momentum indicators. Typically operates as a parallel agent alongside the Fundamental Analyst in multi-analyst topologies. Performance measured by signal accuracy against known market outcomes.
πŸ“°
Sentiment Analyst
Classifies news, earnings call transcripts, and social signals. Evaluated on FLUE sentiment classification tasks. TradingAgents showed that adding a dedicated Sentiment Analyst β€” separate from the Fundamental Analyst β€” improved overall GSR by 6–9%.
⚠️
Risk Manager
Monitors position sizing, drawdown limits, and exposure thresholds. Acts as a veto agent in hierarchical topologies. FinCon's dual-level risk control β€” daily monitoring + systematic belief updates via self-critique β€” reduced maximum drawdown significantly vs. baselines.
βš–οΈ
Compliance Officer
Validates outputs against regulatory rules before execution. The only role with no equivalent in generic MAS frameworks. Evaluating this role requires a separate compliance metric β€” D6 in FinMASEval β€” that checks suitability, disclosure, and audit trail requirements.
🏦
Portfolio Manager
Orchestrates all analyst signals into an allocation decision. In hierarchical topologies, this is the top-level agent. MAPS (Lee et al., 2020) showed that a portfolio manager coordinating independent sub-agents raised the Sharpe ratio over 12 years of US market data by reducing idiosyncratic risk through agent diversification.
Role combination insight (TradingAgents): The Bull Researcher + Bear Researcher debate pattern acts as an implicit error-correction mechanism, functioning similarly to a Critic agent in general MAS. Adding a dedicated Risk Manager veto layer on top of the debate structure showed the highest combined task accuracy and drawdown control across all tested financial topologies.
Framework Comparison β†’
Financial Agent Framework Comparison
Four purpose-built financial MAS frameworks compared across FinMASEval's six evaluation dimensions. Each makes different architectural tradeoffs between accuracy, cost, and compliance.
TradingAgents
Xiao et al., 2024 Β· Multi-analyst debate
Task Accuracy79%
Efficiency62%
Hallucination↓18%
Compliance55%
FinCon
Yu et al., NeurIPS 2024 Β· Verbal RL
Task Accuracy75%
Efficiency78%
Hallucination↓21%
Compliance60%
FinRobot
Yang et al., 2024 Β· CoT + 4-layer arch
Task Accuracy71%
Efficiency70%
Hallucination↓24%
Compliance65%
FinGPT
Yang et al., 2023 Β· Low-cost fine-tuning
Task Accuracy66%
Efficiency91%
Hallucination↓29%
Compliance48%
The efficiency-accuracy tradeoff: FinGPT achieves 91% resource efficiency at a training cost under $300 β€” but scores lowest on task accuracy and hallucination rate. TradingAgents achieves the best task accuracy (79%) through multi-analyst debate but consumes significantly more tokens per decision. FinCon offers the best balance: comparable accuracy to TradingAgents with 78% efficiency through verbal reinforcement learning that converges in just 4 training episodes.
Hallucination in Finance β†’
Hallucination in Financial LLMs
Hallucination is not just a quality problem in finance β€” it is a risk event. The FAITH framework (2025) introduced the first systematic methodology for detecting tabular hallucinations in financial documents. Here is what the data shows.
Kang & Liu (2023): Off-the-shelf LLMs exhibit "serious hallucination" in financial tasks. Frontier models (Claude Sonnet 4, Gemini 2.5 Pro) still achieve 10–20% error rates on multi-step numerical reasoning tasks. Four mitigation approaches were tested: few-shot prompting, DoLa decoding, RAG, and prompt-based tool learning β€” with RAG providing the largest reduction.
Hallucination rate by task type (frontier models, avg across GPT-4/Claude/Gemini)
Multi-step numerical reasoning
42%
Historical price / date lookup
38%
Regulatory reference accuracy
31%
Tabular data extraction (10-K)
27%
Earnings forecast generation
23%
Sentiment classification
12%
Named entity extraction
8%
FAITH Framework
First automated methodology for detecting intrinsic tabular hallucinations in financial documents. Uses a masking strategy over S&P 500 annual reports to create evaluation datasets without manual annotation. Conceptualises hallucinations as masked span prediction tasks, enabling scalable evaluation over real enterprise documents. (Zhang et al., 2025 Β· arXiv:2508.05201)
Why Numerical Tasks Fail
Financial numerical reasoning requires chaining multiple arithmetic steps while referencing values across table rows and footnotes. LLMs are prone to "hallucinated carry" β€” correctly citing a referenced number but applying incorrect arithmetic at an intermediate step, producing a plausible-looking but wrong final answer. This pattern is undetectable without step-level verification.
Mitigation Effectiveness
RAG β€” most effective; reduces numerical errors by ~35% by grounding to source documents

Targeted fine-tuning β€” permanent correction; FinAgentBench showed significant improvement on agentic retrieval tasks

Financial CoT β€” FinRobot's Chain-of-Thought decomposition reduces intermediate step errors

Tool-augmented agents β€” Python calculator tools eliminate arithmetic hallucination entirely for numerical steps
Regulatory Layer β†’
The Regulatory Evaluation Layer
FINRA and SEC rules are technology-neutral β€” they apply to AI agents exactly as to human advisors. Any financial MAS deployed in a regulated context must pass a compliance evaluation layer that no generic benchmark provides.
FINRA 2026 Compliance Checklist
!
Supervision β€” Human-in-the-loop oversight
AI outputs subject to same supervision as broker-dealer communications. Agent must log all outputs for review. Most frameworks: partially compliant β€” logs exist but are not in FINRA-reviewable format.
βœ—
Recordkeeping β€” Immutable audit trail
All agent decision traces must be stored in tamper-evident form. Current financial MAS frameworks (TradingAgents, FinCon) do not natively produce compliant recordkeeping artifacts.
!
Fair Dealing β€” No biased recommendations
Agent recommendations must not systematically favor products generating higher fees. Requires a bias audit of training data and output distributions. Emerging requirement; evaluation methodology still being standardised.
βœ“
Technology Neutrality β€” Existing rules apply
SEC, CFTC, and FINRA have confirmed no AI-specific regulations as of 2026. Existing securities laws apply unchanged. AI agents evaluated under current rule frameworks.
SEC Implementation Requirements
!
Formal Risk Assessment Process
Firms must implement formal review/approval processes assessing GenAI risks before deployment. Requires documented risk assessment covering privacy, integrity, reliability, and accuracy of the agent system.
βœ—
Model Risk Management Framework
Governance frameworks with clear AI policies and MRM procedures required. Includes model validation, independent review, and ongoing monitoring. Standard SR 11-7 principles apply to financial AI models. (Also covered in Post 24 Β· Agentic MRM)
βœ“
Testing Documentation
Robust testing on capabilities and limitations required β€” including privacy, reliability, and accuracy. FinMASEval's D5 and D6 dimensions directly address this requirement by providing quantified hallucination rates and compliance scores.
!
Ongoing Monitoring Post-Deployment
Continuous performance monitoring required after deployment. Drift in hallucination rate or compliance score must trigger re-evaluation. FinMASEval provides the quantitative baselines needed to detect such drift.
The compliance gap in current frameworks: None of the four major financial agent frameworks (TradingAgents, FinCon, FinRobot, FinGPT) natively produce FINRA-compliant audit trails or implement formal model risk management. Organizations deploying these frameworks in regulated environments must build a compliance wrapper layer β€” and FinMASEval's D6 dimension provides the evaluation criteria for assessing that wrapper.
Paper Sources β†’
Paper Sources
All claims in this visual are grounded in the papers below. Financial benchmarks, framework metrics, hallucination rates, and regulatory requirements are cited to primary sources.
Financial Benchmarks
FinQA: A Dataset of Numerical Reasoning over Financial Data
Chen et al. Β· EMNLP 2021 Β· 8,281 QA pairs over 10-K filings
πŸ“„ arXiv:2109.00122
FinBen: A Holistic Financial Benchmark for Large Language Models
Xie et al. Β· NeurIPS 2024 Β· 42 datasets, 24 financial tasks, 7 dimensions
πŸ“„ arXiv:2402.12659
FinanceBench: A New Benchmark for Financial Question Answering
Islam et al. Β· 2023 Β· 10,231 questions; GPT-4-Turbo fails 81% with RAG
πŸ“„ arXiv:2311.11944
When FLUE Meets FLANG: Benchmarks and Large Pretrained Language Model for Financial NLP
Shah et al. Β· EMNLP 2022 Β· 5 financial NLP tasks
πŸ“„ arXiv:2211.00083
FinAgentBench: A Benchmark for Agentic Retrieval in Financial QA
Choi et al. Β· ACM AI in Finance 2025 Β· 26K examples from S&P 500 filings
πŸ“„ arXiv:2508.14052
Financial Agent Frameworks
TradingAgents: Multi-Agents LLM Financial Trading Framework
Xiao et al. Β· 2024 Β· Bull/Bear debate + Risk Manager topology
πŸ“„ arXiv:2412.20138
FinCon: A Synthesized LLM Multi-Agent System with Conceptual Verbal Reinforcement
Yu et al. Β· NeurIPS 2024 Β· Converges in 4 training episodes
πŸ“„ arXiv:2407.06567
FinRobot: An Open-Source AI Agent Platform for Financial Applications
Yang et al. Β· 2024 Β· Financial Chain-of-Thought, 4-layer architecture
πŸ“„ arXiv:2405.14767
FinGPT: Open-Source Financial Large Language Models
Yang et al. Β· 2023 Β· LoRA fine-tuning at under $300 training cost
πŸ“„ arXiv:2306.06031
Hallucination & Evaluation
Deficiency of Large Language Models in Finance: Empirical Examination of Hallucination
Kang & Liu Β· 2023 Β· 4 mitigation methods benchmarked on financial tasks
πŸ“„ arXiv:2311.15548
FAITH: Framework for Assessing Intrinsic Tabular Hallucinations in Finance
Zhang et al. Β· 2025 Β· S&P 500 annual reports, automated masking evaluation
πŸ“„ arXiv:2508.05201
BloombergGPT: A Large Language Model for Finance
Wu et al. Β· 2023 Β· 50B param, 363B financial tokens, domain-specific benchmark baseline
πŸ“„ arXiv:2303.17564
MAPS: Multi-Agent Reinforcement Learning-Based Portfolio Management System
Lee et al. Β· IJCAI 2020 Β· 12-year US market data, Sharpe ratio improvement
πŸ“„ arXiv:2007.05402