Problem
›
859 Pages
›
9,481 Facts
›
LLM Judge
›
13 Models
›
Heatmap
›
Takeaways
Post 43 · Evaluation
FinCriticalED
The first benchmark that measures what really matters in financial OCR: not character accuracy, but whether decision-critical facts survive extraction intact.
Paper: "FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR Evaluation"
Yueru He, Xueqing Peng et al. · arXiv 2511.14998 · 2025–2026
Yueru He, Xueqing Peng et al. · arXiv 2511.14998 · 2025–2026
The Core Problem
Standard OCR benchmarks measure character-level accuracy. But in finance, a single digit error — $1.2B → $12B — is catastrophic even if 99.9% of characters are correct.
The Solution
FinCriticalED shifts evaluation to fact-level accuracy — did the model correctly preserve the 9,481 expert-annotated decision-critical facts across 859 real financial documents?
The Finding
Even the best frontier models fail on numerical and monetary unit extraction in visually complex financial documents. High OCR accuracy ≠ financial fact fidelity.
0
Real financial
document pages
document pages
0
Expert-annotated
critical facts
critical facts
0
Models
benchmarked
benchmarked
0
Fact categories
evaluated
evaluated
The Problem
The OCR Accuracy Trap
Why "99% accurate" OCR is dangerously misleading for financial documents.
Surface Accuracy vs Fact Accuracy
Original document text
Total revenue: $1,234,567,890
Fiscal year ended: December 31, 2024
Currency: USD (thousands)
Reporting entity: Apex Financial Corp.
Fiscal year ended: December 31, 2024
Currency: USD (thousands)
Reporting entity: Apex Financial Corp.
↓ OCR extraction
OCR output — 98.7% character accuracy
Total revenue: $1,234,567,890 $1,234,567,800 ⚠
Fiscal year ended: December 31, 2024 ✓
Currency: USD (thousands) USD (millions) ⚠
Reporting entity: Apex Financial Corp. ✓
Fiscal year ended: December 31, 2024 ✓
Currency: USD (thousands) USD (millions) ⚠
Reporting entity: Apex Financial Corp. ✓
2 fact errors, 98.7% char accuracy. The revenue is off by $90 and the unit multiplier is wrong — meaning the reported figure is actually off by a factor of 1,000×.
Why Existing Benchmarks Miss This
CER / WER — Character & Word Error Rate
Measures how many characters or words differ between OCR output and ground truth. A 90-character number correct in 89 characters = 98.9% CER. But the number is wrong — and the financial impact is enormous.
BLEU / ROUGE — Sequence Overlap
n-gram overlap metrics designed for translation/summarisation. They reward matching word sequences — but a monetary unit change ("thousands" → "millions") scores well because 1 of 2 words matches.
F1 / Exact Match
Better than CER/WER but still treat all tokens equally. The word "the" and the number "$1,234,567,890" have identical weight — even though one is decision-critical and the other is not.
FinCriticalED's insight: Financial facts are not created equal. Numerical values and monetary units need their own evaluation dimension with zero tolerance for errors.
Stakes
The Real Cost of OCR Errors in Finance
Why a 1% error rate that's acceptable in other domains is catastrophic in financial documents.
Regulatory Filings
A wrong figure in an SEC filing can trigger regulatory sanctions, restatements, and shareholder lawsuits. The SEC has brought enforcement actions over numerical errors in Form 10-K filings.
Algorithmic Trading
AI trading systems that ingest OCR'd earnings reports need exact numbers. A misread EPS of $0.45 instead of $4.50 can trigger erroneous buy/sell orders worth millions before the error is caught.
Due Diligence
M&A analysts rely on extracted financial figures for valuation models. An OCR error in a balance sheet can corrupt an entire DCF model, leading to deals mispriced by tens of millions.
Domain vs Error Tolerance — Why Finance is Different
| Domain | Typical Error Tolerance | Consequence of 1% Error | FinCriticalED Relevance |
|---|---|---|---|
| Book digitisation | 1–3% CER acceptable | Minor readability issue | Low |
| Medical records | <0.5% for drug dosages | Patient safety risk | High |
| Legal contracts | <0.1% for key clauses | Contract invalidity, disputes | Very High |
| Financial documents | Zero tolerance on critical facts | Regulatory action, billions in mispricing | Critical |
Dataset
Building FinCriticalED
859 real-world pages from SEC EDGAR and regulatory archives, annotated with 9,481 expert-labeled critical facts.
Construction Pipeline
1
Source Collection
Documents sourced from U.S. SEC EDGAR, corporate disclosure portals, and regulatory archives. Stratified sampling across industries and fiscal years 2023–2025.
2
Document Selection
859 pages selected across 5 document types, prioritising pages with high visual complexity — multi-column tables, mixed formatting, dense numerical content.
3
Expert Annotation
Domain experts annotate each page, labelling 9,481 critical facts across 5 categories. Each annotation includes the fact value, category, context, and criticality level.
4
Quality Review
Multi-round review process. Disputed annotations resolved by senior financial domain experts. Inter-annotator agreement verified before inclusion.
5
Evaluation Protocol Design
Deterministic-Rule-Guided LLM-as-Judge protocol designed to assess fact preservation — accounting for lexical variation, numerical equivalence, and contextual correctness.
Dataset at a Glance
Facts per Category
Data Sources
🏛
SEC EDGAR
Primary source
📋
Corp. Disclosures
Secondary source
⚖
Regulatory Archives
Tertiary source
📅
2023–2025
Fiscal year range
Fact Categories
5 Critical Fact Types
Click each category to see its definition, examples, and why it's decision-critical.
Document Types
5 Financial Document Categories
The benchmark spans the full range of document types where financial OCR is deployed in practice.
📊 Financial Statements
Balance sheets, income statements, cash flow statements. Dense numerical tables with multiple nesting levels and footnotes referencing other pages.
📑 Supplemental Reports
Quarterly earnings supplements, investor presentations, management discussion. Mix of narrative text and embedded numerical tables with complex visual layouts.
📜 Tax Forms
IRS Form 10-K schedules, state tax filings. Highly structured forms with precisely positioned fields where OCR spatial alignment errors cause critical misattribution.
💹 Securities Transactions
Trade confirmations, settlement records, prospectuses. Time-critical documents where temporal and numerical fact accuracy affects transaction validity.
⚖ Financial Legal Documents
Loan agreements, covenants, bond indentures. Legal contracts where numerical covenants (debt ratios, financial tests) carry contractual force if extracted incorrectly.
Annotated Document Sample
■ Numeric
■ Temporal
■ Monetary Unit
■ Reporting Entity
■ Financial Concept
APEX FINANCIAL CORPORATION
CONSOLIDATED STATEMENTS OF OPERATIONS
(In thousands, except per share data)
Year Ended December 31, 2024 Year Ended December 31, 2023
Net revenues $ 4,821,350 $ 4,203,117
Operating expenses 3,214,892 2,876,441
Income before taxes 1,606,458 1,326,676
Earnings per share (diluted) $ 3.47 $ 2.91
CONSOLIDATED STATEMENTS OF OPERATIONS
(In thousands, except per share data)
Year Ended December 31, 2024 Year Ended December 31, 2023
Net revenues $ 4,821,350 $ 4,203,117
Operating expenses 3,214,892 2,876,441
Income before taxes 1,606,458 1,326,676
Earnings per share (diluted) $ 3.47 $ 2.91
Colour-coded annotations show the 5 fact types. Every highlighted value is a data point in FinCriticalED's evaluation set.
Methodology
LLM-as-Judge Evaluation Protocol
Why rule-based matching fails for financial facts — and how a Deterministic-Rule-Guided LLM judge solves it.
Why Pure Rule-Matching Fails
Lexical Equivalence Problem
"$1,234,567" and "1234567" are the same number but different strings. Rule-based matchers flag this as wrong. An LLM judge understands numeric equivalence.
Context Dependency
Whether "3.47" is correct depends on whether the surrounding context specifies it's EPS (earnings per share) vs. a tax rate or ratio. Pure string matching can't assess contextual correctness.
Unit Normalisation
"4,821 thousand" and "4,821,000" and "$4.821M" are all the same monetary fact. A rule-matcher sees three different strings; the LLM judge understands they're identical.
The Evaluation Pipeline
1
OCR System Runs on Document
The system under evaluation receives the document image and produces extracted text output.
2
Structured Fact Extraction
A deterministic extraction step parses the OCR output into structured fact candidates aligned to the annotation schema.
3
LLM Judge Assessment
For each annotated fact, the LLM judge is given: (a) the ground truth fact, (b) the extracted candidate, (c) surrounding context. It returns: CORRECT / INCORRECT / UNCERTAIN.
4
Deterministic Rule Override
Numeric facts trigger deterministic validation rules (e.g., parsed value equality within tolerance) — the LLM cannot override a verified numerical mismatch.
5
Fact-Level Score Aggregation
Scores aggregated by fact category, document type, and model — producing a multi-dimensional performance profile.
Key design: Deterministic rules handle numerical precision (where LLMs can be unreliable). LLM reasoning handles semantic equivalence (where rules fail). Best of both.
LLM Judge Prompt Structure
# Evaluation prompt sent to LLM judge for each fact
TASK: Assess whether the extracted fact correctly preserves the ground truth fact.
GROUND TRUTH FACT:
Category: Numeric
Value: 4,821,350
Context: Net revenues, fiscal year 2024, in thousands USD
EXTRACTED CANDIDATE:
Raw text: "Net revenues $ 4,821,350"
Parsed value: 4821350
DETERMINISTIC CHECK: abs(4821350 - 4821350) == 0 → NUMERIC MATCH CONFIRMED
JUDGE VERDICT: CORRECT
CONFIDENCE: HIGH
Interactive
OCR Error Patterns — Live Demo
Select an error type to see how it appears in financial documents and why it matters.
Interactive
Financial Error Impact Calculator
See how a small OCR error rate translates into real financial misstatement risk.
Input Parameters
Annual revenue ($B)
$10B
OCR numeric error rate (%)
1.0%
Documents processed per year
1,000 docs
Projected Impact
Results
Model Performance on FinCriticalED
13 systems benchmarked across 3 model families. Select a fact category to see where each model class struggles.
Traditional OCR Pipelines
Lowest overall fact accuracy. Excel at clean, structured documents but degrade sharply on visually complex layouts — multi-column tables, rotated text, handwritten annotations.
Specialised VLMs
Mid-range performance. Domain adaptation helps for standard form types but these models weren't trained on the diversity of real-world financial document formats.
Frontier MLLMs (GPT-4V, Claude)
Highest fact accuracy overall — but still with notable error rates on numerical and monetary unit facts in visually complex pages. No model reaches 90%+ on all categories.
Analysis
Vulnerability Heatmap
Which document type + fact category combinations produce the most errors? Click any cell for details.
Click any cell to see why this combination is particularly error-prone.
Takeaways
What FinCriticalED Teaches Us
6 lessons from the first financial fact-level OCR benchmark.
1. Surface metrics are dangerous proxies
A model can score 98%+ on CER/WER while failing to correctly extract the majority of decision-critical financial facts. Never report only surface metrics for high-stakes domains.
2. Numerical values are the Achilles heel
Every model class — from traditional OCR to frontier MLLMs — shows its highest error rate on precise numerical values. Visual complexity (multi-digit numbers in dense tables) is the primary failure driver.
3. Monetary units are critically underestimated
A wrong monetary unit ("thousands" vs "millions") can cause a 1,000× error in extracted value — yet existing benchmarks treat it as equivalent to a minor lexical substitution.
4. LLM judges need deterministic guardrails
Pure LLM judgment is unreliable for numerical equality assessment. The hybrid approach — deterministic rules for numbers, LLM reasoning for semantics — is the right architecture for financial evaluation.
5. Visual complexity is the decisive variable
Performance drops sharply as document visual complexity increases. The same model that handles a simple balance sheet well will struggle with a multi-column supplemental report with nested footnotes.
6. The benchmark gap is widening
As models improve on existing benchmarks, they converge toward human performance on general OCR — but FinCriticalED exposes a new frontier where improvement is still needed: financial fact fidelity under visual stress.
Related Work & Context
DocVQA / DocLayNet
General document understanding benchmarks. DocVQA tests question answering over document images — but uses general-domain documents without financial-specific annotation. FinCriticalED adds domain specificity.
LayoutLM / UDOP
Models designed for document understanding using layout-aware pre-training. FinCriticalED provides a targeted benchmark to assess how well these spatial representations help preserve financial facts.
FinMASEval (Post 34)
Our earlier post on evaluating multi-agent AI for finance. FinCriticalED addresses the upstream problem — can AI extract facts from documents accurately enough to be trusted by evaluation frameworks like FinMASEval?
The big picture: FinCriticalED establishes the measurement layer that financial AI systems need. Without it, claims of "production-ready" financial document AI cannot be validated. It's the foundation for responsible deployment of AI in finance.
Previous Post
Post 42 — Deep GraphRAG
Visual Summary Series
All Posts →