FinCriticalED — Visual Benchmark for Financial Fact-Level OCR

Post 43 · Evaluation

FinCriticalED

The first benchmark that measures what really matters in financial OCR: not character accuracy, but whether decision-critical facts survive extraction intact.

    Paper: "FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR Evaluation"

    Yueru He, Xueqing Peng et al. · arXiv 2511.14998 · 2025–2026

The Core Problem

Standard OCR benchmarks measure character-level accuracy. But in finance, a single digit error — $1.2B → $12B — is catastrophic even if 99.9% of characters are correct.

The Solution

FinCriticalED shifts evaluation to fact-level accuracy — did the model correctly preserve the 9,481 expert-annotated decision-critical facts across 859 real financial documents?

The Finding

Even the best frontier models fail on numerical and monetary unit extraction in visually complex financial documents. High OCR accuracy ≠ financial fact fidelity.

0

Real financial
document pages

0

Expert-annotated
critical facts

0

Models
benchmarked

0

Fact categories
evaluated

The Problem

The OCR Accuracy Trap

Why "99% accurate" OCR is dangerously misleading for financial documents.

Surface Accuracy vs Fact Accuracy

Original document text

          Total revenue: $1,234,567,890

          Fiscal year ended: December 31, 2024

          Currency: USD (thousands)

          Reporting entity: Apex Financial Corp.

↓ OCR extraction

OCR output — 98.7% character accuracy

          Total revenue: $1,234,567,890 $1,234,567,800 ⚠

          Fiscal year ended: December 31, 2024 ✓

          Currency: USD (thousands) USD (millions) ⚠

          Reporting entity: Apex Financial Corp. ✓

2 fact errors, 98.7% char accuracy. The revenue is off by $90 and the unit multiplier is wrong — meaning the reported figure is actually off by a factor of 1,000×.

Why Existing Benchmarks Miss This

CER / WER — Character & Word Error Rate

Measures how many characters or words differ between OCR output and ground truth. A 90-character number correct in 89 characters = 98.9% CER. But the number is wrong — and the financial impact is enormous.

BLEU / ROUGE — Sequence Overlap

n-gram overlap metrics designed for translation/summarisation. They reward matching word sequences — but a monetary unit change ("thousands" → "millions") scores well because 1 of 2 words matches.

F1 / Exact Match

Better than CER/WER but still treat all tokens equally. The word "the" and the number "$1,234,567,890" have identical weight — even though one is decision-critical and the other is not.

        FinCriticalED's insight: Financial facts are not created equal. Numerical values and monetary units need their own evaluation dimension with zero tolerance for errors.
      

Stakes

The Real Cost of OCR Errors in Finance

Why a 1% error rate that's acceptable in other domains is catastrophic in financial documents.

Regulatory Filings

A wrong figure in an SEC filing can trigger regulatory sanctions, restatements, and shareholder lawsuits. The SEC has brought enforcement actions over numerical errors in Form 10-K filings.

Algorithmic Trading

AI trading systems that ingest OCR'd earnings reports need exact numbers. A misread EPS of $0.45 instead of $4.50 can trigger erroneous buy/sell orders worth millions before the error is caught.

Due Diligence

M&A analysts rely on extracted financial figures for valuation models. An OCR error in a balance sheet can corrupt an entire DCF model, leading to deals mispriced by tens of millions.

Domain vs Error Tolerance — Why Finance is Different

Domain	Typical Error Tolerance	Consequence of 1% Error	FinCriticalED Relevance
Book digitisation	1–3% CER acceptable	Minor readability issue	Low
Medical records	<0.5% for drug dosages	Patient safety risk	High
Legal contracts	<0.1% for key clauses	Contract invalidity, disputes	Very High
Financial documents	Zero tolerance on critical facts	Regulatory action, billions in mispricing	Critical

Dataset

Building FinCriticalED

859 real-world pages from SEC EDGAR and regulatory archives, annotated with 9,481 expert-labeled critical facts.

Construction Pipeline

1

Source Collection

Documents sourced from U.S. SEC EDGAR, corporate disclosure portals, and regulatory archives. Stratified sampling across industries and fiscal years 2023–2025.

2

Document Selection

859 pages selected across 5 document types, prioritising pages with high visual complexity — multi-column tables, mixed formatting, dense numerical content.

3

Expert Annotation

Domain experts annotate each page, labelling 9,481 critical facts across 5 categories. Each annotation includes the fact value, category, context, and criticality level.

4

Quality Review

Multi-round review process. Disputed annotations resolved by senior financial domain experts. Inter-annotator agreement verified before inclusion.

5

Evaluation Protocol Design

Deterministic-Rule-Guided LLM-as-Judge protocol designed to assess fact preservation — accounting for lexical variation, numerical equivalence, and contextual correctness.

Dataset at a Glance

Facts per Category

Data Sources

🏛

SEC EDGAR

Primary source

📋

Corp. Disclosures

Secondary source

⚖

Regulatory Archives

Tertiary source

📅

2023–2025

Fiscal year range

Fact Categories

5 Critical Fact Types

Click each category to see its definition, examples, and why it's decision-critical.

Document Types

5 Financial Document Categories

The benchmark spans the full range of document types where financial OCR is deployed in practice.

📊 Financial Statements

Balance sheets, income statements, cash flow statements. Dense numerical tables with multiple nesting levels and footnotes referencing other pages.

📑 Supplemental Reports

Quarterly earnings supplements, investor presentations, management discussion. Mix of narrative text and embedded numerical tables with complex visual layouts.

📜 Tax Forms

IRS Form 10-K schedules, state tax filings. Highly structured forms with precisely positioned fields where OCR spatial alignment errors cause critical misattribution.

💹 Securities Transactions

Trade confirmations, settlement records, prospectuses. Time-critical documents where temporal and numerical fact accuracy affects transaction validity.

⚖ Financial Legal Documents

Loan agreements, covenants, bond indentures. Legal contracts where numerical covenants (debt ratios, financial tests) carry contractual force if extracted incorrectly.

Annotated Document Sample

■ Numeric ■ Temporal ■ Monetary Unit ■ Reporting Entity ■ Financial Concept

APEX FINANCIAL CORPORATION
CONSOLIDATED STATEMENTS OF OPERATIONS
(In thousands, except per share data)

                              Year Ended December 31, 2024  Year Ended December 31, 2023

Net revenues                       $ 4,821,350           $ 4,203,117
Operating expenses                    3,214,892               2,876,441
Income before taxes                    1,606,458               1,326,676
Earnings per share (diluted)              $ 3.47                 $ 2.91

Colour-coded annotations show the 5 fact types. Every highlighted value is a data point in FinCriticalED's evaluation set.

Methodology

LLM-as-Judge Evaluation Protocol

Why rule-based matching fails for financial facts — and how a Deterministic-Rule-Guided LLM judge solves it.

Why Pure Rule-Matching Fails

Lexical Equivalence Problem

"$1,234,567" and "1234567" are the same number but different strings. Rule-based matchers flag this as wrong. An LLM judge understands numeric equivalence.

Context Dependency

Whether "3.47" is correct depends on whether the surrounding context specifies it's EPS (earnings per share) vs. a tax rate or ratio. Pure string matching can't assess contextual correctness.

Unit Normalisation

"4,821 thousand" and "4,821,000" and "$4.821M" are all the same monetary fact. A rule-matcher sees three different strings; the LLM judge understands they're identical.

The Evaluation Pipeline

1

OCR System Runs on Document

The system under evaluation receives the document image and produces extracted text output.

2

Structured Fact Extraction

A deterministic extraction step parses the OCR output into structured fact candidates aligned to the annotation schema.

3

LLM Judge Assessment

For each annotated fact, the LLM judge is given: (a) the ground truth fact, (b) the extracted candidate, (c) surrounding context. It returns: CORRECT / INCORRECT / UNCERTAIN.

4

Deterministic Rule Override

Numeric facts trigger deterministic validation rules (e.g., parsed value equality within tolerance) — the LLM cannot override a verified numerical mismatch.

5

Fact-Level Score Aggregation

Scores aggregated by fact category, document type, and model — producing a multi-dimensional performance profile.

        Key design: Deterministic rules handle numerical precision (where LLMs can be unreliable). LLM reasoning handles semantic equivalence (where rules fail). Best of both.
      

LLM Judge Prompt Structure

# Evaluation prompt sent to LLM judge for each fact

TASK: Assess whether the extracted fact correctly preserves the ground truth fact.

GROUND TRUTH FACT:
  Category: Numeric
  Value: 4,821,350
  Context: Net revenues, fiscal year 2024, in thousands USD

EXTRACTED CANDIDATE:
  Raw text: "Net revenues $ 4,821,350"
  Parsed value: 4821350

DETERMINISTIC CHECK: abs(4821350 - 4821350) == 0 → NUMERIC MATCH CONFIRMED

JUDGE VERDICT: CORRECT
CONFIDENCE: HIGH

Interactive

OCR Error Patterns — Live Demo

Select an error type to see how it appears in financial documents and why it matters.

Interactive

Financial Error Impact Calculator

See how a small OCR error rate translates into real financial misstatement risk.

Input Parameters

Annual revenue ($B)

$10B

OCR numeric error rate (%)

1.0%

Documents processed per year

1,000 docs

Projected Impact

Results

Model Performance on FinCriticalED

13 systems benchmarked across 3 model families. Select a fact category to see where each model class struggles.

Traditional OCR Pipelines

Lowest overall fact accuracy. Excel at clean, structured documents but degrade sharply on visually complex layouts — multi-column tables, rotated text, handwritten annotations.

Specialised VLMs

Mid-range performance. Domain adaptation helps for standard form types but these models weren't trained on the diversity of real-world financial document formats.

Frontier MLLMs (GPT-4V, Claude)

Highest fact accuracy overall — but still with notable error rates on numerical and monetary unit facts in visually complex pages. No model reaches 90%+ on all categories.

Analysis

Vulnerability Heatmap

Which document type + fact category combinations produce the most errors? Click any cell for details.

Click any cell to see why this combination is particularly error-prone.

Takeaways

What FinCriticalED Teaches Us

6 lessons from the first financial fact-level OCR benchmark.

1. Surface metrics are dangerous proxies

A model can score 98%+ on CER/WER while failing to correctly extract the majority of decision-critical financial facts. Never report only surface metrics for high-stakes domains.

2. Numerical values are the Achilles heel

Every model class — from traditional OCR to frontier MLLMs — shows its highest error rate on precise numerical values. Visual complexity (multi-digit numbers in dense tables) is the primary failure driver.

3. Monetary units are critically underestimated

A wrong monetary unit ("thousands" vs "millions") can cause a 1,000× error in extracted value — yet existing benchmarks treat it as equivalent to a minor lexical substitution.

4. LLM judges need deterministic guardrails

Pure LLM judgment is unreliable for numerical equality assessment. The hybrid approach — deterministic rules for numbers, LLM reasoning for semantics — is the right architecture for financial evaluation.

5. Visual complexity is the decisive variable

Performance drops sharply as document visual complexity increases. The same model that handles a simple balance sheet well will struggle with a multi-column supplemental report with nested footnotes.

6. The benchmark gap is widening

As models improve on existing benchmarks, they converge toward human performance on general OCR — but FinCriticalED exposes a new frontier where improvement is still needed: financial fact fidelity under visual stress.

Related Work & Context

DocVQA / DocLayNet

General document understanding benchmarks. DocVQA tests question answering over document images — but uses general-domain documents without financial-specific annotation. FinCriticalED adds domain specificity.

LayoutLM / UDOP

Models designed for document understanding using layout-aware pre-training. FinCriticalED provides a targeted benchmark to assess how well these spatial representations help preserve financial facts.

FinMASEval (Post 34)

Our earlier post on evaluating multi-agent AI for finance. FinCriticalED addresses the upstream problem — can AI extract facts from documents accurately enough to be trusted by evaluation frameworks like FinMASEval?

    The big picture: FinCriticalED establishes the measurement layer that financial AI systems need. Without it, claims of "production-ready" financial document AI cannot be validated. It's the foundation for responsible deployment of AI in finance.