GraphRAG Evaluation Metrics — Interactive Visual Guide

GraphRAG · Evaluation

How Do We Measure GraphRAG?

A comprehensive visual guide to the 63 evaluation metrics used across 58 GraphRAG papers — from classic retrieval metrics to LLM-as-judge frameworks and task-specific scores.

63

Unique Metrics

58

Papers Analyzed

5

Categories

12

Benchmark Datasets

Metrics by Category

Why Metrics Matter

GraphRAG systems are evaluated across multiple dimensions: retrieval quality (did we get the right context?), generation quality (is the answer correct?), faithfulness (is it grounded?), and holistic quality (LLM judge). No single metric captures all dimensions.

Most Common Metric Combinations

EM + F1~25 papers

ROUGE + BLEU~9 papers

Comp + Div + Emp~9 papers

Hit@K + MRR~6 papers

Category 1

Retrieval Metrics

These metrics evaluate the quality of the retrieval step — did the graph retriever fetch the right documents, triples, or subgraphs before passing context to the LLM?

Interactive · Recall@K / Hit@K / MRR Simulator

Toggle document relevance and adjust K to see how Recall@K, Hit@K, and MRR change live. Click any document to mark it as relevant/irrelevant.

K (top-K) 5

-

Recall@K

-

Hit@K

-

MRR

10 Retrieval Metrics

Category 2

Generation Metrics

These metrics evaluate the final generated answer — measuring lexical overlap, semantic correctness, or functional accuracy against a reference answer.

Interactive · F1 Score Calculator

Type a predicted answer and the gold answer. Watch token-level precision, recall, and F1 update in real time. Green tokens are shared; red are prediction-only.

Gold Answer

Gold Tokens

Predicted Answer

Predicted Tokens

Precision

-

Recall

-

F1 Score

-

Interactive · ROUGE-N Visualizer

Highlighted words appear in both hypothesis and reference. ROUGE measures recall — what fraction of reference n-grams appear in the hypothesis.

Reference (Gold)

Hypothesis (Generated)

-

ROUGE-1

-

Precision

-

Recall

12 Generation Metrics

Category 3

Faithfulness Metrics

Faithfulness metrics verify whether the generated answer is grounded in the retrieved context — catching hallucinations and unsupported claims.

Interactive · Faithfulness Score — Claim Checker

Each sentence in the generated answer is decomposed into atomic claims. Click to toggle whether each claim is supported by the retrieved context.

Retrieved Context

"Microsoft's GraphRAG system constructs a knowledge graph from the source corpus, then generates summaries for detected communities. It uses a map-reduce approach: local answers are generated per community, then combined into a global response. The system was evaluated on Podcast and News datasets."

Generated Answer Claims

-

Faithfulness Score

-

Supported

-

Total Claims

8 Faithfulness Metrics

Category 4

LLM-as-Judge Metrics

When no ground-truth answer exists (e.g. global summarization questions), an LLM evaluates two answers head-to-head on qualitative dimensions. Win rate captures which method produces better responses.

Interactive · Pairwise LLM Judge Simulator

Select a quality dimension and see how a simulated LLM judge allocates win rates between two answers. Run multiple rounds to see how wins accumulate.

Answer A · GraphRAG

VS

Answer B · Naive RAG

Trials: 0

Position bias mitigation: Each comparison is run twice with answers in reversed order. The final win rate is the average, preventing the LLM from systematically favouring the first position.

10 LLM-Judge Metrics

Category 5

Task-Specific Metrics

GraphRAG is applied across diverse domains — code generation, medical QA, dialogue, KG construction, and efficiency analysis — each requiring specialized metrics.

Usage Analysis

Most Used Metrics Across Papers

How often does each metric appear across the 58 analyzed papers? F1 and EM dominate for QA tasks; LLM-judge metrics are rising for open-ended generation tasks.

Key takeaway: F1 Score is the most universal metric (25+ papers), but the shift to LLM-judge metrics (Comprehensiveness, Diversity, Empowerment) reflects a growing recognition that lexical overlap metrics cannot capture the quality of global, sensemaking answers that GraphRAG is designed for.