A comprehensive visual guide to the 63 evaluation metrics used across 58 GraphRAG papers — from classic retrieval metrics to LLM-as-judge frameworks and task-specific scores.
63
Unique Metrics
58
Papers Analyzed
5
Categories
12
Benchmark Datasets
Metrics by Category
Why Metrics Matter
GraphRAG systems are evaluated across multiple dimensions: retrieval quality (did we get the right context?), generation quality (is the answer correct?), faithfulness (is it grounded?), and holistic quality (LLM judge). No single metric captures all dimensions.
Most Common Metric Combinations
EM + F1~25 papers
ROUGE + BLEU~9 papers
Comp + Div + Emp~9 papers
Hit@K + MRR~6 papers
Category 1
Retrieval Metrics
These metrics evaluate the quality of the retrieval step — did the graph retriever fetch the right documents, triples, or subgraphs before passing context to the LLM?
Interactive · Recall@K / Hit@K / MRR Simulator
Toggle document relevance and adjust K to see how Recall@K, Hit@K, and MRR change live. Click any document to mark it as relevant/irrelevant.
K (top-K)5
-
Recall@K
-
Hit@K
-
MRR
10 Retrieval Metrics
Category 2
Generation Metrics
These metrics evaluate the final generated answer — measuring lexical overlap, semantic correctness, or functional accuracy against a reference answer.
Interactive · F1 Score Calculator
Type a predicted answer and the gold answer. Watch token-level precision, recall, and F1 update in real time. Green tokens are shared; red are prediction-only.
Gold Answer
Gold Tokens
Predicted Answer
Predicted Tokens
Precision
-
Recall
-
F1 Score
-
Interactive · ROUGE-N Visualizer
Highlighted words appear in both hypothesis and reference. ROUGE measures recall — what fraction of reference n-grams appear in the hypothesis.
Reference (Gold)
Hypothesis (Generated)
-
ROUGE-1
-
Precision
-
Recall
12 Generation Metrics
Category 3
Faithfulness Metrics
Faithfulness metrics verify whether the generated answer is grounded in the retrieved context — catching hallucinations and unsupported claims.
Interactive · Faithfulness Score — Claim Checker
Each sentence in the generated answer is decomposed into atomic claims. Click to toggle whether each claim is supported by the retrieved context.
Retrieved Context
"Microsoft's GraphRAG system constructs a knowledge graph from the source corpus, then generates summaries for detected communities. It uses a map-reduce approach: local answers are generated per community, then combined into a global response. The system was evaluated on Podcast and News datasets."
Generated Answer Claims
-
Faithfulness Score
-
Supported
-
Total Claims
8 Faithfulness Metrics
Category 4
LLM-as-Judge Metrics
When no ground-truth answer exists (e.g. global summarization questions), an LLM evaluates two answers head-to-head on qualitative dimensions. Win rate captures which method produces better responses.
Interactive · Pairwise LLM Judge Simulator
Select a quality dimension and see how a simulated LLM judge allocates win rates between two answers. Run multiple rounds to see how wins accumulate.
Answer A · GraphRAG
VS
Answer B · Naive RAG
Trials: 0
Win Rate after 0 trials
GraphRAG (A)
-
Position bias mitigation: Each comparison is run twice with answers in reversed order. The final win rate is the average, preventing the LLM from systematically favouring the first position.
10 LLM-Judge Metrics
Category 5
Task-Specific Metrics
GraphRAG is applied across diverse domains — code generation, medical QA, dialogue, KG construction, and efficiency analysis — each requiring specialized metrics.
Usage Analysis
Most Used Metrics Across Papers
How often does each metric appear across the 58 analyzed papers? F1 and EM dominate for QA tasks; LLM-judge metrics are rising for open-ended generation tasks.
Key takeaway: F1 Score is the most universal metric (25+ papers), but the shift to LLM-judge metrics (Comprehensiveness, Diversity, Empowerment) reflects a growing recognition that lexical overlap metrics cannot capture the quality of global, sensemaking answers that GraphRAG is designed for.