Post 01 · Reinforcement Learning
Agent Lightning — RL Frameworks Explorer
Walk through the full RL algorithm landscape interactively — REINFORCE, PPO, GRPO,
RLHF, DPO and more. Includes the Agent Lightning architecture and credit assignment deep dive.
Reinforcement Learning
PPO
GRPO
RLHF
DPO
Post 02 · Prompt Optimization
GEPA — Reflective Prompt Evolution Explorer
Explore how GEPA uses evolutionary algorithms and self-reflection to automatically
discover optimal prompts — outperforming manual prompt engineering on complex reasoning tasks.
Prompt Engineering
Evolutionary Algorithms
GEPA
Self-Reflection
Post 03 · LLM Inference
Defeating Nondeterminism in LLM Inference
Why LLMs give different answers every time — and how to tame it. Covers temperature,
top-p sampling, seeds, batching effects, floating-point non-associativity, and determinism
strategies for production systems.
Nondeterminism
Temperature
Sampling
Inference
Post 04 · Vector Quantization
TurboQuant — Near-Optimal Vector Quantization
Interactive deep dive into vector quantization for LLM compression — codebooks, residual
quantization, product quantization, and TurboQuant's near-optimal approach to minimising
reconstruction error at extreme compression ratios.
Quantization
Compression
Vector Quantization
LLM Efficiency
Post 05 · Text Generation
How AI Generates Text — 6 Sampling Techniques
Visualise greedy decoding, temperature sampling, top-k, top-p (nucleus), min-p,
and beam search side-by-side. See how each strategy trades off diversity, coherence,
and predictability in practice.
Sampling
Temperature
Top-p
Beam Search
Text Generation
Post 06 · RLHF Algorithms
GRPO — Group Relative Policy Optimization
How DeepSeek-R1 trains reasoning with GRPO — eliminating the critic network, computing
group-relative advantages, and applying KL-divergence constraints to keep the model
close to the reference policy during RLHF.
GRPO
RLHF
DeepSeek-R1
Policy Optimization
Post 07 · Knowledge Management
LLM Wiki — Build Your Personal Knowledge Base
An interactive glossary and wiki for LLM concepts — searchable, linked, and
organised by topic. From attention mechanisms to RLHF, covers the full vocabulary
of modern language model research and engineering.
Glossary
Knowledge Base
LLM Concepts
Reference
Post 08 · Sentence Embeddings
Sentence-BERT — From 65 Hours to 5 Seconds
How Sentence-BERT revolutionised semantic similarity — siamese networks, pooling strategies,
contrastive and triplet loss, and why SBERT reduced pairwise sentence comparison from
65 hours to 5 seconds at scale.
SBERT
Embeddings
Semantic Similarity
Siamese Networks
Post 09 · Tokenization
Tokenizers — The Hidden Step That Shapes Every LLM
BPE, WordPiece, SentencePiece — how each tokenizer works, why tokenization
decisions affect model reasoning, and the surprising ways token boundaries
shape what LLMs can and cannot do well.
Tokenization
BPE
WordPiece
SentencePiece
Post 10 · Learning Paradigms
In-Context Learning — How LLMs Learn Without Learning
The surprising mechanics of in-context learning — how few-shot examples shift
model behaviour without any gradient update, why example order matters, and
the theoretical debates around what ICL actually does inside the model.
In-Context Learning
Few-Shot
Prompting
Meta-Learning
Post 11 · Local AI Agents
OpenClaw — Local-First AI Agent Framework
How OpenClaw builds local-first AI agents using Model Context Protocol — channel
architecture, skill routing, token budget management, gateway security, and how
it connects 20+ MCP servers to local and cloud LLMs.
MCP
Local AI
Agent Framework
Skill Routing
OpenClaw
Post 12 · Neurosymbolic AI
Neurosymbolic AI — Why, What, and How
Explore how neural networks and symbolic reasoning combine — Kautz's 6 architecture types,
lowering & lifting techniques, knowledge graphs, DeepProbLog, AlphaGo, and real-world
applications in healthcare and autonomous driving.
Neurosymbolic
Knowledge Graphs
Symbolic AI
Explainability
Kautz Taxonomy
Post 13 · Agent Evaluation
Evaluating LLM Agents — A Visual Guide
Why agent evaluation is fundamentally different, the evaluation gap (93% pre-deployment only),
4 capability dimensions, 6 benchmarks (AgentBench, SWE-bench, GAIA, BFCL, ToolEmu),
EDDOps framework, 3-layer reference architecture, and 6 evaluation drivers.
Agent Evaluation
EDDOps
LLM-as-Judge
Benchmarks
Safety
Post 14 · LLM Serving
vLLM & PagedAttention — Efficient LLM Serving
Why LLM serving wastes 60–80% of GPU memory — and how PagedAttention fixes it.
Covers KV cache, the OS virtual memory analogy, block tables, copy-on-write,
continuous batching, and the 24× throughput gains over HuggingFace Transformers.
vLLM
PagedAttention
KV Cache
Continuous Batching
LLM Serving
Post 15 · AI Optimization
DSPy — Programming, Not Prompting
Stop hand-crafting prompts. DSPy compiles declarative AI programs into optimized pipelines.
Covers Signatures, Modules (CoT, PoT, ReAct), the compilation loop, optimizers (BootstrapFewShot,
MIPROv2, GEPA), and the results: +65% over standard few-shot on Llama2-13b.
DSPy
Signatures
Optimizers
RAG
LLM Programming
Post 16 · Metric Learning
Contrastive Loss — Learning to Compare
The 2005 CVPR paper that launched modern metric learning. Siamese networks, shared weights,
the contrastive loss function with its dead zone, and how it scaled from face verification
to CLIP and SimCLR. 12 interactive canvases with live loss explorer and embedding space animation.
Metric Learning
Siamese Networks
Face Verification
CVPR 2005
Post 19 · Model Safety & Alignment
Claude Mythos — The First AI Too Capable to Release
Anthropic's 244-page system card for Claude Mythos Preview: the first model withheld from public release. Project Glasswing, the alignment paradox, chain-of-thought transparency gaps, and a 40-page section on model welfare.
AI Safety
Alignment
System Card
Interpretability
Model Welfare
Post 18 · Structured Data & GNNs
Relational Foundation Models — AI That Speaks SQL
KumoRFM turns any relational database into a temporal graph and applies in-context learning at inference time. From 878 lines of feature engineering to 1 PQL query. No retraining. 1 second.
KumoRFM
Relational AI
Graph Neural Networks
In-Context Learning
PQL
Post 17 · AI Infrastructure
Model Context Protocol — USB-C for AI
The open standard that replaces N×M custom integrations with a single protocol. Learn how MCP connects AI hosts, clients, and servers — and why every major AI tool adopted it.
MCP
AI Agents
Tool Use
Anthropic
AI Infrastructure
Post 20 · AI Agents & Infrastructure
Claude Managed Agents — Stateless Brains, Containerized Hands
Anthropic's new agent runtime: isolated execution containers, append-only event logs, and session lifecycle management. How Brain+Hands+Session decoupling cuts p50 TTFT by 60% and p95 by 90%. Multi-agent orchestration, security model, and full API walkthrough.
Managed Agents
AI Infrastructure
Anthropic
Agent Runtime
Multi-Agent
Post 21 · Search & RAG
Making Search Agents Faster & Smarter
How Contextual AI optimized RAG pipelines on two axes: tuning the search tool stack (embeddings, reranker, hybrid retrieval) and training the planner with GRPO + on-policy distillation + Conditional Log Penalty.
RAG
Search Agents
Planner Training
CER-C
GRPO
Post 22 · AI Governance
NIST AI Risk Management Framework
How 240+ organizations built a shared language for AI risk. GOVERN, MAP, MEASURE, MANAGE: the four functions that turn risk awareness into organizational action, with 7 trustworthiness characteristics every AI system must satisfy.
AI Governance
Risk Management
NIST
Trustworthy AI
AI Policy
Post 23 · Evaluation
GraphRAG Evaluation Metrics
A comprehensive interactive guide to 63 evaluation metrics used across 58 GraphRAG papers — retrieval, generation, faithfulness, LLM-judge, and task-specific metrics, each with live demos and paper citations.
GraphRAG
Evaluation
RAG
Metrics
LLM-Judge
Post 24 · Safety & Governance
Runtime Governance for Agentic AI in Finance
Prompt filters aren't enough for multi-step AI agents. This paper proposes capability decomposition, trajectory-level governance, a 4-tier risk framework, and a 7-step MRM programme aligned to SR 11-7 for financial services.
Post 25 · Evaluation
Multi-Agent System Evaluation — MASEval Framework
Framework choice matters as much as model choice in multi-agent systems.
Explore 4 topology patterns (Sequential, Hierarchical, Parallel, Mesh),
compare 6 frameworks across GAIA, Tau-bench, and MMLU, and simulate
error cascades through the interactive MASEval evaluation lifecycle.
Multi-Agent Systems
MASEval
GAIA
AutoGen
LangGraph
Evaluation
Post 26 · Evaluation
Evaluation Metrics for LLM Multi-Agent Systems
35+ evaluation metrics extracted from 130+ research papers. Explore task performance, communication quality, coordination, safety & trust, planning, and domain-specific metrics with interactive visualizations.
Multi-Agent
Evaluation
IDS
OER
CheckPoint
ECS
Post 27 · Representations
World Models — Learning to Dream
Ha & Schmidhuber's V-M-C architecture: compress observations with a VAE, predict futures with an MDN-RNN, and train controllers entirely inside hallucinated dreams. From Car Racing 906 to DreamerV3 conquering Minecraft.
World Models
VAE
MDN-RNN
Dreamer
Dream Training
RL
Post 28 · Safety & Governance
EU AI Act — The Risk Pyramid
The world's first comprehensive AI regulation visualised: 4-tier risk pyramid, 8 banned practices, high-risk compliance obligations, GPAI model tiers, penalty calculator, and interactive implementation timeline.
EU AI Act
Risk Tiers
GPAI
Compliance
Regulation
Governance
Post 29 · Agents & Systems
LLM-as-a-Verifier — Fine-Grained Trajectory Scoring
How logprob distributions over letter tokens (A–T) turn an LLM into a continuous trajectory scorer. Explore criteria decomposition, round-robin tournament selection, and benchmark results on Terminal-Bench 2 (86.4%) and SWE-bench Verified (77.8%).
Verification
Logprobs
SWE-bench
Trajectories
Gemini
Best-of-N
Post 34 · Evaluation
FinMASEval — Evaluating Multi-Agent AI for Financial Services
Generic MAS evaluation breaks down in finance. FinMASEval extends the MASEval framework with two finance-specific dimensions — hallucination rate and regulatory compliance — benchmarked against FinQA, FinBen, FinanceBench, FLUE, and FinAgentBench. GPT-4-Turbo with RAG still fails 81% of financial questions.
Multi-Agent Systems
Financial AI
Hallucination
FinBen
TradingAgents
FINRA
Evaluation
Post 35 · Evaluation
Agent Harness Engineering — The Infrastructure Beneath LLM Agents
The harness — not the model — is the primary determinant of agent reliability. Meng et al. (2026) introduce the formal H=(E,T,C,S,L,V) framework, analyze 22 systems, and prove that harness-only redesign can produce 10× benchmark gains. Explore the three engineering eras, the completeness matrix, 9 open challenges, and 12 research directions.
Agent Harness
LLM Agents
LangGraph
MCP
Execution Loop
Context Manager
Evaluation
Post 36 · Safety & Governance
The Data Provenance Crisis — Auditing 1,858 AI Training Datasets
70%+ of AI training datasets have no documented license. 66% of HuggingFace licenses differ from author intent. Longpre et al. audit 1,858 datasets across 44 collections, reveal a data availability divide between commercial and non-commercial datasets, and introduce Data Provenance Cards — machine-readable attribution metadata at scale.
Data Provenance
Dataset Licensing
AI Governance
Copyright
Attribution
HuggingFace
Training Data
Post 37 · Training & Alignment
Knowledge Distillation — Teaching Small Models to Think Like Giants
A 40%-smaller DistilBERT retains 97% of BERT's performance. Hinton et al.'s soft-target framework reveals "dark knowledge" hidden in near-zero probabilities — class-similarity structure invisible to hard labels. From temperature scaling to TinyBERT's layer-wise feature matching, explore how large models compress into fast, deployable students.
Knowledge Distillation
Teacher-Student
Soft Targets
Temperature Scaling
DistilBERT
TinyBERT
Model Compression
Post 38 · Inference & Serving
Speculative Decoding — Fast Inference via Draft-Then-Verify
A small drafter model proposes K tokens; the large target model verifies all of them in one batched pass. Lossless — output distribution is mathematically identical to standard decoding. 2–3× speedup on T5-XXL, 2.5× on Chinchilla 70B, 10× with memory offloading. No retraining required.
Speculative Decoding
Draft-Then-Verify
Rejection Sampling
LLM Inference
Medusa
EAGLE
Latency
Post 39 · Production Engineering
72 Techniques to Optimize LLMs in Production
Every lever you can pull — from INT8 quantization and FlashAttention to continuous batching, speculative decoding variants, KV cache offloading, semantic caching, model routing, and function calling. Searchable, filterable reference across 9 categories with in-depth cards for all 72 techniques.
Quantization
Speculative Decoding
KV Cache
Continuous Batching
Model Routing
FlashAttention
Production
Post 40 · APIs & Systems
REST API — Principles, Patterns & Best Practices
From Fielding's 2000 dissertation to production APIs: 6 architectural constraints, HTTP method semantics (safe vs idempotent), 30 status codes, URL design patterns, auth strategies (API Key / JWT / OAuth 2.0), and a full comparison with GraphQL, gRPC, and SOAP.
REST
HTTP
API Design
JWT
OAuth 2.0
GraphQL
gRPC
Post 43 · Evaluation
FinCriticalED — Financial Fact-Level OCR Benchmark
The first benchmark that measures financial OCR at the fact level — not character accuracy. 859 SEC documents, 9,481 expert-annotated critical facts across 5 categories (numeric, temporal, monetary unit, entity, concept), 13 models benchmarked with a Deterministic-Rule-Guided LLM-as-Judge.
OCR
Benchmarking
Finance
LLM-as-Judge
Document AI
SEC EDGAR
Post 42 · Retrieval & Knowledge
Deep GraphRAG — Hierarchical Retrieval & Adaptive Integration
A hierarchical, RL-guided knowledge graph RAG system from Ant Group: multi-level graph structure (concept → entity → fact), policy-guided traversal, adaptive evidence integration, and TRPO+DPO policy optimisation — solving multi-hop reasoning that flat RAG cannot handle.
RAG
Knowledge Graph
Multi-hop
Reinforcement Learning
Graph Traversal
HotpotQA
Post 41 · Safety & Governance
AI Agent Traps — Adversarial Attacks on LLM Agents
A systematic taxonomy of 6 attack categories and 17+ subcategories targeting LLM agents: Content Injection, Semantic Manipulation, Cognitive State, Behavioural Control, Systemic Traps, and Human-in-the-Loop Traps — with empirical evidence and mitigation strategies.
Prompt Injection
Agent Security
RAG Poisoning
Jailbreaks
Red Teaming
LLM Safety
Post 33 · Safety & Governance
CaMeL — Defeating Prompt Injections by Design
The first defense with provable security guarantees against prompt injection attacks on LLM agents. CaMeL separates control flow from data flow using a Dual-LLM architecture, capability tags, and security policies — achieving 0 successful attacks on AgentDojo while preserving 77% utility.
Prompt Injection
CaMeL
LLM Agents
Capabilities
Information Flow
Safety
Post 32 · Safety & Governance
LLM Watermarking — Copyright Attribution & Detection
How do you prove an LLM generated a text? Explore 12+ watermarking schemes from Kirchenbauer's green/red token partition to Google's SynthID-Text (20M+ Gemini responses), with interactive radar comparisons, attack robustness heatmaps, and detection hypothesis testing.
Watermarking
SynthID
Token-Level
Attribution
Copyright
Safety
Post 31 · Safety & Governance
Copyright Detection & Mitigation in LLMs
Framework choice matters as much as model choice. Explore 286 papers across 4 detection paradigms (MIA, memorization, watermarking, dataset inference) and 5 mitigation strategies, with an interactive arms-race timeline from 2020–2025.
Copyright
Memorization
MIA
Unlearning
Differential Privacy
Safety
Post 30 · Representations
Semantic Collapse — Embedding Space & Entropic Drift
How embedding models silently erase modal, epistemic, indexical, and agency operators. Explore 4 collapse types, neighbourhood entropy diagnostics, the triplet framework (CR, Fidelity AUC), and the Modal Proofing Kernel that preserves semantic boundaries.
Embeddings
Semantic Collapse
Modal Logic
Entropy
MPK
Representations