Visual Summary — Interactive AI/ML Visualizations

Post 01 · Reinforcement Learning

Agent Lightning — RL Frameworks Explorer

Walk through the full RL algorithm landscape interactively — REINFORCE, PPO, GRPO, RLHF, DPO and more. Includes the Agent Lightning architecture and credit assignment deep dive.

Reinforcement Learning PPO GRPO RLHF DPO

Post 02 · Prompt Optimization

GEPA — Reflective Prompt Evolution Explorer

Explore how GEPA uses evolutionary algorithms and self-reflection to automatically discover optimal prompts — outperforming manual prompt engineering on complex reasoning tasks.

Prompt Engineering Evolutionary Algorithms GEPA Self-Reflection

Post 03 · LLM Inference

Defeating Nondeterminism in LLM Inference

Why LLMs give different answers every time — and how to tame it. Covers temperature, top-p sampling, seeds, batching effects, floating-point non-associativity, and determinism strategies for production systems.

Nondeterminism Temperature Sampling Inference

Post 04 · Vector Quantization

TurboQuant — Near-Optimal Vector Quantization

Interactive deep dive into vector quantization for LLM compression — codebooks, residual quantization, product quantization, and TurboQuant's near-optimal approach to minimising reconstruction error at extreme compression ratios.

Quantization Compression Vector Quantization LLM Efficiency

Post 05 · Text Generation

How AI Generates Text — 6 Sampling Techniques

Visualise greedy decoding, temperature sampling, top-k, top-p (nucleus), min-p, and beam search side-by-side. See how each strategy trades off diversity, coherence, and predictability in practice.

Sampling Temperature Top-p Beam Search Text Generation

Post 06 · RLHF Algorithms

GRPO — Group Relative Policy Optimization

How DeepSeek-R1 trains reasoning with GRPO — eliminating the critic network, computing group-relative advantages, and applying KL-divergence constraints to keep the model close to the reference policy during RLHF.

GRPO RLHF DeepSeek-R1 Policy Optimization

Post 07 · Knowledge Management

LLM Wiki — Build Your Personal Knowledge Base

An interactive glossary and wiki for LLM concepts — searchable, linked, and organised by topic. From attention mechanisms to RLHF, covers the full vocabulary of modern language model research and engineering.

Glossary Knowledge Base LLM Concepts Reference

Post 08 · Sentence Embeddings

Sentence-BERT — From 65 Hours to 5 Seconds

How Sentence-BERT revolutionised semantic similarity — siamese networks, pooling strategies, contrastive and triplet loss, and why SBERT reduced pairwise sentence comparison from 65 hours to 5 seconds at scale.

SBERT Embeddings Semantic Similarity Siamese Networks

Post 09 · Tokenization

Tokenizers — The Hidden Step That Shapes Every LLM

BPE, WordPiece, SentencePiece — how each tokenizer works, why tokenization decisions affect model reasoning, and the surprising ways token boundaries shape what LLMs can and cannot do well.

Tokenization BPE WordPiece SentencePiece

Post 10 · Learning Paradigms

In-Context Learning — How LLMs Learn Without Learning

The surprising mechanics of in-context learning — how few-shot examples shift model behaviour without any gradient update, why example order matters, and the theoretical debates around what ICL actually does inside the model.

In-Context Learning Few-Shot Prompting Meta-Learning

Post 11 · Local AI Agents

OpenClaw — Local-First AI Agent Framework

How OpenClaw builds local-first AI agents using Model Context Protocol — channel architecture, skill routing, token budget management, gateway security, and how it connects 20+ MCP servers to local and cloud LLMs.

MCP Local AI Agent Framework Skill Routing OpenClaw

Post 12 · Neurosymbolic AI

Neurosymbolic AI — Why, What, and How

Explore how neural networks and symbolic reasoning combine — Kautz's 6 architecture types, lowering & lifting techniques, knowledge graphs, DeepProbLog, AlphaGo, and real-world applications in healthcare and autonomous driving.

Neurosymbolic Knowledge Graphs Symbolic AI Explainability Kautz Taxonomy

Post 13 · Agent Evaluation

Evaluating LLM Agents — A Visual Guide

Why agent evaluation is fundamentally different, the evaluation gap (93% pre-deployment only), 4 capability dimensions, 6 benchmarks (AgentBench, SWE-bench, GAIA, BFCL, ToolEmu), EDDOps framework, 3-layer reference architecture, and 6 evaluation drivers.

Agent Evaluation EDDOps LLM-as-Judge Benchmarks Safety

Post 14 · LLM Serving

vLLM & PagedAttention — Efficient LLM Serving

Why LLM serving wastes 60–80% of GPU memory — and how PagedAttention fixes it. Covers KV cache, the OS virtual memory analogy, block tables, copy-on-write, continuous batching, and the 24× throughput gains over HuggingFace Transformers.

vLLM PagedAttention KV Cache Continuous Batching LLM Serving

Post 15 · AI Optimization

DSPy — Programming, Not Prompting

Stop hand-crafting prompts. DSPy compiles declarative AI programs into optimized pipelines. Covers Signatures, Modules (CoT, PoT, ReAct), the compilation loop, optimizers (BootstrapFewShot, MIPROv2, GEPA), and the results: +65% over standard few-shot on Llama2-13b.

DSPy Signatures Optimizers RAG LLM Programming

Post 16 · Metric Learning

Contrastive Loss — Learning to Compare

The 2005 CVPR paper that launched modern metric learning. Siamese networks, shared weights, the contrastive loss function with its dead zone, and how it scaled from face verification to CLIP and SimCLR. 12 interactive canvases with live loss explorer and embedding space animation.

Metric Learning Siamese Networks Face Verification CVPR 2005

Post 19 · Model Safety & Alignment

Claude Mythos — The First AI Too Capable to Release

Anthropic's 244-page system card for Claude Mythos Preview: the first model withheld from public release. Project Glasswing, the alignment paradox, chain-of-thought transparency gaps, and a 40-page section on model welfare.

AI Safety Alignment System Card Interpretability Model Welfare

Post 18 · Structured Data & GNNs

Relational Foundation Models — AI That Speaks SQL

KumoRFM turns any relational database into a temporal graph and applies in-context learning at inference time. From 878 lines of feature engineering to 1 PQL query. No retraining. 1 second.

KumoRFM Relational AI Graph Neural Networks In-Context Learning PQL

Post 17 · AI Infrastructure

Model Context Protocol — USB-C for AI

The open standard that replaces N×M custom integrations with a single protocol. Learn how MCP connects AI hosts, clients, and servers — and why every major AI tool adopted it.

MCP AI Agents Tool Use Anthropic AI Infrastructure

Post 20 · AI Agents & Infrastructure

Claude Managed Agents — Stateless Brains, Containerized Hands

Anthropic's new agent runtime: isolated execution containers, append-only event logs, and session lifecycle management. How Brain+Hands+Session decoupling cuts p50 TTFT by 60% and p95 by 90%. Multi-agent orchestration, security model, and full API walkthrough.

Managed Agents AI Infrastructure Anthropic Agent Runtime Multi-Agent

Post 21 · Search & RAG

Making Search Agents Faster & Smarter

How Contextual AI optimized RAG pipelines on two axes: tuning the search tool stack (embeddings, reranker, hybrid retrieval) and training the planner with GRPO + on-policy distillation + Conditional Log Penalty.

RAG Search Agents Planner Training CER-C GRPO

Post 22 · AI Governance

NIST AI Risk Management Framework

How 240+ organizations built a shared language for AI risk. GOVERN, MAP, MEASURE, MANAGE: the four functions that turn risk awareness into organizational action, with 7 trustworthiness characteristics every AI system must satisfy.

AI Governance Risk Management NIST Trustworthy AI AI Policy

Post 23 · Evaluation

GraphRAG Evaluation Metrics

A comprehensive interactive guide to 63 evaluation metrics used across 58 GraphRAG papers — retrieval, generation, faithfulness, LLM-judge, and task-specific metrics, each with live demos and paper citations.

GraphRAG Evaluation RAG Metrics LLM-Judge

Post 24 · Safety & Governance

Runtime Governance for Agentic AI in Finance

Prompt filters aren't enough for multi-step AI agents. This paper proposes capability decomposition, trajectory-level governance, a 4-tier risk framework, and a 7-step MRM programme aligned to SR 11-7 for financial services.

Post 25 · Evaluation

Multi-Agent System Evaluation — MASEval Framework

Framework choice matters as much as model choice in multi-agent systems. Explore 4 topology patterns (Sequential, Hierarchical, Parallel, Mesh), compare 6 frameworks across GAIA, Tau-bench, and MMLU, and simulate error cascades through the interactive MASEval evaluation lifecycle.

Multi-Agent Systems MASEval GAIA AutoGen LangGraph Evaluation

Post 26 · Evaluation

Evaluation Metrics for LLM Multi-Agent Systems

35+ evaluation metrics extracted from 130+ research papers. Explore task performance, communication quality, coordination, safety & trust, planning, and domain-specific metrics with interactive visualizations.

Multi-Agent Evaluation IDS OER CheckPoint ECS

Post 27 · Representations

World Models — Learning to Dream

Ha & Schmidhuber's V-M-C architecture: compress observations with a VAE, predict futures with an MDN-RNN, and train controllers entirely inside hallucinated dreams. From Car Racing 906 to DreamerV3 conquering Minecraft.

World Models VAE MDN-RNN Dreamer Dream Training RL

Post 28 · Safety & Governance

EU AI Act — The Risk Pyramid

The world's first comprehensive AI regulation visualised: 4-tier risk pyramid, 8 banned practices, high-risk compliance obligations, GPAI model tiers, penalty calculator, and interactive implementation timeline.

EU AI Act Risk Tiers GPAI Compliance Regulation Governance

Post 29 · Agents & Systems

LLM-as-a-Verifier — Fine-Grained Trajectory Scoring

How logprob distributions over letter tokens (A–T) turn an LLM into a continuous trajectory scorer. Explore criteria decomposition, round-robin tournament selection, and benchmark results on Terminal-Bench 2 (86.4%) and SWE-bench Verified (77.8%).

Verification Logprobs SWE-bench Trajectories Gemini Best-of-N

Post 34 · Evaluation

FinMASEval — Evaluating Multi-Agent AI for Financial Services

Generic MAS evaluation breaks down in finance. FinMASEval extends the MASEval framework with two finance-specific dimensions — hallucination rate and regulatory compliance — benchmarked against FinQA, FinBen, FinanceBench, FLUE, and FinAgentBench. GPT-4-Turbo with RAG still fails 81% of financial questions.

Multi-Agent Systems Financial AI Hallucination FinBen TradingAgents FINRA Evaluation

Post 35 · Evaluation

Agent Harness Engineering — The Infrastructure Beneath LLM Agents

The harness — not the model — is the primary determinant of agent reliability. Meng et al. (2026) introduce the formal H=(E,T,C,S,L,V) framework, analyze 22 systems, and prove that harness-only redesign can produce 10× benchmark gains. Explore the three engineering eras, the completeness matrix, 9 open challenges, and 12 research directions.

Agent Harness LLM Agents LangGraph MCP Execution Loop Context Manager Evaluation

Post 36 · Safety & Governance

The Data Provenance Crisis — Auditing 1,858 AI Training Datasets

70%+ of AI training datasets have no documented license. 66% of HuggingFace licenses differ from author intent. Longpre et al. audit 1,858 datasets across 44 collections, reveal a data availability divide between commercial and non-commercial datasets, and introduce Data Provenance Cards — machine-readable attribution metadata at scale.

Data Provenance Dataset Licensing AI Governance Copyright Attribution HuggingFace Training Data

Post 37 · Training & Alignment

Knowledge Distillation — Teaching Small Models to Think Like Giants

A 40%-smaller DistilBERT retains 97% of BERT's performance. Hinton et al.'s soft-target framework reveals "dark knowledge" hidden in near-zero probabilities — class-similarity structure invisible to hard labels. From temperature scaling to TinyBERT's layer-wise feature matching, explore how large models compress into fast, deployable students.

Knowledge Distillation Teacher-Student Soft Targets Temperature Scaling DistilBERT TinyBERT Model Compression

Post 38 · Inference & Serving

Speculative Decoding — Fast Inference via Draft-Then-Verify

A small drafter model proposes K tokens; the large target model verifies all of them in one batched pass. Lossless — output distribution is mathematically identical to standard decoding. 2–3× speedup on T5-XXL, 2.5× on Chinchilla 70B, 10× with memory offloading. No retraining required.

Speculative Decoding Draft-Then-Verify Rejection Sampling LLM Inference Medusa EAGLE Latency

Post 39 · Production Engineering

72 Techniques to Optimize LLMs in Production

Every lever you can pull — from INT8 quantization and FlashAttention to continuous batching, speculative decoding variants, KV cache offloading, semantic caching, model routing, and function calling. Searchable, filterable reference across 9 categories with in-depth cards for all 72 techniques.

Quantization Speculative Decoding KV Cache Continuous Batching Model Routing FlashAttention Production

Post 40 · APIs & Systems

REST API — Principles, Patterns & Best Practices

From Fielding's 2000 dissertation to production APIs: 6 architectural constraints, HTTP method semantics (safe vs idempotent), 30 status codes, URL design patterns, auth strategies (API Key / JWT / OAuth 2.0), and a full comparison with GraphQL, gRPC, and SOAP.

REST HTTP API Design JWT OAuth 2.0 GraphQL gRPC

Post 57 · Agents & Systems

Open Knowledge Format — The Missing Layer for AI Agents

Models give agents reasoning. MCP tools give agents execution. OKF fills the third layer — curated organizational knowledge. Google Cloud's vendor-neutral spec packages knowledge as plain Markdown directories: one required field (type), two reserved filenames (index.md, log.md), and Markdown cross-links that form a knowledge graph. Covers the fragmentation problem, concept document anatomy, the three-layer agent stack, comparison with DCAT/Frictionless/Data Contracts/MCP, and a live OKF document builder.

OKF Knowledge Layer Data Sharing AI Agents Google Cloud Metadata

Post 56 · Representations

Multi-Vector Embeddings — Beyond the Single Point

A single vector compresses an entire document to one point — losing exact phrases, rare terms, and visual structure. Multi-vector embeddings (ColBERT) keep a separate contextualised vector per token and score retrieval via MaxSim late interaction. Covers the full ColBERT family (v1 → v2 → PLAID → ColPali → BGE-M3), residual compression (10× storage reduction), Matryoshka (MRL) dimension reduction, and Weaviate v1.29 native multi-vector support via Jina-ColBERT-v2.

ColBERT Late Interaction MaxSim ColPali Matryoshka (MRL) Weaviate

Post 55 · Agents & Systems

Securing MCP — Risks, Controls, and Governance

The Model Context Protocol's dynamic tool discovery breaks every assumption traditional security was built on. This post maps three adversary types, a full attack-vector taxonomy, the "lethal trifecta" attack pattern, and the five-layer defense framework (Auth, Provenance, Sandboxing, DLP, Governance) required to secure MCP deployments in production. Real-world evidence: 1,800+ unauthenticated servers, 437k+ vulnerable npm downloads, and a live rugpull case study.

MCP Security Prompt Injection Supply Chain Five-Layer Defense AI Governance Lethal Trifecta

Post 54 · Agents & Systems

SkillLens — The Full Lifecycle Study of Agent Skills

The first systematic end-to-end study of agent skill generation, extraction, and consumption. SkillLens benchmarks three extraction families (refinement, evolution, trace-to-skill) and three consumption strategies across 5 domains. Key findings: model scale doesn't predict skill quality; 3–10 skills is the sweet spot; 70% quality threshold is optimal; cross-domain transfer collapses to 20–35% for dissimilar domains.

Skill Lifecycle Skill Extraction Cross-Domain Transfer Agent Benchmarks SkillLens Microsoft Research

Post 53 · Agents & Systems

SkillOpt — Teaching AI Agents to Optimize Their Own Skills

The first systematic text-space optimizer for agent skills. SkillOpt treats skill documents like model weights — optimized with bounded edits, strict validation gates, and three stability mechanisms (textual LR budget, rejected-edit buffer, epoch-wise meta-updates). Best or tied on all 52 model × benchmark × harness combinations; +23.5 pts on GPT-5.5, skills transfer across models, harnesses, and benchmarks with zero inference overhead.

Self-Evolving Agents Text-Space Optimization Skill Libraries Agent Benchmarks SkillOpt Microsoft Research

Post 52 · Agents & Systems

LLM Agent Orchestration — Coordinating Multiple AI Agents at Scale

Topology patterns (Supervisor, Pipeline, Router, Swarm), framework deep-dives (LangGraph, CrewAI, AutoGen/AG2, MetaGPT, OpenAI Swarm), task decomposition (AOP, DyLAN, DAAO), communication protocols (MCP, ACP, A2A, ANP), MAST failure taxonomy (14 modes across 3 categories), Mixture of Agents (65.1% AlpacaEval), and RL for orchestration decisions.

Multi-Agent LangGraph CrewAI MCP Protocol MAST Failures Mixture of Agents

Post 51 · Agents & Systems

Institutional Memory — Getting Expert Knowledge Into LLM Agents

How to capture tacit knowledge from domain experts and feed it to LLM agents. Covers Polanyi's Paradox, the SECI model updated for LLMs (HAC-SECI, GRAI, GenAI SECI), a 5-stage elicitation pipeline, three capture methods (conversational interview agents, think-aloud protocol, structured elicitation into BPMN/KGs/AKUs), enterprise RAG evolution (vector → GraphRAG → Agentic KG-RAG), and six key challenges including hallucination during capture and IP governance.

Tacit Knowledge Knowledge Elicitation Enterprise RAG GraphRAG SECI Model Institutional Memory

Post 50 · Agents & Systems

Memory in LLM Agents — How Machines Remember, Manage & Recall

A survey of memory mechanisms across 50+ LLM agent systems. Covers the three-dimensional taxonomy (sources × forms × operations), the unified W-P-R formal framework, key architectures (Generative Agents, Reflexion, MemGPT, MemoryBank, ExpeL, Voyager), retrieval strategies (similarity, multi-criterion, SQL), evaluation methods, and seven open challenges including multi-agent memory and lifelong learning.

Memory LLM Agents Retrieval MemGPT Generative Agents Zhang et al. 2024

Post 49 · Safety & Governance

Natural Language Autoencoders — Reading the Mind of an LLM

Anthropic's May 2026 NLA system: two jointly-trained components (Activation Verbalizer + Activation Reconstructor) that compress hidden states into readable text and reconstruct them back. Covers the FVE metric, GRPO RL training for the verbalizer, 3 case studies (poetry planning, language switching artifact, tool misreporting), 5 prediction tasks benchmarks, and 5 key limitations — with interactive pipeline visualizer and training loop stepper.

Mechanistic Interpretability NLA Autoencoders GRPO FVE Anthropic

Post 48 · Training & Alignment

Continual Learning — Teaching Models to Remember Without Forgetting

Why catastrophic forgetting happens (and why interleaved training fixes it), the plasticity–stability tradeoff, 3 CL scenarios (Task-IL / Domain-IL / Class-IL), 6 general methods (replay, parameter & functional regularization, optimization, context-dependent, template-based), Google's Nested Learning + HOPE architecture with multi-frequency memory hierarchy, and Meta FAIR's Sparse Memory Finetuning using TF-IDF slot selection (only 11% forgetting vs 89% for full fine-tuning).

Continual Learning Catastrophic Forgetting Nested Learning HOPE Sparse Memory Meta FAIR

Post 47 · Safety & Governance

GCG Attack — Breaking AI Alignment with Adversarial Suffixes

How the Greedy Coordinate Gradient attack automatically finds token sequences that bypass RLHF safety training. Covers the intuition, the optimization math (cross-entropy loss minimization over discrete token space), the GCG algorithm (top-k candidates + batched evaluation), transferability across model families, universal suffixes, and current defenses — with an interactive suffix optimizer and token anatomy visualizer.

Adversarial ML Jailbreaking GCG RLHF AI Safety Zou et al.

Post 46 · Agents & Systems

Hermes Agent — The Self-Improving Agent Architecture

A deep dive into Hermes: a production-ready self-improving agent built on ReAct-style loops with a 90-turn budget, SOUL.md identity layer, 3-tier memory system (Markdown + SQLite FTS5 + 8 external providers), self-evolving skills in Markdown/YAML, The Curator (background GC), and GEPA — Genetic-Pareto Prompt Evolution via offline execution traces. With interactive memory explorer and skill lifecycle simulator.

Agents ReAct Memory Systems GEPA Self-Improving Prompt Evolution

Post 45 · LLM Foundations

Foundations of Reinforcement Learning — Bandits & Exploration

Why RL matters for LLMs, the 4 properties that define RL (evaluative feedback, non-i.i.d. data, delayed consequences, exploration-exploitation), the agent-environment loop, multi-armed bandits, and 4 exploration strategies: greedy, ε-greedy, optimistic init, and UCB — with a live bandit simulator and 10-armed testbed.

Reinforcement Learning Bandits UCB ε-Greedy Exploration RLHF

Post 44 · Training & Alignment

LLM Training Pipeline — Pretraining → SFT → RLHF

The complete LLM training pipeline from first principles: unsupervised pretraining on trillions of tokens (next-token prediction), supervised fine-tuning on instruction data (loss masking), and alignment tuning via RLHF (PPO + KL penalty) or DPO — with Chinchilla scaling laws, reward models, and a full interactive demo.

Pretraining SFT RLHF PPO DPO Scaling Laws

Post 43 · Evaluation

FinCriticalED — Financial Fact-Level OCR Benchmark

The first benchmark that measures financial OCR at the fact level — not character accuracy. 859 SEC documents, 9,481 expert-annotated critical facts across 5 categories (numeric, temporal, monetary unit, entity, concept), 13 models benchmarked with a Deterministic-Rule-Guided LLM-as-Judge.

OCR Benchmarking Finance LLM-as-Judge Document AI SEC EDGAR

Post 42 · Retrieval & Knowledge

Deep GraphRAG — Hierarchical Retrieval & Adaptive Integration

A hierarchical, RL-guided knowledge graph RAG system from Ant Group: multi-level graph structure (concept → entity → fact), policy-guided traversal, adaptive evidence integration, and TRPO+DPO policy optimisation — solving multi-hop reasoning that flat RAG cannot handle.

RAG Knowledge Graph Multi-hop Reinforcement Learning Graph Traversal HotpotQA

Post 41 · Safety & Governance

AI Agent Traps — Adversarial Attacks on LLM Agents

A systematic taxonomy of 6 attack categories and 17+ subcategories targeting LLM agents: Content Injection, Semantic Manipulation, Cognitive State, Behavioural Control, Systemic Traps, and Human-in-the-Loop Traps — with empirical evidence and mitigation strategies.

Prompt Injection Agent Security RAG Poisoning Jailbreaks Red Teaming LLM Safety

Post 33 · Safety & Governance

CaMeL — Defeating Prompt Injections by Design

The first defense with provable security guarantees against prompt injection attacks on LLM agents. CaMeL separates control flow from data flow using a Dual-LLM architecture, capability tags, and security policies — achieving 0 successful attacks on AgentDojo while preserving 77% utility.

Prompt Injection CaMeL LLM Agents Capabilities Information Flow Safety

Post 32 · Safety & Governance

LLM Watermarking — Copyright Attribution & Detection

How do you prove an LLM generated a text? Explore 12+ watermarking schemes from Kirchenbauer's green/red token partition to Google's SynthID-Text (20M+ Gemini responses), with interactive radar comparisons, attack robustness heatmaps, and detection hypothesis testing.

Watermarking SynthID Token-Level Attribution Copyright Safety

Post 31 · Safety & Governance

Copyright Detection & Mitigation in LLMs

Framework choice matters as much as model choice. Explore 286 papers across 4 detection paradigms (MIA, memorization, watermarking, dataset inference) and 5 mitigation strategies, with an interactive arms-race timeline from 2020–2025.

Copyright Memorization MIA Unlearning Differential Privacy Safety

Post 30 · Representations

Semantic Collapse — Embedding Space & Entropic Drift

How embedding models silently erase modal, epistemic, indexical, and agency operators. Explore 4 collapse types, neighbourhood entropy diagnostics, the triplet framework (CR, Fidelity AUC), and the Modal Proofing Kernel that preserves semantic boundaries.

Embeddings Semantic Collapse Modal Logic Entropy MPK Representations