■ The Problem · ~12 min read

The Feature Engineering Wall

Most of the world's data lives in relational databases — spread across dozens of interconnected tables. Teaching a machine learning model to use all of it requires months of manual work that hasn't changed in decades.

60–80%

ML time on feature engineering

878

lines of code to do it manually

12.3h

expert time per new task

1

model per task (no transfer)

Click a table to see what joins are needed to make a single prediction.

Join Complexity Scaler

Number of tables 3

Manual join complexity grows O(n²). Graph approach stays flat O(n).

Temporal Leakage Visualizer

Drag the red cutoff line left or right to control the train/test split. Watch what happens to accuracy when future data leaks into training features.

Drag the cutoff line to control when training stops. Future events leaking in = silent accuracy collapse in production.

      The core problem: No ML method can learn directly from multiple interconnected tables. Data must be manually joined and aggregated into a single flat training table — a process called feature engineering. It is slow, error-prone, and produces suboptimal models.
    

Manual Joins

To predict customer churn, a data scientist must join users, orders, products, events, and support tickets — writing hundreds of lines of SQL before any ML begins.

Temporal Leakage

Aggregating future data into past features is a silent killer. One wrong timestamp in a join corrupts the entire dataset, and the model learns from the future.

No Transfer

Every new prediction task — churn, fraud, LTV — requires a fresh round of feature engineering from scratch. Nothing learned from one task transfers to another.

Why can’t we just use SQL joins? ▼

SQL joins create a flat table where many rows from child tables become duplicated columns in the parent. This causes data leakage (future events folded into past features), representation loss (only simple aggregates like COUNT, SUM survive), and combinatorial explosion as table count grows.

Why not just use an LLM on the raw data? ▼

LLMs are trained on text, not structured relational data. Feeding raw tables as text loses the relational structure entirely. A 50M-row database can’t fit in any context window, and LLMs can’t compute statistics or reason about temporal ordering of events.

What does “temporal” mean in this context? ▼

Every row in a relational database has a timestamp (created_at, updated_at, event_time). When building features for a prediction at time T, you must only use data from before T. Violating this — even by including data from T+1 — means you’re predicting the past from the future, causing artificially high accuracy that collapses in production.

Databases as Graphs ›

■ Relational Deep Learning

Databases as Temporal Graphs

The key insight from the Relational Deep Learning paper (arXiv:2312.04615, Fey & Leskovec): every relational database is already a graph — we just haven’t been treating it as one.

      The transformation: Each row becomes a node. Each primary–foreign key relationship becomes an edge. Timestamps on rows make the graph temporal. Different table types make it heterogeneous.
    

Nodes = Table Rows

Every row in every table becomes a node. A users table with 1M rows creates 1M nodes. Products, orders, events — each table type gets a different node type in the heterogeneous graph.

Edges = Foreign Keys

A foreign key relationship (orders.user_id → users.id) becomes a directed edge from each order node to its user node. These edges carry the relational structure that SQL flattening destroys.

Temporal = Safe

Because every node carries its timestamp, GNNs can be constrained to only aggregate along temporally valid edges — automatically preventing data leakage without manual date filters.

GNN Hop Depth Explorer

Message passing depth 1 hop

1 hop: Direct neighbors only — equivalent to a single SQL JOIN. Gets orders, events, and support tickets directly connected to the user.

// Relational Database → Temporal Heterogeneous Graph G = (V, E, T_v, T_e, ϕ_v, ϕ_e) V = ⋃ rows(table_i) // nodes = all rows across all tables E ∈ V × V via FK links // edges = primary-foreign key relationships ϕ_v: V → node_type // node type = which table it came from ϕ_e: E → edge_type // edge type = which FK relationship T_v: V → timestamp // temporal ordering for leak prevention

What is a heterogeneous graph? ▼

A heterogeneous graph has multiple types of nodes and edges. Unlike a social network where every node is a "person," a relational graph has user nodes, product nodes, order nodes, event nodes — each with different features and different roles. GNNs must handle this by having separate learnable weights for each node and edge type.

How does message passing work on a relational graph? ▼

Each node aggregates feature vectors from its neighbors, weighted by edge type. A user node might aggregate from all their orders, events, and support tickets simultaneously. After multiple rounds of message passing, each node’s embedding captures information from its multi-hop neighborhood — the equivalent of a complex SQL join, but learned automatically.

Which paper introduced Relational Deep Learning? ▼

arXiv:2312.04615, “Relational Deep Learning: Graph Representation Learning on Relational Databases” by Matthias Fey and Jure Leskovec (December 2023). Fey is the creator of PyTorch Geometric (PyG). The paper also introduced the RelBench benchmark suite and is the foundation on which KumoRFM is built.

In-Context Learning ›

■ How It Works

In-Context Learning at Inference Time

Like GPT-4 using few-shot examples, KumoRFM samples historical labeled subgraphs from the database and uses them as context — making predictions without any gradient updates or task-specific training.

Press Step to walk through how KumoRFM makes a prediction using in-context learning.

      The key insight: Historical data in a database IS the in-context examples. KumoRFM treats past user behavior as few-shot demonstrations and uses them to predict future behavior — without retraining.
    

Interactive PQL Builder

Build a real PQL query by selecting the prediction target, time window, and aggregations. See the assembled query and mock output update live.

Target Variable

History Window

Aggregations to include

In-Context Examples vs Accuracy

Drag to add more in-context examples. Watch accuracy rise then plateau — the same scaling shape as LLM few-shot learning.

Context examples 4

PQL Reference

PREDICT user.will_churn_30d FOR users WHERE subscription_tier = 'pro' USING HISTORY 90 DAYS AGGREGATE orders: COUNT(*), SUM(amount) events: LIST_DISTINCT(event_type) support_tickets: COUNT(*), FIRST(priority)

Traditional Approach

Write SQL to join 5 tables — 200 lines. Compute feature aggregations manually. Split into train/test. Train a gradient boosted tree. Tune hyperparameters. Deploy. Time: hours to days.

KumoRFM Approach

Write one PQL query. KumoRFM samples in-context examples, runs the graph transformer, returns predictions. Zero training code. Time: ~1 second.

How are in-context examples selected? ▼

The In-Context Label Generator samples historical entities that are structurally similar to the query entity (same table, similar subgraph topology) and temporally valid (their labels were observed before the query time). The sampling strategy balances class distribution to avoid showing only majority-class examples.

What task types does PQL support? ▼

PQL supports four task types: (1) Binary classification — will event X happen? (2) Multi-class classification — which category? (3) Regression — how much? (4) Link prediction — which items will user X interact with? (recommendation systems). All four use the same in-context learning mechanism.

How does this differ from in-context learning in LLMs? ▼

LLM in-context learning uses text examples in the prompt. KumoRFM’s in-context learning uses graph-structured examples — labeled subgraphs extracted from the database. The model attends to these examples using the same Graph Transformer that processes the query entity, so it can reason about structural similarity rather than just surface-level text similarity.

KumoRFM Architecture ›

■ KumoRFM · May 2025

KumoRFM: The 5-Module Architecture

Now that you understand in-context learning, every module clicks. KumoRFM (Fey, Kocijan, Lopez, Leskovec) is the first foundation model for relational databases — each of its 5 modules serves the in-context learning pipeline.

Hover a module to see what it does in the pipeline.

❶ In-Context Label Generator

Dynamically samples historical labeled subgraphs from the database at inference time. These examples — like few-shot examples for LLMs — condition the model’s predictions without any gradient updates.

❷ Table-Width Invariant Encoder

Encodes each cell (numerical, categorical, timestamp, text) into a dense vector independently of the total number of columns. This allows the model to handle any database schema without architectural changes.

❸ Relational Graph Transformer

A Graph Transformer that performs attention across the temporal heterogeneous graph. Uses positional encodings for node type, time delta, structural proximity, and local subgraph patterns to capture relational context.

❹ Explainability Module

Provides gradient-based and analytical explanations at both global (which features matter most?) and individual prediction levels (why was this user flagged?). Critical for production deployments in regulated industries.

❺ Fine-tuning Pipeline

While KumoRFM works zero-shot via in-context learning, it can be fine-tuned on specific databases or query types for production deployments — similar to fine-tuning an LLM on domain-specific text. Fine-tuning adds 10–30% accuracy improvement over the already-competitive zero-shot baseline.

What is Predictive Query Language (PQL)? ▼

PQL is a SQL-like syntax designed for prediction tasks rather than data retrieval. A PQL query specifies: (1) the target variable to predict, (2) which entities to make predictions for, (3) optional filters and aggregation functions (FIRST, COUNT, SUM, LIST_DISTINCT). KumoRFM accepts a PQL query and a database connection, and returns predictions — no training code needed.

How does KumoRFM handle different column types? ▼

The Table-Width Invariant Column Encoder handles: numerical values (standardized, then embedded), categorical values (tokenized, then embedded), timestamps (decomposed into cyclical features + time-delta encodings), and text (encoded via a small language model). Each column’s embedding is processed independently, so the total number of columns doesn’t affect the architecture.

What makes the Graph Transformer “relational”? ▼

Standard Graph Transformers use global attention (every node attends to every other node). The Relational Graph Transformer restricts attention to the relational graph topology — a node can only attend to its neighbors defined by foreign key relationships. This makes computation tractable on million-node graphs while preserving the structural inductive bias of the database schema.

Synthetic Pre-training ›

■ RDB-PFN & PLUREL

Synthetic Pre-training & Scaling Laws

Training a relational foundation model requires diverse databases — but real databases are private. RDB-PFN (arXiv:2603.03805) and PLUREL (arXiv:2602.04029) solve this by generating millions of synthetic relational databases from scratch.

2M+

synthetic pre-training tasks

19

real-world benchmarks evaluated

Power-law

scaling with data volume

0

private databases required

RDB-PFN (arXiv:2603.03805)

The first relational foundation model trained purely on synthetic data. Uses a Relational Prior Generator based on Structural Causal Models (SCMs) to generate diverse databases. Builds on Prior-Data Fitted Networks (PFNs) — models that learn from distributions of datasets rather than fixed datasets.

PLUREL (arXiv:2602.04029)

Discovers that RFM pre-training loss follows power-law scaling with both the number of synthetic databases and total pre-training tokens. More synthetic data reliably improves real-world performance — unlocking the same scaling recipe that drove LLM progress.

// PLUREL Synthetic Database Generation Pipeline Step 1: Schema G_schema = directed graph of tables + column types Step 2: Connect G_bipart = bipartite graph of PK→FK cardinalities Step 3: Features P(X_col | parent_cols) via conditional causal mechanisms // Scaling Law (empirical) Loss(D, T) ∝ D^α × T^β where α,β < 0 // D = number of synthetic databases // T = total pre-training tokens // Doubling D or T reliably reduces downstream task loss

      Why this matters: LLMs scaled because the internet provided nearly unlimited text. Relational databases are private. Synthetic generation gives RFMs the same unlimited pre-training data advantage — the power-law scaling curve means more synthetic data always helps.
    

What is a Prior-Data Fitted Network (PFN)? ▼

A PFN is a model trained on distributions of datasets rather than a single fixed dataset. During training, PFNs sample a new synthetic dataset for each forward pass and learn to make predictions in-context from the examples in that dataset. This trains the model to be a general-purpose in-context learner rather than a specialist on any particular distribution.

What is a Structural Causal Model (SCM)? ▼

An SCM defines variables and causal relationships between them using directed acyclic graphs. PLUREL uses SCMs to generate realistic feature distributions within synthetic tables — ensuring that synthetic column values have realistic correlations, nonlinear relationships, and noise levels rather than being purely random.

Are synthetic databases realistic enough? ▼

RDB-PFN demonstrates competitive performance with supervised deep learning baselines on 19 real-world benchmarks — using only synthetic pre-training. PLUREL shows further improvements by scaling synthetic data volume. The key insight is that diversity of generated schemas matters more than individual realism: the model learns generalizable inductive biases by seeing millions of different schema structures.

RelBench Benchmark ›

■ Evaluation

RelBench: Measuring What Matters

RelBench (Fey & Leskovec) is the first public benchmark specifically designed for evaluating machine learning on relational databases. It covers 30 tasks across 7 domains at varying scales of complexity.

30

prediction tasks

11

real-world databases

7

domains covered

3

task types

Click a dataset card to see its schema and tasks.

Select a dataset above to see its domain, schema size, and available tasks.

Dataset	Domain	Tables	Rows	Tasks
rel-stack	Q&A / Social	8	~16M	User engagement, post votes, badge prediction
rel-amazon	E-commerce	7	~48M	Product rating, recommendation, churn
rel-trial	Medical	5	~3M	Trial success, patient outcomes
rel-f1	Sports	9	~1.2M	Race results, driver performance
rel-hm	Fashion / Retail	4	~31M	Purchase prediction, recommendation
rel-event	Events	5	~1.5M	Ticket sales, attendance prediction
rel-avito	Classifieds	6	~12M	Ad click-through, price prediction
rel-arxiv	Academic	5	~8M	Citation prediction, author linkage
rel-mimic	Healthcare	15	~22M	ICU mortality, readmission, diagnosis
rel-ratebeer	Reviews	4	~2.9M	Beer rating, reviewer churn
rel-salt	Supply Chain	6	~5M	Demand forecast, inventory prediction

How does RelBench prevent data leakage? ▼

RelBench defines a strict temporal split: each task specifies a cutoff time T. Training data uses only rows with timestamps before T; test data uses labels from after T. This mirrors real production ML systems where models are trained on historical data and evaluated on truly unseen future data.

What are the three task types in RelBench? ▼

(1) Entity classification/regression: predict a property of an entity row (e.g., "will this user churn?"). (2) Temporal link prediction: predict which two entities will be linked in the future (e.g., "which product will this user buy next?"). (3) Multi-hop prediction: predict labels that require reasoning across multiple table hops.

Results & Impact ›

■ Results & Impact

Results: 1 Second vs. 12 Hours

KumoRFM doesn’t just match traditional approaches — it arrives at comparable accuracy orders of magnitude faster, with a fraction of the code.

1s

KumoRFM inference time

vs 30m

traditional RDL baseline

vs 12.3h

human data scientist

2–8%

accuracy gain zero-shot

Zero-shot Wins

On entity classification tasks in RelBench, KumoRFM outperforms feature-engineered baselines by 2–8% accuracy without any task-specific training — just in-context examples sampled from historical data.

Fine-tuned Jumps

When fine-tuned on specific databases or query types, KumoRFM improves 10–30% over the already-competitive zero-shot baseline, matching or exceeding expert-built supervised deep learning systems.

Recommendation

On temporal link prediction (recommendation) tasks, KumoRFM demonstrates competitive MAP@k performance against specialized recommender systems that took weeks of engineering to build.

      The full stack: Relational Graph Transformers (April 2025) show an additional 10% win over GNN baselines and 40% over LightGBM on RelBench — with 95% less data preparation effort and 20× faster time-to-value vs. traditional approaches.
    

Method Comparison Matrix

Click any cell to see a concrete example of what that score means in practice.

What this means for enterprise

SAP partnership (Nov 2025) + Snowflake Intelligence integration bring KumoRFM to production enterprise databases. Any org with a relational database can now run production-grade ML predictions with one PQL query — no ML team required.

MCP Integration (Sept 2025)

KumoRFM added Model Context Protocol support — exposing its prediction capabilities as tools that AI agents can call. An LLM agent can now issue PQL queries and receive predictions as part of a larger reasoning workflow.

What is the Relational Graph Transformer? ▼

The Relational Graph Transformer (April 2025, Lopez, Fey, Leskovec) extends standard Graph Transformers with three innovations: (1) topology-aware attention that respects database schema, (2) multi-modal node attribute encoding for heterogeneous column types, and (3) composable positional encodings (hop, tree, time encodings). It outperforms GNN baselines by ~10% and LightGBM by ~40% on RelBench.

Why does this matter beyond benchmarks? ▼

The benchmark gains translate directly to business value: churn prediction, fraud detection, recommendation, demand forecasting — every company with a relational database runs these tasks. Reducing the time from months to seconds while matching or exceeding expert accuracy represents a fundamental shift in how production ML is built and deployed.

Where is KumoRFM available? ▼

KumoRFM is available through the Kumo AI platform (kumo.ai) with native integrations for Snowflake, SAP, and other enterprise data systems. The underlying research — RelBench, the RDL paper, and related methods — is open-source and available on PyG (PyTorch Geometric).