🔍
Data Provenance
Visual Summary — Post 36
Incorrect password
Overview Crisis Landscape Licenses Divide Languages 2023 Shift Legal Cards Agentic QC

The Data Provenance Crisis

Auditing 1,858 AI training datasets reveals a systemic breakdown in licensing transparency — where 70%+ of datasets have no documented license and more than half of reported licenses are wrong.

1,858
Datasets audited
44
Collections analyzed
70%+
License omission rate
66%
HuggingFace mismatch rate
17
Authors (legal + ML)
What is Data Provenance?
Data provenance tracks the complete lineage of a dataset — its original source, who created it, under what license, with what conditions, and how it has been transformed or repackaged. For AI training data, this determines what you are legally and ethically permitted to do with a trained model.
Why It Matters Now
Every foundation model and fine-tuned LLM was trained on datasets with licensing conditions. If those conditions prohibit commercial use — and you didn't check — your deployed product may be in violation. With 70%+ of datasets having undocumented licenses, most AI teams are flying blind.
The Scale of the Problem
The Data Provenance Initiative (Longpre et al., 2023) is the first large-scale audit of AI fine-tuning datasets. It brings together legal and ML experts to trace 1,858 datasets across 44 major collections — examining licenses, sources, creators, and downstream use restrictions.
What Was Built
The Data Provenance Explorer — an interactive interface at dataprovenance.org — lets practitioners filter and examine provenance metadata for any of the 1,858 audited datasets before using them in training.
Who Is at Risk
Any team fine-tuning an LLM on open-source instruction datasets. The most popular datasets on HuggingFace have 100s to 10M+ monthly downloads — most without their users checking license conditions.
The Fix
Data Provenance Cards — machine-readable metadata records that travel with datasets — capturing license, source, creator, attribution requirements, and share-alike conditions at scale.

The License Crisis

Across every major dataset hosting platform, the majority of datasets either have no documented license or have a license that misrepresents the actual terms.

Root finding: The Data Provenance Initiative reduced unspecified license rates from 70%+ to 30% through manual auditing — meaning roughly 40% of "unlicensed" datasets actually had discoverable licenses that platforms simply failed to document.
License Omission by Platform
GitHub
72%
Papers with Code
70%
HuggingFace
69%
After DPI Audit
30%
% of datasets with unspecified/missing license
License Mismatch Rates
HuggingFace
66%
GitHub (too permissive)
29%
HuggingFace (too permissive)
27%
Papers with Code
16%
% of datasets where platform license differs from author's intent
Most Common Licenses Found
15.7%
CC-BY-SA 4.0
Attribution + Share-Alike required
12.3%
OpenAI ToU
Prohibits competing model training
11.6%
CC-BY 4.0
Attribution only — commercially permissive
9.6%
Custom Licenses
Bespoke terms — requires legal review
Attribution requirements: 85% of datasets require attribution — meaning you must credit the dataset creator. Most AI training pipelines do not do this.
Share-alike clauses: 30% of datasets require share-alike — meaning any model trained on them must be released under the same or compatible license. This directly conflicts with closed-source model releases.

What 1,858 Datasets Look Like

The audit covers fine-tuning and alignment datasets — instruction-following, QA, classification, and generation tasks across 44 major collections.

Primary Data Sources
Encyclopedias (Wikipedia)
21.5%
Social Media (Reddit/Twitter)
15.9%
General Web
11.2%
News
11.1%
Entertainment
8.5%
Code (GitHub)
5.7%
Exams / Academic
5.6%
Creator Distribution
Academic institutions
68.7%
Industry labs
21.4%
Research groups
17.1%
Corporations
15.8%
Startups
4.0%
Top academic contributors: UW (8.9%), Stanford (6.8%), NYU (5.4%). Top industry: FAIR (8.4%), MSR (4.1%), AI2 (12.3%).
Dataset Format Types
Zero-shot
Direct instruction, no examples
Few-shot
Instruction + k examples
Chain-of-Thought
Reasoning trace included
Multi-turn
Conversational dialogue format

Four License Categories

The audit classifies every dataset into one of four use-permission categories. Only 46% can be used commercially without restrictions.

46.1%
Commercial
856 datasets. Can be used for any purpose including building commercial products. Examples: CC-BY 4.0, MIT, Apache 2.0, CC0.
30.7%
Unspecified
570 datasets. No license documented. Legally, all rights reserved by default — most restrictive interpretation is that commercial use is prohibited.
19.0%
Non-Commercial
352 datasets. Research and personal use only. Examples: CC-BY-NC 4.0, OpenAI ToU. Building a product on these may constitute infringement.
4.3%
Academic-Only
80 datasets. Restricted to academic research institutions. Cannot be used by industry teams, regardless of whether use is commercial.
Attribution Requirements
Require attribution 85%
Attribution means crediting the dataset creator in your model documentation, paper, and release notes. Most model cards do not do this.
Share-Alike Obligations
Include share-alike clause 30%
Share-alike requires releasing trained models under the same license. This is fundamentally incompatible with proprietary model releases — yet these datasets are widely used.
What does CC-BY-SA 4.0 actually mean for AI training?
CC-BY-SA 4.0 (Creative Commons Attribution-ShareAlike) requires: (1) you credit the original creator, and (2) anything you create using this work must also be released under CC-BY-SA or a compatible license. For AI training, this means a model trained on CC-BY-SA data could be argued to require open-sourcing under the same terms. This is actively litigated — no definitive court ruling exists yet.
Why is the OpenAI Terms of Use (12.3%) so significant?
The OpenAI ToU prohibits using outputs to "develop models that compete with OpenAI." This affects all datasets derived from ChatGPT/GPT-4 outputs — including widely-used instruction-following datasets like Alpaca, Dolly, and others built from GPT-generated data. Any model fine-tuned on these datasets carries this restriction transitively, whether or not the fine-tuner was aware.
What happens with Unspecified licenses?
Under most copyright law, if no license is specified, all rights are reserved by default. This means the creator retains full copyright and no one has permission to use, copy, or distribute the work — including for AI training. Practitioners often treat "no license" as "free to use," which is legally incorrect and potentially infringes the creator's copyright.

The Data Availability Divide

Non-commercial and academic-only datasets are systematically richer — more diverse, longer, and more multi-source — than commercially licensed datasets. This creates a capability gap for open commercial AI.

The structural problem: The most capable and diverse datasets are locked behind non-commercial restrictions. Teams building commercial products are forced to use a subset of the data landscape that is shallower, less diverse, and more synthetic-data-poor.
Task Diversity: NC datasets have 2× more
Non-commercial datasets average 3.4±0.2 task types per dataset vs. 1.7±0.1 for commercial. Creative tasks (brainstorming, creative writing, explanation) are disproportionately concentrated in non-commercial datasets. Commercial datasets skew heavily toward QA and classification.
Source Diversity: NC datasets have 2.6× more
Non-commercial datasets average 4.2±1.3 distinct data sources vs. 1.6±0.1 for commercial. This reflects that academic dataset creators pull from wider pools (web, books, exams, social media) while commercial datasets are more focused and single-source.
Text Length: NC datasets are 15× longer
Non-commercial datasets have a mean target text length of 1,580.7±965.6 tokens vs. 102.7±14.6 for commercial. Long-form generation (essays, reports, summaries) is a non-commercial-only capability in the current dataset landscape.
Synthetic Data: NC datasets have 3.5× more
45.5%±3.4 of non-commercial dataset content is model-generated vs. 12.8%±2.1 for commercial. This may appear counterintuitive — synthetic data is easier to generate commercially — but most synthetic datasets use GPT-4/ChatGPT outputs, which carry the OpenAI ToU non-commercial restriction.
Code datasets are the exception
78% of code datasets are commercially viable — significantly higher than the overall average. GitHub's filtering by code license is easier (MIT, Apache, GPL are well-understood), and most code is written by individuals rather than organizations that impose non-commercial restrictions.
Implication for foundation models: The capability frontier — reasoning, creativity, long-form generation — is primarily trained on non-commercial datasets. Commercial fine-tuning is working with a structurally inferior subset.

Language & Geographic Coverage

English and Western European languages dominate the audited datasets. Low-resource languages are not only underrepresented — they are disproportionately locked behind non-commercial restrictions.

Language Family Coverage
Germanic (English)
Dominant
Romance languages
Moderate
Sino-Tibetan (Chinese)
Sparse
Japonic
Sparse
Indo-Aryan
Very sparse
Turkic languages
Critical gap
Niger-Congo (African)
Critical gap
Indigenous Americas
Near absent
The Double Exclusion Problem
Low-resource languages face two barriers:

1. Underrepresentation — Fewer datasets exist in Turkic, Japonic, Sino-Tibetan, and African language families.

2. Commercial restriction — More than 35% of available low-resource language datasets are non-commercial or academic-only — a higher restriction rate than for English datasets.

This means the communities with the least AI capability also face the most access barriers to building it.
The Code Exception (Again)
Code datasets show the most balanced multilingual coverage because programming languages themselves are language-agnostic, and GitHub hosts code from developers worldwide. Python, JavaScript, Java, and C++ datasets are commercially available and globally contributed.
The Bender Rule: The paper follows the "Bender Rule" of always documenting which languages a dataset covers — making language coverage a first-class attribute in every Data Provenance Card. Most existing dataset documentation fails this basic standard.

The 2023 Licensing Shift

Something changed in 2023. The proportion of new datasets with restrictive licenses jumped dramatically — likely in response to growing awareness of AI training data concerns and high-profile legal disputes.

Pre-2023 Baseline
50–80%
of datasets released before 2023 had no license at all. The ecosystem treated "open" as the default — no one expected their NLP dataset to become AI training fodder at scale.
2023 Unlicensed Rate
Only 12%
of new datasets released in 2023 had no license — a dramatic improvement. Dataset creators became aware that licensing their work mattered for AI use.
2023 NC/Academic Rate
61%
of new datasets in 2023 were released as non-commercial or academic-only — vs. ~20% in prior years. The reaction to commercialization of AI drove a sharp restriction turn.
2018–2021
The Wild West Era
NLP datasets released primarily for academic benchmarking. Licensing treated as an afterthought — 50–80% have no license. No one anticipated billion-parameter models scraping everything in sight.
2022
ChatGPT Changes Everything
GPT-3.5, ChatGPT, and instruction-following models go mainstream. Dataset creators begin to realize their work is being used in commercial products. First wave of CC-BY-NC restrictions on new releases.
2023 Q1–Q2
Legal Disputes and Awareness Spike
High-profile lawsuits (Authors Guild vs. OpenAI, Getty Images vs. Stability AI) bring copyright and licensing to mainstream attention. 61% of new 2023 datasets adopt NC/academic restrictions.
2023 Q3
Data Provenance Initiative Published
Longpre et al. publish the first large-scale audit of 1,858 datasets. DPExplorer launches at dataprovenance.org. The field finally has a reference for what the actual license situation looks like.
2024–Present
New Licensing Frameworks Emerge
RAIL licenses (Responsible AI Licenses), AI2 ImpACT licenses, and BigScience OpenRAIL offer middle-ground options — commercially usable but with behavioral restrictions on downstream use.

Data Provenance Cards

Machine-readable metadata records that travel with datasets — capturing everything needed to make an informed licensing decision, designed to scale to hundreds of datasets simultaneously.

Design principle: A Data Provenance Card is like a bibliography entry for training data — standardized, machine-readable, filterable, and composable when combining multiple datasets. Unlike Model Cards or Datasheets, it is specifically designed for scale.
Provenance Cards vs. Existing Documentation
Property Datasheets for Datasets Model Cards Provenance Cards
Format Free-form prose Structured prose Machine-readable schema
Scales to 100s of datasets? No — manual per dataset No — per model Yes — concatenable
Filterable / searchable? No No Yes — by license, language, task
License tracking Sometimes mentioned Rarely detailed Explicit — permitted use, attribution, share-alike
Repackaging support Breaks on remix N/A Preserved through repackaging
Auto-generated? No — manual No — manual Yes — via DPExplorer
Recommendations by Stakeholder
Practitioners
Use DPExplorer to filter datasets by your risk tolerance before including them in training pipelines. Never assume "no license = free to use."
Dataset Creators
Apply explicit licenses at release. Consider RAIL or AI2 ImpACT licenses for middle-ground options. Always document language coverage and data sources.
Policymakers
Clarify fair use and TDM exceptions for AI training. Consider dataset-specific legal frameworks rather than applying code copyright law to training data.
Researchers
Focus on under-resourced languages and non-commercial licensing gaps. Releasing commercially-usable datasets for low-resource languages is a high-impact contribution.

Provenance in Agentic Data QC

Data quality and data provenance solve adjacent problems. An agentic QC pipeline that catches quality issues but ignores license restrictions is only half the solution. Here's how provenance becomes a native feature — not an afterthought.

Core insight: QC asks "is this data correct?" Provenance asks "is this data usable?" A QC agent needs both answers before a dataset can be trusted in a training pipeline.
Agentic Data QC Pipeline with Provenance Gates
4 Integration Points
1
Upstream License Audit
Before QC runs — check if the dataset is legally usable for the intended purpose
2
Lineage Tracking During QC
Every filter, transformation, and augmentation is logged with the original source license
3
Attribution Propagation
When datasets are merged, the most restrictive license propagates to the output
4
Compliance Gate at Pipeline Exit
Final check before the cleaned dataset is released or used for training
QC Signals vs. Provenance Signals — What Each Catches
Issue Type QC Catches It? Provenance Catches It? Risk if Missed
Duplicate examples ✓ Yes — dedup metrics — Not applicable Training overfitting
Label noise / wrong annotations ✓ Yes — annotation QC — Not applicable Model misbehavior
Non-commercial license on dataset ✗ No — invisible to QC ✓ Yes — upstream audit Legal infringement
Platform license mismatch (66% rate) ✗ No ✓ Yes — cross-ref source False compliance confidence
GPT-generated data inheriting OpenAI ToU ✗ No ✓ Yes — synthetic provenance Violates OpenAI ToU at scale
Share-alike conflict in merged corpus ✗ No ✓ Yes — propagation check Closed-source release at risk
Missing attribution credits ✗ No ✓ Yes — exit gate License violation (85% require it)
Low-quality text / toxic content ✓ Yes — quality filters — Not applicable Safety / model quality
Language / geographic coverage gaps ~ Partial — distribution stats ✓ Yes — language metadata Bias / underrepresentation
Compliance Simulator — Build Your Training Corpus
Select datasets and a use intent to see what the compliance gate would flag.
Step 1 — Select Datasets
Step 2 — Select Use Intent

Paper Sources

This visual summary is based on the following papers and resources.

Primary Reference
Longpre et al. (17 authors, CMU / MIT / Hugging Face / Allen AI) — "The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI" — 2023. arXiv: 2310.16787. Audits 1,858 text datasets across 44 collections. Introduces Data Provenance Cards and releases the DPExplorer tool.
↗ arXiv 2310.16787 ↗ dataprovenance.org ↗ GitHub Repository
Related Works
Gebru et al. — "Datasheets for Datasets" — 2021. Communications of the ACM. The original dataset documentation framework that Provenance Cards extend for scale. arXiv: 1803.09010.
Mitchell et al. — "Model Cards for Model Reporting" — 2019. FAccT. Establishes transparency standards for model releases — the model-side complement to dataset provenance.
Bender et al. — "On the Dangers of Stochastic Parrots" — 2021. FAccT. Introduces the "Bender Rule" for language documentation and raises foundational concerns about undocumented training data.
Continue Learning
Next
Knowledge Distillation — Dark Knowledge
Related
Agent Harness — H=(E,T,C,S,L,V)
Related
FinMASEval — AI Evaluation
Related
EU AI Act — Governance
Related
Agentic MRM — Risk Management
All Posts
Visual Summary Home