Auditing 1,858 AI training datasets reveals a systemic breakdown in licensing transparency — where 70%+ of datasets have no documented license and more than half of reported licenses are wrong.
1,858
Datasets audited
44
Collections analyzed
70%+
License omission rate
66%
HuggingFace mismatch rate
17
Authors (legal + ML)
What is Data Provenance?
Data provenance tracks the complete lineage of a dataset — its original source, who created it, under what license, with what conditions, and how it has been transformed or repackaged. For AI training data, this determines what you are legally and ethically permitted to do with a trained model.
Why It Matters Now
Every foundation model and fine-tuned LLM was trained on datasets with licensing conditions. If those conditions prohibit commercial use — and you didn't check — your deployed product may be in violation. With 70%+ of datasets having undocumented licenses, most AI teams are flying blind.
The Scale of the Problem
The Data Provenance Initiative (Longpre et al., 2023) is the first large-scale audit of AI fine-tuning datasets. It brings together legal and ML experts to trace 1,858 datasets across 44 major collections — examining licenses, sources, creators, and downstream use restrictions.
What Was Built
The Data Provenance Explorer — an interactive interface at dataprovenance.org — lets practitioners filter and examine provenance metadata for any of the 1,858 audited datasets before using them in training.
Who Is at Risk
Any team fine-tuning an LLM on open-source instruction datasets. The most popular datasets on HuggingFace have 100s to 10M+ monthly downloads — most without their users checking license conditions.
The Fix
Data Provenance Cards — machine-readable metadata records that travel with datasets — capturing license, source, creator, attribution requirements, and share-alike conditions at scale.
The Core Problem
The License Crisis
Across every major dataset hosting platform, the majority of datasets either have no documented license or have a license that misrepresents the actual terms.
Root finding: The Data Provenance Initiative reduced unspecified license rates from 70%+ to 30% through manual auditing — meaning roughly 40% of "unlicensed" datasets actually had discoverable licenses that platforms simply failed to document.
License Omission by Platform
GitHub
72%
Papers with Code
70%
HuggingFace
69%
After DPI Audit
30%
% of datasets with unspecified/missing license
License Mismatch Rates
HuggingFace
66%
GitHub (too permissive)
29%
HuggingFace (too permissive)
27%
Papers with Code
16%
% of datasets where platform license differs from author's intent
Most Common Licenses Found
15.7%
CC-BY-SA 4.0
Attribution + Share-Alike required
12.3%
OpenAI ToU
Prohibits competing model training
11.6%
CC-BY 4.0
Attribution only — commercially permissive
9.6%
Custom Licenses
Bespoke terms — requires legal review
Attribution requirements: 85% of datasets require attribution — meaning you must credit the dataset creator. Most AI training pipelines do not do this.
Share-alike clauses: 30% of datasets require share-alike — meaning any model trained on them must be released under the same or compatible license. This directly conflicts with closed-source model releases.
Dataset Landscape
What 1,858 Datasets Look Like
The audit covers fine-tuning and alignment datasets — instruction-following, QA, classification, and generation tasks across 44 major collections.
Primary Data Sources
Encyclopedias (Wikipedia)
21.5%
Social Media (Reddit/Twitter)
15.9%
General Web
11.2%
News
11.1%
Entertainment
8.5%
Code (GitHub)
5.7%
Exams / Academic
5.6%
Creator Distribution
Academic institutions
68.7%
Industry labs
21.4%
Research groups
17.1%
Corporations
15.8%
Startups
4.0%
Top academic contributors: UW (8.9%), Stanford (6.8%), NYU (5.4%). Top industry: FAIR (8.4%), MSR (4.1%), AI2 (12.3%).
Dataset Format Types
Zero-shot
Direct instruction, no examples
Few-shot
Instruction + k examples
Chain-of-Thought
Reasoning trace included
Multi-turn
Conversational dialogue format
License Taxonomy
Four License Categories
The audit classifies every dataset into one of four use-permission categories. Only 46% can be used commercially without restrictions.
46.1%
Commercial
856 datasets. Can be used for any purpose including building commercial products. Examples: CC-BY 4.0, MIT, Apache 2.0, CC0.
30.7%
Unspecified
570 datasets. No license documented. Legally, all rights reserved by default — most restrictive interpretation is that commercial use is prohibited.
19.0%
Non-Commercial
352 datasets. Research and personal use only. Examples: CC-BY-NC 4.0, OpenAI ToU. Building a product on these may constitute infringement.
4.3%
Academic-Only
80 datasets. Restricted to academic research institutions. Cannot be used by industry teams, regardless of whether use is commercial.
Attribution Requirements
Require attribution85%
Attribution means crediting the dataset creator in your model documentation, paper, and release notes. Most model cards do not do this.
Share-Alike Obligations
Include share-alike clause30%
Share-alike requires releasing trained models under the same license. This is fundamentally incompatible with proprietary model releases — yet these datasets are widely used.
What does CC-BY-SA 4.0 actually mean for AI training?›
CC-BY-SA 4.0 (Creative Commons Attribution-ShareAlike) requires: (1) you credit the original creator, and (2) anything you create using this work must also be released under CC-BY-SA or a compatible license. For AI training, this means a model trained on CC-BY-SA data could be argued to require open-sourcing under the same terms. This is actively litigated — no definitive court ruling exists yet.
Why is the OpenAI Terms of Use (12.3%) so significant?›
The OpenAI ToU prohibits using outputs to "develop models that compete with OpenAI." This affects all datasets derived from ChatGPT/GPT-4 outputs — including widely-used instruction-following datasets like Alpaca, Dolly, and others built from GPT-generated data. Any model fine-tuned on these datasets carries this restriction transitively, whether or not the fine-tuner was aware.
What happens with Unspecified licenses?›
Under most copyright law, if no license is specified, all rights are reserved by default. This means the creator retains full copyright and no one has permission to use, copy, or distribute the work — including for AI training. Practitioners often treat "no license" as "free to use," which is legally incorrect and potentially infringes the creator's copyright.
Structural Inequality
The Data Availability Divide
Non-commercial and academic-only datasets are systematically richer — more diverse, longer, and more multi-source — than commercially licensed datasets. This creates a capability gap for open commercial AI.
The structural problem: The most capable and diverse datasets are locked behind non-commercial restrictions. Teams building commercial products are forced to use a subset of the data landscape that is shallower, less diverse, and more synthetic-data-poor.
■ Task Diversity: NC datasets have 2× more›
Non-commercial datasets average 3.4±0.2 task types per dataset vs. 1.7±0.1 for commercial. Creative tasks (brainstorming, creative writing, explanation) are disproportionately concentrated in non-commercial datasets. Commercial datasets skew heavily toward QA and classification.
■ Source Diversity: NC datasets have 2.6× more›
Non-commercial datasets average 4.2±1.3 distinct data sources vs. 1.6±0.1 for commercial. This reflects that academic dataset creators pull from wider pools (web, books, exams, social media) while commercial datasets are more focused and single-source.
■ Text Length: NC datasets are 15× longer›
Non-commercial datasets have a mean target text length of 1,580.7±965.6 tokens vs. 102.7±14.6 for commercial. Long-form generation (essays, reports, summaries) is a non-commercial-only capability in the current dataset landscape.
■ Synthetic Data: NC datasets have 3.5× more›
45.5%±3.4 of non-commercial dataset content is model-generated vs. 12.8%±2.1 for commercial. This may appear counterintuitive — synthetic data is easier to generate commercially — but most synthetic datasets use GPT-4/ChatGPT outputs, which carry the OpenAI ToU non-commercial restriction.
■ Code datasets are the exception›
78% of code datasets are commercially viable — significantly higher than the overall average. GitHub's filtering by code license is easier (MIT, Apache, GPL are well-understood), and most code is written by individuals rather than organizations that impose non-commercial restrictions.
Implication for foundation models: The capability frontier — reasoning, creativity, long-form generation — is primarily trained on non-commercial datasets. Commercial fine-tuning is working with a structurally inferior subset.
Geographic Coverage
Language & Geographic Coverage
English and Western European languages dominate the audited datasets. Low-resource languages are not only underrepresented — they are disproportionately locked behind non-commercial restrictions.
Language Family Coverage
Germanic (English)
Dominant
Romance languages
Moderate
Sino-Tibetan (Chinese)
Sparse
Japonic
Sparse
Indo-Aryan
Very sparse
Turkic languages
Critical gap
Niger-Congo (African)
Critical gap
Indigenous Americas
Near absent
The Double Exclusion Problem
Low-resource languages face two barriers:
1. Underrepresentation — Fewer datasets exist in Turkic, Japonic, Sino-Tibetan, and African language families.
2. Commercial restriction — More than 35% of available low-resource language datasets are non-commercial or academic-only — a higher restriction rate than for English datasets.
This means the communities with the least AI capability also face the most access barriers to building it.
The Code Exception (Again)
Code datasets show the most balanced multilingual coverage because programming languages themselves are language-agnostic, and GitHub hosts code from developers worldwide. Python, JavaScript, Java, and C++ datasets are commercially available and globally contributed.
The Bender Rule: The paper follows the "Bender Rule" of always documenting which languages a dataset covers — making language coverage a first-class attribute in every Data Provenance Card. Most existing dataset documentation fails this basic standard.
Temporal Analysis
The 2023 Licensing Shift
Something changed in 2023. The proportion of new datasets with restrictive licenses jumped dramatically — likely in response to growing awareness of AI training data concerns and high-profile legal disputes.
Pre-2023 Baseline
50–80%
of datasets released before 2023 had no license at all. The ecosystem treated "open" as the default — no one expected their NLP dataset to become AI training fodder at scale.
2023 Unlicensed Rate
Only 12%
of new datasets released in 2023 had no license — a dramatic improvement. Dataset creators became aware that licensing their work mattered for AI use.
2023 NC/Academic Rate
61%
of new datasets in 2023 were released as non-commercial or academic-only — vs. ~20% in prior years. The reaction to commercialization of AI drove a sharp restriction turn.
2018–2021
The Wild West Era
NLP datasets released primarily for academic benchmarking. Licensing treated as an afterthought — 50–80% have no license. No one anticipated billion-parameter models scraping everything in sight.
2022
ChatGPT Changes Everything
GPT-3.5, ChatGPT, and instruction-following models go mainstream. Dataset creators begin to realize their work is being used in commercial products. First wave of CC-BY-NC restrictions on new releases.
2023 Q1–Q2
Legal Disputes and Awareness Spike
High-profile lawsuits (Authors Guild vs. OpenAI, Getty Images vs. Stability AI) bring copyright and licensing to mainstream attention. 61% of new 2023 datasets adopt NC/academic restrictions.
2023 Q3
Data Provenance Initiative Published
Longpre et al. publish the first large-scale audit of 1,858 datasets. DPExplorer launches at dataprovenance.org. The field finally has a reference for what the actual license situation looks like.
2024–Present
New Licensing Frameworks Emerge
RAIL licenses (Responsible AI Licenses), AI2 ImpACT licenses, and BigScience OpenRAIL offer middle-ground options — commercially usable but with behavioral restrictions on downstream use.
Legal Context
The Legal Landscape
Data provenance exists at the intersection of copyright law, contract law, and AI policy — with different rules in every jurisdiction and fundamental questions still unresolved.
Disclaimer: The following is educational context from the paper's legal analysis. It is not legal advice. Consult qualified legal counsel before making licensing decisions for your specific jurisdiction.
Jurisdictional Differences
United States
Fair use doctrine may protect some AI training. Four-factor test considers: purpose, nature of work, amount used, market effect. No definitive ruling yet on whether scraping for training is fair use.
European Union
Text and Data Mining (TDM) exceptions under DSM Directive allow research mining. Commercial TDM is allowed unless rights holders explicitly opt out. Sui generis database rights add an additional layer.
Rest of World
Highly fragmented. Japan has broad TDM exceptions. China has evolving AI regulations. Most jurisdictions lack AI-specific guidance — default copyright law applies.
The 4 Unresolved Questions
1. Is scraping infringement?
Courts are split. HiQ v. LinkedIn (scraping allowed), but AI training use cases are different. No definitive ruling.
2. Are trained models derivative works?
If yes, share-alike clauses would require open-sourcing trained models. If no, training is non-infringing. Completely unresolved.
3. Does OpenAI ToU bind downstream users?
When a dataset is built from ChatGPT outputs, does the ToU follow the data to third parties who never agreed to it?
4. Do software licenses apply to data?
MIT and Apache were designed for code. "Derivative works" in ML is unclear. Share-alike obligations on data conflict when mixing datasets with incompatible licenses.
Case Study: SQuAD — The Wikipedia Question
Data Source
539 Wikipedia articles under CC-BY-SA. Wikipedia requires share-alike. Does this propagate to the QA annotations?
Annotations
100,000+ crowd-sourced QA pairs. Crowd workers hold copyright in their annotations — but derived from copyrighted Wikipedia text.
The Question
Can SQuAD be used for commercial fine-tuning? It's one of the most widely used QA datasets — and the answer is genuinely unclear.
The Solution
Data Provenance Cards
Machine-readable metadata records that travel with datasets — capturing everything needed to make an informed licensing decision, designed to scale to hundreds of datasets simultaneously.
Design principle: A Data Provenance Card is like a bibliography entry for training data — standardized, machine-readable, filterable, and composable when combining multiple datasets. Unlike Model Cards or Datasheets, it is specifically designed for scale.
Use DPExplorer to filter datasets by your risk tolerance before including them in training pipelines. Never assume "no license = free to use."
Dataset Creators
Apply explicit licenses at release. Consider RAIL or AI2 ImpACT licenses for middle-ground options. Always document language coverage and data sources.
Policymakers
Clarify fair use and TDM exceptions for AI training. Consider dataset-specific legal frameworks rather than applying code copyright law to training data.
Researchers
Focus on under-resourced languages and non-commercial licensing gaps. Releasing commercially-usable datasets for low-resource languages is a high-impact contribution.
Applied Context
Provenance in Agentic Data QC
Data quality and data provenance solve adjacent problems. An agentic QC pipeline that catches quality issues but ignores license restrictions is only half the solution. Here's how provenance becomes a native feature — not an afterthought.
Core insight: QC asks "is this data correct?" Provenance asks "is this data usable?" A QC agent needs both answers before a dataset can be trusted in a training pipeline.
Agentic Data QC Pipeline with Provenance Gates
4 Integration Points
1
Upstream License Audit
Before QC runs — check if the dataset is legally usable for the intended purpose
▼
What the agent does
1. Receives dataset identifier (HuggingFace ID, GitHub URL, or local path)
2. Looks up the Data Provenance Collection metadata
3. Compares license category against the configured use-intent (commercial / research / academic)
4. Flags non-compliant datasets before any QC computation begins
Why this matters
Running QC on a non-commercial dataset wastes compute and creates false confidence. Teams have shipped products trained on datasets they were never permitted to use commercially — because QC passed but provenance was never checked.
Real example: Alpaca (52K examples) — passes every QC check. Fails commercial use: inherits OpenAI ToU non-commercial restriction.
2
Lineage Tracking During QC
Every filter, transformation, and augmentation is logged with the original source license
▼
What the agent logs
For each QC operation (dedup, filter, augment, translate, reformat):
— Input dataset + its license
— Operation type and parameters
— Output row count and format
— Any new data injected (e.g., synthetic augmentation) and its source license
— Timestamp and agent version
When datasets are merged, the most restrictive license propagates to the output
▼
The propagation rule
When QC merges N datasets into a single training corpus, the output corpus inherits the intersection of what is permitted across all constituent licenses. The agent automatically computes this and warns if constituents are incompatible.
Example merge
CommercialOpenAssistant (Apache 2.0)
CommercialDolly-15k (CC-BY-SA)
Non-CommAlpaca (OpenAI ToU)
⚠ Non-CommMerged corpus — restricted
Share-alike conflict detection
The agent detects when a mix of share-alike (e.g. CC-BY-SA) and non-share-alike datasets are combined — a conflict that cannot be legally resolved without separating the corpora. It suggests remediation: split the corpus or replace the offending constituent.
Dolly-15k (CC-BY-SA) + OpenAssistant (Apache 2.0) = share-alike conflict. Agent recommends: use only one, or release merged corpus under CC-BY-SA.
4
Compliance Gate at Pipeline Exit
Final check before the cleaned dataset is released or used for training
▼
What the gate checks
— Final corpus license vs. declared use intent (commercial / research)
— Attribution requirements: are all constituent datasets credited?
— Share-alike: if triggered, is the release plan compatible?
— Geographic restrictions: any datasets with jurisdiction-specific terms?
— Synthetic data: does any augmented content carry upstream API terms?
Output: Auto-generated Provenance Card
At pipeline exit, the agent generates a Data Provenance Card for the cleaned corpus — covering every constituent dataset, all transformations applied, the resulting combined license, and the attribution credits required. This card travels with the dataset to downstream consumers.
Output card is machine-readable JSON — can be consumed by model training pipelines, model cards generators, and regulatory audit tools.
QC Signals vs. Provenance Signals — What Each Catches
Issue Type
QC Catches It?
Provenance Catches It?
Risk if Missed
Duplicate examples
✓ Yes — dedup metrics
— Not applicable
Training overfitting
Label noise / wrong annotations
✓ Yes — annotation QC
— Not applicable
Model misbehavior
Non-commercial license on dataset
✗ No — invisible to QC
✓ Yes — upstream audit
Legal infringement
Platform license mismatch (66% rate)
✗ No
✓ Yes — cross-ref source
False compliance confidence
GPT-generated data inheriting OpenAI ToU
✗ No
✓ Yes — synthetic provenance
Violates OpenAI ToU at scale
Share-alike conflict in merged corpus
✗ No
✓ Yes — propagation check
Closed-source release at risk
Missing attribution credits
✗ No
✓ Yes — exit gate
License violation (85% require it)
Low-quality text / toxic content
✓ Yes — quality filters
— Not applicable
Safety / model quality
Language / geographic coverage gaps
~ Partial — distribution stats
✓ Yes — language metadata
Bias / underrepresentation
Compliance Simulator — Build Your Training Corpus
Select datasets and a use intent to see what the compliance gate would flag.
Step 1 — Select Datasets
Step 2 — Select Use Intent
Citations
Paper Sources
This visual summary is based on the following papers and resources.
Primary Reference
Longpre et al. (17 authors, CMU / MIT / Hugging Face / Allen AI) — "The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI" — 2023. arXiv: 2310.16787. Audits 1,858 text datasets across 44 collections. Introduces Data Provenance Cards and releases the DPExplorer tool.
Gebru et al. — "Datasheets for Datasets" — 2021. Communications of the ACM. The original dataset documentation framework that Provenance Cards extend for scale. arXiv: 1803.09010.
Mitchell et al. — "Model Cards for Model Reporting" — 2019. FAccT. Establishes transparency standards for model releases — the model-side complement to dataset provenance.
Bender et al. — "On the Dangers of Stochastic Parrots" — 2021. FAccT. Introduces the "Bender Rule" for language documentation and raises foundational concerns about undocumented training data.