Post 36 · Data & Governance

The Data Provenance Crisis

Auditing 1,858 AI training datasets reveals a systemic breakdown in licensing transparency — where 70%+ of datasets have no documented license and more than half of reported licenses are wrong.

1,858

Datasets audited

44

Collections analyzed

70%+

License omission rate

66%

HuggingFace mismatch rate

17

Authors (legal + ML)

    What is Data Provenance?

    Data provenance tracks the complete lineage of a dataset — its original source, who created it, under what license, with what conditions, and how it has been transformed or repackaged. For AI training data, this determines what you are legally and ethically permitted to do with a trained model.

Why It Matters Now

Every foundation model and fine-tuned LLM was trained on datasets with licensing conditions. If those conditions prohibit commercial use — and you didn't check — your deployed product may be in violation. With 70%+ of datasets having undocumented licenses, most AI teams are flying blind.

The Scale of the Problem

The Data Provenance Initiative (Longpre et al., 2023) is the first large-scale audit of AI fine-tuning datasets. It brings together legal and ML experts to trace 1,858 datasets across 44 major collections — examining licenses, sources, creators, and downstream use restrictions.

What Was Built

The Data Provenance Explorer — an interactive interface at dataprovenance.org — lets practitioners filter and examine provenance metadata for any of the 1,858 audited datasets before using them in training.

Who Is at Risk

Any team fine-tuning an LLM on open-source instruction datasets. The most popular datasets on HuggingFace have 100s to 10M+ monthly downloads — most without their users checking license conditions.

The Fix

Data Provenance Cards — machine-readable metadata records that travel with datasets — capturing license, source, creator, attribution requirements, and share-alike conditions at scale.

The Core Problem

The License Crisis

Across every major dataset hosting platform, the majority of datasets either have no documented license or have a license that misrepresents the actual terms.

    Root finding: The Data Provenance Initiative reduced unspecified license rates from 70%+ to 30% through manual auditing — meaning roughly 40% of "unlicensed" datasets actually had discoverable licenses that platforms simply failed to document.
  

License Omission by Platform

GitHub

72%

Papers with Code

70%

HuggingFace

69%

After DPI Audit

30%

% of datasets with unspecified/missing license

License Mismatch Rates

HuggingFace

66%

GitHub (too permissive)

29%

HuggingFace (too permissive)

27%

Papers with Code

16%

% of datasets where platform license differs from author's intent

Most Common Licenses Found

15.7%

CC-BY-SA 4.0

Attribution + Share-Alike required

12.3%

OpenAI ToU

Prohibits competing model training

11.6%

CC-BY 4.0

Attribution only — commercially permissive

9.6%

Custom Licenses

Bespoke terms — requires legal review

      Attribution requirements: 85% of datasets require attribution — meaning you must credit the dataset creator. Most AI training pipelines do not do this.
    

      Share-alike clauses: 30% of datasets require share-alike — meaning any model trained on them must be released under the same or compatible license. This directly conflicts with closed-source model releases.
    

Dataset Landscape

What 1,858 Datasets Look Like

The audit covers fine-tuning and alignment datasets — instruction-following, QA, classification, and generation tasks across 44 major collections.

Primary Data Sources

Encyclopedias (Wikipedia)

21.5%

Social Media (Reddit/Twitter)

15.9%

General Web

11.2%

News

11.1%

Entertainment

8.5%

Code (GitHub)

5.7%

Exams / Academic

5.6%

Creator Distribution

Academic institutions

68.7%

Industry labs

21.4%

Research groups

17.1%

Corporations

15.8%

Startups

4.0%

Top academic contributors: UW (8.9%), Stanford (6.8%), NYU (5.4%). Top industry: FAIR (8.4%), MSR (4.1%), AI2 (12.3%).

Dataset Format Types

Zero-shot

Direct instruction, no examples

Few-shot

Instruction + k examples

Chain-of-Thought

Reasoning trace included

Multi-turn

Conversational dialogue format

License Taxonomy

Four License Categories

The audit classifies every dataset into one of four use-permission categories. Only 46% can be used commercially without restrictions.

46.1%

Commercial

856 datasets. Can be used for any purpose including building commercial products. Examples: CC-BY 4.0, MIT, Apache 2.0, CC0.

30.7%

Unspecified

19.0%

Non-Commercial

352 datasets. Research and personal use only. Examples: CC-BY-NC 4.0, OpenAI ToU. Building a product on these may constitute infringement.

4.3%

Academic-Only

80 datasets. Restricted to academic research institutions. Cannot be used by industry teams, regardless of whether use is commercial.

Attribution Requirements

Require attribution 85%

Attribution means crediting the dataset creator in your model documentation, paper, and release notes. Most model cards do not do this.

Share-Alike Obligations

Include share-alike clause 30%

Share-alike requires releasing trained models under the same license. This is fundamentally incompatible with proprietary model releases — yet these datasets are widely used.

What does CC-BY-SA 4.0 actually mean for AI training? ›

CC-BY-SA 4.0 (Creative Commons Attribution-ShareAlike) requires: (1) you credit the original creator, and (2) anything you create using this work must also be released under CC-BY-SA or a compatible license. For AI training, this means a model trained on CC-BY-SA data could be argued to require open-sourcing under the same terms. This is actively litigated — no definitive court ruling exists yet.

Why is the OpenAI Terms of Use (12.3%) so significant? ›

The OpenAI ToU prohibits using outputs to "develop models that compete with OpenAI." This affects all datasets derived from ChatGPT/GPT-4 outputs — including widely-used instruction-following datasets like Alpaca, Dolly, and others built from GPT-generated data. Any model fine-tuned on these datasets carries this restriction transitively, whether or not the fine-tuner was aware.

What happens with Unspecified licenses? ›

Under most copyright law, if no license is specified, all rights are reserved by default. This means the creator retains full copyright and no one has permission to use, copy, or distribute the work — including for AI training. Practitioners often treat "no license" as "free to use," which is legally incorrect and potentially infringes the creator's copyright.

Structural Inequality

The Data Availability Divide

Non-commercial and academic-only datasets are systematically richer — more diverse, longer, and more multi-source — than commercially licensed datasets. This creates a capability gap for open commercial AI.

    The structural problem: The most capable and diverse datasets are locked behind non-commercial restrictions. Teams building commercial products are forced to use a subset of the data landscape that is shallower, less diverse, and more synthetic-data-poor.
  

■ Task Diversity: NC datasets have 2× more ›

Non-commercial datasets average 3.4±0.2 task types per dataset vs. 1.7±0.1 for commercial. Creative tasks (brainstorming, creative writing, explanation) are disproportionately concentrated in non-commercial datasets. Commercial datasets skew heavily toward QA and classification.

■ Source Diversity: NC datasets have 2.6× more ›

Non-commercial datasets average 4.2±1.3 distinct data sources vs. 1.6±0.1 for commercial. This reflects that academic dataset creators pull from wider pools (web, books, exams, social media) while commercial datasets are more focused and single-source.

■ Text Length: NC datasets are 15× longer ›

Non-commercial datasets have a mean target text length of 1,580.7±965.6 tokens vs. 102.7±14.6 for commercial. Long-form generation (essays, reports, summaries) is a non-commercial-only capability in the current dataset landscape.

■ Synthetic Data: NC datasets have 3.5× more ›

45.5%±3.4 of non-commercial dataset content is model-generated vs. 12.8%±2.1 for commercial. This may appear counterintuitive — synthetic data is easier to generate commercially — but most synthetic datasets use GPT-4/ChatGPT outputs, which carry the OpenAI ToU non-commercial restriction.

■ Code datasets are the exception ›

78% of code datasets are commercially viable — significantly higher than the overall average. GitHub's filtering by code license is easier (MIT, Apache, GPL are well-understood), and most code is written by individuals rather than organizations that impose non-commercial restrictions.

        Implication for foundation models: The capability frontier — reasoning, creativity, long-form generation — is primarily trained on non-commercial datasets. Commercial fine-tuning is working with a structurally inferior subset.
      

Geographic Coverage

Language & Geographic Coverage

English and Western European languages dominate the audited datasets. Low-resource languages are not only underrepresented — they are disproportionately locked behind non-commercial restrictions.

Language Family Coverage

Germanic (English)

Dominant

Romance languages

Moderate

Sino-Tibetan (Chinese)

Sparse

Japonic

Sparse

Indo-Aryan

Very sparse

Turkic languages

Critical gap

Niger-Congo (African)

Critical gap

Indigenous Americas

Near absent

The Double Exclusion Problem

Low-resource languages face two barriers:

1. Underrepresentation — Fewer datasets exist in Turkic, Japonic, Sino-Tibetan, and African language families.

2. Commercial restriction — More than 35% of available low-resource language datasets are non-commercial or academic-only — a higher restriction rate than for English datasets.

This means the communities with the least AI capability also face the most access barriers to building it.

The Code Exception (Again)

Code datasets show the most balanced multilingual coverage because programming languages themselves are language-agnostic, and GitHub hosts code from developers worldwide. Python, JavaScript, Java, and C++ datasets are commercially available and globally contributed.

    The Bender Rule: The paper follows the "Bender Rule" of always documenting which languages a dataset covers — making language coverage a first-class attribute in every Data Provenance Card. Most existing dataset documentation fails this basic standard.
  

Temporal Analysis

The 2023 Licensing Shift

Something changed in 2023. The proportion of new datasets with restrictive licenses jumped dramatically — likely in response to growing awareness of AI training data concerns and high-profile legal disputes.

Pre-2023 Baseline

50–80%

of datasets released before 2023 had no license at all. The ecosystem treated "open" as the default — no one expected their NLP dataset to become AI training fodder at scale.

2023 Unlicensed Rate

Only 12%

of new datasets released in 2023 had no license — a dramatic improvement. Dataset creators became aware that licensing their work mattered for AI use.

2023 NC/Academic Rate

61%

of new datasets in 2023 were released as non-commercial or academic-only — vs. ~20% in prior years. The reaction to commercialization of AI drove a sharp restriction turn.

2018–2021

The Wild West Era

NLP datasets released primarily for academic benchmarking. Licensing treated as an afterthought — 50–80% have no license. No one anticipated billion-parameter models scraping everything in sight.

2022

ChatGPT Changes Everything

GPT-3.5, ChatGPT, and instruction-following models go mainstream. Dataset creators begin to realize their work is being used in commercial products. First wave of CC-BY-NC restrictions on new releases.

2023 Q1–Q2

Legal Disputes and Awareness Spike

High-profile lawsuits (Authors Guild vs. OpenAI, Getty Images vs. Stability AI) bring copyright and licensing to mainstream attention. 61% of new 2023 datasets adopt NC/academic restrictions.

2023 Q3

Data Provenance Initiative Published

Longpre et al. publish the first large-scale audit of 1,858 datasets. DPExplorer launches at dataprovenance.org. The field finally has a reference for what the actual license situation looks like.

2024–Present

New Licensing Frameworks Emerge

RAIL licenses (Responsible AI Licenses), AI2 ImpACT licenses, and BigScience OpenRAIL offer middle-ground options — commercially usable but with behavioral restrictions on downstream use.

Legal Context

The Legal Landscape

Data provenance exists at the intersection of copyright law, contract law, and AI policy — with different rules in every jurisdiction and fundamental questions still unresolved.

    Disclaimer: The following is educational context from the paper's legal analysis. It is not legal advice. Consult qualified legal counsel before making licensing decisions for your specific jurisdiction.
  

Jurisdictional Differences

United States

Fair use doctrine may protect some AI training. Four-factor test considers: purpose, nature of work, amount used, market effect. No definitive ruling yet on whether scraping for training is fair use.

European Union

Text and Data Mining (TDM) exceptions under DSM Directive allow research mining. Commercial TDM is allowed unless rights holders explicitly opt out. Sui generis database rights add an additional layer.

Rest of World

Highly fragmented. Japan has broad TDM exceptions. China has evolving AI regulations. Most jurisdictions lack AI-specific guidance — default copyright law applies.

The 4 Unresolved Questions

1. Is scraping infringement?

Courts are split. HiQ v. LinkedIn (scraping allowed), but AI training use cases are different. No definitive ruling.

2. Are trained models derivative works?

If yes, share-alike clauses would require open-sourcing trained models. If no, training is non-infringing. Completely unresolved.

3. Does OpenAI ToU bind downstream users?

When a dataset is built from ChatGPT outputs, does the ToU follow the data to third parties who never agreed to it?

4. Do software licenses apply to data?

MIT and Apache were designed for code. "Derivative works" in ML is unclear. Share-alike obligations on data conflict when mixing datasets with incompatible licenses.

Case Study: SQuAD — The Wikipedia Question

Data Source

539 Wikipedia articles under CC-BY-SA. Wikipedia requires share-alike. Does this propagate to the QA annotations?

Annotations

100,000+ crowd-sourced QA pairs. Crowd workers hold copyright in their annotations — but derived from copyrighted Wikipedia text.

The Question

Can SQuAD be used for commercial fine-tuning? It's one of the most widely used QA datasets — and the answer is genuinely unclear.

The Solution

Data Provenance Cards

Machine-readable metadata records that travel with datasets — capturing everything needed to make an informed licensing decision, designed to scale to hundreds of datasets simultaneously.

    Design principle: A Data Provenance Card is like a bibliography entry for training data — standardized, machine-readable, filterable, and composable when combining multiple datasets. Unlike Model Cards or Datasheets, it is specifically designed for scale.
  

Provenance Cards vs. Existing Documentation

Property	Datasheets for Datasets	Model Cards	Provenance Cards
Format	Free-form prose	Structured prose	Machine-readable schema
Scales to 100s of datasets?	No — manual per dataset	No — per model	Yes — concatenable
Filterable / searchable?	No	No	Yes — by license, language, task
License tracking	Sometimes mentioned	Rarely detailed	Explicit — permitted use, attribution, share-alike
Repackaging support	Breaks on remix	N/A	Preserved through repackaging
Auto-generated?	No — manual	No — manual	Yes — via DPExplorer

Recommendations by Stakeholder

Practitioners

Use DPExplorer to filter datasets by your risk tolerance before including them in training pipelines. Never assume "no license = free to use."

Dataset Creators

Apply explicit licenses at release. Consider RAIL or AI2 ImpACT licenses for middle-ground options. Always document language coverage and data sources.

Policymakers

Clarify fair use and TDM exceptions for AI training. Consider dataset-specific legal frameworks rather than applying code copyright law to training data.

Researchers

Focus on under-resourced languages and non-commercial licensing gaps. Releasing commercially-usable datasets for low-resource languages is a high-impact contribution.

Applied Context

Provenance in Agentic Data QC

Data quality and data provenance solve adjacent problems. An agentic QC pipeline that catches quality issues but ignores license restrictions is only half the solution. Here's how provenance becomes a native feature — not an afterthought.

    Core insight: QC asks "is this data correct?" Provenance asks "is this data usable?" A QC agent needs both answers before a dataset can be trusted in a training pipeline.
  

Agentic Data QC Pipeline with Provenance Gates

4 Integration Points

1

Upstream License Audit

Before QC runs — check if the dataset is legally usable for the intended purpose

▼

2

Lineage Tracking During QC

Every filter, transformation, and augmentation is logged with the original source license

▼

3

Attribution Propagation

When datasets are merged, the most restrictive license propagates to the output

▼

4

Compliance Gate at Pipeline Exit

Final check before the cleaned dataset is released or used for training

▼

QC Signals vs. Provenance Signals — What Each Catches

Issue Type	QC Catches It?	Provenance Catches It?	Risk if Missed
Duplicate examples	✓ Yes — dedup metrics	— Not applicable	Training overfitting
Label noise / wrong annotations	✓ Yes — annotation QC	— Not applicable	Model misbehavior
Non-commercial license on dataset	✗ No — invisible to QC	✓ Yes — upstream audit	Legal infringement
Platform license mismatch (66% rate)	✗ No	✓ Yes — cross-ref source	False compliance confidence
GPT-generated data inheriting OpenAI ToU	✗ No	✓ Yes — synthetic provenance	Violates OpenAI ToU at scale
Share-alike conflict in merged corpus	✗ No	✓ Yes — propagation check	Closed-source release at risk
Missing attribution credits	✗ No	✓ Yes — exit gate	License violation (85% require it)
Low-quality text / toxic content	✓ Yes — quality filters	— Not applicable	Safety / model quality
Language / geographic coverage gaps	~ Partial — distribution stats	✓ Yes — language metadata	Bias / underrepresentation

Compliance Simulator — Build Your Training Corpus

Select datasets and a use intent to see what the compliance gate would flag.

Step 1 — Select Datasets

OpenAssistant Commercial Dolly-15k Comm+SA Alpaca Non-Comm SQuAD Ambiguous FLAN Mixed

Step 2 — Select Use Intent

Citations

Paper Sources

This visual summary is based on the following papers and resources.

Primary Reference

Longpre et al. (17 authors, CMU / MIT / Hugging Face / Allen AI) — "The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI" — 2023. arXiv: 2310.16787. Audits 1,858 text datasets across 44 collections. Introduces Data Provenance Cards and releases the DPExplorer tool.

↗ arXiv 2310.16787 ↗ dataprovenance.org ↗ GitHub Repository

Related Works

Gebru et al. — "Datasheets for Datasets" — 2021. Communications of the ACM. The original dataset documentation framework that Provenance Cards extend for scale. arXiv: 1803.09010.

Mitchell et al. — "Model Cards for Model Reporting" — 2019. FAccT. Establishes transparency standards for model releases — the model-side complement to dataset provenance.

Bender et al. — "On the Dangers of Stochastic Parrots" — 2021. FAccT. Introduces the "Bender Rule" for language documentation and raises foundational concerns about undocumented training data.