Open Source AI Compliance: A License Risk Guide for Leaders

Key Takeaways

→Open source in AI doesn’t mean unrestricted: licence traps are the norm
→34% of GitHub repos lack licence files; default to MIT or Apache 2.0
→Custom AI licences like Llama are not OSI-approved open source
→Build for the strictest jurisdiction first, then work backwards
→Governance maturity separates ad-hoc adoption from defensible compliance

Last quarter, I sat across from a CTO who had just discovered that thirteen of his team's AI models were deployed under licenses that prohibited their commercial use case. Nobody had been negligent. Nobody had checked. The models were on Hugging Face, they were "open source," and the engineering team assumed that meant they were free to use.

That assumption cost his company four months of rework and an uncomfortable conversation with their legal team.

This is happening everywhere. By Hugging Face's own count, its public model repository passed two million in 2025 (an as-of-2025 snapshot: check the live count for the current total). GitHub reports a 98% increase in generative AI projects in 2024. Python overtook JavaScript as GitHub's most popular language for the first time, driven almost entirely by AI development. The accessibility is extraordinary.

But the compliance landscape underneath that accessibility? It's a minefield. "Open source" in AI does not mean what most developers assume. Licenses range from genuinely permissive to strongly viral. The EU AI Act, updated U.S. export controls, and China's Generative AI Services rules are evolving faster than most legal teams can track. And sector-specific obligations in healthcare, financial services, and government stack additional layers on top.

This guide is the framework I use with my advisory clients to navigate that landscape: a strategic map for leaders who need to make decisions now, with clarity about what they're taking on.

2M+Public models on Hugging Face (2025)

98%Increase in GenAI projects on GitHub (2024)

100M+Developers on GitHub

$100B+AI venture funding in 2024

Open Source vs. Open Weight AI: The Distinction That Changes Everything

Before any compliance conversation, there's one foundational distinction you need to internalise. The industry uses these terms interchangeably. That's dangerous.

Genuine open source means the full stack is available: source code, training code, model architecture, and the training data itself. The Open Source Initiative (OSI) set out exactly this bar in its Open Source AI Definition, and very few AI models actually clear it. EleutherAI's Pythia suite is one of the small number that does: code, weights, and training data all released under Apache 2.0, reproducible end to end. BigScience's BLOOM, often cited as the open-source counterexample, is not: its training code is Apache 2.0, but the released weights carry the BigScience RAIL license, which is not OSI-approved (more on why in the License Families section below).

Open weight is different. Only the trained model parameters are released, often without training code, datasets, or reproducibility information. Meta's Llama models and Mistral's offerings are open weight. They're useful, widely adopted, and frequently called "open source." They're not. Under the OSI definition, they don't qualify.

Why does this matter? Because regulators don't care about marketing labels. They care about documentation, transparency, and provenance.

Analysis from Epoch AI shows open and closed models diverging rapidly. Open weight models may fail to meet emerging regulatory requirements, particularly under the EU AI Act, which demands comprehensive documentation for certain AI systems. When your compliance team asks "Is it open source?" and your engineer says "Yes," you may both be wrong about what that means.

Here's the uncomfortable number: 34% of GitHub repositories lack license files entirely. A third of all repositories operate this way as standard practice. Before your team uses any AI model, the question that actually matters is: "What licence governs this specific artefact, and what does that licence require of us?"

License Families: What You Actually Need to Know

I've found that most technical leaders understand individual licenses but don't have a framework for the families. That's where the compliance mistakes happen: not in the fine print of a single license, but in the interactions between licenses across your stack.

Open source licenses fall into three families. Understanding the family is more important than memorising the details.

Interactive License Explorer

Select a license family to understand its obligations and strategic fit

MIT · Apache 2.0 · BSD

Permissive licenses give you maximum flexibility with minimal obligations. For most commercial AI projects, this is where you want to be. MIT License is the gold standard: commercial use, modification, and distribution, requiring only copyright notice preservation. React, Node.js, and Angular all use MIT. No explicit patent grant creates some uncertainty, but the broad compatibility makes it ideal for libraries intended for wide adoption. Apache License 2.0 fixes MIT's patent gap through explicit patent grants with defensive termination clauses. Kubernetes, TensorFlow, and Android components use Apache 2.0. Requires more attribution (NOTICE file preservation, change documentation), but the patent protection makes it the stronger enterprise choice. Compatible with GPL v3, but not v2: a detail that trips up more teams than you'd expect. BSD 2-Clause / 3-Clause: The 2-clause mirrors MIT. The 3-clause adds a non-endorsement restriction. Both are popular for academic and research projects. Bottom line: If you're building commercial products, these are your default. Start here and only venture into other families with explicit legal review.

License Compatibility: Where Most Teams Get Caught

Licence compatibility trips up more teams than anything else I deal with. And the worst part: you can't catch it with a scan. When you combine components with different licences, the resulting work must satisfy all applicable licences simultaneously.

Some combinations are mathematically impossible. GPL v2 and Apache 2.0, for example: the patent termination clause in Apache 2.0 creates an "additional restriction" that GPL v2 doesn't permit. If your AI pipeline combines both, you have a problem that no amount of attribution headers can solve.

License Compatibility at a Glance

License	Commercial Use	Patent Grant	Copyleft	GPL v2 Compat.	Apache 2.0 Compat.
MIT Commercial ✓ · No explicit patent grant · No copyleft · Compatible with GPL v2, v3, Apache 2.0 · Maximum compatibility across the ecosystem Permissive	Apache 2.0 Commercial ✓ · Explicit patent grant with defensive termination · No copyleft · Incompatible with GPL v2 · Compatible with GPL v3 · Best enterprise permissive choice Permissive	BSD 2/3-Clause Commercial ✓ · No patent grant · No copyleft · Compatible with both GPL versions and Apache 2.0 · 3-clause adds non-endorsement restriction Permissive	GPL v2 Commercial ✓ (with source disclosure) · Implicit patent · Strong copyleft · Incompatible with Apache 2.0 · Powers Linux kernel; treat as viral for all combined works Copyleft	GPL v3 Commercial ✓ (with source disclosure) · Explicit patent · Strong copyleft · Compatible with Apache 2.0 · Incompatible with GPL v2-only code Copyleft	AGPL v3 Commercial requires source disclosure for network use · Explicit patent · Network copyleft · Closes SaaS loophole · Many enterprises maintain AGPL-prohibited policies Strong Copyleft
Custom AI (Llama etc.) Commercial often restricted above revenue/user thresholds · Use-case restrictions (no weapons, surveillance) · Not OSI-approved · Downstream obligations apply; not "open source" under OSI definition Non-OSI

For the full compatibility details, Choose a License and the GNU License Compatibility Guide are the authoritative references. But here's the practical rule I give every client: default to MIT or Apache 2.0 licensed components. Treat any GPL or custom AI license as requiring explicit legal sign-off before adoption. No exceptions.

Automated scanning tools (FOSSA, Black Duck, FOSSology) catch declared licences. But training data provenance, model weight licensing, and transitive dependencies in AI pipelines create compliance risks that tools alone cannot surface. I insist on human review at the architecture stage. Catching a licence conflict before deployment costs hours. Catching it after? Months. I've seen both.

Global AI Compliance: Four Jurisdictions, Four Sets of Rules

Here's what makes 2025 different from every previous year: AI regulation has moved from theoretical to enforceable. If you're deploying globally, you face binding obligations across multiple jurisdictions simultaneously. And none of them align.

I wrote a deeper analysis of the EU's approach in my GDPR and AI compliance guide. The intersection of AI Act and GDPR creates particularly acute challenges. Here's the broader picture across all four major jurisdictions.

Global AI Compliance Landscape

Key regulatory frameworks by jurisdiction

EU AI Act · GDPR · CRA

The EU AI Act phases in across three milestones: Article 5 prohibitions and Article 4 AI literacy from Feb 2, 2025; GPAI obligations from Aug 2, 2025; Annex III high-risk system obligations from Dec 2, 2027 (deferred from Aug 2, 2026 under the Digital Omnibus; provisional pending formal adoption in the EU Official Journal; Annex I embedded high-risk products from Aug 2, 2028, deferred from Aug 2, 2027 under the Digital Omnibus). It is the world's most comprehensive AI regulatory framework. Open source gets limited exemptions, and they don't cover prohibited AI systems, high-risk applications, or consumer-facing systems. GPAI Models: General-purpose AI models exceeding 10²⁵ FLOPs (Art. 51 systemic-risk threshold) get no open source exemptions. Full transparency, risk assessment, and governance requirements apply regardless of licence. The open source reality: Open source status alone doesn't protect you. An AGPL-licensed model used in a high-risk application faces identical EU AI Act obligations to a proprietary one. "But it's open source" is not a defence. GDPR intersection: Training on personal data requires legal basis. Web scraping faces enhanced scrutiny. EDPB Opinion 28/2024 provides the authoritative AI-GDPR guidance. Penalties (tiered under Article 99): Up to €35M or 7% of global annual turnover for prohibited-practice violations (Art. 99(3)); up to €15M or 3% for high-risk system non-compliance and most provider/deployer obligation breaches (Art. 99(4)); up to €7.5M or 1% for supplying incorrect information to authorities (Art. 99(5)). The Cyber Resilience Act adds cybersecurity obligations for open source software with commercial monetisation.

The real challenge is juggling four jurisdictions at once, each with its own rules. An AI system with EU users, U.S. federal customers, a China market product, and Indian user data faces four distinct regulatory frameworks for the same open source model. My rule: build for whichever jurisdiction is strictest, then work backwards to see what the others let you skip.

Sector Compliance: The Layer Most Teams Discover Too Late

Most teams treat licence compliance as the finish line. It's not even close. Sector-specific obligations sit on top, and in healthcare, financial services, and government, they carry the steepest penalties. In my advisory work across all three, this layer is consistently underestimated.

I've covered healthcare compliance in depth in my HIPAA and AI strategic guide. Here's how sector requirements compound the baseline compliance challenge for open source AI across all three verticals.

Sector-Specific AI Compliance

How industry requirements compound the baseline

FDA · HIPAA · IEC 62304

FDA: The January 2025 draft guidance on AI-enabled device software functions affects any AI in medical decision-making. Most AI/ML-enabled devices clear FDA review through the 510(k) pathway rather than the more rigorous PMA route. Predetermined Change Control Plans (PCCPs) let you update open source model versions without full re-approval, which matters for teams that need to patch vulnerabilities. HIPAA: Any AI vendor processing PHI automatically becomes a business associate subject to direct OCR enforcement. For open source models, training data provenance is the acute issue: was it trained on data that included PHI? Can you produce the documentation a BAA requires? IEC 62304: Software of Unknown Provenance (SOUP) provisions specifically address open source. Class C medical software demands full dependency traceability and lifecycle documentation, well beyond a simple licence scan. What I tell healthcare clients: Open source components need complete dependency documentation, reproducible builds, and a defined CVE response process within regulatory timeframes. If you can't produce this documentation, you can't use the component. Full stop.

The Risk-Reward Matrix: Where Should You Position?

Almost certainly, yes, on the question of whether to use open source AI at all. The harder question is which licenses, in which applications, under which governance architecture. That's a risk-reward decision more than a technology one.

Across the enterprise clients I work with, permissive licenses (MIT, Apache 2.0) dominate open source AI adoption by a wide margin. Custom AI licenses, mostly Llama's, are the highest-growth risk category: widely adopted, poorly understood, and carrying commercial restrictions that most teams haven't read. Copyleft (GPL, LGPL) and network-copyleft (AGPL) licenses show up far less often, mainly because most legal teams have already flagged them as requiring sign-off before adoption.

Open Source AI Governance: From Ad-Hoc to Governed

Every organisation I've worked with falls into one of three governance phases. (For the broader governance architecture, my governance playbook covers the full five-layer stack.) Most are stuck in Phase 1: adopting models informally, no central inventory, hoping compliance issues don't surface. They always surface. The question is whether you find them or someone else does.

Open Source AI Governance Maturity

Where does your organisation sit?

Risk: High

Phase 1: Ad-Hoc

Teams adopt models informally. No central inventory. Licence compliance handled by individual developers (or not at all). No SBOM. No vulnerability monitoring. No policy on prohibited licences.

Risk: Medium

Phase 2: Policy-Driven

OSPO or equivalent function established. Licence policy defines permitted, restricted, and prohibited categories. Automated scanning integrated into CI/CD. SBOM generation standardised. Legal review process for non-standard licences. Vulnerability management SLAs defined.

Risk: Managed

Phase 3: Governed

Continuous compliance monitoring with automated alerting. Training data provenance documented for all models. Regulatory change monitoring with impact assessments. Cross-functional governance (legal, security, engineering) with defined escalation. External audit readiness. Employee contribution policy for open source projects.

Essential Open Source Compliance Infrastructure

Phase 3 requires purpose-built infrastructure, not spreadsheets. Five components:

Software Bill of Materials (SBOM): Mandatory for federal contractors under Executive Order 14028 and rapidly becoming an enterprise procurement requirement. Automate SBOM generation in every CI/CD pipeline. Tools: FOSSA, Black Duck, CycloneDX (open source), SPDX (open source standard).
Licence scanning: Static analysis of declared licences is the baseline. Dynamic analysis of transitive dependencies is the requirement. Automated scanning plus human review for complex scenarios, particularly custom AI licences and dual-licensed components.
Vulnerability monitoring: Open source CVE management with defined SLAs by severity. GitHub detected 39 million secret leaks in 2024. Secrets scanning alongside vulnerability detection is non-negotiable. Tools: Dependabot, Snyk, FOSSA, OSV Scanner.
Training data provenance: The SBOM concept extended to training data. Document sources, governing licences, personal data inclusion, and fine-tuning lineage (which triggers upstream derivative work obligations).
OpenChain certification: ISO/IEC 5230 provides an internationally recognised standard for open source compliance. Certification signals to enterprise customers and regulators that your programme meets a consistent, auditable standard.

Subscriber Resource

Download: Open Source AI Compliance Checklist

Get the complete compliance checklist: licence inventory template, three-tier policy matrix, SBOM requirements, regulatory monitoring calendar, and sector-specific addenda for healthcare, financial services, and government — ready to print or save as PDF.

Enter your email to get instant access — you'll also receive the weekly newsletter.

Free. No spam. Unsubscribe anytime.

What to Do Monday Morning

I'll keep this concrete. Here's what I tell the three audiences who sit across from me most often.

If You Lead Technology

Start with an inventory. Today. Audit every AI model currently in use (production, development, pilot) and document the licence, the use case, and the deployment context. This takes days, not months. The findings will surprise you.

Then establish a three-tier licence policy: permitted (MIT, Apache 2.0, BSD with standard attribution), restricted (requires legal review: GPL, LGPL, custom AI licences), and prohibited (AGPL in commercial products, any licence with unacceptable use-case restrictions). Make it enforceable through automated scanning rather than paperwork alone.

If You Lead Legal and Compliance

Custom AI licences require active tracking. Meta's Llama licence, Mistral's terms, the growing set of "responsible AI" licences: each has distinct commercial provisions, user threshold triggers, and downstream obligations. These change more frequently than traditional open source licences, and changes can affect existing deployments.

Build regulatory monitoring into standing practice. The EU AI Act's high-risk deadline (now December 2, 2027, deferred from August 2026 under the Digital Omnibus, provisional), ongoing EAR expansion, and China's CAC registration requirements all have lead times measured in months. Waiting creates crisis. Monitoring creates optionality.

If You Lead the Business

Yes, open source AI carries compliance risk. You can manage that. What you can't manage is watching your competitors ship twice as fast because they aren't afraid of it. The fastest-moving organisations in 2025 are overwhelmingly using open source models as their foundation.

OSPOs, licence policies, SBOM pipelines, regulatory monitoring: yes, it's overhead. But it's the kind of overhead that keeps you off the front page of Hacker News for the wrong reasons. I watched a Series C company spend $2M on a rushed compliance retrofit. Their competitor had spent $200K building it in from the start. That's the difference.

The Compounding Advantage

$100+ billion in annual AI investment. 420+ million GitHub repositories. Regulatory frameworks active across 75% of the world's economies by 2030. Open source AI is not a free lunch. It never was.

But it is an opportunity that compounds over time. In two years, enterprise procurement teams will require governance documentation as a condition of vendor selection. The teams building that documentation now won't scramble. Everyone else will.

Take inventory. Build policy. Automate enforcement. Monitor regulation. Govern training data.

That's the path to AI adoption that lasts.

If you're evaluating your organisation's broader AI readiness (beyond licence compliance), the 5-Pillar AI Readiness Assessment provides the diagnostic framework I use with advisory clients. For governance specifically, the Minimum Viable Governance framework is the fastest path to a defensible baseline.

Ajay Pundhir

Senior AI strategist helping leaders make AI real across four continents. Forbes Technology Council member, IEEE Senior Member.

Let's Talk

Explore more Emerging Technology articles

Ajay's views, from 15 years in the field. Not legal or compliance advice. See full disclaimers →
Published by AI Exponent LLC

Open Source AI: The Ultimate Risk-Reward Guide