AI 7 min read

Open Source AI: Open Weights and Real Model Independence

Llama, Mistral, Qwen are called open models. What separates Open Weights from Open Source and when self-hosting beats API usage in production.

Open Source AI: Open Weights and Real Model Independence

Llama, Mistral, Qwen, DeepSeek: today’s discussion of open source AI usually points to these models. They can be downloaded, run locally, and embedded in commercial products without licensing fees. Yet the label “open source” is technically wrong for most of them. The more accurate term is open weights, and the difference decides how independent a company actually becomes.

Open Weights Is Not Open Source

For software, the line is clear: release the source code under an OSI-approved license and you have open source. AI models are different because the finished product has three components that can be released separately.

The first component is the weights, the trained parameters of a neural network. The second is the training code that turns data and architecture into a model. The third is the training data itself. Open Weights means only the first piece is public. Meta’s Llama 3, Mistral Large, and most Chinese models like Qwen and DeepSeek fall into this category.

In October 2024 the Open Source Initiative (OSI) published the Open Source AI Definition 1.0, the first formal standard for what counts as a truly open model. It requires all three components: weights, code, and training data documented well enough to reproduce the training run. Few well-known models meet this bar today. OLMo from the Allen Institute, Pythia from EleutherAI, and LLM360 do. Llama, Mistral, and Qwen do not.

For many use cases, Open Weights is enough. Pure inference does not need training code or data. The gap matters when reproducibility, audit obligations, or legal clarity over training data become business questions. That is where the distinction between Open Weights and Open Source becomes commercially relevant.

License Traps in Allegedly Open Models

Open Weights does not mean license-free. The Llama Community License allows commercial use but excludes providers with more than 700 million monthly active users and includes acceptable-use policies that restrict certain deployments. The Mistral Research License covers only research and personal use. Commercial deployment requires a separate agreement. Qwen ships under a modified Apache 2.0 with extra terms.

Anyone running Open Weights in production has to read the license. That is non-negotiable. It matters especially when the model becomes part of a commercial product or when output reaches end customers who could themselves cross hyperscaler thresholds. A practical safeguard: track models in the same SBOM as software dependencies, with version numbers and license text. The article on open source licenses for businesses goes deeper on this.

A note for international readers: in the German Mittelstand, GDPR and data sovereignty are not optional checkboxes but board-level concerns. Self-hosting is often less about cost and more about regulatory clarity for industries like insurance, healthcare, or industrial manufacturing.

When Self-Hosting Pays Off

The honest answer is: it depends. The drivers that push toward self-hosting fall into four buckets.

Data privacy comes first in regulated industries. Patient records, manufacturing telemetry, or legal mandates do not belong in third-party APIs, no matter how strong the no-train clauses are. A locally hosted Llama 3 70B or Mixtral 8x22B solves this structurally. Data sovereignty stays in the data center or with a European cloud provider.

Cost control is the second driver. At high token volumes the math flips toward self-hosting. A rough rule of thumb: above five million inference tokens per day, a dedicated GPU server (for example two H100 GPUs running Llama 3 70B) becomes cheaper than the equivalent API calls at hyperscalers. The case strengthens if the model runs at constant load instead of bursty spikes.

Domain fine-tuning is the third driver. Training a model on insurance claims, legal text, or a company-specific code style requires access to the weights. API-only models can be steered with RAG or system prompts but not truly fine-tuned. Real fine-tuning needs Open Weights and infrastructure. The tooling has matured: Hugging Face TRL, Unsloth, or Axolotl cover most needs.

Latency and availability complete the list. On-premise inference on a dedicated GPU answers reliably in 200 to 500 milliseconds, regardless of whether OpenAI is having an outage or Anthropic is changing model policy.

A Practical Self-Hosting Stack

For Mac workstations or small edge setups, Ollama is the pragmatic entry point. Llama 3 8B runs on a Mac Mini M4 with 32 GB RAM at usable speed, around 25 tokens per second. For tests, internal tools, and developer machines that is enough.

Installation and the first run take three commands:

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b
ollama run llama3.1:8b "Explain RAG in three sentences."

For production workloads, vLLM or Text Generation Inference (TGI) take over. Both servers scale across multi-GPU setups, support continuous batching, and expose OpenAI-compatible APIs. Existing applications can switch with minimal code changes.

An existing OpenAI client can be redirected to vLLM in two lines:

from openai import OpenAI

client = OpenAI(
    base_url="https://llm.internal.example.com/v1",
    api_key="sk-internal",
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=[{"role": "user", "content": "What is Open Weights?"}],
    temperature=0.2,
)
print(response.choices[0].message.content)

A two-tier hardware approach works well. First a developer tier on consumer GPUs (RTX 4090 or two of them in a tower). Second a production tier on data center GPUs (H100, H200, or AMD MI300X). Companies that prefer not to operate hardware can rent GPU instances at European providers like IONOS or Hetzner, both with clear GDPR conditions.

The Strategic Argument Behind Open Weights

The real value of Open Weights is not the monthly cost saving. It is optionality. A RAG system built on an open model can swap models any time without leaving data or logic stranded inside a foreign API. That flexibility matters in regulated industries or for platforms with a long lifespan.

At the same time the asymmetry is shrinking. Open weights models now match frontier API models on many tasks, particularly code generation and structured extraction. Proprietary models still lead on complex reasoning, very long contexts, and multimodality. The gap is closing but it exists.

A pragmatic architecture combines both. The open model handles sensitive data, high-volume queries, and standard tasks locally. The proprietary API stays available where peak performance matters. This hybrid pattern captures most of what open source delivers in AI today. It does not always give the best model, but it gives the best negotiating position.

Conclusion

Open Weights is not Open Source, but for most production AI deployments in companies it is the more honest promise. Anyone with strict data sovereignty needs, high inference volumes, or fine-tuning requirements will arrive at self-hosting eventually. The tooling is mature, the hardware is available, the economics work in many scenarios.

For setting up an AI strategy in a company, the article on AI strategy for the Mittelstand provides the frame. For connecting open models with internal knowledge bases, the piece on RAG systems covers the technical groundwork. Implementation support is available through our AI consulting services, from architecture to production deployment.

Frequently Asked Questions

What is the difference between Open Source and Open Weights for AI models?

Open Weights means only the trained model weights are publicly available. Open Source under the OSI definition additionally requires training code and training data documented well enough to reproduce the run. Llama and Mistral are Open Weights. OLMo and Pythia meet the stricter Open Source criteria.

What hardware do I need for a local Llama 3 70B?

Two H100 GPUs with 80 GB VRAM each give solid production performance. On consumer hardware, a 4-bit quantized Llama 3 70B runs on two RTX 4090s or a Mac Studio M2 Ultra with 192 GB unified memory. Speed lands around 10 to 20 tokens per second in those setups, fast enough for many internal tools.

When does self-hosting beat API usage in cost?

At roughly five million inference tokens per day with around-the-clock utilization, a dedicated GPU server becomes cheaper than equivalent API calls. Sporadic usage patterns favor APIs. Data privacy requirements or fine-tuning needs can justify self-hosting at much smaller volumes, regardless of the pure cost calculation.

Can I deploy Llama or Mistral commercially?

Llama 3 permits commercial use but excludes providers with over 700 million monthly users. Mistral models vary by version. The Mistral Research License covers research only. Before any production deployment, the specific license should be reviewed and model versions tracked in an SBOM alongside software dependencies, with version and license text recorded.

#open-source #llm #open-weights #llama #mistral
Share:
Sergej Bardin

Sergej Bardin

CEO – AI Strategy & IT Consulting

Helping mid-sized companies adopt AI and shape their cloud strategy. Focus on practical decisions over hype.

AI StrategyMCPRAGMulti-CloudIT ConsultingMid-Market