Office space meme:
“If y’all could stop calling an LLM “open source” just because they published the weights… that would be great.”
k
I mean that’s all a model is so… Once again someone who doesn’t understand anything about training or models is posting borderline misinformation about ai.
Shocker
Yet another so-called AI evangelist accusing others of not understanding computer science if they don’t want to worship their machine god.
Do you think your comments here are implying an understanding of the tech?
It’s not like you need specific knowledge of Transformer models and whatnot to counterargue LLM bandwagon simps. A basic knowledge of Machine Learning is fine.
what a weird hill to die on
If the Source is Open to copying, and I won’t get sued for doing it, well, then…
The source OP is referring to is the training data what they used to compute those weights. Meaning, petabytes of text. Without that we don’t know which content theynused for training the model.
The running/training engines might be open source, the pretrained model isn’t and claiming otherwise is wrong.
Nothing wrong with it being this way, most commercial models operate the same way obviously. Just don’t claim that themselves is open source because a big part of it is that people can reproduce your training to verify that there’s no fowl play in the input data. We literally can’t. That’s it.
Uuuuh… why?
Do you only accept open source code if you can see every key press every developer made?
Open source means you can recreate the binaries yourself. Neiter Facebook. Nor the devs of deepseek published which training data they used, nor their training algorithm.
Eh, it seems like it fits to me. We casually refer to all manner of data as “open source” even if we lack the ability to specifically recreate it. It might be technically more accurate to say “open data” but we usually don’t, so I can’t be too mad at these folks for also not.
There’s huge deaths of USGS data that’s shared as open data that I absolutely cannot ever replicate.
If we’re specifically saying that open source means you can recreate the binaries, then data is fundamentally not able to be open source, since it distinctly lacks any form of executable content.
If we’re specifically saying that open source means you can recreate the binaries, then data is fundamentally not able to be open source
lol, are you claiming data isn’t reproducable? XD
They published the source code needed run the model. It’s open source in the way that anyone can download the model, run it locally, and further build on it.
Training from scratch costs millions.
A software analogy:
Someone designs a compiler, makes it open source. Make an open runtime for it. ‘Obtain’ some source code with unclear license. Compiles it with the compiler and releases the compiled byte code that can run with the runtime on free OS. Do you call the program open source? Definitely it is more open than something that requires proprietary inside use only compiler and closed runtine and sometimes you can’t access even the binary; it runs on their servers. It depends on perspective.
ps: the compiler takes ages and costs mils in hardware.
edit: typo
They published the source code needed run the model.
Yeah, but not to train it
anyone can download the model, run it locally, and further build on it.
Yeah, it’s about as open source as binary blobs.
Training from scratch costs millions.
So what? You still can gleam something if you know the dataset on which the model has been trained.
If software is hard to compile, can you keep the source code closed and still call software “open source”?
Open source isn’t really applicable to LLM models IMO.
There is open weights (the model), and available training data, and other nuances.
They actually went a step further and provided a very thorough breakdown of the training process, which does mean others could similarly train models from scratch with their own training data. HuggingFace seems to be doing just that as well. https://huggingface.co/blog/open-r1
Edit: see the comment below by BakedCatboy for a more indepth explanation and correction of a misconception I’ve made
The runner is open source, the model is not
The service uses both so calling their service open source gives a false impression to 99,99% of users that don’t know better.
No, but I do call a CC licensed png file open source even if the author didn’t share the original layered Photoshop file.
Model weights are data, not code.
You’d be wrong. Open source has a commonly accepted definition and a CC licensed PNG does not fall under it. It’s copyleft, yes, but not open source.
I do agree that model weights are data and can be given a license, including CC0. There might be some argument about how one can assign a license to weights derived from copyrighted works, but I won’t get into that right now. I wouldn’t call even the most liberally licensed model weights open-source though.
Open Source (generally and for AI) has an established definition.
This is exactly it, open source is not just the availability of the machine instructions, it’s also the ability to recreate the machine instructions. Anything less is incomplete.
It strikes me as a variation on the “free as in beer versus free as in speech” line that gets thrown around a lot. These weights allow you to use the model for free and you are free to modify the existing weights but being unable to re-create the original means it falls short of being truly open source. It is free as in beer, but that’s it.
It really comes down to this part of the “Open Source” definition:
The source code [released] must be the preferred form in which a programmer would modify the program
A compiled binary is not the format in which a programmer would prefer to modify the program - it’s much preferred to have the text file which you can edit in a text editor. Just because it’s possible to reverse engineer the binary and make changes by patching bytes doesn’t make it count. Any programmer would much rather have the source file instead.
Similarly, the released weights of an AI model are not easy to modify, and are not the “preferred format” that the internal programmers use to make changes to the AI mode. They typically are making changes to the code that does the training and making changes to the training dataset. So for the purpose of calling an AI “open source”, the training code and data used to produce the weights are considered the “preferred format”, and is what needs to be released for it to really be open source. Internal engineers also typically use training checkpoints, so that they can roll back the model and redo some of the later training steps without redoing all training from the beginning - this is also considered part of the preferred format if it’s used.
OpenR1, which is attempting to recreate R1, notes: No training code was released by DeepSeek, so it is unknown which hyperparameters work best and how they differ across different model families and scales.
I would call “open weights” models actually just “self hostable” models instead of open source.
Thank you for the explanation. I didn’t know about the ‘preferred format’ definition or how AI models are changed at all.
It’s a lie. The preferred format is the (pre-)trained weights. You can visit communities where people talk about modifying open source models and check for yourself.