About as open source as a binary blob without the training data(slrpnk.net)

posted 3 days ago

Prunebutt@slrpnk.net

memes@lemmy.world

185 commentshide report

Office space meme:

“If y’all could stop calling an LLM “open source” just because they published the weights… that would be great.”

Sort:

Hot Top Controversial New Old

[ - ]

stinky@redlemmy.com

3 points

3 days ago

permalink

report

[ - ]

Oisteink@feddit.nl

21 points

3 days ago

Source - it’s about open source, not access to the database

permalink

report

[ - ]

Prunebutt@slrpnk.netOP

16 points

3 days ago

So, where’s the source, then?

permalink

report

parent

[ - ]

Oisteink@feddit.nl

4 points

3 days ago

Its not open so it doesnt matter.

permalink

report

parent

[ - ]

Prunebutt@slrpnk.netOP

4 points

2 days ago

It’s constantly referred to as “open source”.

permalink

report

parent

Show more comments

[ - ]

KillingTimeItself@lemmy.dbzer0.com

30 points

3 days ago

i mean, if it’s not directly factually inaccurate, than, it is open source. It’s just that the specific block of data they used and operate on isn’t published or released, which is pretty common even among open source projects.

AI just happens to be in a fairly unique spot where that thing is actually like, pretty important. Though nothing stops other groups from creating an openly accessible one through something like distributed computing. Which seems to be a fancy new kid on the block moment for AI right now.

permalink

report

[ - ]

FooBarrington@lemmy.world

10 points

2 days ago

But it is factually inaccurate. We don’t call binaries open-source, we don’t even call visible-source open-source. An AI model is an artifact just like a binary is.

An “open-source” project that doesn’t publish everything needed to rebuild isn’t open-source.

permalink

report

parent

[ - ]

Treczoks@lemmy.world

2 points

2 days ago

That “specific block of data” is more than 99% of such a project. Hardly insignificant.

permalink

report

parent

[ - ]

Fushuan [he/him]@lemm.ee

14 points

2 days ago

The running engine and the training engine are open source. The service that uses the model trained with the open source engine and runs it with the open source runner is not, because a biiiig big part of what makes AI work is the trained model, and a big part of the source of a trained model is training data.

When they say open source, 99.99% of the people will understand that everything is verifiable, and it just is not. This is misleading.

As others have stated, a big part of open source development is providing everything so that other users can get the exact same results. This has always been the case in open source ML development, people do provide links to their training data for reproducibility. This has been the case with most of the papers on natural language processing (overarching branch of llm) I have read in the past. Both code and training data are provided.

Example in the computer vision world, darknet and tool: https://github.com/AlexeyAB/darknet

This is the repo with the code to train and run the darknet models, and then they provide pretrained models, called yolo. They also provide links to the original dataset where the tool models were trained. THIS is open source.

permalink

report

parent

[ - ]

Miaou@jlai.lu

2 points

2 days ago

Is it common? Many fields have standard, open datasets. That’s not the case here, and this data is the most important part of training an LLM.

permalink

report

parent