Apparently, stealing other people’s work to create product for money is now “fair use” as according to OpenAI because they are “innovating” (stealing). Yeah. Move fast and break things, huh?

“Because copyright today covers virtually every sort of human expression—including blogposts, photographs, forum posts, scraps of software code, and government documents—it would be impossible to train today’s leading AI models without using copyrighted materials,” wrote OpenAI in the House of Lords submission.

OpenAI claimed that the authors in that lawsuit “misconceive[d] the scope of copyright, failing to take into account the limitations and exceptions (including fair use) that properly leave room for innovations like the large language models now at the forefront of artificial intelligence.”

You are viewing a single thread.
View all comments
74 points

Try to train a human comedian to make jokes without ever allowing him to hear another comedian’s jokes, never watching a movie, never reading a book or magazine, never watching a TV show. I expect the jokes would be pretty weak.

permalink
report
reply
119 points

A comedian isn’t forming a sentence based on what the most probable word is going to appear after the previous one. This is such a bullshit argument that reduces human competency to “monkey see thing to draw thing” and completely overlooks the craft and intent behind creative works. Do you know why ChatGPT uses certain words over others? Probability. It decided as a result of its training that one word would appear after the previous in certain contexts. It absolutely doesn’t take into account things like “maybe this word would be better here because the sound and syllables maintains the flow of the sentence”.

Baffling takes from people who don’t know what they’re talking about.

permalink
report
parent
reply
65 points
*

I wish I could upvote this more than once.

What people always seem to miss is that a human doesn’t need billions of examples to be able to produce something that’s kind of “eh, close enough”. Artists don’t look at billions of paintings. They look at a few, but do so deeply, absorbing not just the most likely distribution of brushstrokes, but why the painting looks the way it does. For a basis of comparison, I did an art and design course last year and looked at about 300 artworks in total (course requirement was 50-100). The research component on my design-related degree course is one page a week per module (so basically one example from the field the module is about, plus some analysis). The real bulk of the work humans do isn’t looking at billions of examples: it’s looking at a few, and then practicing the skill and developing a process that allows them to convey the thing they’re trying to express.

If the AI models were really doing exactly the same thing humans do, the models could be trained without any copyright infringement at all, because all of the public domain and creative commons content, plus maybe licencing a little more, would be more than enough.

permalink
report
parent
reply
25 points

Exactly! You can glean so much from a single work, not just about the work itself but who created it and what ideas were they trying to express and what does that tell us about the world they live in and how they see that world.

This doesn’t even touch the fact that I’m learning to draw not by looking at other drawings but what exactly I’m trying to draw. I know at a base level, a drawing is a series of shapes made by hand whether it’s through a digital medium or traditional pen/pencil and paper. But the skill isn’t being able replicate other drawings, it’s being able to convert something I can see into a drawing. If I’m drawing someone sitting in a wheelchair, then I’ll get the pose of them sitting in the wheelchair but I can add details I want to emphasise or remove details I don’t want. There’s so much that goes into creative work and I’m tired of arguing with people who have no idea what it takes to produce creative works.

permalink
report
parent
reply
9 points
*

Children learn by watching others. We are trained from millions of examples starting from before birth.

permalink
report
parent
reply
5 points

When people say that the “model is learning from its training data”, it means just that, not that it is human, and not that it learns exactly humans. It doesn’t make sense to judge boats on how well they simulate human swimming patterns, just how well they perform their task.

Every human has the benefit of as a baby training on things around them and being trained by those around them, building a foundation for all later skills. Generative models rely on many text and image pairs to describe things to them because they lack the ability to poke, prod, rotate, and disassemble for themselves.

For example, when a model takes in a thousand images of circles, it doesn’t “learn” a thousand circles. It learns what circle GENERALLY is like, the concept of it. That representation, along with random noise, is how you create images with them. The same happens for every concept the model trains on. Everything from “cat” to more complex things like color relationships and reflections or lighting. Machines are not human, but they can learn despite that.

permalink
report
parent
reply
4 points

What you count as “one” example is arbitrary. In terms of pixels, you’re looking at millions right now.

The ability to train faster using fewer examples in real time, similar to what an intelligent human brain can do, is definitely a goal of AI research. But right now, we may be seeing from AI what a below average human brain could accomplish with hundreds of lifetimes to study.

If the AI models were really doing exactly the same thing humans do, the models could be trained without any copyright infringement at all, because all of the public domain and creative commons content, plus maybe licencing a little more, would be more than enough.

I mean, no, if you only ever look at public domain stuff you literally wouldn’t know the state of the art, which is historically happening for profit. Even the most untrained artist “doing their own thing” watches Disney/Pixar movies and listens to copyrighted music.

permalink
report
parent
reply
2 points

When you look at one painting, is that the equivalent of one instance of the painting in the training data? There is an infinite amount of information in the painting, and each time you look you process more of that information.

I’d say any given painting you look at in a museum, you process at least a hundred mental images of aspects of it. A painting on your wall could be seen ten thousand times easily.

permalink
report
parent
reply
21 points

That’s what humans do, though. Maybe not probability directly, but we all know that some words should be put in a certain order. We still operate within standard norms that apply to aparte group of people. LLM’s just go about it in a different way, but they achieve the same general result. If I’m drawing a human, that means there’s a ‘hand’ here, and a ‘head’ there. ‘Head’ is a weird combination of pixels that mostly look like this, ‘hand’ looks kinda like that. All depends on how the model is structured, but tell me that’s not very similar to a simplified version of how humans operate.

permalink
report
parent
reply
19 points

Yeah but the difference is we still choose our words. We can still alter sentences on the fly. I can think of a sentence and understand verbs go after the subject but I still have the cognition to alter the sentence to have the effect I want. The thing lacking in LLMs is intent and I’m yet to see anyone tell me why a generative model decides to have more than 6 fingers. As humans we know hands generally have five fingers and there’s a group of people who don’t so unless we wanted to draw a person with a different number of fingers, we could. A generative art model can’t help itself from drawing multiple fingers because all it understands is that “finger + finger = hand” but it has no concept on when to stop.

permalink
report
parent
reply
3 points

As an artist you draw with an understanding of the human body, though. An understanding current models don’t have because they aren’t actually intelligent.

Maybe when a human is an absolute beginner in drawing they will think about the different lines and replicate even how other people draw stuff that then looks like a hand.

But eventually they will realise (hopefully, otherwise they may get frustrated and stop drawing) that you need to understand the hand to draw one. It’s mass, it’s concept or the idea of what a hand is.

This may sound very abstract and strange but creative expression is more complex than replicating what we have seen a million times. It’s a complex function unique to the human brain, an organ we don’t even scientifically understand yet.

permalink
report
parent
reply
8 points

That’s not the point though. The point is that the human comedian and the AI both benefit from consuming creative works covered by copyright.

permalink
report
parent
reply
14 points

Yeah except a machine is owned by a company and doesn’t consume the same way. It breaks down copyrighted works into data points so it can find the best way of putting those data points together again. If you understand anything at all about how these models work, they do not consume media the same way we do. It is not an entity with a thought process or consciousness (despite the misleading marketing of “AI” would have you believe), it’s an optimisation algorithm.

permalink
report
parent
reply
11 points
*

And human comedians regularly get called out when they outright steal others material and present it as their own.

The word for this is plagiarism.

And in OpenAIs framework, when used in a relevant commercial context, they are functionally operating and profiting off of the worlds most comprehensive plagiarism software.

permalink
report
parent
reply
5 points

A comedian isn’t forming a sentence based on what the most probable word is going to appear after the previous one.

Neither is an LLM. What you’re describing is a primitive Markov chain.

You may not like it, but brains really are just glorified pattern recognition and generation machines. So yes, “monkey see thing to draw thing”, except a really complicated version of that.

Think of it this way: if your brain wasn’t a reorganization and regurgitation of the things you have observed before, it would just generate random noise. There’s no such thing as “truly original” art or it would be random noise. Every single word either of us is typing is the direct result of everything you and I have observed before this moment.

Baffling takes from people who don’t know what they’re talking about.

Ironic, to say the least.

The point you should be making, is that a corporation will make this above argument up to, but not including the point where they have to treat AIs ethically. So that’s the way to beat them. If they’re going to argue that they have created something that learns and creates content like a human brain, then they should need to treat it like a human, ensure it is well compensated, ensure it isn’t being overworked or enslaved, ensure it is being treated “humanely”. If they don’t want to do that, if they want it to just be a well built machine, then they need to license all the proprietary data they used to build it. Make them pick a lane.

permalink
report
parent
reply
4 points

Neither is an LLM. What you’re describing is a primitive Markov chain.

My description might’ve been indicative of a Markov chain but the actual framework uses matrices because you need to be able to store and compute a huge amount of information at once which is what matrices are good for. Used in animation if you didn’t know.

What it actually uses is irrelevant, how it uses those things is the same as a regression model, the difference is scale. A regression model looks at how related variables are in giving an outcome and computing weights to give you the best outcome. This was the machine learning boom a couple of years ago and TensorFlow became really popular.

LLMs are an evolution of the same idea. I’m not saying it’s not impressive because it’s very cool what they were able to do. What I take issue with is the branding, the marketing and the plagiarism. I happen to be in the intersection of working in the same field, an avid fan of classic Sci-Fi and a writer.

It’s easy to look at what people have created throughout history and think “this looks like that” and on a point by point basis you’d be correct but the creation of that thing is shaped by the lens of the person creating it. Someone might make a George Carlin joke that we’ve heard recently but we’ll read about it in newspapers from 200 years ago. Did George Carlin steal the idea? No. Was he aware of that information? I don’t know. But Carlin regularly calls upon his own experiences so it’s likely that he’s referencing a event from his past that is similar to that of 200 years ago. He might’ve subconsciously absorbed the information.

The point is that the way these models have been trained is unethical. They used material they had no license to use and they’ve admitted that it couldn’t work as well as it does without stealing other people’s work. I don’t think they’re taking the position that it’s intelligent because from the beginning that was a marketing ploy. They’re taking the position that they should be allowed to use the data they stole because there was no other way.

permalink
report
parent
reply

You do know that comedians are copying each others material all the time though? Either making the same joke, or slightly adapting it.

So in the context of copyright vs. model training i fail to see how the exact process of the model is relevant? At the end copyrighted material goes in and material based on that copyrighted material goes out.

permalink
report
parent
reply
2 points

you know how the neurons in our brain work, right?

because if not, well, it’s pretty similar… unless you say there’s a soul (in which case we can’t really have a conversation based on fact alone), we’re just big ol’ probability machines with tuned weights based on past experiences too

permalink
report
parent
reply
5 points

You are spitting out basic points and attempting to draw similarities because our brains are capable of something similar. The difference between what you’ve said and what LLMs do is that we have experiences that we are able to glean a variety of information from. An LLM sees text and all it’s designed to do is say “x is more likely to appear before y than z”. If you fed it nonsense, it would regurgitate nonsense. If you feed it text from racist sites, it will regurgitate that same language because that’s all it has seen.

You’ll read this and think “that’s what humans do too, right?” Wrong. A human can be fed these things and still reject them. Someone else in this thread has made some good points regarding this but I’ll state them here as well. An LLM will tell you information but it has no cognition on what it’s telling you. It has no idea that it’s right or wrong, it’s job is to convince you that it’s right because that’s the success state. If you tell it it’s wrong, that’s a failure state. The more you speak with it, the more fail states it accumulates and the more likely it is to cutoff communication because it’s not reaching a success, it’s not giving you what you want. The longer the conversation goes on, the more crazy LLMs get as well because it’s too much to process at once, holding those contexts in its memory while trying to predict the next one. Our brains do this easily and so much more. To claim an LLM is intelligent is incredibly misguided, it is merely the imitation of intelligence.

permalink
report
parent
reply
2 points

“Soul” is the word we use for something we don’t scientifically understand yet. Unless you did discover how human brains work, in that case I congratulate you on your Nobel prize.

You can abstract a complex concept so much it becomes wrong. And abstracting how the brain works to “it’s a probability machine” definitely is a wrong description. Especially when you want to use it as an argument of similarity to other probability machines.

permalink
report
parent
reply
1 point

Text prediction seems to be sufficient to explain all verbal communication to me. Until someone comes up with a use case that humans can do that LLMs cannot, and I mean a specific use case not general high level concepts, I’m going to assume human verbal cognition works the same was as an LLM.

We are absolutely basing our responses on what words are likely to follow which other ones. It’s literally how a baby learns language from those around them.

permalink
report
parent
reply
9 points

If you ask an LLM to help you with a legal brief, it’ll come up with a bunch of stuff for you, and some of it might even be right. But it’ll very likely do things like make up a case that doesn’t exist, or misrepresent a real case, and as has happened multiple times now, if you submit that work to a judge without a real lawyer checking it first, you’re going to have a bad time.

There’s a reason LLMs make stuff up like that, and it’s because they have been very, very narrowly trained when compared to a human. The training process is almost entirely getting good at predicting what words follow what other words, but humans get that and so much more. Babies aren’t just associating the sounds they hear, they’re also associating the things they see, the things they feel, and the signals their body is sending them. Babies are highly motivated to learn and predict the behavior of the humans around them, and as they get older and more advanced, they get rewarded for creating accurate models of the mental state of others, mastering abstract concepts, and doing things like make art or sing songs. Their brains are many times bigger than even the biggest LLM, their initial state has been primed for success by millions of years of evolution, and the training set is every moment of human life.

LLMs aren’t nearly at that level. That’s not to say what they do isn’t impressive, because it really is. They can also synthesize unrelated concepts together in a stunningly human way, even things that they’ve never been trained on specifically. They’ve picked up a lot of surprising nuance just from the text they’ve been fed, and it’s convincing enough to think that something magical is going on. But ultimately, they’ve been optimized to predict words, and that’s what they’re good at, and although they’ve clearly developed some impressive skills to accomplish that task, it’s not even close to human level. They spit out a bunch of nonsense when what they should be saying is “I have no idea how to write a legal document, you need a lawyer for that”, but that would require them to have a sense of their own capabilities, a sense of what they know and why they know it and where it all came from, knowledge of the consequences of their actions and a desire to avoid causing harm, and they don’t have that. And how could they? Their training didn’t include any of that, it was mostly about words.

One of the reasons LLMs seem so impressive is that human words are a reflection of the rich inner life of the person you’re talking to. You say something to a person, and your ideas are broken down and manipulated in an abstract manner in their head, then turned back into words forming a response which they say back to you. LLMs are piggybacking off of that a bit, by getting good at mimicking language they are able to hide that their heads are relatively empty. Spitting out a statistically likely answer to the question “as an AI, do you want to take over the world?” is very different from considering the ideas, forming an opinion about them, and responding with that opinion. LLMs aren’t just doing statistics, but you don’t have to go too far down that spectrum before the answers start seeming thoughtful.

permalink
report
parent
reply
1 point
*

Am I a moron? How do you have more upvotes than the parent comment, is it because you’re being more aggressive with your statement? I feel like you didn’t quite refute what the parent comment said. You’re just explaining how Chat GPT works, but you’re not really saying how it shouldn’t use our established media (copyrighted material) as a reference.

permalink
report
parent
reply
1 point
*

I don’t control the upvotes so I don’t know why that’s directed at me.

The refutation was based on around a misunderstanding of how LLMs generate their outputs and how the training data assists the LLM in doing what it does. The article itself tells you ChatGPT was trained off of copyrighted material they were not licensed for. The person I responded to suggested that comedians do this with their work but that’s equating the process an LLM uses when producing an output to a comedian writing jokes.

Edit: Apologies if I do come across aggressive. Since the plagiarism machine has been in full swing, the whole discourse around it has gotten on my nerves. I’m a creative person, I’ve written poems and short stories, I’m writing a novel and I also do programming and a whole host of hobbies so when LLMs are used to put people like me out of a job using my own work, why wouldn’t that make me angry? What makes it worse is that I’m having to explain concepts to people regarding LLMs that they continue to defend. I can’t stand it so yes, I will come off aggressive.

permalink
report
parent
reply
20 points
*

There’s this linguistic problem where one word is used for two different things, it becomes difficult to tell them apart. “Training” or “learning” is a very poor choice of word to describe the calibration of a neural network. The actor and action are both fundamentally different from the accepted meaning. To start with, human learning is active whereas machining learning is strictly passive: it’s something done by someone with the machine as a tool. Teachers know very well that’s not how it happens with humans.

When I compare training a neural network with how I trained to play clarinet, I fail to see any parallel. The two are about as close as a horse and a seahorse.

permalink
report
parent
reply
1 point

Not sure what you mean by passive. It takes a hell of a lot of electricity to train one of these LLMs so something is happening actively.

I often interact with ChatGPT 4 as if it were a child. I guide it through different kinds of mental problems, having it take notes and evaluate its own output, because I know our conversations become part of its training data.

It feels very much like teaching a kid to me.

permalink
report
parent
reply
8 points
*

I mean passive in terms of will. Computers want and do nothing. They’re machines that function according to commands.

The way you feel like teaching a child when you feed input in natural language to a LLM until you’re satisfied with the output is known as the ELIZA effect. To quote Wikipedia:

In computer science, the ELIZA effect is the tendency to project human traits — such as experience, semantic comprehension or empathy — into computer programs that have a textual interface. The effect is a category mistake that arises when the program’s symbolic computations are described through terms such as “think”, “know” or “understand.”

permalink
report
parent
reply
12 points

A comedian walks on stage and says, “Why is there a mic here?”

permalink
report
parent
reply
10 points

AIs are not humans. Humans cannot read millions of texts in seconds and cannot split out millions of output at the same time.

permalink
report
parent
reply
1 point

Try to train a human comedian to make jokes without ever allowing him to hear another comedian’s jokes, never watching a movie, never reading a book or magazine, never watching a TV show. I expect the jokes would be pretty weak.

permalink
report
parent
reply
1 point
*
Deleted by creator
permalink
report
parent
reply

Technology

!technology@beehaw.org

Create post

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:


This community’s icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

Community stats

  • 2.9K

    Monthly active users

  • 2.9K

    Posts

  • 54K

    Comments