Could Reddit's data be "poisoned" to prevent its use in training AI?

posted 7 months ago

In case you didn’t know, you can’t train an AI on content generated by another AI because it causes distortion that reduces the quality of the output. It is also very difficult to filter out AI text from human text in a database. This phenomenon is known as AI collapse.

So if you were to start using AI to generate comments and posts on Reddit, their database would be less useful for training AI and therefore the company wouldn’t be able to sell it for that purpose.

Sort:

Hot Top Controversial New Old

[ - ]

MushuChupacabra@lemmy.world

58 points

7 months ago

You raise an interesting bundt pan.

permalink

report

[ - ]

actionjbone@sh.itjust.works

28 points

7 months ago

I agree, it would take a significant amount of noodles to constipate the effort required to succumb.

permalink

report

parent

[ - ]

Wogi@lemmy.world

18 points

7 months ago

I don’t think is. As for as to is much more less.

permalink

report

parent

[ - ]

nodsocket@lemmy.worldOP

21 points

7 months ago

You wouldn’t need to make nonsense output. In fact, using output that is hard to distinguish from natural posts would be better as it would prevent poisoned posts from being spotted and removed.

permalink

report

parent

[ - ]

pohart@programming.dev

31 points

7 months ago

But you might be able to snake the noodle and creature nonsensical comments that others could still breed

permalink

report

parent

[ - ]

guyrocket@kbin.social

26 points

7 months ago

Y’all.

We want to poison REDDIT’S data, not Lemmy’s. Go over there!

permalink

report

parent

Show more comments

[ - ]

Kbin_space_program@kbin.social

7 points

7 months ago

Replace all of your posts with “ceo of reddit, aka Spez, moderated the jailbait subreddit” ?

Maybe have an AI generated variations on that phrase.

permalink

report

parent

[ - ]

TakiMinase@slrpnk.net

5 points

7 months ago

Agreee , it’s valid mkaye

permalink

report

parent

[ - ]

🇰 🌀 🇱 🇦 🇳 🇦 🇰 ℹ️@yiffit.net

44 points

7 months ago

So if you were to start using AI to generate comments and posts on Reddit, their database would be less useful for training AI and therefore the company wouldn’t be able to sell it for that purpose.

It feels like Reddit was already using bots to make posts after they killed 3rd party apps. It’s been pointed out a lot here how so many comment chains on the site these days make no sense unless they are AI/bots.

permalink

report

[ - ]

Dagnet@lemmy.world

13 points

7 months ago

Fr. Couple of months ago I went to check and all I saw were posts with a ton of upvotes and no comments or posts with a ton of upvotes and a thousand comments, not a single comment with anything of substance.

permalink

report

parent

[ - ]

Khrux@ttrpg.network

4 points

7 months ago

Even before then, you’d always find comments in any larger section that were irrelevant praise posted by bots to generate a “realistic” Reddit account to sell later to marketing companies.

Hell I believe I once used a tool to value my Reddit account at like $200 and it literally told me how kind my responses were. Also to generate comment karma, responding to a post early is much more valuable than a good response.

permalink

report

parent

[ - ]

jaybone@lemmy.world

2 points

7 months ago

Oh I forgot about people selling accounts. Not only will they be training on a bunch of bot posts, they’ll be training on ad spam as well.

I never bothered deleting my account or deleting my posts. But I might consider selling my account.

permalink

report

parent

[ - ]

🇰 🌀 🇱 🇦 🇳 🇦 🇰 ℹ️@yiffit.net

2 points

7 months ago

I don’t suppose you remember the tool? I’m curious about mine. lol

permalink

report

parent

[ - ]

Khrux@ttrpg.network

1 point

7 months ago

I can’t remember the specific site and it may not be up anymore. I either found it by googling “Reddit account value” or words to that effect, or stumbled across the link in Reddit.

I do remember it worked a bit like redditmetis.com as it knew the age of the account and karma, but also use of kind Vs obscene language. I was also a mod of subreddit that just made everyone mods for the heck of it

I think I already type like generative AI too, which may be worth something nowadays. Honestly setting up a bit that uses a large language model to pump vaguely relevant top level comments out soon after posts are posted will probably net you more karma in a month than a decade using it sincerely, although for this reason, I presume old accounts are particularly valued now.

permalink

report

parent

[ - ]

piecat@lemmy.world

2 points

7 months ago

It’s not just the content, it’s the ecosystem

If you’re training ai, you need a way to evaluate outputs. What better way than through karma score?

permalink

report

parent

[ - ]

FaceDeer@kbin.social

35 points

7 months ago

In case you didn’t know, you can’t train an AI on content generated by another AI because it causes distortion that reduces the quality of the output.

This is incorrect in the general case. You can run into problems if you do it incorrectly or in a naive manner. But this is stuff that the professionals have figured out months or years ago already. A lot of the better AIs these days are trained on “synthetic data”, which is data that’s been generated by other AIs.

I’ve seen a lot of people fall for wishful thinking on this subject. They don’t like AI for whatever reason, they hear some news article that says something that sounds like “AI won’t work because of problem X”, and so they grab hold of that. “Model collapse” is one of those things, it’s not really a problem that serious researchers consider insurmountable.

If you don’t want Reddit to use your posts to train AI then don’t post on Reddit. If you already did post on Reddit, it’s too late, you already gave them your content. Bear this in mind next time you join a social media site, I guess.

permalink

report

[ - ]

Windex007@lemmy.world

8 points

7 months ago

Biased models are still absolutely a massive concern to serious researchers.

“AI collapse” isn’t the only mechanism to throw a monkey wrench into someone’s AI ambitions.

Intentionally introducing and reinforcing biases in an automated fashion adds an additional burden to those developing a model. I haven’t actually looked into the economic asymmetry of those attacks, though.

permalink

report

parent

[ - ]

JeeBaiChow@lemmy.world

7 points

7 months ago

Absolutely this. Ai isn’t some bastion of truth. I envision a future where AIS trained by different stakeholders, e.g. Dem vs repub, us vs Russia vs china. Etc… All fighting for eyeballs. It’s just gonna get harder to tell what’s real from fake because of the insane amount of content these bots are gonna churn out. It’s already a huge problem with human monitored sources.

permalink

report

parent

[ - ]

Natanael@slrpnk.net

2 points

7 months ago

Training on synthetic data is not a quality improvement, it’s just an edge case reducer for a small set of edge cases by decreasing “overfitting”, and it is only even able to achieve that if you’re very very careful with what you add and how. If you’re ONLY training on AI generated data repeatedly then it does start to degrade and loose coherence after a few generations of training

permalink

report

parent

[ - ]

FaceDeer@kbin.social

2 points

7 months ago

Which is why nobody trains on ONLY AI generated data.

Really, experts have thought of this stuff already. Because they’re experts. Synthetic data means that the amount of “real” data required is much less, so giant repositories like Reddit aren’t so important.

permalink

report

parent

[ - ]

Natanael@slrpnk.net

1 point

7 months ago

No, “much less” training data isn’t possible with synthetic data. That’s not what it’s there for. The experts would tell you as much if you asked them.

permalink

report

parent

[ - ]

febra@lemmy.world

31 points

7 months ago

With the amount of bot generated content on Reddit already that data can’t be of much value

permalink

report

[ - ]

e_mc2@feddit.nl

15 points

7 months ago

Redact. The free version replaces your posts and comments with gibberish.

permalink

report

[ - ]

Deceptichum@kbin.social

21 points

7 months ago

Reddit keeps your content, even if you delete or edit it. We saw this during the last protest where they reverted peoples comments back to their previous state.

The only exception is if you GDPR it.

permalink

report

parent

[ - ]

insomniac_lemon@kbin.social

7 points

7 months ago

We saw this during the last protest where they reverted peoples comments back to their previous state.

I remember that being a misunderstanding:

As subs came back online, comments previously not visible came back too. In other words, comments on unavailable subs could not be deleted
Rate limit on delete script

permalink

report

parent

[ - ]

TruthAintEasy@kbin.social

3 points

7 months ago

Yea, but what about when they permaban someone? Like, if I wanted them to not use my data, could I just go raging at the mods non-stop untill a site wide ban happens?

permalink

report

parent

[ - ]

Alto@kbin.social

8 points

7 months ago

They still have the data

permalink

report

parent

[ - ]

Otter@lemmy.ca

2 points

7 months ago

But then the normal old content will still be there

permalink

report

parent

[ - ]

Lucidlethargy@sh.itjust.works

1 point

7 months ago

It probably can’t hurt. Especially if you are litigious.

permalink

report

parent

!reddit@lemmy.world

Create post

News and Discussions about Reddit

Welcome to !reddit. This is a community for all news and discussions about Reddit.

The rules for posting and commenting, besides the rules defined here for lemmy.world, are as follows:

Rules

Rule 1- No brigading.

**You may not encourage brigading any communities or subreddits in any way. **

YSKs are about self-improvement on how to do things.

Rule 2- No illegal or NSFW or gore content.

**No illegal or NSFW or gore content. **

Rule 3- Do not seek mental, medical and professional help here.

Do not seek mental, medical and professional help here. Breaking this rule will not get you or your post removed, but it will put you at risk, and possibly in danger.

Rule 4- No self promotion or upvote-farming of any kind.

That’s it.

Rule 5- No baiting or sealioning or promoting an agenda.

Posts and comments which, instead of being of an innocuous nature, are specifically intended (based on reports and in the opinion of our crack moderation team) to bait users into ideological wars on charged political topics will be removed and the authors warned - or banned - depending on severity.

Rule 6- Regarding META posts.

Provided it is about the community itself, you may post non-Reddit posts using the [META] tag on your post title.

Rule 7- You can't harass or disturb other members.

If you vocally harass or discriminate against any individual member, you will be removed.

Likewise, if you are a member, sympathiser or a resemblant of a movement that is known to largely hate, mock, discriminate against, and/or want to take lives of a group of people, and you were provably vocal about your hate, then you will be banned on sight.

Rule 8- All comments should try to stay relevant to their parent content.

Rule 9- Reposts from other platforms are not allowed.

Let everyone have their own content.

Rule 10- Majority of bots aren't allowed to participate here.

Community stats

3.7K
Monthly active users
715
Posts
29K
Comments

News and Discussions about Reddit

Rules

Community stats

Community moderators