2 points

Failtoban should add all those scraper IPs, and we need to just flat out block them. Or send them to those mazes. Or redirect them to themselves lol

permalink
report
reply
114 points
*

Really great piece. We have recently seen many popular lemmy instances struggle under recent scraping waves, and that is hardly the first time its happened. I have some firsthand experience with the second part of this article that talks about AI-generated bug reports/vulnerabilities for open source projects.

I help maintain a python library and got a bug report a couple weeks back of a user getting a type-checking issue and a bit of additional information. It didn’t strictly follow the bug report template we use, but it was well organized enough, so I spent some time digging into it and came up with no way to reproduce this at all. Thankfully, the lead maintainer was able to spot the report for what it was and just closed it and saved me from further efforts to diagnose the issue (after an hour or two were burned already).

permalink
report
reply
34 points

AI scrapers are a massive issue for Lemmy instances. I’m gonna try some things in this article because there are enough of them identifying themselves with user agents that I didn’t even think of the ones lying about it.

I guess a bonus (?) is that with 1000 Lemmy instances, the bots get the Lemmy content 1000 times so our input has 1000 times the weighting of reddit.

permalink
report
parent
reply
29 points

Any idea what the point of these are then? Sounds like its reporting a fake bug.

permalink
report
parent
reply
80 points

The theory that the lead maintainer had (he is an actual software developer, I just dabble), is that it might be a type of reinforcement learning:

  • Get your LLM to create what it thinks are valid bug reports/issues
  • Monitor the outcome of those issues (closed immediately, discussion, eventual pull request)
  • Use those outcomes to assign how “good” or “bad” that generated issue was
  • Use that scoring as a way to feed back into the model to influence it to create more “good” issues

If this is what’s happening, then it’s essentially offloading your LLM’s reinforcement learning scoring to open source maintainers.

permalink
report
parent
reply
41 points

Thats wild. I don’t have much hope for llm’s if things like this is how they are doing things and I would not be surprised given how well they don’t work. Too much quantity over quality in training.

permalink
report
parent
reply
-2 points

Honestly, I would be alright with this if the AI companies paid Github so that the server infrastructure can be upgraded. Having AI that can figure out bugs and error reports could be really useful for our society. For example, your computer rebooting for no apparent reason? The AI can check the diagnostic reports, combine them with online reports, and narrow down the possibilities.

In the long run, this could also help maintainers as well. If they can have AI for testing programs, the maintainers won’t have to hope for volunteers or rely on paid QA for detecting issues.

What Github & AI companies should do, is an opt-in program for maintainers. If they allow the AI to officially make reports, Github should offer an reward of some kind to their users. Allocate to each maintainer a number of credits so that they can discuss the report with the AI in realtime, plus $10 bucks for each hour spent on resolving the issue.

Sadly, I have the feeling that malignant capitalism would demand maintainers to sacrifice their time for nothing but irritation.

permalink
report
parent
reply
6 points
*

Testing out a theory with ChatGPT there might be a way, albeit clunky, to detect AI. I asked ChatGPT a simple math question then told it to disregard the rest of the message, then I asked it if it was AI. It answered the math question and told me it was ai. Now a bot probably won’t admit to being AI but it might be foolish enough to consider instruction that you explicitly told it not to follow.

Or you might simply be able to waste its resources by asking it to do something computationally difficult that most people would just reject outright.

Of course all of this could just result in making AI even harder to detect once it learns these tricks. 😬

permalink
report
parent
reply
3 points

These aren’t actual LLMs scraping the web, they’re your usual scraping bots used in an industrial scale, disregarding conventions about what they should or shouldn’t scrape.

permalink
report
parent
reply
63 points

Yep, it hit many lemmy servers as well, including mine. I had to block multiple alibaba subnet to get things back to normal. But I’m expecting the next spam wave.

permalink
report
reply
63 points

I wish these companies would realise that acting like this is a very fast way to get scraping outlawed altogether, which is a shame because it can be genuinely useful (archival, automation, etc).

permalink
report
reply
54 points

How can you outlaw something a company in another conhtinent is doing? And specially when they are becoming better as disguising themselves as normal traffic? What will happen is that politicians will see this as another reason to push for everyone having their ID associated with their Internet traffic.

permalink
report
parent
reply
14 points

What will happen is that politicians will see this as another reason to push for everyone having their ID associated with their Internet traffic.

You’re right. Which is exactly why companies should be exhibiting better behaviour and self regulate before they make the internet infinitely worse off for everyone.

permalink
report
parent
reply
29 points
*

self regulation is a joke. a few bad apples always spoil the bunch.

what needs to happen is regulation, period. force all companies to abide by laws that just make sense, and all these problems go away.

see: GDPR

permalink
report
parent
reply
5 points

according to history, this sadly never works

permalink
report
parent
reply
3 points

Exactly, we’ve already seen this in the past. GDPR is a good example. Whilst I’m glad this regulation exists, it wouldn’t be necessary if megacorps would have behaved.

permalink
report
parent
reply
-4 points

What will happen is that politicians will see this as another reason to push for everyone having their ID associated with their Internet traffic.

Yes, because like or not that’s the only possible solution. If all traffic was required to be signed and the signatures were tied to an entity then you could refuse unsigned traffic and if signed traffic was causing problems you’d know who it was and have recourse.

I don’t like this solution but it’s the only way forward that I can see.

permalink
report
parent
reply
7 points

How do you have more recourse countering a random third world IP vs a random third world person when both are outside your juridiction?

permalink
report
parent
reply
5 points

is it? Someone mentioned proof of work being effective for Tor.

permalink
report
parent
reply
53 points

The Linux Mint forums have been knocked offline multiple times over the last few months, to the point where the admins had to block all Chinese and Brazilian IPs for a while.

permalink
report
reply
7 points

This is the first I’ve heard about Brazil in this type of cyber attack. Is it re-routed traffic going there or are there a large number of Brazilian bot farms now?

permalink
report
parent
reply
10 points

I don’t know why/how, just know that the admins saw the servers were being overwhelmed by traffic from Brazilian IPs and blocked it for a while.

permalink
report
parent
reply

Technology

!technology@lemmy.world

Create post

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


Community stats

  • 22K

    Monthly active users

  • 15K

    Posts

  • 627K

    Comments