237 points

Put something in robots.txt that isn’t supposed to be hit and is hard to hit by non-robots. Log and ban all IPs that hit it.

Imperfect, but can’t think of a better solution.

permalink
report
reply
126 points
*

Good old honeytrap. I’m not sure, but I think that it’s doable.

Have a honeytrap page somewhere in your website. Make sure that legit users won’t access it. Disallow crawling the honeytrap page through robots.txt.

Then if some crawler still accesses it, you could record+ban it as you said… or you could be even nastier and let it do so. Fill the honeytrap page with poison - nonsensical text that would look like something that humans would write.

permalink
report
parent
reply
58 points

I think I used to do something similar with email spam traps. Not sure if it’s still around but basically you could help build NaCL lists by posting an email address on your website somewhere that was visible in the source code but not visible to normal users, like in a div that was way on the left side of the screen.

Anyway, spammers that do regular expression searches for email addresses would email it and get their IPs added to naughty lists.

I’d love to see something similar with robots.

permalink
report
parent
reply
32 points
*

Yup, it’s the same approach as email spam traps. Except the naughty list, but… holy fuck a shareable bot IP list is an amazing addition, it would increase the damage to those web crawling businesses.

permalink
report
parent
reply
11 points

Even better. Build a WordPress plugin to do this.

permalink
report
parent
reply
9 points

I’m the idiot human that digs through robots.txt and the site map to see things that aren’t normally accessible by an end user.

permalink
report
parent
reply
5 points
*
Deleted by creator
permalink
report
parent
reply
6 points

For banning: I’m not sure but I don’t think so. It seems to me that prefetching behaviour is dictated by a page linking another, to avoid any issue all that the site owner needs to do is to not prefetch links for the honeytrap.

For poisoning: I’m fairly certain that it doesn’t. At most you’d prefetch a page full of rubbish.

permalink
report
parent
reply
21 points

“Help, my website no longer shows up in Google!”

permalink
report
parent
reply
16 points

Yeah, this is a pretty classic honeypot method. Basically make something available but inaccessible to the normal user. Then you know anyone who accesses it is not a normal user.

I’ve even seen this done with Steam achievements before; There was a hidden game achievement which was only available via hacking. So anyone who used hacks immediately outed themselves with a rare achievement that was visible on their profile.

permalink
report
parent
reply
13 points

That’s a bit annoying as it means you can’t 100% the game as there will always be one achievement you can’t get.

permalink
report
parent
reply
3 points

perhaps not every game is meant to be 100% completed

permalink
report
parent
reply
4 points

There are tools that just flag you as having gotten an achievement on Steam, you don’t even have to have the game open to do it. I’d hardly call that ‘hacking’.

permalink
report
parent
reply
6 points
*

Better yet, point the crawler to a massive text file of almost but not quite grammatically correct garbage to poison the model. Something it will recognize as language and internalize, but severely degrade the quality of its output.

permalink
report
parent
reply
3 points

Maybe one of the lorem ipsum generators could help.

permalink
report
parent
reply
4 points

a bad-bot .htaccess trap.

permalink
report
parent
reply
-36 points

robots.txt is purely textual; you can’t run JavaScript or log anything. Plus, one who doesn’t intend to follow robots.txt wouldn’t query it.

permalink
report
parent
reply
55 points

If it doesn’t get queried that’s the fault of the webscraper. You don’t need JS built into the robots.txt file either. Just add some line like:

here-there-be-dragons.html

Any client that hits that page (and maybe doesn’t pass a captcha check) gets banned. Or even better, they get a long stream of nonsense.

permalink
report
parent
reply
24 points

server {

name herebedragons.example.com; root /dev/random;

}

permalink
report
parent
reply
9 points

I actually love the data-poisoning approach. I think that sort of strategy is going to be an unfortunately necessary part of the future of the web.

permalink
report
parent
reply
16 points

You’re second point is a good one, but you absolutely can log the IP which requested robots.txt. That’s just a standard part of any http server ever, no JavaScript needed.

permalink
report
parent
reply
11 points

You’d probably have to go out of your way to avoid logging this. I’ve always seen such logs enabled by default when setting up web servers.

permalink
report
parent
reply
12 points

People not intending to follow it is the real reason not to bother, but it’s trivial to track who downloaded the file and then hit something they were asked not to.

Like, 10 minutes work to do right. You don’t need js to do it at all.

permalink
report
parent
reply
141 points

As unscrupulous AI companies crawl for more and more data, the basic social contract of the web is falling apart.

Honestly it seems like in all aspects of society the social contract is being ignored these days, that’s why things seem so much worse now.

permalink
report
reply
25 points

It’s abuse, plain and simple.

permalink
report
parent
reply
15 points

Governments could do something about it, if they weren’t overwhelmed by bullshit from bullshit generators instead and lead by people driven by their personal wealth.

permalink
report
parent
reply
-2 points

these days

When, at any point in history, have people acknowledged that there was no social change or disruption and everyone was happy?

permalink
report
parent
reply
126 points

Well the trump era has shown that ignoring social contracts and straight up crime are only met with profit and slavish devotion from a huge community of dipshits. So. Y’know.

permalink
report
reply
6 points

Only if you’re already rich or in the right social circles though. Everyone else gets fined/jail time of course.

permalink
report
parent
reply
0 points

Meh maybe. I know plenty of people who get away with all kinds of crap without money or connections.

permalink
report
parent
reply
98 points
*

The open and free web is long dead.

just thinking about robots.txt as a working solution to people that literally broker in people’s entire digital lives for hundreds of billions of dollars is so … quaint.

permalink
report
reply
27 points

It’s up there with Do-Not-Track.

Completely pointless because it’s not enforced

permalink
report
parent
reply
3 points

Do-Not-Track, AKA, “I’ve made my browser fingerprint more unique for you, please sell my data”

permalink
report
parent
reply
1 point
*

I bet at least one site I’ve visited in my lifetime has enforced it

permalink
report
parent
reply
6 points

you’re jaded. me, too. but you’re jaded.

permalink
report
parent
reply
4 points
*

i prefer “antiqued” but yes

permalink
report
parent
reply
91 points
*

I would be shocked if any big corpo actually gave a shit about it, AI or no AI.

if exists("/robots.txt"):
    no it fucking doesn't
permalink
report
reply
50 points

Robots.txt is in theory meant to be there so that web crawlers don’t waste their time traversing a website in an inefficient way. It’s there to help, not hinder them. There is a social contract being broken here and in the long term it will have a negative impact on the web.

permalink
report
parent
reply
1 point
Deleted by creator
permalink
report
parent
reply
2 points

Yeah I always found it surprising that everyone just agreed to follow a text file on a website on how to act. It’s one of the worst thought out/significant issues with browsing still out there from the beginning pretty much.

permalink
report
parent
reply

Technology

!technology@lemmy.world

Create post

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


Community stats

  • 15K

    Monthly active users

  • 13K

    Posts

  • 570K

    Comments