Open Source

38789 readers

70 users here now

All about open source! Feel free to ask questions, and share news, and interesting stuff!

Useful Links

Rules

Posts must be relevant to the open source ideology
No NSFW content
No hate speech, bigotry, etc

Related Communities

Community icon from opensource.org, but we are not affiliated with them.

founded 5 years ago

MODERATORS

Cloak@lemmy.ml

kevincox@lemmy.ml

CrypticCoffee@lemmy.ml

Lettuceeatlettuce@lemmy.ml

593

The Open-Source Software Saving the Internet From AI Bot Scrapers (www.404media.co)

submitted 3 days ago by fattyfoods to c/opensource@lemmy.ml

109 comments fedilink hide all child comments

(page 2) 50 comments

sorted by: hot top controversial new old

[–] medem@lemmy.wtf 23 points 2 days ago (7 children)

What advantage does this software provide over simply banning bots via robots.txt?

[–] thingsiplay@beehaw.org 14 points 2 days ago

The difference is:

robots.txt is a promise without a door
Anubis is a physical closed door, that opens up after some time

[–] Mwa@thelemmy.club 8 points 2 days ago

The problem is Ai doesn't follow robots.txt,so Cloudflare are Anubis developed a solution.

load more comments (5 replies)

[–] Kazumara@discuss.tchncs.de 12 points 2 days ago (1 children)

Just recently there was a guy on the NANOG List ranting about Anubis being the wrong approach and people should just cache properly then their servers would handle thousands of users and the bots wouldn't matter. Anyone who puts git online has no-one to blame but themselves, e-commerce should just be made cacheable etc. Seemed a bit idealistic, a bit detached from the current reality.

Ah found it, here

[–] deadcade@lemmy.deadca.de 14 points 2 days ago (1 children)

Someone making an argument like that clearly does not understand the situation. Just 4 years ago, a robots.txt was enough to keep most bots away, and hosting personal git on the web required very little resources. With AI companies actively profiting off stealing everything, a robots.txt doesn't mean anything. Now, even a relatively small git web host takes an insane amount of resources. I'd know - I host a Forgejo instance. Caching doesn't matter, because diffs berween two random commits are likely unique. Ratelimiting doesn't matter, they will use different IP (ranges) and user agents. It would also heavily impact actual users "because the site is busy".

A proof-of-work solution like Anubis is the best we have currently. The least possible impact to end users, while keeping most (if not all) AI scrapers off the site.

[–] interdimensionalmeme@lemmy.ml 2 points 2 days ago (1 children)

This would not be a problem if one bot scraped once, and the result was then mirrored to all on Big Tech's dime (cloudflare, tailscale) but since they are all competing now, they think their edge is going to be their own more better scraper setup and they won't share.

Maybe there should just be a web to torrent bridge sovtge data is pushed out once by the server and tge swarm does the heavy lifting as a cache.

[–] deadcade@lemmy.deadca.de 2 points 2 days ago (1 children)

No, it'd still be a problem; every diff between commits is expensive to render to web, even if "only one company" is scraping it, "only one time". Many of these applications are designed for humans, not scrapers.

load more comments (1 replies)

[–] inbeesee@lemmy.world 2 points 2 days ago

Fantastic article! Makes me less afraid to host a website with this potential solution

[–] RedSnt@feddit.dk 10 points 2 days ago

Brodie interviewed the creator of Anubis a little while back, it's pretty good.

[–] not_amm@lemmy.ml 7 points 2 days ago

I had seen that prompt, but never searched about it. I found it a little annoying, mostly because I didn't know what it was for, but now I won't mind. I hope more solutions are developed :D

[–] DrunkAnRoot@sh.itjust.works 2 points 2 days ago

it wont protect more then one subdomain i think

load more comments