this post was submitted on 23 Feb 2025

24 points (96.2% liked)

Selfhosted

42767 readers

1398 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam posting.
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
No trolling.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 2 years ago

MODERATORS

HybridSarcasm@lemmy.world

HybridSarcasm@lemmy.hybridsarcasm.xyz

Advice on how to deal with AI bots/scrapers? (lemmy.librebun.com)

submitted 4 hours ago by sailorzoop@lemmy.librebun.com to c/selfhosted@lemmy.world

14 comments fedilink hide all child comments

Long story short, my VPS, which I'm forwarding my servers through Tailscale to, got hammered by thousands of requests per minute from Anthropic's Claude AI. All of which being from different AWS IPs.

The VPS has a 1TB monthly cap, but it's still kinda shitty to have huge spikes like the 13GB in just a couple of minutes today.

How do you deal with something like this?
I'm only really running a caddy reverse proxy on the VPS which forwards my home server's services through Tailscale. "

I'd really like to avoid solutions like Cloudflare, since they f over CGNAT users very frequently and all that. Don't think a WAF would help with this at all(?), but rate limiting on the reverse proxy might work.

(VPS has fail2ban and I'm using /etc/hosts.deny for manual blocking. There's a WIP website on my root domain with robots.txt that should be denying AWS bots as well...)

I'm still learning and would really appreciate any suggestions.

top 14 comments

sorted by: hot top controversial new old

[–] waspentalive@lemmy.one 1 points 31 minutes ago

Too bad you can't post a usage notice that anything scrapped to train an AI will be charged and will owe $some-huge-money, then pepper the site with bogus facts, occasionally ask various AI about the bogus fact and use that to prove scraping and invoice the AI's company.

[–] Greg@lemmy.ca 4 points 1 hour ago

What are you hosting and who are your users? Do you receive any legitimate traffic from AWS or other cloud provider IP addresses? There will always be edge cases like people hosting VPN exit nodes on a VPS etc, but if its a tiny portion of your legitimate traffic I would consider blocking all incoming traffic from cloud providers and then whitelisting any that make sense like search engine crawlers if necessary.

[–] drkt@scribe.disroot.org 2 points 1 hour ago (2 children)

Build tar pits.

[–] mholiv@lemmy.world 2 points 33 minutes ago (1 children)

They want to reduce the bandwidth usage. Not increase it!

[–] drkt@scribe.disroot.org 1 points 15 minutes ago

Bots will blacklist your IP if you make it hostile to bots

This will save you bandwidth

[–] douglasg14b@lemmy.world 1 points 32 minutes ago

Cool, lots of information provided!

[–] breadsmasher@lemmy.world 9 points 3 hours ago (1 children)

Im struggling to find it, but theres like an “AI tarpit” that causes scrapers to get stuck. something like that? Im sure I saw it posted on lemmy recently. hopefully someone can link it

[–] sailorzoop@lemmy.librebun.com 2 points 3 hours ago

I did find this github link as the first search result, looks interesting, thanks for letting me know the term "tar pit".

[–] crony@lemmy.cronyakatsuki.xyz 6 points 3 hours ago* (last edited 3 hours ago) (1 children)

Try crowdsec.

You can set it up with list's that are updated frequetly and have it look at caddy proxy logs and then it can easilly block ai/bot like traffic.

I have it blocking over 100k ip's at this moment.

https://www.crowdsec.net/

[–] sailorzoop@lemmy.librebun.com 5 points 3 hours ago

Not gonna lie, the $3900/mo at the top of the /pricing page is pretty wild.
Searched "crowdsec docker" and they have docs and all that. Thank you very much, I've heard of crowdsec before, but never paid much attention, absolutely will check this out!

[–] solrize@lemmy.world 3 points 2 hours ago

Might be worth patching fail2ban to recognize the scrapers and block them in iptables.

[–] poVoq@slrpnk.net 4 points 3 hours ago* (last edited 3 hours ago) (1 children)

It seems any somewhat easy to implement solution gets circumvented by them quickly. Some of the bots do respect robots.txt through if you explicitly add their self-reported user-agent (but they change it from time to time). This repo has a regularly updated list: https://github.com/ai-robots-txt/ai.robots.txt/

In my experience, git forges are especially hit hard, and the only real solution I found is to put a login wall in front, which kinda sucks especially for open-source projects you want to self-host.

Oh and recently the mlmym (old reddit) frontend for Lemmy seems to have started attracting AI scraping as well. We had to turn it off on our instance because of that.

[–] sailorzoop@lemmy.librebun.com 1 points 3 hours ago* (last edited 3 hours ago) (1 children)

In my experience, git forges are especially hit hard

Is that why my Forgejo instance has been hit twice like crazy before...
Why can't we have nice things. Thank you!

EDIT: Hopefully Photon doesn't get in their sights as well. Though after using the official lemmy webui for a while, I do really like it a lot.

[–] poVoq@slrpnk.net 1 points 2 hours ago

Yeah, Forgejo and Gitea. I think it is partially a problem of insufficient caching on the side of these git forges that makes it especially bad, but in the end that is victim blaming 🫠

Mlmym seems to be the target because it is mostly Javascript free and therefore easier to scrape I think. But the other Lemmy frontends are also not well protected. Lemmy-ui doesn't even allow to easily add a custom robots.txt, you have to manually overwrite it in the reverse-proxy.