this post was submitted on 17 Aug 2025

694 points (99.7% liked)

Technology

389 readers

372 users here now

Share interesting Technology news and links.

Rules:

No paywalled sites at all.
News articles has to be recent, not older than 2 weeks (14 days).
No videos.
Post only direct links.

To encourage more original sources and keep this space commercial free as much as I could, the following websites are Blacklisted:

Al Jazeera;
NBC;
CNBC;
Substack;
Tom's Hardware;
ZDNet;
TechSpot;
Ars Technica;
Vox Media outlets, with exception for Axios;
Engadget;
TechCrunch;
Gizmodo;
Futurism;
PCWorld;
ComputerWorld;
Mashable;
Hackaday;
WCCFTECH;
Neowin.

More sites will be added to the blacklist as needed.

Encouraged:

Archive links in the body of the post.
Linking to the direct source, instead of linking to an article talking about the source.

founded 3 months ago

MODERATORS

Pro@programming.dev

694

Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges. (i.imgur.com)

submitted 2 days ago* (last edited 1 day ago) by Pro@programming.dev to c/Technology@programming.dev

111 comments fedilink hide all child comments

Comments

Lemmy;
Hackernews.

Source.

(page 2) 50 comments

sorted by: hot top controversial new old

[–] Gullible@sh.itjust.works 105 points 2 days ago (2 children)

I really feel like scrapers should have been outlawed or actioned at some point.

[–] floofloof@lemmy.ca 84 points 2 days ago (1 children)

But they bring profits to tech billionaires. No action will be taken.

[–] BodilessGaze@sh.itjust.works 12 points 2 days ago (2 children)

No, the reason no action will be taken is because Huawei is a Chinese company. I work for a major US company that's dealing with the same problem, and the problematic scrapers are usually from China. US companies like OpenAI rarely cause serious problems because they know we can sue them if they do. There's nothing we can do legally about Chinese scrapers.

[–] mormund@feddit.org 4 points 1 day ago (1 children)

I thought Anthropic was also very abusive with their scraping?

[–] BodilessGaze@sh.itjust.works 1 points 23 hours ago

Maybe to others, but not to us. Or if they are, they're very good at masking their traffic.

[–] Flax_vert@feddit.uk 6 points 1 day ago (1 children)

Can you not just block China?

[–] BodilessGaze@sh.itjust.works 12 points 1 day ago* (last edited 1 day ago)

We do, somewhat. We haven't gone as far as a blanket ban of Chinese CIDR ranges because there's a lot of risks and bureaucracy associated with a move like that. But it probably makes sense for a small company like Codeberg, since they have higher risk tolerance and can move faster.

[–] programmer_belch@lemmy.dbzer0.com 41 points 2 days ago (8 children)

I use a tool that downloads a website to check for new chapters of series every day, then creates an RSS feed with the contents. Would this be considered a harmful scraper?

The problem with AI scrapers and bots is their scale, thousands of requests to webpages that the internal server cannot handle, resulting in slow traffic.

[–] S7rauss@discuss.tchncs.de 31 points 2 days ago (2 children)

Does your tool respect the site’s robots.txt?

[–] who@feddit.org 18 points 2 days ago* (last edited 2 days ago) (2 children)

Unfortunately, robots.txt cannot express rate limits, so it would be an overly blunt instrument for things like GP describes. HTTP 429 would be a better fit.

[–] Redjard@lemmy.dbzer0.com 9 points 1 day ago (1 children)

Crawl-delay is just that, a simple directive to add to robots.txt to set the maximum crawl frequency. It used to be widely followed by all but the worst crawlers ...

[–] who@feddit.org 2 points 1 day ago* (last edited 1 day ago)

Crawl-delay

It's a nonstandard extension without consistent semantics or wide support, but I suppose it's good to know about anyway. Thanks for mentioning it.

load more comments (1 replies)

[–] ulterno@programming.dev 8 points 2 days ago (3 children)

If the site is getting slowed at times (regardless of whether it is when you scrape), you might want to not scrape at all.

Probably not a good idea to download the whole site, but then that depends upon the site.

If it is a static site, if you just setup your scraper to not download CSS/JS and images/videos, that should make a difference.
For a dynamically created site, there's nothing I can say
Then again, if you try to reduce your download to what you are using, as much as possible, that might be good enough
Since sites are originally made for human consumption, you might have considered keeping the link traversal rates similar to that
The best would be if you could ask the website dev whether they have an API available.
- Even better, ask them to provide an RSS feed.

load more comments (3 replies)

load more comments (6 replies)

[–] gressen@lemmy.zip 82 points 2 days ago (3 children)

Write TOS that state that crawlers automatically accept a service fee and then send invoices to every crawler owner.

[–] BodilessGaze@sh.itjust.works 42 points 2 days ago (2 children)

Huawei is Chinese. There's literally zero chance a European company like Codeberg is going to successfully collect from a company in China over a TOS violation.

[–] wischi@programming.dev 15 points 1 day ago

It's not even a company. It's a non-profit "eingetragener Verein". They have very limited resources, especially money because they purely live on membership fees and donations.

[–] Lumisal@lemmy.world 6 points 1 day ago (1 children)

True, but it can help limit the European AI scrapers too

[–] BodilessGaze@sh.itjust.works 8 points 1 day ago* (last edited 1 day ago) (1 children)

I really doubt it. Lawsuits are expensive, and proving responsibility is difficult, since plausible deniability is easy. All scrapers need to do is use shared IPs (e.g. cloud providers), preferably owned by a company in a different legal jurisdiction. That could be the case here: a European company could be using Huawei Cloud to mask the source of their traffic.

[–] veniasilente@lemmy.dbzer0.com 5 points 1 day ago (1 children)

All scrapers need to do is use shared IPs (e.g. cloud providers),

Simple: just charge the cloud provider.

Once that gets strong enough they'll start placing terms against scraping in their TOS.

[–] wischi@programming.dev 5 points 1 day ago* (last edited 1 day ago)

And then they just throw it in the bin because there was never a contract between you and them. What to do then? Sue Microsoft, Amazon and Google

I'm sure Codeberg, a German non-profit Verein, has time and money to do that 🤣.

[–] wischi@programming.dev 38 points 2 days ago (1 children)

They typically don't include a billing address in the User Agent when crawling 🤣

[–] gressen@lemmy.zip 9 points 2 days ago (1 children)

That's a technicality. The billing address can be discovered for a nominal fee as well.

[–] wischi@programming.dev 7 points 1 day ago* (last edited 1 day ago)

I'm sure it can't, especially for foreign IP addresses, VPNs, and a ton of other situations. Even if directly connect to the internet just via your ISP, many countries in Europe (don't know about US) have laws that would require you to have very good reasons and a court order to get the info you need from the ISP - for a single(!) case.

If it would be possible to simply get the address of all digital visitors, we wouldn't have to develop all this anti scrape tech and just sue them.

load more comments (1 replies)

[–] chicken@lemmy.dbzer0.com 34 points 1 day ago (1 children)

Seems like such a massive waste of bandwidth since it's the same work being repeated by many different actors to piece together the same dataset bit by bit.

[–] chuckleslord@lemmy.world 44 points 1 day ago

Ah Capitalism! Truly the king of efficiency /s

[–] cadekat@pawb.social 40 points 2 days ago

Huh, why does Anubis use SHA256? It's been optimized to all hell and back.

Ah, they're looking into it: https://github.com/TecharoHQ/anubis/issues/94

[–] cecilkorik@lemmy.ca 65 points 2 days ago (1 children)

Begun, the information wars have.

[–] steal_your_face@lemmy.ml 10 points 2 days ago

The wars have been fought and lost a while ago tbh

[–] ryanvade@lemmy.world 24 points 2 days ago

It's being investigated at least, hopefully a solution can be found. This will probably end up in a constantly escalating battle with the AI companies. https://github.com/TecharoHQ/anubis/issues/978

[–] LiveLM@lemmy.zip 26 points 2 days ago

Uuughhh I knew it'd always be a mouse and cat game, sincerely hope the Anubis devs figure out how to fuck up the AI crawlers again

[–] Kolanaki@pawb.social 18 points 2 days ago* (last edited 2 days ago) (2 children)

I dont understand how challenging an AI by asking it to do some heavy computational stuff even makes sense... A computer is literally made to do computations, and AI is just a computer. 🤨

Wouldn't it make more sense to challenge the AI with a Voight-Kampff test? Ask it about baseball.

[–] purplemonkeymad@programming.dev 45 points 2 days ago

The scrapers are not actually an ai, they are just dumb scrapers there to get as much textual information as possible.

If they have to do Anubis tests, that is going to take more time to get the data they scrape. I suspect that they are probably paid per page they provide, so more time per page is less money for them.

[–] BodilessGaze@sh.itjust.works 30 points 2 days ago

The point is to make scraping expensive enough it isn't worth the trouble. The only reason AI scrapers are trying to get this data is because it's cheaper than the alternatives (e.g. generating synthetic data). Once it stops being cheaper, the smart scrapers will stop. The dumb scrapers don't matter because they don't have the talent to devise these kind of workarounds.

load more comments