this post was submitted on 17 Aug 2025

695 points (99.7% liked)

Technology

389 readers

372 users here now

Share interesting Technology news and links.

Rules:

No paywalled sites at all.
News articles has to be recent, not older than 2 weeks (14 days).
No videos.
Post only direct links.

To encourage more original sources and keep this space commercial free as much as I could, the following websites are Blacklisted:

Al Jazeera;
NBC;
CNBC;
Substack;
Tom's Hardware;
ZDNet;
TechSpot;
Ars Technica;
Vox Media outlets, with exception for Axios;
Engadget;
TechCrunch;
Gizmodo;
Futurism;
PCWorld;
ComputerWorld;
Mashable;
Hackaday;
WCCFTECH;
Neowin.

More sites will be added to the blacklist as needed.

Encouraged:

Archive links in the body of the post.
Linking to the direct source, instead of linking to an article talking about the source.

founded 3 months ago

MODERATORS

Pro@programming.dev

695

Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges. (i.imgur.com)

submitted 2 days ago* (last edited 1 day ago) by Pro@programming.dev to c/Technology@programming.dev

112 comments fedilink hide all child comments

Comments

Lemmy;
Hackernews.

Source.

you are viewing a single comment's thread
view the rest of the comments

[–] Gullible@sh.itjust.works 105 points 2 days ago (2 children)

I really feel like scrapers should have been outlawed or actioned at some point.

[–] floofloof@lemmy.ca 84 points 2 days ago (1 children)

But they bring profits to tech billionaires. No action will be taken.

[–] BodilessGaze@sh.itjust.works 12 points 2 days ago (2 children)

No, the reason no action will be taken is because Huawei is a Chinese company. I work for a major US company that's dealing with the same problem, and the problematic scrapers are usually from China. US companies like OpenAI rarely cause serious problems because they know we can sue them if they do. There's nothing we can do legally about Chinese scrapers.

[–] mormund@feddit.org 4 points 1 day ago (1 children)

I thought Anthropic was also very abusive with their scraping?

[–] BodilessGaze@sh.itjust.works 1 points 1 day ago

Maybe to others, but not to us. Or if they are, they're very good at masking their traffic.

[–] Flax_vert@feddit.uk 6 points 2 days ago (1 children)

Can you not just block China?

[–] BodilessGaze@sh.itjust.works 12 points 2 days ago* (last edited 1 day ago)

We do, somewhat. We haven't gone as far as a blanket ban of Chinese CIDR ranges because there's a lot of risks and bureaucracy associated with a move like that. But it probably makes sense for a small company like Codeberg, since they have higher risk tolerance and can move faster.

[–] programmer_belch@lemmy.dbzer0.com 41 points 2 days ago (5 children)

I use a tool that downloads a website to check for new chapters of series every day, then creates an RSS feed with the contents. Would this be considered a harmful scraper?

The problem with AI scrapers and bots is their scale, thousands of requests to webpages that the internal server cannot handle, resulting in slow traffic.

[–] S7rauss@discuss.tchncs.de 31 points 2 days ago (2 children)

Does your tool respect the site’s robots.txt?

[–] who@feddit.org 18 points 2 days ago* (last edited 2 days ago) (2 children)

Unfortunately, robots.txt cannot express rate limits, so it would be an overly blunt instrument for things like GP describes. HTTP 429 would be a better fit.

[–] Redjard@lemmy.dbzer0.com 9 points 1 day ago (1 children)

Crawl-delay is just that, a simple directive to add to robots.txt to set the maximum crawl frequency. It used to be widely followed by all but the worst crawlers ...

[–] who@feddit.org 2 points 1 day ago* (last edited 1 day ago)

Crawl-delay

It's a nonstandard extension without consistent semantics or wide support, but I suppose it's good to know about anyway. Thanks for mentioning it.

[–] S7rauss@discuss.tchncs.de 4 points 2 days ago

I was responding to their question if scraping the site is considered harmful. I would say as long as they are not ignoring robots they shouldn’t be contributing significant amounts of traffic if they’re really only pulling data once a day.

[–] programmer_belch@lemmy.dbzer0.com 2 points 2 days ago

Yes, it just downloads the HTML of one page and formats the data into the RSS format with only the information I need.

[–] ulterno@programming.dev 8 points 2 days ago (1 children)

If the site is getting slowed at times (regardless of whether it is when you scrape), you might want to not scrape at all.

Probably not a good idea to download the whole site, but then that depends upon the site.

If it is a static site, if you just setup your scraper to not download CSS/JS and images/videos, that should make a difference.
For a dynamically created site, there's nothing I can say
Then again, if you try to reduce your download to what you are using, as much as possible, that might be good enough
Since sites are originally made for human consumption, you might have considered keeping the link traversal rates similar to that
The best would be if you could ask the website dev whether they have an API available.
- Even better, ask them to provide an RSS feed.

[–] programmer_belch@lemmy.dbzer0.com 3 points 2 days ago (2 children)

As far as I know, the website doesn't have an API but I just download the HTML and format the result with a simple Python script, it makes around 10 to 20 requests, one for each series I'm following at each time.

[–] limerod@reddthat.com 2 points 1 day ago

You can use the cache feature in curl/wget so it does not download the same css, html, twice. Also, can ignore JavaScript, and image files to save on unnecessary requests.

I would reduce the frequency to once every two days to further reduce the impact.

[–] ulterno@programming.dev 1 points 1 day ago

That might/might not be much.
Depends upon the site, I'd say.

e.g. If it's something like Netflix, I wouldn't think much, because they have the means to serve the requests.
But for some PeerTube instance, even a single request seems to be too heavy for them. So if that server does not respond to my request, I usually wait for an hour or so before refreshing the page.

[–] Gullible@sh.itjust.works 6 points 2 days ago* (last edited 2 days ago) (1 children)

Seems like an api request would be preferable for the site you’re checking. I don’t imagine they’re unhappy with the traffic if they haven’t blocked it yet

[–] JPAKx4@lemmy.blahaj.zone 2 points 2 days ago

I mean if it's cms site there may not be an api, this would be the only solution in that case

[–] Flax_vert@feddit.uk 4 points 2 days ago

The problem is these are constant army hordes / datacentres. You have one tool. Sending a few requests from your device wouldn't even dent a raspberry pi, nevermind a beefier server

I think the intention of traffic is also important. Your tool is so you can consume the content freely provided by the website. Their tool is so they can profit off of the work on the website.