this post was submitted on 17 Aug 2025

695 points (99.7% liked)

Technology

389 readers

372 users here now

Share interesting Technology news and links.

Rules:

No paywalled sites at all.
News articles has to be recent, not older than 2 weeks (14 days).
No videos.
Post only direct links.

To encourage more original sources and keep this space commercial free as much as I could, the following websites are Blacklisted:

Al Jazeera;
NBC;
CNBC;
Substack;
Tom's Hardware;
ZDNet;
TechSpot;
Ars Technica;
Vox Media outlets, with exception for Axios;
Engadget;
TechCrunch;
Gizmodo;
Futurism;
PCWorld;
ComputerWorld;
Mashable;
Hackaday;
WCCFTECH;
Neowin.

More sites will be added to the blacklist as needed.

Encouraged:

Archive links in the body of the post.
Linking to the direct source, instead of linking to an article talking about the source.

founded 3 months ago

MODERATORS

Pro@programming.dev

695

Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges. (i.imgur.com)

submitted 2 days ago* (last edited 1 day ago) by Pro@programming.dev to c/Technology@programming.dev

112 comments fedilink hide all child comments

Comments

Lemmy;
Hackernews.

Source.

top 50 comments

sorted by: hot top controversial new old

[–] sp3ctr4l@lemmy.dbzer0.com 25 points 1 day ago (1 children)

Do we all want the fucking Blackwall from Cyberpunk 2077?

Fucking NetWatch?

Because this is how we end up with them.

....excuse me, I need to go buy a digital pack of cigarettes for the angry voice in my head.

[–] somerandomperson@lemmy.dbzer0.com 6 points 1 day ago (5 children)

Consider nicotine+

load more comments (5 replies)

[–] Electricd@lemmybefree.net 1 points 16 hours ago

Do they really hit that much? I might not have a popular opinion there, but if they don't have a performance impact then I probably wouldn't care

[–] folken@lemmy.world 40 points 1 day ago* (last edited 1 day ago) (2 children)

When you realize that you live in a cyberpunk novel. The AI is cracking the ICE. https://cyberpunk.fandom.com/wiki/Black_ICE

[–] Regrettable_incident@lemmy.world 14 points 1 day ago (1 children)

I love seeing how much influence William Gibson had on cyberpunk.

[–] ThePyroPython@lemmy.world 16 points 1 day ago

It's not intentional but the chap ended up writing works that defined both the Cyberpunk (Neuromancer) and Steampunk (The Difference Engine) genres.

Can't deny that influence.

load more comments (1 replies)

[–] Blackmist@feddit.uk 26 points 1 day ago (3 children)

Business idea: AWS, but hosted entirely within the computing power of AI web crawlers.

[–] Kissaki@feddit.org 10 points 1 day ago (5 children)

Reminds me of the "store data inside slow network requests for the in-transit duration". It was a fun article to read.

load more comments (5 replies)

[–] excral@feddit.org 4 points 1 day ago

I like the idea but couldn't you just go the more direct route and mine crypto?

load more comments (1 replies)

[–] tal@lemmy.today 20 points 1 day ago (1 children)

If someone just wants to download code from Codeberg for training, it seems like it'd be way more efficient to just clone the git repositories or even just download tarballs of the most-recent releases for software hosted on Codeberg than to even touch the Web UI at all.

I mean, maybe you need the Web UI to get a list of git repos, but I'd think that that'd be about it.

[–] witten@lemmy.world 27 points 1 day ago (1 children)

Then they'd have to bother understanding the content and downloading it as appropriate. And you'd think if anyone could understand and parse websites in realtime to make download decisions, it be giant AI companies. But ironically they're only interested in hoovering up everything as plain web pages to feed into their raw training data.

[–] Natanael@infosec.pub 17 points 1 day ago

The same morons scrape Wikipedia instead of downloading the archive files which trivially can be rendered as web pages locally

[–] 0_o7@lemmy.dbzer0.com 35 points 1 day ago (3 children)

I blocked almost all big players in hosting, China, Ruasia, Vietnam and now they're now bombarding my site with residential IP address from all over the world. They must be using compromised smart home devices or phones with malware.

Soon everything on the internet will be behind a wall.

[–] ILikeTraaaains@lemmy.world 2 points 13 hours ago

Not necessarily compromised, I saw a VPN provider (don’t remember the name) that offered a free tier where the client accepts being used for this.

And I suspect that in the future some VPN companies will be exposed doing the same but with their paid customers.

[–] irelephant@programming.dev 12 points 1 day ago (1 children)

This isn't sustainable for the ai companies, when the bubble pops it will stop.

[–] aev_software@programming.dev 20 points 1 day ago (1 children)

In the mean time, sites are getting DDOS-ed by scrapers. One way to stop your site from getting scraped is having it be inaccessible... which is what the scalpers are causing.

Normally I would assume DDOS-ing is performed in order to take a site offline. But ai-scalpers require the opposite. They need their targets online and willing. One would think they'd be a bit more careful about the damage they cause.

But they aren't, because capitalism.

load more comments (1 replies)

[–] metacolon@lemmy.blahaj.zone 12 points 1 day ago (1 children)

Are those blocklists publicly available somewhere?

[–] Taldan@lemmy.world 11 points 1 day ago (1 children)

I would hope not. Kinda pointless if they become public

[–] daniskarma@lemmy.dbzer0.com 28 points 1 day ago (5 children)

On the contrary. Open community based block lists can be very effective. Everyone can contribute to them and asphyxiate people with malicious intents.

If you think something like, "if the blocklist is available then malicious agents simply won't use that ips" I don't think if that makes a lot of sense. As the malicious agent will know any of their IPs being blocked as soon as they use them.

[–] pedz@lemmy.ca 9 points 1 day ago

Just to give an example of public lists that are working, I have an IRC server and it's getting bombarded with spam bots. It's horrible around the superbowl for some reason, but it just continues year round.

So I added a few public anti spamming lists like dronebl to the config, and the vast majority of the bots are automatically G-Lined/banned.

load more comments (4 replies)

[–] MonkderVierte@lemmy.zip 18 points 1 day ago* (last edited 1 day ago) (12 children)

I just thought that having a client side proof-of-work (or even only a delay) bound to the IP might deter the AI companies to choose to behave instead (because single-visit-per-IP crawlers get too expensive/slow and you can just block normal abusive crawlers). But they already have mind-blowing computing and money ressources and only want your data.

But if there was a simple-to-use integrated solution and every single webpage used this approach?

[–] witten@lemmy.world 12 points 1 day ago

Believe me, these AI corporations have way too many IPs to make this feasible. I've tried per-IP rate limiting. It doesn't work on these crawlers.

load more comments (11 replies)

[–] Probius@sopuli.xyz 227 points 2 days ago (4 children)

This type of large-scale crawling should be considered a DDoS and the people behind it should be charged with cyber crimes and sent to prison.

[–] eah@programming.dev 19 points 1 day ago (1 children)

Applying the Computer Fraud and Abuse Act to corporations? Sign me up! Hey, they're also people, aren't they?

[–] caseyweederman@lemmy.ca 9 points 1 day ago (1 children)

Put the entire datacenter buildings into prison

[–] Blueteamsecguy@infosec.pub 4 points 1 day ago

I think they call that a "job" already

[–] FauxLiving@lemmy.world 77 points 1 day ago (2 children)

If it’s disrupting their site, it is a crime already. The problem is finding the people behind it. This won’t be some guy on his dorm PC and they’ll likely be in places interpol can’t reach.

[–] porous_grey_matter@lemmy.ml 47 points 1 day ago

they’ll likely be in places interpol can’t reach

Like some Microsoft data center

[–] finitebanjo@lemmy.world 22 points 1 day ago

Huawei

load more comments (2 replies)

[–] rozodru@lemmy.world 15 points 1 day ago (4 children)

I run my own gitea instance on my own server and within the past week or so I've noticed it just getting absolutely nailed. One repo in particular, a Wayland WM I built. Just keeps getting hammered over and over by IPs in China.

[–] ZILtoid1991@lemmy.world 10 points 1 day ago

Just keeps getting hammered over and over by IPs in China.

Simple solution: Block Chinese IPs!

[–] witten@lemmy.world 5 points 1 day ago

Are you using Anubis?

load more comments (2 replies)

[–] Harbinger01173430@lemmy.world 5 points 1 day ago

A good solution would be to load with a virus, to the PCs connecting from the AI ips, that overloads the computer and makes it explode.

[–] rozodru@lemmy.world 4 points 1 day ago

They're getting hammered again this morning.

load more comments