Technology

403 readers

419 users here now

Share interesting Technology news and links.

Rules:

No paywalled sites at all.
News articles has to be recent, not older than 2 weeks (14 days).
No videos.
Post only direct links.

To encourage more original sources and keep this space commercial free as much as I could, the following websites are Blacklisted:

Al Jazeera;
NBC;
CNBC;
Substack;
Tom's Hardware;
ZDNet;
TechSpot;
Ars Technica;
Vox Media outlets, with exception for Axios;
Engadget;
TechCrunch;
Gizmodo;
Futurism;
PCWorld;
ComputerWorld;
Mashable;
Hackaday;
WCCFTECH;
Neowin.

More sites will be added to the blacklist as needed.

Encouraged:

Archive links in the body of the post.
Linking to the direct source, instead of linking to an article talking about the source.

founded 3 months ago

MODERATORS

Pro@programming.dev

698

Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges. (i.imgur.com)

submitted 2 days ago* (last edited 1 day ago) by Pro@programming.dev to c/Technology@programming.dev

112 comments fedilink hide all child comments

Comments

Lemmy;
Hackernews.

Source.

you are viewing a single comment's thread
view the rest of the comments

[–] S7rauss@discuss.tchncs.de 31 points 2 days ago (2 children)

Does your tool respect the site’s robots.txt?

[–] who@feddit.org 18 points 2 days ago* (last edited 2 days ago) (2 children)

Unfortunately, robots.txt cannot express rate limits, so it would be an overly blunt instrument for things like GP describes. HTTP 429 would be a better fit.

[–] Redjard@lemmy.dbzer0.com 9 points 1 day ago (1 children)

Crawl-delay is just that, a simple directive to add to robots.txt to set the maximum crawl frequency. It used to be widely followed by all but the worst crawlers ...

[–] who@feddit.org 2 points 1 day ago* (last edited 1 day ago)

Crawl-delay

It's a nonstandard extension without consistent semantics or wide support, but I suppose it's good to know about anyway. Thanks for mentioning it.

[–] S7rauss@discuss.tchncs.de 4 points 2 days ago

I was responding to their question if scraping the site is considered harmful. I would say as long as they are not ignoring robots they shouldn’t be contributing significant amounts of traffic if they’re really only pulling data once a day.

[–] programmer_belch@lemmy.dbzer0.com 2 points 2 days ago

Yes, it just downloads the HTML of one page and formats the data into the RSS format with only the information I need.