this post was submitted on 17 Aug 2025
698 points (99.7% liked)

Technology

403 readers
419 users here now

Share interesting Technology news and links.

Rules:

  1. No paywalled sites at all.
  2. News articles has to be recent, not older than 2 weeks (14 days).
  3. No videos.
  4. Post only direct links.

To encourage more original sources and keep this space commercial free as much as I could, the following websites are Blacklisted:

More sites will be added to the blacklist as needed.

Encouraged:

founded 3 months ago
MODERATORS
 

Comments

Source.

you are viewing a single comment's thread
view the rest of the comments
[–] ulterno@programming.dev 8 points 2 days ago (1 children)

If the site is getting slowed at times (regardless of whether it is when you scrape), you might want to not scrape at all.

Probably not a good idea to download the whole site, but then that depends upon the site.

  • If it is a static site, if you just setup your scraper to not download CSS/JS and images/videos, that should make a difference.
  • For a dynamically created site, there's nothing I can say
  • Then again, if you try to reduce your download to what you are using, as much as possible, that might be good enough
  • Since sites are originally made for human consumption, you might have considered keeping the link traversal rates similar to that
  • The best would be if you could ask the website dev whether they have an API available.
    • Even better, ask them to provide an RSS feed.
[–] programmer_belch@lemmy.dbzer0.com 3 points 2 days ago (2 children)

As far as I know, the website doesn't have an API but I just download the HTML and format the result with a simple Python script, it makes around 10 to 20 requests, one for each series I'm following at each time.

[–] limerod@reddthat.com 2 points 1 day ago

You can use the cache feature in curl/wget so it does not download the same css, html, twice. Also, can ignore JavaScript, and image files to save on unnecessary requests.

I would reduce the frequency to once every two days to further reduce the impact.

[–] ulterno@programming.dev 1 points 1 day ago

That might/might not be much.
Depends upon the site, I'd say.

e.g. If it's something like Netflix, I wouldn't think much, because they have the means to serve the requests.
But for some PeerTube instance, even a single request seems to be too heavy for them. So if that server does not respond to my request, I usually wait for an hour or so before refreshing the page.