HTTrack Website Copier

(github.com)

136 points | by iscream26 284 days ago

16 comments

Felk 283 days ago
Funny seeing this here now, as I _just_ finished archiving an old MyBB PHP forum. Though I used `wget` and it took 2 weeks and 260GB of uncompressed disk space (12GB compressed with zstd), and the process was not interruptible and I had to start over each time my hard drive got full. Maybe I should have given HTTrack a shot to see how it compares.
If anyone wanna know the specifics on how I used wget, I wrote it down here: https://github.com/SpeedcubeDE/speedcube.de-forum-archive
Also, if anyone has experience archiving similar websites with HTTrack and maybe know how it compares to wget for my use case, I'd love to hear about it!
[-]
- smashed 283 days ago
  I've tried both in order to archive EOL websites and I've had better luck with wget, it seems to recognize more links/resources and do a better job so it was probably not a bad choice.
  [-]
  - fmajid 279 days ago
    Conversely, httrack was the only tool that could archive the JS-heavy microsite my realtor made to sell our old house. The command-line interface is horrendous, but it does handle rewriting complex sites better than wget does.
- codetrotter 283 days ago
  > it took 2 weeks and 260GB of uncompressed disk space
  Is most of that data because of there being like a zillion different views and sortings of the same posts? That’s been the main difficulty for me when wanting to crawl some sites. There’s like an infinite number of permutations of URLs with different parameters because every page has a bunch of different link with auto-generated URL parameters for various things, that results in often retrieving the same data over and over and over again throughout an attempted crawl. And sometimes URL parameters are needed and sometimes not so it’s not like you can just strip all URL parameters either.
  So then you start adding things to your crawler like, starting with shortest URLs first, and then maybe you make it so whenever you pick the next URL to visit it will take one that is most different from what you’ve seen so far. And after that you start adding super specific rules for different paths of a specific site.
  [-]
  - Felk 283 days ago
    The slowdown wasn't due to a lot of permutations, but mostly because a) wget just takes a considerable amount of time to process large HTML files with lots of links, and b) MyBB has a "threaded mode", where each post of a thread geht's a dedicated page with links to all other posts of that thread. The largest thread had around 16k posts, so that's 16k² URLs to parse.
    In terms of possible permutations, MyBB is pretty tame thankfully. Only the forums are sortable, posts only have the regular and the aforementioned threaded mode to view them. Even the calender widget only goes from 1901-2030, otherwise wget might have crawled forever.
    I originally considered excluding threaded mode using wget's `--reject-regex` and then just adding an nginx rule later to redirect any incoming such links to the normal view mode. Basically just saying "fuck it, you only get this version". That might be worth a try for your case
- criddell 283 days ago
  Is there a friendly way to do this? I'd feel bad burning through hundreds of gigabytes of bandwidth for a non-corporate site. Would a database snapshot be as useful?
  [-]
  - z33k 283 days ago
    MyBB PHP forums have a web interface through which one can download the database as a single .sql file. It will most likely be a mess, depending on the addons that were installed on the forum.
  - Felk 283 days ago
    Downloading a DB dump and crawling locally is possible, but had two gnarly show stoppers for me using wget: the forum's posts often link to other posts, and those links are absolute. Getting wget to crawl those links through localhost is hardly easy (local reverse proxy with content rewriting?). Second, the forum and its server were really unmaintained. I didn't want to spend a lot of time replicating it locally and just archive it as-is while it is still barely running
  - dbtablesorrows 283 days ago
    If you want to customize the scraping, there's scrapy python framework. You would always need to download the html though.
  - squigz 283 days ago
    Isn't bandwidth mostly dirt cheap/free these days?
    [-]
    - criddell 283 days ago
      It's inexpensive, but sometimes not free. For example, Google Cloud Hosting is $0.14 / GB so 260 GB would be around $36.
    - nchmy 283 days ago
      its essentially free on non-extortionate hosts. Use hetzner + cloudflare and you'll essentially never pay for bandwidth
- begrid 283 days ago
  wget2 has an option por paralel downloading. https://github.com/rockdaboot/wget2
corinroyal 283 days ago
One time I was trying to create an offline backup of a botanical medicine site for my studies. Somehow I turned off depth of link checking and made it follow offsite links. I forgot about it. A few days later the machine crashed due to a full disk from trying to cram as much of the WWW as it could on there.
[-]
- rkhassen9 283 days ago
  That is awesome.
suriya-ganesh 283 days ago
This saved me a ton when back in college in rural India without Internet in 2015. I would download whole websites from a nearby library and read at home.
I've read py4e, ostep, Pgs essays using this.
I am who I am because of httrack. Thank you
jregmail 283 days ago
I recommend to try also https://crawler.siteone.io/ for web copying/cloning.
Real copy of the netlify.com website for demonstration: https://crawler.siteone.io/examples-exports/netlify.com/
Sample analysis of the netlify.com website, which this tool can also provide: https://crawler.siteone.io/html/2024-08-23/forever/x2-vuvb0o...
xnx 284 days ago
Great tool. Does it still work for the "modern" web (i.e. now that even simple/content websites have become "apps")?
[-]
- alganet 283 days ago
  Nope. It is for the classic web (the only websites worth saving anyway).
  [-]
  - freedomben 283 days ago
    Even for classic web, if it's behind cloudflare, then HTTrack no longer works.
    It's a sad point to be at. Fortunately, the single file extension still works really well for single pages, even when they are built dynamically by JavaScript on the client side. There isn't a solution for cloning an entire site though, at least that I know of
    [-]
    - alganet 283 days ago
      If it is cloudflare human verification, then httrack will have an issue. But in the end it's just a cookie, you can use a browser with JS to grab the cookie, then feed it to httrack headers.
      If cloudflare ddos protection is an issue, you can throttle httrack requests.
      [-]
      - acheong08 283 days ago
        > you can use a browser with JS to grab the cookie, then feed it to httrack headers
        They also check your user agent, IP and JA3 fingerprint (and ensures it matches with the one that got the cookie) so it's not as simple as copying some cookies. This might just be for paying customers though since it doesn't do such heavy checks for some sites
        [-]
        alganet 282 days ago
        Dude. Cookie is a header, user agent is a header, ja3 is a header. It's the same stuff.
        These protections are against ddos attacks, botnets, large crawling infrastructures that can lose by having to sync header info.
        If you're just a single tired dev saving a website because you care about some content, none of this is a significant barrier.
        [-]
        acheong08 282 days ago
        Dude. JA3 is a your TLS fingerprint. Most libraries don't let you spoof it. The annoying thing is that with new versions of Chrome and Firefox, JA3 is randomized per session so it changes every time. You need to intercept the request in Wireshark to get it.
        freedomben 283 days ago
        Seconded. It seems to depend on the sites settings, and those in turn are regulated heavily by subscription plan the site is on.
    - knowaveragejoe 283 days ago
      I'm aware of this tool, but I'm sure there are caveats in terms of "totally" cloning a website:
      https://github.com/ArchiveTeam/grab-site
dark-star 283 days ago
oh wow that brings back memories. I have used httrack in the late 90s and early 2000's to mirror interesting websites from the early internet, over a modem connection (and early DSL)
Good to know they're still around, however, now that the web is much more dynamic I guess it's not as useful anymore as it was back then
[-]
- dspillett 283 days ago
  > now that the web is much more dynamic I guess it's not as useful anymore as it was back then
  Also less useful because the web is so easy to access, I remember using it back then to draw things down over the university link for reference in my room (1st year, no network access at all in rooms) or house (or per-minute costed modem access).
  Sites can vanish easily of course still these days, so having a local copy could be a bonus, but they just as likely go out of date or get replaced, and if not are usually archived elsewhere already.
oriettaxx 283 days ago
I don't get it: last release 2017 while in github I see more releases...
so, did developer of the github repo took over and updating/upgrading? very good!
superjan 283 days ago
I have tried the windows version 2 years ago. The site I copied was our on-prem issue tracker (fogbugz) that we replaced. HTTrack did not work because of too much javascript rendering, and I could not figure out how to make it login. What I ended up doing was embedding a browser (WebView2) in a C# Desktop app. You can intercept all the images/css, and after the Javascript rendering was complete, write out the DOM content to a html file. Also nice is that you can login by hand if needed, and you can generate all urls from code.
chirau 283 days ago
I use it to download sites with layouts that I like and want to use for landing pages and static pages for random projects. I strip all the copy and stuff and leave the skeleton to put my own content. Most recently link.com, column.com and increase.com. I don't have the time nor the youth to start with all the JavaScript & React stuff.
zazaulola 283 days ago
The archive saved in HTTrack Website Copier can be opened in https://replayweb.page locally or they have different save formats?
Alifatisk 283 days ago
Good ol' days
subzero06 283 days ago
i use this to double check which of my web app folder/files are publicly accessible.
j0hnyl 283 days ago
Scammers love this tool. I see it used in the wild quite a bit.
alberth 283 days ago
I always wonder if this gives false positives for people just using the same WordPress template.
woutervddn 283 days ago
Also known as: static site generator for any original website platform...