Is stuff online worth saving?

(rubenerd.com)

114 points | by Brajeshwar 4 days ago

36 comments

  • JKCalhoun 10 hours ago
    One hundred twenty-three years ago my great grandmother's first husband died in a hotel in Kansas City from asphyxiation from the gas having been left on over night (the hotel did not yet have electric lighting). A letter was hastily written on a piece of hotel stationary to be delivered to his wife in the neighboring farming community where she lived.

    It is fortunate to me that someone thought to hang on to that note since I have become interested in genealogy and this was a fairly significant event in family history (had he not died I don't suppose I would be around since it was her second marriage that gave me my grandfather).

    I long for scraps of anything that my dead relatives, wrote, created, etc. It connects me better to the past — the lives they lived, how they lived them. It somehow grounds me a little better ... well, it's rather hard to explain the draw of genealogy.

    Sadly very little of the ephemera of everyday life was kept. I get it. It might have seemed like hanging on to junk mail — like you were a hoarder or whatever, but in this digital era we should be able to hold terabytes of what may appear to be ephemera.

    I'm doing what I can – not for ego, I think, but for future generations that may find a connection to their past interesting.

    • sangnoir 37 minutes ago
      > ...well, it's rather hard to explain the draw of genealogy.

      I've noticed people becoming more interested in genealogy when they - let me phrase this delicately - reach a certain age. My speculation is that it is a component of grappling with one's own mortality. As the grays and wrinkles multiply, some obsess over healthy eating and exercise, some wealthier ones invest in immortality research, some get blood boys, and the rest feel an urgent need to research our genealogy; any detritus that shows our progenitors existed proves some trace of us having been here will remain, and perhaps our existence means something, as time cruelly keeps marching on.

    • willis936 10 hours ago
      30 years ago there was no digital world. Nearly all information was in physical artifacts. The things worth saving haven't really changed, but the amount of noise they are buried in has. Imagine if that letter was kept in a two ton pile of ad fliers. Sure, someone would find some of those fliers interesting, but you'd have been much less likely to even know about the letter.
      • alex_young 4 hours ago
        Well, I remember a lot of great stuff on Usenet circa 1994, but it looks like Google shut down access to it via Google Groups, which used to archive it in a searchable way.

        There was a ton of great stuff 30 years ago, and I think it's definitely worth saving.

        The Internet was a very different place, but it was quite real 30 years ago, and I think the idea that the further back you go the more valuable this kind of thing is is the right way of looking at it.

      • jonhohle 8 hours ago
        An aside about ad spam from companies that I occasionally buy from:

        Often as spam comes from the same mailbox as order receipts and includes words like “order” while messages with receipts never include the word “receipt”. When inundated with daily or sometimes multiple times a day ad spam from the same company it becomes very difficult to filter for only not receipts, to clean a neglected inbox.

        After I’m gone, I fully expect my family just to delete it all because the signal to noise is so low.

        • sdenton4 8 hours ago
          Sorting through twenty years of spammy email is one of those things that seem like an llm would actually be good for.
        • justsomehnguy 8 hours ago
          I don't have anyone to do anything after I'm gone, so I just delete the emails myself. I do keep the notable ones, like registration information and some payment receipts but otherwise everything goes to the trash.

          Bonus points:

          I don't need 30/50/100Gb mailbox (and the associated mailbox cost nowadays).

          Search is not only fast but if I didn't found something - then there is nothing of this something in the mailbox.

          I't mentally pleasurable to log in once in a while and throw a bunch of unneeded stuff into the trash bin, quite similar to a real life room cleaning.

          • ghaff 3 hours ago
            Fortunately Gmail tabs go a lot of the way to letting you mass delete junk you don’t care about. Assuming you do even a modicum of labeling stuff you might like to refer to or act on, deleting at least older promotions and updates eliminates a lot of things.
      • palmfacehn 9 hours ago
        >...a two ton pile of ad fliers

        Alamy is selling scans of ad prints from the 1850s.

        https://www.alamy.com/stock-photo/1850s-advert.html

        • zamadatix 9 hours ago
          A selection 74 items over a 10 year period is a different proposal compared to e.g. keeping two tons of ad fliers from November 17th 1907 (and every other thing, every other day, all the time).
        • chefandy 9 hours ago
          Ads range from a (necessary, in a capitalist society) nuisance to a scourge, and people justly put up increasingly thick boundaries to shield themselves from their influence. When waning cultural relevance or whatever dilutes that influence, you can more easily see the ads for what they are— often manipulative marketing tactics implemented through often genuinely beautiful art and design. Both aspects are fascinating to consider and the art can be quite enjoyable. Early modernist posters from Paris are beautiful. Watching collections of mid century television ads in the prelinger archives is fun, and tells us a lot about the ways we are influenced by modern ads speaking to current perspectives, fashions, and concerns.
          • ANewFormation 6 hours ago
            Capitalism would work 100% fine without ads because people naturally compare and contrast options when buying stuff.

            All that's necessary is making it possible for people seeking out your type of product to find you. And for revolutionary products, there's word of mouth.

            If anything I think capitalism would function better without ads, because I would argue that advertising overall results in less informed customers, especially the modern lifestyle/brand type of advertising that's clearly quite effective at manipulating people.

            • janalsncm 2 hours ago
              It’s an interesting question I guess (and slightly worrying that I can more easily imagine the end of the world than the end of advertising). Especially if we take it to the extreme and imagine sponsored listings also don’t exist. I guess incumbents would have a big advantage.

              There are second order effects of ads that we’d need to consider. Facebook and Google wouldn’t exist as we know them. Maybe that means some of their research doesn’t happen?

            • chefandy 3 hours ago
              If there were no ads, how would people know that products existed? Would they just see the products on store shelves? What about services? Would labels be ads? Would how stores merchandise things be advertisements? Could businesses negotiate for specific product placement? How would you find out about stores? Would store signs be ads? How about really big ones? How about at the edge of their property along a road highway? Could the sign say what the store sold? If you were to start a product guide to help people find what they need, how could you possibly afford to buy enough products to be useful and up-to-date enough while slow crawl word of mouth got the business off the ground? Would asking people to tell their friends be an ad? If not, could you pay someone to spread the word about your product? Would traveling sales reps be ads? What if they wore head to toe logo gear? Could you just pay people to do that without selling things? Ads suck but I don’t see how a capitalist society could survive without them.
              • janalsncm 2 hours ago
                I think the definition would have to be an exchange of something of value for telling other people about a product. There are some companies that got off the ground with no paid advertising but I think they’re an exception. Generally people are not seeking out new products.
                • chefandy 2 hours ago
                  But the whole point of a capitalist society is that competitors that do things better/cheaper start taking customers so the capital moves to the best and most efficient system.
        • chgs 9 hours ago
          Because they are rare
          • chefandy 9 hours ago
            I don’t think that’s true? Tons of stuff from that era had been digitized, even before newer more relevant stuff and older rarer stuff, because the acid paper had a short shelf life and there were so many ads in printed stuff then. I might have a skewed perspective from working in the digitization world for quite some time. I think they’re selling what they sell with all their other content— discovery, curation, preparation, and easy delivery.
      • harrall 5 hours ago
        It’s not like you currently go to a webpage and save all the images onto deep storage for archival… I’m not sure what relevance things being digital has on identifying noise.

        If the ancestor before you is hoarding anything that comes across their path, be it digital ads or every physical greeting card they’ve ever gotten, the problem is with the person’s collection habits, not the medium.

      • qwertox 9 hours ago
        What about robots reading each flier and checking if something is odd about that particular one? It could find the letter and report it to you. Even easier if it was all digital information.
      • bongodongobob 8 hours ago
        If only we had search algorithms...
      • eesmith 8 hours ago
        A two-ton pile of ad fliers? Sounds like Ted Nelson's Junk Mail collection, https://archive.org/details/tednelsonjunkmail .
    • waltbosz 6 hours ago
      This reminds me of a recent flea market experience. There at some stand was boxes of old used post cards and 100 year old family photos. Photos of people posed on a porch in their Sunday best. Or just mundanely standing around a car not everyone looking at the camera.

      It's hard to assign a value to these things. They are simultaneously junk and treasure. I think about the journey these items took to find their way to that flea market table. It was too diverse a collection to have come from one place. So I imagine all the paths each individual item traversed. The joy of the recipient reading a post card, holding on to it, rediscovering it on spring cleaning days. Or the photo living in an album or framed on a wall somewhere for a lifetime.

      I'm not sure what the value of it all is if it just gets lugged around to various flea markets and sold piecemeal for $1 each.

    • wslh 1 hour ago
      Regarding genealogy it is great to look at the work The Church of Jesus Christ of Latter-day Saints was doing that help genealogical researchers around the globe [1] beyond that specific church.

      [1] https://newsroom.churchofjesuschrist.org/topic/genealogy

    • immibis 3 hours ago
      Now it's easier to save stuff, but there's more stuff to save. YouTubes and TikToks instead of text notes. Chat messages instead of letters.
    • kerkeslager 8 hours ago
      Sure, there are a ton of reasons to archive. And if it's cheap to do (in terms of money, yes, but also in terms of time, effort, mental health, etc.) then I am of the mind that we should archive everything.

      But, it often isn't cheap to do, and in that case, it makes sense to prioritize. The high priority items for me are the things that I might want to share, the ideas I want to amplify for my contemporaries and future generations that might examine my life. Stuff like [1] [2] and [3] which has influenced my thinking fundamentally, that I hope to build upon so that others can build upon what I have built.

      I'd argue that you do this intuitively: you're mentioning a letter from your family's past because it is a high priority item--it's relevant because it was the last written words of your great-grandmother's first husband.

      But, there's a lot that isn't worth keeping. My first form of archiving as a teenager was keeping ticket stubs for movies and concerts--a decade later I was going through my pile and found that I didn't even remember most of them. The better movies, I remembered--and I had them on DVD. The better concerts, I remembered--and I also had journal entries and CDs to remember the experience and the music. It's not important to me where/when I saw Everything, Everywhere, All At Once in theaters, but I have it on DVD and I can't wait to show it to my niece when she's older. And sure, I saw Amigo the Devil live, but frankly, he's not an artist you need to see in concert--the greatest impact of Cocaine and Abel[4] on me was when I listened to it alone in my room. The ticket stubs simply don't matter to me.

      [1] https://www.viridiandesign.org/notes/451-500/the_last_viridi...

      [2] https://www.ted.com/talks/brene_brown_the_power_of_vulnerabi...

      [3] https://digital.wpi.edu/pdfviewer/wm117p10z

      [4] https://www.youtube.com/watch?v=ZzjtLm0G49E

      EDIT: All the things linked above, I have backed up in one form or another. Notably, the Schutt paper isn't at its original URL.

  • zdc1 6 hours ago
    These days whenever I read an interesting article, I will take 2 minutes to copy and paste it into my Obsidian vault under my Articles folder. I'll take care to paste the images as images (and not links) and make sure I've got the author and source URL at the top, and have my separate notes section link to it. It's a bit silly and obsessive, but given how transient content on the Internet is, I think it's necessary to make a copy of anything you care about.
    • Modified3019 5 hours ago
      I use https://github.com/gildas-lormeau/SingleFile

      I set it to tolerate longer processing times, and to open the file after saving so I can sanity check that it got everything. Works great at faithfully saving a page with images as it appears in browser, and saves so much time.

      You might also have a look at https://github.com/ArchiveBox/ArchiveBox

      • Modified3019 5 hours ago
        Also, I believe by default the files are saved as plain html (with resources being base64 encoded), so search tools which can index the contents of html files will work.

        There is also the option to have the contents compressed, and (a separate option) to keep the plaintext of the file uncompressed, which will likewise still allow indexing to work while saving space.

    • kepano 4 hours ago
      I built Obsidian Web Clipper to automate that process. It also allows you to save web pages as nicely formatted Markdown files with YAML properties even if you don't use Obsidian.

      https://github.com/obsidianmd/obsidian-clipper

    • tempestn 5 hours ago
      I noticed a web clipper was just released for Obsidian last month. Maybe that'd cut down those two minutes for you.
    • gameshot911 41 minutes ago
      How often do you reference your vault?
    • yazantapuz 4 hours ago
      I am using monolith to just save the whole page to disk.

      https://github.com/Y2Z/monolith

    • ironyman 5 hours ago
      I do something similar but with Discord. I made a server accessible only by me, and I have a few different channels like work, life, music, ideas, etc. I also send all screenshots I take into a separate channel, and set up a chrome extension that sends whatever page I'm on as a link.
      • jay_kyburz 5 hours ago
        What if discord goes away. I would think you want the data local.
      • efilife 5 hours ago
        terrible idea. people get their discord accounts banned randomly without warning
        • accrual 1 hour ago
          Unfortunately it's not super easy to get data out of Discord either. Last I checked, one needs to carefully setup a bot then script the bot to download messages to CSV, etc., but if you're not careful with the account and bot setup, the export process itself could lead to a ban.
        • immibis 3 hours ago
          like recently they banned the entire country of germany by accident
    • brainzap 3 hours ago
      for the lazy, I think the web archive safari exports is standardised and gives you a good website backup.
    • Feathercrown 6 hours ago
      Agreed. I think you could automate some of that too, could save time if you do it often.
    • jeofken 6 hours ago
      In my day browsers could save an archive of a page

      Is this still the case?

      • accrual 1 hour ago
        It's not perfect, but Edge will let one take a simple full page screenshot with Ctrl+Shift+S. It results in a hefty PNG but at least it's a visual copy of everything which might suffice for a certain set of purposes (e.g. links will be lost, so it's not good for that).

        I can still right-click > Save any page as .html, but that doesn't guarantee server streamed stuff, media, images, etc. will be preserved correctly.

        • 800xl 1 hour ago
          Thank you for this! I pressed Ctrl+Shift+S in Firefox just to see if it would work and it has the same functionality.
      • Macha 6 hours ago
        They can but generally that includes any Javascript on the same page which sometimes does funny stuff when you open it up offline or after the remote server goes away.
        • compootr 5 hours ago
          SingleFile can make a snapshot with just content/styling
  • smitelli 10 hours ago
    > I got a picture of my great grandfather, thing took six hours to take your picture. [...] Every guy had one picture back then. And it's just him like, "[grimacing] I gotta get back, feed them hogs!" Now, in the future of course it'll be different. 50 years from now, people will be going like, "Hey! You wanna see a hundred thousand pictures of my great grandfather? I got 'em right here plus everything he did every day of his life." --Norm Macdonald[1]

    There is certainly a quantity of stuff online that is absolutely worth saving, but there's a considerably larger proportion that's just redundant to the point of being unremarkable and pointless. The trick is filtering, which can be capital-H Hard. That's why some may want to err on the side of over-collecting to reduce the possibility of missing something that will actually be important someday.

    [1]: https://www.youtube.com/watch?v=sY6SjMITHrQ

    • diggan 8 hours ago
      Yeah, this is a good point. Isn't it better we save too much, as tooling for filtering stuff out will always get better, rather than saving too little? The latter has no workaround (today at least).
    • nytesky 9 hours ago
      Another funny take from Macfarlan

      Definitely no smiling:

      https://youtu.be/8SslNMLO0tw

  • don-code 7 hours ago
    I DVR the nightly news with NextPVR, more as a convenience in case I'm doing something when it's on, want to pause/rewind, want to watch it the next morning instead, etc.

    Come 2020, I was convinced that the world was going to end. So I simply... turned off the retention rule. One hour of news is around 5GB, but that's a very-high-bitrate MPEG-2 stream with an extra audio channel in Spanish. So I instead wrote a cron job to take that week's news, drop the stuff I don't care about, and H.264 the entire set of them down to 4.7GB, then burn them to a DVD for offline storage, since there's not much value to keeping them online.

    By 2022, it was obvious the world was not, in fact, ending, but I never stopped this practice because of how simple it was, and how unobtrusive to store they are. I just make sure a fresh DVD is in the NAS every week, and put the DVDs on a spindle - they collectively take up about as much room as a toaster. I could make that even smaller and simpler if I opted for a portable hard drive.

    Occasionally I'll manually toss something interesting in, like the presidential debates, or special coverage of some newsworthy event.

    In 20 years, when it comes time to re-burn the earliest of them, maybe I'll make a value judgment on whether that's worth it, but for now it feels like I'd be losing something for not much of a good reason.

  • montebicyclelo 6 hours ago
    One approach to this is the SingleFile browser plugin [1], configured to save pages to a GitHub repository - it saves the whole web page as a single HTML file in the repo. (Ok it's probably closer to archiving than bookmarking... but it's not too far off)

    [1] https://github.com/gildas-lormeau/SingleFile

  • accrual 1 hour ago
    I've been saving images from various sources for a few years and cataloging them into folders that are meaningful for me, mostly for future reference, inspiration ("inspo" albums), or because they were interesting, worth another look, or just because they were funny (memes, old school image macros, Imgur-grade stuff).

    It's one of my most prized possessions, like an offline curated cache of things that I personally enjoyed at one point. One day I plan to give it all back.

  • thefaux 7 hours ago
    There are many things in life that have immense personal value and zero value to nearly everyone else. This creates a lot of misunderstanding and incentive misalignment.
    • ozim 6 hours ago
      Sounds about same what I was going to write.

      Most likely it is not worth it. But people should not be doing only things that are “worth doing”. Then again if something brought you joy but was complete waste of time - it was worth it.

      Hate dementors who tell you otherwise, it is limited life time but it is yours. You should be helpful to others but doing only “what is worth” suck the beauty out of existence.

    • zimpenfish 6 hours ago
      > zero value to nearly everyone else

      Well, except future historians who may find value in "personal" information (although I guess we've got such a surfeit of recorded "personal" information these days compared to even just 50 years ago, it may not be quite as useful as when they find, e.g., some Babylonian tablet with a shopping list on. But you never know!)

  • zelon88 4 hours ago
    I was thinking the other day about the longevity of useless data. One idea that floated around in my head was self expiring emails.

    I recently deleted about 40,000 emails. Most of them were identical, duplicate marketing emails. I was forced to do this to free up storage.

    That's when I realized something. I am paying my email provider for the full price for every byte of "represented" data. In reality, their distributed file systems could compress an arbitrary number of copies of these emails and only consume the amount of space that one email consumes. So 100,000 duplicate emails on the server are consolidated into one representation of the data, but each customer has to pay for each byte that is represented.

    The vendor stores a file once and charge full price every time they reproduce it for someone. If you have 10,000 copies of a file they only have to store it once but you will pay for every byte in all 10,000 copies.

    • Scoundreller 1 hour ago
      There were some early blog posts by the single person running mailinator.

      Since they only stored text, they would make a single db entry for each unique line of text that came in and just made more and more references to that.

      Even different emails… were mostly the same.

    • password4321 3 hours ago
      This is the Dropbox business model, especially when they encourage using their service to share files and it counts as space used in source and destination accounts.
  • iamwil 5 hours ago
    Yes. Sometimes when I'm doing research into recent history of why certain technical decisions were made, and the arguments for or against, I find archive.org invaluable for piecing a line of thought back then. Recently, this was to look up what the debate between React's Functional components vs Signals was.

    Also, it's helpful to get perspective on the attitudes for or against a new technology in recent history. I remembered there were people that said "If you aren't writing a kernel, you don't have their problems, so you don't need git." Turns out that's not true. Now that git is everywhere, it's harder to remember whether or even if there was pushback against it.

    This was written about the insights from using git that he needed to highlight to people back then. https://keithp.com/blog/Repository_Formats_Matter/

    I often reference it, and if it wasn't still up, I'd have only web archive to rely on.

    So for me, lots of stuff I look at online (mainly blog posts) are worth saving. Sometimes, if the discussion is on a twitter thread, that too. Which makes me fear for the day Microsoft decides to do Github in, and we'd lose all the issues and comments.

  • pabs3 3 days ago
    If you're interested in that sort of thing, come hang out with ArchiveTeam:

    https://wiki.archiveteam.org/

  • wintermutestwin 6 hours ago
    There is a wealth of live performances on youtube that individuals have uploaded and that likely violate mpaa copyright crap.

    IMO, this content is of high cultural value and I fear it won’t be long that the goog suffers us to watch “their” content without infecting it with ads.

    I wish there was an easier way to self host this content with a way to organize and browse using tags.

    • wwweston 6 hours ago
      5 years ago I was working on a semi-novel crowdfunding platform that relied on video presentations. First iteration we used the YouTube API because hosting our own video seemed daunting and that worked fine for a bit. Over time we started to run into limits/errors/interruptions/audits at inconvenient times until one weekend I was like “screw it, let’s find out what the problems of self transcoding and hosting are.” Spent some time learning to use ffmpeg and throwing the results on our static resource pile. Tagging was a fairly straightforward lift. Honestly worked better than I’d have anticipated and was much less hassle. I’m sure we would have hit the problems if we’d reached a critical scaling point buuut that didn’t happen inside our year or so of having clients.
  • dehrmann 1 hour ago
    This feels like confirmation bias to me. The author seemed to genuinely consider the question, but didn't think critically about how little value he got from two decades of bookmarking and instead focused on how he could use this archive in the future.
  • og2023 1 hour ago
    We have become so cloud-native (god forbid!). Just recently I realised that I can save an interesting page to my hard drive instead of saving its link. What a wonderful word has opened since! It's so liberating to live without all these bs tools.
  • jll29 5 hours ago
    Any information created by humans is part of our "culture". You may consider it of no value, but someone else may beg to differ.

    I went to a fantastic talk a few years ago at the British Library about digitizing a substantial quantity historic Australian newspapers. It was amazing to be able to read funeral announcements, product advertisements and other signals from the past showing us Australian culture from the 1800s.

    Since we leave much less behind in terms of physical assets (personal letters, postcards, personal diaries), we should at least aspire to archive more from the digital realm, or to future historians we'd look like a blank century.

  • tofof 6 hours ago
    I've recently come back to a PC game (B-17 the Mighty Eighth) from 2000 that, quite unexpectedly, is getting a remaster and potentially a port to VR. It had a thriving community for several years, with many mods and guides and knowledge contained in the single dominant forum (bombs-away.net). When it shuttered, the vast majority of that information was lost. Old workarounds for bugs in the engine and detailed instructions of exactly how certain mechanics works are unavailable. One popular youtuber who continued playing through at least 2010 maintained a dropbox that had most of the mods that were ever available, but not the forum posts explaining them. So, for example, there's a mod that survives there to let you replace a generic 'sign on the dotted line' handwriting with your own - but gone are the instructions of exactly how to apply it.

    When I had returned to the game after bombs-away.net had gone defunct, I posted my own personal archive to the GoG forum for the game. Now that I've returned to the Redux version I find my own files, with my personal notes, shared by a single other soul who had similarly maintained an archive, and apparently had collected mine at some point. I'm very glad to have helped preserve knowledge - but not everything of mine was there. Now that I've noticed the 2024 remaster effort and joined that community, I've been able to share files that were otherwise apparently completely lost - in particular, a set of images showing dimensions of certain common features in bombing targets, that allow estimating the total size of the target.

    Unfortunately, my own personal archive included many forum topics that I just dragged off shortcuts to. I can see the old titles of the pages from the surviving shortcut files. I remember the questions I had (and now have again) that those shortcuts held the answers to. But because I didn't save the page itself, it's.. gone. That's immensely frustrating.

    Yes, things are worth saving. Especially for topics with extensive information among a small niche audience that have a single point of failure. I've found an extension (SingleFileZ) that does a good job of archiving a web page with all embedded content into what's a zip file under the hood - so futureproof even if the extension disappears and it becomes difficult to simply open the file directly in browser.

    EDIT - montebicyclelo mentions SingleFile, which apparently is a continuation of SingleFileZ, with new features. SingleFileZ already allowed automatically saving every visted page in a tab (or even among all tabs), batch archiving of a list of urls, etc, so presumably SingleFile has all these capabilities and more.

  • stared 9 hours ago
    I often find myself revisiting old posts and stories. As with any human artifacts, most things aren't worth revisiting or are only meaningful in the moment. If they're gone, few people miss them.

    I'm a link hoarder myself (over 13k links on Pinboard: https://pinboard.in/u:pmigdal/). While I don't revisit most of them, some have proven invaluable for re-reading and sharing. I'm not sure about the typical half-life of internet content, but a lot disappears—whether because people stop paying for domains, official websites get reorganized (or their content removed), or other reasons.

    This is where the Internet Archive steps in, doing the essential work of a digital librarian. I often share links from its Wayback Machine, which has been a link-saver more times than I can count.

  • krick 6 hours ago
    How do you backup websites? I mean, it sounds trivial, but I kinda still haven't figured out what is the way. I sometimes think that I'd like some script to automatically make a copy of every webpage I ever link in my notes (it really happens quite often that a blog I linked some years ago is no more), and maybe even replace links to that mirror of my own, but all websites I've actually backed up by now are either "old-web" that are trivial to mirror, or basically required some custom grabber to be writen by myself. If you just want to copy a webdpage, often it either has some broken CSS&JS, missing images, because it was "too shallow", or otherwise it is too deep and has a ton of tiny unnecessary files that are honestly just quite painful to keep on your filesystem as it grows. Add to that cloudaflare, Captchas, ads (that I don't see when browsing with ublock and ideally wouldn't want them in my mirrored sites as well), cookie warning splash-screens, all sorts of really simple (but still above wget's paygrade) anti-scraping measures, you get the idea.

    Is there something that "just works"?

    • wis 6 hours ago
      For saving a webpage you have open, I use a browser extension called SingleFile, I've been using it for a while (IIRC I discovered it on HN's front page a few years ago), in my experience it "just works", works really well.

      You click the "browser action" icon/button of the extension and it saves a single HTML file that looks exactly like the webpage you have open.

      From its FAQ[1] on GitHub:

        # What does SingleFile do?
        SingleFile is a browser extension designed to help users save web pages as complete, self-contained files. The extension's primary function is to capture an entire web page, including its HTML, CSS, JavaScript, images, and other resources, and package them into a single HTML file.
      
        # I am a web archivist, is it ok to use SingleFile to archive content?
        No, SingleFile is not a tool used by professionals to archive content on the Web, especially in the academic field. Professionals prefer to rely on tools based on the WARC specification instead.
      
      [1] https://github.com/gildas-lormeau/SingleFile/blob/master/faq...
      • throw0101a 5 hours ago
        > For saving a webpage you have open

        There's also print-to-PDF that most OSes now have.

        • wis 5 hours ago
          Yeah, pretty much all browsers on all OSes have print-to-PDF/save-to-PDF, I prefer saving an HTML file over saving a PDF file for 3 reasons:

          1. SingleFile allows me to save a an HTML file that looks exactly like the webpage I saved. I never used a save-to-PDF functionality in any browser that allowed me to save a PDF that looks exactly like the webpage I was saving/printing. I wish browsers implement that, somebody did that once, they patched chromium to save a web page as SVG[1], AFAIK if you can save to SVG you can also save to PDF with not much modification to the code, unfortunately the fork is not maintained anymore.

          2. The HTML files that SingleFile creates are responsive (just like the webpage you had open), PDF is not responsive. I like that because it makes it easier to read the webpage I saved on my phone later, with a PDF file you saved on your desktop, you have to pinch to zoom and pan while you read it on your phone.

          3. HTML-files/Webpages are accessible to screen readers and my browser's extensions work on them, extensions don't work on PDF files (they _can_ work on HTML files opened from disk, if you allow/enable it in the extension's settings).

          [1] https://news.ycombinator.com/item?id=33584941

    • rambambram 3 hours ago
      I use WebScrapBook, an extension for Firefox. It seems to save a whole page in one file, and I can tweak a lot of the settings.

      Sometimes I wonder if there's an even easier browser-builtin function that does the same?

    • Dwedit 6 hours ago
      There are extensions like "Save Page WE" that will dump the current state of the DOM to an HTML file, including CSS and Images, but these are static and don't make the scripting work.
  • nilamo 9 hours ago
    Personally, I like that the internet is ephemeral. It matches real life in that way. I would rather see the internet as a means of connecting people over large distances (across space, Mars, etc), maintaining 20,000 copies of every irrelevant thing is just silly.
    • lxgr 8 hours ago
      The problem is that not everything it has replaced was originally ephemeral.

      In a the Internet is both too ephemeral (self-hosted blogs disappear, Youtube videos get taken down) and too persistent at the same time; I don't think that most Twitter posts of non-public figures would need to remain public forever by default, for example, and I don't think I need to mention various data breaches.

      The Internet Archive somewhat mitigates the first issue, but it makes me pretty nervous that there's essentially just one organization doing what used to be much more distributed to various physical libraries.

      For the second one, I hope we'll see better solutions (both technical and social) as the technology and our interactions with it mature.

    • qwertox 9 hours ago
      > Personally, I like that the internet is ephemeral.

      It is not. It is only for us normal people. But the companies which log our lives in order to then capitalize on it, for them the internet is not ephemeral. They have copies of videos, pages, podcasts, whatever it is what can be found there.

      Why would you want those companies to know more about yourself than you do?

      • zamadatix 9 hours ago
        Archive.org or Google can cache more of the internet than I do while still having the majority of the content be ephemeral.

        I'd also hazard to guess most people in this camp would want these companies to also not store these things the same as they don't want people to.

      • Barrin92 4 hours ago
        >Why would you want those companies to know more about yourself than you do?

        That's not a question of wants, companies will always know more about you than you, for the simple reason that even if you had all their data you have no means to extract any meaning from it. It requires immense organization and resources, increasingly so as the rate of data production increases.

        For that reason the correct response isn't to engage in the same hoarding and privacy abuse of the companies, it's like bringing a knife to a tank fight, but to 1. make sure you don't produce that data to begin with through privacy protections and technical means and 2. create environments in which you have ownership of your data, instead of businesses.

  • Macuyiko 4 hours ago
    From an age perspective (but the crowd here will not like that): before I trusted myself I could always find it back so I don't need to save it. Now I can't anymore, but I don't care so much.
  • neilv 5 hours ago
    One thing that is worth saving is the PDF manuals for physical products that you own.

    These sometimes disappear from the Web. Or disappear except for some third-party site that modifies and/or paywalls them.

    Also, save the occasional important support info Web pages for those products. You'll know it when you see it. And if you don't save it now, it might be gone when you need it.

    You don't need a fancy system for this. I just made a directory `~/doc/`, and started dropping files into it. Someday, I'll take the time to merge this with `~/wiki/`, but for now, I'm capturing the information with low friction, which is most important.

    • Groxx 5 hours ago
      And even when they don't disappear, they still end up dozens of weird pages deep that none of the on-site help text or search points to correctly due to the various pointless redesigns the site has gone through.

      But hey, there's more whitespace now.

  • greatgib 8 hours ago
    Some times you have strange obsessions or a strange mindset related to your technological habits. And you might easily think that it is only you that is weird, not thinking straight. If you are the only one doing something, you are probably wrong.

    And then, hopefully, there are nice personal blog posts like this one, showing you that you are not alone having some peculiar habits and so that it might make sense even if most people don't even think about it.

    I have the exact same feeling when I discover through hn, blog posts and events that I'm not the only one having my web browsers full of tabs. Literally having thousand of tabs.

  • jscottbee 6 hours ago
    I created a local-only web app to wrap up some of my favorite web haunts, with HN being one of them. It allows me to look at the headlines, and save any of them in a locale SQLite db that the app maintains.

    https://i.postimg.cc/v8znk92x/ycomb-hn.png.

  • willjp 7 hours ago
    This resonates so strongly with me. I worked a job where I needed to use outdated Microsoft toolchains to build plugins for software, and the documentation was just -- gone. Good luck. I've been almost compulsively saving the things that feel important to me, while seldom browsing them for years -- all the while hunting for a faster and more intuitive recall system that lets me find them later.

    My ex, however had a much more fluid relationship with the internet and media in general. They liked new things, and didn't particularly care if they enjoyed something and it faded into obscurity. I feel like that's the winning mentality, but I just can't bring myself to embrace it.

  • profsummergig 1 hour ago
    Instead of saving them as PDFs, I started saving web pages using a Chrome extension called Single File [1] (after testing it, of course).

    To my dismay, some saved files (.htm extension) didn't open when I wanted to open them.

    So I'm glad people are discussing ways to archive web pages while that reproduce the original page faithfully.

    [1] https://chromewebstore.google.com/detail/singlefile/mpiodijh...

  • Viktoire 9 hours ago
    When I save things, I try to make sure that it'll be immediately useful to me once I find it again.

    I'll highlight, summarise and take notes of what I save. Or some combination of those. If I don't find anything new or directly applicable to my life, I'll let it pass by.

    This approach isn't good for archival purposes, but I hesitate to save a lot of things that I'll never read again.

    • ghaff 9 hours ago
      I'm going through my file cabinets right now. I'll keep a few things that catch my eye but I'll likely throw out most of it. The odd 25 year old computer magazine is probably interesting but not all of them collectively for the most part. And I'm certainly not going to index them in a way that they'd be useful to me.
      • galleywest200 8 hours ago
        You can probably sell or donate those old magazines to a collector, or a kid interested in that stuff. At the very least drop them off at a thirft store instead of just dumping them.
        • ghaff 8 hours ago
          Thrift stores don't want a ton of old paper. There are a lot of things that someone somewhere would probably like but I'm not going to track them down or get them there. Mostly it's not magazines anywway. It's a bunch of articles I ripped out over the years.

          The one thing I have in my garage I know someone would want is a big pile of laserdiscs. But, again, a thrift shop (or my library) wouldn't want them and I live pretty far out from a major city. Probably will try Craigslist post-winter though as I'm trying to declutter.

          • buildsjets 3 hours ago
            Laserdiscs appear and gradually disappear at my local thrift, so someone must be buying them. Now in the vinyl records pile, there are copies of Mantovani, Jim Nabors, and Herb Alpert which have been there for years, but anything classic rock or newer sells the same day.
            • ghaff 2 hours ago
              In the spring I'll probably do take it or leave it for the whole collection on Craigslist for the whole pile at a nominal price and, if that doesn't work, just take it up to the local thrift and I'll at least have tried.
  • deskr 5 hours ago
    "stuff online" is an exceptionally course filter to deem something worthy of saving.
  • mediumsmart 4 hours ago
    I used to think so and then I ran out of space
  • btbuildem 6 hours ago
    I think some stuff is -- the stuff that is crucial to rebuilding all the other stuff.
  • paulcole 1 hour ago
    I’m the opposite of most of the “archivists” on HN. I delete everything and save nothing. I have maybe 25 sheets of paper in my apartment, including social security card and birth certificate.

    Saving stuff just isn’t fun or useful for me. Never for more than a passing moment have I thought, “Boy I wish I had saved that whatever.”

    Old people are the worst about this stuff. They think/hope somebody will want it and then just make it the next generation’s problem.

    I told my dad if he thinks it has value, give it away while he’s alive. I have neither the interest nor the space to deal with it so it’s going straight into the trash.

  • asimpletune 10 hours ago
    It reminds me of the cool links page I see now and then.
  • RajT88 9 hours ago
    Stuff online is absolutely worth saving. It is a window into the past - what people concerned themselves with, what they loved and hated.

    Scholars will write papers on this era, speculating what it was like and how it fit into what came after.

    The web documents the massive societal changes underway which do not relate to the internet directly. Things like changes in transportation technology, medicine, sexuality and gender, and how your average people felt about all of it. Scholars will data mine those opinions to understand who felt what ways and why, with the benefit of hindsight. New knowledge will come of it.

    So yeah! It is all worth saving.

  • underseacables 10 hours ago
    I suppose it comes down to what the purpose of such archiving is.

    I think it's the preservation of information, but I also believe 90% is absolutely pointless. There is just so much of it, and data storage so cheap, that it makes sense to just save everything.

    • dreamcompiler 10 hours ago
      That data storage is also ephemeral. Nobe of it will last as long as a paper note, unless some human goes to the trouble of copying it all onto new drives with new software every ten years or so.
      • Atreiden 9 hours ago
        With a proper NAS and RAID10 for double parity, it's a bit like Theseus ship. Just keep swapping out drives when they become unhealthy and you never have to rebuild or migrate
        • ninalanyon 9 hours ago
          Eventually the controller will die and eventually compatible ones will no longer be produced or will at least be inconvenient to obtain or commission and hence expensive.

          Paper lasts for centuries without any attention beyond keeping it moderately dry and away from things that eat it.

          • emptiestplace 9 hours ago
            No sane person uses hardware RAID in 2024, if that's what you're referring to.
            • zamadatix 9 hours ago
              Whether you're using hardware RAID or not you still need a hardware storage controller of some type which accepts the new disks you can buy and works with the NAS. What they are saying is eventually that'll be more $ and time than just migrating off the system would be. From ENIAC to now could fit in one lifespan, would you still be maintaining a home floppy drive backup system in the 2040s or just save the time and effort with a migration?
              • jpalawaga 3 hours ago
                sure, you can always move the old storage mechanism to something new if it is too cumbersome.

                why still back up floppies if you could just move the data to a single dvd, or throw is on the SAN?

                RAID is just algorithms, the actual transport doesn't matter (i.e. spinning platter and solid state both use SATA connectors).

    • danielbln 10 hours ago
      Data rots though, you can't just save it once and be done with it. You have to migrate it across storage mediums, formats etc. It's a recurrent effort/cost.
      • bdhcuidbebe 9 hours ago
        More planning for less effort.

        Do your research first. Use standards

        Eg: html, pdf, h264/h265/av1 in mp4 container, chd, zip and so on depending on what you are storing.

        • HeatrayEnjoyer 7 hours ago
          On what physical medium?

          I have 1 terabyte of data in 1860, how do I make sure the storage medium is still intact in 2024?

          • TacticalCoder 3 hours ago
            > I have 1 terabyte of data in 1860, how do I make sure the storage medium is still intact in 2024?

            Storage keeps growing and price of storage keeps doing down.

            My DOS and even some C64 source code made it to this day on backups (DVDs, HDDs, SSDs, USB memory sticks, etc., both online and offline) and to ZFS pools. Medium that didn't exist in the 80s/early 90s.

            Floppy disks -> 40 MB HDD -> 6.4 GB HDD -> 80 GB HDD -> 500 GB HDD -> 240 GB SSD -> 1 TB NVMe SSD.

            You get the idea.

            The way you get sure you still have your data is by not focusing on the medium but by focusing on the fact that data is data.

            Medium comes and goes. Data can (and should) be copied to new medium.

            Not unlike:

                /home/pub/backups/oldBackups/DOSbackups/...
                ...Conner80MBHDDbackups/backups/oldBackups/Commodore64backups/...
            
            Some people are going to complain about the naming but I have all my emails except for six months back since I started using the Internet. And I still have all nearly a lot of my data since I started using computers. 8-bit computers.

            Do you?

            I don't care about naming much. "search, don't sort".

            We've got emulators for just about every and any system. My vintage arcade cab has both real PCBs and a Pi running an emulator with thousands of arcade games on it.

            You can already, today, emulate, say, the Raspberry Pi model you want using QEMU. There are container file that'll gladly do that for you.

            Unless civilization ends there's simply a not a world in which, say, PNG, JPG and x265 files aren't readable. This just won't happen.

            FWIW I'm paranoid integrity of my data: I've got my own naming scheme where a cryptographic hash is added to many of my files.

            For example:

                    DSC_91394-b3-ae4f2877d3.jpg
            
            This means "This file's Blake3 checksum begins with ae4f2877d3".

            I then have a script doing statistical sampling: I enter a percentage and that percentage of files where a cryptographic hash is part of the filename are checked, randomly (if I enter 100 then 100% of the files are tested).

            If I enter for example '7', then 7% of the files are tested and then there's high probability all checksums are correct.

            > On what physical medium?

            That is the wrong question.

    • sigio 10 hours ago
      Well... storage is cheap, but not cheap enough to save everything, with just usenet being in the 400TB/day range these days. Sure, it's cheap enough to save every webpage you visit during your life, but probably not cheap enough to save every video you click on youtube or watch on a streaming-service, and all the music you listen to all day.

      Though just the music compressed in opus at 128kbit might work ok, 60 years of 24/7 128kbit is 30TB, so that would fit on 1 large HDD currently.

      • saulpw 8 hours ago
        Music is actually an ideal candidate. I don't listen to music all day, and when I do listen to it, it's often something I've listened to before. My current collection is about 200GB and that includes a ton of stuff I've never listened to; it seems reasonable that a full life's worth of music could fit in 1TB, easily.
      • add-sub-mul-div 6 hours ago
        If that much data comes across Usenet daily then how do services afford the storage to offer years of retention?

        You can't dedupe the large binary files because they're encoded in small parts likely differently every time they're posted.

  • renewiltord 7 hours ago
    In general, I am pro-turnover where there is rivalry: ceteris paribus keep the newer thing. However, information is so cheap as to be effectively non-rivalrous so I am considering running my own archival and to keep kagi's small sites etc. alive. Unfortunately, there is not a good tool for this that matches whatever Archive.org has. ArchiveHub needs routine management to keep the feed up and viewing it is not that easy. I'm sure we'll come up with stuff.

    The other thing is that searching for the long tail is near impossible. The big sites dominate Google, so I need something like marginalia to actually get to the old stuff that it used to be so easy to find. Because of the median user having simple queries, some questions are no longer answerable on Google: they are dominated by the median user and never show up.

  • swayvil 9 hours ago
    Curve smoothing. Chaikin's algorithm and Jarek's tweak etc. Very clever and nice way of making angular geometry curvy. Constructive geometry stuff.

    There were like a dozen algs. I kept links to nice papers with diagrams. Then they started disappearing. Now I'd be pressed to find 2.

    This is really useful info that is apparently disappearing. So yes, it happens, and maybe you should save that stuff.

  • paulpauper 9 hours ago
    Digital storage is free; yes, save it all
    • lxgr 8 hours ago
      Please do share where I can reliably store my backups for free!
      • fragmede 8 hours ago
        > Backups are for wimps. Real men upload their data to an FTP site and have everyone else mirror it.

        — Linus Torvalds

        • LinuxBender 8 hours ago
          This does still happen. Microsoft may nuke a git repo and someone has to figure out who has the latest version of the entire repo with all the latest commits of every branch.
        • theandrewbailey 8 hours ago
          The vast majority of people aren't privileged enough to have anyone mirror their data.
        • lxgr 7 hours ago
          But how do I get everyone to mirror my gigabytes of encrypted photo backups?
          • paulpauper 7 hours ago
            just upload them to social media accounts. Afik twitter, facebook, and youtube do not have storage limits . no deletion for inactivity either.
            • lxgr 3 hours ago
              They don't allow uploading large binary blobs either, though, and steganographically storing gigabytes of data with probably terabytes of overhead sounds like a quick way to get banned.
        • paulpauper 7 hours ago
          dump it on Wikipedia. afik wiki never removes anything. it just gets buried in an edit history . or Wikimedia image files
          • lxgr 3 hours ago
            That obviously can't be true, or spammers would be all over it, using Wikimedia as a free image host.
  • impure 7 hours ago
    The rise of LLM’s has really devalued saving stuff online. What is the point of saving an article if I could just ask ChatGPT to created it and would probably do a pretty good job? It’s still worth keeping notes and stuff that may be hard to find but the majority of things online can easily be reproduced and are not worth saving.
    • vouaobrasil 7 hours ago
      I think you are right. But I think the answer goes deeper: we have encouraged a culture where the most supported information is also the most superficial. The essence of individual experience itself has long been discouraged on the web in favour of SEO and the trashy news and the trivial.

      So the fact that ChatGPT can replace much of the web actually says less about the marvel of ChatGPT and more about the lack of anything really worthwhile because the profound just happens to be the least economically valuable.