Using S3 as a Container Registry

(ochagavia.nl)

319 points | by jandeboevrie 319 days ago

24 comments

  • stabbles 319 days ago
    The OCI Distribution Spec is not great, it does not read like a specification that was carefully designed.

    > According to the specification, a layer push must happen sequentially: even if you upload the layer in chunks, each chunk needs to finish uploading before you can move on to the next one.

    As far as I've tested with DockerHub and GHCR, chunked upload is broken anyways, and clients upload each blob/layer as a whole. The spec also promotes `Content-Range` value formats that do not match the RFC7233 format.

    (That said, there's parallelism on the level of blobs, just not per blob)

    Another gripe of mine is that they missed the opportunity to standardize pagination of listing tags, because they accidentally deleted some text from the standard [1]. Now different registries roll their own.

    [1] https://github.com/opencontainers/distribution-spec/issues/4...

    • eadmund 319 days ago
      > The OCI Distribution Spec is not great, it does not read like a specification that was carefully designed.

      That’s par for everything around Docker and containers. As a user experience Docker is amazing, but as technology it is hot garbage. That’s not as much of a dig on it as it might sound: it really was revolutionary; it really did make using Linux namespaces radically easier than they had ever been; it really did change the world for the better. But it has always prioritised experience over technology. That’s not even really a bad thing! Just as there are tons of boring companies solving expensive problems with Perl or with CSVs being FTPed around, there is a lot of value in delivering boring or even bad tech in a good package.

      It’s just sometimes it gets sad thinking how much better things could be.

      • steve1977 319 days ago
        > it really did change the world for the better.

        I don’t know about that (hyperbole aside). I’ve been in IT for more than 25 years now. I can’t see that Docker container actually delivered any tangible benefits in terms of end-product reliability or velocity of development to be honest. This might not necessarily be Dockers fault though, maybe it’s just that all the potential benefits get eaten up by things like web development frameworks and Kubernetes.

        But at the end of the day, todays Docker-based web app development delivers less than fat-client desktop app development delivered 20 years ago, as sad as that is.

        • 9dev 319 days ago
          If you haven’t seen the benefits, you’re not in the business of deploying a variety of applications to servers.

          The fact that I don’t have to install dependencies on a server, or set up third-party applications like PHP, Apache, Redis, and the myriad of other packages anymore, or manage config files in /etc, or handle upgrades of libc gracefully, or worry about rolling restarts and maintenance downtime… all of this was solvable before, but has become radically easier with containers.

          Packaging an application and its dependencies into a single, distributable artifact that can be passed around and used on all kinds of machines was a glorious success.

          • PaulHoule 319 days ago
            Circa 2005 I was working at places where I was responsible for 80 and 300 web sites respectively using a large range of technologies. On my own account I had about 30 domain names.

            I had scripts that would automatically generate the Apache configuration to deploy a new site in less than 30 seconds.

            At that time I found that most web sites have just a few things to configure: often a database connection, the path to where files are, and maybe a cryptographic secret. If you are systematic about where you put your files and how you do your configuration running servers with a lot of sites is about as easy as falling off a log, not to mention running development, test, staging, prod and any other sites you need.

            I have a Python system now with gunicorn servers and celery workers that exists in three instances on my PC, because I am disciplined and everything is documented I could bring it up on another machine manually pretty quickly, probably more quickly than I could download 3GB worth of docker images over my ADSL connection. With a script it would be no contest.

            There also was a time I was building AMIs and even selling them on the AMZN marketplace and the formula was write a Java program that writes a shell script that an EC2 instance runs on boot, when it is done it sends a message through SQS to tell the Java program to shut down and image the new machine.

            If Docker is anything it is a system that turns 1 MB worth of I/O into 1 GB of I/O. I found Docker was slowing me down when I was using a gigabit connection, I found it basically impossible to do anything with it (like boot up an image) on a 2MB/sec ADSL connection, with my current pair of 20MB/s connections it is still horrifyingly slow.

            I like how the OP is concerned about I/O speed and bringing it up and I think it could be improved if there was a better cache system (e.g. Docker might even work on slow ADSL if it properly recovered from failed downloads)

            However I think Docker has a conflict between “dev” (where I’d say your build is slow if you ever perceive yourself to be waiting) and “ops” (where a 20 minute build is “internet time”)

            I think ops is often happy with Docker, some devs really seem to like it, but for some of us it is a way to make a 20 sec task a 20 minute task.

            • cogman10 319 days ago
              And I'm guessing with this system you had a standard version of python, apache, and everything else. I imagine that with this system if you wanted to update to the latest version of python, in involved a long process making sure those 80 or 300 websites didn't break because of some random undocumented breaking change.

              As for docker image size, really just depends on dev discipline for better or for worse. The nginx image, for example, adds about 1MB of data on top of the whatever you did with your website.

            • belthesar 318 days ago
              You hit a few important notes that are worth keeping in mind, but I think you handwave some valuable impacts.

              By virtue of shipping around an entire system's worth of libraries as a deployment artifact, you are indeed drastically increasing the payload size. It's easy to question whether payload efficiency is worthwhile when the advent of >100, and even >1000 Mbit internet connections available to the home, but that is certainly not the case everywhere. That said, assuming smart squashing of image deltas and basing off of a sane upstream image, much of that pain is felt only once.

              You bring up that you built a system that helped you quickly and efficiently configure systems, and that discipline and good systems design can bring many of the same benefits that containerized workloads do. No argument! What the Docker ecosystem provided however was a standard implemented in practice that became ubiquitous. It became less important to need to build one's own system, because the container image vendor could define that, using a collection of environment variables or config files being placed in a standardized location.

              You built up a great environment, and one that works well for you. The containerization convention replicates much of what you developed, with the benefit that it grabbed a majority mindshare, so now many more folks are building with things like standardization of config, storage, data, and environment in mind. It's certainly not the only way to do things, and much as you described, it's not great in your case. But if something solves a significant amount of cases well, then it's doing something right and well. For a non inconsequential amount of people, trading bandwidth and storage for operational knowledge and complexity are a more than equitable trade

          • mnahkies 319 days ago
            Agreed, I remember having to vendor runtimes to my services because we couldn't risk upgrading the system installed versions with the number of things running on the box, which then led into horrible hacks with LD_PRELOAD to workaround a mixture of OS / glibc version's in the fleet. Adding another replica of anything was a pain.

            Now I don't have to care what OS the host is running, or what dependencies are installed, and adding replicas is either automatic or editing a number in a config file.

            Containerization and orchestration tools like k8s have made life so much easier.

          • nitwit005 318 days ago
            As you note, it was all solvable before.

            A lot of us were just forced to "switch" from VMs to Docker; Docker that still got deployed to a VM.

            And then we got forced to switch to podman as they didn't want to pay for Docker.

            • 9dev 318 days ago
              > As you note, it was all solvable before.

              Washing clothes was possible before people had a washing machine, too; I’m not sure they would want to go back to that, though.

              I was there in the VM time, and I had to set up appliances shipped as a VM instance. It was awful. The complexity around updates and hypervisors, and all that OS adjustment work just to get a runtime environment going, that just disappeared with Docker (if done right, I’ll give you that).

              Organisations manage to abuse technology all the time. Remember when Roy Fielding wrote about using HTTP sensibly to transfer state from one system to another? Suddenly everything had to be „RESTful“, which for most people just meant that you tried to use as many HTTP verbs as possible and performed awkward URL gymnastics to get speaking resource identifiers. Horrible. But all of this doesn’t mean REST is a bad idea of itself - it’s a wonderful one, in fact, and can make an API substantially easier to reason about.

          • steve1977 319 days ago
            I’m aware of all of that, I’m just saying that this has not translated into more reliable and better software in the end, interestingly enough. As said, I’m not blaming Docker, at least not directly. It’s more that the whole “ecosystem” around it seems to have so many disadvantages that in the end overweigh the advantages of Docker.
            • derefr 319 days ago
              It has translated to reliable legacy software. You can snapshot a piece of software, together with its runtime environment, at the point when it's still possible to build it; and then you can continue to run that built OCI image, with low overhead, on modern hardware — even when building the image from scratch has long become impossible due to e.g. all the package archives that the image fetched from going offline.

              (And this enables some increasingly wondrous acts of software archaeology, due to people building OCI images not for preservation, but just for "use at the time" — and then just never purging them from whatever repository they've pushed them to. People are preserving historical software builds in a runnable state, completely by accident!)

              Before Docker, the nearest thing you could do to this was to package software as a VM image — and there was no standard for what "a VM image" was, so this wasn't a particularly portable/long-term solution. Often VM-image formats became unsupported faster than the software held in them did!

              But now, with OCI images, we're nearly to the point where we've e.g. convinced academic science to publish a paper's computational apparatus as an OCI image, so that it can be pulled 10 years later when attempting to replicate the paper.

              • steve1977 319 days ago
                > You can snapshot a piece of software, together with its runtime environment, at the point when it's still possible to build it

                I think you’re onto part of the problem here. The thing is that you have to snapshot a lot of nowadays software together with its runtime environment.

                I mean, I can still run Windows software (for example) that is 10 years or older without that requirement.

                • 9dev 319 days ago
                  The price for that kind of backwards compatibility is a literal army of engineers working for a global megacorporation. Free software could not manage that, so having a pragmatic way to keep software running in isolated containers seems like a great solution to me.
                  • steve1977 319 days ago
                    There’s an army of developers working on Linux as well, employed by companies like IBM and Oracle. I don’t see a huge difference to Microsoft here to be honest.
                    • sangnoir 319 days ago
                      You'd have a better time working with Windows 7 than a 2.x Linux kernel. I love Linux, but Microsoft has longer support Windows for its operating systems.
                • ahnick 319 days ago
                  What are you even talking about? Being able to run 10 year old software (on any OS) is orthogonal to being able to build a piece software whose dependencies are completely missing. Don't pretend like this doesn't happen on Windows.
                  • steve1977 319 days ago
                    My point was that a lot of older software, especially desktop apps, did not have such wild dependencies. Therefore this was less of an issue. Today with Python and with JavaScript and its NPM hell it is of course.
                    • sangnoir 319 days ago
                      > My point was that a lot of older software, especially desktop apps, did not have such wild dependencies. Therefore this was less of an issue.

                      Anyone who worked with Perl CGI and CPAN would tell you managing dependencies across environments has always been an issue. Regarding desktop software; the phrase "DLL hell" precedes NPM and pip by decades and is fundamentally the same dependency management challenge that docker mostly solves.

                      • steve1977 319 days ago
                        DLL hell was also essentially fixed decades ago. And rarely as complex as what you see nowadays.
                      • ahnick 319 days ago
                        Exactly!
            • bzmrgonz 317 days ago
              I think the disconnect is in viewing your trees and not viewing the forest. Sure you were a responsible disciplined tree engineer for your acres, but what about the rest of the forest? Can we at least agree that docker made plant husbandry easier for the masses world-wide??
            • 9dev 319 days ago
              Im not sure I would agree here: from my personal experience, the increasing containerisation has definitely nudged lots of large software projects to behave better; they don’t spew so many artifacts all over the filesystem anymore, for example, and increasingly adopt environment variables for configuration.

              Additionally, I think lots of projects became able to adopt better tooling faster, since the barrier to use container-based tools is lower. Just think of GitHub Actions, which suddenly enabled everyone and their mother to adopt CI pipelines. That simply wasn’t possible before, and has led to more software adopting static analysis and automated testing, I think.

              • steve1977 319 days ago
                This might all be true, but has this actually resulted in better software for end users? More stability, faster delivery of useful features? That is my concern.
                • watermelon0 319 days ago
                  For SaaS, I'd say it definitely improved and sped up delivery of the software from development machine to CI to production environment. How this translates to actual end users, it's totally up to the developers/DevOps/etc. of each product.

                  For self-hosted software, be it for business or personal use, it immensely simplified how a software package can be pulled, and run in isolated environment.

                  Dependency hell is avoided, and you can easily create/start/stop/delete a specific software, without affecting the rest of the host machine.

        • pyrale 319 days ago
          > But at the end of the day, todays Docker-based web app development delivers less than fat-client desktop app development delivered 20 years ago, as sad as that is.

          You mean, aside from not having to handle installation of your software on your users' machines?

          Also I'm not sure this is related to docker at all.

          • steve1977 318 days ago
            I actually did work in software packaging (amongst other things) around 20 years ago. This was never a huge issue to be honest, neither was deployment.

            I know, in theory this stuff all sounds very nice. With web apps, you can "deploy" within seconds ideally, compared to say at least a couple of minutes or maybe hours with desktop software distribution.

            But all of that doesn't really matter if the endusers now actually have to wait weeks or months to get the features they want, because all that new stuff added so much complexity that the devs have to handle.

            And that was my point. In terms of enduser quality, I don't think we have gained much, if anything at all.

        • supriyo-biswas 319 days ago
          Being able to create a portable artifact with only the userspace components in it, and that can be shipped and run anywhere with minimal fuss is something that didn't really exist before containers.
          • docandrew 319 days ago
            Java?
            • yjftsjthsd-h 319 days ago
              There were multiple ways to do it as long as you stayed inside one very narrow ecosystem; JARs from the JVM, Python's virtualenv, kind of PHP, I think Ruby had something? But containers gave you a single way to do it for any of those ecosystems. Docker lets you run a particular JVM with its JARs, and an exact version of the database behind that application, and the Ruby on Rails in front of it, and all these parts use the same format and commands.
        • bandrami 319 days ago
          25 years ago I could tell you what version of every CPAN library was in use at my company (because I installed them). What version of what libraries are the devs I support using now? I couldn't begin to tell you. This makes devs happy but I think has harmed the industry in aggregate.
          • twelfthnight 319 days ago
            Because of containers, my company now can roll out deployments using well defined CI/CD scripts, where we can control installations to force usage of pass-through caches (GCP artifact registry). So it actually has that data you're talking about, but instead of living in one person's head it's stored in a database and accessable to everyone via an API.
            • bandrami 318 days ago
              Tried that. The devs revolted and said the whole point of containers was to escape the tyranny of ops. Management sided with them, so it's the wild west there.
              • twelfthnight 318 days ago
                Huh. I actually can understand devs not wanting to need permission to install libraries/versions, but with a pull-through cache there's no restrictions save for security vulnerabilities.

                I think it actually winds up speeding up ci/cd docker builds, too.

      • KronisLV 319 days ago
        > As a user experience Docker is amazing, but as technology it is hot garbage.

        I mean, Podman exists, as do lots of custom build tools and other useful options. Personally, I mostly just stick with vanilla Docker (and Compose/Swarm), because it's pretty coherent and everything just fits together, even if it isn't always perfect.

        Either way, agreed about the concepts behind the technology making things better for a lot of folks out there, myself included (haven't had prod issues with mismatched packages or inconsistent environments in years at this point, most of my personal stuff also runs on containers).

      • derefr 319 days ago
        Yeah, but the Open Container Initiative is supposed to be the responsible adults in the room taking the "fail fast" corporate Docker Inc stuff, and taking time to apply good engineering principles to it.

        It's somewhat surprising that the results of that process are looking to be nearly as fly-by-the-seat-of-your-pants as Docker itself is.

      • belter 319 days ago
        Was it really so amazing? Here is half a Docker implementation, in about 100 lines of Bash...

        https://github.com/p8952/bocker

        • redserk 319 days ago
          Lines of code is irrelevant.

          Docker is important because:

          1) it made a convenient process to build a “system” image of sorts, upload it, download it, and run it.

          2) (the important bit!) Enough people adopted this process for it to become basically a standard

          Before Docker, it wasnt uncommon to ship some complicated apps in VMs. Packaging those was downright awful with all of the bespoke scripting needed for the various steps of distribution. And then you get a new job? Time to learn a brand new process.

          • Twirrim 318 days ago
            I guess Docker has been around long enough now that people have forgotten just how much of an absolute pain it used to end up being. Just how often I'd have to repeat the joke Them: "Well, it works on my machine!" Me: "Great, back up your email, we're putting your laptop in production..."
        • samlinnfer 319 days ago
          The other half is the other 90%.

          Looking at it now, it won't even run in the latest systemd, which now refuses to boot with cgroups v1. Good luck even accessing /dev/null under cgroups v2 with systemd.

        • greiskul 319 days ago
          And like the famous hacker news comment goes, Dropbox is trivial by just using FTP, curlftpfs and SVN. Docker might have many faults, but for anybody that dealt with the problems that it aimed to solve do know in that it was revolutionary in simplifying things.

          And for people that disagree, please write a library like TestContainers using cobbled together bash scripts, that can download and cleanly execute and then clean up almost any common use backend dependency.

    • mschuster91 319 days ago
      On top of that, it's either the OCI spec that's broken or it's just AWS being nuts, but unlike GitLab and Nexus, AWS ECR doesn't support automatically creating folders (e.g. "<acctid>.dkr.ecr.<region>.amazonaws.com/foo/bar/baz:tag"), it can only do flat storage and either have seriously long image names or tags.

      Yes you can theoretically create a repository object in ECR in Terraform to mimic that behavior, but it sucks in pipelines where the result image path is dynamic - you need to give more privileges to the IAM role of the CI pipeline than I'm comfortable with, not to mention that I don't like any AWS resources managed outside of the central Terraform repository.

      [1] https://stackoverflow.com/questions/64232268/storing-images-...

      • hanikesn 319 days ago
        That's seem standard AWS practice. Implement a new feature so you can check the box, but in practice it's a huge pain to actually use.
      • xyzzy_plugh 319 days ago
        IIRC it's not in the spec because administration of resources is out of scope. For example, perhaps you offer a public repository and you want folks to sign up for an account before they can push? Or you want to have an approval process before new repositories are created?

        Regardless it's a huge pain that ECR doesn't support this. Everybody I know of who has used ECR has run into this.

        There's a long standing issue open which I've been subscribed to for years now: https://github.com/aws/containers-roadmap/issues/853

  • kbumsik 319 days ago
    Actually, Cloudflare open-sourced a container registry server using R2.[1]

    Anyone tried it?

    [1]: https://github.com/cloudflare/serverless-registry

    • justin_oaks 319 days ago
      Looks cool. Thanks for linking it.

      It does mention that it's limited to 500MB per layer.

      For some people's use case that limitation might not be a big deal, but for others that's a dealbreaker.

      • danesparza 319 days ago
        From the README:

        * Pushing with docker is limited to images that have layers of maximum size 500MB. Refer to maximum request body sizes in your Workers plan.

        * To circumvent that limitation, you can manually add the layer and the manifest into the R2 bucket or use a client that is able to chunk uploads in sizes less than 500MB (or the limit that you have in your Workers plan).

  • wofo 319 days ago
    Hi HN, author here. If anyone knows why layer pushes need to be sequential in the OCI specification, please tell! Is it merely a historical accident, or is there some hidden rationale behind it?

    Edit: to clarify, I'm talking about sequentially pushing a _single_ layer's contents. You can, of course, push multiple layers in parallel.

    • abofh 319 days ago
      It makes clean-up simpler - if you never got to the "last" one, it's obvious you didn't finish after N+Timeout and thus you can expunge it. It simplifies an implementation detail (how do you deal with partial uploads? make them easy to spot). Otherwise you basically have to trigger at the end of every chunk, see if all the other chunks are there and then do the 'completion'.

      But that's an implementation detail, and I suspect isn't one that's meaningful or intentional. Your S3 approach should work fine btw, I've done it before in a prior life when I was at a company shipping huge images and $.10/gb/month _really_ added up.

      You lose the 'bells and whistles' of ECR, but those are pretty limited (imho)

      • orf 319 days ago
        In the case of a docker registry, isn’t the “final bit” just uploading the final manifest that actually references the layers you’re uploading?

        At this point you’d validate that the layers exist and have been uploaded, otherwise you’d just bail out?

        And those missing chunks would be handled by the normal registry GC, which evicts unreferenced layers?

        • abofh 318 days ago
          It's been a long time, but I think you're correct. In my environment I didn't actually care (any failed push would be retried so the layers would always eventually complete, and anything that for whatever reason didn't retry, well, it didn't happen enough that we cared at the cost of S3 to do anything clever).

          I think OCI ordered manifests first to "open the flow", but then close is only when the manifests last entry was completed - which led to this ordered upload problem.

          If your uploader knows where the chunks are going to live (OCI is more or less CAS, so it's predictable), it can just put them there in any order as long as it's all readable before something tries to pull it.

    • rcarmo 319 days ago
      Never dealt with pushes, but it’s nice to see this — back when Docker was getting started I dumped an image behind nginx and pulled from that because there was no usable private registry container, so I enjoyed reading your article.
    • majewsky 315 days ago
      Source: I have implemented a OCI-compliant registry [1], though for the most part I've been following the behavior of the reference implementation [2] rather than the spec, on account of its convolutedness.

      When the client finalizes a blob upload, they need to supply the digest of the full blob. This requirement evidently serves to enable the server side to validate the integrity of the supplied bytes. If the server only started checking the digest as part of the finalize HTTP request, it would have to read back all the blob contents that had already been written into storage in previous HTTP requests. For large layers, this can introduce an unreasonable delay. (Because of specific client requirements, I have verified my implementation to work with blobs as large as 150 GiB.)

      Instead, my implementation runs the digest computation throughout the entire sequence of requests. As blob data is taken in chunk by chunk, it is simultaneously streamed into the digest computation and into blob storage. Between each request, the state of the digest computation is serialized in the upload URL that is passed back to the client in the Location header. This is roughly the part where it happens in my code: https://github.com/sapcc/keppel/blob/7e43d1f6e77ca72f0020645...

      I believe that this is the same approach that the reference implementation uses. Because digest computation can only work sequentially, therefore the upload has to proceed sequentially.

      [1] https://github.com/sapcc/keppel [2] https://github.com/distribution/distribution

    • codethief 319 days ago
      Hi, thanks for the blog post!

      > For the last four months I’ve been developing a custom container image builder, collaborating with Outerbounds

      I know you said this was something for another blog post but could you already provide some details? Maybe a link to a GitHub repo?

      Background: I'm looking for (or might implement myself) a way to programmatically build OCI images from within $PROGRAMMING_LANGUAGE. Think Buildah, but as an API for an actual programming language instead of a command line interface. I could of course just invoke Buildah as a subprocess but that seems a bit unwieldy (and I would have to worry about interacting with & cleaning up Buildah's internal state), plus Buildah currently doesn't support Mac.

      • throwawaynorway 318 days ago
        If $PROGRAMMING_LANGUAGE = go, you might be looking for https://github.com/containers/storage which can create layers, images, and so on. I think `Store` is the main entry: https://pkg.go.dev/github.com/containers/storage#Store

        Buildah uses it: https://github.com/containers/buildah/blob/main/go.mod#L27C2...

        Edit: buildkit seems to be the same, used by docker, but needs a daemon?

      • wofo 319 days ago
        Unfortunately, all the code is proprietary at the moment. If you are willing to get your hands dirty, the main thing to realize is that container layers are "just" tar files (see, for instance, this article: https://ochagavia.nl/blog/crafting-container-images-without-...). Contact details are in my profile, in case you'd like to chat ;)
        • codethief 319 days ago
          Ah too bad :)

          Thanks for the link! Though I'm less worried about the tarball / OCI spec part, more about platform compatibility. I tried running runc/crun by hand at some point and let's just say I've done things before that were more fun. :)

      • cpuguy83 318 days ago
        This is what buildkit is. Granted go has the only sdk I know of, but the api is purely protobuf and highly extensible.
        • cpuguy83 318 days ago
          For that matter, dagger (dagger.io) provides an sdk in multiple languages and gives you the full power (and then some extra on top) of buildkit.
    • IanCal 319 days ago
      I can't think of an obvious one, maybe load based?

      ~~I added parallel pushes to docker I think, unless I'm mixing up pulls & pushes, it was a while ago.~~ My stuff was around parallelising the checks not the final pushes.

      Edit - does a layer say which layer it goes "on top" of? If so perhaps that's the reason, so the IDs of what's being pointed to exist.

      • wofo 319 days ago
        Layers are fully independent of each other in the OCI spec (which makes them reusable). They are wired together through a separate manifest file that lists the layers of a specific image.

        It's a mystery... Here are the bits of the OCI spec about multipart pushes (https://github.com/opencontainers/distribution-spec/blob/58d...). In short, you can only upload the next chunk after the previous one finishes, because you need to use information from the response's headers.

        • IanCal 319 days ago
          Ah thanks.

          That's chunks of a single layer though, not multiple layers right?

          • wofo 319 days ago
            Indeed, you are free to push multiple layers in parallel. But when you have a 1 GiB layer full of AI/ML stuff you can feel the pain!

            (I just updated my original comment to make clear I'm talking about single-layer pushes here)

            • killingtime74 319 days ago
              Split the layer up?
              • thangngoc89 319 days ago
                You can’t. Installing pytorch and supporting dependencies takes 2.2GB on debian-slim.
                • electroly 319 days ago
                  If you've got plenty of time for the build, you can. Make a two-stage build where the first stage installs Python and pytorch, and the second stage does ten COPYs which each grab 1/10th of the files from the first stage. Now you've got ten evenly sized layers. I've done this for very large images (lots of Python/R/ML crap) and it takes significant extra time during the build but speeds up pulls because layers can be pulled in parallel.
                  • thangngoc89 318 days ago
                    I see your point on the pull speed. Most of my pulls are stuck at waiting for the pytorch/dependencies layer.

                    This might work with pip but I absolutely hate pip and using poetry with great success. I will investigate how to do this with poetry.

                • fweimer 319 days ago
                  Surely you can have one layer per directory or something like that? Splitting along those lines works as long as everything isn't in one big file.

                  I think it was a mistake to make layers as a storage model visible in to the end user. This should just have been an internal implementation detail, perhaps similar to how Git handles delta compression and makes it independent of branching structure. We also should have delta pushes and pulls, using global caches (for public content), and the ability to start containers while their image is still in transfer.

                • password4321 319 days ago
                  It should be possible to split into multiple layers as long as each file is wholly within in its layer. This is the exact opposite of the work recommended combining commands to keep everything in one layer which I think is done ultimately for runtime performance reasons.
                  • ramses0 319 days ago
                    I've dug fairly deep into docker layering, it would be wonderful if there was a sort of `LAYER ...` barrier instead of implicitly via `RUN ...` lines.

                    Theoretically there's nothing stopping you from building the docker image and "re-layering it", as they're "just" bundles of tar files at the end of the day.

                    eg: `RUN ... ; LAYER /usr ; LAYER /var ; LAYER /etc ; LAYER [discard|remainder]`

  • KronisLV 319 days ago
    That's a pretty cool use case!

    Personally, I just use Nexus because it works well enough (and supports everything from OCI images to apt packages and stuff like a custom Maven, NuGet, npm repo etc.), however the configuration and resource usage both are a bit annoying, especially when it comes to cleanup policies: https://www.sonatype.com/products/sonatype-nexus-repository

    That said:

    > More specifically, I logged the requests issued by docker pull and saw that they are “just” a bunch of HEAD and GET requests.

    this is immensely nice and I wish more tech out there made common sense decisions like this, just using what has worked for a long time and not overcomplicating.

    I am a bit surprised that there aren't more simple container repositories out there (especially with auth and cleanup support), since Nexus and Harbor are both a bit complex in practice.

    • thangngoc89 318 days ago
      I’m using Gitea just for packages (Docker, npm, python, ….) I’m surprised noone has mentioned this in this thread
  • akeck 319 days ago
    Note that CNCF's Distribution (formerly Docker's Registry) includes support for backing a registry with Cloudfront signed URLs that pull from S3. [1]

    https://distribution.github.io/distribution/storage-drivers/...

  • rad_gruchalski 319 days ago
    • _flux 319 days ago
      I hadn't seen that before, and it indeed does support S3, but does it also offer the clients the downloads directly from S3, or does it merely use it as its own storage backend (so basically work as a proxy when pulling)?
      • vbezhenar 319 days ago
        It redirects client requests to S3 endpoint, so yes, in the end all heavy traffic goes from S3.
        • est31 318 days ago
          But it still means a downtime of your service is a downtime for anything which might need a docker container, whereas if it went to S3 (or cloudfront in front of S3) directly, you profit from the many nines that S3 offers without paying an arm and a leg for ECR (data in ECR costs five times as much as S3 standard tier).
  • lofties 319 days ago
    This sounds very, very expensive, and I would've loved to see cost mentioned in the article too. (for both S3 and R2)
    • est31 318 days ago
      S3's standard tier costs a fifth of ECR in terms of costs per GB stored. Egress costs to the free internet are the same, with the exception that for public ECR repositories they make egress to inner-AWS usage free.
    • remram 319 days ago
      The cost is the S3 cost though. It depends on region and storage tier, but the storage cost per GB, the GET/PUT cost, and the bandwidth cost can be found on the AWS website: https://aws.amazon.com/s3/pricing/
  • donatj 319 days ago
    I don't do a ton with Docker outside dev tooling, but I have never understood why private container registries even exist? It just smells like rent seeking. What real advantage does it provide over say just generating some sort of image file you manage yourself, as you please?
    • figmert 319 days ago
      You don't have to use it. You can use docker save and docker import:

          docker save alpine:3.19 > alpine.tar
          docker load < alpine.tar
      
      But now I have to manage that tar file, have all my systems be aware of where it is, how to access it, etc. Or, I could just not re-invent the wheel and use what docker already has provided.
    • JackSlateur 319 days ago
      You will probably have images that you will not share to the world. Said images will probably be made available to your infrastructure (k8s clusters, CI/CD runners etc). So you have to either build your own registry or pay someone to do it for you.

      Of course, if you use images for dev only, all of that are worthless and you just store your images on your dev machine

      • regularfry 319 days ago
        Also if your infrastructure is within AWS, you want your images to also be within AWS when the infrastructure wants them. That doesn't necessarily imply a private registry, but it's a lot less work that way.
    • vel0city 319 days ago
      Why have a code repository instead of just emailing files around?

      Because you want a central store someplace with all the previous versions that is easily accessible to lots of consumers.

      I don't want to build my app and then have to push it to every single place that might run it. Instead, I'll build it and push it to a central repo and have everything reference that repo.

      > It just smells like rent seeking.

      You don't need to pay someone to host a private repo for you. There are lots of tools out there so you can self-host.

    • mcraiha 319 days ago
      Private (cloud) registries are very useful when there are mandatory AuthN/AuthZ things in the project related to the docker images. You can terraform/bicep/pulumi everything per environment.
    • arccy 319 days ago
      and how do you manage them? you use the same tooling that exists for all public images by running a container registry.
    • alemanek 319 days ago
      Integration with vulnerability scanning utilities and centralized permissions for orgs are nice benefits.
    • danmur 319 days ago
      Companies send young engineers (and older engineers who should know more but don't) to AWS and Microsoft for "cloud certification". They learn how to operate cloud services because thats what benefits AWS and MS, so thats what their solutions use.

      It's a difficult uphill battle to get people interested in how things work under the hood, which is what you need in order to know you can do things like easily host your own package repositories.

      • figmert 319 days ago
        This is a odd assessment. I agree certifications aren't all that, but having people learn them isn't about that. It's more that people don't feel like reinventing the wheel at every company, so they can focus on the real work, like shipping the application they've written. So companies like AWS, Docker etc, write things, abstract things away, so someone else doesn't have to redo the whole thing.

        Yes I can host my packages and write tooling around it to make it easy. But JFrog already has all the tooling around it, and it integrates with current tooling. Why would I write the whole thing again?

        • danmur 319 days ago
          I am responding to this part of the parent comment:

          > I don't do a ton with Docker outside dev tooling, but I have never understood why private container registries even exist?

          You know the options and have made a conscious choice:

          > Yes I can host my packages and write tooling around it to make it easy. But JFrog already has all the tooling around it, and it integrates with current tooling. Why would I write the whole thing again?

          So presumably you are not the kind of people I was talking about.

          EDIT: I'm also assuming by the rent seeking part that the parent is referring to paid hosted services like ECR etc.

  • watermelon0 319 days ago
    It seems that ECR is actually designed in a way to support uploading image layers in multiple parts.

    Related ECR APIs:

    - InitiateLayerUpload API: called at the beginning of upload of each image layer

    - UploadLayerPart API: called for each layer chunk (up to 20 MB)

    - PutImage API: called after layers are uploaded, to push image manifest, containing references to all image layers

    The only weird thing seems to be that you have to upload layer chunks in base64 encoding, which increases data for ~33%.

    • wofo 316 days ago
      I tried using that API directly, but it unfortunately did require ordered uploads too :')
  • phillebaba 319 days ago
    Interesting idea to use the file path layout as a way to control the endpoints.

    I do wonder though how you would deal with the Docker-Content-Digest header. While not required it is suggested that responses should include it as many clients expect it and will reject layers without the header.

    Another thing to consider is that you will miss out on some feature from the OCI 1.1 spec like the referrers API as that would be a bit tricky to implement.

  • 8organicbits 319 days ago
    > that S3 is up to 8x faster than ECR

    Awesome. Developer experience is so much better when CI doesn't take ages. Every little bit counts.

    • barbazoo 319 days ago
      > ECR 24 MiB/s (8.2 s)

      > S3 115 MiB/s (1.7 s)

      It's great that it's faster but absolutely, it's only an improvement of 6.5s observed, as you said, on the CI server. And it means using something for a purpose that it's not intended for. I'd hate to have to spend time debugging this if it breaks for whatever reason.

      • wofo 319 days ago
        To be clear, the 8x was comparing the slowest ECR throughput measurement against the fastest S3 one. In any case, the improvement is significant.
  • cpa 319 days ago
    Is there a good reason for not allowing parallel uploads in the spec?
    • wofo 319 days ago
      No idea... I asked the same question here (https://news.ycombinator.com/item?id=40943480) and am hoping we'll have a classic HN moment where someone who was involved in the design of the spec will chime in.
    • benterix 319 days ago
      I believe that even if there was one then, it's probably no longer valid and it's now just a performance limitation.
      • wofo 319 days ago
        Other than backwards-compatibility, I can imagine simplicity being a reason. For instance, sequential pushing makes it easier to calculate the sha256 hash of the layer as it's being uploaded, without having to do it after-the-fact when the uploaded chunks are assembled.
        • amluto 319 days ago
          The fact that layers are hashed with SHA256 is IMO a mistake. Layers are large, and using SHA256 means that you can’t incrementally verify the layer as you download it, which means that extreme care would be needed to start unpacking a layer while downloading it. And SHA256 is fast but not that fast, whereas if you really feel like downloading in parallel, a hash tree can be verified in parallel.

          A hash tree would have been nicer, and parallel uploads would have been an extra bonus.

          • cpuguy83 318 days ago
            sha256 has been around a long time and is highly compatible.

            blake3 support has been proposed both in the OCI spec and in the runtimes, which at least for runtimes I expect to happen soon.

            I tend to think gzip is the bigger problem, though.

            • amluto 318 days ago
              > sha256 has been around a long time and is highly compatible.

              Sure, and one can construct a perfectly nice tree hash from SHA256. (AWS Glacier did this, but their construction should not be emulated.)

              • cpuguy83 318 days ago
                But then every single client needs to support this. sha256 support is already ubiquitous.
                • amluto 318 days ago
                  Every single client already had to implement enough of the OCI distribution spec to be able to parse and download OCI images. Implementing a more appropriate hash, which could be done using SHA-256 as a primitive, would have been a rather small complication. A better compression algorithm (zstd?) is far more complex.
                  • cpuguy83 318 days ago
                    I don't think we can compare reading json to writing a bespoke, secure hashing algorithm across a broad set of languages.
                    • amluto 317 days ago
                      Reading JSON that contains a sort of hash tree already. It’s a simple format that contains a mess of hashes that need verifying over certain files.

                      Adding a rule that you hash the files in question in, say, 1 MiB chunks and hash the resulting hashes (and maybe that’s it, or maybe you add another level) is maybe 10 lines of code in any high level language.

                      • oconnor663 317 days ago
                        Note that secure tree hashing requires a distinguisher between the leaves and the parents (to avoid collisions) and ideally another distinguisher between the root and everything else (to avoid extensions). Surprisingly few bespoke tree hashes in the wild get this right.
                        • amluto 316 days ago
                          This is why I said that Glacier’s hash should not be emulated.

                          FWIW, using a (root hash, data length) pair hides many sins, although I haven’t formally proven this. And I don’t think that extension attacks are very relevant to the OCI use case.

        • catlifeonmars 319 days ago
          That does not make any sense; as the network usually is a much bigger bottleneck than compute, even with disk reads. You’re paying quite a lot for “simplicity” if that were the case
        • jtmarmon 319 days ago
          I’m no expert on docker but I thought the hashes for each layer would already be computed if your image is built
          • cpuguy83 318 days ago
            It's complicated. If you are using the containerd backed image store (opt-in still) OR if you push with "build --push" then yes.

            The default storage backend does not keep compressed layers, so those need to be recreated and digested on push.

            With the new store all that stuff is kept and reused.

          • wofo 319 days ago
            That's true, but I'd assume the server would like to double-check that the hashes are valid (for robustness / consistency)... That's something my little experiment doesn't do, obviously.
  • michaelmior 319 days ago
    > Why can’t ECR support this kind of parallel uploads? The “problem” is that it implements the OCI Distribution Spec…

    I don't see any reason why ECR couldn't support parallel uploads as an optimization. Provide an alternative to `docker push` for those who care about speed that doesn't conform to the spec.

    • wofo 319 days ago
      Indeed, they could support it through a non-standard API... I wish they did!
  • champtar 318 days ago
    What I would really love is for the OCI Distribution spec to support just static files, so we can use dumb http servers directly, or even file:// (for pull). All the metadata could be/is already in the manifests, having Content-Type: octet-stream could work just fine.
  • victorbjorklund 319 days ago
    But this only works for public repos right? I assume docker pull wont use a s3 api key
    • wofo 319 days ago
      That's true, unfortunately. I'm thinking about ways to somehow support private repos without introducing a proxy in between... Not sure if it will be possible.
      • victorbjorklund 318 days ago
        Maybe a very thin proxy layer on like cloudflare functions.
        • wofo 318 days ago
          Yep, I originally thought that wouldn't work... But now I discovered (thanks to a comment here) that the registry is allowed to return an HTTP redirect instead of serving the layer blobs directly... Which opens new possibilities :)
  • kevin_nisbet 319 days ago
    It's cool to see it, I was interested in trying something similar a couple years ago but priorities changed.

    My interest was mainly around a hardening stand point. The base idea was the release system through IAM permissions would be the only system with any write access to the underlying S3 bucket. All the public / internet facing components could then be limited to read only access as part of the hardening.

    This would of course be in addition to signing the images, but I don't think many of the customers at the time knew anything about or configured any of the signature verification mechanisms.

  • tealpod 319 days ago
    This is such a wonderful idea, congrats.

    There is a real usecase for this in some high security sectors. I can't put complete info here for the security reasons, let me know if you are interested.

  • lazy_moderator1 319 days ago
    That's neat! On that note I've been using S3 as a private registry for years now via Gitlab and couldn't be happier!
  • jaimehrubiks 319 days ago
    I experience everyday the slowness of pushing big images (ai related tend to be big) to ECR on our cicd.
    • wofo 319 days ago
      I wonder whether the folks at Cloudflare could take the ideas from the blog post and create a high-performance serverless container registry based on R2. They could call it scrubs, for "serverless container registry using blob storage" :P
  • dheera 318 days ago
    Make sure you use HTTPS, or someone could theoretically inject malicious code into your container. If you want to use your own domain you'll have to use CloudFront to wrap S3 though.
    • wofo 316 days ago
      That's what I thought originally, but you can actually use `https://<your-bucket>.s3.amazonaws.com` without Cloudfront or other service on top (it wasn't easy to find in the AWS docs, but it works)
  • ericfrederich 319 days ago
    R2 in only "free" until it isn't. Cloudflare hasn't got a lot of good press recently. Not something I'd wanna build my business around.
    • TheMrZZ 319 days ago
      Aside from the casino story (high value target that likely faces tons of attacks, therefore an expensive customer for CF), did something happen with them? I'm not aware of bad press around them in general
    • jgrahamc 319 days ago
      R2 egress is free.
  • fnord77 319 days ago
    > What makes S3 faster than ECR?

    the author is missing something huge - ECR does a security scan on upload, too.

  • ericpauley 319 days ago
    Where's the source code?
    • wofo 319 days ago
      The source code is proprietary, but it shouldn't take much work to replicate, fortunately (you just need to upload files at the right paths).
    • seungwoolee518 319 days ago
      Like treat path as a object key, and put value as a json or blob?
  • filleokus 319 days ago
    I've started to grow annoyed with container registry cloud products. Always surprisingly cumbersome to auto-delete old tags, deal with ACL or limit the networking.

    It would be nice if a Kubernetes distro took a page out of the "serverless" playbook and just embedded a registry. Or maybe I should just use GHCR

    • breatheoften 319 days ago
      I'm using google's artifact registry -- aside from upload speed another thing that kills me is freakin download speed ... Why in the world should it take 2 minutes to download a 2.6 GB layer to a cloud build instance sitting in the same region as the artifact registry ... Stupidly slow networking really harms the stateless ci machine + docker registry cache which actually would be quite cool if it was fast enough ...

      In my case it's still faster than doing the builds would be -- but I'm definitely gonna have to get machines with persistent local cache in the mix at some point so that these operations will finish within a few seconds instead of a few minutes ...

    • kevin_nisbet 319 days ago
      We did this in the Gravity Kubernetes Distribution (which development is shut down), but we had to for the use case. Since the distribution was used to take kubernetes applications behind the firewall with no internet access we needed the registry... and it was dead simple just running the docker-distribution registry on some of the nodes.

      In theory it wouldn't be hard to just take docker-distribution and run it as a pod in the cluster with an attached volume if you wanted a registry in the cluster. So it's probably somewhere between trivial and takes a bit of effort if you're really motivated to have something in cluster.

    • freerc1347 318 days ago
      Have you tried zot? https://www.cncf.io/projects/zot/

      https://zotregistry.dev/

      Here are all the projects already using zot in some form or another.

      https://github.com/project-zot/zot/issues/2117

    • vbezhenar 319 days ago
      Kubernetes is extremely bare-bones, there's no way they'll embed a registry. Kubernetes doesn't touch images at all, AFAIK, it delegates that to the container runtime, e.g. containerd.

      If you want some lightweight registry, use "official" docker registry. I'm running it inside Kubernetes and it consumes it just fine.

    • mdaniel 319 days ago
      > Always surprisingly cumbersome to auto-delete old tags,

      Does this not do what you want? https://docs.aws.amazon.com/AmazonECR/latest/userguide/lifec...

      I can't speak to the other "registry cloud products" except for GitLab, which is its own special UX nonsense, but they also support expiry after enough whisky consumption