Show HN: ZeroFS, the Filesystem That Makes S3 Your Primary Storage

(github.com)

32 points | by Eikon 9 hours ago

10 comments

visualphoenix 6 hours ago
Seems to be nfs v3 [0] - curious to test it out - the only userspace nfsv4 implementation I’m aware of is in buildbarn (golang) [1]. The example of their nfs v3 implementation disables locking. Still pretty cool to see all the ways the rust ecosystem is empowering stuff like this.
I’m kinda surprised someone hasn’t integrated the buildbarn nfs v4 stuff into docker/podman - the virtiofs stuff is pretty bad on osx and the buildbarn nfs 4.0 stuff is a big improvement over nfs v3.
Anyhow I digress. Can’t wait to take it for a spin.
[0] https://github.com/Barre/zerofs_nfsserve
[1] https://github.com/buildbarn/bb-remote-execution/tree/master...
anorwell 5 hours ago
Seems like a really interesting project! I don't understand what's going on with latency vs durability here. The benchmarks [1] report ~1ms latency for sequential writes, but that's just not possible with S3. So presumably writes are not being confirmed to storage before confirming the write to the client.
What is the durability model? The docs don't talk about intermediate storage. Slatedb does confirm writes to S3 by default, but I assume that's not happening?
[1] https://www.zerofs.net/zerofs-vs-juicefs
[-]
- Shakahs 4 hours ago
  SlateDB offers different durability levels for writes. By default writes are buffered locally and flushed to S3 when the buffer is full or the client invokes flush().
  https://slatedb.io/docs/design/writes/
- Eikon 1 hour ago
  The durability profile before sync should be pretty close to a local filesystem. There’s (in-memory) buffering happening on writes, then when fsync is issued or when we exceed the in-memory threshold or we exceed a timeout, data is sync-ed.
williamstein 5 hours ago
They have a bunch of claims comparing this to JuiceFS at https://www.zerofs.net/zerofs-vs-juicefs
I am in no way affiliated with JuiceFS, but I have done a lot of benchmarking and testing of it, and the numbers claimed here for JuiceFS are suspicious (basically 5 ops/second with everything mostly broken).
[-]
- ChocolateGod 2 hours ago
  I have a juicefs setup and get operations in the 1000/s, not sure how they got such low numbers.
  JuiceFS also supports multiple concurrent clients making their own connection to the metadata and object storage, allowing near instant synchronization and better performance, where this seems to rely on a single service having a connection and everyone connecting through it with no support for clustering.
  [-]
  - Eikon 1 hour ago
    I have no doubt that JuiceFS can perform “thousands of operations per second” across parallel clients. I don't think that's a useful benchmark because the use cases we are targeting are not embarrassingly parallel.
    Using a bunch of clients on any system hides the latency profile. You could even get your "thousands of operations per second" on a system where any operation takes 10 seconds to complete.
- Eikon 1 hour ago
  The bench suite is published at the root of the repo in bench/. Stating “a bunch of claims” seems a bit dismissive when it’s that easy to reproduce.
  JuiceFS maps operations 1:1 to s3 chunk wise, so the performance profile is entirely unsurprising to me, even untaring an archive is not going to be a pleasant experience.
- selfhoster1312 1 hour ago
  [dead]
billev2k 4 hours ago
I had to laugh out loud: "In practice, you'll encounter other constraints well before these theoretical limits, such as S3 provider limits, performance considerations with billions of objects, or simply running out of money."
aspenmayer 37 minutes ago
How does this compare to rclone?
https://rclone.org/
https://github.com/rclone/rclone
moltar 4 hours ago
Has anyone tried it as a cache dir in CI? I’m concerned about random reads looking pretty slow in the benchmarks.
siliconc0w 6 hours ago
Very cool to see NFS, NDB, or even P9 supported.
Looking at those benchmarks, I think you must be using a local disk to sync writes before uploading to S3?
[-]
- Eikon 1 hour ago
  There’s no on-disk write cache.
akshayKMR 6 hours ago
Incredibly cool! It shows running Ubuntu and Postgres, and also supports full posix operations.
Questions:
- I see there is a diagram running multiple Postgres nodes backed by this store, very similar to horizontally distributed web-server. Doesn't Postgres use WAL replication? Or is it disabled and they are they running on same "views" of the filesystem?
- What does this mean for services that handle geo-distribution on app layer? e.g. CockroachDB?
Sorry if this sounds dumb.
monkaiju 6 hours ago
Finally a way to get more than 2gb of local storage on digital ocean's app platform!
[-]
- nodesocket 5 hours ago
  2gb? They have storage optimized nvme droplets that at the high end support 4.6tb (though absurdly expensive at $2,000/mo).
jauntywundrkind 9 hours ago
Built stop the excellent SlateDB! Breaks files down into 256k chunks. Encrypted. Much much better posix compatibility than most FUSE alternatives. SlateDB has snapshots & clones, so that could be another great superpower of zerofs.
Incredible performance figures, rocketing to probably the best way to use object storage in an fs like way. There's a whole series of comparisons, & they probably need a logarithmic scale given the scale of the lead slatedb has! https://www.zerofs.net/zerofs-vs-juicefs
Speaks 9p, NFS, or NBD. Some great demos of ZFS with l2arc caches giving a near local performance while having s3 persistence.
Totally what I was thinking of when in the Immich someone mentioned wanting a way to run it on cheap object storage. https://news.ycombinator.com/item?id=45169036