The diagrams on this page are stunning! My only complaint is leaving the close/maximize/minimize buttons in the top left was unnecessary but this is the kind of clarity I always strive for (and fail to achieve) every time I make diagrams.
Did you use a tool to create them, and if so, what is that tool?
Thanks! I use https://monodraw.helftone.com/ which is my favorite one-time-purchase software of all time. I definitely agree the buttons on the top left are unnecessary but ... it's cute and it makes me happy so I can't help it. Maybe I'll come up with a different style for the next blog
We haven't even started to discuss Object Storage, but it ends up looking very very similar if you're building data systems that use that instead of raw filesystems (not so much for physics reasons, but because of the way object storage require immutable objects and penalize you for many API calls)
So cool to see this make the front page of hacker news! I'm the author, I'll be online here throughout the weekend to answer any questions you might have :) excited for the next post which is in the works about LSM trees.
The trick is, it's not all in memory - it's a memory-mapped file
If you look at the cache (with `fincore` or similar) you'll see that the buinary search only loads the pages it examines, roughly logarithmetic in the file size.
And a text file is the most useful general format - easy to write, easy to process with standard tools.
I've used this in the past on data sets of hundreds of millions of lines, maybe billions.
It's also true that you could use a memory-mapped indexed file for faster searches - I've used sqlite for this.
what this could really use is a compression format that compresses variable amount of text into fixed-size blocks. with that, it could binary-search compressed text
RocksDB actually does something somewhat similar with its prefix compression. It prefix-compresses texts and then "resets"the prefix compression every N records so it stores a mapping of reset point -> offset so you can skip across compressed records. It's pretty neat
Did you use a tool to create them, and if so, what is that tool?
This is also a fairly good way to handle large amounts of data with maximum performance on spinning rust, and at the heart of systems like Kafka.
I had assumed that the story would be very different with SSDs, so it's surprising to see append only logs show up again.
And a text file is the most useful general format - easy to write, easy to process with standard tools.
I've used this in the past on data sets of hundreds of millions of lines, maybe billions.
It's also true that you could use a memory-mapped indexed file for faster searches - I've used sqlite for this.