> I do wish to point out, of course, that the whole reason it was possible to experiment cheaply and come across this serendipity was because 9 months ago, faced with the choice to either do the bad easy thing or the good nothing, I chose to do the bad easy thing.5 The SQLite database worked! I understood how it worked, behind the scenes with its B-trees and its Full Text Search extension.
This is the most important takeaway, imo, and a very valuable technique: Start with the obvious, stupid solution that definitely works. Then do the optimized version, while making sure it matches the naive implementation. In this case, the optimized version could even be generated from the naive one.
problem with such approach is that sometimes "obvious easy thing" becomes so entrenched and affecting everything, that ripping it out becomes unproportionally monumental task
I feel like it is important to manage the risk and to clearly manage this debt. personally I try to stay safe from both debt and technical debt until there are sound reasons for both.
It is KISS stack for me personally (Keep it stupid simple)
I would still consider technical debt to be different than other forms of debt though, It feels way more of a tradeoff to me but perhaps all debt can be classified as such. Either way I think it makes for an interesting decision nonetheless.
If the product fails all technical debt goes away. So using technical debt to prove out a product is very different. Cause often times you don’t have the money or resources to build correctly at first since you don’t know if the expected payoff will be there
It jumped out at me too, but because I wondered what it would look like in the AI version of this story. Having had it build the SQL version do you ... a) miss the leap because you don't understand how it works, don't care to know, and go off to vibe the next thing b) ask it lots of questions because reasons to develop that deep understanding then make the leap or c) rely on it (prompt: "this can't be good enough do better") to go make the leap for you.
(Assuming for the sake of argument that you guided it to the SQL version first)
Depends on what your overall goal is with what you're building. Is it to rush out as many features as you possible can, before VC-funding falls through the floor so you too can get a slice of the pie before the party is over? Or are you "retired" in your 30s and now have time to build the perfect software for you? Do you need to publish and release an experiment to see how people react to it or use it, before you can know if it's the right thing or not?
Almost everything needs to be contextualized before you can even begin to answer what the right way forward is, depends so heavily on what situation you're in.
Fwiw, giving opus 4.7 two sentences about building a cli doing Finnish to English translation and looking for a space efficient solution leads to an answer pointing to fst. For the same reasons stated in the blog. This is without a search tool.
The K shaped LLM scenario makes a lot of sense to me. Educated and experienced devs get better output because they know what to ask.
Yes - tho I'd refine it to say it's not just more time, it's time plus the forcing function of a working solution requiring more deeply understanding the problem space.
This was a fun read! Thanks for the great introduction to Finite State Transducers. I hadn't heard the formal term before, but your article gave me serious déjà vu.
Years ago, I entered a Scrabble programming contest and needed to compress a GADDAG dictionary to fit into my 6MB L3 cache. Without knowing the official name for it, I ended up using the exact same suffix-compression mechanism by moving characters to the edges instead of the nodes to merge overlapping paths.
Apparently the structure itself has a bit of a history. The word 'rediscovery' tipped me off to go to Wikipedia myself and read up more about this.
First Blumer et al., 1983 came up with a "DAWG", but reading the abstract [1] I was left a little confused as to how exactly we get from 'here is how we store all substrings of a string in O(|string|) space, with "is this a substring [yn]" recognition in O(|substring|) time' to the modern DAFSA, as cool and useful as that is. Come to think of it I bet I could use that in some LeetCode problems.
But the structure we actually think of as a DAWG or DAFSA (or FST, I guess, thanks to this Rust crate) is in the paper "The World’s Fastest Scrabble Program". That worked but you had to construct a whole trie first, then compact it down, so build time was a memory hog. Then Dr. Daciuk of 3city sharpened the blade in 2000 by proving that this was as good as it gets in the unsorted case, but on a sorted set you could build the DAFSA incrementally, because an increasingly large part of the graph you were building was already optimized.
And then from there BurntSushi got involved with the implementation and the rest is history.
Why was the download 3gb, if the solution created a 300x reduction primarily by sharing suffixes? Wouldn’t vanilla compression have dealt with that and achieved a decent (not ideal) amount of compression of the database?
But there's probably not encryption either, so the sqlite database file probably has a lot of duplicated data inside it that's visible externally. I'm also curious how well it compresses by just running the sqlite file through a compression tool. I don't know enough about sqlite storage internals to guess.
This is the most important takeaway, imo, and a very valuable technique: Start with the obvious, stupid solution that definitely works. Then do the optimized version, while making sure it matches the naive implementation. In this case, the optimized version could even be generated from the naive one.
It is KISS stack for me personally (Keep it stupid simple)
I would still consider technical debt to be different than other forms of debt though, It feels way more of a tradeoff to me but perhaps all debt can be classified as such. Either way I think it makes for an interesting decision nonetheless.
(Assuming for the sake of argument that you guided it to the SQL version first)
Almost everything needs to be contextualized before you can even begin to answer what the right way forward is, depends so heavily on what situation you're in.
The K shaped LLM scenario makes a lot of sense to me. Educated and experienced devs get better output because they know what to ask.
Years ago, I entered a Scrabble programming contest and needed to compress a GADDAG dictionary to fit into my 6MB L3 cache. Without knowing the official name for it, I ended up using the exact same suffix-compression mechanism by moving characters to the edges instead of the nodes to merge overlapping paths.
Sharing my old write-up here in case you or other data-structure nerds find the overlap interesting! https://williame.github.io/post/87682811573.html
Sure enough, the first paragraph on the Wikipedia entry for DAFSA is:
DAFSA is the rediscovery of a data structure called Directed Acyclic Word Graph (DAWG)
First Blumer et al., 1983 came up with a "DAWG", but reading the abstract [1] I was left a little confused as to how exactly we get from 'here is how we store all substrings of a string in O(|string|) space, with "is this a substring [yn]" recognition in O(|substring|) time' to the modern DAFSA, as cool and useful as that is. Come to think of it I bet I could use that in some LeetCode problems.
But the structure we actually think of as a DAWG or DAFSA (or FST, I guess, thanks to this Rust crate) is in the paper "The World’s Fastest Scrabble Program". That worked but you had to construct a whole trie first, then compact it down, so build time was a memory hog. Then Dr. Daciuk of 3city sharpened the blade in 2000 by proving that this was as good as it gets in the unsorted case, but on a sorted set you could build the DAFSA incrementally, because an increasingly large part of the graph you were building was already optimized.
And then from there BurntSushi got involved with the implementation and the rest is history.
[1]: https://www.sciencedirect.com/science/article/pii/0304397585...
(That's what I can glean from ~30 minutes of not particularly focused reading. Forgive me if I have made any mistakes.)
https://moodle2.units.it/pluginfile.php/718375/mod_resource/...
https://sqlite.org/com/cerod.html
It was later used to power a popular Python package for processing Russian morphology (pymorphy).