Replacing a 3 GB SQLite db with a 10 MB FST (finite state transducer) binary

(til.andrew-quinn.me)

133 points | by hiAndrewQuinn 7 hours ago

7 comments

Hendrikto 5 hours ago
> I do wish to point out, of course, that the whole reason it was possible to experiment cheaply and come across this serendipity was because 9 months ago, faced with the choice to either do the bad easy thing or the good nothing, I chose to do the bad easy thing.5 The SQLite database worked! I understood how it worked, behind the scenes with its B-trees and its Full Text Search extension.
This is the most important takeaway, imo, and a very valuable technique: Start with the obvious, stupid solution that definitely works. Then do the optimized version, while making sure it matches the naive implementation. In this case, the optimized version could even be generated from the naive one.
[-]
- NooneAtAll3 1 hour ago
  problem with such approach is that sometimes "obvious easy thing" becomes so entrenched and affecting everything, that ripping it out becomes unproportionally monumental task
- moron4hire 3 hours ago
  Technical debt, like all forms of debt, can be used for leverage.
  [-]
  - Imustaskforhelp 3 hours ago
    I feel like it is important to manage the risk and to clearly manage this debt. personally I try to stay safe from both debt and technical debt until there are sound reasons for both.
    It is KISS stack for me personally (Keep it stupid simple)
    I would still consider technical debt to be different than other forms of debt though, It feels way more of a tradeoff to me but perhaps all debt can be classified as such. Either way I think it makes for an interesting decision nonetheless.
    [-]
    - kanwisher 2 hours ago
      If the product fails all technical debt goes away. So using technical debt to prove out a product is very different. Cause often times you don’t have the money or resources to build correctly at first since you don’t know if the expected payoff will be there
- simmonmt 4 hours ago
  It jumped out at me too, but because I wondered what it would look like in the AI version of this story. Having had it build the SQL version do you ... a) miss the leap because you don't understand how it works, don't care to know, and go off to vibe the next thing b) ask it lots of questions because reasons to develop that deep understanding then make the leap or c) rely on it (prompt: "this can't be good enough do better") to go make the leap for you.
  (Assuming for the sake of argument that you guided it to the SQL version first)
  [-]
  - embedding-shape 4 hours ago
    Depends on what your overall goal is with what you're building. Is it to rush out as many features as you possible can, before VC-funding falls through the floor so you too can get a slice of the pie before the party is over? Or are you "retired" in your 30s and now have time to build the perfect software for you? Do you need to publish and release an experiment to see how people react to it or use it, before you can know if it's the right thing or not?
    Almost everything needs to be contextualized before you can even begin to answer what the right way forward is, depends so heavily on what situation you're in.
  - goosejuice 3 hours ago
    Fwiw, giving opus 4.7 two sentences about building a cli doing Finnish to English translation and looking for a space efficient solution leads to an answer pointing to fst. For the same reasons stated in the blog. This is without a search tool.
    The K shaped LLM scenario makes a lot of sense to me. Educated and experienced devs get better output because they know what to ask.
- baublet 5 hours ago
  Came here to add this, too. Sometimes the most valuable thing a solution can buy you is time to think of a better solution.
  [-]
  - chrisweekly 5 hours ago
    Yes - tho I'd refine it to say it's not just more time, it's time plus the forcing function of a working solution requiring more deeply understanding the problem space.
wood_spirit 2 hours ago
This was a fun read! Thanks for the great introduction to Finite State Transducers. I hadn't heard the formal term before, but your article gave me serious déjà vu.
Years ago, I entered a Scrabble programming contest and needed to compress a GADDAG dictionary to fit into my 6MB L3 cache. Without knowing the official name for it, I ended up using the exact same suffix-compression mechanism by moving characters to the edges instead of the nodes to merge overlapping paths.
Sharing my old write-up here in case you or other data-structure nerds find the overlap interesting! https://williame.github.io/post/87682811573.html
[-]
- rishkewl 1 hour ago
  Nice reminder that a lot of these wins come from removing identity that the algorithm doesn’t actually need.
lscharen 6 hours ago
I was halfway through the article and began thinking that his described data structure sounded very familiar to something I used about 20 years ago.
Sure enough, the first paragraph on the Wikipedia entry for DAFSA is:
DAFSA is the rediscovery of a data structure called Directed Acyclic Word Graph (DAWG)
[-]
- hiAndrewQuinn 5 hours ago
  Apparently the structure itself has a bit of a history. The word 'rediscovery' tipped me off to go to Wikipedia myself and read up more about this.
  First Blumer et al., 1983 came up with a "DAWG", but reading the abstract [1] I was left a little confused as to how exactly we get from 'here is how we store all substrings of a string in O(|string|) space, with "is this a substring [yn]" recognition in O(|substring|) time' to the modern DAFSA, as cool and useful as that is. Come to think of it I bet I could use that in some LeetCode problems.
  But the structure we actually think of as a DAWG or DAFSA (or FST, I guess, thanks to this Rust crate) is in the paper "The World’s Fastest Scrabble Program". That worked but you had to construct a whole trie first, then compact it down, so build time was a memory hog. Then Dr. Daciuk of 3city sharpened the blade in 2000 by proving that this was as good as it gets in the unsorted case, but on a sorted set you could build the DAFSA incrementally, because an increasingly large part of the graph you were building was already optimized.
  And then from there BurntSushi got involved with the implementation and the rest is history.
  [1]: https://www.sciencedirect.com/science/article/pii/0304397585...
  (That's what I can glean from ~30 minutes of not particularly focused reading. Forgive me if I have made any mistakes.)
  [-]
  - lscharen 1 hour ago
    An interesting property of DAWGs is that the compact/compressed variation (CDAWG) can be built in linear time.
    https://moodle2.units.it/pluginfile.php/718375/mod_resource/...
- carefulfungi 4 hours ago
  I recently learned about a related data structure designed for scrabble style word searches: https://en.wikipedia.org/wiki/GADDAG.
- zozbot234 1 hour ago
  Isn't this just a regular expression?
cadamsdotcom 2 hours ago
Why was the download 3gb, if the solution created a 300x reduction primarily by sharing suffixes? Wouldn’t vanilla compression have dealt with that and achieved a decent (not ideal) amount of compression of the database?
[-]
- andriy_koval 2 hours ago
  I think there is no vanilla compression in SQLite.
  [-]
  - Arcuru 20 minutes ago
    But there's probably not encryption either, so the sqlite database file probably has a lot of duplicated data inside it that's visible externally. I'm also curious how well it compresses by just running the sqlite file through a compression tool. I don't know enough about sqlite storage internals to guess.
  - woadwarrior01 2 hours ago
    There's a commercial extension that can do this.
    https://sqlite.org/com/cerod.html
hmokiguess 2 hours ago
This was such a pleasant read, thank you for sharing your experience, I feel motivated now to go solve the same problems twice!
rishkewl 1 hour ago
This feels like a good case of "SQLite was the right bad solution."
asibahi 5 hours ago
This was a very interesting read. I wonder if similar techniques can apply to Turkish or Japanese dictionaries?
[-]
- macleginn 19 minutes ago
  An old C++ library using a related approach (DAWG) was seemingly created by a Japanese programmer: https://code.google.com/archive/p/dawgdic/
  It was later used to power a popular Python package for processing Russian morphology (pymorphy).
- comonoid 2 hours ago
  I don't know about Japanese, but to Turkish definitely can.