There are very few phrases in all of history that have done more damage to the project of software development than:
"Premature optimization is the root of all evil."
First, let's not besmirch the good name of Tony Hoare. The quote is from Donald Knuth, and the missing context is essential.
From his 1974 paper, "Structured Programming with go to Statements":
"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."
He was talking about using GOTO statements in C. He was talking about making software much harder to reason about in the name of micro-optimizations. He assumed (incorrectly) that we would respect the machines our software runs on.
Multiple generations of programmers have now been raised to believe that brutally inefficient, bloated, and slow software is just fine. There is no limit to the amount of boilerplate and indirection a computer can be forced to execute. There is no ceiling to the crystalline abstractions emerging from these geniuses. There is no amount of time too long for a JVM to spend starting.
I worked at Google many years ago. I have lived the absolute nightmares that evolve from the willful misunderstanding of this quote.
No thank you. Never again.
I have committed these sins more than any other, and I'm mad as hell about it.
Huh, I've always understood that quote very differently, with emphasis on "premature" ... not as in, "don't optimize" but more as in "don't optimize before you've understood the problem" ... or, as a CS professor of mine said "Make it work first, THEN make it work fast" ...
And if you know in advance that a function will be in the critical path, and it needs to perform some operation on N items, and N will be large, it’s not premature to consider the speed of that loop.
well you see, in corporate (atleast in big tech), this is usually used as a justification to merge inefficient code (we will optimize it later). That later never comes, either the developers/management moves on or the work item never gets prioritized. That is until the bad software either causes outages or customer churn. Then it is fixed and shown as high impact in your next promo packet.
Another one from my personal experience: apply DRY principles (don't repeat yourself) the third time you need something. Or in other words: you're allowed to copy-and-paste the same piece of code in two different places.
Far too often we generalise a piece of logic that we need in one or two places, making things more complicated for ourselves whenever they inevitably start to differ. And chances are very slim we will actually need it more than twice.
Premature generalisation is the most common mistake that separates a junior developer from an experienced one.
The rule of 3 is awful because it focuses on the wrong thing. If two instances of the same logic represent the same concept, they should be shared. If 10 instances of the same logic represent unrelated concepts, they should be duplicated.
The goal is to have code that corresponds to a coherent conceptual model for whatever you are doing, and the resulting codebase should clearly reflect the design of the system. Once I started thinking about code in these terms, I realized that questions like "DRY vs YAGNI" were not meaningful.
Of course, the rule of 3 is saying that you often _can't tell_ what the shared concept between different instances is until you have at least 3 examples.
It's not about copying identical code twice, it's about refactoring similar code into a shared function once you have enough examples to be able to see what the shared core is.
But don’t let the rule of 3 be an excuse for you to not critically assess the abstract concepts that your program is operating upon and within.
I too often see junior engineers (and senior data scientists…) write code procedurally, with giant functions and many, many if statements, presumably because in their brain they’re thinking about “1st I do this if this, 2nd I do that if that, etc”.
> If two instances of the same logic represent the same concept, they should be shared. If 10 instances of the same logic represent unrelated concepts, they should be duplicated.
I think we should not even generalize it down to a rule of three, because then you're outsourcing your critical thinking to a rule rather than doing the thinking yourself.
Instead, I tend to ask: if I change this code here, will I always also need to change it over there?
Copy-paste is good as long as I'm just repeating patterns. A for loop is a pattern. I use for loops in many places. That doesn't mean I need to somehow abstract out for loops because I'm repeating myself.
But if I have logic that says that button_b.x = button_a.x + button_a.w + padding, then I should make sure that I only write that information down once, so that it stays consistent throughout the program.
The reason for the rule of thumb is because you don't know whether you will need to change this code here when you change it there until you've written several instances of the pattern. Oftentimes different generalizations become appropriate for N=1, N=2, N>=3 && N <= 10, N>=10 && N<=100, and N>=100.
Your example is a pretty good one. In most practical applications, you do not want to be setting button x coordinates manually. You want to use a layout manager, like CSS Flexbox or Jetpack Compose's Row or Java Swing's FlowLayout, which takes in a padding and a direction for a collection of elements and automatically figures out where they should be placed. But if you only have one button, this is overkill. If you only have two buttons, this is overkill. If you have 3 buttons, you should start to realize this is the pattern and reach for the right abstraction. If you get to 10 buttons, you'll realize that you need to arrange them in 2D as well and handle how they grow & shrink as you resize the window, and there's a good chance you need a more powerful abstraction.
The instances should be based on the context. For example we had a few different API providers for the same thing, and someone refactored the separate classes into a single one that treats all of the APIs the same.
Well, turns out that 3 of the APIs changed the way they return the data, so instead of separating the logic, someone kept adding a bunch of if statements into a single function in order to avoid repeating the code in multiple places. It was a nightmare to maintain and I ended up completely refactoring it, and even tho some of the code was repeated, it was much easier to maintain and accommodate to the API changes.
I think this is a reasonable rule of thumb, but there are also times that the code you are about to write a second time is extremely portable and can easily be made reusable (say less than 5 minutes of extra time to make the abstraction). In these cases I think it's worth it to go ahead and do it.
Having identical logic in multiple places (even only 2) is a big contributor to technical debt, since if you're searching for something and you find it and fix it /once/ we often thing of the job as done. Then the "there is still a bug and I already fixed that" confusion is avoided by staying DRY.
This is so true. I have been burned by this more times than I can count. You see two functions that look similar, you extract a shared utility, and then six months later one of them needs a slightly different behavior and now you are fighting your own abstraction instead of just changing one line in a copy. The rule of three is a good default. Let the pattern prove itself before you try to generalize it.
I really like Casey Muratori's "[Semantic] Compression-oriented programming" - which is the philosophical backing of "WET" (Write Everything Twice) counterpart to DRY.
You say that, but I've created plenty of production bugs because two different implementations diverge. Easier to avoid such bugs if we just share the implementation.
I've also seen a lot of production bugs because two things that appeared to be a copy/paste where actually conceptually different and making them common made the whole much more complex trying to get common code to handle things that diverged even though they started from the same place.
“Once, twice, automate/abstract” is a good general rule but you have to understand that the thing you’re counting isn’t appearances in the source code, it’s repetitions of the same logic in the same context. It’s gotta mean the same, not just look the same.
Depends on length and complexity, imho. If it's more than a line or two of procedure? Or involves anything counterintuitive? DRY at 2.
Extract a method or object if it's something that feels conceptually a "thing" even if it has only one use. Most tools to DRY your code also help by providing a bit of encapsulation that do a great job of tidying things up to force you to think about "should I be letting this out of domain stuff leak in here?"
More critical in my mind is investigating the "inevitably start to differ" option.
If two pieces of code use the same functionality by coincidence but could possibly evolve differently then don't refactor. Don't even refactor if this happens three, four, or five times. Because even if the code may be identical today the features are not actually identical.
But if you have two uses of code that actually semantically identical and will assuredly evolve together then go ahead and refactor to remove duplication.
Ehh, people who are really excited about DRY write unreadable convoluted code, where the bulk of the code is abstractions invented to avoid rewriting a small amount of code and unless you're very familiar with the codebase reasoning about what it actually does is a mystery because related pieces of functionality are very far away from each other.
DRY is not to avoid writing code (of any amount). DRY is a maintainability feature. "Unless you're very familiar with the code" you probably won't remember that you have to make this change in two places instead of one. DRY makes life easier for future you, and anyone else unfortunate to encounter (y)our mess.
I think the bigger problem is that "Premature optimization is the root of all evil" is a statement made by software engineers to feel more comfortable in their shortcomings.
That's not to bemoan the engineer with shortcomings. Even the most experienced and educated engineer might find themself outside their comfort zone, implementing code without the ability to anticipate the performance characteristics under the hood. A mental model of computation can only go so far.
Articulated more succinctly, one might say "Use the profiler, and use it often."
I was a bit worried you are paraphrasing Rob Pike, but no, he actually agrees with that Knuth quote.
I am almost certain that people building bloated software are not willfully misunderstanding this quote; it's likely they never heard about it. Let's not ignore the relevance of this half a century old advice just because many programmers do not care about efficiency or do not understand how computers work. Premature optimization is exactly that, the fact that is premature makes it wrong, regardless if it's about GOTO statements in the 70s or a some modern equivalent where in the name of craft or fun people make their apps a lot more complex than they should be. I wouldn't be surprised if some of the brutally inefficient code you mention was so because people optimized prematurely for web-scale and their app never ever needed those abstractions and extra components. The advice applies both to hackers doing micro-optimizations and architecture astronauts dreaming too big IMHO.
Oh yes, I'd recommend everyone who uses the phrase reads the rest of the paper to see the kinds of optimisations that Knuth considers justified. For example, optimising memory accesses in quicksort.
This shows how hard it is to create a generalized and simple rule regarding programming. Context is everything and a lot is relative and subjective.
Tips like "don't try to write smart code" are often repeated but useless (not to mention that "smart" here means over-engineered or overly complex, not smart).
I dunno, Ive seen people try to violate "dont prematurely optimize" probably a thousand times (no exaggeration) and never ONCE seen this happen:
1. Somebody verifies with the users that speed is actually one of the most burning problems.
2. They profile the code and discover a bottleneck.
3. Somebody says "no, but we shouldnt fix that, that's premature optimization!"
Ive heard all sorts of people like OP moan that "this is why pieces of shit like slack are bloated and slow" (it isnt) when advocating skipping steps 1 and 2 though.
I dont think they misunderstand the rule, either, they just dont agree with it.
Did pike really have to specify explicitly that you have to identify that a problem is a problem before solving it?
To be fair, I think human nature is probably a bigger culprit here than the quote. Yes, it was one of the first things told to me as a new programmer. No, I don't think it influenced very heavily how I approach my work. It's just another small (probably reasonable) voice in the back of my head.
I don't think the quote itself is responsible for any of that.
It's true that premature optimization (that is, optimization before you've measured the software and determined whether the optimization is going to make any real-world difference) is bad.
The reality, though, is that most programmers aren't grappling with whether their optimizations are premature, they're grappling with whether to optimize at all. At most companies, once the code works, it ships. There's little, if any, time given for an extra "optimization" pass.
It's only after customers start complaining about performance (or higher-ups start complaining about compute costs) that programmers are given any time to go through and optimize things. By which point refactoring the code is now much harder than it wouldn've been originally.
I usually defer this until a PM does the research to highlight that speed is a burning issue.
I find 98% of the time that users are clamoring to get something implemented or fixed which isnt speed related so I work on that instead.
When I do drill down what I tend to find in the flame graphs is that your scope for making performance improvements a user will actually notice is bottlenecked primarily by I/O not by code efficiency.
Meanwhile my less experienced coworkers will spot a nested loop that will never take more than a couple of milliseconds and demand it be "optimised".
Picking the starting point is very important. "optimization" is the process of going from that starting point to a more performant point.
If you don't know enough to pick good starting points you probably won't know enough to optimize well. So don't optimize prematurely.
If you are experienced enough to pick good starting points, still don't optimize prematurely.
If you see a bad starting point picked by someone else, by all means, point it out if it will be problematic now or in the foreseeable future, because that's a bug.
Slow code is more of a project management problem. Features are important and visible on the roadmap. Performance usually isn't until it hits "unacceptable", which may take a while to feed back. That's all it is.
(AI will probably make this worse as well, having a bloat tendency all of its own)
Totally agree. I’ve see that quote used to justify wilfully ignoring basic performance techniques. Then people are surprised when the app is creaking exactly due to the lack of care taken earlier. I would tend to argue the other way most of the time: a little performance consideration goes a long way!
Maybe I’ve had an unrepresentative career, but I’ve never worked anywhere where there’s much time to fiddle with performance optimisations, let alone those that make the code/system significantly harder to understand. I expect that’s true of most people working in mainstream tech companies of the last twenty years or so. And so that quote is basically never applicable.
Don't confuse premature pessimization for the warnings against premature optimization.
I can write bubble sort, it is simple and I have confidence it will work. I wrote quicksort for class once - I turned in something that mostly worked but there were bugs I couldn't fix in time (but I could if I spent more time - I think...)
However writing bubble sort is wrong because any good language has a sort in the standard library (likely timsort or something else than quicksort in the real world)
> Multiple generations of programmers have now been raised to believe that brutally inefficient, bloated, and slow software is just fine. There is no limit to the amount of boilerplate and indirection a computer can be forced to execute. There is no ceiling to the crystalline abstractions emerging from these geniuses. There is no amount of time too long for a JVM to spend starting.
I think that's due to people doing premature optimization! If people took the quote to heart, they would be less inclined to increasing the amount of boilerplate and indirection.
While you were seeing those problems with Java at Google, I saw seeing it with Python.
So many levels of indirection. Holy cow! So many unneeded superclasses and mixins! You can’t reason about code if the indirection is deeper than the human mind can grasp.
There was also a belief that list comprehensions were magically better somehow and would expand to 10-line monstrosities of unreadable code when a nested for loop would have been more readable and just as fast but because list comprehensions were fetishized nobody would stop at their natural readability limits. The result was like reading the run-on sentence you just suffered through.
In all honesty, this is one of the less abused quotes, and I have seen more benefit from it than harm.
Like you, I've seen people produce a lot of slow code, but it's mostly been from people who would have a really hard time writing faster code that's less wrong.
I hate slow software, but I'd pick it anytime over bogus software. Also, generally, it's easier to fix performance problems than incorrect behavior, especially so when the error has created data that's stored somewhere we might not have access to. But even more so, when the harm has reached the real world.
User-facing, sure, nothing stopping us from doing "simple and fast" software. But when it comes to the code, design and architecture, "simple" is often at odds with "fast", and also "secure". Once you need something to be fast and secure, it often leads to a less simple design, because now you care about more things, it's kind of hard to avoid.
IME doing application servers and firmware my whole career, simple and fast are usually the same thing, and "simple secure" is usually better security posture than "complex secure".
Interesting, never done firmware, but plenty of backends and frontends. Besides the whole "do less and things get faster", I can't think of a single case where "simple" and "fast" is the same thing.
And I'd agree that "simple secure" is better than "complex secure" but you're kind of side-stepping what I said, what about "not secure at all", wouldn't that lead to simpler code? Usually does for me, especially if you have to pile it on top of something that is already not so secure, but even when taking it into account when designing from ground up.
Same. I, too, am sick of bloated code. But I use the quote as a reminder to myself: "look, the fact that you could spend the rest of the workday making this function run in linear instead of quadratic time doesn't mean you should – you have so many other tasks to tackle that it's better that you leave the suboptimal-but-obviously-correct implementation of this one little piece as-is for now, and return to it later if you need to".
> Rule 5. Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.
Always preferred Perlis' version, that might be slightly over-used in functional programming to justify all kinds of hijinks, but with some nuance works out really well in practice:
> 9. It is better to have 100 functions operate on one data structure than 10 functions on 10 data structures.
>I will, in fact, claim that the difference between a bad programmer
and a good one is whether he considers his code or his data structures more important. Bad programmers worry about the code. Good programmers worry about data structures and their relationships.
From what I understand from the vibe coders, they tell a machine what the code should do, but not how it should do it. They leave the important decisions (the shape of data) to an LLM and thus run afoul of this, which is gonna bite them in the ass eventually.
I think this is sometimes a barrier to getting started for me. I know that I need to explore the data structure design in the context of the code that will interact with it and some of that code will be thrown out as the data structure becomes more clear, but still it can be hard to get off the ground when me gut instinct is that the data design isn't right.
This kind of exploration can be a really positive use case for AI I think, like show me a sketch of this design vs that design and let's compare them together.
My recommendation is to truly learn a functional language and apply it to a real world product. Then you’ll learn how to think about data, in its pure state, and how it is transformed to get from point A to point B. These lessons will make for much cleaner design that will be applicable to imperative languages as well.
Or learn C where you do not have the luxury of using high-level crutches.
> This kind of exploration can be a really positive use case for AI I think
Not sure if SoTA codegen models are capable of navigating design space and coming up with optimal solutions. Like for cybersecurity, may be specialized models (like DeepMind's Sec-Gemini), if there are any, might?
I reckon, a programmer who already has learnt about / explored the design space, will be able to prompt more pointedly and evaluate the output qualitatively.
Yeah key word is exploration. It's not "hey Claude write the design doc for me" but rather, here's two possible directions for how to structure my solution, help me sketch each out a bit further so that I can get a better sense what roadblocks I may hit 50-100 hours into implementation when the cost of changing course is far greater.
"Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowchart; it'll be obvious." -- Fred Brooks, The Mythical Man Month (1975)
This is the biggest issue I see with AI driven development. The data structures are incredibly naive. Yes it's easy to steer them in a different direction but that comes at a long term cost. The further you move from naive the more often you will need to resteer downstream and no amount of context management will help you, it is fighting against the literal mean.
The rule may not hold with AI driven development. The rule exists because it's expensive to rewrite code that depends on a given data structure arrangement, and so programmers usually resort to hacks (eg. writing translation layers or views & traversals of the data) so they can work with a more convenient data structure with functionality that's written later. If writing code becomes free, the AI will just rewrite the whole program to fit the new requirements.
This is what I've observed with using AI on relatively small (~1000 line) programs. When I add a requirement that requires a different data structure, Claude will happily move to the new optimal data structure, and rewrite literally everything accordingly.
I've heard that it gets dicier when you have source files that are 30K-40K lines and programs that are in the million+ line range. My reports have reported that Gemini falls down badly in this case, because the source file blows the context window. But even then, they've also reported that you can make progress by asking Gemini to come up with the new design, and then asking it to come up with a list of modules that depend upon the old structure, and then asking it to write a shim layer module-by-module to have the old code use the new data structure, and then have it replace the old data structure with the new one, and then have it remove the shim layer and rewrite the code of each module to natively use the new data structure. Basically, babysit it through the same refactoring that an experienced programmer would use to do a large-scale refactoring in a million+ line codebase, but have the AI rewrite modules in 5 minutes that would take a programmer 5 weeks.
Naive doesn't mean bad. 99% of software can be written with understood, well documented data structures. One of the problems with ai is that it allows people to create software without understanding the trade offs of certain data structures, algorithms and more fundamental hardware management strategies.
You don't need to be able to pass a leet code interview, but you should know about big O complexity, you should be able to work out if a linked list is better than an array, you should be able to program a trie, and you should be at least aware of concepts like cache coherence / locality. You don't need to be an expert, but these are realities of the way software and hardware work. They're also not super complex to gain a working knowledge of, and various LLMs are probably a really good way to gain that knowledge.
Then don't let the AI write the data structures. I don't. I usually don't even let the AI write the class or method names. I give it a skeleton application and let it fill in the code. Works great, and I retain knowledge of how the application works.
> This is the biggest issue I see with AI driven development. The data structures are incredibly naive.
Bill Gates, for example, always advocated for thinking through the entire program design and data structures before writing any code, emphasizing that structure is crucial to success.
And Paul Allen wrote a whole Altair emulator so that they could use an (academic) Harvard computer for their little (commercial) project and test/run Bill Gates' BASIC interpreter on it.
As I'm sure more and more people are using AI to document old systems, even just to get a foothold in them personally if they don't intend to share it, here's a hint related to that: By default, if you fire an AI at a programming base, at least in my experience you get the usual documentation you expect from a system: This is the list of "key modules", this module does this, this module does that, this module does the other thing.
This is the worst sort of documentation; technically true but quite unenlightening. It is, in the parlance of the Fred Brooks quote mentioned in a sibling comment, neither the "flowchart" nor the "tables"; it is simply a brute enumeration of code.
To which the fix is, ask for the right thing. Ask for it to analyze the key data structures (tables) and provide you the flow through the program (the flowchart). It'll do it no problem. Might be inaccurate, as is a hazard with all documentation, but it makes as good a try at this style of documentation as "conventional" documentation.
Honestly one of the biggest problems I have with AI coding and documentation is just that the training set is filled to the brim with mediocrity and the defaults are inferior like this on numerous fronts. Also relevant to this conversation is that AI tends to code the same way it documents and it won't have either clear flow charts or tables unless you carefully prompt for them. It's pretty good at doing it when you ask, but if you don't ask you're gonna get a mess.
(And I find, at least in my contexts, using opus, you can't seem to prompt it to "use good data structures" in advance, it just writes scripting code like it always does and like that part of the prompt wasn't there. You pretty much have to come back in after its first cut and tell it what data structures to create. Then it's really good at the rest. YMMV, as is the way of AI.)
I find languages like Haskell, ReScript/OCaml to work really well for CRUD applications because they push you to think about your data and types first. Then you think about the transformations you want to make on the data via functions. When looking at new code I usually look for the types first, specifically what is getting stored and read.
Similarly, that approach works really well in Clojure too, albeit with a lot less concern for types, but the "data and data structures first" principle is widespread in the ecosystem.
Aren't they basically saying opposite things? Perlis is saying "don't choose the right data structure, shoehorn your data into the most popular one". This advice might have made sense before generic programming was widespread; I think it's obsolete.
> Perlis is saying "don't choose the right data structure, shoehorn your data into the most popular one"
I don't take it like that. A map could be the right data structure for something people typically reach for classes to do, and then you get a whole bunch of functions that can already operate on a map-like thing for free.
If you take a look at the standard library and the data structures of Clojure you'd see this approach taken to a somewhat extreme amount.
One part of it has interesting new resonance in the era of agentic LLMs:
alankay on June 21, 2016 | root | parent | next [–]
This is why "the objects of the future" have to be ambassadors that can negotiate with other objects they've never seen.
Think about this as one of the consequences of massive scaling ...
Nowdays rather than the methods associated with data objects, we are dealing with "context" and "prompts".
Hm, not sure. Data on its own (say, a string of numbers) might be meaningless - but structured data? Sure, there may be ambiguity but well-structured data generally ought to have a clear/obvious interpretation. This is the whole idea of nailing your data structures.
Yeah, structured data implies some processing on raw data to improve its meaning. Alan Kay seems to want to push this idea to encapsulate data with rich behaviour.
This quote from “Dive into Python” when I was a fresh graduate was one of the most impacting lines I ever read in a programming book.
> Busywork code is not important. Data is important. And data is not difficult. It's only data. If you have too much, filter it. If it's not what you want, map it. Focus on the data; leave the busywork behind.
> Rule 5. Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.
If I have learned one thing in my 30-40 years spent writing code, it is this.
I agree. The biggest lesson I try to drive home to newer programmers that join my projects is that its always best to transform the data into the structure you need at the very end of the chain, not at the beginning or middle. Keep the data in it's purest form and then transform it right before displaying it to the user, or right before providing it in the final api for others to consume.
You never know how requirements are going to change over the next 5 years, and pure structures are always the most flexible to work with.
Related: your business logic should work on metric units. It is a UI concern if the user wants to see some other measurement system. Convert to feet, chains, cubits... or whatever obscure measurement system the user wants at display time. (if you do get an embedded device that reports non-metric units convert when it comes in - you will get a different device in the future that reports different units anyway)
You still have to worry about someone using kg when you use g, but you avoid a large class of problems and make your logic easier.
With 100 functions and one datastructure it is almost as programming with a global variables where new instance is equivalent to a new process. Doesn’t seem like a good rule to follow.
The scope of where that data structure or functions are available is a different concern though, "100 functions + 1 data structure" doesn't require globals or private, it's a separate thing.
As much as relational DBs have held back enterprise software for a very long time by being so conservative in their development, the fact that they force you to put this relationship absolutely front-of-mind is excellent.
I'd personally consider "persistence" AKA "how to store shit" to be a very different concern compared to the data structures that you use in the program. Ideally, your design shouldn't care about how things are stores, unless there is a particular concern for how fast things read/writes.
Often significant improvements to every aspect of a system that interacts with a database can be made by proper design of the primary keys, instead of the generic id way too many people jump to.
The key difficulty is identifying what these are is far from obvious upfront, and so often an index appears adjacent to a table that represents what the table should have been in the first place.
I guess that might be true also, to some extent. I guess most of the times I've seen something "messy" in software design, it's almost always about domain code being made overly complicated compared to what it has to do, and almost never about "how does this domain data gets written/read to/from a database", although it's very common. Although of course storage/persistence isn't non-essential, just less common problem than the typical design/architecture spaghetti I encounter.
I'm a firm believer in always using an auto-generated surrogate key for the PK because domain PKs always eventually become a pain point. The problem is that doing so does real damage to the ergonomics of the DB.
This is why I fundamentally find SQL too conservative and outdated. There are obvious patterns for cross-cutting concerns that would mitigate things like this but enterprise SQL products like Oracle and MS are awful at providing ways to do these reusable cross-cutting concerns consistently.
I meant to reply to a different comment originally, specifically the one including this quote from Torvalds:
> Good programmers worry about data structures and their relationships.
> -- Linus Torvalds
I was specifically thinking about the "relationship" issues. The worst messes to fix are the ones where the programmer didn't consider how to relate the objects together - which relationships need to be direct PK bindings, which can be indirect, which things have to be cached vs calculated live, which things are the cache (vs the master copy), what the cardinality of each relationship is, which relationships are semantically ownerships vs peers, which data is part of the system itself vs configuration data vs live, how you handle changes to the data, (event sourcing vs changelogging vs vs append-only vs yolo update), etc.
Not quite "data structures" I admit but absolutely thinking hard about the relationship between all the data you have.
SQL doesn't frame all of these questions out for you but it's good getting you to start thinking about them in a way you might not otherwise.
Also basically everything DHH ever said (I stopped using Rails 15 years ago but just defining data relationships in YAML and typing a single command to get a functioning website and database was in fact pretty cool in the oughts).
Hang on, they mostly agree with each other. I've spoken to Rob Pike a few times and I never heard him call out Perlis as being wrong. On this particular point, Perlis and Pike are both extending an existing idea put forward by Fred Brooks.
Perlis is right in the way that academics so often are and Pike is right in the way that practitioners often are. They also happen to be in rough agreement on this, unsurprisingly so.
Promoting the idea of one data structure with many functions contradicts:
“If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident.”
And:
“Use simple algorithms as well as simple data structures.”
A data structure general enough to solve enough problems to be meaningful will either be poorly suited to some problems or have complex algorithms for those problems, or both.
There are reasons we don’t all use graph databases or triple stores, and rely on abstractions over our byte arrays.
I think you are badly misinterpreting the statement.
Let's say you're working for the DMV on a program for driver's licenses. The idea is to use one structure for driver's license data, as opposed to using one structure for new driver's licenses, a different one for renewals, and yet a third for expired ones, and a fourth one for name changes.
It is not saying that you should use byte arrays for driver's license records, so that you can use the same data structure for driver's license data and missile tracks. Generalize within your program, not across all possible programs running on all computers.
These rules apply equally well to system architecture. I've been trying to talk our team out of premature optimization (redis cluster) and fancy algorithms (bloom filters) to compensate for poor data structures (database schema) before we know if performance is going to be a problem.
Even knowing with 100% certainty that performance will be subpar, requirements change often enough that it's often not worth the cost of adding architectural complexity too early.
I don't disagree with these principles, but if I wanted to compress all my programming wisdom into 5 rules, I wouldn't spend 3 out of the 5 slots on performance. Performance is just a component of correctness : if you have a good methodology to achieve correctness, you will get performance along the way.
My #1 programming principle would be phrased using a concept from John Boyd: make your OODA loops fast. In software this can often mean simple things like "make compile time fast" or "make sure you can detect errors quickly".
I think it's fine and generous that he credited these rules to the better-known aphorisms that inspired them, but I think his versions are better, they deserve to be presented by themselves, instead of alongside the mental clickbait of the classic aphorisms. They preserve important context that was lost when the better-known versions were ripped out of their original texts.
For example, I've often heard "premature optimization is the root of all evil" invoked to support opposite sides of the same argument. Pike's rules are much clearer and harder to interpret creatively.
Also, it's amusing that you don't hear this anymore:
> Rule 5 is often shortened to "write stupid code that uses smart objects".
In context, this clearly means that if you invest enough mental work in designing your data structures, it's easy to write simple code to solve your problem. But interpreted through an OO mindset, this could be seen as encouraging one of the classic noob mistakes of the heyday of OO: believing that your code could be as complex as you wanted, without cost, as long as you hid the complicated bits inside member methods on your objects. I'm guessing that "write stupid code that uses smart objects" was a snappy bit of wisdom in the pre-OO days and was discarded as dangerous when the context of OO created a new and harmful way of interpreting it.
Can't agree more on 5. I've repeatedly found that any really tricky programming problem is (eventually) solved by iterative refinement of the data structures (and the APIs they expose / are associated with). When you get it right the control flow of a program becomes straightforward to reason about.
To address our favorite topic: while I use LLMs to assist on coding tasks a lot, I think they're very weak at this. Claude is much more likely to suggest or expand complex control flow logic on small data types than it is to recognize and implement an opportunity to encapsulate ideas in composable chunks. And I don't buy the idea that this doesn't matter since most code will be produced and consumed by LLMs. The LLMs of today are much more effective on code bases that have already been thoughtfully designed. So are humans. Why would that change?
I feel like 1 and 2 are only applicable in cases of novelty.
The thing is, if you build enough of the same kinds of systems in the same kinds of domains, you can kinda tell where you should optimize ahead of time.
Most of us tend to build the same kinds of systems and usually spend a career or a good chunk of our careers in a given domain. I feel like you can't really be considered a staff/principal if you can't already tell ahead of time where the perf bottleneck will be just on experience and intuition.
I feel like every time I have expected an area to be the major bottleneck it has been. Sometimes some areas perform worse than I expected, usually something that hasn't been coded well, but generally its pretty easy to spot the computationally heavy or many remote call areas well before you program them.
I have several times done performance tests before starting a project to confirm it can be made fast enough to be viable, the entire approach can often shift depending on how quickly something can be done.
It really depends on your requirements. C10k requires different design than a web server that sees a few requests per second at most, but the web might never have been invented if the focus was always on that level of optimization.
The number 1 issue Ive experienced with poor programmers is a belief that theyre special snowflakes who can anticipate the future.
It's the same thing with programmers who believe in BDUF or disbelieve YAGNI - they design architectures for anticipated futures which do not materialize instead of evolving the architecture retrospectively in line with the future which did materialize.
I think it's a natural human foible. Gambling, for instance, probably wouldnt exist if humans' gut instincts about their ability to predict future defaulted to realistic.
This is why no matter how many brilliant programmers scream YAGNI, dont do BDUF and dont prematurely optimize there will always be some comment saying the equivalent of "akshually sometimes you should...", remembering that one time when they metaphorically rolled a double six and anticipated the necessary architecture correctly when it wasnt even necessary to do so.
These programmers are all hopped up on a different kind of roulette these days...
Sure, don't build your system to keep audit trails until after you have questions to answer so that you know what needs to go in those audit trails.
Don't insist on file-based data ingestion being a wrapper around a json-rpc api just because most similar things are moving that direction; what matters is whether someone has specifically asked for that for this particular system yet.
.
Not all decisions can be usefully revisited later. Sometimes you really do need to go "what if..." and make sure none of the possibilities will bite too hard. Leaving the pizza cave occasionally and making sure you (have contacts who) have some idea about the direction of the industry you're writing stuff for can help.
Aye. The number one way to make software amenable to future requirements is to keep it simple so that it's easy to change in future. Adding complexity for anticipated changes works against being able to support the unanticipated ones.
Rob Pike is responsible for many cool things, but Unix isn't one of them. Go is a wonderful hybrid (with its own faults) of the schools of Thompson and Wirth, with a huge amount of Pike.
If you'd said Plan 9 and UTF-8 I'd agree with you.
Rob Pike definitely wrote large chunks of Unix while at Bell Labs. It's wrong to say he wrote all of it like the GP did but it is also wrong to diminish his contributions.
Unix was created by Ken Thompson and Dennis Ritchie at Bell Labs (AT&T) in 1969. Thompson wrote the initial version, and Ritchie later contributed significantly, including developing the C programming language, which Unix was subsequently rewritten in.
> but was a contributor to it. He, with a team, unquestionably wrote it.
contribute < wrote.
His credits are huge, but I think saying he wrote Unix is misattribution.
Credits include: Plan 9 (successor to Unix), Unix Window System, UTF-8 (maybe his most universally impactful contribution), Unix Philosophy Articulation, strings/greps/other tools, regular expressions, C successor work that ultimately let him to Go.
Are you under the impression he was, like, a hands-off project manager or something? His involvement was in writing it. Not singlehandedly, but certainly as part of a team. He unquestionably wrote it. He did not envision it like he did the other projects you mention, but the original credit was only in the writing of.
The first four are kind of related. For me the fifth is the important – and oft overlooked – one:
> Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.
These rules aged well overall. The only change I would make these days is to invert the order.
Number 5 is timeless and relevant at all scales, especially as code iterations have gotten faster and faster, data is all the more relevant. Numbers 4 and 3 have shifted a bit since data sizes and performance have ballooned, algorithm overhead isn't quite as big a concern, but the simplicity argument is relevant as ever. Numbers 2 and 1 while still true (Amdahl's law is a mathematical truth after all), are also clearly a product of their time and the hard constraints programmers had to deal with at the time as well as the shallowness of the stack. Still good wisdom, though I think on the whole the majority of programmers are less concerned about performance than they should be, especially compared to 50 years ago.
The attribution to Hoare is a common error — "Premature optimization is the root of all evil" first appeared in Knuth's 1974 paper "Structured Programming with go to Statements."
Knuth later attributed it to Hoare, but Hoare said he had no recollection of it and suggested it might have been Dijkstra.
Rule 5 aged the best. "Data dominates" is the lesson every senior engineer eventually learns the hard way.
How good is your model at picking good data structures?
There’s several orders of magnitude less available discussion of selecting data structures for problem domains than there is code.
If the underlying information is implicit in high volume of code available then maybe the models are good at it, especially when driven by devs who can/will prompt in that direction. And that assumption seems likely related to how much code was written by devs who focus on data.
> There’s several orders of magnitude less available discussion of selecting data structures for problem domains than there is code.
I believe that’s what most algorithms books are about. And most OS book talks more about data than algorithms. And if you watch livestream or read books on practical projects, you’ll see that a lot of refactor is first selecting a data structure, then adapt the code around it. DDD is about data structure.
- ideologically, he's spent his career chasing complexity reduction, adovcating for code sobriety, resource efficiency, and clarity of thought. Large, opaque, energy-intensive LLMs represent the antithesis.
Potentially its by either (or even both independently). Knuth originally attributed it to Hoare, but there's no paper trail to demonstrate Hoare actually coined it first
I'm not a skilled programmer (but would like to be someday). Would someone kindly resolve what appears to me to be a contradiction between the following?
1(a) Torvalds: "Bad programmers worry about the code. Good programmers worry about data structures and their relationships."
1(b) Pike Rule 5: "Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming."
— versus —
2. Perlis 2: "Functions delay binding; data structures induce binding. Moral: Structure data late in the programming process."
---
Ignorant as I am, I read these to advise that I ought to put data structures centrally, first, foremost — but not until the end of the programming process.
When you explore a problem, use Python and lists/sets/dictionaries/JSON. Wait with types and specific data structures till you have understanding.
Speed of development over speed of execution.
When you know what and how to build commit to good data structures. Do the types, structs, classes, Trie, CRDTs, XML, Protobuf, Parquet and whatnot where apropriate. Instrument your program.
The efficiency of the final product counts.
> We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.
Rule 4, I have always practiced and demanded of junior programmers, to make algorithms and structures that are simple to understand, for our main user: the one who will modify this code in the future.
I believe that's why Golang is a very simple but powerful language.
Yes, and I'd say it's more true now than then. Best case, your fancy algorithms are super-sizing code that runs 1% of the time, always kicking more-often-run code out of the most critical CPU caches. Worst case, your fancy algorithms contain security bugs, and the bad guys cash in.
9front it's distilled Unix. I corrected Russ Cox' 'xword' to work in 9front and I am just a newbie. No LLM's, that's Idiocratic, like the movie; just '9intro.us.pdf' and man pages.
I think for people starting out - rule 5 isn't perhaps that obvious.
> Rule 5. Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.
If want to solve a problem - it's natural to think about logic flow and the code that implements that first and the data structures are an after thought, whereas Rule 5 is spot on.
Conputers are machines that transform an input to an output.
> If want to solve a problem - it's natural to think about logic flow and the code that implements that first and the data structures are an after thought, whereas Rule 5 is spot on.
It is?
How can you conceive of a precise idea of how to solve a problem without a similarly precise idea of how you intend to represent the information fundamental to it? They are inseparable.
Obviously they are linked - the question is where do you start your thinking.
Do you start with the logical task first and structure the data second, or do you actually think about the data structures first?
Let's say I have a optimisation problem - I have a simple scoring function - and I just want to find the solution with the best score. Starting with the logic.
for all solutions, score, keep if max.
Simple eh? Problem is it's a combinatorial solution space. The key to solving this before the entropic death of the universe is to think about the structure of the solution space.
I mean - no. If you're coming to a completely new domain you have to decide what the important entities are, and what transformations you want to apply.
Neither data structures nor algorithms, but entities and tasks, from the user POV, one level up from any kind of implementation detail.
There's no point trying to do something if you have no idea what you're doing, or why.
When you know the what and why you can start worrying about the how.
Iff this is your 50th CRUD app you can probably skip this stage. But if it's green field development - no.
Sure context is important - and the important context you appear to have missed is the 5 rules aren't about building websites. It's about solving the kind of problems which are easy to state but hard to do (well) .
Also, "why these 5 in particular" is definitely not obvious -- there are a great many possible "obvious in some sense but also true in an important way" epigrams to choose from (the Perlis link from another comment has over a hundred). That Pike picked these 5 to emphasise tells you something about his view of programming, and doubly so given that they are rather overlapping in what they're talking about.
"Premature optimization is the root of all evil."
First, let's not besmirch the good name of Tony Hoare. The quote is from Donald Knuth, and the missing context is essential.
From his 1974 paper, "Structured Programming with go to Statements":
"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."
He was talking about using GOTO statements in C. He was talking about making software much harder to reason about in the name of micro-optimizations. He assumed (incorrectly) that we would respect the machines our software runs on.
Multiple generations of programmers have now been raised to believe that brutally inefficient, bloated, and slow software is just fine. There is no limit to the amount of boilerplate and indirection a computer can be forced to execute. There is no ceiling to the crystalline abstractions emerging from these geniuses. There is no amount of time too long for a JVM to spend starting.
I worked at Google many years ago. I have lived the absolute nightmares that evolve from the willful misunderstanding of this quote.
No thank you. Never again.
I have committed these sins more than any other, and I'm mad as hell about it.
Far too often we generalise a piece of logic that we need in one or two places, making things more complicated for ourselves whenever they inevitably start to differ. And chances are very slim we will actually need it more than twice.
Premature generalisation is the most common mistake that separates a junior developer from an experienced one.
The goal is to have code that corresponds to a coherent conceptual model for whatever you are doing, and the resulting codebase should clearly reflect the design of the system. Once I started thinking about code in these terms, I realized that questions like "DRY vs YAGNI" were not meaningful.
It's not about copying identical code twice, it's about refactoring similar code into a shared function once you have enough examples to be able to see what the shared core is.
I too often see junior engineers (and senior data scientists…) write code procedurally, with giant functions and many, many if statements, presumably because in their brain they’re thinking about “1st I do this if this, 2nd I do that if that, etc”.
Yet again, understanding when to follow a rule of thumb or not is another thing that separates the junior from the senior.
Exactly.
Instead, I tend to ask: if I change this code here, will I always also need to change it over there?
Copy-paste is good as long as I'm just repeating patterns. A for loop is a pattern. I use for loops in many places. That doesn't mean I need to somehow abstract out for loops because I'm repeating myself.
But if I have logic that says that button_b.x = button_a.x + button_a.w + padding, then I should make sure that I only write that information down once, so that it stays consistent throughout the program.
Your example is a pretty good one. In most practical applications, you do not want to be setting button x coordinates manually. You want to use a layout manager, like CSS Flexbox or Jetpack Compose's Row or Java Swing's FlowLayout, which takes in a padding and a direction for a collection of elements and automatically figures out where they should be placed. But if you only have one button, this is overkill. If you only have two buttons, this is overkill. If you have 3 buttons, you should start to realize this is the pattern and reach for the right abstraction. If you get to 10 buttons, you'll realize that you need to arrange them in 2D as well and handle how they grow & shrink as you resize the window, and there's a good chance you need a more powerful abstraction.
Well, turns out that 3 of the APIs changed the way they return the data, so instead of separating the logic, someone kept adding a bunch of if statements into a single function in order to avoid repeating the code in multiple places. It was a nightmare to maintain and I ended up completely refactoring it, and even tho some of the code was repeated, it was much easier to maintain and accommodate to the API changes.
Having identical logic in multiple places (even only 2) is a big contributor to technical debt, since if you're searching for something and you find it and fix it /once/ we often thing of the job as done. Then the "there is still a bug and I already fixed that" confusion is avoided by staying DRY.
https://caseymuratori.com/blog_0015
Sometimes four or five doesn’t seem too bad, sometimes two is too many
Extract a method or object if it's something that feels conceptually a "thing" even if it has only one use. Most tools to DRY your code also help by providing a bit of encapsulation that do a great job of tidying things up to force you to think about "should I be letting this out of domain stuff leak in here?"
If two pieces of code use the same functionality by coincidence but could possibly evolve differently then don't refactor. Don't even refactor if this happens three, four, or five times. Because even if the code may be identical today the features are not actually identical.
But if you have two uses of code that actually semantically identical and will assuredly evolve together then go ahead and refactor to remove duplication.
https://xkcd.com/1205/
https://xkcd.com/974/
That's not to bemoan the engineer with shortcomings. Even the most experienced and educated engineer might find themself outside their comfort zone, implementing code without the ability to anticipate the performance characteristics under the hood. A mental model of computation can only go so far.
Articulated more succinctly, one might say "Use the profiler, and use it often."
I am almost certain that people building bloated software are not willfully misunderstanding this quote; it's likely they never heard about it. Let's not ignore the relevance of this half a century old advice just because many programmers do not care about efficiency or do not understand how computers work. Premature optimization is exactly that, the fact that is premature makes it wrong, regardless if it's about GOTO statements in the 70s or a some modern equivalent where in the name of craft or fun people make their apps a lot more complex than they should be. I wouldn't be surprised if some of the brutally inefficient code you mention was so because people optimized prematurely for web-scale and their app never ever needed those abstractions and extra components. The advice applies both to hackers doing micro-optimizations and architecture astronauts dreaming too big IMHO.
Oh yes, I'd recommend everyone who uses the phrase reads the rest of the paper to see the kinds of optimisations that Knuth considers justified. For example, optimising memory accesses in quicksort.
Tips like "don't try to write smart code" are often repeated but useless (not to mention that "smart" here means over-engineered or overly complex, not smart).
1. Somebody verifies with the users that speed is actually one of the most burning problems.
2. They profile the code and discover a bottleneck.
3. Somebody says "no, but we shouldnt fix that, that's premature optimization!"
Ive heard all sorts of people like OP moan that "this is why pieces of shit like slack are bloated and slow" (it isnt) when advocating skipping steps 1 and 2 though.
I dont think they misunderstand the rule, either, they just dont agree with it.
Did pike really have to specify explicitly that you have to identify that a problem is a problem before solving it?
I wish Knuth would come out and publicly chastise the many decades of abuse this quote has enabled.
It's true that premature optimization (that is, optimization before you've measured the software and determined whether the optimization is going to make any real-world difference) is bad.
The reality, though, is that most programmers aren't grappling with whether their optimizations are premature, they're grappling with whether to optimize at all. At most companies, once the code works, it ships. There's little, if any, time given for an extra "optimization" pass.
It's only after customers start complaining about performance (or higher-ups start complaining about compute costs) that programmers are given any time to go through and optimize things. By which point refactoring the code is now much harder than it wouldn've been originally.
Profiling never achieved its place in most developers’ core loop the way that compiling, linting, or unit testing did.
How many real CI/CD pipelines spit out flame graphs alongside test results?
I find 98% of the time that users are clamoring to get something implemented or fixed which isnt speed related so I work on that instead.
When I do drill down what I tend to find in the flame graphs is that your scope for making performance improvements a user will actually notice is bottlenecked primarily by I/O not by code efficiency.
Meanwhile my less experienced coworkers will spot a nested loop that will never take more than a couple of milliseconds and demand it be "optimised".
I believe people don't think about Knuth when they choose to write app in Electron. Some other forces might be at play here.
If you don't know enough to pick good starting points you probably won't know enough to optimize well. So don't optimize prematurely.
If you are experienced enough to pick good starting points, still don't optimize prematurely.
If you see a bad starting point picked by someone else, by all means, point it out if it will be problematic now or in the foreseeable future, because that's a bug.
(AI will probably make this worse as well, having a bloat tendency all of its own)
Maybe I’ve had an unrepresentative career, but I’ve never worked anywhere where there’s much time to fiddle with performance optimisations, let alone those that make the code/system significantly harder to understand. I expect that’s true of most people working in mainstream tech companies of the last twenty years or so. And so that quote is basically never applicable.
I can write bubble sort, it is simple and I have confidence it will work. I wrote quicksort for class once - I turned in something that mostly worked but there were bugs I couldn't fix in time (but I could if I spent more time - I think...)
However writing bubble sort is wrong because any good language has a sort in the standard library (likely timsort or something else than quicksort in the real world)
I think that's due to people doing premature optimization! If people took the quote to heart, they would be less inclined to increasing the amount of boilerplate and indirection.
While you were seeing those problems with Java at Google, I saw seeing it with Python.
So many levels of indirection. Holy cow! So many unneeded superclasses and mixins! You can’t reason about code if the indirection is deeper than the human mind can grasp.
There was also a belief that list comprehensions were magically better somehow and would expand to 10-line monstrosities of unreadable code when a nested for loop would have been more readable and just as fast but because list comprehensions were fetishized nobody would stop at their natural readability limits. The result was like reading the run-on sentence you just suffered through.
Then the quote wasn’t the problem. The wilful misunderstanding was the problem.
I don't think you can blame this phrase if people are going to drop an entire word out of an eight word sentence. The very first word, no less.
Like you, I've seen people produce a lot of slow code, but it's mostly been from people who would have a really hard time writing faster code that's less wrong.
I hate slow software, but I'd pick it anytime over bogus software. Also, generally, it's easier to fix performance problems than incorrect behavior, especially so when the error has created data that's stored somewhere we might not have access to. But even more so, when the harm has reached the real world.
We can and should have both.
This is a fraud, made up by midwits to justify their leaning towers of abstraction.
And I'd agree that "simple secure" is better than "complex secure" but you're kind of side-stepping what I said, what about "not secure at all", wouldn't that lead to simpler code? Usually does for me, especially if you have to pile it on top of something that is already not so secure, but even when taking it into account when designing from ground up.
Same. I, too, am sick of bloated code. But I use the quote as a reminder to myself: "look, the fact that you could spend the rest of the workday making this function run in linear instead of quadratic time doesn't mean you should – you have so many other tasks to tackle that it's better that you leave the suboptimal-but-obviously-correct implementation of this one little piece as-is for now, and return to it later if you need to".
how do you know which code was written using this quote in mind.
> Rule 5. Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.
Always preferred Perlis' version, that might be slightly over-used in functional programming to justify all kinds of hijinks, but with some nuance works out really well in practice:
> 9. It is better to have 100 functions operate on one data structure than 10 functions on 10 data structures.
>I will, in fact, claim that the difference between a bad programmer and a good one is whether he considers his code or his data structures more important. Bad programmers worry about the code. Good programmers worry about data structures and their relationships.
-- Linus Torvalds
This kind of exploration can be a really positive use case for AI I think, like show me a sketch of this design vs that design and let's compare them together.
My recommendation is to truly learn a functional language and apply it to a real world product. Then you’ll learn how to think about data, in its pure state, and how it is transformed to get from point A to point B. These lessons will make for much cleaner design that will be applicable to imperative languages as well.
Or learn C where you do not have the luxury of using high-level crutches.
Not sure if SoTA codegen models are capable of navigating design space and coming up with optimal solutions. Like for cybersecurity, may be specialized models (like DeepMind's Sec-Gemini), if there are any, might?
I reckon, a programmer who already has learnt about / explored the design space, will be able to prompt more pointedly and evaluate the output qualitatively.
> sometimes a barrier to getting started for me
Plenty great books on the topic (:
Algorithms + Data Structures = Programs (1976), https://en.wikipedia.org/wiki/Algorithms_%2B_Data_Structures...
"Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowchart; it'll be obvious." -- Fred Brooks, The Mythical Man Month (1975)
This is what I've observed with using AI on relatively small (~1000 line) programs. When I add a requirement that requires a different data structure, Claude will happily move to the new optimal data structure, and rewrite literally everything accordingly.
I've heard that it gets dicier when you have source files that are 30K-40K lines and programs that are in the million+ line range. My reports have reported that Gemini falls down badly in this case, because the source file blows the context window. But even then, they've also reported that you can make progress by asking Gemini to come up with the new design, and then asking it to come up with a list of modules that depend upon the old structure, and then asking it to write a shim layer module-by-module to have the old code use the new data structure, and then have it replace the old data structure with the new one, and then have it remove the shim layer and rewrite the code of each module to natively use the new data structure. Basically, babysit it through the same refactoring that an experienced programmer would use to do a large-scale refactoring in a million+ line codebase, but have the AI rewrite modules in 5 minutes that would take a programmer 5 weeks.
You don't need to be able to pass a leet code interview, but you should know about big O complexity, you should be able to work out if a linked list is better than an array, you should be able to program a trie, and you should be at least aware of concepts like cache coherence / locality. You don't need to be an expert, but these are realities of the way software and hardware work. They're also not super complex to gain a working knowledge of, and various LLMs are probably a really good way to gain that knowledge.
Bill Gates, for example, always advocated for thinking through the entire program design and data structures before writing any code, emphasizing that structure is crucial to success.
While developing Altair BASIC, his choice of data structures and algorithms enabled him to fit the code into just 4 kilobytes.
Microsoft is another story.
This is the worst sort of documentation; technically true but quite unenlightening. It is, in the parlance of the Fred Brooks quote mentioned in a sibling comment, neither the "flowchart" nor the "tables"; it is simply a brute enumeration of code.
To which the fix is, ask for the right thing. Ask for it to analyze the key data structures (tables) and provide you the flow through the program (the flowchart). It'll do it no problem. Might be inaccurate, as is a hazard with all documentation, but it makes as good a try at this style of documentation as "conventional" documentation.
Honestly one of the biggest problems I have with AI coding and documentation is just that the training set is filled to the brim with mediocrity and the defaults are inferior like this on numerous fronts. Also relevant to this conversation is that AI tends to code the same way it documents and it won't have either clear flow charts or tables unless you carefully prompt for them. It's pretty good at doing it when you ask, but if you don't ask you're gonna get a mess.
(And I find, at least in my contexts, using opus, you can't seem to prompt it to "use good data structures" in advance, it just writes scripting code like it always does and like that part of the prompt wasn't there. You pretty much have to come back in after its first cut and tell it what data structures to create. Then it's really good at the rest. YMMV, as is the way of AI.)
I don't take it like that. A map could be the right data structure for something people typically reach for classes to do, and then you get a whole bunch of functions that can already operate on a map-like thing for free.
If you take a look at the standard library and the data structures of Clojure you'd see this approach taken to a somewhat extreme amount.
Perlin: stringly typed logic is great!
My interpretation of his point of view is that what you need is a process/interpreter/live object that 'explains' the data.
https://news.ycombinator.com/item?id=11945722
EDIT: He writes more about it in Quora. In brief, he says it is 'meaning', not 'data' that is central to programming.
https://qr.ae/pCVB9m
One part of it has interesting new resonance in the era of agentic LLMs:
alankay on June 21, 2016 | root | parent | next [–]
This is why "the objects of the future" have to be ambassadors that can negotiate with other objects they've never seen. Think about this as one of the consequences of massive scaling ...
Nowdays rather than the methods associated with data objects, we are dealing with "context" and "prompts".
I should probably be thinking more in this direction.
> Busywork code is not important. Data is important. And data is not difficult. It's only data. If you have too much, filter it. If it's not what you want, map it. Focus on the data; leave the busywork behind.
If I have learned one thing in my 30-40 years spent writing code, it is this.
You never know how requirements are going to change over the next 5 years, and pure structures are always the most flexible to work with.
You still have to worry about someone using kg when you use g, but you avoid a large class of problems and make your logic easier.
> 2. Functions delay binding; data structures induce binding. Moral: Structure data late in the programming process.
https://ocw.mit.edu/courses/6-001-structure-and-interpretati...
which I found very helpful in (finally) managing to get through that entire text (and do all the exercises).
The key difficulty is identifying what these are is far from obvious upfront, and so often an index appears adjacent to a table that represents what the table should have been in the first place.
This is why I fundamentally find SQL too conservative and outdated. There are obvious patterns for cross-cutting concerns that would mitigate things like this but enterprise SQL products like Oracle and MS are awful at providing ways to do these reusable cross-cutting concerns consistently.
> Good programmers worry about data structures and their relationships.
> -- Linus Torvalds
I was specifically thinking about the "relationship" issues. The worst messes to fix are the ones where the programmer didn't consider how to relate the objects together - which relationships need to be direct PK bindings, which can be indirect, which things have to be cached vs calculated live, which things are the cache (vs the master copy), what the cardinality of each relationship is, which relationships are semantically ownerships vs peers, which data is part of the system itself vs configuration data vs live, how you handle changes to the data, (event sourcing vs changelogging vs vs append-only vs yolo update), etc.
Not quite "data structures" I admit but absolutely thinking hard about the relationship between all the data you have.
SQL doesn't frame all of these questions out for you but it's good getting you to start thinking about them in a way you might not otherwise.
That's great
Pike is right.
I would guess Pike is simply wise enough not to get involved in such arguments.
“If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident.”
And:
“Use simple algorithms as well as simple data structures.”
A data structure general enough to solve enough problems to be meaningful will either be poorly suited to some problems or have complex algorithms for those problems, or both.
There are reasons we don’t all use graph databases or triple stores, and rely on abstractions over our byte arrays.
Let's say you're working for the DMV on a program for driver's licenses. The idea is to use one structure for driver's license data, as opposed to using one structure for new driver's licenses, a different one for renewals, and yet a third for expired ones, and a fourth one for name changes.
It is not saying that you should use byte arrays for driver's license records, so that you can use the same data structure for driver's license data and missile tracks. Generalize within your program, not across all possible programs running on all computers.
Even knowing with 100% certainty that performance will be subpar, requirements change often enough that it's often not worth the cost of adding architectural complexity too early.
My #1 programming principle would be phrased using a concept from John Boyd: make your OODA loops fast. In software this can often mean simple things like "make compile time fast" or "make sure you can detect errors quickly".
For example, I've often heard "premature optimization is the root of all evil" invoked to support opposite sides of the same argument. Pike's rules are much clearer and harder to interpret creatively.
Also, it's amusing that you don't hear this anymore:
> Rule 5 is often shortened to "write stupid code that uses smart objects".
In context, this clearly means that if you invest enough mental work in designing your data structures, it's easy to write simple code to solve your problem. But interpreted through an OO mindset, this could be seen as encouraging one of the classic noob mistakes of the heyday of OO: believing that your code could be as complex as you wanted, without cost, as long as you hid the complicated bits inside member methods on your objects. I'm guessing that "write stupid code that uses smart objects" was a snappy bit of wisdom in the pre-OO days and was discarded as dangerous when the context of OO created a new and harmful way of interpreting it.
To address our favorite topic: while I use LLMs to assist on coding tasks a lot, I think they're very weak at this. Claude is much more likely to suggest or expand complex control flow logic on small data types than it is to recognize and implement an opportunity to encapsulate ideas in composable chunks. And I don't buy the idea that this doesn't matter since most code will be produced and consumed by LLMs. The LLMs of today are much more effective on code bases that have already been thoughtfully designed. So are humans. Why would that change?
The thing is, if you build enough of the same kinds of systems in the same kinds of domains, you can kinda tell where you should optimize ahead of time.
Most of us tend to build the same kinds of systems and usually spend a career or a good chunk of our careers in a given domain. I feel like you can't really be considered a staff/principal if you can't already tell ahead of time where the perf bottleneck will be just on experience and intuition.
I have several times done performance tests before starting a project to confirm it can be made fast enough to be viable, the entire approach can often shift depending on how quickly something can be done.
It's the same thing with programmers who believe in BDUF or disbelieve YAGNI - they design architectures for anticipated futures which do not materialize instead of evolving the architecture retrospectively in line with the future which did materialize.
I think it's a natural human foible. Gambling, for instance, probably wouldnt exist if humans' gut instincts about their ability to predict future defaulted to realistic.
This is why no matter how many brilliant programmers scream YAGNI, dont do BDUF and dont prematurely optimize there will always be some comment saying the equivalent of "akshually sometimes you should...", remembering that one time when they metaphorically rolled a double six and anticipated the necessary architecture correctly when it wasnt even necessary to do so.
These programmers are all hopped up on a different kind of roulette these days...
Don't insist on file-based data ingestion being a wrapper around a json-rpc api just because most similar things are moving that direction; what matters is whether someone has specifically asked for that for this particular system yet.
.
Not all decisions can be usefully revisited later. Sometimes you really do need to go "what if..." and make sure none of the possibilities will bite too hard. Leaving the pizza cave occasionally and making sure you (have contacts who) have some idea about the direction of the industry you're writing stuff for can help.
Rules are "kinda" made to be broken. Be free.
I've been sticking to these rules (and will keep sticking to them) for as long as I can program (I've been doing it for the last 30 years).
IMHO, you can feel that a bottleneck is likely to occur, but you definitely can't tell where, when, or how it will actually happen.
If you'd said Plan 9 and UTF-8 I'd agree with you.
Unless you meant to imply that UNIX isn't cool.
A lot of people are learning some history today, beautiful to see.
Unix was created by Ken Thompson and Dennis Ritchie at Bell Labs (AT&T) in 1969. Thompson wrote the initial version, and Ritchie later contributed significantly, including developing the C programming language, which Unix was subsequently rewritten in.
contribute < wrote.
His credits are huge, but I think saying he wrote Unix is misattribution.
Credits include: Plan 9 (successor to Unix), Unix Window System, UTF-8 (maybe his most universally impactful contribution), Unix Philosophy Articulation, strings/greps/other tools, regular expressions, C successor work that ultimately let him to Go.
This is probably the worst use of the word "shortened" ever, and it should be more like "mutilated"?
> Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.
Number 5 is timeless and relevant at all scales, especially as code iterations have gotten faster and faster, data is all the more relevant. Numbers 4 and 3 have shifted a bit since data sizes and performance have ballooned, algorithm overhead isn't quite as big a concern, but the simplicity argument is relevant as ever. Numbers 2 and 1 while still true (Amdahl's law is a mathematical truth after all), are also clearly a product of their time and the hard constraints programmers had to deal with at the time as well as the shallowness of the stack. Still good wisdom, though I think on the whole the majority of programmers are less concerned about performance than they should be, especially compared to 50 years ago.
Knuth later attributed it to Hoare, but Hoare said he had no recollection of it and suggested it might have been Dijkstra.
Rule 5 aged the best. "Data dominates" is the lesson every senior engineer eventually learns the hard way.
edit: s/data/data structure/
Good software can handle crap data.
There’s several orders of magnitude less available discussion of selecting data structures for problem domains than there is code.
If the underlying information is implicit in high volume of code available then maybe the models are good at it, especially when driven by devs who can/will prompt in that direction. And that assumption seems likely related to how much code was written by devs who focus on data.
I believe that’s what most algorithms books are about. And most OS book talks more about data than algorithms. And if you watch livestream or read books on practical projects, you’ll see that a lot of refactor is first selecting a data structure, then adapt the code around it. DDD is about data structure.
Based on everything public, Pike is deeply hostile to generative AI in general:
- The Christmas 2025 incident (https://simonwillison.net/2025/Dec/26/slop-acts-of-kindness/)
- he's labeled GenAI as nuclear waste (https://www.webpronews.com/rob-pike-labels-generative-ai-nuc...)
- ideologically, he's spent his career chasing complexity reduction, adovcating for code sobriety, resource efficiency, and clarity of thought. Large, opaque, energy-intensive LLMs represent the antithesis.
The whole article is an AI hallucination. It refers to the same "Christmas 2025 incident". The internet is dead for real.
1(a) Torvalds: "Bad programmers worry about the code. Good programmers worry about data structures and their relationships."
1(b) Pike Rule 5: "Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming."
— versus —
2. Perlis 2: "Functions delay binding; data structures induce binding. Moral: Structure data late in the programming process."
---
Ignorant as I am, I read these to advise that I ought to put data structures centrally, first, foremost — but not until the end of the programming process.
When you know what and how to build commit to good data structures. Do the types, structs, classes, Trie, CRDTs, XML, Protobuf, Parquet and whatnot where apropriate. Instrument your program. The efficiency of the final product counts.
So not really a contradiction, just Perlis talking about the functional shell and Torvalds/Pike talking about the imperative core.
Good structure comes from exploring until you understand the problem well AND THEN letting data structure dominate.
This Axiom has caused far and away more damage to software development than the premature optimization ever will.
> We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.
I believe that's why Golang is a very simple but powerful language.
https://news.ycombinator.com/item?id=47325225
Funny handwritten html artifact though:
LLM's work will never be reproducible by design.
> Rule 5. Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.
If want to solve a problem - it's natural to think about logic flow and the code that implements that first and the data structures are an after thought, whereas Rule 5 is spot on.
Conputers are machines that transform an input to an output.
It is?
How can you conceive of a precise idea of how to solve a problem without a similarly precise idea of how you intend to represent the information fundamental to it? They are inseparable.
Do you start with the logical task first and structure the data second, or do you actually think about the data structures first?
Let's say I have a optimisation problem - I have a simple scoring function - and I just want to find the solution with the best score. Starting with the logic.
for all solutions, score, keep if max.
Simple eh? Problem is it's a combinatorial solution space. The key to solving this before the entropic death of the universe is to think about the structure of the solution space.
Neither data structures nor algorithms, but entities and tasks, from the user POV, one level up from any kind of implementation detail.
There's no point trying to do something if you have no idea what you're doing, or why.
When you know the what and why you can start worrying about the how.
Iff this is your 50th CRUD app you can probably skip this stage. But if it's green field development - no.
eg sort a list.
That's why a collection of "obvious" things formulated in a convincing way by a person with big street cred is still useful and worth elevating.
"Why quote someone who's just quoting someone else?" — Michael Scott — knorker