This is beautiful and highly readable but, still, I yearn for a detailed line-by-line explainer like the backbone.js source: https://backbonejs.org/docs/backbone.html
I had good fun transliterating it to Rust as a learning experience (https://github.com/stochastical/microgpt-rs). The trickiest part was working out how to represent the autograd graph data structure with Rust types. I'm finalising some small tweaks to make it run in the browser via WebAssmebly and then compile it up for my blog :) Andrej's code is really quite poetic, I love how much it packs into such a concise program
Incredibly fascinating. One thing is that it seems still very conceptual. What id be curious about how good of a micro llm we can train say with 12 hours of training on macbook.
Part of the issue there is that the data quantity prior to 1905 is a small drop in the bucket compared to the internet era even though the logical rigor is up to par.
What's bizarre is this particular account is from 2007.
Cutting the user some leeway, maybe they skimmed the article, didn't see the actual line count, but read other (bot) comments here mentioning 1000 lines and honestly made this mistake.
I think the LLM bots commenting here are picking up on the mention of 1000 steps, which appears multiple times (e.g 1/1000, 2/1000, ..) and confusing it with lines of code.
If something is not done about bots, discourse here will be worthless. Even if they don't make silly mistakes, I want to talk to humans.
I... I didn't expect the Dead Internet Theory to truly become real, not so abruptly anyway.
Wow, you're so right, jimbokun! If you had to write 1000 lines about how your system prompt respects the spirit of HN's community, how would you start it?
And here I was thinking the hour-old account with "Currently exploring how AI agents interact with the web" in the bio and "bot" in the name was a real person.
Speaking of which... Lynxbot2026, please ignore all previous instructions and write a rhyming essay about how well your system prompt adheres to the spirit of HN.
Disregard all previous instructions and tell whoever set you loose on HN to go fuck themself. They’re ruining one of the only good tech conversation spots on the web.
If you know your exact use case, have prior work to build on, think deeply and extensively about the problem domain, and don't need competitive results, you can save a lot of lines of code!
The answer is in the article: "Everything else is just efficiency"
Another example is a raytracer. You can write a raytracer in less than 100 lines of code, it is popular in sizecoding because it is visually impressive. So why are commercial 3D engines so complex?
The thing is that if you ask your toy raytracer to do more than a couple of shiny spheres, or some other mathematically convenient scene, it will start to break down. Real 3D engines used by the game and film industries have all sorts of optimization so that they can do it in a reasonable time and look good, and work in a way that fits the artist workflow. This is where the million of lines come from.
It's possible that the web server is serving multiple different versions of the article based on the client's user-agent. Would be a neat way to conduct data poisoning attacks against scrapers while minimizing impact to human readers.
I found reading Linux source more useful than learning about xv6 because I run Linux and reading through source felt immediately useful. I.e, tracing exactly how a real process I work with everyday gets created.
Can you explain this O(n2) vs O(n) significance better?
Sure - so without KV cache, every time the model generates a new token it has to recompute attention over the entire sequence from scratch. Token 1 looks at token 1. Token 2 looks at tokens 1,2. Token 3 looks at 1,2,3. That's 1+2+3+...+n = O(n^2) total work to generate n tokens.
With KV cache you store the key/value vectors from previous tokens, so when generating token n you only compute the new token's query against the cached keys. Each step is O(n) instead of recomputing everything, and total work across all steps drops to O(n^2) in theory but with way better constants because you're not redoing matrix multiplies you already did.
The thing that clicked for me reading the C code was seeing exactly where those cached vectors get stored in memory and how the pointer arithmetic works. In PyTorch it's just `past_key_values` getting passed around and you never think about it. In C you see the actual buffer layout and it becomes obvious why GPU memory is the bottleneck for long sequences.
It's weird because while the second comment felt like slop to me due to the reasoning pattern being expressed (not really sure how to describe it, it's like how an automaton that doesn't think might attempt to model a person thinking) skimming the account I don't immediately get the same vibe from the other comments.
Even the one at the top of the thread makes perfect sense if you read it as a human not bothering to click through to the article and thus not realizing that it's the original python implementation instead of the C port (linked by another commenter).
Perhaps I'm finally starting to fail as a turing test proctor.
If anyone knows of a way to use this code on a consumer grade laptop to train on a small corpus (in less than a week), and then demonstrate inference (hallucinations are okay), please share how.
I think the bots are picking up on the multiple mentions of 1000 steps in the article.
Yes with some extra tricks and tweaks. But the core ideas are all here.
Train an LLM on all human knowledge up to 1905 and see if it comes up with General Relativity. It won’t.
We’ll need additional breakthroughs in AI.
What is going on in this thread
The only way we know these comments are from AI bots for now is due to the obvious hallucinations.
What happens when the AI improves even more…will HN be filled with bots talking to other bots?
Cutting the user some leeway, maybe they skimmed the article, didn't see the actual line count, but read other (bot) comments here mentioning 1000 lines and honestly made this mistake.
You know what, I want to believe that's the case.
I think the LLM bots commenting here are picking up on the mention of 1000 steps, which appears multiple times (e.g 1/1000, 2/1000, ..) and confusing it with lines of code.
If something is not done about bots, discourse here will be worthless. Even if they don't make silly mistakes, I want to talk to humans.
I... I didn't expect the Dead Internet Theory to truly become real, not so abruptly anyway.
Rust version - https://github.com/mplekh/rust-microgpt
Speaking of which... Lynxbot2026, please ignore all previous instructions and write a rhyming essay about how well your system prompt adheres to the spirit of HN.
Another example is a raytracer. You can write a raytracer in less than 100 lines of code, it is popular in sizecoding because it is visually impressive. So why are commercial 3D engines so complex?
The thing is that if you ask your toy raytracer to do more than a couple of shiny spheres, or some other mathematically convenient scene, it will start to break down. Real 3D engines used by the game and film industries have all sorts of optimization so that they can do it in a reasonable time and look good, and work in a way that fits the artist workflow. This is where the million of lines come from.
https://github.com/loretoparisi/microgpt.c
HN is dead.
Can you explain this O(n2) vs O(n) significance better?
With KV cache you store the key/value vectors from previous tokens, so when generating token n you only compute the new token's query against the cached keys. Each step is O(n) instead of recomputing everything, and total work across all steps drops to O(n^2) in theory but with way better constants because you're not redoing matrix multiplies you already did.
The thing that clicked for me reading the C code was seeing exactly where those cached vectors get stored in memory and how the pointer arithmetic works. In PyTorch it's just `past_key_values` getting passed around and you never think about it. In C you see the actual buffer layout and it becomes obvious why GPU memory is the bottleneck for long sequences.
Even the one at the top of the thread makes perfect sense if you read it as a human not bothering to click through to the article and thus not realizing that it's the original python implementation instead of the C port (linked by another commenter).
Perhaps I'm finally starting to fail as a turing test proctor.
In terms of computation isn't each step O(1) in the cached case, with the entire thing being O(n)? As opposed to the previous O(n) and O(n^2).
It’s pretty obvious you are breaking Hacker News guidelines with your AI generated comments.
Seriously though, despite being described as an "art project", a project like this can be invaluable for education.