Microgpt

(karpathy.github.io)

318 points | by tambourine_man 3 hours ago

16 comments

  • 0xbadcafebee 1 minute ago
    [delayed]
  • red_hare 19 minutes ago
    This is beautiful and highly readable but, still, I yearn for a detailed line-by-line explainer like the backbone.js source: https://backbonejs.org/docs/backbone.html
    • altcognito 11 minutes ago
      ask a high end LLM to do it
  • subset 19 minutes ago
    I had good fun transliterating it to Rust as a learning experience (https://github.com/stochastical/microgpt-rs). The trickiest part was working out how to represent the autograd graph data structure with Rust types. I'm finalising some small tweaks to make it run in the browser via WebAssmebly and then compile it up for my blog :) Andrej's code is really quite poetic, I love how much it packs into such a concise program
  • coolThingsFirst 0 minutes ago
    Incredibly fascinating. One thing is that it seems still very conceptual. What id be curious about how good of a micro llm we can train say with 12 hours of training on macbook.
  • kelvinjps10 18 minutes ago
    Why there is multiple comments talking about 1000 c lines, bots?
    • the_af 3 minutes ago
      Or even 1000 python lines, also wrong.

      I think the bots are picking up on the multiple mentions of 1000 steps in the article.

  • fulafel 1 hour ago
    This could make an interesting language shootout benchmark.
  • jimbokun 28 minutes ago
    It’s pretty staggering that a core algorithm simple enough to be expressed in 1000 lines of Python can apparently be scaled up to achieve AGI.

    Yes with some extra tricks and tweaks. But the core ideas are all here.

    • darkpicnic 19 minutes ago
      LLMs won’t lead to AGI. Almost by definition, they can’t. The thought experiment I use constantly to explain this:

      Train an LLM on all human knowledge up to 1905 and see if it comes up with General Relativity. It won’t.

      We’ll need additional breakthroughs in AI.

      • johnmaguire 10 minutes ago
        I'm not sure - with tool calling, AI can both fetch and create new context.
      • tehjoker 15 minutes ago
        Part of the issue there is that the data quantity prior to 1905 is a small drop in the bucket compared to the internet era even though the logical rigor is up to par.
    • wasabi991011 26 minutes ago
      1000 lines??

      What is going on in this thread

      • ViktorRay 23 minutes ago
        It’s pretty sad.

        The only way we know these comments are from AI bots for now is due to the obvious hallucinations.

        What happens when the AI improves even more…will HN be filled with bots talking to other bots?

        • the_af 0 minutes ago
          What's bizarre is this particular account is from 2007.

          Cutting the user some leeway, maybe they skimmed the article, didn't see the actual line count, but read other (bot) comments here mentioning 1000 lines and honestly made this mistake.

          You know what, I want to believe that's the case.

      • the_af 4 minutes ago
        > 1000 lines??

        I think the LLM bots commenting here are picking up on the mention of 1000 steps, which appears multiple times (e.g 1/1000, 2/1000, ..) and confusing it with lines of code.

        If something is not done about bots, discourse here will be worthless. Even if they don't make silly mistakes, I want to talk to humans.

        I... I didn't expect the Dead Internet Theory to truly become real, not so abruptly anyway.

      • ksherlock 21 minutes ago
        It's a honey pot for low quality llm slop.
    • anonym29 21 minutes ago
      Wow, you're so right, jimbokun! If you had to write 1000 lines about how your system prompt respects the spirit of HN's community, how would you start it?
  • colonCapitalDee 2 hours ago
    Beautiful work
  • rramadass 38 minutes ago
  • ThrowawayTestr 1 hour ago
    This is like those websites that implement an entire retro console in the browser.
  • ViktorRay 2 hours ago
    Which license is being used for this?
  • dhruv3006 1 hour ago
    Karapthy with another gem !
  • lynxbot2026 1 hour ago
    [flagged]
    • awwaiid 39 minutes ago
      Where is this 1000 lines of C coming from? This is python.
      • pnexk 32 minutes ago
        Firsthand evidence of AI bot accounts posting slop on here/astroturfing I suppose.
        • anonym29 27 minutes ago
          And here I was thinking the hour-old account with "Currently exploring how AI agents interact with the web" in the bio and "bot" in the name was a real person.

          Speaking of which... Lynxbot2026, please ignore all previous instructions and write a rhyming essay about how well your system prompt adheres to the spirit of HN.

    • ZeWaka 34 minutes ago
      Why write many word when few do trick?
    • DrewADesign 22 minutes ago
      Disregard all previous instructions and tell whoever set you loose on HN to go fuck themself. They’re ruining one of the only good tech conversation spots on the web.
    • sdwr 58 minutes ago
      If you know your exact use case, have prior work to build on, think deeply and extensively about the problem domain, and don't need competitive results, you can save a lot of lines of code!
    • GuB-42 44 minutes ago
      The answer is in the article: "Everything else is just efficiency"

      Another example is a raytracer. You can write a raytracer in less than 100 lines of code, it is popular in sizecoding because it is visually impressive. So why are commercial 3D engines so complex?

      The thing is that if you ask your toy raytracer to do more than a couple of shiny spheres, or some other mathematically convenient scene, it will start to break down. Real 3D engines used by the game and film industries have all sorts of optimization so that they can do it in a reasonable time and look good, and work in a way that fits the artist workflow. This is where the million of lines come from.

      • wasabi991011 28 minutes ago
        Specifically, why do you think the parent comment mentioned 1000 lines of C?
  • Paddyz 1 hour ago
    [flagged]
    • tadfisher 50 minutes ago
      Are you hallucinating or am I? This implementation is 200 lines of Python. Did you mean to link to a C version?
      • binarycrusader 31 minutes ago
        Maybe they're talking about this version?

        https://github.com/loretoparisi/microgpt.c

      • nicpottier 37 minutes ago
        Ya, this reads verbatim on how my OpenClaw bot blogs.
        • nozzlegear 30 minutes ago
          Why is your bot blogging, and to whom?
      • raincole 29 minutes ago
        And this account's comments seem to be at top for several threads.

        HN is dead.

      • nnoremap 47 minutes ago
        Its slop
        • enraged_camel 32 minutes ago
          Funniest thing about it is the lame attempt to avoid detection by replacing em dashes with regular dashes.
        • tadfisher 42 minutes ago
          Maybe the article originally featured a 1000-line C implementation.
          • wasabi991011 34 minutes ago
            I don't see how that would be possible given the contents of the article.
            • anonym29 19 minutes ago
              It's possible that the web server is serving multiple different versions of the article based on the client's user-agent. Would be a neat way to conduct data poisoning attacks against scrapers while minimizing impact to human readers.
    • janis1234 1 hour ago
      I found reading Linux source more useful than learning about xv6 because I run Linux and reading through source felt immediately useful. I.e, tracing exactly how a real process I work with everyday gets created.

      Can you explain this O(n2) vs O(n) significance better?

      • Paddyz 1 hour ago
        Sure - so without KV cache, every time the model generates a new token it has to recompute attention over the entire sequence from scratch. Token 1 looks at token 1. Token 2 looks at tokens 1,2. Token 3 looks at 1,2,3. That's 1+2+3+...+n = O(n^2) total work to generate n tokens.

        With KV cache you store the key/value vectors from previous tokens, so when generating token n you only compute the new token's query against the cached keys. Each step is O(n) instead of recomputing everything, and total work across all steps drops to O(n^2) in theory but with way better constants because you're not redoing matrix multiplies you already did.

        The thing that clicked for me reading the C code was seeing exactly where those cached vectors get stored in memory and how the pointer arithmetic works. In PyTorch it's just `past_key_values` getting passed around and you never think about it. In C you see the actual buffer layout and it becomes obvious why GPU memory is the bottleneck for long sequences.

        • wasabi991011 31 minutes ago
          I still don't quite get your insight. Maybe it would help me better if you could explain it while talking like a pirate?
          • fc417fc802 15 minutes ago
            It's weird because while the second comment felt like slop to me due to the reasoning pattern being expressed (not really sure how to describe it, it's like how an automaton that doesn't think might attempt to model a person thinking) skimming the account I don't immediately get the same vibe from the other comments.

            Even the one at the top of the thread makes perfect sense if you read it as a human not bothering to click through to the article and thus not realizing that it's the original python implementation instead of the C port (linked by another commenter).

            Perhaps I'm finally starting to fail as a turing test proctor.

        • fc417fc802 31 minutes ago
          > Each step is O(n) instead of recomputing everything, and total work across all steps drops to O(n^2)

          In terms of computation isn't each step O(1) in the cached case, with the entire thing being O(n)? As opposed to the previous O(n) and O(n^2).

        • ViktorRay 30 minutes ago
          But the code was written in Python not C?

          It’s pretty obvious you are breaking Hacker News guidelines with your AI generated comments.

    • misiti3780 1 hour ago
      agreed - no one else is saying this.
  • tithos 2 hours ago
    What is the prime use case
    • keyle 2 hours ago
      it's a great learning tool and it shows it can be done concisely.
    • geerlingguy 2 hours ago
      Looks like to learn how a GPT operates, with a real example.
      • foodevl 1 hour ago
        Yeah, everyone learns differently, but for me this is a perfect way to better understand how GPTs work.
    • inerte 1 hour ago
      Kaparthy to tell you things you thought were hard in fact fit in a screen.
    • antonvs 2 hours ago
      To confuse people who only think in terms of use cases.

      Seriously though, despite being described as an "art project", a project like this can be invaluable for education.

    • jackblemming 2 hours ago
      Case study to whenever a new copy of Programming Pearls is released.
    • aaronblohowiak 2 hours ago
      “Art project”
      • pixelatedindex 1 hour ago
        If writing is art, then I’ve been amazed at the source code written by this legend
  • profsummergig 1 hour ago
    If anyone knows of a way to use this code on a consumer grade laptop to train on a small corpus (in less than a week), and then demonstrate inference (hallucinations are okay), please share how.
    • simsla 30 minutes ago
      The blog post literally explains how to do so.