Markov chains are the original language models

(elijahpotter.dev)

226 points | by chilipepperhott 4 days ago

26 comments

  • AnotherGoodName 4 hours ago
    The problem is the linear nature of markov chains. Sure they can branch but after an observation you are absolutely at a new state. A goes to B goes to C etc. A classic problem to understand why this is an issue is feeding in a 2D bitmap where the patterns are vertical but you’re passing in data left to right which Markov chains can’t handle since they are navigating exclusively on the current left to right inout. They miss the patterns completely. Similar things happen with language. Language is not linear and context from a few sentences ago should change probabilities in the current sequence of characters. The attention mechanism is the best we have for this and Markov chains struggle beyond stringing together a few syllables.

    I have played with Markov chains a lot. I tried having skip states and such but ultimately you’re always pushed towards doing something similar to the attention mechanism to handle context.

    • t_mann 3 hours ago
      You can model multiple-hop dependencies as a Markov chain by just blowing up the state space as a Cartesian product. Not that that would necessarily make sense in practice, but in theory Markov chains have enormous expressive power.
      • TeMPOraL 19 minutes ago
        > Not that that would necessarily make sense in practice, but in theory Markov chains have enormous expressive power.

        Right - but this feels like being half-way between transformers and lookup tables; the latter have enormous expressive power too, as long as you're willing to handle even more state. I'm curious what else could be put on that spectrum, and where. Maybe there's something weaker than transformers, but good enough for practical applications, while being more resource-efficient?

        Related thought: understanding and compression seem to be fundamentally the same thing.

    • roadside_picnic 3 hours ago
      > after an observation you are absolutely at a new state

      The essential property of a Markov chain is maintaining the Markov property:

      P(X_n+1 = x_n+1 | X_n = x_n, ..., x_1) = P(X_n+1 = x_n+1 | X_n = x_n)

      That is the future, given the present state, is conditionally independent of the past states.

      It's worth recognizing that decoder only LLMs (which are most the major LLMs used by people) maintain the Markov property and can be properly understood as Markov chains.

      > The attention mechanism is the best we have for this and Markov chains struggle beyond stringing together a few syllables.

      Attention has nothing to do with the maintaining the Markov property but allows for a fantastically more complex representation of state which is where decoder only LLMs derive the majority of their power.

      tl;dr most of the LLMs people use are effectively Markov Chains.

      • kgeist 2 hours ago
        What is the conceptual difference between a "past state" and a "present state"? In a typical conversation with an LLM, you have past messages that remain in the KV cache when generating the next token (so as not to recalculate everything). Yet here, everything that came before is reinterpreted as part of the present state. In other words, it seems as if you can take any past state and fold it into one large concatenated present state - and, by that logic, call basically anything a Markov chain? I can't wrap my head around it. Can you give an example of a non-Markov system?
        • jltsiren 1 hour ago
          The difference is arbitrary but fixed. For any given Markov model, there are inputs that can fully fit in the present state and inputs that can't. But for any given input, there are models that can encode it entirely in the present state.

          Markov models are useful for understanding the limitations of any specific next-token predictor with a fixed context window. But not for the limitations of such predictors in general, as there is always a Markov model large enough to handle any specific input.

        • Imnimo 2 hours ago
          A common distinction is that whatever you take to be your state space should be fixed-sized. So an LLM has a fixed context window, which might be extremely large, but with a long enough conversation, the oldest tokens will fall out of the context window, and your process will be non-Markovian. But with a large context size, most conversations will be small enough that they fit entirely within the context, and then the process is Markovian.

          So, people can correctly call LLMs Markovian in practice, and also non-Markovian from a theoretical standpoint.

          I think of it as conceptually little bit like the distinction between a formal Turing machine which requires an infinite tape, and a practical computer with a finite amount of memory. Your computer acts as a Turing machine for the real computations you use it for, but there exist some computations that would require more memory than you have. From a theoretical standpoint, your compute is merely a finite state automaton.

          • Imnimo 2 hours ago
            Sorry, I realized I didn't quite write what I meant to. I didn't intend to say that LLMs are non-Markovian from a theoretical standpoint. I meant to say that the language generation task is non-Markovian from a theoretical standpoint, because the next word can depend on arbitrarily distant history.
        • roadside_picnic 1 hour ago
          > In other words, it seems as if you can take any past state and fold it into one large concatenated present state

          These are what n-grams are, even traditional token/word/character based Markov chains don't rely on just the most recent word. Typical Markov Chains in NLP are 3-7-grams.

          > Can you give an example of a non-Markov system?

          Encoder-decoder LLMs violate the Markov Property and would not count as Markov Chains.

        • fluoridation 2 hours ago
          Hmm... For example, say you have a machine that must decide whether a monetary transaction is allowable or not. The state at any given time could be the balances of all accounts, plus some metadata about past states (but not the past states themselves). Then if A has $5 in their account and tries to transfer $5 to B, the contents of the metadata would be the deciding factor of whether the transaction goes through or not. The machine still only operates on the present state, but the present state depends indirectly on past states.
          • kgeist 2 hours ago
            But a conversation with an LLM doesn't only depend on the last message, it depends on all previous messages. Sure the number of messages is limited by the context window, so you can indeed say it's a high-order Markov chain (it can only see N states back), but theoretically, what an actual non-Markovian LLM would look like? Something which has an infinite context? It's not physically possible... At this point it sounds like "it's a Markov chain" is not a very useful statement.
            • fluoridation 2 hours ago
              >what an actual non-Markovian LLM would look like? Something which has an infinite context? It's not physically possible... At this point it sounds like "it's a Markov chain" is not a very useful statement.

              Uh... You're moving the goalposts. I don't know the exact definition of "LLM", but let's suppose that it and the definition of "Markov chain" make us conclude that LLMs are necessarily Markov chains. "It's a Markov chain" is still a useful statement, because there are processes that are neither LLMs nor Markov chains.

            • thaumasiotes 2 hours ago
              You appear to be confusing the UI over the LLM with the actual working of the LLM. None of them allow for more than one message to exist. You put in one message and you get one word (well, one token) back.
              • kgeist 2 hours ago
                I have an idea of how they work. At the end of the day, it's all just numbers. You say you put "one message", but in reality it's N input numbers corresponding to the tokens (where N can very from 1 to any size). Basically N variables. An LLM is one large function which takes N values as input and produces a new output value. So you can view an LLM as y = f(a, b, c, ...)

                What I don't understand is how the line is drawn between "past" and "present". Why are variables a,b,c viewed as "present" and not "past"? How is past defined? Why can't you view token generation as f(past1, past2, past3, present)?

                As I said, it does look "Markovian" if the number of "pasts" is fixed, but again, what exactly is not Markovian then? Infinite inputs? Everything in the real world is finite.

                • fluoridation 2 hours ago
                  Let's keep it simple and say that the procedure of an LLM is

                    state: list[n];
                    while (true){
                        new_token = llm(state);
                        state.pop_front();
                        state.push_back(new_token);
                    }
                  
                  Any given time that llm() is called, it is blind to previous invocations. The state may contain evidence of them, or it may not. It's a free-form data structure that is sometimes filled with user input and sometimes contains previous llm() outputs, but llm() doesn't care.
                  • kgeist 1 hour ago
                    So any pure function is Markovian then? They too only depend on their input and are blind to previous invocations. I’m just trying to understand the practical purpose of calling something a Markov chain beyond small sizes, because if we allow the state to be of any size, I’m not sure what the usefulness is: anything can be called a Markov chain if you concatenate all input data of any size into one blob and interpret it as "one state."
                    • fluoridation 1 hour ago
                      >So any pure function is Markovian then?

                      I guess it would be more accurate to say that at its core a Markovian process is based around a pure function. Usually there's some kind of feedback loop around the function that allows the process to evolve in some way and continue producing output.

                      >I’m pretty sure anything can be modeled using pure functions.

                      We're not talking about theoretical models of black boxes, though, we're talking about known mechanisms. A recursive function doesn't become iterative just because it could be thought of as iterative (because eventually the CPU does loop over some sequence of instructions). We still say it's recursive.

                      >I’m not sure what the usefulness is

                      I suppose part of the usefulness is to de-mystify LLMs. If you can understand what Markov chains do, the thought "oh. An LLMs is just a much more complex Markov chain" can help reasoning about them.

                      • drdeca 16 minutes ago
                        “just”?

                        If you include the entire past history as part of the state, then any stochastic process becomes a Markov process.

                        Any computer program (without network stuff or inputs for a time (assume your computer has a true RNG module)) you could run is a Markov process. Just say that your state space is the space of possible ways the memory of the computer can be.

                        Saying that the LLM is “just” a Markov process seems to be doing something a bit weird with that “just”.

        • jampekka 2 hours ago
          > In other words, it seems as if you can take any past state and fold it into one large concatenated present state - and, by that logic, call basically anything a Markov chain?

          Essentially yes. Given complex enough state, more or less any process is Markovian. Some do require infinite state, which would be maybe stretching the definition a bit. In practice Markov formulation may not be a very good analysis perspective if the required state would be very large/complex.

      • mjburgess 3 hours ago
        They are Markov Chains, any one saying otherwise doesn't understand what a Markov chain is
        • lackoftactics 2 hours ago
          Wouldn't tool calling, mcp break finite state?
      • eigencoder 3 hours ago
        Could you help me understand how decoder-only LLMs maintain the Markov property? If you used the same random seed, the input to the model "The cow jumped over the" would not give the same output as just "the", right? So isn't that violating the Markov property?
        • AnotherGoodName 2 hours ago
          There would be need to a state specifically for “the cow jumped over the” (and any other relevant context) and states for all the other times ‘the’ is proceeded by something.

          This is the limitation i was getting at btw. In the example i wad getting at, if you have an image with solid vertical columns, followed by columns of random static, followed again by solid vertical colors a markov chain could eventually learn all patterns that go

          solid->32 random bits->different solid color

          And eventually it would start predicting the different color correctly based on the solid color before the randomness. It ‘just’ needs a state for every possible random color between. This is ridiculous in practice however since you’d need to learn 2^32 states just for relation ship between those two solid colors alone.

          • thesz 2 hours ago
            > It ‘just’ needs a state for every possible random color between.

            You can use skipgrams - prefixes with holes in them.

            Sparse Non-negative Matrix Language Model [1] uses them with great success.

            [1] https://aclanthology.org/Q16-1024/

            The pure n-gram language models would have hard time computing escape weights for such contexts, but mixture of probabilities that is used in SNMLM does not need to do that.

            If I may, I've implemented an online per-byte version of SNMLM [2], which allows skipgrams' use. They make performance worse, but they can be used. SNMLM's predictive performance for my implementation is within percents to performance of LSTM on enwik8.

            [2] https://github.com/thesz/snmlm-per-byte

        • OJFord 2 hours ago
          State (in this sense at least) isn't word/token parsing progress, it's comprising all the input and any context (which may include the entire chat history for example).
      • rcfox 3 hours ago
        Is this just because LLMs don't have state?

        As far as I understand it, as you have a back-and-forth conversation with an LLM, you have to provide the entire history of the conversation plus your new response each time.

        • jampekka 2 hours ago
          Stateful models, e.g. RNNs, are Markov models too. Sometimes "Markov chain" is used to refer specifically to models with no hidden state, e.g. (decoder-only) Transformers.
    • 6r17 4 hours ago
      Would you say it's interesting to explore after spending much time on them ? Do you feel like one could make an use for it pragmatically within certain context or it's way too much of a toy where most of the time getting a service / coherent llm would ease-in the work ?
      • AnotherGoodName 3 hours ago
        Yes. I think learning them and learning their limitations is the best way to learn neural networks actually.

        Give a class of students an image with horizontal lines where every second line is a solid color and every other is random static. See how their left to right markov chains do here (should make ~50% correct predictions).

        Then rotate the image 90degrees. Have the class observe a left to right markov chains gets 0% when predicting this (every second pixel being random will do that). What to do? Maybe input both ways and weight towards the best one with a perceptron? Hey first step to learning a neural network!

        From there you can iterate more and more until you no longer really have markov chains but instead neural networks with a type of attention mechanism.

    • cuttothechase 4 hours ago
      Would having a Markov chain of Markov chains help in this situation. One chain does this when 2D bitmap patterns are vertical and another one for left to right?
      • AnotherGoodName 3 hours ago
        Yes and then you weight between them with a neural network and your vertical predictor catches that every second vertical line is solid (while every other vertical line is static to mess up the horizontal markov chains). Of course then someone passes you video where there’s 3rd dimension. And you need to yet customise again with that consideration. Or maybe the pattern is in 45 degree diagonal lines and not horizontal or vertical. Better have a markov chains for that too. What about 10degree vertical lines? Etc.

        In the end you’re inputting a millions of ways there could be a pattern, passing all of those into a neural network and weighting the chains that make correct predictions more.

        You start to realize even with all these ways past context could still influence the current prediction and what you want is a generator for all the ways there could be a pattern. At this point you're getting into the realm of multilayer neural networks and starting to consider the attention mechanism.

        I don’t want to discourage anyone from learning markov chains here btw. It’s just that they have limitations and those limitations actually make a great learning journey for neural networks as you realize you really need more than an absolute singular state being activated at a time and then you start to think about how all the states activated in the past might influence the current probabilities (essentially you then start thinking about the problem the attention mechanism solves).

  • jcynix 4 hours ago
    Ah, yes, Markov chains. A long time agoMark V. Shaney https://en.wikipedia.org/wiki/Mark_V._Shaney was designed by Rob Pike and posted on Usenet.

    And Veritasium's video "The Strange Math That Predicts (Almost) Anything" talks in detail about the history of Markov chains: https://youtu.be/KZeIEiBrT_w

    • chilipepperhott 3 hours ago
      It's a great explainer. Definitely better than mine.
      • proprietario 3 hours ago
        fwiw, your explanation really made markov chains click for me (unlike other markov chain explanations) i have to admit, that the part where you went from the idea over hashmaps to the matrix representation was a bit dense, but rereading it twice made me under it so thanks for your explanation!
        • chilipepperhott 2 hours ago
          That's great to hear. Thanks! I'm glad it worked.
  • ahmedhawas123 4 hours ago
    Random tidbit - 15+ years ago Markov chains were the go to for auto generating text. Google was not as advanced as it is today at flagging spam, so most highly affiliate-marketing dense topics (e.g., certain medications, products) search engine results pages were swamped with Markov chain-created websites that were injected with certain keywords.
  • chankstein38 5 hours ago
    I once, probably 4-6 years ago, used exports from Slack conversations to train a Markov Chain to recreate a user that was around a lot and then left for a while. I wrote the whole thing in python and wasn't overly well-versed in the statistics and math side but understood the principle. I made a bot and had it join the Slack instance that I administrate and it would interact if you tagged it or if you said things that person always responded to (hardcoded).

    Well, the responses were pretty messed up and not accurate but we all got a good chuckle watching the bot sometimes actually sound like the person amidst a mass of random other things that person always said jumbled together :D

    • vunderba 4 hours ago
      I had a similar program designed as my "AWAY" bot that was trained on transcripts of my previous conversations and connected to Skype. At the time (2009) I was living in Taiwan so I would activate it when I went to bed to chat with my friends back in the States given the ~12 hour time difference. Reading back some of the transcripts made it sound like I was on the verge of a psychotic break though.
  • taolson 4 hours ago
    If you program a Markov chain to generate based upon a fairly short sequence length (4 - 5 characters), it can create some neat portamenteaus. I remember back in the early 90's I trained one on some typical tech literature and it came up with the word "marketecture".
    • ronsor 3 hours ago
      That sounds like a corporate buzzword generator
  • glouwbug 5 hours ago
    The Practice of Programming by Kernighan and Pike had a really elegant Markov:

    https://github.com/Heatwave/the-practice-of-programming/blob...

  • mwcz 3 hours ago
    This is great. I use Markov chains to describe to normies how LLMs work.

    It's very interesting reading what Claude Shannon wrote after producing Markov chains from English sentences.

    > In an unpublished spoof written a year later, Shannon imagined the damage his methods would do if they fell into the wrong hands. It seems that an evil Nazi scientist, dr. Hagen Krankheit, had escaped Germany with a prototype of his Müllabfuhrwortmaschine, a fearsome weapon of war "anticipated in the work ... of Dr. Claude Shannon." Krankheit's machine used the principles of randomized text to totally automate the propaganda industry. By randomly stitching together agitprop phrases in a way that approximated human language, the Müllabfuhrwortmaschine could produce an endless flood of demoralizing statements.

    From A Mind at Play. I don't remember the exact year he did that work. I think it was some time before 1950, yet strangely timely in 2025.

  • benob 4 hours ago
    Like it or not, LLMs are effectively high-order Markov chains
    • krackers 1 hour ago
      Only in the way that every real-world computer is a finite state machine.
    • BenoitEssiambre 4 hours ago
      Exactly. I think of them as Markov Chains in grammar space or in Abstract Syntax Tree space instead of n-gram chain-of-words space. The attention mechanism likely plays a role in identifying the parent in the grammar tree or identifying other types of back references like pronouns or if it's for programming languages, variable back references.
    • guluarte 4 hours ago
      markov chains with limited self correction
  • fluxusars 3 hours ago
    Can't believe no one's mentioned this classic yet: https://www.tumblr.com/kingjamesprogramming

    Back in the early days of Slack I wrote a chat bot that logged all the messages people wrote and used that data to make it say random things prompted by certain trigger words. It was hilarious, until the bosses found out that I was basically logging all their conversations. Unsurprisingly they made me turn it off.

  • Nevermark 1 hour ago
    Nice little example.

    Could add a "stop" token at the end of the fruit example, so when the little model runs it can call its own stop.

    To generalize to ADHD, create another model that can quietly "Listen" to a speaker (token generator) until its Markov chain predicts a "Stop", and starts its own generating model running, regardless of whether the speaker was done.

    Then two instances can have time efficient interrupty conversations with each other.

  • allthatineed 5 hours ago
    I remember playing with megahal eggdrop bots.

    This was one of my first forays into modifying c code, trying to figure out why 350mb seemed to be the biggest brain size (32 bit memory limits and requiring a contiguous block for the entire brain).

    I miss the innocence of those days. Just being a teen, tinkering with things i didn't understand.

    • vunderba 4 hours ago
      I remember reading the source of the original MegaHAL program when I was younger - one of the tricks that made it stand out (particularly in the Loebner competitions [1]) was that it used both a backwards and forwards Markov chain to generate responses.

      [1] https://en.wikipedia.org/wiki/Loebner_Prize

    • foobarian 5 hours ago
      I'm old now, but thanks to LLMs I can now again tinker with things I don't understand :-)
      • jcynix 4 hours ago
        The nice thing about LLMs is that they can explain stuff so you can learn to understand. And they are very patient.

        For example I'm currently relearning various ImageMagick details and thanks to their explanations now understand things that I cut/copy/pasted a long time ago without always understanding why things worked the way they did.

      • codr7 5 hours ago
        Are you though? Or is the LLM the target of your tinkering and lack of understanding? Honest question.
    • anthk 2 hours ago
      Hailo it's still alive under CPAN.
  • robotnikman 3 hours ago
    I remember creating a chatbot for my minecraft server over a decade ago that used Markov chains. It was funny seeing the crazy things it spat out and the other players enjoyed it.
  • traes 1 hour ago
    There seems to be a mistake in the "Building the Transition Matrix" section of this article. Instead of showing the diagonal matrix D with the normalization coefficients it instead shows the normalized Markov matrix M, incorrectly claiming it is equal to the unnormalized matrix C.
  • fair_enough 4 hours ago
    On a humorous note OP, you seem like exactly the kind of guy who would get a kick out of this postmodern essay generator that a STEM professor wrote using a Recursive Transition Network in 1996:

    https://www.elsewhere.org/journal/pomo/

    Every now and again, I come back to it for a good chuckle. Here's what I got this time (link to the full essay below the excerpt):

    "If one examines subcultural capitalist theory, one is faced with a choice: either reject subdialectic cultural theory or conclude that the significance of the artist is significant form, but only if culture is interchangeable with art. It could be said that many desituationisms concerning the capitalist paradigm of context exist. The subject is contextualised into a presemanticist deappropriation that includes truth as a totality."

    https://www.elsewhere.org/journal/pomo/?1298795365

    • 1718627440 1 hour ago
      I think that's how source code sounds to non-programmers. Too far removed from the meaning two distinguish, what makes sense and what is garbage.
    • chilipepperhott 3 hours ago
      That essay generator is pretty cool. There are definitely a lot of undergrads whose essays look about the same.
      • fair_enough 3 hours ago
        There is no dearth of midwittery in the non-STEM fields. I admit this is elitist and downright snobbish, but I am disgusted that somebody else can get a degree from the same university as me in a non-STEM major and piggyback off of prestige they didn't create.

        In my opinion, it's stolen valor.

  • Aperocky 3 hours ago
    Shameless plug for playing with random markov chain with seed texts: https://aperocky.com/markov/

    Source: https://github.com/Aperocky/weighted-markov-generator

  • cestith 4 hours ago
    I’ve been telling people for years that a reasonably workable initial, simplified mental model of a large language model is a Markov chain generator with an unlimited, weighted corpus trained in. Very few people who know LLMs have said anything to critique that thought more than that it’s a coarse description and downplays the details. Since being simplified is in the initial statement and it’s not meant to capture detail, I say if it walks like a really big duck and it honks instead of quacking then it’s maybe a goose or swan which are both pretty duck-like birds.
    • nerdponx 4 hours ago
      It's not a Markov chain because it doesn't obey the Markov property.

      What it is, and what I assume you mean, is a next-word prediction model based solely on the previous sequence of words, up to some limit. It literally is that, because it was designed to be that.

      • qwelfkh 3 hours ago
        Okay, so we have a function that converts a buffer of tokens to a distribution of next tokens. We give the function a buffer, get back the next-token-distribution, from which we sample to get our next token. We append that to the buffer (and possibly roll off the first token) to obtain a new buffer. If we consider entire buffers to be states (which is a perfectly reasonable thing to do), then we have a stochastic process that moves us from one state to another. How is that not a Markov chain?
      • cestith 3 hours ago
        What about a language model requires knowledge of previous states to choose the next state based on the current state? Are you asserting that only the last word chosen is the current state? Couldn’t the list of symbols chosen so far be the current state, so long as no data is carried forward about how that list has been generated?
        • coliveira 3 hours ago
          You're right, the recent memory of the LLM is the state, that's why it needs to be so deep to be effective compared to a traditional MC.
    • rytill 3 hours ago
      So, I have heard a number of people say this, and I feel like I'm the person in your conversations saying it's a coarse description and downplays the details. What I don't understand is, what specifically do we gain from thinking of it as a Markov chain.

      Like, what is one insight beyond that LLMs are Markov chains that you've derived from thinking of LLMs as Markov chains? I'm genuinely very curious.

      • jltsiren 2 hours ago
        It depends on if you already had experience in using large Markov models for practical purposes.

        Around 2009, I had developed an algorithm for building the Burrows–Wheeler transform on (what was back then) very large scale. If you have the BWT of a text corpus, you can use it to simulate a Markov model with any context length. It tried that with a Wikipedia dump, which was amusing for a while but not interesting enough to develop further.

        Then, around 2019, I was working in genomics. We were using pangenomes based on thousands of (human) haplotypes as reference genomes. The problem was that adding more haplotypes also added rare variants and rare combinations of variants, which could be misleading and eventually started decreasing accuracy in the tasks we were interested in. The standard practice was dropping variants that were too rare (e.g. <1%) in the population. I got better results with synthetic haplotypes generated by downsampling the true haplotypes with a Markov model (using the BWT-based approach). The distribution of local haplotypes within each context window was similar to the full set of haplotypes, but the noise from rare combinations of variants was mostly gone.

        Other people were doing haplotype inference with Markov models based on similarly large sets of haplotypes. If you knew, for a suitably large subset of variants, whether each variant was likely absent, heterozygous, or homozygous in the sequenced genome, you could use the model to get a good approximation of the genome.

        When ChatGPT appeared, the application was surprising (even though I knew some people who had been experimenting with GPT-2 and GPT-3). But it was less surprising on a technical level, as it was close enough to what I had intuitively considered possible with large Markov models.

    • hatthew 2 hours ago
      Anything can be a markov chain if you define "memory of the past" as a component of the current state and include a big enough memory to hold all of the past for any practical amount of time. Personally, I feel like in most contexts it's not helpful to think of LLMs that way, since their state space (context length) is so large that they are extremely far removed from what we normally think of as markov chains. One might as well define humans as markov chains, since the brain has a finite memory.
    • Uehreka 2 hours ago
      The problem is that it doesn’t stay simplified; once you leave the room those people start using that model to explain details of actual LLM behavior. I’m constantly arguing with people who use this mental model to explain why LLMs can’t do things that they absolutely can do.

      I frequently see people underestimate the danger of LLMs to humanity in the long term and to people’s jobs in the medium term because they follow this chain of reasoning:

      - An LLM is basically a fancy markov chain (highly dubious even if there’s a tiny kernel of truth in there) - Therefore markov chains and LLMs have a similar skill ceiling (definitely wrong) - My job could never be done by a markov chain, it’s much too complex (this is presented self-evidently, no one ever feels the need to back this up) - Therefore, since LLMs are basically markov chains, and a markov chain could never do my job, I have nothing to worry about.

      I’ll grant that you’re not necessarily responsible for these people using your model in a way you didn’t intend. But I do think at this point it’s time to start pushing people to stop trying to reason mechanically about how these things work.

      • asdlfaglkj 16 minutes ago
        1. An LLM is absolutely a Markov chain. That is not meant to dismiss that they are a remarkable feat of generating and compressing the representation of really cool Markov chains.

        2. Because they are Markov chains, their skill ceiling is bounded by whatever the skill ceiling of Markov chains happens to be. What LLMs demonstrate is that the skill ceiling of Markov chains is higher than previously understood.

        3. If the skill ceiling of Markov chains is high enough, then one could take over some or all of someone's job.

    • jama211 4 hours ago
      Sure, but arguably by that definition so are we ;)
      • lackoftactics 2 hours ago
        There is no free will, only determinism
  • Doch88 3 hours ago
    I remember coding IRC bots with Markov Chains that were repeating random nonsense from the chat, so many memories
  • kragen 3 hours ago
    A first-order Markov-chain text generator is not complex at all; it's a one-line shell script, reformatted here onto three lines for clarity:

        perl -ane 'push@{$m{$a}},$a=$_ for@F;END{
          print$a=$m{$a}[rand@{$m{$a}}]," "for 1..100_000
        }' <<<'fish fish for fish.'
    
    Given the King James Bible instead of "fish fish for fish." as input, this program produces output like this:

    25:13 As the hands upon his disciple; but make great matter wisely in their calamity; 1:14 Sanctify unto the arches thereof unto them: 28:14 Happy is our judges ruled, that believe. 19:36 Thus saith the first being one another's feet.

    This leans pretty hard on Perl DWIM; a Python version is many times longer:

        import random, sys
    
        model, last, word = {None: []}, None, None
        for word in (word for line in sys.stdin for word in line.split()):
            model.setdefault(last, []).append(last := word)
    
        sys.stdout.writelines(str(word := random.choice(model.get(word) or words)) + " "
                              for words in [list(model.keys())] for i in range(100_000))
    
    (If you write code like that at a job, you might get fired. I would reject your pull request.)

    It's probably worth mentioning:

    - Markov-chain states don't have to be words. For text modeling, a common thing to do is to use N-grams (of either words or characters) instead of single words for your states.

    - It's good to have some nonzero probability to take a transition that hasn't occurred in the training set; this is especially important if you use a larger universe of states, because each state will occur fewer times in the training set. "Laplacian smoothing" is one way to do this.

    - Usually (and in my code examples above) the probability distribution of transitions we take in our text generator is the probability distribution we inferred from the input. Potter's approach of multiplying his next-state-occupancy vector Ms⃗ by a random diagonal matrix R and then taking the index of the highest value in the resulting vector produces somewhat similar results, but they are not the same; for example

        sum(random.random() * 0.75 > random.random() * 0.25 for i in range(1_000_000))
    
    is roughly 833000, not roughly 750000. It's 5× as common instead of 3×. But I imagine that it still produces perfectly cromulent text.

    Interestingly, modern compilers derive from Chomsky's theory of linguistics, which he formulated while working on an AI project in the 01950s in order to demolish the dominant theory of psychology at the time, behaviorism. Behaviorism essentially held that human minds were Markov chains, and Chomsky showed that Markov chains couldn't produce context-free languages.

    • chilipepperhott 3 hours ago
      This is actually sick! I have to admit, my implementation is far more (likely unnecessarily) complex.
      • kragen 3 hours ago
        Thanks! I think! Your implementation might be easier to understand and modify, too, although perhaps not because of being unnecessarily complex.
    • 3abiton 3 hours ago
      I brushed with MCMC decades ago, and your code is truely beautiful
      • kragen 3 hours ago
        MCMC is considerably more complicated, though!
    • 1718627440 2 hours ago
      What????????? Given that, how was it not obvious for decades, that this devolves into something like the current LLMs? I give it only the first few paragraphs of a random man page and it already passes for a human, that tries to form nonsense sentences, while having mostly correct grammar, but not always. LLMs now feel not very technically advanced. The code isolation in their interface seams to be more impressive than this.
      • kragen 2 hours ago
        If you only give it a few paragraphs, most words only occur once, so it ends up copying whole phrases from the input until it hits a common word. Computers are very good at sounding like human writing when they're just reproducing things that humans in fact did write. LLMs mostly aren't doing that, though.
        • 1718627440 2 hours ago
          You're right, I was a bit too enthusiastic. I've feed it random C code (both normal and already preprocessed) and it produces random garbage, not remotely passable for syntax. Also it seems to really like emitting parts of copyright messages, since they occur verbatim everywhere.

          Happy part of today's 10'000.

          • kragen 39 minutes ago
            Welcome!
  • Spooky_Fusion1 1 hour ago
    Great article. I have been hitting on 'High School's levels including 8 dimensional , dual quaternion, Dirac's Eq. Algebra. On my website. Mainly as they give a view of the dynamics of non trivial unitary matrix expansions that can occur in such as Zeta Zero, element nuclei, prime numbers, energy levels. For light elements. For de located electrons in sync, the Markov chains are interesting, as they relate to cellular automata, the taking of limits for (relativity, electrons moving with charge potential).Having not done high school algebra, until latter years. I have learnt to not assume, a particular algebra fits all dynamic non trivial matrix process's. Ie quantum computing, long range sensors, gravity waves, pressurised plasma at phase, ECT. So great article as it shows a practical high school level of applied knowledge, that doesn't revolve around solely unitary matrices. In as much as its potential dynamic (linear expansion)evolved states ! To which the high school level of visualisation can become aware! (I just thought I'd add this on legal probabilities. There is a case involving a SASR trooper 'Schultz' that revolves around shooting a Mulha's spotter. In Australia magistrates seem to judge cases based on 4 dimensional probabilities. And this seems to tie in with Jeffery Epstein's management through cohesion of President Clinton's pattent attorney change from an solely engineer to a lawyer based pattent granting. Epstein's little girl problem wasn't solely sexual, but given stock market prediction dynamics, more like a dynamic non trivial unitary probability problem of visualising size. Something magistrates in Australia seem to have adopted in the determining of complex military cases.). Warren
  • slashdave 1 hour ago
    I don't think the author understands what a Markov chain is
  • saylisteins 3 hours ago
    There is an interesting video about this from Veritasium

    https://www.youtube.com/watch?v=KZeIEiBrT_w

  • calf 32 minutes ago
    So I think this "talking point" of "LLMs are just Markov chains" is as bad, analogously, as "Your amazing invention that passes the Turing Test is justa giant LUT". Or "Your C++ program is justa FSM". It is all technically true, but it is being used subtly to dismiss/delegitimize something, instead of communicate a scientifically insightful idea. How is this different than "Humans are just collections of cells". True except we're are a very interesting type of collection! LLMs are a very interesting type of Markov chain! Haskell/Java/Ruby programs are a very interesting type of FSM! Etc. The talking point by itself is just reductionism, of which we've seen and heard variants of during other scientific revolutions.
  • jedberg 3 hours ago
    For funsies one of the early reddit engineers wrote a Markov comment generator trained on all the existing comments.

    It worked surprisingly well. :)

  • pessimizer 3 hours ago
    Loved playing with Babble! when I was a teenager: https://archive.org/details/Babble_1020

    You could load it up with anything, and it would just babble on.

  • anthk 2 hours ago
    Also for perl user:

          curl -L https://cpanmin.us/ -o cpanm
      
          ./cpanm -n local::lib
          ./cpanm -n Hailo
    
          ~/perl5/bin/hailo -t file.txt -b file.brn
    
          ~/perl5/bin/hailo -b file.brn
    
    file.txt must be a text file with one sentence per line.
  • snoopre 19 minutes ago
    [dead]