TimeCapsuleLLM: LLM trained only on data from 1800-1875

(github.com)

157 points | by admp 1 hour ago

21 comments

  • dogma1138 1 hour ago
    Would be interesting to train a cutting edge model with a cut off date of say 1900 and then prompt it about QM and relativity with some added context.

    If the model comes up with anything even remotely correct it would be quite a strong evidence that LLMs are a path to something bigger if not then I think it is time to go back to the drawing board.

    • bazzargh 56 minutes ago
      You would find things in there that were already close to QM and relativity. The Michelson-Morley experiment was 1887 and Lorentz transformations came along in 1889. The photoelectric effect (which Einstein explained in terms of photons in 1905) was also discovered in 1887. William Clifford (who _died_ in 1889) had notions that foreshadowed general relativity: "Riemann, and more specifically Clifford, conjectured that forces and matter might be local irregularities in the curvature of space, and in this they were strikingly prophetic, though for their pains they were dismissed at the time as visionaries." - Banesh Hoffmann (1973)

      Things don't happen all of a sudden, and being able to see all the scientific papers of the era its possible those could have fallen out of the synthesis.

      • matthewh806 38 minutes ago
        I presume that's what the parent post is trying to get at? Seeing if, given the cutting edge scientific knowledge of the day, the LLM is able to synthesis all it into a workable theory of QM by making the necessary connections and (quantum...) leaps

        Standing on the shoulders of giants, as it were

        • actionfromafar 10 minutes ago
          Yeah but... we still might not know if it could do that because we were really close by 1900 or if the LLM is very smart.
      • bhaak 33 minutes ago
        This would still be valuable even if the LLM only finds out about things that are already in the air.

        It’s probably even more of a problem that different areas of scientific development don’t know about each other. LLMs combining results would still not be like they invented something new.

        But if they could give us a head start of 20 years on certain developments this would be an awesome result.

    • DevX101 20 minutes ago
      Chemistry would be a great space to explore. The last quarter of the 19th century had a ton of advancements in chemistry. It'd be interesting the see if an LLM could propose fruitful hypotheses, made predictions of the science of thermodynamics.
    • tokai 1 hour ago
      Looking at the training data I don't think it will know anything.[0] Doubt On the Connexion of the Physical Sciences (1834) is going to have much about QM. While the cut-off is 1900, it seems much of the texts a much closer to 1800 than 1900.

      [0] https://github.com/haykgrigo3/TimeCapsuleLLM/blob/main/Copy%...

      • dogma1138 1 hour ago
        It doesn’t need to know about QM or reactivity just about the building blocks that led to them. Which were more than around in the year 1900.

        In fact you don’t want it to know about them explicitly just have enough background knowledge that you can manage the rest via context.

        • tokai 58 minutes ago
          I was vague. My point is that I don't think the building blocks are in the data. Its mainly tertiary and popular sources. Maybe if you had the writings of Victorian scientists, both public and private correspondence.
    • forgotpwd16 59 minutes ago
      Done few weeks ago: https://github.com/DGoettlich/history-llms (discussed in: https://news.ycombinator.com/item?id=46319826)

      At least the model part. Although others made same thought as you afaik none tried it.

      • chrononaut 52 minutes ago
        And unfortunately I don't think they plan on making those models public.
    • imjonse 1 hour ago
      I suppose the vast majority of training data used for cutting edge models was created after 1900.
      • dogma1138 1 hour ago
        Ofc they are because their primary goal is to be useful and to be useful they need to always be relevant.

        But considering that Special Relativity was published in 1905 which means all its building blocks were already floating in the ether by 1900 it would be a very interesting experiment to train something on Claude/Gemini scale and then say give in the field equations and ask it to build a theory around them.

        • famouswaffles 1 hour ago
          His point is that we can't train a Gemini 3/Claude 4.5 etc model because we don't have the data to match the training scale of those models. There aren't trillions of tokens of digitized pre-1900s text.
        • p1esk 1 hour ago
          How can you train a Claude/Gemini scale model if you’re limited to <10% of the training data?
      • kopollo 44 minutes ago
        I don't know if this is related to the topic, but GPT5 can convert an 1880 Ottoman archival photograph to English, and without any loss of quality.
    • metalliqaz 50 minutes ago
      Yann LeCun spoke explicitly on this idea recently and he asserts definitively that the LLM would not be able to add anything useful in that scenario. My understanding is that other AI researchers generally agree with him, and that it's mostly the hype beasts like Altman that think there is some "magic" in the weights that is actually intelligent. Their payday depends on it, so it is understandable. My opinion is that LeCun is probably correct.
      • johnsmith1840 23 minutes ago
        There is some ability for it to make novel connections but it's pretty small. You can see this yourself having it build novel systems.

        It largely cannot imaginr anything beyond the usual but there is a small part that it can. This is similar to in context learning, it's weak but it is there.

        It would be incredible if meta learning/continual learning found a way to train exactly for novel learning path. But that's literally AGI so maybe 20yrs from now? Or never..

        You can see this on CL benchmarks. There is SOME signal but it's crazy low. When I was traing CL models i found that signal was in the single % points. Some could easily argue it was zero but I really do believe there is a very small amount in there.

        This is also why any novel work or findings is done via MASSIVE compute budgets. They find RL enviroments that can extract that small amount out. Is it random chance? Maybe, hard to say.

      • catigula 11 minutes ago
        This is definitely wrong, most AI researchers DO NOT agree with LeCun.

        Most ML researchers think AGI is imminent.

    • a-dub 1 hour ago
      yeah i was just wondering that. i wonder how much stem material is in the training set...
      • signa11 1 hour ago
        i will go for ‘aint gonna happen for a 1000 dollars alex’
  • hallvard 2 minutes ago
    Cool! I also did something like this: https://github.com/hallvardnmbu/transformer

    But on various data (i.e., separate model per source): the Bible, Don Quixote and Franz Kafka. (As well as a (bad!) lyrics generator, and translator.)

  • eqmvii 1 hour ago
    Could this be an experiment to show how likely LLMs are to lead to AGI, or at least intelligence well beyond our current level?

    If you could only give it texts and info and concepts up to Year X, well before Discovery Y, could we then see if it could prompt its way to that discovery?

    • ben_w 58 minutes ago
      > Could this be an experiment to show how likely LLMs are to lead to AGI, or at least intelligence well beyond our current level?

      You'd have to be specific what you mean by AGI: all three letters mean a different thing to different people, and sometimes use the whole means something not present in the letters.

      > If you could only give it texts and info and concepts up to Year X, well before Discovery Y, could we then see if it could prompt its way to that discovery?

      To a limited degree.

      Some developments can come from combining existing ideas and seeing what they imply.

      Other things, like everything to do with relativity and quantum mechanics, would have required experiments. I don't think any of the relevant experiments had been done prior to this cut-off date, but I'm not absolutely sure of that.

      You might be able to get such an LLM to develop all the maths and geometry for general relativity, and yet find the AI still tells you that the perihelion shift of Mercury is a sign of the planet Vulcan rather than of a curved spacetime: https://en.wikipedia.org/wiki/Vulcan_(hypothetical_planet)

      • opponent4 15 minutes ago
        > You'd have to be specific what you mean by AGI

        Well, they obviously can't. AGI is not science, it's religion. It has all the trappings of religion: prophets, sacred texts, origin myth, end-of-days myth and most importantly, a means to escape death. Science? Well, the only measure to "general intelligence" would be to compare to the only one which is the human one but we have absolutely no means by which to describe it. We do not know where to start. This is why you scrape the surface of any AGI definition you only find circular definitions.

        And no, the "brain is a computer" is not a scientific description, it's a metaphor.

      • grimgrin 31 minutes ago
        An example of why you need to explain what you mean by AGI is:

        https://www.robinsloan.com/winter-garden/agi-is-here/

      • markab21 39 minutes ago
        Basically looking for emergent behavior.
    • alansaber 1 hour ago
      I think not if only for the fact that the quantity of old data isn't enough to train anywhere near a SoTA model, until we change some fundamentals of LLM architecture
      • franktankbank 1 hour ago
        Are you saying it wouldn't be able to converse using english of the time?
        • ben_w 51 minutes ago
          Machine learning today requires an obscene quantity of examples to learn anything.

          SOTA LLMs show quite a lot of skill, but they only do so after reading a significant fraction of all published writing (and perhaps images and videos, I'm not sure) across all languages, in a world whose population is 5 times higher than the link's cut off date, and the global literacy went from 20% to about 90% since then.

          Computers can only make up for this by being really really fast: what would take a human a million or so years to read, a server room can pump through a model's training stage in a matter of months.

          When the data isn't there, reading what it does have really quickly isn't enough.

        • wasabi991011 55 minutes ago
          That's not what they are saying. SOTA models include much more than just language, and the scale of training data is related to its "intelligence". Restricting the corpus in time => less training data => less intelligence => less ability to "discover" new concepts not in its training data
          • franktankbank 29 minutes ago
            Perhaps less bullshit though was my thought? Was language more restricted then? Scope of ideas?
      • andyfilms1 1 hour ago
        I mean, humans didn't need to read billions of books back then to think of quantum mechanics.
        • alansaber 56 minutes ago
          Which is why I said it's not impossible, but current LLM architecture is just not good enough to achieve this.
        • famouswaffles 58 minutes ago
          Right, what they needed was billions of years of brute force and trial and error.
    • water-data-dude 17 minutes ago
      It'd be difficult to prove that you hadn't leaked information to the model. The big gotcha of LLMs is that you train them on BIG corpuses of data, which means it's hard to say "X isn't in this corpus", or "this corpus only contains Y". You could TRY to assemble a set of training data that only contains text from before a certain date, but it'd be tricky as heck to be SURE about it.

      Ways data might leak to the model that come to mind: misfiled/mislabled documents, footnotes, annotations, document metadata.

    • armcat 51 minutes ago
      I think this would be an awesome experiment. However you would effectively need to train something of a GPT-5.2 equivalent. So you need lot of text, a much larger parameterization (compared to nanoGPT and Phi-1.5), and the 1800s equivalents of supervised finetuning and reinforcement learning with human feedback.
    • dexwiz 42 minutes ago
      This would be a true test of can LLMs innovate or just regurgitate. I think part of people's amazement of LLMs is they don't realize how much they don't know. So thinking and recalling look the same to the end user.
    • Trufa 51 minutes ago
      This is fascinating, but the experiment seems to fail in being a fair comparison of how much knowledge can we have from that time in data vs now.

      As a thought experiment I find it thrilling.

    • Rebuff5007 46 minutes ago
      OF COURSE!

      The fact that tech leaders espouse the brilliance of LLMs and don't use this specific test method is infuriating to me. It is deeply unfortunate that there is little transparency or standardization of the datasets available for training/fine tuning.

      Having this be advertised will make more interesting and informative benchmarks. OEM models that are always "breaking" the benchmarks are doing so with improved datasets as well as improved methods. Without holding the datasets fixed, progress on benchmarks are very suspect IMO.

    • feisty0630 55 minutes ago
      I fail to see how the two concepts equate.

      LLMs have neither intelligence nor problem-solving abillity (and I won't be relaxing the definition of either so that some AI bro can pretend a glorified chatbot is sentient)

      You would, at best, be demonstrating that the sharing of knowledge across multiple disciplines and nations (which is a relatively new concept - at least at the scale of something like the internet) leads to novel ideas.

      • al_borland 50 minutes ago
        I've seen many futurists claim that human innovation is dead and all future discoveries will be the results of AI. If this is true, we should be able to see AI trained on the past figure it's way to various things we have today. If it can't do this, I'd like said futurists to quiet down, as they are discouraging an entire generation of kids who may go on to discover some great things.
        • skissane 26 minutes ago
          > I've seen many futurists claim that human innovation is dead and all future discoveries will be the results of AI.

          I think there's a big difference between discoveries through AI-human synergy and discoveries through AI working in isolation.

          It probably will be true soon (if it isn't already) that most innovation features some degree of AI input, but still with a human to steer the AI in the right direction.

          I think an AI being able to discover something genuinely new all by itself, without any human steering, is a lot further off.

          If AIs start producing significant quantities of genuine and useful innovation with minimal human input, maybe the singularitarians are about to be proven right.

  • addaon 12 minutes ago
    Suppose two models with similar parameters trained the same way on 1800-1875 and 1800-2025 data. Running both models, we get probability distributions across tokens, let's call the distributions 1875' and 2025'. We also get a probability distribution finite difference (2025' - 1875'). What would we get if we sampled from 1.1*(2025' - 1875') + 1875'? I don't think this would actually be a decent approximation of 2040', but it would be a fun experiment to see. (Interpolation rather than extrapolation seems just as unlikely to be useful and less likely to be amusing, but what do I know.)
  • orthecreedence 4 minutes ago
    Is this what the FBI used to write the text messages between the Charlie Kirk's suspect and his "lover?"
  • jimmytucson 23 minutes ago
    Fascinating idea. There was another "time-locked" LLM project that popped up on HN recently[1]. Their model output is really polished but the team is trying to figure out how to avoid abuse and misrepresentation of their goals. We think it would be cool to talk to someone from 100+ years ago but haven't seriously considered the many ways in which it would be uncool. Interesting times!

    [1] https://news.ycombinator.com/item?id=46319826

  • dlcarrier 26 minutes ago
    It's interesting that it's trained off only historic text.

    Back in the pre-LLM days, someone trained a Markov chain off the King James Bible and a programming book: https://www.tumblr.com/kingjamesprogramming

    I'd love to see an LLM equivalent, but I don't think that's enough data to train from scratch. Could a LoRA or similar be used in a way to get speech style to strictly follow a few megabytes worth of training data?

  • myrmidon 48 minutes ago
    There was a discussion around a very similar model (Qwen3 based) some weeks ago:

    https://news.ycombinator.com/item?id=46319826

    I found it particularly thought-inspiring how a model with training from that time period completely lacks context/understanding of what it is itself, but then I realized that we are the same (at least for now).

  • simonw 1 hour ago
    Anyone seen a low-friction way to run prompts through this yet, either via a hosted API or chat UI or a convenient GGML or MLX build that runs in Ollama or llama.cpp or LM Studio?
    • t1amat 1 hour ago
      Not a direct answer but it looks like v0.5 is a nanoGPT arch and v1 is a Phi 1.5 arch, which should be well supported by quanting utilities for any engine. They are small too and should be able to be done on a potato.
    • alansaber 53 minutes ago
      I too have completely forgotten how the adapters library works and would have appreciated a simple inference script
    • philmo1 1 hour ago
      +1
  • abhishekjha 1 hour ago
    Oh I have really been thinking long about this. The intelligence that we have in these models represent a time.

    Now if I train a foundation models with docs from library of Alexandria and only those texts of that period, I would have a chance to get a rudimentary insight on what the world was like at that time.

    And maybe time shift further more.

    • feisty0630 1 hour ago
      > I would have a chance to get a rudimentary insight on what the world was like at that time

      Congratulations, you've reinvented the history book (just with more energy consumption and less guarantee of accuracy)

      • gordonhart 49 minutes ago
        History books, especially those from classical antiquity, are notoriously not guaranteed to be accurate either.
        • feisty0630 21 minutes ago
          Do you expect something exclusively trained on them to be any better?
          • gordonhart 13 minutes ago
            To a large extent, yes. A model trained on many different accounts of an event is likely going to give a more faithful picture of that event than any one author.

            This isn't super relevant to us because very few histories from this era survived, but presumably there was sufficient material in the Library of Alexandria to cover events from multiple angles and "zero out" the different personal/political/religious biases coloring the individual accounts.

      • lcfcjs6 49 minutes ago
        [dead]
  • marmalade2413 31 minutes ago
    Can you confidently say that the architure of the LLM doesn't include any a priori bias that might effect the integrity of this LLM?

    That is, the architectures of today are chosen to yield the best results given the textual data around today and the problems we want to solve today.

    I'd argue that this lack of bias would need to be researched (if it hasn't been already) before this kind of model has credence.

    LLMs aren't my area of expertise but during my PhD we were able to encode a lot of a priori knowledge through the design of neural network architectures.

  • aqme28 1 hour ago
    This kind of technique seems like a good way to test model performance against benchmarks. I'm too skeptical that new models are taking popular benchmark solutions into their training data. So-- how does e.g. ChatGPT's underlying architecture perform on SWE-bench if trained only on data prior to 2024.
    • NitpickLawyer 26 minutes ago
      > are taking popular benchmark solutions into their training data

      That happened in the past, and the "naive" way of doing it is usually easy to spot. There are, however, many ways in which testing data can leak into models, even without data contamination. However this doesn't matter much, as any model that only does well in benchmarks but is bad in real-world usage will be quickly sussed out by people actually using them. There are also lots and lots of weird, not very popular benchmarks out there, and the outliers are quickly identified.

      > perform on SWE-bench if trained only on data prior to 2024.

      There's a benchmark called swe-REbench, that takes issues from real-world repos, published ~ monthly. They perform tests and you can select the period and check their performance. This is fool-proof for open models, but a bit unknown for API-based models.

  • HarHarVeryFunny 1 hour ago
    It would be interesting if there's enough data to train a model capable enough to converse with and ask about contemporary views on issues of the day, or what it thought about "potential" future events/technologies yet to happen.
  • andy99 47 minutes ago
    How many tokens in the data set? I see its size being referenced in GB only. Probably a stupid question, but GB of text or of what?
  • dhruv3006 1 hour ago
    This will be something good - would love something on Ollama or lmstudio.
  • zingababba 1 hour ago
    LOL

    PROMPT:Charles Darwin

    Charles DarwinECCEMACY. Sir, — The following case is interesting to me : — I was in London a fortnight, and was much affected with an attack of rheumatism. The first attack of rheumatism was a week before I saw you, and the second when I saw you, and the third when I saw you, and the third in the same time. The second attack of gout, however, was not accompanied by any febrile symptoms, but was accompanied with an increased flow of urine, and with a more copious discharge of urine. The third attack was an hour after I saw you, and was succeeded by a return of a paroxysm of gout, and a more rapid return of the gout. The fourth attack was also accompanied by a fever, but was not always accompanied by any febrile symptoms. The third attack of gout was a fortnight after you had been ill, and the fourth was followed by a paroxysm of gout. The fourth attack was a fortnight after you were attacked, and was accompanied by a sense

    • feisty0630 49 minutes ago
      Interesting that it reads a bit like it came from a Markov chain rather than an LLM. Perhaps limited training data?
    • alansaber 52 minutes ago
      Average small language model response
  • philmo1 1 hour ago
    Exciting idea!
  • srigi 1 hour ago
    "I'm sorry, my knowledge cuttoff is 1875"
  • Swoerd 37 minutes ago
    [dead]
  • dogemaster2032 46 minutes ago
    [flagged]