Gemini 2.5

(blog.google)

871 points | by meetpateltech 21 hours ago

77 comments

  • og_kalu 17 hours ago
    One of the biggest problems with hands off LLM writing (for long horizon stuff like novels) is that you can't really give them any details of your story because they get absolutely neurotic with it.

    Imagine for instance you give the LLM the profile of the love interest for your epic fantasy, it will almost always have the main character meeting them within 3 pages (usually page 1) which is of course absolutely nonsensical pacing. No attempt to tell it otherwise changes anything.

    This is the first model that after 19 pages generated so far resembles anything like normal pacing even with a TON of details. I've never felt the need to generate anywhere near this much. Extremely impressed.

    Edit: Sharing it - https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

    with pastebin - https://pastebin.com/aiWuYcrF

    • comboy 14 hours ago
      I like how critique of LLMs evolved on this site over the last few years.

      We are currently at nonsensical pacing while writing novels.

      • skyechurch 12 hours ago
        The most straightforward way to measure the pace of AI progress is by attaching a speedometer to the goalposts.
        • kaliqt 7 hours ago
          Oh, that's a good one. And it's true. There seems to be a massive inability for most people to admit the building impact of modern AI development on society.
          • benterix 6 hours ago
            Oh, we do admit impact and even have a name for it: AI slop. (Speaking on LLMs now since AI is a broad term and it has many extremely useful applications in various areas)
            • Workaccount2 1 hour ago
              AI slop is soon to be "AI output that no one wanted to take credit for".
        • josefx 47 minutes ago
          They certainly seem to have moved from "it is literally skynet" and "FSD is just around the corner" in 2016 to "look how well it paces my first lady Trump/Musk slashfic" in 2025. Truly world changing.
        • munksbeer 4 hours ago
          I love this comment.
      • solardev 11 hours ago
        It's not really passing the Turing Test until it outsells Harry Potter.
        • dragonwriter 8 hours ago
          > It's not really passing the Turing Test until it outsells Harry Potter.

          Most human-written books don't do that, so that seems to be a ceiteria for a very different test that a Turing test.

          • ZiiS 5 hours ago
            Both books that have outsold the Harry Potter series claim divine authorship, not purely human. I am prepared to bet quite a lot that the next isn't human-written, either.
          • mirekrusin 7 hours ago
            The joke is that the goalpost is constantly moving.
            • TeMPOraL 6 hours ago
              This subgoal post can't move much further after it passes "outsells the Bible" mark.
              • zimbatm 59 minutes ago
                Why would the book be worth buying tough. If AI can generate a fresh new one just for you?
              • mirekrusin 5 hours ago
                You may be right, on the other hand it always feels like the next goalpost is the final one.

                I'm pretty sure if something like this happens some dude will show up from nowhere and claim that it's just parroting what other, real people have written, just blended it together and randomly spitted it out – "real AI would come up with original ideas like cure for cancer" he'll say.

                After some form of that comes another dude will show up and say that this "alphafold while-loop" is not real AI because he just went for lunch and there was a guy flipping burgers – and that "AI" can't do it so it's shit.

                https://areweagiyet.com should plot those future points as well with all those funky goals like "if Einstein had access to the Internet, Wolfram etc. he could came up with it anyway so not better than humans per se", or "had to be prompted and guided by human to find this answer so didn't do it by itself really" etc.

        • tummler 2 hours ago
          Let me toss a grenade in here.

          What if we didn’t measure success by sales, but impact to the industry (or society), or value to peoples’ lives?

          Zooming out to AI broadly: what if we didn’t measure intelligence by (game-able, arguably meaningless) benchmarks, but real world use cases, adaptability, etc?

          • szatkus 37 minutes ago
            I recently watched some Claude Plays Pokemon and believe it's better measure than all those AI benchmarks. The game could be beaten by a 8yo which obviously doesn't have all that knowledge that even small local LLMs posess, but has actual intelligence and could figure out the game within < 100h. So far Claude can't even get past the first half and I doubt any other AI could get much further.
          • Workaccount2 1 hour ago
            This is difficult to do because one of the juiciest parts of AI is being able to take credit for it's work.
        • eru 9 hours ago
          Well, strictly speaking outselling the Harry Potter would fail the Turing test: the Turing test is about passing for human (in an adversarial setting), not to surpass humans.

          Of course, this is just some pedantry.

          I for one love that AI is progressing so quickly, that we _can_ move the goalposts like this.

        • ninetyninenine 3 hours ago
          the goal posts will be moved again. Tons of people clamoring the book is stupid and vapid and only idiots bought the book. When ai starts taking over jobs which it already has you’ll get tons of idiots claiming the same thing.
        • jychang 8 hours ago
          To be fair, pacing as a big flaw of LLMs has been a constant complaint from writers for a long time.

          There were popular writeups about this from the Deepseek-R1 era: https://www.tumblr.com/nostalgebraist/778041178124926976/hyd...

          • newswasboring 6 hours ago
            This was written on march 15. Deepseek came out in January. "Era" is not a language I would use for something that happened few days ago
      • ksec 33 minutes ago
        Do we have any simple benchmarks ( and I know benchmarks are not everything ) that tests all the LLMs?

        The pace is moving so fast I simply cant keep up. Or a ELI5 page which gives a 5 min explanation of LLM from 2020 to this moment?

      • krzat 8 hours ago
        This either ends at "better than 50% of human novels" garbage or at unimaginably compelling works of art that completely obsoletes fiction writing.

        Not sure what is better for humanity in long term.

        • WindyMiller 7 hours ago
          That could only obsolete fiction-writing if you take a very narrow, essentially commercial view of what fiction-writing is for.

          I could build a machine that phones my mother and tells her I love her, but it wouldn't obsolete me doing it.

          • bergundytomato 55 minutes ago
            Ahh, now this would be a great premise for a short story (from the mom's POV).
      • rafaelmn 6 hours ago
        People are trying to use gen AI in more and more use-cases, it used to fall flat on its face at trivial stuff, now it got past trivial stuff but still scratching the boundaries of being useful. And that is not an attempt to make the gen AI tech look bad, it is really amazing what it can do - but it is far from delivering on hype - and that is why people are providing critical evaluations.

        Lets not forget the OpenAI benchmarks saying 4.0 can do better at college exams and such than most students. Yet real world performance was laughable on real tasks.

      • ruraljuror 13 hours ago
        We are, if this comment is the standard for all criticism on this site. Your comment seems harsh. Perhaps novel writing is too low-brow of a standard for LLM critique?
        • jorl17 13 hours ago
          I didn't quite read parent's comment like that. I think it's more about how we keep moving the goalposts or, less cynically, how the models keep getting better and better.

          I am amazed at the progress that we are _still_ making on an almost monthly basis. It is unbelievable. Mind-boggling, to be honest.

          I am certain that the issue of pacing will be solved soon enough. I'd give 99% probability of it being solved in 3 years and 50% probability in 1.

          • dalmo3 4 hours ago
            What exactly is the problem with moving the goalposts? Who is trying to win arguments over this stuff?

            Yes, Z is indeed a big advance over Y was a big advance over X. Also yes, Z is just as underwhelming.

            Are customers hurting the AI companies' feelings?

            • TeMPOraL 1 hour ago
              > Are customers hurting the AI companies' feelings?

              No. It's the critics' feelings that are being hurt by continued advances, so they keep moving goalposts so they can keep believing they're right.

            • HelloMcFly 1 hour ago
              The goalposts should keep moving. That's called progress. Like you, I'm not sure why it seems to irritate or even amuse people.
          • jiggawatts 11 hours ago
            In my consulting career I sometimes get to tune database servers for performance. I have a bag of tricks that yield about +10-20% performance each. I get arguments about this from customers, typically along the lines of "that doesn't seem worth it."

            Yeah, but 10% plus 20% plus 20%... next thing you know you're at +100% and your server is literally double the speed!

            AI progress feels the same. Each little incremental improvement alone doesn't blow my skirt up, but we've had years of nearly monthly advances that have added up to something quite substantial.

            • eru 9 hours ago
              Yes, if you are Mary Poppins, each individual trick in your bag doesn't have to be large.

              (For those too young or unfamiliar: Mary Poppins famously had a bag that she could keep pulling things out of.)

            • rafaelmn 7 hours ago
              Except at some point the low hanging fruit is gone and it becomes +1%, +3% in some benchmarked use case and -1% in the general case, etc. and then come the benchmarking lies that we are seeing right now, where everyone picks a benchmark that makes them look good and its correlation to real world performance is questionable.
      • leokennis 6 hours ago
        Not really new is it? First cars just had to be approaching horse and cart levels of speed. Comfort, ease of use etc. were non-factors as this was "cool new technology".

        In that light, even a 20 year old almost broken down crappy dinger is amazing: it has a radio, heating, shock absorbers, it can go over 500km on a tank of fuel! But are we fawning over it? No, because the goalposts have moved. Now we are disappointed that it takes 5 seconds for the Bluetooth to connect and the seats to auto-adjust to our preferred seating and heating setting in our new car.

      • ripped_britches 12 hours ago
        lol wouldn’t that be great to read this comment in 2022
    • deng 6 hours ago
      I have actually read it and agree it is impressive. I will not comment much on the style of the writing, since this is very much subjective, but I would rate it as the "typical" modern fantasy style, which aims at filling as much pages as possible: very "flowery" language, lots of adjectives/adverbs, lots of details, lots of high-school prose ("Panic was a luxury they couldn't afford"). Not a big fan of that since I really miss the time where authors could write single, self-contained books instead of a sprawling series over thousands of pages, but I know of course that this kind of thing is very successful and people seem to enjoy it. If someone would give me this I would advise them to get a good copy editor.

      There are some logical inconsistencies, though. For instance, when they both enter the cellar through a trapdoor, Kael goes first, but the innkeeper instructs him to close the trapdoor behind them, which makes no sense. Also, Kael goes down the stairs and "risks a quick look back up" and can somehow see the front door bulging and the chaos outside through the windows, which obviously is impossible when you look up through a trapdoor, not to mention that previously it was said this entry is behind the bar counter, surely blocking the sight. Kael lights an oily rag which somehow becomes a torch. There's more generic things, like somehow these Eldertides being these mythical things no one has ever seen, yet they seem to be pretty common occurrences? The dimensions of the cellar are completely unclear, at first it seems to be very small but yet they move around it quite a bit. There's other issues, like people using the same words as the narrator ("the ooze"), like they listen to him. The inkeeper suddenly calling Kael by his name like they already know each other.

      Anyway, I would rate it "first draft". Of course, it is unclear whether the LLM would manage to write a consistent book, but I can fully believe that it would manage. I probably wouldn't want to read it.

      • og_kalu 3 hours ago
        Yeah I just focused on how well it was paced and didn't give any instructions on style or try a second pass to spot any inconsistencies.

        That would be the next step but I'd previously never thought going any further might be worth it.

      • hjnilsson 5 hours ago
        Thank you for taking the time to do a thorough read, I just skimmed it, and the prose is certainly not for me. To me it lacks focus, but as you say, this may be the style the readers enjoy.

        And it also, as you say, really reuses words. Just reading I notice "phosphorescence" 4 times for example in this chapter, or "ooze" 17 times (!).

        It is very impressive though that it can create a somewhat cohesive storyline, and certainly an improvement over previous models.

    • nout 16 hours ago
      Using the AI in multiple phases is the approach that can handle this. Similarly to "Deep Research" approach - you can tell it to first generate a storyline with multiple twists and turns. Then ask the model to take this storyline and generate prompts for individual chapters. Then ask it to generate the individual chapters based on the prompts, etc.
      • bbor 15 hours ago
        Yup -- asking a chatbot to create a novel in one shot is very similar to asking a human to improvise a novel in one shot.
        • mikepurvis 13 hours ago
          But a future chatbot would be able to internally project manage itself through that process, of first emitting an outline, then producing draft chapters, then going back and critiquing itself and finally rewriting the whole thing.
          • eru 9 hours ago
            Yes, and that's why many people in the discussion here are very optimistic that chatbots will have solve this problem very soon. Either with the approach you suggest, or with something else (and perhaps more general, and less directly programmed in).
        • og_kalu 15 hours ago
          It's not a problem of one-shotting it. It's that the details cause a collapse. Even if you tried breaking it down which i have, you'd run into the same problem unless you tried holding its hand for every single page and then - what's the point ? I want to read the story not co-author it.
          • koakuma-chan 14 hours ago
            Doesn't novel literally mean something new? Can we really expect an LLM to produce a novel?
            • eru 8 hours ago
              The etymology is pretty much irrelevant. In eg German, the word for novel is 'Roman'. But German readers don't expect their novels to be anymore romantic, nor do English readers expect their novels to be more novel.

              LLMs have been producing new things all the time. The question was always about quality of output, never about being able to produce anything new.

            • FloorEgg 14 hours ago
              Yes
          • fragmede 10 hours ago
            I dunno, there's a certain amount of fun in "writing" a book with ChatGPT. Like playing a video game with a bunch of different endings instead of a watch a movie with only one. does the hero save the day? or turn into a villian! you decide!
    • fshr 17 hours ago
      I think you would be better off having the LLM help you build up the plot with high level chapter descriptions and then have it dig into each chapter or arc. Or start by giving it the beats before you ask it for help with specifics. That'd be better at keeping it on rails.
      • og_kalu 17 hours ago
        I don't disagree. Like with almost anything else involving LLMs, getting hands on produces better results but because in this instance, i much prefer to be the reader than the author or editor, it's really important to me that a LLM is capable of pacing long form writing properly on its own.
        • saberience 15 hours ago
          Random question, if you don't care about being a creator yourself, why do you even want to read long form writing written by an LLM? There are literally 10000s of actual human written books out there all of them better than anything an LLM can write, why not read them?
          • eru 8 hours ago
            > There are literally 10000s of actual human written books out there all of them better than anything an LLM can write, why not read them?

            10000s is still much smaller than the space of possibilities for even a short prompt.

            You might be right that good human novels are better than what LLMs can manage today. But that's rapidly changing.

            And if you really need that Harry Potter / Superman / Three Musketeers crossover fan fiction itch scratched, you might not care that some other existing novel is 'better' in some abstract sense.

          • og_kalu 15 hours ago
            Authors tell stories they want to tell and Readers read stories they want to read. The two don't necessarily overlap or overlap strongly enough. If you're even a little bit specific (nowhere near as specific as the above prompt, even just something like the dynamic between protagonists) then you don't actually have 10,000s of actual human written books. Not even close. Maybe it exists and maybe you'll find it good enough but if it's only been read by a few hundred or thousand people ? Good luck getting it recommended.

            I've read a LOT of fiction. I love reading. And if it's good enough, the idea of reading something created by a machine does not bother me at all. So of course i will continue to see if the machine is finally good enough and i can be a bit more specific.

          • dragonwriter 8 hours ago
            > There are literally 10000s of actual human written books out there

            Tens-of-thousands is probably low by something in the neighborhood of four orders of magnitude.

          • Spooky23 14 hours ago
            Usually porn and fan fiction.
          • bufferoverflow 15 hours ago
            It's very hard to find good books written by humans. GoodReads is okay, but you quickly run out of high-end recommendations. I read mostly sci-fi, and the books that everyone recommends rarely end up being 10/10. But then I see some random recommendation on Reddit or HN, and it ends up being amazing.

            Human-generated slop is real.

            • eru 8 hours ago
              You could ask your LLM for a recommendation.
      • tluyben2 8 hours ago
        That was what I tried on the train [0] a few weeks ago. I used Groq to get something very fast to see if it would work at least somewhat. It gives you a PDF in the end. Plugging in a better model gave much better results (still not really readable if you actually try to; at a glance it's convincing though), however, it was so slow that testing what kind of impossible. Cannot really have things done in parallel either because it does need to know what it pushed out before, at least the summary of it.

        [0] https://github.com/tluyben/bad-writer

    • smcleod 3 hours ago
      I have never used an LLM for fictional writing, but I have been writing large amounts of code with them for years. What I'd recommend is when you're defining your plan up front as to the sections of the content, simply state in which phase / chapter of the content they should meet.

      Planning generated content is often more important to invest in than the writing of it.

      Looking at your paste, your prompt is short and basic, it should probably be broken up into clear, formatted sections (try directives inside XML style tags). For such a large output as you're expecting id expect a considerable prompt of rules and context setting (maybe a page or two).

    • sagarpatil 9 hours ago
      My prompt is nowhere near yours.

      Just for fun: Asked it to rewrite the first page of ‘The Fountainhead’ where Howard is a computer engineer, the rewrite is hilarious lol.

      https://gist.github.com/sagarspatil/e0b5443132501a3596c3a9a2...

    • stephen_cagle 15 hours ago
      Opening with "like a struck flint carried on a wind that wasn’t blowing." <chuckles>

      I don't know why, but that is just such a literal thing to say that it seems almost random.

    • Workaccount2 16 hours ago
      Can you share it on a text sharing site? It seems you hit your share quota
    • solardev 10 hours ago
      19 pages?! Am I the only one who prefers an AI that jumps straight to the point?

      - Buildup and happy background world-building

      - Subtle foreshadowing

      - Orcs attack

      - Hero is saved by unlikely warrior of astounding beauty

      - Evil is defeated until sales justify unnecessary sequel

      That's the kind of story fit for the modern attention span...

  • malisper 20 hours ago
    I've been using a math puzzle as a way to benchmark the different models. The math puzzle took me ~3 days to solve with a computer. A math major I know took about a day to solve it by hand.

    Gemini 2.5 is the first model I tested that was able to solve it and it one-shotted it. I think it's not an exaggeration to say LLMs are now better than 95+% of the population at mathematical reasoning.

    For those curious the riddle is: There's three people in a circle. Each person has a positive integer floating above their heads, such that each person can see the other two numbers but not his own. The sum of two of the numbers is equal to the third. The first person is asked for his number, and he says that he doesn't know. The second person is asked for his number, and he says that he doesn't know. The third person is asked for his number, and he says that he doesn't know. Then, the first person is asked for his number again, and he says: 65. What is the product of the three numbers?

    • hmottestad 20 hours ago
      This looks like it’s been posted on Reddit 10 years ago:

      https://www.reddit.com/r/math/comments/32m611/logic_question...

      So it’s likely that it’s part of the training data by now.

      • canucker2016 19 hours ago
        You'd think so, but both Google's AI Overview and Bing's CoPilot output wrong answers.

        Google spits out: "The product of the three numbers is 10,225 (65 * 20 * 8). The three numbers are 65, 20, and 8."

        Whoa. Math is not AI's strong suit...

        Bing spits out: "The solution to the three people in a circle puzzle is that all three people are wearing red hats."

        Hats???

        Same text was used for both prompts (all the text after 'For those curious the riddle is:' in the GP comment), so Bing just goes off the rails.

        • moritzwarhier 19 hours ago
          That's a non-sequitur, they would be stupid to run ab expensive _L_LM for every search query. This post is not about Google Search being replaced by Gemini 2.5 and/or a chatbot.
          • michaelt 17 hours ago
            Yes, putting an expensive LLM response atop each search query would be quite stupid.

            You know what would be even stupider? Putting a cheap, wrong LLM response atop each search query.

          • canucker2016 19 hours ago
            Google placed its "AI overview" answer at the top of the page.

            The second result is this reddit.com answer, https://www.reddit.com/r/math/comments/32m611/logic_question..., where at least the numbers make sense. I haven't examined the logic portion of the answer.

            Bing doesn't list any reddit posts (that Google-exclusive deal) so I'll assume no stackexchange-related sites have an appropriate answer (or bing is only looking for hat-related answers for some reason).

            • moritzwarhier 19 hours ago
              I might have been phrasing poorly. With _L_ (or L as intended), I meant their state-of-the-art model, which I presume Gemini 2.5 is (didn't come around to TFA yet). Not sure if this question is just about model size.

              I'm eagerly awaiting an article about RAG caching strategies though!

        • vicek22 19 hours ago
          The riddle has a different variants with hats https://erdos.sdslabs.co/problems/5
        • Etherlord87 5 hours ago
          There's 3 toddlers on the floor. You ask them a hard mathematical question. One of the toddlers plays around pieces of paper on the ground and happens to raise one that has the right answer written on it.

          - This kid is a genius! - you yell

          - But wait, the kid has just picked an answer from the ground, it didn't actually come up...

          - But the other toddlers could do it also but didn't!

      • malisper 19 hours ago
        Other models aren't able to solve it so there's something else happening besides it being in the training data. You can also vary the problem and give it a number like 85 instead of 65 and Gemini is still able to properly reason through the problem
        • lolinder 19 hours ago
          I'm sure you're right that it's more than just it being in the training data, but that it's in the training data means that you can't draw any conclusions about general mathematical ability using just this as a benchmark, even if you substitute numbers.

          There are lots of possible mechanisms by which this particular problem would become more prominent in the weights in a given round of training even if the model itself hasn't actually gotten any better at general reasoning. Here are a few:

          * Random chance (these are still statistical machines after all)

          * The problem resurfaced recently and shows up more often than it used to.

          * The particular set of RLHF data chosen for this model draws out the weights associated with this problem in a way that wasn't true previously.

          • mrtesthah 14 hours ago
            Google Gemini 2.5 is able to search the web, so if you're able to find the answer on reddit, maybe it can too.
        • mattkevan 18 hours ago
          I think there’s a big push to train LLMs on maths problems - I used to get spammed on Reddit with ads for data tagging and annotation jobs.

          Recently these have stopped and they’re now the ads are about becoming a maths tutor to AI.

          Doesn’t seem like a role with long-term prospects.

        • stabbles 16 hours ago
          It gets it wrong when you give it 728. It claims (728, 182, 546). I won't share the answer so it won't appear in the next training set.
          • WithinReason 8 hours ago
            with 728 the puzzle doesn't work since it's divisible by 8
            • eru 8 hours ago
              But then the AI should tell you that, too, if it really understand the problem?
            • stabbles 3 hours ago
              Fair, the question is what possible solutions exists.
        • 7e 18 hours ago
          Sure, but you can't cite this puzzle as proof that this model is "better than 95+% of the population at mathematical reasoning" when the method of solving (the "answer") it is online, and the model has surely seen it.
      • f1shy 9 hours ago
        Thaks. I wanted to do exactly that: find the answer online. It is amazing that people (even in HN) think that LLM can reason. It just regurgitates the input.
        • Etherlord87 5 hours ago
          I think it can reason. At least if it can work in a loop ("thinking"). It's just that this reasoning is far inferior to human reasoning, despite what some people hastily claim.
        • motoxpro 9 hours ago
          I would say that 99.99% of humans do the same. Most people never come up with anything novel.
          • f1shy 7 hours ago
            I would say maybe about 80% certainly not 99.99%. But I've seen that in college, some would only be able to solve the problems which were pretty much the same as others already seen. Notably some guys could easily come up with solutions to complex problems they did not see before. I have the opinion that no human at age 20 can have the amount of input a LLM today. And still humans of age 20 do come with very new ideas pretty often (new in the sense that (s)he has not seen that or anything like it before). Of course there are more and less creative/intelligent people...
          • WA 7 hours ago
            Reasoning != coming up with something novel.
      • drexlspivey 16 hours ago
        And if it wasn’t, it is now
      • thaumasiotes 16 hours ago
        [flagged]
        • thaumasiotes 9 hours ago
          Is there a reason for the downvotes here? We can see that having the answer in the training data doesn't help. If it's in there, what's that supposed to show?
          • klausa 7 hours ago
            It's entirely unclear what are you trying to get across, at least to me.

            Generally speaking, posting output from a LLM, without explaining exactly what do you think it illustrates and why is frowned upon here. I don't think your comment does a great job of the latter.

            • thaumasiotes 6 hours ago
              >> So it’s likely that it’s part of the training data by now.

              > I don't think this means what you think it means.

              > I did some interacting with the Tencent model that showed up here a couple days ago [...]

              > This is a question that obviously was in the training data. How do you get the answer back out of the training data?

              What do I think the conversation illustrates? Probably that having the answer in the training data doesn't get it into the output.

              How does the conversation illustrate that? It isn't subtle. You can see it without reading any of the Chinese. If you want to read the Chinese, Google Translate is more than good enough for this purpose; that's what I used.

              • Workaccount2 43 minutes ago
                Your intentions are good, but your execution is poor.

                I cannot figure out what the comment is trying to get across either. It's easy for you because you already know what you are trying to say. You know what the pasted output shows. The poor execution is in not spending enough time thinking about how someone coming in totally blind would interpret the comment.

              • klausa 6 hours ago
                > How does the conversation illustrate that? It isn't subtle. You can see it without reading any of the Chinese.

                I can't, and I imagine most of the people who downvoted you couldn't either.

                I think asking people to go to Google Translate to parse a random comment that seems to be 90% LLM output by volume is a bit much.

          • eru 8 hours ago
            No clue. Perhaps people object to the untranslated Chinese?
          • meepmorp 6 hours ago
            > Is there a reason for the downvotes here?

            I didn't downvote you, but like (probably) most people here, I can't read Chinese; I can't derive whatever point you're trying to make just from with text you provided.

    • _cs2017_ 17 hours ago
      This is solvable in roughly half an hour on pen and paper by a random person I picked with no special math skills (beyond a university). This is far from a difficult problem. The "95%+" in math reasoning is a meaningless standard, it's like saying a model is better than 99.9% of world population in Albanian language, since less than 0.1% bother to learn Albanian.

      Even ignoring the fact that this or similar problem may have appeared in the training data, it's something a careful brute-force math logic should solve. It's neither difficult, nor interesting, nor useful. Yes, it may suggest a slight improvement on the basic logic, but no more so than a million other benchmarks people quote.

      This goes to show that evaluating models is not a trivial problem. In fact, it's a hard problem (in particular, it's a far far harder than this math puzzle).

      • wrasee 43 minutes ago
        So in the three hours between you reading the puzzle in the parent comment, you stopped what you were doing, managed to get some other "random" person to stop what they were doing and spend half an hour of their time on a maths puzzle that at that point prior experience suggested could take a day? All within three hours?

        That's not to say that you didn't, or you're recalling from a previous time that happens to be this exact puzzle (despite there being scant prior references to this puzzle, and precisely the reason for using it). But you can see how some might see that as not entirely credible.

        Best guess: this random person is someone that really likes puzzles, is presumably good at them and is very, very far from being representative to the extent you would require to be in support of your argument.

        Read: just a heavy flex about puzzle solving.

      • windowshopping 12 hours ago
        The "random person" you picked is likely very, very intelligent and not at all a good random sample. I'm not saying this is difficult to the extent that it merits academic focus, but it is NOT a simple problem and I suspect less than 1% of the population could solve this in half an hour "with no special math skills." You have to be either exceedingly clever or trained in a certain type of reasoning or both.
        • sundarurfriend 8 hours ago
          I agree with your general point that this "random person" is probably not representative of anything close to an average person off the street, but I think the phrasing "very very intelligent" and "exceedingly clever" is kinda misleading.

          In my experience, the difference between someone who solves this type of logic puzzle and someone who doesn't, has more to do with persistence and ability to maintain focus, rather than "intelligence" in terms of problem-solving ability per se. I've worked with college students helping them learn to solve these kinds of problems (eg. as part of pre-interview test prep), and in most cases, those who solve it and those who don't have the same rate of progress towards the solution as long as they're actively working at it. The difference comes in how quickly they get frustrated (at themselves mostly), decide they're not capable of solving it, and give up on working on it further.

          I mention this because this frustration itself comes from a belief that the ability to solve these belongs some "exceedingly clever" people only, and not someone like them. So, this kind of thinking ends up being a vicious cycle that keeps them from working on their actual issues.

        • dskloet 8 hours ago
          I solved it in less than 15 minutes while walking my dog, no pen or paper. But I wouldn't claim to be a random person without math skills. And my very first guess was correct.

          It was a fun puzzle though and I'm surprised I didn't know it already. Thanks for sharing.

      • re-thc 5 hours ago
        > This is solvable in roughly half an hour on pen and paper by a random person I picked with no special math skills (beyond a university).

        I randomly answered this post and can't solve it in half an hour. Is the point leet code but for AI? I rather it solve real problems than "elite problems".

        Side note: couldn't even find pen and paper around in half an hour.

    • sebzim4500 20 hours ago
      This is a great riddle. Unfortunately, I was easily able to find the exact question with a solution (albeit with a different number) online, thus it will have been in the training set.
      • Workaccount2 20 hours ago
        What makes this interesting is that while the question is online (on reddit, from 10 years ago) other models don't get the answer right. Gemini also shows it's work and it seems to do a few orders of magnitude more calculating then the elegant answer given on reddit.

        Granted this is all way over my head, but the solution gemini comes to matches the one given on reddit (and now here in future training runs)

        65×26×39=65910

        • sebzim4500 19 hours ago
          >Gemini also shows it's work and it seems to do a few orders of magnitude more calculating then the elegant answer given on reddit.

          I don't think Gemini does an unnecessary amount of computation, it's just more verbose. This is typical of reasoning models, almost every step is necessary but many would not be written down by a human.

      • varispeed 20 hours ago
        Seems like we might need a section of internet that is off limits to robots.
        • Centigonal 20 hours ago
          everyone with limited bandwidth has been trying to limit site access to robots. the latest generation of AI web scrapers are brutal and do not respect robots.txt
          • varispeed 19 hours ago
            There are websites where you can only register to in person and have two existing members vouch for you. Probably still can be gamed, but sounds like a great barrier to entry for robots (for now).
            • tmpz22 18 hours ago
              What prevents someone from getting access and then running an authenticated headless browser to scoop the data?
              • varispeed 16 hours ago
                Admins will see unusual traffic from that account and then take action. Of course it will not be perfect as there could be a way to mimic human traffic and slowly scrape the data anyway, that's why there is element of trust (two existing members to vouch).
                • tmpz22 13 hours ago
                  Yeah don’t get me wrong I believe raising the burden of extraction is an effective strategy I just think it’s been solved at scale ie voting rings and astro turfing operations on Reddit - and at the nation state level I’d just bribe or extort the mods and admins directly (or the IT person to dump the database).
                  • varispeed 2 hours ago
                    That's entirely possible, especially if the site is small and not run by people with access to resources like physical security, legal etc.
        • baq 19 hours ago
          It’s here and it’s called discord.
          • Zandikar 19 hours ago
            I have bad news for you if you think non paywalled / non phone# required discord communities are immune to AI scraping, especially as it costs less than hammering traditional websites as the push-on-change event is done for you in real time chat contexts.

            Especially as the company archives all those chats (not sure how long) and is small enough that a billion dollar "data sharing" agreement would be a very inticing offer.

            If there isn't a significant barrier to access, it's being scraped. And if that barrier is money, it's being scraped but less often.

          • Davidzheng 19 hours ago
            Honestly someone should scrape the algebraic topology Discord to AI it'll be a nice training set
        • kylebenzle 16 hours ago
          Or we could just accept that LLMs can only output what we have put in and calling them, "AI" was a misnomer from day one.
          • eru 8 hours ago
            Why would you accept a lie?
    • yifanl 19 hours ago
      You'd have better results if you had prompted it with the actual answer and asked how the first person came to the conclusion. Giving a number in the training set is very easy.

      i.e. You observe three people in a magical room. The first person is standing underneath a 65, the second person is standing underneath a 26 and the third person is standing underneath a 39. They can see the others numbers but not the one they are directly under. You tell them one of the three numbers is the sum of the other two and all numbers are positive integers. You ask the first person for their number, they respond that they don't know. You ask the second person for their number, they respond that they don't know. You ask the third person, they respond that they don't know. You ask the first person again and they respond with the correct value, how did they know?

      And of course, if it responds with a verbatim answer in the line of https://www.reddit.com/r/math/comments/32m611/logic_question..., we can be pretty confident what's happening under the hood.

    • semiinfinitely 20 hours ago
      I love how the entire comment section is getting one-shotted by your math riddle instead of the original post topic.
    • SwayStar123 19 hours ago
      You can also put the AI in the first person's shoes. Prompt: You are standing in a circle, there are 2 other people in the circle with you, everyone in the circle, has a positive integer above their head, no one knows what the number above their own head is but can see the numbers above the heads of the other people. You see that the person infront of you on the left has 26 above their head. The person on the right has 39 above their head. You are told that the sum of two of the numbers is the third number. You are asked what the number above your head is, the option is the sum, 65, or 13, as 26 + 13 = 39. You don't know which one it is, and you say so. The second person is asked the number above their head. They also say they dont know, the third person also says they dont know. What is your number?

      Gemini 2.5 and claude 3.7 thinking get it right, o3 mini and 4o get it wrong

    • utopcell 13 hours ago
      Same here: My problem of choice is the 100 prisoners problem [1]. I used to ask simple reasoning questions in the style of "what is the day three days before the day after tomorrow", but nowadays when I ask such questions, I can almost feel the the NN giggling at the naivety of its human operator.

      [1] https://en.wikipedia.org/wiki/100_prisoners_problem

    • r0fl 17 hours ago
      Wow

      Tried this in deepseek and grok and it kept thunking in loops for a while and I just turned it off

      I haven’t seen a question loop this long ever.

      Very impressed

      • z2 10 hours ago
        Deepseek R1 got the right answer after a whopping ~10 minutes of thinking. I'm impressed and feel kind of dirty, I suspect my electricity use from this could have been put to better use baking a frozen pizza.
      • deepboy2 10 hours ago
        Just tried it on Deepseek (not R1, maybe V3-0324) and got the correct answer after 7-8 pages of reasoning. Incredible!
    • ototot 19 hours ago
    • eru 8 hours ago
      I use an algorithmic question that I'd been working on for years and that I'm finally writing up the answer to.

      It's basically: given a sequence of heap operations (insert element, delete minimum element), can you predict the left-over elements (that are in the heap at the end) in linear time in the comparison model?

      (The answer is surprisingly: Yes.)

    • mitko 10 hours ago
      Loved that puzzle, thanks for sharing it. I’ve solved a lot of math problems in the past but this one had a unique flavor of interleaving logical reasoning, partial information and a little bit of arithmetic.
    • adpirz 19 hours ago
      Interactive playground for the puzzle: https://claude.site/artifacts/832e77d7-5f46-477c-a411-bdad10...

      (All state is stored in localStorage so you can come back to it :) ).

    • dkjaudyeqooe 17 hours ago
      > I think it's not an exaggeration to say LLMs are now better than 95+% of the population at mathematical reasoning.

      It's not an exaggeration it's a non-sequitur, you first have to show that the LLMs are reasoning in the same way humans do.

    • TrackerFF 20 hours ago
      The riddle certainly nerd-sniped GPT 4.5

      After a couple of minutes it decided on the answer being 65000. (S = {65, 40, 25)}

    • bbstats 20 hours ago
      Could you explain "The sum of two of the numbers is equal to the third"??
      • rappatic 20 hours ago
        I think:

        Call the three numbers a, b, and c. This means c = a + b, but we still don’t know to which person each number belongs.

        When person 1 (p1) is asked what his number is, he has no way to know whether he has a, b, or c, so he says he doesn’t know. Same goes for p2 and p3. Clearly p1 somehow gains information by p2 and p3 passing. Either he realizes that he must be either a or b, and such his number is the difference between p2 and p3’s numbers, or he realizes that he must be c and so his number is the sum of p2 and p3’s numbers.

        That’s all I have so far. Anyone have other ideas?

        • aardvarkr 19 hours ago
          I think it has something to do with applying the lower bound of 1.

          If p1 KNOWS that he’s the largest then he has to have gained some other piece of information. Say the numbers he sees are 32 and 33. His number would have to be either 1 or 65. If p1 was 1 then the other two would have known p1 couldn’t be the sum of the other two

          • oezi 19 hours ago
            But p2 and p3 don't yet know what they are themselves just because they see a 1:

            If p2 sees 1 and 33, s/he would wonder if s/he is 32 or 34.

            P3 would consider 31 or 33.

        • bena 19 hours ago
          The answer is online and it's clever.

          P1 knows that P2 and P3 are not equal. So they know that the set isn't [2A, A, A].

          P2 knows that P1 and P3 are not equal. So they know that the set isn't [A, 2A, A]. They also know that if P1 doesn't know, then they were able to make the same deduction. So they now know that both [2A, A, A] and [A, 2A, A] aren't correct. Since they know that [2A, A, A] isn't correct, they can also know that [2A, 3A, A] isn't correct either. Because they'd be able to see if P1 = 2A and P3 = A, and if that were true and P1 doesn't know their number, it would have to be because P2 isn't A. And if P2 isn't A, they'd have to be 3A.

          P3 knows that P1 and P2 aren't equal. Eliminates [A, A, 2A]. Knows that [2A, A, A], [A, 2A, A], and [2A, 3A, A], are eliminated. Using the same process as P2, they can eliminate [2A, A, 3A], [A, 2A, 3A], and also [2A, 3A, 5A]. Because they can see the numbers and they know if P1 is 2A and P2 is 3A.

          Now we're back at P1. Who now knows.

          So P2 and P3 are in the eliminated sets. Which means we're one of these

          [2A, A, A]; [3A, 2A, A]; [4A, 3A, A]; [3A, A, 2A]; [4A, A, 3A]; [5A, 2A, 3A]; [8A, 3A, 5A]

          We know his number is 65. To find the set, we can factor 65: (5 * 13). We can check the other numbers 2(13) = 26. 3(13) = 39. And technically, you don't need to find the other numbers. The final answer is 5A * 2A * 3A or (A^3) * 30.

          • byearthithatius 19 hours ago
            "Which means we're one of these [2A, A, A]; [3A, 2A, A]; [4A, 3A, A]; [3A, A, 2A]; [4A, A, 3A]; [5A, 2A, 3A]; [8A, 3A, 5A]"

            Why? Couldn't it be an infinite number of 3 size arrays comprised of A where two elements sum to the third? [24A, 13A, 11A]? How did we deduce this set of arrays?

            EDIT: Solved from another reddit comment. Tuples without a common factor like the one above are considered as a=1.

            "They're not eliminated; they correspond to a = 1."

            • jhhh 12 hours ago
              I think that answer was poorly phrased because those possibilities are eliminated in a sense. There is a better answer further in the thread that explains "If the solution was not one of the flipped triplets, then the first player would not have worked out the solution." Thus if it was one of your other infinite triplets (eg. 65, 12, 53) then round 2 player 1 would've still answered 'I don't know'. Since they did respond with a definitive answer it had to be one of the formula solutions, since those were the only solutions they could prove. And since the only formula with a factor in 65 is 5 the correct formula must be [5A, 2A, 3A] and thus [65, 26, 39].

              You should be able to generate an infinite number of these problems just by multiplying the first formula factor by a prime number. Like the same question but the person answers '52' restricts you to either [4a, 3a, a] or [4a, a, 3a]. Since the question only asks for the product of all the terms the answer is 4 * 13 + 3 * 13 + 13 = 104.

            • WithinReason 8 hours ago
              Look at it this way: Person 1 sees the numbers 26 and 39, and has to guess his own number. It must be one of only 2 possibilities: 13 or 65. All he has to do is eliminate one of those possibilities.
      • malisper 20 hours ago
        if the three numbers are a, b, and c, then either a+b=c, a+c=b, or b+c=a
        • bena 20 hours ago
          And they must all be positive integers.

          So A + B = C and A + C = B. But we know that A + B = C, so we can replace C with (A + B). So we know that A + A + B = B.

          So 2A + B = B. Or 2A = 0.

          And this holds any way you slice it.

          Even if you were to try and brute force it.

          A = 1

          B = 2

          Then C = 3. But A + C has to equal B. That's 1 + 3 = 2? That's not true.

          I don't see a case where you can add to the sum of two numbers one of the numbers and get the other number.

          I'm guessing that's a misreading of the problem. Because it looks like the third number is the sum of the first two.

          • refulgentis 20 hours ago
            One of the cases has to be true, not all 3. (as you show, they're mutually exclusive for positive integers) i.e. "either" is important in the parent comment.
            • bena 19 hours ago
              Which is why I indicated that it would be a misreading of the problem.

              The original problem is a little ambiguously worded. You could say "one of their numbers is the sum of the other two" and it would be a little clearer.

              • thaumasiotes 16 hours ago
                > The original problem is a little ambiguously worded.

                No it isn't. If it said "the sum of any two of the numbers is equal to the third", that would be a contradiction. What it says is "the sum of two of the numbers is equal to the third".

                • bena 13 hours ago
                  I have three items.

                  Buying two of the items gets you the third for free.

                  The implication is any two.

                  It’s ok that it’s ambiguous. It happens. In most cases, we clarify and move on. There’s no need to defend it.

                  • thaumasiotes 10 hours ago
                    Why look for ambiguity that isn't there?
      • refulgentis 20 hours ago
        Given #s x,y, and z, either x + y = z, x + z = y, or y + z = x.
    • highfrequency 17 hours ago
      Fun puzzle! I’m curious how you managed to structure the problem such that a computer could solve it but it took 3 days of computation?
    • refulgentis 20 hours ago
      In general I find commentary here too negative on AI, but I'm a bit squeamish about maximalist claims re: AI mathematical reasoning vs. human population based off this, even setting aside lottery-ticket-hypothesis-like concerns.

      It's a common logic puzzle, Google can't turn up an exact match to the wording you have, but ex. here: https://www.futilitycloset.com/2018/03/03/three-hat-problem/

    • hmottestad 20 hours ago
      Is the answer somehow {65, 20, 45} with the product 58,500?

      That’s one-shot for o1 pro.

      • sebzim4500 19 hours ago
        That's wrong. From player 1's perspective {25 20 45} is entirely consistent with the calls made in the first three rounds.
    • doener 19 hours ago
      ChatGPT 4.5:

      Let’s carefully analyze this puzzle step-by-step:

      Step 1: Define the Problem Clearly

      We have three people in a circle. Let’s call them: • Person A (first person) • Person B (second person) • Person C (third person)

      Each has a positive integer above their head. Each sees the other two numbers but not their own. The numbers satisfy: A + B = C \quad \text{or} \quad B + C = A \quad \text{or} \quad C + A = B

      Given the puzzle scenario, we need to find out the exact numbers given the clues, and ultimately find the product of the three numbers.

      Step 2: Logic of the Puzzle (Step-by-Step Reasoning)

      Initial conditions: The numbers are positive integers. The puzzle implies a situation of ambiguity at first, but after multiple rounds of “I don’t know,” Person A finally knows their number is 65.

      Let’s denote clearly the conditions again: • Exactly one of the following three equations is true: • A + B = C, or • B + C = A, or • C + A = B.

      Person A sees B and C, but not A. Similarly for Persons B and C.

      Step 3: Logical deductions from the repeated “I don’t know”

      First round deductions: • Person A initially doesn’t know their number. This means, from Person A’s perspective, both scenarios A = B + C and A + B = C (or A + C = B) are still possible. If Person A had seen identical numbers, for instance, Person A could have instantly known their number. Thus, Person A sees two different numbers. • Similarly, Person B doesn’t know either, implying that Person B also sees ambiguity in their own number. Thus, Person B also sees two numbers that could result in at least two possible solutions. • Person C also doesn’t know, same logic as above.

      Thus, after one full round, we understand that initially, all three numbers must create ambiguity. No trivial case (like two identical numbers) is present.

      Second round deductions:

      After hearing that no one knows their number, each person gains information: they now know the scenario is one in which none of the other two people could have immediately solved it.

      Crucially, in the second round: • Person A now knows their number is 65 after considering the previous statements. This implies that Person A sees numbers B and C, and from these numbers plus the statements of uncertainty, Person A deduces that their own number can only be 65.

      Step 4: Deriving the Numbers (Critical deduction)

      Let’s denote clearly again: Person A’s number is now known to be 65.

      We have three possible equations. Let’s test each carefully given A = 65: 1. Case 1: A + B = C Then we have: • 65 + B = C. Person A sees numbers B and C, so they would see B and 65 + B. Person A sees numbers B and 65+B. If Person A sees these two numbers, they might initially consider their number could be either the sum or difference of these two. • However, if Person A sees two numbers, and one is exactly 2× the other, Person A could immediately guess their number is the difference (since if you see numbers X and 2X, your number is clearly X). So this scenario of B and 65+B is okay only if these two numbers don’t exhibit an immediate trivial scenario. • Let’s hold this scenario and continue examining others first. 2. Case 2: B + C = A (So, B + C = 65) • Then Person A sees numbers B and C, adding exactly to 65. • Initially, Person A sees two numbers summing exactly to 65. Could Person A know their number immediately at first glance? No, since their number could either be the sum (65) or the difference of the two numbers. Ambiguity here initially makes sense. • Initially, Person B and Person C also see ambiguous scenarios. They both don’t know their numbers either at first round. This scenario is plausible. 3. Case 3: C + A = B (So, C + 65 = B) • Then Person A sees numbers B = C + 65 and C. • Person A sees C and C+65. Similar reasoning as Case 1.

      Step 5: Identifying the Correct Scenario

      The puzzle’s logic typically aligns best with Case 2: B + C = A = 65.

      Here’s why clearly stated: • Initially, Person A sees two numbers adding to exactly 65. At first glance, Person A’s number could either be 65 (sum) or the difference between the two numbers. • Person B also initially sees two numbers and cannot distinguish immediately their own number, nor can Person C. • After everyone states “I don’t know,” Person A gains critical insight: if the difference between the two numbers (that A sees) allowed Person B or Person C to know their number immediately, they’d have known right away. The fact that B and C didn’t know means the difference is not immediately obvious to B or C. Thus, the difference cannot be trivial (such as identical numbers or easy multiples).

      Therefore, the two numbers must be such that their difference doesn’t immediately clue in the other persons. This occurs most clearly with pairs of numbers which have multiple factorization possibilities or aren’t trivial multiples.

      But we must carefully test quickly to find a suitable, plausible integer pair whose sum is 65 and whose ambiguity would logically survive through multiple rounds. The puzzle is a classic logic puzzle with a known structure:

      A common scenario in these classic puzzles is the use of Fibonacci-like or similar “golden” sets. Let’s quickly check the following well-known ambiguous triple: • Numbers: 65, 39, 26 • Check quickly: indeed, 39 + 26 = 65.

      Checking logic quickly: • Person A sees 39 and 26. Initially ambiguous (number could be 65 or 13). • Person B sees 65 and 26. Could be either 39 or 91 at first glance. Initially ambiguous. • Person C sees 65 and 39. Could be either 26 or 104 initially. Initially ambiguous.

      After each person says they don’t know: • Now, Person A considers: If my number was 13, then Person B would see numbers 13 and 26. Person B would then immediately know their number is 39 (13+26=39), no ambiguity. But Person B didn’t know! So Person A’s number cannot be 13. Hence, Person A confidently concludes their number is 65.

      Thus, the numbers 65, 39, and 26 satisfy the puzzle conditions perfectly!

      Step 6: Final Answer (Product of the Numbers)

      The three numbers identified clearly are: • A = 65 • B = 39 • C = 26

      Then, the product is: 65 \times 39 \times 26 = 65 \times 1014 = 65,910

      Final Answer:

      \boxed{65,910}

      • WithinReason 8 hours ago
        This was not logic, this was recall from the training set disguised as logic:

        Let’s quickly check the following well-known ambiguous triple: • Numbers: 65, 39, 26

    • g105b 20 hours ago
      Please can you enlighten me, I'm a mathematic plebian?
  • simonw 18 hours ago
    I'm impressed by this one. I tried it on audio transcription with timestamps and speaker identification (over a 10 minute MP3) and drawing bounding boxes around creatures in a complex photograph and it did extremely well on both of those.

    Plus it drew me a very decent pelican riding a bicycle.

    Notes here: https://simonwillison.net/2025/Mar/25/gemini/

    • ggeorgovassilis 56 minutes ago
      I've been following your blog for a while now, great stuff!
    • jillesvangurp 4 hours ago
      Have you considered that they must be training on images of pelicans driving bicycle's at this point ;-). At least given how often that comes up in your reviews, a smart LLM engineer might put their fingers on the scales a bit and optimize for those things that come up in reviews of their work a lot.
    • redox99 12 hours ago
      Claude's pelican is way better than Gemini's
    • kridsdale3 17 hours ago
      I just tried your trademark benchmark on the new 4o Image Output, though it's not the same test:

      https://imgur.com/a/xuPn8Yq

      • jonomacd 7 hours ago
        And the same thing with gemini 2.0 flash native image output.

        https://imgur.com/a/V4YAkX5

        It's sort of irrelevant though as the test is about SVGs.

      • Unroasted6154 17 hours ago
        Was that an actual SVG?
        • simonw 17 hours ago
          No that's GPT-4o native image output.
          • sebzim4500 16 hours ago
            I wonder how far away we are from models which, given this prompt, generate that image in the first step in their chain-of-thought and then use it as a reference to generate SVG code.

            It could be useful for much more than just silly benchmarks, there's a reason why physics students are taught to draw a diagram before attempting a problem.

  • freediver 15 hours ago
    Tops our benchmark in an unprecedented way.

    https://help.kagi.com/kagi/ai/llm-benchmark.html

    High quality, to the point. Bit on the slow side. Indeed a very strong model.

    Google is back in the game big time.

    • aoeusnth1 14 hours ago
      It should be in the "reasoning" category, right? (still topping the charts there)
    • utopcell 13 hours ago
      That is some wide gap!
  • anotherpaulg 15 hours ago
    Gemini 2.5 Pro set the SOTA on the aider polyglot coding leaderboard [0] with a score of 73%.

    This is well ahead of thinking/reasoning models. A huge jump from prior Gemini models. The first Gemini model to effectively use efficient diff-like editing formats.

    [0] https://aider.chat/docs/leaderboards/

    • aoeusnth1 14 hours ago
      Am I correct in assuming that accuracy < using correct edit format? i.e. it made mistakes in 27% of the problems, 11% of which were due to (at least) messing up the diff format?

      In which case, google should be working on achieving better output format following, as Claude and R1 are able to hit nearly 100% accuracy on the format.

      • anotherpaulg 14 hours ago
        It does have fairly low adherence to the edit format, compared to the other frontier models. But it is much better than any previous Gemini model in this regard.

        Aider automatically asks models to retry malformed edits, so it recovers. And goes on to produce a SOTA score.

        • aoeusnth1 9 hours ago
          Ok, thanks for clearing that up.
    • sagarpatil 10 hours ago
      The only benchmark I care about. Thanks!
  • Oras 21 hours ago
    These announcements have started to look like a template.

    - Our state-of-the-art model.

    - Benchmarks comparing to X,Y,Z.

    - "Better" reasoning.

    It might be an excellent model, but reading the exact text repeatedly is taking the excitement away.

    • devsda 21 hours ago
      Reminds me of how nobody is too excited about flagship mobile launches anymore. Most flagships for sometime now are just incremental updates over previous gen and only marginally better. Couple that with the chinese OEMs launching better or good enough devices at a lower price point, new launches from established players are not noteworthy anymore.

      It's interesting how the recent AI announcements are following the same trend over a smaller timeframe.

      • breppp 19 hours ago
        I think the greatest issue with buying a new phone today is ironically the seamless migration.

        once you get all your apps, wallpaper, shortcut order and same OS, you really quickly get the feeling you spent 1000$ for the exact same thing

        • atonse 16 hours ago
          100% agree with you.

          But it needs to be seamless to remove any friction from the purchase, but at the same time if it feels the same then we felt like we wasted money.

          So what I usually do is buy a different colored phone and change the wallpaper.

          My MacBook was the same. Seamless transition and 2 hours later I was used to the new m4 speeds.

      • flakiness 20 hours ago
        Phones are limited by hardware manufacturing, plus maybe the annual shopping cycle peaking at Christmas. People won't have bought multiple iPhones even in its heyday.

        These LLM models were supposedly limited by the training run, but these point-version models are mostly post-training driven, which seems to be taking less time.

        If models were tied to a specific hardware (say, a "AI PC" or whatever) the cycle would get slower and we'll get a slower summer which I'm secretly wishing.

    • tibbar 21 hours ago
      For me, the most exciting part is the improved long-context performance. A lot of enterprise/RAG applications rely on synthesizing a bunch of possibly relevant data. Let's just say it's clearly a bottleneck in current models and I would expect to see a meaningful % improvement in various internal applications if long-context reasoning is up. Gemini was already one of my favorite models for this usecase.

      So, I think these results are very interesting, if you know what features specifically you are using.

      • zwaps 21 hours ago
        But they score it on their own benchmark, on which coincidentally Gemini models always were the only good ones. In Nolima or Babilong we see that Gemini models still cant do long context.

        Excited to see if it works this time.

    • bhouston 21 hours ago
      > It might be an excellent model, but reading the exact text repeatedly is taking the excitement away.

      This is the commodification of models. There is nothing special about the new models but they perform better on the benchmarks.

      They are all interchangeable. This is great for users as it adds to price pressure.

      • flir 21 hours ago
        Man, I hope those benchmarks actually measure something.
        • Legend2440 21 hours ago
          I would say they are a fairly good measure of how well the model has integrated information from pretraining.

          They are not so good at measuring reasoning, out-of-domain performance, or creativity.

      • Workaccount2 21 hours ago
        Sooner or later someone is going to find "secret sauce" that provides a step-up in capability, and it will be closely guarded by whoever finds it.

        As big players look to start monetizing, they are going to desperately be searching for moats.

        • bangaladore 21 hours ago
          Reasoning was supposed to be that for "Open" AI, that's why they go to such lengths to hide the reasoning output. Look how that turned out.

          Right now, in my opinion, OpenAI has actually a useful deep research feature which I've found nobody else matches. But there is no moat to be seen there.

          • CamperBob2 21 hours ago
            If you've seen DeepSeek R1's <think> output, you'll understand why OpenAI hides their own. It can be pretty "unsafe" relative to their squeaky-clean public image.
          • stepanhruda 15 hours ago
            They don’t hide reasoning output anymore?
        • cratermoon 20 hours ago
          Sooner or later someone is going to find the "secret sauce" that allows building a stepladder tall enough to reach the moon.

          It's called the "first step fallacy", and AI hype believers continue to fall for it.

    • cadamsdotcom 14 hours ago
      Why not snooze the news for a year and see what’s been invented when you get back. That’ll blow your mind properly. Because each of these incremental announcements contributes to a mind blowing rate of improvement.

      The rate of announcements is a sign that models are increasing in ability at an amazing rate, and the content is broadly the same because they’re fungible commodities.

      The latter, that models are fungible commodities, is what’s driving this explosion and leading to intense competition that benefits us all.

    • diego_sandoval 21 hours ago
      I take this as a good thing, because they're beating each other every few weeks and using benchmarks as evidence.

      If these companies start failing to beat the competition, then we should prepare ourselves for very creative writing in the announcements.

      • gtirloni 19 hours ago
        The improvements have been marginal at best. I wouldn't call that beating.
    • ototot 21 hours ago
      Maybe they just asked Gemini 2.5 to write the announcement.
      • cpeterso 21 hours ago
        And it was trained on the previous announcements.
        • xlbuttplug2 20 hours ago
          ... which were also written by earlier Gemini versions.
          • rvnx 20 hours ago
            which was written by ChatGPT3.5
          • belter 20 hours ago
            LLMs all the way down
            • cratermoon 20 hours ago
              Not all the way. At the bottom are a bunch of unpaid writers and artists and a horde of low-paid mturk workers in Nigeria.
              • throwaway2037 10 hours ago
                I love this comment. It made me laugh.

                    > mturk workers in Nigeria
                
                Serious question: Has anyone tested how much money you can actually make doing a month of Amazon Mechanical Turk? (It would make for an interesting YouTube video!) I am curious if it is middle class wages in very poor countries (like Nigeria). Some light Googling tells me that middle class salary in Nigeria is about 6K USD, so about 3 USD/hour (assuming: 50 weeks/year * 40 hours/week = 2000 hours/year). Is this possible with MTurk?
              • finebalance 12 hours ago
                That's ok. AI will kill those off soon enough, and like all winners, rewrite history enough so that that inconvenient theft never happened anyway. It's manifest destiny, or something.
    • schainks 21 hours ago
      I wish I wish I wish Google put better marketing into these releases. I've moved entire workflows to Gemini because it's just _way_ better than what openai has to offer, especially for the money.

      Also, I think google's winning the race on actually integrating the AI to do useful things. The agent demo from OpenAI is interesting, but frankly, I don't care to watch the machine use my computer. A real virtual assistant can browse the web headless and pick flights or food for me. That's the real workflow unlock, IMO.

      • throwaway2037 10 hours ago

            > I've moved entire workflows to Gemini because it's just _way_ better than what openai has to offer, especially for the money.
        
        This is useful feedback. I'm not here to shill for OpenAI, nor Google/Gemini, but can you share a concrete example? It would be interesting to hear more about your use case. More abstractly: Do you think these "moved entire workflows" offset a full worker, or X% of a full worker? I am curious to see how and when we will see low-end/junior knowledge workers displaced by solid LLMs. Listening to the Oxide and Friends podcast, I learned that they make pretty regular use of LLMs to create graphs using GNU plot. To paraphrase, they said "it is like have a good intern".
    • taf2 21 hours ago
      We have incrementally improved 1% better then we were yesterday. Our competition is 1 day behind us now.
      • throwaway2037 10 hours ago
        Like! No trolling: This could be a sarcastic comment written by an LLM!
    • andai 21 hours ago
      Well hey, OpenAI did the exact opposite, and nobody liked that either.
      • exitb 20 hours ago
        I think people were fine with OpenAI demos. They were less fine with not actually ever releasing the demoed tech.
      • andai 14 hours ago
        To clarify, by "doing the opposite" I mean OpenAI releasing GPT-4.5, a non-reasoning model that does worse on benchmarks (but supposed to be qualitatively better). People shit on OpenAI hard for doing that.
      • emptysongglass 20 hours ago
        I liked their announcements and demos and continue to like them.
      • Oras 21 hours ago
        How did you measure “nobody” liked OpenAI announcements?
    • sagarpatil 10 hours ago
      AI labs, it seems, use a template for system cards as well. OpenAI stands out because they showcase their employees using their tools for various use cases, which is refreshing.
    • kh_hk 20 hours ago
      Was going to comment the same thing, which has been bugging me off lately on all announcements that start with "our" followed by empty superlatives. Happy to not be alone on this!
    • cratermoon 19 hours ago
      Glaringly missing from the announcements: concrete use cases and products.

      The Achilles heel of LLMs is the distinct lack of practical real-world applications. Yes, Google and Microsoft have been shoving the tech into everything they can fit, but that doesn't a product make.

      • throwaway2037 10 hours ago
        I would say Adobe is doing an excellent job of commercialising image manipulation and generation using LLMs. When I see adverts for their new features, they seem genuinely useful for normie users who are trying to edit some family/holiday photos.
      • kiratp 11 hours ago
        https://www.osmos.io/fabric

        Practical, real-world application.

      • sebzim4500 16 hours ago
        ChatGPT has like 500M weekly active users, what are you on about?
        • cratermoon 14 hours ago
          "Well, Ed, there are 300 million weekly users of ChatGPT. That surely proves that this is a very real industry!" https://www.wheresyoured.at/longcon/
          • sebzim4500 2 hours ago
            Is that article trying to argue that 500M people every week are visiting ChatGPT for the first (or second) time after reading about it in the news?

            If I'm being incredibly generous I will concede that this could have been the case for the first few weeks when it was making headlines, but it clearly isn't true now.

            It would be literally impossible to keep up these figures for as long as ChatGPT has without a ton of repeat users. There simply aren't enough people/devices.

    • Mistletoe 21 hours ago
      I’m sure the AI helps write the announcements.
    • belter 21 hours ago
      Cancelled my account long time ago. Gemini models are like a McDonalds Croissant. You always give them an extra chance, but they always fall apart on your hands...
  • mindwok 14 hours ago
    Just adding to the praise: I have a little test case I've used lately which was to identify the cause of a bug in a Dart library I was encountering by providing the LLM with the entire codebase and description of the bug. It's about 360,000 tokens.

    I tried it a month ago on all the major frontier models and none of them correctly identified the fix. This is the first model to identify it correctly.

    • markdog12 1 hour ago
      Interesting, I've been asking it to generate some Dart code, and it makes tons of mistakes, including lots of invalid code (static errors). When pointing out the mistakes, it thanks me and tells me it won't make it again, then makes it again on the very next prompt.
    • weatherlite 8 hours ago
      360k tokens = how many lines of code approximately ? and also, if its an open source lib are you sure there's no mentions of this bug anywhere on the web?
      • mindwok 4 hours ago
        Not a huge library, around 32K LoC and no mention of the bug on the web - I was the first to encounter it (it’s since been fixed) unless the training data is super recent.
        • weatherlite 3 hours ago
          Impressive. I tend to think it managed to find the bug by itself which is pretty crazy without being able to debug anything. Then again I haven't seen the bug description, perhaps the description makes it super obvious where the problem lies.
    • nmfisher 2 hours ago
      How long did it take to sift through those?
    • kungfufrog 12 hours ago
      How do you use the model so quickly? Google AI Studio? Maybe I've missed how powerful that is.. I didn't see any easy way to pass it a whole code base!
      • mindwok 10 hours ago
        Yep! AI studio I think is the only way you can actually use it right now and AFAIK it's free.
    • ripped_britches 12 hours ago
      Wow holy smokes that is exciting
  • jnd0 21 hours ago
    > with Gemini 2.5, we've achieved a new level of performance by combining a significantly enhanced base model with improved post-training. Going forward, we’re building these thinking capabilities directly into all of our models, so they can handle more complex problems and support even more capable, context-aware agents.

    Been playing around with it and it feels intelligent and up to date. Plus is connected to the internet. A reasoning model by default when it needs to.

    I hope they enable support for the recently released canvas mode for this model soon it will be a good match.

    • Workaccount2 21 hours ago
      It is almost certainly the "nebula" model on LLMarena that has been generating buzz for the last few days. I didn't test coding but it's reasoning is very strong.
  • greatgib 19 hours ago
    If you plan to use Gemini, be warned, here are the usual Big Tech dragons:

       Please don’t enter ...confidential info or any data... you wouldn’t want a reviewer to see or Google to use ...
    
    The full extract of the terms of usage:

       How human reviewers improve Google AI
    
       To help with quality and improve our products (such as the generative machine-learning models that power Gemini Apps), human reviewers (including third parties) read, annotate, and process your Gemini Apps conversations. We take steps to protect your privacy as part of this process. This includes disconnecting your conversations with Gemini Apps from your Google Account before reviewers see or annotate them. Please don’t enter confidential information in your conversations or any data you wouldn’t want a reviewer to see or Google to use to improve our products, services, and machine-learning technologies.
    • cavisne 16 hours ago
      Google is the best of these. You either pay per token and there is no training on your inputs, or it’s free/a small monthly fee and there is training.
    • greatgib 19 hours ago
      And even worse:

         Conversations that have been reviewed or annotated by human reviewers (and related data like your language, device type, location info, or feedback) are not deleted when you delete your Gemini Apps activity because they are kept separately and are not connected to your Google Account. Instead, they are retained for up to three years.
      
      Emphasis on "retained for up to three years" even if you delete it!!
      • kccqzy 18 hours ago
        Well they can't delete a user's Gemini conversations because they don't know which user a particular conversation comes from.

        This seems better, not worse, than keeping the user-conversation mapping so that the user may delete their conversations.

      • mastodon_acc 18 hours ago
        How does it compare to OpenAI and anthropic’s user data retention policy?
        • greatgib 18 hours ago
          If i'm not wrong, Chatgpt states clearly that they don't use user data anymore by default.

          Also, maybe some services are doing "machine learning" training with user data, but it is the first time I see recent LLM service saying that you can feed your data to human reviewers at their will.

    • KoolKat23 18 hours ago
      I don't think this is the same as the AI studio and API terms. This looks like your consumer facing Gemini T&C's.
    • summerlight 16 hours ago
      You can use a paid tier to avoid such issues. Not sure what you're expecting for those "experimental" models, which is in development and needs user feedback.
    • sauwan 17 hours ago
      I'm assuming this is true of all experimental models? That's not true with their models if you're on a paid tier though, correct?
    • suyash 16 hours ago
      More of a reason for new privacy guidelines specially for big tech and AI
    • mastodon_acc 18 hours ago
      I mean this is pretty standard for online llms. What is Gemini doing here that openai or Anthropic aren’t already doing?
  • vineyardmike 21 hours ago
    I wonder what about this one gets the +0.5 to the name. IIRC the 2.0 model isn’t particularly old yet. Is it purely marketing, does it represent new model structure, iteratively more training data over the base 2.0, new serving infrastructure, etc?

    I’ve always found the use of the *.5 naming kinda silly when it became a thing. When OpenAI released 3.5, they said they already had 4 underway at the time, they were just tweaking 3 be better for ChatGPT. It felt like a scrappy startup name, and now it’s spread across the industry. Anthropic naming their models Sonnet 3, 3.5, 3.5 (new), 3.7 felt like the worst offender of this naming scheme.

    I’m a much bigger fan of semver (not skipping to .5 though), date based (“Gemini Pro 2025”), or number + meaningful letter (eg 4o - “Omni”) for model names.

    • forbiddenvoid 21 hours ago
      I would consider this a case of "expectation management"-based versioning. This is a release designed to keep Gemini in the news cycle, but it isn't a significant enough improvement to justify calling it Gemini 3.0.
      • jstummbillig 20 hours ago
        I think it's reasonable. The development process is just not really comparable to other software engineering: It's fairly clear that currently nobody really has a good grasp on what a model will be while they are being trained. But they do have expectations. So you do the training, and then you assign the increment to align the two.
        • 8n4vidtmkvmk 9 hours ago
          I figured you don't update the major unless you significantly change the... algorithm, for lack of a better word. At least I assume something major changed between how they trained ChatGPT 3 vs GPT 4, other than amount of data. But maybe I'm wrong.
          • eru 8 hours ago
            The number is purely for marketing.

            If you could get much better performance without changing the algorithm (eg just by scaling), you'd still bump the number.

      • KoolKat23 18 hours ago
        Funnily enough, from early indications (user feedback) this new model would've been worthy of the 3.0 moniker, despite what the benchmarks say.
    • aoeusnth1 21 hours ago
      I think it's because of the big jump in coding benchmarks. 74% on aider is just much, much better than before and worthy of a .5 upgrade.
    • utopcell 12 hours ago
      It may indicate a Tick-Tock [1] process.

      [1] https://en.wikipedia.org/wiki/Tick%E2%80%93tock_model

    • Workaccount2 21 hours ago
      At least for OpenAI, a .5 increment indicates a 10x increase in training compute. This so far seems to track for 3.5, 4, 4.5.
    • alphabetting 21 hours ago
      The elo jump and big benchmark gains could be justification
    • falcor84 21 hours ago
      Agreed, can't everyone just use semantic versioning, with 0.1 increments for regular updates?
      • laurentlb 21 hours ago
        Regarding semantic versioning: what would constitute a breaking change?

        I think it makes sense to increase the major / minor numbers based on the importance of the release, but this is not semver.

        • falcor84 18 hours ago
          As I see it, if it uses a similar training approach and is expected to be better in every regard, then it's a minor release. Whereas when they have a new approach and where there might be some tradeoffs (e.g. longer runtime), it should be a major change. Or if it is very significantly different, then it should be considered an entirely differently named model.
      • morkalork 21 hours ago
        Or drop the pretext of version numbers entirely since they're meaningless here and go back to classics like Gemini Experience, Gemini: Millennium Edition or Gemini New Technology
      • joaogui1 20 hours ago
        Would be confusing for non-tech people once you did x.9 -> x.10
      • guelo 20 hours ago
        What would a major version bump look like for an llm?
        • eru 8 hours ago
          Going from English to Chinese, I guess? Because that would not be a compatible version for most previous users.
  • jorl17 19 hours ago
    Just a couple of days ago I wrote on reddit about how long context models are mostly useless to me, because they start making too many mistakes very fast. They are vaguely helpful for "needle in a haystack" problems, not much more.

    I have a "test" which consists in sending it a collection of almost 1000 poems, which currently sit at around ~230k tokens, and then asking a bunch of stuff which requires reasoning over them. Sometimes, it's something as simple as "identify key writing periods and their differences" (the poems are ordered chronologically). Previous models don't usually "see" the final poems — they get lost, hallucinate and are pretty much worthless. I have tried several workaround techniques with varying degrees of success (e.g. randomizing the poems).

    Having just tried this model (I have spent the last 3 hours probing it), I can say that, to me, this is a breakthrough moment. Truly a leap. This is the first model that can consistently comb through these poems (200k+ tokens) and analyse them as a whole, without significant issues or problems. I have no idea how they did it, but they did it.

    The analysis of this poetic corpus has few mistakes and is very, very, very good. Certainly very good in terms of how quickly it produces an answer — it would take someone days or weeks of thorough analysis.

    Of course, this isn't about poetry — it's about passing in huge amounts of information, without RAG, and having a high degree of confidence in whatever reasoning tasks this model performs. It is the first time that I feel confident that I could offload the task of "reasoning" over large corpus of data to an LLM. The mistakes it makes are minute, it hasn't hallucinated, and the analysis is, frankly, better than what I would expect of most people.

    Breakthrough moment.

    • Alifatisk 16 hours ago
      Two years ago, Claude was known for having the largest context window and being able to remember tokens throughout the whole conversation.

      Today, it seems like Google has beat them and supports way larger context window and is way better at keeping track of what has being said and memorize older tokens.

  • Alifatisk 12 minutes ago
    Can't wait for the benchmark at artificialanalysis.ai
  • zone411 16 hours ago
    Scores 54.1 on the Extended NYT Connections Benchmark, a large improvement over Gemini 2.0 Flash Thinking Experimental 01-21 (23.1).

    1 o1-pro (medium reasoning) 82.3

    2 o1 (medium reasoning) 70.8

    3 o3-mini-high 61.4

    4 Gemini 2.5 Pro Exp 03-25 54.1

    5 o3-mini (medium reasoning) 53.6

    6 DeepSeek R1 38.6

    7 GPT-4.5 Preview 34.2

    8 Claude 3.7 Sonnet Thinking 16K 33.6

    9 Qwen QwQ-32B 16K 31.4

    10 o1-mini 27.0

    https://github.com/lechmazur/nyt-connections/

  • nickandbro 20 hours ago
    Wow, was able to nail the pelican riding on a bicycle test:

    https://www.svgviewer.dev/s/FImn7kAo

  • ascorbic 7 hours ago
    It can answer my favourite riddle for LLMs:

    "Anna, Becca and Clare go to the play park. There is nobody else there. Anna is playing on the see-saw, Becca is playing on the swings. What is Clare doing?" (Sometimes I ask similar questions with the same structure and assumptions but different activities)

    About a year ago none of them could answer it. All the latest models can pass it if I tell them to think hard, but previously Gemini could rarely answer it without that extra hint. Gemini 2.5 caveats its answer a bit, but does get it correct. Interestingly GPT-4o initially suggests it will give a wrong answer without thinking, but recognises it's a riddle, so decides to think harder and gets it right.

  • falcor84 21 hours ago
    I'm most impressed by the improvement on Aider Polyglot; I wasn't expecting it to get saturated so quickly.

    I'll be looking to see whether Google would be able to use this model (or an adapted version) to tackle ARC-AGI 2.

  • ekojs 21 hours ago
    > This will mark the first experimental model with higher rate limits + billing. Excited for this to land and for folks to really put the model through the paces!

    From https://x.com/OfficialLoganK/status/1904583353954882046

    The low rate-limit really hampered my usage of 2.0 Pro and the like. Interesting to see how this plays out.

    • chipgap98 21 hours ago
      Any word on what that pricing is? I can't seem to find it
      • rvnx 20 hours ago
        Traditionally at Google experimental models are 100% free to use on https://aistudio.google.com (this is also where you can see the pricing) with a quite generous rate limit.

        This time, the Googler says: “good news! you will be charged for experimental models, though for now it’s still free”

        • chipgap98 20 hours ago
          Right but the tweet I was responding to says: "This will mark the first experimental model with higher rate limits + billing. Excited for this to land and for folks to really put the model through the paces!"

          I assumed that meant there was a paid version with a higher rate limit coming out today

          • altbdoor 14 hours ago
            The parent Twitter post mentions:

                Available as experimental and for free right now in Google AI Studio + API, with pricing coming very soon!
            
            And the pricing page [1] still does not show 2.5 yet.

            [1]: https://ai.google.dev/gemini-api/docs/pricing

        • KoolKat23 18 hours ago
          I expect this might be pricier. Hoping not unusable level expensive.
      • xnx 20 hours ago
        Currently free, but only 50 requests/day.
  • og_kalu 21 hours ago
    From the 2.0 line, the Gemini models have been far better at Engineering type questions (fluids etc) than GPT, Claude especially with questions that have Images that require more than just grabbing text. This is even better.
  • M4v3R 21 hours ago
    The Long Context benchmark numbers seem super impressive. 91% vs 49% for GPT 4.5 at 128k context length.
    • Workaccount2 21 hours ago
      Google has the upperhand here because they are not dependent on nvidia for hardware. They make and uses their own AI accelerators.
    • alexdzm 21 hours ago
      Keen to hear more about this benchmark. Is it representative of chat-to-document style usecases with big docs?
      • sebzim4500 19 hours ago
        Looks like it's this benchmark [1]. It's certainly less artificial than most long context benchmarks (that are basically just a big lookup table) but probably not as representative as Fiction.LiveBench [2], which asks specific questions about works of fanfiction (which are typically excluded from training sets because they are basically porn).

        [1] https://arxiv.org/pdf/2409.12640

        [2] https://fiction.live/stories/Fiction-liveBench-Feb-20-2025/o...

        • sebzim4500 15 hours ago
          Update: Gemini 2.5 also crushes fiction.livebench
        • swyx 16 hours ago
          "MRCR (multi-round coreference resolution)" for those looking for the link to Michaelangelo
  • Dowwie 20 hours ago
    This model is a fucking beast. I am so excited about the opportunities this presents.
  • nikcub 17 hours ago
    Impressive model - but I'm confused by the knowledge cutoff. AI Studio says it is January 2025 (which would be impressive) but querying it for anything early 2025 or mid/late 2024 and it self-reports that it's cutoff is in 2023 (which can't be right).

    This is most evident when querying about fast-moving dev tools like uv or bun. It seems to only know the original uv options like pip and tools, while with bun it is unfamiliar with bun outdated (from Aug 2024), bun workspaces (from around that time?) but does know how to install bun on windows (April 2024).

    You'll still need to provide this model with a lot of context to use it with any tooling or libraries with breaking changes or new features from the past ~year - which seems to contradict the AI Studio reported knowledge cutoff.

    Were I developing models - I'd prioritise squeezing in the most recent knowledge of popular tools and libraries since development is such a popular (and revenue generating) use case.

    • foota 8 hours ago
      Maybe less has been written about these newer things, even if they had technically been released?
  • batata_frita 18 hours ago
    Why do I have the feel that nobody is too much excited to google's models compared to other companies?
    • owlninja 18 hours ago
      Yea I get a little bummed but I guess a lot of HNers have reasons to not like google. I've had a Google One membership forever so opted for the higher subscription with Gemini access since the beginning (plus a free year with new Pixel phone). and I think it is awesome.
    • Davidzheng 18 hours ago
      I feel like Google intentionally don't want people to be as excited. This is a very good model. Definitely the best available model today.
    • bklyn11201 18 hours ago
      Most of us care only about coding performance, and Sonnet 3.5 has been such a giant winner that we don't get too excited about the latest model from Google.
    • gck1 12 hours ago
      For me personally - rate limit of 50/day means that I can't use it as daily driver so I'll have to go back to Sonnet which will gladly accept my money for more. Then I just forget it exists.
      • sagarpatil 9 hours ago
        Yeah, if I don’t have higher rate limits, it’s useless. This just sounds like a gimmick launch where they want to gather feedback. It will be a couple of months before this will be GA.
    • crimsoneer 18 hours ago
      Because most of the LLM hype is still generated by people who don't use them in production, and those people don't use GCP
    • Mond_ 18 hours ago
      Google is worse at marketing and hyping people up.
      • dmix 15 hours ago
        The internal incentives must not align with new things making money.
    • ipsum2 18 hours ago
      They're not good models. They over fit to LMArena leaderboard, but perform worse in real life scenarios compared to their competitors.

      The exceptions are auto regressive image generation and audio models.

    • SamuelAdams 18 hours ago
      Because it’s more likely to be sunsetted.

      https://killedbygoogle.com/

  • andai 21 hours ago
    How does Gemini have such a big context window?

    I thought memory requirement grows exponentially with context size?

    • semiinfinitely 18 hours ago
      TPUs have a network topology better suited for long context than gpus: https://jax-ml.github.io/scaling-book/tpus/#tpu-networking

      > This nearest-neighbor connectivity is a key difference between TPUs and GPUs. GPUs connect up to 256 H100s in an all-to-all configuration (called a node), rather than using local connections. On the one hand, that means GPUs can send arbitrary data within a node in a single low-latency hop. On the other hand, TPUs are dramatically cheaper and simpler to wire together, and can scale to much larger topologies because the number of links per device is constant.

    • mmoskal 21 hours ago
      Memory grows linearly, compute grows quadratically (but with small constant - until ~100k the inference will be still dominated by non-quadratic factors).
      • xiphias2 20 hours ago
        Also reusing key/values for different queries can compress the KV cache, it can be an 1000x or 10000x improvement in bandwidth if the model is trained for it.
        • mmoskal 17 hours ago
          Just to clarify: simple prefix KV cache doesn't require any special model training. It does require the inference framework to support it, but most do by now.

          You can see dramatic improvements in latency and throughput if there is a large shared prefix of the queries.

    • throitallaway 20 hours ago
      Funnyish story: the other night I asked my Pixel 9 to generate an image via Gemini, then I asked it to make a change. It didn't consider the previous context, so I asked it "Are you capable of keeping context?" No matter how clearly I enunciated "context", it always interpreted what I was saying as "contacts." After the 4th try, I said "context, spelled "c-o-n-t-e-x-t" and it replied with "Ah, you meant context! Yes..."

      This stuff has a long way to go.

      • Workaccount2 19 hours ago
        I think google is digging a hole for themselves by making their lightweight models be the most used model. Regardless of what their heavy weight models can do, people will naturally associate them with their search model or assistant model.
        • Andrex 12 hours ago
          That might be considered fine if Google's larger goal is to make money from enterprises/Workspace integration, using consumer launches as splashy PR.

          This way they get two rounds of headlines. "Gemini 2.5 released" and later on "Gemini 2.5 coming to all Google accounts."

          • seunosewa 3 hours ago
            Their willingness to integrate depends on their perception of the model quality.
      • andai 20 hours ago
        I noticed Gemini Flash 2.0 making a lot of phonetic typos like that, yeah. Like instead of Basal Ganglia it said Basil Ganglia.

        I've also had it switch languages in the middle of output... like one word in the middle of a sentence was randomly output in some strange hieroglyphs, but when I translated them, it was the right word and the sentence made sense.

      • dcchambers 19 hours ago
        I was using the conversational feature of Gemini on my phone the other night and was trying to get it to read a blog post to me. The AI proceeded to tell me (out loud, via voice mode/speech synthesis) that it was a text based model and couldn't read text out loud.

        For as amazing as these things are, AGI they are not.

        • vanviegen 17 hours ago
          In its defense: it probably is just a text model that hasn't been told that its output is being read to the user.
    • ototot 21 hours ago
      The Gemini 1.5 tech report do reference some papers about supporting large context window.
    • swyx 16 hours ago
  • arjun_krishna1 19 hours ago
    I've been using Gemini Pro for my University of Waterloo capstone engineering project. Really good understanding of PDF documents and good reasoning as well as structured output Recommend trying it out at aistudio dot google dot com
  • d3nj4l 19 hours ago
    A model that is better on Aider than Sonnet 3.7? For free, right now? I think I'll give it a spin this weekend on a couple of projects, seems too good to be true.
    • istjohn 16 hours ago
      With a rate limit of 50 requests per day
      • sagarpatil 9 hours ago
        Could use multiple Google accounts to increase the rate limit.
        • Alifatisk 11 minutes ago
          This is why we can't have nice things
  • Davidzheng 18 hours ago
    On initial thoughts, I think this might be the first AI model to be reliably helpful as a research assistant in pure mathematics (o3-mini-high can be helpful but is more prone to hallucinations)
    • kadushka 8 hours ago
      Have you tried o1-pro?
  • marcus_holmes 48 minutes ago
    I tried the beta version of this model to write a business plan (long story).

    I was impressed at first. Then it got really hung up on the financial model, and I had to forcibly move it on. After that it wrote a whole section in Indonesian, which I don't speak, and then it crashed. I'd not saved for a while (ever since the financial model thing), and ended up with an outline and a couple of usable sections.

    I mean, yes, this is better than nothing. It's impressive that we made a pile of sand do this. And I'm aware that my prompt engineering could improve a lot. But also, this isn't a usable tool yet.

    I'm curious to try again, but wary of spending too much time "playing" here.

  • f1shy 6 hours ago
    One test I always do is ask for an absolutely minimal language interpreter with TCO.

    This is part of the code output (after several interactions of it not returning actual code):

            // Tail Call Optimization (very basic)
            if(func->type == VAL_FUNCTION){
                return apply(func, args, env); //no stack growth.
            }
            else{
                return apply(func, args, env);
            }
    
    I'm not very impressed.

    I pointed out that part of the code, and answered:

    You've correctly pointed out that the TCO implementation in the provided C code snippet is essentially a no-op. The if and else blocks do the same thing: they both call apply(func, args, env). This means there's no actual tail call optimization happening; it's just a regular function call.

    But then follows with even worst code. It does not even compile!

  • summerlight 19 hours ago
    This looks like the first model where Google seriously comes back into the frontier competition? 2.0 flash was nice for the price but it's more focused on efficiency, not the performance.
  • comex 20 hours ago
    I was recently trying to replicate ClaudePlaysPokemon (which uses Claude 3.7) using Gemini 2.0 Flash Thinking, but it was seemingly getting confused and hallucinating significantly more than Claude, making it unviable (although some of that might be caused by my different setup). I wonder if this new model will do better. But I can't easily test it: for now, even paid users are apparently limited to 50 requests per day [1], which is not really enough when every step in the game is a request. Maybe I'll try it anyway, but really I need to wait for them to "introduce pricing in the coming weeks".

    Edit: I did try it anyway and so far the new model is having similar hallucinations. I really need to test my code with Claude 3.7 as a control, to see if it approach the real ClaudePlaysPokemon's semi-competence.

    Edit 2: Here's the log if anyone is curious. For some reason it's letting me make more requests than the stated rate limit. Note how at 11:27:11 it hallucinates on-screen text, and earlier it thinks some random offscreen tile is the stairs. Yes, I'm sure this is the right model: gemini-2.5-pro-exp-03-25.

    https://a.qoid.us/20250325/

    [1] https://ai.google.dev/gemini-api/docs/rate-limits#tier-1

    • comex 16 hours ago
      Update: I tried a different version of the prompt and it's doing really well! Well, so far it's gotten out of its house and into Professor Oak's lab, which is not so impressive compared to ClaudePlaysPokemon, but it's a lot more than Gemini 2.0 was able to do with the same prompt.
  • serjester 21 hours ago
    I wish they’d mention pricing - it’s hard to seriously benchmark models when you have no idea what putting it in production would actually cost.
  • asah 18 hours ago
    It nailed my two hard reasoning+linguistic+math questions in one shot, both the kinds of things that LLM struggle but humans do well.

    (DM me for the questions)

  • Medicineguy 9 hours ago
    While I'm sure the new Gemini model has made improvements, I feel like the user experience outside of the model itself is stagnating. I think OpenAI's interfaces, both web app and mobile app, are quite a bit more polished currently. For example, Gemini's speech recognition struggles with longer pauses and often enough cuts me off mid-sentence. Also, OpenAIs whisper model understands more context (for instance, saying “[...] plex, emby and Jellyfin [...]” is usually understood in whisper, but less often in Gemini) The Gemini web app lacks keyboard shortcuts for basic actions like opening a new chat or toggling the sidebar (good for privacy friendly pair programming). Last point off the top of my head would be the ability to edit messages beyond just the last one. That's possible in ChatGPT, but not in Gemini. Googlers are spending so much money for model training, I would appreciate spending some for making it fun to use :)
  • DaveMcMartin 3 hours ago
    I love to see this competition between companies trying to get the best LLM, and also, the fact that they’re trying to make them useful as tools, focusing on math, science, coding, and so on
  • strstr 18 hours ago
    It's a lot better at my standard benchmark "Magic: The Gathering" rules puzzles. Gets the answers right (both the outcome and rationale).
    • strstr 18 hours ago
      Ooof, it failed my "Wheel of Potential" bug finding question, and got aggressive about asserting it was correct.
  • jasonpeacock 21 hours ago
    Isn't every new AI model the "most <adjective>"?

    Nobody is going to say "Announcing Foobar 7.1 - not our best!"

    • minimaxir 21 hours ago
      GPT-4.5's announcement was the equivalent of that.

      "It beats all the benchmarks...but you really really don't want to use it."

      • forbiddenvoid 21 hours ago
        They even priced it so people would avoid using it. GPT-4.5's entire function was to be the anchor of keeping OpenAI in the news, to keep up the perception of releasing quickly.
        • glenstein 18 hours ago
          My assumption was that the pricing was because it really was that expensive for whatever reason. I'm keeping fingers crossed that they're going to do some kind of 4.5 mini at some point that will be more affordable.
    • smilekzs 21 hours ago
      You're not wrong, but that just means the <adjective> is where the bulk of information resides. The trade-off matters. Maybe it's a model with good enough quality but really cheap to serve. Maybe it's a model that only plays poker really well but sucks at everything else because it bluffs too much. Etc. etc.
    • Maxatar 21 hours ago
      Sure but that adjective matters. Could be cheapest, "intelligent", fastest, etc... it's rarely all three of them.
    • thethimble 21 hours ago
      Except for GPT 4.5 and Claude 3.7 :/
    • SirMaster 20 hours ago
      Same with new phones. The new phone is always the fastest cpu, gpu, and best camera ever!
    • wigster 21 hours ago
      gobble 2.0 - a bit of a turkey
    • zamadatix 20 hours ago
      Stable Diffusion definitely had a few of those.
  • ofermend 11 hours ago
    This model is quite impressive. Not just useful for math/research with great reasoning, it also maintained a very low hallucination rate of 1.1% on Vectara Hallucination Leaderboard: https://github.com/vectara/hallucination-leaderboard
  • lvl155 19 hours ago
    With recent pace of model updates, I wonder which factor is more important: hardware assets, software/talent, or data access. Google clearly is in the lead in terms of data access in my view. If I am a top talent in AI, I’d go where I can work with the best data no?
    • danpalmer 17 hours ago
      I think an argument could be made for hardware too. Perhaps in absolute terms Nvidia is ahead, but in terms of knowing how to get the most out of the hardware, Google making its own chips, building on their networking, etc, is a pretty big advantage.

      (Disclaimer, Googler, but I don’t work on any of this, I only have an external layperson’s understanding of it)

    • jillesvangurp 4 hours ago
      UX is actually increasingly the bottleneck. Most of the top models are very good if you micromanage their context and prompts. But people aren't very good at that stuff.

      Some of the desktop chat clients are turning into great productivity tools. I tried the Claude one last week and quickly went back to Chat GPT. Claude might be a better model for coding. But it's less effort to make Chat GPT do what I want at this point and it's kind of good enough for a lot of stuff. Every release it's getting better. It connects to my IDE automatically, it can look at the files I have open. It can patch those files (I actually disabled that because it's too slow for my taste), etc.

      But most importantly, I can trigger all that with option+shift+1. I do this gazillions times per day. Mostly simple stuff with really short prompts, "check this" (file, selection, current line, etc.), fix that, what do you think about x, "address the FIXMEs/TODOs", "document this", etc.

      I can ask other models the same questions and they'd get the job done. But then I have to do more work to give them the same context. Claude has a Github connect option, which is great. But unfortunately it's just a glorified file picker, which really sucks. I have files open in my editor, just look at those. I don't want to have to manually open files do that for me or specify what files to look at every time I go near the tool.

      Chat GPT actually asked me yesterday whether it could add a different file than the one it was looking at. I said "yes" and it did. That's a great UX. Don't make me do work.

      That's a good UX.

      I use Gemini mainly because it's integrated into google's tools. So it's kind of there. And chat GPT for whatever reason does can not look at the browser window. But from a UX point of view, that kind of deep integration is what you want. You have this implicit shared context which is the thing you are looking at that you don't have to spell out anymore.

      The UX of populating the context is the deciding factor in how useful models are at this point, not how well it solves pet benchmark questions or renders pelicans on bicycles.

      I have good hopes for agentic coding tools progressing rapidly this year. The ones I've tried recently need a lot of work though. I keep going back to Chat GPT because it's just the quickest & easiest to use at this point.

      • lvl155 25 minutes ago
        I agree with you about ChatGPT. It’s actually a compelling product especially their PRO tier at $200 which essentially unlimited.
    • CuriouslyC 18 hours ago
      The problem Goog has is its insane bureaucracy and lack of vision from Sundar, which isn't very attractive from an employee position. If you're working close to Demis I imagine the situation is better though.
      • rfoo 18 hours ago
        Now that Noam is back I'm a little bit more optimistic.
  • andai 4 hours ago
    I asked it for suggestions for a project, and it was the only model that correctly pointed out serious flaws in the existing proposal. So far so good!
  • jszymborski 12 hours ago
    Gemini refuses to answer any questions on poprtional swing models or anything related to psephology on the grounds that it has to do with elections. Neither Claude nor ChatGPT nor Mistral/Le Chat are that neutered.
    • eru 8 hours ago
      I assume Gemini would be less neutered in this regard, if it wasn't developed by Google.
  • cj 21 hours ago
    Slight tangent: Interesting that they use o3-mini as the comparison rather than o1.

    I've been using o1 almost exclusively for the past couple months and have been impressed to the point where I don't feel the need to "upgrade" for a better model.

    Are there benchmarks showing o3-mini performing better than o1?

    • kmod 20 hours ago
      The benchmark numbers don't really mean anything -- Google says that Gemini 2.5 Pro has an AIME score of 86.7 which beats o3-mini's score of 86.5, but OpenAI's announcement post [1] said that o3-mini-high has a score of 87.3 which Gemini 2.5 would lose to. The chart says "All numbers are sourced from providers' self-reported numbers" but the only mention of o3-mini having a score of 86.5 I could find was from this other source [2]

      [1] https://openai.com/index/openai-o3-mini/ [2] https://www.vals.ai/benchmarks/aime-2025-03-24

      You just have to use the models yourself and see. In my experience o3-mini is much worse than o1.

    • FloorEgg 21 hours ago
      I noticed this too, I have used both o1 and o3 mini extensively, and I have ran many tests on my own problems and o1 solves one of my hardest prompts quite reliably but o3 is very inconsistent. So from my anecdotal experience o1 is a superior model in terms of capability.

      The fact they would exclude it from their benchmarks seems biased/desperate and makes me trust them less. They probably thought it was clever to leave o1 out, something like "o3 is the newest model lets just compare against that", but I think for anyone paying attention that decision will backfire.

    • boldlybold 21 hours ago
      I find o3 at least faster to get to the response I care about, anecdotally.
    • jnd0 21 hours ago
      Probably because It is more similar to o3 in terms of size/parameters as well as price (although I would expect this to be at least half price)
    • logicchains 20 hours ago
      It's a reasonable comparison given it'll likely be priced similarly to o3-mini. I find o1 to be strictly better than o3-mini, but still use o3-mini for the majority of my agentic workflow because o1 is so much more expensive.
    • PunchTornado 20 hours ago
      Why would you compare against all the models from a competitor. You take their latest one that you can test. Openai or anthropoc don’t compare against the whole gemini family.
  • barrenko 21 hours ago
    The incumbent has awoken.
  • WasimBhai 19 hours ago
    I do not intend to take anything away from the technical achievement of the team. However, as Satya opined some weeks back, these benchmarks do not mean a lot if we do not see a comparable increase in productivity.

    But then there are two questions. First, are the white collar workers specifically consultants, engineers responsible for increase in productivity? Or is the white collar workers at the very right tail e.g., scientists?

    I think consultants and engineers are using these technologies a lot. I think biologists at least are using these models a lot.

    But then where is the productivity increases?

    • weatherlite 8 hours ago
      It's a complex proposition. I think Satya was talking about actual gdp growth right ? In theory lets say all knowledge work is now 50% faster due to A.I. Well then I would assume this should affect civil society as well - planning a bridge, a railway etc should happen faster and more efficiently (the actual building of thins won't, but a lot of time is spent on planning a red tape). Healthcare in general should become way more efficient with people getting better treatment; this should have a positive economic effect. It does seem to me like it should be able to speed things up in the real world but of course a lot will have to do with how well the models can reason / how often they make catastrophic mistakes + the will of the governments and people to start using them seriously.

      But its more complex than that - if many people start losing their jobs we all take a hit on gdp because they can't consume as much anymore, so it could take perhaps a long time until gdp actually sees meaningful gains.

      And one last thought - Satya likely hasn't spent much time thinking about gdp, it's just not his field. He's a smart guy for sure but this isn't what he does.

      • eru 7 hours ago
        The problem is slightly different.

        Unemployment hasn't really picked up, and is unlikely to do so, unless the central bank is incompetent. (They have been from time to time.)

        However, some advances don't show up in GDP. Eg Wikipedia is a tremendous achievement. But nobody pays for it, so it doesn't show up in GDP statistics.

        • weatherlite 7 hours ago
          > Unemployment hasn't really picked up, and is unlikely to do so

          That's an important assessment. I don't know if you're right. If the models are going to continue to get more capable I'm expecting unemployment to rise , I don't see how it won't (sure we are promised A.I to create tons of new jobs no one has imagined yet, I haven't seen a reliable clue for such jobs yet).

          • borgdefenser 3 hours ago
            I can feel this already with my own use of language models.

            All the questions I had before language models, I have answered with language models.

            That doesn't mean I have no more questions though. Answering those questions opened up 10X more questions I have now.

            In general, everyone knows that answering scientific questions leads to new and more questions. It is the exact same process in the economy. There is a collectivist sentiment though in society and the economy that wants to pretend this isn't true. That the economic questions can be "solved", the spoils divided up and we live happily ever after in some kind of equilibrium.

            As far as new jobs, they are here now but they surely sound as ridiculous to think about as being a professional youtuber in 2005. Or I think of the person making a geocities website in 1997 vs a front end developer. There is no date that a front end developer emerges from the html code monkey. It is a slow and organic process that is hard to game.

            • weatherlite 3 hours ago
              > As far as new jobs, they are here now but they surely sound as ridiculous to think about as being a professional youtuber in 2005

              How many people can make an actual living out of Youtube? Surely they exist but to reliably live off it for decades (not just 1-2 years of temporary fame - which is also very hard to come by) I'd say fewer than one in ten thousand people will make it. I can't call "Youtuber" a career path with that kind of success rates anymore than I can call being an actor in Hollywood a career path.

          • eru 5 hours ago
            No, it won't (necessarily) be AI that's creating the new jobs. In general, when a new technology comes along and automates away some jobs, you can't expect the same technology to provide the new jobs.

            To give an example from the recent past: 'hipster' baristas that make you a five dollar coffee are a fairly new job. At least at scale.

            But I doubt you'll be able to find any technology that automated some other job but created barista jobs.

            It's just that the market will find stuff for people to do for money, unless prevented to do so by incompetent central bank policy or (too) onerous labour market regulation.

            (The labour market can take quite a lot of regulation, and still be able to get people jobs. Have a look at Germany today for an example.)

            • weatherlite 5 hours ago
              > It's just that the market will find stuff for people to do for money

              Will it ? Let's take my example, I'm a 41 year old male with around 15 years experience in software development. Lets say 4 years from now myself and million others are losing our development jobs to A.I. What does the market have for my skills? I can try going into healthcare or teaching (though that's quite an extensive retraining + salary reduction), I can go into the trades (same) or get some other work that's hard to automate like caring for old people (very low salary). All of these options involve massive salary reduction, and that's in the positive scenario that I actually am able to retrain and survive such a shift mentally. It's quite likely many software devs won't be able to become plumbers and nurses and will become chronically unemployed.

              • eru 4 hours ago
                Well, we have many examples where in the past technology (and to a lesser extent trade) have let to some sectors of the economy using fewer people than before.

                The situation you describe isn't all that special.

                Yes, losing your job (or your career) is not fun, and can be painful. Massive salary reduction can happen.

                No, that hasn't lead to widespread unemployment in the past. At least not widespread enough to be visible in aggregate statistics, especially over the noise of the 'normal' business cycle. However, individuals can obviously have pretty long spells in unemployment, but that can also happen without a shift in technology.

                • weatherlite 3 hours ago
                  > Yes, losing your job (or your career) is not fun, and can be painful. Massive salary reduction can happen.

                  I'm just trying to get the point across that unemployment might rise so gdp may fall, in fact I think it should be the baseline scenario and not thinking some new jobs we can't imagine yet will be created. It's so hard to imagine these new jobs because if the machines will out perform us cognitively it follows we will be able to get intelligent robots into the real world quite soon after. Then seriously what the heck is left? Fewer jobs, not more.

                  There is one "cure" I can think of for this and that's something closer to socialism, the market will have to step aside and the government will create massive amounts of new jobs. For example classes can be 5 pupils per teacher instead of 30 pupils per teacher. Nurses can attend to 3 patient beds instead of 8. But letting the market sort this out ? I don't think so.

    • KoolKat23 18 hours ago
      As it stands currently I'd say this is difficult to measure.

      They're not baked into workflows where the measurable output is attributed easily to the model use. Productivity in its current form is transformative in the sense that the use case and gain differs for the individual (who even provide different prompts). So some are keeping the gains for themselves, others are using it to improve quality rather than quantity.

      It'll come in time, it's important to remember gpt 4 was released 2 years ago this month. The newer models are more reliable and could probably be introduced into workflows more frequently. Today I spoke to a company who are looking to use it to reduce cost in the next year.

    • danpalmer 17 hours ago
      That’s true, but productivity has many factors and takes a long time to get confidence on. Any productivity value that could be stated clearly would have similar downsides to a benchmark, and take far longer.

      Benchmarks are useful as leading indicators. Early warning signs. If there’s no relation to the eventual productivity then hopefully that benchmark will disappear as it’s not useful.

      In a fast moving space like this it’s reasonable to make use of leading indicators.

    • ZeroTalent 19 hours ago
      Also, why not compare to GPT-o3 in the benchmarks?
      • KoolKat23 18 hours ago
        The models not really available.
  • t_minus_40 8 hours ago
    i have asked the direction of friction on ball rolling either up or down on an inclined plan - it gave wrong answer and was adamant about it. Surprisingly, similar to o1.

    gave a problem which sounds like monty hall problem but a simple probability question and it nailed it.

    asked to tell a joke - horrible joke ever.

    much better than o1 but still no where near agi. it has been optimized for logic and reasoning at best.

  • joshdavham 17 hours ago
    When these companies release a model “2.5”, are they using some form of semver? Where are these numbers coming from?
    • eru 7 hours ago
      Marketing.
  • jharohit 21 hours ago
    why not enable Canvas for this model on Gemini.google.com? Arguably the weakest link of Canvas is the terrible code that Gemini 2.0 Flash writes for Canvas to run..
    • aoeusnth1 21 hours ago
      I'm guessing it should be enabled eventually. @logankilpatrick thoughts?
  • rodolphoarruda 18 hours ago
    I've been trying to use Gemini 2.0 Flash, but I don't think it's possible. The model still thinks it's running the 1.5 Pro model.

    Reference: https://rodolphoarruda.pro.br/wp-content/uploads/image-14.pn...

  • simonw 14 hours ago
    Here's a Gemini 2.5 provided summary of this Hacker News thread as of the moment when it had 269 comments: https://gist.github.com/simonw/3efa62d917370c5038b7acc24b7c7...

    I ran this command to create it:

      curl -s "https://hn.algolia.com/api/v1/items/43473489" | \
        jq -r 'recurse(.children[]) | .author + ": " + .text' | \
        llm -m "gemini-2.5-pro-exp-03-25" -s \
        'Summarize the themes of the opinions expressed here.
        For each theme, output a markdown header.
        Include direct "quotations" (with author attribution) where appropriate.
        You MUST quote directly from users when crediting them, with double quotes.
        Fix HTML entities. Output markdown. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece'
    
    Using this script: https://til.simonwillison.net/llms/claude-hacker-news-themes
  • daquisu 20 hours ago
    Weird, they released Gemini 2.5 but I still can't use 2.0 pro with a reasonable rate limit (5 RPM currently).
  • xnx 20 hours ago
    It will be huge achievement if models can get to the point where so much selection effort isn't required: gemini.google.com currently lists 2.0 Flash, 2.0 Flash Thinking (experimental), Deep Research, Personalization (experimental), and 2.5 Pro (experimental) for me.
    • lxgr 19 hours ago
      There's probably a sweet spot here. On the flip side, ChatGPT currently doesn't indicate whether a given image generation request was serviced by multimodal GPT-4o [1] or Dall-E.

      Personally, I do like the "use web search" and "extended thinking" buttons, but ultimately, the models should probably be able to figure out whether doing so would be useful themselves too.

      [1] https://news.ycombinator.com/item?id=43474112

  • andai 21 hours ago
    Can anyone share what they're doing with reasoning models? They seem to only make a difference with novel programming problems, like Advent of Code. So this model will help solve slightly harder advent of codes.

    By extension it should also be slightly more helpful for research, R&D?

    • mgens 20 hours ago
      Have been using them for non-interactive coding where latency is not an issue. Specifically, turning a set of many free-text requirements into SQL statements, so that later when an item's data is entered into the system, we can efficiently find which requirements it meets. The reasoning models' output quality is much better than the non-reasoning models like 3.5 Sonnet, it's not a subtle difference.
    • liuliu 20 hours ago
      I found reasoning models are much more faithful at text related tasks too (i.e. 1. translating long key-value pairs (i.e. Localizable.strings), 2. long transcript fixing and verification; 3. look at csv / tabular data and fix) probably due to the reflection mechanism built into these reasoning models. Using prompts such as "check your output to make sure it covers everything in the input" letting the model to double-check its work, avoiding more manual checks on my end.
    • bravura 21 hours ago
      We're using it to RCA infrastructure incidents.
      • fn-mote 21 hours ago
        Seriously? That doesn't require a human?! Are we talking about some kind of "generic" incident? (Type 3: forgot to manually update the xxxx file.) Or what's going on?
  • billforsternz 13 hours ago
    I know next to nothing about AI, but I just experienced an extraordinary hallucination in a google AI search (presumably an older Gemini model right?) as I elaborated in detail in another HN thread. It might be a good test question. https://news.ycombinator.com/item?id=43477710
  • afro88 20 hours ago
    Is this the first model announcement where they show Aider's Polyglot benchmark in the performance comparison table? That's huge for Aider and anotherpaulg!
  • slama 15 hours ago
    Interestingly, the model hallucinated the ability to use a search tool when I was playing around with it
  • mclau156 17 hours ago
    Generated 1000 lines of turn based combat with shop, skills, stats, elements, enemy types, etc. with this one
  • dcchambers 19 hours ago
    > Developers and enterprises can start experimenting with Gemini 2.5 Pro in Google AI Studio now, and Gemini Advanced users can select it in the model dropdown on desktop and mobile. It will be available on Vertex AI in the coming weeks.

    I'm a Gemini Advanced subscriber, still don't have this in the drop-down model selection in the phone app, though I do see it on the desktop webapp.

    • ehsankia 19 hours ago
      I see it in both, probably just some gradual rollout delays.
  • testycool 13 hours ago
    It feels like Gemini 2.0 Pro + Reasoning.

    I also see Gemini 2.0 Pro has been replaced completely in AI Studio.

  • vivzkestrel 21 hours ago
    hi, here is our new AI model, it performs task A x% better than our competitor 1, task B y% better than our competitor 2 seems to be the new hot AI template in town
  • eenchev 20 hours ago
    "My info, the stuff I was trained on, cuts off around early 2023." - Gemini 2.5 to me. Appears that they did a not-so-recent knowledge cutoff in order to use the best possible base model.
    • staticman2 19 hours ago
      It's unlikely the model knows its actual cutoff date. Try asking 2024 news- for example in my test it knows the January 2024 Oscar nominees.

      On AI studio the model told me today is June 13 2024.

  • joelthelion 21 hours ago
    Is this model going to be restricted to paying users?
  • skinkestek 8 hours ago
    Can it now generate images of soldier in typical uniforms from 1940s Germany without having to throw in a few token ethnicities?

    Or generate images of the founding fathers of US that at least to some degree resemble the actual ones?

  • pachico 19 hours ago
    It really surprises me that Google and Amazon, considering their infrastructure and the urge to excel at this, aren't leading the industry.
  • andrewinardeer 20 hours ago
    Google is overly cautious with their guardrails.

    Granted, Gemini answers it now, however, this one left me shaking my head.

    https://cdn.horizon.pics/PzkqfxGLqU.jpg

    • slongfield 20 hours ago
      For better or worse, Google gets more bad press when their models get things wrong compared to smaller AI labs.
    • rfoo 20 hours ago
      Ha, I still remember that super hilarious "You are under 18, so you should not write C++, as it is unsafe..." log from ... a year ago?
    • summerlight 19 hours ago
      Looks like they're gradually removing guardrails, it returns Nixon for me.
  • fourseventy 20 hours ago
    Does it think the founding fathers were a diverse group of mixed races and genders like the last model did?
  • noisy_boy 20 hours ago
    Is Gemini and Bard same? I asked it a question and it said "... areas where I, as Bard, have..."
    • zxilly 20 hours ago
      There is no point in asking such questions, the model doesn't know what it is on its own, and you could get many different answers if you repeat it a few more times.
  • cp9 19 hours ago
    does it still suggest glue on pizza
  • TheMagicHorsey 18 hours ago
    I tested out Gemini 2.5 and it failed miserably at calling into tools that we had defined for it. Also, it got into an infinite loop a number of times where it would just spit out the exact same line of text continuously until we hard killed the process. I really don't know how others are getting these amazing results. We had no problems using Claude or OpenAI models in the same scenario. Even Deepseek R1 works just fine.
  • honeybadger1 20 hours ago
    Claude is still the king right now for me. Grok is 2nd in line, but sometimes it's better.
  • resource_waste 20 hours ago
    I'll try it tonight, but I'm not excited, its just work.

    ChatGPT4.5, I was excited.

    Deepseek, I was excited. (then later disappointed)

    I know Gemini probably wont answer any medical question, even if you are a doctor. ChatGPT will.

    I know I've been disappointed at the quality of Google's AI products. They are backup at best.

    • FergusArgyll 18 hours ago
      It interpreted blood work for me

      (Everything's ok, I'm just testing it ;)

  • curtisszmania 1 hour ago
    [dead]
  • hackermeows 21 hours ago
    [flagged]
  • noisy_boy 20 hours ago
    Are Gemini and Bard same? I asked it a question and it said "... areas where I, as Bard, have...."
    • guyzero 20 hours ago
      Normal Google rollout process: Bard is deprecated, Gemini is not ready yet.
  • ototot 20 hours ago
    And OpenAI is announcing their ImageGen in 4o

    https://news.ycombinator.com/item?id=43474112

  • throwaway13337 21 hours ago
    Google has this habit of 'releasing' without releasing AI models. This looks to be the same?

    I don't see it on the API price list:

    https://ai.google.dev/gemini-api/docs/pricing

    I can imagine that it's not so interesting to most of us until we can try it with cursor.

    I look forward to doing so when it's out. That Aider bench mixed with the speed and a long context window that their other models are known for could be a great mix. But we'll have to wait and see.

    More generally, it woud be nice for these kinds of releases to also add speed and context window as a separate benchmark. Or somehow include it in the score. A model that is 90% as good as the best but 10x faster is quite a bit more useful.

    These might be hard to mix to an overall score but they're critical for understanding usefulness.

    • minimaxir 21 hours ago
      It's available now as an option in Google AI Studio and Google Gemini.
      • kmod 20 hours ago
        It's "experimental", which means that it is not fully released. In particular, the "experimental" tag means that it is subject to a different privacy policy and that they reserve the right to train on your prompts.

        2.0 Pro is also still "experimental" so I agree with GP that it's pretty odd that they are "releasing" the next version despite never having gotten to fully releasing the previous version.

      • throwaway13337 21 hours ago
        Thanks. I think my post lacked clarity of what I was talking about. I meant that most people care about API access to use with their favorite editor. It's a big limiter with grok, for example.

        But I did mingle that with my knowledge of google's history of releasing without releasing these models which, as you point out, isn't true with this release.

      • ototot 21 hours ago
        and the price is 0.0 usd, lol