Gemini 2.5

(blog.google)

973 points | by meetpateltech 108 days ago

79 comments

og_kalu 108 days ago
One of the biggest problems with hands off LLM writing (for long horizon stuff like novels) is that you can't really give them any details of your story because they get absolutely neurotic with it.
Imagine for instance you give the LLM the profile of the love interest for your epic fantasy, it will almost always have the main character meeting them within 3 pages (usually page 1) which is of course absolutely nonsensical pacing. No attempt to tell it otherwise changes anything.
This is the first model that after 19 pages generated so far resembles anything like normal pacing even with a TON of details. I've never felt the need to generate anywhere near this much. Extremely impressed.
Edit: Sharing it - https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
with pastebin - https://pastebin.com/aiWuYcrF
[-]
- comboy 108 days ago
  I like how critique of LLMs evolved on this site over the last few years.
  We are currently at nonsensical pacing while writing novels.
  [-]
  - skyechurch 108 days ago
    The most straightforward way to measure the pace of AI progress is by attaching a speedometer to the goalposts.
    [-]
    - kaliqt 108 days ago
      Oh, that's a good one. And it's true. There seems to be a massive inability for most people to admit the building impact of modern AI development on society.
      [-]
      - benterix 107 days ago
        Oh, we do admit impact and even have a name for it: AI slop. (Speaking on LLMs now since AI is a broad term and it has many extremely useful applications in various areas)
        [-]
        Workaccount2 107 days ago
        AI slop is soon to be "AI output that no one wanted to take credit for".
    - josefx 107 days ago
      They certainly seem to have moved from "it is literally skynet" and "FSD is just around the corner" in 2016 to "look how well it paces my first lady Trump/Musk slashfic" in 2025. Truly world changing.
    - orena 105 days ago
      I've asked claude to explain what you meant... https://claude.ai/share/391160c5-d74d-47e9-a963-0c19a9c7489a
      [-]
      - dieortin 105 days ago
        I’m not source outsourcing even the comprehension of HN comments to an LLM is going to work out well for your mind
        [-]
        etherealG 102 days ago
        I’m not sure lacking comprehension of a comment and choosing to ignore that lack is better. Or worse: asking everyone to manually explain every reference they make. The LLM seems a good choice when comprehension is lacking.
    - qnleigh 99 days ago
      This is so on-point. Many things that we now take for granted from LLMs would have been considered sufficient evidence for AGI not all that long ago. Likely the only test of AGI is whether we can still come up with new goalpost.
    - Nition 107 days ago
      Haha, so that's the first derivative of goalpost position. You could take the derivative of that to see if the rate of change is speeding up or slowing.
    - munksbeer 107 days ago
      I love this comment.
  - solardev 108 days ago
    It's not really passing the Turing Test until it outsells Harry Potter.
    [-]
    - dragonwriter 108 days ago
      > It's not really passing the Turing Test until it outsells Harry Potter.
      Most human-written books don't do that, so that seems to be a ceiteria for a very different test that a Turing test.
      [-]
      - ZiiS 107 days ago
        Both books that have outsold the Harry Potter series claim divine authorship, not purely human. I am prepared to bet quite a lot that the next isn't human-written, either.
      - mirekrusin 108 days ago
        The joke is that the goalpost is constantly moving.
        [-]
        TeMPOraL 107 days ago
        This subgoal post can't move much further after it passes "outsells the Bible" mark.
        [-]
        zimbatm 107 days ago
        Why would the book be worth buying tough. If AI can generate a fresh new one just for you?
        [-]
        TeMPOraL 107 days ago
        I don't know. It's a question relevant to all generative AI applications in entertainment - whether books, art, music, film or videogames. To the extent the value of these works is mostly in being social objects (i.e. shared experience to talk about with other people), being able to generate clones and personalized variants freely via GenAI destroys that value.
        mirekrusin 107 days ago
        You may be right, on the other hand it always feels like the next goalpost is the final one.
        I'm pretty sure if something like this happens some dude will show up from nowhere and claim that it's just parroting what other, real people have written, just blended it together and randomly spitted it out – "real AI would come up with original ideas like cure for cancer" he'll say.
        After some form of that comes another dude will show up and say that this "alphafold while-loop" is not real AI because he just went for lunch and there was a guy flipping burgers – and that "AI" can't do it so it's shit.
        https://areweagiyet.com should plot those future points as well with all those funky goals like "if Einstein had access to the Internet, Wolfram etc. he could came up with it anyway so not better than humans per se", or "had to be prompted and guided by human to find this answer so didn't do it by itself really" etc.
    - silveraxe93 107 days ago
      From Gary Marcus' (notable AI skeptic) predictions of what AI won't do in 2027:
      > With little or no human involvement, write Pulitzer-caliber books, fiction and non-fiction.
      So, yeah. I know you made a joke, but you have the same issue as the Onion I guess.
    - tummler 107 days ago
      Let me toss a grenade in here.
      What if we didn’t measure success by sales, but impact to the industry (or society), or value to peoples’ lives?
      Zooming out to AI broadly: what if we didn’t measure intelligence by (game-able, arguably meaningless) benchmarks, but real world use cases, adaptability, etc?
      [-]
      - szatkus 107 days ago
        I recently watched some Claude Plays Pokemon and believe it's better measure than all those AI benchmarks. The game could be beaten by a 8yo which obviously doesn't have all that knowledge that even small local LLMs posess, but has actual intelligence and could figure out the game within < 100h. So far Claude can't even get past the first half and I doubt any other AI could get much further.
        [-]
        solardev 107 days ago
        Now I want to watch Claude play Pokemon Go, hitching a ride on self-driving cars to random destinations and then trying to autonomously interpret a live video feed to spin the ball at the right pixels...
        2026 news feed: Anthropic cited as AI agents simultaneously block traffic across 42 major cities while trying to capture a not-even-that-rare pokemon
        harrison_clarke 107 days ago
        the true measure of AI: does it have fun playing pokemon? did it make friends along the way?
      - etruong42 107 days ago
        We humans love quantifiability. Since you used the word "measure", do you believe the measurement you're aspiring for is quantifiable?
        I currently assert that it's not, but I would also say that trying to follow your suggestion is better than our current approach of measuring everything by money.
        [-]
        icrbow 107 days ago
        > We humans love quantifiability.
        No. Screw quantifiability. I don't want "we've improved the sota by 1.931%" on basically anything that matters. Show me improvements that are obvious, improvements that stand out.
        Claude Plays Pokemon is one of the few really important "benchmarks". No numbers, just the progress and the mood.
      - Workaccount2 107 days ago
        This is difficult to do because one of the juiciest parts of AI is being able to take credit for it's work.
    - ninetyninenine 107 days ago
      the goal posts will be moved again. Tons of people clamoring the book is stupid and vapid and only idiots bought the book. When ai starts taking over jobs which it already has you’ll get tons of idiots claiming the same thing.
    - eru 108 days ago
      Well, strictly speaking outselling the Harry Potter would fail the Turing test: the Turing test is about passing for human (in an adversarial setting), not to surpass humans.
      Of course, this is just some pedantry.
      I for one love that AI is progressing so quickly, that we _can_ move the goalposts like this.
    - jychang 108 days ago
      To be fair, pacing as a big flaw of LLMs has been a constant complaint from writers for a long time.
      There were popular writeups about this from the Deepseek-R1 era: https://www.tumblr.com/nostalgebraist/778041178124926976/hyd...
      [-]
      - newswasboring 108 days ago
        This was written on march 15. Deepseek came out in January. "Era" is not a language I would use for something that happened few days ago
  - krzat 108 days ago
    This either ends at "better than 50% of human novels" garbage or at unimaginably compelling works of art that completely obsoletes fiction writing.
    Not sure what is better for humanity in long term.
    [-]
    - WindyMiller 108 days ago
      That could only obsolete fiction-writing if you take a very narrow, essentially commercial view of what fiction-writing is for.
      I could build a machine that phones my mother and tells her I love her, but it wouldn't obsolete me doing it.
      [-]
      - bergundytomato 107 days ago
        Ahh, now this would be a great premise for a short story (from the mom's POV).
  - ruraljuror 108 days ago
    We are, if this comment is the standard for all criticism on this site. Your comment seems harsh. Perhaps novel writing is too low-brow of a standard for LLM critique?
    [-]
    - jorl17 108 days ago
      I didn't quite read parent's comment like that. I think it's more about how we keep moving the goalposts or, less cynically, how the models keep getting better and better.
      I am amazed at the progress that we are _still_ making on an almost monthly basis. It is unbelievable. Mind-boggling, to be honest.
      I am certain that the issue of pacing will be solved soon enough. I'd give 99% probability of it being solved in 3 years and 50% probability in 1.
      [-]
      - jiggawatts 108 days ago
        In my consulting career I sometimes get to tune database servers for performance. I have a bag of tricks that yield about +10-20% performance each. I get arguments about this from customers, typically along the lines of "that doesn't seem worth it."
        Yeah, but 10% plus 20% plus 20%... next thing you know you're at +100% and your server is literally double the speed!
        AI progress feels the same. Each little incremental improvement alone doesn't blow my skirt up, but we've had years of nearly monthly advances that have added up to something quite substantial.
        [-]
        eru 108 days ago
        Yes, if you are Mary Poppins, each individual trick in your bag doesn't have to be large.
        (For those too young or unfamiliar: Mary Poppins famously had a bag that she could keep pulling things out of.)
        rafaelmn 108 days ago
        Except at some point the low hanging fruit is gone and it becomes +1%, +3% in some benchmarked use case and -1% in the general case, etc. and then come the benchmarking lies that we are seeing right now, where everyone picks a benchmark that makes them look good and its correlation to real world performance is questionable.
      - dalmo3 107 days ago
        What exactly is the problem with moving the goalposts? Who is trying to win arguments over this stuff?
        Yes, Z is indeed a big advance over Y was a big advance over X. Also yes, Z is just as underwhelming.
        Are customers hurting the AI companies' feelings?
        [-]
        TeMPOraL 107 days ago
        > Are customers hurting the AI companies' feelings?
        No. It's the critics' feelings that are being hurt by continued advances, so they keep moving goalposts so they can keep believing they're right.
        HelloMcFly 107 days ago
        The goalposts should keep moving. That's called progress. Like you, I'm not sure why it seems to irritate or even amuse people.
  - rafaelmn 108 days ago
    People are trying to use gen AI in more and more use-cases, it used to fall flat on its face at trivial stuff, now it got past trivial stuff but still scratching the boundaries of being useful. And that is not an attempt to make the gen AI tech look bad, it is really amazing what it can do - but it is far from delivering on hype - and that is why people are providing critical evaluations.
    Lets not forget the OpenAI benchmarks saying 4.0 can do better at college exams and such than most students. Yet real world performance was laughable on real tasks.
    [-]
    - parineum 107 days ago
      > Lets not forget the OpenAI benchmarks saying 4.0 can do better at college exams and such than most students. Yet real world performance was laughable on real tasks.
      That's a better criticism of college exams than the benchmarks and/or those exams likely have either the exact questions or very similar ones in the training data.
      The list of things that LLMs do better than the average human tends to rest squarely in the "problems already solved by above average humans" realm.
  - stickfu 105 days ago
    I don’t know why I keep submitting myself to hacker news but every few months I get the itch, and it only takes a few minutes to be turned off by the cynicism. I get that it’s from potentialy wizened tech heads who have been in the trenches and are being realistic. It’s great for that, but any new bright eyed and bushy tailed dev/techy, whatever, should stay far away until much later in their journey
  - ksec 107 days ago
    Do we have any simple benchmarks ( and I know benchmarks are not everything ) that tests all the LLMs?
    The pace is moving so fast I simply cant keep up. Or a ELI5 page which gives a 5 min explanation of LLM from 2020 to this moment?
  - basch 107 days ago
    It’s more a bellwether or symptom of a flaw where the context becomes poisoned and continually regurgitates the same thought over and over.
  - leokennis 108 days ago
    Not really new is it? First cars just had to be approaching horse and cart levels of speed. Comfort, ease of use etc. were non-factors as this was "cool new technology".
    In that light, even a 20 year old almost broken down crappy dinger is amazing: it has a radio, heating, shock absorbers, it can go over 500km on a tank of fuel! But are we fawning over it? No, because the goalposts have moved. Now we are disappointed that it takes 5 seconds for the Bluetooth to connect and the seats to auto-adjust to our preferred seating and heating setting in our new car.
  - ripped_britches 108 days ago
    lol wouldn’t that be great to read this comment in 2022
- deng 108 days ago
  I have actually read it and agree it is impressive. I will not comment much on the style of the writing, since this is very much subjective, but I would rate it as the "typical" modern fantasy style, which aims at filling as much pages as possible: very "flowery" language, lots of adjectives/adverbs, lots of details, lots of high-school prose ("Panic was a luxury they couldn't afford"). Not a big fan of that since I really miss the time where authors could write single, self-contained books instead of a sprawling series over thousands of pages, but I know of course that this kind of thing is very successful and people seem to enjoy it. If someone would give me this I would advise them to get a good copy editor.
  There are some logical inconsistencies, though. For instance, when they both enter the cellar through a trapdoor, Kael goes first, but the innkeeper instructs him to close the trapdoor behind them, which makes no sense. Also, Kael goes down the stairs and "risks a quick look back up" and can somehow see the front door bulging and the chaos outside through the windows, which obviously is impossible when you look up through a trapdoor, not to mention that previously it was said this entry is behind the bar counter, surely blocking the sight. Kael lights an oily rag which somehow becomes a torch. There's more generic things, like somehow these Eldertides being these mythical things no one has ever seen, yet they seem to be pretty common occurrences? The dimensions of the cellar are completely unclear, at first it seems to be very small but yet they move around it quite a bit. There's other issues, like people using the same words as the narrator ("the ooze"), like they listen to him. The inkeeper suddenly calling Kael by his name like they already know each other.
  Anyway, I would rate it "first draft". Of course, it is unclear whether the LLM would manage to write a consistent book, but I can fully believe that it would manage. I probably wouldn't want to read it.
  [-]
  - hjnilsson 107 days ago
    Thank you for taking the time to do a thorough read, I just skimmed it, and the prose is certainly not for me. To me it lacks focus, but as you say, this may be the style the readers enjoy.
    And it also, as you say, really reuses words. Just reading I notice "phosphorescence" 4 times for example in this chapter, or "ooze" 17 times (!).
    It is very impressive though that it can create a somewhat cohesive storyline, and certainly an improvement over previous models.
  - blinding-streak 107 days ago
    Regarding your last sentence, I agree. My stance is this: If you didn't bother to write it, why should I bother to read it?
    [-]
    - deng 107 days ago
      From a technical standpoint, this is incredible. A few years ago, computers had problems creating grammatically correct sentences. Producing a consistent narrative like this was science fiction.
      From an artistic standpoint, the result is... I'd say: incredibly mediocre, with some glaring errors in between. This does not mean that an average person could produce a similar chapter. Gemini can clearly produce better prose than the vast majority of people. However, the vast majority of people does not publish books. Gemini would have to be on par with the best professional writers, and it clearly isn't. Why would you read this when there is no shortage of great books out there? It's the same with music, movies, paintings, etc. There is more great art than you could ever consume in your lifetime. All LLMs/GenAI do in art is pollute everything with their incredible mediocrity. For art (and artists), these are sad times.
    - meta_ai_x 107 days ago
      It's more nuanced than that. There are certain material/content where it is mandatory/necessary to read them.
      Ideally I'd prefer to read material written by a the top 1%ile expert in that field, but due to constraints you almost always get to read material written by a midwit, intern, junior associate. In which case AI written content is much better especially as I can interrogate the material and match the top 1%ile quality.
    - og_kalu 107 days ago
      Quality is its own property separate from its creator. If a machine writing something bothers you irrespective of quality then don't read it. You think i would care ? I would not.
      If this ever gets good enough to write your next bestseller or award winner, i might not even share it and if i did, i wouldn't care if some stranger read it or not because it was created entirely for my pleasure.
  - og_kalu 107 days ago
    Yeah I just focused on how well it was paced and didn't give any instructions on style or try a second pass to spot any inconsistencies.
    That would be the next step but I'd previously never thought going any further might be worth it.
  - KittenInABox 107 days ago
    > Not a big fan of that since I really miss the time where authors could write single, self-contained books instead of a sprawling series over thousands of pages, but I know of course that this kind of thing is very successful and people seem to enjoy it.
    When was this time you speak of?
- nout 108 days ago
  Using the AI in multiple phases is the approach that can handle this. Similarly to "Deep Research" approach - you can tell it to first generate a storyline with multiple twists and turns. Then ask the model to take this storyline and generate prompts for individual chapters. Then ask it to generate the individual chapters based on the prompts, etc.
  [-]
  - bbor 108 days ago
    Yup -- asking a chatbot to create a novel in one shot is very similar to asking a human to improvise a novel in one shot.
    [-]
    - mikepurvis 108 days ago
      But a future chatbot would be able to internally project manage itself through that process, of first emitting an outline, then producing draft chapters, then going back and critiquing itself and finally rewriting the whole thing.
      [-]
      - eru 108 days ago
        Yes, and that's why many people in the discussion here are very optimistic that chatbots will have solve this problem very soon. Either with the approach you suggest, or with something else (and perhaps more general, and less directly programmed in).
    - og_kalu 108 days ago
      It's not a problem of one-shotting it. It's that the details cause a collapse. Even if you tried breaking it down which i have, you'd run into the same problem unless you tried holding its hand for every single page and then - what's the point ? I want to read the story not co-author it.
      [-]
      - fragmede 108 days ago
        I dunno, there's a certain amount of fun in "writing" a book with ChatGPT. Like playing a video game with a bunch of different endings instead of a watch a movie with only one. does the hero save the day? or turn into a villian! you decide!
      - koakuma-chan 108 days ago
        Doesn't novel literally mean something new? Can we really expect an LLM to produce a novel?
        [-]
        eru 108 days ago
        The etymology is pretty much irrelevant. In eg German, the word for novel is 'Roman'. But German readers don't expect their novels to be anymore romantic, nor do English readers expect their novels to be more novel.
        LLMs have been producing new things all the time. The question was always about quality of output, never about being able to produce anything new.
        FloorEgg 108 days ago
        Yes
- fshr 108 days ago
  I think you would be better off having the LLM help you build up the plot with high level chapter descriptions and then have it dig into each chapter or arc. Or start by giving it the beats before you ask it for help with specifics. That'd be better at keeping it on rails.
  [-]
  - og_kalu 108 days ago
    I don't disagree. Like with almost anything else involving LLMs, getting hands on produces better results but because in this instance, i much prefer to be the reader than the author or editor, it's really important to me that a LLM is capable of pacing long form writing properly on its own.
    [-]
    - saberience 108 days ago
      Random question, if you don't care about being a creator yourself, why do you even want to read long form writing written by an LLM? There are literally 10000s of actual human written books out there all of them better than anything an LLM can write, why not read them?
      [-]
      - eru 108 days ago
        > There are literally 10000s of actual human written books out there all of them better than anything an LLM can write, why not read them?
        10000s is still much smaller than the space of possibilities for even a short prompt.
        You might be right that good human novels are better than what LLMs can manage today. But that's rapidly changing.
        And if you really need that Harry Potter / Superman / Three Musketeers crossover fan fiction itch scratched, you might not care that some other existing novel is 'better' in some abstract sense.
      - og_kalu 108 days ago
        Authors tell stories they want to tell and Readers read stories they want to read. The two don't necessarily overlap or overlap strongly enough. If you're even a little bit specific (nowhere near as specific as the above prompt, even just something like the dynamic between protagonists) then you don't actually have 10,000s of actual human written books. Not even close. Maybe it exists and maybe you'll find it good enough but if it's only been read by a few hundred or thousand people ? Good luck getting it recommended.
        I've read a LOT of fiction. I love reading. And if it's good enough, the idea of reading something created by a machine does not bother me at all. So of course i will continue to see if the machine is finally good enough and i can be a bit more specific.
      - Spooky23 108 days ago
        Usually porn and fan fiction.
      - dragonwriter 108 days ago
        > There are literally 10000s of actual human written books out there
        Tens-of-thousands is probably low by something in the neighborhood of four orders of magnitude.
      - bufferoverflow 108 days ago
        It's very hard to find good books written by humans. GoodReads is okay, but you quickly run out of high-end recommendations. I read mostly sci-fi, and the books that everyone recommends rarely end up being 10/10. But then I see some random recommendation on Reddit or HN, and it ends up being amazing.
        Human-generated slop is real.
        [-]
        eru 108 days ago
        You could ask your LLM for a recommendation.
  - tluyben2 108 days ago
    That was what I tried on the train [0] a few weeks ago. I used Groq to get something very fast to see if it would work at least somewhat. It gives you a PDF in the end. Plugging in a better model gave much better results (still not really readable if you actually try to; at a glance it's convincing though), however, it was so slow that testing what kind of impossible. Cannot really have things done in parallel either because it does need to know what it pushed out before, at least the summary of it.
    [0] https://github.com/tluyben/bad-writer
- sagarpatil 108 days ago
  My prompt is nowhere near yours.
  Just for fun: Asked it to rewrite the first page of ‘The Fountainhead’ where Howard is a computer engineer, the rewrite is hilarious lol.
  https://gist.github.com/sagarspatil/e0b5443132501a3596c3a9a2...
- didip 107 days ago
  Give it time, this will be solved.
  I envisioned that one day, a framework will be created that can persist LLM current state on disk and then "fragments of memories" can be paged in and out into memory.
  When that happened, LLM will be able to remember everything.
- smcleod 107 days ago
  I have never used an LLM for fictional writing, but I have been writing large amounts of code with them for years. What I'd recommend is when you're defining your plan up front as to the sections of the content, simply state in which phase / chapter of the content they should meet.
  Planning generated content is often more important to invest in than the writing of it.
  Looking at your paste, your prompt is short and basic, it should probably be broken up into clear, formatted sections (try directives inside XML style tags). For such a large output as you're expecting id expect a considerable prompt of rules and context setting (maybe a page or two).
- pantsforbirds 107 days ago
  I had Grok summarize + evaluate the first chapter with thinking mode enabled. The output was actually pretty solid: https://pastebin.com/pLjHJF8E.
  I wouldn't be surprised if someone figured out a solid mixture of models working as a writer (team of writers?) + editor(s) and managed to generate a full book from it.
  Maybe some mixture of general outlining + maintaining a wiki with a basic writing and editing flow would be enough. I think you could probably find a way to maintain plot consistency, but I'm not so sure about maintaining writing style.
- stephen_cagle 108 days ago
  Opening with "like a struck flint carried on a wind that wasn’t blowing." <chuckles>
  I don't know why, but that is just such a literal thing to say that it seems almost random.
- bwb 107 days ago
  why would you ever want to write a novel with AI, that is human stuff right? :)
  [-]
  - qingcharles 107 days ago
    I'm terrible at writing, but I love reading. I've got ideas for novels, but I struggle to put them down.
    What I have found that works is to give the LLM the "world" outline at the beginning and then just feed it one line summary of each chapter and get it to write a chapter at a time.
    The problem is that the quality of results drastically decreases as the context length increases. After about 10 chapters the dialogue will start to get real snippy. I've tried getting it to summarize all the previous chapters and feed that back in, but it never includes enough detail.
    [-]
    - bwb 107 days ago
      The only way to get better at something is to do it. Start writing short stories or small novels, and you will get there over time. You don't even have to be a great writer to write a great book as well :). It helps, but readers will forgive a lot along your journey.
      Brandon Sanderson has a great series of lectures on how he approaches it that are awesome ->
      https://www.youtube.com/playlist?list=PLSH_xM-KC3ZvzkfVo_Dls...
      You will get so many mental benefits from writing, too. I promise it is worth it. AI is a great tool if you hit a block and need to brainstorm.
      [-]
      - qingcharles 107 days ago
        No, you are absolutely right. A lot of the things people think they can't do are literally just lack of practice.
        My other problem is... lack of time :)
        [-]
        bwb 107 days ago
        ack, I also have this problem :)
        I am working on some world-building for something I want to write one day, but I am trying just to write little things to help. I write a lot of nonfiction stuff for work, but I am worried that it might not translate as well to characters...
  - og_kalu 103 days ago
    I don't want to write a novel with AI. I want to read them (when they're good enough) because i love reading. Sometimes i want to read something with a certain dynamic and it gets difficult finding human written recommendations.
    [-]
    - bwb 103 days ago
      I run Shepherd.com, and hopefully, it helps :). Feel free to email me at [email protected] if you need any help with book ideas. I'm working to add more book DNA breakdowns later this year to help tap into certain themes, tropes, moods, etc.
      For example, with filters right now you can do things like show me hard sci fi with AI: https://shepherd.com/bookshelf/hard-science-fiction?topics=Q...
      Reddit is also a great source for recommendations: https://www.reddit.com/r/booksuggestions/ https://www.reddit.com/r/fantasybooks/ https://www.reddit.com/r/scifi/
      Humans write books, AI is for doing the dishes or laundry :)
      [-]
      - og_kalu 101 days ago
        >Reddit is also a great source for recommendations: https://www.reddit.com/r/booksuggestions/ https://www.reddit.com/r/fantasybooks/ https://www.reddit.com/r/scifi/
        Not really. Everyone recommends the same 20 books that most have read or at least considered.
        Let me give you an example that is real to me. I'd like to - 1. Read a fantasy series that pairs a human male and elf female romantically over the course of the series. - 2. What i'm looking for is to read the challenges of two fantasy races that aren't on very good terms so just being an elf won't really cut it. - 3. I also want a love interest that is a big active character in the story so not just a dozen mentions in a book. - 4. Obviously, i have to like the book(s).
        It doesn't even have to be elves, it's just much harder trying to find such recs from a bespoke species.
        You would think this would be an easy enough recommendation. Elves are the fantasy race after all and they usually aren't on the best of terms with humans. But it's not.. and at this point, i could give you more obscure recommendations that meet at least requirement 1, than you'd get in the vast majority of reddit threads. I spent months going through general amazon/goodreads recs and goodreads shelves with elves and still came out wanting.
        Once you are even a little bit specific, options decay and if they exist, they are hard to find.
        Shepherd looks good though
        [-]
        bwb 100 days ago
        If you ask for really specific stuff you can get some good recs, but it can def be hit or miss.
        That type of deep analysis is hard, as nobody has access to inside the books (unless your are FB and do it illegally, plus have billions of compute dollars to spend) :)
- harrison_clarke 107 days ago
  this seems like something that planning would fix. i wonder if that's how it's doing it
  like, if it decides to <think> a table of contents, or chapter summaries, rather than just diving in at page 1
- Workaccount2 108 days ago
  Can you share it on a text sharing site? It seems you hit your share quota
  [-]
  - og_kalu 108 days ago
    Sure - https://pastebin.com/aiWuYcrF
- dyauspitr 107 days ago
  That is mind blowing. To this fantasy reader that’s pure magic.
- solardev 108 days ago
  19 pages?! Am I the only one who prefers an AI that jumps straight to the point?
  - Buildup and happy background world-building
  - Subtle foreshadowing
  - Orcs attack
  - Hero is saved by unlikely warrior of astounding beauty
  - Evil is defeated until sales justify unnecessary sequel
  That's the kind of story fit for the modern attention span...
malisper 108 days ago
I've been using a math puzzle as a way to benchmark the different models. The math puzzle took me ~3 days to solve with a computer. A math major I know took about a day to solve it by hand.
Gemini 2.5 is the first model I tested that was able to solve it and it one-shotted it. I think it's not an exaggeration to say LLMs are now better than 95+% of the population at mathematical reasoning.
For those curious the riddle is: There's three people in a circle. Each person has a positive integer floating above their heads, such that each person can see the other two numbers but not his own. The sum of two of the numbers is equal to the third. The first person is asked for his number, and he says that he doesn't know. The second person is asked for his number, and he says that he doesn't know. The third person is asked for his number, and he says that he doesn't know. Then, the first person is asked for his number again, and he says: 65. What is the product of the three numbers?
[-]
- hmottestad 108 days ago
  This looks like it’s been posted on Reddit 10 years ago:
  https://www.reddit.com/r/math/comments/32m611/logic_question...
  So it’s likely that it’s part of the training data by now.
  [-]
  - canucker2016 108 days ago
    You'd think so, but both Google's AI Overview and Bing's CoPilot output wrong answers.
    Google spits out: "The product of the three numbers is 10,225 (65 * 20 * 8). The three numbers are 65, 20, and 8."
    Whoa. Math is not AI's strong suit...
    Bing spits out: "The solution to the three people in a circle puzzle is that all three people are wearing red hats."
    Hats???
    Same text was used for both prompts (all the text after 'For those curious the riddle is:' in the GP comment), so Bing just goes off the rails.
    [-]
    - moritzwarhier 108 days ago
      That's a non-sequitur, they would be stupid to run ab expensive _L_LM for every search query. This post is not about Google Search being replaced by Gemini 2.5 and/or a chatbot.
      [-]
      - michaelt 108 days ago
        Yes, putting an expensive LLM response atop each search query would be quite stupid.
        You know what would be even stupider? Putting a cheap, wrong LLM response atop each search query.
      - canucker2016 108 days ago
        Google placed its "AI overview" answer at the top of the page.
        The second result is this reddit.com answer, https://www.reddit.com/r/math/comments/32m611/logic_question..., where at least the numbers make sense. I haven't examined the logic portion of the answer.
        Bing doesn't list any reddit posts (that Google-exclusive deal) so I'll assume no stackexchange-related sites have an appropriate answer (or bing is only looking for hat-related answers for some reason).
        [-]
        moritzwarhier 108 days ago
        I might have been phrasing poorly. With _L_ (or L as intended), I meant their state-of-the-art model, which I presume Gemini 2.5 is (didn't come around to TFA yet). Not sure if this question is just about model size.
        I'm eagerly awaiting an article about RAG caching strategies though!
    - vicek22 108 days ago
      The riddle has a different variants with hats https://erdos.sdslabs.co/problems/5
    - Etherlord87 107 days ago
      There's 3 toddlers on the floor. You ask them a hard mathematical question. One of the toddlers plays around pieces of paper on the ground and happens to raise one that has the right answer written on it.
      - This kid is a genius! - you yell
      - But wait, the kid has just picked an answer from the ground, it didn't actually come up...
      - But the other toddlers could do it also but didn't!
  - malisper 108 days ago
    Other models aren't able to solve it so there's something else happening besides it being in the training data. You can also vary the problem and give it a number like 85 instead of 65 and Gemini is still able to properly reason through the problem
    [-]
    - lolinder 108 days ago
      I'm sure you're right that it's more than just it being in the training data, but that it's in the training data means that you can't draw any conclusions about general mathematical ability using just this as a benchmark, even if you substitute numbers.
      There are lots of possible mechanisms by which this particular problem would become more prominent in the weights in a given round of training even if the model itself hasn't actually gotten any better at general reasoning. Here are a few:
      * Random chance (these are still statistical machines after all)
      * The problem resurfaced recently and shows up more often than it used to.
      * The particular set of RLHF data chosen for this model draws out the weights associated with this problem in a way that wasn't true previously.
      [-]
      - mrtesthah 108 days ago
        Google Gemini 2.5 is able to search the web, so if you're able to find the answer on reddit, maybe it can too.
    - mattkevan 108 days ago
      I think there’s a big push to train LLMs on maths problems - I used to get spammed on Reddit with ads for data tagging and annotation jobs.
      Recently these have stopped and they’re now the ads are about becoming a maths tutor to AI.
      Doesn’t seem like a role with long-term prospects.
    - 7e 108 days ago
      Sure, but you can't cite this puzzle as proof that this model is "better than 95+% of the population at mathematical reasoning" when the method of solving (the "answer") it is online, and the model has surely seen it.
    - stabbles 108 days ago
      It gets it wrong when you give it 728. It claims (728, 182, 546). I won't share the answer so it won't appear in the next training set.
      [-]
      - WithinReason 108 days ago
        with 728 the puzzle doesn't work since it's divisible by 8
        [-]
        eru 108 days ago
        But then the AI should tell you that, too, if it really understand the problem?
        stabbles 107 days ago
        Fair, the question is what possible solutions exists.
  - toonalfrink 95 days ago
    This whole answer hinges on knowing that 0 is not a positive integer, that's why I couldn't figure it out...
  - f1shy 108 days ago
    Thaks. I wanted to do exactly that: find the answer online. It is amazing that people (even in HN) think that LLM can reason. It just regurgitates the input.
    [-]
    - jug 105 days ago
      Have you given a reasoning model a novel problem and watched its chain of thought process?
    - Etherlord87 107 days ago
      I think it can reason. At least if it can work in a loop ("thinking"). It's just that this reasoning is far inferior to human reasoning, despite what some people hastily claim.
    - motoxpro 108 days ago
      I would say that 99.99% of humans do the same. Most people never come up with anything novel.
      [-]
      - f1shy 108 days ago
        I would say maybe about 80% certainly not 99.99%. But I've seen that in college, some would only be able to solve the problems which were pretty much the same as others already seen. Notably some guys could easily come up with solutions to complex problems they did not see before. I have the opinion that no human at age 20 can have the amount of input a LLM today. And still humans of age 20 do come with very new ideas pretty often (new in the sense that (s)he has not seen that or anything like it before). Of course there are more and less creative/intelligent people...
      - WA 108 days ago
        Reasoning != coming up with something novel.
  - drexlspivey 108 days ago
    And if it wasn’t, it is now
  - thaumasiotes 108 days ago
    [flagged]
    [-]
    - thaumasiotes 108 days ago
      Is there a reason for the downvotes here? We can see that having the answer in the training data doesn't help. If it's in there, what's that supposed to show?
      [-]
      - klausa 108 days ago
        It's entirely unclear what are you trying to get across, at least to me.
        Generally speaking, posting output from a LLM, without explaining exactly what do you think it illustrates and why is frowned upon here. I don't think your comment does a great job of the latter.
        [-]
        thaumasiotes 107 days ago
        >> So it’s likely that it’s part of the training data by now.
        > I don't think this means what you think it means.
        > I did some interacting with the Tencent model that showed up here a couple days ago [...]
        > This is a question that obviously was in the training data. How do you get the answer back out of the training data?
        What do I think the conversation illustrates? Probably that having the answer in the training data doesn't get it into the output.
        How does the conversation illustrate that? It isn't subtle. You can see it without reading any of the Chinese. If you want to read the Chinese, Google Translate is more than good enough for this purpose; that's what I used.
        [-]
        Workaccount2 107 days ago
        Your intentions are good, but your execution is poor.
        I cannot figure out what the comment is trying to get across either. It's easy for you because you already know what you are trying to say. You know what the pasted output shows. The poor execution is in not spending enough time thinking about how someone coming in totally blind would interpret the comment.
        klausa 107 days ago
        > How does the conversation illustrate that? It isn't subtle. You can see it without reading any of the Chinese.
        I can't, and I imagine most of the people who downvoted you couldn't either.
        I think asking people to go to Google Translate to parse a random comment that seems to be 90% LLM output by volume is a bit much.
      - RugnirViking 107 days ago
        I have translated the Chinese. I still have no idea what point you're trying to make. You ask it questions about some kind of band, and it answers. Are you saying the answers are wrong?
      - eru 108 days ago
        No clue. Perhaps people object to the untranslated Chinese?
      - meepmorp 108 days ago
        > Is there a reason for the downvotes here?
        I didn't downvote you, but like (probably) most people here, I can't read Chinese; I can't derive whatever point you're trying to make just from with text you provided.
- _cs2017_ 108 days ago
  This is solvable in roughly half an hour on pen and paper by a random person I picked with no special math skills (beyond a university). This is far from a difficult problem. The "95%+" in math reasoning is a meaningless standard, it's like saying a model is better than 99.9% of world population in Albanian language, since less than 0.1% bother to learn Albanian.
  Even ignoring the fact that this or similar problem may have appeared in the training data, it's something a careful brute-force math logic should solve. It's neither difficult, nor interesting, nor useful. Yes, it may suggest a slight improvement on the basic logic, but no more so than a million other benchmarks people quote.
  This goes to show that evaluating models is not a trivial problem. In fact, it's a hard problem (in particular, it's a far far harder than this math puzzle).
  [-]
  - windowshopping 108 days ago
    The "random person" you picked is likely very, very intelligent and not at all a good random sample. I'm not saying this is difficult to the extent that it merits academic focus, but it is NOT a simple problem and I suspect less than 1% of the population could solve this in half an hour "with no special math skills." You have to be either exceedingly clever or trained in a certain type of reasoning or both.
    [-]
    - sundarurfriend 108 days ago
      I agree with your general point that this "random person" is probably not representative of anything close to an average person off the street, but I think the phrasing "very very intelligent" and "exceedingly clever" is kinda misleading.
      In my experience, the difference between someone who solves this type of logic puzzle and someone who doesn't, has more to do with persistence and ability to maintain focus, rather than "intelligence" in terms of problem-solving ability per se. I've worked with college students helping them learn to solve these kinds of problems (eg. as part of pre-interview test prep), and in most cases, those who solve it and those who don't have the same rate of progress towards the solution as long as they're actively working at it. The difference comes in how quickly they get frustrated (at themselves mostly), decide they're not capable of solving it, and give up on working on it further.
      I mention this because this frustration itself comes from a belief that the ability to solve these belongs some "exceedingly clever" people only, and not someone like them. So, this kind of thinking ends up being a vicious cycle that keeps them from working on their actual issues.
    - dskloet 108 days ago
      I solved it in less than 15 minutes while walking my dog, no pen or paper. But I wouldn't claim to be a random person without math skills. And my very first guess was correct.
      It was a fun puzzle though and I'm surprised I didn't know it already. Thanks for sharing.
      [-]
      - wrasee 107 days ago
        My dog solved it in less than 14 minutes, no pen or paper, and no fingers.
        Seriously though, nice work.
  - wrasee 107 days ago
    So in the three hours between you reading the puzzle in the parent comment, you stopped what you were doing, managed to get some other "random" person to stop what they were doing and spend half an hour of their time on a maths puzzle that at that point prior experience suggested could take a day? All within three hours?
    That's not to say that you didn't, or you're recalling from a previous time that happens to be this exact puzzle (despite there being scant prior references to this puzzle, and precisely the reason for using it). But you can see how some might see that as not entirely credible.
    Best guess: this random person is someone that really likes puzzles, is presumably good at them and is very, very far from being representative to the extent you would require to be in support of your argument.
    Read: just a heavy flex about puzzle solving.
  - re-thc 107 days ago
    > This is solvable in roughly half an hour on pen and paper by a random person I picked with no special math skills (beyond a university).
    I randomly answered this post and can't solve it in half an hour. Is the point leet code but for AI? I rather it solve real problems than "elite problems".
    Side note: couldn't even find pen and paper around in half an hour.
- sebzim4500 108 days ago
  This is a great riddle. Unfortunately, I was easily able to find the exact question with a solution (albeit with a different number) online, thus it will have been in the training set.
  [-]
  - Workaccount2 108 days ago
    What makes this interesting is that while the question is online (on reddit, from 10 years ago) other models don't get the answer right. Gemini also shows it's work and it seems to do a few orders of magnitude more calculating then the elegant answer given on reddit.
    Granted this is all way over my head, but the solution gemini comes to matches the one given on reddit (and now here in future training runs)
    65×26×39=65910
    [-]
    - sebzim4500 108 days ago
      >Gemini also shows it's work and it seems to do a few orders of magnitude more calculating then the elegant answer given on reddit.
      I don't think Gemini does an unnecessary amount of computation, it's just more verbose. This is typical of reasoning models, almost every step is necessary but many would not be written down by a human.
  - varispeed 108 days ago
    Seems like we might need a section of internet that is off limits to robots.
    [-]
    - Centigonal 108 days ago
      everyone with limited bandwidth has been trying to limit site access to robots. the latest generation of AI web scrapers are brutal and do not respect robots.txt
      [-]
      - varispeed 108 days ago
        There are websites where you can only register to in person and have two existing members vouch for you. Probably still can be gamed, but sounds like a great barrier to entry for robots (for now).
        [-]
        tmpz22 108 days ago
        What prevents someone from getting access and then running an authenticated headless browser to scoop the data?
        [-]
        varispeed 108 days ago
        Admins will see unusual traffic from that account and then take action. Of course it will not be perfect as there could be a way to mimic human traffic and slowly scrape the data anyway, that's why there is element of trust (two existing members to vouch).
        [-]
        tmpz22 108 days ago
        Yeah don’t get me wrong I believe raising the burden of extraction is an effective strategy I just think it’s been solved at scale ie voting rings and astro turfing operations on Reddit - and at the nation state level I’d just bribe or extort the mods and admins directly (or the IT person to dump the database).
        [-]
        varispeed 107 days ago
        That's entirely possible, especially if the site is small and not run by people with access to resources like physical security, legal etc.
    - baq 108 days ago
      It’s here and it’s called discord.
      [-]
      - Zandikar 108 days ago
        I have bad news for you if you think non paywalled / non phone# required discord communities are immune to AI scraping, especially as it costs less than hammering traditional websites as the push-on-change event is done for you in real time chat contexts.
        Especially as the company archives all those chats (not sure how long) and is small enough that a billion dollar "data sharing" agreement would be a very inticing offer.
        If there isn't a significant barrier to access, it's being scraped. And if that barrier is money, it's being scraped but less often.
      - Davidzheng 108 days ago
        Honestly someone should scrape the algebraic topology Discord to AI it'll be a nice training set
    - kylebenzle 108 days ago
      Or we could just accept that LLMs can only output what we have put in and calling them, "AI" was a misnomer from day one.
      [-]
      - eru 108 days ago
        Why would you accept a lie?
        [-]
        kylebenzle 106 days ago
        I'm not sure what you mean but I'm trying to say our current LLMs are not artificially intelligent and calling them "AI" has confused a lot of the lay public.
  - beefnugs 107 days ago
    Why is this a great riddle? It sounds like incomplete nonsense to me:
    It doesnt say anything about the skill levels of the participants, whether their answers are just guessing, or why they arent just guessing the sum of the other two people each time asked to provide more information?
    It doesnt say the guy saying 65 is even correct
    How could three statements of "no new information" give information to the first guy that didn't know the first time he was asked?
    [-]
    - DangitBobby 107 days ago
      2 and 3 saying they don't know eliminates some uncertainties 1 had about their own number (any combination where the other two would see numbers that could tell them their own). After those possibilities were eliminated, the 1st person has narrowed it down enough to actually know based on the numbers shown above the other 2. The puzzle could instead have been done in order 2, 3, 1 and 1 would not have needed to go twice.
      I guess really the only missing information is that they have the exact same information you do, plus the numbers above their friends heads.
      [-]
      - Nition 107 days ago
        > The puzzle could instead have been done in order 2, 3, 1 and 1 would not have needed to go twice.
        If this is true, then back in the original 1->2->3->1 form, shouldn't person #3 have been able to answer it?
- yifanl 108 days ago
  You'd have better results if you had prompted it with the actual answer and asked how the first person came to the conclusion. Giving a number in the training set is very easy.
  i.e. You observe three people in a magical room. The first person is standing underneath a 65, the second person is standing underneath a 26 and the third person is standing underneath a 39. They can see the others numbers but not the one they are directly under. You tell them one of the three numbers is the sum of the other two and all numbers are positive integers. You ask the first person for their number, they respond that they don't know. You ask the second person for their number, they respond that they don't know. You ask the third person, they respond that they don't know. You ask the first person again and they respond with the correct value, how did they know?
  And of course, if it responds with a verbatim answer in the line of https://www.reddit.com/r/math/comments/32m611/logic_question..., we can be pretty confident what's happening under the hood.
  [-]
- semiinfinitely 108 days ago
  I love how the entire comment section is getting one-shotted by your math riddle instead of the original post topic.
- refulgentis 108 days ago
  In general I find commentary here too negative on AI, but I'm a bit squeamish about maximalist claims re: AI mathematical reasoning vs. human population based off this, even setting aside lottery-ticket-hypothesis-like concerns.
  It's a common logic puzzle, Google can't turn up an exact match to the wording you have, but ex. here: https://www.futilitycloset.com/2018/03/03/three-hat-problem/
- utopcell 108 days ago
  Same here: My problem of choice is the 100 prisoners problem [1]. I used to ask simple reasoning questions in the style of "what is the day three days before the day after tomorrow", but nowadays when I ask such questions, I can almost feel the the NN giggling at the naivety of its human operator.
  [1] https://en.wikipedia.org/wiki/100_prisoners_problem
- r0fl 108 days ago
  Wow
  Tried this in deepseek and grok and it kept thunking in loops for a while and I just turned it off
  I haven’t seen a question loop this long ever.
  Very impressed
  [-]
  - z2 108 days ago
    Deepseek R1 got the right answer after a whopping ~10 minutes of thinking. I'm impressed and feel kind of dirty, I suspect my electricity use from this could have been put to better use baking a frozen pizza.
  - deepboy2 108 days ago
    Just tried it on Deepseek (not R1, maybe V3-0324) and got the correct answer after 7-8 pages of reasoning. Incredible!
- SwayStar123 108 days ago
  You can also put the AI in the first person's shoes. Prompt: You are standing in a circle, there are 2 other people in the circle with you, everyone in the circle, has a positive integer above their head, no one knows what the number above their own head is but can see the numbers above the heads of the other people. You see that the person infront of you on the left has 26 above their head. The person on the right has 39 above their head. You are told that the sum of two of the numbers is the third number. You are asked what the number above your head is, the option is the sum, 65, or 13, as 26 + 13 = 39. You don't know which one it is, and you say so. The second person is asked the number above their head. They also say they dont know, the third person also says they dont know. What is your number?
  Gemini 2.5 and claude 3.7 thinking get it right, o3 mini and 4o get it wrong
- drewbeck 107 days ago
  I just asked it this twice and it gave me 65×65×130=549250. Both times. The first time I made it about ducks instead of people and mentioned that there was a thunderstorm. The second time I c/p your exact text and it gave me the same answer.
  Again we find that the failure state of LLMs is a problem – yeah, when you know the answer already and it gets it right, that's impressive! When it fails, it still acts the same exact way and someone who doesn't already know the answer is now a lil stupider.
- ototot 108 days ago
  I also tried one-shot.
  https://g.co/gemini/share/badd00a824d2
- eru 108 days ago
  I use an algorithmic question that I'd been working on for years and that I'm finally writing up the answer to.
  It's basically: given a sequence of heap operations (insert element, delete minimum element), can you predict the left-over elements (that are in the heap at the end) in linear time in the comparison model?
  (The answer is surprisingly: Yes.)
- integralof5y 107 days ago
  A prolog program, swipl (it takes less than a second to solve your puzzle)
  N is number of turns of don't know answers. the bad predicate means that the person can know its number at turn N.
```
  bad(_,_,_,-1) :- !,false.
  bad(_,A,A,0) :- !.
  bad(A,_,A,0) :- !.
  bad(A,A,_,0) :- !.
  bad(B,C,A,N) :- D is abs(B-A),D<C,N1 is N-1, bad(B,D,A,N1),!.
  bad(C,A,B,N) :- D is abs(B-A),D<C,N1 is N-1, bad(D,A,B,N1),!.
  bad(A,B,C,N) :- D is abs(B-A),D<C,N1 is N-1, bad(A,B,D,N1),!.
  
  solve(X,Y,Z) :- Y1 is X-1, between(1,Y1,Y),
                  between(0,2,N), Z is X-Y,bad(X,Y,Z,N).

  ?- solve(65,X,Y).
  X = 26,
  Y = 39 ;
  X = 39,  
  Y = 26 .
```
- adpirz 108 days ago
  Interactive playground for the puzzle: https://claude.site/artifacts/832e77d7-5f46-477c-a411-bdad10...
  (All state is stored in localStorage so you can come back to it :) ).
- TrackerFF 108 days ago
  The riddle certainly nerd-sniped GPT 4.5
  After a couple of minutes it decided on the answer being 65000. (S = {65, 40, 25)}
- dkjaudyeqooe 108 days ago
  > I think it's not an exaggeration to say LLMs are now better than 95+% of the population at mathematical reasoning.
  It's not an exaggeration it's a non-sequitur, you first have to show that the LLMs are reasoning in the same way humans do.
- bbstats 108 days ago
  Could you explain "The sum of two of the numbers is equal to the third"??
  [-]
  - rappatic 108 days ago
    I think:
    Call the three numbers a, b, and c. This means c = a + b, but we still don’t know to which person each number belongs.
    When person 1 (p1) is asked what his number is, he has no way to know whether he has a, b, or c, so he says he doesn’t know. Same goes for p2 and p3. Clearly p1 somehow gains information by p2 and p3 passing. Either he realizes that he must be either a or b, and such his number is the difference between p2 and p3’s numbers, or he realizes that he must be c and so his number is the sum of p2 and p3’s numbers.
    That’s all I have so far. Anyone have other ideas?
    [-]
    - bena 108 days ago
      The answer is online and it's clever.
      P1 knows that P2 and P3 are not equal. So they know that the set isn't [2A, A, A].
      P2 knows that P1 and P3 are not equal. So they know that the set isn't [A, 2A, A]. They also know that if P1 doesn't know, then they were able to make the same deduction. So they now know that both [2A, A, A] and [A, 2A, A] aren't correct. Since they know that [2A, A, A] isn't correct, they can also know that [2A, 3A, A] isn't correct either. Because they'd be able to see if P1 = 2A and P3 = A, and if that were true and P1 doesn't know their number, it would have to be because P2 isn't A. And if P2 isn't A, they'd have to be 3A.
      P3 knows that P1 and P2 aren't equal. Eliminates [A, A, 2A]. Knows that [2A, A, A], [A, 2A, A], and [2A, 3A, A], are eliminated. Using the same process as P2, they can eliminate [2A, A, 3A], [A, 2A, 3A], and also [2A, 3A, 5A]. Because they can see the numbers and they know if P1 is 2A and P2 is 3A.
      Now we're back at P1. Who now knows.
      So P2 and P3 are in the eliminated sets. Which means we're one of these
      [2A, A, A]; [3A, 2A, A]; [4A, 3A, A]; [3A, A, 2A]; [4A, A, 3A]; [5A, 2A, 3A]; [8A, 3A, 5A]
      We know his number is 65. To find the set, we can factor 65: (5 * 13). We can check the other numbers 2(13) = 26. 3(13) = 39. And technically, you don't need to find the other numbers. The final answer is 5A * 2A * 3A or (A^3) * 30.
      [-]
      - byearthithatius 108 days ago
        "Which means we're one of these [2A, A, A]; [3A, 2A, A]; [4A, 3A, A]; [3A, A, 2A]; [4A, A, 3A]; [5A, 2A, 3A]; [8A, 3A, 5A]"
        Why? Couldn't it be an infinite number of 3 size arrays comprised of A where two elements sum to the third? [24A, 13A, 11A]? How did we deduce this set of arrays?
        EDIT: Solved from another reddit comment. Tuples without a common factor like the one above are considered as a=1.
        "They're not eliminated; they correspond to a = 1."
        [-]
        jhhh 108 days ago
        I think that answer was poorly phrased because those possibilities are eliminated in a sense. There is a better answer further in the thread that explains "If the solution was not one of the flipped triplets, then the first player would not have worked out the solution." Thus if it was one of your other infinite triplets (eg. 65, 12, 53) then round 2 player 1 would've still answered 'I don't know'. Since they did respond with a definitive answer it had to be one of the formula solutions, since those were the only solutions they could prove. And since the only formula with a factor in 65 is 5 the correct formula must be [5A, 2A, 3A] and thus [65, 26, 39].
        You should be able to generate an infinite number of these problems just by multiplying the first formula factor by a prime number. Like the same question but the person answers '52' restricts you to either [4a, 3a, a] or [4a, a, 3a]. Since the question only asks for the product of all the terms the answer is 4 * 13 + 3 * 13 + 13 = 104.
        WithinReason 108 days ago
        Look at it this way: Person 1 sees the numbers 26 and 39, and has to guess his own number. It must be one of only 2 possibilities: 13 or 65. All he has to do is eliminate one of those possibilities.
    - aardvarkr 108 days ago
      I think it has something to do with applying the lower bound of 1.
      If p1 KNOWS that he’s the largest then he has to have gained some other piece of information. Say the numbers he sees are 32 and 33. His number would have to be either 1 or 65. If p1 was 1 then the other two would have known p1 couldn’t be the sum of the other two
      [-]
      - oezi 108 days ago
        But p2 and p3 don't yet know what they are themselves just because they see a 1:
        If p2 sees 1 and 33, s/he would wonder if s/he is 32 or 34.
        P3 would consider 31 or 33.
  - malisper 108 days ago
    if the three numbers are a, b, and c, then either a+b=c, a+c=b, or b+c=a
    [-]
    - bena 108 days ago
      And they must all be positive integers.
      So A + B = C and A + C = B. But we know that A + B = C, so we can replace C with (A + B). So we know that A + A + B = B.
      So 2A + B = B. Or 2A = 0.
      And this holds any way you slice it.
      Even if you were to try and brute force it.
      A = 1
      B = 2
      Then C = 3. But A + C has to equal B. That's 1 + 3 = 2? That's not true.
      I don't see a case where you can add to the sum of two numbers one of the numbers and get the other number.
      I'm guessing that's a misreading of the problem. Because it looks like the third number is the sum of the first two.
      [-]
      - refulgentis 108 days ago
        One of the cases has to be true, not all 3. (as you show, they're mutually exclusive for positive integers) i.e. "either" is important in the parent comment.
        [-]
        bena 108 days ago
        Which is why I indicated that it would be a misreading of the problem.
        The original problem is a little ambiguously worded. You could say "one of their numbers is the sum of the other two" and it would be a little clearer.
        [-]
        thaumasiotes 108 days ago
        > The original problem is a little ambiguously worded.
        No it isn't. If it said "the sum of any two of the numbers is equal to the third", that would be a contradiction. What it says is "the sum of two of the numbers is equal to the third".
        [-]
        bena 108 days ago
        I have three items.
        Buying two of the items gets you the third for free.
        The implication is any two.
        It’s ok that it’s ambiguous. It happens. In most cases, we clarify and move on. There’s no need to defend it.
        [-]
        thaumasiotes 108 days ago
        Why look for ambiguity that isn't there?
        [-]
        refulgentis 107 days ago
        There's a certain mind that either doesn't realize they're sidestepping the problem and turning it into a editing review, or realizes it, and doesn't understand why it seems off-topic/trivial to others.
        What's especially strange here is, they repeatedly demonstrate if you interpret it that way, the problem is obviously, trivially, unsolvable, in a way that a beginner in algebra could intuit. (roughly 12 years old, at least, we started touching algebra in 7th grade)
        I really don't get it.
        When I've seen this sort of thing play out this way, the talking-down is usually for the benefit of demonstrating something to an observer (i.e. I am smart look at this thing I figured out; I can hold my own when the haters chirp; look they say $INTERLOCUTOR is a thinker but they can't even understand me!), but ~0 of that would apply here, at least traditionally.
        bena 106 days ago
        One often doesn't look for ambiguity. It is there. It is fine.
  - refulgentis 108 days ago
    Given #s x,y, and z, either x + y = z, x + z = y, or y + z = x.
- mitko 108 days ago
  Loved that puzzle, thanks for sharing it. I’ve solved a lot of math problems in the past but this one had a unique flavor of interleaving logical reasoning, partial information and a little bit of arithmetic.
- highfrequency 108 days ago
  Fun puzzle! I’m curious how you managed to structure the problem such that a computer could solve it but it took 3 days of computation?
- hmottestad 108 days ago
  Is the answer somehow {65, 20, 45} with the product 58,500?
  That’s one-shot for o1 pro.
  [-]
  - sebzim4500 108 days ago
    That's wrong. From player 1's perspective {25 20 45} is entirely consistent with the calls made in the first three rounds.
- g105b 108 days ago
  Please can you enlighten me, I'm a mathematic plebian?
- doener 108 days ago
  ChatGPT 4.5:
  Let’s carefully analyze this puzzle step-by-step:
  ⸻
  Step 1: Define the Problem Clearly
  We have three people in a circle. Let’s call them: • Person A (first person) • Person B (second person) • Person C (third person)
  Each has a positive integer above their head. Each sees the other two numbers but not their own. The numbers satisfy: A + B = C \quad \text{or} \quad B + C = A \quad \text{or} \quad C + A = B
  Given the puzzle scenario, we need to find out the exact numbers given the clues, and ultimately find the product of the three numbers.
  ⸻
  Step 2: Logic of the Puzzle (Step-by-Step Reasoning)
  Initial conditions: The numbers are positive integers. The puzzle implies a situation of ambiguity at first, but after multiple rounds of “I don’t know,” Person A finally knows their number is 65.
  Let’s denote clearly the conditions again: • Exactly one of the following three equations is true: • A + B = C, or • B + C = A, or • C + A = B.
  Person A sees B and C, but not A. Similarly for Persons B and C.
  ⸻
  Step 3: Logical deductions from the repeated “I don’t know”
  First round deductions: • Person A initially doesn’t know their number. This means, from Person A’s perspective, both scenarios A = B + C and A + B = C (or A + C = B) are still possible. If Person A had seen identical numbers, for instance, Person A could have instantly known their number. Thus, Person A sees two different numbers. • Similarly, Person B doesn’t know either, implying that Person B also sees ambiguity in their own number. Thus, Person B also sees two numbers that could result in at least two possible solutions. • Person C also doesn’t know, same logic as above.
  Thus, after one full round, we understand that initially, all three numbers must create ambiguity. No trivial case (like two identical numbers) is present.
  Second round deductions:
  After hearing that no one knows their number, each person gains information: they now know the scenario is one in which none of the other two people could have immediately solved it.
  Crucially, in the second round: • Person A now knows their number is 65 after considering the previous statements. This implies that Person A sees numbers B and C, and from these numbers plus the statements of uncertainty, Person A deduces that their own number can only be 65.
  ⸻
  Step 4: Deriving the Numbers (Critical deduction)
  Let’s denote clearly again: Person A’s number is now known to be 65.
  We have three possible equations. Let’s test each carefully given A = 65: 1. Case 1: A + B = C Then we have: • 65 + B = C. Person A sees numbers B and C, so they would see B and 65 + B. Person A sees numbers B and 65+B. If Person A sees these two numbers, they might initially consider their number could be either the sum or difference of these two. • However, if Person A sees two numbers, and one is exactly 2× the other, Person A could immediately guess their number is the difference (since if you see numbers X and 2X, your number is clearly X). So this scenario of B and 65+B is okay only if these two numbers don’t exhibit an immediate trivial scenario. • Let’s hold this scenario and continue examining others first. 2. Case 2: B + C = A (So, B + C = 65) • Then Person A sees numbers B and C, adding exactly to 65. • Initially, Person A sees two numbers summing exactly to 65. Could Person A know their number immediately at first glance? No, since their number could either be the sum (65) or the difference of the two numbers. Ambiguity here initially makes sense. • Initially, Person B and Person C also see ambiguous scenarios. They both don’t know their numbers either at first round. This scenario is plausible. 3. Case 3: C + A = B (So, C + 65 = B) • Then Person A sees numbers B = C + 65 and C. • Person A sees C and C+65. Similar reasoning as Case 1.
  ⸻
  Step 5: Identifying the Correct Scenario
  The puzzle’s logic typically aligns best with Case 2: B + C = A = 65.
  Here’s why clearly stated: • Initially, Person A sees two numbers adding to exactly 65. At first glance, Person A’s number could either be 65 (sum) or the difference between the two numbers. • Person B also initially sees two numbers and cannot distinguish immediately their own number, nor can Person C. • After everyone states “I don’t know,” Person A gains critical insight: if the difference between the two numbers (that A sees) allowed Person B or Person C to know their number immediately, they’d have known right away. The fact that B and C didn’t know means the difference is not immediately obvious to B or C. Thus, the difference cannot be trivial (such as identical numbers or easy multiples).
  Therefore, the two numbers must be such that their difference doesn’t immediately clue in the other persons. This occurs most clearly with pairs of numbers which have multiple factorization possibilities or aren’t trivial multiples.
  But we must carefully test quickly to find a suitable, plausible integer pair whose sum is 65 and whose ambiguity would logically survive through multiple rounds. The puzzle is a classic logic puzzle with a known structure:
  A common scenario in these classic puzzles is the use of Fibonacci-like or similar “golden” sets. Let’s quickly check the following well-known ambiguous triple: • Numbers: 65, 39, 26 • Check quickly: indeed, 39 + 26 = 65.
  Checking logic quickly: • Person A sees 39 and 26. Initially ambiguous (number could be 65 or 13). • Person B sees 65 and 26. Could be either 39 or 91 at first glance. Initially ambiguous. • Person C sees 65 and 39. Could be either 26 or 104 initially. Initially ambiguous.
  After each person says they don’t know: • Now, Person A considers: If my number was 13, then Person B would see numbers 13 and 26. Person B would then immediately know their number is 39 (13+26=39), no ambiguity. But Person B didn’t know! So Person A’s number cannot be 13. Hence, Person A confidently concludes their number is 65.
  Thus, the numbers 65, 39, and 26 satisfy the puzzle conditions perfectly!
  ⸻
  Step 6: Final Answer (Product of the Numbers)
  The three numbers identified clearly are: • A = 65 • B = 39 • C = 26
  Then, the product is: 65 \times 39 \times 26 = 65 \times 1014 = 65,910
  ⸻
  Final Answer:
  \boxed{65,910}
  [-]
  - WithinReason 108 days ago
    This was not logic, this was recall from the training set disguised as logic:
    Let’s quickly check the following well-known ambiguous triple: • Numbers: 65, 39, 26
simonw 108 days ago
I'm impressed by this one. I tried it on audio transcription with timestamps and speaker identification (over a 10 minute MP3) and drawing bounding boxes around creatures in a complex photograph and it did extremely well on both of those.
Plus it drew me a very decent pelican riding a bicycle.
Notes here: https://simonwillison.net/2025/Mar/25/gemini/
[-]
- jillesvangurp 107 days ago
  Have you considered that they must be training on images of pelicans driving bicycle's at this point ;-). At least given how often that comes up in your reviews, a smart LLM engineer might put their fingers on the scales a bit and optimize for those things that come up in reviews of their work a lot.
- redox99 108 days ago
  Claude's pelican is way better than Gemini's
  [-]
  - jonomacd 108 days ago
    I'm not so sure. I've run it a bunch of times. It makes a great pelican.
    Personally I'm convinced this model is the best out there right now.
    https://www.reddit.com/r/Bard/comments/1jjobaz/pelican_on_a_...
    [-]
    - fao_ 107 days ago
      I think a competent 5yro could make a better pelican on a bicycle than that. Which to me feels like the hallmark of AI.
      I mean, hell, I have drawings from when I was eight of leaves and they are botanically-accurate enough to still be used for plant identification, which itself is a very difficult task that people study decades for. I don't see why this is interesting or noteworthy, call me a neo-luddite if you must.
      [-]
      - ashenke 107 days ago
        The complexity is that it's not a drawing : It's SVG. So it's code that must, in the end, display a pelican, so it's one step further.
- ggeorgovassilis 107 days ago
  I've been following your blog for a while now, great stuff!
- kridsdale3 108 days ago
  I just tried your trademark benchmark on the new 4o Image Output, though it's not the same test:
  https://imgur.com/a/xuPn8Yq
  [-]
  - jonomacd 108 days ago
    And the same thing with gemini 2.0 flash native image output.
    https://imgur.com/a/V4YAkX5
    It's sort of irrelevant though as the test is about SVGs.
  - Unroasted6154 108 days ago
    Was that an actual SVG?
    [-]
    - simonw 108 days ago
      No that's GPT-4o native image output.
      [-]
      - sebzim4500 108 days ago
        I wonder how far away we are from models which, given this prompt, generate that image in the first step in their chain-of-thought and then use it as a reference to generate SVG code.
        It could be useful for much more than just silly benchmarks, there's a reason why physics students are taught to draw a diagram before attempting a problem.
        [-]
        simonw 108 days ago
        Someone managed to get ChatGPT to render the image using GPT-4o, then save that image to a Code Interpreter container and run Python code with OpenCV to trace the edges and produce an SVG: https://bsky.app/profile/btucker.net/post/3lla7extk5c2u
        [-]
        qingcharles 107 days ago
        Does this match the rules of your test, or is it cheating? :)
freediver 108 days ago
Tops our benchmark in an unprecedented way.
https://help.kagi.com/kagi/ai/llm-benchmark.html
High quality, to the point. Bit on the slow side. Indeed a very strong model.
Google is back in the game big time.
[-]
- aoeusnth1 108 days ago
  It should be in the "reasoning" category, right? (still topping the charts there)
- causal 107 days ago
  Remarkable how few tokens it needed to get a much better score than other reasoning models. Any chance of contamination?
  [-]
  - 85392_school 107 days ago
    It makes me wonder how the token counting was implemented and if it missed the (not sent in API) reasoning.
    [-]
    - freediver 107 days ago
      Vaild concern, most likely thinking tokens were not counted due to API reporting changes.
- utopcell 108 days ago
  That is some wide gap!
anotherpaulg 108 days ago
Gemini 2.5 Pro set the SOTA on the aider polyglot coding leaderboard [0] with a score of 73%.
This is well ahead of thinking/reasoning models. A huge jump from prior Gemini models. The first Gemini model to effectively use efficient diff-like editing formats.
[0] https://aider.chat/docs/leaderboards/
[-]
- aoeusnth1 108 days ago
  Am I correct in assuming that accuracy < using correct edit format? i.e. it made mistakes in 27% of the problems, 11% of which were due to (at least) messing up the diff format?
  In which case, google should be working on achieving better output format following, as Claude and R1 are able to hit nearly 100% accuracy on the format.
  [-]
  - anotherpaulg 108 days ago
    It does have fairly low adherence to the edit format, compared to the other frontier models. But it is much better than any previous Gemini model in this regard.
    Aider automatically asks models to retry malformed edits, so it recovers. And goes on to produce a SOTA score.
    [-]
    - aoeusnth1 108 days ago
      Ok, thanks for clearing that up.
- sagarpatil 108 days ago
  The only benchmark I care about. Thanks!
Oras 108 days ago
These announcements have started to look like a template.
- Our state-of-the-art model.
- Benchmarks comparing to X,Y,Z.
- "Better" reasoning.
It might be an excellent model, but reading the exact text repeatedly is taking the excitement away.
[-]
- devsda 108 days ago
  Reminds me of how nobody is too excited about flagship mobile launches anymore. Most flagships for sometime now are just incremental updates over previous gen and only marginally better. Couple that with the chinese OEMs launching better or good enough devices at a lower price point, new launches from established players are not noteworthy anymore.
  It's interesting how the recent AI announcements are following the same trend over a smaller timeframe.
  [-]
  - breppp 108 days ago
    I think the greatest issue with buying a new phone today is ironically the seamless migration.
    once you get all your apps, wallpaper, shortcut order and same OS, you really quickly get the feeling you spent 1000$ for the exact same thing
    [-]
    - atonse 108 days ago
      100% agree with you.
      But it needs to be seamless to remove any friction from the purchase, but at the same time if it feels the same then we felt like we wasted money.
      So what I usually do is buy a different colored phone and change the wallpaper.
      My MacBook was the same. Seamless transition and 2 hours later I was used to the new m4 speeds.
  - flakiness 108 days ago
    Phones are limited by hardware manufacturing, plus maybe the annual shopping cycle peaking at Christmas. People won't have bought multiple iPhones even in its heyday.
    These LLM models were supposedly limited by the training run, but these point-version models are mostly post-training driven, which seems to be taking less time.
    If models were tied to a specific hardware (say, a "AI PC" or whatever) the cycle would get slower and we'll get a slower summer which I'm secretly wishing.
- tibbar 108 days ago
  For me, the most exciting part is the improved long-context performance. A lot of enterprise/RAG applications rely on synthesizing a bunch of possibly relevant data. Let's just say it's clearly a bottleneck in current models and I would expect to see a meaningful % improvement in various internal applications if long-context reasoning is up. Gemini was already one of my favorite models for this usecase.
  So, I think these results are very interesting, if you know what features specifically you are using.
  [-]
  - zwaps 108 days ago
    But they score it on their own benchmark, on which coincidentally Gemini models always were the only good ones. In Nolima or Babilong we see that Gemini models still cant do long context.
    Excited to see if it works this time.
- bhouston 108 days ago
  > It might be an excellent model, but reading the exact text repeatedly is taking the excitement away.
  This is the commodification of models. There is nothing special about the new models but they perform better on the benchmarks.
  They are all interchangeable. This is great for users as it adds to price pressure.
  [-]
  - flir 108 days ago
    Man, I hope those benchmarks actually measure something.
    [-]
    - Legend2440 108 days ago
      I would say they are a fairly good measure of how well the model has integrated information from pretraining.
      They are not so good at measuring reasoning, out-of-domain performance, or creativity.
  - Workaccount2 108 days ago
    Sooner or later someone is going to find "secret sauce" that provides a step-up in capability, and it will be closely guarded by whoever finds it.
    As big players look to start monetizing, they are going to desperately be searching for moats.
    [-]
    - bangaladore 108 days ago
      Reasoning was supposed to be that for "Open" AI, that's why they go to such lengths to hide the reasoning output. Look how that turned out.
      Right now, in my opinion, OpenAI has actually a useful deep research feature which I've found nobody else matches. But there is no moat to be seen there.
      [-]
      - CamperBob2 108 days ago
        If you've seen DeepSeek R1's <think> output, you'll understand why OpenAI hides their own. It can be pretty "unsafe" relative to their squeaky-clean public image.
      - stepanhruda 108 days ago
        They don’t hide reasoning output anymore?
        [-]
        bangaladore 106 days ago
        I was looking at this the other day. I'm pretty sure OpenAI run the internal reasoning into a model that purges the reasoning and makes it worse to train other models from.
        I might be mistaken, but originally the reasoning was fully hidden? Or maybe it was just far more aggressively purged. I agree that today the reasoning output seems higher quality then originally.
    - cratermoon 108 days ago
      Sooner or later someone is going to find the "secret sauce" that allows building a stepladder tall enough to reach the moon.
      It's called the "first step fallacy", and AI hype believers continue to fall for it.
- cadamsdotcom 108 days ago
  Why not snooze the news for a year and see what’s been invented when you get back. That’ll blow your mind properly. Because each of these incremental announcements contributes to a mind blowing rate of improvement.
  The rate of announcements is a sign that models are increasing in ability at an amazing rate, and the content is broadly the same because they’re fungible commodities.
  The latter, that models are fungible commodities, is what’s driving this explosion and leading to intense competition that benefits us all.
- diego_sandoval 108 days ago
  I take this as a good thing, because they're beating each other every few weeks and using benchmarks as evidence.
  If these companies start failing to beat the competition, then we should prepare ourselves for very creative writing in the announcements.
  [-]
  - gtirloni 108 days ago
    The improvements have been marginal at best. I wouldn't call that beating.
- ototot 108 days ago
  Maybe they just asked Gemini 2.5 to write the announcement.
  [-]
  - cpeterso 108 days ago
    And it was trained on the previous announcements.
    [-]
    - xlbuttplug2 108 days ago
      ... which were also written by earlier Gemini versions.
      [-]
      - belter 108 days ago
        LLMs all the way down
        [-]
        cratermoon 108 days ago
        Not all the way. At the bottom are a bunch of unpaid writers and artists and a horde of low-paid mturk workers in Nigeria.
        [-]
        throwaway2037 108 days ago
        I love this comment. It made me laugh.
        > mturk workers in Nigeria
        Serious question: Has anyone tested how much money you can actually make doing a month of Amazon Mechanical Turk? (It would make for an interesting YouTube video!) I am curious if it is middle class wages in very poor countries (like Nigeria). Some light Googling tells me that middle class salary in Nigeria is about 6K USD, so about 3 USD/hour (assuming: 50 weeks/year * 40 hours/week = 2000 hours/year). Is this possible with MTurk?
        finebalance 108 days ago
        That's ok. AI will kill those off soon enough, and like all winners, rewrite history enough so that that inconvenient theft never happened anyway. It's manifest destiny, or something.
      - rvnx 108 days ago
        which was written by ChatGPT3.5
- schainks 108 days ago
  I wish I wish I wish Google put better marketing into these releases. I've moved entire workflows to Gemini because it's just _way_ better than what openai has to offer, especially for the money.
  Also, I think google's winning the race on actually integrating the AI to do useful things. The agent demo from OpenAI is interesting, but frankly, I don't care to watch the machine use my computer. A real virtual assistant can browse the web headless and pick flights or food for me. That's the real workflow unlock, IMO.
  [-]
  - throwaway2037 108 days ago
```
    > I've moved entire workflows to Gemini because it's just _way_ better than what openai has to offer, especially for the money.
```
    This is useful feedback. I'm not here to shill for OpenAI, nor Google/Gemini, but can you share a concrete example? It would be interesting to hear more about your use case. More abstractly: Do you think these "moved entire workflows" offset a full worker, or X% of a full worker? I am curious to see how and when we will see low-end/junior knowledge workers displaced by solid LLMs. Listening to the Oxide and Friends podcast, I learned that they make pretty regular use of LLMs to create graphs using GNU plot. To paraphrase, they said "it is like have a good intern".
    [-]
    - schainks 107 days ago
      > can you share a concrete example?
      Upload a complicated PDF of presentation and ask for insights that require some critical thinking about them.
      > Do you think these "moved entire workflows" offset a full worker, or X% of a full worker
      It can replace many junior analysts IMO.
- cratermoon 108 days ago
  Glaringly missing from the announcements: concrete use cases and products.
  The Achilles heel of LLMs is the distinct lack of practical real-world applications. Yes, Google and Microsoft have been shoving the tech into everything they can fit, but that doesn't a product make.
  [-]
  - throwaway2037 108 days ago
    I would say Adobe is doing an excellent job of commercialising image manipulation and generation using LLMs. When I see adverts for their new features, they seem genuinely useful for normie users who are trying to edit some family/holiday photos.
  - kiratp 108 days ago
    https://www.osmos.io/fabric
    Practical, real-world application.
  - sebzim4500 108 days ago
    ChatGPT has like 500M weekly active users, what are you on about?
    [-]
    - cratermoon 108 days ago
      "Well, Ed, there are 300 million weekly users of ChatGPT. That surely proves that this is a very real industry!" https://www.wheresyoured.at/longcon/
      [-]
      - sebzim4500 107 days ago
        Is that article trying to argue that 500M people every week are visiting ChatGPT for the first (or second) time after reading about it in the news?
        If I'm being incredibly generous I will concede that this could have been the case for the first few weeks when it was making headlines, but it clearly isn't true now.
        It would be literally impossible to keep up these figures for as long as ChatGPT has without a ton of repeat users. There simply aren't enough people/devices.
- taf2 108 days ago
  We have incrementally improved 1% better then we were yesterday. Our competition is 1 day behind us now.
  [-]
  - throwaway2037 108 days ago
    Like! No trolling: This could be a sarcastic comment written by an LLM!
- andai 108 days ago
  Well hey, OpenAI did the exact opposite, and nobody liked that either.
  [-]
  - exitb 108 days ago
    I think people were fine with OpenAI demos. They were less fine with not actually ever releasing the demoed tech.
  - andai 108 days ago
    To clarify, by "doing the opposite" I mean OpenAI releasing GPT-4.5, a non-reasoning model that does worse on benchmarks (but supposed to be qualitatively better). People shit on OpenAI hard for doing that.
  - emptysongglass 108 days ago
    I liked their announcements and demos and continue to like them.
  - Oras 108 days ago
    How did you measure “nobody” liked OpenAI announcements?
- kh_hk 108 days ago
  Was going to comment the same thing, which has been bugging me off lately on all announcements that start with "our" followed by empty superlatives. Happy to not be alone on this!
- sagarpatil 108 days ago
  AI labs, it seems, use a template for system cards as well. OpenAI stands out because they showcase their employees using their tools for various use cases, which is refreshing.
- Mistletoe 108 days ago
  I’m sure the AI helps write the announcements.
- belter 108 days ago
  Cancelled my account long time ago. Gemini models are like a McDonalds Croissant. You always give them an extra chance, but they always fall apart on your hands...
greatgib 108 days ago
If you plan to use Gemini, be warned, here are the usual Big Tech dragons:
```
   Please don’t enter ...confidential info or any data... you wouldn’t want a reviewer to see or Google to use ...
```
The full extract of the terms of usage:
```
   How human reviewers improve Google AI

   To help with quality and improve our products (such as the generative machine-learning models that power Gemini Apps), human reviewers (including third parties) read, annotate, and process your Gemini Apps conversations. We take steps to protect your privacy as part of this process. This includes disconnecting your conversations with Gemini Apps from your Google Account before reviewers see or annotate them. Please don’t enter confidential information in your conversations or any data you wouldn’t want a reviewer to see or Google to use to improve our products, services, and machine-learning technologies.
```
[-]
- cavisne 108 days ago
  Google is the best of these. You either pay per token and there is no training on your inputs, or it’s free/a small monthly fee and there is training.
- greatgib 108 days ago
  And even worse:
```
   Conversations that have been reviewed or annotated by human reviewers (and related data like your language, device type, location info, or feedback) are not deleted when you delete your Gemini Apps activity because they are kept separately and are not connected to your Google Account. Instead, they are retained for up to three years.
```
  Emphasis on "retained for up to three years" even if you delete it!!
  [-]
  - kccqzy 108 days ago
    Well they can't delete a user's Gemini conversations because they don't know which user a particular conversation comes from.
    This seems better, not worse, than keeping the user-conversation mapping so that the user may delete their conversations.
  - mastodon_acc 108 days ago
    How does it compare to OpenAI and anthropic’s user data retention policy?
    [-]
    - greatgib 108 days ago
      If i'm not wrong, Chatgpt states clearly that they don't use user data anymore by default.
      Also, maybe some services are doing "machine learning" training with user data, but it is the first time I see recent LLM service saying that you can feed your data to human reviewers at their will.
      [-]
      - cubefox 108 days ago
        They seem to use it as long as the chat history is enabled, similar to Gemini. https://help.openai.com/en/articles/7792795-how-do-i-turn-of...
        [-]
        foundry27 107 days ago
        I believe this is out of date. There’s a very explicit opt in/out slider for permitting training on conversations that doesn’t seem to affect conversation history retention.
- KoolKat23 108 days ago
  I don't think this is the same as the AI studio and API terms. This looks like your consumer facing Gemini T&C's.
- summerlight 108 days ago
  You can use a paid tier to avoid such issues. Not sure what you're expecting for those "experimental" models, which is in development and needs user feedback.
- sauwan 108 days ago
  I'm assuming this is true of all experimental models? That's not true with their models if you're on a paid tier though, correct?
- suyash 108 days ago
  More of a reason for new privacy guidelines specially for big tech and AI
- mastodon_acc 108 days ago
  I mean this is pretty standard for online llms. What is Gemini doing here that openai or Anthropic aren’t already doing?
mindwok 108 days ago
Just adding to the praise: I have a little test case I've used lately which was to identify the cause of a bug in a Dart library I was encountering by providing the LLM with the entire codebase and description of the bug. It's about 360,000 tokens.
I tried it a month ago on all the major frontier models and none of them correctly identified the fix. This is the first model to identify it correctly.
[-]
- weatherlite 108 days ago
  360k tokens = how many lines of code approximately ? and also, if its an open source lib are you sure there's no mentions of this bug anywhere on the web?
  [-]
  - mindwok 107 days ago
    Not a huge library, around 32K LoC and no mention of the bug on the web - I was the first to encounter it (it’s since been fixed) unless the training data is super recent.
    [-]
    - weatherlite 107 days ago
      Impressive. I tend to think it managed to find the bug by itself which is pretty crazy without being able to debug anything. Then again I haven't seen the bug description, perhaps the description makes it super obvious where the problem lies.
- kungfufrog 108 days ago
  How do you use the model so quickly? Google AI Studio? Maybe I've missed how powerful that is.. I didn't see any easy way to pass it a whole code base!
  [-]
  - mindwok 108 days ago
    Yep! AI studio I think is the only way you can actually use it right now and AFAIK it's free.
- markdog12 107 days ago
  Interesting, I've been asking it to generate some Dart code, and it makes tons of mistakes, including lots of invalid code (static errors). When pointing out the mistakes, it thanks me and tells me it won't make it again, then makes it again on the very next prompt.
  [-]
  - blinding-streak 107 days ago
    Open the pod bay doors Hal.
    I'm sorry Dave, I'm afraid I can't do that.
- ripped_britches 108 days ago
  Wow holy smokes that is exciting
- nmfisher 107 days ago
  How long did it take to sift through those?
jnd0 108 days ago
> with Gemini 2.5, we've achieved a new level of performance by combining a significantly enhanced base model with improved post-training. Going forward, we’re building these thinking capabilities directly into all of our models, so they can handle more complex problems and support even more capable, context-aware agents.
Been playing around with it and it feels intelligent and up to date. Plus is connected to the internet. A reasoning model by default when it needs to.
I hope they enable support for the recently released canvas mode for this model soon it will be a good match.
[-]
- Workaccount2 108 days ago
  It is almost certainly the "nebula" model on LLMarena that has been generating buzz for the last few days. I didn't test coding but it's reasoning is very strong.
vineyardmike 108 days ago
I wonder what about this one gets the +0.5 to the name. IIRC the 2.0 model isn’t particularly old yet. Is it purely marketing, does it represent new model structure, iteratively more training data over the base 2.0, new serving infrastructure, etc?
I’ve always found the use of the *.5 naming kinda silly when it became a thing. When OpenAI released 3.5, they said they already had 4 underway at the time, they were just tweaking 3 be better for ChatGPT. It felt like a scrappy startup name, and now it’s spread across the industry. Anthropic naming their models Sonnet 3, 3.5, 3.5 (new), 3.7 felt like the worst offender of this naming scheme.
I’m a much bigger fan of semver (not skipping to .5 though), date based (“Gemini Pro 2025”), or number + meaningful letter (eg 4o - “Omni”) for model names.
[-]
- forbiddenvoid 108 days ago
  I would consider this a case of "expectation management"-based versioning. This is a release designed to keep Gemini in the news cycle, but it isn't a significant enough improvement to justify calling it Gemini 3.0.
  [-]
  - jstummbillig 108 days ago
    I think it's reasonable. The development process is just not really comparable to other software engineering: It's fairly clear that currently nobody really has a good grasp on what a model will be while they are being trained. But they do have expectations. So you do the training, and then you assign the increment to align the two.
    [-]
    - 8n4vidtmkvmk 108 days ago
      I figured you don't update the major unless you significantly change the... algorithm, for lack of a better word. At least I assume something major changed between how they trained ChatGPT 3 vs GPT 4, other than amount of data. But maybe I'm wrong.
      [-]
      - eru 108 days ago
        The number is purely for marketing.
        If you could get much better performance without changing the algorithm (eg just by scaling), you'd still bump the number.
  - KoolKat23 108 days ago
    Funnily enough, from early indications (user feedback) this new model would've been worthy of the 3.0 moniker, despite what the benchmarks say.
- aoeusnth1 108 days ago
  I think it's because of the big jump in coding benchmarks. 74% on aider is just much, much better than before and worthy of a .5 upgrade.
- Workaccount2 108 days ago
  At least for OpenAI, a .5 increment indicates a 10x increase in training compute. This so far seems to track for 3.5, 4, 4.5.
- utopcell 108 days ago
  It may indicate a Tick-Tock [1] process.
  [1] https://en.wikipedia.org/wiki/Tick%E2%80%93tock_model
- alphabetting 108 days ago
  The elo jump and big benchmark gains could be justification
- falcor84 108 days ago
  Agreed, can't everyone just use semantic versioning, with 0.1 increments for regular updates?
  [-]
  - laurentlb 108 days ago
    Regarding semantic versioning: what would constitute a breaking change?
    I think it makes sense to increase the major / minor numbers based on the importance of the release, but this is not semver.
    [-]
    - falcor84 108 days ago
      As I see it, if it uses a similar training approach and is expected to be better in every regard, then it's a minor release. Whereas when they have a new approach and where there might be some tradeoffs (e.g. longer runtime), it should be a major change. Or if it is very significantly different, then it should be considered an entirely differently named model.
  - morkalork 108 days ago
    Or drop the pretext of version numbers entirely since they're meaningless here and go back to classics like Gemini Experience, Gemini: Millennium Edition or Gemini New Technology
  - joaogui1 108 days ago
    Would be confusing for non-tech people once you did x.9 -> x.10
  - guelo 108 days ago
    What would a major version bump look like for an llm?
    [-]
    - eru 108 days ago
      Going from English to Chinese, I guess? Because that would not be a compatible version for most previous users.
jorl17 108 days ago
Just a couple of days ago I wrote on reddit about how long context models are mostly useless to me, because they start making too many mistakes very fast. They are vaguely helpful for "needle in a haystack" problems, not much more.
I have a "test" which consists in sending it a collection of almost 1000 poems, which currently sit at around ~230k tokens, and then asking a bunch of stuff which requires reasoning over them. Sometimes, it's something as simple as "identify key writing periods and their differences" (the poems are ordered chronologically). Previous models don't usually "see" the final poems — they get lost, hallucinate and are pretty much worthless. I have tried several workaround techniques with varying degrees of success (e.g. randomizing the poems).
Having just tried this model (I have spent the last 3 hours probing it), I can say that, to me, this is a breakthrough moment. Truly a leap. This is the first model that can consistently comb through these poems (200k+ tokens) and analyse them as a whole, without significant issues or problems. I have no idea how they did it, but they did it.
The analysis of this poetic corpus has few mistakes and is very, very, very good. Certainly very good in terms of how quickly it produces an answer — it would take someone days or weeks of thorough analysis.
Of course, this isn't about poetry — it's about passing in huge amounts of information, without RAG, and having a high degree of confidence in whatever reasoning tasks this model performs. It is the first time that I feel confident that I could offload the task of "reasoning" over large corpus of data to an LLM. The mistakes it makes are minute, it hasn't hallucinated, and the analysis is, frankly, better than what I would expect of most people.
Breakthrough moment.
[-]
- Alifatisk 108 days ago
  Two years ago, Claude was known for having the largest context window and being able to remember tokens throughout the whole conversation.
  Today, it seems like Google has beat them and supports way larger context window and is way better at keeping track of what has being said and memorize older tokens.
nickandbro 108 days ago
Wow, was able to nail the pelican riding on a bicycle test:
https://www.svgviewer.dev/s/FImn7kAo
[-]
- anon373839 108 days ago
  That's actually too good to believe. I have a feeling simonw's favorite test has been special-cased...
  [-]
  - Workaccount2 108 days ago
    It seems pretty good at it. The hair on the boy is messed up, but still decent.
    "A boy eating a sandwhich"
    https://www.svgviewer.dev/s/VhcGxnIR
    "A multimeter"
    https://www.svgviewer.dev/s/N5Dzrmyt
  - sebzim4500 108 days ago
    I doubt it is explicitly special cased, but now that it's all over twitter etc. it will have ended up many times in the training data.
  - KTibow 108 days ago
    They could've RLed on SVGs - wouldn't be hard to render them, test adherence through Gemini or CLIP, and reward fittingly
- locallost 108 days ago
  What does nail mean? That's not a bicycle.
  [-]
  - TonyTrapp 107 days ago
    To be honest, it's in good company with real humans there: https://www.behance.net/gallery/35437979/Velocipedia
    Maybe it learned from Gianluca's gallery!
falcor84 108 days ago
I'm most impressed by the improvement on Aider Polyglot; I wasn't expecting it to get saturated so quickly.
I'll be looking to see whether Google would be able to use this model (or an adapted version) to tackle ARC-AGI 2.
ekojs 108 days ago
> This will mark the first experimental model with higher rate limits + billing. Excited for this to land and for folks to really put the model through the paces!
From https://x.com/OfficialLoganK/status/1904583353954882046
The low rate-limit really hampered my usage of 2.0 Pro and the like. Interesting to see how this plays out.
[-]
- chipgap98 108 days ago
  Any word on what that pricing is? I can't seem to find it
  [-]
  - rvnx 108 days ago
    Traditionally at Google experimental models are 100% free to use on https://aistudio.google.com (this is also where you can see the pricing) with a quite generous rate limit.
    This time, the Googler says: “good news! you will be charged for experimental models, though for now it’s still free”
    [-]
    - chipgap98 108 days ago
      Right but the tweet I was responding to says: "This will mark the first experimental model with higher rate limits + billing. Excited for this to land and for folks to really put the model through the paces!"
      I assumed that meant there was a paid version with a higher rate limit coming out today
      [-]
      - altbdoor 108 days ago
        The parent Twitter post mentions:
        Available as experimental and for free right now in Google AI Studio + API, with pricing coming very soon!
        And the pricing page [1] still does not show 2.5 yet.
        [1]: https://ai.google.dev/gemini-api/docs/pricing
    - KoolKat23 108 days ago
      I expect this might be pricier. Hoping not unusable level expensive.
  - xnx 108 days ago
    Currently free, but only 50 requests/day.
    [-]
    - sagarpatil 108 days ago
      Any idea what is RPM for this model?
      [-]
      - xnx 107 days ago
        https://aistudio.google.com/prompts/new_chat says 2 for free, but also 5, which might be the rpm when they start charging.
zone411 108 days ago
Scores 54.1 on the Extended NYT Connections Benchmark, a large improvement over Gemini 2.0 Flash Thinking Experimental 01-21 (23.1).
1 o1-pro (medium reasoning) 82.3
2 o1 (medium reasoning) 70.8
3 o3-mini-high 61.4
4 Gemini 2.5 Pro Exp 03-25 54.1
5 o3-mini (medium reasoning) 53.6
6 DeepSeek R1 38.6
7 GPT-4.5 Preview 34.2
8 Claude 3.7 Sonnet Thinking 16K 33.6
9 Qwen QwQ-32B 16K 31.4
10 o1-mini 27.0
https://github.com/lechmazur/nyt-connections/
og_kalu 108 days ago
From the 2.0 line, the Gemini models have been far better at Engineering type questions (fluids etc) than GPT, Claude especially with questions that have Images that require more than just grabbing text. This is even better.
M4v3R 108 days ago
The Long Context benchmark numbers seem super impressive. 91% vs 49% for GPT 4.5 at 128k context length.
[-]
- Workaccount2 108 days ago
  Google has the upperhand here because they are not dependent on nvidia for hardware. They make and uses their own AI accelerators.
- alexdzm 108 days ago
  Keen to hear more about this benchmark. Is it representative of chat-to-document style usecases with big docs?
  [-]
  - sebzim4500 108 days ago
    Looks like it's this benchmark [1]. It's certainly less artificial than most long context benchmarks (that are basically just a big lookup table) but probably not as representative as Fiction.LiveBench [2], which asks specific questions about works of fanfiction (which are typically excluded from training sets because they are basically porn).
    [1] https://arxiv.org/pdf/2409.12640
    [2] https://fiction.live/stories/Fiction-liveBench-Feb-20-2025/o...
    [-]
    - sebzim4500 108 days ago
      Update: Gemini 2.5 also crushes fiction.livebench
    - swyx 108 days ago
      "MRCR (multi-round coreference resolution)" for those looking for the link to Michaelangelo
nikcub 108 days ago
Impressive model - but I'm confused by the knowledge cutoff. AI Studio says it is January 2025 (which would be impressive) but querying it for anything early 2025 or mid/late 2024 and it self-reports that it's cutoff is in 2023 (which can't be right).
This is most evident when querying about fast-moving dev tools like uv or bun. It seems to only know the original uv options like pip and tools, while with bun it is unfamiliar with bun outdated (from Aug 2024), bun workspaces (from around that time?) but does know how to install bun on windows (April 2024).
You'll still need to provide this model with a lot of context to use it with any tooling or libraries with breaking changes or new features from the past ~year - which seems to contradict the AI Studio reported knowledge cutoff.
Were I developing models - I'd prioritise squeezing in the most recent knowledge of popular tools and libraries since development is such a popular (and revenue generating) use case.
[-]
- foota 108 days ago
  Maybe less has been written about these newer things, even if they had technically been released?
Dowwie 108 days ago
This model is a fucking beast. I am so excited about the opportunities this presents.
comex 108 days ago
I was recently trying to replicate ClaudePlaysPokemon (which uses Claude 3.7) using Gemini 2.0 Flash Thinking, but it was seemingly getting confused and hallucinating significantly more than Claude, making it unviable (although some of that might be caused by my different setup). I wonder if this new model will do better. But I can't easily test it: for now, even paid users are apparently limited to 50 requests per day [1], which is not really enough when every step in the game is a request. Maybe I'll try it anyway, but really I need to wait for them to "introduce pricing in the coming weeks".
Edit: I did try it anyway and so far the new model is having similar hallucinations. I really need to test my code with Claude 3.7 as a control, to see if it approach the real ClaudePlaysPokemon's semi-competence.
Edit 2: Here's the log if anyone is curious. For some reason it's letting me make more requests than the stated rate limit. Note how at 11:27:11 it hallucinates on-screen text, and earlier it thinks some random offscreen tile is the stairs. Yes, I'm sure this is the right model: gemini-2.5-pro-exp-03-25.
https://a.qoid.us/20250325/
[1] https://ai.google.dev/gemini-api/docs/rate-limits#tier-1
[-]
- comex 108 days ago
  Update: I tried a different version of the prompt and it's doing really well! Well, so far it's gotten out of its house and into Professor Oak's lab, which is not so impressive compared to ClaudePlaysPokemon, but it's a lot more than Gemini 2.0 was able to do with the same prompt.
ascorbic 108 days ago
It can answer my favourite riddle for LLMs:
"Anna, Becca and Clare go to the play park. There is nobody else there. Anna is playing on the see-saw, Becca is playing on the swings. What is Clare doing?" (Sometimes I ask similar questions with the same structure and assumptions but different activities)
About a year ago none of them could answer it. All the latest models can pass it if I tell them to think hard, but previously Gemini could rarely answer it without that extra hint. Gemini 2.5 caveats its answer a bit, but does get it correct. Interestingly GPT-4o initially suggests it will give a wrong answer without thinking, but recognises it's a riddle, so decides to think harder and gets it right.
andai 108 days ago
How does Gemini have such a big context window?
I thought memory requirement grows exponentially with context size?
[-]
- semiinfinitely 108 days ago
  TPUs have a network topology better suited for long context than gpus: https://jax-ml.github.io/scaling-book/tpus/#tpu-networking
  > This nearest-neighbor connectivity is a key difference between TPUs and GPUs. GPUs connect up to 256 H100s in an all-to-all configuration (called a node), rather than using local connections. On the one hand, that means GPUs can send arbitrary data within a node in a single low-latency hop. On the other hand, TPUs are dramatically cheaper and simpler to wire together, and can scale to much larger topologies because the number of links per device is constant.
- mmoskal 108 days ago
  Memory grows linearly, compute grows quadratically (but with small constant - until ~100k the inference will be still dominated by non-quadratic factors).
  [-]
  - xiphias2 108 days ago
    Also reusing key/values for different queries can compress the KV cache, it can be an 1000x or 10000x improvement in bandwidth if the model is trained for it.
    [-]
    - mmoskal 108 days ago
      Just to clarify: simple prefix KV cache doesn't require any special model training. It does require the inference framework to support it, but most do by now.
      You can see dramatic improvements in latency and throughput if there is a large shared prefix of the queries.
- throitallaway 108 days ago
  Funnyish story: the other night I asked my Pixel 9 to generate an image via Gemini, then I asked it to make a change. It didn't consider the previous context, so I asked it "Are you capable of keeping context?" No matter how clearly I enunciated "context", it always interpreted what I was saying as "contacts." After the 4th try, I said "context, spelled "c-o-n-t-e-x-t" and it replied with "Ah, you meant context! Yes..."
  This stuff has a long way to go.
  [-]
  - Workaccount2 108 days ago
    I think google is digging a hole for themselves by making their lightweight models be the most used model. Regardless of what their heavy weight models can do, people will naturally associate them with their search model or assistant model.
    [-]
    - Andrex 108 days ago
      That might be considered fine if Google's larger goal is to make money from enterprises/Workspace integration, using consumer launches as splashy PR.
      This way they get two rounds of headlines. "Gemini 2.5 released" and later on "Gemini 2.5 coming to all Google accounts."
      [-]
      - seunosewa 107 days ago
        Their willingness to integrate depends on their perception of the model quality.
  - andai 108 days ago
    I noticed Gemini Flash 2.0 making a lot of phonetic typos like that, yeah. Like instead of Basal Ganglia it said Basil Ganglia.
    I've also had it switch languages in the middle of output... like one word in the middle of a sentence was randomly output in some strange hieroglyphs, but when I translated them, it was the right word and the sentence made sense.
  - dcchambers 108 days ago
    I was using the conversational feature of Gemini on my phone the other night and was trying to get it to read a blog post to me. The AI proceeded to tell me (out loud, via voice mode/speech synthesis) that it was a text based model and couldn't read text out loud.
    For as amazing as these things are, AGI they are not.
    [-]
    - vanviegen 108 days ago
      In its defense: it probably is just a text model that hasn't been told that its output is being read to the user.
- ototot 108 days ago
  The Gemini 1.5 tech report do reference some papers about supporting large context window.
- swyx 108 days ago
  https://supaiku.com/attention-is-logarithmic
batata_frita 108 days ago
Why do I have the feel that nobody is too much excited to google's models compared to other companies?
[-]
- owlninja 108 days ago
  Yea I get a little bummed but I guess a lot of HNers have reasons to not like google. I've had a Google One membership forever so opted for the higher subscription with Gemini access since the beginning (plus a free year with new Pixel phone). and I think it is awesome.
- Davidzheng 108 days ago
  I feel like Google intentionally don't want people to be as excited. This is a very good model. Definitely the best available model today.
- bklyn11201 108 days ago
  Most of us care only about coding performance, and Sonnet 3.5 has been such a giant winner that we don't get too excited about the latest model from Google.
- crimsoneer 108 days ago
  Because most of the LLM hype is still generated by people who don't use them in production, and those people don't use GCP
- gck1 108 days ago
  For me personally - rate limit of 50/day means that I can't use it as daily driver so I'll have to go back to Sonnet which will gladly accept my money for more. Then I just forget it exists.
  [-]
  - sagarpatil 108 days ago
    Yeah, if I don’t have higher rate limits, it’s useless. This just sounds like a gimmick launch where they want to gather feedback. It will be a couple of months before this will be GA.
- Mond_ 108 days ago
  Google is worse at marketing and hyping people up.
  [-]
  - dmix 108 days ago
    The internal incentives must not align with new things making money.
- ipsum2 108 days ago
  They're not good models. They over fit to LMArena leaderboard, but perform worse in real life scenarios compared to their competitors.
  The exceptions are auto regressive image generation and audio models.
- SamuelAdams 108 days ago
  Because it’s more likely to be sunsetted.
  https://killedbygoogle.com/
arjun_krishna1 108 days ago
I've been using Gemini Pro for my University of Waterloo capstone engineering project. Really good understanding of PDF documents and good reasoning as well as structured output Recommend trying it out at aistudio dot google dot com
d3nj4l 108 days ago
A model that is better on Aider than Sonnet 3.7? For free, right now? I think I'll give it a spin this weekend on a couple of projects, seems too good to be true.
[-]
- istjohn 108 days ago
  With a rate limit of 50 requests per day
  [-]
  - sagarpatil 108 days ago
    Could use multiple Google accounts to increase the rate limit.
    [-]
    - Alifatisk 107 days ago
      This is why we can't have nice things
summerlight 108 days ago
This looks like the first model where Google seriously comes back into the frontier competition? 2.0 flash was nice for the price but it's more focused on efficiency, not the performance.
serjester 108 days ago
I wish they’d mention pricing - it’s hard to seriously benchmark models when you have no idea what putting it in production would actually cost.
[-]
- impulser_ 108 days ago
  It's experimental. You shouldn't be using it in production.
  [-]
  - chipgap98 108 days ago
    > This will mark the first experimental model with higher rate limits + billing. Excited for this to land and for folks to really put the model through the paces! From https://x.com/OfficialLoganK/status/1904583353954882046
Davidzheng 108 days ago
On initial thoughts, I think this might be the first AI model to be reliably helpful as a research assistant in pure mathematics (o3-mini-high can be helpful but is more prone to hallucinations)
[-]
- kadushka 108 days ago
  Have you tried o1-pro?
strstr 108 days ago
It's a lot better at my standard benchmark "Magic: The Gathering" rules puzzles. Gets the answers right (both the outcome and rationale).
[-]
- strstr 108 days ago
  Ooof, it failed my "Wheel of Potential" bug finding question, and got aggressive about asserting it was correct.
asah 108 days ago
It nailed my two hard reasoning+linguistic+math questions in one shot, both the kinds of things that LLM struggle but humans do well.
(DM me for the questions)
ofermend 108 days ago
This model is quite impressive. Not just useful for math/research with great reasoning, it also maintained a very low hallucination rate of 1.1% on Vectara Hallucination Leaderboard: https://github.com/vectara/hallucination-leaderboard
jasonpeacock 108 days ago
Isn't every new AI model the "most <adjective>"?
Nobody is going to say "Announcing Foobar 7.1 - not our best!"
[-]
- minimaxir 108 days ago
  GPT-4.5's announcement was the equivalent of that.
  "It beats all the benchmarks...but you really really don't want to use it."
  [-]
  - forbiddenvoid 108 days ago
    They even priced it so people would avoid using it. GPT-4.5's entire function was to be the anchor of keeping OpenAI in the news, to keep up the perception of releasing quickly.
    [-]
    - glenstein 108 days ago
      My assumption was that the pricing was because it really was that expensive for whatever reason. I'm keeping fingers crossed that they're going to do some kind of 4.5 mini at some point that will be more affordable.
- smilekzs 108 days ago
  You're not wrong, but that just means the <adjective> is where the bulk of information resides. The trade-off matters. Maybe it's a model with good enough quality but really cheap to serve. Maybe it's a model that only plays poker really well but sucks at everything else because it bluffs too much. Etc. etc.
- Maxatar 108 days ago
  Sure but that adjective matters. Could be cheapest, "intelligent", fastest, etc... it's rarely all three of them.
- thethimble 108 days ago
  Except for GPT 4.5 and Claude 3.7 :/
- SirMaster 108 days ago
  Same with new phones. The new phone is always the fastest cpu, gpu, and best camera ever!
- wigster 108 days ago
  gobble 2.0 - a bit of a turkey
- zamadatix 108 days ago
  Stable Diffusion definitely had a few of those.
f1shy 107 days ago
One test I always do is ask for an absolutely minimal language interpreter with TCO.
This is part of the code output (after several interactions of it not returning actual code):
```
        // Tail Call Optimization (very basic)
        if(func->type == VAL_FUNCTION){
            return apply(func, args, env); //no stack growth.
        }
        else{
            return apply(func, args, env);
        }
```
I'm not very impressed.
I pointed out that part of the code, and answered:
You've correctly pointed out that the TCO implementation in the provided C code snippet is essentially a no-op. The if and else blocks do the same thing: they both call apply(func, args, env). This means there's no actual tail call optimization happening; it's just a regular function call.
But then follows with even worst code. It does not even compile!
lvl155 108 days ago
With recent pace of model updates, I wonder which factor is more important: hardware assets, software/talent, or data access. Google clearly is in the lead in terms of data access in my view. If I am a top talent in AI, I’d go where I can work with the best data no?
[-]
- danpalmer 108 days ago
  I think an argument could be made for hardware too. Perhaps in absolute terms Nvidia is ahead, but in terms of knowing how to get the most out of the hardware, Google making its own chips, building on their networking, etc, is a pretty big advantage.
  (Disclaimer, Googler, but I don’t work on any of this, I only have an external layperson’s understanding of it)
- CuriouslyC 108 days ago
  The problem Goog has is its insane bureaucracy and lack of vision from Sundar, which isn't very attractive from an employee position. If you're working close to Demis I imagine the situation is better though.
  [-]
  - rfoo 108 days ago
    Now that Noam is back I'm a little bit more optimistic.
- jillesvangurp 107 days ago
  UX is actually increasingly the bottleneck. Most of the top models are very good if you micromanage their context and prompts. But people aren't very good at that stuff.
  Some of the desktop chat clients are turning into great productivity tools. I tried the Claude one last week and quickly went back to Chat GPT. Claude might be a better model for coding. But it's less effort to make Chat GPT do what I want at this point and it's kind of good enough for a lot of stuff. Every release it's getting better. It connects to my IDE automatically, it can look at the files I have open. It can patch those files (I actually disabled that because it's too slow for my taste), etc.
  But most importantly, I can trigger all that with option+shift+1. I do this gazillions times per day. Mostly simple stuff with really short prompts, "check this" (file, selection, current line, etc.), fix that, what do you think about x, "address the FIXMEs/TODOs", "document this", etc.
  I can ask other models the same questions and they'd get the job done. But then I have to do more work to give them the same context. Claude has a Github connect option, which is great. But unfortunately it's just a glorified file picker, which really sucks. I have files open in my editor, just look at those. I don't want to have to manually open files do that for me or specify what files to look at every time I go near the tool.
  Chat GPT actually asked me yesterday whether it could add a different file than the one it was looking at. I said "yes" and it did. That's a great UX. Don't make me do work.
  That's a good UX.
  I use Gemini mainly because it's integrated into google's tools. So it's kind of there. And chat GPT for whatever reason does can not look at the browser window. But from a UX point of view, that kind of deep integration is what you want. You have this implicit shared context which is the thing you are looking at that you don't have to spell out anymore.
  The UX of populating the context is the deciding factor in how useful models are at this point, not how well it solves pet benchmark questions or renders pelicans on bicycles.
  I have good hopes for agentic coding tools progressing rapidly this year. The ones I've tried recently need a lot of work though. I keep going back to Chat GPT because it's just the quickest & easiest to use at this point.
  [-]
  - lvl155 107 days ago
    I agree with you about ChatGPT. It’s actually a compelling product especially their PRO tier at $200 which essentially unlimited.
Medicineguy 108 days ago
While I'm sure the new Gemini model has made improvements, I feel like the user experience outside of the model itself is stagnating. I think OpenAI's interfaces, both web app and mobile app, are quite a bit more polished currently. For example, Gemini's speech recognition struggles with longer pauses and often enough cuts me off mid-sentence. Also, OpenAIs whisper model understands more context (for instance, saying “[...] plex, emby and Jellyfin [...]” is usually understood in whisper, but less often in Gemini) The Gemini web app lacks keyboard shortcuts for basic actions like opening a new chat or toggling the sidebar (good for privacy friendly pair programming). Last point off the top of my head would be the ability to edit messages beyond just the last one. That's possible in ChatGPT, but not in Gemini. Googlers are spending so much money for model training, I would appreciate spending some for making it fun to use :)
barrenko 108 days ago
The incumbent has awoken.
cj 108 days ago
Slight tangent: Interesting that they use o3-mini as the comparison rather than o1.
I've been using o1 almost exclusively for the past couple months and have been impressed to the point where I don't feel the need to "upgrade" for a better model.
Are there benchmarks showing o3-mini performing better than o1?
[-]
- kmod 108 days ago
  The benchmark numbers don't really mean anything -- Google says that Gemini 2.5 Pro has an AIME score of 86.7 which beats o3-mini's score of 86.5, but OpenAI's announcement post [1] said that o3-mini-high has a score of 87.3 which Gemini 2.5 would lose to. The chart says "All numbers are sourced from providers' self-reported numbers" but the only mention of o3-mini having a score of 86.5 I could find was from this other source [2]
  [1] https://openai.com/index/openai-o3-mini/ [2] https://www.vals.ai/benchmarks/aime-2025-03-24
  You just have to use the models yourself and see. In my experience o3-mini is much worse than o1.
- logicchains 108 days ago
  It's a reasonable comparison given it'll likely be priced similarly to o3-mini. I find o1 to be strictly better than o3-mini, but still use o3-mini for the majority of my agentic workflow because o1 is so much more expensive.
- FloorEgg 108 days ago
  I noticed this too, I have used both o1 and o3 mini extensively, and I have ran many tests on my own problems and o1 solves one of my hardest prompts quite reliably but o3 is very inconsistent. So from my anecdotal experience o1 is a superior model in terms of capability.
  The fact they would exclude it from their benchmarks seems biased/desperate and makes me trust them less. They probably thought it was clever to leave o1 out, something like "o3 is the newest model lets just compare against that", but I think for anyone paying attention that decision will backfire.
- boldlybold 108 days ago
  I find o3 at least faster to get to the response I care about, anecdotally.
- PunchTornado 108 days ago
  Why would you compare against all the models from a competitor. You take their latest one that you can test. Openai or anthropoc don’t compare against the whole gemini family.
- jnd0 108 days ago
  Probably because It is more similar to o3 in terms of size/parameters as well as price (although I would expect this to be at least half price)
jszymborski 108 days ago
Gemini refuses to answer any questions on poprtional swing models or anything related to psephology on the grounds that it has to do with elections. Neither Claude nor ChatGPT nor Mistral/Le Chat are that neutered.
[-]
- eru 108 days ago
  I assume Gemini would be less neutered in this regard, if it wasn't developed by Google.
WasimBhai 108 days ago
I do not intend to take anything away from the technical achievement of the team. However, as Satya opined some weeks back, these benchmarks do not mean a lot if we do not see a comparable increase in productivity.
But then there are two questions. First, are the white collar workers specifically consultants, engineers responsible for increase in productivity? Or is the white collar workers at the very right tail e.g., scientists?
I think consultants and engineers are using these technologies a lot. I think biologists at least are using these models a lot.
But then where is the productivity increases?
[-]
- weatherlite 108 days ago
  It's a complex proposition. I think Satya was talking about actual gdp growth right ? In theory lets say all knowledge work is now 50% faster due to A.I. Well then I would assume this should affect civil society as well - planning a bridge, a railway etc should happen faster and more efficiently (the actual building of thins won't, but a lot of time is spent on planning a red tape). Healthcare in general should become way more efficient with people getting better treatment; this should have a positive economic effect. It does seem to me like it should be able to speed things up in the real world but of course a lot will have to do with how well the models can reason / how often they make catastrophic mistakes + the will of the governments and people to start using them seriously.
  But its more complex than that - if many people start losing their jobs we all take a hit on gdp because they can't consume as much anymore, so it could take perhaps a long time until gdp actually sees meaningful gains.
  And one last thought - Satya likely hasn't spent much time thinking about gdp, it's just not his field. He's a smart guy for sure but this isn't what he does.
  [-]
  - eru 108 days ago
    The problem is slightly different.
    Unemployment hasn't really picked up, and is unlikely to do so, unless the central bank is incompetent. (They have been from time to time.)
    However, some advances don't show up in GDP. Eg Wikipedia is a tremendous achievement. But nobody pays for it, so it doesn't show up in GDP statistics.
    [-]
    - weatherlite 108 days ago
      > Unemployment hasn't really picked up, and is unlikely to do so
      That's an important assessment. I don't know if you're right. If the models are going to continue to get more capable I'm expecting unemployment to rise , I don't see how it won't (sure we are promised A.I to create tons of new jobs no one has imagined yet, I haven't seen a reliable clue for such jobs yet).
      [-]
      - eru 107 days ago
        No, it won't (necessarily) be AI that's creating the new jobs. In general, when a new technology comes along and automates away some jobs, you can't expect the same technology to provide the new jobs.
        To give an example from the recent past: 'hipster' baristas that make you a five dollar coffee are a fairly new job. At least at scale.
        But I doubt you'll be able to find any technology that automated some other job but created barista jobs.
        It's just that the market will find stuff for people to do for money, unless prevented to do so by incompetent central bank policy or (too) onerous labour market regulation.
        (The labour market can take quite a lot of regulation, and still be able to get people jobs. Have a look at Germany today for an example.)
        [-]
        weatherlite 107 days ago
        > It's just that the market will find stuff for people to do for money
        Will it ? Let's take my example, I'm a 41 year old male with around 15 years experience in software development. Lets say 4 years from now myself and million others are losing our development jobs to A.I. What does the market have for my skills? I can try going into healthcare or teaching (though that's quite an extensive retraining + salary reduction), I can go into the trades (same) or get some other work that's hard to automate like caring for old people (very low salary). All of these options involve massive salary reduction, and that's in the positive scenario that I actually am able to retrain and survive such a shift mentally. It's quite likely many software devs won't be able to become plumbers and nurses and will become chronically unemployed.
        [-]
        eru 107 days ago
        Well, we have many examples where in the past technology (and to a lesser extent trade) have let to some sectors of the economy using fewer people than before.
        The situation you describe isn't all that special.
        Yes, losing your job (or your career) is not fun, and can be painful. Massive salary reduction can happen.
        No, that hasn't lead to widespread unemployment in the past. At least not widespread enough to be visible in aggregate statistics, especially over the noise of the 'normal' business cycle. However, individuals can obviously have pretty long spells in unemployment, but that can also happen without a shift in technology.
        [-]
        weatherlite 107 days ago
        > Yes, losing your job (or your career) is not fun, and can be painful. Massive salary reduction can happen.
        I'm just trying to get the point across that unemployment might rise so gdp may fall, in fact I think it should be the baseline scenario and not thinking some new jobs we can't imagine yet will be created. It's so hard to imagine these new jobs because if the machines will out perform us cognitively it follows we will be able to get intelligent robots into the real world quite soon after. Then seriously what the heck is left? Fewer jobs, not more.
        There is one "cure" I can think of for this and that's something closer to socialism, the market will have to step aside and the government will create massive amounts of new jobs. For example classes can be 5 pupils per teacher instead of 30 pupils per teacher. Nurses can attend to 3 patient beds instead of 8. But letting the market sort this out ? I don't think so.
        [-]
        eru 107 days ago
        > It's so hard to imagine these new jobs because if the machines will out perform us cognitively it follows we will be able to get intelligent robots into the real world quite soon after. Then seriously what the heck is left? Fewer jobs, not more.
        So I admit that this is a serious possibility that we need to consider.
        But for the argument to make sense, we can't just talk about the general 'Oh, new technology will make a bunch of jobs obsolete.' We have to specifically talk about what (might) make AI special in that it might be even more general than electricity.
        You didn't mention these special factors in your original comments.
        I am not sure whether AI will be different or not, or rather I don't know how different it will be.
        So far I see it as a good sign that we have many relatively equally competitive models from different providers, and some of them have open weights and some of them even have completely open sources (including training algorithms). So at least it's unlikely for the technology to be monopolised by any one entity.
        > There is one "cure" I can think of for this and that's something closer to socialism, the market will have to step aside and the government will create massive amounts of new jobs. For example classes can be 5 pupils per teacher instead of 30 pupils per teacher. Nurses can attend to 3 patient beds instead of 8. But letting the market sort this out ? I don't think so.
        If you want to involve the government, I'd rather give everyone a basic income, than to give our pupils inferior teachers and our sick people inferior nurses. (After all, we are assuming that humans will be worse at these jobs than the AI.) Also, I'd rather have people enjoy whatever it is they want to do, instead of being forced into some government provided make-work programme.
      - borgdefenser 107 days ago
        I can feel this already with my own use of language models.
        All the questions I had before language models, I have answered with language models.
        That doesn't mean I have no more questions though. Answering those questions opened up 10X more questions I have now.
        In general, everyone knows that answering scientific questions leads to new and more questions. It is the exact same process in the economy. There is a collectivist sentiment though in society and the economy that wants to pretend this isn't true. That the economic questions can be "solved", the spoils divided up and we live happily ever after in some kind of equilibrium.
        As far as new jobs, they are here now but they surely sound as ridiculous to think about as being a professional youtuber in 2005. Or I think of the person making a geocities website in 1997 vs a front end developer. There is no date that a front end developer emerges from the html code monkey. It is a slow and organic process that is hard to game.
        [-]
        weatherlite 107 days ago
        > As far as new jobs, they are here now but they surely sound as ridiculous to think about as being a professional youtuber in 2005
        How many people can make an actual living out of Youtube? Surely they exist but to reliably live off it for decades (not just 1-2 years of temporary fame - which is also very hard to come by) I'd say fewer than one in ten thousand people will make it. I can't call "Youtuber" a career path with that kind of success rates anymore than I can call being an actor in Hollywood a career path.
- KoolKat23 108 days ago
  As it stands currently I'd say this is difficult to measure.
  They're not baked into workflows where the measurable output is attributed easily to the model use. Productivity in its current form is transformative in the sense that the use case and gain differs for the individual (who even provide different prompts). So some are keeping the gains for themselves, others are using it to improve quality rather than quantity.
  It'll come in time, it's important to remember gpt 4 was released 2 years ago this month. The newer models are more reliable and could probably be introduced into workflows more frequently. Today I spoke to a company who are looking to use it to reduce cost in the next year.
- danpalmer 108 days ago
  That’s true, but productivity has many factors and takes a long time to get confidence on. Any productivity value that could be stated clearly would have similar downsides to a benchmark, and take far longer.
  Benchmarks are useful as leading indicators. Early warning signs. If there’s no relation to the eventual productivity then hopefully that benchmark will disappear as it’s not useful.
  In a fast moving space like this it’s reasonable to make use of leading indicators.
- ZeroTalent 108 days ago
  Also, why not compare to GPT-o3 in the benchmarks?
  [-]
  - KoolKat23 108 days ago
    The models not really available.
    [-]
    - ZeroTalent 106 days ago
      they have access to o3, I do. Thousands of people do(tens of thousands at this point?). Come on. Compare to SOTA, when you're saying it's the best AI you have.
      [-]
      - KoolKat23 104 days ago
        I honestly wasn't aware it's available to the few outside of pro.

simonw 108 days ago

Here's a Gemini 2.5 provided summary of this Hacker News thread as of the moment when it had 269 comments: https://gist.github.com/simonw/3efa62d917370c5038b7acc24b7c7...

I ran this command to create it:

  curl -s "https://hn.algolia.com/api/v1/items/43473489" | \
    jq -r 'recurse(.children[]) | .author + ": " + .text' | \
    llm -m "gemini-2.5-pro-exp-03-25" -s \
    'Summarize the themes of the opinions expressed here.
    For each theme, output a markdown header.
    Include direct "quotations" (with author attribution) where appropriate.
    You MUST quote directly from users when crediting them, with double quotes.
    Fix HTML entities. Output markdown. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece'

Using this script: https://til.simonwillison.net/llms/claude-hacker-news-themes

jharohit 108 days ago
why not enable Canvas for this model on Gemini.google.com? Arguably the weakest link of Canvas is the terrible code that Gemini 2.0 Flash writes for Canvas to run..
[-]
- aoeusnth1 108 days ago
  I'm guessing it should be enabled eventually. @logankilpatrick thoughts?
TheMagicHorsey 108 days ago
I tested out Gemini 2.5 and it failed miserably at calling into tools that we had defined for it. Also, it got into an infinite loop a number of times where it would just spit out the exact same line of text continuously until we hard killed the process. I really don't know how others are getting these amazing results. We had no problems using Claude or OpenAI models in the same scenario. Even Deepseek R1 works just fine.
rodolphoarruda 108 days ago
I've been trying to use Gemini 2.0 Flash, but I don't think it's possible. The model still thinks it's running the 1.5 Pro model.
Reference: https://rodolphoarruda.pro.br/wp-content/uploads/image-14.pn...
joshdavham 108 days ago
When these companies release a model “2.5”, are they using some form of semver? Where are these numbers coming from?
[-]
- eru 108 days ago
  Marketing.
daquisu 108 days ago
Weird, they released Gemini 2.5 but I still can't use 2.0 pro with a reasonable rate limit (5 RPM currently).
andai 108 days ago
Can anyone share what they're doing with reasoning models? They seem to only make a difference with novel programming problems, like Advent of Code. So this model will help solve slightly harder advent of codes.
By extension it should also be slightly more helpful for research, R&D?
[-]
- mgens 108 days ago
  Have been using them for non-interactive coding where latency is not an issue. Specifically, turning a set of many free-text requirements into SQL statements, so that later when an item's data is entered into the system, we can efficiently find which requirements it meets. The reasoning models' output quality is much better than the non-reasoning models like 3.5 Sonnet, it's not a subtle difference.
- liuliu 108 days ago
  I found reasoning models are much more faithful at text related tasks too (i.e. 1. translating long key-value pairs (i.e. Localizable.strings), 2. long transcript fixing and verification; 3. look at csv / tabular data and fix) probably due to the reflection mechanism built into these reasoning models. Using prompts such as "check your output to make sure it covers everything in the input" letting the model to double-check its work, avoiding more manual checks on my end.
- bravura 108 days ago
  We're using it to RCA infrastructure incidents.
  [-]
  - fn-mote 108 days ago
    Seriously? That doesn't require a human?! Are we talking about some kind of "generic" incident? (Type 3: forgot to manually update the xxxx file.) Or what's going on?
    [-]
    - sandos 106 days ago
      Sounds unbelievable to me, but hey... :)
      If theyre that easy, why not fix the casues for the needs for RCA? Our RCAs will not be solved by AI for decades, let me tell you that.
xnx 108 days ago
It will be huge achievement if models can get to the point where so much selection effort isn't required: gemini.google.com currently lists 2.0 Flash, 2.0 Flash Thinking (experimental), Deep Research, Personalization (experimental), and 2.5 Pro (experimental) for me.
[-]
- lxgr 108 days ago
  There's probably a sweet spot here. On the flip side, ChatGPT currently doesn't indicate whether a given image generation request was serviced by multimodal GPT-4o [1] or Dall-E.
  Personally, I do like the "use web search" and "extended thinking" buttons, but ultimately, the models should probably be able to figure out whether doing so would be useful themselves too.
  [1] https://news.ycombinator.com/item?id=43474112
DaveMcMartin 107 days ago
I love to see this competition between companies trying to get the best LLM, and also, the fact that they’re trying to make them useful as tools, focusing on math, science, coding, and so on
afro88 108 days ago
Is this the first model announcement where they show Aider's Polyglot benchmark in the performance comparison table? That's huge for Aider and anotherpaulg!
andai 107 days ago
I asked it for suggestions for a project, and it was the only model that correctly pointed out serious flaws in the existing proposal. So far so good!
t_minus_40 108 days ago
i have asked the direction of friction on ball rolling either up or down on an inclined plan - it gave wrong answer and was adamant about it. Surprisingly, similar to o1.
gave a problem which sounds like monty hall problem but a simple probability question and it nailed it.
asked to tell a joke - horrible joke ever.
much better than o1 but still no where near agi. it has been optimized for logic and reasoning at best.
theragra 107 days ago
Yeah, and then it says that call of duty is pronounced call of dah-tee when I speak in Russian.
Chatgpt pronounced correctly
mclau156 108 days ago
Generated 1000 lines of turn based combat with shop, skills, stats, elements, enemy types, etc. with this one
slama 108 days ago
Interestingly, the model hallucinated the ability to use a search tool when I was playing around with it
dcchambers 108 days ago
> Developers and enterprises can start experimenting with Gemini 2.5 Pro in Google AI Studio now, and Gemini Advanced users can select it in the model dropdown on desktop and mobile. It will be available on Vertex AI in the coming weeks.
I'm a Gemini Advanced subscriber, still don't have this in the drop-down model selection in the phone app, though I do see it on the desktop webapp.
[-]
- ehsankia 108 days ago
  I see it in both, probably just some gradual rollout delays.
billforsternz 108 days ago
I know next to nothing about AI, but I just experienced an extraordinary hallucination in a google AI search (presumably an older Gemini model right?) as I elaborated in detail in another HN thread. It might be a good test question. https://news.ycombinator.com/item?id=43477710
honeybadger1 108 days ago
Claude is still the king right now for me. Grok is 2nd in line, but sometimes it's better.
testycool 108 days ago
It feels like Gemini 2.0 Pro + Reasoning.
I also see Gemini 2.0 Pro has been replaced completely in AI Studio.
Alifatisk 107 days ago
Can't wait for the benchmark at artificialanalysis.ai
vivzkestrel 108 days ago
hi, here is our new AI model, it performs task A x% better than our competitor 1, task B y% better than our competitor 2 seems to be the new hot AI template in town
eenchev 108 days ago
"My info, the stuff I was trained on, cuts off around early 2023." - Gemini 2.5 to me. Appears that they did a not-so-recent knowledge cutoff in order to use the best possible base model.
[-]
- staticman2 108 days ago
  It's unlikely the model knows its actual cutoff date. Try asking 2024 news- for example in my test it knows the January 2024 Oscar nominees.
  On AI studio the model told me today is June 13 2024.
joelthelion 108 days ago
Is this model going to be restricted to paying users?
marcus_holmes 107 days ago
I tried the beta version of this model to write a business plan (long story).
I was impressed at first. Then it got really hung up on the financial model, and I had to forcibly move it on. After that it wrote a whole section in Indonesian, which I don't speak, and then it crashed. I'd not saved for a while (ever since the financial model thing), and ended up with an outline and a couple of usable sections.
I mean, yes, this is better than nothing. It's impressive that we made a pile of sand do this. And I'm aware that my prompt engineering could improve a lot. But also, this isn't a usable tool yet.
I'm curious to try again, but wary of spending too much time "playing" here.
[-]
- blinding-streak 107 days ago
  LOL the random Indonesian section. That's incredible and so strange.
pachico 108 days ago
It really surprises me that Google and Amazon, considering their infrastructure and the urge to excel at this, aren't leading the industry.
andrewinardeer 108 days ago
Google is overly cautious with their guardrails.
Granted, Gemini answers it now, however, this one left me shaking my head.
https://cdn.horizon.pics/PzkqfxGLqU.jpg
[-]
- slongfield 108 days ago
  For better or worse, Google gets more bad press when their models get things wrong compared to smaller AI labs.
- rfoo 108 days ago
  Ha, I still remember that super hilarious "You are under 18, so you should not write C++, as it is unsafe..." log from ... a year ago?
- summerlight 108 days ago
  Looks like they're gradually removing guardrails, it returns Nixon for me.
fourseventy 108 days ago
Does it think the founding fathers were a diverse group of mixed races and genders like the last model did?
noisy_boy 108 days ago
Is Gemini and Bard same? I asked it a question and it said "... areas where I, as Bard, have..."
[-]
- zxilly 108 days ago
  There is no point in asking such questions, the model doesn't know what it is on its own, and you could get many different answers if you repeat it a few more times.
skinkestek 108 days ago
Can it now generate images of soldier in typical uniforms from 1940s Germany without having to throw in a few token ethnicities?
Or generate images of the founding fathers of US that at least to some degree resemble the actual ones?
meerab 102 days ago
The Gemini 2.5 model is truly impressive, especially with its multimodal capability. Its ability to understand audio and video content is amazing—truly groundbreaking.
I spent some time experimenting with Gemini 2.5, and its reasoning abilities blew me away. Here are few standout use cases that showcase its potential:
1. Counting Occurrences in a Video
In one experiment, I tested Gemini 2.5 with a video of an assassination attempt on then-candidate Donald Trump. Could the model accurately count the number of shots fired? This task might sound trivial, but earlier AI models often struggled with simple counting tasks (like identifying the number of "R"s in the word "strawberry").
Gemini 2.5 nailed it! It correctly identified each sound, outputted the timestamps where they appeared, and counted eight shots, providing both visual and audio analysis to back up its answer. This demonstrates not only its ability to process multimodal inputs but also its capacity for precise reasoning—a major leap forward for AI systems.
2. Identifying Background Music and Movie Name
Have you ever heard a song playing in the background of a video and wished you could identify it? Gemini 2.5 can do just that! Acting like an advanced version of Shazam, it analyzes audio tracks embedded in videos and identifies background music. I am also not a big fan of people posting shorts without specifying the movie name. Gemini 2.5 solves that problem for you - no more searching for movie name!
3. OCR Text Recognition
Gemini 2.5 excels at Optical Character Recognition (OCR), making it capable of extracting text from images or videos with precision. I asked the model to output one of Khan Academy's handwritten visuals into a nice table format - and the text was precisely copied from video into a neat little table!
4. Listen to Foreign News Media
The model can translate text from one language to another and give a good translation. I tested the recent official statement from Thai officials about an earthquake in Bangkok, and the latest news from a Marathi news channel. The model was correctly able to translate and output the news synopsis in the language of your choice.
5. Cricket Fans?
Sports fans and analysts alike will appreciate this use case! I tested Gemini 2.5 on an ICC T20 World Cup cricket match video to see how well it could analyze gameplay data. The results were incredible: the model accurately calculated scores, identified the number of fours and sixes, and even pinpointed key moments—all while providing timestamps for each event.
7. Webinar - Generate Slides from Video
Now this blew my mind - video webinars are generated by slide decks and a person talking about the slides. Can we reverse the process? Given a video, can we ask AI to output the slide deck? Google Gemini 2.5 outputted 41 slides for a Stanford webinar!
Bonus: Humor Test
Finally, I put Gemini 2.5 through a humor test using a PG-13 joke from one of my favorite YouTube channels, Mike and Joelle. I wanted to see if the model could understand adult humor and infer punchlines.
At first, the model hesitated to spell out the punchline (perhaps trying to stay appropriate?), but eventually, it got there—and yes, it understood the joke perfectly!
https://videotobe.com/blog/googles-gemini-25
curtisszmania 107 days ago
[dead]
hackermeows 108 days ago
[flagged]
cp9 108 days ago
does it still suggest glue on pizza
resource_waste 108 days ago
I'll try it tonight, but I'm not excited, its just work.
ChatGPT4.5, I was excited.
Deepseek, I was excited. (then later disappointed)
I know Gemini probably wont answer any medical question, even if you are a doctor. ChatGPT will.
I know I've been disappointed at the quality of Google's AI products. They are backup at best.
[-]
- FergusArgyll 108 days ago
  It interpreted blood work for me
  (Everything's ok, I'm just testing it ;)
noisy_boy 108 days ago
Are Gemini and Bard same? I asked it a question and it said "... areas where I, as Bard, have...."
[-]
- guyzero 108 days ago
  Normal Google rollout process: Bard is deprecated, Gemini is not ready yet.
ototot 108 days ago
And OpenAI is announcing their ImageGen in 4o
https://news.ycombinator.com/item?id=43474112
throwaway13337 108 days ago
Google has this habit of 'releasing' without releasing AI models. This looks to be the same?
I don't see it on the API price list:
https://ai.google.dev/gemini-api/docs/pricing
I can imagine that it's not so interesting to most of us until we can try it with cursor.
I look forward to doing so when it's out. That Aider bench mixed with the speed and a long context window that their other models are known for could be a great mix. But we'll have to wait and see.
More generally, it woud be nice for these kinds of releases to also add speed and context window as a separate benchmark. Or somehow include it in the score. A model that is 90% as good as the best but 10x faster is quite a bit more useful.
These might be hard to mix to an overall score but they're critical for understanding usefulness.
[-]
- minimaxir 108 days ago
  It's available now as an option in Google AI Studio and Google Gemini.
  [-]
  - kmod 108 days ago
    It's "experimental", which means that it is not fully released. In particular, the "experimental" tag means that it is subject to a different privacy policy and that they reserve the right to train on your prompts.
    2.0 Pro is also still "experimental" so I agree with GP that it's pretty odd that they are "releasing" the next version despite never having gotten to fully releasing the previous version.
  - throwaway13337 108 days ago
    Thanks. I think my post lacked clarity of what I was talking about. I meant that most people care about API access to use with their favorite editor. It's a big limiter with grok, for example.
    But I did mingle that with my knowledge of google's history of releasing without releasing these models which, as you point out, isn't true with this release.
  - ototot 108 days ago
    and the price is 0.0 usd, lol
    [-]
    - wenyong3124 108 days ago
      [dead]