One of the biggest problems with hands off LLM writing (for long horizon stuff like novels) is that you can't really give them any details of your story because they get absolutely neurotic with it.
Imagine for instance you give the LLM the profile of the love interest for your epic fantasy, it will almost always have the main character meeting them within 3 pages (usually page 1) which is of course absolutely nonsensical pacing. No attempt to tell it otherwise changes anything.
This is the first model that after 19 pages generated so far resembles anything like normal pacing even with a TON of details. I've never felt the need to generate anywhere near this much. Extremely impressed.
Oh, that's a good one. And it's true. There seems to be a massive inability for most people to admit the building impact of modern AI development on society.
Oh, we do admit impact and even have a name for it: AI slop.
(Speaking on LLMs now since AI is a broad term and it has many extremely useful applications in various areas)
They certainly seem to have moved from "it is literally skynet" and "FSD is just around the corner" in 2016 to "look how well it paces my first lady Trump/Musk slashfic" in 2025. Truly world changing.
Both books that have outsold the Harry Potter series claim divine authorship, not purely human. I am prepared to bet quite a lot that the next isn't human-written, either.
You may be right, on the other hand it always feels like the next goalpost is the final one.
I'm pretty sure if something like this happens some dude will show up from nowhere and claim that it's just parroting what other, real people have written, just blended it together and randomly spitted it out – "real AI would come up with original ideas like cure for cancer" he'll say.
After some form of that comes another dude will show up and say that this "alphafold while-loop" is not real AI because he just went for lunch and there was a guy flipping burgers – and that "AI" can't do it so it's shit.
https://areweagiyet.com should plot those future points as well with all those funky goals like "if Einstein had access to the Internet, Wolfram etc. he could came up with it anyway so not better than humans per se", or "had to be prompted and guided by human to find this answer so didn't do it by itself really" etc.
What if we didn’t measure success by sales, but impact to the industry (or society), or value to peoples’ lives?
Zooming out to AI broadly: what if we didn’t measure intelligence by (game-able, arguably meaningless) benchmarks, but real world use cases, adaptability, etc?
I recently watched some Claude Plays Pokemon and believe it's better measure than all those AI benchmarks. The game could be beaten by a 8yo which obviously doesn't have all that knowledge that even small local LLMs posess, but has actual intelligence and could figure out the game within < 100h. So far Claude can't even get past the first half and I doubt any other AI could get much further.
Well, strictly speaking outselling the Harry Potter would fail the Turing test: the Turing test is about passing for human (in an adversarial setting), not to surpass humans.
Of course, this is just some pedantry.
I for one love that AI is progressing so quickly, that we _can_ move the goalposts like this.
the goal posts will be moved again. Tons of people clamoring the book is stupid and vapid and only idiots bought the book. When ai starts taking over jobs which it already has you’ll get tons of idiots claiming the same thing.
People are trying to use gen AI in more and more use-cases, it used to fall flat on its face at trivial stuff, now it got past trivial stuff but still scratching the boundaries of being useful. And that is not an attempt to make the gen AI tech look bad, it is really amazing what it can do - but it is far from delivering on hype - and that is why people are providing critical evaluations.
Lets not forget the OpenAI benchmarks saying 4.0 can do better at college exams and such than most students. Yet real world performance was laughable on real tasks.
We are, if this comment is the standard for all criticism on this site. Your comment seems harsh. Perhaps novel writing is too low-brow of a standard for LLM critique?
I didn't quite read parent's comment like that. I think it's more about how we keep moving the goalposts or, less cynically, how the models keep getting better and better.
I am amazed at the progress that we are _still_ making on an almost monthly basis. It is unbelievable. Mind-boggling, to be honest.
I am certain that the issue of pacing will be solved soon enough. I'd give 99% probability of it being solved in 3 years and 50% probability in 1.
In my consulting career I sometimes get to tune database servers for performance. I have a bag of tricks that yield about +10-20% performance each. I get arguments about this from customers, typically along the lines of "that doesn't seem worth it."
Yeah, but 10% plus 20% plus 20%... next thing you know you're at +100% and your server is literally double the speed!
AI progress feels the same. Each little incremental improvement alone doesn't blow my skirt up, but we've had years of nearly monthly advances that have added up to something quite substantial.
Except at some point the low hanging fruit is gone and it becomes +1%, +3% in some benchmarked use case and -1% in the general case, etc. and then come the benchmarking lies that we are seeing right now, where everyone picks a benchmark that makes them look good and its correlation to real world performance is questionable.
Not really new is it? First cars just had to be approaching horse and cart levels of speed. Comfort, ease of use etc. were non-factors as this was "cool new technology".
In that light, even a 20 year old almost broken down crappy dinger is amazing: it has a radio, heating, shock absorbers, it can go over 500km on a tank of fuel! But are we fawning over it? No, because the goalposts have moved. Now we are disappointed that it takes 5 seconds for the Bluetooth to connect and the seats to auto-adjust to our preferred seating and heating setting in our new car.
I have actually read it and agree it is impressive. I will not comment much on the style of the writing, since this is very much subjective, but I would rate it as the "typical" modern fantasy style, which aims at filling as much pages as possible: very "flowery" language, lots of adjectives/adverbs, lots of details, lots of high-school prose ("Panic was a luxury they couldn't afford"). Not a big fan of that since I really miss the time where authors could write single, self-contained books instead of a sprawling series over thousands of pages, but I know of course that this kind of thing is very successful and people seem to enjoy it. If someone would give me this I would advise them to get a good copy editor.
There are some logical inconsistencies, though. For instance, when they both enter the cellar through a trapdoor, Kael goes first, but the innkeeper instructs him to close the trapdoor behind them, which makes no sense. Also, Kael goes down the stairs and "risks a quick look back up" and can somehow see the front door bulging and the chaos outside through the windows, which obviously is impossible when you look up through a trapdoor, not to mention that previously it was said this entry is behind the bar counter, surely blocking the sight. Kael lights an oily rag which somehow becomes a torch. There's more generic things, like somehow these Eldertides being these mythical things no one has ever seen, yet they seem to be pretty common occurrences? The dimensions of the cellar are completely unclear, at first it seems to be very small but yet they move around it quite a bit. There's other issues, like people using the same words as the narrator ("the ooze"), like they listen to him. The inkeeper suddenly calling Kael by his name like they already know each other.
Anyway, I would rate it "first draft". Of course, it is unclear whether the LLM would manage to write a consistent book, but I can fully believe that it would manage. I probably wouldn't want to read it.
Thank you for taking the time to do a thorough read, I just skimmed it, and the prose is certainly not for me. To me it lacks focus, but as you say, this may be the style the readers enjoy.
And it also, as you say, really reuses words. Just reading I notice "phosphorescence" 4 times for example in this chapter, or "ooze" 17 times (!).
It is very impressive though that it can create a somewhat cohesive storyline, and certainly an improvement over previous models.
Using the AI in multiple phases is the approach that can handle this.
Similarly to "Deep Research" approach - you can tell it to first generate a storyline with multiple twists and turns. Then ask the model to take this storyline and generate prompts for individual chapters. Then ask it to generate the individual chapters based on the prompts, etc.
But a future chatbot would be able to internally project manage itself through that process, of first emitting an outline, then producing draft chapters, then going back and critiquing itself and finally rewriting the whole thing.
Yes, and that's why many people in the discussion here are very optimistic that chatbots will have solve this problem very soon. Either with the approach you suggest, or with something else (and perhaps more general, and less directly programmed in).
It's not a problem of one-shotting it. It's that the details cause a collapse. Even if you tried breaking it down which i have, you'd run into the same problem unless you tried holding its hand for every single page and then - what's the point ? I want to read the story not co-author it.
The etymology is pretty much irrelevant. In eg German, the word for novel is 'Roman'. But German readers don't expect their novels to be anymore romantic, nor do English readers expect their novels to be more novel.
LLMs have been producing new things all the time. The question was always about quality of output, never about being able to produce anything new.
I dunno, there's a certain amount of fun in "writing" a book with ChatGPT. Like playing a video game with a bunch of different endings instead of a watch a movie with only one. does the hero save the day? or turn into a villian! you decide!
I think you would be better off having the LLM help you build up the plot with high level chapter descriptions and then have it dig into each chapter or arc. Or start by giving it the beats before you ask it for help with specifics. That'd be better at keeping it on rails.
I don't disagree. Like with almost anything else involving LLMs, getting hands on produces better results but because in this instance, i much prefer to be the reader than the author or editor, it's really important to me that a LLM is capable of pacing long form writing properly on its own.
Random question, if you don't care about being a creator yourself, why do you even want to read long form writing written by an LLM? There are literally 10000s of actual human written books out there all of them better than anything an LLM can write, why not read them?
> There are literally 10000s of actual human written books out there all of them better than anything an LLM can write, why not read them?
10000s is still much smaller than the space of possibilities for even a short prompt.
You might be right that good human novels are better than what LLMs can manage today. But that's rapidly changing.
And if you really need that Harry Potter / Superman / Three Musketeers crossover fan fiction itch scratched, you might not care that some other existing novel is 'better' in some abstract sense.
Authors tell stories they want to tell and Readers read stories they want to read. The two don't necessarily overlap or overlap strongly enough. If you're even a little bit specific (nowhere near as specific as the above prompt, even just something like the dynamic between protagonists) then you don't actually have 10,000s of actual human written books. Not even close. Maybe it exists and maybe you'll find it good enough but if it's only been read by a few hundred or thousand people ? Good luck getting it recommended.
I've read a LOT of fiction. I love reading. And if it's good enough, the idea of reading something created by a machine does not bother me at all. So of course i will continue to see if the machine is finally good enough and i can be a bit more specific.
It's very hard to find good books written by humans. GoodReads is okay, but you quickly run out of high-end recommendations. I read mostly sci-fi, and the books that everyone recommends rarely end up being 10/10. But then I see some random recommendation on Reddit or HN, and it ends up being amazing.
That was what I tried on the train [0] a few weeks ago. I used Groq to get something very fast to see if it would work at least somewhat. It gives you a PDF in the end. Plugging in a better model gave much better results (still not really readable if you actually try to; at a glance it's convincing though), however, it was so slow that testing what kind of impossible. Cannot really have things done in parallel either because it does need to know what it pushed out before, at least the summary of it.
I have never used an LLM for fictional writing, but I have been writing large amounts of code with them for years. What I'd recommend is when you're defining your plan up front as to the sections of the content, simply state in which phase / chapter of the content they should meet.
Planning generated content is often more important to invest in than the writing of it.
Looking at your paste, your prompt is short and basic, it should probably be broken up into clear, formatted sections (try directives inside XML style tags). For such a large output as you're expecting id expect a considerable prompt of rules and context setting (maybe a page or two).
I've been using a math puzzle as a way to benchmark the different models. The math puzzle took me ~3 days to solve with a computer. A math major I know took about a day to solve it by hand.
Gemini 2.5 is the first model I tested that was able to solve it and it one-shotted it. I think it's not an exaggeration to say LLMs are now better than 95+% of the population at mathematical reasoning.
For those curious the riddle is: There's three people in a circle. Each person has a positive integer floating above their heads, such that each person can see the other two numbers but not his own. The sum of two of the numbers is equal to the third. The first person is asked for his number, and he says that he doesn't know. The second person is asked for his number, and he says that he doesn't know. The third person is asked for his number, and he says that he doesn't know. Then, the first person is asked for his number again, and he says: 65. What is the product of the three numbers?
That's a non-sequitur, they would be stupid to run ab expensive _L_LM for every search query. This post is not about Google Search being replaced by Gemini 2.5 and/or a chatbot.
Bing doesn't list any reddit posts (that Google-exclusive deal) so I'll assume no stackexchange-related sites have an appropriate answer (or bing is only looking for hat-related answers for some reason).
I might have been phrasing poorly. With _L_ (or L as intended), I meant their state-of-the-art model, which I presume Gemini 2.5 is (didn't come around to TFA yet). Not sure if this question is just about model size.
I'm eagerly awaiting an article about RAG caching strategies though!
There's 3 toddlers on the floor. You ask them a hard mathematical question. One of the toddlers plays around pieces of paper on the ground and happens to raise one that has the right answer written on it.
- This kid is a genius! - you yell
- But wait, the kid has just picked an answer from the ground, it didn't actually come up...
- But the other toddlers could do it also but didn't!
Other models aren't able to solve it so there's something else happening besides it being in the training data. You can also vary the problem and give it a number like 85 instead of 65 and Gemini is still able to properly reason through the problem
I'm sure you're right that it's more than just it being in the training data, but that it's in the training data means that you can't draw any conclusions about general mathematical ability using just this as a benchmark, even if you substitute numbers.
There are lots of possible mechanisms by which this particular problem would become more prominent in the weights in a given round of training even if the model itself hasn't actually gotten any better at general reasoning. Here are a few:
* Random chance (these are still statistical machines after all)
* The problem resurfaced recently and shows up more often than it used to.
* The particular set of RLHF data chosen for this model draws out the weights associated with this problem in a way that wasn't true previously.
Sure, but you can't cite this puzzle as proof that this model is "better than 95+% of the population at mathematical reasoning" when the method of solving (the "answer") it is online, and the model has surely seen it.
Thaks. I wanted to do exactly that: find the answer online. It is amazing that people (even in HN) think that LLM can reason. It just regurgitates the input.
I think it can reason. At least if it can work in a loop ("thinking"). It's just that this reasoning is far inferior to human reasoning, despite what some people hastily claim.
I would say maybe about 80% certainly not 99.99%. But I've seen that in college, some would only be able to solve the problems which were pretty much the same as others already seen. Notably some guys could easily come up with solutions to complex problems they did not see before. I have the opinion that no human at age 20 can have the amount of input a LLM today. And still humans of age 20 do come with very new ideas pretty often (new in the sense that (s)he has not seen that or anything like it before). Of course there are more and less creative/intelligent people...
Is there a reason for the downvotes here? We can see that having the answer in the training data doesn't help. If it's in there, what's that supposed to show?
It's entirely unclear what are you trying to get across, at least to me.
Generally speaking, posting output from a LLM, without explaining exactly what do you think it illustrates and why is frowned upon here. I don't think your comment does a great job of the latter.
>> So it’s likely that it’s part of the training data by now.
> I don't think this means what you think it means.
> I did some interacting with the Tencent model that showed up here a couple days ago [...]
> This is a question that obviously was in the training data. How do you get the answer back out of the training data?
What do I think the conversation illustrates? Probably that having the answer in the training data doesn't get it into the output.
How does the conversation illustrate that? It isn't subtle. You can see it without reading any of the Chinese. If you want to read the Chinese, Google Translate is more than good enough for this purpose; that's what I used.
Your intentions are good, but your execution is poor.
I cannot figure out what the comment is trying to get across either. It's easy for you because you already know what you are trying to say. You know what the pasted output shows. The poor execution is in not spending enough time thinking about how someone coming in totally blind would interpret the comment.
I didn't downvote you, but like (probably) most people here, I can't read Chinese; I can't derive whatever point you're trying to make just from with text you provided.
This is solvable in roughly half an hour on pen and paper by a random person I picked with no special math skills (beyond a university). This is far from a difficult problem. The "95%+" in math reasoning is a meaningless standard, it's like saying a model is better than 99.9% of world population in Albanian language, since less than 0.1% bother to learn Albanian.
Even ignoring the fact that this or similar problem may have appeared in the training data, it's something a careful brute-force math logic should solve. It's neither difficult, nor interesting, nor useful. Yes, it may suggest a slight improvement on the basic logic, but no more so than a million other benchmarks people quote.
This goes to show that evaluating models is not a trivial problem. In fact, it's a hard problem (in particular, it's a far far harder than this math puzzle).
So in the three hours between you reading the puzzle in the parent comment, you stopped what you were doing, managed to get some other "random" person to stop what they were doing and spend half an hour of their time on a maths puzzle that at that point prior experience suggested could take a day? All within three hours?
That's not to say that you didn't, or you're recalling from a previous time that happens to be this exact puzzle (despite there being scant prior references to this puzzle, and precisely the reason for using it). But you can see how some might see that as not entirely credible.
Best guess: this random person is someone that really likes puzzles, is presumably good at them and is very, very far from being representative to the extent you would require to be in support of your argument.
The "random person" you picked is likely very, very intelligent and not at all a good random sample. I'm not saying this is difficult to the extent that it merits academic focus, but it is NOT a simple problem and I suspect less than 1% of the population could solve this in half an hour "with no special math skills." You have to be either exceedingly clever or trained in a certain type of reasoning or both.
I agree with your general point that this "random person" is probably not representative of anything close to an average person off the street, but I think the phrasing "very very intelligent" and "exceedingly clever" is kinda misleading.
In my experience, the difference between someone who solves this type of logic puzzle and someone who doesn't, has more to do with persistence and ability to maintain focus, rather than "intelligence" in terms of problem-solving ability per se. I've worked with college students helping them learn to solve these kinds of problems (eg. as part of pre-interview test prep), and in most cases, those who solve it and those who don't have the same rate of progress towards the solution as long as they're actively working at it. The difference comes in how quickly they get frustrated (at themselves mostly), decide they're not capable of solving it, and give up on working on it further.
I mention this because this frustration itself comes from a belief that the ability to solve these belongs some "exceedingly clever" people only, and not someone like them. So, this kind of thinking ends up being a vicious cycle that keeps them from working on their actual issues.
I solved it in less than 15 minutes while walking my dog, no pen or paper. But I wouldn't claim to be a random person without math skills. And my very first guess was correct.
It was a fun puzzle though and I'm surprised I didn't know it already. Thanks for sharing.
> This is solvable in roughly half an hour on pen and paper by a random person I picked with no special math skills (beyond a university).
I randomly answered this post and can't solve it in half an hour. Is the point leet code but for AI? I rather it solve real problems than "elite problems".
Side note: couldn't even find pen and paper around in half an hour.
This is a great riddle. Unfortunately, I was easily able to find the exact question with a solution (albeit with a different number) online, thus it will have been in the training set.
What makes this interesting is that while the question is online (on reddit, from 10 years ago) other models don't get the answer right. Gemini also shows it's work and it seems to do a few orders of magnitude more calculating then the elegant answer given on reddit.
Granted this is all way over my head, but the solution gemini comes to matches the one given on reddit (and now here in future training runs)
>Gemini also shows it's work and it seems to do a few orders of magnitude more calculating then the elegant answer given on reddit.
I don't think Gemini does an unnecessary amount of computation, it's just more verbose. This is typical of reasoning models, almost every step is necessary but many would not be written down by a human.
everyone with limited bandwidth has been trying to limit site access to robots. the latest generation of AI web scrapers are brutal and do not respect robots.txt
There are websites where you can only register to in person and have two existing members vouch for you. Probably still can be gamed, but sounds like a great barrier to entry for robots (for now).
Admins will see unusual traffic from that account and then take action. Of course it will not be perfect as there could be a way to mimic human traffic and slowly scrape the data anyway, that's why there is element of trust (two existing members to vouch).
Yeah don’t get me wrong I believe raising the burden of extraction is an effective strategy I just think it’s been solved at scale ie voting rings and astro turfing operations on Reddit - and at the nation state level I’d just bribe or extort the mods and admins directly (or the IT person to dump the database).
I have bad news for you if you think non paywalled / non phone# required discord communities are immune to AI scraping, especially as it costs less than hammering traditional websites as the push-on-change event is done for you in real time chat contexts.
Especially as the company archives all those chats (not sure how long) and is small enough that a billion dollar "data sharing" agreement would be a very inticing offer.
If there isn't a significant barrier to access, it's being scraped. And if that barrier is money, it's being scraped but less often.
You'd have better results if you had prompted it with the actual answer and asked how the first person came to the conclusion. Giving a number in the training set is very easy.
i.e. You observe three people in a magical room. The first person is standing underneath a 65, the second person is standing underneath a 26 and the third person is standing underneath a 39. They can see the others numbers but not the one they are directly under. You tell them one of the three numbers is the sum of the other two and all numbers are positive integers. You ask the first person for their number, they respond that they don't know. You ask the second person for their number, they respond that they don't know. You ask the third person, they respond that they don't know. You ask the first person again and they respond with the correct value, how did they know?
You can also put the AI in the first person's shoes.
Prompt:
You are standing in a circle, there are 2 other people in the circle with you, everyone in the circle, has a positive integer above their head, no one knows what the number above their own head is but can see the numbers above the heads of the other people. You see that the person infront of you on the left has 26 above their head. The person on the right has 39 above their head. You are told that the sum of two of the numbers is the third number. You are asked what the number above your head is, the option is the sum, 65, or 13, as 26 + 13 = 39. You don't know which one it is, and you say so. The second person is asked the number above their head. They also say they dont know, the third person also says they dont know. What is your number?
Gemini 2.5 and claude 3.7 thinking get it right, o3 mini and 4o get it wrong
Same here: My problem of choice is the 100 prisoners problem [1]. I used to ask simple reasoning questions in the style of "what is the day three days before the day after tomorrow", but nowadays when I ask such questions, I can almost feel the the NN giggling at the naivety of its human operator.
Deepseek R1 got the right answer after a whopping ~10 minutes of thinking. I'm impressed and feel kind of dirty, I suspect my electricity use from this could have been put to better use baking a frozen pizza.
I use an algorithmic question that I'd been working on for years and that I'm finally writing up the answer to.
It's basically: given a sequence of heap operations (insert element, delete minimum element), can you predict the left-over elements (that are in the heap at the end) in linear time in the comparison model?
Loved that puzzle, thanks for sharing it. I’ve solved a lot of math problems in the past but this one had a unique flavor of interleaving logical reasoning, partial information and a little bit of arithmetic.
Call the three numbers a, b, and c. This means c = a + b, but we still don’t know to which person each number belongs.
When person 1 (p1) is asked what his number is, he has no way to know whether he has a, b, or c, so he says he doesn’t know. Same goes for p2 and p3. Clearly p1 somehow gains information by p2 and p3 passing. Either he realizes that he must be either a or b, and such his number is the difference between p2 and p3’s numbers, or he realizes that he must be c and so his number is the sum of p2 and p3’s numbers.
That’s all I have so far. Anyone have other ideas?
I think it has something to do with applying the lower bound of 1.
If p1 KNOWS that he’s the largest then he has to have gained some other piece of information. Say the numbers he sees are 32 and 33. His number would have to be either 1 or 65. If p1 was 1 then the other two would have known p1 couldn’t be the sum of the other two
P1 knows that P2 and P3 are not equal. So they know that the set isn't [2A, A, A].
P2 knows that P1 and P3 are not equal. So they know that the set isn't [A, 2A, A]. They also know that if P1 doesn't know, then they were able to make the same deduction. So they now know that both [2A, A, A] and [A, 2A, A] aren't correct. Since they know that [2A, A, A] isn't correct, they can also know that [2A, 3A, A] isn't correct either. Because they'd be able to see if P1 = 2A and P3 = A, and if that were true and P1 doesn't know their number, it would have to be because P2 isn't A. And if P2 isn't A, they'd have to be 3A.
P3 knows that P1 and P2 aren't equal. Eliminates [A, A, 2A]. Knows that [2A, A, A], [A, 2A, A], and [2A, 3A, A], are eliminated. Using the same process as P2, they can eliminate [2A, A, 3A], [A, 2A, 3A], and also [2A, 3A, 5A]. Because they can see the numbers and they know if P1 is 2A and P2 is 3A.
Now we're back at P1. Who now knows.
So P2 and P3 are in the eliminated sets. Which means we're one of these
[2A, A, A]; [3A, 2A, A]; [4A, 3A, A]; [3A, A, 2A]; [4A, A, 3A]; [5A, 2A, 3A]; [8A, 3A, 5A]
We know his number is 65. To find the set, we can factor 65: (5 * 13). We can check the other numbers 2(13) = 26. 3(13) = 39. And technically, you don't need to find the other numbers. The final answer is 5A * 2A * 3A or (A^3) * 30.
"Which means we're one of these [2A, A, A]; [3A, 2A, A]; [4A, 3A, A]; [3A, A, 2A]; [4A, A, 3A]; [5A, 2A, 3A]; [8A, 3A, 5A]"
Why? Couldn't it be an infinite number of 3 size arrays comprised of A where two elements sum to the third? [24A, 13A, 11A]? How did we deduce this set of arrays?
EDIT: Solved from another reddit comment. Tuples without a common factor like the one above are considered as a=1.
"They're not eliminated; they correspond to a = 1."
I think that answer was poorly phrased because those possibilities are eliminated in a sense. There is a better answer further in the thread that explains "If the solution was not one of the flipped triplets, then the first player would not have worked out the solution." Thus if it was one of your other infinite triplets (eg. 65, 12, 53) then round 2 player 1 would've still answered 'I don't know'. Since they did respond with a definitive answer it had to be one of the formula solutions, since those were the only solutions they could prove. And since the only formula with a factor in 65 is 5 the correct formula must be [5A, 2A, 3A] and thus [65, 26, 39].
You should be able to generate an infinite number of these problems just by multiplying the first formula factor by a prime number. Like the same question but the person answers '52' restricts you to either [4a, 3a, a] or [4a, a, 3a]. Since the question only asks for the product of all the terms the answer is 4 * 13 + 3 * 13 + 13 = 104.
Look at it this way: Person 1 sees the numbers 26 and 39, and has to guess his own number. It must be one of only 2 possibilities: 13 or 65. All he has to do is eliminate one of those possibilities.
One of the cases has to be true, not all 3. (as you show, they're mutually exclusive for positive integers) i.e. "either" is important in the parent comment.
Which is why I indicated that it would be a misreading of the problem.
The original problem is a little ambiguously worded. You could say "one of their numbers is the sum of the other two" and it would be a little clearer.
> The original problem is a little ambiguously worded.
No it isn't. If it said "the sum of any two of the numbers is equal to the third", that would be a contradiction. What it says is "the sum of two of the numbers is equal to the third".
In general I find commentary here too negative on AI, but I'm a bit squeamish about maximalist claims re: AI mathematical reasoning vs. human population based off this, even setting aside lottery-ticket-hypothesis-like concerns.
We have three people in a circle. Let’s call them:
• Person A (first person)
• Person B (second person)
• Person C (third person)
Each has a positive integer above their head. Each sees the other two numbers but not their own. The numbers satisfy:
A + B = C \quad \text{or} \quad B + C = A \quad \text{or} \quad C + A = B
Given the puzzle scenario, we need to find out the exact numbers given the clues, and ultimately find the product of the three numbers.
⸻
Step 2: Logic of the Puzzle (Step-by-Step Reasoning)
Initial conditions:
The numbers are positive integers. The puzzle implies a situation of ambiguity at first, but after multiple rounds of “I don’t know,” Person A finally knows their number is 65.
Let’s denote clearly the conditions again:
• Exactly one of the following three equations is true:
• A + B = C, or
• B + C = A, or
• C + A = B.
Person A sees B and C, but not A. Similarly for Persons B and C.
⸻
Step 3: Logical deductions from the repeated “I don’t know”
First round deductions:
• Person A initially doesn’t know their number. This means, from Person A’s perspective, both scenarios A = B + C and A + B = C (or A + C = B) are still possible. If Person A had seen identical numbers, for instance, Person A could have instantly known their number. Thus, Person A sees two different numbers.
• Similarly, Person B doesn’t know either, implying that Person B also sees ambiguity in their own number. Thus, Person B also sees two numbers that could result in at least two possible solutions.
• Person C also doesn’t know, same logic as above.
Thus, after one full round, we understand that initially, all three numbers must create ambiguity. No trivial case (like two identical numbers) is present.
Second round deductions:
After hearing that no one knows their number, each person gains information: they now know the scenario is one in which none of the other two people could have immediately solved it.
Crucially, in the second round:
• Person A now knows their number is 65 after considering the previous statements. This implies that Person A sees numbers B and C, and from these numbers plus the statements of uncertainty, Person A deduces that their own number can only be 65.
⸻
Step 4: Deriving the Numbers (Critical deduction)
Let’s denote clearly again: Person A’s number is now known to be 65.
We have three possible equations. Let’s test each carefully given A = 65:
1. Case 1: A + B = C
Then we have:
• 65 + B = C. Person A sees numbers B and C, so they would see B and 65 + B. Person A sees numbers B and 65+B. If Person A sees these two numbers, they might initially consider their number could be either the sum or difference of these two.
• However, if Person A sees two numbers, and one is exactly 2× the other, Person A could immediately guess their number is the difference (since if you see numbers X and 2X, your number is clearly X). So this scenario of B and 65+B is okay only if these two numbers don’t exhibit an immediate trivial scenario.
• Let’s hold this scenario and continue examining others first.
2. Case 2: B + C = A (So, B + C = 65)
• Then Person A sees numbers B and C, adding exactly to 65.
• Initially, Person A sees two numbers summing exactly to 65. Could Person A know their number immediately at first glance? No, since their number could either be the sum (65) or the difference of the two numbers. Ambiguity here initially makes sense.
• Initially, Person B and Person C also see ambiguous scenarios. They both don’t know their numbers either at first round. This scenario is plausible.
3. Case 3: C + A = B (So, C + 65 = B)
• Then Person A sees numbers B = C + 65 and C.
• Person A sees C and C+65. Similar reasoning as Case 1.
⸻
Step 5: Identifying the Correct Scenario
The puzzle’s logic typically aligns best with Case 2: B + C = A = 65.
Here’s why clearly stated:
• Initially, Person A sees two numbers adding to exactly 65. At first glance, Person A’s number could either be 65 (sum) or the difference between the two numbers.
• Person B also initially sees two numbers and cannot distinguish immediately their own number, nor can Person C.
• After everyone states “I don’t know,” Person A gains critical insight: if the difference between the two numbers (that A sees) allowed Person B or Person C to know their number immediately, they’d have known right away. The fact that B and C didn’t know means the difference is not immediately obvious to B or C. Thus, the difference cannot be trivial (such as identical numbers or easy multiples).
Therefore, the two numbers must be such that their difference doesn’t immediately clue in the other persons. This occurs most clearly with pairs of numbers which have multiple factorization possibilities or aren’t trivial multiples.
But we must carefully test quickly to find a suitable, plausible integer pair whose sum is 65 and whose ambiguity would logically survive through multiple rounds. The puzzle is a classic logic puzzle with a known structure:
A common scenario in these classic puzzles is the use of Fibonacci-like or similar “golden” sets. Let’s quickly check the following well-known ambiguous triple:
• Numbers: 65, 39, 26
• Check quickly: indeed, 39 + 26 = 65.
Checking logic quickly:
• Person A sees 39 and 26. Initially ambiguous (number could be 65 or 13).
• Person B sees 65 and 26. Could be either 39 or 91 at first glance. Initially ambiguous.
• Person C sees 65 and 39. Could be either 26 or 104 initially. Initially ambiguous.
After each person says they don’t know:
• Now, Person A considers: If my number was 13, then Person B would see numbers 13 and 26. Person B would then immediately know their number is 39 (13+26=39), no ambiguity. But Person B didn’t know! So Person A’s number cannot be 13. Hence, Person A confidently concludes their number is 65.
Thus, the numbers 65, 39, and 26 satisfy the puzzle conditions perfectly!
⸻
Step 6: Final Answer (Product of the Numbers)
The three numbers identified clearly are:
• A = 65
• B = 39
• C = 26
Then, the product is:
65 \times 39 \times 26
= 65 \times 1014
= 65,910
I'm impressed by this one. I tried it on audio transcription with timestamps and speaker identification (over a 10 minute MP3) and drawing bounding boxes around creatures in a complex photograph and it did extremely well on both of those.
Plus it drew me a very decent pelican riding a bicycle.
Have you considered that they must be training on images of pelicans driving bicycle's at this point ;-). At least given how often that comes up in your reviews, a smart LLM engineer might put their fingers on the scales a bit and optimize for those things that come up in reviews of their work a lot.
I wonder how far away we are from models which, given this prompt, generate that image in the first step in their chain-of-thought and then use it as a reference to generate SVG code.
It could be useful for much more than just silly benchmarks, there's a reason why physics students are taught to draw a diagram before attempting a problem.
Someone managed to get ChatGPT to render the image using GPT-4o, then save that image to a Code Interpreter container and run Python code with OpenCV to trace the edges and produce an SVG: https://bsky.app/profile/btucker.net/post/3lla7extk5c2u
Gemini 2.5 Pro set the SOTA on the aider polyglot coding leaderboard [0] with a score of 73%.
This is well ahead of thinking/reasoning models. A huge jump from prior Gemini models. The first Gemini model to effectively use efficient diff-like editing formats.
Am I correct in assuming that accuracy < using correct edit format? i.e. it made mistakes in 27% of the problems, 11% of which were due to (at least) messing up the diff format?
In which case, google should be working on achieving better output format following, as Claude and R1 are able to hit nearly 100% accuracy on the format.
It does have fairly low adherence to the edit format, compared to the other frontier models. But it is much better than any previous Gemini model in this regard.
Aider automatically asks models to retry malformed edits, so it recovers. And goes on to produce a SOTA score.
Reminds me of how nobody is too excited about flagship mobile launches anymore. Most flagships for sometime now are just incremental updates over previous gen and only marginally better. Couple that with the chinese OEMs launching better or good enough devices at a lower price point, new launches from established players are not noteworthy anymore.
It's interesting how the recent AI announcements are following the same trend over a smaller timeframe.
Phones are limited by hardware manufacturing, plus maybe the annual shopping cycle peaking at Christmas. People won't have bought multiple iPhones even in its heyday.
These LLM models were supposedly limited by the training run, but these point-version models are mostly post-training driven, which seems to be taking less time.
If models were tied to a specific hardware (say, a "AI PC" or whatever) the cycle would get slower and we'll get a slower summer which I'm secretly wishing.
For me, the most exciting part is the improved long-context performance. A lot of enterprise/RAG applications rely on synthesizing a bunch of possibly relevant data. Let's just say it's clearly a bottleneck in current models and I would expect to see a meaningful % improvement in various internal applications if long-context reasoning is up. Gemini was already one of my favorite models for this usecase.
So, I think these results are very interesting, if you know what features specifically you are using.
But they score it on their own benchmark, on which coincidentally Gemini models always were the only good ones. In Nolima or Babilong we see that Gemini models still cant do long context.
Reasoning was supposed to be that for "Open" AI, that's why they go to such lengths to hide the reasoning output. Look how that turned out.
Right now, in my opinion, OpenAI has actually a useful deep research feature which I've found nobody else matches. But there is no moat to be seen there.
If you've seen DeepSeek R1's <think> output, you'll understand why OpenAI hides their own. It can be pretty "unsafe" relative to their squeaky-clean public image.
Why not snooze the news for a year and see what’s been invented when you get back. That’ll blow your mind properly. Because each of these incremental announcements contributes to a mind blowing rate of improvement.
The rate of announcements is a sign that models are increasing in ability at an amazing rate, and the content is broadly the same because they’re fungible commodities.
The latter, that models are fungible commodities, is what’s driving this explosion and leading to intense competition that benefits us all.
Serious question: Has anyone tested how much money you can actually make doing a month of Amazon Mechanical Turk? (It would make for an interesting YouTube video!) I am curious if it is middle class wages in very poor countries (like Nigeria). Some light Googling tells me that middle class salary in Nigeria is about 6K USD, so about 3 USD/hour (assuming: 50 weeks/year * 40 hours/week = 2000 hours/year). Is this possible with MTurk?
That's ok. AI will kill those off soon enough, and like all winners, rewrite history enough so that that inconvenient theft never happened anyway. It's manifest destiny, or something.
I wish I wish I wish Google put better marketing into these releases. I've moved entire workflows to Gemini because it's just _way_ better than what openai has to offer, especially for the money.
Also, I think google's winning the race on actually integrating the AI to do useful things. The agent demo from OpenAI is interesting, but frankly, I don't care to watch the machine use my computer. A real virtual assistant can browse the web headless and pick flights or food for me. That's the real workflow unlock, IMO.
> I've moved entire workflows to Gemini because it's just _way_ better than what openai has to offer, especially for the money.
This is useful feedback. I'm not here to shill for OpenAI, nor Google/Gemini, but can you share a concrete example? It would be interesting to hear more about your use case. More abstractly: Do you think these "moved entire workflows" offset a full worker, or X% of a full worker? I am curious to see how and when we will see low-end/junior knowledge workers displaced by solid LLMs. Listening to the Oxide and Friends podcast, I learned that they make pretty regular use of LLMs to create graphs using GNU plot. To paraphrase, they said "it is like have a good intern".
To clarify, by "doing the opposite" I mean OpenAI releasing GPT-4.5, a non-reasoning model that does worse on benchmarks (but supposed to be qualitatively better). People shit on OpenAI hard for doing that.
AI labs, it seems, use a template for system cards as well. OpenAI stands out because they showcase their employees using their tools for various use cases, which is refreshing.
Was going to comment the same thing, which has been bugging me off lately on all announcements that start with "our" followed by empty superlatives. Happy to not be alone on this!
Glaringly missing from the announcements:
concrete use cases and products.
The Achilles heel of LLMs is the distinct lack of practical real-world applications.
Yes, Google and Microsoft have been shoving the tech into everything they can fit,
but that doesn't a product make.
I would say Adobe is doing an excellent job of commercialising image manipulation and generation using LLMs. When I see adverts for their new features, they seem genuinely useful for normie users who are trying to edit some family/holiday photos.
Is that article trying to argue that 500M people every week are visiting ChatGPT for the first (or second) time after reading about it in the news?
If I'm being incredibly generous I will concede that this could have been the case for the first few weeks when it was making headlines, but it clearly isn't true now.
It would be literally impossible to keep up these figures for as long as ChatGPT has without a ton of repeat users. There simply aren't enough people/devices.
Cancelled my account long time ago. Gemini models are like a McDonalds Croissant. You always give them an extra chance, but they always fall apart on your hands...
Just adding to the praise: I have a little test case I've used lately which was to identify the cause of a bug in a Dart library I was encountering by providing the LLM with the entire codebase and description of the bug. It's about 360,000 tokens.
I tried it a month ago on all the major frontier models and none of them correctly identified the fix. This is the first model to identify it correctly.
Interesting, I've been asking it to generate some Dart code, and it makes tons of mistakes, including lots of invalid code (static errors). When pointing out the mistakes, it thanks me and tells me it won't make it again, then makes it again on the very next prompt.
360k tokens = how many lines of code approximately ?
and also, if its an open source lib are you sure there's no mentions of this bug anywhere on the web?
Not a huge library, around 32K LoC and no mention of the bug on the web - I was the first to encounter it (it’s since been fixed) unless the training data is super recent.
Impressive. I tend to think it managed to find the bug by itself which is pretty crazy without being able to debug anything. Then again I haven't seen the bug description, perhaps the description makes it super obvious where the problem lies.
How do you use the model so quickly? Google AI Studio? Maybe I've missed how powerful that is.. I didn't see any easy way to pass it a whole code base!
> with Gemini 2.5, we've achieved a new level of performance by combining a significantly enhanced base model with improved post-training. Going forward, we’re building these thinking capabilities directly into all of our models, so they can handle more complex problems and support even more capable, context-aware agents.
Been playing around with it and it feels intelligent and up to date. Plus is connected to the internet. A reasoning model by default when it needs to.
I hope they enable support for the recently released canvas mode for this model soon it will be a good match.
It is almost certainly the "nebula" model on LLMarena that has been generating buzz for the last few days. I didn't test coding but it's reasoning is very strong.
If you plan to use Gemini, be warned, here are the usual Big Tech dragons:
Please don’t enter ...confidential info or any data... you wouldn’t want a reviewer to see or Google to use ...
The full extract of the terms of usage:
How human reviewers improve Google AI
To help with quality and improve our products (such as the generative machine-learning models that power Gemini Apps), human reviewers (including third parties) read, annotate, and process your Gemini Apps conversations. We take steps to protect your privacy as part of this process. This includes disconnecting your conversations with Gemini Apps from your Google Account before reviewers see or annotate them. Please don’t enter confidential information in your conversations or any data you wouldn’t want a reviewer to see or Google to use to improve our products, services, and machine-learning technologies.
Conversations that have been reviewed or annotated by human reviewers (and related data like your language, device type, location info, or feedback) are not deleted when you delete your Gemini Apps activity because they are kept separately and are not connected to your Google Account. Instead, they are retained for up to three years.
Emphasis on "retained for up to three years" even if you delete it!!
If i'm not wrong, Chatgpt states clearly that they don't use user data anymore by default.
Also, maybe some services are doing "machine learning" training with user data, but it is the first time I see recent LLM service saying that you can feed your data to human reviewers at their will.
I believe this is out of date. There’s a very explicit opt in/out slider for permitting training on conversations that doesn’t seem to affect conversation history retention.
You can use a paid tier to avoid such issues. Not sure what you're expecting for those "experimental" models, which is in development and needs user feedback.
I wonder what about this one gets the +0.5 to the name. IIRC the 2.0 model isn’t particularly old yet. Is it purely marketing, does it represent new model structure, iteratively more training data over the base 2.0, new serving infrastructure, etc?
I’ve always found the use of the *.5 naming kinda silly when it became a thing. When OpenAI released 3.5, they said they already had 4 underway at the time, they were just tweaking 3 be better for ChatGPT. It felt like a scrappy startup name, and now it’s spread across the industry. Anthropic naming their models Sonnet 3, 3.5, 3.5 (new), 3.7 felt like the worst offender of this naming scheme.
I’m a much bigger fan of semver (not skipping to .5 though), date based (“Gemini Pro 2025”), or number + meaningful letter (eg 4o - “Omni”) for model names.
I would consider this a case of "expectation management"-based versioning. This is a release designed to keep Gemini in the news cycle, but it isn't a significant enough improvement to justify calling it Gemini 3.0.
I think it's reasonable. The development process is just not really comparable to other software engineering: It's fairly clear that currently nobody really has a good grasp on what a model will be while they are being trained. But they do have expectations. So you do the training, and then you assign the increment to align the two.
I figured you don't update the major unless you significantly change the... algorithm, for lack of a better word. At least I assume something major changed between how they trained ChatGPT 3 vs GPT 4, other than amount of data. But maybe I'm wrong.
As I see it, if it uses a similar training approach and is expected to be better in every regard, then it's a minor release. Whereas when they have a new approach and where there might be some tradeoffs (e.g. longer runtime), it should be a major change. Or if it is very significantly different, then it should be considered an entirely differently named model.
Or drop the pretext of version numbers entirely since they're meaningless here and go back to classics like Gemini Experience, Gemini: Millennium Edition or Gemini New Technology
Just a couple of days ago I wrote on reddit about how long context models are mostly useless to me, because they start making too many mistakes very fast. They are vaguely helpful for "needle in a haystack" problems, not much more.
I have a "test" which consists in sending it a collection of almost 1000 poems, which currently sit at around ~230k tokens, and then asking a bunch of stuff which requires reasoning over them. Sometimes, it's something as simple as "identify key writing periods and their differences" (the poems are ordered chronologically). Previous models don't usually "see" the final poems — they get lost, hallucinate and are pretty much worthless. I have tried several workaround techniques with varying degrees of success (e.g. randomizing the poems).
Having just tried this model (I have spent the last 3 hours probing it), I can say that, to me, this is a breakthrough moment. Truly a leap. This is the first model that can consistently comb through these poems (200k+ tokens) and analyse them as a whole, without significant issues or problems. I have no idea how they did it, but they did it.
The analysis of this poetic corpus has few mistakes and is very, very, very good. Certainly very good in terms of how quickly it produces an answer — it would take someone days or weeks of thorough analysis.
Of course, this isn't about poetry — it's about passing in huge amounts of information, without RAG, and having a high degree of confidence in whatever reasoning tasks this model performs. It is the first time that I feel confident that I could offload the task of "reasoning" over large corpus of data to an LLM. The mistakes it makes are minute, it hasn't hallucinated, and the analysis is, frankly, better than what I would expect of most people.
Two years ago, Claude was known for having the largest context window and being able to remember tokens throughout the whole conversation.
Today, it seems like Google has beat them and supports way larger context window and is way better at keeping track of what has being said and memorize older tokens.
"Anna, Becca and Clare go to the play park. There is nobody else there. Anna is playing on the see-saw, Becca is playing on the swings. What is Clare doing?" (Sometimes I ask similar questions with the same structure and assumptions but different activities)
About a year ago none of them could answer it. All the latest models can pass it if I tell them to think hard, but previously Gemini could rarely answer it without that extra hint. Gemini 2.5 caveats its answer a bit, but does get it correct. Interestingly GPT-4o initially suggests it will give a wrong answer without thinking, but recognises it's a riddle, so decides to think harder and gets it right.
> This will mark the first experimental model with higher rate limits + billing. Excited for this to land and for folks to really put the model through the paces!
Traditionally at Google experimental models are 100% free to use on https://aistudio.google.com (this is also where you can see the pricing) with a quite generous rate limit.
This time, the Googler says: “good news! you will be charged for experimental models, though for now it’s still free”
Right but the tweet I was responding to says: "This will mark the first experimental model with higher rate limits + billing. Excited for this to land and for folks to really put the model through the paces!"
I assumed that meant there was a paid version with a higher rate limit coming out today
From the 2.0 line, the Gemini models have been far better at Engineering type questions (fluids etc) than GPT, Claude especially with questions that have Images that require more than just grabbing text. This is even better.
Looks like it's this benchmark [1]. It's certainly less artificial than most long context benchmarks (that are basically just a big lookup table) but probably not as representative as Fiction.LiveBench [2], which asks specific questions about works of fanfiction (which are typically excluded from training sets because they are basically porn).
Impressive model - but I'm confused by the knowledge cutoff. AI Studio says it is January 2025 (which would be impressive) but querying it for anything early 2025 or mid/late 2024 and it self-reports that it's cutoff is in 2023 (which can't be right).
This is most evident when querying about fast-moving dev tools like uv or bun. It seems to only know the original uv options like pip and tools, while with bun it is unfamiliar with bun outdated (from Aug 2024), bun workspaces (from around that time?) but does know how to install bun on windows (April 2024).
You'll still need to provide this model with a lot of context to use it with any tooling or libraries with breaking changes or new features from the past ~year - which seems to contradict the AI Studio reported knowledge cutoff.
Were I developing models - I'd prioritise squeezing in the most recent knowledge of popular tools and libraries since development is such a popular (and revenue generating) use case.
Yea I get a little bummed but I guess a lot of HNers have reasons to not like google. I've had a Google One membership forever so opted for the higher subscription with Gemini access since the beginning (plus a free year with new Pixel phone). and I think it is awesome.
Most of us care only about coding performance, and Sonnet 3.5 has been such a giant winner that we don't get too excited about the latest model from Google.
For me personally - rate limit of 50/day means that I can't use it as daily driver so I'll have to go back to Sonnet which will gladly accept my money for more. Then I just forget it exists.
Yeah, if I don’t have higher rate limits, it’s useless. This just sounds like a gimmick launch where they want to gather feedback. It will be a couple of months before this will be GA.
> This nearest-neighbor connectivity is a key difference between TPUs and GPUs. GPUs connect up to 256 H100s in an all-to-all configuration (called a node), rather than using local connections. On the one hand, that means GPUs can send arbitrary data within a node in a single low-latency hop. On the other hand, TPUs are dramatically cheaper and simpler to wire together, and can scale to much larger topologies because the number of links per device is constant.
Memory grows linearly, compute grows quadratically (but with small constant - until ~100k the inference will be still dominated by non-quadratic factors).
Also reusing key/values for different queries can compress the KV cache, it can be an 1000x or 10000x improvement in bandwidth if the model is trained for it.
Just to clarify: simple prefix KV cache doesn't require any special model training. It does require the inference framework to support it, but most do by now.
You can see dramatic improvements in latency and throughput if there is a large shared prefix of the queries.
Funnyish story: the other night I asked my Pixel 9 to generate an image via Gemini, then I asked it to make a change. It didn't consider the previous context, so I asked it "Are you capable of keeping context?" No matter how clearly I enunciated "context", it always interpreted what I was saying as "contacts." After the 4th try, I said "context, spelled "c-o-n-t-e-x-t" and it replied with "Ah, you meant context! Yes..."
I think google is digging a hole for themselves by making their lightweight models be the most used model. Regardless of what their heavy weight models can do, people will naturally associate them with their search model or assistant model.
I noticed Gemini Flash 2.0 making a lot of phonetic typos like that, yeah. Like instead of Basal Ganglia it said Basil Ganglia.
I've also had it switch languages in the middle of output... like one word in the middle of a sentence was randomly output in some strange hieroglyphs, but when I translated them, it was the right word and the sentence made sense.
I was using the conversational feature of Gemini on my phone the other night and was trying to get it to read a blog post to me. The AI proceeded to tell me (out loud, via voice mode/speech synthesis) that it was a text based model and couldn't read text out loud.
For as amazing as these things are, AGI they are not.
I've been using Gemini Pro for my University of Waterloo capstone engineering project. Really good understanding of PDF documents and good reasoning as well as structured output
Recommend trying it out at aistudio dot google dot com
A model that is better on Aider than Sonnet 3.7? For free, right now? I think I'll give it a spin this weekend on a couple of projects, seems too good to be true.
On initial thoughts, I think this might be the first AI model to be reliably helpful as a research assistant in pure mathematics (o3-mini-high can be helpful but is more prone to hallucinations)
I tried the beta version of this model to write a business plan (long story).
I was impressed at first. Then it got really hung up on the financial model, and I had to forcibly move it on. After that it wrote a whole section in Indonesian, which I don't speak, and then it crashed. I'd not saved for a while (ever since the financial model thing), and ended up with an outline and a couple of usable sections.
I mean, yes, this is better than nothing. It's impressive that we made a pile of sand do this. And I'm aware that my prompt engineering could improve a lot. But also, this isn't a usable tool yet.
I'm curious to try again, but wary of spending too much time "playing" here.
I pointed out that part of the code, and answered:
You've correctly pointed out that the TCO implementation in the provided C code snippet is essentially a no-op. The if and else blocks do the same thing: they both call apply(func, args, env). This means there's no actual tail call optimization happening; it's just a regular function call.
But then follows with even worst code. It does not even compile!
This looks like the first model where Google seriously comes back into the frontier competition? 2.0 flash was nice for the price but it's more focused on efficiency, not the performance.
I was recently trying to replicate ClaudePlaysPokemon (which uses Claude 3.7) using Gemini 2.0 Flash Thinking, but it was seemingly getting confused and hallucinating significantly more than Claude, making it unviable (although some of that might be caused by my different setup). I wonder if this new model will do better. But I can't easily test it: for now, even paid users are apparently limited to 50 requests per day [1], which is not really enough when every step in the game is a request. Maybe I'll try it anyway, but really I need to wait for them to "introduce pricing in the coming weeks".
Edit: I did try it anyway and so far the new model is having similar hallucinations. I really need to test my code with Claude 3.7 as a control, to see if it approach the real ClaudePlaysPokemon's semi-competence.
Edit 2: Here's the log if anyone is curious. For some reason it's letting me make more requests than the stated rate limit. Note how at 11:27:11 it hallucinates on-screen text, and earlier it thinks some random offscreen tile is the stairs. Yes, I'm sure this is the right model: gemini-2.5-pro-exp-03-25.
Update: I tried a different version of the prompt and it's doing really well! Well, so far it's gotten out of its house and into Professor Oak's lab, which is not so impressive compared to ClaudePlaysPokemon, but it's a lot more than Gemini 2.0 was able to do with the same prompt.
> This will mark the first experimental model with higher rate limits + billing. Excited for this to land and for folks to really put the model through the paces!
From https://x.com/OfficialLoganK/status/1904583353954882046
While I'm sure the new Gemini model has made improvements, I feel like the user experience outside of the model itself is stagnating. I think OpenAI's interfaces, both web app and mobile app, are quite a bit more polished currently.
For example, Gemini's speech recognition struggles with longer pauses and often enough cuts me off mid-sentence. Also, OpenAIs whisper model understands more context (for instance, saying “[...] plex, emby and Jellyfin [...]” is usually understood in whisper, but less often in Gemini)
The Gemini web app lacks keyboard shortcuts for basic actions like opening a new chat or toggling the sidebar (good for privacy friendly pair programming). Last point off the top of my head would be the ability to edit messages beyond just the last one. That's possible in ChatGPT, but not in Gemini.
Googlers are spending so much money for model training, I would appreciate spending some for making it fun to use :)
I love to see this competition between companies trying to get the best LLM, and also, the fact that they’re trying to make them useful as tools, focusing on math, science, coding, and so on
They even priced it so people would avoid using it. GPT-4.5's entire function was to be the anchor of keeping OpenAI in the news, to keep up the perception of releasing quickly.
My assumption was that the pricing was because it really was that expensive for whatever reason. I'm keeping fingers crossed that they're going to do some kind of 4.5 mini at some point that will be more affordable.
You're not wrong, but that just means the <adjective> is where the bulk of information resides. The trade-off matters. Maybe it's a model with good enough quality but really cheap to serve. Maybe it's a model that only plays poker really well but sucks at everything else because it bluffs too much. Etc. etc.
This model is quite impressive. Not just useful for math/research with great reasoning, it also maintained a very low hallucination rate of 1.1% on Vectara Hallucination Leaderboard:
https://github.com/vectara/hallucination-leaderboard
With recent pace of model updates, I wonder which factor is more important: hardware assets, software/talent, or data access. Google clearly is in the lead in terms of data access in my view. If I am a top talent in AI, I’d go where I can work with the best data no?
I think an argument could be made for hardware too. Perhaps in absolute terms Nvidia is ahead, but in terms of knowing how to get the most out of the hardware, Google making its own chips, building on their networking, etc, is a pretty big advantage.
(Disclaimer, Googler, but I don’t work on any of this, I only have an external layperson’s understanding of it)
UX is actually increasingly the bottleneck. Most of the top models are very good if you micromanage their context and prompts. But people aren't very good at that stuff.
Some of the desktop chat clients are turning into great productivity tools. I tried the Claude one last week and quickly went back to Chat GPT. Claude might be a better model for coding. But it's less effort to make Chat GPT do what I want at this point and it's kind of good enough for a lot of stuff. Every release it's getting better. It connects to my IDE automatically, it can look at the files I have open. It can patch those files (I actually disabled that because it's too slow for my taste), etc.
But most importantly, I can trigger all that with option+shift+1. I do this gazillions times per day. Mostly simple stuff with really short prompts, "check this" (file, selection, current line, etc.), fix that, what do you think about x, "address the FIXMEs/TODOs", "document this", etc.
I can ask other models the same questions and they'd get the job done. But then I have to do more work to give them the same context. Claude has a Github connect option, which is great. But unfortunately it's just a glorified file picker, which really sucks. I have files open in my editor, just look at those. I don't want to have to manually open files do that for me or specify what files to look at every time I go near the tool.
Chat GPT actually asked me yesterday whether it could add a different file than the one it was looking at. I said "yes" and it did. That's a great UX. Don't make me do work.
That's a good UX.
I use Gemini mainly because it's integrated into google's tools. So it's kind of there. And chat GPT for whatever reason does can not look at the browser window. But from a UX point of view, that kind of deep integration is what you want. You have this implicit shared context which is the thing you are looking at that you don't have to spell out anymore.
The UX of populating the context is the deciding factor in how useful models are at this point, not how well it solves pet benchmark questions or renders pelicans on bicycles.
I have good hopes for agentic coding tools progressing rapidly this year. The ones I've tried recently need a lot of work though. I keep going back to Chat GPT because it's just the quickest & easiest to use at this point.
The problem Goog has is its insane bureaucracy and lack of vision from Sundar, which isn't very attractive from an employee position. If you're working close to Demis I imagine the situation is better though.
Gemini refuses to answer any questions on poprtional swing models or anything related to psephology on the grounds that it has to do with elections. Neither Claude nor ChatGPT nor Mistral/Le Chat are that neutered.
Slight tangent: Interesting that they use o3-mini as the comparison rather than o1.
I've been using o1 almost exclusively for the past couple months and have been impressed to the point where I don't feel the need to "upgrade" for a better model.
Are there benchmarks showing o3-mini performing better than o1?
The benchmark numbers don't really mean anything -- Google says that Gemini 2.5 Pro has an AIME score of 86.7 which beats o3-mini's score of 86.5, but OpenAI's announcement post [1] said that o3-mini-high has a score of 87.3 which Gemini 2.5 would lose to. The chart says "All numbers are sourced from providers' self-reported numbers" but the only mention of o3-mini having a score of 86.5 I could find was from this other source [2]
I noticed this too, I have used both o1 and o3 mini extensively, and I have ran many tests on my own problems and o1 solves one of my hardest prompts quite reliably but o3 is very inconsistent. So from my anecdotal experience o1 is a superior model in terms of capability.
The fact they would exclude it from their benchmarks seems biased/desperate and makes me trust them less. They probably thought it was clever to leave o1 out, something like "o3 is the newest model lets just compare against that", but I think for anyone paying attention that decision will backfire.
It's a reasonable comparison given it'll likely be priced similarly to o3-mini. I find o1 to be strictly better than o3-mini, but still use o3-mini for the majority of my agentic workflow because o1 is so much more expensive.
Why would you compare against all the models from a competitor. You take their latest one that you can test. Openai or anthropoc don’t compare against the whole gemini family.
I do not intend to take anything away from the technical achievement of the team. However, as Satya opined some weeks back, these benchmarks do not mean a lot if we do not see a comparable increase in productivity.
But then there are two questions. First, are the white collar workers specifically consultants, engineers responsible for increase in productivity? Or is the white collar workers at the very right tail e.g., scientists?
I think consultants and engineers are using these technologies a lot. I think biologists at least are using these models a lot.
It's a complex proposition.
I think Satya was talking about actual gdp growth right ?
In theory lets say all knowledge work is now 50% faster due to A.I. Well then I would assume this should affect civil society as well - planning a bridge, a railway etc should happen faster and more efficiently (the actual building of thins won't, but a lot of time is spent on planning a red tape).
Healthcare in general should become way more efficient with people getting better treatment; this should have a positive economic effect.
It does seem to me like it should be able to speed things up in the real world but of course a lot will have to do with how well the models can reason / how often they make catastrophic mistakes + the will of the governments and people to start using them seriously.
But its more complex than that - if many people start losing their jobs we all take a hit on gdp because they can't consume as much anymore, so it could take perhaps a long time until gdp actually sees meaningful gains.
And one last thought - Satya likely hasn't spent much time thinking about gdp, it's just not his field. He's a smart guy for sure but this isn't what he does.
Unemployment hasn't really picked up, and is unlikely to do so, unless the central bank is incompetent. (They have been from time to time.)
However, some advances don't show up in GDP. Eg Wikipedia is a tremendous achievement. But nobody pays for it, so it doesn't show up in GDP statistics.
> Unemployment hasn't really picked up, and is unlikely to do so
That's an important assessment. I don't know if you're right. If the models are going to continue to get more capable I'm expecting unemployment to rise , I don't see how it won't (sure we are promised A.I to create tons of new jobs no one has imagined yet, I haven't seen a reliable clue for such jobs yet).
I can feel this already with my own use of language models.
All the questions I had before language models, I have answered with language models.
That doesn't mean I have no more questions though. Answering those questions opened up 10X more questions I have now.
In general, everyone knows that answering scientific questions leads to new and more questions. It is the exact same process in the economy. There is a collectivist sentiment though in society and the economy that wants to pretend this isn't true. That the economic questions can be "solved", the spoils divided up and we live happily ever after in some kind of equilibrium.
As far as new jobs, they are here now but they surely sound as ridiculous to think about as being a professional youtuber in 2005. Or I think of the person making a geocities website in 1997 vs a front end developer. There is no date that a front end developer emerges from the html code monkey. It is a slow and organic process that is hard to game.
> As far as new jobs, they are here now but they surely sound as ridiculous to think about as being a professional youtuber in 2005
How many people can make an actual living out of Youtube? Surely they exist but to reliably live off it for decades (not just 1-2 years of temporary fame - which is also very hard to come by) I'd say fewer than one in ten thousand people will make it. I can't call "Youtuber" a career path with that kind of success rates anymore than I can call being an actor in Hollywood a career path.
No, it won't (necessarily) be AI that's creating the new jobs. In general, when a new technology comes along and automates away some jobs, you can't expect the same technology to provide the new jobs.
To give an example from the recent past: 'hipster' baristas that make you a five dollar coffee are a fairly new job. At least at scale.
But I doubt you'll be able to find any technology that automated some other job but created barista jobs.
It's just that the market will find stuff for people to do for money, unless prevented to do so by incompetent central bank policy or (too) onerous labour market regulation.
(The labour market can take quite a lot of regulation, and still be able to get people jobs. Have a look at Germany today for an example.)
> It's just that the market will find stuff for people to do for money
Will it ?
Let's take my example, I'm a 41 year old male with around 15 years experience in software development. Lets say 4 years from now myself and million others are losing our development jobs to A.I.
What does the market have for my skills? I can try going into healthcare or teaching (though that's quite an extensive retraining + salary reduction), I can go into the trades (same) or get some other work that's hard to automate like caring for old people (very low salary). All of these options involve massive salary reduction, and that's in the positive scenario that I actually am able to retrain and survive such a shift mentally.
It's quite likely many software devs won't be able to become plumbers and nurses and will become chronically unemployed.
Well, we have many examples where in the past technology (and to a lesser extent trade) have let to some sectors of the economy using fewer people than before.
The situation you describe isn't all that special.
Yes, losing your job (or your career) is not fun, and can be painful. Massive salary reduction can happen.
No, that hasn't lead to widespread unemployment in the past. At least not widespread enough to be visible in aggregate statistics, especially over the noise of the 'normal' business cycle. However, individuals can obviously have pretty long spells in unemployment, but that can also happen without a shift in technology.
> Yes, losing your job (or your career) is not fun, and can be painful. Massive salary reduction can happen.
I'm just trying to get the point across that unemployment might rise so gdp may fall, in fact I think it should be the baseline scenario and not thinking some new jobs we can't imagine yet will be created.
It's so hard to imagine these new jobs because if the machines will out perform us cognitively it follows we will be able to get intelligent robots into the real world quite soon after. Then seriously what the heck is left? Fewer jobs, not more.
There is one "cure" I can think of for this and that's something closer to socialism, the market will have to step aside and the government will create massive amounts of new jobs. For example classes can be 5 pupils per teacher instead of 30 pupils per teacher. Nurses can attend to 3 patient beds instead of 8.
But letting the market sort this out ? I don't think so.
As it stands currently I'd say this is difficult to measure.
They're not baked into workflows where the measurable output is attributed easily to the model use. Productivity in its current form is transformative in the sense that the use case and gain differs for the individual (who even provide different prompts). So some are keeping the gains for themselves, others are using it to improve quality rather than quantity.
It'll come in time, it's important to remember gpt 4 was released 2 years ago this month. The newer models are more reliable and could probably be introduced into workflows more frequently. Today I spoke to a company who are looking to use it to reduce cost in the next year.
That’s true, but productivity has many factors and takes a long time to get confidence on. Any productivity value that could be stated clearly would have similar downsides to a benchmark, and take far longer.
Benchmarks are useful as leading indicators. Early warning signs. If there’s no relation to the eventual productivity then hopefully that benchmark will disappear as it’s not useful.
In a fast moving space like this it’s reasonable to make use of leading indicators.
i have asked the direction of friction on ball rolling either up or down on an inclined plan - it gave wrong answer and was adamant about it. Surprisingly, similar to o1.
gave a problem which sounds like monty hall problem but a simple probability question and it nailed it.
asked to tell a joke - horrible joke ever.
much better than o1 but still no where near agi. it has been optimized for logic and reasoning at best.
why not enable Canvas for this model on Gemini.google.com? Arguably the weakest link of Canvas is the terrible code that Gemini 2.0 Flash writes for Canvas to run..
curl -s "https://hn.algolia.com/api/v1/items/43473489" | \
jq -r 'recurse(.children[]) | .author + ": " + .text' | \
llm -m "gemini-2.5-pro-exp-03-25" -s \
'Summarize the themes of the opinions expressed here.
For each theme, output a markdown header.
Include direct "quotations" (with author attribution) where appropriate.
You MUST quote directly from users when crediting them, with double quotes.
Fix HTML entities. Output markdown. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece'
It will be huge achievement if models can get to the point where so much selection effort isn't required: gemini.google.com currently lists 2.0 Flash, 2.0 Flash Thinking (experimental), Deep Research, Personalization (experimental), and 2.5 Pro (experimental) for me.
There's probably a sweet spot here. On the flip side, ChatGPT currently doesn't indicate whether a given image generation request was serviced by multimodal GPT-4o [1] or Dall-E.
Personally, I do like the "use web search" and "extended thinking" buttons, but ultimately, the models should probably be able to figure out whether doing so would be useful themselves too.
Can anyone share what they're doing with reasoning models? They seem to only make a difference with novel programming problems, like Advent of Code. So this model will help solve slightly harder advent of codes.
By extension it should also be slightly more helpful for research, R&D?
Have been using them for non-interactive coding where latency is not an issue. Specifically, turning a set of many free-text requirements into SQL statements, so that later when an item's data is entered into the system, we can efficiently find which requirements it meets. The reasoning models' output quality is much better than the non-reasoning models like 3.5 Sonnet, it's not a subtle difference.
I found reasoning models are much more faithful at text related tasks too (i.e. 1. translating long key-value pairs (i.e. Localizable.strings), 2. long transcript fixing and verification; 3. look at csv / tabular data and fix) probably due to the reflection mechanism built into these reasoning models. Using prompts such as "check your output to make sure it covers everything in the input" letting the model to double-check its work, avoiding more manual checks on my end.
Seriously? That doesn't require a human?! Are we talking about some kind of "generic" incident? (Type 3: forgot to manually update the xxxx file.) Or what's going on?
I know next to nothing about AI, but I just experienced an extraordinary hallucination in a google AI search (presumably an older Gemini model right?) as I elaborated in detail in another HN thread. It might be a good test question. https://news.ycombinator.com/item?id=43477710
Is this the first model announcement where they show Aider's Polyglot benchmark in the performance comparison table? That's huge for Aider and anotherpaulg!
> Developers and enterprises can start experimenting with Gemini 2.5 Pro in Google AI Studio now, and Gemini Advanced users can select it in the model dropdown on desktop and mobile. It will be available on Vertex AI in the coming weeks.
I'm a Gemini Advanced subscriber, still don't have this in the drop-down model selection in the phone app, though I do see it on the desktop webapp.
hi, here is our new AI model, it performs task A x% better than our competitor 1, task B y% better than our competitor 2 seems to be the new hot AI template in town
"My info, the stuff I was trained on, cuts off around early 2023." - Gemini 2.5 to me. Appears that they did a not-so-recent knowledge cutoff in order to use the best possible base model.
There is no point in asking such questions, the model doesn't know what it is on its own, and you could get many different answers if you repeat it a few more times.
I tested out Gemini 2.5 and it failed miserably at calling into tools that we had defined for it. Also, it got into an infinite loop a number of times where it would just spit out the exact same line of text continuously until we hard killed the process. I really don't know how others are getting these amazing results. We had no problems using Claude or OpenAI models in the same scenario. Even Deepseek R1 works just fine.
I can imagine that it's not so interesting to most of us until we can try it with cursor.
I look forward to doing so when it's out. That Aider bench mixed with the speed and a long context window that their other models are known for could be a great mix. But we'll have to wait and see.
More generally, it woud be nice for these kinds of releases to also add speed and context window as a separate benchmark. Or somehow include it in the score. A model that is 90% as good as the best but 10x faster is quite a bit more useful.
These might be hard to mix to an overall score but they're critical for understanding usefulness.
It's "experimental", which means that it is not fully released. In particular, the "experimental" tag means that it is subject to a different privacy policy and that they reserve the right to train on your prompts.
2.0 Pro is also still "experimental" so I agree with GP that it's pretty odd that they are "releasing" the next version despite never having gotten to fully releasing the previous version.
Thanks. I think my post lacked clarity of what I was talking about. I meant that most people care about API access to use with their favorite editor. It's a big limiter with grok, for example.
But I did mingle that with my knowledge of google's history of releasing without releasing these models which, as you point out, isn't true with this release.
Imagine for instance you give the LLM the profile of the love interest for your epic fantasy, it will almost always have the main character meeting them within 3 pages (usually page 1) which is of course absolutely nonsensical pacing. No attempt to tell it otherwise changes anything.
This is the first model that after 19 pages generated so far resembles anything like normal pacing even with a TON of details. I've never felt the need to generate anywhere near this much. Extremely impressed.
Edit: Sharing it - https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
with pastebin - https://pastebin.com/aiWuYcrF
We are currently at nonsensical pacing while writing novels.
Most human-written books don't do that, so that seems to be a ceiteria for a very different test that a Turing test.
I'm pretty sure if something like this happens some dude will show up from nowhere and claim that it's just parroting what other, real people have written, just blended it together and randomly spitted it out – "real AI would come up with original ideas like cure for cancer" he'll say.
After some form of that comes another dude will show up and say that this "alphafold while-loop" is not real AI because he just went for lunch and there was a guy flipping burgers – and that "AI" can't do it so it's shit.
https://areweagiyet.com should plot those future points as well with all those funky goals like "if Einstein had access to the Internet, Wolfram etc. he could came up with it anyway so not better than humans per se", or "had to be prompted and guided by human to find this answer so didn't do it by itself really" etc.
What if we didn’t measure success by sales, but impact to the industry (or society), or value to peoples’ lives?
Zooming out to AI broadly: what if we didn’t measure intelligence by (game-able, arguably meaningless) benchmarks, but real world use cases, adaptability, etc?
Of course, this is just some pedantry.
I for one love that AI is progressing so quickly, that we _can_ move the goalposts like this.
There were popular writeups about this from the Deepseek-R1 era: https://www.tumblr.com/nostalgebraist/778041178124926976/hyd...
The pace is moving so fast I simply cant keep up. Or a ELI5 page which gives a 5 min explanation of LLM from 2020 to this moment?
Not sure what is better for humanity in long term.
I could build a machine that phones my mother and tells her I love her, but it wouldn't obsolete me doing it.
Lets not forget the OpenAI benchmarks saying 4.0 can do better at college exams and such than most students. Yet real world performance was laughable on real tasks.
I am amazed at the progress that we are _still_ making on an almost monthly basis. It is unbelievable. Mind-boggling, to be honest.
I am certain that the issue of pacing will be solved soon enough. I'd give 99% probability of it being solved in 3 years and 50% probability in 1.
Yes, Z is indeed a big advance over Y was a big advance over X. Also yes, Z is just as underwhelming.
Are customers hurting the AI companies' feelings?
No. It's the critics' feelings that are being hurt by continued advances, so they keep moving goalposts so they can keep believing they're right.
Yeah, but 10% plus 20% plus 20%... next thing you know you're at +100% and your server is literally double the speed!
AI progress feels the same. Each little incremental improvement alone doesn't blow my skirt up, but we've had years of nearly monthly advances that have added up to something quite substantial.
(For those too young or unfamiliar: Mary Poppins famously had a bag that she could keep pulling things out of.)
In that light, even a 20 year old almost broken down crappy dinger is amazing: it has a radio, heating, shock absorbers, it can go over 500km on a tank of fuel! But are we fawning over it? No, because the goalposts have moved. Now we are disappointed that it takes 5 seconds for the Bluetooth to connect and the seats to auto-adjust to our preferred seating and heating setting in our new car.
There are some logical inconsistencies, though. For instance, when they both enter the cellar through a trapdoor, Kael goes first, but the innkeeper instructs him to close the trapdoor behind them, which makes no sense. Also, Kael goes down the stairs and "risks a quick look back up" and can somehow see the front door bulging and the chaos outside through the windows, which obviously is impossible when you look up through a trapdoor, not to mention that previously it was said this entry is behind the bar counter, surely blocking the sight. Kael lights an oily rag which somehow becomes a torch. There's more generic things, like somehow these Eldertides being these mythical things no one has ever seen, yet they seem to be pretty common occurrences? The dimensions of the cellar are completely unclear, at first it seems to be very small but yet they move around it quite a bit. There's other issues, like people using the same words as the narrator ("the ooze"), like they listen to him. The inkeeper suddenly calling Kael by his name like they already know each other.
Anyway, I would rate it "first draft". Of course, it is unclear whether the LLM would manage to write a consistent book, but I can fully believe that it would manage. I probably wouldn't want to read it.
That would be the next step but I'd previously never thought going any further might be worth it.
And it also, as you say, really reuses words. Just reading I notice "phosphorescence" 4 times for example in this chapter, or "ooze" 17 times (!).
It is very impressive though that it can create a somewhat cohesive storyline, and certainly an improvement over previous models.
LLMs have been producing new things all the time. The question was always about quality of output, never about being able to produce anything new.
10000s is still much smaller than the space of possibilities for even a short prompt.
You might be right that good human novels are better than what LLMs can manage today. But that's rapidly changing.
And if you really need that Harry Potter / Superman / Three Musketeers crossover fan fiction itch scratched, you might not care that some other existing novel is 'better' in some abstract sense.
I've read a LOT of fiction. I love reading. And if it's good enough, the idea of reading something created by a machine does not bother me at all. So of course i will continue to see if the machine is finally good enough and i can be a bit more specific.
Tens-of-thousands is probably low by something in the neighborhood of four orders of magnitude.
Human-generated slop is real.
[0] https://github.com/tluyben/bad-writer
Planning generated content is often more important to invest in than the writing of it.
Looking at your paste, your prompt is short and basic, it should probably be broken up into clear, formatted sections (try directives inside XML style tags). For such a large output as you're expecting id expect a considerable prompt of rules and context setting (maybe a page or two).
Just for fun: Asked it to rewrite the first page of ‘The Fountainhead’ where Howard is a computer engineer, the rewrite is hilarious lol.
https://gist.github.com/sagarspatil/e0b5443132501a3596c3a9a2...
I don't know why, but that is just such a literal thing to say that it seems almost random.
- Buildup and happy background world-building
- Subtle foreshadowing
- Orcs attack
- Hero is saved by unlikely warrior of astounding beauty
- Evil is defeated until sales justify unnecessary sequel
That's the kind of story fit for the modern attention span...
Gemini 2.5 is the first model I tested that was able to solve it and it one-shotted it. I think it's not an exaggeration to say LLMs are now better than 95+% of the population at mathematical reasoning.
For those curious the riddle is: There's three people in a circle. Each person has a positive integer floating above their heads, such that each person can see the other two numbers but not his own. The sum of two of the numbers is equal to the third. The first person is asked for his number, and he says that he doesn't know. The second person is asked for his number, and he says that he doesn't know. The third person is asked for his number, and he says that he doesn't know. Then, the first person is asked for his number again, and he says: 65. What is the product of the three numbers?
https://www.reddit.com/r/math/comments/32m611/logic_question...
So it’s likely that it’s part of the training data by now.
Google spits out: "The product of the three numbers is 10,225 (65 * 20 * 8). The three numbers are 65, 20, and 8."
Whoa. Math is not AI's strong suit...
Bing spits out: "The solution to the three people in a circle puzzle is that all three people are wearing red hats."
Hats???
Same text was used for both prompts (all the text after 'For those curious the riddle is:' in the GP comment), so Bing just goes off the rails.
You know what would be even stupider? Putting a cheap, wrong LLM response atop each search query.
The second result is this reddit.com answer, https://www.reddit.com/r/math/comments/32m611/logic_question..., where at least the numbers make sense. I haven't examined the logic portion of the answer.
Bing doesn't list any reddit posts (that Google-exclusive deal) so I'll assume no stackexchange-related sites have an appropriate answer (or bing is only looking for hat-related answers for some reason).
I'm eagerly awaiting an article about RAG caching strategies though!
- This kid is a genius! - you yell
- But wait, the kid has just picked an answer from the ground, it didn't actually come up...
- But the other toddlers could do it also but didn't!
There are lots of possible mechanisms by which this particular problem would become more prominent in the weights in a given round of training even if the model itself hasn't actually gotten any better at general reasoning. Here are a few:
* Random chance (these are still statistical machines after all)
* The problem resurfaced recently and shows up more often than it used to.
* The particular set of RLHF data chosen for this model draws out the weights associated with this problem in a way that wasn't true previously.
Recently these have stopped and they’re now the ads are about becoming a maths tutor to AI.
Doesn’t seem like a role with long-term prospects.
Generally speaking, posting output from a LLM, without explaining exactly what do you think it illustrates and why is frowned upon here. I don't think your comment does a great job of the latter.
> I don't think this means what you think it means.
> I did some interacting with the Tencent model that showed up here a couple days ago [...]
> This is a question that obviously was in the training data. How do you get the answer back out of the training data?
What do I think the conversation illustrates? Probably that having the answer in the training data doesn't get it into the output.
How does the conversation illustrate that? It isn't subtle. You can see it without reading any of the Chinese. If you want to read the Chinese, Google Translate is more than good enough for this purpose; that's what I used.
I cannot figure out what the comment is trying to get across either. It's easy for you because you already know what you are trying to say. You know what the pasted output shows. The poor execution is in not spending enough time thinking about how someone coming in totally blind would interpret the comment.
I can't, and I imagine most of the people who downvoted you couldn't either.
I think asking people to go to Google Translate to parse a random comment that seems to be 90% LLM output by volume is a bit much.
I didn't downvote you, but like (probably) most people here, I can't read Chinese; I can't derive whatever point you're trying to make just from with text you provided.
Even ignoring the fact that this or similar problem may have appeared in the training data, it's something a careful brute-force math logic should solve. It's neither difficult, nor interesting, nor useful. Yes, it may suggest a slight improvement on the basic logic, but no more so than a million other benchmarks people quote.
This goes to show that evaluating models is not a trivial problem. In fact, it's a hard problem (in particular, it's a far far harder than this math puzzle).
That's not to say that you didn't, or you're recalling from a previous time that happens to be this exact puzzle (despite there being scant prior references to this puzzle, and precisely the reason for using it). But you can see how some might see that as not entirely credible.
Best guess: this random person is someone that really likes puzzles, is presumably good at them and is very, very far from being representative to the extent you would require to be in support of your argument.
Read: just a heavy flex about puzzle solving.
In my experience, the difference between someone who solves this type of logic puzzle and someone who doesn't, has more to do with persistence and ability to maintain focus, rather than "intelligence" in terms of problem-solving ability per se. I've worked with college students helping them learn to solve these kinds of problems (eg. as part of pre-interview test prep), and in most cases, those who solve it and those who don't have the same rate of progress towards the solution as long as they're actively working at it. The difference comes in how quickly they get frustrated (at themselves mostly), decide they're not capable of solving it, and give up on working on it further.
I mention this because this frustration itself comes from a belief that the ability to solve these belongs some "exceedingly clever" people only, and not someone like them. So, this kind of thinking ends up being a vicious cycle that keeps them from working on their actual issues.
It was a fun puzzle though and I'm surprised I didn't know it already. Thanks for sharing.
I randomly answered this post and can't solve it in half an hour. Is the point leet code but for AI? I rather it solve real problems than "elite problems".
Side note: couldn't even find pen and paper around in half an hour.
Granted this is all way over my head, but the solution gemini comes to matches the one given on reddit (and now here in future training runs)
65×26×39=65910
I don't think Gemini does an unnecessary amount of computation, it's just more verbose. This is typical of reasoning models, almost every step is necessary but many would not be written down by a human.
Especially as the company archives all those chats (not sure how long) and is small enough that a billion dollar "data sharing" agreement would be a very inticing offer.
If there isn't a significant barrier to access, it's being scraped. And if that barrier is money, it's being scraped but less often.
i.e. You observe three people in a magical room. The first person is standing underneath a 65, the second person is standing underneath a 26 and the third person is standing underneath a 39. They can see the others numbers but not the one they are directly under. You tell them one of the three numbers is the sum of the other two and all numbers are positive integers. You ask the first person for their number, they respond that they don't know. You ask the second person for their number, they respond that they don't know. You ask the third person, they respond that they don't know. You ask the first person again and they respond with the correct value, how did they know?
And of course, if it responds with a verbatim answer in the line of https://www.reddit.com/r/math/comments/32m611/logic_question..., we can be pretty confident what's happening under the hood.
Gemini 2.5 and claude 3.7 thinking get it right, o3 mini and 4o get it wrong
[1] https://en.wikipedia.org/wiki/100_prisoners_problem
Tried this in deepseek and grok and it kept thunking in loops for a while and I just turned it off
I haven’t seen a question loop this long ever.
Very impressed
https://g.co/gemini/share/badd00a824d2
It's basically: given a sequence of heap operations (insert element, delete minimum element), can you predict the left-over elements (that are in the heap at the end) in linear time in the comparison model?
(The answer is surprisingly: Yes.)
(All state is stored in localStorage so you can come back to it :) ).
It's not an exaggeration it's a non-sequitur, you first have to show that the LLMs are reasoning in the same way humans do.
After a couple of minutes it decided on the answer being 65000. (S = {65, 40, 25)}
Call the three numbers a, b, and c. This means c = a + b, but we still don’t know to which person each number belongs.
When person 1 (p1) is asked what his number is, he has no way to know whether he has a, b, or c, so he says he doesn’t know. Same goes for p2 and p3. Clearly p1 somehow gains information by p2 and p3 passing. Either he realizes that he must be either a or b, and such his number is the difference between p2 and p3’s numbers, or he realizes that he must be c and so his number is the sum of p2 and p3’s numbers.
That’s all I have so far. Anyone have other ideas?
If p1 KNOWS that he’s the largest then he has to have gained some other piece of information. Say the numbers he sees are 32 and 33. His number would have to be either 1 or 65. If p1 was 1 then the other two would have known p1 couldn’t be the sum of the other two
If p2 sees 1 and 33, s/he would wonder if s/he is 32 or 34.
P3 would consider 31 or 33.
P1 knows that P2 and P3 are not equal. So they know that the set isn't [2A, A, A].
P2 knows that P1 and P3 are not equal. So they know that the set isn't [A, 2A, A]. They also know that if P1 doesn't know, then they were able to make the same deduction. So they now know that both [2A, A, A] and [A, 2A, A] aren't correct. Since they know that [2A, A, A] isn't correct, they can also know that [2A, 3A, A] isn't correct either. Because they'd be able to see if P1 = 2A and P3 = A, and if that were true and P1 doesn't know their number, it would have to be because P2 isn't A. And if P2 isn't A, they'd have to be 3A.
P3 knows that P1 and P2 aren't equal. Eliminates [A, A, 2A]. Knows that [2A, A, A], [A, 2A, A], and [2A, 3A, A], are eliminated. Using the same process as P2, they can eliminate [2A, A, 3A], [A, 2A, 3A], and also [2A, 3A, 5A]. Because they can see the numbers and they know if P1 is 2A and P2 is 3A.
Now we're back at P1. Who now knows.
So P2 and P3 are in the eliminated sets. Which means we're one of these
[2A, A, A]; [3A, 2A, A]; [4A, 3A, A]; [3A, A, 2A]; [4A, A, 3A]; [5A, 2A, 3A]; [8A, 3A, 5A]
We know his number is 65. To find the set, we can factor 65: (5 * 13). We can check the other numbers 2(13) = 26. 3(13) = 39. And technically, you don't need to find the other numbers. The final answer is 5A * 2A * 3A or (A^3) * 30.
Why? Couldn't it be an infinite number of 3 size arrays comprised of A where two elements sum to the third? [24A, 13A, 11A]? How did we deduce this set of arrays?
EDIT: Solved from another reddit comment. Tuples without a common factor like the one above are considered as a=1.
"They're not eliminated; they correspond to a = 1."
You should be able to generate an infinite number of these problems just by multiplying the first formula factor by a prime number. Like the same question but the person answers '52' restricts you to either [4a, 3a, a] or [4a, a, 3a]. Since the question only asks for the product of all the terms the answer is 4 * 13 + 3 * 13 + 13 = 104.
So A + B = C and A + C = B. But we know that A + B = C, so we can replace C with (A + B). So we know that A + A + B = B.
So 2A + B = B. Or 2A = 0.
And this holds any way you slice it.
Even if you were to try and brute force it.
A = 1
B = 2
Then C = 3. But A + C has to equal B. That's 1 + 3 = 2? That's not true.
I don't see a case where you can add to the sum of two numbers one of the numbers and get the other number.
I'm guessing that's a misreading of the problem. Because it looks like the third number is the sum of the first two.
The original problem is a little ambiguously worded. You could say "one of their numbers is the sum of the other two" and it would be a little clearer.
No it isn't. If it said "the sum of any two of the numbers is equal to the third", that would be a contradiction. What it says is "the sum of two of the numbers is equal to the third".
Buying two of the items gets you the third for free.
The implication is any two.
It’s ok that it’s ambiguous. It happens. In most cases, we clarify and move on. There’s no need to defend it.
It's a common logic puzzle, Google can't turn up an exact match to the wording you have, but ex. here: https://www.futilitycloset.com/2018/03/03/three-hat-problem/
That’s one-shot for o1 pro.
Let’s carefully analyze this puzzle step-by-step:
⸻
Step 1: Define the Problem Clearly
We have three people in a circle. Let’s call them: • Person A (first person) • Person B (second person) • Person C (third person)
Each has a positive integer above their head. Each sees the other two numbers but not their own. The numbers satisfy: A + B = C \quad \text{or} \quad B + C = A \quad \text{or} \quad C + A = B
Given the puzzle scenario, we need to find out the exact numbers given the clues, and ultimately find the product of the three numbers.
⸻
Step 2: Logic of the Puzzle (Step-by-Step Reasoning)
Initial conditions: The numbers are positive integers. The puzzle implies a situation of ambiguity at first, but after multiple rounds of “I don’t know,” Person A finally knows their number is 65.
Let’s denote clearly the conditions again: • Exactly one of the following three equations is true: • A + B = C, or • B + C = A, or • C + A = B.
Person A sees B and C, but not A. Similarly for Persons B and C.
⸻
Step 3: Logical deductions from the repeated “I don’t know”
First round deductions: • Person A initially doesn’t know their number. This means, from Person A’s perspective, both scenarios A = B + C and A + B = C (or A + C = B) are still possible. If Person A had seen identical numbers, for instance, Person A could have instantly known their number. Thus, Person A sees two different numbers. • Similarly, Person B doesn’t know either, implying that Person B also sees ambiguity in their own number. Thus, Person B also sees two numbers that could result in at least two possible solutions. • Person C also doesn’t know, same logic as above.
Thus, after one full round, we understand that initially, all three numbers must create ambiguity. No trivial case (like two identical numbers) is present.
Second round deductions:
After hearing that no one knows their number, each person gains information: they now know the scenario is one in which none of the other two people could have immediately solved it.
Crucially, in the second round: • Person A now knows their number is 65 after considering the previous statements. This implies that Person A sees numbers B and C, and from these numbers plus the statements of uncertainty, Person A deduces that their own number can only be 65.
⸻
Step 4: Deriving the Numbers (Critical deduction)
Let’s denote clearly again: Person A’s number is now known to be 65.
We have three possible equations. Let’s test each carefully given A = 65: 1. Case 1: A + B = C Then we have: • 65 + B = C. Person A sees numbers B and C, so they would see B and 65 + B. Person A sees numbers B and 65+B. If Person A sees these two numbers, they might initially consider their number could be either the sum or difference of these two. • However, if Person A sees two numbers, and one is exactly 2× the other, Person A could immediately guess their number is the difference (since if you see numbers X and 2X, your number is clearly X). So this scenario of B and 65+B is okay only if these two numbers don’t exhibit an immediate trivial scenario. • Let’s hold this scenario and continue examining others first. 2. Case 2: B + C = A (So, B + C = 65) • Then Person A sees numbers B and C, adding exactly to 65. • Initially, Person A sees two numbers summing exactly to 65. Could Person A know their number immediately at first glance? No, since their number could either be the sum (65) or the difference of the two numbers. Ambiguity here initially makes sense. • Initially, Person B and Person C also see ambiguous scenarios. They both don’t know their numbers either at first round. This scenario is plausible. 3. Case 3: C + A = B (So, C + 65 = B) • Then Person A sees numbers B = C + 65 and C. • Person A sees C and C+65. Similar reasoning as Case 1.
⸻
Step 5: Identifying the Correct Scenario
The puzzle’s logic typically aligns best with Case 2: B + C = A = 65.
Here’s why clearly stated: • Initially, Person A sees two numbers adding to exactly 65. At first glance, Person A’s number could either be 65 (sum) or the difference between the two numbers. • Person B also initially sees two numbers and cannot distinguish immediately their own number, nor can Person C. • After everyone states “I don’t know,” Person A gains critical insight: if the difference between the two numbers (that A sees) allowed Person B or Person C to know their number immediately, they’d have known right away. The fact that B and C didn’t know means the difference is not immediately obvious to B or C. Thus, the difference cannot be trivial (such as identical numbers or easy multiples).
Therefore, the two numbers must be such that their difference doesn’t immediately clue in the other persons. This occurs most clearly with pairs of numbers which have multiple factorization possibilities or aren’t trivial multiples.
But we must carefully test quickly to find a suitable, plausible integer pair whose sum is 65 and whose ambiguity would logically survive through multiple rounds. The puzzle is a classic logic puzzle with a known structure:
A common scenario in these classic puzzles is the use of Fibonacci-like or similar “golden” sets. Let’s quickly check the following well-known ambiguous triple: • Numbers: 65, 39, 26 • Check quickly: indeed, 39 + 26 = 65.
Checking logic quickly: • Person A sees 39 and 26. Initially ambiguous (number could be 65 or 13). • Person B sees 65 and 26. Could be either 39 or 91 at first glance. Initially ambiguous. • Person C sees 65 and 39. Could be either 26 or 104 initially. Initially ambiguous.
After each person says they don’t know: • Now, Person A considers: If my number was 13, then Person B would see numbers 13 and 26. Person B would then immediately know their number is 39 (13+26=39), no ambiguity. But Person B didn’t know! So Person A’s number cannot be 13. Hence, Person A confidently concludes their number is 65.
Thus, the numbers 65, 39, and 26 satisfy the puzzle conditions perfectly!
⸻
Step 6: Final Answer (Product of the Numbers)
The three numbers identified clearly are: • A = 65 • B = 39 • C = 26
Then, the product is: 65 \times 39 \times 26 = 65 \times 1014 = 65,910
⸻
Final Answer:
\boxed{65,910}
Let’s quickly check the following well-known ambiguous triple: • Numbers: 65, 39, 26
Plus it drew me a very decent pelican riding a bicycle.
Notes here: https://simonwillison.net/2025/Mar/25/gemini/
Personally I'm convinced this model is the best out there right now.
https://www.reddit.com/r/Bard/comments/1jjobaz/pelican_on_a_...
https://imgur.com/a/xuPn8Yq
https://imgur.com/a/V4YAkX5
It's sort of irrelevant though as the test is about SVGs.
It could be useful for much more than just silly benchmarks, there's a reason why physics students are taught to draw a diagram before attempting a problem.
https://help.kagi.com/kagi/ai/llm-benchmark.html
High quality, to the point. Bit on the slow side. Indeed a very strong model.
Google is back in the game big time.
This is well ahead of thinking/reasoning models. A huge jump from prior Gemini models. The first Gemini model to effectively use efficient diff-like editing formats.
[0] https://aider.chat/docs/leaderboards/
In which case, google should be working on achieving better output format following, as Claude and R1 are able to hit nearly 100% accuracy on the format.
Aider automatically asks models to retry malformed edits, so it recovers. And goes on to produce a SOTA score.
- Our state-of-the-art model.
- Benchmarks comparing to X,Y,Z.
- "Better" reasoning.
It might be an excellent model, but reading the exact text repeatedly is taking the excitement away.
It's interesting how the recent AI announcements are following the same trend over a smaller timeframe.
once you get all your apps, wallpaper, shortcut order and same OS, you really quickly get the feeling you spent 1000$ for the exact same thing
But it needs to be seamless to remove any friction from the purchase, but at the same time if it feels the same then we felt like we wasted money.
So what I usually do is buy a different colored phone and change the wallpaper.
My MacBook was the same. Seamless transition and 2 hours later I was used to the new m4 speeds.
These LLM models were supposedly limited by the training run, but these point-version models are mostly post-training driven, which seems to be taking less time.
If models were tied to a specific hardware (say, a "AI PC" or whatever) the cycle would get slower and we'll get a slower summer which I'm secretly wishing.
So, I think these results are very interesting, if you know what features specifically you are using.
Excited to see if it works this time.
This is the commodification of models. There is nothing special about the new models but they perform better on the benchmarks.
They are all interchangeable. This is great for users as it adds to price pressure.
They are not so good at measuring reasoning, out-of-domain performance, or creativity.
As big players look to start monetizing, they are going to desperately be searching for moats.
Right now, in my opinion, OpenAI has actually a useful deep research feature which I've found nobody else matches. But there is no moat to be seen there.
It's called the "first step fallacy", and AI hype believers continue to fall for it.
The rate of announcements is a sign that models are increasing in ability at an amazing rate, and the content is broadly the same because they’re fungible commodities.
The latter, that models are fungible commodities, is what’s driving this explosion and leading to intense competition that benefits us all.
If these companies start failing to beat the competition, then we should prepare ourselves for very creative writing in the announcements.
Also, I think google's winning the race on actually integrating the AI to do useful things. The agent demo from OpenAI is interesting, but frankly, I don't care to watch the machine use my computer. A real virtual assistant can browse the web headless and pick flights or food for me. That's the real workflow unlock, IMO.
The Achilles heel of LLMs is the distinct lack of practical real-world applications. Yes, Google and Microsoft have been shoving the tech into everything they can fit, but that doesn't a product make.
Practical, real-world application.
If I'm being incredibly generous I will concede that this could have been the case for the first few weeks when it was making headlines, but it clearly isn't true now.
It would be literally impossible to keep up these figures for as long as ChatGPT has without a ton of repeat users. There simply aren't enough people/devices.
I tried it a month ago on all the major frontier models and none of them correctly identified the fix. This is the first model to identify it correctly.
Been playing around with it and it feels intelligent and up to date. Plus is connected to the internet. A reasoning model by default when it needs to.
I hope they enable support for the recently released canvas mode for this model soon it will be a good match.
This seems better, not worse, than keeping the user-conversation mapping so that the user may delete their conversations.
Also, maybe some services are doing "machine learning" training with user data, but it is the first time I see recent LLM service saying that you can feed your data to human reviewers at their will.
I’ve always found the use of the *.5 naming kinda silly when it became a thing. When OpenAI released 3.5, they said they already had 4 underway at the time, they were just tweaking 3 be better for ChatGPT. It felt like a scrappy startup name, and now it’s spread across the industry. Anthropic naming their models Sonnet 3, 3.5, 3.5 (new), 3.7 felt like the worst offender of this naming scheme.
I’m a much bigger fan of semver (not skipping to .5 though), date based (“Gemini Pro 2025”), or number + meaningful letter (eg 4o - “Omni”) for model names.
If you could get much better performance without changing the algorithm (eg just by scaling), you'd still bump the number.
[1] https://en.wikipedia.org/wiki/Tick%E2%80%93tock_model
I think it makes sense to increase the major / minor numbers based on the importance of the release, but this is not semver.
I have a "test" which consists in sending it a collection of almost 1000 poems, which currently sit at around ~230k tokens, and then asking a bunch of stuff which requires reasoning over them. Sometimes, it's something as simple as "identify key writing periods and their differences" (the poems are ordered chronologically). Previous models don't usually "see" the final poems — they get lost, hallucinate and are pretty much worthless. I have tried several workaround techniques with varying degrees of success (e.g. randomizing the poems).
Having just tried this model (I have spent the last 3 hours probing it), I can say that, to me, this is a breakthrough moment. Truly a leap. This is the first model that can consistently comb through these poems (200k+ tokens) and analyse them as a whole, without significant issues or problems. I have no idea how they did it, but they did it.
The analysis of this poetic corpus has few mistakes and is very, very, very good. Certainly very good in terms of how quickly it produces an answer — it would take someone days or weeks of thorough analysis.
Of course, this isn't about poetry — it's about passing in huge amounts of information, without RAG, and having a high degree of confidence in whatever reasoning tasks this model performs. It is the first time that I feel confident that I could offload the task of "reasoning" over large corpus of data to an LLM. The mistakes it makes are minute, it hasn't hallucinated, and the analysis is, frankly, better than what I would expect of most people.
Breakthrough moment.
Today, it seems like Google has beat them and supports way larger context window and is way better at keeping track of what has being said and memorize older tokens.
1 o1-pro (medium reasoning) 82.3
2 o1 (medium reasoning) 70.8
3 o3-mini-high 61.4
4 Gemini 2.5 Pro Exp 03-25 54.1
5 o3-mini (medium reasoning) 53.6
6 DeepSeek R1 38.6
7 GPT-4.5 Preview 34.2
8 Claude 3.7 Sonnet Thinking 16K 33.6
9 Qwen QwQ-32B 16K 31.4
10 o1-mini 27.0
https://github.com/lechmazur/nyt-connections/
https://www.svgviewer.dev/s/FImn7kAo
"A boy eating a sandwhich"
https://www.svgviewer.dev/s/VhcGxnIR
"A multimeter"
https://www.svgviewer.dev/s/N5Dzrmyt
Maybe it learned from Gianluca's gallery!
"Anna, Becca and Clare go to the play park. There is nobody else there. Anna is playing on the see-saw, Becca is playing on the swings. What is Clare doing?" (Sometimes I ask similar questions with the same structure and assumptions but different activities)
About a year ago none of them could answer it. All the latest models can pass it if I tell them to think hard, but previously Gemini could rarely answer it without that extra hint. Gemini 2.5 caveats its answer a bit, but does get it correct. Interestingly GPT-4o initially suggests it will give a wrong answer without thinking, but recognises it's a riddle, so decides to think harder and gets it right.
I'll be looking to see whether Google would be able to use this model (or an adapted version) to tackle ARC-AGI 2.
From https://x.com/OfficialLoganK/status/1904583353954882046
The low rate-limit really hampered my usage of 2.0 Pro and the like. Interesting to see how this plays out.
This time, the Googler says: “good news! you will be charged for experimental models, though for now it’s still free”
I assumed that meant there was a paid version with a higher rate limit coming out today
[1]: https://ai.google.dev/gemini-api/docs/pricing
[1] https://arxiv.org/pdf/2409.12640
[2] https://fiction.live/stories/Fiction-liveBench-Feb-20-2025/o...
This is most evident when querying about fast-moving dev tools like uv or bun. It seems to only know the original uv options like pip and tools, while with bun it is unfamiliar with bun outdated (from Aug 2024), bun workspaces (from around that time?) but does know how to install bun on windows (April 2024).
You'll still need to provide this model with a lot of context to use it with any tooling or libraries with breaking changes or new features from the past ~year - which seems to contradict the AI Studio reported knowledge cutoff.
Were I developing models - I'd prioritise squeezing in the most recent knowledge of popular tools and libraries since development is such a popular (and revenue generating) use case.
The exceptions are auto regressive image generation and audio models.
https://killedbygoogle.com/
I thought memory requirement grows exponentially with context size?
> This nearest-neighbor connectivity is a key difference between TPUs and GPUs. GPUs connect up to 256 H100s in an all-to-all configuration (called a node), rather than using local connections. On the one hand, that means GPUs can send arbitrary data within a node in a single low-latency hop. On the other hand, TPUs are dramatically cheaper and simpler to wire together, and can scale to much larger topologies because the number of links per device is constant.
You can see dramatic improvements in latency and throughput if there is a large shared prefix of the queries.
This stuff has a long way to go.
This way they get two rounds of headlines. "Gemini 2.5 released" and later on "Gemini 2.5 coming to all Google accounts."
I've also had it switch languages in the middle of output... like one word in the middle of a sentence was randomly output in some strange hieroglyphs, but when I translated them, it was the right word and the sentence made sense.
For as amazing as these things are, AGI they are not.
I was impressed at first. Then it got really hung up on the financial model, and I had to forcibly move it on. After that it wrote a whole section in Indonesian, which I don't speak, and then it crashed. I'd not saved for a while (ever since the financial model thing), and ended up with an outline and a couple of usable sections.
I mean, yes, this is better than nothing. It's impressive that we made a pile of sand do this. And I'm aware that my prompt engineering could improve a lot. But also, this isn't a usable tool yet.
I'm curious to try again, but wary of spending too much time "playing" here.
This is part of the code output (after several interactions of it not returning actual code):
I'm not very impressed.I pointed out that part of the code, and answered:
You've correctly pointed out that the TCO implementation in the provided C code snippet is essentially a no-op. The if and else blocks do the same thing: they both call apply(func, args, env). This means there's no actual tail call optimization happening; it's just a regular function call.
But then follows with even worst code. It does not even compile!
Edit: I did try it anyway and so far the new model is having similar hallucinations. I really need to test my code with Claude 3.7 as a control, to see if it approach the real ClaudePlaysPokemon's semi-competence.
Edit 2: Here's the log if anyone is curious. For some reason it's letting me make more requests than the stated rate limit. Note how at 11:27:11 it hallucinates on-screen text, and earlier it thinks some random offscreen tile is the stairs. Yes, I'm sure this is the right model: gemini-2.5-pro-exp-03-25.
https://a.qoid.us/20250325/
[1] https://ai.google.dev/gemini-api/docs/rate-limits#tier-1
(DM me for the questions)
Nobody is going to say "Announcing Foobar 7.1 - not our best!"
"It beats all the benchmarks...but you really really don't want to use it."
(Disclaimer, Googler, but I don’t work on any of this, I only have an external layperson’s understanding of it)
Some of the desktop chat clients are turning into great productivity tools. I tried the Claude one last week and quickly went back to Chat GPT. Claude might be a better model for coding. But it's less effort to make Chat GPT do what I want at this point and it's kind of good enough for a lot of stuff. Every release it's getting better. It connects to my IDE automatically, it can look at the files I have open. It can patch those files (I actually disabled that because it's too slow for my taste), etc.
But most importantly, I can trigger all that with option+shift+1. I do this gazillions times per day. Mostly simple stuff with really short prompts, "check this" (file, selection, current line, etc.), fix that, what do you think about x, "address the FIXMEs/TODOs", "document this", etc.
I can ask other models the same questions and they'd get the job done. But then I have to do more work to give them the same context. Claude has a Github connect option, which is great. But unfortunately it's just a glorified file picker, which really sucks. I have files open in my editor, just look at those. I don't want to have to manually open files do that for me or specify what files to look at every time I go near the tool.
Chat GPT actually asked me yesterday whether it could add a different file than the one it was looking at. I said "yes" and it did. That's a great UX. Don't make me do work.
That's a good UX.
I use Gemini mainly because it's integrated into google's tools. So it's kind of there. And chat GPT for whatever reason does can not look at the browser window. But from a UX point of view, that kind of deep integration is what you want. You have this implicit shared context which is the thing you are looking at that you don't have to spell out anymore.
The UX of populating the context is the deciding factor in how useful models are at this point, not how well it solves pet benchmark questions or renders pelicans on bicycles.
I have good hopes for agentic coding tools progressing rapidly this year. The ones I've tried recently need a lot of work though. I keep going back to Chat GPT because it's just the quickest & easiest to use at this point.
I've been using o1 almost exclusively for the past couple months and have been impressed to the point where I don't feel the need to "upgrade" for a better model.
Are there benchmarks showing o3-mini performing better than o1?
[1] https://openai.com/index/openai-o3-mini/ [2] https://www.vals.ai/benchmarks/aime-2025-03-24
You just have to use the models yourself and see. In my experience o3-mini is much worse than o1.
The fact they would exclude it from their benchmarks seems biased/desperate and makes me trust them less. They probably thought it was clever to leave o1 out, something like "o3 is the newest model lets just compare against that", but I think for anyone paying attention that decision will backfire.
But then there are two questions. First, are the white collar workers specifically consultants, engineers responsible for increase in productivity? Or is the white collar workers at the very right tail e.g., scientists?
I think consultants and engineers are using these technologies a lot. I think biologists at least are using these models a lot.
But then where is the productivity increases?
But its more complex than that - if many people start losing their jobs we all take a hit on gdp because they can't consume as much anymore, so it could take perhaps a long time until gdp actually sees meaningful gains.
And one last thought - Satya likely hasn't spent much time thinking about gdp, it's just not his field. He's a smart guy for sure but this isn't what he does.
Unemployment hasn't really picked up, and is unlikely to do so, unless the central bank is incompetent. (They have been from time to time.)
However, some advances don't show up in GDP. Eg Wikipedia is a tremendous achievement. But nobody pays for it, so it doesn't show up in GDP statistics.
That's an important assessment. I don't know if you're right. If the models are going to continue to get more capable I'm expecting unemployment to rise , I don't see how it won't (sure we are promised A.I to create tons of new jobs no one has imagined yet, I haven't seen a reliable clue for such jobs yet).
All the questions I had before language models, I have answered with language models.
That doesn't mean I have no more questions though. Answering those questions opened up 10X more questions I have now.
In general, everyone knows that answering scientific questions leads to new and more questions. It is the exact same process in the economy. There is a collectivist sentiment though in society and the economy that wants to pretend this isn't true. That the economic questions can be "solved", the spoils divided up and we live happily ever after in some kind of equilibrium.
As far as new jobs, they are here now but they surely sound as ridiculous to think about as being a professional youtuber in 2005. Or I think of the person making a geocities website in 1997 vs a front end developer. There is no date that a front end developer emerges from the html code monkey. It is a slow and organic process that is hard to game.
How many people can make an actual living out of Youtube? Surely they exist but to reliably live off it for decades (not just 1-2 years of temporary fame - which is also very hard to come by) I'd say fewer than one in ten thousand people will make it. I can't call "Youtuber" a career path with that kind of success rates anymore than I can call being an actor in Hollywood a career path.
To give an example from the recent past: 'hipster' baristas that make you a five dollar coffee are a fairly new job. At least at scale.
But I doubt you'll be able to find any technology that automated some other job but created barista jobs.
It's just that the market will find stuff for people to do for money, unless prevented to do so by incompetent central bank policy or (too) onerous labour market regulation.
(The labour market can take quite a lot of regulation, and still be able to get people jobs. Have a look at Germany today for an example.)
Will it ? Let's take my example, I'm a 41 year old male with around 15 years experience in software development. Lets say 4 years from now myself and million others are losing our development jobs to A.I. What does the market have for my skills? I can try going into healthcare or teaching (though that's quite an extensive retraining + salary reduction), I can go into the trades (same) or get some other work that's hard to automate like caring for old people (very low salary). All of these options involve massive salary reduction, and that's in the positive scenario that I actually am able to retrain and survive such a shift mentally. It's quite likely many software devs won't be able to become plumbers and nurses and will become chronically unemployed.
The situation you describe isn't all that special.
Yes, losing your job (or your career) is not fun, and can be painful. Massive salary reduction can happen.
No, that hasn't lead to widespread unemployment in the past. At least not widespread enough to be visible in aggregate statistics, especially over the noise of the 'normal' business cycle. However, individuals can obviously have pretty long spells in unemployment, but that can also happen without a shift in technology.
I'm just trying to get the point across that unemployment might rise so gdp may fall, in fact I think it should be the baseline scenario and not thinking some new jobs we can't imagine yet will be created. It's so hard to imagine these new jobs because if the machines will out perform us cognitively it follows we will be able to get intelligent robots into the real world quite soon after. Then seriously what the heck is left? Fewer jobs, not more.
There is one "cure" I can think of for this and that's something closer to socialism, the market will have to step aside and the government will create massive amounts of new jobs. For example classes can be 5 pupils per teacher instead of 30 pupils per teacher. Nurses can attend to 3 patient beds instead of 8. But letting the market sort this out ? I don't think so.
They're not baked into workflows where the measurable output is attributed easily to the model use. Productivity in its current form is transformative in the sense that the use case and gain differs for the individual (who even provide different prompts). So some are keeping the gains for themselves, others are using it to improve quality rather than quantity.
It'll come in time, it's important to remember gpt 4 was released 2 years ago this month. The newer models are more reliable and could probably be introduced into workflows more frequently. Today I spoke to a company who are looking to use it to reduce cost in the next year.
Benchmarks are useful as leading indicators. Early warning signs. If there’s no relation to the eventual productivity then hopefully that benchmark will disappear as it’s not useful.
In a fast moving space like this it’s reasonable to make use of leading indicators.
gave a problem which sounds like monty hall problem but a simple probability question and it nailed it.
asked to tell a joke - horrible joke ever.
much better than o1 but still no where near agi. it has been optimized for logic and reasoning at best.
Reference: https://rodolphoarruda.pro.br/wp-content/uploads/image-14.pn...
I ran this command to create it:
Using this script: https://til.simonwillison.net/llms/claude-hacker-news-themesPersonally, I do like the "use web search" and "extended thinking" buttons, but ultimately, the models should probably be able to figure out whether doing so would be useful themselves too.
[1] https://news.ycombinator.com/item?id=43474112
By extension it should also be slightly more helpful for research, R&D?
I'm a Gemini Advanced subscriber, still don't have this in the drop-down model selection in the phone app, though I do see it on the desktop webapp.
I also see Gemini 2.0 Pro has been replaced completely in AI Studio.
On AI studio the model told me today is June 13 2024.
Or generate images of the founding fathers of US that at least to some degree resemble the actual ones?
Granted, Gemini answers it now, however, this one left me shaking my head.
https://cdn.horizon.pics/PzkqfxGLqU.jpg
ChatGPT4.5, I was excited.
Deepseek, I was excited. (then later disappointed)
I know Gemini probably wont answer any medical question, even if you are a doctor. ChatGPT will.
I know I've been disappointed at the quality of Google's AI products. They are backup at best.
(Everything's ok, I'm just testing it ;)
https://news.ycombinator.com/item?id=43474112
I don't see it on the API price list:
https://ai.google.dev/gemini-api/docs/pricing
I can imagine that it's not so interesting to most of us until we can try it with cursor.
I look forward to doing so when it's out. That Aider bench mixed with the speed and a long context window that their other models are known for could be a great mix. But we'll have to wait and see.
More generally, it woud be nice for these kinds of releases to also add speed and context window as a separate benchmark. Or somehow include it in the score. A model that is 90% as good as the best but 10x faster is quite a bit more useful.
These might be hard to mix to an overall score but they're critical for understanding usefulness.
2.0 Pro is also still "experimental" so I agree with GP that it's pretty odd that they are "releasing" the next version despite never having gotten to fully releasing the previous version.
But I did mingle that with my knowledge of google's history of releasing without releasing these models which, as you point out, isn't true with this release.