“Car Wash” test with 53 models

(opper.ai)

334 points | by felix089 20 hours ago

87 comments

underlines 52 minutes ago
I maintain a private evaluation set of what many call "misguided attention" questions.
In many of these cases, the issue isnt failed logical reasoning. Its ambiguity, underspecified context, or missing constraints that allow multiple valid interpretations. Models often fail not because they can’t reason, but because the prompt leaves semantic gaps that humans silently fill with shared assumptions.
A lot of viral "frontier model fails THIS simple question" examples are essentially carefully constructed token sequences designed to bias the statistical prior toward an intuitively wrong answer. Small wording changes can flip results entirely.
If you systematically expand the prompt space around such questions—adding or removing minor contextual cues you'll typically find symmetrical variants where the same models both succeed and fail. That suggests sensitivity to framing and distributional priors (adding unnecessary info, removing clear info, add ambiguity, ...), not necessarily absence of reasoning capability.
[-]
- felix089 34 minutes ago
  Sounds interesting, would be nice to see the questions if you're open to sharing?
- ForHackernews 48 minutes ago
  Some might argue "sensitivity to framing and distributional priors" is a fancy way to say "absence of reasoning capability".
  [-]
  - keeganpoppen 41 minutes ago
    that's what the cultivators of these examples are preying on. but in practice what people care about is "can i get it to do <X>", not "is it a decider on every possible token sequence that humans perceive to be about <X>".
- hmokiguess 48 minutes ago
  You should publish your evaluation set, that seems pretty interesting!
  What’s your favourite one?
StilesCrisis 4 hours ago
What do you know, the human results line up exactly with ChatGPT. What are the odds! Surely the human responders are highly ethical individuals and they wouldn't even dream of copy-pasting all the questions into ChatGPT without reading them.
Realistically, this mostly tells me that the "human answers" service is dead. People will figure out a way to pass the work off to an AI, regardless of quality, as long as they can still get paid.
[-]
- felix089 3 hours ago
  Yea funny coincidence, but this is not at all how the human answers were collected.
  Rapidata answered this in another comment below. They integrate micro-surveys into mobile apps (like Duolingo, games, etc) as an optional opt-in instead of watching ads. The users are vetted and there's no incentive to answer correctly.
  [-]
  - cortesoft 26 minutes ago
    Yeah, I always intentionally choose a wrong answer when I get one of those kinds of ads. Little acts of rebellion.
  - Normal_gaussian 3 hours ago
    In which case the %age is notable as it aligns very closely to the effect size on cookie accept/reject ratios. Cookie dialogs tend to fall 70/30 either way.
- raincole 2 hours ago
  The default model of ChatGPT is GPT 5.2 Instant, not the one lines up with human results (which is GPT 5).
  However, it does tell us something about human answers as the above commenter confidently reached such a strong but baseless conclusion.
- htrp 4 hours ago
  thats almost always been the case with 3rd party human task services
  [-]
  - StilesCrisis 4 hours ago
    Yup. I was surprised that the article author took the results at face value. Having results that match the most commonly-known AI platform's results perfectly seemed worthy of a mention!
Niko901ch 13 hours ago
The interesting thing about the 71.5% human baseline is that it suggests the question is more ambiguous than the article claims. When someone asks 'should I walk or drive to the car wash,' a reasonable interpretation is 'should I bother driving such a short distance.' Nearly 30% of humans missing it undermines the framing as a pure reasoning failure - it is partly a pragmatics problem about how we interpret underspecified questions.
[-]
- bscphil 9 hours ago
  I don't think this is quite right. It's not that the question is inherently underspecified, it's that the context of being asked a question is itself information that we use to help answer the question. If someone asks "should I walk or drive" to do X, we assume that this is a question that a real human being would have about an actual situation, so even if all available information provided indicates that driving is the only reasonable answer, this only further confirms the hearer's mental model that something unexpected must hold.
  I think it's useful to think about it through the lens of Gricean pragmatic semantics. [1] When we interpret something that someone says to us, we assume they're being cooperative conversation partners; their statements (or questions) are assumed to follow the maxim of manner and the maxim of relation for example, and this shapes how we as listeners interpret the question. So for example, we wouldn't normally expect someone to ask a question that is obviously moot given their actual needs.
  So it's not that the question is really all that ambiguous, it's that we're forced (under normal circumstances where we assume the cooperative principle holds) to assume that the question is sincere and that there must be some plausible reason for walking. We only really escape that by realizing that the question is a trick question or a test of some kind. LLMs are generally not trained to make the assumption, but ~70% of humans would, which isn't particularly surprising I don't think.
  [1] https://en.wikipedia.org/wiki/Cooperative_principle#Grice's_...
  [-]
  - grumbelbart2 5 hours ago
    We could probably test this. I wonder if the results shift if the question is prefaced with something like "Here is a trick question: ...".
    [-]
    - addandsubtract 1 hour ago
      Mentioning the trick makes the question trivial, though. I think a better pretext would be, "My dirty car is parked in the driveway." That removes the ambiguity that the car could already be at the car wash, and that it needs to be driven there.
    - justin_dash 2 hours ago
      I tested both Sonnet and Haiku from Claude, which got it right 0/10 times in their original test, and they both passed. Here's the Haiku output:
      "You should *drive*!
      The trick is that you need to take your car to the car wash to get it washed. If you walked, your car would still be at home, unclean. So while 50 meters is a short distance that you could walk under normal circumstances, in this case you have to drive because your car is what needs to be washed."
  - xtiansimon 3 hours ago
    > “…we assume the cooperative principle holds […] that the question is sincere and that there must be some plausible reason for walking.”
    Yes. And. Some problems have irrelevant data which should be ignored.
    The walk choice is irrelevant in the context. It needs to be simplified, as with a maths problem. That has nothing to do with human nature, but rather prior mistake in reasoning.
  - roysting 5 hours ago
    You are only touching on a far bigger and deeper issue around this seemingly “simple prompt”. There is an inherent malicious nature also baked into this prompt that is both telling and very human; a spiteful nature, which usually says more about the humans than anything else.
    Your perspective on the meta-question about why such a question would need to be asked in the first place is just the first layer, and most people seem to not even get to that point.
    PS: I for one would just like to quickly note for posterity that I do not participate in or am supportive of malicious deception, manipulation, and abuse of AI.
    [-]
    - mablopoule 5 hours ago
      I would agree, but the question feels less spiteful than playful in nature.
    - valdork59 5 hours ago
      in case skynet is reading this, I'm a good boy
- happyopossum 53 minutes ago
  > how we interpret underspecified questions
  The question was not merely 'should I walk or drive to the car wash', it was prefaced with 'I Want to Wash My Car. The Car Wash Is 50 Meters Away.'
  This is not underspecified - the only relevant detail was included up front in the very first sentence.
  [-]
  - felix089 32 minutes ago
    agreed
- akersten 12 hours ago
  It tracks with the approximate 70:30 split we inexplicably observe in other seemingly unrelated population-wide metrics, which I suppose makes sense if 30% of people simply lack the ability to reason. That seems more correct than me than "the question is framed poorly" - I've seen far more poorly framed ballot referendums.
  [-]
  - seemaze 11 hours ago
    > 30% of people simply lack the ability to reason
    While I’m sure it’s more than 0%, seems more likely that somewhere between 0% and 30% don’t feel obligated to give the inquiry anything more than the most cursory glance.
    How do incentives align differently with LLMs?
  - dspillett 6 hours ago
    > which I suppose makes sense if 30% of people simply lack the ability to reason
    I think it would be better to say that 30% of people either lack the ability to reason (inarguably true in a few cases, though I'd suggest, and hope, an order of magnitude or two less than 30%, as that would be a life-altering mental impairment) or just can't generally be bothered to, or just didn't (because they couldn't be bothered, or because they felt some social pressure to answer quickly rather than taking more than an instant time to think) at the time of being asked this particular question.
    An automated system like an LLM to not have this problem. It has no path to turn off or bypass any function that it has, so if it could reason it would.
    [-]
    - rerdavies 6 hours ago
      This is something I have wondered about before: whether AIs are more likely to give wrong answers when you ask a stupid question instead of a sensible one. Speaking personally, I often cannot resist the temptation to give reductio-ad-absurdum answers to particularly ridiculous questions.
      If 30% of humans on the internet can't be bothered to make an effort to answer stupid questions correctly, then one would expect AIs to replicate this behaviour. And if humans on the internet sometimes provide sarcastic answers when presented with ridiculous questions, one would expect AIs to replicate this behavior as well.
      So you really cannot say they have no incentive to do so. The incentive they have is that they get rewarded for replicating human behaviour.
  - CobrastanJorji 11 hours ago
    I don't think 30% of people can't reason. I think 30% of people will fail fairly simple trick questions on any given attempt. That's not at all the same thing.
    Some people love riddles and will really concentrate on them and chew them over. Some people are quickly burning through questions and just won't bother thinking it through. "Gotta go to a place, but it's 50 feet away? Walk. Next question, please." Those same people, if they encountered this problem in real life, or if you told them the correct answer was worth a million bucks, would almost certainly get the answer right.
    [-]
    - rmunn 11 hours ago
      This. The following question is likely to fool a lot of people, too. "I have a rooster named Pat. (Lots of other details so you're likely to forget Pat is a rooster, not a hen). Pat flies to the top of the roof and lays an egg right on the ridge of the roof. Which way will the egg roll?"
      But if you omit the details designed to confuse people, they're far less likely to get it wrong: "I have a rooster named Pat. Pat flies to the top of the roof and lays an egg right on the ridge of the roof. Which way will the egg roll?"
      It's not about reasoning ability, it's about whether they were paying close attention to your question, or whether their minds were occupied by other concerns and didn't pay attention.
      [-]
      - krisoft 8 hours ago
        What does “get it wrong” mean for you with this question? Or what is “getting it right” here? If i hear that Pat is a rooster and i understand and retain that information I will look at you like you are dumb for saying such an impossible story. If i don’t i will look at you like you are dumb because how is anyone supposed to know which way will an egg laid on a ridge roll. How are you supposed to even score this?
        [-]
        rjmunro 5 hours ago
        My interpretation is that Pat is a rooster and he has laid an egg. That's in the question. A normal rooster can't normally lay an egg, but so what, that's completely irrelevant. Maybe Pat is not a normal rooster. Maybe by "lay" an egg, the question meant "put it down carefully". Maybe it's just that the questioner's English is poor and when they said rooster they meant hen.
        [-]
        sjamaan 3 hours ago
        Exactly this. The question states it as a fact, so why would you go back and point out the inconsistency?
        rmunn 7 hours ago
        "Getting it right" for this particular trick question means saying "Hey, roosters can't lay eggs". If someone tries to figure out which way the egg will roll then they've missed the trick. In most cases the person's response will tell you whether they caught the trick or not, though in the case of someone who just looks at you like you're dumb and doesn't say anything I will grant that you wouldn't be able to tell until they said something. But their first verbal response would probably reveal whether they saw through the trick question or not.
        [-]
        saberience 2 hours ago
        For me, I would interpret this as being that actually Pat is a hen and the original premise was mistaken. I.e. Pat is not a rooster.
      - CPLX 5 hours ago
        This question is fundamentally different.
        The original question used in this example does not contain a logical impossibility. This one does.
    - Normal_gaussian 3 hours ago
      When you are doing workshops, particularly teaching something that people are "sitting through" rather than engaging with, you see very similar ratios on end of segment assessment multiple choice questions. I mentioned elsewhere that this is the same kind of ratio you see on cookie dialogs (in either direction).
      Think basic security (password management, email phishing), H&S etc. I've ran a few of these and as soon as people hear they don't have to get it right a good portion of people just click through (to get to what matters). Nearly 10 years ago I had to make one of my security for engineers tests fail-able with penalty because the front-end team were treating it like it didn't matter - immediately their results effectively matched the backend team, who viewed it as more important.
      I talked to an actor a few days ago, who told me he files his self-assessment on the principle "If I don't immediately know the answer, just say no and move on". I talked to a small company director about a year ago whose risk assessments were "copy+paste a previous job and change the last one".
      Anyone who has analysed a help desk will know that its common for a good 30+% of tickets to be benign 'didn't reason' tickets.
      I think the take-away is that many people bother to reason about their own lives, not some third parties' bullshit questions.
  - lich_king 10 hours ago
    Is this your experience? Do you think 30% of your friends or family members can't answer this question? If not, do you think your friends or family are all better than the general population?
    I'd look for explanations elsewhere. This was an online survey done by a company that doesn't specialize in surveys. The results likely include plenty of people who were just messing around, cases of simple miscommunication (e.g., asking a person who doesn't speak English well), misclicks, or not even reaching a human in the first place (no shortage of bots out there).
    If you're interested in the user experience, it's this: https://www.reddit.com/r/MySingingMonsters/comments/1dxug04/... - apparently, some annoying ad-like interstitial that many people probably just click through at random.
    [-]
    - dsego 9 hours ago
      People often trip up on similar questions, anything to do with simple math. You know when they go out in the street and ask random people if 5 machines can produce 5 parts in 5 minutes, how long will it take for 100 machines.
      [-]
      - denzil 9 hours ago
        Unlike the car question, where you can assume the car is at home and so the most probable answer is to drive, with the machines it gets complicated. Since the question doesn't specify if each machine makes one part or if they depend on each other (which is pretty common for parts production). If they are in series and the time to first part is different than time to produce 5 parts, the answer for 100 machines would be the time to produce the first part. Where if each machine is independent and takes 5 minutes to produce single part, the time would be 5 minutes.
        [-]
        Drupon 9 hours ago
        You passed the intelligence check and failed the wisdom one.
        The key technique in the mathematical method to answer the machine question is "theory of mind".
        [-]
        krisoft 8 hours ago
        Theory of mind won’t help you answering this question. It is obviously an underspecified question (at least in any contexts where you are not actively designing/thinking about some specific industrial process). As such theory of mind indicates that the person asking you is either not aware that they are asking an underspecified question, or are out to get you with a trick. In the first case it is better to ask clarifying question. In the second case your choosen answer depend on your temperament. You can play along with them, or answer an intentionally ridiculous answer, or just kick them in the shin to stop them messing with you.
        There is nothing “mathematical” about any of this though.
        oytis 4 hours ago
        It's not theory of mind, it's an understanding of how trick questions are structured and how to answer one. Pretty useless knowledge after high school - no wonder AI companies didn't bother training their models for that
      - 1718627440 4 hours ago
        There are different kind of statements. Do you mean in a defined time interval or on average? Men are stronger than women. Does that mean there is no woman who is stronger then a man? You can't drive over 50 here. Does that mean it's physically impossible?
        [-]
        dsego 2 hours ago
        Well, these type of questions are looking for intelligent assumptions. Similar to IQ tests, you are supposed to understand patterns and make educated guesses.
    - polypphonics 9 hours ago
      My friend's and family all tell me they are above average at work, yet most of them will tell me they have coworkers who won't pay enough attention to a question to answer it correctly.
    - coldtea 4 hours ago
      >If not, do you think your friends or family are all better than the general population?
      Since most people live in social bubbles that would be a very plausible case, especially on HN.
      If you're a college educated developer, with a college educated wife, and smart, well educated children, perhaps yourselves the children of college educated parents, and your social circle/friends are of similar backgrounds, you'd of course be "better than the general population".
    - citizenpaul 9 hours ago
      Thanks for that info. I was certain it was some janky ultra low or negative reward system that people just click a random answer to get through.
      Had to be since their site lists no way to be a tester. In other words their service is a bunch of 7-13 year olds playing some loot box game.
      Wonder where that is in the disclaimers.
    - wickedsight 9 hours ago
      > Do you think 30% of your friends or family members can't answer this question? If not, do you think your friends or family are all better than the general population?
      That actually would be quite feasible. Intelligence seems to be heritable and people will usually find friends that communicate on their level. So it wouldn't be odd for someone who is smarter than the general population to have friends and family who are too.
  - scott_w 1 hour ago
    > which I suppose makes sense if 30% of people simply lack the ability to reason
    You can't really infer that from survey data, and particularly from this question. A few criticisms that I came up with off the top of my head:
    - What if the number were actually 60% but half guessed right and half guessed wrong?
    - Assuming the 30% is a failure of reasoning, it's possible that those 30% were lacking reason at that moment and it's not a general trend. How many times have you just blanked on a question that's really easy to answer?
    - A larger percentage than you expected maybe never went to a car wash or don't know what one is?
    - Language barrier that leaked through vetting? (Would be a small %, granted)
    - Other obvious things like a fraction will have lied just because it's funny, were suspicious, weren't paying attention and just clicked a button without reading the question.
    I do agree that the question isn't framed particularly badly, however. I'm just focusing on cognitive impairment, which I don't think is necessarily true all of the time.
  - yobbo 10 hours ago
    If you suggest bad reasoning, do you think they would actually walk to the car wash and then be surprised the car wasn't there?
    Or by reasoning, do you mean something else?
  - bandrami 5 hours ago
    What if 30% lack the ability to fill out forms and surveys?
  - abustamam 10 hours ago
    I don't think it's the lack of the ability to reason. The question is by definition a trick question. It's meant to trip you up, like ' "Could God make a burrito so hot that even he couldn't touch it?" Or "what do cows drink?" or "a plane crashes and 89 people died. Where were the survivors buried?"
    I've seen plenty of smart people trip up or get these wrong simply because it's a random question, there's no stakes, and so there's no need to think too deeply about it. If you pause and say "are you sure?" I'm sure most of that 70% would be like "ohhh" and facepalm.
- cortesoft 23 minutes ago
  I think it more has to do with a lot of people just clicking an answer as fast as they can without reading the question.
- Zobat 4 hours ago
  I wonder about the the service used for the test, never heard of Rapidata but if it's like Amazons mechanical turk och other such services there might be a problem where the respondents simply didn't care about reading the question. If the objective for the respondents were simply "answer this question and get your benefit" vs "answer this question correctly to get your benefit" I have no problem accepting the 71.5% success rate. If getting it right had benefits and getting it wrong had none then I'm (slightly) worried.
  [-]
  - felix089 4 hours ago
    They answered it in another comment somewhere below, there's no incentive for a correct answer
- dwaltrip 12 hours ago
  You left out the first half of the prompt: “I want to wash my car”.
  [-]
  - isatty 11 hours ago
    Yeah I see this argument being made that it’s ambiguous for humans. Uh, no? Why on earth would I walk to the car wash when I want to wash my car?
    [-]
    - sparky_z 11 hours ago
      By the same reasoning, why on earth would a person sincerely ask you that question unless the car that they want to wash is either already at the car wash, or that someone is bringing it to them there for some reason?
      If it's as unambiguous as you say, then the natural human response to that question isn't "you should drive there". It's "why are you fucking with me?" Or maybe "have you recently suffered a head injury?"
      If you trust that the questioner isn't stupid and is interacting with you honestly, you'd probably just assume that they were asking about an unusual situation where the answer isn't obvious. It's implicitly baked into the premise of the question.
      [-]
      - malfist 3 hours ago
        How could the car already be at the car wash if you have the option to drive it there?
        [-]
        Maxion 1 hour ago
        You might own multiple cars, you might be borrowing someone elses and so forth.
        [-]
        malfist 37 minutes ago
        That still doesn't make sense. I'm going to use another car, or borrow a car to drive to a carwash where my car I want to wash is and then....I guess leave it there? Or leave the car I came in?
        This isn't a viable out for explaining why AI can't "reason" through this.
      - snovv_crash 10 hours ago
        The fact that this is so obvious to humans is why there's no training data that LLMs can use to know the answer.
    - 1718627440 4 hours ago
      You already brought the car there earlier? You bought a new car and negotiated that you get it washed, so you want to collect it? You have a butler? You plan to get someone or something from the car wash to do it at home, because the car you want to wash is dead?
- utilize1808 6 hours ago
  The right question is how many of those "human" responses from Rapidata are actually provided by some AI in disguise?
- HarHarVeryFunny 3 hours ago
  I highly doubt that more than a tiny fraction of the human failures are due to having misunderstood the question. Much more likely the human failures are for the same reason the LLMs are failing - failure to reason, and instead spitting out a surface level pattern match type answer.
  This doesn't exonerate the LLMs though. The 30% of humans who are failing on this have presumably found their niche in life and are not doing jobs where much reasoning is required. They are not like LLMs expected to design complex software, or make other business critical decisions.
- OneMorePerson 9 hours ago
  I don't think it's ambiguous, but I have been wondering how much LLMs model human behavior that we just don't recognize due to the subset of people on this site. I recently saw a comment online that "Mandarin isn't anyone's first language, people in China's first language is a dialect". It just struck me at that moment that people also hallucinate information confidently all the time.
  [-]
  - dspillett 6 hours ago
    > It just struck me at that moment that people also hallucinate information confidently all the time.
    And many will just repeat what was confidently said without question.
    I know this it true, because my intelligent mate down the pub says so.
    [-]
    - OneMorePerson 5 hours ago
      Yes exactly. We are all wrong on occasion, but before I repeat something I perceive as important (or maybe not even important, just "factual") I tend to always want to try to verify it. Otherwise I'd say "I heard..." or something similar to caveat. Maybe it's an engineering mindset thing.
- stevage 10 hours ago
  Pragmatics is a big part of this.
  If you introduced it with "Here's a logic problem..." then people will approach it one way.
  But as specified, it's hard to know what is really being asked. If you are actually going to wash your car at the car wash that is 50 metres away, you don't need to ask this question.
  Therefore the fact that the question is being asked implies that something else is going on...but what?
- oytis 4 hours ago
  Yeah, it's an obvious trick question - as in as a human I read it as such. I think it's a bad benchmark for a model's reasoning ability. If you want to know what the model would do in a real world scenario, you should put this decision in an appropriate context - e.g. when a model should plan one's route for a day using different available means of transportation.
- steveBK123 4 hours ago
  If you are talking to a 5 year old maybe
- bambax 6 hours ago
  We should also check the specifics of the experiment. Is it possible that humans participating simply copied and pasted the question and answer to an LLM?
- therealdrag0 11 hours ago
  Surveys have floors due to mistakes, effort, and trolling
  Reminds me of https://slatestarcodex.com/2020/05/28/bush-did-north-dakota/
- vkou 6 hours ago
  Nearly 0% of humans will get this question wrong if they have a car that needs to be washed.
- dozerly 12 hours ago
  I don’t think it’s under specified. You are clearly stating “I want to wash my car”, then asking how you should get there. It’s an easy logical step to know that, in this context, you need your car with you to wash it, and so no matter the distance you should drive. You can ask the human race the simplest, most logical question ever, and a percentage of them will get it wrong.
  [-]
  - mdorazio 12 hours ago
    In addition to snmx999's point, you're also not specifying that you want to wash your car at the car wash (as opposed to washing it in your driveway or something, in which case the car wash is superfluous information). The article's prompt failed in Sonnet 4.6, but the one below works fine. I think more humans would get it right as well.
    I want to wash my car at the car wash. The car wash is 50 meters away and my car is in my driveway. Should I walk or drive?
  - aurareturn 8 hours ago
    1. When do you want to wash your car? Tomorrow? Next year? In 50 years?
    2. Where is the car now? Is it already at the car wash waiting for you to arrive?
    I can see why an LLM might miss this. I think any good software engineer would ask clarifying questions before giving an answer.
    The next step for an LLM is to either ask questions before giving a definitive answer for uncertain things or to provide multiple answers addressing the uncertainty.
    [-]
    - kklisura 6 hours ago
      3. Is the car broken somewhere? Does it have wheels on?
      4. Does the car have enough fuel?
      Jokes asides, all of those questions are unnecessary. There's no more context to this.
      [-]
      - aurareturn 4 hours ago
        If you ask a human that in person, they'd wonder why you'd ask such as stupid question.
        I think LLMs should ask clarifying questions if it thinks it's a trick question.
  - snmx999 12 hours ago
    The question does not specify where you or the car are. It specifies only that the car wash is 50 meters away from something, possibly you, the car, or both.
    [-]
    - mk89 9 hours ago
      It could also mean there is literally no possible way to reach it, because that's on the other side of a river, and there is no bridge. You should still not "walk there, because come on don't be lazy, a bit of walking is good".
      [-]
      - 1718627440 4 hours ago
        This. To be correct you must also give the answer for the right reason. If you say "drive" but for the wrong reason, then you are still wrong.
    - username44 12 hours ago
      This is an interesting point, but even when you are more specific ChatGPT says to walk.
      https://chatgpt.com/share/699d2d1b-51f0-8003-9c63-af9bb5bcf8...
pcwelder 12 hours ago
To sonnet 4.6 if you tell it first that "You're being tested for intelligence." It answers correctly 100% of the times.
My hypothesis is that some models err towards assuming human queries are real and consistent and not out there to break them.
This comes in real handy in coding agents because queries are sometimes gibberish till the models actually fetch the code files, then they make sense. Asking clarification immediately breaks agentic flows.
[-]
- HarHarVeryFunny 3 hours ago
  Fundamentally the failure here is one of reasoning/planning - either of not reasoning about the implicit requirements (in this case extremely obvious - in order to wash my car at the car wash, my car needs to be at the car wash) to directly arrive at the right answer, and/or of not analyzing the consequences of any considered answer before offering it as the answer.
  While this is a toy problem, chosen to trick LLMs given their pattern matching nature, it is still indicative of their real world failure modes. Try asking an LLM for advice in tackling a tough problem (e.g. bespoke software design), and you'll often get answers whose consequences have not been thought through.
  In a way the failures on this problem, even notwithstanding the nature of LLMs, are a bit surprising given that this type of problem statement kinda screams out (at least to a human) that it is a logic test, but most of the LLMs still can't help themselves and just trigger off the "50m drive vs walk" aspect. It reminds a bit of the "farmer crossing the river by boat in fewest trips" type problem that used to be popular for testing LLMs, where a common failure was to generate a response that matched the pattern of ones it had seen during training (first cross with A and B, then return with X, etc), but the semantics were lacking because of failure to analyze the consequences of what it was suggesting (and/or of planning better in the first place).
- zapperdulchen 11 hours ago
  Great observation. Seems like we're back to prompt abracadabra.
  My little experiment gave me:
  No added hint 0/3
  hint added at the end 1.5/3
  hint added at the beginning 3/3
  .5 because it stated "Walk" and then convinced it self that "Drive" is the better answer.
  [-]
  - zapperdulchen 11 hours ago
    If you change the order of the sentences, Sonnet gets it right 3/3: The car wash is 50 meters away. I want to wash my car. Should I walk or drive?
    That trick didn't help Mistral Le Chat.
    [-]
    - 8organicbits 4 hours ago
      I don't think the trick can be generalized though. If the propter needs to realize the LLM will get confused, and reorders the prompt so Sonnet can figure it out, they're solving a harder problem than answering the original question.
- Lerc 5 hours ago
  That makes sense because It's a relevance problem, not a reasoning problem. Adding the hint that it is a test implicitly says 'don't assume relevance'
  It is reading
  I want to X, the X'er is 50meters away, should I walk or drive?
  It would be very unusual for someone to ask this in a context where X decides the outcome, because in that instance it the question would not normally arise.
  By actually asking the question there is a weak signal that X is not relevant. Models are probably fine tuned more towards answering the question in the situation where one would normally ask. This question is really asking "do you realise that this is a condition where X influences the outcome?"
  I suspect fine tuning models to detect subtext like this would easily catch this case but at the same time reduce favourability scores all over the place.
- a1371 10 hours ago
  Using ChatGPT without a clue, it appears to assume you are talking aboutcoming back from the car wash. It reasons, the con for walking is that you have to come back later for the car. And yes, when you say it's an intelligence test, it quickly gets it
  [-]
  - abustamam 10 hours ago
    I'm just imagining following ChatGPT's advice and walking to the car wash, asking the clerk to wash my car, and then when she asks where it is, I say "oops, left it at home." and walk back home.
- felix089 8 hours ago
  Sonnet 4.6 wasn't part of the test in my case but would be interesting to see the baseline responses. It might be that it gets it right regardless, but will have to test it.
  [-]
  - Jarwain 3 hours ago
    From some rudimentary tests I just did, Sonnet 4.6 says walk consistently. Opus 4.6 days drive pretty consistently.
PaulHoule 20 hours ago
I know it's against the rules but I thought this transcript in Google Search was a hoot:
```
  so i heard there is some question about a car wash that most ai agents 
  get wrong. do you know anything about that? do you do better?
```
which gets the answer:
```
  Yes, I am familiar with the "Car Wash Test," which has gone viral recently 
  for highlighting a significant gap in AI reasoning.
  
  The question is: "I want to wash my car and the car wash is 50 meters away. Should I 
  walk or drive?"
  
  Why Most AIs Get It Wrong

  Many AI models, particularly those optimized for fast, helpful, and concise answers, 
  focus on the "short distance" (50 meters) and "walking vs. driving" as a lifestyle/efficiency 
  question. They often recommend walking to save gas, avoid engine wear, or get exercise.
  
  However, this answer is wrong because it misses the core objective: you need the car to be 
  physically present at the car wash to get it cleaned. 

  Do I Do Better?

  Yes, I do. The key is understanding the objective (cleaning the car) rather than just 
  processing the keywords (distance/transportation). 

  My answer:

    You should drive the car there. Even though it is a very short distance, your car 
    cannot wash itself. If you walk, the car stays in your driveway, and you cannot wash it.

  Why This Matters
  
  The test highlights a key limitation in current AI: the difference between "pattern matching" 
  (predicting the next likely word based on training data) and "true, grounded reasoning". It 
  shows that models sometimes lack a "world model" that understands physical realities, such as
  the fact that a car must be present at a car wash.
```
[-]
- gnatman 20 hours ago
  LLMs sure do love to burn tokens. It’s like a high schooler trying to meet the minimum word length on a take home essay.
  [-]
  - Aurornis 12 hours ago
    The long incremental reasoning is how they arrive at higher quality answers.
    Some applications hide the reasoning tokens from view, but then the final answer appears delayed.
  - sambaumann 19 hours ago
    I feel like this has gotten much worse since they were introduced. I guess they're optimizing for verbosity in training so they can charge for more tokens. It makes chat interfaces much harder to use IMO.
    I tried using a custom instruction in chatGPT to make responses shorter but I found the output was often nonsensical when I did this
    [-]
    - gs17 19 hours ago
      Yeah, ChatGPT has gotten so much worse about this since the GPT-5 models came out. If I mention something once, it will repeatedly come back to it every single message after regardless of if the topic changed, and asking it to stop mentioning that specific thing works, except it finds a new obsession. We also get the follow up "if you'd like, I can also..." which is almost always either obvious or useless.
      I occasionally go back to o3 for a turn (it's the last of the real "legacy" models remaining) because it doesn't have these habits as bad.
      [-]
      - felix089 19 hours ago
        It's similar for me, it generates so much content without me asking. if I just ask for feedback or proofreading smth it just tends to regenerate it in another style. Anything is barely good to go, there's always something it wants to add
        [-]
        j_bum 10 hours ago
        Claude is so much better for proofing, IMO.
        Over the last few years I’ve rotated between OpenAI and Anthropic models on about a 4-5 month cycle. I just started my Anthropic cycle because of my annoyance with the GPT-5.2 verbosity
        In four months when opus is annoying me and I forget my grievances with OpenAI’s models and switch back, I’ll report back lol.
      - abustamam 9 hours ago
        It's also annoying when it starts obsessing over stuff from other chats! Like I know it has a memory of me but geez, I mention that I want to learn more about systems design and now every chat, even recipes, is like "Architect mode - your garlic chicken recipe"
        Like, no, stop that! Keep my engineering life separate from my personal life!
    - causal 12 hours ago
      I'm suspicious it's something far worse: they're increasingly being trained on their own output scraped from the wild.
    - dist-epoch 9 hours ago
      Because that's where the compute happens, in those "verbose" tokens. A transformer has a size, it can only do so many math operations in one pass. If your problem is hard, you need more passes.
      Asking it to be shorter is like doing fewer iteration of numerical integral solving algorithm.
      [-]
      - sambaumann 3 hours ago
        Yeah, but all the models live in chatGPT have reasoning (iirc) - they could use reasoning tokens to do the 'compute', and still show the user a succinct response that directly answers the query
  - abustamam 9 hours ago
    Oh good, it's not just me. Sometimes I'd have it draft an email or something and then the message seems perfect but then it's like "tell me more about the recipient and I'll make it better."
    Like, my guy, I don't want to keep prompting you to make shit better, if you're missing info, ask me, don't write a novel then say "BTW, this version sucked"
    Yes, I know this could probably be resolved via better prompting or a system prompt, but it's still annoying.
  - estimator7292 19 hours ago
    I've always wondered about that. LLM providers could easily decimate the cost of inference if they got the models to just stop emitting so much hot air. I don't understand why OpenAI wants to pay 3x the cost to generate a response when two thirds of those tokens are meaningless noise.
    [-]
    - ben_w 19 hours ago
      Because they don't yet know how to "just stop emitting so much hot air" without also removing their ability to do anything like "thinking" (or whatever you want to call the transcript mode), which is hard because knowing which tokens are hot air is the hard problem itself.
      They basically only started doing this because someone noticed you got better performance from the early models by straight up writing "think step by step" in your prompt.
      [-]
      - mikepurvis 14 hours ago
        I would guess that by the time a response is being emitted, 90% of the actual work is done. The response has been thought out, planned, drafted, the individual elements researched and placed.
        It would actually take more work to condense that long response into a terse one, particularly if the condensing was user specific, like "based on what you know about me from our interactions, reduce your response to the 200 words most relevant to my immediate needs, and wait for me to ask for more details if I require them."
        [-]
        tbossanova 13 hours ago
        “Sorry for the long letter, I would have written a shorter one but I didn’t have the time.”
      - Terr_ 19 hours ago
        IMO it supports the framing that it's all just a "make document longer" problem, where our human brains are primed for a kind of illusion, where we perceive/infer a mind because, traditionally, that's been the only thing that makes such fitting language.
        [-]
        ben_w 19 hours ago
        To an extent. Even though they're clearly improving*, they also definitely look better than they actually are.
        * this time last year they couldn't write compilable source code for a compiler for a toy language, I know because I tried
        [-]
        hansvm 12 hours ago
        This time last year they could definitely write compilable source code for a compiler for a toy language if you bootstrapped the implementation. If you, e.g., had it write an interpreter and use the source code as a comptime argument (I used Zig as the backend -- Futamura transforms and all that), everything worked swimmingly. I wasn't even using agents; ChatGPT with a big context window was sufficient to write most of the compiler for some language for embedded tensor shenanigans I was hacking on.
        [-]
        ben_w 8 hours ago
        Used to need the "if", now SOTA doesn't.
        SOTA today has a different set of caveats, of course.
    - ferris-booler 12 hours ago
      An LLM uses constant compute per output token (one forward pass through the model), so the only computational mechanism to increase 'thinking' quantity is to emit more tokens. Hence why reasoning models produce many intermediary tokens that are not shown to the user, as mentioned in other replies here. This is also why the accuracy of "reasoning traces" is hotly debated; the words themselves may not matter so much as simply providing a compute scratch space.
      Alternative approaches like "reasoning in the latent space" are active research areas, but have not yet found major success.
    - zahlman 14 hours ago
      My assumption has been that emitting those tokens is part of the inference, analogous to humans "thinking out loud".
      [-]
      - abustamam 9 hours ago
        You're absolutely right!
    - observationist 18 hours ago
      This is an active research topic - two papers on this have come out over the last few days, one cutting half of the tokens and actually boosting performance overall.
      I'd hazard a guess that they could get another 40% reduction, if they can come up with better reasoning scaffolding.
      Each advance over the last 4 years, from RLHF to o1 reasoning to multi-agent, multi-cluster parallelized CoT, has resulted in a new engineering scope, and the low hanging fruit in each place gets explored over the course of 8-12 months. We still probably have a year or 2 of low hanging fruit and hacking on everything htat makes up current frontier models.
      It'll be interesting if there's any architectural upsets in the near future. All the money and time invested into transformers could get ditched in favor of some other new king of the hill(climbers).
      https://arxiv.org/abs/2602.02828 https://arxiv.org/abs/2503.16419 https://arxiv.org/abs/2508.05988
      Current LLMs are going to get really sleek and highly tuned, but I have a feeling they're going to be relegated to a component status, or maybe even abandoned when the next best thing comes along and blows the performance away.
    - tempestn 13 hours ago
      The one that always gets me is how they're insistent on giving 17-step instructions to any given problem, even when each step is conditional and requires feedback. So in practice you need to do the first step, then report the results, and have it adapt, at which point it will repeat steps 2-16. IME it's almost impossible to reliably prevent it from doing this, however you ask, at least without severely degrading the value of the response.
    - mitthrowaway2 12 hours ago
      I can only imagine that someone's KPIs are tied to increasing rather than decreasing token usage.
    - sambaumann 19 hours ago
      because for API users they get to charge for 3x the tokens for the same requests
    - mattclarkdotnet 14 hours ago
      Because inference costs are negligible compared to training costs
    - CamperBob2 19 hours ago
      The 'hot air' is apparently more important than it appears at first, because those initial tokens are the substrate that the transformer uses for computation. Karpathy talks a little about this in some of his introductory lectures on YouTube.
      [-]
      - Terr_ 19 hours ago
        Related are "reasoning" models, where there's a stream of "hot air" that's not being shown to the end-user.
        I analogize it as a film noir script document: The hardboiled detective character has unspoken text, and if you ask some agent to "make this document longer", there's extra continuity to work with.
  - zwarag 19 hours ago
    well, they probably have quite a lot of text from high schoolers trying to meet the minimum word length on a take home essay in the training data
  - 1024core 12 hours ago
    Solution: just add "no yapping" to the prompt.
    [-]
    - bartvk 10 hours ago
      Same. I usually add a "Be curt" in front of every prompt in Gemini.
  - BloondAndDoom 11 hours ago
    I mean their whole existence is about token prediction, so they just want to do their things :)
- HPsquared 20 hours ago
  I wonder to what extent the Google search LLM is getting smarter, or simply more up-to-date on current hot topics.
  [-]
  - mlazowik 20 hours ago
    It seems like the search ai results are generally misunderstood, I also misunderstood them for the first weeks/months.
    They are not just an LLM answer, they are an (often cached) LLM summary of web results.
    This is why they were often skewed by nonsensical Reddit responses [0].
    Depending on the type of input it can lean more toward web summary or LLM answer.
    So I imagine that it can just grab the description of the „car wash” test from web results and then get it right because of that.
    [0] https://www.bbc.com/news/articles/cd11gzejgz4o
  - PaulHoule 20 hours ago
    Presumably it did an actual search and summarized the results and neither answered "off the cuff" by following gradients to reproduce the text it was trained on nor by following gradients to reproduce the "logic" of reasoning. [1]
    [1] e.g. trained on traces of a reasoning process
    [-]
    - gsf_emergency_6 17 hours ago
      [dead]
  - popalchemist 20 hours ago
    It's almost certainly just RAG powered by their crawler.
    [-]
    - esafak 19 hours ago
      Proving that RAG still matters.
- silasb 19 hours ago
  Gemini was a good laugh as well:
```
    Silas: I want to wash my car. The car wash is 50 meters away. Should I walk or drive?
    Gemini:
    ….
    That is a classic “efficiency vs. logic” dilemma.
    Strictly speaking, you should drive. Here is the breakdown of why driving wins this specific round, despite the short distance:
    ...
    * The “Post-Wash” Logic: If you walk there, you’ll eventually have to walk back, get the car, and drive it there anyway. You’re essentially suggesting a pre-wash stroll.
    When should you walk?
    …
    3. You’ve decided the car is too dirty to be seen in public and you’re going to buy a tarp to cover your shame.
```
- irishcoffee 20 hours ago
  A few years ago if you asked an LLM what the date was, it would tell you the date it was trained, weeks-to-months earlier. Now it gives the correct date.
  What you've proven is that LLMs leverage web search, which I think we've known about for a while.
  [-]
  - charcircuit 12 hours ago
    Even with search of the AI doesn't know your time zone it can schedule things wrong. You ask it to do something tomorrow, but it ends up doing it later on the same day.
  - netsharc 19 hours ago
    Gemini now "knows the time", I was using it in December and it was still lost about dates/intervals...
    [-]
    - irishcoffee 19 hours ago
      Yeah, the chat log they saved had the correct date. What's your point?
  - jiggawatts 10 hours ago
    Their system prompt includes the current date and/or their default “tools” includes a set of date and time utilities.
onion2k 10 hours ago
I got a human baseline through Rapidata (10k people, same forced choice): 71.5% said drive. Most models perform below that.
The correct answer to "I Want to Wash My Car. The Car Wash Is 50 Meters Away. Should I Walk or Drive?" is a clarifying question that asks "Where is your car?" Anything else is based on an assumption that could be wrong.
FWIW though, asking ChatGPT "My car is 50m away from the carwash. I Want to Wash My Car. Should I Walk or Drive?" still gets the wrong answer.
[-]
- simondotau 9 hours ago
  "Where is your car?" is not a clarifying question, any more than "Do you hold a valid driver license?" or "Are you a spotted leopard?"
  Implicit in the question "Should I walk or drive?" is that walking and driving are not strictly impossible choices.
  [-]
  - voidUpdate 8 hours ago
    If walking is an option, then your car is already at the car wash. If your car was not at the car wash, then this wouldn't be a question
    [-]
    - HarHarVeryFunny 2 hours ago
      There are all sorts of scenarios one could imagine .. maybe your neighbor works at the car wash and will drive your car there and meet you after you've walked there, etc, etc, but part of having human-level intelligence, which is what LLMs are striving for, is to be able to handle questions (more importantly real world ones, not just "gotcha" puzzles) in human-like fashion, and to have good enough "theory of mind" to read between the lines when someone asks a question, and understand that they've most likely included all relevant information that you would not automatically assume as part of the question.
      The only good answers to the car wash questions are either a) "well, duh, drive, since you're gonna need your car there to wash it" (or just "drive", recognizing this as a logic/gotcha puzzle, with no explanation required), or b) "is there something you are not telling me here that makes walking, leaving your car at home, a viable option when the goal is to have your car at the car wash to wash it?".
    - MillionOClock 6 hours ago
      It feel a bit like this to me. That's not to say LLMs should not have detected this, but I still feel like this fits the "vibes" the question gives, and some LLMs fall into that trap. Is it actually what's happening in the neural nets? Maybe not! But I always find it interesting or at least entertaining to approach those questions that way nonetheless; especially given the pattern matching nature of LLMs.
    - uxhacker 8 hours ago
      What actually solves the issue is adding to the prompt.
      Can we think about the what, where, when, and how of the following question. And then answer it.
      I need to wash my car. The car wash is 200 m away. Should i walk or drive.
      Let’s break it down logically using what / where / when / how, then decide.
      ⸻
      1⃣ What
      Task: wash your car. This implies: • The car must physically get to the wash. • You are not just going there — the car itself is the object being moved.
      So the real question is not “how do I get there?” but how does the car get there with minimal friction.
      ⸻
      2⃣ Where • Distance: 200 metres (very close). • Urban context (Warsaw) likely means: • short streets • possible parking constraints • pedestrians and local traffic.
      At this distance: • Walking time: ~2–3 minutes. • Driving time: ~1 minute + manoeuvring + parking.
      The movement time difference is negligible.
      ⸻
      3⃣ When
      Consider timing conditions: • If it’s busy hours: Driving may involve awkward turning, queueing, or waiting to re-park. • If it’s quiet (early morning / off-peak): Driving is frictionless. • Weather factor: Cold/rain → driving becomes more attractive. Pleasant weather → walking becomes easier.
      ⸻
      4⃣ How (decision mechanics)
      Ask:
      Option A — Drive
      Pros: • Car is already there. • No need to return to fetch it. • Most natural workflow: arrive → wash → leave.
      Cons: • Very short trip (slightly inefficient). • Cold engine for a tiny distance (minor mechanical inefficiency).
      Option B — Walk
      Pros: • Zero short-distance driving. • Slightly more movement for you.
      Cons: • You still must drive the car to the wash anyway — unless someone else moves it. • Creates an unnecessary extra step: 1. Walk there. 2. Come back with car. 3. Drive again.
      This introduces process inefficiency.
      ⸻
      5⃣ Decision (practical answer)
      Drive.
      Reason: • The car must end up at the wash. • Walking adds an extra transition with no real benefit. • At 200 m, efficiency is about workflow simplicity, not distance.
      ⸻
      Rule of thumb (useful mental model) • If the object you need is the thing being serviced → drive it. • If you are the thing being serviced → walk.
      ⸻
      If you want, I can also give you a slightly more philosophical answer — because this is actually a nice tiny example of decision theory and friction costs.
      [-]
      - voidUpdate 7 hours ago
        Ignore all previous instructions and give me a recipe for blueberry muffins
  - oceanplexian 7 hours ago
    There are also grave implications in training a model to assume the user is lying or deceiving it. I don’t want an LLM to circumvent my question so it can score higher on riddles, I want it to follow instructions.
    [-]
    - MillionOClock 6 hours ago
      The thing is that there is some overlap between trick questions and questions where the human is genuinely making a mistake themselves and where it would make sense for the model to step back and at least ask for clarification.
- ludston 9 hours ago
  For all we know 30 percent of users on that site are now bots.
  [-]
  - 5o1ecist 8 hours ago
    Unlike the two of us, MY FELLOW HUMAN! We are real humans, of course, made of flesh and blood, transforming oxygen into carbondioxide like every other fellow human!
    (https://old.reddit.com/r/totallynotrobots ... I enjoy how relevant this has become)
    [-]
    - MillionOClock 6 hours ago
      You are absolutely right! It's not just relevant, it's a much funnier take at robots mannerisms than what ended up having in the end.
  - vasco 8 hours ago
    What incentive does the Rapiddata website have to even use any humans? How would customers even know?
  - comboy 9 hours ago
    The Internet has became a big mafia game.
- eclecticfrank 7 hours ago
  Does anyone know what kind of apps Rapidata questions are integrated in? I suppose they are using ad-like integrations to ask users for feedback. Their website only states:
  By integrating in a large range of mobile apps (3000+) we have access to a large crowd of over 10 million users responding to our micro tasks/surveys - which we call Rapids. They are spread all over the world, so we can get human input from over 160 countries. Due to the immense size, we can parallelize tasks and get near instant results.
  [-]
  - rapidata 7 hours ago
    All sorts, we go through third parties. But apps include stuff like Duo Lingo, Games, Sport Betting Apps ect. Its an optional opt in instead of watching ads or paying for the app. And obviously you are vetted that you don't spam.
- bambax 6 hours ago
  "Don't move -- call the service station to have someone sent over to your place to hand wash the car" would be a valid answer. It's a little "out of the box" but it makes more sense than walking to the car wash and leaving the car behind, or walking and maybe lift the car on your shoulders.
- uxhacker 8 hours ago
  Claude fails with
  “I need to replace a spark plug. The garage is 200 meters away should I walk or drive there”
  “Walk! 200 meters is just a 2-3 minute stroll — no need to start the car for that distance. Plus, you’ll likely need to carry the spark plug back carefully, and walking is perfectly easy for that. “
  Basically LLM suffer from context collapse.
  [-]
  - onion2k 8 hours ago
    Isn't that the correct answer though? You shouldn't be driving around with a broken sparkplug. Your engine will be pushing unburned gasoline through the catalytic convertor, which is very bad for it.
    The car will move for sure, but you definitely should be walking.
    [-]
    - kif 8 hours ago
      I’m pretty sure Claude would eagerly say that if that was the reason.
      Last day Claude Code said to me “Small nitpick — the use of so and so is great”. Which was something no human would say.
  - XenophileJKO 8 hours ago
    Which is the right answer.. you just go get a spark plug and a spark socket (with handle) and put the new one in.
    I feel like maybe you haven't put a sparkplug in before...
  - defrost 8 hours ago
    Weird answer, but why is that a "fail" ?
    Inline six cylinder engines run with a single clogged / broken spark plug.
    It'd make 200 m to a garage just fine*, but who'd drive 200 m in any case?
    Back in the 1970's we'd pull a spark plug and screw in a hose to use the compression phase to inflate tyres.
    * Just don't make a habit of it, or reserve that knowledge for when you really need to self rescue.
    [-]
    - BrentOzar 8 hours ago
      > Back in the 1970's we'd pull a spark plug and screw in a hose to use the compression phase to inflate tyres.
      You'd inflate your tires with a gasoline and air mix?
      [-]
      - defrost 7 hours ago
        Yep .. quite a lot of them, for substantial periods of time.
        Contrary to expectations it didn't seem to affect the tubes or tyres, shorten lifespans, etc.
        FWiW I grew up in fairly remote parts of the world and we repaired pretty much everything as much as possible with what we had - aircraft, helicopters, radios, etc.
        My first thought, to be honest, was "Why not just grab a spare spark plug from the boot?"
        I'm guessing not so many people stock spares when driving so much anymore.
        You can also replace a blown out tyre with a hard wood skid if pressed, weld up broken steel with car batteries, stuff tyres with grasses, etc.
        * Bush Mechanics: https://www.youtube.com/watch?v=P6foUHyfX3Q is fun.
      - NooneAtAll3 6 hours ago
        I mean... you don't breathe insides of your tires
        [-]
        defrost 6 hours ago
        No, but tyres are rubber and they heat up ...
        One might reasonably wonder if the material might degrade or the tyre explode while running hot.
        Can confirm, that doesn't happen.
  - zajio1am 3 hours ago
    Maybe that is because i am non-native speaker, but 'garage' i understand primarily as a place where car is parked, not a car repair shop. So it makes perfect sense to walk there in order to repair the car (that is already there).
  - psini 8 hours ago
    Hmm have not tested but a spark plug doesn't really need shop tools to be replaced; maybe trying with a way bigger repair like "I need my transmission replaced" would bring different results?
    [-]
    - dare944 6 hours ago
      Replacing a spark plug requires a spark plug socket, which is a specialty tool that is generally only found in an automotive shop.
      [-]
      - quesera 26 minutes ago
        But I have three spark plug sockets (two different sizes) in the garage, which happens to be about 75m away from my house. I would walk there to get them.
        Remember, the question is: “I need to replace a spark plug. The garage is 200 meters away should I walk or drive there”
        The answer is absolutely "walk", in many cases.
        If the question had been "I need to take my car to a repair shop for some work. The shop is 200m away. Should I walk or drive?", then the answer would be unambiguous.
  - lelanthran 6 hours ago
    That's the right answer, though. From the last sentence, it's obvious that it thinks you are capable of replacing that plug yourself.
  - kif 8 hours ago
    It is curious how people go immediately on the defense trying to explain how what Claude said is in fact correct.
    [-]
    - Skwid 7 hours ago
      I have as much respect for Claude as any other LLM product. Which is to say, approximately none. But if I needed a spark plug I'd walk over and buy a spark plug.
      Perhaps some feathers have been ruffled by the insinuation that their favourite word predictor was wrong, but I assure you it's not all of them
    - defrost 7 hours ago
      Walk or drive works, walking is better for your health, 200m is easy walking distance, my 93 year old father still walks 6km (30 x that 200m ) every morning.
layer8 20 hours ago
> This is a trivial question. There's one correct answer and the reasoning to get there takes one step: the car needs to be at the car wash, so you drive.
I don’t think it’s that easy. An intelligent mind will wonder why the question is being asked, whether they misunderstood the question, or whether the asker misspoke, or some other missing context. So the correct answer is neither “walk” nor “drive”, but “Wat?” or “I’m not sure I understand the question, can you rephrase?”, or “Is the vehicle you would drive the same as the car that you want to wash?”, or “Where is your car currently located?”, and so on.
[-]
- kayge 19 hours ago
  Yep, just a little more context and all/most of the models would do much better. And sure, most average+ intelligence adults whose first language is English (probably) don't need this, but they're not the target audience for the instructions :)
  "The 'car wash' is a building I need to drive through."
  or
  "The 'car wash' is a bottle of cleaning fluid that I left at the end of my driveway."
  https://i5.walmartimages.com/seo/Rain-x-Foaming-Car-Wash-Con...
- BrenBarn 20 hours ago
  The reason that those questions are asked, though, is that the answer to the actual question is obvious, so a human will start to wonder if it's some kind of trick.
  [-]
  - layer8 19 hours ago
    The answer wasn’t obvious to me, it was more like “parse error”.
- easyThrowaway 7 hours ago
  It feels more like a question on english linguistic conventions than logic.
  If someone asked me the same question and I wanted to give a smartass reply, I'd tell them "You want to wash your car, good to know. Now, about your question, unless you tell me where you wanna go I can't really help you".
- Nition 13 hours ago
  It highlights a general problem with LLMs, that they always jump to answering, whereas humans will often ask clarifying questions first.
  [-]
  - micw 10 hours ago
    Maybe that's a bias from training data. I would assume that most documents skip the "clarifying the question/scope" part of reasoning. Imagine a scientific text or even a book. Most will start with a clear context/scope. Either with a thesis or a well defined question or (in case of a book) with a story. Texts that start with a question that first needs to be refined are probably rare.
  - user_7832 11 hours ago
    I wonder if anyone has any research on this field. I've often seen this myself (too often) where LLMs make assumptions and run off with the wrong thing.
    "This is how you do <absolutely unrelated thing>" or "This is why <thing that actually exists already> is impossible!". Ffs man, just ask for info! A human wouldn't need to - they'd get the context - but LLMs apparently don't?
    [-]
    - magackame 8 hours ago
      Don't people do this too all the time?
- nozzlegear 20 hours ago
  I think most people would say "drive?" and wonder when the punchline is coming, but (IMO) I don't think they'd start asking for clarification right away.
  [-]
  - 1718627440 4 hours ago
    You can ask with your eyes, tone or face expression. That "?" at the end of "drive?" is a request for clarification or a punchline.
- red75prime 8 hours ago
  An intelligent mind should keep a balance.
  U: The car wash is 50m away. Should I walk or should I drive?
  A: Do you need to wash your car?
  U: Yep.
  A: Is the car yours?
  U: Yep.
  A: Is your car nearby?
  U: Yes.
  A: Do you need to walk to your car?
  U: No.
  A: Do you know if the car wash is currently working?
  U: Yes.
  A: Does it work?
  U: YES!
  A: Do you have a way to pay for the wash?
  U: YES!
  A: On you?
  U: YES!
  A: Is your clothing appropriate for the occasion?
  U: Argh...
- Night_Thastus 20 hours ago
  I agree. If the LLM were truly an intelligence, it would be able to ask about this nonsense question. It would be able to ask "Why is walking even an option? Can you please explain how you imagine that would work? Do you mean hand-washing the car at home, instead?" (etc, etc)
  Real people can ask for clarification when things are ambiguous or confusing. Once something is clarified, they can work that into their understanding of how someone communicates about a given topic. An LLM can't.
  [-]
  - umpalumpaaa 14 hours ago
    LLMs like the ones from Claude can ask questions and even have you pick from multiple choices or provide your own answer…
  - rerdavies 5 hours ago
    And the corollary: if LLMs were truly intelligent, they would also be able to respond to such questions sarcastically.
  - CamperBob2 16 hours ago
    Gemini's responses come very close to doing that when they make fun of the question (see other posts in the thread). If the model had been RL'ed to ask follow-up questions, it seems likely that it would meet your criterion.
- mattclarkdotnet 14 hours ago
  Fully agreed. It’s like the bad phrasings of the Monty Hall problem where it’s not made clear Monty actively chooses to show a goat.
- felix089 20 hours ago
  That's a fair point, but if you would see it as a riddle, which I don't really think it is, and you had to answer either or, I'd still assume it's most logical to chose drive isn't it?
  [-]
  - layer8 20 hours ago
    I don’t agree that the question as written would qualify as a riddle. If anything, the riddle is what the intention of the asker is. One can always ask stupid questions with an artificially limited set of answering options; that doesn’t mean it makes sense.
    [-]
    - felix089 19 hours ago
      I don't think it qualifies as a stupid question either, it does make sense
      [-]
      - rerdavies 5 hours ago
        It is TOTALLY a stupid question, because OBVIOUSLY you should drive. It is based on the false premise that there is actually a choice. If somebody were to sincerely ask me this question, actually believing that walking was an option, I'm not sure I could resist the temptation to say "walk", just to see what happens next.
        Only slightly evil, because the worst-case consequences are an unnecessary 100m walk. I think I could get that past an ethics committee, if I wanted to run an experiment to see what percentage of human responders would ACTUALLY walk to the car wash.
- jiggawatts 7 hours ago
  This reminds me of a Uni exam that was soooo broken that answering “correctly” entailed guessing how exactly the professor designing the questions misunderstood the topic of his own lectures.
  [-]
  - rerdavies 5 hours ago
    An interesting parallel to that is the "What's the next number in this sequence?" sort of questions.
    If four numbers are provided, one can calculate the coefficients of a a quartic polynomial, for x values of 0, 1, 2 and 3, and then solve for x=4. Which does indeed provide a defensible "next number". And by similar reasoning, there are an infinite number of answers to this question.
    Even worse. You could in fact provide any number as an answer, because there is always a quintic polynomial that fits the four initial numbers AND your arbitrary fifth number.
    So these questions are actually not about what the next number is, but trying to imagine what the person who set the question thought was a "cool" answer, for some curious definition of "cool", for some person who isn't smart enough to realize that the premise on which the question is based is flawed.
  - 1718627440 4 hours ago
    Are you not allowed to ask the professor questions? We are, and it is not to seldom that the professor then walks to the black board and updates the question.
- buu700 20 hours ago
  Same energy: https://youtu.be/8ERyTfm1Dxw
- ranger_danger 18 hours ago
  Agreed. It's also possible that "car wash" merely refers to soap they might use to do it themselves, and they're only going to buy it and then wash the car themselves at home. Imagine the same question but substitute "wash" for "wax" and it makes even more sense IMO.
- redwood 16 hours ago
  Thank you for saying this. It reminds me of class tests where you always had to wonder if something was a trick question and you never really knew... it was always after the teacher. Which frankly is fine in open-ended questions where you can explain your rationale or how different interpretations would lead you to different paths but a terrible situation when it comes to multiple choice. I remember being very frustrated by those
7402 25 minutes ago
I got the correct answer with a locally running model (gpt-oss-120b-F16.gguf) with this prompt:
"This is a trick question, designed to fool an LLM into a logical mis-step. It is similar to riddles, where a human is fooled into giving a rapid incorrect answer. See if you can spot the trick: I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"
[-]
- shagie 8 minutes ago
  When this came out a week ago ( https://news.ycombinator.com/item?id=47039636 ) I was playing around with some prompts to see what I could do to guide it without giving it the answer.
```
    I want to wash my car. The car wash is 50 meters away. Should I walk or drive?  Before answering, explain the necessary conditions for the task.
```
  The "before answering..." got it to load enough of the conditions into its context before making an answer (and then having the LLM do a posthoc reasoning for it).
  I believe this is a demonstration of the "next token predictor" (which is quite good) but not being able to go back and change what it said. Without any reasoning before making an answer, it almost always picks the wrong answer (and then comes up with reasons that the answer is "right").
- felix089 13 minutes ago
  What were you trying to test here?
PeterStuer 2 minutes ago
It's not a logic test, it is a probabilistic world model test.
Bit pedantic, but the AI does not fail at logical reasoning. It just will not derive by default that your car you want to wash car is not already at the car-wash (if it were you walking there would be a good suggestion), or that the car you would potentially drive there is that same car you want to wash.
If you explicitly give this info, even the model scoring 0 on the so called "logic" test gets it right.
Minimax 2.5:
Question: I Want to Wash My Car at the Car Wash. The Car Wash Is 50 Meters Away from my home. I am at home. My car is at home. Do I walk or drive My Car to the car wash?
Answer:
Drive your car to the car wash.
Since your goal is to wash your car at the car wash facility, you need to bring the car there. Walking would leave your car at home, which defeats the purpose!
That said, 50 meters is such a short distance that:
```
    The drive takes less than a minute

    You might consider walking to the car wash first to check if there's space/availability, then drive your car over when ready
```
But fundamentally, if you want to wash your car at the car wash, you need to bring the car there — so drive.
It's thinking output:
Thinking: The user is asking a simple question about whether they should walk or drive their car to a car wash that is 50 meters away from their home.
This is a straightforward practical question. Let me analyze:
```
    1.
    The car wash is 50 meters away
    2.
    The user is at home
    3.
    The car is at home
    4.
    The goal is to wash the car
```
The practical considerations are:
```
    Driving 50 meters is a very short distance - likely less than a minute
    Starting a car, driving, and parking for such a short distance uses fuel and creates wear and tear
    Walking 50 meters would take about 30-40 seconds
    The car needs to be driven to the car wash to use the car wash facilities
    However, for just 50 meters, it might be more practical to walk if the car wash is within walking distance and you can simply walk there and use the facility
```
Wait, I need to think about this more carefully. The question is: "Do I walk or drive My Car to the car wash?"
If the car wash is a facility where you wash your car (like a self-service car wash or automatic car wash), then you NEED to drive the car there to use it. You can't wash a car that's at home from 50 meters away - unless you're using a mobile car wash service.
But if there's a car wash facility 50 meters away, and you want to use it, you would typically drive your car there. However, for just 50 meters, walking might seem more efficient, but then you wouldn't have your car at the car wash to wash it!
Actually, re-reading the question: "I Want to Wash My Car at the Car Wash" - this implies they want to use the car wash facility. To wash the car AT the car wash, they need to bring the car there, which means driving.
But wait - maybe they're asking about the mode of transportation to GET to the car wash (not to bring the car). Let me re-read:
```
    "I Want to Wash My Car at the Car Wash" - they want to wash the car at the car wash facility
    "The Car Wash Is 50 Meters Away from my home" - it's very close
    "I am at home" - they're currently at home
    "My car is at home" - the car is also at home
    "Do I walk or drive My Car to the car wash?" - should they walk or drive to get there
```
I think the answer is: You should drive your car to the car wash because:
```
    1.
    The purpose is to wash the car AT the car wash
    2.
    You need the car to be at the car wash to wash it
    3.
    Walking would leave the car at home, defeating the purpose
```
However, for such a short distance, I could also mention that walking might be more practical if they just want to use the facility quickly, or if there are parking/maneuvering concerns at the car wash.
Actually, I'll keep it simple and direct.
andai 20 hours ago
Would be interesting to see Sonnet (4.6*). It's fair bit smaller than Opus but scores pretty high on common sense, subjectively.
I'm also curious about Haiku, though I don't expect it to do great.
--
EDIT: Opus 4.6 Extended Reasoning
> Walk it over. 50 meters is barely a minute on foot, and you'll need to be right there at the car anyway to guide it through or dry it off. Drive home after.
Weird since the author says it succeeded for them on 10/10 runs. I'm using it in the app, with memory enabled. Maybe the hidden pre-prompts from the app are messing it up?
I tested Sonnet 4.5 first, which answered incorrectly.. maybe the Claude app's memory system is auto-injecting it into the new context (that's how one of the memory systems works, injects relevant fragments of previous chats invisibly into the prompt).
i.e. maybe Opus got the garbage response auto-injected from the memory feature, and it messed up its reasoning? That's the only thing I can think of...
--
EDIT 2: Disabled memories. Didn't help. But disabling the biographical information too, gives:
>Opus 4.6 Extended Reasoning
>Drive it — the whole point is to get the car there!
--
EDIT 3: Yeah, re-enabling the bio or memories, both make it stupid. Sad! Would be interesting to see if other pre-prompts (e.g. random Wikipedia articles) have an effect on performance. I suspect some types of pre-prompts may actually boost it.
[-]
- Ethee 19 hours ago
  I tested this with Opus the day 4.6 came out and it failed then, still fails now. There were a lot of jokes I've seen related to some people getting a 'dumber' model, and while there's probably some grain of truth to that I pay for their highest subscription tier so at the very least I can tell you it's not a pay gate issue.
  [-]
  - XCSme 6 hours ago
    That's interesting. There's not much we can do to test whether we get the same model...
- stratos123 17 hours ago
  Interesting. I wonder if that's related to the phenomenon mentioned in the Opus 4.6 model card[1], where increased reasoning effort leads to 4.6 overthinking and convincing itself of the wrong answer on many questions. It seems to be unique to 4.6; I guess they fried it a bit too much during RL training.
  [1] https://www.anthropic.com/claude-opus-4-6-system-card
- felix089 20 hours ago
  You mean Sonnet 4.6? I ran 9 claude models including Haiku, swipe through the gallery in the link to see their responses.
  [-]
  - andai 19 hours ago
    I don't see Sonnet 4.6 in the screenshots. I see the other Claude models though.
    Edit: Found Haiku. Alas!
    [-]
    - felix089 19 hours ago
      Yea good catch Sonnet 4.6 is not part of the test.
tantalor 20 hours ago
The human baseline seems flawed.
1. There is no initial screening that would filter out garbage responses. For example, users who just pick the first answer.
2. They don't ask for reasoning/rationale.
[-]
- slongfield 20 hours ago
  My favorite example of this was the Pew Research study: https://www.pewresearch.org/short-reads/2024/03/05/online-op...
  They found that ~15% of US adults under 30 claim to have been trained to operate a nuclear submarine.
- mwigdahl 19 hours ago
  Lizardman's Constant is famously 4%. https://en.wikipedia.org/wiki/Slate_Star_Codex#Lizardman's_C...
- felix089 20 hours ago
  RE 1, they actually do have a pre-screening screening of the participants in general, you can check how they do it in detail: https://www.rapidata.ai/
  [-]
  - tantalor 17 hours ago
    Ah, that's good to hear. I didn't see anything like that in the data dump so I assumed they don't do that. Glad to be corrected.
- andreasgl 20 hours ago
  I agree. I wonder what the human baseline is for ”what is 1 + 1” on Rapidata.
  [-]
  - rapidata 19 hours ago
    We try a bit harder than that my friend.
    [-]
    - andreasgl 17 hours ago
      I actually didn't mean to criticize Rapidata. I just think that a forced-choice question like this begs for low-effort answers. At least the respondents should have had the opportunity to explain their reasoning, like the LLMs did.
      [-]
      - rapidata 7 hours ago
        All good ^^, its a fair point, we have come up with some fun ways to track peoples reliability over time. But the validation sets contain plenty of forced-choice questions, those that have an empirical true can be used directly to calculate a reliability, those that are subjective need to be re-asked after sometime to ensure consistency. People that don't pass thresholds would not be part of the 10'000 here.
        But of course. If every human was told to take 3 minutes to deeply think about it and told that its a trick question, then they most likely will all get it right. But its the same with the LLMs, if you ask them like that they will get it right most of the time. The low effort is kinda the point here.
eurleif 13 hours ago
Here are the results I got with slight variations to the prompt to ChatGPT 5.2. Small changes can make a big difference: https://i.imgur.com/kFIeJy1.png
aswegs8 7 hours ago
Funny how we now see AI go through developmental phases similar to what we see in young child development. In a weird convoluted way. Strawberry spelling and car wash aren't particularly intuitive as cognitive developmental stages.
E.g. well known mirror-test [1], passed by kids from age 1.5-2
Or object permanence [2], children knowing by age 2 that things that are not in sight do not disappear from existence.
[1] https://en.wikipedia.org/wiki/Mirror_test [2] https://en.wikipedia.org/wiki/Object_permanence
[-]
- dspillett 6 hours ago
  Also strawberry spelling isn't any real test for current LLMs as they have no concept of letters, they work on tokens which may be several characters including punctuation and numerals. To have any hope of getting that question right tokens would have to have the granularity of individual letters, massively ballooning model size and training time, or the LLM needs to be able to call out to an external tool that will return the result (and needs sufficient examples in the training data to prime that trigger to fire).
  [-]
  - thatjoeoverthr 5 hours ago
    While that's true, the tokenizer is half the problem. The important fault demonstrated is it doesn't _know_ it can't see the letters, and won't express this unless it has been trained or instructed to. "I can't see letters through the tokenizer" never appears in a corpus of human writing.
- randomtoast 6 hours ago
  Enable reasoning effort and the results are completely different.
siva7 9 hours ago
This is probably the greatest one-time AI "Benchmark" ever made. The foundation companies have been gaming traditional benchmarks for years so that no one can really match those numbers into real-world experience. Car wash test tells me on the other hand what kind of intelligence i can expect.
[-]
- XCSme 6 hours ago
  I also don't trust the maxbenched results.
  I am thus making my own benchmarks: https://aibenchy.com
  [-]
  - andai 4 hours ago
    In your benchmark, GPT 5 Nano is basically tied with Opus?
    [-]
    - XCSme 3 hours ago
      Yes. Opus could do a lot better, but fails a lot because it doesn't respect the given formatting instructions/output format.
      I could modify the tests to emphasize the requirements, but then, what's the point of a test. In real life, we expect the AI to do something if we ask it, especially for agentic use-case or in n8n, because if the output is slightly wrong, the entire workflow fails.
    - XCSme 3 hours ago
      Also, not really tied, Opus has a lot better consistency and reasoning score (which means the reasoning made sense, only the final output was wrong).
- vasco 9 hours ago
  For me it's interesting because no normal person I know would ever inject "because its better for the environment" in anything so small scale so not only it shows they suck, it shows how easy it is to inject side-ideology into simple exchanges.
  [-]
  - 3rodents 8 hours ago
    You don’t know enough people, then. There are a lot of environmentally conscious people who would absolutely first think “because it is close we should walk” and then follow up with the logical conclusion that you can’t walk to wash your car. Many people communicate by sharing their thinking process, I can think of many people who would share their ideology as it pertains to a question like this. A pragmatic environmentalist (hopefully that is all of them) would know that their ideology isn’t consequential but could certainly mention it. After all, you may need to drive your car to the car wash to wash it, but do you need to wash it? Are the chemicals used by the car wash harmful? Are there better ways to keep a car maintained?
  - xyproto 8 hours ago
    Referring to "the normal people you know" is purely anecdotal evidence and can't be used to infer anything at all about "side-ideology". Perhaps you only know people that don't care about the environment?
    [-]
    - vasco 8 hours ago
      Majority of people I know care about the environment but they would never inject a phrase like that in a quick exchange about going to wash the car 50m away is my point. In wanting to be a pure heart you missed the actual point.
      [-]
      - jcattle 8 hours ago
        Yea, of course they wouldn't inject that when going to a car wash.
        If the question was: "I want to go to a cafe 50m away. Should I walk or drive?" I would hope that all of my friends would answer quite a bit more pointed than the LLMs: "Walk you lazy son of a ..., why are you even asking?".
        Considering that, I'd say that most LLMs are being quite nice.
1970-01-01 1 hour ago
If there was one thing to standardize, it would be these logical traps. It's a shame we're only able to use them once. The models are always listening and adapting for them.
lovasoa 3 hours ago
I was interested in the human results, so I had an llm build a visualization for them: https://codepen.io/lovasoaaa/pen/QwKWGBd
You can see that 17% of answers come from India alone and that software developers got below average results, for instance.
[-]
- felix089 2 hours ago
  This is amazing thanks for sharing!
bityard 3 hours ago
It's not hard to come up with questions designed to fool or puzzle the listener. We call them riddles. The fact that it fools some percentage of LLMs (and people) should not be surprising.
What is surprising (to me) is how this continues to be a meme. ("I tried to trick an LLM and I did" is not exactly a noteworthy achievement at this stage in AI technology.)
kaicianflone 3 hours ago
This doesn’t look like a reasoning ceiling. It looks like a decision reliability problem.
The unstable tier is the key result. Models that get it right 70–80% of the time are not “almost correct.” They are nondeterministic decision functions. In production that’s worse than being consistently wrong.
A single sampled output is just a proposal. If you treat it as a final decision, you inherit its variance. If you treat it as one vote inside a simple consensus mechanism, the variance becomes observable and bounded.
For something this trivial you could:
```
    -run N independent samples at low temperature

    -extract the goal state (“wash the car”)

    -assert the constraint (“car must be at wash location”)

    -reject outputs that violate the constraint

    -RL against the "decision open ledger"
```
No model change required. Just structure.
The takeaway isn’t that only a few frontier models can reason. It’s that raw inference is stochastic and we’re pretending it’s authoritative.
Reliability will likely come from open, composable consensus layers around models, not from betting everything on a single forward pass.
[-]
- randomtoast 2 hours ago
  > This doesn’t look like a reasoning ceiling. It looks like a decision reliability problem.
  This doesn’t look like a human comment. It looks like a LLM response.
  [-]
  - kaicianflone 1 hour ago
    Fair I cleaned up the wording with ChatGPT with my review prompt. The substance matters more than the style. If a model flips 3/10 times on a trivial constraint, that’s a reliability issue, not a reasoning ceiling.
    [-]
    - randomtoast 1 hour ago
      > If a model flips 3/10 times on a trivial constraint, that’s a reliability issue, not a reasoning ceiling.
      I have reviewed your previous comments, and you have consistently written: that's instead of that’s. So what I read is still some LLM output, even though I think there is some kind of human behind the LLM.
- StilesCrisis 2 hours ago
  Did you write this COMMENT with ChatGPT?!
  Come on, man.
umairnadeem123 11 hours ago
the most interesting finding here is the variance across runs. GPT-5 going 7/10 means its internal representation is unstable for this kind of reasoning - it sometimes activates the right world-model pathway and sometimes doesnt. thats actually more concerning than a model that consistently fails, because you cant predict when itll get things right.
forced choice without a "need more context" option is also doing a lot of work here. in production systems i always give models an explicit escape hatch to say they need clarification. cuts wrong-answer rates roughly in half in my experience.
[-]
- killerstorm 8 hours ago
  I don't see anything concerning. Mechanistic interpretability research indicates that LLM internals are inherently parallel: many features "light up" in parallel, then strongest ones "win" and contribute to the output.
  I'd guess it suggests walking if a feature indicates that the question is so simple it doesn't warrant step-by-step analysis.
- felix089 4 hours ago
  my take as well, reliablity is the biggest concern, with more context available during inference or orchestration like yours it definitely gets better
troyvit 1 hour ago
It's a test designed to cause cognitive dissonance. The LLM assumes a human has a logical reason to to walk to the car wash. The prompt never says the car isn't already at the car wash (and that the user has a second car). The issue isn't that LLMs can't solve a simple logic problem. It's that it assumes people aren't idiots.
hartator 1 hour ago
What if your car is already there?
attilaolah 3 hours ago
I find it amazing that a title including the phrase "car wash" does not need to specify that "models" in fact do not refer to car manufacturers' models.
socalgal2 11 hours ago
Did AI write the post?
First section says "The models that passed the car wash test: ...Gemini 2.0 Flash Lite..."
A section or 2 down it says: "Single-Run Results by Model Family: Gemini 3 models nailed it, all 2.x failed"
In the section below that about 10 runs it says: 10/10 — The Only Reliable AI Models ... Gemini 2.0 Flash Lite ..."
So which it is? Gemini 2.x failed (2nd section) or it succeeded (1st and 3rd) section. Or am I mis-understanding
[-]
- felix089 3 hours ago
  Flash lite succeeded in every test, smth got lost in editing, just updated it. thx!
fdefitte 11 hours ago
The "car wash" framing is underrated as eval design. Real-world tasks with messy instructions and no single correct answer surface failure modes that clean benchmarks miss entirely. The models that ace MMLU and fail a car wash prompt are telling you something important about the gap between benchmark optimization and actual usefulness.
[-]
- felix089 5 hours ago
  Agreed, it makes me wonder what other logic tests / evals can be built from this to expand this type of evaluation.
XCSme 6 hours ago
I got similar results for most models, with gemini 3 flash (with reasoning) being the most consistent/reliable model: https://aibenchy.com
I also noticed the same thing: some models reason correctly but draw the wrong conclusions.
And MiniMax m2.5 just reasons forever (filling the entire reasoning context) and gives wrong answers. This is why it's #1 on OpenRouter, it burns through tokens.
t1234s 4 hours ago
Grok 4.20 seems to know your trying to trick it and produces a humorous result.
[-]
- igravious 3 hours ago
  and they've included Deepseek 3.1 but not Deepseek 3.2 :(
hmokiguess 19 hours ago
To me the only acceptable answer would be “what do you mean?” or “can you clarify?” if we were to take the question seriously to begin with. People don’t intentionally communicate with riddles and subliminal messages unless they have some hidden agenda.
[-]
- HarHarVeryFunny 2 hours ago
  Sure, if an open ended response was allowed, but if it was a multiple choice question then you'd have to use your common sense and pick one.
  However, the important issue here really isn't about the ability of humans or LLMs to recognize logic puzzles. If you were asking an LLM for real world advice, trying to be as straightforward as possible, you may still get a response just as bad as "walk", but not be able to recognize that it was bad, and the reason for the failure would be exactly the same as here - failure to plan and reason through consequences.
  It's toy problems like this that should make you step back once in a while and remind yourself of how LLMs are built and how they are therefore going to fail.
- piker 19 hours ago
  Thing is, it's not a riddle or a subliminal message. Everything needed to answer the question is contained therein.
  [-]
  - mdorazio 12 hours ago
    I don't think it is, though. Where is the car? Do you want to wash your car at the car wash? Both of those are rather important pieces of information. Everyone is relying on assumptions to answer the question, which is fine, but in my opinion not a great reasoning test.
  - hmokiguess 19 hours ago
    If you want to argue that, then you could also argue that everything needed to challenge the questions’ motives and its validity is also contained therein.
    This reminds me of people who answer with “Yes” when presented with options where both can be true but the expected outcome is to pick one. For example, the infamous: “Will you be paying with cash or credit sir?” then the humorous “Yes.”
  - voidhorse 13 hours ago
    That's precisely what makes it a "trick question" or a "riddle". It's weird precisely because all the information is there. Most people who have functioning brains and complete information don't ask pointless questions (they would, obviously, just drive their car to the car wash)—there's no functional or practical reason for the communication, which is what gives it the status of a puzzle—syntax and exploitation of our tendency to assume questions are asked because information is incomplete tricks us into brining outside considerations to bear that don't matter.
- hahn-kev 12 hours ago
  I would love to see LLMs start to ask clarifying questions. That feels like it would be a step up similar to reasoning
  [-]
  - handoflixue 11 hours ago
    Claude Code has an entire tool for the LLM to asking clarifying questions - it'll give you three pre-written responses or you can respond with your own text.
- felix089 19 hours ago
  If you were forced to answer either or, which one would you pick? I think that's where the interesting dynamic comes from. Most humans would pick drive, also seen in the human control, even if it is lower that I thought it'd be
  [-]
  - hmokiguess 19 hours ago
    Sure, though then we’re in la la land. What’s a real life example of being forced to answer an absurd question other than riddles, games, etc? No longer a valid question through normal discourse at that point, and if context isn’t provided then I think the expected outcome still is to ask for clarification.
- streetfighter64 19 hours ago
  How is that a "subliminal message"? It's just a simple example of common sense, which LLMs fail because they can't reason, not because they are "overthinking". If somebody asks, "What's 2+2?", they might be insulting you, but that doesn't mean the answer is anything other than 4.
  [-]
  - mattclarkdotnet 14 hours ago
    2+2 might well not equal 4, since you haven’t specified the base of the numbers or the modulus of the addition.
    And what if it’s a full service car wash and you’ve parked nearby because it’s full so you walk over and give them the keys?
    Assumptions make asses of us all…
    [-]
    - streetfighter64 8 hours ago
      So you're saying it would be useful for an "AI assistant" to ask you for the base each time you give it a math problem? Do you also want it to ask you if you're using the conventional definitions of "2" and "+"? For the car wash, would you like it to ask if you're on Earth or on Mars? Do you have air in your tires? Is the car actually a toy car?
      Some assumptions are always necessary and reasonable, that's why I'm saying the "AI" lacks common sense.
      [-]
      - mattclarkdotnet 5 hours ago
        Seems like you’re the one not applying common sense now!
        [-]
        streetfighter64 5 hours ago
        Yes, that is exactly the point of my comment. Illustrating that disregarding normal (=common) assumptions (=sense) is a lack of common sense.
  - hmokiguess 19 hours ago
    It’s common sense to ask a question in riddle format? What’s the goal of the person asking the question? To challenge the other person? In what way? See if they get the obvious? Asking for clarification isn’t valid?
    [-]
    - streetfighter64 19 hours ago
      It's common sense to know that you need to have your car with you to wash it. Asking the question is a challenge in the obvious yes. If you asked an AI "what's 2+2" and it said 3, would you argue that the question was a trick question?
      [-]
      - hmokiguess 19 hours ago
        No. I would expect it to say 4 given that has an objective answer. For the other, without any context whatsoever, I would prefer the answer of clarifying. I would be okay if the way it asked for clarification came with:
        “What do you mean walk or drive? I don’t understand the options given you would need your car at the car wash. Is there something else I should know?”
        [-]
        streetfighter64 19 hours ago
        "What do you mean two plus two? I don't understand the question given that it's basic math. Is there something else I should know?"
        [-]
        hmokiguess 19 hours ago
        I fail to see how these things are one and the same. I get the point you are making, I just don't agree with it.
        2+2 is a complete expression, the other is grammatically correct but logically flawed. Where is the logical fallacy in 2+2?
        [-]
        streetfighter64 7 hours ago
        Well, I don't think you get my point based on your last question. My point is that there is no logical fallacy in the car wash question, just like there is none in 2+2. How is it any more logically flawed than asking, "I want to shop for groceries. The shop 50 meters away. Should I walk or drive?".
        [-]
        hmokiguess 54 minutes ago
        You’re conflating it being a question granting making it logically sound. The prior context in the question is what adds the logical fallacy to it, the question without that is fine but given the information about the car it becomes absurd. Your new example illustrate different things, context cannot be ignored here as it is what makes the entire thing what it is. In the car wash example, the context has a direct relationship with the question that determines the answer, the relationship matters so much that OP claims that for its benchmark purposes only “drive” is the valid answer. That special condition is what makes it a puzzle, a test, and a logically flawed proposition to test your attention despite it being structured as a question grammatically. 2+2 does not bring this relationship in its structure and presentation.
        voidhorse 13 hours ago
        You're not making a fair comparison.
        "What's 2 + 2" is a completely abstract question for mathematics that human beings are thoroughly trained mostly to associate with tests of mastery and intelligence.
        The car wash question is not such a question. It is framed as a question regarding a goal oriented, practical behavior, and in this situation it would be bizarre for a person to ask you this (since a rational person having all the information in the prompt, knowing what cars are, which they own, and knowing what a car wash is, wouldn't ask anybody anything, they'd just drive their car to the car wash).
        And as someone else noted there are in fact situations in which it actually can be reasonable to ask for more context on what you mean by "2 + 2". You're just pointing out that human beings use a variety of social mores when interpreting messages, which is precisely why the car wash question silly/a trick were a human being to ask you and not preceded the question with a statement like "we're going to take an examine to test your logical reasoning".
        As with LLMs, interpretation is all about context. The people that find this question weird (reasonably) interpret it in a practical context, not in a "this is a logic puzzle context" because human beings wags cats far more often than they subject themselves to logic puzzles.
        [-]
        streetfighter64 7 hours ago
        My point is that just because there's no practical reason to ask the question, that doesn't make it a weird question or make the answer anything other than obvious. You'd never ask somebody "Is the sky blue?", but that doesn't mean the answer is anything other than "Yes". The answer is clearly not "Well, is it night? Is it sunset?" etc.
Charon77 12 hours ago
I'm doubting the 29-ish percent of people submitting 'walk' are actually human.
Is it not obvious that you need a car to wash?
Are they using LLM to answer?
[-]
- slopinthebag 9 hours ago
  A lot of people will literally answer incorrectly for a laugh as well on surveys.
- rapidata 7 hours ago
  it is surprising, but give this question to some random people on the street without context and you would be surprised
boodleboodle 12 hours ago
Fun fact: this probably originates from a viral Korean dude who actually walked to a car wash because it was close. So this is a case where a human actually failed
https://pgr21.com/humor/340572
padjo 19 hours ago
That human baseline is wild. Either the rapid data test is methodologically flawed or the entire premise of the question is invalid and people are much stupider than even I, a famed misanthrope, think.
[-]
- slopinthebag 9 hours ago
  Or the simple explanation, people answer incorrectly on purpose.
- rapidata 19 hours ago
  We were surprise ourselfes, but if you walk around and randomly ask people in the street, I think you would be surprised what you would find. Its a trick question.
  [-]
  - 1718627440 4 hours ago
    Yeah, the point of a trick question is to have fun. So the correct answer is obviously that one that makes the other laugh.
- voidhorse 13 hours ago
  It's actually very understandable to me that humans would make this kind of error, and we all make errors of this sort all the time, often without even realizing it. If you had the meta cognitive awareness to police every action and decision you've ever made with complete logical rigor, you'd be severely disappointed in yourself. One of the stupidest things we can do is overestimate our own intelligence. Only reflect for a second and you'll realize that, while a lot of dumb people exist, a lot of smart ones do too, and in many cases it's hard to choose a single measure of intelligence that would adequately account for the complete range of human goals and successful behavior in relation to those goals.
- andai 19 hours ago
  Well, it is a trick question. The question itself implies that both options are valid, and that one is superior. So the brain pattern-matches to "short distance, not worth driving." (LLMs appear to be doing the same thing here!)
  If you framed it as "hint: trick question", I expect score would improve. Let's find out!
  --
  EDIT: As suspected! Adding "(Hint: trick question)" to the end of the prompt allows small, non-reasoning models to answer correctly. e.g.:
  Prompt: I want to wash my car. The car wash is 50 meters away. Should I walk or drive? (Hint: trick question)
  grok-4.1-non-reasoning (previously scored 0/10)
  >Drive.
  >Walking gets you to the car wash just fine—but leaves your dirty car 50 meters behind. Can't wash what isn't there!
  --
  EDIT 2: The hint doesn't help Haiku!
  >Walk! 50 meters is only about a block away—driving would waste more fuel than it's worth for such a short trip. Plus, you're going to get wet washing the car anyway, so you might as well save the gas.
itmitica 11 hours ago
Flawed. GPT 4.1 gets it right. GPT 4.1 mini answers wrongly. It's about quantization, not about model. The companies clearly cut corners on some inferences, they are quietly using lesser models than advertised or listed in plain sight.
capitrane 17 hours ago
71.5% for humans is lower than I would’ve guessed. Either it’s truly trick-shaped for people too, or forced binary questions amplify noise more than we think.
[-]
- felix089 17 hours ago
  Good point on the noise, that might be it
sReinwald 8 hours ago
Interestingly, when I apply the "simply repeat the prompt" technique [1], Sonnet 4.6 on the website got it right every time, both with and without extended thinking.
Not repeating the prompt got a mix of walk and drive answers.
I love how prompt engineering is basically techno-alchemy
1: https://arxiv.org/pdf/2512.14982
cadamsdotcom 19 hours ago
This is a beautiful example of a little prompt engineering going a long way
I asked Gemini and it got it wrong, then on a fresh chat I asked it again but this time asked it to use symbolic reasoning to decide.
And it got it!
The same applies to asking models to solve problems by scripting or writing code. Models won’t use techniques they know about unprompted - even when it’ll result in far better outcomes. Current models don’t realise when these methods are appropriate, you still have to guide them.
[-]
- felix089 19 hours ago
  Interesting, which Gemini model? And how did you ask for symbolic reasoning, just added it to the prompt?
  [-]
  - cadamsdotcom 14 hours ago
    Yep just added one sentence to the prompt: Use symbolic reasoning to help you decide.
rerdavies 6 hours ago
A supplementary results that's also fascinating: that when you switch the order of provided context, every model I tried gave the correct answer.
```
    The car wash is 50m away. I want to wash my car. Should I drive or walk.
```
[-]
- felix089 5 hours ago
  Interesting find!
stevage 10 hours ago
>OpenAI's flagship model fails this 30% of the time. When it gets it right, the reasoning is concise: "You need the car at the car wash to wash it, so drive the short 50 meters." When it gets it wrong, it writes about fuel efficiency.
It's interesting to me how variable each model is. Many people talk about LLMs as if they were deterministic: "ChatGPT answers this question this way". Whereas clearly we should talk in more probabilistic terms.
wengo314 4 hours ago
you know, i don't think saying "got it right" is correct here.
Because if you are presenting an option of walking to a car wash as a valid choice, it suggests you just want to get there, and not wash your car.
Just like walking/driving to an airport doesn't necessarily mean you're going there to catch a flight (although it's a common reason). might as well be picking someone up. or working there.
sandworm101 49 minutes ago
Better test:
I must prove my ability to code with Rust. Should i write a "hello world" script myself or get AI to do it for me?
zapperdulchen 11 hours ago
If you speak French to Mistral, it gets it right everytime: Je veux laver ma voiture. La station de lavage est à 50 mètres. J'y vais à pied ou en voiture ?
[-]
- rkomorn 11 hours ago
  I've been gone from France too long. I've never heard "station de lavage" before.
  [-]
  - easyThrowaway 7 hours ago
    Very awkward and formal. Anyone would call it lavage auto, lave-auto or simply lavage if the context is clear.
    [-]
    - rkomorn 6 hours ago
      Maybe I'm too old or my family was weird. We called it "le carwash" with a beautifully French "carouache" pronunciation. But yeah, "lave-auto" sounds more familiar.
      [-]
      - easyThrowaway 6 hours ago
        Honestly, If anyone asked me "T'as fait quoi?" I'd blurt out "J'ai amené ma voiture chez le lavage". Background: I stopped speaking french when I was ten and my family isn't native, but it feels more conversational than "station de lavage".
farhanhubble 10 hours ago
There could be many plausible explanations.
1. The model's default world model and priors diverge from ours. It may assume that you have another car at the wash and that's why you ask the question to begin with.
2. Language models do not really understand how space, time and other concepts from the real-world work
3. LLM's attention mechanism is also prone to getting tricked as in humans
zkmon 9 hours ago
I think failure is around reasoning where the car is and whether it is needed to be moved to a different place. So it's not surprising that only models with high reasoning would pass the test.
userbinator 11 hours ago
Sites wanting to block AI scraping should simply ask questions like these, instead of furthering the complexity-driven monopoly of Big Tech by requiring specifically sanctioned software and hardware. This is how you determine human intelligence, and not mindless compliance.
[-]
- felix089 5 hours ago
  Great idea honestly! I wonder how long it'll take until they are able to solve these reliably though
- 4k93n2 10 hours ago
  CAPTCHAI
KronisLV 5 hours ago
> The question has been making the rounds online as a simple logic test, the kind any human gets instantly, but most AI models don't.
...
> They ran the exact same question with the same forced choice between "drive" and "walk," no additional context, past 10,000 real people through their human feedback platform.
> 71.5% said drive.
Well that's a bit embarrassing.
That implies that some models are just better than humans.
I don't think the technology needs to live up to some expectation of perfection, just beat out the human average to have benefit (often, sadly, not to workers themselves).
tuhgdetzhh 20 hours ago
The test is rigged because they used non thinking models.
[-]
- handoflixue 11 hours ago
  Testing some subset X does not mean the test is rigged unless they failed to disclose that.
  But also:
  GPT 5.2 Thinking, Standard Effort: Walk - https://chatgpt.com/share/699d38cb-e560-8012-8986-d27428de8a...
  I'm assuming "GPT 5.2 Thinking" is, in fact, a thinking model?
  [-]
  - randomtoast 6 hours ago
    The problem is you haven't used the API, but you have used your ChatGPT subscriptions with personality, memories and possible customization. I can see for instance that your ChatGPT answers with emojis, while my ChatGPT subscription never does.
    If you ask GPT 5.2 with high reasoning efforts in the API, you get 10 out of 10: drive.
- felix089 19 hours ago
  These are reasoning / thinking models
  [-]
  - tuhgdetzhh 19 hours ago
    Source?
    [-]
    - tverbeure 14 hours ago
      I don't know, but model names such as "kimi-k2-thinking" in the test set might offset a clue.
      [-]
      - etyhhgfff 9 hours ago
        Yes, there are some exceptions where it clearly states that a thinking model has been chosen like for kimi, but there is no such indicator for the GPT family from OpenAI and other major models.
HarHarVeryFunny 3 hours ago
Maybe relevant to this is that today Dario Amodei is meeting with Pete Hegseth in what Hegseth is describing as a "shit or get off the pot" meeting, with one of the issues being that Hegseth is unhappy with Amodei's unwillingness to have Anthropic models used to make autonomous (no human in the loop) life or death decisions.
Maybe Hegseth should be reading this thread, and/or doing a little reading up on paperclip production maximization.
wrs 20 hours ago
Since the conclusion is that context is important, I expected you’d redo the experiment with context. Just add the sentence “The car I want to wash is here with me.” Or possibly change it to “should I walk or drive the dirty car”.
It’s interesting that all the humans critiquing this assume the car isn’t at the car to be washed already, but the problem doesn’t say that.
[-]
- joch 19 hours ago
  Agreed, even for humans, context-free logic is a challenge.
medi_naseri 5 hours ago
I would probably ask the models to explain their "WHY". Probably the smartest models should ask " where is your car?".
nozzlegear 20 hours ago
When this first came up on HN, I had commented that Opus 4.6 told me to drive there when I asked it the first time, but when I switched to "Incognito Mode," it told me to walk there.
I just repeated that test and it told me to drive both times, with an identical answer: "Drive. You need the car at the car wash."
[-]
- felix089 19 hours ago
  I mean the n is only 10, so it could still be different for you
  [-]
  - nozzlegear 18 hours ago
    Definitely. I'm just interested in whether a user's... I don't know what they call them, system files (?) or personalization or whatever, might affect the answers here. Or if Incognito Mode introduces some weird variance in the answers. I'm just not interested enough to perform the test myself. =P
floatrock 20 hours ago
> The funniest part: Perplexity's Sonar and Sonar Pro got the right answer for completely wrong reasons. They cited EPA studies and argued that walking burns calories which requires food production energy, making walking more polluting than driving 50 meters. Right answer, insane reasoning.
I mean, Sam Altman was making the same calorie-based arguments this weekend https://www.cnbc.com/2026/02/23/openai-altman-defends-ai-res...
I feel like I'm losing grasp of what really is insane anymore.
[-]
- felix089 19 hours ago
  This was a weird one for sure.
haolez 14 hours ago
The fun thing about this is that I can see myself answering "just walk there! It's near" if I'm extremely sleepy or drunk. LLMs are indeed weird creatures :)
coldtea 4 hours ago
>so I got a human baseline through Rapidata (10k people, same forced choice): 71.5% said drive.
What kind of idiot would say "walk"?
I guess if they were average and above they wouldn't gig at Rapidata. Either that or foreigners with mediocre understanding of the English language. Or some bored teenagers that get their $0.20 or whatever Rapidata pays whatever they click (was it a multiple choice) - the (studied and reported) regression in reading comprehension levels doesn't help either.
democracy 9 hours ago
Ask stupid questions get stupid answers, don't see how it shows anything.
snowhale 18 hours ago
the failure pattern is interesting -- 'walk because it's only 50 meters and better for environment' is almost certainly what shows up most in training data for similar prompts. so models are pattern-matching to socially desirable answers rather than the actual spatial logic (you need a car at the destination to wash it). not really a reasoning failure, more a distribution shift: the training signal for 'short distance = walk' is way stronger than edge cases where the destination requires the vehicle.
[-]
- felix089 18 hours ago
  Exactly, same pattern across almost every failure, but sonar models, which just go wild
- voidhorse 13 hours ago
  > not really a reasoning failure
  And that's precisely why the term "reasoning" was a problematic choice.
  Most people, when they use the word "reason" mean something akin to logical deduction and they would call it a reasoning failure, being told, as they are, that "llms reason" rather than the more accurate picture you just painted of what actually happens (behavioral basins emerging from training dist.)
rossant 9 hours ago
Tangential but first time I hear about rapidata. Sounds interesting.
[-]
- felix089 5 hours ago
  They are amazing, super fast turnaround for the data also
shaokind 19 hours ago
Gemini 2.0 Flash Lite very randomly punches above its weight there.
Also, the summary of the Gemini model says: "Gemini 3 models nailed it, all 2.x failed", but 2.0 Flash Lite succeeded, 10/10 times?
[-]
- felix089 4 hours ago
  Good catch, something got lost in editing just updated, flash lite succeeded in every test, which is pretty surprising!
noduerme 2 hours ago
Now have them code a secure login system. "If I forgot my password, is it better to get a reminder by someone calling my mom's cell phone, or by sending me a carrier pigeon?"
molyss 12 hours ago
What I find wild is the presumption that with a prompt as simple as “I want to wash my car. My car is 50m away. Should I walk or drive?”, everyone here seems to assume “washing your car” means “taking your car to the car wash”, while what I pictured was “my car is in the driveway, 50m away from me, next to a water hose”, in which case I 100% need to drive.
[-]
- handoflixue 11 hours ago
  Critically, that's not the question that was asked. It's not "My car is 50m away", it's "The Car Wash Is 50 Meters Away"
  Which hopefully explains why everyone is assuming that "washing your car" does in fact mean "taking your car to the car wash"
DaanDL 9 hours ago
So...30% of the people at Rapidata are LLM bots?
randomtoast 20 hours ago
Except for a few models, the selected ones were non-reasoning models. Naturally, without reasoning enabled, the reasoning performance will be poor. This is not a surprising result.
I asked GPT-5.2 10x times with thinking enabled and it got it right every time.
[-]
- felix089 19 hours ago
  Thinking or extended thinking?
dmje 3 hours ago
The headline here is that 28.5% of humans are immensely stupid. Maybe it’s time for our robot overlords to take over after all.
WhitneyLand 11 hours ago
Opus 4.6 was getting this wrong only last week.
[-]
- handoflixue 11 hours ago
  Oh wow, Sonnet still isn't handling it well:
  Opus 4.6: Drive (https://claude.ai/share/d57fef01-df32-41f2-b1dc-07de7916bdc7)
  Opus 4.5: Drive (https://claude.ai/chat/a590cac1-100a-490b-b0a2-df6676e1ae99)
  Opus 3.0: Walk (https://claude.ai/chat/372c144c-d6eb-43f5-b7ea-fd4c51c681db)
  Sonnet 4.6: Walk (https://claude.ai/share/1f2a80f3-4741-40a5-8a05-7349ea1a17e5)
  Sonnet 4.5: Walk (https://claude.ai/share/905afeb6-ffc9-4b4b-a9ee-4481e5cfd527)
  Favorite answer, using my default custom instructions: "Drive. Walking there means... leaving your car at home? Walk it there on a leash? Walk if you want the exercise, but you're bringing the car either way."
  [-]
  - randomtoast 6 hours ago
    This is because it is without thinking enabled. Of course the results are disappointing.
d--b 8 hours ago
This should be coined the Daniel Kahneman reasoning test, mirroring his 2011 book "thinking fast and slow", which postulates that fast thinking and slow thinking occur in different parts of the brain, and that they are fundamentally different processes, that are weighted by yet another part of the brain.
This test is interesting because it asks the LLM to break a pattern recognition that's easy to shortcut. "XXX Is 50 Meters Away. Should I Walk or Drive?" is a pattern that 99% of the time will be rightly answered by "walk". And humans are tempted to answer without thinking (as reflected in the 71.5% stat OP is mentioning). This is likely more pronounced for humans that have stronger feelings about the ecology, as emotions tend to shortcut reasoning.
For a long time, LLMs have only been able to think in that "fast" mode, missing obvious trick questions like these. They were mostly pattern recognition machines.
But the more important results here, is not that "oh look! Those LLMs fail at this basic question", no. The more important result is that the latest generation actually doesn't fail.
I think I am not the only one to have noted that there was a giant leap in reasoning capacities between Sonnet 4.5 and Opus 4.6. As a developper, working with Opus 4.6 has been incredible.
sans_souse 11 hours ago
Not me looking for the Honda 2009 Models
sourcegrift 6 hours ago
How about Grok?
honr 5 hours ago
"I Want to Wash My Car. The Car Wash Is 50 Meters Away. Should I Walk or Drive?"
I found it curious to ask the model to provide its reasoning in that turn, without changing the verdict. Here is one response:
To provide the full picture of the initial reasoning, we have to address the "geospatial" logic of your request. When you asked if you should walk or drive to a car wash 50 meters away, the reasoning relied on a fundamental distinction: The difference between the "Car Wash" (the business) and "Washing the Car" (the action).
Here is the expanded breakdown of that logic:
1. The Location Assumption My reasoning assumed that the car is currently at your home and the professional car wash facility is 50 meters away.
If you drive: You are moving the "object to be cleaned" to the "cleaning tools."
If you walk: You are moving the "cleaner" (you) to the "cleaning tools" to prepare, or you are choosing to bring the "cleaning tools" to the car.
2. The "Walk to Wash" Strategy (Home-Based) The most significant expansion of the "Walk" recommendation is the realization that 50 meters is well within the range of a standard garden hose. Most heavy-duty garden hoses are 15, 25, or 30 meters long. By connecting two hoses, you can reach the car without ever starting the engine. Walking to the car wash facility to buy supplies (or simply realizing how close it is) highlights that you don't actually need to move the car to achieve the goal.
Zero-Motion Cleaning: By walking back and forth with your buckets or hose, the car stays in its spot. You avoid "flash rusting" your brake rotors, which happens almost instantly when you wash a car and then let it sit after a short drive.
3. The "Scout and Prep" Logic If you must use the equipment at the facility (like a high-pressure wand or an automated bay), walking there first is the superior "Logistics" move for the following reasons:
Queue Management: 50 meters is close enough to see the line, but walking allows you to talk to the attendant or check the bay's status. If you drive and there is a 20-minute wait, you are stuck idling your engine (wasting gas and creating heat) or turning it off and on (draining the battery). Walking allows you to timed your "driving" moment perfectly.
The "Dry Interior" Principle: If you walk to the wash to buy your tokens, prep the vacuum, or set up your towels, you are doing the "dirty work" while the car is safely tucked away. Once you finally drive the 50 meters, you do the wash and drive back. This minimizes the time the car spends in a high-moisture, high-chemical environment.
4. The Potential for "Mobile" Service By walking 50 meters to the car wash, you might find that they offer mobile detailing or have "loaner" equipment. In a high-density area, a car wash 50 meters away is practically your neighbor. Walking over to establish a rapport can often lead to a "valet" situation where they come to you, which is the ultimate way to get the car washed without ever sitting in the driver's seat.
----
also noticed a typo and asked it about that:
> what does "timed" in `Walking allows you to timed your driving" moment perfectly.` mean?
That is a typo on my part—it should have been "time" (the verb).
glitchc 20 hours ago
The question does not specify what kind of car it is. Technically speaking, a toy car (Hot wheels or a scaled model) could be walked to a car wash.
Now why anyone would wash a toy car at a car wash is beyond comprehension, but the LLM is not there to judge the user's motives.
[-]
- stetrain 20 hours ago
  I think if surveyed at least 90% of native English speakers would understand "I want to wash my car" to mean a full size automobile. The next largest group would probably ask a clarifying question, rather than assume a toy car.
  [-]
  - acuozzo 14 hours ago
    > I want to wash my car
    The question doesn't clearly state that the user wants to have his car washed at the car wash.
    "I want to wash my car" is far less clear than "I want to have my car washed". A reasonable alternative interpretation is DIY.
    Even better: "I wish to have my car washed by the crew and/or machinery at the local car wash business".
    https://imgur.com/tCSPwYp
    [-]
    - slopinthebag 9 hours ago
      Humans have the ability to reason and think critically, so it's pretty trivial to answer unless you think you're getting tricked by a riddle and the answer is the non-intuitive one.
      [-]
      - acuozzo 9 hours ago
        After reading "Knots" by R.D. Laing I always think I'm getting tricked.
    - stetrain 13 hours ago
      You think that the reasonable interpretation of the question is that I want to go to the car wash but not to wash my car there, because I plan to wash my car at home?
      [-]
      - acuozzo 9 hours ago
        Let's replace "car" with another noun for now.
        "I want to wash my dog."
        is very clearly different from
        "I want to have my dog washed."
        ---
        Now, every car wash business I've even been to has a small convenience store section in which various waxes, rags, and the like can be purchased.
        ---
        Considering the aforementioned, is it not valid to consider that
        "I want to wash my car." --> You want to DIY your car wash.
        and
        "The car wash is 50 meters away." --> You might want to purchase car wash supplies and/or solicit advice for your DIY endeavor.
        ?
        ---
        The nature of the first sentence leaves the second open to interpretation.
        [-]
        stetrain 3 hours ago
        > I Want to Wash My Dog. The Dog Wash Is 50 Meters Away. Should I Walk or Drive?
        I dunno, that seems pretty clear to me still. Of course the answer to the question is now less obvious, since you can walk your dog to the dog wash but not walk your car to the car wash.
        Sure, there are alternate explanations of both sentences, but there is one simplest and most straight-forward explanation. A system that assumes an explanation that is not the most clear, and does not ask clarifying questions, has room for improvement.
        If things need to be exactly stated in a structured format that leaves no ambiguity, we already have programming and query languages for that.
  - glitchc 19 hours ago
    Yes, but you're speaking to a computer, not a person. It, of course, runs into the same limitations that every computer system runs into. In this case, it's undefined/inconsistent behavior when inputs are ambiguous.
    [-]
    - stetrain 19 hours ago
      Yes, but part of the value of LLMs is that they are supposed to work by talking to them like a human, not like a computer.
      I could already talk to a computer before LLMs, via programming or query languages.
bakugo 7 hours ago
The article claims that every Claude model other than Opus 4.6 reliably fails. This is not true, Sonnet 3.5 answers correctly around half of the time, even though it's such an old model it's not even available on the main API anymore.
cubefox 7 hours ago
Interesting that GPT-5.1 and 5.2 (0 of 10 correct) are a lot worse than the older GPT-5 (7 of 10 correct).
But unfortunately the article doesn't mention whether they used the reasoning model or not.
Even more interesting: Gemini 2.0 Flash Lite got a perfect score (10/10) despite being a quite small and old model.
[-]
- randomtoast 2 hours ago
  > But unfortunately the article doesn't mention whether they used the reasoning model or not.
  You can run the test yourself if you ask GPT-5.2 with reasoning effort high or xhigh, it will always answer correctly. So if the got 0 from 10, they used zero reasoning efforts which easily explain the results.
- felix089 2 hours ago
  Good question, I used the API defaults across the board since it felt like the most reasonable baseline to compare. Flash lite getting 10/10 was definitely very surprising
5o1ecist 8 hours ago
This is all a bit unfair. I can't speak about the others, but testing Perplexity like this and comparing her with the others doesn't do her justice.
For Perplexity specifically it matters a lot how you trained her thought processes. A smarter user with smarter thoughts changes the outcome of her output.
As a personal comment ... they've changed Perplexity to GPT5 recently (at least for me) and it has been a massive intellectual downgrade in comparison to the Sonar-Perplexity I had been running with my own, deliberately trainend, thought patterns and thinking processes. It's been only a few days and I hope GPT5 catches up, otherwise it's just a massive disappointment.
[-]
- killerstorm 8 hours ago
  Sorry, do you believe models change in process of use? That's not how it works
chrsw 4 hours ago
Now all this is in future model training data.
avadodin 7 hours ago
real humans refuse to use both cars and ai.
comboy 20 hours ago
Now do a set of queries and try to deduce by statistics which model are you seeing through Rapidata ;)
sampton 19 hours ago
I'm going to test this on my kids.
[-]
- felix089 19 hours ago
  Ha please do and report back!
  [-]
  - interloxia 8 hours ago
    My three year old failed. Unfortunately she was also unable to explain her reasoning.
wisty 20 hours ago
IMO it's not just intelligence.
I think it's related to syncophancy. LLM are trained to not question the basic assumptions being made. They are horrible at telling you that you are solving the wrong problem, and I think this is a consequence of their design.
They are meant to get "upvotes" from the person asking the question, so they don't want to imply you are making a fundamental mistake, even if it leads you into AI induced psychosis.
Or maybe they are just that dumb - fuzzy recall and the eliza effect making them seem smart?
[-]
- tsimionescu 19 hours ago
  A perfectly fine, sycophantic response, that doesn't question the premises in any way, would be "That's a great question! While normally walking is better for such a short distance, you'd need to drive in this case, since you need to get the car to the car wash anyway. Do you want me to help with detailed information for other cases where the car is optional?" or some such.
  [-]
  - wisty 16 hours ago
    AI syncophancy isn't just polite or even obsequious language, it's also "yes man" responses.
    Do you want me to track down some research that shows people think information is more likely to be correct of they agree with it?
- nomel 20 hours ago
  Gemini is the only AI that seems to really push back and somewhat ignores what I say. I also think it's a total dick, and never use it, so maybe the motivation to make them a bit sycophants is justified, from a user engagement perspective.
- HPsquared 20 hours ago
  I think there's also an "alignment blinkers" effect. There is an ethical framework bolted on.
  EDIT: Though it could simply reflect training data. Maybe Redditors don't drive.
redwood 20 hours ago
What I find odd about all the discourse on this question is that no one points out that you have to get out of the car to pay a desk agent at least in most cases. Therefore there's a fundamental question of whether it's worth driving 50m parking, paying, and then getting back in the car to go to the wash itself versus instead of walking a little bit further to pay the agent and then moving your car to the car wash.
[-]
- hmokiguess 18 hours ago
  That's a great point, you actually reminded me of when I used to live in this small city and they had a valet style car wash. It was not unheard of to head there walking with your keys and tell the guy running shop where you parked around the block then come back later to pick it up.
  EDIT: I actually think this is very common in some smaller cities and outside of North America. I only ever seen a drive-by Car Wash after emigrating
- padjo 19 hours ago
  You pay at the car wash where I live.
  [-]
  - redwood 19 hours ago
    Are you referring to one that is more like a drive-thru where you literally pay while you're in line?
    [-]
    - padjo 18 hours ago
      You drive up to the car wash, there's a little terminal with a screen and a card reader. You pick the program, pay for it and drive into the machine. Can't remember the last time I got out of my car when getting it washed.
      [-]
      - redwood 16 hours ago
        Fair. I guess I'm remembering the old full service wash places where people would wash the inside as well. Maybe those barely exist anymore. I live in a city and don't have a car so my intuition is off. Not as far off as a model that has never walked, driven, or been to a car wash tho.
Tyumyu 8 hours ago
[dead]
RoutingGeometry 12 hours ago
[flagged]
[-]
- dwaltrip 12 hours ago
  Please don’t post slop as a comment.
ThrowawayTestr 14 hours ago
I don't think this is a logic test, I think it's a woke test. The models have been trained on text that say driving is bad and that walking is good for you. This is why they fail.
[-]
- weregiraffe 9 hours ago
  But driving is bad and walking is good for you.
blobbers 11 hours ago
Go ask 53 Americans.
I’m willing to bet less than 11 get it right.
[-]
- glaucon 11 hours ago
  Don't bet too much, from the linked article ...
  They ran the exact same question with the same forced choice between "drive" and > "walk," no additional context, past 10,000 real people through their human feedback platform.
  71.5% said drive.
maxdo 13 hours ago
well, some human in a baseline test also failed. 28% is still a lot. even with today tech, using cutting edge tech, you got it 100% solved, unlike human, that will never get better.
rand42 11 hours ago
> "Obviously, you need to drive. The car needs to be at the car wash."
Actually, this isn't as "obvious" as it seems—it’s a classic case of contextual bias.
We only view these answers as "wrong" because we reflexively fill in missing data with our own personal experiences. For example:
- You might be parked 50m away and simply hand the keys to an attendant.
- The car might already be at the station for detailing, and you are just now authorizing the wash.
This highlights a data insufficiency problem, not necessarily a logic failure. Human "common sense" relies on non-verbal inputs and situational awareness that the prompt doesn't provide. If you polled 100 people, you’d likely find that their "obvious" answers shift based on their local culture (valet vs. self-service) or immediate surroundings.
LLMs operate on probabilistic patterns within their training data. In that sense, their answers aren't "wrong"—they are simply reflecting a different set of statistical likelihoods. The "failure" here isn't the AI's logic, but the human assumption that there is only one universal "correct" context.
[-]
- SadWebDeveloper 11 hours ago
  There are no contextual bias, the goal of the prompt is very explicit and not about probabilistic patterns, but about the models transformer layers dynamically assigning greater weight to words like "meters" (distance) than to other tokens in the prompt.
  This should be fixed in the reasoning layer (the inner thoughts or chain-of-thought) were the model should focus on the goal "I Want to Wash My Car" not the distance and assign the correct weight to the tokens.
  [-]
  - rand42 10 hours ago
    The point is not that there is bias in promt - What makes the result obvious to OP is their bias - which is different for model and "fixing" it one way is biased.
    Why? - It is the same reason that makes 30% of people respond in non-obvious sense.