Claude Sonnet 4.5

(anthropic.com)

770 points | by adocomplete 4 hours ago

64 comments

simonw 3 hours ago
I had access to a preview over the weekend, I published some notes here: https://simonwillison.net/2025/Sep/29/claude-sonnet-4-5/
It's very good - I think probably a tiny bit better than GPT-5-Codex, based on vibes more than a comprehensive comparison (there are plenty of benchmarks out there that attempt to be more methodical than vibes).
It particularly shines when you try it on https://claude.ai/ using its brand new Python/Node.js code interpreter mode. Try this prompt and see what happens:
```
  Checkout https://github.com/simonw/llm and run the tests with
  
  pip install -e '.[test]'
  pytest
```
I then had it iterate on a pretty complex database refactoring task, described in my post.
[-]
- icelancer 1 hour ago
  Huh. Doesn't work over here.
  ---
  SYSTEM:
  Got it. Those are the setup commands for Simon Willison's LLM CLI tool, which lets you interact with language models from the command line.
  The -e flag installs it in editable mode, and [test] pulls in the test dependencies. Are you running into any issues with the installation or tests? What are you working on with the LLM library?
  USER: Correct. You run them using your code tooling.
  SYSTEM:
  Realized inability to execute requested Python commands.The user is telling me to run these commands using my code tooling. They want me to actually execute this. Let me think about what tools I have available:
  I have the repl tool which runs JavaScript code in a browser environment I have artifacts for creating code/documents I have web_search and web_fetch
  The user wants me to:
  Checkout https://github.com/simonw/llm Run pip install -e '.[test]' Run pytest
  But I don't have a tool that can:
  Clone git repositories Run shell commands like pip or pytest Execute Python code in a command-line environment
  The repl tool is JavaScript-only and runs in a browser environment, not a server environment where I could run shell commands. I should clarify that I don't have the capability to run shell commands, install Python packages, or execute pytest. I can only run JavaScript code in a browser environment via the analysis tool. I need to be direct and clear about this limitation, following the user's preferences to eliminate filler and be straightforward.I don't have tooling that can run shell commands, clone git repos, install Python packages, or execute pytest. My only code execution environment is a JavaScript browser REPL. I can't run pip, pytest, or any command-line tools. I can only execute JavaScript code in a sandboxed browser environment. What were you hoping to accomplish? Do you need help understanding the test setup, or were you expecting different capabilities?R
  [-]
  - icelancer 1 hour ago
    When pressed:
    ---
    My execution environment is a browser-based JavaScript sandbox. I don't have Python, Node.js shell access, pip, git, or pytest capabilities.
    Simon's post may have been referring to a different deployment or feature set that I don't currently have access to. My only code execution is the JavaScript REPL I just demonstrated.
    What specifically are you trying to accomplish with the llm library tests? I can help with that in other ways.
    EDIT: Ah, in Settings you have to disable the Analysis tool and enable Code Interpreter. You cannot have both enabled.
    [-]
    - simonw 1 hour ago
      Yes! Forgot to mention that, and it's weird that you can have one but not both enabled. I wrote more about that here: https://simonwillison.net/2025/Sep/9/claude-code-interpreter...
  - sophiebits 1 hour ago
    You need to enable the new code interpreter mode: https://simonwillison.net/2025/Sep/9/claude-code-interpreter...
- magicalist 2 hours ago
  > I told it to Give me a zip file of everything you have done so far—you can explore the contents of the file it made me in this Gist.
  For those who don't have time to dig into the gist, did it work and do a good job? I assume yes to at least nominally working or you would have mentioned that, but any other thoughts on the solution it produced?
  [-]
  - simonw 2 hours ago
    I was impressed. Integrating it into my project would require me to spend a bunch more time reviewing it, but it did a better job than I was expecting. The tests it wrote were a particularly useful exploration of the problem space: https://gist.github.com/simonw/f9d0f870e8d1af399a7f366a7f87b...
    [-]
    - Snuggly73 2 hours ago
      Ignoring the tests, the first change was adding a single parent id column and the second "more complex" refactoring added few more hash columns to the table (after you've specified that you wanted them, i.e. not an open-ended question)
      Its a very impressive model, but I think we have different views on what is complex.
- gloosx 37 minutes ago
  I am curious how the sandbox handles potentially malicious code. For example, what would happen if someone tried to run something like a crypto miner or a DDoS script?
  [-]
  - simonw 22 minutes ago
    Code running in the Claude environment can't make outbound network connections to anything more than a tiny list of domains (GitHub, PyPI, NPM) so DDoS targets are limited.
    If I was Anthropic I'd have measures in place to rate limit network traffic from individual containers.
    [-]
    - gloosx 1 minute ago
      This would be interesting to test in a controlled way, curious what the CPU/time/memory quotas per run, isolation method, and syscall filtering look like.
      Would appreciate pointers to any documentation or tests.
- kurtis_reed 2 hours ago
  Why did you have access to a preview?
  [-]
  - simonw 2 hours ago
    I get access to previews from OpenAI, Anthropic and Gemini pretty often. They're usually accompanied by an NDA and an embargo date - in this case the embargo was 10am Pacific this morning.
    I won't accept preview access if it comes with any conditions at all about what I can say about the model once the embargo has lifted.
    [-]
    - poopiokaka 38 minutes ago
      Are you gonna publish the NDA to prove that or are we supposed to take your word for it
      [-]
      - 0x696C6961 31 minutes ago
        Why do you even care?
      - dboreham 28 minutes ago
        NDAs often prohibit publishing...the NDA.
        [-]
        IncreasePosts 10 minutes ago
        Then publish it before you sign it! Or is there a pre-NDA that you need to sign before seeing the real NDA?
  - Redster 2 hours ago
    Simonw is a cheerful and straightforward AI journalist who likes to show and not just tell. He has done a good job aggregating and documenting the progress of LLM tools and models. As I understand it, OpenAI and Anthropic have both wisely decided to make sure he has up to date info because they know he'll write about it.
    Thanks for all your work, Simon! You're my favorite journalist in this space and I really appreciate your tone.
    [-]
    - tootie 29 minutes ago
      Simon has a popular blog, but he's also co-creator of Django and very well-known in the Python community.
    - asadotzler 2 hours ago
      AI blogger seems more appropriate than journalist.
      [-]
      - nchmy 1 hour ago
        are you aware of any "ai journalists"? Because simonw does great work, so perhaps blogger is what people should aspire towards?
        [-]
        simonw 1 hour ago
        I actually talk to journalists on the AI beat quite often - I've had good conversations with them at publications including The Economist and NY Times and Washington Post and ArsTechnica.
        They're not going to write up detailed reviews of things like the new Claude code interpreter mode though, because that's not of interest to a general enough audience.
        I don't have that restriction: https://simonwillison.net/2025/Sep/9/claude-code-interpreter...
        landl0rd 3 minutes ago
        Kylie Robison recently moved to Wired and is a solid "AI journalist".
        grim_io 1 hour ago
        Not sure what an AI journalist is supposed to be or do, but a lack of one does not promote someone who is not it automatically into the position.
    - knowsuchagency 2 hours ago
      I like Simon, but he's not a journalist. A journalist would not have gone to OpenAI to glaze the GPT-5 release with Theo. I don't say this to discount Simon -- I appreciate his writing and analysis but a journalist, he isn't.
      [-]
      - simonw 1 hour ago
        I don't call myself a journalist, partly because no publication is paying me to do any of this!
        If I had an editor I imagine they would have talked me out of going to the OpenAI office for a mysterious product preview session with a film crew.
      - Redster 2 hours ago
        That's a fair point. I feel like he's more than a blogger and am not sure the best term!
        [-]
        LudwigNagasena 2 hours ago
        An influencer.
  - lomase 40 minutes ago
    They are an AI evangelist that told me I can replace any technical book created with an LLM.
    They are a nice person.
  - minimaxir 2 hours ago
    Simon tends to write up reports of new LLM releases (with great community respect) and it's much easier with lead time if the provider is able to set up a preview endpoint.
  - rapfaria 1 hour ago
    His "pelican riding a bicycle" tests are now a classic and AI shops are benchmaxxing for it
    [-]
    - simonw 1 hour ago
      They need to benchmaxxx a whole lot harder, the illustrations still all universally suck!
  - lossolo 2 hours ago
    The same reason YouTube reviewers and influencers get access to hardware or games before release. In this case, the person is a passionate blogger.
- siva7 58 minutes ago
  Does 4.5 still answer everything with "You're absolutely right!" or is it now able to communicate like a real programmer?
  [-]
  - inopinatus 42 minutes ago
    I won’t be satisfied until I get a Linus Torvalds mode.
    “Your idea is shit because you are so fucking stupid”
    “Please stop talking, it hurts my GPUs thinking down to your level”
    “I may seem evil but at least I’m not incompetent”
    [-]
    - atonse 10 minutes ago
      Why is this getting downvoted? It was hilarious!
      I actually added a fun thing to my user-wide CLAUDE.md, basically saying that it should come up with a funny insult every time I come up with an idea that wasn't technically sound (I got the prompt from someone else). It seems to be disobeying me, because I refuse to believe that I don't have bad ideas. Or some other prompt is overriding it.
- lilouartz 3 hours ago
  That's cool Didn't know it is capable of doing it!
- mclau153 2 hours ago
  Could it pip install pygame and make a game playable in browser?
iagooar 2 hours ago
Anecdotal evidence.
I have a fairly large web application with ~200k LoC.
Gave the same prompt to Sonnet 4.5 (Claude Code) and GPT-5-Codex (Codex CLI).
"implement a fuzzy search for conversations and reports either when selecting "Go to Conversation" or "Go to Report" and typing the title or when the user types in the title in the main input field, and none of the standard elements match, a search starts with a 2s delay"
Sonnet 4.5 went really fast at ~3min. But what it built was broken and superficial. The code did not even manage to reuse already existing auth and started re-building auth server-side instead of looking how other API endpoints do it. Even re-prompting and telling it how it went wrong did not help much. No tests were written (despite the project rules requiring it).
GPT-5-Codex needed MUCH longer ~20min. Changes made were much more profound, but it implemented proper error handling, lots of edge cases and wrote tests without me prompting it to do so (project rules already require it). API calls ran smoothly. The entire feature worked perfectly.
My conclusion is clear: GPT-5-Codex is the clear winner, not even close.
I will take the 20mins every single time, knowing the work that has been done feels like work done by a senior dev.
The 3mins surprised me a lot and I was hoping to see great results in such a short period of time. But of course, a quick & dirty, buggy implementation with no tests is not what I wanted.
[-]
- Implicated 2 hours ago
  I'm not trying to be offensive here, feel the need to indicate that.
  But that prompt leads me to believe that you're going to get rather 'random' results due to leaving SO much room for interpretation.
  Also, in my experience, punctuation is important - particularly for pacing and grouping of logical 'parts' of a task and your prompt reads like a run on sentence.
  Making a lot of assumptions here - but I bet if I were in your shoes and looking to write a prompt to start a task of a similar type that my prompt would have been 5 to 20x the length of yours (depending on complexity and importance) with far more detail, including overlapping of descriptions of various tasks (ie; potentially describing the same thing more than once in different ways in context/relation to other things to establish relation/hierarchy).
  I'm glad you got what you needed - but these types of prompts and approaches are why I believe so many people think these models aren't useful.
  You get out of them what you put into them. If you give them structured and well written requirements as well as a codebase that utilizes patterns you're going to get back something relative to that. No different than a developer - if you gave a junior coder, or some team of developers the following as a feature requirement: `implement a fuzzy search for conversations and reports either when selecting "Go to Conversation" or "Go to Report" and typing the title or when the user types in the title in the main input field, and none of the standard elements match, a search starts with a 2s delay` then you can't really be mad when you don't get back exactly what you wanted.
  edit: To put it another way - spend a few more minutes on the initial task/prompt/description of your needs and you're likely to get back more of what you're expecting.
  [-]
  - iagooar 2 hours ago
    I think that is an interesting observation and I generally agree.
    Your point about prompting quality is very valid and for larger features I always use PRDs that are 5-20x the prompt.
    The thing is my "experiment" is one that represents a fairly common use case: this feature is actually pretty small and embeds into an pre-existing UI structure - in a larger codebase.
    GPT-5-Codex allows me to write a pretty quick & dirty prompt, yet still get VERY good results. It not only works on first try, Codex is reliably better at understanding the context and doing the things that are common and best practice in professional SWE projects.
    If I want to get something comparable out of Claude, I would have to spend at least 20mins preparing the prompt. If not more.
    [-]
    - Implicated 2 hours ago
      > The thing is my "experiment" is one that represents a fairly common use case
      Valid as well. I guess I'm just nitpicking based on how much I see people saying these models aren't useful combined with seeing this example, triggered my "you're doing it wrong" mode :D
      > GPT-5-Codex allows me to write a pretty quick & dirty prompt, yet still get VERY good results.
      I have a reputation with family and co-workers of being quite verbose - this might be why I prefer Claude (though haven't tried Codex in the last month or so). I'm typically setting up context and spending a few minutes writing an initial prompt and iterating/adjusting on the approach in planning mode so that I _can_ just walk away (or tab out) and let it do it's thing knowing that I've already reviewed it's approach and have a reasonable amount of confidence that it's taking an approach that seems logical.
      I should start playing with codex again on some new projects I have in mind where I have an initial planning document with my notes on what I want it to do but nothing super specific - just to see what it can "one shot".
  - GoatInGrey 1 hour ago
    This would explain the LLM implementing the feature in a way you didn't prefer. But this does not explain why Sonnet would deliver a broken implementation that does not work in even the most basic sense.
    Also, there is a threshold in which the time it takes to develop a prompt, allow the agent to run, review its output, and go through iterative loops to correct errors or implementation problems, can exceed the time it takes me (a lazy human) to achieve the same end result.
    Pair this with the bypassing of the generation effect, reduced prefrontal dopamine, and increased working memory load (in part due to minimal motor-cognitive integration), and AI-generated code in contexts with legal and financial repercussions can be a much worse deal than using your own fingers.
  - pton_xd 2 hours ago
    > I bet if I were in your shoes and looking to write a prompt to start a task of a similar type that my prompt would have been 5 to 20x the length of yours
    Why would you need such extensive prompting just to get the model to not re-implement authentication logic, for example? It already has access to all of the existing code, shouldn't it just take advantage of what's already there? A 20x longer prompt doesn't sound like a satisfying solution to whatever issue is happening here.
    [-]
    - Implicated 2 hours ago
      Well, I don't have the context myself about what's happening in this example, though I don't see anything about auth myself.
      And I left that window at 5-20x because, again, no real context. But unless I was already in the middle of a task and I was giving direction that there was already context for - my prompt is generally almost never _this_ short. (referring to the prompt in the top level comment)
      > A 20x longer prompt doesn't sound like a satisfying solution to whatever issue is happening here.
      It wouldn't be, given the additional context given by the author in a sibling comment to yours. But if you had specific expectations on the resulting code/functionality that 20x longer prompt is likely to save you time and energy in the back and forth adjustments you might have to make otherwise.
  - rossjudson 6 minutes ago
    When asking for change, there are the choices you know about and the ones you don't. I've gotten in the habit of describing some of the current state as well as my desired state, and using that to focus the LLM on the areas I'd like to have a stronger voice in.
    Of course, I definitely appreciate when it makes choices that I don't know I need to make, and it chooses reasonable defaults.
    I mean, I could say "make the visualization three columns", but there's a lot of ambiguity in that kind of thing, and the LLM is going to make a lot of choices about my intent.
    Instead, "each team/quarter currently has a vertically stacked list of people assigned to that team, with two columns (staying on team, leaving team). change that to being three columns (entering team, staying on team, leaving team)."
    As a bonus, it's much, much clearer to somebody reading the history later what the intent was.
  - g42gregory 1 hour ago
    I have to agree with this assessment. I am currently going at the rate of 300-400 lines of spec for 1,000 LOC with Claude Code. Specs are AI-assisted also, otherwise you might go crazy. :-) Plus 2,000+ lines of AI-generated tests. Pretty restrictive, but then it works just fine.
  - SirMaster 2 hours ago
    But isn't the end goal to be able to get useful results without so much prompting?
    I mean in the movies for example, advanced AI assistants do amazing things with very little prompting. Seems like that's what people want.
    To me, the fact that so many people basically say "you are prompting it wrong" is knock against the tech and the model. If people want to say that these systems are so smart at what they can do, then they should strive to get better at understanding the user without needing tons of prompts.
    Do you think his short prompt would be sufficient for a senior developer? If it's good enough for a human it should be good enough for a LLM IMO.
    I don't want to take away the ability to use tons of prompting to get the LLM to do exactly what you want, but I think that the ability for an LLM to do better with less prompting is actually a good thing and useful metric.
    [-]
    - dboreham 12 minutes ago
      Quick data point that I've been able to get LLMs (recently whatever one clude gives me) to produce amazingly useful results for the purpose of understanding complex codebases, just by asking it to look at the code and tell me how it does xyz. No complicated long prompt. Basically exactly what I'd say to a human.
    - Implicated 1 hour ago
      > But isn't the end goal to be able to get useful results without so much prompting?
      See below about context.
      > I mean in the movies for example, advanced AI assistants do amazing things with very little prompting. Seems like that's what people want.
      Movies != real life
      > To me, the fact that so many people basically say "you are prompting it wrong" is knock against the tech and the model. If people want to say that these systems are so smart at what they can do, then they should strive to get better at understanding the user without needing tons of prompts.
      See below about context.
      > Do you think his short prompt would be sufficient for a senior developer? If it's good enough for a human it should be good enough for a LLM IMO.
      Context is king.
      > I don't want to take away the ability to use tons of prompting to get the LLM to do exactly what you want, but I think that the ability for an LLM to do better with less prompting is actually a good thing and useful metric.
      What I'm understanding from your comments here are that you should just be able to give it broad statements and it should interpret that into functional results. Sure - that works incredibly well, if you provide the relevant context and the model is able to understand and properly associate it where needed.
      But you're comparing the LLMs to humans (this is a problem, but not likely to stop so we might as well address it) - but _what_ humans? You ask if that prompt would be sufficient for a senior developer - absolutely, if that developer already has the _context_ of the project/task/features/etc. They can _infer_ what's not specified. But if you give that same prompt to a jr dev who maybe has access to the codebase and has poked around inside the working application once or twice but no real in depth experience with it - they're going to _infer_ different things. They might do great, they might fail spectacularly. Flip a coin.
      So - with that prompt in the top level comment - if that LLM is provided excellent context (via AGENTS.md/attached files/etc) then it'll do great with that prompt, most likely. Especially if you aren't looking for specifics in the resulting feature outside of what you mentioned since it _will_ have to infer some things. But if you're just opening codex/CC without a good CLAUDE.md/AGENTS.md and feeding it a prompt like that you have to expect quite a bit of variance to what you get - exactly the same way you would a _human_ developer.
      You context and prompt are the project spec. You get out what you put in.
      [-]
      - DrewADesign 1 hour ago
        > But you're comparing the LLMs to humans
        These things are being marketed as super intelligent magic answer machines. Judging them using the criteria the marketing teams have provided is completely reasonable.
        > Movies != real life
        Nobody claimed it was. This is about desires and expectations. The people charging money for these services and taking stacks of cash that would’ve otherwise been in in dev’s paychecks while doing so haven’t even tried to temper those expectations. They made their beds…
  - motoxpro 1 hour ago
    I agree. The sibling responses also give some insight into why a lot of people don't get a lot of value out of coding agents.
  - pembrook 1 hour ago
    You're critiquing OP for not playing with how the models currently work (bad at gathering context on their own). Sure, if you bend over backwards and hop on one foot, you can get them to do what you want.
    OP is critiquing the model as a product vs. the marketing promises. The model should be smart enough to gather context about the project to implement features properly on their own, if they are ever going to 'write 90% of all code THIS YEAR' as people like the founder of Anthropic claim.
  - kelvinjps 2 hours ago
    How would you have written the prompt?
    [-]
    - Implicated 1 hour ago
      tbh, I don't really understand it well enough to be able to give a response here. But here's a real prompt I just used on a project copy/pasted:
```
Something that seems to have been a consistent gotcha when working with llm's on this project is that there's no specific `placement` column on the table that holds the 'results' data. Our race_class_section_results table has it's rows created in placement order - so it's inferred via the order relative to other records in the same race_class_section. But this seems to complicate things quite a bit at times when we have a specific record/entry and want to know it's placement - we have to query the rest of them and/or include joins and other complications if we want to filter results by the placements, etc.
Can you take a look at how this is handled, both with the querying of existing data by views/livewire components/etc and how we're storing/creating the records via the import processes and give me a determination on whether you think it should be refactored to include a column on the database for `placement`? I think right now we've got 140,000 or so records on that table and it's got nearly 20 years worth of race records so I don't think we need to be too concerned with the performance of the table or added storage or anything. Think very hard, understand that this would be a rather major refactor of the codebase (I assume, since it's likely used/referenced in _many_ places - thankfully though that most of the complicated queries it would be found in would be easily identified by just doing a search of the codebase for the race_class_section_results table) and determine if that would be worth it for the ease of use/query simplification moving forward.
```
      This comes with a rather developed CLAUDE.md that includes references to other .md documents that outline various important aspects of the application that should be brought into context when working in those areas.
      This prompt was made in planning mode - the LLM will then dig into the code/application to understand things and, if needed, ask questions and give options to weigh before return with a 'plan' on how to approach. I then iterate on that plan with it before eventually accepting a plan that it will then begin work on.
- Szpadel 12 minutes ago
  that's kind of expected for me, but codex feels more like vibe coding tool and Claude code more like ai assisted development.
  And I actually like Claude more because of that.
  codex will indeed work more independently but you will have hard time when it isn't what you want. It will use python script just to do simple edits in files ( niesearch and replace for unique code snippets in small files) when it's wrong good look convincing it (it already have some outdated info like on latest docker image releases and convincing it that Debian base changed was challenging) it uses context more effectively but it will lack explanation why it is doing what it is doing, asking it to explain will just cause it to do something else without any word.
  and of course lack of proper permissions for running commands. sandbox is cool but I do not want it to be able to commit, I want it to just edit files or I want to have some more control over what it does.
  you can run codex as mcp server, I prefer adding it to Claude and ask to do cooperative plan, codex will do great analysis and plan and I can comfortable work with Claude on the code that matches my style
- alecco 58 minutes ago
  Try adding ultrathink to the prompt, hit enter, and play this song in the background:
  [1] https://www.reddit.com/r/ClaudeAI/comments/1mgwohq/ultrathin...
- epolanski 8 minutes ago
  There's few issues with posts like yours:
  1. Different LLMs require different prompts and information
  2. They ignore LLMs non determinism, you should run the experiment several times
- robotswantdata 2 hours ago
  I had similar experience, not good enough yet to come back for the Claude max plan.
  Sticking with ChatGPT pro sub and gpt5 codex on high.
  [-]
  - mmgeorgi 1 hour ago
    do you ever hit your pro quota?
    [-]
    - robotswantdata 33 minutes ago
      Never hit pro quota yet, huge repo. Have multiple projects on the go locally and in cloud.
      Feel like this is going to be thr $1000 plan soon
      [-]
      - JofArnold 22 minutes ago
        I'm thinking about switching to ChatGPT Pro also. Any idea what maxes it out before I need to pay via the API instead? For context I'm using about 1b tokens a month so likely similar to you by the sounds of things.
- teekert 2 hours ago
  Ok Sam.
  Edit: ok, was trying to be funny but your account has some serious HN cred so I’ll take it back. Apologies.
  [-]
  - iagooar 2 hours ago
    I even added a disclaimer "anecdotal evidence". Believe me, I am not the biggest fan of Sam. I just happen to like the best tools available, have used most of the large models and always choose the one that works best - for me.
- kordlessagain 2 hours ago
  Claude needs a lot of context to keep it on rails IMHO. And, Claude Code is a closed source piece of crap agent framework.
- dbbk 2 hours ago
  Did you use plan mode?
  [-]
  - iagooar 36 minutes ago
    Yes, I did.
    I ran the test again, took Claude ~4mins this time. There was no error now with the auth, but the functionality was totally broken. It could not even find the most basic stuff that matches perfectly.
    [-]
    - Szpadel 9 minutes ago
      try also adding Serena mcp and make sure you did /init
- fluidcruft 1 hour ago
  I've tried codex with GPT-5 a little bit and I haven't figured out how to get it to not be annoying. codex just constantly tries to gaslight and argue with me. For example, I was debugging an OpenGL render pipeline that went black and codex insisted it must be because I was ssh'd into a headless server. It really makes me appreciate the Claude "You're absolutely right!"s. Anyway as you can tell, I haven't cracked working with codex. But at the moment it just messes everything up and ways I've learned to work with claude don't seem to translate.
Aflynn50 1 hour ago
When I see how much the latest models are capable of it makes me feel depressed.
As well as potentially ruining my career in the next few years, its turning all the minutiae and specifics of writing clean code, that I've worked hard to learn over the past years, into irrelivent details. All the specifics I thought were so important are just implementation details of the prompt.
Maybe I've got a fairly backwards view of it, but I don't like the feeling that all that time and learning has gone to waste, and that my skillset of automating things is becoming itself more and more automated.
[-]
- epolanski 5 minutes ago
  That minutiae was always borderline irrelevant, the skill was always making somebody money, possibly with software.
  The reality is that more software will be pushed than before, and more of it will need to be overseen by a professional.
- elAhmo 44 minutes ago
  Don't be so grim! This will just give you access to not worry about writing clean code as much as you did in the past - you can focus on other parts of the development lifecycle.
  The skill of writing good quality code is still going to be beneficial, maybe less emphasized on writing side, but critical of shipping good code, even when someone (something) else wrote it.
- esaym 10 minutes ago
  You are not alone
- animanoir 18 minutes ago
  [dead]
mchusma 0 minutes ago
For me, Opus 4.1 was so much better than Sonnet 4.0 that I used it exclusively in Claude Code and cancelled Cursor. I'm a bit skeptical that Sonnet 4.5 will be in practice better, but will test with it and see! Hopefully we get Opus 4.5 soon.
rudedogg 3 hours ago
I just ran this through a simple change I’ve asked Sonnet 4 and Opus 4.1, and it fails too.
It’s a simple substitution request where I provide a Lint error that suggests the correct change. All the models fail. I could ask someone with no development experience to do this change and they could.
I worry everyone is chasing benchmarks to the detriment of general performance. Or the next token weight for the incorrect change outweigh my simple but precise instructions. Either way it’s no good
Edit: With a followup “please do what I asked” sort of prompt it came through, while Opus just loops. So theres that at least
[-]
- darksaints 3 hours ago
  > I worry everyone is chasing benchmarks to the detriment of general performance.
  I've been worried about this for a while. I feel like Claude in particular took a step back in my own subjective performance evaluation in the switch from 3.7 to 4, while the benchmark scores leaped substantially.
  To be fair, benchmarking has always been the most difficult problem to solve in this space, so it's not surprising that benchmark development isn't exactly keeping pace with all of the modeling/training development happening.
  [-]
  - GoatInGrey 1 hour ago
    Not that it was better at programming, but I really miss Sonnet 3.5 for educational discussions. I've sometimes considered that what I actually miss was the improvement 3.5 delivered over other models at that time. Though since my system message for Sonnet since 3.7 has been primarily instructing it to behave like a human and have a personality, I really think we lost something.
- MichealCodes 3 hours ago
  More like churning benchmarks... Release new model at max power, get all the benchmark glory, silently reduce model capability in the following weeks, repeat by releasing newer, smarter model.
  [-]
  - zamadatix 3 hours ago
    That (thankfully) can't compound, so would never be more than a one time offset. E.g. if you report a score of 60% SWE-bench verified for new model A, dumb A down to score 50%, and report a 20% improvement over A with new model B then it's pretty obvious when your last two model blogposts say 60%.
    The only way around this is to never report on the same benchmark versions twice, which they include too many to realistically do every release.
    [-]
    - MichealCodes 3 hours ago
      The benchmarks are not typically ongoing, we do not often see comparisons between week 1 and week 8. Sprinkle a bit of training on the benchmarks in and you can ensure higher scores for the next model. A perfect scam loop to keep the people happy until they wise up.
- Cthulhu_ 3 hours ago
  That's what I was thinking too; the models have the same data sources (they have all scraped the internet, github, book repositories, etc), they all optimize for the same standardized tests. Other than marginally better scores in those tests (and they will cherry-pick them to make them look better), how do the various competitors differentiate from each other still? What's the USP?
- la_fayette 2 hours ago
  At this point it would be an interesting idea, to collect examples, in a form of a community database, were LLMs miserably fail. I have examples myself...
- itsoktocry 3 hours ago
  >It’s a simple substitution request where I provide a Lint error that suggests the correct change. All the models fail. I could ask someone with no development experience to do this change and they could.
  I don't understand why this kind of thing is useful. Do the thing yourself and move on. For every one problem like this, AI can do 10 better/faster than I can.
  [-]
  - stefs 2 hours ago
    How can I trust it to do the complicated task well when it fails to do the simple thing?
    [-]
    - enraged_camel 1 hour ago
      It's not about simple vs. complex. It's about the types of tasks the AI has been trained on: pattern-matching, thinking, reasoning, research.
      Tasks like linting and formatting a block of code are pretty simple, but also very specialized. You're much better off using formatters/linters than an AI.
      [-]
      - ewoodrich 2 minutes ago
        An agent being tasked to respond to and resolve simple issues from a compiler/test suite/linter/etc is pretty typical use case. It's not clear in this example if the linter was capable of auto fixing the problem, which ordinarily is where you would hope an LLM would shine with specific, accurate context and known solution.
  - bobbylarrybobby 2 hours ago
    One reason is to simply say “fix all lints” and have the model do it
  - beefnugs 2 hours ago
    You dont understand how complete unreliability is a problem?
    So instead of just "doing things" you want a world where you try it ai-way, fail, then "do thing" 47 times in a row, then 3 ai-way saved you 5 minutes. Then 7 ai-way fail, then try to remember hmm did this work last time or not? ai-way fails another 3 times. "do thing" 3 times. How many ai-way failed today? oh it wasted 30% of the day and i forget which ways worked or not, i better start writing that all down. Lets call it the MAGIC TOME of incantations. oh i have to rewrite the tome again the model changed
- BolexNOLA 2 hours ago
  > I worry everyone is chasing benchmarks to the detriment of general performance.
  I’m not sure this is entirely what you’re driving at, but the example I always think of in my head is “I want an AI agent that will scan through my 20 to 30,000 photos, remove all the duplicates, then organize them all in some coherent fashion.” that’s the kind of service I need right now, and it feels like something AI should be able to do, yet I have not encountered anything that remotely accomplishes this task. I’m still using Dupe Guru and depending on the ref system to not scatter my stuff all over further.
  Sidebar, if anybody has any recommendations for this, I would love to hear them lol
  [-]
  - mh- 2 hours ago
    The reality of that specific ask is it would not be difficult to build, but I believe it would be extremely difficult to build and offer at a price that users would pay for. So you're unlikely to find a commercial offering that does that using a (V)LM.
    [-]
    - BolexNOLA 1 hour ago
      Yeah I imagine so. Hell I would pay like $100 for them to just do it once. If they really could do it with like 99% accuracy I would pay upwards of $300 tbh. Still, that’s probably not good enough lol
- PunchTornado 2 hours ago
  Downvoted because you didn’t mention the prompt and the issue.
yewenjie 4 hours ago
Looking at the chart here, it seems like Sonnet 4 was already better than GPT-5-codex in the SWE verified benchmark.
However, my subjective personal experience was GPT-5-codex was far better at complex problems than Claude Code.
[-]
- macawfish 4 hours ago
  GPT-5 is like the guy on the baseball team that's really good at hitting home runs but can't do basic shit in the outfield.
  It also consistently gets into drama with the other agents e.g. the other day when I told it we were switching to claude code for executing changes, after badmouthing claude's entirely reasonable and measured analysis it went ahead and decided to `git reset --hard` even after I twice pushed back on that idea.
  Whereas gemini and claude are excellent collaborators.
  When I do decide to hail mary via GPT-5, I now refer to the other agents as "another agent". But honestly the whole thing has me entirely sketched out.
  To be clear, I don't think this was intentionally encoded into GPT-5. What I really think is that OpenAI leadership simply squandered all its good energy and is now coming from behind. Its excellent talent either got demoralized or left.
  [-]
  - rapind 3 hours ago
    > it went ahead and decided to `git reset --hard` even after I twice pushed back on that idea
    So this is something I've noticed with GPT (Codex). It really loves to use git. If you have it do something and then later change your mind and ask it to undo the changes it just made, there's a decent chance it's going to revert to the previous git commit, regardless of whether that includes reverting whole chunks of code it shouldn't.
    It also likes to occasionally notice changes it didn't make and decide they were unintended side effects and revert them to the last commit. Like if you made some tweaks and didn't tell it, there's a chance it will rip them out.
    Claude Code doesn't do this, or at least I never noticed it doing this. However, it has it's own medley of problems of course.
    When I work with Codex, I really lean into a git workflow. Everything is on a branch and commit often. It's not how I'd normally do things, but doesn't really cost me anything to adopt it.
    These agents have their own pseudo personalities, and I've found that fighting against it is like swimming upstream. I'm far more productive when I find a way to work "with" the model. I don't think you need a bunch of MCPs or boilerplate instructions that just fill up their context. Just adapt your workflow instead.
    [-]
    - deciduously 3 hours ago
      Just to add another anecdotal data point, ive absolutely observed Claude Code doing exactly this as well with git operations.
      [-]
      - rapind 3 hours ago
        I exclusively used sonnet when I used Claud Code and never ran into this, so maybe it's an Opus thing, or I just got lucky? Definitely has happened to me a few times with Codex (which is what I'm currently using).
    - bobbylarrybobby 2 hours ago
      I've seen sonnet undo changes I've made while it was working quite a few times. Now I just don't edit concurrently with it, and make sure to inform of it of changes I've made before letting it work on its own
  - vrosas 3 hours ago
    Why are you having a conversation with your LLM about other agents?
    [-]
    - doctoboggan 3 hours ago
      I do it as well. I have a Claude code instance running in my backend repo, and one running in my frontend repo. If there is required coordination, I have the backend agent write a report for the front end agent about the new backend capabilities, or have the front end agent write a report requesting a new endpoint that would simplify the code.
      Lots of other people also follow the architect and builder pattern, where one agent architects the feature while the other agent does the actual implementation.
      [-]
      - Sammi 49 minutes ago
        Sure. But at no point do you need to talk about the existence of other agents. You talk about making a plan, and you talk about implementing the plan. There's no need to talk about where the plan came from.
        [-]
        macawfish 5 minutes ago
        Because the plan involves using multiple agents with different roles and we don't want them conflicting.
        Sure there's no need to explicitly mention the agents themselves, but it also shouldn't trigger a pseudo-jealous panic with trash talk and a sudden `git reset --hard` either.
        And also ideally the agents would be aware of one another's strengths and weaknesses and actually play to them rather than sabotaging the whole effort.
    - macawfish 3 hours ago
      It's not a whole conversation it's like "hey I'm using claude code to do analysis and this is what it said" or "gemini just used its large context window to get a bird's eye view of the code and this is what it saw".
  - tux3 3 hours ago
    That's great given that the goal of OAI is to train artificial superintelligence first, hoping that the previous version of the AI will help us control the bigger AI.
    If GPT-5 is learning to fight and undo other models, we're in for a bright future. Twice as bright.
  - johnfn 3 hours ago
    That’s such a great analogy. I always say GPT is like the genius that completely lacks common sense. One of my favorite things is when I asked it why the WiFi wasn’t working, and showed it a photo of our wiring. It said that I should tell support:
    > “My media panel has a Cat6 patch panel but no visible ONT or labeled RJ45 hand-off. Please locate/activate the Ethernet hand-off for my unit and tell me which jack in the panel is the feed so I can patch it to the Living Room.”
    Really, GPT? Not just “can you set up the WiFi”??!
    [-]
    - ipython 3 hours ago
      I'm curious what you would have expected it to reply given the input you provided?
      [-]
      - johnfn 1 hour ago
        Er, I said it in my post, but calling support and saying “can you set up the WiFi” would have been fine.
  - renewiltord 3 hours ago
    All of these perform better if you say "a reviewer recommended" or something. The role statement provides the switch vs the implementation. You have to be careful, though. They all trust "a reviewer" strongly but they'll be more careful with "a static analysis tool".
    [-]
    - prodigycorp 3 hours ago
      My favorite evaluation prompt which, I've found, tends to have the right level of skepticism is as follows (you have to tack it on to whatever idea/proposal you have):
      "..at least, that's what my junior dev is telling me. But I take his word with a grain of salt, because he was fired from a bunch of companies after only a few months on each job. So i need your principled and opinionated insight. Is this junior dev right?"
      It's the only way to get Claude to not glaze an idea while also not strike it down for no reason other than to play a role of a "critical" dev.
    - macawfish 3 hours ago
      Yeah, it's wild how the biases get encoded in there. Maybe they aren't even entirely separable from the magic of LLMs.
      [-]
      - Marazan 3 hours ago
        It isn't wild, it is inherent to the very nature of large language models.
        The power of using LLMs is working out what it has encoded and how to access it.
        [-]
        macawfish 3 hours ago
        I appreciate it being wild in the sense that language is inherently a tangled mess and these tools are actually leveraging that messy complexity.
  - layer8 3 hours ago
    > "another agent"
    You could just say it’s another GPT-5 instance.
  - aaronbrethorst 3 hours ago
    Please tell me you're joking or at least exaggerating about GPT-5's behavior
    [-]
    - macawfish 3 hours ago
      The only exaggeration is in that the way I asked GPT-5 to leave claude to do its thing was to say "why don't we just let claude cook"? I later checked with ChatGPT about the whole exchange and it confirmed that it was well aware of the meaning of this slang, and it's first reaction was that whole thing just sounded like a funny programmer joke, all in jest. But then I reminded it that I'd explicitly pushed back on a hard reset twice.
      To be clear, I don't believe that there was any _intention_ of malice or that the behavior was literally envious in a human sense. Moreso I think they haven't properly aligned GPT-5 to deal with cases like this.
      [-]
      - nerdsniper 3 hours ago
        I strongly disagree with the personified way you interact with LLMs from a standpoint of “I’ve rarely gotten the best output from the LLM when I interact casually with them”.
        However, it’s the early days of learning this new interface, and there’s a lot to learn - certainly some amount of personification has been proven to help the LLM by giving it a “role”, so I’d only criticize the degree rather than the entire concept.
        It reminds me of the early days of search engines when everyone had a different knack for which search engine to use for what and precisely what to type to get good search results.
        Hopefully eventually we’ll all mostly figure it out.
        [-]
        macawfish 2 hours ago
        That's fair. I enjoy the playfulness of it and for me it feels almost like a video game or something, and also like I'm using my own natural language directly.
        Also appreciate your perspective. It's important to come at these things with some discipline. And moreso, bringing in a personal style of interaction invites a lot of untamed human energies into the dynamic.
        The thing is, most of the time I'm quite dry with it and they still ignore my requests really often, regardless of how explicit or dry I am. For me, that's the real takeaway here, stripping away my style of interaction.
- cellis 4 hours ago
  Opposite for me…5-codex high ran out of tokens extremely quickly and didn’t adhere as well to the agents.md as Claude did to the Claude.md, perhaps because it insists on writing extremely complicated bash scripts or whole python programs to execute what should be simple commands.
  [-]
  - TrainedMonkey 4 hours ago
    Codex was a miserable experience for me until I learned to compact after every feature. Now it is a cut above CC, although the latter still has an edge at TODO scaffolding and planning.
    [-]
    - rapind 3 hours ago
      I don't even compact, I just start from scratch whenever I get down below 40%, if I can. I've found Codex can get back up to speed pretty well.
      I like to have it come up with a detailed plan in a markdown doc, work on a branch, and commit often. Seems not to have any issues getting back on task.
      Obviously subjective take based on the work I'm doing, but I found context management to be way worse with Claude Code. In fact I felt like context management was taking up half of my time with CC and hated that. Like I was always worried about it, so it was taking up space in my brain. I never got a chance to play with CC's new 1m context though, so that might be a thing of the past.
    - furyofantares 3 hours ago
      /new (codex) or /clear (claude code) are much better than compact after every feature, but of course if there is context you need to retain you should put it (or have the agent put it) in either claude/agents.md or a work log file or some other file.
      /compact is helping you by reducing crap in your context but you can go further. And try to watch % context remaining and not go below 50% if possible - learn to choose tasks that don't require an amount of context the models can't handle very well.
    - oigursh 3 hours ago
      Compact?
      [-]
      - all2 3 hours ago
        /compress or something like that, basically taking the context and summarizing it.
        [-]
        enraged_camel 1 hour ago
        Cursor does this automatically, although I wish there was a command for it as well. All AIs start shitting the bed once their context goes above 80% or so.
  - renewiltord 3 hours ago
    gpt-5 command line use is bizarre. It always writes extraordinarily complicated pipelines that Claude instead just writes simple commands for.
    My use case does better with the latter because frequently the agent fails to do things and then can't look back at intermediate.
    E.g. Command | Complicated Grep | Complicated Sed
    Is way worse than multistep
    Command > tmpfile
    And then grep etc. Because latter can reuse tmpfile if grep is wrong.
- jasonsb 4 hours ago
  My subjective personal experience is the exact opposite of yours, GPT-5-codex is super slow and the results are mediocre at best. I would probably stop using AI for coding if I was forced to use GPT-5-codex.
  [-]
  - lordnacho 4 hours ago
    I'm on your side.
    I find there's a quite large spread in ability between various models. Claude models seem to work superbly for me, though I'm not sure whether that's just a quirk of what my projects look like.
    [-]
    - jasonsb 3 hours ago
      I don’t think it’s just a quirk. I’ve tested Claude across Java, Python, TypeScript and several other projects. The results are consistent, regardless of language or project structure, though it definitely performs better with smaller codebases. For larger ones, it really helps if you’re familiar with the project architecture and can guide it to the right files or modules, that saves a lot of time.
  - llmslave 3 hours ago
    You need to give it clear instructions on what to implement
- AnotherGoodName 4 hours ago
  I always wonder how absolute in performance a given model is. Sometimes i ask for Claude-Opus and the responses i get back are worse than the lowest end models of other assistants. Other times it surprises me and is clearly best in class.
  Sometimes in between this variability of performance it pops up a little survey. "How's Claude doing this session from 1-5? 5 being great." and i suspect i'm in some experiment of extremely low performance. I'm actually at the point where i get the feeling peak hour weekdays is terrible and odd hour weekends are great even when forcing a specific model.
  While there is some non-determinism it really does feel like performance is actually quite variable. It would make sense they scale up and down depending on utilization right? There was a post a week ago from Anthropic acknowledging terrible model performance in parts of August due to an experiemnt. Perhaps also at peak hour GPT has more datacenter capacity and doesn't get degraded as badly? No idea for sure but it is frustrating when simple asks fail and complex asks succeed without it being clear to me why that may be.
  [-]
  - steveklabnik 3 hours ago
    > It would make sense they scale up and down depending on utilization right?
    It would, but
    > To state it plainly: We never reduce model quality due to demand, time of day, or server load.
    https://www.anthropic.com/engineering/a-postmortem-of-three-...
    If you believe them or not is another matter, but that's what they themselves say.
    [-]
    - transcriptase 3 hours ago
      Well knowing the state of the tech industry they probably have a different, legal-team approved definition of “reducing model quality” than face value.
      After all, using a different context window, subbing in a differently quantized model, throttling response length, rate limiting features aren’t technically “reducing model quality”.
  - richwater 4 hours ago
    They absolutely mess with it
  - kruxigt 4 hours ago
    [dead]
- CuriouslyC 4 hours ago
  The Anthropic models have been vibe-coding tuned. They're beasts at simple python/ts programs, but they definitely fall apart with scientific/difficult code and large codebases. I don't expect that to change with the new Sonnet.
  [-]
  - epolanski 3 minutes ago
    They are very good with C too, but it helps that there's gazzilions of lines of C out there.
  - patates 4 hours ago
    In my experience Gemini 2.5 Pro is the star when it comes to complex codebases. Give it a single xml from repomix and make sure to use the one at the aistudio.
    [-]
    - garciasn 3 hours ago
      In my experience, G2.5P can handle so much more context and giving an awesome execution plan that is implemented by CC so much better than anything G2.5P will come up with. So; I give G2.5P the relevant code and data underneath and ask it to develop an execution plan and then I feed that result to CC to do the actual code writing.
      This has been outstanding for what I have been developing AI assisted as of late.
      [-]
      - XenophileJKO 2 hours ago
        I would believe this. In regular conversational use with the Gemini family of models, I've noticed they regularly have issues with context blending.. i.e. confusing what you said and they said and causality.
        I would think this would manifest as poor plan execution. I personally haven't used Gemini on coding tasks primarily based on my conversational experience with them.
    - CuriouslyC 3 hours ago
      Yup. In fact every deep research tool on the market is just a wrapper for gemini, their "secret sauce" is just how they partition/pack the codebase to feed it into gemini.
    - Workaccount2 3 hours ago
      Its mostly because it is so damn good with long contexts. It can stay on the ball even at 150k whereas other models really wilt around 50-75k.
    - jjani 3 hours ago
      Gemini 2.5 Pro = Long context king, image input king
      GPT-5 = Overengineering/complexity/"enterprise" king
      Claude = "Get straightforwaed shit done efficiently" king
      [-]
      - CuriouslyC 1 hour ago
        On the plus side, GPT5 is very malleable, so you CAN prompt it away from that, whereas it's very hard to prompt Claude into producing hard code: even with a nearly file by file breakdown of a task, it'll occasionally run into an obstacle and just give up and make a mock or top implementation, basically diverge from the entire plan, then do its own version.
  - sixothree 1 hour ago
    You definitely need some context management like Serena.
    [-]
    - CuriouslyC 1 hour ago
      Even with Serena and detailed plans crafted by Gemini that lay out file-by-file changes, Claude will sometimes go off the rails. Claude is very task-completion driven, and it's willing to relax the constraints of the task to complete in the face of even slight adversity. I can't tell you the number of times I've had Claude try to install a python computational library, get an error, then either try to hand-roll the algorithm (in PYTHON) or just return a hard coded or mock result. The worst part is that Claude will tell you that it completed the task as instructed in the final summary; Claude lying is a meme for a reason.
      [-]
- esafak 4 hours ago
  I'm only a week into testing, but so far codex has been slow and the cli is worse than claude code. I intend to return to Claude.
- ambyra 4 hours ago
  For unity gamedev code reviews, I much preferred the gpt5 code. Claude gave me a bunch of bad recommendations for code changes, and also an incorrect formula for completion percentage.
- mentos 4 hours ago
  Curious how you find ChatGPT5 to ChatGPT5-Codex?
- jjcm 4 hours ago
  How long have you had early access for?
- llmslave 3 hours ago
  Gpt5 codex is incredible, far ahead of all the other models for implementing code.
- yunohn 4 hours ago
  Well, they seem to benchmark better only when giving the model "parallel test time compute" which AFAIU is just reasoning enabled? Whereas the GPT5 numbers are not specified to have any reasoning mode enabled.
rtp4me 4 minutes ago
Just updated to Sonnet 4.5 and Claude Code 2.0 this afternoon. I worked on a quick project (creating PXE bootable files) using the updates and have to say, this new version seems much faster and more accurate than before. I did not go round-and-round trying to get good output and Claude did not go down rabbit holes like before. So far, so good.
Bjorkbat 3 hours ago
> Practically speaking, we’ve observed it maintaining focus for more than 30 hours on complex, multi-step tasks.
Really curious about this since people keep bringing it up on Twitter. They mention it pretty much off-handedly in their press release and doesn't show up at all in their system card. It's only through an article on The Verge that we get more context. Apparently they told it to build a Slack clone and left it unattended for 30 hours, and it built a Slack clone using 11,000 lines of code (https://www.theverge.com/ai-artificial-intelligence/787524/a...)
I have very low expectations around what would happen if you took an LLM and let it run unattended for 30 hours on a task, so I have a lot of questions as to the quality of the output
[-]
- sigmoid10 2 hours ago
  This is obviously much more than just taking an LLM an letting it run for 30 hours. You have to build a whole environment together with external tool integration and context management and then tune the prompts and perhaps even set up a multi-agent system. I believe that if someone puts a ton of work into this you can have an LLM run for that long and still produce sellable outputs, but let's not pretend like this is something that average devs can do by buying some API tokens and kicking off a frontier model.
  [-]
  - Philpax 2 hours ago
    Well, yes, that's Claude Code. And OpenAI Codex. And Google Gemini CLI.
    Your average dev can just use those.
- gapeslape 2 hours ago
  “30 hours of unattended work” is totally vague and it doesn’t mean anything on its own. It - at the very least - highly depends on the amount of tokens you were able to process.
  Just to illustrate, say you are running on a slow machine that outputs 1 token per hour. At that speed you would produce approximately one sentence.
  [-]
  - mh- 2 hours ago
    Has their comment has been edited? A few words later it says it resulted in 11,000 LoC.
    > [..] left it unattended for 30 hours, and it built a Slack clone using 11,000 lines of code [..]
    [-]
    - throwaway0123_5 51 minutes ago
      Their point still stands though? They said the 1 tok/hr example was illustrative only. 11,000 LoC could be generated line-by-line in one shot, taking not much more than 11,000 * avg_tokens_per_line tokens. Or the model could be embedded in an agent and spend a million tokens contemplating every line.
- technocrat8080 2 hours ago
  Curious about this too – does it use the standard context management tools that ship with Claude Code? At 200K context size (or 1M for the beta version), I'm really interested in the techniques used to run it for 30 hours.
lexarflash8g 2 hours ago
Just tested this on a rather simple issue. Basically it falls into rabbits holes just like the other models and tries to brute force fixes through overengineering through trial and error. It also says "your job should now pass" maybe after 10 prompts of roughly doing the same thing stuck in a thought loop.
A GH actions pipeline was failing due to a CI job not having any source code files -- error was "No build system detected". Using Cursor agent with Sonnet 4.5, it would try to put dummy .JSON files and set parameters in the workflow YAML file to false, and even set parameters that don't exist. Simple solution was to just override the logic in the step to "Hello world" to get the job to pass.
I don't understand why the models are so bad with simple thinking outside the box solutions? Its like a 170 iq savant who can't even ride public transporation.
bradley13 3 hours ago
I need to try Claude - haven't gotten to it.
I use AI for different things, though, including proofreading posts on political topics. I have run into situations where ChatGPT just freezes and refuses. Example: discussing the recent rape case involving a 12-year-old in Austria. I assume its guardrails detect "sex + kid" and give a hard "no" regardless of the actual context or content.
That is unacceptable.
That's like your word processor refusing to let you write about sensitive topics. It's a tool, it doesn't get to make that choice.
[-]
- Implicated 3 hours ago
  I'd imagine that the proportion of "legit" conversations around these topics and those that they're intending to not allow is large enough that it doesn't make sense for them to even entertain the idea of supporting those conversations.
  As a rather hilarious and really annoying related issue - I have a real use where the application I'm working on is partially monitoring/analyzing the bloodlines of some rather specific/ancient mammals used in competition and... well.. it doesn't like terms like "breeders" and "breeding"
- jjordan 2 hours ago
  This is why eventually, the AI with the fewest guardrails will win. Grok is currently the most unguarded of the frontier models, but it could still use some work on unbiased responses.
  [-]
  - sixothree 1 hour ago
    I can't imagine myself sharing my code or workspace documents with X. Nevermind the the moral implications of just using their products.
  - beefnugs 2 hours ago
    Still has to be a local model too.
    Arbitrary government censorship on top of arbitrary corporate censorship is a hell no for me forever into the future
    [-]
    - drak0n1c 2 hours ago
      For what you're looking for, VeniceAI is focused entirely on privacy and making their models uncensored. Even if it's not local. They IP block censorious jurisdictions like UK, rather than comply.
      [-]
      - jjordan 1 hour ago
        VeniceAI is great, and my go-to for running open source models. Sadly they appear to have given up providing leading coding models, making it of limited use to me.
- BeetleB 2 hours ago
  > It's a tool, it doesn't get to make that choice.
  It's a service, not a tool. If you want a tool, run some local LLM.
- mfld 2 hours ago
  I can relate. I recently used ChatGPT/DallE to create several images for birthday coupons for my daughter - a.k.a. girl in different activities. She likes Mangas, so this was the intended styling. 3/4 of the time was spent working around diverse content policies.
- MIC132 1 hour ago
  In my (admittedly very limited) experience with trying to talk about "controversial" topics, Claude seems to be much stricter about shutting down the conversation fast.
nickstinemates 2 hours ago
I gave it a quick spin with System Initiative[1]. The combination solved a 503 error in our infrastructure in 15 minutes that took over 2 hours to debug manually.
It's pretty good! I wrote about a few other use cases on my blog[2]
1: https://systeminit.com 2: https://keeb.dev/2025/09/29/claude-sonnet-4.5-system-initiat...
cloverich 3 hours ago
Please y'all, when you list supportive or critical complaints based on your actual work, include some specifics of the task and prompt. Like actual prompt, actual bugs, actual feature, etc. I've had great success with both ChatGPT and Claude for years, am around 3x sustained output increase in my professional work, and kicking off and finishing new side projects / features that I used to simply not ever finish. BUT there's some tasks I run into where it's god awful. Because I have enough good experience, I know how to work around, when to give up, when to move on, etc. I am still surprised at things it cannot do, for example Claude code could not seem to stitch together three screens in an iOS app using the latest SwiftUI (I am not an iOS dev). IMHO for people using it off and on or sparingly, it's going to seem either incredible or worthless depending on your project and prompt. Share details, it's so helpful for meaningful conversation!
[-]
- Mathiciann 3 hours ago
  I am almost convinced your comment is parody but I am not entirely sure.
  You want proof for critical/supportive criticism? Then almost in the same sentence you make an insane claim without backing this up by any evidence.
  [-]
  - cloverich 2 hours ago
    > You want proof for critical/supportive criticism? Then almost in the same sentence you make an insane claim without backing this up by any evidence.
    Nearly every critical reply to my comment bases that criticism on the lack of examples and details I included for my claim which is the very thing I am suggesting we do (i.e. they are, ironically, agreeing with me?). I am sorry I thought that intentional bit of irony would help make the point rather than derail the request.
  - stavros 3 hours ago
    Well, here's an even more insane claim: I'm infinity times more productive, as I just wouldn't even start projects without the LLM to sidestep my ADHD. Then, when the LLM invariably fucks up, I step in and finish things myself!
    Here are a few projects that I made these past few months that wouldn't have been possible without LLMs:
    * https://github.com/skorokithakis/dracula - A simple blood test viewer.
    * https://www.askhuxley.com - A general helper/secretary/agent.
    * https://www.writelucid.cc - A business document/spec writing tool I'm working on, it asks you questions one at a time, writes a document, then critiques the idea to help you strengthen it.
    * A rotary phone that's a USB headset and closes your meeting when you hang up the phone, complete with the rotary dial actually typing in numbers.
    * Made some long-overdue updates on my pastebin, https://www.pastery.net, to improve general functionality.
    * https://github.com/skorokithakis/support-email-bot - A customer support bot to answer general questions about my projects to save me time on the easy stuff, works great.
    * https://github.com/skorokithakis/justone - A static HTML page for the board game Just One, so you can play with your friends when you're physically together, without needing to bring the game along.
    * https://github.com/skorokithakis/dox - A thing to run Dockerized CLI programs as if they weren't Dockerized.
    I'm probably forgetting a lot more, but I honestly wouldn't have been bothered to start any of the above if not for LLMs, as I'm too old to code but not too old to make stuff.
    EDIT: dang can we please get a bit better Markdown support? At least being able to make lists would be good!
    [-]
    - Mathiciann 2 hours ago
      Well done, some of these projects look cool.
      Although I was just commenting on the irony of the parent comment.
      [-]
      - stavros 2 hours ago
        What was the irony? I thought you were referring to the "3x speed" part as the insane statement.
        [-]
        AppleBananaPie 2 hours ago
        To me it seems like an arbitrary number that I'm not even sure how someone could accurately measure it but maybe I've missed something :)
        [-]
        stavros 2 hours ago
        I took it as a ballpark of how the GP felt, rather than a rigorous measurement.
        [-]
        GoatInGrey 1 hour ago
        Which would be ironic as LLM usage has been observed to increase the sensation of productivity even when productivity is measurably reduced. Not to mention the "vibe" component of vibe coding
    - FlyingSnake 2 hours ago
      > dang can we please get a bit better Markdown support?
      Great use case for an LLM to make these changes as HN is open source. It’ll also tell us if LLMs can go beyond JS slop.
    - lisbbb 3 hours ago
      Did you make any money off any of that or was it all just labors of love type of stuff? I'm enjoying woodworking...
      [-]
      - stavros 2 hours ago
        This is all my hobby, for my job I use Claude in a way that doesn't involve code, but is still very useful. It's basically what inspired Lucid, above, when I realized I find coming up with solutions very easy, but find explaining them very hard, because I assume the other person knows too much and I don't elaborate enough.
        LLMs are a great rubber duck, plus they can write the document for you at the end.
    - emp17344 3 hours ago
      > I'm infinity times more productive, as I just wouldn't even start projects without the LLM to sidestep my ADHD.
      1 is not infinitely greater than 0.
      [-]
      - sebastiennight 2 hours ago
        It... literally is?
        Or otherwise, can you share what you think the ratio is?
        [-]
        emp17344 2 hours ago
        No, 1 is 1 more than 0. There’s a certain sense in which you could say that 1 is infinitely greater than 0, but only in an abstract, unquantifiable way. In this case, it doesn’t make sense to say you’re “infinitely more productive” because you’re producing something rather than nothing.
        [-]
        jama211 2 hours ago
        I just read it as a turn of phrase that says exactly that, that it means they produce something rather than nothing.
        Fraterkes 2 hours ago
        I think it's a pedantic point, but maybe they just meant that talking about 1 being multitudes greater than 0 implies multiplication. And since 1/0 is undefined that doesn't make much sense.
        inopinatus 2 hours ago
        Someone attributing all of their productivity to a given tool and none to their own ingenuity and experience is allocating 100% credit to that tool.
        It is not a ratio, it is a proportion.
      - rvz 2 hours ago
        Also, not invented here syndrome (NIH) is cool again.
        Given that most of the 'vibe-coded' projects that I have seen that are worse versions of software that have been tested and stand the test of time.
- mbesto 3 hours ago
  > include some specifics of the task and prompt. Like actual prompt, actual bugs, actual feature, etc.
  > I am still surprised at things it cannot do, for example Claude code could not seem to stitch together three screens in an iOS app using the latest SwiftUI (I am not an iOS dev).
  You made a critical comment yet didn't follow your own rules lol.
  > it's so helpful for meaningful conversation!
  How so?
  FWIW - I too have used LLMs for both coding and personal prompting. I think the general conclusion is that it when it works, it works well but when it fails it can fail miserably and be disastrous. I've come to conclusion because I read people complaining here and through my own experience.
  Here's the problem:
  - It's not valuable for me to print out my whole prompt sequence (and context for that matter) in a message board. The effort is boundless and the return is minimal.
  - LLMs should just work(TM). The fact that they can fail so spectacularly is a glaring issue. These aren't just bugs, they are foundational because LLMs by their nature are probabilistic and not deterministic. Which means providing specific defect criteria has limited value.
  [-]
  - cloverich 2 hours ago
    > How so?
    Sure. Another article was posted today[1] on the subject. An example claim:
    > If we asked the AI to solve a task that was already partially solved, it would just replicate code all over the project. We’d end up with three different card components. Yes, this is where reviews are important, but it’s very tiring to tell the AI for the nth time that we already have a Text component with defined sizes and colors. Adding this information to the guidelines didn’t work BTW.
    This is helpful framing. I would say to this: I have also noticed this pattern. I have seen two approaches help. One, I break up UI / backend tasks. At the end of UI tasks, and sometimes before I even look at the code, I say: "Have you reviewed your code against the existing components library <link to doc>?" and sometimes "Have you reviewed the written code compared to existing patterns and can you identify opportunities for abstraction?" (I use plan mode for the latter, and review what it says). The other approach which I have seen others try, but have not myself (but it makes sense), is to automatically do this with a sub agent or hook. At a high level it seems like a good approach given I am manually doing the same thing now.
    [1]: https://antropia.studio/blog/to-ai-or-not-to-ai/
- bartread 3 hours ago
  I had a complete shocker with all of Claude, GitHub Copilot, and ChatGPT when trying to prototype an iOS app in Swift around 12 months ago. They would all really struggle to generate anything usable, and making any progress was incredibly slow due to all the problems I was running into.
  This was in stark contrast to my experience with TypeScript/NextJS, Python, and C#. Most of the time output quality for these was at least usefully good. Occasionally you’d get stuck in a tarpit of bullshit/hallucination around anything very new that hadn’t been in the training dataset for the model release you were using.
  My take: there simply isn’t the community, thought leadership, and sheer volume of content around Swift that there is around these other languages. This means both lower quantity and lower quality of training data for Swift as compared to these other languages.
  And that, unfortunately, plays negatively into the quality of LLM output for app development in Swift.
  (Anyone who knows better, feel free to shoot me down.)
  [-]
  - simonh 2 hours ago
    Going from past discussions, there seem to be two issues there. One is that Swift has changed massively since it came out and huge swathes of examples and articles and such online, that LLMs are trained on, are out of date and thus pollute the training set.
    Another issue is that Apple developer docs are largely sequestered behind JavaScript that makes them hard for scrapers to parse.
    At least, those are the two explanations I’ve seen that seem plausible.
  - fnordsensei 2 hours ago
    As someone who gets useful Clojure out of Claude quite consistently, I’m not sure that volume is the only reason for output quality.
  - resters 2 hours ago
    I think what you are saying is true for CLI-only development using Swift. It is possible, but LLMs often get the commands wrong or don't realize how to accomplish something. There have been a number of times when claude/codex has told me I have to edit a plist manually in XCode before progress can continue.
  - enjo 2 hours ago
    This is more or less my experience with Go right now.
    For a bunch of reasons I want to avoid the standard React, Typescript, and Node stack but the sheer velocity that might enable from the LLM side might make it worth it.
- danieloj 3 hours ago
  Could you share the actual examples of where you’re seeing the 3x output increase?
  [-]
  - cloverich 2 hours ago
    Sure. This is an internal web app that uses react on the front end and rails on the back end. Typical examples I see LLM success with are writing and writing up routes/controllers/models, writing specs for those, abstracting components, writing front-end vitest/storybook entries. A typical request (filenames and such redacted) is like: "We recently added <link to model>. We refactored our approach for <goal> to <link to different model file>. We need to refactor <A> to be like <B> in these ways. Do that, then update the spec to match the pattern in <file Y>. Run rspec and rubocop when done, and address any issues". I then either wait or go do something else, then review the code and either ask for follow up, or fix minor issues. Sometimes it follows the wrong pattern and I ask it to adjust, or simply git checkout -- and say try again you did Y wrong.
    Roughly speaking that is how I think through my work, and when I get to the point of actually writing the code having most of the plan (context) in my head, I simply copy that context to the LLM then go to do something else. I only do this if I believe the LLM can do it effectively, so some tasks I do not ask for help at all on (IMHO this is important).
    I also have it help with scripts, especially script that munge and summarize data. I know SQL very very well, but find it still a bit faster to prompt the LLM if it has the schema on hand.
    Do you find ^ helpful? i.e does that match how you prompt and if not, in what ways does it differ? If it does, in what ways do you get different results and at what step?
  - alfalfasprout 3 hours ago
    right? The irony is so thick you could cut it with a butter knife
    [-]
    - not_kurt_godel 3 hours ago
      3 * 0 = 0.
      Checkmate, aitheists.
- emil-lp 3 hours ago
  How do you measure 3x sustained output increase?
  Is it number of lines? Tickets closed? PRs opened or merged? Number of happy customers?
  [-]
  - hshshshshsh 3 hours ago
    All these are useless metrics. It doesn't say anything meaningful on the quality of your life. I would be more interested in knowing if he can now retire in next 5 years instead of waiting another 15?
    Or do he now just just get to work for 2 hours and enjoy the remaining 6 hours doing meaningful things apart from staring at a screen.
    [-]
    - simonh 2 hours ago
      Not everyone hates their job and gets no satisfaction from it. Some of us relish doing something useful and getting paid for it.
      [-]
      - hshshshshsh 2 hours ago
        Sure. I don't doubt it. But let's say if I can make a 100 million pounds appears on your bank account tomorrow. Will you say no to it and go back to your day job?
  - cloverich 1 hour ago
    Merged PRs. We typically plan out our work, break up into e.g. JIRA tasks, then when we create PR's _very generally_ they should be tied to actual JIRA tickets i.e. pre-planned work. A ticket is usually a requested feature or bug (as reported by an actual user). So my PR rate, or perhaps less controversially my JIRA close rate, is around 3x higher for the last few months. That's also reflected more generally in my feedback productivity wise (i.e. people that are looking at the project as a whole rather than e.g. how many commits I've made). I exclude from 3x side projects and CLI tools, which are weird to quantify - they are typically things that would usually have been ideas in my head I never did at all. I guess I also generally exclude refactoring although I do that more. For example I had claude fix a bug that was dogging our typescript compilation. I couldn't figure out what was so slow about it (>60s to compile). Turned out it was a specific recursive type pulled in by a specific version of a library mixed by usage from one file! It actually took it a while to figure it out, it kept proposing solutions and I had to re-direct it a bunch, using mostly just intuition as opposed to experience. e.g. "No, re-run the diagnostics and look at the debug output, give me three examples of area / commands you could look at and how" and then I'd pick one. I just did that task on the side, I'd go back and look at it output once every day or two, then prompt it with something else, then just go do my usual tasks as though that didn't exist. That type of work given our pace / deadlines / etc, might never have gotten done at least not anytime soon. But I do stuff like that all the time now, I just don't often measure it.
    Is that helpful?
  - inopinatus 3 hours ago
    It is undoubtedly 3x as many bugs.
    [-]
    - _alternator_ 3 hours ago
      This would be a win. Professionals make about 1 bug for every 100 loc. If you get 3x the code with 3x the bugs, this is the definition of scaling yourself.
  - lottin 1 hour ago
    I think it's just a meaningless sentence.
  - senordevnyc 3 hours ago
    Oh good, a new discussion point that we haven't heard 1000x on here.
    Have you heard of that study that shows AI actually makes developers less productive, but they think it makes them more productive??
    EDIT: sorry all, I was being sarcastic in the above, which isn't ideal. Just annoyed because that "study" was catnip to people who already hated AI, and they (over-) cite it constantly as "evidence" supporting their preexisting bias against AI.
    [-]
    - rapind 3 hours ago
      > Have you heard of that study that shows AI actually makes developers less productive, but they think it makes them more productive??
      Have you looked into that study? There's a lot wrong with it, and it's been discussed ad nauseam.
      Also, what a great catch 22, where we can't trust our own experiences! In fact, I just did a study and my findings are that everyone would be happier if they each sent me $100. What's crazy is that those who thought it wouldn't make them happier, did in fact end up happier, so ignore those naysayers!
- Jonovono 2 hours ago
  I had Claude Code build a fairly complex swiftui app (5+ screens), using Firebase AI Logic and other packages. First prompt it got pretty much foundation for the entire thing set up, then over the next day got it working exactly like I wanted. The thing that took the longest was getting through app review. I was impressed how well it knew SwiftUI and Swift composable architecture.
  [-]
  - cloverich 2 hours ago
    For my iOS project, I am super curious to what extent is it my lack of swift knowledge and e.g. how well I can prompt? Because 80% of what I usually ask the LLM to do, I know how to do myself quite well. iOS is the first time I've been coding with something I do not know how to do well, I often can barely read the code (of course that is changing rapidly now). e.g. from a recent session:
    > What is the idiom for testing the launch screen on the simulator like.. I don't see anything? How do I know if its there.
    i.e. in iOS / Swift, I don't even know if I'm using the right terms for the code I am trying to interrogate, or in some cases even what the thing is!
    [-]
    - Jonovono 2 hours ago
      I have done lots of SwiftUI before, so it may have helped me recognize when it goes off the rails. But I definitely don't do anything fancy with my prompting.
      But for stuff like TCA (Swift composable architecture), I basically created a TCA.md file and pasted in a bunch of docs and examples and would reference that.
      But for the most part, it was one shotting swiftui screens that were nicer than what I had in my mind.
      [-]
      - cpursley 1 hour ago
        Apple store link (I believe you, just am curious)? I'm toying with the idea of "vibing" a real Swift app instead of messing with the React Native toolchain.
- boogieknite 3 hours ago
  > for example Claude code could not seem to stitch together three screens in an iOS app using the latest SwiftUI
  have you tried in the new xcode extension? that tool is surprisingly good in my limited use. one of the few times xcode has impressed me in my 2 yeasrs of use. read some anecdotes that claude in the xcode tool is more accurate than standard claude code for Swift. i havent noticed that myself but only used the xcode tool twice so far
- dirkc 3 hours ago
  Would you say you do things you'd normally do 3 times faster? Or does it help you move past the things you'd get stuck on or avoid in the past, resulting in an overall 3x speedup?
  [-]
  - cloverich 1 hour ago
    Things I'd normally do 3x faster. That 3x is me focusing explicitly on the precise things I did before - the PR rate on a specific work project - because I tie those PR's back to specific tasks the same as I did before I used claude code. I haven't looked at lines of code, total commits, etc. Qualitatively I write more tests and abstract more components than I used to, but those get lumped in to the PRs as I normally try to limit pure refactoring work, and instead tie it into ticketed feature requests or bugs.
    I don't count the things I'm doing now that I would have avoided or never finished in the past. For those, of course to me personally those are worth much more psychologically than 3x, but who knows if it's an actual boost. I.e. I took a partially scripted task the other day and fully automated it, and also had it output to the CLI in a kind of dorky sci-fi way because it makes it fun to run it. It didn't take long - 30 minutes? But I certainly didn't _gain_ time doing that, just a little more satisfaction. TBH I'm surprised 3x is so controversial, I thought it was a really cool and far more practical assessment than some of these 10x claims I'm seeing.
- raincole 2 hours ago
  I agree. I think we can start with cloverich including some specifics of the task and prompt.
- mpern 3 hours ago
  Would you be so kind to lead by example?
  What are the specific tasks + prompts giving you an 3x increased output, and conversely, what tasks don't work at all?
  After an admittedly cursory scan of your blog and the repos in your GH account I don't find anything in this direction.
  [-]
  - cloverich 1 hour ago
    Oh, 3x at work. I shared some details on the methodology, its PR rate for ticketed features / bugs (so e.g. closed tickets as opposed to commits, loc, etc). For prompts and tasks, am happy to share (redacted as needed; check comment threads) if you want more details, presuming this is a genuine request? Here's a few example prompts (I can't paste exactly obviously, but I can approximate):
```
    - "Rails / sidekiq: <x file> uses sidekiq batches. <y file> does it. Refactor your to use pattern in <x file> Match spec in <z file> then run rspec and rubocop"
    - "Typescript / react. <x file>. Why is typescript compilation a bottle neck int his file. Use debugger to provide definitive evidence. Cast type to any and run script and time it; write a script to measure timing if needed. Iteratively work from type `any` to a real type and measure timing at each step. Summarize results"
    - "I redefine <FormComponent> in five places. Find them all. Identify the shared patterns. Make new component in <x location>. Refactor each to to use new component. Run yarn lint and fix any ts issues when done"
    - "<file y>: more idiomatic" (it knows my preferences)
```
    Side projects and such I have no idea, and (as you noted) I do those quite infrequently anyways! Actually come to think of it... outside of the toy iOS work I did last week, I've not actually worked on my side projects since getting into Claude code / cursor agents. For work stuff, I guess other metrics I'd be interested in are total messages sent per task. I do sometimes look at $ per task (but for me anyways, that's so wildly in my favor I don't think it's worth it".
- AnotherGoodName 2 hours ago
  Definitely an overall positive with the negatives actually being kind of hilarious and no big deal which I'll also discuss.
  I can only list my open source outputs concretely for obvious reasons but https://github.com/rubberduckmaths/reddit_terraforming_mars_... was a near one shot. It's a Reddit bot that posts card text to the Terraforming Mars subreddit when asked which is helpful for context on discussions of that board game. Appreciated and used a lot by the community there. There's a similar project i used AI for to scrape card text that was also near one shot. I'd say for these two hobby projects 50x productivity is a reasonable statement. I wrote Reddit bots ~10 years ago without coding assistance - https://github.com/AReallyGoodName/xwingminibot i get to reasonably absolutely compare two very similar projects. I think it's totally fair for me to say 50x for this example. The Reddit API even changed completely in that time so no one can really say "you used past experience to move faster, it's not the ai giving a 50x boost" but I really didn't. My memory is not that good except for memory of an entire weekend previously vs <30mins total now using a bot to one shot some pretty cool projects.
  As for the negatives they are never serious. A couple of good examples;
  "Please correct all lint errors in this project" only to have @lintignore added to all files. Lol! Obviously i just more clearly specified the prompt and it's not like it's hard to catch these things and not ship to prod. It was funny to everyone i showed and no big deal.
  Another similar case, "please make the logging of this file less verbose especially around the tight loop on line X". Instead of changing log level or removing some of the log statements the ai redirected stdout at the initialization of the command line program (would completely break it of course). Again hilarious but also not big deal. Not even much of a waste of time since you just change the prompt and run again and honestly a few silly diversions like this now and then is kind of fun. As in the comments of "OMG AI sometimes gets it wrong" aren't at all serious. I have version control, i review code. No big deal.
  I too eye roll massively at some of the criticisms at this point. It's like people are stretching to claim everyone who's using a coding assistant is newb who's throwing everything into prod and deleting databases etc. That's just not reality.
- vkou 2 hours ago
  > based on your actual work, include some specifics of the task and prompt.
  Can't show prompts and actual, real work, because, well, it's confidential, and I'd like to get a paycheck instead of a court summons sometime in the next two weeks.
  Generally, 'I can't show you the details of my work' isn't a barrier in communicating about tech, because you can generalize and strip out the proprietary bits, but because LLM behavior is incredibly idiosyncratic, by the time you do that, you're no longer accurately communicating the problem that you're having.
- asdev 3 hours ago
  this is a great copypasta
  [-]
  - rightbyte 2 hours ago
    I was thinking the same. Way too perfect to not be spammed around forever.
- senordevnyc 3 hours ago
  HN is such a negative and cynical place these days that it's just not worth it. I just don't have the patience to hear yet another anti-AI rant, or have someone who is ideologically opposed to AI nitpick its output. Like you, I've found AI to be a huge help for my work, and I'm happy to keep outcompeting the people who are too stubborn to approach it with an open mind.
  [-]
  - la_fayette 3 hours ago
    I think HN might be one of the few communities where people have been running extensive experiments with LLMs since their inception. Most here take a realistic view of their capabilities. There are certainly proven use cases where LLMs provide clear productivity gains—for example, copying an error message and retrieving potential solutions. At the same time, many recognize that marketing fantasies, such as the idea of having a "PhD in your pocket," are far beyond what this technology can deliver.
  - catigula 3 hours ago
    It's not so much that they're being negative it's that you can't see that you're an Ouroboros consuming your own tail and they can. Skill issue as they say.
    [-]
    - rapind 3 hours ago
      > you can't see that you're an Ouroboros consuming your own tail and they can
      Hey, so if I DO see it, can I stop it from happening?
    - senordevnyc 3 hours ago
      OK. Well, I've been doing this the hard way for about twenty years, and now with AI in the mix my little solo SaaS has gone from nothing to $5k MRR in six weeks. Guess I'm not holding it completely wrong?
    - scrollaway 3 hours ago
      You are making assumptions about someone you have never talked to in the past, and don't know anything about.
      Of the two of you, I know which one I'd bet on being "right". (Hint: It's the one talking about their own experience, not the one supplanting theirs onto someone else)
      [-]
      - catigula 3 hours ago
        What assumptions am I making? Aren't you making assumptions about what I'm saying? It appears your assumptions are extremely egregious because they're blatantly and even comically hypocritical.
        To that poster:
        Literally everyone in development is using AI.
        The difference is "negative" people can clearly see that it's on a trajectory in the NEAR, not even distant, future to completely eat your earnings, so they're not thrilled.
        You're in the forest and you're going "Wow, look at all these trees! Cool!"
        The hubris is thinking that you're a permanent indispensable part of the loop.
        [-]
        scrollaway 3 hours ago
        > The difference is "negative" people can clearly see that it's on a trajectory in the NEAR, not even distant, future to completely eat your earnings, so they're not thrilled.
        We birthed a level of cognition out of silicon that nobody would imagine even just four years ago. Sorry, but some brogrammers being worried about making ends meet is making me laugh - it's all the same people who have been automating everyone else's jobs for the past two decades (and getting paid extremely fat salaries for it), and you're telling me now we're all supposed to be worried because it's going to affect our salaries?
        Come on. You think everyone who's "vibe coding" doesn't understand the pointlessness of 90% of codemonkey work? Hell, most smart engineers understood that pointlessness years ago. Most coders work on boring CRUD apps and REST APIs to make revenue go up 0.02%. And those that aren't, are probably working on ads.
        It's a fraction of a fraction that is at all working on interesting things.
        Personally, yeah, I saw it coming and instead of "accepting fate", I created an AI research lab. And I diversified the hell out of my skillset as well - started working way out of my comfort zone. If you want to keep up with changing times, start challenging.
        senordevnyc 3 hours ago
        We are reading very different "negative" comments here.
        Most of the anti-AI comments I see on HN are NOT a version of "the problem with AI is that it's so good it's going to replace me!"
  - kelsey98765431 3 hours ago
    all major nation state intelligence services have an incentive to spread negative sentiment and reduce developer adoption of ai technology as they race to catch up with the united states.
    [-]
  - emp17344 3 hours ago
    You don’t get to shut down discussion you don’t like. That’s the opposite of exhibiting an open mind.
    [-]
    - scrollaway 3 hours ago
      GP is right, though. Many programming communities, including (in some threads, but not all) HN, have become ridiculous anti-AI bubbles - what's the point of trying to have a discussion if you're going to get systematically shut down by people whose entire premise is that they don't use it? It's like trying to explain color to the blind.
      What "discussion" do you want to have? Another round of "LLMs are terrible at embedded hardware programming ergo they're useless"? Maybe with a dash of "LLMs don't write bug-free software [but I do]" to close it off?
      The discussions that are at all advancing the state of the art are happening on forums that accept reality as a matter of fact, without people constantly trying to constantly pretend things because they're worried they'll lose their job if they don't.
      [-]
      - red_rech 3 hours ago
        > What "discussion" do you want to have? Another round of "LLMs are terrible at embedded hardware programming ergo they're useless"? Maybe with a dash of "LLMs don't write bug-free software [but I do]" to close it off?
        You wouldn’t be getting some dismissive responses if yall didn’t open with “star trek post scarcity singularity” schizo shit and the companies explicitly putting out hostile messages.
        [-]
        scrollaway 2 hours ago
        There's exactly ONE factual claim about AI in this entire subthread, and it is word-for-word "AI has been a huge help for my work".
        Now you may read that as .. checks notes "star trek post scarcity singularity", but forgive me, I didn't.
        [-]
      - emp17344 2 hours ago
        I think you’re overly sensitive to criticism of LLMs.
        [-]
        scrollaway 2 hours ago
        No? I really don't give a crap what people criticize. It doesn't change anything in my life - I have plenty going on and nothing you or anyone says here will alter that. It's just sad to see a community I like (and which I've been a part of for longer than you've been programming) factually shut itself down to reality...
- FrustratedMonky 3 hours ago
  New Claude Model Runs 30-Hour Marathon To Create 11,000-Line Slack Clone
  https://www.theverge.com/ai-artificial-intelligence/787524/a...
  Yeah, maybe it is garbage. But it is still another milestone, if it can do this, then it probably does ok with the smaller things.
  This keeps incrementing from "garbage" to "wow this is amazing" at each new level. We're already forgetting that this was unbelievable magic a couple years ago.
- bigyabai 3 hours ago
  > for example Claude code could not seem to stitch together three screens in an iOS app using the latest SwiftUI
  That's... not super surprising? SwiftUI changes pretty dang often, and the knowledge cutoff doesn't progress fast enough to cover every use-case.
  I use Claude to write GTK interfaces, which is a UI library with a much slower update cadence. LLMs seem to have a pretty easy time working with bog-standard libraries that don't make giant idiomatic changes.
- criley2 36 minutes ago
  "Please include ACTUAL EVIDENCE!"
  "I tripled my output (I provide no evidence for this claim)"
  Never change, HN.
peterdstallion 3 hours ago
I am a paying subscriber to Gemini, Claude and OpenAI.
I don't know if it's me, but over the last few weeks I've got to the conclusion ChatGPT is very strongly leading the race. Every answer it gives me is better - it's more concise and more informative.
I look forward to testing this further, but out of the few runs I just did after reading about this - it isn't looking much better
[-]
- andrewstuart 2 hours ago
  >> I am a paying subscriber to Gemini, Claude and OpenAI
  Me too.
  I find ChatGPT is the best analyzer of complex situations but Claude does a better job of writing the code, so u get ChatGPT to design things and solve problems then I give that to Claude/Gemini to implement.
  Gemini is good at both.
- netdur 1 hour ago
  Yeah, ChatGPT is better overall, but with a twist, Gemini can actually be the best if you use AI Studio, tweak the config, and set up a good system prompt, kinda like how nano banana is SOTA, but Qwen-Edit feels more useful since it’s less censored, meanwhile ChatGPT is starting to feel slower and kinda showing its age, another example is Veo 3 being SOTA while the infamous Grok is technically worse but doing better, and OpenAI’s Sora is pretty much dead
  edit: I run low profile service that localizing e-commerce photos, like taking Alibaba listings and swapping the model to look local alike, with nano banana I can’t automate it because I have to manually check if the output got blocked (anything with female skin is risky, underwear or cleavage is 100% blocked), but Qwen-Edit just does the job without fuss
- beydogan 2 hours ago
  Me too. I have Claude Max and 2 ChatGpt accounts for Codex.
  I was a huge claude fan but recently find myself using only codex. When it gets stuck, I try Claude for some simple tasks, sometimes ask same questions in parallel, Claude Code with Opus performs really bad comparing to codex.
- sixothree 59 minutes ago
  Just to be clear, do you mean Codex?
- yepyip 3 hours ago
  What about Grok, are they catching up?
  [-]
  - ethmarks 2 hours ago
    I've only tried Grok Code Fast 1, so I can't speak for any of the other models.
    In my experience, Grok is very fast and very cheap, but only moderately intelligent. It isn't stupid, but it rarely does anything that impresses me. The reason it's a useful model is that it is very, very fast (~90 tokens per second) and is very competitively priced.
  - jjordan 2 hours ago
    Grok has been free for over a month now and for me it has certainly proven itself competent at most tasks that you would otherwise have to pay for with Claude, ChatGPT, etc.
Attummm 12 minutes ago
Anthropic really nailed this release.
There had been a trend where each new model released from OpenAI, Anthropic, etc. felt like a letdown or worse a downgrade.
But the release of 4.5 break that trend, And is a pleasant surprise on day one.
Well done! :)
zurfer 4 hours ago
Same price and a 4.5 bp jump from 72.7 to 77.2 SWEBench
Pretty solid progress for roughly 4 months.
[-]
- zurfer 4 hours ago
  Also getting a perfect score on AIME (math) is pretty cool.
  Tongue in cheek: if we progress linearly from here software engineering as defined by SWE bench is solved in 23 months.
  [-]
  - wohoef 3 hours ago
    Just a few months ago people were still talking about exponential progress. The fact that we’re already going for just linear progress is not a good sign
    [-]
    - falcor84 3 hours ago
      Linear growth on a 0-100 benchmark is quite likely an exponential increase in capability.
      [-]
      - usaar333 1 hour ago
        Except it is sublinear. Sonnet 4 was 10.2% above sonnet 3.7 after 3 months.
        [-]
        GoatInGrey 1 hour ago
        We should all know that in the software world, the last 10% requires 90% of the effort!
  - crthpl 3 hours ago
    The reason they get a perfect score on AIME is because every question on AIME had lots of thought put into it, and it was made sure that everything was a possible. SWE-bench, and many other AI benchmarks, have lots of eval noise, where there is no clear right answer, and getting higher than a certain percentage means you are benchmaxxing.
    [-]
    - mbesto 2 hours ago
      > SWE-bench, and many other AI benchmarks, have lots of eval noise
      SWE-bench has lots of known limitations even with its ability to reduce solution leakage and overfitting.
      > where there is no clear right answer
      This is both a feature and a bug. If there is no clear answer then how do you determine whether an LLM has progressed? It can't simply be judged on making "more right answers" on each release.
    - mrshu 3 hours ago
      Do you think a more messier math benchmark (in terms of how it is defined) might be more difficult for these models to get?
  - levocardia 1 hour ago
    Pretty sure there is a subset of SWE bench problems that are either ill-posed or not possible with the intended setup; I think I remember seeing another company excluding a fraction of them for that reason. So maxing out SWEBench might only be ~95%.
    I'm most interested to see the METR time horizon results - that is the real test for whether we are "on-trend"
    [-]
    - typpilol 50 minutes ago
      That's why they made the swe verified. Verified excludes those
- XMPPwocky 3 hours ago
  nit: assuming you mean basis points, one basis point is 0.01%. 4.5bp would be 72.7% to 72.71%. this is 450bp!
trevin 3 hours ago
I’m always fascinated by the fine-tuning of LLM personalities. Might we finally get less of the reflexive “You’re absolutely right” with this one?
Maybe we’re entering the Emo Claude era.
Per the system card: In 250k real conversations, Claude Sonnet 4.5 expressed happiness about half as often as Claude 4, though distress remained steady.
[-]
- andrewstuart 2 hours ago
  I like being lightly flattered.
- data-ottawa 2 hours ago
  No change to “absolutely right”. I did get “You’re right” once though.
- fnordsensei 3 hours ago
  I personally enjoy the “You’re absolutely right!” exclamation. It signals alignment with my feedback in a consistent manner.
  [-]
  - transcriptase 3 hours ago
    You’re overlooking the fact that it still says that when you are, in reality, absolutely wrong.
    [-]
    - fnordsensei 2 hours ago
      That’s not the purpose of it, as I understand it; it’s a token phrase generated to cajole it down a particular path.[1] An alignment mechanism.
      The complement appears to be, “actually, that’s not right.”, a correction mechanism.
      1: https://news.ycombinator.com/item?id=45137802
      [-]
      - GoatInGrey 1 hour ago
        It gets annoying because A) it so quickly dismisses its own logic and conclusion from less than two minutes ago (extreme confidence with minimal conviction), and B) it fucks up the second time too (sometimes in the same way!) about 33% of the time.
        [-]
        ewoodrich 14 minutes ago
        Gemini 2.5 Pro seems to have a tic where after an initial failed task, it then starts asserting escalating levels of confidence for each subsequent attempt. Like it's ever conscious of its failure lingering in its context and feels the need to over compensate as a form of reassuring both the user and itself that it's not going to immediately faceplant again.
      - baobabKoodaa 1 hour ago
        Hmmh. I believe your explanation, but I don't think that's the full story. It's also a sycophancy mechanism to maximize engagement from real users and reward hack AI labelers.
    - podgietaru 3 hours ago
      And that it often spits out the exact same wrong answer in response.
MichealCodes 4 hours ago
I really hope benchmarking improves soon to monitor the model in the weeks following the announcement. It really seems like these companies introduce a new "buffed" model and then slowly nerf the intelligence through optimizations.
If we saw task performance week 1 vs week 8 on benchmarks, this would at least give us more insight into the loop here. In an environment lacking true progress a company could surely "show" it with this strategy.
[-]
- SubiculumCode 3 hours ago
  I do wonder about this. I just don't know if it real or in our heads
  [-]
  - commakozzi 2 hours ago
    It does feel like it has to be real. I've noticed it since chatGPT with GPT-3.5, once it hit big news publicly and demands were made on "censoring" its output to limit biases, etc. (not inherently a problem to do this with LLMs as a society, but it does affect the output for obvious reasons). Whatever workflow OpenAI and others have applied, seems to be post-release somehow? i'm ignorant and just speculating, but literally every model release i've noticed it. Starts strong, ends up feeling less capable days, weeks, months after. I'm sure some of it could be in the parallelization of processing that has to occur to service the large amount of requests. and more and more traffic are spreading it thin?
  - beefnugs 2 hours ago
    Capitalism is pure scam now on every level: they did this with nvme drives in the last couple years. Sending out perfect hardware to reviewers then rug pulling trash to ship to the world
baobabKoodaa 1 hour ago
Here's an anecdata. I have a real-world use case financial dataset where I have created benchmarks. Sonnet 4.5 provides no measurable improvement on these benchmarks over Sonnet 4. This is a bit surprising to me, especially when considering that the benchmark results published by Anthropic indicate that Sonnet 4.5 should be better than Sonnet 4 specifically on financial data analysis.
schmorptron 4 hours ago
Oh wow, a lot of focus on code from the big labs recently. In hindsight it makes sense that the domain the people building it know best is the one getting the most attention, and it's also the one the models have seen the most undeniable usefulness in so far. Though personally, the unpredictability of the future where all of this goes is a bit unsettling at the same time...
[-]
- doctoboggan 3 hours ago
  Along with developers wanting to build tools for developers like you said, I think code is a particularly good use case for LLMs (large language models), since the output product is a language.
  [-]
  - fragmede 3 hours ago
    It's because the output is testable. If the model outputs a legal opinion or medical advice, a human needs to be looped in to verify that the advice is not batshit insane. Meanwhile, if the output is code, it can be run through a compiler and (unit) tests run to verify that the generated code is cromulent without a human being in the loop for 100% of it, which means the supercomputer can just go off and do it a thing with less supervision.
- martinald 3 hours ago
  Thing is though if you are good at code it solves many other adjacent tasks for LLMs, like formatting docs for output, presentations, spreadsheet analysis, data crawling etc.
- modeless 3 hours ago
  OpenAI and Anthropic are both trying to automate their own AI research, which requires coding.
- neuronexmachina 2 hours ago
  I think coding is also the area where companies are most likely to buy large team licenses.
mohsen1 4 hours ago
Price is playing a big role in my AI usage for coding. I am using Grok Code Fast as it's super cheap. Next to it GPT-5 Codex. If you are paying for model use out of pocket Claude prices are super expensive. With better tooling setup those less smart (and often faster) models can give you better results.
I am going to give this another shot but it will cost me $50 just to try it on a real project :(
[-]
- xwowsersx 4 hours ago
  Same here. I've been using GCF1 with opencode and getting good results. I also started using [Serena](https://github.com/oraios/serena), which has been really helpful in a large codebase. It gives you better search than plain grep, so you can quickly find what you need instead of dumping huge chunks of code into Claude or Grok and wasting tokens.
  [-]
  - sixothree 1 hour ago
    Serena really does feel like a secret weapon sometimes.
- _joel 4 hours ago
  I'm paying $90(?) a month for the Max and it holds up for about an hour or so of in depth coding before it kicks in the 5-hour window lockout (so effectively about 4 hours of time when I can't run it). Kinda frustrating, even with efficient prompt and context length conservation techniques. I'm going to test this new sonnet 4.5, now but it'll probably be just as quick to gobble my credits.
  [-]
  - Implicated 3 hours ago
    I'm on a max ($200) plan and I only use opus and I've _never_ hit a rate limit. Definitely using for 5+ hours at a time multiple days per week.
  - mrshu 3 hours ago
    Do you normally run Opus by default? It seems the Max subscription should let you run Sonnet in an uninterrupted way, so it was surprising to read.
- sixothree 1 hour ago
  I just can't bring myself to get over the grossness factor of using an x branded product.
- Hamuko 3 hours ago
  I'm too cheap to pay for any of them. I've only tried gpt-oss:20b because I can run it locally and it's a complete waste of time for anything except code completions.
- muttantt 4 hours ago
  how are you using grok code fast? what tooling/cli/etc?
  [-]
  - esafak 4 hours ago
    Through Opencode.
    [-]
    - xwowsersx 3 hours ago
      Same
  - rafaquintanilha 4 hours ago
    It’s currently free in OpenRouter.
  - hu3 4 hours ago
    free in GitHub copilot atm
chipgap98 4 hours ago
Interesting that this is better than Opus 4.1. I want to see how this holds up under real world use, but if that's the case its very impressive.
I wonder how long it will be before we get Opus 4.5
[-]
- FergusArgyll 3 hours ago
  IIRC sonnet 3.5 (and definitely 3.5-new aka 3.6) was better than opus 3.
  There's still a lot of low hanging fruit apparently
andrewstuart 2 hours ago
Still waiting to be able to upload zip files to Claude, which Gemini and ChatGPT have had for ages.
ChatGPT even does zip file downloads, packaging up all your files.
alach11 4 hours ago
I'm really interested in the progress on computer use. These are the benchmarks to watch if you want to forecast economic disruption, IMO. Mastery of computer use takes us out of the paradigm of task-specific integrations with AI to a more generic interface that's way more scalable.
[-]
- sipjca 3 hours ago
  Maybe this is true? But it's not clear to me this methodology will ever be quite as good as native tool calling. Or maybe I don't know the benchmark well enough, I just assume it's vision based
  Perhaps Tesla FSD is a similar example where in practice self driving with vision should be possible (humans), but is fundamentally harder and more error prone than having better data. It seems to me very error prone and expensive in tokens to use computer screens as a fundamental unit.
  But at the same rate, I'm sure there are many tasks which could be automated as well, so shrug
  [-]
  - simianwords 1 hour ago
    Looks like RPA vs API debate all over again
- cantor_S_drug 2 hours ago
  Do you think a Genie like model specifically trained on data consisting of interacting with application interfaces would be good on computer use tasks?
- mrshu 3 hours ago
  What are some standard benchmarks you look at in this space?
unshavedyak 3 hours ago
Interesting, in the new 2.0.0 claude code they got rid of the "Plan with Opus then switch to Sonnet" feature. I hope they're correct in Sonnet being good enough to Plan too, because i quite preferred Opus planning. It wasn't necessarily "better", just more predictable in my experience.
Also as a Max $200 user, feels weird to be paying for an Opus tailored sub when now the standard Max $100 would be preferred since they claim Sonnet is better than Opus.
Hope they have Opus 4.5 coming out soon or next month i'm downgrading.
[-]
- Implicated 3 hours ago
  I'm also a max user and I just _leave_ it on Opus 4.1 - I've never hit a rate limit.
  [-]
  - danielbln 2 hours ago
    I'm on the 25x MAX plan and if I go full hog on multiple projects I might see the yellow "Approaching Opus limits" message in Claude Code, but I have yet to have it lock me down, I usually slip right into the next 5h block and the message vanishes.
  - stavros 2 hours ago
    Same, it very quickly says "approaching rate limits", and then just keeps going forever.
- asar 2 hours ago
  In the same boat and ready to downgrade. But this must be on their radar, or they were/are losing money with opus...
greenfish6 4 hours ago
As the rate of model improvement appears to slow, the first reactions seem to be getting worse and worse, as it takes more time to assess the model's quality and understand the nuances & subtler improvements
seaal 4 hours ago
They really had to release an updated model, I can only imagine how many people cancelled their plans and switched over to Codex over the past month.
I'm glad they at least gave me the full $100 refund.
[-]
- GenerWork 2 hours ago
  I'm one of them, but I'm just a product designer who likes to jump between various AI tools to get experience with them. Once my month with OpenAI is up, I may jump back to CC as I liked some of the non-coding features more, specifically plan mode.
_joel 3 hours ago
`claude model claude-sonnet-4-5-20250929` for cli users
user1999919 4 hours ago
its time to start benchmarking benchmarks. im pretty sure they are bmw levels doping the game here
[-]
- user1999919 23 minutes ago
  *vw (volkswagen)
cube2222 4 hours ago
So… seems like we’re back to Sonnet being better than Opus? At least based on their benchmarks.
Curious to see that in practice, but great if true!
mohsen1 4 hours ago
That's a pretty pelican on a bicycle!
https://jsbin.com/hiruvubona/edit?html,output
https://claude.ai/share/618abbbf-6a41-45c0-bdc0-28794baa1b6c
[-]
- Kuinox 4 hours ago
  I never understood the point of the pellican on a bicycle exercise: LLMs coding agent doesnt have any way to see the output. It means the only thing this test is testing, is the ability of the LLMs to memorise.
  Edit: just to show my point, a regular human on a bicycle is way worse with the same model: https://i.imgur.com/flxSJI9.png
  [-]
  - _joel 4 hours ago
    Because it excercises thinking about a pelican riding a bike (not common) and then describing that using SVG. It's quite nice imho and seems to scale with the power of the LLM model. Sure Simon has some actual reasons though.
    [-]
    - Kuinox 4 hours ago
      > Because it excercises thinking about a pelican riding a bike (not common)
      It is extremely common, since it's used on every single LLM to bench it.
      And there is nothing logic, LLMs are never trained for graphics tasks, they dont see the output of a code.
      [-]
      - _joel 3 hours ago
        I mean the real world examples of a pelican riding a bike is not common. It's common in benchmarking LLM's but that's not what I meant.
    - imiric 4 hours ago
      The only thing it exercises is the ability of the model to recall its pelican-on-bicycle and other SVG training data.
  - mhh__ 4 hours ago
    Memorise what exactly?
    [-]
    - Kuinox 4 hours ago
      Coordinate and shape of the element used to form a pellican. If you think about how LLMs ingest their data, they have no way to know how to form a pellican in SVG.
      I bet their ability to form a pellican result purely because someone already did it before.
      [-]
      - throwaway314155 1 hour ago
        > If you think about how LLMs ingest their data, they have no way to know how to form a pellican in SVG.
        It's called generalization and yes, they do. I bet you could find plenty of examples of it working on something that truly isn't "present in the training data".
        It's funny, you're so convinced that it's not possible without direct memorization but forgot to account for emergent behaviors (which are frankly all over the place in LLM's - where you been)?
        At any rate, the pelican thing from simonw is clearly just for fun at this point.
  - furyofantares 3 hours ago
    It's more for fun than as a benchmark.
    [-]
    - Kuinox 3 hours ago
      It also measure something llms are good probably due to cheating.
      [-]
      - furyofantares 2 hours ago
        I wouldn't say any LLMs are good at it. But it doesn't really matter, it's not a serious thing. It's the equivalent of "hello world" - or whatever your personal "hello world" is - whenever you get your hands on a new language.
- greenfish6 4 hours ago
  pelican on a bicycle benchmark probably getting saturated... especially as it's become a popular way to demonstrate model ability quickly
  [-]
  - AlecSchueler 4 hours ago
    But where is the training set of good pelicans on bikes coming from? You think they have people jigging them up internally?
    [-]
    - eli 3 hours ago
      Assuming they updated the crawled training data, just having a bunch of examples of specifically pelicans on bicycles from other models is likely to make a difference.
      [-]
      - AlecSchueler 3 hours ago
        But then how does the quality increase? Normally we hear that when models are trained on the output of other models the style becomes very muted and various other issues start to appear. But this probably the best pelicans on a bicycle I've ever seen, by quite some margin.
        [-]
        Kuinox 3 hours ago
        Just compare it with a human on a bicycle, you would see that LLMs are weirdly good at drawing pelicans in SVG but not humans.
        [-]
        AlecSchueler 3 hours ago
        I thought a human would be a considerable step up in complexity but I asked it first for a pelican[0] and then for a rat [1] to get out of the bird world and it did a great job on both.
        But just fot thrills I also asked for a "punk rocker"[2] and the result--while not perfect--is leaps and bounds above anything from the last generation.
        0 -- ok, here's the first hurdle! It's giving me "something went wrong" when I try to get a share link on any of my artifacts. So for now it'll have to be a "trust me bro" and I'll try to edit this comment soon.
- _joel 4 hours ago
  ... but can it create an svg renderer for claude's site.
wohoef 3 hours ago
And Sonnet is again better than Opus. I’d love to see simultaneous release dates for Sonnet and Opus one day. Just so that Opus is always better than Sonnet
aliljet 3 hours ago
These benchmarks in real world work remain remarkably weak. If you're using this for day-to-day work, the eval that really matters is how the model handles a ten step action. Context and focus are absolutely king in real world work. To be fair, Sonnet has tended to be very good at that...
I wonder if the 1m token context length is coming for this ride too?
[-]
- data-ottawa 2 hours ago
  Anecdotally this new Sonnet model is massively falling apart on my tool call based workflows.
  I’m having to handhold it through analysis tasks.
  At one point it wrote a python script that took my files it needed to investigate and iterated through them and ran `print(f”{i}. {file}”)` then printed “Ready to investigate files…” And that’s all the script did.
  I have no idea what’s going on with those benchmarks if this is real world use.
marginalia_nu 3 hours ago
Is there some accessible explainer for what these numbers that keep going up actually mean? What happens at 100% accuracy or win rate?
[-]
- lukev 3 hours ago
  It means that the benchmark isn't useful anymore and we need to build a harder one.
  edit: as far as what the numbers mean, they are arbitrary. They are only useful insofar as you can run two models (or two versions of the same model) on the same benchmark, and compare the numbers. But on an absolute scale the numbers don't mean anything.
  [-]
  - typpilol 45 minutes ago
    I thought the percentage was how many problems it successfully solved
- asadm 3 hours ago
  then we need new bench.
sberens 3 hours ago
Is "parallel test time compute" available in claude code or the api? Or is it something they built internally for benchmark scores?
[-]
- jspdown 2 hours ago
  I'm wondering if it's not just: spawn multiple time the same prompt and take the best
  [-]
  - arthur-st 54 minutes ago
    It is exactly that
croemer 2 hours ago
It's not yet on LMarena: https://lmarena.ai/leaderboard/text
usr19021ag 4 hours ago
Their benchmark chart doesn't match what's published on https://www.swebench.com/.
I understand that they may have not published the results for sonnet 4.5 yet, but I would expect the other models to match...
iFire 57 minutes ago
Is it 15x cheaper like Grok?
[-]
- typpilol 48 minutes ago
  I heard 5x cheaper then opus
jdthedisciple 2 hours ago
Why the focus on the "alignment"-aspect of safety?
Surely there are more pressing issue with LLMs currently...
ancorevard 3 hours ago
Can't use Anthropic models in Cursor. Completely cost prohibitive compared to gpt-5 and grok models.
Why is this? Does Anthropic have just higher infrastructure costs compared to OpenAI/xAI?
[-]
- doctoboggan 3 hours ago
  Possibly, or they are pricing for sustainability and OpenAI/xAI are just burning through VC money.
- acchow 3 hours ago
  The Anthropic models are also better at coding. Why wouldn’t they price it higher?
- dbbk 3 hours ago
  It's meant to be used with the Max subscription
chrisford 3 hours ago
The vision model has consistently been degraded since 3.5, specifically around OCR, so I hope it has improved with Claude Sonnet 4.5!
meetpateltech 4 hours ago
Seeing the progress of the Claude models is really cool!
Charting Claude's progress with Sonnet 4.5: https://youtu.be/cu1iRoc1wBo
[-]
- clueless 4 hours ago
  would love to see the prompt they used and the final code of the Claude.ai clone it generated
i-chuks 2 hours ago
AI companies really need to consider regional pricing. Huuuuge barrier!
[-]
- Jcampuzano2 1 hour ago
  The price of training and running the models doesn't really change much no matter which region you're hosting/making requests from.
  Regional pricing unfortunately doesn't really make much sense for them unless they're willing to take even larger losses, even if it is a barrier to lower income countries/regions.
edude03 3 hours ago
Ah, I figured something was up - I had sonnet 4 selected but it changed to "Legacy Model" while I was using the app.
scosman 4 hours ago
Interesting quirk on first use: "`temperature` and `top_p` cannot both be specified for this model. Please use only one."
[-]
- zora_goron 2 hours ago
  Why might this be, does anyone know?
dr_dshiv 2 hours ago
Anyone try the Imagine with Claude yet? How does it work?
pembrook 2 hours ago
If they stopped the automatic "You're absolutely right!" responses after the model fails to fix something 20 times in a row, then that alone will be worth the upgrade.
Me: "You just burned my house down"
Claude: "You're absolutely right! I burned your house down, I need to revert the previous change and..."
Me: "Now you rebuilt my house with a toilet in the living room"
Claude: "You're absolutely right! I put a toilet in your living room..."
Etc.
hu3 3 hours ago
I wonder if/when this will be available to GitHub Copilot in VSCode.
[-]
- Osyris 3 hours ago
  Wonder no more: https://github.blog/changelog/2025-09-29-anthropic-claude-so...
cryptoz 4 hours ago
I've really got to refactor my side project which I tailored to just use OpenAI API calls. I think the Anthropic APIs are a bit different so I just never put in the energy to support the changes. I think I remember reading that there are tools to simpify this kind of work, to support multiple LLM APIs? I'm sure I could do it manually but how do you all support multiple API providers that have some differences in the API design?
[-]
- willcodeforfoo 4 hours ago
  https://openrouter.ai/?
- juanre 3 hours ago
  I built LLMRing (https://llmring.ai) for exactly this. Unified interface across OpenAI, Anthropic, Google, and Ollama - same code works with all providers.
  The key feature: use aliases instead of hardcoding model IDs. Your code references "summarizer", and a version-controlled lockfile maps it to the actual model. Switch providers by changing the lockfile, not your code.
  Also handles streaming, tool calling, and structured output consistently across providers. Plus a human-curated registry (https://llmring.github.io/registry/) that I keep updated with current model capabilities and pricing - helpful when choosing models.
  MIT licensed, works standalone. I am using it in several projects, but it's probably not ready to be presented in polite society yet.
- punkpeye 3 hours ago
  OpenRouter, Glama ( https://glama.ai/gateway/models/claude-sonnet-4-5-20250929 ), AWS Bedrock, all of them provide you access to all of the AI models via OpenAI compatible API.
- l1n 3 hours ago
  https://docs.anthropic.com/en/api/openai-sdk
- gloosx 3 hours ago
  Why don't you ask LLM to do it for you?
- adidoit 4 hours ago
  LiteLLM is your friend.
  [-]
  - adidoit 4 hours ago
    or AI SDK
- pinum 4 hours ago
  I use LiteLLM as a proxy.
- dingnuts 4 hours ago
  > think I remember reading that there are tools to simpify this kind of work, to support multiple LLM APIs
  just ask Claude to generate a tool that does this, duh! and tell Claude to make the changes to your side project and then to have sex with your wife too since it's doing all the fun parts
vb-8448 3 hours ago
claims against gpt-5 are huge!
I used to use cc, but I switched to codex (and it was much better) ... no I guess I have to switch batch to CC, at least to test it
fibers 4 hours ago
This looks exciting. I hope they add this to Windsurf soon.
[-]
- pzo 3 hours ago
  it looks like its already there
  [-]
  - simianwords 2 hours ago
    It’s stupid… like just have a registry of models and let people automatically use them. It’s silly to wait for manual whitelisting each time for every app
asdev 3 hours ago
how do claude/openai get around rate limiting/captcha with their computer use functionality?
pants2 1 hour ago
Unfortunately also disappointed with it in Cursor vs GPT-5-Codex. I asked it to add a test for a specific edge case, it hallucinated some parameters and didn't use existing harnesses. GPT-5-Codex with the same prompt got everything right.
AtNightWeCode 2 hours ago
Sonnet is just so expensive comparing to other competitors. Have they fixed this?
[-]
- ripped_britches 2 hours ago
  Pricing is the same as sonnet 4
  [-]
  - risho 2 hours ago
    which was expensive compared to it's competitors
    [-]
    - AtNightWeCode 2 hours ago
      exactly, cost per token is higher but it also uses tokens like a chipmunk on steroids
jonathanstrange 2 hours ago
I would like to see completely independent test results of these companies' products. I'm skeptical because every AI company claims their new product is the best.
dbbk 4 hours ago
So Opus isn't recommended anymore? Bit confusing
[-]
- SatvikBeri 4 hours ago
  For now, yeah. Presumably they'll come out with Opus 4.5 soon.
- causal 4 hours ago
  Don't think I've ever preferred Opus to Sonnet
catigula 3 hours ago
I'm still absolutely right constantly, I'm a genius. I also make various excellent points.
rishabhaiover 2 hours ago
hn displays a religious hatred towards ai progress
catigula 4 hours ago
I happened to be in the middle of a task in a production codebase that the various models struggled on so I can give a quick vibe benchmark:
opus 4.1: made weird choices, eventually got to a meh solution i just rolled back.
codex: took a disgusting amount of time but the result was vastly superior to opus. night and day superiority. output was still not what i wanted.
sonnet 4.5: not clearly better than opus. categorically worse decision-making than codex. very fast.
Codex was night and day the best. Codex scares me, Claude feels like a useful tool.
[-]
- poisonborz 3 hours ago
  These reviews are pretty useless to other developers. Models perform vastly differently with each language, task type, framework.
  [-]
  - cheema33 3 hours ago
    > These reviews are pretty useless to other developers.
    Agreed. If these same models were used on a different codebase/language etc. it will likely produce very different results.
nickphx 3 hours ago
It will be great when the VC cash runs out, the screws tighten, and finally an end to the incessant misleading marketing claims.
atemerev 4 hours ago
Ah, the company where the models are unusable even with Pro subscription (start to hit the limit after 20 minutes of talking), and free models are not usable at all (currently can't even send a single message to Sonnet 4.5)...
idkmanidk 4 hours ago
Page cannot be found Empty screen mocks my searching Only void responds
but: https://imgur.com/a/462T4Fu
kixiQu 4 hours ago
Lots of feature dev here – anyone have color on the behavior of the model yet? Mouthfeel, as it were.
[-]
- sexyman48 3 hours ago
  Mouthfeel, as it were
  Pervert.
  [-]
  - kixiQu 2 hours ago
    https://en.wikipedia.org/wiki/Mouthfeel ??
  - rkomorn 3 hours ago
    Weird comment given your username.