Desperate pivot aside, I don't see how anyone competes with the big labs on coding agents. They can serve the models at a fraction of the API cost, can trivially add post training to fill gaps and have way deeper enterprise penetration.
The differentiator is the fact that the scaling myth was a lie. The GPT-5 flop should make that obvious enough. These guys are spending billions and can't make the models show more than a few % improvement. You need to actually innovate, e.g. tricks like MoE, tool calling, better cache utilization, concurrency, better prompting, CoT, data labeling, and so on.
Not two weeks ago some Chinese academics put out a paper called Deep Think With Confidence where they coaxed GPT-OSS-120B into thinking a little longer causing it to perform better on benchmarks than it did when OpenAI released it.
Scaling inference not training is what OP means I believe .
The smaller startups like cursor or windsurf are not competing on foundational model development. So whether new models are generationally better is not relevant to them.
A cursor is competing with Claude code and both use Claude Sonnet.
Even if Cursor was running a on par model on their own GPUs their inference costs will not as cheap as those of Anthropic just because they would not be operating at the same scale . Larger DCs means better deals, more knowledge about running an inference better because they are also doing much larger training runs.
Don't need to compete - demonstrate some ability to use AI in an easy to understand way, get bought out at valuation. Bad for investors, awesome for founders.
I've pitched people working there this multiple times. Warp is not just a terminal, it's a full stack of interaction, they have more of the vertical of the development cycle to leverage.
You need different relationships at different parts of coding, ideation, debugging, testing, etc. Cleverly sharing context while maintaining different flows and respecting the relationship hygiene is the key. Most of the vscode extensions now do this with various system prompt selections of different "personas".
I used to (6 months ago) compare these agentic systems basically as if they were John Wayne as contract programmer, parachuting in a project, firing off their pistol, shooting the criminals, mayor, and burning the barn down all the while you're yelling at it to behave better.
There's contexts and places where this can be more productive. Warp is one of them if executed with clean semantic perimeters. It's in a rather strong positioning for it and an obvious loyalty builder
Sometimes worse is better. I haven't used it a lot yet, but so far I quite like this reduced focus on editing - I see this as close to the sweet spot of vibe coding, in between Claude Code and a full editor/IDE, whereby I generally trust the agent to right the code, but just want a simple editor to steer it more effectively.
I see this similarly to the way I would have a work session with a more junior dev where sometimes during the chat I would "drop down in abstraction" to show them how I'd code a specific function, but I don't want to take over - I'm giving them a bit of direction, and it's up to them to either keep my code or ignore/rewrite it to better suit their approach.
I get that it’s a different mode, but if you ever drop into editing, why would you ever want a worse (or just unfamiliar) editor? How does that improve the agent mode? If my junior dev only had Notepad installed, I would get them to install a better editor, I wouldn’t say “it’s great you only have Notepad, so we can focus on our conversation”.
Well, Notepad is probably too much of a downgrade, but I do often prefer something simpler like Sublime rather than my fully extensionized VS Code, which I do like, but has a ton of visual clutter which I only need when I'm in a developer mindset rather than a vibing mindset.
For me, the USP Warp used to have was generating shell commands from prompts inside the terminal - but Cursor has had this in its embedded terminal for a while now so increasingly I find myself using Ghostty instead
this concerns me given what I've seen generated by these tools. In 10? 5? 1? year(s) are we going to see an influx of CVEs or hiring of Senior+ level developers solely for the purpose of cleaning up these messes?
Insofar as CVEs issued for proprietary software, I would expect that the owning organization would not be inclined to blame AI code unless they think they can pass the buck.
But as for eventually having to hire senior developers to clean up the mess, I do expect that. Most organizations that think they can build and ship reliable products without human experts probably won’t be around long enough to be able to have actual CVEs issued. But larger organizations playing this game will eventually have to face some kind of reckoning.
Looking at the other side of the coin, I'm hoping that the proliferation of unsafe code would lead to more investment in vulnerability testing tooling, and particularly in reducing false positives by generating potential exploits. Having better security testing would be a massive boon to the industry regardless of whether we use AI to write the code.
I switched to this and honestly, more or less feels the same as claude code except with a fancy UI and built-in mcp servers for automated memory management. But I am sticking to it so I don't have to deal with vendor lockin (I heavily disagree with what antrophic is doing when it comes to 'safety')
The difference to me is that I can quickly switch in and out of “AI mode” with Warp. So it’s a terminal when I want that, and it’s an AI assistant when I want that.
With Claude Code, you’re stuck in AI mode all the time (which is slow for running vanilla terminal commands) or you have to have a second window for just terminal commands.
Edit: just read some documentation saying Claude has a “bash mode” where it will actually pass through the commands, so off to try that out now.
All these monolithic agents are so wasteful. Having an agent orchestration service is so much more efficient and maintainable. My work in progress rust agent takes less cpu/memory for a whole swarm than one claude code instance.
My rust agent is closed source (at least right now, we'll see) but I'm happy to discuss details of how stuff works to get you going in the right direction.
I'd be glad to hear more. I'm not certain what I would even ask, as the space is really fuzzy (prompting and all that).
I've got an Ollama instance (24GB VRAM) I want to leverage to try and reduce dependency on Claude Code. Even the tech stack seems unapproachable. I've considered LiteLLM, router agents, micro-agents (smallest slice of functionality possible), etc. I haven't wrapped my head around it all the way, though.
Ideally, it would be something like:
UI <--> LiteLLM
^
|
v
Agent Shim
Where the UI is probably aider or something similar. Claude Code muddies the differentiation between UI and agent (with all the built in system-prompt injection). I imagine I would like to move system-prompt injection / agent CRUD into the agent shim.
I'm just spitballing here.
Thoughts? (my email is in my profile if you would prefer to continue there)
I also have a 24gb card. Local LLMs are great for a lot of things but I wouldn't route coding questions to them, the time/$ tradeoff isn't worth it. Also, don't use LiteLLM, it's just bad, Bifrost is the way.
You can use a LLM router to direct questions to an optimal model on a price/performance pareto frontier. I have a plugin for Bifrost that does this, Heimdall (https://github.com/sibyllinesoft/heimdall), it's very beta right now but the test coverage is good, I just haven't paved the integration pathway yet.
I've got a number of products in the works to manage context automatically, enrich/tune rag, provide enhanced code search. Most of them are public and you can poke around and see what I'm doing. I plan on doing a number of launches soon but I like to build rock solid software and rapid agentic development really creates a large manual qa/acceptance eval burden.
I'm using the swarm to build ~20 projects in parallel, some released even, and some draft papers done. Take a look at the products gallery on my site (research papers linked on the research tab): https://sibylline.dev/products/
Claude Code and Codex provide something like $5000 of tokens for $200. How will any other offering depending on their models ever compete with that except by luring suckers or tire kickers?
They already have 40% margins on inference. Even if they give less with their own subscriptions, they may continue to have such margins on API, handicapping competitor tools
The pivot is likely because there's more VC dollars there.
It is a handy AI-cli for any terminal. I've been using the "terminal" app for a few months and found it was a very competent coding tool. I kept giving feedback to the team that they should "beef up" the coding side because until Claude Code this was my daily driver for writing code until Opus 4. The interface still is a bit janky because i think it's trying to predict whether you're typing a console command or talking to it for new prompt (it tries to dynamical assess that but often enough it crosses the streams). Regardless, I highly recommend checking it out, I've had some great success with it.
The part I don't get is the pricing. Seems like its pricing is solely based on requests. Then how would someone use gpt 4.1 when opus is charing the same price???
Pro Tip: Ask the agent or another LLM to generate a prompt for an agent of what you want to build, then tweak it as needed, and then use that prompt. I've had decent success prompting Junie (JetBrains AI) a few times because of this.
If warp had just stuck to being a decent terminal emulator with great UI, I would be using it without question. This AI nonsense is why I don't even consider them an option.
Things like self hosting and data privacy, model optionality too.
Plenty of companies still don’t want to ship their code, agreement or not over to these vendors or be locked into their specific model.
The differentiator is the fact that the scaling myth was a lie. The GPT-5 flop should make that obvious enough. These guys are spending billions and can't make the models show more than a few % improvement. You need to actually innovate, e.g. tricks like MoE, tool calling, better cache utilization, concurrency, better prompting, CoT, data labeling, and so on.
Not two weeks ago some Chinese academics put out a paper called Deep Think With Confidence where they coaxed GPT-OSS-120B into thinking a little longer causing it to perform better on benchmarks than it did when OpenAI released it.
The smaller startups like cursor or windsurf are not competing on foundational model development. So whether new models are generationally better is not relevant to them.
A cursor is competing with Claude code and both use Claude Sonnet.
Even if Cursor was running a on par model on their own GPUs their inference costs will not as cheap as those of Anthropic just because they would not be operating at the same scale . Larger DCs means better deals, more knowledge about running an inference better because they are also doing much larger training runs.
Reference: Browser Company
You need different relationships at different parts of coding, ideation, debugging, testing, etc. Cleverly sharing context while maintaining different flows and respecting the relationship hygiene is the key. Most of the vscode extensions now do this with various system prompt selections of different "personas".
I used to (6 months ago) compare these agentic systems basically as if they were John Wayne as contract programmer, parachuting in a project, firing off their pistol, shooting the criminals, mayor, and burning the barn down all the while you're yelling at it to behave better.
There's contexts and places where this can be more productive. Warp is one of them if executed with clean semantic perimeters. It's in a rather strong positioning for it and an obvious loyalty builder
(2) A Microsoft VP of product spends enough time writing code to be a relevant testimonial?
I see this similarly to the way I would have a work session with a more junior dev where sometimes during the chat I would "drop down in abstraction" to show them how I'd code a specific function, but I don't want to take over - I'm giving them a bit of direction, and it's up to them to either keep my code or ignore/rewrite it to better suit their approach.
this concerns me given what I've seen generated by these tools. In 10? 5? 1? year(s) are we going to see an influx of CVEs or hiring of Senior+ level developers solely for the purpose of cleaning up these messes?
But as for eventually having to hire senior developers to clean up the mess, I do expect that. Most organizations that think they can build and ship reliable products without human experts probably won’t be around long enough to be able to have actual CVEs issued. But larger organizations playing this game will eventually have to face some kind of reckoning.
With Claude Code, you’re stuck in AI mode all the time (which is slow for running vanilla terminal commands) or you have to have a second window for just terminal commands.
Edit: just read some documentation saying Claude has a “bash mode” where it will actually pass through the commands, so off to try that out now.
I've got an Ollama instance (24GB VRAM) I want to leverage to try and reduce dependency on Claude Code. Even the tech stack seems unapproachable. I've considered LiteLLM, router agents, micro-agents (smallest slice of functionality possible), etc. I haven't wrapped my head around it all the way, though.
Ideally, it would be something like:
Where the UI is probably aider or something similar. Claude Code muddies the differentiation between UI and agent (with all the built in system-prompt injection). I imagine I would like to move system-prompt injection / agent CRUD into the agent shim.I'm just spitballing here.
Thoughts? (my email is in my profile if you would prefer to continue there)
You can use a LLM router to direct questions to an optimal model on a price/performance pareto frontier. I have a plugin for Bifrost that does this, Heimdall (https://github.com/sibyllinesoft/heimdall), it's very beta right now but the test coverage is good, I just haven't paved the integration pathway yet.
I've got a number of products in the works to manage context automatically, enrich/tune rag, provide enhanced code search. Most of them are public and you can poke around and see what I'm doing. I plan on doing a number of launches soon but I like to build rock solid software and rapid agentic development really creates a large manual qa/acceptance eval burden.
Why suddenly agentic coding?
It is a handy AI-cli for any terminal. I've been using the "terminal" app for a few months and found it was a very competent coding tool. I kept giving feedback to the team that they should "beef up" the coding side because until Claude Code this was my daily driver for writing code until Opus 4. The interface still is a bit janky because i think it's trying to predict whether you're typing a console command or talking to it for new prompt (it tries to dynamical assess that but often enough it crosses the streams). Regardless, I highly recommend checking it out, I've had some great success with it.
https://www.youtube.com/watch?v=9jKOVAa1KAo
Claude Code can replicate some of the behavior, but it’s too slow to switch in and out of command / agent flows.
Can we please standardize this and just have one markdown file that all the agents can use?
Did they? Their original product was a terminal emulator, with built-in telemetry, that required you to create an account to use.