I am working on a project with ~200k LoC, entirely written with AI codegen.
These days I use Codex, with GPT-5-Codex + $200 Pro subscription. I code all day every day and haven't yet seen a single rate limiting issue.
We've come a long way. Just 3-4 months ago, LLMs would start doing a huge mess when faced with a large codebase. They would have massive problems with files with +1k LoC (I know, files should never grow this big).
Until recently, I had to religiously provide the right context to the model to get good results. Codex does not need it anymore.
Heck, even UI seems to be a solved problem now with shadcn/ui + MCP.
My personal workflow when building bigger new features:
1. Describe problem with lots of details (often recording 20-60 mins of voice, transcribe)
2. Prompt the model to create a PRD
3. CHECK the PRD, improve and enrich it - this can take hours
4. Actually have the AI agent generate the code and lots of tests
5. Use AI code review tools like CodeRabbit, or recently the /review function of Codex, iterate a few times
6. Check and verify manually - often times, there are a few minor bugs still in the implementation, but can be fixed quickly - sometimes I just create a list of what I found and pass it for improving
With this workflow, I am getting extraordinary results.
And I assume there's no actual product that customers are using that we could also demo? Because only 1 out of every 20 or so claims of awesomeness actually has a demoable product to back up those claims. The 1 who does usually has immediate problems. Like an invisible text box rendered over the submit button on their Contact Us page preventing an onClick event for that button.
In case it wasn't obvious, I have gone from rabidly bullish on AI to very bearish over the last 18 months. Because I haven't found one instance where AI is running the show and things aren't falling apart in not-always-obvious ways.
I'm kind of in the same boat although the timeline is more compressed. People claim they're more productive and that AI is capable of building large systems but I've yet to see any actual evidence of this. And the people who make these claims also seem to end up spending a ton of time prompting to the point where I wonder if it would have been faster for them to write the code manually, maybe with copilot's inline completions.
I created these demos using real data and real api connections with real databases, utilizing 100% AI code in http://betpredictor.io and https://pix2code.com; however, they barely work. At this point, I'm fixing 90% or more of every recommendation the AI gives. With you're code base being this large, you can be guaranteed that the AI will not know what needs to be edited, but I haven't written one line of hand-written code.
It is true AI-generated UIs tend to be... Weird. In weird ways. Sometimes they are consistent and work as intended, but often times they reveal weird behaviors.
Or at least this was true until recently. GPT-5 is consistently delivering more coherent and better working UIs, provided I use it with shadcn or alternative component libraries.
So while you can generate a lot of code very fast, testing UX and UI is still manual work - at least for me.
I am pretty sure, AI should not run the show. It is a sophisticated tool, but it is not a show runner - not yet.
Nothing much weird about the SwiftUI UIs GPT-5-codex generates for me. And it adapts well to building reusable/extensible components and using my existing components instead of constantly reinventing, because it is good at reading a lot of code before putting in work.
It is also good at refactoring to consolidate existing code for reusability, which makes it easier to extend and change UI in the future. Now I worry less about writing new UI or copy/pasting UI because I know I can do the refactoring easily to consolidate.
>I am working on a project with ~200k LoC, entirely written with AI codegen.
I’d love to see the codebase if you can share.
My experience with LLM code generation (I’ve tried all of the popular models and tools, though generally favor Claude Code with Opus and Sonnet). My time working with them leads me to suspect that your ~200k LoC project could be solved in only about 10k LoC. Their solutions are unnecessary complex (I’m guessing because they don’t “know” the problem, in the way a human does) and that compounds over time.
At this point, I would guess my most common instruction to this tools is to simplify the solution. Even when that’s part of the plan.
What is you opinion on what is the "right level of detail" that we should use when creating technical documents the LLM will use to implement features ?
When I started leaning heavily into LLMs I was using really detailed documentations. Not '20 minutes of voice recordings', but my specification documents would easily hit hundreds of lines even for simple features.
The result was decent, but extremely frustrating. Because it would often deliver 80% to 90% but the final 10% to 20% it could never get right.
So, what I naturally started doing was to care less about the details of the implementation and focus on the behavior I want. And this led me to simpler prompts, to the point that I don't feel the need to create a specification document anymore. I just use the plan mode in Claude Code and it is good enough for me.
One way that I started to think about this was that really specific documentations were almost as if I was 'over-fitting' my solution over other technically viable solutions the model could come up with. One example would be if I want to sort an array, I could either ask for "sort the array" or "merge sort the array". And by forcing a merge sort I may end up with a worse solution. Admittedly sort is a pretty simple and unlikely example, but this could happen with any topic. You may ask the model to use a hash-set but a better solution would be to use a bloom filter.
Given all that, do you think investing so much time into your prompts provides a good ROI compared with the alternative of not really min-maxing every single prompt?
I tend to provide detailed PRDs, because even if the first couple of iterations of the coding agent are not perfect, it tends to be easier to get there (as opposed to having a vague prompt and move on from there).
What I do sometimes is an experimental run - especially when I am stuck. I express my high-level vision, and just have the LLM code it to see what happens. I do not do it often, but it has sometimes helped me get out of being mentally stuck with some part of the application.
Funnily, I am facing this problem right now, and your post might just have reminded me, that sometimes a quick experiment can be better than 2 days of overthinking about the problem...
This mirrors my experience with AI so far - I've arrived at mostly using the plan and implement modes in Claude Code with complete but concise instructions about the behavior I want with maybe a few guide rails for the direction I'd like to see the implementation path take. Use cases and examples seem to work well.
I kind of assumed that claude code is doing most of the things described this document under the hood (but I really have no idea).
If it's working for you I have to assume that you are an expert in the domain, know the stack inside and out and have built out non-AI automated testing in your deployment pipeline.
And yes Step 3 is what no one does. And that's not limited to AI. I built a 20+ year career mostly around step 3 (after being biomed UNIX/Network tech support, sysadmin and programmer for 6 years).
Yes, I have over 2 decades of programming experience, 15 years working professionally. With my co-founder we built an entire B2B SaaS, coding everything from scratch, did product, support, marketing, sales...
Now I am building something new but in a very familiar domain. I agree my workflow would not work for your average "vibe coder".
Now with the MCP server, you can instruct the coding agent to use shadcn. I often do "I you need to add new UI elements, make sure to use shadcn and the shadcn component registry to find the best fitting component"
The genius move is that the shadcn components are all based on Tailwind and get COPIED to your project. 95% of the time, the created UI views are just pixel-perfect, spacing is right, everything looks good enough. You can take it from here to personalize it more using the coding agent.
I've had success here by simply telling Codex which components to use. I initially imported all the shadcn components into my project and then I just say things like "Create a card component that includes a scrollview component and in the scrollview add a table with a dropdown component in the third column"...and Codex just knows how to add the shadcn components. This is without internet access turned on by the way.
Don't want to come off as combative but if you code every day with codex you must not be pushing very hard, I can hit the weekly quota in <36 hours. The quota is real and if you're multi-piloting you will 100% hit it before the week is over.
Fair enough. I spend entire days working on the product, but obviously there are lots of times I am not running Codex - when reviewing PRDs, testing, talking to users, even posting on HN is good for the quota ;)
On the Pro tier? Plus/Team is only suitable for evaluating the tool and occasional help
Btw one thing that helps conserve context/tokens is to use GPT 5 Pro to read entire files (it will read more than Codex will, though Codex is good at digging) and generate plans for Codex to execute. Tools like RepoPrompt help with this (though it also looks pretty complicated)
> 1. Describe problem with lots of details (often recording 20-60 mins of voice, transcribe)
I just ask it to give me instructions for a coding agent and give it a small description of what I want to do, it looks at my code, and details what I describes as best as it can, and usually I have enough to let Junie (JetBrains AI) run on.
I can't personally justify $200 a month, I would need to see seriously strong results for that much. I use AI piecemeal because it has always been the best way to use it. I still want to understand the codebase. When things break its mostly on you to figure out what broke.
A small description can be extrapolated to a large feature, but then you have to accept the AI filling in the gaps. Sometimes that is cool, often times it misses the mark. I do not always record that much, but if I have a vague idea that I want to verbalize, I use recording. Then I take the transcript and create the PRD based on it. Then I iterate a few more times on the PRD - which yield much better results.
How does it compare to Cursor with Claud? I’ve been really impressed with how well Cursor works, but always interested in up leveling if there’s better tools considering how fast this space is moving. Can you comment to how Codex performs vs Cursor?
Claude code is Claude code, whether you use in cursor or not
Codex and Claude code are neck and neck, but we made the decision to go all in on opus 4, as there are compounding returns in optimizing prompts and building intuition for a specific model
That said I have tested these prompts on codex, amp, opencode, even grok 4 fast via codebuff, and they still work decently well
But they are heavily optimized from our work with opus in particular
Which of these steps do you think/wish could be automated further? Most of the latter ones seem like throwing independent AI reviewers could almost fully automate it, maybe with a "notify me" option if there's something they aren't confident about? Could PRD review be made more efficient if it was able to color code by level of uncertainty? For 1, could you point it to a feed of customer feedback or something and just have the day's draft PRD up and waiting for you when you wake up each morning?
There is definitely way too much plumbing and going back and forth.
But one thing that MUST get better soon is having the AI agent verify its own code. There are a few solutions in place, e.g. using an MCP server to give access to the browser, but these tend to be brittle and slow. And for some reason, the AI agents do not like calling these tools too much, so you kinda have to force them every time.
PRD review can be done, but AI cannot fill the missing gaps the same way a human can. Usually, when I create a new PRD, it is because I have a certain vision in my head. For that reason, the process of reviewing the PRD can be optimized by maybe 20%. OR maybe I struggle to see how tools could make me faster at reading and commenting / editing the PRD.
Agents __SHOULD NOT__ verify their own code. They know they wrote it, and they act biased. You should have a separate agent with instructions to red team the hell out of a commit, be strict, but not nitpick/bikeshed, and you should actually run multiple review agents with slightly different areas of focus since if you try to run one agent for everything it'll miss lots of stuff. A panel of security, performance, business correctness and architecture/elegance agents (armed with a good covering set of code context + the diff) will harden a PR very quickly.
Codex uses this principle - /review runs in a subthread, does not see previous context, only git diff. This is what I am using. Or I open Cursor to review code written by GPT-5 using Sonnet.
Do you have examples of this working, or any best practices on how to orchestrate it efficiently? It sounds like the right thing to do, but it doesn't seem like the tech is quite to the point where this could work in practice yet, unless I missed it. I imagine multiple agents would churn through too many tokens and have a hard time coming to a consensus.
This sounds very similar to my workflow. Do you have pre-commits or CI beyond testing? I’ve started thinking about my codebase as an RL environment with the pre-commits as hyperparameters. It’s fascinating seeing what coding patterns emerge as a result.
I think pre-commit is essential. I enforce conventional commits (+ a hook which limits commit length to 50 chars) and for Python, ruff with many options enabled. Perhaps the most important one is to enforce complexity limits. That will catch a lot of basic mistakes. Any sanity checks that you can make deterministic are a good idea. You could even add unit tests to pre-commit, but I think it's fine to have the model run pytest separately.
The models tend to be very good about syntax, but this sort of linting will often catch dead code like unused variables or arguments.
You do need to rule-prompt that the agent may need to run pre-commit multiple times to verify the changes worked, or to re-add to the commit. Also, frustratingly, you also need to be explicit that pre-commit might fail and it should fix the errors (otherwise sometimes it'll run and say "I ran pre-commit!") For commits there are some other guardrails, like blanket denying git add <wildcard>.
Claude will sometimes complain via its internal monologue when it fails a ton of linter checks and is forced to write complete docstrings for everything. Sometimes you need to nudge it to not give up, and then it will act excited when the number of errors goes down.
Very solid advice. I need to experiment more with the pre-commit stuff, I am a bit tired of reminding the model to actually run tests / checks. They seem to be as lazy about testing as your average junior dev ;)
Yes, I do have automated linting (a bit of a PITA at this scale).
On the CI side I am using Github Actions - it does the job, but haven't put much work into it yet.
Generally I have observed that using a statically typed language like Typescript helps catching issues early on. Had much worse results with Ruby.
Have you considered or tried adding steps to create / review an engineering design doc? Jumping straight from PRD to a huge code change seems scary. Granted, given that it's fast and cheap to throw code away and start over, maybe engineering design is a thing of the past. But still, it seems like it would be useful to have it delineate the high-level decisions and tradeoffs before jumping straight into code; once the code is generated it's harder to think about alternative approaches.
Adding an additional layer slows things down. So the tradeoff must be worth it.
Personally, I would go without a design doc, unless you work on a mission-critical feature humans MUST specify or deeply understand. But this is my gut speaking, I need to give it a try!
Yeah I'd love to hear more about that. Like the way I imagine things working currently is "get requirement", "implement requirement", more or less following existing patterns and not doing too much thinking or changing of the existing structure.
But what I'd love to see is, if it has an engineering design step, could it step back and say "we're starting to see this system evolve to a place where a <CQRS, event-sourcing, server-driven-state-machine, etc> might be a better architectural match, and so here's a proposal to evolve things in that direction as a first step."
Something like Kent Beck's "for each desired change, make the change easy (warning: this may be hard), then make the easy change." If we can get to a point where AI tools can make those kinds of tradeoffs, that's where I think things get slightly dangerous.
OTOH if AI models are writing all the code, and AI models have contexts that far exceed what humans can keep in their head at once, then maybe for these agents everything is an easy change. In which case, well, I guess having human SWEs in the loop would do more harm than good at that point.
I can recommend one more thing: tell the LLM frequently to "ask me clarifying questions". It's simple, but the effect is quite dramatic, it really cuts down on ambiguity and wrong directions without having to think about every little thing ahead of time.
The "ask my clarifying questions" can be incredibly useful. It often will ask me things I hadn't thought of that were relevant, and it often suggests very interesting features.
As for when/where to do it? You can experiment. I do it after step 1.
Not OP, but I use Codex for back-end, scripting, and SQL. Claude Code for most front-end. I have found that when one faces a challenge, the other often can punch through and solve the problem. I even have them work together (moving thoughts and markdown plans back and fourth) and that works wonders.
My progression: Cursor in '24, Roo code mid '25, Claude Code in Q2 '25, Codex CLI in Q3 `25.
Yes, it is a web project with next.js + Typescript + Tailwind + Postgres (Prisma).
I started with Cursor, since it offers a well-rounded IDE with everything you need. It also used to be the best tool for the job. These days Codex + GPT-5-Codex is king. But I sometimes go back to Cursor, especially when reading / editing the PRDs or if I need the ocasional 2nd opinion by Claude.
Then I instruct the coding agent to use shadcn / choose the right component from shadcn component registry
The MCP server has a search / discovery tool, and it can also fetch individual components. If you tell the AI agent to use a specific component, it will fetch it (reference doc here: https://ui.shadcn.com/docs/components)
I would not call it vibe coding. But I do not check all changed lines of code either.
In my opinion, and this is really my opinion, in the age of coding with AI, code review is changing as well. If you speed up how much code can be produced, you need to speed up code review accordingly.
I use automated tools most of the time AND I do very thorough manual testing. I am thinking about a more sophisticated testing setup, including integration tests via using a headless browser. It definitely is a field where tooling needs to catch up.
And this my friends is why software engineering is going down the drain. Weve made our professiona joke. Can you imagine an architect or civil engineer speaking like this?
These kind of people make me want to change to a completely new discipline.
Strong feelings are fair, but the architect analogy cuts the other way. Architects and civil engineers do not eyeball every rebar or hand compute every load. They probably use way more automation than you would think.
I do not claim this is vibe coding, and I do not ship unreviewed changes to safety critical systems (in case this is what people think). I claim that in 2025 reviewing every single changed line is not the only way to achieve quality at the scale that AI codegen enables. The unit of review is shifting from lines to specifications.
You were never an engineer. I'm 18 years into my career on the web and games and I was never an engineer. It's blind people leading blind people and your somewhere in the middle based on 2013 patterns you got to this point on and 2024 advancements called "Vibe Coding" and you get paid $$ to make it work.
Building a bridge from steel that lasts 100 years and carries real living people in the tens or hundreds of thousands per day without failing under massive weather spikes is engineering.
We've all been waiting for the other shoe to drop. Everyone points out that reviewing code is more difficult than writing it. The natural question is, if AI is generating thousands of lines of code per day, how do you keep up with reviewing it all?
The answer: you don't!
Seems like this reality will become increasingly justified and embraced in the months to come. Really though it feels like a natural progression of the package manager driven "dependency hell" style of development, except now it's your literal business logic that's essentially a dependency that has never been reviewed.
My process is probably more robust than simply reviewing each line of code. But hey, I am not against doing it, if that is your policy. I had worked the old-fashioned way for over 15 years, I know exactly what pitfalls to watch out for.
It is a fairly standardized way of capturing the essens of a new feature. It covers most important aspects of what the feature is about, the goals, the success criteria, even implementation details where it makes sense.
If there is interest, I can share the outline/template of my PRDs.
Wow, very nice. Thank you. That's very well thought out.
I'm particularly intrigued by the large bold letters: "Success must be verifiable by the AI / LLM that will be writing the code later, using tools like Codex or Cursor."
May I ask, what your testing strategy is like?
I think you've encapsulated a good best practices workflow here in a nice condensed way.
I'd also be interested to know how you handle documentation but don't want to bombard you with too many questions
I added that line, because otherwise the LLM would generate goals that are not verifiable in development (e.g. certain pages to render <300ms - this is not something you can test on your local machine).
Documentation is a different topic - I have not yet found how to do it correctly. But I am reading about it and might soon test some ideas to co-generate documentation based on the PRD and the actual code. The challenge being, the code normally evolves and drifts away from the original PRD.
Programming has always had these steps, but traditionally people with different roles would do different parts of it, like gathering requirements, creating product concept, creating development tickets, coding, testing and so on.
I’m not an expert in either language, but seeing a 20k LoC PR go up (linked in the article) would be an instant “lgtm, asshole” kind of review.
> I had to learn to let go of reading every line of PR code
Ah. And I’m over here struggling to get my teammates to read lines that aren’t in the PR.
Ah well, if this stuff works out it’ll be commoditized like the author said and I’ll catch up later. Hard to evaluate the article given the authors financial interest in this succeeding and my lack of domain expertise.
I'm the owner of some of my work projects/repos. I will absolutely without a 2nd thought close a 20k LoC PR, especially an AI generated one, because the code that ends up in master is ultimately my responsibility. Unless it's something like a repo-wide linter change or whatever, there's literally never a reason to have such a massive PR. Break it down, I don't care if it ends up being 200 disparate PRs, that's actually possible to properly review compared to a single 20k line PR.
Dumping a 20k LOC PR on somebody to review especially if all/a lot of it was generated with AI is disrespectful. The appropriate response is to kick that back and tell them to make it more digestible.
Dumping a huge PR across a shared codebase wherein everyone else also has to deal with the risk of you monumental changes is pretty rude as well, I would even go so far as to say that it is likely selfishly risky.
If somebody did this, it means they ignored their team's conventions and offloaded work onto colleagues for their own convenience. Being considered rude by the offender is not a concern of mine when dealing with a report who pulls this kind of antisocial crap.
A 20k LOC PR isn’t reviewable in any normal workflow/process.
The only moves are refusing to review it, taking it up the chain of authority, or rubber stamping it with a note to the effect that it’s effectively unreviewable so rubber stamping must be the desired outcome.
I don't understand this attitude. Tests are important parts of the codebase. Poorly written tests are a frequent source of headaches in my experience, either by encoding incorrect assumptions, lying about what they're testing, giving a false sense of security, adding friction to architectural changes/refactors, etc. I would never want to review even 2k lines of test changes in one go.
Preach. Also, don't forget making local testing/CI take longer to run, which costs you both compute and developer context switching.
I've heard people rave about LLMs for writing tests, so I tried having Claude Code generate some tests for a bug I fixed against some autosave functionality - (every 200ms, the auto-saver should initiate a save if the last change was in the previous 200ms). Claude wrote five tests that each waited 200ms (!) adding a needless entire second to the run-time of my test suite.
I went in to fix it by mocking out time, and in the process realized that the feature was doing a time stamp comparisons when a simpler/non-error prone approach was to increment a logical clock for each change instead.
The tests I've seen Claude write vary from junior-level to flat-out-bad. Tests are often the first consumer of a new interface, and delegating them to an LLM means you don't experience the ergonomics of the thing you just wrote.
i think the general take away for all of this is the model can write the code but you still have to design it. I don't disagree with anything you've said, and I'd say my advice is engage more, iterate more, and work in small steps to get the right patterns and rules laid out. It wont work well on day one if you don't set up the right guidelines and guardrails. That's why it's still software engineering, despite being a different interaction medium.
And if the 10k lines of tests are all garbage, now what? Because tests are the 1 place you absolutely should not delegate to AI outside of setting up the boilerplate/descriptions.
I’d rather be behind the curve enjoying my craft than producing crap and admitting I don’t even read every line of code in a PR.
The fact that people can trust the output of an LLM which is known to not be accurate and ship code is mind boggling. We have turned our profession into a joke.
I built a package which I use for large codebase work[0].
It starts with /feature, and takes a description. Then it analyzes the codebase and asks questions.
Once I’ve answered questions, it writes a plan in markdown. There will be 8-10 markdowns files with descriptions of what it wants to do and full code samples.
Then it does a “code critic” step where it looks for errors. Importantly, this code critic is wrong about 60% of the time. I review its critique and erase a bunch of dumb issues it’s invented.
By that point, I have a concise folder of changes along with my original description, and it’s been checked over. Then all I do is say “go” to Claude Code and it’s off to the races doing each specific task.
This helps it keep from going off the rails, and I’m usually confident that the changes it made were the changes I wanted.
I use this workflow a few times per day for all the bigger tasks and then use regular Claude code when I can be pretty specific about what I want done. It’s proven to be a pretty efficient workflow.
It seems we're still collectively trying to figure out the boundaries of "delegation" versus "abstraction" which I personally don't think are the same thing, though they are certainly related and if you squint a bit you can easily argue for one or the other in many situations.
> We've gotten claude code to handle 300k LOC Rust codebases, ship a week's worth of work in a day, and maintain code quality that passes expert review.
This seems more like delegation just like if one delegated a coding task to another engineer and reviewed it.
> That in two years, you'll be opening python files in your IDE with about the same frequency that, today, you might open up a hex editor to read assembly (which, for most of us, is never).
This seems more like abstraction just like if one considers Python a sort of higher level layer above C and C a higher level layer above Assembly, except now the language is English.
I would say its much more about abstraction and the leverage abstractions give you.
You'll also note that while I talk about "spec driven development", most of the tactical stuff we've proven out is downstream of having a good spec.
But in the end a good spec is probably "the right abstraction" and most of these techniques fall out as implementation details. But to paraphrase sandy metz - better to stay in the details than to accidentally build against the wrong abstraction (https://sandimetz.com/blog/2016/1/20/the-wrong-abstraction)
I don't think delegation is right - when me and vaibhav shipped a week's worth of work in a day, we were DEEPLY engaged with the work, we didn't step away from the desk, we were constantly resteering and probably sent 50+ user messages that day, in addition to some point-edits to markdown files along the way.
A few weeks later, @hellovai and I paired on shipping 35k LOC to BAML, adding cancellation support and WASM compilation - features the team estimated would take a senior engineer 3-5 days each.
Sorry, had they effectively estimated that an engineer should produce 4-6KLOC per day (that's before genAI)?
This article bases its argument on the predicate that AI _at worst_ will increase developer productivity be 0-10%. But several studies have found that not to be true at all. AI can, and does, make some people less effective
There's also the more insidious gap between perceived productivity and actual productivity. Doesn't help that nobody can agree on how to measure productivity even without AI.
"AI can, and does, make some people less effective"
So those people should either stop using it or learn to use it productively. We're not doomed to live in a world where programmers start using AI, lose productivity because of it and then stay in that less productive state.
If managers are convinced by stakeholders who relentlessly put out pro-"AI" blog posts, then a subset of programmers can be forced to at least pretend to use "AI".
They can be forced to write in their performance evaluation how much (not if, because they would be fired) "AI" has improved their productivity.
Both (1) "AI can, and does, make some people less effective" and (2) "the average productivity boost (~20%) is significant" (per Stanford's analysis) can be true.
The article at the link is about how to use AI effectively in complex codebases. It emphasizes that the techniques described are "not magic", and makes very reasonable claims.
the techniques described sound like just as much work, if not more, than just writing the code. the claimed output isn't even that great, it's comparable to the speed you would expect a skilled engineer to move at in a startup environment
> the techniques described sound like just as much work, if not more, than just writing the code.
That's very fair, and I believe that's true for you and for many experienced software developers who are more productive than the average developer. For me, AI-assisted coding is a significant net win.
Yet a lot of people never bother to learn vim, and are still outstanding and productive engineers. We're surely not seeing any memos "Reflexive vim usage is now a baseline expectation at [our company]" (context: https://x.com/tobi/status/1909251946235437514)
The as-of-yet unanswered question is: Is this the same? Or will non-LLM-using engineers be left behind?
Perhaps if we get the proper thought influencers on board we can look forward to C-suite VI mandates where performance reviews become descriptions of how we’ve boosted our productivity 10x with effective use of VI keyboard agents, the magic of g-prefixed VI technology, VI-power chording, and V-selection powered column intelligence.
Question for discussion - what steps can I take as a human to set myself up for success where success is defined by AI made me faster, more efficient etc?
In many cases (though not all) it's the same thing that makes for great engineering managers:
smart generalists with a lot of depth in maybe a couple of things (so they have an appreciation for depth and complexity) but a lot of breadth so they can effectively manage other specialists,
and having great technical communication skills - be able to communicate what you want done and how without over-specifying every detail, or under-specifying tasks in important ways.
>where success is defined by AI made me faster, more efficient etc?
I think this attitude is part of the problem to me; you're not aiming to be faster or more efficient (and using AI to get there), you're aiming to use AI (to be faster and more efficient).
A sincere approach to improvement wouldn't insist on a tool first.
> It was uncomfortable at first. I had to learn to let go of reading every line of PR code. I still read the tests pretty carefully, but the specs became our source of truth for what was being built and why.
This is exactly right. Our role is shifting from writing implementation details to defining and verifying behavior.
I recently needed to add recursive uploads to a complex S3-to-SFTP Python operator that had a dozen path manipulation flags. My process was:
* Extract the existing behavior into a clear spec (i.e., get the unit tests passing).
* Expand that spec to cover the new recursive functionality.
* Hand the problem and the tests to a coding agent.
I quickly realized I didn't need to understand the old code at all. My entire focus was on whether the new code was faithful to the spec. This is the future: our value will be in demonstrating correctness through verification, while the code itself becomes an implementation detail handled by an agent.
> Our role is shifting from writing implementation details to defining and verifying behavior.
I could argue that our main job was always that - defining and verifying behavior. As in, it was a large part of the job. Time spent on writing implementation details have always been on a downward trend via higher level languages, compilers and other abstractions.
> My entire focus was on whether the new code was faithful to the spec
This may be true, but see Postel's Law, that says that the observed behavior of a heavily-used system becomes its public interface and specification, with all its quirks and implementation errors. It may be important to keep testing that the clients using the code are also faithful to the spec, and detect and handle discrepancies.
Claude Plays Pokemon showed that too. AI is bad at deciding when something is "working" - it will go in circles forever. But an AI combined with a human to occasionally course correct is a powerful combo.
if you haven't tried the research -> plan -> implementation approach here, you are missing out on how good LLMs are. it completely changed my perspective.
the key part was really just explicitly thinking about different levels of abstraction at different levels of vibecoding. I was doing it before, but not explicitly in discrete steps and that was where i got into messes. The prior approach made check pointing / reverting very difficult.
When i think of everything in phases, i do similar stuff w/ my git commits at "phase" levels, which makes design decision easier to make.
I also do spend ~4-5 hours cleaning up the code at the very very end once everything works. But its still way faster than writing hard features myself.
> but not explicitly in discrete steps and that was where i got into messes.
I've said this repeatedly, I mostly use it for boilerplate code, or when I'm having a brain fart of sorts, I still love to solve things for myself, but AI can take me from "I know I want x, y, z" to "oh look I got to x, y, z in under 30 minutes, which could have taken hours. For side projects this is fine.
I think if you do it piecemeal it should almost always be fine. When you try to tell it to do two much, you and the model both don't consider edge cases (Ask it for those too!) and are more prone for a rude awakening eventually.
tbh I think the thing that's making this new approach so hard to adopt for many people is the word "vibecoding"
Like yes vibecoding in the lovable-esque "give me an app that does XYZ" manner is obviously ridiculous and wrong, and will result in slop. Building any serious app based on "vibes" is stupid.
But if you're doing this right, you are not "coding" in any traditional sense of the word, and you are *definitely* not relying on vibes
I'm sticking to the original definition of "vibe coding", which is AI-generated code that you don't review.
If you're properly reviewing the code, you're programming.
The challenge is finding a good term for code that's responsibly written with AI assistance. I've been calling it "AI-assisted programming" but that's WAY too long.
It's strange that author is bragging that this 35K LOC was researched and implemented in 7 hours, but there are 40 commits spanning across 7 days. Was it 1 hour per day or what?
Also quite funny that one of the latest commits is "ignore some tests" :D
FWIW I think your style is better and more honest than most advocates. But I'd really love to see some examples of things that completely failed. Because there have to be some, right? But you hardly ever see an article from an AI advocate about something that failed, nor from an AI skeptic about something that succeeded. Yet I think these would be the types of things that people would truly learn from. But maybe it's not in anyone's financial interest to cross borders like that, for those who are heavily vested in the ecosystem.
You do acknowledge this but this doesn't make the "spent 7 hours and shipped 35k LOC" claim factually correct or true. It sure sounds good but it's disingenuous, because shipping != making progress. Shipping code means deploying it to the end users.
I'm always amazed when I seen xKLOC metrics being thrown out like it matters somehow. The bar has always been shipped code. If it's not being used, it's merely a playground or learning exercise.
I've used this pattern on two separate codebases. One was ~500k LOC apache airflow monolith repo (I am a data engineer). The other was a greenfield flutter side project (I don't know dart, flutter, or really much of anything regarding mobile development).
All I know is that it works. On the greenfield project the code is simple enough to mostly just run `/create_plan` and skip research altogether. You still get the benefit of the agents and everything.
The key is really truly reviewing the documents that the AI spits out. Ask yourself if it covered the edge cases that you're worried about or if it truly picked the right tech for the job. For instance, did it break out of your sqlite pattern and suggest using postgres or something like that. These are very simple checks that you can spot in an instant. Usually chatting with the agent after the plan is created is enough to REPL-edit the plan directly with claude code while it's got it all in context.
At my day job I've got to use github copilot, so I had to tweak the prompts a bit, but the intentional compaction between steps still happens, just not quite as efficiently because copilot doesn't support sub-agents in the same way as claude code. However, I am still able to keep productivity up.
-------
A personal aside.
Immediately before AI assisted coding really took off, I started to feel really depressed that my job was turning into a really boring thing for me. Everything just felt like such a chore. The death by a million paper cuts is real in a large codebase with the interplay and idiosyncrasies of multiple repos, teams, personalities, etc. The main benefit of AI assisted coding for me personally seems to be smoothing over those paper cuts.
I derive pleasure from building things that work. Every little thing that held up that ultimate goal was sucking the pleasure out of the activity that I spent most of my day trying to do. I am much happier now having impressed myself with what I can build if I stick to it.
Verifying behavior is great and all if you can actually exhaustively test the behaviors of your system. If you can't, then not knowing what your code is actually doing is going to set you back when things do go belly up.
I love this comment because it makes perfect sense today, it made perfect sense 10 years ago, it would have made perfect sense in 1970. The principles of software engineering are not changed by the introduction of commodified machine intelligence.
This article is like a bookmark in time of where I exactly gave up (in July) managing context in Claude code.
I made specs for every part of the code in a separate folder and that had in it logs on every feature I worked on. It was an API server in python with many services like accounts, notifications, subscriptions etc.
It got to the point where managing context became extremely challenging. Claude would not be able to determine business logic properly and it can get complex. e.g. if you want to do a simple RBAC system with an account and profile with a junction table for roles joining an account with profile. In the end what kind of worked was I had to give it UML diagrams of the relationship with examples to make it understand and behave better.
My problem is it keeps working, even when it reaches certain things it doesn't know how to do.
I've been experimenting with Github agents recently, they use GPT-5 to write loads of code, and even make sure it compiles and "runs" before ending the task.
Then you go and run it and it's just garbage, yeah it's technically building and running "something", but often it's not anything like what you asked for, and it's splurged out so much code you can't even fix it.
> Context has never been the bottleneck for me. AI just stops working when I reach certain things that AI doesn't know how to do.
It's context all the way down. That just means you need to find and give it the context to enable it to figure out how to do the thing. Docs, manuals, whatever.
Same stuff that you would use to enable a human that doesn't know how to do it to figure out how.
In my limited experiments with Gemini: it stops working when presented with a program containing fundamental concurrency flaws. Ask it to resolve a race condition or deadlock and it will flail, eventually getting caught in a loop, suggesting the same unhelpful remedies over and over.
I imagine this has to to with concurrency requiring conceptual and logical reasoning, which LLMs are known to struggle with about as badly as they do with math and arithmetic. Now, it's possible that the right language to work with the LLM in these domains is not program code, but a spec language like TLA+. However, at that point, I'd probably just spend less effort to write the potentially tricky concurrent code myself.
Anything it has not been trained on. Try getting AI to use OpenAI's responses API. You will have to try very hard to convince it not to use the chat completions API.
yeah once again you need the right context to override what's in the weights. It may not know how to use the responses api, so you need to provide examples in context (or tools to fetch them)
> And yeah sure, let's try to spend as many tokens as possible
It'd be nice if the article included the cost for each project. A 35k LOC change in a 350k codebase with a bunch of back and forth and context rewriting over 7 hours, would that be a regular subscription, max subscription, or would that not even cover it?
There are a lot of people declaring this, proclaiming that about working with AI, but nobody presents the details. Talk is cheap, show me the prompts. What will be useful is to check in all the prompts along with code. Every commit generated by AI should include a prompt log recording all the prompts that led to the change. One should be able to walkthrough the prompt log just as they may go through the commit log and observe firsthand how the code was developed.
Can't agree with the formula for performance, on the "/ size" part. You can have a huge codebase, but if the complexity goes up with size then you are screwed. Wouldn't a huge but simple codebase be practical and fine for AI to deal with?
The hierarchy of leverage concept is great! Love it. (Can't say I like the 1 bad line of CLAUDE.md is 100K lines of bad code; I've had some bad lines in my CLAUDE.md from time to time - I almost always let Claude write it's own CLAUDE.md.).
i mean there's also the fact that claude code injects this system message into your claude.md which means that even if your claude.md sucks you will probably be okay:
<system-reminder>
IMPORTANT: this context may or may not be relevant to your tasks. You should not respond to this context or otherwise consider it in your response unless it is highly relevant to your task.
Most of the time, it is not relevant.
</system-reminder>
lots of others have written about this so i won't go deep but its a clear product decision, but if you don't know what's in your context window, you can't respond/architect your balance between claude.md and /commands well.
Maybe I am just misunderstanding. I probably am; seems like it happens more and more often these days
But.. I hate this. I hate the idea of learning to manage the machine's context to do work. This reads like a lecture in an MBA class about managing certain types of engineers, not like an engineering doc.
Never have I wanted to manage people. And never have I even considered my job would be to find the optimum path to the machine writing my code.
Maybe firmware is special (I write firmware)... I doubt it. We have a cursor subscription and are expected to use it on production codebases. Business leaders are pushing it HARD. To be a leader in my job, I don't need to know algorithms, design patterns, C, make, how to debug, how to work with memory mapped io, what wear leveling is, etc.. I need to know 'compaction' and 'context engineering'
I feel like a ship corker inspecting a riveted hull
Guess it boils down to personality, but I personally love it. I got into coding later in life, and coming from a career that involved reading and writing voluminous amounts of text in English. I got into programming because I wanted to build web applications, not out of any love for the process of programming in and of itself. The less I have to think and write in code, the better. Much happier to be reading it and reviewing it than writing it myself.
No ones like programming that much. That's like saying someone love speaking English. You have an idea and you express it. Sometimes there's additional complexity that got in the way (initializing the library, memory cleanup,...), but I put those at the same level as proper greetings in a formal letter.
It also helps starting small, get something useful done and iterate by adding more features overtime (or keeping it small).
Honestly - if it's such a good technique it should be built into the tool itself. I think just waiting for the tools to mature a bit will mean you can ignore a lot of the "just do xyz" crap.
It's not at senior engineer level until it asks relevant questions about lacking context instead of blindly trying to solve problems IMO.
I've started to use agents on some very low-level code, and have middling results. For pure algorithmic stuff, it works great. But I asked it to write me some arm64 assembly and it failed miserably. It couldn't keep track of which registers were which.
I used to do these things manually in Cursor. Then I had to take a few months off programming, and when I came back and updated Cursor I found out that it now automatically does ToDos, as well as keeps track of the context size and compresses it automatically by summarising the history when it reaches some threshold.
With this I find that most of the shenanigans of manual context window managing with putting things in markdown files is kind of unnecessary.
You still need to make it plan things, as well as guide the research it does to make sure it gets enough useful info into the context window, but in general it now seems to me like it does a really good job with preserving the information. This is with Sonnet 4
Except for ofc pushing their own product (humanlayer) and some very complex prompt template+agent setups that are probably overkill for most, the basics in this post about compaction and doing human review at the correct level are pretty good pointers. And giving a bit of a framework to think within is also neat
1. Go's spec and standard practices are more stable, in my experience. This means the training data is tighter and more likely to work.
2. Go's types give the llm more information on how to use something, versus the python model.
3. Python has been an entry-level accessible language for a long time. This means a lot of the code in the training set is by amateurs. Go, ime, is never someone's first language. So you effectively only get code from someone who has already has other programming experience.
4. Go doesn't do much 'weird' stuff. It's not hard to wrap your head around.
yeah i love that there is a lot of source data for "what is good idiomatic go" - the model doesn't have it all in the training set but you can easily collect coding standards for go with deep research or something
And then I find models try to write scripts/manual workflows for testing, but Go is REALLY good for doing what you might do in a bash script, and so you can steer the model to build its own feedback loop as a harness in go integration tests (we do a lot of this in github.com/humanlayer/humanlayer/tree/main/hld)
2. Write down the principles and assumptions behind the design and keep them current
In other words, the same thing successful human teams on complex projects do! Have we become so addicted to “attention-deficit agile” that this seems like a new technique?
Imagine, detailed specs, design documents, and RFC reviews are becoming the new hotness. Who would have thought??
yeah its kinda funny how some bigger more sophisticated eng orgs that would be called "slow and ineffective" by smaller teams are actually pretty dang well set-up to leverage AI.
All because they have been forced to master technical communication at scale.
but the reason I wrote this (and maybe a side effect of the SF bubble) is MOST of the people I have talked to, from 3-person startups to 1000+ employee public companies, are in a state where this feels novel and valuable, not a foregone conclusion or something happening automatically
Hello, I noticed your privacy policy is a black page with text seemingly set to 1% or so opacity. Can you get the slopless AI to fix that when time permits?
Thanks for sharing, I wonder how do you keep the stylistic and mental alignment of the codebase - is this happens during the code review or there are specific instructions during at the plan/implement stages?
> Heck even Amjad was on a lenny's podcast 9 months ago talking about how PMs use Replit agent to prototype new stuff and then they hand it off to engineers to implement for production.
> Within an hour or so, I had a PR fixing a bug which was approved by the maintainer the next morning
An hour for 14 lines of code. Not sure how this shows any productivity gain from AI. It's clear that it's not the code writing that is the bottleneck in a task like this.
Looking at the "30K lines" features, the majority of the 30K lines are either auto-generated code (not by AI), or documentation. One of them is also a PoC and not merged...
We're taking a profession that attracts people who enjoy a particular type of mental stimulation, and transforming it into something that most members of the profession just fundamentally do not enjoy.
If you're a business leader wondering why AI hasn't super charged your company's productivity, it's at least partly because you're asking people to change the way they work so drastically, that they no longer derive intrinsic motivation from it.
Because now your manager will measure on LOCs against other engineers again and it's only software engineers worrying about complexity, maintainability, and, in summary, the health of the very creature it's going to pay your salary.
This is the new world we live in. Anyone who actually likes coding should seriously look for other venues because this industry is for other type of people now.
I use AI in my job. I went from tolerable (not doing anything fancy) to unbearable.
I'm actually looking to become a council employee with a boring job and code my own stuff, because if this is what I have to do moving forward, I rather go back to non-coding jobs.
i strongly disagree with this - if anything, using AI to code real production code in real complex codebase is MORE technical than just writing software.
Staff/Principal engineers already spend a lot more time designing systems than writing code. They care a lot about complexity, maintainability, and good architecture.
The best people I know who have been using these techniques are former CTOs, former core Kubernetes contributors, have built platforms for CRDTs at scale, and many other HIGHLY technical pursuits.
What BS. I’d love to see evidence of your claims. Otherwise it’s just claims.
Clearly there are a lot of people in this post that have some incentive for this AI BS to work. We have people defending 20k LOC pr’s and admitting they don’t check every line.
What have we come to. Time to change careers than work with these people.
Was the axe or the chainsaw designed in such a way that guarantees that it will definitely miss the log and hit your hand fair amount of the times you use it? If it were, would you still use it? Yes, these hand tools are dangerous, but they were not designed so that it would probably cut off your hand even 1% of the time. "Accidents happen" and "AI slop" are not even remotely the same.
So then with "AI" we're taking a tool that is known to "hallucinate", and not infrequently. So let's put this thing in charge of whatever-the-fuck we can?
I have no doubt "AI" will someday be embedded inside a "smart chainsaw", because we as humans are far more stupid than we think we are.
Even if we had perfectly human-level AI it'd still need management, just like human workers do, and turns out effective management is actually nontrivial.
These days I use Codex, with GPT-5-Codex + $200 Pro subscription. I code all day every day and haven't yet seen a single rate limiting issue.
We've come a long way. Just 3-4 months ago, LLMs would start doing a huge mess when faced with a large codebase. They would have massive problems with files with +1k LoC (I know, files should never grow this big).
Until recently, I had to religiously provide the right context to the model to get good results. Codex does not need it anymore.
Heck, even UI seems to be a solved problem now with shadcn/ui + MCP.
My personal workflow when building bigger new features:
1. Describe problem with lots of details (often recording 20-60 mins of voice, transcribe)
2. Prompt the model to create a PRD
3. CHECK the PRD, improve and enrich it - this can take hours
4. Actually have the AI agent generate the code and lots of tests
5. Use AI code review tools like CodeRabbit, or recently the /review function of Codex, iterate a few times
6. Check and verify manually - often times, there are a few minor bugs still in the implementation, but can be fixed quickly - sometimes I just create a list of what I found and pass it for improving
With this workflow, I am getting extraordinary results.
AMA.
In case it wasn't obvious, I have gone from rabidly bullish on AI to very bearish over the last 18 months. Because I haven't found one instance where AI is running the show and things aren't falling apart in not-always-obvious ways.
Or at least this was true until recently. GPT-5 is consistently delivering more coherent and better working UIs, provided I use it with shadcn or alternative component libraries.
So while you can generate a lot of code very fast, testing UX and UI is still manual work - at least for me.
I am pretty sure, AI should not run the show. It is a sophisticated tool, but it is not a show runner - not yet.
It is also good at refactoring to consolidate existing code for reusability, which makes it easier to extend and change UI in the future. Now I worry less about writing new UI or copy/pasting UI because I know I can do the refactoring easily to consolidate.
Note: using it for my B2B e-commerce
I’d love to see the codebase if you can share. My experience with LLM code generation (I’ve tried all of the popular models and tools, though generally favor Claude Code with Opus and Sonnet). My time working with them leads me to suspect that your ~200k LoC project could be solved in only about 10k LoC. Their solutions are unnecessary complex (I’m guessing because they don’t “know” the problem, in the way a human does) and that compounds over time. At this point, I would guess my most common instruction to this tools is to simplify the solution. Even when that’s part of the plan.
When I started leaning heavily into LLMs I was using really detailed documentations. Not '20 minutes of voice recordings', but my specification documents would easily hit hundreds of lines even for simple features.
The result was decent, but extremely frustrating. Because it would often deliver 80% to 90% but the final 10% to 20% it could never get right.
So, what I naturally started doing was to care less about the details of the implementation and focus on the behavior I want. And this led me to simpler prompts, to the point that I don't feel the need to create a specification document anymore. I just use the plan mode in Claude Code and it is good enough for me.
One way that I started to think about this was that really specific documentations were almost as if I was 'over-fitting' my solution over other technically viable solutions the model could come up with. One example would be if I want to sort an array, I could either ask for "sort the array" or "merge sort the array". And by forcing a merge sort I may end up with a worse solution. Admittedly sort is a pretty simple and unlikely example, but this could happen with any topic. You may ask the model to use a hash-set but a better solution would be to use a bloom filter.
Given all that, do you think investing so much time into your prompts provides a good ROI compared with the alternative of not really min-maxing every single prompt?
I tend to provide detailed PRDs, because even if the first couple of iterations of the coding agent are not perfect, it tends to be easier to get there (as opposed to having a vague prompt and move on from there).
What I do sometimes is an experimental run - especially when I am stuck. I express my high-level vision, and just have the LLM code it to see what happens. I do not do it often, but it has sometimes helped me get out of being mentally stuck with some part of the application.
Funnily, I am facing this problem right now, and your post might just have reminded me, that sometimes a quick experiment can be better than 2 days of overthinking about the problem...
I kind of assumed that claude code is doing most of the things described this document under the hood (but I really have no idea).
This is everyone's experience if they don't have a vested interest in LLM's, or if their domain is low risk (e.g., not regulated).
And yes Step 3 is what no one does. And that's not limited to AI. I built a 20+ year career mostly around step 3 (after being biomed UNIX/Network tech support, sysadmin and programmer for 6 years).
Now I am building something new but in a very familiar domain. I agree my workflow would not work for your average "vibe coder".
I'm interested in hearing more about this - any resource you can point me at or do you mind elaborating a bit? TIA!
If you use Codex, convert the config to toml:
[mcp_servers.shadcn] command = "npx" args = ["shadcn@latest", "mcp"]
Now with the MCP server, you can instruct the coding agent to use shadcn. I often do "I you need to add new UI elements, make sure to use shadcn and the shadcn component registry to find the best fitting component"
The genius move is that the shadcn components are all based on Tailwind and get COPIED to your project. 95% of the time, the created UI views are just pixel-perfect, spacing is right, everything looks good enough. You can take it from here to personalize it more using the coding agent.
Btw one thing that helps conserve context/tokens is to use GPT 5 Pro to read entire files (it will read more than Codex will, though Codex is good at digging) and generate plans for Codex to execute. Tools like RepoPrompt help with this (though it also looks pretty complicated)
I just ask it to give me instructions for a coding agent and give it a small description of what I want to do, it looks at my code, and details what I describes as best as it can, and usually I have enough to let Junie (JetBrains AI) run on.
I can't personally justify $200 a month, I would need to see seriously strong results for that much. I use AI piecemeal because it has always been the best way to use it. I still want to understand the codebase. When things break its mostly on you to figure out what broke.
Codex and Claude code are neck and neck, but we made the decision to go all in on opus 4, as there are compounding returns in optimizing prompts and building intuition for a specific model
That said I have tested these prompts on codex, amp, opencode, even grok 4 fast via codebuff, and they still work decently well
But they are heavily optimized from our work with opus in particular
Drop us an email at navan.chauhan[at]strongdm.com
But one thing that MUST get better soon is having the AI agent verify its own code. There are a few solutions in place, e.g. using an MCP server to give access to the browser, but these tend to be brittle and slow. And for some reason, the AI agents do not like calling these tools too much, so you kinda have to force them every time.
PRD review can be done, but AI cannot fill the missing gaps the same way a human can. Usually, when I create a new PRD, it is because I have a certain vision in my head. For that reason, the process of reviewing the PRD can be optimized by maybe 20%. OR maybe I struggle to see how tools could make me faster at reading and commenting / editing the PRD.
The models tend to be very good about syntax, but this sort of linting will often catch dead code like unused variables or arguments.
You do need to rule-prompt that the agent may need to run pre-commit multiple times to verify the changes worked, or to re-add to the commit. Also, frustratingly, you also need to be explicit that pre-commit might fail and it should fix the errors (otherwise sometimes it'll run and say "I ran pre-commit!") For commits there are some other guardrails, like blanket denying git add <wildcard>.
Claude will sometimes complain via its internal monologue when it fails a ton of linter checks and is forced to write complete docstrings for everything. Sometimes you need to nudge it to not give up, and then it will act excited when the number of errors goes down.
Generally I have observed that using a statically typed language like Typescript helps catching issues early on. Had much worse results with Ruby.
Adding an additional layer slows things down. So the tradeoff must be worth it.
Personally, I would go without a design doc, unless you work on a mission-critical feature humans MUST specify or deeply understand. But this is my gut speaking, I need to give it a try!
But what I'd love to see is, if it has an engineering design step, could it step back and say "we're starting to see this system evolve to a place where a <CQRS, event-sourcing, server-driven-state-machine, etc> might be a better architectural match, and so here's a proposal to evolve things in that direction as a first step."
Something like Kent Beck's "for each desired change, make the change easy (warning: this may be hard), then make the easy change." If we can get to a point where AI tools can make those kinds of tradeoffs, that's where I think things get slightly dangerous.
OTOH if AI models are writing all the code, and AI models have contexts that far exceed what humans can keep in their head at once, then maybe for these agents everything is an easy change. In which case, well, I guess having human SWEs in the loop would do more harm than good at that point.
As for when/where to do it? You can experiment. I do it after step 1.
"Here is roughly what I want, ask me clarifying questions"
Now I pick and choose and have a good idea if my assumptions and the LLMs assumptions align.
Did you start with Cursor and move to Codex or only ever Codex?
My progression: Cursor in '24, Roo code mid '25, Claude Code in Q2 '25, Codex CLI in Q3 `25.
These tools change all the time, very quickly. Important to stay open to change though.
I started with Cursor, since it offers a well-rounded IDE with everything you need. It also used to be the best tool for the job. These days Codex + GPT-5-Codex is king. But I sometimes go back to Cursor, especially when reading / editing the PRDs or if I need the ocasional 2nd opinion by Claude.
Then I instruct the coding agent to use shadcn / choose the right component from shadcn component registry
The MCP server has a search / discovery tool, and it can also fetch individual components. If you tell the AI agent to use a specific component, it will fetch it (reference doc here: https://ui.shadcn.com/docs/components)
I have roughly 2k tests now, but should probably spend a couple of days before production release to double that.
But I am working on making a solid self-service signup experience - might need a couple of weeks to get it done.
In my opinion, and this is really my opinion, in the age of coding with AI, code review is changing as well. If you speed up how much code can be produced, you need to speed up code review accordingly.
I use automated tools most of the time AND I do very thorough manual testing. I am thinking about a more sophisticated testing setup, including integration tests via using a headless browser. It definitely is a field where tooling needs to catch up.
I do not claim this is vibe coding, and I do not ship unreviewed changes to safety critical systems (in case this is what people think). I claim that in 2025 reviewing every single changed line is not the only way to achieve quality at the scale that AI codegen enables. The unit of review is shifting from lines to specifications.
Building a bridge from steel that lasts 100 years and carries real living people in the tens or hundreds of thousands per day without failing under massive weather spikes is engineering.
Hard disagree but you do you.
The answer: you don't!
Seems like this reality will become increasingly justified and embraced in the months to come. Really though it feels like a natural progression of the package manager driven "dependency hell" style of development, except now it's your literal business logic that's essentially a dependency that has never been reviewed.
My process is probably more robust than simply reviewing each line of code. But hey, I am not against doing it, if that is your policy. I had worked the old-fashioned way for over 15 years, I know exactly what pitfalls to watch out for.
I am not giving universal solutions. I am sharing MY solution.
It is a fairly standardized way of capturing the essens of a new feature. It covers most important aspects of what the feature is about, the goals, the success criteria, even implementation details where it makes sense.
If there is interest, I can share the outline/template of my PRDs.
I'm particularly intrigued by the large bold letters: "Success must be verifiable by the AI / LLM that will be writing the code later, using tools like Codex or Cursor."
May I ask, what your testing strategy is like?
I think you've encapsulated a good best practices workflow here in a nice condensed way.
I'd also be interested to know how you handle documentation but don't want to bombard you with too many questions
Documentation is a different topic - I have not yet found how to do it correctly. But I am reading about it and might soon test some ideas to co-generate documentation based on the PRD and the actual code. The challenge being, the code normally evolves and drifts away from the original PRD.
> I had to learn to let go of reading every line of PR code
Ah. And I’m over here struggling to get my teammates to read lines that aren’t in the PR.
Ah well, if this stuff works out it’ll be commoditized like the author said and I’ll catch up later. Hard to evaluate the article given the authors financial interest in this succeeding and my lack of domain expertise.
Would you trust an colleague who is over confident, lies all the time, and then pushes a huge PR? I wouldn't.
Closed > will not review > make more atomic changes.
The only moves are refusing to review it, taking it up the chain of authority, or rubber stamping it with a note to the effect that it’s effectively unreviewable so rubber stamping must be the desired outcome.
I've heard people rave about LLMs for writing tests, so I tried having Claude Code generate some tests for a bug I fixed against some autosave functionality - (every 200ms, the auto-saver should initiate a save if the last change was in the previous 200ms). Claude wrote five tests that each waited 200ms (!) adding a needless entire second to the run-time of my test suite.
I went in to fix it by mocking out time, and in the process realized that the feature was doing a time stamp comparisons when a simpler/non-error prone approach was to increment a logical clock for each change instead.
The tests I've seen Claude write vary from junior-level to flat-out-bad. Tests are often the first consumer of a new interface, and delegating them to an LLM means you don't experience the ergonomics of the thing you just wrote.
20k LOC PR. Am I living in a fanstady world where we have actually come to accept PRs that big?
Jesus. We are doomed.
The fact that people can trust the output of an LLM which is known to not be accurate and ship code is mind boggling. We have turned our profession into a joke.
It starts with /feature, and takes a description. Then it analyzes the codebase and asks questions.
Once I’ve answered questions, it writes a plan in markdown. There will be 8-10 markdowns files with descriptions of what it wants to do and full code samples.
Then it does a “code critic” step where it looks for errors. Importantly, this code critic is wrong about 60% of the time. I review its critique and erase a bunch of dumb issues it’s invented.
By that point, I have a concise folder of changes along with my original description, and it’s been checked over. Then all I do is say “go” to Claude Code and it’s off to the races doing each specific task.
This helps it keep from going off the rails, and I’m usually confident that the changes it made were the changes I wanted.
I use this workflow a few times per day for all the bigger tasks and then use regular Claude code when I can be pretty specific about what I want done. It’s proven to be a pretty efficient workflow.
[0] GitHub.com/iambateman/speedrun
I've been working on something what I call Micromanaged Driven Development https://mmdd.dev and wrote about it at https://builder.aws.com/content/2y6nQgj1FVuaJIn9rFLThIslwaJ/...
I'm in a similar search and I'm stoked to see that many people riding the wave of coding with AI is moving in this direction.
Lots of learning ahead.
> We've gotten claude code to handle 300k LOC Rust codebases, ship a week's worth of work in a day, and maintain code quality that passes expert review.
This seems more like delegation just like if one delegated a coding task to another engineer and reviewed it.
> That in two years, you'll be opening python files in your IDE with about the same frequency that, today, you might open up a hex editor to read assembly (which, for most of us, is never).
This seems more like abstraction just like if one considers Python a sort of higher level layer above C and C a higher level layer above Assembly, except now the language is English.
Can it really be both?
You'll also note that while I talk about "spec driven development", most of the tactical stuff we've proven out is downstream of having a good spec.
But in the end a good spec is probably "the right abstraction" and most of these techniques fall out as implementation details. But to paraphrase sandy metz - better to stay in the details than to accidentally build against the wrong abstraction (https://sandimetz.com/blog/2016/1/20/the-wrong-abstraction)
I don't think delegation is right - when me and vaibhav shipped a week's worth of work in a day, we were DEEPLY engaged with the work, we didn't step away from the desk, we were constantly resteering and probably sent 50+ user messages that day, in addition to some point-edits to markdown files along the way.
Sorry, had they effectively estimated that an engineer should produce 4-6KLOC per day (that's before genAI)?
So those people should either stop using it or learn to use it productively. We're not doomed to live in a world where programmers start using AI, lose productivity because of it and then stay in that less productive state.
They can be forced to write in their performance evaluation how much (not if, because they would be fired) "AI" has improved their productivity.
The article at the link is about how to use AI effectively in complex codebases. It emphasizes that the techniques described are "not magic", and makes very reasonable claims.
That's very fair, and I believe that's true for you and for many experienced software developers who are more productive than the average developer. For me, AI-assisted coding is a significant net win.
The as-of-yet unanswered question is: Is this the same? Or will non-LLM-using engineers be left behind?
smart generalists with a lot of depth in maybe a couple of things (so they have an appreciation for depth and complexity) but a lot of breadth so they can effectively manage other specialists,
and having great technical communication skills - be able to communicate what you want done and how without over-specifying every detail, or under-specifying tasks in important ways.
I think this attitude is part of the problem to me; you're not aiming to be faster or more efficient (and using AI to get there), you're aiming to use AI (to be faster and more efficient).
A sincere approach to improvement wouldn't insist on a tool first.
This is exactly right. Our role is shifting from writing implementation details to defining and verifying behavior.
I recently needed to add recursive uploads to a complex S3-to-SFTP Python operator that had a dozen path manipulation flags. My process was:
* Extract the existing behavior into a clear spec (i.e., get the unit tests passing).
* Expand that spec to cover the new recursive functionality.
* Hand the problem and the tests to a coding agent.
I quickly realized I didn't need to understand the old code at all. My entire focus was on whether the new code was faithful to the spec. This is the future: our value will be in demonstrating correctness through verification, while the code itself becomes an implementation detail handled by an agent.
I could argue that our main job was always that - defining and verifying behavior. As in, it was a large part of the job. Time spent on writing implementation details have always been on a downward trend via higher level languages, compilers and other abstractions.
This may be true, but see Postel's Law, that says that the observed behavior of a heavily-used system becomes its public interface and specification, with all its quirks and implementation errors. It may be important to keep testing that the clients using the code are also faithful to the spec, and detect and handle discrepancies.
the key part was really just explicitly thinking about different levels of abstraction at different levels of vibecoding. I was doing it before, but not explicitly in discrete steps and that was where i got into messes. The prior approach made check pointing / reverting very difficult.
When i think of everything in phases, i do similar stuff w/ my git commits at "phase" levels, which makes design decision easier to make.
I also do spend ~4-5 hours cleaning up the code at the very very end once everything works. But its still way faster than writing hard features myself.
I've said this repeatedly, I mostly use it for boilerplate code, or when I'm having a brain fart of sorts, I still love to solve things for myself, but AI can take me from "I know I want x, y, z" to "oh look I got to x, y, z in under 30 minutes, which could have taken hours. For side projects this is fine.
I think if you do it piecemeal it should almost always be fine. When you try to tell it to do two much, you and the model both don't consider edge cases (Ask it for those too!) and are more prone for a rude awakening eventually.
Like yes vibecoding in the lovable-esque "give me an app that does XYZ" manner is obviously ridiculous and wrong, and will result in slop. Building any serious app based on "vibes" is stupid.
But if you're doing this right, you are not "coding" in any traditional sense of the word, and you are *definitely* not relying on vibes
Maybe we need a new word
If you're properly reviewing the code, you're programming.
The challenge is finding a good term for code that's responsibly written with AI assistance. I've been calling it "AI-assisted programming" but that's WAY too long.
i've also heard "aura coding", "spec-driven development" and a bunch of others I don't love.
but we def need a new word cause vibe coding aint it
You can vibe code using specs or just by having a conversation.
Also quite funny that one of the latest commits is "ignore some tests" :D
> While the cancelation PR required a little more love to take things over the line, we got incredible progress in just a day.
All I know is that it works. On the greenfield project the code is simple enough to mostly just run `/create_plan` and skip research altogether. You still get the benefit of the agents and everything.
The key is really truly reviewing the documents that the AI spits out. Ask yourself if it covered the edge cases that you're worried about or if it truly picked the right tech for the job. For instance, did it break out of your sqlite pattern and suggest using postgres or something like that. These are very simple checks that you can spot in an instant. Usually chatting with the agent after the plan is created is enough to REPL-edit the plan directly with claude code while it's got it all in context.
At my day job I've got to use github copilot, so I had to tweak the prompts a bit, but the intentional compaction between steps still happens, just not quite as efficiently because copilot doesn't support sub-agents in the same way as claude code. However, I am still able to keep productivity up.
-------
A personal aside.
Immediately before AI assisted coding really took off, I started to feel really depressed that my job was turning into a really boring thing for me. Everything just felt like such a chore. The death by a million paper cuts is real in a large codebase with the interplay and idiosyncrasies of multiple repos, teams, personalities, etc. The main benefit of AI assisted coding for me personally seems to be smoothing over those paper cuts.
I derive pleasure from building things that work. Every little thing that held up that ultimate goal was sucking the pleasure out of the activity that I spent most of my day trying to do. I am much happier now having impressed myself with what I can build if I stick to it.
I made specs for every part of the code in a separate folder and that had in it logs on every feature I worked on. It was an API server in python with many services like accounts, notifications, subscriptions etc.
It got to the point where managing context became extremely challenging. Claude would not be able to determine business logic properly and it can get complex. e.g. if you want to do a simple RBAC system with an account and profile with a junction table for roles joining an account with profile. In the end what kind of worked was I had to give it UML diagrams of the relationship with examples to make it understand and behave better.
"what happens if we end up owning this codebase but don't know how it works / don't know how to steer a model on how to make progress"
There are two common problems w/ primarily-AI-written code
1. Unfamiliar codebase -> research lets you get up to speed quickly on flows and functionality
2. Giant PR Reviews Suck -> plans give you ordered context on what's changing and why
Mitchell has praised ampcode for the thread sharing, another good solution to #2 - https://x.com/mitchellh/status/1963277478795026484
I've been experimenting with Github agents recently, they use GPT-5 to write loads of code, and even make sure it compiles and "runs" before ending the task.
Then you go and run it and it's just garbage, yeah it's technically building and running "something", but often it's not anything like what you asked for, and it's splurged out so much code you can't even fix it.
Then I go and write it myself like the old days.
It's context all the way down. That just means you need to find and give it the context to enable it to figure out how to do the thing. Docs, manuals, whatever. Same stuff that you would use to enable a human that doesn't know how to do it to figure out how.
I treat "uses AI tools" as a signal that a person doesn't know what they are doing
I imagine this has to to with concurrency requiring conceptual and logical reasoning, which LLMs are known to struggle with about as badly as they do with math and arithmetic. Now, it's possible that the right language to work with the LLM in these domains is not program code, but a spec language like TLA+. However, at that point, I'd probably just spend less effort to write the potentially tricky concurrent code myself.
It'd be nice if the article included the cost for each project. A 35k LOC change in a 350k codebase with a bunch of back and forth and context rewriting over 7 hours, would that be a regular subscription, max subscription, or would that not even cover it?
> oh, and yeah, our team of three is averaging about $12k on opus per month
I'll have to admit, I was intrigued with the workflow at first. But emm, okay, yeah, I'll keep handwriting my open source contributions for a while.
but yes we switched off per-token this week because we ran out of anthropic credits, we're on max plan now
Horrible, right? When I asked gemini, it guessed 37 cents! https://g.co/gemini/share/ff3ed97634ba
The hierarchy of leverage concept is great! Love it. (Can't say I like the 1 bad line of CLAUDE.md is 100K lines of bad code; I've had some bad lines in my CLAUDE.md from time to time - I almost always let Claude write it's own CLAUDE.md.).
<system-reminder> IMPORTANT: this context may or may not be relevant to your tasks. You should not respond to this context or otherwise consider it in your response unless it is highly relevant to your task. Most of the time, it is not relevant. </system-reminder>
lots of others have written about this so i won't go deep but its a clear product decision, but if you don't know what's in your context window, you can't respond/architect your balance between claude.md and /commands well.
But.. I hate this. I hate the idea of learning to manage the machine's context to do work. This reads like a lecture in an MBA class about managing certain types of engineers, not like an engineering doc.
Never have I wanted to manage people. And never have I even considered my job would be to find the optimum path to the machine writing my code.
Maybe firmware is special (I write firmware)... I doubt it. We have a cursor subscription and are expected to use it on production codebases. Business leaders are pushing it HARD. To be a leader in my job, I don't need to know algorithms, design patterns, C, make, how to debug, how to work with memory mapped io, what wear leveling is, etc.. I need to know 'compaction' and 'context engineering'
I feel like a ship corker inspecting a riveted hull
It also helps starting small, get something useful done and iterate by adding more features overtime (or keeping it small).
It's not at senior engineer level until it asks relevant questions about lacking context instead of blindly trying to solve problems IMO.
With this I find that most of the shenanigans of manual context window managing with putting things in markdown files is kind of unnecessary.
You still need to make it plan things, as well as guide the research it does to make sure it gets enough useful info into the context window, but in general it now seems to me like it does a really good job with preserving the information. This is with Sonnet 4
YMMV
It's super effective with the right guardrails and docs. It also works better on languages like Go instead of Python.
1. Go's spec and standard practices are more stable, in my experience. This means the training data is tighter and more likely to work.
2. Go's types give the llm more information on how to use something, versus the python model.
3. Python has been an entry-level accessible language for a long time. This means a lot of the code in the training set is by amateurs. Go, ime, is never someone's first language. So you effectively only get code from someone who has already has other programming experience.
4. Go doesn't do much 'weird' stuff. It's not hard to wrap your head around.
And then I find models try to write scripts/manual workflows for testing, but Go is REALLY good for doing what you might do in a bash script, and so you can steer the model to build its own feedback loop as a harness in go integration tests (we do a lot of this in github.com/humanlayer/humanlayer/tree/main/hld)
Also, strongly-typed languages tend to catch more issues through the language server which the agent can touch through LSP.
2. Write down the principles and assumptions behind the design and keep them current
In other words, the same thing successful human teams on complex projects do! Have we become so addicted to “attention-deficit agile” that this seems like a new technique?
Imagine, detailed specs, design documents, and RFC reviews are becoming the new hotness. Who would have thought??
All because they have been forced to master technical communication at scale.
but the reason I wrote this (and maybe a side effect of the SF bubble) is MOST of the people I have talked to, from 3-person startups to 1000+ employee public companies, are in a state where this feels novel and valuable, not a foregone conclusion or something happening automatically
- Mr. Snarky
Please kill me now
https://github.com/ricardoborges/cpython
what web programming task GPT-5 can't handle?
An hour for 14 lines of code. Not sure how this shows any productivity gain from AI. It's clear that it's not the code writing that is the bottleneck in a task like this.
Looking at the "30K lines" features, the majority of the 30K lines are either auto-generated code (not by AI), or documentation. One of them is also a PoC and not merged...
We're taking a profession that attracts people who enjoy a particular type of mental stimulation, and transforming it into something that most members of the profession just fundamentally do not enjoy.
If you're a business leader wondering why AI hasn't super charged your company's productivity, it's at least partly because you're asking people to change the way they work so drastically, that they no longer derive intrinsic motivation from it.
Doesn't apply to every developer. But it's a lot.
If AI is so groundbreaking, why do we have to have guides and jump through 3000 hoops just so we can make it work?
This is the new world we live in. Anyone who actually likes coding should seriously look for other venues because this industry is for other type of people now.
I use AI in my job. I went from tolerable (not doing anything fancy) to unbearable.
I'm actually looking to become a council employee with a boring job and code my own stuff, because if this is what I have to do moving forward, I rather go back to non-coding jobs.
Staff/Principal engineers already spend a lot more time designing systems than writing code. They care a lot about complexity, maintainability, and good architecture.
The best people I know who have been using these techniques are former CTOs, former core Kubernetes contributors, have built platforms for CRDTs at scale, and many other HIGHLY technical pursuits.
Clearly there are a lot of people in this post that have some incentive for this AI BS to work. We have people defending 20k LOC pr’s and admitting they don’t check every line.
What have we come to. Time to change careers than work with these people.
So then with "AI" we're taking a tool that is known to "hallucinate", and not infrequently. So let's put this thing in charge of whatever-the-fuck we can?
I have no doubt "AI" will someday be embedded inside a "smart chainsaw", because we as humans are far more stupid than we think we are.
I want to do the work