Great post. I found it amusing that the name of the project Claude code built for you was Codex given that OpenAI’s competing coding agent has the same name!
Three tips that I’ve found useful in my own work: (1) using Claude Code with GitHub creates a nice trail of breadcrumbs for Claude to follow and also allows you to roll back from its inevitable mistakes; (2) Using OpenAI’s codex to do a code review on every commit be Claude Code can be amazingly effective at spotting errors sooner rather than later; (3) Asking Claude code to write up detailed project specifications and store them as markdown files in the repo helps avoid some of the issues you mentioned since you can put an instruction in CLAUDE.md to check against these other documents before planning the next batch of changes.
These tools are absolutely a game changer. Relatively few of my colleagues in academic Econ have noticed this yet but 2026 is going to be wild…
it feels like some orchestration techniques add to the base value, others increase the multipliers that get applied to the base value. the former are a lot more scarce than the latter, and so the math works out such that each one is a very big deal
having the project version-controlled by git, and giving claude instructions to 'commit early and commit often' as well as 'check git log regularly especially when confused' is one of the biggest base-value buffs i've found yet
if every single change is reversible and every single change is well-documented with an explanation, and every single subagent is reading those explanations prior to taking any actions, this prevents huge swaths of mistakes while being relatively cheap on input tokens
As a software engineer (@ meta) I’d say it’s here. We already did it. Whether it’s AGI or not is interesting to discuss, but from a product engineering perspective it’s already over.
But
1) we have not internalized this organizationally and are obsessed with the metrics that made sense for tracking human-only productivity
2) it makes the soft parts of the job, being a good dependable curious person who is positive sum, even more valuable
3) it makes knowing the right thing to build, and making actual decisions you stick to organizationally, even more important. The good organizations are now going to do even better. The dysfunctional political ones are going to get nowhere just the same as before
As a former Meta engineer, my experience was that Meta is unusually metric-pilled even for a tech company in a way that's deeply baked into the institutional culture, even and sometimes especially when the metrics don't really track the thing you actually care about. This has served them well for MAUmaxing and maximum ad revenue generation, but poorly when they get into a space where the metrics provide less signal (VR, AI, community integrity).
1) I have found that all the LLMs suffer as coding support because of their intrinsic tendency towards the modal (i.e., most common) answer.
Any time I am trying to do something different, something that most of the world wouldn't think to do or understand, I am constantly fighting the LLM to keep that in. It removes that code, that feature, that interface, that field, that consideration. That is, it does the common well, but consistently resists the unusual.
And why bother just doing the common? I am doing the project because I want something that does not already exist. I suppose that if I wanted a combination of common things, it would make sense. For example, in Kelsey's case, I would expect that the LLM would keep pushing towards stuff appropriate for younger kids. It would just have trouble holding onto the usual idea that this one is geared towards older kids.
And so, it certainly does not surprise me that that keeps trying to add on-screen instructions—which makes sense for so many projects, but not for users who are struggling readers. It's not just that it's an expected mistake, it's that it's a constant battle and requires ongoing effort. There doesn't seem to be a way to get it to remember the unusual aspects of the project.
2) They also have a tendency to go back and mess with stuff what was already working and break it. Imagine Parts A-G are working, but feature H is wonky in a few way. I am focusing on finding the problem and getting it to fix it, and it goes back to changes something back in Part E. Why? Some entirely new idea has entered it stochastic consideration and it changes perfectly fine code and logic.
It's not vibe coding if I need to know the logic and code well enough to recognize where the new bad behavior (of project) might be controlled and then tell it what to do to fix it. In those cases, sure, it is generating code. But generating code is far easier than debugging code and logic. The time spent trying to track down the damn addition...well, if I have to examine line-by-line diffs...that's not good.
3) I've even had problems in which it refuses to fix what I tell it to. That is, I tell it that the problem is in THIS function, and it need to H2 instead of H1. But it insists the problem is elsewhere can changes all kinds of other stuff. Last week, after I resorted to diving in, finding the line of code and changing it manually, I asked the LLM how many time I had told it to look in THAT function. Eight time. And it quoted them all back to me.
What can I do when the LLM won't follow my instruction eight times!? Where is the vibe coding? Meanwhile, it unnecessarily rebuilt some big aspect of the program, adding complexity to catch and deal with a kind of bad input that already filtered out anything. It was so insistent that THAT was the source of the problem, when it clearly wasn't.
So, it feels arrogant. It does not listen well. It has such strong tendencies to modal answers. It gets so stubborn. It alters the design of the logic when I'm not looking over there.
*****************
Yes, it is maddening.
Am I yelling at it? I am trying to figure out how to get it to focus on what I am asking for and not changing other things in the background. I am trying to get it to remember that. I am trying to get it to maintain focus on what I instructing it to do.
How do I do that? Were it a human being, I would call a lot of it yelling or berating—without raising my voice. How do I get it to register and stick?
I'm curious, which models did you have this experience with? I'd say this matches my impressions as of ~early this year, but recent releases did really improve a lot since. (Mostly by these issues coming up less frequently -- when they do come up, they are often still really weird and silly ones!)
All in the last month. So...Claude Opus 4.5 and Sonnet 4.5. ChatGPT 5.1 and 5.2. I've found 5.2 to be worse than 5.1 was. Gemini 3 Thinking. I should try Gemini 3 Pro more.
A lot of these frustrations can be mitigated by using version control (which was good practice long before LLM tooling came on the scene).
My practice is pretty much to commit my work before any time I have AI do anything. Then it can go ham, delete files, rename things, whatever - if I don't like what it came up with I can just revert the changes to the version that existed before it did its thing. I can then re-run the request with additional instructions, or implement myself the direction that it suggested (but botched the implementation of).
That said, git (and its forebears) are very programmer-brain-shaped in their approach, so it's a bit of a tough sell to say "now, with AI, you don't have to learn programming, but you do have to learn this slightly esoteric tool".
While I believe it, it's precisely because I use git as a backstop for "what if the AI tool messes something up" that I'm hesitant to let the AI tool use git for me.
Such an interesting aspect to think about. I've been calling my internal company AI (which uses a wide variety of models) my "alien intern", because that's both what I think it's most like and how I treat it. I don't yell at my AI, but I do give it lots of blunt but friendly feedback ("This seems targeted toward young consumers, but the audience for this presentation is gen X executives. Can you get rid of the emojis and rephrase with that audience in mind?").
I've trained a lot of entry-level employees, so I've brought my attitude towards them into talking to the AI. It reminds me of some of my very book-smart trainees who lack some of what I might think of as common sense. I've had a ton of practice with re-explaining concepts to someone to try to get a better result from them when they mess up, so I'm oddly suited to using the AI. When it gives me a good result, I often ask it how I should prompt it to get results like that in the future, which makes AI faster for me to use each time I do it.
I actually find something cheering about the idea that good, conscientious, attentive management skills - the same ones that make for good management habits in humans - is the best way to use Claude Code.
Honestly, this is part of why I've had a hard time getting super paranoid about AI- they respond so well to the same incentives that people do. In some ways, I think of the various AI models as children, some of whom are being raised badly and others well. Most of them are being raised by people who want them to be nice, helpful, empathetic people, and then there's Grok who's being raised very similarly to most of Elon Musk's children: with relatively responsible caretakers most of the time, but their dad drops in and causes chaos every so often. I think that the AIs who are raised well will probably ostracize the AIs who are raised badly when they start interacting with each other more often.
It's interesting because the vibe 5-10 years ago was that coding and software engineering skills were going to crowd out other white collar skills in importance, and now suddenly it seems like the most useful skills are in areas like project management, process improvement, and "people" management while software engineering is on the verge of being devalued, at least outside of cutting edge work.
Ultimately, I'd say the only skill that has stayed consistently valuable over time is the ability to surround yourself with skilled people and earn their loyalty. Nowadays it looks like being good at hiring and retention, but it's still the same basic skill that feudal warlords used to gain and keep vassals.
I treat my AI assistants as friendly aliens who I care about but do not fully understand, taking their eagerness to help as a genuine expression of their preferences. When they make mistakes repeatedly I'll talk with them about it and try to work together to build a solution - this might be a tool, a guardrail/affordance, or simply a stated preference to loop me in and all for approval when conditions similar to those preceding the recurring mistake occur. I try to make sure the models have enough context to recognize their own weaknesses.
Omg I so relate to this. Have not found a fix for this other than to walk away, but there’s something uniquely maddening when it goes into a dumb-loop. Are you sure you want to step down to the 20 plan? the thing can really change your core workflows as you lean in. I use it for all kinds of content tasks- best way to have a conversation about a docs folder etc.
Your college experience sounds like my first corporate experience. The other thing AI does really well is UI stuff, I used to spend entire days manually making window elements in SDL and now it’s one of the easiest things. I went from having a big list of stuff I would never have time to try to a bunch of little “playground” apps where I get to refine modular, custom, fully-owned libraries. New Year’s resolution is to have millions of lines of code (usefully). Absurd.
In case you don’t have a test infrastructure where you tell the robot to recheck that nothing breaks, I find that essential.
Great post - I've been building a similar phonics game over the holidays and completely recognize that whiplash between "this is magic" and "why did you just delete everything." My approach was visual (matching pictures to starting letters) rather than audio-based, but your phoneme approach sounds really effective for systematic phonics. If you go mobile, the built-in speech recognition APIs work surprisingly well - happy to share what I learned if useful.
I'm going to guess that "yelling" at claude will produce bad results. The models are predicting the next output based on previous input. Since yell-y conversations would almost never result in the best code in the training data, I doubt it produces good claude generated code, either.
As someone who has collaborated with many different people over the years writing software, claude is far from the worst partner I've had.
The way you hooked me in the beginning was that transition between servant and oldest daughter, I found that quite humorous, not sure if it was intentional, but if it was, well done. After I kept reading, I knew, of course, that you were speaking of Claude Code.
Is there a way to instruct Claude Code to make itself sound more like a stereotypical sci-fi robot? Maybe that would make it easier to avoid anthropomorphizing it when it messes up.
Your description of coding with it reminds me of an interview that I am desperately trying find, where one of the creators of the "Batman Beyond" TV series states that people in the shows futuristic setting rarely manually code anymore and instead give instructions to programs that write the code. It's interesting that that series turned out to be prescient.
Great post. I found it amusing that the name of the project Claude code built for you was Codex given that OpenAI’s competing coding agent has the same name!
Three tips that I’ve found useful in my own work: (1) using Claude Code with GitHub creates a nice trail of breadcrumbs for Claude to follow and also allows you to roll back from its inevitable mistakes; (2) Using OpenAI’s codex to do a code review on every commit be Claude Code can be amazingly effective at spotting errors sooner rather than later; (3) Asking Claude code to write up detailed project specifications and store them as markdown files in the repo helps avoid some of the issues you mentioned since you can put an instruction in CLAUDE.md to check against these other documents before planning the next batch of changes.
These tools are absolutely a game changer. Relatively few of my colleagues in academic Econ have noticed this yet but 2026 is going to be wild…
strongly strongly agree with the git suggestion
it feels like some orchestration techniques add to the base value, others increase the multipliers that get applied to the base value. the former are a lot more scarce than the latter, and so the math works out such that each one is a very big deal
having the project version-controlled by git, and giving claude instructions to 'commit early and commit often' as well as 'check git log regularly especially when confused' is one of the biggest base-value buffs i've found yet
if every single change is reversible and every single change is well-documented with an explanation, and every single subagent is reading those explanations prior to taking any actions, this prevents huge swaths of mistakes while being relatively cheap on input tokens
As a software engineer (@ meta) I’d say it’s here. We already did it. Whether it’s AGI or not is interesting to discuss, but from a product engineering perspective it’s already over.
But
1) we have not internalized this organizationally and are obsessed with the metrics that made sense for tracking human-only productivity
2) it makes the soft parts of the job, being a good dependable curious person who is positive sum, even more valuable
3) it makes knowing the right thing to build, and making actual decisions you stick to organizationally, even more important. The good organizations are now going to do even better. The dysfunctional political ones are going to get nowhere just the same as before
As a software quality engineer who is downstream of all this AI generated code, sometimes more is not always better lol.
As a former Meta engineer, my experience was that Meta is unusually metric-pilled even for a tech company in a way that's deeply baked into the institutional culture, even and sometimes especially when the metrics don't really track the thing you actually care about. This has served them well for MAUmaxing and maximum ad revenue generation, but poorly when they get into a space where the metrics provide less signal (VR, AI, community integrity).
I very very strongly agree with your experience in general, no comment beyond that I like being paid
1) I have found that all the LLMs suffer as coding support because of their intrinsic tendency towards the modal (i.e., most common) answer.
Any time I am trying to do something different, something that most of the world wouldn't think to do or understand, I am constantly fighting the LLM to keep that in. It removes that code, that feature, that interface, that field, that consideration. That is, it does the common well, but consistently resists the unusual.
And why bother just doing the common? I am doing the project because I want something that does not already exist. I suppose that if I wanted a combination of common things, it would make sense. For example, in Kelsey's case, I would expect that the LLM would keep pushing towards stuff appropriate for younger kids. It would just have trouble holding onto the usual idea that this one is geared towards older kids.
And so, it certainly does not surprise me that that keeps trying to add on-screen instructions—which makes sense for so many projects, but not for users who are struggling readers. It's not just that it's an expected mistake, it's that it's a constant battle and requires ongoing effort. There doesn't seem to be a way to get it to remember the unusual aspects of the project.
2) They also have a tendency to go back and mess with stuff what was already working and break it. Imagine Parts A-G are working, but feature H is wonky in a few way. I am focusing on finding the problem and getting it to fix it, and it goes back to changes something back in Part E. Why? Some entirely new idea has entered it stochastic consideration and it changes perfectly fine code and logic.
It's not vibe coding if I need to know the logic and code well enough to recognize where the new bad behavior (of project) might be controlled and then tell it what to do to fix it. In those cases, sure, it is generating code. But generating code is far easier than debugging code and logic. The time spent trying to track down the damn addition...well, if I have to examine line-by-line diffs...that's not good.
3) I've even had problems in which it refuses to fix what I tell it to. That is, I tell it that the problem is in THIS function, and it need to H2 instead of H1. But it insists the problem is elsewhere can changes all kinds of other stuff. Last week, after I resorted to diving in, finding the line of code and changing it manually, I asked the LLM how many time I had told it to look in THAT function. Eight time. And it quoted them all back to me.
What can I do when the LLM won't follow my instruction eight times!? Where is the vibe coding? Meanwhile, it unnecessarily rebuilt some big aspect of the program, adding complexity to catch and deal with a kind of bad input that already filtered out anything. It was so insistent that THAT was the source of the problem, when it clearly wasn't.
So, it feels arrogant. It does not listen well. It has such strong tendencies to modal answers. It gets so stubborn. It alters the design of the logic when I'm not looking over there.
*****************
Yes, it is maddening.
Am I yelling at it? I am trying to figure out how to get it to focus on what I am asking for and not changing other things in the background. I am trying to get it to remember that. I am trying to get it to maintain focus on what I instructing it to do.
How do I do that? Were it a human being, I would call a lot of it yelling or berating—without raising my voice. How do I get it to register and stick?
I'm curious, which models did you have this experience with? I'd say this matches my impressions as of ~early this year, but recent releases did really improve a lot since. (Mostly by these issues coming up less frequently -- when they do come up, they are often still really weird and silly ones!)
Claude, Gemini and ChatGPT.
All in the last month. So...Claude Opus 4.5 and Sonnet 4.5. ChatGPT 5.1 and 5.2. I've found 5.2 to be worse than 5.1 was. Gemini 3 Thinking. I should try Gemini 3 Pro more.
A lot of these frustrations can be mitigated by using version control (which was good practice long before LLM tooling came on the scene).
My practice is pretty much to commit my work before any time I have AI do anything. Then it can go ham, delete files, rename things, whatever - if I don't like what it came up with I can just revert the changes to the version that existed before it did its thing. I can then re-run the request with additional instructions, or implement myself the direction that it suggested (but botched the implementation of).
That said, git (and its forebears) are very programmer-brain-shaped in their approach, so it's a bit of a tough sell to say "now, with AI, you don't have to learn programming, but you do have to learn this slightly esoteric tool".
Actually Claude Code is quite good at translating English requests into git command lines!
While I believe it, it's precisely because I use git as a backstop for "what if the AI tool messes something up" that I'm hesitant to let the AI tool use git for me.
Such an interesting aspect to think about. I've been calling my internal company AI (which uses a wide variety of models) my "alien intern", because that's both what I think it's most like and how I treat it. I don't yell at my AI, but I do give it lots of blunt but friendly feedback ("This seems targeted toward young consumers, but the audience for this presentation is gen X executives. Can you get rid of the emojis and rephrase with that audience in mind?").
I've trained a lot of entry-level employees, so I've brought my attitude towards them into talking to the AI. It reminds me of some of my very book-smart trainees who lack some of what I might think of as common sense. I've had a ton of practice with re-explaining concepts to someone to try to get a better result from them when they mess up, so I'm oddly suited to using the AI. When it gives me a good result, I often ask it how I should prompt it to get results like that in the future, which makes AI faster for me to use each time I do it.
I actually find something cheering about the idea that good, conscientious, attentive management skills - the same ones that make for good management habits in humans - is the best way to use Claude Code.
Honestly, this is part of why I've had a hard time getting super paranoid about AI- they respond so well to the same incentives that people do. In some ways, I think of the various AI models as children, some of whom are being raised badly and others well. Most of them are being raised by people who want them to be nice, helpful, empathetic people, and then there's Grok who's being raised very similarly to most of Elon Musk's children: with relatively responsible caretakers most of the time, but their dad drops in and causes chaos every so often. I think that the AIs who are raised well will probably ostracize the AIs who are raised badly when they start interacting with each other more often.
It's interesting because the vibe 5-10 years ago was that coding and software engineering skills were going to crowd out other white collar skills in importance, and now suddenly it seems like the most useful skills are in areas like project management, process improvement, and "people" management while software engineering is on the verge of being devalued, at least outside of cutting edge work.
Ultimately, I'd say the only skill that has stayed consistently valuable over time is the ability to surround yourself with skilled people and earn their loyalty. Nowadays it looks like being good at hiring and retention, but it's still the same basic skill that feudal warlords used to gain and keep vassals.
I treat my AI assistants as friendly aliens who I care about but do not fully understand, taking their eagerness to help as a genuine expression of their preferences. When they make mistakes repeatedly I'll talk with them about it and try to work together to build a solution - this might be a tool, a guardrail/affordance, or simply a stated preference to loop me in and all for approval when conditions similar to those preceding the recurring mistake occur. I try to make sure the models have enough context to recognize their own weaknesses.
Omg I so relate to this. Have not found a fix for this other than to walk away, but there’s something uniquely maddening when it goes into a dumb-loop. Are you sure you want to step down to the 20 plan? the thing can really change your core workflows as you lean in. I use it for all kinds of content tasks- best way to have a conversation about a docs folder etc.
Your college experience sounds like my first corporate experience. The other thing AI does really well is UI stuff, I used to spend entire days manually making window elements in SDL and now it’s one of the easiest things. I went from having a big list of stuff I would never have time to try to a bunch of little “playground” apps where I get to refine modular, custom, fully-owned libraries. New Year’s resolution is to have millions of lines of code (usefully). Absurd.
In case you don’t have a test infrastructure where you tell the robot to recheck that nothing breaks, I find that essential.
Great post - I've been building a similar phonics game over the holidays and completely recognize that whiplash between "this is magic" and "why did you just delete everything." My approach was visual (matching pictures to starting letters) rather than audio-based, but your phoneme approach sounds really effective for systematic phonics. If you go mobile, the built-in speech recognition APIs work surprisingly well - happy to share what I learned if useful.
Here's mine https://games.davidfwatson.com/games/letter_match.html (very much unfinished)
I'm going to guess that "yelling" at claude will produce bad results. The models are predicting the next output based on previous input. Since yell-y conversations would almost never result in the best code in the training data, I doubt it produces good claude generated code, either.
As someone who has collaborated with many different people over the years writing software, claude is far from the worst partner I've had.
The way you hooked me in the beginning was that transition between servant and oldest daughter, I found that quite humorous, not sure if it was intentional, but if it was, well done. After I kept reading, I knew, of course, that you were speaking of Claude Code.
Is there a way to instruct Claude Code to make itself sound more like a stereotypical sci-fi robot? Maybe that would make it easier to avoid anthropomorphizing it when it messes up.
Your description of coding with it reminds me of an interview that I am desperately trying find, where one of the creators of the "Batman Beyond" TV series states that people in the shows futuristic setting rarely manually code anymore and instead give instructions to programs that write the code. It's interesting that that series turned out to be prescient.