I wish I had this kind of experience. I threw a tedious but straightforward task at Claude Code using Opus 4.6 late last week: find the places in a React code base where we were using useState and useEffect to calculate a value that was purely dependent on the inputs to useEffect, and replace them with useMemo. I told it to be careful to only replace cases where the change did not introduce any behavior changes, and I put it in plan mode first.
It gave me an impressive plan of attack, including a reasonable way to determine which code it could safely modify. I told it to start with just a few files and let me review; its changes looked good. So I told it to proceed with the rest of the code.
It made hundreds of changes, as expected (big code base). And most of them were correct! Except the places where it decided to do things like put its "const x = useMemo(...)" call after some piece of code that used the value of "x", meaning I now had a bunch of undefined variable references. There were some other missteps too.
I tried to convince it to fix the places where it had messed up, but it quickly started wanting to make larger structural changes (extracting code into helper functions, etc.) rather than just moving the offending code a few lines higher in the source file. Eventually I gave up trying to steer it and, with the help of another dev on my team, fixed up all the broken code by hand.
It probably still saved time compared to making all the changes myself. But it was way more frustrating.
One tip I have is that once you have the diff you want to fix, start a new session and have it work on the diff fresh. They’ve improved this, but it’s still the case that the farther you get into context window, the dumber and less focused the model gets. I learned this from the Claude Code team themselves, who have long advised starting over rather than trying to steer a conversation that has started down a wrong path.
I have heard from people who regularly push a session through multiple compactions. I don’t think this is a good idea. I virtually never do this — when I see context getting up to even 100k, I start making sure I have enough written to disk to type /new, pipe it the diff so far, and just say “keep going.” I learned recently that even essentials like the CLAUDE.md part of the prompt get diluted through compactions. You can write a hook to re-insert it but it's not done by default.
This fresh context thing is a big reason subagents might work where a single agent fails. It’s not just about parallelism: each subagent starts with a fresh context, and the parent agent only sees the result of whatever the subagent does — its own context also remains clean.
Slight tangent: you want to read the diff between your branch and the merge-base with origin/main. Otherwise you get lots of spurious spam in your diff, if main moved since you branched off.
One thing that seems important is to have the agent write down their plan and any useful memory in markdown files, so that further invocations can just read from it
>"This fresh context thing is a big reason subagents might work where a single agent fails. It’s not just about parallelism: each subagent starts with a fresh context, and the parent agent only sees the result of whatever the subagent does — its own context also remains clean."
You maintain very low context usage in the main thread; just orchestration and planning details, while each individual team member remains responsible for their own. Allows you to churn through millions of output tokens in a fraction of the time.
subagents are huge, could execute on a massive plan that should easily fill up a 200k context window and be done atnaround 60k for the orchestration agent.
as a cheapass, being able to pass off the simple work to cheaper $ per token agents is also just great. I've got a handful of tasks I can happily delegate work to a haiku agent and anything requiring a bit of reasoning goes to sonnet.
Feel like opus is almost a cheatcode when i do get stuck, i just bust out a full opus workflow instead and it just destroys everything i was struggling with usually. like playing on easy mode.
as cool as this stuff is, kinda still wish i was just grandfathered into the plan with no weekly limit and only the 5 hour window limits, id just be happily hammering opus blissfully.
Same here. I don't understand how people leave it running on an "autopilot" for long periods of time. I still use it interactively as an assistant, going back and forth and stepping in when it makes mistakes or questionable architectural decisions. Maybe that workflow makes more sense if you're not a developer and don't have a good way to judge code quality in the first place.
There's probably a parallel with the CMSes and frameworks of the 2000s (e.g. WordPress or Ruby on Rails). They massively improved productivity, but as a junior developer you could get pretty stuck if something broke or you needed to implement an unconventional feature. I guess it must feel a bit similar for non-developers using tools like Claude Code today.
>Same here. I don't understand how people leave it running on an "autopilot" for long periods of time.
Things have changed. The models have reached a level of coherence that they can be left to make the right decisions autonomously. Opus 4.6 is in a class of its own now.
A non-technical client of mine has built an entire app with a very large feature set with Opus. I declined to work on it to clean it up, I was afraid it would have been impossible and too much risk. I think we are at a level where it can build and auto-correct its mistakes, but the code is still slop and kind of dangerous to put in production. If you care about the most basic security.
Branch first so you can just undo. I think this would have worked with sub agents and /loop maybe? Write all items to change to a todo.md. Have it split up the work with haiku sub agents doing 5-10 changes at a time, marking the todos done, and /loop until all are done. You’ll succeed I suspect. If the main claude instance compacts its context - stop and start from where you left off.
It actually did automatically break the work up into chunks and launched a bunch of parallel workers to each handle a smaller amount of work. It wasn't doing everything in a single instance.
The problem wasn't that it lost track of which changes it needed to make, so I don't think checking items off a todo list would have helped. I believe it did actually change all the places in the code it should have. It just made the wrong changes sometimes.
But also, the claim I was responding to was, "I start with a PRD, ask for a step-by-step plan, and just execute on each step at a time." If I have to tell it how to organize its work and how to keep track of its progress and how to execute all the smaller chunks of work, then I may get good results, but the tool isn't as magical (for me, anyway) as it seems to be for some other people.
One of the more subtle points that seems to be crucial is that it works a lot better when it can use the context as part of its own work rather than being polluted by unrelated details. Even better than restarting when it's off the rails is to avoid it as much as possible by proactively starting a new conversation as soon as anything in the history of the existing one stops being relevant. I've found it more effective to manually tell it most what's currently in the context in a fresh session skip the irrelevant bits even if they're fairly small than relying on it to figure out that it's no longer relevant (or give it instructions indicating that, which feels like a crapshoot whether it's actually going to prune or just bloat things further with that instruction just being added into the mix).
To echo what the parent comment said, it's almost frustrating how effective it can be at certain tasks that I wouldn't ever have the patience for. At my job recently I needed to prototype calling some Python code via WASM using the Rust wasmtime engine, and setting up the code structure to have the bytes for the WASM component, the arguments I wanted to pass to the function, and the WIT describing the interface for the function, it was able to fill in all of the boilerplate needed so that the function calls worked properly within a minute or two on the first try; reading through all the documentation and figuring out how exactly which half dozen assorted things I had to import and hook up together in the correct order would have probably taken me an hour at minimum.
I don't have any particular insight on whether or not these tools will become even more powerful over time, and I still have fairly strong concerns about how AI tools will affect society (both in terms of how they're used and the amount of in energy used to produce them in the first place), but given how much the tech industry tends to prioritize productivity over social concerns, I have to assume that my future employment is going to be heavily impacted by my willingness to adopt and use these tools. I can't deny at this point that having it as an option would make me more productive than if I refuse to use it, regardless of my personal opinions on it.
The biggest thing tying my team to GitHub right now is that we use Graphite to manage stacked diffs, and as far as I can tell, Graphite doesn't support anything but GitHub. What other tools are people using for stacked-diff workflows (especially code review)?
Gerrit is the other option I'm aware of but it seems like it might require significant work to administer.
Please do this! As a Graphite user, I'd love to be able to switch to jj for my local development, but the disconnect between it and Graphite keeps me away.
This is an interesting way to look at it because you can kind of quantify the tradeoff in terms of the value of your time. A simple analysis would be something like, if you value your time at $60/hour, then spending an additional $30 in credits becomes a good choice if it saves you more than a half-hour of work.
> I've seen people who prefer to say "hey siri set alarm clock for 10 AM" rather than use the UI. Which makes sense, because language is the way people literally have evolved specialized organs for.
I don't think it's necessary to resort to evolutionary-biology explanations for that.
When I use voice to set my alarm, it's usually because my phone isn't in my hand. Maybe it's across the room from me. And speaking to it is more efficient than walking over to it, picking it up, and navigating to the alarm-setting UI. A voice command is a more streamlined UI for that specific task than a GUI is.
I don't think that example says much about chatbots, really, because the value is mostly the hands-free aspect, not the speak-it-in-English aspect.
I'd love to know the kind of phone you're using where the voice commands are faster than touchscreen navigation.
Most of the practical day to day tasks on the Androids I've used are 5-10 taps away from a lock screen, and get far less dirty looks from those around me.
1 unlock the phone - easy, but takes an active swipe
2 go to the clock app - i might not have been on the home screen, maybe a swipe or two to get there
3 set the timer to what I want - and here it COMPLETELY falls down, since it probably is showing how long the last timer I set was, and if that's not what I want, I have to fiddle with it.
If I do it with my voice I don't even have to look away from what I'm currently doing. AND I can say "90 seconds" or "10 minutes" or "3 hours" or even (at least on an iPhone) "set a timer for 3PM" and it will set it to what I say without me having to select numbers on a touchscreen.
And 95% of the time there's nobody around who's gonna give me a dirty look for it.
and less mental overhead. Go to the home screen, find the clock app, go to the alarm tab, set the time, set the label, turn it on, get annoyed by the number of alarms that are there that I should delete so there isn't a million of them. Or just ask Siri to do it.
One thing people forget is that if you do it by hand you can do it even when people are listening, or when it’s loud. Meaning its working more reliable. And in your brain you only have to store one execution instead of two. So I usually prefer the more reliable approach.
I don’t know any people that do Siri except the people that have really bad eyes
I'd sort of roughly approached this technique with my own channel organization over time without thinking about it systematically, but this is a helpful crystallization of what I'd been trying to achieve. I'm glad this was posted.
Definitely agree with others that Slack needs a richer selection of notification mechanisms, both for new content in channels and for mentions. For mentions, there's no level between "I demand immediate attention from this person" and "the characters that make up this person's name happen to be in the text of my message."
> I'll start by saying I'm skeptical of the answer and ask it to state its reasoning.
How do you tell if it's actually stating the reasoning that got it to its answer originally, as opposed to constructing a plausible-sounding explanation after the fact? Or is the goal just to see if it detects mistakes, rather than to actually get it to explain how it arrived at the answer?
The act of making it state its reasoning can help it uncover mistakes. Note that I'm asking a second model to do this; not the original one, otherwise I would not expect a different result.
I would totally expect a different result even on the same model. Especially if you're doing this via a chat interface (vs API) where you can't control the temperature parameters.
But yes, it'll be more effective on a different model.
When I was living in China I got used to crossing large streets one lane at a time. Pedestrians stand on the lane markers with cars whizzing by on either side while they wait for a gap big enough to cross the next lane. It's not great for safety, to put it mildly, but the drivers expect it and it's the only way to get across the road in some places. I was freaked out by it but eventually it became habit.
Then I came back to the US and forgot to switch back to US-style street crossing behavior at first. No physical harm done, but I was very embarrassed when people slammed on their brakes at the sight of me in the middle of the road.
As a satisfied customer of yours, the prospect of having to give up Graphite is the main thing keeping me from giving jj a try at my day job.
Ironic, since if there are a bunch of people in my boat, the lack of us in jj's user base will make it that much harder for jj to cross the "popular enough to be worth supporting" threshold.
My ideal is really just a version of `gt sync` and `gt submit` that handle updating the Graphite + Github server-side of things let you use `jj` for everything else, I think it could feel super nice. Probably not as simple as my dreams, but hopefully something we can get to with enough interest!
It gave me an impressive plan of attack, including a reasonable way to determine which code it could safely modify. I told it to start with just a few files and let me review; its changes looked good. So I told it to proceed with the rest of the code.
It made hundreds of changes, as expected (big code base). And most of them were correct! Except the places where it decided to do things like put its "const x = useMemo(...)" call after some piece of code that used the value of "x", meaning I now had a bunch of undefined variable references. There were some other missteps too.
I tried to convince it to fix the places where it had messed up, but it quickly started wanting to make larger structural changes (extracting code into helper functions, etc.) rather than just moving the offending code a few lines higher in the source file. Eventually I gave up trying to steer it and, with the help of another dev on my team, fixed up all the broken code by hand.
It probably still saved time compared to making all the changes myself. But it was way more frustrating.
reply