It's hard to tell how much it says about difficulty of harnessing vs how much it says about difficulty of maintaining a clean and not bloated codebase when coding with AI.
Why not both? AI writes bloated spaghetti by default. The control plane needs to be human-written and rigid -> at least until the state machine is solid enough to dogfood itself. Then you can safely let the AI enhance the harness from within the sandbox.
I mean, tools change, but I'd be happy to hear if any tool can create that by just saying create "Claude Code Unpack" with nice graphics. or some other single prompt. It likely was an iterative process and it would be lovely if more people started sharing that, because the process itself is also very interesting.
I've created some chinese characters learning website and I took me typing 1/3 of LoTR to get there[1]. I would have typed like 1% of that writing code directly. It is a different process, but it still needs some direction.
As things stand today even when doing research tasks, time spent by model is >> than fetching websites. I don't see it changing any time soon, except when some deals happen behind the scenes where agents get to access CF guarded resources that normally get blocked from automated access.
You might need to turn laws into formal proofs, and the existence of judges makes me think that’s not as likely as you would like. A commenting system would though—trained on countries’s precedents, jurisprudence and traditions might.
This could in theory already happen without any tech, but I suspect since the government is pretty monolithic, any changes in a specific law are all being done by the same set of people.
You might not have merge conflicts but I imagine you could end up with conflicting guidance from two separate pieces of law (e.g., law A says you must wear green on St. Patrick's day, law B outlaws green pajamas).
> “We are not aware of any successful mercenary spyware attacks against a Lockdown Mode-enabled Apple device,” Apple spokesperson Sarah O’Rourke told TechCrunch on Friday.
At least 225 judges have ruled in more than 700 cases that the administration's mandatory immigration detention policy likely violates the right to due process[1] The Fifth Amendment's Due Process Clause generally requires those having federal funds cut off to receive notice and an opportunity for a hearing, which was not provided in many of DOGE's spending freezes[2]
Not really related, but does anybody know if somebody's tracking same models performance on some benchmarks over time? Sometimes I feel like I'm being A/B tested.
Yeah, good tests are associated with cost. I'd like to see benchmarks on big messy codebases and how models perform on a clearly defined task that's easy to verify.
I was thinking that tokens spent in such case could also be an interesting measure, but some agent can do small useful refactoring. Although prompt could specify to do the minimal change required to achieve the goal.
People are loading huge interpreted environments for stuff that can be done from the command line. Run computations on complex objects where it could be a single machine instruction etc. The trend has been around for a long time.
Wow /insights is genuinely useful, perhaps CLI should be pushing that as a tip, if one has enough sessions, instead of keep nagging me about the frontend developer skill which I already have installed
In general CLI could be more reliable and responsive though, it's a text based env yet sometimes feel like running windows 95 on 386dx
It seems clear from the insights that some model is marking failure cases when things went wrong and likely reporting home, so that should be extremely valuable to Anthropic
reply