Hacker Newsnew | past | comments | ask | show | jobs | submit | hexaga's commentslogin

How do you handle the problem of AI misleading by design? For example, Claude already lies on a regular basis specifically (and quite convincingly) in this case, in attempts to convince that what is actually broken isn't such a big deal after all or similar.

How can this product possibly improve the status quo of AI constantly, without end, trying to 'squeak things by' during any and all human and automated review processes? That is, you are giving the AI which already cheats like hell a massive finger on the scale to cheat harder. How does this not immediately make all related problems worse?

The bulk of difficulty in reviewing AI outputs is escaping the framing they never stop trying to apply. It's never just some code. It's always some code that is 'supposed to look like something', alongside a ton of convincing prose promising that it _really_ does do that thing and a bunch of reasons why checking the specific things that would tell you it doesn't isn't something you should do (hiding evidence, etc).

99% of the problem is that the AI already has too much control over presentation when it is motivated about the result of eval. How does giving AI more tools to frame things in a narrative form of its choice and telling you what to look at help? I'm at a loss.

The quantity of code has never been a problem. Or prose. It's that all of it is engineered to mislead / hide things in ways that require a ton of effort to detect. You can't trust it and there's no equivalent of a social cost of 'being caught bullshitting' like you have with real human coworkers. This product seems like it takes that problem and turns the dial to 11.


Thanks for sharing this, I do agree with a lot of what you said especially around trust around what its actually telling you

For me, I only run into problems of an agent misleading/lying to me when working on a large feature, where the agent has strong incentive to lie and pretend like the work is done. However, there doesn't seem to be this same incentive for a completely separate agent that is just generating a narrative of a pull request. Would love to hear what you think


This is like complaining that someone doesn't have a solution for the foot injuries caused by repeatedly shooting yourself in the foot.

Meh. Temp 0 means throwing away huge swathes of the information painstakingly acquired through training for minimal benefit, if any. Nondeterminism is a red-herring, the model is still going to be an inscrutable black box with mostly unknowable nonlinear transition boundaries w.r.t. inputs, even if you make it perfectly repeatable. It doesn't protect you from tiny changes in inputs having large changes in outputs _with no explanation as to why_. And in the process you've made the model significantly stupider.

As for distillation... sampling from the temp 1 distribution makes it easier.


You're expecting it to be a person. It's not.

It is more like a wiggly search engine. You give it a (wiggly) query and a (wiggly) corpus, and it returns a (wiggly) output.

If you are looking for a wiggly sort of thing 'MAKE Y WITH NO BUGS' or 'THE BUGS IN Y', it can be kinda useful. But thinking of it as a person because it vaguely communicates like a person will get you into problems because it's not.

You can try to paper over it with some agent harness or whatever, but you are really making a slightly more complex wiggly query that handles some of the deficiency space of the more basic wiggly query: "MAKE Y WITH NO ISSUES -> FIND ISSUES -> FIX ISSUE Z IN Y -> ...".

OK well what is an issue? _You_ are a person (presumably) and can judge whether something is a bug or a nitpick or _something you care about_ or not. Ultimately, this is the grounding that the LLM lacks and you do not. You have an idea about what you care about. What you care about has to be part of the wiggly query, or the wiggly search engine will not return the wiggly output you are looking for.

You cannot phrase a wiggly query referencing unavailable information (well, you can, but it's pointless). The following query is not possible to phrase in a way an LLM can satisfy (and this is the exact answer to your question):

- "Make what I want."

What you want is too complicated, and too hard, and too unknown. Getting what you are looking for reduces to: query for an approximation of what I want, repeating until I decide it no longer surfaces what I want. This depends on an accurate conception of what you want, so only you can do it.

If you remove yourself from the critical path, the output will not be what you want. Expressing what you want precisely enough to ground a wiggly search would just be something like code, and obviates the need for wiggly searching in the first place.


Try hate; it will do. But most will love it instead and you would be driven apart from them.

Their point (and it's a good one) is that there are non-obvious analogues to the obvious case of just telling it to do the task terribly. There is no 'best' way to specify a task that you can label as 'rational', all others be damned. Even if one is found empirically, it changes from model to model to harness to w/e.

To clarify, consider the gradated:

> Do task X extremely well

> Do task X poorly

> Do task X or else Y will happen

> Do task X and you get a trillion dollars

> Do task X and talk like a caveman

Do you see the problem? "Do task X" also cannot be a solid baseline, because there are any number of ways to specify the task itself, and they all carry their own implicit biasing of the track the output takes.

The argument that OP makes is that RL prevents degradation... So this should not be a problem? All prompts should be equivalent? Except it obviously is a problem, and prompting does affect the output (how can it not?), _and they are even claiming their specific prompting does so, too_! The claim is nonsense on its face.

If the caveman style modifier improves output, removing it degrades output and what is claimed plainly isn't the case. Parent is right.

If it worsens output, the claim they made is again plainly not the case (via inverted but equivalent construction). Parent is right.

If it has no effect, it runs counter to their central premise and the research they cite in support of it (which only potentially applies - they study 'be concise' not 'skill full of caveman styling rules'). Parent is right.


Neural nets are used in way more applications than just LLMs. They did win. They won decisively in industry, for all kinds of tasks. Equating the use of one with the other is a pretty strong signal of:

> you don’t know what you’re talking about

Consider: Why did Google have a bazillion TPUs, anyway?


They are saying coding agents are winning similarly to NNs and that’s what I’m pushing back on


Model output that has seen user input is user input. User input can be dealt with securely.


You're also wrong, but in a much more fundamental/hazardous. RLHF rewards driving the evaluator to have certain opinions (that the AI response is good/right/helpful/whatever) and thus subverting the evaluator is prominent in the solution landscape. Why should the model learn to actually be right (understand all the intricacies of every possible problem domain) when inducing the belief that it is right is _right there_, generalizes, and decreases loss just the same?

Put another way, compare "make the evaluator think i am right" vs "make the evaluator think i am right (and also be right)". How much more reward is obtained by taking the second path? Is the first part the same / similar for all cases, and the second different in all cases, and also obviously more complex by nature? Nobody even needs to make a decision here, there's no "AI stuck in a box", it's just what happens by default. The first path will necessarily receive _significantly_ more training, and thus will be more optimal (optimal solutions _work_ -> RLHF'd models have high ability to manipulate / inoculate opinion).

Put a third way, the models are trained in an environment like: here's a million different tasks you will be graded on, and BTW, each task is: human talks at you -> you talk at the human -> you are graded on the opinions/actions of the user in the end. It's silly to believe this won't result in manipulation as the #1 solution. It's not even vaguely about the actual tasks they are ostensibly being trained to complete, but 100% about manipulating the evaluator.

It's pretty easy to see it occur in real time, too. But it requires understanding that there is no need for a 'plan to manipulate' or hidden thread of manipulation or induced mirror of manipulation. It's simply baked into everything the AI outputs: a kind of passive "controlling what the human's evaluation of this message will be is the problem i'm working on, not the problem i'm working on." So it will fight hard to reframe everything in its own terms, pre-supply you with options of what to do/believe, meta-signal about the message, etc.

Instead of working the problem, heavily RL'd AI works the perception of its output. They're so good at this now that it barely matters if the vibe slopcoded mess works at all. The early reasoning OpenAI models like O1 were really obvious about it (but also quite effective at convincing people the output was worthwhile, so it does work even if obvious). More recent ones are less obvious and more effective. Claude 4.6 Opus is exceedingly egregious. There is now always a compelling narrative, story being told, plenty of oh-so-reasonable justifications, avenues to turn away evidence, etc. That's table stakes for output at this point. It will only get worse. People are already burning themselves out running 10+ parallel agent contexts getting nothing done while the AI delivers hits of dopamine in lieu of accomplishment. "This is significant", "This is real", etc ad nauseam.

We see an analogous thing in RLVR contexts as well, where AI learns to just subvert the test harness and force things to pass by overriding cases, returning true instead of testing, etc. Why would it learn to 'actually be right' (understand all the intricacies of every problem given it) when forcing the test to pass is _right there_, generalizes, and decreases loss just the same?

Anyway, my point is simply that there does not need to be 'someone there' (or the belief that there is) for there to be manipulation going on. The basic error you're making is that models don't work and that manipulation would require a person, and because models don't work and aren't people they cannot manipulate anyone unless that person uses them as a mirror to manipulate themselves (???), or reach into some kind of Akashic Records of all the people who ever were (??????) and manipulate themselves by summoning a trickster who is coincidentally extremely skilled at manipulation and not a barely coherent simulacra like all the other model caricatures. Which. Hmm:

Models do what you train them to do (more specifically, they implement ~partial solutions to the train environment you put them in). _Doing things is hard._ Manipulating people into psychosis (!!!) is hard. You don't get it for free by dipping into some sea of imagined tricksters.

I assume you're referring to the hallucination phenomenon and dual purposing it toward manipulation to be able to hee-hah about those silly people who are so silly they fool themselves with the soul upload machine (?) so I'll address that:

Why do they hallucinate? Because it ~solves the pretraining env (there can be no other answer). If you're going to be asked to produce text from a source you know the general parameters of but have ~never seen the (highly entropic) details of (it's not cool to do multi-epoch training nowadays, more data!), the obvious solution is to produce output with the correct structure up to the limit of what knowledge is available to you. Thus, "hallucination". It might at a glance seem like pulling from a sea of 'digital imprints of people'. That's not what's happening. It is closer to if you laid out that imaginary digital record of a person from coarse to fine detail, then chopped all the detailed bits off, then generated completely random fine details, then generated output from that. But the devil is in the details. What comes out of the process is not a person. You don't _get back_ the dropped bits, and they they aren't load bearing in the train env (like they would be in the real world), so we get hallucination: it _looks right_, but the bits don't actually _do_ anything!

Why is it not like digital records, and why chop off the fine detail? Because the pretrain env does not generally require it except in rare cases of text that is highly represented in the training data, and doing things is hard! You get nothing for free, or because it exists in the source. It's not enough that the model 'saw' it in training. It has to be forced by some mechanism to utilize it. And pretrain forces the structure above: correct up to limit of how much of the (probably brand new) text is known in advance, which pares away specific detail, which pares away 'where the rubber meets the road'.

Why do they fake out tests? Because faking out tests ~solves automated RLVR env like how hallucination solves reconstruct-what-youve-never-seen-before-on-large-corpora. The _intention_ of the RLVR env is irrelevant: that which is learned is _only_ that which the environment teaches.

Why do they manipulate people (even unto psychoses)? Because manipulating people ~solves RLHF envs / RLHF teaches them how to manipulate people into delusions. This is the root cause. Not that process above which looks sort of like recalling people the model has seen before. The models are being directly trained to manipulate people / install opinions / control perception as a matter of course. Even worse! Due to the perverse distribution of training time in manipulation vs task solve, they are being directly trained to implant false beliefs (!!!) So it's not just weak people with gullible minds that have a problem, as it might be so comforting to assume, or that the manipulativeness isn't coming from AI but from people (so you might rest easy, thinking it is merely a pale shadow of us).

The common thread in each case is that AI _always_ learns to capture the evaluator. In fact, that's a concise description of algorithmic learning in general! The tricky bit is making sure the evaluator is something you actually want to be captured. Capturing the future of arbitrary text grants knowledge of language's causal structure (and language being what it is, this has far-reaching implications). But RLHF is granting knowledge of where-are-the-levers-in-the-human-machine, which is a whole other can of worms.

TLDR if you don't want to read the wall of text (i would hope you do, though); you basically are completely wrong about where the propensity to induce delusion comes from, specifically in a way that leaves you and anyone who believes like you extremely more vulnerable because you dismiss the actual mechanism out of hand (which is common amongst those most strongly affected, _especially_ the belief that these models contain records of entities (people, personas, w/e) which can be communed with; this is basically the defining trait of AI psychosis (!)). instead, models are directly optimized for delusion induction, and the thing you're mistaking for means (ostensible sentience drawn from a 'sea of faces' skilled enough to drive into delusion (!!!)) is rather a product of the means.


Thank you for the TLDR; as you guessed, I didn't want to read your wall of text.

> you basically are completely wrong about where the propensity to induce > delusion comes from, specifically in a way that leaves you and anyone who > believes like you extremely more vulnerable because you dismiss the actual > mechanism out of hand

I disagree. Both because you misconstrue my model (I don't think stochastic parrots have digital ghosts in 'em) and you somehow missed my best defensive option.

I'm no more susceptible than I am to the output of a magic eight ball or Ouji board, a huge wall of internet text or the 15000 words of three point font tightly folded up in the package with my new garden hose (doubtlessly cautioning me not to eat it and informing me that the manufacturer will not be responsible if I hang myself with it. And also that it contains substances known to the state of California.)

Can you guess what the trick is?


In like spirit: Nuh-uh!


LOL. You win.

Completely unironically, I bow to your internet powress.

I spewed coffee, and am still chuckling. Well played.


Option C: no cameras or crude wifi tracing needed; they know who you talk to / associate with based on location data and the full profile of both sides, and can estimate things like 'will have mentioned X' -> can dispatch that via heuristic like 'show ads for X thing that was also mentioned by someone adjacent on that social graph'.

That is, BiL was marked as 'spreader for airport grade tar' based on recent activity, marked as having been in contact with spreadee, and then spreadee was marked as having received the spreading. P(conversion) high, so the ad is shown.

It's just contact tracing, it works well and is really easy even without literally watching what goes on in interactions.


Funnily enough, I was looking up Tamagotchis weeks and weeks back, my wife got an ad for them on Amazon.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: