Hacker Newsnew | past | comments | ask | show | jobs | submit | comboy's commentslogin

Prompt matters. Obviously if you want another model opinion you must generate from the scratch using the same prompt and then you can try to synthesize, but working with an existing response can work if desired. I use explicit instructions to find issues with assigned severities and then these are going through the panel of judges, only issues passing certain threshold are fixed in the original response.

I'll share a revelation which vastly improved my results: tell judges to evaluate truth and usefulness/should-be-fixed axis separately. Because inevitably with a prompt that is forcing to find issues you will end up with nitpicks. Plus truth axis allows to better evaluate the issue-finder models for your use case.

That's some part of what happens when I generate explanations like this one: https://hanzirama.com/character/%E6%9D%A5#explain - at this point the site is a small side product of my LLMs-evaluation machinery.

Bonus content for patient readers: if you need top quality you will likely need to pin provider(s) on OR, :exacto is not enough to get good repeatable results especially for open-weights models.


They cannot do it. Apart from all the practical, technical and talent reasons, it would still be exporting forbidden stuff.

The signal is clear enough though for the next Anthropic..


It's simple, marketing dominates everything. With attention being very expensive, appearance is what matters.

It doesn't matter if you write fantastic library, nobody is gonna use it because they won't know about it, the one with a gif of the terminal (ffs) will win that has a good page describing what it does (and being the most popular one can even become better than your library because of the following but that's not the point here).

It's everywhere, products, hiring, services. We have no network of trust (sigh), we need to trust some heuristics based on a shallow information. If somebody focuses on the shallow he wins, because nobody can ever dive into everything.


But it sounds like FableFool so it has that going for it.

There is in /config "Switch models when a message is flagged" now which can be set to false, but I had no chance to see what happens then, does it just stop or what.

Session paused

Fable 5 has safety measures that flag messages on most cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Send feedback with /feedback or learn more

   1. Switch to Opus 4.8
   2. Edit prompt and retry with Fable 5

Biology? Why?

they're worried about people creating bioweapons

'by the way, your previous attempts have these structural problems."

Just to be clear, it did not have access to any previous work that opus did? Because they are pretty good at digging out relevant tmp files and making use of whatever is out there.

With my fable adventures I caught it hallucinating something and stating it as a fact in CLI twice. And it was something that I did not see opus do in such way, opus obviously many times stated some things that it did not verify but guessed, but fable said something like "the probe showed that ..." - but there was no probe, it was not about some past events it was about what it was doing right now. "I overstated"...

But boy does it know Chinese, so much better than any other english model, gemini used to be the king but fable clearly was trained on a decent amount of it. It has a deep cultural understanding.


If you have some spare time, I'd be interested in knowing what kind of questions you use to test models on understanding of Chinese culture.

I'm creating hanzirama.com

I generate explanations for characters and words like so: https://hanzirama.com/character/%E6%9D%A5#explain

But I don't want to mislead learners and want to provide some cultural depth, so I have a hole sophisticated pipeline, using multiple models to generate the explanation, then multiple models look for issues in the explanation, each issue goes through the panel of judges (basically trying to squash down any hallucinations), it's fixed and it goes through such cycles a few times over.

I've been at it for some months now, so I have dozens of different probes, that I needed to evaluate prompts and method changes. Plus on some items I generated so many explanations through different means that I can tell a lot about given model just by looking at one.

Plus I'm doing some statistics, so I see how e.g. when working as judges of issues some models correlate heavily with some others... Fun fact during some testing runs basically just testing providers I stumbled upon qwen introducing himself as made by Google. And also Anhropic's Sonnet saying that it was made by OpenAI :)

At this point all my evaluations frameworks and pipelines stuff is much bigger than the site itself. I'm having lots of fun though.


Yes, iit had access. Thats actually the point.

I maintain a failure registry in the repo. Every failed attempt gets documented with the exact mechanism, the test that regressed, the revert SHA, and an instruction to start from that frontier. Fable read all of it.

But so did Opus.

Each of the 16 Opus failures ran in the same harness with the same accumulating registry. By attempt 15, it had disproofs 1–14 in context. By the end, Opus had basically the same corpus that Fable started with, and it still kept failing, sometimes by re-deriving an already-disproved approach in a slightly different shape.

So “it leveraged the previous work” doesn’t really separate them. Both had the leverage. Only one converted it.

What changed wasn’t more context. It was that Fable rejected a premise inside the context.

The registry’s standing framing was: “this needs whole-program borrow inference, which conflicts with per-module incrementality” (architecturally blocked.) Fable ran around 5 fresh attempts in-session, hit the same wall, and then noticed the framing was a red herring: the borrow analysis already runs module-wide, and for a single-module program, the module is the whole program.

Opus read that same framing for months and treated it as a constraint. Fable falsified it.

its the same repo, same rules, same disproof history, same workflow. The model was the only variable that changed, and the outcome flipped. Is it possible that attempt 17 by Opus could have figured it out? sure. but there's 16 previous attempts that say otherwise.

As fars as anecdotes go, that’s about as controlled as it gets.


I’ve had a similar experience.

Pointing out past suboptimal / failing behaviours to new opus sessions would almost always actually create a sort of "anchoring bias" that would drive the agents towards exhibiting the failure mode (often while mentioning how it wouldn’t fall for it).

As far as I can recall, Fable has been the first model to discover the documented failure modes, comment on them, and just… keep going, actually avoiding them. Quite a surprise.


I kind of enjoy exploring black boxes, trying how different inputs are mapping to differences in outputs. It's kind of like hacking. The problem is, they keep altering the box.

The box is stochastic by design, and has an untraceable amount of complexity between its context and output by nature.

It may be fun to look at inputs and outputs, but it's not hackable and trying to map one into the other is more like astrology than a science.


It's copromancy. Picking through the clanker's doings in an attempt to predict the future.

Thanks, you taught me a new word today! https://en.wikipedia.org/wiki/Scatomancy

It feels like Greek mythology should have some metaphor for "apparently simple structure that is so complex it leads anybody that studies it into madness". But I can't think of any name to put there.

Maybe the idea of complexity is too modern.


No but you see, I have a system! /s

(I spent too long by the horse racing track)


The third sentence got to what my objection was going to be. It's fun trying to make the thing do what you want it to do! That's why many of us like computers. It's the randomness that sucks and makes the process unsatisfying.

That's just just a slot machine

ROTFL

Here's the thing. Building trust and then leaving stuff in has been around forever. The fact that it becomes cheaper does not matter that much (since protection against it is also getting better), but it required you to have a bunch of extremely talented people who has spent much of their life diving into given topic.

Such driven people are usually even hard to buy, they usually would rather get by with enough income and work on interesting projects with interesting people that get some uninteresting work for tons of money. This still does not stop them from working for Malice. But ethics do. Even if not right away, if people see that what they are doing is not quite OK, the talent stops eroding. People quit, productivity drops. That was a good dynamic. Which now will be gone.


Oh, a nice subthread place to vent. Their CLI is so f tragic that it is ridiculous. It keeps scrambling the terminal, scroll and basic shortcuts keep breaking, I've used so many tuis and terminal apps and many of them are a single man operation and a side project and I have never seen anything so bad.

If I didn't know from experience that directed properly claude can be powerful, knowing that they used it to create that CLI would be instant runaway based on very reasonable heuristics - if they are not able to use their product to create a decent piece of software that is not even sophisticated then it seems futile for me to try.

I just do not understand. I feel like most HN could vibe code better claude CLI in claude than the CLI (and certainly just write one) than what we have to deal with to use subscription.


I could not agree more that Claude itself is a janky, hacky, crappy piece of software.

When management at $DAYJOB brought the hammer down and said, "Everyone has to use genAI all the time, OR ELSE," I expected to be blown away by the tool I was avoiding due to ethical concerns, aesthetic objections, humanism, and long-term thinking.

I was born away, but not in a good way.

The CLI is _bad_. I've seen it randomly fail to render anything at all on the terminal multiple times. It has a vim-mode, but it's painfully buggy, and I can literally outrun it - if I try to type too quickly after hitting Esc for normal mode, it just doesn't return to normal mode. It's I was keeping track of the bugs in the Claude TUI, but gave up because it was taking _too much of my time_ to do so.

If nothing else, I'd say Claude shows convincingly that success is not the default for vibecoding.

Yes, it technically does the job, and no, I don't think I've ever used a worse TUI.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: