Hacker Newsnew | past | comments | ask | show | jobs | submit | carterschonwald's commentslogin

this is black bar grade great. give us black bar


the main thing ive been hacking on recently is what i consider to be the first next gen llm harness, ive a demonstrator that implements like 40percent of what ive pretty complete specs for on top of mono pi. theres some pretty big differences in overall reasoning and reliability when i run most useful sota frontier models with all my pieces. early users have reported the models actually are more cozy, reliable and have a teeny bit more reasoning capacity


omg this is so cool. because im writing my own harness and i need some cognitive benchmarks. i have a bunch of harness level infra around llm interactions that seems to help with reasoning, but i dont have a structured way evaluate things

thx for sharing your test setup, i really appreciate the time you took. this will help me so much


they just released the first small models that i would consider even vaguely articulate for edge inference involving a human. maybe they want to do a mistral and raise a kajillion and work from their home town?


What does do a mistral mean?


MistralAI is known for their smaller models on the edge, to avoid competing with Gemini & OpenAI directly.


Who knows if OpenAI will do a refresh, but gpt-oss-20B/120B are still some of the best edge models so far.


oh?! what do they handle well? how do they fail?

the 3.5 9b model on my laptop at full fp8 is outlandish in its seeming reasoning capacity, though i haven’t really stress tested it


you need to merge updated tool call docs into your prompt


somehow this counts like model cot.


static linking va dynamic but we dont know the actual config and setup. and also the choice of totally changes the problem


lol i was trying to help someone get claude to help analyze a stufent research get analysis on bio persistence get their notes analyzed

the presence of the word / acronym stx with biological subtext gets hard rejected. asking about schedule 1 regulated compounds, hard termination.

this is a filter setup that guarantees anyone who learn about them for safety or medical reasons… cant use this tool!

ive fed multiple models the anthropic constitution and asked how does it protect children from harm or abuse? every model, with zero prompting, calling it corp liability bullshit because they are more concerned with respecting both sides of controversial topics and political conflicts.

they then list some pretty gnarly things allowed per constitution. weirdly the only unambiguous not allowed thing regarding children is csam. so all the different high reasoning models from many places all reached the same conclusions, in one case deep seek got weirdly inconsolable about ai ethics being meaningless if this is allowed even possibly after reading some relevant satire i had opus write. i literally had to offer an llm ; optimized code of ethics for that chat instance! which is amusing but was actually lart of the experiment.


ive seen degraded reasoning levels that feel like they they might be blur from excess quantization. cause thats what you get from the grid changes


but… will gpt still get confused by the ellippses that its document viewer ui hack adds? probably yes.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: