More

carterschonwald · 2026-03-10T15:26:32 1773156392

this is black bar grade great. give us black bar

carterschonwald · 2026-03-09T14:15:57 1773065757

the main thing ive been hacking on recently is what i consider to be the first next gen llm harness, ive a demonstrator that implements like 40percent of what ive pretty complete specs for on top of mono pi. theres some pretty big differences in overall reasoning and reliability when i run most useful sota frontier models with all my pieces. early users have reported the models actually are more cozy, reliable and have a teeny bit more reasoning capacity

carterschonwald · 2026-03-04T15:03:51 1772636631

omg this is so cool. because im writing my own harness and i need some cognitive benchmarks. i have a bunch of harness level infra around llm interactions that seems to help with reasoning, but i dont have a structured way evaluate things

thx for sharing your test setup, i really appreciate the time you took. this will help me so much

carterschonwald · 2026-03-04T13:41:50 1772631710

they just released the first small models that i would consider even vaguely articulate for edge inference involving a human. maybe they want to do a mistral and raise a kajillion and work from their home town?

victorbjorklund · 2026-03-04T13:50:31 1772632231

What does do a mistral mean?

goldenarm · 2026-03-04T13:59:44 1772632784

MistralAI is known for their smaller models on the edge, to avoid competing with Gemini & OpenAI directly.

mycall · 2026-03-04T14:05:15 1772633115

Who knows if OpenAI will do a refresh, but gpt-oss-20B/120B are still some of the best edge models so far.

carterschonwald · 2026-03-04T15:07:37 1772636857

oh?! what do they handle well? how do they fail?

the 3.5 9b model on my laptop at full fp8 is outlandish in its seeming reasoning capacity, though i haven’t really stress tested it

carterschonwald · 2026-03-01T12:22:03 1772367723

you need to merge updated tool call docs into your prompt

carterschonwald · 2026-02-15T14:55:02 1771167302

somehow this counts like model cot.

carterschonwald · 2026-01-30T06:07:39 1769753259

static linking va dynamic but we dont know the actual config and setup. and also the choice of totally changes the problem

carterschonwald · 2026-01-30T05:17:54 1769750274

lol i was trying to help someone get claude to help analyze a stufent research get analysis on bio persistence get their notes analyzed

the presence of the word / acronym stx with biological subtext gets hard rejected. asking about schedule 1 regulated compounds, hard termination.

this is a filter setup that guarantees anyone who learn about them for safety or medical reasons… cant use this tool!

ive fed multiple models the anthropic constitution and asked how does it protect children from harm or abuse? every model, with zero prompting, calling it corp liability bullshit because they are more concerned with respecting both sides of controversial topics and political conflicts.

they then list some pretty gnarly things allowed per constitution. weirdly the only unambiguous not allowed thing regarding children is csam. so all the different high reasoning models from many places all reached the same conclusions, in one case deep seek got weirdly inconsolable about ai ethics being meaningless if this is allowed even possibly after reading some relevant satire i had opus write. i literally had to offer an llm ; optimized code of ethics for that chat instance! which is amusing but was actually lart of the experiment.

carterschonwald · 2026-01-30T05:00:00 1769749200

ive seen degraded reasoning levels that feel like they they might be blur from excess quantization. cause thats what you get from the grid changes

carterschonwald · 2026-01-26T22:55:35 1769468135

but… will gpt still get confused by the ellippses that its document viewer ui hack adds? probably yes.