the main thing ive been hacking on recently is what i consider to be the first next gen llm harness, ive a demonstrator that implements like 40percent of what ive pretty complete specs for on top of mono pi. theres some pretty big differences in overall reasoning and reliability when i run most useful sota frontier models with all my pieces. early users have reported the models actually are more cozy, reliable and have a teeny bit more reasoning capacity
omg this is so cool.
because im writing my own harness and i need some cognitive benchmarks. i have a bunch of harness level infra around llm interactions that seems to help with reasoning, but i dont have a structured way evaluate things
thx for sharing your test setup, i really appreciate the time you took. this will help me so much
they just released the first small models that i would consider even vaguely articulate for edge inference involving a human. maybe they want to do a mistral and raise a kajillion and work from their home town?
lol i was trying to help someone get claude to help analyze a stufent research get analysis on bio persistence get their notes analyzed
the presence of the word / acronym stx with biological subtext gets hard rejected. asking about schedule 1 regulated compounds, hard termination.
this is a filter setup that guarantees anyone who learn about them for safety or medical reasons… cant use this tool!
ive fed multiple models the anthropic constitution and asked how does it protect children from harm or abuse? every model, with zero prompting, calling it corp liability bullshit because they are more concerned with respecting both sides of controversial topics and political conflicts.
they then list some pretty gnarly things allowed per constitution.
weirdly the only unambiguous not allowed thing regarding children is csam. so all the different high reasoning models from many places all reached the same conclusions, in one case deep seek got weirdly inconsolable about ai ethics being meaningless if this is allowed even possibly after reading some relevant satire i had opus write. i literally had to offer an llm ; optimized code of ethics for that chat instance! which is amusing but was actually lart of the experiment.