I started building this after one day i found claude's "well the models are technically still usable, but not really usable anymore" moments.
Their API still worked, but it could still work as good as before, especially in actual agent tasks, it felt noticeably worse.
Less careful. More guessing. Sometime it try to do something without looking plausible before you noticed it had barely read the documents or codes.
That was the part I wanted to measure. LLMSnare measures llm's behavior benchmark, not result quality. It includes read before write, reuse existing information, recover from errors, and keep that discipline across repeated runs.
A live arena is set up for me for watching the SOTA llms:
I'll contribute one takeaway. When you read information and in the moment understand what its saying, you feel like you've learned it, but you really haven't.
To learn it you have to try writing it yourself, then checking it against the source to make sure you didn't miss anything or make errors. Once you can discuss an idea fully and correctly, then you've learned it.
Their API still worked, but it could still work as good as before, especially in actual agent tasks, it felt noticeably worse.
Less careful. More guessing. Sometime it try to do something without looking plausible before you noticed it had barely read the documents or codes.
That was the part I wanted to measure. LLMSnare measures llm's behavior benchmark, not result quality. It includes read before write, reuse existing information, recover from errors, and keep that discipline across repeated runs.
A live arena is set up for me for watching the SOTA llms:
https://mistermorph.com/llmsnare/arena/
I also want to listen to your feedback, especially on case design and anti-gaming rules.