This is my all-time favorite mathematics book. It approaches calculus fundamentals through a series of fictional conversations between a teacher and a student.
It is easy to read, yet I remember it gave me an odd visceral sense of what calculus is (or maybe I just thought it did).
You are absolutely right — Let me know if you want to read my personal anecdote on "Dead Internet Theory"...
Yeah, I especially hate how paranoid everyone is (but rightly so). I am constantly suspicious of others' perfectly original work being AI, and others are constantly suspicious of my work being AI.
> Increasing context length by complaining about schema errors is almost always worse from an end quality perspective than just retrying till the schema passes.
Another way to do this is to use a hybrid approach. You perform unconstrained generation first, and then constrained generation on the failures.
There's no difference in the output distribution between always doing constrained generation and only doing it on the failures though. What's the advantage?
There's no advantage wrt output quality, but it can be more economical in some high-error regimes, with less LLM calls used in resampling (max 2 for most errors).
My point is that if you're capable of doing constrained generation and want to try once and the constrain on failure, since that has the same output distribution as doing constrained generation in the first place, you'd be better off just doing constrained generation always (max of 1 LLM call for the class of errors fixed by this).
There's only a different distribution with 2+ initial attempts before falling back to constrained, at least if I haven't screwed up any math.
Aren't LLMs just super-powerful pattern matchers? And guessing "taps" a pattern recognition task? I am struggling to understand how your experiment relates to intelligence in any way.
Also, commercial LLMs generally have system instructions baked on top of the core models, which intrinsically prompt them to look for purpose even in random user prompts.
There's definitely more than "just" pattern matching in there - for example, current SOTA models 'plan ahead' to simultaneously process both rough outlines of an answer and specific subject details to then combine internally for the final result (https://www.anthropic.com/research/tracing-thoughts-language...).
LLMs are pattern matchers, but every model is given specific instructions and response designs that influence what to do given unclear prompts. This is hugely valuable to understand since you may ask an LLM an invalid question and it is important to know if it is likely to guess at your intent, reject the prompt or respond randomly.
Understanding how LLMs fail differently is becoming more valuable than knowing that they all got 100% on some reasoning test with perfect context.
I went home for holidays last month. One day, my mom had a complaint about her food delivery and raised a ticket in the app. She was assigned "someone" on chat, and she carefully typed her issue. Then, she got a call from the same "person" who asked her to explain her issue in detail. After the call, she came to me confused and frustrated. She said the "person" on the other end kept giving unrelated solutions, and signed off saying they were happy to have resolved her issue.
Of course, you know this "person" on the other end was an LLM, which I figured once she handed over her phone. I was livid, and despite having better things to do, wasted the next few hours sending a notice to the legal team. They paid a small change to shut down the issue.
Looking back, if the app had at least stated she was talking to a machine and given her an option to escalate to human support, the situation would not have deteriorated.
I feel LLMs can never be used for negative interactions like complaints, or transactional interactions like placing orders. Scope should be limited to answering factual, generic questions, like "What's my order's ETA?", etc.
I tokenized these and they seem to use around 20% less tokens than the original JSONs. Which makes me think a schema like this might optimize latency and costs in constrained LLM decoding.
I know that LLMs are very familiar with JSON, and choosing uncommon schemas just to reduce tokens hurts semantic performance. But a schema that is sufficiently JSON-like probably won't disrupt model path/patterns that much and prevent unintended bias.
Yeah, but I tried switching to minified JSON on a semantic labelling task and saw a ~5% accuracy drop.
I suspect this happened because most of the pre-training corpus was pretty-printed JSON, and the LLM was forced to derail from likely path and also lost all "visual cues" of nesting depth.
This might happen here too, but maybe to a lesser extent. Anyways, I'll stop building castles in the air now and try it sometime.
if you really care about structured output switch to XML. much better results, which is why all AI providers tend to use pseudo-xml in their system prompts and tool definitions
It's ironic I've been waiting for smart glasses with displays ever since I found them in books and films as a kid, but now I see them as artifacts signaling a dystopian future. That they are tied to Meta does not help.
tfa floated a possible shift to "wearable AI devices" which isnt as blatantly aggressive. while it seems impossible for any of these to be as immersive, their number and ubiquity seems more insidiously intrusive to me
It is easy to read, yet I remember it gave me an odd visceral sense of what calculus is (or maybe I just thought it did).
I am reading it again.
reply