I always wondered how they achieved this - is it just retries while generating t...

simonw · 2025-11-14T20:09:08 1763150948

They're using the same trick OpenAI have been using for a while: they compile a grammar and then have that running as part of token inference, such that only tokens that fit the grammar are selected as the next-token.

This trick has also been in llama.cpp for a couple of years: https://til.simonwillison.net/llms/llama-cpp-python-grammars

minimaxir · 2025-11-14T20:36:52 1763152612

More info on Claude's grammar compiling: https://docs.claude.com/en/docs/build-with-claude/structured...

huevosabio · 2025-11-14T20:35:30 1763152530

Yea, and now there are mature OSS solutions with outlines and xgrammar, so it makes even more weird that only now do we have this supported by Anthropic.

causal · 2025-11-14T22:00:50 1763157650

I reaaaaally wish we could provide an EBNF grammar like llama.cpp. JSON Schema has much fewer use cases for me.

psadri · 2025-11-15T05:09:55 1763183395

What are some examples that you can’t express in json schema?

causal · 2025-11-15T13:48:34 1763214514

Anything not JSON

xpe · 2025-11-15T02:06:54 1763172414

This makes me wonder if there are cases where one would want the LLM to generate a syntactically invalid response (which could be identified as such) rather than guarantee syntactic validity at the potential cost of semantic accuracy.

jawiggins · 2025-11-14T22:04:45 1763157885

How sure are you that OpenAI is using that?

I would have suspected it too, but I’ve been struggling with OpenAI returning syntactically invalid JSON when provided with a simple pydantic class (a list of strings), which shouldn’t be possible unless they have a glaring error in their grammar.

gradys · 2025-11-14T22:18:03 1763158683

You might be using JSON mode, which doesn’t guarantee a schema will be followed, or structured outputs not in strict mode. It is possible to get the property that the response is either a valid instance of the schema or an error (eg for refusal)

jawiggins · 2025-11-15T04:15:21 1763180121

How do you activate strict mode when using pydantic schemas? It doesn't look like that is a valid parameter to me.

No, I don't get refusals, I see literally invalid json, like: `{"field": ["value...}`

koakuma-chan · 2025-11-14T22:19:33 1763158773

https://github.com/guidance-ai/llguidance

> 2025-05-20 LLGuidance shipped in OpenAI for JSON Schema

mmoskal · 2025-11-14T22:23:08 1763158988

OpenAI is using [0] LLGuidance [1]. You need to set strict:true in your request for schema validation to kick in though.

[0] https://platform.openai.com/docs/guides/function-calling#lar... [1] https://github.com/guidance-ai/llguidance

jawiggins · 2025-11-15T04:13:24 1763180004

I don't think that parameter is an option when using pydantic schemas.

class FooBar(BaseModel): foo: list[str] bar: list[int]

prompt = """#Task Your job is to reply with Foo Bar, a json object with foo, a list of strings, and bar, a list of ints """

response = openai_client.chat.completions.parse( model="gpt-5-nano-2025-08-07", messages=[{"role": "system", "content": FooBar}], max_completion_tokens=4096, seed=123, response_format=CommentAnalysis, strict=True )

TypeError: Completions.parse() got an unexpected keyword argument 'strict'

simonw · 2025-11-14T22:23:57 1763159037

You have to explicitly opt into it by passing strict=True https://platform.openai.com/docs/guides/structured-outputs/s...

jawiggins · 2025-11-15T04:09:00 1763179740

Are you able to use `strict=True` when using pydantic models? It doesn't seem to be valid for me. I think that only works for json schemas.

class FooBar(BaseModel): foo: list[str] bar: list[int]

prompt = """#Task Your job is to reply with Foo Bar, a json object with foo, a list of strings, and bar, a list of ints """

response = openai_client.chat.completions.parse( model="gpt-5-nano-2025-08-07", messages=[{"role": "system", "content": FooBar}], max_completion_tokens=4096, seed=123, response_format=CommentAnalysis, strict=True )

> TypeError: Completions.parse() got an unexpected keyword argument 'strict'

Kuinox · 2025-11-14T19:47:17 1763149637

The inference doesn't return a single token, but the probably for all tokens. You just select the token that is allowed according to the compiler.

mkagenius · 2025-11-14T20:15:49 1763151349

Hmm, wouldn't it sacrifice a better answer in some cases (not sure how many though)?

I'll be surprised if they hadn't specifically trained for structured "correct" output for this, in addition to picking next token following the structure.

tdfirth · 2025-11-14T20:40:19 1763152819

In my experience (I've put hundreds of billions of tokens through structured outputs over the last 18 months), I think the answer is yes, but only in edge cases.

It generally happens when the grammar is highly constrained, for example if a boolean is expected next.

If the model assigns a low probability to both true and false coming next, then the sampling strategy will pick whichever one happens to score highest. Most tokens have very similar probabilities close to 0 most of the time, and if you're picking between two of these then the result will often feel random.

It's always the result of a bad prompt though, if you improve the prompt so that the model understands the task better, then there will then be a clear difference in the scores the tokens get, and so it seems less random.

miki123211 · 2025-11-15T00:13:07 1763165587

It's not just the prompt that matters, it's also field order (and a bunch of other things).

Imagine you're asking your model to give you a list of tasks mentioned in a meeting, along with a boolean indicating whether the task is done. If you put the boolean first, the model must decide both what the task is and whether it is done at the same time. If you put the task description first, the model can separate that work into two distinct steps.

There are more tricks like this. It's really worth thinking about which calculations you delegate to the model and which you do in code, and how you integrate the two.

mmoskal · 2025-11-14T22:26:18 1763159178

Grammars work best when aligned with prompt. That is, if your prompt gives you the right format of answer 80% of the time, the grammar will take you to a 100%. If it gives you the right answer 1% of the time, the grammar will give you syntactically correct garbage.

mirekrusin · 2025-11-14T20:59:26 1763153966

Sampling is already constrained with temperature, top_k, top_p, top_a, typical_p, min_p, entropy_penalty, smoothing etc. – filtering tokens to valid ones according to grammar is just yet another alternative. It does make sense and can be used for producing programming language output as well – what's the point in generating/bothering with up front know, invalid output? Better to filter it out and allow valid completions only.

Kuinox · 2025-11-14T20:37:25 1763152645

The "better answer" wouldnt had respected the schema in this case.