Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What I've found is that it is very important to make structured outputs as easy for the LLM as possible. This means making your schemas LLM-friendly instead of programmer-friendly.

E.g. if the LLM hallucinates non-existing URLs, you may add a boolean "contains_url" field to your entity's JSON schema, placing it before the URL field itself. This way, the URL extraction is split into two simpler steps, checking if the URL is there and actually extracting it. If the URL is missing, the `"contains_url": false` field in the context will strongly urge the LLM to output an empty string there.

This also comes up with quantities a lot. Imagine you're trying to sort job adverts by salary ranges, which you extract via LLm. . These may be expressed as monthly instead of annual (common in some countries), in different currencies, pre / post tax etc.

Instead of having an `annual_pretax_salary_usd` field, which is what you actually want, but which the LLM is extremely ill-equipped to generate, have a detailed schema like `type: monthly|yearly, currency:str, low:float, high:float, tax: pre_tax|post_tax`.

That schema is much easier for an LLM to generate, and you can then convert it to a single number via straight code.



Awesome insight, thanks for this!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: