Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You are right to be curious. The encoding used by both GPT-3.5 and GPT-4 is called `cl100k_base`, which immediately and correctly suggests that there are about 100K tokens.


Amazing, thanks for the reply, I'm finding some good resources afyer a quick search of `cl100k_base`.

If you have any other resources (for anything AI related) please share!


Their tokenizer is open source: https://github.com/openai/tiktoken

Data files that contain vocabulary are listed here: https://github.com/openai/tiktoken/blob/9e79899bc248d5313c7d...


GPT 2 and 3 used the p50K right? Then GPT-4 used cl100K





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: