You are right to be curious. The encoding used by both GPT-3.5 and GPT-4 is call... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		lifthrasiir on Jan 22, 2024 \| parent \| context \| favorite \| on: GPT-3.5 crashes when it thinks about useRalativeIm... You are right to be curious. The encoding used by both GPT-3.5 and GPT-4 is called `cl100k_base`, which immediately and correctly suggests that there are about 100K tokens.

cwsx on Jan 22, 2024 | [–]

Amazing, thanks for the reply, I'm finding some good resources afyer a quick search of `cl100k_base`.

If you have any other resources (for anything AI related) please share!

dchest on Jan 22, 2024 | | [–]

Their tokenizer is open source: https://github.com/openai/tiktoken

Data files that contain vocabulary are listed here: https://github.com/openai/tiktoken/blob/9e79899bc248d5313c7d...

senseiV on Jan 22, 2024 | [–]

GPT 2 and 3 used the p50K right? Then GPT-4 used cl100K

lifthrasiir on Jan 22, 2024 | [–]

Yeah, see [1].

[1] https://github.com/openai/tiktoken/blob/main/tiktoken/model....

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact