For a dryer, more formal and succinct approach, see "The Transformer Model in Equations" [0], by
John Thickstun. The whole thing fits in a single page, using standard mathematical notation.
Finally, thank you so much!
Was it so difficult?
Isn't 7 lines of mathematical notation way better than pages of qualitative pub talking?
I don't really understand these ML researchers, it always looks like they have never studied mathematics at all.
Thank god, I've had to cobble something like this together for my own notes a couple of times trying to parse papers and was never quite sure if I was missing something.
[0] https://johnthickstun.com/docs/transformers.pdf