This is a great visualization because original paper on transformers is not very clear and understandable; I tried to read it first and didn't understand so I had to look for other explanations (for example it was unclear for me how multiple tokens are handled).
Also, speaking about transformers: they usually append their output tokens to input and process them again. Can we optimize it, so that we don't need to do the same calculations with same input tokens?
Also, speaking about transformers: they usually append their output tokens to input and process them again. Can we optimize it, so that we don't need to do the same calculations with same input tokens?