You have to consider that there are still some low hanging fruit that let you improve prompt processing (not token generation) performance by an order of magnitude or even two, but there are no takers. The reason is quite simple. You can just buy more GPUs and forget about the optimizations.
If a 100x improvement in performance is left on the table, then surely even lower priority optimizations won't be implemented any time soon.
Consider this: a lot of clever attention optimizations rely on some initial pass to narrow the important tokens down and discarding them from the KV cache. If this was actually possible, then how come the first few layers of the LLM don't already do this numerically to focus their attention? Here is the shocker: they already do, but since you're passing the full 8k context to the next layer anyway, you're wasting it on mostly... Nothing.
I repeat: Does the 80th layer really need the ability to perform attention over all the previous 8k outputs of the 79th layer? The first layer? Definitely. The last? No.
What happens if you only perform attention over 10% of the outputs of layer 79? What speedup does this give you?
Notice how the model has already learned the most optimal attention scheme. You just need to give it less stuff to do and it will get faster automatically.
If a 100x improvement in performance is left on the table, then surely even lower priority optimizations won't be implemented any time soon.
Consider this: a lot of clever attention optimizations rely on some initial pass to narrow the important tokens down and discarding them from the KV cache. If this was actually possible, then how come the first few layers of the LLM don't already do this numerically to focus their attention? Here is the shocker: they already do, but since you're passing the full 8k context to the next layer anyway, you're wasting it on mostly... Nothing.
I repeat: Does the 80th layer really need the ability to perform attention over all the previous 8k outputs of the 79th layer? The first layer? Definitely. The last? No. What happens if you only perform attention over 10% of the outputs of layer 79? What speedup does this give you?
Notice how the model has already learned the most optimal attention scheme. You just need to give it less stuff to do and it will get faster automatically.