Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What I meant is that by changing lambda each attention head is able to put its outputs in a subspace that is different than that of the other heads. This means that the outputs of different heads do not mingle with each other, and it's easier for the following layer to pick them apart. So I was thinking at increased expressiveness because the attention output can in principle cover a larger volume.

Maybe expressiveness is not the right term, or not the main consequence. I could imagine that having different subspaces like that also introduces a degree of robustness to out-of-distribution inputs, as this would make it harder for the outputs of one attention head to shift towards the in-distribution outputs of another head, and thus for the following layer to confuse them.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: