Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> predict a weight of exactly zero for some of the values

Wouldn’t this be pretty unlikely, though?



Quite the opposite — if you have a long sequence only a smattering of the words will influence the meaning of the current word. Everything else is “noise”.

Attention is really good at finding this smattering of words (ie assign most weight there). But it struggles to put exactly 0 on the other words.


I mean wouldn’t it be unlikely that

SoftmaxA[n] - SoftmaxB[n] is exactly 0?

Even if 2 attention layers learn two different things, I would imagine the corresponding weights in each layer wouldn’t exactly cancel each other out.


why say lot word when few word do


Few word no do tho


U+1FAE5


Phew!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: