Quite the opposite — if you have a long sequence only a smattering of the words will influence the meaning of the current word. Everything else is “noise”.
Attention is really good at finding this smattering of words (ie assign most weight there). But it struggles to put exactly 0 on the other words.
Wouldn’t this be pretty unlikely, though?