Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Recent work by Feng et al from Bengio's lab focus on how attention can be formulated as an RNN ("Attention as RNN": https://arxiv.org/pdf/2405.13956) and how minimal versions of GRUs and LSTMs can be trained in parallel by removing some parameters ("Were RNNs All We Needed?" https://arxiv.org/pdf/2410.01201).

It's possible we start seeing more blended version of RNN/attention architecture exploring different LLM properties.

In particular, Aaren architecture in the former paper "can not only (i) be trained in parallel (like Transformers) but also (ii) be updated efficiently with new tokens, requiring only constant memory for inferences (like traditional RNNs)."



The formulations in attention as rnn have similar issues as rwkv. Fundamentally it's a question of what we call an RNN.

Personally I think it's important not to call some of these recent architectures RNNs because they have theoretical properties that do not match (read: they're worse) what we've "classically" called RNNs.

Ref: https://arxiv.org/abs/2404.08819

As a rule of thumb: you generally don't get parallelism for free, you pay for it with poorer expressivity.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: