Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>I'm just struggling to build a picture of how exactly the network accomplishes this.

I mean, intuitively it would be trivial for the model to just optimise lambda to zero during training. Then you essentially have built a vanilla transformer with an overcomplicated parameter pruning mechanism. Pruning is already pretty well established in the literature as something that works surprisingly good for reducing parameter counts up to (hold on to your papers)... about 40%. In practice the model probably doesn't work exactly like that, but I wouldn't be surprised if it just approximates the normal transformer in the end anyways.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: