Power centralization is an inevitable outcome of improved techniques. Centralized control imperfectly selects for highly specific skills that only few possess. One can attempt to regulate inequality and inform opinion to some extent, but this will become harder as we go forward due to their improved techniques enabling manipulation.
They use the key-value lookup/routing mechanism from Transformers to predict pixel-wise labels in bird view (lane, car, obstacle, intersection etc.). The motivation here is that some of the predictions may temporarily be occluded, so for predicting these occluded areas it may be particularly helpful to attend to remote regions in the input images which requires long-range dependencies that highly depend on the input itself (e.g. on whether there is an occlusion), which is exactly where the key-value mechanism excels. Not sure they even process past camera frames at this point. They only mention that later in the pipline they have an LSTM-like NN incorporating past camera frames (Schmidhuber will be proud!!).
Edit: A random observation which just occurred to me is that their predictions seem surprisingly temporally unstable. Observe, for example, the lane layout wildly changing while it drives makes a left-turn at the intersection (https://youtu.be/j0z4FweCy4M?t=2608). You can use the comma and period keys to step through the video frame-by-frame.
I've been lifelong Apple user (webdev, sysadmin, power user) up until 2018. I switched to dual-boot Fedora/Xfce and Win10 on a Lenovo X1C 6th Gen. It took several months to get used to, but I did not look back even once. The user experience is not as polished, but it gets the job done. In hindsight I was irrationally concerned about features or apps which I thought I would miss.