Thank you so much for your response, that example makes a lot of sense. Algorithmically speaking in computer science, we can formalize efficiency with complexity theory.
Can we do the same with neural networks? Is there a formalization of why 'skip connections' (which I know nothing about) are better, why transformers are more efficient than recurrance, etc?
Is it useful to talk about their complexity or universal properties or size (and I realize this is muddled up a bit by the fact that hardware architecture can sometimes trump efficiency).
Classically, the equivalent of complexity theory in machine learning is statistical learning theory, where the main question is: if I have a magical algorithm that always finds the function in my class that fits the data best, how big does my dataset (which is a sample from some unknown probability distribution) need to be to ensure that the function I pick is almost as good as the best function in the class with high probability? This is known as PAC (probably approximately correct) learning.
For many non-deep machine learning models like support vector machines (SVMs), "find the function in my class that fits the data best" can be posed as a convex optimization problem. So the magical algorithm actually exists. In this setting, the PAC analysis is the natural thing to study. (Typically, we find that you need more samples if your function class is more expressive, which agrees with the observation that deep learning requires huge data sets.)
In deep learning, the magical algorithm doesn't exist. The optimization problem is nonconvex, but everyone uses stochastic gradient descent (SGD), an algorithm that is only guaranteed to find the global optimum on convex problems. Theory suggests that SGD will often converge on a local optimum that is significantly worse than the global optimum. However, in practice this doesn't happen much! If the network is big enough and all the algorithm hyperparameters are tuned well, and you run deep the deep learning algorithm with different random seeds, the result will be about equally good every time.
ML theory people working in deep learning tend to focus on this phenomenon: why does SGD usually find good local optima? This is totally different from the PAC analysis, and the analogy with computational complexity is less crisp.
Skip connections have very good motivation (see one of my other comments in this thread), and attention is decently well motivated, especially as an improvement in the translation space where they were first introduced. I don't think there's any formal proof that attention >> than convolutions with a wide receptive field.
It would be fantastic to have better measures of problem complexity. My thinking at this point is that huge parameter size makes it easier to search for a solution, but once we've found one, there should be interesting ways to simplify the function you've found. Recall above: there are many equivalent models with the same loss when the learning slows down... Some of these equivalent models have lots of zeros in them. We find that often you can prune 90% of the weights and still have a perfectly good model. Eventually you hit a wall, where it gets hard to prune more without large drops in model quality; this /might/ correspond to the actual problem complexity somehow, but the pruned model you happened to find may not actually be the best, and there may be better methods we haven't discovered yet.
Can we do the same with neural networks? Is there a formalization of why 'skip connections' (which I know nothing about) are better, why transformers are more efficient than recurrance, etc?
Is it useful to talk about their complexity or universal properties or size (and I realize this is muddled up a bit by the fact that hardware architecture can sometimes trump efficiency).