Wow, so just the weight sharing architecture does so much already? I am wondering if the same could be done with LSTMs on sequences or CNNs on voice...
Note also that this finding strongly suggests that neural net architecture actually is quite important, possibly even more important than having more data -- which contradicts the conventional wisdom!
There is some pretty strong evidence for this: all the toddlers in the world. You only need to show them something once and they'll immediately be able to recognize more examples of the same thing from different angles and even when it is partially hidden. All they have to guide them is the structure of their brains, not the quantity of data they have been exposed.
"All they have to guide them is the structure of their brains, not the quantity of data they have been exposed."
A typical toddler (say 12 months' old) has spent 4000-5000 hours with open eyes. Even if you assume a low frame rate (10fps), resolution (1080p), and a 1000:1 compression ratio, that's still 1TB of training data.
Certainly not true.. reading takes ages for instance. Associating objects to words takes forever.. Perhaps this is true in another sense but in the sense i described.
It's not about neural network architecture. CNNs are taught by presenting them overlapping pieces of image. To speed things up and keep things orgsnized this is not done sequentially but in parralel making multiple neurons share weights but this is just a trick.
So what makes this result possible is not the architecture of NN in CNN but rather architecture of C. That allows us to get multiple samples from single image. The rest is just that actual content of the image is easier to learn then the noise.
I think it's both C and NN. Don't forget each new C layer groups information from previous layers; using just a single C layer won't do you much good. It might not reflect brain much but it kinda resembles what retina/visual cortex neurons do; CNNs were actually inspired by visual field maps found in visual cortex and somebody had the idea that C is the most similar CV operation we have, and put them together. To everyone's surprise it worked nicely.
It's probably just very rough "resemblance" :D It is said CNNs were "inspired" by visual field maps; I am 100% sure we know very little about how that part of brain works and maybe somebody just took a look at main/thickest connections between neurons there and tried to assemble them in a NN to see if it helps.