It's not supposed to be a measure of "true linguistic proficiency" but a measure of the capacity to generate linguistic-proficiency-like results.
So in each case what's the gap between actual linguistic proficiency / randomness, and the appearance thereof? And of what value is measuring these human perceptions rather than the actual facts in each instance (like taking all the results and putting them into a scatter output and seeing if there is actually a pattern in the pseudorandom data, or formally analysing the grammar and spelling in question and verifying that it is technically correct rather than just "english sounding" https://youtu.be/gU4w12oDjn8?t=2m)
Can we draw conclusions about that Italian gentleman's ability to make a song that sounds like English pop music "better" than an English pop music song that is actually technically grammatically correct, and use it to infer that he's got better English skills than the writer of the technically correct song?
And if not, why are we trying to make statements about the ability of some randomness souce not based on any actual measure of true randomness?
>So in each case what's the gap between actual linguistic proficiency / randomness, and the appearance thereof?
That there are no hard constraints/expectations like in measuring the quality of a e.g. software random number generator implementation.
They don't expect to find true randomness in the results, just to measure how much randomness (entropy if you will) those various age groups are capable of producing.