So, for the sake of argument, if the dataset had no fake data then it would be OK? Or would they still need to demonstrate "substantial investment", no matter the state of the data?
If the latter, then that gets weird quick. How many lines of code is considered substantial? How many hours hunched over a microfiche machine? It sounds like it would ultimately depend on the skill of your lawyer.
You need to think of fake data being a more broad term than you are. If we talk about play by play for american college football you will notice how ncaa.com, espn.com, foxsports.com and others have slight differences in what a play's down/togo/time/etc is. It is not as simple as ESPN inserting an entire fake team or fake game; if you were to compare to the last example it would be a real record with a slightly modified price. I analyze college football data sets and can determine where they came from, so I have no doubt that companies can as well.
If you have enough data sources you could theoretically recreate a play by play from all of them and have a data set that would be difficult to prove was stolen from someplace in particular. I say theoretically because (at least with college football) you are often not given enough information to recreate the game (simple example would be how long a play took to execute to determine drive possession time), so often you are left using a best guess method.
I think you miss the point. If you have to start arguing what is and isn't a substantial effort then you're probably going to fall foul.
This is an open source schema for storing data, why not re-acquire the data from a fresh source and make that open source too? This avoids pulling it from an existing and potentially protected source.
You could have members of the public individually enter historical scores, and each one provide proof of the score (e.g. a photo of a result in a newspaper or a photo of the matchday guide).
You could verify correctness of that acquired data by comparing to a few known data sources (even if they were protected). So long as you were close enough in fact to not alter history it was probably right, and correctable in the future (editable like a wiki).
If you use one of the existing datasources you'll find yourself with a lawsuit if you reach any reasonable size.
So, for the sake of argument, if the dataset had no fake data then it would be OK? Or would they still need to demonstrate "substantial investment", no matter the state of the data?
If the latter, then that gets weird quick. How many lines of code is considered substantial? How many hours hunched over a microfiche machine? It sounds like it would ultimately depend on the skill of your lawyer.