There is no reason to believe GPT-4 had more(or higher quality) data than Google...

austhrow743 · on March 17, 2024

The more disregard a company has for intellectual property rights, the more data they can use.

Google had far more to lose from a "copyright? lol" approach than OpenAI did.

brookst · on March 17, 2024

I was under the impression training was at best an undefined area of IP law. Is there any aspect of copyright that prohibits training models?

simonw · on March 17, 2024

This is being tested by a number of lawsuits right now, most notably the NY Times one: https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

The key questions are around "fair use". Part of the US doctrine of fair use is "the effect of the use upon the potential market for or value of the copyrighted work" - so one big question here is whether a model has a negative impact on the market for the copyrighted work it was trained on.

sroussey · on March 17, 2024

I don’t think the New York Times thing is that much about training, than it is about the fact that ChatGPT can use Bing and Bing has access to New York Times articles for search purposes.

simonw · on March 17, 2024

If you read the lawsuit it's absolutely about training. The Bing RAG piece is one of the complaints in there but it's by no means the most important.

Take a look at https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20... - bullet points 2 and 4 on pages 2/3 are about training data. Bullet point 5 is the Bing RAG thing.

sroussey · on March 18, 2024

Ah, thanks!

YetAnotherNick · on March 18, 2024

Having used both Google's and OpenAI's models, the kind of issue they have are different. Google's models are superior or at least on par in knowledge. It's the instruction following and understanding where OpenAI is significantly better. I don't think pretraining data is the reason of this.

supafastcoder · on March 18, 2024

> Google had far more to lose from a "copyright? lol" approach than OpenAI did.

The company that scrapes trillions of web pages has an issue with copyright?

sib · on March 18, 2024

Well... Googlebot does pay attention to robots.txt - I don't think (original) OpenAI-bot did.