It's immediately become difficult to untangle the licensing here. Is this safe f...

pwendell · on March 24, 2023

Yes it's nuanced, but will be simplified going forward.

This uses a fully open source (liberally licensed) model and we also open sourced (liberally licensed) our own training code. However, the uptraining dataset of ~50,000 samples was generated with OpenAI's text-davinci-003 model, and depending on how one interprets their terms, commercial use of the resulting model may violate the OpenAI terms of use. For that reason we are advising only noncommercial use of this model for now.

The next step here is to create a set of uptraining samples that is 100% open. Stay tuned.

Taek · on March 24, 2023

Are you in touch with the OpenAssistant team? I believe they already have a more or less complete set of samples (100,000!) that were produced in an open environment and aren't encumbered by any licensing.

pwendell · on March 24, 2023

No I haven't heard of that, we'll engage with that team. This is exactly what we need will look into it.

rnosov · on March 24, 2023

This has nothing to do with facebook. The foundational model here is GPT-J which is opensource and safe to use. Sadly, it is inferior to state-of-the-art models such as LLaMA.

Mizza · on March 24, 2023

But they're "using data from Alpaca". I don't know what that means, isn't Alpaca using data generated by ChatGPT, which isn't "clean" to use? Or data from Facebook, which isn't "clean" to use? I'm drowning.

rnosov · on March 24, 2023

They are instruction tuning it using the dataset released by stanford-alpaca team. The dataset itself is synthetic (created using GPT-3) and somewhat noisy and in my view can be easily recreated if OpenAI ever tries to go after it (which is very unlikely). Anyway, facebook has nothing to do with anything used by this project.

Mizza · on March 24, 2023

So, this is a "dirty" model, in that is was created by data which violated OpenAI ToS. Obviously, this kind of violation is basically fine if you're a massive corporation who the rules don't apply to, but it's a huge risk if you're a small fish.

hutzlibu · on March 24, 2023

"basically fine if you're a massive corporation who the rules don't apply to, but it's a huge risk if you're a small fish"

With these things, it is usually the other way around.

If you are a small fish, no one will care. But if you are big enough, that money could be extracted from you, then they will come. A big org just has better lawers and negotiating power, but they really cannot ignore the law. Especially not, if there is a competitor with money to sue.

So if you are small and want to become big, better be cautious on the legal ground you are walking.

rnosov · on March 24, 2023

ToS are not the law. It would be similar to your power company claiming copyright over the code written using "their" electricity. Not going to happen. I wouldn't be too concerned.

sp332 · on March 24, 2023

No, but you could be banned from using OpenAI products in the future, which seems like quite a liability for a researcher or company.

rnosov · on March 24, 2023

That would be anticompetitive practice that is actually against the law in many countries[1]. In the unlikely event of OpenAI ever engaging in such things they will be sued into oblivion.

[1] https://en.wikipedia.org/wiki/Refusal_to_deal

iudqnolq · on March 24, 2023

No it wouldn't. Wikipedia has a crap definition that inexplicably focuses on cartels where multiple companies coordinate the refusal, which this definitely isn't. The FTC has a better definition for US law [1].

Companies routinely ban users for ToS violations. Just look at any thread about Google on here to see people complaining about it.

[1]: https://www.ftc.gov/advice-guidance/competition-guidance/gui...

rnosov · on March 25, 2023

The FTC link has an example of the only newspaper in town refusing to deal with customers who are also running ads on a radio station. Do you think if the newspaper dressed such refusal as a ToS violation it would fly with FTC?

Google might be banning people for enforceable violations of their ToS but imagine the uproar if they banned a Bing engineer for using Google search to find solutions for some Bing problem (which is similar to the problem here). The upside for Google or OpenAI would be somewhat limited but the downside is almost boundless.

Spivak · on March 24, 2023

Especially when OpenAI explicitly doesn't have a claim to copyright on the model output.

gremlinsinc · on March 24, 2023

If you use output, from a non-profit who open sourced the output gained by following the TOS, as in they aren't using it 'for profit', it's not illegal, because:

A. it's an output gained via following the letter of the law (TOS).

B. TOS only applies directly to people who've accepted the TOS, unless alpaca's license/TOS ALSO forwards the same criterion as it's source at openai, then derivatives wouldn't apply.

It's like if an app developer on IOS violated a TOS, and apple tried to go after everybody who ever used the app, they didn't agree directly to the TOS, only the developer did.

sebzim4500 · on March 24, 2023

That's between OpenAI and the people that recorded the data. No one else needs to care.

bilekas · on March 24, 2023

I don't know the full details but Alpaca is from Stanford and only based on the LLamA (not a derivative work afaik). That said :

Also Meta's licensing here https://github.com/facebookresearch/llama/blob/main/LICENSE

Can't be sure what that license actually reffers to, the language model or just the tooling in the Git Repo.

I agree its a minefield, but with Meta I would eer on the side of caution.

leobg · on March 24, 2023

Why? Dolly had nothing to do with Llama or its weights.

Besides: How would anyone ever know which model generated the output you are serving? AFAIK there is no fingerprint in any model’s output. And even if there was, it would probably be destroyed by fine tuning “over it”.

stametseater · on March 24, 2023

> AFAIK there is no fingerprint in any model’s output.

It seems like there easily could be. What if some of the data they trained it on didn't exist anywhere else except in the training set, and was put there specifically for this purpose? For instance they could have taught it a few poems that don't exist anywhere else. If you can coax the LLM of unknown origin into reciting those poems back to you, you know where it came from.

kurthr · on March 24, 2023

Even easier have a small set of 8-10 character gibberish tokens it's trained on in a particular contexts (eg a non-existent poem). Then feed it one or several poems and see if a gibberish token pops out.

eigenvalue · on March 24, 2023

I think they call these canary GUIDs. If you manage to generate one from an LLM then you can conclude with certainty that the model saw that document during training.

neilv · on March 24, 2023

> Besides: How would anyone ever know which model generated the output you are serving?

There's precedent for "whatever you can get away with" in tech companies, but establishing a culture of that at the start of this new big change could end up undesirable for most people.

For example, it could relieve demand for more legal and sustainable ways, until it's too late. (Look at the history of digital entertainment media piracy and DRM and legislation, for example. Or look at the history of software piracy, where some big companies seem to actually want their product to be pirated, partly because it builds a bigger moat against competitors, and they can legally strongarm some of those pirates later.)

bilekas · on March 24, 2023

> Meta really botched the Llama release.

It's no surprise really though, from what I see they recognised some way to monitize and rolled back their commitment.

But this Dolly doesn't depend on Llama (unless I'M missing something), so you don't have to use it.

babyyoda · on March 24, 2023

Given that Alpaca strictly specified that they released purely for academic use and any commercial use was prohibited given doing so would violate terms of service, I don’t see this as viable for use. Looks like marketing gimmick