Yes it's nuanced, but will be simplified going forward.
This uses a fully open source (liberally licensed) model and we also open sourced (liberally licensed) our own training code. However, the uptraining dataset of ~50,000 samples was generated with OpenAI's text-davinci-003 model, and depending on how one interprets their terms, commercial use of the resulting model may violate the OpenAI terms of use. For that reason we are advising only noncommercial use of this model for now.
The next step here is to create a set of uptraining samples that is 100% open. Stay tuned.
Are you in touch with the OpenAssistant team? I believe they already have a more or less complete set of samples (100,000!) that were produced in an open environment and aren't encumbered by any licensing.
Was the Alpaca dataset being licensed as non-commercial only the reason you aren't releasing the weights? Is it possible to just release them under the same license?
Yes the issue is that some of the training data is arguably tainted with some noncommercial license (it's nuanced, discussed below in my comment). We are releasing weights to people who request but we just wanted to have an email request flow so that we can make sure people know it's just for noncommercial purposes.
Working on a model without this issue. Certainly our goal is totally open models anyone can use for anything.
I've been a bit jaded by the "open/democratizing ai" stuff and then having companies stiff us at actually making it open - but not wanting to be the first to litigate these new types of issues ml brings is very understandable.
Question - Would you consider benchmarking a single 4090 for your training? While training in a few hours with 8x A100's is impressive, myself and I think others are curious how that translates to consumer hardware. IMO running/fine-tuning on consumer hardware is the ultimate endgame for all ai models.
The DataFrame is an evolution of the RDD model, where Spark knows explicit schema information. The core Spark RDD API is very generic and assumes nothing about the structure of the user's data. This is powerful, but ultimately the generic nature imposes limits on how much we can optimize.
DataFrames impose just a bit more structure: we assume that you have a tabular schema, named fields with types, etc. Given this assumption, Spark can optimize a lot of internal execution details, and also provide slicker API's to users. It turns out that a huge fraction of Spark workloads fall into this model, especially since we support complex types and nested structures.
Is the core RDD API going anywhere? Nope - not any time soon. Sometimes it really is necessary to drop into that lower level API. But I do anticipate that within a year or two most Spark applications will let DataFrames do the heavy lifting.
In fact, DataFrames and RDDs are completely inter-operable, either can be converted to the other. This means that even if you don't want to use DataFrames you can benefit from all of the cool input/output capabilities they have, even just to create regular old RDDs.
Hey - sorry you had a bad experience. That bug was filed as a "minor" issue with only one user ever reporting it, so it didn't end up high up in our triage. We didn't merge the pull request because it was not correct, however, we can just add our own fix for it if it's affecting users. In the future, if you chime in on a reported JIRA, it will escalate it in our process.
Sorry, didn't mean to come over completely negative.
I appreciate the work that's gone in to Spark and it's clearly well designed. Developing with Spark after coming from a Hadoop background was a very refreshing experience.
Good news about the PySpark input format improvements. Does that also cover reading complex Parquet datatypes into SchemaRDDs with their native datatypes? When can we get a Databricks Cloud account (I'm already on the waiting list)?
Yeah, you can load Parquet data directly into SchemaRDD's in 1.1 and get the type conversion, including use of nested types. That's the long term solution for all of our storage integration is to go through the SchemaRDD API since it's a standard type description and we expect many data sources to integrate there.
Re: databricks cloud - shoot me an e-mail and I'll see if I can help. Right now demand exceeds supply for us on accounts, but I can try!
The Princeton "University Database" that was hacked is a privately maintained Alumni Association site from the UK. It is not on the Princeton domain or associated with the University at all.
Here's an excerpt:
"Come out and suppoert Chickenshed, an inclusive theatre company based in London that brings people of all ages, backgrounds and abilities together to create groundbreaking and exciting new theatre."
Time will tell how the world will change now that this sensitive information is out in the open.
We'll definitely keep iterating on Dolly and releasing everything openly.