Hacker Newsnew | past | comments | ask | show | jobs | submit | pwendell's commentslogin

15k is the entire corpus we have right now. Hopefully others can join up in releasing additional samples that can be merged in over time.

We'll definitely keep iterating on Dolly and releasing everything openly.


Yes this was a very surprising result... that the relatively small uptraining was able to unlock so much latent knowledge in the model.


Yes, this.


Yes it's nuanced, but will be simplified going forward.

This uses a fully open source (liberally licensed) model and we also open sourced (liberally licensed) our own training code. However, the uptraining dataset of ~50,000 samples was generated with OpenAI's text-davinci-003 model, and depending on how one interprets their terms, commercial use of the resulting model may violate the OpenAI terms of use. For that reason we are advising only noncommercial use of this model for now.

The next step here is to create a set of uptraining samples that is 100% open. Stay tuned.


Are you in touch with the OpenAssistant team? I believe they already have a more or less complete set of samples (100,000!) that were produced in an open environment and aren't encumbered by any licensing.


No I haven't heard of that, we'll engage with that team. This is exactly what we need will look into it.


Full source code is up here now:

https://github.com/databrickslabs/dolly

Sorry it took us a day to get the external repo setup.


Awesome thank you!

Was the Alpaca dataset being licensed as non-commercial only the reason you aren't releasing the weights? Is it possible to just release them under the same license?


Yes the issue is that some of the training data is arguably tainted with some noncommercial license (it's nuanced, discussed below in my comment). We are releasing weights to people who request but we just wanted to have an email request flow so that we can make sure people know it's just for noncommercial purposes.

Working on a model without this issue. Certainly our goal is totally open models anyone can use for anything.


Understandable, thank you for the response!

I've been a bit jaded by the "open/democratizing ai" stuff and then having companies stiff us at actually making it open - but not wanting to be the first to litigate these new types of issues ml brings is very understandable.

Question - Would you consider benchmarking a single 4090 for your training? While training in a few hours with 8x A100's is impressive, myself and I think others are curious how that translates to consumer hardware. IMO running/fine-tuning on consumer hardware is the ultimate endgame for all ai models.


Look forward to a response. We are heading toward a 6X Bizon 4090 system as a test bed.

https://bizon-tech.com/bizon-zx5500.html


The DataFrame is an evolution of the RDD model, where Spark knows explicit schema information. The core Spark RDD API is very generic and assumes nothing about the structure of the user's data. This is powerful, but ultimately the generic nature imposes limits on how much we can optimize.

DataFrames impose just a bit more structure: we assume that you have a tabular schema, named fields with types, etc. Given this assumption, Spark can optimize a lot of internal execution details, and also provide slicker API's to users. It turns out that a huge fraction of Spark workloads fall into this model, especially since we support complex types and nested structures.

Is the core RDD API going anywhere? Nope - not any time soon. Sometimes it really is necessary to drop into that lower level API. But I do anticipate that within a year or two most Spark applications will let DataFrames do the heavy lifting.

In fact, DataFrames and RDDs are completely inter-operable, either can be converted to the other. This means that even if you don't want to use DataFrames you can benefit from all of the cool input/output capabilities they have, even just to create regular old RDDs.


> It turns out that a huge fraction of Spark workloads fall into this model, especially since we support complex types and nested structures.

The first step of all my Spark tasks is "turn this RDD[String] into an RDD of parsed JSON", or turning CSV into case classes.

What JSON parser will dataframes be using? I presume Jackson?


Very likely in Spark 1.4. Hortonworks has been helping out with this, we just need some internal refacotring to the API to make it work.


Hey - sorry you had a bad experience. That bug was filed as a "minor" issue with only one user ever reporting it, so it didn't end up high up in our triage. We didn't merge the pull request because it was not correct, however, we can just add our own fix for it if it's affecting users. In the future, if you chime in on a reported JIRA, it will escalate it in our process.


Sorry, didn't mean to come over completely negative.

I appreciate the work that's gone in to Spark and it's clearly well designed. Developing with Spark after coming from a Hadoop background was a very refreshing experience.


No worries. Hopefully you'll reconsider using it!


Hey all - I'm the release manager for Spark 1.1. Happy to answer any questions about Spark or this release.


Good news about the PySpark input format improvements. Does that also cover reading complex Parquet datatypes into SchemaRDDs with their native datatypes? When can we get a Databricks Cloud account (I'm already on the waiting list)?


Yeah, you can load Parquet data directly into SchemaRDD's in 1.1 and get the type conversion, including use of nested types. That's the long term solution for all of our storage integration is to go through the SchemaRDD API since it's a standard type description and we expect many data sources to integrate there.

Re: databricks cloud - shoot me an e-mail and I'll see if I can help. Right now demand exceeds supply for us on accounts, but I can try!


Don't the SchemaRDD already support Parquet? Although it'd be great if they supported CSVs.


There's work in progress to support importing CSV data as SchemaRDDs:

https://issues.apache.org/jira/browse/SPARK-2360 https://github.com/apache/spark/pull/1351


Any plans to allow GraphX to work with Spark Streaming DStreams?


You can call GraphX algorithms right now from within the Streaming API, for instance compute a graph on a windowed view of data.

Online graph algorithms aren't there yet (probably what you mean). We just started adding online MLlib algorithms, so this is the main focus for now.


The Princeton "University Database" that was hacked is a privately maintained Alumni Association site from the UK. It is not on the Princeton domain or associated with the University at all.

Here's an excerpt: "Come out and suppoert Chickenshed, an inclusive theatre company based in London that brings people of all ages, backgrounds and abilities together to create groundbreaking and exciting new theatre."

Time will tell how the world will change now that this sensitive information is out in the open.


The University of Michigan database dump here, on the other hand, is legit.


4 of the dumps labeled U of M are from MSU (and at a glance, a couple of them appear to be a database that the linked website is intended to publish).


One of them looks like a database from the student association. Most of their stuff is built by $9/hr work study so no big surprise it's insecure.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: