Hacker Newsnew | past | comments | ask | show | jobs | submit | GabrielBianconi's commentslogin

We set up dataset splits and the usual best practices. Of course, if you overdo things, you can still hack benchmarks; our goal isn't to publish SOTA numbers but rather to illustrate results from our methodology. We didn't even tune hyperparameters, we just used the default choices. Definitely a valid concern for teams chasing SOTA though.

Thanks!


Thanks, Sam! I'm excited to see what you guys come up with.


With supervised fine-tuning (SFT), you'll often see good results with 100-1000+ datapoints (they can be variations of the same prompt template). If you have more limited data, reinforcement fine-tuning (RFT) can work well in the 10-100 range.

Good luck!


AFAIK, distillation typically refers to tuning on the logits of the larger model, so you wouldn't be able to do that with fine-tuning APIs (OpenAI + Google in our blog post). We fine-tune on the outputs themselves.

But broadly speaking, yes, we generate data using a large model, curate the best samples using metrics from the environment, and fine-tune on that data. This isn't a novel technique from an academic perspective; our focus is on applying it to different use cases (e.g. agentic RAG, agentic tool use) and models (OpenAI, Google, Qwen).

Thanks!


> AFAIK, distillation typically refers to tuning on the logits of the larger model

I think this is called “logit distillation” which is a particular form of distillation but not the only one.

> so you wouldn't be able to do that with fine-tuning APIs (OpenAI + Google in our blog post)

Dististillation from competitors' API is so common it has been given a name: it's called “distealing”.


Thanks for the explanation and the clarification on terminology! I've used a similar approach myself and it sounded like you were doing something similar.


Thanks for the feedback!

We chose a set of tasks with different levels of complexity to see how this approach would scale. For LLMs, the "challenge" with NER is not the task itself but the arbitrariness of the labels in the dataset. I agree it's still much simpler than the other tasks we present (agentic RAG, agentic tool use, maze navigation).

There are definitely strong parallels to model distillation and student-teacher training, with the primary difference being that we don't simply take all the data from the larger model but rather filter the dataset based on metrics from the environment. In the "Does curation even matter?" section, we show that this generally improves the result by a good margin.

We link to Vicuna, which might be the closest reference as prior art: https://lmsys.org/blog/2023-03-30-vicuna/

Thanks!


[I'm his coworker.] We ran Unsloth ourselves on a GPU-by-the-hour server. We have a notebook in the repository showing how to query historical data and use it with Unsloth.

It's a WIP PR that we plan to merge soon: https://github.com/tensorzero/tensorzero/pull/2273


Yeah, I hadn't noticed!


TensorZero | https://github.com/tensorzero/tensorzero | Founding Member of Technical Staff | NYC (onsite) | Full-time

TensorZero is an open-source stack for industrial-grade LLM applications. It unifies an LLM gateway, observability, optimization, evaluation, and experimentation.

Open Roles:

‣ Back-end Engineering (Rust)

‣ Design Engineering

‣ Developer Relations (DevRel) Engineering

‣ Front-end Engineering (React)

‣ Product Engineering (Full-Stack)

What we offer:

‣ Vast majority of your work → open source

‣ Years of runway

‣ Small and entirely technical team: former Rust compiler maintainer, ML researchers (Stanford, CMU, Oxford, Columbia) with thousands of citations, decacorn CPO

‣ $200-300k base + up to 1% equity + benefits

‣ Onsite (5 days) in New York (Williamsburg, Brooklyn)

More information: https://tensorzero.com/candidate-brief

Apply: https://www.tensorzero.com/jobs


We didn't look into that workflow closely, but you can reproduce our work (code in GitHub) and potentially find some insights!

We plan to continue investigating how it works (+ optimize the models and prompts using TensorZero).


They use different prompts depending on the action you're taking. We provided just a sample because our ultimate goal here is to start A/B testing models, optimizing prompts + models, etc. We provide the code to reproduce our work so you can see other prompts!

The Gist you shared is a good resource too though!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: