It wouldn’t be too difficult to build something like that for your own usage, but I found it pretty easy to get datasets set up.
Essentially a game changer in understanding if your prompts are working. Especially if you’re doing something which requires high levels of consistency.
In our case we would use LLM for classification which fits in perfectly with evals.
Have some good takeaways / feedback on this? First time I hear about Braintrust (the eval platform) so I'll look into it but I'm curious on your experience with it so far.
“I admit that I still disagreed with him after the exchange, but I had a new respect for him as a designer because he was able to articulate a rationale for his decision.”
Any competent designer gets really good at justifying their decisions. Everyone has an opinion about design and thinks that their taste is correct.
I’m glad I don’t have to deal with that on the software side.
Marginally related, I feel the same way about honesty, especially in a work context.
I’ve always prided myself in being an honest but considerate person.
A recent experience with a colleague who weaponised my honesty in an attempt to manipulate me has left a foul taste in my mouth. Luckily their contract ended and the problem resolved itself.
But I remember distinctly feeling that I will
be professional and polite but I do not automatically owe anyone my honesty.
We analyse thousands of lines from a csv using an LLM.
The only thing that worked for us was to send each individual line and analyse it one by one.
I’m not sure if that would work in your use case, but you could classify each line into a value using an LLM then hard code the trends you are looking for.
For example if you’re analysing something like support tickets. Use an LLM to classify the sentiment, and you can plot the sentiment on a graph and see if it’s trending up or down.
I think that is probably what I'll end up doing. Since the data is text based data. Combined that with the approach of pre analzying quantitative data. To feed to the LLM
I figured I ask this question because there might've been a technique I'm not aware about
A lot of software has friction to get to the value. This is often because of constraints not choice.
To give a concrete example of this, in my company we had users upload files for analysis. To get the export for the file, it took many steps. Not hard, but a lot to get done.
We switched it to an integration and now it’s 3 clicks. We’ve gone from 10% of users onboarding to 100%.
It doesn’t mean we get people to stay, but the barrier to understanding if our tool provides value to them has completely disappeared.
I’m very curious though, what value did you strip away when trying to make your product easier to use?
How do you introduce any tool/change to a team of people?
You get buy in, start having conversations see what AI people have explored. Have they tried claude? Do they prefer other tools? If so why? What are the objections. Actually listen.
I’d also showcase what you can do. I love to present what codex has found when debugging something, or a prototype I’ve put together.
If you have the budget pay for subscriptions so they can play around.
Also, you say that development velocity is a big problem, but I would dive into why that is. You may be disappointed when velocity remains the same with AI tools.
It wouldn’t be too difficult to build something like that for your own usage, but I found it pretty easy to get datasets set up.
Essentially a game changer in understanding if your prompts are working. Especially if you’re doing something which requires high levels of consistency.
In our case we would use LLM for classification which fits in perfectly with evals.
reply