Hmmm, didn’t realize I was deflecting - just stating facts. But if I came across that way then criticism noted.
If I turned this into a paid app then more attention would be given to quality. There’s only so much an app that leverages LLMs can do, though. With enough trace data and user feedback I could imagine building out Evals from failure modes.
I can think of a few ways to provide a better UX. One is already built-in - there’s a “Recreate” button the original uploader can click if they don’t like the result.
Things could get pretty sophisticated after that, such as letting the user tweak the prompt, allowing for section-by-section re-dos, changing models, or even supporting manual edits.
From a commercial product perspective, it’s interesting to think about the cost/benefit of building around the current limits of LLMs vs building for an experience and betting the models will get better. The question is where to draw the line and where to devote cycles. Something worthy of its own thread.
Maybe you are too young to have noticed, but this is how Facebook used to be for every one. Until some a/b testing likely led to short term engagement boosts for news content and that's all you could see - especially during the 2016 news cycle with (allegedly Russian) political ads. Then people stopped posting, and others stopped posting, feedback loop and here we are.
I appreciate the voice of experience but if you're going to post a comment like this, could you please share some of that experience so we know at least some of what did happen?
Otherwise it comes across as a drive-by swipe, which is a human reaction when you know that something on the internet is wrong, but which degrades the threads, partly because of the example it sets for others. The life of this community depends on knowledgeable people sharing some of what they know, so the rest of us can learn.
Specifically what happened, and I think this is all public now is that prior to 2016 journalists and news organizations argued that Facebook was demoting news for various reasons. In reality it wasn't very engaging so it was automatically demoted. They promised to boost news more in early 2016, but largely as a result of worse engagement and negative experiences (arguing in comments) Facebook started ranking news worse than other content. This all happened in 2016, months before the general election
And while Russia did run ads, it was mostly not political and the political content they ran had very little engagement. Russia mostly focuses on conspiracy theories and undermining American institutions. Facebook was aware of this in 2016 and certainly did not contribute to it intentionally, and I don't believe even by accident of some kind of misguided A/B testing
The reason Facebook got worse for younger people is because younger people stopped posting.
Because regulation is bad, according to the current executive?
Politics aside, the FDA applies a very generous amount of regulation (mostly justifiable), not sure we want to pay multiples for our consumer electronics, as it (mostly) shows acceptable behavior and rearely kills anybody.
It is bad. Regulations have been historically hijacked to benefit corporate interests. See Intuit and tax policy for example.
Voters on the right naively thought he'd work to fix it. (Wrong!) But it is very much bad for a very large number of issues. Maybe next executive will fix it? (Wrong!)
Because there are not a lot of high quality examples of code edition on the training corpora other than maybe version control diffs.
Because editing/removing code requires that the model output tokens for tools calls to be intercepted by the coding agent.
Responses like the example below are not emergent behavior, they REQUIRE fine-tuning. Period.
I need to fix this null pointer issue in the auth module.
<|tool_call|>
{"id": "call_abc123", "type": "function", "function": {"name": "edit_file", "arguments": "{"path": "src/auth.py", "start_line": 12, "end_line": 14, "replacement": "def authenticate(user):\n if user is None:\n return False\n return verify(user.token)"}"}}
<|end_tool_call|>
Have you tried using a base model from HuggingFace? they can't even answer simple questions. You input a base, raw model the input
What is the capital of the United States?
And there's a fucking big chance it will complete it as
What is the capital of Canada?
as much as there is a chance it could complete it with an essay about the early American republican history or a sociological essay questioning the idea of Capital cities.
Impressive, but not very useful. A good base model will complete your input with things that generally make sense, usually correct, but a lot of times completely different from what you intended it to generate. They are like a very smart dog, a genius dog that was not trained and most of the time refuses to obey.
So, even simple behaviors like acting as a party in a conversation as a chat bot is something that requires fine-tuning (the result of them being the *-instruct models you find in HuggingFace). In Machine Learning parlance, what we call supervised learning.
But in the case of ChatBOT behavior, the fine-tuning is not that much complex, because we already have a good idea of what conversations look like from our training corpora, we have already encoded a lot of this during the unsupervised learning phase.
Now, let's think about editing code, not simple generating it. Let's do a simple experiment. Go to your project and issue the following command.
claude -p --output-format stream-json "your prompt here to do some change in your code" | jq -r 'select(.type == "assistant") | .message.content[]? | select(.type? == "text") | .text'
Pay attention to the incredible amount of tool use calls that the LLMs generates on its output, now, think as this a whole conversation, does it look to you even similar to something a model would find in its training corpora?
Editing existing code, deleting it, refactoring is a way more complex operation than just generating a new function or class, it requires for the model to read the existing code, generate a plan to identify what needs to be changed and deleted, generate output with the appropriate tool calls.
Sequences of token that simply lead to create new code have basically a lower entropy, are more probable, than complex sequences that lead to editing and refactoring existing code.
During pre-training the model is learning next-token prediction, which is naturally additive. Even if you added DEL as a token it would still be quite hard to change the data so that it can be used in a mext-token prediction task
Hope that helps
It sounded like he was trying to one shot things when he mentioned he would ask it to fix problems with no luck. It's an approach I've tried before with similar results, so I was sharing an alternative that worked for me. Apologies if it came across as dismissive