Hacker Newsnew | past | comments | ask | show | jobs | submit | more pedrovhb's commentslogin

For what it's worth, as a primarily backend dev having ~recently started getting more deeply into frontend web, I have specifically noted in my head that the box model isn't too intuitive and in my inexperienced opinion, the default was a bad one. I figured surely if it is the way it is, then it's for reasons I do not yet comprehend™, so it actually feels pretty validating that someone who knows what they're talking about agrees.


It does feel right to me, because it's not distilling the second model, and in fact the second model is not an image generation model at all, but a visual encoder. That is, it's a more "general purpose" model which specializes in extracting semantic information from images.

In hindsight it makes total sense - generative image models don't automatically start out with an idea of semantic meaning or the world, and so they have to implicitly learn one during training. That's a hard task by itself, and it's not specifically trained for this task, but rather learns it on the go at the same time as the network learns to create images. The idea of the paper then is to provide the diffusion model with a preexisting concept of the world by nudging its internal representations to be similar to the visual encoders'. As I understand DINO isn't even used during inference after the model is ready, it's just about representations.

I wouldn't at all describe it as "a technique for transplanting an existing model onto a different architecture". It's different from distillation because again, DINO isn't an image generation model at all. It's more like (very roughly simplifying for the sake of analogy) instead of teaching someone to cook from scratch, we're starting with a chef who already knows all about ingredients, flavors, and cooking techniques, but hasn't yet learned to create dishes. This chef would likely learn to create new recipes much faster and more effectively than someone starting from zero knowledge about food. It's different from telling them to just copy another chef's recipes.


The technique in this paper would still be rightly described as distillation. In this case it's distillation of "internal" representations rather than the final prediction. This a reasonably common form of distillation. The interesting observation in this paper is that including an auxiliary distillation loss based on features from a non-generative model can be beneficial when training a generative model. This observation leads to interesting questions like, eg, which parts of the overall task of generating images (diffusionly) are being learned faster/better due to this auxiliary distillation loss.


You may already be aware of it, but in case not - it sounds like tree-sitter-graph could be something you'd be interested in: https://docs.rs/tree-sitter-graph/latest/tree_sitter_graph/r...

I haven't gotten into it yet but it looks pretty neat, and it's an official tool.


Or by the definition that the ratio between consecutive fib numbers approaches Phi, just multiply by 1.618? Though at that point might as well just use the real conversion ratio.

In other news, π² ≈ g.


+1 on feeling there's a lot of UX possibilities left on the table. Most seem to have accepted chat as the only means of using LLMs. In particular, I don't think most people realize that LLMs can be used in very powerful ways that just aren't possible with black-box API services as they currently exist. Google kind of has an edge on this area with recent context caching support for Gemini, but that's just one thing. Some things that feel like they could enable new modes of interaction aren't possible at all, like grammar constrained generation and rapid LLM-tool interactions (think a repl or shell rather than function calls; currently you have to pay for the input tokens all over again if you want to use the results of that function call as context and it adds up quickly).

On Copilot, I've been using it since it was public, and have always found it useful, but it hasn't really changed much. There's a chat window now (groundbreaking, I know) and it shows a "processing steps" thing that says it's doing some distinct agentic tasks like collecting context and test run results and what have you, but it doesn't feel like it knows my codebase any better than the cursory description I'd give an LLM without context. I use the jetbrains plugin though, and I understand the vscode extension has some different features, so ymmv.


It does view RAW when compiled with the right flags. JXL too, interestingly. Managed to save a bunch of space on old photos (converting with cjxl, but which I wouldn't have done if I weren't able to see them somehow).


Here's an idea: recursively mount code files/projects. Use something like tree-sitter to extract class and function definitions and make each into a "file" within the directory representing the actual file. Need to get an idea for how a codebase is structured? Just `tree` it :)

Getting deeper into the rabbit hole, maybe imports could be resolved into symlinks and such. Plenty of interesting possibilities!


Have you tried asking it for a specific concrete length, like a number of words? I was also frustrated with concise answers when asking for long ones, but I found that the outputs improved significantly if I asked for e.g. 4000 words specifically. Further than that, have it break it down into sections and write X words per section.


Yes, all the possible length extending custom instructions you can think of. I can get some reasonable length responses out of it, but I've never seen them go over 1 page worth, and multi-shot example prompts using multiple USER and GPT exchanges to define the format. Seems like GPT4 has a hard limit as to how much it will output when you click "continue", and Claude Opus never goes over a page either. Another user pointed out using the API, which I have done in the past, but it's been a long while, and I can't really justify the cost of using the advanced models via API for my general use.


Everyone's coalescing at a max of 4096 tokens/12 "pages" via API (page is 250 words, which is 1 8.5"x11" double spaced)

To your point, doesn't matter anyway, it's nigh impossible to get over 2K of output with every trick and bit of guidance you can think of (I got desperate when 16K/48 pages came out to "make it work", even completely deforming tricks like making it number each line and write a reminder on each line that it should write 1000 lines don't work)


My intuition is that a significant challenge for LLMs' ability to do arithmetics has to do with tokenization. For instance, `1654+73225` as per the OpenAI tokenizer tool breaks down into `165•4•+•732•25`, meaning the LLM is incapable of considering digits individually; that is, "165" is a single "word" and its relationship to "4" and in fact each other token representing a numerical value has to be learned. It can't do simple carry operations (or other arithmetic abstractions humans have access to) in the vast majority of cases because its internal representation of text is not designed for this. Arithmetic is easy to do in base 10 or 2 or 16, but it's a whole lot harder in base ~100k where 99% of the "digits" are words like "cat" or "///////////".

Compare that to understanding arbitrary base64-encoded strings; that's much harder for humans to do without tools. Tokenization still isn't _the_ greatest fit for it, but it's a lot more tractable, and LLMs can do it no problem. Even understanding ASCII art is impressive, given they have no innate idea of what any letter looks like, and they "see" fragments of each letter on each line.

So I'm not sure if I agree or disagree with you here. I'd say LLMs in fact have very impressive capabilities to learn logical structures. Whether grammar is the problem isn't clear to me, but their internal representation format obviously and enormously influences how much harder seemingly trivial tasks become. Perhaps some efforts in hand-tuning vocabularies could improve performance in some tasks, perhaps something different altogether is necessary, but I don't think it's an impossible hurdle to overcome.


I don't think that's really how it works - sure this is true at the first level in a neural network, but in deep neural networks after the first few layers the LLM shouldn't be 'thinking' in tokens anymore.

The tokens are just the input - the internal representation can be totally different (and that format isn't tokens).


Please don't act like you "know how it works" when you obviously don't.

The issue is not the fact that the model "thinks or doesn't think in tokens". The model is forced at the final sampling/decoding step to convert it's latent back into tokens, one token at a time.

The models are fully capable of understanding the premise that they should "output a 5-7-5 syllable Haiku", but from the perspective of a model trying to count its own syllables, this is not possible, as its own vocabulary is tokenized in such a way that not only does the model not have direct phonetic information within the dataset, but it literally has no analogue for how humans count syllables (measuring mouth drops). Models can't reason about the number of characters or even tokens used in a reply too for the same exact reason too.

The person you're replying to broadly is right, and you are broadly wrong. The internal format does not matter when the final decoding step forces a return of tokenization. Please actually use these systems rather than pontificating about them online.


Thank god we aren’t talking about a model counting syllables then.


That requires converting from a weird unhelpful form into a more helpful form first, so yes but the tokenisation makes things harder as it adds an extra step - they need to learn how these things relate while having significant amounts of the structure hidden from them.


This conversion is inherent in the problem of language and maths though - Two, too (misspelt), 2, duo, dos, $0.02, and one apple next to another apple, 0b10 and 二 can all represent the (fairly abstract) concept of two.

The conversion to a helpful form is required anyway (also lets remember that computers don't work in base 10, and there isn't really a reason to believe that base 10 is inherently great for LLM's either)


It is, but there's a reason I teach my son addition like this:

    hundreds | tens | ones

        1        2      3
    +   2        1      5
    -----------------------
        3        3      8
Rather than

unoDOOOOS(third) {}{}{} [512354]_ = three"ate

* replace {}{}{} with addition, {}{} is subtraction unless followed by three spaces in which case it's also addition * translate and correct any misspellings * [512354] look up in your tables * _ is 15 * dotted lines indicate repeated numbers

Technically they're doing the same thing. One we would assume is harder to learn the fundamental concepts from.


Right, which is why testing arithmetics is a good way to test how well LLMs generalize their capabilities to non text tasks. LLMs can in theory be excellent at it, but they aren't due to how they are trained.


The tokens are the structure over which the attention mechanism is permutation equivariant. This structure permeates the forward pass, its important at every layer and will be until we find something better than attention.


I thought of something similar these days but with a different approach - rather than settrace, it would use a subclass of bdb.Bdb (the standard library base debugger, on top of which Pdb is built) to actually have the LLM run a real debugging session. It'd place breakpoints (or postmortem sessions after an uncaught exception) to drop into a repl which allows going up/down the frame stack at a given execution point, listing local state for frames, running code on the repl to try out hypotheses or understand the cause of an exception, look at methods available for the objects in scope, etc. This is similar to what you'd get by running the `%debug` magic on IPython after an uncaught exception in a cell (try it out).

The quick LLM input/repl output look is more suitable for local models though, where you can control hidden state cache, have lower latency, and enforce a grammar to ensure it doesn't go off the rails/commands implemented for interacting with the debugger, which afaik you can't do with services like OpenAI's. This is something I'd like to see more of - having low level control of a model gives qualitatively different ways of using it which I haven't seen people explore that much.


So interestingly enough, we first tried letting GPT interact with pdb, through just a set of directed prompts, but we found that it kept hallucinating commands, not responding with the correct syntax and really struggling with line numbers. That's why we pivoted to just getting all the relevant data upfront GPT could need and letting GPT synthesize that data into a singular root cause.

I think we're going to explore the local model approach though - you raise some really great points about having more granular control over the state of the model.


Interesting! Did you try the function calling API? I feel you with the line number troubles, it's hard to get something consistent there. Using diffs with GPT-4 isn't much better in my experience; I didn't extensively test that, but from what I did it rarely produced synctatically valid diffs that could just be sent to `patch`. One approach I started playing with was using tree-sitter to add markers to code and let the LLM specify marker ranges for deletion/insertion/replacement, but alas, I got distracted before fully going through with it.

In any case, I'll keep an eye on the project, good luck! Let me know if you ever need an extra set of hands, I find this stuff pretty interesting to think about :)


I actually coded something very close to this and it worked surprisingly well: https://github.com/janpf/debuggAIr


Ooh, interesting - starred and going to dig into this later today!


I've done a manual version of this with chatgpt.

I had ipdb, told it to request any variables that I should look at, suggest what to do next, what it would expect - it was quite good, but took a lot of persuading, just having an LLM that was more tuned to this would be better.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: