Hacker Newsnew | past | comments | ask | show | jobs | submit | gwern's commentslogin

While we're at it: my own work 2 years ago in creating an entire workflow for turning Midjourney or DALL-E dropcaps into attractive, lightweight, easy-to-create dropcaps for web pages: https://gwern.net/dropcap We use it for the cat, Gene Wolfe, and holiday pages.

No. I think the need for adversarial losses in order to distill diffusion models into one-step forward passes has provided additional evidence that GANs were much more viable than diffusimaxis loudly insisted.

(Although I'm not really current on where image generation is these days or who is using GAN-like approaches under the hood or what are the current theoretical understandings of GAN vs AR vs diffusion, so if you have some specific reason I should have "caved", feel free to mention it - I may well just be unaware of it.)


"SotA diffusion uses adversarial methods anyways" seems like a bit of a departure from the case you make in the blog post.

edit: For what it's worth - I agree. At least some auto-encoders (which will produce latents for diffusion models) use some form of adversarial method.

Still, I'm curious if you think GAN models in their more familiar form are going to eventually take on LCM/diffusion models?


Silence says it all.


This works surprisingly well. If you look into enough dark corners of Unicode, it turns out that you can do a shocking amount of typography, going far beyond the obvious italics and bolds: https://gwern.net/utext

In fact, I found that writing as much math as possible in Unicode makes for the best HTML reading experience: it's fast, simple, and looks more natural (avoids font inconsistency and line-height jagginess, among other things). https://gwern.net/design-graveyard#mathjax

And if you find writing Unicode yourself a pain, you can just ask a LLM to translate from LaTeX to Unicode! https://github.com/gwern/gwern.net/blob/master/build/latex2u...


Regarding typing latex vs unicode, I use WinCompose/XCompose with a list of bindings that include most latex symbols. So instead of \cup I'd type <compose>cup

For reference, here is my personal (still evolving) .XCompose https://github.com/chtenb/dotfiles/blob/master/.XCompose


This is the epitome of patching symptoms rather than treating the disease. Even if you suppress the obvious syntactic slop like 'it's not X but Y', you have no reason to believe you've fixed mode-collapse on higher more important levels like semantics and creativity. (For example, Claude LLMs have always struck me as mode-collapsed on a semantic level: they don't have the blatant verbal tics of 4o but somehow they still 'go in circles'.) Which will potentially severely hinder the truly high-value applications of LLMs to creative applications like frontier research. To the extent that this succeeds in hiding the brain damage in contemporary LLMs, it arguably is a cure worse than the disease.


Those higher level kinds of mode collapse are hard to quantify in an automated way. To fix that, you would need interventions upstream, at pre & post training.

This approach is targeted to the kinds of mode collapse that we can meaningfully measure and fix after the fact, which is constrained to these verbal tics. Which doesn't fix higher level mode collapse on semantics & creativity that you're identifying -- but I think fixing the verbal tics is still important and useful.


> but I think fixing the verbal tics is still important and useful.

I don't. I think they're useful for flagging the existence of mode-collapse and also providing convenient tracers for AI-written prose. Erasing only the verbal tics with the equivalent of 's/ - /; /g' (look ma! no more 4o em dashes!) is about the worst solution you could come up with and if adopted would lead to a kind of global gaslighting. The equivalent of a vaccine for COVID which only suppresses coughing but doesn't change R, or fixing a compiler warning by disabling the check.

If you wanted to do useful research here, you'd be doing the opposite. You'd be figuring out how to make the verbal expressions even more sensitive to the underlying mode-collapse, to help research into fixing it and raising awareness. (This would be useful even on the released models, to more precisely quantify their overall mode-collapse, which is poorly captured by existing creative writing benchmarks, I think, and one reason I've had a hard time believing things like Eqbench rankings.)


Wow it’s pretty sad to see one of my idols be kind of a hater on this. :(


How come? It was a valid take on the situation. Critical feedback is vital to success.


You're correct, but when the worst the ChatGPTisms get is turns of phrases like "LeetCode youth finally paid off: turns out all those "rebalance a binary search tree" problems were preparing me for salami, not FAANG interviews." or "Designing software for things that rot means optimising for variance, memory, and timing–not perfection. It turns out the hardest part of software isn't keeping things alive. It's knowing when to let them age.", then I'm inclined to forgive it compared to how many far more egregious offenders are at the top of HN these days. This is a rather mild use of ChatGPT for copyediting, and at least I feel like I can trust OP to factcheck everything and not put in any confabulations.


> That's when it clicked:

> You know the drill:

etc etc.

If these are hand-typed, I'll eat my hat.


The problem with assessing nerd writing for whether it's AI-assisted is that the AIs themselves are trained on nerd writing.


Exactly this. I often feel plagiarized by AI.


If you were talking about some essays I wrote in the early 2000s, you’d be buttering your Stetson. It’s hilarious to me that several of my blog posts from 20 years ago have been called out as AI generated lol.


I agree. I've written like this too, but these days when you see it it's more likely to be AI.

I actually think if I were writing blog posts these days I'd deliberately avoid these kinds of cliches for that reason. I'd try to write something no LLM is likely to spit out, even if it ends up weird.


You're absolutely right!


> Note again that a residual connection is not just an arbitrary shortcut connection or skip connection (e.g., 1988)[LA88][SEG1-3] from one layer to another! No, its weight must be 1.0, like in the 1997 LSTM, or in the 1999 initialized LSTM, or the initialized Highway Net, or the ResNet. If the weight had some other arbitrary real value far from 1.0, then the vanishing/exploding gradient problem[VAN1] would raise its ugly head, unless it was under control by an initially open gate that learns when to keep or temporarily remove the connection's residual property, like in the 1999 initialized LSTM, or the initialized Highway Net.

After reading Lang & Witbrock 1988 https://gwern.net/doc/ai/nn/fully-connected/1988-lang.pdf I'm not sure how convincing I find this explanation.


That's a cool paper. Super interesting to see how work was progressing at the time, when Convex was the machine everybody wanted on (or rather next to) their desks.


For residual networks with an infinite number of layers it is absolutely correct. For a residual network with finite layers, you can get away with any non zero constant weight as long as the weight chosen appropriately for the fixed network depth. The problem is simply c^n gives you very big or very small numbers for large n and large deviations from 1.

Now let me address the other possibility that you are talking about: what if residual connections aren't necessary? What if there is another way? What are the criteria necessary to avoid exploding or vanishing gradient or slow learning in the absence of both?

For that we need to first know why residual connections work. There is no way around calculating the back propagation formula by hand, but there is an easy trick to make it simple. We don't care about the number of parameters in the network, we only care about the flow of the gradient. So just have a single input and output with hidden size 1 and two hidden layers.

Each layer has a bias and a single weight and an activation function.

Let's assume you initialize each weight and bias with zero. The forward pass returns zero for any input and the gradient is zero. In this artificial scenario the gradient starts vanished and stays vanished. The reason is pretty obvious when you apply back propagation. The second layer clips the gradient of the first layer. If there was a single layer, the gradient would be non zero and yield a non zero gradient, rescuing the network out of the vanishing gradient.

Now what if you add residual connections? The forward pass stays the same, but the backward pass changes for two layers and beyond. The gradient for the second layer consists of just the second layer activation function multiplied by the first layer activation of the forward pass. The first layer gradient consists of the second layer gradient where the first layer activation is substituted by the gradient of the first layer but because it is a residual net, you also add the gradient of just the first layer.

In other words, the first layer is trained independently of the layers that come after it, but also gets feedback from higher layers on top. This allows it to become non zero, which then lets the second layer become non zero, which lets the third be non zero and so on.

Since the degenerate case of a zero initialized network makes things easy to conceptualise, it should help you figure out what other ways there are to accomplish the same task.

For example, what if we apply the loss to every layer's output as a regularizer? That is essentially doing the same thing as a residual, but with skip connections that sum up the outputs. You could replace the sum with a weighted sum where the weights are not equal to 1.0.

But what if you don't want skip connections either, because they are too similar to residual networks? A residual network has one skip connection already and summing up in a different way is uninteresting. It is also too reliant on each layer being encouraged to produce an output that is matched against the label.

In other words, what if we wanted to let the inner layers not be subject to any correlation with the output data? You would need something that forces the gradients away from zero but also away from excessively high numbers. I.e. weight regularization or layer normalisation with a fixed non zero bias.

Predictive coding and especially batched predictive coding could also be a solution to this.

Predictive coding predicts the input of the next layer, so the only requirement is that the forward pass produces a non zero output. There is no requirement for the gradient to flow through the entire network.


My point is more that Schmidhuber is saying that the gates or the initialization are the innovation solely because they produce well-behaved gradients, which is why Hochreiter's 1991 thing is where he starts and nothing before that counts. But it's not clear to me why we should define it like that when you can solve the gradient misbehavior other ways, which is why https://gwern.net/doc/ai/nn/fully-connected/1988-lang.pdf#pa... works and doesn't diverge: if I'm understanding them right, they did warmup, so the gradients don't explode or vanish. So why doesn't that count? They have shortcut layers and a solution to exploding/vanishing gradients and it works to solve their problem. Is it literally 'well, you didn't use a gate neuron or fancy initialization to train your shortcuts stably, therefore it doesn't count'? Such an argument seems carefully tailored to exclude all prior work...


This apparently doesn't apply here, but in fact, pixels can be generated independently of each other. There are architectures where you can generate an arbitrary pixel or element of the image without generating the others; they are just implicit. See NeRFs or 'single-pixel GANs' or MAEs: eg https://arxiv.org/abs/2003.08934 https://arxiv.org/abs/2011.13775 https://arxiv.org/abs/2401.14391

Why is this possible? I tend to think of it as reflecting the ability to 'memorize' all possible data, and the independent generation is just when you 'remember' a specific part of a memory. The latent space is a Platonic object which doesn't change, so why should your generative process for materializing any specific point in the latent space have to? It's not surprising if you could generate arbitrary points from a function like 'y = mx + b' without generating every other point, right? It's just an atemporal mathematical object. Similarly with 'generating images from a random seed'. They too are just (complicated) functions mapping one number to another number.

(You might wonder if this is limited to images? It is not. In fact, you can generate even natural language like this to some degree: https://github.com/ethan-w-roland/AUNN based on my proposal for taking the 'independent generation' idea to a pathological extreme: https://gwern.net/aunn )


Requires a login?



I think OP is an instance that proves the point

this is a lazy, clumsy editing attempt, done through a document registration service which exists to prevent exactly this and yet, you have to be an experienced nerd (eg https://en.wikipedia.org/wiki/Matthew_Garrett has a doctorate and decades of software development experience) who will jump through a bunch of hoops to even begin to build a case beyond he-said-she-said. And he still doesn't have a settlement or criminal conviction in hand, so he's not even half done... Or look at the extensive forensics in the Craig Wright case just to establish simple things like that they were edited or backdated to a legally acceptable level.

Meanwhile, the original PDF edit in question took maybe 5 minutes with entry-level PDF tools.


One of the most surprising Gwern.net bug reports was from a compulsive highlighter who noted that the skip-ink implementation (which uses the old text-shadow trick, because frustratingly, the recently standardized skip-ink CSS still manages to fail at its only job and it looks awful) looked bad because of how browsers handle shadows and highlighting.

We had known about that (and it can't be fixed because browsers don't let you control the highlighting), but we had never imagined it'd be a problem because you'd only see it briefly when once in a while copy-pasting some text for a quote - right? I mean, why else would anyone be highlighting text? You'd only highlight small bits of text you had already read, so if it looked bad in places, that was fine, surely.

(Narrator: "It was not fine.")

Just another instance of Hyrum's law, I guess...

We decided to WONTFIX that because we can't easily fix that without making it uglier for users who don't abuse highlighting and are reading normally, which is almost everyone else.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: