The high and low frequency components of speech are produced and perceived in different ways.
The lower frequencies (roughly below 4KHz) are created by the vocal chords opening and closing at the fundamental frequency, and harmonics of this fundamental frequency (e.g. 100Hz + 2/3/400Hz etc harmonics), with this frequency spectrum then being modulated by the resonances of the vocal tract which change during pronunciation. What we perceive as speech is primarily the changes to these resonances (aka formants) due to articulation/pronunciation.
The higher frequencies present in speech mostly comes from "white noise" created by the turbulence of forcing air out through closed teeth/etc (e.g. "S" sound), and our perception of these "fricative" speech sounds is based on onset/offset of energy in these higher 4-8KHz frequencies. Frequencies above 8KHz are not very perceptually relevant, and may be filtered out (e.g. not present in analog telephone speech).
I don't mean to mean but: what is surprising about any of this ?
Joseph Fourier's solution to the heat-equation (linear diffusion) was in fact the origin of the FT. The high-freq coefficients decay (as -t^2 IIRC) in there; the reverse is also known to be "unstable" (numerically, and is singular from the equillibrium).
More over, the reformulation doesn't immediately reveal some computational speedup, or a better alternative formulation (which is usually a measure of how valuable it is epistemically).
(Edit: note that Heat-equation is more akin to the Fokker-Planck eqn, not actual Diffusion as an SDE as is used in Diffusion models).
Connections between fields drive new ideas. And this has especially been the case for recent AI progress. With the speed at which the field is moving, ideas that are obvious to some still have a significant chance of not being tried yet.
Just as the connection between the Kalman filter and RNN models or the significant similarities between back-propagation and the whole field of control theory. If it's truly not surprising, then that's just another reason to try it out if nobody else has.
Does everything always need to be immediately "useful"?
I think what's interesting about it is the inter-relation between different disciplines and how the ideas are connected. The connection between the heat-equation and the generative diffusion models we see to day, and its relation to the Fourier Transform would not have been immediately obvious to me.
I mean you didn't mention autoregressive models anywhere in your comment, whereas the post is about the connection between diffusion and autoregressive modelling. Also it's a blog post, if it has figured out a speed-up or improved method it would probably have been a paper
> I won’t speculate about why images exhibit this behaviour and sound seemingly doesn’t, but it is certainly interesting (feel free to speculate away in the comments!).
Images have a large near-DC component (solid colors) and useful time-domain properties, while human hearing starts at ~20 Hz and the frequencies needed to understand speech range from 300-4 kHz (spitballing based on the bandwidth of analog phones).
What would happen if you built a diffusion model using pink noise to corrupt all coefficients simultaneously? Alternatively what if you used something other than noise (like a direct blur) for the model to reverse?
Thanks for reading! The paper that directly inspired this blog post actually investigates the latter (blurring as the corruption process): https://arxiv.org/abs/2206.13397
The lack of semantics associated to DC (and near-DC) components in audio data is important, and a big difference compared to image data, no doubt.
I'm not sure this changes if you look at a cepstral representation (as suggested in the article). In this case, the DC component represents the (white) noise level in the raw audio space (i.e., the spectrum averaged over all frequencies), so it doesn't have strong semantics either (other than... "how noisy is the waveform?").
All four audio examples are human-made, so it makes sense they emphasize the frequency range that humans distinguish best. It would be interesting to compare with natural audio to see if there’s a distinction like that found in natural vs. manmade scenes in images. (Unfortunately there are increasingly few places on Earth you can find truly natural audio with no manmade sounds audible…)
You could just generate the audio in frequency space, much like how MP3 style codecs encode the raw signal. This converts the purely 1D audio waveform into a 2D grid of values, which is more amenable to this type of diffusion-based generation.
It is not really 1D - to perform any T/F transform (FFT, (M)DCT, etc.) you need a number of samples in the time domain, so you are essentially transforming 2D (intensity over time) to another 2D representation (magnitude or magnitude+phase over frequency) - this is why MP3 style codecs usually have multiple frame (or "window") lenghts, usually one longer for high frequency resolution and one shorter for high temporal resolution.
That’s exactly what I mean. Break up the 1D audio into 2D samples in time and frequency space. Train the AI in this space plus diffusion noise, and have it generate de-noised output in this space.
Not my area, enjoyed the read. It reminded me of how you can decode a scaled-down version of a JPEG image by simply ignoring the higher-order DCT coefficients.
As such it seems the statement is that stable diffusion is like an autoregressive model which predicts the next set of higher-order FT coefficients from the lower-order ones.
Seems like this is something one could do with a "regular" autoregressive model, has this been tried? Seems obvious so I assume so, but curious how it compares.
Had just finished watching the Physics of Language Models[1] talk, where they show how GPT2 models could learn non-trivial context-free grammars, as well as effectively do dynamic programming to an extent, so though it would be interesting to see how they performed in the spectral fine-graining task.
> I included a few references that explore that approach at the bottom of section 4
Man, reading on mobile phone just ain't the same. Somehow managed to not catch then end of that section. The first reference, "Generating Images with Sparse Representations", is very close to what I had in mind.
This post reminded me of a conversation I had with my cousins about language and learning. It’s interesting how (most?) languages seem inherently sequential, while ideas and knowledge tend to have a more hierarchical structure, with a “base frequency” communicating the basic idea and higher frequency overtones adding the nuances. I wonder what implications this might have in teaching current LLMs to reason?
> It’s interesting how (most?) languages seem inherently sequential, while ideas and knowledge tend to have a more hierarchical structure
Spoken and written languages are presented in a sequential medium. They still represent hierarchical trees in their structure though.
(Notable semi-exception to the linearity are the sign languages, which are are kinematic three-dimensional languages involving two hands, an entire upper body and facial expressions. While I don't speak it, I've read a bit about it, and apparently the most common error for non-deaf people who learn it is to make so-called "split verb" errors. That is to say: to sign in a linear fashion like one would with a spoken language, instead of making use of all the parallel communication options available)
I know you're joking, but since we're among nerds who like technical correctness: what Italians do is known as "gesticulation". It is an important part of their speech, for sure, just like the melody of a spoken language can add layers of depth to a sentence when compared to its written representation. As far as I know this is not, however, a sign language. Sign languages have their own grammar that are not comparable to spoken languages. Italians do not take their gesticulation that far AFAIK.
Ok I get that people downvoted my comment being a killjoy, but the point I was trying to make is a serious one: namely that sign languages are real, valid languages, that Deaf people who speak and think in it should be taken seriously, and that the consequences of not doing so are severely damaging for the Deaf.
The Deaf community has suffered a lot of discrimination throughout history, and two of the biggest issues are non-deaf people deciding on their behalf what is best for them, and forcing them to use vocal languages (which makes as much sense as forcing a blind person to communicate via colorful paintings) while denying them access to sign language. Ask any Deaf person about Milan 1880 and why Alexander Graham Bell is so controversial in their community. A major driver in this has always been that people who don't know sign languages tend to think of them as funny interpretive miming.
With that in mind comparing sign languages to "haha Italian gesticulation funny" jokes without being aware of the differences can become a form of infantilization.
Statements can have high internal branching & nesting (clauses, referents, etc.) but it seems to hit the limits of the brain's pushdown stack pretty quickly.
Now you're making me curious why people with ADHD (me included) tend to have a weird tendency for writing longer run-on sentences with commas, that on top of that use more parenthesis than average. Often nesting them, even. Because according to research our working memory is a little lower on average than neurotypicals, which seems to contradict this.
Perhaps the text itself is functioning as working memory.
Both ADHD people and neurotypicals have deeply structured thoughts. "Serializing" those thoughts without planning ahead leads to the "stream of consciousness" writing style, which includes things like run-on sentences and deeply nested parentheses. This style is considered poor form, because it is hard to follow. To serialize and communicate thoughts in a way that avoids this style, it is necessary to plan ahead and rely on working memory to hold several sub-goals simultaneously, instead of simply scanning back through the text to see which parentheses have not been closed yet.
It could also be simply that ADHD people have "branchier" thoughts, hopping around a constellation of related concepts that they feel compelled to communicate despite being tangential to the main point; parentheses are the main lexical construct used to convey such asides.
It's not just "branchier" thought that make it hard to communicate, it's graphier thoughts, when you mean (it's important) to communicate that it's not just a tree, but that connections may also go both ways, and sometimes they even have cycles. That to see the full picture in more nuance you've got to consider those feedback loops, and that they don't necessarily have precedence one over the other but that they must be all taken account simultaneously.
When you explain it serially you are forced to choose a spanning tree, and people usually stop listening when the spanning tree has touched all the relevant concepts, then they persuade themselves they got the full picture but miss some connections, that make the problem more complex and nuanced.
When graphs have more than one loop, loopy belief propagation doesn't work anymore and you need an another algorithm to update your belief without introducing bias.
This explanation resonates with me a lot. I use Logseq to store my notes in a graph now, which works pretty darned good for me, but it still bothers me that I can't have polyhierarcies in the namespaces and/or compound aliases.
I want to be able to simultaneously encode [[Computer Science]] and [[Computer]] [[Science]].
And [[Project1/Computer Science]] to at least provide a connection to [[Project2/Computer Science]].
I am not familiar with logseq. The sort of connection you want to made can often be made automatically using some embeddings. Because [[Project1/Computer Science]] and [[Project2/Computer Science]] likely have similar content, their semantic embedding are probably close, and a neighborhood search can help find them easily.
Communication is kind of the game of transmitting the information in such a way that your interlocutor internal representation of things ends up mapping to yours. Low dimensional embeddings are often very useful, but sometimes graph are not planar. Symmetry is usually useful, and a symmetric higher dimensional embedding is often better, because the symmetry constrain it more making it easier to be sure it was transmitted correctly.
When people ends up with different concept maps, in one of which some concepts are located near each other and in the other the same concepts are located far apart, interesting things usually happen when they communicate, ranging from culture enlightenment to culture war.
Some of these mapping are sometimes constrained to 3d, by things like memory palaces, (method of loci), but this is somewhat arbitrary, and staying more abstract and working in higher dimension until you "feel" everything fall into the right place intuitively is often preferable, (aka the Feynman method).
Yes I think embeddings using some sort of analysis is the correct answer.
I have a basic natural language processing system implemented in Neo4J (what I tried to use before Logseq). But to take notes I like plain text more than a database. Less dependencies.
The problem with embeddings, is I don't know how I would wire that into my workflow yet. Plain text notes have links, I would need a separate interface or mode to browse and analyze the connections.
One guy (whom I (electronically more than else) know) writes (can) in (most of the times this (or deeper)) style.
He can produce whole paragraphs of this semi-regular language and it even has distinct structure and non-standard interactions like in the above sentence.
GP is hitting against limit of expressiveness of sequential text. Stacked parentheses work when the flattened sentence still reads correctly, but in this case, GP has a graph-like thought, in that:
in (most of the times <this> (or deeper)) style
is supposed to represent a graph, where "most of the times" and "or deeper" both descend from "this", and "or deeper" also descends from "most of the times". A DAG like that can't in general be flattened without back references (which would be meta-elements in the text, something natural writing generally doesn't do) or repetition, and the latter will lead to non-grammatical sentences, especially as you trim the DAG down to reduce detail.
Also: while I'm not the guy GP references, I am a guy that does that too - or rather did, at some point in the past, until I realized there's like 5 people in my life who could understand this without an issue, even less who'd indulge me or enjoy communicating this way. So over time, I got back to writing like a normal person[0]; I guess conformity is just less mentally taxing.
--
[0] - Mostly - I still use semicolons and single-depth parentheses a lot, and on HN, also footnotes.
I used to do it a lot myself since it's closer to the thought. But I'm also dyslectic. Getting lost at which stack-depth I'm at while reading made me respect short and to-the-point writing.
Very easy to lose focus even without dyslexia. I found out that you have to “glide” through these stacks rather than trying to reconstruct the tree, because its structure often mirrors the commenter’s stream of thought and its tempo is either somewhat similar to yours or acts as a #clk.
That’s the non-standard part. His parentheses may add context and may serve as proper child nodes or just float there linking to the most semantically relevant parts.
To me this means that you could significantly speed up image generation by using a lower resolution at the beginning of the generation process and gradually transitioning to higher resolutions. This would also help with the attention mechanism not getting overwhelmed when generating a high resolution image from scratch.
Also, you should probably enforce some kind of frequency cutoff later when you're generating the high frequencies so that you don't destroy low frequency details later in the process.
Intuitively, audio is way more sensitive to phase and persistence because of the time domain. So maybe audio models look more like video models instead of image models?
I'm not really sure how current video generating models work, but maybe we could get some insight into them by looking at how current audio models work?
I think we are looking at an auto regression of auto regressions of sorts, where each PSD + phase is used to output the next, right? Probably with different sized windows of persistence as "tokens". But I'm a way out of my depth here!
It's the other way around - in hearing, phase is almost irrelevant. At medium frequencies, moving head by a few centimeters changes phase wand phase relationships of all frequencies - and we don't perceive it at all! Most audio synthesis methods work on variants of spectrograms and phase is approximated only later (mattering mostly for transients and rapid frequency content changes).
In images, scrambling phase yields a completely different image. A single edge will have the same spectral content as pink/brown~ish noise, but they look completely unlike one another.
Makes sense! My impression that phase matters from audio comes from when editing audio in a DAW or anything like that. We are very sensitive to sudden phase changes (which would be kind of like teleporting very fast from one point to another, from our heads point of view). Our ears kind of pick them up like sudden bursts of white noise (which also makes sense, given that they kind of look like an impulse when zoomed in a lot).
So when generating audio I think the next chunk needs to be continuous in phase to the last chunk, where in images a small discontinuity in phase would just result in a noisy patch in the image. That's why I think it should be somewhat like video models, where sudden, small phase changes from one frame to the next give that "AI graininess" that is so common in the current models
I have an example audio clip in there where the phase information has been replaced with random noise, so you can perceive the effect. It certainly does matter perceptually, but it is tricky to model, and small "vocoder" models do a decent job of filling it in post-hoc.
This was a fascinating read. I wonder if anyone has done an analysis on the FT structures of various types of data from molecular structures to time series data. Are all domains different, or do they share patterns?
I'm not sure if frequency decomposition makes sense for anything that's not grid-structured, but there is certainly evidence that there is positive "transfer" between generative modelling tasks in vastly different domains, implying that there are some underlying universal statistics which occur in almost all data modalities that we care about.
That said, the gap between perceptual modalities (image, video, sound) and language is quite large in this regard, and probably also partially explains why we currently use different modelling paradigms for them.
I was struck by the comparison between audio spectra and image spectra. Image spectra have a strong power law effect, but audio spectra have more power in middle bands. Why? One part of the issue is that the visual spectrum is very narrow (just 1 order of magnitude from red to blue) compared to audio (4 orders of magnitude from 20Hz to 20kHz).
But another issue not mentioned in the article is that in images we can zoom in/out arbitrarily. So the width of a pixel can change – it might be 1mm in one image, or 1cm in another, or 1m or 1km. Whereas in audio, the “width of a pixel” (the time between two audio samples) is a fixed amount of time – usually 1/44.1kHz, but even if it’s at a different sample rate, we would convert all images to have the same sample rate before training an NN. The equivalent of this for images would be rescaling all images so that a picture of a cat is say 100x100 pixels, while a picture of a tiger is 300x300.
Which, come to think of it, would be potentially an interesting thing to do.
> that the visual spectrum is very narrow (just 1 order of magnitude from red to blue) compared to audio (4 orders of magnitude from 20Hz to 20kHz)
I was talking nonsense here - confusing the visual spectrum of light from red to blue with the visual spectrum of images, as in "how quickly the image changes as you move across the image". The article illustrates the latter concept well.
> The RAPSD of Gaussian noise is also a straight line on a log-log plot; but a horizontal one, rather than one that slopes down. This reflects the fact that Gaussian noise contains all frequencies in equal measure
Huh. Does this mean that pink noise would be a better prior for diffusion models than Gaussian noise, as your denoiser doesn’t need to learn to adjust the overall distribution? Or is this shift in practice not a hard thing to learn in the scale of a training run?
I feel like Song et al characterized diffusion models as SDEs pretty unambiguously, and it connects to Optimal Transport in a pretty unambiguous manner. I understand the desire to give different perspectives, but once you start using multiple hedge words/qualitatives like:
> basically an approximate version of the Fourier transform!
You should take a step back and ask “am I actually muddying the water right now?”
Sorry to hear that. My blog posts are intended to build intuition. I also write academic papers, which of course involves a different standard of rigour. Perhaps you'd prefer those, only one of those is about diffusion models though.
This has little to do with diffusion. The aspects described relate to images (and sound) and are true for VAE models, for example. I mean, what else is a UNet?
WEll yes econometrics and time series analyses had already described all the methods and functions for """AI"""", but marketing idiots decided t ocreate new names for 30 year old knowledge.
The lower frequencies (roughly below 4KHz) are created by the vocal chords opening and closing at the fundamental frequency, and harmonics of this fundamental frequency (e.g. 100Hz + 2/3/400Hz etc harmonics), with this frequency spectrum then being modulated by the resonances of the vocal tract which change during pronunciation. What we perceive as speech is primarily the changes to these resonances (aka formants) due to articulation/pronunciation.
The higher frequencies present in speech mostly comes from "white noise" created by the turbulence of forcing air out through closed teeth/etc (e.g. "S" sound), and our perception of these "fricative" speech sounds is based on onset/offset of energy in these higher 4-8KHz frequencies. Frequencies above 8KHz are not very perceptually relevant, and may be filtered out (e.g. not present in analog telephone speech).