They mentioned that they using strong quantization (iirc 3bit) and that the model was degradeted from that. Also, they don't have to use transistors to store the bits.
gpt-oss is fp4 - they're saying they'll next try mid size one, I'm guessing gpt-oss-20b then large one, i'm guessing gpt-oss-120b as their hardware is fp4 friendly
I wonder how well this works with MoE architectures?
For dense LLMs, like llama-3.1-8B, you profit a lot from having all the weights available close to the actual multiply-accumulate hardware.
With MoE, it is rather like a memory lookup. Instead of a 1:1 pairing of MACs to stored weights, you suddenly are forced to have a large memory block next to a small MAC block. And once this mismatch becomes large enough, there is a huge gain by using a highly optimized memory process for the memory instead of mask ROM.
At that point we are back to a chiplet approach...
For comparison I wanted to write on how Google handles MoE archs with its TPUv4 arch.
They use Optical Circuit Switches, operating via MEMS mirrors, to create highly reconfigurable, high-bandwidth 3D torus topologies. The OCS fabric allows 4,096 chips to be connected in a single pod, with the ability to dynamically rewire the cluster to match the communication patterns of specific MoE models.
The 3D torus connects 64-chip cubes with 6 neighbors each. TPUv4 also contains 2 SparseCores which specialize handling high-bandwidth, non-contiguous memory accesses.
Of course this is a DC level system, not something on a chip for your pc, but just want to express the scale here.
It could simply be bit serial. With 4 bit weights you only need four serial addition steps, which is not an issue if the weight are stored nearby in a rom.
May be worth pointing out, that this is not the first residual connection innovation to be in production.
Gemma 3n is also using a low-rank projection of the residual stream called LAuReL. Google did not publicize this too much, I noted it when poking around in the model file.
This is a fantastic catch. I hadn't realized Gemma 3n was already shipping with a variant of this in production.
It feels like we are entering the era of residual stream engineering. For a long time, the standard x + F(x) additive backbone was treated as untouchable. Now, between mHC (weighted scaling) and LAuReL (low-rank projections), labs are finally finding stable ways to make that signal path more dynamic.
I'm curious if the Low-Rank constraint in LAuReL acts as a natural stabilizer against the gradient explosion I saw with unconstrained hyper-connections.
Thanks for the paper link, definitely reading that tonight.
The entire debate is tiring. It would be better if these reviews would put the actual device physics of the different concepts into context.
Is there any report of a reproduction of the device proposed by Lilienfeld in his patents? If he managed to make functional devices back then, it should be possible today? (Note: Cu2S is not a very well controllable semiconductor...)
From the second document:
"The primary obstacle for $Cu_2S$ TFTs is degeneracy. Spontaneous copper vacancies form with negligible energy cost in the sulfur lattice. As a result, stoichiometric $Cu_2S$ is thermodynamically unstable in air, rapidly oxidizing or losing copper to form substoichiometric phases ($Cu_{2-x}S$) with hole concentrations exceeding $10^{20}-10^{21} \text{ cm}^{-3}$."
This explains why there are zero reproductions of Lilienfelds devices. It should be noted that Lilienfeld is one of the inventors of electrolytic capacitors and did therefore know where well how to create extremely thin insulating layers as needed for TFTs. It is not impossible to assume that he could have used other semiconductors (e.g. CdS) with his concept. However, the patents seems to specifically mention Cu2S, which does not yield functional TFTs.
Yes, correct. I will phrase this better in the future. The zero-power refers only to what is, in effect, the optical replacement for the ocean of matmul you have in standard Transformer implementation.
I apologize for not being clearer.
The goal isn't actually "zero power" - the goal is "so little heat dissipation in orbit is easy".
What also cannot be ignored, is that transformer models are a great unifying force. It's basically one architecture that can be used for many purposes.
This eliminates the need for more specialized models and the associated engineering and optimizations for their infrastructure needs.
I am not a professional software developer but instead more of multi-domain system architect and I have to say it is absolutely magical!
The public discourse about LLM assisted coding is often driven by front end developers or rather non-professionals trying to build web apps, but the value it brings to prototyping system concepts across hardware/software domains can hardly be understated.
Instead of trying to find suitable simulation environments and trying to couple them, I can simply whip up a gui based tool to play around with whatever signal chain/optimization problem/control I want to investigate. Usually I would have to find/hire people to do this, but using LLMs I can iterate ideas at a crazy cadence.
Later, implementation does of course require proper engineering.
That said, it is often confusing how different models are hyped. As mentioned, there is an overt focus on front end design etc. For the work I am doing, I found Claude 4.5 (both models) to be absolutely unchallenged. Gemini 3 Pro is also getting there, but long term agentic capability still needs to catch up. GPT 5.1/codex is excellent for brainstorming in the UX, but I found it too unresponsive and intransparent as a code assistant. It does not even matter if it can solve bugs other llms cannot find, because you should not put yourself into a situation where you don't understand the system you are building.
reply