Hacker Newsnew | past | comments | ask | show | jobs | submit | cpldcpu's commentslogin


They mentioned that they using strong quantization (iirc 3bit) and that the model was degradeted from that. Also, they don't have to use transistors to store the bits.


I think they are talking about the transistors that apply the weights to the inputs.


gpt-oss is fp4 - they're saying they'll next try mid size one, I'm guessing gpt-oss-20b then large one, i'm guessing gpt-oss-120b as their hardware is fp4 friendly


I wonder how well this works with MoE architectures?

For dense LLMs, like llama-3.1-8B, you profit a lot from having all the weights available close to the actual multiply-accumulate hardware.

With MoE, it is rather like a memory lookup. Instead of a 1:1 pairing of MACs to stored weights, you suddenly are forced to have a large memory block next to a small MAC block. And once this mismatch becomes large enough, there is a huge gain by using a highly optimized memory process for the memory instead of mask ROM.

At that point we are back to a chiplet approach...


For comparison I wanted to write on how Google handles MoE archs with its TPUv4 arch.

They use Optical Circuit Switches, operating via MEMS mirrors, to create highly reconfigurable, high-bandwidth 3D torus topologies. The OCS fabric allows 4,096 chips to be connected in a single pod, with the ability to dynamically rewire the cluster to match the communication patterns of specific MoE models.

The 3D torus connects 64-chip cubes with 6 neighbors each. TPUv4 also contains 2 SparseCores which specialize handling high-bandwidth, non-contiguous memory accesses.

Of course this is a DC level system, not something on a chip for your pc, but just want to express the scale here.

*ed: SpareCubes to SparseCubes


If each of the Expert models were etched in Silicon, it would still have massive speed boost, isn't it?

I feel printing ASIC is the main block here.


It could simply be bit serial. With 4 bit weights you only need four serial addition steps, which is not an issue if the weight are stored nearby in a rom.


May be worth pointing out, that this is not the first residual connection innovation to be in production.

Gemma 3n is also using a low-rank projection of the residual stream called LAuReL. Google did not publicize this too much, I noted it when poking around in the model file.

https://arxiv.org/pdf/2411.07501v3

https://old.reddit.com/r/LocalLLaMA/comments/1kuy45r/gemma_3...

Seems to be what they call LAuReL-LR in the paper, with D=2048 and R=64


This is a fantastic catch. I hadn't realized Gemma 3n was already shipping with a variant of this in production.

It feels like we are entering the era of residual stream engineering. For a long time, the standard x + F(x) additive backbone was treated as untouchable. Now, between mHC (weighted scaling) and LAuReL (low-rank projections), labs are finally finding stable ways to make that signal path more dynamic.

I'm curious if the Low-Rank constraint in LAuReL acts as a natural stabilizer against the gradient explosion I saw with unconstrained hyper-connections.

Thanks for the paper link, definitely reading that tonight.


Thanks! Would be quite interesting to see how this fares compared to mHC.

I noted that LAuReL is cited in the mHC paper, but they refer to it as "expanding the width of the residual stream", which is rather odd.


All of them use ASML lithography, including CXMT.

They are, of course, a bit slower in EUV adoption. But its already there:

https://www.tomshardware.com/pc-components/dram/micron-sampl...

https://www.techinsights.com/blog/samsung-d1z-lpddr5-dram-eu...


This lists many transistor patents from oldest to newest.

https://patents.google.com/?q=(H03F3%2f16)&sort=old

The Matare/Welker Patent is missing though

https://patents.google.com/patent/US2673948A/en.541

The entire debate is tiring. It would be better if these reviews would put the actual device physics of the different concepts into context.

Is there any report of a reproduction of the device proposed by Lilienfeld in his patents? If he managed to make functional devices back then, it should be possible today? (Note: Cu2S is not a very well controllable semiconductor...)

Edit:

Gemini Deep Research summary here, its quite informative: https://docs.google.com/document/d/1jE0wQVeWP9Eiybh_C6zMKeZ5...

Also specifically on Cu based TFT: https://docs.google.com/document/d/1_B2x2gBPKgGFVgJyQ0qzPdI4...

From the second document: "The primary obstacle for $Cu_2S$ TFTs is degeneracy. Spontaneous copper vacancies form with negligible energy cost in the sulfur lattice. As a result, stoichiometric $Cu_2S$ is thermodynamically unstable in air, rapidly oxidizing or losing copper to form substoichiometric phases ($Cu_{2-x}S$) with hole concentrations exceeding $10^{20}-10^{21} \text{ cm}^{-3}$."

This explains why there are zero reproductions of Lilienfelds devices. It should be noted that Lilienfeld is one of the inventors of electrolytic capacitors and did therefore know where well how to create extremely thin insulating layers as needed for TFTs. It is not impossible to assume that he could have used other semiconductors (e.g. CdS) with his concept. However, the patents seems to specifically mention Cu2S, which does not yield functional TFTs.


"Zero power" does not include the power needed to translate information between electronic and optical domains and the light source itself.


Yes, correct. I will phrase this better in the future. The zero-power refers only to what is, in effect, the optical replacement for the ocean of matmul you have in standard Transformer implementation.

I apologize for not being clearer.

The goal isn't actually "zero power" - the goal is "so little heat dissipation in orbit is easy".


What also cannot be ignored, is that transformer models are a great unifying force. It's basically one architecture that can be used for many purposes.

This eliminates the need for more specialized models and the associated engineering and optimizations for their infrastructure needs.


And if better models than transformers are found? Or if someone finds models that do not rely on GPUs or specialized hardware?

Neither the hyperscalers nor NVDA are safe from uncertainty.


I am not a professional software developer but instead more of multi-domain system architect and I have to say it is absolutely magical!

The public discourse about LLM assisted coding is often driven by front end developers or rather non-professionals trying to build web apps, but the value it brings to prototyping system concepts across hardware/software domains can hardly be understated.

Instead of trying to find suitable simulation environments and trying to couple them, I can simply whip up a gui based tool to play around with whatever signal chain/optimization problem/control I want to investigate. Usually I would have to find/hire people to do this, but using LLMs I can iterate ideas at a crazy cadence.

Later, implementation does of course require proper engineering.

That said, it is often confusing how different models are hyped. As mentioned, there is an overt focus on front end design etc. For the work I am doing, I found Claude 4.5 (both models) to be absolutely unchallenged. Gemini 3 Pro is also getting there, but long term agentic capability still needs to catch up. GPT 5.1/codex is excellent for brainstorming in the UX, but I found it too unresponsive and intransparent as a code assistant. It does not even matter if it can solve bugs other llms cannot find, because you should not put yourself into a situation where you don't understand the system you are building.


Agreed, I know Networks, requests, protocols, Auth flow etc but is been years since I actually coded stuff

This is magical to me.

I love cursor, I use it to deploy docker packages and fix npm issues etc too :p

I use some guardrails, like SonarQube as static code analyzer and of course some default linters. Checks and balances


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: