LLMs are a commodity now. It is all about capital. DeepSeek and Grok proved that.
It’s not Klingon cloaking tech.
With minor variations, it is Transformers via autoregressive next-token prediction on text. Self-attention, residuals, layer norm, positional encodings (RoPE/ALiBi), optimized with AdamW or Lion. Training scales with data, model size, and batch size using LR schedules and distributed parallelism (FSDP, ZeRO, TP). For inference KV caching and probabilistic sampling (temperature, top-k/p).
Most differences are in scale, data quality, and marginal architectural tweaks.
Meta had the scale and whiffed so bad the entire Llama team disbanded and Zuck has been recruiting new people with 9 figure offers. Capital is essential but the people deciding when to launch the training run still do matter.
It’s not Klingon cloaking tech.
With minor variations, it is Transformers via autoregressive next-token prediction on text. Self-attention, residuals, layer norm, positional encodings (RoPE/ALiBi), optimized with AdamW or Lion. Training scales with data, model size, and batch size using LR schedules and distributed parallelism (FSDP, ZeRO, TP). For inference KV caching and probabilistic sampling (temperature, top-k/p).
Most differences are in scale, data quality, and marginal architectural tweaks.