I was an architect on the Anton 2 and 3 machines - the systolic arrays that computed pairwise interactions were a significant component of the chips, but there were also an enormous number of fairly normal looking general-purpose (32-bit / 4-way SIMD) processor cores that we just programmed with C++.
I love stuff like this! I have a Kaypro 2/84, which supports a display resolution of 160x100 with colors ranging from green, to 'less green' to black, and I went down a rabbit hole of trying to push the graphics: https://www.chrisfenton.com/exploring-kaypro-video-performan... - I was eventually able to get it do display (short!) 50 fps video clips.
I built a more whimsical version of this - my daughter and I basically built a 'junk robot' from a 1980s movie, told it 'you're an independent and free junk robot living in a yard', and let it go: https://www.chrisfenton.com/meet-grasso-the-yard-robot/
I did this like 18 months ago, so it uses a webcam + multimodal LLM to figure out what it's looking at, it has a motor in its base to let it look back and forth, and it use a python wrapper around another LLM as its 'brain'. It worked pretty well!
Your article mentioned taking 4 minutes to process a frame. Considering how many image recognition softwares run in real time, I find this surprising. I haven't used them so maybe I'm not understanding, but wouldn't things like yolo be more apt to this?
It uses an Intel N100, which is an extremely slow CPU. The model sizes that he's using would be pretty slow on a CPU like that. Moving up to something like the AMD AI Max 365 would make a huge difference, but would also cost hundreds of dollars more than his current setup.
Running something much simpler that only did bounding box detection or segmentation would be much cheaper, but he's running fairly full featured LLMs.
Yeah I guess I was more thinking of moving to a bounding box only model. If it's OCRing it's doing too much IMO (though OCR could also be interesting to run). Not my circus not my monkeys but it feels like the wrong way to determine roughly what the camera sees.
Mine was entirely mechanical (driven by punch cards and a hand-crank), and changed all of the pixels in parallel, but a lot of the mechanism development looked extremely familiar to me.
This is incredible! I can appreciate how much work it took to make this happen. Well done!
I was recently in the presence of some linotype machines from the 1800s and it's so good to be humbled by the achievements of people who came before us. That machine was so complex, I could barely begin to figure out how to manufacture one. Your discussion of looms reminds me of that!
If you enjoy linotype machines, I'll suggest you watch 'Farewell ETAOIN SHRDLU', a documentary on the last night the New York Times ran its hot press system
I think a quad-CPU X-MP is probably the first computer that could have run (not train!) a reasonably impressive LLM if you could magically transport one back in time. It supported a 4GB (512 MWord) SRAM-based "Solid State Drive" with a supported transfer bandwidth of 2 GB/s, and about 800 MFLOPS CPU performance on something like a big matmul. You could probably run a 7B parameter model with 4-bit quantization on it with careful programming, and get a token every couple seconds.
This sounds plausible and fascinating.
Let’s see what it would have taken to train a model as well.
Given an estimate of 6 FLOPs per token per parameter, training a 7B parameter model would require about 1.26×10^22 FLOPs. That translates to roughly 500 000 years on an 800 MFLOPS X-MP, far too long to be feasible.
Training a 100M parameter model would still take nearly 70 years.
However, a 7M-parameter model would only have required about six months of training, and a 14M one about a year, so let’s settle on 10 million. That’s already far more reasonable than the 300K model I mentioned earlier.
Moreover, a 10M parameter model would have been far from useless. It could have performed decent summarization, categorization, basic code autocompletion, and even powered a simple chatbot with a short context, all that in 1984, which would have been pure sci-fi back in those days. And pretty snappy too, maybe around 10 tokens per second if not a little more.
Too bad we lacked the datasets and the concepts...
The Cray PVP line was also doing double precision floating point, and could overlap vector memory operations with math operations. My guess is that you would need a microcontroller operating at several hundred MHz to beat a Cray-1 in practice. The later Cray-1/S and /M variants also supported a 10gbps link to an SSD of several hundred megabytes, which is hard to beat in a microcontroller.
I took a different approach by just making an FPGA-based multi-core Z80 setup. One core is dedicated to running 'supervisor' CP/NET server, and all of the applications run on CP/NET clients and can run normal CP/M software. I built a 16-core version of this, and each CPU gets its own dedicated 'terminal' window, with all of the windowing handled by the display hardware (and ultimately controlled by the supervisor CPU). It's a fun 'what-if' architecture that works way better than one might expect in practice. It would have made an amazing mid-to-late 1980s machine.