Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Something else I find exciting, starting with one of the reflections-

The original training took 3 days on a Sun 4/260 workstation; I can't find specifics but I believe that era of early SPARC workstations would likely pull about 200 watts in total (the CPU wasn't super high powered but the whole system, running with the disks and the monitor etc would pull about that).

So 200 watts * 72 hours = 14400 watt-hours of energy.

Karpathy trained the equivalent on a Macbook, not even fully utilized, in 90 seconds. Likely something around 20 watts * 0.025 hours = 0.5 watt-hours.

An energy efficiency improvement of nearly 30000x.



This is very interesting, because I've always thought that all NN performance should be measured in a unit with energy in the denominator.


It totally depends on what you want to use a measure for. Just like neither height or volume alone will tell you what will fit in your car.

By any measure that puts energy used by the brain in the denominator, humans are probably dumber than ants. But that doesn't mean those measures are always accurate.

(For contemporary neural networks, you also have to distinguish training costs from inference costs.)


To add more context, humans are 100W biological machines. Brain is ~20% of that power - 20W.

The greatest form of general intelligence at 20W.

A MacBook Air is ~30W.

https://www.jackery.com/blogs/knowledge/how-many-watts-a-lap...


You're leaving out the training requirements


That’s not entirely fair. That’s run time cost but not the same cost to information ratio and the cost of drawing from that pool.

Given a laptop is at 30W how much can a laptop do disconnected from the internet? Now how much can it do with the internet? Now how much information does the internet cost in terms of wattage? Now what’s the ratio?


What can a human do disconnected from society?


It is “the greatest” because we only appreciate intelligence that we ourselves understand. A 0.0001W calculator calculates arithmetic faster than any human brain.


I dispute that, if the metric is a chess game between an ant and a human


For inference that could be useful, but the energy is not for the model it is for at least the tuple of: model, model architecture and compilation, and hardware chosen.


30k doesn't even sound like that much to me given Moore's law. I'd expect more improvement since 1989. Supercomputer performance increased more than a million since then


My (wrong) intuition on reading your comment was that you were over-estimating the expected growth in performance over that time period, but actually after checking the maths based on Moore's Law, i.e. doubling every two years (though of course I understand that was a rough estimate, more of a concept prediction than expected to be precise) you're right so I'll share the maths for anyone else whose intuition might be as poor as mine:

Doubling every 2 years = compound annual growth rate (CAGR) of ~41.42%

  CAGR = ((End Value / Start Value)^(1 / Number of Years)) - 1

  ((2 / 1)^(1 / 2)) - 1 = 0.41421356237
Therefore in 34 years since then:

  1 * (1 + 0.41421356237)^34 = ~131,072
So x30k is ~4.4x less than 131k. Then again, that's equivalent to ~x1.833 every two years, compared to Moore's Law of x2 every two years, so only ~8% less growth per two years, which coming back to the fact that Moore's Law is a rough estimate concept not an exact fact, doesn't seem to far off!


The rest of that difference can easily be explained by the difference in the class of hardware used. A desktop made today vs a laptop is roughly that factor 4. Not sure if back then there would have been laptops that you could have done this on for a more apples-to-apples comparison.

Modern laptops give great efficiency, when I went for solar power here the first thing to go was the desktop computer. I still have it, but it hasn't run in over a year and the elderly thinkpad that is now my daily driver uses far less power and still has enough compute to serve my modest needs. But if I would dive into something requiring much more compute I'd have to start the desktop again. Unfortunately power management is not such that computers can really throttle down to 'miser mode' when you don't need it, it's a good step but not as good as the jump between desktop and laptop.


also the 'memory wall', remember memory b/w did not grow at pace with moore's law. sure there are ways to mitigate it but that eat into chip budget and reflect when real world performance is calculated.


Yes, true and in a way that wall is still there. The way GPUs are limited in how much RAM they have because there is a way to sell you that memory at a multiple of the cost.

Imagine a GPU with a 128G or even 256G slot based memory section that is sold unpopulated. 8 SODIMM slots or so.


Hadn't thought of that, good point


imagine that we are discussing "proving" a law with historical data from our POV but at the time, it must've seem like a theory at best or comical at the least.

8% less growth is not the point. The "law" has stood the test of time which says something about the guy and his vision


It isn't much, but as the link says, the neural network they were reimplementing is too small to take advantage of modern hardware.


Amdahl's law


33 years ago is 2000/1999


Um... you might want to check your tens digit.


Opps


quickest maffs


> watt-hours

You mean joules (up to a constant factor)?


A watt-hour is 3600 joules but watt-hours or kilowatt-hours is commonly used because it’s easier to calculate.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: