More

iamaaditya · 2025-12-21T22:58:20 1766357900

One usage case that I saw myself is when a vehicle is parked such that it will require the other vehicle to go slighty over the curb, in this case the curb is flat so I assuming the parked driver thought it was okay. Every other human driver did okay, but Waymo just refused to put its wheel on the curb and just got stuck. Video here: https://x.com/aaditya_prakash/status/1989444130238259575?s=2...

iamaaditya · on May 18, 2022

In machine learning (especially deep learning or neural networks), the 'training' is done by using Stochastic Gradient Descent. These gradients are computed using Backpropagation. Backpropagation requires you to do a backward pass of your model (typically many layers of neural weights) and thus requires you to keep in memory a lot of intermediate values (called activations). However, if you are doing "inference" that is if the goal is only to get the result but not improve the model, then you don't have to do the backpropagation and thus you don't need to store/save the intermediate values. As the layers and number of parameters in Deep Learning grows, this difference in computation in training vs inference becomes signifiant. In most modern applications of ML, you train once but infer many times, and thus it makes sense to have specialized hardware that is optimized for "inference" at the cost of its inability to do "training".

eklitzke · on May 18, 2022

Just to add to this, the reason these inference accelerators have become big recently (see also the "neural core" in Pixel phones) is because they help doing inference tasks in real time (lower model latency) with better power usage than a GPU.

As a concrete example, on a camera you might want to run a facial detector so the camera can automatically adjust its focus when it sees a human face. Or you might want a person detector that can detect the outline of the person in the shot, so that you can blur/change their background in something like a Zoom call. All of these applications are going to work better if you can run your model at, say, 60 HZ instead of 20 HZ. Optimizing hardware to do inference tasks like this as fast as possible with the least possible power usage it pretty different from optimizing for all the things a GPU needs to do, so you might end up with hardware that has both and uses them for different tasks.

sillyinseattle · on May 18, 2022

Thank you @iamaaditya and @eklitzke . Very informative

dekhn · on May 18, 2022

it took me 20 years to learn this body of knowledge and now it can just sort of be summed up in a paragraph.

When I learned and used gradient descent, you had to analytically determine your own gradients (https://web.archive.org/web/20161028022707/https://genomics....). I went to grad school to learn how to determine my own gradients. Unfortunately, in my realm, loss landscapes have multiple minima, and gradient descent just gets trapped in local minima.

thentherewere2 · on May 18, 2022

This is the case most contemporary neural networks as well. It turns out for many domains, a "good" local minima generalizes well across many tasks.

dekhn · on May 18, 2022

Huh. I talked to some experts and they told me NN loss functions are bowl-shaped and have single minima, but those minima take a very long time to navigate to in high dimensional spaces.

Salgat · on May 19, 2022

For higher feature counts the real concern is saddle points rather than minima, where the gradient is so small that you barely move at all each iteration and get "stuck".

timomo · on May 19, 2022

To add here: for a local minimum to occur all those dimensions (or features) need to increase. This is highly unlikely for modern NNs where you have millions of dimensions. If one of the dimensions is going down but the rest up, you have a saddle point. Since you go down only one (or few) dimensions it takes longer.

ray__ · on May 19, 2022

What's your realm?

dekhn · on May 19, 2022

protein folding and structure prediction. Protein simulations typically define an energy function, similar to a loss function, over all the atoms in the protein. There are many terms: at least one per bonded atom pair, at least one per bonded atom triple, at least one per bonded atom quadruple, one per each non-bonded pair (although atoms that are distant can be excluded, sometimes making this a sparse matrix). If you start with a proposed model (say, random coordinates for all the atoms) and apply gradient descent, you'll end up with a mess. All those energy terms end up creating a high dimensional surface that is absurdly spiky in the details, and extremely wavy with many local minima at coarse grain.

Instead of using gradient descent, we used molecular dynamics (I'm unaware if this has a direct equivalent) to sample the space by moving along various isocontours (constant energy, or constant temp, or usually constant pressure). Even so, you have to do a lot of sampling- in my day, it was years of computer time, now it's months- to get a good approximation to the total landscape, and measure transition frequencies between areas of the landscape that correspond to energy barries (local maxima) that are smaller than the thermal energy avaialble to the system.

It's complicated. also, deep mind obviated all my work by providng that sequence data (which is cheap to obtain) can be used to predict very accurate structures with little or no simulation.

derbOac · on May 19, 2022

Worth noting that inference in "traditional" statistics and ML/AI/DL isn't really that different at some level. In both cases you have an inverse problem; in one case the parameters are about a group or population (e.g., something about all cats in existence), and in another it is about an individual case (something about a particular cat).

dataexporter · on May 18, 2022

This sounds really fascinating. Are there any resources that you'd recommend for someone who's starting out in learning all this? I'm a complete beginner when it comes to Machine Learning.

dr_zoidberg · on May 18, 2022

Deep Learning with Python (2nd ed), by Francois Chollet.

If you don't mind about learning the part where you program, it's got a lot of beginner/intermediate concepts clearly explained. If you do dive into the programming examples, you get to play around with a few architectures and ideas and you're left on the step to dive into the more advanced material knowing what you're doing.

sshlocalhost98 · on May 19, 2022

Thanks for the explanation, really succinct. Do you recommend any good back propagation tutorials for an EE undergrad?

iamaaditya · on Feb 25, 2021

Love the UI and the simple design. All the best with the product!

iamaaditya · on Jan 16, 2021

I believe there is some form of memory bias or bias in reporting by the media. If you look at the record books, you will find that most records are held by Sherpas. For example, more than 70 records for Mount Everest is held my Sherpas (or other Nepalese climbers) [1], far more than climbers from any other country.

Apa Sherpa, one of the most prolific Sherpas, holds many records [2]. Interesting fact he was Sherpa to Peter Hillary (Edmund's son).

[1] https://en.wikipedia.org/wiki/List_of_Mount_Everest_records [2] https://en.wikipedia.org/wiki/Apa_Sherpa

iamaaditya · on July 8, 2020

Graham's Number, as in 3 ↑↑ 3, is named after Ron Graham.

[1] https://plus.maths.org/content/too-big-write-not-too-big-gra...

[2] https://en.wikipedia.org/wiki/Graham%27s_number?oldformat=tr...

[3] And here is the man himself describing it https://www.youtube.com/watch?v=GuigptwlVHo

gabrielsroka · on July 8, 2020

3 ↑↑↑↑ 3 = g1, Graham's Number is g64. See the article for details.

I'd write it out, but this margin is too narrow.

agumonkey · on July 8, 2020

I find it stupidly fascinating how notation compressed a quantity so large in so few.

iamaaditya · on Aug 2, 2018

[The Matrix Calculus You Need For Deep Learning](https://arxiv.org/pdf/1802.01528.pdf)

iamaaditya · on June 10, 2018

One simple way to minimize impact of these attacks is our work called Pixel Deflection (CVPR 2018 Spotlight). Here is a short (4 min) video introduction to the idea https://youtu.be/VgjOXJ9QKWo

iamaaditya · on Feb 1, 2018

Paper: https://arxiv.org/abs/1801.08926

Code: https://github.com/iamaaditya/pixel-deflection

iamaaditya · on Feb 8, 2017

Technically it was heat but mostly it was due to (i) Economics, there was less demand for faster clock speed. Otherwise more research could have gone towards solving heat problem. (ii) Each cycle of CPU was more efficient with ability to execute multiple instructions in a single cycle and with more efficient instruction sets.

Surprisingly, power consumption also made huge impact. As tablets and laptops got more popular than desktop battery life became a major concern and thus TDP played major role in research.

Try this fun experiment: Underclock your CPU by half a GHz and see if you notice the difference in your day to day work.

Cyph0n · on Feb 8, 2017

No, it was only because of power density i.e. too much heat dissipated in a really small area. There is no way to "solve" this issue, other than to just throw more cooling at it. And since more cooling = more money, Intel (and friends) went down the multicore route instead.

No amount of R&D spending can bend the laws of physics to overcome the inherent limitations of silicon. I'm sure Intel also looked into alternative semiconductors (e.g., III-V) before giving up on the 10 GHz dream.

Baeocystin · on Feb 8, 2017

Single-thread performance is as important as it has ever been.

That a secretary typing a document or someone who only spends time on facebook doesn't notice the difference is irrelevant- consider, for example, the massive capital outlay by the financial industry to have servers located as closely to the world's trading hubs as possible. If they are willing to pay whatever it takes to shave milliseconds off a round trip, faster CPUs are a part of that equation.

krylon · on Feb 8, 2017

> faster CPUs are a part of that equation.

I think the GP did not debate that but pointed out the for CPU speed/throughput, clock speed is only part of it. Adding functional units and allowing the CPU to process more instructions in parallel can have a big impact, so can e.g. larger cache, better branch prediction and so forth.

If you give people faster CPUs, they will cheer and find something to keep them busy. ;-) And for some people, there is no such thing as "fast enough". But for a fairly large share of desktop/mobile users, the is not the limiting factor as much as memory bandwidth and I/O.

Baeocystin · on Feb 8, 2017

I don't disagree with that statement in a general sense. But what earns Intel its money and marketplace dominance? The cheap Celeron/Pentium-class chips sold in bargain laptops & Best Buy specials? Or the high-end, single-thread performance chips?

monocasa · on Feb 8, 2017

> Otherwise more research could have gone towards solving heat problem. (ii) Each cycle of CPU was more efficient with ability to execute multiple instructions in a single cycle and with more efficient instruction sets.

Dude, Intel spends something like $80B/yr on R&D. This is closer to hitting fundamental laws of physics barriers.

They killed off their P4 line and developed their mobile line for a reason.

kashkhan · on Feb 8, 2017

the $80B a year in R&D is off by an order of magnitude.

https://www.fool.com/investing/2017/02/05/intel-corporation-...

ma2rten · on Feb 8, 2017

Isn't an order of magnitude 10x? According to your link they spent $12.74 billion in 2016.

carlob · on Feb 8, 2017

Not necessarily, it could be e if you're using natural logarithms. Anyway: log10(12.74) = 1.10, log10(80) = 1.90. So, it's a little less than a full order of magnitude, but pretty close.

slededit · on Feb 8, 2017

Although not normally used for smaller amounts it can go in both directions. ~1/10th is still an accurate if archaic use of the term.

knowaveragejoe · on Feb 8, 2017

Indeed, that's more than Intel earned in total revenues in 2016...

szatkus · on Feb 8, 2017

I accidentaly underclocked my old CPU (Athlon 651K) to 800MHz and found out after about 2 weeks when I bought The Vanishing of Ethan Carter. Other than that it was fine, sometime little slow, but comfortable.

iamaaditya · on Jan 10, 2017

Recently, I have seen O1 being given pretty generously. Couple of people who I know who completed PhDs from mediocre University and less than 10 papers in total have all gotten their O1s. Yes they had to get reference letters and complete all the formality but nothing stellar or extra-ordinary was required.