Understanding Automatic Differentiation in 30 lines of Python

eachro · on Aug 25, 2023

I really enjoy small elegant code demonstrations like this that really allow you to get your hands dirty to try to understand a concept. Another example is Sasha Rush's gpu puzzles, tensor puzzles -https://github.com/srush/GPU-Puzzles -https://github.com/srush/Tensor-Puzzles

Scene_Cast2 · on Aug 25, 2023

In that case, you might also enjoy https://jaykmody.com/blog/gpt-from-scratch/

(here's the raw code: https://github.com/jaymody/picoGPT/blob/main/gpt2.py)

eachro · on Aug 25, 2023

This is really cool! Thanks for sharing :)

craigching · on Aug 25, 2023

Also micrograd from Andrej Karpathy: https://github.com/karpathy/micrograd

rd11235 · on Aug 25, 2023

Anyone who believes that this completes their understanding of automatic differentiation is tricking themselves.

When your graph is a TREE, then everything is very simple, as in this post.

When your graph is instead a more general directed acyclic graph (e.g., x = 5; y = 2x; z = xy), then the IMPLEMENTATION is still very simple, but understanding WHY that implementation works is not as simple (repeat: if you think it’s ‘just the ordinary chain rule’, you are tricking yourself).

One of the earliest descriptions of this was by Paul Werbos. He called the required rule “the chain rule for ordered derivatives”, which he proved by induction from the ordinary chain rule. But it is nevertheless not immediately evident from the ordinary chain rule.

I welcome anyone who believes otherwise to prove me wrong. If you do I will be very happy.

newsoul · on Aug 25, 2023

Then, where to read more about this? People who built autograd and other frameworks like Pytorch, mxnet, etc. should have learnt them in details somewhere. Where? AFAIK mxnet came out of academia (probably CMU).

credit_guy · on Aug 26, 2023

Here's what you do: you watch this video of Andrew Karpathy [1] called "Becoming a backprop ninja". Then you pick up a function that you like and implement this backprop (which is a different way of saying reverse mode automatic differentiation) using just numpy. If you use some numpy broadcasting, an np.sum, some for-loops, you'll start getting a good feel for what's going on.

Then you can go and read this fabulous blog post [2], and if you like what you see, you go to the framework built by its author, called Small Pebble [3]. Despite the name, it's not all that small. If you peruse the code you'll get some appreciation of what it takes to build a solid autodiff library, and if push comes to shove, you'll be able to build one yourself.

[1] https://www.youtube.com/watch?v=q8SA3rM6ckI

[2] https://sidsite.com/posts/autodiff/

[3] https://github.com/sradc/SmallPebble

rd11235 · on Aug 25, 2023

I don't have a great answer. Most modern descriptions are shallow and/or unclear. My favorite discussions were actually in Werbos's original papers.

A nice overview was Backpropagation through time: what it does and how to do it, 1990. The rule itself is stated very clearly there, but without proof. The proof can be found in Maximizing long-term gas industry profits in two minutes in lotus using neural network methods, 1989 (which I believe was copied over from his earlier thesis, which I could never find a copy of).

t-vi · on Aug 26, 2023

To be honest, I never get what people want with all that business and wonder if it is because the abstraction ("ordered derivatives") implied is not ideal.

If we follow the ordinary chain rule (for a single coordinate if you want) through the edges of the computational (DAG) graph, we get the right thing in each step.

The only other rule you need is that "if you use one variable several times in a calculation (i.e. several edges from(fw)/to(bw) the same node), you need to add the gradients computed for each", but IMHO that is pretty basic and intuitive, too. (So if you plug in z for both x and y into f(x, y), you have d/dz f(z, z) = f_x(z, z) + f_y(z, z), where the subscript indicates partial derivative.)

To me this seems both mathematically simpler than mixing the two into a "more than chain rule" thing and closer to what is actually going on algorithmically in a given implementation (the one I'm most familiar with is probably PyTorch's).

rd11235 · on Aug 26, 2023

But the chain rule for ordered derivatives is exactly the backprop rule. It's just the mathematical representation of 'the simple implementation' I mentioned.

I think what you're saying is that you find the process intuitive. I don't have much of a way to argue with that. But I think it's important to note that we're dealing with two things: 1. a process that we follow (backprop), 2. a true answer that is obtainable using only the chain rule. And yes it turns out that (1) and (2) both give the same answer. But (2) requires much more work, and I question anyone who claims that (1) is 'obvious' from (2): getting (1) from (2) requires work.

I'm guessing you'll agree that using only the chain rule takes much more work, but in case you don't: consider a fully connected graph with at least 5 variables, say a = 5; b = 2 a; c = 2 a b; d = 2 a b c; e = 2 a b c d. If you use backprop, you can compute de/da rapidly. If you use only the chain rule, it will take a long time to compute de/da, because the number of terms you have to deal with increases exponentially fast with the number of variables.

taminka · on Aug 25, 2023

chain rule is defined for partial derivatives, so it's still technically just chain rule

nimonian · on Aug 25, 2023

OC's point is that the chain rule for partial derivatives shouldn't be assumed because the ordinary chain rule holds, there's more depth to it than that, and the proof is harder than you might instinctively expect based on the ordinary chain rule.

It's epistemically acceptable to understand these both as "the chain rule" once we're satisfied they've both been proved, and apply liberal amounts of synecdoche from there (and I don't think OC disagrees with you on that).

rd11235 · on Aug 25, 2023

Actually by 'ordinary chain rule' I am referring to what you're referring to as 'the chain rule for partial derivatives'. It seems like backprop follows very quickly even from that, but it does not.

rd11235 · on Aug 25, 2023

> chain rule is defined for partial derivatives

I agree. That's what I'm referring to as 'the ordinary chain rule'.

> so it's still technically just chain rule

No. Go try to derive backprop for general DAGs using only the chain rule. If you complete the proof, then you will agree that the proof was more elaborate than you ever expected.

IngoBlechschmid · on Aug 25, 2023

Automatic differentiation feels magical.

Many compsci people have been captivated by it and wrote introductions, trying to put the technique into a wider perspective. Here is mine, including a "poor man's variant" of automatic differentiation which does without operator overloading, but uses complex numbers instead:

https://pizzaseminar.speicherleck.de/automatic-differentiati...

dekhn · on Aug 26, 2023

At the time I worked in Machine Learning (94-95) I was unaware of AD and my professor who built the objective function also determined its analytic derivative manually. I didn't learn about it until a few years ago and was amazed because I spend much of the late 90's learning enough Mathematica to make my own analytic derivatives.

V1ndaar · on Aug 25, 2023

I think this goes back to "The complex-step derivative approximation" from 2003 by J. Martins, P. Sturdza and J. Alonso. [0] That paper is a great read!

[0]: https://doi.org/10.1145/838250.838251

nicolapede · on Aug 25, 2023

It does feel magic indeed.

Nice write up, thanks for sharing it. Would you know of any introduction to back-propagation written in a similar fashion?

hansvm · on Aug 25, 2023

Backprop is the same algorithm?

Autodiff computes a derivative by examining a computational graph (either up-front all at once, or implicitly by examining each computation) and producing a new graph. The person defines the forward pass (graph), and the computer figures out the backward pass.

Backprop is what happens when you tell the programmer to do the thing autodiff is doing. You examine the computational graph, write down all the local changes that autodiff would do to compute the derivative, and that new code (that you hand-wrote rather than letting a machine generate) is a function computing the derivative by backpropagating error terms through each edge in that computational graph.

montebicyclelo · on Aug 25, 2023

Here's my autodiff in 26 lines of Python (2022): https://gist.github.com/sradc/d9d66e3898ffe3a02e0b6b266629b0...

blitzar · on Aug 25, 2023

I appreciate the brevity, but my brain seems to function way better on a healthy dose of white space.

I need to practice some of these alternative methods.

korvalds · on Aug 25, 2023

Very similar to techniques used in Knowledge Based Engineering systems, where the term "dependency tracking" is used. Together with caching of nodes/tensors this reduces calculations, especially useful for large parametric 3D models. When getting a value it will recursively call the binary/dependency tree to find out which variables have changed and only recalculate them when needed. Custom python objects and properties with __set__ and __get__ methods makes this a built-in feature of an object-oriented model.

x = Tensor(3)

y = Tensor(5)

z = x + y

print(x, y) # 3, 5

print(z) # 8

x.value = 4 # when setting value nothing is recalculated

print(z) # 9 since getting value triggers recalculation of dependencies that have changed

Cef111 · on Aug 25, 2023

Interesting video by Andrej Karpathy building an autograd engine, quite insightful:

https://youtu.be/VMj-3S1tku0?si=wuKhELwOwoYbzpt7

Repo:

https://github.com/karpathy/micrograd

stkdump · on Aug 25, 2023

The variant of auto differentiation I know doesn't build up a graph of operations. Instead it calculates the corresponding value on the fly.

fjkdlsjflkds · on Aug 25, 2023

You are probably thinking of forward-mode autodiff (which is more useful when the dimensionality of the output of your function is comparatively large), which is different from backward-mode autodiff (which is more useful when the dimensionality of the output is comparatively small). Both will work, but one will be more efficient than the other (depending on the context). For things like "training neural networks", people tend to use backward-mode (since you often want to optimize a single loss output w.r.t. many things).

taminka · on Aug 25, 2023

i wish people would just call autodiff (or at least describe it) as numerical chain rule, because that’s quite literally all it is (plus a few tricks for not explicitly computing the jacobian for certain operations), it’s a lot more clear that way

evertedsphere · on Aug 25, 2023

The version described here and used most frequently in backpropagation implementations, "autodiff", is reverse-mode AD, but forward-mode exists, as does a spectrum of strategies between the two extremes. Sure, it all comes down to the chain rule, but at an algorithmic level the choice is not at all a trivial one.

In fact, if asked to use the chain rule to propagate gradients through a computation graph, I suspect most people would intuitively default to the forward mode. (I would!)

https://en.wikipedia.org/wiki/Automatic_differentiation#Beyo...

Given this, it seems useful to use the term to denote a particular method of accumulating the gradients as one traverses the expressions provided by the chain rule.

sdenton4 · on Aug 25, 2023

Technically incorrect! The numerical chain rule uses the method of finite difference, which will accumulate errors as you move through the computation.

See the 'differences from other methods' section: https://en.m.wikipedia.org/wiki/Automatic_differentiation

The point, as the neighborhood comment says, is that the implementation really matters, and is worthy of study. It's fine to say that autodiff is a family of methods for implementing the chain rule, but incorrect to say that it's 'just' the numerical chain rule.

taminka · on Aug 25, 2023

by "numerical" i meant "operating on numbers as opposed to symbolic values", not "using the method of finite differences", apologies for the confusion

gipp · on Aug 25, 2023

More precise maybe, but I certainly wouldn't call that more clear.

taminka · on Aug 25, 2023

everyone studies chain rule in school, so it's definitely more clear, especially if you look at the amount of unnecessary complexity and convolutedness in many autodiff explanations (an substantial amount for such a simple subject)

eli_gottlieb · on Aug 25, 2023

AD is just a Cartesian lens of Jacobians and total derivatives in the category of smooth functions, what's the problem? https://www.youtube.com/watch?v=ne99laPUxN4

danbruc · on Aug 25, 2023

What is the reasoning behind calling the class a tensor? Is there some way to think of the represented expression or its derivative as a tensor? Or just because scalars are tensors and this could be extended to support other tensor types?

jdthedisciple · on Aug 25, 2023

I might be wrong but mathematically we would call a 2-dimensional object a matrix and a 3- or higher-dimensional object a tensor.

Now I guess since the described auto grad algorithm works for arbitrarily high dimensional objects it makes sense to call these objects tensors.

danbruc · on Aug 25, 2023

Tensors are objects that combine in a certain way and act on vectors in a certain way, they roughly represent a collection of linear transformations. One way to represent a tensor is a matrix after fixing a basis.

mbbutler · on Aug 25, 2023

Machine learning people use "tensor" to just mean an N-dimensional array of numbers. The term is divorced from its meaning in Physics and Mathematics, which caused me some confusion when I started looked at machine learning papers coming from physics.

jdthedisciple · on Aug 26, 2023

Again, doesn't seem so divorced: "N-dimensional array of numbers" pretty much works for Mathematics from my understanding.

pharmakom · on Aug 25, 2023

Anyone know of a purely functional version?

sdenton4 · on Aug 25, 2023

Check out Jax: https://jax.readthedocs.io/en/latest/notebooks/autodiff_cook...

lbrindze · on Aug 25, 2023

I remember seeing this one and an ocaml versions while back.

https://gist.github.com/ttesmer/948df432cf46ec6db8c1e83ab59b...

marcle · on Aug 25, 2023

I wrote a purely functional AD library in Mercury [0], which adapts a general approach from [1]. I believe that Owl provides a similar approach [2].

[0] https://github.com/mclements/mercury-ad

[1] https://github.com/qobi/AD-Rosetta-Stone/

[2] https://github.com/owlbarn/owl/tree/main/src/base/algodiff

rebeccaskinner · on Aug 25, 2023

I've been meaning to learn more about this for a while, but haven't made time for it because I had assumed it would be more complicated than this. It's surprisingly accessible. As a haskell person, the implementation here brings to mind a free monad (https://serokell.io/blog/introduction-to-free-monads) with a fairly trivial interpreter.

tromp · on Aug 25, 2023

A concise implementation of automatic differentiation in Haskell: https://crypto.stanford.edu/~blynn/haskell/ad.html

globular-toast · on Aug 25, 2023

Oh, this is interesting. I thought the title referred to symbolic differentiation (that's what you learnt in school, but in a computer). I didn't realise automatic differentiation was a different thing.

jprete · on Aug 25, 2023

The symbolic differentiation appears to be embedded in the algorithm, based on reading the article and skimming the algebraic expression trees.