Anyone who believes that this completes their understanding of automatic differentiation is tricking themselves.
When your graph is a TREE, then everything is very simple, as in this post.
When your graph is instead a more general directed acyclic graph (e.g., x = 5; y = 2x; z = xy), then the IMPLEMENTATION is still very simple, but understanding WHY that implementation works is not as simple (repeat: if you think it’s ‘just the ordinary chain rule’, you are tricking yourself).
One of the earliest descriptions of this was by Paul Werbos. He called the required rule “the chain rule for ordered derivatives”, which he proved by induction from the ordinary chain rule. But it is nevertheless not immediately evident from the ordinary chain rule.
I welcome anyone who believes otherwise to prove me wrong. If you do I will be very happy.
Then, where to read more about this? People who built autograd and other frameworks like Pytorch, mxnet, etc. should have learnt them in details somewhere. Where? AFAIK mxnet came out of academia (probably CMU).
Here's what you do: you watch this video of Andrew Karpathy [1] called "Becoming a backprop ninja". Then you pick up a function that you like and implement this backprop (which is a different way of saying reverse mode automatic differentiation) using just numpy. If you use some numpy broadcasting, an np.sum, some for-loops, you'll start getting a good feel for what's going on.
Then you can go and read this fabulous blog post [2], and if you like what you see, you go to the framework built by its author, called Small Pebble [3]. Despite the name, it's not all that small. If you peruse the code you'll get some appreciation of what it takes to build a solid autodiff library, and if push comes to shove, you'll be able to build one yourself.
I don't have a great answer. Most modern descriptions are shallow and/or unclear. My favorite discussions were actually in Werbos's original papers.
A nice overview was Backpropagation through time: what it does and how to do it, 1990. The rule itself is stated very clearly there, but without proof. The proof can be found in Maximizing long-term gas industry profits in two minutes in lotus using neural network methods, 1989 (which I believe was copied over from his earlier thesis, which I could never find a copy of).
To be honest, I never get what people want with all that business and wonder if it is because the abstraction ("ordered derivatives") implied is not ideal.
If we follow the ordinary chain rule (for a single coordinate if you want) through the edges of the computational (DAG) graph, we get the right thing in each step.
The only other rule you need is that "if you use one variable several times in a calculation (i.e. several edges from(fw)/to(bw) the same node), you need to add the gradients computed for each", but IMHO that is pretty basic and intuitive, too. (So if you plug in z for both x and y into f(x, y), you have d/dz f(z, z) = f_x(z, z) + f_y(z, z), where the subscript indicates partial derivative.)
To me this seems both mathematically simpler than mixing the two into a "more than chain rule" thing and closer to what is actually going on algorithmically in a given implementation (the one I'm most familiar with is probably PyTorch's).
But the chain rule for ordered derivatives is exactly the backprop rule. It's just the mathematical representation of 'the simple implementation' I mentioned.
I think what you're saying is that you find the process intuitive. I don't have much of a way to argue with that. But I think it's important to note that we're dealing with two things: 1. a process that we follow (backprop), 2. a true answer that is obtainable using only the chain rule. And yes it turns out that (1) and (2) both give the same answer. But (2) requires much more work, and I question anyone who claims that (1) is 'obvious' from (2): getting (1) from (2) requires work.
I'm guessing you'll agree that using only the chain rule takes much more work, but in case you don't: consider a fully connected graph with at least 5 variables, say a = 5; b = 2 a; c = 2 a b; d = 2 a b c; e = 2 a b c d. If you use backprop, you can compute de/da rapidly. If you use only the chain rule, it will take a long time to compute de/da, because the number of terms you have to deal with increases exponentially fast with the number of variables.
OC's point is that the chain rule for partial derivatives shouldn't be assumed because the ordinary chain rule holds, there's more depth to it than that, and the proof is harder than you might instinctively expect based on the ordinary chain rule.
It's epistemically acceptable to understand these both as "the chain rule" once we're satisfied they've both been proved, and apply liberal amounts of synecdoche from there (and I don't think OC disagrees with you on that).
Actually by 'ordinary chain rule' I am referring to what you're referring to as 'the chain rule for partial derivatives'. It seems like backprop follows very quickly even from that, but it does not.
I agree. That's what I'm referring to as 'the ordinary chain rule'.
> so it's still technically just chain rule
No. Go try to derive backprop for general DAGs using only the chain rule. If you complete the proof, then you will agree that the proof was more elaborate than you ever expected.
Many compsci people have been captivated by it and wrote introductions, trying to put the technique into a wider perspective. Here is mine, including a "poor man's variant" of automatic differentiation which does without operator overloading, but uses complex numbers instead:
At the time I worked in Machine Learning (94-95) I was unaware of AD and my professor who built the objective function also determined its analytic derivative manually. I didn't learn about it until a few years ago and was amazed because I spend much of the late 90's learning enough Mathematica to make my own analytic derivatives.
I think this goes back to "The complex-step derivative approximation" from 2003 by J. Martins, P. Sturdza and J. Alonso. [0] That paper is a great read!
Autodiff computes a derivative by examining a computational graph (either up-front all at once, or implicitly by examining each computation) and producing a new graph. The person defines the forward pass (graph), and the computer figures out the backward pass.
Backprop is what happens when you tell the programmer to do the thing autodiff is doing. You examine the computational graph, write down all the local changes that autodiff would do to compute the derivative, and that new code (that you hand-wrote rather than letting a machine generate) is a function computing the derivative by backpropagating error terms through each edge in that computational graph.
Very similar to techniques used in Knowledge Based Engineering systems, where the term "dependency tracking" is used. Together with caching of nodes/tensors this reduces calculations, especially useful for large parametric 3D models.
When getting a value it will recursively call the binary/dependency tree to find out which variables have changed and only recalculate them when needed. Custom python objects and properties with __set__ and __get__ methods makes this a built-in feature of an object-oriented model.
x = Tensor(3)
y = Tensor(5)
z = x + y
print(x, y) # 3, 5
print(z) # 8
x.value = 4 # when setting value nothing is recalculated
print(z) # 9 since getting value triggers recalculation of dependencies that have changed
You are probably thinking of forward-mode autodiff (which is more useful when the dimensionality of the output of your function is comparatively large), which is different from backward-mode autodiff (which is more useful when the dimensionality of the output is comparatively small). Both will work, but one will be more efficient than the other (depending on the context). For things like "training neural networks", people tend to use backward-mode (since you often want to optimize a single loss output w.r.t. many things).
i wish people would just call autodiff (or at least describe it) as numerical chain rule, because that’s quite literally all it is (plus a few tricks for not explicitly computing the jacobian for certain operations), it’s a lot more clear that way
The version described here and used most frequently in backpropagation implementations, "autodiff", is reverse-mode AD, but forward-mode exists, as does a spectrum of strategies between the two extremes. Sure, it all comes down to the chain rule, but at an algorithmic level the choice is not at all a trivial one.
In fact, if asked to use the chain rule to propagate gradients through a computation graph, I suspect most people would intuitively default to the forward mode. (I would!)
Given this, it seems useful to use the term to denote a particular method of accumulating the gradients as one traverses the expressions provided by the chain rule.
Technically incorrect! The numerical chain rule uses the method of finite difference, which will accumulate errors as you move through the computation.
The point, as the neighborhood comment says, is that the implementation really matters, and is worthy of study. It's fine to say that autodiff is a family of methods for implementing the chain rule, but incorrect to say that it's 'just' the numerical chain rule.
everyone studies chain rule in school, so it's definitely more clear, especially if you look at the amount of unnecessary complexity and convolutedness in many autodiff explanations (an substantial amount for such a simple subject)
What is the reasoning behind calling the class a tensor? Is there some way to think of the represented expression or its derivative as a tensor? Or just because scalars are tensors and this could be extended to support other tensor types?
Tensors are objects that combine in a certain way and act on vectors in a certain way, they roughly represent a collection of linear transformations. One way to represent a tensor is a matrix after fixing a basis.
Machine learning people use "tensor" to just mean an N-dimensional array of numbers. The term is divorced from its meaning in Physics and Mathematics, which caused me some confusion when I started looked at machine learning papers coming from physics.
I've been meaning to learn more about this for a while, but haven't made time for it because I had assumed it would be more complicated than this. It's surprisingly accessible. As a haskell person, the implementation here brings to mind a free monad (https://serokell.io/blog/introduction-to-free-monads) with a fairly trivial interpreter.
Oh, this is interesting. I thought the title referred to symbolic differentiation (that's what you learnt in school, but in a computer). I didn't realise automatic differentiation was a different thing.