This looks like it's going to be a really big deal unless they totally fuck it up. Autodiff greatly simplifies all kinds of continuous mathematical optimization algorithms, and using mathematical optimization means you only have to write code that can recognize things that look similar to a solution instead of code that can find the solution — in a sense, mathematical optimization derives your code from tests.
Autodiff is useful for other kinds of things too, not just mathematical optimization, as they mention in the manifesto. This weekend I wrote a real-time SDF raymarcher in a page of Lua (https://gitlab.com/kragen/bubbleos/blob/master/yeso/sdf.pnghttps://gitlab.com/kragen/bubbleos/blob/master/yeso/sdf.lua), but the surface shading is pretty lame because without sampling the SDF multiple times it doesn't yet have any way to calculate surface normals. Guess what a surface normal is? It's the gradient of the SDF, and reverse-mode autodiff can sample the SDF and its gradient in only double the time of just sampling the SDF.
Pretty much anything you do in continuous domains can get simpler and/or better-performing with autodiff. This is why I've written so much about it in Dercuano.
It's worth pointing out that you don't need to redesign your language to get first-class differentiable programming. Source-to-source transformation has been used to do autodiff in FORTRAN for at least ten years, including on pretty massive applications. (This is mentioned in the manifesto.)
Conal Elliott (the guy who co-invented FRP) did a nice project on shading computational surfaces by extracting derivatives from code more than a decade ago: http://conal.net/Vertigo/
He is now very active in the differentiable programming community, but he got started on the problem in the graphics field.
He has been working on that for a long time. Though much of his earlier work is symbolic differentiation, I’m not very familiar with the more recent AD stuff.
Let me mostly summarize Conal’s “simple” reverse AD: he drops the tape into the stack. This has two impacts: the stack is “about twice as long”; and, you “may (will?) lose tail recursion”.
It is still a really good paper—and no doubt a lot of the subtlety was list to me.
Well, successive-approximation mathematical optimization algorithms are sort of inherently anytime, in a way: you can stop after any integer number of iterations and use the approximate solution. Some of them fail to be anytime in practice, like SOR (“successive over-relaxation”) where you usually quit after like six iterations but each iteration takes a long time. And autodiff makes it a lot easier to apply mathematical optimization, in particular variants of gradient descent, one popular use of which is training deep ANNs.
Is that what you meant, or are you talking about something else?
I think AD can be used to train a DNN, the DNN in turn can be used to power an anytime algorithm such that you can have systems that can not only respond on different time scales, they are also robust in the face of sensor noise, adversarial sensors or broken actuators.
I like differentiable programming a lot, but can anyone explain the motivation for Swift as their language of choice for this? I assume it was simply because they wanted to get Chris Lattner, and Swift was his baby. Julia seems like a much better target for this, I wish they had just dedicated some money or resources to that. I don't have a problem with Swift's language design per-se (although Julia's dynamic nature really lends itself more naturally to the sort of problems you might be trying to solve using differential programming, in my opinion) but its cross-platform support is terrible. Why would you choose Mac's boutique language as the foundation of your new paradigm?
I think it is a bit unfair to say “I assume it was simply because they wanted to get Chris Lattner, and Swift was his baby”. Swift is an interesting programming language in its own right – and I am saying that as someone coding pretty much exclusively in Julia – and as I have stated before “If you are sitting on a team deeply familiar and passionate about a language – Swift – what kind of managerial fool would not let them take a stab at it? Especially with Lattner’s excellent track record” [1].
Only yesterday I gave a lecture to my cohort of MSc students on precisely this topic; there is history going back to the 60s, implementations alive since the 80s, and so much development over the last five years. As someone that cares deeply about the science (and as an engineer at heart), I simply can not be too sad that Googled picked up Lattner et al., poured money over them, and asked them to push the envelope of what is possible with differential programming. After all, what will stop us from lifting over advances to Julia et al. later down the line? Sure, I can wish that it would have been Bezanson et al. and not Lattner et al. that got the money poured over them, but that feels very petty. Let the “best” language win, and I still feel the same way that I felt in 2018 when Swift first planted their flag: “Swift has a non-existing scientific computing community […] they will have to build it entirely from scratch and community building is difficult. […] My decision to side with Julia is partially to stay my own course, partially a preference for ‘the bazaar’ development model, and partially because I have a hunch that Julia has a better chance to capture the scientific computing community as a whole which is likely to yield benefits down the line”.
I disagree, I don't think it's unfair to say at all. It is simply mismanagement to input resources into a language rework that is dead on arrival, community wise. This is of course par for the course for big tech research orgs where big names get a lot of free rein, but that does not mean it's not a strategic mistake here.
This is simply about focus as an org, and this is the reason why PyTorch is getting so popular.
There seems to be a massive lack of focus and direction in the TF org, too many egos wanting to put their stamp all over the APIs and subsystems (tf.keras anyone?).
TensorFlow eager with autograph or Pytorch solve all differentiation problems as far as researchers and practitioners are concerned.
> It is simply mismanagement to input resources into a language rework that is dead on arrival, community wise
how exactly might other features have a community of users prior to the feature being implemented?
> TensorFlow eager with autograph or Pytorch solve all differentiation problems as far as researchers and practitioners are concerned
I think this is a pretty narrow view of the world. From autograd to Stan to the cornucopia of implementations in Julia it's worth considering not everyone's going to be able to shoehorn their problem into the TF/PyTorch way of doing things.
First of all, excellent points and thank you for the perspective on the direction of TensorFlow development. I usually settle for “It is opaque”, because that is as much as I know at this stage. Looking at all of your points, I agree with almost all of them. I would even add that I think a major reason for TensorFlow’s initial success was the fact that the machine learning community at large preferred (and was more familiar with) the overall Python echo system over the lackluster one in Lua land. However, I still feel that it is unfair to call Swift “dead on arrival” as I can imagine a future where its ecosystem becomes superior to that of Python – I would bet against it, as my comment history would suggest, but I can imagine it with some reasonably larger than zero probability.
Lastly, I would somewhat object to your statement that “TensorFlow eager with [AutoGraph] or [PyTorch] solve all differentiation problems as far as researchers and practitioners are concerned”. Yes, this is a very true statement at this exact point in time. Every single one of my PhD students prefer PyTorch over anything else on the market and I support them in their decision to pick the tool they see as best to accomplish their goals. However, my experience tells me that once you give researchers more powerful tools to express their models, they will find interesting ways to use that increase in expressive power to push the envelope in terms of what is possible. So, yes, as far as researchers and practitioners are currently concerned, what we have is sufficient, but what about the models of the 2020s? I am not so sure.
Exactly, at least from an outside perspective, why tf?! It seems like people love so much bolting stuff to "classic, conservative" languages that they do it even if it's 10x more work than doing it in a language already having full fledged mature real macros. And this is not Common Lisp we're talking about... Julia is a pretty "normal" language that wouldn't scare anyone off!
(EDIT+: though, as a software engineer primarily, I do have a slight personal preference for more classic single-dispatchy languages like Swift... I imagine that in general it's some king of Stockholm-syndrome we developers experience :P)
Multiple dispatch is a little more than just overload. In C++ if you have a base class Animal with two derived Dog and Cat, and you have an Animal * pointer to Dog or Cat and do animal->walk() it will dispatch to either the Dog or Cat method (single dispatch). If you do animal->meet(animal) it will dispatch to either dog.meet(Animal * ) or cat.meet(Animal * ), not dog.meet(Dog * ) or cat.meet(Cat * ) like multiple dispatch languages do. You need a visitor pattern in C++ (or apparently templates) to get double dispatch:
The post is quite in-depth; but just to give a brief take on other potential languages:
8<---
Here’s my personal view of some languages that I’ve used and enjoyed, but all of which have limitations I’ve found frustrating at times:
Python: Slow at runtime, poor support for parallel processing (but very easy to use)
C, C++: hard to use (and C++ is slow at compile time), but fast and (for C++) expressive
Javascript: Unsafe (unless you use Typescript); somewhat slow (but easy to use and flexible)
Julia: Poor support for general purpose programming, but fast and expressive for numeric programming. ( Edit: this may be a bit unfair to Julia; it’s come a long way since I’ve last looked at it!)
Java: verbose (but getting better, particularly if you use Kotlin), less flexible (due to JVM issues), somewhat slow (but overall a language that has many useful application areas)
C# and F#: perhaps the fewest compromises of any major programming language, but still requires installation of a runtime, limited flexibility due to garbage collection, and difficulties making code really fast (except on Windows, where you can interface via C++/CLI)
Can’t understate this enough. Ultimately there is a trade-off being made between end user applications and numerical computing. Julia is surely superior when it comes to numerical computing. It has a super scientific stack and multi-dispatch means it is easy to mix and match these.
Of course there is no reason why Julia can’t evolve a similarly good ecosystem for webapps and other end user front ends. It is after all a _general purpose programming language_ which happens to have native numerical support.
Mostly because the end-user application community is not as strong in Julia (though you can create web services/sites and GUIs right now), in the same way the scientific/numerical computing/machine learning/HPC isn't as strong in Swift. Perhaps the point he made was that since his area is ML, then he could contribute better in trying to create a community around that in Swift than he could bring the other domains to Julia up to par with those languages by adding another competing library.
And in terms of the language itself, Julia is very much a general purpose language. Exceptionally general purpose as it's basically a Lisp below the surface, so it can not only support any domain, it can be comparatively easily extended to better support them. Differentiable programming is one such example, as the compiler was not designed for that, but you can just import this functionality as a library. Though the focus is still from desktop to HPC, as opposed to mobile/IoT to desktop (and apple focused) like Swift, which makes so having both languages supporting differentiable programming an overall positive over having only one of these languages.
Reading the link, it seems that Jeremy Howard only tried out Swift because Swift for Tensorflow existed. I got the impression from your comment that using Swift for this purpose was somehow "in-the-air" at the time, and many people independently considered it, but I don't think this is true. My impression is quite the contrary - when they announced Tensorflow support for Swift, I think many people were surprised.
In his lecture, he specifically said he interviewed both the Swift and Julia teams and got a view inside what happening at Google... and the the Swift teams work, development, and vision was well beyond what the Julia team had even envisioned.
Interesting since the manifesto hasn't mentioned something that's not already completed in Julia. Is their main project or motivation something that is not being shared? I find it quite odd that this manifesto doesn't mention anything that actually requires differentiable programming (old adjoint equations for their applications have existed since the 90's), which makes it a very odd justification for this work.
I suspect his comments are about the vision for the whole solution and environment, not just auto-diff, which is just a piece.
Auto-diff extends back 40 years. But the relative obscurity of its usage is that it is a "add-on", meta-programming step, or/and a limited sub-set of a language.
The Swift approach of 1st class compiler support is fairly unique. But also supports a future view of computing. If many problems can be solved by differentiable programming, not just the typical but somewhat niche NN/ML problems, then it needs to built into the whole language, not a sub-set.
But more importantly the future requirements, and death of Moore's Law means a fast, deeply statically analyzable language will be required to make the most of a heterogeneous computing environment.
I’m still crawling like a baby with Rust but I’m quickly getting familiarized with low-level language notions while finding the struct/trait system notebook intuitive than Python.
C# and F# don't require the installation of a runtime per se . With the most recent versions of .NET Core, C# and F# applications can be published as completely self-contained applications.
(Yes .NET languages have a runtime, but I don't think that's what the author was referring to).
Julia and Swift are different enough that there is benefit in having both with that functionality. Julia is a dynamic and heavily interactive language (especially when it improves more JIT latency and "time to first plot") for exploratory analysis (like what Python provides) and development without compromising performance in any step, with strong focus on HPC/Distributed computing.
Swift has some interactivity, but it's core execution strategy is to create static binaries to be deployed in mobile and desktop, with much better support for high quality end user interfaces (especially in apple devices).
While as a researcher and mostly linux developer I will probably not use Swift, I can easily see situations in which Swift would be the clear choice given both Zygote and Swift for Tensorflow equally mature, even when Julia's scientific computing support is much broader.
Or... as Jeremy Howard said it, having a real compiler with a strongly typed language is "mind blowing" after working in Python.
A compiler isn't in your way, with modern JIT compilation and a language with an actual type system, its actually an productivity assistant. Shouldn't AI researchers have automation too?
If I remember correctly, there is an FAQ somewhere that says that they chose Swift because they were already familiar with it, and didn't know anything about Julia.
> picked Swift over Julia because Swift has a much larger community, is syntactically closer to Python, and because we were more familiar with its internal implementation details - which allowed us to implement a prototype much faster.
Personally I find it hard to believe that Swift has a larger community of people doing numerical computation, but maybe I'm out of touch.
The community of people using Swift for numerical computation essentially didn't exist prior to their announcement, and after their announcement it seems to only consist of the team developing the framework and Jeremey Howard.
Most likely explanation: they were going to do it in Swift anyway, and they’re doing out in the open, for others to scrutinise, copy and potentially improve upon. That’s cool, right?
1) A fast general purpose systems language (like C++)
2) With high level semantics and usability (like Python)
3) With type system from functional programming (like Haskel)
4) With concurrent memory model (like Rust)
5) That uses compile-time analysis for static guarantees
6) That runs everywhere from CLI scripting to ML to App devlopment
7) With a huge user base (millions of devs)
8) And two large corporate backers (Apple & Google)
9) The developers include the originators of Rust and the developers of LLVM & Clang.
Thats a killer list. And Julia only addresses 1 & 2.
It is further believed by the core group at Google and Fast.ai (and history) that the future must be a static type checked language to scale beyond the script-level experimentation of python to large application development where the lines between ML, application, and system development will blur. Its will also be critical to next-gen performance breakthroughs and heterogeneous computing. And the type system provides a fully hackable language for fast experimentation.
Also Richard Wei, not Lattner, I believe initiated the project with his DLVM thesis... specifically because Swift was such a good fit and is so hackable.
Except it basically only runs on MacOS, which means that it doesn't have strong datacenter support, which means that it won't be used in any serious applications.
It's also not dynamic, which is almost a must-have for exploratory data analysis.
I will avoid a point-by-point rebuttal, mostly to avoid seeming too antagonistic, but Julia supports many of the other features on the list you provided, and (more importantly, for the case at hand) already has proven it's capability in the differentiable programming paradigm with Zygote.
You apparently aren't familiar with Swift at all. I'd recommend exploring more before answering.
Swift had Linux support since it open sourced 4 years ago. IBM, AWS, Google all have very performant server-side swift packages for manner of things from servers to protocols. Our company is using Swift for a Linux-only embedded environments.
If you think the future is dynamic, you're not regressing the past, or the trajectory of the future. Dynamic languages keep a compiler from understanding the code.
Compilers are just code optimization heuristics - like AI for developers. If you make a language and compiler that has a strong ontology for the intent of code (a type system), it can break code down into functional proofs, then understand how to optimize it and in ways that aren't possible in a dynamic language that breaks the chain of intent.
Julia is still pretty niche at this point, and just recently got tools as fundamental as a debugger.
I understand quite well the connection between static type systems and proofs, and how compilers work. As pointed out by another user here, an amazing amount of optimization can be done while maintaining the dynamic nature of the code and, in my opinion, while doing data analysis this is an absolute necessity.
As for static typing providing a proof of correctness, yes, many times I have wished that Python had better static analysis so that I didn't have to run my program only to get an error halfway through. On the other hand, when writing an experiment I rarely know at the start what the eventual design will look like. It is rarely worth setting up a fully expressive type system at the onset, so I wouldn't really get many of the correctness guarantees aside from existing types. My experience with mypy/PyCharm has demonstrated that a surprising amount of type inference can be done in dynamic languages, which allows me to catch most errors prior to runtime at this point.
While Swift the language might have support on Linux, last I used it (~ 8 months ago, when this initiative was announced) it was cumbersome at best. See the Swift for Linux initiative which, while being a fervent supporter of getting Swift working on Linux, admits right in the introduction that it is not great:
http://swift-linux.refi64.com/en/latest/
As mentioned there, even when I got the core language working I was smacked in the face with another of the core issues with Swift: the ecosystem is anemic, at best, on Linux. Almost all of the libraries are MacOS only.
You apparently aren't familiar with dynamic languages at all. I'd recommend exploring more before answering.
In particular, you seem to be conflating being dynamic with being interpreted. Julia is an incredibly dynamic langauage whose JIT compilation strategy is essentially lazy AOT compilation. As soon as a method is called the first time, that method and all it's inferred dependent methods are compiled down to very efficient machine code and run. If one writes statically inferrable code, all sorts of code optimization, theorem proving and eliding will be done just like in a static language.
However, we also have the option to write non-inferrable dynamic code where the called methods depend on runtime values when needed. This will (obviously) come with a performance hit, but as long as you know what you're doing, it can be a great boon so long as you keep type instabilities outside of performance critical code.
> Julia is still pretty niche at this point, and just recently got tools as fundamental as a debugger.
Julia has had debuggers for ages. It's just that when 1.0 launched last year we moved to a new intermediate representation which broke all the existing debuggers. The old ones could have been updated, but it was decided that people wanted to start over from scratch having learned a lot of lessons from debuggers like Gallium.jl.
Julia is indeed a niche language. However, in the field of scientific computing, compared to Swift's ecosystem julia might as well be python. Swift has no scientific computing ecosystem to speak of.
I'm not conflating. Just like Julia, Swift can JIT... after all both are built on top of the inventor of Swift's LLVM compilation environment.
That is not what I'm talking about. Nor am I talking about dynamic dispatch. Swift supports 4 different dispatch models from static to dynamic, depending on how the compiler optimizes.
What I'm talking about is a static type system. From Julia's documentation: "dynamic type systems, where nothing is known about types until run time".
What is fascinating, is ML/AI developers develop deep infrastructure to support ontologies for their target problems, but then forget to adopt the same for their code tooling. You can go though a series of proofs to demonstrate that you can't extract intent from a language that doesn't enforce it. And what is possible in a language's future can be expressed by how much intent can be extracted.
Like Julia, Swifts scientific computing ecosystem is still evolving. But keep in mind that Swift can call Python, C, objC & C++ directly, works in Jupyter notebooks, has already large general purpose open source library ecosystem.
Your comment that I responded to was making claims about the optimization opportunities available to statically versus dynamically typed languages and I believe those claims were false.
If I have a function f and I call f(x, y, z), julia will compile a specialized method of f based on the types of x, y and z, and then will do type inference and build a computational graph for all the computations carried out by f, de-virtualizing and often inlining function calls inside the body of f.
If type inference was successful, all types are known throughout the computational graph and the language has just as many optimization opportunities available to it as a statically compiled language and is free to do as much or little theorem proving as is asked for. The only major difference is that if type inference fails, the program can still run instead of erroring. It will just lazily wait until it has enough information to compile the next method.
For instance, consider the following:
julia> f(x) = x + 2
f (generic function with 1 method)
julia> function g(x)
for i in 1:100
x = f(x)
end
x
end
g (generic function with 1 method)
julia> code_llvm(g, Tuple{Int})
define i64 @julia_g_17491(i64) {
top:
%1 = add i64 %0, 200
ret i64 %1
}
I defined a function f(x) = x + 2 and then put that function inside a simple loop in another function g and then asked for the compiled LLVM code for g if x is an Int64. The compiled code is simply x + 200 because the compiler was able to prove that those two programs are identical. Obviously this is a simple example, but I assure you, julia's compiler is able to do some pretty impressive theorem proving and program transformations by leveraging it's very powerful (yet dynamic) type system. I fully agree that type systems are fantastic for expressing programmer intent, but I do not believe that the type system needs to be static to get the behaviour one wants. Dynamic type systems are strictly more expressive after-all and it's nice to be able to fall back on dynamic behaviour.
Of course Swift offers opt-in dynamism whereas in julia it's more just that a programmer is expected to write statically inferrable code if they want performance optimizations comparable to static languages. This is just a question of what defaults one prefers, but I think in many scientific and exploratory contexts, julia's default is preferable.
While this is true, the prospect seems exciting to me because best in class ML might be a thing that could drive Swift towards better cross platform support.
Cross platform is never going to be a target for current gen Apple execs. It would require a seismic level shift in their tech focus
They don't have a real presence in traditional servers, they don't have any real presence in cloud, they have minimal presence in non consumer focused services area. Cross platform at this point would just obsolete their platform in favor of more widely present OS'es and mobile tech
I don't think Swift's success is gated by Apple execs. If it's a tool developers value and the ecosystem brings it to other platforms, it will succeed there. Swift is just a programming language.
Apple is the primary driver of swift. There is literally no adoption of the language outside Apple for business purposes. Saying it's "just a programming language" is meaningless.
It's reached version 5.0 and still doesn't have stable support for linux or any support for windows.
..or it could be an opportunity for them. Apple relies more and more on services for growth. For instance, why not make iMessage into a cross-platform app, as a real privacy-focused competitor to Whatsapp, Telegram etc?
Because they don't make money directly from iMessage, they make money by selling the hardware, and the fact that iMessage only works on their hardware increases it's value and further locks you into their ecosystem.
It's certainly true iMessage isn't monetised today, but it doesn't mean it couldn't be in the future. If China is anything to go by, people do everything on their IM app.
Lock in - how many people are locked into Apple because of iMessage? I think it's just an ancillary service to most people, with WhatsApp, Messenger, Telegram, or WeChat in China being their main IM app. The longer this is the case, the less relevant iMessage becomes.
Have you forgotten about apple credit card/apple pay? apple has already gotten into payments but they own the whole hardware platform, they don't need to shoehorn a tab into iMessage like WeChat has to
All I'm saying is iMessage could be a lot bigger than it is, if it was a universal app. And right now it's becoming irrelevant compared to WhatsApp etc.
I see a lot of upside and not a lot of downside for Apple here
So now everyone's pet project gets a language extension in Swift? Apple extended the language with @_functionBuilder for SwiftUI. Now Google wants to extend it with this differentiable stuff for TensorFlow.
Why won't they implement proper metaprogramming support instead of proliferating the core language?
Does anyone else dislike the usage of the word "manifesto" in computing? It always seems to me to indicate a faith-based rather than evidence-based commitment to some train of thought.
Seems strange to see manifestos discussed in the ML space, which is nominally a measure-oriented world. Mind you, when has statistics not been polemical.
Striving for concise syntax for mathematical operations is a noble intention, but it's hardly a good idea achieve that by shoehorning changes into an existing language.
A language framework designed for extensibility has to be the basis allowing for the implementation of concepts such as differential or probabilistic programming, and it's not Haskell, i.e. we're not there yet.
Julia,'s source to source autodiff is implemented as a custom compiler pass in a third party package. Similar approached are bring taken by prob programming packages.
1) This seems like it would only work for varying the values passed to variables in a function. Can you differential a function other than just changing some values -- like using a genetic algorithm. Seems like this is just finding the right values to pass to a function not finding a different function itself.
2) Is this limited to only continuous / ordered values? How can you differentiate logic, branching, and what other functions to call? Seems like to do so would require mapping a function to some kind of n-dimensional euclidian space / manifold.
3) Why does this need to be an extension to a language? Can't ordinary interfaces in the OOP sense or monads in the FP sense be used to wrap functions and give them this functionality for free?
1) It's not restricted to the variables being passed, for example Zygote [1] has an example in the README with IO gradient(x -> fs[readline()](x), 1), and it's not using numerical differentiation (varying inputs to check the output), but finding a formula (approximate, not a closed form like a symbolic differentiation) of how changing the output affects the input. Genetic algorithms is a black box metaheuristic, so people would favor using gradient descent given that the code is differentiable, but how the model is optimized is technically open to any method.
2) Yes, it can differentiate all control flow. What happens is that each forward/backward pass will have a different graph based on what branches it would follow (effectively a subderivative). Each one of these graphs is independently differentiable and do not contain the control flow by themselves.
3) It doesn't need a language extension (Julia doesn't, but that's because you can access Julia IR using the language itself, not many language support this level of metaprogramming). The OOP strategy (like Pytorch) overload methods so every time you call the overloaded method it builds the graph. It depends on customized types and methods and does not support code that isn't using those custom types that can hold the graph (native/custom types), native control flow (only implicitly by changing what graph is being created), reduces the possible optimizations (you only have one subderivative, not the complete graph to optimize) and end up creating a DSL with it's own error messages and quirks that is less natural than just using the host language. I can't say how it would look using monads though.
Autodiff is useful for other kinds of things too, not just mathematical optimization, as they mention in the manifesto. This weekend I wrote a real-time SDF raymarcher in a page of Lua (https://gitlab.com/kragen/bubbleos/blob/master/yeso/sdf.png https://gitlab.com/kragen/bubbleos/blob/master/yeso/sdf.lua), but the surface shading is pretty lame because without sampling the SDF multiple times it doesn't yet have any way to calculate surface normals. Guess what a surface normal is? It's the gradient of the SDF, and reverse-mode autodiff can sample the SDF and its gradient in only double the time of just sampling the SDF.
Pretty much anything you do in continuous domains can get simpler and/or better-performing with autodiff. This is why I've written so much about it in Dercuano.
It's worth pointing out that you don't need to redesign your language to get first-class differentiable programming. Source-to-source transformation has been used to do autodiff in FORTRAN for at least ten years, including on pretty massive applications. (This is mentioned in the manifesto.)