Talk:Automatic differentiation

Rewrite needed. Low quality
The quality of this article is quite low. Despite being around for so long, it is still unclear what the differences are between forward and reverse modes, both at the mathematical level and in the corresponding implementations. This is specially true for someone new to AD. What it requires is 1) working an example at the mathematical level in both modes and in detail (the reverse mode case has been just hinted at most) and 2) to clearly show how both modes are implemented. The latter should help understanding when each mode is more adequate than the other. This is still failing for the reverse mode. In case 1) it would be worth to show how to use the chain rule to sort the derivatives so as to apply each modes. Mind you, it's wrong insisting so much on the chain rule: the differential of the function is also used! Thus, again, it would be good if someone who knows the ins and out well would rewrite this article. FWIW, [Bradley's] version of 2007 has a much clearer style than the current one. — Preceding unsigned comment added by 199.119.233.131 (talk) 15:06, 4 May 2016 (UTC)

Maybe one can add pictures containing lineraized computational graphs to describe the Forward and Reverse Modes in general. The general method (stick in cartesian basis vectors from the input or output space) is missing, too...

What is the connection between Automatic Differentiation and Dual Numbers?


 * Well, I would say that if P is a polynomial, and &epsilon;2 = 0 in the usual way of the dual numbers, then


 * P(a + &epsilon;) = P(a) +P&prime;(a)&epsilon;

is an identity in the dual numbers that expresses 'automatic differentiation'.

Charles Matthews 15:57, 12 Jul 2004 (UTC)

I believe the article should be changed so that the mathematics is separated from the use of the mathematical idea in computer programs. Somwthing like this: First the pure mathematical stuff, then some stuff about the fact that automatic differentiation can either be implemented by operator overloading, wich is nice if you want to do numerical differentiation. Or it can be implemented by automatically transforming a program to calculate the derivatives of functions when they are called, which is nice if you need derivatives of functions but don't bother to edit the source code. Comments? --DanielJanzon 16:14, 8 October 2005 (UTC)


 * I agree that it's generally a good idea to separate the maths from the implementation. Your comment implies that this isn't done in the current article; however, it seems to me that the implementation is not mentioned at all. Anyway, please go ahead and describe the implementation techniques you mentioned (operator overloading and transforming the source code). -- Jitse Niesen (talk) 17:11, 8 October 2005 (UTC)

Partial rewrite
I rearranged parts of the article. The former article presented the augmented arithmetic as part of forward accumulation, which is wrong. Also I think the derivation using dual numbers is appropriate here, even though it is also present in the dual numbers article. Added two images, which needs some more explanation. Berland 07:20, 19 January 2007 (UTC)

There is a technical misunderstanding here. Augmented arithmetic (using Dual numbers) *is* the way forward-mode AD works. Or to be more pedantic, using programming language theory terminology, forward-mode AD is a nonstandard interpretation of a computer program replacing reals by dual numbers and lifting the basis functions appropriately. But reverse-mode AD works in an entirely different way. The current article essentially does not discuss reverse mode, at least not in a way that someone who does not already understand reverse mode could understand. (I suggest segregating discussion of forward versus reverse-mode, and only after both have been explained mentioning hybrids. The discussion of dual numbers belongs inside forward-mode.)  Barak 12:56, 29 January 2007 (UTC)


 * You are right, reverse-mode needs to be "written", it is not considered done yet. Feel free to contribute. Berland 14:32, 29 January 2007 (UTC)


 * Um, I did. I wrote an introduction that briefly explained the two modes, I wrote a long discourse on forward mode, and had a place ready for reverse mode to go in. Someone ripped out the intro, ripped out the bits of text distinguishing forward from reverse, added inaccuracies (like saying that AD solved the exponential increase in complexity of calculating higher derivatives which it most certainly does not!) et cetera. Right now the article is not ready for reverse mode, because the intro material needs to be fixed, the transitions need to be fixed, all just to have a place for reverse mode.  Also I'd feel sort of guilty removing the example and graph so carefully inserted, even though I'm not sure it is the right thing for someone who does not already have an idea of AD. Is the source code for the graph available? This would make it possible to edit the graph instead of just ripping it out.  Barak 07:30, 30 January 2007 (UTC)


 * I have the source for the graphs yes. Berland 07:55, 30 January 2007 (UTC)


 * I'm moving the dual numbers stuff down because I think it can be a bit intimidating, and it's perhaps peripheral to AD. It's my impression that an understanding of it isn't necessary for understanding or using AD.  Feel free to revert if you disagree. Lyonsam (talk) 17:07, 6 April 2008 (UTC)

Link to "ADIFF Online Derivatives Calculator"
I am removing this link because the site it points to falls under the category of symbolic differentiation, which AD is specifically NOT. I think the distinction is subtle enough that links such as these can only make the article more confusing. Lyonsam 18:34, 29 October 2007 (UTC)

P(x+ex') = P(x) + P'(x)x'e
This part is really confusing because (from the looks of it, im not entirely sure) the prime symbol ' is being used in 2 different ways. On x, x' just means "some other real number", whereas (I think) P' is literally the derivative of P. Is this correct? Either way, this part could use some clarification. —Preceding unsigned comment added by 98.203.237.75 (talk) 10:32, 21 January 2008 (UTC)
 * You are correct in that the prime symbol is used in two slightly different ways. On $$P$$, it signifies the derivative with respecto to $$x$$, but $$x'$$ is just "some number". However, the latter use is beneficial when you look at the later sections, where it will be evident that $$x'$$ actually can mean the derivative of some function at $$x$$. Any changes to the notation is mostly welcome, as long as it well thought through and consistent with the rest of the text. --Berland (talk) 21:29, 21 January 2008 (UTC)

Query regarding NP-Hard
"Forward and reverse accumulation are just two (extreme) ways of traversing the chain rule. In order to compute a full Jacobian of $$f:\mathbb{R}^n \rightarrow \mathbb{R}^m$$, there is a path somewhere in between which is optimal, however, locating that path is probably an NP-hard problem."

Is there a source for this assertion? I don't doubt that it is true, but I'd like to see something to back it up. Joecainey (talk) 09:56, 20 March 2008 (UTC)
 * From memory, I would guess that this is discussed in, but I don't have physical access to it right now in order to verify. --Berland (talk) 10:56, 2 April 2008 (UTC)


 * Well, it really depends very much on how you define the problem exactly. The Acta Numerica paper  mentions the NP-hardness as a conjecture, I believe.  However, Uwe Naumann has recently shown that  this "optimal Jacobian accumulation" problem is NP-complete in general.  I will add a reference to the paper to the article. Lyonsam (talk) 16:32, 6 April 2008 (UTC)

Dual numbers vs Complex step derivative
Presumably the stuff about dual numbers is identical with the far more popular Complex-step derivative that is not mentioned in the article. Perhaps someone can expand on the theory and application of f'(x)=imag(f(x+ih))/h - since this seems to be missed out in many texts. Shyamal (talk) 10:42, 6 April 2009 (UTC)
 * No, it is not identical. In the dual numbers, the square of the "imaginary" generator is zero, whereas in the complex number trick, the square of (ih) is -h^2, leading to computational errors. The smaller h gets, the less the influence on the function value, but also the greater the numerical error in the derivatives. Correctly done, it is a way to easily implement approximate derivation via divided differences.--LutzL (talk) 16:02, 1 August 2010 (UTC)
 * On second consideration, numerically both methods are the same, provided that the seed values for the imaginary part of the variables are some machine precision orders lower than the numbers occuring in the computation tree. Something like x=1+1e-80i. However, modern code-rewriting or template based methods make the complex step obsolete from an "automatic" implementation perspective. It may be still valid as a technique for manual code transformation, since it avoids the cluttering of the code with additional intermediate variables. Just replace double by complex double, exp by cexp, cos by ccos etc.--LutzL (talk) 09:14, 21 December 2011 (UTC)

acronyms
The table of AD systems has an Approach column, containing OO and SCT. I know that OO is Object Oriented, but the meaning of SCT has eluded me. How about each acronym being defined, or linked, on each page? Normvcr (talk) 20:31, 30 December 2011 (UTC)


 * I acronymized the headings above the table so that the connection is visible. SCT is source-code transformation and OO is operator overloading. Perhaps also OET would be usefull for operation-execution trace (OR), which is a different method than DP direct propagation, but both use operator overloading.--LutzL (talk) 12:10, 31 December 2011 (UTC)

Last link appears non-functional please check

 * Adjoint Algorithmic Differentiation: Calibration and Implicit Function Theorem — Preceding unsigned comment added by Yttrill (talk • contribs) 09:54, 10 July 2012 (UTC)

Historical note: "PROSE"
The first encounter I had with automatic differentiation was in a simulation language called PROSE, circa 1974. I never actually used it, but I came across a manual for it, which I found quite fascinating. As I recall, it ran on the CDC 6000 series -- at the time I was doing numerical physics on a 6400, and PROSE was being presented as an alternative to the ubiquitous FORTRAN.

PROSE seems to have been completely forgotten; I can find nothing on it now. But I distinctly recall that it did AD. ScottBurson (talk) 17:35, 25 August 2013 (UTC)

As far as I am able to determine, the PROSE system for the CDC 6000 series was a text formatting system, similar in spirit and syntax to roff. See http://www.saildart.org/PROSE.PAS%5BTMP,HE%5D. Perhaps you are thinking of a program with a different name? If so, Bert Speelpenning's PhD dissertation might have a reference to the system you're thinking of. Barak (talk) 10:26, 27 August 2013 (UTC)

Assessment comment
Substituted at 01:47, 5 May 2016 (UTC)

Are not symbolic and numerical differentiation also automatic?
The article states that "automatic differentiation is not symbolic differentiation nor numerical differentiation." But these two techniques are both carried out automatically by the computer and don't require any hand labour, so to me it sounds like they are forms of automatic differentiation. How comes this article considers them not to be forms of automatic differentiation? —Kri (talk) 11:10, 24 June 2016 (UTC)
 * That depends how you define 'automatic differentiation'. If you interpret the phrase literally, then sure, if it's done by an automated procedure, it's automatic + differentiation = automatic differentiation. However, the phrase has a specific meaning in the field that differs from a lay interpretation of the term, as is the case with many other phrases in many other fields. Hddharvey (talk) 22:02, 28 June 2022 (UTC)
 * Years ago, I got to play, just a little, with PROSE. You specify the variables you want derivatives with respect to, and it, pretty much, calculates those derivatives for all variables. (Maybe it optimizes where it can.) So you don't say which variables you want to take the derivatives of, as it does that automatically. The usual symbolic or numerical derivatives, you specify the variables you want derivatives of. Gah4 (talk) 23:30, 28 June 2022 (UTC)
 * Years ago, I got to play, just a little, with PROSE. You specify the variables you want derivatives with respect to, and it, pretty much, calculates those derivatives for all variables. (Maybe it optimizes where it can.) So you don't say which variables you want to take the derivatives of, as it does that automatically. The usual symbolic or numerical derivatives, you specify the variables you want derivatives of. Gah4 (talk) 23:30, 28 June 2022 (UTC)
 * Years ago, I got to play, just a little, with PROSE. You specify the variables you want derivatives with respect to, and it, pretty much, calculates those derivatives for all variables. (Maybe it optimizes where it can.) So you don't say which variables you want to take the derivatives of, as it does that automatically. The usual symbolic or numerical derivatives, you specify the variables you want derivatives of. Gah4 (talk) 23:30, 28 June 2022 (UTC)

It seems to me (unfamiliar with the literature) that the very general term, "automatic differentiation" has been adopted for certain systematic sequencing of calculations of derivatives by the normal rules. The purpose of the confusing "not"s appears to be to restrict attention to special niches among a broad class of possibilities. Absent an answer to Kri's query, why not reframe that part of the article accordingly? Mdmi (talk) 23:46, 25 May 2017 (UTC)

Typo ?
Ordinaux (talk) 08:13, 13 July 2020 (UTC) It looks like some equations are not consistent with another in the section "Automatic differentiation using dual numbers" $$ \left\langle u,u'\right\rangle / \left\langle v,v'\right\rangle = \left\langle \frac{u}{v}, \frac{u'v - uv'}{v^2} \right\rangle \quad ( v\ne 0) $$ should be replaced by $$ \left\langle u,u'\right\rangle / \left\langle v,v'\right\rangle = \left\langle \frac{u}{v}, \frac{- uv'}{v^2} \right\rangle \quad ( v\ne 0) $$ to run properly and to be consistent with $$ \left\langle u,u'\right\rangle^k = \left\langle u^k, k u^{k - 1} u' \right\rangle \quad (u \ne 0) $$
 * The user specifies which variables have derivatives calculated. That can be a little tricky, as you have to follow back. But in the above case, it depends on which of u and v have derivatives active. It could be u, v, or both. Gah4 (talk) 05:36, 3 October 2023 (UTC)
 * The user specifies which variables have derivatives calculated. That can be a little tricky, as you have to follow back. But in the above case, it depends on which of u and v have derivatives active. It could be u, v, or both. Gah4 (talk) 05:36, 3 October 2023 (UTC)

Substantially revise "Operational calculus on programming spaces" section?
I'm not an experienced wikipedia editor, but I just wanted to chime in after reading this article. This article seems... not great at introducing the subject matter.

Specifically, I think that section 3 should be entirely excised from the article. It seems to exclusively reference a single article from 4 years ago, that has approximately 2 citations and is not published in any journal. I haven't read the pre-print, but I don't think that this very notationally specific formulation of AD (that hasn't been peer-reviewed) needs such a prominent exposition in this article.

In general, references to this "operational calculus on programming spaces" are strewn throughout the article, often in a way that distracts from the point. For instance, in the dual number section, "This approach was generalized by the theory of operational calculus on programming spaces (see Analytic programming space), through tensor algebra of the dual space."

I also think that the article could improve from a differential geometric exposition of AD - looking at the jacobian as a linear map from the tangent space of the domain of the function to the tangent space of the codomain of the function. The difference between forward and reverse mode becomes easier to understand and generalize to the multivariate case.

Thoughts? 2601:645:8200:61E0:824:A96C:4EA5:FA97 (talk) 18:56, 21 September 2020 (UTC)


 * This subject has absolutely nothing to do with math and everything to do with maintaining a two-channel lambda-function for computation. There are far more interesting things one can do with a multi-channel lambda-functions from the mathematical point of view. 129.93.161.205 (talk) 13:49, 28 October 2022 (UTC)

Risks?
In this: Typically, AAD tools will have an adjoint factor of over 5, meaning a 5x performance penalty for introducing AAD and being able to compute all risks., it seems that AAD is automatic adjoint differentiation, but risks is unfathomable (to me). Could somebody clarify please? catslash (talk) 23:48, 7 September 2022 (UTC)

That's resolved it then. Thanks. catslash (talk) 00:04, 3 December 2022 (UTC)

Incorrect statement? Forward mode does not necessarily need a separate pass for each independent variable
My understanding is that is strictly an implementation detail. Forward mode evaluates the chain rule "from right to left" and it can do it for all independent variables in one pass with the appropriate vector-Jacobian multiplications. This discussion on math.stackexchange has an example that I think makes it clear.

The statement at https://en.wikipedia.org/wiki/Automatic_differentiation#Two_types_of_automatic_differentiation should probably be clarified/revised. Moremeta (talk) 14:56, 2 October 2023 (UTC)
 * I was figuring this out many years ago. About 1980. Whenever a variable that is active for derivatives changes, a new value is computer for the derivative(s). I had the idea at the time, that if there is only one independent variable, to use complex variables, and change the routines that evaluate complex operators. So, yes, it is the chain rule along with the addition, subtraction, multiplication, division, and power rules. Well, individual rules for each operator and function. Gah4 (talk) 05:31, 3 October 2023 (UTC)
 * I was figuring this out many years ago. About 1980. Whenever a variable that is active for derivatives changes, a new value is computer for the derivative(s). I had the idea at the time, that if there is only one independent variable, to use complex variables, and change the routines that evaluate complex operators. So, yes, it is the chain rule along with the addition, subtraction, multiplication, division, and power rules. Well, individual rules for each operator and function. Gah4 (talk) 05:31, 3 October 2023 (UTC)

“Operator overloading and source code transformation” not clear enough to be included?
The section “Operator overloading and source code transformation” at the end is very unclear and goes into technical details not covered elsewhere in the article. The English would furthermore require improvement. Finally, it may have been added by the authors of software that is being linked to in a footnote mentioned there. My impression is that it would be better to remove this material, as it does not provide much added benefit, but may rather confuse people. Equaeghe (talk) 14:08, 5 April 2024 (UTC)