Talk:Floating-point arithmetic/Archive 5

Zuse's Z3 floating-point format
There are contradictory documents about the size and the significand (mantissa) size of the floating-point format of Zuse's Z3. According to Pr Horst Zuse, this is 22 bits, with a 15-bit significand (implicit bit + 14 represented bits). There has been a recent anonymous change of the article, based on unpublished Raúl Rojas's work, but I wonder whether this is reliable. Raúl Rojas was already wrong in the Bulletin of the Computer Conservation Society Number 37, 2006 about single precision (he said 22 bits for the mantissa). Vincent Lefèvre (talk) 14:44, 21 September 2013 (UTC)

Error in diagram
The image "Float mantissa exponent.png" erroneously shows that 10e-4 is the exponent, while the exponent actually is only -4 and the base is 10. — Preceding unsigned comment added by 109.85.65.228 (talk) 12:14, 22 January 2014 (UTC)

Failure at Dhahran - Loss of significance or clock drift
This article states in section http://en.wikipedia.org/wiki/Floating_point#Incidents that the Failure at Dhahran was caused by Loss of significance. However, the article "MIM-104 Patriot" makes it sound like it was rather simply clock drift. This should be cleared up. — Preceding unsigned comment added by 82.198.218.209 (talk) 14:01, 3 December 2014 (UTC)


 * I agree. It isn't a loss of significance as defined by Loss of significance. It is an accumulation of rounding errors (not compensating each other) due to the fact that 1/10 was represented in binary (with a low precision for its usage). In a loss of significance, the relative error increases while the absolute error remains (almost) the same. Here, it is the opposite: the relative error remains (almost) the same, but the absolute error (which is what matters here) increases. Vincent Lefèvre (talk) 00:49, 4 December 2014 (UTC)

John McLaughlin's Album
Should there be a link to John McLaughlin's album at the top in case someone was trying to go there but went here?2602:306:C591:4D0:AD55:E334:4141:98FA (talk) 05:49, 7 January 2015 (UTC)


 * Done. Good catch! --Guy Macon (talk) 07:05, 7 January 2015 (UTC)

needs simpler overview
put it this way, I'm an IT guy and I can't understand this article, there need to be a much simpler summery for non tech people, using simple English. Right now every other word is another tech term I don't fully understand. -- thanks, Wikipedia Lover & Supporter


 * It seems that Mfwitten removed that simple overview. Perhaps, to enforce the WP:ROWN. He called this "streamlining". I have recovered mine affair, additionally reducing the 'bits part'. Yet, I am sure, IT department will be happy now. --Javalenok (talk) 18:56, 17 February 2015 (UTC)

Non-trivial Floating-Point Focused computation
The C program intpow.c at www.civilized.com/files/intpow.c may be a suitable link for this topic. If the principal author agrees, please feel free to add it. (Don't assume this is just exponentiation by repeated doubling - it deals with optimal output in the presence of overflow or denormal intermediate results.) — Preceding unsigned comment added by Garyknott (talk • contribs) 23:31, 27 August 2015 (UTC)

Lead
What does "formulaic representation" in the lead sentence mean?

In general, I think we could simplify the lead. I may give it a try over the weekend.... --Macrakis (talk) 18:52, 23 February 2016 (UTC)

Minor technical correctness error
Any integer with absolute value less than 224 can be exactly represented in the single precision format, and any integer with absolute value less than 253 These ought to say "less than or equal" instead of "less than", because the powers of two themselves can be exactly represented in single-precision and double-precision IEEE-754 numbers respectively. They are the last such consecutive integers. -- Myria (talk) 00:12, 16 June 2016 (UTC)

Epsilon vs. Oopsilon
Deep in section Minimizing the effect of accuracy problems there is a sentence
 * Consequently, such tests are sometimes replaced with "fuzzy" comparisons (if (abs(x-y) < epsilon) ..., where epsilon is sufficiently small and tailored to the application, such as 1.0E−13).

wherein 'epsilon' is linked to Machine epsilon. Unfortunately this is not the same 'epsilon'. Epsilon as a general term for a minimum acceptable error is not the same as Machine epsilon which is a limitation of some hardware floating point implementation.

As used in the sentence it would be perfectly appropriate to set that constant 'epsilon' to 0.00001. Whereas Machine epsilon is derivable based on the hardware to be something like 2.22e-16. The latter is a fixed value. The former is something chosen as a "good enough" guard limit for a particular programming problem.

I'm going to unlink that use of epsilon. I hope that won't be considered an error of sufficiently large magnitude. ;-) Shenme (talk) 08:00, 25 June 2016 (UTC)

spelling inconsistency floating point or floating-point
The title and first section say "floating point". But elsewhere in the article "floating-point" is used. The article should be consistent in spelling. In IEEE 754 they use "floating-point" with hyphen. I think that should be the correct spelling.JHBonarius (talk) 14:18, 18 January 2017 (UTC)
 * This is not an inconsistency (at least, not always), but usual English rules: when followed by a noun, one adds an hyphen to avoid ambiguity, e.g. "floating-point arithmetic". Vincent Lefèvre (talk) 14:26, 18 January 2017 (UTC)

hidden bit
The article Hidden bit redirects to this article, but there is no definition of this term here (there are two usages, but they are unclear in context unless you already know what the term is referring to). Either there should be a definition here, or the redirection should be removed and a stub created. JulesH (talk) 05:43, 1 June 2017 (UTC)
 * It is defined in the Internal representation section. Vincent Lefèvre (talk) 17:56, 1 June 2017 (UTC)

Seeking consensus on the deletion of the "Causes of Floating Point Error" section.
There is a discussion with Vincent Lefèvre seeking consensus on the deletion of the "Causes of Floating Point Error" from this article on whether this change should be reverted.

Softtest123 (talk) 20:16, 19 April 2018 (UTC)


 * It started with "The primary sources of floating point errors are alignment and normalization." Both are completely wrong. First, alignment (of the significands) is just for addition and subtraction, and it is just an implementation method of a behavior that has (most of the time) already been specified: correct rounding. Thus alignment has nothing to do with floating-point errors. Ditto for normalization. Moreover, in the context of IEEE 754-2008, a result can be normalized or not (for the decimal formats and non-interchange binary formats), but this is a Level 4 consideration, i.e. it does not affect the rounded value, thus does not affect the rounding error. In the past (before IEEE 754), important errors could come from the lack of normalization before doing an addition or subtraction, but this is the opposite of what you said: the errors were due to the lack of normalization in the implementation of the operation, not due to normalization. Anyway, that's the past. Then this section went on about alignment and normalization...
 * The primary source of floating-point errors is actually the fact that most real numbers cannot be represented exactly and must be rounded. But this point has already been covered in the article. Then, the errors also depend on the algorithms: those used to implement the basic operations (but in practice, this is fixed by the correct rounding requirement such as for the arithmetic operations +, −, ×, /, √), and those that use these operations. Note also that there is already a section Accuracy problems about these issues.
 * Vincent Lefèvre (talk) 22:14, 19 April 2018 (UTC)
 * Perhaps it would be better stated that the root cause of floating point error is alignment and normalization. Note that either alignment or normalization must delete possibly significant digits, then the value must be rounded or truncated, both of which introduce error.


 * Of course the reason there is floating point error is because real numbers, in general, cannot be represented without error. This does not address the cause.  What actual operations inside the processor (or software algorithm) causes a floating point representation of a real number to be incorrect.


 * Since you have not addressed my original arguments as posted on your talk page, I am reposing them here:


 * In your reason for this massive deletion, you explained "wrong in various ways." Specifically, how is it wrong? This is not a valid criteria for deletion.  See WP:DEL-REASON.


 * When you find errors in Wikipedia, the alternative is to correct the errors with citations. This edit was a good faith edit WP:GF.


 * Even if it is " badly presented", that is not a reason for deletion. Again, see WP:DEL-REASON.


 * And finally, "applied only to addition and subtraction (thus cannot be general)." Addition and subtraction are the major causes of floating point error.  If you can make cases for adding other functions, such as multiplication, division, etc., then find a resource that backs your positions and add to the article.


 * I will give you some time to respond, but without substantive justification for your position, I am going to revert your deletion based on the Wikipedia policies cited. The first alternative is to reach a consensus.  I am willing to discuss your point of view.


 * (talk) 20:08, 19 April 2018 (UTC)


 * Because you have not responded specifically to these Wikipedia policies (WP:DEL-REASON and WP:GF), I am reverting the section. Please feel free to edit it to correct any errors you might see.  I would refer you to the experts on floating point such as Professor Kahan and David Goldberg
 * Softtest123 (talk) 23:03, 24 April 2018 (UTC)


 * You might not know, but Vincent is one of those experts on floating point. ;-)
 * Nevertheless, it is always better to correct or rephrase sub-standard contents instead of deleting it.
 * --Matthiaspaul (talk) 11:43, 16 August 2019 (UTC)


 * I think that this is more complex than you may think. The obvious cause of floating-point errors is that real numbers are not, in general, represented exactly in floating-point arithmetic. But if one wants to extend that, e.g. by mentioning solutions as what was expected with this section, this will necessarily go too far for this article. IMHO, a separate article would be needed, just like the recent Floating point error mitigation, which should be improved and probably be renamed to "Numerical error mitigation". Vincent Lefèvre (talk) 14:46, 16 August 2019 (UTC)
 * I agree that "...real numbers are not, in general, represented exactly in floating-point arithmetic" so then the question is, "How does that manifest itself in the algorithms, and consequently the hardware design?" What is it in the features of these implementations that manifests the errors?" As I have pointed out, rounding error occurs when the results of an arithmetic operation produces more bits than can be represented in the mantissa of a floating point value.  There are methods of minimizing the probability of the accumulation of rounding error, however, there is also cancellation error.  Cancellation error occurs during normalization of subtraction when the operands are similar, and cancellation amplifies any accumulated rounding error exponentially [Higham,1996, "Accuracy and Stability...", p. 11].  This is the material that I presented that was deleted.
 * Softtest123 (talk) 18:14, 16 August 2019 (UTC)


 * Interestingly, it just so happens that this week I have been doing some engineering using my trusty SwissMicros DM42 calculator which uses IEEE 754 quadruple precision decimal floating-point (~34 decimal digits, exponents from -6143 to +6144) and at the same time am writing code for a low end microcontroller used in a toy using bfloat16 (better for this application than IEEE 754 binary16 which I also use on some projects). You really have to watch for error accumulation at half precision. --Guy Macon (talk) 19:28, 16 August 2019 (UTC)


 * The effect on the algorithms is various. Some algorithms (such as Malcolm's algorithm) are actually based on the rounding errors in order to work correctly. There is no short answer. Correct rounding is nowadays required in implementations of the FP basic operations; as long as this requirement is followed, the implementer has the choice of the hardware design. Cancellation is just the effect of subtracting two numbers that are close to each other; in this case, the subtraction operation itself is exact (assuming the same precision for all variables), and the normalization does not introduce any error. Vincent Lefèvre (talk) 20:13, 16 August 2019 (UTC)

Fastfloat16?
[ https://www.analog.com/media/en/technical-documentation/application-notes/EE.185.Rev.4.08.07.pdf ]

Is this a separate floating point format or another name for an existing format? --Guy Macon (talk) 11:32, 20 September 2020 (UTC)


 * Same question for [ http://people.ece.cornell.edu/land/courses/ece4760/Math/Floating_point/ ] Somebody just added both to our Minifloat article. --Guy Macon (talk) 11:37, 20 September 2020 (UTC)


 * As the title of the first document says: Fast Floating-Point Arithmetic Emulation on Blackfin® Processors. So, these are formats convenient for a software implementation of floating point ("software implementation" rather than "emulation", as they don't try to emulate anything since they have their own arithmetic, without correct rounding). The shorter of the two formats has a 16-bit exponent and a 16-bit significand (including the sign). Thus that's a 32-bit format. Definitely not minifloat. And the goal (according to the provided algorithms) is not emulate minifloat formats either (contrary to what I have done with Sipe, where I use a large format for a software emulation of minifloat formats). In the second document, this is a 24-bit format with a 16-bit significand, so I would not say that this is a minifloat either. — Vincent Lefèvre (talk) 16:23, 20 September 2020 (UTC)


 * Thanks! That was my conclusion as well but I wanted someone else to look at it on case I was missing something. As an embedded systems engineer working in the toy industry I occasionally use things line minfloat and brainfloat, but I am certainly not an expert. I fixed the minifloat article. --Guy Macon (talk) 17:50, 20 September 2020 (UTC)

imprecise info about imprecision of tan(gens)
imho that: 'Also, the non-representability of π (and π/2) means that an attempted computation of tan(π/2) will not yield a result of infinity' - is misleading as it's more a problem of the cos approximation not yielding '0' for pi/2, if you replace cos(x) with sin(x-pi/2) for that range you get a nice #DIV/0! for tan(pi/2),

as well sin(pi) not resulting in '0' can be corrected by replacing sin(x) with -sin(x-pi) for that range,

not sure if it holds, but if you reduce all trig. calculations on the numerical values of sin in the first quadrant - what imho is possible - the results may come out quite fine ... greatly neglected by calc, ex$el and others ... — Preceding unsigned comment added by 77.0.177.112 (talk) 01:26, 11 March 2021 (UTC)
 * No, the tan floating-point function has nothing to do with the cos floating-point function. Vincent Lefèvre (talk) 11:32, 11 March 2021 (UTC)


 * hello @Vincent, sorry for objecting ... imho (school math) and acc. to wikipedia (https://en.wikipedia.org/wiki/Trigonometric_functions, esp. 'Summary of relationships between trigonometric functions' there) "tan(x) = sin(x) / cos(x)", once you get a proper cos at pi/2 [use sin(pi/2-x), same reference], you can calculate a proper tan with overflow (#DIV/0! in 'calc'),

perhaps it won't work 'in IEEE' (then it's a weakness there), but developers or users can achieve proper results once they have proper sin values for the first quadrant, — Preceding unsigned comment added by 77.0.177.112 (talk) 14:03, 11 March 2021 (UTC)
 * "tan(x) = sin(x) / cos(x)" is a mathematical definition on the set of the real numbers. This has nothing to do with a floating-point specification. — Vincent Lefèvre (talk) 18:23, 11 March 2021 (UTC)


 * hello @Vincent, what's going on? there is a possibility to get math correct results, and you don't want it be posted?

0.: despite you think it a 'non floating point specification' you agree that the formulas hold and achieve correct results?

1.: the wikipedia article does not! state that there is any special 'floating-point-tangens' specification (and imho there isn't any), but states 'that an attempted computation of tan(π/2) will not yield a result of infinity', and that's simply only true for some attempts, by calculating sin and cos you can get the correct overflow,

2.: 'mathematical definition on the set of the real numbers', yes, but what in that contradicts applying it on float or double figures as they are a subset of reals? some representations and results will have small deviations, that's the tradeoff for the speed of floats, but the basic math rules should hold as long as there aren't special points against it (as there are against e.g. associative rule), pi, pi/2, pi/4, 2*pi and so on are not exact in floats or doubles ... as well as they are not! exact in decimals, despite that we calculate infinity for tan(pi/2) in decimals, and thus we can! do the same in doubles (and floats?),

3.: plenty things in this world suffer from small deviations in fp-calculations ... we should start correcting them instead of getting the prayer mill 'fp-math is imprecise' going again and again,

4.: i am meanwhile slightly annoyed when 'fp-math is imprecise' is pushed again and again with wrong reasons, fp-math has weaknesses and 'you have to care what you do with it' is true and well known since Goldberg, but this does not forbid to achieve correct results with good algorithms, on the contrary, Goldberg and Kahan explicitly recommend it (because they did not see floating point numbers as a special world in which own laws should apply but as tools to be able to process real world tasks as fast and as good as possible),

5.: the article states that a correct calculation of tan(x) at pi/2 is impossible as a result of the representation of pi being imprecise, i'd show: a: it's not impossible, b: the representation of pi isn't an issue against good results,

agree? if not please with clear definitions and sources ... — Preceding unsigned comment added by 77.3.16.116 (talk) 16:29, 12 March 2021 (UTC)


 * Well, reading the beginning of what you said, about using  in the implementation, yes, due to the cancellation in the subtraction, one could get a division by 0 and an infinity. I've clarified the text by saying "assuming an accurate implementation of tan". This would disallow implementations that do such ugly things. Even using   in the floating-point system to implement   would be a bad idea, due to the errors on sin and on cos, then on the division. And for 5, you misread the article (it implicitly assumes no contraction of the   expression, but this is quite obvious to me). The article does not say that computing tan at the math value π/2 is impossible, it just says that the floating-point tan function will never give an infinity, because its input cannot be π/2 exactly (or kπ+π/2 exactly, k being an integer). — Vincent Lefèvre (talk) 03:44, 13 March 2021 (UTC)


 * hello @Vincent,


 * - what do you refer to with: 'the floating-point tan function'? i could only find ('C')-library implementations and recomendations for FPGA's, and 'The asinPi, acosPi and tanPi functions were not part of the IEEE 754-2008 standard because they were deemed less necessary' in 'https://en.wikipedia.org/wiki/IEEE_754',


 * - "assuming an accurate implementation of tan": that sounds misleading and imho an attempt to stick to 'fp-math is imprecise' despite there are correct solutions, duping them as 'not accurate',


 * - 'due to the errors on sin and on cos,': if you - or everyone - implement(s) the trig functions as proposed, and! respectively takes that function / that part of the quadrant that has less error one will get ... 'good results',


 * - 'implementations that do such ugly things': opposite ... IEEE or '(binary) fp-math' or 'reducing accuracy by limiting to small amount off digits' is doing 'ugly things' with math in general, most of countermeasures rely on 'dirty tricks', i'd suggest letting the mill 'fp-math is imprecise' phase out, and using instead 'we are intelligent beings, we can recognize difficulties and deal with them' ... or the ' ... at least we try to',


 * - 'Even using sin(x)/cos(x) in the floating-point system to implement tan(x) would be a bad idea, due to the errors on sin and on cos, then on the division.' - don't think that simple, it's well known which trig-function has weaknesses in which range(s) (calculated by approximations or taylor series or similar), pls. consider using substitutions only for that ranges ... — Preceding unsigned comment added by 77.10.180.117 (talk) 11:09, 13 March 2021 (UTC)


 * The tan function (tangent) is included in the IEEE 754 and ISO C standards, for instance. The sentence "The asinPi, acosPi and tanPi functions..." is not about the tan function; moreover, this is historical information, as these functions are part of the current IEEE 754 standard as explained. My addition "assuming an accurate implementation of tan" is needed because some trig implementations are known to be inaccurate (at least for very large arguments), so who knows what one can get with such implementations... — Vincent Lefèvre (talk) 12:16, 13 March 2021 (UTC)