Talk:Floating-point arithmetic/Archive 4

..Easy?
Isn't this system where a calculator DOESN'T shorten a number (e.g.12345.6789 to 12345.68)? -Newprofile001 —Preceding unsigned comment added by 71.40.68.66 (talk) 17:11, 4 May 2009 (UTC)

IBM supercomputer info irrelevant
"In June 2008, the IBM Roadrunner supercomputer achieved 1.026 petaflops, or 1.026 quadrillion floating-point operations per second."

I don't see how this is relevant to the definition of floating point. It seems like an advertisement for IBM's roadrunner supercomputer.

-FoxMajik (talk) 14:56, 10 July 2008 (UTC)

Time to Archive previous discussion?
Yes, I agree. We're at 107 kilobytes, shouldn't be over 32. William Ackerman 01:54, 7 December 2006 (UTC)

Done, in two batches. We are still a little heavy, at 56 KB. William Ackerman 16:03, 8 December 2006 (UTC)

More discussion of introductory material
[The following is a digest of a discussion between Amoss and William Ackerman].

From Amoss:

In reworking the intro you seem to have reverted the text drastically to a much early version focusing on floating point as a numeral representation. There is some discussion on the talk page about why this was changed that you have not added to. In particular: Floating-point is a system of arithmetic that operates on a particular representation (also called floating-point). You've also reverted the intro to the claim that floating-point numbers are representations of real-numbers. This makes no sense and there is much controversy on the subject on the talk page. All floating point numbers are instances of a set that is a sub-set of the rationals. There are no irrational elements, and so it makes no more sense to call them Real than it does to call them Complex. As a result there is now a clash between the description at the top, and that later on. Amoss 16:54, 5 December 2006 (UTC)

From William Ackerman:

I made the "above the TOC" (table of contents) section list 4 ways of representing numbers: (1) integers (point implicitly at the right), (2) ordinary written mathematical notation (with a point), (3) written scientific notation (with "x10-3), and (4) the real thing. I believed that understanding (4) required a context of (1), (2) and (3).  It was in response to a comment by JakeVortex that I put those 4 below the TOC, listing only (4), without its context, above the TOC.  I really believe that this organization, discussing only (4) above the TOC and listing the context shortly afterward, is the right thing, and that at least the first few paragraphs, both above and below the TOC, are starting to be really correct.  I really believe that this description, as a sort of numeral representation, is the right way to describe it, and that my sentence "could be thought of as a computer realization of scientific notation" is a good one.

The bit about rationals: I consider the fact that all representable FP numbers are rational to be a coincidence, and not fundamental, which is why I've been downplaying that. (I don't remember at the moment whether I completely reverted your text along those lines, or just seemed to.) Aside from the fact that all FP numbers represent reals that happen to be rational, there is nothing special about the rationals here. People use FP to solve differential equations, invert matrices, etc. etc. These are generally thought of as operations over the field of reals. If FP arithmetic were to be suddenly magically endowed with the ability to represent all rationals exactly, all the usual accuracy problems would remain. Well, most of them. You could invert matrices exactly, but you couldn't solve diff. eq.s, or compute pi or exp or log or sin ....

The fact that FP doesn't represent all reals, and the accuracy problems that arise therefrom, are sort of a hot-button item for me. I've seen too much "floating point mysticism" (or maybe "superstition", but really it's ignorance), and I want to be sure this article dispels same. Therefore, it's important to me that the article say that FP numbers are intended to model the reals, and that they do this only approximately because they exactly represent only a subset of the reals and any result is rounded to the nearest representable number. It's perfectly appropriate to mention somewhere that those representable numbers are rational, but that's not what is fundamental, and it really shouldn't be in the first few paragraphs.

William Ackerman 00:47, 6 December 2006 (UTC)

[End of material from user pages.]

Hi William, thanks for your detailed reply to my message on your talkpage. Firstly, I didn't mean to accuse you of "angry reverts", if my message came across that way then please accept my apologies. Now, onto the business of the page itself:

The changes in the past month are very drastic, and a huge improvement. I think that we should archive all the other sections on this page and continue with what the page needs now. There seems to be three or four active contributors at the moment, I have no idea how many of the old contributors on here are still watching the talkpage.

1. Numeral-representation or system of arithmetic? There does seem to be a lot of old discussion about this. I would say that the term floating-point does refer to two separate things; a representation and a system of arithmetic. All of the old versions of the intro were quite unwieldy as it is alot to get across in so few words. The current version is quite straight-forward and works well as an intro. Perhaps the nasty details of both could be pushed into one of the body sections?

2. Reals or rationals? I can see your point on this. Strong evidence in favour of the "subset of the reals", shortened to "reals", is that it does make the exposition clearer and less unwieldy, it's still technically correct, and it's the terminology that Kahan uses. But I think that the exact description that they are a particular subset of the rationals needs to go in somewhere. In particular it helps explain the examples like why 0.1 can't be represented exactly. Also, the arithmetic operations make more sense as operating on this set of rationals.

3. Overall structure? The new structure with the Overview split into lots of the old subsections really works well. I would say that the front of the article works well, and it is only the "tail" that still needs serious work.

Amoss 15:02, 6 December 2006 (UTC)

Even more discussion of introductory material
I think the basic material is here, and the problem is organizing it. The subject matter of floating point is proving to be extremely difficult to get organized and ordered properly. Rather than just going ahead with my vision of how things should be (possibly reverting other people's material) I think it would be good to discuss an outline here on the talk page.

There are many things we want to say, and they are all trying to find the right section, and trying to compete for a spot near the top. It think they include: I think that it is getting there. I though the phrasing of "some kind of" designation for the radix point was a little weak. So I've made it more direct and inserted a sentence that mentioned arithmetic. It uses a similar weak / glossy mention of accuracy but hopefully it reinforces that the representation is exact, and the operations are approximate. Amoss 17:04, 1 January 2007 (UTC) Yes it is an important point. Without a finite restriction on the window the approximation becomes exact. I think that it should be explicitly explained although I'm not quite sure where abouts it would fit. Amoss 17:04, 1 January 2007 (UTC)
 * The very first paragraph (above the table of contents). Does this properly capture what floating point really is?
 * Do we want to say something about the "finite window" of significand digits? (I happen to believe that, on a theoretical basis, the finiteness of the window is an important aspect of what FP really is, but I don't know whether it's a point that we can/should make explicitly.)

On the one hand, it is easiest to explain floating-point as a comparison to other formats. On the other hand maybe there should be a separate page for formats that explains the differences between them in more detail? Fixed-point is nice because it is essentially floating-point with a constant exponent. I this is used by people, I remember being taught to evaluate expression using a fixed number of decimal-places (as opposed to significant digits). At the moment there is some redundancy between this section and the following nonclementure, I'll see if it is easy to merge the relevent sentences together. Amoss 17:04, 1 January 2007 (UTC)
 * Near the top, there is a list of 5 ways of interpreting things (integer, common notation, fixed point, scientific, FP.) Is this the right thing?  Does fixed point belong here?  (It's more a computer thing than a human thing.)

Again, this is potentially material for a formats page, but taking it out makes it harder to explain floating-point. Amoss 17:04, 1 January 2007 (UTC)
 * We have a later separate paragraph about alternative computer representations, e.g. arbitrary precision, bignum, symbolic. Is this the right thing to do?  Should fixed point computer representation be listed only here?

The value and rounding sections sound like the right place place for this. Non-representable values should go in the value section I think. Amoss 17:04, 1 January 2007 (UTC)
 * Do we want to point out early on that FP gives only a subset of the reals, and, in fact, a subset of the rationals? I think this needs to be a recurring theme in the article because it's so important.  (We need to go into more detail later, saying that it is [0, 2p-1] &times; 2any.)  Where should it be introduced first?  Where should we introduce the non-representability of 1/3, 1/10, and &pi;?  (I'm inclined to move forward on this, putting this material down around the "value" and "the conversion and rounding" sections.)

I think this section is important, and it should go directly before the numerical analysis part. So firstly explain where the floating-point representation is exact that people generally don't realise. Then lead into where arithmetic can be unstable and what not to do. Amoss 17:04, 1 January 2007 (UTC)
 * Do we need a "misconceptions" section? I tend to think so, because I have seen a lot of FP superstition/mysticism, and I'd like this article to straighten that out.  For example, FP numbers are not approximations, they are exact.  It's the FP operations that aren't exact.  We probably ought to mention, in this section or elsewhere, the real-world things that can interfere with IEEE's vision of making FP deterministic (and thereby perpetuating the superstition.)  This includes compilers using 80-bit arithmetic at their whim, and the compiler switches (e.g. "/Op") to prevent this.

Someone has put it back in, I think it adds to the intro and should stay. I've added a statement to it that explains why having a large automatically changing range is the important feature that makes floating-point desirable. The applications could be expanded, they are the "classic" list in some sense, but now games and multimedia performance is just as important, and more widespread. Amoss 17:04, 1 January 2007 (UTC)
 * There used to be a mention of the extreme economic importance of FP in science, technology, industry, etc. There was also mention of "FLOPS".  It was taken out.  Should it be put back?  Where?  Perhaps in the topmost section, above the TOC?

William Ackerman 02:20, 11 December 2006 (UTC)

Etymology
I sure would like to know why they call it a floating point number. what does it have to do with floating? —Preceding unsigned comment added by 71.68.109.170 (talk) 22:45, 18 November 2010 (UTC)
 * The term floating point refers to the fact that the radix point (decimal point, or, more commonly in computers, binary point) can "float"; that is, it can be placed anywhere relative to the significant digits of the number. This position is indicated separately in the internal representation, and floating-point representation can thus be thought of as a computer realization of scientific notation. —Preceding unsigned comment added by RaptorHunter (talk • contribs) 22:47, 18 November 2010 (UTC)

Real vs rational
The top section of the article describes a floating point number as representing a real number, whereas this is not really true: floating point cannot fully represent an irrational number, therefore only rational numbers are representable in floating point systems. I'm commenting here rather than fixing it, because I can't believe this hasn't come up before, yet the text is still there at the top, so I'm guessing somebody must have a reason for this. JulesH 11:43, 13 February 2007 (UTC)
 * Is it not saying that it represents a specific real number (rather than all possible real numbers)? Perhaps could be clarified.  mfc 13:48, 13 February 2007 (UTC)
 * The problem that JulesH is addressing is that the specific real that is represented happens to be a rational number. As to why we talk about it representing a real rather than specifying a rational, it's a convention that is used in most literature.  There is no better way to represent a real anyway.  So when we use the constant pi we use the best approximation of pi that is possible with the precision we are using. Taemyr 20:43, 9 June 2007 (UTC)
 * An incorrect convention just the same. Accuracy is preferable to reflex conservatism, of course we probably should note that such a convention exists.  The argument that pi is only approximate also applies to the decimal expansion of 1/3, it too has no complete representation in floating point.


 * Also, in the first sentence:

In computing, floating-point is a numerical-representation system in which a string of digits (or bits) represents a rational real number.


 * ...rational real number is as redundant as natural rational number or natural real number. The likely intent was that rational applied to the actual FPN, while real was the ideal; the painting as compared to the model; the map and the territory; waffling.  --AC 07:55, 21 June 2007 (UTC)


 * Although it doesn't directly affect the real vs. rational discussion, I'd like to point out that if the IEEE standard is our exemplar model of a floating-point system, and this standard includes representations for infinities and NaN, then a floating-point system isn't just restricted to real/rational quantities. ~ Booya Bazooka 16:26, 21 June 2007 (UTC)
 * Agreed that the IEEE made an excellent standard. However, while NaN and infinities are honorary members of a particularly good implementation and useful aids to machine calculation, (though little known and seldom used, thus needlessly reinvented), I'd wonder if these symbols, opcodes, and underlying silicon shouldn't be considered more as algorithmic conventions rather than (in)finite digital strings.  Useful exceptions when used correctly, but not general -- the IEEE infinities can't make FPUs produce all the digits of pi.  Those infinities and NaN are floating point in the same fuzzy sense that David Rice Atchison was President of the United States.  --AC 08:08, 22 June 2007 (UTC)

...wording seems unnecessarily complicated - a "rational real number" is just a rational number, isn't it?

No thats not the same! A floating point number is a representation of a real value. Because its finite, it must be from a rational subset of R. So its rational as value and real, when used in a longer calculation. Sometimes a phrase is constructed very carefully and not only in prophecies of modern british authors. This is one of them. But rational real number should be better explained perhaps. --Brf 07:46, 9 July 2007 (UTC)


 * From the rational number article: "The rationals are a dense subset of the real numbers" - If rationals are a subset of reals, than specifying "real" is redundant. ~ Booya Bazooka 11:05, 10 July 2007 (UTC)


 * Until I read the article on rational numbers I was neutral on which wording was best. This article pointed out that the reals include irrational numbers like pi and e.  Floating-point numbers don't include the irrational numbers and so the rational number link is of higher value (ie, more relevant information), in my opinion, than any alternative link. Derek farn 16:50, 10 July 2007 (UTC)

(I moved the two discussion parts together) --Brf 06:44, 12 July 2007 (UTC)

Hi Guys, this has all been argued through before. It's possible that the talk page has been chopped since then (I even vaguely remember requesting it). The basic argument (as always) is that floating point representations are entirely contained within the set of Rationals and therefore shouldn't be referred to as Reals. In fact they are entirely within the subset of the Rationals given by products of integers and powers of two, but that never seems to crop up as a reason to describe them that way. Every floating point number is a real number. It is not a representation of a real number, but actually a real number. Yes, it is also in a subset of the Rationals, but the semantics of floating point operations (rather than just the representations) assume that these numbers are Real. William Ackermann posted a very good link to an article by Kahan on the subject. If William is reading then perhaps he could post the link again and settle this debate before it starts all over again? Amoss 13:51, 11 July 2007 (UTC)

The problem in this discussion is, that floating point numbers (fpn) have several differents aspects. As a number set they are a finite subset of the rational numbers (rat) and not even a dense one. They were constructed for approximate calculations in the real range (rl) and so they follow mostly the arithmetic rules of rl. Exceptions are discussed in numerical analysis. Their use is to calculate real results. Concerning their applications flps are real. And a reader of the article is sent into a wrong direction when reading that flps are rational. How would you call an arithmetic using fractions? Are fractions simply integers, because only integers are involved? --Brf 06:44, 12 July 2007 (UTC)

Request for info
The following was edited into the main article. I have reverted same, and will try to formulate a reply for this person.


 * Actually floating point are processed in binary format. there formats. 1-> IEEE.


 * I want to know what is IEEE. and how it is normalised in mantissa and exponent form


 * Please verify this and reply same


 * debakanta.rout@feelings.com

William Ackerman 00:37, 15 April 2007 (UTC)

That hyphen
Both floating point and floating-point are used interchangeably here. Which is correct? The article should be standardised on one of them. 172.188.190.67 14:55, 21 August 2007 (UTC)

Both are correct (but may not be used correctly):
 * floating-point is the adjective, as in 'floating-point number'.
 * floating point is an adjective and noun, as in 'the number has a floating point'

mfc 13:17, 29 August 2007 (UTC)

Clarification
The article mentions using the natural logarithm of a number stored in fixed-point as an alternative to the IEEE floating point format, does anyone have a reference to how that may work? Specifically, how would addition and subtraction be implemented without first exponentiating the operands and taking the logarithm of the result? —Preceding unsigned comment added by Somenick (talk • contribs) 09:02, 26 September 2007 (UTC)
 * You're right. That idea would be so inefficient it's scary. Since that possibility lacks a reference, I have removed it from the article. If anyone can find a citation that makes a persuasive argument for this we might restore it. The proposal was included in Floating point. EdJohnston (talk) 16:37, 21 November 2007 (UTC)
 * It's very efficient for multiplication. As for references, try Google.  And for an interesting insight into this and similar concepts, see also the classic:
 * FOCUS Microcomputer Number System, Albert D. Edgar and Samuel C. Lee, Communications of the ACM Vol. 22 #3, pp166–177, ACM Press, March 1979.
 * Abstract: FOCUS is a number system and supporting computational algorithms especially useful for microcomputer control and other signal processing applications. FOCUS has the wide-ranging character of floating-point numbers with a uniformity of state distributions that give FOCUS better than a twofold accuracy advantage over an equal word length floating-point system. FOCUS computations are typically five times faster than single precision fixed-point or integer arithmetic for a mixture of operations, comparable in speed with hardware arithmetic for many applications. Algorithms for 8-bit and 16-bit implementations of FOCUS are included.
 * mfc (talk) 20:09, 21 November 2007 (UTC)

Addition
Some of the wording under the Addition heading is completely mangled and unreadable, to the point that I can't clean it up because I don't even know what it's supposed to mean. Examples: "The following example is decimal means base is simply 10."; "This is nothing else as converting to engineering notation. In detail:". Largesock 19:44, 27 September 2007 (UTC)

A few nice properties
The article mentions that equality of IEEE floating-point numbers is equivalent to equality of their integer bit representation (except for exceptional values). However, this is not quite true as (according to the IEEE 754 standard) +0.0 equals -0.0 but the signed zeros have different bit representations and they are no exceptional values. —Preceding unsigned comment added by 84.163.65.4 (talk) 08:03, 21 November 2007 (UTC)
 * Correct, and also NaNs compare unequal (unordered) with NaNs. mfc (talk) 15:19, 21 November 2007 (UTC)

IBM Roadrunner and the petaflops
The article says that supercomputers are usually discussed in terms of teraflops, but IHT reported that a supercomputer just broke the petaflops range (1.026 petaflops, actually). Should this be mentioned on the page? The IHT article is here: http://www.iht.com/articles/2008/06/09/technology/09petaflops.php 68.225.14.153 (talk) 19:25, 9 June 2008 (UTC)

I've added the reference. But the TOP500 page that we reference doesn't yet have a mention of it. Can someone who knows their way around the TOP500 page fix that up?

Also, we still have the issue of whether "flop" should capitalized. I believe that this issue has been raised before, and that FLOPS was capitalized because it is an acronym for Floating Point Operations Per Second. While that is certainly true, I think that capitalizing it for that reason looks funny, as does considering it to be plural simply because of the "second". People can speak of "0.9 megaflop" as though "megaflop" is singular, and I think that, at this time, that kind of usage trumps the acronymic origin. What do people think?

William Ackerman (talk) 18:06, 10 June 2008 (UTC)

Mathematics of Gap Functions
There is an actually relatively old but not extensively well known body of mathematics on the mapping of reals to floats. Will review and supply summary at least if not present or addressed in archived talk. 74.78.162.229 (talk) 18:05, 6 July 2008 (UTC)


 * Also don't see mention/link/etc. of big number representations. 74.78.162.229 (talk) 18:07, 6 July 2008 (UTC)


 * Couldn't find ancient Xeroxed reference which discussed topic, I believe in terms of the topology of mapping functional integer representations of reals and real valued functions. Since it was a critical one and also can't identify it with superficial web search, dropping this. 74.78.162.229 (talk) 10:23, 8 July 2008 (UTC)

Representing Zero in Floating Point
Can someone explain to me the ways that Zero was represented in floating point over the years? It appears clear to me that the sign bit has no usefulness in representing zero. Were there any forms of floating point where Zero was specified in the Exponent rather than the Mantissa? 198.177.27.18 (talk) 02:41, 5 December 2008 (UTC)
 * Well, the mantissa representation of zero is a natural fallout of IEEE's ability to represent denormal numbers. A representation without this capacity would have to use a special representation for zero based on the exponent or some reserved field. As it turns out there is some usefulness in having a positive and negative zero - even when you underflow to zero, you can still determine the sign of your actual result (and you can differentiate 1.0/infinity and 1.0/(-infinity)). Dcoetzee 02:48, 5 December 2008 (UTC)


 * Ah, that's a good point. And if you can tie that in with some real world instances, I can make a history of it.   Perhaps you can tell me how floating point representations of Zero for the Apple II (for instance) departed from representations of Zero for the Atari 8 bit?   That would naturally take you back to the early 1980s.   I doubt there was any kind of a standard for the PC compatible, except that which was observed by those programmers wishing to be IEEE compliant, and that is still kind of fuzzy to me, even after reading the main article.  Most of the PC programmers I knew in the early 1980s had no interest in complying with IEEE standards. 198.177.27.25 (talk) 09:03, 5 December 2008 (UTC)


 * It's very nice to have the property that 1/(1/x) gives x, even when x is &plusmn;infinity. Regarding the bit representation, it's also extremely useful that (positive) zero is represented by all-zero bits.  It means that large blocks of numbers in memory can be set to zero at once using memset and similar low-level operations.  —Steven G. Johnson (talk) 02:55, 5 December 2008 (UTC)


 * So, if the mantissa is 30 bits long, and they are all zero bits, that is enough to make the thing positive Zero (assuming the Sign Bit is clear), and if the bits in the mantissa are all ones, that would make it a negative Zero? Shouldn't you have a flag somewhere (not in the exponent, but somewhere else) showing that there is an overflow/underflow condition? 198.177.27.25 (talk) 09:11, 5 December 2008 (UTC)
 * Minus zero in IEEE floating point is all zero except for the sign bit. See −0 (number). The value to the practical world of having a negative zero still remains to be seen. This article has little information on historic (pre-IEEE) floating point formats, though perhaps they could be added. My impression is that minus-zero and the denormals are used because of the influence of William Kahan. The article says little about Kahan's role in the creation of the floating point formats we now use. Probably it should be expanded in that area, either in the present article or in IEEE 754. EdJohnston (talk) 19:25, 5 December 2008 (UTC)

Historically, different bases have been used
Right now, there is a sentence in the overview that says: Historically, different bases have been used for floating-point, but until recently almost all modern computer architectures used base 2, or binary.

This sentence does not make much sense to me either grammatically or factually. Grammatically, the "until recently" seems redundant with "all modern". Factually, it makes even less sense. Early computers often used base 10 floating point. Later, as computers evolved, binary was used. Then the IBM 360 series used base-16 (with its wonderful wobbling precision), and this format was still widely used into the 1990s. I'm pretty sure supercomputer manufactures such as CDC and Cray used base 2. DEC also used base 2 for the VAX and was the basis for IEEE 754 starting in the late 1970s, but by the early 1980s, Intel was allowing decimal bases in the 8087. More recently, IEEE 754-2008 has also allowed decimal bases.
 * (Aside: the 8087 was always only binary floating-point (with instructions to convert to and from BCD integers), which is why IEEE 754-1985 was binary, even though many of the committee members (including Prof. Kahan) argued that decimal should have been used. mfc (talk) 10:28, 29 December 2008 (UTC))
 * My 8087 databook agrees with you rather than my faulty memory. And, the BCD in the FPU wasn't even floating point, but large integers. Wrs1864 (talk) 15:55, 29 December 2008 (UTC)

I have not followed this article, so I'm really not sure what the intent of this sentence is. I guess I would rewrite it to something like: Historically, different bases have been used for floating-point, with both base 2 (binary) being the most common followed by base 10 (decimal), and then many lesser varieties such as base 16. Wrs1864 (talk) 19:52, 26 December 2008 (UTC)

That sounds about right. In context it needed to be a bit different, ended up with "Historically, different bases have been used for representing floating-point numbers, with base 2 (binary) being the most common, followed by base 10 (decimal), and other less common varieties such as base 16. In a binary floating-point representation...". mfc (talk) 10:08, 29 December 2008 (UTC)


 * Much depends on the period, but base 16 floating point is probably more common than base 10, especially after the demise of the decimal arithimatic, variable word length machines. For example all the IBM mainframes (360, 370, 390 etc and compatibles) were base 16. So were some of the non-DEC minicomputers. I'm not aware of a decimal floating point implementation in hardware from the 1970's on. That doesn't mean it does not exist. There are likely software implementations of base 10 floating point formats in various basic interpreters for machines without native floating point. Ferritecore (talk) 12:21, 29 December 2008 (UTC)
 * Thanks to mfc reminding me of how the x87 works, I would agree that base 16 was probably more common than base 10. Part of my confusion may have been due to IEEE 854, but I don't think that was every widely used. Wrs1864 (talk) 15:55, 29 December 2008 (UTC)
 * You're welcome :-) mfc (talk) 08:32, 30 December 2008 (UTC)


 * There was indeed a big gap in hardware implementations of base-10 floating point. However, there are now four recent ones (POWER6 and IBM System z9 in 2007, and IBM System z10 and Silminds' in 2008).  If you add those to all the ones in the 1950s and 1960s, there are probably more base 10 designs than base 16.  However, it is possible that more base 16 have actually been manufactured (and they still are).  On the other hand, no new designs in base 16 are likely.  So it all depends what one counts... mfc (talk) 08:32, 30 December 2008 (UTC)

Good article
I like this article. I don't really see why it is still rated start-class. It's a bit too short and could need a few pictures, but it is (imo) well written. Keep up the good work! --mafutrct (talk) 10:30, 13 January 2009 (UTC)

"log summation" avoids round-off error?
Recently, there was an addition to the article about how using log summation can "reduce or avoid round-off errors". (And similar language was added to the linked article.) This just doesn't sound right to me, I don't see how more information per digit can be stored in logarithm form, any more than multiplying by ten or taking the reciprocal changes the information per digit. My gut feel is that whether this representation ends up being more or less accurate for any given calculation has to do with where the last digit ends up getting rounded to. In some representations, the rounding of the last digit may cancel out errors or happen to end up having very little rounding error, while other representations may be "unlucky". Thoughts? Wrs1864 (talk) 01:47, 1 March 2009 (UTC)


 * You are right. Use of logarithms will result in a significant increase in rounding error for any values that are close to each other (since the range of representable value sis compressed).  There are better solutions for handling the situations where the range of |values being added/subtracted varies by orders of magnitude. Derek farn (talk) 02:20, 1 March 2009 (UTC)

No mention of Double-double
I noticed there is no mention of double-double precision numbers. This is a fine way of emulating ultra-high precision floating point numbers without pure software emulation, and hence is faster.

Double-double precision uses two double-precision numbers to represent a single number. (Technically, it could use any two floating point numbers, even two of different kind)

The first number is called the high part, the other the low part. The high part encodes the main portion of the number, eg. 3.141592653589793, and the low part encodes the exact residue of (&pi; - hi), which is

&pi; = 3.14159265358979323846264338327950288... - hi = 3.14159265358979311599796346854419 -  lo = 0.00000000000000012246467991473531

As two double precision numbers are used together, the combination does not have the same precision as an IEEE 128-bit floating-point number, even though it occupies the same space. This is because they store two signs, two exponents and two mantissas.

ie.:

Even though the double-double has less precision than the quad-precision, it is able to store numbers Quad-precision can't. Consider this number: 1.00000000000000000000000000000000001 which in Quad-precision would be 1.0, but can be expressed by the tuple (1.0, 1.0)

Arithmatic with Double-double precision is not straight-forward, but is it nonetheless a lot faster than pure software emulation (eg. arbitrary-precision).

Consider these two basic functions in Java:

public DoubleDouble add(DoubleDouble y) { double a, b, c, d, e, f; 	e = this.hi + y.hi; d = this.hi - e; 	a = this.lo + y.lo; f = this.lo - a; 	d = ((this.hi - (d + e)) + (d + y.hi)) + a; b = e + d; 	c = ((this.lo - (f + a)) + (f + y.lo)) + (d + (e - b)); a = b + c; 	return new DoubleDouble(a, c + (b - a)); } public DoubleDouble mul(DoubleDouble y) { double a, b, c, d, e; 	a = 0x08000001 * this.hi; a += this.hi - a; b = this.hi - a; c = 0x08000001 * y.hi; c += y.hi - c; d = y.hi - c; e = this.hi * y.hi; c = (((a * c - e) + (a * d + b * c)) + b * d) + (this.lo * y.hi + this.hi * y.lo); a = e + c; 	return new DoubleDouble(a, c + (e - a)); }

In particular, after some (sub-)operation (bold lines) and at the very end, the numbers must be 'normalized'. This is the process of trying to store the complete result (hi+lo) in the hi part, and then subtracting the hi part from the original number, storing the remainder in the lo part. This introduces a lot of odd structures in the code using parenthesis. When these structures are simplified according to mathematical rules, the functions will actually break.

As reference, I used the (somewhat buggy) QD library in C++ on the site []

Can some of this be included in the Floating point page or given it's own page? --Zom-B (talk) 19:41, 1 May 2009 (UTC)

Ps. I think Windows Calculator (WinXP and newer) uses Double-double precision. The numbers I get for every operation I can think off, have the same amount of digits as my Double-double implementation generates, which is a rather odd coincidence if it uses another kind of emulation.


 * Are you sure it's faster?  uses 20 floating point add operations whereas   uses 12 add and 7 multiply operations. It'd be faster to use integer-based floating point number and multi-word integer add/substract operations (which can be done as most architectures have add with carry instruction).1exec1 (talk) 16:06, 19 November 2010 (UTC)

Internal Representation Table
This table is supposed to show how the exponent and mantissa are packed into a bits for various data sizes but also includes the exponent bias which is an offset amount (rather than bit count) and took me several reads to understand (even though I am a programmer). I would suggest this information is better in a separate table or at least in a column off to the right of the main table.
 * I think just describing the columns better would do the job. For instance saying 'bits' for bits. Dmcq (talk) 20:12, 27 September 2009 (UTC)

C language
In the IEEE section, there's a discussion of C's "float" and "double" types, which asserts them to be 32 and 64 bits respectively. That's just wrong; the minimum sizes for C floats/doubles are not 32/64 bits. IEEE 754 is not mandated by the C standard, it is optional. The world is full of C-compilers for DSPs that use 32-bit doubles. - GWO (talk) 12:10, 23 October 2009 (UTC)
 * That is in the IEEE section. Anyway using 32 bits for double isn't even in conformance with quite ancient versions of the C standard that require I believe a minimum of 10 digits precision for double. Dmcq (talk) 12:49, 23 October 2009 (UTC)


 * I don't have it here, but the original K&R just says a more or less that a double must not be shorter than a float, and on PDP-9s the word size was 9 bits anyway wasn't it? Doesn't say it has to be larger. The C spec says nothing about implementation at all. It would be perfectly conforming, I think, for an implementation to use 1 bit to represent a float and 1 bit to represent a double. Not useful, of course, but conforming. I'll check it when I get back to work (I don't trust online copies when I have a first edition on my desk.) Si Trew (talk) 01:09, 4 December 2009 (UTC)

Lead: rational. Is infinity a rational number?
I am reasonably well up on basic number theory which makes me wonder if it can be said FP represents a rational number, because infinity can usually be represented (at least it can in IEE 754, as can NaN). Which leads to saying, either these are kinda special cases that are not really FP representations (I'll certainly accept that, cos 0 can't be represented except as a special case in a sign-mantissa-exponent representation either) or that infinity is a rational number (which I don't think it is, though the set of rational numbers is certainly infinite).

This is being picky I know, but if right at the start of the lead we are saying something that it is not, then that bothers me a little. I appreciate this sentences is intended for concision, but if it is simply wrong then that is bad. If it is meant that particular implementations of FP also allow numbers that are not rational, then that is OK too, but since this article specifically starts talking about representations and not simply the mathematical theory, there is some dilemma here I think.

Si Trew (talk) 01:04, 4 December 2009 (UTC)


 * It does seem a fairly silly first line. It doesn't really describe a floating point number at all. The 'rational' bit is the least of its troubles. I'll see if I can't phrase something a bit better. Dmcq (talk) 07:31, 4 December 2009 (UTC)


 * I think you've done a nice job there of putting it, as well as it can be, in a nutshell. Of couse it need not explain all the exceptions, representations etc – there's the rest of the article for that – but in my experience, even mature software engineers tend to treat FP numbers as some kind of magic and basically a Pandora's box. For example, having been told not to trust the exactness of FP numbers, they will not accept that any decent implementation will be as accurante as the size of the mantissae, that you can, for example, compare for equality if you know that the FP representation is actually storing only integers well within the range of the mantissa. They'll add calls to "compare it within a ratio" as Kernighan and Plaugher recommend in The Elements of Programming Style, having never read that book at all, but heard it once on their computing course. It is nice to kinda explode some myths of FP like that. Si Trew (talk) 11:00, 4 December 2009 (UTC)

Rational is not important, range is
I have been trying to say what the purpose of a floating poiint number is in the leader, but it is being changed to remove about representing a wide range of numbers to say that it is a rational number. That the standard formats are rational numbers is not important, it is a property that is easily derived and can be mentioned further down. That it represents a wide range of numbers is very important - that's the whole purpose. Everything in a computer is bits so it is not important to mention they are composed of bits. Dmcq (talk) 18:15, 4 December 2009 (UTC)

Signed zero
In this section it says "1/(−0) returns negative infinity (exactly), while 1/+0 returns positive infinity (exactly)".

Does it? In the IEEE representation, I think it returns a NaN of some kind. But I could be wrong. Regardless, since this section is not specific to IEEE 754, and different implementations may implement it differently, I think the bald statement that it returns these is overgeneral. It could say "might" or "may" return these values; on other implementations it could differ (it might be undefined behavior). Si Trew (talk) 15:04, 8 January 2010 (UTC)

And I don't know what "(exactly)" is meant there. It could hardly return infinity approximately. Si Trew (talk) 15:05, 8 January 2010 (UTC)


 * If you look at the contents it is in the IEEE section and they returned the appropriate signed infinity not NaN. I don't know what the exactly is about either - I believe it can be removed. Dmcq (talk) 15:54, 8 January 2010 (UTC)

Significand
A succession of editors keep coming to this article to correct the spelling of 'significand' to 'significant.' How would people feel about using 'mantissa' instead? The pros and cons of mantissa are well discussed in our free-standing article called significand. Although IEEE prefers 'significand' it's not clear that they should win, because we ought to be documenting common usage, and be neutral between equally good ideas. Another option is to reword the article to not use 'significand' over and over, because it's not a term that people new to floating point are likely to have heard. So it doesn't have much explanatory value, even though we use it in the article 30 times. EdJohnston (talk) 18:18, 23 November 2007 (UTC)
 * I suspect that the editors that are unfamiliar with the term 'significand' would be even more confused by the term 'mantissa'. At least the former is similar to a more common word (significant) which is in the right area.
 * If changing it at all, I would change it to 'coefficient', which means the same but is less technical than 'significand' and certainly more correct than 'mantissa'. mfc (talk) 20:09, 23 November 2007 (UTC)

"Significand" may be the most pedanticly correct term, but it is linquisticly unfortunate. Just look at all the recent edits that changed "significand" to "signicant", just to be set right again. Looking at these edits I noticed that they typically involve the form "n signifcand bits". That phrase is syntactically correct with a single 'd' to 't' consonant substitution. Making the substitution changes the meaning. It obviously looks like a typo to a lot of people. I went through and recast the occurances of "significand" followd by "bit" or "bits". For the most part I think it makes it harder to misread. Let's try it for a while, anyway. Ferritecore (talk) 01:28, 1 March 2008 (UTC)
 * Your change seems worthwhile to reduce confusion. There seems to be a reasonable argument that 'mantissa' is a worse alternative, but 'coefficient' still has some charm. The phrase 'floating point coefficient' gets about 8 times as many Google hits as 'floating point significand.' Those of us who are used to hearing the lingo of the IEEE standard may have overestimated how common 'significand' is. Most people have a notion of what a coefficient is from any math course, even when they're unfamiliar with floating point.  Anyone willing to support 'coefficient' for this article?  After all this is not the article about the IEEE standard. EdJohnston (talk) 02:24, 1 March 2008 (UTC)
 * I am not fond of the term 'significand', let's get that out of the way up front. I kept it in my edit after skimming the significand mantissa explanation earlier in the article. After some further investigation I am not convinced that significand is really any better. Let's look at the available alternatives:
 * coefficient dos have more than just some charm. The floating point number can be expressed as a single-term pollynomial (mononomial) CBX with a coeficient, a base and the exponent. Reasonably educated people can be expected to recognize and correctly understand the word. If I had just invented floating point this would likely be the word I use.
 * mantissa is much maligned here. The argument given against it is at least partially incorrect. The [Wolfram Math world definition] does not mention logs at all. Mathematically, a mantissa is a fractional part of any real number, not just a log. My math handbook has a table of "mantissas of common logarithms", not a "table of mantissas". The term is also used to describe the coresponding part of a number represented in scientific notation. The term has a long and respecable history. The significand article has a 1947 John von Neumann quote using mantissa in the floating point sense. The term was the one nearly universally used. It is still in common use. The term is not perfect, the bit-field stored in a floating point number is not the mantissa of the number represented. It is the number represented normalized to appear "in the form of a mantissa" or has the appearence of a mantissa.
 * significand is a relativly new term. It was probably coined in response to percieved difficulties with mantissa. It is apparently used by the IEEE standard. As a coined word it means exactly what the coiner intended, without the baggage of prior or other meanings. In spite of the IEEE floating point standard being arround for 20+ years significand hasn't managed to displace mantissa in common usage, probably because it looks like a typo and can be confused with significant. The term does not appear anywhere on the Wolfram Mathworld site.
 * fraction is, oddly enough, used by the IEEE 754-1985 article. It is descriptive, universally understood (probably more so than coefficient). It strikes me as being slightly less accurate than coefficient.


 * I personally would prefer to use mantissa as the generally accepted technical term, but am unwilling to put much energy into fighting against significand. The discussion of the issue in the article needs to be corrected and improved. I may do so when I have the time to give it the proper care and ballance.


 * I mention Wolfram Mathworld because there is a hazard in consulting mainstream dictionaies, even ones as respectable as the OED, for technical or scientific terms. They are frequently not quite right. Computer terms are frequently borrowed from other disciplines. Nobody worries much that a computer virus lacks a protine coat and a dna core - biology is sufficiently removed from computer science. When borrowing a term from math, however the separation is not so great. I went to Wolfram in hopes of getting a math insight into the terms. Ferritecore (talk) 15:01, 2 March 2008 (UTC)

All the IBM documentation that I have seen from S/360 through z/Architecture HFP (Hexadecimal Floating Point) uses fraction, as it is less than one in the usual intepretation. I believe coefficient is sometimes used for other machines, where it isn't a fraction. I don't mind coefficient or fraction. I like significand, as, being new, it doesn't have previous usage to confuse people. The term "mantissa of common logarithms" is correct, as the table does not include the characteristics of the logs (the integer part). Only in a logarithmic representation (discussed here I believe) would mantissa and characteristic be correct. Gah4 (talk) 23:02, 29 April 2011 (UTC)

More user-friendly Introduction
As a total amateur when it comes to computing, I am trying to find out what the floating point is all about. I was doing fine for a very short time before I came across this sentence in the introduction:

"For example, a fixed-point representation that has seven decimal digits, with the decimal point assumed to be positioned after the fifth digit, can represent the numbers 12345.67, 123.45, 1.23 and so on,"

How do the the numbers 12345.67, 123.45, 1.23 correspond to the decimal point being assumed to be positioned after the fifth digit when 123.45 and 1.23 have neither five digits nor a decimal point after the (absent) "fifth digit"?

For an encyclopedia, which, I think, by definition, should be basically comprehensible even to amateurs, this statement is assuming too much knowledge of computing on the part of the reader.

Can someone please tweak it to make it a little more user-friendly? tripbeetle (talk) 3:36, 8 April 2010 (UTC)


 * What it means is a fixed point number that can occupy a space like #####.## where the # refer to digits. 123.45 can occupy that space by putting 00123.45 in. Saying something like two decimal places would be better, I'll stick that in. Dmcq (talk) 10:16, 8 April 2010 (UTC)

SIgnificand versus significant
I agree with teh IP editor that significant is better than significand in the leader. Significand is a correct term for that part, but the term has not been introduced yet. Also it would be 'significand' versus 'significant digits'. Saying significant digits explains the bit well for a simple explanation before reading further into the article. Dmcq (talk) 23:35, 9 November 2010 (UTC)

Octuple Precision
Is there any Compiler that does Octuple Precision? (256-binary)--RaptorHunter (talk) 05:18, 18 November 2010 (UTC)


 * Not that I know of. Why on earth would anyone bother? If people needed such high precision they'd use a variable precision floating point package and do it via function calls Dmcq (talk) 10:38, 18 November 2010 (UTC)
 * As it is, it's hard to find any genuine uses for quadruple precision. E.g. a physical problem where the outcome can't be adequately predicted by a double precision calculation. EdJohnston (talk) 20:23, 18 November 2010 (UTC)
 * Sometimes you want all the precision possible. For instance, if you are trying to calculate the trajectory of 99942 Apophis (which used to have a 2.7% chance of hitting earth), then you want to avoid all possible error. When simulating trajectories, rounding errors become cumulative (see Floating_point. It's best to use numbers with as many digits as possible.--RaptorHunter (talk) 20:36, 18 November 2010 (UTC)
 * If you have information about the scientific work on the trajectory of 99942 Apophis, that shows it required more than double precision, and can find a reference, it might be interesting to put it in the Floating point article. EdJohnston (talk) 20:54, 18 November 2010 (UTC)
 * Here's an article about 64-bit errors when calculating apophis. (page 10 has a neat table)
 * I found all kinds of interesting articles:
 * Quote
 * Numerical integration error can accumulate through rounding or truncation caused by machine precision limits, especially near planetary close approaches when the time-step size must change rapidly. We found that the cumulative relative integration error for 1950 DA remains less than 1 km until 2105, thereafter oscillating around zero with a maximum amplitude of 200 km until the 2809 Earth encounter (22). It then grows to –9900 km at the 2880 encounter, changing the nominal time of close approach on 16 March 2880 by –12 min.
 * ^^Accuracy of asteroid projections using 64 bit.
 * Round of error in long term planetary orbit integrations
 * (BTW, I just found out about google scholar. It sure beats the normal page after page of yahoo answers that google dishes up for every query. If you want more sources, you can look at this link A lot of the articles are behind paywalls, but I discovered a trick. Click on the link for (All * versions) under the summary and just keep clicking until you find an article you can access. Useful stuff.)--RaptorHunter (talk) 22:43, 18 November 2010 (UTC)

Octuple precision was more difficult to find, for instance this intriguing quote: is behind a paywall
 * "This implies that three-body scattering calculations are severely limited by the finite wordlength of computers. Worse still, in the more extreme cases even octuple precision would not be su~cient."

Octuple precision for binary stars

This article says : the mathcw library is designed to pave the way for future octuple-precision arithmetic in a 256-bit format offering a significand of 235 bits in binary (Other article seem to disagree about the size of the significand. It seems to vary between 224 and 240. Guess there's not a set standard yet. I know the new sandy bridge cpus have 256 bit simd registers and the new avx instructions. Maybe they will create a standard ocutple-precision then ) Perhaps it could be added to the article?--RaptorHunter (talk) 22:43, 18 November 2010 (UTC)


 * Anyway nobody's going to write special compiler support for this sort of stuff. What you could do though in a number of languages like C++ though is write your own octuple class which does calls to subroutines for the various operations and so enables you to write ordinary arithmetic expressions. Dmcq (talk) 23:25, 18 November 2010 (UTC)
 * By the way the SIMD instruction only support single and double precision plus some for 16 bit floating point. SIMD means single instruction multiple data - a number of these operands are handled in parallel. Dmcq (talk) 23:34, 18 November 2010 (UTC)

Octuple precision is not-existent. It won't be implemented in the coming decades and I doubt there'll be need for that. Even quadruple precision is not yet implemented in hardware. Double precision support was added to the mainstream processors because there was apparent need in the consumer market, not because some scientific application would benefit from it. Extending the range and precision of the computations even more does not benefit the consumer. Even if it did, the applications will almost certainly use arbitrary precision arithmetic library since it gives hell of a lot more freedom for the implementation. All these SSE or AVX extensions are just for optimization of parallel algorithms. A SSE 128-bit register can hold 4 single precision floating point numbers or 2 double precision numbers while AVX register can hold 8 and 4 numbers respectively.1exec1 (talk) 15:13, 19 November 2010 (UTC)
 * Even if it's just being implemented in software, I still think it would be interesting to add a new section for Octo precesion: when it's needed and how so. As you can see, I have lots of good sources. I think I'll add it later today.--RaptorHunter (talk) 20:18, 19 November 2010 (UTC)
 * Octuple precision is not standardised, not supported and not used (as arbitrary precision packages are used instead). Thus it fails the wikipedia notability criteria and should (shall?) not be included.1exec1 (talk) 21:23, 19 November 2010 (UTC)
 * Read my sources above. I have scholarly papers of people using octuple precision, not arbitrary. There are libraries you can import into c to make this happen.--RaptorHunter (talk) 23:38, 19 November 2010 (UTC)
 * And as I said above you can set up your own classes for octuple precision. A compiler would not add to efficiency in any appreciable way and no-one is liable to pay for a professionally tested compiler and too tiny a part of the market would be interested. I worked once on a machine with 128 bit packed decimal registers and I think there may be some quad precision hardware floating point implementations in the future. Dmcq (talk) 00:24, 20 November 2010 (UTC)
 * Might be, but also might not be. I don't see neither current nor future need for that in any of the market segments. Consumer market won't need that. The only place where some benefits could be seen is HPC. However, the supercomputer market is shifting towards GPGPU architecture. Quad precision would be very inefficient there as it's hard to parallelize (the operations would have long latencies with low throughput and the part of the program using it would easily become a bottleneck). Also, it's quite easy to set up quad precision math using integer hardware and it's easy for GPGPU hardware vendors to include more integer cores. Not to talk about benefits of using a lot more portable approach. Thus I believe that even if quad precision would be implemented in hardware, it'd have very limited usage. And because of that vendors won't implement that in the first place. 1exec1 (talk) 13:26, 20 November 2010 (UTC)
 * These papers make no difference. What does the octuple precision library differ from arbitrary precision library if both do the calculations in software using integer math?? By saying arbitrary precision library I mean library which does not necessarily always calculate 1000000 digits of pi, but can do it on request. You can view octuple precision just as a subset of capabilities that library provides since you can easily set it to do the calculations only in octuple precision. So usage of octuple precision term doesn't have a lot of common with octuple precision floating point format. At least as of now.1exec1 (talk) 13:15, 20 November 2010 (UTC)


 * By the way IEEE quad precision has been implemented in hardware on the IBM z series. IBM also implemented quad precision in their 360 series and Dec did in their VAX series. In the older series hardware support would only be present on a few machines but both had emulation packages as far as I know and the quad was a double double implementation. IO should have remembered about the IBM one as I've seen emulator code for the operations. Dmcq (talk) 19:42, 20 November 2010 (UTC)
 * Seemingly the VAX H format was an independent format similar to IEEE and not double double, also some or all the Cray machines had a quad precision which they called double and 32 bit was called half by them. Dmcq (talk) 19:57, 20 November 2010 (UTC)

GMP
Any thoughts on adding a reference or external link to the The GNU Multiple Precision Arithmetic Library (GMP) library? http://gmplib.org/

"GMP is a free library for arbitrary precision arithmetic, operating on signed integers, rational numbers, and floating point numbers." —Preceding unsigned comment added by 99.32.166.179 (talk) 19:49, 8 January 2011 (UTC)


 * I suppose you put a link to GNU Multi-Precision Library in the see also but it doesn't seem to me to warrant anything in the main article. Dmcq (talk) 20:51, 8 January 2011 (UTC)

Alternatives to the FP representation
There should be a mentioning of continued fractions in that section. Software implementations of it already exist, and it has incredible properties, especially together with lazily-evaluated languages like Haskell. Whitehorses2501 (talk) 00:15, 29 April 2011 (UTC)
 * Only if we had some notability and a reliable source. That thing you pointed at was a blog of someone's efforts and they haven't even figured out how to multiply the square root of 2 by itself yet. Dmcq (talk) 08:18, 29 April 2011 (UTC)

formula to calculate Range of floating-point numbers
The Range of floating-point numbers section says Would this be more clear if expressed as: The latter is more consistent with the earlier (in Overview) notation of value = s × be (where b=2, e=1023). 63.116.23.136 (talk) 05:10, 1 July 2011 (UTC)

✅

Unnecesary precision
As requested after someone stuck in loads of extra digits of pi I have set up a section in this talk page for discussion if somebody else thinks loads of digits which whave nothing to do with the topic are a good idea. Until then the consensus of people on the matter from the history is pretty apparent and a big long uninteresting and irrelevant string of digits should not be put in. Dmcq (talk) 17:35, 5 October 2011 (UTC)


 * In addition, everyone involved needs to read and follow Edit warring I have placed warnings on the userpages of everyone who is at 2RR.


 * Derek farn, the consensus is against you on this one. Davidhorman, Dmcq and Guy Macon all agree that going ten digits past the number of digits in the single precision example is enough to get the point across, and that more digits than that detract from the article. Guy Macon (talk) 21:19, 5 October 2011 (UTC)


 * Well perhaps you could explain why you want so many digits in yourself. Having a load of unnecessary digits just encourages people to add extra ones as far as I can see as a kind of pointy comment on the length. Why did you want some digits in grey past the first seven in bold and do we really need thirty digits to see the difference? Dmcq (talk) 23:45, 5 October 2011 (UTC) Dmcq (talk) 23:45, 5 October 2011 (UTC)

Software of book side of history
I just reverted a bit about the Pilot Ace in history because it used software to emulate floating point. However it occrs to me that there might be something worthwhile in the bit about J.H.Wilkinson, Rounding errors in algebraic processes. Is there evidence about who wrote a book about floating point or that this was a particular turning point? Dmcq (talk) 18:46, 6 February 2012 (UTC)

IEEE 754
I have added a section discussing the "big picture" on the rationale and use for the IEEE 754 features which often gets lost when discussing the details. I plan to add specific references for the points made there (from Kahn's web site). It would be good to expand the examples and add additional ones as well.

Brianbjparker (talk) —Preceding undated comment added 11:22, 19 February 2012 (UTC).


 * You need to cite something saying these were accepted rationales for it. Citations point to specific books journals or newspapers and preferably page number ranges. Dmcq (talk) 13:51, 19 February 2012 (UTC)

Added direct citations as requested. Brianbjparker (talk) —Preceding undated comment added 18:20, 19 February 2012 (UTC).


 * Thanks. My feeling about Kahan and his diatribe against Java is that he just doesn't get what programmers have to do when testing a program. Having a switch to enable lax typing of intermediate results where you know it ill only be run in environments you've tested is a good idea but that wasn't what Java was originally designed for. The section about extended precision there seems undue in length as I'm pretty certain other considerations like signed zero and denormal handling were the main original considerations where it differed from previous implementations. Dmcq (talk) 20:37, 19 February 2012 (UTC)


 * Although I referenced Kahan's Java paper several times, I certainly didn't want this section to appear as a slight against Java. Kahan has several other papers discussing the need for extended precision that do not mention Java-- I will replace the current references with those in the near future, and try to trim it down (although I don't think that that reference is a diatribe against Java, just against its numerics). I certainly didn't want to get into the tradeoffs between improved numerical precision of results versus exact reproducibility in Java in this section. I do however think that it is important to clarify the intended use of the IEEE754 features in an introductory article like this, which can get lost in detailed descriptions of the features. In particular, I find that there is *wide* misunderstanding of the intended use of, and need for, extended precision amongst the programming community, particularly as extended precision was historically not supported in several RISC processors, and thus it is underused by programmers, even when targeting the x86 platform for e.g. HPC (even when these same programmers would carry additional significant figures for intermediate calculations if doing the same computations by hand, as alluded to in this section). Also, Kahan's descriptions of work on the design of the x87 (based on his experience designing HP calculators which use extended precision internally) makes it clear that extended precision was intended as a key feature (indeed a recommended feature) of IEEE754, compared with previous implementations.

Brianbjparker (talk) 00:56, 20 February 2012 (UTC)


 * As far as I'm aware the main other rationales were
 * To have a sound mathematical basis in that results were correctly rounded versions of accurate results and also so reasoning about the calculations would be easier.
 * Round to even was used to improve accuracy. In fact this is much more important than extended precision if the double storage mode is only used for intermediate calculations. Using extended precision only gives bout one extra bit overall at the end if values in arrays are in doubles. The main reason I believe they were put in was it made calculating mathematical functions much easier and more accurate, they can also be used in inner routines with benefit.
 * Biased rounding was put in I believe to support interval arithmetic - another part of being able to guarantee the results of calculations. Dmcq (talk) 15:43, 20 February 2012 (UTC)


 * Using extended precision only gives bout one extra bit overall at the end if values in arrays are in doubles. This is false in general; you must be thinking of some special cases where not many intermediate calculations happen before rounding to double for storage.  For a counterexample, e.g. consider a loop to take a dot product of two double-precision arrays (not using Kahan summation etc.) — Steven G. Johnson (talk) 21:16, 20 February 2012 (UTC)


 * You would normally get very little advantage in that case over round to even with so few intermediate calculations. And for longer calculations round to even wins over just using a longer mantissa and rounding down. You only get a worthwhile gain if the storage is in extended precision. Dmcq (talk) 21:53, 20 February 2012 (UTC)


 * That is certainly not the case in general. The examples you are thinking of are using simple exactly rounded single arithmetic expresions-- the advantage of extended precision is avoiding loss of precision in more complicated numerically unstable formulae-- e.g. it is easy to construct examples were even computing a quadratic formula discriminant can cause massive loss of ULP when computed in double but not in double extended. Several examples are given in the Kahan references. This is in addition to the advantage of the extended exponent in avoiding overflow in e.g. dot products. Brianbjparker (talk) 00:16, 22 February 2012 (UTC)


 * When you say Round to even was used to improve accuracy., I take it you are mainly referring to the exact rounding: breaking ties by round to even does avoid some additional statistic biases but it is rather subtle (might be worth mentioning the main text though..). Brianbjparker (talk) 00:16, 22 February 2012 (UTC)


 * Biased rounding was put in I believe to support interval arithmetic. Yes, I believe directed rounding was included to support interval arithmetic, but also for debugging numerical stability issues-- if an algorithm gives drastically different results under round to + and - infinity then it is likely unstable. Brianbjparker (talk) 00:16, 22 February 2012 (UTC)


 * As far as I'm aware the main other rationales were... to have a sound mathematical basis in that results were correctly rounded versions of accurate results and also so reasoning about the calculations would be easier.. Yes, the exact rounding is an important point-- I have added some additional text earlier in the article to expand on this.  It is true that, like previous arithmetics, having a precise specification to allow expert numerical analysts to write robust libraries was an important consideration, but the unique aspect of IEEE-754 is that it was also aimed at a broad market of non-expert users and so I focused in the section on the robustness features relevant to that (I will add some text highlighting that aspect as well though). Brianbjparker (talk) 00:16, 22 February 2012 (UTC)


 * Well exact rounding, but I thought it better to specify the precise format they have. The point is that rounding rather than truncating is what really matters. With rounding the error only tends to go up with the number of computations as the square root of the number of operations whereas with directed rounding it goes up linearly. Even the reduction of bias by round to even matter in this. You alwayts get something else putting in a little bias so it is not as good as this but directed rounding is really bad. You're better off just perturbing the original figures for stability checking.
 * The mathematical basis makes it much easier to do things like construct longer precision arithmetic packages easily, in fact the fused multiply is particularly useful for this. Dmcq (talk) 00:27, 22 February 2012 (UTC)
 * The use of directed rounding for diagnosis of stability issues is discussed here http://www.cs.berkeley.edu/~wkahan/Stnfrd50.pdf and in other references at that web site. It also discusses why perturbation alone is not as useful. IEEE 754-2008 annex B states this explicitly-- "B.2 Numerical sensitivity: Debuggers should be able to alter the attributes governing handling of rounding or exceptions inside subprograms, even if the source code for those subprograms is not available; dynamic modes might be used for this purpose. For instance, changing the rounding direction or precision during execution might help identify subprograms that are unusually sensitive to rounding, whether due to ill-condition of the problem being solved, instability in the algorithm chosen, or an algorithm designed to work in only one rounding- direction attribute. The ultimate goal is to determine responsibility for numerical misbehavior, especially in separately-compiled subprograms. The chosen means to achieve this ultimate goal is to facilitate the production of small reproducible test cases that elicit unexpected behavior." Brianbjparker (talk) 01:04, 22 February 2012 (UTC)
 * The uses that somebody makes of features is quite a different thing from the rationale for why somebody would pay to have them implemented. The introduction to the standard gives a succinct summary of the main reasons for the standard. I'll just copy the latest here so you can see


 * a) Facilitate movement of existing programs from diverse computers to those that adhere to this standard as well as among those that adhere to this standard.
 * b) Enhance the capabilities and safety available to users and programmers who, although not expert in numerical methods, might well be attempting to produce numerically sophisticated programs.
 * c) Encourage experts to develop and distribute robust and efficient numerical programs that are portable, by way of minor editing and recompilation, onto any computer that conforms to this standard and possesses adequate capacity. Together with language controls it should be possible to write programs that produce identical results on all conforming systems.
 * d) Provide direct support for
 * ― execution-time diagnosis of anomalies
 * ― smoother handling of exceptions
 * ― interval arithmetic at a reasonable cost.
 * e) Provide for development of
 * ― standard elementary functions such as exp and cos
 * ― high precision (multiword) arithmetic
 * ― coupled numerical and symbolic algebraic computation.
 * f) Enable rather than preclude further refinements and extensions.
 * There are other things but this is what the basic rationale was and is. Directed rounding was for interval arithmetic. Dmcq (talk) 01:56, 22 February 2012 (UTC)


 * Thanks. Actually, I believe that "d) Provide direct support for― execution-time diagnosis of anomalies" is referring to this use of directed rounding to diagnose numerical instability. Certainly Kahan makes it clear that he considered it a key usage from the early design of the x87. I agree that its use for interval arithmetic was also considered from the beginning. Brianbjparker (talk) 02:11, 22 February 2012 (UTC)
 * No that refers to identification and methods of notifying the various exceptions and the handling of the signalling and quiet NaNs. Your reference from 2007 does not support in any way that arbitrarily jiggling the calculations using directed rounding was considered as a reason to include directed rounding in the specification. He'd have been just laughed at if he had justified spending money on the 8087 for such a purpose when there are easy ways of doing something like that without any hardware assistance. Dmcq (talk) 08:23, 22 February 2012 (UTC)

Trivia removed
I removed about that the full precision of extended precision is attained when extended precision is used. The point about the algorithm is it converges using the precision used. We don't need to put in the precisions of single double and extended precision versions of the algorithm. Dmcq (talk) 23:23, 23 February 2012 (UTC)


 * I disagree that it is trivia-- it is a good example to also illustrate the earlier discussions on the usage of extended precision. In any case, to make it easier to find for those who may be interested in the information:  the footnote to the final example, giving the precision using double extended for internal calculations, is included here-


 * "As the recurrence is applied repeatedly, the accuracy improves at first, but then it deteriorates. It never gets better than about 8 digits, even though 53-bit arithmetic should be capable of about 16 digits of precision. When the second form of the recurrence is used, the value converges to 15 digits of precision. Footnote: if intermediate calculations are carried at a higher precision using double extended (x87 80 bit) format, it reaches 18 digits of precision, which is the full target double precision." Brianbjparker (talk) 23:37, 23 February 2012 (UTC)


 * It just has nothing to do with extended precision. The first algorithm would go wrong just as badly with extended precision and the second one behaves exactly like double. There is nothing of note here. Why should it have all the various precisons in? The same thing would happen with float or quad precision. All it says is that the precision for different orecisions is different. Also a double cannot hold 18 digits of precision, used as an intermediate for double you'd at most get one bit of precision extra. Dmcq (talk) 00:50, 25 February 2012 (UTC)


 * Agreed that the footnote does nothing to clarify the particular point being made by that example-- that wasn't the aim though. The intention was to also utilise the example to demonstrate the utility of computing intermediate values to higher precision than needed by the final destination format to limit the effects of round-off. In that sense it is an example for the earlier discussion on extended precision (and also the section of approaches to improve accuracy). Perhaps the text "Footnote: if intermediate calculations are carried at a higher precision using double extended (x87 80 bit) format, it reaches 18 digits of precision, which is the full target double precision (see discussion on extended precision above)." would be clearer. Agreed it is is not the most striking example of this, but still demonstrates the idea-- perhaps a separate, more striking and specific example would be preferable, I will see what I can find. Brianbjparker (talk) 04:52, 25 February 2012 (UTC)


 * It does not illustrate that. What give you the idea it does? If anything it is an argument against what was said before. Using extended precision in the intermediate calculation and storing back as double does not give increased precision in the final result. The 18 digits only applies to the extended precision, it does not apply to the double result. The 18 digits is not the target precision of a double. A double can only hold 15 digits accurately. There is no way to stick the extra precision of the extended precision into the target double. Dmcq (talk) 09:53, 25 February 2012 (UTC)


 * IEEE 754 double precision gives from 15 to 17 decimal digits of precision (17 digits if round-tripping from double to text back to double). When the example is computed with extended precision it gives 17 decimal digits of precision, so if the returned double was to be used for further computation it would have less roundoff error, in ULP (at least one extra decimal digit worth). Although, as you say, if the double result is printed to 15 decimal digits this extra precision will be lost. I agree that it is not a compelling example-- a better example could show a difference in many decimal significant digits due to internal extended precision. 121.45.205.130 (talk) 23:21, 25 February 2012 (UTC)
 * The 17 digits for a round trip is only needed to cope with making certain that rounding works okay. The actual precision is just less than 16 digits, about 15.95 if one cranks the figures. Printing has nothing to do with it. I was just talking about the 53 bits of precision information held within double precision format expressed as decimal digits. You can't shove any more information into the bits. The value there is about 1 ulp out and using extended precision would gain that back. This is what I was saying about extended precision being very useful for getting accurate maths functions, straightforward implementations in double will very often be 1 ulp out without special work whereas the extended precision result will very often give the value given by rounding the exact value. Dmcq (talk) 00:08, 26 February 2012 (UTC)
 * Ideally, what should be added is a more striking example of using excess precision in intermediate computations to protect against numerical instability. The current one can indeed demonstrate this if excess precision is carried to IEEE quad precision, in which case the numerical unstable version gives good results. I have added notes to that effect which will do as an example for now. There are many examples also showing this using only double extended (e.g. even as simple as computing the roots of a quadratic equation), and I will add such an example in the future.. but not for a while (by the way, I think double extended adds more than 1 ULP but I haven't checked that). Brianbjparker (talk) 06:54, 26 February 2012 (UTC)
 * That's not true either because how does one know when to stop? Using quadruple precision would still diverge. Dmcq (talk) 11:45, 26 February 2012 (UTC)
 * Yes that is so- once it does reach the correct value it stays there for several iterations (at double precision) but does eventually diverge from it again, so a stopping criterion of when the value does not change at double precision could be used. But yes, I am not completely happy with that example for that reason-- feel free to remove it if you feel it is misleading. Actually Kahan has several very compelling examples in his notes-- I will post one here in the next week or so. Brianbjparker (talk) 14:41, 26 February 2012 (UTC)

The use of extra precision can be illustrated easily using differentiation. If the result is to be single precision then using double precision for all the calculations is a good idea because of th loss of significance when subtracting two values of he function. Dmcq (talk) 12:00, 26 February 2012 (UTC)
 * ok yes, that could be a good example-- I will see what I can come up with. Brianbjparker (talk) 14:41, 26 February 2012 (UTC)


 * If have added an example from Kahan's publications-- I think this is a good example as it demonstrates the massive roundoff error (up to half signif. digits lost) that can occur with even innocuous-looking formulae, and shows the two main methods to correct or improve that: increased internal precision, or numerical analysis. Brianbjparker (talk) 07:03, 28 February 2012 (UTC)
 * Yes it is definitely better to source something like that to a good source like him. I may not agree with every last word he says about it but he definitely is the premiere source for anything on floating point. Dmcq (talk) 14:14, 28 February 2012 (UTC)

01010111 01101000 01100001 01110100 00101110 00101110 00101110 00111111 (What...?)
The section on internal representation does not explain how decimals are converted to floating-point values. I think it will be helpful if we add a step-by-step procedure that the computer follows. Thanks! 68.173.113.106 (talk) 02:16, 25 February 2012 (UTC)
 * This gives an example of conversion and the articles on the particular formats give other examples. Wikipedia does not in general provide step by step procedures, it describes things, see WP:NOTHOWTO. Dmcq (talk) 02:24, 25 February 2012 (UTC)
 * I just thought it was kind of unclear. Besides, doing so might actually help this article get to GA status.
 * You see, I'm trying to design an algorithm for getting the mantissa, the exponent, and the sign of a  or  . So in case anyone else actually cares about that stuff. For the record, the storage is little-endian, so you have to reverse the bit order. 68.173.113.106 (talk) 02:50, 25 February 2012 (UTC)
 * It would stop FA status. Have a look at the articles about the individual formats. They describe in quite enough details the format. Any particular algorithm is up to the user, they are not interesting or discussed in secondary sources. Dmcq (talk) 10:01, 25 February 2012 (UTC)
 * The closest in Wikipedia for the sort of stuff you're talking about is if somebody wrote something for wikibooks. Have you had a look at the various external sites? Really to me what you're talking about sounds like some homework exercise and we shouldn't help with those except perhaps to give hints. Dmcq (talk) 10:20, 25 February 2012 (UTC)

imho, "real numbers" is didactically misleading
I'd like to propose to change the beginning of the first sentence, because the limited amount of bits in the significand only allows for storing rational binary numbers. Because two is a prime factor of ten, this means only rational decimal numbers can be stored as well. Concluding, I'd like to propose to replace "real" by "rational" there. Drgst (talk) 13:17, 25 February 2012 (UTC)


 * Definitely not. That is a bad idea. They are approximations to real numbers. The concept of rational number just doesn't come into it. That they are rational is just a side effect. Dmcq (talk) 14:32, 25 February 2012 (UTC)


 * In the section 'Some other computer representations for non-integral numbers' there are some systems that can represent some irrational numbers. for instance a logarithmic system does not necessarily represent rational numbers. Dmcq (talk) 14:36, 25 February 2012 (UTC)


 * Sorry for the delayed answer, Dmcq, it seems I forgot to tick the "watch page" checkbox... now for the content: IEEE FP numbers definitely are rational numbers. Even the most simple irrational number in the world, i.e. sqrt(2), cannot be represented, e.g. Any mathematical theorem that really depends on the existence of irrational numbers does not hold for the set of FP numbers. Nevertheless, you are right in stating that FP numbers are meant to approximate real numbers. Yet, as no non-rational number can be represented, transcendental numbers are far from being representable. Of course, this has serious consequences: for example, none of these nice trigonometric identities involving pi or pi/2 can be used naively without introducing large errors. This is just a simple example of why I think people should be warned of associating floating point numbers with real numbers.Drgst (talk) 21:14, 27 June 2012 (UTC)


 * "Irrational numbers are those real numbers that cannot be represented as terminating or repeating decimals." --Irrational number Therefore, irrational numbers cannot be exactly represented on any digital computer. However, you can get arbitrarily close. It really doesn't take all that many bits to handle a Planck length (~10^-35m) and the estimated size of the universe (~10^26m) in the same calculation.


 * The key point here is that floating point really is a method of representing (not perfectly but arbitrarily close) real numbers. Yes, it just so happens that some of them are represented exactly and others are not, but that's not relevant to the fact that FP is a method of representing (imperfectly) real numbers. All of this is covered quite nicely in the "Representable numbers, conversion and rounding" section. No need to make the lead confusing and misleading. --Guy Macon (talk) 22:48, 27 June 2012 (UTC)


 * I don't think this is correct "floating point really is a method of representing (not perfectly but arbitrarily close) real numbers". We talk about the "representable numbers" as those real numbers which can be represented exactly within the system. Other real numbers are rounded to some representable number. So I think we should either speak in terms of "working with real numbers" (which seems a little vague) or "representing approximations to real numbers" (as we do later in the article). --Jake (talk) 08:50, 22 October 2012 (UTC)
 * You make a good point, but while "working with real numbers" is inexact and vague, "representing approximations to real numbers" is wordy and clumsy. Perhaps we can devise a third alternative? --Guy Macon (talk) 12:57, 22 October 2012 (UTC)


 * What about "approximating real numbers"? But IMHO, "real numbers" is slightly incorrect, because floating point can also be used for complex arithmetic (though a complex number is here seen as a pair of two real numbers). Moreover a floating-point arithmetic is not just about the representation, but also the behavior when doing an operation (e.g. how the result is rounded). So, I would prefer something like: "a method of doing numerical computations" Vincent Lefèvre (talk) 22:09, 22 October 2012 (UTC)

Guard bits
Anybody know where the business of needing three extra bits comes from? For addition one only needs a guard/round digit plus a sticky bit as the sticky bit will always be zero if subtraction means you have to shift up. And for multiplication one needs the double length to cope with carry properly before rounding - but one can still cut that down to two bits before applying the particular rounding. The literaure talks about guard and round and sticky so I'm not disputig putting it in the text, just wondering why people got the idea in their heads in the first place. Dmcq (talk) 13:03, 8 March 2012 (UTC)


 * Somewhat related: Take a look at "2 vs 3 guard bits" here:
 * http://www.engineering.uiowa.edu/~carch/lectures07/55035-070404-prn.pdf


 * Also interesting:
 * http://www.google.com/patents/US4282582.pdf


 * These two searches turn up some interesting pages:
 * http://www.google.com/search?q="floating+point"+"40+bits"
 * http://www.google.com/search?q="floating+point"+"eight+guard+bits"+"DSP"
 * --Guy Macon (talk) 00:39, 9 March 2012 (UTC)


 * Goldberg gives a discussion of the need for two guard digits in http://www.validlab.com/goldberg/paper.pdf (page 195). There is a very clear description with example cases in: Michael L. Overton (2001). Numerical Computing with IEEE Floating Point Arithmetic. SIAM. Brianbjparker (talk) 06:17, 9 March 2012 (UTC)


 * Very good reference. It should be noted that he not only covers base 10 and guard (decimal) digits but also base 2 and guard bits. --Guy Macon (talk) 07:02, 9 March 2012 (UTC)


 * I just looked at some implementation I did of the whole business I did ages ago and I did actually use three bits! Just me forgetting what I'd done, sorry. yes the subtraction does actually require them all. Dmcq (talk) 11:33, 9 March 2012 (UTC)

edit : computation in page is correct after all
Sorry for the confusion : I used t_(i+1) instead of t_i. for that reason I missed a factor 2 : 2^(i+1) = 2 * 2^i. — Preceding unsigned comment added by KeesLem (talk • contribs) 14:36, 21 February 2013 (UTC)

Justification for division by zero definition
I recently added to division by zero this statement with an appropriate source:
 * "The justification for this definition is to preserve the sign of the result in case of arithmetic underflow. For example, in the double-precision computation 1/(x/2), where x = ±2−149, the computation x/2 underflows and produces ±0 with sign matching x, and the result will be ±∞ with sign matching x. The sign will match that of the exact result ±2150, but the magnitude of the exact result is too large to represent, so infinity is used to indicate overflow."

Provided this is valid, I wonder if it could also be added in some relevant location in the body of floating point related articles. In general I'd like to see more information on design rationales. Thanks! Dcoetzee 07:42, 11 September 2012 (UTC)

Signed zero section, branch cuts
The section on signed zero (under Internal representation >> Special values >> Signed zero) says the following:

"The difference between +0 and −0 is mostly noticeable for complex operations at so-called branch cuts."

In a strictly mathematical sense, +0/-0 can be interpreted as describing the limiting behaviors of a function, but that's not actually what's happening here. Moreover, branch cuts are not the only situation where these exceptional limiting behaviors appear, one can have branch cuts without exceptional limiting behaviors of this sort, and none of the examples given in the section are actually branch cuts. As far as I can tell, there is absolutely no significance to the relationship between branch cuts in complex analysis and signed zero in floating point numerical representations, but I wanted to make sure there wasn't a good reason for this being here. Thoughts? 71.227.119.236 (talk) 15:25, 29 September 2012 (UTC)


 * Result of a quick Google search:


 * "A system with signed zero can distinguish between asin(5+0i) and asin(5-0i) and pick the appropriate branch cut continuous with quadrant I or quadrant IV, respectively. A system without signed zero cannot distinguish and, according to the choses the branch cut such that it is continuous with quadrant IV (consistent with the rule of CCC). So, for asin(5+0i) it will return the same value as a system with signed zero would for asin(5-0i)." -Richard B. Kreckel ( [ http://www.ginac.de/~kreckel/ ] [ http://lists.gnu.org/archive/html/bug-gsl/2011-12/msg00004.html ] ).


 * I think that when he wrote "according to the" he meant "accordingly" (probably not a native English speaker). --Guy Macon (talk) 23:34, 29 September 2012 (UTC)


 * Somewhat straying from the subject but still quite interesting; the "Signed Zero" section of "What Every Computer Scientist Should Know About Floating-Point Arithmetic" ( [ http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html ] ) --Guy Macon (talk) 23:41, 29 September 2012 (UTC)

imho, the computation for Pi as shown actually computes only Pi/2
The algorithm as shown to compute an approximation of Pi actually computes imo in this form only Pi/2, even while the output shown contains an approximation for Pi. I think either the values should be halved or the formula should be changed into : 12 * 2^i * t_i KeesLem (talk) 15:16, 21 February 2013 (UTC) — Preceding unsigned comment added by 130.161.210.156 (talk)