Talk:Double-precision floating-point format

Pros/Cons of Double Precision
This entry seems to be very technical explaining what Double Precision is, but not the benefits or applications of it.
 * I miss this too :/ Too bad I can't answer the question. -- Henriok 11:36, 8 January 2007 (UTC)

Um... pros: more digits of precision... cons: slower mathematical operations, takes more memory... ? Not too hard... Sdedeo (tips) 03:12, 10 September 2007 (UTC)
 * I don't think that was what they were asking. Given a fixed number of bits of precision "n", you could divide the interval over which you want to calculate in 2^n equal steps, this is integer arithmetic. Unfortunately it breaks down if the interval is not bounded or if both extremely large and extremely small values are expected and relative error is more important than absolute error. This means integers are unsuitable for a general purpose arithmetic (they are "bounded") and for many real world problems (where the relative error is most important). Now, you could save the logarithm of your values to solve this, but unfortunately adding and substracting numbers would become problematic. So double precision is essentially a hybrid approach. Shinobu (talk) 23:45, 13 December 2007 (UTC)
 * I think Gerbrant is reffering to the properties of floating point, not double precision in itself. (One can have double-precision integers). mfc (talk) 09:06, 22 December 2007 (UTC)
 * Its useful because computers are unable to store true Real numbers it can only store values of some arbitrary precision. So the computer can never truly store some values, for instance .1 or values like pi.  So in application it allows for very small numbers to be represented in the computer, so for instance if you have some program that tracks the distance.  And lets say you only care about like three zeros after the decimal place, then you could use floating points to store the values of the distance. (sorry to bring this topic back from the dead, I just wanted to give an example of when floating points would be use)  Deadcellplus (talk) 21:41, 3 October 2008 (UTC)
 * Again, the question was most likely about double precision as compared to single precision, rather than floating point as compared to integer arithmetic. A lot of your explanation is also incorrect - for every number there is some encoding scheme that can finitely encode that value. The only true statement you can make is that no encoding scheme can encode an uncountable number of different real numbers, since the number of strings is countable. Floating-point arithmetic is useful because it's an encoding scheme designed to handle a wide range of real values frequently encountered in applications. Dcoetzee 00:46, 4 October 2008 (UTC)
 * Maybe my information is a bit outdated, but in the book I learned x86 assembly language from, there was a note that x87 FPUs internally always use 80-bit-precision; therefore there is no speed gain for smaller precision values, only memory storage benefits. Junkstar (talk) 09:44, 18 March 2013 (UTC)

Fractions
I can see the exponent and the sign, but which digits in the significand are used for the numbers before the decimal point, and which are used for after? —Preceding unsigned comment added by 67.91.121.114 (talk) 20:24, 20 May 2008 (UTC)
 * This is a good question - the short answer is, usually the position of the decimal point in the significand is fixed, so that it doesn't have to be encoded. In the IEEE encoding, the point is preceded by an implicit 1, also not encoded, and followed by the explicit bits of the significand. Underflow is a special case. Dcoetzee 00:54, 4 October 2008 (UTC)

Limitations
If "All bit patterns are valid encoding.", then why is the highest exponent represented by bits [000 0111 1111]? I tested sets of bits for higher values in Microsoft Visual Studio (C++) and got "infinite" as a value. It also turns out that comparing a double value to that "infinite" value throws an exception... But I guess this is more related to the compiler itself. —Preceding unsigned comment added by 69.70.33.68 (talk) 15:01, 19 March 2009 (UTC)

Infinity is a valid value. --193.71.114.235 (talk) 18:47, 10 February 2011 (UTC)

Integer resolution limits
I was looking for a mention about within what range all integers can be represented exactly. After some experimentation with double precision floats (52 bit stored fraction) it seems $$2^{53}-1=9,007,199,254,740,991$$ and $$2^{53}=9,007,199,254,740,992$$ are the largest positive integers where a difference of 1 can be represented. $$2^{53}+1$$ would be rounded up to $$2^{53}+2$$.


 * I added it.--Patrick (talk) 11:35, 8 October 2010 (UTC)


 * The article seems definitely wrong about this. It's obvious from the formula that the negative and positive resolutions are the same, but the article says that there's an extra bit of positive resolution.  Where does it come from?
 * 206.124.141.187 (talk) 18:00, 13 June 2014 (UTC)

17 digits used in examples
I'm confused, why do they use 17 digits in the examples if the prescribed number of digits is: 15.955 i.e. 1.7976931348623157 x 10^308 Also, an explanation of how you could have 15.955 digits would be nice. I'm assuming that the higher digits can't represent all values from 0-9 hence we can't get to a full 16 digits? — Preceding unsigned comment added by Ctataryn (talk • contribs) 22:45, 31 May 2011 (UTC)

You have 52 binary digits, which happens to be 15.955 decimal digits. Compared to 16 decimal digits, the last digit can't always represent all values from 0-9 (but in some cases it can, thus it represents 9.55 different values on average). Also, while on average you only have ~16 digits of precision, sometimes two different values have the same 16 digits, so you need a 17th digit to distinguish those. This means that for some values, you have 17 digits effective precision (while some others have only 15 digits precision). --94.219.122.21 (talk) 20:52, 7 February 2013 (UTC)

You actually have 53 binary digits due to implicit bit. Double float can represent integers exactly up to 9007,1992,5474,0992 (2^53). Accuracy of 16 decimal digits would provide integers exactly up to 1,0000,0000,0000,0000. 2A01:119F:21D:7900:2DC1:2E59:7C56:EE1E (talk) 15:39, 11 June 2017 (UTC)
 * The 17 digits is wrong, and should be fixed. It seems that if you print 17 digits, and read them back, then you get the original binary value. That doesn't mean that you have 17 digits precision, though. Gah4 (talk) 07:42, 9 September 2023 (UTC)
 * When describing decimal digits, why are you putting commas after every 4th digit? Isn't the correct way to show decimal numbers to put a comma after every 3rd digit? For example your number 1,0000,0000,0000,0000 should be shown as 10,000,000,000,000,000 and your number 9,007,199,254,740,992. Benhut1 (talk) 05:31, 15 July 2024 (UTC)
 * When describing decimal digits, why are you putting commas after every 4th digit? Isn't the correct way to show decimal numbers to put a comma after every 3rd digit? For example your number 1,0000,0000,0000,0000 should be shown as 10,000,000,000,000,000 and your number 9,007,199,254,740,992. Benhut1 (talk) 05:31, 15 July 2024 (UTC)

Execution speed - grammar
"calculations with double precision are 3 to 8 times slower than float."

What exactly does "x times slower" mean? Is it n^-x? Or n/x? How much would the logical conclusion of "1 time slower" be? This unscientific colloquial English is confusing and should be clarified. I would like to, but I cannot make sense of the source given. Thanks. Andersenman (talk) 10:52, 16 July 2014 (UTC)
 * x times slower is generally understood to mean it takes x times as long. Dicklyon (talk) 03:18, 4 December 2014 (UTC)
 * THEN FOR GOD'S SAKE WHY NOT WRITE IT AS THAT?! Andersenman (talk) 10:31, 30 October 2016 (UTC)

this article is using an incorrect word
According to iee-754 verbatim:

significand. The component of a binary floating-point number that consists of an explicit or implicit leading bit to the left of its implied binary point and a fraction field to the right.

The word "mantissa" has no definition in the standard, nor does it appear anywhere in that text. — Preceding unsigned comment added by 23.240.200.20 (talk) 03:01, 4 December 2014 (UTC)

subnormal representation allows values smaller than 1e−323 ???
The very ending of section "IEEE 754 double-precision binary floating-point format: binary64" says: "By compromising precision, subnormal representation allows values smaller than 1e−323". Shouldn't it be something like "..allows even smaller numbers up to 1e-323"? — Preceding unsigned comment added by Honzik.stransky (talk • contribs) 21:21, 18 April 2015 (UTC)


 * I think it meant that there were values smaller than 1e−323, but this was very confusing. I've done the correction. Vincent Lefèvre (talk) 10:43, 28 January 2016 (UTC)

Lack of Citations in Implementations Section
There do not appear to be citations for any of the claims made in the "Implementations" section. Are these considered common knowledge?

--Irony0 (talk) 16:32, 27 January 2016 (UTC)


 * I've done some updates. Not everything is common knowledge. There should be citations for the specifications. But the fact that one has double precision in practice is simply due to the IEEE 754 hardware and that many implementations are written in C/C++ and double is the preferred type (matching double precision, even before this was standardized by Annex F of C99). There may also be issues with x86 extended precision . Vincent Lefèvre (talk) 10:30, 28 January 2016 (UTC)
 * Well, DOUBLE PRECISION (the actual statement) goes back to Fortran II in about 1957. Even more, Fortran requires a data type that is twice the size, such that things line up in memory when you use EQUIVALENCE. Gah4 (talk) 07:47, 9 September 2023 (UTC)
 * Well, DOUBLE PRECISION (the actual statement) goes back to Fortran II in about 1957. Even more, Fortran requires a data type that is twice the size, such that things line up in memory when you use EQUIVALENCE. Gah4 (talk) 07:47, 9 September 2023 (UTC)

Precision
There are a few statements relating to precision that I think are incorrect (please see ):

"This gives 15–17 significant decimal digits precision. If a decimal string with at most 15 significant digits is converted to IEEE 754 double precision representation and then converted back to a string with the same number of significant digits, then the final string should match the original. If an IEEE 754 double precision is converted to a decimal string with at least 17 significant digits and then converted back to double, then the final number must match the original.[1]"

You don't really get 17 digits of precision; you only get up to 16. 17 digits is useful for "round-tripping" doubles, but it is not precision.

"with full 15–17 decimal digits precision"

15-16 digits, not 15-17.

"the total precision is therefore 53 bits (approximately 16 decimal digits, 53 log10(2) ≈ 15.955)"

The simple "take the logarithm" approach does not apply to floating-point, so the precision is not simply ≈ 15.955.

(I made similar comments about the single-precision format: https://en.wikipedia.org/wiki/Talk:Single-precision_floating-point_format#Precision )

--Behindthemath (talk) 12:31, 1 July 2016 (UTC)


 * The confusion seems to be coming from page 4 of the referenced lecture notes . The "Lower Bound" description and formula are correct and equivalent to Floor(Log10(2^(N-1))) assuming the author wanted to account for subnormal numbers by subtracting the Implied significand bit (N = Storage Bits + Implied Bit) which is the value that yields the same results as displayed.


 * The "Upper Bound" description and formula would be correct if the author intended to add the Implied Bit for normal numbers, but since it was already included in the value for N the formula here is wrong (high by 1). Additionally, the second test described (convert float to decimal and back, with first and last equal) must be trivially true for every float value, regardless of significant decimal places, for the format to work at all, and so is meaningless.


 * I'm unsure how to proceed. The page is wrong because the cited source is wrong; the cited source contains errors which are trivial to see and correct (all upper bounds high by 1). But I have no alternate source available to cite. My instinct is to remove the source, correct the page and let someone else come along and provide a source without the errors... I'll wait for guidance for a bit and if none is provided I'll proceed as described.


 * 74.214.226.120 (talk) 10:22, 14 June 2020 (UTC)


 * Updating as described 74.214.226.120 (talk) 23:12, 21 June 2020 (UTC)


 * The source is from William Kahan. I know that he has not always written correct things, but well... One issue is that there is no good way to define an equivalent precision between systems of different radices. The usual log formula just gives an approximation, and in particular it is wrong on a power of the radix, where it can give an equality. For instance, consider a 2-digit radix-1000 system. The log formula tells you that this is equivalent to a 6-digit radix-10 system (since 1000 = 103). Even though you can write any 2-digit radix-1000 number as a 6-digit radix-10 number, the converse is not true. For instance, 1.23456 needs 3 digits in radix 1000. More generally, what would be regarded as the equivalent precision would change near the power of a radix (this applies to both the source system and the target system). That's why if you consider all the possible inputs, you obtain a range for the equivalent precision (I would not be surprised if this were directly related to the round-tripping). Vincent Lefèvre (talk) 01:46, 22 June 2020 (UTC)

Advantages & Disadvantages
IEEE 754 double-precision binary floating-point format: binary64 - contains the following as its first sentence:"Double-precision binary floating-point is a commonly used format on PCs, due to its wider range over single-precision floating point, in spite of its performance and bandwidth cost." [emphasis mine] Aside from the OBVIOUSLY poor usage of the word "over" when "compared to" would be much clearer, I challenge this claim. I have no idea whether it is its wider RANGE or greater PRECISION, or both which motivates its use, but both are obviously reasons to use it.

I have two other issues with this article. First is the discussions of its advantages and disadvantages sprinkled around the article are simply wrong. It may be, on a particular machine, faster than single precision. That is a fact. It may use the SAME storage as single, another fact. It may also use the same bandwidth. In a theoretical sense, double precision mathematics requires more resources but what it uses depends on BOTH the hardware and the software (not to mention firmware) of the machine it is running on. This article doesn't distinguish between what is theoretically true on some abstract machine, and what the various chip architectures, core - fpu - gpu utilizations, as well as OS and compiler/interpreter optimizations and defaults will ACTUALLY do. It needs a revision to qualify the overly broad sentiments.

Second, this article discusses various languages and whether or how they implement double-precision. This is problematical. First (repeating myself) it is chip dependent - and with the proliferation of 'non-Intel' architectures, this is more and more true. Second, it is version dependent. ANY statement of what C or JAVA does or doesn't do should include version numbers. (obviously). Thirdly for some of the more sophisticated language compilers, it may be (fully or partially) customizable.71.29.173.173 (talk) 17:21, 16 July 2016 (UTC)
 * I agree, the reason seems wrong to me. Moreover, the sentence doesn't say when. Vincent Lefèvre (talk) 23:06, 16 July 2016 (UTC)

Article NEVER fully defines how the number is calculated, is therefore rather useless if you don't have access to full standard
The article NEVER fully defines how the number is calculated, is therefore rather useless if you don't have access to full standard. It calculates a lot of numbers and all, but the only precise definition of how the number is actually calculated is given here:

The real value assumed by a given 64-bit double-precision datum with a given biased exponent $$e$$ and a 52-bit fraction is   : $$ (-1)^{\text{sign}}(1.b_{51}b_{50}...b_{0})_2 \times 2^{e-1023} $$

... and it completely omits on how to properly calculate the exponent. Then it goes on with useless 10^x calculations and other stuff without ever filling this information in. (at least I cannot see where, and if it comes much later it should probably be added to the quoted section instead!)

Can someone who has access to this information maybe add it? (with the actual bits of the exponent filled in, in the proper way they are used to calculate the exponent) 132.230.194.161 (talk) 09:17, 29 March 2017 (UTC)
 * The biased exponent comes from the representation, as described. There is nothing to calculate. Everything is well-defined in the article. Vincent Lefèvre (talk) 06:53, 30 March 2017 (UTC)
 * In discussions of the Fortran standard, and evaluation of expressions, it was explained (by an actual committee member) that the Fortran standard allows all expressions to evaluate to 42. That would not be a quality implementation, but legal. That is, the standard, agreeing with this question, does not describe how values are calculated. That is outside the standard, and should also be outside this article. Gah4 (talk) 07:13, 26 September 2023 (UTC)
 * In discussions of the Fortran standard, and evaluation of expressions, it was explained (by an actual committee member) that the Fortran standard allows all expressions to evaluate to 42. That would not be a quality implementation, but legal. That is, the standard, agreeing with this question, does not describe how values are calculated. That is outside the standard, and should also be outside this article. Gah4 (talk) 07:13, 26 September 2023 (UTC)
 * In discussions of the Fortran standard, and evaluation of expressions, it was explained (by an actual committee member) that the Fortran standard allows all expressions to evaluate to 42. That would not be a quality implementation, but legal. That is, the standard, agreeing with this question, does not describe how values are calculated. That is outside the standard, and should also be outside this article. Gah4 (talk) 07:13, 26 September 2023 (UTC)

Proposal for minor change to exponent encoding examples
The second example shows 26, but it should show 20 instead. The section explains exponent bias, and 20 is highly relevant to exponent bias, whereas 26 is unrelated and might as well be a random number. Also the preceding and following examples show the lower and upper limits, and in that context I would (and did) expect the middle example to show the middle. The proposed change certainly would have saved me time, and I suspect it would similarly help others who need to quickly grasp exponent bias. I will gladly make this change if there's no objection. Victimofleisure (talk) 07:26, 24 November 2017 (UTC)
 * I would say: add a 4th example corresponding to 20, at the second position. Thus, there would be: the minimum exponent, the case 20, an example more generic than 20, and the maximum exponent. Vincent Lefèvre (talk) 12:45, 24 November 2017 (UTC)
 * Fair enough. But I still feel it would be helpful to draw attention to the importance of 20. How about adding (zero offset) to the right of the 2nd example? This would connect it visually to the zero offset explanation that precedes the examples. Victimofleisure (talk) 04:45, 27 November 2017 (UTC)
 * Why not, but note that I don't think that 20 has a particular importance, in particular due to the different conventions to represent floating-point numbers. Here, this is the exponent expressed where the significand is between 1 and 2. But in C, the significand is between 1/2 and 1, so that the zero offset (bias) would come from the encoding of 0.12 × 20 = 2−1. And integral significands can be used to, so that in this case, the zero offset (bias) would come from the encoding of 1000...0002 × 20 = 252. Vincent Lefèvre (talk) 11:41, 27 November 2017 (UTC)
 * Noted, and thanks. Simplification can aid understanding, hence it's sensible for all the examples in this section to use the same standard, as they do (IEEE 754).Victimofleisure (talk) 21:19, 27 November 2017 (UTC)

bit layout - is it standardized?
I cannot get hold of the standard document, but I have read other C/C++ standards, people who write standards try to word things in order to avoid standardizing things which should be left to the implementers of the hardware and compilers. That would be the layout in the case of IEEE-574! Every webpage on the net, however has a graphic with the supposedly "stadard" bit layout. Can someone please confirm or deny that the bit layout is standardized in IEEE-574. — Preceding unsigned comment added by Kotika98 (talk • contribs) 02:44, 20 March 2018 (UTC)


 * The bit layout is standardized. But the actual layout in the computer memory is out of the scope of the standard: a mapping between both must be defined by the implementation. Typically, there are differences between implementations due to endianness. Vincent Lefèvre (talk) 09:08, 20 March 2018 (UTC)
 * There is also VAX, which is not so far from IEEE 754, and is not big or little endian. Alpha has instructions which move bits to allow converting between F, D, and G, float, and IEEE S and T float. (I am not sure about H-float.) Gah4 (talk) 11:47, 10 September 2023 (UTC)
 * There is also VAX, which is not so far from IEEE 754, and is not big or little endian. Alpha has instructions which move bits to allow converting between F, D, and G, float, and IEEE S and T float. (I am not sure about H-float.) Gah4 (talk) 11:47, 10 September 2023 (UTC)

Alternative names
The article starts with:

The status of the alternative names FP64 and float64 is unclear: are these names used for any 64-bit floating-point format or only for IEEE's double-precision format (binary64)? If this is the latter, I think that these alternative names should be introduced only in the paragraph on IEEE's double-precision format (where binary64 is introduced). — Vincent Lefèvre (talk) 23:52, 26 October 2020 (UTC)

Execution speed: Hardware vs. algorithms
I have modified the section about execution speed. It said that functions like sin are slower, and then continued that “this” is a particular issue on GPUs. This is not the case. The thing with GPUs is that either two single-precision cores are needed for one double-precision calculation, or that (in this case) only one of 16 cores is usable at all for double-precision. Calculations run exactly the same speed on GPUs, but fewer of them can run in parallel. — At the same time it is true that functions like sin are slower. But this is caused by the algorithms, not the processors. 91.96.31.112 (talk) 19:53, 6 October 2022 (UTC)

exponent in decimal
I just reverted an edit to IEEE 754 claiming to changed to agree with this. (Even though I don't see where.) IEEE 754 has one significant bit before the binary point, where some previously popular formats have 0 bits before the binary point. If you compare formats, it looks like the exponent is one higher than it actually is, but it really isn't. Does this article need to be changed? (And the other similar articles.) Gah4 (talk) 07:52, 9 September 2023 (UTC)

in the beginning ...
There seems to be a question on the origination of double precision. Fortran II, though maybe not quite from the beginning in 1958, added the DOUBLEPRECISION statement and data type. (Blanks are not significant in Fortran II code.) It might be possible to track down the data that it was added. ALGOL 60 has a DOUBLE keyword, which I presume declares double precision variables. It might take a while to track down how and when things got added to ALGOL, but it seems likely after 1958. Gah4 (talk) 07:24, 26 September 2023 (UTC)

First
Fortran is usually considered the first high-level language. Various not-so-high-level languages came earlier. Since Fortran had REAL from the beginning, it should be the first high-level language with a floating point type. It originated on the IBM 704, the first IBM machine with hardware floating point. I don't know about non-IBM machines. Gah4 (talk) 13:11, 16 November 2023 (UTC)

Visual Basic has a unique way of handling some of the NaN codes
I have a copy of Visual Basic 6, and it has a unique way of handling NaN codes.

While the official Wikipedia article about Double Precision FP values says this

I've found that VB6 will treat a LOT MORE than just these 3 values as NaN values. I don't know if all of these are supposed to be treated as NaN values or not (an older version of this Wiki page indicated that these would be valid NaN values, but now it instead indicates only the 3 above mentioned encodings for NaN, so I hope that someone with knowledge goes back and verifies if those 3 encodings are the only actual valid NaN encodings according to IEEE standards).

In VB6, any Double Precision NaN without the top fraction bit set to 0 like or is treated as a SNaN number when using it in an equation or passing it to some functions (some internal VB6 functions like CStr seem to detect it and trigger an error, though defined functions don't seem to trigger an error just from passing this in e variable). That is, if it's used in an equation (or even setting the variable to itself like MyDouble=MyDouble) or when used in some functions, it triggers a runtime error. So there are literally BILLIONS of possible values for an SNaN according to VB6. Now I say "when passing it to another function" it treats it as an SNaN, because if you use it directly with the Print statement to show the value (using code like Print MyDouble) then it will actually trigger no runtime error and instead say that the value is a QNaN value. The specific text it prints in that case is " 1.#QNAN".

VB6 will treat any Double Precision NaN value as QNaN in all circumstances (regardless if using the Print statement or not) if the top fraction bit is set to 1 like this or this In these cases, the it truly is a QNaN value and will not trigger any error when being passed to another function or any other situation where an SNaN value would trigger an error. Again, that means there's literally BILLIONS of values that VB6 considers valid QNaN values. In these cases, the Print statement also displays the text " 1.#QNAN".

So the Print statement makes no distinction between QNaN values and SNaN values. It doesn't even generalize them correctly by calling them NaN values. Instead it always displays them as QNaN values, which is incorrect.

Also, NaN values aren't supposed to be treated as signed. The sign bit is supposed to always be ignored. However in VB6, the Print statement does display the sign of the NaN value that was given to it. If the sign bit is 0, the Print statement displays " 1.#QNAN" while if the sign bit is 1 it instead displays "-1.#QNAN". Also there's one specific encoding of NaN that is treated differently in VB6. This encoding is In this case, the most significant 13 bits are set to 1 (sign bit, all of the exponent bits, and the top fraction bit), while all of the remaining bits are set to 0. Technically, this is one specific encoding of QNaN. This is considered the Indefinite value and is displayed by the Print statement as "-1.#IND". This value is the only value that can actually be created by doing floating point math in VB6. Things like dividing zero by zero, taking the square root of a negative number, and subtracting infinity from infinity, all generate this value (after first displaying an error). In fact, you can only get this value (instead of having the program generate an error and quit due to an impossible math calculation being performed like dividing zero by zero) by disabling VB6's forcing the program closed when an error happens on the part of the code that generates the NaN value. This is done by making sure you have the code On Error Resume Next before the code that is intended to generate the NaN. Alternatively, if you are compiling the program instead of running it in the VB6 IDE, you can set the compiling option to disable floating point error checks before you compile the program. Benhut1 (talk) 05:13, 15 July 2024 (UTC)