Talk:POWER8

Bad maths?
Quote: ''The link between the POWER8 chip and the Centaur is a 9.6 GB/s with 40 ns latency. ... Each POWER8 can be linked to up to eight Centaur chips for an aggregated 128 MB L4 cache and 230 GB/s sustained and 410 GB/s peak memory bandwidth in and out of the processor.''

Can anyone explain how 8 times 9.6 equals 230 or even 410? Please avoid quoting nonsense from El Reg and do not forget to turn the common sense on. — Preceding unsigned comment added by 188.126.5.54 (talk) 09:04, 29 April 2014 (UTC)


 * That's a very good question. Please have a look at this paper, pages 15–17; regarding the "410 GB/s peak bandwidth", that's about the bandwidth between Centaur chips and DRAM, not between POWER8 chip and Centaurs.  However, regarding how eight 9.6 GB/s channels between POWER8 and Centaurs combine into a total of 230 GB/s bandwidth, I (still) have no idea. &mdash; Dsimic (talk | contribs) 08:15, 1 May 2014 (UTC)


 * If you consider the 9.6 GB/s is for a 1 byte wide bus and the 230 GB/s and 410 GB/s is for the full 64 byte wide bus then it makes sense. For the 64 byte bus 50% would be 256 GB/s and 100% would be 512 GB/s. Since there's no FLOP performance given I would like to address this now, too. Based on other specifications given by IBM found "here": 12 core x 8 instruction/cycle x 5 GHz = 480 GFlop single precision (four double precision or eight single precision / cycle). Mightyname (talk) 13:02, 13 August 2014 (UTC)


 * Hm, but 1 GB/s is 1 GB/s regardless of the underlying bus width. Any chances for some exact math, please? &mdash; Dsimic (talk | contribs) 02:57, 16 August 2014 (UTC)


 * There are 4x 64-bit FPUs and 2x 128-bit VMX vector units per POWER8 core. Assuming all can do fused multiply-add (2 FLOPs per instruction) peak DP rate would be 4(FPU1234) + 2(VMX1) + 2(VMX2) = 8 x 2 = 16 DP FLOPs per clock tick. At 5 GHz this would translate to 16x5 = 80 DP GFLOPS / core or 960 DP GFLOPS for the 12 core package. Single precision would be roughly 2x that at 160 SP GFLOPS / core and 1920 SP GFLOPS (1.9 TFLOPS) for the whole 12 core chip, which is consistent with IBM's claimed 1.6x improvement over POWER7. The other "execution units" in the POWER8 core are not relevant to IEE754 FP math. --89.137.178.0 (talk) 18:07, 21 August 2014 (UTC)


 * The answer to the original question is here http://www.setphaserstostun.org/power8/POWER8_UM_external_22APR2014_pub.pdf, section 4.4.2.1 "Bandwidth" : the channel to the Centaur is characterized as "Single 16-byte read data ramp, dual 16-byte write data ramp interfaces". Therefore the 9.6 GB/s per link quoted is actually the WRITE speed to RAM. Reading speed is twice that ("dual 16-bit write") at 19.2 GB/s and the TOTAL bandwidth for one link AS CALCULATED BY IBM is 9.6 + 19.2 = 28.8 GB/s which multiplied with 8 channels gives the quoted 230 GB/s figure which is actually PEAK bandwidth, not SUSTAINED as described in IBM's presentation.
 * However, the processor bus is described later as "Processor bus transfer size = 32 bytes" ie 256 bit wide. This means that you can READ 32 bytes at a time OR WRITE 16 bytes, but not BOTH at the same time, ie there are 2 128 bit lanes to the centaur, one unidirectional (read-only) and one reversible (read/write). Therefore the correct bandwidth calculation is 9.6 x 8 = 76.8 GB/s peak read and 9.6 x 8 x 2 = 153.6 GB/s peak write per socket and NOT the combined figure of 76.8 + 153.6 = 230.4 GB/s.
 * If we apply IBM's bandwidth calculation methods to a 2133 MT/s quad channel capable Haswell, then we obtain 68 GB/s peak read + 68 GB/s peak write = 136 GB/s and suddenly POWER8 doesn't sound so impressive and capable of handling "thousands of x86 workloads simultaneously". At best it has ~2x main memory peak read bandwidth. 89.120.104.138 (talk) 09:29, 19 January 2015 (UTC)


 * Also, according to the above resource, the POWER8 core can issue 16 single precision FP ops per clock cycle, not 32 (same units process both FP and vector FP, can't do both at the same time) so correct performance figures are 40 DP / 80 SP GFLOPS per core @5GHz. For comparison, a 3GHz Haswell core can manage 48 DP / 96 SP GFLOPS.
 * A fast chip, sure, arguably twice faster peak I/O than the latest and greatest Xeons, half the FP throughput but compensates with 5 GHz clock rate, certainly over-hyped by IBM, IMHO. 89.120.104.138 (talk) 09:57, 19 January 2015 (UTC)


 * Thank you very much for the explanation and pointers to the documentation! &mdash; Dsimic (talk | contribs) 10:33, 19 January 2015 (UTC)

Erh
The software that I'm using now doesn't support power7 or power8. It runs super super slow on power7... Unless there is a compiler that can make code written for intel simd run fast on power7 or power8, I still think my school should have nerver brought a cluster made of these cpu.
 * See https://www.spec.org/cpu2006/results/res2010q2/cpu2006-20100426-10752.html

The base score is with the benchmark compiled with out of the box settings. The peak (blue) score is with optimizations enabled. It gives almost 2x gains in performance. The compiler is IBM's XL C/C++ for aix. Similar result for POWER8 although the difference is smaller. So yes, you need to compile using optimizations and IBM's compiler if you want to use your chips efficiently. If, as you say, the original code was also tuned for Intel's simd (I suppose you mean optimised for MMX/SSE/AVX, alignments, caches, access patterns etc) then it might need a complete rewrite and no compiler will help you much with that. — Preceding unsigned comment added by 89.120.104.138 (talk) 09:25, 4 November 2014 (UTC)
 * This is not a forum for discussing how to use the product in the article. -- Henriok (talk) 14:18, 4 November 2014 (UTC)