In 64-bit mode, I had to be careful to rely only on the 64-bit processing benefits, not the large memory footprint stuff, since The Quad had only 2GB RAM compared to the humongous 8GB of The Oct.
The Sandra bench 64-bit version threw out pretty much the same results as the 32-bit one (note that in both cases it used SSE4 instructions, so these extra ops are already included in 65 nm Core 2 generation!), even memory bandwidth still returned the same 8.6GB/s ('lucky number' 8.8 GB/s Triad in memory stream) for The Quad, and half of it, 4.3GB/s on The Oct. The FSB / memory efficiency still stood at 65% of the peak FSB bandwidth in The Quad on Nvidia chipset, and 41% on The Oct. Sandra memory test unfortunately doesn't use both processors and FSB's, so the dual-FSB and dual memory bus potential of Greencreek chipset is not utilised here - exposing instead only the high-latency FB-DIMM penalty.
Interesting thing happened when I ran Cinema4D 64-bit version - it scales better on The Oct than the 32-bit one, and quite a bit at that: 4.7 X vs 4.1 X is quite a bit of difference. On the other hand, The Quad scalability stays the same as on the 32-bit version, within 2per cent benchmark tolerance margin. I also ran another render routine, the famous POVray 64-bit renderer, the brand new 3.7 beta with full multithreading - another suggestion from a reader. Surprisingly, it scales near linearly - the raytrace render on eight CPUs was really eight times faster, within few per cent, and same happened on The Quad.

What does this mean? Another reader commented that typical multithreaded apps (Cinema4D, for instance) may have a fixed serial, non-threaded code portion which can't be parallelised and runs only on one CPU. In Cinema4D, it could be somewhere around 10%, maybe a bit less on the seemingly better optimised 64-bit version. The remaining 90% or so then get scaled across the CPUs.
However, in Povray, it seems that the whole ray-trace render routine is 100% multithreaded, without a serial portion - that's why this ideal 'heavenly scaling'. And yes, that's why Intel was showing that scalable real-time ray tracing on the demo Tigerton (16 core quad-socket server) on the last year's IDF, at least the one in Taipei that I attended. After all, it is highly immune to FSB / memory bottlenecks, as most of the code fits nicely in caches, and each piece of data you bring from memory is processed many times.
Now, here are the additional benchmark results:
| Benchmark |
The Quad
|
The Oct
|
| Cinebench 64 bit | ||
| One CPU |
584
|
452
|
| All CPUs |
1839
|
2125
|
| Speed Factor |
3.15
|
4.70
|
| Povray 3.7 64-bit | ||
| One CPU |
668
|
526
|
| All CPUs |
2592
|
3990
|
So, up to now, the scalability varies wildly - for a lot of apps and routines, four cores are already the limit, unless of course you run many apps side by side. Therefore focusing on increasing the core speed, as well as its FSB and memory responsiveness, may be the way for further real performance gains - and that's what I attempted by pushing the FSB to 1667 MHz throughput on a 3333 MHz single-chip quad-core desktop Xeon. But if in need of, say, high end rendering yet having only a single machine license for an expensive 3-D authoring application, The Oct is starting to make good sense.
The Novell Suse 10.2 64-bit edition Linux had some boot problems on the HP system after installation, and I'll be addressing those in the next few days - we'll do the final instalment, Linux benchmarks as well as in-memory multithreaded dataset runs, at that point. And yes, I am attempting my best to run any benchmarks you guys suggest - some, like 64-bit versions of Sciencemark or SuperPI, were not possible to find. Also, I'd like to see Futuremark compile a 64-bit 3DMark / PCMark suite, with a bit better threading too - just, please, don't make it Vista-only! ยต