The old adage 'Fight fire with fire' does not apply to non-metaphorical fires
INTEL'S FIRST Nehalem CPU, the Core i7 uni-processor desktop entry, is now out for everyone to see - you saw roughly three thousand reviews coming out on the NDA lift date this past Monday.
Something that bugged us a bit in those early benchmarks - in common with tests we've ran before on Intel's systems - were all using both the HT multithreading and Turbo functions.
Yes, these are important benefits for many typical users, except say ray tracers where each thread takes up the whole core without much benefit from HT, or the real-time gang who prefers the CPU to run at a specific same clock - always.
So, to compare the raw core processing capability of Core i7 965 vs Core 2 QX 9770 at the same clock, we disabled the two features for these initial few benchmarks - a bit of a "handicap" run. Let's how the newbie copes with it.
Before that, let's look at this initial configuration setup for the first of our many Nehalem reviews this month. Rather than relying on the Intel test setup, we paired the i7 965 CPU with Asus P6T Deluxe mobo and - get this - whopping twelve gigabytes of RAM on board, in six Qimonda Aeneon Xtune DDR3-1866 1.5 volt DIMMs, the same ones that we tried on the X48-based Asus Rampage Extreme last month with pretty good results.
In this case, using six identical DIMMs meant fully loading all three channels with two double-sided DIMMs each, driving the integrated memory controllers of the Nehalem CPU to the full capacity. How far could it be pushed without big time manual tweaking? See it in a while.
As for the test of the setup, we used the OCZ PC Power 860W PSU, more than sufficient for the machine even when using the GTX280 or 4870X2 cards. Besides the Thermalright six heat-pipe sink with a big silent fan, supplied courtesy of Intel, we also used three Thermaltake fans to cool the surrounding board areas, especially the north bridge heat sink, as the ambient temperature kept about 30C even in the night here in Singapore.
Interestingly, despite us not overclocking the chipset at all, and the copper heat pipes Asus used, the contraption on top of the chipset and VRMs was getting quite hot to the touch even before running serious benches. That's where the 3,000 rpm Thermaltake 8cm fans did help. Keep in mind that, on Core 2, you pretty much have to overclock the chipset and FSB to get more system throughput - not necessary here.
For the BIOS and boot tests, we use the conveniently available Sparkle Calibre 9500GT card fresh from a review, while the 3Dmark Vantage and other stuff was done using the GTX280 and HD4870X2 cards of course. So, one can say, a pretty decent high-end setup.
While the Intel SSD RAID0 setup is waiting in the wings here, this first test used a simple Super Talent 60GB SSD drive for the Vista64 boot.
The test goals in this round were to leave everything at Auto - except the memory latency timings of course - and see how far the brand new platforms go, all BIOSes updated.
The system booted fine at 3.2GHz and, with the new BIOS 0804, changing the core clock multiplier from 24 to 30 (x 133 MHz) to up the CPU to 4GHz worked fine in the Auto mode. This setting passed absolutely all initial benchmarks we ran - Sandra 2009, Povray 3.7 and Cinebench all passed with flying colours. Even Linpack, the usual TDP breaker, completed, although the results there will have to wait till the optimised binary version to be worth analysing.
Sandra CPU & multimedia at
3.2GHz
And at 4
GHz
Note that in Sandra 2009 CPU and multimedia benchmarks, if you leave the stuff at 3.2GHz no HT and no Turbo, overall the new CPU runs about the same on average as the Penryn based QX9770.
Why? Well, the raw compute power is about the same, the internal microarchitecture is surely somewhat more efficient on the i7, but any routine that spreads nicely over those 12MB of fast 15-cycle latency L2 cache in QX9770 may have a bit of the performance hit on the 8MB of 38-cycle L3 in the i7 965. But then, those fast quadruple 9-cycle L2 per core caches should be of some help. And yes, once you start taking things from memory, there's no competition to the Nehalem.
Sandra memory 2 ch 1600 and 3 ch
1333
Talking about the memory,we were quite happy with the speed obtained at the current mobo and BIOS development stage. Running three channels fully loaded with two 16-chip DIMMs each at (supposedly, didn't do the multimeter check yet) default stock voltage and getting DDR3-1333 CL 7-7-7-18 CR2 wasn't bad. If you want to run them at DDR3-1600 CL8, it will do to without pushing forward the memory voltage - but, if you don't overvoltage the memory controller and the buses, only two channels out of three will be active. See the benchmark difference for yourself. The three channel 1333 was 20 per cent faster than two-channel 1600\ in bandwidth.
Povray (one and four cores) and
Cinebench
Superlinear Cinebench scaling? Remember, the HT was DISABLED. This is good... Now, recall the QX9700 3.2 GHz results?
This is how 3.2 GHz QX9770 performed, the best Penryn:
Povray 1CPU - 654.1
Povray allCPU - 2465.3
CineBench10 1Cpu - 3954
CineBench10 AllCpu - 15372
Even if you scale these up linearly 20% for 4 GHz clock, the Core i7 965 will still be a fifth faster, clock for clock - and scale better due to on-die intercore comms and a single shared L3 cache with much wider memory bandwidth.
The next run will include more of the media stuff and general apps, including those with heavy computational content, as well as, of course, more of manual tuning on this and other mobos, plus hopefully some better cooling as more LGA 1366 support comes along. Did I also mention three channel DDR3-2000 1.65 volt memory, the pride of Kingston?
3dmark Vantage - performance and extreme
Nvidia GTX280
ATI 4870X2
Notice something interesting in the 3Dmark Vantage runs - the NV-friendly CPU physics run on the GTX280 doubles the result compared to the true CPU-only test when using the ATI 4870X2. However, for the overall graphics results, it is an absolute clear 4870X2 win. Since the Nehalem is reasonably fast for a CPU we guess, the 4870X2 with its somewhat faster graphics may be the best match for now.
In summary, for a typical high end user wanting the best performance without nearly any manual tweaks, this first Nehalem Core i7 is surely a hit. The job is easier than with the initial Core2 CPUs, and, even in the extreme Singapore everyday heat, we can recommend the 4GHz aircooled setting for the i7 965 as default everyday operation.
According to sources at Intel, enabling the HT and Turbo should not affect this at all. With that stuff on, in any situations with either lots of threads (HT) or just one / two of them (Turbo), it will then surely beat the Core 2 QX'es. And yes, with no FSB bottleneck, we expect this platform to scale far better in the future...
Even at 1.4 volts, the heat sink was barely warm except in Linpack - and that's just the current C0 stepping. Do expect a fairly quick sequence of new stepping updates, as it should be with a brand new CPU anyway.
Those wanting to be the first need not hold out for the future steppings - the current one is quite good itself and, after all, the boards should be ready to take in the Westmere 32nm Core i7 shrink a year from now.
How does six cores with twelve meg L3 and 3.6 GHz and above clocks sound? Well, kinda nice match to the 3-channel DDR3 finally. ยต
"Even if you scale these up linearly 20% for 4 GHz"

0.8 GHz is 25% of 3.2 GHz
Why compare an overclocked result for i7 to a stock result for a QX9770?

Plus 5544 / (3954 x 1.25) = 1.122

Or 12.2 % faster clock for clock on a single thread, and 16.7% faster on the multithread.

Why is Cinebench reporting 8 cores if HT is disabled?
??
@waxwing
.8GHz = .2*4GHz - tho agreed it's scaled up 25% 

what i wonder though is how hard it's going to be to get that 3.2GHz machine to run at 5GHz stable
please do a follow up with some emphasis on HT. heard HT in Nehalem was good. a confirmation from theinquier would be nice.

Go back only 2 1/2 years ago and AMD had higher margins than Intel with their lousy netburst designs. It is truly amazing how far Intel has come in such a short period of time. Either that or it is truly amazing how poorly AMD has performed over the past couple of years.
Congratulations to Intel.
It is a first big step towards cluster 4 core supercomputing on a new level making a great possibility to visualize by medical doctors a human internal organs with new enhanced resolution.
I'm not so sure about your "ray tracers where each thread takes up the whole core without much benefit from HyperThreading" : As far as I know, ray-tracing can become quite memory bound for high complexity scenes and/or incoherent rays. In that case, HT could help alleviate cache-misses, wouldn't it ?
I don't think the multiplier is high enough. Why don't we just get rid of the fsb (or what ever you want to call it now) and just use the multiplier.
This way we can make it as high as we want.
30, come on you got to be kidding me.