Tech Check

Google
 

Sunday, 13 July 2008

The Nehalem Preview: Intel Does It Again

Two years ago in Taiwan at Computex 2006 Gary Key and I stayed up all night benchmarking the Core 2 Extreme X6800, the first Core micro-architecture (Conroe core) CPU we had laid our hands on. While Intel retroactively applied its tick-tock model to previous CPU generations, it was the Core micro-architecture and the Core 2 Duo in particular that kicked it all off.

At the end of last year we saw the first update to Core, the first post-Conroe "tick" if you will: Penryn. Penryn proved to be a nice upgrade to Conroe, reducing power consumption even further and giving a slight boost to performance. What Penryn didn't do however was shake the world the way Conroe did upon its launch in 2006.

After every tick however, comes a tock. While Penryn was a die shrink of an existing architecture, Nehalem is a brand new architecture built on the same 45nm process as Penryn. It's sort of a big deal, being the first tock after the incredibly successful Core 2 launch.


731M transistors, four cores, eight threads

It's like clockwork with Intel; around six months before the release of a new processor, it's sent over to Intel's partners so they may begin developing motherboards for the chip. It was true with Northwood, Prescott, Conroe, Penryn and now Nehalem. And plus, did you really expect, on the eve of the two year anniversary of our first Core 2 preview, a trip to Taiwan for Computex without benchmarks of Nehalem? In the words of Balki Bartokomous, don't be ridiculous :)


Yep, that's what you think it is

Without Intel's approval, supervision, blessing or even desire - we went ahead and snagged us a Nehalem (actually, two) and spent some time with them.

(Sorry guys, stop making interesting chips and we'll stop trying to get an early look at them :)...)

Not One Nehalem, but Two

Nehalem itself is very stable but it has only been in Taiwanese motherboard manufacturer hands for a relatively short while now, so the only truly mature motherboards are made by Intel. Unfortunately since Intel didn't sanction our little Nehalem excursion, we were left with little more than access to some early X58 based motherboards in Taiwan. Thankfully we had two setups to play with, each for a very limited time.

We had access to a 2.66GHz Nehalem for the longest time, unfortunately the motherboard it was paired with had some serious issues with memory performance. Not only was there no difference between single and triple channel memory configurations, memory latency was high. We know this was a board specific issue since our second Nehalem platform didn't exhibit any issues. Unfortunately we didn't have access to the more mature platform for very long at all, meaning the majority of our tests had to be run on the first setup (never fear, Nehalem is fast enough that it didn't end up mattering).

The second issue we ran into was a PCI Express problem that kept us from running any meaningful GPU benchmarks. We've been told that it'll take the motherboard guys about a month to work out these kinks, but that's why you shouldn't expect to see a full performance evaluation of Nehalem in the near term.

The CPUs are quite mature and are running extremely cool (surprisingly cool actually), their clock speeds are being artificially limited by Intel in order to avoid putting all cards on the table at this time. We saw a similar approach with the very first Penryn samples which were all locked at 2.66GHz. The Intel X58 chipset we used in our testing on the other hand got quite hot.


Nehalem no longer has a conventional FSB, its clock speed is derived from a multiplier of an external clock frequency - in this case 133MHz. Expect all Nehalem chips to come out in frequencies that are multiples of 133MHz.

Thankfully we don't want a thorough look at Nehalem today, we'll save that for the launch - what we do want is to whet our appetite. We want to know if Intel can pull it off a second time.


The Socket

With an integrated memory controller, Intel needed a new pinout for Nehalem and the first version with three 64-bit DDR3 memory channels features a 1366-pin LGA interface:


LGA-1366 (left) vs. LGA-775 (right)

The socket is noticeably bigger than LGA-775 as is the mounting area for heatsinks. You can't reuse LGA-775 heatsinks and instead must use a heatsink with mounting holes more spread apart. As far as we can tell, the same push-pin mounting mechanism from LGA-775 is present in Nehalem which is disappointing.

With a larger socket and more pins, the CPU itself is obviously bigger. Here's a shot of our Nehalem compared to a Core 2 Duo E8500:


Nehalem (left) vs. Penryn (right)


Nehalem (left) vs. Penryn (right)

Intel will obviously have dual-channel versions of Nehalem in the future, unfortunately it looks like they will use a smaller socket for mainstream versions of the chip (LGA-1160?). We won't have to deal with socket segmentation just yet and it is always possible that Intel will choose to stand behind a single socket for the majority of the desktop market, reserving LGA-1366 for a Skulltrail-like high end but the strategy is unclear at this point.

The Return of Hyper Threading

While Nehalem is designed to scale to up to 8 cores per chip, each one of those cores has the hardware necessary to execute two threads simultaneously - yep, it's the return of Hyper Threading. Thus our quad-core Nehalem sample appeared as 8 logical cores under Windows Vista:


Four cores, eight threads, all in a desktop CPU

Note that as in previous implementations of Hyper Threading (or other SMT processors) this isn't a doubling of execution resources, it's simply allowing two instruction threads to make their way down the pipeline at the same time to make better use of idle execution units. Having 8 physical cores will obviously be faster, but 8 logical (4 physical) is a highly power efficient way of increasing performance.

We took Valve's source-engine map compilation benchmark and measured the compile time to execute one instance (4 threads) vs. two instances of the benchmark. The graph below shows the increase in compilation time when we double the workload:

Valve Map Compilation Benchmark - 4 to 8 Thread Scaling

While the 2.66GHz Core 2 Quad Q9450 (Penryn) takes another 127 seconds to execute twice the workload, the 2.66GHz Nehalem only needs another 49 seconds. And if you're curious, this quad-core Nehalem running at 2.66GHz is within 20% of the performance of an eight-core 3.2GHz Skulltrail system. Equalize clock speed and we'd bet that a quad-core Nehalem would be the same speed as an 8-core Skulltrail here. The raw performance numbers are below:

Valve Map Compilation Benchmark

We couldn't disable Hyper Threading so we reached the limits of what we were able to investigate here.


Power Consumption

Built on the same 45nm process as Penryn, we expect Nehalem to have higher power consumption than Penryn but given Intel's target of a 1% increase in performance for no more than a 1% increase in power consumption per microarchitectural change - the results should be reasonable:

Total System Power Consumption - Idle

Total System Power Consumption - Load

And reasonable they are; for a 20 - 50% increase in performance, total system power consumption only went up by 10%.

Final Words

First keep in mind that these performance numbers are early, and they were run on a partly crippled, very early platform. With that preface, the fact that Nehalem is still able to post these 20 - 50% performance gains says only one thing about Intel's tick-tock cadence: they did it.

We've been told to expect a 20 - 30% overall advantage over Penryn and it looks like Intel is on track to delivering just that in Q4. At 2.66GHz, Nehalem is already faster than the fastest 3.2GHz Penryns on the market today. At 3.2GHz, I'd feel comfortable calling it baby Skulltrail in all but the most heavily threaded benchmarks. This thing is fast and this is on a very early platform, keep in mind that Nehalem doesn't launch until Q4 of this year.

One valid concern is with regards to performance in applications that don't scale well beyond two or four cores, what will Nehalem offer us then? Our DivX test doesn't scale well beyond four cores and even then Nehalem's performance was in the 20 - 30% faster range that we've been expecting. The other thing to keep in mind is that none of these tests are really stressing Nehalem's integrated memory controller. When AMD made the move to an IMC, we saw an instant 20% performance boost in most applications. I suspect that the applications that don't benefit from Hyper Threading, will at least benefit from the IMC. We've only scratched the surface of Nehalem here, looking at the benefits of Hyper Threading and its lower latency unaligned cache accesses. We've hinted at what's to come with the extremely well balanced and low latency memory hierarchy of Intel's new baby. Once this thing gets closer to launch, we should be able to fill in the rest of the puzzle.

Over six years ago I had dinner with Intel's Pat Gelsinger (back when he was Intel's CTO), and I asked him the same question I always do: "what are you excited about?" Back then his response was "threading", Intel was about to launch Hyper Threading and Pat was convinced that it was absolutely necessary for the future of microprocessors.

It was at the same dinner that Pat mentioned Intel may do a chip with an integrated memory controller much like AMD, but that an IMC wouldn't solve the problem of idle execution units - only indirectly mitigate it. With Nehalem, Intel managed to combine both - and it only took 6 years to pull it off.

Pat also brought up another very good point at that dinner. He turned to me and said that you can only integrate a memory controller once, what do you do next to improve performance? Intel has managed to keep increasing performance, but what I really want to see is what happens at the next tock. Intel proved its ability with Conroe and with Nehalem it shows that the tick-tock model can work, but more than anything looking at Nehalem today makes me excited at what Sandy Bridge will bring.

The fact that we're able to see these sorts of performance improvements despite being faced with a dormant AMD says a lot. In many ways Intel is doing more to improve performance today than when AMD was on top during the Pentium 4 days.

AMD never really caught up to the performance of Conroe, through some aggressive pricing we got competition in the low end but it could never touch the upper echelon of Core 2 performance. With Penryn, Intel widened the gap. And now with Nehalem it's going to be even tougher to envision a competitive high-end AMD CPU at the end of this year. 2009 should hold a new architecture for AMD, which is the only thing that could possibly come close to achieving competition here. It's months before Nehalem's launch and there's already no equal in sight, it will take far more than Phenom to make this thing sweat.


Your Ad Here