Wednesday, July 28, 2010

Intel Xeon 5600-Series

The professional space is peppered with products derived from the desktop. Today we're looking at Intel's Xeon X5680 CPUs, which look a lot like Core i7-980X, only they're optimized for dual-socket platforms. We're also introducing new Adobe CS5 tests.
Back in 2005, Intel changed the trajectory of desktop computing by introducing its first dual-core Pentium processors. Having realized that it was fighting an uphill battle to try pushing frequencies beyond 10 GHz, the company shifted strategies and put parallelism in it crosshairs.


The thing was that servers and workstations were already employing multi-socket configurations to get work done faster. At the time, Irwindale-based Xeons were getting their behinds handed to them by AMD’s Opteron. Although these dropped into dual-processor boards, they were still single-core chips, aided slightly by the same Hyper-Threading technology we know today.

Just imagine the cost savings of going from a single-core, dual-socket system to a dual-core, single-socket box. Or how about the performance gain shifting from a single-core, dual-socket platform to a dual-core, dual-socket configuration? That’s two times the processing resources in the same class of hardware. Buying motherboards and CPUs only gets more expensive as you start looking at four- and eight-way boxes.

And to think, today we have six-core Hyper-Threaded chips making 12 logical processors available to operating systems like Windows 7—all from a single socket.

Chris Angelini Discusses Xeon 5600-Series

Intel’s Return To Competition

As the hardware gets more powerful, software adapts to take advantage, necessitating even more capable hardware. Gotta love the viscous circle, right?

Last year, Intel launched its Xeon 5500-series CPUs for dual-socket servers and workstations. Then-vice-president Pat Gelsinger characterized the introduction as the most important in more than a decade. And while I’m not one to parrot marketing messages, this one was absolutely true for the company.

The architectural advantage AMD carved out for itself using HyperTransport was especially pronounced in multi-socket machines, while Intel relied on shared front side bus bandwidth for processor communication. With the Xeon 5500-series, Intel addressed its weaknesses through the QuickPath Interconnect, adding Hyper-Threading and Turbo Boost to further improve performance in parallelized and single-threaded applications alike.

Of course, the wheels of progress continue to spin. This year’s shift to 32 nm manufacturing gave Intel the opportunity to add complexity to its SMB-oriented processor lineup without altering its thermal properties. Enter the Xeon 5600 family, sporting up to six physical cores and 12 MB of shared L3 cache per processor—all within the same 130 W envelope established by the Xeon 5500-series.

Always fun to see 24 logical CPUs in Windows' Device ManagerAlways fun to see 24 logical CPUs in Windows' Device Manager

At any rate, we have plenty of hardware to compare here, including a pair of Xeon 5600s, two Xeon 5500s, and a Core i7-980X that’ll demonstrate where and when a second processor will actually buy extra performance in a workstation-oriented load.

If you’ve already read my Intel Core i7-980X review, then you’re already very much in the know about how Intel built its Xeon 5600 family. The main difference is that while Intel is only selling two enthusiast-class 32 nm six-core desktop CPUs (the -980X and -970), the Xeon 5600-series consists of no fewer than 12 different models, featuring TDPs from 40 W to 130 W, core counts between four and six, clock rates from 1.86 GHz to 3.46 GHz, and 12 MB of L3 cache across the board.

Now you know why I was so enthused about the thought of a true quad-core 32 nm desktop chip back when I wrote that Gulftown review. The fact of the matter is that Intel is already selling 32 nm quad-core CPUs in the workstation space. It just doesn’t see a reason to cannibalize sales of its 45 nm lineup at this point. The good news for business-oriented folks is that there’s really a Xeon 5600 for any application.

If you find 12 models a tad overwhelming, use Intel’s prefixes as a general guide. There are three general classifications in play here. The Advanced lineup (denoted with an ‘X’ prefix) consists of 130 W and 95 W SKUs. You’ll typically find these six models in performance workstations, where pedestal enclosures leave plenty of room for cooling. The three Standard-class chips are identifiable by their ‘E’ prefixes, sporting 80 W TDPs that work well in 1U and 2U rackmount equipment. A trio of low-power ‘L’ models touches 60W and 40 W power ceilings. While these CPUs leverage the lowest clock rates, even the most entry-level chip wields four physical cores, endearing them to entry-level SMB servers.

QPI Speed
L3 Cache
Base Freq.
Max Turbo Freq.
Power (TDP)
Cores / Threads
Price
Xeon X5680
6.4 GT/s
12 MB
3.33 GHz
3.6 GHz
130 W
6/12
$1663
Xeon X5677
6.4 GT/s 12 MB 3.46 GHz
3.73 GHz
130 W
4/8
$1663
Xeon X5670
6.4 GT/s 12 MB 2.93 GHz
3.33 GHz
95 W
6/12
$1440
Xeon X5667
6.4 GT/s 12 MB 3.06 GHz
3.46 GHz
95 W
4/8
$1440
Xeon X5660
6.4 GT/s 12 MB 2.8 GHz
3.2 GHz
95 W
6/12
$1219
Xeon X5650
6.4 GT/s 12 MB 2.66 GHz
3.06 GHz
95 W
6/12
$996
Xeon L5640
5.86 GT/s
12 MB 2.26 GHz
2.8 GHz
60 W
6/12
$996
Xeon L5630
5.86 GT/s 12 MB 2.13 GHz
2.4 GHz
40 W
4/8
$551
Xeon L5609
4.8 GT/s
12 MB 1.86 GHz
1.86 GHz
40 W
4/4
$440
Xeon E5640
5.86 GT/s 12 MB 2.66 GHz
2.93 GHz
80 W
4/8
$774
Xeon E5630
5.86 GT/s 12 MB 2.53 GHz
2.8 GHz
80 W
4/8
$551
Xeon E5620
5.86 GT/s 12 MB 2.4 GHz
2.66 GHz 80 W
4/8
$387

Building On A Familiar Platform

Just as Gulftown (the hexa-core desktop design) is interface-compatible with Bloomfield (the original Core i7-900-series), so too does Xeon 5600 drop into the same LGA 1366 interface as Xeon 5500. The reasons for this are similar to what we discussed in reviewing Core i7-980X, including an architecture that is fundamentally the same and a TDP at or under the previous ceiling. Granted, while a processor upgrade within a year of another isn’t unheard of on the desktop, that’s not usually how the server or workstation markets operate. Even if there won’t be a surge of businesses upgrading their Xeon 5500-based machines with 5600s, the fact that the same motherboards, memory modules, and graphics cards work the same as they did before is at least a benefit to resellers bringing Intel’s latest to their customers.

To briefly recap my Gulftown coverage, Westmere-EP (the internal name for Xeon 5600)…

“…is enabled by Intel’s 32 nm manufacturing process—the same node we saw debut back in January with the Clarkdale and Arrandale processor families. This time, however, enthusiasts don’t have to be bamboozled by a second, on-package 45 nm die handling graphics, memory control, and PCI Express connectivity. The [Xeon 5600] gets us performance-freaks back to where we want to be—on-die memory controller, PCI Express handled by the well-endowed [5520 and 5500] chipsets, and discrete graphics only, please.

With [Westmere-EP], Intel uses its 32 nm process to add cores and cache, rather than push integration. As a result, we have up to a six-core processor with 12 MB of shared L3 cache. Architecturally, [Westmere-EP] is otherwise the same as [Nehalem-EP]. Each core gets 32 KB of L1 instruction cache, 32 KB of L1 data cache, and a dedicated 256 KB L2 cache.

Despite the addition of two cores and 4 MB of L3, [Westmere-EP] employs a smaller die than its predecessor (248 square millimeters versus [Nehalem-EP’s] 263). Transistor count increases from 731 million to 1.17 billion. That’s fairly incredible, considering the [fastest Xeon 5600s] fit within the same 130 W thermal envelope as existing [Xeon 5500-series] processors.”

Intel S5520SCR Motherboard

Workstations are similar to servers in a great many ways, including the validation and testing that goes into a production machine. If you’re using a workstation for business, it can’t crash in the middle of a project. It needs to deliver the performance of an enthusiast-class system with the reliability of a mission-critical server. As a result, the parts that went into our reference platform are atypical of what you’d normally find in a Tom’s Hardware test bed.

We started with an Intel S5520SCR motherboard, a roughly $430 platform designed for workstation duties by virtue of its 16-lane PCI Express 2.0 slots. There are plenty of other expansion slots on the board, but we aren’t going to use any of them here.

The board boasts two LGA 1366 interfaces, each capable of taking any of the Xeon 5600-series processors. A total of 12 memory slots are divided into six per processor, yielding three channels with two slots per channel. It employs the 5520 northbridge (armed with 36 lanes of PCIe connectivity) and ICH10R southbridge. The platform is actually very similar to X58, with the exception of two QPI links to the Xeon processors versus one to Core i7.

4 x Kingston KVR1333D3E9SK3/3G Memory

We populated the board with four 3 GB memory kits from Kingston, totaling 12 GB. These ECC-enabled unbuffered modules are a welcome change from the warm-running FB-DIMMs used in Xeon 5400-based servers.

And the 1333 MT/s data rate is plenty. Remember that high-end Xeon 5600s support up to DDR3-1333 across two slots per channel, while the Xeon 5500-series processors only do one slot per channel of DDR3-1333. Thus, switching over to the Xeon W5580 CPUs causes our memory configuration to downshift to DDR3-1066. As you’ll see in the benchmarks, though, this doesn’t have an appreciable impact on memory performance, according to SiSoftware’s Sandra 2010.

If you plan on building your own workstation, keep configuration details like this in mind. With 12 available slots, that’s a lot of room for expansion. Getting optimal performance, though, requires that you fill all three channels for both CPUs, so that’s memory in at least six slots.

Nvidia Quadro FX 3800

While it isn’t the fastest board in Nvidia’s professional lineup, it’s the fastest card I happened to have on-hand, and it’s quick enough to prevent GPU-related bottlenecks in the tests that matter in a processor comparison.

One thing to keep in mind, though, especially as we start looking at Adobe CS5 benchmarks: the Quadro FX 3800 is one of the few cards on Adobe’s list with CUDA-enabled GPU acceleration. Little of what we’re testing should be affected by the Mercury Playback Engine, but even still, it’s worth noting that this is on the short supported list.

2 x Intel X25-M 160 GB SSD

While it’s certainly desirable to use a high-end RAID controller in a workstation, offloading storage-related calculations that’d otherwise detract from overall processor performance, our simple RAID 0 setup shouldn’t cause a problem. Thus, I shelved the Intel RS2BL080 I had on-hand and used the ICH10’s SATA 3Gb/s connectivity instead, managing storage using Rapid Storage Technology software.

Especially when it comes to workloads like video editing, you want a healthy sustained write throughput. A couple of 160 GB X25-Ms in RAID 0 are probably even overkill for what we’re doing here.

The recently-released SPECviewperf 11 is meant primarily to measure OpenGL graphics performance. It includes new viewsets from up-to-date versions of LightWave, CATIA, EnSight, Maya, Pro/ENGINEER, SolidWorks, Siemens Teamcenter Visualization Mockup, and Siemens NX.
SPECviewperf 11
2 x Xeon X5680
2 x Xeon W5580
1 x Core i7-980X
catia-03
21.32
22.3
22.5
ensight-04
11.4
11.86
12.03
lightwave-01
40.06
40.87
41.88
maya-03
8.94
14.55
16.02
proe-05
7.74
8.09
9.21
sw-02
32.58
32.64
33.14
tcvis-02
16.24
16.66
16.41
snx-01
13.92
16.55
16.6

We were hoping to see performance in these tests at least impacted by the platform hosting our Nvidia Quadro FX 3800 graphics card. No such luck, it seems. In fact, the opposite seems to be true. Consistently, the highest scores come from the single-CPU Core i7-980X—though it should be noted that the differences here are fairly small.

There are actually three results garnered from the LightWave 9.6 test. However, they’re most easily generated using the Discovery mode. Once you register for a trial version of the software, you get a popup before Layout launches that ends up preventing the Interactive test from completing. The solution seems to be running this SPECapc and our custom workload using a full, registered copy. We’re working with NewTek to make that happen.

Even still, we get some interesting results from the Render and multi-task tests (yes, the LightWave benchmark was developed specifically to take advantage of threaded platforms). The render test, specifically, sees a massive speed-up moving from a single socket to a dual-socket Xeon W5580 and then to a dual Xeon X5680 configuration. Though not as pronounced, the MT test also clearly favors a pair of Xeon X5680s over the W5580s, which in turn best a single Core i7-980X.

We’ve been running benchmarks based on Adobe’s CS4 suite for a while now—most notably Photoshop CS4 in all of our processor reviews. But graphics professionals use some of the company’s other software tools for more taxing workloads, like video editing and compositing.

Knowing that we needed tests with heavier lifting, we enlisted the help of Jon Carroll, a Tom’s Hardware freelancer and graphics professional in Southern California, to design tests using Adobe After Effects and Premiere Pro, complementing our threaded Photoshop benchmark.

As you’ll see in the charts below, though, Jon’s CS4-based tests exposed some very interesting results moving from 12 threads to 16 and then to 24. So, I quickly adapted all of Jon’s tests to run in each respective app’s CS5 version to cross-check the results. What we came away with simply blew me away…

After Effects CS4/CS5

As Jon and I discussed the testing, an HP workstation on which he was reviewing wrapped up this After Effect CS4 benchmark in 28 minutes. It shocked me, then, when my 24-thread Xeon X5680-based system took 44 minutes to complete the same metric. I was further floored when the 16-thread Xeon W5580-based box finished faster, and even more so when the 12-thread Core i7-980X system was fastest.

I suspected an issue with memory allocation. After Effects CS4 only has access to 4 GB of system memory—a third of what these Xeon boxes bring to bear. As you add execution resources to AE’s pool, less and less memory is available to each processor, be it logical or physical. The result is a lot more swapping to solid state storage, which is fast, but nowhere near as quick as three channels of DDR3.

When we pick up After Effects CS5, which supports a native 64-bit environment, the app can get its hands on a little more than 9 GB, leaving roughly 3 GB for other applications. What took more than 44 minutes to finish in CS4 drops to a little more than one minute in CS5. Even better, positive scaling is restored—the 24- and 16-thread configurations are separated by three seconds, as are the 16- and 12-thread boxes.

Also, it’s worth noting that in CS4, we got our best results having all cores working on each frame, while in CS5, performance accelerated significantly rendering multiple frames concurrently (an option under Memory and Multiprocessing), so that’s how we benchmarked.

Premiere Pro CS4/CS5

The same challenge faced us in Premiere Pro, though the speed-up in moving from CS4 to CS5 wasn’t as pronounced, nor did we see scaling issues originally in CS4. Nevertheless, a combination of shifting to a 64-bit environment and utilizing Nvidia’s Quadro FX 3800 (taking advantage of the Mercury Playback Engine) drops a 3:40 render (in CS4) down to 19 seconds (in CS5) on the dual-socket Xeon X5680 platform. Rendering the project in Adobe Media Encoder becomes a 2:55 job, down from 7:41 on the same box.

Photoshop CS4/CS5

With a tweaked script, we’re able to use our same Photoshop CS4 workload to test CS5. We had already been using 64-bit software here, so it’s hardly a surprise to see more modest gains in shifting from the older version to Adobe’s latest.

With all of that said, we have a story in the works dedicated to testing Adobe’s newest Creative Suite, where we’ll explore the effects of 64-bit operating environments, the Mercury Playback Engine, and GPU acceleration with Nvidia’s few supported graphics cards. For now, it’s fairly safe to say that professionals with multi-socket, multi-core machines will be well-served by an upgrade to CS5.

One thing you’ll notice when you use applications with multiple components is a tendency for some functionality to include optimizations for threading and other pieces do not.

Our custom LightWave 3D Modeler test, which renders a 1+ million polygon version of the Tom’s Hardware logo, gains nothing from the additional compute muscle afforded by 24 threads available concurrently. The same holds true for the OpenGL-based fly-through of the logo in LightWave Layout. In fact, in both cases (as we've seen previously), the more complex architectures sacrifice performance compared to simpler and less-expensive setups.

2 x Xeon X5680
2 x Xeon W5580
1 x Core i7-980X Extreme
Render, Frame 8
6 min., 7 sec.
7 min., 30 sec.
9 min., 35 sec
Render, Frame 41
6 min., 29 sec.
7 min., 49 sec.
10 min., 6 sec.
Render, Frame 500
7 min., 8 sec.
8 min., 35 sec.
11 min., 12 sec.
Render, Frame 600
5 min., 20 sec.
6 min., 12 sec.
8 min.

Start rendering individual frames from the Layout-based logo file, however, and those CPU cores suddenly kick into gear. While two Xeon X5680s can’t quite halve the rendering time of a single Core i7-980X, they come close enough to make the addition of a second processor worthwhile for professionals who do a lot of rendering in LightWave.

Just remember—not every component of NewTek’s software benefits equally from a multi-socket configuration.

Normally, we’d run Prime95 to determine maximum load power consumption and then PCMark Vantage to chart out consumption over time. However, a max figure isn’t really relevant here, and Vantage simply won’t run on our multi-socket configs. SYSmark Preview 2007 is populated by old, outdated software that wouldn't exploit threading in a way we could tie into a workstation story. So, I turned to LightWave 3D 9.6. The frame rendering process taxes available CPU cores and takes long enough for us to measure average power use.

The results are pretty gosh-darned telling. Not surprisingly, the lowest-power solution is a single Core i7-980X. However, the one CPU also takes the longest to finish frame eight of our rendering workload.

Two Xeon W5580s (130 W TDP processors) are actually the most power-hungry—and they don’t even finish the fastest. That honor goes to a couple of Xeon X5680s (also 130 W CPUs).

Our Extech logger sampled power every two seconds, making it easy to gauge the exact time for frame eight to render completely. We turn that time, in seconds, into its fraction of an hour, and then multiply by the average power use during the run.

It turns out that, while a single Core i7-980X is a great way to improve the efficiency of your workstation versus a pair of quad-core CPUs like the Xeon W5580s (despite the fact that the two Xeons are faster), a couple of Xeon X5680s turn that conclusion topsy-turvy. They get our workload finished fast enough that the elevated power is more than compensated for by increased performance.

And just as we harp on the importance of building balanced desktops, the same holds true here. A potent dual-socket workstation should be complemented with plenty of memory and fast storage. In this case, 12 GB of DDR3-1333 and a pair of 160 GB SSDs in RAID 0 did the trick. Naturally, there are also gains to be had from a capable graphics card. And in some applications, your GPU will make all of the difference, while the Xeons have no impact whatsoever.

What we can say definitively is that the Xeon X5680—despite running 133 MHz faster than Intel’s older Xeon W5580—operates more efficiently than its predecessor in threaded software. It’s significantly more complex, what, with its two extra cores and 4 MB of extra L3 cache. But it fits within the same thermal envelope thanks to 32 nm manufacturing, and even manages to use less average power in our LightWave rendering test that the Xeon 5500-series chip.

And although a pair of hexa-core Xeons are much more power-hungry than a single Core i7-980X, the performance they enable gets threaded workloads done faster—fast enough, in fact, to yield a lower average watt-hour rating than the single-socket Core i7.

As for AMD, here’s hoping its SR56x0 and SP5100 chipset components pave the way for renewed competition in the workstation space. It’d be interesting to gauge the speed of the Opteron 6100-series’ 12 physical cores against Intel’s 6C/12T Xeon 5600-series, after all.

No comments:

Post a Comment