dual-CPU Mac Pro Nehalem can be slower than a single-CPU model for large files



http://macperformanceguide.com/Reviews-MacProNehalem-MoreIsLess.html

Mac Pro -

When more is less: dual vs single CPU

Strange, but true-
With Photoshop CS4, a dual-CPU Mac Pro Nehalem can be slower than a
single-CPU model for large files.

A workaround is explained.Please see Scalability for related material.
Behavior can change over time, as the Mac OS X system changes and/or
Photoshop is upgraded- test your own system to be sure.

Dual CPUs are slower than a single CPU! (Mac OS X 10.5.6)

"1/2 cores" means that half the CPU cores were disabled

This changes in 10.5.7!!!

Background notes

This finding applies to Photoshop CS4 11.0.1.

First a few background notes that will be helpful in the following
discussion-

Photoshop CS4/Mac is a 32-bit program, which limits it to 3.5GB of memory
allocation. Of that, ~3GB can be used by Photoshop; the rest is overhead for
code, plugins, etc.

CPU cores are the hardware workers involved in computing; these correspond
in CS4 to "threads". CS4 creates 3 threads per CPU core when executing
tasks.

Each thread requires some memory of its own which reduces the memory
available for storing image data and other necessary items. The amount of
overhead depends on the program, and for CS4, the overhead is apparently
substantial.
The more threads, the higher the overhead of coordinating them, but this is
likely a minor factor compared to memory usage.

Dual-cpu is slower - why?

This discussion applies when working with file(s) that require the scratch
disk; the more the scratch volume is needed the more distinct the advantage
of the single-CPU system.

When the diglloydMedium benchmark is run, some puzzling figures emerge: a
dual CPU machine with more memory is slower!

The graph below shows the single and dual-CPU 2.93GHz Mac Pro Nehalem with
different amounts of memory. Observe the following:

A minimum of 16GB is required for best performance. Even so, the single-CPU
MP09 with only 12GB beats the dual-CPU MP09 with 24GB!

A single-CPU is faster than a dual-CPU with either memory configuration.

The dual-CPU time drops from 61 seconds to 47 seconds when half its CPU
cores are disabled.
How can this possibly be?

The answer is most likely usable memory, but it needs some explanation, see
below.

diglloydMedium: single vs dual core, Mac OS X 10.5.6

Something has changed in Mac OS X 10.5.7. While disabling half the CPU cores
still is slightly faster, using all 16 cores is now much closer in speed
than before. A bug fix of some kind in OS X.

diglloydMedium: single vs dual core, Mac OS X 10.5.7

Available memory

Photoshop CS4 blindly allocates 3 "threads" per CPU core. For a 16-core
machine (dual CPU), this means that it's allocating 48 threads, vs 24
threads for an 8-core machine (single CPU). Each of these threads requires
memory of its own. That is our working theory at least.

The memory used by the threads comes out of the limited amount available to
Photoshop CS4 (a 32-bit application is limited to 3.5GB absolute max).

The net result is that the memory available for image data is reduced
substantially.

The reduced memory for image data forces Photoshop to use its scratch volume
more, which increases processing time substantially-and remember that these
times are using an exceptionally fast striped RAID scratch volume More.

Available memory is critical when working with large files. The
diglloydMedium benchmark ends its run with a 15.7GB scratch file, which far
exceeds the available ~3gB or so of usable memory in the 32-bit Photoshop
CS4.

The same performance implications lie in wait for anyone working with
file(s) that begin to use the scratch disk, so beware!

Exploring the cores

Let's see what happens when CHUD tools is used to disable real cores and
virtual cores (hyperthreading).
The M/N notation means M real cores and N virtual cores eg 4/8 means 4 real
cores and 8 virtual cores.

The graph shows the time to execute diglloydMedium, rounded to the nearest
second. Observe that CS4 offers marginal gains when going beyond 2 real
cores / 4 virtual cores - it doesn't scale.

The perverse result is that with all CPU cores in use, we see the 2nd worst
result - better only than that of a single virtual core, a rather poor
showing from Photoshop CS4. Let's hope Adobe does something about this.

diglloydMedium: effect of the number of real/virtual CPU cores on run-time

A kludge workaround

CHUD Tools

Processor Palette

This workaround is worth the trouble only if you spend a lot of time in
Photoshop working with large files eg those that use the scratch volume
regularly Learn about the scratch volume.

As an Apple developer, you can download Apple's CHUD tools, which is part of
the Apple developer toolkit. CHUD tools allow disabling CPU cores, either
real and/or virtual ones.

When working with big files, you can use the CPU palette to disable half or
more of the CPU cores equally across the two physical CPU chips. This drops
execution time on dual-CPU MP09 systems by 23%, as shown in the graph.

What Adobe can do

Adobe can address this issue by not blindly allocating threads for every CPU
core. In fact, CS4 does not scale beyond two cores, so one solution is for
the CS4 engineers to simply hard-code a limit and ignore the available CPU
cores beyond a fixed number.

Another and better solution would be to offer a "max threads" preference.

But of course the best solution is to rewrite the aging code base to use 16
cores efficiently. The chances of that happening in CS4 seem slim. The real
fix is likely to come only with a 64-bit Photoshop CS5, which will be able
to access as much memory as there is installed in the machine. Adobe will
also have to fix the internal bottlnecks which currently keep it from
scaling more than a pittance beyond two cores.

Not all bad news

In spite of the poor CPU utilization on common operations, some operations
do utilize multiple cores, though scalability remains well below optimal.

Good scalability would ideally yield about 1/16 the time for 16 virtual
cores vs 1 core, but there is some overhead even for well-written programs,
so anything over 12:1 is more typical.

For the Surface Blur filter, tests show near perfect scalability from 2
cores to 4 cores; the time is almost exactly halved. Beyond that, the
additional cores help considerably, but we don't see 10 seconds with 16
cores (vs 80 seconds for 2 cores). Instead, we see 17 seconds for 16 cores-
not bad, but about 70% longer than perfect scalability. A figure in the
12-14 second range would be quite respectable.

Surface Blur: more cores helps a lot

Gray bars denote best-case scalability.

Photoshop CS4 needs work. Its threading behavior is self-defeating, making a
single-CPU system notably faster than a dual-CPU system for large files.

--
"I never mentioned that I couldn't afford to buy a Mac.
I said I couldn't afford to buy any computer." -- Dave Fritzinger


.



Relevant Pages