Re: WWDC -- MacBook Pro?



In article
<haberg-1008061146230001@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>,
haberg@xxxxxxxxxx (Hans Aberg) wrote:

In article <nospam.News.Bob-AA9DD8.23461809082006@xxxxxxxxxxxxxxxx>, Bob
Harris <nospam.News.Bob@xxxxxxxxxxxxxxxxxxxxxx> wrote:

And my experience with 64 bit applications is that they are not
twice as big (bigger yes, but not any where near twice as big).

What factor increase do you get?

And that the application's data generally does not change.  A jpeg
images is still a jpeg image.  A database record is still a
database record.

If the data sizes and padding are the same, it does not increase of
course. But then alignment problems could cause a slowdown. Padding is up
to the C compiler to decide.

Yes and no. The C compiler can not alter the alignment of a jpeg
image, nor can it alter the alignment of data stored on disk. The
C program must be able to handle data as it is presented to it, it
can not arbitrarily change that.

In all the C compiles I've worked with, bytes are aligned on byte
boundaries, int16 are aligned on 2 byte boundaries, int32 are
aligned on 4 byte boundaries, int64 values are aligned on 64 bit
boundaries, and pages are aligned on page boundaries.

Compilers that do not follow that behavior run into portability
problems reading and processing data created on another platform.

As for performance. Even the Alpha (Digital's 64 bit CPU that had
a lot of fame in the early 90's) handled 32 bit aligned data on an
equal footing to 64 bit aligned data. And later versions gave
byte and 2 byte aligned data an equal status.

The only time alignment is an issue is if a int16, int32, or int64
are not stored on their natural boundary. And C compilers have
been performing that kind of alignment since the early days of
computing. So unaligned values only show up if the application
was written to pack data as tightly as possible, and that is an
application dependent issue.

Also the CPUs mostly talk to the L1 cache. The L1 cache mostly
talks to the L2 cache (if there is an L3 cache, he is next in
line), and the last cache in the line is the one that talks to
main memory.

Memory at the bottom level may be setup to only be accessed in
some very wide width (depends on the way the cache to memory
interface is designed and the memory bus width. Some system
access memory in 4 bytes widths (not very fast computers), some
access in 8 bytes width, some in 16 bytes widths, and there are
some that have memory busses that are 32 bytes wide (could even be
wider ones out there). Of course this has nothing to do with
MacBooks and MacBook Pros, but we are talking about 64 vs 32
compiled code, and the way a CPU can access that code and data are
important.

Where am I going. OK. From the CPU's point of view it is the L1
cache and how it can access it. If the CPU architecture allows
byte access, then the L1 cache does byte access and will do it as
efficiently as 64 bit access. As long as the data is on its
natural alignment boundary and that data is the right data for the
job, then being compiled for 32 bits or 64 bits should not matter
to the CPU, especially if the code doesn't need 64 bit features.

But when a perfectly good 32 applications (one that was happy
being 32 bits) is compiled for 64 bits, then there is no specific
advantage to the 32 bit app, and a possible disadvantage in that
now its memory pointers are taking up more space in the L1, L2
(L3??) cache. The code may take more pages of memory to handle
the larger addresses pointers, so this may result in more TLB
(Translation Lookaside Buffer) invalidates (soft page fault, where
the OS has to reload the TLB for a page already in memory vs a
hard page fault where disk I/O needs to be done).

Just be case a CPU can do 64 bits, does not mean it is bad at 32
bit, or 16, or byte operations.

A 64 bit CPU means it can tackle larger problem sets. It does not
mean it now sucks at handling the smaller ones. And it does not
mean it has to double in memory size.

And one surprise is that some applications compiled for 32 bits
run FASTER than the same applications compiled for 64 bits.  This
is where the application doesn't need 64 bit integers nor 64 bit
pointers for huge memory access.  In this case the 64 bit pointers
used by the 64 bit version of the application and the 64 bit
libraries it links against, consume extra memory and cache
bandwidth (the addresses being twice as wide as the 32 bit
equivalents).  And since the app didn't need 64 bit addresses,
over half of the address is zeros.  So this wasted space is taking
up space in the application memory, taking up space in the CPU
cache causing more cache invalidates, and a when a cache line is
loaded, less of the bits transferred are useful if it contains
addresses, etc...  The end result is that it actually takes longer
for the CPU to do the same unit of work as a more compact 32
version of the program.

Now this is only small (tiny) performance differences, but it is
something that can be measured.

One possible explanations that comes to my mind is that the 64-bit
computer isn't optimized around 32-bit. Data that is packed in to a single
word, so that the CPU has to split it, is slower than unpacked data. So if
the memory is 64-bit and one fits two 32-bit data types into it, it will
be slower than if these two 32-bit data are put into two 64-bit words. And
so if the 'int', which is 32-bits on a 32-bit computer remains 32-bit on
the 64-bit computer, and the C compiler decides that two following int's
should be packed into a 64-bit word, that might cause a slowdown relative
if the compiler decides that the int's should be put into two 64-bit
words.

No, it is because the application more memory taken up by address
space.

Think of it this way. You have a glass. In case A you fill that
glass up with ice cold water. You are thirsty, so you quickly
drink the water and it is good.

Case B, same glass, same ice code water. But now half the volume
be ice. You are still thirsty, but now you only get half the
water, and you need to ask the waiter to get you more water. You
wait. The wait is short, as the waiter has a pitcher of ice water
near by (a cache), but when he gives it to you, you get half ice
again. Your still thirsty, the pitcher quickly empties, the
waiter has to go back to the kitchen to get more ice and water for
the pitcher, you wait longer. But you are still thirsty, another
pitcher gone, and more trips to the kitchen, and each time you are
only getting half a glass of water and half a glass of ice.

The point is that if what you want is water (meaningful
addresses), but you are also carrying along ice (zeros in the
upper 32 bits of every address), then the system will spend more
time moving ice around (zeros) that are never used. This data
movement is not free. It results in more memory usage, more cache
invalidates, more TLB soft faults, etc...

So the problem is not running a 32 bit applications on a 64 bit
CPU, it is a 32 bit application compiled to 64 bits when it
doesn't need to be.

If you have a 32 bit application that is happy being 32 bits, and
the operating system lets you stay 32 bits, then it is in your
best interests to stay 32 bits. OK, if you are a developer, you
should not trust stuff you read in news group, you should verify,
so build it both ways, and measure your performance. But also
measure it on all the platforms it will be run on, as each
platform has different CPUs (generations), different caches and
cache sizes, different sized memory busses, etc...

It is possible to speed up a program by treating such packed data as
unpacked. For example, the C library function 'strcpy' becomes faster if
the character string is copied as int's (on a 32-bit computer) instead of
as char's, because it will be copied word by word, instead of having to
split each word to find each character, and then copying it. Similar
rewrites may speed up the code written for a 32-bit computer on a 64-bit
computer.

Some operating system vendors actually do this to strcpy(),
memcpy(), bcopy() inside the libc library. It is called loop
unrolling. In fact you can make it faster by doing several copies
in a row before looping back to the top, so that you save compare
and branch instructions (but starting and stopping such unrolled
loops is messy code).

And if you understand cache prefetch tricks, you could throw a
memory fetch in at the beginning of an unrolled loop for data much
further down the line and not store that data until later

load r10, 32(r1) # prefetch to load next cache line
load r11, 0(r1)
load r12, 8(r1)
load r13, 16(r1)
load r14, 24(r1)
store r11, 0(r2)
store r12, 8(r2)
store r13, 16(r2)
store r14, 24(r2)
store r10, 32(r2) # start using prefetched data

This is fake assembly. It is lousy code actually, but it is
intended to show that if you want to speed up a long memory copy,
you can do tricks to avoid CPU stalls waiting for memory. The
first load is a prefetch. it will cause the caches to load the
next 32 bytes into a cache line (also assumes the addresses are
aligned to cache line boundaries). This load will happen in
parallel to loading the next 4 registers with 64 bits of data
each. Eventually you get around to storing the prefetched data
and either keep on going with inline load/stores, or you
recalculate the pointers, check for loop ending, and loop the code.

This code can be written in C and still get good results if the
compiler has a good optimizer. And even if it is a bad optimizer,
you can still get good results in C by unrolling loops and doing
prefetches. But it is very messy code and only pays off if you
are moving a lot of data, and it is very tricky to get the start
and end portions correct because of odd lengths and odd boundary
starting alignments.

And if this is done in the libc, it can use 64 load/store
operations no matter how the application is compiled. It may need
to have a different version for 32 bit addressing vs 64 bit
addressing, but generally that is handled by having a libc for 32
bits and a libc for 64 bits. You mileage may vary depending on
which operation system you are working on. I have played with too
many, and so far I have not done any serious programming for Mac
OS X, I just love using it :-)

On the other hand, an application that uses lots of 64 bit integer
math, or needs huge virtual memory address space (and happens to
have enough real memory to keep page faulting to a minimum) will
out perform a 32 bit app that tries to work around its address
limitations, or perform 64 bit math using 32 bit registers (think
multiple and divide).

The guys in need of such high end number crunching applications are the
ones really benefitting from 64-bit. For others, it has been rumored,
performance benefits may be more modest, at least before the code has been
rewritten to make use of the 64-bit features.

No reason for a 32 bit app to benefit from running on a 64 bit
computer. And as I've said before, unless the app is being held
back by 32 bit addressing or it is doing lots of 64 bit integer
math in 32 bit sizes, there are many good reasons to NOT compile
it for 64 bits. Then again, marketing forces sometimes cause for
strange business decisions :-)

So I'm not trashing 64 bit CPUs.  I'm just saying they they do not
automatically double the memory needs, and that not all 64 bit
applications will be faster.

Let's return to the padding/packing question above. If you compile your
32-bit program for 64-bit, then the int's remain the same size, and
the compiler will probably pack adjacent int's into single 64-bit words.
This gives a small performance loss.

NO! all caps are intentional. As long as the data is aligned on
its natural boundary, and there no reason to think this will
change just because the size of the addresses has changed.

The next step is to profile the code written for the 32-bit computer on
the 64-bit computer. So one discovers, aha, these 32-bit int's slow the
program down.

Profiling good. yes do this.

But as I've said, it is unlikely that a 32 bit app will run slower
than the same app compiled for 64 bits running on the same
computer.

My wife ran SPEC benchmarks for Digital Equipment Corporation (and
then Compaq, and then HP; same office, they kept changing the name
on the building). She has real world experience running the same
SPEC codes as 32 bits and 64 bits. If the application does not
need 64 bit math or addresses, it generally ran faster on a 64 bit
system if it remained 32 bits.

So the next step is to change them to 64-bit long int's.
Now, on a 32-bit computer, a 'long' might be 32-bits just as the 'int'. So
you have introduced a 64/32-bit incompatibility. Now, when programming
continues for awhile on the 64-bit computer, it seems prudent to make use
of the longer integral types. This could be other types, such as 'double',
which are coerced into this 64-bit types. So after awhile, you have code
that can't be easily converted to work on a 32-bit computer.

That is half true. If you are a developer and you setup your
builds to generate ONLY 64 bit versions of your program, then you
will have a tendency to take advantage of int64. But unless you
are really doing heavy duty 64 bit integer math, or working with
huge memory models, the application will suffer from being
compiled this way, as explained above.

however, it is best for a developer to measure this for themselves
on a case by case basis.

But if I were developing an app, I would try to stay 32 bit
compiled for as long as possible, unless and until I had a strong
need for 64 bit math or addressing.

So the code could, when programming continues for awhile and one is
making special use of the 64-bit features, expand considerably, and would
probably become wholly incompatible with the 32-bit computers.

Yes, and that could be a market limiting approach to making money.
If your app is a niche product and your customers would only be
using on high end equipment anyway, this is less of a problem. But
if you are going to the low margin mass market, then stay 32 bits
as long as you can. There is more money in it for you to not
exclude anyone. For that matter, continue to provide PowerPC
versions too. Money is money, and you don't want to leave any of
it on the table if you don't have to.

As this program rewriting happens, the 32-bit computers will die away from
use, as new programs will not run on them anymore.

That is not tomorrow. And this may happen, or maybe the Smart
Phone will become such a big market, that you will want to write a
version of your app for the cell phone, and that may not be 64
bits for awhile.

or a media console may turn the wide screen TV into a media center
with your application running on the side, but the media center is
32 bits.

If you are a developer and you want people to love your product,
don't shoot yourself in the foot just because you can. :-)

As for the original real life example that caused so much debate over
whether the code actually doubles: a graphics layouter has a computer, a
PowerBook, in which 1 GB isn't enough, perhaps 2 GB is it, but that is not
for sure. The problem with only 1 GB is that it takes 20 seconds or more
to just switch between applications, which is a typical sign of that the
active program parts are not kept in RAM but constantly paged onto the
hard disk. Suppose something just below 2 GB is enough, and that something
just above 2 GB will be enough when switching to a 64-bit MacBook. The buy
will be 4 GB, not 3 GB, in part because matching memory produces a small
speed increase. Now, if 4 GB is the upper limit for all times on this
computer, the problem is that later versions of these programs may require
more memory for all kinds of reasons, not only rewriting code, but by
adding new features. And in addition, new programs may be required.

Experience has it that the need of RAM will expand rapidly over time. Say
it follows Moore's law, and doubles in 1-2 years. Then the computer will
only last 1-2 years! That is too little time for an investment in such an
expensive computer.

So 4 GB seems to be too little as a limit on those Mac laptops.

Graphics applications generally want more memory, but they
generally want it because of the graphics data. A 64 bit CPU with
the same graphics program will take the same amount of space as it
did on the 32 bit CPU. It will most likely run faster because the
64 but CPU has a faster clock speed, faster memory bus, and able
to retire instructions in the CPU faster than the previous model.

If that application is paging (use Terminal running vm_stat 60 to
monitor pageout values to see if paging is the issue). Then more
memory will help.

But this has little to do with using a 64 bit CPU vs a 32 bit CPU.

And yes, memory usage will continue to grow. And computer systems
will continue to allow more memory, but what controls how much
will show up in any given model is generally a function of how
much it will cost in space, power, heat, and cost (both of the
actual memory type needed, and engineering costs for new memory
interface chips). Giving a laptop 8GB of memory isn't any good if
you can't fit it into the system, or if the power draw drains the
battery in 15 minutes, or the heat generated melts the plastic or
burns the owners legs, or the cost is $1000 per 4GB SO-DIMM.

And if your computer system worked will when you purchased it and
your computing habits remain consistent, then it is likely it will
last as long as you need it to (mine is almost 3 years old; and
running on 640MB). If you change from reading email and surfing
the web to being a moving makers, then maybe you will need to
change your computer system sooner.

bottom line, 64 bit CPUs are not going to have an instant effect
on memory sizes.

Bob Harris
.



Relevant Pages