Re: Matlab Vectorisation Speed - How is it done in c++?



sturlamolden <sturlamolden@xxxxxxxx> wrote in message
<08100346-586e-41fb-bb41-9d9342d269ed@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>...
On 17 Des, 01:05, Phil Winder
<philipwin...@xxxxxxxxxxxxxx> wrote:

Im currently porting some matlab algorithms to c++ code.
The test
code I have is testing the vector math capabilities and
how fast they
can go. I have found that it can be very, very fast and
I am
strugglling to reproduce the speed in c++. How does
matlab do it? And
how can it be reproduced in c++?

Beating the performance of vectorized Matlab code is very
hard, and
usually not worth the effort.

Matlab makes calls to optimized C and Fortran libraries
such as blas/
atlas, lapack and fftw. You cannot duplicate their
efficacies in C++
for at least two reasons:

1. There are issues related to the language syntax that
makes Fortran
particularly easy to optimize for compilers, such as lack
of pointer
aliasing. This is particularly important for optimal
allocation of
registers when the CPU goes into a tight loop.

2. A lot of effort have been put into making these
libraries as fast
as possible. This includes optimal use of cache and branch
prediction.
Duplicating these efforts on your own is going to take the
rest of
your life to complete.

My advice would be this:

If you want speed in your C++ app, link and call the same
libraries as
Matlab do. Most of them are available for free. Give the
C++ compiler
pointer aliasing hints wherever possible.

In addition:

Use optimization level 3 on numerical code and level 2 on non-
numerical code. Process the data in chunks that fit in
your L1 cache.
Force the CPU to prefetch if you know it will help.
Page-align your
arrays in RAM. Memory access is terribly slow, traverse
as few times
as possible. Never use strided memory access. Manually
unroll tight
loops. Exploit arithmetic pipelining of four subsequent
operations.
Avoid divisions, transform to a multiplcation. Exploit
multiple CPUs:
use MPI or OpenMP, forkjoin with labour-sharing
threadpools, etc. Use
inline assembly to access SIMD parallel registers.

Also remember Hoare's statement about optimization, quoted
by D.
Knuth: "Premature optimization is the root of all evil in
computer
programming." Profile your code. Direct your optimizations
to the
important bottlenecks. They are likely to be few. Do as
much as you
can with the bottlenecks, and never mind the reminding 90%
of your
code.

But before you begin: ask yourself if the hard work is
goint to be
worth the effort.



Regarding (1): I write in C and I haven't found (1) to be
that much of an issue (although I do worry about it and it's
well worth it for you to mention here). I think the more
recent versions of gcc are able to work around this issue.
More serious for C is the abuse of pointers (indirect
addressing, which requires lots of memory traffic). Memory
traffic is more of a problem than register allocation,
anyway (which you point out too, regarding the stride issue)..

Regarding (2): Yes, that's definitely true. One could
write an m-file script that does x=A\b without backslash, in
maybe 50 lines of M (LU factorization if square, QR with
Householder if rectangular). Backslash itself has maybe
250,000 lines of code (that's a guess, but an educated one
since I wrote about half that).

Some vector operations are trivial (a = b+c) to write in C
or Fortran. If you write a=b*c where b and c are matrices,
then there's no way you'll match performance in an optimized
BLAS library.

Rule of thumb: if the work is O(n) where n is the size of
the data, then there's a decent chance that simple C or
Fortran code can match (not beat) MATLAB. If the work is
higher than O(n) than you probably can't beat MATLAB with
simple C. Matrix add fits in the former category; matrix
multiply doesn't.

You can always call the BLAS / LAPACK yourself, in the dense
case, or use available C code for the sparse case. Lots of
the code in x=a*b, x=A\b, etc is open source.

.



Relevant Pages