Re: Why lock-free implementations are faster than traditional using synchronization? Are they?



On 09.10.11 03.08, Michael Podolsky wrote:
I've encountered I have very basic misunderstanding of the topic of
lock-free algorithms.

They possess many good qualities but is the performance one of these
qualities? And it it is, to what extent and due to what reasons
exactly?

Well the reason is, because they do not lock and block.
Each time a thread is held at a mutex it looses its current time slice. Even if it need to stop only for some clock cycles, the scheduler will not immediately reschedule the thread as long as there is something else to do. This also causes the cache content to become mostly waste.
A lock free algorithm will not give up a time slice.

Of course. there may be some special situations where we may run
faster with block-free and wait-free implementations using solely
interlock-xadd like operations.
>
But as it comes to using our main tool, cmpxchg instruction, our code
gets a kind of transactional semantics with failing/recurring attempts
to modify data. It's like we spinning(!)
on attempts to modify some data with cmpxchg.

You are right. Cmpxchg is similar to a spin-lock. And it will have comparable performance to a fast mutex implemented by a spin-lock. But there are some important differences.

Now I should ask how is
it different as for performance from using just a spin-lock so that
instead of

example 1
do { oldPtr = read(ptr)
prepare newValue of ptr with optional pointed data,
while (!cmpxchg(&ptr, oldPtr, newValue));
recycle oldPtr

we write

example 2
lock(spinlock)
oldPtr = read(ptr)
prepare newValue of ptr with optional pointed data
recycle/delete oldPtr
ptr = newValue;
unlock(spinlock)

Compared exchange comes only into play if the old value is needed to compute the new value in a non trivial way, i.e. where there is no dedicated machine instruction available. Most platforms offer trivial operations like x+value, x-value, x&value etc. directly without a loop.

One may argue that waiting on lock(spinlock) saturates the bus much
more than rarely called cmpxchg in the example 2, but this is merely a
question of implementation of efficient spinlock for multicore.

The collision probability scales with the square of the time taken in one spinning loop cycle. So the basic idea is to keep this time as short as possible. It is obvious that spinning at some spin-lock variable and then modifying some other memory location holds the lock longer than reading the old value and write the new value back, which implicitly releases the lock. The compared exchange simply joins the memory location of the mutex with the memory location of the data to protect.
This also shows when it makes sense: only as long as you do not protect more than one memory place with the lock. But keep in mind that fine grained locking is a requirement of scalability.

And there is another important difference. The loop condition is moved from the beginning of the critical section to the end. This has a significant effect on scheduling issues. See below.

Supposing that spinning on our spinlock does not saturate the bus and
just wastes the internal resources of one core (which is also the case
in example 1), why the code in example 1 should be fater than in
example 2?

The loop is smaller. Example 1 results in a collision if at least two threads are between line 1 and 3. In example 2 the range is extended to line 1 to 6. OK, there is a bug. The delete oldPtr should be kept out of this area as it is in example 1. And note that the collision probability scales with the square of this time (and even higher powers for multiple collisions).


Yes, I understand that example 2 might run slowly in some
circumstances like when a thread which locked a spinlock is preempted
and other threads start to spin.

This is an important wort case. For this reason most spin-lock implementations have a fall back to some ordinary blocking, at least sleep(0) after some turns.

This is where the second advantage of e compare exchange loop comes into play. The check condition is at the end of the critical section. This is like optimistic locking.
Any process is allowed to proceed until it wants to update a value. At this point the update is rejected if it is not based on the previous version. This is almost the same as any reasonable code version repository works. (Note that the compare exchange loop still has the potential race condition that other threads might have modified the value and eventually restored the old value before the loop completes.)

Checking the condition at the end of the loop has the important advantage that if a thread is preempted or suspended for some reason, no other threads will spin. Only the thread that did not do its job in time will pay the bill and take one extra turn. In contrast a spin-lock mutex may spin infinitely and maybe yet in pararllel if only one thread causes a delay. It simply is not lock free. For the same reason compare exchange loops cannot cause priority inversion, too.


May a composite lock (spinlock + kernel mutex) be helpful in this case?

This is one work around. It has the disadvantage that it requires kernel resources even if they are never needed. Also it makes construction and destruction of the fast mutex expensive, which could be an issue if they are used in temporaries.
In fact I have not seen such an implementation of a fast mutex so far. Either you use fast mutexes, and you confirm thereby that you do not hold the mutex for long, or you don't and use kernel mutexes.

Or if we take pure kernel-level
implementation where a 'thread' can not be 'preempted' what about
saying that example 2 is as good as example 1? Probably no as Linux
uses RCU, but what are the exact reasons?

First of all you generate a reasonable overhead by the required switches to kernel context. But a spin-lock is not RCU, so I don't understand your question.


Marcel
.