Solving the memory issue of lock-free algorithms



Hi Folks,

I read a lot about lock-free algorithm and tried to find an "elegant"
solution to the memory problem that the algorithms poses. The pop is
obviously the problem. The algorithm I'm using is the following:

SC0: loop
SC1: head = lf->top
SC2: oc = lf->ocount
SC3: if head == NULL
SC4: return NULL
SC5: next = head->next
SC6: if CAS2(&lf->top, head, oc, next, oc + 1)
SC7: break
SC8: endloop
SC9: return head


I wanted to make a solution where I could allocate/free the nodes as
part of the function, so I started by putting alloc/free in my push/
pop and quickly found multiple memory crashers (as many papers talk
about). However I was never really a fan of the solutions I have seen
(hazard pointers and etc) just sounded like they added a lot more
slowdown than what was ultimately really worth.

I know a lot of papers describe this as the lock-free memory
reclamation issue, but from what I've read, they all seem to point
towards the fact that it is possible for SC5 to generate a memory
fault if a pop or pop/push happened to squeeze between SC2 and SC5.
However it may very well be true that we can potentially read garbage
for SC5, at least on the Intel/ARM architecture, we should still be
able to read that address in memory without generating a fault since
head was a known good location to read from at some point in time. As
long as the CAS fails in theses cases, the correctness of the
algorithm should still be ok. And it should fail since the counter
should not match. So knowing this, I went about debugging my code to
make it work such that it can have an old head and still be able to
access the memory region of "head->next" provided that it failed the
CAS2.

I had a simple test that had 2 push thread and 2 pop thread on a quad-
core. As it turns out, I found out that the real issue as to why it
was crashing was because pre-emption was done between SC1 and SC2 and
the program was fooled into thinking that it had a match when in
reality, it didn't. This false positive allowed the system to make a
swap and thus inserted garbage in the list and crashed the system when
the next item would parse the list. As it turns out, the solution was
to combine SC1 and SC2 into one single atomic operation.

As a mater of fact, I don't see how you can fully guarantee that the
algorithm works without doing this. There's 2 ways to do this on an
Intel platform. The first is to use the movq instruction back and
forth, which guarantees atomic reads of 64bit elements on 64bit
aligned boundaries or the more crafty solution to use CMPXCHG8B in
order to read it.

As it turns out, once I changed my code to do 64bit reads, I was able
to run my stress-test app which push/pops on 4 threads without any
crashers for one entire night. If I couldn't read from freed memory,
then this wouldn't work, (obviously) but since I can on a multicore
system, this seems like a very nice solution to the problem (assuming
there are no other holes in it) and that's pretty much why I'm posting
here. Since you guys are really experts on this (compared to me at
the very least), I'd like to know your thoughts on this.


I've attached my source code at the bottom (If you do use it, I merely
ask to get credited for it). I also have a C equivalent using a 64bit
read instruction, but the assembler version is just cleaner imo since
I can re-use the cmpxchg8b for the read.



struct LFStack {
LFStack() : Count(0), Top(NULL) { }

struct LFStackNode {
LFStackNode *Next;
unsigned long Data;
};

void Push(register unsigned long Data)
{
LFStackNode *Node = (LFStackNode
*)malloc(sizeof(LFStackNode));

Node->Data = Data;

__asm {
mov edi, [this] ; edi = &this
mov esi, [Node] ; esi = &Node
};
LoopPush:
__asm {
mov eax, [edi] ; eax = Top
mov [esi], eax ; Node->Next = Top;
lock cmpxchg [edi], esi
jne LoopPush
};
}

unsigned long Pop(void)
{
register LFStackNode *OldNode;

__asm {
mov edi, [this] ; edi = &this
xor eax, eax
xor edx, edx
xor ebx, ebx
xor ecx, ecx

lock cmpxchg8b [edi] ; edx:eax = this
}
LoopPop:
__asm {
test eax, eax
je PopOut

mov ebx, [eax] ; ebx = Top->Next
lea ecx, [edx + 1] ; ebx = Top->Count + 1
lock cmpxchg8b [edi] ; ebx:ecx = Next
jne LoopPop

mov OldNode, eax ; OldNode = Top
}

const unsigned long Data = OldNode->Data;

free(OldNode);

return Data;

PopOut:
return -1;
}


public:
LFStackNode *volatile Top;
volatile unsigned long Count;
};



Regards,
Chris
.



Relevant Pages

  • Re: Turing Machines and Physical Computation
    ... rather than adumbrated in that 1936 Turing paper On Computable ... ... any actual computer has finite memory, ... computer will run out of memory for some huge calculation. ... It is generally assumed that an algorithm must satisfy the following ...
    (comp.theory)
  • Re: Turing Machines and Physical Computation
    ... rather than adumbrated in that 1936 Turing paper On Computable ... ... any actual computer has finite memory, ... computer will run out of memory for some huge calculation. ... It is generally assumed that an algorithm must satisfy the following ...
    (sci.math)
  • Re: Version after Version
    ... MMX, by doing 50 times two additions at a time. ... Many compilers & memory managers have at least the option to align data on ... > - compression and decompression works at the Bit level ... > LZW is the until recently proprietory algorithm of Uinisys ...
    (alt.comp.lang.borland-delphi)
  • Re: Sans memory barrier native thread communication upon AMD64?
    ... producer/consumer or a distributed algorithm w/ mutual ... that may be garbage collected independent of any other native thread. ... microthread allocates its environment (including dynamic stacks) from the ... system provides a thread safe memory allocator. ...
    (comp.programming.threads)