Re: Wait free queue



"Chris Cochran" <chris@xxxxxxxxxxxxx> wrote in message news:0ac30931-8681-4d75-b18d-d329057495df@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
OK, Ok, ok... I have a wait-free, non-atomic local memory allocate/
dealloc, lock-free remote memory dealloc among producers, wait-free
for the consumer, with all threads serviced through lock-free queues.
Threads block when they have nothing else to do. Instrumentation in
my framework includes the count of all CAS fail/retry cycles, to see
in real time the actual (not imagined) amount of non-blocking wait
cycles that arise in any application runtime context. Other
statistics shown by thread include: latency, wake-ups, events
processed, queue depth, and others.

Sounds good to me.




I would be happy to trust my life to a medical device that has been
properly designed from stochastic processes--we do it all the time.
It is a human frailty to distrust probabilistic systems that have
acceptable risk, while simultaneously accepting all sorts of high-
probability risks arising everywhere you look. That said, your right,
you're right mathematically, you're right technically, and I'm going
to have to watch my French around here (I hope you're not French...).
We do need to be precise about these things, if for no other reason
than to protect their definitions. I stand corrected.

Sorry for coming across as hostile.




"Chris M. Thomasson wrote:"
If not, and ___please___ CORRECT me if I am misunderstanding you,
but it sure sounds like your claiming that you never used a general
purpose memory allocator that could allocate in thread_a, and free in
thread_b after thread_a finished and returned to the OS for whatever
reason (e.g., terminated).

Now you are overreaching Chris, I never said anything of the sort. I
have not tried this with very many allocators, I am not generalizing,
I didn't distilled it down to a primitive example, and my comments are
specifically about the standard new/delete provided in Microsoft
Windows, from NT4 to Win2K, and maybe WinXP (i.e. in msvcrxx.dlls).

Well, it reads as if you wrote something to that effect; how am I misunderstanding the following quote:


"Chris Cochran wrote:"
"This service will successfully delete blocks allocated by other
threads, even after the originating thread has terminated. This
capability is not available from the standard new/delete, nor from any
other heap implementation I know of--they all just crash, mine does
not."


Between its bad performance and the really obvious inability to ALWAYS
succeed at cross-threaded memory releases, my patience wore thin.

I have never seen the standard memory allocator (e.g., `malloc()/free()' `new/delete') in NT4 or Win2k crash on cross-thread memory frees. Can you show me some example code? I simple cannot reproduce the behavior your describing...


When you don't have to rely on someone else's code for such things,
you can only catch it in the act so many times before confidence
disappears and you move on. OBVIOUSLY others have figured out how to
perform cross-thread deallocation correctly too, and do, but when my
memory allocator blows the msvcrt away--why bother with it, it's a no
brainer. If I learned that Microsoft fixed this problem, I would not
go back unless I had to, and they probably have fixed it. I avoid,
because I can.

AFAICT, the standard allocator on NT4 and Win2k have no problem with allocating memory in thread_a and freeing it in thread_b after thread_a has finished its execution. Can you create a standalone example program that all of us can test?




Dr.Dobbs, October 29, 2008
Writing a Generalized Concurrent Queue

What to you mean?? This looks like a fine article, for 1998, they
just got the date wrong. ;-)

And now that we have thoroughly and completely taken over Jon Harrop's
thread for this, there you have it.

AFAICT, Herb seems to be using nasty spinlocks. I would just use OS provided mutexs for head and tail. Code like the following code, taken from the article:


bool Consume( T& result ) {
while( consumerLock.exchange(true) )
{ } // acquire exclusivity


makes be cringe. There is not even a backoff mechanism. No PAUSE instruction, nothing. Anyway, he seems to be writing that the reason he is using spinlocks is because he can put them on there own cache-lines. You can most certainly do that with `CRITICAL_SECTION's and a most implementations of `pthread_mutex_t' as well. Also, he fails to properly align data-structures on L2 boundaries. This is very important:


http://groups.google.com/group/comp.programming.threads/msg/8036dc8a380718ad


http://groups.google.com/group/comp.programming.threads/browse_frm/thread/7030f8ac1610a709





I would setup the two-lock queue data-structure on windows as follows.



<quick hack I just typed in>


________________________________________________________________
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <stdlib.h>
#include <stddef.h>
#include <assert.h>


#define ALIGN(mp_this, mp_type, mp_align) ((mp_type)( \
(((ptrdiff_t)(mp_this)) + (mp_align) - 1) \
& (-(ptrdiff_t)(mp_align)) \
))


#define L2_CACHE_LINE 128


union CRITICAL_SECTION_PAD {
CRITICAL_SECTION self;
char l2pad[L2_CACHE_LINE];
};


struct queue_node {
struct queue_node* next;
void* payload;
void* base_mem;
};


union queue_node_pad {
struct queue_node self;
char l2pad[L2_CACHE_LINE];
};


union queue_node_ptr_pad {
struct queue_node* ptr;
char l2pad[L2_CACHE_LINE];
};


struct queue {
union queue_node_ptr_pad head;
union CRITICAL_SECTION_PAD head_lock;
union queue_node_ptr_pad tail;
union CRITICAL_SECTION_PAD tail_lock;
};


typedef char static_assert[
sizeof(struct queue) == L2_CACHE_LINE * 4
];


struct queue_node*
queue_node_alloc(void* payload) {
void* base_mem =
malloc(sizeof(union queue_node_ptr_pad) + L2_CACHE_LINE - 1);
if (base_mem) {
struct queue_node* self =
ALIGN(base_mem, struct queue_node*, L2_CACHE_LINE);
self->payload = payload;
self->base_mem = base_mem;
self->next = NULL;
return self;
}
return NULL;
}


void
queue_node_free(struct queue_node* self) {
free(self->base_mem);
}




static unsigned char g_queue_raw_buf[
sizeof(struct queue) + L2_CACHE_LINE - 1
] = { '\0' };


static struct queue* g_queue = NULL;


int main(void) {
g_queue = ALIGN(g_queue_raw_buf, struct queue*, L2_CACHE_LINE);

InitializeCriticalSection(&g_queue->head_lock.self);
InitializeCriticalSection(&g_queue->tail_lock.self);

g_queue->head.ptr = g_queue->tail.ptr = queue_node_alloc(NULL);


if (g_queue->head.ptr) {
/* [whatever] */


assert(g_queue->head.ptr == g_queue->tail.ptr);


queue_node_free(g_queue->head.ptr);
}


DeleteCriticalSection(&g_queue->tail_lock.self);
DeleteCriticalSection(&g_queue->head_lock.self);


return 0;
}

________________________________________________________________




It should compile fine. Everything is padded to L2 and everything is aligned on L2 boundary. No false-sharing.



What do you think?

.



Relevant Pages

  • [PATCH 7/8] Use xvmalloc to store compressed chunks
    ... This gives pool's total memory usage, including allocator ... pgoff_t index, struct page *page, int is_zero) ... goto out_store; ...
    (Linux-Kernel)
  • Re: Where is the memory of net_device Tx queue located?
    ... where is the memory of this queue's memory ... It is the maximum number of skbs ('struct sk_buff') the ... network layer will queue while the interface queue has been stopped by ... the driver, usually, because the hardware is currently busy and cannot ...
    (comp.os.linux.development.system)
  • Re: [PATCH 5 of 6] hotplug-memory: add section_ops
    ... create kva mapping for that memory ... Initialize the 'struct page' ... grabbing it *out* of the buddy allocator and using it. ... detail of the balloon driver, which users needn't know about when they ...
    (Linux-Kernel)
  • Re: Where is the memory of net_device Tx queue located?
    ... where is the memory of this queue's memory ... It is the maximum number of skbs ('struct sk_buff') the ... network layer will queue while the interface queue has been stopped by ... the driver, usually, because the hardware is currently busy and cannot ...
    (comp.os.linux.development.system)
  • Re: [PATCH 10 of 20] ipath - support for userspace apps using core driver
    ... memory allocated from the regular page allocator, ... something not backed by a struct page? ... populated with compound pages anyway? ... pages (and PagePrivate is set when that page is free in the allocator). ...
    (Linux-Kernel)