Re: Single producer, "one-shot" work queue implementation, no lock on queue.
- From: David Barrett-Lennard <davidbl@xxxxxxxxxxxx>
- Date: Sat, 17 Jan 2009 17:59:37 -0800 (PST)
On Jan 17, 3:53 pm, JC <jason.cipri...@xxxxxxxxx> wrote:
On Jan 16, 10:18 pm, David Barrett-Lennard <davi...@xxxxxxxxxxxx>
wrote:
On Jan 17, 8:05 am, JC <jason.cipri...@xxxxxxxxx> wrote:
I think you need to remove that assertion in executeNext().
It's a macro that was removed in the release build that I tested with;
it doesn't affect the timing (I did remove it to double-check).
I think the assertion is *incorrect*.
This seems to be working well except it more than doubled the overhead
to 45 usec on average compared to about 20 usec before
A per-task overhead of 45 usec seems high to me.
(compared to 5
usec for just calling execute() directly).
I don't understand that. On a modern machine a function call has an
overhead that's closer to 5nsec than 5usec.
Let me double check the math here... all right, my benchmarks were
flawed because the sample set size was too low (they may be flawed for
other reasons) and because QueryPerformanceCounter() has a good bit of
overhead, I've redone the test program using a much higher sample size
(1000000 instead of 200) and using rdtsc (following guidelines herehttp://cs.smu.ca/~jamuir/rdtscpm1.pdfwith some modifications).
All right, now that the benchmarks are arguably less crappy, I am
seeing around 33nsec for the plain function call (actually, it's a
call to a virtual function via a pointer to an object, but just a
regular function showed about the same results). And I'm consistently
seeing about 15usec for the queued item.
What I said about the extra InterlockedDecrement adding extra overhead
above was wrong, with or without it, it's still around 15usec. I'm
including the call to addItem() as part of the overhead, and this is
the worst case, where there's only a single item in the queue.
All of my timings may be off by the same scale, because I'm grabbing
the CPU frequency from the Windows registry which can't necessarily be
trusted (using Intel 2.16GHz T2600 Core Duo, reported frequency is
2,161,000,000).
I guess that's negligible
compared to the length of the work items but it would be nice to speed
it up. There are 2 interlocked increments and a semaphore wait per
work item; I can't think of a way to do it with less.
The thread pool I implemented uses a mutex (aka windows critical
section) to protect the queue of pending tasks, and my worker threads
use an inner loop where they pop items and execute them until the
queue is empty. Only then do they break out of the inner loop and go
back to sleep by waiting on an event in the outer loop.
I avoid waking up all threads when the queue transitions to non-
empty. I also use an optimisation where addItem sometimes gives the
task directly to a sleeping worker thread rather than via the queue.
As you have done, I provide a facility to wait until tasks have
completed. However I group tasks according to the caller to allow
multiple clients to concurrently post tasks then wait (only) on the
execution of the tasks that they posted. For example, one client may
be posting tasks while another client is waiting on the tasks that it
posted.
This sounds like a good implementation; but our use cases are a little
different. I don't need the grouping, and in fact for the most part I
know ahead of time exactly how many items will be in the work queue
each time this runs, so I can probably get away with just resetting
the indexes instead of clearing the queue and re-adding the items now
that I think about it. I also know that the queue won't be growing
while work items are being processed -- I can take advantage of that
by not locking the queue, but it also means added complexity if I try
to make addItem give tasks directly to worker threads (I'm assuming...
but I don't really know that optimization).
IMO a more general purpose thread pool is a good idea because it
easily supports your particular use case, but also can be applied to
many other problems as well.
I achieve grouping capability using a "task executer" (which is
distinct from the thread pool) as follows:
struct Task
{
virtual void Execute() = 0;
};
struct TaskExecuter
{
virtual void Close() = 0;
virtual void Post(Task* task) = 0;
virtual void Wait() = 0;
};
struct ThreadPool
{
virtual void Close() = 0;
virtual TaskExecuter* CreateTaskExecuter() = 0;
};
ThreadPool* CreateThreadPool();
I think there is an implicit difficulty with measuring a per-task
overhead. When the execute method does nothing at all, a given worker
thread tends to rapidly execute the items until the queue is empty and
goes back to sleep. Therefore the test has a tendency to wake up
sleeping threads all the time, increasing the overheads!
That's a good point. I tried to measure the worst case, though, with
one worker thread and one item in the queue. So the sleeping thread is
always woken up, and that's included in the measurement as well.
In any case the measured overhead is about 3 usec on a dual Athlon
4600+ machine. I haven't compared performance to any other thread
pool implementations, so I have no idea whether that's good or bad.
Does that include the time it takes to add the work item to the queue?
I don't know if it's good or bad either but it's certainly a lot
better than mine, the CPU is a bit better than mine as well. After
fixing my benchmarks, I'm seeing 15usec for queue + wakeup + execute +
wait on completion. My timing might include more than yours, but it
still makes me wonder if I'm taking the wrong approach.
I'd be happy to post the test program, although it's a bit long and
messy. The basic timing is done like this (pseudo-ish code):
struct Timing : public WorkItem {
__int64 start;
__int64 fstart;
__int64 fdone;
__int64 done;
void execute () { // called by worker thread
fstart = rdtsc; // <-- processing start time
// <-- work load
fdone = rdtsc; // <-- processing end time
}
}
for (n = 0; n < 1000000; ++n) {
Timing t;
start = rdtsc; // <-- start time
work.addItem(&t);
work.beginWork();
work.waitComplete();
done = rdtsc; // <-- end time
timings.push_back(t);
}
I've tried a few different things for "work load" all producing
identical results; ranging from nothing at all, to 1000 calls to sin()
on a random number (what I'm currently using, verified that calls were
not optimized away), to Sleep() calls.
I define "overhead" as the average ((done - start) - (fdone - fstart))
over 1,000,000 iterations. The rdtsc's are preceded with cpuid to
prevent instruction reordering, and the overhead of the cpuid + rdtsc
calls is measured when the program starts (I've verified that the
measurements, in cycle counts, are repeatable with no variation). The
major unknown is CPU cycles per second, but the relative times are
still valid.
I only measured overhead as done-start using an empty execute method.
.
- Follow-Ups:
- References:
- Single producer, "one-shot" work queue implementation, no lock on queue.
- From: JC
- Re: Single producer, "one-shot" work queue implementation, no lock on queue.
- From: JC
- Re: Single producer, "one-shot" work queue implementation, no lock on queue.
- From: David Barrett-Lennard
- Re: Single producer, "one-shot" work queue implementation, no lock on queue.
- From: JC
- Single producer, "one-shot" work queue implementation, no lock on queue.
- Prev by Date: usb PERRPHERAL INTERFACE
- Next by Date: Re: Single producer, "one-shot" work queue implementation, no lock on queue.
- Previous by thread: Re: Single producer, "one-shot" work queue implementation, no lock on queue.
- Next by thread: Re: Single producer, "one-shot" work queue implementation, no lock on queue.
- Index(es):
Relevant Pages
|