very strange pthread problem




We have a single thread system at work that
we're extending so some threads can do some
of the hard work on some other cpus.

I use a boss worker design, a main thread does the majority of the
processing, which is the code that is difficult to run
in parallel because of the nature of the processing. Then I use mutexes to
pause the main thread and broadcast to the 4 worker threads to start that
have been paused in a condition wait since the last processing cycle. They
all have to run to completion and pause and I use pthread_cond_signal to
get the main thread out of pause. By testing, I'm convinced this all works
and that the main and the workers are alternately
working and pausing.

The worker threads need access to a class member function or two.
Since the thread is a void* function, I pass the address of the
class to it via the 4th argument in the pthread_create which
is a pointer to a user defined structure:
struct {
int thread_id;
Class* myclass;
} THREAD_ARG;

the thread then casts that data back to get the thread id and
class pointer so it can get to the various class
functions and by knowing its id number each
thread can find its list of data to work on. Each thread
is unlocked from the shared mutex so they can all be in the
common work function. I keep the threads workqueue distinct by having a
STL vector so each index into the vector points to the list of pointers to
the data the thread should work on, sort of like this:

vector<STRUCT> workqueue;
...
workqueue[thread_id].dataset;

dataset --> list<data*>*

so thread 0 can get its data from
workqueue[0].dataset->begin(), ..., and so on towards the end.
I figured this would prevent interactions between threads
since each has its own container. The main thread
puts 1/4th the data on each dataset list for a quad cpu host
or 1/2 for a dual....

The problem is, the old software works fine (I can
manually turn off the threading feature and run it
the 'old' way as one thread. But one of the many
child functions in the main function is working
inconsistently in threaded mode than in legacy mode.
Some of the child functions used static variables for
speed reasons, but they're setup so they can be
changed back to local variables easily by recompiling
so I made sure those were all back to being local
variables --> same result.
So right away I thought it was some kind of thread
interaction so I made another mutex and wrapped
it around the work function so only one thread
can be in there at a time --> same result.
To make really sure, I ran it so that pthread_create
was only making one worker thread --> same result.
Then I moved the whole thing to a single cpu Linux
machine and compiled and ran there with one worker
thread. On that box
it should be impossible for simultaneous interaction
since there's just one cpu and one thread -->same result.

So basically, in the thread function, the threads all
go into a "work" function, and in the work function is
a child function doing some mundane math (it computes
dot products and stuff to figure out two signals
seen by antennas can come from a single source) and
this subfunction somehow knows I'm using pthreads
and breaks but then works fine when I'm not.
It even breaks when there's just one cpu and there's
only one pthread running in the work function. I've
looked over the work function and it looks thread safe,
all the operations are reading and not writing data
in there (there is writing but only to each thread's
local variables). There's writing of pointers to
the STL lists via push_back but there's a unique
list for each thread and that's not the data that
look bad anyway.

The only idea I had was stack overflow in the thread
so that data sometimes gets corrupted when running
with the thread but not with the original code
but when I run getstacksize() on linux it says 10+ megs
which is huge. I was thinking the corruption only happening
sometimes might be due to the stack usage changing because
the function can be left by various returns, so the lower
you go the more stack is used so maybe the screwed up
one was the path through the function that got low enough
down.

The threads are all DETACHED, not JOINED
because the threads stick around for the whole runtime
and don't get destroyed and recreated. When developing
all this I setup all the mutexes to be ERRORCHECK type
and found no error codes occurring from the pthread
calls. I've since changed them back to default/NORMAL
for faster speed, however I check the rc codes for
all the thread calls for != 0. Another thing
is that I've run the software on Solaris X86 (Sun V40Z)
and Linux (single cpu embedded board computer) using
Fedora Core 4 and get the same behavior.

I was just wondering if anyone has some ideas.
Mark
.



Relevant Pages

  • Re: very strange pthread problem
    ... The worker threads need access to a class member function or two. ... STL vector so each index into the vector points to the list of pointers ... it's just storing the pointers to the STL lists and the vector ... Then I moved the whole thing to a single cpu Linux ...
    (comp.programming.threads)
  • Re: very strange pthread problem
    ... The worker threads need access to a class member function or two. ... STL vector so each index into the vector points to the list of pointers to ... Then I moved the whole thing to a single cpu Linux ... the STL lists via push_back but there's a unique ...
    (comp.programming.threads)
  • Re: very strange pthread problem (solved)
    ... The worker threads need access to a class member function or two. ... STL vector so each index into the vector points to the list of pointers ... it's just storing the pointers to the STL lists and the vector ... Then I moved the whole thing to a single cpu Linux ...
    (comp.programming.threads)
  • Re: threaded application not using all processors
    ... > All the worker threads seem to be running in a single processor. ... Are you looking at the CPU usage in task ... it's entirely possible for a Windows Forms app to make use of multiple CPUs. ... I compiled this with no special flags. ...
    (microsoft.public.dotnet.framework.performance)
  • SCSI bus reset with Adaptec 29320ALP and Eonstor RAID
    ... I am trying to use a 1.5TB Eonstor raid array with FreeBSD 7.0, but I don't understand whether it is the raid or the scsi card or something else that is causing the computer problems when accessing the raid. ... CPU: IntelXeonCPU 3.20GHz ... Kernel Free SCB lists: ... Sequencer Complete DMA-inprog list: ...
    (freebsd-stable)