Re: Possible causes for "MT slowdown"



"Mirek Fidler" <cxl@xxxxxxxxxx> wrote in message news:57cb88a4-ea4a-45cf-b630-e2b7ffe25877@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
On May 29, 12:06 am, "Chris Thomasson" <cris...@xxxxxxxxxxx> wrote:
"Mirek Fidler" <c...@xxxxxxxxxx> wrote in message

news:f9dd932a-63f7-45e9-a2dc-7892ff3e33ed@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

> Hi,

> I am working on server application which basically serves as big cache
> to sql database, performing query requests. As long as query requests
> that are in cache are performed, there is no sql communication.

> I believe that in such case, running N queries in single thread should
> take approximately 2x as much time as running N/2 in 2 threads.
> Anyway, so far I was able to achieve only 1.6x.

> I am aware about this possible problems in MT performance:

> * mutex contention (avoided most of it)

> * false sharing (cacheline contention) (allocator used should not
> have this problem)

[...]

Are you saying that the allocator your using eliminates false-sharing?

Yep. I did some work since the last time we discussed allocators :) It
is now lock-free and false-sharing free, unless you do remote, of
course.

Great! :^D




AFAICT, you need to properly pad and explicitly align your data-structures
on l2-cacheline boundaries regardless of what allocator your using.

http://groups.google.com/group/comp.programming.threads/msg/8036dc8a3...

Also, you should also ensure that your mutexs are padded and aligned. For
instance, on Windows I usually do something like:

Well, this is a good point I guess. All of my mutexes are global
variables now, however to reduce contention, I have a lot of them,
often in big static arrays.

Perhaps I can try to pad them to e.g. 128 byte unions, that would
still be acceptable size-wise (I have about 2000 mutexes and 8GB to
play with :).

Think of mutex's A and B which protect completely unrelated datum (dA and dB). If stores to A or B invalidates dA or dB, well, that's a performance issue. If stores to dA or dB invalidates A or B, that's not good. If mutations to A invalidate B, or vise-verse, well, that's another performance issue. IMVHO, it helps to think of that type of cross interference. Sometimes, keeping track of all of this micro detail can have a marked improvement on performance in general...

You generally don't want a store into a datum A to mess around with another piece of unrelated data. You don't want a store into a lock which protects datum A to mess around with the cache lines in which A exists...

For instance, if you frequently lock/unlock mutex A, why should that cast a negative impact upon mutex B -or- the contents of the critical-section which B protects access to?




Speaking about it, do you think it is a good idea to use "trylocks" to
reduce contention? I mean sometimes I need array data, each elemented
with mutex. I do not care about the order in which I get data from the
array, therefore I was trying to do 2 through the array, doing "tries"
on first pass and blocking lock for remaining elements on second. In
praxis, this so far seems to reduce blocking, but does not seem to
improve real performance :)

IMVHO, only if your generally guaranteed to have other "important" work to do if the failure case of a try_lock is hit. No need to call try_lock if the result of a failure is a spin. Only do try_lock if you can immediately operate on other work, or if your doing a STM like implementation and need to lock multiple mutexs and have no access to a total ordering scheme... Does that make any sense?

.



Relevant Pages

  • Re: WaitForSingleObject() will not deadlock
    ... There are architectures with cache tagged by virtual addresses (which also ... In those architectures, mutex operation ... would be to provide either implicit cache flushing on lock and unlock (as ...
    (microsoft.public.vc.mfc)
  • Re: WaitForSingleObject() will not deadlock
    ... But note that a multiprocessor that uses this architecture imposes severe limitations on ... they may require explicit cache coherency maintenance. ... In those architectures, mutex ... would be to provide either implicit cache flushing on lock and unlock ...
    (microsoft.public.vc.mfc)
  • Re: WaitForSingleObject() will not deadlock
    ... Why is a mutex unlock coupled to a cache synchronization issue? ... would be to provide either implicit cache flushing on lock and unlock (as the x86 does ...
    (microsoft.public.vc.mfc)
  • Re: Linux pthread_cond_broadcast waking only one thread out of four waiting
    ... but the RedHat manpages have copied this from Posix: ... "The effect of using more than one mutex for concurrent ... As for things like cache problems, it depends on what you know about how ... to *resultp won't update x and y. ...
    (comp.programming.threads)
  • Re: Possible causes for "MT slowdown"
    ... to sql database, performing query requests. ... For instance, if you frequently lock/unlock mutex A, why should that cast a ... I mean sometimes I need array data, ... on first pass and blocking lock for remaining elements on second. ...
    (comp.programming.threads)