Re: Is C close to the machine?



Anton Ertl wrote:

To avoid these overheads and to achieve load balancing, I have been
thinking about an implementation that automatically switches between
communicating between threads and using coroutining for efficient
communication within a thread. Currently these are just ideas, but I
have written them down in
<http://www.complang.tuwien.ac.at/kps09/pdfs/ertl.pdf>


Did a quick read of the paper. My summary [Correct me if I'm wrong]

Basic idea:
1. Write code as a series of tasks connected by pipes, using pretty much the same idiom as piping together programs in Unix: e.g. to count non-blank lines by c and h files in a tree:
find . -name '*.[ch]' | grep -v '^$' | wc -l
Lets call these tasks kernels A,B,....
2. The pipeline of kernels is partitioned between cores. In the case of 5 tasks A,B,C,D,E,F running on 3 cores 0,1,2, a possible partition might be (A,B,C),(D,E),(F)

Dynamic load balancing
3. If a mismatch in workloads is noticed, the pipeline may be repartitioned, generally by moving one kernel across threads. Thus, if core 0 has more work than core 1, C may move to core 1, resulting in partition (A,B),(C,D,E),(F)
4. The repartitioning is mediated by the states of the inter-thread buffers; if C is writing to D and generally sees an empty queue, then (A,B,C) is probably running slower than (D,E), and so C can move over. If C is writing to D and generally sees a full queue, then the that core is overloaded, and D may move over onto core 0, resulting in (A,B,C,D). Alternatively, it may make sense for the task at the other end of the pipe to be moved (i.e. instead of the C/D boundary moving, the E/F boundary may move).

Kernel scheduling
5. If kenel Ki pipes information to a Ki+1 on the same core, it may yield (using coroutine semantics) to Ki+1 after writing the information

Critique:
1. Only works for pipelines; it doesn't work with more complicated graphs such as DAGs.
2. POSIX pipes are byte-streams of potentially unbounded length (i.e. you can write an arbitrary number of bytes before the first one is read).
a. You probably want a stronger type system on the data
b. Do you want to bound the buffer size?
3. You're going to have to worry about copying the data at least once and probably twice (data gets created, copied into buffer, data gets copied from buffer, data gets used). It is possible to come up with implementations of 2b that, if the data is not copied, can lead to lockup.
4. The overhead of coroutining is not the state save, its the loss of locality of the caching and prediction structures.
5. What if the kernels have different data sizes (e.g. kernel A produces data in chunks of 512 and the next one, B, processes it in a chunk of 2K; your workload will be AbAbAbAB where b is kernel B reading the 512B, and then blocking waiting for the next read).

Its a neat little idea, but it will require more engineering than you might expect. My main objection is (1) - I don't expect that pipelines is going to be a very interesting case.
.


Quantcast