System Software: multicore performance

Thursday, July 11, 2013

Selfish processes

There is an important optimization not described in previous posts, and not
considered in the evaluation and the traces shown there. The idea is
to let processes keep a few of the resources they release in case
they are needed later.

In particular, we modified the process structure to keep up to 10 pages
(of the size used for user segments). When a process releases a page
and has less than 10 pages kept, it simply keeps the page without
releasing it. Later, if a new page is needed it would first try to use
one from the per-process pool. The pool is not released when a process dies.
Instead, the pool is kept in the process structure and will be used
again when a new process is allocated using it.

The trace output taken after applying this optimization shows that
most of the pages are reused, and that for small cached programs about 1/3
of the allocations are satisfied with the per-process pool. Thus,
the contention on the central page allocator is greatly reduced with this
change.

Per process resource pool should be used with care. For example, our
attempts to do the same with the kernel memory allocator indicated that
it is not a good idea in this case. Memory allocations have very different
sizes and some structures are very long lived while others are very short
lived. Thus, what happen was that memory was wasted in per process pools
and, at the same time, not many memory allocations could benefit from this
technique.

In general, per-process allocation pools are a good idea when the structures
are frequently used and have the same size. For example, this could be
applied also to Chan and Path structures as used on Nix.

Wednesday, January 18, 2012

Manycore and scheduling

Time for some performance evaluation. After some scaffolding, we are measuring how long it takes to do several things (1: compiling a kernel with source from a remote file server, 2: copying files from a local ramfs to a local ramfs, 3: compiling a kernel with the source in a local rmafs).

The aim is to compare how long it takes to do this with different number of cores and different schedulers.

These are the results for nix. Times are user, system, and total. The value for the system time is not reliable, so pay attention mostly to the total and perhaps to the user time.

Single scheduler for 32 TCs:

1/output:times 7.03556 50.7456 14.0344
2/output:times 0.233 1.626 1.921
3/output:times 7.663 63.668 10.458

Single scheduler for just 4 TCs:

4/output:times 3.989 5.585 10.789
5/output:times 0.156 0.404 0.608
6/output:times 4.062 5.429 5.073

8 scheduling groups, stealing when idle, for 32 TCs:

7/output:times 5.849 22.625 13.859
8/output:times 0.193 0.94 1.132
9/output:times 6.357 28.8 9.404

We are making more experiments, but the interesting thing is that it takes longer to do the work with 32 cores than it takes to do it with 4 cores.

Changing the scheduler so that there's one per each 4 cores helps, but the time is still far from the surprisingly best case, which was using 4 cores.

If we run 8 schedulers like in the last test, but we do not permit them to steal/donate jobs,
which means that there is a single group in use with just 4 cores, the numbers get back to
the case of a single scheduler with just 4 TCs:

8 scheduling groups, but only 1 used (4 out of 32TCs):

13/output:times 4.147 5.727 9.64
14/output:times 0.174 0.374 0.596
15/output:times 4.049 4.816 4.804