Tuesday, January 24, 2012

SMP, AMP, and benchmarks

The new AMP scheduler for NIX can keep up with loads resulting from actual compilation of the system software. Until 16 cores are added, a real speed up is obtained. From there on, contention on other resources result in times not improving, and even slowing down a bit.

With the old SMP scheduler, when more than about 6 cores are used, times get worse fast. The curve is almost exponential, because of a high contention on the scheduler.

That was using a system load from the real world. As an experiment, we measured also what happen to a program similar to mk. This program spawns concurrently as many processes as needed to compile all the source files. But it's not a real build, because no dependencies are taken into account, no code is generated by running scripts, etc.


For this test, which is half-way between a real program and a micro-benchmark, the results differ. Instead of getting much worse, SMP keeps its time when more than 4 cores are added. Also, AMP achieves about a 30% or 40% of speedup.


Thus, you might infer from this benchmark that SMP is ok for building software with 32 cores and that AMP may indeed speed it up. However, for the actual build process, SMP is not ok, but much slower. And AMP is not achieving any speed up when you reach 32 cores, you better run with 16 cores or less.


There are lies, damn lies, and microbenchmarks.

4 comments:

  1. What are the other bottlenecks after 16 cores and any opportunity to reduce them?

    ReplyDelete
    Replies
    1. We are not sure. But it's likely that the page allocator, the kernel qmalloc, and the channel for the ramfs.

      More measuring is on the way, and more evaluation.
      We have even full scheduling diagrams produced by a trace available in the nix kernel.

      Delete
  2. "The curve is almost exponential, because of a high contention on the scheduler."

    Does that really make sense? I wonder, I really do.

    ReplyDelete
    Replies
    1. Too early to know. removing the central runq data structure changes the shape. What seems to be slowing down is the implementation of the central scheduling data structure (or its existence). So, what I know is that leaving such central data structure leads to this almost exponential curve.

      (The AMP scheduler is using the same algorithm, only that core 0 executes it and dispatches processes to all others, as I'm sure you know).

      Delete