... as their main storage?
We are in the process of creating a new file server program to Nix. The main assumption underlying nix is that the hardware has changed so much that we should revisit all "well known" facts about system design.
File servers are not an exception. The test machine in use for Nix at lsub has 64 Gbytes of memory. With that much memory, the machine can easily host in memory the entire file tree we have in our main file server. However, our standard file server program insists on using the memory as a cache of the disk, considering the disk as the "master" copy of the file tree.
Instead, I'm advocating for file servers that keep the main file tree at memory, using the disk mostly to survive during machine halts and power outages, and to keep those files that are big enough that cannot be held easily on RAM.
More on the IX file server in the near future.
Quick notes regarding system software issues, references to related work, ideas for future work, and any interesting result along the way. See the lsub web site.
Friday, January 27, 2012
Thursday, January 26, 2012
Wasting the time.
NIX knows how to record the worst/total/number of waiting times for each lock used in the kernel. Including queueing locks and other types of lock. This lets us retrieve from /dev a list of the ones where programs waste their time (be that for useful work or not). These are the winners as of today, as reported by the kernel
what pc #times maxtime tottime where meantime
lock 0xfffffffff017d490 50 12710074 215909070 /sys/src/nix/port/page.c:560 4318181
lock 0xfffffffff017d486 4 6326408 16216101 /sys/src/nix/port/page.c:560 4054025
lock 0xfffffffff017ce18 112 6763350 293233176 /sys/src/nix/port/page.c:309 2618153
lock 0xfffffffff017cbc3 78 12051128 126828266 /sys/src/nix/port/page.c:222 1626003
lock 0xfffffffff01620c0 162 1366053 14176025 /sys/src/nix/port/devmnt.c:1052 87506
qlock 0xfffffffff01773cc 581 1009266 40825474 /sys/src/nix/port/qio.c:1187 70267
qlock 0xfffffffff012cf2e 155 1238110 10673584 /sys/src/nix/k10/ether82563.c:893 68861
qlock 0xfffffffff01402af 376 574149 24205418 /sys/src/nix/ip/tcp.c:547 64376
qlock 0xfffffffff0143bee 350 492911 14229109 /sys/src/nix/ip/tcp.c:2118 40654
lock 0xfffffffff0119308 593 1121568 18533782 /sys/src/nix/k10/qmalloc.c:275 31254
lock 0xfffffffff0119a17 11183 1365890 294252474 /sys/src/nix/k10/qmalloc.c:500 26312
lock 0xfffffffff016226b 80 459279 1898897 /sys/src/nix/port/devmnt.c:1101 23736
what pc #times maxtime tottime where meantime
lock 0xfffffffff017d490 50 12710074 215909070 /sys/src/nix/port/page.c:560 4318181
lock 0xfffffffff017d486 4 6326408 16216101 /sys/src/nix/port/page.c:560 4054025
lock 0xfffffffff017ce18 112 6763350 293233176 /sys/src/nix/port/page.c:309 2618153
lock 0xfffffffff017cbc3 78 12051128 126828266 /sys/src/nix/port/page.c:222 1626003
lock 0xfffffffff01620c0 162 1366053 14176025 /sys/src/nix/port/devmnt.c:1052 87506
qlock 0xfffffffff01773cc 581 1009266 40825474 /sys/src/nix/port/qio.c:1187 70267
qlock 0xfffffffff012cf2e 155 1238110 10673584 /sys/src/nix/k10/ether82563.c:893 68861
qlock 0xfffffffff01402af 376 574149 24205418 /sys/src/nix/ip/tcp.c:547 64376
qlock 0xfffffffff0143bee 350 492911 14229109 /sys/src/nix/ip/tcp.c:2118 40654
lock 0xfffffffff0119308 593 1121568 18533782 /sys/src/nix/k10/qmalloc.c:275 31254
lock 0xfffffffff0119a17 11183 1365890 294252474 /sys/src/nix/k10/qmalloc.c:500 26312
lock 0xfffffffff016226b 80 459279 1898897 /sys/src/nix/port/devmnt.c:1101 23736
The list is not interesting unless you are optimizing the kernel, but it shows how powerful this simple instrument can be. It's easy to spot some parts which are responsible for making the programs wait, perhaps more than really needed.
Wednesday, January 25, 2012
Plots for SMP, AMP, and benchmarks
Recently I posted here regarding SMP, AMP, and benchmarks. That post discussed the results for measuring an actual build of the kernel, and a pseudo-build (where a benchmark program created concurrent processes for all compilations at once). There are significant differences, because the later is closer to a microbenchmark and hides effects found in the real world load.
These are the plots.
These are the plots.
- Time to compile using mk (which spawns concurrent compilations). The old scheduler is SMP, the new one is AMP.
- Time to compile using pmk (with the source tree hosted in ram), a fake build tool just for this benchmark. It spawns all compilations at once and then waits for all of them.
Draw your own conclussions. No wonder a micro-benchmark could be designed such that AMP always achieves a speed up as we add more cores. It's a pity that in the real world there are always dependencies between independent things.
Is NUMA relevant?
We made an experiment, running a program that spawns as many compiler processes as needed to compile the kernel source files in parallel. This uses the new AMP scheduler, where core 0 assigns processes to other cores for execution.
To see the effect of NUMA in the 32 core machine used for testing, we timed this program for all number of cores: first, allocating memory from the domain local to the process; second, allocating it always from the last memory domain.
Thus, if NUMA effects are significant, it must show up in the plot. This is it:
For less than 5 cores there is a visible difference. Using a NUMA aware allocator results in lower times, as it could be expected. However, for more cores there is no significant difference, which means that NUMA effects are irrelevant on this system and this load (actually, the NUMA allocator is a little bit worse).
The probable reason is that when using more cores the system seems to tradeoff load-balancing of processes among cores and memory locality. Although all memory is allocated from the right color, the pages kept in the cache already have a color, as does kernel memory. Using processors from other colors comes at the expense of using remote memory, which means that in the end it does not matter much where do you allocate memory.
If each process relied on a different binary, results might be different. However, in practice, the same few binaries are used most of the time (the shell, common file tools, compilers, etc.).
Thus, is NUMA still relevant for this kind of machine and workload?
As provocative as this question might be...
Tuesday, January 24, 2012
SMP, AMP, and benchmarks
The new AMP scheduler for NIX can keep up with loads resulting from actual compilation of the system software. Until 16 cores are added, a real speed up is obtained. From there on, contention on other resources result in times not improving, and even slowing down a bit.
With the old SMP scheduler, when more than about 6 cores are used, times get worse fast. The curve is almost exponential, because of a high contention on the scheduler.
That was using a system load from the real world. As an experiment, we measured also what happen to a program similar to mk. This program spawns concurrently as many processes as needed to compile all the source files. But it's not a real build, because no dependencies are taken into account, no code is generated by running scripts, etc.
For this test, which is half-way between a real program and a micro-benchmark, the results differ. Instead of getting much worse, SMP keeps its time when more than 4 cores are added. Also, AMP achieves about a 30% or 40% of speedup.
Thus, you might infer from this benchmark that SMP is ok for building software with 32 cores and that AMP may indeed speed it up. However, for the actual build process, SMP is not ok, but much slower. And AMP is not achieving any speed up when you reach 32 cores, you better run with 16 cores or less.
There are lies, damn lies, and microbenchmarks.
With the old SMP scheduler, when more than about 6 cores are used, times get worse fast. The curve is almost exponential, because of a high contention on the scheduler.
That was using a system load from the real world. As an experiment, we measured also what happen to a program similar to mk. This program spawns concurrently as many processes as needed to compile all the source files. But it's not a real build, because no dependencies are taken into account, no code is generated by running scripts, etc.
For this test, which is half-way between a real program and a micro-benchmark, the results differ. Instead of getting much worse, SMP keeps its time when more than 4 cores are added. Also, AMP achieves about a 30% or 40% of speedup.
Thus, you might infer from this benchmark that SMP is ok for building software with 32 cores and that AMP may indeed speed it up. However, for the actual build process, SMP is not ok, but much slower. And AMP is not achieving any speed up when you reach 32 cores, you better run with 16 cores or less.
There are lies, damn lies, and microbenchmarks.
Saturday, January 21, 2012
AMP better than SMP for 32 cores
After some performance measurement, we have found that AMP performs better than SMP at least for some real-world workloads, like making a kernel.
It seems that the scheduler interference when enough cores are present is enough to slow down the system. Sparing a core just to schedule tasks to other cores (and handle interrupts) makes the system as a whole faster.
These are the times for the experiments of making a kernel, copying a file tree from ram to ram, and making a kernel with its source tree in ram. The first two rows are taken from a previous experiment also reported in this blog. We did re-run that experiment to compare with AMP because it depends on the state of the network, which may differ.
14.0344 1.921 10.458 single sched. 32 cores.
10.789 0.608 5.073 single sched. 4 cores.
12.8 2.20 9.23 rerun of single sched, 32 cores.
10.17 0.995 5.775 AMP sched, 32 cores.
It seems that the scheduler interference when enough cores are present is enough to slow down the system. Sparing a core just to schedule tasks to other cores (and handle interrupts) makes the system as a whole faster.
These are the times for the experiments of making a kernel, copying a file tree from ram to ram, and making a kernel with its source tree in ram. The first two rows are taken from a previous experiment also reported in this blog. We did re-run that experiment to compare with AMP because it depends on the state of the network, which may differ.
14.0344 1.921 10.458 single sched. 32 cores.
10.789 0.608 5.073 single sched. 4 cores.
12.8 2.20 9.23 rerun of single sched, 32 cores.
10.17 0.995 5.775 AMP sched, 32 cores.
Wednesday, January 18, 2012
Manycore and scheduling
Time for some performance evaluation. After some scaffolding, we are measuring how long it takes to do several things (1: compiling a kernel with source from a remote file server, 2: copying files from a local ramfs to a local ramfs, 3: compiling a kernel with the source in a local rmafs).
The aim is to compare how long it takes to do this with different number of cores and different schedulers.
These are the results for nix. Times are user, system, and total. The value for the system time is not reliable, so pay attention mostly to the total and perhaps to the user time.
If we run 8 schedulers like in the last test, but we do not permit them to steal/donate jobs,
which means that there is a single group in use with just 4 cores, the numbers get back to
the case of a single scheduler with just 4 TCs:
The aim is to compare how long it takes to do this with different number of cores and different schedulers.
These are the results for nix. Times are user, system, and total. The value for the system time is not reliable, so pay attention mostly to the total and perhaps to the user time.
- Single scheduler for 32 TCs:
- 1/output:times 7.03556 50.7456 14.0344
- 2/output:times 0.233 1.626 1.921
- 3/output:times 7.663 63.668 10.458
- Single scheduler for just 4 TCs:
- 4/output:times 3.989 5.585 10.789
- 5/output:times 0.156 0.404 0.608
- 6/output:times 4.062 5.429 5.073
- 8 scheduling groups, stealing when idle, for 32 TCs:
- 7/output:times 5.849 22.625 13.859
- 8/output:times 0.193 0.94 1.132
- 9/output:times 6.357 28.8 9.404
We are making more experiments, but the interesting thing is that it takes longer to do the work with 32 cores than it takes to do it with 4 cores.
Changing the scheduler so that there's one per each 4 cores helps, but the time is still far from the surprisingly best case, which was using 4 cores.
If we run 8 schedulers like in the last test, but we do not permit them to steal/donate jobs,
which means that there is a single group in use with just 4 cores, the numbers get back to
the case of a single scheduler with just 4 TCs:
- 8 scheduling groups, but only 1 used (4 out of 32TCs):
- 13/output:times 4.147 5.727 9.64
- 14/output:times 0.174 0.374 0.596
- 15/output:times 4.049 4.816 4.804
Compatible with Linux or with a language?
For a new system like nix, it might be easier and more effective to introduce
compatibility with a run-time for a popular language than doing the same for
a legacy operating system.
In the past, introducing trampolines and emulation for Linux might help to
run Linux binaries. However, if we aim at running, say, go programs,
it would be much easier and cleaner to provide support for its run time.
compatibility with a run-time for a popular language than doing the same for
a legacy operating system.
In the past, introducing trampolines and emulation for Linux might help to
run Linux binaries. However, if we aim at running, say, go programs,
it would be much easier and cleaner to provide support for its run time.
Subscribe to:
Posts (Atom)