Showing posts with label performance. Show all posts
Showing posts with label performance. Show all posts

Wednesday, May 28, 2014

bytes.Buffer or builtin concatenation to build strings in Go?

During profiling of clive Go code I have found an interesting bit.
This is the same old discussion about deciding on using a string buffer or using raw string concatenation to build strings.

A Dir is actually a map[string]string, and, the network format is a string with form

    attr1=val1 attr2=val2 ....

The function Dir.String() builds the string from the map. The question is:

Do we use bytes.Buffer and Fprintf to append data to the string, asking at the end to Buffer.String the resulting string? Or do we use a native Go string and concatenate using "+="?

Using bytes.Buffer and Fprintf in Dir.String() yields:
BenchmarkDirString              500000      5163 ns/op

Using strings and += 
BenchmarkDirString              500000      5077 ns/op


Surprisingly (perhaps), it's both easier to write and faster to use the strings and forget
about using the bytes.Buffer. It's likely this will place more pressure in the GC, because it builds and discards intermediate strings, but, it's faster and easier.

Saturday, July 13, 2013

A description of the selfish Nix allocator

This TR from the Lsub papers page describes the Nix allocator and provides a bit of initial evaluation for it. In short, half of the times allocation of resources could be done by the allocating process without interlocking with other processes or cores and without disturbing any other system component. Plus other benefits described in the TR.

Here is a picture of the allocator as a teaser:

Friday, July 12, 2013

Selfish allocators in action

This is the output from the alloc device. The last few lines show that for paths,
chans, and, what is more important, I/O blocks, most of the times a process
could do the allocation by itself. It took a free structure out of the small number
of structures that are kept in the process structure in case it later allocates
whatever it liberated before.
For example, 13891 block allocations were done without reaching the central block allocator,
and 14440 were done using the block allocator.

Ouch!, the writing was a mistake!. It's 13891 self-allocations out of a total of 14440 allocations. I'm making this note editing the post. Thus, note that almost all of the allocations are a self-service, not just half of the allocations.

This is utterly important for many-core machines. Half Most of the times it's a self-service.

I will post here a paper describing the allocator in a near future.

% cat /dev/alloc
3473408/67108864 cache bytes
72/100 cache segs 0/0 reclaims 1 procs
0/7 rpcs
98/102 segs
12/100 text segs 0/0 reclaims
1071104000 memory
15663104 kernel
0/0 1G pages 0 user 0 kernel 0 bundled 0 split
0/496 2M pages 0 user 0 kernel 0 bundled 6 split
423/968 16K pages 229 user 194 kernel 0 bundled 0 split
82/99 4K pages 0 user 82 kernel 3792 bundled 0 split
6/6 pgas
1447/1509 pgs
169/181 path 5535/5716 self allocs 5563/5575 self frees
103/123 chan 7269/7392 self allocs 7297/7317 self frees
13796608/14192640 malloc 1 segs
56/63 block 13891/14440 self allocs 13945/14438 self frees
0/8388608 ialloc bytes
61/82 mmu 4096 pages

Friday, July 5, 2013

Early evaluation of memory management in Nix mark IV


In a previous post I published a link to a TR describing the
recent work on memory management for Nix. Such TR has
been updated to include some early evaluation, which I reproduce
here.

To measure the  impact  of  the  different  changes  in  the
behavior  of the system, we took the final system and run it
with diagnostic output enabled for page allocation and  page
faults,  and measured the different events of interest. Then
we did the same disabling one or more of  the  improvements.
The  results  are not fully precise because debugging output
may miss some events sometimes. Further evaluation will  use
counters instead, and compare with a stock Plan 9 system.

     We have to say that the impact of the changes  is  more
dramatic  than shown by the results, because early modifica-
tions for the mount driver and other parts of the system, on
their  own, do a good job reducing the impact of system load
(e.g., by better caching).

     The variations of the system executed are  given  these
names in the results:

all

     The standard Nix mark IV. All new features are in.

nopf

     Prefaulting code for text and stack  segments  is  dis-
     abled.  Such  code  installs  into  the MMU entries for
     those pages already present in the segments (because of
     the  cache  or other optimizations). The alternative is
     installing entries on demand.

flush

     Prefaulting code is disabled and the MMU is flushed  on
     forks, as it is customary on Plan 9. The alternative to
     MMU flushes is flushing just the entries for  the  data
     segment  (others do not have to be because of optimiza-
     tions explained before).

nodeep

     Deep stack forks are disabled. Deep stack  forks  imply
     copying  the  actual  stack  memory  during  forks. The
     alternative is the standard copy on reference to fork a
     stack segment.

nostk

     Stack segments are not cached (their memory is not kept
     and  recycled) and deep stack copies are not performed.
     The alternative is the standard construction  by  zero-
     fill  on demand (due to page faults) and full dealloca-
     tion and allocation of stacks when processes  exit  and
     are created.

none

     There are no prefaulting code, the MMU  is  flushed  on
     forks, deep stack copying is disabled, and stack memory
     is not cached. Quite similar to the state of the system
     before  any  modification was made, but for using 16KiB
     pages for user segments.


none4k

     This is the old system.  Like the  previous  variation,
     but using standard 4KiB pages for user segments.

The system booted normally from a remote file server to exe-
cute  a shell instead of the full standard start script, and
then we executed  pwd twice. The  first  table  reports  the
results  for the second run of  pwd, counting since we typed
the command to the time when the shell  printed  its  prompt
after   pwd completed. The first way to read the table is to
compare any row with the first or the last one, to  see  the
impact of a particular configuration.

     The second table shows the same counters  but  for  the
entire execution of the system, from a hardware reset to the
prompt after executing  pwd twice.

_________________________________________________________________________
 bench    page allocs   page frees   mmu faults   page faults   page-ins
_________________________________________________________________________
  all          6             5           14           11           1
_________________________________________________________________________
  nopf         6             5           16           12           1
_________________________________________________________________________
 flush         6             5           22           18           1
_________________________________________________________________________
 nodeep        6             6           17           12           1
_________________________________________________________________________
 nostk         6             7           16           15           1
_________________________________________________________________________
  none         8             8           24           17           1
_________________________________________________________________________
 none4k       15            15           65           68           1
_________________________________________________________________________

Page  allocations,  page  deallocations,  page  faults,  MMU
faults,  and  pages paged in for variations of the system on
the second execution of a simple command.

_________________________________________________________________________
 bench    page allocs   page frees   mmu faults   page faults   page-ins
_________________________________________________________________________
  all         229           41          219           210         107
_________________________________________________________________________
  nopf        231           41          246           232         109
_________________________________________________________________________
 flush        224           38          311           283         109
_________________________________________________________________________
 nodeep       227           41          244           237         109
_________________________________________________________________________
 nostk        232           56          245           233         109
_________________________________________________________________________
  none        236           60          321           296         109
_________________________________________________________________________
 none4k       501          107          847           843         313
_________________________________________________________________________

Page  allocations,  page  deallocations,  page  faults,  MMU
faults,  and pages paged in for variations of the system for
an entire boot and two executions of a simple command.

Several things can be seen:

∙    going from the old to the new system means  going  from
     68  down  to 11 page faults, just for running  pwd from
     the shell.  For the entire boot process it means  going
     from 843 down to 210.

∙    Using a more reasonable page size, without other  opti-
     mizations,  reduces a lot the number of page faults (as
     could be expected). We saw that almost all the 4K pages
     paged  in are actually used for 16K pages, thus it pays
     to change the page size. Also, the new page size has  a
     significant  impact  in  the size of reads performed by
     the mount driver, because it enables  concurrent  reads
     for larger sizes.

∙    Deep stack copying reduces a little the page faults  in
     the system, but it might not be worth if the time taken
     to zero out the stack pages kept in the cache is wasted
     when  they  are  not  used. In our system that does not
     seem to be case.

∙    More efforts should be  taken  to  avoid  flushing  MMU
     state.  As  it  could be expected, not flushing the MMU
     when it is not necessary reduces quite a bit the number
     of page faults.

As a reference, here we list the trace output for the
old  and  the new system for executing  pwd the second time.
It is illustrative to compare them.

This is for the new system:
% pwd
newpg 0x000000003fe44000 pgsz 0x4000 for 0x4000
fault pid 23 0x20f9e0 r
fault pid 21 0x400000 r
fixfault pid 23 s /bin/rc sref 9 Text 0x200000 addr 0x20c000 pg 0x1174000 r1 n1
fixfixfault faulted pid 23 s /bin/rc Text 0x200000 addr 0x20c000 pg 0x1174000 ref 1
fault pid 23 0x20a9d9 r
pid 21 s /bin/rc sref 1 Data 0x400000 addr 0x400000 pg 0x3fe6c000 r2 n1
newpg 0x000000003fe80000 pgsz 0x4000 for 0x4000
fixfaulted pid 21 s /bin/rc Data 0x400000 addr 0x400000 pg 0x3fe80000 ref 1
fault pid 21 0x404b30 w
fixfault pid 23 s /bin/rc sref 9 Text 0x200000 addr 0x208000 pfixfault pid 21 s /bin/rc sref 1 Data 0x400000 addr 0x404000 pg 0x3fe70000 r2 n1
newpg 0x000000003fe84000 pgsz 0x4000 for 0x4000
g 0x11a4000 r1 n1
fixfaulted pid 21 s /bin/rc Data 0x400000 addr 0x404000 pg 0x3fe84000 ref 1
fault pid 21 0x409afc r
fixfaulted pid 23 s /bin/rc Text 0x200000 addr 0x208000 pg 0x11a4000 ref 1
fault pid 23 0x400060 w
fixfault pid 21 s /bin/rc sref 1 Data 0x400000 addr 0x408000 pg 0x3fe78000 r2 n1
newpg 0x000000003fe88000 pgsz 0x4000 for 0x4000
fixfaulted pid 21 s /bin/rc Data 0x400000 addr 0x408000 pg 0x3fe88000 ref 1
fault pid 21 0x40c178 r
fixfault pid 23 s /bin/rc sref 1 Data 0x400000 addr 0x400000 pg 0x3fe6c000 r1 n1
fixfaulted pid 23 s /bin/rc Data 0x400000 addr 0x400000 pg 0x3fe6c000 ref 1
fault pid 23 0x202aec r
fixfault pid 23 s /bin/rc sref 9 Text 0x200000 fixfault pid 21 s /bin/rc sref 1 Data 0x400000 addr 0x40c000 pg 0x3fe74000 r2 n1
newpg 0x000000003feaddr 0x200000 pg 0x10e4000 r1 n1
fixfaulted pid 23 s /bin/rc Text 0x200000 addr 0x200000 pg 0x10e4000 ref 1
fault pid 23 0x40a698 r
8c000 pgsz 0x4000 for 0x4000
fixfaulted pid 21 s /bin/rc Data 0x400000 addr 0x40c000 pg 0x3fe8c000 ref 1
fixfault pid 23 s /bin/rc sref 1 Data 0x400000 addr 0x408000 pg 0x3fe78000 r1 n1
fixfaulted pid 23 s /bin/rc Data 0x400000 addr 0x408000 pg 0x3fe78000 ref 1
fault pid 23 0x212b47 r
fixfault pid 23 s /bin/rc sref 9 Text 0x200000 addr 0x210000 pg 0x11a8000 r1 n1
fixfaulted pid 23 s /bin/rc Text 0x200000 addr 0x210000 pg 0x11a8000 ref 1
fault pid 23 0x404b30 w
fixfault pid 23 s /bin/rc sref 1 Data 0x400000 addr 0x404000 pg 0x3fe70000 r1 n1
fixfaulted pid 23 s /bin/rc Data 0x400000 addr 0x404000 pg 0x3fe70000 ref 1
fault pid 23 0x40c17c r
fixfault pid 23 s /bin/rc sref 1 Data 0x400000 addr 0x40c000 pg 0x3fe74000 r1 n1
fixfaulted pid 23 s /bin/rc Data 0x400000 addr 0x40c000 pg 0x3fe74000 ref 1
fault pid 23 0x7ffffeffbfe0 w
fixfault pid 23 s '' sref 1 Stack 0x7ffffdffc000 addr 0x7ffffeff8000 pg 0x3fe60000 r1 n1
fixfaulted pid 23 s '' Stack 0x7ffffdffc000 addr 0x7ffffeff8000 pg 0x3fe60000 ref 1
pgfree pg 0x3fe6c000
pgfree pg 0x3fe70000
pgfree pg 0x3fe78000
pgfree pg 0x3fe74000
fault pid 23 0x400018 w
fixfault pid 23 s /bin/pwd sref 1 Data 0x400000 addr 0x400000 pg 0x0 r0 n-1
pagein pid 23 s 0x400000 addr 0x400000 soff 0x0
newpg 0x000000003fe74000 pgsz 0x4000 for 0x4000
fixfaulted pid 23 s /bin/pwd Data 0x400000 addr 0x400000 pg 0x3fe74000 ref 1
/usr/nemo
pgfree pg 0x3fe74000


And this is for the old system:
fault pid 21 0x2000c4 r
fixfault pid 21 s /bin/rc sref 7 Text 0x200000 addr 0x200000 pg 0x116c000 r1 n1
fixfaulted pid 21 s /bin/rc Text 0x200000 addr 0x200000 pg 0x116c000 ref 1
fault pid 21 0x20170a r
fixfault pid 21 s /bin/rc sref 7 Text 0x200000 addr 0x201000 pg 0x116f000 r1 n1
fixfaulted pid 21 s /bin/rc Text 0x200000 addr 0x201000 pg 0x116f000 ref 1
fault pid 21 0x205dfc r
fixfault pid 21 s /bin/rc sref 7 Text 0x200000 addr 0x205000 pg 0x1132000 r1 n1
fixfaulted pid 21 s /bin/rc Text 0x200000 addr 0x205000 pg 0x1132000 ref 1
fault pid 21 0x40b834 w
fixfault pid 21 s /bin/rc sref 1 Data 0x400000 addr 0x40b000 pg 0x11b3000 r1 n1
fixfaulted pid 21 s /bin/rc Data 0x400000 addr 0x40b000 pg 0x11b3000 ref 1
fault pid 21 0x407c78 r
fixfault pid 21 s /bin/rc sref 1 Data 0x400000 addr 0x407000 pg 0x11ac000 r1 n1
fixfaulted pid 21 s /bin/rc Data 0x400000 addr 0x407000 pg 0x11ac000 ref 1
fault pid 23 0x20f9e0 r
fixfault pid 23 s /bin/rc sref 9 Text 0x200000 addr 0x20f000 pg 0x1138000 r1 n1
fixfaulted pid 23 s /bin/rc Text 0x200000 addr 0x20f000 pg 0x1138000 ref 1
fault pid 23 0x7fffffffefe8 w
fixfault pid 23 s /bin/rc sref 1 Stack 0x7ffffefff000 addr 0x7fffffffe000 pg 0x11b0000 r2 n1
newpg 0x00000000011d2000 pgsz 0x1000 for 0x1000
fixfaulted pid 23 s '' Stack 0x7ffffefff000 addr 0x7fffffffe000 pg 0x11d2000 ref 1
fault pid 23 0x20a9d9 r
fixfault pid 23 s /bin/rc sref 9 Text 0x200000 addr 0x20a000 pg 0x1146000 r1 n1
fixfaulted pid 23 s /bin/rc Text 0x200000 addr 0x20a000 pg 0x1146000 ref 1
fault pid 23 0x20be91 r
fixfault pid 23 s /bin/rc sref 9 Text 0x200000 addr 0x20b000 pg 0x1135000 r1 n1
fixfaulted pid 23 s /bin/rc Text 0x200000 addr 0x20b000 pg 0x1135000 ref 1
fault pid 23 0x400060 w
fixfault pid 23 s /bin/rc sref 1 Data 0x400000 addr 0x400000 pg 0x11aa000 r2 n1
newpg 0x00000000011d0000 pgsz 0x1000 for 0x1000
fixfaulted pid 23 s /bin/rc Data 0x400000 addr 0x400000 pg 0x11d0000 ref 1
fault pid 23 0x202aec r
fixfault pid 23 s /bin/rc sref 9 Text 0x200000 addr 0x202000 pg 0x1130000 r1 n1
fixfaulted pid 23 s /bin/rc Text 0x200000 addr 0x202000 pg 0x1130000 ref 1
fault pid 23 0x40a698 r
fixfault pid 23 s /bin/rc sref 1 Data 0x400000 addr 0x40a000 pg 0x119e000 r2 n1
newpg 0x00000000011d6000 pgsz 0x1000 for 0x1000
fixfaulted pid 23 s /bin/rc Data 0x400000 addr 0x40a000 pg 0x11d6000 ref 1
fault pid 23 0x20977d r
fixfault pid 23 s /bin/rc sref 9 Text 0x200000 addr 0x209000 pg 0x1139000 r1 n1
fixfaulted pid 23 s /bin/rc Text 0x200000 addr 0x209000 pg 0x1139000 ref 1
fault pid 23 0x20dbe3 r
fixfault pid 23 s /bin/rc sref 9 Text 0x200000 addr 0x20d000 pg 0x113b000 r1 n1
fixfaulted pid 23 s /bin/rc Text 0x200000 addr 0x20d000 pg 0x113b000 ref 1
fault pid 23 0x212b47 r
fixfault pid 23 s /bin/rc sref 9 Text 0x200000 addr 0x212000 pg 0x113c000 r1 n1
fixfaulted pid 23 s /bin/rc Text 0x200000 addr 0x212000 pg 0x113c000 ref 1
fault pid 23 0x401338 r
fixfault pid 23 s /bin/rc sref 1 Data 0x400000 addr 0x401000 pg 0x11af000 r2 n1
newpg 0x00000000011ce000 pgsz 0x1000 for 0x1000
fixfaulted pid 23 s /bin/rc Data 0x400000 addr 0x401000 pg 0x11ce000 ref 1
fault pid 23 0x210670 r
fixfault pid 23 s /bin/rc sref 9 Text 0x200000 addr 0x210000 pg 0x113d000 r1 n1
fixfaulted pid 23 s /bin/rc Text 0x200000 addr 0x210000 pg 0x113d000 ref 1
fault pid 23 0x404b30 w
fixfault pid 23 s /bin/rc sref 1 Data 0x400000 addr 0x404000 pg 0x11ae000 r2 n1
newpg 0x00000000011d3000 pgsz 0x1000 for 0x1000
fixfaulted pid 23 s /bin/rc Data 0x400000 addr 0x404000 pg 0x11d3000 ref 1
fault pid 23 0x21107c r
fixfault pid 23 s /bin/rc sref 9 Text 0x200000 addr 0x211000 pg 0x1143000 r1 n1
fixfaulted pid 23 s /bin/rc Text 0x200000 addr 0x211000 pg 0x1143000 ref 1
fault pid 23 0x40c17c r
fixfault pid 23 s /bin/rc sref 1 Data 0x400000 addr 0x40c000 pg 0x11b2000 r2 n1
newpg 0x00000000011cf000 pgsz 0x1000 for 0x1000
fixfaulted pid 23 s /bin/rc Data 0x400000 addr 0x40c000 pg 0x11cf000 ref 1
fault pid 23 0x4097fc r
fixfault pid 23 s /bin/rc sref 1 Data 0x400000 addr 0x409000 pg 0x1196000 r2 n1
newpg 0x00000000011ca000 pgsz 0x1000 for 0x1000
fixfaulted pid 23 s /bin/rc Data 0x400000 addr 0x409000 pg 0x11ca000 ref 1
fault pid 23 0x20e02d r
fixfault pid 23 s /bin/rc sref 9 Text 0x200000 addr 0x20e000 pg 0x1147000 r1 n1
fixfaulted pid 23 s /bin/rc Text 0x200000 addr 0x20e000 pg 0x1147000 ref 1
fault pid 23 0x20cbb0 r
fixfault pid 23 s /bin/rc sref 9 Text 0x200000 addr 0x20c000 pg 0x1129000 r1 n1
fixfaulted pid 23 s /bin/rc Text 0x200000 addr 0x20c000 pg 0x1129000 ref 1
fault pid 23 0x40204a r
fixfault pid 23 s /bin/rc sref 1 Data 0x400000 addr 0x402000 pg 0x11a7000 r2 n1
newpg 0x00000000011c6000 pgsz 0x1000 for 0x1000
fixfaulted pid 23 s /bin/rc Data 0x400000 addr 0x402000 pg 0x11c6000 ref 1
fault pid 23 0x2086dc r
fixfault pid 23 s /bin/rc sref 9 Text 0x200000 addr 0x208000 pg 0x1165000 r1 n1
fixfaulted pid 23 s /bin/rc Text 0x200000 addr 0x208000 pg 0x1165000 ref 1
fault pid 23 0x213000 r
fixfault pid 23 s /bin/rc sref 9 Text 0x200000 addr 0x213000 pg 0x1142000 r1 n1
fixfaulted pid 23 s /bin/rc Text 0x200000 addr 0x213000 pg 0x1142000 ref 1
fault pid 23 0x4055b8 r
fixfault pid 23 s /bin/rc sref 1 Data 0x400000 addr 0x405000 pg 0x11a9000 r2 n1
newpg 0x00000000011d8000 pgsz 0x4000 for 0x1000
splitbundle pg 0x00000000011d8000
fixfaulted pid 23 s /bin/rc Data 0x400000 addr 0x405000 pg 0x11d8000 ref 1
fault pid 23 0x4081a8 r
fixfault pid 23 s /bin/rc sref 1 Data 0x400000 addr 0x408000 pg 0x11a8000 r2 n1
newpg 0x00000000011db000 pgsz 0x1000 for 0x1000
fixfaulted pid 23 s /bin/rc Data 0x400000 addr 0x408000 pg 0x11db000 ref 1
fault pid 23 0x407c78 r
fixfault pid 23 s /bin/rc sref 1 Data 0x400000 addr 0x407000 pg 0x11ac000 r2 n1
newpg 0x00000000011da000 pgsz 0x1000 for 0x1000
fixfaulted pid 23 s /bin/rc Data 0x400000 addr 0x407000 pg 0x11da000 ref 1
fault pid 23 0x406dd8 r
fixfault pid 23 s /bin/rc sref 1 Data 0x400000 addr 0x406000 pg 0x11ad000 r2 n1
newpg 0x00000000011d9000 pgsz 0x1000 for 0x1000
fixfaulted pid 23 s /bin/rc Data 0x400000 addr 0x406000 pg 0x11d9000 ref 1
fault pid 21 0x7fffffffefe8 w
fixfault pid 23 0x7ffffeffefe0 w
fixfault pid 23 s '' sref 1 Stack 0x7ffffdfff000 addr 0x7ffffeffe000 pg 0x0 r0 n-1
newpg 0x00000000011dc000 pgsz 0x4000 for 0x1000
splitbundle pg 0x00000000011dc000
fixfaulted pid 23 s '' Stack 0x7ffffdfff000 addr 0x7ffffeffe000 pg 0x11dc000 ref 1
pgfree pg 0x11d2000
pgfree pg 0x11d0000
pgfree pg 0x11ce000
pgfree pg 0x11c6000
pgfree pg 0x11d3000
pgfree pg 0x11d8000
pgfree pg 0x11d9000
pgfree pg 0x11da000
pgfree pg 0x11db000
pgfree pg 0x11ca000
pgfree pg 0x11d6000
pgfree pg 0x11cf000
faultfault pid 23 0x7fffffffef98 w
fixfault pid 23 s /bin/pwd sref 1 Stack 0x7ffffefff000 addr 0x7fffffffe000 pg 0x11dc000 r1 n1
fixfaulted pid 23 s '' Stack 0x7ffffefff000 addr 0x7fffffffe000 pg 0x11dc000 ref 1
fault pid 23 0x20008a r
fixfault pid 23 s /bin/pwd sref 3 Text 0x200000 addr 0x200000 pg 0x11cd000 r1 n1
fixfaulted pid 23 s /bin/pwd Text 0x200000 addr 0x200000 pg 0x11cd000 ref 1
fault pid 23 0x400018 w
fixfault pid 23 s /bin/pwd sref 1 Data 0x400000 addr 0x400000 pg 0x0 r0 n-1
pagein pid 23 s 0x400000 addr 0x400000 soff 0x0
newpg 0x00000000011cf000 pgsz 0x1000 for 0x1000
fixfaulted pid 23 s /bin/pwd Data 0x400000 addr 0x400000 pg 0x11cf000 ref 1
fault pid 23 0x201b6d r
fixfault pid 23 s /bin/pwd sref 3 Text 0x200000 addr 0x201000 pg 0x11d1000 r1 n1
fixfaulted pid 23 s /bin/pwd Text 0x200000 addr 0x201000 pg 0x11d1000 ref 1
 pid 21 s /bfault pid 23 0x2020ef r
fixfault pid 23 s /bin/pwd sref 3 Text 0x200000 addr 0x202000 pg 0x11d4000 r1 n1
fixfaulted pid 23 s /bin/pwd Text 0x200000 addr 0x202000 pg 0x11d4000 ref 1
fault pid 23 0x2040b2 r
fixfault pid 23 s /bin/pwd sref 3 Text 0x200000 addr 0x204000 pg 0x11d7000 r1 n1
fixfaulted pid 23 s /bin/pwd Text 0x200000 addr 0x204000 pg 0x11d7000 ref 1
/usr/nemo
fault pid 23 0x401098 r
fixfault pid 23 s /bin/pwd sref 1 Data 0x400000 addr 0x401000 pg 0x0 r0 n-1
pagein: zfod 0x401000
newpg 0x00000000011d6000 pgsz 0x1000 for 0x1000
fixfaulted pid 23 s /bin/pwd Data 0x400000 addr 0x401000 pg 0x11d6000 ref 1
pgfree pg 0x11dc000
pgfree pg 0x11cf000
pgfree pg 0x11d6000
in/rc sref 1 Stack 0x7ffffefff000 addr 0x7fffffffe000 pg 0x11b0000 r1 n1
fixfaulted pid 21 s '' Stack 0x7ffffefff000 addr 0x7fffffffe000 pg 0x11b0000 ref 1
fault pid 21 0x20f9e0 r
fixfault pid 21 s /bin/rc sref 7 Text 0x200000 addr 0x20f000 pg 0x1138000 r1 n1
fixfaulted pid 21 s /bin/rc Text 0x200000 addr 0x20f000 pg 0x1138000 ref 1
fault pid 21 0x20a9d9 r
fixfault pid 21 s /bin/rc sref 7 Text 0x200000 addr 0x20a000 pg 0x1146000 r1 n1
fixfaulted pid 21 s /bin/rc Text 0x200000 addr 0x20a000 pg 0x1146000 ref 1
fault pid 21 0x20bdca r
fixfault pid 21 s /bin/rc sref 7 Text 0x200000 addr 0x20b000 pg 0x1135000 r1 n1
fixfaulted pid 21 s /bin/rc Text 0x200000 addr 0x20b000 pg 0x1135000 ref 1
fault pid 21 0x400000 r
fixfault pid 21 s /bin/rc sref 1 Data 0x400000 addr 0x400000 pg 0x11aa000 r1 n1
fixfaulted pid 21 s /bin/rc Data 0x400000 addr 0x400000 pg 0x11aa000 ref 1
fault pid 21 0x20ddb3 r
fixfault pid 21 s /bin/rc sref 7 Text 0x200000 addr 0x20d000 pg 0x113b000 r1 n1
fixfaulted pid 21 s /bin/rc Text 0x200000 addr 0x20d000 pg 0x113b000 ref 1
fault pid 21 0x212ea2 r
fixfault pid 21 s /bin/rc sref 7 Text 0x200000 addr 0x212000 pg 0x113c000 r1 n1
fixfaulted pid 21 s /bin/rc Text 0x200000 addr 0x212000 pg 0x113c000 ref 1
fault pid 21 0x401338 r
fixfault pid 21 s /bin/rc sref 1 Data 0x400000 addr 0x401000 pg 0x11af000 r1 n1
fixfaulted pid 21 s /bin/rc Data 0x400000 addr 0x401000 pg 0x11af000 ref 1
fault pid 21 0x210670 r
fixfault pid 21 s /bin/rc sref 7 Text 0x200000 addr 0x210000 pg 0x113d000 r1 n1
fixfaulted pid 21 s /bin/rc Text 0x200000 addr 0x210000 pg 0x113d000 ref 1
fault pid 21 0x404b30 w
fixfault pid 21 s /bin/rc sref 1 Data 0x400000 addr 0x404000 pg 0x11ae000 r1 n1
fixfaulted pid 21 s /bin/rc Data 0x400000 addr 0x404000 pg 0x11ae000 ref 1
fault pid 21 0x409afc r
fixfault pid 21 s /bin/rc sref 1 Data 0x400000 addr 0x409000 pg 0x1196000 r1 n1
fixfaulted pid 21 s /bin/rc Data 0x400000 addr 0x409000 pg 0x1196000 ref 1
fault pid 21 0x211ba3 r
fixfault pid 21 s /bin/rc sref 7 Text 0x200000 addr 0x211000 pg 0x1143000 r1 n1
fixfaulted pid 21 s /bin/rc Text 0x200000 addr 0x211000 pg 0x1143000 ref 1
fault pid 21 0x20e02d r
fixfault pid 21 s /bin/rc sref 7 Text 0x200000 addr 0x20e000 pg 0x1147000 r1 n1
fixfaulted pid 21 s /bin/rc Text 0x200000 addr 0x20e000 pg 0x1147000 ref 1
fault pid 21 0x208574 r
fixfault pid 21 s /bin/rc sref 7 Text 0x200000 addr 0x208000 pg 0x1165000 r1 n1
fixfaulted pid 21 s /bin/rc Text 0x200000 addr 0x208000 pg 0x1165000 ref 1
fault pid 21 0x202c39 r
fixfault pid 21 s /bin/rc sref 7 Text 0x200000 addr 0x202000 pg 0x1130000 r1 n1
fixfaulted pid 21 s /bin/rc Text 0x200000 addr 0x202000 pg 0x1130000 ref 1
fault pid 21 0x40a698 r
fixfault pid 21 s /bin/rc sref 1 Data 0x400000 addr 0x40a000 pg 0x119e000 r1 n1
fixfaulted pid 21 s /bin/rc Data 0x400000 addr 0x40a000 pg 0x119e000 ref 1
fault pid 21 0x2097b7 r
fixfault pid 21 s /bin/rc sref 7 Text 0x200000 addr 0x209000 pg 0x1139000 r1 n1
fixfaulted pid 21 s /bin/rc Text 0x200000 addr 0x209000 pg 0x1139000 ref 1
fault pid 21 0x213000 r
fixfault pid 21 s /bin/rc sref 7 Text 0x200000 addr 0x213000 pg 0x1142000 r1 n1
fixfaulted pid 21 s /bin/rc Text 0x200000 addr 0x213000 pg 0x1142000 ref 1
fault pid 21 0x40c178 r
fixfault pid 21 s /bin/rc sref 1 Data 0x400000 addr 0x40c000 pg 0x11b2000 r1 n1
fixfaulted pid 21 s /bin/rc Data 0x400000 addr 0x40c000 pg 0x11b2000 ref 1
fault pid 21 0x20ce7e r
fixfault pid 21 s /bin/rc sref 7 Text 0x200000 addr 0x20c000 pg 0x1129000 r1 n1
fixfaulted pid 21 s /bin/rc Text 0x200000 addr 0x20c000 pg 0x1129000 ref 1
fault pid 21 0x204bba r
fixfault pid 21 s /bin/rc sref 7 Text 0x200000 addr 0x204000 pg 0x1133000 r1 n1
fixfaulted pid 21 s /bin/rc Text 0x200000 addr 0x204000 pg 0x1133000 ref 1
fault pid 21 0x402611 r
fixfault pid 21 s /bin/rc sref 1 Data 0x400000 addr 0x402000 pg 0x11a7000 r1 n1
fixfaulted pid 21 s /bin/rc Data 0x400000 addr 0x402000 pg 0x11a7000 ref 1
fault pid 21 0x405e00 r
fixfault pid 21 s /bin/rc sref 1 Data 0x400000 addr 0x405000 pg 0x11a9000 r1 n1
fixfaulted pid 21 s /bin/rc Data 0x400000 addr 0x405000 pg 0x11a9000 ref 1
fault pid 21 0x408d48 r
fixfault pid 21 s /bin/rc sref 1 Data 0x400000 addr 0x408000 pg 0x11a8000 r1 n1
fixfaulted pid 21 s /bin/rc Data 0x400000 addr 0x408000 pg 0x11a8000 ref 1
fault pid 21 0x20310e r
fixfault pid 21 s /bin/rc sref 7 Text 0x200000 addr 0x203000 pg 0x1136000 r1 n1
fixfaulted pid 21 s /bin/rc Text 0x200000 addr 0x203000 pg 0x1136000 ref 1
fault pid 21 0x40b834 w
fixfault pid 21 s /bin/rc sref 1 Data 0x400000 addr 0x40b000 pg 0x11b3000 r1 n1
fixfaulted pid 21 s /bin/rc Data 0x400000 addr 0x40b000 pg 0x11b3000 ref 1
fault pid 21 0x206ab9 r
fixfault pid 21 s /bin/rc sref 7 Text 0x200000 addr 0x206000 pg 0x113a000 r1 n1
fixfaulted pid 21 s /bin/rc Text 0x200000 addr 0x206000 pg 0x113a000 ref 1
fault pid 21 0x406820 r
fixfault pid 21 s /bin/rc sref 1 Data 0x400000 addr 0x406000 pg 0x11ad000 r1 n1
fixfaulted pid 21 s /bin/rc Data 0x400000 addr 0x406000 pg 0x11ad000 ref 1
fault pid 21 0x7fffffffce78 w
fixfault pid 21 s /bin/rc sref 1 Stack 0x7ffffefff000 addr 0x7fffffffc000 pg 0x11bd000 r1 n1
fixfaulted pid 21 s '' Stack 0x7ffffefff000 addr 0x7fffffffc000 pg 0x11bd000 ref 1
fault pid 21 0x2071d8 r
fixfault pid 21 s /bin/rc sref 7 Text 0x200000 addr 0x207000 pg 0x116b000 r1 n1
fixfaulted pid 21 s /bin/rc Text 0x200000 addr 0x207000 pg 0x116b000 ref 1




Thursday, July 4, 2013

Memory management in Nix Mark IV

There is a draft TR describing how the memory management has been fully reworked in Nix mark IV. The resulting system is faster, suffers a lot fewer page faults than its predecessor, and exploits better concurrent access to file servers.

We are still testing the implementation, and hopefully will make it public soon.
In the future I will write another post showing some evaluation for the system and traces we obtained from debug output that are quite illustrative.

Monday, May 13, 2013

Making 9 and Nix faster

There's a full initial implementation for the new mount table, mount
driver, and protocol work done. And there's a document describing what
has been done at http://lsub.org/export/9pix.pdf

Thursday, January 26, 2012

Wasting the time.

NIX knows how to record the worst/total/number of waiting times for each lock used in the kernel. Including queueing locks and other types of lock. This lets us retrieve from /dev a list of the ones where programs waste their time (be that for useful work or not). These are the winners as of today, as reported by the kernel


what pc #times maxtime tottime where meantime
lock 0xfffffffff017d490 50 12710074 215909070 /sys/src/nix/port/page.c:560 4318181
lock 0xfffffffff017d486 4 6326408 16216101 /sys/src/nix/port/page.c:560 4054025
lock 0xfffffffff017ce18 112 6763350 293233176 /sys/src/nix/port/page.c:309 2618153
lock 0xfffffffff017cbc3 78 12051128 126828266 /sys/src/nix/port/page.c:222 1626003
lock 0xfffffffff01620c0 162 1366053 14176025 /sys/src/nix/port/devmnt.c:1052 87506
qlock 0xfffffffff01773cc 581 1009266 40825474 /sys/src/nix/port/qio.c:1187 70267
qlock 0xfffffffff012cf2e 155 1238110 10673584 /sys/src/nix/k10/ether82563.c:893 68861
qlock 0xfffffffff01402af 376 574149 24205418 /sys/src/nix/ip/tcp.c:547 64376
qlock 0xfffffffff0143bee 350 492911 14229109 /sys/src/nix/ip/tcp.c:2118 40654
lock 0xfffffffff0119308 593 1121568 18533782 /sys/src/nix/k10/qmalloc.c:275 31254
lock 0xfffffffff0119a17 11183 1365890 294252474 /sys/src/nix/k10/qmalloc.c:500 26312
lock 0xfffffffff016226b 80 459279 1898897 /sys/src/nix/port/devmnt.c:1101 23736

The list is not interesting unless you are optimizing the kernel, but it shows how powerful this simple instrument can be. It's easy to spot some parts which are responsible for making the programs wait, perhaps more than really needed.

Wednesday, January 25, 2012

Plots for SMP, AMP, and benchmarks

Recently I posted here regarding SMP, AMP, and benchmarks. That post discussed the results for measuring an actual build of the kernel, and a pseudo-build (where a benchmark program created concurrent processes for all compilations at once). There are significant differences, because the later is closer to a microbenchmark and hides effects found in the real world load.

These are the plots.


  • Time to compile using mk (which spawns concurrent compilations). The old scheduler is SMP, the new one is AMP.



  • Time to compile using pmk (with the source tree hosted in ram), a fake build tool just for this benchmark. It spawns all compilations at once and then waits for all of them.


Draw your own conclussions. No wonder a micro-benchmark could be designed such that AMP always achieves a speed up as we add more cores. It's a pity that in the real world there are always dependencies between independent things.

Is NUMA relevant?


We made an experiment, running a program that spawns as many compiler processes as needed to compile the kernel source files in parallel. This uses the new AMP scheduler, where core 0 assigns processes to other cores for execution.

To see the effect of NUMA in the 32 core machine used for testing, we timed this program for all number of cores: first, allocating memory from the domain local to the process; second, allocating it always from the last memory domain.

Thus, if NUMA effects are significant, it must show up in the plot. This is it:





For less than 5  cores there is a visible difference. Using a NUMA aware allocator results in lower times, as it could be expected. However, for more cores there is no significant difference, which means that NUMA effects are irrelevant on this system and this load (actually, the NUMA allocator is a little bit worse).

The probable reason is that when using more cores the system seems to tradeoff load-balancing of processes among cores and memory locality. Although all memory is allocated from the right color, the pages kept in the cache already have a color, as does kernel memory. Using processors from other colors comes at the expense of using remote memory, which means that in the end it does not matter much where do you allocate memory.

If each process relied on a different binary, results might be different. However, in practice, the same few binaries are used most of the time (the shell, common file tools, compilers, etc.).

Thus, is NUMA still relevant for this kind of machine and workload?
As provocative as this question might be...


Tuesday, January 24, 2012

SMP, AMP, and benchmarks

The new AMP scheduler for NIX can keep up with loads resulting from actual compilation of the system software. Until 16 cores are added, a real speed up is obtained. From there on, contention on other resources result in times not improving, and even slowing down a bit.

With the old SMP scheduler, when more than about 6 cores are used, times get worse fast. The curve is almost exponential, because of a high contention on the scheduler.

That was using a system load from the real world. As an experiment, we measured also what happen to a program similar to mk. This program spawns concurrently as many processes as needed to compile all the source files. But it's not a real build, because no dependencies are taken into account, no code is generated by running scripts, etc.


For this test, which is half-way between a real program and a micro-benchmark, the results differ. Instead of getting much worse, SMP keeps its time when more than 4 cores are added. Also, AMP achieves about a 30% or 40% of speedup.


Thus, you might infer from this benchmark that SMP is ok for building software with 32 cores and that AMP may indeed speed it up. However, for the actual build process, SMP is not ok, but much slower. And AMP is not achieving any speed up when you reach 32 cores, you better run with 16 cores or less.


There are lies, damn lies, and microbenchmarks.

Saturday, January 21, 2012

AMP better than SMP for 32 cores

After some performance measurement, we have found that AMP performs better than SMP at least for some real-world workloads, like making a kernel.

It seems that the scheduler interference when enough cores are present is enough to slow down the system. Sparing a core just to schedule tasks to other cores (and handle interrupts) makes the system as a whole faster.

These are the times for the experiments of making a kernel, copying a file tree from ram to ram, and making a kernel with its source tree in ram. The first two rows are taken from a previous experiment also reported in this blog. We did re-run that experiment to compare with AMP because it depends on the state of the network, which may differ.


14.0344 1.921 10.458 single sched. 32 cores.
10.789 0.608 5.073 single sched. 4 cores.
12.8 2.20 9.23 rerun of single sched, 32 cores.
10.17 0.995 5.775 AMP sched, 32 cores.


Sunday, May 29, 2011

User level semaphores

Multicore computers as found today can share memory. Semaphores are
popular for building other synchronization and communication primitives.
But as found on most operating systems, they require entering the kernel
even if it is not necessary. In many cases, two process will be running each one
in its own core and one would put a ticket in a semaphore before the other would
try to acquire it. Why should they enter the kernel in that case?
Linux includes futexes, with the same rationale, but they are not reliable and, therefore,
cannot be used for real applications.


In the operating system we are building, NIX, we have included user-level semaphores
using this interface:



extern int altsems(int *ss[], int n);
extern int downsem(int *s, int dontblock);
extern void upsem(int *s);


Up and down are similar to other interfaces. They rely on atomic increment and
decrement operations to update the semaphore (an integer value) without races.
However, if a process finds during a down that it should block, it would enter
the kernel and sleep. If another process increments the semaphore and finds that
someone was awaken (because it was negative), it enters the kernel to awake
the sleeper. With some book-keeping, operations may proceed in many cases without paying
the cost of entering and exiting the kernel.


Altsems is a novel operation. It tries to perform simultaneous downs on an array of
semaphores, using the same approach described above. Only one of them will proceed,
but the process does not know which one until it returns. This is useful to implement
user-level channels, equipped with an alt operation to send or receive from multiple
channels.



Monday, May 2, 2011

What's faster? A vector or a list?

Actual program performance is non-intuitive. Stroustrup, the author
of C++, included measurements in a talk for ordered insertion in C++
vectors and lists. We (E.Soriano and me) measured the same issue using C in Plan 9, Mac
OS X, and Linux, and C++ in Mac OS X.

Well, it seems that it's not clear at all which data structure is better for
something as simple as sorted insertion. Depending on the implementation of
the containers, on the system, and the language used, one or another may be
the right choice. There are so many factors that I'd say it cannot be predicted.

Thus, we might say that it doesn't really matter. Pick your preferred one and do not optimize early. All in all, it might be the better one, or the worse one; nobody really knows.

See Some Performance Experiments for Simple Data Structures for the experiments.