Tuesday, April 30, 2013

9P read streaming in action

This is the debug output for a standard Plan 9, modified as said in previous posts.
The output was collected while copying /bin/ls to /tmp. The sendreqs print
corresponds to read requests issued to satisfy a read, and implies a blocking
during read to wait for data.

The sendreqs2 prints correspond to requests issued by the kernel during "streaming"; they are just read-aheads.

In short, the output means that 9 had to wait only for the first read. The semantics of file access remain the same of a standard 9 system, we just exploit 9P2000.ix when available and read-aheads for files that   are cached in the mount cache. Plus, we note in the mount cache when did we detect the end of file to prevent unnecessary read requests. All this being careful to check out with the server during opens, as the standard mount cache would do.

sendreqs pid 157 /bin/ls n 8192 off 0 cend -1
sendreqs2 pid 157 /bin/ls na 16336 raoff 8192
rdone pid 157 /bin/ls off 0 tot 8192 left 0xf27b6b00
rdone pid 157 /bin/ls off 16384 tot 8144 left 0xf1e71790
rdone pid 157 /bin/ls off 8192 tot 8192 left 0x0
sendreqs2 pid 157 /bin/ls na 8192 raoff 24528
rdone pid 157 /bin/ls off 24528 tot 8192 left 0x0
sendreqs2 pid 157 /bin/ls na 8192 raoff 32720
rdone pid 157 /bin/ls off 32720 tot 8192 left 0x0
sendreqs2 pid 157 /bin/ls na 8192 raoff 40912
rdone pid 157 /bin/ls off 40912 tot 8192 left 0x0
sendreqs2 pid 157 /bin/ls na 8192 raoff 49104
rdone pid 157 /bin/ls off 49104 tot 8192 left 0x0
sendreqs2 pid 157 /bin/ls na 8192 raoff 57296
rdone pid 157 /bin/ls off 57296 tot 8192 left 0x0
sendreqs2 pid 157 /bin/ls na 8192 raoff 65488
rdone pid 157 /bin/ls off 65488 tot 8192 left 0x0
sendreqs2 pid 157 /bin/ls na 8192 raoff 73680
rdone pid 157 /bin/ls off 73680 tot 7894 left 0x0
ceof pid 157 /bin/ls eof at 81574

We still have to clean up the implementation, conduct more testing, and fix any bug found during the process. But, this system is close to fully avoid the latency problems due to 9P RPCs while still using 9P(.ix), which is nice. Stay tuned :)

Saturday, April 27, 2013

1-round trip operations

These are a couple of examples of the modified 9 kernel speaking to a 9P2000.ix server.

This is a dirstat call:

You can see how the kernel sent the walk, stat, and clunk requests without waiting for the server to reply. The server processed them sequentially because they were using the same tag, and replied to each one as they were completed.

This other one is a create for a file:

The walk and create requests were sent without waiting. Because the kernel knows it's a 9P2000.ix server, it sends just a create request flagged to indicate that the server should either create the file or walk to the file and open with truncation (should the file already exist).

When the kernel is not sure regarding the server it's speaking to, it would speak 9P2000 as usual, so there is no problem regarding compatibility with the rest of the 9 world.

Coming up next, we are going to address read and writes so they can be streamed while the file server, (most of) the kernel, and the user program can still rely on good old interfaces.

Wednesday, April 24, 2013

Getting "dot" right

Yes, "dot", not "dot-dot". What happens when you...?

grep foo ../*/*.c

You start with your ".", then go up, and then down to anywhere you can. One of those places is ".", isn't it?

As part of the work we are doing to make the system more efficient regarding file operations, actually as a side-effect, we got "." right.

Our system takes any relative path given, prefixes the path for the current working directory, cleans up the name, and then tries to use that name. Note how this removes not just ".." but also ".". Name resolution has not even started.

When it starts, it would check the mount table and also your ".". If your file is under the path for ".", that file is used as a starting point for the walk to reach the named file; that is, there are no dots in file names used by the kernel but you still have a current working directory.

If you look again at the command above, you'll notice how all files under your current directory won't be resolved by going up just to go back to where you were.

Plan 9, mount tables, RPCs, and streaming

To be able to reduce the number of RPCs it is not enough to have a protocol capable of doing that. The mount table and how the kernel uses files has to change a little bit. On the other hand, it is not enough to have the kernel prepared for the task, the protocol and the server must cooperate. 

Plan 9 system calls that take a file name end up calling namec within the kernel. This function takes a name and returns a Chan, representing a file in use by the kernel (or the processes it runs). Now, to resolve the name, namec starts with the process slash or dot file, and tries to walk the given path from there. This requires an RPC to the file server (or device) providing the (slash or dot) file.

After having the list of file identifiers (qids) for the files known by the server for that walk, each element is checked out in the mount table to see if the walking should continue at a different (mounted) file.

This means that namec requires:
  • Issuing an RPC to the server for the starting point of the file path
  • Lookup the first mount point along the way
  • Repeat this process starting with the mounted file (if any) and remaining elements of the path.
When the final file is found, the resulting Chan is returned to the kernel routine calling namec.

Consider for example remove(2) or stat(2), the former is mostly:
  • call namec to get the Chan
  • issue a remove RPC to the server providing the file.
the latter looks like:
  • call namec to get the Chan
  • issue a stat RPC to the server
  • issue a clunk RPC to the server to release the Chan.
If the file is after a mounted point,  namec would require multiple RPCs, each with a RTT to a server. In other cases (like when opening a file, namec has to issue further RPCs, e.g., to open it).

In principle, to remove a file, you could do it in a single RPC RTT to the server, but the mount table, name resolution, and the implementation of remove in the kernel conspire to make you require multiple RPCs (and RTTs).

To reduce the RTT for operations like opening a file, or removing it, or ..., we have done this:

  1. We replaced the mount table with a name based mount table. It lets the kernel travel along a tree of names to determine to which server  it should talk and which file to use to start walking.  The mount table still has unions and its semantics are close to the standard mount table. But, with this table, you do not require RPCs to any server to reach the actual file where walking should start to reach the file. Plus, if there are multiple mount points along the way, you do not disturb (or wait for) the servers early in the path.
  2. We rewrote namec to exploit the new mount table, and to accept new ``access modes''. After looking for the longest prefix for the path required in the mount table, it issues the RPCs required to honor the access mode (e.g., to open a file, or to remove it, or to stat it).
  3. We added a new driver, devlater,  which is used during name resolution to defer walks. If the file to be walked is provided by the mount driver (is a remote or user-level file), and can be cached, the Chan returned from the file pretends to have walked to the file without walking at all. Such deferred channels are provided by the later device, which would do the walk only when it knows what should be done with the file (e.g., when one of its file operations is called). In the example of removing a file, when namec calls the remove operation on the device for the file, if it's a deferred-walk file, the later device would issue both the walk and the remove RPCs and then receive the replies for them. That reduces the latency to that of a single RPC. Other access requests are performed in a similar way.
  4. We changed the mount driver so the later device could use it for sending requests and receving replies; that is,  the RPC logic is now split between sending and receiving.

That's enough as long as the kernel is concerned, but, it does not suffice. Think of a concurrent file server processing the requests issued by this new client. The request to remove a file identified after a walk has to wait in the server until the walk has been done (or has failed). Thus... 

  • We changed fossil to understand a variant of 9P, called 9P2000.ix, which takes requests using the same tag value as requests to be processed sequentially (but perhaps concurrently with requests using other tag values or from other connections).
The new mount driver issues a 9P2000.ix version request to file servers it mounts, and fall backs to 9P2000 if there's no 9pix server on the other end. But, if there is one, it does many things in a single round trip to the server.

The new fossil is trivially compatible with previous kernels, because they would never send two requests with the same tag value at the same time.

Another interesting side-effect is that all sequential servers (e.g., ramfs) are 9pix servers out of the box, but the version they provide must be changed to inform the client. 

We will write another post regarding how read and write streaming can be achieved along these lines, with few modifications to the protocol (mostly the one already stated) and with almost no modification to the user interface. The underlying idea is telling the kernel if a file is to be fully read or written and then using read ahead and asynchronous writes in the kernel to keep the stream flowing.

Sunday, April 7, 2013

How to get ".." right?

Long ago Rob Pike published a paper about how to get ".." right, and the idea was applied to Plan 9.

But I think I have a better way: remove ".." from the system. That does not mean that we can't use "..",  only that the kernel would remove any ".." from names before using those names as paths.

I have an experimental 9 kernel with a changed namec that

  1. cleans the path name given to it  (nothing new by now).
  2. If it starts with "..", 
    1. takes the (string) path from up->dot
    2. appends the name given to namec
    3. cleans the resulting path.

Point 1 gets rid of middle ".."'s in the path name, as done always in Plan 9.
Point 2 will make a leading ".." become an intermediate ".." and then get rid of it.

Not much is gained by doing it this way, but for code simplicity. I was able to remove several portions of code that did worry about "..". Moreover, if the experiment works as expected, I will be able to remove even more, because all the machinery in Chan's to walk back from mount points will no longer be needed: there will be no "walk back" in the system.

If you argue that this would require resolving longer paths when you use leading ".."'s, you are right; but hopefully, after a few more changes, the system might be a lot faster. Stay tuned.

By the way, in the course of the experiment I found that cleanname does not play well with kernel device paths...

How 9 speaks 9p

This is how the 9 kernel speaks 9p with the file server, in the common case of using a cache for program images so executables are cached.

Printing the current working directory

Let's start with pwd:


The shell calls exec to run the command, and as a result, the kernel calls namec to translate the  name "pwd" into a Chan structure. A series of requests and replies for Walk are issued to locate pwd in the current directory and then in /bin. Because /bin is a union mount, it is responsible for further requests and replies until pwd is located.

Twalk tag 7 fid 275 newfid 273 nwname 1 0:pwd 
Rerror tag 7 ename file does not exist
Twalk tag 7 fid 170 newfid 273 nwname 1 0:pwd 
Rerror tag 7 ename file does not exist
Twalk tag 7 fid 166 newfid 273 nwname 1 0:pwd 
Rerror tag 7 ename file does not exist
Twalk tag 7 fid 90 newfid 273 nwname 1 0:pwd 
Rwalk tag 7 nwqid 1 0:(00000000000001b3 35 ) 

Still within namec, the file is opened, because exec called with the Aopen flag.

Topen tag 7 fid 273 mode 3
Ropen tag 7 qid (00000000000001b3 35 ) iounit 8192 

Back in exec, the kernel reads the first few bytes to check out which kind of executable it is (binary? interpreted?) and read the executable file header. The file was already in the cache, so the Chan is discarded and the cache used.

Tread tag 7 fid 273 offset 0 count 32
Rread tag 7 count 32 '000001eb 00004ba3 0000095c ...'

Tclunk tag 7 fid 273
Rclunk tag 7

Now, after executing pwd, the program calls the C library function getwd, which opens "." and
calls fd2path to retrieve the path from the kernel's Chan for the process current working directory, and the descriptor is closed.

Twalk tag 7 fid 275 newfid 273 nwname 0 
Rwalk tag 7 nwqid 0 
Topen tag 7 fid 273 mode 0
Ropen tag 7 qid (0000000000002003 139 d) iounit 8192 
Tclunk tag 7 fid 273
Rclunk tag 7

This makes a total of 10 RPCs for printing the current working directory, and by design, each one must wait for the previous one to complete.

Changing the working directory

Let's execute now cd:

$ cd /sys/src

The shell's built-in command cd issues a chdir system call, which walks to the new directory (starting from the root directory in this case) and then clunks the previous Chan for "dot".

Twalk tag 7 fid 66 newfid 274 nwname 2 0:sys 1:src 
Rwalk tag 7 nwqid 2 0:(165b 21 d) 1:(0000000000002003 139 d) 
Tclunk tag 7 fid 275
Rclunk tag 7

This time it's 2 RPCs.

Listing a directory

Now we list the current working directory:

Like before, the shell calls exec (to execute ls this time), which leads to several RPCs:

Twalk tag 7 fid 274 newfid 273 nwname 1 0:ls 
Rerror tag 7 ename file does not exist
Twalk tag 7 fid 170 newfid 273 nwname 1 0:ls 
Rerror tag 7 ename file does not exist
Twalk tag 7 fid 166 newfid 273 nwname 1 0:ls 
Rerror tag 7 ename file does not exist
Twalk tag 7 fid 90 newfid 273 nwname 1 0:ls 
Rwalk tag 7 nwqid 1 0:(000000000000412e 65 ) 
Topen tag 7 fid 273 mode 3
Ropen tag 7 qid (000000000000412e 65 ) iounit 8192 
Tread tag 7 fid 273 offset 0 count 32
Rread tag 7 count 32 '000001eb 0000b29f 00001288 00002988...'
Tclunk tag 7 fid 273
Rclunk tag 7

Now it's the time for ls to list the directory. It stats the target to see which file it must list.
Twalk tag 7 fid 274 newfid 275 nwname 0 
Rwalk tag 7 nwqid 0 
Tstat tag 7 fid 275
Tclunk tag 7 fid 275
Rclunk tag 7

It then opens the target directory,

Twalk tag 7 fid 274 newfid 275 nwname 0 
Rwalk tag 7 nwqid 0 
Topen tag 7 fid 275 mode 0
Ropen tag 7 qid (0000000000002003 139 d) iounit 8192 

reads it all to print the entries,
Tread tag 7 fid 275 offset 0 count 8192
Rread tag 7 count 3373 '3a000000 00000000 80290000...'
Tread tag 7 fid 275 offset 3373 count 8192
Rread tag 7 count 0 ''

and closes the descriptor when done.
Tclunk tag 7 fid 275
Rclunk tag 7

In total, it required 15 RPCs.

Creating a file

We now create a file without writing anything into it.
$ >/tmp/x

This makes the shell issue a create system call, which plays a dance (within namec) like shown:
Twalk tag 7 fid 66 newfid 273 nwname 1 0:tmp 
Rwalk tag 7 nwqid 1 0:(0000000000003f73 0 d) 
Twalk tag 7 fid 172 newfid 275 nwname 1 0:x 
Rerror tag 7 ename file does not exist
Twalk tag 7 fid 172 newfid 275 nwname 0 
Rwalk tag 7 nwqid 0 
Tcreate tag 7 fid 275 name x perm --rw-rw-rw- mode 1
Rcreate tag 7 qid (000000000000dd06 0 ) iounit 8192 
Tclunk tag 7 fid 273
Rclunk tag 7
Tclunk tag 7 fid 275
Rclunk tag 7

It required 6 RPCs.

Wednesday, April 3, 2013

9p + ix = 9pix

More experimentation with ix leaded to attempts to put it in a plan 9 kernel.
To make a start, we changed fossil to accept groups of requests.
A group is just a series of requests sharing the same tag, like in ix.

A new mount driver is on the works, along with experimental interfaces for
the rest of the kernel to issue calls.

One result is that it seems easier if the kernel forwards Call structures
instead of using procedural interfaces.

Another is that it seems that to exploit streaming and concurrency the calls
require using futures or promises. For example, a write call returns a
handle that can be used later to check out the results from the write.

Using this it is easier to, for example, issue all the clone requests at once
when forking a name space, and then collect all the replies. Or, to issue
a sequence of walk, open, read requests to pull data into a cache.

The plan is to make this variant, called 9pix, evolve up to a compromise
point where we get the best of ix while we keep as compatible with 9p as

More on this when we actually get the toy working...