Wednesday, April 24, 2013

Plan 9, mount tables, RPCs, and streaming

To be able to reduce the number of RPCs it is not enough to have a protocol capable of doing that. The mount table and how the kernel uses files has to change a little bit. On the other hand, it is not enough to have the kernel prepared for the task, the protocol and the server must cooperate. 

Plan 9 system calls that take a file name end up calling namec within the kernel. This function takes a name and returns a Chan, representing a file in use by the kernel (or the processes it runs). Now, to resolve the name, namec starts with the process slash or dot file, and tries to walk the given path from there. This requires an RPC to the file server (or device) providing the (slash or dot) file.

After having the list of file identifiers (qids) for the files known by the server for that walk, each element is checked out in the mount table to see if the walking should continue at a different (mounted) file.

This means that namec requires:
  • Issuing an RPC to the server for the starting point of the file path
  • Lookup the first mount point along the way
  • Repeat this process starting with the mounted file (if any) and remaining elements of the path.
When the final file is found, the resulting Chan is returned to the kernel routine calling namec.

Consider for example remove(2) or stat(2), the former is mostly:
  • call namec to get the Chan
  • issue a remove RPC to the server providing the file.
the latter looks like:
  • call namec to get the Chan
  • issue a stat RPC to the server
  • issue a clunk RPC to the server to release the Chan.
If the file is after a mounted point,  namec would require multiple RPCs, each with a RTT to a server. In other cases (like when opening a file, namec has to issue further RPCs, e.g., to open it).


In principle, to remove a file, you could do it in a single RPC RTT to the server, but the mount table, name resolution, and the implementation of remove in the kernel conspire to make you require multiple RPCs (and RTTs).

To reduce the RTT for operations like opening a file, or removing it, or ..., we have done this:

  1. We replaced the mount table with a name based mount table. It lets the kernel travel along a tree of names to determine to which server  it should talk and which file to use to start walking.  The mount table still has unions and its semantics are close to the standard mount table. But, with this table, you do not require RPCs to any server to reach the actual file where walking should start to reach the file. Plus, if there are multiple mount points along the way, you do not disturb (or wait for) the servers early in the path.
  2. We rewrote namec to exploit the new mount table, and to accept new ``access modes''. After looking for the longest prefix for the path required in the mount table, it issues the RPCs required to honor the access mode (e.g., to open a file, or to remove it, or to stat it).
  3. We added a new driver, devlater,  which is used during name resolution to defer walks. If the file to be walked is provided by the mount driver (is a remote or user-level file), and can be cached, the Chan returned from the file pretends to have walked to the file without walking at all. Such deferred channels are provided by the later device, which would do the walk only when it knows what should be done with the file (e.g., when one of its file operations is called). In the example of removing a file, when namec calls the remove operation on the device for the file, if it's a deferred-walk file, the later device would issue both the walk and the remove RPCs and then receive the replies for them. That reduces the latency to that of a single RPC. Other access requests are performed in a similar way.
  4. We changed the mount driver so the later device could use it for sending requests and receving replies; that is,  the RPC logic is now split between sending and receiving.

That's enough as long as the kernel is concerned, but, it does not suffice. Think of a concurrent file server processing the requests issued by this new client. The request to remove a file identified after a walk has to wait in the server until the walk has been done (or has failed). Thus... 

  • We changed fossil to understand a variant of 9P, called 9P2000.ix, which takes requests using the same tag value as requests to be processed sequentially (but perhaps concurrently with requests using other tag values or from other connections).
The new mount driver issues a 9P2000.ix version request to file servers it mounts, and fall backs to 9P2000 if there's no 9pix server on the other end. But, if there is one, it does many things in a single round trip to the server.

The new fossil is trivially compatible with previous kernels, because they would never send two requests with the same tag value at the same time.

Another interesting side-effect is that all sequential servers (e.g., ramfs) are 9pix servers out of the box, but the version they provide must be changed to inform the client. 

We will write another post regarding how read and write streaming can be achieved along these lines, with few modifications to the protocol (mostly the one already stated) and with almost no modification to the user interface. The underlying idea is telling the kernel if a file is to be fully read or written and then using read ahead and asynchronous writes in the kernel to keep the stream flowing.