diff options
author | Jacob Moody <moody@posixcafe.org> | 2023-04-28 07:11:37 +0000 |
---|---|---|
committer | Jacob Moody <moody@posixcafe.org> | 2023-04-28 07:11:37 +0000 |
commit | 83b569cbc7e772ec454f87ccd71104b23faebbc8 (patch) | |
tree | b445621b08bbd0dfd9b00715bc8266b66d6d116d | |
parent | bd2ef36d976db0a842e31af309f0ef1439f30a89 (diff) |
/sys/doc: "Namespaces as Security Domains"
-rw-r--r-- | sys/doc/mkfile | 1 | ||||
-rw-r--r-- | sys/doc/nssec.ms | 414 |
2 files changed, 415 insertions, 0 deletions
diff --git a/sys/doc/mkfile b/sys/doc/mkfile index 54cbf648f..913e8761e 100644 --- a/sys/doc/mkfile +++ b/sys/doc/mkfile @@ -41,6 +41,7 @@ ALL=\ port\ colophon\ nupas/nupas\ + nssec\ ALLPS=${ALL:%=%.ps} HTML=${ALL:%=%.html} release3.html release4.html diff --git a/sys/doc/nssec.ms b/sys/doc/nssec.ms new file mode 100644 index 000000000..c35fd78bb --- /dev/null +++ b/sys/doc/nssec.ms @@ -0,0 +1,414 @@ +.HTML "Namespaces as Security Domains" +.TL +Namespaces as Security Domains +.AU +Jacob Moody +.AB +We aim to explore the use of Plan 9 namespaces +as ways of building isolated processes. We present +here code for increasing the ability and granularity +for which a process may isolate itself from others +on the system. +.AE +.SH +Introduction +.PP +.FS +First presented in a slightly different form at the 9th International Workshop on Plan 9. +.FE +.LP +In a Plan 9 system the kernel exposes hardware and system +interfaces through a myriad of filesystem trees. These trees, or +sharp devices, replace the functionality of many would be system calls +through use of standard file system operations. A standard Plan 9 environment +is comprised of a composition of these individual devices together, the collection +of such being the processes namespace. +.LP +With these principles it is quite easy for a process to build a slim namespace using only +what it may need for operation. This could be done in service to reduce the "blast radius" +of awry or malicious code to some effect. But to be fully effective a process must also be able +to remove the ability to bootstrap these capabilities back. We will explore different ways of +building isolated namespaces, their pitfalls, ways to address those issues, along with new solutions. +.SH +Outside World +.LP +There have been many solutions for sandboxing within the UNIX™ +world. There are more classical approaches such as +.CW zones , +[Price04] and +.CW jails , +that all provide an abstraction of building some number of +smaller full unix boxes out of a single physical host. However these +interfaces are presented more as a systems management tool, the mechanisms +for which an administrator creates and manages these resources is unergonomic +to use on a per-process basis. Instead it seems more the fashion now to isolate +specific pieces of the system, and expect it possible that each process on the system +may choose to manage its environment. The most successful execution of this idea in the +wild is the OpenBSD project's +.CW unveil +and +.CW pledge +[Beck18] system calls, allowing a processes to cut off specific parts of the filesystem or +system call interfaces. Linux namespaces [Biederman06] implement this idea by allowing a process +to fork off private versions of specific global resources. In both these cases the sandboxing +of a process is through gradual steps, removing potentially dangerous tools one by one. +.SH +Existing Work +.LP +Let us first define the resources we are restricting access to. The aforementioned gradual solutions +provide ways in which a process can remove itself from specific kernel interfaces. In plan9 the kernel +exposes almost all of its functionality through individual filesystems. These devices are accessed +globally by prefixing a path with a sharp('#'), and have conventional places they are bound within the +namespace. +.LP +A processes namespace in plan9 is typically constructed using a namespace file. These files +are a collection of namespace operations formatted as one would expect to see them in a shell script. +They typically begin by binding in some number of sharp devices in to their expected location. +.P1 +bind #d /fd +bind -c #e /env +bind #p /proc +bind -c #s /srv +.P2 +Then using the globals provided, in particular /srv, to bring in the rest of the root filesystem. +A process can at any point choose to construct itself a new namespace, but it must do so when changing +users. This is done in part to ensure that each filesystem that the program would like to use has +their chance to authenticate and be notified. Because this information is only exchanged on attach, +the new user must construct a namespace from scratch. +.LP +Many programs, like network services, wish to drop their current user and become the special user +.CW none +user on startup, and in doing so must rebuild their namespace. The conventional default namespace +files used is /lib/namespace, but most programs allow the user to specify an alternative with a +flag. It is here that we already can approximate a chroot style environment by changing the root +filesystem used in a namespace file. +.P1 +bind #s /srv +mount /srv/myboot /root +bind -a /root / +.P2 +By having another filesystem exposed in /srv/myboot and modifying the provided namespace file, +we've allowed this process to work within an entirely separate root filesystem. +.SH +RFNOMNT +.LP +The issue in using these namespaces as security barriers is that there is nothing preventing +a process from bootstrapping a resource back. While our example code places a different root filesystem +in the namespace, nothing is preventing that process or its children from potentially rebootstrapping +the real root filesystem back. For this issue there is a special rfork flag +.CW RFNOMNT +the prevents a process from accessing any almost any sharp device of consequence. This is done by +preventing a process from walking to a device by its location within '#'. This allows existing +binds of resources to continue working within the namespace but restricts a process from binding +in new resources from the kernel. +.LP +While effective we found this to be too large a hammer in practice. Doing as its name implies +.CW RFNOMNT +also prevents a process from performing any mounts or binds. This in practice creates a single +point in time in which a process gives up all of its control, instead of the idealized gradual +process. This makes it quite hard to make use of in practice, only a singly program in a chain +may be the one to invoke +.CW RFNOMNT +or must hope that no other program further in the chain may want to make use of its namespace. +The interface itself feels very clunky, there is a nice gradual addition of these kernel devices +to the namespace why must the removal be all at once? +.SH +Chdev +.LP +We propose a new write interface through /dev/drivers +that functionally replaces +.CW RFNOMNT . +/dev/drivers now accepts writes in the form of +.P1 +chdev op devmask +.P2 +Devmask is a string of sharp device characters. Op specifies how +devmask is interpreted. Op is one of +.TS +lw(1i) lw(4.5i). +\f(CW&\fP T{ +Permit access to just the devices specified in devmask. +T} +\f(CW&~\fP T{ +Permit access to all but the devices specified in devmask. +T} +\f(CW~\fP T{ +Remove access to all devices. Devmask is ignored. +T} +.TE +.LP +This allows a process to selectively remove access to +sections of sharp devices with quite a bit of control. +In order to mimic all of +.CW RFNOMNT 's +features, removing access to +.CW devmnt , +which is not normally accessible directly, +disables the processes ability to perform mount +and bind operations. +.LP +For the implementation, we extended the existing +.CW RFNOMNT +flag attached to the process namespace group +in to a bit vector. Each bit representing a index +into +.CW devtab . +The following function illustrates how this vector is set. +.P1 +void +devmask(Pgrp *pgrp, int invert, char *devs) +{ + int i, t, w; + char *p; + Rune r; + u64int mask[nelem(pgrp->notallowed)]; + + if(invert) + memset(mask, 0xFF, sizeof mask); + else + memset(mask, 0, sizeof mask); + + w = sizeof mask[0] * 8; + for(p = devs; *p != 0;){ + p += chartorune(&r, p); + t = devno(r, 1); + if(t == -1) + continue; + if(invert) + mask[t/w] &= ~(1<<t%w); + else + mask[t/w] |= 1<<t%w; + } + + wlock(&pgrp->ns); + for(i=0; i < nelem(pgrp->notallowed); i++) + pgrp->notallowed[i] |= mask[i]; + wunlock(&pgrp->ns); +} +.P2 +Devmask is called from the write handler for /dev/drivers. This +bitmask is then consulted any time a name is resolved that begins +with '#'. This is done from within the +.CW namec () +function using the following function to check +if a particular device +.CW r +is permitted. +.P1 +int +devallowed(Pgrp *pgrp, int r) +{ + int t, w, b; + + t = devno(r, 1); + if(t == -1) + return 0; + + w = sizeof(u64int) * 8; + rlock(&pgrp->ns); + b = !(pgrp->notallowed[t/w] & 1<<t%w); + runlock(&pgrp->ns); + return b; +} +.P2 +.LP +We found that once removal is made a core verb of these sharp +devices it becomes easy to start to view access to them +as capabilities. This is aided by system functionally already neatly +organized in to the various devices themselves. For example, one could +say a process is capable of accessing the broader internet if it has access +to the +.CW devip +device. This access can either be direct via it's path under '#' or through a +location in the namespace where this device had already been bound. With these +changes, the entire capability list of a process is on display through just its +/proc/$pid/ns file. This +.CW ns +file would indicate if a particular device is bound and now also includes +the list of devices a process has access to. +.LP +In practice, this results in a pattern of binding +in a sharp device, making use of them and removing +them when no longer needed. A namespace file for +a web server could now look like +.P1 +bind #s /srv +# /srv/www created by srvfs www /lib/www +mount /srv/www /lib/ +unmount /srv +chdev -r s # chdev &~ s +.P2 +In this example we have created a new root for the process by +using exportfs to expose a little piece of the boot namespace. +We unmount +.CW devsrv +and remove access to it with +.CW chdev +ensuring there is no way for our process to talk to the real +.CW /srv/boot . +This provides a nice succinct lifetime of access to +.CW devsrv +and makes the removal of these sharp devices as easy as +it is to use them in the first place. +.LP +Like +.CW RFNOMNT , +.CW chdev +does not restrict access to sharp devices that had already been mounted. +This allows a process to use a subsection or only one piece of +sharp devices as well. One example of this may be to restrict a process +to just a single network stack +.P1 +bind '#I1' /net +chdev -r I +.P2 +.SH +/srv/clone +.LP +With this +.CW chdev +mechanism, the ability for a device to provide isolation of its +own became more powerful. Partially illustrated in the previous +.CW devip +example. +.CW Devsrv , +the sharp device providing named pipes, was an ideal target for +adding isolation. Devsrv provides a bulletin board of all posted 9p services +for a given host. We wanted to provide a mechanism for a process, or +family tree of process to share a private +.CW devsrv +between themselves. +.LP +The design for this was borrowed from devip, one in which a process opens a +.CW clone +file to read its newly allocated slot number. This new 'board' appears as a sibling directory +to the +.CW clone +it was spawned from. This new board is itself a fully functioning +.CW devsrv +with its own clone file, making nesting to full trees of +.CW srvs +quite easy, and completely transparent. The following illustrates +how one could replace their global +.CW /srv +with a freshly allocated one. +.P1 +</srv/clone { + s='/srv/'^`{read} + bind -c $s /srv + exec p +} +.P2 +Also like devip, once the last reference to the file descriptor returned by opening +.CW clone +is closed the board is closed and posters to that board receive an EOF. It is important +to bake this kind of ownership in to the design, as self referential users of +.CW /srv +are quite common in current code. +.LP +This along with chdev can be used to create a sandbox for /srv quite easily, +the process allocates itself a new /srv then removes access to the global +root srv. This allows potentially untrusted process to still make use of the interface +without needing to worry about their access to the global state. The practice of having +new boards appear as subdirectories allows the entire state to easily be seen by inspecting the +root of devsrv itself. +.SH +Restricting Within a Mount +.LP +As shown earlier with the use of +.CW srvfs , +an intermediate file server can be used to only service a small subsection of a larger +namespace. In that example we used this to expose only /lib/www from the host to processes +running a web server. This can be limited as the invocation of +.CW exportfs +can become more complicated if the user wishes to use multiple pieces from completely +separate places within the file tree. To address this a utility program +.CW protofs +was written to easily create convincing mimics of the filesystem it was run from. +.CW protofs +accepts a +.CW proto +file, a text file containing a description of file tree, and uses it to provide +dummy files mimicking the structure. These dummies can then be used by a process as targets +for bind mounts of its current namespace, providing the illusion of trimming all but select +pieces. This new root can not be simply bound over the real one, that still allows an unmount +to escape back to the real system but rexporting the namespace still works. To illustrate a +more involved setup then before. +.P1 +# We want to provide our web server +# with /bin, /lib/www and /lib/git +; cat >>/tmp/proto <<. +bin d775 +lib d775 + www d775 + git d775 +. +; protofs -m /mnt/proto /tmp/prot +; bind /bin /mnt/proto/bin +; bind /lib/www /mnt/proto/lib/www +; bind /lib/git /mnt/proto/lib/git +# A private srv could be used, omitted for brevity +; srvfs webbox /mnt/proto +# Namespace file for using our new mini-root +; cat >>/tmp/ns <<. +mount #s/webbox /root +bind -b /root / +chdev -r s +. +; auth/newns -n /tmp/ns ls / +bin +lib +; +.P2 +.SH +Future Work +.LP +While we think these bring us closer to namespaces as security boundaries, +there is still plenty of work and understanding to be done. One particular +item of interest is attempting some kind of isolation of +.CW devproc , +possibly in a similar fashion to the +.CW /srv/clone +implementation, but attempts have yet to be made. The exact nature of +.CW namespace +files and how they relate to sandboxing as a whole has yet to be fully +worked out. There is clear potential, but it is likely additional abilities may +be required. It is somewhat difficult to synthesize a namespace entirely +from nothing, which is something we found ourselves reaching for when building +alternative roots to run processes within. There is potential for some merger +of +.CW proto +and +.CW namespace +files to provide a template of the current namespace to graft on to the next one. +.LP +Both +.CW chdev +and +.CW /srv/clone +are merged into 9front and their implementations are freely available as part of the base system. +.SH REFERENCES +.LP +[Beck18] +Bob Beck, +``Pledge, and Unveil, in OpenBSD'', +.I "BSDCan Slides" +Ottawa, +July, 2018. +.LP +[Price04] +Daniel Price, +Andrew Tucker, +``Solaris Zones: Operating System Support for Consolidating Commercial Workloads'', +.I "Proceedings of the 18th Large Installation System Administration Conference" +pp. 241-254, +Atlanta, +November, 2004. +.LP +[Biederman06] +Eric W. Biederman +``Multiple Instances of the Global Linux Namespaces'', +.I "Proceedings of the 2006 Linux Symposium Volume One" +pp. 102-112, +Ottawa, Ontario +July, 2006. |