summaryrefslogtreecommitdiff
path: root/sys/doc/fossil.ms
diff options
context:
space:
mode:
authoraiju <aiju@phicode.de>2011-07-18 11:01:22 +0200
committeraiju <aiju@phicode.de>2011-07-18 11:01:22 +0200
commit8c4c1f39f4e369d7c590c9d119f1150a2215e56d (patch)
treecd430740860183fc01de1bc1ddb216ceff1f7173 /sys/doc/fossil.ms
parent11bf57fb2ceb999e314cfbe27a4e123bf846d2c8 (diff)
added /sys/doc
Diffstat (limited to 'sys/doc/fossil.ms')
-rw-r--r--sys/doc/fossil.ms1163
1 files changed, 1163 insertions, 0 deletions
diff --git a/sys/doc/fossil.ms b/sys/doc/fossil.ms
new file mode 100644
index 000000000..ca9b783fd
--- /dev/null
+++ b/sys/doc/fossil.ms
@@ -0,0 +1,1163 @@
+.HTML "Fossil, an Archival File Server
+... .FP times
+... .fp 1 R R.nomath
+... .fp 5 CW LucidaSansCW83
+.TL
+Fossil, an Archival File Server
+.AU
+Sean Quinlan
+Jim McKie
+Russ Cox
+jmk,rsc@plan9.bell-labs.com
+.AB
+This paper describes the internals and
+operation of Fossil, an archival file server built for Plan 9.
+Fossil has not yet replaced the current Plan 9 file server
+and
+.CW kfs ,
+but that is our eventual intent.
+Both fossil and this documentation are
+works in progress. Comments on either are most welcome.
+.AE
+.de HP
+.LP
+..
+.NH 1
+Introduction
+.HP
+Fossil is an archival file server built for Plan 9.
+In a typical configuration, it maintains a traditional file system
+in a local disk partition and periodically archives snapshots of the file system
+to a Venti server. These archives are made available through
+a file system interface.
+Fossil can also be run without a Venti server, in which case the
+snapshots (if any) occupy local disk space.
+.PP
+The bulk of this paper explains the underlying data structures:
+Venti trees, the Venti archival file system format, and finally Fossil's
+file system format.
+The end of the paper discusses the architecture of the Fossil server.
+.PP
+The presentation of the data structures is very detailed, perhaps
+too detailed for most readers.
+The intent is to record all the details necessary to make structural
+changes to the file system format.
+Feel free to jump ahead when boredom strikes.
+.NH 1
+Venti trees and directory hierarchies
+.HP
+Venti [3] is an archival block storage server.
+Once a block is stored, it can be retrieved by presenting the 20-byte
+SHA1 hash of its contents, called a
+.I score .
+Blocks on Venti have a maximum length of about 56 kilobytes,
+though in practice smaller blocks are used.
+To store a byte stream of arbitrary length, Venti uses a hash tree.
+Conceptually, the data stream is broken into fixed-size (say,
+.I dsize -byte)
+chunks, which are stored on the Venti server.
+The resulting scores are concatenated into a new pointer stream, which is
+broken into fixed size (say,
+.I psize -byte)
+chunks, which are stored on the Venti server.
+.I Psize "" (
+is different from
+.I dsize
+so that we can ensure that each pointer block holds an
+integral number of pointers.)
+This yields a new pointer stream, and so on, until there is a single block
+and finally a single score describing the entire tree.
+The resulting structure looks like:
+.PS
+.ps 8
+.vs 10
+boxht=0.1
+boxwid=0.1
+
+B0: box invis wid 1 "\f(CWVtDataType\fP"
+move right 0.1
+L0a: box wid 0.2
+move right 0.1
+L0b: box wid 0.2
+move right 0.1
+L0c: box invis wid 0.2 "..."
+move right 0.1
+
+L0d: box wid 0.2
+move right 0.1
+L0e: box wid 0.2
+move right 0.2
+L0f: box invis wid 0.2 "..."
+move right 0.2
+
+L0g: box wid 0.2
+move right 0.1
+L0h: box wid 0.2
+move right 0.1
+L0i: box invis wid 0.2 "..."
+move right 0.1
+
+L0j: box wid 0.2
+move right 0.1
+L0k: box wid 0.2
+move right 0.1
+L0l: box invis wid 0.2 "..."
+move right 0.1
+L0m: box wid 0.2
+
+define boxppddd {
+ line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
+ line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se>
+ X: box invis at 0.1<$1.nw,$1.ne>
+ Y: box invis at 0.1<$1.sw,$1.se>
+ line -> from 0.5<X,Y> to $2.nw
+ X: box invis at 0.3<$1.nw,$1.ne>
+ Y: box invis at 0.3<$1.sw,$1.se>
+ line -> from 0.5<X,Y> to $3.nw
+ "..." at 0.7<$1.w,$1.e>
+}
+
+define boxppdddp {
+ line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
+ line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se>
+ line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se>
+ X: box invis at 0.1<$1.nw,$1.ne>
+ Y: box invis at 0.1<$1.sw,$1.se>
+ line -> from 0.5<X,Y> to $2.nw
+ X: box invis at 0.3<$1.nw,$1.ne>
+ Y: box invis at 0.3<$1.sw,$1.se>
+ line -> from 0.5<X,Y> to $3.nw
+ "..." at 0.6<$1.w,$1.e>
+ X: box invis at 0.9<$1.nw,$1.ne>
+ Y: box invis at 0.9<$1.sw,$1.se>
+ line -> from 0.5<X,Y> to $4.nw
+}
+
+define boxpdddp {
+ line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
+ line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se>
+ X: box invis at 0.1<$1.nw,$1.ne>
+ Y: box invis at 0.1<$1.sw,$1.se>
+ line -> from 0.5<X,Y> to $2.nw
+ "..." at 0.5<$1.w,$1.e>
+ X: box invis at 0.9<$1.nw,$1.ne>
+ Y: box invis at 0.9<$1.sw,$1.se>
+ line -> from 0.5<X,Y> to $3.nw
+}
+
+bhd=0.4
+L1abc: box wid 0.5 at 0.5<L0a, L0b>+(0,bhd)
+boxppddd(L1abc, L0a, L0b)
+L1def: box wid 0.5 at 0.5<L0d, L0e>+(0,bhd)
+boxppddd(L1def, L0d, L0e)
+L1ghi: box wid 0.5 at 0.5<L0g, L0h>+(0,bhd)
+boxppddd(L1ghi, L0g, L0h)
+L1jklm: box wid 0.5 at 0.5<L0j, L0k>+(0,bhd)
+boxppdddp(L1jklm, L0j, L0k, L0m)
+B1: box invis wid 1 "\f(CWVtPointerType0\fP" at B0+(0,bhd)
+
+L2abcdef: box wid 0.5 at 0.5<L1abc,L1def>+(0,bhd)
+boxppddd(L2abcdef, L1abc, L1def)
+L2ghijklm: box wid 0.5 at 0.5<L1ghi,L1jklm>+(0,bhd)
+boxpdddp(L2ghijklm, L1ghi, L1jklm)
+B2: box invis wid 1 "\f(CWVtPointerType1\fP" at B1+(0,bhd)
+
+L3atom: box wid 0.5 at 0.5<L2abcdef, L2ghijklm>+(0,bhd)
+boxpdddp(L3atom, L2abcdef, L2ghijklm)
+B3: box invis wid 1 "\f(CWVtPointerType2\fP" at B2+(0,bhd)
+.PE
+.LP
+The leaves are the original data stream. Those blocks have type
+.CW VtDataType .
+The first pointer stream has type
+.CW VtPointerType0 ,
+the next has type
+.CW VtPointerType1 ,
+and so on.
+The figure ends with a single block of type
+.CW VtPointerType2 ,
+but in general trees can have height up to
+.CW VtPointerType6 .
+For a
+.I dsize
+of 8192 bytes
+and
+.I psize
+of 8180 bytes (409 pointers),
+this gives a maximum stream size of approximately 10 zettabytes
+(2\s-2\u73\d\s+2 or 10\s-2\u22\d\s+2 bytes).
+.PP
+Data blocks are truncated to remove trailing runs of zeros before
+storage to Venti; they are zero-filled back to
+.I dsize
+bytes after retrieval from Venti.
+Similarly, trailing runs of pointers to zero-length blocks are
+removed from and added back to pointer blocks.
+These simple rules happen to make it particularly efficient to store
+large runs of zeros, as might occur in a data stream with ``holes:''
+the zero-length block itself can be interpreted as a tree of any
+depth encoding an all-zero data stream.
+.PP
+Reconstructing the data stream requires the score and type of the
+topmost block in the tree, the data chunk size, the pointer chunk size,
+and the data stream size.
+(From the data stream size and the chunk sizes we could derive the
+depth of the tree and thus the type of the topmost block, but it is convenient
+to allow trees that are deeper than necessary.)
+This information is kept in a 40-byte structure called a
+.CW VtEntry :
+.P1
+VtEntry:
+.ta +\w' 'u +\w' 'u
+ gen[4] \fRgeneration number\fP
+ psize[2] \fRsize of pointer blocks\fP
+ dsize[2] \fRsize of data blocks\fP
+ flags[1]
+ zero[5]
+ size[6] \fRlength of file\fP
+ score[20] \fRscore of root block in tree\fP
+.P2
+(In this notation,
+.CW name[sz]
+indicates a
+.CW sz -byte
+field called
+.CW name .
+Integers are stored in big-endian order.
+.CW Size
+really is a 48-bit field.)
+.CW Flags
+is made up of the following bit fields.
+.P1
+.ta +\w' 'u +\w' 'u
+0x01 VtEntryActive \fRentry is allocated\fP
+0x02 VtEntryDir \fRentry describes a Venti directory (q.v.)\fP
+0x1C VtEntryDepthMask \fRmask for tree depth\fP
+0x20 VtEntryLocal \fRreserved (q.v.)\fP
+.P2
+.LP
+The depth of the described tree is stored in the 3 bits indicated:
+a tree with a topmost node of type
+.CW VtPointerType3
+has depth 4.
+.PP
+With
+.CW VtEntry
+we can build more complicated data structures,
+ones with multiple or nested data streams.
+A data stream consisting of
+.CW VtEntry
+structures is called a Venti directory.
+It is identical in structure to the Venti data stream
+we described earlier except that the bottom-level type is
+.CW VtDirType ,
+and
+the
+.CW VtEntry
+describing a Venti directory has the
+.CW VtEntryDir
+flag bit set.
+The
+.I dsize
+for a Venti directory
+is a multiple of 40 so that each data chunk holds
+an integer number of
+.CW VtEntry
+structures.
+By analogy with Venti directories,
+we call the original data stream a
+Venti file.
+Note that Venti files are assumed
+.I not
+to contain pointers to other Venti blocks.
+The only pointers to Venti blocks occur in
+.CW VtEntry
+structures in
+Venti directories
+(and in the internal hash tree structure of the
+individual files and directories).
+Note also that these directories are nothing more than pointer lists.
+In particular, there are no names or metadata like in a file system.
+.PP
+To make it easier to pass hierarchies between applications,
+the root of a hierarchy is described in a 300-byte structure
+called a
+.CW VtRoot :
+.P1
+VtRoot:
+.ta +\w' 'u +\w' 'u
+ version[2] \f(CW2\fP
+ name[128] \fRname of structure (just a comment)\fP
+ type[128] \fRstring describing structure (\f(CWvac\fR)\f(CW
+ score[20] \fRpointer to \f(CWVtDirType\fP block\f(CW
+ blockSize[2] \fRmaximum block size in structure\fP
+ prev[20] \fRprevious \f(CWVtRoot\fP in chain, if any\fP
+.P2
+.LP
+This structure is stored to Venti and its score is passed
+between applications, typically in the form
+``\fItype\f(CW:\fIrootscore\fR,''
+where
+.I type
+is the type field from the
+.CW VtRoot
+structure, and
+.I rootscore
+is the score of the
+.CW VtRoot
+block.
+.CW VtRoot
+structures can be chained together using the
+.I prev
+field to encode an archival history
+of the data structure.
+.PP
+For example, a small Venti hierarchy might look like:
+.PS
+.ps 8
+.vs 10
+boxwid=0.1
+boxht=0.1
+f=0.9
+mb=0.16
+
+VtRoot: [
+ right
+ B1: box
+ move right 0.1
+ "\f(CWVtRoot\fP" ljust
+]
+
+Root: [
+ right
+ B1: box fill f
+ B2: box fill f
+ B3: box fill f
+ move right 0.1
+] with .nw at VtRoot.sw+(0.2,-.1)
+Level1: [
+ RootMeta: [
+ box wid mb
+ ]
+ MetaSource: [
+ right
+ B1: box wid 5*mb
+ ] with .nw at RootMeta.sw+(0,-.1)
+
+ Source: [
+ right
+ B1: box fill f
+ B2: box fill f
+ B3: box fill f
+ B4: box fill f
+ B5: box fill f
+ B6: box fill f
+ B7: box fill f
+ B8: box fill f
+ ] with .nw at MetaSource.sw+(0,-.1)
+ SB1: box invis at Source.B1
+ SB2: box invis at Source.B2
+ SB3: box invis at Source.B3
+] with .nw at Root.sw+(0.4,-.1)
+Level2: [
+ MetaSource: [
+ right
+ B1: box wid 5*mb
+ ]
+ Source: [
+ right
+ B1: box fill f
+ B2: box fill f
+ B3: box fill f
+ B4: box fill f
+ B5: box fill f
+ B6: box fill f
+ B7: box fill f
+ B8: box fill f
+ ] with .nw at MetaSource.sw+(0,-.1)
+ File: box wid 0.8 with .nw at Source.sw+(0,-.1)
+] with .nw at Level1.sw+(0.6,-.1)
+
+line -> from VtRoot.B1 down boxwid/2+0.1+boxwid/2 then to Root.w
+line -> from Root.B3 down boxwid/2+0.1+boxwid/2 then to Level1.RootMeta.w
+line -> from Root.B2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level1.MetaSource.w
+line -> from Root.B1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level1.Source.w
+
+line -> from Level1.SB3 down boxwid/2+0.1+boxwid/2 then to Level2.MetaSource.w
+line -> from Level1.SB2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level2.Source.w
+line -> from Level1.SB1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level2.File.w
+
+[
+ KEY: box wid 1.5 invis "Key"
+ line from KEY.sw to KEY.se
+ k = -.1
+ kk=0.5
+ A: [
+ box wid 4*boxwid
+ "Venti file" ljust with .w at last box .w+(kk,0)
+ ] with .nw at KEY.sw+(0,2*k)
+ B: [
+ box fill f
+ "Venti entry (\f(CWVtEntry\fP)" ljust with .w at last box .w+(kk,0)
+ ] with .nw at A.sw+(0,k)
+ C: [
+ right
+ CC: box fill f
+ box fill f
+ box fill f
+ box fill f
+ "Venti directory" ljust with .w at CC.w+(kk,0)
+ ] with .nw at B.sw+(0,k)
+ D: [
+ line -> right 3*boxwid
+ "Venti pointer (score)" ljust with .w at last line .w+(kk, 0)
+ ] with .nw at C.sw+(0,k)
+] with .nw at VtRoot.nw+(3,0)
+.PE
+.LP
+Venti files are shown as white boxes, while directories are shown
+as shaded boxes. Each shaded square represents a
+.CW VtEntry .
+Arrows represent pointers from
+.CW VtEntry
+structures to other
+Venti files or directories.
+.PP
+The hierarchical structure provided by Venti files and directories
+can be used as the base for more complicated data structures.
+Because this structure captures all the information
+about pointers to other blocks, tools written to traverse
+Venti hierarchies can traverse the more complicated
+data structures as well.
+For example,
+.I venti/copy
+(see
+.I venti (1))
+copies a Venti hierarchy from one Venti server to another,
+given the root
+.CW VtEntry .
+Because the traditional file system described in later sections is
+layered on a Venti hierarchy,
+.I venti/copy
+can copy it without fully understanding its structure.
+.NH 1
+Vac file system format
+.HP
+The Venti archive format
+.I vac
+builds a traditional file system using a Venti hierarchy.
+Each vac file is implemented as a Venti file;
+each vac directory is implemented as a Venti
+directory and a Venti file to provide traditional file system metadata.
+The metadata is stored in a structure called a
+.CW DirEntry :
+.P1
+DirEntry:
+.ta +\w' 'u +\w' 'u
+ magic[4] \f(CW0x1c4d9072\fP (DirMagic)\fP
+ version[2] \f(CW9\fP
+ elem[s] \fRname (final path element only)\fP
+ entry[4] \fRentry number for Venti file or directory\fP
+ gen[4] \fRgeneration number\fP
+ mentry[4] \fRentry number for Venti file holding metadata\fP
+ mgen[4] \fRgeneration number\fP
+ qid[8] \fRunique file serial number\fP
+ uid[s] \fRowner\fP
+ gid[s] \fRgroup\fP
+ mid[s] \fRlast modified by\fP
+ mtime[4] \fRlast modification time\fP
+ ctime[4] \fRcreation time\fP
+ atime[4] \fRlast access time\fP
+ mode[4] \fRmode bits\fP
+.P2
+The notation
+.CW name[s]
+denotes a string stored as a two-byte length
+and then that many bytes.
+The above describes Version 9 of the
+.CW DirEntry
+format. Versions 7 and 8 are very similar; they can be
+read by the current
+.I vac
+source code but are not written.
+Earlier versions were not widespread.
+A
+.CW DirEntry
+may be followed by optional extension sections, though none
+are currently used.
+The
+.CW mode
+bits include bits commonly used by
+Unix and Windows, in addition to those used by Plan 9.
+.PP
+The
+.CW entry
+field is an index into the parallel Venti directory.
+The
+.CW gen
+field must match the
+.CW gen
+field in the corresponding
+.CW VtEntry
+in the directory;
+it is used to detect
+stale indices.
+Similarly,
+.CW mentry
+and
+.CW mgen
+are the index and generation number
+for the metadata Venti file,
+if the
+.CW DirEntry
+describes a vac directory.
+.PP
+The relation between Venti files and directories and
+vac files and directories can be seen in this figure:
+.PS
+.ps 8
+.vs 10
+boxwid=0.1
+boxht=0.1
+f=0.9
+mb=0.16
+
+VtRoot: [
+ right
+ B1: box
+ move right 0.1
+ "\f(CWVtRoot\fP" ljust
+]
+
+SuperRoot: [
+ right
+ B1: box fill f
+ move right 0.1
+ "fs root block" ljust
+] with .nw at VtRoot.sw + (0.2, -.2)
+Root: [
+ right
+ B1: box fill f
+ B2: box fill f
+ B3: box fill f
+ move right 0.1
+ "root directory info block" ljust
+] with .nw at SuperRoot.sw+(0.2, -.2)
+Level1: [
+ RootMeta: [
+ box wid mb
+ move right 0.1
+ "root metadata" ljust
+ ]
+ MetaSource: [
+ right
+ B1: box wid mb
+ B2: box wid mb
+ B3: box wid mb
+ B4: box wid mb
+ B5: box wid mb
+ ] with .nw at RootMeta.sw+(0,-.2)
+ MB1: box wid mb invis at MetaSource.B1
+ MB2: box wid mb invis at MetaSource.B2
+ MB3: box wid mb invis at MetaSource.B3
+ MB4: box wid mb invis at MetaSource.B4
+ MB5: box wid mb invis at MetaSource.B5
+
+ Source: [
+ right
+ B1: box fill f
+ B2: box fill f
+ B3: box fill f
+ B4: box fill f
+ B5: box fill f
+ B6: box fill f
+ B7: box fill f
+ B8: box fill f
+ ] with .nw at MetaSource.sw+(0,-.1)
+ SB1: box invis at Source.B1
+ SB2: box invis at Source.B2
+ SB3: box invis at Source.B3
+ SB4: box invis at Source.B4
+ SB5: box invis at Source.B5
+ SB6: box invis at Source.B6
+ SB7: box invis at Source.B7
+ SB8: box invis at Source.B8
+] with .nw at Root.sw+(0.4,-.2)
+Level2: [
+ MetaSource: [
+ right
+ B1: box wid mb
+ B2: box wid mb
+ B3: box wid mb
+ B4: box wid mb
+ B5: box wid mb
+ ]
+ Source: [
+ right
+ B1: box fill f
+ B2: box fill f
+ B3: box fill f
+ B4: box fill f
+ B5: box fill f
+ B6: box fill f
+ B7: box fill f
+ B8: box fill f
+ ] with .nw at MetaSource.sw+(0,-.1)
+ File: box wid 0.8 with .nw at Source.sw+(0,-.2)
+] with .nw at Level1.sw+(0.6,-.2)
+
+line -> from VtRoot.B1 down boxwid/2+0.2+boxwid/2 then to SuperRoot.w
+line -> from SuperRoot.B1 down boxwid/2+0.2+boxwid/2 then to Root.w
+line -> from Root.B3 down boxwid/2+0.2+boxwid/2 then to Level1.RootMeta.w
+line -> from Root.B2 down boxwid/2+0.2+boxwid+0.2+boxwid/2 then to Level1.MetaSource.w
+line -> from Root.B1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level1.Source.w
+
+line -> from Level1.SB3 down boxwid/2+0.2+boxwid/2 then to Level2.MetaSource.w
+line -> from Level1.SB2 down boxwid/2+0.2+boxwid+0.1+boxwid/2 then to Level2.Source.w
+line -> from Level1.SB1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level2.File.w
+
+arrowwid = arrowwid/2
+arrowht = arrowht/2
+line -> from Level1.MB1 to Level1.SB1.n
+line -> from Level1.MB2 to Level1.SB2.n
+line -> from Level1.MB2 to Level1.SB3.n
+line -> from Level1.MB4 to Level1.SB7.n
+line -> from Level1.MB5 to Level1.SB5.n
+arrowwid = arrowwid * 2
+arrowht = arrowht * 2
+
+box dashed with .nw at Level1.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2
+box dashed with .nw at Level2.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2
+box dotted with .nw at Level2.File.nw+(-.05,.05) wid 0.8+0.05*2 ht .1+.05*2
+
+[
+ KEY: box wid 1.5 invis "Key"
+ line from KEY.sw to KEY.se
+ k = -.1
+ kk=0.5
+ A: [
+ box wid 4*boxwid
+ "Venti file" ljust with .w at last box .w+(kk,0)
+ ] with .nw at KEY.sw+(0,2*k)
+ B: [
+ box fill f
+ "Venti entry (\f(CWEntry\fP)" ljust with .w at last box .w+(kk,0)
+ ] with .nw at A.sw+(0,k)
+ C: [
+ right
+ CC: box fill f
+ box fill f
+ box fill f
+ box fill f
+ "Venti directory" ljust with .w at CC.w+(kk,0)
+ ] with .nw at B.sw+(0,k)
+ D: [
+ line -> right 3*boxwid
+ "Venti pointer (score)" ljust with .w at last line .w+(kk, 0)
+ ] with .nw at C.sw+(0,k)
+ DD: [
+ box dotted wid 4*boxwid
+ "Vac file" ljust with .w at last box .w+(kk,0)
+ ] with .nw at D.sw+(0,k)
+ E: [
+ box wid mb
+ "Vac entry (\f(CWDirEntry\fP)" ljust with .w at last box .w+(kk,0)
+ ] with .nw at DD.sw+(0,k)
+ G: [
+ box dashed wid 4*boxwid
+ "Vac directory" ljust with .w at last box .w+(kk,0)
+ ] with .nw at E.sw+(0,k)
+ H: [
+ arrowwid = arrowwid/2
+ arrowht = arrowht/2
+ line -> right 1.5*boxwid
+ "Vac pointer (integer index)" ljust with .w at last line .w+(kk, 0)
+ arrowwid = arrowwid * 2
+ arrowht = arrowht * 2
+ ] with .nw at G.sw+(0,k)
+] with .nw at VtRoot.nw+(3,0)
+.PE
+.LP
+In reality, the story is slightly more complicated.
+The metadata file in a Vac directory
+is not just the concatenation of
+.CW DirEntry
+structures.
+Instead, it is the concatenation of
+.CW MetaBlocks .
+A
+.CW MetaBlock
+contains some number of
+.CW DirEntry
+structures along with a sorted index to make it easy
+to look for a particular
+.CW DirEntry
+by its
+.CW elem
+field.
+The details are in the source code.
+.PP
+As shown in the diagram,
+the root directory of the file system is summarized by
+three
+.CW VtEntry
+structures describing
+the Venti directory for the children of the root,
+the Venti file for the metadata describing the children of the root,
+and a Venti file holding metadata for the root directory itself.
+These
+.CW VtEntry
+structures are placed in a Venti directory of their own,
+described by the single
+.CW VtEntry
+in the
+root block.
+.NH 1
+Fossil file system format
+.HP
+Fossil uses the vac format, with some small changes.
+The changes only affect the data on the local disk; the data
+archived to Venti is exactly in vac format.
+.PP
+Blocks stored on local disk may contain scores pointing at local disk
+blocks or at Venti blocks.
+Local block addresses are stored as 20-byte scores in which the first 16 bytes
+are all zero and the last 4 bytes specify a block number in the disk.
+Before a block is archived, all the
+blocks it points to must be archived, and the local scores in the block
+must be changed to Venti scores.
+Using block addresses rather than content hashes for local data
+makes the local file system easier to manage: if a local block's contents
+change, the pointer to the block does not need to change.
+.NH 2
+Snapshots
+.HP
+Fossil is an archival file server.
+It takes periodic snapshots of the file system,
+which are made accessible through the file system.
+Specifically, the active file system is presented in
+.CW /active .
+Ephemeral snapshots (those that are kept on local disk and eventually deleted)
+are presented in
+\f(CW/snapshot/\fIyyyy\f(CW/\fImmdd\f(CW/\fIhhmm\fR,
+where
+.I yyyy
+is the full year,
+.I mm
+is the month number,
+.I dd
+is the day number,
+.I hh
+is the hour,
+and
+.I mm
+is the minute.
+Archival snapshots (those that are archived to Venti and persist forever)
+are presented in
+\f(CW/archive/\fIyyyy\f(CW/\fImmdds\fR,
+where
+.I yyyy ,
+.I mm ,
+and
+.I dd
+are year, month, and day as before,
+and
+.I s
+is a sequence number if more than one
+archival snapshot is done in a day.
+For the first snapshot,
+.I s
+is null.
+For the subsequent snapshots,
+.I s
+is
+.CW .1 ,
+.CW .2 ,
+.CW .3 ,
+etc.
+.PP
+To implement the snapshots, the file server maintains a
+current
+.I epoch
+for the active file system.
+Each local block has a label that records, among other things,
+the epoch in which the block was allocated.
+If a block was allocated in an epoch earlier than the current one,
+it is immutable and treated as copy-on-write.
+Taking a snapshot can be accomplished by
+recording the address of the current root block and then
+incrementing the epoch number.
+Notice that the copy-on-write method makes
+snapshots both time efficient and space efficient.
+The only time cost is waiting for all current file system
+requests to finish and then incrementing a counter.
+After a snapshot, blocks only get copied when they are
+next modified, so the per-snapshot
+space requirement is proportional
+to the amount of new data rather than the total
+size of the file system.
+.PP
+The blocks in the archival snapshots are moved to Venti,
+but the blocks in the ephemeral snapshots take up space
+in the local disk file.
+To allow reclamation of this disk space, the file system
+maintains a
+.I low
+.I epoch ,
+which is the epoch of the earliest ephemeral snapshot
+still available.
+Fossil only allows access to snapshots with epoch numbers
+between the
+low epoch and the current epoch
+(also called the high epoch).
+Incrementing the low epoch thus makes old
+snapshots inaccessible.
+The space required to store those snapshots can then
+be reclaimed, as described below.
+.NH 2
+Local blocks
+.HP
+The bulk of the local disk file is the local blocks.
+Each block has a 14-byte label associated with it, of the format:
+.P1
+Label:
+.ta +\w' 'u +\w' 'u
+ state[1] \fRblock state\fP
+ type[1] \fRblock type\fP
+ epoch[4] \fRallocation epoch\fP
+ epochClose[4] \fRclose epoch\fP
+ tag[4] \fRrandom tag\fP
+.P2
+.LP
+The
+.CW type
+is an analogue of the block types described earlier,
+though different names are used, to distinguish between
+pointers blocks in a hash tree for a data stream
+and pointer blocks for a directory stream.
+The
+.CW epoch
+was mentioned in the last section.
+The other fields are explained below.
+.PP
+There are two distinguished blocks states
+.CW BsFree
+.CW 0x00 ) (
+and
+.CW BsBad
+.CW 0xFF ), (
+which mark blocks that are available for allocation
+and blocks that are bad and should be avoided.
+If
+.CW state
+is not one of these values, it is a bitwise
+.I or ' `
+of the following flags:
+.P1
+.ta +\w' 'u +\w' 'u
+0x01 BsAlloc \fRblock is in use\fP
+0x02 BsCopied \fRblock has been copied\fP
+0x04 BsVenti \fRblock has been stored on Venti\fP
+0x08 BsClosed \fRblock has been unlinked from active file system\fP
+.P2
+.LP
+The flags are explained as they arise in the discussions below.
+.PP
+It is convenient to store some extra fields in the
+.CW VtEntry
+structure when it describes a Venti file or directory
+stored on local disk.
+Specifically, we set the
+.CW VtEntryLocal
+flag bit
+and then use the bytes 7-16 of the score (which would
+otherwise be zero, since it is a local score) to hold these fields:
+.P1
+.ta +\w' 'u +\w' 'u
+ archive[1] \fRboolean: this is an archival snapshot\fP
+ snap[4] \fRepoch number if root of snapshot\fP
+ tag[4] \fRrandom tag\fP
+.P2
+.LP
+The extended
+.CW VtEntry
+structure is called an
+.CW Entry .
+The
+.CW tag
+field
+in the
+.CW Label
+and the
+.CW Entry
+is used to identify dangling pointers or other file system corruption:
+all the local blocks in a hash tree must
+have tags matching the tag in the
+.CW Entry .
+If this
+.CW Entry
+points at the root of a snapshot,
+the
+.CW snap
+field is the epoch of the snapshot.
+If the snapshot is intended to be archived to Venti,
+the
+.CW archive
+field is non-zero.
+.NH 2
+Block reclamation
+.HP
+The blocks in the active file system form a tree: each
+block has only one parent.
+Once a copy-on-write block
+.I b
+is replaced by its copy, it is no longer
+needed by the active file system.
+At this point,
+.I b
+is unlinked from the active file system.
+We say that
+.I b
+is now
+.I closed :
+it is needed only for snapshots.
+When a block is closed, the
+.CW BsClosed
+bit is set in its state, and the current epoch (called the block's closing epoch)
+is stored in the
+.CW epochClose
+label field.
+(Open blocks have an
+.CW epochClose
+of
+.CW ~0 ).
+.PP
+A block is referenced by snapshots with epochs
+between the block's allocation epoch and its closing epoch.
+Once the file system's low epoch grows to be greater than or equal to the block's
+closing epoch, the block is no longer needed for any snapshots
+and can be reused.
+.PP
+In a typical configuration, where nightly archival snapshots
+are taken and written to Venti, it is desirable to reclaim
+the space occupied by now-archived blocks if possible.
+To do this, Fossil keeps track of whether the pointers
+in each block are unique to that block.
+When a block
+.I bb
+is allocated, a pointer to
+.I bb
+is written into exactly one active block (say,
+.I b ).
+In the absence of snapshots, the pointer to
+.I bb
+will remain unique to
+.I b ,
+so that if the pointer is zeroed,
+.I bb
+can be immediately reused.
+Snapshots complicate this invariant:
+when
+.I b
+is copied-on-write, all its pointers
+are no longer unique to it.
+At time of the copy, the
+.CW BsCopied
+state bit in the block's label
+is set to note the duplication of the pointers contained within.
+.NH 2
+Disk layout
+.HP
+The file system header describes the file system layout and has this format:
+.P1
+.ta +\w' 'u +\w' 'u
+Header:
+ magic[4] \fR0x3776AE89 (HeaderMagic)\fP
+ version[2] \fR1 (HeaderVersion)\fP
+ blockSize[2] \fIfile system block size\fP
+ super[4] \fRblock offset of super block\fP
+ label[4] \fRblock offset of labels\fP
+ data[4] \fRdata blocks\fP
+ end[4] \fRend of file system\fP
+.P2
+.LP
+The corresponding file system layout is:
+.PS
+.ps 8
+.vs 9
+boxwid=0.75
+boxht=0.15
+Empty: box "empty" ht 0.25
+Header: box "header" with .n at Empty.s
+Empty2: box "empty" with .n at Header.s
+Super: box "super block" with .n at Empty2.s
+Label: box "label" "blocks" with .n at Super.s ht 0.25
+Data: box "data" "blocks" with .n at Label.s ht 0.3
+" 0" ljust at Empty.ne
+" 128kB" ljust at Header.ne
+" \f5super\fP \(mu \f(CWblockSize\fP" ljust at Super.ne
+" \f5label\fP \(mu \f(CWblockSize\fP" ljust at Label.ne
+" \f5data\fP \(mu \f(CWblockSize\fP" ljust at Data.ne
+" \f5end\fP \(mu \f(CWblockSize\fP" ljust at Data.se
+"" at (-1,0)
+"" at (6,0)
+.PE
+.LP
+The numbers to the right of the blocks are byte offsets
+of the boundaries.
+.LP
+The super block describes the file system itself and looks like:
+.P1
+.ta +\w' 'u +\w' 'u
+Super:
+ magic[4] \fR0x2340A3B1 (SuperMagic)\fP
+ version[2] \fR1 (SuperVersion)\fP
+ epochLow[4] \fRfile system low epoch\fP
+ epochHigh[4] \fRfile system high (active) epoch\fP
+ qid[8] \fRnext qid to allocate\fP
+ active[4] \fRdata block number: root of active file system\fP
+ next[4] \fRdata block number: root of next file system to archive\fP
+ current[4] \fRdata block number: root of file system currently being archived\fP
+ last[20] \fRVenti score of last successful archive\fP
+ name[128] \fRname of file system (just a comment)\fP
+.P2
+.LP
+.NH 1
+Fossil server
+.HP
+The Fossil server is a user-space program that runs on a standard Plan 9 kernel.
+.NH 2
+Process structure
+.PP
+The file server is structured as a set of processes synchronizing
+mostly through message passing along queues.
+The processes are given names, which can be seen in the output of
+.CW ps
+.CW -a .
+.PP
+.CW Listen
+processes announce on various network addresses.
+A
+.CW con
+process handles each incoming connection, reading 9P requests
+and adding them to a central message queue.
+.CW Msg
+processes remove 9P requests from the queue,
+handle them, and write the responses to the appropriate
+file descriptors.
+.PP
+The
+.CW disk
+process handles disk I/O requests made by the other processes.
+The
+.CW flush
+process writes dirty blocks from the in-memory block cache to disk.
+The
+.CW unlink
+process frees previously linked blocks once the blocks that point at them
+have been written to disk.
+.PP
+A
+.CW consI
+reads from each console file (typically a pipe posted in
+.CW /srv ),
+adding the typed characters to the input queue.
+The
+.CW cons
+process echoes input and runs the commands, saving
+output in a ring buffer.
+Because there is only one
+.CW cons
+process, only one console command may be executing at a time.
+A
+.CW consO
+process copies this ring buffer to each console file.
+.PP
+The
+.CW periodic
+process runs periodic events, like
+flushing the root metadata to disk or
+taking snapshots of the file system.
+.NH 2
+Block cache
+.HP
+Fossil maintains an in-memory block cache which
+holds both local disk blocks and Venti blocks.
+Cache eviction follows a least recently used policy.
+Dirty blocks are restricted to at most half the cache.
+This can be changed by editing
+.CW DirtyPercentage
+in
+.CW dat.h .
+.PP
+The block cache uses soft updates [1] to ensure that the on-disk
+file system is always self-consistent.
+Thus there is no
+.I halt
+console command
+and no need to check a file system
+that was shut down without halting.
+.NH 2
+Archiving
+.HP
+A background process writes blocks in archival snapshots to Venti.
+Although
+.CW /archive/\fIyyyy\fP/\fImmdds\fR
+is a copy of only
+.CW /active
+at the time of the snapshot,
+the archival process archives the
+entire file tree rather than just
+the subtree rooted at
+.CW /active .
+The snapshots
+.CW /snapshot/\fIyyyy\fP/\fImmdd\fP/\fIhhmm
+are stored as empty directories.
+Once all the blocks have been archived,
+a
+.CW VtRoot
+header for the file system is archived.
+The score of that header is recorded in
+.CW super.score
+and also printed on the file server console.
+The score can used by
+.I flfmt
+to restore a file system (see
+.I fossil (4)).
+.NH 2
+Contrast with the old file server
+.HP
+The most obvious difference between Fossil and the
+old Plan 9 file server [2] is that Fossil uses a Venti server as
+its archival storage in place of a WORM juke box.
+There are a few other architectural differences to be
+aware of.
+.PP
+Fossil is a user-level program run on a standard kernel.
+.PP
+Fossil does not have any way to concatenate, stripe, or
+mirror disk files. For functionality similar to the old file server's
+configuration strings, use the experimental file stack device
+(see
+.I fs (3)).
+.PP
+Fossil speaks only 9P2000. Old 9P (aka 9P1) is not supported.
+.PP
+... XXX words about converting an old file system to fossil?
+.NH 1
+References
+.LP
+[1] Gregory R. Ganger, Marshall Kirk McKusick, Craig A. N. Soules,
+and Yale N. Patt.
+``Soft Updates: A Solution to the Metadata Update Problem
+in File Systems,''
+.I "ACM Transactions on Computer Systems" ,
+Vol 18., No. 2, May 2000, pp. 127\-153.
+.LP
+[2] Sean Quinlan, ``A Cached WORM File System,''
+.I "Software\(emPractice and Experience" ,
+Vol 21., No 12., December 1991, pp. 1289\-1299.
+.LP
+[3] Sean Quinlan and Sean Dorward, ``Venti: A New Approach to Archival Storage,''
+.I "Usenix Conference on File and Storage Technologies" ,
+2002.