diff options
author | aiju <aiju@phicode.de> | 2011-07-18 11:01:22 +0200 |
---|---|---|
committer | aiju <aiju@phicode.de> | 2011-07-18 11:01:22 +0200 |
commit | 8c4c1f39f4e369d7c590c9d119f1150a2215e56d (patch) | |
tree | cd430740860183fc01de1bc1ddb216ceff1f7173 /sys/doc/fossil.ms | |
parent | 11bf57fb2ceb999e314cfbe27a4e123bf846d2c8 (diff) |
added /sys/doc
Diffstat (limited to 'sys/doc/fossil.ms')
-rw-r--r-- | sys/doc/fossil.ms | 1163 |
1 files changed, 1163 insertions, 0 deletions
diff --git a/sys/doc/fossil.ms b/sys/doc/fossil.ms new file mode 100644 index 000000000..ca9b783fd --- /dev/null +++ b/sys/doc/fossil.ms @@ -0,0 +1,1163 @@ +.HTML "Fossil, an Archival File Server +... .FP times +... .fp 1 R R.nomath +... .fp 5 CW LucidaSansCW83 +.TL +Fossil, an Archival File Server +.AU +Sean Quinlan +Jim McKie +Russ Cox +jmk,rsc@plan9.bell-labs.com +.AB +This paper describes the internals and +operation of Fossil, an archival file server built for Plan 9. +Fossil has not yet replaced the current Plan 9 file server +and +.CW kfs , +but that is our eventual intent. +Both fossil and this documentation are +works in progress. Comments on either are most welcome. +.AE +.de HP +.LP +.. +.NH 1 +Introduction +.HP +Fossil is an archival file server built for Plan 9. +In a typical configuration, it maintains a traditional file system +in a local disk partition and periodically archives snapshots of the file system +to a Venti server. These archives are made available through +a file system interface. +Fossil can also be run without a Venti server, in which case the +snapshots (if any) occupy local disk space. +.PP +The bulk of this paper explains the underlying data structures: +Venti trees, the Venti archival file system format, and finally Fossil's +file system format. +The end of the paper discusses the architecture of the Fossil server. +.PP +The presentation of the data structures is very detailed, perhaps +too detailed for most readers. +The intent is to record all the details necessary to make structural +changes to the file system format. +Feel free to jump ahead when boredom strikes. +.NH 1 +Venti trees and directory hierarchies +.HP +Venti [3] is an archival block storage server. +Once a block is stored, it can be retrieved by presenting the 20-byte +SHA1 hash of its contents, called a +.I score . +Blocks on Venti have a maximum length of about 56 kilobytes, +though in practice smaller blocks are used. +To store a byte stream of arbitrary length, Venti uses a hash tree. +Conceptually, the data stream is broken into fixed-size (say, +.I dsize -byte) +chunks, which are stored on the Venti server. +The resulting scores are concatenated into a new pointer stream, which is +broken into fixed size (say, +.I psize -byte) +chunks, which are stored on the Venti server. +.I Psize "" ( +is different from +.I dsize +so that we can ensure that each pointer block holds an +integral number of pointers.) +This yields a new pointer stream, and so on, until there is a single block +and finally a single score describing the entire tree. +The resulting structure looks like: +.PS +.ps 8 +.vs 10 +boxht=0.1 +boxwid=0.1 + +B0: box invis wid 1 "\f(CWVtDataType\fP" +move right 0.1 +L0a: box wid 0.2 +move right 0.1 +L0b: box wid 0.2 +move right 0.1 +L0c: box invis wid 0.2 "..." +move right 0.1 + +L0d: box wid 0.2 +move right 0.1 +L0e: box wid 0.2 +move right 0.2 +L0f: box invis wid 0.2 "..." +move right 0.2 + +L0g: box wid 0.2 +move right 0.1 +L0h: box wid 0.2 +move right 0.1 +L0i: box invis wid 0.2 "..." +move right 0.1 + +L0j: box wid 0.2 +move right 0.1 +L0k: box wid 0.2 +move right 0.1 +L0l: box invis wid 0.2 "..." +move right 0.1 +L0m: box wid 0.2 + +define boxppddd { + line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se> + line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se> + X: box invis at 0.1<$1.nw,$1.ne> + Y: box invis at 0.1<$1.sw,$1.se> + line -> from 0.5<X,Y> to $2.nw + X: box invis at 0.3<$1.nw,$1.ne> + Y: box invis at 0.3<$1.sw,$1.se> + line -> from 0.5<X,Y> to $3.nw + "..." at 0.7<$1.w,$1.e> +} + +define boxppdddp { + line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se> + line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se> + line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se> + X: box invis at 0.1<$1.nw,$1.ne> + Y: box invis at 0.1<$1.sw,$1.se> + line -> from 0.5<X,Y> to $2.nw + X: box invis at 0.3<$1.nw,$1.ne> + Y: box invis at 0.3<$1.sw,$1.se> + line -> from 0.5<X,Y> to $3.nw + "..." at 0.6<$1.w,$1.e> + X: box invis at 0.9<$1.nw,$1.ne> + Y: box invis at 0.9<$1.sw,$1.se> + line -> from 0.5<X,Y> to $4.nw +} + +define boxpdddp { + line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se> + line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se> + X: box invis at 0.1<$1.nw,$1.ne> + Y: box invis at 0.1<$1.sw,$1.se> + line -> from 0.5<X,Y> to $2.nw + "..." at 0.5<$1.w,$1.e> + X: box invis at 0.9<$1.nw,$1.ne> + Y: box invis at 0.9<$1.sw,$1.se> + line -> from 0.5<X,Y> to $3.nw +} + +bhd=0.4 +L1abc: box wid 0.5 at 0.5<L0a, L0b>+(0,bhd) +boxppddd(L1abc, L0a, L0b) +L1def: box wid 0.5 at 0.5<L0d, L0e>+(0,bhd) +boxppddd(L1def, L0d, L0e) +L1ghi: box wid 0.5 at 0.5<L0g, L0h>+(0,bhd) +boxppddd(L1ghi, L0g, L0h) +L1jklm: box wid 0.5 at 0.5<L0j, L0k>+(0,bhd) +boxppdddp(L1jklm, L0j, L0k, L0m) +B1: box invis wid 1 "\f(CWVtPointerType0\fP" at B0+(0,bhd) + +L2abcdef: box wid 0.5 at 0.5<L1abc,L1def>+(0,bhd) +boxppddd(L2abcdef, L1abc, L1def) +L2ghijklm: box wid 0.5 at 0.5<L1ghi,L1jklm>+(0,bhd) +boxpdddp(L2ghijklm, L1ghi, L1jklm) +B2: box invis wid 1 "\f(CWVtPointerType1\fP" at B1+(0,bhd) + +L3atom: box wid 0.5 at 0.5<L2abcdef, L2ghijklm>+(0,bhd) +boxpdddp(L3atom, L2abcdef, L2ghijklm) +B3: box invis wid 1 "\f(CWVtPointerType2\fP" at B2+(0,bhd) +.PE +.LP +The leaves are the original data stream. Those blocks have type +.CW VtDataType . +The first pointer stream has type +.CW VtPointerType0 , +the next has type +.CW VtPointerType1 , +and so on. +The figure ends with a single block of type +.CW VtPointerType2 , +but in general trees can have height up to +.CW VtPointerType6 . +For a +.I dsize +of 8192 bytes +and +.I psize +of 8180 bytes (409 pointers), +this gives a maximum stream size of approximately 10 zettabytes +(2\s-2\u73\d\s+2 or 10\s-2\u22\d\s+2 bytes). +.PP +Data blocks are truncated to remove trailing runs of zeros before +storage to Venti; they are zero-filled back to +.I dsize +bytes after retrieval from Venti. +Similarly, trailing runs of pointers to zero-length blocks are +removed from and added back to pointer blocks. +These simple rules happen to make it particularly efficient to store +large runs of zeros, as might occur in a data stream with ``holes:'' +the zero-length block itself can be interpreted as a tree of any +depth encoding an all-zero data stream. +.PP +Reconstructing the data stream requires the score and type of the +topmost block in the tree, the data chunk size, the pointer chunk size, +and the data stream size. +(From the data stream size and the chunk sizes we could derive the +depth of the tree and thus the type of the topmost block, but it is convenient +to allow trees that are deeper than necessary.) +This information is kept in a 40-byte structure called a +.CW VtEntry : +.P1 +VtEntry: +.ta +\w' 'u +\w' 'u + gen[4] \fRgeneration number\fP + psize[2] \fRsize of pointer blocks\fP + dsize[2] \fRsize of data blocks\fP + flags[1] + zero[5] + size[6] \fRlength of file\fP + score[20] \fRscore of root block in tree\fP +.P2 +(In this notation, +.CW name[sz] +indicates a +.CW sz -byte +field called +.CW name . +Integers are stored in big-endian order. +.CW Size +really is a 48-bit field.) +.CW Flags +is made up of the following bit fields. +.P1 +.ta +\w' 'u +\w' 'u +0x01 VtEntryActive \fRentry is allocated\fP +0x02 VtEntryDir \fRentry describes a Venti directory (q.v.)\fP +0x1C VtEntryDepthMask \fRmask for tree depth\fP +0x20 VtEntryLocal \fRreserved (q.v.)\fP +.P2 +.LP +The depth of the described tree is stored in the 3 bits indicated: +a tree with a topmost node of type +.CW VtPointerType3 +has depth 4. +.PP +With +.CW VtEntry +we can build more complicated data structures, +ones with multiple or nested data streams. +A data stream consisting of +.CW VtEntry +structures is called a Venti directory. +It is identical in structure to the Venti data stream +we described earlier except that the bottom-level type is +.CW VtDirType , +and +the +.CW VtEntry +describing a Venti directory has the +.CW VtEntryDir +flag bit set. +The +.I dsize +for a Venti directory +is a multiple of 40 so that each data chunk holds +an integer number of +.CW VtEntry +structures. +By analogy with Venti directories, +we call the original data stream a +Venti file. +Note that Venti files are assumed +.I not +to contain pointers to other Venti blocks. +The only pointers to Venti blocks occur in +.CW VtEntry +structures in +Venti directories +(and in the internal hash tree structure of the +individual files and directories). +Note also that these directories are nothing more than pointer lists. +In particular, there are no names or metadata like in a file system. +.PP +To make it easier to pass hierarchies between applications, +the root of a hierarchy is described in a 300-byte structure +called a +.CW VtRoot : +.P1 +VtRoot: +.ta +\w' 'u +\w' 'u + version[2] \f(CW2\fP + name[128] \fRname of structure (just a comment)\fP + type[128] \fRstring describing structure (\f(CWvac\fR)\f(CW + score[20] \fRpointer to \f(CWVtDirType\fP block\f(CW + blockSize[2] \fRmaximum block size in structure\fP + prev[20] \fRprevious \f(CWVtRoot\fP in chain, if any\fP +.P2 +.LP +This structure is stored to Venti and its score is passed +between applications, typically in the form +``\fItype\f(CW:\fIrootscore\fR,'' +where +.I type +is the type field from the +.CW VtRoot +structure, and +.I rootscore +is the score of the +.CW VtRoot +block. +.CW VtRoot +structures can be chained together using the +.I prev +field to encode an archival history +of the data structure. +.PP +For example, a small Venti hierarchy might look like: +.PS +.ps 8 +.vs 10 +boxwid=0.1 +boxht=0.1 +f=0.9 +mb=0.16 + +VtRoot: [ + right + B1: box + move right 0.1 + "\f(CWVtRoot\fP" ljust +] + +Root: [ + right + B1: box fill f + B2: box fill f + B3: box fill f + move right 0.1 +] with .nw at VtRoot.sw+(0.2,-.1) +Level1: [ + RootMeta: [ + box wid mb + ] + MetaSource: [ + right + B1: box wid 5*mb + ] with .nw at RootMeta.sw+(0,-.1) + + Source: [ + right + B1: box fill f + B2: box fill f + B3: box fill f + B4: box fill f + B5: box fill f + B6: box fill f + B7: box fill f + B8: box fill f + ] with .nw at MetaSource.sw+(0,-.1) + SB1: box invis at Source.B1 + SB2: box invis at Source.B2 + SB3: box invis at Source.B3 +] with .nw at Root.sw+(0.4,-.1) +Level2: [ + MetaSource: [ + right + B1: box wid 5*mb + ] + Source: [ + right + B1: box fill f + B2: box fill f + B3: box fill f + B4: box fill f + B5: box fill f + B6: box fill f + B7: box fill f + B8: box fill f + ] with .nw at MetaSource.sw+(0,-.1) + File: box wid 0.8 with .nw at Source.sw+(0,-.1) +] with .nw at Level1.sw+(0.6,-.1) + +line -> from VtRoot.B1 down boxwid/2+0.1+boxwid/2 then to Root.w +line -> from Root.B3 down boxwid/2+0.1+boxwid/2 then to Level1.RootMeta.w +line -> from Root.B2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level1.MetaSource.w +line -> from Root.B1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level1.Source.w + +line -> from Level1.SB3 down boxwid/2+0.1+boxwid/2 then to Level2.MetaSource.w +line -> from Level1.SB2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level2.Source.w +line -> from Level1.SB1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level2.File.w + +[ + KEY: box wid 1.5 invis "Key" + line from KEY.sw to KEY.se + k = -.1 + kk=0.5 + A: [ + box wid 4*boxwid + "Venti file" ljust with .w at last box .w+(kk,0) + ] with .nw at KEY.sw+(0,2*k) + B: [ + box fill f + "Venti entry (\f(CWVtEntry\fP)" ljust with .w at last box .w+(kk,0) + ] with .nw at A.sw+(0,k) + C: [ + right + CC: box fill f + box fill f + box fill f + box fill f + "Venti directory" ljust with .w at CC.w+(kk,0) + ] with .nw at B.sw+(0,k) + D: [ + line -> right 3*boxwid + "Venti pointer (score)" ljust with .w at last line .w+(kk, 0) + ] with .nw at C.sw+(0,k) +] with .nw at VtRoot.nw+(3,0) +.PE +.LP +Venti files are shown as white boxes, while directories are shown +as shaded boxes. Each shaded square represents a +.CW VtEntry . +Arrows represent pointers from +.CW VtEntry +structures to other +Venti files or directories. +.PP +The hierarchical structure provided by Venti files and directories +can be used as the base for more complicated data structures. +Because this structure captures all the information +about pointers to other blocks, tools written to traverse +Venti hierarchies can traverse the more complicated +data structures as well. +For example, +.I venti/copy +(see +.I venti (1)) +copies a Venti hierarchy from one Venti server to another, +given the root +.CW VtEntry . +Because the traditional file system described in later sections is +layered on a Venti hierarchy, +.I venti/copy +can copy it without fully understanding its structure. +.NH 1 +Vac file system format +.HP +The Venti archive format +.I vac +builds a traditional file system using a Venti hierarchy. +Each vac file is implemented as a Venti file; +each vac directory is implemented as a Venti +directory and a Venti file to provide traditional file system metadata. +The metadata is stored in a structure called a +.CW DirEntry : +.P1 +DirEntry: +.ta +\w' 'u +\w' 'u + magic[4] \f(CW0x1c4d9072\fP (DirMagic)\fP + version[2] \f(CW9\fP + elem[s] \fRname (final path element only)\fP + entry[4] \fRentry number for Venti file or directory\fP + gen[4] \fRgeneration number\fP + mentry[4] \fRentry number for Venti file holding metadata\fP + mgen[4] \fRgeneration number\fP + qid[8] \fRunique file serial number\fP + uid[s] \fRowner\fP + gid[s] \fRgroup\fP + mid[s] \fRlast modified by\fP + mtime[4] \fRlast modification time\fP + ctime[4] \fRcreation time\fP + atime[4] \fRlast access time\fP + mode[4] \fRmode bits\fP +.P2 +The notation +.CW name[s] +denotes a string stored as a two-byte length +and then that many bytes. +The above describes Version 9 of the +.CW DirEntry +format. Versions 7 and 8 are very similar; they can be +read by the current +.I vac +source code but are not written. +Earlier versions were not widespread. +A +.CW DirEntry +may be followed by optional extension sections, though none +are currently used. +The +.CW mode +bits include bits commonly used by +Unix and Windows, in addition to those used by Plan 9. +.PP +The +.CW entry +field is an index into the parallel Venti directory. +The +.CW gen +field must match the +.CW gen +field in the corresponding +.CW VtEntry +in the directory; +it is used to detect +stale indices. +Similarly, +.CW mentry +and +.CW mgen +are the index and generation number +for the metadata Venti file, +if the +.CW DirEntry +describes a vac directory. +.PP +The relation between Venti files and directories and +vac files and directories can be seen in this figure: +.PS +.ps 8 +.vs 10 +boxwid=0.1 +boxht=0.1 +f=0.9 +mb=0.16 + +VtRoot: [ + right + B1: box + move right 0.1 + "\f(CWVtRoot\fP" ljust +] + +SuperRoot: [ + right + B1: box fill f + move right 0.1 + "fs root block" ljust +] with .nw at VtRoot.sw + (0.2, -.2) +Root: [ + right + B1: box fill f + B2: box fill f + B3: box fill f + move right 0.1 + "root directory info block" ljust +] with .nw at SuperRoot.sw+(0.2, -.2) +Level1: [ + RootMeta: [ + box wid mb + move right 0.1 + "root metadata" ljust + ] + MetaSource: [ + right + B1: box wid mb + B2: box wid mb + B3: box wid mb + B4: box wid mb + B5: box wid mb + ] with .nw at RootMeta.sw+(0,-.2) + MB1: box wid mb invis at MetaSource.B1 + MB2: box wid mb invis at MetaSource.B2 + MB3: box wid mb invis at MetaSource.B3 + MB4: box wid mb invis at MetaSource.B4 + MB5: box wid mb invis at MetaSource.B5 + + Source: [ + right + B1: box fill f + B2: box fill f + B3: box fill f + B4: box fill f + B5: box fill f + B6: box fill f + B7: box fill f + B8: box fill f + ] with .nw at MetaSource.sw+(0,-.1) + SB1: box invis at Source.B1 + SB2: box invis at Source.B2 + SB3: box invis at Source.B3 + SB4: box invis at Source.B4 + SB5: box invis at Source.B5 + SB6: box invis at Source.B6 + SB7: box invis at Source.B7 + SB8: box invis at Source.B8 +] with .nw at Root.sw+(0.4,-.2) +Level2: [ + MetaSource: [ + right + B1: box wid mb + B2: box wid mb + B3: box wid mb + B4: box wid mb + B5: box wid mb + ] + Source: [ + right + B1: box fill f + B2: box fill f + B3: box fill f + B4: box fill f + B5: box fill f + B6: box fill f + B7: box fill f + B8: box fill f + ] with .nw at MetaSource.sw+(0,-.1) + File: box wid 0.8 with .nw at Source.sw+(0,-.2) +] with .nw at Level1.sw+(0.6,-.2) + +line -> from VtRoot.B1 down boxwid/2+0.2+boxwid/2 then to SuperRoot.w +line -> from SuperRoot.B1 down boxwid/2+0.2+boxwid/2 then to Root.w +line -> from Root.B3 down boxwid/2+0.2+boxwid/2 then to Level1.RootMeta.w +line -> from Root.B2 down boxwid/2+0.2+boxwid+0.2+boxwid/2 then to Level1.MetaSource.w +line -> from Root.B1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level1.Source.w + +line -> from Level1.SB3 down boxwid/2+0.2+boxwid/2 then to Level2.MetaSource.w +line -> from Level1.SB2 down boxwid/2+0.2+boxwid+0.1+boxwid/2 then to Level2.Source.w +line -> from Level1.SB1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level2.File.w + +arrowwid = arrowwid/2 +arrowht = arrowht/2 +line -> from Level1.MB1 to Level1.SB1.n +line -> from Level1.MB2 to Level1.SB2.n +line -> from Level1.MB2 to Level1.SB3.n +line -> from Level1.MB4 to Level1.SB7.n +line -> from Level1.MB5 to Level1.SB5.n +arrowwid = arrowwid * 2 +arrowht = arrowht * 2 + +box dashed with .nw at Level1.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2 +box dashed with .nw at Level2.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2 +box dotted with .nw at Level2.File.nw+(-.05,.05) wid 0.8+0.05*2 ht .1+.05*2 + +[ + KEY: box wid 1.5 invis "Key" + line from KEY.sw to KEY.se + k = -.1 + kk=0.5 + A: [ + box wid 4*boxwid + "Venti file" ljust with .w at last box .w+(kk,0) + ] with .nw at KEY.sw+(0,2*k) + B: [ + box fill f + "Venti entry (\f(CWEntry\fP)" ljust with .w at last box .w+(kk,0) + ] with .nw at A.sw+(0,k) + C: [ + right + CC: box fill f + box fill f + box fill f + box fill f + "Venti directory" ljust with .w at CC.w+(kk,0) + ] with .nw at B.sw+(0,k) + D: [ + line -> right 3*boxwid + "Venti pointer (score)" ljust with .w at last line .w+(kk, 0) + ] with .nw at C.sw+(0,k) + DD: [ + box dotted wid 4*boxwid + "Vac file" ljust with .w at last box .w+(kk,0) + ] with .nw at D.sw+(0,k) + E: [ + box wid mb + "Vac entry (\f(CWDirEntry\fP)" ljust with .w at last box .w+(kk,0) + ] with .nw at DD.sw+(0,k) + G: [ + box dashed wid 4*boxwid + "Vac directory" ljust with .w at last box .w+(kk,0) + ] with .nw at E.sw+(0,k) + H: [ + arrowwid = arrowwid/2 + arrowht = arrowht/2 + line -> right 1.5*boxwid + "Vac pointer (integer index)" ljust with .w at last line .w+(kk, 0) + arrowwid = arrowwid * 2 + arrowht = arrowht * 2 + ] with .nw at G.sw+(0,k) +] with .nw at VtRoot.nw+(3,0) +.PE +.LP +In reality, the story is slightly more complicated. +The metadata file in a Vac directory +is not just the concatenation of +.CW DirEntry +structures. +Instead, it is the concatenation of +.CW MetaBlocks . +A +.CW MetaBlock +contains some number of +.CW DirEntry +structures along with a sorted index to make it easy +to look for a particular +.CW DirEntry +by its +.CW elem +field. +The details are in the source code. +.PP +As shown in the diagram, +the root directory of the file system is summarized by +three +.CW VtEntry +structures describing +the Venti directory for the children of the root, +the Venti file for the metadata describing the children of the root, +and a Venti file holding metadata for the root directory itself. +These +.CW VtEntry +structures are placed in a Venti directory of their own, +described by the single +.CW VtEntry +in the +root block. +.NH 1 +Fossil file system format +.HP +Fossil uses the vac format, with some small changes. +The changes only affect the data on the local disk; the data +archived to Venti is exactly in vac format. +.PP +Blocks stored on local disk may contain scores pointing at local disk +blocks or at Venti blocks. +Local block addresses are stored as 20-byte scores in which the first 16 bytes +are all zero and the last 4 bytes specify a block number in the disk. +Before a block is archived, all the +blocks it points to must be archived, and the local scores in the block +must be changed to Venti scores. +Using block addresses rather than content hashes for local data +makes the local file system easier to manage: if a local block's contents +change, the pointer to the block does not need to change. +.NH 2 +Snapshots +.HP +Fossil is an archival file server. +It takes periodic snapshots of the file system, +which are made accessible through the file system. +Specifically, the active file system is presented in +.CW /active . +Ephemeral snapshots (those that are kept on local disk and eventually deleted) +are presented in +\f(CW/snapshot/\fIyyyy\f(CW/\fImmdd\f(CW/\fIhhmm\fR, +where +.I yyyy +is the full year, +.I mm +is the month number, +.I dd +is the day number, +.I hh +is the hour, +and +.I mm +is the minute. +Archival snapshots (those that are archived to Venti and persist forever) +are presented in +\f(CW/archive/\fIyyyy\f(CW/\fImmdds\fR, +where +.I yyyy , +.I mm , +and +.I dd +are year, month, and day as before, +and +.I s +is a sequence number if more than one +archival snapshot is done in a day. +For the first snapshot, +.I s +is null. +For the subsequent snapshots, +.I s +is +.CW .1 , +.CW .2 , +.CW .3 , +etc. +.PP +To implement the snapshots, the file server maintains a +current +.I epoch +for the active file system. +Each local block has a label that records, among other things, +the epoch in which the block was allocated. +If a block was allocated in an epoch earlier than the current one, +it is immutable and treated as copy-on-write. +Taking a snapshot can be accomplished by +recording the address of the current root block and then +incrementing the epoch number. +Notice that the copy-on-write method makes +snapshots both time efficient and space efficient. +The only time cost is waiting for all current file system +requests to finish and then incrementing a counter. +After a snapshot, blocks only get copied when they are +next modified, so the per-snapshot +space requirement is proportional +to the amount of new data rather than the total +size of the file system. +.PP +The blocks in the archival snapshots are moved to Venti, +but the blocks in the ephemeral snapshots take up space +in the local disk file. +To allow reclamation of this disk space, the file system +maintains a +.I low +.I epoch , +which is the epoch of the earliest ephemeral snapshot +still available. +Fossil only allows access to snapshots with epoch numbers +between the +low epoch and the current epoch +(also called the high epoch). +Incrementing the low epoch thus makes old +snapshots inaccessible. +The space required to store those snapshots can then +be reclaimed, as described below. +.NH 2 +Local blocks +.HP +The bulk of the local disk file is the local blocks. +Each block has a 14-byte label associated with it, of the format: +.P1 +Label: +.ta +\w' 'u +\w' 'u + state[1] \fRblock state\fP + type[1] \fRblock type\fP + epoch[4] \fRallocation epoch\fP + epochClose[4] \fRclose epoch\fP + tag[4] \fRrandom tag\fP +.P2 +.LP +The +.CW type +is an analogue of the block types described earlier, +though different names are used, to distinguish between +pointers blocks in a hash tree for a data stream +and pointer blocks for a directory stream. +The +.CW epoch +was mentioned in the last section. +The other fields are explained below. +.PP +There are two distinguished blocks states +.CW BsFree +.CW 0x00 ) ( +and +.CW BsBad +.CW 0xFF ), ( +which mark blocks that are available for allocation +and blocks that are bad and should be avoided. +If +.CW state +is not one of these values, it is a bitwise +.I or ' ` +of the following flags: +.P1 +.ta +\w' 'u +\w' 'u +0x01 BsAlloc \fRblock is in use\fP +0x02 BsCopied \fRblock has been copied\fP +0x04 BsVenti \fRblock has been stored on Venti\fP +0x08 BsClosed \fRblock has been unlinked from active file system\fP +.P2 +.LP +The flags are explained as they arise in the discussions below. +.PP +It is convenient to store some extra fields in the +.CW VtEntry +structure when it describes a Venti file or directory +stored on local disk. +Specifically, we set the +.CW VtEntryLocal +flag bit +and then use the bytes 7-16 of the score (which would +otherwise be zero, since it is a local score) to hold these fields: +.P1 +.ta +\w' 'u +\w' 'u + archive[1] \fRboolean: this is an archival snapshot\fP + snap[4] \fRepoch number if root of snapshot\fP + tag[4] \fRrandom tag\fP +.P2 +.LP +The extended +.CW VtEntry +structure is called an +.CW Entry . +The +.CW tag +field +in the +.CW Label +and the +.CW Entry +is used to identify dangling pointers or other file system corruption: +all the local blocks in a hash tree must +have tags matching the tag in the +.CW Entry . +If this +.CW Entry +points at the root of a snapshot, +the +.CW snap +field is the epoch of the snapshot. +If the snapshot is intended to be archived to Venti, +the +.CW archive +field is non-zero. +.NH 2 +Block reclamation +.HP +The blocks in the active file system form a tree: each +block has only one parent. +Once a copy-on-write block +.I b +is replaced by its copy, it is no longer +needed by the active file system. +At this point, +.I b +is unlinked from the active file system. +We say that +.I b +is now +.I closed : +it is needed only for snapshots. +When a block is closed, the +.CW BsClosed +bit is set in its state, and the current epoch (called the block's closing epoch) +is stored in the +.CW epochClose +label field. +(Open blocks have an +.CW epochClose +of +.CW ~0 ). +.PP +A block is referenced by snapshots with epochs +between the block's allocation epoch and its closing epoch. +Once the file system's low epoch grows to be greater than or equal to the block's +closing epoch, the block is no longer needed for any snapshots +and can be reused. +.PP +In a typical configuration, where nightly archival snapshots +are taken and written to Venti, it is desirable to reclaim +the space occupied by now-archived blocks if possible. +To do this, Fossil keeps track of whether the pointers +in each block are unique to that block. +When a block +.I bb +is allocated, a pointer to +.I bb +is written into exactly one active block (say, +.I b ). +In the absence of snapshots, the pointer to +.I bb +will remain unique to +.I b , +so that if the pointer is zeroed, +.I bb +can be immediately reused. +Snapshots complicate this invariant: +when +.I b +is copied-on-write, all its pointers +are no longer unique to it. +At time of the copy, the +.CW BsCopied +state bit in the block's label +is set to note the duplication of the pointers contained within. +.NH 2 +Disk layout +.HP +The file system header describes the file system layout and has this format: +.P1 +.ta +\w' 'u +\w' 'u +Header: + magic[4] \fR0x3776AE89 (HeaderMagic)\fP + version[2] \fR1 (HeaderVersion)\fP + blockSize[2] \fIfile system block size\fP + super[4] \fRblock offset of super block\fP + label[4] \fRblock offset of labels\fP + data[4] \fRdata blocks\fP + end[4] \fRend of file system\fP +.P2 +.LP +The corresponding file system layout is: +.PS +.ps 8 +.vs 9 +boxwid=0.75 +boxht=0.15 +Empty: box "empty" ht 0.25 +Header: box "header" with .n at Empty.s +Empty2: box "empty" with .n at Header.s +Super: box "super block" with .n at Empty2.s +Label: box "label" "blocks" with .n at Super.s ht 0.25 +Data: box "data" "blocks" with .n at Label.s ht 0.3 +" 0" ljust at Empty.ne +" 128kB" ljust at Header.ne +" \f5super\fP \(mu \f(CWblockSize\fP" ljust at Super.ne +" \f5label\fP \(mu \f(CWblockSize\fP" ljust at Label.ne +" \f5data\fP \(mu \f(CWblockSize\fP" ljust at Data.ne +" \f5end\fP \(mu \f(CWblockSize\fP" ljust at Data.se +"" at (-1,0) +"" at (6,0) +.PE +.LP +The numbers to the right of the blocks are byte offsets +of the boundaries. +.LP +The super block describes the file system itself and looks like: +.P1 +.ta +\w' 'u +\w' 'u +Super: + magic[4] \fR0x2340A3B1 (SuperMagic)\fP + version[2] \fR1 (SuperVersion)\fP + epochLow[4] \fRfile system low epoch\fP + epochHigh[4] \fRfile system high (active) epoch\fP + qid[8] \fRnext qid to allocate\fP + active[4] \fRdata block number: root of active file system\fP + next[4] \fRdata block number: root of next file system to archive\fP + current[4] \fRdata block number: root of file system currently being archived\fP + last[20] \fRVenti score of last successful archive\fP + name[128] \fRname of file system (just a comment)\fP +.P2 +.LP +.NH 1 +Fossil server +.HP +The Fossil server is a user-space program that runs on a standard Plan 9 kernel. +.NH 2 +Process structure +.PP +The file server is structured as a set of processes synchronizing +mostly through message passing along queues. +The processes are given names, which can be seen in the output of +.CW ps +.CW -a . +.PP +.CW Listen +processes announce on various network addresses. +A +.CW con +process handles each incoming connection, reading 9P requests +and adding them to a central message queue. +.CW Msg +processes remove 9P requests from the queue, +handle them, and write the responses to the appropriate +file descriptors. +.PP +The +.CW disk +process handles disk I/O requests made by the other processes. +The +.CW flush +process writes dirty blocks from the in-memory block cache to disk. +The +.CW unlink +process frees previously linked blocks once the blocks that point at them +have been written to disk. +.PP +A +.CW consI +reads from each console file (typically a pipe posted in +.CW /srv ), +adding the typed characters to the input queue. +The +.CW cons +process echoes input and runs the commands, saving +output in a ring buffer. +Because there is only one +.CW cons +process, only one console command may be executing at a time. +A +.CW consO +process copies this ring buffer to each console file. +.PP +The +.CW periodic +process runs periodic events, like +flushing the root metadata to disk or +taking snapshots of the file system. +.NH 2 +Block cache +.HP +Fossil maintains an in-memory block cache which +holds both local disk blocks and Venti blocks. +Cache eviction follows a least recently used policy. +Dirty blocks are restricted to at most half the cache. +This can be changed by editing +.CW DirtyPercentage +in +.CW dat.h . +.PP +The block cache uses soft updates [1] to ensure that the on-disk +file system is always self-consistent. +Thus there is no +.I halt +console command +and no need to check a file system +that was shut down without halting. +.NH 2 +Archiving +.HP +A background process writes blocks in archival snapshots to Venti. +Although +.CW /archive/\fIyyyy\fP/\fImmdds\fR +is a copy of only +.CW /active +at the time of the snapshot, +the archival process archives the +entire file tree rather than just +the subtree rooted at +.CW /active . +The snapshots +.CW /snapshot/\fIyyyy\fP/\fImmdd\fP/\fIhhmm +are stored as empty directories. +Once all the blocks have been archived, +a +.CW VtRoot +header for the file system is archived. +The score of that header is recorded in +.CW super.score +and also printed on the file server console. +The score can used by +.I flfmt +to restore a file system (see +.I fossil (4)). +.NH 2 +Contrast with the old file server +.HP +The most obvious difference between Fossil and the +old Plan 9 file server [2] is that Fossil uses a Venti server as +its archival storage in place of a WORM juke box. +There are a few other architectural differences to be +aware of. +.PP +Fossil is a user-level program run on a standard kernel. +.PP +Fossil does not have any way to concatenate, stripe, or +mirror disk files. For functionality similar to the old file server's +configuration strings, use the experimental file stack device +(see +.I fs (3)). +.PP +Fossil speaks only 9P2000. Old 9P (aka 9P1) is not supported. +.PP +... XXX words about converting an old file system to fossil? +.NH 1 +References +.LP +[1] Gregory R. Ganger, Marshall Kirk McKusick, Craig A. N. Soules, +and Yale N. Patt. +``Soft Updates: A Solution to the Metadata Update Problem +in File Systems,'' +.I "ACM Transactions on Computer Systems" , +Vol 18., No. 2, May 2000, pp. 127\-153. +.LP +[2] Sean Quinlan, ``A Cached WORM File System,'' +.I "Software\(emPractice and Experience" , +Vol 21., No 12., December 1991, pp. 1289\-1299. +.LP +[3] Sean Quinlan and Sean Dorward, ``Venti: A New Approach to Archival Storage,'' +.I "Usenix Conference on File and Storage Technologies" , +2002. |