1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
|
.HTML "Fossil, an Archival File Server
... .FP times
... .fp 1 R R.nomath
... .fp 5 CW LucidaSansCW83
.TL
Fossil, an Archival File Server
.AU
Sean Quinlan
Jim McKie
Russ Cox
jmk,rsc@plan9.bell-labs.com
.AB
This paper describes the internals and
operation of Fossil, an archival file server built for Plan 9.
Fossil has not yet replaced the current Plan 9 file server
and
.CW kfs ,
but that is our eventual intent.
Both fossil and this documentation are
works in progress. Comments on either are most welcome.
.AE
.de HP
.LP
..
.NH 1
Introduction
.HP
Fossil is an archival file server built for Plan 9.
In a typical configuration, it maintains a traditional file system
in a local disk partition and periodically archives snapshots of the file system
to a Venti server. These archives are made available through
a file system interface.
Fossil can also be run without a Venti server, in which case the
snapshots (if any) occupy local disk space.
.PP
The bulk of this paper explains the underlying data structures:
Venti trees, the Venti archival file system format, and finally Fossil's
file system format.
The end of the paper discusses the architecture of the Fossil server.
.PP
The presentation of the data structures is very detailed, perhaps
too detailed for most readers.
The intent is to record all the details necessary to make structural
changes to the file system format.
Feel free to jump ahead when boredom strikes.
.NH 1
Venti trees and directory hierarchies
.HP
Venti [3] is an archival block storage server.
Once a block is stored, it can be retrieved by presenting the 20-byte
SHA1 hash of its contents, called a
.I score .
Blocks on Venti have a maximum length of about 56 kilobytes,
though in practice smaller blocks are used.
To store a byte stream of arbitrary length, Venti uses a hash tree.
Conceptually, the data stream is broken into fixed-size (say,
.I dsize -byte)
chunks, which are stored on the Venti server.
The resulting scores are concatenated into a new pointer stream, which is
broken into fixed size (say,
.I psize -byte)
chunks, which are stored on the Venti server.
.I Psize "" (
is different from
.I dsize
so that we can ensure that each pointer block holds an
integral number of pointers.)
This yields a new pointer stream, and so on, until there is a single block
and finally a single score describing the entire tree.
The resulting structure looks like:
.PS
.ps 8
.vs 10
boxht=0.1
boxwid=0.1
B0: box invis wid 1 "\f(CWVtDataType\fP"
move right 0.1
L0a: box wid 0.2
move right 0.1
L0b: box wid 0.2
move right 0.1
L0c: box invis wid 0.2 "..."
move right 0.1
L0d: box wid 0.2
move right 0.1
L0e: box wid 0.2
move right 0.2
L0f: box invis wid 0.2 "..."
move right 0.2
L0g: box wid 0.2
move right 0.1
L0h: box wid 0.2
move right 0.1
L0i: box invis wid 0.2 "..."
move right 0.1
L0j: box wid 0.2
move right 0.1
L0k: box wid 0.2
move right 0.1
L0l: box invis wid 0.2 "..."
move right 0.1
L0m: box wid 0.2
define boxppddd {
line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se>
X: box invis at 0.1<$1.nw,$1.ne>
Y: box invis at 0.1<$1.sw,$1.se>
line -> from 0.5<X,Y> to $2.nw
X: box invis at 0.3<$1.nw,$1.ne>
Y: box invis at 0.3<$1.sw,$1.se>
line -> from 0.5<X,Y> to $3.nw
"..." at 0.7<$1.w,$1.e>
}
define boxppdddp {
line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se>
line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se>
X: box invis at 0.1<$1.nw,$1.ne>
Y: box invis at 0.1<$1.sw,$1.se>
line -> from 0.5<X,Y> to $2.nw
X: box invis at 0.3<$1.nw,$1.ne>
Y: box invis at 0.3<$1.sw,$1.se>
line -> from 0.5<X,Y> to $3.nw
"..." at 0.6<$1.w,$1.e>
X: box invis at 0.9<$1.nw,$1.ne>
Y: box invis at 0.9<$1.sw,$1.se>
line -> from 0.5<X,Y> to $4.nw
}
define boxpdddp {
line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se>
X: box invis at 0.1<$1.nw,$1.ne>
Y: box invis at 0.1<$1.sw,$1.se>
line -> from 0.5<X,Y> to $2.nw
"..." at 0.5<$1.w,$1.e>
X: box invis at 0.9<$1.nw,$1.ne>
Y: box invis at 0.9<$1.sw,$1.se>
line -> from 0.5<X,Y> to $3.nw
}
bhd=0.4
L1abc: box wid 0.5 at 0.5<L0a, L0b>+(0,bhd)
boxppddd(L1abc, L0a, L0b)
L1def: box wid 0.5 at 0.5<L0d, L0e>+(0,bhd)
boxppddd(L1def, L0d, L0e)
L1ghi: box wid 0.5 at 0.5<L0g, L0h>+(0,bhd)
boxppddd(L1ghi, L0g, L0h)
L1jklm: box wid 0.5 at 0.5<L0j, L0k>+(0,bhd)
boxppdddp(L1jklm, L0j, L0k, L0m)
B1: box invis wid 1 "\f(CWVtPointerType0\fP" at B0+(0,bhd)
L2abcdef: box wid 0.5 at 0.5<L1abc,L1def>+(0,bhd)
boxppddd(L2abcdef, L1abc, L1def)
L2ghijklm: box wid 0.5 at 0.5<L1ghi,L1jklm>+(0,bhd)
boxpdddp(L2ghijklm, L1ghi, L1jklm)
B2: box invis wid 1 "\f(CWVtPointerType1\fP" at B1+(0,bhd)
L3atom: box wid 0.5 at 0.5<L2abcdef, L2ghijklm>+(0,bhd)
boxpdddp(L3atom, L2abcdef, L2ghijklm)
B3: box invis wid 1 "\f(CWVtPointerType2\fP" at B2+(0,bhd)
.PE
.LP
The leaves are the original data stream. Those blocks have type
.CW VtDataType .
The first pointer stream has type
.CW VtPointerType0 ,
the next has type
.CW VtPointerType1 ,
and so on.
The figure ends with a single block of type
.CW VtPointerType2 ,
but in general trees can have height up to
.CW VtPointerType6 .
For a
.I dsize
of 8192 bytes
and
.I psize
of 8180 bytes (409 pointers),
this gives a maximum stream size of approximately 10 zettabytes
(2\s-2\u73\d\s+2 or 10\s-2\u22\d\s+2 bytes).
.PP
Data blocks are truncated to remove trailing runs of zeros before
storage to Venti; they are zero-filled back to
.I dsize
bytes after retrieval from Venti.
Similarly, trailing runs of pointers to zero-length blocks are
removed from and added back to pointer blocks.
These simple rules happen to make it particularly efficient to store
large runs of zeros, as might occur in a data stream with ``holes:''
the zero-length block itself can be interpreted as a tree of any
depth encoding an all-zero data stream.
.PP
Reconstructing the data stream requires the score and type of the
topmost block in the tree, the data chunk size, the pointer chunk size,
and the data stream size.
(From the data stream size and the chunk sizes we could derive the
depth of the tree and thus the type of the topmost block, but it is convenient
to allow trees that are deeper than necessary.)
This information is kept in a 40-byte structure called a
.CW VtEntry :
.P1
VtEntry:
.ta +\w' 'u +\w' 'u
gen[4] \fRgeneration number\fP
psize[2] \fRsize of pointer blocks\fP
dsize[2] \fRsize of data blocks\fP
flags[1]
zero[5]
size[6] \fRlength of file\fP
score[20] \fRscore of root block in tree\fP
.P2
(In this notation,
.CW name[sz]
indicates a
.CW sz -byte
field called
.CW name .
Integers are stored in big-endian order.
.CW Size
really is a 48-bit field.)
.CW Flags
is made up of the following bit fields.
.P1
.ta +\w' 'u +\w' 'u
0x01 VtEntryActive \fRentry is allocated\fP
0x02 VtEntryDir \fRentry describes a Venti directory (q.v.)\fP
0x1C VtEntryDepthMask \fRmask for tree depth\fP
0x20 VtEntryLocal \fRreserved (q.v.)\fP
.P2
.LP
The depth of the described tree is stored in the 3 bits indicated:
a tree with a topmost node of type
.CW VtPointerType3
has depth 4.
.PP
With
.CW VtEntry
we can build more complicated data structures,
ones with multiple or nested data streams.
A data stream consisting of
.CW VtEntry
structures is called a Venti directory.
It is identical in structure to the Venti data stream
we described earlier except that the bottom-level type is
.CW VtDirType ,
and
the
.CW VtEntry
describing a Venti directory has the
.CW VtEntryDir
flag bit set.
The
.I dsize
for a Venti directory
is a multiple of 40 so that each data chunk holds
an integer number of
.CW VtEntry
structures.
By analogy with Venti directories,
we call the original data stream a
Venti file.
Note that Venti files are assumed
.I not
to contain pointers to other Venti blocks.
The only pointers to Venti blocks occur in
.CW VtEntry
structures in
Venti directories
(and in the internal hash tree structure of the
individual files and directories).
Note also that these directories are nothing more than pointer lists.
In particular, there are no names or metadata like in a file system.
.PP
To make it easier to pass hierarchies between applications,
the root of a hierarchy is described in a 300-byte structure
called a
.CW VtRoot :
.P1
VtRoot:
.ta +\w' 'u +\w' 'u
version[2] \f(CW2\fP
name[128] \fRname of structure (just a comment)\fP
type[128] \fRstring describing structure (\f(CWvac\fR)\f(CW
score[20] \fRpointer to \f(CWVtDirType\fP block\f(CW
blockSize[2] \fRmaximum block size in structure\fP
prev[20] \fRprevious \f(CWVtRoot\fP in chain, if any\fP
.P2
.LP
This structure is stored to Venti and its score is passed
between applications, typically in the form
``\fItype\f(CW:\fIrootscore\fR,''
where
.I type
is the type field from the
.CW VtRoot
structure, and
.I rootscore
is the score of the
.CW VtRoot
block.
.CW VtRoot
structures can be chained together using the
.I prev
field to encode an archival history
of the data structure.
.PP
For example, a small Venti hierarchy might look like:
.PS
.ps 8
.vs 10
boxwid=0.1
boxht=0.1
f=0.9
mb=0.16
VtRoot: [
right
B1: box
move right 0.1
"\f(CWVtRoot\fP" ljust
]
Root: [
right
B1: box fill f
B2: box fill f
B3: box fill f
move right 0.1
] with .nw at VtRoot.sw+(0.2,-.1)
Level1: [
RootMeta: [
box wid mb
]
MetaSource: [
right
B1: box wid 5*mb
] with .nw at RootMeta.sw+(0,-.1)
Source: [
right
B1: box fill f
B2: box fill f
B3: box fill f
B4: box fill f
B5: box fill f
B6: box fill f
B7: box fill f
B8: box fill f
] with .nw at MetaSource.sw+(0,-.1)
SB1: box invis at Source.B1
SB2: box invis at Source.B2
SB3: box invis at Source.B3
] with .nw at Root.sw+(0.4,-.1)
Level2: [
MetaSource: [
right
B1: box wid 5*mb
]
Source: [
right
B1: box fill f
B2: box fill f
B3: box fill f
B4: box fill f
B5: box fill f
B6: box fill f
B7: box fill f
B8: box fill f
] with .nw at MetaSource.sw+(0,-.1)
File: box wid 0.8 with .nw at Source.sw+(0,-.1)
] with .nw at Level1.sw+(0.6,-.1)
line -> from VtRoot.B1 down boxwid/2+0.1+boxwid/2 then to Root.w
line -> from Root.B3 down boxwid/2+0.1+boxwid/2 then to Level1.RootMeta.w
line -> from Root.B2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level1.MetaSource.w
line -> from Root.B1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level1.Source.w
line -> from Level1.SB3 down boxwid/2+0.1+boxwid/2 then to Level2.MetaSource.w
line -> from Level1.SB2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level2.Source.w
line -> from Level1.SB1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level2.File.w
[
KEY: box wid 1.5 invis "Key"
line from KEY.sw to KEY.se
k = -.1
kk=0.5
A: [
box wid 4*boxwid
"Venti file" ljust with .w at last box .w+(kk,0)
] with .nw at KEY.sw+(0,2*k)
B: [
box fill f
"Venti entry (\f(CWVtEntry\fP)" ljust with .w at last box .w+(kk,0)
] with .nw at A.sw+(0,k)
C: [
right
CC: box fill f
box fill f
box fill f
box fill f
"Venti directory" ljust with .w at CC.w+(kk,0)
] with .nw at B.sw+(0,k)
D: [
line -> right 3*boxwid
"Venti pointer (score)" ljust with .w at last line .w+(kk, 0)
] with .nw at C.sw+(0,k)
] with .nw at VtRoot.nw+(3,0)
.PE
.LP
Venti files are shown as white boxes, while directories are shown
as shaded boxes. Each shaded square represents a
.CW VtEntry .
Arrows represent pointers from
.CW VtEntry
structures to other
Venti files or directories.
.PP
The hierarchical structure provided by Venti files and directories
can be used as the base for more complicated data structures.
Because this structure captures all the information
about pointers to other blocks, tools written to traverse
Venti hierarchies can traverse the more complicated
data structures as well.
For example,
.I venti/copy
(see
.I venti (1))
copies a Venti hierarchy from one Venti server to another,
given the root
.CW VtEntry .
Because the traditional file system described in later sections is
layered on a Venti hierarchy,
.I venti/copy
can copy it without fully understanding its structure.
.NH 1
Vac file system format
.HP
The Venti archive format
.I vac
builds a traditional file system using a Venti hierarchy.
Each vac file is implemented as a Venti file;
each vac directory is implemented as a Venti
directory and a Venti file to provide traditional file system metadata.
The metadata is stored in a structure called a
.CW DirEntry :
.P1
DirEntry:
.ta +\w' 'u +\w' 'u
magic[4] \f(CW0x1c4d9072\fP (DirMagic)\fP
version[2] \f(CW9\fP
elem[s] \fRname (final path element only)\fP
entry[4] \fRentry number for Venti file or directory\fP
gen[4] \fRgeneration number\fP
mentry[4] \fRentry number for Venti file holding metadata\fP
mgen[4] \fRgeneration number\fP
qid[8] \fRunique file serial number\fP
uid[s] \fRowner\fP
gid[s] \fRgroup\fP
mid[s] \fRlast modified by\fP
mtime[4] \fRlast modification time\fP
ctime[4] \fRcreation time\fP
atime[4] \fRlast access time\fP
mode[4] \fRmode bits\fP
.P2
The notation
.CW name[s]
denotes a string stored as a two-byte length
and then that many bytes.
The above describes Version 9 of the
.CW DirEntry
format. Versions 7 and 8 are very similar; they can be
read by the current
.I vac
source code but are not written.
Earlier versions were not widespread.
A
.CW DirEntry
may be followed by optional extension sections, though none
are currently used.
The
.CW mode
bits include bits commonly used by
Unix and Windows, in addition to those used by Plan 9.
.PP
The
.CW entry
field is an index into the parallel Venti directory.
The
.CW gen
field must match the
.CW gen
field in the corresponding
.CW VtEntry
in the directory;
it is used to detect
stale indices.
Similarly,
.CW mentry
and
.CW mgen
are the index and generation number
for the metadata Venti file,
if the
.CW DirEntry
describes a vac directory.
.PP
The relation between Venti files and directories and
vac files and directories can be seen in this figure:
.PS
.ps 8
.vs 10
boxwid=0.1
boxht=0.1
f=0.9
mb=0.16
VtRoot: [
right
B1: box
move right 0.1
"\f(CWVtRoot\fP" ljust
]
SuperRoot: [
right
B1: box fill f
move right 0.1
"fs root block" ljust
] with .nw at VtRoot.sw + (0.2, -.2)
Root: [
right
B1: box fill f
B2: box fill f
B3: box fill f
move right 0.1
"root directory info block" ljust
] with .nw at SuperRoot.sw+(0.2, -.2)
Level1: [
RootMeta: [
box wid mb
move right 0.1
"root metadata" ljust
]
MetaSource: [
right
B1: box wid mb
B2: box wid mb
B3: box wid mb
B4: box wid mb
B5: box wid mb
] with .nw at RootMeta.sw+(0,-.2)
MB1: box wid mb invis at MetaSource.B1
MB2: box wid mb invis at MetaSource.B2
MB3: box wid mb invis at MetaSource.B3
MB4: box wid mb invis at MetaSource.B4
MB5: box wid mb invis at MetaSource.B5
Source: [
right
B1: box fill f
B2: box fill f
B3: box fill f
B4: box fill f
B5: box fill f
B6: box fill f
B7: box fill f
B8: box fill f
] with .nw at MetaSource.sw+(0,-.1)
SB1: box invis at Source.B1
SB2: box invis at Source.B2
SB3: box invis at Source.B3
SB4: box invis at Source.B4
SB5: box invis at Source.B5
SB6: box invis at Source.B6
SB7: box invis at Source.B7
SB8: box invis at Source.B8
] with .nw at Root.sw+(0.4,-.2)
Level2: [
MetaSource: [
right
B1: box wid mb
B2: box wid mb
B3: box wid mb
B4: box wid mb
B5: box wid mb
]
Source: [
right
B1: box fill f
B2: box fill f
B3: box fill f
B4: box fill f
B5: box fill f
B6: box fill f
B7: box fill f
B8: box fill f
] with .nw at MetaSource.sw+(0,-.1)
File: box wid 0.8 with .nw at Source.sw+(0,-.2)
] with .nw at Level1.sw+(0.6,-.2)
line -> from VtRoot.B1 down boxwid/2+0.2+boxwid/2 then to SuperRoot.w
line -> from SuperRoot.B1 down boxwid/2+0.2+boxwid/2 then to Root.w
line -> from Root.B3 down boxwid/2+0.2+boxwid/2 then to Level1.RootMeta.w
line -> from Root.B2 down boxwid/2+0.2+boxwid+0.2+boxwid/2 then to Level1.MetaSource.w
line -> from Root.B1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level1.Source.w
line -> from Level1.SB3 down boxwid/2+0.2+boxwid/2 then to Level2.MetaSource.w
line -> from Level1.SB2 down boxwid/2+0.2+boxwid+0.1+boxwid/2 then to Level2.Source.w
line -> from Level1.SB1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level2.File.w
arrowwid = arrowwid/2
arrowht = arrowht/2
line -> from Level1.MB1 to Level1.SB1.n
line -> from Level1.MB2 to Level1.SB2.n
line -> from Level1.MB2 to Level1.SB3.n
line -> from Level1.MB4 to Level1.SB7.n
line -> from Level1.MB5 to Level1.SB5.n
arrowwid = arrowwid * 2
arrowht = arrowht * 2
box dashed with .nw at Level1.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2
box dashed with .nw at Level2.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2
box dotted with .nw at Level2.File.nw+(-.05,.05) wid 0.8+0.05*2 ht .1+.05*2
[
KEY: box wid 1.5 invis "Key"
line from KEY.sw to KEY.se
k = -.1
kk=0.5
A: [
box wid 4*boxwid
"Venti file" ljust with .w at last box .w+(kk,0)
] with .nw at KEY.sw+(0,2*k)
B: [
box fill f
"Venti entry (\f(CWEntry\fP)" ljust with .w at last box .w+(kk,0)
] with .nw at A.sw+(0,k)
C: [
right
CC: box fill f
box fill f
box fill f
box fill f
"Venti directory" ljust with .w at CC.w+(kk,0)
] with .nw at B.sw+(0,k)
D: [
line -> right 3*boxwid
"Venti pointer (score)" ljust with .w at last line .w+(kk, 0)
] with .nw at C.sw+(0,k)
DD: [
box dotted wid 4*boxwid
"Vac file" ljust with .w at last box .w+(kk,0)
] with .nw at D.sw+(0,k)
E: [
box wid mb
"Vac entry (\f(CWDirEntry\fP)" ljust with .w at last box .w+(kk,0)
] with .nw at DD.sw+(0,k)
G: [
box dashed wid 4*boxwid
"Vac directory" ljust with .w at last box .w+(kk,0)
] with .nw at E.sw+(0,k)
H: [
arrowwid = arrowwid/2
arrowht = arrowht/2
line -> right 1.5*boxwid
"Vac pointer (integer index)" ljust with .w at last line .w+(kk, 0)
arrowwid = arrowwid * 2
arrowht = arrowht * 2
] with .nw at G.sw+(0,k)
] with .nw at VtRoot.nw+(3,0)
.PE
.LP
In reality, the story is slightly more complicated.
The metadata file in a Vac directory
is not just the concatenation of
.CW DirEntry
structures.
Instead, it is the concatenation of
.CW MetaBlocks .
A
.CW MetaBlock
contains some number of
.CW DirEntry
structures along with a sorted index to make it easy
to look for a particular
.CW DirEntry
by its
.CW elem
field.
The details are in the source code.
.PP
As shown in the diagram,
the root directory of the file system is summarized by
three
.CW VtEntry
structures describing
the Venti directory for the children of the root,
the Venti file for the metadata describing the children of the root,
and a Venti file holding metadata for the root directory itself.
These
.CW VtEntry
structures are placed in a Venti directory of their own,
described by the single
.CW VtEntry
in the
root block.
.NH 1
Fossil file system format
.HP
Fossil uses the vac format, with some small changes.
The changes only affect the data on the local disk; the data
archived to Venti is exactly in vac format.
.PP
Blocks stored on local disk may contain scores pointing at local disk
blocks or at Venti blocks.
Local block addresses are stored as 20-byte scores in which the first 16 bytes
are all zero and the last 4 bytes specify a block number in the disk.
Before a block is archived, all the
blocks it points to must be archived, and the local scores in the block
must be changed to Venti scores.
Using block addresses rather than content hashes for local data
makes the local file system easier to manage: if a local block's contents
change, the pointer to the block does not need to change.
.NH 2
Snapshots
.HP
Fossil is an archival file server.
It takes periodic snapshots of the file system,
which are made accessible through the file system.
Specifically, the active file system is presented in
.CW /active .
Ephemeral snapshots (those that are kept on local disk and eventually deleted)
are presented in
\f(CW/snapshot/\fIyyyy\f(CW/\fImmdd\f(CW/\fIhhmm\fR,
where
.I yyyy
is the full year,
.I mm
is the month number,
.I dd
is the day number,
.I hh
is the hour,
and
.I mm
is the minute.
Archival snapshots (those that are archived to Venti and persist forever)
are presented in
\f(CW/archive/\fIyyyy\f(CW/\fImmdds\fR,
where
.I yyyy ,
.I mm ,
and
.I dd
are year, month, and day as before,
and
.I s
is a sequence number if more than one
archival snapshot is done in a day.
For the first snapshot,
.I s
is null.
For the subsequent snapshots,
.I s
is
.CW .1 ,
.CW .2 ,
.CW .3 ,
etc.
.PP
To implement the snapshots, the file server maintains a
current
.I epoch
for the active file system.
Each local block has a label that records, among other things,
the epoch in which the block was allocated.
If a block was allocated in an epoch earlier than the current one,
it is immutable and treated as copy-on-write.
Taking a snapshot can be accomplished by
recording the address of the current root block and then
incrementing the epoch number.
Notice that the copy-on-write method makes
snapshots both time efficient and space efficient.
The only time cost is waiting for all current file system
requests to finish and then incrementing a counter.
After a snapshot, blocks only get copied when they are
next modified, so the per-snapshot
space requirement is proportional
to the amount of new data rather than the total
size of the file system.
.PP
The blocks in the archival snapshots are moved to Venti,
but the blocks in the ephemeral snapshots take up space
in the local disk file.
To allow reclamation of this disk space, the file system
maintains a
.I low
.I epoch ,
which is the epoch of the earliest ephemeral snapshot
still available.
Fossil only allows access to snapshots with epoch numbers
between the
low epoch and the current epoch
(also called the high epoch).
Incrementing the low epoch thus makes old
snapshots inaccessible.
The space required to store those snapshots can then
be reclaimed, as described below.
.NH 2
Local blocks
.HP
The bulk of the local disk file is the local blocks.
Each block has a 14-byte label associated with it, of the format:
.P1
Label:
.ta +\w' 'u +\w' 'u
state[1] \fRblock state\fP
type[1] \fRblock type\fP
epoch[4] \fRallocation epoch\fP
epochClose[4] \fRclose epoch\fP
tag[4] \fRrandom tag\fP
.P2
.LP
The
.CW type
is an analogue of the block types described earlier,
though different names are used, to distinguish between
pointers blocks in a hash tree for a data stream
and pointer blocks for a directory stream.
The
.CW epoch
was mentioned in the last section.
The other fields are explained below.
.PP
There are two distinguished blocks states
.CW BsFree
.CW 0x00 ) (
and
.CW BsBad
.CW 0xFF ), (
which mark blocks that are available for allocation
and blocks that are bad and should be avoided.
If
.CW state
is not one of these values, it is a bitwise
.I or ' `
of the following flags:
.P1
.ta +\w' 'u +\w' 'u
0x01 BsAlloc \fRblock is in use\fP
0x02 BsCopied \fRblock has been copied\fP
0x04 BsVenti \fRblock has been stored on Venti\fP
0x08 BsClosed \fRblock has been unlinked from active file system\fP
.P2
.LP
The flags are explained as they arise in the discussions below.
.PP
It is convenient to store some extra fields in the
.CW VtEntry
structure when it describes a Venti file or directory
stored on local disk.
Specifically, we set the
.CW VtEntryLocal
flag bit
and then use the bytes 7-16 of the score (which would
otherwise be zero, since it is a local score) to hold these fields:
.P1
.ta +\w' 'u +\w' 'u
archive[1] \fRboolean: this is an archival snapshot\fP
snap[4] \fRepoch number if root of snapshot\fP
tag[4] \fRrandom tag\fP
.P2
.LP
The extended
.CW VtEntry
structure is called an
.CW Entry .
The
.CW tag
field
in the
.CW Label
and the
.CW Entry
is used to identify dangling pointers or other file system corruption:
all the local blocks in a hash tree must
have tags matching the tag in the
.CW Entry .
If this
.CW Entry
points at the root of a snapshot,
the
.CW snap
field is the epoch of the snapshot.
If the snapshot is intended to be archived to Venti,
the
.CW archive
field is non-zero.
.NH 2
Block reclamation
.HP
The blocks in the active file system form a tree: each
block has only one parent.
Once a copy-on-write block
.I b
is replaced by its copy, it is no longer
needed by the active file system.
At this point,
.I b
is unlinked from the active file system.
We say that
.I b
is now
.I closed :
it is needed only for snapshots.
When a block is closed, the
.CW BsClosed
bit is set in its state, and the current epoch (called the block's closing epoch)
is stored in the
.CW epochClose
label field.
(Open blocks have an
.CW epochClose
of
.CW ~0 ).
.PP
A block is referenced by snapshots with epochs
between the block's allocation epoch and its closing epoch.
Once the file system's low epoch grows to be greater than or equal to the block's
closing epoch, the block is no longer needed for any snapshots
and can be reused.
.PP
In a typical configuration, where nightly archival snapshots
are taken and written to Venti, it is desirable to reclaim
the space occupied by now-archived blocks if possible.
To do this, Fossil keeps track of whether the pointers
in each block are unique to that block.
When a block
.I bb
is allocated, a pointer to
.I bb
is written into exactly one active block (say,
.I b ).
In the absence of snapshots, the pointer to
.I bb
will remain unique to
.I b ,
so that if the pointer is zeroed,
.I bb
can be immediately reused.
Snapshots complicate this invariant:
when
.I b
is copied-on-write, all its pointers
are no longer unique to it.
At time of the copy, the
.CW BsCopied
state bit in the block's label
is set to note the duplication of the pointers contained within.
.NH 2
Disk layout
.HP
The file system header describes the file system layout and has this format:
.P1
.ta +\w' 'u +\w' 'u
Header:
magic[4] \fR0x3776AE89 (HeaderMagic)\fP
version[2] \fR1 (HeaderVersion)\fP
blockSize[2] \fIfile system block size\fP
super[4] \fRblock offset of super block\fP
label[4] \fRblock offset of labels\fP
data[4] \fRdata blocks\fP
end[4] \fRend of file system\fP
.P2
.LP
The corresponding file system layout is:
.PS
.ps 8
.vs 9
boxwid=0.75
boxht=0.15
Empty: box "empty" ht 0.25
Header: box "header" with .n at Empty.s
Empty2: box "empty" with .n at Header.s
Super: box "super block" with .n at Empty2.s
Label: box "label" "blocks" with .n at Super.s ht 0.25
Data: box "data" "blocks" with .n at Label.s ht 0.3
" 0" ljust at Empty.ne
" 128kB" ljust at Header.ne
" \f5super\fP \(mu \f(CWblockSize\fP" ljust at Super.ne
" \f5label\fP \(mu \f(CWblockSize\fP" ljust at Label.ne
" \f5data\fP \(mu \f(CWblockSize\fP" ljust at Data.ne
" \f5end\fP \(mu \f(CWblockSize\fP" ljust at Data.se
"" at (-1,0)
"" at (6,0)
.PE
.LP
The numbers to the right of the blocks are byte offsets
of the boundaries.
.LP
The super block describes the file system itself and looks like:
.P1
.ta +\w' 'u +\w' 'u
Super:
magic[4] \fR0x2340A3B1 (SuperMagic)\fP
version[2] \fR1 (SuperVersion)\fP
epochLow[4] \fRfile system low epoch\fP
epochHigh[4] \fRfile system high (active) epoch\fP
qid[8] \fRnext qid to allocate\fP
active[4] \fRdata block number: root of active file system\fP
next[4] \fRdata block number: root of next file system to archive\fP
current[4] \fRdata block number: root of file system currently being archived\fP
last[20] \fRVenti score of last successful archive\fP
name[128] \fRname of file system (just a comment)\fP
.P2
.LP
.NH 1
Fossil server
.HP
The Fossil server is a user-space program that runs on a standard Plan 9 kernel.
.NH 2
Process structure
.PP
The file server is structured as a set of processes synchronizing
mostly through message passing along queues.
The processes are given names, which can be seen in the output of
.CW ps
.CW -a .
.PP
.CW Listen
processes announce on various network addresses.
A
.CW con
process handles each incoming connection, reading 9P requests
and adding them to a central message queue.
.CW Msg
processes remove 9P requests from the queue,
handle them, and write the responses to the appropriate
file descriptors.
.PP
The
.CW disk
process handles disk I/O requests made by the other processes.
The
.CW flush
process writes dirty blocks from the in-memory block cache to disk.
The
.CW unlink
process frees previously linked blocks once the blocks that point at them
have been written to disk.
.PP
A
.CW consI
reads from each console file (typically a pipe posted in
.CW /srv ),
adding the typed characters to the input queue.
The
.CW cons
process echoes input and runs the commands, saving
output in a ring buffer.
Because there is only one
.CW cons
process, only one console command may be executing at a time.
A
.CW consO
process copies this ring buffer to each console file.
.PP
The
.CW periodic
process runs periodic events, like
flushing the root metadata to disk or
taking snapshots of the file system.
.NH 2
Block cache
.HP
Fossil maintains an in-memory block cache which
holds both local disk blocks and Venti blocks.
Cache eviction follows a least recently used policy.
Dirty blocks are restricted to at most half the cache.
This can be changed by editing
.CW DirtyPercentage
in
.CW dat.h .
.PP
The block cache uses soft updates [1] to ensure that the on-disk
file system is always self-consistent.
Thus there is no
.I halt
console command
and no need to check a file system
that was shut down without halting.
.NH 2
Archiving
.HP
A background process writes blocks in archival snapshots to Venti.
Although
.CW /archive/\fIyyyy\fP/\fImmdds\fR
is a copy of only
.CW /active
at the time of the snapshot,
the archival process archives the
entire file tree rather than just
the subtree rooted at
.CW /active .
The snapshots
.CW /snapshot/\fIyyyy\fP/\fImmdd\fP/\fIhhmm
are stored as empty directories.
Once all the blocks have been archived,
a
.CW VtRoot
header for the file system is archived.
The score of that header is recorded in
.CW super.score
and also printed on the file server console.
The score can used by
.I flfmt
to restore a file system (see
.I fossil (4)).
.NH 2
Contrast with the old file server
.HP
The most obvious difference between Fossil and the
old Plan 9 file server [2] is that Fossil uses a Venti server as
its archival storage in place of a WORM juke box.
There are a few other architectural differences to be
aware of.
.PP
Fossil is a user-level program run on a standard kernel.
.PP
Fossil does not have any way to concatenate, stripe, or
mirror disk files. For functionality similar to the old file server's
configuration strings, use the experimental file stack device
(see
.I fs (3)).
.PP
Fossil speaks only 9P2000. Old 9P (aka 9P1) is not supported.
.PP
... XXX words about converting an old file system to fossil?
.NH 1
References
.LP
[1] Gregory R. Ganger, Marshall Kirk McKusick, Craig A. N. Soules,
and Yale N. Patt.
``Soft Updates: A Solution to the Metadata Update Problem
in File Systems,''
.I "ACM Transactions on Computer Systems" ,
Vol 18., No. 2, May 2000, pp. 127\-153.
.LP
[2] Sean Quinlan, ``A Cached WORM File System,''
.I "Software\(emPractice and Experience" ,
Vol 21., No 12., December 1991, pp. 1289\-1299.
.LP
[3] Sean Quinlan and Sean Dorward, ``Venti: A New Approach to Archival Storage,''
.I "Usenix Conference on File and Storage Technologies" ,
2002.
|