1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
|
.TH DOC2TXT 1
.SH NAME
doc2txt, doc2ps, wdoc2txt, xls2txt, olefs, mswordstrings, msexceltables
\- extract printable text from Microsoft documents
.SH SYNOPSIS
.B doc2txt
[
.I file.doc
]
.br
.B doc2ps
[
.I file.doc
]
.br
.B wdoc2txt
[
.I file.doc
]
.br
.B xls2txt
[
.I file.xls
]
.br
.B aux/olefs
[
.B -m
.I mtpt
]
.I file.doc
.br
.B aux/mswordstrings
.IB mtpt /WordDocument
.br
.B aux/msexceltables
[
.B -qaDnt
] [
.B -d
.I delim
] [
.B -c
.I column-range
] [
.B -w
.I worksheet-range
]
.IB mtpt /Workbook
.SH DESCRIPTION
.I Doc2txt
is an
.IR rc (1)
script that uses
.I olefs
and
.I mswordstrings
to extract the printable text from the body of a Microsoft Word document
and write it on the standard output.
.I Doc2ps
is similar, but emits PostScript corresponding to the document.
.I Wdoc2txt
is similar to
.IR doc2txt ,
but uses
.IR plumb (1)
to send the output to a new
.IR acme (1)
window instead.
.I Xls2txt
performs a similar function for Microsoft Excel documents.
.PP
Microsoft Office documents are stored in OLE (Object Linking and Embedding)
format, which is a scaled down version of Microsoft's FAT file system.
.I Olefs
presents the contents of an MS Office document as a file system
on
.IR mtpt ,
which defaults to
.BR /mnt/doc .
.I Mswordstrings
or
.I msexceltables
may then be used to parse the files inside, extracting
a text stream.
.I Msexceltables
may be given options to control the formatting of its output.
.TF "\fL-d \fIdelim"
.TP
.B -a
Attempt conversion of non-tabular sheets in the workbook (charts).
.TP
.BI -d " delim
Sets the inter-field delimiter to the string
.IR delim ,
by default a single space.
.TP
.B -D
Enables debugging output.
.TP
.BI -c " range
.I Range
is a comma-separated list of column numbers and ranges.
Ranges are separated by dashes.
Limit processing to just those columns named;
by default all columns are output.
.TP
.B -n
Disables field padding to column width.
.TP
.B -q
Disable quoting of textural fields (see
.IR quote (2).)
.TP
.B -t
Truncate fields to the column width.
.TP
.BI -w " range
.I Range
is a comma-separated list of worksheet numbers and ranges, this
limits the sheets output using the same syntax as the
.B -c
option above.
Suppressed chart pages are always included in the sheet count.
.SH EXAMPLE
Extract pieces of an MS Excel spreadsheet.
.PD 0
.IP
.EX
.SM
aux/olefs report.xls
msexceltables -q -w 1,7,9-14 -c 3-5 -n -d '@' /mnt/doc/Workbook > rpt.txt
unmount /mnt/doc
.EE
.PD
.SH SOURCE
.TF "\fL/sys/src/cmd/aux "
.TP
.B /rc/bin
.BR doc2txt ,
.BR doc2ps ,
.BR wdoc2txt,
and
.BR xls2txt
.TP
.B /sys/src/cmd/aux
the others
.fi
.PD
.SH SEE ALSO
.IR strings (1)
.br
``Microsoft Word 97 Binary File Format'',
at Microsoft's developer (MSDN) home page.
.br
``LAOLA Binary Structures'',
.B http://user.cs.tu-berlin.de/~schwartz/pmh
.br
``OpenOffice.Org's Excel Documentation'',
.br
.B http://sc.openoffice.org/excelfileformat.pdf
|