diff options
author | cinap_lenrek <cinap_lenrek@localhost> | 2011-05-03 11:25:13 +0000 |
---|---|---|
committer | cinap_lenrek <cinap_lenrek@localhost> | 2011-05-03 11:25:13 +0000 |
commit | 458120dd40db6b4df55a4e96b650e16798ef06a0 (patch) | |
tree | 8f82685be24fef97e715c6f5ca4c68d34d5074ee /sys/src/cmd/python/Doc/lib/libhtmllib.tex | |
parent | 3a742c699f6806c1145aea5149bf15de15a0afd7 (diff) |
add hg and python
Diffstat (limited to 'sys/src/cmd/python/Doc/lib/libhtmllib.tex')
-rw-r--r-- | sys/src/cmd/python/Doc/lib/libhtmllib.tex | 181 |
1 files changed, 181 insertions, 0 deletions
diff --git a/sys/src/cmd/python/Doc/lib/libhtmllib.tex b/sys/src/cmd/python/Doc/lib/libhtmllib.tex new file mode 100644 index 000000000..a84dd856d --- /dev/null +++ b/sys/src/cmd/python/Doc/lib/libhtmllib.tex @@ -0,0 +1,181 @@ +\section{\module{htmllib} --- + A parser for HTML documents} + +\declaremodule{standard}{htmllib} +\modulesynopsis{A parser for HTML documents.} + +\index{HTML} +\index{hypertext} + + +This module defines a class which can serve as a base for parsing text +files formatted in the HyperText Mark-up Language (HTML). The class +is not directly concerned with I/O --- it must be provided with input +in string form via a method, and makes calls to methods of a +``formatter'' object in order to produce output. The +\class{HTMLParser} class is designed to be used as a base class for +other classes in order to add functionality, and allows most of its +methods to be extended or overridden. In turn, this class is derived +from and extends the \class{SGMLParser} class defined in module +\refmodule{sgmllib}\refstmodindex{sgmllib}. The \class{HTMLParser} +implementation supports the HTML 2.0 language as described in +\rfc{1866}. Two implementations of formatter objects are provided in +the \refmodule{formatter}\refstmodindex{formatter}\ module; refer to the +documentation for that module for information on the formatter +interface. +\withsubitem{(in module sgmllib)}{\ttindex{SGMLParser}} + +The following is a summary of the interface defined by +\class{sgmllib.SGMLParser}: + +\begin{itemize} + +\item +The interface to feed data to an instance is through the \method{feed()} +method, which takes a string argument. This can be called with as +little or as much text at a time as desired; \samp{p.feed(a); +p.feed(b)} has the same effect as \samp{p.feed(a+b)}. When the data +contains complete HTML markup constructs, these are processed immediately; +incomplete constructs are saved in a buffer. To force processing of all +unprocessed data, call the \method{close()} method. + +For example, to parse the entire contents of a file, use: +\begin{verbatim} +parser.feed(open('myfile.html').read()) +parser.close() +\end{verbatim} + +\item +The interface to define semantics for HTML tags is very simple: derive +a class and define methods called \method{start_\var{tag}()}, +\method{end_\var{tag}()}, or \method{do_\var{tag}()}. The parser will +call these at appropriate moments: \method{start_\var{tag}} or +\method{do_\var{tag}()} is called when an opening tag of the form +\code{<\var{tag} ...>} is encountered; \method{end_\var{tag}()} is called +when a closing tag of the form \code{<\var{tag}>} is encountered. If +an opening tag requires a corresponding closing tag, like \code{<H1>} +... \code{</H1>}, the class should define the \method{start_\var{tag}()} +method; if a tag requires no closing tag, like \code{<P>}, the class +should define the \method{do_\var{tag}()} method. + +\end{itemize} + +The module defines a parser class and an exception: + +\begin{classdesc}{HTMLParser}{formatter} +This is the basic HTML parser class. It supports all entity names +required by the XHTML 1.0 Recommendation (\url{http://www.w3.org/TR/xhtml1}). +It also defines handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements. +\end{classdesc} + +\begin{excdesc}{HTMLParseError} +Exception raised by the \class{HTMLParser} class when it encounters an +error while parsing. +\versionadded{2.4} +\end{excdesc} + + +\begin{seealso} + \seemodule{formatter}{Interface definition for transforming an + abstract flow of formatting events into + specific output events on writer objects.} + \seemodule{HTMLParser}{Alternate HTML parser that offers a slightly + lower-level view of the input, but is + designed to work with XHTML, and does not + implement some of the SGML syntax not used in + ``HTML as deployed'' and which isn't legal + for XHTML.} + \seemodule{htmlentitydefs}{Definition of replacement text for XHTML 1.0 + entities.} + \seemodule{sgmllib}{Base class for \class{HTMLParser}.} +\end{seealso} + + +\subsection{HTMLParser Objects \label{html-parser-objects}} + +In addition to tag methods, the \class{HTMLParser} class provides some +additional methods and instance variables for use within tag methods. + +\begin{memberdesc}{formatter} +This is the formatter instance associated with the parser. +\end{memberdesc} + +\begin{memberdesc}{nofill} +Boolean flag which should be true when whitespace should not be +collapsed, or false when it should be. In general, this should only +be true when character data is to be treated as ``preformatted'' text, +as within a \code{<PRE>} element. The default value is false. This +affects the operation of \method{handle_data()} and \method{save_end()}. +\end{memberdesc} + + +\begin{methoddesc}{anchor_bgn}{href, name, type} +This method is called at the start of an anchor region. The arguments +correspond to the attributes of the \code{<A>} tag with the same +names. The default implementation maintains a list of hyperlinks +(defined by the \code{HREF} attribute for \code{<A>} tags) within the +document. The list of hyperlinks is available as the data attribute +\member{anchorlist}. +\end{methoddesc} + +\begin{methoddesc}{anchor_end}{} +This method is called at the end of an anchor region. The default +implementation adds a textual footnote marker using an index into the +list of hyperlinks created by \method{anchor_bgn()}. +\end{methoddesc} + +\begin{methoddesc}{handle_image}{source, alt\optional{, ismap\optional{, + align\optional{, width\optional{, height}}}}} +This method is called to handle images. The default implementation +simply passes the \var{alt} value to the \method{handle_data()} +method. +\end{methoddesc} + +\begin{methoddesc}{save_bgn}{} +Begins saving character data in a buffer instead of sending it to the +formatter object. Retrieve the stored data via \method{save_end()}. +Use of the \method{save_bgn()} / \method{save_end()} pair may not be +nested. +\end{methoddesc} + +\begin{methoddesc}{save_end}{} +Ends buffering character data and returns all data saved since the +preceding call to \method{save_bgn()}. If the \member{nofill} flag is +false, whitespace is collapsed to single spaces. A call to this +method without a preceding call to \method{save_bgn()} will raise a +\exception{TypeError} exception. +\end{methoddesc} + + + +\section{\module{htmlentitydefs} --- + Definitions of HTML general entities} + +\declaremodule{standard}{htmlentitydefs} +\modulesynopsis{Definitions of HTML general entities.} +\sectionauthor{Fred L. Drake, Jr.}{fdrake@acm.org} + +This module defines three dictionaries, \code{name2codepoint}, +\code{codepoint2name}, and \code{entitydefs}. \code{entitydefs} is +used by the \refmodule{htmllib} module to provide the +\member{entitydefs} member of the \class{HTMLParser} class. The +definition provided here contains all the entities defined by XHTML 1.0 +that can be handled using simple textual substitution in the Latin-1 +character set (ISO-8859-1). + + +\begin{datadesc}{entitydefs} + A dictionary mapping XHTML 1.0 entity definitions to their + replacement text in ISO Latin-1. + +\end{datadesc} + +\begin{datadesc}{name2codepoint} + A dictionary that maps HTML entity names to the Unicode codepoints. + \versionadded{2.3} +\end{datadesc} + +\begin{datadesc}{codepoint2name} + A dictionary that maps Unicode codepoints to HTML entity names. + \versionadded{2.3} +\end{datadesc} |